CN114926511A

CN114926511A - High-resolution remote sensing image change detection method based on self-supervision learning

Info

Publication number: CN114926511A
Application number: CN202210509781.1A
Authority: CN
Inventors: 詹涛; 赵伟; 蒋祥明; 徐偲; 刘洁怡; 张普照
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2022-08-19

Abstract

The invention discloses a high-resolution remote sensing image change detection method based on self-supervision learning, which comprises the following steps: acquiring two high-resolution remote sensing images, and performing radiation correction and registration to obtain two target images; comparing the spectral characteristics of the two target images to obtain a pixel level difference image; acquiring a plurality of feature maps corresponding to the target image through a twin feature extraction network; calculating a plurality of target level difference graphs of the two target images according to the characteristic graphs; fusing the pixel level difference image and the target level difference images, and obtaining a change detection result according to the fused image; the twin feature extraction network is a sub-network in a deep twin convolution network model; the training samples are from sample remote sensing image pairs with high resolution, and the labeling information is generated by automatic calculation based on the sample remote sensing image pairs. The invention applies the self-supervision learning of the depth network to the change detection of the high-resolution remote sensing image, thereby improving the performance of the change detection of the high-resolution remote sensing image.

Description

High-resolution remote sensing image change detection method based on self-supervision learning

Technical Field

The invention belongs to the field of remote sensing image processing, and particularly relates to a high-resolution remote sensing image change detection method based on self-supervision learning.

Background

With the rapid development of remote sensing technology, novel satellites carrying various advanced sensors are launched and lifted off successively and realize on-orbit operation, and the time, space and spectral resolution of satellite ground imaging are remarkably improved. As an important source of the current remote sensing data, the high-resolution remote sensing image intuitively reflects the ground object target distribution and information representation of a specific geographic space and contains abundant spectral, texture and spatial structure characteristics. Therefore, the change detection by using the high-resolution remote sensing image has great significance for deeply understanding the relationship between the human activities and the natural environment and the interaction mechanism of the human activities and the natural environment.

The remote sensing image change detection aims to determine and analyze the state change of the ground features in the region by using remote sensing image data acquired from the same geographical region in different historical periods and adopting an image processing theory and a statistical learning modeling method. With the open sharing of remote sensing platform data and the development of change detection algorithms, the remote sensing image change detection technology has been widely applied in the fields of homeland resource investigation, agriculture and forestry monitoring, city expansion, natural disaster assessment and the like.

For the remote sensing images with medium/low resolution, the spectral information of a single pixel is enough to represent the types of different ground features, so that the two remote sensing images are directly compared pixel by pixel, the actual change in an observed area can be effectively highlighted, and the related detection methods such as an image difference/ratio method, an image transformation method or a change vector analysis method and the like are adopted. For the high-resolution remote sensing image, due to factors such as imaging under different spectra, a single target on the ground is usually composed of a group of pixels, and the targets are spatially related to each other, so the high-resolution remote sensing image usually contains more abundant ground feature information than the medium/low-resolution remote sensing image. Meanwhile, due to the existence of the characteristics of 'same-object different spectrum and same-spectrum foreign matter', the method for comparing the difference of two remote sensing images pixel by pixel is difficult to fully utilize the ground feature information in the images, so that the phenomenon of 'salt and pepper' is very easy to occur in the change detection result (as shown in fig. 1).

In order to realize high-resolution remote sensing image change detection, in recent years, deep learning techniques typified by convolutional neural networks have been introduced into the field of remote sensing image processing. However, in the prior art, the performance of the method for processing remote sensing images based on the deep learning method is often established on the basis of having sufficient labeled samples. Some methods need to use an open sample library, and some methods need to manually label a large number of samples, that is, the labeling information of the samples must be obtained by pre-judging based on human subjective experience. However, the samples in the open sample library sometimes do not match the type or size of the image to be processed in the detection task, which results in poor practical use effect of the trained model. In the method of manually labeling a large number of samples, the detection result depends on manual labeling, certain subjective uncertainty exists, and a large amount of manpower, material resources and time are generally consumed to timely and accurately obtain a large amount of labeled data, so that the application range of the method in a real scene is greatly limited.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a high-resolution remote sensing image change detection method based on self-supervision learning.

The technical problem to be solved by the invention is realized by the following technical scheme:

a high-resolution remote sensing image change detection method based on self-supervision learning comprises the following steps:

acquiring two high-resolution remote sensing images shot at different moments aiming at the same area, and carrying out radiation correction and registration on the two high-resolution remote sensing images to obtain two registered target images;

comparing the spectral characteristics of the two target images to obtain a pixel level difference image;

acquiring a plurality of characteristic graphs corresponding to each target image through a preset twin characteristic extraction network; wherein the plurality of feature maps respectively correspond to a plurality of different operational layers of a single branch of the twin feature extraction network;

calculating the difference between the characteristic graphs of the two target images from the same level operation layer to obtain a plurality of target level difference graphs;

fusing the pixel level difference image and the target level difference images, and obtaining a change detection result according to the fused image;

the twin feature extraction network is a feature extraction sub-network in a depth twin convolution network model which is trained in advance; training samples used for training the depth twin convolutional network model are all from registered sample remote sensing image pairs with high resolution, and labeling information of the training samples is generated by inputting the sample remote sensing image pairs into a preset labeling algorithm through automatic calculation.

Optionally, the depth twin convolutional network model is used for performing overlap prediction on paired image blocks;

the labeling algorithm comprises the following steps:

comparing spectral characteristics of the sample remote sensing image pair to obtain a sample pixel level difference image;

performing superpixel segmentation on the sample pixel level difference image to obtain a first segmentation result;

calculating the overlapping degree of each pair of image blocks between the sample remote sensing image pairs by taking the coordinate of the centroid position of each super pixel in the first segmentation result as the center of the image block and a preset size as the size of the image block;

and constructing the training sample and the labeling information according to each pair of image blocks in the sample remote sensing image pair and the overlapping degree of the image blocks.

Optionally, obtaining a plurality of feature maps corresponding to each target image based on a preset twin feature extraction network, including:

performing superpixel segmentation on the pixel level difference image to obtain a second segmentation result;

generating a plurality of template windows by taking the coordinates of the centroid position of each super pixel in the second segmentation result as the center of the image block and the preset size as the size of the image block;

dividing image blocks of each target image according to the template windows to obtain a plurality of pairs of image blocks corresponding to the two target images in a one-to-one mode;

extracting the multilayer features of each image block from each pair of obtained image blocks by using the twin feature extraction network;

and carrying out same-layer splicing on the multilayer features belonging to the same target image to obtain a plurality of feature maps corresponding to each target image.

Optionally, the superpixel segmentation is implemented using a simple linear iterative clustering method.

Optionally, the calculating of the overlapping degree includes:

the image blocks p and q are respectively from two different images in the sample remote sensing image pair; y (p, q) represents the degree of overlap of image block p and image block q; IoU denotes the cross-over ratio between image patches.

Optionally, before extracting the multi-layer feature of each image block from each pair of obtained image blocks by using the twin feature extraction network, the method further includes:

carrying out normalization processing on the obtained image blocks;

moreover, the image blocks constituting the training samples are also normalized image blocks.

Optionally, calculating a difference between feature maps of the two target images from the same operation layer includes:

and calculating Euclidean distance between the characteristic graphs of the two target images from the same level operation layer.

Optionally, fusing the pixel-level difference map and the target-level difference maps, including:

and solving the weighted average result of the pixel level difference image and the target level difference images to obtain the fused image.

Optionally, obtaining a change detection result according to the fused image includes:

performing binary segmentation on the fused image by adopting an Otsu threshold method, wherein the obtained binary segmentation result is a change detection result; wherein, 1 in the binary segmentation result indicates that there is a change, and 0 indicates that there is no change.

Optionally, the depth twin convolutional network model comprises, in the data flow direction: the twin feature extraction network, a feature comparison layer used for solving an absolute difference value of features output by the tail end of the twin feature extraction network and a full connection layer;

wherein each branch of the twin feature extraction network comprises in sequence along the data flow direction: a first convolution layer, a first BN layer, a maximum pooling layer and a residual layer as a plurality of different operation layers of the branch; wherein the residual layer comprises two serially connected residual modules;

the residual error module sequentially comprises the following components along the data flow direction: the second convolution layer, the second BN layer, the third convolution layer, the third BN layer and an addition module; the input characteristics of the second convolution layer and the output characteristics of the third BN layer are jointly sent to the addition module to be summed;

the first BN layer and the second BN layer are also respectively connected with Relu activation layers;

the first convolution layer comprises 64 convolution kernels with the size of 3 x 3, and the convolution step size is 2;

the size of a convolution kernel in the maximum pooling layer is 3 multiplied by 3, and the convolution step length is 2;

the second convolutional layer and the third convolutional layer each include 64 convolution kernels having a size of 3 × 3, and a convolution step size is 1.

In the high-resolution remote sensing image change detection method based on the self-supervision learning, the multiple feature maps corresponding to each target image are obtained through the twin feature extraction network, so that the target in the target image can be used as an analysis unit, the feature information such as spectrum, texture, geometric shape and the like contained in the target image can be effectively extracted, the key information in the target image can be better mined, and the full utilization of the image context information can be realized. Therefore, after the difference between the characteristic graphs of the two target images from the same operating layer is calculated, the obtained multiple target level difference graphs are fused with the pixel level difference graphs of the two target images, and the obtained change detection result is more accurate. Moreover, the twin feature extraction network used by the invention is a feature extraction sub-network in a depth twin convolution network model which is trained in advance, training samples used for training the depth twin convolution network model are all from a registered high-resolution sample remote sensing image pair and are matched with a detection task, and the practical use performance of the twin feature extraction network is ensured to be consistent with the performance obtained when the training is finished. In addition, the marking information of the training sample is generated by inputting the remote sensing image pair of the sample into a preset marking algorithm through automatic calculation, does not depend on the subjective experience of people, does not introduce subjective uncertainty, and has higher marking precision; and the efficiency of a mechanized labeling mode is greatly improved, a large amount of manpower, material resources and time are not required to be consumed, and a useful sample can be timely and accurately obtained, so that the application range of the method in a real scene is wider.

The present invention will be described in further detail with reference to the accompanying drawings.

Drawings

FIG. 1 is a schematic illustration of the salt and pepper phenomenon;

fig. 2 is a flowchart of a high-resolution remote sensing image change detection method based on self-supervised learning according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of constructing annotation information for a sample remote sensing image pair according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a network structure of a deep twin convolutional network model used in an embodiment of the present invention.

FIG. 5 is a set of test data sets used in experimental validation of an embodiment of the present invention;

FIG. 6 is another set of test data sets used in experimental validation of embodiments of the present invention;

the detection results obtained for the test data set of fig. 5 using the existing IR-MAD method, the PCANet method, the DCVA method and the method provided by the present invention are shown in fig. 7;

the detection results obtained for the test data set of fig. 6 using the existing IR-MAD method, the PCANet method, the DCVA method and the method provided by the present invention are shown in fig. 8;

fig. 9 shows quantitative evaluation indexes of the detection results of fig. 7 and 8.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

In order to effectively and practically utilize the deep learning technology to improve the performance of high-resolution remote sensing image change detection, the embodiment of the invention provides a high-resolution remote sensing image change detection method based on self-supervision learning. As shown in fig. 2, the method comprises the steps of:

s10: and acquiring two high-resolution remote sensing images shot at different moments aiming at the same area, and carrying out radiation correction and registration on the two high-resolution remote sensing images to obtain two registered target images.

The radiation correction is carried out on the two high-resolution remote sensing images, so that image radiation errors caused by different factors such as sensitivity characteristics of a sensor, solar altitude, atmospheric conditions, topographic relief and the like can be eliminated; the specific radiation correction method is the same as that in the related prior art, and the embodiment of the present invention is not described again.

After the two high-resolution remote sensing images are subjected to radiation correction, the two corrected high-resolution remote sensing images are continuously registered, so that pixels of the same ground object correspond to the same spatial position. Various specific registration modes exist, for example, the specific registration modes can be implemented by using an existing Scale-invariant Feature Transform (SIFT-invariant Feature Transform) method, and details are not repeated in the embodiment of the present invention.

It can be understood that the two high-resolution remote sensing images obtained after the registration are two target images.

S20: and comparing the spectral characteristics of the two target images to obtain a pixel level difference image.

Specifically, the spectral characteristics of two target images can be compared by using a change vector analysis method, and the implementation process is expressed by a formula as follows:

wherein,

and

respectively represent target images I ₁ And a target image I ₂ Pixel values at the B-th band of the spectrum, B representing the number of spectral bands corresponding to the target image; x ₀ A pixel level disparity map is shown.

S30: acquiring a plurality of feature maps corresponding to each target image through a preset twin feature extraction network; wherein the plurality of feature maps respectively correspond to a plurality of different operation layers of a single branch of the twin feature extraction network; the twin feature extraction network is a feature extraction sub-network in a depth twin convolution network model which is trained in advance; the sample remote sensing image pairs used for training the depth twin convolutional network model are all from registered high-resolution sample remote sensing image pairs, and the marking information of the training samples is automatically calculated and generated by inputting the sample remote sensing image pairs into a preset marking algorithm.

It can be understood that the high-resolution remote sensing image generally contains richer ground feature information, but gradually presents the characteristics of mass, high dimension and diversification along with the increase of the data scale. When the traditional shallow learning model processes such data, the problem of over-fitting and under-fitting is easily encountered due to the limited capacity of the model, so that the model has weak feature expression capability and poor generalization performance. In contrast, the deep learning model has a deeper network structure, can automatically learn more essential feature representation in the original data by using the training sample, and has stronger feature extraction capability and generalization performance. For a high-resolution remote sensing image change detection task, the higher-level feature representation learned by the deep neural network can amplify factors important for classification in the sensing input and inhibit some irrelevant changes (such as illumination and noise interference), so that the identification of a ground real change area is promoted. Therefore, in the embodiment of the invention, a feature extraction sub-network in the depth twin convolutional network model is adopted to extract the feature map from the target image.

In practical applications, there may be many possibilities for the embodied functionality of the deep twin convolutional network model. For example, in one implementation, the depth twin convolutional network model may be used for overlap prediction of pairs of image blocks; that is, after a pair of image blocks is input into the depth twin convolutional network model, the model may output the degree of overlap of the pair of image blocks. Accordingly, the feature extraction sub-network, i.e. the twin feature extraction network described above, extracts features from the paired image blocks in layers, in particular, through two twin branches. In this way, the features of the image blocks belonging to the same image are spliced in the same layer, and a plurality of feature maps corresponding to the image can be obtained.

Alternatively, in another implementation, the depth twin convolutional network model may be used to segment a pair of images for regions of similarity; that is, it is also possible that, after a pair of images is input into the depth twin convolution network model, the model may output the coordinates of the similar region between the pair of images. Accordingly, the feature extraction subnetwork, i.e. the twin feature extraction network described above, may extract features from paired images in a hierarchical manner, in particular, by means of two twin branches.

It is to be understood that the network structure of the deep twin convolutional network model is related to the function to be realized by the deep twin convolutional network model, and for clarity of the description, the network structure of the deep twin convolutional network model is illustrated later.

S40: and calculating the difference between the characteristic graphs of the two target images from the same level operation layer to obtain a plurality of target level difference graphs.

For example, the euclidean distance between the feature maps from the same operational layer of the same level for two target images may be calculated:

X _j ＝d(x _1j ,x _2j )＝||x _1j -x _2j || ₂ ；

wherein x is _1j Representing a target image I ₁ In the incoming branch, the profile, x, output by the jth operating layer of the branch _2j Representing an object image I ₂ In the entered branch, the characteristic diagram output by the jth operation layer of the branch, | · | | survival ₂ Is a two-norm, d is the Euclidean distance function, X _j The calculated jth target level difference map is shown, i.e., j is used to number the operation layers of the twin feature extraction network.

S50: and fusing the pixel level difference image and the target level difference images, and obtaining a change detection result according to the fused image.

Specifically, the pixel-level disparity map and the target-level disparity map are fused, and the pixel-level disparity map and the target-level disparity map can be averaged to obtain an average fused image.

Alternatively, in a preferred implementation, a weighted average of the pixel-level difference map and the plurality of target-level difference maps (as shown in the following formula) may be obtained to obtain a weighted-average fused image. In this way, their deep fusion can be achieved, thereby retaining more useful change information.

In the formula, X _D X representing fused images with k values being different _k Respectively representing a pixel level difference map and a plurality of target level difference maps, mu _k Is X _k The corresponding weight factor. Preferably, the weighting factor corresponding to the pixel-level difference map may be slightly higher than the weighting factor corresponding to each target-level difference map.

After the fused image is obtained, various specific implementation modes exist for obtaining the change detection result according to the fused image. For example, in one implementation, a region in which a pixel having a pixel value greater than a preset threshold value in the fused image is located may be determined as a region in which there is a change. Wherein the threshold value may be determined based on manual experience.

In another implementation manner, in order to reduce the influence of an artificially set threshold on the detection result, a tsujin threshold method may be used to perform binary segmentation on the fused image, and the obtained binary segmentation result is a change detection result; in the binary segmentation result, 1 indicates that there is a change, and 0 indicates that there is no change.

Here, for the fusion image X _D Using the Otsu threshold method, i.e. based on fusing images X _D The division threshold value T for maximizing the inter-class variance is calculated. Specifically, if the foreground points account for the image proportion as represented by w0, the average gray scale thereof is represented by u0, the background points account for the image proportion as represented by w1, and the average gray scale thereof is represented by u 1; traversing the threshold T from the minimum gray value to the maximum gray value, and when a certain value T in the traversal makes the variance delta w0 w1 (u0-u1) (u0-u1) maximum, the T at the moment is the optimal segmentation threshold determined by the Otsu threshold method.

Then, the difference map is subjected to binarization processing by using a threshold value T, and the expression is as follows:

wherein,X _D (r, c) denotes the fused image X _D The pixel value at the middle position (r, c), CM (r, c) represents the pixel value at the position (r, c) in the change detection result, 1 represents that there is a change at (r, c), and 0 represents that there is no change at (r, c).

In the method for detecting the change of the high-resolution remote sensing image based on the self-supervision learning, the multiple feature maps corresponding to each target image are obtained through a twin feature extraction network, so that the target in the target image can be used as an analysis unit, the feature information such as spectrum, texture, geometric shape and the like contained in the target image can be effectively extracted, the key information in the target image can be better mined, and the full utilization of the context information of the image can be realized. Therefore, after the difference between the characteristic graphs of the two target images from the same operating layer is calculated, the obtained multiple target level difference graphs are fused with the pixel level difference graphs of the two target images, the obtained change detection result is more accurate, the undetected rate and the false detection rate are lower, and the phenomenon of salt and pepper is effectively avoided. In addition, the twin feature extraction network used in the embodiment of the invention is a feature extraction sub-network in a depth twin convolution network model which is trained in advance, training samples used for training the depth twin convolution network model are all from registered high-resolution sample remote sensing image pairs and are matched with a detection task, and the consistency of the actual use performance of the twin feature extraction network and the performance of the twin feature extraction network when the training is finished is ensured. In addition, in the embodiment of the invention, the marking information of the training sample is automatically calculated and generated by inputting the sample remote sensing image pair into a preset marking algorithm, does not depend on the subjective experience of people, does not introduce subjective uncertainty, and has higher marking precision; and the efficiency of a mechanized labeling mode is greatly improved, a large amount of manpower, material resources and time are not required to be consumed, and a useful sample can be timely and accurately obtained, so that the application range of the method in a real scene is wider.

In summary, the embodiment of the invention is a completely unsupervised method, and can realize automation and high efficiency of change detection of high-resolution remote sensing images in complex scenes, and further improve the performance of change detection.

It is understood that the remote sensing image change detection method can be classified into a pixel-based change detection method and a target-based change detection method according to the difference of the analysis units and the difference of the feature extraction means. In the embodiment of the invention, the idea of the change detection method based on pixels is adopted for calculating the pixel-level difference map, and the idea of the change detection method based on targets is adopted for extracting the feature map and calculating the target-level difference map by using the twin feature extraction network.

The target-based change detection method generally comprises three links of image segmentation, feature extraction and change identification. The image segmentation is the core of the target-based change detection method, and aims to divide the remote sensing image into a plurality of targets which are similar in spectrum and adjacent in space, so that a foundation is laid for subsequent feature extraction and change analysis. Therefore, how to select an appropriate segmentation scale and accurately represent key information in an image is a key factor for determining whether the detection effect is good or bad. In order to achieve the purpose, an annotation algorithm is adopted in the embodiment of the invention to realize automatic segmentation of the image, so that an annotation sample data set facing the image block classification task is constructed, manual annotation is not needed, and the method is suitable for any type of remote sensing images. Accordingly, the depth twin convolutional network model is designed as an auto-supervised learning model for performing overlap prediction on paired image blocks.

Specifically, the preferred labeling algorithm proposed by the embodiment of the present invention includes:

(1) and comparing spectral characteristics of the sample remote sensing image pair to obtain a sample pixel level difference image.

The specific implementation manner of this step can be seen in step S20 described above.

(2) And performing superpixel segmentation on the sample pixel level difference image to obtain a first segmentation result.

It is understood that a super pixel refers to an irregular pixel block with certain visual significance, which is composed of adjacent pixels with similar texture, color, brightness and other characteristics.

Exemplaryly,in the step, a simple linear iterative clustering method can be adopted to carry out superpixel segmentation on the sample pixel level difference image. Specifically, a segmentation scale s (for example, 10 may be set) is set, a simple linear iterative clustering method is used to segment the sample pixel level difference map, so as to obtain superpixels with similar spectra and adjacent spaces, and obtain a segmentation mask I of the corresponding superpixels _M 。

In this step, the method of performing the super-pixel segmentation is not limited to this, and for example, the FS algorithm may be used to perform the super-pixel segmentation on the sample pixel level difference map. The FS algorithm is a method for generating a superpixel pair by adopting a Fuzzy C-Means (FCM) clustering algorithm.

(3) And calculating the overlapping degree of each pair of image blocks between the sample remote sensing image pairs by taking the coordinate of the centroid position of each super pixel in the first segmentation result as the center of the image block and the preset size as the size of the image block.

In particular, for each superpixel, a segmentation mask I is used _M Calculating the centroid position coordinates of the superpixels; then, in two sample images included in the sample remote sensing image pair, respectively, taking an image block with the size of S multiplied by S (2S + 2) by taking the left side of the centroid position as a center; therefore, the number of the super pixels is the same, and the image blocks can be obtained from each sample remote sensing image. Then, the overlapping degree of each pair of image blocks between the sample remote sensing image pairs is calculated. For example, if 10 superpixels are included in the segmentation result, 10 image blocks can be taken for each sample remote sensing image, and the overlapping degree of 10 × 10 between sample remote sensing image pairs to 100 image blocks needs to be calculated.

The calculation of the overlapping degree mainly includes calculating the intersection ratio of two image blocks, and the calculation mode is expressed as follows:

the image blocks p and q are respectively from two different images in the sample remote sensing image pair; y (p, q) represents the overlapping degree of the image block p and the image block q, namely the category labels of the image block p and the image block q; IoU denotes the intersection ratio between image blocks.

Illustratively, a schematic diagram of constructing annotation information for a sample remote sensing image pair is shown in fig. 3. Wherein, the left and the right are two images contained in the sample remote sensing image pair respectively; when the image block 1 and the image block 3 are partially overlapped and the intersection ratio of the two is between [0 and 1], marking the class label of the pair of image blocks (1 and 3) as 1; the image blocks 1 and 4 are completely overlapped, the intersection ratio of the image blocks 1 and 4 is 1, and the class labels of the pair of image blocks (1, 4) are marked as 2; the image block 1 and the image block 2 are not overlapped at all, the intersection ratio of the two is 0, and the class label of the pair of image blocks (1, 2) is marked as 0.

After the labels of each pair of image blocks are labeled according to the above process, the pairs of image blocks can be used as training samples, and the labels labeled for the image blocks are corresponding labeling information.

Thus, in step S30, acquiring a plurality of feature maps corresponding to each target image through the twin feature extraction network includes:

(1) the pixel-level difference map obtained in step S20 is subjected to superpixel segmentation to obtain a second segmentation result.

Here, the super-pixel segmentation is performed in the same manner as used above when constructing the training samples.

(2) And generating a plurality of template windows by taking the coordinates of the centroid position of each super pixel in the second segmentation result as the center of the image block and the preset size as the size of the image block.

Here, the size of the preset size is the same as the preset size used in constructing the training sample above.

(3) And dividing image blocks of each target image according to a plurality of template windows to obtain a plurality of pairs of image blocks with one-to-one correspondence between the positions of the two target images.

It can be understood that, because the change detection task is executed at this time, each pair of image blocks with one-to-one correspondence between positions of two target images is only needed to be taken, and the pair of image blocks with non-corresponding positions does not need to be considered.

(4) And extracting the multilayer features of each image block from each pair of obtained image blocks by using a twin feature extraction network.

Specifically, each pair of image blocks obtained in the step (3) is respectively input into the twin feature extraction network, so that the twin feature extraction network outputs the multi-layer feature of each image block. It will be appreciated that the multi-layer features described herein correspond one-to-one with a plurality of different operational layers of the branching of the image block into the twin feature extraction network.

(5) And carrying out same-layer splicing on the multilayer characteristics belonging to the same target image to obtain a plurality of characteristic graphs corresponding to each target image.

Specifically, for each target image, the features extracted from the image blocks included in the target image and corresponding to the same operation layer are spliced to obtain a complete feature map of the target image corresponding to each operation layer.

In the embodiment of the present invention, there may be a plurality of depth twin convolutional network models suitable for the above-described labeling algorithm. For example, the designed depth twin convolutional network model may include, as shown in fig. 4, in the data flow direction: the device comprises a twin feature extraction network, a feature comparison layer and a full connection layer, wherein the feature comparison layer is used for solving an absolute difference value of features output by the tail end of the twin feature extraction network.

As shown in fig. 4, each branch of the twin feature extraction network sequentially includes, in the data flow direction: a first convolution layer, a first BN layer, a maximum pooling layer and a residual layer as a plurality of different operation layers of the branch; wherein the residual layer comprises two serially connected residual modules; the residual error module sequentially comprises the following components along the data flow direction: the second convolution layer, the second BN layer, the third convolution layer, the third BN layer and an addition module; the input characteristics of the second convolution layer and the output characteristics of the third BN layer are jointly sent to the addition module to be summed; the first BN layer and the second BN layer are also respectively connected with Relu activation layers.

Wherein, in order to reduce the number of parameters of the model, the first convolution layer preferably comprises 64 convolution kernels with the size of 3 × 3, and the convolution step size is 2; the convolution kernel size in the maximum pooling layer is 3 x 3, and the convolution step length is 2; the second convolutional layer and the third convolutional layer each include 64 convolutional kernels of size 3 × 3, and the convolution step is 1.

It should be noted that the above settings such as the size and number of convolution kernels and the convolution step are merely examples, and do not limit the embodiments of the present invention, and in practice, the settings may be adaptively adjusted according to factors such as the actual resolution of the target image.

Based on the deep twin convolution network model shown in fig. 4, the twin feature extraction network has 4 different operation layers, and the corresponding feature maps obtained in step S30 for each target image have 4, so that there are 4 target-level difference maps calculated in step S40, and when they are fused together, the weighting factors are preferably set to {0.4, 0.1, 0.3, 0.1, 0.1 }. Wherein 0.4 is a weighting factor corresponding to the pixel level difference map, 0.1 is a weighting factor corresponding to the target level difference map obtained according to the characteristics output by the two first convolution layers, the next 0.3 is a weighting factor corresponding to the target level difference map obtained according to the characteristics output by the two first BN layers, the next 0.1 is a weighting factor corresponding to the target level difference map obtained according to the characteristics output by the two largest pooling layers, and the next 0.1 is a weighting factor corresponding to the target level difference map obtained according to the characteristics output by the two residual layers.

Further, the network model that is applicable to the labeling algorithm and that can realize the overlap prediction function is not limited to the network model shown in fig. 4; for example, a ResNet18 network or other residual networks may be used to build other networks with similar functions for the backbone network, or two VGG16 networks may be used to build the above-mentioned twin feature extraction network, and so on.

After the network model and the data set are constructed, the model can be trained.

First, initial parameters are set for the model. Here, in order to accelerate the convergence rate of the model and improve the feature extraction capability of the model, weights trained on the network model by referring to the existing data set may be used as initial parameters of the built model. For example, when a ResNet18 network is used as a backbone network to build a deep twin convolutional network model, the weights of the ResNet18 network trained in advance on the ImageNet data set can be used as initial parameters of the deep twin convolutional network model.

Then, the training samples in the training sample library are input into the built model, so that the model outputs a prediction result. And then, calculating a loss value according to the prediction result and the input marking information of the training sample, and converging the model when the loss value is smaller than a preset loss threshold value to obtain the trained deep twin convolution network model.

Wherein, because the model executes a multi-classification task, the Softmax loss function is used as the loss function of the network, which can be expressed as:

wherein N represents the number of training samples,

is used for indicating whether the nth training sample belongs to the ith class or not, wherein the nth training sample does not belong to the ith class when the value is 0, the nth training sample belongs to the ith class when the value is 1,

the specific value of (a) is determined by the labeling information of the nth training sample;

probability, w, of the nth training sample to belong to the ith class, representing the output of the network model _i Representing the weights belonging to the class i samples.

In the training process, the network is optimized by adopting an Adam algorithm (Adaptive motion Estimation), the parameter of the learning rate can be set to 0.001, the weight attenuation coefficient can be set to 0.00005, and the training algebra can be set to 200, and these settings can be adjusted.

In other implementation manners, if the deep twin convolutional network model is an auto-supervised learning model for performing similarity region segmentation on the input sample remote sensing image pair, the main structure of the model can be built by using common image segmentation models such as FCN, SegNet, U-Net or PSPNet. When constructing the labeling information for the sample remote sensing image pair, each image in the sample remote sensing image pair can be traversed by using an image block with a preset size as a template, and the traversing step can be adjusted, and can be equal to or smaller than the diameter of the image block, so that each image is divided into a plurality of image blocks; then, the similarity between the image blocks corresponding to every two positions between the sample remote sensing image pairs is calculated, and the changed region and the unchanged region in the sample remote sensing image pairs are determined according to the threshold value of the similarity, so that the automatic construction of the labeling information is realized, and the self-supervision learning is realized.

Optionally, in an implementation manner, in order to increase the convergence speed of the model, before the twin feature extraction network is used to extract the multi-layer feature of each image block from each pair of obtained image blocks, the method provided by the embodiment of the present invention may further include: and carrying out normalization processing on the obtained image blocks. And when the deep twin convolution network model is trained, the image blocks in the used training samples are also normalized image blocks.

Here, the procedure of the normalization process can be formulated as follows:

where X (r, c) is the pixel value of the image block at position (r, c), X _out To correspond to the normalized output pixel value, X _max And X _min The maximum and minimum values of the pixels in the target image, here 255 and 0, respectively.

In conclusion, the embodiments of the present invention can solve various problems in the prior art, such as inability to effectively represent surface feature information of an image, insufficient high-quality labeled samples, and insufficient extraction of change information, so as to practically and effectively utilize a deep learning technique in high-resolution remote sensing image change detection, thereby improving the performance of change detection.

The beneficial effects of the embodiment of the invention can be further explained by the following simulation:

simulation conditions are as follows: the method provided by the invention and the simulation comparison experiment of the existing IR-MAD (iterative weighted Multivariate Detection algorithm) method, PCANet (multi-channel principal component analysis network-based) method and DCVA (depth variation vector analysis) algorithm are completed under AMD Ryzen 5800X CPU, 3.8GHz Windows 10 system and Python3.7 running platform.

Evaluation indexes are as follows:

(1) precision rate (Precision): the proportion of the number of pixels correctly identified as change in the change detection result to the number of pixels predicted as change in the result is as follows:

wherein TP and FP respectively indicate the number of pixels correctly classified as change class and the number of pixels incorrectly classified as change class in the change detection result.

(2) Recall (Recall): the proportion of the number of the pixels which are correctly identified as the changed pixels in the change detection result to the number of the real changed pixels is calculated according to the following formula:

where FN indicates the number of pixels that were not correctly identified as a change class in the change detection result.

(3) Accuracy (Accuracy): the proportion of the correct pixel number in the detection result of the finger change to the whole image pixel number is detected, and the calculation formula is as follows:

where TN represents the number of pixels correctly classified as unchanged in the change detection result.

(4) Kappa coefficient: the method is used for measuring the whole classification capability of the change detection algorithm, the larger the value is, the better the classification performance of the algorithm is, and the calculation formula is as follows:

wherein, RC and RU respectively represent the number of pixels of a changed class and an unchanged class in the reference image manually marked according to the priori knowledge.

Experimental data:

the first set of test data sets are data sets taken somewhere in the country, as shown in subgraph (a) and subgraph (b) in fig. 5, which mainly reflect land type changes caused by ground vegetation. Wherein each image comprises three wave bands of red, green and near infrared, the size of the image is 877 x 738, and the spatial resolution of the image is 2.5 meters. Sub-graph (c) in fig. 5 is a reference image manually labeled by an expert according to a priori knowledge of the image, wherein white represents a changed region, and black represents an unchanged region.

The second set of test data sets is a data set taken at another location within the country, as shown in subgraph (a) and subgraph (b) of fig. 7, which mainly reflects the change of land type in the urbanization process. Each image comprises three wave bands of red, green and blue, the size of the image is 1431 multiplied by 1431, and the spatial resolution is 2 meters. Sub-graph (c) in FIG. 7 is a reference image obtained by interpreting the artificial annotation according to the image vision.

Experimental contents and results:

in fig. 6, subgraphs (a) to (d) are the detection results obtained by the conventional IR-MAD method, the PCANet method, the DCVA method and the method provided by the embodiment of the present invention with respect to the first set of experimental data, respectively.

In contrast, the IR-MAD method erroneously identifies roads that have not changed as a change class, resulting in a poor detection result thereof; by considering neighborhood information for pixels, the PCANet method can detect major areas of variation, but many small variations are not effectively identified; the DCVA method can detect most of the change regions on the earth surface, but a large amount of white noise points appear in the result; the method provided by the embodiment of the invention can accurately identify the change area, and the change result is cleaner on the whole.

Sub-diagrams (a) - (d) in fig. 8 are detection results obtained by using the conventional IR-MAD method, the PCANet method, the DCVA method, and the method provided by the embodiment of the present invention for the second set of experimental data, respectively.

The comparison shows that the IR-MAD method cannot effectively identify the changed area, and a detection result is full of a large number of spots; the PCANet method can detect only a small portion of the changed regions, and some unchanged regions are also identified as changed classes; the DCVA method can identify the major change areas, but lose a great deal of change details; the method provided by the embodiment of the invention can effectively highlight the changes on the earth surface, and the result contains less noise.

The quantitative evaluation indexes of the two sets of experimental data of different change detection methods are shown in fig. 9, and it can be seen from fig. 9 that Accuracy and Kappa values of the method provided by the embodiment of the invention on the first set of test data set are 0.9754 and 0.9013 respectively, which is far superior to the existing method. For the second group of test data sets, as the data sets contain more surface feature types, the difficulty of change detection is greater, and the method provided by the embodiment of the invention has the highest three indexes of Recall, Accuracy and Kappa on the data sets, and further illustrates the effectiveness and feasibility of the method.

In summary, the method for detecting the change of the high-resolution remote sensing image based on the self-supervised learning provided by the embodiment of the invention takes the super-pixels as the basic analysis unit, extracts the useful features in the image in the self-supervised learning mode, and combines feature comparison and threshold segmentation to obtain the final change detection result. Compared with the existing method, the method provided by the embodiment of the invention can obtain better performance on the aspect of high-resolution remote sensing image change detection, and has wider application range and stronger robustness.

The method provided by the embodiment of the invention can be applied to electronic equipment. Specifically, the electronic device may be: desktop computers, portable computers, terminal devices, servers, etc., are not limited herein, and any electronic device capable of implementing the present invention is within the scope of the present invention.

It should be noted that the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more features. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the description of the specification, reference to the description of the term "one embodiment", "some embodiments", "an example", "a specific example", or "some examples", etc., means that a particular feature or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by those skilled in the art.

While the invention has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a review of the drawings, the disclosure, and the appended claims.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, numerous simple deductions or substitutions may be made without departing from the spirit of the invention, which shall be deemed to belong to the scope of the invention.

Claims

1. A high-resolution remote sensing image change detection method based on self-supervision learning is characterized by comprising the following steps:

acquiring a plurality of feature maps corresponding to each target image through a preset twin feature extraction network; wherein the plurality of feature maps respectively correspond to a plurality of different operational layers of a single branch of the twin feature extraction network;

2. The method of claim 1, wherein the depth twin convolutional network model is used for overlap prediction of pairs of image blocks;

the labeling algorithm comprises the following steps:

3. The method according to claim 2, wherein the obtaining of the plurality of feature maps corresponding to each target image based on the preset twin feature extraction network comprises:

generating a plurality of template windows by taking the coordinates of the mass center position of each super pixel in the second segmentation result as the center of the image block and the preset size as the size of the image block;

4. A method according to claim 2 or 3, wherein the superpixel segmentation is achieved using a simple linear iterative clustering method.

5. The method of claim 2, wherein the degree of overlap is calculated by:

6. The method of claim 3, wherein before extracting the multi-layer features of each tile from each pair of resulting tiles using the twin feature extraction network, the method further comprises:

normalizing the obtained image block;

7. The method of claim 1, wherein calculating the difference between the feature maps from the same level of operation for the two target images comprises:

8. The method of claim 1, wherein fusing the pixel-level difference map and the plurality of target-level difference maps comprises:

9. The method of claim 1, wherein deriving change detection results from the fused image comprises:

10. The method of claim 1, wherein the depth twin convolutional network model comprises, in the direction of data flow: the twin feature extraction network, a feature comparison layer used for solving an absolute difference value of features output by the tail end of the twin feature extraction network and a full connection layer;

the residual error module sequentially comprises the following components along the data flow direction: the second convolution layer, the second BN layer, the third convolution layer, the third BN layer and an addition module; the input characteristics of the second convolution layer and the output characteristics of the third BN layer are jointly sent into the addition module to be summed;