CN115601661A

CN115601661A - Building change detection method for urban dynamic monitoring

Info

Publication number: CN115601661A
Application number: CN202211344397.7A
Authority: CN
Inventors: 徐川; 叶昭毅; 杨威; 梅礼晔; 张琪; 李迪
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-01-13

Abstract

The invention discloses a building change detection method for urban dynamic monitoring, which receives an urban surface double-time-phase image detected by a remote sensing satellite through a satellite technology, cuts an original image, inputs the cut image into an urban building automatic detection model, and outputs a change detection result of a building in the double-time-phase image; the automatic detection model of the urban building comprises an encoding stage and a decoding stage. In the encoding stage, a twin network shared by weight is adopted to carry out down-sampling operation on the input double-time phase image, rich multi-scale characteristic information is extracted, and meanwhile, the expression of the characteristic information is enhanced by utilizing a twin cross attention mechanism; in the decoding stage, a multi-scale feature fusion module is adopted to carry out progressive fusion on the extracted multi-scale features; and the detection result is pushed to be closer to the actual change condition by using the differential context discrimination module. The method can efficiently judge and fuse multiple features, thereby improving the accuracy of detecting the change of the urban buildings.

Description

Building change detection method for urban dynamic monitoring

Technical Field

The invention belongs to the field of urban dynamic monitoring, and particularly relates to a building change detection method for urban dynamic monitoring.

Background

At present, large-scale detection equipment and cables need to be laid around the city in most city building automatic monitoring systems, the power supply and the maintenance of equipment need very high cost, and simultaneously, receive signal interference, the shooting angle changes, and factors such as illumination influence are great, can lead to detection system to take place the mistake and supervise and leak the scheduling problem. Remote sensing technology can acquire information of the earth surface at fixed time intervals and extract dynamic changes of the same surface in a plurality of time periods. The automatic detection model of the urban building is based on a remote sensing change detection technology, and the task is to observe the difference change of the same target in different periods and classify each image pixel point by a label, namely label 0 (unchanged) and label 1 (changed). Researchers have done a great deal of work to date on the theory and application of remote sensing change detection. The contributions have important significance on the aspects of land resource management, city construction and planning, illegal construction management and the like.

Over the past several decades, many algorithms have been proposed for remote sensing image change detection models. These algorithms can be broadly divided into two categories: traditional methods and deep learning based methods. For the conventional method, the resolution of the remote sensing image is limited at first, a pixel-based method is mostly adopted for change detection, and the spectral features of each pixel point are analyzed by using Change Vector Analysis (CVA) and Principal Component Analysis (PCA), so that the change detection is performed. With the rapid development of aerospace and remote sensing technologies, the ability to acquire high-resolution remote sensing images is enhanced. Scholars introduce the concept of objects in the field of change detection, and mainly use spectral, texture and spatial background information based on object hierarchy to perform change detection. Although these methods can achieve better effects at that time, the traditional methods need to manually design features and specify thresholds to ensure the final detection effect, and can only extract shallow features, and cannot fully represent the change of buildings in high-resolution remote sensing images, so that the requirements on precision in reality are difficult to meet.

On the other hand, with the development of computing power and the accumulation of massive data, a change detection algorithm based on deep learning has become the mainstream because of its powerful performance. At present, most of change detection methods based on deep learning are developed by networks with better effects on comparison learning and segmentation tasks. And partial scholars adopt focusing contrast loss to carry out change detection, so that intra-class variance is reduced, inter-class difference is increased, and finally, a binarization detection result is obtained through threshold value limitation. The segmentation network performs change detection based on the idea of image segmentation, and representative examples are U-shaped network (UNet) and full Convolutional neural network (FCN) and deep lab series network.

Although these methods also achieve high performance, the following problems remain: first, when there is a large amount of false changes in the previous and subsequent time series images, the current attention mechanism cannot efficiently and pertinently focus on the unchanged area and the changed area, which may cause a serious false detection phenomenon. Secondly, a large amount of down-sampling and up-sampling operations in the existing network cause loss of feature information in front and back time sequence images, and a rough fusion strategy deepens the problem, so that the original features of the images cannot be well restored in the final change detection of the network, and the final detection result has the problems of missing detection, irregular change edge and the like. Finally, the current algorithm cannot perform differential processing on context information well, so that the detection effect on urban building images with many pseudo changes is not good.

Disclosure of Invention

Aiming at the defects and the improvement requirements of the prior art, the invention provides a building change detection method for urban dynamic detection, which can accurately realize the automatic detection of urban buildings. The method comprises the following steps:

s1, taking an image of an urban building acquired by a remote sensing satellite as a data set, acquiring an actual change image corresponding to each building in the data set, and dividing the actual change image and a corresponding double-time-phase image into a training set and a test set;

s2, building an automatic building detection model consisting of an encoder and a decoder, wherein the encoder comprises a weight-shared double-channel twin network and a twin cross attention module, and the decoder comprises a multi-scale feature fusion and differential context discrimination module;

the weight-shared two-channel twin network comprises a batch normalization layer and a plurality of up-sampling blocks, and a double-time-phase image is input and used for acquiring feature maps of different scales;

the twin cross attention module firstly carries out embedding operation on feature maps of different scales, and then extracts deeper variation feature semantic information by using a multi-head cross attention mechanism, so that the global attention to the feature information is improved;

the multi-scale feature fusion module adopts a double progressive fusion strategy of reconstruction and up-sampling blocks to fuse the extracted features containing rich multi-scale semantic information;

the input of the differential context discrimination module is the output image of the multi-scale fusion module and the front and back time sequence differential image, and the purpose is to improve the discrimination capability of the network by combining the context information in the image, so that the detection result image is closer to the real change image, and the detection accuracy is improved;

and S3, training the building automatic detection model in the S2 by using the training set in the S1, and realizing building change detection by using the trained model.

In some alternative embodiments, step S1 comprises:

the method comprises the steps of adopting an artificially-made urban building change image as a data set, and making an actual change image according to a front-back time sequence image in the data set, wherein the actual change image is a change area in the front-back time sequence image, and each pixel in the front-back time sequence image represents a category (unchanged or changed).

And (3) forming an automatic detection image data set of the urban building by the front and rear time sequence images and the corresponding actual change images, and dividing a training set and a test set in the data set according to the ratio of 8: 2.

In some optional embodiments, the encoder comprises a weight-shared two-channel twin network and twin cross attention module, and the decoder comprises a multi-scale feature fusion and differential context discrimination module.

In this embodiment, the two-channel twin network for weight sharing in the encoder is implemented using a multi-scale dense connection UNet, which contains a hopping connection, and is capable of sufficiently extracting low-level features and high-level features. A twin cross attention module in an encoder is combined with a Transformer multi-head attention mechanism, and the twin cross attention module firstly independently carries out embedding operation on a double-temporal image to obtain a corresponding multi-stage embedded token. The feature information is further divided into a query queue, a query vector and a query value through a multi-head attention mechanism, the concerned feature information is further activated by a Sigmoid function, the time complexity of the network is effectively reduced by the multi-layer perceptron block, finally, the attention channel respectively pays attention to a change area and an unchanged area in the image, meanwhile, the image information is divided into a sliding window for self-attention calculation, and the modeling capability of the network on the global information is improved.

A multi-scale fusion module in the decoder fuses the multi-stage embedded token extracted from the encoder with the channel attention output rich in context information by using a multi-scale feature fusion technology, and then uses the up-sampling operation fusion feature, so that the network can restore the original image information to the maximum extent, and the omission factor of the network is reduced. And secondly, fusing the rich context output of the embedded token and the channel transformer by utilizing a multi-scale feature fusion technology. Then, the extracted multi-scale information content is subjected to up-sampling fusion, and the original image information is recovered to the maximum extent; the input of the differential context discrimination module in the decoder is the output image and the front and rear time sequence differential images of the multi-scale fusion module in the decoder, and the purpose is to combine the context information in the images to improve the discrimination capability of the network, so that the detection result images approach to the real change images more, and the detection accuracy is improved.

In some optional embodiments, the two-channel twin network with weight sharing in step S2 performs batch normalization on the input two-phase image, including convolution kernel 3, two-dimensional convolution with step size 1, two-dimensional BatchNorm and ReLU activation function with output channel number 64, and then extracts feature information through 3 down-sampling blocks to define x ^i，j For the output node of a downsampling block, the objective function of the downsampling block is:

wherein N (-) represents a nested convolution function, D (-) represents a down-sampling layer, U (-) represents an up-sampling layer, and]representing a characteristic connection function, x ^i，j Representing an output characteristic diagram, i represents the number of layers, j represents the jth convolutional layer of the layer, and k represents the kth connecting layer; and finally, outputting four kinds of multi-scale characteristic information by the twin network channel.

In some optional embodiments, the twin cross attention module in step S2 performs an embedding operation on the four outputs of the two-channel twin network, first performing a 2D convolution to extract features, and then unfolding the features into a two-dimensional sequence T ₁ ，T ₂ ，T ₃ And T ₄ The patch sizes are 32, 16, 8 and 4 respectively, and T is ₁ -T ₄ Are combined to obtain T _∑ Then, a multi-head cross attention mechanism is used for processing, and the objective function of the first stage is as follows:

wherein,

W _K and W _V For weight coefficients of different inputs, T _l Representing a token of characteristic information, l representing characteristic information of the ith scale, T _∑ Representing the feature union of four tokens to obtain a query vector Q _u Query key K, query value V, l =1,2,3,4, u =1,2,3,4;

the objective function for the second stage is:

wherein σ (·) and

respectively representing the softmax function and the instance normalization function, C _∑ Represents the sum of the number of channels;

the objective function of the third stage of multi-head cross attention is:

wherein, CA _h Representing the output of the second stage of multi-head cross attention, h representing the output of the h-th attention head, and N being the number of the attention heads;

the objective function of the final stage of multi-head cross attention is as follows:

O _r ＝MCA _p +MLP(Q _u +MCA _p )

determining the final output of multi-headed cross attention, wherein MCA _p Represents the output of the third stage of multi-head cross attention, p represents the p output, MLP (is) is a multi-layer perceptron function, Q _u Representing a query vector, u representing the u-th query vector.

In some optional embodiments, in step S2, the objective function of the multi-scale feature fusion module is:

M _i ＝W ₁ ·V(T _l )+W ₂ ·V(O _r )

wherein, W ₁ And W ₂ Is a weight parameter, T, of two linear layers _l Representing a token of characteristic information, l representing characteristic information of the l-th scale, O _r The output of the multi-head cross attention module is shown, and r represents the output of the r-th attention head.

In some optional embodiments, in step S2, the differential context determination module includes a generator and a determiner, the generator receives two inputs, the detection image obtained at the last layer of the multi-scale feature fusion module and the generated image obtained by performing a differential operation on the first and second time phases calculate losses of the two to push a result closer to an actual change image, and a weighted sum of SCAD and least squares LSGAN loss functions is used as a loss function in the generator to reduce a false monitoring rate of the model; and a least square LSGAN loss function is adopted in the discriminator to improve the detection precision, and the loss functions of the generator and the discriminator are accumulated to obtain the final probability loss.

In some optional embodiments, in step S2, the objective function of the differential context discrimination module is:

L(P)＝L(D)+L(G)

L(D)＝L _LSGAN (D)

L(G)＝L _LSGAN (D)+αL _SCAD

wherein L (P) represents probability loss, L (D) represents discriminator loss, L (G) represents generator loss, L _LSGAN (D) Least squares LSGAN loss, L, representing the arbiter _LSGAN (G) Least squares LSGAN loss, L, representing the generator _SCAD Representing SCAD loss.

In some alternative embodiments, the SCAD penalty is defined as:

wherein C represents the detection type, v (C) represents the pixel error value of the detection type, J _C For the loss term, ρ is a continuously optimized parameter, and v (c) is defined as follows:

wherein, y _i For the actual change image, s _g (c) To detect the score, g denotes the g-th pixel.

In some alternative embodiments, the least squares LSGAN loss is:

wherein, D (x) ₁ Y) and D (x) ₁ ，G(x ₁ ) Denotes the output of the discriminator on the first time phase image, G (x) ₁ ) Representing the output of the generator on the first time phase image, D (x) ₂ Y) and D (x) ₂ ，G(x ₂ ) G (x) represents the output of the discriminator on the second phase image ₂ ) Representing the output of the generator for the second phase image,

and

indicating the desire to detect the first time phase image,

and

indicating the detection expectation of the second phase image, x ₁ ，x ₂ Respectively representing the first and second time phase images input by the discriminator, and y representing the actual change image.

In some alternative embodiments, the least squares LSGAN loss is:

wherein,

indicating the desire to detect the first time phase image,

indicating the expectation of detection of the second phase image, D (x) ₁ ，G(x ₁ ) Denotes the output of the discriminator on the first time phase image, G (x) ₁ ) Representing the output of the generator on the first time phase image, D (x) ₂ ，G(x ₂ ) Denotes the output of the discriminator for the second phase image, G (x) ₂ ) Representing the output of the generator on the second phase image, x ₁ ，x ₂ Respectively representing the first and second time phase images input by the discriminator.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects: based on the deep convolutional neural network, a building automatic detection model consisting of an encoder and a decoder is constructed, multi-scale feature information in the double-temporal image can be effectively distinguished and fused, and building change detection accuracy is effectively improved. Finally, the change condition of the urban building can be automatically detected only by inputting the double-time-phase image into the trained model.

Drawings

Fig. 1 is a schematic flow chart of an automated urban building detection system and method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an automated building inspection model according to an embodiment of the present invention;

FIG. 3 is a diagram of a multi-head cross attention mechanism network according to an embodiment of the present invention;

FIG. 4 is a comparison chart of tests performed in a different method according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Extracting abundant characteristic information in the double-time-phase image by adopting a multi-scale dense connection UNet network; the twin attention mechanism pays attention to a changed area and an unchanged area in the double-temporal image respectively, the representation of characteristic information is enhanced, and the global attention of the information is improved; progressively fusing feature information of each scale by adopting a multi-scale feature fusion module; meanwhile, the weighted sum of the generator and the discriminator is calculated by the differential context discrimination module to be used as probability loss, so that the detection result is pushed to be close to a real change image. And 8 evaluation indexes are adopted to evaluate the performance of the invention, including Precision (Precision), recall (Recall), comprehensive evaluation index (F1-score), intersection (IOU), unchanged intersection (IOU _ 0), changed intersection (IOU _ 1), overall Precision (OA) and Kappa coefficient (Kappa) as evaluation indexes. The present invention will be described in further detail with reference to the accompanying drawings and examples.

Fig. 1 is a schematic flow chart of an automated urban building detection system and method according to an embodiment of the present invention, which specifically includes the following steps:

s1: and (3) data set construction: the method comprises the steps of constructing a data set by using images of urban buildings acquired by a remote sensing satellite, acquiring actual change images corresponding to all buildings in the data set, and taking the actual change images and corresponding double-time-phase images as the data set;

the detection precision of the model can be effectively improved by constructing a reasonable building change detection data set. In the experiments of the present example, a LEVIR-CD dataset was used, which contained a wide variety of architectural images from 20 regions. The original size of each image is 1024 × 1024 pixels, and the spatial resolution is 0.5m. Considering the limitation of the memory capacity of the GPU, each image is cut into 16 area images with the size of 256 multiplied by 256 pixels by adopting an image segmentation algorithm, and 4450 front and back time sequence image pairs are finally obtained. The invention adopts professional computer vision labeling software to label the urban building image. For each pair of front and rear time sequence images, a corresponding actual change image group channel is obtained, each pixel point in the actual change image represents a category, wherein the category labels in the actual change image are represented by 0 and 1, 0 represents an unchanged area (which can be displayed as black), and 1 represents a changed area (which can be displayed as white).

The front and rear time sequence images and the corresponding actual change images are obtained through the processing, each front and rear time sequence image and the corresponding actual change image form an automatic detection image data set of the urban building, and a training set (3560 images in total) and a testing set (890 images in total) are divided in the data set according to the ratio of 8: 2.

S2: building automatic detection model construction: constructing a twin cross attention discrimination network consisting of an encoder and a decoder as a building automatic detection model;

as shown in fig. 2, the building automation detection model of the embodiment of the present invention includes two main modules: an encoder and a decoder. The encoder comprises a two-channel twin network and twin cross attention module which are shared by weight, and the decoder comprises a multi-scale feature fusion and difference context discrimination module.

The encoder is responsible for extracting multi-scale characteristic information and high-level semantic information in the input image. The decoder carries out progressive fusion on the extracted multi-scale features, calculates probability loss by combining context difference information, and continuously pushes a result graph to be close to the Ground Truth.

As shown in fig. 2 (a), a two-channel twin network with shared weight is first used to perform batch normalization on the input dual-phase image, including convolution kernel 3, two-dimensional convolution with step size of 1, two-dimensional BatchNorm, and ReLU activation function with output channel number of 64. Then extracting feature information by downsampling block to define x ^i，j For the output node of a downsampling block, the objective function of the downsampling block is:

wherein N (-) represents a nested convolution functionNumber, D (-) represents a down-sampling layer, U (-) represents an up-sampling layer, [ 2 ]]Representing a characteristic connection function, x ^i，j Representing the output signature, i represents the number of layers, j represents the jth convolutional layer of the layer, and k represents the kth connection layer. To better describe the network parameters, the number of output channels of the three downsample blocks are defined as 128, 256 and 512, respectively. And finally, outputting four kinds of multi-scale characteristic information by the twin network channel.

As shown in fig. 2 (b), the twin cross attention module performs an embedding operation on four outputs of the two-channel twin network with shared weight, firstly performs 2D convolution to extract features, and then expands the features into a two-dimensional sequence T ₁ ，T ₂ ，T ₃ And T ₄ The patch sizes are 32, 16, 8 and 4, respectively. Will T ₁ -T ₄ Are combined to obtain T _∑ 。

As shown in fig. 3, the twin cross attention module extracts deeper-level variation feature semantic information by using a multi-head cross attention mechanism, so as to improve the global attention to the feature information. The objective function for the first stage of multi-headed cross attention is:

Q _u ＝T _l W _Qi ，K＝T _∑ W _K ，V＝T _∑ W _V

wherein,

W _K and W _V For weight coefficients of different inputs, T _l Representing a token of characteristic information, l representing characteristic information of the ith scale, T _∑ Representing a feature union of four tokens. Obtain a query vector Q _u (u =1,2,3,4), query key K, query value V. The number of channels of the four query vectors is [64, 128, 256, 512 ] respectively]。

Since the time complexity of the network is large due to the global attention mechanism, the calculation amount of the network is reduced by adopting the transposition attention mechanism. Wherein

And V ^T Are respectively query vectors Q _u And a transpose of the query value V. The objective function for the second stage of multi-headed cross attention is therefore:

determining the output of the second phase of multi-headed cross-attention, where σ (-) and

respectively representing the softmax function and the instance normalization function,

W _K and

for the weight coefficients of the different inputs,

representing a token of characteristic information, l representing characteristic information of the ith scale, T _∑ Representing a feature union of four tokens. C _∑ Representing the sum of the number of channels.

The objective function of the third stage of multi-head cross attention is:

wherein, CA _h The output of the second stage of multi-head cross attention (h =1,2,3,4), h the output of the h-th attention head, and N the number of attention heads, are shown, and experiments prove that the network has the best detection effect when N is 4.

The objective function for the final stage of multi-headed cross attention is:

O _r ＝MCA _p +MLP(Q _u +MCA _p )

determining the final output of multi-headed cross-attention, wherein MCA _p Third to show the multi-headed intersection attentionThe output of the stage, p denotes the p-th output, MLP (-) is a multi-layer perceptron function, Q _u Representing the query vector, u represents the u-th query vector (u =1,2,3,4). Finally four outputs O are obtained ₁ ，O ₂ ，O ₃ And O ₄ 。

As shown in fig. 2 (c), the multi-scale feature fusion module adopts a dual progressive fusion strategy of reconstruction and upsampling block to fuse the extracted features containing rich multi-scale semantic information. The reconstruction strategy first intersects four embedded tokens T in the attention module ₁ ，T ₂ ，T ₃ And T ₄ And four outputs O in a multi-headed cross attention mechanism ₁ ，O ₂ ，O ₃ And O ₄ Fusion is performed.

The objective function in the reconstruction strategy is:

M _i ＝W ₁ ·V(T _l )+W ₂ ·V(O _r )

wherein, W ₁ And W ₂ Is a weight parameter, T, of two linear layers _l Representing a token of characteristic information, l representing characteristic information of the l-th scale, O _r The output of the multi-head cross attention module is shown, and r represents the output of the r-th attention head (r =1,2,3,4). Get four outputs M ₁ ，M ₂ ，M ₃ And M ₄ 。

For better fusion of the multi-scale feature information, the four outputs are subjected to upsampling block operations, and the output channels of the four upsampling blocks are 256, 128, 64 and 64 respectively. The upsampling block contains a two-dimensional convolution with a convolution kernel size of 2, an average pooling layer, and an activation function ReLu. And finally, performing one-dimensional convolution with a convolution kernel of 1 and a step length of 1 on the output result of the fourth up-sampling block to obtain a detection image.

As shown in fig. 2 (d), the differential context discrimination module includes a generator and a discriminator. The generator receives two inputs, and the multi-scale feature fusion module performs difference operation on the detection image obtained at the last layer and the first time phase and the second time phase to obtain a generated image. The loss of both is calculated to push the result closer to the actual change image. And the generator adopts the weighted sum of the SCAD and the least square LSGAN loss function as the loss function to reduce the false monitoring rate of the model. And a least square LSGAN loss function is adopted in the discriminator to improve the detection precision. And accumulating the loss functions of the generator and the discriminator to obtain the final probability loss. The objective function of the differential context discrimination module is:

L(P)＝L(D)+L(G)

L(D)＝L _LSGAN (D)

L(G)＝L _LSGAN (D)+αL _SCAD

SCAD loss is defined as:

determining SCAD loss, wherein C represents a detection class, v (C) represents a pixel error value of the detection class, J _C For the loss term, ρ is a continuously optimized parameter. v (c) is defined as follows:

wherein, y _i For actually changing the image, s _g (c) To detect the score, g denotes the g-th pixel.

The least square LSGAN loss of the discriminator is as follows:

determining least squares LSGAN loss of the discriminator, where D (x) ₁ Y) and D (x) ₁ ，G(x ₁ ) Denotes the output of the discriminator on the first time phase image, G (x) ₁ ) Representing the output of the generator on the first time phase image, D (x) ₂ Y) and D (x) ₂ ，G(x ₂ ) G (x) represents the output of the discriminator on the second phase image ₂ ) Representing the output of the generator for the second phase image,

and

indicating the desire to detect the first time phase image,

and

The generator least squares LSGAN loss in the present invention is:

a least squares LSGAN loss of the generator is determined, wherein,

indicating the desire to detect the first time phase image,

indicating the detection expectation of the second phase image, D (x) ₁ ，G(x ₁ ) Denotes the output of the discriminator on the first time phase image, G (x) ₁ ) Representing the output of the generator on the first time phase image, D (x) ₂ ，G(x ₂ ) G (x) represents the output of the discriminator on the second phase image ₂ ) Representing the output of the generator on the second phase image, x ₁ ，x ₂ Respectively representing the first and second time phase images input by the discriminator.

Thus, the objective function of the differential context discrimination module is:

L(P)＝L(D)+L(G)

L(D)＝L _LSGAN (D)

L(G)＝L _LSGAN (D)+αL _SCAD

wherein L (P) represents probability loss, L (D) represents discriminator loss, L (G) represents generator loss, L _LSGAN (D) Least squares LSGAN loss, L, representing the arbiter _LSGAN (G) Least squares LSGAN loss, L, representing the generator _SCAD Representing SCAD loss. α is a weighting parameter that controls the relative importance between the two losses. With the aid of the objective function, the generator and the discriminator generate probability loss in a loop iteration mode until the probability loss is lower than a set threshold value, and then a detection result is output.

S3: training the building automatic detection model in the S2 by using the training set in the S1, realizing building change detection by using the trained model, and finally evaluating a detection result by using the evaluation index of the building automatic detection model;

and training the LEVIR-CD data set constructed in the step S1 by using the network structure provided by the invention to obtain model weights for model evaluation. The training process is based on a PyTorch deep learning framework, the software environment is Ubuntu20.04, the hardware environment is 3090 display card, and the video memory is 24GB. The batchsize is set to 8 for a total of 100 epochs. Each input comprises three images: the first time phase image, the second time phase image and the actual change image are tested once after being trained once, and the change information of the urban buildings in the double time phase image and the real change image is continuously learned in the network training process. And (5) iterating circularly until the epoch reaches 100, and finishing the training.

Selecting Precision (Precision), recall (Recall), comprehensive evaluation index (F1-score) Intersection (IOU), unchanged intersection (IOU _ 0), overall Precision (OA) changed intersection (IOU _ 1) and Kappa coefficient (Kappa) as evaluation indexes, wherein the evaluation index calculation formula is as follows:

in order to verify the performance of the building automation detection model provided by the invention, the invention provides a final experimental result, fig. 4 is a visual comparison graph of various methods, and table 1 is a quantitative index of various methods.

Fig. 4 shows images of the building detection results obtained by various methods. The image (a) is a preceding image, (b) is a subsequent image, (c) is a real variation image (GT), and (d) - (g) are the detection result images of different methods. By comparing the actual change images, black represents an unchanged area, white represents a changed area, red represents a false detection area, and green represents a missing detection area.

Table 1: building detection precision in LEVIR-CD data set by various methods

Note that all indices are in percent units, and the larger the number, the better the effect. For ease of observation, the best results are shown in bold.

It should be noted that, according to the implementation requirement, each step/component described in the present application can be divided into more steps/components, and two or more steps/components or partial operations of the steps/components can be combined into new steps/components to achieve the purpose of the present invention.

It will be appreciated by those skilled in the art that the foregoing is only a preferred embodiment of the invention and is not intended to limit the invention, which is to cover any modifications, equivalents, improvements and the like, which fall within the spirit and scope of the invention.

Claims

1. A building change detection method for urban dynamic monitoring is characterized by comprising the following steps:

s2, building an automatic building detection model consisting of an encoder and a decoder, wherein the encoder comprises a two-channel twin network and a twin cross attention module which are shared by weight, and the decoder comprises a multi-scale feature fusion and differential context discrimination module;

the weight-shared two-channel twin network comprises a batch normalization layer and a plurality of up-sampling blocks, and double-time-phase images are input to obtain feature maps of different scales;

the multi-scale feature fusion module adopts a dual progressive fusion strategy of reconstruction and up-sampling blocks to fuse the extracted features containing rich multi-scale semantic information;

2. The method according to claim 1, wherein step S1 comprises:

the method comprises the steps of taking an artificially-made urban building change image as a data set, and making an actual change image according to a double-time-phase image in the data set, wherein the actual change image is a change area in the double-time-phase image, and each pixel in the actual change image represents a type and is unchanged or changed;

and (3) forming an automatic detection image data set of the urban building by the front and rear time sequence images and the corresponding actual change images, wherein the data set comprises the following data according to the ratio of 8:2 into training and test sets.

3. The method of claim 1, wherein: the weight-shared dual-channel twin network in the step S2 carries out batch processing normalization operation on the input dual-time phase image, wherein the batch processing normalization operation comprises a convolution kernel 3, a two-dimensional convolution with the step length of 1, a two-dimensional Batchnorm and a ReLU activation function with the output channel number of 64, and then characteristic information is extracted through 3 down-sampling blocks to define x ^i,j For the output node of a downsampling block, the objective function of the downsampling block is:

wherein N (-) represents a nested convolution function, D (-) represents a down-sampling layer, U (-) represents an up-sampling layer, and]representing a characteristic connection function, x ^i,j Representing the output characteristic diagram, i representing the number of layers, j representing the number of layersRepresents the jth convolutional layer of the layer, k represents the kth connection layer; and finally, outputting four kinds of multi-scale characteristic information by the twin network channel.

4. The method of claim 1, wherein: the twin cross attention module in the step S2 carries out embedding operation on four outputs of the dual-channel twin network, firstly carries out 2D convolution once to extract characteristics, and then expands the characteristics into a two-dimensional sequence T ₁ ，T ₂ ，T ₃ And T ₄ The patch sizes are 32, 16, 8 and 4 respectively, and T is ₁ -T ₄ Are combined to obtain T _∑ Then, a multi-head cross attention mechanism is used for processing, and the objective function of the first stage is as follows:

K＝T _∑ W _K ,V＝T _∑ W _V

wherein,

the objective function for the second stage is:

wherein σ (·) and

the objective function of the third stage of multi-head cross attention is:

O _r ＝MCA _p +MLP(Q _u +MCA _p )

determining the final output of multi-headed cross-attention, wherein MCA _p Represents the output of the third stage of multi-head cross attention, p represents the p output, MLP (is) is a multi-layer perceptron function, Q _u Representing a query vector, u representing the u-th query vector.

5. The method of claim 4, wherein: in step S2, the objective function of the multi-scale feature fusion module is:

M _i ＝W ₁ ·V(T _l )+W ₂ ·V(O _r )

6. The method of claim 1, wherein: in step S2, the differential context discrimination module comprises a generator and a discriminator, the generator receives two inputs, the detection image obtained at the last layer of the multi-scale feature fusion module and the generated image obtained by carrying out differential operation on the first time phase and the second time phase are calculated to promote the loss of the detection image and the generated image to be closer to the actual change image, and the generator adopts the weighted sum of an SCAD (sequence-characterized amplified Scattering analysis) and a least square LSGAN (least squares) loss function as a loss function to reduce the false monitoring rate of the model; and a least square LSGAN loss function is adopted in the discriminator to improve the detection precision, and the loss functions of the generator and the discriminator are accumulated to obtain the final probability loss.

7. The method of claim 6, wherein: in step S2, the objective function of the differential context discrimination module is:

L(P)＝L(D)+L(G)

L(D)＝L _LSGAN (D)

L(G)＝L _LSGAN (D)+αL _SCAD

wherein L (P) represents probability loss, L (D) represents discriminator loss, L (G) represents generator loss, L _LSGAN (D) Least squares LSGAN loss, L, representing discriminators _LSGAN (G) Least squares LSGAN loss, L, representing the generator _SCAD Representing SCAD loss.

8. The method of claim 7, wherein: SCAD loss is defined as:

wherein, y _i For actually changing the image, s _g (c) To detect the score, g represents the g-th pixel.

9. The method of claim 7, wherein: the least squares LSGAN loss is:

wherein, D (x) ₁ Y) and D (x) ₁ ,G(x ₁ ) Denotes the output of the discriminator on the first time phase image, G (x) ₁ ) Representing the output of the generator on the first time phase image, D (x) ₂ Y) and D (x) ₂ ,G(x ₂ ) G (x) represents the output of the discriminator on the second phase image ₂ ) Representing the output of the generator for the second phase image,

and

indicating the desire to detect the first time phase image,

and

indicating the detection expectation of the second phase image, x ₁ ,x ₂ Respectively representing the first and second time phase images input by the discriminator, and y representing the actual change image.

10. The method of claim 7, wherein: the least squares LSGAN loss is:

wherein,

indicating the desire to detect the first time phase image,

indicating the expectation of detection of the second phase image, D (x) ₁ ,G(x ₁ ) Denotes the output of the discriminator on the first time phase image, G (x) ₁ ) Representing the output of the generator on the first time phase image, D (x) ₂ ,G(x ₂ ) G (x) represents the output of the discriminator on the second phase image ₂ ) Representing the output of the generator on the second phase image, x ₁ ,x ₂ Respectively representing the first and second time phase images input by the discriminator.