CN116434069A

CN116434069A - Remote sensing image change detection method based on local-global transducer network

Info

Publication number: CN116434069A
Application number: CN202310470097.1A
Authority: CN
Inventors: 夏旻; 宋磊
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-04-27
Filing date: 2023-04-27
Publication date: 2023-07-14

Abstract

The invention discloses a remote sensing image change detection method based on a local-global transducer network, which comprises the following steps: preprocessing a plurality of groups of image sample data to obtain a plurality of groups of image data; inputting each set of normalized image data into a local-global transducer network; the method comprises the steps of performing differential operation on the output of a front time-phase remote sensing image and the output of a rear time-phase remote sensing image in each stage, and inputting the differential operation into a high-frequency enhancement unit to obtain high-frequency characteristics of the edge of a change area in each stage; inputting the high-frequency characteristics of the second stage, the third stage and the fourth stage into a multi-scale fusion attention unit to obtain fine-granularity fusion characteristics of each stage; inputting the high-frequency characteristics of the edge of the first stage change area and the fine granularity fusion characteristics obtained in the following three stages into a deep characteristic guiding unit; in order to solve the serious false detection omission problem of the characteristic boundary, the invention provides a plug-and-play high-frequency enhancement unit to replace an inflexible U-shaped structure so as to optimize the detection boundary.

Description

Remote sensing image change detection method based on local-global transducer network

Technical Field

The invention belongs to the technical field of change detection networks, and particularly relates to a remote sensing image change detection method based on a local-global transducer network.

Background

The evolution of natural environment and human production activities causes the surface to continuously change, so the land coverage change is regularly focused and timely found, and the method has important significance for harmony and symbiosis of human and nature. The remote sensing image change detection research aims at determining and analyzing the change of the ground object by utilizing multi-temporal remote sensing images and geospatial data shot in the same ground surface area in different periods, and comprises the change of the range and state of the ground object and the like, and is an important way and means for realizing ground surface observation. Currently, this technology has been widely used in urban planning, land use monitoring, agriculture and forestry monitoring, natural disaster monitoring, and numerous other fields.

The existing remote sensing image change detection method based on deep learning mostly depends on a convolution network. Due to the inherent nature of convolution kernels, convolution networks naturally lack the ability to model long-range dependencies, which may limit further development of convolution networks in the field of change detection.

In recent years, some work has begun to utilize non-local relational modeling to efficiently extract global relationships between pixels. Compared with a change detection method based on a convolution network, the method can fully utilize the connection between any pixels, so that the limited receptive field is no longer a key factor influencing the performance of the model. However, most existing non-local modeling methods have difficulty capturing information flow between multi-scale features. They typically cascade a non-local feature extraction module at the end of the network, but ignore a more critical problem in the field of change detection, namely the multi-scale geospatial problem.

In addition, the boundary detection effect of the variation region directly affects the final detection accuracy. For actual aerial remote sensing images, the specific boundaries of some regions of variation become difficult to define due to spectral variations or shadow masking. Existing change detection algorithms typically employ a U-shaped structure to capture the detail information of the change boundaries layer by layer. However, this approach inevitably sacrifices flexibility in model design.

Disclosure of Invention

The invention aims to: the invention aims to overcome the defects of the existing method, provides a remote sensing image change detection method based on a local-global transducer network, and solves the problem of low remote sensing image change detection accuracy.

The technical scheme is as follows: the invention discloses a remote sensing image change detection method based on a local-global transducer network, which comprises the following steps:

s1, preprocessing a plurality of groups of image sample data to obtain a plurality of standardized groups of image data, wherein each group of image sample data comprises a double-time-phase remote sensing image, and the double-time-phase remote sensing image comprises a front time-phase remote sensing image and a rear time-phase remote sensing image;

s2, inputting each group of standardized image data into a local-global transducer network in sequence, wherein the local-global transducer network comprises an image block embedding and a backbone network, the backbone network comprises four stages, namely a first stage, a second stage, a third stage and a fourth stage, the front-time remote sensing image and the rear-time remote sensing image after the image block embedding are input into the first stage, the output of the first stage is used as the input of the later stage, each stage comprises two Siamese transducer models, the first Siamese transducer model is used for limiting self-attention calculation in a local window to model a local pixel relation of an input image, the second Siamese transducer model is used for performing attention calculation on the whole pixels of the image to model a global pixel relation of the input image, and an image block merging operation, namely halving the size of a feature image and doubling the number of channels, exists between each two stages;

s3, carrying out differential operation on the output of the front-time phase remote sensing image and the output of the rear-time phase remote sensing image in each stage, and then inputting the differential operation into a high-frequency enhancement unit to obtain high-frequency characteristics of the edge of the change area of each stage;

s4, inputting the high-frequency characteristics of the second stage, the third stage and the fourth stage into a multi-scale fusion attention unit to obtain fine-granularity fusion characteristics of each stage;

s5, inputting the high-frequency features of the edges of the first-stage change region, the second-stage, the third-stage and the fine-granularity fusion features obtained in the third-stage into a deep feature guiding unit to obtain a finer detection diagram;

s6, carrying out feature fusion on the output of the deep feature guiding unit and the fine granularity fusion features obtained in the second stage, the third stage and the third stage to obtain model output features;

and S7, performing model training on the model by using training data, and then performing data testing to obtain a final prediction result.

Further, the method comprises the steps of:

the first stage includes two siamese convertors models, and both siamese convertors models are used to constrain the calculation of self-attention in a local window to model the local pixel relationship of the input image.

Further, the method comprises the steps of:

the image block embedding specifically comprises a two-dimensional convolution with a convolution kernel of 4×4 and a step length of 4, i.e. a dual-phase image I ¹ ,

D. H and W represent the number of channels, height and width, respectively, and are converted into a sequence of image tokens by an image block embedding operation.

Further, the method comprises the steps of:

the Siamese transducer model comprises two standard transducer blocks with shared weight values, the standard transducer blocks are respectively divided into a token mixer and a channel mixer according to different regional functions, the token mixer is used for capturing spatial characteristic representation of a double-time-phase image, firstly, layer normalization operation is carried out on characteristic pairs of the double-time-phase image, then three input characteristics Q, K and V are obtained through linear transformation, the three input characteristics are input into self-attention calculation, and finally, output characteristics of the token mixer are obtained through jump connection of the characteristics of the double-time-phase image and addition of output of the self-attention calculation, and corresponding output characteristics are obtained;

the channel mixer is used for fusing the characteristics in the channel dimension, firstly, normalizing the output characteristics of the token mixer, then entering the first multi-layer perceptron, carrying out deep convolution, carrying out activation function operation on the result of the deep convolution and the result of the linear transformation, and then inputting the result of the deep convolution and the result of the linear transformation into the second multi-layer perceptron, wherein the output characteristics of the second multi-layer perceptron and the output characteristics of the token mixer are added to obtain the output characteristics Y of the channel mixer, and the first multi-layer perceptron and the second multi-layer perceptron are both in linear transformation.

Further, the method comprises the steps of:

the token mixer fully captures the global spatial feature relation of the image by using a self-attention mechanism, and the mathematical expression of the token mixer is shown in formulas (1) - (2):

MHSA _k (Q，K，V)＝Φ(Concat(SA ₁ ，SA ₂ ，...，SA _k )) (2)

wherein SA is _i (. Cndot.) and MGSA _k (. Cndot.) represents the output characteristics of the ith self-attention head and the output characteristics of the multi-head self-attention with k attention heads, respectively, Q (Q) _i )、K(K _i ) And V (V) _i ) Three input features, namely query, key and value, c represents the feature dimension of each attention head, and phi represents linear transformation; the channel mixer effectively fuses characteristics in the channel dimension by using a multi-layer perceptron, and the mathematical expression of the channel mixer is shown in formulas (3) - (5):

X′ _m ＝Φ ₁ (norm(X _m )) (3)

pos＝T ₂ (DW(T ₁ (X′ _m ))) (4)

Y＝Φ ₂ (σ(X′ _m +pos))+X _m (5)

wherein X 'is' _m Pos and Y represent the transition characteristics, i.e. the output characteristics of the first multi-layer perceptron, and the conditional position coding, respectivelyCharacteristics and output characteristics of channel mixer, X _m Representing the input characteristics of the channel mixer, m e 1,2, phi represents a linear transformation, where,

t represents a remodeling operation, wherein T ₁ ∈(1D→2D)，T ₂ E (2 d→1d), DW represents the deep convolution, σ represents the GELU activation.

Further, the method comprises the steps of:

the high frequency enhancement unit includes:

the input feature X is subjected to 3X 3 convolution to conduct differential feature optimization to obtain shallow differential feature E _s In order to obtain high-frequency characteristics, firstly, capturing low-frequency characteristics by utilizing average pooling and multi-head self-attention, then, up-sampling by using bilinear interpolation to obtain intermediate characteristics, and finally, shallow differential characteristics E _s Subtracting the intermediate features to obtain high-frequency feature E _H High frequency characteristics E _H And shallow differential feature E _s Is connected along the channel dimension and uses a 3 x 3 convolution fusion to obtain the output characteristic Y of the final high frequency enhancement unit;

the concrete steps are as follows:

E _s ＝W ₁ X (6)

E _H ＝E _s -Up(T ₂ (MHSA(T ₁ (Avg(E _s ))))) (7)

Y＝W ₂ Concat(E _s ，E _H ) (8)

wherein X represents input features, Y represents output features, E _s And E is _H Respectively represent shallow differential characteristics and high-frequency characteristics, W ₁ And W is ₂ Representing two 3 x 3 convolutions, T ₁ And T ₂ Two remodelling operations are represented, MHSA for multi-head self-attention computation and Up for Up sampling operations.

Further, the method comprises the steps of:

the multi-scale fusion attention unit comprises a first path, wherein three input features are respectively subjected to 1X 1 convolution, then three two-dimensional image features are converted into a one-dimensional token sequence by using a remolding operation, and in order to fully interact the features among the multi-scale tokens, the one-dimensional token sequences are connected together along a space dimension and continuously aggregate the space relations of the features of different scales through N multi-head self-attention; dividing the output characteristics of multi-head self-attention along the space dimension, restoring the one-dimensional token sequence into two-dimensional image characteristics through remolding operation, carrying out 1×1 convolution, batch normalization and Sigmoid activation on the two-dimensional image characteristics to obtain corresponding attention weights, fusing three input characteristics with different dimensions in three different space dimensions by utilizing average pooling and up-sampling to obtain three coarse-granularity fusion characteristics, optimizing the three coarse-granularity fusion characteristics through 3×3 convolution, batch normalization and ReLU activation functions to obtain corresponding calibration characteristics, and finally carrying out attention weighted calculation on the three calibration characteristics by utilizing the corresponding attention weights to obtain final fine-granularity fusion characteristics.

Further, the method comprises the steps of:

the multiscale fused attention unit is represented as:

wherein,,

representing the calibration feature +.>

Output characteristics representing multi-head self-attention, Y ⁱ Output features representing a multiscale fused attention unit, < >>

Representing k x k convolutions, representing the matrix hadamard product, i representing the number of stages, i e {2,3,4}, BN representing the normalization operation of the batch of samples.

Further, the method comprises the steps of:

the deep feature guiding unit comprises the steps of carrying out scale calibration on fine-grained fusion features obtained in the second stage, the third stage and the third stage to obtain calibrated deep fusion features, carrying out cross attention calculation on the calibrated deep fusion features and high-frequency features at the edge of a change area in the first stage, carrying out 1×1 convolution, batch normalization and Sigmoid activation to obtain corresponding attention weights, carrying out 3×3 convolution, batch normalization and ReLU activation functions to optimize the calibrated deep fusion features and the high-frequency features at the edge of the change area in the first stage to obtain corresponding optimized features, and finally carrying out attention weighted calculation on the attention weights and the optimized features to obtain a finer detection graph.

Further, the method comprises the steps of:

the deep feature guide unit is expressed as shown in formulas (10) - (12):

ReLU(BN(W _3×3 (X ¹ +M)))*Sigmoid(BN(W _1×1 MHCA(X ¹ ,M))) (10)

MHCA _k (Q，K，V)＝Φ(Concat(CA ₁ ，CA ₂ ，...，CA _k )) (12)

wherein X is ¹ Representing the high frequency features of the first stage, M representing the fused calibration features of the last three stages, MHCA representing multi-headed cross attention, CA _j (. Cndot.) and MHCA _k (. Cndot.) represents the output characteristics of the jth cross attention head and the output characteristics of the multi-head cross attention with k attention heads, respectively, Q (Q) _j )、K(K _j ) And V (V) _j ) Representing three input features of query, key and value, respectively, where Q is represented by X ¹ The linear transformation results in K, V from M linear transformations, c representing the characteristic dimension of each attention head and Φ representing the linear transformation.

The beneficial effects are that: the method provided by the invention has the advantages that the local and global features of the image are extracted by using the transducer, and meanwhile, the problem of modeling of the multi-scale features of the ground object is also considered, and the method provides a local-global Siamese transducer as a backbone network to extract semantic distinguishing features; firstly, in order to solve the serious false detection and omission problem of the characteristic boundary, a plug-and-play high-frequency enhancement unit is provided to replace an inflexible U-shaped structure so as to optimize the detection boundary; secondly, aiming at the problem of multi-scale modeling of ground objects, a multi-scale fusion attention unit is provided, and multi-scale information flow is integrated into the calculation process of self-attention; and finally, optimizing shallow detail characteristic information by using a deep characteristic guiding unit to obtain a refined detection result.

Drawings

FIG. 1 is a flowchart illustrating steps of a detection method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a remote sensing image change detection model based on a local-global transducer network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a Siamese transducer network according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a high-frequency enhancement unit according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a multi-scale fused attention unit according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a deep feature guiding unit according to an embodiment of the invention.

Detailed Description

The invention is further described below with reference to the drawings and the detailed description.

The invention discloses a remote sensing image change detection method based on a local-global transducer network, which firstly considers that the existing change detection method is limited by the size of a convolution kernel receptive field because the characteristic is extracted by seriously depending on CNN, and the method is difficult to effectively capture the long-distance dependence of an image, so the invention provides a local-global transducer which is used as a backbone network and simultaneously considers the local characteristic and the global characteristic of a double-time-phase image; secondly, a plug-and-play high-frequency enhancement unit and a multi-scale fusion attention unit are provided aiming at the problem of loss of boundary detail information of a change area and the problem of multi-scale feature modeling of remote sensing ground objects; finally, a deep feature guiding unit is provided for guiding and optimizing fine granularity features of a shallow layer of a network by utilizing deep abstract semantic features so as to realize finer detection results; the content is unified into an end-to-end remote sensing image change detection network based on a transducer, and a complicated step-by-step training process is not needed.

As shown in fig. 1, the present invention includes the steps of:

step (1): a group of double-phase remote sensing images are input as samples to test the algorithm. Because the neural network is very sensitive to data distribution, the data preprocessing operation is firstly carried out on the input double-phase remote sensing image. In order to facilitate subsequent batch training, all the double-temporal remote sensing images are cut into uniform 256×256 pixel sizes; random horizontal flipping, random vertical flipping, and random 0-180 ° rotation are used as data enhancement means to train a more powerful change detection model; the two-phase image is subjected to a data normalization step to obtain normalized image data before the data is fed into the model.

Step (2): and (3) obtaining standardized double-phase remote sensing image data from the step (1), and constructing a backbone network based on local-global convertors so as to pay attention to local high-frequency features and global low-frequency features of the double-phase remote sensing image at the feature extraction stage.

As shown in fig. 2, the local-global transducer network is divided into four phases, and each time the image features pass through one phase, the size of the image features is reduced by half, and the feature dimensions are doubled, so that a more flexible feature expression mode is constructed. Specifically, first a bi-phase image I ¹ ，

(D, H and W represent channel number, height and width, respectively) are converted into a sequence of image tokens T by an image block embedding operation ¹ ，/>

In the present invention, a two-dimensional convolution with a convolution kernel of 4 x 4 and a step size of 4 is used to implement the step of embedding the image block, thus +.>

Then T ¹ And T ² Through four successive stages. In each stage, the present invention proposes a local-global attention in order to extract the local and global features of the bi-temporal image effectively.

It consists of two siamese convectors, the first of which limits the self-attention computation to the local window to model the local pixel relationship of the image and the second of which performs the attention computation on the whole pixels of the image to model the global pixel relationship. Considering the high cost of self-attention computation directly on the network shallow features, only local attention is done in the first stage. In addition, there is a block merging operation between every two stages, halving the feature map size and doubling the number of channels.

The structure of the Siamese transducer is shown in FIG. 3, and the module is composed of two standard transducer blocks with shared weight, and each transducer block is divided into a token mixer and a channel mixer according to different zone functions. Let the feature pair of the bi-phase image be denoted as X ¹ ，

N represents the token sequence length and C represents the number of characteristic channels. X is X ¹ And X ² Firstly, capturing the spatial characteristic representation of the double-phase image through a token mixer, and secondly, fusing characteristic information in the channel dimension through a channel mixer.

Specifically, for token mixer, X ¹ Firstly, performing layer normalization operation, and then obtaining three input features Q, K and V through linear transformation. Q, K, V are used to calculate multi-headed self-attention to capture the links between all image tokens. Finally, X ¹ The output characteristics of the token mixer are obtained by jump connection and output addition. X is X ² Through the same calculation process. The key to the above process is the calculation of multi-head self-attention, the mathematical expression of which is shown in formulas (1) - (2):

MHSA _k (Q，K，V)＝Φ(Concat(SA ₁ ，SA ₂ ，SA _k )) (2)

wherein SA is _i (. Cndot.) and MHSA _k (. Cndot.) represents the output characteristics of the ith self-attention head and the output characteristics of the multi-head self-attention with k attention heads, respectively, Q (Q) _i )、K(K _i ) And V (V) _i ) Three input features, namely query, key and value, are respectively represented, c represents the feature dimension of each attention head, and < represents linear transformation.

For the channel mixer, the core is a multi-layer perceptron, which consists of two linear transformation layers and an intermediate activation function, for modeling the characteristic relation of the channel dimension. In addition, in order to solve the problem that the transducer is insensitive to the space position, the invention introduces conditional position coding based on depth convolution into a multi-layer perceptron. The mathematical expression of the channel mixer is shown in formulas (3) - (5):

X′ _m ＝Φ ₁ (norm(X _m )) (3)

pos＝T ₂ (DW(T ₁ (X′ _m ))) (4)

Y＝Φ ₂ (σ(X′ _m +pos))+X _m (5)

wherein X 'is' _m Pos and Y represent the transition feature, the conditional position coding feature and the output feature of the channel mixer, X, respectively _m Representing the input characteristics of the channel mixer, norm represents the layer normalization, phi represents the linear transformation,

t represents a remodeling operation, T ₁ ∈(1D→2D)，T ₂ E (2 d→1d), DW represents the deep convolution, σ represents the GELU activation.

Step (3): and (3) obtaining local features and global features of the double-phase image from the step (2), obtaining a differential graph by utilizing differential algebra operation, constructing a high-frequency enhancement unit, a multi-scale fusion attention unit and a deep feature guiding unit, and constructing a complete remote sensing image change detection model based on a local-global transducer network.

Step (3.1): and building a high-frequency enhancement unit. In the change detection task, the detection effect of the edge of the change area has great influence on the performance index of the final model. Most existing change detection methods utilize complex U-shaped structures in combination with attention mechanisms to refine the change region boundaries, which generally optimize features from a global perspective, which is not targeted for boundary information modeling. Furthermore, the U-shaped structure tends to be complex in network design. Therefore, the present invention proposes a simple but effective high frequency enhancement unit that uses a self-attention mechanism and correlation between front and rear features to effectively extract high frequency feature information of the edges of the change region.

As shown in FIG. 4, assume that the input features are represented as

X is first subjected to a 3X 3 convolution for differential feature optimization. To obtain high frequency features, low frequency features are captured using averaging pooling and multi-head self-attention, then upsampled using bilinear interpolation, and finally subtracted from shallow features using differential algebra operations. The high frequency features and shallow features are connected along the channel dimension and the output features of the final high frequency enhancement unit are obtained using a 3 x 3 convolution fusion. The mathematical expression of the above process is shown in formulas (6) - (8):

E _s ＝W ₁ X (6)

E _H ＝E _s -Up(T ₂ (MHSA(T ₁ (Avg(E _s ))))) (7)

Y＝W ₂ Concat(E _s ，E _H ) (8)

Step (3.2): constructing a multi-scale fusion attention unit. The remote sensing image has a plurality of ground objects with different scales, so the multi-scale modeling capability is one of important indexes for measuring the change detection model of the remote sensing image. Most of the existing change detection algorithms usually adopt a multi-scale convolution or pooling mode, a U-shaped fusion module and the like to solve the multi-scale problem. Unlike these methods, the present invention proposes a new multi-scale fusion attention unit based on multi-head self-attention, the structure of which is schematically shown in fig. 5.

The input features of the module are derived from the outputs of the last three stages of the local-global transducer backbone network in step (2), expressed as

To calibrate the channel, three input feature dimensions are first transformed to C using a 1 x 1 convolution. Three two-dimensional image features are then converted into a one-dimensional token sequence using a remodelling operation, denoted as

To adequately interact feature information between multi-scale tokens, tk ⁱ Are connected together along a spatial dimension and sequentially aggregate spatial relationships of different scale features through N multiple heads of self-attention. Thanks to the long-range dependence modeling capability of self-attention, all the links between tokens are unified. Considering that the original feature distribution is easily changed after multi-head self-attention, the present invention proposes an attention weighting mechanism that effectively alleviates this phenomenon.

Specifically, the multi-headed self-attention output feature is first segmented along the spatial dimension, and then the one-dimensional token sequence is restored to the two-dimensional image feature by a reshaping operation. The two-dimensional image features are subjected to 1×1 convolution, batch normalization and Sigmoid activation to obtain attention weights. In order to more fully aggregate multi-scale features, the invention provides a scale calibration operation, namely, three different-scale features are fused in three different spatial dimensions by utilizing average pooling and up-sampling to obtain three coarse-granularity fusion features. The three features are then subjected to a 3 x 3 convolutionBatch normalization and ReLU activation functions. Finally, the three features are respectively subjected to attention weighted calculation by using the corresponding attention weights to obtain final fine granularity fusion features, which are expressed as

The mathematical expression of the above process is shown in formula (9):

wherein,,

representing the calibration feature +.>

Representing k×k convolution, BN representing batch normalization, x representing matrix hadamard product, i representing the number of stages;

step (3.3): and constructing a deep feature guiding unit. The shallow features of the network typically contain more fine-grained information such as texture, color, boundaries, etc., and additionally contain more background interference information. Deep features of the network have semantic directionality, often containing more abstract semantic information. From this point of view, better detection accuracy can be obtained by guiding shallow detail features with deep semantic features. Existing work has incorporated this idea into model designs, but they tend to consider only deep semantic features at a single scale. The invention provides a novel deep feature guiding unit which can effectively optimize shallow fine granularity features by utilizing deep multi-scale semantic features to obtain a finer detection diagram, and the structural schematic diagram of the novel deep feature guiding unit is shown in fig. 6.

The overall structure is somewhat similar to the multiscale fused attention unit of step (3.2). Likewise, a scale calibration operation is used to fuse the three output features of the multiscale fused attention unit in the spatial dimension of stage 1. To migrate semantic information of deep features to shallow features, a similar attention weighting mechanism is introduced. In contrast, multi-headed self-attention is replaced by multi-headed cross-attention.

The input of multi-headed cross-attention is also three features, denoted Q, K,

q is derived from the shallow features of stage 1, and K and V are derived from the calibrated deep fusion features. Through the interaction of Q and K, V, the multi-headed cross-attention can fully model the semantic relation between shallow features and multi-scale deep features, and the mathematical expression is shown in formulas (10) - (12):

ReLU(BN(W _3×3 (X ¹ +M)))*Sigmoid(BN(W _1×1 MHCA(X ¹ ,M))) (10)

MHCA _k (Q，K，V)＝Φ(Concat(CA ₁ ，CA ₂ ，...，CA _k )) (12)

Step (4): and constructing a remote sensing image change detection model based on a local-global transducer network, and performing model training by using the standardized double-phase image data. Fig. 2 shows a model proposed by the present invention. According to the invention, four binary remote sensing image change detection data sets of CDD, BTCDD, LEVIR-CD and Google are adopted for experiments, all experiments are carried out based on a Pytorch framework, and a Zhang Yingwei TITAN RTX GPU is provided. The binary cross entropy is used as a training loss function, adamW is used as a training optimizer, the weight attenuation coefficient is 0.01, the learning rate is 0.0001, the batch size is set to 16, and the training period is set to 200. Finally, the model output characteristics are activated by Sigmoid to obtain a final prediction result, and all prediction values of the final prediction result are compressed to be between 0 and 1. The region where the pixel value of 0.5 or more is located is determined as a change region (indicated by white), and the region where the pixel value of less than 0.5 is located is determined as a no-change region (indicated by black).

Step (5): model change detection performance evaluation. 200 trained change detection models can be obtained in the step (4), the best model is selected from the 200 models according to five evaluation indexes of the Precision Rate (PR), the recall Rate (RC), the F1 fraction (F1), the cross-over ratio (IoU) and the overall accuracy rate (OA), and the five evaluation indexes are used for evaluating the performance of the best model. The mathematical expressions of the five evaluation indexes are shown in formulas (13) - (17):

wherein TP, TN, FP and FN represent true yang, true yin, false yang and false yin, respectively.

By implementing the model of the invention, the final change detection precision is greatly improved. The invention provides a local-global transducer as a main network to extract the characteristics of a double-phase remote sensing image, so that the local characteristics and the global characteristics of the double-phase image are effectively captured; the high-frequency enhancement unit is put forward to replace a complex U-shaped network structure, and the optimization of the high-frequency characteristics of the boundary of the change area is realized in a simple but efficient manner; providing a multiscale fusion attention unit, and fully interacting the connection among multiscale features by utilizing a multi-head self-attention mechanism; finally, a deep feature guiding unit is provided, and deep multi-scale semantic features are used for guiding and optimizing the fine granularity features of the shallow layer of the network based on a multi-head cross attention mechanism, so that a more refined detection result is realized. The above matters together form the remote sensing image change detection method based on the local-global transducer network.

Claims

1. The remote sensing image change detection method based on the local-global transducer network is characterized by comprising the following steps of:

2. The method as recited in claim 1, further comprising: the first stage includes two siamese convertors models, and both siamese convertors models are used to constrain the calculation of self-attention in a local window to model the local pixel relationship of the input image.

3. The method according to claim 1 or 2, characterized in that the image block embedding comprises in particular a two-dimensional convolution with a convolution kernel of 4 x 4 and a step size of 4, i.e. a bi-temporal image

D、H and W represent the number of channels, height and width, respectively, and are converted into a sequence of image tokens by an image block embedding operation.

4. The method according to claim 1 or 2, characterized in that the siamese fransformer model comprises two standard fransformer blocks with shared weight values, which are respectively divided into a token mixer and a channel mixer according to different area functions, wherein the token mixer is used for capturing spatial characteristic representation of a double-time-phase image, firstly, performing layer normalization operation on characteristic pairs of the double-time-phase image, then, obtaining three input characteristics Q, K and V through linear transformation, inputting the three input characteristics into self-attention calculation, and finally, obtaining output characteristics of the token mixer through jump connection of the characteristics of the double-time-phase image and output addition of the self-attention calculation, and obtaining corresponding output characteristics;

5. The method of claim 4, wherein the token mixer uses a self-attention mechanism to fully capture the image global spatial feature relationship, the mathematical expression of which is shown in formulas (1) - (2):

MHSA _k (Q，K，V)＝Φ(Concat(SA ₁ ，SA ₂ ，...，SA _k )) (2)

wherein SA is _i (. Cndot.) and MHSA _k (. Cndot.) represents respectivelyOutput features of i self-attention heads and multi-head self-attention with k attention heads, Q (Q) _i )、K(K _i ) And V (V) _i ) Three input features, namely query, key and value, c represents the feature dimension of each attention head, and phi represents linear transformation; the channel mixer effectively fuses characteristics in the channel dimension by using a multi-layer perceptron, and the mathematical expression of the channel mixer is shown in formulas (3) - (5):

X′ _m ＝Φ ₁ (norm(X _m )) (3)

pos＝T ₂ (DW(T ₁ (X′ _m ))) (4)

Y＝Φ ₂ (σ(X′ _m +pos))+X _m (5)

wherein X 'is' _m Pos and Y represent the transition characteristics, i.e. the output characteristics of the first multi-layer perceptron, the conditional position coding characteristics and the output characteristics of the channel mixer, X, respectively _m Representing the input characteristics of the channel mixer, m e 1,2, phi represents a linear transformation, where,

6. The method according to claim 1 or 2, wherein the high frequency enhancement unit comprises:

the concrete steps are as follows:

E _s ＝W ₁ X (6)

E _H ＝E _s -Up(T ₂ (MHSA(T ₁ (Avg(E _s ))))) (7)

Y＝W ₂ Concat(E _s ，E _H ) (8)

7. A method according to claim 1 or 2, wherein the multi-scale fusion attention unit comprises a first pass of first performing 1 x 1 convolution on three input features respectively, and then converting three two-dimensional image features into a one-dimensional token sequence using a reshaping operation, the one-dimensional token sequence being connected together along a spatial dimension and continuously aggregating spatial relationships of different scale features through N multi-headed self-attention; dividing the output characteristics of multi-head self-attention along the space dimension, restoring the one-dimensional token sequence into two-dimensional image characteristics through remolding operation, carrying out 1×1 convolution, batch normalization and Sigmoid activation on the two-dimensional image characteristics to obtain corresponding attention weights, fusing three input characteristics with different dimensions in three different space dimensions by utilizing average pooling and up-sampling to obtain three coarse-granularity fusion characteristics, optimizing the three coarse-granularity fusion characteristics through 3×3 convolution, batch normalization and ReLU activation functions to obtain corresponding calibration characteristics, and finally carrying out attention weighted calculation on the three calibration characteristics by utilizing the corresponding attention weights to obtain final fine-granularity fusion characteristics.

8. The method of claim 7, wherein the multi-scale fused attention unit is represented as:

wherein,,

representing the calibration feature +.>

9. The method according to claim 1 or 2, wherein the deep feature guiding unit performs scale calibration on the fine-grained fusion features obtained in the second stage, the third stage and the third stage to obtain calibrated deep fusion features, performs cross attention calculation on the calibrated deep fusion features and high-frequency features at the edge of the first stage change region, performs 1 x 1 convolution, batch normalization and Sigmoid activation to obtain corresponding attention weights, performs optimization on the calibrated deep fusion features and the high-frequency features at the edge of the first stage change region after addition, and performs optimization on the calibrated deep fusion features and the high-frequency features at the edge of the first stage change region through 3 x 3 convolution, batch normalization and ReLU activation functions to obtain corresponding optimized features, and finally performs attention weighting calculation on the attention weights and the optimized features to obtain a finer detection graph.

10. The method of claim 9, wherein the deep feature guide unit is expressed as shown in formulas (10) - (12):

ReLU(BN(W _3×3 (X ¹ +M)))*Sigmoid(BN(W _1×1 MHCA(X ^l ，M))) (10)

MHCA _k (Q，K，V)＝Φ(Concat(CA ₁ ，CA ₂ ，...，CA _k )) (12)