CN116434069A - Remote sensing image change detection method based on local-global transducer network - Google Patents

Remote sensing image change detection method based on local-global transducer network Download PDF

Info

Publication number
CN116434069A
CN116434069A CN202310470097.1A CN202310470097A CN116434069A CN 116434069 A CN116434069 A CN 116434069A CN 202310470097 A CN202310470097 A CN 202310470097A CN 116434069 A CN116434069 A CN 116434069A
Authority
CN
China
Prior art keywords
stage
attention
features
image
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310470097.1A
Other languages
Chinese (zh)
Inventor
夏旻
宋磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202310470097.1A priority Critical patent/CN116434069A/en
Publication of CN116434069A publication Critical patent/CN116434069A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a remote sensing image change detection method based on a local-global transducer network, which comprises the following steps: preprocessing a plurality of groups of image sample data to obtain a plurality of groups of image data; inputting each set of normalized image data into a local-global transducer network; the method comprises the steps of performing differential operation on the output of a front time-phase remote sensing image and the output of a rear time-phase remote sensing image in each stage, and inputting the differential operation into a high-frequency enhancement unit to obtain high-frequency characteristics of the edge of a change area in each stage; inputting the high-frequency characteristics of the second stage, the third stage and the fourth stage into a multi-scale fusion attention unit to obtain fine-granularity fusion characteristics of each stage; inputting the high-frequency characteristics of the edge of the first stage change area and the fine granularity fusion characteristics obtained in the following three stages into a deep characteristic guiding unit; in order to solve the serious false detection omission problem of the characteristic boundary, the invention provides a plug-and-play high-frequency enhancement unit to replace an inflexible U-shaped structure so as to optimize the detection boundary.

Description

Remote sensing image change detection method based on local-global transducer network
Technical Field
The invention belongs to the technical field of change detection networks, and particularly relates to a remote sensing image change detection method based on a local-global transducer network.
Background
The evolution of natural environment and human production activities causes the surface to continuously change, so the land coverage change is regularly focused and timely found, and the method has important significance for harmony and symbiosis of human and nature. The remote sensing image change detection research aims at determining and analyzing the change of the ground object by utilizing multi-temporal remote sensing images and geospatial data shot in the same ground surface area in different periods, and comprises the change of the range and state of the ground object and the like, and is an important way and means for realizing ground surface observation. Currently, this technology has been widely used in urban planning, land use monitoring, agriculture and forestry monitoring, natural disaster monitoring, and numerous other fields.
The existing remote sensing image change detection method based on deep learning mostly depends on a convolution network. Due to the inherent nature of convolution kernels, convolution networks naturally lack the ability to model long-range dependencies, which may limit further development of convolution networks in the field of change detection.
In recent years, some work has begun to utilize non-local relational modeling to efficiently extract global relationships between pixels. Compared with a change detection method based on a convolution network, the method can fully utilize the connection between any pixels, so that the limited receptive field is no longer a key factor influencing the performance of the model. However, most existing non-local modeling methods have difficulty capturing information flow between multi-scale features. They typically cascade a non-local feature extraction module at the end of the network, but ignore a more critical problem in the field of change detection, namely the multi-scale geospatial problem.
In addition, the boundary detection effect of the variation region directly affects the final detection accuracy. For actual aerial remote sensing images, the specific boundaries of some regions of variation become difficult to define due to spectral variations or shadow masking. Existing change detection algorithms typically employ a U-shaped structure to capture the detail information of the change boundaries layer by layer. However, this approach inevitably sacrifices flexibility in model design.
Disclosure of Invention
The invention aims to: the invention aims to overcome the defects of the existing method, provides a remote sensing image change detection method based on a local-global transducer network, and solves the problem of low remote sensing image change detection accuracy.
The technical scheme is as follows: the invention discloses a remote sensing image change detection method based on a local-global transducer network, which comprises the following steps:
s1, preprocessing a plurality of groups of image sample data to obtain a plurality of standardized groups of image data, wherein each group of image sample data comprises a double-time-phase remote sensing image, and the double-time-phase remote sensing image comprises a front time-phase remote sensing image and a rear time-phase remote sensing image;
s2, inputting each group of standardized image data into a local-global transducer network in sequence, wherein the local-global transducer network comprises an image block embedding and a backbone network, the backbone network comprises four stages, namely a first stage, a second stage, a third stage and a fourth stage, the front-time remote sensing image and the rear-time remote sensing image after the image block embedding are input into the first stage, the output of the first stage is used as the input of the later stage, each stage comprises two Siamese transducer models, the first Siamese transducer model is used for limiting self-attention calculation in a local window to model a local pixel relation of an input image, the second Siamese transducer model is used for performing attention calculation on the whole pixels of the image to model a global pixel relation of the input image, and an image block merging operation, namely halving the size of a feature image and doubling the number of channels, exists between each two stages;
s3, carrying out differential operation on the output of the front-time phase remote sensing image and the output of the rear-time phase remote sensing image in each stage, and then inputting the differential operation into a high-frequency enhancement unit to obtain high-frequency characteristics of the edge of the change area of each stage;
s4, inputting the high-frequency characteristics of the second stage, the third stage and the fourth stage into a multi-scale fusion attention unit to obtain fine-granularity fusion characteristics of each stage;
s5, inputting the high-frequency features of the edges of the first-stage change region, the second-stage, the third-stage and the fine-granularity fusion features obtained in the third-stage into a deep feature guiding unit to obtain a finer detection diagram;
s6, carrying out feature fusion on the output of the deep feature guiding unit and the fine granularity fusion features obtained in the second stage, the third stage and the third stage to obtain model output features;
and S7, performing model training on the model by using training data, and then performing data testing to obtain a final prediction result.
Further, the method comprises the steps of:
the first stage includes two siamese convertors models, and both siamese convertors models are used to constrain the calculation of self-attention in a local window to model the local pixel relationship of the input image.
Further, the method comprises the steps of:
the image block embedding specifically comprises a two-dimensional convolution with a convolution kernel of 4×4 and a step length of 4, i.e. a dual-phase image I 1 ,
Figure BDA0004203746820000031
D. H and W represent the number of channels, height and width, respectively, and are converted into a sequence of image tokens by an image block embedding operation.
Further, the method comprises the steps of:
the Siamese transducer model comprises two standard transducer blocks with shared weight values, the standard transducer blocks are respectively divided into a token mixer and a channel mixer according to different regional functions, the token mixer is used for capturing spatial characteristic representation of a double-time-phase image, firstly, layer normalization operation is carried out on characteristic pairs of the double-time-phase image, then three input characteristics Q, K and V are obtained through linear transformation, the three input characteristics are input into self-attention calculation, and finally, output characteristics of the token mixer are obtained through jump connection of the characteristics of the double-time-phase image and addition of output of the self-attention calculation, and corresponding output characteristics are obtained;
the channel mixer is used for fusing the characteristics in the channel dimension, firstly, normalizing the output characteristics of the token mixer, then entering the first multi-layer perceptron, carrying out deep convolution, carrying out activation function operation on the result of the deep convolution and the result of the linear transformation, and then inputting the result of the deep convolution and the result of the linear transformation into the second multi-layer perceptron, wherein the output characteristics of the second multi-layer perceptron and the output characteristics of the token mixer are added to obtain the output characteristics Y of the channel mixer, and the first multi-layer perceptron and the second multi-layer perceptron are both in linear transformation.
Further, the method comprises the steps of:
the token mixer fully captures the global spatial feature relation of the image by using a self-attention mechanism, and the mathematical expression of the token mixer is shown in formulas (1) - (2):
Figure BDA0004203746820000032
MHSA k (Q,K,V)=Φ(Concat(SA 1 ,SA 2 ,...,SA k )) (2)
wherein SA is i (. Cndot.) and MGSA k (. Cndot.) represents the output characteristics of the ith self-attention head and the output characteristics of the multi-head self-attention with k attention heads, respectively, Q (Q) i )、K(K i ) And V (V) i ) Three input features, namely query, key and value, c represents the feature dimension of each attention head, and phi represents linear transformation; the channel mixer effectively fuses characteristics in the channel dimension by using a multi-layer perceptron, and the mathematical expression of the channel mixer is shown in formulas (3) - (5):
X′ m =Φ 1 (norm(X m )) (3)
pos=T 2 (DW(T 1 (X′ m ))) (4)
Y=Φ 2 (σ(X′ m +pos))+X m (5)
wherein X 'is' m Pos and Y represent the transition characteristics, i.e. the output characteristics of the first multi-layer perceptron, and the conditional position coding, respectivelyCharacteristics and output characteristics of channel mixer, X m Representing the input characteristics of the channel mixer, m e 1,2, phi represents a linear transformation, where,
Figure BDA0004203746820000041
t represents a remodeling operation, wherein T 1 ∈(1D→2D),T 2 E (2 d→1d), DW represents the deep convolution, σ represents the GELU activation.
Further, the method comprises the steps of:
the high frequency enhancement unit includes:
the input feature X is subjected to 3X 3 convolution to conduct differential feature optimization to obtain shallow differential feature E s In order to obtain high-frequency characteristics, firstly, capturing low-frequency characteristics by utilizing average pooling and multi-head self-attention, then, up-sampling by using bilinear interpolation to obtain intermediate characteristics, and finally, shallow differential characteristics E s Subtracting the intermediate features to obtain high-frequency feature E H High frequency characteristics E H And shallow differential feature E s Is connected along the channel dimension and uses a 3 x 3 convolution fusion to obtain the output characteristic Y of the final high frequency enhancement unit;
the concrete steps are as follows:
E s =W 1 X (6)
E H =E s -Up(T 2 (MHSA(T 1 (Avg(E s ))))) (7)
Y=W 2 Concat(E s ,E H ) (8)
wherein X represents input features, Y represents output features, E s And E is H Respectively represent shallow differential characteristics and high-frequency characteristics, W 1 And W is 2 Representing two 3 x 3 convolutions, T 1 And T 2 Two remodelling operations are represented, MHSA for multi-head self-attention computation and Up for Up sampling operations.
Further, the method comprises the steps of:
the multi-scale fusion attention unit comprises a first path, wherein three input features are respectively subjected to 1X 1 convolution, then three two-dimensional image features are converted into a one-dimensional token sequence by using a remolding operation, and in order to fully interact the features among the multi-scale tokens, the one-dimensional token sequences are connected together along a space dimension and continuously aggregate the space relations of the features of different scales through N multi-head self-attention; dividing the output characteristics of multi-head self-attention along the space dimension, restoring the one-dimensional token sequence into two-dimensional image characteristics through remolding operation, carrying out 1×1 convolution, batch normalization and Sigmoid activation on the two-dimensional image characteristics to obtain corresponding attention weights, fusing three input characteristics with different dimensions in three different space dimensions by utilizing average pooling and up-sampling to obtain three coarse-granularity fusion characteristics, optimizing the three coarse-granularity fusion characteristics through 3×3 convolution, batch normalization and ReLU activation functions to obtain corresponding calibration characteristics, and finally carrying out attention weighted calculation on the three calibration characteristics by utilizing the corresponding attention weights to obtain final fine-granularity fusion characteristics.
Further, the method comprises the steps of:
the multiscale fused attention unit is represented as:
Figure BDA0004203746820000051
wherein,,
Figure BDA0004203746820000052
representing the calibration feature +.>
Figure BDA0004203746820000053
Output characteristics representing multi-head self-attention, Y i Output features representing a multiscale fused attention unit, < >>
Figure BDA0004203746820000054
Representing k x k convolutions, representing the matrix hadamard product, i representing the number of stages, i e {2,3,4}, BN representing the normalization operation of the batch of samples.
Further, the method comprises the steps of:
the deep feature guiding unit comprises the steps of carrying out scale calibration on fine-grained fusion features obtained in the second stage, the third stage and the third stage to obtain calibrated deep fusion features, carrying out cross attention calculation on the calibrated deep fusion features and high-frequency features at the edge of a change area in the first stage, carrying out 1×1 convolution, batch normalization and Sigmoid activation to obtain corresponding attention weights, carrying out 3×3 convolution, batch normalization and ReLU activation functions to optimize the calibrated deep fusion features and the high-frequency features at the edge of the change area in the first stage to obtain corresponding optimized features, and finally carrying out attention weighted calculation on the attention weights and the optimized features to obtain a finer detection graph.
Further, the method comprises the steps of:
the deep feature guide unit is expressed as shown in formulas (10) - (12):
ReLU(BN(W 3×3 (X 1 +M)))*Sigmoid(BN(W 1×1 MHCA(X 1 ,M))) (10)
Figure BDA0004203746820000061
MHCA k (Q,K,V)=Φ(Concat(CA 1 ,CA 2 ,...,CA k )) (12)
wherein X is 1 Representing the high frequency features of the first stage, M representing the fused calibration features of the last three stages, MHCA representing multi-headed cross attention, CA j (. Cndot.) and MHCA k (. Cndot.) represents the output characteristics of the jth cross attention head and the output characteristics of the multi-head cross attention with k attention heads, respectively, Q (Q) j )、K(K j ) And V (V) j ) Representing three input features of query, key and value, respectively, where Q is represented by X 1 The linear transformation results in K, V from M linear transformations, c representing the characteristic dimension of each attention head and Φ representing the linear transformation.
The beneficial effects are that: the method provided by the invention has the advantages that the local and global features of the image are extracted by using the transducer, and meanwhile, the problem of modeling of the multi-scale features of the ground object is also considered, and the method provides a local-global Siamese transducer as a backbone network to extract semantic distinguishing features; firstly, in order to solve the serious false detection and omission problem of the characteristic boundary, a plug-and-play high-frequency enhancement unit is provided to replace an inflexible U-shaped structure so as to optimize the detection boundary; secondly, aiming at the problem of multi-scale modeling of ground objects, a multi-scale fusion attention unit is provided, and multi-scale information flow is integrated into the calculation process of self-attention; and finally, optimizing shallow detail characteristic information by using a deep characteristic guiding unit to obtain a refined detection result.
Drawings
FIG. 1 is a flowchart illustrating steps of a detection method according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a remote sensing image change detection model based on a local-global transducer network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a Siamese transducer network according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a high-frequency enhancement unit according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a multi-scale fused attention unit according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a deep feature guiding unit according to an embodiment of the invention.
Detailed Description
The invention is further described below with reference to the drawings and the detailed description.
The invention discloses a remote sensing image change detection method based on a local-global transducer network, which firstly considers that the existing change detection method is limited by the size of a convolution kernel receptive field because the characteristic is extracted by seriously depending on CNN, and the method is difficult to effectively capture the long-distance dependence of an image, so the invention provides a local-global transducer which is used as a backbone network and simultaneously considers the local characteristic and the global characteristic of a double-time-phase image; secondly, a plug-and-play high-frequency enhancement unit and a multi-scale fusion attention unit are provided aiming at the problem of loss of boundary detail information of a change area and the problem of multi-scale feature modeling of remote sensing ground objects; finally, a deep feature guiding unit is provided for guiding and optimizing fine granularity features of a shallow layer of a network by utilizing deep abstract semantic features so as to realize finer detection results; the content is unified into an end-to-end remote sensing image change detection network based on a transducer, and a complicated step-by-step training process is not needed.
As shown in fig. 1, the present invention includes the steps of:
step (1): a group of double-phase remote sensing images are input as samples to test the algorithm. Because the neural network is very sensitive to data distribution, the data preprocessing operation is firstly carried out on the input double-phase remote sensing image. In order to facilitate subsequent batch training, all the double-temporal remote sensing images are cut into uniform 256×256 pixel sizes; random horizontal flipping, random vertical flipping, and random 0-180 ° rotation are used as data enhancement means to train a more powerful change detection model; the two-phase image is subjected to a data normalization step to obtain normalized image data before the data is fed into the model.
Step (2): and (3) obtaining standardized double-phase remote sensing image data from the step (1), and constructing a backbone network based on local-global convertors so as to pay attention to local high-frequency features and global low-frequency features of the double-phase remote sensing image at the feature extraction stage.
As shown in fig. 2, the local-global transducer network is divided into four phases, and each time the image features pass through one phase, the size of the image features is reduced by half, and the feature dimensions are doubled, so that a more flexible feature expression mode is constructed. Specifically, first a bi-phase image I 1
Figure BDA0004203746820000071
(D, H and W represent channel number, height and width, respectively) are converted into a sequence of image tokens T by an image block embedding operation 1 ,/>
Figure BDA0004203746820000072
In the present invention, a two-dimensional convolution with a convolution kernel of 4 x 4 and a step size of 4 is used to implement the step of embedding the image block, thus +.>
Figure BDA0004203746820000073
Then T 1 And T 2 Through four successive stages. In each stage, the present invention proposes a local-global attention in order to extract the local and global features of the bi-temporal image effectively.
It consists of two siamese convectors, the first of which limits the self-attention computation to the local window to model the local pixel relationship of the image and the second of which performs the attention computation on the whole pixels of the image to model the global pixel relationship. Considering the high cost of self-attention computation directly on the network shallow features, only local attention is done in the first stage. In addition, there is a block merging operation between every two stages, halving the feature map size and doubling the number of channels.
The structure of the Siamese transducer is shown in FIG. 3, and the module is composed of two standard transducer blocks with shared weight, and each transducer block is divided into a token mixer and a channel mixer according to different zone functions. Let the feature pair of the bi-phase image be denoted as X 1
Figure BDA0004203746820000081
N represents the token sequence length and C represents the number of characteristic channels. X is X 1 And X 2 Firstly, capturing the spatial characteristic representation of the double-phase image through a token mixer, and secondly, fusing characteristic information in the channel dimension through a channel mixer.
Specifically, for token mixer, X 1 Firstly, performing layer normalization operation, and then obtaining three input features Q, K and V through linear transformation. Q, K, V are used to calculate multi-headed self-attention to capture the links between all image tokens. Finally, X 1 The output characteristics of the token mixer are obtained by jump connection and output addition. X is X 2 Through the same calculation process. The key to the above process is the calculation of multi-head self-attention, the mathematical expression of which is shown in formulas (1) - (2):
Figure BDA0004203746820000082
MHSA k (Q,K,V)=Φ(Concat(SA 1 ,SA 2 ,SA k )) (2)
wherein SA is i (. Cndot.) and MHSA k (. Cndot.) represents the output characteristics of the ith self-attention head and the output characteristics of the multi-head self-attention with k attention heads, respectively, Q (Q) i )、K(K i ) And V (V) i ) Three input features, namely query, key and value, are respectively represented, c represents the feature dimension of each attention head, and < represents linear transformation.
For the channel mixer, the core is a multi-layer perceptron, which consists of two linear transformation layers and an intermediate activation function, for modeling the characteristic relation of the channel dimension. In addition, in order to solve the problem that the transducer is insensitive to the space position, the invention introduces conditional position coding based on depth convolution into a multi-layer perceptron. The mathematical expression of the channel mixer is shown in formulas (3) - (5):
X′ m =Φ 1 (norm(X m )) (3)
pos=T 2 (DW(T 1 (X′ m ))) (4)
Y=Φ 2 (σ(X′ m +pos))+X m (5)
wherein X 'is' m Pos and Y represent the transition feature, the conditional position coding feature and the output feature of the channel mixer, X, respectively m Representing the input characteristics of the channel mixer, norm represents the layer normalization, phi represents the linear transformation,
Figure BDA0004203746820000091
t represents a remodeling operation, T 1 ∈(1D→2D),T 2 E (2 d→1d), DW represents the deep convolution, σ represents the GELU activation.
Step (3): and (3) obtaining local features and global features of the double-phase image from the step (2), obtaining a differential graph by utilizing differential algebra operation, constructing a high-frequency enhancement unit, a multi-scale fusion attention unit and a deep feature guiding unit, and constructing a complete remote sensing image change detection model based on a local-global transducer network.
Step (3.1): and building a high-frequency enhancement unit. In the change detection task, the detection effect of the edge of the change area has great influence on the performance index of the final model. Most existing change detection methods utilize complex U-shaped structures in combination with attention mechanisms to refine the change region boundaries, which generally optimize features from a global perspective, which is not targeted for boundary information modeling. Furthermore, the U-shaped structure tends to be complex in network design. Therefore, the present invention proposes a simple but effective high frequency enhancement unit that uses a self-attention mechanism and correlation between front and rear features to effectively extract high frequency feature information of the edges of the change region.
As shown in FIG. 4, assume that the input features are represented as
Figure BDA0004203746820000092
X is first subjected to a 3X 3 convolution for differential feature optimization. To obtain high frequency features, low frequency features are captured using averaging pooling and multi-head self-attention, then upsampled using bilinear interpolation, and finally subtracted from shallow features using differential algebra operations. The high frequency features and shallow features are connected along the channel dimension and the output features of the final high frequency enhancement unit are obtained using a 3 x 3 convolution fusion. The mathematical expression of the above process is shown in formulas (6) - (8):
E s =W 1 X (6)
E H =E s -Up(T 2 (MHSA(T 1 (Avg(E s ))))) (7)
Y=W 2 Concat(E s ,E H ) (8)
wherein X represents input features, Y represents output features, E s And E is H Respectively represent shallow differential characteristics and high-frequency characteristics, W 1 And W is 2 Representing two 3 x 3 convolutions, T 1 And T 2 Two remodelling operations are represented, MHSA for multi-head self-attention computation and Up for Up sampling operations.
Step (3.2): constructing a multi-scale fusion attention unit. The remote sensing image has a plurality of ground objects with different scales, so the multi-scale modeling capability is one of important indexes for measuring the change detection model of the remote sensing image. Most of the existing change detection algorithms usually adopt a multi-scale convolution or pooling mode, a U-shaped fusion module and the like to solve the multi-scale problem. Unlike these methods, the present invention proposes a new multi-scale fusion attention unit based on multi-head self-attention, the structure of which is schematically shown in fig. 5.
The input features of the module are derived from the outputs of the last three stages of the local-global transducer backbone network in step (2), expressed as
Figure BDA0004203746820000101
To calibrate the channel, three input feature dimensions are first transformed to C using a 1 x 1 convolution. Three two-dimensional image features are then converted into a one-dimensional token sequence using a remodelling operation, denoted as
Figure BDA0004203746820000102
To adequately interact feature information between multi-scale tokens, tk i Are connected together along a spatial dimension and sequentially aggregate spatial relationships of different scale features through N multiple heads of self-attention. Thanks to the long-range dependence modeling capability of self-attention, all the links between tokens are unified. Considering that the original feature distribution is easily changed after multi-head self-attention, the present invention proposes an attention weighting mechanism that effectively alleviates this phenomenon.
Specifically, the multi-headed self-attention output feature is first segmented along the spatial dimension, and then the one-dimensional token sequence is restored to the two-dimensional image feature by a reshaping operation. The two-dimensional image features are subjected to 1×1 convolution, batch normalization and Sigmoid activation to obtain attention weights. In order to more fully aggregate multi-scale features, the invention provides a scale calibration operation, namely, three different-scale features are fused in three different spatial dimensions by utilizing average pooling and up-sampling to obtain three coarse-granularity fusion features. The three features are then subjected to a 3 x 3 convolutionBatch normalization and ReLU activation functions. Finally, the three features are respectively subjected to attention weighted calculation by using the corresponding attention weights to obtain final fine granularity fusion features, which are expressed as
Figure BDA0004203746820000103
The mathematical expression of the above process is shown in formula (9):
Figure BDA0004203746820000104
wherein,,
Figure BDA0004203746820000111
representing the calibration feature +.>
Figure BDA0004203746820000112
Output characteristics representing multi-head self-attention, Y i Output features representing a multiscale fused attention unit, < >>
Figure BDA0004203746820000113
Representing k×k convolution, BN representing batch normalization, x representing matrix hadamard product, i representing the number of stages;
step (3.3): and constructing a deep feature guiding unit. The shallow features of the network typically contain more fine-grained information such as texture, color, boundaries, etc., and additionally contain more background interference information. Deep features of the network have semantic directionality, often containing more abstract semantic information. From this point of view, better detection accuracy can be obtained by guiding shallow detail features with deep semantic features. Existing work has incorporated this idea into model designs, but they tend to consider only deep semantic features at a single scale. The invention provides a novel deep feature guiding unit which can effectively optimize shallow fine granularity features by utilizing deep multi-scale semantic features to obtain a finer detection diagram, and the structural schematic diagram of the novel deep feature guiding unit is shown in fig. 6.
The overall structure is somewhat similar to the multiscale fused attention unit of step (3.2). Likewise, a scale calibration operation is used to fuse the three output features of the multiscale fused attention unit in the spatial dimension of stage 1. To migrate semantic information of deep features to shallow features, a similar attention weighting mechanism is introduced. In contrast, multi-headed self-attention is replaced by multi-headed cross-attention.
The input of multi-headed cross-attention is also three features, denoted Q, K,
Figure BDA0004203746820000114
q is derived from the shallow features of stage 1, and K and V are derived from the calibrated deep fusion features. Through the interaction of Q and K, V, the multi-headed cross-attention can fully model the semantic relation between shallow features and multi-scale deep features, and the mathematical expression is shown in formulas (10) - (12):
ReLU(BN(W 3×3 (X 1 +M)))*Sigmoid(BN(W 1×1 MHCA(X 1 ,M))) (10)
Figure BDA0004203746820000115
MHCA k (Q,K,V)=Φ(Concat(CA 1 ,CA 2 ,...,CA k )) (12)
wherein X is 1 Representing the high frequency features of the first stage, M representing the fused calibration features of the last three stages, MHCA representing multi-headed cross attention, CA j (. Cndot.) and MHCA k (. Cndot.) represents the output characteristics of the jth cross attention head and the output characteristics of the multi-head cross attention with k attention heads, respectively, Q (Q) j )、K(K j ) And V (V) j ) Representing three input features of query, key and value, respectively, where Q is represented by X 1 The linear transformation results in K, V from M linear transformations, c representing the characteristic dimension of each attention head and Φ representing the linear transformation.
Step (4): and constructing a remote sensing image change detection model based on a local-global transducer network, and performing model training by using the standardized double-phase image data. Fig. 2 shows a model proposed by the present invention. According to the invention, four binary remote sensing image change detection data sets of CDD, BTCDD, LEVIR-CD and Google are adopted for experiments, all experiments are carried out based on a Pytorch framework, and a Zhang Yingwei TITAN RTX GPU is provided. The binary cross entropy is used as a training loss function, adamW is used as a training optimizer, the weight attenuation coefficient is 0.01, the learning rate is 0.0001, the batch size is set to 16, and the training period is set to 200. Finally, the model output characteristics are activated by Sigmoid to obtain a final prediction result, and all prediction values of the final prediction result are compressed to be between 0 and 1. The region where the pixel value of 0.5 or more is located is determined as a change region (indicated by white), and the region where the pixel value of less than 0.5 is located is determined as a no-change region (indicated by black).
Step (5): model change detection performance evaluation. 200 trained change detection models can be obtained in the step (4), the best model is selected from the 200 models according to five evaluation indexes of the Precision Rate (PR), the recall Rate (RC), the F1 fraction (F1), the cross-over ratio (IoU) and the overall accuracy rate (OA), and the five evaluation indexes are used for evaluating the performance of the best model. The mathematical expressions of the five evaluation indexes are shown in formulas (13) - (17):
Figure BDA0004203746820000121
Figure BDA0004203746820000122
Figure BDA0004203746820000123
Figure BDA0004203746820000124
Figure BDA0004203746820000125
wherein TP, TN, FP and FN represent true yang, true yin, false yang and false yin, respectively.
By implementing the model of the invention, the final change detection precision is greatly improved. The invention provides a local-global transducer as a main network to extract the characteristics of a double-phase remote sensing image, so that the local characteristics and the global characteristics of the double-phase image are effectively captured; the high-frequency enhancement unit is put forward to replace a complex U-shaped network structure, and the optimization of the high-frequency characteristics of the boundary of the change area is realized in a simple but efficient manner; providing a multiscale fusion attention unit, and fully interacting the connection among multiscale features by utilizing a multi-head self-attention mechanism; finally, a deep feature guiding unit is provided, and deep multi-scale semantic features are used for guiding and optimizing the fine granularity features of the shallow layer of the network based on a multi-head cross attention mechanism, so that a more refined detection result is realized. The above matters together form the remote sensing image change detection method based on the local-global transducer network.

Claims (10)

1. The remote sensing image change detection method based on the local-global transducer network is characterized by comprising the following steps of:
s1, preprocessing a plurality of groups of image sample data to obtain a plurality of standardized groups of image data, wherein each group of image sample data comprises a double-time-phase remote sensing image, and the double-time-phase remote sensing image comprises a front time-phase remote sensing image and a rear time-phase remote sensing image;
s2, inputting each group of standardized image data into a local-global transducer network in sequence, wherein the local-global transducer network comprises an image block embedding and a backbone network, the backbone network comprises four stages, namely a first stage, a second stage, a third stage and a fourth stage, the front-time remote sensing image and the rear-time remote sensing image after the image block embedding are input into the first stage, the output of the first stage is used as the input of the later stage, each stage comprises two Siamese transducer models, the first Siamese transducer model is used for limiting self-attention calculation in a local window to model a local pixel relation of an input image, the second Siamese transducer model is used for performing attention calculation on the whole pixels of the image to model a global pixel relation of the input image, and an image block merging operation, namely halving the size of a feature image and doubling the number of channels, exists between each two stages;
s3, carrying out differential operation on the output of the front-time phase remote sensing image and the output of the rear-time phase remote sensing image in each stage, and then inputting the differential operation into a high-frequency enhancement unit to obtain high-frequency characteristics of the edge of the change area of each stage;
s4, inputting the high-frequency characteristics of the second stage, the third stage and the fourth stage into a multi-scale fusion attention unit to obtain fine-granularity fusion characteristics of each stage;
s5, inputting the high-frequency features of the edges of the first-stage change region, the second-stage, the third-stage and the fine-granularity fusion features obtained in the third-stage into a deep feature guiding unit to obtain a finer detection diagram;
s6, carrying out feature fusion on the output of the deep feature guiding unit and the fine granularity fusion features obtained in the second stage, the third stage and the third stage to obtain model output features;
and S7, performing model training on the model by using training data, and then performing data testing to obtain a final prediction result.
2. The method as recited in claim 1, further comprising: the first stage includes two siamese convertors models, and both siamese convertors models are used to constrain the calculation of self-attention in a local window to model the local pixel relationship of the input image.
3. The method according to claim 1 or 2, characterized in that the image block embedding comprises in particular a two-dimensional convolution with a convolution kernel of 4 x 4 and a step size of 4, i.e. a bi-temporal image
Figure FDA0004203746800000011
D、H and W represent the number of channels, height and width, respectively, and are converted into a sequence of image tokens by an image block embedding operation.
4. The method according to claim 1 or 2, characterized in that the siamese fransformer model comprises two standard fransformer blocks with shared weight values, which are respectively divided into a token mixer and a channel mixer according to different area functions, wherein the token mixer is used for capturing spatial characteristic representation of a double-time-phase image, firstly, performing layer normalization operation on characteristic pairs of the double-time-phase image, then, obtaining three input characteristics Q, K and V through linear transformation, inputting the three input characteristics into self-attention calculation, and finally, obtaining output characteristics of the token mixer through jump connection of the characteristics of the double-time-phase image and output addition of the self-attention calculation, and obtaining corresponding output characteristics;
the channel mixer is used for fusing the characteristics in the channel dimension, firstly, normalizing the output characteristics of the token mixer, then entering the first multi-layer perceptron, carrying out deep convolution, carrying out activation function operation on the result of the deep convolution and the result of the linear transformation, and then inputting the result of the deep convolution and the result of the linear transformation into the second multi-layer perceptron, wherein the output characteristics of the second multi-layer perceptron and the output characteristics of the token mixer are added to obtain the output characteristics Y of the channel mixer, and the first multi-layer perceptron and the second multi-layer perceptron are both in linear transformation.
5. The method of claim 4, wherein the token mixer uses a self-attention mechanism to fully capture the image global spatial feature relationship, the mathematical expression of which is shown in formulas (1) - (2):
Figure FDA0004203746800000021
MHSA k (Q,K,V)=Φ(Concat(SA 1 ,SA 2 ,...,SA k )) (2)
wherein SA is i (. Cndot.) and MHSA k (. Cndot.) represents respectivelyOutput features of i self-attention heads and multi-head self-attention with k attention heads, Q (Q) i )、K(K i ) And V (V) i ) Three input features, namely query, key and value, c represents the feature dimension of each attention head, and phi represents linear transformation; the channel mixer effectively fuses characteristics in the channel dimension by using a multi-layer perceptron, and the mathematical expression of the channel mixer is shown in formulas (3) - (5):
X′ m =Φ 1 (norm(X m )) (3)
pos=T 2 (DW(T 1 (X′ m ))) (4)
Y=Φ 2 (σ(X′ m +pos))+X m (5)
wherein X 'is' m Pos and Y represent the transition characteristics, i.e. the output characteristics of the first multi-layer perceptron, the conditional position coding characteristics and the output characteristics of the channel mixer, X, respectively m Representing the input characteristics of the channel mixer, m e 1,2, phi represents a linear transformation, where,
Figure FDA0004203746800000031
t represents a remodeling operation, wherein T 1 ∈(1D→2D),T 2 E (2 d→1d), DW represents the deep convolution, σ represents the GELU activation.
6. The method according to claim 1 or 2, wherein the high frequency enhancement unit comprises:
the input feature X is subjected to 3X 3 convolution to conduct differential feature optimization to obtain shallow differential feature E s In order to obtain high-frequency characteristics, firstly, capturing low-frequency characteristics by utilizing average pooling and multi-head self-attention, then, up-sampling by using bilinear interpolation to obtain intermediate characteristics, and finally, shallow differential characteristics E s Subtracting the intermediate features to obtain high-frequency feature E H High frequency characteristics E H And shallow differential feature E s Is connected along the channel dimension and uses a 3 x 3 convolution fusion to obtain the output characteristic Y of the final high frequency enhancement unit;
the concrete steps are as follows:
E s =W 1 X (6)
E H =E s -Up(T 2 (MHSA(T 1 (Avg(E s ))))) (7)
Y=W 2 Concat(E s ,E H ) (8)
wherein X represents input features, Y represents output features, E s And E is H Respectively represent shallow differential characteristics and high-frequency characteristics, W 1 And W is 2 Representing two 3 x 3 convolutions, T 1 And T 2 Two remodelling operations are represented, MHSA for multi-head self-attention computation and Up for Up sampling operations.
7. A method according to claim 1 or 2, wherein the multi-scale fusion attention unit comprises a first pass of first performing 1 x 1 convolution on three input features respectively, and then converting three two-dimensional image features into a one-dimensional token sequence using a reshaping operation, the one-dimensional token sequence being connected together along a spatial dimension and continuously aggregating spatial relationships of different scale features through N multi-headed self-attention; dividing the output characteristics of multi-head self-attention along the space dimension, restoring the one-dimensional token sequence into two-dimensional image characteristics through remolding operation, carrying out 1×1 convolution, batch normalization and Sigmoid activation on the two-dimensional image characteristics to obtain corresponding attention weights, fusing three input characteristics with different dimensions in three different space dimensions by utilizing average pooling and up-sampling to obtain three coarse-granularity fusion characteristics, optimizing the three coarse-granularity fusion characteristics through 3×3 convolution, batch normalization and ReLU activation functions to obtain corresponding calibration characteristics, and finally carrying out attention weighted calculation on the three calibration characteristics by utilizing the corresponding attention weights to obtain final fine-granularity fusion characteristics.
8. The method of claim 7, wherein the multi-scale fused attention unit is represented as:
Figure FDA0004203746800000041
wherein,,
Figure FDA0004203746800000042
representing the calibration feature +.>
Figure FDA0004203746800000043
Output characteristics representing multi-head self-attention, Y i Output features representing a multiscale fused attention unit, < >>
Figure FDA0004203746800000044
Representing k x k convolutions, representing the matrix hadamard product, i representing the number of stages, i e {2,3,4}, BN representing the normalization operation of the batch of samples.
9. The method according to claim 1 or 2, wherein the deep feature guiding unit performs scale calibration on the fine-grained fusion features obtained in the second stage, the third stage and the third stage to obtain calibrated deep fusion features, performs cross attention calculation on the calibrated deep fusion features and high-frequency features at the edge of the first stage change region, performs 1 x 1 convolution, batch normalization and Sigmoid activation to obtain corresponding attention weights, performs optimization on the calibrated deep fusion features and the high-frequency features at the edge of the first stage change region after addition, and performs optimization on the calibrated deep fusion features and the high-frequency features at the edge of the first stage change region through 3 x 3 convolution, batch normalization and ReLU activation functions to obtain corresponding optimized features, and finally performs attention weighting calculation on the attention weights and the optimized features to obtain a finer detection graph.
10. The method of claim 9, wherein the deep feature guide unit is expressed as shown in formulas (10) - (12):
ReLU(BN(W 3×3 (X 1 +M)))*Sigmoid(BN(W 1×1 MHCA(X l ,M))) (10)
Figure FDA0004203746800000045
MHCA k (Q,K,V)=Φ(Concat(CA 1 ,CA 2 ,...,CA k )) (12)
wherein X is 1 Representing the high frequency features of the first stage, M representing the fused calibration features of the last three stages, MHCA representing multi-headed cross attention, CA j (. Cndot.) and MHCA k (. Cndot.) represents the output characteristics of the jth cross attention head and the output characteristics of the multi-head cross attention with k attention heads, respectively, Q (Q) j )、K(K j ) And V (V) j ) Representing three input features of query, key and value, respectively, where Q is represented by X 1 The linear transformation results in K, V from M linear transformations, c representing the characteristic dimension of each attention head and Φ representing the linear transformation.
CN202310470097.1A 2023-04-27 2023-04-27 Remote sensing image change detection method based on local-global transducer network Pending CN116434069A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310470097.1A CN116434069A (en) 2023-04-27 2023-04-27 Remote sensing image change detection method based on local-global transducer network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310470097.1A CN116434069A (en) 2023-04-27 2023-04-27 Remote sensing image change detection method based on local-global transducer network

Publications (1)

Publication Number Publication Date
CN116434069A true CN116434069A (en) 2023-07-14

Family

ID=87087196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310470097.1A Pending CN116434069A (en) 2023-04-27 2023-04-27 Remote sensing image change detection method based on local-global transducer network

Country Status (1)

Country Link
CN (1) CN116434069A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994202A (en) * 2023-08-03 2023-11-03 杭州宸悦智能工程有限公司 Intelligent car washer and system thereof
CN117237740A (en) * 2023-11-07 2023-12-15 山东军地信息技术集团有限公司 SAR image classification method based on CNN and Transformer
CN117612024A (en) * 2023-11-23 2024-02-27 国网江苏省电力有限公司扬州供电分公司 Remote sensing image roof recognition method and system based on multi-scale attention

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116994202A (en) * 2023-08-03 2023-11-03 杭州宸悦智能工程有限公司 Intelligent car washer and system thereof
CN116994202B (en) * 2023-08-03 2024-03-15 杭州宸悦智能工程有限公司 Intelligent car washer and system thereof
CN117237740A (en) * 2023-11-07 2023-12-15 山东军地信息技术集团有限公司 SAR image classification method based on CNN and Transformer
CN117237740B (en) * 2023-11-07 2024-03-01 山东军地信息技术集团有限公司 SAR image classification method based on CNN and Transformer
CN117612024A (en) * 2023-11-23 2024-02-27 国网江苏省电力有限公司扬州供电分公司 Remote sensing image roof recognition method and system based on multi-scale attention
CN117612024B (en) * 2023-11-23 2024-06-07 国网江苏省电力有限公司扬州供电分公司 Remote sensing image roof recognition method based on multi-scale attention

Similar Documents

Publication Publication Date Title
Ding et al. DSA-Net: A novel deeply supervised attention-guided network for building change detection in high-resolution remote sensing images
CN108764085B (en) Crowd counting method based on generation of confrontation network
CN106683048B (en) Image super-resolution method and device
CN116434069A (en) Remote sensing image change detection method based on local-global transducer network
Choi et al. Real-time significant wave height estimation from raw ocean images based on 2D and 3D deep neural networks
Dong et al. Crowd counting by using top-k relations: A mixed ground-truth CNN framework
CN113221641B (en) Video pedestrian re-identification method based on generation of antagonism network and attention mechanism
Chen et al. Remote sensing image quality evaluation based on deep support value learning networks
Zhang et al. Feature pyramid network for diffusion-based image inpainting detection
Yang et al. Spatiotemporal trident networks: detection and localization of object removal tampering in video passive forensics
CN110399820B (en) Visual recognition analysis method for roadside scene of highway
CN114037891A (en) High-resolution remote sensing image building extraction method and device based on U-shaped attention control network
CN114882368A (en) Non-equilibrium hyperspectral image classification method
Liu et al. DCCAM‐MRNet: Mixed Residual Connection Network with Dilated Convolution and Coordinate Attention Mechanism for Tomato Disease Identification
CN115063833A (en) Machine room personnel detection method based on image layered vision
CN114612709A (en) Multi-scale target detection method guided by image pyramid characteristics
Yuan et al. Locally and multiply distorted image quality assessment via multi-stage CNNs
CN110866552B (en) Hyperspectral image classification method based on full convolution space propagation network
CN115456957B (en) Method for detecting change of remote sensing image by full-scale feature aggregation
Xiao et al. Feature-level image fusion
Chen et al. Hyperspectral remote sensing IQA via learning multiple kernels from mid-level features
CN116958800A (en) Remote sensing image change detection method based on hierarchical attention residual unet++
Rao et al. Classification of land cover usage from satellite images using deep learning algorithms
CN115100091A (en) Conversion method and device for converting SAR image into optical image
Unsalan Measuring land development in urban regions using graph theoretical and conditional statistical features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination