CN115147760B

CN115147760B - High-resolution remote sensing image change detection method based on video understanding and space-time decoupling

Info

Publication number: CN115147760B
Application number: CN202210742299.2A
Authority: CN
Inventors: 张洪艳; 林漫晖; 杨光义; 张良培
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2024-04-19
Anticipated expiration: 2042-06-27
Also published as: CN115147760A

Abstract

The invention relates to a high-resolution remote sensing image change detection method based on video understanding and space-time decoupling. Aiming at the characteristic that the construction of the double-time-phase high-resolution remote sensing image is unbalanced in space and time dimensions, the invention adopts a time sequence linear interpolation strategy to construct a pseudo video frame sequence, and expands the time dimension, so that the processing of a change detection task by using a video understanding algorithm is possible. The invention combines the characteristic of the change detection task to focus on the space-time information, and provides a space-time decoupling encoder design scheme, so that the network only pays attention to one dimension of the problem at a time, thereby relieving the burden of a decoder and improving the detection effect. Meanwhile, in order to promote information exchange between space-time encoders, the invention provides a time sequence aggregation module which is arranged in the side path connection from the space encoder to the time encoder, so as to improve the fit degree of space-time characteristics. In addition, the invention uses a deep supervision technology to improve the convergence rate of the deep model and solve the problem of insufficient effectiveness of the characteristics of the middle layer of the model.

Description

High-resolution remote sensing image change detection method based on video understanding and space-time decoupling

Technical Field

The invention relates to the field of remote sensing image change detection, in particular to a technical method for realizing image pair-pseudo video frame sequence conversion through time sequence interpolation and constructing a space-time decoupling network structure to detect the change of a remote sensing image based on the time sequence interpolation. The training of the whole change detection network is completed in a data driving mode, and the accurate extraction of the change information in the high-resolution remote sensing image pair is realized.

Background

Remote sensing image change detection aims at analyzing the state change of the ground objects in the region by repeatedly observing different times in the same region. Since the 70 s of the last century, researchers at home and abroad analyze remote sensing image data from different sources from different angles, and a large number of models and methods are provided. With the progress of satellite sensor technology and signal transmission technology, the acquisition of remote sensing images is more and more convenient, and the spatial resolution of the images is also continuously improved. The increasingly abundant high-resolution remote sensing data simultaneously brings opportunities and challenges to the field of remote sensing image change detection. On the one hand, compared with the medium-low resolution remote sensing image, the high resolution remote sensing image can provide richer ground feature details and spatial distribution information, so that fine changes can be found, and the boundary of the changed ground feature can be better positioned. On the other hand, in the high resolution image, the same ground object usually appears in a planar form, the assumption that pixels are independent of each other is no longer true, the gray scale of the pixels in the same ground object is affected by the material and reflection characteristics of the target to be fluctuated, and the phenomenon of 'same object, different spectrum and same spectrum foreign matter' is more obvious compared with the middle and low resolution remote sensing image, so that the difficulty of detecting a change region is rapidly increased. In recent years, with the rising and maturing of deep learning in the field of artificial intelligence, new solutions are obtained based on the detection of changes in high-resolution remote sensing images. The deep learning method uses a large number of samples to train the network model, so that the model has the capability of extracting more discriminative features, and complicated and low-efficiency manual feature extraction is avoided. Meanwhile, compared with the traditional algorithm, the deep learning architecture has higher parallelism and excellent end-to-end property, and can realize efficient and accurate reasoning. Considering that the remote sensing image has massive and multidimensional natural properties, the deep learning is very suitable for learning and optimizing the remote sensing image change detection task. The high-resolution remote sensing image change detection method based on deep learning is researched, so that the detection precision can be greatly improved, and meanwhile, the intelligent and automatic processes of a change detection algorithm can be greatly accelerated, so that the method has higher application value.

In general, current remote sensing image change detection algorithms can be categorized into three categories:

Conventional pixel-based change detection methods: pixel-based change detection methods were the earliest developed and most diverse of all. As a main method for detecting the change of the middle-low resolution image, the most representative image algebraic method, the image transformation method and the classification detection method are based on the pixel method. The image algebra method obtains a change intensity graph through algebraic operation pixel by pixel between corresponding wave bands of two images, and the idea influences a plurality of later-proposed more advanced change detection algorithms. The image transformation method starts from the statistical structure of data, and extracts pixel level transformation information after mathematical transformation of an input image according to theories such as principal component analysis, slow feature analysis and the like. The classification detection method is characterized in that the images of each time phase are firstly classified independently, and then a change detection result is obtained according to the classification result. The traditional pixel level change detection method is easy to operate and convenient to implement. However, due to the complexity of the ground features in the high-resolution remote sensing image, the methods are only suitable for processing the situations of few image wave bands and relatively simple scenes, and the large-area error and leakage phenomenon in the detection result is difficult to stop in various scenes.

Object-based traditional change detection methods: the object-based change detection algorithm breaks through the limitation that the effect of processing the high-resolution image based on the pixel change detection method is poor. The method firstly divides the object according to the space and spectral properties of the pixel, and then replaces the pixel with the object to serve as a basic processing unit in the detection process. Compared with the pixel-based method, the object-based change detection method has the natural advantages of accurately describing the boundaries of the features and ensuring the consistent state of the internal changes of the features, but has two main limitations: firstly, the detection effect based on the object method is greatly limited by the quality of the segmented object; secondly, the object-based change detection method often adopts a two-stage structure of dividing the object and then extracting change characteristics, rather than adopting end-to-end frame optimization, so that error accumulation is easy to cause, and a global optimal solution is difficult to obtain.

The change detection method based on deep learning comprises the following steps: such methods can be subdivided into three subclasses: feature-based methods, tile-based methods, and full-view-based methods. The change detection method based on deep learning can obtain more effective and robust features by utilizing rich space and spectrum information of high-resolution images, thereby overcoming the defects of the traditional algorithm to a certain extent. However, most of the existing deep learning-based change detection algorithms focus either on more efficient use of spatial and spectral information or on domain alignment between different time-spaced images. The algorithms mostly use semantic segmentation or metric learning models to process change detection tasks, take pixel-by-pixel difference or channel dimension stitching as a main means for time dimension information extraction, lack explicit modeling of a time sequence process, and ignore the essence of 'change is a process'. A small part of algorithms explicitly consider time sequence information, but modeling on a time dimension is incomplete, and the improvement of algorithm performance is not facilitated.

Therefore, there is still a great room for improvement in the deep learning-based change detection algorithm, and it is necessary to develop a high-resolution remote sensing image change detection algorithm in consideration of time sequence information mining.

Disclosure of Invention

Aiming at the defects of the existing remote sensing image change detection algorithm based on deep learning, the invention provides a high-resolution remote sensing image change detection method based on video understanding and space-time decoupling, which is characterized in that a video frame sequence is constructed by image pairs, the space dimension and the time dimension of multi-time-phase images are decoupled, a double-flow structure is designed, space and time branches are respectively constructed, and the full mining of time-space information is realized, so that a change region is more accurately positioned.

The technical scheme of the invention provides a high-resolution remote sensing image change detection method based on video understanding and space-time decoupling, which comprises the following steps of:

step 1, according to an input double-phase remote sensing image pair, a pseudo video frame sequence is obtained through a time sequence interpolation strategy, and each image in the pseudo video frame sequence has the same spatial size and numerical range as an original image;

Step 2, constructing a time encoder and a space encoder, wherein the time encoder receives a pseudo video frame sequence as input, firstly performs downsampling operation, then extracts characteristics through a cascaded three-dimensional convolution layer, the space encoder receives an original double-phase remote sensing image pair as input, extracts the characteristics through a two-dimensional convolution layer, sets unidirectional side-path connection between the two encoders, and processes the characteristics transferred from the time encoder to the space encoder through a time sequence aggregation module (temporal aggregation module, TAM);

Step 3, constructing a progressive decoder, connecting the output of each level module of the space encoder with the input of each level module in the decoder, and outputting a single-channel change probability map by the final convolution layer of the decoder;

Step 4, adding an additional convolution layer at the tail end of the time encoder, applying depth supervision to the output of the time encoder, constructing a joint loss function, and optimizing weight parameters of the whole network by using a gradient descent method until the loss converges; the convolution layer at the end of the time encoder only provides additional output in the model training stage, and a single-channel change probability map of the final convolution layer output of the decoder is still used as the final output of the network in the model reasoning stage.

Further, in step 1, assuming that the first phase original image is I ₁, the second phase original image is I ₂, and the video contains N frames, the interpolation formula for the nth frame image F _n is:

further, in step 2, the spatial encoder inputs are:

X_s＝concat(I₁,I₂) (2)

the first time phase original image is I ₁, the second time phase original image is I ₂, the concat () represents the splicing operation of channel dimension, the basic composition module of the space encoder is a space module S-Block, and the S-Block is divided into two types of S-Block I and S-Block II; the initial part of the two types of space modules comprises two cascaded convolution layers, a corresponding BN layer and a corresponding ReLU activation function, and the tail part is a maximum pooling layer; compared with the S-Block I, the S-Block II has an additional convolution layer, a corresponding BN layer and a corresponding ReLU activation function, so that the S-Block II also has stronger feature extraction and fitting capabilities, and residual connection is added between the output of the first ReLU activation function and the output of the last normalization layer; two S-Block I and one S-Block II are sequentially connected to form a space encoder.

Further, the inputs to the temporal encoder are:

X_t＝stack(F₀,F₁,F₂,...,F_N-1) (3)

The stack () represents the operation of overlapping images in a new dimension, and the basic component module of the time encoder is a time module T-Block; T-Block first use 1X 1 the convolution layer reduces the channel number of the input features; then the features are sent to a 3X 3 convolution layer for processing so as to realize the coding of the space context and the full mining of the change information; finally, a 1 multiplied by 1 convolution layer is used again to increase the number of characteristic channels, so that the capacity of the model is improved; BN layers are added after each convolutional layer, reLU activation functions are added after the first and second BN layers, and, in addition, to alleviate the gradient vanishing problem, improve the convergence performance of the module, residual connection is added between the input and the output of the T-Block, and a residual branch uses the number of matched characteristic channels of a 1 multiplied by 1 convolution layer and a BN layer; the downsampling module formed by serially connecting a convolution layer, a BN layer and a ReLU activation function is added before the first T-Block of the time encoder, the size of the convolution kernel is set to be 3 multiplied by 9, the step length is set to be 1 multiplied by 4, after the downsampling module is adopted, the spatial resolution of an input video frame sequence is reduced by 4 times, the attention program of the time encoder on the spatial information can be reduced, and the explicit space-time decoupling is realized.

Further, the time sequence aggregation module TAM firstly carries out global maximum pooling and global average pooling of time dimension on the characteristics output by a certain level of the time encoder so as to obtain efficient representation of time sequence change information contained in the T-Block extraction characteristics; then, splicing the two pooling results in the channel dimension to obtain an aggregation characteristic; finally, the aggregation characteristics are subjected to point-to-point transformation by using a convolution layer with the convolution kernel size of 1 multiplied by 1, a batch normalization layer and a ReLU activation function, and a final output is obtained.

Further, in step 3, the progressive decoder is serially connected with a convolutional layer and a plurality of decoding modules D-blocks, and the total number of D-blocks is one more than the number of S-blocks in the spatial encoder; D-Block receives two inputs, namely the output of the last D-Block and the output of the S-Block at the same level, and firstly upsamples the superior decoding characteristics, then splices the upsampling result and the same-level coding characteristics in the channel dimension, finally uses two convolution layers to perform characteristic fusion, increases BN layer and ReLU activation function after each convolution layer, and adds residual connection between the two convolution layers to relieve the gradient vanishing problem.

Further, in step 4, training the entire change detection network by minimizing a joint loss function, which can be expressed as:

L＝l(P_final,R)+λl(P_inter,R) (4)

Where l represents the specific penalty function used for each output-truth label pair, P _final and P _inter represent the probability map of change of model final output, i.e., decoder output, and bypass output, i.e., temporal encoder output, respectively, R represents the truth change label, and λ is the weighting factor of the auxiliary penalty, and class-balanced cross entropy penalty is selected as the specific penalty type:

wherein, H and W represent the height and width of the image respectively, i and j represent the ith row and jth column of the image respectively, and W _c and W _u are the class weight coefficients of the variable class and the unchanged class respectively, and an Adam optimizer is used to adjust the gradient to minimize the loss function.

Further, the spatial encoder includes 3S-blocks, and the output channel numbers of the three S-blocks are set to 32, 64 and 128, respectively.

Further, the time encoder comprises 1 downsampling module and 4T-blocks, and the number of output channels of the downsampling module is set to be 64; the output channel numbers are set to 256, 512, and 512 for 4T-blocks, respectively.

According to the high-resolution remote sensing image change detection method based on video understanding and space-time decoupling, the change detection problem is converted into the dense classification problem of videos for the first time, and a space encoder and a time encoder are used for respectively processing double-time-phase input images and constructed pseudo video frame sequences, so that explicit decoupling of time and space dimensions is achieved.

Meanwhile, aiming at the conditions that the ground object scene in the high-resolution remote sensing image is complex and the false change is easy to occur in the detection result, the method provided by the invention fully utilizes the rich information of the multi-temporal remote sensing image, relieves the unbalance of the original data in the time and space dimension structure through time sequence interpolation, and enables the model to concentrate on the refinement of the change information, thereby inhibiting the false change in the result. The method provided by the invention has important effects on the subsequent application of high-resolution remote sensing images, such as urban building change monitoring, disaster monitoring and the like. Therefore, the high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling has very important academic value and important practical significance.

The invention not only provides a new deep learning change detection paradigm which is different from the existing deep learning change detection framework, emphasizes the full utilization of time sequence information, but also explicitly considers the space-time coupling problem existing in the change detection task for the first time, and provides an effective solution for accurately describing the change feature boundary.

Drawings

FIG. 1 is an overall network architecture diagram of the present invention;

FIG. 2 is a block diagram of the basic components of a spatial encoder, a temporal encoder, and a progressive decoder;

fig. 3 is a block diagram of a timing interpolation module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, a method for detecting a change in a remote sensing image based on video understanding and space-time decoupling will be described in detail with reference to the accompanying drawings. It should be understood that the specific examples described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides a remote sensing image change detection method based on video understanding and space-time decoupling, which regards 'change' as a continuous 'time sequence process' rather than discrete 'state change', thereby modeling a change detection problem as a video understanding task. The double-phase image pair is converted into a pseudo video frame sequence through time sequence interpolation, so that imbalance in time and space construction of the original data is relieved. The space-time decoupling encoder is adopted, and a time encoder and a space encoder are respectively used for processing space-time information, so that the model only pays attention to one dimension at a time, and adverse effects caused by space-time coupling are avoided. An edge connection is added between the two encoders to promote interaction between the spatio-temporal features. In addition, the present invention applies depth supervision to the output of the temporal encoder, forcing the temporal encoder to learn more useful features and speeding up the training process.

And 4, adding an additional convolution layer at the tail end of the time encoder, applying depth supervision to the output of the time encoder, constructing a joint loss function, and optimizing weight parameters of the whole network by using a gradient descent method until the loss converges. The convolutional layer at the end of the time encoder provides additional output only during the model training phase, and still uses the single-channel variation probability map of the final convolutional layer output of the decoder as the final network output during the model reasoning phase.

further, in step 2, the input of the spatial encoder is:

X_s＝concat(I₁,I₂) (7)

The concat () represents the splicing operation of the channel dimension, the basic component module of the space encoder is a space module S-Block, the S-Block can be divided into two types of S-Block I and S-Block II, and the structures of the two types of S-Block I and S-Block II are respectively shown in fig. 2 (a) and (b); the initial part of the two types of space modules comprises two cascaded convolution layers (and corresponding BN layers and ReLU activation functions), and the tail part is a maximum pooling layer; compared with S-Block I, S-Block II has an extra convolution layer and corresponding BN layer and ReLU activation function, and thus has stronger feature extraction and fitting capabilities, with a residual connection added between the output of the first ReLU activation function and the output of the last normalization layer. Two S-Block I and one S-Block II are sequentially connected to form a space encoder.

Further, the inputs to the temporal encoder are:

X_t＝stack(F₀,F₁,F₂,...,F_N-1) (8)

Wherein, stack () represents the operation of overlapping images in a new dimension, and the basic component module of the time encoder is a time module T-Block, and the structure thereof is shown in fig. 2 (c); T-Block first use 1X 1 the convolution layer reduces the channel number of the input features; then the features are sent to a 3X 3 convolution layer for processing so as to realize the coding of the space context and the full mining of the change information; and finally, a1 multiplied by 1 convolution layer is used again to increase the number of characteristic channels and improve the capacity of the model. BN layers are added after each convolutional layer, and ReLU activation functions are added after the first and second BN layers. In addition, in order to alleviate the problem of gradient disappearance and improve the convergence performance of the module, residual connection is added between the input and the output of the T-Block, the residual branch uses a1 x1 convolutional layer and BN layer matching feature channel number. As shown in fig. 1, a downsampling module (i.e., the stem part marked in fig. 1) composed of a convolution layer, a BN layer and a ReLU activation function connected in series is added before the first T-Block of the temporal encoder, the size of the convolution kernel is set to 3×9×9, the step size is set to 1×4×4, and after passing through the downsampling module, the spatial resolution of the input video frame sequence is reduced by 4 times, which can reduce the attention procedure of the temporal encoder to spatial information, and realize explicit spatial-temporal decoupling.

Further, in step 3, the progressive decoder is serially connected with a convolutional layer and a plurality of decoding modules D-Block, the total number of D-Block is one more than the number of S-Block (including S-Block I and S-Block II) in the spatial encoder, and the structure of each D-Block is shown in fig. 2 (D); D-Block receives two inputs, namely the output of the last D-Block (or the forefront convolution layer) and the output of the S-Block at the same level, wherein the upper decoding features are firstly up-sampled, then the up-sampling result and the same-level coding features are spliced in the channel dimension, finally two convolution layers are used for feature fusion, a BN layer and a ReLU activation function are added after each convolution layer, and residual connection is added between the two convolution layers to relieve the gradient vanishing problem; as shown in fig. 1, the channel dimension splice result of the original bi-phase image pair is also used as a hierarchical coding feature in order to preserve spatial detail information in the image as much as possible.

L＝l(P_final,R)+λl(P_inter,R) (9)

Where l represents the specific penalty function used for each output-truth label pair, P _final and P _inter represent the probability map of change of the model final output (i.e., decoder output) and bypass output (i.e., immediate encoder output), respectively, R represents the truth change label, and λ is the weight coefficient of the auxiliary penalty, and class-balanced cross entropy penalty is chosen as the specific penalty type:

The present invention may be implemented using computer software technology. The following details the steps of the method for detecting a change in a high-resolution remote sensing image according to the embodiment with reference to fig. 1.

And step 1, performing interpolation operation of time dimension on the input double-phase high-resolution remote sensing image pair to obtain a pseudo video frame sequence.

The invention provides a high-resolution remote sensing image change detection method based on video understanding and space-time decoupling, which is to obtain images of each frame in a video by using a linear interpolation strategy, wherein the images of a first time phase are used as initial frames, and the images of a second time phase are used as end frames. In the example, the spatial size of the original image is 256×256, and the number of bands is 3. The length T of the pseudo video frame sequence is 8, which represents that the interpolation result contains 8 frames of images, wherein the spatial size of each image is 256×256, and the band number is 3. Interpolation operations may be performed in a vectorized manner by writing code through NumPy scientific computation libraries or PyTorch deep learning frameworks. In particular, a person skilled in the art can select the value of the pseudo video frame sequence length T according to the actual computing power and the time resolution, and generally, the larger T is, the longer the algorithm is time-consuming to execute, and the more computing resources are needed to be used, but the better the precision index is.

And 2, constructing a time encoder and a space encoder, respectively receiving a video frame sequence and an original double-phase remote sensing image pair as inputs, adding a side path connection between the two encoders, and processing the characteristics transferred from the time encoder to the space encoder by using a time sequence aggregation module TAM.

In the example, the spatial encoder contains 3S-blocks in total, while the temporal encoder contains 1 downsampling module and 4T-blocks. The output channel numbers of the three S-blocks are set to 32, 64 and 128, respectively. For the first two S-blocks, adopting an S-Block I type; for the third S-Block, the S-Block II type is used. The output channel number of the downsampling module in the time encoder is set to 64. The output channel numbers are set to 256, 512, and 512 for 4T-blocks, respectively. In particular, for the 3 rd T-Block, the 3×3 convolution layers on its trunk and bypass (residual branches) are set to step sizes of 2 in both the temporal and spatial dimensions in order to achieve space-time downsampling. According to the above setup, both the spatial encoder and the temporal encoder have an output stride of 8, i.e. the spatial resolution of the output coding features is 1/8 of the input features. In addition, to facilitate the exchange of time and spatial information, an edge connection is added between the two encoders, and the intermediate layer features output by the 2 nd and 4 th T-blocks of the time encoder are first processed by a timing aggregation module TAM and then transferred to the spatial encoder as the 2 nd and 3 rd S-Block inputs, respectively. In specific implementation, the number of S-blocks and T-blocks can be adjusted according to actual needs by a person skilled in the art, but the number of S-blocks is required to be ensured to be 1 less than that of T-blocks and 1 more than that of time sequence aggregation modules.

And 3, constructing a progressive decoder, connecting the output of each level module of the spatial encoder with the input of each level module in the decoder, and outputting a single-channel change probability map by the final convolution layer of the decoder.

In the example, the decoder concatenates a convolutional layer and 4-level D-blocks, where each D-Block receives the output of the last D-Block (or the most forward convolutional layer) and the output of the S-Block at the same level. The number of D-blocks must be guaranteed to be 1 more than that of S-blocks during implementation.

And 4, calculating the loss through the final output of the decoder and the bypass output of the time encoder, and optimizing the weight parameters of the whole network by using a gradient descent method until the loss converges.

In an example, each of the joint losses is calculated using class-balanced cross entropy losses, the weight coefficients of the positive and negative classes are set to 0.5 and 0.5, respectively, and the weight coefficient of the auxiliary loss applied to the temporal encoder is set to 0.4. Solving is carried out by using an Adam optimizer, the initial learning rate is set to be 0.0004, and training lasts for 26 ten thousand iterations. In particular implementations, the training hyper-parameters can be adjusted by those skilled in the art based on the particular data set used.

One of ordinary skill in the art can understand that the invention sees the change detection problem from the view of video understanding for the first time, and more refined time sequence modeling is realized by combining the two-dimensional and three-dimensional convolutional neural networks to mine the space-time characteristics in multi-time-phase image pairs. And secondly, the design of space-time decoupling is adopted for the encoder structure, so that the capability of extracting space-time characteristics of the network is enhanced, the burden of a decoder is relieved, and the training difficulty is reduced. And finally, through the side path connection and time sequence aggregation module, on one hand, the information interaction between two encoders is enhanced, and on the other hand, the time characteristics and the space characteristics learned by the network are more matched, so that the accuracy and the robustness of the model are improved.

It should be noted and appreciated that various modifications and improvements of the invention described in detail above can be made without departing from the spirit and scope of the invention as claimed in the appended claims. Accordingly, the scope of the claimed subject matter is not limited by any particular exemplary teachings presented.

Claims

1. The high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling is characterized by comprising the following steps of:

2. The high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 1, wherein: in step 1, assuming that the first time-phase original image is I ₁, the second time-phase original image is I ₂, and the video contains N frames, the interpolation formula for the nth frame image F _n is:

3. the high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 1, wherein: in step 2, the spatial encoder inputs are:

X_s＝concat(I₁,I₂) (2)

The first time phase original image is I ₁, the second time phase original image is I ₂, the concat () represents the splicing operation of channel dimension, the basic composition module of the space encoder is a space module S-Block, and the S-Block is divided into two types of S-Block I and S-Block II; the initial part of the two types of space modules comprises two cascaded convolution layers, a corresponding BN layer and a corresponding ReLU activation function, and the tail part is a maximum pooling layer; compared with S-BlockI, S-BlockII has an additional convolution layer, a corresponding BN layer and a ReLU activation function, and therefore stronger feature extraction and fitting capabilities, and a residual connection is added between the output of the first ReLU activation function and the output of the last normalization layer; two S-Block I and one S-Block II are sequentially connected to form a space encoder.

4. The high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 1, wherein: the inputs to the temporal encoder are:

X_t＝stack(F₀,F₁,F₂,...,F_N-1) (3)

5. The high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 1, wherein: the method comprises the steps that a time sequence aggregation module TAM firstly carries out global maximum pooling and global average pooling of time dimension on a feature output by a certain level of a time encoder so as to obtain efficient representation of time sequence change information contained in T-Block extraction features; then, splicing the two pooling results in the channel dimension to obtain an aggregation characteristic; finally, the aggregation characteristics are subjected to point-to-point transformation by using a convolution layer with the convolution kernel size of 1 multiplied by 1, a batch normalization layer and a ReLU activation function, and a final output is obtained.

6. The high resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 3, wherein: in step 3, the progressive decoder is serially connected with a convolution layer and a plurality of decoding modules D-blocks, wherein the total number of the D-blocks is one more than the number of the S-blocks in the space encoder; D-Block receives two inputs, namely the output of the last D-Block and the output of the S-Block at the same level, and firstly upsamples the superior decoding characteristics, then splices the upsampling result and the same-level coding characteristics in the channel dimension, finally uses two convolution layers to perform characteristic fusion, increases BN layer and ReLU activation function after each convolution layer, and adds residual connection between the two convolution layers to relieve the gradient vanishing problem.

7. The high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 1, wherein: in step 4, training the entire change detection network by minimizing a joint loss function, which can be expressed as:

L＝l(P_final,R)+λl(P_inter,R) (4)

8. The high resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 3, wherein: the spatial encoder includes 3S-blocks, and the output channel numbers of the three S-blocks are set to 32, 64, and 128, respectively.

9. The high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 4, wherein: the time encoder comprises 1 downsampling module and 4T-blocks, and the number of output channels of the downsampling module is set to 64; the output channel numbers are set to 256, 512, and 512 for 4T-blocks, respectively.