CN115147760B - High-resolution remote sensing image change detection method based on video understanding and space-time decoupling - Google Patents

High-resolution remote sensing image change detection method based on video understanding and space-time decoupling Download PDF

Info

Publication number
CN115147760B
CN115147760B CN202210742299.2A CN202210742299A CN115147760B CN 115147760 B CN115147760 B CN 115147760B CN 202210742299 A CN202210742299 A CN 202210742299A CN 115147760 B CN115147760 B CN 115147760B
Authority
CN
China
Prior art keywords
time
space
encoder
output
remote sensing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210742299.2A
Other languages
Chinese (zh)
Other versions
CN115147760A (en
Inventor
张洪艳
林漫晖
杨光义
张良培
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202210742299.2A priority Critical patent/CN115147760B/en
Publication of CN115147760A publication Critical patent/CN115147760A/en
Application granted granted Critical
Publication of CN115147760B publication Critical patent/CN115147760B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Astronomy & Astrophysics (AREA)
  • Remote Sensing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention relates to a high-resolution remote sensing image change detection method based on video understanding and space-time decoupling. Aiming at the characteristic that the construction of the double-time-phase high-resolution remote sensing image is unbalanced in space and time dimensions, the invention adopts a time sequence linear interpolation strategy to construct a pseudo video frame sequence, and expands the time dimension, so that the processing of a change detection task by using a video understanding algorithm is possible. The invention combines the characteristic of the change detection task to focus on the space-time information, and provides a space-time decoupling encoder design scheme, so that the network only pays attention to one dimension of the problem at a time, thereby relieving the burden of a decoder and improving the detection effect. Meanwhile, in order to promote information exchange between space-time encoders, the invention provides a time sequence aggregation module which is arranged in the side path connection from the space encoder to the time encoder, so as to improve the fit degree of space-time characteristics. In addition, the invention uses a deep supervision technology to improve the convergence rate of the deep model and solve the problem of insufficient effectiveness of the characteristics of the middle layer of the model.

Description

High-resolution remote sensing image change detection method based on video understanding and space-time decoupling
Technical Field
The invention relates to the field of remote sensing image change detection, in particular to a technical method for realizing image pair-pseudo video frame sequence conversion through time sequence interpolation and constructing a space-time decoupling network structure to detect the change of a remote sensing image based on the time sequence interpolation. The training of the whole change detection network is completed in a data driving mode, and the accurate extraction of the change information in the high-resolution remote sensing image pair is realized.
Background
Remote sensing image change detection aims at analyzing the state change of the ground objects in the region by repeatedly observing different times in the same region. Since the 70 s of the last century, researchers at home and abroad analyze remote sensing image data from different sources from different angles, and a large number of models and methods are provided. With the progress of satellite sensor technology and signal transmission technology, the acquisition of remote sensing images is more and more convenient, and the spatial resolution of the images is also continuously improved. The increasingly abundant high-resolution remote sensing data simultaneously brings opportunities and challenges to the field of remote sensing image change detection. On the one hand, compared with the medium-low resolution remote sensing image, the high resolution remote sensing image can provide richer ground feature details and spatial distribution information, so that fine changes can be found, and the boundary of the changed ground feature can be better positioned. On the other hand, in the high resolution image, the same ground object usually appears in a planar form, the assumption that pixels are independent of each other is no longer true, the gray scale of the pixels in the same ground object is affected by the material and reflection characteristics of the target to be fluctuated, and the phenomenon of 'same object, different spectrum and same spectrum foreign matter' is more obvious compared with the middle and low resolution remote sensing image, so that the difficulty of detecting a change region is rapidly increased. In recent years, with the rising and maturing of deep learning in the field of artificial intelligence, new solutions are obtained based on the detection of changes in high-resolution remote sensing images. The deep learning method uses a large number of samples to train the network model, so that the model has the capability of extracting more discriminative features, and complicated and low-efficiency manual feature extraction is avoided. Meanwhile, compared with the traditional algorithm, the deep learning architecture has higher parallelism and excellent end-to-end property, and can realize efficient and accurate reasoning. Considering that the remote sensing image has massive and multidimensional natural properties, the deep learning is very suitable for learning and optimizing the remote sensing image change detection task. The high-resolution remote sensing image change detection method based on deep learning is researched, so that the detection precision can be greatly improved, and meanwhile, the intelligent and automatic processes of a change detection algorithm can be greatly accelerated, so that the method has higher application value.
In general, current remote sensing image change detection algorithms can be categorized into three categories:
Conventional pixel-based change detection methods: pixel-based change detection methods were the earliest developed and most diverse of all. As a main method for detecting the change of the middle-low resolution image, the most representative image algebraic method, the image transformation method and the classification detection method are based on the pixel method. The image algebra method obtains a change intensity graph through algebraic operation pixel by pixel between corresponding wave bands of two images, and the idea influences a plurality of later-proposed more advanced change detection algorithms. The image transformation method starts from the statistical structure of data, and extracts pixel level transformation information after mathematical transformation of an input image according to theories such as principal component analysis, slow feature analysis and the like. The classification detection method is characterized in that the images of each time phase are firstly classified independently, and then a change detection result is obtained according to the classification result. The traditional pixel level change detection method is easy to operate and convenient to implement. However, due to the complexity of the ground features in the high-resolution remote sensing image, the methods are only suitable for processing the situations of few image wave bands and relatively simple scenes, and the large-area error and leakage phenomenon in the detection result is difficult to stop in various scenes.
Object-based traditional change detection methods: the object-based change detection algorithm breaks through the limitation that the effect of processing the high-resolution image based on the pixel change detection method is poor. The method firstly divides the object according to the space and spectral properties of the pixel, and then replaces the pixel with the object to serve as a basic processing unit in the detection process. Compared with the pixel-based method, the object-based change detection method has the natural advantages of accurately describing the boundaries of the features and ensuring the consistent state of the internal changes of the features, but has two main limitations: firstly, the detection effect based on the object method is greatly limited by the quality of the segmented object; secondly, the object-based change detection method often adopts a two-stage structure of dividing the object and then extracting change characteristics, rather than adopting end-to-end frame optimization, so that error accumulation is easy to cause, and a global optimal solution is difficult to obtain.
The change detection method based on deep learning comprises the following steps: such methods can be subdivided into three subclasses: feature-based methods, tile-based methods, and full-view-based methods. The change detection method based on deep learning can obtain more effective and robust features by utilizing rich space and spectrum information of high-resolution images, thereby overcoming the defects of the traditional algorithm to a certain extent. However, most of the existing deep learning-based change detection algorithms focus either on more efficient use of spatial and spectral information or on domain alignment between different time-spaced images. The algorithms mostly use semantic segmentation or metric learning models to process change detection tasks, take pixel-by-pixel difference or channel dimension stitching as a main means for time dimension information extraction, lack explicit modeling of a time sequence process, and ignore the essence of 'change is a process'. A small part of algorithms explicitly consider time sequence information, but modeling on a time dimension is incomplete, and the improvement of algorithm performance is not facilitated.
Therefore, there is still a great room for improvement in the deep learning-based change detection algorithm, and it is necessary to develop a high-resolution remote sensing image change detection algorithm in consideration of time sequence information mining.
Disclosure of Invention
Aiming at the defects of the existing remote sensing image change detection algorithm based on deep learning, the invention provides a high-resolution remote sensing image change detection method based on video understanding and space-time decoupling, which is characterized in that a video frame sequence is constructed by image pairs, the space dimension and the time dimension of multi-time-phase images are decoupled, a double-flow structure is designed, space and time branches are respectively constructed, and the full mining of time-space information is realized, so that a change region is more accurately positioned.
The technical scheme of the invention provides a high-resolution remote sensing image change detection method based on video understanding and space-time decoupling, which comprises the following steps of:
step 1, according to an input double-phase remote sensing image pair, a pseudo video frame sequence is obtained through a time sequence interpolation strategy, and each image in the pseudo video frame sequence has the same spatial size and numerical range as an original image;
Step 2, constructing a time encoder and a space encoder, wherein the time encoder receives a pseudo video frame sequence as input, firstly performs downsampling operation, then extracts characteristics through a cascaded three-dimensional convolution layer, the space encoder receives an original double-phase remote sensing image pair as input, extracts the characteristics through a two-dimensional convolution layer, sets unidirectional side-path connection between the two encoders, and processes the characteristics transferred from the time encoder to the space encoder through a time sequence aggregation module (temporal aggregation module, TAM);
Step 3, constructing a progressive decoder, connecting the output of each level module of the space encoder with the input of each level module in the decoder, and outputting a single-channel change probability map by the final convolution layer of the decoder;
Step 4, adding an additional convolution layer at the tail end of the time encoder, applying depth supervision to the output of the time encoder, constructing a joint loss function, and optimizing weight parameters of the whole network by using a gradient descent method until the loss converges; the convolution layer at the end of the time encoder only provides additional output in the model training stage, and a single-channel change probability map of the final convolution layer output of the decoder is still used as the final output of the network in the model reasoning stage.
Further, in step 1, assuming that the first phase original image is I 1, the second phase original image is I 2, and the video contains N frames, the interpolation formula for the nth frame image F n is:
further, in step 2, the spatial encoder inputs are:
Xs=concat(I1,I2) (2)
the first time phase original image is I 1, the second time phase original image is I 2, the concat () represents the splicing operation of channel dimension, the basic composition module of the space encoder is a space module S-Block, and the S-Block is divided into two types of S-Block I and S-Block II; the initial part of the two types of space modules comprises two cascaded convolution layers, a corresponding BN layer and a corresponding ReLU activation function, and the tail part is a maximum pooling layer; compared with the S-Block I, the S-Block II has an additional convolution layer, a corresponding BN layer and a corresponding ReLU activation function, so that the S-Block II also has stronger feature extraction and fitting capabilities, and residual connection is added between the output of the first ReLU activation function and the output of the last normalization layer; two S-Block I and one S-Block II are sequentially connected to form a space encoder.
Further, the inputs to the temporal encoder are:
Xt=stack(F0,F1,F2,...,FN-1) (3)
The stack () represents the operation of overlapping images in a new dimension, and the basic component module of the time encoder is a time module T-Block; T-Block first use 1X 1 the convolution layer reduces the channel number of the input features; then the features are sent to a 3X 3 convolution layer for processing so as to realize the coding of the space context and the full mining of the change information; finally, a 1 multiplied by 1 convolution layer is used again to increase the number of characteristic channels, so that the capacity of the model is improved; BN layers are added after each convolutional layer, reLU activation functions are added after the first and second BN layers, and, in addition, to alleviate the gradient vanishing problem, improve the convergence performance of the module, residual connection is added between the input and the output of the T-Block, and a residual branch uses the number of matched characteristic channels of a 1 multiplied by 1 convolution layer and a BN layer; the downsampling module formed by serially connecting a convolution layer, a BN layer and a ReLU activation function is added before the first T-Block of the time encoder, the size of the convolution kernel is set to be 3 multiplied by 9, the step length is set to be 1 multiplied by 4, after the downsampling module is adopted, the spatial resolution of an input video frame sequence is reduced by 4 times, the attention program of the time encoder on the spatial information can be reduced, and the explicit space-time decoupling is realized.
Further, the time sequence aggregation module TAM firstly carries out global maximum pooling and global average pooling of time dimension on the characteristics output by a certain level of the time encoder so as to obtain efficient representation of time sequence change information contained in the T-Block extraction characteristics; then, splicing the two pooling results in the channel dimension to obtain an aggregation characteristic; finally, the aggregation characteristics are subjected to point-to-point transformation by using a convolution layer with the convolution kernel size of 1 multiplied by 1, a batch normalization layer and a ReLU activation function, and a final output is obtained.
Further, in step 3, the progressive decoder is serially connected with a convolutional layer and a plurality of decoding modules D-blocks, and the total number of D-blocks is one more than the number of S-blocks in the spatial encoder; D-Block receives two inputs, namely the output of the last D-Block and the output of the S-Block at the same level, and firstly upsamples the superior decoding characteristics, then splices the upsampling result and the same-level coding characteristics in the channel dimension, finally uses two convolution layers to perform characteristic fusion, increases BN layer and ReLU activation function after each convolution layer, and adds residual connection between the two convolution layers to relieve the gradient vanishing problem.
Further, in step 4, training the entire change detection network by minimizing a joint loss function, which can be expressed as:
L=l(Pfinal,R)+λl(Pinter,R) (4)
Where l represents the specific penalty function used for each output-truth label pair, P final and P inter represent the probability map of change of model final output, i.e., decoder output, and bypass output, i.e., temporal encoder output, respectively, R represents the truth change label, and λ is the weighting factor of the auxiliary penalty, and class-balanced cross entropy penalty is selected as the specific penalty type:
wherein, H and W represent the height and width of the image respectively, i and j represent the ith row and jth column of the image respectively, and W c and W u are the class weight coefficients of the variable class and the unchanged class respectively, and an Adam optimizer is used to adjust the gradient to minimize the loss function.
Further, the spatial encoder includes 3S-blocks, and the output channel numbers of the three S-blocks are set to 32, 64 and 128, respectively.
Further, the time encoder comprises 1 downsampling module and 4T-blocks, and the number of output channels of the downsampling module is set to be 64; the output channel numbers are set to 256, 512, and 512 for 4T-blocks, respectively.
According to the high-resolution remote sensing image change detection method based on video understanding and space-time decoupling, the change detection problem is converted into the dense classification problem of videos for the first time, and a space encoder and a time encoder are used for respectively processing double-time-phase input images and constructed pseudo video frame sequences, so that explicit decoupling of time and space dimensions is achieved.
Meanwhile, aiming at the conditions that the ground object scene in the high-resolution remote sensing image is complex and the false change is easy to occur in the detection result, the method provided by the invention fully utilizes the rich information of the multi-temporal remote sensing image, relieves the unbalance of the original data in the time and space dimension structure through time sequence interpolation, and enables the model to concentrate on the refinement of the change information, thereby inhibiting the false change in the result. The method provided by the invention has important effects on the subsequent application of high-resolution remote sensing images, such as urban building change monitoring, disaster monitoring and the like. Therefore, the high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling has very important academic value and important practical significance.
The invention not only provides a new deep learning change detection paradigm which is different from the existing deep learning change detection framework, emphasizes the full utilization of time sequence information, but also explicitly considers the space-time coupling problem existing in the change detection task for the first time, and provides an effective solution for accurately describing the change feature boundary.
Drawings
FIG. 1 is an overall network architecture diagram of the present invention;
FIG. 2 is a block diagram of the basic components of a spatial encoder, a temporal encoder, and a progressive decoder;
fig. 3 is a block diagram of a timing interpolation module.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, a method for detecting a change in a remote sensing image based on video understanding and space-time decoupling will be described in detail with reference to the accompanying drawings. It should be understood that the specific examples described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The invention provides a remote sensing image change detection method based on video understanding and space-time decoupling, which regards 'change' as a continuous 'time sequence process' rather than discrete 'state change', thereby modeling a change detection problem as a video understanding task. The double-phase image pair is converted into a pseudo video frame sequence through time sequence interpolation, so that imbalance in time and space construction of the original data is relieved. The space-time decoupling encoder is adopted, and a time encoder and a space encoder are respectively used for processing space-time information, so that the model only pays attention to one dimension at a time, and adverse effects caused by space-time coupling are avoided. An edge connection is added between the two encoders to promote interaction between the spatio-temporal features. In addition, the present invention applies depth supervision to the output of the temporal encoder, forcing the temporal encoder to learn more useful features and speeding up the training process.
Step 1, according to an input double-phase remote sensing image pair, a pseudo video frame sequence is obtained through a time sequence interpolation strategy, and each image in the pseudo video frame sequence has the same spatial size and numerical range as an original image;
Step 2, constructing a time encoder and a space encoder, wherein the time encoder receives a pseudo video frame sequence as input, firstly performs downsampling operation, then extracts characteristics through a cascaded three-dimensional convolution layer, the space encoder receives an original double-phase remote sensing image pair as input, extracts the characteristics through a two-dimensional convolution layer, sets unidirectional side-path connection between the two encoders, and processes the characteristics transferred from the time encoder to the space encoder through a time sequence aggregation module (temporal aggregation module, TAM);
Step 3, constructing a progressive decoder, connecting the output of each level module of the space encoder with the input of each level module in the decoder, and outputting a single-channel change probability map by the final convolution layer of the decoder;
And 4, adding an additional convolution layer at the tail end of the time encoder, applying depth supervision to the output of the time encoder, constructing a joint loss function, and optimizing weight parameters of the whole network by using a gradient descent method until the loss converges. The convolutional layer at the end of the time encoder provides additional output only during the model training phase, and still uses the single-channel variation probability map of the final convolutional layer output of the decoder as the final network output during the model reasoning phase.
Further, in step 1, assuming that the first phase original image is I 1, the second phase original image is I 2, and the video contains N frames, the interpolation formula for the nth frame image F n is:
further, in step 2, the input of the spatial encoder is:
Xs=concat(I1,I2) (7)
The concat () represents the splicing operation of the channel dimension, the basic component module of the space encoder is a space module S-Block, the S-Block can be divided into two types of S-Block I and S-Block II, and the structures of the two types of S-Block I and S-Block II are respectively shown in fig. 2 (a) and (b); the initial part of the two types of space modules comprises two cascaded convolution layers (and corresponding BN layers and ReLU activation functions), and the tail part is a maximum pooling layer; compared with S-Block I, S-Block II has an extra convolution layer and corresponding BN layer and ReLU activation function, and thus has stronger feature extraction and fitting capabilities, with a residual connection added between the output of the first ReLU activation function and the output of the last normalization layer. Two S-Block I and one S-Block II are sequentially connected to form a space encoder.
Further, the inputs to the temporal encoder are:
Xt=stack(F0,F1,F2,...,FN-1) (8)
Wherein, stack () represents the operation of overlapping images in a new dimension, and the basic component module of the time encoder is a time module T-Block, and the structure thereof is shown in fig. 2 (c); T-Block first use 1X 1 the convolution layer reduces the channel number of the input features; then the features are sent to a 3X 3 convolution layer for processing so as to realize the coding of the space context and the full mining of the change information; and finally, a1 multiplied by 1 convolution layer is used again to increase the number of characteristic channels and improve the capacity of the model. BN layers are added after each convolutional layer, and ReLU activation functions are added after the first and second BN layers. In addition, in order to alleviate the problem of gradient disappearance and improve the convergence performance of the module, residual connection is added between the input and the output of the T-Block, the residual branch uses a1 x1 convolutional layer and BN layer matching feature channel number. As shown in fig. 1, a downsampling module (i.e., the stem part marked in fig. 1) composed of a convolution layer, a BN layer and a ReLU activation function connected in series is added before the first T-Block of the temporal encoder, the size of the convolution kernel is set to 3×9×9, the step size is set to 1×4×4, and after passing through the downsampling module, the spatial resolution of the input video frame sequence is reduced by 4 times, which can reduce the attention procedure of the temporal encoder to spatial information, and realize explicit spatial-temporal decoupling.
Further, the time sequence aggregation module TAM firstly carries out global maximum pooling and global average pooling of time dimension on the characteristics output by a certain level of the time encoder so as to obtain efficient representation of time sequence change information contained in the T-Block extraction characteristics; then, splicing the two pooling results in the channel dimension to obtain an aggregation characteristic; finally, the aggregation characteristics are subjected to point-to-point transformation by using a convolution layer with the convolution kernel size of 1 multiplied by 1, a batch normalization layer and a ReLU activation function, and a final output is obtained.
Further, in step 3, the progressive decoder is serially connected with a convolutional layer and a plurality of decoding modules D-Block, the total number of D-Block is one more than the number of S-Block (including S-Block I and S-Block II) in the spatial encoder, and the structure of each D-Block is shown in fig. 2 (D); D-Block receives two inputs, namely the output of the last D-Block (or the forefront convolution layer) and the output of the S-Block at the same level, wherein the upper decoding features are firstly up-sampled, then the up-sampling result and the same-level coding features are spliced in the channel dimension, finally two convolution layers are used for feature fusion, a BN layer and a ReLU activation function are added after each convolution layer, and residual connection is added between the two convolution layers to relieve the gradient vanishing problem; as shown in fig. 1, the channel dimension splice result of the original bi-phase image pair is also used as a hierarchical coding feature in order to preserve spatial detail information in the image as much as possible.
Further, in step 4, training the entire change detection network by minimizing a joint loss function, which can be expressed as:
L=l(Pfinal,R)+λl(Pinter,R) (9)
Where l represents the specific penalty function used for each output-truth label pair, P final and P inter represent the probability map of change of the model final output (i.e., decoder output) and bypass output (i.e., immediate encoder output), respectively, R represents the truth change label, and λ is the weight coefficient of the auxiliary penalty, and class-balanced cross entropy penalty is chosen as the specific penalty type:
wherein, H and W represent the height and width of the image respectively, i and j represent the ith row and jth column of the image respectively, and W c and W u are the class weight coefficients of the variable class and the unchanged class respectively, and an Adam optimizer is used to adjust the gradient to minimize the loss function.
The present invention may be implemented using computer software technology. The following details the steps of the method for detecting a change in a high-resolution remote sensing image according to the embodiment with reference to fig. 1.
And step 1, performing interpolation operation of time dimension on the input double-phase high-resolution remote sensing image pair to obtain a pseudo video frame sequence.
The invention provides a high-resolution remote sensing image change detection method based on video understanding and space-time decoupling, which is to obtain images of each frame in a video by using a linear interpolation strategy, wherein the images of a first time phase are used as initial frames, and the images of a second time phase are used as end frames. In the example, the spatial size of the original image is 256×256, and the number of bands is 3. The length T of the pseudo video frame sequence is 8, which represents that the interpolation result contains 8 frames of images, wherein the spatial size of each image is 256×256, and the band number is 3. Interpolation operations may be performed in a vectorized manner by writing code through NumPy scientific computation libraries or PyTorch deep learning frameworks. In particular, a person skilled in the art can select the value of the pseudo video frame sequence length T according to the actual computing power and the time resolution, and generally, the larger T is, the longer the algorithm is time-consuming to execute, and the more computing resources are needed to be used, but the better the precision index is.
And 2, constructing a time encoder and a space encoder, respectively receiving a video frame sequence and an original double-phase remote sensing image pair as inputs, adding a side path connection between the two encoders, and processing the characteristics transferred from the time encoder to the space encoder by using a time sequence aggregation module TAM.
In the example, the spatial encoder contains 3S-blocks in total, while the temporal encoder contains 1 downsampling module and 4T-blocks. The output channel numbers of the three S-blocks are set to 32, 64 and 128, respectively. For the first two S-blocks, adopting an S-Block I type; for the third S-Block, the S-Block II type is used. The output channel number of the downsampling module in the time encoder is set to 64. The output channel numbers are set to 256, 512, and 512 for 4T-blocks, respectively. In particular, for the 3 rd T-Block, the 3×3 convolution layers on its trunk and bypass (residual branches) are set to step sizes of 2 in both the temporal and spatial dimensions in order to achieve space-time downsampling. According to the above setup, both the spatial encoder and the temporal encoder have an output stride of 8, i.e. the spatial resolution of the output coding features is 1/8 of the input features. In addition, to facilitate the exchange of time and spatial information, an edge connection is added between the two encoders, and the intermediate layer features output by the 2 nd and 4 th T-blocks of the time encoder are first processed by a timing aggregation module TAM and then transferred to the spatial encoder as the 2 nd and 3 rd S-Block inputs, respectively. In specific implementation, the number of S-blocks and T-blocks can be adjusted according to actual needs by a person skilled in the art, but the number of S-blocks is required to be ensured to be 1 less than that of T-blocks and 1 more than that of time sequence aggregation modules.
And 3, constructing a progressive decoder, connecting the output of each level module of the spatial encoder with the input of each level module in the decoder, and outputting a single-channel change probability map by the final convolution layer of the decoder.
In the example, the decoder concatenates a convolutional layer and 4-level D-blocks, where each D-Block receives the output of the last D-Block (or the most forward convolutional layer) and the output of the S-Block at the same level. The number of D-blocks must be guaranteed to be 1 more than that of S-blocks during implementation.
And 4, calculating the loss through the final output of the decoder and the bypass output of the time encoder, and optimizing the weight parameters of the whole network by using a gradient descent method until the loss converges.
In an example, each of the joint losses is calculated using class-balanced cross entropy losses, the weight coefficients of the positive and negative classes are set to 0.5 and 0.5, respectively, and the weight coefficient of the auxiliary loss applied to the temporal encoder is set to 0.4. Solving is carried out by using an Adam optimizer, the initial learning rate is set to be 0.0004, and training lasts for 26 ten thousand iterations. In particular implementations, the training hyper-parameters can be adjusted by those skilled in the art based on the particular data set used.
One of ordinary skill in the art can understand that the invention sees the change detection problem from the view of video understanding for the first time, and more refined time sequence modeling is realized by combining the two-dimensional and three-dimensional convolutional neural networks to mine the space-time characteristics in multi-time-phase image pairs. And secondly, the design of space-time decoupling is adopted for the encoder structure, so that the capability of extracting space-time characteristics of the network is enhanced, the burden of a decoder is relieved, and the training difficulty is reduced. And finally, through the side path connection and time sequence aggregation module, on one hand, the information interaction between two encoders is enhanced, and on the other hand, the time characteristics and the space characteristics learned by the network are more matched, so that the accuracy and the robustness of the model are improved.
It should be noted and appreciated that various modifications and improvements of the invention described in detail above can be made without departing from the spirit and scope of the invention as claimed in the appended claims. Accordingly, the scope of the claimed subject matter is not limited by any particular exemplary teachings presented.

Claims (9)

1. The high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling is characterized by comprising the following steps of:
step 1, according to an input double-phase remote sensing image pair, a pseudo video frame sequence is obtained through a time sequence interpolation strategy, and each image in the pseudo video frame sequence has the same spatial size and numerical range as an original image;
Step 2, constructing a time encoder and a space encoder, wherein the time encoder receives a pseudo video frame sequence as input, firstly performs downsampling operation, then extracts characteristics through a cascaded three-dimensional convolution layer, the space encoder receives an original double-phase remote sensing image pair as input, extracts the characteristics through a two-dimensional convolution layer, sets unidirectional side-path connection between the two encoders, and processes the characteristics transferred from the time encoder to the space encoder through a time sequence aggregation module (temporal aggregation module, TAM);
Step 3, constructing a progressive decoder, connecting the output of each level module of the space encoder with the input of each level module in the decoder, and outputting a single-channel change probability map by the final convolution layer of the decoder;
Step 4, adding an additional convolution layer at the tail end of the time encoder, applying depth supervision to the output of the time encoder, constructing a joint loss function, and optimizing weight parameters of the whole network by using a gradient descent method until the loss converges; the convolution layer at the end of the time encoder only provides additional output in the model training stage, and a single-channel change probability map of the final convolution layer output of the decoder is still used as the final output of the network in the model reasoning stage.
2. The high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 1, wherein: in step 1, assuming that the first time-phase original image is I 1, the second time-phase original image is I 2, and the video contains N frames, the interpolation formula for the nth frame image F n is:
3. the high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 1, wherein: in step 2, the spatial encoder inputs are:
Xs=concat(I1,I2) (2)
The first time phase original image is I 1, the second time phase original image is I 2, the concat () represents the splicing operation of channel dimension, the basic composition module of the space encoder is a space module S-Block, and the S-Block is divided into two types of S-Block I and S-Block II; the initial part of the two types of space modules comprises two cascaded convolution layers, a corresponding BN layer and a corresponding ReLU activation function, and the tail part is a maximum pooling layer; compared with S-BlockI, S-BlockII has an additional convolution layer, a corresponding BN layer and a ReLU activation function, and therefore stronger feature extraction and fitting capabilities, and a residual connection is added between the output of the first ReLU activation function and the output of the last normalization layer; two S-Block I and one S-Block II are sequentially connected to form a space encoder.
4. The high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 1, wherein: the inputs to the temporal encoder are:
Xt=stack(F0,F1,F2,...,FN-1) (3)
The stack () represents the operation of overlapping images in a new dimension, and the basic component module of the time encoder is a time module T-Block; T-Block first use 1X 1 the convolution layer reduces the channel number of the input features; then the features are sent to a 3X 3 convolution layer for processing so as to realize the coding of the space context and the full mining of the change information; finally, a 1 multiplied by 1 convolution layer is used again to increase the number of characteristic channels, so that the capacity of the model is improved; BN layers are added after each convolutional layer, reLU activation functions are added after the first and second BN layers, and, in addition, to alleviate the gradient vanishing problem, improve the convergence performance of the module, residual connection is added between the input and the output of the T-Block, and a residual branch uses the number of matched characteristic channels of a 1 multiplied by 1 convolution layer and a BN layer; the downsampling module formed by serially connecting a convolution layer, a BN layer and a ReLU activation function is added before the first T-Block of the time encoder, the size of the convolution kernel is set to be 3 multiplied by 9, the step length is set to be 1 multiplied by 4, after the downsampling module is adopted, the spatial resolution of an input video frame sequence is reduced by 4 times, the attention program of the time encoder on the spatial information can be reduced, and the explicit space-time decoupling is realized.
5. The high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 1, wherein: the method comprises the steps that a time sequence aggregation module TAM firstly carries out global maximum pooling and global average pooling of time dimension on a feature output by a certain level of a time encoder so as to obtain efficient representation of time sequence change information contained in T-Block extraction features; then, splicing the two pooling results in the channel dimension to obtain an aggregation characteristic; finally, the aggregation characteristics are subjected to point-to-point transformation by using a convolution layer with the convolution kernel size of 1 multiplied by 1, a batch normalization layer and a ReLU activation function, and a final output is obtained.
6. The high resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 3, wherein: in step 3, the progressive decoder is serially connected with a convolution layer and a plurality of decoding modules D-blocks, wherein the total number of the D-blocks is one more than the number of the S-blocks in the space encoder; D-Block receives two inputs, namely the output of the last D-Block and the output of the S-Block at the same level, and firstly upsamples the superior decoding characteristics, then splices the upsampling result and the same-level coding characteristics in the channel dimension, finally uses two convolution layers to perform characteristic fusion, increases BN layer and ReLU activation function after each convolution layer, and adds residual connection between the two convolution layers to relieve the gradient vanishing problem.
7. The high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 1, wherein: in step 4, training the entire change detection network by minimizing a joint loss function, which can be expressed as:
L=l(Pfinal,R)+λl(Pinter,R) (4)
Where l represents the specific penalty function used for each output-truth label pair, P final and P inter represent the probability map of change of model final output, i.e., decoder output, and bypass output, i.e., temporal encoder output, respectively, R represents the truth change label, and λ is the weighting factor of the auxiliary penalty, and class-balanced cross entropy penalty is selected as the specific penalty type:
wherein, H and W represent the height and width of the image respectively, i and j represent the ith row and jth column of the image respectively, and W c and W u are the class weight coefficients of the variable class and the unchanged class respectively, and an Adam optimizer is used to adjust the gradient to minimize the loss function.
8. The high resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 3, wherein: the spatial encoder includes 3S-blocks, and the output channel numbers of the three S-blocks are set to 32, 64, and 128, respectively.
9. The high-resolution remote sensing image change detection algorithm based on video understanding and space-time decoupling as claimed in claim 4, wherein: the time encoder comprises 1 downsampling module and 4T-blocks, and the number of output channels of the downsampling module is set to 64; the output channel numbers are set to 256, 512, and 512 for 4T-blocks, respectively.
CN202210742299.2A 2022-06-27 2022-06-27 High-resolution remote sensing image change detection method based on video understanding and space-time decoupling Active CN115147760B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210742299.2A CN115147760B (en) 2022-06-27 2022-06-27 High-resolution remote sensing image change detection method based on video understanding and space-time decoupling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210742299.2A CN115147760B (en) 2022-06-27 2022-06-27 High-resolution remote sensing image change detection method based on video understanding and space-time decoupling

Publications (2)

Publication Number Publication Date
CN115147760A CN115147760A (en) 2022-10-04
CN115147760B true CN115147760B (en) 2024-04-19

Family

ID=83410214

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210742299.2A Active CN115147760B (en) 2022-06-27 2022-06-27 High-resolution remote sensing image change detection method based on video understanding and space-time decoupling

Country Status (1)

Country Link
CN (1) CN115147760B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259853A (en) * 2020-02-04 2020-06-09 中国科学院计算技术研究所 High-resolution remote sensing image change detection method, system and device
CN112577473A (en) * 2020-12-21 2021-03-30 陕西土豆数据科技有限公司 Double-time-phase high-resolution remote sensing image change detection algorithm
CN112949549A (en) * 2021-03-19 2021-06-11 中山大学 Super-resolution-based change detection method for multi-resolution remote sensing image
CN113420662A (en) * 2021-06-23 2021-09-21 西安电子科技大学 Remote sensing image change detection method based on twin multi-scale difference feature fusion
CN114359723A (en) * 2021-12-27 2022-04-15 陕西科技大学 Remote sensing image change detection method based on space spectrum feature fusion network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259853A (en) * 2020-02-04 2020-06-09 中国科学院计算技术研究所 High-resolution remote sensing image change detection method, system and device
CN112577473A (en) * 2020-12-21 2021-03-30 陕西土豆数据科技有限公司 Double-time-phase high-resolution remote sensing image change detection algorithm
CN112949549A (en) * 2021-03-19 2021-06-11 中山大学 Super-resolution-based change detection method for multi-resolution remote sensing image
CN113420662A (en) * 2021-06-23 2021-09-21 西安电子科技大学 Remote sensing image change detection method based on twin multi-scale difference feature fusion
CN114359723A (en) * 2021-12-27 2022-04-15 陕西科技大学 Remote sensing image change detection method based on space spectrum feature fusion network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
多时相遥感影像变化检测的现状与展望;张良培;武辰;;测绘学报;20171015(第10期);249-261 *
遥感影像变化检测算法综述;佟国峰;李勇;丁伟利;岳晓阳;;中国图象图形学报;20151216(第12期);5-15 *

Also Published As

Publication number Publication date
CN115147760A (en) 2022-10-04

Similar Documents

Publication Publication Date Title
CN111325751B (en) CT image segmentation system based on attention convolution neural network
CN112669325B (en) Video semantic segmentation method based on active learning
CN115049936B (en) High-resolution remote sensing image-oriented boundary enhanced semantic segmentation method
CN115601549B (en) River and lake remote sensing image segmentation method based on deformable convolution and self-attention model
CN110569851B (en) Real-time semantic segmentation method for gated multi-layer fusion
CN111178316A (en) High-resolution remote sensing image land cover classification method based on automatic search of depth architecture
CN115713679A (en) Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map
Wang et al. TF-SOD: a novel transformer framework for salient object detection
CN113554032A (en) Remote sensing image segmentation method based on multi-path parallel network of high perception
CN114693929A (en) Semantic segmentation method for RGB-D bimodal feature fusion
CN115797635A (en) Multi-stage instance segmentation method and system based on parallel feature completion
CN113392727B (en) RGB-D salient object detection method based on dynamic feature selection
Chong et al. Multi-hierarchy feature extraction and multi-step cost aggregation for stereo matching
Xing et al. MABNet: a lightweight stereo network based on multibranch adjustable bottleneck module
CN115147760B (en) High-resolution remote sensing image change detection method based on video understanding and space-time decoupling
CN112419325A (en) Super-pixel segmentation method based on deep learning
Gao et al. Multi-branch aware module with channel shuffle pixel-wise attention for lightweight image super-resolution
CN116071281A (en) Multi-mode image fusion method based on characteristic information interaction
CN115731280A (en) Self-supervision monocular depth estimation method based on Swin-Transformer and CNN parallel network
CN115187777A (en) Image semantic segmentation method under data set manufacturing difficulty
Wu et al. Lightweight stepless super-resolution of remote sensing images via saliency-aware dynamic routing strategy
Geng et al. Dual-path feature aware network for remote sensing image semantic segmentation
Yang et al. SA-MVSNet: Self-attention-based multi-view stereo network for 3D reconstruction of images with weak texture
CN116152441B (en) Multi-resolution U-net curved surface reconstruction method based on depth priori
Xu et al. CVE-Net: cost volume enhanced network guided by sparse features for stereo matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant