CN117765361B

CN117765361B - Method for detecting building change area in double-time-phase remote sensing image based on residual neural network

Info

Publication number: CN117765361B
Application number: CN202311808119.7A
Authority: CN
Inventors: 刘志恒; 张文杰; 李晨阳; 周绥平; 刘彦明; 盛家豪; 江澄; 节永师
Original assignee: Xidian University; Beijing Institute of Space Research Mechanical and Electricity
Current assignee: Xidian University; Beijing Institute of Space Research Mechanical and Electricity
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-06-11
Anticipated expiration: 2043-12-26
Also published as: CN117765361A

Abstract

A method for detecting a building change region in a double-phase remote sensing image based on a residual neural network comprises the following steps: constructing an input sub-network of remote sensing image building change; constructing a feature extraction sub-network of remote sensing image building change; constructing a feature fusion sub-network of remote sensing image building change; constructing a prediction sub-network of remote sensing image building change; constructing a remote sensing image building change detection network; generating a training set, a verification set and a test set; training a remote sensing image building change detection network; detecting the change range of a remote sensing image building; the invention constructs the residual convolution of the extrusion excitation to replace the original convolution by combining the extrusion excitation and the residual convolution, can selectively emphasize the characteristics of the change area, inhibit the characteristics of the building without change, and has the technical effects of protruding characteristics and facilitating the extraction of the change characteristics.

Description

Method for detecting building change area in double-time-phase remote sensing image based on residual neural network

Technical Field

The invention belongs to the technical field of remote sensing image detection, and particularly relates to a method for detecting a building change region in a double-time-phase remote sensing image based on a residual neural network.

Background

The change of the surface building is one of important parts of dynamic evolution of human social activities, is an important factor of a harmonious place between people and nature, and provides necessary basic data and thematic information for decisions in aspects of land utilization monitoring, urban planning, environment assessment, urban expansion, pre-disaster monitoring, post-disaster assessment, management and the like. Along with the acceleration of the urban process, the change of regional buildings is continuously updated, and meanwhile, the phenomena of private reconstruction, extension, demolition and the like of the buildings are accompanied, so that the regional resource utilization and the environment improvement are deeply influenced. The building change detection technology which is accurate, efficient and easy to deploy on the satellite has important significance for the business management such as the regional natural resource optimization management, urban and rural development and the like.

Due to the characteristics of large width, easy acquisition, all-day time, all weather and the like of the remote sensing image, the building change detection technology based on the remote sensing image gradually replaces the traditional field investigation and screening method and the image-based manual outlining change detection method, and the cost of labor, time and financial resources is reduced. Building change detection technology based on remote sensing images depends on the characteristics of the shape, the size, the texture and the like of a building on the remote sensing images to judge whether the building changes or not, and the change range is defined. However, with the continuous progress of society, the building change detection technology based on remote sensing images is difficult to meet the basic requirements of the current digital city and smart city development. In recent years, along with the improvement of resolution of remote sensing images and calculation power of computer equipment and the development of deep learning technology, researchers convert a change detection task into an image semantic segmentation problem, input double-phase remote sensing images into a deep learning model in batches, obtain a weight result through continuous training of the model, and can rapidly and accurately detect a building change region in the double-phase remote sensing images and improve the detection efficiency of a building change range.

Yin Meijie, ni Cui and other authors propose a remote sensing image building change detection method based on Unet in the paper 'remote sensing image building change detection based on semantic segmentation' (applied science journal 2023, volume 41, 3, pages 448-460). The method comprises the following specific steps: firstly, introducing a lightweight high-efficiency channel attention mechanism module (EFFICIENT CHANNEL attention network, ECANet) into an original UNet network model, adjusting a jump link structure in an original Unet network structure, and improving the accuracy of image segmentation of a building change area; then improving parameters of the extrusion excitation attention module (Squeeze and Excitationattention module, SENet) and improving the precision of building change detection in the remote sensing image; finally, restoring the characteristic diagram to the original image size through a convolution block to be used as a result of building change detection. The method has the defects that ECAnet mainly focuses on local cross-channel interaction and self-adaptive selection of convolution kernels, so that the corresponding relation between the weight vector and the input still has certain defects, and the complexity of a model is reduced while the interaction of a channel and surrounding neighbors is considered, so that the feature expression capability of the model is reduced; the dimension reduction in SENet employed in classification can have adverse effects on the channel attention mechanism and does not necessarily capture dependencies between all channels in an inefficient manner.

Huiwei Jiang and Xiangyun Hu et al authors in their published papers "PGA-SiamNet:PyramidFeature-Based Attention-Guided Siamese Network for Remote SensingOrthoimagery Building Change Detection"(Remote Sensing 2020, volume 12, page 484) propose a method for detecting building changes in remote sensing images based on a feature pyramid attention directed twin convolutional network PGA-SiamNet. The method comprises the following specific steps: training the network by using a characteristic pyramid convolutional neural network, so that the network captures the characteristics of a building change area and enhances the expression capability of the multi-scale change characteristics; the global common attention mechanism is introduced to emphasize the importance of the correlation between the input features, so that the network is more concerned with the part of the building change in the remote sensing image, the long-range dependence of the features is improved, and the richer change information can be conveniently obtained; and finally, generating a segmentation result image through convolution to obtain a building change area. The method has the defects that the model has multiple parameters, the model is large, the reliability and the quality of sample labeling are depended, the detection capability of remote sensing images with different sizes is different, the labeling cost of the samples is high, and the change detection efficiency of the method is low.

Zhu Jiezhong and Chen Yong et al in their published papers, "Siam-unet++ based high resolution remote sensing image building change detection" (computer applied research 2021, volume 38, 4, 3460-3465 pages) propose a Siam-Unet ++ based deep neural network algorithm to detect building change ranges in high resolution remote sensing images. The method comprises the following specific steps: firstly, a Siam-diff (Siamese-diff) structure is applied in a Unet ++ network encoder to extract building change characteristics on front and rear time sequence remote sensing images, and a attention mechanism (TripletAttention, TA) is introduced after a link path is up-sampled and jumped in a decoder, so that the network is more focused on the change part of a building, and the learning of other types of characteristics is restrained; secondly, weighting and fusing the change characteristic information of different semantic layers by using a Multiple side-output fusion (MSOF) strategy; and finally, adopting a sliding window method to detect the building change of the whole large-scale remote sensing image, and reducing the problems of image hollowness and dislocation caused by the traditional algorithm when the result is spliced. The method has the defects that the TA module establishes the dependency relationship among the dimensions by using residual transformation, so that the indirect corresponding relationship between the channels and the weights is eliminated, the calculated amount is reduced, the detection effect is weaker than that of other previous models, and particularly for a building change data set with large scale difference, the model still has the phenomena of missing detection and false detection.

The existing method for detecting the building change in the remote sensing image has the following defects:

1. When the shooting time of the double-time-phase remote sensing image is not synchronous, the double-time-phase remote sensing image is directly introduced into a change detection model and is easily influenced by differences of spectrum, texture and the like, particularly Gao Liangyun snow and the like are interfered, and more false detection still exists in a change detection result.

2. The remote sensing image contains abundant ground feature spectral information and texture features, and the building itself also has a diversity structure, so that the traditional image processing-based change detection algorithm cannot obtain very accurate detection effects when multi-level, multi-scale and all category feature extraction is carried out.

3. The building change detection method based on the coding-decoding neural network has the defects that the resolution is reduced in the downsampling process, and the segmentation effect on a large building change area is poor; meanwhile, when the data set categories are unbalanced, the model is easy to favor the categories with more quantity; at edge segmentation, deconvolution can lead to overlap and aliasing between pixels, resulting in edge glitches.

4. Compared with a conventional neural network, the building change detection method based on the twin convolutional neural network needs to calculate the similarity of two inputs, so that the calculated amount of training is larger, and the time is longer. The output is the distance between the two classes rather than the probability, and more processing is required to obtain the final change detection result. It is difficult to deal with problems of complete occlusion and out of view. In reasoning, nearest neighbor searches need to be used, and this approach is prone to ignore interference from background information.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a method for detecting a building change area in a double-time-phase remote sensing image based on a residual neural network, which is used for constructing the residual convolution of extrusion excitation instead of original convolution by combining extrusion excitation and residual convolution, selectively emphasizing the change area characteristics and inhibiting unchanged building characteristics.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

A method for detecting a building change region in a double-phase remote sensing image based on a residual neural network comprises the following steps:

step1, constructing an input sub-network of remote sensing image building change, which is used for inputting double-phase remote sensing image data;

Step 2, constructing a feature extraction sub-network of the building change of the remote sensing image, and extracting features of a building change area from the double-phase remote sensing image data input in the step 1;

Step 3, constructing a feature fusion sub-network of the remote sensing image building change, and fusing the features of the building change area extracted in the step 2 to obtain building change feature diagrams extracted in different layers;

step 4, constructing a prediction sub-network of the remote sensing image building change, and recovering the building change feature map obtained in the step 3 to the original feature map size to obtain a final building change area;

step 5, constructing a remote sensing image building change detection network based on the input sub-network constructed in the step 1, the feature extraction sub-network constructed in the step 2, the feature fusion sub-network constructed in the step 3 and the prediction sub-network constructed in the step 4;

step 6, generating a training set, a verification set and a test set;

step 7, training the remote sensing image building change detection network constructed in the step 5 by using the training set and the verification set generated in the step 6 to obtain a model weight file;

And 8, detecting the remote sensing image building change range in the test set generated in the step 6 by using the model weight file obtained through training in the step 7.

The specific method of the step 1 is as follows:

An input sub-network for preprocessing the remote sensing image data of the double phases is built, and an input image of the input sub-network comprises two moments: and the time T1 and the time T2, wherein T1 is the first time, T2 is the second time, the remote sensing images at the time T2 are subjected to histogram matching according to the remote sensing image histogram at the time T1, the histogram matching is realized by using a match_ histograms function, the T2 time remote sensing image T2 'after the histogram matching is obtained, the remote sensing images T2' after the histogram matching and the T1 time remote sensing images are subjected to channel merging operation, wherein the channel merging uses a cat function, and the merging is performed in the first dimension, so that the merged double-time-phase remote sensing images are obtained.

The specific method of the step 2 is as follows:

A feature extraction sub-network with 5 downsampling modules is built, and the structure of the feature extraction sub-network is as follows: downsampling block 1, downsampling block 2, downsampling block 3, downsampling block 4 and downsampling block 5;

step 2.1, constructing a module of a downsampling block 1 in a feature extraction sub-network of remote sensing image building change;

The downsampling block 1 comprises 3 convolution layers, namely a first convolution layer, a second convolution layer and an extrusion excitation layer, wherein the first convolution layer is used for extracting basic characteristics of remote sensing image data of an input double-phase; the second convolution layer is used for processing and combining the basic features extracted by the first convolution layer to generate deeper change features; the extrusion excitation layer is used for multiplying different weight coefficients for different channels so as to make the model pay more attention to the channel with large information quantity, and comprises the following components: extruding, exciting and proportional operation, wherein the extruding operation adopts global average pooling, the exciting operation comprises two fully connected layers, the bias of the two fully connected layers is False, and the proportional operation is a multiplication function, namely, an input feature map and a result phase of the exciting operation, so as to obtain a feature map after final multiplication;

Step 2.2, constructing a module of a downsampling block 2 in a feature extraction sub-network of remote sensing image building change;

The downsampling block 2 comprises: the device comprises a maximum pooling layer, a first convolution layer, a second convolution layer and an extrusion excitation layer, wherein the maximum pooling layer is used for sampling the length and the width of a characteristic diagram to half of the length and the width of the characteristic diagram of a lower sampling block 1; the first convolution layer is used for extracting basic features of the feature map of the maximum pooling layer; the second convolution layer is used for processing and combining the basic features extracted by the first convolution layer to generate deeper change features; the extrusion excitation layer is used for multiplying different weight coefficients for different channels so as to make the model pay more attention to the channel with large information quantity, and comprises the following components: extruding, exciting and proportional operation, wherein the extruding operation adopts global average pooling, the exciting operation comprises two fully connected layers, the bias of the two fully connected layers is False, and the proportional operation is a multiplication function, namely, an input feature map and a result phase of the exciting operation, so as to obtain a feature map after final multiplication;

Step 2.3, constructing a module of a downsampling block 3 in a feature extraction sub-network of remote sensing image building change;

The downsampling block 3 comprises: the device comprises a maximum pooling layer, a first convolution layer, a second convolution layer and an extrusion excitation layer, wherein the maximum pooling layer is used for sampling the length and the width of a characteristic diagram to half of the length and the width of the characteristic diagram of a lower sampling block 2; the first convolution layer is used for extracting basic features of the feature map of the maximum pooling layer; the second convolution layer is used for processing and combining the basic features extracted by the first convolution layer to generate deeper change features; the extrusion excitation layer is used for multiplying different weight coefficients for different channels so as to make the model pay more attention to the channel with large information quantity, and comprises the following components: extruding, exciting and proportional operation, wherein the extruding operation adopts global average pooling, the exciting operation comprises two fully connected layers, the bias of the two fully connected layers is False, and the proportional operation is a multiplication function, namely, an input feature map and a result phase of the exciting operation, so as to obtain a feature map after final multiplication;

step 2.4, constructing a module of a downsampling block 4 in a feature extraction sub-network of remote sensing image building change;

The downsampling block 4 comprises: the device comprises a maximum pooling layer, a first convolution layer, a second convolution layer and an extrusion excitation layer, wherein the maximum pooling layer is used for sampling the length and the width of a characteristic diagram to half of the length and the width of the characteristic diagram of a lower sampling block 3; the first convolution layer is used for extracting basic features of the feature map of the maximum pooling layer; the second convolution layer is used for processing and combining the basic features extracted by the first convolution layer to generate deeper change features; the extrusion excitation layer is used for multiplying different weight coefficients for different channels so as to make the model pay more attention to the channel with large information quantity, and comprises the following components: extruding, exciting and proportional operation, wherein the extruding operation adopts global average pooling, the exciting operation comprises two fully connected layers, the bias of the two fully connected layers is False, and the proportional operation is a multiplication function, namely, an input feature map and a result phase of the exciting operation, so as to obtain a feature map after final multiplication;

step 2.5, constructing a module of a downsampling block 5 in a feature extraction sub-network of remote sensing image building change;

The downsampling block 5 comprises: the device comprises a maximum pooling layer, a first convolution layer, a second convolution layer and an extrusion excitation layer, wherein the maximum pooling layer is used for sampling the length and the width of a characteristic diagram to half of the length and the width of the characteristic diagram of a lower sampling block 4; the first convolution layer is used for extracting basic features of the feature map of the maximum pooling layer; the second convolution layer is used for processing and combining the basic features extracted by the first convolution layer to generate deeper change features; the extrusion excitation layer is used for multiplying different weight coefficients for different channels so that the model focuses more on the channel with large information quantity, and comprises the following components: the method comprises the steps of extrusion, excitation and proportion operation, wherein the extrusion operation adopts global average pooling, the excitation operation comprises two fully-connected layers, the bias of the two fully-connected layers is False, the proportion operation is a multiplication function, namely, a characteristic diagram is input and the result phase of the excitation operation is obtained, and the characteristic diagram after final multiplication is obtained.

The specific method of the step 3 is as follows:

The feature fusion sub-network comprises a feature pyramid attention module, an up-sampling block 1, an up-sampling block 2, an up-sampling block 3, an up-sampling block 4 and an attention gating structure;

Step 3.1, constructing a feature pyramid attention module of a feature fusion sub-network;

The feature pyramid attention module structurally comprises: the system comprises a first convolution layer, a global average pooling layer, a second convolution layer, a bilinear upsampling layer, a first average pooling layer, a third convolution layer, a fourth convolution layer, a first expansion convolution layer, a second average pooling layer, a fifth convolution layer, a sixth convolution layer, a third expansion convolution layer and a fourth expansion convolution layer;

The first convolution layer is used for reducing the dimension of input data; the global average pooling layer is used for downsampling each pixel point of the input layer to be a specified size so as to realize feature extraction; the second convolution layer is used for reducing the dimension of the global average pooling feature map and reducing the model parameter number; the bilinear upsampling layer is used for enlarging the size of the feature map, increasing the density of pixel points and improving the resolution ratio of the image; the first averaging pooling layer is used for reducing the size of the feature map, and the risk of overfitting is reduced by averaging the feature map; the third convolution layer is used for extracting basic features of the first average pooling layer feature map; the fourth convolution layer is used for integrating the feature map of the third convolution layer to generate a deeper feature map; the first expansion convolution layer is used for increasing the size of the convolution kernel so as to enlarge the range of extracting the characteristics of the second average pooling layer; capturing details of the change feature while maintaining the resolution of the first expansion convolution feature map by the second expansion convolution layer; the second average pooling layer is used for reducing the size of the characteristic diagram of the first average pooling layer, and the risk of overfitting is reduced by averaging the characteristic diagram; the fifth convolution layer is used for extracting basic features of the feature map of the second average pooling layer; the sixth convolution layer is used for integrating the feature map of the fifth convolution layer to generate a deeper feature map; the third expansion convolution layer is used for increasing the size of the convolution kernel so as to enlarge the range of extracting the characteristics of the second average pooling layer; the fourth expansion convolution layer captures details of the change feature while maintaining the resolution of the third expansion convolution feature map;

step 3.2, an up-sampling block 1 of the feature fusion sub-network is constructed;

The structure of the upper sampling block 1 includes: the device comprises a first convolution layer, an attention gating layer, a splicing layer, a second convolution layer, a third convolution layer and an extrusion excitation layer;

The first convolution layer is used for extracting basic features of the feature pyramid attention module feature map in the step 3.1;

The attention gating layer comprises a gate convolution layer, an L convolution layer, a ReLU activation layer, a simple convolution layer and Sigmodi activation layers; wherein the gate convolution is used to halve the number of channels of the first convolution layer feature map without changing the feature map size; the L convolution is used to halve the number of channels of the feature map of the downsampling block 4 without changing the feature map size; the ReLU activation layer is used for adding the gate convolution layer feature map and the L convolution layer feature map, the negative value of the feature map is directly set to be zero, and the positive value part is kept unchanged; the simple convolution layer is used for extracting the basic characteristics of the ReLU activation layer; the Sigmoid activation layer is used for mapping the simple convolution layer feature map to the range from 0 to 1, and multiplying the downsampled block feature map and the Sigmoid activation layer feature map to be used as an output feature map;

The splicing layer is used for executing channel merging operation on the first convolution layer feature map and the attention gating layer feature map, wherein the channel merging uses a cat function, and merging is executed in a first dimension to obtain a feature map after channel merging;

the second convolution layer is used for extracting the basic characteristics of the spliced layer after the channels are combined;

the third convolution layer is used for processing and integrating the basic features extracted by the second convolution layer to generate a deeper feature map;

The extrusion excitation layer is used for multiplying different weight coefficients for different channels so that the model focuses on the channel with large information quantity more and a feature map is generated;

Step 3.3, an up-sampling block 2 of the feature fusion sub-network is constructed;

the structure of the up-sampling block 2 includes: the device comprises a first convolution layer, an attention gating layer, a splicing layer, a second convolution layer, a third convolution layer and an extrusion excitation layer;

The first convolution layer is used for extracting basic features of the feature map of the sampling block 1 in the step 3.2;

The attention gating layer comprises a gate convolution layer, an L convolution layer, a ReLU activation layer, a simple convolution layer and Sigmodi activation layers; wherein the gate convolution is used to halve the number of channels of the first convolution layer feature map without changing the feature map size; the L convolution is used to halve the number of channels of the feature map of the downsampling block 3 without changing the size of the feature map; the ReLU activation layer is used for adding the gate convolution layer feature map and the L convolution layer feature map, the negative value of the feature map is directly set to be zero, and the positive value part is kept unchanged; the simple convolution layer is used for extracting the basic characteristics of the ReLU activation layer; the Sigmoid activation layer is used for mapping the simple convolution layer feature map to the range from 0 to 1, and multiplying the downsampled block feature map and the Sigmoid activation layer feature map to be used as an output feature map;

step 3.4, an up-sampling block 3 of the feature fusion sub-network is constructed;

the structure of the up-sampling block 3 includes: the device comprises a first convolution layer, an attention gating layer, a splicing layer, a second convolution layer, a third convolution layer and an extrusion excitation layer;

the first convolution layer is used for extracting basic features of the feature map of the upper sampling block 2 in the step 3.3;

The attention gating layer comprises a gate convolution layer, an L convolution layer, a ReLU activation layer, a simple convolution layer and Sigmodi activation layers; wherein the gate convolution is used to halve the number of channels of the first convolution layer feature map without changing the feature map size; the L convolution is used to halve the number of channels of the feature map of the downsampling block 2 without changing the size of the feature map; the ReLU activation layer is used for adding the gate convolution layer feature map and the L convolution layer feature map, the negative value of the feature map is directly set to be zero, and the positive value part is kept unchanged; the simple convolution layer is used for extracting the basic characteristics of the ReLU activation layer; the Sigmoid activation layer is used for mapping the simple convolution layer feature map to the range from 0 to 1, and multiplying the downsampled block feature map and the Sigmoid activation layer feature map to be used as an output feature map;

step 3.5, an up-sampling block 4 of the feature fusion sub-network is constructed;

the structure of the upsampling block 4 comprises: the device comprises a first convolution layer, an attention gating layer, a splicing layer, a second convolution layer, a third convolution layer and an extrusion excitation layer;

The first convolution layer is used for extracting basic features of the feature map of the upper sampling block 3 in the step 3.4;

The attention gating layer comprises a gate convolution layer, an L convolution layer, a ReLU activation layer, a simple convolution layer and Sigmodi activation layers; wherein the gate convolution is used to halve the number of channels of the first convolution layer feature map without changing the feature map size; the L convolution is used for halving the channel number of the feature map of the downsampled block 1 without changing the size of the feature map; the ReLU activation layer is used for adding the gate convolution layer feature map and the L convolution layer feature map, the negative value of the feature map is directly set to be zero, and the positive value part is kept unchanged; the simple convolution layer is used for extracting the basic characteristics of the ReLU activation layer; the Sigmoid activation layer is used for mapping the simple convolution layer feature map to the range from 0 to 1, and multiplying the downsampled block feature map and the Sigmoid activation layer feature map to be used as an output feature map;

The extrusion excitation layer is used for multiplying different weight coefficients for different channels so that the model focuses on the channel with large information amount more and a feature map is generated.

Step 4, constructing a prediction sub-network of remote sensing image building change;

The prediction sub-network consists of a convolution layer and is used for recovering the characteristic map of the characteristic fusion sub-network to the size of the characteristic map of the double-phase remote sensing image.

The specific method in the step 5 is as follows:

step 5.1, connecting the input sub-network constructed in the step 1 and the downsampling block 1 of the feature extraction sub-network constructed in the step 2 in series;

step 5.2, cascading the feature map of the building change area extracted in the step 2 with the feature map of the building change area extracted by the up-sampling block in the step 3 through an attention gating unit;

and 5.3, connecting the building change area characteristic map extracted by the up-sampling block in the step 3 with the prediction sub-network constructed in the step 4 in series.

The specific method of the step 6 is as follows:

Step 6.1, acquiring a remote sensing image building change detection public dataset LEVIR-CD;

step 6.2, cutting the image in the data set obtained in the step 6.1;

And 6.3, dividing the image data set cut in the step 6.2 into a training set, a verification set and a test set in proportion.

The specific method of the step 7 is as follows:

Step 7.1, setting training parameters, inputting remote sensing images obtained from 2 periods before and after random and non-repeated selection in a training set into a network each time, wherein a loss function uses a binary cross entropy loss function BCEWithLogitsLoss with Sigmoid;

The two-class cross entropy loss function BCEWithLogitsLoss with Sigmoid is as follows:

Wherein, Is a Sigmoid function, log is a natural logarithm, p _i represents a probability that sample x _i is predicted to be a positive example, y _i represents a true label of sample x _i, and in a two-class problem, y _i is typically 0 or 1, indicating whether sample x _i belongs to the class of the positive example;

Step 7.2, sequentially inputting the training set and the verification set generated in the step 6 into the remote sensing image building change detection network constructed in the step 5; the method comprises the following specific steps: firstly, matching a remote sensing image at the time T2 according to a histogram of the remote sensing image at the time T1 input into an input sub-network, and realizing channel combination; secondly, inputting the input data into a feature extraction sub-network to perform downsampling feature extraction; thirdly, inputting an output result of the feature extraction sub-network into a feature fusion sub-network to obtain a feature map after feature fusion; and finally, restoring the feature map after feature fusion to the original image size by utilizing a prediction sub-network.

The specific method of the step 8 is as follows:

Step 8.1, inputting the test set generated in the step 6 into the remote sensing image building change detection network trained in the step 7, and loading a remote sensing image building change detection network parameter file 'best_loss.pth' trained in the step 7.1;

Step 8.2, performing remote sensing image building change detection on the image in the test set to obtain a change range of the building;

And 8.3, outputting a detection result of the remote sensing image building change detection network and storing the detection result as a tag file in a 'png' format.

Compared with the prior art, the invention has the beneficial effects that:

(1) The invention designs a building change detection neural network, wherein an extrusion excitation module is introduced into an encoder and a decoder of a model, and the extrusion excitation module is a channel attention module used for extracting a characteristic diagram of building change information, strengthening the characteristic diagram on a channel and not changing the size of the characteristic diagram. The model is more focused on channel information through learning channel characteristics and is combined with the input characteristic diagram, so that the characteristic diagram with the channel attention is finally obtained, the capability and the accuracy of the model for detecting building change information are improved, the residual structure has more nonlinearity, complex correlations among channels can be better fitted, and the model is light.

(2) Because the traditional Unet network needs to combine research objects with different sizes when in jump linking, the attention gating unit allows the model to learn and pay attention to building change targets with different sizes, the model is finer than an image recovered by direct up-sampling, the model parameters are smaller, and the model parameters are convenient to be deployed in other image segmentation models based on encoding-decoding structures; therefore, the invention improves the jump link mode, and when combining the feature graphs of downsampling and upsampling, the attention gating unit is introduced, so that the model focuses more on building change information on the feature graph of the building channel, and the attention gating mechanism allows the model to extract features with different scales on a single feature layer and can realize attention perception in the channel dimension; the attention gating unit is introduced, so that the fusion capability of the model to the different-scale building change feature graphs can be improved, the extraction precision of the model to the multi-scale building change information and the performance of the model in detecting the building change information are improved, and meanwhile, the calculation cost can be reduced.

(3) In the prior art, when feature jump connection is performed on feature graphs with different scales, the low-level features of the coding region and the high-level features of the decoding region are directly spliced, and larger semantic differences are often accompanied; the invention introduces a feature pyramid attention mechanism at the tail end of the encoder, fuses multi-scale information, places more attention points on high-dimensional features of building change, and excavates feature information of building change images; by using the feature pyramid attention mechanism, the method can help the network restore the information loss caused by downsampling, and well solve the semantic difference and the information loss generated by directly splicing the low-level features of the coding region and the high-level features of the decoding region.

In summary, compared with the existing technology for detecting the change of the building aiming at the remote sensing image, the extrusion excitation attention and attention gating residual neural network designed by the invention reduces the missing detection and the false detection of the change information of the large building and improves the accuracy and the precision of the change detection of the building.

Drawings

FIG. 1 is a schematic flow chart of a method in an embodiment of the invention.

Fig. 2 is a schematic diagram of a network model in an embodiment of the present invention.

FIG. 3 is a schematic diagram of the structure of a downsampling block without a stacked maximum pooling layer in accordance with an embodiment of the present invention.

FIG. 4 is a schematic view of the structure of the extrusion driving layer in the embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a downsampling block stacked maximum pooling layer in an embodiment of the invention.

FIG. 6 is a schematic diagram of a feature pyramid structure in an embodiment of the present invention.

Fig. 7 is a schematic diagram of a jump connection in an embodiment of the present invention, in which fig. 7 (a) is a schematic diagram of a jump connection of a downsampling block and a feature pyramid, and fig. 7 (b) is a schematic diagram of a jump connection of an upsampling block and a downsampling block.

Fig. 8 is a schematic diagram of an attention gating architecture in an embodiment of the present invention.

Fig. 9 is a schematic diagram showing comparison of building detection results in an embodiment of the present invention, in which fig. 9 (a) is a remote sensing image at time T1, fig. 9 (b) is a remote sensing image at time T2, fig. 9 (c) is a tag image, and fig. 9 (d) is a graph showing detection results of the method of the present invention.

Detailed Description

The technical scheme and effect of the present invention are described in further detail below with reference to the accompanying drawings.

The implementation steps of the embodiment of the present invention will be further described with reference to fig. 1.

The remote sensing image building change detection network constructed by the invention comprises four main parts, namely: the input sub-network, the feature extraction sub-network, the feature fusion sub-network and the prediction sub-network are responsible for carrying out histogram matching on remote sensing images acquired in the front and rear periods, and combining channels to serve as input images, so that preprocessing operation before training is realized. The feature extraction sub-network is responsible for extracting features and acquiring rich feature information of a remote sensing image of a building in two time phases; the feature fusion sub-network is responsible for recovering images and recovering the building change feature map to the original resolution. The jump link module is responsible for carrying out feature fusion on the feature graphs obtained by extracting each layer of the feature extraction sub-network and the feature graphs obtained by sampling on the feature fusion sub-network, wherein after the last feature graph obtained by downsampling of the feature extraction sub-network, the feature pyramid attention module is used, and other jump link modules use the attention gating structure. The prediction network is responsible for restoring the feature map to the size of the original image.

An input sub-network for preprocessing the remote sensing image data of the double phases is built, and an input image of the input sub-network comprises two moments: and the time T1 and the time T2, wherein T1 is the first time, T2 is the second time, the input sizes of the remote sensing images at the front and rear time of the input sub-network are 256 multiplied by 3, the remote sensing images at the time T2 are subjected to histogram matching according to the remote sensing image histogram at the time T1, the problem of large gray scale difference of the remote sensing images is solved, the histogram matching is realized by using a match_ histograms function, the T2 time remote sensing image T2 'after the histogram matching is obtained, the remote sensing images T2' after the histogram matching and the T1 time remote sensing image are subjected to channel merging operation as the input data of the subsequent feature extraction network, the channels are merged by using a cat function, and the merged double-phase remote sensing image is obtained in the first dimension, and the size is 256 multiplied by 6.

Building a feature extraction network with 5 downsampling modules, wherein the feature extraction network comprises the following structures in sequence: a downsampling block 1, a downsampling block 2, a downsampling block 3, a downsampling block 4 and a downsampling block 5;

The module structures of the downsampling blocks 2, 3, 4 and 5 are similar to the overall structure of the downsampling block 1, and the only difference is that a maximum pooling layer is added before a first convolution layer, and the output of the last downsampling block is used as the input of the current downsampling block.

referring to fig. 3, a module of a downsampling block 1 in a remote sensing image building change feature extraction sub-network structure constructed by the invention is further described.

FIG. 3 is a schematic diagram of a structure of a downsampling block 1, in which the input image size is 256×256×6, and the downsampling block 1 mainly includes 3 convolution layers, namely a first convolution layer, a second convolution layer and an extrusion excitation layer, wherein the first convolution layer has an input channel number of 6, an output channel number of 64, a convolution kernel size of 3×3, a step size of 1, a filling of 1, and a BN layer normalization of the output result, wherein the output channel number of 64 is used to enhance the nonlinear expression capability of the model, and a ReLU layer activation function is used to extract basic features of remote sensing image data of input double phases, such as edges, textures, etc.; the convolution kernel size of the second convolution layer is set to be 3 multiplied by 3, the step length is set to be 1, the filling is set to be 1, the number of input channels is 64, the number of output channels is 64, BN layer normalization is carried out on an output result, the number of output channels is 64, and a ReLU layer activation function is used for enhancing the nonlinear expression capability of a model; and adding the second convolution layer result and the original input data, wherein the input and output channels are 6 and 64 respectively, the convolution kernel size is set to be 1 multiplied by 1, the step size is set to be 1, the offset is set to be False, the BN layer is used for data normalization, and the output channel number is 64.

With reference to fig. 4, a further description will be given of an extrusion excitation layer in a feature extraction sub-network structure of remote sensing image building change constructed by the present invention.

FIG. 4 is a schematic illustration of an extrusion actuation layer, consisting essentially of three operative parts, extrusion, actuation and scale; the extrusion operation adopts global average pooling, the input data size is 256 multiplied by 64, and the output size is 1 multiplied by 64; the excitation operation mainly comprises two full-connection layers, the proportion factor is 16, the input and output channels of the first full-connection layer are 64 and 4, the proportion input and output channels of the second full-connection layer are 4 and 64, the bias of the two full-connection layers is False, and ReLU and Sigmoid activation functions are respectively added after the first full-connection layer and the second full-connection layer, so that the expression capacity of the model is improved, and the output size is 1 multiplied by 64; the scaling operation is mainly a multiplication function, and mainly the multiplication of the input feature map and the result of the excitation operation, so as to obtain the final output size of 256×256×64.

Referring to fig. 5, a further description will be given of the downsampling block in the feature extraction sub-network structure of the remote sensing image building change constructed in the present invention.

The maximum pooling layer input size in downsampled block 2 is 256×256×64, the convolution kernel size is 2×2, and the step size is 2; the first convolution layer has an input size of 128×128×64, input and output channels of 64 and 128, convolution kernel size of 3×3, padding of 1, bias of True, and BN layer normalization of the output result, wherein the number of output channels is 128, and a ReLU layer activation function is used to enhance the nonlinear expression capability of the model; the second convolutional layer has an input size of 128 x 128, input and output channels of 128, the convolution kernel size is 3 multiplied by 3, the filling is 1, the bias is True, the BN layer normalization is carried out on the output result, wherein the number of output channels is 128, in order to enhance the nonlinear expression capability of the model, using the ReLU layer activation function, the output size is 128×128×128; and adding the result of the second convolution layer and the original input data, wherein the input and output channels are 64 and 128 respectively, the convolution kernel size is set to be 1 multiplied by 1, the step size is set to be 1, the input channel of the extrusion excitation layer is 128, global average pooling is adopted, the output size is 1 multiplied by 128, 2 full-connection layers are adopted, the scale factor is 16, the input and output channels of the first full-connection layer are 128 and 8, the proportional input and output channels of the second full-connection layer are 8 and 128, the bias of the two full-connection layers is False, the ReLU and Sigmoid activation functions are added after the first full-connection layer and the second full-connection layer respectively, the expression capacity of the model is increased, the output size is 1 multiplied by 128, and the final output size is 128 multiplied by 128.

The maximum pooling layer input size in downsampling block 3 is 128 x 128, the convolution kernel size is 2×2, and the step size is 2; the input size of the first convolution layer is 64 multiplied by 128, the input and output channels are 128 and 256, the convolution kernel size is 3 multiplied by 3, the filling is 1, the offset is True, the BN layer normalization is carried out on the output result, the output channel number is 256, and a ReLU layer activation function is used for enhancing the nonlinear expression capacity of a model; the input size of the second convolution layer is 64 multiplied by 256, the input and output channels are 256, the convolution kernel size is 3 multiplied by 3, the filling is 1, the offset is True, the output result is subjected to BN layer normalization, wherein the output channel number is 256, and the output size is 64 multiplied by 256 by using a ReLU layer activation function for enhancing the nonlinear expression capability of the model; the input channel of the extrusion excitation layer is 256, global average pooling is adopted, the output size is 1 multiplied by 256, 2 full-connection layers are adopted, the proportion factor is 16, the input and output channels of the first full-connection layer are 256 and 16, the proportion input and output channels of the second full-connection layer are 16 and 256, the bias of the two full-connection layers is False, the ReLU and Sigmoid activation functions are respectively added after the first full-connection layer and the second full-connection layer, the output size is 1 multiplied by 256, and the final output size is 64 multiplied by 256.

the maximum pooling layer input size in the downsampling block 4 is 64×64×256, the convolution kernel size is 2×2, and the step size is 2; the input size of the first convolution layer is 32 multiplied by 256, the input and output channels are 256 and 512, the convolution kernel size is 3 multiplied by 3, the filling is 1, the offset is True, the output result is subjected to BN layer normalization, the number of the output channels is 512, and a ReLU layer activation function is used for enhancing the nonlinear expression capacity of a model; the second convolution layer has an input size of 32×32×512, input and output channels of 512, convolution kernel size of 3×3, padding of 1, bias of True, and BN layer normalization of the output result, wherein the number of output channels is 512, and as a non-linear expression capability of the enhancement model, a ReLU layer activation function is used to output a size of 32×32×512; the input channel of the extrusion excitation layer is 512, global average pooling is adopted, the output size is 1 multiplied by 512, 2 full-connection layers are adopted, the proportion factor is 16, the input and output channels of the first full-connection layer are 512 and 32, the proportion input and output channels of the second full-connection layer are 32 and 512, the bias of the two full-connection layers is False, the ReLU and Sigmoid activation functions are respectively added after the first full-connection layer and the second full-connection layer, the output size is 1 multiplied by 512, and the final output size is 32 multiplied by 512.

The input size of the largest pooling layer in the downsampling block 5 is 32 multiplied by 512, the convolution kernel size is 2 multiplied by 2, and the step length is 2; the first convolution layer has an input size of 16×16×512, input and output channels of 512 and 1024, a convolution kernel size of 3×3, padding of 1, bias of True, BN layer normalization of the output result, and 1024 output channels, wherein a ReLU layer activation function is used to enhance the nonlinear expression capability of the model; the input size of the second convolution layer is 16 multiplied by 1024, the input and output channels are 1024, the convolution kernel size is 3 multiplied by 3, the filling is 1, the offset is True, the output result is subjected to BN layer normalization, wherein the number of the output channels is 1024, and the ReLU layer activation function is used for enhancing the nonlinear expression capacity of the model, and the output size is 16 multiplied by 1024; the input channels of the extrusion excitation layer are 1024, global average pooling is adopted, the output size is 1 multiplied by 1024, 2 full-connection layers are adopted, the proportion factor is 16, the input and output channels of the first full-connection layer are 1024 and 64, the proportion input and output channels of the second full-connection layer are 64 and 1024, the bias of the two full-connection layers is False, the ReLU and the Sigmoid activation function are respectively added after the first full-connection layer and the second full-connection layer, the output size is 1 multiplied by 1024, and the result of the input characteristic diagram and the excitation operation is multiplied to obtain the final output size of 16 multiplied by 1024.

Referring to fig. 2, the feature fusion sub-network includes a feature pyramid attention module, an up-sampling block 1, an up-sampling block 2, an up-sampling block 3, an up-sampling block 4, and an attention gating structure.

The structure comprises: the method comprises the following steps of a first convolution layer, a global average pooling layer, a second convolution layer, a bilinear upsampling layer, a first average pooling layer, a third convolution layer, a fourth convolution layer, a first expansion convolution layer, a second average pooling layer, a fifth convolution layer, a sixth convolution layer, a third expansion convolution layer and a fourth expansion convolution layer.

The feature pyramid attention module in the feature fusion subnetwork for building changes in remote sensing images of the present invention will be further described with reference to fig. 6.

The feature map output by the downsampling block 5 passes through a first convolution layer, wherein the input and output channels are 1024, the convolution kernel size is 1 multiplied by 1, the step length is 1, the filling is 0, the output result size is 16 multiplied by 1024, and the first convolution layer is used for reducing the dimension of input data, so that the number of parameters in a model is reduced, and the complexity of the model is reduced;

the feature map output by the downsampling block 5 passes through a global average pooling layer with the size of 1 multiplied by 1, and is input into a second convolution layer, convolution input and output are 1024, the convolution kernel size is 1 multiplied by 1, the step length is 1, filling is 0, the output result is subjected to a bilinear upsampling layer, the corner alignment is True, and the output result size is 1 multiplied by 1024;

The feature map output by the downsampling block 5 passes through a first average pooling layer, the pooling layer convolution kernel size is 2×2, the pooling result is input into a third convolution layer, the third convolution layer is used for extracting basic features of the feature map of the first average pooling layer, such as edges and textures, input and output channels are 1024 and 256 respectively, the convolution kernel size is 5×5, the step size is 2, filling is 2, BN layer normalization and ReLU activation are used, the result is input into a fourth convolution layer, the input and output channels are 256, the convolution kernel size is 5×5, the step size is 1, filling is 2, and BN layer normalization is used; inputting the result obtained by the first average pooling layer into a first expansion convolution layer, wherein the input and output channels are 1024 and 256 respectively, the convolution kernel size is 3 multiplied by 3, the filling is 2, the expansion rate is 3, the BN layer is used for normalization and ReLU activation, then inputting the first expansion convolution result into a second expansion convolution layer, the input and output channels are 256 respectively, the convolution kernel size is 3 multiplied by 3, the filling is 2, the expansion rate is 3, the BN layer is used for normalization, and the output result size is 4 multiplied by 256;

The feature map output by the first average pooling layer passes through a second average pooling layer, the convolution kernel size of the pooling layer is 2 multiplied by 2, the feature map of the second average pooling layer is input into a fifth convolution layer, the input and output channels are 1024 and 256 respectively, the convolution kernel size is 3 multiplied by 3, the filling is 2, the offset is False, the BN layer is used for normalization and ReLU activation, the feature map of the fifth convolution layer is input into a sixth convolution layer, the input and output channels are 256, the convolution kernel size is 3 multiplied by 3, the step length is 2, the filling is 2, the offset is False, and the BN layer is used for normalization; inputting the second average pooling layer characteristic diagram into a third expansion convolution, wherein the input and output channels are 1024 and 256 respectively, the convolution kernel size is 3 multiplied by 3, the filling is 5, the expansion rate is 5, the BN layer is used for normalization and ReLU activation, then inputting the first expansion convolution result into a fourth expansion convolution, the input and output channels are 256 respectively, the convolution kernel size is 3 multiplied by 3, the filling is 5, the expansion rate is 5, the BN layer is used for normalization, and the output result size is 4 multiplied by 256;

Adding the results of the fourth convolution layer, the second expansion convolution, the sixth convolution layer and the fourth expansion convolution to obtain an addition result H (X) through deconvolution, wherein the input and output channels are 256, the convolution kernel size is 4, the step size is 2, the filling is 1, the offset is False, the BN layer is used for normalization and ReLU activation, and the output result size is 8 multiplied by 256; the added result is multiplied by the first convolution layer result X ₁, i.e., by deconvolution, the input and output channels are 256 and 1024, respectively, the convolution kernel size is 4, the step size is 2, the padding is 1, the offset is False, BN layer normalization and ReLU activation are used, and the output result size is 16×16×256. The result is added to the bilinear upsampled result X ₂, which is formulated as:

Using the ReLU activation addition result, the resulting output feature map F is 16×16×1024.

And 3.2, constructing an up-sampling block 1 module in the feature fusion subnetwork of the remote sensing image building change.

Referring to fig. 7 (a), a further description will be given of the up-sampling block in the feature fusion subnetwork for building changes with remote sensing images according to the present invention.

The main structure of the up-sampling block 1 comprises a first convolution layer, an attention gating layer, a splicing layer, a second convolution layer, a third convolution layer and an extrusion excitation layer;

the first convolution layer in the upsampling block 1 is: inputting the output characteristic diagram F obtained in the step 3.1 into a first convolution layer, performing simple up-sampling, wherein the scale factor is 2, using simple two-dimensional convolution, wherein input and output channels are 1024 and 512 respectively, the convolution kernel size is 3, the step length is 1, the filling is 1, the bias is True, and using BN layer normalization and ReLU activation;

The attention gating layer in the up-sampling block 1 of the feature fusion sub-network for constructing remote sensing image building changes according to the present invention will be further described with reference to fig. 8.

The attention gating layer in the upsampling block 1 is: inputting the result of the first convolution layer of the up-sampling block 1 into an attention gating layer, performing gate convolution, wherein the input and output channels are 512 and 256 respectively, the convolution kernel is 1 multiplied by 1, the step size is 1, the filling is 0, the offset is True, the normalization is performed by using a BN layer, the input channel is 256, and the result is marked as A; as shown in fig. 8, the output result of the downsampling block 4 is input to an L convolution layer, L convolution is performed, input and output channels are 512 and 256, respectively, the convolution kernel size is 1×1, the step size is 1, the padding is 0, the offset is True, normalization is performed by using a BN layer, the input channel is 256, and the result is denoted as B; superposing A and B, and marking the result as C; activating by using a ReLU activation function, and recording the result as D; inputting D into a simple convolution layer, wherein input and output channels are 256 and 1 respectively, the convolution kernel size is 1 multiplied by 1, the step length is 1, the filling is 0, the offset is True, the BN layer is used for normalization, the input channel is 1, and the result is marked as E; activating by using a Sigmoid function to obtain a weight matrix, marking the result as F, and multiplying the result with the result of the upper sampling block 1;

Splicing the attention layer gating result and the first convolution layer result in the up-sampling block 1 in the 1 st dimension;

Referring to fig. 7 (a), the splice layer result is input to a second convolutional layer, input and output channels are 1024 and 512, respectively, the convolutional kernel size is 3×3, the padding is 1, normalized with BN layer, the output channel is 512, and activated with a ReLU function; inputting the result of the second convolution layer into a third convolution layer, wherein the input and output channels are 512, the convolution kernel is 3 multiplied by 3, the filling is 1, the normalization is performed by using a BN layer, the output channel is 512, and the activation is performed by using a ReLU function; superposing the spliced layer result and the third convolution layer result, wherein the input and output channels are 1024 and 512 respectively, the convolution kernel size is 1 multiplied by 1, the step length is 1, the offset is False, the BN layer is used for normalization, and the output result size is 32 multiplied by 512;

Inputting the superposition result into an extrusion excitation layer, wherein fig. 4 is a schematic diagram of the extrusion excitation layer, the input channel of the extrusion excitation layer is 512, global average pooling is adopted, and the output size is 1×1×512; 2 full connection layers are adopted, the scale factor is 16, the input and output channels of the first full connection layer are 512 and 32, the scale input and output channels of the second full connection layer are 32 and 512, and the bias of the two full connection layers is False; adding a ReLU and a Sigmoid activation function respectively after the first full connection layer and the second full connection layer, increasing the expression capacity of the model, and outputting the output size of 1 multiplied by 512; the input signature is multiplied by the result of the excitation operation to give a final output size of 32 x 512.

Step 3.3, constructing an upper sampling block 2 module in a feature fusion sub-network of remote sensing image building change;

The main structure of the up-sampling block 2 comprises a first convolution layer, an attention gating layer, a splicing layer, a second convolution layer, a third convolution layer and an extrusion excitation layer;

Referring to fig. 7 (b), the building change area feature map extracted by the upsampling block 1 is input into a first convolution layer, simple upsampling is performed, the scale factor is 2, simple two-dimensional convolution is used, the input and output channels are 512 and 256, respectively, the convolution kernel size is 3, the step size is 1, the padding is 1, the offset is True, BN layer normalization and ReLU activation are used; inputting the feature map of the building change area extracted by the first convolution layer into an attention gating layer, performing gate convolution, wherein input and output channels are 256 and 128 respectively, the convolution kernel size is 1 multiplied by 1, the step size is 1, the filling is 0, the offset is True, the normalization is performed by using a BN layer, the input channel is 128, and the result is marked as A; as shown in fig. 8, the extracted building change area feature map of the downsampling block 3 is input to an L convolution layer of the attention gating structure, L convolution is performed, input and output channels are 256 and 128, respectively, the convolution kernel size is 1×1, the step size is 1, the padding is 0, the offset is True, normalization is performed using a BN layer, the input channel is 128, and the result is denoted as B; superposing A and B, and marking the result as C; activating by using a ReLU activation function, and recording the result as D; inputting D into a simple convolution layer, wherein the input channel and the output channel are 128 and 1 respectively, the convolution kernel is 1 multiplied by 1, the step length is 1, the filling is 0, the offset is True, the BN layer is used for normalization, the input channel is 1, and the result is marked as E; activating by using a Sigmoid function to obtain a weight matrix, marking the result as F, and multiplying the result with the result of the upper sampling block 1; splicing the attention layer gating result with the first convolution layer result in the up-sampling block 2 in the 1 st dimension, inputting the spliced layer result into a second convolution layer, wherein the input and output channels are 512 and 256 respectively, the convolution kernel size is 3 multiplied by 3, the filling is 1, the normalization is carried out by using a BN layer, the output channel is 256, and the activation is carried out by using a ReLU function; the second convolution layer result is input to the third convolution layer, the input and output channels are 256, the convolution kernel is 3×3, the padding is 1, the BN layer is used for normalization, the output channel is 256, and the ReLU function is used for activation. Superposing a splicing layer result and a third convolution layer result, wherein the input and output channels are 512 and 256 respectively, the convolution kernel size is 1 multiplied by 1, the step length is 1, the offset is False, the BN layer is used for normalization, and the output result size is 64 multiplied by 256; inputting the superposition result into an extrusion excitation layer, wherein the input channel of the extrusion excitation layer is 256, global average pooling is adopted, and the output size is 1 multiplied by 256; with 2 fully connected layers, the scale factor is 16, the input and output channels of the first fully connected layer are 256 and 16, the scale input and output channels of the second fully connected layer are 16 and 256, and the bias of both fully connected layers is False. And after the first full connection layer and the second full connection layer, reLU and Sigmoid activation functions are respectively added, so that the expression capacity of the model is improved, and the output size is 1 multiplied by 256. Multiplying the input feature map by the result of the excitation operation to obtain a final output size of 64×64×256;

step 3.4, an up-sampling block 3 module in a feature fusion sub-network of remote sensing image building change is constructed;

the main structure of the up-sampling block 3 comprises a first convolution layer, an attention gating layer, a splicing layer, a second convolution layer, a third convolution layer and an extrusion excitation layer;

Inputting the building change area feature map extracted by the up-sampling block 2 into a first convolution layer, performing simple up-sampling, wherein the scale factor is 2, using simple two-dimensional convolution, the input and output channels are 256 and 128 respectively, the convolution kernel size is 3, the step size is 1, the filling is 1, the bias is True, and using BN layer normalization and ReLU activation; inputting the result of the first convolution layer of the up-sampling block 2 into an attention gating layer, performing gate convolution, wherein the input and output channels are 128 and 64 respectively, the convolution kernel is 1×1, the step size is 1, the filling is 0, the offset is True, the normalization is performed by using a BN layer, the input channel is 64, and the result is marked as A; as shown in fig. 8, the output result of the downsampling block 2 is input to an L convolution layer, L convolution is performed, the input and output channels are 128 and 64, respectively, the convolution kernel size is 1×1, the step size is 1, the padding is 0, the offset is True, normalization is performed using a BN layer, the input channel is 64, and the result is denoted as B; superposing A and B, and marking the result as C; activating by using a ReLU activation function, and recording the result as D; inputting D into a simple convolution layer, wherein input and output channels are 64 and 1 respectively, the convolution kernel is 1 multiplied by 1, the step length is 1, the filling is 0, the offset is True, the BN layer is used for normalization, the input channel is 1, and the result is marked as E; activating by using a Sigmoid function to obtain a weight matrix, marking the result as F, and multiplying the result with the result of the next sampling block 2; splicing the attention layer gating result with the first convolution layer result in the up-sampling block 2 in the 1 st dimension, inputting the spliced layer result into a second convolution layer, wherein the input and output channels are 256 and 128 respectively, the convolution kernel size is 3 multiplied by 3, the filling is 1, the normalization is carried out by using a BN layer, the output channel is 128, and the activation is carried out by using a ReLU function; the second convolution layer result is input to the third convolution layer, the input and output channels are 128, the convolution kernel is 3×3, the padding is 1, the BN layer is used for normalization, the output channel is 128, and the ReLU function is used for activation. Superposing the splicing layer result and the third convolution layer result, wherein the input and output channels are 256 and 128 respectively, the convolution kernel size is 1 multiplied by 1, the step length is 1, the offset is False, the BN layer is used for normalization, and the output result size is 128 multiplied by 128; inputting the superposition result into an extrusion excitation layer, wherein the input channel of the extrusion excitation layer is 128, global average pooling is adopted, and the output size is 1 multiplied by 128; with 2 fully connected layers, the scale factor is 16, the input and output channels for the first fully connected layer are 128 and 8, the scale input and output channels for the second fully connected layer are 8 and 128, and the bias for both fully connected layers is False. And after the first full connection layer and the second full connection layer, reLU and Sigmoid activation functions are respectively added, so that the expression capacity of the model is improved, and the output size is 1 multiplied by 128. The input signature is multiplied by the result of the excitation operation, resulting in a final output size of 128 x 128;

step 3.5, an up-sampling block 4 module in a feature fusion sub-network of remote sensing image building change is constructed;

The main structure of the up-sampling block 4 comprises a first convolution layer, an attention gating layer, a splicing layer, a second convolution layer, a third convolution layer and an extrusion excitation layer;

Inputting the feature map of the building change area extracted by the up-sampling block 3 into a first convolution layer, performing simple up-sampling, wherein the scale factor is 2, using simple two-dimensional convolution, the input and output channels are 128 and 64 respectively, the convolution kernel size is 3, the step size is 1, the filling is 1, the offset is True, the normalization and ReLU activation of the BN layer are used, inputting the feature map of the up-sampling block 3 into an attention gating layer, performing gate convolution, the input and output channels are 64 and 32 respectively, the convolution kernel size is 1 multiplied by 1, the step size is 1, the filling is 0, the offset is True, the normalization is performed by using the BN layer, the input channel is 32, and the result is marked as A; as shown in fig. 8, the extracted building change area feature map of the downsampling block 1 is input to an L convolution layer, L convolution is performed, input and output channels are 64 and 32, the convolution kernel size is 1×1, the step size is 1, the filling is 0, the offset is True, normalization is performed using a BN layer, the input channel is 32, and the result is denoted as B; superposing A and B, and marking the result as C; activating by using a ReLU activation function, and recording the result as D; inputting D into a simple convolution layer, wherein input and output channels are respectively 32 and 1, the convolution kernel size is 1 multiplied by 1, the step length is 1, the filling is 0, the offset is True, the BN layer is used for normalization, the input channel is 1, and the result is marked as E; activating by using a Sigmoid function to obtain a weight matrix; the result is denoted as F and multiplied by the result of the downsampling block 1. The attention layer gating results are spliced with the first convolution layer results in upsampling block 3 in dimension 1. The splice layer result is input to the second convolutional layer, with input and output channels 128 and 64, respectively, with a convolutional kernel size of 3 x3, filled with 1. Normalized using BN layer, output channel is 64 and activated using ReLU function. The second convolution layer result is input to a third convolution layer, the input and output channels are 64, the convolution kernel is 3×3, and the padding is 1. Normalized using BN layer, output channel is 64 and activated using ReLU function. And overlapping the splice layer result and the third convolution layer result, wherein the input and output channels are 128 and 64 respectively, the convolution kernel size is 1 multiplied by 1, the step size is 1, the offset is False, the BN layer is used for normalization, and the output result size is 256 multiplied by 64.

the prediction sub-network consists of a convolution layer, the number of input and output channels of the convolution layer is 64 and 2, the convolution kernel size is 1 multiplied by 1, the step size is 1, the filling is 0, and the output result size is 256 multiplied by 2.

Step 5.2, cascading the feature map extracted in the step 2 and the feature map after upsampling through an attention gating unit;

And 5.3, connecting the result after the up-sampling block processing with the prediction sub-network constructed in the step 4 in series.

And 6, generating a training set, a verification set and a test set.

Step 6.1, collecting remote sensing image building change detection public dataset LEVIR-CD, wherein the remote sensing image comprises 637 Google remote sensing images of the same region in 2 years, and the image size is 1024 multiplied by 1024;

step 6.2, cutting the image in the data set acquired in the step 6.1 in consideration of the computing power of a computer, wherein the size of the cut image is 256 multiplied by 256;

Step 6.3, the image data set cut in the step 6.2 is processed by the following steps: 1: the ratio of 2 is divided into a training set, a verification set and a test set, wherein the training set, the verification set and the test set comprise 7120 training sets, 1024 verification sets and 2048 test sets.

Step 7.1, setting training parameters, namely setting batch-size to be 16, training 150 rounds, wherein the initial learning rate is 0.0001, selecting 16 random and non-repeated remote sensing images acquired from a training set for 2 periods before and after each time, inputting the remote sensing images into a network, storing a weight parameter file as 'best_loss.pth', and using a two-class cross entropy loss function BCEWithLogitsLoss with Sigmoid as a loss function;

Wherein, Is a Sigmoid function, log is a natural logarithm, p _i represents the probability that sample x _i is predicted to be a positive example, and y _i represents the true label of sample x _i. In a classification problem, y _i is typically 0 or 1, indicating whether sample x _i belongs to the class of positive examples;

With reference to fig. 9, a further description will be made of the prediction result of the remote sensing image building change detection network constructed in the present invention.

As shown in fig. 9, fig. 9 (a) is a remote sensing image at time T1, fig. 9 (b) is a remote sensing image at time T2, fig. 9 (c) is a label image, and fig. 9 (d) is a detection result; it can be seen that the model is more focused on edge details of building variations due to the addition of squeeze excitation and feature pyramid attention modules; through the attention gating module, the characteristics of the change area can be selectively emphasized, the characteristics of buildings without change are restrained, and the method has the technical effects that the characteristics are outstanding and the change characteristics are convenient to extract.

Application prospect of the invention

Building change detection is one of the important contents of city planning and homeland resource management. Along with the acceleration of the urban process, the city gradually expands to the rural area of the peripheral suburb, so that the urban architecture is continuously updated, expanded and reconstructed. The change condition of the urban building is monitored timely, and reliable basic data and decision advice can be provided for regional land utilization management and assessment.

Building changes often result in slow regional land resource management and surveys affecting urban land planning progress. Especially, the illegal building brings great hidden trouble to the life safety of people. After the building is changed, remote sensing satellite images shot in the front and back periods and a computer image processing technology are utilized. The method has important practical significance and research value for regional city development progress, land utilization, resource management and the like.

Therefore, the method for detecting the building change area in the double-time-phase remote sensing image based on the extrusion excitation attention and attention gating residual neural network can solve the problems that the method for detecting the building change area in the double-time-phase remote sensing image has insufficient detection precision and rough edges, the small-size building change recognition effect is poor, and more building change areas are missed to be detected and misplaced. The building change detection network designed by the invention has high precision and good effect, and can rapidly output the range information of the building change in two remote sensing images.

Claims

1. A method for detecting a building change area in a double-time-phase remote sensing image based on a residual neural network is characterized by comprising the following steps of: the method comprises the following steps:

the specific method of the step 1 is as follows:

An input sub-network for preprocessing the remote sensing image data of the double phases is built, and an input image of the input sub-network comprises two moments: t1 time and T2 time, wherein T1 is the first time, T2 is the second time, according to the remote sensing image histogram of the T1 time, carrying out histogram matching on the remote sensing image of the T2 time, wherein the histogram matching is realized by using a match_ histograms function to obtain a T2 time remote sensing image T2 'after the histogram matching, carrying out channel merging operation on the remote sensing image T2' after the histogram matching and the remote sensing image at the T1 time, wherein the channel merging uses a cat function, and carrying out merging in the first dimension to obtain a merged double-time-phase remote sensing image;

the specific method of the step 2 is as follows:

the downsampling block 5 comprises: the device comprises a maximum pooling layer, a first convolution layer, a second convolution layer and an extrusion excitation layer, wherein the maximum pooling layer is used for sampling the length and the width of a characteristic diagram to half of the length and the width of the characteristic diagram of a lower sampling block 4; the first convolution layer is used for extracting basic features of the feature map of the maximum pooling layer; the second convolution layer is used for processing and combining the basic features extracted by the first convolution layer to generate deeper change features; the extrusion excitation layer is used for multiplying different weight coefficients for different channels so that the model focuses more on the channel with large information quantity, and comprises the following components: extruding, exciting and proportional operation, wherein the extruding operation adopts global average pooling, the exciting operation comprises two fully connected layers, the bias of the two fully connected layers is False, and the proportional operation is a multiplication function, namely, an input feature map and a result phase of the exciting operation, so as to obtain a feature map after final multiplication;

The specific method of the step 3 is as follows:

the specific method in the step 5 is as follows:

Step 5.3, the building change area feature map extracted by the sampling block in the step 3 is connected with the prediction sub-network constructed in the step 4 in series;

step 6, generating a training set, a verification set and a test set;

the specific method of the step 6 is as follows:

step 6.2, cutting the image in the data set obtained in the step 6.1;

step 6.3, dividing the image data set cut in the step 6.2 into a training set, a verification set and a test set in proportion;

the specific method of the step 7 is as follows:

Wherein, Is a Sigmoid function, log is a natural logarithm, pi represents the probability that the sample xi is predicted as a positive example, yi represents the true label of the sample xi, and in a two-class problem, yi is usually 0 or 1, and represents whether the sample xi belongs to the class of the positive example;

Step 7.2, sequentially inputting the training set and the verification set generated in the step 6 into the remote sensing image building change detection network constructed in the step 5; the method comprises the following specific steps: firstly, matching a remote sensing image at the time T2 according to a histogram of the remote sensing image at the time T1 input into an input sub-network, and realizing channel combination; secondly, inputting the input data into a feature extraction sub-network to perform downsampling feature extraction; thirdly, inputting an output result of the feature extraction sub-network into a feature fusion sub-network to obtain a feature map after feature fusion; finally, restoring the feature map after feature fusion to the original image size by utilizing a prediction sub-network;

step 8, detecting the remote sensing image building change range in the test set generated in the step 6 by using the model weight file obtained by training in the step 7;

The specific method of the step 8 is as follows:

2. The method for detecting a building change area in a dual-temporal remote sensing image based on a residual neural network according to claim 1, wherein the method comprises the following steps: step 4, constructing a prediction sub-network for remote sensing image building change;