CN115631412A

CN115631412A - Remote sensing image building extraction method based on coordinate attention and data correlation upsampling

Info

Publication number: CN115631412A
Application number: CN202211270279.6A
Authority: CN
Inventors: 程志友; 彭友根; 汪传建
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-10-18
Filing date: 2022-10-18
Publication date: 2023-01-20

Abstract

The invention relates to a remote sensing image building extraction method based on coordinate attention and data correlation upsampling, which comprises the following steps: acquiring remote sensing data; data preprocessing and data enhancement; constructing a building extraction network model, namely a CAD-UNet network model, comprising an encoder, a coordinate attention CA module and a data-related up-sampling DUp module; training and evaluating a model; building automation extraction: and after data preprocessing is carried out on the new remote sensing image to be extracted, inputting the trained CAD-UNet network model, and outputting a predicted image by the CAD-UNet network model to obtain a building extraction result. The network designed by the invention gradually extracts the deep features of the building, performs feature fusion and then gradually upsamples the features to the size of the input resolution, is more friendly to the building extraction task, and obviously improves the building extraction precision; the position information and the boundary information of the building can be effectively captured, so that the extracted building has smoother boundaries and complete outlines.

Description

Remote sensing image building extraction method based on coordinate attention and data correlation upsampling

Technical Field

The invention relates to the technical field of image processing, in particular to a remote sensing image building extraction method based on coordinate attention and data correlated upsampling.

Background

The building is an indispensable activity place in daily life of people and also an important component in the urban construction and development process. The main task of building extraction is to identify and extract building areas from the remote sensing images. The building extraction has important significance for smart city construction, traffic management, population estimation, land utilization monitoring and the like. With the rapid development of remote sensing technology, remote sensing images begin to transit from low resolution to high resolution, and a development trend with high spatial resolution, high spectral resolution and high temporal resolution as characteristics is formed. The features and information of the high-resolution remote sensing image are continuously increased, and the noise and interference information are correspondingly increased, which brings new challenges to building extraction, and how to accurately extract buildings from the high-resolution remote sensing image becomes a hotspot and difficulty of research.

The traditional building extraction method is usually based on prior knowledge and manual features, and then adopts algorithms such as clustering and the like to extract the building, and mainly comprises a building feature-based method, an auxiliary information-based method and the like. Most of the methods are used for extracting the building by utilizing the characteristics of the shape and texture of the building and the like and based on auxiliary information and the like, the realization principle is simpler, the problems of low recognition rate, more errors and the like exist, the process is time-consuming and labor-consuming, and the method has great limitation in practical application and greatly limited performance. The concrete points are as follows:

first, there is less concern about location information of the building; the position information of the building is extremely important for a building extraction task, the building is usually regularly distributed in one remote sensing image, shadow shielding conditions such as trees and the like usually exist, the position information of the building is focused to obtain the accurate position of the building in one image, and the phenomenon of wrong division is avoided, the position information of the building is not fully considered in the prior art, particularly for the building with shadow shielding and complex adhesion, the attention degree on the position information is insufficient, and therefore the phenomenon of wrong division is easy to occur;

secondly, the building extraction boundary is rough and fuzzy; the buildings are mostly rectangular and usually have regular boundaries, the boundary information is important features which cannot be ignored in the building extraction task, when the buildings are extracted, if the edge information is ignored, the problems of rough and fuzzy boundaries, disordered boundaries, holes and the like are easily caused, and in the prior art, when the buildings are extracted, only conventional feature extraction is carried out, the edge features of the buildings cannot be fully excavated, so that the problems of poor extraction effect, rough and fuzzy boundaries and the like are caused;

thirdly, the problem of unbalance of positive and negative samples exists; building extraction is a two-classification task that mainly distinguishes buildings from the background. But typically there will be more background pixels than building pixels in the remote sensed image, which can impair the model's ability to extract buildings during training. The prior art cannot fully consider the problem of imbalance of positive and negative samples, so that the building extraction precision of the model is not high, and the generalization capability is weak.

Disclosure of Invention

The invention aims to provide a remote sensing image building extraction method based on coordinate attention and data correlation up-sampling, which can effectively solve the problem of wrong division caused by insufficient attention to building position information, optimize the boundary extraction effect of a building, relieve the problem of unbalance of positive and negative samples and improve the generalization capability of a network.

In order to achieve the purpose, the invention adopts the following technical scheme: a remote sensing image building extraction method based on coordinate attention and data correlation up-sampling comprises the following steps in sequence:

(1) Obtaining remote sensing data: downloading a WHU building data set and a Massachusetts building data set;

(2) Data preprocessing and data enhancement: the preprocessing refers to cutting a large image in the data set and enhancing the data of the cut remote sensing image and the label image; dividing the remote sensing image and the label image after data enhancement into a training set, a verification set and a test set according to the proportion of 8;

(3) Constructing a CAD-UNet network model: the method comprises the steps of improving on the basis of a UNet network, and constructing a building extraction network model, namely a CAD-UNet network model, comprising an encoder, a coordinate attention CA module and a data-related up-sampling DUp module;

(4) Model training and evaluation: training a CAD-UNet network model based on a joint Loss function combining BCE (binary-class cross entropy) Loss of training set data and Focal local focus Loss, and evaluating the building extraction precision and effect of the CAD-UNet network model by using a test set after training is finished;

(5) Building automation extraction: and after data preprocessing is carried out on the new remote sensing image to be extracted, inputting the trained CAD-UNet network model, and outputting a predicted image by the CAD-UNet network model to obtain a building extraction result.

The step (2) specifically comprises the following steps:

(2a) Cutting a remote sensing large image and a label image in the Massachusetts building data set in a sliding window mode into 512 multiplied by 512 images, marking the pixel value of a building in the WHU building data set and the label image in the Massachusetts building data set as 1, and marking the pixel value of a background as 0;

(2b) Carrying out data enhancement on the remote sensing image and the label image in the WHU building data set and the cut Massachusetts building data set remote sensing image and the label image to enlarge the data volume, wherein the data enhancement comprises the following steps:

horizontally overturning: respectively horizontally overturning the remote sensing image and the label image by using an image processing library OpenCV;

and (3) vertically overturning: respectively vertically overturning the remote sensing image and the label image by using an image processing library OpenCV;

and (3) horizontally and vertically overturning: respectively turning the remote sensing image and the label image horizontally and then vertically by using an image processing library OpenCV;

shift, scale, random crop, and add noise: respectively carrying out operations such as shifting, zooming, random cutting, noise adding and the like on the remote sensing image and the label image;

(2c) Dividing the remote sensing image and the label image after data enhancement into a training set, a verification set and a test set according to the proportion of 8; the verification set is used for adjusting the hyper-parameters of the CAD-UNet network model; the test set is used for testing the precision and the extraction effect of the CAD-UNet network model after training is completed.

The step (3) specifically comprises the following steps:

(3a) Substitution of UNet network encoder: replacing a UNet network encoder with a VGG16 network module, wherein the VGG16 network module is formed by removing the last pooling layer and the full connection layer from the VGG16 network, and the VGG16 network module performs downsampling through multiple convolution and four times of maximum pooling, extracts building features in the remote sensing image and outputs four feature maps with different scales;

(3b) Construct coordinate attention CA module: embedding a coordinate attention CA module into the jump connection of the UNet network obtained in the step (3 a);

the coordinate attention CA module captures long-range dependence and reserved position information along two spatial directions respectively, encodes the feature maps respectively to form two feature maps which are sensitive to direction and position respectively, and converts any intermediate tensor X = [ X ] ₁ ，x ₂ ，x ₃ ...，x _C ]∈R ^C×H×W As input, and outputs a tensor Y = [ Y ] of the same length ₁ ，y ₂ ，y ₃ ...，y _C ]Specifically, each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1,W) for input X, with the expression of the c-th channel of height H as follows:

wherein H represents the height of the image, W represents the width of the image, C represents the total number of channels of the image, C represents the C-th channel, and X _c Representing the image of the c-th channel, i representing the abscissa of the image, and R representing the set of intermediate tensor;

similarly, the output of the c-th channel of width W is expressed as follows:

formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return to two-direction perception attention diagrams; the coordinate attention CA module generates two feature layers before concatenation, and then transforms with a 1 × 1 convolution operation, as shown in equation (3):

f＝δ(F ₁ ([Z ^H ，Z ^W ])) (3)

in the formula, delta is a nonlinear activation function, f is intermediate feature mapping, and is a result obtained by performing feature coding on spatial information in the horizontal direction and the vertical direction; f is then decomposed along the spatial dimension into 2 separate tensors, f ^H ∈R ^C/r×H And f ^W ∈R ^C/r×W R is the reduction ratio used to control the number of channels, and then F is transformed using two additional 1 × 1 convolution _H And F _W Respectively will f ^H And f ^W Converted into two tensors g containing the same number of eigenlayers ^H And g ^W ：

g ^H ＝σ(F _H (f ^H )) (4)

g ^W ＝σ(F _H (f ^W ))(7)

In the formula, F _H And F _W Is two 1 × 1 convolution transformations, f ^H And f ^W Is two separate tensors, g, obtained by decomposing f ^H And g ^W Is a tensor obtained by convolution transformation and an activation function, and sigma is a sigmoid activation functionThe number of channels of f is reduced by a reduction ratio r in the conversion process, and then g is output ^H And g ^W Performing expansion respectively as attention weights; the final output of the coordinate attention CA module is shown in equation (6):

(3c) Constructing a data-dependent upsampling DUp module: the data-related up-sampling DUp module is constructed by combining the convolutional layer and the data-related up-sampling module and is used for extracting boundary information of a high-resolution building, and for four input feature maps with different scales, the feature maps pass through a 3 x 3 convolutional layer to reduce the number of channels of the feature maps; performing data-related upsampling, directly recovering the feature maps to 512 multiplied by 512, performing point-by-point addition and fusion on the four feature maps obtained after upsampling, and outputting the feature maps from a data-related upsampling DUp module;

(4) And obtaining a CAD-UNet network model.

The step (4) specifically comprises the following steps:

(4a) Constructing a joint loss function: constructing a joint Loss function combining BCE (binary alternating entropy) Loss and Focal point Loss, wherein formulas of the BCE local binary alternating entropy Loss and the Focal point Loss are as follows:

BL(p _t ，target)＝-ω*(target*ln(p _t )+(1-target)*ln(1-p _t )) (7)

in the formula, p _t The method comprises the steps that a predicted value of a CAD-UNet network model is obtained, target is a label value, and omega is a weight value;

FL(p _t )＝-α(1-p _t ) ^γ log(p _t ) (8)

in the formula, p _t Alpha is a balance parameter which is a predicted value of a CAD-UNet network model and is used for balancing the proportion of positive and negative samples and the value range (0,1)](ii) a Gamma is a focusing parameter used for reducing the loss of the easily classified samples, and the value range is [0, + ∞ ];

the joint loss function is shown in equation (9):

Loss＝BL+FL (9)

(4b) Setting parameters: setting ω =1, α =0.5, γ =2;

(4c) Training a strategy: during training, pre-training weights of the VGG16 network are used, a freezing training mode is adopted, parameters of the first 100 epochs freezing main network are trained, the last 100 epochs are normally trained, and 200 epochs are trained in each experiment;

(4d) And (3) evaluating model precision: the Precision is evaluated by adopting evaluation index Precision and intersection ratio IoU, and evaluation index calculation formulas are shown as formulas (10) and (11):

in the formula, the true value of TP is positive, and the model is judged to be positive; FP is true and is negative, the model is judged to be positive; FN is true and positive, the model is judged negative.

The step (5) specifically comprises the following steps:

(5a) After data preprocessing is carried out on the new remote sensing image to be extracted, the size of the new remote sensing image to be extracted is adjusted to be 512 multiplied by 512;

(5b) Inputting the adjusted image into a trained CAD-UNet network model, outputting a predicted image after passing through the CAD-UNet network model to obtain a building extraction result, wherein the CAD-UNet network model predicts that the pixel value of the building is 255, and the CAD-UNet network model predicts that the pixel value of the background is 0, so that the white part in the predicted image is a building area, and the black part in the predicted image is a background area.

According to the technical scheme, the beneficial effects of the invention are as follows: firstly, the building extraction precision is high, compared with other methods, the network designed by the invention gradually extracts deep features of the building, performs feature fusion and then gradually upsamples to the size of input resolution, is more friendly to the building extraction task, and obviously improves the building extraction precision; secondly, the method has a better building boundary extraction effect, and the added and constructed coordinate attention CA module and the data-related up-sampling DUp module can effectively capture the position information and the boundary information of the building, so that the extracted building has a smoother boundary and a complete contour; thirdly, the number of network parameters is small, training is easy, the coordinate attention adopted by the invention is plug-and-play lightweight attention, and compared with the original UNet model, the number of channels of the CAD-UNet network model is reduced, and the network complexity is reduced.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a block diagram of a CAD-UNet network model according to the present invention;

FIG. 3 is a block diagram of a coordinate attention CA module of the present invention;

FIG. 4 is a block diagram of a data-dependent upsampling DUp module of the present invention;

FIG. 5 is an example of training data in the present invention;

FIG. 6 is a graph showing the predicted result of the present invention.

Detailed Description

As shown in fig. 1, a method for extracting a remote sensing image building based on coordinate attention and data-dependent up-sampling comprises the following steps in sequence:

(1) Obtaining remote sensing data: downloading a WHU building data set and a Massachusetts building data set; the WHU building dataset is a Wuhan university building dataset, and the Massachusetts building dataset is a Massachusetts building dataset;

The step (2) specifically comprises the following steps:

and (3) horizontal and vertical overturning: respectively turning the remote sensing image and the label image horizontally and then vertically by using an image processing library OpenCV;

(2c) Dividing the remote sensing image and the label image after data enhancement into a training set, a verification set and a test set according to the ratio of 8:1, wherein the training set is used for directly participating in the training of a CAD-UNet network model and extracting the characteristics; the verification set is used for adjusting the hyper-parameters of the CAD-UNet network model; the test set is used for testing the precision and the extraction effect of the CAD-UNet network model after training is completed.

The step (3) specifically comprises the following steps:

(3a) Substitution of UNet network encoder: replacing a UNet network encoder with a VGG16 network module, wherein the VGG16 network module is formed by removing the last pooling layer and the full connection layer from the VGG16 network, and the VGG16 network module performs downsampling through multiple convolution and four times of maximal pooling, extracts building features in the remote sensing image and outputs four feature maps with different scales;

the coordinate attention CA module captures long-range dependence and retained position information along two spatial directions respectively, encodes the eigenmaps respectively to form two eigenmaps which are sensitive to direction perception and position respectively, and converts any intermediate tensor X = [ X ] ₁ ，x ₂ ，x ₃ ...，x _C ]∈R ^C×H×W As input, and outputs a tensor Y = [ Y ] of the same length ₁ ，y ₂ ，y ₃ ...，y _C ]Specifically, each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1,W) for input X, with the expression of the c-th channel of height H as follows:

similarly, the output of the c-th channel of width W is expressed as follows:

f＝δ(F ₁ ([Z ^H ，Z ^W ])) (3)

in the formula, delta is a nonlinear activation function, f is intermediate feature mapping, and is a result obtained by performing feature coding on spatial information in the horizontal direction and the vertical direction; f is then decomposed along the spatial dimension into 2 individual tensors, f ^H ∈R ^C/r×H And f ^W ∈R ^C/r×W R is the reduction ratio used to control the number of channels, and then F is transformed using two additional 1 × 1 convolution _H And F _W Respectively will f ^H And f ^W Converted into two tensors g containing the same number of eigenlayers ^H And g ^W ：

g ^H ＝σ(F _H (f ^H )) (4)

g ^W ＝σ(F _H (f ^W ))(11)

In the formula, F _H And F _W Is two 1 × 1 convolution transformations, f ^H And f ^W Is two separate tensors, g, obtained by decomposing f ^H And g ^W Is a tensor obtained by convolution transformation and an activation function, sigma is a sigmoid activation function, in the transformation process, a reduction ratio r is used to reduce the number of channels of f, and then g of the output is processed ^H And g ^W Performing expansion respectively as attention weights; the final output of the coordinate attention CA module is shown in equation (6):

(4) And obtaining a CAD-UNet network model.

The step (4) specifically comprises the following steps:

BL(p _t ，target)＝-ω*(target*ln(p _t )+(1-target)*ln(1-p _t )) (7)

FL(p _t )＝-α(1-p _t ) ^γ log(p _t ) (8)

in the formula, p _t Alpha is a balance parameter which is a predicted value of a CAD-UNet network model and is used for balancing the proportion of positive and negative samples and the value range (0,1)](ii) a Gamma is a focusing parameter used for reducing the loss of the sample which is easy to classify, and the value range is [0, + ∞ ];

the joint loss function is shown in equation (9):

Loss＝BL+FL (9)

(4b) Setting parameters: setting ω =1, α =0.5, γ =2;

(4d) And (3) evaluating the model precision: the Precision is evaluated by adopting evaluation index Precision and intersection ratio IoU, and evaluation index calculation formulas are shown as formulas (10) and (11):

The step (5) specifically comprises the following steps:

(5a) After data preprocessing is carried out on the new remote sensing image to be extracted, the size of the remote sensing image to be extracted is adjusted to 512 x 512;

(5b) Inputting the adjusted image into a trained CAD-UNet network model, outputting a predicted image after the adjusted image passes through the CAD-UNet network model, and obtaining a building extraction result, wherein the CAD-UNet network model predicts that the pixel value of the building is 255, and the CAD-UNet network model predicts that the pixel value of the background is 0, so that the white part in the predicted image is a building area, and the black part in the predicted image is a background area.

To verify the effectiveness of the present invention, unet was chosen as a comparative example, and the results were compared using a standard building data set, comparing the accuracy and cross-over ratio of the algorithm.

Table 1: comparison of results on data sets for examples and comparative examples

As shown in FIG. 2, the CAD-UNet network model of the present invention also adopts a coding/decoding structure, and the left side is an encoder for down-sampling and feature extraction; the middle solid arrow represents a coordinate attention CA module for paying attention to the location information of the building; the decoder is arranged on the right side and is used for feature fusion and upsampling; the dotted arrow at the lower right corner represents a data-dependent up-sampling DUp module for extracting boundary information of a building; finally, after the number of channels is adjusted through 1 × 1 convolution, the building extraction result is output.

As shown in fig. 3, the coordinate attention CA module of the present invention first encodes channels from the horizontal direction and the vertical direction, respectively, for the input feature tensor; then, polymerizing the characteristics in two different directions to obtain an intermediate tensor; and finally, splitting the intermediate tensor along the spatial dimension, and obtaining final output after respectively passing through the convolution layer and the Sigmoid function.

As shown in fig. 4, in the data-dependent up-sampling DUp module of the present invention, four input feature maps with different scales are first passed through a 3 × 3 convolutional layer, so as to reduce the number of channels; then, data correlation type up-sampling is carried out, and the four characteristic graphs are directly restored to the size of 512 multiplied by 512; and finally, performing point-by-point addition and fusion on the four characteristic graphs, and outputting the four characteristic graphs from the DUp module.

In fig. 5, the WHU building data set and Massachusetts building data set are illustrated as examples, respectively, with the remote sensing image on the left and the corresponding real label on the right.

Fig. 6 shows the predicted result of the CAD-UNet network model of the present invention, where the first column is a remote sensing image, the second column is a corresponding real label, the third column is the predicted result of the CAD-UNet network model of the present invention, and the fourth column is the predicted result of UNet.

In conclusion, the network designed by the invention gradually extracts the deep features of the building, performs feature fusion and then gradually upsamples the deep features to the size of the input resolution, is more friendly to the building extraction task, and obviously improves the building extraction precision; the coordinate attention CA module and the data-related up-sampling DUp module which are added and constructed in the invention can effectively capture the position information and the boundary information of the building, so that the extracted building has smoother boundary and complete contour; the coordinate attention adopted by the invention is plug-and-play lightweight attention, and compared with the original UNet model, the CAD-UNet network model has the advantages that the number of channels is reduced, the network complexity is reduced, and the method has fewer parameters and is easy to train.

Claims

1. A remote sensing image building extraction method based on coordinate attention and data correlation upsampling is characterized by comprising the following steps: the method comprises the following steps in sequence:

2. The method for building extraction based on coordinate attention and data-dependent upsampling remote sensing images according to claim 1, characterized in that: the step (2) specifically comprises the following steps:

3. The method for building extraction based on coordinate attention and data-dependent upsampling remote sensing images according to claim 1, characterized in that: the step (3) specifically comprises the following steps:

the coordinate attention CA module captures long-range dependence and reserved position information along two spatial directions respectively, encodes the feature maps respectively to form two feature maps which are sensitive to direction and position respectively, and converts any intermediate tensor X = [ X ] ₁ ,x ₂ ,x ₃ ...,x _C ]∈R ^C×H×W As input, and outputs a tensor Y = [ Y ] of the same length ₁ ,y ₂ ,y ₃ ...,y _C ]Specifically, each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1,W) for input X, with the expression of the c-th channel of height H as follows:

wherein H represents the height of the image, W represents the width of the image, C represents the total number of channels of the image, C represents the C-th channel, and X represents the number of channels _c Representing the image of the c-th channel, i representing the abscissa of the image, and R representing the set of intermediate tensor;

similarly, the output of the c-th channel of width W is expressed as follows:

formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return two-direction perception attention diagrams; the coordinate attention CA module generates two feature layers before concatenation and then transforms them with a 1 × 1 convolution operation, as shown in equation (3):

f＝δ(F ₁ ([Z ^H ,Z ^W ])) (3)

in the formula, delta is a nonlinear activation function, f is intermediate feature mapping, and is a result obtained by performing feature coding on spatial information in the horizontal direction and the vertical direction; then divide f along the spatial dimensionSolution into 2 separate tensors, f ^H ∈R ^C/r×H And f ^W ∈R ^C/r×W R is the reduction ratio used to control the number of channels, and then F is transformed using two additional 1 × 1 convolution _H And F _W Respectively will f ^H And f ^W Converted into two tensors g containing the same number of eigenlayers ^H And g ^W ：

g ^H ＝σ(F _H (f ^H )) (4)

g ^w ＝σ(F _H (f ^W ))(3)

(4) And obtaining a CAD-UNet network model.

4. The method for building extraction based on coordinate attention and data-dependent upsampling remote sensing images according to claim 1, characterized in that: the step (4) specifically comprises the following steps:

BL(p _t ，target)＝-ω*(target*ln(p _t )+(1-target)*ln(1-p _t )) (7)

FL(p _t )＝-α(1-p _t ) ^γ log(p _t ) (8)

the joint loss function is shown in equation (9):

Loss＝BL+FL (9)

(4b) Setting parameters: setting ω =1, α =0.5, γ =2;

5. The method for extracting buildings based on remote sensing images of coordinate attention and data correlation up-sampling according to claim 1, characterized in that: the step (5) specifically comprises the following steps: