CN115631412A - Remote sensing image building extraction method based on coordinate attention and data correlation upsampling - Google Patents

Remote sensing image building extraction method based on coordinate attention and data correlation upsampling Download PDF

Info

Publication number
CN115631412A
CN115631412A CN202211270279.6A CN202211270279A CN115631412A CN 115631412 A CN115631412 A CN 115631412A CN 202211270279 A CN202211270279 A CN 202211270279A CN 115631412 A CN115631412 A CN 115631412A
Authority
CN
China
Prior art keywords
building
image
data
remote sensing
cad
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211270279.6A
Other languages
Chinese (zh)
Inventor
程志友
彭友根
汪传建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202211270279.6A priority Critical patent/CN115631412A/en
Publication of CN115631412A publication Critical patent/CN115631412A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/176Urban or other man-made structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to a remote sensing image building extraction method based on coordinate attention and data correlation upsampling, which comprises the following steps: acquiring remote sensing data; data preprocessing and data enhancement; constructing a building extraction network model, namely a CAD-UNet network model, comprising an encoder, a coordinate attention CA module and a data-related up-sampling DUp module; training and evaluating a model; building automation extraction: and after data preprocessing is carried out on the new remote sensing image to be extracted, inputting the trained CAD-UNet network model, and outputting a predicted image by the CAD-UNet network model to obtain a building extraction result. The network designed by the invention gradually extracts the deep features of the building, performs feature fusion and then gradually upsamples the features to the size of the input resolution, is more friendly to the building extraction task, and obviously improves the building extraction precision; the position information and the boundary information of the building can be effectively captured, so that the extracted building has smoother boundaries and complete outlines.

Description

Remote sensing image building extraction method based on coordinate attention and data correlation upsampling
Technical Field
The invention relates to the technical field of image processing, in particular to a remote sensing image building extraction method based on coordinate attention and data correlated upsampling.
Background
The building is an indispensable activity place in daily life of people and also an important component in the urban construction and development process. The main task of building extraction is to identify and extract building areas from the remote sensing images. The building extraction has important significance for smart city construction, traffic management, population estimation, land utilization monitoring and the like. With the rapid development of remote sensing technology, remote sensing images begin to transit from low resolution to high resolution, and a development trend with high spatial resolution, high spectral resolution and high temporal resolution as characteristics is formed. The features and information of the high-resolution remote sensing image are continuously increased, and the noise and interference information are correspondingly increased, which brings new challenges to building extraction, and how to accurately extract buildings from the high-resolution remote sensing image becomes a hotspot and difficulty of research.
The traditional building extraction method is usually based on prior knowledge and manual features, and then adopts algorithms such as clustering and the like to extract the building, and mainly comprises a building feature-based method, an auxiliary information-based method and the like. Most of the methods are used for extracting the building by utilizing the characteristics of the shape and texture of the building and the like and based on auxiliary information and the like, the realization principle is simpler, the problems of low recognition rate, more errors and the like exist, the process is time-consuming and labor-consuming, and the method has great limitation in practical application and greatly limited performance. The concrete points are as follows:
first, there is less concern about location information of the building; the position information of the building is extremely important for a building extraction task, the building is usually regularly distributed in one remote sensing image, shadow shielding conditions such as trees and the like usually exist, the position information of the building is focused to obtain the accurate position of the building in one image, and the phenomenon of wrong division is avoided, the position information of the building is not fully considered in the prior art, particularly for the building with shadow shielding and complex adhesion, the attention degree on the position information is insufficient, and therefore the phenomenon of wrong division is easy to occur;
secondly, the building extraction boundary is rough and fuzzy; the buildings are mostly rectangular and usually have regular boundaries, the boundary information is important features which cannot be ignored in the building extraction task, when the buildings are extracted, if the edge information is ignored, the problems of rough and fuzzy boundaries, disordered boundaries, holes and the like are easily caused, and in the prior art, when the buildings are extracted, only conventional feature extraction is carried out, the edge features of the buildings cannot be fully excavated, so that the problems of poor extraction effect, rough and fuzzy boundaries and the like are caused;
thirdly, the problem of unbalance of positive and negative samples exists; building extraction is a two-classification task that mainly distinguishes buildings from the background. But typically there will be more background pixels than building pixels in the remote sensed image, which can impair the model's ability to extract buildings during training. The prior art cannot fully consider the problem of imbalance of positive and negative samples, so that the building extraction precision of the model is not high, and the generalization capability is weak.
Disclosure of Invention
The invention aims to provide a remote sensing image building extraction method based on coordinate attention and data correlation up-sampling, which can effectively solve the problem of wrong division caused by insufficient attention to building position information, optimize the boundary extraction effect of a building, relieve the problem of unbalance of positive and negative samples and improve the generalization capability of a network.
In order to achieve the purpose, the invention adopts the following technical scheme: a remote sensing image building extraction method based on coordinate attention and data correlation up-sampling comprises the following steps in sequence:
(1) Obtaining remote sensing data: downloading a WHU building data set and a Massachusetts building data set;
(2) Data preprocessing and data enhancement: the preprocessing refers to cutting a large image in the data set and enhancing the data of the cut remote sensing image and the label image; dividing the remote sensing image and the label image after data enhancement into a training set, a verification set and a test set according to the proportion of 8;
(3) Constructing a CAD-UNet network model: the method comprises the steps of improving on the basis of a UNet network, and constructing a building extraction network model, namely a CAD-UNet network model, comprising an encoder, a coordinate attention CA module and a data-related up-sampling DUp module;
(4) Model training and evaluation: training a CAD-UNet network model based on a joint Loss function combining BCE (binary-class cross entropy) Loss of training set data and Focal local focus Loss, and evaluating the building extraction precision and effect of the CAD-UNet network model by using a test set after training is finished;
(5) Building automation extraction: and after data preprocessing is carried out on the new remote sensing image to be extracted, inputting the trained CAD-UNet network model, and outputting a predicted image by the CAD-UNet network model to obtain a building extraction result.
The step (2) specifically comprises the following steps:
(2a) Cutting a remote sensing large image and a label image in the Massachusetts building data set in a sliding window mode into 512 multiplied by 512 images, marking the pixel value of a building in the WHU building data set and the label image in the Massachusetts building data set as 1, and marking the pixel value of a background as 0;
(2b) Carrying out data enhancement on the remote sensing image and the label image in the WHU building data set and the cut Massachusetts building data set remote sensing image and the label image to enlarge the data volume, wherein the data enhancement comprises the following steps:
horizontally overturning: respectively horizontally overturning the remote sensing image and the label image by using an image processing library OpenCV;
and (3) vertically overturning: respectively vertically overturning the remote sensing image and the label image by using an image processing library OpenCV;
and (3) horizontally and vertically overturning: respectively turning the remote sensing image and the label image horizontally and then vertically by using an image processing library OpenCV;
shift, scale, random crop, and add noise: respectively carrying out operations such as shifting, zooming, random cutting, noise adding and the like on the remote sensing image and the label image;
(2c) Dividing the remote sensing image and the label image after data enhancement into a training set, a verification set and a test set according to the proportion of 8; the verification set is used for adjusting the hyper-parameters of the CAD-UNet network model; the test set is used for testing the precision and the extraction effect of the CAD-UNet network model after training is completed.
The step (3) specifically comprises the following steps:
(3a) Substitution of UNet network encoder: replacing a UNet network encoder with a VGG16 network module, wherein the VGG16 network module is formed by removing the last pooling layer and the full connection layer from the VGG16 network, and the VGG16 network module performs downsampling through multiple convolution and four times of maximum pooling, extracts building features in the remote sensing image and outputs four feature maps with different scales;
(3b) Construct coordinate attention CA module: embedding a coordinate attention CA module into the jump connection of the UNet network obtained in the step (3 a);
the coordinate attention CA module captures long-range dependence and reserved position information along two spatial directions respectively, encodes the feature maps respectively to form two feature maps which are sensitive to direction and position respectively, and converts any intermediate tensor X = [ X ] 1 ,x 2 ,x 3 ...,x C ]∈R C×H×W As input, and outputs a tensor Y = [ Y ] of the same length 1 ,y 2 ,y 3 ...,y C ]Specifically, each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1,W) for input X, with the expression of the c-th channel of height H as follows:
Figure BDA0003894877760000041
wherein H represents the height of the image, W represents the width of the image, C represents the total number of channels of the image, C represents the C-th channel, and X c Representing the image of the c-th channel, i representing the abscissa of the image, and R representing the set of intermediate tensor;
similarly, the output of the c-th channel of width W is expressed as follows:
Figure BDA0003894877760000042
formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return to two-direction perception attention diagrams; the coordinate attention CA module generates two feature layers before concatenation, and then transforms with a 1 × 1 convolution operation, as shown in equation (3):
f=δ(F 1 ([Z H ,Z W ])) (3)
in the formula, delta is a nonlinear activation function, f is intermediate feature mapping, and is a result obtained by performing feature coding on spatial information in the horizontal direction and the vertical direction; f is then decomposed along the spatial dimension into 2 separate tensors, f H ∈R C/r×H And f W ∈R C/r×W R is the reduction ratio used to control the number of channels, and then F is transformed using two additional 1 × 1 convolution H And F W Respectively will f H And f W Converted into two tensors g containing the same number of eigenlayers H And g W
g H =σ(F H (f H )) (4)
g W =σ(F H (f W ))(7)
In the formula, F H And F W Is two 1 × 1 convolution transformations, f H And f W Is two separate tensors, g, obtained by decomposing f H And g W Is a tensor obtained by convolution transformation and an activation function, and sigma is a sigmoid activation functionThe number of channels of f is reduced by a reduction ratio r in the conversion process, and then g is output H And g W Performing expansion respectively as attention weights; the final output of the coordinate attention CA module is shown in equation (6):
Figure BDA0003894877760000051
(3c) Constructing a data-dependent upsampling DUp module: the data-related up-sampling DUp module is constructed by combining the convolutional layer and the data-related up-sampling module and is used for extracting boundary information of a high-resolution building, and for four input feature maps with different scales, the feature maps pass through a 3 x 3 convolutional layer to reduce the number of channels of the feature maps; performing data-related upsampling, directly recovering the feature maps to 512 multiplied by 512, performing point-by-point addition and fusion on the four feature maps obtained after upsampling, and outputting the feature maps from a data-related upsampling DUp module;
(4) And obtaining a CAD-UNet network model.
The step (4) specifically comprises the following steps:
(4a) Constructing a joint loss function: constructing a joint Loss function combining BCE (binary alternating entropy) Loss and Focal point Loss, wherein formulas of the BCE local binary alternating entropy Loss and the Focal point Loss are as follows:
BL(p t ,target)=-ω*(target*ln(p t )+(1-target)*ln(1-p t )) (7)
in the formula, p t The method comprises the steps that a predicted value of a CAD-UNet network model is obtained, target is a label value, and omega is a weight value;
FL(p t )=-α(1-p t ) γ log(p t ) (8)
in the formula, p t Alpha is a balance parameter which is a predicted value of a CAD-UNet network model and is used for balancing the proportion of positive and negative samples and the value range (0,1)](ii) a Gamma is a focusing parameter used for reducing the loss of the easily classified samples, and the value range is [0, + ∞ ];
the joint loss function is shown in equation (9):
Loss=BL+FL (9)
(4b) Setting parameters: setting ω =1, α =0.5, γ =2;
(4c) Training a strategy: during training, pre-training weights of the VGG16 network are used, a freezing training mode is adopted, parameters of the first 100 epochs freezing main network are trained, the last 100 epochs are normally trained, and 200 epochs are trained in each experiment;
(4d) And (3) evaluating model precision: the Precision is evaluated by adopting evaluation index Precision and intersection ratio IoU, and evaluation index calculation formulas are shown as formulas (10) and (11):
Figure BDA0003894877760000061
Figure BDA0003894877760000062
in the formula, the true value of TP is positive, and the model is judged to be positive; FP is true and is negative, the model is judged to be positive; FN is true and positive, the model is judged negative.
The step (5) specifically comprises the following steps:
(5a) After data preprocessing is carried out on the new remote sensing image to be extracted, the size of the new remote sensing image to be extracted is adjusted to be 512 multiplied by 512;
(5b) Inputting the adjusted image into a trained CAD-UNet network model, outputting a predicted image after passing through the CAD-UNet network model to obtain a building extraction result, wherein the CAD-UNet network model predicts that the pixel value of the building is 255, and the CAD-UNet network model predicts that the pixel value of the background is 0, so that the white part in the predicted image is a building area, and the black part in the predicted image is a background area.
According to the technical scheme, the beneficial effects of the invention are as follows: firstly, the building extraction precision is high, compared with other methods, the network designed by the invention gradually extracts deep features of the building, performs feature fusion and then gradually upsamples to the size of input resolution, is more friendly to the building extraction task, and obviously improves the building extraction precision; secondly, the method has a better building boundary extraction effect, and the added and constructed coordinate attention CA module and the data-related up-sampling DUp module can effectively capture the position information and the boundary information of the building, so that the extracted building has a smoother boundary and a complete contour; thirdly, the number of network parameters is small, training is easy, the coordinate attention adopted by the invention is plug-and-play lightweight attention, and compared with the original UNet model, the number of channels of the CAD-UNet network model is reduced, and the network complexity is reduced.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a block diagram of a CAD-UNet network model according to the present invention;
FIG. 3 is a block diagram of a coordinate attention CA module of the present invention;
FIG. 4 is a block diagram of a data-dependent upsampling DUp module of the present invention;
FIG. 5 is an example of training data in the present invention;
FIG. 6 is a graph showing the predicted result of the present invention.
Detailed Description
As shown in fig. 1, a method for extracting a remote sensing image building based on coordinate attention and data-dependent up-sampling comprises the following steps in sequence:
(1) Obtaining remote sensing data: downloading a WHU building data set and a Massachusetts building data set; the WHU building dataset is a Wuhan university building dataset, and the Massachusetts building dataset is a Massachusetts building dataset;
(2) Data preprocessing and data enhancement: the preprocessing refers to cutting a large image in the data set and enhancing the data of the cut remote sensing image and the label image; dividing the remote sensing image and the label image after data enhancement into a training set, a verification set and a test set according to the proportion of 8;
(3) Constructing a CAD-UNet network model: the method comprises the steps of improving on the basis of a UNet network, and constructing a building extraction network model, namely a CAD-UNet network model, comprising an encoder, a coordinate attention CA module and a data-related up-sampling DUp module;
(4) Model training and evaluation: training a CAD-UNet network model based on a joint Loss function combining BCE (binary-class cross entropy) Loss of training set data and Focal local focus Loss, and evaluating the building extraction precision and effect of the CAD-UNet network model by using a test set after training is finished;
(5) Building automation extraction: and after data preprocessing is carried out on the new remote sensing image to be extracted, inputting the trained CAD-UNet network model, and outputting a predicted image by the CAD-UNet network model to obtain a building extraction result.
The step (2) specifically comprises the following steps:
(2a) Cutting a remote sensing large image and a label image in the Massachusetts building data set in a sliding window mode into 512 multiplied by 512 images, marking the pixel value of a building in the WHU building data set and the label image in the Massachusetts building data set as 1, and marking the pixel value of a background as 0;
(2b) Carrying out data enhancement on the remote sensing image and the label image in the WHU building data set and the cut Massachusetts building data set remote sensing image and the label image to enlarge the data volume, wherein the data enhancement comprises the following steps:
horizontally overturning: respectively horizontally overturning the remote sensing image and the label image by using an image processing library OpenCV;
and (3) vertically overturning: respectively vertically overturning the remote sensing image and the label image by using an image processing library OpenCV;
and (3) horizontal and vertical overturning: respectively turning the remote sensing image and the label image horizontally and then vertically by using an image processing library OpenCV;
shift, scale, random crop, and add noise: respectively carrying out operations such as shifting, zooming, random cutting, noise adding and the like on the remote sensing image and the label image;
(2c) Dividing the remote sensing image and the label image after data enhancement into a training set, a verification set and a test set according to the ratio of 8:1, wherein the training set is used for directly participating in the training of a CAD-UNet network model and extracting the characteristics; the verification set is used for adjusting the hyper-parameters of the CAD-UNet network model; the test set is used for testing the precision and the extraction effect of the CAD-UNet network model after training is completed.
The step (3) specifically comprises the following steps:
(3a) Substitution of UNet network encoder: replacing a UNet network encoder with a VGG16 network module, wherein the VGG16 network module is formed by removing the last pooling layer and the full connection layer from the VGG16 network, and the VGG16 network module performs downsampling through multiple convolution and four times of maximal pooling, extracts building features in the remote sensing image and outputs four feature maps with different scales;
(3b) Construct coordinate attention CA module: embedding a coordinate attention CA module into the jump connection of the UNet network obtained in the step (3 a);
the coordinate attention CA module captures long-range dependence and retained position information along two spatial directions respectively, encodes the eigenmaps respectively to form two eigenmaps which are sensitive to direction perception and position respectively, and converts any intermediate tensor X = [ X ] 1 ,x 2 ,x 3 ...,x C ]∈R C×H×W As input, and outputs a tensor Y = [ Y ] of the same length 1 ,y 2 ,y 3 ...,y C ]Specifically, each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1,W) for input X, with the expression of the c-th channel of height H as follows:
Figure BDA0003894877760000091
wherein H represents the height of the image, W represents the width of the image, C represents the total number of channels of the image, C represents the C-th channel, and X c Representing the image of the c-th channel, i representing the abscissa of the image, and R representing the set of intermediate tensor;
similarly, the output of the c-th channel of width W is expressed as follows:
Figure BDA0003894877760000092
formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return to two-direction perception attention diagrams; the coordinate attention CA module generates two feature layers before concatenation, and then transforms with a 1 × 1 convolution operation, as shown in equation (3):
f=δ(F 1 ([Z H ,Z W ])) (3)
in the formula, delta is a nonlinear activation function, f is intermediate feature mapping, and is a result obtained by performing feature coding on spatial information in the horizontal direction and the vertical direction; f is then decomposed along the spatial dimension into 2 individual tensors, f H ∈R C/r×H And f W ∈R C/r×W R is the reduction ratio used to control the number of channels, and then F is transformed using two additional 1 × 1 convolution H And F W Respectively will f H And f W Converted into two tensors g containing the same number of eigenlayers H And g W
g H =σ(F H (f H )) (4)
g W =σ(F H (f W ))(11)
In the formula, F H And F W Is two 1 × 1 convolution transformations, f H And f W Is two separate tensors, g, obtained by decomposing f H And g W Is a tensor obtained by convolution transformation and an activation function, sigma is a sigmoid activation function, in the transformation process, a reduction ratio r is used to reduce the number of channels of f, and then g of the output is processed H And g W Performing expansion respectively as attention weights; the final output of the coordinate attention CA module is shown in equation (6):
Figure BDA0003894877760000093
(3c) Constructing a data-dependent upsampling DUp module: the data-related up-sampling DUp module is constructed by combining the convolutional layer and the data-related up-sampling module and is used for extracting boundary information of a high-resolution building, and for four input feature maps with different scales, the feature maps pass through a 3 x 3 convolutional layer to reduce the number of channels of the feature maps; performing data-related upsampling, directly recovering the feature maps to 512 multiplied by 512, performing point-by-point addition and fusion on the four feature maps obtained after upsampling, and outputting the feature maps from a data-related upsampling DUp module;
(4) And obtaining a CAD-UNet network model.
The step (4) specifically comprises the following steps:
(4a) Constructing a joint loss function: constructing a joint Loss function combining BCE (binary alternating entropy) Loss and Focal point Loss, wherein formulas of the BCE local binary alternating entropy Loss and the Focal point Loss are as follows:
BL(p t ,target)=-ω*(target*ln(p t )+(1-target)*ln(1-p t )) (7)
in the formula, p t The method comprises the steps that a predicted value of a CAD-UNet network model is obtained, target is a label value, and omega is a weight value;
FL(p t )=-α(1-p t ) γ log(p t ) (8)
in the formula, p t Alpha is a balance parameter which is a predicted value of a CAD-UNet network model and is used for balancing the proportion of positive and negative samples and the value range (0,1)](ii) a Gamma is a focusing parameter used for reducing the loss of the sample which is easy to classify, and the value range is [0, + ∞ ];
the joint loss function is shown in equation (9):
Loss=BL+FL (9)
(4b) Setting parameters: setting ω =1, α =0.5, γ =2;
(4c) Training a strategy: during training, pre-training weights of the VGG16 network are used, a freezing training mode is adopted, parameters of the first 100 epochs freezing main network are trained, the last 100 epochs are normally trained, and 200 epochs are trained in each experiment;
(4d) And (3) evaluating the model precision: the Precision is evaluated by adopting evaluation index Precision and intersection ratio IoU, and evaluation index calculation formulas are shown as formulas (10) and (11):
Figure BDA0003894877760000111
Figure BDA0003894877760000112
in the formula, the true value of TP is positive, and the model is judged to be positive; FP is true and is negative, the model is judged to be positive; FN is true and positive, the model is judged negative.
The step (5) specifically comprises the following steps:
(5a) After data preprocessing is carried out on the new remote sensing image to be extracted, the size of the remote sensing image to be extracted is adjusted to 512 x 512;
(5b) Inputting the adjusted image into a trained CAD-UNet network model, outputting a predicted image after the adjusted image passes through the CAD-UNet network model, and obtaining a building extraction result, wherein the CAD-UNet network model predicts that the pixel value of the building is 255, and the CAD-UNet network model predicts that the pixel value of the background is 0, so that the white part in the predicted image is a building area, and the black part in the predicted image is a background area.
To verify the effectiveness of the present invention, unet was chosen as a comparative example, and the results were compared using a standard building data set, comparing the accuracy and cross-over ratio of the algorithm.
Table 1: comparison of results on data sets for examples and comparative examples
Figure BDA0003894877760000113
As shown in FIG. 2, the CAD-UNet network model of the present invention also adopts a coding/decoding structure, and the left side is an encoder for down-sampling and feature extraction; the middle solid arrow represents a coordinate attention CA module for paying attention to the location information of the building; the decoder is arranged on the right side and is used for feature fusion and upsampling; the dotted arrow at the lower right corner represents a data-dependent up-sampling DUp module for extracting boundary information of a building; finally, after the number of channels is adjusted through 1 × 1 convolution, the building extraction result is output.
As shown in fig. 3, the coordinate attention CA module of the present invention first encodes channels from the horizontal direction and the vertical direction, respectively, for the input feature tensor; then, polymerizing the characteristics in two different directions to obtain an intermediate tensor; and finally, splitting the intermediate tensor along the spatial dimension, and obtaining final output after respectively passing through the convolution layer and the Sigmoid function.
As shown in fig. 4, in the data-dependent up-sampling DUp module of the present invention, four input feature maps with different scales are first passed through a 3 × 3 convolutional layer, so as to reduce the number of channels; then, data correlation type up-sampling is carried out, and the four characteristic graphs are directly restored to the size of 512 multiplied by 512; and finally, performing point-by-point addition and fusion on the four characteristic graphs, and outputting the four characteristic graphs from the DUp module.
In fig. 5, the WHU building data set and Massachusetts building data set are illustrated as examples, respectively, with the remote sensing image on the left and the corresponding real label on the right.
Fig. 6 shows the predicted result of the CAD-UNet network model of the present invention, where the first column is a remote sensing image, the second column is a corresponding real label, the third column is the predicted result of the CAD-UNet network model of the present invention, and the fourth column is the predicted result of UNet.
In conclusion, the network designed by the invention gradually extracts the deep features of the building, performs feature fusion and then gradually upsamples the deep features to the size of the input resolution, is more friendly to the building extraction task, and obviously improves the building extraction precision; the coordinate attention CA module and the data-related up-sampling DUp module which are added and constructed in the invention can effectively capture the position information and the boundary information of the building, so that the extracted building has smoother boundary and complete contour; the coordinate attention adopted by the invention is plug-and-play lightweight attention, and compared with the original UNet model, the CAD-UNet network model has the advantages that the number of channels is reduced, the network complexity is reduced, and the method has fewer parameters and is easy to train.

Claims (5)

1. A remote sensing image building extraction method based on coordinate attention and data correlation upsampling is characterized by comprising the following steps: the method comprises the following steps in sequence:
(1) Obtaining remote sensing data: downloading a WHU building data set and a Massachusetts building data set;
(2) Data preprocessing and data enhancement: the preprocessing refers to cutting a large image in the data set and enhancing the data of the cut remote sensing image and the label image; dividing the remote sensing image and the label image after data enhancement into a training set, a verification set and a test set according to the proportion of 8;
(3) Constructing a CAD-UNet network model: the method comprises the steps of improving on the basis of a UNet network, and constructing a building extraction network model, namely a CAD-UNet network model, comprising an encoder, a coordinate attention CA module and a data-related up-sampling DUp module;
(4) Model training and evaluation: training a CAD-UNet network model based on a joint Loss function combining BCE (binary-class cross entropy) Loss of training set data and Focal local focus Loss, and evaluating the building extraction precision and effect of the CAD-UNet network model by using a test set after training is finished;
(5) Building automation extraction: and after data preprocessing is carried out on the new remote sensing image to be extracted, inputting the trained CAD-UNet network model, and outputting a predicted image by the CAD-UNet network model to obtain a building extraction result.
2. The method for building extraction based on coordinate attention and data-dependent upsampling remote sensing images according to claim 1, characterized in that: the step (2) specifically comprises the following steps:
(2a) Cutting a remote sensing large image and a label image in the Massachusetts building data set in a sliding window mode into 512 multiplied by 512 images, marking the pixel value of a building in the WHU building data set and the label image in the Massachusetts building data set as 1, and marking the pixel value of a background as 0;
(2b) Carrying out data enhancement on the remote sensing image and the label image in the WHU building data set and the cut Massachusetts building data set remote sensing image and the label image to enlarge the data volume, wherein the data enhancement comprises the following steps:
horizontally overturning: respectively horizontally overturning the remote sensing image and the label image by using an image processing library OpenCV;
and (3) vertically overturning: respectively vertically overturning the remote sensing image and the label image by using an image processing library OpenCV;
and (3) horizontal and vertical overturning: respectively turning the remote sensing image and the label image horizontally and then vertically by using an image processing library OpenCV;
shift, scale, random crop, and add noise: respectively carrying out operations such as shifting, zooming, random cutting, noise adding and the like on the remote sensing image and the label image;
(2c) Dividing the remote sensing image and the label image after data enhancement into a training set, a verification set and a test set according to the proportion of 8; the verification set is used for adjusting the hyper-parameters of the CAD-UNet network model; the test set is used for testing the precision and the extraction effect of the CAD-UNet network model after training is completed.
3. The method for building extraction based on coordinate attention and data-dependent upsampling remote sensing images according to claim 1, characterized in that: the step (3) specifically comprises the following steps:
(3a) Substitution of UNet network encoder: replacing a UNet network encoder with a VGG16 network module, wherein the VGG16 network module is formed by removing the last pooling layer and the full connection layer from the VGG16 network, and the VGG16 network module performs downsampling through multiple convolution and four times of maximum pooling, extracts building features in the remote sensing image and outputs four feature maps with different scales;
(3b) Construct coordinate attention CA module: embedding a coordinate attention CA module into the jump connection of the UNet network obtained in the step (3 a);
the coordinate attention CA module captures long-range dependence and reserved position information along two spatial directions respectively, encodes the feature maps respectively to form two feature maps which are sensitive to direction and position respectively, and converts any intermediate tensor X = [ X ] 1 ,x 2 ,x 3 ...,x C ]∈R C×H×W As input, and outputs a tensor Y = [ Y ] of the same length 1 ,y 2 ,y 3 ...,y C ]Specifically, each channel is encoded in both horizontal and vertical directions using pooling kernels of sizes (H, 1) and (1,W) for input X, with the expression of the c-th channel of height H as follows:
Figure FDA0003894877750000021
wherein H represents the height of the image, W represents the width of the image, C represents the total number of channels of the image, C represents the C-th channel, and X represents the number of channels c Representing the image of the c-th channel, i representing the abscissa of the image, and R representing the set of intermediate tensor;
similarly, the output of the c-th channel of width W is expressed as follows:
Figure FDA0003894877750000031
formulas (1) and (2) are two transformations of feature aggregation, which are respectively aggregated along two spatial directions and return two-direction perception attention diagrams; the coordinate attention CA module generates two feature layers before concatenation and then transforms them with a 1 × 1 convolution operation, as shown in equation (3):
f=δ(F 1 ([Z H ,Z W ])) (3)
in the formula, delta is a nonlinear activation function, f is intermediate feature mapping, and is a result obtained by performing feature coding on spatial information in the horizontal direction and the vertical direction; then divide f along the spatial dimensionSolution into 2 separate tensors, f H ∈R C/r×H And f W ∈R C/r×W R is the reduction ratio used to control the number of channels, and then F is transformed using two additional 1 × 1 convolution H And F W Respectively will f H And f W Converted into two tensors g containing the same number of eigenlayers H And g W
g H =σ(F H (f H )) (4)
g w =σ(F H (f W ))(3)
In the formula, F H And F W Is two 1 × 1 convolution transformations, f H And f W Is two separate tensors, g, obtained by decomposing f H And g W Is a tensor obtained by convolution transformation and an activation function, sigma is a sigmoid activation function, in the transformation process, a reduction ratio r is used to reduce the number of channels of f, and then g of the output is processed H And g W Performing expansion respectively as attention weights; the final output of the coordinate attention CA module is shown in equation (6):
Figure FDA0003894877750000032
(3c) Constructing a data-dependent upsampling DUp module: the data-related up-sampling DUp module is constructed by combining the convolutional layer and the data-related up-sampling module and is used for extracting boundary information of a high-resolution building, and for four input feature maps with different scales, the feature maps pass through a 3 x 3 convolutional layer to reduce the number of channels of the feature maps; performing data-related upsampling, directly recovering the feature maps to 512 multiplied by 512, performing point-by-point addition and fusion on the four feature maps obtained after upsampling, and outputting the feature maps from a data-related upsampling DUp module;
(4) And obtaining a CAD-UNet network model.
4. The method for building extraction based on coordinate attention and data-dependent upsampling remote sensing images according to claim 1, characterized in that: the step (4) specifically comprises the following steps:
(4a) Constructing a joint loss function: constructing a joint Loss function combining BCE (binary alternating entropy) Loss and Focal point Loss, wherein formulas of the BCE local binary alternating entropy Loss and the Focal point Loss are as follows:
BL(p t ,target)=-ω*(target*ln(p t )+(1-target)*ln(1-p t )) (7)
in the formula, p t The method comprises the steps that a predicted value of a CAD-UNet network model is obtained, target is a label value, and omega is a weight value;
FL(p t )=-α(1-p t ) γ log(p t ) (8)
in the formula, p t Alpha is a balance parameter which is a predicted value of a CAD-UNet network model and is used for balancing the proportion of positive and negative samples and the value range (0,1)](ii) a Gamma is a focusing parameter used for reducing the loss of the easily classified samples, and the value range is [0, + ∞ ];
the joint loss function is shown in equation (9):
Loss=BL+FL (9)
(4b) Setting parameters: setting ω =1, α =0.5, γ =2;
(4c) Training a strategy: during training, pre-training weights of the VGG16 network are used, a freezing training mode is adopted, parameters of the first 100 epochs freezing main network are trained, the last 100 epochs are normally trained, and 200 epochs are trained in each experiment;
(4d) And (3) evaluating the model precision: the Precision is evaluated by adopting evaluation index Precision and intersection ratio IoU, and evaluation index calculation formulas are shown as formulas (10) and (11):
Figure FDA0003894877750000041
Figure FDA0003894877750000042
in the formula, the true value of TP is positive, and the model is judged to be positive; FP is true and is negative, the model is judged to be positive; FN is true and positive, the model is judged negative.
5. The method for extracting buildings based on remote sensing images of coordinate attention and data correlation up-sampling according to claim 1, characterized in that: the step (5) specifically comprises the following steps:
(5a) After data preprocessing is carried out on the new remote sensing image to be extracted, the size of the remote sensing image to be extracted is adjusted to 512 x 512;
(5b) Inputting the adjusted image into a trained CAD-UNet network model, outputting a predicted image after the adjusted image passes through the CAD-UNet network model, and obtaining a building extraction result, wherein the CAD-UNet network model predicts that the pixel value of the building is 255, and the CAD-UNet network model predicts that the pixel value of the background is 0, so that the white part in the predicted image is a building area, and the black part in the predicted image is a background area.
CN202211270279.6A 2022-10-18 2022-10-18 Remote sensing image building extraction method based on coordinate attention and data correlation upsampling Pending CN115631412A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211270279.6A CN115631412A (en) 2022-10-18 2022-10-18 Remote sensing image building extraction method based on coordinate attention and data correlation upsampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211270279.6A CN115631412A (en) 2022-10-18 2022-10-18 Remote sensing image building extraction method based on coordinate attention and data correlation upsampling

Publications (1)

Publication Number Publication Date
CN115631412A true CN115631412A (en) 2023-01-20

Family

ID=84906561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211270279.6A Pending CN115631412A (en) 2022-10-18 2022-10-18 Remote sensing image building extraction method based on coordinate attention and data correlation upsampling

Country Status (1)

Country Link
CN (1) CN115631412A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116503464A (en) * 2023-06-25 2023-07-28 武汉理工大学三亚科教创新园 Farmland building height prediction method based on remote sensing image

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116503464A (en) * 2023-06-25 2023-07-28 武汉理工大学三亚科教创新园 Farmland building height prediction method based on remote sensing image
CN116503464B (en) * 2023-06-25 2023-10-03 武汉理工大学三亚科教创新园 Farmland building height prediction method based on remote sensing image

Similar Documents

Publication Publication Date Title
CN108961235B (en) Defective insulator identification method based on YOLOv3 network and particle filter algorithm
CN111462126B (en) Semantic image segmentation method and system based on edge enhancement
CN111626128B (en) Pedestrian detection method based on improved YOLOv3 in orchard environment
CN110322453B (en) 3D point cloud semantic segmentation method based on position attention and auxiliary network
CN110111345B (en) Attention network-based 3D point cloud segmentation method
CN111209921A (en) License plate detection model based on improved YOLOv3 network and construction method
CN109145836B (en) Ship target video detection method based on deep learning network and Kalman filtering
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN112489054A (en) Remote sensing image semantic segmentation method based on deep learning
CN114821342B (en) Remote sensing image road extraction method and system
CN114187520B (en) Building extraction model construction and application method
CN111797841B (en) Visual saliency detection method based on depth residual error network
CN112818969A (en) Knowledge distillation-based face pose estimation method and system
CN112861970B (en) Fine-grained image classification method based on feature fusion
CN111753682A (en) Hoisting area dynamic monitoring method based on target detection algorithm
CN112507849A (en) Dynamic-to-static scene conversion method for generating countermeasure network based on conditions
CN115223017B (en) Multi-scale feature fusion bridge detection method based on depth separable convolution
CN116222577A (en) Closed loop detection method, training method, system, electronic equipment and storage medium
CN115631412A (en) Remote sensing image building extraction method based on coordinate attention and data correlation upsampling
CN113361493B (en) Facial expression recognition method robust to different image resolutions
CN114926826A (en) Scene text detection system
CN107358625B (en) SAR image change detection method based on SPP Net and region-of-interest detection
CN112668662B (en) Outdoor mountain forest environment target detection method based on improved YOLOv3 network
CN116703996A (en) Monocular three-dimensional target detection algorithm based on instance-level self-adaptive depth estimation
CN116386042A (en) Point cloud semantic segmentation model based on three-dimensional pooling spatial attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination