CN116912708A

CN116912708A - Remote sensing image building extraction method based on deep learning

Info

Publication number: CN116912708A
Application number: CN202310894293.1A
Authority: CN
Inventors: 陶于祥; 华娟; 罗舒月
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2023-10-20

Abstract

The invention relates to a remote sensing image building extraction method based on deep learning, and belongs to the field of remote sensing images. The method comprises the following steps: firstly, based on a processed public data set, improving a deep labv3+ network model, and replacing a backbone network of the deep labv3+ network model with a lightweight MobilenetV2; the ASPP module of the model is replaced with DenseASPP and a set of dilation convolutions are connected in a more dense manner to produce multi-scale features that cover a larger scale range. Aiming at the problem of insufficient representation capability of the model on small-scale feature information, a channel attention mechanism is added behind the DenseASPP module, and the performance of the model is improved by strengthening channel features which are more interested in a small building; considering that shallow layer features contain more original information, two layers of shallow layer features are fused in a decoding area, finer spatial information is provided, and robustness is enhanced.

Description

Remote sensing image building extraction method based on deep learning

Technical Field

The invention belongs to the field of remote sensing images, and relates to a remote sensing image building extraction method based on deep learning.

Background

The method for extracting the remote sensing image information at the present stage is mainly performed for the pixels. And because of the characteristics of abundant details of the high-resolution image, large data volume and the like, the traditional pixel-based method is not suitable for processing the high-resolution image data. At present, two main methods for extracting building features from high-resolution remote sensing images are as follows: 1) Traditional machine learning classifiers (such as support vector machines (support vector machine, SVM) and random forests) are used for extracting building features, and corresponding post-processing steps are generally adopted for refining segmentation results; 2) Based on traditional computer vision methods, artificial features such as vegetation index, texture, and color features are used. However, these methods not only have high model complexity, but also suffer from limitations of manual knowledge and experience. While convolutional neural networks CNN (Convolution Neural Network) or full convolutional networks FCN (Fully Convolutional Networks), e.g., based on encoder and decoder architectures, have been successfully applied in this field and are superior to traditional computer vision methods. However, these methods use a pooling layer to perform downsampling, and although the global features of the image can be extracted by adding the receptive field of the convolution kernel, the high-frequency details in the image are lost at the same time, so that boundary information is lost in the separation result, the building ground is easily divided into a plurality of circular spots, and the complete boundary of the building ground is difficult to extract. Therefore, how to extract the needed urban building information from the massive data is a difficulty and a hot spot of high-resolution remote sensing image interpretation; second, the research of the remote sensing image building feature extraction method is very challenging due to the diversity of building features in different areas, such as colors, shapes, sizes, and the like, and the similarity of the building features to the background or other objects.

The identification and detection of the building can directly influence the automation level of the ground object drawing, so that the information of the urban ground object building can be mastered in time, and the method has important significance for better urban development planning and digital city construction.

In the seventies of the last century, people were first analyzing remote sensing image urban buildings by using computers, and the main methods can be classified into texture-based methods, edge-based methods and shadow-based methods.

Stephen Levitt proposes a texture-based building detection method, in which, since artificial architecture and natural area textures have differences, by measuring the textures, building areas can be distinguished from other areas by using different textures. The leaf container utilizes amplitude spectrum information to extract texture and edge characteristics of the multi-layer building, and combines the texture and the edge characteristics to realize multi-layer building information extraction, so that the leaf container has practical application value; however, this method cannot avoid interference of the "same-frequency foreign matter" phenomenon of the high-resolution image to the extraction accuracy. Zhang Hao et al propose a LiDAR point cloud building extraction method based on gray level co-occurrence matrix textures to automatically extract a building. In order to reduce the amount of computation, the authors compress the gray level during the computation of the gray level co-occurrence matrix, which will cause part of the texture to be lost, resulting in misclassification. In general, texture-based extraction methods are very effective for processing low-resolution images in the middle, but high-resolution images have complex textures, and the difficulty of extracting buildings by using the method is high. Lin and Nevitia et al propose perceptual grouping theory, which first uses edge detection to extract the outline of the building; and grouping the extracted outlines according to the spatial relationship, and finally searching parallel lines to obtain rectangular outlines so as to extract the outline positions of the building. Chen et al propose edge regularity index and hatching index as new features of construction candidates obtained from a segmentation method to refine the boundaries of the detection results.

In recent years, deep learning has been rapidly developed in the field of artificial intelligence, and research means in various fields have been gradually moved from conventional research methods to deep neural networks. Since convolutional neural networks were proposed, deep learning represented by convolutional neural networks has a remarkable effect in the field of image learning. The deep learning is used as an emerging research hotspot, has good effects in the fields of voice recognition, computer vision, natural language processing and the like, and compared with the traditional neural network, the deep neural network is greatly improved, the problem difficulty in data training is effectively reduced through a layer-by-layer learning method, approximation of a plurality of complex functions can be realized through learning a deep nonlinear network structure, the method can excavate data characteristics of a large amount of marked data layer by layer and learn the characteristic essence of the data characteristics, and considerable learning capacity is shown in a non-marked data set, so that the deep learning is widely applied to the fields of image classification, target detection, image segmentation and the like. In the field of image segmentation, semantic segmentation in high-altitude remote sensing images is beneficial to the fields of urban road planning, geological exploration, national defense construction and the like. The Bischke et al study introduces a new cascade multitask loss in a deep network structure, fuses boundary information of a building, overcomes the generation of a 'speckle' segmentation phenomenon, solves the problem of retaining semantic segmentation boundaries in a high-resolution satellite image, and achieves 73% of average cross ratio index. Zhou et al propose a D-LinkNet semantic segmentation network for high resolution satellite image road extraction, which adopts a link network structure and expands a convolution layer in a central part, and learns and fuses multiscale information of high-level semantic features through the central part, wherein the mloU index in CVPR 2018 Deep Globe Road Extraction Challenge reaches 64.66%. Liu Hao and the like improve the Unet network, train by using a loss function compounded by the Dice coefficient and the cross entropy function, and extract irregular buildings to obtain good effects.

The high-resolution remote sensing image can provide detailed building information, deep learning can learn data features with high spatial resolution better, and can extract building features efficiently. The deep series network adds the cavity convolution to expand the receptive field, and provides a cavity convolution pyramid pooling ASPP layer, and adopts different cavity rates to extract the multi-scale characteristics of the image, thereby improving the precision of remote sensing image segmentation. However, the deep series has the phenomena of slow fitting speed, rough edge information extraction, fuzzy small-scale target segmentation, hollow large-scale target segmentation and the like when the remote sensing image building is extracted. How to select a suitable and efficient deep learning network to extract a building has been a focus of attention of students.

Disclosure of Invention

In view of the above, the present invention aims to provide a remote sensing image building extraction method based on deep learning.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a remote sensing image building extraction method based on deep learning comprises the following steps:

s1: selecting a public remote sensing building data set lnria Aerial lmage Labeling Dataset as original data, and performing data preprocessing steps including cutting and data enhancement;

s2: the traditional deep labv3+ model is improved, the main network is replaced by lightweight MobilenetV2, the parameter is effectively reduced, and the model speed is improved; replacing the model main module ASPP with DenseASPP, connecting a set of dilation convolutions in a more dense manner; an attention mechanism ECA module is added, which directly generates a channel attention map through two 1X 1 convolution layers, recalibrates the channel weight of the feature map, selects more important feature channels, improves the extraction capacity of a model on small-scale information, and obtains a DAEC-deep labv3+ network;

s3: inputting the processed remote sensing image data set into a DAEC-deep Labv3+ network for training to obtain a trained building detection model;

s4: and detecting the trained building detection model in the test set of the remote sensing image to obtain a remote sensing building image segmentation result.

Optionally, the S1 specifically includes:

s11: downloading a required remote sensing building data set from a data set website, wherein the lnria Aerial lmage Labeling Dataset data set comprises 180 color images with 5000 multiplied by 5000 pixels and 180 binary gray scale images with 5000 multiplied by 5000 pixels corresponding to the color images, and the test image comprises 180 color images with 180 pixels; the dataset covers urban residential areas in europe and the united states, covering five major cities of Austin, chicago, kitsap, tyrol and Vienna, each city corresponding to 36 images; all the marked images divide the buildings into two types, namely a building and a non-building, wherein the pixel value of a building area is 255, and the pixel value of the non-building area is 0; cutting an image in a data set into 500 pixels which are 500 pixels, and enhancing the data set by utilizing a rotation transformation, mirror image transformation or brightness transformation mode;

s12: according to 7:2:1 randomly dividing the data in the data set into training data, verification data and test data, and storing the divided files in subfiles, namely train. Txt, val. Txt and test. Txt.

Optionally, the S2 specifically includes:

s21: an improved DAEC_Deeplabv3+ network model is built, and a lightweight network MobileNet V2 is used as a backbone network;

s22: in the encoding region Encoder, the dense void space pyramid pooling DenseASPP is used for replacing the original ASPP module, and the dense connection idea in DenseNet is applied to ASPP; the method comprises the steps of cascading a plurality of cavity convolution layers, and transmitting the output of each cavity convolution layer to all subsequent cavity convolution layers which are not accessed in a dense connection mode;

DenseASPP is expressed by formula (1):

y _l ＝H _K,dl ([y _l-1 ,y _l-2 ,...,y ₀ ]) (1)

d _l represents the expansion rate of layer i.]Represents a series operation [ y ] _l-1 ,...,y ₀ ]Representing the characteristics generated after connecting the outputs of all the previous layers;

s23: after feature extraction is completed through DensaASPP, stacking and thickening features, adding an ECA module, removing a full-connection layer by directly using a 1X 1 convolution layer after a global average pooling layer, recalibrating channel weights of a feature map, selecting more important feature channels, and improving the extraction capability of a model on small-scale information; the method realizes local cross-channel interaction by using 1-dimensional convolution with high efficiency, and extracts the dependency relationship among channels; the method comprises the following specific steps: firstly, carrying out global average pooling operation on an input feature map, then carrying out 1-dimensional convolution operation with a convolution kernel of k, and obtaining the weight w of each channel through a Sigmoid activation function, wherein the weight w is shown in a formula (2):

ω＝σ(C1D _K (y)) (2)

multiplying the weight with the corresponding element of the original input feature map to obtain a final output feature map, and then carrying out channel number adjustment on the deep feature with higher semantic information by using a convolution operation of 1 multiplied by 1; wherein C1D represents a one-dimensional convolution;

s24: in a decoding area Decoder, extracting 4 th and 7 th shallow layer features from a main network, performing multi-scale feature fusion MSFF operation, adding the features with a deep feature layer subjected to 2 times of upsampling, and recovering the size of the feature layer through one 2 times of upsampling to realize semantic segmentation; then, connecting with original shallow layer characteristics obtained from a backbone network to increase the number of channels, then, carrying out characteristic extraction by utilizing a convolution of 3 multiplied by 3, and finally, adjusting an output image to be the same size as an input image; the MSFF operation is to fuse the multi-scale features of the two feature layers.

Optionally, the step S3 specifically includes:

training in a PyTorr deep learning framework by using PyCharm programming software; and inputting the processed remote sensing image data set into a DAEC_DeepLabv3+ network for training, obtaining a trained building detection model and storing pre-trained model parameters aiming at different segmented objects.

Optionally, the step S4 specifically includes the following steps:

s41: inputting the test images in the test set into a trained improved deeplabv < 3+ > network model, selecting a segmentation object as a building or a background, outputting a corresponding segmentation result, and then storing a result image of the model;

s42: selecting cross entropy as loss of an algorithm, taking average pixel accuracy mPA and average cross ratio mPOU as evaluation indexes, and evaluating training results of the model from two angles of the proportion of the correct part of the predicted pixel in the union of the predicted pixel and the real pixel and the proportion of the correct pixel in the total pixel, wherein the higher the MIoU value is, the better the image segmentation effect is; the formula of cross entropy is:

wherein y is _i The true value in the two classification tasks is 0 or 1 for the true value of a certain pixel;is a predicted value of a certain pixel; n is the sample size lost per calculation;

the calculation formulas of the average pixel accuracy rate mpA and the average intersection ratio mIoU are respectively as follows:

the TP represents that the model prediction is correct, namely the model prediction and the actual are both positive examples; FP represents model prediction error, i.e. model prediction class is positive example, and the actual class is negative example; FN represents model prediction errors, namely the model prediction class is a counterexample, and the model prediction class is a positive example; TN represents that the model prediction is correct, and represents that the model prediction and the actual are opposite examples.

The invention has the beneficial effects that:

first, the backbone network is changed into lightweight Mobilenetv2, so that the parameter number and the calculation amount of the model are greatly reduced.

Secondly, the original ASPP module of the model is changed into DenseASPP, and the convolution layers with four different expansion rates (3, 6, 12, 18) are connected in a DenseNet connection mode to form a dense feature pyramid, wherein each layer fuses a plurality of different scale features of the previous layer in parallel, so that fusion features of a multi-scale large receptive field can be generated, and the large receptive field provides larger context information for large target segmentation in a high-pixel image.

Thirdly, the ECA attention module directly generates a channel attention map through two 1X 1 convolution layers, so that the traditional attention moment array multiplication is avoided, the calculation efficiency is effectively improved, the extraction capacity of a model to small-scale information is improved, the MIoU index of the segmented remote sensing image is improved, and the segmentation precision is higher;

fourth, considering that shallow features contain more original information, feature fusion is carried out on two shallow feature layers in a decoding area, richer spatial information is reserved, and robustness is enhanced. In addition, the method is convenient to understand as a whole, easy to operate and applicable to semantic segmentation of other complex remote sensing images.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a remote sensing image building extraction model based on improved DAEC_deep v3+ according to a preferred embodiment of the present invention;

FIG. 2 is a DenseASPP module in a coding network;

FIG. 3 is an attention mechanism ECA added to the coding network;

fig. 4 is a multi-scale feature fusion module MSFF.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

Referring to fig. 1 to 4, S1 is a preprocessing method for input data source and data including the present model, lnria Aerial lmage Labeling Dataset data set includes 180 color images of 5000×5000 pixels and corresponding 180 binary gray scale images of 5000×5000 pixels, and test image includes 180 color images of pixels. All the annotated images divide these buildings into two categories, building and non-building, the pixel value of the building area being 255 (white) and the pixel value of the non-building area being 0 (black). The image in the dataset is cut into 500 pixels by 500 pixels, and the dataset is enhanced by geometric and non-geometric modes such as rotation transformation, mirror transformation, brightness transformation and the like. After image enhancement according to 7:2: the scale of 1 divides the entire dataset into a training set, a test set and a validation set.

S2, constructing an improved DAEC_deep labv3+ network model. The backbone network is replaced by lightweight MobilenetV2, so that the parameter is effectively reduced, and the model speed is improved; in the coding region Encoder, the original ASPP module is replaced by DenseASPP, and the dense connection idea in DenseNet is applied to the ASPP. ASPP can be expressed by formula (1):

y＝H _3,6(x) +H _3,12(x) +H _3,18(x) +H _3,24(x) (1)

H _K,d(x) used to represent a hole convolution, y represents the fusion feature.

In DenseASPP, the output of each hole convolution layer is transmitted to all subsequent hole convolution layers which are not accessed in a dense connection mode, the reasonable expansion rate is fully utilized by the hole convolution layers, through a series of characteristic connection, the neurons of all intermediate characteristics can encode semantic information of different scales, and the different intermediate characteristics have different scale ranges. Thus, the final DenseASPP feature output has more larger scale receptive fields, and a denser and larger feature pyramid can be generated by using several hole convolution layers.

DenseASPP can be expressed by the formula (2):

d _l represents the expansion rate of layer i.]Representing a conclusive operation, [ y ] _l-1 ,...,y ₀ ]Representing the features generated after concatation of the outputs of all the previous layers, denseASPP stacks all in a densely connected manner, compared to ASPPAnd the cavity convolution layer is used for generating a feature pyramid more densely and a receptive field more greatly. After feature extraction is completed through DensaASPP, features are stacked and thickened, and a ECA (External Channel Attention) module is added, so that dimension reduction can be avoided, and cross-channel interaction information is effectively captured. The method realizes local cross-channel interaction by using 1-dimensional convolution and extracts the dependency relationship among channels. The method comprises the following specific steps: firstly, carrying out global average pooling operation on an input feature map, then carrying out 1-dimensional convolution operation with a convolution kernel of k, and obtaining the weight w of each channel through a Sigmoid activation function, wherein the weight w is shown in a formula (3):

ω＝σ(C1D _K (y)) (3)

multiplying the weight with the corresponding element of the original input feature map to obtain a final output feature map, and then carrying out channel number adjustment on the deep feature with higher semantic information by using a convolution operation of 1 multiplied by 1. In the decoding area Decoder, the 4 th and 7 th shallow layer features are extracted from the main network, MSFF operation is carried out, the extracted features are added with the deep feature layer subjected to 2 times of upsampling, and then the feature layer size is recovered through one 2 times of upsampling, so that semantic segmentation is realized. Then, the number of channels is increased by concat stacking with the original shallow features obtained from the backbone network, then feature extraction is performed by using a 3×3 convolution, and finally the output image is adjusted to be the same size as the input image. The MSFF operation is to fuse the multi-scale features of the two feature layers.

Further, S3 is training in a PyTorch deep learning framework using PyCharm programming software, and as an embodiment, pyCharm professional version is selected and training of network structure is performed using the deep learning framework PyTorch 1.11.0. And inputting the processed remote sensing image data set into a DAEC-deep Labv3+ network for training, obtaining a trained building detection model and storing pre-trained model parameters aiming at different segmented objects.

Further, S4 is to input the test image in the test set into the trained improved daec_deep v3+ network model, select the segmentation object as a building or a background, output the corresponding segmentation result, and store the result image of the model. In the network training process, cross entropy is selected as the loss of an algorithm, average pixel accuracy mPA and average cross ratio mPOU are used as evaluation indexes, and the training result of the model is evaluated from two angles of the proportion of the correct part of the predicted pixel in the union of the predicted pixel and the real pixel and the proportion of the correct pixel in the total pixel, wherein the higher the MIoU value is, the better the image segmentation effect is; the formula of cross entropy is:

in the above formula, TP represents that the model prediction is correct, namely the model prediction and the reality are both positive examples; FP represents model prediction error, i.e. model predicts that the class is positive, but in reality that the class is negative; FN represents model prediction errors, namely the model predicts that the category is a counterexample, and the actual category is a positive example; TN represents that the model prediction is correct, meaning that the model prediction and the reality are opposite examples;

experimental results show that the method is based on an improved DAEC_deep v3+ network model, has the advantages of smaller training parameter quantity, higher segmentation precision, finer edge information extraction, effective improvement of problems of void segmentation of a large-scale target and the like; in addition, the method has better robustness, is convenient to understand as a whole, is easy to operate, and can be suitable for semantic segmentation of other complex remote sensing images.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A remote sensing image building extraction method based on deep learning is characterized by comprising the following steps of: the method comprises the following steps:

2. The remote sensing image building extraction method based on deep learning as claimed in claim 1, wherein the method comprises the following steps: the S1 specifically comprises the following steps:

3. The remote sensing image building extraction method based on deep learning as claimed in claim 2, wherein the method comprises the following steps: the step S2 specifically comprises the following steps:

DenseASPP is expressed by formula (1):

d _l represents the expansion rate of layer i.]Represents a series operation [ y ] _l-1 ,...,y ₀ ]Representing the connection of the outputs of all the preceding layersPost-generated features;

ω＝σ(C1D _K (y)) (2)

4. The remote sensing image building extraction method based on deep learning as claimed in claim 3, wherein the method comprises the following steps: the step S3 specifically comprises the following steps:

5. The method for extracting the remote sensing image building based on deep learning according to claim 4, wherein the method comprises the following steps: the step S4 specifically comprises the following steps: