CN116912708A - Remote sensing image building extraction method based on deep learning - Google Patents
Remote sensing image building extraction method based on deep learning Download PDFInfo
- Publication number
- CN116912708A CN116912708A CN202310894293.1A CN202310894293A CN116912708A CN 116912708 A CN116912708 A CN 116912708A CN 202310894293 A CN202310894293 A CN 202310894293A CN 116912708 A CN116912708 A CN 116912708A
- Authority
- CN
- China
- Prior art keywords
- model
- building
- remote sensing
- image
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 31
- 238000013135 deep learning Methods 0.000 title claims abstract description 21
- 238000000034 method Methods 0.000 claims abstract description 38
- 101100295091 Arabidopsis thaliana NUDT14 gene Proteins 0.000 claims abstract description 12
- 230000007246 mechanism Effects 0.000 claims abstract description 4
- 230000010339 dilation Effects 0.000 claims abstract description 3
- 230000011218 segmentation Effects 0.000 claims description 23
- 238000012549 training Methods 0.000 claims description 18
- 238000012360 testing method Methods 0.000 claims description 16
- 238000001514 detection method Methods 0.000 claims description 11
- 238000011176 pooling Methods 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000003709 image segmentation Methods 0.000 claims description 8
- 230000000694 effects Effects 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 230000004927 fusion Effects 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 5
- 230000003993 interaction Effects 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 239000011800 void material Substances 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 2
- 230000008719 thickening Effects 0.000 claims description 2
- 238000012795 verification Methods 0.000 claims description 2
- 238000005728 strengthening Methods 0.000 abstract 1
- 230000008901 benefit Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 108091026890 Coding region Proteins 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000007123 defense Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000003708 edge detection Methods 0.000 description 1
- 230000012447 hatching Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
- G06V20/13—Satellite images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
- G06V20/176—Urban or other man-made structures
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A30/00—Adapting or protecting infrastructure or their operation
- Y02A30/60—Planning or developing urban green infrastructure
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Astronomy & Astrophysics (AREA)
- Remote Sensing (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a remote sensing image building extraction method based on deep learning, and belongs to the field of remote sensing images. The method comprises the following steps: firstly, based on a processed public data set, improving a deep labv3+ network model, and replacing a backbone network of the deep labv3+ network model with a lightweight MobilenetV2; the ASPP module of the model is replaced with DenseASPP and a set of dilation convolutions are connected in a more dense manner to produce multi-scale features that cover a larger scale range. Aiming at the problem of insufficient representation capability of the model on small-scale feature information, a channel attention mechanism is added behind the DenseASPP module, and the performance of the model is improved by strengthening channel features which are more interested in a small building; considering that shallow layer features contain more original information, two layers of shallow layer features are fused in a decoding area, finer spatial information is provided, and robustness is enhanced.
Description
Technical Field
The invention belongs to the field of remote sensing images, and relates to a remote sensing image building extraction method based on deep learning.
Background
The method for extracting the remote sensing image information at the present stage is mainly performed for the pixels. And because of the characteristics of abundant details of the high-resolution image, large data volume and the like, the traditional pixel-based method is not suitable for processing the high-resolution image data. At present, two main methods for extracting building features from high-resolution remote sensing images are as follows: 1) Traditional machine learning classifiers (such as support vector machines (support vector machine, SVM) and random forests) are used for extracting building features, and corresponding post-processing steps are generally adopted for refining segmentation results; 2) Based on traditional computer vision methods, artificial features such as vegetation index, texture, and color features are used. However, these methods not only have high model complexity, but also suffer from limitations of manual knowledge and experience. While convolutional neural networks CNN (Convolution Neural Network) or full convolutional networks FCN (Fully Convolutional Networks), e.g., based on encoder and decoder architectures, have been successfully applied in this field and are superior to traditional computer vision methods. However, these methods use a pooling layer to perform downsampling, and although the global features of the image can be extracted by adding the receptive field of the convolution kernel, the high-frequency details in the image are lost at the same time, so that boundary information is lost in the separation result, the building ground is easily divided into a plurality of circular spots, and the complete boundary of the building ground is difficult to extract. Therefore, how to extract the needed urban building information from the massive data is a difficulty and a hot spot of high-resolution remote sensing image interpretation; second, the research of the remote sensing image building feature extraction method is very challenging due to the diversity of building features in different areas, such as colors, shapes, sizes, and the like, and the similarity of the building features to the background or other objects.
The identification and detection of the building can directly influence the automation level of the ground object drawing, so that the information of the urban ground object building can be mastered in time, and the method has important significance for better urban development planning and digital city construction.
In the seventies of the last century, people were first analyzing remote sensing image urban buildings by using computers, and the main methods can be classified into texture-based methods, edge-based methods and shadow-based methods.
Stephen Levitt proposes a texture-based building detection method, in which, since artificial architecture and natural area textures have differences, by measuring the textures, building areas can be distinguished from other areas by using different textures. The leaf container utilizes amplitude spectrum information to extract texture and edge characteristics of the multi-layer building, and combines the texture and the edge characteristics to realize multi-layer building information extraction, so that the leaf container has practical application value; however, this method cannot avoid interference of the "same-frequency foreign matter" phenomenon of the high-resolution image to the extraction accuracy. Zhang Hao et al propose a LiDAR point cloud building extraction method based on gray level co-occurrence matrix textures to automatically extract a building. In order to reduce the amount of computation, the authors compress the gray level during the computation of the gray level co-occurrence matrix, which will cause part of the texture to be lost, resulting in misclassification. In general, texture-based extraction methods are very effective for processing low-resolution images in the middle, but high-resolution images have complex textures, and the difficulty of extracting buildings by using the method is high. Lin and Nevitia et al propose perceptual grouping theory, which first uses edge detection to extract the outline of the building; and grouping the extracted outlines according to the spatial relationship, and finally searching parallel lines to obtain rectangular outlines so as to extract the outline positions of the building. Chen et al propose edge regularity index and hatching index as new features of construction candidates obtained from a segmentation method to refine the boundaries of the detection results.
In recent years, deep learning has been rapidly developed in the field of artificial intelligence, and research means in various fields have been gradually moved from conventional research methods to deep neural networks. Since convolutional neural networks were proposed, deep learning represented by convolutional neural networks has a remarkable effect in the field of image learning. The deep learning is used as an emerging research hotspot, has good effects in the fields of voice recognition, computer vision, natural language processing and the like, and compared with the traditional neural network, the deep neural network is greatly improved, the problem difficulty in data training is effectively reduced through a layer-by-layer learning method, approximation of a plurality of complex functions can be realized through learning a deep nonlinear network structure, the method can excavate data characteristics of a large amount of marked data layer by layer and learn the characteristic essence of the data characteristics, and considerable learning capacity is shown in a non-marked data set, so that the deep learning is widely applied to the fields of image classification, target detection, image segmentation and the like. In the field of image segmentation, semantic segmentation in high-altitude remote sensing images is beneficial to the fields of urban road planning, geological exploration, national defense construction and the like. The Bischke et al study introduces a new cascade multitask loss in a deep network structure, fuses boundary information of a building, overcomes the generation of a 'speckle' segmentation phenomenon, solves the problem of retaining semantic segmentation boundaries in a high-resolution satellite image, and achieves 73% of average cross ratio index. Zhou et al propose a D-LinkNet semantic segmentation network for high resolution satellite image road extraction, which adopts a link network structure and expands a convolution layer in a central part, and learns and fuses multiscale information of high-level semantic features through the central part, wherein the mloU index in CVPR 2018 Deep Globe Road Extraction Challenge reaches 64.66%. Liu Hao and the like improve the Unet network, train by using a loss function compounded by the Dice coefficient and the cross entropy function, and extract irregular buildings to obtain good effects.
The high-resolution remote sensing image can provide detailed building information, deep learning can learn data features with high spatial resolution better, and can extract building features efficiently. The deep series network adds the cavity convolution to expand the receptive field, and provides a cavity convolution pyramid pooling ASPP layer, and adopts different cavity rates to extract the multi-scale characteristics of the image, thereby improving the precision of remote sensing image segmentation. However, the deep series has the phenomena of slow fitting speed, rough edge information extraction, fuzzy small-scale target segmentation, hollow large-scale target segmentation and the like when the remote sensing image building is extracted. How to select a suitable and efficient deep learning network to extract a building has been a focus of attention of students.
Disclosure of Invention
In view of the above, the present invention aims to provide a remote sensing image building extraction method based on deep learning.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a remote sensing image building extraction method based on deep learning comprises the following steps:
s1: selecting a public remote sensing building data set lnria Aerial lmage Labeling Dataset as original data, and performing data preprocessing steps including cutting and data enhancement;
s2: the traditional deep labv3+ model is improved, the main network is replaced by lightweight MobilenetV2, the parameter is effectively reduced, and the model speed is improved; replacing the model main module ASPP with DenseASPP, connecting a set of dilation convolutions in a more dense manner; an attention mechanism ECA module is added, which directly generates a channel attention map through two 1X 1 convolution layers, recalibrates the channel weight of the feature map, selects more important feature channels, improves the extraction capacity of a model on small-scale information, and obtains a DAEC-deep labv3+ network;
s3: inputting the processed remote sensing image data set into a DAEC-deep Labv3+ network for training to obtain a trained building detection model;
s4: and detecting the trained building detection model in the test set of the remote sensing image to obtain a remote sensing building image segmentation result.
Optionally, the S1 specifically includes:
s11: downloading a required remote sensing building data set from a data set website, wherein the lnria Aerial lmage Labeling Dataset data set comprises 180 color images with 5000 multiplied by 5000 pixels and 180 binary gray scale images with 5000 multiplied by 5000 pixels corresponding to the color images, and the test image comprises 180 color images with 180 pixels; the dataset covers urban residential areas in europe and the united states, covering five major cities of Austin, chicago, kitsap, tyrol and Vienna, each city corresponding to 36 images; all the marked images divide the buildings into two types, namely a building and a non-building, wherein the pixel value of a building area is 255, and the pixel value of the non-building area is 0; cutting an image in a data set into 500 pixels which are 500 pixels, and enhancing the data set by utilizing a rotation transformation, mirror image transformation or brightness transformation mode;
s12: according to 7:2:1 randomly dividing the data in the data set into training data, verification data and test data, and storing the divided files in subfiles, namely train. Txt, val. Txt and test. Txt.
Optionally, the S2 specifically includes:
s21: an improved DAEC_Deeplabv3+ network model is built, and a lightweight network MobileNet V2 is used as a backbone network;
s22: in the encoding region Encoder, the dense void space pyramid pooling DenseASPP is used for replacing the original ASPP module, and the dense connection idea in DenseNet is applied to ASPP; the method comprises the steps of cascading a plurality of cavity convolution layers, and transmitting the output of each cavity convolution layer to all subsequent cavity convolution layers which are not accessed in a dense connection mode;
DenseASPP is expressed by formula (1):
y l =H K,dl ([y l-1 ,y l-2 ,...,y 0 ]) (1)
d l represents the expansion rate of layer i.]Represents a series operation [ y ] l-1 ,...,y 0 ]Representing the characteristics generated after connecting the outputs of all the previous layers;
s23: after feature extraction is completed through DensaASPP, stacking and thickening features, adding an ECA module, removing a full-connection layer by directly using a 1X 1 convolution layer after a global average pooling layer, recalibrating channel weights of a feature map, selecting more important feature channels, and improving the extraction capability of a model on small-scale information; the method realizes local cross-channel interaction by using 1-dimensional convolution with high efficiency, and extracts the dependency relationship among channels; the method comprises the following specific steps: firstly, carrying out global average pooling operation on an input feature map, then carrying out 1-dimensional convolution operation with a convolution kernel of k, and obtaining the weight w of each channel through a Sigmoid activation function, wherein the weight w is shown in a formula (2):
ω=σ(C1D K (y)) (2)
multiplying the weight with the corresponding element of the original input feature map to obtain a final output feature map, and then carrying out channel number adjustment on the deep feature with higher semantic information by using a convolution operation of 1 multiplied by 1; wherein C1D represents a one-dimensional convolution;
s24: in a decoding area Decoder, extracting 4 th and 7 th shallow layer features from a main network, performing multi-scale feature fusion MSFF operation, adding the features with a deep feature layer subjected to 2 times of upsampling, and recovering the size of the feature layer through one 2 times of upsampling to realize semantic segmentation; then, connecting with original shallow layer characteristics obtained from a backbone network to increase the number of channels, then, carrying out characteristic extraction by utilizing a convolution of 3 multiplied by 3, and finally, adjusting an output image to be the same size as an input image; the MSFF operation is to fuse the multi-scale features of the two feature layers.
Optionally, the step S3 specifically includes:
training in a PyTorr deep learning framework by using PyCharm programming software; and inputting the processed remote sensing image data set into a DAEC_DeepLabv3+ network for training, obtaining a trained building detection model and storing pre-trained model parameters aiming at different segmented objects.
Optionally, the step S4 specifically includes the following steps:
s41: inputting the test images in the test set into a trained improved deeplabv < 3+ > network model, selecting a segmentation object as a building or a background, outputting a corresponding segmentation result, and then storing a result image of the model;
s42: selecting cross entropy as loss of an algorithm, taking average pixel accuracy mPA and average cross ratio mPOU as evaluation indexes, and evaluating training results of the model from two angles of the proportion of the correct part of the predicted pixel in the union of the predicted pixel and the real pixel and the proportion of the correct pixel in the total pixel, wherein the higher the MIoU value is, the better the image segmentation effect is; the formula of cross entropy is:
wherein y is i The true value in the two classification tasks is 0 or 1 for the true value of a certain pixel;is a predicted value of a certain pixel; n is the sample size lost per calculation;
the calculation formulas of the average pixel accuracy rate mpA and the average intersection ratio mIoU are respectively as follows:
the TP represents that the model prediction is correct, namely the model prediction and the actual are both positive examples; FP represents model prediction error, i.e. model prediction class is positive example, and the actual class is negative example; FN represents model prediction errors, namely the model prediction class is a counterexample, and the model prediction class is a positive example; TN represents that the model prediction is correct, and represents that the model prediction and the actual are opposite examples.
The invention has the beneficial effects that:
first, the backbone network is changed into lightweight Mobilenetv2, so that the parameter number and the calculation amount of the model are greatly reduced.
Secondly, the original ASPP module of the model is changed into DenseASPP, and the convolution layers with four different expansion rates (3, 6, 12, 18) are connected in a DenseNet connection mode to form a dense feature pyramid, wherein each layer fuses a plurality of different scale features of the previous layer in parallel, so that fusion features of a multi-scale large receptive field can be generated, and the large receptive field provides larger context information for large target segmentation in a high-pixel image.
Thirdly, the ECA attention module directly generates a channel attention map through two 1X 1 convolution layers, so that the traditional attention moment array multiplication is avoided, the calculation efficiency is effectively improved, the extraction capacity of a model to small-scale information is improved, the MIoU index of the segmented remote sensing image is improved, and the segmentation precision is higher;
fourth, considering that shallow features contain more original information, feature fusion is carried out on two shallow feature layers in a decoding area, richer spatial information is reserved, and robustness is enhanced. In addition, the method is convenient to understand as a whole, easy to operate and applicable to semantic segmentation of other complex remote sensing images.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is a schematic diagram of a remote sensing image building extraction model based on improved DAEC_deep v3+ according to a preferred embodiment of the present invention;
FIG. 2 is a DenseASPP module in a coding network;
FIG. 3 is an attention mechanism ECA added to the coding network;
fig. 4 is a multi-scale feature fusion module MSFF.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.
Referring to fig. 1 to 4, S1 is a preprocessing method for input data source and data including the present model, lnria Aerial lmage Labeling Dataset data set includes 180 color images of 5000×5000 pixels and corresponding 180 binary gray scale images of 5000×5000 pixels, and test image includes 180 color images of pixels. All the annotated images divide these buildings into two categories, building and non-building, the pixel value of the building area being 255 (white) and the pixel value of the non-building area being 0 (black). The image in the dataset is cut into 500 pixels by 500 pixels, and the dataset is enhanced by geometric and non-geometric modes such as rotation transformation, mirror transformation, brightness transformation and the like. After image enhancement according to 7:2: the scale of 1 divides the entire dataset into a training set, a test set and a validation set.
S2, constructing an improved DAEC_deep labv3+ network model. The backbone network is replaced by lightweight MobilenetV2, so that the parameter is effectively reduced, and the model speed is improved; in the coding region Encoder, the original ASPP module is replaced by DenseASPP, and the dense connection idea in DenseNet is applied to the ASPP. ASPP can be expressed by formula (1):
y=H 3,6(x) +H 3,12(x) +H 3,18(x) +H 3,24(x) (1)
H K,d(x) used to represent a hole convolution, y represents the fusion feature.
In DenseASPP, the output of each hole convolution layer is transmitted to all subsequent hole convolution layers which are not accessed in a dense connection mode, the reasonable expansion rate is fully utilized by the hole convolution layers, through a series of characteristic connection, the neurons of all intermediate characteristics can encode semantic information of different scales, and the different intermediate characteristics have different scale ranges. Thus, the final DenseASPP feature output has more larger scale receptive fields, and a denser and larger feature pyramid can be generated by using several hole convolution layers.
DenseASPP can be expressed by the formula (2):
d l represents the expansion rate of layer i.]Representing a conclusive operation, [ y ] l-1 ,...,y 0 ]Representing the features generated after concatation of the outputs of all the previous layers, denseASPP stacks all in a densely connected manner, compared to ASPPAnd the cavity convolution layer is used for generating a feature pyramid more densely and a receptive field more greatly. After feature extraction is completed through DensaASPP, features are stacked and thickened, and a ECA (External Channel Attention) module is added, so that dimension reduction can be avoided, and cross-channel interaction information is effectively captured. The method realizes local cross-channel interaction by using 1-dimensional convolution and extracts the dependency relationship among channels. The method comprises the following specific steps: firstly, carrying out global average pooling operation on an input feature map, then carrying out 1-dimensional convolution operation with a convolution kernel of k, and obtaining the weight w of each channel through a Sigmoid activation function, wherein the weight w is shown in a formula (3):
ω=σ(C1D K (y)) (3)
multiplying the weight with the corresponding element of the original input feature map to obtain a final output feature map, and then carrying out channel number adjustment on the deep feature with higher semantic information by using a convolution operation of 1 multiplied by 1. In the decoding area Decoder, the 4 th and 7 th shallow layer features are extracted from the main network, MSFF operation is carried out, the extracted features are added with the deep feature layer subjected to 2 times of upsampling, and then the feature layer size is recovered through one 2 times of upsampling, so that semantic segmentation is realized. Then, the number of channels is increased by concat stacking with the original shallow features obtained from the backbone network, then feature extraction is performed by using a 3×3 convolution, and finally the output image is adjusted to be the same size as the input image. The MSFF operation is to fuse the multi-scale features of the two feature layers.
Further, S3 is training in a PyTorch deep learning framework using PyCharm programming software, and as an embodiment, pyCharm professional version is selected and training of network structure is performed using the deep learning framework PyTorch 1.11.0. And inputting the processed remote sensing image data set into a DAEC-deep Labv3+ network for training, obtaining a trained building detection model and storing pre-trained model parameters aiming at different segmented objects.
Further, S4 is to input the test image in the test set into the trained improved daec_deep v3+ network model, select the segmentation object as a building or a background, output the corresponding segmentation result, and store the result image of the model. In the network training process, cross entropy is selected as the loss of an algorithm, average pixel accuracy mPA and average cross ratio mPOU are used as evaluation indexes, and the training result of the model is evaluated from two angles of the proportion of the correct part of the predicted pixel in the union of the predicted pixel and the real pixel and the proportion of the correct pixel in the total pixel, wherein the higher the MIoU value is, the better the image segmentation effect is; the formula of cross entropy is:
wherein y is i The true value in the two classification tasks is 0 or 1 for the true value of a certain pixel;is a predicted value of a certain pixel; n is the sample size lost per calculation;
the calculation formulas of the average pixel accuracy rate mpA and the average intersection ratio mIoU are respectively as follows:
in the above formula, TP represents that the model prediction is correct, namely the model prediction and the reality are both positive examples; FP represents model prediction error, i.e. model predicts that the class is positive, but in reality that the class is negative; FN represents model prediction errors, namely the model predicts that the category is a counterexample, and the actual category is a positive example; TN represents that the model prediction is correct, meaning that the model prediction and the reality are opposite examples;
experimental results show that the method is based on an improved DAEC_deep v3+ network model, has the advantages of smaller training parameter quantity, higher segmentation precision, finer edge information extraction, effective improvement of problems of void segmentation of a large-scale target and the like; in addition, the method has better robustness, is convenient to understand as a whole, is easy to operate, and can be suitable for semantic segmentation of other complex remote sensing images.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.
Claims (5)
1. A remote sensing image building extraction method based on deep learning is characterized by comprising the following steps of: the method comprises the following steps:
s1: selecting a public remote sensing building data set lnria Aerial lmage Labeling Dataset as original data, and performing data preprocessing steps including cutting and data enhancement;
s2: the traditional deep labv3+ model is improved, the main network is replaced by lightweight MobilenetV2, the parameter is effectively reduced, and the model speed is improved; replacing the model main module ASPP with DenseASPP, connecting a set of dilation convolutions in a more dense manner; an attention mechanism ECA module is added, which directly generates a channel attention map through two 1X 1 convolution layers, recalibrates the channel weight of the feature map, selects more important feature channels, improves the extraction capacity of a model on small-scale information, and obtains a DAEC-deep labv3+ network;
s3: inputting the processed remote sensing image data set into a DAEC-deep Labv3+ network for training to obtain a trained building detection model;
s4: and detecting the trained building detection model in the test set of the remote sensing image to obtain a remote sensing building image segmentation result.
2. The remote sensing image building extraction method based on deep learning as claimed in claim 1, wherein the method comprises the following steps: the S1 specifically comprises the following steps:
s11: downloading a required remote sensing building data set from a data set website, wherein the lnria Aerial lmage Labeling Dataset data set comprises 180 color images with 5000 multiplied by 5000 pixels and 180 binary gray scale images with 5000 multiplied by 5000 pixels corresponding to the color images, and the test image comprises 180 color images with 180 pixels; the dataset covers urban residential areas in europe and the united states, covering five major cities of Austin, chicago, kitsap, tyrol and Vienna, each city corresponding to 36 images; all the marked images divide the buildings into two types, namely a building and a non-building, wherein the pixel value of a building area is 255, and the pixel value of the non-building area is 0; cutting an image in a data set into 500 pixels which are 500 pixels, and enhancing the data set by utilizing a rotation transformation, mirror image transformation or brightness transformation mode;
s12: according to 7:2:1 randomly dividing the data in the data set into training data, verification data and test data, and storing the divided files in subfiles, namely train. Txt, val. Txt and test. Txt.
3. The remote sensing image building extraction method based on deep learning as claimed in claim 2, wherein the method comprises the following steps: the step S2 specifically comprises the following steps:
s21: an improved DAEC_Deeplabv3+ network model is built, and a lightweight network MobileNet V2 is used as a backbone network;
s22: in the encoding region Encoder, the dense void space pyramid pooling DenseASPP is used for replacing the original ASPP module, and the dense connection idea in DenseNet is applied to ASPP; the method comprises the steps of cascading a plurality of cavity convolution layers, and transmitting the output of each cavity convolution layer to all subsequent cavity convolution layers which are not accessed in a dense connection mode;
DenseASPP is expressed by formula (1):
d l represents the expansion rate of layer i.]Represents a series operation [ y ] l-1 ,...,y 0 ]Representing the connection of the outputs of all the preceding layersPost-generated features;
s23: after feature extraction is completed through DensaASPP, stacking and thickening features, adding an ECA module, removing a full-connection layer by directly using a 1X 1 convolution layer after a global average pooling layer, recalibrating channel weights of a feature map, selecting more important feature channels, and improving the extraction capability of a model on small-scale information; the method realizes local cross-channel interaction by using 1-dimensional convolution with high efficiency, and extracts the dependency relationship among channels; the method comprises the following specific steps: firstly, carrying out global average pooling operation on an input feature map, then carrying out 1-dimensional convolution operation with a convolution kernel of k, and obtaining the weight w of each channel through a Sigmoid activation function, wherein the weight w is shown in a formula (2):
ω=σ(C1D K (y)) (2)
multiplying the weight with the corresponding element of the original input feature map to obtain a final output feature map, and then carrying out channel number adjustment on the deep feature with higher semantic information by using a convolution operation of 1 multiplied by 1; wherein C1D represents a one-dimensional convolution;
s24: in a decoding area Decoder, extracting 4 th and 7 th shallow layer features from a main network, performing multi-scale feature fusion MSFF operation, adding the features with a deep feature layer subjected to 2 times of upsampling, and recovering the size of the feature layer through one 2 times of upsampling to realize semantic segmentation; then, connecting with original shallow layer characteristics obtained from a backbone network to increase the number of channels, then, carrying out characteristic extraction by utilizing a convolution of 3 multiplied by 3, and finally, adjusting an output image to be the same size as an input image; the MSFF operation is to fuse the multi-scale features of the two feature layers.
4. The remote sensing image building extraction method based on deep learning as claimed in claim 3, wherein the method comprises the following steps: the step S3 specifically comprises the following steps:
training in a PyTorr deep learning framework by using PyCharm programming software; and inputting the processed remote sensing image data set into a DAEC_DeepLabv3+ network for training, obtaining a trained building detection model and storing pre-trained model parameters aiming at different segmented objects.
5. The method for extracting the remote sensing image building based on deep learning according to claim 4, wherein the method comprises the following steps: the step S4 specifically comprises the following steps:
s41: inputting the test images in the test set into a trained improved deeplabv < 3+ > network model, selecting a segmentation object as a building or a background, outputting a corresponding segmentation result, and then storing a result image of the model;
s42: selecting cross entropy as loss of an algorithm, taking average pixel accuracy mPA and average cross ratio mPOU as evaluation indexes, and evaluating training results of the model from two angles of the proportion of the correct part of the predicted pixel in the union of the predicted pixel and the real pixel and the proportion of the correct pixel in the total pixel, wherein the higher the MIoU value is, the better the image segmentation effect is; the formula of cross entropy is:
wherein y is i The true value in the two classification tasks is 0 or 1 for the true value of a certain pixel;is a predicted value of a certain pixel; n is the sample size lost per calculation;
the calculation formulas of the average pixel accuracy rate mpA and the average intersection ratio mIoU are respectively as follows:
the TP represents that the model prediction is correct, namely the model prediction and the actual are both positive examples; FP represents model prediction error, i.e. model prediction class is positive example, and the actual class is negative example; FN represents model prediction errors, namely the model prediction class is a counterexample, and the model prediction class is a positive example; TN represents that the model prediction is correct, and represents that the model prediction and the actual are opposite examples.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310894293.1A CN116912708A (en) | 2023-07-20 | 2023-07-20 | Remote sensing image building extraction method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310894293.1A CN116912708A (en) | 2023-07-20 | 2023-07-20 | Remote sensing image building extraction method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116912708A true CN116912708A (en) | 2023-10-20 |
Family
ID=88362558
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310894293.1A Pending CN116912708A (en) | 2023-07-20 | 2023-07-20 | Remote sensing image building extraction method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116912708A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117152546A (en) * | 2023-10-31 | 2023-12-01 | 江西师范大学 | Remote sensing scene classification method, system, storage medium and electronic equipment |
CN117237644A (en) * | 2023-11-10 | 2023-12-15 | 广东工业大学 | Forest residual fire detection method and system based on infrared small target detection |
CN117496563A (en) * | 2024-01-03 | 2024-02-02 | 脉得智能科技(无锡)有限公司 | Carotid plaque vulnerability grading method and device, electronic equipment and storage medium |
CN117975195A (en) * | 2024-01-17 | 2024-05-03 | 山东建筑大学 | Texture information guiding-based high-resolution farmland pseudo-sample controllable generation method |
CN118366045A (en) * | 2024-06-20 | 2024-07-19 | 福建师范大学 | Remote sensing image building extraction method and device based on improved DeepLabV + |
-
2023
- 2023-07-20 CN CN202310894293.1A patent/CN116912708A/en active Pending
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117152546A (en) * | 2023-10-31 | 2023-12-01 | 江西师范大学 | Remote sensing scene classification method, system, storage medium and electronic equipment |
CN117152546B (en) * | 2023-10-31 | 2024-01-26 | 江西师范大学 | Remote sensing scene classification method, system, storage medium and electronic equipment |
CN117237644A (en) * | 2023-11-10 | 2023-12-15 | 广东工业大学 | Forest residual fire detection method and system based on infrared small target detection |
CN117237644B (en) * | 2023-11-10 | 2024-02-13 | 广东工业大学 | Forest residual fire detection method and system based on infrared small target detection |
CN117496563A (en) * | 2024-01-03 | 2024-02-02 | 脉得智能科技(无锡)有限公司 | Carotid plaque vulnerability grading method and device, electronic equipment and storage medium |
CN117496563B (en) * | 2024-01-03 | 2024-03-19 | 脉得智能科技(无锡)有限公司 | Carotid plaque vulnerability grading method and device, electronic equipment and storage medium |
CN117975195A (en) * | 2024-01-17 | 2024-05-03 | 山东建筑大学 | Texture information guiding-based high-resolution farmland pseudo-sample controllable generation method |
CN117975195B (en) * | 2024-01-17 | 2024-09-06 | 山东建筑大学 | Texture information guiding-based high-resolution farmland pseudo-sample controllable generation method |
CN118366045A (en) * | 2024-06-20 | 2024-07-19 | 福建师范大学 | Remote sensing image building extraction method and device based on improved DeepLabV + |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106909924B (en) | Remote sensing image rapid retrieval method based on depth significance | |
CN116912708A (en) | Remote sensing image building extraction method based on deep learning | |
CN111797779A (en) | Remote sensing image semantic segmentation method based on regional attention multi-scale feature fusion | |
CN110298037A (en) | The matched text recognition method of convolutional neural networks based on enhancing attention mechanism | |
Yuan et al. | Exploring multi-level attention and semantic relationship for remote sensing image captioning | |
CN113657450B (en) | Attention mechanism-based land battlefield image-text cross-modal retrieval method and system | |
CN110826638A (en) | Zero sample image classification model based on repeated attention network and method thereof | |
CN113034506B (en) | Remote sensing image semantic segmentation method and device, computer equipment and storage medium | |
CN115438215B (en) | Image-text bidirectional search and matching model training method, device, equipment and medium | |
CN117036936A (en) | Land coverage classification method, equipment and storage medium for high-resolution remote sensing image | |
CN113743417A (en) | Semantic segmentation method and semantic segmentation device | |
CN115222998B (en) | Image classification method | |
CN117351363A (en) | Remote sensing image building extraction method based on transducer | |
CN114510594A (en) | Traditional pattern subgraph retrieval method based on self-attention mechanism | |
CN115527113A (en) | Bare land classification method and device for remote sensing image | |
CN117593666B (en) | Geomagnetic station data prediction method and system for aurora image | |
Hu et al. | Saliency-based YOLO for single target detection | |
Cheng et al. | A survey on image semantic segmentation using deep learning techniques | |
CN117953299A (en) | Land utilization classification method based on multi-scale remote sensing images | |
Li et al. | HRVQA: A Visual Question Answering benchmark for high-resolution aerial images | |
CN117058437B (en) | Flower classification method, system, equipment and medium based on knowledge distillation | |
CN112925983A (en) | Recommendation method and system for power grid information | |
CN114511787A (en) | Neural network-based remote sensing image ground feature information generation method and system | |
CN116844039A (en) | Multi-attention-combined trans-scale remote sensing image cultivated land extraction method | |
Pang et al. | PTRSegNet: A Patch-to-Region Bottom-Up Pyramid Framework for the Semantic Segmentation of Large-Format Remote Sensing Images |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |