CN111898543B

CN111898543B - Building automatic extraction method integrating geometric perception and image understanding

Info

Publication number: CN111898543B
Application number: CN202010757389.XA
Authority: CN
Inventors: 张展; 郑先伟; 龚健雅; 陈晓玲; 徐旭
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2022-06-07
Anticipated expiration: 2040-07-31
Also published as: CN111898543A

Abstract

The invention discloses a building automatic extraction method integrating geometric perception and image understanding, which comprises the steps of selecting and preprocessing remote sensing data, using an obtained multiband remote sensing image and a normalized digital surface model as input feature maps, and inputting the input feature maps to an encoder end based on an improved depth residual error network for feature learning; then, inputting the high-level semantic features into a building multi-scale high-efficiency sensing module of a network, and inputting the low-level semantic features and the high-level semantic features learned and output by the building multi-scale high-efficiency sensing module into a network decoder end to obtain a binary surface feature classification map; and finally, inputting the binary ground object classification graph output by the network decoder end into a network loss function layer to drive the network to learn the optimal weight parameter capable of fitting the target task, and finally outputting the final building classification graph by the network output layer to finish the automatic extraction of the end-to-end building. The invention can obtain the best effect in the aspects of building extraction precision and extraction efficiency.

Description

Building automatic extraction method integrating geometric perception and image understanding

Technical Field

The invention belongs to the application of a deep learning technology in the field of intelligent interpretation of remote sensing images, and relates to an automatic building extraction method integrating geometric perception and image understanding, in particular to a method for automatically extracting a building from multi-source remote sensing data (aerial remote sensing images and airborne laser radar data).

Background

Remote sensing is a modern applied technology science developed with the progress of modern science and technology in the 60 th century, and the information acquisition and description of a target are realized by acquiring ground target data through various sensors (such as a camera, a scanner and a laser radar) and processing and analyzing the data to obtain useful information of the detected target. Remote sensing is generally divided into space sensing and aerial (including near-ground) sensing, depending on platform height. Common spaceflight remote sensing platforms comprise artificial satellites, spacecrafts and the like, and the aerial remote sensing platform mainly comprises an airplane as a remote sensing platform. Remote sensing can acquire earth surface information in a large-scale and dynamic manner, is an important scientific means for acquiring geospatial information, and is widely applied to the fields of environmental change detection, urban construction and management, meteorological disaster monitoring and the like at present.

Due to frequent updating, large occupied area and close relation with human activities, the building is always one of the most important artificial ground object targets and is also the key point for automatically extracting the urban remote sensing information. The method has the advantages that the position and the strength of the building change are mastered, and the method has extremely important significance on urban disaster assessment, urban expansion, urban resource distribution, urban environment, digital city and other researches. With the continuous development of economy and the increasing enhancement of human activities, buildings are updated more rapidly, the structure composition is more complex, and high requirements are provided for the data acquisition period of the remote sensing platform, the spatial resolution of acquired data and the automation degree of an extraction algorithm. The aerial remote sensing platform has the characteristics of high imaging resolution, short investigation period, high measurement accuracy and the like, and is widely used in automatic building information extraction research. Besides an aerial photography system, the aerial remote sensing platform can also carry an airborne laser radar (LiDAR) system, can directly obtain high-precision three-dimensional surface topographic data, and is an important supplement to traditional aerial photography in the aspects of land object elevation data acquisition and automatic rapid processing. In addition, the rapid development of the remote sensing sensor greatly enhances the capability of the space remote sensing platform for acquiring high-space and high-time remote sensing data, and the acquired high-resolution remote sensing image is also widely applied to the research of ground feature information extraction.

The remote sensing image is one of the most common remote sensing data. The traditional building automatic extraction basic method based on a single aerial remote sensing image data source mainly has three ideas: methods based on geometric boundary features, methods based on active contour models, and methods based on region segmentation (documents 1 to 3). With the rapid development of an airborne laser radar system, the processing theory of the three-dimensional laser point cloud data obtained by the airborne laser radar system is extensively and deeply researched. The traditional building automatic extraction basic method based on LiDAR data mainly has two ideas: point cloud point-by-point classification based methods and point cloud segmentation based methods (documents 4 to 5). The high-resolution aerial remote sensing image can provide abundant information such as spectrum, texture and geometry for building detection, and the LiDAR data can provide three-dimensional space information of a building through the acquired digital earth surface model. The multi-source remote sensing data (aerial remote sensing images and airborne laser radar data) are used for extracting the buildings, the problem that target ground objects in the high-resolution images are mutually hidden can be solved, and meanwhile due to the redundancy of information, the accuracy and the reliability of building extraction can be improved. The conventional building extraction method fusing multi-source remote sensing data can be roughly classified into a pixel-based classification method and an object-based classification method (documents 6 to 7). The method is the current key research direction of scholars because the advantages of aerial remote sensing images and airborne laser radar data can be fully exerted by extracting the buildings by using the multi-source remote sensing data, and the extraction precision and reliability of the buildings are improved. However, the traditional automatic building extraction method based on multi-source remote sensing data still has the limitations of limited artificial design feature expression capability, low algorithm robustness and the like.

With the success of convolutional neural networks in the field of image processing, Deep Convolutional Neural Networks (DCNNs) and full convolutional neural networks (FCNs) have been widely used by scholars in information extraction of remote sensing features such as buildings and roads (documents 8 to 12). Compared with the traditional data fusion algorithm, the data fusion method based on the FCNs can fully utilize rich information in high-resolution multi-source remote sensing data to improve the building extraction precision and the result reliability, and realizes automatic extraction of buildings in various complex environments. At present, learning the building characteristics in the multisource remote sensing data by using the FCNs is an effective way for improving the precision and reliability of building extraction results. However, current network learning strategies that extract buildings using digital surface models derived based on LiDAR data as network aids or additional image features not only lack sufficient mining of LiDAR data spatial geometry information, but also greatly increase the amount of computation and parameters of the network. In addition, the output classification result lacks detail information, and is a main problem faced by the full convolution neural network structure. When a building with a smaller space size is extracted in a complex urban scene, the problem of boundary ambiguity of the extraction result is more prominent.

[ document 1] Cui S, Yan Q, Reinartz P2012, Complex building description and extraction based on Hough transformation and cycle detection removal [ J ],3: p.151-159.

[ document 2] Mohammaddzadeh A2010. Automatic urea building boundary extraction from high resolution conventional images using an innovative model of active controls. International Journal of Applied Earth Observation & Geoinformation [ J ],12:0-157.

[ document 3] Ok A O, Senaras C, Yuksel B2013. automatic Detection of arbitrarly Shaped Buildings in Complex Environments From monomer VHR Optical software image. IEEE Transactions on Geoscience and Remote Sensing [ J ],51:1701 and 1717.

[ document 4] Niemeyer J, Rottenstein F, Soergel U2014. Contextual classification of data and building object detection in urea areas ISprs Journal of photography and Remote Sensing [ J ],87: 152-.

[ document 5] Du S, Zhang Y, Zou Z, et al.2017.automatic building extraction from LiDAR data fusion of point and grid-based features. Isprs Journal of photolithography and Remote Sensing [ J ],130: 294-.

Document 6 Haala N, Brenner C1999 Extraction of buildings and trees in urea environments Isprs Journal of photolithography and Remote Sensing [ J ],54: 130-.

[ document 7] Yeast A, Mohammadddeh A2016.A Novel Building and Tree Detection Method From LiDAR Data and axial images IEEE Journal of Selected Topics in Applied Earth requirements and Remote Sensing [ J ],9:1864 + 1875.

[ document 8] Paisitkriangkrai S, Sherrah J, Janney P, et al 2016.Semantic laboratory of Aerial and Satellite image. IEEE Journal of Selected Topics in Applied Earth innovations and Remote Sensing [ J ],9: 2868-.

[ document 9] Liu Y, Fan B, Wang L, et al.2017. semiconductor labeling in high resolution images via a self-contained volumetric flow network Isprs Journal of Photogrammetry and Remote Sensing [ J ],145:78-95.

[ document 10] Xu Y, Wu L, Xie Z, et al 2018. construction Extraction in Very High Resolution removal Sensing image Using Deep Learning and Guided filters [ J ],10:144.

[ document 11] Marmanis D, Schinder K, Wegner J D, et al.2018.Classification with an edge: improving the semantic information segmentation with boundary detection. Isprs Journal of photographic symmetry and removal Sensing [ J ],135: 158-.

[ document 12] Huang J, Zhang X, Xin Q, et al.2019.automatic building extraction from high-resolution assays and LiDAR data using gated residual refining network. Isprs Journal of Photogrammetry and Remote Sensing [ J ],151:91-105.

Disclosure of Invention

Aiming at the defects of the existing building automatic extraction method, the invention provides the building automatic extraction method integrating efficient geometric perception and image understanding, and provides a new idea for automatically, accurately and efficiently extracting remote sensing building information.

The technical scheme adopted by the invention is as follows: a building automatic extraction method fusing geometric perception and image understanding is characterized by comprising the following steps:

step 1: selecting and preprocessing remote sensing data to obtain an experimental data set consisting of a first data set, a second data set, a normalized digital surface model and a building real type label; the first data set comprises a plurality of multiband remote sensing images with red, green and blue wave bands, and the second data set comprises a plurality of multiband remote sensing images with red, green and near-infrared wave bands; wherein one part of the first data set is used as a training and verifying data set, and the other part of the first data set is used as a testing data set; one part of the second data set is used as a training and verifying data set, and the other part of the second data set is used as a testing data set;

step 2: inputting the preprocessed multiband remote sensing image and the corresponding normalized digital surface model as an input feature map to a network encoder end for feature learning;

the network encoder is an improved depth residual error network ResNet-101; wherein the improvement comprises: the method comprises the following steps that (1) a network layer formed by connecting three different network branches in parallel is added at the front end of a deep residual error network, wherein the first network branch is formed by a 3 x 3 standard convolution operation layer, a batch normalization layer and an activation layer structure, and the second network branch and the third network branch respectively replace the 3 x 3 standard convolution operation layer with a convolution operation layer of building local height similar information and a building three-dimensional space geometric information efficient sensing operation layer on the basis of the first network branch structure; (2) replacing a common 3 multiplied by 3 convolution kernel with a cavity convolution kernel with the cavity rates of 2 and 4 respectively in the 5 th and 6 th network layers; (3) a structure capable of efficiently sensing local height similar information of a building is introduced into each layer of a network encoder end;

and step 3: inputting the high-level semantic features obtained by the network encoder end learning in the step 2 into a building multi-scale high-efficiency sensing module of the network;

and 4, step 4: inputting low-level semantic features obtained by learning of a network at an encoder end and high-level semantic features output by learning of a building multi-scale high-efficiency perception module into a network decoder end to obtain a binary ground object classification map comprising buildings and non-buildings;

and 5: inputting the binary ground object classification diagram output by the network decoder end into the network loss function layer to drive the network to learn the optimal weight parameter capable of fitting the target task, and finally outputting the final building classification diagram by the network output layer to finish the automatic extraction of the end-to-end building.

The invention can fully integrate and mine the data characteristics of the high-resolution remote sensing image and the three-dimensional laser point cloud to realize the high-efficiency automatic extraction of the building. Its advantages are:

(1) at the encoder end of the reference network, the method utilizes the powerful characteristic learning capability of the ResNet-101 deep residual error network to extract the characteristics of the multi-source remote sensing data. Meanwhile, at the network decoder end, the invention adopts the idea of combining the multi-level characteristics learned in the network to gradually recover the size of the input image space, and the detail information of the building is retained to the maximum extent;

(2) through the efficient sensing structure of the local highly similar information of the building, which is merged into the network encoder end, the three-dimensional geometrical relationship between the pixels is seamlessly merged into the convolution and pooling operation in the network, so that the extraction precision of the building is greatly improved under the condition of not introducing any parameter and calculation complexity;

(3) the three-dimensional geometric information in the laser radar data is fully mined by a three-dimensional convolution operation mode in the structure through a building three-dimensional space geometric information efficient sensing structure merged into the network encoder end, so that the building extraction precision is further improved;

(4) the network model uses the proposed building multi-scale information high-efficiency sensing module between the encoder end and the decoder end, and the module effectively enlarges the size and diversity of the receptive field of the network, thereby improving the learning ability of the network on the image multi-scale features and improving the extraction effect of the building.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a block diagram of the overall architecture of an encoder-decoder embodying the present invention;

FIG. 3 is a schematic diagram of a convolution kernel operation for building local highly similar information according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a pooling kernel operation for building local highly similar information according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of the operation of the building three-dimensional spatial information efficient sensing structure implemented by the present invention;

fig. 6 is a block diagram of a building multi-scale information efficient perception module implemented by the present invention.

Detailed Description

To facilitate the understanding and practice of the present invention for those of ordinary skill in the art, the present invention is further described below in conjunction with the following drawings and the detailed description, it being understood that the practical examples described herein are only for the purpose of illustration and explanation and are not intended to limit the present invention.

Referring to fig. 1, the method for automatically extracting a building by fusing geometric perception and image understanding provided by the invention is characterized by comprising the following steps:

in the embodiment, a SpaceNet city three-dimensional challenge data set and an IEEE GRASS data fusion match data set are selected to compare and test automatic extraction methods of various buildings. The data sets have a plurality of complex urban scenes consisting of different building densities, space sizes and surrounding environments, and the extraction accuracy and reliability of the automatic extraction methods for different buildings can be well verified. Which comprises the following steps: (1) the spatial resolution is a high-resolution orthographic remote sensing image with the size of 0.5 m; (2) a digital surface model and a digital elevation model corresponding to the remote sensing image; and (3) obtaining a building real type label based on the remote sensing image manual labeling. The invention selects 132 scenes covering 4 cities comprising Jackson Ville (Jackson), Tampas (Tampa), Richmond (Richmond) and Houston (Houston) from the two data sets, and takes the ortho remote sensing images containing three wave bands of red, green and blue and having the space size of 2048 multiplied by 2048 or 1192 multiplied by 1202 as experimental data sets. It should be particularly noted that 10 scenes of the houston city scene in the published IEEE GRASS data set lack the building real type labels corresponding to the remote sensing images, the vector building boundary of the corresponding area is downloaded on an Open Street Map (OSM), the building real distribution condition in the original remote sensing image is referred to modify and adjust the vector building boundary, and then the processed building vector boundary data is rasterized by the invention, so that the corresponding 10 scenes of the building real type labels are finally obtained and are supplemented into the experimental data set of the invention. In order to obtain the three-dimensional geometrical space information capable of reflecting the reality of the ground object, the invention obtains the normalized digital surface model by utilizing the height difference between pixel values of the digital surface model and the digital elevation model, and the normalized digital surface model is used as another input characteristic of the network model. The spatial resolution of all normalized digital surface models is then resampled to the same spatial resolution as their corresponding remote sensing images. In the experimental data set, the invention selects 105 scenes of images as the training and verification data set of the experiment, and the remaining 27 scenes of images as the test data set of the experiment.

In addition, the invention also uses ISPRS two-dimensional semantic segmentation standard data set to further supplement the experimental data set. In the data set, for the city remote sensing data set with the total number of images being 33 scenes, the image space size and the resolution ratio being 2500 x 2000 pixels and 0.09m respectively, 27 scenes of the city remote sensing data set are randomly selected to be used as the supplement of an experiment training and verification data set, and the remaining 6 scenes of the city remote sensing data set are used as the supplement of an experiment testing data set. For the urban remote sensing data set with the total number of images being 38 scenes and the image space size and resolution being 6000 × 6000 pixels and 0.05m respectively, 30 scenes of the urban remote sensing data set are randomly selected to be used as supplements of an experimental training and verification data set, and the remaining 8 scenes of the urban remote sensing data set are used as supplements of an experimental testing data set. The remote sensing image in the stegat city remote sensing data set has 3 wave bands including near infrared wave band, red wave band and green wave band. The remote sensing image in the Betstan city remote sensing data set has 4 wave bands including a near infrared wave band, a red wave band, a green wave band and a blue wave band, and only the near infrared wave band, the red wave band and the green wave band of the ISPRS data set image are reserved as the input of the network model. In addition, in order to obtain data capable of reflecting real three-dimensional geometric space information of the surface of the ground object, a normalized digital surface model product corresponding to a data set is used as part of input characteristics of a network model to train and predict. Because the spatial resolution and the wave band composition of the image in the ISPRS two-dimensional semantic segmentation data set have some differences from the image in the SpaceNet and IEEE GRASS data sets, the invention separately divides the image into another group of experimental data sets for training and testing the model.

Finally, the present invention obtains an experimental data set consisting of 203 scene images, including 162 scene images of the training/validation data set and 41 scene images of the test data set.

the network encoder of the embodiment is an improved depth residual error network ResNet-101; wherein the improvement comprises: (1) a network layer formed by connecting three different network branches in parallel is added at the front end of the depth residual error network, wherein the first network branch is formed by a 3 multiplied by 3 standard convolution operation layer, a batch normalization layer and an activation layer structure, and the second network branch and the third network branch respectively replace the 3 multiplied by 3 standard convolution operation layer with a convolution operation layer of building local height similar information and a building three-dimensional space geometric information high-efficiency perception operation layer on the basis of the first network branch structure; (2) replacing a common 3 multiplied by 3 convolution kernel with a cavity convolution kernel with the cavity rates of 2 and 4 respectively in the 5 th and 6 th network layers; (3) a structure capable of efficiently sensing local height similar information of a building is introduced into each layer of a network encoder end;

the overall structure of the network encoder end of the present embodiment is shown in fig. 2. In order to enable the network model to have better feature extraction capability and expression effect, the invention is based on a basic framework of a deep residual error network ResNet-101 as an encoder end, and is effectively improved on the basis of the basic framework. The encoder end based on the ResNet-101 network can be divided into 5 different network layers according to the difference of the number of the input characteristic diagram wave bands in the network layers. The first point of improvement of the invention at the encoder end is that a network layer is added at the front end of ResNet-101, and the network layer integrates a building local highly similar information efficient sensing structure and a building three-dimensional space geometric information efficient sensing structure on the basis of a 3 x 3 standard convolution layer, a batch normalization layer and an activation layer structure. The added network layer is formed by connecting three different network branches in parallel, and can receive the input of multiband remote sensing images and normalized digital surface model data to obtain 64 characteristic graphs with the same space size as the input data. The purpose of adding the network layer is to extract and fuse the three-dimensional space geometric information, local highly similar information, contour and other low-level information of the building target in the data and transmit the fused information to the next network layer. After the processing of the first network layer, the number of feature map output channels is 64. The second point of improvement of the invention at the network encoder end is that the 5 th and 6 th network layers use the hole convolution with the hole rates of 2 and 4 respectively to increase the receptive field of the network. The third point of the improvement of the invention at the network encoder end is that an efficient sensing structure capable of sensing the local highly similar information of the building is introduced at each layer of the network. The structure can enable the network to flexibly and efficiently integrate the normalized digital surface model information obtained by airborne laser radar data under the condition of not introducing additional parameters. After the original input data passes through 6 network layers at the encoder end, the number of output characteristic graph wave bands is 2048.

The calculation principle of the efficient sensing structure of the local similar information of the building in the figure 2 is that the space size of a convolution kernel

Within range, with the central pixel P₀Picture element P with similar height information_ijThe image element is considered to have larger correlation with the category of the central image element, and then a function model is endowed with larger weight in the convolution operation, so that a convolution output characteristic diagram containing the space geometric relationship among the image elements is obtained. The calculation formula of the Gaussian function model describing the highly similar information between pixels in each convolution operation is as follows:

wherein the content of the first and second substances,

representing the central pixel P of a convolution kernel₀Height value of (P)_ijRepresenting all pixels within the convolution kernel, d_ijRepresenting by picture element P₀Is centered

The subscripts i and j represent the wide and high indexes of the positions of the pixels in the convolution kernel; delta represents the unit size of a two-dimensional convolution kernel and is a standard deviation term in a Gaussian function model;

except for the 1 st network layer in the network encoder end, all standard convolution kernels of the network encoder end are replaced by convolution kernels with the information of local height similarity of the building which is sensed efficiently. In addition, in order to avoid that the network excessively 'pays attention to' the space geometric relationship between the pixels at the encoder end and further neglects the low-level features of the building (such as the outline, corner points and the like of the building) in the input data, the invention adds a parallel structure branch of standard convolution operation in the layer 1 at the encoder end of the network, and the finally obtained convolution operation function model is shown in a formula 2. Fig. 3 shows the operation process of the convolution kernel for the local highly similar information of the high-efficiency perception building.

Wherein h and w represent the width and height of the convolution kernel, respectively, b represents the offset value on the characteristic diagram,

representing the re-distributed convolution kernel weight calculated by the Gaussian function model,

representing the weight of the original convolution kernel operation;

the standard convolution kernel added in the first layer of the network encoder end and the convolution kernel capable of efficiently sensing the local highly similar information of the building respectively learn different characteristics of the building in the input data in a side-by-side manner through a parallel branch structure under the condition of not sharing any operation weight parameters.

Although the traditional average pooling operation can reduce the parameters and the calculated amount of the network and improve the nonlinear learning capability of the network, the method calculates the average value of all pixels of each target area in the feature map indiscriminately, so that the network can easily omit low-level feature information such as the outline of a building and the like, and the network can not obtain an accurate building extraction result finally. The invention effectively relieves the phenomenon by integrating the geometric relationship information among the pixels in the network standard pooling operation, namely adding a Gaussian function model for describing the geometric correlation among the pixels into a standard average pooling function operation model. The pooling operation function model with the function of efficiently perceiving the local information of the building is as follows:

wherein, the first and the second end of the pipe are connected with each other,

represents a model of a gaussian function and is,

representative pixel P₀Height value of d_ijRepresenting by picture element P₀Is centered

Spatial size rangeThe height values of different pixels in the system are represented, delta represents the unit size of a two-dimensional convolution kernel, and the two-dimensional convolution kernel is used as a standard deviation term in a Gaussian function model.

The invention replaces the standard pooling core in the 2 nd network layer of the network encoder end with the pooling core with the building local information which is sensed efficiently. Fig. 4 shows the operation process of the efficient sensing building local highly similar information pooling core. Pooling the kernel space size during this calculation

Within range and center pixel P₀Picture element P with more similar height information_ijThe image element is considered to have larger correlation with the category of the central image element, and is endowed with larger weight in convolution operation, so that a pooling output characteristic diagram containing the spatial geometric relationship among the image elements is obtained.

In order to fully mine the three-dimensional geometric information in the airborne laser radar data and simultaneously reduce the calculated amount and the parameter quantity of a network model as much as possible, the invention provides a three-dimensional spatial geometric information structure with high-efficiency perception of a building. The processing idea is that firstly, the maximum height value of the pixel in the input characteristic diagram is obtained, and the input characteristic diagram is averagely divided into L by the maximum height value_iAnd (4) obtaining a feature map set L, and then independently using two-dimensional convolution operation on each feature map in the feature map set. This operation can be expressed by the following equation 4:

wherein h (-) represents an activation function, b_ijRepresenting the offset values on the feature map, subscripts i, j representing the wide and high indices of convolution kernel element positions on the feature map, M_i、N_i、L_iRespectively representing the width, height and length of a three-dimensional convolution kernel, x and y respectively representing the position index of a pixel in the convolution kernel, z representing the height value of the pixel at the position of the pixel, and l representing the index size range of the corresponding height value of the input characteristic diagram; k is the feature map of the current input in the feature map set LThe index is a function of the number of times,

representing the element value with (m, n) in the two-dimensional convolution kernel on the kth characteristic diagram; the input feature map in the convolution operation is

Wherein the content of the first and second substances,

representing the dot multiplication operation of the pixel at each position in the input characteristic diagram, wherein the specific rule of the dot multiplication operation is that if the height value of the pixel at the position (x, y) on the characteristic diagram meets the range set by the layer, the pixel value at the position is dot multiplied by 1, and the pixel value is kept unchanged; if not, multiplying the pixel value point of the position by 0 to obtain a pixel with the pixel value of 0; the range set by each layer is determined by the index value Z corresponding to the input characteristic diagram_l(Z_l＝L_i) Together with the size δ of the unit network in three-dimensional space.

Compared with the standard two-dimensional convolution structure, the operation complexity of the quasi-three-dimensional convolution provided by the invention is only increased by dot product operation in each convolution kernel operation process, and then L in the input feature map set L is added_iAnd (5) performing secondary operation. Compared with the operation mode of three-dimensional convolution on the three-dimensional space of the input feature map, the structure can mine three-dimensional information in the laser radar data in a similar mode under the condition that only a small amount of calculation is added to a network model. The invention uses the efficient sensing structure of the geometric information of the three-dimensional space of the building on the third parallel structure branch in the first layer at the network encoder end,fig. 5 shows the operation of the structure.

in this embodiment, the building information multi-scale sensing module is shown in fig. 6, and is located between an encoder end and a decoder end of the network model, and is used for enhancing the extraction and representation capability of the network on the building multi-scale features in the remote sensing data. The module is composed of a plurality of parallel branches and a series structure, an input feature map obtained by a network encoder end passes through three branches, and the middle branch is subjected to 1 multiplied by 1 convolution operation, so that the number of the feature map wave bands is reduced, the network calculation amount is reduced, and the space size of the feature map is kept. And the other two branches are used as input feature branches additionally added in the feature fusion layer at the rear end of the module. Subsequently, the intermediate input feature branch superposes and combines convolution layers with different convolution kernel sizes through a series of hole convolution layers with different hole rates (d is 6,12,18 and 24) and a series branch and parallel branch structure of 1 × 1 convolution, and a plurality of output feature maps are obtained under different network receptive field sizes respectively. And at the rear end of the multi-scale sensing module, the wave bands of a plurality of output characteristic graphs obtained by different branches are fused on the basis of the same space size, then the characteristic graphs with multi-scale information obtained by the wave band fusion are subjected to 1 × 1 convolution operation to reduce the number of the wave bands, and finally the wave bands are output to a decoder end of a network model to complete subsequent up-sampling operation. The building information multi-scale sensing module provided by the invention can enable the network model to have larger receptive field size by continuously superposing and combining the cavity convolution layers under the condition of increasing a small amount of network parameters and calculated amount, thereby effectively improving the extraction capability of the network model for the building multi-scale characteristic information.

in this embodiment, the structure of the network decoder gradually recovers the spatial size of the original input data by using a policy of combining multi-level feature information (see fig. 2). There are two types of input characteristics of the network at the decoder side. One is a high-level characteristic diagram with the space size of 64 multiplied by 64 and the number of wave bands of 256, which is obtained by an encoder end and a multi-scale information high-efficiency perception module of a building; the other is a lower level feature map with a spatial size of 128 x 128 and a number of bands of 256 output at the 3 rd network layer at the encoder side. Then, the network decoder end inputs the obtained lower-level feature map into the convolution layer with the size of 1 × 1 to reduce the number of the wave bands, so as to balance the proportion of the wave bands of the two types of features and improve the learning efficiency of the network model. Meanwhile, the decoder side carries out 2 times bilinear upsampling operation on the obtained high-level feature map, after the space size of the high-level feature map is the same as that of the lower-level feature map, wave bands of the two features are fused, after the fused feature map passes through a plurality of convolution layers and dropouts, 4 times bilinear upsampling operation is adopted at the network decoder side to obtain a feature map, the number of the wave bands of the feature map is 2, the feature map has the same space size (512 multiplied by 512) with the input image of the network model, and finally the feature map is input into the output layer of the network to obtain a final classification result map, so that detail information in the image classification result is recovered to the maximum extent.

And 5: inputting the binary surface feature classification map output by the network decoder end into a network loss function layer to drive the network to learn the optimal weight parameter capable of fitting a target task (a real binary surface feature label), and finally outputting a final building classification map by the network output layer to finish the automatic extraction of the end-to-end building.

In this embodiment, the objective function used by the loss function layer is a cross entropy loss function. The function calculates n target pixel elements x in all m training batch (mini-batch) sample data^(n，m)Model predictive value of (1)

With the true value y⁽ⁿ ^，m)To "drive" the entire network to learn optimal weight parameters that can fit the target task. Cross entropy loss functionThe specific calculation formula of (2) is as follows:

wherein, K represents the ground object target classification value participating in classification, the values of the classification values are respectively 0 and 1, and only two classification values exist in the building ground object classification task of the invention: 0 represents a non-building feature class value; 1 represents a building ground class value. N and M represent all pel values and the total training batch (mini-batch size) of the input image of the network model in each training round (epoch), respectively.

Representing the object pixel x^(n，m)The output value at the encoder end of the network model,

representing the object pixel x^(n，m)And (4) normalizing the terminal of the network model by a softmax layer, and outputting the probability value finally. Furthermore, I { y^(n，m)K is an indicator function of the true value of the ground feature class when y is^(n，m)K is a function with an output value of 1, y^(n，m)And 0 if not equal to k.

In a remote sensing image surface feature classification task, the number of non-building surface feature type pixels of a sample data set is often far larger than that of building surface feature type pixels, and the imbalance between the number and the distribution of different types of pixel values can greatly influence the training efficiency and the classification performance of a network model. To solve this problem, the present invention uses an objective function combining the Focal Loss function (Focal local) and the pel median frequency balancing method to train the network model. The focus loss function is a function obtained by modifying based on a cross entropy loss function, and the proportion of the simple samples in the total loss value of the network model is gradually reduced by adding an adjustable parameter gamma, so that the network model can pay more attention to the difficult samples which are difficult to learn in the training data. The focus loss function calculation expression is as follows:

where γ is an adjustable parameter that allows the network model to focus more on difficult samples. In the network model of the present invention, the best building extraction results are achieved when the parameter is set to 2.

And then, the pixel median frequency balancing method carries out different weighting on the loss value obtained by calculating the focus loss function based on the number of pixels of the real category of each ground feature so as to achieve the classification balance of various ground features in the network model. The final objective function expression obtained by the network model is as follows:

wherein, w_kFor the median frequency weight, f, obtained by weighting class k in the training samples_kPixel frequency, mean ({ (f) representing class k_kI K e K) }) represents the median frequency of the pixels of all classes in the training sample.

To illustrate the effectiveness of the method, the method was tested on a set of remote sensing data selected in step 1 containing a plurality of different urban scenarios. The experimental results of tables 1 and 2 show that, compared with other advanced building extraction networks, the automatic building extraction method provided by the invention achieves the best effect in the aspect of extraction precision: the overall extraction accuracy (OA) on test data set 1 reached 95.68%, with an average cross score value (mlou) of 91.78%; the overall extraction accuracy (OA) on test data set 2 reached 97.30%, with an average cross score value (mlou) of 91.32%. In addition, the method also achieves the best effect on the extraction efficiency of the building: the model size is only 53.00MB, and the computation time complexity (the forward propagation time of the model obtained by counting the average computation time of 50 network model iterations through the time function in the PyTorch open-source deep learning library) is only 83.88 ms. Test results show that the method has a good application prospect in the research of automatic extraction of remote sensing buildings.

TABLE 1

TABLE 2

It should be understood that parts of the specification not set forth in detail are of the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A building automatic extraction method fusing geometric perception and image understanding is characterized by comprising the following steps:

step 1: selecting and preprocessing remote sensing data to obtain an experimental data set consisting of a first data set, a second data set, a normalized digital surface model and a building real type label; the first data set comprises a plurality of red, green and blue waveband multiband remote sensing images, and the second data set comprises a plurality of red, green and near-infrared waveband multiband remote sensing images; wherein one part of the first data set is used as a training and verifying data set, and the other part of the first data set is used as a testing data set; one part of the second data set is used as a training and verifying data set, and the other part of the second data set is used as a testing data set;

the network encoder is an improved depth residual error network ResNet-101; wherein the improvement comprises: (1) a network layer formed by connecting three different network branches in parallel is added at the front end of the depth residual error network, wherein the first network branch is formed by a 3 multiplied by 3 standard convolution operation layer, a batch normalization layer and an activation layer structure, and the second network branch and the third network branch respectively replace the 3 multiplied by 3 standard convolution operation layer with a convolution operation layer of building local height similar information and a building three-dimensional space geometric information high-efficiency perception operation layer on the basis of the first network branch structure; (2) replacing a common 3 multiplied by 3 convolution kernel with a cavity convolution kernel with the cavity rates of 2 and 4 respectively in the 5 th and 6 th network layers; (3) a structure capable of efficiently sensing local height similar information of a building is introduced into each layer of a network encoder end;

in step 2, the structure for efficiently sensing the local height similarity information of the building has the working principle that the size of the convolution kernel space is

Within range, with the central pixel P₀Picture element P with similar height information_ijThe image is considered to have greater correlation with the category of the central pixel, and then a function model is endowed with greater weight in the convolution operation, so that a convolution output characteristic diagram containing the space geometric relationship among the pixels is obtained;

the gaussian function model describing the highly similar information between pixels in each convolution operation is as follows:

The height values of different pixels in the space size range, and subscripts i and j represent wide and high indexes of pixel positions in a convolution kernel; delta represents the unit size of a two-dimensional convolution kernel and is a standard deviation term in a Gaussian function model;

the convolution operation function model is:

representing the weight of the original convolution kernel operation;

the pooling operation function model with the function of efficiently sensing the local information of the building is as follows:

wherein the content of the first and second substances,

representing a model of a gaussian function;

the efficient sensing building three-dimensional space geometric information structure firstly obtains the maximum height value of the pixels in the input characteristic graph, and the maximum height value averagely divides the input characteristic graph into L_iAnd (3) obtaining a feature map set L, and then independently using two-dimensional convolution operation on each feature map in the feature map set, wherein the operation process is as follows:

wherein h (-) represents an activation function, b_ijRepresenting the offset values on the feature map, the indices i, j representing the wide and high indices of the convolution kernel pixel locations on the feature map, M_i、N_i、L_iRespectively representing the width, height and length of a three-dimensional convolution kernel, x and y respectively representing the position index of a pixel in the convolution kernel, z representing the height value of the pixel at the position of the pixel, and l representing the index size range of the corresponding height value of the input characteristic diagram; k is the currently entered feature map index in the feature map set L,

Wherein the content of the first and second substances,

representing the point multiplication operation of the pixel at each position in the input characteristic diagram, wherein the specific rule of the point multiplication operation is that if the height value of the pixel with the position (x, y) on the characteristic diagram meets the range set on each layer of characteristic diagram in the characteristic diagram set L, the pixel value at the position is point-multiplied by 1, and the pixel value is kept unchanged; if not, multiplying the pixel value point of the position by 0 to obtain a pixel with the pixel value of 0; the range set by each layer is determined by the index value Z corresponding to the input characteristic diagram_lDetermined together with the size delta of the unit network in three-dimensional space, where Z_l＝L_i；

step 3, inputting the high-level characteristic diagram extracted by the network encoder end into three branches of a multi-scale high-efficiency perception module of the building; the high-level feature map is subjected to 1 multiplied by 1 convolution operation in the middle branch, and the other two branches are used as input feature branches additionally added in the feature fusion layer at the rear end of the module; then, the intermediate input characteristic branch respectively obtains a plurality of output characteristic graphs under different network receptive field sizes by superposing and combining convolution layers with different convolution kernel sizes through a series of cavity convolution layers with different cavity rates and series branch and parallel branch structures with 1 multiplied by 1 convolution; at the rear end of a multi-scale high-efficiency sensing module of a building, the wave bands of a plurality of output feature maps obtained by different branches are fused on the basis of the same space size, then the high-level feature maps with multi-scale information obtained by the wave band fusion are subjected to 1 x 1 convolution operation, and finally the high-level feature maps are output to a network decoder end;

in step 4, the input characteristics of the network decoder end have two types: one is a high-level characteristic diagram obtained by a network encoder end and a building multi-scale information high-efficiency perception module; the other is a lower-level feature map output at the 3 rd network layer at the network encoder end; the decoder end inputs the obtained lower level feature diagram into a convolution layer with the size of 1 multiplied by 1; meanwhile, 2 times of bilinear upsampling is carried out on the obtained high-level feature map, the two features are subjected to wave band fusion after the same space size as that of the lower-level feature map is obtained, 4 times of bilinear upsampling operation is carried out on the fused feature map at a network decoder end after a plurality of layers of operation is carried out on the fused feature map, and a feature map with the same space size as that of a network model input image is obtained;

2. The building automatic extraction method fusing geometric perception and image understanding according to claim 1, characterized in that: in the step 1, the selected remote sensing data comprise a high-resolution ortho remote sensing image, a digital surface model and a digital elevation model corresponding to the remote sensing image, and a building real type label obtained based on manual marking of the remote sensing image.

3. The building automatic extraction method fusing geometric perception and image understanding according to any one of claims 1-2, wherein: in step 5, the network loss function layer trains a network model by using an objective function combining a focus loss function and a pixel median frequency balance method, finally obtains a building classification result graph, and completes the automatic extraction of the end-to-end building;

the objective function used by the loss function layer is a cross entropy loss function which is obtained by calculating all m training batch sample dataMiddle n target pixel elements x^(n，m)Model predictive value of

With the true value y^(n，m)The fitting degree between the two parameters is used for driving the whole network to learn the optimal weight parameter capable of fitting the target task;

the cross entropy loss function is:

k represents the ground object target category values participating in classification, the values of the K are 0 and 1 respectively, 0 represents the non-building ground object category value, and 1 represents the building ground object category value; n and M respectively represent all pixel values and the total training batch of the input image of the network model in each training turn;

representing the object pixel x^(n，m)Normalizing the probability value output finally after the terminal of the network model is subjected to softmax layer; i { y⁽ⁿ ^，m)K is an indication function of the true value of the ground feature class when y^(n，m)K is a function with an output value of 1, y^(n，m)When not equal to k, the value is 0;

the focal loss function is:

wherein gamma is an adjustable parameter for the network model to concentrate more on the difficult samples;

then, the pixel median frequency balancing method carries out different weighting on the loss value obtained by calculating the focal point loss function based on the number of pixels of the real category of each ground feature so as to achieve the classification balance of various ground features in the network model; the final objective function expression obtained by the network model is as follows: