CN112183432B

CN112183432B - Building area extraction method and system based on medium-resolution SAR image

Info

Publication number: CN112183432B
Application number: CN202011084022.2A
Authority: CN
Inventors: 吴樊; 王超; 李娟娟; 张红
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2022-04-15
Anticipated expiration: 2040-10-12
Also published as: CN112183432A

Abstract

The utility model relates to a building district extraction method and system based on medium-resolution SAR image, firstly, all SAR images are preprocessed, pixel level marking samples of multi-class buildings are made according to the preprocessed SAR images, then a building district extraction network comprising a feature extraction network based on attention mechanism and a pyramid multi-scale city extraction network of multi-scale pyramid structure is constructed, the network is trained and tested by using a training data set and a testing data set, network optimization is carried out, finally, the preprocessed SAR images are input into the building district extraction network to obtain a primary extraction result, post-processing is carried out on the building district which is primarily extracted, the extraction result is optimized, the precision of building district extraction is improved, and the influence of non-building districts such as vegetation is effectively removed.

Description

Building area extraction method and system based on medium-resolution SAR image

Technical Field

The present disclosure relates to the field of image processing, and more particularly, to a method and a system for extracting a building area based on a medium-resolution SAR image.

Background

Urbanization has now become a global trend affecting most citizens of the world. With the acceleration of the urbanization process, accurate and timely urban information acquisition is important for urban risk assessment, infrastructure planning, urban spread, population estimation, environmental protection and urban sustainable development.

Satellite remote sensing has been widely recognized as the most economically feasible method of obtaining human residential information over a wide range. Previous regional or global land cover products are mostly based on time series optical satellite data, the extraction effect of the land cover products is considerable overall, the regional or global city range is drawn through a series of data sets and algorithms, but the extraction result of the city region cannot meet the requirement of city change detection. The use of optical data for regional or global urban mapping remains a significant challenge due to low data availability caused by cloud and rain weather.

Compared with optical data, the full-time and all-weather acquirability of SAR data and unique surface feature information enable the SAR data to have the advantage of earth surface observation, and the SAR data are increasingly applied to extraction in a global city range. An Esch research team of DLR in Germany in 2012 verifies the application potential of high-resolution X-band SAR data in the aspect of automatically drawing human living areas through experiments, and issues a brand-new unsupervised and fully-automatic classification system developed based on TanDEM-X Task (TDM) by using a UFP (city footprint processor). Later, inspired by several research results, the team adds an automatic editing module in the original system to optimize and modify the UFP system and eliminate errors; the method is then applied globally, producing a global city footprint (GUF) data set with a spatial resolution of 12 meters. In 2013, Gamba and Lisini et al developed a fast and efficient method for extracting a global city range based on ENVISAT ASAR wide-format data with a resolution of 75m, and experiments show that the extraction result is more accurate than that of the existing global land utilization data set (including GlobCover 2009). To evaluate the potential of ENVISAT SAR data for application in global urban extraction, Ban et al developed a KTH-Pavia city extractor in 2014, using ENVISAT ASAR 30m of data to efficiently extract central cities and small towns. In 2015, Jacob, Ban and the like utilize Sentinel-1A SAR data to perform city range extraction evaluation on a KTH-Pavia city extractor, and a preliminary result shows that a Sentinel-1A strip mode is very suitable for city extraction, and the accuracy rate reaches more than 83%. Cao et al introduced the spatial index and texture characteristics and the intensity of the Sentinel-1SAR data into the seed selection process, and successfully extracted the built-up areas of the Chinese city.

The traditional algorithm for extracting the urban area based on the SAR data mainly comprises a threshold value method based on texture features and intensity, a Support Vector Machine (SVM), a neural network and the like. In recent years, deep learning has achieved remarkable results in the fields of classification and object recognition. Many scholars solve the problem of multi-scale distribution of building areas in SAR data by using a deep learning method. The method makes great contribution to solving the problem of multi-scale distribution of the SAR image buildings, but large-scale regional mapping is not performed in the multi-type building distribution region of the complex terrain in the method. At present, the urban building region extraction under the complex background of a large region mainly has the following two major problems of (1) omission of short buildings. Low-rise and densely distributed buildings, such as rural areas, have weak scattering echoes, resulting in low backscattering values in SAR images in such building areas. Some flat-topped building top scatter echoes are weak and show only contours in the SAR image. In rural areas in China, some smaller villages have lower backscattering values and are easy to miss detection. (2) And (5) error division. The SAR image has spatial correlation among pixels, the basic physical structure of the landscape is very similar to that of a building area, such as a dispersed forest, and a high backscattering value can also occur under the condition that a paddy field or a wetland is mixed with sparse vegetation; secondly, the backscattering values of some bridges and ships are also large, and the misscores are easily caused.

Therefore, in order to realize high-precision mapping of building areas in complex areas, rich deep texture feature information of the building areas needs to be acquired, and meanwhile, a network needs to have the performance of extracting very small or very large urban areas.

Disclosure of Invention

The present disclosure is provided to solve the above-mentioned problems occurring in the prior art. The present disclosure proposes a building region extraction network that combines an attention-based feature extraction network and a multi-scale pyramid structured pyramid multi-scale city extraction network. The feature extraction network based on the attention mechanism puts attention on the main target, so that the accuracy of target extraction and classification is effectively improved; the pyramid multi-scale city extraction network takes the target characteristics under a plurality of receptive fields into parallel consideration, and is suitable for extracting multi-scale building areas. In addition, aiming at the problem of unbalanced proportion of positive and negative samples in pixel class classification, the method introduces Focal local into the classifier to replace the original cross entropy Loss function, and further improves the classification precision.

According to a first aspect of the present disclosure, there is provided a building region extraction method based on a medium-resolution SAR image, the extraction method including:

s1, establishing pixel level marking samples of multi-category building areas;

s2, building area extraction networks are established, wherein the building area extraction networks comprise a feature extraction network based on an attention mechanism and a pyramid multi-scale city extraction network with a multi-scale pyramid structure;

s3, training and testing the extracted network of the building area, and optimizing the network;

s4, inputting the preprocessed SAR image into an optimized building area extraction network to obtain a building area extracted preliminarily;

and S5, carrying out post-processing on the building area subjected to primary extraction, and optimizing an extraction result.

According to a second aspect of the present disclosure, a building area extraction system based on a medium-resolution SAR image is provided, which includes a sample establishment module, a building area extraction network establishment module, an optimization module, a preliminary extraction module, and a post-processing module:

the sample creation module: the pixel level marking sample is used for establishing a multi-category building area;

the building area extraction network establishing module: the system comprises a building area extraction network, a city extraction network module and a city extraction network module, wherein the building area extraction network comprises a feature extraction network module based on an attention mechanism and a pyramid multi-scale city extraction network module with a multi-scale pyramid structure;

the optimization module: the system is used for training and testing the building area extraction network and carrying out network optimization;

the preliminary extraction module: the SAR image processing system is used for inputting the preprocessed SAR image into the optimized building area extraction network to obtain a building area which is preliminarily extracted;

the post-processing module: and the post-processing unit is used for post-processing the building area subjected to the primary extraction and optimizing the extraction result.

According to a third aspect of the present disclosure, a non-volatile computer storage medium is provided, the storage medium having stored thereon a computer program, the processor implementing the steps in the above-mentioned medium resolution SAR image-based building region extraction method when executing the computer program on the storage medium.

According to the method, during the training process of extracting the network from the building areas, the established pixel-level mark samples of the multi-class building areas enable the network to well learn the texture, strength and spatial relationship characteristics of the multiple building areas. The method combines a feature extraction network based on an attention mechanism and a pyramid multi-scale city extraction network with a multi-scale pyramid structure as a building area extraction network, the multi-scale input layer enables the network to fully learn building area features in SAR images, the pyramid multi-scale city extraction network can be used for extracting the multi-scale features of the building area, the combination of deep layers and the multi-scale features improves the accuracy of building area extraction, and an attention gating module of a decoder part concentrates attention on extraction and identification of building area targets, so that the extraction accuracy is further improved. The method also combines the spectral information of the ground features with abundant optical data, effectively removes the influence of the vegetation and other non-building ground features, and further improves the extraction precision.

Drawings

In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar components in different views. Like reference numerals having letter suffixes or different letter suffixes may represent different instances of similar components. The drawings illustrate various embodiments generally by way of example and not by way of limitation, and together with the description and claims serve to explain the disclosed embodiments. The same reference numbers will be used throughout the drawings to refer to the same or like parts, where appropriate. Such embodiments are illustrative, and are not intended to be exhaustive or exclusive embodiments of the present apparatus or method.

Fig. 1 shows a flow chart of a method for building region extraction based on medium resolution SAR images according to an embodiment of the present disclosure;

FIG. 2 shows a schematic structural diagram of a building zone extraction network according to an embodiment of the present disclosure;

FIG. 3 shows a schematic structural diagram of a pyramid multi-scale city extraction network according to an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of a medium resolution SAR image based architectural region extraction system according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a sample setup module according to an embodiment of the present disclosure;

FIG. 6 shows a schematic diagram of an attention mechanism based feature extraction network module in accordance with an embodiment of the present disclosure;

FIG. 7 shows a schematic diagram of a pyramid multi-scale city extraction network module, according to an embodiment of the present disclosure;

FIG. 8 shows a schematic diagram of an optimization module according to an embodiment of the present disclosure;

FIG. 9 shows city area mapping results for four areas according to the present disclosure;

FIG. 10 shows the building zone results for four areas extracted according to the present disclosure;

FIG. 11 shows the comparison of four region extraction results with the GUF product according to the present disclosure;

FIG. 12 shows a comparison of Beijing City center urban areas with GUF products, extracted according to the present disclosure;

figure 13 shows the comparison of the complex and suburban areas extracted according to the present disclosure with GUF products;

FIG. 14 shows the extraction results compared to optical products according to the present disclosure;

FIG. 15 shows urban results extracted in accordance with the present disclosure compared to Attention U-Net and Residual U-Net; and

FIG. 16 shows the results of the extraction of the Sentiniel-1 and ALOS-2/PALSAR-2 data according to an embodiment of the disclosure.

Detailed Description

For a better understanding of the technical aspects of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings. Embodiments of the present disclosure are described in further detail below with reference to the figures and the detailed description, but the present disclosure is not limited thereto. The order in which the various steps described herein are described as examples should not be construed as a limitation if there is no requirement for a context relationship between each other, and one skilled in the art would know that sequential adjustments may be made without destroying the logical relationship between each other, rendering the overall process impractical.

Fig. 1 shows a flowchart of a method for building region extraction based on medium resolution SAR images according to an embodiment of the present disclosure. As shown in fig. 1, the method comprises the following steps:

s1, establishing pixel level marking samples of multi-category building areas;

in the SAR image, buildings such as office buildings, shopping malls, high-rise residential areas and the like in a central urban area are high, a superposition area formed by the buildings is wide, the overall performance of the buildings is high in echo characteristics, and the backscattering value of the building area in the SAR image is high and is easy to distinguish from non-buildings; in low and densely distributed building areas, such as urban villages and the like, the echo characteristics of the overall performance of the buildings are weak and are not easily distinguished from non-buildings. And the flat top buildings such as a factory building and the like mainly generate single scattering, so that the scattering echo is weak, and only the contour is shown in the SAR image. Generally, the urban distribution of each city has the multi-scale characteristic, the central urban building areas are distributed in a piece, and the suburban building areas are small in scale and are scattered. However, different cities are influenced by factors such as terrain and history, and the spatial layout of buildings still has certain difference. Some cities have more plants and villages in the city; the relief of part of urban terrains is large, and buildings are distributed along roads and rivers. While some cities have smaller residential areas and are scattered on plains.

The utility model discloses a pixel level sample marking strategy of multi-class building district including city central region, mountain area village, plain village, town zhongcun etc. as follows:

s11, carrying out radiometric calibration, contrast enhancement and geocoding pretreatment on all SAR images;

differences in the nature, height and size of buildings can lead to differences in scattering characteristics in SAR images. The echo intensities of tall buildings and short buildings are greatly different, the scale difference of building areas enables the building areas in the SAR image to be characterized as multi-scale, and vegetation, roads and mountains have similarity with the scattering characteristics of the buildings under certain imaging conditions. To improve accuracy, the pictures need to be preprocessed.

S12, making pixel-level marking samples of the multi-class buildings according to the preprocessed SAR images, cutting the sample images into slices of n multiplied by n size, and setting an overlapping area of a plurality of pixels in the slice making process to prevent the boundary of the building with a smaller area from being damaged;

the method cuts a sample image into 256 multiplied by 256 slices, and when the sample is cut, the boundary of a building with a small area is easy to damage, so that an overlapping area of 30 pixels is set in the slice making process, the boundary of the building with the small area is ensured not to be damaged, and the sample quality is improved.

S13, removing the incomplete slices in the sample;

and removing the incomplete slices in the sample to ensure the sample quality, such as the edge of the SAR image, containing less building information, wherein the slices need to be removed, so that the classification accuracy is improved.

S14, determining whether the slice contains a building area target, if so, taking the slice as a positive sample, and if not, taking the slice as a negative sample, and respectively storing the positive sample and the negative sample in different subsets;

according to the actual conditions of the research area, plain and coastal terrain flat areas, the differentiable degree of a building area and a non-building area is high, and the two types of samples in the sample set are positive samples; the mountain area building samples comprise positive samples and negative samples, so that the building area extraction network can be ensured to fully learn the building characteristics and the mountain body characteristics, and can be accurately distinguished, and the accuracy of the classification result of the building area extraction network is improved.

the present disclosure proposes a building region extraction network that combines an attention-based feature extraction network and a multi-scale pyramid structured pyramid multi-scale city extraction network. The feature extraction network based on the attention mechanism is used for feature extraction and classification, attention is paid to main targets, and the accuracy of target extraction and classification is effectively improved; the pyramid multi-scale city extraction network is used for deep multi-scale extraction of features, and the target features under multiple receptive fields are considered in parallel, so that the pyramid multi-scale city extraction network is suitable for multi-scale building area extraction.

the method adopts a plurality of data sets to train and test the building area extraction network, and continuously iterates until the optimal state is reached, so that the accuracy of target extraction and classification is improved.

and (3) carrying out radiometric calibration, contrast enhancement and geocoding pretreatment on all SAR images, and inputting the pretreated SAR images into a trained building area extraction network to obtain a preliminarily extracted building area.

S5, post-processing the building area subjected to primary extraction, and optimizing an extraction result;

the building area result primarily extracted by the building area extraction network may have a certain false alarm, and further post-processing is required to improve the extraction accuracy. Common false alarms include: vegetation, roads, mountains, and the like. The method comprises the steps of obtaining mask layers of several types of non-building ground objects by using auxiliary data, and removing intersection of initial extraction results and the mask layers by using logic calculation to obtain optimized results. And the post-processing is carried out on the preliminarily extracted building area, so that the extraction precision of the multi-scale building area is further improved.

Fig. 2 shows a schematic structural diagram of a building zone extraction network according to an embodiment of the present disclosure. As shown in fig. 2, the building region extraction network includes an attention-based feature extraction network and a multi-scale pyramid-structured pyramid multi-scale city extraction network, where the attention-based feature extraction network is divided into a left part and a right part, the left part is used for feature extraction, and the right part is up-sampled, also referred to as an encoder and a decoder part, the encoder part obtains a feature map by a series of convolution and pooling on an input image, and the encoder is configured to include a convolution block (Conv2D block), a maximum pooling layer (max pooling layer), and a cascade structure (collocation). Wherein the convolution block is configured to include two sets of convolution layer (Conv2D) with convolution kernel n × n, a Normalization layer (Batch Normalization layer), and an activation function layer (ReLU Activator), the present disclosure employs a convolution kernel of 3 × 3. The decoder part restores the feature map into an output image through a series of deconvolution and convolution, and is configured to comprise a deconvolution structure (Transconv2D), a convolution block structure, a cascade structure and an Attention block structure (Attention block); the deconvolution structure is used for up-sampling the feature map to realize pixel-level classification.

The feature extraction network based on the attention mechanism improved based on the CNN model puts attention on a main target, uses features obtained in a down-sampling stage as a gating signal to obtain an attention coefficient, and then inhibits feature response irrelevant to a task in a feature map by multiplying the attention coefficient and the feature map.

The method for extracting the features by adopting the feature extraction network based on the attention mechanism comprises the following steps:

s21, inputting the SAR image into an encoder of the feature extraction network based on the attention mechanism to obtain an original feature map, and outputting the original feature map as a gating signal g of a down-sampling layer after the original feature map is subjected to 1 × 1 convolution operation and ReLU activation function mapping;

the gating signal g indicates whether each pixel is activated, provides context information acquired by the shallow network, and is a shallow feature derived in the encoder structure.

S22, inputting the original characteristic diagram obtained by the encoder into a decoder for up-sampling to obtain a characteristic diagram x_l；

x_lThe feature map representing the upsampled layer passed by the skip connection is a deep feature that is input to the deconvolution layer in the decoder structure.

S23, performing up-sampling operation on the gating signal g obtained in the step S21 to enable the gating signal g to be identical to the characteristic diagram x in the step S22_lAre the same in size;

due to the feature diagram x_lIs one time larger than the current gating signal, so it is necessary to perform an upsampling operation on g to ensure that it is matched with x_lThe dimensions are the same.

S24, performing mean pooling operation on the gating signals obtained in the step S23 by using expansion convolution to obtain a feature map;

and performing mean pooling operation by using the expansion convolution kernel, and reducing the number of channels of the feature map to obtain the description of the feature map.

S25, combining the characteristic diagram x_lAfter convolution operation, connecting the obtained result with the characteristic diagram in the step S24, and then obtaining a spatial attention coefficient alpha after sequentially passing through a ReLU activation function layer, a convolution layer and a Sigmoid activation function layer;

will feature diagram x_lThe number of channels of the feature map can be reduced after convolution operation, then the feature map is added with the feature map in step S24, the ReLU activation function is used for enhancing nonlinearity, then 1 × 1 convolution operation is carried out for dimension reduction, and finally the number in the range of 0 to 1, namely the spatial attention coefficient alpha, is output by the Sigmoid activation function.

S26, identifying the prominent image area and deleting the characteristic response, and only keeping the activation related to the specific task;

task-independent feature responses in the feature map are suppressed by multiplying the attention coefficient and the feature map.

S27, combining the coefficients alpha and x_lPerforming dot product operation to obtain the concerned characteristic

Combining the alpha coefficient with x_lAnd transmitting the characteristics obtained by the dot product operation as input to the next layer of network, wherein weight parameters in the structure are updated by adopting a normal distribution initialization method and following a gradient descent back propagation algorithm.

Fig. 3 shows a schematic structural diagram of a pyramid multi-scale city extraction network according to an embodiment of the present disclosure. As shown in fig. 3, a pyramid scene parsing network (PSPNet) can embed difficult-to-parse scene information features into an FCN-based prediction framework through a pyramid pooling structure, so as to provide effective global context prior for pixel-level scene parsing. The present disclosure introduces pyramid pooling modules into a multi-scale building area extraction network. The pyramid pooling module used in the present disclosure is a four-layer structure, bin sizes are 1 × 1, 2 × 2, 3 × 3, and 6 × 6, respectively, and features under four different pyramid scales are fused. The pyramid level divides the feature map into different sub-regions and integrates different location information. The outputs at different levels in the pyramid pooling module contain feature maps of different sizes. In order to maintain the weight of the global characteristics, if the pyramid has N levels, 1 × 1 convolution is used after each level to reduce the number of channels of the corresponding level to 1/N of the original number, and then the low-dimensional feature map is directly up-sampled by bilinear interpolation to obtain the feature map with the same size as the original feature map. And finally, cascading the features of different levels to serve as the final pyramid pooling global characteristic, and realizing subsequent classification on the basis.

The pyramid multi-scale city extraction network is located between the encoder and the decoder of the feature extraction network based on the attention mechanism, multi-layer information of a multi-scale building area is collected, and then the multi-layer information is combined with an original feature map, so that the accuracy of building area extraction is improved.

The method for training and testing the building area extraction network comprises the following steps:

s31, dividing a sample into a training set and a testing set;

the sample set is divided into a training set and a testing set, and the proportion of the training set and the testing set can be set according to specific requirements.

S32, inputting the training set into the building area extraction network for training, wherein a classifier adopts a Focal local function, and a formula is as follows:

where α is the balance factor, γ is the loss factor, y ∈ {0,1} is the class label,

continuously iterating a predicted value output after the function is activated to obtain an optimal building area extraction network;

the building section comprises suburb small villages and villages distributed along river currents, the area of the building area in the section of the building area is small in the section, and the problem of serious unbalance of positive and negative samples of the building exists. In order to solve the problem, the original cross entropy Loss function is replaced by the Focal local, wherein the Focal local is modified on the basis of the cross entropy Loss function, the two-classification cross entropy Loss is analyzed firstly, and the formula is as follows:

common cross entropy for positive samples, the larger the output probability, the smaller the loss; for negative samples, the smaller the output probability, the smaller the penalty. The loss function at this point is slow in the iterative process of a large number of simple samples and may not be optimized to be optimal. Therefore, the Focal local of the present disclosure is improved on the basis, and the formula is as follows:

adding a factor on the basis of the original factor, wherein gamma is more than 0 and can be reducedThe loss of less easily sorted samples makes sorting more focused on difficult, misclassified samples. In addition, a balance factor alpha is added to balance the imbalance of the proportion of the positive and negative samples. Through a series of parameter adjustment, the parameters of alpha-0.25 and gamma-2 are determined to be better in the building area extraction of the present disclosure.

S33, inputting the test set into a trained building area extraction network, and outputting a classification result;

and inputting the test set into a trained building area extraction network, and finally outputting a classification result of the building area image.

The post-processing of the preliminarily extracted building area comprises the following steps:

s41, taking Sentinel-2 optical data as a data source, and acquiring NDVI and MNDWI images by using a spectral index module in ENVI software;

s42, setting a threshold value to obtain a vegetation and water body mask layer;

the water body mask layer covers the wetland and the overwater non-building objects, and the threshold values of the vegetation and the water body mask layer are 0.42 and 0.38 respectively.

S43, resampling the 30m SRTM DEM to 10m SAR resolution, and acquiring gradient data;

the present disclosure sets the grade threshold to 15 °.

S44, overlapping the extracted building area result image layer with the mask image layer, obtaining an intersection, erasing the intersection image layer, and finally obtaining a post-processing building area result image layer;

the present disclosure eliminates errors caused by other artifacts with high backscattering values based on NDVI and MNDWI indices.

Fig. 4 shows a schematic diagram of a building region extraction system based on medium resolution SAR images according to an embodiment of the present disclosure. As shown in fig. 4, includes: a sample establishing module 41, a building area extraction network establishing module 42, an optimizing module 43, a preliminary extraction module 44 and a post-processing module 45:

the sample creation module 41: the pixel level marking sample is used for establishing a multi-category building area;

building area extraction network establishment module 42: the system comprises a building area extraction network, a city extraction network module and a city extraction network module, wherein the building area extraction network comprises a feature extraction network module based on an attention mechanism and a pyramid multi-scale city extraction network module with a multi-scale pyramid structure;

the optimization module 43: the system is used for training and testing the building area extraction network and carrying out network optimization;

the preliminary extraction module 44: the SAR image processing system is used for inputting the preprocessed SAR image into the optimized building area extraction network to obtain a building area which is preliminarily extracted;

post-processing module 45: and the post-processing unit is used for post-processing the building area subjected to the primary extraction and optimizing the extraction result.

Fig. 5 shows a schematic diagram of a sample setup module according to an embodiment of the disclosure. As shown in fig. 5, the sample creating module 41 includes a preprocessing module 51, a cutting module 52, a culling module 53, and a distinguishing module 54:

the preprocessing module 51: the SAR image processing method is used for carrying out radiometric calibration, contrast enhancement and geocoding preprocessing on all SAR images;

the cutting module 52: the system comprises a plurality of pixels, a plurality of image processing units and a plurality of image processing units, wherein the image processing units are used for making pixel-level marking samples of multi-class buildings according to preprocessed SAR images, cutting the sample images into slices with the size of n multiplied by n, and setting an overlapping area of the plurality of pixels in the slice making process to ensure that the boundary of the building with a small area is not damaged;

the eliminating module 53: for culling defective slices in the sample;

the distinguishing module 54: and the system is used for determining whether the section contains the building area target, if so, the section is a positive sample, otherwise, the section is a negative sample, and the positive sample and the negative sample are respectively stored in different subsets.

FIG. 6 shows a schematic diagram of an attention mechanism based feature extraction network module according to an embodiment of the present disclosure. As shown in fig. 6, the feature extraction network module based on attention mechanism includes an encoder module 61 and a decoder module 62: the encoder module 61 is configured to include a convolution block, a max-pooling layer, and a concatenation structure, and the decoder module 62 is configured to include a deconvolution structure, a convolution block structure, a concatenation structure, and an attention block structure. The convolution block in the encoder module 61 is configured to include two sets of convolution layers with convolution kernels of n × n, a normalization layer, and an activation function layer, the present disclosure employing a convolution kernel of 3 × 3; the deconvolution structure in the decoder module 62 upsamples the feature map to achieve pixel-level classification.

FIG. 7 shows a schematic diagram of a pyramid multi-scale city extraction network module, according to an embodiment of the present disclosure. As shown in fig. 7, the pyramid multi-scale city extraction network module includes a pooling layer 71, a convolutional layer 72, an upsampling layer 73, and a cascading structure 74. The pyramid multi-scale city extraction network module is located between the encoder module 61 and the decoder module 62 of the attention-based feature extraction network module.

The building area extraction network establishment module 42 of the present disclosure is further configured to:

s51, inputting the SAR image into an encoder module of the feature extraction network module based on the attention mechanism to obtain an original feature map, and outputting the original feature map as a gating signal g of a down-sampling layer after the original feature map is subjected to 1 × 1 convolution operation and ReLU activation function mapping;

s52, inputting the original characteristic diagram obtained by the encoder module into a decoder module for up-sampling to obtain a characteristic diagram x_l；

S153, performing up-sampling operation on the gate control signal g obtained in the step S51 to enable the gate control signal g to be matched with the feature diagram x_lAre the same in size;

s54, performing mean pooling operation on the gating signals obtained in the step S53 by using expansion convolution to obtain a feature map;

s55, matching the characteristic diagram x_lAfter convolution operation, connecting the feature map obtained in the step S54, and then sequentially obtaining a spatial attention coefficient alpha after a ReLU activation function layer, a convolution layer and a Sigmoid activation function layer;

s56, identifying the prominent image area and deleting the characteristic response, and only keeping the activation related to the specific task;

s57, combining the alpha coefficient with x_lPerforming dot product operation to obtain the concerned characteristic

Wherein the value range of the alpha coefficient is [0,1 ].

FIG. 8 shows a schematic diagram of an optimization module according to an embodiment of the disclosure. As shown in fig. 8, the optimization module 43 includes a training module 81 and a testing module 82:

the training module 81 is configured to divide the samples obtained by the sample establishing module 41 into a training set and a testing set, and input the training set into the building area extraction network for training, where the classifier uses a Focal local function, and the formula is as follows:

the test module 82 is configured to input the test set into a trained building area extraction network, and output a classification result.

Wherein the activation function is a sigmoid function, α is 0.25, and γ is 2.

The post-processing module 45 of the present disclosure is configured to:

s61, taking Sentinel-2 optical data as a data source, and acquiring NDVI and MNDWI images by using a spectral index module in ENVI software;

s62, setting a threshold value to obtain a vegetation and water body mask layer;

s63, resampling the 30m SRTM DEM to 10m SAR resolution, and acquiring gradient data;

and S64, overlapping the extracted building area result image layer with the mask image layer, obtaining an intersection, erasing the intersection image layer, and finally obtaining a post-processing building area result image layer.

Wherein, the threshold values of the vegetation and the water body mask layer are 0.42 and 0.38 respectively.

The method comprises the following steps of designing two experiments for evaluating the robustness of the building area extraction of a pyramid multi-scale building area extraction network in a high-resolution SAR image: A. GF-3SAR data was used to evaluate the detection performance of the method proposed herein in different areas of the building. B. The applicability of the method presented herein in different SAR sensor data was evaluated.

(1) Study area and experimental data:

for experiment a, beijing, wo, chong qing and fujian, fu, were selected as experimental areas. Beijing and the complex are located in plain areas, Chongqing is mountain city, and Fuzhou is located in coastal areas. For experiment B, the data of the L wave band ALOS-2 is selected as a data source, and Beijing is selected as a research area. Table 1 and table 2 list the detailed information of the SAR data used in the experiments.

TABLE 1 basic information of GF-3SAR data

TABLE 2 basic information for Sentiniel-1 and ALOS-2SAR data

Data of	Region(s)	Obtaining time	Resolution ratio	Polarization mode
					Sentinel-1	Beijing	2019-06	20 m	VV/VH
ALOS-2	Beijing	2019-06	10m	HH/HV

(2) Samples of different building zone distributions:

the spatial layout and the type of buildings of the four areas are different under different terrain scenes. To improve the extraction performance of the network, the present disclosure makes approximately 9000 valid sample slices covering different types of building areas (see fig. 9), depending on the sample making design strategy. The basic information of the training data and the test data is shown in table 3. In order to improve the detection precision, the sample slice which does not contain the building is removed. All slices are 256 × 256 in size. The ratio of the training samples to the test samples is 7:3, and the ratio of the training samples to the test samples is as follows: mountain area samples: the sample ratio of the coastal areas is 5:3:2, buildings in plain areas in the experimental area are distributed more, buildings in mountain areas are distributed less, and the building proportion in the building slices is smaller than that of mountain areas, so that negative samples (mountain areas) are added into the training samples, and the network can fully learn the difference between the mountain areas and the mountain areas. The ratio of the number of the building block sections in the three regions in the test sample was also 5:3: 2.

FIG. 9 is a sample of multi-class labels. (a1) - (f1) are Google Earth image slices, (a2) - (f2) are corresponding SAR image slices, and (a3) - (f3) are corresponding pixel-level tag slices. a1 is a mountainous area building area, b1 is plain distribution villages, c1 is a flat-top building, d1 is a linear distribution village, e1 is a dense distribution building area, and f1 is a net village in the plain distribution villages.

TABLE 3 basic information of training sample set and test sample set

(3) Setting parameters:

the main parameters of the residual error feature extraction network of the present disclosure are as follows: the learning rate, number of iterations, and batch size were 0.0005,200, and 20, respectively.

(4) And (4) analyzing results:

fig. 9 is a city area mapping result of a building area extraction network based on the present disclosure for four areas. Fig. 10 is a building block result for four areas extracted based on the present disclosure. FIG. 10 shows the results of extraction from four cities, Beijing, Wuhan, Henan and Guangzhou. As can be seen from fig. 10, the urban area extracted by the present disclosure substantially coincides with the architectural area characterization in the SAR image. The detection effect is good in urban centers, suburban villages and towns, dense connected building areas and small-scale villages, and the detection effect is good in Chongqing and Fuzhou city mountainous areas.

The method selects a global human living layer GUF product produced by a German DLR team as a comparison, the resolution of a GUF product is 12 meters, and the TanDEM-X and TerrraSAR-X radar images with the ground resolution of 3 meters are interpreted by a full-automatic extraction framework Urea Font Processor (UFP) system. The results of the extraction of the four regions and the visualization of the GUF results are shown in fig. 11. As can be seen from the comparison, the range of the building area extracted by the GUF product in beijing area is larger than that extracted by the present disclosure, while in the wo river area, the extraction result of the present disclosure is obviously larger than that extracted by the GUF product, and the difference between the two products in Chongqing city and Fuzhou city is smaller. To prove the difference of the extraction results, the extraction results of Beijing city are superposed on the Sentinal-2 optical image with the resolution of 10 meters at the same time, and the extraction results of Luo river city are superposed on the Google Earth image for visual display and comparison. FIG. 12 shows the comparison of the urban area extraction results of Beijing City to GUF. It can be seen from fig. 12 that GUF products are more false positives, and that the black markings indicate that many non-architectural areas are detected as architectural areas. Fig. 13 shows a visual comparison of the results of the complex and suburban areas, where it can be seen that there are high false positives for the product in urban GUF and high false positives for the product in suburban GUF. Table 4 compares the extraction results herein with the accuracy evaluations of the GUF product, and from table 4 it can be seen that the overall extraction accuracy in the beijing area of the present disclosure is greater than the GUF product, while the extraction accuracy in the complex, Chongqing and Fuzhou markets is less than the GUF product.

TABLE 4 comparison of the extraction results of the present disclosure with the accuracy of the GUF product

The present disclosure selects global human habitation (GHSL) and FROMGLC10 products to compare with the present disclosure extraction results. GHSL is a GHSL product produced with a resolution of 0.5 to 10m based on satellite SPOT (2 and 5), CBERS 2B, RapidEye (2 and 4), WorldView (1 and 2), GeoEye 1, QuickBird2, Ikonos 2 and onboard sensors to acquire multiple data sources. The FROMGLC 30 product is a global 10m resolution land cover map made by a team of professor gong of qinghua university using Sentinel-2 optical data as a data source using a random forest classification method, and the results extracted by the present disclosure compared with GHSL, FROMGLC10 are shown in fig. 14. As can be seen from the figures, the urban area extracted by the present disclosure is substantially consistent with the other three products. However, GHSL has a certain degree of overfitting, where the boundary range and density of urban areas are greater than those of the other three products, especially in the beijing area. For small-scale villages, the small-scale villages in the Luinan Luodao city are well extracted by the method, compared with the extraction result of the method, GHSL products have certain missing detection, and FROMLC 10 products miss detection of a large number of villages. Table 5 shows the results of the extraction of the present disclosure compared to the optical products in four regions. As can be seen from table 5, in the beijing area, the extraction result of the disclosure and the FROMGLC10 product have higher consistency, but the false detection rate of the FROMGLC10 product is higher than that of the disclosure, 6.55%, and the false detection rate (CE) of the GHSL product reaches 30.40%. In the areas of the Luwo city, the overall accuracy of the three results is over 90%, but the false detection rate of the GHSL product is the highest and is 10.86%, while the missed detection rate of the FROMLLC 10 product is higher and reaches 25.90%. In Chongqing and Fuzhou city areas, the extraction precision of the method is equivalent to that of GHSL products, and the false detection rate of FROMLLC 10 products is the highest. In the four areas, the false detection rate of the building area extracted by the method is the lowest, but certain missed detections exist in the four areas, such as Beijing, Chongqing and Fuzhou areas, the missed detection rate reaches 5.74%, 6.54% and 6.10% respectively, the missed detections mainly occur in areas with low building backscattering, the texture and the geometric structure of the building in the image are incomplete, and the building features are unstable in a convolutional neural network and are divided into non-building areas in a classification layer. Therefore, from the spatial scale, regional building area mapping can be accurately carried out, and the method can make up for 10-meter resolution optical mapping to a certain extent in the aspect of small-scale village extraction, and provides technical support for earth surface monitoring application based on multi-source remote sensing data.

TABLE 5 architectural area results extracted by the present disclosure compared to optical products

(5) And (3) comparison test:

1) comparison of the present disclosure with the Attention U-Net and Residual U-Net methods:

to verify the extraction performance of the present disclosure, which was compared to Attention U-Net and Residual U-Net, the Attention U-Net and Residual U-Net models were trained using the same training data. The urban area result pairs extracted by the three methods are shown in fig. 15, and table 6 shows the precision evaluation results of the three methods. Fig. 15 shows that the visual detection results of the present disclosure in four regions are best. In the Henan area, the detection effect of Residual U-Net is the worst, and a large amount of missed detection exists. As can be seen from Table 6, the overall accuracy of the building region detection in the four regions of the present disclosure is between 94.06% and 95.79% (Kappa mean of 0.83), while the extension U-Net and Residual U-Net methods have Kappa mean of 0.81 and 0.73 respectively in the four regions, and the worst extraction of the three methods is Residual U-Net method. Compared with the present disclosure, Residual U-Net has the highest missing rate in Beijing, the complex, Chongqing and Fuzhou, followed by the Attention U-Net method, while Residual U-Net has the highest missing rate of 27.30% in Chongqing, which is consistent with the visualization result of FIG. 15. The analysis from the method is that the extraction precision of the Attention U-Net method is higher than that of the Residual U-Net method, mainly, the Attention U-Net method not only utilizes a U-shaped structure to transmit the coded features to a decoder part, but also adds a space Attention module to the decoder part, so that the network is focused on the building area target. The extraction precision is highest, mainly Attention U-Net is used as a feature extraction network, firstly, the geometric structure and textural features of a building area are fully learned and extracted through a multi-scale input layer, then a pyramid multi-scale city extraction network is introduced, so that the network further extracts the features of different sub-areas of the building area, finally, the deep features and the multi-scale features are combined, and a space Attention module enables the network to focus on identification of a building area target, so that the accurate extraction of the multi-scale building area is realized.

TABLE 6 evaluation of accuracy of extraction results in construction area

2) The robustness analysis of the extraction application in different SAR data source building areas is as follows:

to test the robustness of the present disclosure using different SAR data for regional mapping, we selected Sentinel-1 and ALOS-2/PALSAR-2 data as experimental data sources. Table 7 shows the details of the training data and the test data. FIG. 16 shows the results of the extraction of the Sentinil-1 and ALOS-2/PALSAR-2 data. Table 8 shows the accuracy evaluation of the extraction results of the present disclosure in two SAR data building regions, from which it can be seen that the overall extraction accuracy is higher than 80% and the Kappa coefficient is higher than 0.7. These results show that the building area extraction network of the present disclosure has a good generalization capability in regional mapping of different sensor SAR data.

TABLE 7 basic information of training and test samples

Data source	Polarization mode	Training sample	Test specimen
				Sentinel-1	VV/VH	686	70
ALOS-2	HH/HV	686	70

TABLE 8 evaluation of the accuracy of the results of two data construction area extractions

In the experiment of drawing regional building areas based on GF-3 data, the building area extraction method provided by the disclosure not only can accurately identify the building areas compared with the traditional FCN-based method, but also has higher accuracy in small-scale village extraction compared with optical data products. The present disclosure is robust in regional mapping using Sentinel-1 and ALOS-2/PALSAR-2 data because the building area extraction network of the present disclosure has good generalization performance.

Moreover, although exemplary embodiments have been described herein, the scope thereof includes any and all embodiments based on the disclosure with equivalent elements, modifications, omissions, combinations (e.g., of various embodiments across), adaptations or alterations. The elements of the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the prosecution of the application, which examples are to be construed as non-exclusive. It is intended, therefore, that the specification and examples be considered as exemplary only, with a true scope and spirit being indicated by the following claims and their full scope of equivalents.

Claims

1. A building region extraction method based on medium-resolution SAR images is characterized in that

The extraction method comprises the following steps:

s1, establishing pixel level marking samples of multi-category building areas;

s12, making pixel-level marking samples of the multi-class buildings according to the preprocessed SAR images,

cutting the sample image into slices of n multiplied by n size, and setting an overlapping area of a plurality of pixels in the slice manufacturing process to ensure that the boundary of a building with small area is not damaged;

s13, removing the incomplete slices in the sample;

s14, determining whether the slice contains a building area target, if so, determining the slice as a positive sample, otherwise, determining the slice as a negative sample, and respectively storing the positive sample and the negative sample in different subsets;

wherein the feature extraction network based on the attention mechanism comprises an encoder and a decoder: the encoder is configured to include: a convolution block, a max-pooling layer, and a concatenation structure, the decoder configured to include a deconvolution structure, a convolution block structure, a concatenation structure, and an attention block structure; the step of extracting the features of the feature extraction network based on the attention mechanism comprises the following steps:

s21, inputting the SAR image into an encoder of the feature extraction network based on the attention mechanism to obtain an original feature map, and carrying out nxn convolution operation and ReLU activation function on the original feature map

The number is mapped and used as a gating signal g of a down-sampling layer and output;

s22, inputting the original characteristic diagram obtained by the encoder into a decoder for up-sampling

Obtaining a characteristic map x 1;

s23, performing up-sampling operation on the gating signal g obtained in the step S21 to enable the gating signal g to be the same as the size of the feature map x 1;

s24, performing mean pooling operation on the gating signals obtained in the step S23 by using an expansion convolution core to obtain a feature map;

s25, connecting the feature map x1 with the feature map obtained in the step S34 after convolution operation, and then obtaining a spatial attention coefficient alpha after sequentially passing through a ReLU activation function layer, a convolution layer and a Sigmoid activation function layer;

s27, performing point multiplication operation on the coefficient alpha and the feature map x1 to obtain the concerned feature

；

s31, dividing the samples obtained in the step S1 into a training set and a testing set;

，

2. The method for extracting a building region based on a medium resolution SAR image according to claim 1,

the pyramid multi-scale city extraction network in step S2 includes a pooling layer, a convolution layer, an upsampling layer, and a cascade structure, and is located between an encoder and a decoder of the attention-based feature extraction network.

3. The method for extracting a building region based on a medium resolution SAR image according to claim 2,

the pyramid multi-scale city extraction network feature extraction step comprises the following steps:

the pyramid level divides the feature map into different sub-regions, and integrates different position information, and the output of different levels comprises feature maps with different sizes;

setting pyramid levels to be N, and reducing the number of channels of the corresponding levels to 1/N of the original number by using 1 multiplied by 1 convolution after each level;

up-sampling the low-dimensional feature map through a bilinear interpolation value to obtain a feature map with the same size as the original feature map;

and cascading features of different levels to serve as the final pyramid pooling global feature.

4. The method for extracting a building region based on a medium resolution SAR image according to claim 1,

the step S5 includes:

s51, taking Sentinel-2 optical data as a data source, and acquiring NDVI and MNDWI images by using a spectral index module in ENVI software;

s52, setting a threshold value to obtain a vegetation and water body mask layer;

s53, resampling the SRTM DEM to SAR resolution, and acquiring gradient data;

and S54, overlapping the extracted building area result image layer with the mask image layer, obtaining an intersection, erasing the intersection image layer, and finally obtaining a post-processing building area result image layer.

5. A building region extraction system based on a medium-resolution SAR image is characterized in that,

the system comprises a sample establishing module, a building area extraction network establishing module, an optimizing module, a primary extraction module and a post-processing module:

carrying out radiometric calibration, contrast enhancement and geocoding pretreatment on all SAR images;

making pixel-level marking samples of multi-category buildings according to the preprocessed SAR images, cutting the sample images into slices with the size of n multiplied by n, and setting an overlapping area of a plurality of pixels in the slice making process to ensure that the boundaries of the buildings with small areas are not damaged;

removing the incomplete slices in the sample;

determining whether the section contains a building area target, if so, determining the section as a positive sample, otherwise, determining the section as a negative sample, and respectively storing the positive sample and the negative sample in different subsets;

wherein the attention mechanism based feature extraction network comprises an encoder and a decoder:

the encoder is configured to include: a convolution block, a max-pooling layer, and a concatenation structure, the decoder configured to include a deconvolution structure, a convolution block structure, a concatenation structure, and an attention block structure;

the step of extracting the features of the feature extraction network based on the attention mechanism comprises the following steps:

inputting an SAR image into an encoder of the feature extraction network based on the attention mechanism to obtain an original feature map, and outputting the original feature map as a gating signal g of a down-sampling layer after the original feature map is subjected to n × n convolution operation and ReLU activation function mapping;

inputting the original characteristic diagram obtained by the encoder into a decoder for up-sampling to obtain a characteristic diagram x 1;

performing an upsampling operation on the obtained gating signal g so as to have the same size as the feature map x 1;

performing mean pooling operation on the obtained gating signals by using expansion convolution check to obtain a characteristic diagram;

connecting the feature diagram xl with the obtained feature diagram after convolution operation, and then obtaining a spatial attention coefficient alpha after sequentially passing through a ReLU activation function layer, a convolution layer and a Sigmoid activation function layer;

identifying salient image regions and deleting feature responses, leaving only activations associated with a particular task;

performing point multiplication operation on the coefficient alpha and the feature map x1 to obtain the concerned feature

；

dividing the obtained sample into a training set and a testing set;

inputting the training set into the building area extraction network for training, wherein a classifier adopts a Focal local function, and a formula is as follows:

inputting the test set into a trained building area extraction network, and outputting a classification result;

6. A non-volatile computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in the medium resolution SAR image based architectural region extraction method of any of claims 1-4.