AU2018101336A4

AU2018101336A4 - Building extraction application based on machine learning in Urban-Suburban-Integration Area

Info

Publication number: AU2018101336A4
Application number: AU2018101336A
Authority: AU
Inventors: Yuan HU; Zhouxuan Lin; Ruixue Ma; Dacheng Wang; Ge Yang; Xiaojing Yao
Original assignee: Hu Yuan Miss; Ma Ruixue Miss
Current assignee: Hu Yuan Miss; Ma Ruixue Miss
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2018-10-11
Anticipated expiration: 2026-09-12

Abstract

Abstract This is a technology based on computer learning, which realizes the automatic recognition of buildings in semantic segmentation by the means of deep learning.The invention consists of the following steps: Inputing the improved DCNN (with hole convolution and ASPP module) and getting a rough prediction result - expanding to the original size by bilinear interpolation refining the prediction result by fully connected CRF - getting the final output . It does not require manual selection of features and can accurately and quickly identify different shapes and heights of buildings on macroscopic remote sensing images. Conv1 rate=2 rate=4 rate=8 rate=16 Poolt Blockl Block2 Block3 Block4 Block5 Block6 Block Image ",rid, 4 8 16 16 16 16 16 16 figure 3 f~i, j) figure 4 P(C) figure 5

Description

The invention consists of the following steps: Inputing the improved DCNN (with hole convolution and ASPP module) and getting a rough prediction result - expanding to the original size by bilinear interpolation refining the prediction result by fully connected CPF - getting the final output. It does not require manual selection of features and can accurately and quickly identify different shapes and heights of buildings on macroscopic remote sensing images.

2018101336 12 Sep 2018

rate=2 rate=4 t>lock4 _ο._πα□ □ □ figure 3

Block5 ra£e=8 ratfie=16

Blocks

Block7

figure 5

2018101336 12 Sep 2018

TITLE

Building extraction application based on machine learning in

Urban-Suburban-Integration Area

FIELD OF THE INVENTION

This invention is the application of convolutional neural networks in digital image processing. What isespecial is that the reference of remote sensing Image can accurately and quickly identifying the different shapes, heights and materialsof the buildings in a macroscopic image with rich information.

BACKGROUND OF THE INVENTION

The advent of the big data has led to an increase in data and a complication in the structure of data. The storage and processing of huge amounts of information has become a new opportunity and an intense challenge in computing and other fields. Recently deep learning has became a hotspot in the field of artificial intelligence. It also has received great attention in

2018101336 12 Sep 2018 the fields of computer vision, speech recognition and natural language processing. Deep learning can deeply explore more essential features to make the features more reasonable and accurate to show the data. Deep learning originated from the network and it can help the neural network to be transformed from shallow learning to deep learning andimprove the ability of neural network to processing the data by adding multiple hidden layer structures between the input and output layers of the neural network and hierarchical expression of time input information.

The computer learning promotes the development of the field of the computer vision in various aspects. Since the deep learning provides more accurate feature of the images, it makes the recognization more accurate and quick. Semantic segmentation is the basis of visual analysis, which refers to dividing the image into regions with certain semantic meanings and identifying the semantics of category of each region block, then realizing the reasoning process from the bottom to the

2018101336 12 Sep 2018 high level semantics and finally obtaining the segmented images with pixel semantic annotation.

In the early research, people use convolution neural network in

Image semantic recognition to recognize the Displacement, scaling, and other forms of distortion invariant two-dimensional graphics. Its basic structure contains two parts. The first is the feature extraction layer. The input of each neuron is connected to the local acceptance domain of the previous layer, and the local features are extracted. Once the local feature is extracted, its positional relationships with other features is also determined. The second is the feature mapping layer. Each computing layer of the network is composed of multiple feature maps, and each feature map is a plane. The neurons in the plane have equal weights. This has become a basic framework for subsequent follow-up development. However, the number of network free parameters is reducedsince the neurons on a mapped surface share weights. Each convolutional layer in the convolutional neural network is followed by a computational layer for local averaging and quadratic extraction. This unique

2018101336 12 Sep 2018 two-feature extraction structure reduces feature resolution.

Later, based on CNN, people proposed many convolution structures.

After many experiments, people have proposed a new convolution method - 'Atrous convolution' as a powerful tool for intensive prediction tasks, and clearly control the resolution of the corresponding features in the deep convolutional neural network. The network can effectively expand the field of view of the filter, so that it can integrate larger contexts without increasing parameters or calculations. It can also enable more precise determination of the positional relationship between features. This is the convolutional network to which the invention is applied.

Since the retention of the entire feature image is mainly determined by the pooling method in the entire neural network, the earliest max-pooling method only retained the point with the largest feature. However this method makes the features that people really need neglected. So later people have proposed a kind of mean-pooling method to obtain global

2018101336 12 Sep 2018 information, but this method still has drawbacks. The problem is that it has no ability to choose independently, so the model proposes a selection function. The essence of pooled square ruler SPSnet is the ability to assign different weights to different features during the pooling, so that the features are selectively expressed and truly preserve the semantic information.

Under the characteristics of Atrous convolution, the invention adopts atrous spatial pyramid pooling (ASPP) to obtain more robust segmentation results with multi-scale information. ASPP parallel uses multiple sampling rates ofAtrous convolutional layer to detect and capture the object and the image context in multiple scales. It is built on the basis of the PSP, thus it preserves the hierarchical global priority of the PSPNet, including different scales of information characteristics among different levels.

The segmentation boundary results are improved by combining

DCNN and probability map models. Through the maximum pooling and downsampling combinations in DCNN the it can achieve translational invariance, but this has an impact on the

2018101336 12 Sep 2018 accuracy. The invention overcomes this problem by combining the final DCNN layer response with a fully connected CRF.

The invention solves three challenging problems of DCNNs: 1) smaller and smaller feature resolution; 2) feature fusion brought by different scales; 3) loss of spatial features leads to a decrease in positioning accuracy.

Since the remote sensing imagesare the real forms of the earth surface shot by satellite, they have the characteristics of diverse types of data, strong macroscopic features and rich variety of features. Traditional semantic segmentation is only used in an image with simple or obvious features. With our Eyes, we can quickly and accurately identify each semantic category.

However, When faced with a remote sensing image with different shapes, heights and building materials, it is unreachable work for human and traditional semantic segmentation. Afterwe had several tests on our invention, it showed that the invention can achieve the purpose of quickly and accurately identifying a building. It is also an innovative development of deep learning in the field of remote sensing.

2018101336 12 Sep 2018

SUMMARY OF THE INVENTION

The invention is based on the methods of computing learning.

It can accurately and quickly recognize the objects showed by the remote sensing images. Since the information of the objects in the remote sensing images have practical positioning meanings, our invention has a great value and can be of help in the remote sensing quantification and qualitative analysis in the future. It also provides reliable reference for the various fields in our daily life like agriculture and transportation etc.

To meet the purpose of the invention as expected, we improve the convolution and prolong medals on the basis of the convolutional neural network, as well as low resolution of the feature image, inaccurate edge positioning, loss of spatial information, and image. Typical problems such as scale mismatch and poor model performance are optimized. The

FCN(Fully Convolution network), PSP(Pyramid Spatial Polling),

ASPP(Atrous Spatial Pyramid Polling), Atrous convolution,

Bilinear interpolation, and CRF (Conditional Random Field

2018101336 12 Sep 2018

Algorithm) classic models are applied in the whole model. The steps and the ideas involved in each step in the invention will be explained by starting with the input image in detail.

Step 1: Input the image. The images extracted by this model are different from the images that have high resolution and obvious difference in object types. These are remote sensing images.

The particularity are that the amount of feature types is huge and there is no rules. The shape and height of buildings are different. Only aModels that can classify accurately can extract the buildings that appear to be irregular.

Step 2: Input the image into the improved DCNN to get a rough prediction that is called building coarse score map. The key to this step is the improvement of DCNN. The original neural convolution network model has lost too much background information after continuous pooling. To solve this problem, the model uses Atrous convolution and cancels the last two layers. The pooling layer ensures that more semantic information is merged in the case where the final feature image size is no longer lost, and the ASPP model with Atrous

2018101336 12 Sep 2018 convolution is proposed based on SPSnet, so that spatial information and semantic information are better integrated.

Step 3: Extend the building coarse score map to its original size by bilinear interpolation. Since the continuous pooling makes the final result image much smaller than the original feature image, the image size must be the same as the original feature image, so the upsampling must be used. However, the conventional deconvolution method will cause some feature information to be restored when the image is restored.

Therefore, the linear interpolation method can take into account the relationship between the global and the local to obtain a more accurate restoration.

Step 4: Refine the prediction results through the fully connected

CRF to get the final output. Many models have not used CRF to optimize results. The final edge information is not very accurate.

In order to make our classifier perform better, the CRF is adopted, and its market value is to consider the tag information of adjacent data. This is difficult for a common classifier, and it is precisely where CRF is good at.

2018101336 12 Sep 2018

DESCRIPTION OF THE DRAWINGS

The appended drawings are only for the purpose of description and explanation but not for limitation, wherein:

Fig.lOverall description of the model

Fig.2 Atrous convolution Schematic

Fig.3 ASPP Schematic

Fig.4 Bilinear interpolation Schematic

Fig.5 CRFSchematic

Figure 6 Remote sensing data

Figure 7 Manually tagging data

Figure 8 Experimental results

Figure 9 Experimental accuracy of each model

Figure 10 Comparison of experimental accuracy

DESCRIPTION OF PREFERRED EMBODIMENTS

The continuous maximum pooling and downsampling repetitive combination layer in DCNN greatly reduces the spatial resolution of the final feature map. Some remedies are to use

2018101336 12 Sep 2018 transpose convolution to expand the feature map resolution, but this requires additional space and calculation. We advocate the use of Atrous convolution to calculate the feature map of any layer with any feature response resolution. As shown in the figurel, the specific method is as follows: first insert a null value to ensure that the calculation parameters are unchanged, thenextract pixel information of different scales and insert corresponding null values. It can improve the receptive field while capturing different scale information.

Explanation of the Atrous convolution:

Firstly, consider a one-dimensional signal.The Atrous convolution's output is y[i]y[i], the input is x[i]x[i],and the filter of length K is co[k]co[k].

defined as:

y[k]=£k=lKx[i + r-k]co[k]

The step size of the input sample is the parameter rr, and the standard sampling rate is r=lr=l. As shown in Fig. 2(a), the figure2 (b) is the sampling condition with the sampling rate r=2r=2.

2018101336 12 Sep 2018

The cavity convolution can amplify the receptive field of the filter, and the rate rr introduces r—lr—1 zeros. It effectively extend the receptive field from kxkkxk to ke=k+(k— l)(r— l)ke= k+(k—l)(r—1) without increasing the parameters and calculations. In DCNN, it is common to use a combination of

Atrous convolution to calculate the final DCNN network response with high resolution (understood as sampling density).

In this invention, the use of Atrous convolution increases the density of features by a factor of four, and the output of the characteristic response bilinear interpolation is upsampled by 8 times to the original resolution.

In reality there are often objects on multiple scales. In order to solve this problem, there is a way to scale a picture to different versions, and summarize features to ultimately predict the results. After many experiments, when this method increases system functionality, it requires a lot of storage space. Finally, inspired by SPP, a similar structure is proposed, which uses

Atrous convolution of parallel samplingat different sampling rates for a given input, which is equivalent to capturing the

2018101336 12 Sep 2018 context of images in multiple scales and is been called ASPP (atrous spatial pyramid pooling)module. Since the ordinary convolution neural net can not handle the relationships among the xx and the global information efficiently, the global average pooling process must be used. The pyramid pooling produces different levels of features that are connected smoothly into a layer. This layer has global priority and contains different scale information with different self-speaks. In order to improve the defects of SPS itself, this model proposes a pooling model ASPP that is suitable for Atrous convolution, so that it can maintain complete global information and prevent nuclear convolution bringing the loss of information of possibility in extreme cases.

Figure 3 is the principle of aspp. In SPP, the size of the pool is determined according to the size of the input to obtain the same feature map. Through the Atrous convolution of different rates and further processing, the same feature maps are obtained.lt superimposes several different rate features on one image after retaining the pooled maximum resolution image, which can ensure the resolution of the image and obtain a large

2018101336 12 Sep 2018 amount of feature information to improve the ability to get global information.

As we all know, pooling will reduce the size of the image. For example, after five times of VGG16, the image is reduced by 32 times. In order to get a large segmentation map as the original image, we need upsampling operation to make the image resolution consistent with the original image. This invention uses a bilinear interpolation method to upsample the image. As shown in figure4

For (i, j+v), the gray scale change to f(i, j) to f(i, j+1) is linear, then:

f(i, j+v) = [f(i, j+1) - f(i, j)] * v + f(i, j)

For the same reason, as to (i+1, j+v), it will be:

f(i + l, j+v) = [f(i + l, j + 1) - f(i + l, j)] * v + f(i + l, j)

The gray scale change from f(i, j+v) to f(i + l, j+v) is also a linear relationship, from which the calculation formula of the pixel gray to be obtained can be derived as follows:

2018101336 12 Sep 2018 f(i+u, j+v) = (1-u) * (1-v) * f(i, j) + (1-u) * v * f(i, j+1) + u * ( 1-v) * f(i +1, j) + u * v * f(i +1, j +1)

The calculation of bilinear interpolation is more complicated than the nearest neighbor method. The calculation amount is large, but there is no disadvantage of gray discontinuityand the result is basically satisfactory.

The basic classification network extracts the coarse featuresthen conducts classification prediction and generates the segmentation map to optimize segmentation map for output. In order to ensure the accuracy of the edge, the invention finally connects a full condition random field (CPF) to refine the boundary. The model of CRF is shown in Fig.5. The label of the pixel is used as a random variable, and the relationship between the pixel and the pixel is used as the edge, which constitutes a condition that the global observation of the random field can be obtained, and the CRF can model these labels. The global observations are usually input images

2018101336 12 Sep 2018

Let the random variable XiXi be the label of pixel ii, Xi e L = 11,

12, ..., ILXi G L = 11, 12, ..., IL, let the variable XX be XI, X2, ...,

XNX1, X2, ..., XN is a random vector and NN is the number of pixels in the image. Assume that the graph G = (V, E) G = (V, E), where V = XI, X2, ..., XNV = XI, X2, ..., XN, and the global observation is II. The conditional random field conforms to the

Gibbs distribution, and (I, X) (I, X) can be modeled as CRF.

P(X=x|I)=l/Z(I)exp(— E(x|I))P(X=x|I)=l/Z(I)exp(- E(x|I))

In a fully connected CRF model, the energy of the label xx can be expressed as:

Ε(χ)=Σΐθ(χί)+Σϋθϋ(χΐ/>ίϊ)^Ε(χ)=Σίθ(^χί)+Σϋθϋ(^χί/><ί)

0i(xi)0i(xi) is a unitary energy term representing the energy of dividing pixel ii into label xixi, and the binary energy term φρ(χΐ, xj)cpp(xi, xj) is for pixel points ii, jj simultaneously Divided into energy of xixi, xjxj. The binary energy term describes the relationship between pixel points and pixel points, and encourages similar pixels to assign the same label, while pixels with larger differences assign different labels. The definition of this distance is related to the color value and the actual

2018101336 12 Sep 2018 relative distance. So that CRF can make the image split as much as possible at the boundary. Minimizing the energy above can find the most likely segmentation. The difference of the

DenseCRF is the binary potential function describes the relationship between each pixel and all other pixels, so it is called full connection. Specifically, the unary energy term in the model comes directly from the output of the front-end FCN, and is calculated as follows:

0i(xi) =— log P(xi)0i(xi) =— logP(xi)

The binary energy term is calculated as follows:

0ij (xi,xj)=p(xi,xj) [colexp(— // pi— pj // 22σ2α— // Ii— Ij // 22σ2β) + co2e xp(— // pi— pj // 22o2y)]0ij(xi,xj) = p(xi,xj)[colexp(— || pi—pj II 22σα2— || Ii—Ij II 22σβ2) + ω2εχρ(— || pi -pj II 22σγ2)]

Among them, p(xi, xj) = lp(xi, xj) = 1, when i/ji/j, the other value is 0. That is to say, when the labels are different, there is a penalty. The residual expression is two Gaussian kernel functions in different feature spaces. The first based on the bilateral

2018101336 12 Sep 2018

Gaussian function.The similar RGB and position to be in similar labels, the second only considering Theposition of pixel is equal to the application of a smoothing term. The superparameters are σασα, σβσβ, σγσγ that control the weight of the Gaussian kernel.

The experimental uses a colorful remote sensing image with a resolution of 10 meters. A part of it is used as training samples, as it is shown in Figure 5. The types of objects are roads, vegetation, cars, buildings, etc. The image of the marked building is used as a later standard reference image, as it is shown in Fig. 6. They are identified by the model, and the result is shown in the Fig.7. It can be seen from the experimental results that all the building areas can be identified accurately and the edge of the building can be predicted in the output image. In order to ensure the reliability of the results, we identify the same training samples in different models and compare the accuracy results of 99 trainings. It is found that the three precision results of the model are better than the other seven, and the time of calculation is decreased. The

2018101336 12 Sep 2018 experimental results show that the use of CRF has a certain optimization effect on the edge of the experiment. The addition of the ASPP pooling method also greatly preserves the semantic information, and the accuracy is improved by at least 7% compared with other results.

This invention is the convolution network that is based on the existing neural network to respectively input image types and usage. The pooling square ruler has a relatively large change. It not only retains the good place of the previous model, but also makes it a series of optimization. The average accuracy of the model is higher than others through many experiments' results.

In particular, the using of linear interpolation to enlarge the feature image and CRFhelp to obtain the optimized image, which improves the precision among them.

2018101336 12 Sep 2018

EDITORIAL NOTE

- There is 1 page of Claims only

2018101336 12 Sep 2018

Claims

1. A method of building extraction application based on machine learning in Urban-Suburban-Integration Area, which is to combine computer vision with remote sensing image. With the improved model, the different shapes and heights of buildings can be accurately identified in various images of ground objects without training samples, this method is an innovation in both computer vision and remote sensing.

2018101336 12 Sep 2018

EDITORIAL NOTE

- There are 4 pages of Drawings only

2018101336 12 Sep 2018

Final Output Fully Connected CRF Bi-linear Interpolation figure 1

Convolution kernel = 3 stride = 1 pad = 2 rate = 2 (insert 1 zero) (b) figure 2

2018101336 12 Sep 2018

Convl +

Pooll

Image output stride rate=2 rate=4 rate=3 rate=16

8 16 16 16 16 16 16 figure 3 figure 4 figure 5

2018101336 12 Sep 2018 figure 6 figure 7 figure 8

2018101336 12 Sep 2018

model structure avgaccuracy fl score mean iou experiment model 0.873311 0. 874998 0. 6882995 adaptnet 0.7988692 0.8269 0.5031025 densenet-103 0.8033674 0.805057 0.6523883 encoder-decoder 0.750553 0.741214 0. 5116506 mobileunet 0.7871232 0.874087 0.3988447 mobileunet-skip 0. 7761902 0.776578 0.6036939 fc-densen56 0. 8085084 0.808514 0.6655708 gcn-restlOl 0.7856583 0.787802 0.61853

figure 9

Precision comparison diagram avg_accuracy fl score mean iou experiment model adaptnet densenet-103 encoder-decoder mobileunet mobileunet-skip fc-densen56 gcn-restlOl figure 10