CN108764039B

CN108764039B - Neural network, building extraction method of remote sensing image, medium and computing equipment

Info

Publication number: CN108764039B
Application number: CN201810373725.3A
Authority: CN
Inventors: 李祥; 彭玲; 胡媛; 肖莎
Original assignee: Institute of Remote Sensing and Digital Earth of CAS
Current assignee: Institute of Remote Sensing and Digital Earth of CAS
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2020-12-01
Anticipated expiration: 2038-04-24
Also published as: CN108764039A

Abstract

The invention discloses a neural network, a building extraction method of remote sensing images, a medium and computing equipment. The disclosed neural network is used for building extraction of remote sensing images, and comprises: an input layer, first to fifth convolutional layers, first to fourth pooling layers in the VGG network; the input end of the first single-scale fusion layer is connected to the output end of the first coiled layer; the input ends of the second to fifth single-scale fusion layers are respectively connected to the output ends of the second to fifth convolution layers; the input ends of the first to fourth up-sampling layers are respectively connected to the output ends of the second to fifth single-scale fusion layers; the input end of the multi-scale splicing and fusing layer is connected to the output ends of the first single-scale fusing layer and the first to fourth upper sampling layers; and (5) outputting the layer. The neural network disclosed can effectively process buildings which are densely distributed and have various sizes, and improves the precision of automatic extraction of the buildings.

Description

Neural network, building extraction method of remote sensing image, medium and computing equipment

Technical Field

The invention relates to the field of neural networks and image processing, in particular to a neural network, a building extraction method of remote sensing images, a medium and computing equipment.

Background

With the rapid development of sensor technology, the spatial resolution of remote sensing images is continuously improved. Inspired by deep learning algorithm in the field of computer vision, at present, most scholars adopt a convolutional neural network to realize semantic segmentation task of remote sensing images. Although some of the most advanced methods have achieved good effects on the task of semantic segmentation of remote sensing images, some characteristics of the remote sensing images are not considered. Firstly, in the conventional computer vision semantic segmentation task, only a few to dozens of targets are generally arranged on an image to be detected, and the distribution among the targets is loose, as shown in fig. 1 (a). However, in remote sensing images, buildings are generally distributed more densely, especially in residential areas, as shown in fig. 1 (b). Secondly, in a conventional computer vision semantic segmentation task, the size of an object to be detected is generally large, the length and the width are generally dozens to hundreds of pixels, the size of a building in a remote sensing image is generally much smaller, and the change of the scale (the number of pixels corresponding to the images of different buildings) is also large, as shown in fig. 1 (c).

In order to ensure the accuracy of semantic segmentation, the accuracy of building (feature) extraction is ensured first. Although some technical solutions for extracting some specific targets in the remote sensing image by combining the convolutional neural network already exist in the prior art. For example, patent application publication No. CN107025440A discloses a method for extracting a road from a remote sensing image based on a full convolution neural network, and the disclosed technical solution uses the full convolution neural network to realize structural output, so as to fully mine the two-dimensional geometric structure correlation of the road in the remote sensing image. However, an effective method for extracting feature information of buildings with different scales in a remote sensing image by fully utilizing a convolutional neural network does not exist in the prior art.

Therefore, a new technical scheme needs to be provided to combine the convolutional neural network to fuse the image features under different scales, so as to effectively improve the accuracy of the automatic extraction of buildings with different scales.

Disclosure of Invention

The neural network system is used for building extraction of remote sensing images and comprises the following components:

an input layer, first to fifth convolutional layers, first to fourth pooling layers in the VGG network;

the input end of the first single-scale fusion layer is connected to the output end of the first convolution layer and is used for fusing the first-scale multi-channel feature map output by the first convolution layer and outputting the fused first-scale fusion single-channel feature map;

input ends of the second to fifth single-scale fusion layers are respectively connected to output ends of the second to fifth convolution layers and are used for respectively fusing second to fifth scale multichannel feature maps output by the second to fifth convolution layers and respectively outputting fused second to fifth scale fusion single-channel feature maps;

the input ends of the first to fourth up-sampling layers are respectively connected to the output ends of the second to fifth single-scale fusion layers;

the input end of the multi-scale splicing and fusing layer is connected to the output ends of the first single-scale fusing layer and the first to fourth upper sampling layers and is used for fusing the feature maps output by the first single-scale fusing layer and the first to fourth upper sampling layers and outputting the fused multi-scale fusion single-channel feature map;

an output layer, the input end of which is connected with the output end of the multi-scale splicing fusion layer and is used for outputting the building characteristic diagram based on the multi-scale fusion single-channel characteristic diagram,

and the two-dimensional single-channel characteristic diagram with the same resolution as that of the remote sensing image is output by the output ends of the first single-scale fusion layer, the first up-sampling layer, the second up-sampling layer, the fourth up-sampling layer and the multi-scale splicing fusion layer.

The neural network system according to the present invention, further comprising:

and the first to fourth clipping layers are respectively arranged between the first to fourth up-sampling layers and the multi-scale splicing and fusing layer and are used for respectively clipping the feature maps output by the first to fourth up-sampling layers to the resolution ratio same as that of the original input image.

According to the neural network system of the present invention, the following layers are further included after the first to fifth convolutional layers:

the first to fifth ReLU layers, the first to fifth Batch Normalization layers and the first to fifth Dropout layers are used for avoiding over-fitting and improving the generalization capability of the neural network system.

The building extraction method for the remote sensing image comprises the following steps:

constructing a trained neural network system as described above;

and acquiring a building characteristic diagram corresponding to the remote sensing image by using the trained neural network system.

According to the building extraction method for remote sensing images, before the step of constructing the trained neural network system, the method further comprises the following steps:

and training the neural network system by using a data set comprising the remote sensing training image of the building and the label image corresponding to the remote sensing training image to obtain the trained neural network system.

According to the building extraction method for the remote sensing image, after the building characteristic map corresponding to the remote sensing image is obtained, the final building distribution map is obtained by using a threshold value method.

According to the building extraction method for the remote sensing image, a Sigmoid Cross energy Loss function is used and a stochastic gradient descent algorithm is used when a neural network system is trained.

A computer-readable storage medium according to the invention, having stored thereon a computer program, which when executed by a processor, carries out the steps of the method as described above.

The computing device according to the invention comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method as described above when executing the program.

According to the technical scheme of the invention, the multi-scale information in the deep convolutional neural network is directly utilized, so that the building with dense distribution and various scales can be effectively processed, and the precision of automatic extraction of the building is improved. In addition, according to the above technical solution of the present invention, the whole image is used as an input, and the segmentation (i.e., building extraction) result is directly output without performing overlapped slices, thereby greatly improving the efficiency of building extraction.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention. In the drawings, like reference numerals are used to indicate like elements. The drawings in the following description are directed to some, but not all embodiments of the invention. For a person skilled in the art, other figures can be derived from these figures without inventive effort.

Fig. 1 shows a schematic diagram of a conventional image to be detected and a remote sensing image to be detected according to the present invention.

Fig. 2 schematically shows a schematic block diagram of a neural network system according to the present invention.

Fig. 3 schematically shows a flow chart of a building extraction method for remote sensing images according to the invention.

FIG. 4 schematically illustrates different image maps output by the various layers of the neural network system shown in FIG. 2.

Fig. 5 exemplarily shows an original satellite remote sensing image, a corresponding real label image thereof, and a building feature diagram actually output according to the technical solution of the present invention.

Fig. 6 shows exemplary accuracy vs. recall curves for different relaxation coefficients according to the solution of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

As described in the background art with reference to fig. 1, because the remote sensing image and the conventional image have the above difference, a new technical solution needs to be provided to combine the convolutional neural network to fuse the image features at different scales, so as to effectively improve the accuracy of the automatic extraction of buildings at different scales.

As shown in fig. 2, the neural network system according to the present invention, for building extraction of a remote sensing image, includes:

input layers (corresponding to "input image" in fig. 2) in the VGG network, first to fifth convolutional layers (corresponding to layer sets in which "Conv 1_2, Conv2_2, Conv3_3, Conv4_3, and Conv5_ 3" in fig. 2, respectively), first to fourth pooling layers (corresponding to "Pool 1, Pool2, Pool3, and Pool 4" layers in fig. 2, respectively);

a first single-scale fusion layer (corresponding to the 1 st "Conv" on the left side of the horizontal distribution in fig. 2), an input end of the first single-scale fusion layer being connected to an output end of the first convolution layer, for fusing the first-scale multi-channel feature map output by the first convolution layer and outputting a fused first-scale fused single-channel feature map;

second to fifth single-scale fusion layers (corresponding to 2 nd to 5 th "Conv" horizontally distributed in fig. 2, respectively), the input ends of which are connected to the output ends of the second to fifth convolution layers, respectively, for fusing the second to fifth scale multi-channel feature maps output by the second to fifth convolution layers, respectively, and outputting the fused second to fifth scale fusion single-channel feature maps, respectively;

first to fourth Upsampling layers (corresponding to "2 × Upsampling", "4 × Upsampling", "8 × Upsampling", and "16 × Upsampling" in fig. 2, respectively), input ends of the first to fourth Upsampling layers being connected to output ends of the second to fifth single-scale fusion layers, respectively;

a multi-scale splicing and fusing layer (corresponding to the "Concat" layer in fig. 2), wherein an input end of the multi-scale splicing and fusing layer is connected to output ends of the first single-scale fusing layer and the first to fourth upsampling layers, and is used for fusing the feature maps output by the first single-scale fusing layer and the first to fourth upsampling layers and outputting a fused multi-scale fusion single-channel feature map;

an output layer (corresponding to "Conv" above "P" in fig. 2), an input end of which is connected to an output end of the multi-scale stitch-fusion layer, for outputting a building feature map (corresponding to "P" in fig. 2) based on the multi-scale fusion single-channel feature map,

In the above technical solution, although the number (i.e., number) C of channels of the input feature maps corresponding to the first to fifth single-scale fusion layers is different (64, 128, 256, 512, and 512, respectively, as shown in fig. 2), since they respectively use different convolution kernels having a dimension of 1 × 1 (64, 128, 256, 512, and 512) × 1 to fuse all the input feature maps at the respective scales, it is finally possible to output 1 single-channel feature map (i.e., the first to fifth scale fusion single-channel feature maps, which are single-channel feature maps of 5 different resolutions having resolutions of 2562, 1282, 642, 322, and 162, respectively).

For the multi-scale merged fusion layer, similarly to the fusion method of the first to fifth single-scale fusion layers, the number C (i.e., the number) of channels of the input feature map is 5 (including the feature map output by the first single-scale fusion layer and 4 new feature maps output by the first to fourth upsampling layers obtained by using an upsampling method, and the resolutions of the feature maps are the same as the resolution of the original remote sensing image). Thus, these 5 signatures can be stitched into one 5-channel probability map (i.e., signature), and a single-channel prediction map (i.e., multi-scale fusion single-channel signature) can be obtained using a convolution kernel with a dimension of 1 × 5 × 1.

Although the solution shown in fig. 2 does not require clipping of the first to fourth upsampling layers, optionally, the neural network system may further include:

the first to fourth clipping layers (respectively corresponding to "Crop" in the layer set where "P2" to "P5" in fig. 2) are respectively disposed between the first to fourth upsampling layers and the multi-scale stitching fusion layer, and are used for respectively clipping the feature maps output by the first to fourth upsampling layers to the same resolution as the original input image. And automatically adapting to the condition that the resolution of the remote sensing image is inconsistent with the resolution of the output feature maps of the first to fourth upsampling layers.

Optionally, after the first to fifth convolutional layers, the neural network system further includes the following layers (not shown in fig. 2):

The network shallow layer as shown in fig. 2 generates a feature map with fine spatial resolution but low level semantic information, the deep layer as shown in fig. 2 generates a coarse feature map with high level semantic information, and the feature map of the middle layer as shown in fig. 2 corresponds to some middle level features. The technical scheme can integrate the different feature maps, so that buildings with different appearances or shelters can be effectively extracted.

step S302: constructing a trained neural network system as described above;

step S304: the trained neural network system is used to obtain a building probability map (i.e., a building feature map, i.e., a building extraction prediction map "P", as described above) corresponding to the remote sensing image (i.e., "input image" in fig. 2).

Optionally, the building extraction method for remote sensing images further includes, before step S302:

step S302': the neural network system is trained using a data set including a remote sensing training image (i.e., "input image" in fig. 2) of a building and a tag image (i.e., "input map" in fig. 2) corresponding to the remote sensing training image to obtain a trained neural network system.

Alternatively, in step S304 and step S302', the above-described building feature map (final building extraction result) is obtained using a threshold method.

Optionally, in step S302', a Sigmoid Cross control Loss function (i.e., a calculation function corresponding to "Loss" in fig. 2) and a random gradient descent algorithm are used in training the neural network system.

In order that those skilled in the art will better understand the technical advantages of the present invention, the following description should be taken in conjunction with the specific embodiments.

As shown in fig. 4, fig. 4(a) is an original satellite remote sensing image (i.e., "input image" in fig. 2) selected from Massachusetts remote sensing datasets (of a first scale), and fig. 4(b) is a feature image (i.e., "P2" in fig. 2) obtained by interpolating a second scale feature image with a small receptive field, and based on the feature image, low-level features such as edges and corners of the original satellite remote sensing image can be extracted. Fig. 4(c) is an interpolated feature map ("P3" in fig. 2) of a third scale feature map with a larger receptive field, which is capable of delineating a preliminary outline of a building. Fig. 4(d) is a feature map (i.e., "P4" in fig. 2) obtained by interpolating the fourth scale feature map having a larger receptive field, and based on the feature map, it is possible to identify a non-building region such as a lake. Fig. 4(e) is a feature diagram (i.e., "P5" in fig. 2) obtained by interpolating the fifth scale feature diagram having the largest receptive field, and based on the feature diagram, non-building areas such as lakes and bare lands can be identified. Finally, integrating the semantic information and the spatial information at multiple levels results in a reliable prediction (the multi-scale fusion single-channel feature map described above, i.e., "P" in fig. 2), as shown in fig. 4 (f).

As shown in fig. 5, fig. 5(a) is an original satellite remote sensing image selected from Massachusetts remote sensing data sets, fig. 5(b) is a real label image thereof, and fig. 5(c) is a predicted label image (i.e., a building feature image actually output according to the technical solution of the present invention). According to the technical scheme, the building distribution condition can be well predicted, and the building boundary is accurate.

The accuracy is defined as the proportion of the detected pixel within the range of rho adjacent pixels of the real pixel, and the recall rate is defined as the proportion of the real pixel within the range of rho adjacent pixels of the detected pixel. Fig. 6(a) is an accuracy-recall curve according to the present invention at ρ ═ 3, which corresponds to a model accuracy of about 0.9668 (corresponding to the break key at the point where accuracy and recall are equal, indicated by the symbol x in fig. 6 (a)). Fig. 6(b) is an accuracy-recall curve according to the present invention at ρ ═ 0, which corresponds to a model accuracy of about 0.8424 (corresponding to the break key at the point where accuracy and recall are equal, indicated by the symbol x in fig. 6 (b)).

Table 1 gives the architectural extraction performance comparisons between different schemes including the Mnih-CNN scheme and the Mnih-CNN + CRF scheme disclosed in his doctor's paper "Machine learning for aural image labeling, docoral (2013)", and the Saito-multi-MA scheme and the Saito-multi-MA & CIS scheme disclosed in Saito in "Multiple object extraction from aural image with connected audio network works", and the technical scheme of the present invention.

TABLE 1 comparison of Performance between different technical solutions

Model (model)	Breakeven(ρ＝3)	Breakeven(ρ＝0)	Predicting time(s)
				Mnih-CNN	0.9271	0.7661	8.7
Mnih-CNN+CRF	0.9282	0.7638	26.6
				Saito-multi-MA	0.9503	0.7873	67.72
Saito-multi-MA&CIS	0.9509	0.7872	67.84
				Technical scheme of the invention	0.9668	0.8424	2.05

Note: the prediction time is the average time required for predicting a single 1500 × 1500 test image, and the model of the used display card is NVIDIA TITAN X.

As can be seen from the results in table 1, the above-described solution according to the present invention has better technical effects both in terms of model accuracy under different relaxation coefficients (ρ ═ 3 and ρ ═ 0) and in terms of prediction time. Not only can obviously improve the extraction precision, but also can reduce the operation time.

In connection with the above technical solution according to the present invention, a computer-readable storage medium is also provided, on which a computer program is stored, which when executed by a processor implements the steps of the method shown in fig. 3.

In combination with the above technical solution according to the present invention, a computing device is further provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps of the method shown in fig. 3 are implemented.

According to the technical scheme of the invention, the VGG network is used as a basic structure, the last layer of each resolution characteristic diagram in the network is extracted, and the characteristic diagrams are fused into a single-channel characteristic diagram by convolution operation. And obtaining a final prediction result through resampling and feature map splicing.

According to the technical scheme of the invention, the method also has the following advantages: 1) the resolution characteristic graphs can be fused, so that input image multi-scale information is extracted, and accurate extraction of buildings is realized; 2) as model integration is not required during prediction (i.e., extraction), and post-processing operation is not required, the building extraction efficiency is greatly improved; 3) since the full convolution network is used, an input image of an arbitrary size can be accepted as video memory permits.

In addition, according to the technical scheme of the invention, the whole image is directly used as input, and the segmentation (namely, building extraction) result can be obtained through one-time network forward propagation, so that model integration in a mode of overlapping slices is not required, post-processing operation is not required, and the building extraction efficiency is greatly improved. The result of the comparison test based on the Massachusetts remote sensing data set shows that the technical scheme of the invention is obviously superior to other methods in precision and efficiency.

The above-described aspects may be implemented individually or in various combinations, and such variations are within the scope of the present invention.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Finally, it should be noted that: the above examples are only for illustrating the technical solutions of the present invention, and are not limited thereto. Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A neural network system for building extraction of remote sensing images, comprising:

the input end of the first single-scale fusion layer is connected to the output end of the first convolution layer and is used for fusing the first-scale multi-channel feature map output by the first convolution layer and outputting a fused first-scale fusion single-channel feature map;

the input ends of the second to fifth single-scale fusion layers are respectively connected to the output ends of the second to fifth convolutional layers, and the second to fifth single-scale fusion layers are used for respectively fusing the second to fifth scale multi-channel feature maps output by the second to fifth convolutional layers and respectively outputting the fused second to fifth scale fusion single-channel feature maps;

first to fourth upsampling layers, input ends of the first to fourth upsampling layers being connected to output ends of the second to fifth single-scale fusion layers, respectively;

the input end of the multi-scale splicing fusion layer is connected to the output ends of the first single-scale fusion layer and the first to fourth upsampling layers, and the multi-scale splicing fusion layer is used for fusing the feature maps output by the first single-scale fusion layer and the first to fourth upsampling layers and outputting a fused multi-scale fusion single-channel feature map;

an output layer, an input of the output layer connected to an output of the multi-scale stitch fusion layer, for outputting a building feature map based on the multi-scale fusion single-channel feature map,

and the two-dimensional single-channel feature map with the same resolution as that of the remote sensing image is output by the first single-scale fusion layer, the output ends of the first to fourth up-sampling layers and the output end of the multi-scale splicing and fusion layer.

2. The neural network system of claim 1, further comprising:

and the first to fourth clipping layers are respectively arranged between the first to fourth upsampling layers and the multi-scale splicing and fusing layer and are used for respectively clipping the feature maps output by the first to fourth upsampling layers to the resolution ratio same as that of the original input image.

3. The neural network system of claim 1 or 2, further comprising the following layers after the first to fifth convolutional layers:

4. A building extraction method for remote sensing images is characterized by comprising the following steps:

constructing a trained neural network system of any one of claims 1-3;

5. The building extraction method for remote sensing images as claimed in claim 4, further comprising, before the step of constructing the trained neural network system as claimed in any one of claims 1 to 3:

and training the neural network system by using a data set comprising a remote sensing training image of a building and a label image corresponding to the remote sensing training image to obtain the trained neural network system.

6. The building extraction method for remote sensing images as claimed in claim 4 or 5, characterized in that after obtaining the building feature map corresponding to the remote sensing image, a final building distribution map is obtained by using a threshold method.

7. The building extraction method for remote sensing images as claimed in claim 5, wherein in training the neural network system, Sigmoid Cross energy Loss function is used and stochastic gradient descent algorithm is used.

8. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method according to any one of claims 4 to 7.

9. A computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 4 to 7 when executing the program.