CN110009637B

CN110009637B - Remote sensing image segmentation network based on tree structure

Info

Publication number: CN110009637B
Application number: CN201910280400.5A
Authority: CN
Inventors: 岳凯; 李瑞瑞
Original assignee: Beijing University of Chemical Technology
Current assignee: Beijing University of Chemical Technology
Priority date: 2019-04-09
Filing date: 2019-04-09
Publication date: 2021-04-16
Anticipated expiration: 2039-04-09
Also published as: CN110009637A

Abstract

The invention relates to a remote sensing image segmentation network based on a tree structure, and belongs to the technical field of computer vision. The remote sensing image segmentation network is a deep Lab V3+ structured tree network model, and the tree network model comprises a segmentation module and a tree processing module which are sequentially connected; wherein the partitioning module is a DeepLab V3+ network model, and the DeepLab V3+ network model includes an encoder portion and a decoder portion. The construction method of the tree-shaped processing module comprises the steps of constructing a confusion matrix, calculating a lower triangular matrix, establishing a confusion undirected complete graph, carrying out iterative edge cutting operation on the confusion undirected complete graph and obtaining the tree-shaped processing module. The invention can better distinguish the confusable pixels to obtain more accurate segmentation results, effectively improves the overall semantic segmentation precision of the high-resolution remote sensing image, and obviously improves the segmentation accuracy of the confusable class data.

Description

Remote sensing image segmentation network based on tree structure

Technical Field

The invention relates to a remote sensing image segmentation network based on a tree structure, in particular to a remote sensing image segmentation network mixing DeepLab V3+ and the tree structure, and belongs to the technical field of computer vision.

Background

The semantic segmentation of the high-resolution remote sensing image refers to a task of distributing a semantic label to each pixel in the image. In recent years, with the rapid development of remote sensing mapping technology, ultra-high resolution optical remote sensing images with ground sampling intervals (GSD) of 5 to 10 centimeters have been easily obtained. Based on this, how to accurately and efficiently segment these images becomes a research focus in the field of remote sensing image segmentation. For data of ultra-high resolution remote sensing images, most of the traditional methods rely on a supervised classifier with artificially designed features for segmentation processing. The manually designed features can only express low-level semantic information, and the deep learning technology can fully mine high-level semantic information features in the image. Deep learning techniques have enjoyed great success in the field of computer vision, such as picture classification, object detection, semantic segmentation, and the like. The deep convolutional neural network receives the original data input, and finally obtains an output result according to a specific task through learning in an end-to-end structure.

For the multi-classification remote sensing image segmentation task, since the confusable class data are usually distributed adjacently on the spatial distribution, as shown in fig. 4, because the general network model has difficulty in learning effective feature representation, the segmentation accuracy is not high. The essence of the semantic segmentation of the remote sensing image is to perform pixel-level classification on earth surface forms such as houses, vehicles, roads, vegetation, ocean ice and the like. The remote sensing image is derived from an image of an aerial drone or a spectral sensor. The early remote sensing image segmentation research is mainly based on graph theory. For example, a remote sensing image segmentation method combining a minimum spanning tree algorithm and a MumfordShah theory in the traditional method. For supervised learning, most segmentation methods are based on manual feature selection. These features are often complex, but can only express low-level or medium-level semantic information descriptions.

With the development of remote sensing technology, a large number of VHR (very-high-resolution) remote sensing images can be conveniently acquired. These images typically have rich context information, making most conventional segmentation methods unsuitable. Convolutional Neural Networks (CNN) initially learn semantic representations of pixels through region-based training. The complete convolutional network (FCN), the network model first applied to image segmentation, replaces all fully-connected layers with convolutional layers to allow images of arbitrary size to be segmented as input. And then, network models such as a deconvolution network ((DeconvNet)), SegNet, RefineNet, U-Net and the like are successively introduced into the remote sensing image segmentation field, so that the difficult problem that the segmentation precision of the remote sensing image with the ultra-high resolution rate is not high is solved. Although the overall segmentation accuracy has been greatly improved over conventional methods, there are instances where the segmentation is distracting to confusing data categories such as brush and tree, building and impervious surfaces.

Disclosure of Invention

In order to solve the problem of low segmentation accuracy of the confusable class data, the invention aims to provide a remote sensing image segmentation network with a mixed deep Lab V3+ and tree structure, aiming at the segmentation problem of the confusable class data in the ultrahigh-resolution remote sensing image and improving the segmentation accuracy of the confusable class data.

The purpose of the invention is realized by the following technical scheme:

a remote sensing image segmentation network based on a tree structure is a DeepLab V3+ structured tree network model, and the tree network model comprises a segmentation module and a tree processing module which are connected in sequence; wherein the partitioning module is a DeepLab V3+ network model, and the DeepLab V3+ network model comprises an encoder portion and a decoder portion;

the construction method of the tree processing module comprises the following steps:

step 1, counting the number of pixels of different categories in a segmentation graph by using a priori segmentation result, and constructing a confusion matrix A, wherein a in the confusion matrix A_ijRepresenting the number of pixels that are actually of class j but predicted to be of class i;

step 2, mixing the elements a of corresponding rows and columns in the matrix A_ijAnd a_jiAdding to obtain corresponding lower triangular matrix B, B_ijThe confusion degree of the ith type data and the jth type data is shown;

step 3, establishing a confusion undirected complete graph: regarding the lower triangular matrix B as a correlation matrix of an undirected graph, wherein each node represents a category, weight values of edges between the nodes represent the confusion degree of two nodes, and regarding B in the lower triangular matrix B as a correlation matrix of the undirected graph_ijThe value of (d) is used as the weight of the edge between the node i and the node j of the confusion undirected complete graph;

and 4, performing iterative edge cutting operation on the confusion undirected complete graph: traversing all the edges which are not cut off in the graph every time, selecting one edge with the minimum weight value to cut off the edge, and checking whether the undirected graph is divided into two sub-graphs or not; if the original image is not divided into two subgraphs, one of the rest unselected edges with the smallest weight value is continuously selected to be cut off, and the step 4 is repeated; if the original image is divided into two subgraphs, repeating the step 4 for each subgraph;

step 5, obtaining a tree processing module: and (4) when the two subgraphs are divided in the step 4 every time, respectively taking the point sets of the two subgraphs as two child nodes of the original root node of the tree structure, and completing the construction of the tree processing module when all the nodes are separated.

Further, the middleFlow block of the encoder section in the DeepLab V3+ network model contains two Xception units.

Further, the tree processing module is a binary tree model with 6 leaf nodes, and each node is a resenext unit.

Further, in the construction method of the tree-shaped processing module, the confusion matrix a in the step 1 is a 6 × 6 matrix; the confusion in step 3 has a total of 6 vertices in the incomplete graph.

Furthermore, the remote sensing image segmentation network is a full convolution network, and a full connection layer does not exist; after the tree processing module, all feature maps are fed into a 1 × 1 convolutional layer and output by the SoftMax function.

The invention has the following beneficial effects:

the invention not only modifies the DeepLab V3+ to be suitable for multi-scale and multi-modal data, but also adds and connects the tree neural network processing module afterwards. The tree processing module is constructed by establishing a confusion matrix, extracting a confusion graph and dividing the graph, so that confusable pixels can be better distinguished, and a more accurate division result is obtained. According to the invention, experiments are respectively carried out on remote sensing image sets of two different cities provided by the ISPRS (International society for photogrammetry and remote sensing), and the experimental results show that the network combining the deep Lab V3+ and the tree structure effectively improves the integral precision of semantic segmentation of the high-resolution remote sensing image, wherein the segmentation accuracy of the confusable class data is obviously improved.

Drawings

FIG. 1 is a schematic diagram of a remote sensing image segmentation network structure based on a tree structure according to the present invention;

FIG. 2 is a schematic diagram of the DeepLab V3+ network model according to the present invention;

FIG. 3 is a schematic diagram of a tree processing module according to the present invention;

FIG. 4 is an example of an optical remote sensing image;

FIG. 5 is a schematic diagram of a tree processing module construction method;

FIG. 6 is a schematic diagram of a graph cut algorithm of the iterative edge cutting process in the embodiment;

FIG. 7 is a diagram of an example of tree processing module construction;

FIG. 8 is a diagram of an ISPRS data set in an embodiment;

FIG. 9 is a schematic diagram illustrating an image overlay strategy according to an embodiment;

the global and local segmentation results for the test set pictures in the fig. 10 bit embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

A remote sensing image segmentation network based on a tree structure is a DeepLab V3+ structured tree network model, and the tree network model comprises a segmentation module and a tree processing module which are connected in sequence.

An hourglass-shaped coding-decoding model (encoder-decoder networks) is selected as the segmentation module, as shown in fig. 2, a deep lab V3+ network model is selected as the segmentation module, and the deep lab V3+ network model mainly comprises two parts: an encoder section and a decoder section. Compared with the original DeepLab V3+ structure, the present invention reduces the number of Xcenter units in the Middle Flow block of the encoder portion from 16 to 2 because of the limited image data and computational resources used for training.

Since the ISPRS remote sensing data set contains 6 types of data, the tree processing module is a binary tree model with 6 leaf nodes, each node is a ResNeXt unit, and the structure of the ResNeXt unit is shown in FIG. 3. The ResNeXt unit can not only avoid the phenomenon that gradient disappears in training, but also reduce the training overhead of video memory by reducing the number of the hyper-parameters.

The construction method of the tree-shaped processing module has five intermediate results, and the distribution is (a) a confusion matrix, (b) a lower triangular matrix, (c) a confusion undirected complete graph, (d) a confusion graph after edge cutting, (e) a confusion tree-shaped structure, referring to fig. 5, wherein the serial numbers 1-6 respectively represent 6 data types with heavy remote sensing image segmentation tasks: impervious surfaces, buildings, low-shrub vegetation, trees, automobiles, and clutter. The method specifically comprises the following steps:

step 1, counting the number of pixels of different categories in a segmentation graph by using a priori segmentation result, and constructing a confusion matrix A, wherein the confusion matrix A is a 6 x 6 matrix; a in the confusion matrix A_ijIndicating the number of pixels that are actually of class j but predicted to be of class i.

step 3, establishing a confusion undirected complete graph: regarding the lower triangular matrix B as a correlation matrix of an undirected graph, wherein each node represents a category, weight values of edges between the nodes represent the confusion degree of two nodes, and regarding B in the lower triangular matrix B as a correlation matrix of the undirected graph_ijIs used as a weight for the edge between node i and node j of the obfuscated undirected graph. The confusion results in a total of 6 vertices in the full graph, representing six classes in the isps dataset.

And 4, performing iterative edge cutting operation on the confusion undirected complete graph: traversing all the edges which are not cut off in the graph every time, selecting one edge with the minimum weight value to cut off the edge, and checking whether the undirected graph is divided into two sub-graphs or not; if the original image is not divided into two subgraphs, one of the rest unselected edges with the smallest weight value is continuously selected to be cut off, and the step 4 is repeated; if the original graph is divided into two subgraphs, this step 4 is repeated for each subgraph. The pseudo code for this step is shown in FIG. 6.

Fig. 7 shows an exemplary diagram of the edge cutting operation. The left-most column is the lower triangular matrix, with gray elements representing edges that have been cut off and black elements representing edges that are about to be cut off. The middle column is the state of the confusion map, with the dashed lines representing the edges that have been cut to this step, and the solid lines representing the edges that have not been selected. The rightmost side is the state of the constructed tree structure.

The remote sensing image segmentation network provided by the invention is a full convolution network, and a full connection layer does not exist; after the tree processing module, all feature maps are fed into a 1 × 1 convolutional layer and output by the SoftMax function.

The invention performs experiments on two ISPRS Vaihingen and Potsdam remote sensing data sets, and publishes the experimental results, wherein the data sets are shown in FIG. 8. The ISPRS remote sensing data set is a remote sensing high-resolution image data set which is disclosed on the internet, mainly comprises the environmental conditions of urban centers and surroundings, and comprises the most common land coverage categories in remote sensing images: non-penetrating surfaces (penetrations), buildings (building), low shrub vegetation (low _ veg), trees (tree), vehicles (car), and clutter (clutter).

(1) Vaihingen data set

The data set contains a total of 33 high resolution pictures taken by a drone camera over Vaihingen town, germany, the group of data sharing IRRG format, DSM format and data annotation. The average size of each picture was 2494 × 2046 pixels with a spatial resolution of 9 cm. In this set of experiments, 11 pictures were used as training set, 5 were used as validation set, and 17 were used as test set.

(2) Potsdam dataset

The Potsdam data set comprises 38 overhead high-resolution remote sensing pictures, the size of each picture is 6000 pixels by 6000 pixels, and the spatial resolution is 5 cm. The set of data has RGB format, DSM format, IRRG format, and data labels. Since there are a lot of error labels (No. 7_10) in one image in the data set, 17 images are used in the experiment as the training set, 5 validation sets and 15 test sets.

Test evaluation criteria

The evaluation criteria of the segmentation experiment are Overall Accuracy (OA). For the segmentation performance of each type of data, the present embodiment uses the F1 value to evaluate, which is calculated from precision (precision value) and recall (recall value). The present embodiment also uses the mean F1 value between classes to evaluate the overall class segmentation performance because the Overall Accuracy (OA) is not sensitive to the evaluation of the data of the unbalanced distribution. The overall accuracy and the calculation formula of the F1 value are as follows:

wherein tp (true Positive) indicates that the positive class is predicted to be a positive class number, namely a true rate, tn (true negative) indicates that the negative class is predicted to be a negative class number, namely a true negative rate, and fp (false positive) indicates that the negative class is predicted to be a positive class number, namely a false positive rate; fn (false negative) indicates that a positive class is predicted as a negative class number, i.e., false negative rate.

Experimental Environment and data Pre-processing

The segmentation network is realized based on an MXNet framework, and 2 Nvidia GeForce GTX1080 Ti display cards are used for accelerated training. Due to video memory limitation, the input image block size in training is 640 x 640 pixels, the batch size is 4. the training has a total of 80 iterations (epoch), the momentum (momentum) is set to 0.9, the initial learning rate is set to 0.01, when the training reaches half, the learning rate is adjusted to 0.001, and when the training reaches 3/4, the training again adjusts to 0.0001.

The ISPRS remote sensing data set is small in data volume, and data enhancement is needed to expand training data. The method used in this embodiment is: for each piece of raw data, the center of the image is rotated by 10 ° each time, and the largest square block is cut out. Thus, each training image can be rotated to obtain 36 sets of pictures.

In addition, because the original training image is very large in size, the entire image cannot be directly put into the network for training, and it needs to be clipped to an image block of 640 × 640 pixels in size. In order to ensure that no obvious crack appears in the segmentation result graph after splicing, an image overlapping strategy (overlay-tile strategy) is required. The image overlap strategy is to overlap when the original image is cut inside, and perform mirror reflection extrapolation on the edge part of the original image, and the schematic diagram is shown in fig. 9, where a thin white solid line is the boundary of the training image, a thick white solid line area is the size when the image is cut, and a thick white dotted line area is the size of the area where the actual segmentation is effective.

Analysis of Experimental results

The experiments of this example were performed on two data sets, Vaihingen and Potsdam, respectively, and the results are shown in tables 1 and 2. Wherein, SVL is a benchmark result provided by ISPRS host, DST method is a result of full convolution network FCN and adding CRF unit, UZ method is a segmentation result of deconvolution network.

TABLE 1 comparison of segmentation results of different models on the Vaihingen test data set

TABLE 2 comparison of segmentation results of different models on Potsdam test dataset

(1) Analysis of the results of the experiment

As can be seen from tables 1 and 2, the model of the present embodiment performs best on the Overall Accuracy (OA), and reaches 90.4% and 90.7% on the data sets of Vaihingen and Potsdam, respectively, which indicates that the model has stronger segmentation accuracy. In addition, the F1 value is improved compared with other network models in the representation of two categories, namely low shrub vegetation (low _ veg) and trees (tree), which are easy to be confused with each other. In the experiment with the Vaihingen dataset, the F1 values for the low shrub vegetation and tree classes for the model of this example were 83.6% and 89.6%, respectively; in the experiments of the Potsdam dataset, the F1 values for the low shrub vegetation and tree classes for the model of this example were 86.8% and 87.1%, respectively. On the index of the average F1 value, the scores of the model of the embodiment on the data sets of Vaihingen and Potsdam are 89.3% and 92.0% respectively, which are much higher than those of the other three methods, and the model of the embodiment is also best for the segmentation of the remote sensing image on the average performance on each category.

(2) Tree module performance analysis

An innovation point of the model of the embodiment is to provide a tree-shaped module, so that the segmentation precision of the categories which are easy to be confused in the remote sensing image data can be further improved. In this section, we performed control experiments on whether the network has tree modules or not, and the results are shown in table 3.

TABLE 3 control test results of whether or not the Tree Structure is added

Compared with a model without the tree structure, the network model with the tree structure improves the quasi-curvature of segmentation on each category to different degrees. On the Vaihingen data set, the Overall Accuracy (OA) of the tree structure is improved by 1.1%, and the average F1 value is improved by 0.6%; whereas on the Potsdam dataset, the overall accuracy and average F1 values were improved by 1.3% and 0.9%, respectively. Therefore, the improvement of the tree structure on the network is not directed to a certain category, but the overall segmentation precision is improved. Finely comparing each class, wherein the low shrub vegetation (low _ veg) and the trees (tree) are a pair of classes easily confused with each other

The result of the control experiment is shown in fig. 10, for example, the first row is global image segmentation, and the second row is local segmentation result; from left to right, the first column is an original image, the second column is a real label, the third column is a tree-free structure model, and the fourth column is a tree-shaped structure model. As can be seen from the figure, after the tree structure is added into the model of the embodiment, the accuracy of the segmentation result is obviously improved, the segmented edge is smoother, and the phenomenon of bubbling is not generated.

In summary, in the high-resolution remote sensing image containing complex semantic information, the reduction of the error of the pixel of the confusable category is benefited, and the overall segmentation accuracy of the network model provided by the embodiment using the tree structure is greatly improved (compared with the non-mixed tree structure).

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A remote sensing image segmentation network based on a tree structure is characterized in that the remote sensing image segmentation network is a DeepLab V3+ structured tree network model, and the tree network model comprises a segmentation module and a tree processing module which are connected in sequence; wherein the partitioning module is a DeepLab V3+ network model, and the DeepLab V3+ network model comprises an encoder portion and a decoder portion;

2. The remote sensing image segmentation network based on the tree structure as claimed in claim 1, wherein the Middle Flow block of the encoder portion in the deep lab V3+ network model includes two Xception units.

3. A remote sensing image segmentation network based on a tree structure as claimed in claim 1, characterized in that the tree processing module is a binary tree model with 6 leaf nodes, and each node is a resenext unit.

4. The remote sensing image segmentation network based on the tree structure as claimed in claim 3, wherein the confusion matrix A in step 1 in the construction method of the tree processing module is a 6 x 6 matrix; the confusion in step 3 has a total of 6 vertices in the incomplete graph.

5. The tree structure-based remote sensing image segmentation network according to claim 1, wherein the remote sensing image segmentation network is a full convolution network, and a full connection layer does not exist; after the tree processing module, all feature maps are fed into a 1 × 1 convolutional layer and output by the SoftMax function.