CN110059772B

CN110059772B - Remote sensing image semantic segmentation method based on multi-scale decoding network

Info

Publication number: CN110059772B
Application number: CN201910397121.7A
Authority: CN
Inventors: 张笑钦; 肖智恒; 李东阳; 樊明宇
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2021-04-30
Anticipated expiration: 2039-05-14
Also published as: CN110059772A

Abstract

The invention discloses a remote sensing image semantic segmentation method based on a multi-scale decoding network, which comprises the following steps of: randomly cutting a high-resolution remote sensing image used for training and a label graph corresponding to the high-resolution remote sensing image into small images, dividing a network structure into two parts, namely encoding and multi-scale decoding, doubling the resolution of encoded information through an inverse pooling path and an inverse convolution path, connecting the encoded information with a result of cavity convolution through a channel, recovering a characteristic image to the original size through inverse convolution up-sampling, inputting an output label graph into a PPB (Peer-to-Peer) module to perform multi-scale aggregation processing, and finally updating network parameters in a random gradient reduction mode by taking cross entropy as a loss function; and inputting the small images sequentially cut from the test picture into a neural network to predict the corresponding label graphs, and splicing the label graphs into the original size. According to the technical scheme, the segmentation precision of the model is improved, the complexity of the network is reduced, and the training time is saved.

Description

Remote sensing image semantic segmentation method based on multi-scale decoding network

Technical Field

The invention relates to the technical field of machine vision, in particular to a remote sensing image semantic segmentation method based on a multi-scale decoding network.

Background

Semantic segmentation is an important issue of general attention in the fields of unmanned driving, medical image analysis, geographic information systems and the like. Semantic segmentation is to segment different objects in a picture from the level of pixels, label each pixel in an original picture, and classify the pixel into different labels, and the segmentation precision includes understanding of information in the picture. The remote sensing image has the characteristics of complex imaging, high picture pixel and large information amount, so that how to rapidly and accurately extract useful information from the remote sensing image by using an artificial intelligence technology is a research hotspot in the field of machine vision.

Semantic segmentation based on neural networks has been studied more. FCN (full probabilistic network) is a classic framework for image semantic segmentation, which is trained in an end-to-end manner and used for semantic segmentation of a trained classification network; to restore the resolution of the image, the FCN also upsamples using deconvolution. Unlike FCN, SegNet upsamples using an inverse pooling method, so that the network parameters are much less than FCN. Compared with FCN and SegNet, U-Net has a more symmetric coding and decoding structure, and the jump connection from the coding to the decoding part facilitates the recovery of position information, but also makes the network structure complex, requiring more training time. The network structure often uses pooling to increase the receptive field, but pooling causes a decrease in spatial resolution when the receptive field is increased. Although the receptive field is enlarged and the loss of resolution is avoided through the hole convolution, and the information with different scales can be captured by utilizing the convolution with different hole rates, the hole convolution adopts a sparse sampling mode to cause the local information to be lost, so that the long-distance information lacks correlation. In semantic segmentation a large field can provide more global information, but local information is ignored. How to balance the sizes of the receptive fields is considered as one of the keys for improving the semantic segmentation precision, however, on the premise of ensuring the segmentation precision, reducing the complexity of the model and the training time are also the problems to be considered.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a remote sensing image semantic segmentation method based on a multi-scale decoding network, which can improve the segmentation precision of a model, reduce the complexity of the network and save the training time.

In order to achieve the purpose, the invention provides the following technical scheme: a remote sensing image semantic segmentation method based on a multi-scale decoding network comprises the following steps:

(1) randomly cutting a high-resolution remote sensing image used for training and a label graph corresponding to the high-resolution remote sensing image into small images of 256 multiplied by 256 pixels, wherein the cut images are divided into two parts, one part is used as a training set of a network, and the other part is used as a verification set;

(2) the network structure is divided into two parts of encoding and decoding, the front 16 layers of VGG16 of a classification network are used as an encoding network, the decoding network is composed of three paths of an inverse pooling path, an inverse convolution path and a void convolution path, the resolution of encoded information is doubled through the inverse pooling path and the inverse convolution path, the encoded information and the void convolution result are subjected to channel connection, a characteristic image is restored to the original size through inverse convolution upsampling, an output label image is input into a PPB module to be subjected to multi-scale aggregation processing, and finally, the network parameters are updated in a random gradient descending mode by taking cross entropy as a loss function;

(3) the test picture is sequentially cut into small images of 256 multiplied by 256 pixels, the small images are input to a neural network to predict corresponding label graphs, and then the label graphs are spliced into an original size.

Preferably, step (2) comprises the sub-steps of:

(1.1) randomly cutting the high-pixel remote sensing image into image fragments with specified sizes;

and (1.2) adopting the first 16 layers of the VGG network as an encoding network to extract semantic features of the preprocessed image fragments.

Preferably, step (2) further comprises the sub-steps of:

(2.1) recovering the size of the characteristic image by deconvolution and inverse pooling, combining the deconvolution with the inverse pooling for up-sampling, adding the inverse pooling after the fifth pooling of the VGG network, and obtaining a first characteristic map by convolution of 3 × 3 and 1 × 1;

(2.2) after the fifth pooling of the VGG network, connecting convolution of 3 x 3 and 1 x 1, expanding the size of the feature map by deconvolution of 4 x 4 with the step size of 2, cutting the feature map according to the size of the first feature map to obtain a second feature map, and after the fourth pooling of the VGG network, generating a third feature map by convolution of 3 x 3 with the hole rate of 2;

(2.3) connecting the characteristic graphs generated by the 3 paths, and integrating information of different scales to enable the network to select an optimal combination; and then, restoring the feature map to the original size by using convolution of step size 16 and 32 multiplied by 32, and outputting a prediction label through a softmax layer to obtain a semantic segmentation image.

Preferably, the method for manufacturing the prediction tag comprises the following steps:

(3.1) performing 3 x 3 convolution on the decoded label, and performing down-sampling on the obtained feature map through global average pooling of different scales;

(3.2) up-sampling the down-sampling result, and aggregating the down-sampling result into an eigentensor in a depth connection mode;

(3.3) reducing dimensions by using 1 × 1 convolution to obtain a prediction label.

Preferably, the invention adopts a computer with an Intel Core-i5 central processing unit and 4 gigabytes of memory and establishes an algorithm framework for migrating the semantic segmentation of the remote sensing image of the VGG network by using Matlab language.

Preferably, the parameters are updated by a random gradient descent with momentum of 0.9, using cross entropy as a loss function.

The invention has the advantages that: compared with the prior art, the method has the advantages that,

1. the semantic segmentation model of the remote sensing image based on the VGG provided by the invention has a good segmentation effect on the remote sensing image with high resolution.

2. The remote sensing image semantic segmentation model provided by the invention greatly reduces the time consumption of network training on the premise of ensuring the segmentation precision.

3. The decoding mode combining the three paths also provides a new idea for semantic segmentation of the image.

The invention is further described with reference to the drawings and the specific embodiments in the following description.

Drawings

FIG. 1 is a schematic diagram of a remote sensing image semantic segmentation model for migrating a VGG network in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a network architecture according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a PPB module according to an embodiment of the present invention;

FIG. 4 is a graph illustrating a loss function according to an embodiment of the present invention;

FIG. 5 is a graph illustrating the verification accuracy according to an embodiment of the present invention.

Detailed Description

In the description of the present embodiment, it should be noted that, as the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", "front", "rear", etc. appear, the indicated orientation or positional relationship thereof is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, but does not indicate or imply that the indicated device or element must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" as appearing herein are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

Referring to fig. 1, fig. 2, fig. 3, fig. 4 and fig. 5, the invention discloses a remote sensing image semantic segmentation method based on a multi-scale decoding network, which comprises the following steps:

The following 4 indexes are adopted for quantitatively evaluating the quality of image segmentation

Global precision: sigma_in_ii/∑_it_i

Average precision: (1/n)_cl)∑_in_ii/t_i

Average overlapping ratio: (1/n)_cl)∑_in_ii/(t_i+∑_jn_jj-n_ii)

Weighted overlap ratio: (Sigma)_kt_k)^-1∑_it_in_ii/(t_i+∑_jn_ij-n_ii)

Wherein n is_ijIs the number of i-type pixel points predicted as j-type, and the total number of i-type pixel points is

Preferably, step (2) comprises the sub-steps of:

Preferably, step (2) further comprises the sub-steps of:

The present embodiment is described in further detail below:

conv: convolution operation (convolution)

Pooling: an operation similar to downsampling;

ReLu: an activation function, mathematical form max (0, x);

softmax: suppose V is an array, V_iIs the ith element of V, which is mathematically represented as:

deconv: a transposed convolution operation (deconvolution) may be used for the upsampling.

Un zooling: and (4) inverse pooling, which can be used for upsampling.

Scaled Conv: and (4) carrying out void convolution, wherein the cavity rate is utilized to improve the receptive field of a convolution result without reducing the resolution.

Details of the network architecture:

the pooling layer is typically used to extract abstract features and filter out noisy activations but can cause input feature resolution contraction and information loss. Deconvolution and inverse pooling are commonly used to recover the size of the feature image, deconvolution and inverse pooling are combined to perform upsampling, inverse pooling is added after the fifth pooling of the VGG network, and convolution of 3 × 3 and 1 × 1 is used to obtain a first feature map; in addition, after the fifth pooling of the VGG network, the convolution of 3 x 3 and 1 x 1 is carried out, then the size of the feature map is enlarged by the deconvolution of 4 x 4 with the step size of 2, and the feature map is cut according to the size of the first feature map to obtain a second feature map.

A third profile is generated after the fourth pooling of VGG networks using a convolution of 3 × 3 with a hole rate of 2.

And finally, connecting the third dimensions of the characteristic diagrams generated by the 3 paths, and integrating information of different dimensions to enable the network to select an optimal combination. The feature map is then restored to the original size by 32 x 32 convolution of step size 16.

As shown in fig. 3, after convolution processing of 3 × 3, performing global average pooling of 4 times, 8 times, 16 times and 32 times on the output features respectively to construct 4 pooled pyramids, finally performing dimensionality reduction by using 1 × 1 convolution, and outputting prediction label mapping through a softmax layer, that is, outputting prediction labels through the softmax layer to obtain a semantic segmentation image.

The remote sensing image semantic segmentation model of the migration VGG network adopts cross entropy as a loss function, and updates parameters through random gradient descent with momentum of 0.9. The loss function and the verification accuracy of the network training process are shown in fig. 4 and 5.

The hardware and programming language for the specific operation of the method of the invention are not limited, and the writing can be completed by any language, so that other working modes are not described any more.

The semantic segmentation model of the remote sensing image based on the VGG provided by the invention has a good segmentation effect on the remote sensing image with high resolution.

The remote sensing image semantic segmentation model provided by the invention greatly reduces the time consumption of network training on the premise of ensuring the segmentation precision.

The decoding mode combining the three paths also provides a new idea for semantic segmentation of the image.

The above embodiments are described in detail for the purpose of further illustrating the present invention and should not be construed as limiting the scope of the present invention, and the skilled engineer can make insubstantial modifications and variations of the present invention based on the above disclosure.

Claims

1. A remote sensing image semantic segmentation method based on a multi-scale decoding network is characterized by comprising the following steps: the method comprises the following steps:

(2) the network structure is divided into two parts of encoding and multi-scale decoding, the front 16 layers of VGG16 of the classification network are used as an encoding network, the multi-scale decoding network is composed of three paths of an inverse pooling path, an inverse convolution path and a void convolution path, the resolution of encoded information is doubled through the inverse pooling path and the inverse convolution path, the encoded information and the void convolution result are subjected to channel connection, the characteristic image is restored to the original size through sampling on the inverse convolution, the output label image is input into a PPB module to be subjected to multi-scale aggregation processing, and finally, the network parameters are updated in a random gradient descending mode by taking cross entropy as a loss function;

a PPB module: performing global average pooling on the output characteristics by 4 times, 8 times, 16 times and 32 times after convolution processing of 3 × 3 respectively, constructing 4 pooled pyramids, finally performing dimensionality reduction by using 1 × 1 convolution, and outputting prediction label mapping through a softmax layer;

(3) sequentially cutting a test picture into small images of 256 multiplied by 256 pixels, inputting the small images into a neural network to predict corresponding label graphs, and splicing the label graphs into an original size;

the step (2) comprises the following substeps:

(1.2) extracting semantic features of the preprocessed image fragments by adopting the first 16 layers of the VGG network as a coding network;

the step (2) further comprises the following substeps:

(2.3) connecting the characteristic graphs generated by the 3 paths, integrating information of different scales, and using the information as a decoding network to enable the model to select an optimal combination; and then, restoring the feature map to the original size by using convolution of step size 16 and 32 multiplied by 32, and outputting a prediction label through a softmax layer to obtain a semantic segmentation image.

2. The remote sensing image semantic segmentation method based on the multi-scale decoding network according to claim 1, characterized in that: the manufacturing method of the prediction label comprises the following steps:

3. The remote sensing image semantic segmentation method based on the multi-scale decoding network according to claim 1, characterized in that: a computer with an Intel Core-i5 central processing unit and a 4G byte memory is adopted, and an Matlab language is used for constructing an algorithm framework for semantic segmentation of remote sensing images of a multi-scale decoding network.

4. The remote sensing image semantic segmentation method based on the multi-scale decoding network according to claim 1, characterized in that: the parameters were updated by a random gradient descent with momentum of 0.9 using cross entropy as a loss function.