CN112837320B

CN112837320B - Remote sensing image semantic segmentation method based on parallel hole convolution

Info

Publication number: CN112837320B
Application number: CN202110129416.3A
Authority: CN
Inventors: 张东映; 唐振超; 罗蔚然; 洪志明; 梁忠壮; 刘震
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2023-10-27
Anticipated expiration: 2041-01-29
Also published as: CN112837320A

Abstract

The invention discloses a remote sensing image semantic segmentation method based on parallel cavity convolution, which relates to the technical field of remote sensing images and comprises the following steps of: the method comprises the steps of obtaining a high-resolution remote sensing image in advance, slicing the high-resolution remote sensing image, normalizing and standardizing the high-resolution remote sensing image, and obtaining a source high-resolution remote sensing image; based on a pre-trained resnet101 parameter initialization feature on an ImageNet, a low-level network of a network resnet101 is extracted, a parallel cavity convolution network is constructed, and shallow-level features of a source high-resolution remote sensing image are extracted; inputting shallow features into a parallel cavity convolution network to acquire multi-scale information, and fusing the multi-scale information; and re-fusing the fused features with the shallow features, and repairing image-level information by using the full-connection conditional random field to obtain a semantic segmentation result. The invention expands the convolution receptive field without adding extra parameters, and compared with the standard convolution reaching the same receptive field, the parallel cavity convolution method can save the video memory.

Description

Remote sensing image semantic segmentation method based on parallel hole convolution

Technical Field

The invention relates to the technical field of remote sensing images, in particular to a remote sensing image semantic segmentation method based on parallel cavity convolution.

Background

With the maturation and commercialization of satellite remote sensing technology and the encouragement and promotion of governments around the world, satellite remote sensing is rapidly developed and applied in more and more fields. Semantic segmentation of remote sensing images is an important link of satellite remote sensing application. Semantic segmentation of remote sensing images is widely used in the pattern recognition fields of city planning, road planning, ground object target extraction, even automatic driving and the like. The improvement of semantic segmentation accuracy has important significance in the processing of remote sensing images.

The ground object information in the remote sensing image is complex and various, and in order to improve the semantic segmentation precision of the remote sensing image, related scholars have developed a great deal of research and put forward a plurality of algorithms. The thought of the algorithms mainly comprises (a) carrying out semantic segmentation on the remote sensing image by adopting a full convolution network, (b) fusing feature information with symmetrical dimensions on the basis of the full convolution network, and recording indexes in the process of reverse pooling so as to make up for the loss of position information. All the methods are based on standard convolution, and the standard convolution has limitations on receptive fields, so that the expanded receptive fields have important research and application values for semantic segmentation of remote sensing images.

For the problems in the related art, no effective solution has been proposed at present.

Disclosure of Invention

Aiming at the problems in the related art, the invention provides a remote sensing image semantic segmentation method based on parallel cavity convolution, which aims to overcome the technical problems existing in the prior art.

The technical scheme of the invention is realized as follows:

a remote sensing image semantic segmentation method based on parallel hole convolution comprises the following steps:

the method comprises the steps of obtaining a high-resolution remote sensing image in advance, slicing the high-resolution remote sensing image, normalizing and standardizing the high-resolution remote sensing image, and obtaining a source high-resolution remote sensing image;

based on a pre-trained resnet101 parameter initialization feature on an ImageNet, a low-level network of a network resnet101 is extracted, a parallel cavity convolution network is constructed, and shallow-level features of a source high-resolution remote sensing image are extracted;

inputting shallow features into a parallel cavity convolution network to obtain multi-scale information, and fusing the multi-scale information, wherein the multi-scale information is captured by setting different expansion rates;

and re-fusing the fused features with the shallow features, and repairing image-level information by using the full-connection conditional random field to obtain a semantic segmentation result.

Further, slicing the high resolution remote sensing image includes slicing the high resolution remote sensing image to a length and width of 512 pixels.

Further, the method also comprises the following steps:

and extracting RGB three channels from the high-resolution remote sensing image based on the slicing.

Further, the parallel hole convolution network comprises the following steps:

starting from a standard normal convolution, the calibration has a discrete function F:and have->And k: omega shape _r And the convolution calculation process taking p as a center and expanding is as follows:

by expanding the standard convolution, the expansion rate of the cavity convolution is l, and the cavity convolution is:

and (3) carrying out cavity convolution on the shallow features in parallel through different expansion rates to obtain multi-scale features, and fusing the multi-scale features in a splicing mode to form a parallel cavity convolution network layer.

Further, the expansion ratios were set to 2,3,4 and 5, respectively.

Further, the full-connection conditional random field repair image level information comprises the following steps:

the energy function used by the fully connected conditional random field is:

the unitary potential energy function is used for describing the influence of an observed object and a label:

θ _i (x _i )＝-log P(x _i )

wherein, pixel points i, P (x _i ) For the probability of a network classifying over a pixel, a binary potential energy function describes the correlation between observed objects:

when x is _i ≠y _j When u (x) _i ,y _j ) =1, otherwise, u (x _i ,y _j ) =0, additionally k ^m (f _i ,f _j ) As f _i And f _j Gaussian kernel in between, f _i Is the color information corresponding to pixel i, i.e. the feature vector, w _m Is the weight used by the gaussian kernel;

in the process of minimizing the energy function, the unreasonable classification pixels in the image can be corrected, and the repaired semantic segmentation result is obtained.

The invention has the beneficial effects that:

according to the remote sensing image semantic segmentation method based on parallel hole convolution, a high-resolution remote sensing image is obtained in advance, the high-resolution remote sensing image is sliced, normalization and standardization are carried out, a source high-resolution remote sensing image is obtained, a low-layer network of a network resnet101 is extracted based on a resnet101 parameter initialization feature pre-trained on an ImageNet, a parallel hole convolution network is constructed, shallow layer features of the source high-resolution remote sensing image are extracted, the shallow layer features are input into the parallel hole convolution network to obtain multi-scale information, the multi-scale information is fused, the fused features are fused with the shallow layer features again, and image-level information is restored by using a full-connection condition random field to obtain semantic segmentation result, so that a convolution receptive field is enlarged under the condition that additional parameters are not increased, and compared with standard convolution reaching the same receptive field, the parallel hole convolution method can save display memory; the parallel computing structure is adopted, so that nodes in the neural network computing graph can be conveniently distributed on distributed hardware, and the computing speed is improved; the multi-scale information is beneficial to capturing detail objects and large objects by a network, small target objects are not easy to miss, and semantic segmentation precision is improved; in addition, the cavity convolution can widely sense the adjacent object of the target object, pixel-level classification can be effectively carried out by means of the adjacent information, and the method has better pixel-level classification effect compared with standard convolution.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of hole convolution sampling with different expansion rates for a remote sensing image semantic segmentation method based on parallel hole convolution according to an embodiment of the invention;

FIG. 2 is a schematic diagram of a multi-scale parallel hole convolution network of a remote sensing image semantic segmentation method based on parallel hole convolution according to an embodiment of the invention;

fig. 3 is a parallel hole convolution semantic segmentation result of a remote sensing image semantic segmentation method based on parallel hole convolution according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the invention, fall within the scope of protection of the invention.

According to the embodiment of the invention, a remote sensing image semantic segmentation method based on parallel hole convolution is provided.

As shown in fig. 1-3, the remote sensing image semantic segmentation method based on parallel hole convolution according to an embodiment of the present invention includes the following steps:

The slicing of the high-resolution remote sensing image comprises slicing the high-resolution remote sensing image until the length and the width are 512 pixels.

Wherein, still include the following step:

Wherein, the parallel cavity convolution network comprises the following steps:

Wherein the expansion rates are set to 2,3,4 and 5, respectively.

The full-connection conditional random field restoration image level information comprises the following steps:

the energy function used by the fully connected conditional random field is:

θ _i (x _i )＝-log P(x _i )

wherein, pixel points i, P (x _i ) Classifying pixels for a networkA binary potential energy function describes the correlation between observed objects:

when x is _i ≠y _j When u (x) _i ,y _j ) =1, otherwise, u (x _i ,y _j ) =0, additionally k ^m (f _i ,f _j ) As f _i And f _j Gaussian kernel in between, f _i Is the color information corresponding to pixel i, i.e. the feature vector, w _m Is the weight used by the gaussian kernel.

By means of the technical scheme, the pretrained resnet101 is obtained by slicing the high-resolution remote sensing image, and the shallow features of the sliced image are extracted by taking low-level migration as a feature extraction network; constructing parallel hole convolution, respectively setting the expansion rate of a convolution kernel to be 2 to 5, inputting shallow features into a parallel hole convolution network, and splicing information under different scales; re-fusing the output characteristics of the cavity convolution network with the shallow layer characteristics, up-sampling to recover the resolution, and repairing the segmentation result by using a conditional random field; and merging the segmentation results of the slices, and repairing unreasonable prediction results by simply filling holes and removing small connected domains.

Specifically, in one embodiment, the method includes the steps of:

s1: preprocessing a high-resolution remote sensing image, wherein the resolution of the high-resolution remote sensing image is too high, the memory and the video memory of a general computer are not easy to bear the calculation of the whole image, and the image is sliced to 512 pixels long and wide according to the 512 resolution commonly used in a main stream semantic segmentation model;

s2: to be compatible with conventional deep convolutional neural networks, three RGB channels need to be extracted from the sliced remote sensing image. Conventional data enhancement was performed: random horizontal overturn, random vertical overturn and color dithering. When the data is enhanced, the labeling image also carries out the same processing along with the RGB image;

s3: properly scaling the RGB three-channel tensorAnd (5) putting, namely normalizing. Assuming a total of m RGB images in the dataset, these RGB images may be divided into 3 channel tensors x ₁ ,x ₂ ,x ₃ ]Normalization of tensors gives [ y ] ₁ ,y ₂ ,y ₃ ]The tensor normalization formula for each channel is:

s4: and then normalized according to the mean value mu and standard deviation sigma of each channel to obtain tensor [ z ] ₁ ,z ₂ ,z ₃ ]The normalized calculation formula is:

s5: based on the network initialized by the resnet101, intercepting the layers 1 to 4, wherein the hole convolution expansion rate of the layer4 is 2, and the hole convolution expansion rates of the layers 1 to 3 are 1, which is equivalent to standard common convolution;

s6: the method comprises the steps of carrying out cavity space pyramid convolution on features output by the resnet101, carrying out parallel convolution with different expansion rates, and replacing global pooled branches by standard convolution instead of global pooled branches, so as to obtain semantic information deeply and improve classification accuracy;

s7: the jump level structure is used for fusing the low-level features generated by layer1 in the resnet101 with the spatial pyramid convolution result after linear interpolation, the low-level features can bring partial position information to the high-level features, and as global pooling is cancelled in the spatial pyramid convolution layer, the position information of the image-level features is lost in the network, the rough segmentation result output by the network is required to be subjected to post-processing based on a conditional random field;

s8: calculating loss by using cross entropy, wherein the object distribution of the remote sensing image is unbalanced, so that weight is added to each class of object during cross entropy calculation, calculating gradient is counter-propagated in a calculation graph through the loss function, and network parameters are updated;

s9: the optimization method of model training adopts Adadelta, and the initial learning rate is set to be 1e ^-1 。

Adadelta can achieve a faster effect in the early stages of training. The feature extraction backbone network of the model is a resnet101, and although specific object information of the remote sensing image cannot be directly detected by the resnet101 pre-trained on the ImageNet, low-level information such as edges, angles and colors can be effectively perceived, so that the feature extraction layer of the network can be initialized by using the resnet101 parameters pre-trained on the ImageNet, and a good initial solution can be obtained by the network; random initialization obeying Gaussian distribution is carried out on other layer parameters of the network;

s10: the model can be converged after traversing the whole data set 256 times, the batch size is set to be 8, and the total iteration number of model training is 5e ⁴ ；

S11: the high-resolution remote sensing image can not be segmented at one time, so that the slices are needed to be segmented semantically one by one, and unreasonable prediction results are restored by simply filling holes and removing small connected domains when the slices are spliced.

In addition, as shown in fig. 1, (a)/(b)/(c) shows that the hole convolutions with different expansion rates are sampled on the characteristics, and the expansion rates are 1,2 and 3, respectively, as can be seen in fig. 1, the receptive field increases with the increase of the expansion rate of the hole convolutions. The cavity convolution can be performed by sparse sampling on the features through setting the expansion rate, and can be performed by using any expansion rate, so that the method is beneficial to definitely controlling the receptive field and acquiring the context information in a dense calculation task. The setting of the hole convolution expansion rate does not affect the structure of the original network parameters, which is friendly to transfer learning, so that fine adjustment can be performed based on the original network parameters after the expansion rate is set.

In addition, according to steps S1 to S4, the GID high-resolution remote sensing image is sliced to a resolution of 512, normalized and standardized, and the image of the statistical dataset can obtain the normalized image with the average value of 3 channels of RGB as follows: 0.3515224,0.38427463,0.35403764. The standard deviation is: 0.19264674,0.18325084,0.17028946.

In addition, as shown in fig. 2, according to steps S5 and S6, the pre-trained resnet101 parameter initialization feature on ImageNet is used to extract the lower network of the network resnet101, the lower network can effectively detect the position information of edges, angles and the like, and a parallel cavity convolution network is constructed, and the expansion rates are respectively set to 2,3,4 and 5. The template parameters of the convolution kernel are tensors formed by the sizes of (3, 3). The shallow features are input into a parallel cavity convolution network to obtain multi-scale information, the multi-scale information is fused in a splicing mode, and the calculation process of the feature input cavity convolution is shown in figure 2.

In addition, according to step S7, the fused features are fused with shallow features again to make up for the position detail information, the resolution is recovered by means of upsampling, the semantic segmentation result obtained at this time is rough, and the image-level information needs to be repaired by combining with the original image, namely, the image-level features are repaired, so that the semantic segmentation result is improved.

In addition, according to steps S8 to S10, the forward computation is performed by traversing the data set, and after each batch process is completed, the loss function is updated, and the loss function is counted according to the pixel class to obtain the proportion of each class, and is fused into the cross entropy loss as the coefficient of each class loss. And starting from the node of the loss function, reversely calculating in the calculation graph, acquiring the gradient, and updating the model parameters. The model updating optimizer is Adadelta, which can obtain a quicker effect in the early middle period of training.

In addition, as shown in fig. 3, after training is finished, parameters of the semantic segmentation network are obtained, and the parameters are loaded into the network of the corresponding structure during reasoning. And (3) carrying out semantic segmentation on each slice of the high-resolution remote sensing image, wherein (a)/(b)/(c) in fig. 3 respectively represents the original picture of the slice, the real label corresponding to the slice and the semantic segmentation result of the parallel cavity convolution. Different clutter objects are represented using different pixels. The figure 3 shows that the remote sensing image semantic segmentation method based on parallel cavity convolution achieves good effect, and the segmentation result is close to the real annotation.

In addition, according to step S11, the semantic segmentation results of the respective slices are consolidated, and when the respective slices are spliced, unreasonable prediction results are repaired by simply filling holes and removing small connected domains.

In summary, by means of the above technical solution of the present invention, by acquiring the high resolution remote sensing image in advance, slicing the high resolution remote sensing image, normalizing and standardizing the high resolution remote sensing image, acquiring the source high resolution remote sensing image, extracting the low-level network of the network resnet101 based on the pre-trained resnet101 parameter initialization feature on the ImageNet, constructing a parallel cavity convolution network, extracting the shallow features of the source high resolution remote sensing image, inputting the shallow features into the parallel cavity convolution network to acquire multi-scale information, fusing the fused features with the shallow features again, repairing the image-level information by using the full-connection condition random field, acquiring the semantic segmentation result, and realizing that the sense field of convolution is enlarged without adding additional parameters. The parallel computing structure is adopted, so that nodes in the neural network computing graph can be conveniently distributed on distributed hardware, and the computing speed is improved; the multi-scale information is beneficial to capturing detail objects and large objects by a network, small target objects are not easy to miss, and semantic segmentation precision is improved; in addition, the cavity convolution can widely sense the adjacent object of the target object, pixel-level classification can be effectively carried out by means of the adjacent information, and the method has better pixel-level classification effect compared with standard convolution.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A remote sensing image semantic segmentation method based on parallel hole convolution is characterized by comprising the following steps:

based on a pre-trained resnet101 parameter initialization feature on an ImageNet, a low-level network of a network resnet101 is extracted, a parallel cavity convolution network is constructed, and shallow-level features of a source high-resolution remote sensing image are extracted; based on the network initialized by the resnet101, intercepting the layers 1 to 4, wherein the hole convolution expansion rate of the layer4 is 2, and the hole convolution expansion rates of the layers 1 to 3 are 1, which is equivalent to standard common convolution;

inputting shallow features into a parallel cavity convolution network to obtain multi-scale information, and fusing the multi-scale information, wherein the multi-scale information is captured by setting different expansion rates; the method comprises the steps of carrying out cavity space pyramid convolution on features output by the resnet101, carrying out parallel convolution with different expansion rates, and replacing global pooled branches by standard convolution instead of global pooling; fusing the low-level features generated by layer1 in the resnet101 with the space pyramid convolution result after linear interpolation by using a jump level structure;

re-fusing the fused features with shallow features, and repairing image-level information by using a full-connection conditional random field to obtain a semantic segmentation result;

carrying out semantic segmentation on the normalized and standardized slices one by one, and repairing unreasonable prediction results by simply filling holes and removing small connected domains when the semantic segmentation results of all the slices are spliced;

the parallel hole convolution network comprises the following steps:

starting from a standard normal convolution, the calibration is performed with a discrete functionAnd have->Which is a kind ofFor discrete convolution kernels, p-centered deconvolution computationThe process is as follows:

the shallow features are subjected to cavity convolution with different expansion rates in parallel to obtain multi-scale features, and the multi-scale features are fused in a splicing mode, so that a parallel cavity convolution network layer is formed;

the full connection conditional random field repair image level information comprises the following steps:

the energy function used by the fully connected conditional random field is:

θ _i (x _i )＝-logP(x _i )

when x is _i ≠y _j When u (x) _i ，y _j ) Otherwise, u (x _i ，y _j ) =0, additionally k ^m (f _i ，f _j ) As f _i And f _j Gaussian kernel in between, f _i Is the color information corresponding to pixel i, i.e. feature vector, W _m Is the weight used by the gaussian kernel;

slicing the high-resolution remote sensing image, wherein the slicing of the high-resolution remote sensing image comprises slicing the high-resolution remote sensing image until the length and the width are 512 pixels;

the method also comprises the following steps:

extracting RGB three channels from the sliced high-resolution remote sensing image;

the expansion ratios were set to 2,3,4 and 5, respectively.