CN109447994B

CN109447994B - Remote sensing image segmentation method combining complete residual error and feature fusion

Info

Publication number: CN109447994B
Application number: CN201811306585.4A
Authority: CN
Inventors: 汪西莉; 张小娟; 洪灵; 刘明; 刘侍刚
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2018-11-05
Filing date: 2018-11-05
Publication date: 2019-12-17
Anticipated expiration: 2038-11-05
Also published as: CN109447994A

Abstract

The remote sensing image segmentation method combining complete residual error and multi-scale feature fusion comprises the following steps: s100: the method improves a convolutional coding-decoding network which is a segmented backbone network, and specifically comprises the following steps: s101: adopting a convolution coding-decoding network as a segmented backbone network; s102: adding a characteristic pyramid module for aggregating multi-scale context information into the backbone network; s103: adding a residual error unit into the convolutional layer corresponding to the encoder and the decoder of the backbone network, and simultaneously fusing the characteristics in the encoder into the corresponding layer of the decoder in a pixel-by-pixel addition mode; s200: the remote sensing image is segmented by adopting an improved image segmentation network combining complete residual errors and multi-scale feature fusion; s300: and outputting a segmentation result of the remote sensing image. The method simplifies the training of a deep network, enhances the feature fusion, enables the network to extract rich context information, deals with the target scale change and improves the segmentation performance.

Description

Remote sensing image segmentation method combining complete residual error and feature fusion

Technical Field

The disclosure belongs to the technical field of remote sensing image processing, and particularly relates to a remote sensing image segmentation method combining complete residual error and multi-scale feature fusion.

Background

with the advent of unmanned aerial vehicles and improvements in acquisition sensors, remote sensing images at extreme resolutions (< 10 cm) have become available, particularly in urban areas. Compared with a common image, with the improvement of the spatial resolution, the spectrum information and the ground feature information contained in the remote sensing image are more and more abundant, the target dimensions are different, and the images have more phenomena of shielding, shading and the like, which all bring challenges to the understanding of the high-resolution remote sensing image. Therefore, the research on the remote sensing image segmentation is of great significance to the aspects of increasingly-demanded processing of remote sensing data, such as environment modeling, land utilization change detection, city planning and the like.

The image segmentation refers to a process of dividing a pixel set with similar features in an image into a plurality of image sub-regions, and may also be regarded as assigning a unique label (or category) to each pixel in the image, so that pixels with the same label have certain common visual characteristics, and the image is easier to understand and analyze. At present, a deep learning method, especially a Convolutional Neural Network (CNN), has a significant effect in the field of image processing, and has an increasingly large influence on remote sensing image processing.

Deep learning can be applied to image segmentation, but there are some disadvantages. For a deep convolutional neural network, firstly, feature information under different scales can be extracted by a multi-proportion cavity convolution and a spatial pyramid pooling structure, but a grid phenomenon and local information loss caused by the cavity convolution and pooling operation have great limitation on the improvement of final segmentation precision. Secondly, the convolutional neural networks with higher service performance and deeper layers are used as the segmented backbone networks, although the segmentation precision can be improved to a certain extent and the disappearance of gradients can be overcome, the network structures of the convolutional neural networks are too complex, and the training cost of consuming a large amount of memory. The features of all levels are considered to be helpful for semantic segmentation, the high-level features are helpful for category identification, and the low-level features are helpful for improving the details of segmentation results.

disclosure of Invention

in order to solve the above problems, the present disclosure provides a remote sensing image segmentation method combining complete residual and multi-scale feature fusion, including the following steps:

S100: for a backbone network as a segment: the improvement of the convolutional coding-decoding network is as follows:

S101: a convolutional encoding-decoding network is used as a partitioned backbone network, which contains two components: an encoder and a decoder;

S102: adding a characteristic pyramid module for aggregating multi-scale context information into the backbone network;

s103: adding a residual error unit into the convolutional layer corresponding to the encoder and the decoder of the backbone network, and simultaneously fusing the characteristics in the encoder into the corresponding layer of the decoder in a pixel-by-pixel addition mode;

S200: the remote sensing image is segmented by adopting an improved image segmentation network combining complete residual errors and multi-scale feature fusion;

S300: and outputting a segmentation result of the remote sensing image.

through the technical scheme, firstly, the characteristics in the encoder are fused into the corresponding layer of the decoder in a pixel-by-pixel addition mode on the basis of a convolutional coding-decoding network, and the partial connection can also be called long-distance residual connection; and secondly introduces a short distance of residual concatenation inside the respective convolutional layers of the encoder and decoder. The long-distance and short-distance complete residual error connection not only blends more original input information into the layer and enhances the feature fusion, but also can allow the gradient to be directly propagated to any convolution layer, thereby simplifying the training process. In the process of fusing the features in the encoder to the decoder, except for selecting the last layer of features of a shallow layer, all high-level features of a deep layer are particularly selected, and in the fifth stage, a feature pyramid module for aggregating multi-scale information is used, so that the fusion of the features of different contents and different scales enables the whole network to effectively cope with target scale change, and the segmentation performance is improved.

drawings

FIG. 1 is a schematic flow chart diagram of a remote sensing image segmentation method combining complete residual and multi-scale feature fusion provided in an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a feature pyramid module in one embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a residual unit according to an embodiment of the present disclosure;

FIG. 4 is a graph comparing the segmentation results for each deep net on the ISPRS Vaihingen test set in one embodiment of the present disclosure;

FIG. 5 is a result of an evaluation on each graph corresponding to the different method of FIG. 4 in one embodiment of the present disclosure;

FIGS. 6(a) and 6(b) are graphs comparing evaluation results of each deep network on an ISPRS Vaihingen test set in an embodiment of the disclosure;

FIG. 7 is a comparison graph comparing the method with the method in the literature with better segmentation performance in the embodiment of the present disclosure;

FIG. 8 is a comparison graph of segmentation results of comparison networks on an annotated image in one embodiment of the present disclosure;

FIG. 9 is a graph comparing the segmentation results of each deep net in the Road Detection test set in one embodiment of the present disclosure;

FIG. 10 is a graph of the results of the evaluation of the various methods corresponding to FIG. 9 on each graph in one embodiment of the present disclosure;

FIG. 11 is a comparison graph of evaluation results of deep nets on the Road Detection test set in one embodiment of the present disclosure;

FIG. 12 is a graph comparing the comparative analysis on the Road Detection data set of the present method with the prior Road segmentation study method in one embodiment of the present disclosure;

FIG. 13 is a comparison graph of segmentation results of different comparison networks on a label-free image in a road image in one embodiment of the present disclosure.

Detailed Description

in one embodiment, as shown in fig. 1, a method for segmenting a remote sensing image by combining complete residual and multi-scale feature fusion is disclosed, which comprises the following steps:

S300: and outputting a segmentation result of the remote sensing image.

the definition of an image segmentation network combining complete residual and multi-scale feature fusion is as follows: the network is completed by adding complete residual error connection between an encoder and a decoder and a convolution layer inside the encoder on the basis of convolution encoding-decoding, and simultaneously using a feature pyramid module FPM for aggregating multi-scale features for convolution features of the last convolution stage of the encoder.

the method specifically comprises the following steps: first, the underlying network is a convolutional encoder-decoder network, which consists of a completely symmetric encoder and decoder. Next, a short-distance residual connection is added to each convolution layer in the encoder and decoder. The residual error connection operation is realized by firstly inputting a data, learning the residual error of the input data through a series of operation units such as a convolution layer, a batch normalization unit and a correction linear unit, and then adding the residual error and the input data to obtain the output. Meanwhile, the characteristics of each convolution stage in the encoder are fused into the corresponding layer of the decoder in a pixel-by-pixel addition mode, and the connection of the step can be called long-distance residual connection and the short-distance and long-distance residual connection is called complete residual connection by analogy with the processing principle of a residual unit. Finally, when the features of each convolution stage in the encoder are fused to the corresponding layer in the decoder, a feature pyramid module FPM is used for the features of the fifth convolution stage in the encoder, so that context information under different scales is aggregated, and the obtained multi-scale features are fused to the corresponding layer of the decoder. The above full residual join and the operation of the feature golden tower module are performed simultaneously, and belong to different operations of the same level.

The embodiment adopts the improved image segmentation network combining the complete residual error and the multi-scale feature fusion, simplifies the training of a deep network, enhances the feature fusion, and enables the network to extract rich context information by the feature fusion of different scales and modes, so as to cope with the target scale change and improve the segmentation performance.

In another embodiment, the encoder in step S101 comprises 13 convolutional layers and 5 pooling layers, and a decoder is stacked on top of the encoder, the decoder being in a full mirror relationship with the encoder, comprising 13 convolutional layers and 5 depoling layers.

with this embodiment, the encoder is implemented by performing feature extraction on input data by convolution kernels of various sizes, and this implementation can achieve a good feature extraction effect.

In another embodiment, the 13 convolutional layers of the encoder are divided into five convolution stages, the first convolution stage and the second convolution stage each containing two convolutional layers, and the third convolution stage, the fourth convolution stage and the fifth convolution stage each containing three convolutional layers.

In another embodiment, each convolution layer is followed by a batch normalization unit and a modified linear unit, wherein the batch normalization unit normalizes the extracted feature data, and the modified linear unit is used for adding a nonlinear factor; one pooling layer is included after each convolution stage.

For the embodiment, the problem that the data distribution of the middle layer is changed in the network training process can be solved by adopting the batch normalization unit, so that the gradient is prevented from disappearing, and the training speed is accelerated; and a nonlinear factor is added by adopting a modified linear unit, so that the expression capability of the network on data is improved.

in another embodiment, the pooling operation in the encoder employs maximal pooling, and the index position of maximal pooling is preserved.

For this embodiment, preserving the largest pooling index may facilitate the delocalization layer to expand the smaller size feature map to obtain a sparse feature map.

In another embodiment, a pyramid structure for extracting context information of different scales is commonly used, such as a spatial pyramid pooling module in a PSPNet network and a deep lab network or an ASPP module with a hole convolution, and such modules aggregate multi-scale information in a parallel channel splicing manner, so that on one hand, network parameters are excessive, and on the other hand, a local information loss and a grid phenomenon are easily caused by pooling operation and hole convolution respectively, and local consistency of a feature map is finally influenced. Therefore, the structure of the Feature Pyramid Module (FPM) in the method is shown in fig. 2, and the original input feature map (conv5) is checked by using convolution of 3x3 and 5x5 to extract context information under different scales, and then the context information is gradually integrated to achieve the purpose of combining context features of adjacent scales. The original input feature map (conv5) is then convolved 1x1 and multiplied pixel-wise with the multiscale features. And finally, fusing the global pooling information to improve the performance of the feature pyramid module. The Upsample in fig. 2 refers to enlarging the size of the feature map to a specified resolution through a deconvolution operation.

for the embodiment, the feature pyramid module is adopted to reduce the calculation load, and the local information loss and the grid phenomenon can not be caused.

In another embodiment, the feature pyramid module is used before the features of the fifth convolution stage in the encoder are fused to the corresponding layers in the decoder.

For this embodiment, the feature pyramid module selection operates at the conv5 stage, since the higher level feature map resolution at the deeper layers is smaller, and the use of a larger convolution kernel does not impose an excessive computational burden.

in another embodiment, the gradual integration is a gradual pixel-by-pixel addition to aggregate the multi-scale information.

For the embodiment, multi-scale information is aggregated by adopting a mode of adding pixel by pixel gradually, so that the hierarchical dependency relationship of the features under different scales is considered, and the local consistency of the feature information is kept.

in another embodiment, the step S103 of fusing the features in the encoder into the corresponding layers of the decoder in a pixel-by-pixel addition manner specifically includes:

And only selecting the last layer of convolution characteristic image in the first convolution stage in the encoder and the second convolution stage in the encoder, and selecting all convolution characteristic images to perform pixel-by-pixel addition fusion in the third convolution stage in the encoder, the fourth convolution stage in the encoder and the fifth convolution stage in the encoder.

With this embodiment, the loss of feature map resolution is reduced.

in another embodiment, the method incorporates residual units as in fig. 3, called short-range residual concatenation, inside the convolution stages of the encoder and decoder respectively. In FIG. 3, X_lY denotes the input and output of the second, residual unit, F (X)_l) Represents the residual learned by the residual unit, which is composed of a series ofthe convolution layer is obtained by operation learning such as Batch Normalization (BN) and modified linear unit (RELU). The convolution layer is used for extracting features, the batch normalization unit is used for normalizing extracted feature data, and the correction linear unit is used for adding nonlinear factors. y ═ F (X)_l)+X_lin special cases, when the residual F (X)_l) When 0, the output equals the input. By analogy with the principle of the residual error unit in fig. 3, the feature fusion connection in step S102 is referred to as long-distance residual error connection, and it and short-distance residual error connection together form complete residual error connection, so that on one hand, the problem of gradient disappearance due to hierarchy deepening of the depth network is solved, and on the other hand, for the feature map information loss due to convolution operation of the depth network, the complete residual error connection not only fuses the multi-scale features, but also fuses the original input information of the layer, thereby supplementing the lost information to a certain extent, and further enhancing the feature fusion.

With this embodiment, the residual unit is used, effectively preventing the gradient from disappearing.

In another embodiment, a 64-bit Ubuntu system equipped workstation with hardware configured as Intel (R) Xeon (R) CPU E5-2690 v32.6GHz processor, 256GB memory and 4TB hard disk is used. The Caffe deep learning platform is used for training the whole network, and a NVIDIA Tesla K40 c 126B video memory CPU is used for accelerating in the training process. The network parameters are initialized using VGG16 pre-trained on the ImageNet dataset, and the remaining layer parameters are initialized by the MSRA initialization method proposed by He et al (2015), which, when considering only the input number n, can make the weights obey a gaussian distribution with mean 0 and variance 2/n. In the training process, the fixed learning rate is 0.0001, the batch _ size is 5, the gamma is 1, the weight attenuation is 0.0002, the momentum is 0.99, and the maximum iteration number is 100000 times.

in the back propagation stage of training, errors are calculated through a cross entropy loss function, the weight of the whole network is updated by using a random gradient descent method, and the definition of the cross entropy loss function is as follows:

l_iRepresenting the true label at pixel point i, p_k，iExpressing the output probability of the pixel point i belonging to the kth class, K expressing the total number of classes, N expressing the total number of all pixel points in the batch image, sigma (-) expressing a sign function, when l_ik is 1, otherwise 0. l denotes the set of real tags, p denotes the output of the last convolutional layer in the decoder, θ denotes the set of parameters in the loss function, and log defaults to base 10.

The convolutional neural network in the deep learning field reversely transmits errors at the tail end of the network to each layer by using a back propagation algorithm, so that the layers modify and update the weight of the current layer, and finally, the capability of extracting features of each layer of the convolutional neural network is better. The Back Propagation (BP) standard procedure is to include a forward Propagation phase and a backward Propagation phase. In the forward propagation stage, the characteristics of the input image are learned according to the initially given weight, a predicted value is finally obtained at the tail end of the network, an error exists between the predicted value and the truly given label value, and the weight is not involved in updating in the stage. In order to make the weight of each layer in the network better simulate the distribution of features in the image, the error layers need to be returned to the previous layer to update the weight of each layer in the back propagation stage. After the weight value is updated through multiple forward propagation and backward propagation processes, the predicted value finally learned by the network can be closer to the real label value. A random gradient descent algorithm is used in updating the weights. The error mentioned above needs to define a loss function to calculate, and in the method, a cross entropy loss function is used to calculate the error between the real tag value and the error after forward propagation.

In another embodiment, the following two data sets are used for verifying the performance of the network segmentation remote sensing image and performing data expansion on the following two data sets, which are specifically described as follows:

(1) ISPRS Vaihingen Change Dataset: it is a reference data set challenged by ISPRS 2D semantic tags in Vaihingen, and consists of 3-band IRRG (near infrared, green) image data and corresponding digital surface network (DSM) and normalized digital surface Network (NDSM) data. The data set contains 33 images with different sizes and ground sampling distance of 9cm, wherein 16 images with marks are provided, and each image is marked as six types, namely, Impervious surface (impetus), Building (Building), Low vegetation (Low visibility), Tree (Tree), Car (Car), Clutter or Background (Cleatter/Background). From the 16 marked images, 12 images were randomly selected as a training set, 2 images were selected as a verification set, and 2 images were selected as a test set. The data set is relatively small for training a deep network, and 256x256 image blocks are selected for training the network in an experiment. The number of training and validation sets partitioned above is relatively small for training deep networks, so for the training and validation sets we use a two-stage approach to augment the data. In the first stage, for a given image, due to the size inequality, a sliding window with size 256 × 256 and step size 128 is used to intercept the IRRG image and its corresponding label map, and then 3 image blocks with fixed positions (i.e., top right corner, bottom left corner, and bottom right corner) are extracted. And in the second stage, firstly, all the image blocks are respectively rotated by 90 degrees, 180 degrees and 270 degrees, and then all the rotated image blocks are subjected to horizontal and vertical mirror image overturning. Finally, 15000 training set samples and 2045 verification set samples are obtained respectively.

(2) road Detection Dataset: this data set was collected by Cheng et al (2017) from Google Earth and manually marked with a road-split reference map and its corresponding centerline reference map, which is the largest road data set at the present time. The system comprises 224 high-resolution images with the spatial resolution of 1.2m, wherein each image has at least 600x600 pixels, and the road width is about 12-15 pixels. We randomly divided 224 images into 180 training sets, 14 validation sets, and 30 test sets. 300x300 patches were chosen for training the network in the experiment. Similarly, a two-stage approach is used to augment the data for the training set and the validation set. In the first stage, for a given image, 4 image blocks with fixed positions (i.e., top left, top right, bottom left, and bottom right) are extracted, and then 25 image blocks are randomly captured on the original image and the label image by using a sliding window with a size of 300 × 300. In the second stage, all image blocks are rotated at steps of every 90 degrees and then flipped in the horizontal and vertical directions. Finally, 31320 training set samples and 2436 validation set samples were obtained.

in another embodiment, to verify the effectiveness of the image segmentation network of the present invention, the following networks are compared, specifically described as follows:

Four semantic segmentation networks such as FCN8s (Long et al, 2015), DeconNet (Noh et al, 2015), SegNet (Badrinarayana et al, 2017), and U-Net (Ronneberger et al, 2015).

The four semantic segmentation networks have the simplest structure of the FCN8s in terms of structure, the encoding part of the FCN8s network based on the VGG16 comprises 15 convolutional layers and 5 pooling layers, and the decoding part of the four semantic segmentation networks is to expand the feature maps of the third convolutional layer, the fourth convolutional layer and the fifth convolutional layer through deconvolution operation, add the feature maps layer by layer for feature fusion, and finally perform pixel class prediction. DeconvNet, SegNet and U-Net networks can be classified into the general class of fully symmetric coding-decoding networks, with comparable depth of construction, where their encoders are all done by convolution and pooling operations, the decoders of DeconvNet and SegNet are done by deconvolution and deconvolution (or convolution) operations, and the decoders of U-Net networks are done by deconvolution operations only. The decoding process for this type of encoding-decoding network is deeper than FCN8 s. In the aspect of feature fusion, the FCN8s and the U-Net network perform feature fusion, and the FCN8s performs layer-by-layer additive fusion on feature maps of the third, fourth and fifth stages in the encoder. The U-Net network copies and fuses the last layer of feature map of each convolution stage in the encoder into the corresponding layer of the decoder, and the fused feature information is more and the fusion mode is more complex. While the DeconvNet and SegNet networks do not use feature fusion in the decoding process, they only enlarge the high-level features in the encoder layer by layer to a feature map with the same size as the input image, and finally do pixel class prediction.

the image segmentation network of the present invention can also be classified as an encoding-decoding network, which is very similar in structure to the U-Net network, but there are four differences. The first point is as follows: the fusion mode is different, the image segmentation network of the invention fuses the feature map in the encoder into the corresponding layer of the decoder in a pixel-by-pixel addition mode, and the U-Net network performs feature fusion in a channel splicing mode. Compared with channel splicing, the fusion mode of pixel-by-pixel addition does not add extra parameters to the network. And a second point: different fusion contents exist, and because the resolution of the feature map is lost by gradual convolution and pooling operations in the encoder, the image segmentation network selects the last convolution feature of the first and second stages and all the convolution features of the third, fourth and fifth stages during fusion, while the U-Net network only selects the last layer of feature of each convolution stage in the encoder during fusion. And a third point: and fusing multi-scale features, wherein before the feature graph in the fifth stage is fused to a corresponding layer, the image segmentation network utilizes a feature pyramid module to extract multi-scale feature information, so that the multi-scale feature information can cope with target multi-scale changes, and the U-Net network does not fuse different scale features. A fourth point: the image segmentation network of the invention adds residual connection inside the convolutional layers corresponding to the encoder and the decoder, and the residual connection and the characteristic fusion connection in the image segmentation network of the invention form the complete residual connection, and the complete residual connection allows the gradient to be directly transmitted to any convolutional layer, thereby simplifying the training process. While the U-Net network does not use residual connection.

in another embodiment, to quantitatively evaluate the performance of the segmented network, the following evaluation indexes are used, which are explained and defined as follows:

F1-value (F1-score), overall accuracy (0A) and cross-over ratio (IOU).

The F1 value is a harmonic mean value of the precision rate (P) and the recall rate (R) and is a comprehensive evaluation index; the overall accuracy (0A) is a measure of the percentage of all correctly labeled pixels in the total number of image pixels, and is defined as follows:

Wherein:TP: the true positive class is judged as a positive class; FP: the false positive negative class is judged as a positive class; FN: the false negative positive class is judged as a negative class; TN: the truenenegative class is determined to be a negative class.

the IOU is a standard measurement of semantic segmentation, represents the ratio of the pixel number of an intersection set of a predicted value and a real label value to the pixel number of a union set of the predicted value and the real label value, and is defined as follows:

Wherein: p_gtIs the set of pixels of the true label map, P_mIs a set of pixels of a predicted image, "n" and "u" denote intersection and union operations, respectively. | · | represents calculating the number of pixels in the group.

in another embodiment, experiments on the ISPRS Vaihingen test set were as follows:

On the isps Vaihingen, the segmentation results of the method and the advanced deep network are shown in fig. 4, the input image size of all networks is 256 × 256, and all networks are only IRRG three-channel color images, and the output is a prediction label graph with the same size as the input image. Fig. 4 shows an IRRG image, a label map, an FCN8s division result, a DeconvNet division result, a SegNet division result, a U-Net division result, and a FRes-mfdng division result in this order from top to bottom.

The size and the form of the target in each figure are different, and certain shadow occlusion exists. For example, the low vegetation and trees in the first and fifth images are distributed more intensively, a large area of shadow exists in the original image due to the influence of the heights of the trees and buildings, and partial shadow shields the automobile and the road surface. As can be seen from fig. 4, the FCN8s and DeconvNet network have poor segmentation results, wherein the DeconvNet segmentation results are greatly different from the actual label graph, the details at the edges of the target are blurred, there is segmentation discontinuity in the interior of a single target, and the like. Compared with the FCN8s, the SegNet network deepens the decoding process and utilizes the position index value obtained in the pooling process, so that the segmentation result is closer to the actual label graph, the detail information of the target is better reserved, and the misclassification part is reduced compared with that of the FCN8s and the DeconvNet network. And the U-Net copies and fuses the characteristics of the corresponding stage in the encoder into the corresponding stage in the decoder, the segmentation result is closer to the actual label graph, and the target detail information is clearer. The network in the method uses complete residual connection in the corresponding layers of the encoder and the decoder, and combines multi-scale information of advanced features, so that the segmentation result is very close to an actual label graph, the target details are clearer, the wrong segmentation is less, the influence of the target size diversity and shadow in the original graph can be dealt with to a certain extent by the method, and the segmentation accuracy is improved.

Fig. 5 shows the quantitative evaluation results corresponding to fig. 4, with bold representing the best results and underlining representing the next best results. Wherein the accuracy (P) and the recall (R) respectively measure the completeness and the correctness of the segmentation, and the ideal segmentation condition is that the accuracy and the recall are high. The method has the highest measurement index on each graph, in addition, the average accuracy and the average recall ratio are respectively higher than the second best result by about 3 percent and 2 percent, and the method is closer to the actual marked graph in the aspect of urban remote sensing image segmentation and has better effect from qualitative and quantitative results.

the evaluation results of each deep network and the method on the ISPRS variahingn test image are shown in fig. 6(a) and 6(b), wherein, as can be seen from fig. 6(a) and 6(b), although some comparison algorithms have better results in the IOU and F1 value measures, the method achieves the best results in the average performance of the IOU, F1 values of each category and the test set as a whole. Specifically, the average IOU of the method is about 6% higher than the second best result (U-Net), and the average F1 value is about 4% higher than the second best result, which fully proves the effectiveness of the method in the aspect of urban remote sensing image segmentation.

A comparison of this method with the methods in the literature with better segmentation performance at present is given in fig. 7, Paisitkriangkrai et al (2015) propose a CNN + RF segmentation network combining CNN mainly for feature extraction and Random Forest (RF) for classification. Volpi and Tuia (2017) propose the use of a deconvolution network for remote sensing image segmentation, which network consists of a symmetric encoder, completed by eight convolutional layers and three pooling layers, and a decoder in mirror relationship with the encoder, where the encoding and decoding processes are linked with 1x1 convolutional layers. Sherrah (2016) segments the remote sensing image by using a hole convolution, and smoothes the segmentation result by using CRF; maggiori et al (2016) incorporates CRF as a post-process into the training process of a deep network at the end of an encoding-decoding network. Audebert et al (2017) use a symmetric encoding-decoding network with an encoder consisting of a convolutional layer and a pooling layer and a decoder consisting of an anti-convolutional layer and an anti-pooling layer. The above experimental results are taken from literature texts, and the number of training samples used by each method is roughly equivalent. It can be seen from the comparison of fig. 7 that the segmentation effect of the method is better than that of the compared method in terms of the F1 value of each class and the overall accuracy of segmentation.

in order to better verify the segmentation performance of the method, three pieces of ISPRS Vaihingen data sets, namely area4, area31, area35 and the like, which do not contain labeled graphs, are adopted in the method, experiments are respectively carried out in each comparison network, partial results are shown as the graph in FIG. 8, and the original graph, the FRES-MFDNN segmentation result, the U-Net segmentation result, the SegNet segmentation result, the FCN8s segmentation result and the DeconvNet segmentation result are sequentially shown from left to right.

Without reference to the marker, it can be seen with reference to the original (first column) that the method works better than other contrast networks in terms of accuracy, completeness of segmentation and smoothness of the target boundary.

In another embodiment, the following is experimented on the Road Detection data set:

on the Road Detection Dataset, the segmentation results of the method and each deep network are shown in fig. 9. The input image size of all networks is an RGB three-channel image of 300x300, the output is a prediction result image with the same size as the input image, black represents a background, and white represents a road. Fig. 7 shows, in order from top to bottom, an RGB image, a label map, a FRes-mfdng division result, a U-Net division result, a SegNet division result, a DeconvNet division result, and an FCN8s division result.

the first row of fig. 9 shows five images with different spectral information and background complexity, and part of the road is blocked by trees and cars, wherein the fourth image is very similar to the road in the living area, and the fourth image and the fifth image also contain loess roads with obvious treading, which all add certain challenges to the segmentation. As can be seen from fig. 9, the segmentation results of FCN8s and DeconvNet network are much different from the actual label graph, the area of wrong segmentation and missing segmentation is much, and the continuity of the segmented road is poor. The segmentation result of the SegNet network is similar to that of an actual label graph, the misclassification area is obviously reduced compared with that of DeconvNet, but the misclassification phenomenon still exists. The segmentation result of the U-Net network and the method is most similar to that of an actual label graph, and the wrong scores and the missed scores are obviously reduced compared with other networks. Compared with a U-Net network, the method has the advantages that the detail information of the segmentation result is more perfect, when the automobile and the tree are shielded, the edge of the segmented road is smoother, and the space consistency is higher.

fig. 10 shows the evaluation results corresponding to fig. 9, with bold representing the best value and underlining representing the next best value. Similarly, although some comparison methods obtain better results in terms of accuracy or recall, the measurement index of each image can almost reach the highest by the method, the average accuracy and the average recall rate are respectively 2% and 3% higher than those of suboptimal methods, and the method is more similar to an actual marker map in the aspect of segmenting the remote sensing road image from qualitative and quantitative results and has better effect.

FIG. 11 shows the average IOU and average F1 values for each method over the Road Detection test set. It can be seen that the average IOU and the average F1 value of the method are obviously higher than those of other methods, the average IOU is improved by 4% compared with the second best U-Net method, the average F1 value reaches 93%, and the good segmentation performance of the method on the data set is fully reflected.

FIG. 12 is a comparison of the results of the present method and prior Road segmentation studies on the Road Detection data set, including the mean IOU, mean F1 value, training time (h) and the extrapolated time (s/p in seconds/sheet) for each method on the data set.

in fig. 12, Zhang et al (2018) propose a Res-unet network comprising a three-layer encoder and a three-layer decoder, the encoding process is completed by convolution operation, the decoding process is completed by bilinear interpolation, wherein the last layer of feature map of each stage of the encoder is copied and fused into the corresponding stage of the decoder, and residual connection is introduced in the encoder and the encoder. Ronneberger et al (2015) proposed U-Net for medical image segmentation, and many studies are currently in the task of using it for remote sensing image segmentation. Panboneyuen et al (2017) proposed an ELU-SegNet structure, which he replaces the RELU activation function with an ELU activation function on the basis of a SegNet network. Cheng et al (2014) propose a Cascaded-net structure consisting of a four-layer encoder and a four-layer decoder, where the encoder is done by convolution and pooling operations and the decoder is done by deconvolution and depoling. Networks such as Res-unet, ELU-SegNet and caseded-net have been proposed for road segmentation applications. The results are obtained by aiming at the Road Detection data set experiment on a coffee deep learning platform which is configured with the network training of the method. It can be seen from fig. 12 that although the method is somewhat inferior to the Res-unet and U-Net networks in terms of training time and extrapolation time, the difference is not large and the method has an advantage over other methods in both the average IOU and average F1 values.

in order to better verify the performance of each comparison network in the aspect of segmenting remote sensing road images, images above a certain block of the American Saint Louis are collected from a Google map, all the images are three-channel RGB color images, the spatial resolution is 20 meters, the three-channel RGB color images are respectively sent to each trained network for testing, and partial results are shown in FIG. 13 and are an original image, a network segmentation result of the method, a U-Net segmentation result, a SegNet segmentation result, a DeconNet segmentation result and an FCN8s segmentation result from left to right.

Although the acquired image and the Road Detection data set used for training the network are different in background complexity, spectral information and spatial resolution, as can be seen from fig. 13, compared with other comparison methods, the method can better segment roads, effectively remove most background interference and overcome the influence caused by different spatial resolutions. The robustness of the method in the aspect of segmenting the remote sensing road image is fully proved.

Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described embodiments and application fields, and the above-described embodiments are illustrative, instructive, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto without departing from the scope of the invention as defined by the appended claims.

Claims

1. A remote sensing image segmentation method combining complete residual error and multi-scale feature fusion comprises the following steps:

S300: outputting a segmentation result of the remote sensing image;

The characteristic pyramid module specifically comprises:

Extracting context information under different scales from a feature map of a fifth convolution stage in an encoder by using convolution kernels of 3x3 and 5x5 respectively, and gradually integrating to obtain multi-scale features;

convolving the feature map of the fifth convolution stage by 1x1 and multiplying the feature map by the multi-scale features in a pixel manner;

fusing global pooling information;

The encoder comprises 13 convolutional layers, the 13 convolutional layers of the encoder are divided into five convolution stages, the first convolution stage and the second convolution stage respectively comprise two convolutional layers, and the third convolution stage, the fourth convolution stage and the fifth convolution stage respectively comprise three convolutional layers;

The gradual integration is to aggregate the multi-scale information in a gradual pixel-by-pixel addition.

2. The method of claim 1, wherein the encoder in step S101 comprises 13 convolutional layers and 5 pooling layers, and a decoder is stacked on top of the encoder, the decoder being in a full mirror image relationship with the encoder, comprising 13 convolutional layers and 5 depoling layers.

3. The method of claim 1, wherein a batch normalization unit and a modified linear unit are included after each convolutional layer, wherein the batch normalization unit normalizes the extracted feature data, and the modified linear unit is used for adding a non-linear factor; one pooling layer is included after each convolution stage.

4. The method of claim 2, wherein the pooling operation in the encoder employs maximal pooling and preserves a maximally pooled index position.

5. the method of claim 1, wherein the feature pyramid module is used before the features of the fifth convolution stage in the encoder are fused into the corresponding layers of the decoder.

6. the method according to claim 1, wherein said fusing the features in the encoder into the corresponding layers of the decoder in a pixel-by-pixel addition manner in step S103 is specifically:

7. The method of claim 1, wherein the residual learned by the residual unit is learned by a series of operation units, the operation units including convolution layer, batch normalization unit, and modified linear unit; the convolution layer is used for extracting features, the batch normalization unit is used for normalizing extracted feature data, and the correction linear unit is used for adding nonlinear factors.