CN115035418A

CN115035418A - Remote sensing image semantic segmentation method and system based on improved deep LabV3+ network

Info

Publication number: CN115035418A
Application number: CN202210677113.XA
Authority: CN
Inventors: 白根宝; 徐欣; 姚英彪; 杨阿锋; 刘晴; 姜显扬
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-09-09

Abstract

The invention discloses a remote sensing image semantic segmentation method and a system based on an improved deep LabV3+ network, wherein the method comprises the following steps: s1, acquiring a remote sensing road data set and preprocessing the data set, wherein the data in the data set is divided into training data, verification data and test data; s2, building an improved DeepLabV3+ semantic segmentation network model based on a Pytrch environment; s3, training the improved DeepLabV3+ semantic segmentation network model by using the training data and the verification data obtained in the step S1; and S4, inputting the test data obtained in the step S1 into the improved DeepLabV3+ semantic segmentation network model in the step S3 to obtain a semantic segmentation result of the remote sensing road image. Compared with a traditional DeepLabV3+ network model-based method, the method adopts an R-Drop regularization method, and regularizes the output of two sub-models randomly extracted from dropout by each data sample in training.

Description

Remote sensing image semantic segmentation method and system based on improved deep LabV3+ network

Technical Field

The invention belongs to the technical field of remote sensing image segmentation, relates to a remote sensing image segmentation method, and particularly relates to a remote sensing image semantic segmentation method and system based on an improved DeepLabV3+ network.

Background

The remote sensing image segmentation algorithm refers to prediction of each pixel in a remote sensing image, is a pixel-level classification algorithm, can be widely applied to a plurality of application scenes such as land planning, environment monitoring and disaster assessment, and has a great application value. The traditional image segmentation method is mainly based on the steps that a classifier is manually designed on the basis of image bottom-layer features such as colors, textures and the like to segment an image, and then semantics are marked on the segmented image. For example, a pixel-level-based clustering segmentation method, a pixel-level threshold segmentation method, a pixel-level-based decision tree classification method and the like are used, the algorithms better meet the requirements of image segmentation to a certain extent, but have high requirements on manually designed feature extractors, have poor generalization performance on data sets, and are difficult to apply to general scenes with complex backgrounds in a large scale.

Along with the rapid development of computer hardware in recent years, especially the improvement of GPU computing capacity, the progress of artificial intelligence is greatly promoted, and meanwhile, great power is provided for the development of computer vision. Semantic segmentation is a basic task in computer vision, and by means of the strong computing power of a GPU, an image segmentation method based on deep learning can be used for rapidly segmenting a remote sensing image and accurately extracting useful information. Semantic segmentation architectures are usually in different forms and can be understood as a coder-decoder network as a whole. The encoder is usually a pre-trained classification network such as ResNet to extract the features of the image, and for the decoder, the role is mainly in mapping the discriminable features, and mapping from semantics to pixel space can be realized, so that dense classification can be obtained, and this is also a function that needs to be realized by semantic segmentation.

DeepLabV3+ is a network model with better performance in semantic segmentation, the extraction of feature resolution is realized mainly by arbitrarily controlling an encoder, meanwhile, the balance of efficiency and precision can be achieved, the MobileNetV2 network model is applied to the semantic segmentation, a deep separable convolutional network is used in a decoding module, and the execution efficiency of encode-decode is enhanced by adopting the mode. The Dropout method is adopted in the DeepLabV3+ network, so that the overfitting problem in the training process is avoided, part of neurons are randomly ignored in the training process, the contribution of the ignored neurons to downstream neurons temporarily disappears in the forward propagation process, and the neurons do not have any weight update in the reverse propagation process.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a semantic segmentation method and a semantic segmentation system for remote sensing images based on an improved DeepLabV3+ network, the invention uses an R-Drop regularization method to replace a Drop method used in an original DeepLabV3+ network, the R-Drop further regularizes a model space, exceeds the Drop, the generalization capability of the model can be further improved, and therefore, the effective segmentation of the remote sensing urban road images can be completed.

The technical scheme adopted by the invention is as follows:

a remote sensing image semantic segmentation method based on an improved DeepLabV3+ network comprises the following steps:

s1, acquiring a remote sensing road data set and preprocessing the remote sensing road data set;

s2, building an improved DeepLabV3+ semantic segmentation network based on a Pytrch environment;

s3, training the improved DeepLabV3+ semantic segmentation network model by using the training data and the verification data obtained in the step S1;

and S4, inputting the test data obtained in the step S1 into the step S3 to improve the deep LabV3+ semantic segmentation network model to obtain a semantic segmentation result of the remote sensing road image.

Further, the step S1 specifically includes the following steps:

s11, downloading or self-making a remote sensing data set from an open source data set website;

s12, respectively placing the image file and the label file which are originally placed in one folder into different folders;

s13, randomly dividing data in the data set into training data, verification data and test data according to the ratio of 2:1:1, and storing the divided file name list files under the path where the project is located, wherein the divided file name list files are respectively train.txt, val.txt and test.txt.

Further, the step S2 specifically includes the following steps:

s21, improving a DeepLabV3+ semantic segmentation network model and dividing the improved DeepLabV3+ semantic segmentation network model into an encoder module and a decoder module;

s22, in an encoder module, extracting shallow features and deep features of the remote sensing image by using MobileNet V2 as a main network;

and S23, performing further feature extraction operation on the deep features obtained in the S21 by adopting a Spatial Pyramid Pooling module (also called ASPP module, ASPP is an English abbreviation of atom Spatial Pyramid farming). The spatial pyramid pooling module consists of a 1 × 1 convolution, three expansion convolutions with expansion rates of 6, 12 and 18 respectively and an ImagePooling (global average pooling) module, wherein the three expansion convolutions are used for capturing the receptive field information of different scales and capturing the characteristic information of different scales, and the global average pooling and the 1 × 1 convolution layer are used for extracting characteristics;

s24, stacking the feature layers with different receptive fields obtained in the step S23 by using a continate feature fusion method, wherein the number of input channels is 5 times of the number of original input channels, and reducing the number of the channels to the original value by using a 1 multiplied by 1 convolution layer to obtain deep features;

s25, adjusting the number of channels of the shallow feature obtained in the step S22 by adopting 1 × 1 convolution in the decoder module, and then performing concatemate feature fusion with the result obtained in the step S24 after 4 times of upsampling on the deep feature layer;

s26, thinning the feature fusion result obtained in the step S25 by adopting two 3 x 3 convolutional layers, and then performing four-time upsampling to obtain a segmentation prediction graph.

Further, the step S3 specifically includes the following steps:

s31, setting initial parameters of the training model as follows:

initial learning rate, namely learning rate: 0.014;

weight decay, namely weight decay: 0.0005;

momentum, momentum: 0.9;

the batch size is determined according to the display memory size of the server for actual training;

s32, in the training process, an R-Drop regularization method is adopted, namely: in each small training batch, each data sample undergoes two forward passes, each of which is implemented by a different submodel by randomly deleting some hidden units.

The specific process is as follows: the training data is

The goal of the training is to learn a model P ^w (y _i |x _i ) Where n is the number of training samples, (x) _i ，y _i ) Is a marked data pair, x _i Is input data, y _i Is a label, and the loss of each sample is the cross entropy:

L ⁱ ＝-logP ^w (y _i |x _i )

in the case of the R-Drop regularization method, the sample may be considered to have passed through two slightly different models, denoted separately as

And

the final loss of the model is divided into two parts, one part is the conventional cross entropy:

the other part is the symmetrical KL divergence between the two models, which has the effect of making the outputs of the two models passing through different Dropout as consistent as possible:

the final loss of the network model is the weighted sum of the two losses:

wherein alpha is the weight of the auxiliary loss and is set as 1, and the loss function adopts the cross information entropy;

s33, calculating a gradient according to the loss function obtained in the step S32, and updating a weight value and a bias value of the neural network by adopting a random gradient descent method as an optimizer;

s34, Pixel Accuracy (PA) and average Intersection over Union (MIoU) are introduced to evaluate the performance of the model, wherein the PA represents the proportion of the number of correct pixels of the prediction category to the total number of pixels, the MIoU represents the precision of the network model for segmenting the image, and the higher the MIoU value is, the better the image segmentation effect is. The calculation method comprises the following steps:

in the above formula, tp (true positive) represents that the model prediction is correct, i.e. both the model prediction and the actual model are positive examples; FP (false positive) represents model prediction error, that is, the model predicts that the category is a positive example, but actually the category is a negative example; FN (false negative) represents model prediction error, that is, the model predicts that the category is a negative example, and actually the category is a positive example; TN (TN) represents that the model prediction is correct, which means that the model prediction and the actual model are opposite examples; n represents the number of classes, and the subscript i represents the ith class;

s35, the training process of the steps S32-S24 is repeated, after each round of training is finished, the network model is evaluated by using a verification data set, the model is stored according to the MIoU optimal result, the training is stopped until the iteration number reaches a set value, and the trained model is stored.

Further, the step S4 specifically includes the following steps:

s41, loading the model trained in the step S3, and reading in the test picture and the label of the test data obtained in the step S1;

and S42, calculating index scores and storing test results.

The invention also discloses a remote sensing image semantic segmentation system based on the improved deep LabV3+ network, which comprises the following modules:

a data classification module: acquiring a remote sensing road data set and preprocessing the data set, wherein the data in the data set is divided into training data, verification data and test data;

a model building module: constructing an improved DeepLabV3+ semantic segmentation network model based on a Pythrch environment;

a training module: training the improved DeepLabV3+ semantic segmentation network model by using training data and verification data obtained by the data classification module;

a segmentation result obtaining module: and inputting the test data obtained by the data classification module into an improved DeepLabV3+ semantic segmentation network model of the training module to obtain a semantic segmentation result of the remote sensing road image.

Compared with the prior art, the invention has the following beneficial effects:

compared with a traditional DeepLabV3+ network model-based method, the method for semantically segmenting the remote sensing image based on the improved DeepLabV3+ network can regularize the output of two submodels randomly extracted from dropouts by each data sample in training by adopting an R-Drop regularization method, can reduce the degree of freedom of network model parameters, can relieve inconsistency between the training and reasoning stages, and enhances generalization capability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a remote sensing image semantic segmentation method based on an improved deep av 3+ model according to embodiment 1 of the present invention.

FIG. 2 is a schematic diagram of the R-Drop regularization method provided in embodiment 1 of the present invention.

Fig. 3 is a semantic segmentation result diagram of a remote sensing road image provided in embodiment 1 of the present invention.

Fig. 4 is a block diagram of a remote sensing image semantic segmentation system based on an improved deep lab v3+ network according to embodiment 2 of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

Example 1

As shown in fig. 1, the embodiment provides a remote sensing image semantic segmentation method based on an improved deep labv3+ model, which specifically includes the following steps:

s1, acquiring a remote sensing road data set and preprocessing the remote sensing road data set. In the embodiment, a DeepGlobe Road Extraction Dataset downloaded from an open source Dataset website kaggle.com is used, and 2000 remote sensing Road RGB satellite images with the size of 1024 × 1024 are randomly selected from the Dataset and are randomly divided into training data, verification data and test data according to the ratio of 2:1: 1.

And S2, building an improved DeepLabV3+ semantic segmentation network based on the Pytrch environment. In this embodiment, MobileNetV2 is selected as a backbone network of the deep lab v3+ semantic segmentation network to extract shallow features and deep features; the deep features are input into an ASPP module to obtain multi-scale feature layers with different receptive fields through 5 different operations such as convolution, cavity convolution, global average pooling and the like, after concatenate stacking processing is carried out, the number of channels is reduced to an original value through 1 x 1 convolution to obtain deep features, and the deep features are input into a decoder module. In a decoder module of a network model, channel number adjustment is carried out on shallow features input from an encoder module, 4 times of upsampling is carried out on deep features, then the results of the shallow features and the deep features are subjected to concatenate stacking, after the stacking is finished, the stacking result is subjected to 3 x 3 depth separable convolution twice and 4 times of upsampling to restore the original image size, and the predicted segmentation image of the remote sensing image is obtained.

And S3, training the improved DeepLabV3+ semantic segmentation network model by using the training data and the verification data obtained in the step S1. In order to verify the feasibility of the network designed by the invention and the recognition effect of the path in the complex environment, the network is programmed and trained, and the specific experimental environment and configuration are shown in table 1:

TABLE 1 Experimental Environment and configuration

And setting initial parameters of the training model as shown in table 2:

table 2 initial parameter settings

After the parameters are set, training can be carried out, in the training process, an R-Drop regularization method is adopted to replace a Drop out method used in an original DeepLabV3+ network, specifically, in each small batch training, each data sample is subjected to forward transfer twice, each transfer is realized by randomly deleting some hidden units through different submodels, and the schematic diagram of the R-Drop regularization method is shown in FIG. 2.

The specific process is as follows:

the training data is

L ⁱ ＝-logP ^w (y _i |x _i )

in the case of using the R-Drop regularization method, the sample may be considered genericAfter passing two slightly different models, respectively marked as

And

the final loss of the network model is the weighted sum of the two losses:

wherein alpha is the weight of the auxiliary loss and is set to be 1, and the loss function adopts cross information entropy;

and (2) evaluating the performance of the model by introducing Pixel Accuracy (PA) and average Intersection over Union (MIoU), wherein the PA represents the proportion of the Pixel number with correct prediction category to the total Pixel number, the MIoU represents the precision of the network model for segmenting the image, and the higher the MIoU value is, the better the image segmentation effect is. The calculation method comprises the following steps:

in the formula, TP (true Positive) represents that model prediction is correct, namely the model prediction and the actual model prediction are positive examples; FP (false positive) represents model prediction error, that is, the model predicts that the category is a positive example, but actually the category is a negative example; FN (false negative) represents model prediction error, that is, the model predicts that the category is a negative example, and actually the category is a positive example; TN (TN) represents that the model prediction is correct, which means that the model prediction and the actual model are opposite examples; n represents the number of classes and the index i represents the ith class.

In the training stage, a random gradient descent method is used as an optimizer, and the weight value and the offset value after the convolutional neural network is updated are calculated; and after each round of training is finished, evaluating the network model by using a verification data set, storing the model according to the MIoU optimal result, stopping training after 300 rounds of iteration, and storing the trained model.

And S4, inputting the test data obtained in the step S1 into an improved DeepLabV3+ semantic segmentation network model to obtain a semantic segmentation result of the remote sensing road image, wherein a result graph is shown in FIG 3.

In addition to the experiment of improving the deep labv3+ semantic segmentation network model, the invention also trains the deep labv3+ algorithm into a corresponding model on the selected remote sensing road data set, and compares the model with the performance of the algorithm of the invention, wherein the performance of the two algorithms on the remote sensing road data set is shown in table 3:

TABLE 3 comparison of the Performance of the two models on a set of remote sensing road data

Modeling method	PA(％)	MIoU(％)
			DeepLabV3+	97.3721	73.8854
The invention	97.6744	76.8213

The remote sensing semantic segmentation method for improving the DeepLabV3+ network improves the pixel accuracy rate, provides nearly 3 points in average cross-over ratio, and has an obviously better segmentation effect on images than that of the original DeepLabV3+ algorithm.

Example 2

As shown in fig. 4, the embodiment discloses a remote sensing image semantic segmentation system based on an improved deep labv3+ network, which includes the following modules:

a data classification module: acquiring a remote sensing road data set and preprocessing the data set, wherein the data in the data set are divided into training data, verification data and test data;

a model building module: constructing an improved DeepLabV3+ semantic segmentation network model based on a Pytrch environment;

Other contents of this embodiment can refer to embodiment 1.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention. However, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention still belong to the protection scope of the technical solution of the present invention.

Claims

1. A remote sensing image semantic segmentation method based on an improved deep LabV3+ network is characterized by comprising the following steps:

s1, acquiring a remote sensing road data set and preprocessing the data set, wherein the data in the data set is divided into training data, verification data and test data;

s2, building an improved DeepLabV3+ semantic segmentation network model based on a Pytrch environment;

and S4, inputting the test data obtained in the step S1 into the improved DeepLabV3+ semantic segmentation network model in the step S3 to obtain a semantic segmentation result of the remote sensing road image.

2. The remote sensing image semantic segmentation method based on the improved DeepLabV3+ network according to claim 1, wherein the step S1 specifically comprises the following steps:

s11, downloading or self-making a remote sensing image data set from an open source data set website;

s13, randomly dividing data in the data set into training data, verification data and test data according to the ratio of 2:1:1, and storing the divided file name list files under the path of the project, wherein the divided file name list files are respectively train.txt, val.txt and test.txt.

3. The remote sensing image semantic segmentation method based on the improved deep LabV3+ network as claimed in claim 2, wherein the step S2 specifically comprises the following steps:

s21, improving a DeepLabV3+ semantic segmentation network model to be divided into an encoder module and a decoder module;

s22, in the encoder module, extracting shallow features and deep features of the remote sensing image by using MobileNet V2 as a main network;

s23, further extracting the deep features obtained in the step S21 by adopting a spatial pyramid pooling module; the spatial pyramid pooling module consists of a 1 × 1 convolution, three expansion convolutions with expansion rates of 6, 12 and 18 respectively and an imageposing module, the three expansion convolutions are used for capturing the receptive field information of different scales and capturing the characteristic information of different scales, and the global average pooling and the 1 × 1 convolution layer are used for extracting characteristics;

s24, stacking the feature layers with different receptive fields obtained in the step S23 by using a concatenate feature fusion method, wherein the number of input channels is 5 times of the number of original input channels, and reducing the number of channels to the original value by using a 1 multiplied by 1 convolution layer to obtain deep features;

4. The remote sensing image semantic segmentation method based on the improved deep LabV3+ network according to claim 3, wherein the step S3 specifically comprises the following steps:

s31, setting initial parameters of the training model as follows:

initial learning rate, namely learning rate: 0.014;

weight decay, namely weight decay: 0.0005;

momentum, momentum: 0.9;

s32, in the training process, an R-Drop regularization method is adopted, namely: in each small batch training, each data sample is subjected to two forward transmissions, and each transmission is processed by different submodels through random deletion of some hidden units; the method comprises the following specific steps: the training data is

L ⁱ ＝-logP ^w (y _i |x _i )

in the case of the R-Drop regularization method, the samples are considered to pass through two slightly different models, denoted respectively as

And

the other part is the symmetric KL divergence between the two models:

the final loss of the network model is the weighted sum of the two losses:

s34, introducing a pixel accuracy rate PA and an average cross-over ratio MIoU to evaluate the performance of the model, wherein the PA represents the proportion of the number of pixels with correct prediction categories to the total number of pixels, the MIoU represents the image segmentation precision of the network model, and the higher the MIoU value is, the better the image segmentation effect is; the calculation method comprises the following steps:

in the formula, TP represents that the model prediction is correct, namely the model prediction and the actual model prediction are positive examples; FP represents the model prediction error, namely the model predicts that the category is a positive example, but actually the category is a negative example; FN represents model prediction error, i.e. model predicts that the category is a negative example, and actually the category is a positive example; TN represents that the model prediction is correct, which means that the model prediction and the actual model are both counter-examples; n represents the number of classes, and the subscript i represents the ith class;

s35, the training process of the steps S32-S34 is repeated, after each round of training is finished, the network model is evaluated by using verification data, the model is stored according to the MIoU optimal result, the training is stopped until the iteration number reaches a set value, and the trained model is stored.

5. The remote sensing image semantic segmentation method based on the improved deep LabV3+ network according to claim 4, wherein the step S4 specifically comprises the following steps:

and S42, calculating index scores and storing test results.

6. A remote sensing image semantic segmentation system based on an improved DeepLabV3+ network is characterized by comprising the following modules: