CN114972759A

CN114972759A - Remote sensing image semantic segmentation method based on hierarchical contour cost function

Info

Publication number: CN114972759A
Application number: CN202210675935.4A
Authority: CN
Inventors: 韩振; 吕宁; 陈晨; 原昊
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-06-15
Filing date: 2022-06-15
Publication date: 2022-08-30

Abstract

The invention discloses a remote sensing image semantic segmentation method based on a hierarchical contour cost function, which is used for segmenting a high-resolution remote sensing image. The method comprises the following implementation steps: 1. generating a training set; 2. constructing an inclusion-v 3U-Net segmentation network; 3. training an increment-v 3U-Net network; 4. and predicting the remote sensing image. The invention constructs the training network inclusion-v 3U-Net, reduces the calculated amount and the parameter quantity, and improves the training efficiency. The invention constructs a hierarchical contour cost function to monitor network loss, enhances the capability of the model to segment the foreground contour, refines the contour judgment range by a method of subtracting corrosion after the convolution kernel is sequentially expanded, and improves the accuracy of contour classification. Meanwhile, the invention endows the contour levels of the background and the foreground with the same relative contour distance with pairwise corresponding and complementary hyper-parameters so as to realize the accurate segmentation of the contour.

Description

Remote sensing image semantic segmentation method based on hierarchical contour cost function

Technical Field

The invention belongs to the technical field of image processing, and further relates to a remote sensing image semantic segmentation method based on a hierarchical contour cost function in the technical field of image segmentation. The method can be used for segmenting the remote sensing image, and further realizes the remote sensing image spot interpretation task in the automatic production construction project.

Background

With the rapid development of the technology in the field of remote sensing, a great deal of image data interpretation processing requirements are brought. The interpretation of the remote sensing image requires identifying the region or object by segmenting out the different classes of objects, regions and assigning the same label to the same class. At present, in engineering practice, deep neural network learning technologies such as U-Net are adopted to carry out remote sensing image segmentation, the neural network is used for extracting the characteristics of the remote sensing image, the category of each pixel is predicted through a training network, and therefore a segmentation graph with category labels is obtained finally. However, the existing method can only effectively complete simpler tasks, and the accuracy of subsequent interpretation results is affected due to the fact that the edges of the segmented contours are not clear and smooth enough, so that the interpretation results still need to be further distinguished and corrected by related personnel in a complex scene. In the practical process, the cost function mainly used by the semantic segmentation network at the present stage cannot well improve the accuracy of the model for segmenting the target contour.

Shandongfengshi information technology, Inc. in the patent document "a remote sensing image segmentation method and system based on edge auxiliary information" (patent application No. 202111094364.7, application publication No. CN113920311A) discloses a remote sensing image segmentation method based on edge auxiliary information. The method comprises the following implementation steps: firstly, preprocessing a remote sensing image to obtain a plurality of local images; then, predicting the local image by adopting a remote sensing image segmentation model which takes the cross entropy as a segmentation and auxiliary cost function and takes Resnet as a main network to obtain a prediction result that each pixel belongs to various types; and finally, obtaining an edge characteristic diagram and a main body characteristic diagram through upsampling, splicing and difference value operation, and fusing to obtain a final characteristic diagram, thereby improving the remote sensing segmentation precision. However, the method still has the disadvantages that the cross entropy is directly used as a cost function during segmentation, so that the model focuses more on the accuracy of background segmentation, the foreground is ignored, the boundary and the internal texture cannot be effectively distinguished, and the edge processing is rough. Meanwhile, the method takes Resnet as a backbone network, so that the amount of training parameters is increased, and the calculation efficiency is low.

Chen Z, Zhou H, Lai J et al, in its published article "Boundary Loss: Boundary-Aware Learning for sales Object Segmentation" (IEEE Transactions on Image Processing,2021,30:431-443), propose an Image Segmentation method using a Contour cost function. In the method, during training, a pre-trained VGG-16 network is firstly adopted for image classification to obtain a multilayer characteristic diagram, then the multilayer characteristic diagram is sent to a decoder composed of residual blocks, and a weight matrix for increasing the proportion of pixels in a contour range is adopted as a contour cost function to help learning the boundary distinction between a significant object and a background. The method has the defects that when the method is used for roughly judging the contour of the contour cost function, the problem of wrong judgment on different types of contours can occur. Meanwhile, the obtained weight matrixes of different categories are the same, different directions of the outline cannot be accurately distinguished, and the judgment of the different directions of the outline lacks pertinence, so that the accuracy of the outline is lost.

Disclosure of Invention

The invention aims to provide a remote sensing image semantic segmentation method based on a hierarchical contour cost function aiming at the defects in the prior art, and the method is used for solving the problems that the foreground is ignored when edges are segmented and processed in the remote sensing image semantic segmentation, the boundary and the internal texture cannot be effectively distinguished, the parameter quantity and the calculated quantity are large, the judgment on contours of different categories is wrong, and the contours of segmented objects are inaccurate.

The idea for achieving the purpose of the invention is that a U-Net segmentation network with the inclusion-v 3 as a main body is constructed, namely the U-Net segmentation network with the inclusion-v 3 divides the remote sensing image, and decomposes large convolution into small convolution so as to reduce the network parameter number. The invention constructs a hierarchical contour cost function to supervise network loss, pays more attention to the accuracy of boundary processing during training and enhances the capability of a model for segmenting the foreground contour. The invention thins the outline judgment range by a method of subtracting corrosion after the convolution kernels are sequentially expanded, so that pixels at different positions and under different labels can have different weight parameters, and the accuracy of outline classification is improved. The invention endows contour levels in two directions of the background and the foreground with the same relative contour distance with pairwise corresponding and complementary hyper-parameters, so that the complementary weights form counterstudy in the training process, thereby realizing the accurate segmentation of the contour. And training the network by taking a data set formed by the high-resolution remote sensing image as a sample to obtain a final remote sensing image semantic segmentation model.

The method comprises the following specific steps:

step 1, generating a training set:

step 1.1, randomly selecting at least 20 high-resolution remote sensing images with balanced foreground and background ratios and corresponding label images; cropping each high resolution image and corresponding label image to a size of 224 x 224 pixels;

step 1.2, selecting a label image with foreground pixels accounting for more than 10% of the cut label image and a remote sensing image corresponding to the label image to form a training set;

step 2, constructing an inclusion-v 3U-Net segmentation network:

step 2.1, 1 convolution module is constructed:

building a convolution module formed by connecting a first convolution layer and a second convolution layer in series;

setting the sizes of convolution kernels of the first convolution layer and the second convolution layer to be 1 multiplied by 1, setting the step length to be 1 and setting the edge filling to be 1;

step 2.2, constructing an up-sampling sub-network:

an up-sampling sub-network which is formed by sequentially connecting a first up-sampling module, a second up-sampling module, a third up-sampling module and a CBR module in series is built as a decoder; the first to third upsampling modules have the same structure, and each upsampling module has the following structure in sequence: a first convolution layer, a second convolution layer, a third convolution layer, a BatchNorm layer, an active layer and an upper sampling layer;

the convolution kernels of the first convolution layer, the second convolution layer and the third convolution layer are all set to be 3 multiplied by 3, the step length is all set to be 1, the edge filling is all set to be 1, the slope of the negative part of the active layer is set to be 0.2, the active layer is realized by adopting a LeakyReLU function, and the up-sampling layer adopts a double nearest neighbor up-sampling mode;

the structure of the CBR module is as follows in sequence: a convolutional layer, a BatchNorm layer, an active layer; setting the convolution kernel size of the convolution layer to be 3 multiplied by 3, setting the step length to be 1, setting the edge filling to be 1, setting the slope of the negative part of the active layer to be 0.2, and realizing the active layer by adopting a LeakyReLU function;

step 2.3, respectively connecting the inputs of three up-sampling modules in an up-sampling sub-network with the outputs of three inclusion modules in an inclusion-v 3 network and connecting the outputs of input modules with the input of a CBR module by using a concatemate mode to form an inclusion-v 3 network with a jump connection structure;

2.4, sequentially connecting an inclusion-v 3 network with a hop connection structure, a convolution module and an up-sampling sub-network in series to form an inclusion-v 3U-Net network;

step 3, training an increment-v 3U-Net network:

inputting the training set into an increment-v 3U-Net network, and iteratively updating parameters of each layer in the increment-v 3U-Net network by using a gradient descent method until a total cost function is converged to obtain the trained increment-v 3U-Net network;

step 4, predicting the remote sensing image:

step 4.1, sequentially cutting all remote sensing images to be predicted into 224 multiplied by 224, and marking sequence numbers on the cut remote sensing images to be predicted;

step 4.2, sequentially inputting the images with the label serial numbers into a trained inclusion-v 3U-Net network to obtain a cutting result of the cut remote sensing image;

and 4.3, sequentially splicing the segmentation results of the cut remote sensing images according to the sequence numbers to obtain a final segmentation result.

Compared with the prior art, the invention has the following advantages:

firstly, the loss of the segmentation network is supervised by using the hierarchical contour cost function in training, and the problems that the foreground is neglected and the accuracy of foreground segmentation is reduced in the semantic segmentation of the remote sensing image in the prior art are solved, so that the foreground segmentation accuracy can be ensured, and the boundary and the internal texture can be effectively distinguished.

Secondly, the invention constructs a training network increment-v 3U-Net, and takes an increment-v 3 model structure as a backbone network of an encoder part, thereby overcoming the problems of large parameter and more difficult model training in the prior art, greatly reducing the calculated amount and parameters, having higher efficiency in training time and having the advantage of easy popularization and application.

Thirdly, the invention adopts a method of sequentially expanding and corroding by 2 multiplied by 2 convolution kernels, so that the outline judgment range of the hierarchical outline cost function is refined, the problem of wrong judgment of different types of outlines caused by rough judgment in the outline judgment in the prior art is solved, and the processing of outline information is refined, so that the accuracy of the outline classification result is improved.

Fourthly, the hyper-parameters given to the contour levels in the background and foreground directions with the same relative contour distance in the grading contour cost function correspond to each other in pairs and are complementary with each other, the problem that processing of the contours in different directions is not targeted in the prior art is solved, and the segmentation result obtained by the method has a more accurate contour.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the inclusion-v 3U-Net network of the present invention;

fig. 3 is a simulation diagram of the remote sensing image segmentation of Massachusetts buildings according to the invention.

Detailed Description

The invention will be described in further detail with reference to fig. 1 and the examples.

The implementation steps of the present invention will be described in further detail with reference to fig. 1 and the embodiment.

Step 1, generating a training set.

Step 1.1, the embodiment of the invention randomly selects 20 label images corresponding to high-resolution remote sensing images with balanced foreground and background proportion from Massachusetts building data sets, and the resolution of each image is 1500 multiplied by 1500 pixels. Each high resolution image and corresponding label image are cropped one by one to a size of 224 x 224 pixels.

And 1.2, selecting the label images with the foreground pixel ratio of more than 10% in the cut label images and the remote sensing images corresponding to the label images to form a training set.

And 2, constructing an inclusion-v 3U-Net network.

The inclusion-v 3U-Net network constructed by the present invention is described in further detail with reference to FIG. 2.

And 2.1, constructing an inclusion-v 3 network.

The encoder of the present invention employs a prior art inclusion-v 3 network. The structure of the Incep-v 3 network is formed by connecting 1 input module and 3 Incep modules with the same structure in series. The structure of the input module is as follows in sequence: the first convolution layer, the first pooling layer, the first LocalRespNorm layer, the second convolution layer, the third convolution layer, the second LocalRespNorm layer and the second pooling layer.

The convolution kernel size of the first convolution layer is set to be 7 x 7, the convolution kernel of the second convolution layer is set to be 1 x 1, the convolution kernel of the third convolution layer is set to be 3 x 3, the first convolution layer step size is set to be 2, the second convolution layer step size and the third convolution layer step size are set to be 1, and the edge padding is set to be 'same'. The first and second pooling layers are implemented by maximum pooling, the size of the pooling window is set to 3 × 3, and the step length is set to 2.

And 2.2, constructing 1 convolution module.

And building a convolution module formed by connecting the first convolution layer and the second convolution layer in series.

The convolution kernel sizes of the first convolution layer and the second convolution layer are set to be 1 multiplied by 1, the step size is set to be 1, and the edge filling is set to be 1.

And 2.3, constructing an upsampling sub-network.

An up-sampling sub-network which is formed by sequentially connecting a first up-sampling module, a second up-sampling module, a third up-sampling module and a CBR module in series is built as a decoder; the first to third upsampling modules have the same structure, and each upsampling module has the following structure in sequence: a first convolution layer, a second convolution layer, a third convolution layer, a BatchNorm layer, an active layer, and an up-sampling layer.

The convolution kernel sizes of the first to third convolution layers are all set to 3 × 3, the step sizes are all set to 1, and the edge padding is all set to 1. The slope of the negative part of the active layer is set to 0.2. The active layer is implemented using the LeakyReLU function. The up-sampling layer adopts a double nearest neighbor up-sampling mode.

The structure of the CBR module is as follows in sequence: a convolutional layer, a BatchNorm layer, an active layer. The convolution kernel size of the convolutional layer is set to 3 × 3, the step size is set to 1, and the edge padding is set to 1. The slope of the negative part of the active layer is set to 0.2. The active layer is implemented using the LeakyReLU function.

And 2.4, respectively connecting the inputs of three up-sampling modules in the up-sampling sub-network with the outputs of three inclusion modules in the inclusion-v 3 network and connecting the outputs of the input modules with the inputs of the CBR module by using a concatemate mode to form the inclusion-v 3 network with a jump connection structure.

And 2.5, sequentially connecting the Incep-v 3 network with the hop connection structure, the convolution module and the up-sampling sub-network in series to form the Incep-v 3U-Net network.

And 3, training an inclusion-v 3U-Net network.

Inputting the training set into an increment-v 3U-Net network, and iteratively updating parameters of each layer in the increment-v 3U-Net network by using a gradient descent method until a total cost function is converged to obtain the trained increment-v 3U-Net network.

The total cost function L is as follows:

where L represents the total cost function, N represents the total number of samples in the training set, N is set to 20 in the embodiment of the present invention, Σ (-) represents the summation operation, j represents the sequence number of samples in the training set, L represents the sequence number of samples in the training set, and _GCL (j) representing the hierarchical contour cost function for the jth sample in the training set.

The hierarchical contour cost function L _GCL (j) The following were used:

wherein M is ^gc Hierarchical contour weight matrix M representing the jth sample in the training set ^gc Weight of the ith pixel, y _i And

respectively representing the real value and the predicted value of the ith pixel in the jth sample in the training set, and log (-) representing the logarithm operation with the base 2.

The hierarchical outline weight matrix M ^gc The following were used:

wherein G represents the inward hierarchical progression of the contour matrix, the interior of the target area is taken as the positive direction of the contour matrix, the exterior of the target area is taken as the negative direction of the contour matrix, and G represents the sequence number of the progression in the contour matrix. Gauss (-) denotes the Gaussian convolution function, K _g Showing the G-th order profile range over-parameter, in the present invention

K represents a hyperparameter controlling the weight of the whole outline, and K is more than or equal to 1.Γ represents a division result of the profile of each level, and δ represents a determination result of the profile range. In the embodiment of the present invention, G is 3, the convolution kernel of the gaussian convolution is set to 2 × 2, the variance is 1.5, and K is 4.

The division result Γ of the profile at each level is obtained by the following formula:

wherein Y represents a label image in the training set, S represents a convolution kernel used when performing an expansion or erosion operation on the label image, (Y; S) ^α The method comprises the steps of representing expansion or corrosion operation on a label image, representing the expansion operation on the label image when alpha is a positive number, and representing the corrosion operation on the label image when alpha is a negative number, wherein alpha is set to be g +1, g, +1, -1 and g-1 respectively.

The determination result δ of the contour range is obtained by the following formula:

δ＝255·One-((Y；S) ⁺¹ )-(Y；S) ^-1 ))

where One represents a matrix with all 1 element values, the matrix size is set to 224 × 224 in the embodiment of the present invention.

And 4, predicting the remote sensing image.

And 4.1, sequentially cutting all remote sensing images to be predicted into 224 multiplied by 224, and marking the serial numbers of the cut remote sensing images to be predicted.

And 4.2, sequentially inputting the images with the label serial numbers into a trained increment-v 3U-Net network to obtain a cutting result of the remote sensing image after cutting.

The effects of the present invention can be further illustrated by the following simulation experiments.

1. Simulation experiment conditions are as follows:

the hardware platform of the simulation experiment of the invention is as follows: the processor is Intel i7-6850K, the main frequency is 3.60GHz, the memory is 64G, the video card is NVIDIA TITAN Xp, and the video memory is 12 GB.

The software platform of the simulation experiment of the invention is as follows: ubuntu operating system and python 3.6.

The data used in the simulation experiment of the invention comes from 20 randomly selected groups of data in a remote sensing building data set Massachusetts and consists of 151 aerial images of the Boston area, the size of each image is 1500 multiplied by 1500, and the foreground of the data set is a building and comprises various buildings.

2. Simulation experiment content and result analysis:

the simulation experiment of the invention is carried out by adopting the method of the invention and the ablation experiment method of the prior art according to the following steps respectively.

The ablation experimental method is characterized in that binary cross entropy and a contour cost function are respectively adopted as cost functions to train the increment-v 3U-Net network.

Step a, data in the Massachusetts dataset was clipped to a size of 224 × 224 as the standard image size for this experiment.

And step B, selecting 4464 remote sensing images corresponding to the label images with the foreground pixel proportion of more than 10% in the cut label images as samples to form an integral data set of the experiment.

And C, randomly disordering the samples in the whole data set, wherein 3960 samples are divided into a training set, 144 samples are divided into a verification set, and 360 samples are divided into a test set.

And D, inputting the training set data into an inclusion-v 3U-Net network for training, and training for 30 epochs in total. In training, the performance of the current model is evaluated on the validation set by using the evaluation index every 5 epochs.

And E, after training, evaluating the performance of the model with the best verification effect on the test set.

The effect of the present invention will be further described with reference to the simulation diagram of fig. 3.

Fig. 3(a) is a remote sensing image of a test set. Fig. 3(b) is a label diagram corresponding to fig. 3(a) in the test set. Fig. 3(c) is a result diagram of segmenting the remote sensing image by using a binary cross entropy as a cost function. Fig. 3(d) is a result diagram of segmenting the remote sensing image by using the contour cost function as the cost function. Fig. 3(e) is a diagram showing the result of segmenting a remote sensing image by the method of the present invention.

As can be seen from fig. 3(c), 3(d) and 3(e), the segmentation result of the present invention has a clearer foreground contour segmentation and a more precise contour compared to the segmentation results of the two ablation experimental methods.

In order to verify that the segmentation effect of the invention is superior to that of two ablation experimental methods. The segmentation results are evaluated by calculating three evaluation indexes of Recall rate Recall, accuracy Acc and intersection ratio IoU by using the following formulas, and all the calculation results are drawn as table 1: .

The formula for Recall rate recalls is as follows:

where TP is the number of samples that are actually positive samples and the prediction result is also positive samples, and FN is the number of samples that are actually positive samples and the prediction result is negative samples.

The formula for the accuracy Acc is as follows:

where FP is the number of samples that are actually negative samples and the prediction result is positive samples, and TN is the number of samples that are actually negative samples and the prediction result is also negative samples. The formula for the intersection ratio IoU is as follows:

TABLE 1 quantitative analysis table of segmentation results of the inventive method and the ablation experimental method in simulation experiments

The larger values of Recall, Acc and IoU in Table 1 represent more accurate results of segmentation. The combination of the table 1 shows that the Recall rate Recall is 67.81%, the accuracy rate Acc is 89.68%, and the intersection ratio IoU is 57.00%, and the three indexes are all higher than 2 ablation experimental methods, so that the effectiveness of the hierarchical contour cost function provided by the invention is proved, and the accuracy of remote sensing image segmentation can be effectively improved.

Claims

1. A remote sensing image semantic segmentation method based on a hierarchical contour cost function is characterized by comprising the steps of constructing an inclusion-v 3U-Net segmentation network, and monitoring the loss of the segmentation network by using the hierarchical contour cost function; the segmentation method comprises the following specific steps:

step 1, generating a training set:

step 2, constructing an inclusion-v 3U-Net segmentation network:

step 2.1, 1 convolution module is constructed:

step 2.2, constructing an up-sampling sub-network:

step 3, training an increment-v 3U-Net network:

step 4, predicting the remote sensing image:

step 4.2, sequentially inputting the images with the label serial numbers into a trained increment-v 3U-Net network to obtain a cutting result of the remote sensing image after cutting;

2. The method for semantically segmenting the remote sensing image based on the hierarchical contour cost function according to claim 1, wherein the total cost function in the step 3 is as follows:

wherein, L represents the total cost function, N represents the total number of samples in the training set, sigma (DEG) represents the summation operation, j represents the serial number of the samples in the training set, L _GCL (j) Representing a hierarchical contour cost function of a jth sample in the training set;

the hierarchical contour cost function L _GCL (j) The following were used:

respectively representing the real value and the predicted value of the ith pixel in the jth sample in the training set, and log (-) representing the logarithm operation with 2 as a base;

the hierarchical contour weight matrix M ^gc The following were used:

wherein G represents the inward hierarchical progression of the contour matrix, the interior of the target area is taken as the positive direction of the contour matrix, the exterior of the target area is taken as the negative direction of the contour matrix, G represents the sequence number of the progression in the contour matrix, Gauss (·) representsGaussian convolution function, K _g Representing a g-level profile range hyper-parameter, gamma representing a division result of each level of profile, and delta representing a judgment result of the profile range;

wherein Y represents a label image in the training set, S represents a convolution kernel used when performing an expansion or erosion operation on the label image, (Y; S) ^α The method comprises the steps of representing expansion or corrosion operation on a label image, representing the expansion operation on the label image when alpha is a positive number, and representing the corrosion operation on the label image when alpha is a negative number;

δ＝255·One-((Y；S) ⁺¹ )-(Y；S) ^-1 ))

wherein One represents a matrix with all 1 element values.