CN110119728B

CN110119728B - Remote sensing image cloud detection method based on multi-scale fusion semantic segmentation network

Info

Publication number: CN110119728B
Application number: CN201910436645.2A
Authority: CN
Inventors: 彭宇; 郭玥; 于希明; 马宁; 姚博文; 刘大同; 彭喜元
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-05-23
Filing date: 2019-05-23
Publication date: 2023-12-05
Anticipated expiration: 2039-05-23
Also published as: CN110119728A

Abstract

A remote sensing image cloud detection method based on a multi-scale fusion semantic segmentation network belongs to the technical field of remote sensing image cloud detection. The method solves the problem of low cloud detection precision in the existing method for cloud detection by manually extracting the features. The invention utilizes the former three-level sub-network to extract shallow layer characteristics, utilizes the latter two-level sub-network to extract deep layer characteristics, and fuses the extracted deep layer characteristics with the shallow layer characteristics, thereby fully utilizing the abundant detail information contained in the shallow layer characteristics and the abundant semantic information contained in the deep layer characteristics, fusing the advantages of the deep layer characteristics and the deep layer characteristics, enabling the segmentation of deep layer characteristic boundaries to be finer, and achieving the best cloud detection effect by optimizing the proportion of the deep layer characteristics and the shallow layer characteristics, wherein the cloud area detection error is less than 1%. The method and the device can be applied to the technical field of remote sensing image cloud detection.

Description

Remote sensing image cloud detection method based on multi-scale fusion semantic segmentation network

Technical Field

The invention belongs to the technical field of remote sensing image cloud detection, and particularly relates to a remote sensing image cloud detection method.

Background

Remote sensing is an important means for acquiring earth resources and environmental information, and cloud is a main factor affecting satellite remote sensing image quality. In general, 50% of the area of the earth's surface is covered by the cloud, and the presence of the cloud presents a great inconvenience to remote sensing image processing. The remote sensing image covered by the cloud has little available information, but occupies a large amount of storage space and transmission bandwidth of the system, thereby reducing the utilization rate of satellite data. At present, besides the synthetic aperture radar sensor can penetrate through cloud layers to acquire ground surface information, other sensors can not thoroughly solve the problem of cloud coverage of remote sensing images, and most of image data are acquired by sensors in visible light wave bands at present. Therefore, high-precision cloud detection of the visible light remote sensing image becomes a key for improving the utilization rate of the remote sensing data.

The cloud detection method is subjected to the stage from manual judgment to computer processing, early cloud detection and classification mainly depend on manual visual inspection of observers to judge, the degree of dependence on subjective experience of the observers is very high, and along with huge-amount growing remote sensing data, all the remote sensing data depend on manual judgment, so that automatic, rapid and effective cloud detection and classification become important research directions of various satellite data processing centers.

The cloud detection method by means of computer processing is completed on the basis of extracting cloud image features, the extraction of the cloud image features continuously excavates deeper features, and the extraction mode is also being changed from manual extraction to automatic extraction. The most intuitive difference between the cloud and the ground object is that the gray characteristic is that the cloud is white in the picture, the method for directly carrying out cloud detection by depending on the gray threshold is called a threshold method, the threshold method is simple to calculate, but the prior knowledge is needed, the number of affected factors is large, and the detection precision is low. The gray level characteristics of the cloud cannot represent all the characteristics of the cloud, and further, the subsequent cloud detection method continues to mine other characteristics of the cloud, including frequency characteristics, texture characteristics and the like. For example, a learner equally divides a picture into a plurality of parts, extracts gray, frequency and texture characteristics of each part for training, and finally classifies the picture by a support vector machine (Support Vector Machine, SVM).

The gray scale, frequency and texture characteristics of the cloud all belong to manually extracted shallow layer characteristics, and the method for cloud detection by manually extracting the characteristics has the following problems:

(1) Feature extraction is often directly carried out on the whole picture, and due to cloud complexity, missing detection on a small amount of cloud conditions is caused in the picture;

(2) Only shallow features are extracted, so that features similar to cloud features are not distinguished, and robustness is poor;

(3) Only the cloud position can be roughly judged, and the cloud quantity value extraction precision is low.

Because the method for performing cloud detection by manually extracting the features has the above problems, the existing method for performing cloud detection by manually extracting the features has lower cloud detection precision.

Disclosure of Invention

The invention aims to solve the problem of low cloud detection precision in the existing method for carrying out cloud detection by manually extracting features.

The technical scheme adopted for solving the technical problems is as follows: a remote sensing image cloud detection method based on a multi-scale fusion semantic segmentation network comprises the following steps:

step one, randomly selecting N from a true panchromatic visible light remote sensing image data set ₀ The sheet is used as an original remote sensing image;

for N ₀ Preprocessing the original remote sensing image to obtain N ₀ A remote sensing image after preprocessing;

step two, N is added ₀ Inputting the preprocessed remote sensing image as a training set into a semantic segmentation network for training, and continuously updating convolution kernel parameters of a convolution layer in the semantic segmentation network in the training process until the maximum iteration number is reached, and stopping training to obtain a trained semantic segmentation network;

step three, preprocessing the remote sensing image to be detected by adopting the method of the step one to obtain a preprocessed remote sensing image to be detected;

inputting the preprocessed remote sensing image to be detected into the semantic segmentation network trained in the second step, and obtaining a cut image output by the semantic segmentation network;

and (3) the cut image passes through a softmax classifier to obtain a binary image with the same size as the cut image, wherein pixels with gray values not being 0 in the binary image represent cloud-containing areas, and pixels with gray values being 0 represent non-cloud areas, so that cloud detection of the remote sensing image to be detected is realized.

The beneficial effects of the invention are as follows: the invention provides a remote sensing image cloud detection method based on a multi-scale fusion semantic segmentation network, which utilizes a first three-level sub-network to extract shallow features, utilizes a second two-level sub-network to extract deep features, and fuses the extracted deep features with the shallow features, so that the advantages of rich detail information contained in the shallow features and rich semantic information contained in the deep features are fully utilized, the advantages of the deep features and the deep features are fused, the deep feature boundary segmentation is finer, the best cloud detection effect is achieved by optimizing the proportion of the deep features and the shallow features, the cloud detection precision is improved, and the cloud area detection error is less than 1%.

Drawings

FIG. 1 is a flow chart of a remote sensing image cloud detection method based on a multi-scale fusion semantic segmentation network;

FIG. 2 is a schematic diagram of a network architecture of a semantic segmentation network of the present invention;

FIG. 3 is a flow chart of training a semantic segmentation network according to the present invention;

FIG. 4 is a schematic diagram of a deconvolution operation process;

FIG. 5 is a schematic diagram of a bilinear kernel computation process of deconvolution;

FIG. 6 is an original view of a selected test dataset of scenario 1 of the present invention;

FIG. 7 is an original view of a test dataset of scenario 2 selected by the present invention;

FIG. 8 is an original view of a test dataset of scenario 3 selected by the present invention;

FIG. 9 is an effect diagram of cloud detection of an original graph of a test dataset of scenario 1 using a maximum inter-class variance method;

FIG. 10 is an effect diagram of cloud detection of an original graph of a test dataset of scenario 2 using a maximum inter-class variance method;

FIG. 11 is an effect diagram of cloud detection of an original graph of a test dataset of scene 3 using a maximum inter-class variance method;

FIG. 12 is an effect diagram of cloud detection of an original graph of a test dataset of scenario 1 using a multi-feature extraction method;

FIG. 13 is an effect diagram of cloud detection of an original view of a test dataset of scenario 2 using multi-feature extraction;

FIG. 14 is an effect diagram of cloud detection of an original view of a test dataset of scenario 3 using multi-feature extraction;

FIG. 15 is a callout corresponding to the original graphs of the test datasets for scenario 1, scenario 2, and scenario 3;

FIG. 16 is an effect diagram of cloud detection of original graphs of test data sets of scenario 1, scenario 2 and scenario 3 using the FCN method;

FIG. 17 is an effect diagram of cloud detection of test dataset original graphs for scenario 1, scenario 2, and scenario 3 using the U-net approach;

FIG. 18 is an effect diagram of cloud detection of original graphs of test data sets of scenario 1, scenario 2, and scenario 3 using the deep V3+ method;

fig. 19 is an effect diagram of cloud detection of test dataset original graphs of scene 1, scene 2, and scene 3 using WMSFNet method of the present invention.

Detailed Description

The first embodiment is as follows: as shown in fig. 1, the remote sensing image cloud detection method based on the multi-scale fusion semantic segmentation network according to the embodiment includes the following steps:

the data set adopted in the first step is as follows: a 2-meter resolution true full-color visible light remote sensing image dataset photographed by a high-resolution satellite I;

step two, N is added ₀ Inputting the preprocessed remote sensing image as a training set into a semantic segmentation network (WMSFNet-scale fusion network) for training, and continuously updating convolution kernel parameters of a convolution layer in the semantic segmentation network in the training process until the maximum iteration number is reached, and stopping training to obtain a trained semantic segmentation network;

the multi-scale fusion semantic segmentation network refers to: the fusion layer is added on the basis of the semantic segmentation network;

As shown in fig. 1, the WMSFNet cloud detection algorithm framework of this embodiment performs preprocessing on a picture for an input picture, that is, gray levels of pixel points in the picture are respectively subtracted from gray levels of channels of the picture, so as to increase the calculation speed.

And extracting deep features of the picture by using the convolution layer, and carrying out feature dimension reduction by using the pooling layer to realize the distinction of cloud and ground objects. Then, up-sampling is carried out by utilizing the deconvolution layer, so as to obtain a binary picture with the same size as the input picture.

The pixels with gray values of 0 in the binary image represent non-cloud areas in the image, and the pixels with gray values of not 0 represent cloud areas in the image, so that the proportion of the cloud in the original input image is obtained by counting the proportion of the number of the pixels with gray values of not 0 in the binary image to all the pixels.

When the cloud proportion is larger than the set threshold value, the fact that most of the images are cloud is indicated, the useful information is contained very little, and the images can be removed.

For an input picture, VGGNet is used as a trunk to extract features, and because cloud detection is a pixel-level prediction task, a prediction picture with the same size as an original picture is required to be generated, so that classification of each pixel point is realized. Deep learning algorithms often use a full connection layer to handle classification tasks, converting two-dimensional images into one-dimensional labels, and pixel-level prediction tasks do not require converting pictures into one-dimension. Therefore, the full connection layer in VGGNet needs to be replaced with a convolution layer.

The feature extraction process of WMSFNet takes VGGNet as a backbone, and table 1 shows the structures of VGGNet and WMSFNet. In the original VGGNet, the convolution layers with the same size of the output characteristic diagram are set as one stage, and according to the characteristics of the VGGNet, the size of the input characteristic diagram is changed into half by the pooling layer, and the size of the input characteristic diagram is not changed by other layers. Thus, the output profile is reduced to half of the input profile per stage of the network.

TABLE 1 VGGNet and WMSFNET network architecture

The layer configuration information of WMSFNet (without taking into account the zero padding operation at the first layer convolution) is shown in table 2:

table 2 WMSFNet network layer configuration information

The pooling layer may lead to a feature map becoming smaller, and eventually a binary picture of the same size as the original map is needed, so that the pooled image needs to be upsampled, which is achieved by the deconvolution layer. However, if the output of the last stage is directly up-sampled to the same size as the original image, the detection result at the cloud and feature edges will be very rough. According to the cloud edge detection method, a multi-scale fusion idea is finally utilized, shallow detail features are fused with deep semantic features, cloud detection accuracy is improved by utilizing the deep semantic features, and the detection effect on the cloud edge is enhanced by utilizing the shallow detail features.

The invention utilizes the proposed WMSFNet network to carry out cloud detection. The network has the following characteristics:

1) The traditional cloud detection method needs to manually extract characteristics and set a threshold value, and needs to have abundant experience for researchers. The WMSFNet network can perform end-to-end training without manually adjusting parameters, so that the cloud detection realization process is simplified;

2) The training process of the WMSFNet network needs to distinguish cloud areas and non-cloud areas of an input picture in advance, respectively learn, and is insensitive to the shape characteristics of the cloud;

3) The WMSFNet network can automatically extract deep features of the cloud, realize a pixel-level prediction task, and fully integrate shallow detail features and deep semantic features so that a separation boundary is finer;

4) The WMSFNet network can realize the pixel-level prediction task, and finally obtain a binary picture with the same size as the input picture, which respectively represents a cloud area and a cloud-free area.

Compared with the prior art, the method only fuses the features of the shallow layer and the deep layer, and fuses the features according to the proportion of 1:3, so that a good cloud region detection effect is achieved. Wherein: the detection effect of cloud edges can be improved due to the shallow detail features, cloud detection accuracy can be improved due to the deep semantic features, and misjudgment is reduced. The method can be applied to high-precision cloud detection of the visible light remote sensing image.

The second embodiment is as follows: the first difference between this embodiment and the specific embodiment is that: the pair N ₀ Preprocessing the original remote sensing image to obtain N ₀ The specific process of the remote sensing image after pretreatment is as follows:

for any original remote sensing image, calculating the average value M of the gray scales of each channel of the original remote sensing image, and respectively subtracting the average value M from the gray scales of each pixel point in the original remote sensing image to obtain a preprocessed remote sensing image corresponding to the original remote sensing image, namely, the gray scale value of each pixel point in the preprocessed remote sensing image corresponding to the original remote sensing image is as follows:

O′(i,j)＝O(i,j)-M (1)

wherein: o (i, j) is the gray value of the pixel point (i, j) in the original remote sensing image, and O' (i, j) is the gray value of the pixel point (i, j) in the preprocessed remote sensing image corresponding to the original remote sensing image;

similarly, calculate N ₀ Preprocessing each original remote sensing image in the original remote sensing images to obtain N ₀ And (5) preprocessing the remote sensing image.

And a third specific embodiment: as shown in fig. 2 and 3, this embodiment is different from the second embodiment in that: the specific process of the second step is as follows:

will N ₀ Inputting the preprocessed remote sensing image as a training set into a semantic segmentation network, initializing network parameters of the semantic segmentation network before training is started, and starting the training process after the network parameters are initialized;

the semantic segmentation network comprises 15 convolution layers, 5 pooling layers, 2 deconvolution layers and 2 clipping layers, which are respectively:

2 convolution layers with a convolution kernel size of 3*3 and a convolution kernel number of 64;

1 pooling layer with the convolution kernel size of 2 x 2 and the convolution kernel number of 64;

2 convolution layers with a convolution kernel size of 3*3 and a convolution kernel number of 128;

1 pooling layer with convolution kernel size of 2 x 2 and convolution kernel number of 128;

3 convolution layers with the convolution kernel size of 3*3 and the convolution kernel number of 256;

1 pooling layer with the convolution kernel size of 2 x 2 and the convolution kernel number of 256;

3 convolution layers with the convolution kernel size of 3*3 and the convolution kernel number of 512;

1 pooling layer with convolution kernel size of 2 x 2 and convolution kernel number of 512;

1 convolution layer with convolution kernel size 7*7 and number 4096;

1 convolution layer with convolution kernel size 1*1 and number 4096;

1 deconvolution layer with the convolution kernel size of 8 x 8 and the convolution kernel number of 2;

1 cutting layer is used for cutting out the material,

1 deconvolution layer with the convolution kernel size of 16 x 16 and the convolution kernel number of 2;

1 clipping layer;

up-sampling the feature map output by the convolution layer with the convolution kernel size of 1*1 and the convolution kernel number of 4096 by utilizing the deconvolution layer with the convolution kernel size of 8 x 8 and the convolution kernel number of 2 to obtain an up-sampled feature map, wherein the size of the obtained up-sampled feature map is four times of the size of the feature map output by the convolution layer with the convolution kernel size of 1*1 and the convolution kernel number of 4096;

the size of the obtained up-sampled characteristic diagram is four times of the size of the characteristic diagram output by the pooling layer with the size of convolution kernels of 2 x 2 and the number of convolution kernels of 512, and the size of the characteristic diagram is unchanged after the output characteristic diagram of the pooling layer with the size of convolution kernels of 2 x 2 and the number of convolution kernels of 512 passes through the convolution layer with the size of convolution kernels of 7*7, the convolution layer with the number of convolution kernels of 4096, and the convolution layer with the size of 1*1 and the number of convolution kernels of 4096.

Performing pixel-by-pixel weighted average on the obtained up-sampled feature map and the feature map output by the last convolution layer with the convolution kernel size of 3*3 and the convolution kernel number of 512 to obtain a fused feature map; up-sampling the fusion feature map by utilizing a deconvolution layer with the convolution kernel size of 16×16 and the convolution kernel number of 2 to obtain an up-sampled fusion feature map, wherein the obtained up-sampled fusion feature map is eight times the size of the fusion feature map;

the up-sampled fusion feature map is subjected to a clipping layer to obtain a clipped image, and the clipped image has the same size as the preprocessed remote sensing image;

in the training process, the convolution kernel parameters of the convolution layer of the semantic segmentation network are continuously updated through a BP algorithm; stopping iteration until the set maximum iteration times N are reached, and obtaining the trained semantic segmentation network.

The deep learning architecture used in the invention is Caffe, and the programming language is Python.

WMSFNet is a full convolutional network whose convolutional layer is consistent with the convolutional layer algorithms of other deep learning networks, and more particularly, the introduction of deconvolution layers. In the forward propagation stage, a certain error exists between the training result obtained by each iteration and the training label, and the error can cause identification errors, so that the convolution kernel parameters of the convolution layer are required to be continuously adjusted through a learning process to obtain proper convolution kernel parameters.

The convolution kernel parameters of the deconvolution layer in WMSFNet do not participate in the training, i.e., the convolution kernel parameters of the deconvolution layer are fixed throughout the training process.

The calculation process of the convolution layer is as follows: convolutional layer reception N _C Each input feature is convolved with a shift window of k kernel as input to output a feature mapGenerating a pixel thereon, the step of the shift window being s, typically less than k, for a total of N _F The output profiles will form the input profile of the next convolutional layer. The convolutional layer receives a signal of size N _C * H W input feature map and a set of N-sized input feature maps _F *N _C * k is a convolution kernel of k, resulting in a set of magnitudes N _F *H _O *W _O Output feature map of (H) _O And W is _O The size of (2) can be derived from the following formula:

H _O ＝(H+2*p-k)/s+1

W _O ＝(W+2*p-k)/s+1

deconvolution is in effect a process of transpose convolution, such as: when the deconvolution is calculated, the number of the deconvolution kernels is equal to the number of the deconvolution kernels with the size of k, the step size is 1, the number of the deconvolution kernels is k-p-1, and s-1 zeros need to be added among all input units, and the deconvolution layer receives an input characteristic diagram with the size of nc.H.W and a group of the deconvolution kernels with the size of nf.Nc.k.k, so as to obtain a group of output characteristic diagrams with the size of nf.Ho.Wo, wherein the sizes of Ho and Wo are given by the following formulas:

Ho＝s*(H-1)+k-2*p

Wo＝s*(W-1)+k-2*p

the specific calculation process of the deconvolution operation is shown in fig. 4, wherein the left graph in fig. 4 shows an input characteristic diagram, the right graph shows an output characteristic diagram through a deconvolution layer, the deconvolution layer has a convolution kernel size k of 4, s of 2, p of 0, an input size 4*4, an input size after zero padding of 13×13, and an output size of 10×10.

The convolution kernel of the deconvolution layer is a bilinear kernel, which can be obtained by bilinear interpolation, and the coordinates of the current point of the convolution kernel are (i, j) and the coordinates of the central point are (a, b), and then the value D of the current point of the convolution kernel is calculated by the following formula:

D＝[1-abs(i-a)/2]*[1-abs(j-b)/2]

the calculation is illustrated with a bilinear kernel of size 4*4, as shown in fig. 5. Taking the 2 nd cell as an example, the coordinate value is (0, 1), and the weight corresponding to the point is:

[1-abs(1-1.5)/2]*[1-abs(0-1.5)/2]＝0.1875

in the WMSFNet network structure, only the pooling layer affects the size of the output feature map, and if the size of the input feature map is H, after pooling in the 5 th level, the size of the output feature map is H/2 ⁵ Next, the output of the layer is H by a convolution layer of convolution kernel size 7*7 ₆ The size of the obtained output feature map is:

H ₆ ＝(H/2 ⁵ -7)/1+1＝(H-192)/2 ⁵

thus, the size of the final input deconvolution layer is H ₆ . In addition, for pictures with length or width higher than 192 pixels, the algorithm has no way to process, and in order to solve the problem, the input picture is generally subjected to zero padding for 100 pixels when the first convolution is performed on the picture, and H is the time ₆ The output feature map size of (a) is:

H ₆ ＝(H+6)/2 ⁵

next step is to take H ₆ Up-sampling the output of (2) to 32 times the original, let the deconvoluted output be H ₇ At this time H ₇ The output feature map size of (a) is:

H ₇ ＝(H ₆ -1)*32+64＝((H+6)/32-1)*32+64＝H+38

obviously H ₇ Different from the size of the input picture H, H is needed to be cut by a cutting layer ₇ Clipping is the same as H in size, the clipping position needs to be specified to determine where the algorithm clips, and the clipping is performed by H ₇ The expression of (2) is obtained, at which time the offset of the crop should be set to 19.

The specific embodiment IV is as follows: the third difference between this embodiment and the third embodiment is that: the loss function adopted by the semantic segmentation network is J (W, b), and the cut image is input into a softmax classifier to obtain a binary image with the same size as the cut image; calculating a value of a loss function J (W, b) using the obtained binary image:

wherein: s is S _j′ Represents the j 'th value in the output vector S through the Softmax classifier, j' =1, 2, …, T representing the total number of values in the output vector S; a, a _j 'represents the j' th value in the input vector a to the Softmax classifier, e represents a natural constant; y is _j′ Is a vector of 1×T, and y _j′ = {0, …,0,1,0, …,0}, wherein: 1 is vector y _j′ The j' th element of (a) vector y _j′ All other elements in (2) are 0.

Fifth embodiment: the fourth difference between this embodiment and the third embodiment is that: the convolution kernel parameters of the convolution layer of the semantic segmentation network are continuously updated through a BP algorithm, and the specific process is as follows:

in the training process, updating the convolution kernel parameters of the convolution layer of the semantic segmentation network according to a formula (4) in each iteration;

wherein:representing the transfer parameters from the ith neuron of the ith convolution layer to the jth neuron of the (1) th convolution layer, alpha being the learning rate,>the bias term for the ith neuron of the ith convolution layer.

The purpose of the training process is to make the cost function J (W, b) smaller and smaller. Similar to other deep learning algorithms, the adjustment of the convolution kernel parameters of the convolution layer is also learned by a Back Propagation algorithm (BP). In each iteration process, some parameters causing poor recognition effect are updated, so that the cloud detection effect of the new parameters is better until the training times are reached, and a final trained model is obtained.

The WMSFNet can detect the cloud more accurately, and the main reason is that the convolution kernel of the network convolution layer trained through the BP algorithm can extract the characteristics of the cloud more effectively. Shallow layer convolution kernels extract shallow layer features such as gray level and texture of the cloud, deep layer convolution kernels extract abstract semantic features of the cloud, and finally the features are fused, so that a good cloud detection effect is obtained.

Specific embodiment six: the third difference between this embodiment and the third embodiment is that: the obtained up-sampled feature map and the feature map output by the last convolution layer with the convolution kernel size of 3*3 and the convolution kernel number of 512 are subjected to pixel-by-pixel weighted average to obtain a fused feature map, and the specific process is as follows:

wherein: a is that _i″j″ B is the pixel value at the pixel point (i ', j') in the up-sampled feature map _i″j″ For the pixel value at the pixel point (i ', j') in the feature image output by the convolution layer with the last convolution kernel size of 3*3 and the number of convolution kernels of 512, alpha 'and beta' are both weight coefficients, C _i″j″ Is the pixel value at pixel point (i ', j') in the fused feature map.

The convolution layer of the first three-stage network extracts shallow features, the convolution layer of the second two-stage network extracts deep features, the shallow features contain rich detail information, the deep features contain rich semantic information, and the advantages of the shallow features and the deep features are combined. Because the output characteristic diagram of the pooling layer of the third-stage network is different from the output characteristic diagram of the pooling layer of the fifth-stage network in size, the pooling layer and the pooling layer cannot be directly added pixel by pixel for fusion. Therefore, the output characteristic diagram of the fifth-stage network pooling layer is up-sampled to 4 times of the original characteristic diagram by utilizing the deconvolution layer, then pixel-by-pixel weighted average is carried out on the output characteristic diagram of the third-stage network pooling layer, and finally the fused characteristic diagram is up-sampled to 8 times of the original characteristic diagram by utilizing the deconvolution layer, so that a binary picture with the same size as the input picture is obtained.

Seventh embodiment: the sixth embodiment differs from the first embodiment in that: the fusion ratio of the feature map after upsampling to the feature map output by the convolution layer with the last convolution kernel size of 3*3 and the number of convolution kernels of 512 is 1:3.

Experimental verification and analysis

In order to evaluate the performance of the WMSFNet network in the aspect of cloud detection, the invention selects images from 2-meter resolution real full-color visible light remote sensing images shot by high-resolution first-order satellites for test verification, and the resolution of the images is 256 multiplied by 256 pixels. The cloud detection method based on the WMSFNet network selects 100 pictures for training, and 20 pictures for test verification.

In order to verify that the method provided by the invention has better effect in the full-color visible light remote sensing image than other methods, the method is respectively compared with a cloud detection method for extracting shallow features and other advanced semantic segmentation methods capable of automatically extracting deep features, and the effectiveness of the WMSFNet network in the aspect of cloud detection is fully illustrated.

The invention selects three different scenes to illustrate the cloud detection effect, as shown in fig. 6, 7 and 8:

scene 1 is the simplest scene, and the picture only contains the cloud except the background; scene 2 contains not only clouds but also land except for the background, there is a sea-land boundary with obvious gray level variation; scene 3 is relatively complex, with buildings in the background that resemble cloud features.

The cloud detection method based on artificial feature extraction mainly comprises a threshold method and a multi-feature extraction method, wherein the threshold method is a common method of cloud detection, cloud has obvious gray scale features relative to ground objects, and cloud detection is carried out on full-color visible light remote sensing images in a scene 1, a scene 2 and a scene 3 by using a maximum inter-class variance method in a literature (high-resolution first satellite image automatic cloud detection), wherein the detection effects are shown in figures 9, 10 and 11 respectively.

The single gray level feature cannot summarize all the characteristics of the cloud, and furthermore, a literature (feature extraction in remote sensing image cloud picture identification) proposes a multi-feature extraction method, and whether a picture contains the cloud is verified through extracting the gray level feature, the frequency feature and the texture feature of the cloud and through an SVM classifier. The SVM classifier parameters can be optimized by genetic methods. According to the method, a picture is divided into a plurality of small blocks uniformly, whether cloud is contained in each small block is predicted through SVM, cloud detection is carried out on original pictures of test data sets of a scene 1, a scene 2 and a scene 3 based on a multi-feature extraction method, and the obtained cloud detection effect is shown in figures 12-14.

Dark portions in the output picture represent cloud-containing areas and light portions represent cloud-free areas. Experimental results show that the classification accuracy of cloud detection by using SVM can reach 89%. Because the image equipartition mode is adopted, each small block possibly comprises a cloud area and a cloud-free area, the method can only approximately extract the cloud area, and the detection accuracy is low.

In practical engineering application, whether the images contain clouds or not needs to be judged, cloud quantity values need to be evaluated, a remote sensing satellite is difficult to capture a completely cloud-free scene when acquiring images, a small amount of clouds do not shade effective information, and if the effective information is still processed, the effective information is lost. The detection of cloud magnitude becomes an unavoidable problem. The difference between the cloud area percentage in the predicted picture and the cloud area percentage in the real picture is recorded as a cloud area detection error, and the calculation of the cloud area percentage in the real picture can be obtained by manually marking a cloud area and dividing the number of pixels in the cloud area by the number of all pixels in the picture. The cloud area detection value of the multi-feature extraction method is obtained by dividing the number of dark squares in the output picture by the number of all squares, as shown in table 3.

TABLE 3 Table 3

As can be seen from table 3, the method based on multi-feature extraction+svm can only realize approximate detection of cloud area, and in a simple scene, the cloud area detection error is less than 20%, but if the scene contains features similar to the cloud features, the cloud area detection error is more than 30%. In summary, the cloud detection based on the method for extracting the shallow features has poor detection effect.

In addition to gray scale, frequency, texture features, clouds have many abstract deep features. More abstract semantic features of the cloud can be extracted through multiple convolution layers. If Alexnet is adopted for feature extraction and SVM is replaced for classification, the classification accuracy can reach 94%. The Alexnet is a network with a simpler structure, and if a network with a more complex structure is used, the classification accuracy can have a better effect, but the cloud magnitude can only be roughly detected by using the classification method, so that the cloud magnitude is accurately extracted by using the semantic segmentation method.

The cloud detection method and the cloud detection system respectively compare cloud detection effects of the WMSFNet network and the FCN, the common semantic segmentation method U-net and the advanced semantic segmentation method deep V < 3+ > in three different scenes. As shown in fig. 15, the original image of the scene 1, the original image of the scene 2 and the annotation image corresponding to the original image of the scene 3 are shown; fig. 16, 17, 18 and 19 are graphs of cloud detection effects of FCN, U-net, deeplab v3+ and WMSFNet, respectively.

In the first simplest scenario, all semantic segmentation methods can obtain better effects. In the second scenario, the U-net is less effective in that scenario, and the sea-land boundary with obvious gray level changes is also identified as a cloud. In a third relatively complex scenario, where the background contains buildings similar to cloud features, the traditional thresholding method has not been able to handle this scenario, and other semantic segmentation methods than U-net do not recognize these as clouds.

As can be seen from fig. 15 to 19, except for U-net, both FCN and deep v3+ can detect the outline of the cloud approximately, but because the large input step of FCN causes rough edge detection, deep v3+ introduces hole convolution and conditional random field, the detection effect on the cloud edge is finer, but some cloud pixels at the image boundary easily cause erroneous judgment. 15-19, the cloud detection effect of WMSFNet is better than that of other methods, and the method can achieve better effect on cloud edge detection due to the fact that the method fuses shallow detail features and deep semantic features, so that the extraction of cloud magnitude is more accurate.

A number of criteria are commonly used in image segmentation to measure the accuracy of the method, and the present invention employs pixel-by-pixel labeled accuracy criteria. Assuming a total of k+1 classes, P _ij Representing the number of pixels that would belong to class i but are predicted to be class j. Namely P _ii Representing the correct pixel for marking, and P _ij And P _ji Representing the pixel marked with the error.

(1) Pixel Accuracies (PA): marking the correct pixel to total pixel ratio.

(2) Average pixel precision (Mean Pixel Accuracy, MPA): the proportion of the number of correctly classified pixels within each class is calculated and then averaged.

(3) Homocross ratio (Mean Intersection over Union, MIoU): the ratio of the intersection and union of the set of real and predicted values is calculated and then averaged.

(4) Frequency-to-weight ratio (Frequency Weighted Intersection over Union, FWIoU): weights are set for each class based on MIoU based on its frequency of occurrence.

The cloud area detection error calculating method comprises the following steps: the difference between the predicted picture cloud area percentage and the real picture cloud area percentage. The input picture passes through a WMSFNet network to obtain a binary picture with the same size as the input picture, wherein the pixel points with the gray value of 0 represent cloud-free areas in the picture, the pixel points with the gray value of not 0 represent cloud-containing areas in the picture, and the cloud area percentage of the predicted picture can be obtained by counting the proportion of the number of the pixel points with the gray value of not 0 in the binary picture to all the pixel points.

The quantization index results of the different semantic segmentation methods are shown in table 4. The quantization index includes PA, MPA, MIoU, FWIoU and cloud area detection error. Compared with other semantic segmentation methods, the WMSFNet cloud detection method has the advantages that the WMSFNet cloud detection method is improved in five indexes, and has better detection effect than other methods for the data set.

Table 4 quantization index comparison

Experimental results show that under different scenes, the WMSFNet has a good effect, the pixel classification accuracy of 95.39% can be achieved, and the cloud area detection error is better than 1%.

The above examples of the present invention are only for describing the calculation model and calculation flow of the present invention in detail, and are not limiting of the embodiments of the present invention. Other variations and modifications of the above description will be apparent to those of ordinary skill in the art, and it is not intended to be exhaustive of all embodiments, all of which are within the scope of the invention.

Claims

1. The remote sensing image cloud detection method based on the multi-scale fusion semantic segmentation network is characterized by comprising the following steps of:

the pair N ₀ Preprocessing the original remote sensing image to obtain N ₀ The specific process of the remote sensing image after pretreatment is as follows:

O′(i,j)＝O(i,j)-M (1)

similarly, calculate N ₀ Preprocessing each original remote sensing image in the original remote sensing images to obtain N ₀ A remote sensing image after preprocessing;

the specific process of the second step is as follows:

1 convolution layer with convolution kernel size 7*7 and number 4096;

1 convolution layer with convolution kernel size 1*1 and number 4096;

1 cutting layer is used for cutting out the material,

1 clipping layer;

in the training process, the convolution kernel parameters of the convolution layer of the semantic segmentation network are continuously updated through a BP algorithm; stopping iteration until the set maximum iteration times N are reached, and obtaining a trained semantic segmentation network;

2. The remote sensing image cloud detection method based on the multi-scale fusion semantic segmentation network according to claim 1, wherein the semantic segmentation network adopts a loss function J (W, b), and the cut image is input into a softmax classifier to obtain a binary image with the same size as the cut image; calculating a value of a loss function J (W, b) using the obtained binary image:

wherein: s is S _j′ Represents the j 'th value in the output vector S through the Softmax classifier, j' =1, 2, …, T representing the total number of values in the output vector S;a _j′ representing the j' th value in the input vector a to the Softmax classifier, e representing a natural constant; y is _j′ Is a vector of 1×T, and y _j′ = {0, …,0,1,0, …,0}, wherein: 1 is vector y _j′ The j' th element of (a) vector y _j′ All other elements in (2) are 0.

3. The remote sensing image cloud detection method based on the multi-scale fusion semantic segmentation network according to claim 2, wherein the convolution kernel parameters of the convolution layer of the semantic segmentation network are continuously updated through a BP algorithm, and the specific process is as follows:

4. The remote sensing image cloud detection method based on the multi-scale fusion semantic segmentation network according to claim 1, wherein the pixel-by-pixel weighted average is performed on the obtained up-sampled feature map and the feature map output by the convolution layer with the final convolution kernel size of 3*3 and the convolution kernel number of 512, so as to obtain a fused feature map, and the specific process is as follows:

5. The remote sensing image cloud detection method based on the multi-scale fusion semantic segmentation network according to claim 4, wherein the fusion ratio of the up-sampled feature map to the feature map output by the convolution layer with the last convolution kernel size of 3*3 and the number of convolution kernels of 512 is 1:3.