CN115410081A

CN115410081A - Multi-scale aggregated cloud and cloud shadow identification method, system, equipment and storage medium

Info

Publication number: CN115410081A
Application number: CN202210959194.2A
Authority: CN
Inventors: 夏旻; 陈凯; 翁理国
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-11-29

Abstract

The invention discloses a multi-scale aggregated cloud and cloud shadow identification method, a system, equipment and a storage medium, belonging to the technical field of image processing, wherein the method comprises the following steps: acquiring a picture to be detected; inputting a picture to be detected into a pre-trained multi-scale attention feature aggregation network to obtain a mask image of cloud and cloud shadow, and completing identification of the cloud and the cloud shadow; the mask image of the cloud and the cloud shadow is output after the trained weight extraction features are used for coding and decoding, the cloud and cloud shadow identification accuracy can be effectively improved, the interference of complex backgrounds and noise in the image can be effectively reduced, scattered small-scale cloud and cloud shadow targets can be effectively captured, the detection capability of thin clouds is enhanced, the segmentation of irregular joints of the cloud and the cloud shadow is refined, the segmentation accuracy of the complex edge details of the cloud and the cloud shadow is improved, good effects are achieved in segmentation experiments of other targets, and the method has excellent generalization capability and good robustness.

Description

Multi-scale aggregated cloud and cloud shadow identification method, system, equipment and storage medium

Technical Field

The invention relates to a multi-scale aggregated cloud and cloud shadow identification method, system, equipment and storage medium, and belongs to the technical field of image processing.

Background

With the development of remote sensing technology, remote sensing images have been widely applied to various fields such as agriculture, meteorology and military; since 67% of the earth's surface is covered by cloud layers, many areas in the remote sensing image are often covered by cloud layers, which causes the ground information obtained by us to be attenuated and even directly lost; therefore, accurate identification of cloud and cloud shadow is of great significance to the application of the optical remote sensing image; due to the fact that the traditional deep learning network is easily influenced by factors such as ground object interference and noise interference and lacks of generalization capability, if the traditional deep learning network is directly applied to cloud detection, details and spatial information are easily lost, and the problems that the cloud and cloud shadow boundary is roughly segmented, and images are missed and mistakenly detected are caused.

Disclosure of Invention

The invention aims to provide a multi-scale aggregated cloud and cloud shadow identification method, system, equipment and storage medium, which solve the problems of rough segmentation of cloud and cloud shadow boundaries, missed detection and false detection of images and the like and improve the identification accuracy of clouds and cloud shadows.

In order to realize the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for identifying a multi-scale aggregated cloud and cloud shadow, including:

acquiring a picture to be detected;

and inputting the picture to be detected into a pre-trained multi-scale attention feature aggregation network to obtain a mask image of cloud and cloud shadow, and completing the identification of the cloud and the cloud shadow.

Further in combination with the first aspect, the multi-scale attention feature aggregation network is trained by:

acquiring training data;

performing data enhancement processing on the image in the training data, and then converting the image and the corresponding label into a tensor;

and inputting the tensor into a multi-scale attention feature aggregation network for training to obtain the trained multi-scale attention feature aggregation network.

With reference to the first aspect, further, the multi-scale attention feature aggregation network includes a multi-scale strip-shaped pooling attention module, which is composed of 4 parallel strip-shaped average pooling branches and an adaptive average pooling branch, two parallel strip-shaped convolution branches, a spatial attention module, and a channel attention module, and is configured to extract multi-scale context information and deep space and channel information;

the system comprises 4 average pooling branches and a self-adaptive average pooling branch, wherein the average pooling branches and the self-adaptive average pooling branch are used for extracting and adding a picture to be detected in parallel, recovering the size of the picture to be detected after obtaining a multi-scale characteristic diagram, and connecting the multi-scale characteristic diagram and the picture to be detected in a height dimension to obtain a weight vector;

respectively inputting the weight vectors into two parallel strip convolution branches, wherein the first branch consists of a convolution kernel 1 multiplied by 7 and a convolution kernel 7 multiplied by 1, the second branch consists of a convolution kernel 7 multiplied by 1 and a convolution kernel 1 multiplied by 7, the first branch extracts a feature map and then inputs the feature map into a space attention module to extract a first feature map containing space information, the second branch extracts a feature map and then inputs the feature map into a channel attention module to extract a second feature map containing channel information, and then the final feature maps are output after connection and interaction;

the calculation process of the channel attention module is as follows:

features were extracted using global average pooling and global maximum pooling, respectively:

wherein, x represents the characteristic diagram of the input,

and

second and third weight vectors representing the outputs of the global max-pooled branch and the global mean-pooled branch, respectively, gmax and Gavg representing the global max-pooled and the global mean-pooled, respectively, C2D _1×1 Two-dimensional convolution representing a convolution kernel of 1 × 1;

splicing the features extracted by the global average pooling and the global maximum pooling:

wherein, CAT ₃ Representing the stitching in the width dimension,

are images stitched in the width dimension;

size recovery, feature selection, re-weighting:

where CA (x) represents the first signature of the channel attention module output, DWC2D _1×2 Two-dimensional depth separable convolution, DWC2D, representing a convolution kernel of 1 x 2 _1×1 A two-dimensional depth separable convolution with a convolution kernel of 1 × 1 is represented, and σ represents a nonlinear activation function Sigmoid;

the spatial attention module is calculated as follows:

respectively extracting features by using global average pooling and global maximum pooling, then connecting along channel dimensions, executing convolution operation, and generating a second feature map by using a nonlinear activation function:

SA(x)＝σ(C2D _7×7 (CAT ₁ (MP(x)，AP(x))))

wherein SA (x) represents a second profile, C2D, of the spatial attention module output _7×7 Representing a two-dimensional convolution with a convolution kernel of 7 x 7, CAT ₁ Indicating stitching in channel dimensions, MP and AP indicate maximum pooling and average pooling, respectively.

With reference to the first aspect, further, the multi-scale attention feature aggregation network includes a deep multi-head feedforward transfer attention module, configured to promote two adjacent layers of a backbone network in the multi-scale attention feature aggregation network to guide each other for feature mining, and to merge feature map information of the two adjacent layers extracted from the backbone network;

firstly, carrying out layer normalization operation on the feature maps output by two adjacent layers to generate a first layer normalization tensor and a second layer normalization tensor, generating a query vector by the first layer normalization tensor, generating a key vector and a value vector by the second layer normalization tensor, and calculating as follows:

wherein X is a first-level normalized tensor, Y is a second-level normalized tensor, Q is a query vector, K is a key vector, V is a value vector,

the convolution kernel representing the computed query vector is a 1 x 1 two-dimensional convolution,

the convolution kernel representing the computed query vector is a 3 x 3 two-dimensional depth separable convolution,

the convolution kernel representing the calculated key vector is a 1 x 1 two-dimensional convolution,

the convolution kernel representing the computed key vector is a 3 x 3 two-dimensional depth separable convolution,

the convolution kernel representing the vector of computed values is a 1 x 1 two-dimensional convolution,

a convolution kernel representing a vector of computed values is a 3 x 3 two-dimensional depth separable convolution;

reshaping the query vector and the key vector to have their dot product interactions generate a transposed attention map:

Attention(Q′，K′，V′)＝V′·Softmax(K′·V′/β)

P′＝C2D _1×1 Attention(Q′，K′，V′)+x+y

wherein x and y represent shallow and deep input feature maps, respectively, P ' is the output transposed Attention feature map, Q ', K ', V ' are three matrices obtained from the original size remodeling tensor, β is a scaling parameter, and Attention (Q ', K ', V ') is an Attention function;

inputting the transposed attention feature map into a feedforward network to obtain a feedforward feature map:

Z ₁ ＝DWC2D _3×3 ((C2D _1×1 (Z))

Z ₂ ＝δ(Z ₁ )⊙Z ₁

Z′＝C2D _1×1 (Z ₂ )+Z

wherein Z is a third layer normalization tensor, C2D, obtained by performing layer normalization on the transposed attention feature map _1×1 Is a two-dimensional convolution with a convolution kernel of 1 × 1, DWC2D _3×3 Is a two-dimensional depth separable convolution with a convolution kernel of 3 x 3, Z ₁ Is a first feedforward intermediate diagram, indicates a dot product, δ is a Gelu not-lineSexual activation function, Z ₂ Is the second feedforward intermediate graph and Z' is the feedforward characteristic graph of the output.

With reference to the first aspect, further, the multi-scale attention feature aggregation network includes a bilateral feature fusion module including a detail branch and a context branch;

and simultaneously inputting the feature diagram output by the detail branch into two branches to obtain two feature-mapped detail output values:

wherein x _ d is a feature map of the detail branch output,

two-dimensional depth separable convolution with a convolution kernel of 3 x 3 and an expansion factor of 2, BN batch normalization, R Relu activation function, AP average pooling, Y ₁ And Y ₂ Respectively representing the detail output values after the two feature mappings;

simultaneously inputting the feature diagram output by the context branch into two branches to obtain two feature mapped context output values:

y ₁ ＝γ ₁ ·σ(x_c _up )

wherein, x _ c _up Is a characteristic diagram of the context branch output, sigma represents a nonlinear activation function Sigmoid,

two-dimensional convolution, DWC2D, with a convolution kernel of 3 × 3 and a dilation factor of 2 _3×3 Representing a convolution kernel of 3 x 3 two dimensionsDepth separable convolution, y ₁ And y ₂ Respectively representing context output values after the two characteristic mappings;

adding the context output values after the two feature mappings, and then performing two-dimensional depth separable convolution, batch normalization processing and activation processing to obtain a bilateral feature fusion feature map:

y _out ＝R(BN(DWC2D _3×3 (y ₁ +y ₂ )))

wherein, y _out Is a bilateral feature fusion feature map.

With reference to the first aspect, further, the multiscale attention feature aggregation network includes a boundary refinement boosting module, configured to enhance detection of cloud and cloud shadow complex edge information, where a computation process of the boundary refinement boosting module is as follows:

x′＝C2D _3×3 (C2D _3×3 (x))+x

y＝drop(C2D _3×3 (x′))

y′＝Up(C2D _1×1 (y))

wherein x and y' respectively represent the input value and the output value of the boundary refinement boosting module, and C2D _1×1 Representing a two-dimensional convolution with a convolution kernel of 1 × 1, up representing 2 times upsampling, C2D _3×3 Representing a two-dimensional convolution with a convolution kernel of 3 x 3, drop representing the dropout algorithm, x' representing a first refinement intermediate quantity, and y representing a second refinement intermediate quantity.

In a second aspect, the present invention further provides a multi-scale aggregated cloud and cloud shadow identification system, including:

the picture acquisition module: the method comprises the steps of obtaining a picture to be detected;

cloud and cloud shadow identification module: the method is used for inputting the picture to be detected into the pre-trained multi-scale attention feature aggregation network to obtain the mask image of the cloud and the cloud shadow, and the identification of the cloud and the cloud shadow is completed.

In a third aspect, the present invention further provides a multi-scale aggregated cloud and cloud shadow recognition apparatus, including a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of the first aspect.

In a fourth aspect, the invention also provides a computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, performs the steps of the method of any one of the first aspects.

Compared with the prior art, the invention has the following beneficial effects:

according to the cloud and cloud shadow identification method, system, equipment and storage medium for multi-scale aggregation, the picture to be detected is input into a pre-trained multi-scale attention feature aggregation network, the trained weight extraction features are used for carrying out coding and decoding operations, and then the mask image of the cloud and cloud shadow is output, so that the cloud and cloud shadow identification accuracy can be effectively improved;

the multi-scale strip pooling attention module is adopted to further extract multi-scale context information and deep space and channel information, the cloud and cloud shadows are classified through the context information, edge detail processing and segmentation are carried out between the cloud and cloud shadows, strip pooling can reduce interference of other irrelevant areas in the image, interference of complex backgrounds and noise in the image is effectively reduced, and scattered small-scale cloud and cloud shadow targets can be effectively captured;

the deep multi-head feedforward attention transfer module is adopted to enhance the communication capacity of two channels, promote the mutual guiding of two adjacent layers of a backbone network to carry out feature mining, fuse the feature map information of the two adjacent layers extracted from the backbone network, enhance the detection capacity of thin clouds and refine the segmentation of irregular combination parts of the clouds and cloud shadows;

a bilateral feature fusion module is adopted to fuse low-level semantic information and high-level detail information, and context branch semantic information is used to guide the feature response of detail branches, so that efficient information exchange is realized, and the influence of interference objects on identification is reduced;

the feature representation is enhanced in the training phase by adopting a boundary thinning boosting module, so that the segmentation precision of the cloud and the complex edge details of the cloud shadow is improved.

Drawings

FIG. 1 is a schematic structural diagram of a multi-scale attention feature aggregation network provided by an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a channel attention module provided in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a spatial attention module according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a deep multi-head feedforward transfer attention module according to an embodiment of the invention;

fig. 5 is a schematic structural diagram of a bilateral feature fusion module according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a boundary refinement boosting module provided by an embodiment of the present invention;

fig. 7 is a flowchart of a method for identifying a cloud and cloud shadow for multi-scale aggregation according to an embodiment of the present invention.

Detailed Description

The present invention is further described with reference to the accompanying drawings, and the following examples are only for clearly illustrating the technical solutions of the present invention, and should not be taken as limiting the scope of the present invention.

Example 1

As shown in fig. 7, the method for identifying a cloud and a cloud shadow in a multi-scale aggregation according to an embodiment of the present invention includes:

s1, obtaining a picture to be detected.

An original color picture is collected as a picture to be detected.

S2, inputting the picture to be detected into a pre-trained multi-scale attention feature aggregation network to obtain a mask image of cloud and cloud shadow, and completing identification of the cloud and the cloud shadow.

The multi-scale attention feature aggregation network shown in fig. 1 is constructed, the whole network is of an encoder-decoder structure, an end-to-end training mode is adopted, and the multi-scale strip pooling attention module, the deep multi-head feedforward transfer attention module, the bilateral feature fusion module and the boundary refinement boosting module mainly form the multi-scale strip pooling attention network.

In the process of identifying cloud and cloud shadow in the remote sensing image, the extraction of the feature information in the image is very important, and the detection efficiency of the network can be greatly improved by selecting a proper backbone network.

A multi-scale strip pooling attention Module (MSPA) for further extracting multi-scale context information and deep space and channel information; the system consists of 4 parallel strip average pooling branches, an adaptive average pooling branch, two parallel strip convolution branches, a space attention module and a channel attention module.

The pooling kernel of the 4 average pooling branches is Nx1 (N =1,3,5, 6), the pooling kernel of the adaptive average pooling branch is 1 xN, and the pooling kernel is used for extracting and adding input features in parallel to obtain a multi-scale feature map, restoring the size of the multi-scale feature map to the size of an input picture to be detected, and connecting the multi-scale feature map and the input feature map together in a height dimension to obtain a weight vector so as to finish extraction of the multi-scale features.

The weight vectors are respectively input into two parallel strip convolution branches, the first branch is composed of a convolution kernel 1 multiplied by 7 and a convolution kernel 7 multiplied by 1, the second branch is composed of a convolution kernel 7 multiplied by 1 and a convolution kernel 1 multiplied by 7, the first branch extracts a feature map and then inputs the feature map into a space attention module, a first feature map containing space information is extracted, the second branch extracts a feature map and then inputs the feature map into a channel attention module, a second feature map containing channel information is extracted, and finally the two branch image information are connected and interacted to output a final feature map.

The horizontal stripe convolution tends to learn some horizontal details in cloud and cloud shadow images, the longitudinal stripe convolution tends to learn some longitudinal details, edges of the cloud and the cloud shadow are often closely connected, the operation can capture edge feature information at the connection position of the cloud and the cloud shadow well, and the edge segmentation effect is improved.

The structure of the channel attention module is shown in fig. 2, the core content of the channel attention module is that two types of global pooling are used for extracting high-level features, namely global average pooling and global maximum pooling, the use of different global pooling means that the extracted high-level features are richer, then two-dimensional depth separable convolution is used as an extractor for information between channels, an endpoint focuses on the importance of features in different channels, and the calculation process of the channel attention module is as follows:

wherein, x represents the characteristic diagram of the input,

and

second and third weight vectors representing the global max-pooling branch and global mean-pooling branch outputs, respectively, gmax and Gavg representing global max-pooling and global mean-pooling, respectively, C2D _1×1 A two-dimensional convolution representing a convolution kernel of 1 × 1;

then, splicing the features extracted by the global average pooling and the global maximum pooling:

wherein, CAT ₃ Representing the stitching in the width dimension,

are images stitched in the width dimension;

then, performing size recovery on the spliced image by using a two-dimensional depth separable convolution with a convolution kernel of 1 × 2, and paying attention to detail characteristic information of the image; then, a two-dimensional depth separable convolution with a convolution kernel of 1 x 1 is used as a selector, and the feature representation of the global average pooling branch and the global maximum pooling branch is focused in a self-adaptive mode; finally, after the output of the selector, the original feature map is re-weighted using the nonlinear activation function Sigmoid.

The calculation processes of the size recovery, the feature selection and the re-weighting are as follows:

where CA (x) represents a first feature map of the channel attention module output, DWC2D _1×2 Two-dimensional depth separable convolution, DWC2D, representing a convolution kernel of 1 x 2 _1×1 A two-dimensional depth separable convolution with a convolution kernel of 1 x 1 is denoted, and σ denotes the nonlinear activation function Sigmoid.

The structure of the spatial attention module is shown in fig. 3, which extracts feature information using average pooling and maximum pooling, unlike the channel attention module, where both convergence methods are performed along the channel dimension; after the feature map results generated by average pooling and maximum pooling are connected in channel dimensions, convolution operation with convolution kernel of 7 × 7 is performed in order to reduce the number of channels from 2 to 1; a relatively large reception field can be extracted by convolution operation with a large convolution kernel with a convolution kernel of 7 × 7; and finally generating a final feature map through a nonlinear activation function Sigmoid.

The calculation process of the above spatial attention module is as follows:

extracting features by using global average pooling and global maximum pooling respectively, then connecting along channel dimensions, executing convolution operation, and generating a second feature map by using a nonlinear activation function:

SA(x)＝σ(C2D _7×7 (cAT ₁ (MP(x)，AP(x))))

wherein SA (x) represents a second profile, C2D, of the spatial attention module output _7×7 Representing a two-dimensional convolution with a convolution kernel of 7 x 7, CAT ₁ Representation stitching in channel dimension, MPAnd AP represent maximum pooling and average pooling, respectively.

The structure of the deep multi-head feedforward attention transfer module (DMFA) is shown in fig. 4, and is used for promoting two adjacent layers of the backbone network in the multi-scale attention feature aggregation network to mutually guide feature mining, and fusing feature map information of the two adjacent layers extracted from the backbone network.

The deep multi-head feedforward attention transfer module firstly performs layer normalization operation on feature maps output by two adjacent layers to generate a first layer of normalization tensor X belonging to R ^H×W×C And a second layer of normalized tensors Y ∈ R ^H×W×C The query vector Q is generated from the first layer of normalized tensor, the key vector K and the value vector V are generated from the second layer of normalized tensor, and enrichment is performed by using local context information, aggregating pixel-level cross-channel context information by using 1 × 1 convolution, and then encoding channel-level spatial context information by using 3 × 3 deep convolution, and the calculation process is as follows:

where X is the first layer normalized tensor, Y is the second layer normalized tensor, Q is the query vector, K is the key vector, V is the value vector,

representing a 1 x 1 two-dimensional convolution with the convolution kernel that computes the query vector,

the convolution kernel representing the vector of calculated values is a 1 x 1 two-dimensional convolution,

the convolution kernel representing the vector of computed values is a 3 x 3 two-dimensional depth separable convolution.

Then reshaping the query vector and the key vector to have their dot product interaction generation size R ^C×C The calculation process is as follows:

Attention(Q′，K′，V′)＝V′·Softmax(K′·V′/β)

P′＝C2D _1×1 Attention(Q′，K′，V′)+x+y

where x and y represent shallow and deep input eigenmaps respectively, P ' is the output transposed attention eigenmap, Q ', K ', V ' are the three matrices obtained from the original size reshaped tensor, Q ' is for R ^HW×C ，K′∈ R ^C×HW ，V′∈R ^HW×C β is a scaling parameter used to control the size of the dot product of K ' and V ' before performing the Softmax function, and Attention (Q ', K ', V ') is the Attention function.

After the characteristic diagram information is processed in the previous step, the transposed attention characteristic diagram is input into a feedforward network, the feedforward network respectively carries out the same operation on each pixel position of the input transposed attention characteristic diagram, firstly, a third layer normalization tensor Z is obtained through layer normalization operation, and Z belongs to R ^H×W×C Then the feature channels are extended by a two-dimensional convolution with a convolution kernel of 1 x 1, using a two-dimensional depth separable convolution with a convolution kernel of 3 x 3 to encode bits from spatially adjacent pixelsInformation of the placement; then, through a gating mechanism, namely, after splitting the channel dimension, the feature map information after deep separable convolution passes through two parallel branches, wherein one branch passes through a Gelu nonlinear activation function, the feature map points output by the two parallel branches are multiplied, and then the channel is reduced back to the original input dimension through 1 × 1 convolution, so as to obtain a feedforward feature map, wherein the calculation process is as follows:

Z ₁ ＝DWC2D _3×3 ((C2D _1×1 (Z))

Z ₂ ＝δ(Z ₁ )⊙Z ₁

Z′＝C2D _1×1 (Z ₂ )+Z

wherein Z is a third layer normalization tensor, C2D, obtained by performing layer normalization on the transposed attention feature map _1×1 Is a two-dimensional convolution with a convolution kernel of 1 × 1, DWC2D _3×3 Is a two-dimensional depth separable convolution with a convolution kernel of 3 x 3, Z ₁ Is a first feed-forward intermediate diagram, indicating a dot product, δ being a Gelu nonlinear activation function, Z ₂ Is the second feedforward intermediate graph and Z' is the feedforward characteristic graph of the output.

The structure of the bilateral feature fusion module (BFF) is shown in fig. 5, and includes a detail branch and a context branch; the method is used for fusing low-level semantic information and high-level detail information, improving the overall anti-interference performance of the model, and reducing the influence of interferents and noise in the image on cloud and cloud shadow prediction; the module uses semantic information of the context branch to guide the feature response of the detail branch, and through guidance of different scales, we can extract feature representations of different scales.

Simultaneously inputting the feature map output by the detail branch into two branches to obtain two feature-mapped detail output values; one branch enters the depth separable cavity convolution, the convolution kernel is 3 multiplied by 3, the expansion factor is 2, the receptive field can be greatly improved by using the cavity convolution, but extra parameters cannot be increased, the context information is enhanced by increasing the receptive field, and the accuracy of the segmentation boundary can be better improved; the other branch enters a depth separable cavity convolution and average pooling layer with a convolution kernel of 3 x 3 and an expansion factor of 2; batch normalization and Relu activation functions are added to the two branches, so that the network convergence is faster and more stable, and overfitting is prevented; the above calculation process is as follows:

wherein x _ d is a feature map of the detail branch output,

two-dimensional depth separable convolution with a convolution kernel of 3 x 3 and an expansion factor of 2, BN batch normalization, R Relu activation function, AP average pooling, Y ₁ And Y ₂ And respectively representing the detail output values after the two feature mappings.

Because the pixel sizes of the feature images output by the detail branch and the context branch are different, and the size of the feature image output by the detail branch is twice that of the context branch, the feature images output by the context branch are simultaneously input into the two branches, and the two branches are respectively subjected to upsampling; the method comprises the following steps that (1) Sigmoid activation is directly carried out on a sampled feature map on one branch, in the other branch, the feature map is subjected to depth separable convolution after being subjected to cavity convolution, the feature map is subjected to Sigmoid activation after being subjected to convolution layer, and after detail branch feature mapping, the value of Sigmoid activation is reweighed to feature mapping to obtain a context output value after two feature mappings:

y ₁ ＝γ ₁ ·σ(x_c _up )

two-dimensional convolution with a convolution kernel of 3 × 3 and a dilation factor of 2, DWC2D _3×3 Representing a convolution kernel as a 3 x 3 two-dimensional depth separable convolution, y ₁ And y ₂ Respectively representing the context output values after the two feature mappings.

Summarizing results under different scales, and further extracting characteristic information, namely y ₁ And y ₂ Adding the two feature mapping context output values, performing two-dimensional depth separable convolution, and performing batch normalization processing and activation processing to obtain a bilateral feature fusion feature map:

y _out ＝R(BN(DWC2D _3×3 (y ₁ +y ₂ )))

wherein, y _out Is a bilateral feature fusion feature map.

The structure of a boundary refinement boosting module (BRB) is shown in FIG. 6, and is used for enhancing the detection of cloud and cloud shadow complex edge information; the method comprises the steps that the boundary details of the cloud and the cloud shadow are predicted again through end-to-end training, and aiming at the problem that the effect of a segmentation prediction graph is not ideal when the segmentation precision is not high, a BRB module provides a training enhancement strategy, the characteristic representation is enhanced in a training stage, a dropout link is added in an intermediate training process, and neurons in a probability network of 0.1 can be discarded in a prediction stage, so that the segmentation precision can be improved to a certain extent, the prediction graph with good segmentation effect is obtained, and overfitting of the network is prevented; the calculation process of the boundary refinement boosting module is as follows:

x′＝C2D _3×3 (C2D ₃ × ₃ (x))+x

y＝drop(C2D _3×3 (x′))

y′＝Up(C2D _1×1 (y))

wherein x and y' respectively represent the input value and the output value of the boundary refining boosting module, and C2D _1×1 Representing a two-dimensional convolution with a convolution kernel of 1 × 1, up representing 2 times upsampling, C2D _3×3 To representThe convolution kernel is a 3 × 3 two-dimensional convolution, drop represents the dropout algorithm, x' represents the first refinement intermediate quantity, and y represents the second refinement intermediate quantity.

For a deep neural network, capturing long-distance correlation is crucial; however, convolution operation is used for processing local areas, and the receptive field is limited, so that the correlation of long-distance feature information is difficult to capture; the method has good effect when detecting large-scale cloud layers, but has poor effect on scattered small-scale cloud clusters, and because the large square kernels extract too much information from irrelevant areas, the final prediction of the model is interfered, and the segmentation precision is reduced; in order to solve the problems, the invention provides a multi-scale strip-shaped pooling attention Module (MSPA) for further extracting multi-scale context information and deep space and channel information.

On one hand, the cloud and the cloud shadow have similar shapes, so that the cloud and the cloud shadow can be classified through context information, the edge between the cloud and the cloud shadow is subjected to detailed processing and segmentation, and the strip pooling can reduce the interference of other irrelevant areas in an image and more effectively identify scattered small-sized cloud and cloud shadow, so that the probability of missed detection and false detection of a detected target is reduced, and the segmentation effect is improved; on the other hand, after the strip pooling operation is performed, attention mechanism operation is performed next to extract multi-scale deep space information and multi-scale deep channel information in parallel, and the category information and the position information of cloud and cloud shadow are better concerned, so that the model can focus on important information in the image, and the segmentation effect is further improved.

In order to meet the requirements of thin cloud layers and the segmentation of irregular cloud and cloud shadow joints in a cloud and cloud shadow segmentation task, the scheme of the invention selects and promotes the mutual guidance of two adjacent layers of a backbone network to carry out feature mining, and fuses feature map information of the two adjacent layers extracted from the backbone network; however, simply combining two feature maps with different scales can cause loss of diversity of the two kinds of information, so the scheme of the invention designs a deep multi-head feedforward transfer attention module (DMEA) for enhancing the communication capability of two channels, so that two adjacent layers of a backbone network are guided to each other for feature mining, and the feature information fusion of images is promoted so as to provide more useful feature information for an upsampling process.

In order to reduce the influence of an interference object existing in an image on the cloud and cloud shadow prediction to the maximum extent when the cloud and cloud shadow is segmented, improve the overall anti-interference performance of a model, reduce the probability of false detection and missed detection, and further solve the problem that the irregular shape of the cloud and cloud shadow is difficult to predict accurately; in the decoding stage, the valve rod of the invention provides a bilateral feature fusion module (BFF) which is used for fusing low-level semantic information and high-level detail information; the feature representations of the detail branch and the context branch are complementary, one party being unaware of the information of the other party; there are several different ways to combine the two characteristic responses, i.e. element summation connections; however, the outputs of these two branches have different levels of feature representation, the detail branch for the lower level and the semantic branch for the higher level; therefore, the simple combination ignores the diversity of the two types of information, resulting in poor performance and difficulty in optimization; the bilateral feature fusion module can greatly improve the fusion of feature graphs output by two branches, and guides the feature response of a detail branch by using the semantic information of a context branch; by guiding in different scales, different scale feature representations can be extracted, and meanwhile compared with a simple combination, the guiding mode can achieve efficient information communication between the two branches.

Because the size and shape of the cloud and cloud shadows are arbitrary and irregular, it is difficult to detect boundary information; the scheme of the invention provides a new module (BRB) to re-predict the boundary details of cloud and cloud shadow through end-to-end training, and provides a training-enhancing strategy for solving the problem that the effect of segmenting a prediction graph is not ideal when the segmentation precision is not high, wherein a dropout link is added in the middle training process in the training stage to enhance feature representation, and the neuron in a 0.1 probability network can be discarded in the prediction stage, so that the segmentation precision can be improved to a certain extent, the prediction graph with good segmentation effect can be obtained, and the overfitting of the network can be prevented.

And pre-training the multi-scale attention feature aggregation network after the multi-scale attention feature aggregation network is constructed.

Acquisition of a training data set:

the cloud and cloud shadow data set used in the embodiment of the invention is from Google Earth (Google Earth) which is virtual Earth software developed by Google; it places satellite photographs, aerial photographs and geographic information systems on a three-dimensional model of the earth, google earth with an effective resolution of at least 100 meters, usually 30 meters, and an observation height (Eyealt) of 15 kilometers.

The data set consists of high-definition remote sensing images randomly acquired by professional meteorological experts in Qinghai, yunnan plateau, qinghai-Tibet plateau and Yangtze river delta; in order to better reflect the performance of the model, several groups of high-resolution cloud images with different shooting angles and heights are selected, due to the limitation of video storage capacity of the GPU, a high-resolution cloud remote sensing image with an original resolution of 4800 × 2692 is cut into a size of 224 × 224, and after screening, 12280 images are obtained, wherein 9824 images are used as a training set, 2456 images are used as a verification set, and the ratio of the training set to the verification set in the data set is 8/2.

Deep neural networks require a large amount of training data, but it is difficult to obtain these learning samples; therefore, when training samples are few, it is very necessary to use data enhancement to avoid overfitting; therefore, the embodiment of the invention performs data enhancement through translation, overturning and rotation. High resolution cloud and cloud shadow images obtained from Google Earth are divided into 5 types with different backgrounds, namely waters, forests, fields, towns and deserts, these tags are manually labeled, there are three types: cloud (red), cloud shadow (green), and background (black).

The embodiment of the invention adopts a supervised training mode, firstly, data enhancement processing is carried out on the images in the data set, and then the original images and the corresponding labels are converted into tensors and then input into a model for training; the batch size of each training is set to be 16, the learning rate is correspondingly reduced by adopting an equal interval adjustment learning rate (StepLR) strategy along with the increase of the training times to train so as to achieve a better training effect, wherein the initial learning rate is set to be 0.001, the attenuation coefficient is 0.98, the learning rate is updated every 3 times of training, and the training is performed for 200 times in total; the Adam algorithm is used as an optimizer in the training process.

After training is finished, the weight of the model is obtained, then a prediction stage of the model is started, the collected picture to be tested (original color picture) is input into a pre-trained multi-scale attention feature aggregation network, a mask image of cloud and cloud shadow is obtained, and identification of the cloud and the cloud shadow is completed.

Example 2

The embodiment of the invention provides a multi-scale aggregated cloud and cloud shadow identification system, which comprises:

Example 3

The embodiment of the invention provides a multi-scale aggregated cloud and cloud shadow identification device, which comprises a processor and a storage medium, wherein the processor is used for processing cloud shadow and cloud shadow;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method of:

acquiring a picture to be detected;

Example 4

An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following method steps:

acquiring a picture to be detected;

and inputting the picture to be detected into a pre-trained multi-scale attention feature aggregation network to obtain a mask image of the cloud and the cloud shadow, and completing the identification of the cloud and the cloud shadow.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A multi-scale aggregated cloud and cloud shadow identification method is characterized by comprising the following steps:

acquiring a picture to be detected;

2. The method of claim 1, wherein the multi-scale aggregated attention feature aggregation network is trained by:

acquiring training data;

performing data enhancement processing on images in the training data, and then converting the images and corresponding labels into tensors;

3. The method of claim 1, wherein the multi-scale attention feature aggregation network comprises a multi-scale strip pooling attention module, which is composed of parallel 4 strip average pooling branches and an adaptive average pooling branch, two parallel strip convolution branches, a spatial attention module, and a channel attention module, and is used for extracting multi-scale context information and deep space and channel information;

the calculation process of the channel attention module is as follows:

wherein, x represents the characteristic diagram of the input,

and

wherein, CAT ₃ Representing the stitching in the width dimension,

are images stitched in the width dimension;

size recovery, feature selection, re-weighting:

the spatial attention module is calculated as follows:

SA(x)＝σ(C2D _7×7 (CAT ₁ (MP(x),AP(x))))

wherein SA (x) represents a second profile of spatial attention module output, C2D _7×7 Representing a two-dimensional convolution with a convolution kernel of 7 x 7, CAT ₁ Indicating stitching in channel dimensions, MP and AP indicate maximum pooling and average pooling, respectively.

4. The method as claimed in claim 1, wherein the multi-scale aggregated cloud and cloud shadow recognition method is characterized in that the multi-scale attention feature aggregation network comprises a deep multi-head feedforward transfer attention module, which is used for promoting two adjacent layers of a backbone network in the multi-scale attention feature aggregation network to mutually guide feature mining, and fusing feature map information of two adjacent layers extracted from the backbone network;

the convolution kernel representing the vector of computed values is a 3 x 3 two-dimensional depth separable convolution;

Attention(Q′,K′,V′)＝V′·Softmax(K′·V′/β)

P′＝C2D _1×1 Attention(Q′,K′,V′)+x+y

Z ₁ ＝DWC2D _3×3 ((C2D _1×1 (Z))

Z ₂ ＝δ(Z ₁ )⊙Z ₁

Z′＝C2D _1×1 (Z ₂ )+Z

wherein Z is a third layer normalization tensor, C2D, obtained by performing layer normalization on the transposed attention feature map _1×1 Is a two-dimensional convolution with a convolution kernel of 1 × 1, DWC2D _3×3 Is a two-dimensional depth separable convolution with a convolution kernel of 3 x 3, Z ₁ Is a first feed-forward intermediate diagram, indicating a dot product, δ being a Gelu nonlinear activation function, Z ₂ Is the second feedforward intermediate graph, and Z' is the feedforward characteristic graph of the output.

5. The method of claim 1, wherein the multi-scale aggregated cloud and cloud shadow recognition comprises a bilateral feature fusion module comprising a detail branch and a context branch;

wherein x _ d is a feature map of the detail branch output,

two-dimensional depth separable convolution with a convolution kernel of 3 x 3 and an expansion factor of 2, BN batch normalization, R Relu activation function, AP average pooling, and gamma ₁ And gamma ₂ Respectively representing the detail output values after the two feature mappings;

y ₁ ＝γ ₁ ·σ(x_c _up )

two-dimensional convolution with a convolution kernel of 3 × 3 and a dilation factor of 2, DWC2D _3×3 Representing the convolution kernel as a 3 x 3 two-dimensional depth separable convolution, y ₁ And y ₂ Respectively representing context output values after the two characteristic mappings;

y _out ＝R(BN(DWC2D _3×3 (y ₁ +y ₂ )))

wherein, y _out Is a bilateral feature fusion feature map.

6. The method for cloud and cloud shadow recognition for multi-scale aggregation according to claim 1, wherein the multi-scale attention feature aggregation network comprises a boundary refinement boosting module for enhancing detection of cloud and cloud shadow complex edge information, and the calculation process of the boundary refinement boosting module is as follows:

x′＝C2D _3×3 (C2D _3×3 (x))+x

y＝drop(C2D _3×3 (x′))

y′＝Up(C2D _1×1 (y))

7. A multiscale aggregated cloud and cloud shadow identification system, comprising:

8. A multi-scale aggregated cloud and cloud shadow recognition device, comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 6.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.