CN113469094A

CN113469094A - Multi-mode remote sensing data depth fusion-based earth surface coverage classification method

Info

Publication number: CN113469094A
Application number: CN202110787839.4A
Authority: CN
Inventors: 曹金; 文枚金; 杨庆楠; 苏含坤
Original assignee: Shanghai Zhongkechen New Satellite Technology Co ltd
Current assignee: Shanghai Zhongkechen New Satellite Technology Co ltd
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2021-10-01
Anticipated expiration: 2041-07-13
Also published as: CN113469094B

Abstract

The invention relates to a surface coverage classification method based on multi-mode remote sensing data depth fusion, which comprises the following steps: (1) constructing a remote sensing image semantic segmentation network based on multi-mode information fusion; the network comprises an encoder for extracting the feature of the ground feature, a depth feature fusion module, a spatial pyramid module and an up-sampling decoder; the depth feature fusion module comprises an ACF3 module and a CACF3 module which are used for simultaneously fusing RGB, DSM and NDVI modal information, wherein the ACF3 module is a self-attention convolution fusion module based on transformer and convolution, and the CACF3 module is a cross-modal convolution fusion module based on transformer and convolution; (2) training the network constructed in the step (1); (3) and (3) predicting the remote sensing image surface feature type by using the network model trained in the step (2). Compared with the traditional method, the earth surface coverage classification method based on the multi-mode remote sensing data depth fusion has the advantages that the effect of improving the accuracy on earth surface classification tasks is remarkable, and the application prospect is wide.

Description

Multi-mode remote sensing data depth fusion-based earth surface coverage classification method

Technical Field

The invention belongs to the technical field of remote sensing, and relates to a surface coverage classification method based on multi-mode remote sensing data depth fusion.

Background

Surface object (surface feature) classification is an important basis for remote sensing image analysis application. The continuous observation of the earth surface by the multi-sensor at present has promoted the multi-scale, multi-temporal, multi-azimuth and multi-level earth surface remote sensing image, and provides richer data information for accurately describing the ground features. Due to the fact that the same ground object is observed essentially, although a certain difference exists between different modal information, complementary characteristics of the information still exist between multi-source remote sensing images. Therefore, the feature classification is performed by using a plurality of remote sensing information sources, and higher precision than single-mode information classification can be realized.

The existing classification method based on deep learning mostly adopts pixel level fusion, feature level fusion or decision level fusion, and the methods lack the mining of potential complementary information among multi-source remote sensing images. And the complementation and redundant information among the multi-source remote sensing images are effectively associated and de-redundant so as to obtain high-level feature sharing, and the high-precision remote sensing image classification can be realized by extracting and fusing relevant features step by step at a feature level.

On the other hand, in the field of multimodal semantic segmentation, a deep neural network has been widely used which performs feature extraction and class prediction based on a U-shaped structure (UNet) and residual connection and performs feature fusion based on a fusion network (fusenet). However, although the Convolutional Neural Network (CNN) has achieved excellent performance, when deep feature fusion is performed, it cannot learn semantic information of global and long distance well due to locality of convolution operation, and cannot extract and utilize complementarity between different modality information from a global scale. There are many studies to use the deep self-attention transformation network (transformer) in the visual field to perform long-term dependent sequence modeling and transduction tasks using its multi-headed attention structure, and the receptive field from the attention layer is always global relative to the convolutional layer where the receptive field is a certain range of the neighborhood. Compared with a convolution network, the remarkable characteristic enables the extraction and fusion of different dimensional characteristics to be better carried out. However, none of the existing methods can explore more in the field of depth feature fusion, such as a tightly-connected transform network (DCST) using a transform as a feature extractor, and tries to replace the encoder structure in a classical convolutional network with the transform; the work of further processing the features extracted by the classical encoder, if the transform is regarded as a high-dimensional feature extraction module by a transform network (TransUNet) based on a U-type network; for example, a pure attention-transfer network (Swin-Unet) based on a U-type network imitates the U-type network, and the whole network structure is subjected to the work of feature extraction and type prediction by using a transformer. These methods regard the transformer as a feature extraction structure and do not utilize it in the field of feature fusion.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a ground surface coverage classification method based on multi-mode remote sensing data depth fusion.

The invention firstly provides two attention Fusion modules based on transformer and Convolution at the same time, namely a self-attention Convolution Fusion module ACF (self attention and Convolution Fusion module) and a cross-mode Convolution Fusion module CACF (cross attention and Convolution Fusion module). The two modules carry out feature extraction based on a convolution trunk network and can be conveniently transplanted to other networks. The ACF firstly maps information of different modes to the same sequence, then extracts fusion characteristics of each channel through the self-attention fusion module, and finally calculates respective weights by utilizing channel convolution and obtains channels with larger relativity, thereby realizing the fusion of different modes. CACF uses the cross-modal attention mechanism to calculate the interaction of two modes, and the weight calculation is carried out by channel convolution after the respective outputs are further processed by self attention.

The invention simultaneously provides three modes of simultaneously fused frames ACF3 and CACF3 based on the ACF and CACF. The structure further considers the information of other modes on the basis of the ACF and CACF fused RGB (red, green and blue) information and DSM (digital surface model) information. ACF3 and CACF3 first need to calculate the remote sensing index NDVI (normalized vegetation index) that reflects vegetation growth, vegetation coverage, and eliminates some of the noise, and input as additional modal information. DSM information aids in the identification of buildings, while trees and low-level vegetation that are present in large numbers in the data set help to be minimal, so the method takes the NDVI index as a third modality for input. In consideration of the closer relation between the NDVI and the RGB information image, the NDVI and the RGB information image are fused in a mode different from that of the DSM during deep fusion, namely the universal fusion framework provided by the invention can realize the fusion of the RGB, DSM and NDVI modal information simultaneously on the basis of the remote sensing image semantic segmentation network of multi-modal information fusion.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a surface coverage classification method based on multi-modal remote sensing data depth fusion comprises the following steps:

(1) constructing a remote sensing image semantic segmentation network based on multi-mode information fusion;

the remote sensing image semantic segmentation network based on multi-mode information fusion comprises an encoder for extracting surface feature, a depth feature fusion module, a spatial pyramid module and an up-sampling decoder;

the encoder is divided into three branches, namely an RGB branch, a DSM branch and an NDVI branch, wherein the RGB branch is used as a main branch, the DSM branch and the NDVI branch are used as subordinate branches, each branch comprises four rolling blocks named as feature extraction layers, and as the network in each branch goes deep, the feature of the ground feature is processed by the rolling blocks to obtain different input feature vectors, the input feature vectors represent the feature of the ground feature, and the input feature vectors are represented by X, Y, Z in a unified way corresponding to the RGB, DSM and NDVI branches;

the depth feature fusion module comprises an ACF3 module and a CACF3 module which simultaneously fuse RGB, DSM and NDVI modal information, wherein the ACF3 module is a self-attention convolution fusion module (ACF) based on transformer (depth self-attention transform network) and convolution, and the CACF3 module is a cross-modal convolution fusion module (CACF) based on transformer and convolution; because the NDVI is calculated according to the near-infrared band and the red band, when information of different branches enters the fusion module, RGB is directly fused with the NDVI at first, and then is fused with DSM information through an attention mechanism;

the self-attention convolution fusion module adopts a self-attention mechanism and comprises a transformer structure and two convolution structures (the convolution structures are positioned at the tail part of the self-attention convolution fusion module), wherein the transformer structure is formed by stacking 8 layers of self-attention fusion modules, each layer of self-attention fusion module comprises two normalization layers LN1 and LN2, a multi-layer perceptron MLP and a core self-attention layer Attn; the self-attention fusion formula is as follows:

wherein j represents the number of layers of the self-attention fusion module in the self-attention convolution fusion module (ACF) and depends on the number of stacked attention blocks, s^jRepresenting the fusion feature calculated by the self-attention layer Attn, obtaining different input feature vectors after the ground feature is processed by a rolling block, wherein the vectors represent the ground feature, the input feature vectors correspond to three branches of RGB, DSM and NDVI and are respectively represented by X, Y, Z, XY represents directly connecting input feature vectors X and Y, LN1(XY) represents inputting XY into LN1, and Attn (LN1(XY)) represents inputting the calculation result of LN1(XY) into Attn, and then obtaining the fusion feature s calculated by the self-attention layer Attn^j，z^jLN 2(s), representing the final result after processing by the attention fusion Module^j) Representing the fusion feature s^jInput into LN2, MLP (LN 2(s)^j) Represents LN 2(s)^j) The calculated result is input into the MLP, and the final result processed by the self-attention fusion module is obtained;

after the correlation calculation for fusion, the output will be split and restored to the original size. In the semantic segmentation task of the high-resolution remote sensing image, the information of each pixel point is very important. Whereas if all the information is retained and the calculation of attention is performed, the computational complexity will be the square of the sequence length. If pooling is performed first, and then the interpolation is used to restore the scale after attention is calculated, information will be lost, which will have a serious impact on the result. To solve this problem, the present invention combines the advantages of transformer and convolution, and further extracts channel information by using a convolution module after attention calculation. The convolution block may learn to use global information to emphasize the informational channels and suppress the less useful channels, which helps the AFC3 module to efficiently utilize the informational characteristics of the two branches. Finally, the weighted channels are merged directly and input to the main branch.

The cross-modal convolution fusion module simultaneously adopts a cross-modal attention mechanism and a self-attention mechanism, and comprises three transformer structures and two convolution structures (the convolution structures are positioned at the tail part of the cross-modal convolution fusion module), wherein one transformer structure is formed by stacking four layers of cross-modal fusion modules, the other two transformer structures are respectively formed by stacking four layers of self-attention fusion modules, each layer of self-attention fusion module has the same structure as the self-attention fusion module in the self-attention convolution fusion module, and each layer of cross-modal fusion module comprises two normalization layers LN1 and LN2, a multilayer perceptron MLP and a core cross-modal attention layer Attn_rgbAnd Attn_dComposition is carried out; the cross-modal attention calculation formula is as follows:

wherein Q, K and V respectively represent three vectors representing different information calculated by the input characteristic diagram and different linear structures, and K^TRepresents the transposed vector of K, d_kRepresenting the dimensions of the vector K to control the information scale, all the subscripts RGB and d indicate that its calculation is derived from RGB information and DSM information, softmax is a known activation function, so that the cross-modal attention calculation formula represents the characteristic information calculated by Q and K by d_kAfter the scale is controlled, the scale is calculated with V after the function softmax is activatedOutputting, calculating the result by using the information of the complementary modes in the whole formula, and describing a mode of performing characteristic fusion on the two modes in a cross-mode fusion module by using a cross-mode attention calculation formula, wherein the Attn_rgb(Q_rgb,K_d,V_d) Representing RGB information, Attn, incorporating DSM information_d(Q_d,K_rgb,V_rgb) Represents DSM information fused with RGB information;

the up-sampling decoder consists of three layers of convolution blocks and a classifier, wherein each convolution block comprises a convolution layer for processing residual connection and an up-sampling convolution layer for restoring resolution, in each up-sampling convolution layer, a low-resolution feature map from the upper layer is up-sampled to the same resolution as a feature map from the residual connection through bilinear interpolation, then the two feature map streams are added element by element, finally, the feature map streams are mixed and input to the next layer through 3 multiplied by 3 convolution, the classifier is positioned at the tail part of the up-sampling decoder, and the classifier is a convolution structure with the output as class number so as to realize final class prediction;

after passing through the feature extraction layer of each branch processing network, the output result is fused by a self-attention convolution fusion module or a cross-mode convolution fusion module, and four values are output and respectively input to an RGB branch, a DSM branch and an NDVI branch and output to a decoder as residual connection, and the process is expressed as the following two feature fusion formulas:

wherein X ∈ R^{3*H*W}Representing input feature vectors for corresponding RGB channels, Y ∈ R^{1*H*W}Representing input feature vectors corresponding to DSM channels, Z ∈ R^{1*H*W}Representing input feature vectors corresponding to NDVI channels, H and W respectively refer to the height and width of input data, H x W is the size of a picture, i represents the number of layers where a fusion module is located, ACF represents a self-attention convolution fusion module, CACF represents a cross-modal convolution fusion module, and a feature fusion formula represents the self-attention convolution fusion module or the cross-modal convolution fusion moduleThe fusion module receives the unfused RGB feature information X^i-1DSM characteristic information Y^i-1And NDVI characteristic information Z^i-1After the three modes are input, fused RGB characteristic information X is outputⁱDSM characteristic information YⁱAnd NDVI characteristic information ZⁱAnd residual information skip for the decoding phaseⁱ；

The whole network firstly extracts features through a coder, namely an input picture is mapped to a feature space represented by a vector, information of different modes is fused by using a depth feature fusion module provided by the invention in the process, in the final stage of the fusion, feature fusion of different scales is realized by using a space pyramid module, and a feature map containing rich feature information is output, an up-sampling decoder is used for restoring the feature map step by step in the restoration stage of the whole network, namely, visual features with rich semantics in a low-resolution space are up-sampled to the input resolution of an original picture, and a classifier outputs prediction of each pixel point category in the final stage of the decoder;

(2) training the remote sensing image semantic segmentation network constructed in the step (1) based on multi-modal information fusion to obtain a trained model;

(3) and (3) predicting the remote sensing image surface feature type by using the model trained in the step (2).

As a preferred technical scheme:

according to the method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion, a calculation formula of a self-attention layer Attn of a core in a self-attention fusion module is as follows:

wherein Q, K, V represent three vectors representing different information calculated from the input feature map and different linear structures, K^TRepresents the transposed vector of K, d_kRepresenting the dimensions of the vector K for controlling the information scale, softmax being a known activation function, Q and K being calculated as characteristic informationAnd after the information passes through the activation function softmax, the information and V are calculated to obtain output.

According to the method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion, each volume block is stacked at different depths according to the position of the volume block in the network, the four volume blocks of an RGB branch are stacked in 3, 4, 6 and 3 layers respectively, and the four volume blocks of a DSM branch and an NDVI branch are stacked in 2, 2 and 2 layers respectively.

According to the method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion, the spatial pyramid module is composed of 3 layers of convolution structures, and information is gathered on different scales of 8 × 8, 4 × 4 and 2 × 2. After passing through the four residual error network layers and the fusion module, rich high-level semantic information is hidden in the fused feature map. In order to increase the detection capability of large-scale objects in the image, the invention adopts a spatial pyramid structure, and the alignment grids with different granularities are averaged before upsampling. The decoder is to upsample the semantically rich visual features in the coarse spatial resolution to the input high resolution.

The method for classifying the earth surface coverage based on the multi-mode remote sensing data depth fusion comprises the following steps of: firstly, initializing parameters by a Hommin method, loading a trained pretrained model of a ResNet series (namely a residual network, which is a known concept) by a decoder, then reading partial data from a data set by the model, comparing the output of the read data after being processed by a complete model with a known real label, calculating by a cross entropy loss function, wherein the calculation result shows the prediction capability of the model on the data set, then updating the parameters by the model according to the calculation value of the cross entropy loss function, and repeating a training process (which is the whole training process except for parameter initialization and pretrained model loading, wherein the two steps of parameter initialization and pretrained model loading are required to be used as initial parameters of the model at first time, but after the first time of data, the model parameters are updated, and other steps are circulated subsequently to read all data, updating parameters of the model), and observing the calculated value of the cross entropy loss function until the prediction capability of the model on the whole data set is stable;

the model is a remote sensing image semantic segmentation network based on multi-mode information fusion;

the partial data refers to 4 pieces of data selected from a data set;

when the calculated value of the cross entropy loss function cannot be reduced continuously, the prediction capability of the model on the whole data set is stable.

According to the method for classifying the earth surface coverage based on the multi-mode remote sensing data depth fusion, the formula of the cross entropy loss function is as follows:

wherein, L represents the calculated value of the cross entropy loss function, N represents the number of each training data, N is 4, k is taken from 1 to 4, which represents that 4 pieces of data are respectively calculated and finally averaged, M represents the number of categories of the ground objects, y represents the number of categories of the ground objects, and_kcrepresenting the probability of the kth sample belonging to class c, which is provided by the true label, if it is 1, otherwise it is 0, p_kcRepresenting the prediction probability of the model for the kth sample belonging to class c.

The method for classifying the earth surface coverage based on the multi-mode remote sensing data depth fusion specifically comprises the following steps of predicting the remote sensing image surface feature type by using a trained network model:

(1) processing data to be predicted into data with the size consistent with that of the training data;

(2) reading data by using the trained model;

(3) and outputting the prediction result of each pixel point category by the model, and performing visualization processing on the result to obtain a remote sensing image surface feature category prediction graph.

Has the advantages that:

(1) the invention provides two attention fusion modules simultaneously based on transformer and convolution, a self-attention convolution fusion module ACF and a cross-modal convolution fusion CACF for the first time, and provides frames ACF3 and CACF3 with three modes simultaneously fused based on the ACF and the CACF, and the universal fusion frame provided by the invention can realize the fusion of information of three modes, namely RGB, DSM and NDVI simultaneously based on a remote sensing image semantic segmentation network with multi-modal information fusion;

(2) compared with the traditional method, the earth surface coverage classification method based on the multi-mode remote sensing data depth fusion has the advantages that the effect of improving the accuracy on earth surface classification tasks is remarkable, and the application prospect is wide.

Drawings

FIG. 1 is a schematic diagram of a remote sensing image semantic segmentation network structure based on multi-modal information fusion;

FIG. 2 is a self-attention convolution fusion module;

FIG. 3 is a cross-modal convolution fusion module;

fig. 4 is a prediction graph of the ground object type of the remote sensing image at the visualization position.

Detailed Description

The invention will be further illustrated with reference to specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

(1) constructing a model;

the model is a remote sensing image semantic segmentation network based on multi-mode information fusion, the structure of the model is shown in figure 1, and the model comprises an encoder for extracting the feature of the ground feature, a depth feature fusion module, a spatial pyramid module and an up-sampling decoder;

the encoder is divided into three branches: the system comprises an RGB branch, a DSM branch and an NDVI branch, wherein the RGB branch is used as a main branch, the DSM branch and the NDVI branch are used as subordinate branches, each branch comprises four volume blocks named as feature extraction layers, and as the network in each branch goes deep, different input feature vectors can be obtained after the feature of the ground feature is processed by the volume blocks, the input feature vectors represent the feature of the ground feature, and the input feature vectors are represented by X, Y, Z in a unified way corresponding to the RGB, DSM and NDVI branches; each convolution block is stacked at different depths according to the position of the convolution block in the network, the four convolution blocks of the RGB branch are stacked by 3 layers, 4 layers, 6 layers and 3 layers respectively, and the four convolution blocks of the DSM branch and the NDVI branch are stacked by 2 layers, 2 layers and 2 layers respectively;

the depth feature fusion module comprises an ACF3 module and a CACF3 module which simultaneously fuse RGB, DSM and NDVI modal information, the ACF3 module is a self-attention convolution fusion module (ACF) based on transformer (depth self-attention transform network) and convolution, and the CACF3 module is a cross-modal convolution fusion module (CACF) based on transformer and convolution; the self-attention convolution fusion module adopts a self-attention mechanism and comprises a transformer structure and two convolution structures (the convolution structures are positioned at the tail part of the self-attention convolution fusion module), wherein the transformer structure is formed by stacking 8 layers of self-attention fusion modules, each layer of self-attention fusion module comprises two normalization layers LN1 and LN2, a multi-layer perceptron MLP and a core self-attention layer Attn; as shown in fig. 2, when information of different modalities enters the fusion module, the ACF3 module first connects X and Y, and maps the X and Y to the same sequence space through position coding, so as to convert the picture into a sequence, and the sequence is input into a transform based on the self-attention convolution fusion module;

the self-attention fusion formula is as follows:

wherein j represents the number of layers of the self-attention fusion module in the self-attention convolution fusion module (ACF) and depends on the number of stacked attention blocks, s^jRepresents the fusion feature calculated by the self-attention layer ATTn, XY represents the direct connection of the input feature vectors X and Y, LN1(XY) represents the input of XY into LN1, and Attn (LN1(XY)) represents the input of the calculation result of LN1(XY) into LN1In Attn, the fusion feature s after self-attention calculation is obtained at this time^j，z^jLN 2(s), representing the final result after processing by the attention fusion Module^j) Representing the fusion feature s^jInput into LN2, MLP (LN 2(s)^j) Represents LN 2(s)^j) The calculation result of (2) is input into the MLP;

the calculation formula of the self-attention layer Attn of the core in the self-attention fusion module is as follows:

wherein Q, K, V represent three vectors representing different information calculated from the input feature map and different linear structures, K^TRepresents the transposed vector of K, d_kRepresenting the dimensionality of the vector K to control the information scale, and calculating the characteristic information obtained by calculating Q and K with V after passing through an activation function softmax to obtain output;

the overall structure of the CACF3 module is shown in FIG. 3. The cross-modal convolution fusion module simultaneously adopts a cross-modal attention mechanism and a self-attention mechanism, and comprises three transformer structures and two convolution structures (the convolution structures are positioned at the tail part of the cross-modal convolution fusion module), wherein one transformer structure is formed by stacking four layers of cross-modal fusion modules, the other two transformer structures are respectively formed by stacking four layers of self-attention fusion modules, each layer of self-attention fusion module has the same structure as the self-attention fusion module in the self-attention convolution fusion module, and each layer of cross-modal fusion module comprises two normalization layers LN1 and LN2, a multilayer perceptron MLP and a core cross-modal attention layer Attn_rgbAnd Attn_dComposition is carried out; since the ACF3 module needs to connect the information of the two modalities, the sequence length that needs to be processed in its attention module will be twice that of the CACF3 module; the cross-modal attention calculation formula is as follows:

wherein Q, K, V represent three vectors representing different information calculated from the input feature map and different linear structures, K^TRepresents the transposed vector of K, d_kRepresenting the dimensions of the vector K to control the information scale, all the subscripts RGB and d indicate that the calculation is derived from RGB information and DSM information, so that the cross-modal attention calculation formula represents that the feature information calculated by Q and K is represented by d_kAfter controlling the scale of the model, the model is calculated with V after the model is activated by softmax to obtain output, the whole formula adopts the information of complementary modes to calculate the result, the cross-mode attention calculation formula explains the mode of characteristic fusion of the two modes in the cross-mode fusion module, Attn_rgb(Q_rgb,K_d,V_d) Representing RGB information, Attn, incorporating DSM information_d(Q_d,K_rgb,V_rgb) Represents DSM information fused with RGB information;

the spatial pyramid module collects the fused RGB-DSM-NDVI features from the three branches and generates a feature map with multi-scale information, the spatial pyramid module consists of 3 layers of convolution structures, and information is gathered on different scales of 8 × 8, 4 × 4 and 2 × 2;

the up-sampling decoder consists of three layers of convolution blocks and a classifier, wherein each convolution block comprises a convolution layer for processing residual connection and an up-sampling convolution layer for restoring resolution, the classifier is positioned at the tail part of the up-sampling decoder, and the classifier is a convolution structure with the output as class number so as to realize final class prediction;

wherein，X∈R^{3*H*W}Representing input feature vectors corresponding to RGB branches, Y ∈ R^{1*H*W}Representing input feature vectors corresponding to DSM branches, Z ∈ R^{1*H*W}Representing input feature vectors corresponding to NDVI branches, wherein H and W respectively refer to the height and width of input data, H X W is the size of a picture, i represents the number of layers where a fusion module is located, ACF represents a self-attention convolution fusion module, CACF represents a cross-modal convolution fusion module, and a feature fusion formula represents that the self-attention convolution fusion module or the cross-modal convolution fusion module receives unfused RGB feature information X^i-1DSM characteristic information Y^i-1And NDVI characteristic information Z^i-1After the three modes are input, fused RGB characteristic information X is outputⁱDSM characteristic information YⁱAnd NDVI characteristic information ZⁱAnd residual information skip for the decoding phaseⁱ；

(2) Training a model;

for RGB branches, a residual error network Resnet-34 pre-training network is adopted to assign initial values to the RGB branches, a residual error network Resnet-18 is adopted for DSM and NDVI branches to assign initial values to the DSM and NDVI branches, and other network parameters are assigned by using a Hommin initialization method. The model firstly initializes parameters by a Hommin method, a decoder loads a ResNet series trained pre-training model, then the model reads 4 data from a data set, the output of the read data after being processed by a complete model is compared with a known real label, the comparison process is calculated by a cross entropy loss function, and the cross entropy loss function formula is as follows:

wherein, L represents the calculated value of the cross entropy loss function, N represents the number of each training data, N is 4, k is taken from 1 to 4, which represents that 4 pieces of data are respectively calculated and finally averaged, M represents the number of categories of the ground objects, y represents the number of categories of the ground objects, and_kcrepresenting the probability of the kth sample belonging to class c, taking 1 if it is, and 0 if it is not, the value being provided by the true label, p_kcRepresenting the prediction probability of the model for the kth sample belonging to the class c, the calculation result shows the prediction capability of the model for the data set, and then the model is based on the loss functionThe calculated value of the cross entropy loss function is subjected to parameter updating, a training process is repeated (the whole training process except for parameter initialization and pre-training model loading needs two steps of parameter initialization and pre-training model loading to give initial parameters of the model at first, but after first-time data, the model parameters are updated, other steps are circulated subsequently to read all data and perform parameter updating on the model), the calculated value of the cross entropy loss function is observed, and the prediction capability of the model on the whole data set is stable until the calculated value of the cross entropy loss function cannot be reduced continuously; in the training process, parameter optimization is carried out by using an Adaptive momentum Estimation (Adam) algorithm, and the learning rate is set to be 4 multiplied by 10^-4The number of training iterations is 40, the batch size is 4, and experiments show that the number of iterations can enable the model to be converged;

(3) predicting the ground object class of the remote sensing image by using the model trained in the step (2);

(3.1) processing the data to be predicted into data with the size consistent with that of the training data;

the three modal data required by the invention are RGB three-band data, DSM data and NDVI index respectively, wherein the NDVI index is calculated from a red band and a near infrared band. The experiment adopts the two-dimensional semantic annotation competition of borstem provided by ISPRS. The data set includes 38 four channel pictures of 6000 x 6000 and corresponding DSM data. In consideration of the computing power and required accuracy, in the data processing stage, all data are clipped to 256 × 256 at an overlap ratio of 0.5, and corresponding NDVI modal data are calculated. Meanwhile, as a large amount of data has the condition of single label category, cleaning is carried out according to the rule that no automobile/water surface/building material category exists and some other category accounts for more than 80 percent, and finally a training set is obtained;

(3.2) reading data by using the trained model;

(3.3) outputting the prediction result of each pixel point type by the model, and performing visualization processing on the result to obtain a remote sensing image surface feature type prediction map, as shown in fig. 4. The prediction results are specifically shown in table 1.

TABLE 1 precision comparison

The test performance is evaluated by the overall accuracy, i.e. the percentage of correctly classified pixels among all pixels. For each class, the ratio of the pixels predicted to be correct by the class to all the pixels in the class is calculated. Table 1 illustrates the improved accuracy of the present invention over conventional methods on the surface classification task. The number suffix in the method column in the table represents the number of modalities, 2 indicates that the training data is RGB and DSM, and 3 indicates that the training data is augmented with NDVI. As with other methods, the results compare mainly the accuracy and overall accuracy of the five major categories. The result shows that after the ACF is used for fusing DSM information, other categories except vehicles are improved, most obviously trees reach 1.25%, the CACF module can better fuse depth characteristics, all categories are improved, and building category improvement reaches 0.44%. Meanwhile, by fusing the ACF3 of DSM and NDVI, the tree is greatly improved by 4.11%, which shows that the NDVI index is very helpful for judging higher trees. The CACF3 further improves the judgment on buildings and trees by 0.56 percent and 2.51 percent respectively. The overall accuracy of the whole ACF3 fusion method is improved by 0.36%, and the overall accuracy of the ACF3 is improved by 0.40%. In summary, the present invention is a significant improvement over the conventional method in the field of depth feature fusion.

Claims

1. A surface coverage classification method based on multi-modal remote sensing data depth fusion is characterized by comprising the following steps:

the encoder is divided into an RGB branch, a DSM branch and an NDVI branch, wherein the RGB branch is used as a main branch, the DSM branch and the NDVI branch are used as subordinate branches, and each branch comprises four volume blocks;

the depth feature fusion module comprises an ACF3 module and a CACF3 module, wherein the ACF3 module and the CACF3 module are used for simultaneously fusing information of three modes, namely RGB, DSM and NDVI, the ACF3 module is a self-attention convolution fusion module based on a transformer and convolution, and the CACF3 module is a cross-mode convolution fusion module based on a transformer and convolution;

the self-attention convolution fusion module adopts a self-attention mechanism and comprises a transformer structure and two convolution structures, wherein the transformer structure is formed by stacking 8 layers of self-attention fusion modules, each layer of self-attention fusion module comprises two normalization layers LN1 and LN2, a multilayer perceptron MLP and a core self-attention layer Attn; the self-attention fusion formula is as follows:

s^j＝Attn(LN1(XY))

z^j＝MLP(LN2(s^j))；

wherein j represents the number of layers of the self-attention fusion module in the self-attention convolution fusion module, s^jRepresenting the fusion feature calculated by the self-attention layer Attn, obtaining different input feature vectors after the ground feature is processed by a rolling block, wherein the input feature vectors correspond to three branches of RGB, DSM and NDVI and are respectively represented by X, Y, Z, XY represents directly connecting input feature vectors X and Y, LN1(XY) represents inputting XY into LN1, Attn (LN1(XY)) represents inputting the calculation result of LN1(XY) into Attn, z represents inputting the calculation result of LN1(XY) into Attn, and^jLN 2(s), representing the final result after processing by the attention fusion Module^j) Representing the fusion feature s^jInput into LN2, MLP (LN 2(s)^j) Represents LN 2(s)^j) The calculation result of (2) is input into the MLP;

the cross-modal convolution fusion module simultaneously adopts a cross-modal attention mechanism and a self-attention mechanism and comprises three transformer structures and two convolution structures, wherein one transformer structure is formed by stacking four layers of cross-modal fusion modules, the other two transformer structures are respectively formed by stacking four layers of self-attention fusion modules, and each layer of cross-modal fusion module is formed by stacking two normalization layers LN1 and LN2, a multilayer sensor MLP and a kernelCross-modal attention layer of the heart Attn_rgbAnd Attn_dComposition is carried out; the cross-modal attention calculation formula is as follows:

wherein Q, K and V respectively represent three vectors representing different information calculated by the input characteristic diagram and different linear structures, and K^TRepresents the transposed vector of K, d_kRepresenting the dimensions of the vector K, all subscripts RGB and d indicate that its computation is derived from RGB information and DSM information, and the cross-modality attention calculation formula indicates the way in which the two modalities are feature fused in the cross-modality fusion module, Attn_rgb(Q_rgb,K_d,V_d) Representing RGB information, Attn, incorporating DSM information_d(Q_d,K_rgb,V_rgb) Represents DSM information fused with RGB information;

the up-sampling decoder consists of three layers of convolution blocks and a classifier, wherein each convolution block comprises a convolution layer for processing residual connection and an up-sampling convolution layer for restoring resolution, the classifier is positioned at the tail part of the up-sampling decoder, and the classifier is a convolution structure with the output as class number;

wherein X ∈ R^{3*H*W}Representing input feature vectors corresponding to RGB branches, Y ∈ R^{1*H*W}Representing input feature vectors corresponding to DSM branches, Z ∈ R^{1*H*W}Representing input feature vectors corresponding to NDVI branchesH and W respectively indicate the height and width of input data, i represents the number of layers where the fusion module is located, ACF represents a self-attention convolution fusion module, CACF represents a cross-modal convolution fusion module, and a feature fusion formula indicates that the non-fused RGB feature information X is received by the self-attention convolution fusion module or the cross-modal convolution fusion module^i-1DSM characteristic information Y^i-1And NDVI characteristic information Z^i-1After the three modes are input, fused RGB characteristic information X is outputⁱDSM characteristic information YⁱAnd NDVI characteristic information ZⁱAnd residual information skip for the decoding phaseⁱ；

2. The method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion as claimed in claim 1, wherein the calculation formula of the self-attention layer Attn of the core in the self-attention fusion module is as follows:

wherein Q, K, V represent three vectors representing different information calculated from the input feature map and different linear structures, K^TRepresents the transposed vector of K, d_kRepresenting the dimension of vector K.

3. The method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion of claim 2, wherein four volume blocks of an RGB branch are stacked in 3, 4, 6 and 3 layers respectively, and four volume blocks of a DSM branch and an NDVI branch are stacked in 2, 2 and 2 layers respectively.

4. The method as claimed in claim 3, wherein the spatial pyramid module is composed of 3 layers of convolution structure, and information is collected on different scales of 8 x 8, 4 x 4 and 2 x 2.

5. The method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion as claimed in claim 1, wherein the process of training the remote sensing image semantic segmentation network based on the multi-modal information fusion is as follows: initializing parameters by a Hommin method, loading a ResNet series trained pre-training model by a decoder, reading partial data from a data set by the model, comparing the output of the read data after being processed by a complete model with a known real label, calculating by a cross entropy loss function in the comparison process, updating the parameters by the model according to the calculated value of the cross entropy loss function, repeating the training process, and observing the calculated value of the cross entropy loss function until the prediction capability of the model on the whole data set is stable;

the partial data refers to 4 pieces of data selected from a data set;

6. The method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion as claimed in claim 5, wherein the cross entropy loss function formula is as follows:

wherein, L represents the calculation value of the cross entropy loss function, N represents the number of each training data, N is 4, M represents the number of categories of the ground objects, y represents the number of the ground objects_kcRepresenting the true probability, p, that the kth sample belongs to class c_kcThe representation model belongs to class c for the kth sampleThe prediction probability of (2).

7. The method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion as claimed in claim 6, wherein the specific steps of predicting the remote sensing image surface feature category by using the trained network model are as follows:

(2) reading data by using the trained model;