CN113469094A - Multi-mode remote sensing data depth fusion-based earth surface coverage classification method - Google Patents
Multi-mode remote sensing data depth fusion-based earth surface coverage classification method Download PDFInfo
- Publication number
- CN113469094A CN113469094A CN202110787839.4A CN202110787839A CN113469094A CN 113469094 A CN113469094 A CN 113469094A CN 202110787839 A CN202110787839 A CN 202110787839A CN 113469094 A CN113469094 A CN 113469094A
- Authority
- CN
- China
- Prior art keywords
- fusion
- convolution
- module
- attention
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 175
- 238000000034 method Methods 0.000 title claims abstract description 47
- 238000012549 training Methods 0.000 claims abstract description 26
- 101100172290 Candida albicans (strain SC5314 / ATCC MYA-2876) ENG1 gene Proteins 0.000 claims abstract description 17
- 230000011218 segmentation Effects 0.000 claims abstract description 16
- 238000005070 sampling Methods 0.000 claims abstract description 15
- 239000013598 vector Substances 0.000 claims description 42
- 238000004364 calculation method Methods 0.000 claims description 28
- 238000012545 processing Methods 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 11
- 230000007246 mechanism Effects 0.000 claims description 11
- 238000010606 normalization Methods 0.000 claims description 6
- 238000005096 rolling process Methods 0.000 claims description 4
- 238000012800 visualization Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 17
- 230000004913 activation Effects 0.000 description 4
- 230000000295 complement effect Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 3
- 238000007796 conventional method Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 101100269323 Arabidopsis thaliana AFC3 gene Proteins 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 239000004566 building material Substances 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 235000019580 granularity Nutrition 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 238000011423 initialization method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000026683 transduction Effects 0.000 description 1
- 238000010361 transduction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a surface coverage classification method based on multi-mode remote sensing data depth fusion, which comprises the following steps: (1) constructing a remote sensing image semantic segmentation network based on multi-mode information fusion; the network comprises an encoder for extracting the feature of the ground feature, a depth feature fusion module, a spatial pyramid module and an up-sampling decoder; the depth feature fusion module comprises an ACF3 module and a CACF3 module which are used for simultaneously fusing RGB, DSM and NDVI modal information, wherein the ACF3 module is a self-attention convolution fusion module based on transformer and convolution, and the CACF3 module is a cross-modal convolution fusion module based on transformer and convolution; (2) training the network constructed in the step (1); (3) and (3) predicting the remote sensing image surface feature type by using the network model trained in the step (2). Compared with the traditional method, the earth surface coverage classification method based on the multi-mode remote sensing data depth fusion has the advantages that the effect of improving the accuracy on earth surface classification tasks is remarkable, and the application prospect is wide.
Description
Technical Field
The invention belongs to the technical field of remote sensing, and relates to a surface coverage classification method based on multi-mode remote sensing data depth fusion.
Background
Surface object (surface feature) classification is an important basis for remote sensing image analysis application. The continuous observation of the earth surface by the multi-sensor at present has promoted the multi-scale, multi-temporal, multi-azimuth and multi-level earth surface remote sensing image, and provides richer data information for accurately describing the ground features. Due to the fact that the same ground object is observed essentially, although a certain difference exists between different modal information, complementary characteristics of the information still exist between multi-source remote sensing images. Therefore, the feature classification is performed by using a plurality of remote sensing information sources, and higher precision than single-mode information classification can be realized.
The existing classification method based on deep learning mostly adopts pixel level fusion, feature level fusion or decision level fusion, and the methods lack the mining of potential complementary information among multi-source remote sensing images. And the complementation and redundant information among the multi-source remote sensing images are effectively associated and de-redundant so as to obtain high-level feature sharing, and the high-precision remote sensing image classification can be realized by extracting and fusing relevant features step by step at a feature level.
On the other hand, in the field of multimodal semantic segmentation, a deep neural network has been widely used which performs feature extraction and class prediction based on a U-shaped structure (UNet) and residual connection and performs feature fusion based on a fusion network (fusenet). However, although the Convolutional Neural Network (CNN) has achieved excellent performance, when deep feature fusion is performed, it cannot learn semantic information of global and long distance well due to locality of convolution operation, and cannot extract and utilize complementarity between different modality information from a global scale. There are many studies to use the deep self-attention transformation network (transformer) in the visual field to perform long-term dependent sequence modeling and transduction tasks using its multi-headed attention structure, and the receptive field from the attention layer is always global relative to the convolutional layer where the receptive field is a certain range of the neighborhood. Compared with a convolution network, the remarkable characteristic enables the extraction and fusion of different dimensional characteristics to be better carried out. However, none of the existing methods can explore more in the field of depth feature fusion, such as a tightly-connected transform network (DCST) using a transform as a feature extractor, and tries to replace the encoder structure in a classical convolutional network with the transform; the work of further processing the features extracted by the classical encoder, if the transform is regarded as a high-dimensional feature extraction module by a transform network (TransUNet) based on a U-type network; for example, a pure attention-transfer network (Swin-Unet) based on a U-type network imitates the U-type network, and the whole network structure is subjected to the work of feature extraction and type prediction by using a transformer. These methods regard the transformer as a feature extraction structure and do not utilize it in the field of feature fusion.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a ground surface coverage classification method based on multi-mode remote sensing data depth fusion.
The invention firstly provides two attention Fusion modules based on transformer and Convolution at the same time, namely a self-attention Convolution Fusion module ACF (self attention and Convolution Fusion module) and a cross-mode Convolution Fusion module CACF (cross attention and Convolution Fusion module). The two modules carry out feature extraction based on a convolution trunk network and can be conveniently transplanted to other networks. The ACF firstly maps information of different modes to the same sequence, then extracts fusion characteristics of each channel through the self-attention fusion module, and finally calculates respective weights by utilizing channel convolution and obtains channels with larger relativity, thereby realizing the fusion of different modes. CACF uses the cross-modal attention mechanism to calculate the interaction of two modes, and the weight calculation is carried out by channel convolution after the respective outputs are further processed by self attention.
The invention simultaneously provides three modes of simultaneously fused frames ACF3 and CACF3 based on the ACF and CACF. The structure further considers the information of other modes on the basis of the ACF and CACF fused RGB (red, green and blue) information and DSM (digital surface model) information. ACF3 and CACF3 first need to calculate the remote sensing index NDVI (normalized vegetation index) that reflects vegetation growth, vegetation coverage, and eliminates some of the noise, and input as additional modal information. DSM information aids in the identification of buildings, while trees and low-level vegetation that are present in large numbers in the data set help to be minimal, so the method takes the NDVI index as a third modality for input. In consideration of the closer relation between the NDVI and the RGB information image, the NDVI and the RGB information image are fused in a mode different from that of the DSM during deep fusion, namely the universal fusion framework provided by the invention can realize the fusion of the RGB, DSM and NDVI modal information simultaneously on the basis of the remote sensing image semantic segmentation network of multi-modal information fusion.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a surface coverage classification method based on multi-modal remote sensing data depth fusion comprises the following steps:
(1) constructing a remote sensing image semantic segmentation network based on multi-mode information fusion;
the remote sensing image semantic segmentation network based on multi-mode information fusion comprises an encoder for extracting surface feature, a depth feature fusion module, a spatial pyramid module and an up-sampling decoder;
the encoder is divided into three branches, namely an RGB branch, a DSM branch and an NDVI branch, wherein the RGB branch is used as a main branch, the DSM branch and the NDVI branch are used as subordinate branches, each branch comprises four rolling blocks named as feature extraction layers, and as the network in each branch goes deep, the feature of the ground feature is processed by the rolling blocks to obtain different input feature vectors, the input feature vectors represent the feature of the ground feature, and the input feature vectors are represented by X, Y, Z in a unified way corresponding to the RGB, DSM and NDVI branches;
the depth feature fusion module comprises an ACF3 module and a CACF3 module which simultaneously fuse RGB, DSM and NDVI modal information, wherein the ACF3 module is a self-attention convolution fusion module (ACF) based on transformer (depth self-attention transform network) and convolution, and the CACF3 module is a cross-modal convolution fusion module (CACF) based on transformer and convolution; because the NDVI is calculated according to the near-infrared band and the red band, when information of different branches enters the fusion module, RGB is directly fused with the NDVI at first, and then is fused with DSM information through an attention mechanism;
the self-attention convolution fusion module adopts a self-attention mechanism and comprises a transformer structure and two convolution structures (the convolution structures are positioned at the tail part of the self-attention convolution fusion module), wherein the transformer structure is formed by stacking 8 layers of self-attention fusion modules, each layer of self-attention fusion module comprises two normalization layers LN1 and LN2, a multi-layer perceptron MLP and a core self-attention layer Attn; the self-attention fusion formula is as follows:
wherein j represents the number of layers of the self-attention fusion module in the self-attention convolution fusion module (ACF) and depends on the number of stacked attention blocks, sjRepresenting the fusion feature calculated by the self-attention layer Attn, obtaining different input feature vectors after the ground feature is processed by a rolling block, wherein the vectors represent the ground feature, the input feature vectors correspond to three branches of RGB, DSM and NDVI and are respectively represented by X, Y, Z, XY represents directly connecting input feature vectors X and Y, LN1(XY) represents inputting XY into LN1, and Attn (LN1(XY)) represents inputting the calculation result of LN1(XY) into Attn, and then obtaining the fusion feature s calculated by the self-attention layer Attnj,zjLN 2(s), representing the final result after processing by the attention fusion Modulej) Representing the fusion feature sjInput into LN2, MLP (LN 2(s)j) Represents LN 2(s)j) The calculated result is input into the MLP, and the final result processed by the self-attention fusion module is obtained;
after the correlation calculation for fusion, the output will be split and restored to the original size. In the semantic segmentation task of the high-resolution remote sensing image, the information of each pixel point is very important. Whereas if all the information is retained and the calculation of attention is performed, the computational complexity will be the square of the sequence length. If pooling is performed first, and then the interpolation is used to restore the scale after attention is calculated, information will be lost, which will have a serious impact on the result. To solve this problem, the present invention combines the advantages of transformer and convolution, and further extracts channel information by using a convolution module after attention calculation. The convolution block may learn to use global information to emphasize the informational channels and suppress the less useful channels, which helps the AFC3 module to efficiently utilize the informational characteristics of the two branches. Finally, the weighted channels are merged directly and input to the main branch.
The cross-modal convolution fusion module simultaneously adopts a cross-modal attention mechanism and a self-attention mechanism, and comprises three transformer structures and two convolution structures (the convolution structures are positioned at the tail part of the cross-modal convolution fusion module), wherein one transformer structure is formed by stacking four layers of cross-modal fusion modules, the other two transformer structures are respectively formed by stacking four layers of self-attention fusion modules, each layer of self-attention fusion module has the same structure as the self-attention fusion module in the self-attention convolution fusion module, and each layer of cross-modal fusion module comprises two normalization layers LN1 and LN2, a multilayer perceptron MLP and a core cross-modal attention layer AttnrgbAnd AttndComposition is carried out; the cross-modal attention calculation formula is as follows:
wherein Q, K and V respectively represent three vectors representing different information calculated by the input characteristic diagram and different linear structures, and KTRepresents the transposed vector of K, dkRepresenting the dimensions of the vector K to control the information scale, all the subscripts RGB and d indicate that its calculation is derived from RGB information and DSM information, softmax is a known activation function, so that the cross-modal attention calculation formula represents the characteristic information calculated by Q and K by dkAfter the scale is controlled, the scale is calculated with V after the function softmax is activatedOutputting, calculating the result by using the information of the complementary modes in the whole formula, and describing a mode of performing characteristic fusion on the two modes in a cross-mode fusion module by using a cross-mode attention calculation formula, wherein the Attnrgb(Qrgb,Kd,Vd) Representing RGB information, Attn, incorporating DSM informationd(Qd,Krgb,Vrgb) Represents DSM information fused with RGB information;
the up-sampling decoder consists of three layers of convolution blocks and a classifier, wherein each convolution block comprises a convolution layer for processing residual connection and an up-sampling convolution layer for restoring resolution, in each up-sampling convolution layer, a low-resolution feature map from the upper layer is up-sampled to the same resolution as a feature map from the residual connection through bilinear interpolation, then the two feature map streams are added element by element, finally, the feature map streams are mixed and input to the next layer through 3 multiplied by 3 convolution, the classifier is positioned at the tail part of the up-sampling decoder, and the classifier is a convolution structure with the output as class number so as to realize final class prediction;
after passing through the feature extraction layer of each branch processing network, the output result is fused by a self-attention convolution fusion module or a cross-mode convolution fusion module, and four values are output and respectively input to an RGB branch, a DSM branch and an NDVI branch and output to a decoder as residual connection, and the process is expressed as the following two feature fusion formulas:
wherein X ∈ R{3*H*W}Representing input feature vectors for corresponding RGB channels, Y ∈ R{1*H*W}Representing input feature vectors corresponding to DSM channels, Z ∈ R{1*H*W}Representing input feature vectors corresponding to NDVI channels, H and W respectively refer to the height and width of input data, H x W is the size of a picture, i represents the number of layers where a fusion module is located, ACF represents a self-attention convolution fusion module, CACF represents a cross-modal convolution fusion module, and a feature fusion formula represents the self-attention convolution fusion module or the cross-modal convolution fusion moduleThe fusion module receives the unfused RGB feature information Xi-1DSM characteristic information Yi-1And NDVI characteristic information Zi-1After the three modes are input, fused RGB characteristic information X is outputiDSM characteristic information YiAnd NDVI characteristic information ZiAnd residual information skip for the decoding phasei;
The whole network firstly extracts features through a coder, namely an input picture is mapped to a feature space represented by a vector, information of different modes is fused by using a depth feature fusion module provided by the invention in the process, in the final stage of the fusion, feature fusion of different scales is realized by using a space pyramid module, and a feature map containing rich feature information is output, an up-sampling decoder is used for restoring the feature map step by step in the restoration stage of the whole network, namely, visual features with rich semantics in a low-resolution space are up-sampled to the input resolution of an original picture, and a classifier outputs prediction of each pixel point category in the final stage of the decoder;
(2) training the remote sensing image semantic segmentation network constructed in the step (1) based on multi-modal information fusion to obtain a trained model;
(3) and (3) predicting the remote sensing image surface feature type by using the model trained in the step (2).
As a preferred technical scheme:
according to the method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion, a calculation formula of a self-attention layer Attn of a core in a self-attention fusion module is as follows:
wherein Q, K, V represent three vectors representing different information calculated from the input feature map and different linear structures, KTRepresents the transposed vector of K, dkRepresenting the dimensions of the vector K for controlling the information scale, softmax being a known activation function, Q and K being calculated as characteristic informationAnd after the information passes through the activation function softmax, the information and V are calculated to obtain output.
According to the method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion, each volume block is stacked at different depths according to the position of the volume block in the network, the four volume blocks of an RGB branch are stacked in 3, 4, 6 and 3 layers respectively, and the four volume blocks of a DSM branch and an NDVI branch are stacked in 2, 2 and 2 layers respectively.
According to the method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion, the spatial pyramid module is composed of 3 layers of convolution structures, and information is gathered on different scales of 8 × 8, 4 × 4 and 2 × 2. After passing through the four residual error network layers and the fusion module, rich high-level semantic information is hidden in the fused feature map. In order to increase the detection capability of large-scale objects in the image, the invention adopts a spatial pyramid structure, and the alignment grids with different granularities are averaged before upsampling. The decoder is to upsample the semantically rich visual features in the coarse spatial resolution to the input high resolution.
The method for classifying the earth surface coverage based on the multi-mode remote sensing data depth fusion comprises the following steps of: firstly, initializing parameters by a Hommin method, loading a trained pretrained model of a ResNet series (namely a residual network, which is a known concept) by a decoder, then reading partial data from a data set by the model, comparing the output of the read data after being processed by a complete model with a known real label, calculating by a cross entropy loss function, wherein the calculation result shows the prediction capability of the model on the data set, then updating the parameters by the model according to the calculation value of the cross entropy loss function, and repeating a training process (which is the whole training process except for parameter initialization and pretrained model loading, wherein the two steps of parameter initialization and pretrained model loading are required to be used as initial parameters of the model at first time, but after the first time of data, the model parameters are updated, and other steps are circulated subsequently to read all data, updating parameters of the model), and observing the calculated value of the cross entropy loss function until the prediction capability of the model on the whole data set is stable;
the model is a remote sensing image semantic segmentation network based on multi-mode information fusion;
the partial data refers to 4 pieces of data selected from a data set;
when the calculated value of the cross entropy loss function cannot be reduced continuously, the prediction capability of the model on the whole data set is stable.
According to the method for classifying the earth surface coverage based on the multi-mode remote sensing data depth fusion, the formula of the cross entropy loss function is as follows:
wherein, L represents the calculated value of the cross entropy loss function, N represents the number of each training data, N is 4, k is taken from 1 to 4, which represents that 4 pieces of data are respectively calculated and finally averaged, M represents the number of categories of the ground objects, y represents the number of categories of the ground objects, andkcrepresenting the probability of the kth sample belonging to class c, which is provided by the true label, if it is 1, otherwise it is 0, pkcRepresenting the prediction probability of the model for the kth sample belonging to class c.
The method for classifying the earth surface coverage based on the multi-mode remote sensing data depth fusion specifically comprises the following steps of predicting the remote sensing image surface feature type by using a trained network model:
(1) processing data to be predicted into data with the size consistent with that of the training data;
(2) reading data by using the trained model;
(3) and outputting the prediction result of each pixel point category by the model, and performing visualization processing on the result to obtain a remote sensing image surface feature category prediction graph.
Has the advantages that:
(1) the invention provides two attention fusion modules simultaneously based on transformer and convolution, a self-attention convolution fusion module ACF and a cross-modal convolution fusion CACF for the first time, and provides frames ACF3 and CACF3 with three modes simultaneously fused based on the ACF and the CACF, and the universal fusion frame provided by the invention can realize the fusion of information of three modes, namely RGB, DSM and NDVI simultaneously based on a remote sensing image semantic segmentation network with multi-modal information fusion;
(2) compared with the traditional method, the earth surface coverage classification method based on the multi-mode remote sensing data depth fusion has the advantages that the effect of improving the accuracy on earth surface classification tasks is remarkable, and the application prospect is wide.
Drawings
FIG. 1 is a schematic diagram of a remote sensing image semantic segmentation network structure based on multi-modal information fusion;
FIG. 2 is a self-attention convolution fusion module;
FIG. 3 is a cross-modal convolution fusion module;
fig. 4 is a prediction graph of the ground object type of the remote sensing image at the visualization position.
Detailed Description
The invention will be further illustrated with reference to specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
A surface coverage classification method based on multi-modal remote sensing data depth fusion comprises the following steps:
(1) constructing a model;
the model is a remote sensing image semantic segmentation network based on multi-mode information fusion, the structure of the model is shown in figure 1, and the model comprises an encoder for extracting the feature of the ground feature, a depth feature fusion module, a spatial pyramid module and an up-sampling decoder;
the encoder is divided into three branches: the system comprises an RGB branch, a DSM branch and an NDVI branch, wherein the RGB branch is used as a main branch, the DSM branch and the NDVI branch are used as subordinate branches, each branch comprises four volume blocks named as feature extraction layers, and as the network in each branch goes deep, different input feature vectors can be obtained after the feature of the ground feature is processed by the volume blocks, the input feature vectors represent the feature of the ground feature, and the input feature vectors are represented by X, Y, Z in a unified way corresponding to the RGB, DSM and NDVI branches; each convolution block is stacked at different depths according to the position of the convolution block in the network, the four convolution blocks of the RGB branch are stacked by 3 layers, 4 layers, 6 layers and 3 layers respectively, and the four convolution blocks of the DSM branch and the NDVI branch are stacked by 2 layers, 2 layers and 2 layers respectively;
the depth feature fusion module comprises an ACF3 module and a CACF3 module which simultaneously fuse RGB, DSM and NDVI modal information, the ACF3 module is a self-attention convolution fusion module (ACF) based on transformer (depth self-attention transform network) and convolution, and the CACF3 module is a cross-modal convolution fusion module (CACF) based on transformer and convolution; the self-attention convolution fusion module adopts a self-attention mechanism and comprises a transformer structure and two convolution structures (the convolution structures are positioned at the tail part of the self-attention convolution fusion module), wherein the transformer structure is formed by stacking 8 layers of self-attention fusion modules, each layer of self-attention fusion module comprises two normalization layers LN1 and LN2, a multi-layer perceptron MLP and a core self-attention layer Attn; as shown in fig. 2, when information of different modalities enters the fusion module, the ACF3 module first connects X and Y, and maps the X and Y to the same sequence space through position coding, so as to convert the picture into a sequence, and the sequence is input into a transform based on the self-attention convolution fusion module;
the self-attention fusion formula is as follows:
wherein j represents the number of layers of the self-attention fusion module in the self-attention convolution fusion module (ACF) and depends on the number of stacked attention blocks, sjRepresents the fusion feature calculated by the self-attention layer ATTn, XY represents the direct connection of the input feature vectors X and Y, LN1(XY) represents the input of XY into LN1, and Attn (LN1(XY)) represents the input of the calculation result of LN1(XY) into LN1In Attn, the fusion feature s after self-attention calculation is obtained at this timej,zjLN 2(s), representing the final result after processing by the attention fusion Modulej) Representing the fusion feature sjInput into LN2, MLP (LN 2(s)j) Represents LN 2(s)j) The calculation result of (2) is input into the MLP;
the calculation formula of the self-attention layer Attn of the core in the self-attention fusion module is as follows:
wherein Q, K, V represent three vectors representing different information calculated from the input feature map and different linear structures, KTRepresents the transposed vector of K, dkRepresenting the dimensionality of the vector K to control the information scale, and calculating the characteristic information obtained by calculating Q and K with V after passing through an activation function softmax to obtain output;
the overall structure of the CACF3 module is shown in FIG. 3. The cross-modal convolution fusion module simultaneously adopts a cross-modal attention mechanism and a self-attention mechanism, and comprises three transformer structures and two convolution structures (the convolution structures are positioned at the tail part of the cross-modal convolution fusion module), wherein one transformer structure is formed by stacking four layers of cross-modal fusion modules, the other two transformer structures are respectively formed by stacking four layers of self-attention fusion modules, each layer of self-attention fusion module has the same structure as the self-attention fusion module in the self-attention convolution fusion module, and each layer of cross-modal fusion module comprises two normalization layers LN1 and LN2, a multilayer perceptron MLP and a core cross-modal attention layer AttnrgbAnd AttndComposition is carried out; since the ACF3 module needs to connect the information of the two modalities, the sequence length that needs to be processed in its attention module will be twice that of the CACF3 module; the cross-modal attention calculation formula is as follows:
wherein Q, K, V represent three vectors representing different information calculated from the input feature map and different linear structures, KTRepresents the transposed vector of K, dkRepresenting the dimensions of the vector K to control the information scale, all the subscripts RGB and d indicate that the calculation is derived from RGB information and DSM information, so that the cross-modal attention calculation formula represents that the feature information calculated by Q and K is represented by dkAfter controlling the scale of the model, the model is calculated with V after the model is activated by softmax to obtain output, the whole formula adopts the information of complementary modes to calculate the result, the cross-mode attention calculation formula explains the mode of characteristic fusion of the two modes in the cross-mode fusion module, Attnrgb(Qrgb,Kd,Vd) Representing RGB information, Attn, incorporating DSM informationd(Qd,Krgb,Vrgb) Represents DSM information fused with RGB information;
the spatial pyramid module collects the fused RGB-DSM-NDVI features from the three branches and generates a feature map with multi-scale information, the spatial pyramid module consists of 3 layers of convolution structures, and information is gathered on different scales of 8 × 8, 4 × 4 and 2 × 2;
the up-sampling decoder consists of three layers of convolution blocks and a classifier, wherein each convolution block comprises a convolution layer for processing residual connection and an up-sampling convolution layer for restoring resolution, the classifier is positioned at the tail part of the up-sampling decoder, and the classifier is a convolution structure with the output as class number so as to realize final class prediction;
after passing through the feature extraction layer of each branch processing network, the output result is fused by a self-attention convolution fusion module or a cross-mode convolution fusion module, and four values are output and respectively input to an RGB branch, a DSM branch and an NDVI branch and output to a decoder as residual connection, and the process is expressed as the following two feature fusion formulas:
wherein,X∈R{3*H*W}Representing input feature vectors corresponding to RGB branches, Y ∈ R{1*H*W}Representing input feature vectors corresponding to DSM branches, Z ∈ R{1*H*W}Representing input feature vectors corresponding to NDVI branches, wherein H and W respectively refer to the height and width of input data, H X W is the size of a picture, i represents the number of layers where a fusion module is located, ACF represents a self-attention convolution fusion module, CACF represents a cross-modal convolution fusion module, and a feature fusion formula represents that the self-attention convolution fusion module or the cross-modal convolution fusion module receives unfused RGB feature information Xi-1DSM characteristic information Yi-1And NDVI characteristic information Zi-1After the three modes are input, fused RGB characteristic information X is outputiDSM characteristic information YiAnd NDVI characteristic information ZiAnd residual information skip for the decoding phasei;
(2) Training a model;
for RGB branches, a residual error network Resnet-34 pre-training network is adopted to assign initial values to the RGB branches, a residual error network Resnet-18 is adopted for DSM and NDVI branches to assign initial values to the DSM and NDVI branches, and other network parameters are assigned by using a Hommin initialization method. The model firstly initializes parameters by a Hommin method, a decoder loads a ResNet series trained pre-training model, then the model reads 4 data from a data set, the output of the read data after being processed by a complete model is compared with a known real label, the comparison process is calculated by a cross entropy loss function, and the cross entropy loss function formula is as follows:wherein, L represents the calculated value of the cross entropy loss function, N represents the number of each training data, N is 4, k is taken from 1 to 4, which represents that 4 pieces of data are respectively calculated and finally averaged, M represents the number of categories of the ground objects, y represents the number of categories of the ground objects, andkcrepresenting the probability of the kth sample belonging to class c, taking 1 if it is, and 0 if it is not, the value being provided by the true label, pkcRepresenting the prediction probability of the model for the kth sample belonging to the class c, the calculation result shows the prediction capability of the model for the data set, and then the model is based on the loss functionThe calculated value of the cross entropy loss function is subjected to parameter updating, a training process is repeated (the whole training process except for parameter initialization and pre-training model loading needs two steps of parameter initialization and pre-training model loading to give initial parameters of the model at first, but after first-time data, the model parameters are updated, other steps are circulated subsequently to read all data and perform parameter updating on the model), the calculated value of the cross entropy loss function is observed, and the prediction capability of the model on the whole data set is stable until the calculated value of the cross entropy loss function cannot be reduced continuously; in the training process, parameter optimization is carried out by using an Adaptive momentum Estimation (Adam) algorithm, and the learning rate is set to be 4 multiplied by 10-4The number of training iterations is 40, the batch size is 4, and experiments show that the number of iterations can enable the model to be converged;
(3) predicting the ground object class of the remote sensing image by using the model trained in the step (2);
(3.1) processing the data to be predicted into data with the size consistent with that of the training data;
the three modal data required by the invention are RGB three-band data, DSM data and NDVI index respectively, wherein the NDVI index is calculated from a red band and a near infrared band. The experiment adopts the two-dimensional semantic annotation competition of borstem provided by ISPRS. The data set includes 38 four channel pictures of 6000 x 6000 and corresponding DSM data. In consideration of the computing power and required accuracy, in the data processing stage, all data are clipped to 256 × 256 at an overlap ratio of 0.5, and corresponding NDVI modal data are calculated. Meanwhile, as a large amount of data has the condition of single label category, cleaning is carried out according to the rule that no automobile/water surface/building material category exists and some other category accounts for more than 80 percent, and finally a training set is obtained;
(3.2) reading data by using the trained model;
(3.3) outputting the prediction result of each pixel point type by the model, and performing visualization processing on the result to obtain a remote sensing image surface feature type prediction map, as shown in fig. 4. The prediction results are specifically shown in table 1.
TABLE 1 precision comparison
The test performance is evaluated by the overall accuracy, i.e. the percentage of correctly classified pixels among all pixels. For each class, the ratio of the pixels predicted to be correct by the class to all the pixels in the class is calculated. Table 1 illustrates the improved accuracy of the present invention over conventional methods on the surface classification task. The number suffix in the method column in the table represents the number of modalities, 2 indicates that the training data is RGB and DSM, and 3 indicates that the training data is augmented with NDVI. As with other methods, the results compare mainly the accuracy and overall accuracy of the five major categories. The result shows that after the ACF is used for fusing DSM information, other categories except vehicles are improved, most obviously trees reach 1.25%, the CACF module can better fuse depth characteristics, all categories are improved, and building category improvement reaches 0.44%. Meanwhile, by fusing the ACF3 of DSM and NDVI, the tree is greatly improved by 4.11%, which shows that the NDVI index is very helpful for judging higher trees. The CACF3 further improves the judgment on buildings and trees by 0.56 percent and 2.51 percent respectively. The overall accuracy of the whole ACF3 fusion method is improved by 0.36%, and the overall accuracy of the ACF3 is improved by 0.40%. In summary, the present invention is a significant improvement over the conventional method in the field of depth feature fusion.
Claims (7)
1. A surface coverage classification method based on multi-modal remote sensing data depth fusion is characterized by comprising the following steps:
(1) constructing a remote sensing image semantic segmentation network based on multi-mode information fusion;
the remote sensing image semantic segmentation network based on multi-mode information fusion comprises an encoder for extracting surface feature, a depth feature fusion module, a spatial pyramid module and an up-sampling decoder;
the encoder is divided into an RGB branch, a DSM branch and an NDVI branch, wherein the RGB branch is used as a main branch, the DSM branch and the NDVI branch are used as subordinate branches, and each branch comprises four volume blocks;
the depth feature fusion module comprises an ACF3 module and a CACF3 module, wherein the ACF3 module and the CACF3 module are used for simultaneously fusing information of three modes, namely RGB, DSM and NDVI, the ACF3 module is a self-attention convolution fusion module based on a transformer and convolution, and the CACF3 module is a cross-mode convolution fusion module based on a transformer and convolution;
the self-attention convolution fusion module adopts a self-attention mechanism and comprises a transformer structure and two convolution structures, wherein the transformer structure is formed by stacking 8 layers of self-attention fusion modules, each layer of self-attention fusion module comprises two normalization layers LN1 and LN2, a multilayer perceptron MLP and a core self-attention layer Attn; the self-attention fusion formula is as follows:
sj=Attn(LN1(XY))
zj=MLP(LN2(sj));
wherein j represents the number of layers of the self-attention fusion module in the self-attention convolution fusion module, sjRepresenting the fusion feature calculated by the self-attention layer Attn, obtaining different input feature vectors after the ground feature is processed by a rolling block, wherein the input feature vectors correspond to three branches of RGB, DSM and NDVI and are respectively represented by X, Y, Z, XY represents directly connecting input feature vectors X and Y, LN1(XY) represents inputting XY into LN1, Attn (LN1(XY)) represents inputting the calculation result of LN1(XY) into Attn, z represents inputting the calculation result of LN1(XY) into Attn, andjLN 2(s), representing the final result after processing by the attention fusion Modulej) Representing the fusion feature sjInput into LN2, MLP (LN 2(s)j) Represents LN 2(s)j) The calculation result of (2) is input into the MLP;
the cross-modal convolution fusion module simultaneously adopts a cross-modal attention mechanism and a self-attention mechanism and comprises three transformer structures and two convolution structures, wherein one transformer structure is formed by stacking four layers of cross-modal fusion modules, the other two transformer structures are respectively formed by stacking four layers of self-attention fusion modules, and each layer of cross-modal fusion module is formed by stacking two normalization layers LN1 and LN2, a multilayer sensor MLP and a kernelCross-modal attention layer of the heart AttnrgbAnd AttndComposition is carried out; the cross-modal attention calculation formula is as follows:
wherein Q, K and V respectively represent three vectors representing different information calculated by the input characteristic diagram and different linear structures, and KTRepresents the transposed vector of K, dkRepresenting the dimensions of the vector K, all subscripts RGB and d indicate that its computation is derived from RGB information and DSM information, and the cross-modality attention calculation formula indicates the way in which the two modalities are feature fused in the cross-modality fusion module, Attnrgb(Qrgb,Kd,Vd) Representing RGB information, Attn, incorporating DSM informationd(Qd,Krgb,Vrgb) Represents DSM information fused with RGB information;
the up-sampling decoder consists of three layers of convolution blocks and a classifier, wherein each convolution block comprises a convolution layer for processing residual connection and an up-sampling convolution layer for restoring resolution, the classifier is positioned at the tail part of the up-sampling decoder, and the classifier is a convolution structure with the output as class number;
after passing through the feature extraction layer of each branch processing network, the output result is fused by a self-attention convolution fusion module or a cross-mode convolution fusion module, and four values are output and respectively input to an RGB branch, a DSM branch and an NDVI branch and output to a decoder as residual connection, and the process is expressed as the following two feature fusion formulas:
wherein X ∈ R{3*H*W}Representing input feature vectors corresponding to RGB branches, Y ∈ R{1*H*W}Representing input feature vectors corresponding to DSM branches, Z ∈ R{1*H*W}Representing input feature vectors corresponding to NDVI branchesH and W respectively indicate the height and width of input data, i represents the number of layers where the fusion module is located, ACF represents a self-attention convolution fusion module, CACF represents a cross-modal convolution fusion module, and a feature fusion formula indicates that the non-fused RGB feature information X is received by the self-attention convolution fusion module or the cross-modal convolution fusion modulei-1DSM characteristic information Yi-1And NDVI characteristic information Zi-1After the three modes are input, fused RGB characteristic information X is outputiDSM characteristic information YiAnd NDVI characteristic information ZiAnd residual information skip for the decoding phasei;
(2) Training the remote sensing image semantic segmentation network constructed in the step (1) based on multi-modal information fusion to obtain a trained model;
(3) and (3) predicting the remote sensing image surface feature type by using the model trained in the step (2).
2. The method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion as claimed in claim 1, wherein the calculation formula of the self-attention layer Attn of the core in the self-attention fusion module is as follows:
wherein Q, K, V represent three vectors representing different information calculated from the input feature map and different linear structures, KTRepresents the transposed vector of K, dkRepresenting the dimension of vector K.
3. The method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion of claim 2, wherein four volume blocks of an RGB branch are stacked in 3, 4, 6 and 3 layers respectively, and four volume blocks of a DSM branch and an NDVI branch are stacked in 2, 2 and 2 layers respectively.
4. The method as claimed in claim 3, wherein the spatial pyramid module is composed of 3 layers of convolution structure, and information is collected on different scales of 8 x 8, 4 x 4 and 2 x 2.
5. The method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion as claimed in claim 1, wherein the process of training the remote sensing image semantic segmentation network based on the multi-modal information fusion is as follows: initializing parameters by a Hommin method, loading a ResNet series trained pre-training model by a decoder, reading partial data from a data set by the model, comparing the output of the read data after being processed by a complete model with a known real label, calculating by a cross entropy loss function in the comparison process, updating the parameters by the model according to the calculated value of the cross entropy loss function, repeating the training process, and observing the calculated value of the cross entropy loss function until the prediction capability of the model on the whole data set is stable;
the model is a remote sensing image semantic segmentation network based on multi-mode information fusion;
the partial data refers to 4 pieces of data selected from a data set;
when the calculated value of the cross entropy loss function cannot be reduced continuously, the prediction capability of the model on the whole data set is stable.
6. The method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion as claimed in claim 5, wherein the cross entropy loss function formula is as follows:
wherein, L represents the calculation value of the cross entropy loss function, N represents the number of each training data, N is 4, M represents the number of categories of the ground objects, y represents the number of the ground objectskcRepresenting the true probability, p, that the kth sample belongs to class ckcThe representation model belongs to class c for the kth sampleThe prediction probability of (2).
7. The method for classifying the earth surface coverage based on the multi-modal remote sensing data depth fusion as claimed in claim 6, wherein the specific steps of predicting the remote sensing image surface feature category by using the trained network model are as follows:
(1) processing data to be predicted into data with the size consistent with that of the training data;
(2) reading data by using the trained model;
(3) and outputting the prediction result of each pixel point category by the model, and performing visualization processing on the result to obtain a remote sensing image surface feature category prediction graph.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110787839.4A CN113469094B (en) | 2021-07-13 | 2021-07-13 | Surface coverage classification method based on multi-mode remote sensing data depth fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110787839.4A CN113469094B (en) | 2021-07-13 | 2021-07-13 | Surface coverage classification method based on multi-mode remote sensing data depth fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113469094A true CN113469094A (en) | 2021-10-01 |
CN113469094B CN113469094B (en) | 2023-12-26 |
Family
ID=77879924
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110787839.4A Active CN113469094B (en) | 2021-07-13 | 2021-07-13 | Surface coverage classification method based on multi-mode remote sensing data depth fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113469094B (en) |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113610070A (en) * | 2021-10-11 | 2021-11-05 | 中国地质环境监测院(自然资源部地质灾害技术指导中心) | Landslide disaster identification method based on multi-source data fusion |
CN113920262A (en) * | 2021-10-15 | 2022-01-11 | 中国矿业大学(北京) | Mining area FVC calculation method and system for enhancing edge sampling and improving Unet model |
CN114419449A (en) * | 2022-03-28 | 2022-04-29 | 成都信息工程大学 | Self-attention multi-scale feature fusion remote sensing image semantic segmentation method |
CN114565816A (en) * | 2022-03-03 | 2022-05-31 | 中国科学技术大学 | Multi-modal medical image fusion method based on global information fusion |
CN114581459A (en) * | 2022-02-08 | 2022-06-03 | 浙江大学 | Improved 3D U-Net model-based segmentation method for image region of interest of preschool child lung |
CN114638964A (en) * | 2022-03-07 | 2022-06-17 | 厦门大学 | Cross-domain three-dimensional point cloud segmentation method based on deep learning and storage medium |
CN114663733A (en) * | 2022-02-18 | 2022-06-24 | 北京百度网讯科技有限公司 | Method, device, equipment, medium and product for fusing multi-modal features |
CN114723951A (en) * | 2022-06-08 | 2022-07-08 | 成都信息工程大学 | Method for RGB-D image segmentation |
CN114943893A (en) * | 2022-04-29 | 2022-08-26 | 南京信息工程大学 | Feature enhancement network for land coverage classification |
CN115345886A (en) * | 2022-10-20 | 2022-11-15 | 天津大学 | Brain glioma segmentation method based on multi-modal fusion |
CN115527123A (en) * | 2022-10-21 | 2022-12-27 | 河北省科学院地理科学研究所 | Land cover remote sensing monitoring method based on multi-source feature fusion |
CN115546649A (en) * | 2022-10-24 | 2022-12-30 | 中国矿业大学(北京) | Single-view remote sensing image height estimation and semantic segmentation multi-task prediction method |
CN115578406A (en) * | 2022-12-13 | 2023-01-06 | 四川大学 | CBCT jaw bone region segmentation method and system based on context fusion mechanism |
CN115661681A (en) * | 2022-11-17 | 2023-01-31 | 中国科学院空天信息创新研究院 | Deep learning-based landslide disaster automatic identification method and system |
CN115830469A (en) * | 2022-11-25 | 2023-03-21 | 中国科学院空天信息创新研究院 | Multi-mode feature fusion based landslide and surrounding ground object identification method and system |
CN115984656A (en) * | 2022-12-19 | 2023-04-18 | 中国科学院空天信息创新研究院 | Multi-mode data fusion method based on special and shared architecture |
CN116091890A (en) * | 2023-01-03 | 2023-05-09 | 重庆大学 | Small target detection method, system, storage medium and product based on transducer |
CN116258971A (en) * | 2023-05-15 | 2023-06-13 | 江西啄木蜂科技有限公司 | Multi-source fused forestry remote sensing image intelligent interpretation method |
CN116452936A (en) * | 2023-04-22 | 2023-07-18 | 安徽大学 | Rotation target detection method integrating optics and SAR image multi-mode information |
WO2023154137A1 (en) * | 2022-02-10 | 2023-08-17 | Qualcomm Incorporated | System and method for performing semantic image segmentation |
CN117372885A (en) * | 2023-09-27 | 2024-01-09 | 中国人民解放军战略支援部队信息工程大学 | Multi-mode remote sensing data change detection method and system based on twin U-Net neural network |
CN117409264A (en) * | 2023-12-16 | 2024-01-16 | 武汉理工大学 | Multi-sensor data fusion robot terrain sensing method based on transformer |
CN117496281A (en) * | 2024-01-03 | 2024-02-02 | 环天智慧科技股份有限公司 | Crop remote sensing image classification method |
CN117576483A (en) * | 2023-12-14 | 2024-02-20 | 中国石油大学(华东) | Multisource data fusion ground object classification method based on multiscale convolution self-encoder |
CN117789042A (en) * | 2024-02-28 | 2024-03-29 | 中国地质大学(武汉) | Road information interpretation method, system and storage medium |
CN118397694A (en) * | 2024-04-24 | 2024-07-26 | 大湾区大学(筹) | Training method of video skeleton action recognition model and computer equipment |
CN118644783A (en) * | 2024-08-15 | 2024-09-13 | 云南瀚哲科技有限公司 | Crop remote sensing identification method based on multi-mode deep learning |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108230329A (en) * | 2017-12-18 | 2018-06-29 | 孙颖 | Semantic segmentation method based on multiple dimensioned convolutional neural networks |
CN111259828A (en) * | 2020-01-20 | 2020-06-09 | 河海大学 | High-resolution remote sensing image multi-feature-based identification method |
CN111597830A (en) * | 2020-05-20 | 2020-08-28 | 腾讯科技(深圳)有限公司 | Multi-modal machine learning-based translation method, device, equipment and storage medium |
CN111985369A (en) * | 2020-08-07 | 2020-11-24 | 西北工业大学 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
CN112183360A (en) * | 2020-09-29 | 2021-01-05 | 上海交通大学 | Lightweight semantic segmentation method for high-resolution remote sensing image |
CN112287940A (en) * | 2020-10-30 | 2021-01-29 | 西安工程大学 | Semantic segmentation method of attention mechanism based on deep learning |
KR20210063016A (en) * | 2019-11-22 | 2021-06-01 | 주식회사 공간정보 | method for extracting corp cultivation area and program thereof |
-
2021
- 2021-07-13 CN CN202110787839.4A patent/CN113469094B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108230329A (en) * | 2017-12-18 | 2018-06-29 | 孙颖 | Semantic segmentation method based on multiple dimensioned convolutional neural networks |
KR20210063016A (en) * | 2019-11-22 | 2021-06-01 | 주식회사 공간정보 | method for extracting corp cultivation area and program thereof |
CN111259828A (en) * | 2020-01-20 | 2020-06-09 | 河海大学 | High-resolution remote sensing image multi-feature-based identification method |
CN111597830A (en) * | 2020-05-20 | 2020-08-28 | 腾讯科技(深圳)有限公司 | Multi-modal machine learning-based translation method, device, equipment and storage medium |
CN111985369A (en) * | 2020-08-07 | 2020-11-24 | 西北工业大学 | Course field multi-modal document classification method based on cross-modal attention convolution neural network |
CN112183360A (en) * | 2020-09-29 | 2021-01-05 | 上海交通大学 | Lightweight semantic segmentation method for high-resolution remote sensing image |
CN112287940A (en) * | 2020-10-30 | 2021-01-29 | 西安工程大学 | Semantic segmentation method of attention mechanism based on deep learning |
Non-Patent Citations (3)
Title |
---|
李万琦;李克俭;陈少波;: "多模态融合的高分遥感图像语义分割方法", 中南民族大学学报(自然科学版), no. 04 * |
李志峰;张家硕;洪宇;尉桢楷;姚建民;: "融合覆盖机制的多模态神经机器翻译", 中文信息学报, no. 03 * |
李瑶: "小样本的多模态遥感影像融合分类方法", 中国优秀硕士学位论文全文数据库信息科技辑 * |
Cited By (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113610070A (en) * | 2021-10-11 | 2021-11-05 | 中国地质环境监测院(自然资源部地质灾害技术指导中心) | Landslide disaster identification method based on multi-source data fusion |
CN113920262A (en) * | 2021-10-15 | 2022-01-11 | 中国矿业大学(北京) | Mining area FVC calculation method and system for enhancing edge sampling and improving Unet model |
CN114581459A (en) * | 2022-02-08 | 2022-06-03 | 浙江大学 | Improved 3D U-Net model-based segmentation method for image region of interest of preschool child lung |
WO2023154137A1 (en) * | 2022-02-10 | 2023-08-17 | Qualcomm Incorporated | System and method for performing semantic image segmentation |
CN114663733A (en) * | 2022-02-18 | 2022-06-24 | 北京百度网讯科技有限公司 | Method, device, equipment, medium and product for fusing multi-modal features |
CN114565816A (en) * | 2022-03-03 | 2022-05-31 | 中国科学技术大学 | Multi-modal medical image fusion method based on global information fusion |
CN114565816B (en) * | 2022-03-03 | 2024-04-02 | 中国科学技术大学 | Multi-mode medical image fusion method based on global information fusion |
CN114638964A (en) * | 2022-03-07 | 2022-06-17 | 厦门大学 | Cross-domain three-dimensional point cloud segmentation method based on deep learning and storage medium |
CN114419449B (en) * | 2022-03-28 | 2022-06-24 | 成都信息工程大学 | Self-attention multi-scale feature fusion remote sensing image semantic segmentation method |
CN114419449A (en) * | 2022-03-28 | 2022-04-29 | 成都信息工程大学 | Self-attention multi-scale feature fusion remote sensing image semantic segmentation method |
CN114943893A (en) * | 2022-04-29 | 2022-08-26 | 南京信息工程大学 | Feature enhancement network for land coverage classification |
CN114943893B (en) * | 2022-04-29 | 2023-08-18 | 南京信息工程大学 | Feature enhancement method for land coverage classification |
CN114723951A (en) * | 2022-06-08 | 2022-07-08 | 成都信息工程大学 | Method for RGB-D image segmentation |
CN115345886A (en) * | 2022-10-20 | 2022-11-15 | 天津大学 | Brain glioma segmentation method based on multi-modal fusion |
CN115527123A (en) * | 2022-10-21 | 2022-12-27 | 河北省科学院地理科学研究所 | Land cover remote sensing monitoring method based on multi-source feature fusion |
CN115546649B (en) * | 2022-10-24 | 2023-04-18 | 中国矿业大学(北京) | Single-view remote sensing image height estimation and semantic segmentation multi-task prediction method |
CN115546649A (en) * | 2022-10-24 | 2022-12-30 | 中国矿业大学(北京) | Single-view remote sensing image height estimation and semantic segmentation multi-task prediction method |
CN115661681A (en) * | 2022-11-17 | 2023-01-31 | 中国科学院空天信息创新研究院 | Deep learning-based landslide disaster automatic identification method and system |
CN115830469A (en) * | 2022-11-25 | 2023-03-21 | 中国科学院空天信息创新研究院 | Multi-mode feature fusion based landslide and surrounding ground object identification method and system |
CN115578406B (en) * | 2022-12-13 | 2023-04-07 | 四川大学 | CBCT jaw bone region segmentation method and system based on context fusion mechanism |
CN115578406A (en) * | 2022-12-13 | 2023-01-06 | 四川大学 | CBCT jaw bone region segmentation method and system based on context fusion mechanism |
CN115984656A (en) * | 2022-12-19 | 2023-04-18 | 中国科学院空天信息创新研究院 | Multi-mode data fusion method based on special and shared architecture |
CN115984656B (en) * | 2022-12-19 | 2023-06-09 | 中国科学院空天信息创新研究院 | Multi-mode data fusion method based on special and shared architecture |
CN116091890A (en) * | 2023-01-03 | 2023-05-09 | 重庆大学 | Small target detection method, system, storage medium and product based on transducer |
CN116452936B (en) * | 2023-04-22 | 2023-09-29 | 安徽大学 | Rotation target detection method integrating optics and SAR image multi-mode information |
CN116452936A (en) * | 2023-04-22 | 2023-07-18 | 安徽大学 | Rotation target detection method integrating optics and SAR image multi-mode information |
CN116258971A (en) * | 2023-05-15 | 2023-06-13 | 江西啄木蜂科技有限公司 | Multi-source fused forestry remote sensing image intelligent interpretation method |
CN116258971B (en) * | 2023-05-15 | 2023-08-08 | 江西啄木蜂科技有限公司 | Multi-source fused forestry remote sensing image intelligent interpretation method |
CN117372885A (en) * | 2023-09-27 | 2024-01-09 | 中国人民解放军战略支援部队信息工程大学 | Multi-mode remote sensing data change detection method and system based on twin U-Net neural network |
CN117372885B (en) * | 2023-09-27 | 2024-06-25 | 中国人民解放军战略支援部队信息工程大学 | Multi-mode remote sensing data change detection method and system based on twin U-Net neural network |
CN117576483A (en) * | 2023-12-14 | 2024-02-20 | 中国石油大学(华东) | Multisource data fusion ground object classification method based on multiscale convolution self-encoder |
CN117409264A (en) * | 2023-12-16 | 2024-01-16 | 武汉理工大学 | Multi-sensor data fusion robot terrain sensing method based on transformer |
CN117409264B (en) * | 2023-12-16 | 2024-03-08 | 武汉理工大学 | Multi-sensor data fusion robot terrain sensing method based on transformer |
CN117496281A (en) * | 2024-01-03 | 2024-02-02 | 环天智慧科技股份有限公司 | Crop remote sensing image classification method |
CN117496281B (en) * | 2024-01-03 | 2024-03-19 | 环天智慧科技股份有限公司 | Crop remote sensing image classification method |
CN117789042A (en) * | 2024-02-28 | 2024-03-29 | 中国地质大学(武汉) | Road information interpretation method, system and storage medium |
CN117789042B (en) * | 2024-02-28 | 2024-05-14 | 中国地质大学(武汉) | Road information interpretation method, system and storage medium |
CN118397694A (en) * | 2024-04-24 | 2024-07-26 | 大湾区大学(筹) | Training method of video skeleton action recognition model and computer equipment |
CN118644783A (en) * | 2024-08-15 | 2024-09-13 | 云南瀚哲科技有限公司 | Crop remote sensing identification method based on multi-mode deep learning |
Also Published As
Publication number | Publication date |
---|---|
CN113469094B (en) | 2023-12-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113469094A (en) | Multi-mode remote sensing data depth fusion-based earth surface coverage classification method | |
CN112347859B (en) | Method for detecting significance target of optical remote sensing image | |
CN110111366B (en) | End-to-end optical flow estimation method based on multistage loss | |
CN112926396B (en) | Action identification method based on double-current convolution attention | |
CN113657388B (en) | Image semantic segmentation method for super-resolution reconstruction of fused image | |
CN112396607B (en) | Deformable convolution fusion enhanced street view image semantic segmentation method | |
CN113033570B (en) | Image semantic segmentation method for improving void convolution and multilevel characteristic information fusion | |
CN112991350B (en) | RGB-T image semantic segmentation method based on modal difference reduction | |
CN113780149A (en) | Method for efficiently extracting building target of remote sensing image based on attention mechanism | |
CN111178316A (en) | High-resolution remote sensing image land cover classification method based on automatic search of depth architecture | |
CN115035131A (en) | Unmanned aerial vehicle remote sensing image segmentation method and system of U-shaped self-adaptive EST | |
CN117274760A (en) | Infrared and visible light image fusion method based on multi-scale mixed converter | |
CN112766099B (en) | Hyperspectral image classification method for extracting context information from local to global | |
CN110222615A (en) | The target identification method that is blocked based on InceptionV3 network | |
CN115631513B (en) | Transformer-based multi-scale pedestrian re-identification method | |
CN112288772B (en) | Channel attention target tracking method based on online multi-feature selection | |
CN117237559A (en) | Digital twin city-oriented three-dimensional model data intelligent analysis method and system | |
CN111476133A (en) | Unmanned driving-oriented foreground and background codec network target extraction method | |
CN117456182A (en) | Multi-mode fusion remote sensing image semantic segmentation method based on deep learning | |
CN111627055A (en) | Scene depth completion method based on semantic segmentation | |
CN116844004A (en) | Point cloud automatic semantic modeling method for digital twin scene | |
CN118212127A (en) | Misregistration-based physical instruction generation type hyperspectral super-resolution countermeasure method | |
CN115311508A (en) | Single-frame image infrared dim target detection method based on depth U-type network | |
CN112488117B (en) | Point cloud analysis method based on direction-induced convolution | |
CN117727022A (en) | Three-dimensional point cloud target detection method based on transform sparse coding and decoding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |