CN113255727A

CN113255727A - Multi-sensor remote sensing image fusion classification method capable of layering dense fusion network

Info

Publication number: CN113255727A
Application number: CN202110446906.6A
Authority: CN
Inventors: 王相海; 冯一宁; 宋若曦; 穆振华; 宋传鸣
Original assignee: Liaoning Normal University
Current assignee: Liaoning Normal University
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2021-08-13

Abstract

The invention discloses a fusion classification method for a layering intensive fusion multi-sensor remote sensing image, and belongs to the field of remote sensing image processing. Firstly, introducing a network frame with three branches of space-frequency spectrum-elevation, and extracting spatial features and spectral features of a hyperspectral image and spatial elevation features of a LiDAR image respectively; secondly, a modal attention mechanism of the multi-sensor remote sensing image is provided, and the characteristics of different modal data are obtained by utilizing the relevance and the anisotropy among different modal data; and then, fusing the features obtained by the modal attention mechanism and the self-attention mechanism by using a Flatten operation and a concatenate operation of a convolutional neural network, and classifying the features by using a softmax activation function to realize the ground object classification based on the multi-sensor remote sensing image.

Description

Multi-sensor remote sensing image fusion classification method capable of layering dense fusion network

Technical Field

The invention relates to the field of remote sensing image processing, in particular to a multi-sensor remote sensing image fusion classification method of a layering intensive fusion network, which has good fusion quality, high classification precision and strong multi-modal data interaction capacity.

Background

As a contactless remote sensing technology, remote sensing image-to-ground observation technology has been widely applied in the field of ground surface covering ground object classification. Among many types of sensors, hyperspectral images provide a detailed description of features of land features in a unified map, which can better distinguish land features having the same elevation information but different spectral characteristics, such as a road surface and a grass land at the same height. However, the spatial resolution of the hyperspectral image is not high, phenomena such as spectrum aliasing and 'same spectrum of foreign matters' often exist, and the classification accuracy of the ground objects in a complex scene is seriously influenced. LiDAR data, on the other hand, provides height information for features that may better distinguish features having the same spectral characteristics but different elevation information. Because the traditional LiDAR adopts a single-waveband working mode, if the remote sensing scene interpretation is carried out by utilizing the three-dimensional space information of the ground objects acquired by the LiDAR, the classification and the identification can be usually carried out on the large classes of the ground objects, and the fine interpretation cannot be realized. Taking public hyperspectral images and LiDAR-DSM images of a regional area of a Florida campus as an example, although the hyperspectral images can accurately distinguish grasslands from pavements, the pavements cannot be distinguished from roofs of buildings, and the reason is that the two are made of the same material; conversely, although LiDAR accurately distinguishes buildings and pavements of different heights, it may not be possible to effectively distinguish pavements and lawns of the same height. Therefore, the hyperspectral image and the LiDAR data have rich complementary information, and if the information can be fully utilized to cooperatively complete ground object analysis, the advantages of two types of sensors are favorably fused, the effective improvement of the performance of an intelligent processing algorithm is realized, and the ground object information is analyzed more comprehensively. Under such circumstances, a method for fusion classification of hyperspectral imagery and LiDAR data has received much attention.

Mercier et al introduced a Support Vector Machine (SVM) with a nonlinear kernel function into the field of remote sensing image classification in 2003. Extreme Learning Machine (ELM) was also introduced by Li et al in 2015 into classification of remote sensing images and achieved performance effects comparable to SVM. However, these classification methods have low classification accuracy. Rasi et al proposed a hyperspectral image-LiDAR data fusion classification method based on sparsity and low-rank decomposition in 2017, and spatial redundancy of image features is captured by using sparsity characteristics to improve spatial smoothness of fusion features, so that the problem of a hous phenomenon in the fusion process is effectively solved. However, due to the dimensionality disaster phenomenon of the hyperspectral image, the processing precision of the method is obviously insufficient. Xue et al proposed a hyperspectral image-LiDAR data fusion model based on coupled high-order tensor decomposition in 2019, extracted more potential features through the coupled high-order tensor decomposition technology, and overcome the defects of low classification precision, hous effect caused by the fusion process and the like in the above technology to a certain extent.

In recent years, computing power and data acquisition capabilities of computing devices have increased rapidly. The significant increase in computing power can mitigate the inefficiencies of training, and the significant increase in training data can reduce the risk of "overfitting". Therefore, complex models represented by deep learning techniques such as Convolutional Neural Networks (CNN) are increasingly applied to classification of hyperspectral remote sensing images, and results superior to those of conventional machine learning methods are obtained. In 2017, Li et al trained the pixel points of the hyperspectral image by using the CNN frame, and realized the pixel-level classification process of the hyperspectral image. However, in the method, each single pixel point of the hyperspectral image is trained as a whole, and the spectral characteristics specific to the hyperspectral image are ignored, so that the model precision is insufficient. In 2018, Xu et al proposed a dual-branch convolutional neural network hyperspectral image classification framework, and introduced spectral domain branches and spatial domain branches to perform collaborative classification on spectral features and spatial features of hyperspectral images respectively. Due to the fact that the spectral characteristics of the hyperspectral images are considered, the hyperspectral classification accuracy of the classification method is improved, and the classification accuracy of ground feature information with the same elevation is poor. On the basis, Hao et al introduce elevation information provided by LiDAR data into a double-branch convolutional neural network structure in 2018, and provide a hyperspectral image-LiDAR data collaborative fusion classification framework based on a neural network and a composite kernel. The frame utilizes the convolution neural networks of the three branches to respectively extract the spectral feature, the spatial feature and the elevation feature of the ground feature, so that the ground feature classification precision is improved. However, although the multi-branch input network structure can reduce the information loss of different modality data in the fusion process, no connection is still established between the spatial information of two modalities. Therefore, in 2020, Hong et al combined the generative countermeasure network (GAN) with multi-modal deep learning, and proposed a cross-modal network model using GAN network as main frame. The method adopts a single-stage feature fusion mode to fuse shallow features, and considers the spatial information correlation and the interactivity among data features of different modes to a certain extent. However, since this method only processes shallow features of each modality, deep spatial information correlation between multi-source data cannot be fully utilized, and there is room for further improvement in classification accuracy.

In general, the characteristics of different modes are cooperatively analyzed based on the characteristics of the remote sensing image, so that a higher-quality fusion result can be obtained. Unfortunately, in the prior art, most of the prior art adopts a mode of network front-end fusion in single-stage feature fusion to perform fusion, which usually ignores the correlation and interactivity of spatial information between different modality features, and a multi-branch network structure cannot establish a sufficient correlation relationship between the spatial information of two modalities. At present, a fusion classification method which can increase the interaction among the modalities by utilizing common characteristics and special characteristics among different modalities of the multi-sensor remote sensing image so as to obviously improve the fusion quality and classification precision of the multi-sensor remote sensing image does not exist, and the prior technical scheme still has the defects of poor cross-modality interaction capability, poor fusion quality and limited classification precision.

Disclosure of Invention

The invention aims to solve the technical problems in the prior art and provides the multi-sensor remote sensing image fusion classification method of the layering dense fusion network, which has the advantages of good fusion quality, high classification precision and strong multi-mode data interaction capability.

The technical solution of the invention is as follows: a multi-sensor remote sensing image fusion classification method capable of layering dense fusion network is characterized by comprising the following steps:

step 1, establishing and initializing a convolutional neural network N for fusion and classification of multi-sensor remote sensing images_ahdSaid N is_ahdComprising 2 sub-networks N for feature extraction_featureSpeAnd N _featureSpa1 sub-network N for shallow feature fusion _{shallowfusion}1 sub-network N for deep feature fusion_deepfusionAnd 1 sub-network N for classification_cls；

Step 1.1 establishing and initializing a sub-network N_featureSpa4 groups of convolutional layers, Conv2_0, Conv2_1, Conv2_2 and Conv2_ 3;

the Conv2_0 comprises 1-layer convolution operation, 1-layer BatchNorm normalization operation and 1-layer activation operation, wherein the convolution layer comprises 100 convolution kernels with the size of 3 x 3, each convolution kernel performs convolution operation by taking 1 pixel as a step size, and a nonlinear activation function ReLU is selected as an activation function for operation;

the Conv2_1 comprises 1-layer convolution operation, 1-layer BatchNorm normalization operation and 1-layer activation operation, wherein the convolution layer comprises 100 convolution kernels with the size of 3 x 3, each convolution kernel performs convolution operation by taking 1 pixel as a step size, and a nonlinear activation function ReLU is selected as an activation function for operation;

the Conv2_2 comprises 1-layer convolution operation, 1-layer BatchNorm normalization operation and 1-layer activation operation, wherein the convolution layer comprises 100 convolution kernels with the size of 1 × 1, each convolution kernel performs convolution operation by taking 1 pixel as a step size, and a nonlinear activation function ReLU is selected as an activation function for operation;

the Conv2_3 comprises 1-layer convolution operation, 1-layer BatchNorm normalization operation and 1-layer activation operation, wherein the convolution layer comprises 100 convolution kernels with the size of 1 × 1, each convolution kernel performs convolution operation by taking 1 pixel as a step size, and a nonlinear activation function ReLU is selected as an activation function for operation;

step 1.2. establishing and initializing sub-network N _featureSpe2 groups of convolutional layers, Conv1_0 and Conv1_ 1;

the Conv1_0 comprises 1-layer convolution operation, 1-layer BatchNorm normalization operation and 1-layer activation operation, wherein the convolution layer comprises 64 one-dimensional convolution kernels with the size of 11, each convolution kernel performs convolution operation by taking 1 pixel as a step size, and a nonlinear activation function ReLU is selected as an activation function for operation;

the Conv1_1 comprises a 1-layer convolution operation, a 1-layer BatchNorm normalization operation and a 1-layer activation operation, wherein the convolution layer comprises 128 one-dimensional convolution kernels with the size of 3, each convolution kernel performs convolution operation by taking 1 pixel as a step size, and a nonlinear activation function ReLU is selected as an activation function for operation;

step 1.3. establishing and initializing sub-network N_{shallowfusion}Comprising 6 sets of parallel convolutional layers, Conv2_ Q1, Conv2_ K1, Conv2_ V1, Conv2_ Q2, Conv2_ K2 and Conv2_ V2, and 2 sets of custom modules, L_SAM、L_DPAM；

The Conv2_ Q1 comprises 1 layer of convolution operations, including 25 convolution kernels of size 1 × 1, each convolution kernel performing convolution operations in steps of 1 pixel;

the Conv2_ K1 includes 1 layer of convolution operations, including 25 convolution kernels of size 1 × 1, each convolution kernel performing convolution operations with 1 pixel step size;

the Conv2_ V1 comprises 1 layer of convolution operations, including 200 convolution kernels of size 1 × 1, each convolution kernel performing convolution operations with a step size of 1 pixel;

the Conv2_ Q2 comprises 1 layer of convolution operations, including 25 convolution kernels of size 1 × 1, each convolution kernel performing convolution operations in steps of 1 pixel;

the Conv2_ K2 includes 1 layer of convolution operations, including 25 convolution kernels of size 1 × 1, each convolution kernel performing convolution operations with 1 pixel step size;

the Conv2_ V2 comprises 1-layer convolution operation, including 200 2D convolution kernels with the size of 1 × 1, and each convolution kernel carries out convolution operation by taking 1 pixel as a step size;

said L_SAMThe module maps the input three-dimensional tensor F to using reshape operation

Space, get the characteristic

Wherein, C_speIndicating the number of input channels, N₂＝1×1，

Is represented by F_speRThe ith channel of (2), and then calculating the spectrum attention moment array according to the definition of the formula (1)

Wherein,

is represented by F_speSThe element in the jth row and ith column,

is represented by F_speRThe transpose of the jth lane of (1),

is represented by F_speRRepresents the inner product operation, and further defines F according to the formula (2)_speRAnd F_speSPerforming matrix multiplication to obtain spectral attention feature F_speA；

Wherein γ represents a preset coefficient;

said L_DPAMThe module comprises the following 7 steps:

(a) three-dimensional tensor F to be input₁Feeding into the convolutional layer Conv2_ Q1 to calculate the characteristics

Then F is mixed₁Feeding into the convolutional layer Conv2_ K1 to calculate the characteristics

Then F is put₁Sending the convolution layer Conv2_ V1 to calculate the characteristics

Wherein, F_Q1,i、F_K1,iAnd F_V1,iRespectively represent F_Q1、F_K1And F_V1The ith element of (1), C_spaNumber of channels representing input tensor, H_spaAnd W_spaRespectively representing the length and width of the input tensor, K₁＝25，K₂＝25，K₃＝200；

(b)L_DPAMModule three-dimensional tensor F₂Feeding into the convolutional layer Conv2_ Q2 to calculate the characteristics

Then F is mixed₂Feeding into the convolutional layer Conv2_ K2 to calculate the characteristics

Then F is put₂Sending the convolution layer Conv2_ V2 to calculate the characteristics

Wherein, F_Q2,i、F_K2,iAnd F_V2,iRespectively represent F_Q2、F_K2And F_V2The ith element of (1);

(c) f is processed by reshape operation_Q1And F_K1Mapping to

Space and calculating a space attention moment matrix according to the definition of the formula (3)

Wherein N is₁Represents the total number of features and N₁＝H_spa×W_spa，

Is represented by F_spaXThe element in the jth row and ith column,

is represented by F_K1Transpose of jth element of (a);

(d) f is processed by reshape operation_V1Mapping to

Space, calculating the spatial attention feature F according to the definition of formula (4)_spaA；

Wherein eta is_spaIs a pre-set scaling factor that is,

is represented by F_spaXA vector consisting of the elements of row i;

(e) f is processed by reshape operation_Q2And F_K2Mapping to

Space, rootCalculating the space attention moment array according to the definition of the formula (5)

Wherein,

is represented by F_mXThe element in the jth row and ith column,

is represented by F_K2Transpose of jth element of (a);

(f) calculating the modal attention feature F according to the definition of formula (6)_mA；

Wherein epsilon₂Which represents a preset scaling factor, is set,

is represented by F_mXA vector consisting of the elements of row i;

(g) calculating the spatial weighting feature F according to the definition of equation (7)_maF；

F_maF＝α₁F₁+α₂F_spaA+α₃F_mA (7)

Wherein alpha is₁，α₂And alpha₃Representing a preset weight coefficient;

step 1.4 setting up and initializing sub-network N_deepfusionThe self-defined self-leveling light-emitting diode comprises 2 groups of maximum pooling layers and 1 group of self-defined connecting layers, namely MaxPool1, MaxPool2 and Concatenate;

the MaxPool1 comprises 1-layer pooling operation and 1-layer Flatten operation, wherein the pooling layer carries out maximum pooling operation by using a one-dimensional pooling kernel with the size of 1;

the MaxPool2 comprises 1-layer pooling operation, 2-layer full-connection operation, 2-layer activation operation and 1-layer Flatten operation, wherein the pooling layer performs maximum pooling operation by using a pooling core with the size of 2 x 2, the 2-layer full-connection layer is respectively provided with 1024 and 512 output units, ReLU is selected as an activation function for operation, and Dropout operation with the parameter of 0.4 is performed to obtain 3 three-dimensional tensors

And

the Concatenate will be defined according to equation (8)

And

performing fusion operation and Dropout operation with 3 times of parameters of 0.5;

where ω and b represent weights and offsets of fully connected layers and "|" represents an operation of connecting spectral features with spatial features;

step 1.5 establishing and initializing sub-network N _cls1 group of full connection layers, namely Dense 1;

the Dense1 has num classification units and takes Softmax as an activation function, wherein num represents the total number of the ground feature categories to be classified;

step 2, inputting a training set L of a training set H, LiDAR image of a hyperspectral image, a pixel point coordinate set and a label set which are marked artificially, and performing comparison on N_ahdTraining is carried out;

step 2.1, according to the pixel point coordinate set marked artificially, in the hyperspectral imageExtracting all pixel point sets X with labels from the training set H_H＝{x_H,i1, …, M, and extracting pixel point set X with all labels from training set L of LiDAR image_L＝{x_L,i1, …, M, where x_H,iRepresents X_HThe ith pixel point, x_L,iRepresents X_LM represents the total number of pixel points having labels;

step 2.2. according to the definition of the formula (9) and the formula (10), X is defined_HAnd X_LPerforming standardization treatment to obtain

And

wherein,

representing a normalized set of labeled hyperspectral image primitive points,

to represent

The point of the ith pixel of (a),

represents a normalized set of labeled LiDAR pixel points,

to represent

The ith pixel point of (1);

step 2.3. with

Is divided into a series of high spectral pel block sets X of size 11X 11 centered on each pel point of H_H1And are combined with

Divides L into a series of sets X of LiDAR image metablocks of size 11X 11 centered on each image metablock of_L1；

Step 2.4. mixing X_H1And X_L1Each image element block in the image acquisition system is turned over up and down to obtain a high-spectrum image element block set X_H2And LiDAR pixelblock set X_L2；

Step 2.5 for X_H1Adding Gaussian noise with variance of 0.01 to each pixel block to obtain a hyperspectral pixel block set X_H3And to X_L1Adding Gaussian noise with variance of 0.03 to each pixel block to obtain a LiDAR pixel block set X_L3；

Step 2.6. mixing X_H1Each pixel block in the hyperspectral image block set X rotates by n multiplied by 90 degrees clockwise and randomly by taking the central point as a rotation center to obtain a hyperspectral image block set X_H4And X is_L1Each pixel block in the LiDAR pixel block set X is obtained by clockwise randomly rotating n multiplied by 90 degrees by taking the central point as the rotating center_L4Wherein n represents a value randomly selected from the set {1,2,3 };

step 2.7. order

And

will be provided with

And

as a training set for fusing and classifying neural networks, and integrating samples in the training set into a triad

In the form of a network data input, wherein,

represents a pixel pair consisting of a hyperspectral image and a LiDAR image in the training set, and

and

are the same in spatial coordinates of (a) Y_iTo represent

And

making the iteration number iter ← 1 for the corresponding real category label, and executing the step 2.8 to the step 2.13;

step 2.8. adopt the subnetwork N_featureSpeAnd N_featureSpaExtracting the characteristics of the training set;

step 2.8.1 utilizing subnetwork N_featureSpeTraining set for hyperspectral images

Carrying out feature extraction to obtain shallow spectral feature F of hyperspectral image_spe；

Step 2.8.2 utilizing sub-network N_featureSpaTraining set for hyperspectral images

Carrying out feature extraction to obtain shallow space features F of the hyperspectral image_spa；

Step 2.8.3 utilizing sub-network N_featureSaTraining set for LiDAR imagery

Performing feature extraction to obtain shallow elevation features F of the LiDAR image_L；

Step 2.9. use sub-network N_{shallowfusion}Performing shallow layer fusion of a characteristic level to obtain shallow layer characteristics;

step 2.9.1 Using L_SAMModule pair shallow spectral feature F_speCalculating to obtain the spectral attention characteristic F of the hyperspectral image_speA；

Step 2.9.2 Using L_DPAMModule pair shallow space feature F_spaAnd shallow space feature F_LCalculating to obtain the spatial modal attention feature F of the hyperspectral image_maHF；

Step 2.9.3 Using L_DPAMModule pair shallow space feature F_LAnd shallow space feature F_spaCalculating to obtain the spatial modal attention feature F of the LiDAR image_maLF；

Step 2.10. use sub-network N_deepfusionCarrying out deep fusion of characteristic levels to obtain deep characteristics;

step 2.10.1 spectral attention feature F using max-pooling layer MaxPool1_speACalculating to obtain deep spectral characteristics of the hyperspectral image

Step 2.10.2 spatial modal attention feature F of hyperspectral image by using maximum pooling layer MaxPool2_maHFCalculating to obtain deep space characteristics of hyperspectral image

Step 2.10.3 utilizes the spatial modal attention feature F of the max pooling layer Maxpool2 for LiDAR imagery_maLFCalculation was carried out to obtain LiDeep elevation features for DAR images

Step 2.10.4 utilizes the custom linker Concatenate to characterize the deep spectra of the hyperspectral image

Spatial features of deep layers

Deep elevation features of LiDAR images

Calculating to obtain deep layer characteristics F_M；

Step 2.11 Using subnetwork N_clsClassifying the deep features, and calculating to obtain a classification prediction result TR_pred；

Step 2.12, taking the weighted cross entropy as a loss function according to the definitions of the formula (11) and the formula (12);

wherein, ω is_jThe weight of the jth class is represented,

probability, n, of a picture element belonging to class j terrain_jRepresenting the number of the jth class of ground-truth ground objects in the ground-truth training sample;

step 2.13, if all pixel blocks in the training set are processed, the step 2.14 is carried out, otherwise, a group of unprocessed pixel blocks are taken out from the training set, and the step 2.8 is returned;

step 2.14 let iter ← iter +1, if yesIteration number iter>Total _ iter, then obtaining the trained convolutional neural network N_ahdAnd (4) turning to the step (3), otherwise, utilizing a reverse error propagation algorithm based on a random gradient descent method and predicting loss L_ω-CUpdating N_ahdStep 2.8, all the pixel blocks in the training set are reprocessed, and the Total _ iter represents the preset iteration times;

step 3, inputting unlabeled hyperspectral images H 'and LiDAR images L', performing data preprocessing on all pixels of H 'and L', and adopting a trained convolutional neural network N_ahdCompleting pixel classification;

step 3.1, extracting all pixel points in H' to form a set T_H＝{t_H,iI1, …, U, extracting all pixel points in L' to form a set T_L＝{t _L,i1, …, U }, where t is_H,iRepresents T_HI-th pixel of (1), t_L,iRepresents T_LU represents the total number of all picture elements;

step 3.2. definition of T according to formula (17) and formula (18)_HAnd T_LPerforming standardization treatment to obtain

And

wherein,

representing a normalized labeled set of high-spectrum image pixel points,

to represent

The point of the ith pixel of (a),

representing normalized images with labelsThe set of LiDAR pixel points of (1),

to represent

The ith pixel point of (1);

step 3.3. with

Each pixel point of the hyperspectral imager is taken as a center, H' is divided into a series of hyperspectral pixel block sets with the size of 11 multiplied by 11 to form a hyperspectral image test set

Then combine with

With each pixel point as the center, the L' is divided into a series of sets of LiDAR pixel blocks with the size of 11 multiplied by 11 to form a LiDAR image test set

Step 3.4. use sub-network N_featureSpeAnd N_featureSpaExtracting the characteristics of the test set;

step 3.4.1 utilizing sub-network N_featureSpeTo pair

Carrying out feature extraction to obtain spectral feature T of hyperspectral image H_spe；

Step 3.4.2 utilizes sub-network N_featureSpaTo pair

Carrying out feature extraction to obtain spatial feature T of hyperspectral image H_spa；

Step 3.4.3 utilizing sub-network N_featureSpaTo pair

Extracting features to obtain the elevation features T of the LiDAR image L_L；

Step 3.5. use sub-network N_{shallowfusion}Performing shallow layer fusion of a characteristic level to obtain shallow layer characteristics;

step 3.5.1 Using L_SAMModule pair spectral feature T_speCalculating to obtain the spectral attention characteristic T of the hyperspectral image H_speA；

Step 3.5.2 Using L_DPAMModule to space characteristics T_spaAnd spatial feature T_LCalculating to obtain the spatial modal attention feature T of the hyperspectral image H_maHF；

Step 3.5.3 utilizes L_DPAMModule to space characteristics T_LAnd spatial feature T_spaCalculating to obtain the space modal attention feature T of the LiDAR image L_maLF；

Step 3.6. use sub-network N_deepfusionCarrying out deep fusion of characteristic levels to obtain deep characteristics;

step 3.6.1 spectral attention feature T using max-pooling layer MaxPool1_speACalculating to obtain deep spectral characteristics of the hyperspectral image H

Step 3.6.2 attention feature T to spatial modality with max pooling layer Maxpool2_maHFCalculating to obtain deep spatial features of the hyperspectral image H

Step 3.6.3 utilizes the max pooling layer Maxpool2 for the spatial modal attention feature T_maLFCalculating to obtain the deep elevation features of the LiDAR image L

Step 3.6.4 utilizes custom connection layer conditioner pairs

Calculating to obtain deep layer characteristics T_M；

Step 3.7 Using subnetwork N_clsFor deep layer characteristic T_MClassifying to calculate the classified prediction result TE_pred。

Compared with the prior art, the invention has the advantages of two aspects: firstly, a multi-mode fusion classification frame of a layering dense fusion network based on a multi-attention machine system is introduced, and the frames of three branches of space-frequency spectrum-elevation of a hyperspectral image and a LiDAR image can be organically combined with the attention machine system, so that the precision of ground feature fusion classification can be improved; secondly, by fusing the shallow spectral features and the shallow spatial features of the hyperspectral image with the shallow spatial features of the LiDAR image, a modal attention mechanism for fusion of the shallow features is designed to discover the relevance and the diversity among the multi-modal data of the same ground object, and interaction and advantage complementation among different modal data are realized. Therefore, the method has the characteristics of good fusion quality, high classification precision and strong multi-mode data interaction capability. Experimental results show that the overall accuracy of the method on the Houston data set and the Telento data set respectively reaches 90.06% and 99.03%, the average accuracy is 92.25% and 98.32%, the Kappa coefficient is 89.24% and 98.70%, and the classification accuracy of the ground objects is effectively improved.

Drawings

FIG. 1 is a comparison graph of fusion classification results of the method of the present invention and a SVM method, an ELM method, a CNN-PPF method, a Two-Branch CNN method, and an EndNet method on a Houston data set.

FIG. 2 is a comparison graph of the fusion classification results of the method of the present invention with a SVM method, ELM method, CNN-PPF method, Two-Branch CNN method, EndNet method on the Trento dataset.

Detailed Description

The invention discloses a multi-sensor remote sensing image fusion classification method of a layering dense fusion network, which is carried out according to the following steps:

Space, get the characteristic

Wherein, C_speIndicating the number of input channels, N₂＝1×1，

Wherein,

is represented by F_speSThe element in the jth row and ith column,

is represented by F_speRThe transpose of the jth lane of (1),

Where γ represents a preset coefficient, and in this embodiment, γ is made to be 0.4;

said L_DPAMThe module comprises the following 7 steps:

(c) f is processed by reshape operation_Q1And F_K1Mapping to

Is represented by F_spaXThe element in the jth row and ith column,

is represented by F_K1Transpose of jth element of (a);

(d) f is processed by reshape operation_V1Mapping to

Wherein eta is_spaIs a pre-set scaling factor that is,

is represented by F_spaXThe vector formed by the ith row element of (1), let η in this embodiment_spa＝0.4；

(e) F is processed by reshape operation_Q2And F_K2Mapping to

Space, calculating a space attention moment array according to the definition of formula (5)

Wherein,

is represented by F_mXThe element in the jth row and ith column,

is represented by F_K2Transpose of jth element of (a);

Wherein epsilon₂Which represents a preset scaling factor, is set,

is represented by F_mXThe vector formed by the ith row element of (1), let ε in this embodiment₂＝0.4；

F_maF＝α₁F₁+α₂F_spaA+α₃F_mA (7)

Wherein alpha is₁，α₂And alpha₃Represents a predetermined weight coefficient, let α in this embodiment₁＝0.4，α₂＝0.3， α₃＝0.3；

And

the Concatenate will be defined according to equation (8)

And

step 2.1, extracting all pixel point sets X with labels from a hyperspectral image training set H according to the artificially marked pixel point coordinate set_H＝{x_H,i1, …, M, and extracting pixel point set X with all labels from training set L of LiDAR image_L＝{x_L,i1, …, M, where x_H,iRepresents X_HThe ith pixel point, x_L,iRepresents X_LM represents the total number of pixel points having labels;

And

wherein,

representing a normalized set of labeled hyperspectral image primitive points,

to represent

The point of the ith pixel of (a),

represents a normalized set of labeled LiDAR pixel points,

to represent

The ith pixel point of (1);

step 2.3. with

step 2.7. order

And

will be provided with

And

In the form of a network data input, wherein,

and

are the same in spatial coordinates of (a) Y_iTo represent

And

Performing feature extraction to obtain shallow spectrum features of hyperspectral imageSign F_spe；

Step 2.8.3 utilizing sub-network N_featureSpaTraining set for LiDAR imagery

Step 2.10.3 utilizes the spatial modal attention feature F of the max pooling layer Maxpool2 for LiDAR imagery_maLFCalculating to obtain deep elevation features of the LiDAR image

Spatial features of deep layers

Deep elevation features of LiDAR images

Calculating to obtain deep layer characteristics F_M；

wherein, ω is_jThe weight of the jth class is represented,

indicating that the picture element belongs to the j-th classProbability of ground object, n_jRepresenting the number of the jth class of ground-truth ground objects in the ground-truth training sample;

step 2.14, let iter ← iter +1, if iter times iter>Total _ iter, then obtaining the trained convolutional neural network N_ahdAnd (4) turning to the step (3), otherwise, utilizing a reverse error propagation algorithm based on a random gradient descent method and predicting loss L_ω-CUpdating N_ahdStep 2.8, all the image element blocks in the training set are reprocessed, where Total _ iter represents a preset number of iterations, and in this embodiment, Total _ iter is set to 200;

And

wherein,

representing a normalized labeled set of high-spectrum image pixel points,

to represent

The point of the ith pixel of (a),

represents a normalized set of labeled LiDAR pixel points,

to represent

The ith pixel point of (1);

step 3.3. with

Then combine with

Step 3.4. use sub-network N_featureSpeAnd N_featureSpaExtraction ofCharacteristics of the test set;

step 3.4.1 utilizing sub-network N_featureSpeTo pair

Step 3.4.2 utilizes sub-network N_featureSpaTo pair

Step 3.4.3 utilizing sub-network N_featureSpaTo pair

Step 3.6.4 utilizes custom connection layer conditioner pairs

Calculating to obtain deep layer characteristics T_M；

In order to verify the effectiveness of the method, experiments are carried out by taking a Houston data set and a Telento data set which are disclosed as examples, the fusion classification result is evaluated by taking the Overall Accuracy (OA), the Average Accuracy (AA) and the Kappa coefficient as objective indexes, and the evaluation result is compared with an SVM method, an ELM method, a CNN-PPF method, a Two-Branch CNN method and an EndNet method.

The main challenge of the terrain classification task is the problem of misclassification. For remote-sensing image-based terrain classification, the most common classification error is the classification of soil as grassland. As can be seen from Table 1, the SVM method, the ELM method, the CNN-PPF method, the Two-Branch CNN method and the EndNet method do not fully utilize the interactivity among the characteristics acquired by different sensors, so that the classification accuracy is not accurate enough, but the method provided by the invention has certain misjudgment among three grasslands in different states, but almost no wrong classification is formed between the whole grasslands and bare soil. In addition, for the tennis court area and the runway area in the table 1 and the timber area and the vineyard area in the table 2, the invention has no classification error and no misjudgment phenomenon, and the accuracy rate reaches 100 percent. As can be seen from Table 1, for the Houston data set, the OA results obtained by the method are respectively improved by 9.57%, 8.14%, 6.73%, 2.08% and 1.54% compared with the SVM method, the ELM method, the CNN-PPF method, the Two-Branch CNN method and the EndNet method, and are improved by 5.61% on average; compared with an SVM method, an ELM method, a CNN-PPF method, a Two-Branch CNN method and an EndNet method, the Kappa coefficient result of the invention is respectively improved by 10.26%, 8.79%, 7.36%, 2.26% and 1.65%, and is improved by 6.06% on average. As can be seen from Table 2, for the trenltor data set, the OA result obtained by the method is respectively improved by 6.26%, 13.22%, 4.27%, 1.11% and 4.86% compared with the SVM method, the ELM method, the CNN-PPF method, the Two-Branch CNN method and the EndNet method, and is improved by 5.94% on average; the Kappa coefficient results are respectively improved by 2.85%, 17.34%, 5.66%, 1.89% and 6.48% compared with the SVM method, the ELM method, the CNN-PPF method, the Two-Branch CNN method and the EndNet method, and are improved by 6.84% on average.

FIG. 1 is a graph of results of different classification methods on Houston data sets, wherein (a) is an HSI pseudo-color image; (b) a digital surface model based on LiDAR imagery; (c) is a group-truth classification chart; (d) the overall accuracy of the classification result of the SVM method is 80.49%; (e) the classification result of the ELM method has the overall accuracy of 81.92 percent; (f) the overall accuracy of the classification result of the CNN-PPF method is 83.33%; (g) the classification result of the Two-Branch CNN method is obtained, and the overall accuracy rate is 87.98%; (h) the classification result of the EndNet method is 88.52 percent of the overall accuracy; (i) the overall accuracy of the classification results of the present invention was 90.06%.

Fig. 2 is a diagram of classification results of different methods on a trenlto data set, wherein (a) is an HSI pseudo-color image; (b) a digital surface model based on LiDAR imagery; (c) is a group-truth classification chart; (d) the overall accuracy of the classification result of the SVM method is 92.77%; (e) the classification result of the ELM method has the overall accuracy of 85.81 percent; (f) the overall accuracy of the classification result of the CNN-PPF method is 94.76%; (g) the classification result of the Two-Branch CNN method is obtained, and the overall accuracy rate is 97.92%; (h) the method is a classification result of an EndNet method, and the overall accuracy is 94.17%; (i) the overall accuracy of the classification results of the present invention was 99.03%.

As can be seen from fig. 1 and 2, for the areas which are difficult to classify, the present invention can more effectively identify various ground features, and particularly, can relatively accurately judge the business area in the upper right corner of fig. 2. Moreover, because the multi-end input is beneficial to reducing information loss, the invention can obtain a classification result which is smoother and more accurate than five single-input methods, namely an SVM method, an ELM method, a CNN-PPF method, a Two-Branch CNN method and an EndNet method.

It can be known from the comparison results of table 1, table 2, fig. 1 and fig. 2 that the fusion quality and classification accuracy of the multi-sensor remote sensing image are effectively improved by fully utilizing the interactivity among different sensor data.

Table 1 houston data set classification accuracy comparison (%)

Table 2 comparison of classification accuracy (%) -for the trenntot data sets

Claims

1. A multi-sensor remote sensing image fusion classification method capable of layering dense fusion network is characterized by comprising the following steps:

step 1, establishing and initializing a convolutional neural network N for fusion and classification of multi-sensor remote sensing images_ahdSaid N is_ahdComprising 2 sub-networks N for feature extraction_featureSpeAnd N_featureSpa1 subnet for shallow feature fusionLigand of formula II_{shallowfusion}1 sub-network N for deep feature fusion_deepfusionAnd 1 sub-network N for classification_cls；

step 1.2. establishing and initializing sub-network N_featureSpe2 groups of convolutional layers, Conv1_0 and Conv1_ 1;

the Conv1_0 comprises 1-layer convolution operation, 1-layer BatchNorm normalization operation and 1-layer activation operation, wherein the convolution layer comprises 64 one-dimensional convolution kernels with the size of 11, each convolution kernel performs convolution operation by taking 1 pixel as a step length, and a nonlinear activation function ReLU is selected as an activation function for operation;

the Conv1_1 comprises 1-layer convolution operation, 1-layer BatchNorm normalization operation and 1-layer activation operation, wherein the convolution layer comprises 128 one-dimensional convolution kernels with the size of 3, each convolution kernel performs convolution operation by taking 1 pixel as a step length, and a nonlinear activation function ReLU is selected as an activation function for operation;

The Conv2_ Q1 comprises 1 layer of convolution operation, and comprises 25 convolution kernels with the size of 1 × 1, and each convolution kernel carries out convolution operation by taking 1 pixel as a step size;

the Conv2_ K1 includes 1 layer of convolution operations, including 25 convolution kernels of size 1 × 1, each convolution kernel performing convolution operations with 1 pixel as a step size;

the Conv2_ V1 comprises 1 layer of convolution operation, and comprises 200 convolution kernels with the size of 1 × 1, and each convolution kernel carries out convolution operation by taking 1 pixel as a step size;

the Conv2_ Q2 comprises 1 layer of convolution operation, and comprises 25 convolution kernels with the size of 1 × 1, and each convolution kernel carries out convolution operation by taking 1 pixel as a step size;

the Conv2_ K2 includes 1 layer of convolution operations, including 25 convolution kernels of size 1 × 1, each convolution kernel performing convolution operations with 1 pixel as a step size;

the Conv2_ V2 comprises 1 layer of convolution operations, including 200 2D convolution kernels of size 1 × 1, each convolution kernel performing convolution operations with 1 pixel step size;

Space, get a feature

Wherein, C_speIndicating the number of input channels, N₂＝1×1，

Is represented by F_speRAnd then calculating the spectral attention matrix according to the formula (1)

Wherein,

is represented by F_speSThe element in the jth row and ith column,

is represented by F_speRThe transpose of the jth lane of (1),

is represented by F_speRRepresents the inner product operation, and then F is calculated according to the formula (2)_speRAnd F_speSPerforming matrix multiplication to obtain spectral attention feature F_speA；

Wherein γ represents a preset coefficient;

said L_DPAMThe module comprises the following 7 steps:

Then F is put₁Feeding into the convolutional layer Conv2_ V1 to obtain characteristics by calculation

Then F is put₂Feeding into the convolutional layer Conv2_ V2 to obtain characteristics by calculation

(c) f is processed by reshape operation_Q1And F_K1Mapping to

Space and calculating a spatial attention matrix according to formula (3)

Is represented by F_spaXThe element in the jth row and ith column,

is represented by F_K1Transpose of jth element of (a);

(d) f is processed by reshape operation_V1Mapping to

Space, calculating a spatial attention feature F according to formula (4)_spaA；

Wherein eta is_spaIs a pre-set scaling factor that is,

is represented by F_spaXA vector consisting of the elements of row i;

(e) f is processed by reshape operation_Q2And F_K2Mapping to

Space, calculating a space attention moment array according to formula (5)

Wherein,

is represented by F_mXThe element in the jth row and ith column,

is represented by F_K2Transpose of jth element of (a);

(f) calculating modal attention feature F according to equation (6)_mA；

Wherein epsilon₂Which represents a preset scaling factor, is set,

is represented by F_mXA vector consisting of the elements of row i;

(g) calculating the spatial weighting feature F according to equation (7)_maF；

F_maF＝α₁F₁+α₂F_spaA+α₃F_mA (7)

the MaxPool2 comprises 1-layer pooling operation, 2-layer full-link operation, 2-layer activation operation and 1-layer Flatten operation, wherein the pooling layer performs maximum pooling operation by using a pooling core with the size of 2 x 2, and the 2-layer full-link layer respectively comprises1024 and 512 output units, selecting ReLU as an activation function for operation, and then executing Dropout operation with the parameter of 0.4 to obtain 3 three-dimensional tensors

And

the Concatenate is a derivative of formula (8)

And

step 1.5 establishing and initializing sub-network N_cls1 group of full connection layers, namely Dense 1;

step 2.1, extracting all pixel point sets X with labels from a training set H of the hyperspectral image according to the pixel point coordinate set marked artificially_H＝{x_H,i1, …, M, and extracting pixel point set X with all labels from training set L of LiDAR image_L＝{x_L,i1, …, M, where x_H,iRepresents X_HThe ith pixel point, x_L,iRepresents X_LM represents the total number of pixel points having labels;

step 2.2, X is corrected according to the formula (9) and the formula (10)_HAnd X_LPerforming standardization treatment to obtain

And

wherein,

representing a normalized set of labeled hyperspectral image primitive points,

to represent

The point of the ith pixel of (a),

represents a normalized set of labeled LiDAR pixel points,

to represent

The ith pixel point of (1);

step 2.3. with

Is divided into a series of hyperspectral pel block sets X with the size of 11 multiplied by 11 by taking each pixel point as the center_H1And are combined with

Divides L into a series of sets X of 11X 11 LiDAR pixelblocks centered at each pixel point of_L1；

Step 2.4. mixing X_H1And X_L1Each image element block in the high-spectrum image element block set X is turned over up and down to obtain a high-spectrum image element block set X_H2And LiDAR pixelblock set X_L2；

Step 2.6. mixing X_H1Each pixel block in the hyperspectral image block set X rotates by n multiplied by 90 degrees clockwise and randomly by taking the central point as a rotation center to obtain a hyperspectral image block set X_H4And X is_L1Each pixel block in the LiDAR pixel block set X is obtained by clockwise randomly rotating n multiplied by 90 degrees by taking the central point as a rotation center_L4Wherein n represents a value randomly selected from the set {1,2,3 };

step 2.7. order

And

will be provided with

And

as a training set for fusing and classifying neural networks, and integrating samples in the training set into triples

In the form of a network data input, wherein,

and

are the same in spatial coordinates of (a) Y_iTo represent

And

making iteration number iter ← 1 for the corresponding real category label, and executing the step 2.8 to the step 2.13;

Step 2.8.3 utilizing sub-network N_featureSpaTraining set for LiDAR imagery

Step 2.10.3 utilizes the maximum pooling layer MaxPoint 2 for spatial modal attention feature F of LiDAR imagery_maLFCalculating to obtain deep elevation features of the LiDAR image

Spatial features of deep layers

Deep elevation features of LiDAR images

Calculating to obtain deep layer characteristics F_M；

Step 2.12 using the weighted cross entropy as a loss function according to equation (11) and equation (12);

wherein, ω is_jThe weight of the jth class is represented,

representing the probability of the picture element belonging to the j-th class of ground objects, n_jRepresenting the number of the jth class of ground-truth ground objects in the ground-truth training sample;

step 2.14, let iter ← iter +1, if iter times iter>Total _ iter, then obtaining the trained convolutional neural network N_ahdAnd (4) turning to the step (3), otherwise, utilizing a reverse error propagation algorithm based on a random gradient descent method and predicting the loss L_ω-CUpdating N_ahdStep 2.8, all the pixel blocks in the training set are reprocessed, and the Total _ iter represents the preset iteration times;

step 3.1, extracting all pixel points in H' to form a set T_H＝{t_H,iI1, …, U, extracting all pixel points in L' to form a set T_L＝{t_L,i1, …, U }, where t is_H,iRepresents T_HI-th pixel of (1), t_L,iRepresents T_LU represents the total number of all picture elements;

step 3.2, according to the formula (17) and the formula (18), T is paired_HAnd T_LPerforming standardization treatment to obtain

And

wherein,

representing a normalized set of labeled hyperspectral image primitive points,

to represent

The point of the ith pixel of (a),

represents a normalized set of labeled LiDAR pixel points,

to represent

The ith pixel point of (1);

step 3.3. with

Then combine with

step 3.4.1 utilizing sub-network N_featureSpeTo pair

Step 3.4.2 utilizes sub-network N_featureSpaTo pair

Step 3.4.3 utilizing sub-network N_featureSpaTo pair

Step 3.6.2 attention feature T to spatial modality with max pooling layer Maxpool2_maHFCalculating to obtain deep space characteristics of the hyperspectral image H

Step 3.6.4 utilizes custom connection layer conditioner pairs

Calculating to obtain deep layer characteristic T_M；