CN117475216A - Hyperspectral and laser radar data fusion classification method based on AGLT network - Google Patents

Hyperspectral and laser radar data fusion classification method based on AGLT network Download PDF

Info

Publication number
CN117475216A
CN117475216A CN202311439960.3A CN202311439960A CN117475216A CN 117475216 A CN117475216 A CN 117475216A CN 202311439960 A CN202311439960 A CN 202311439960A CN 117475216 A CN117475216 A CN 117475216A
Authority
CN
China
Prior art keywords
output
layer
result
hyperspectral
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311439960.3A
Other languages
Chinese (zh)
Inventor
王敏慧
孙亚秀
项建弘
王霖郁
黄丽莲
钟瑜
孙蕊
武雅若
蒋涵宇
王英
徐昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN202311439960.3A priority Critical patent/CN117475216A/en
Publication of CN117475216A publication Critical patent/CN117475216A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/194Terrestrial scenes using hyperspectral data, i.e. more or other wavelengths than RGB

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Remote Sensing (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Astronomy & Astrophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Image Processing (AREA)

Abstract

A hyperspectral and laser radar data fusion classification method based on an AGLT network belongs to the hyperspectral image classification field. The invention solves the problem of poor classification performance of the existing method. The invention can capture and learn hyperspectral air-spectrum combination characteristics from hyperspectral image data and acquire elevation characteristics from LiDAR-DSM data; the asymmetric convolution kernel is introduced into a visual transducer structure, so that the strong space context information extraction capability of the convolution neural network and the strong remote dependency modeling capability of the visual transducer based on a self-attention mechanism are fully utilized; the Bi feedforward unit is designed for the visual transducer to fully extract the global and local information of the data, so that the classification performance of the model is improved. The method can be applied to hyperspectral image classification.

Description

Hyperspectral and laser radar data fusion classification method based on AGLT network
Technical Field
The invention belongs to the field of hyperspectral image classification, and particularly relates to a hyperspectral and laser radar data fusion classification method based on an AGLT network.
Background
The remote sensing technology has significantly advanced in recent years, and the availability of remote sensing images is increased. Typically, data from several remote sensing devices in the same geographical area is available, so that multi-modal data may be used to analyze the land cover information. A variety of different sensor technologies can effectively capture the earth covering quality of different features. With the acquisition of mass data, the data difference of different modes can realize advantage complementation, and the effect of remote sensing ground object classification is effectively improved. In multimodal data fusion and earth coverage interpretation tasks, the fusion interpretability of hyperspectral image (HSI) and light detection and ranging digital surface model (LiDAR-DSM) data has been an important issue that requires attention. The hyperspectral image sensor can obtain spectral information and geospatial information, and the light detection and ranging digital surface model measures the elevation and object height information of the earth surface. By integrating the data of different modalities, more detailed information can be obtained, thus constructing a complete feature representation.
Fusion of multiple data sources may improve accuracy of land cover recognition, but there are many technical obstacles such as different data structures and unrelated physical features. Currently hyperspectral data and LiDAR-DSM images have been used successfully in combination, where convolutional neural networks are a common method of joint classification of hyperspectral data and LiDAR-DSM images, and convolutional neural networks are powerful tools for feature extraction and context modeling. However, because of their inherent limitations in the network backbone, they are unable to establish long-distance connections for global images, there are still drawbacks in capturing the sequence properties of spectral features, and convolution operations in convolutional neural networks can only capture local information, making it difficult to obtain discriminatory spectral-spatial features from a global perspective. The visual transducer backbone network can solve these challenges and create new insights into multi-modal image classification, with visual transducers having more similarity between features acquired in shallow and deep layers, good at capturing global feature information of the image, but not with prior knowledge of the dimensions, translational invariance, and feature locality that the image itself has, and must learn high quality intermediate representations using large-scale datasets. However, due to the lack of training data, there is still a certain difficulty in combining the convolutional neural network and the visual transducer backbone network, so the classification performance of the existing method is still poor, and the classification performance of the existing method needs to be further improved.
Disclosure of Invention
The invention aims to solve the problem of poor classification performance of the existing method, and provides a hyperspectral and laser radar data fusion classification method based on an AGLT network.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a hyperspectral and laser radar data fusion classification method based on an AGLT network specifically comprises the following steps:
step 1, acquiring LiDAR-DSM image data and hyperspectral image data;
step 2, performing dimension reduction processing on the acquired hyperspectral image data by adopting a principal component analysis method to obtain dimension-reduced hyperspectral image data;
step 3, slicing the acquired LiDAR-DSM image data and the hyperspectral image data subjected to dimension reduction processing, and dividing the LiDAR-DSM image data and the hyperspectral image data subjected to slicing processing into a training set and a testing set;
step 4, constructing an AGLT network, training the constructed AGLT network by using a training set until the classification accuracy of the AGLT network on a test set is not improved any more, and obtaining a trained AGLT network;
and 5, carrying out joint processing on the hyperspectral image to be classified and the LiDAR-DSM image by using the trained AGLT network to obtain a classification result.
Further, the AGLT network comprises a hyperspectral image data processing branch and a LiDAR-DSM data processing branch, wherein:
in a hyperspectral image data processing branch, an input image sequentially passes through a three-dimensional normalization layer, a first activation function layer, a first three-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3, a second three-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3 multiplied by 1, a third three-dimensional convolution layer with a convolution kernel size of 3 multiplied by 1, a first two-dimensional normalization layer, a second activation function layer, a first two-dimensional convolution layer with a convolution kernel size of 3 multiplied by 1, a second two-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3 and a first Bi-Former module;
in the LiDAR-DSM data processing branch, the input LiDAR-DSM data sequentially passes through a second two-dimensional normalization layer, a third activation function layer, a third two-dimensional convolution layer with a convolution kernel size of 3 multiplied by 1, a fourth two-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3 and a second Bi-Former module;
and sending the output of the first Bi-Former module and the output of the second Bi-Former module into a cross attention layer together to obtain an output A and an output B, then enabling the output A to pass through a first MLP layer, enabling the output B to pass through a second MLP layer, and finally superposing the output of the first MLP layer and the output of the second MLP layer, wherein a superposition result is a final classification result.
Further, the working principle of the Bi-Former module is as follows:
mapping an input image of the Bi-Former module into a vector sequence, embedding additional learning codes into the head of the vector sequence to obtain an overall sequence, embedding position codes into the overall sequence, and enabling the vector sequence after embedding the position codes to pass through an encoder submodule, wherein the output of the encoder submodule is used as the output of the Bi-Former module; and the encoder sub-module comprises N encoders;
the first encoder works on the principle that:
step one, a vector sequence after embedded position coding input by an encoder sub-module is input by a first encoder, and the vector sequence after embedded position coding is mapped into a query vector, a key vector and a value vector respectively;
step two, multi-head attention calculation is carried out on the query vector, the key vector and the value vector, and a multi-head attention calculation result is obtained;
thirdly, carrying out residual connection on the multi-head attention calculation result in the second step and the vector sequence after embedding the position codes, and normalizing the residual connection result;
step four, sending the normalization result of the step three into a Bi feedforward unit, carrying out residual connection on the output of the Bi feedforward unit and the normalization result of the step three, normalizing the residual connection result, and taking the normalization result as the output of the first encoder;
and taking the output of the first encoder as the input of the second encoder, and the like until the output of the N-th encoder is obtained, wherein the output of the N-th encoder is the output of the encoder submodule.
Further, the multi-head attention calculating method comprises the following steps:
step 1), respectively passing the query vector, the key vector and the value vector through the linear layer, then jointly transmitting the output of the query vector after passing through the linear layer, the output of the key vector after passing through the linear layer and the output of the value vector after passing through the linear layer into a first zoom dot product attention unit, jointly transmitting the output of the query vector after passing through the linear layer, the output of the key vector after passing through the linear layer and the output of the value vector after passing through the linear layer into a second zoom dot product attention unit, and jointly transmitting the output of the query vector after passing through the linear layer, the output of the key vector after passing through the linear layer and the output of the value vector after passing through the linear layer into a third zoom dot product attention unit;
splicing the outputs of the first zoom dot product attention unit, the second zoom dot product attention unit and the third zoom dot product attention unit to obtain a splicing result;
step 2), the splicing result is subjected to linear layer processing, and the output after the linear layer processing is the output of the multi-head attention.
Further, the calculation method of the first scaling dot product attention unit is as follows:
multiplying the output of the query vector after passing through the linear layer by the output of the key vector after passing through the linear layer, scaling the multiplied result, continuing to mask the scaled result, and finally enabling the masked result to pass through a Softmax activation function;
and multiplying the output of the Softmax activation function by the output of the value vector after the value vector passes through the linear layer to obtain the calculation result of the first scaling dot product attention unit.
Further, the working principle of the Bi feedforward unit is as follows:
the input of the Bi feedforward unit is sent to two parallel branches, wherein the first branch is a channel attention branch, the channel attention branch comprises a global subunit, a linear layer and a Sigmoid activation function layer, the second branch is a space attention branch, and the space attention branch comprises a local subunit, a linear layer and a Sigmoid activation function layer;
in the channel attention branch, the input of the Bi feedforward unit sequentially passes through the global subunit, the linear layer and the Sigmoid activation function layer to obtain an output X;
in the space attention branch, the input of the Bi feedforward unit firstly passes through the local subunit, then the output of the local subunit and the output of the global subunit are spliced, and the spliced result sequentially passes through the linear layer and the Sigmoid activation function layer of the space attention branch to obtain an output Y;
multiplying X and Y, and multiplying the multiplication result with the input of the Bi feedforward unit to obtain a final multiplication result, namely obtaining the output of the Bi feedforward unit.
Further, the global subunit comprises an average pooling layer, a linear layer and a GELU activation function layer; the local subunit comprises a linear layer and a GELU activation function layer.
Further, taking the hyperspectral image data as an example, the method for calculating the cross attention is as follows:
the hyperspectral data in the output of the first Bi-Former module is additionally learned and encoded as a linear projection function F HSI Input of (-), linear projection function F HSI The output of (-) passes through matrix W Q Mapping to a query vector;
linear projection function F HSI Stacking the output of the (-) with the LiDAR-DSM feature coding part in the output of the second Bi-Former module, and dividing the stacking result into two parts, wherein the first part passes through a matrix W K Mapped to key vectors, the second part passing through matrix W V Mapping to a value vector;
will be a linear projection function F HSI Multiplying the query vector corresponding to the output of the (-) with the key vector corresponding to the first part of the stacking result to obtain a multiplication result a, and inputting the multiplication result into a Softmax function; multiplying the output of the Softmax function with the corresponding value vector of the second part of the stacked result to obtain a multiplied result b, and multiplying the multiplied result b with the linear projection function F HSI The query vectors corresponding to the output of (-) are superimposed, and the superimposed result is input into a linear back projection function G HSI (. Cndot.) then the linear back projection function G HSI Splicing the output of the (-) and the hyperspectral data characteristic coding part in the output of the first Bi-Former module, wherein the splicing result is output A;
similarly, the additional learning code of LiDAR-DSM data is cross-attentional calculated to obtain output B.
The beneficial effects of the invention are as follows:
the invention can capture and learn hyperspectral air-spectrum combination characteristics from hyperspectral image data and acquire elevation characteristics from LiDAR-DSM data; the asymmetric convolution kernel is introduced into a visual transducer structure, so that the strong space context information extraction capability of the convolution neural network and the strong remote dependency modeling capability of the visual transducer based on a self-attention mechanism are fully utilized; the Bi feedforward unit is designed for the visual transducer to fully extract the global and local information of the data, so that the classification performance of the model is improved.
Drawings
FIG. 1 is a diagram of a model structure of an AGLT network;
FIG. 2 is a training flow diagram of an AGLT network;
FIG. 3 is a block diagram of a Bi-Former module;
FIG. 4 is a diagram of a multi-headed attention structure;
FIG. 5 is a block diagram of a Bi feed forward unit;
in the figure, (a) is channel attention and (b) is spatial attention;
FIG. 6 is a cross-attention block diagram;
FIG. 7 is a TR data presentation diagram;
in the figure, (a) is a hyperspectral false color chart, (b) is a DSM gray chart, and (c) is a truth chart;
FIG. 8 is a MU data presentation diagram;
in the figure, (a) is a hyperspectral false color chart, (b) is a DSM gray chart, and (c) is a truth chart;
fig. 9 is an AU data display diagram;
in the figure, (a) is a hyperspectral false color chart, (b) is a DSM gray chart, and (c) is a truth chart;
fig. 10 is a graph of classification results of different data by the AGLT network model.
Detailed Description
The present application is described in further detail below with reference to the accompanying drawings by way of specific embodiments. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the invention. Based on the embodiments of the present invention, other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.
Detailed description of the inventionin the first embodiment, this embodiment will be described with reference to fig. 2. The hyperspectral and laser radar data fusion classification method based on the AGLT network specifically comprises the following steps:
step 1, acquiring LiDAR-DSM image data and hyperspectral image data (acquired is data which is disclosed at present);
step 2, performing dimension reduction processing on the acquired hyperspectral image data by adopting a principal component analysis method to obtain dimension-reduced hyperspectral image data;
the number of wave bands of a third dimension of the three-dimensional original data can be reduced through dimension reduction, data redundancy is reduced, and the running time is shortened;
step 3, slicing the acquired LiDAR-DSM image data and the hyperspectral image data subjected to dimension reduction processing, and dividing the LiDAR-DSM image data and the hyperspectral image data subjected to slicing processing into a training set and a testing set;
step 4, constructing an AGLT network, training the constructed AGLT network by using a training set until the classification accuracy of the AGLT network on a test set is not improved any more, and obtaining a trained AGLT network;
and 5, carrying out joint processing on the hyperspectral image to be classified and the LiDAR-DSM image by using the trained AGLT network to obtain a classification result.
The invention combines the convolutional neural network with the visual transducer, can capture and learn the hyperspectral space-spectrum combination characteristic from hyperspectral image data, and obtain the elevation characteristic from LiDAR-DSM data; secondly, introducing an asymmetric convolution kernel into a visual transducer structure, and fully utilizing the strong spatial context information extraction capability of a convolution neural network and the strong remote dependency modeling capability of the visual transducer based on a self-attention mechanism; finally, the Bi feedforward unit is designed for the visual transducer to fully extract the global and local information of the data. The method is used for fusing multi-source heterogeneous information and improving joint classification performance.
The second embodiment is as follows: this embodiment will be described with reference to fig. 1. This embodiment differs from the specific embodiment in that the AGLT network comprises a hyperspectral image data processing branch and a LiDAR-DSM data processing branch, wherein:
in a hyperspectral image data processing branch, an input image sequentially passes through a three-dimensional normalization layer, a first activation function layer, a first three-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3, a second three-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3 multiplied by 1, a third three-dimensional convolution layer with a convolution kernel size of 3 multiplied by 1, a first two-dimensional normalization layer, a second activation function layer, a first two-dimensional convolution layer with a convolution kernel size of 3 multiplied by 1, a second two-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3 and a first Bi-Former module;
in the LiDAR-DSM data processing branch, the input LiDAR-DSM data sequentially passes through a second two-dimensional normalization layer, a third activation function layer, a third two-dimensional convolution layer with a convolution kernel size of 3 multiplied by 1, a fourth two-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3 and a second Bi-Former module;
and sending the output of the first Bi-Former module and the output of the second Bi-Former module into a cross attention layer together to obtain an output A and an output B, then enabling the output A to pass through a first MLP layer, enabling the output B to pass through a second MLP layer, and finally superposing the output of the first MLP layer and the output of the second MLP layer, wherein a superposition result is a final classification result.
Other steps and parameters are the same as in the first embodiment.
The convolution kernel of the hyperspectral data branch circuit adopts a form of converting a three-dimensional asymmetric convolution kernel (1×1×3,1×3×1, 3×1×1) into a two-dimensional asymmetric convolution kernel (1×3,3×1), and simultaneously, the DSM data branch circuit adopts a form of two-dimensional asymmetric convolution (1×3,3×1), so that the multi-source image characteristics can be fully learned.
And a third specific embodiment: this embodiment will be described with reference to fig. 3. The first or second embodiment of the present invention is different from the first embodiment, in that the working principle of the Bi-Former module is as follows:
mapping an input image of the Bi-Former module into a vector sequence, embedding additional learning codes (Extra learable embedding) into the head of the vector sequence to obtain an overall sequence, embedding position codes into the overall sequence, and enabling the vector sequence after embedding the position codes to pass through an encoder submodule, wherein the output of the encoder submodule is used as the output of the Bi-Former module; and the encoder sub-module comprises N encoders;
the first encoder works on the principle that:
step one, the vector sequence after the embedded position coding input by the encoder submodule is the input of a first encoder, and the vector sequence after the embedded position coding is mapped into a query (Q) vector, a key (K) vector and a value (V) vector respectively;
step two, multi-head attention calculation is carried out on the query vector, the key vector and the value vector, and a multi-head attention calculation result is obtained;
thirdly, carrying out residual connection on the multi-head attention calculation result in the second step and the vector sequence after embedding the position codes, and normalizing the residual connection result;
step four, sending the normalization result of the step three into a Bi feedforward unit, carrying out residual connection on the output of the Bi feedforward unit and the normalization result of the step three, normalizing the residual connection result, and taking the normalization result as the output of the first encoder;
and taking the output of the first encoder as the input of the second encoder, and the like until the output of the N-th encoder is obtained, wherein the output of the N-th encoder is the output of the encoder submodule.
Other steps and parameters are the same as in the first or second embodiment.
The specific embodiment IV is as follows: this embodiment will be described with reference to fig. 4. The difference between this embodiment and one to three embodiments is that the multi-head attention calculating method includes:
step 1), respectively passing the query vector, the key vector and the value vector through the linear layer, then jointly transmitting the output of the query vector after passing through the linear layer, the output of the key vector after passing through the linear layer and the output of the value vector after passing through the linear layer into a first zoom dot product attention unit, jointly transmitting the output of the query vector after passing through the linear layer, the output of the key vector after passing through the linear layer and the output of the value vector after passing through the linear layer into a second zoom dot product attention unit, and jointly transmitting the output of the query vector after passing through the linear layer, the output of the key vector after passing through the linear layer and the output of the value vector after passing through the linear layer into a third zoom dot product attention unit;
splicing the outputs of the first zoom dot product attention unit, the second zoom dot product attention unit and the third zoom dot product attention unit to obtain a splicing result;
step 2), the splicing result is subjected to linear layer processing, and the output after the linear layer processing is the output of the multi-head attention.
Other steps and parameters are the same as in one to three embodiments.
Fifth embodiment: the first embodiment is different from the first to fourth embodiments in that the calculation method of the first scaling dot product attention unit is as follows:
multiplying the output of the query vector after passing through the linear layer by the output of the key vector after passing through the linear layer, scaling the multiplied result, continuing to mask the scaled result, and finally enabling the masked result to pass through a Softmax activation function;
and multiplying the output of the Softmax activation function by the output of the value vector after the value vector passes through the linear layer to obtain the calculation result of the first scaling dot product attention unit.
Other steps and parameters are the same as in one to four embodiments.
The second and third scaled dot product attention units are calculated in the same way as the first scaled dot product attention unit.
Specific embodiment six: this embodiment will be described with reference to fig. 5. The difference between the present embodiment and one to fifth embodiments is that the working principle of the Bi feedforward unit is:
the input of the Bi feedforward unit is sent to two parallel branches, wherein the first branch is a channel attention branch, the channel attention branch comprises a global subunit, a linear layer and a Sigmoid activation function layer, the second branch is a space attention branch, and the space attention branch comprises a local subunit, a linear layer and a Sigmoid activation function layer;
in the channel attention branch, the input of the Bi feedforward unit sequentially passes through the global subunit, the linear layer and the Sigmoid activation function layer to obtain an output X;
in the space attention branch, the input of the Bi feedforward unit firstly passes through the local subunit, then the output of the local subunit and the output of the global subunit are spliced, and the spliced result sequentially passes through the linear layer and the Sigmoid activation function layer of the space attention branch to obtain an output Y;
multiplying X and Y, and multiplying the multiplication result with the input of the Bi feedforward unit to obtain a final multiplication result, namely obtaining the output of the Bi feedforward unit.
Other steps and parameters are the same as in one of the first to fifth embodiments.
Seventh embodiment: this embodiment differs from one of the first to sixth embodiments in that the global subunit includes an average pooling layer, a linear layer, and a GELU activation function layer; the local subunit comprises a linear layer and a GELU activation function layer.
Other steps and parameters are the same as in one of the first to sixth embodiments.
Eighth embodiment: this embodiment will be described with reference to fig. 6. The difference between the present embodiment and one of the first to seventh embodiments is that, taking the hyperspectral image data as an example, the method for calculating the cross-attention is as follows:
the hyperspectral data in the output of the first Bi-Former module is additionally learned and encoded as a linear projection function F HSI Input of (-)Then the linear projection function F HSI The output of (-) passes through the learning matrix W Q Mapping to a query (Q) vector; linear projection function F HSI Stacking the output of the (-) with the LiDAR-DSM feature coding part in the output of the second Bi-Former module, and dividing the stacking result into two parts, wherein the first part passes through the learning matrix W K Mapped to a key (K) vector, the second part passing through a learning matrix W V Mapping to a value (V) vector;
will be a linear projection function F HSI Multiplying the query vector corresponding to the output of the (-) with the key vector corresponding to the first part of the stacking result to obtain a multiplication result a, and inputting the multiplication result into a Softmax function; multiplying the output of the Softmax function with the corresponding value vector of the second part of the stacked result to obtain a multiplied result b, and multiplying the multiplied result b with the linear projection function F HSI The query vectors corresponding to the output of (-) are superimposed, and the superimposed result is input into a linear back projection function G HSI (. Cndot.) then the linear back projection function G HSI Splicing the output of the (-) and the hyperspectral data characteristic coding part in the output of the first Bi-Former module, wherein the splicing result is output A;
linear projection function F HSI (. Cndot.) and a linear backprojection function G HSI The purpose of (-) is for dimension alignment;
similarly, cross attention calculation is performed on the additional learning code of the LiDAR-DSM data (the parameter operation is the additional learning code of the LiDAR-DSM data and the feature code of the hyperspectral data, namely, the position of the additional learning code of the hyperspectral data is replaced by the additional learning code of the LiDAR-DSM data, and the feature code part of the LiDAR data is replaced by the feature code of the hyperspectral data), so that an output B is obtained.
Other steps and parameters are the same as those of one of the first to seventh embodiments.
Interaction with feature codes from LiDAR-DSM data helps learn supplemental information because the additional learning code of the hyperspectral data already learns abstract information in all feature codes of the hyperspectral data. Interaction with feature codes from hyperspectral data helps learn supplemental information because the additional learning code of the LiDAR-DSM data has learned abstract information in all feature codes of the LiDAR-DSM data.
Examples
The invention provides an AGLT network-based hyperspectral and laser radar data fusion classification method, which comprises the implementation flow of the method as shown in a table 1:
table 1AGLT network architecture algorithm flow
The specific implementation steps are as follows:
and step 1, acquiring hyperspectral image data and LiDAR-DSM data (public data is adopted).
And 2, performing dimension reduction processing on the obtained hyperspectral image data by using a PCA principal component analysis method, so that the number of wave bands of a third dimension of the three-dimensional original data can be reduced, the data redundancy can be reduced, and the running time can be shortened.
And step 3, preprocessing LiDAR-DSM data and dimension-reduced hyperspectral data. The number of training sets and the number of testing sets are designated according to the numbers of marked pixel samples in the hyperspectral image and the LiDAR-DSM image, and the training sets and the number of testing sets are stored as two pieces of data with the same format as LiDAR-DSM data and dimension-reduced hyperspectral data respectively. Then, the two pieces of data are separately sliced.
Step 4, training and classifying AGLT network
And 4.1, constructing an AGLT network, wherein the training set and the testing set used by the method adopt 11 multiplied by 11 resolution ratio. As shown in FIG. 1, H W represents the spatial dimension, C represents the third dimension, and after PCA processing and data preprocessing, a series of 11X L hyperspectral images (L represents the dimension-reduced third dimension) and 11X 11 LiDAR-DSM images are obtained.
Step 4.2, the AGLT network structure is mainly divided into three parts, namely an asymmetric three-dimensional convolution aiming at hyperspectral data, an asymmetric two-dimensional convolution after data reconstruction and an asymmetric two-dimensional convolution aiming at LiDAR-DSM data. The asymmetric three-dimensional convolution mainly includes 1×1×3,1×3×1, and 3×1×1; the asymmetric two-dimensional convolution consists mainly of 1×3 and 3×1. And the parameter quantity is reduced while the space-spectrum joint characteristic information is extracted.
And 4.3, respectively sending the output of the asymmetric convolution neural network of the hyperspectral data branch and the LiDAR-DSM data branch into Bi-force. As shown in FIG. 5, the Bi feedforward unit in the Bi-Former can enable the AGLT network to obtain lighter calculation burden in the small channel dimension, increase the information interaction of the cross channels, and fuse the global information with the local information, thereby further improving the model performance and the precision.
And 4.4, aiming at the output of Bi-force of the hyperspectral data branch and the LiDAR-DSM data branch, sending the output into the cross attention, carrying out characteristic information interaction, and fully extracting the multi-mode data fusion characteristics.
And 5, classifying the images.
Experimental part
The invention trains the network model, the computer used for verifying the classification result is configured as follows: CPU is Intel Rui 9-12900K, memory is 32GBDDR5 5200, display card is NVIDIAGeForce RTX 3090Ti, hard disk is 1TBSSD and 4T HDD, and system is Windows 11 professional. Three well-known multi-modal feature fusion datasets were used, the dataset being the trentt dataset (TR), the university of miscissimide and university of florida golf course dataset (MU) and the austburg dataset (AU), respectively. The data set display diagrams are shown in fig. 7, 8 and 9, different colors represent different ground object categories, detailed data information corresponding to fig. 7 is shown in table 2, detailed data information corresponding to fig. 8 is shown in table 3, and detailed data information corresponding to fig. 9 is shown in table 4.
TABLE 2
TABLE 3 Table 3
TABLE 4 Table 4
The location of capture of the tr dataset was located in a rural area around the trentum city in italy. The dataset contains hyperspectral images and lidar images. The hyperspectral image has 600×166 pixels and 63 bands, the band range covers the spectrum band of 420.89-989.09nm, the spectral resolution is 9.2nm, and the spatial resolution is 1m. The laser radar image is a single-channel image, comprises the altitude corresponding to the ground position, and has the same image size as the hyperspectral image. There are 6 categories in the labeling information of the dataset.
The mu dataset is a registered aerial hyperspectral-lidar dataset. Two modal images of the dataset were acquired simultaneously by one aviation flight, acquired at 11 months 2010, with the location in misinformation, U.S. The image size is 325×220 pixels. The hyperspectral image contains 64 spectral bands. The laser radar image is a single-channel image, comprises the altitude corresponding to the ground position, and has the same image size as the hyperspectral image. The data annotation information contains 11 categories.
Au datasets were captured over the market in ogesturg, germany. HSI data was acquired by DAS-EOC HySpex sensor and collected by DLR-3K system based on LiDAR-DSM data. The spatial resolution of the two images is downsampled to a uniform resolution of 30m to adequately manage the multi-modal fusion. In this dataset, the HSI data consisted of 180 bands ranging from 0.4-2.5 μm, while the LiDAR-DSM data had only one grid. The size of the data set is 332 x 485 pixels. The dataset depicts seven different land cover categories.
For the evaluation index of the model, three objective evaluation indexes which are most widely applied in the industry are selected: overall classification accuracy (OA), average classification accuracy (AA), and Kappa coefficient (Kappa coefficient, K).
The objective classification results of the AGLT network model used in the present invention on three data sets are shown in Table 5.
TABLE 5
From left to right, the first column is the dataset; the second column is OA data of the classification result; the third column is AA data of the classification result; the fourth column is K data of the classification result; the fifth column is the training duration of the classification; the sixth column is the test duration of the classification. As shown in fig. 10, the subjective classification results are shown in fig. 10, (a) represents the TR dataset classification effect, (b) represents the MU dataset classification effect, and (c) represents the AU dataset classification effect. The higher the accuracy, the less salt and pepper noise the picture.
The above examples of the present invention are only for describing the calculation model and calculation flow of the present invention in detail, and are not limiting of the embodiments of the present invention. Other variations and modifications of the above description will be apparent to those of ordinary skill in the art, and it is not intended to be exhaustive of all embodiments, all of which are within the scope of the invention.

Claims (8)

1. The hyperspectral and laser radar data fusion classification method based on the AGLT network is characterized by comprising the following steps of:
step 1, acquiring LiDAR-DSM image data and hyperspectral image data;
step 2, performing dimension reduction processing on the acquired hyperspectral image data by adopting a principal component analysis method to obtain dimension-reduced hyperspectral image data;
step 3, slicing the acquired LiDAR-DSM image data and the hyperspectral image data subjected to dimension reduction processing, and dividing the LiDAR-DSM image data and the hyperspectral image data subjected to slicing processing into a training set and a testing set;
step 4, constructing an AGLT network, training the constructed AGLT network by using a training set until the classification accuracy of the AGLT network on a test set is not improved any more, and obtaining a trained AGLT network;
and 5, carrying out joint processing on the hyperspectral image to be classified and the LiDAR-DSM image by using the trained AGLT network to obtain a classification result.
2. The hyperspectral and LiDAR data fusion classification method based on the AGLT network of claim 1, wherein the AGLT network comprises a hyperspectral image data processing branch and a LiDAR-DSM data processing branch, wherein:
in a hyperspectral image data processing branch, an input image sequentially passes through a three-dimensional normalization layer, a first activation function layer, a first three-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3, a second three-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3 multiplied by 1, a third three-dimensional convolution layer with a convolution kernel size of 3 multiplied by 1, a first two-dimensional normalization layer, a second activation function layer, a first two-dimensional convolution layer with a convolution kernel size of 3 multiplied by 1, a second two-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3 and a first Bi-Former module;
in the LiDAR-DSM data processing branch, the input LiDAR-DSM data sequentially passes through a second two-dimensional normalization layer, a third activation function layer, a third two-dimensional convolution layer with a convolution kernel size of 3 multiplied by 1, a fourth two-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3 and a second Bi-Former module;
and sending the output of the first Bi-Former module and the output of the second Bi-Former module into a cross attention layer together to obtain an output A and an output B, then enabling the output A to pass through a first MLP layer, enabling the output B to pass through a second MLP layer, and finally superposing the output of the first MLP layer and the output of the second MLP layer, wherein a superposition result is a final classification result.
3. The hyperspectral and laser radar data fusion classification method based on the AGLT network as described in claim 2, wherein the working principle of the Bi-Former module is as follows:
mapping an input image of the Bi-Former module into a vector sequence, embedding additional learning codes into the head of the vector sequence to obtain an overall sequence, embedding position codes into the overall sequence, and enabling the vector sequence after embedding the position codes to pass through an encoder submodule, wherein the output of the encoder submodule is used as the output of the Bi-Former module; and the encoder sub-module comprises N encoders;
the first encoder works on the principle that:
step one, a vector sequence after embedded position coding input by an encoder sub-module is input by a first encoder, and the vector sequence after embedded position coding is mapped into a query vector, a key vector and a value vector respectively;
step two, multi-head attention calculation is carried out on the query vector, the key vector and the value vector, and a multi-head attention calculation result is obtained;
thirdly, carrying out residual connection on the multi-head attention calculation result in the second step and the vector sequence after embedding the position codes, and normalizing the residual connection result;
step four, sending the normalization result of the step three into a Bi feedforward unit, carrying out residual connection on the output of the Bi feedforward unit and the normalization result of the step three, normalizing the residual connection result, and taking the normalization result as the output of the first encoder;
and taking the output of the first encoder as the input of the second encoder, and the like until the output of the N-th encoder is obtained, wherein the output of the N-th encoder is the output of the encoder submodule.
4. The hyperspectral and lidar data fusion classification method based on the AGLT network as described in claim 3, wherein the multi-head attention computing method is as follows:
step 1), respectively passing the query vector, the key vector and the value vector through the linear layer, then jointly transmitting the output of the query vector after passing through the linear layer, the output of the key vector after passing through the linear layer and the output of the value vector after passing through the linear layer into a first zoom dot product attention unit, jointly transmitting the output of the query vector after passing through the linear layer, the output of the key vector after passing through the linear layer and the output of the value vector after passing through the linear layer into a second zoom dot product attention unit, and jointly transmitting the output of the query vector after passing through the linear layer, the output of the key vector after passing through the linear layer and the output of the value vector after passing through the linear layer into a third zoom dot product attention unit;
splicing the outputs of the first zoom dot product attention unit, the second zoom dot product attention unit and the third zoom dot product attention unit to obtain a splicing result;
step 2), the splicing result is subjected to linear layer processing, and the output after the linear layer processing is the output of the multi-head attention.
5. The hyperspectral and lidar data fusion classification method based on the AGLT network as described in claim 4, wherein the calculation method of the first scaling dot product attention unit is as follows:
multiplying the output of the query vector after passing through the linear layer by the output of the key vector after passing through the linear layer, scaling the multiplied result, continuing to mask the scaled result, and finally enabling the masked result to pass through a Softmax activation function;
and multiplying the output of the Softmax activation function by the output of the value vector after the value vector passes through the linear layer to obtain the calculation result of the first scaling dot product attention unit.
6. The hyperspectral and lidar data fusion classification method based on the AGLT network as described in claim 5, wherein the working principle of the Bi feedforward unit is as follows:
the input of the Bi feedforward unit is sent to two parallel branches, wherein the first branch is a channel attention branch, the channel attention branch comprises a global subunit, a linear layer and a Sigmoid activation function layer, the second branch is a space attention branch, and the space attention branch comprises a local subunit, a linear layer and a Sigmoid activation function layer;
in the channel attention branch, the input of the Bi feedforward unit sequentially passes through the global subunit, the linear layer and the Sigmoid activation function layer to obtain an output X;
in the space attention branch, the input of the Bi feedforward unit firstly passes through the local subunit, then the output of the local subunit and the output of the global subunit are spliced, and the spliced result sequentially passes through the linear layer and the Sigmoid activation function layer of the space attention branch to obtain an output Y;
multiplying X and Y, and multiplying the multiplication result with the input of the Bi feedforward unit to obtain a final multiplication result, namely obtaining the output of the Bi feedforward unit.
7. The hyperspectral and lidar data fusion classification method based on the AGLT network as described in claim 6, wherein the global subunit comprises an average pooling layer, a linear layer and a GELU activation function layer; the local subunit comprises a linear layer and a GELU activation function layer.
8. The method for fusion classification of hyperspectral and lidar data based on an AGLT network according to claim 7, wherein the method for computing the cross-attention is as follows:
the hyperspectral data in the output of the first Bi-Former module is additionally learned and encoded as a linear projection function F HSI Input of (-), linear projection function F HSI The output of (-) passes through matrix W Q Mapping to a query vector;
linear projection function F HSI Stacking the output of the (-) with the LiDAR-DSM feature coding part in the output of the second Bi-Former module, and dividing the stacking result into two parts, wherein the first part passes through a matrix W K Mapped to key vectors, the second part passing through matrix W V Mapping to a value vector;
will be a linear projection function F HSI Query vector and stacking junction corresponding to output of (-)Multiplying the key vectors corresponding to the first part in the result to obtain a multiplication result a, and inputting the multiplication result into a Softmax function; multiplying the output of the Softmax function with the corresponding value vector of the second part of the stacked result to obtain a multiplied result b, and multiplying the multiplied result b with the linear projection function F HSI The query vectors corresponding to the output of (-) are superimposed, and the superimposed result is input into a linear back projection function G HSI (. Cndot.) then the linear back projection function G HSI Splicing the output of the (-) and the hyperspectral data characteristic coding part in the output of the first Bi-Former module, wherein the splicing result is output A;
similarly, the additional learning code of LiDAR-DSM data is cross-attentional calculated to obtain output B.
CN202311439960.3A 2023-11-01 2023-11-01 Hyperspectral and laser radar data fusion classification method based on AGLT network Pending CN117475216A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311439960.3A CN117475216A (en) 2023-11-01 2023-11-01 Hyperspectral and laser radar data fusion classification method based on AGLT network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311439960.3A CN117475216A (en) 2023-11-01 2023-11-01 Hyperspectral and laser radar data fusion classification method based on AGLT network

Publications (1)

Publication Number Publication Date
CN117475216A true CN117475216A (en) 2024-01-30

Family

ID=89639164

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311439960.3A Pending CN117475216A (en) 2023-11-01 2023-11-01 Hyperspectral and laser radar data fusion classification method based on AGLT network

Country Status (1)

Country Link
CN (1) CN117475216A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117876890A (en) * 2024-03-11 2024-04-12 成都信息工程大学 Multi-source remote sensing image classification method based on multi-level feature fusion
CN117934978A (en) * 2024-03-22 2024-04-26 安徽大学 Hyperspectral and laser radar multilayer fusion classification method based on countermeasure learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117876890A (en) * 2024-03-11 2024-04-12 成都信息工程大学 Multi-source remote sensing image classification method based on multi-level feature fusion
CN117876890B (en) * 2024-03-11 2024-05-07 成都信息工程大学 Multi-source remote sensing image classification method based on multi-level feature fusion
CN117934978A (en) * 2024-03-22 2024-04-26 安徽大学 Hyperspectral and laser radar multilayer fusion classification method based on countermeasure learning
CN117934978B (en) * 2024-03-22 2024-06-11 安徽大学 Hyperspectral and laser radar multilayer fusion classification method based on countermeasure learning

Similar Documents

Publication Publication Date Title
CN110706302B (en) System and method for synthesizing images by text
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN110969124B (en) Two-dimensional human body posture estimation method and system based on lightweight multi-branch network
CN109636742B (en) Mode conversion method of SAR image and visible light image based on countermeasure generation network
CN110136170A (en) A kind of remote sensing image building change detecting method based on convolutional neural networks
CN117475216A (en) Hyperspectral and laser radar data fusion classification method based on AGLT network
CN115690479A (en) Remote sensing image classification method and system based on convolution Transformer
CN113591968A (en) Infrared weak and small target detection method based on asymmetric attention feature fusion
CN114792372A (en) Three-dimensional point cloud semantic segmentation method and system based on multi-head two-stage attention
CN114972378A (en) Brain tumor MRI image segmentation method based on mask attention mechanism
CN113887472A (en) Remote sensing image cloud detection method based on cascade color and texture feature attention
CN114708455A (en) Hyperspectral image and LiDAR data collaborative classification method
CN115546640A (en) Cloud detection method and device for remote sensing image, electronic equipment and storage medium
Yuan et al. STransUNet: A siamese TransUNet-based remote sensing image change detection network
CN116109925A (en) Multi-mode remote sensing image classification method based on heterogeneous feature learning network
Zhao et al. Thermal UAV image super-resolution guided by multiple visible cues
CN114155556A (en) Human body posture estimation method and system based on stacked hourglass network added with channel shuffle module
Bashmal et al. Language Integration in Remote Sensing: Tasks, datasets, and future directions
CN116543165B (en) Remote sensing image fruit tree segmentation method based on dual-channel composite depth network
He et al. Spectral-spatial classification of hyperspectral images using label dependence
CN113887470B (en) High-resolution remote sensing image ground object extraction method based on multitask attention mechanism
CN115471901A (en) Multi-pose face frontization method and system based on generation of confrontation network
CN115861922A (en) Sparse smoke and fire detection method and device, computer equipment and storage medium
CN111898671B (en) Target identification method and system based on fusion of laser imager and color camera codes
Zhao et al. Adaptive Dual-Stream Sparse Transformer Network for Salient Object Detection in Optical Remote Sensing Images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination