CN117475216A

CN117475216A - Hyperspectral and laser radar data fusion classification method based on AGLT network

Info

Publication number: CN117475216A
Application number: CN202311439960.3A
Authority: CN
Inventors: 王敏慧; 孙亚秀; 项建弘; 王霖郁; 黄丽莲; 钟瑜; 孙蕊; 武雅若; 蒋涵宇; 王英; 徐昊
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2024-01-30

Abstract

A hyperspectral and laser radar data fusion classification method based on an AGLT network belongs to the hyperspectral image classification field. The invention solves the problem of poor classification performance of the existing method. The invention can capture and learn hyperspectral air-spectrum combination characteristics from hyperspectral image data and acquire elevation characteristics from LiDAR-DSM data; the asymmetric convolution kernel is introduced into a visual transducer structure, so that the strong space context information extraction capability of the convolution neural network and the strong remote dependency modeling capability of the visual transducer based on a self-attention mechanism are fully utilized; the Bi feedforward unit is designed for the visual transducer to fully extract the global and local information of the data, so that the classification performance of the model is improved. The method can be applied to hyperspectral image classification.

Description

Hyperspectral and laser radar data fusion classification method based on AGLT network

Technical Field

The invention belongs to the field of hyperspectral image classification, and particularly relates to a hyperspectral and laser radar data fusion classification method based on an AGLT network.

Background

The remote sensing technology has significantly advanced in recent years, and the availability of remote sensing images is increased. Typically, data from several remote sensing devices in the same geographical area is available, so that multi-modal data may be used to analyze the land cover information. A variety of different sensor technologies can effectively capture the earth covering quality of different features. With the acquisition of mass data, the data difference of different modes can realize advantage complementation, and the effect of remote sensing ground object classification is effectively improved. In multimodal data fusion and earth coverage interpretation tasks, the fusion interpretability of hyperspectral image (HSI) and light detection and ranging digital surface model (LiDAR-DSM) data has been an important issue that requires attention. The hyperspectral image sensor can obtain spectral information and geospatial information, and the light detection and ranging digital surface model measures the elevation and object height information of the earth surface. By integrating the data of different modalities, more detailed information can be obtained, thus constructing a complete feature representation.

Fusion of multiple data sources may improve accuracy of land cover recognition, but there are many technical obstacles such as different data structures and unrelated physical features. Currently hyperspectral data and LiDAR-DSM images have been used successfully in combination, where convolutional neural networks are a common method of joint classification of hyperspectral data and LiDAR-DSM images, and convolutional neural networks are powerful tools for feature extraction and context modeling. However, because of their inherent limitations in the network backbone, they are unable to establish long-distance connections for global images, there are still drawbacks in capturing the sequence properties of spectral features, and convolution operations in convolutional neural networks can only capture local information, making it difficult to obtain discriminatory spectral-spatial features from a global perspective. The visual transducer backbone network can solve these challenges and create new insights into multi-modal image classification, with visual transducers having more similarity between features acquired in shallow and deep layers, good at capturing global feature information of the image, but not with prior knowledge of the dimensions, translational invariance, and feature locality that the image itself has, and must learn high quality intermediate representations using large-scale datasets. However, due to the lack of training data, there is still a certain difficulty in combining the convolutional neural network and the visual transducer backbone network, so the classification performance of the existing method is still poor, and the classification performance of the existing method needs to be further improved.

Disclosure of Invention

The invention aims to solve the problem of poor classification performance of the existing method, and provides a hyperspectral and laser radar data fusion classification method based on an AGLT network.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a hyperspectral and laser radar data fusion classification method based on an AGLT network specifically comprises the following steps:

step 1, acquiring LiDAR-DSM image data and hyperspectral image data;

step 2, performing dimension reduction processing on the acquired hyperspectral image data by adopting a principal component analysis method to obtain dimension-reduced hyperspectral image data;

step 3, slicing the acquired LiDAR-DSM image data and the hyperspectral image data subjected to dimension reduction processing, and dividing the LiDAR-DSM image data and the hyperspectral image data subjected to slicing processing into a training set and a testing set;

step 4, constructing an AGLT network, training the constructed AGLT network by using a training set until the classification accuracy of the AGLT network on a test set is not improved any more, and obtaining a trained AGLT network;

and 5, carrying out joint processing on the hyperspectral image to be classified and the LiDAR-DSM image by using the trained AGLT network to obtain a classification result.

Further, the AGLT network comprises a hyperspectral image data processing branch and a LiDAR-DSM data processing branch, wherein:

in a hyperspectral image data processing branch, an input image sequentially passes through a three-dimensional normalization layer, a first activation function layer, a first three-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3, a second three-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3 multiplied by 1, a third three-dimensional convolution layer with a convolution kernel size of 3 multiplied by 1, a first two-dimensional normalization layer, a second activation function layer, a first two-dimensional convolution layer with a convolution kernel size of 3 multiplied by 1, a second two-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3 and a first Bi-Former module;

in the LiDAR-DSM data processing branch, the input LiDAR-DSM data sequentially passes through a second two-dimensional normalization layer, a third activation function layer, a third two-dimensional convolution layer with a convolution kernel size of 3 multiplied by 1, a fourth two-dimensional convolution layer with a convolution kernel size of 1 multiplied by 3 and a second Bi-Former module;

and sending the output of the first Bi-Former module and the output of the second Bi-Former module into a cross attention layer together to obtain an output A and an output B, then enabling the output A to pass through a first MLP layer, enabling the output B to pass through a second MLP layer, and finally superposing the output of the first MLP layer and the output of the second MLP layer, wherein a superposition result is a final classification result.

Further, the working principle of the Bi-Former module is as follows:

mapping an input image of the Bi-Former module into a vector sequence, embedding additional learning codes into the head of the vector sequence to obtain an overall sequence, embedding position codes into the overall sequence, and enabling the vector sequence after embedding the position codes to pass through an encoder submodule, wherein the output of the encoder submodule is used as the output of the Bi-Former module; and the encoder sub-module comprises N encoders;

the first encoder works on the principle that:

step one, a vector sequence after embedded position coding input by an encoder sub-module is input by a first encoder, and the vector sequence after embedded position coding is mapped into a query vector, a key vector and a value vector respectively;

step two, multi-head attention calculation is carried out on the query vector, the key vector and the value vector, and a multi-head attention calculation result is obtained;

thirdly, carrying out residual connection on the multi-head attention calculation result in the second step and the vector sequence after embedding the position codes, and normalizing the residual connection result;

step four, sending the normalization result of the step three into a Bi feedforward unit, carrying out residual connection on the output of the Bi feedforward unit and the normalization result of the step three, normalizing the residual connection result, and taking the normalization result as the output of the first encoder;

and taking the output of the first encoder as the input of the second encoder, and the like until the output of the N-th encoder is obtained, wherein the output of the N-th encoder is the output of the encoder submodule.

Further, the multi-head attention calculating method comprises the following steps:

step 1), respectively passing the query vector, the key vector and the value vector through the linear layer, then jointly transmitting the output of the query vector after passing through the linear layer, the output of the key vector after passing through the linear layer and the output of the value vector after passing through the linear layer into a first zoom dot product attention unit, jointly transmitting the output of the query vector after passing through the linear layer, the output of the key vector after passing through the linear layer and the output of the value vector after passing through the linear layer into a second zoom dot product attention unit, and jointly transmitting the output of the query vector after passing through the linear layer, the output of the key vector after passing through the linear layer and the output of the value vector after passing through the linear layer into a third zoom dot product attention unit;

splicing the outputs of the first zoom dot product attention unit, the second zoom dot product attention unit and the third zoom dot product attention unit to obtain a splicing result;

step 2), the splicing result is subjected to linear layer processing, and the output after the linear layer processing is the output of the multi-head attention.

Further, the calculation method of the first scaling dot product attention unit is as follows:

multiplying the output of the query vector after passing through the linear layer by the output of the key vector after passing through the linear layer, scaling the multiplied result, continuing to mask the scaled result, and finally enabling the masked result to pass through a Softmax activation function;

and multiplying the output of the Softmax activation function by the output of the value vector after the value vector passes through the linear layer to obtain the calculation result of the first scaling dot product attention unit.

Further, the working principle of the Bi feedforward unit is as follows:

the input of the Bi feedforward unit is sent to two parallel branches, wherein the first branch is a channel attention branch, the channel attention branch comprises a global subunit, a linear layer and a Sigmoid activation function layer, the second branch is a space attention branch, and the space attention branch comprises a local subunit, a linear layer and a Sigmoid activation function layer;

in the channel attention branch, the input of the Bi feedforward unit sequentially passes through the global subunit, the linear layer and the Sigmoid activation function layer to obtain an output X;

in the space attention branch, the input of the Bi feedforward unit firstly passes through the local subunit, then the output of the local subunit and the output of the global subunit are spliced, and the spliced result sequentially passes through the linear layer and the Sigmoid activation function layer of the space attention branch to obtain an output Y;

multiplying X and Y, and multiplying the multiplication result with the input of the Bi feedforward unit to obtain a final multiplication result, namely obtaining the output of the Bi feedforward unit.

Further, the global subunit comprises an average pooling layer, a linear layer and a GELU activation function layer; the local subunit comprises a linear layer and a GELU activation function layer.

Further, taking the hyperspectral image data as an example, the method for calculating the cross attention is as follows:

the hyperspectral data in the output of the first Bi-Former module is additionally learned and encoded as a linear projection function F ^HSI Input of (-), linear projection function F ^HSI The output of (-) passes through matrix W _Q Mapping to a query vector;

linear projection function F ^HSI Stacking the output of the (-) with the LiDAR-DSM feature coding part in the output of the second Bi-Former module, and dividing the stacking result into two parts, wherein the first part passes through a matrix W _K Mapped to key vectors, the second part passing through matrix W _V Mapping to a value vector;

will be a linear projection function F ^HSI Multiplying the query vector corresponding to the output of the (-) with the key vector corresponding to the first part of the stacking result to obtain a multiplication result a, and inputting the multiplication result into a Softmax function; multiplying the output of the Softmax function with the corresponding value vector of the second part of the stacked result to obtain a multiplied result b, and multiplying the multiplied result b with the linear projection function F ^HSI The query vectors corresponding to the output of (-) are superimposed, and the superimposed result is input into a linear back projection function G ^HSI (. Cndot.) then the linear back projection function G ^HSI Splicing the output of the (-) and the hyperspectral data characteristic coding part in the output of the first Bi-Former module, wherein the splicing result is output A;

similarly, the additional learning code of LiDAR-DSM data is cross-attentional calculated to obtain output B.

The beneficial effects of the invention are as follows:

the invention can capture and learn hyperspectral air-spectrum combination characteristics from hyperspectral image data and acquire elevation characteristics from LiDAR-DSM data; the asymmetric convolution kernel is introduced into a visual transducer structure, so that the strong space context information extraction capability of the convolution neural network and the strong remote dependency modeling capability of the visual transducer based on a self-attention mechanism are fully utilized; the Bi feedforward unit is designed for the visual transducer to fully extract the global and local information of the data, so that the classification performance of the model is improved.

Drawings

FIG. 1 is a diagram of a model structure of an AGLT network;

FIG. 2 is a training flow diagram of an AGLT network;

FIG. 3 is a block diagram of a Bi-Former module;

FIG. 4 is a diagram of a multi-headed attention structure;

FIG. 5 is a block diagram of a Bi feed forward unit;

in the figure, (a) is channel attention and (b) is spatial attention;

FIG. 6 is a cross-attention block diagram;

FIG. 7 is a TR data presentation diagram;

in the figure, (a) is a hyperspectral false color chart, (b) is a DSM gray chart, and (c) is a truth chart;

FIG. 8 is a MU data presentation diagram;

fig. 9 is an AU data display diagram;

fig. 10 is a graph of classification results of different data by the AGLT network model.

Detailed Description

The present application is described in further detail below with reference to the accompanying drawings by way of specific embodiments. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the invention. Based on the embodiments of the present invention, other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.

Detailed description of the inventionin the first embodiment, this embodiment will be described with reference to fig. 2. The hyperspectral and laser radar data fusion classification method based on the AGLT network specifically comprises the following steps:

step 1, acquiring LiDAR-DSM image data and hyperspectral image data (acquired is data which is disclosed at present);

the number of wave bands of a third dimension of the three-dimensional original data can be reduced through dimension reduction, data redundancy is reduced, and the running time is shortened;

The invention combines the convolutional neural network with the visual transducer, can capture and learn the hyperspectral space-spectrum combination characteristic from hyperspectral image data, and obtain the elevation characteristic from LiDAR-DSM data; secondly, introducing an asymmetric convolution kernel into a visual transducer structure, and fully utilizing the strong spatial context information extraction capability of a convolution neural network and the strong remote dependency modeling capability of the visual transducer based on a self-attention mechanism; finally, the Bi feedforward unit is designed for the visual transducer to fully extract the global and local information of the data. The method is used for fusing multi-source heterogeneous information and improving joint classification performance.

The second embodiment is as follows: this embodiment will be described with reference to fig. 1. This embodiment differs from the specific embodiment in that the AGLT network comprises a hyperspectral image data processing branch and a LiDAR-DSM data processing branch, wherein:

Other steps and parameters are the same as in the first embodiment.

The convolution kernel of the hyperspectral data branch circuit adopts a form of converting a three-dimensional asymmetric convolution kernel (1×1×3,1×3×1, 3×1×1) into a two-dimensional asymmetric convolution kernel (1×3,3×1), and simultaneously, the DSM data branch circuit adopts a form of two-dimensional asymmetric convolution (1×3,3×1), so that the multi-source image characteristics can be fully learned.

And a third specific embodiment: this embodiment will be described with reference to fig. 3. The first or second embodiment of the present invention is different from the first embodiment, in that the working principle of the Bi-Former module is as follows:

mapping an input image of the Bi-Former module into a vector sequence, embedding additional learning codes (Extra learable embedding) into the head of the vector sequence to obtain an overall sequence, embedding position codes into the overall sequence, and enabling the vector sequence after embedding the position codes to pass through an encoder submodule, wherein the output of the encoder submodule is used as the output of the Bi-Former module; and the encoder sub-module comprises N encoders;

the first encoder works on the principle that:

step one, the vector sequence after the embedded position coding input by the encoder submodule is the input of a first encoder, and the vector sequence after the embedded position coding is mapped into a query (Q) vector, a key (K) vector and a value (V) vector respectively;

Other steps and parameters are the same as in the first or second embodiment.

The specific embodiment IV is as follows: this embodiment will be described with reference to fig. 4. The difference between this embodiment and one to three embodiments is that the multi-head attention calculating method includes:

Other steps and parameters are the same as in one to three embodiments.

Fifth embodiment: the first embodiment is different from the first to fourth embodiments in that the calculation method of the first scaling dot product attention unit is as follows:

Other steps and parameters are the same as in one to four embodiments.

The second and third scaled dot product attention units are calculated in the same way as the first scaled dot product attention unit.

Specific embodiment six: this embodiment will be described with reference to fig. 5. The difference between the present embodiment and one to fifth embodiments is that the working principle of the Bi feedforward unit is:

Other steps and parameters are the same as in one of the first to fifth embodiments.

Seventh embodiment: this embodiment differs from one of the first to sixth embodiments in that the global subunit includes an average pooling layer, a linear layer, and a GELU activation function layer; the local subunit comprises a linear layer and a GELU activation function layer.

Other steps and parameters are the same as in one of the first to sixth embodiments.

Eighth embodiment: this embodiment will be described with reference to fig. 6. The difference between the present embodiment and one of the first to seventh embodiments is that, taking the hyperspectral image data as an example, the method for calculating the cross-attention is as follows:

the hyperspectral data in the output of the first Bi-Former module is additionally learned and encoded as a linear projection function F ^HSI Input of (-)Then the linear projection function F ^HSI The output of (-) passes through the learning matrix W _Q Mapping to a query (Q) vector; linear projection function F ^HSI Stacking the output of the (-) with the LiDAR-DSM feature coding part in the output of the second Bi-Former module, and dividing the stacking result into two parts, wherein the first part passes through the learning matrix W _K Mapped to a key (K) vector, the second part passing through a learning matrix W _V Mapping to a value (V) vector;

linear projection function F ^HSI (. Cndot.) and a linear backprojection function G ^HSI The purpose of (-) is for dimension alignment;

similarly, cross attention calculation is performed on the additional learning code of the LiDAR-DSM data (the parameter operation is the additional learning code of the LiDAR-DSM data and the feature code of the hyperspectral data, namely, the position of the additional learning code of the hyperspectral data is replaced by the additional learning code of the LiDAR-DSM data, and the feature code part of the LiDAR data is replaced by the feature code of the hyperspectral data), so that an output B is obtained.

Other steps and parameters are the same as those of one of the first to seventh embodiments.

Interaction with feature codes from LiDAR-DSM data helps learn supplemental information because the additional learning code of the hyperspectral data already learns abstract information in all feature codes of the hyperspectral data. Interaction with feature codes from hyperspectral data helps learn supplemental information because the additional learning code of the LiDAR-DSM data has learned abstract information in all feature codes of the LiDAR-DSM data.

Examples

The invention provides an AGLT network-based hyperspectral and laser radar data fusion classification method, which comprises the implementation flow of the method as shown in a table 1:

table 1AGLT network architecture algorithm flow

The specific implementation steps are as follows:

and step 1, acquiring hyperspectral image data and LiDAR-DSM data (public data is adopted).

And 2, performing dimension reduction processing on the obtained hyperspectral image data by using a PCA principal component analysis method, so that the number of wave bands of a third dimension of the three-dimensional original data can be reduced, the data redundancy can be reduced, and the running time can be shortened.

And step 3, preprocessing LiDAR-DSM data and dimension-reduced hyperspectral data. The number of training sets and the number of testing sets are designated according to the numbers of marked pixel samples in the hyperspectral image and the LiDAR-DSM image, and the training sets and the number of testing sets are stored as two pieces of data with the same format as LiDAR-DSM data and dimension-reduced hyperspectral data respectively. Then, the two pieces of data are separately sliced.

Step 4, training and classifying AGLT network

And 4.1, constructing an AGLT network, wherein the training set and the testing set used by the method adopt 11 multiplied by 11 resolution ratio. As shown in FIG. 1, H W represents the spatial dimension, C represents the third dimension, and after PCA processing and data preprocessing, a series of 11X L hyperspectral images (L represents the dimension-reduced third dimension) and 11X 11 LiDAR-DSM images are obtained.

Step 4.2, the AGLT network structure is mainly divided into three parts, namely an asymmetric three-dimensional convolution aiming at hyperspectral data, an asymmetric two-dimensional convolution after data reconstruction and an asymmetric two-dimensional convolution aiming at LiDAR-DSM data. The asymmetric three-dimensional convolution mainly includes 1×1×3,1×3×1, and 3×1×1; the asymmetric two-dimensional convolution consists mainly of 1×3 and 3×1. And the parameter quantity is reduced while the space-spectrum joint characteristic information is extracted.

And 4.3, respectively sending the output of the asymmetric convolution neural network of the hyperspectral data branch and the LiDAR-DSM data branch into Bi-force. As shown in FIG. 5, the Bi feedforward unit in the Bi-Former can enable the AGLT network to obtain lighter calculation burden in the small channel dimension, increase the information interaction of the cross channels, and fuse the global information with the local information, thereby further improving the model performance and the precision.

And 4.4, aiming at the output of Bi-force of the hyperspectral data branch and the LiDAR-DSM data branch, sending the output into the cross attention, carrying out characteristic information interaction, and fully extracting the multi-mode data fusion characteristics.

And 5, classifying the images.

Experimental part

The invention trains the network model, the computer used for verifying the classification result is configured as follows: CPU is Intel Rui 9-12900K, memory is 32GBDDR5 5200, display card is NVIDIAGeForce RTX 3090Ti, hard disk is 1TBSSD and 4T HDD, and system is Windows 11 professional. Three well-known multi-modal feature fusion datasets were used, the dataset being the trentt dataset (TR), the university of miscissimide and university of florida golf course dataset (MU) and the austburg dataset (AU), respectively. The data set display diagrams are shown in fig. 7, 8 and 9, different colors represent different ground object categories, detailed data information corresponding to fig. 7 is shown in table 2, detailed data information corresponding to fig. 8 is shown in table 3, and detailed data information corresponding to fig. 9 is shown in table 4.

TABLE 2

TABLE 3 Table 3

TABLE 4 Table 4

The location of capture of the tr dataset was located in a rural area around the trentum city in italy. The dataset contains hyperspectral images and lidar images. The hyperspectral image has 600×166 pixels and 63 bands, the band range covers the spectrum band of 420.89-989.09nm, the spectral resolution is 9.2nm, and the spatial resolution is 1m. The laser radar image is a single-channel image, comprises the altitude corresponding to the ground position, and has the same image size as the hyperspectral image. There are 6 categories in the labeling information of the dataset.

The mu dataset is a registered aerial hyperspectral-lidar dataset. Two modal images of the dataset were acquired simultaneously by one aviation flight, acquired at 11 months 2010, with the location in misinformation, U.S. The image size is 325×220 pixels. The hyperspectral image contains 64 spectral bands. The laser radar image is a single-channel image, comprises the altitude corresponding to the ground position, and has the same image size as the hyperspectral image. The data annotation information contains 11 categories.

Au datasets were captured over the market in ogesturg, germany. HSI data was acquired by DAS-EOC HySpex sensor and collected by DLR-3K system based on LiDAR-DSM data. The spatial resolution of the two images is downsampled to a uniform resolution of 30m to adequately manage the multi-modal fusion. In this dataset, the HSI data consisted of 180 bands ranging from 0.4-2.5 μm, while the LiDAR-DSM data had only one grid. The size of the data set is 332 x 485 pixels. The dataset depicts seven different land cover categories.

For the evaluation index of the model, three objective evaluation indexes which are most widely applied in the industry are selected: overall classification accuracy (OA), average classification accuracy (AA), and Kappa coefficient (Kappa coefficient, K).

The objective classification results of the AGLT network model used in the present invention on three data sets are shown in Table 5.

TABLE 5

From left to right, the first column is the dataset; the second column is OA data of the classification result; the third column is AA data of the classification result; the fourth column is K data of the classification result; the fifth column is the training duration of the classification; the sixth column is the test duration of the classification. As shown in fig. 10, the subjective classification results are shown in fig. 10, (a) represents the TR dataset classification effect, (b) represents the MU dataset classification effect, and (c) represents the AU dataset classification effect. The higher the accuracy, the less salt and pepper noise the picture.

The above examples of the present invention are only for describing the calculation model and calculation flow of the present invention in detail, and are not limiting of the embodiments of the present invention. Other variations and modifications of the above description will be apparent to those of ordinary skill in the art, and it is not intended to be exhaustive of all embodiments, all of which are within the scope of the invention.

Claims

1. The hyperspectral and laser radar data fusion classification method based on the AGLT network is characterized by comprising the following steps of:

step 1, acquiring LiDAR-DSM image data and hyperspectral image data;

2. The hyperspectral and LiDAR data fusion classification method based on the AGLT network of claim 1, wherein the AGLT network comprises a hyperspectral image data processing branch and a LiDAR-DSM data processing branch, wherein:

3. The hyperspectral and laser radar data fusion classification method based on the AGLT network as described in claim 2, wherein the working principle of the Bi-Former module is as follows:

the first encoder works on the principle that:

4. The hyperspectral and lidar data fusion classification method based on the AGLT network as described in claim 3, wherein the multi-head attention computing method is as follows:

5. The hyperspectral and lidar data fusion classification method based on the AGLT network as described in claim 4, wherein the calculation method of the first scaling dot product attention unit is as follows:

6. The hyperspectral and lidar data fusion classification method based on the AGLT network as described in claim 5, wherein the working principle of the Bi feedforward unit is as follows:

7. The hyperspectral and lidar data fusion classification method based on the AGLT network as described in claim 6, wherein the global subunit comprises an average pooling layer, a linear layer and a GELU activation function layer; the local subunit comprises a linear layer and a GELU activation function layer.

8. The method for fusion classification of hyperspectral and lidar data based on an AGLT network according to claim 7, wherein the method for computing the cross-attention is as follows:

will be a linear projection function F ^HSI Query vector and stacking junction corresponding to output of (-)Multiplying the key vectors corresponding to the first part in the result to obtain a multiplication result a, and inputting the multiplication result into a Softmax function; multiplying the output of the Softmax function with the corresponding value vector of the second part of the stacked result to obtain a multiplied result b, and multiplying the multiplied result b with the linear projection function F ^HSI The query vectors corresponding to the output of (-) are superimposed, and the superimposed result is input into a linear back projection function G ^HSI (. Cndot.) then the linear back projection function G ^HSI Splicing the output of the (-) and the hyperspectral data characteristic coding part in the output of the first Bi-Former module, wherein the splicing result is output A;