CN117474781A - High spectrum and multispectral image fusion method based on attention mechanism - Google Patents

High spectrum and multispectral image fusion method based on attention mechanism Download PDF

Info

Publication number
CN117474781A
CN117474781A CN202311469057.1A CN202311469057A CN117474781A CN 117474781 A CN117474781 A CN 117474781A CN 202311469057 A CN202311469057 A CN 202311469057A CN 117474781 A CN117474781 A CN 117474781A
Authority
CN
China
Prior art keywords
feature
attention
image
hyperspectral
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311469057.1A
Other languages
Chinese (zh)
Inventor
徐炳洁
石静芸
傅安特
陈施施
吴海军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Eastern Communication Co Ltd
Original Assignee
Eastern Communication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Eastern Communication Co Ltd filed Critical Eastern Communication Co Ltd
Priority to CN202311469057.1A priority Critical patent/CN117474781A/en
Publication of CN117474781A publication Critical patent/CN117474781A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/58Extraction of image or video features relating to hyperspectral data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10032Satellite or aerial image; Remote sensing
    • G06T2207/10036Multispectral image; Hyperspectral image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A40/00Adaptation technologies in agriculture, forestry, livestock or agroalimentary production
    • Y02A40/10Adaptation technologies in agriculture, forestry, livestock or agroalimentary production in agriculture

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a hyperspectral and multispectral image fusion method based on an attention mechanism, which comprises the following steps: step 1: acquiring a hyperspectral image with low spatial resolution and a multispectral image with high spatial resolution, and constructing a training set and a testing set; step 2: constructing a double-flow hyperspectral and multispectral image fusion model based on an attention mechanism; step 3: training the constructed image fusion model by using a training set and an Adam optimizer, and back-propagating the training model through a loss function to obtain an optimal model; step 4: and inputting the hyperspectral image with low spatial resolution and the multispectral image with high spatial resolution to be fused in the test set into a trained image fusion model to obtain the hyperspectral image with high spatial resolution. The method of the invention has good fusion performance when processing complex ground objects, and effectively reduces the overall error of the fusion result while focusing on details, thereby obtaining high-quality high-resolution hyperspectral images.

Description

High spectrum and multispectral image fusion method based on attention mechanism
Technical Field
The invention relates to the technical field of remote sensing image processing and deep learning, in particular to a hyperspectral and multispectral image fusion method based on an attention mechanism.
Background
Because of the limitations of imaging sensors, the spectral resolution and the spatial resolution of the remote sensing image are mutually restricted, and no single sensor can acquire data with high spatial resolution, high spectral resolution and high time resolution at the same time. The hyperspectral image has hundreds of spectral bands, can provide rich spectral information, can be used for more fine material identification and classification, but has lower spatial resolution. The multispectral image has higher spatial resolution and can capture clearer ground characteristics, but has fewer spectral bands and cannot provide detailed spectral information. By fusing the hyperspectral image with the multispectral image, an image with both high spatial and high spectral resolution can be generated, which will greatly improve the classification and recognition capabilities of land objects and widen the range of image applications.
The current hyperspectral and multispectral image fusion method can be divided into a traditional method and a method based on deep learning. The traditional fusion method comprises methods based on full color sharpening, matrix decomposition, tensor representation and the like, the deep learning-based method can automatically learn and extract complex features of data, high-dimensional data can be effectively processed, but global semantic information cannot be modeled due to the limitation of the fixed size of a convolution kernel, and an attention mechanism is introduced into hyperspectral and multispectral fusion so as to capture remote details of hyperspectral and multispectral images.
The existing hyperspectral and multispectral fusion method has certain limitations, and mainly comprises the following steps:
1) Most conventional methods are based on manually extracted features, relying on a priori assumptions, which are often sensitive to parameter selection, and which may lead to quality degradation if these assumptions are not applicable to the current problem.
2) The spectrum bands of the remote sensing images have high correlation and non-local similarity in spatial positions, and the convolutional neural network extracts local characteristic information in a window due to the limitation of a receptive field, so that the intrinsic characteristics of the remote sensing images are not fully utilized.
3) Because the hyperspectral image and the multispectral image both contain space information and spectrum information, redundant information and complementary information are contained between the hyperspectral image and the multispectral image, and only how to extract the space information or the spectrum information respectively is not concerned, and information interaction between the hyperspectral image and the multispectral image is considered.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a hyperspectral and multispectral image fusion method based on an attention mechanism, which fully utilizes strong correlation of remote sensing images among spectrum bands and non-local similarity in space positions, captures remote dependence and self-similarity prior, captures redundant and complementary information of hyperspectral images and multispectral images, and achieves better fusion effect, and the specific technical scheme is as follows:
a hyperspectral and multispectral image fusion method based on an attention mechanism comprises the following steps:
step 1: acquiring a hyperspectral image with low spatial resolution and a multispectral image with high spatial resolution, and constructing a training set and a testing set;
step 2: constructing a double-flow hyperspectral and multispectral image fusion model based on an attention mechanism;
step 3: training the constructed image fusion model by using a training set and an Adam optimizer, and back-propagating the training model through a loss function to obtain an optimal model;
step 4: and inputting the hyperspectral image with low spatial resolution and the multispectral image with high spatial resolution to be fused in the test set into a trained image fusion model to obtain the hyperspectral image with high spatial resolution.
Further, the image fusion model comprises a double-branch network, a feature fusion module and an image reconstruction module;
the hyperspectral image with low spatial resolution is up-sampled to the same size as the multispectral image with high spatial resolution, and then is input into a dual-branch network for feature extraction;
the feature fusion module fuses the extracted features, and then reconstructs a hyperspectral image with high spatial resolution from the fused features through the image reconstruction module.
Further, the upsampling of the low spatial resolution hyperspectral image uses a bilinear interpolation algorithm.
Further, the dual-branch network comprises a spatial feature extraction branch and a spectral feature extraction branch;
the space feature extraction branch and the spectrum feature extraction branch respectively sequentially perform shallow feature extraction and depth feature extraction on the multispectral image with high space resolution and the hyperspectral image with low space resolution after upsampling;
the shallow feature extraction adopts two independent shallow feature extraction modules, and extracts shallow spatial features and shallow spectral features from a multispectral image with high spatial resolution and a hyperspectral image with low spatial resolution after upsampling respectively and maps feature data to high-dimensional features;
the depth feature extraction adopts two attention-guided cross-domain feature extraction modules to alternately extract global feature domain information and cross-domain information in shallow spatial features and shallow spectral features and cross-feature domain interaction feature information, so as to obtain deep spatial features and deep spectral features.
Furthermore, the shallow feature extraction module consists of two continuous convolution layers, the convolution kernel size is 3×3, the step length is 1, and a parameter correction linear unit activation function is overlapped between the convolution layers.
Further, the attention directed cross-domain feature extraction module consists of a self-attention mechanism based Swin transducer and a cross-attention mechanism based Swin transducer through a cascade.
Further, the self-attention mechanism-based Swin transform includes a conventional window layer of the first layer and a shifted window layer of the next layer, wherein the conventional window layer includes: a multi-head self-attention W-MSA module and a multi-layer perceptron MLP module based on a window mechanism, wherein each module applies a residual error connection, and an LN layer is arranged in front of the W-MSA and the MLP; corresponding to a conventional window layer, the shift window layer includes: a multi-head self-attention SW-MSA module and a multi-layer perceptron MLP module based on a shift window mechanism, each module applying a residual connection, there is an LN layer in front of the W-MSA and MLP.
Further, the specific process of the feature data processing of the conventional window layer is as follows: assuming that the input of the characteristic F of the spin transducer based on the self-attention mechanism has a size of h×w×c, C representing the number of channels, the input characteristic is first divided into the characteristics of m×m partial windows that do not overlapI.e. remodel the input +.>Features of (1), wherein->Is the total number of windows; next, standard self-attention is performed for each window separately; for local window featuresThree learnable weight matrices shared between different windows are used +.>Andit is projected into query Q, key K, and value V by:
Q=XW Q ,K=XW K ,V=XW V #(1)
by calculating the dot product of the query Q and all keys K, and then normalizing by the softmax function, a corresponding attention score is obtained, and the attention mechanism is defined as follows:
d k is the dimension of key, B is a learnable relative position code;
the self-attention SA is extended to a multi-head self-attention MSA, so the overall process of the local window feature X at the regular window layer is formulated as:
SA(X)=Attention(Q(X),K(X),V(X))#(3)
wherein Z is 1 The output of the conventional window layer of the Swin transducer of the self-attention mechanism with X as input is shown.
Further, the specific process expression of the feature data processing of the shift window layer is as follows:
wherein Z is 2 The output of the Swin transducer of the self-attention mechanism with X as input is the multilayer perceptron MLP moduleThe expression of (2) is as follows:
MLP(X)=GELU(W 1 X+b 1 )W 2 +b 2 #(8)
wherein GELU is a Gaussian error linear unit, wherein W 1 And W is 2 Is a learnable weight of the full connection layer, and b 1 And b 2 Is a learnable bias parameter.
Further, the cross-attention mechanism-based Swin transducer layer structure is similar to the self-attention mechanism-based Swin transducer, wherein the attention module adopts a multi-head cross-attention MCA module, and the specific characteristic data processing process of the cross-attention mechanism-based Swin transducer is as follows:
for two local window features X from feature field 1 and feature field 2, respectively, from different fields 1 And X 2 The cross-attention mechanism is defined as:
CA(X 1 ,X 2 )=Attention(Q(X 2 ),K(X 1 ),V(X 1 ))#(9)
wherein CA (·) is computation X 1 And X is 2 Attention function, feature X, of the relationship between 1 And X is 2 It is projected into query Q, key K, and value V by:
feature X as shown in equations (10) (11) 1 For generating keys K 1 Sum value V 1 Feature X 2 Are used to generate query Q 2 Then, performing attention weighting operation by using the generated keys, values and queries, so as to realize the fusion of cross-modal information; meanwhile, original information in the feature domain 1 is reserved by adopting a residual connection mode, so that the information is stored and transferred, and the same processing is also applied to the feature domain 2;
thus, the overall process of Swin transducer based on cross-attention mechanism is defined as:
wherein the method comprises the steps ofRepresenting characteristic X 1 Via the output of the first layer MCA, < >>Representing characteristic X 2 An output through the first layer MCA; the feature is moved in two directions, respectively +.>Pixels, repartitioning windows, and calculating the attention in each window, wherein the calculation formula is as follows:
wherein,representing a feature X in feature Domain 1 1 Through the output of the Swin transducer based on the Cross-attention mechanism,/for example>Is feature X in feature field 2 2 Through the output of a Swin transducer based on a cross-attention mechanism.
Further, the feature fusion module comprises a depth feature fusion module based on a Transformer and a feature fusion module based on a CNN, the depth spectral features and the depth spatial features are spliced and then input into the depth feature fusion module based on the Transformer to obtain fused depth spectral spatial features, the fused depth spectral spatial features are combined with the shallow spatial features and the shallow spectral features by long jump connection to obtain spliced feature information, finally the spliced feature information is input into the feature fusion module based on the CNN, and local information is extracted again from the features containing global information to fuse the local information in different domains, so that fused spatial spectral features are obtained.
Further, the image reconstruction module comprises 3 convolution layers, the filter size of each convolution layer is 3×3, the step size is 1, PReLU activation is applied after each convolution layer, the module maps the fused deep spectral features and deep spatial features back to the image space, and the fused shallow spatial features and shallow spectral features are recovered to obtain a hyperspectral image with high spatial resolution.
Compared with the prior art, the method adopts the Swin transducer to fully extract the global information of the hyperspectral and multispectral images, adopts the cross attention mechanism to fully utilize the global information in the characteristic domain and between domains, has uniform error distribution in the global domain, shows good fusion performance when processing complex regions, and effectively reduces the overall error of the fusion result while paying attention to details, thereby obtaining high-quality high-resolution hyperspectral images.
Drawings
FIG. 1 is a schematic flow diagram of a method for blending hyperspectral and multispectral images based on an attention mechanism of the present invention;
FIG. 2 is a schematic diagram of a specific flow of image feature extraction, fusion and image reconstruction performed by the image fusion model according to an embodiment of the present invention;
FIG. 3 is a data processing flow diagram of an image fusion model according to an embodiment of the present invention;
FIG. 4 is a data processing flow diagram of a self-attention mechanism based Swin transducer in accordance with an embodiment of the present invention;
fig. 5 is a data processing flow diagram of a cross-attention mechanism based Swin transducer in accordance with an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and technical effects of the present invention more apparent, the present invention will be further described in detail with reference to the drawings and examples of the specification.
As shown in FIG. 1, the method for fusing hyperspectral and multispectral images based on an attention mechanism mainly comprises the following steps:
step 1: acquiring a hyperspectral image with low spatial resolution and a multispectral image with high spatial resolution, and constructing a training set and a testing set;
step 2: constructing a double-flow hyperspectral and multispectral image fusion model based on an attention mechanism;
step 3: training the constructed image fusion model by using a training set and an Adam optimizer, and back-propagating the training model through a loss function to obtain an optimal model;
step 4: and inputting the hyperspectral image with low spatial resolution and the multispectral image with high spatial resolution to be fused in the test set into a trained image fusion model to obtain the hyperspectral image with high spatial resolution.
Specifically, assume thatW×h LR-HSI (Low Resolution Hyperspectral Image, low spatial resolution hyperspectral image) representing S spectral bands, each spectral band having lower spatial resolution; />W x H HR-MSI (High Resolution Multispectral Image, high spatial resolution multispectral image) representing s spectral bands, but with higher spatial resolution. Wherein H and H represent image height, W and W represent image width, S and S represent spectral band numbers, and H < H, W < W, S < S.
As shown in fig. 2 and 3, in the image fusion model, first, LR-HSI is up-sampled to the same size as HR-MSI using bilinear interpolation, resulting inAnd then the HR-MSI and the up-sampled LR-HSI are respectively input into a spatial feature extraction branch and a spectral feature extraction branch in the dual-branch network.
And then, shallow feature extraction is carried out on the HR-MSI and the up-sampled LR-HSI respectively, so as to obtain shallow spatial features and shallow spectral features with high resolution. The shallow feature extraction adopts two independent shallow feature extraction modules, and spatial texture features and spectral features are obtained from HR-MSI and up-sampled LR-HSI respectively. The two shallow feature extraction modules are respectively composed of two continuous convolution layers, the convolution layers provide a simple and convenient way to increase feature dimension, then parameter correction linear unit (PReLU) activation function between the convolution layers is provided, the convolution kernel size is 3×3, the step size is 1, local semantic information of the shallow layers is extracted, feature data of the shallow features are mapped to high-dimensional features, and the extracted shallow spectral features areThe superficial spatial feature is->C represents the number of channels, and in this embodiment C is 192.
And then, inputting the shallow spectral features and the shallow spatial features into a depth feature extraction module, and alternately extracting global feature domain information and cross-domain information of MSI (Multispectral Image, multispectral images) and HSI (Hyperspectral Image, hyperspectral images) and cross-feature domain interaction feature information to obtain deep spectral features and deep spatial detail features. The depth feature extraction module is provided with 2 Attention-guided Cross-domain feature extraction (ACFE) modules; the ACFE module consists of a Swin transducer (Self-Attention Swin Transformer, SAST) based on a Self-attention mechanism and a Swin transducer (Cross-Attention Swin Transformer, CAST) based on a Cross-attention mechanism which are cascaded to encourage information exchange between the two features, perform interactive feature fusion extraction between a space detail branch and a spectrum branch, and capture redundant and complementary information of the HSI and the MSI.
The hyperspectral image HSI may have consistency or similarity between spectral characteristics and features across the entire image. Due to the receptive field limitation of CNN, the long-range dependence in modeling images is limited, only the local information of the images can be modeled, global semantic information cannot be effectively captured, and the capability of screening and distinguishing useful features from redundant information is lacking. Therefore, after the preliminary shallow feature extraction is carried out, the Swin Transformer is used for further extracting the features containing global information, and each extracted feature fuses the global information because the Swin Transformer has the capability of acquiring the long-range dependency information.
Thus, shallow spatial featuresAnd shallow spectral features->Input depth feature extraction module for obtaining deep spectral features through self-attention-based Swin transducer module and cross-attention-based Swin transducer module>And deep space detail features->Global feature domain information and cross-feature domain global information of MSI and HSI are contained.
And then shallow space is characterizedShallow spectral features->Deep spectral features->And deep space detail features->And inputting the characteristic fusion module, and carrying out characteristic fusion to obtain spectrum space fusion characteristics. The feature fusion module comprises a depth feature fusion module based on a transducer and a feature fusion module based on a CNN, and the deep global features are fused respectively, and fused deep global features and local features are fused. Specifically, deep spectral features ++>And deep space features->After splicing, inputting a depth characteristic fusion module based on a transducer to obtain fusion depth spectrum space characteristics +.>Because the transducer self-attention computation requires stretching the patch into a one-dimensional token, spatial information inside the patch is lost. Fusion information F using long hop connections FDF And shallow feature information->Combining to obtain spliced characteristic information +.>The local information is reserved, and meanwhile, the middle-long distance information is integrated for the characteristics. To splice the characteristics F CF Inputting a CNN-based feature fusion module, extracting local information again from the features containing global information to fuse the local information in different domains to obtain fused spatial spectrum features ∈ ->
And finally, inputting the spatial spectrum fusion characteristics into an image reconstruction module, and reconstructing a required high-spatial resolution hyperspectral image from the fusion characteristics. Specifically, spatial spectrum is fused to feature F FF An image reconstruction module is input, the image reconstruction module has 3 convolution layers, the filter size of each convolution layer is 3 x 3,
stride 1, PReLU activation is applied after each convolutional layer. Will F FF The deep features fused in the middle are mapped back to the image space, and the fused shallow features are recovered to obtain a high-resolution hyperspectral image
In the above, the dual-flow hyperspectral and multispectral image fusion method based on the attention mechanism adopts a dual-branch network, combines CNN and Swin transducer to extract the characteristics containing local information and global information, and extracts space details and spectral information from multispectral images and hyperspectral images respectively. The cross attention mechanism is introduced, the characteristics of two branches of the hyperspectral image and the multispectral image can be interacted, the characteristics are enhanced by utilizing the similarity and redundant information of the hyperspectral image and the multispectral image, and the characteristics of the two images are better extracted. And combining space and spectrum global features by using the Swin transducer, and splicing with local information extracted by CNN to better perform feature fusion. Finally, a high spatial resolution hyperspectral image is reconstructed from the fused features by means of an image reconstruction network. The global information in the characteristic domain and between domains is fully utilized, a better fusion effect can be obtained, and a high-resolution hyperspectral image with higher quality can be obtained.
As shown in fig. 4, the processing flow of the self-attention mechanism-based Swin Transformer is the same as that of the Swin Transformer layer, and includes a window-based MSA (Multi-head Self Attention, multi-head self-attention) module and a shift window-based MSA module, and attention based on the shift window mechanism is a basic component of the design SAST, and global information is captured by extracting branches in HR-MSI and LR-HSI features respectively, so as to effectively integrate global information in the same feature domain. Specifically, assuming that the input is a given feature F, and F has a size of H2XWXC, the input is first divided into M×M partial windows that do not overlap, and the input is reshaped intoFeatures of (1), wherein->Is the total number of windows. Next, standard self-attention is performed for each window separately. For local window feature->Three learnable weight matrices shared between different windows are used +.>And->It is projected into query Q, key K, and value V by:
Q=XW Q ,K=XW K ,V=XW V #(1)
the attention mechanism is basically by computing the dot product of the query Q and all keys K, and then normalizing using the softmax function to get the corresponding attention score. The mechanism of attention is defined as follows:
d k is the dimension of key and B is a learnable relative position code. Extending self-attention to multi-headed self-attention MSA enables the attention mechanism to take into account various attention distributions and allows the model to capture information from different angles. In practice, the attention function is performed h times in parallel and the result is connected to the multi-headed self-attention, where h is set to 8.
The Swin transducer layer comprises: the W-MSA module and the multi-layer perceptron MLP module each employ a residual connection with an LN (LayerNorm) layer in front of the W-MSA and MLP. Thus, the overall process of the SAST module for the local window feature X is formulated as:
SA(X)=Attention(Q(X),K(X),V(X))#(3)
wherein Z is 1 The output of the first layer Swin transducer with X as input is shown.
If the partitions of the layers remain fixed, then no connection can be established between the cross-local windows. To achieve cross-window connectivity, an alternate conventional window and shift window partitioning approach is employed. In the conventional window layer of the first layer, a standard window partitioning scheme is used, and self-attention calculations are performed within each window. Then, at the next layerA bit window layer that creates a new window by moving window partitions. Before performing such moving window partitioning, the features are moved in two directions, respectivelyThe windows are divided again, the attention in each window is calculated, and the attention calculation in the new window can cross the boundary of the layer 1 window to realize the connection between the two windows. The calculation formula is as follows:
wherein Z is 2 Is the output of the SAST with X as input. The MLP layer of the multi-layer sensor is as follows
MLP(X)=GELU(W 1 X+b 1 )W 2 +b 2 #(8)
Wherein GELU is a Gaussian error linear unit, wherein W 1 And W is 2 Is a learnable weight of the full connection layer, and b 1 And b 2 Is a learnable bias parameter.
The SAST module is utilized to extract the long-range dependence information of the MSI and the MSI, so that the long-range dependence of the space and the spectrum of the MSI and the MSI can be effectively learned, and the space and the spectrum quality can be improved.
As shown in fig. 5, the CAST, i.e., the cross-attention mechanism based Swin transducer and SAST, both follow similar baselines, but have key differences, in particular CAST employs Multi-headed cross attention (Multi-head Cross Attention, MCA) rather than Multi-headed self-attention MSA to achieve global context exchange across feature domains.
The spatial information and the spectral information can be extracted from the multispectral image and the hyperspectral image respectively through the dual-branch network, but the multispectral image also contains certain spectral information, the dual-branch network ignores complementarity between the multispectral image and the hyperspectral image, so that the characteristic information can be extracted incompletely, and the reconstructed image can still have the problem of spatial distortion or spectral distortion. Therefore, the invention designs a Swin transducer (CAST) based on a cross-attention mechanism, which accurately models the cross-modal relationship between HSI and MSI.
In the deep feature extraction branch for HSI, HSI features are used to generate K and V, while MSI features are used to generate Q; for the depth feature extraction branch of MSI, MSI features are used to generate K and V and HSI features are used to generate Q. Where Q represents a query vector, K represents a keyword vector, and V represents a value vector.
Respectively giving two characteristics X 1 And X 2 The relationship between them can be modeled with an attention mechanism, defined as follows:
CA(X 1 ,X 2 )=Attention(Q(X 2 ),K(X 1 ),V(X 1 ))#(9)
wherein CA (·) is computation X 1 And X is 2 Attention function of the relationship between them. CA (-) is calculated using the same attention function as equation (2). Wherein, feature X 1 And X is 2 It is projected into query Q, key K, and value V by:
as shown in formulas (10) (11), feature X in feature field 1 1 For generating keys K 1 Sum value V 1 . Whereas in feature field 2, feature X 2 Are used to generate query Q 2 . Then, an attention weighting operation is performed using these generated keys, values and queries, thereby achieving fusion of cross-modality information. Meanwhile, original information in the feature domain 1 is reserved by adopting a residual connection mode, so that the information is stored and transferred, and the same processing is also applied to the featuresDomain 2. The design effectively captures the complementary information in the two feature domains and organically fuses the complementary information to improve the expression capacity of the model.
Thus, given two local window features X from different domains 1 And X 2 The overall process of CAST is defined as:
wherein the method comprises the steps ofRepresenting a feature X in feature Domain 1 1 Via the output of the first layer MCA, < >>Representing feature X in feature Domain 2 2 Through the output of the first layer MCA. The feature is moved in two directions, respectively +.>Pixels, repartitioning windows, and calculating the attention in each window, wherein the calculation formula is as follows:
wherein,representing a feature X in feature Domain 1 1 Through the output of CAST, < >>Is feature X in feature field 2 2 Through the output of CAST.
Unlike the MSA input K, Q, V, which is from the same image feature, the MCA input K, Q, V is from a different image feature. For example, K, V is from features of HSI, while Q is from features of MSI, after MCA calculation, the feature information of HSI will be affected by the feature information of MSI. Through the interaction of space and spectrum attention, the feature redundancy between two sub-networks is reduced, the complementarity between features is improved, and the effective fusion of the HSI and MSI feature information can be realized.
The cross-attention mechanism is effectively applied to the decoupling of spatial and spectral information to enhance the complementarity of the extracted features and reduce redundancy. Through the interaction of spatial and spectral attention, feature redundancy between two subnetworks is reduced, while complementarity between features is also improved, thereby optimizing the integration and processing of information. The design method is beneficial to ensuring the independence of feature extraction and simultaneously maximally utilizing the complementarity between the HSI and the MSI, thereby realizing efficient feature fusion and optimizing the final image quality.
From the above, the hyperspectral and multispectral image fusion method based on the attention mechanism of the invention firstly designs a double-flow space spectrum fusion network with the attention mechanism, combines the advantages of CNN and Swin transform, and fully excavates the local and global dependency relationship in HSI and MSI. And secondly, introducing a cross attention mechanism, modeling redundant information and complementary information across the HSI and MSI modes, capturing a complex correlation between the HSI and the MSI, and acquiring high correlation between spectrum bands and non-local similarity on spatial positions. And then, the Swin transducer is used for fusing the space details and the global features of the spectrum, and the local information extracted by the CNN is spliced, so that the feature fusion is better. Finally, the desired high spatial resolution hyperspectral image (HR-HSI: high Resolution Hyperspectral Image) is reconstructed from the fused features using an image reconstruction network.
Although the convolution operation can properly fuse texture features in a deep network, each pixel value cannot be adjusted on a global visual field, compared with the prior art, the method and the device can capture global information by introducing a Swin transform, can fully utilize global information in a feature domain and between domains by a cross attention mechanism, have uniform error distribution in the global domain, show good fusion performance when processing complex ground objects, pay attention to details, effectively reduce the overall error of a fusion result and obtain a high-resolution hyperspectral image with higher quality.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the foregoing detailed description of the invention has been provided, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing examples, and that certain features may be substituted for those illustrated and described herein. Modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (12)

1. The hyperspectral and multispectral image fusion method based on the attention mechanism is characterized by comprising the following steps of:
step 1: acquiring a hyperspectral image with low spatial resolution and a multispectral image with high spatial resolution, and constructing a training set and a testing set;
step 2: constructing a double-flow hyperspectral and multispectral image fusion model based on an attention mechanism;
step 3: training the constructed image fusion model by using a training set and an Adam optimizer, and back-propagating the training model through a loss function to obtain an optimal model;
step 4: and inputting the hyperspectral image with low spatial resolution and the multispectral image with high spatial resolution to be fused in the test set into a trained image fusion model to obtain the hyperspectral image with high spatial resolution.
2. The method for merging hyperspectral and multispectral images based on an attention mechanism as claimed in claim 1, wherein the image merging model comprises a dual-branch network, a feature merging module and an image reconstruction module;
the hyperspectral image with low spatial resolution is up-sampled to the same size as the multispectral image with high spatial resolution, and then is input into a dual-branch network for feature extraction;
the feature fusion module fuses the extracted features, and then reconstructs a hyperspectral image with high spatial resolution from the fused features through the image reconstruction module.
3. A method of attention-based hyperspectral and multispectral image fusion as recited in claim 2, wherein the upsampling of the low spatial resolution hyperspectral image uses a bilinear interpolation algorithm.
4. A hyperspectral and multispectral image fusion method based on an attention mechanism as recited in claim 2, wherein the dual-branch network includes a spatial feature extraction branch and a spectral feature extraction branch;
the space feature extraction branch and the spectrum feature extraction branch respectively sequentially perform shallow feature extraction and depth feature extraction on the multispectral image with high space resolution and the hyperspectral image with low space resolution after upsampling;
the shallow feature extraction adopts two independent shallow feature extraction modules, and extracts shallow spatial features and shallow spectral features from a multispectral image with high spatial resolution and a hyperspectral image with low spatial resolution after upsampling respectively and maps feature data to high-dimensional features;
the depth feature extraction adopts two attention-guided cross-domain feature extraction modules to alternately extract global feature domain information and cross-domain information in shallow spatial features and shallow spectral features and cross-feature domain interaction feature information, so as to obtain deep spatial features and deep spectral features.
5. The method for merging hyperspectral and multispectral images based on an attention mechanism as claimed in claim 4, wherein the shallow feature extraction module consists of two continuous convolution layers, the convolution kernel size is 3×3, the step size is 1, and a parameter correction linear unit activation function is superimposed between the convolution layers.
6. The method of claim 4, wherein the attention-directed cross-domain feature extraction module consists of a self-attention-mechanism-based Swin transducer and a cross-attention-mechanism-based Swin transducer in cascade.
7. The method for merging hyperspectral and multispectral images based on an attention mechanism as recited in claim 6, wherein the Swin transform based on the self-attention mechanism includes a conventional window layer of a first layer and a shift window layer of a next layer, wherein the conventional window layer includes: a multi-head self-attention W-MSA module and a multi-layer perceptron MLP module based on a window mechanism, wherein each module applies a residual error connection, and an LN layer is arranged in front of the W-MSA and the MLP; corresponding to a conventional window layer, the shift window layer includes: a multi-head self-attention SW-MSA module and a multi-layer perceptron MLP module based on a shift window mechanism, each module applying a residual connection, there is an LN layer in front of the W-MSA and MLP.
8. The method for merging hyperspectral and multispectral images based on an attention mechanism as recited in claim 7, wherein the specific process of feature data processing of the conventional window layer is as follows: assuming that the input of the characteristic F of the spin transducer based on the self-attention mechanism has a size of h×w×c, C representing the number of channels, the input characteristic is first divided into the characteristics of m×m partial windows that do not overlapI.e. remodel the input +.>Features of (1), wherein->Is the total number of windows; then, respectively executing standard self-attention SA for each window; for local window feature->Three learnable weight matrices shared between different windows are used +.>And->It is projected into query Q, key K, and value V by:
Q=XW Q ,K=XW K ,V=XW V #(1)
by calculating the dot product of the query Q and all keys K, and then normalizing by the softmax function, a corresponding attention score is obtained, and the attention mechanism is defined as follows:
d k is the dimension of key, B is a learnable relative position code;
the self-attention SA is extended to a multi-head self-attention MSA, so the overall process of the local window feature X at the regular window layer is formulated as:
SA(X)=Attention(Q(X),K(X),V(X))#(3)
wherein Z is 1 The output of the conventional window layer of the Swin transducer of the self-attention mechanism with X as input is shown.
9. The method for merging hyperspectral and multispectral images based on an attention mechanism as recited in claim 8, wherein the specific process expression of the feature data processing of the shift window layer is:
wherein Z is 2 The expression of the multi-layer perceptron MLP module is as follows, which is the output of the spin transducer of the self-attention mechanism with X as input:
MLP(X)=GELU(W 1 X+b 1 )W 2 +b 2 #(8)
wherein GELU is a Gaussian error linear unit, wherein W 1 And W is 2 Is a learnable weight of the full connection layer, and b 1 And b 2 Is a learnable bias parameter.
10. The method of claim 7, wherein the cross-attention-based Swin transform layer structure is similar to the self-attention-based Swin transform, and the attention module is a multi-headed cross-attention MCA module, and the specific feature data processing procedure of the cross-attention-based Swin transform is as follows:
for two local window features X from feature field 1 and feature field 2, respectively, from different fields 1 And X 2 The cross-attention mechanism is defined as:
CA(X 1 ,X 2 )=Attention(Q(X 2 ),K(X 1 ),V(X 1 ))#(9)
wherein CA (·) is computation X 1 And X is 2 Attention function, feature X, of the relationship between 1 And X is 2 It is projected into query Q, key K, and value V by:
Q 1 =X 1 W 1 Q ,K 1 =X 1 W 1 K ,V 1 =X 1 W 1 K #(10)
Q 2 =X 2 W 2 Q ,K 2 =X 2 W 2 K ,V 2 =X 2 W 2 V #(11)
feature X as shown in equations (10) (11) 1 For generating keys K 1 Sum value V 1 Feature X 2 Are used to generate query Q 2 Then, performing attention weighting operation by using the generated keys, values and queries, so as to realize the fusion of cross-modal information; meanwhile, original information in the feature domain 1 is reserved by adopting a residual connection mode, so that the information is stored and transferred, and the same processing is also applied to the feature domain 2;
thus, the overall process of Swin transducer based on cross-attention mechanism is defined as:
wherein the method comprises the steps ofRepresenting characteristic X 1 Via the output of the first layer MCA, < >>Representing characteristic X 2 An output through the first layer MCA; the feature is moved in two directions, respectively +.>Pixels, repartitioning windows, and calculating the attention in each window, wherein the calculation formula is as follows:
wherein,representing a feature X in feature Domain 1 1 Through the output of the Swin transducer based on the Cross-attention mechanism,/for example>Is feature X in feature field 2 2 Through the output of a Swin transducer based on a cross-attention mechanism.
11. The method for fusing hyperspectral and multispectral images based on an attention mechanism as claimed in claim 10, wherein the feature fusion module comprises a depth feature fusion module based on a transducer and a feature fusion module based on a CNN, the depth spectral features and the depth spatial features are spliced and then input into the depth feature fusion module based on the transducer to obtain fused depth spectral spatial features, the fused depth spectral spatial features are combined with the shallow spatial features and the shallow spectral features by long jump connection to obtain spliced feature information, the spliced feature information is input into the feature fusion module based on the CNN, and local information is extracted again from the features containing global information to fuse local information in different domains, so that fused spatial spectral features are obtained.
12. The method of claim 11, wherein the image reconstruction module comprises 3 convolution layers, each convolution layer has a filter size of 3 x 3 and a step size of 1, and the pralu activation is applied after each convolution layer, and the module maps the fused deep spectral features and deep spatial features back into the image space, and recovers the fused shallow spatial features and shallow spectral features to obtain the high spatial resolution hyperspectral image.
CN202311469057.1A 2023-11-06 2023-11-06 High spectrum and multispectral image fusion method based on attention mechanism Pending CN117474781A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311469057.1A CN117474781A (en) 2023-11-06 2023-11-06 High spectrum and multispectral image fusion method based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311469057.1A CN117474781A (en) 2023-11-06 2023-11-06 High spectrum and multispectral image fusion method based on attention mechanism

Publications (1)

Publication Number Publication Date
CN117474781A true CN117474781A (en) 2024-01-30

Family

ID=89636029

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311469057.1A Pending CN117474781A (en) 2023-11-06 2023-11-06 High spectrum and multispectral image fusion method based on attention mechanism

Country Status (1)

Country Link
CN (1) CN117474781A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117726916A (en) * 2024-02-18 2024-03-19 电子科技大学 Implicit fusion method for enhancing image resolution fusion
CN117911830A (en) * 2024-03-20 2024-04-19 安徽大学 Global interaction hyperspectral multi-spectral cross-modal fusion method for spectrum fidelity
CN117953312A (en) * 2024-03-25 2024-04-30 深圳航天信息有限公司 Part detection method, device, equipment and storage medium based on visual recognition

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117726916A (en) * 2024-02-18 2024-03-19 电子科技大学 Implicit fusion method for enhancing image resolution fusion
CN117726916B (en) * 2024-02-18 2024-04-19 电子科技大学 Implicit fusion method for enhancing image resolution fusion
CN117911830A (en) * 2024-03-20 2024-04-19 安徽大学 Global interaction hyperspectral multi-spectral cross-modal fusion method for spectrum fidelity
CN117911830B (en) * 2024-03-20 2024-05-28 安徽大学 Global interaction hyperspectral multi-spectral cross-modal fusion method for spectrum fidelity
CN117953312A (en) * 2024-03-25 2024-04-30 深圳航天信息有限公司 Part detection method, device, equipment and storage medium based on visual recognition

Similar Documents

Publication Publication Date Title
Wang et al. Ultra-dense GAN for satellite imagery super-resolution
CN117474781A (en) High spectrum and multispectral image fusion method based on attention mechanism
CN113673590B (en) Rain removing method, system and medium based on multi-scale hourglass dense connection network
Gao et al. Cross-scale mixing attention for multisource remote sensing data fusion and classification
Zhang et al. LR-Net: Low-rank spatial-spectral network for hyperspectral image denoising
Yan et al. When pansharpening meets graph convolution network and knowledge distillation
CN112669248B (en) Hyperspectral and panchromatic image fusion method based on CNN and Laplacian pyramid
Li et al. Deep hybrid 2-D–3-D CNN based on dual second-order attention with camera spectral sensitivity prior for spectral super-resolution
CN111951195A (en) Image enhancement method and device
CN116343052B (en) Attention and multiscale-based dual-temporal remote sensing image change detection network
Li et al. RGB-induced feature modulation network for hyperspectral image super-resolution
de Souza Brito et al. Combining max-pooling and wavelet pooling strategies for semantic image segmentation
Pan et al. Structure–color preserving network for hyperspectral image super-resolution
CN114972024A (en) Image super-resolution reconstruction device and method based on graph representation learning
CN114757862B (en) Image enhancement progressive fusion method for infrared light field device
CN115578262A (en) Polarization image super-resolution reconstruction method based on AFAN model
Long et al. Dual self-attention Swin transformer for hyperspectral image super-resolution
Choudhary et al. From conventional approach to machine learning and deep learning approach: an experimental and comprehensive review of image fusion techniques
Yang et al. Variation learning guided convolutional network for image interpolation
Wu et al. Hprn: Holistic prior-embedded relation network for spectral super-resolution
Gong et al. Learning deep resonant prior for hyperspectral image super-resolution
CN112734645B (en) Lightweight image super-resolution reconstruction method based on feature distillation multiplexing
CN109584194B (en) Hyperspectral image fusion method based on convolution variation probability model
CN116563187A (en) Multispectral image fusion based on graph neural network
CN114429424B (en) Remote sensing image super-resolution reconstruction method suitable for uncertain degradation modes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination