CN116452930A

CN116452930A - Multispectral image fusion method and multispectral image fusion system based on frequency domain enhancement in degradation environment

Info

Publication number: CN116452930A
Application number: CN202310311387.1A
Authority: CN
Inventors: 李忠博; 周恒�; 谢永强; 齐锦; 梁进君; 王真
Original assignee: Institute of Systems Engineering of PLA Academy of Military Sciences
Current assignee: Institute of Systems Engineering of PLA Academy of Military Sciences
Priority date: 2023-03-28
Filing date: 2023-03-28
Publication date: 2023-07-18

Abstract

The invention discloses a multispectral image fusion method and system based on frequency domain enhancement in a degradation environment, and relates to the technical field of image processing. According to the invention, a ViT backbone network is used as a feature encoder to extract multi-scale multi-spectrum features, and a high-frequency module and a low-frequency module are designed to extract and improve a self-attention structure in a ViT model, so that frequency information in a multi-spectrum image is captured. Deep features are fully utilized through a nested connection architecture, different scale information extracted from an encoder network is further fused, and a high-resolution fused image is output through layer-by-layer up-sampling. Finally, the multispectral image complementary information is fully utilized in a degradation environment, background clutter is effectively suppressed, target characteristics are obviously enhanced, and high-quality and reliable image data are provided for specific tasks such as downstream target detection, tracking and segmentation.

Description

Multispectral image fusion method and multispectral image fusion system based on frequency domain enhancement in degradation environment

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a multispectral image fusion method and system based on frequency domain enhancement in a degradation environment.

Background

Multispectral image fusion is an important technique in image processing aimed at generating a single image containing significant features and complementary information from a source image by using appropriate feature extraction methods and fusion strategies. The most advanced fusion algorithms currently are widely used in many applications, such as autonomous vehicles, vision tracking, and intelligent security.

Fusion algorithms can be broadly divided into two categories: traditional methods and deep learning-based methods. Most conventional methods are based on signal processing operators that have achieved good performance. In recent years, deep learning-based approaches have shown tremendous potential in image fusion tasks and are believed to offer better performance than traditional algorithms.

Conventional methods generally include two methods: a multi-scale based method and a sparse and low rank representation (low-rank representation, LRR) learning based method.

Multiscale methods typically decompose the source image into different scales to extract features and fuse each scale feature using an appropriate fusion strategy. The inverse operator is then used to reconstruct the fused image. Although these methods exhibit good fusion performance, their performance is highly dependent on multi-scale methods.

Prior to developing deep learning based fusion methods, sparse representation (sparse representation, SR) and LRR have attracted considerable attention. Public information and complementary features are extracted from a source image based on a joint sparse representation (joint sparse representation, JSR) method.

A multi-focus image fusion method based on LRR and dictionary learning first divides a source image into image blocks, and classifies each image block using directional gradient histogram (histogram of oriented gradient, HOG) features. The global dictionary is learned by K singular value decomposition (K-SVD). In addition, there are many other methods that combine SR with other operators, such as Pulse Coupled Neural Networks (PCNNs) and shear wave transforms.

The conventional fusion method has the following disadvantages: the operation efficiency of the fusion algorithm is highly dependent on structural operators such as dictionary learning; and when the input image is complicated, the conventional method is poor in generalization, resulting in a decrease in fusion performance.

To address these shortcomings, many deep learning based fusion approaches have been proposed over the past few years. The training strategy can be adopted to train a model suitable for the image fusion task so as to obtain better fusion performance, so that the latest deep learning method is based on the strategy. In the field of infrared and visible light image fusion, the multi-spectrum data are fused by adopting a dense block-based and automatic encoder architecture, the multi-scale features cannot be extracted without downsampling operation based on a deep learning method, and the deep multi-spectrum features are not fully utilized. The lack of a well-designed fusion module to achieve fusion of multi-scale multispectral deep features.

In addition, conventional methods fuse multispectral images by using low-level features that are made by hand, and therefore these methods often fail in complex scenes. Some approaches attempt to design texture enhancement modules or employ attention mechanisms to direct the model to focus on complementary regions in the multispectral image. All of these latest deep learning-based methods share a common feature: they simply enhance the information of the spectral image domain of the image by complex techniques. There is a lack of efficient use of frequency domain information.

Therefore, in order to better model the spectrum invariant information and the spectrum specific information from the RGB-T image, a solution is needed that enables robust multi-spectral image fusion.

Disclosure of Invention

In recent years, the progress and development of sensor technology provides more spectrum data for target perception in a degraded environment, and in order to effectively utilize the characteristics of multispectral images and solve the technical problems in the prior art, the invention provides a multispectral image fusion scheme based on frequency domain enhancement in the degraded environment; the scheme combines the high-frequency local detail information and the low-frequency global structure information of the image, effectively suppresses background clutter under the degradation condition, enhances the useful target characteristics, and provides high-quality and reliable image data for the downstream specific target recognition tasks such as target detection, tracking and segmentation.

The invention discloses a multispectral image fusion method based on frequency domain enhancement in a degradation environment. The method comprises the following steps: step S1, taking a visible light image and an infrared image as multispectral images to be fused, and respectively extracting multiscale characteristics of the visible light image and the infrared image by utilizing a characteristic encoder so as to obtain an original characteristic diagram formed by the multiscale characteristics of the visible light image and the multiscale characteristics of the infrared image; s2, extracting a high-frequency information attention vector and a low-frequency information attention vector in the original feature map based on a high-frequency information attention module and a low-frequency information attention module respectively, and obtaining the attention vector of the original feature map by cascading the high-frequency information attention vector and the low-frequency information attention vector; and S3, cascading the visible light image multiscale features and the infrared image multiscale features, multiplying the cascaded visible light image multiscale features and the attention vector of the original feature image to obtain multispectral fusion features, and performing multiscale up-sampling processing on the multispectral fusion features to obtain a multispectral fusion image.

According to the method of the first aspect, in the step S1, the feature encoder is a visual transducer based on a multi-head self-attention architecture, comprising a plurality of self-attention layers and one fully-connected layer; wherein: and each self-attention layer extracts the characteristics of the visible light image and the infrared image on different scales through linear mapping calculation to serve as output, and the output characteristics of each self-attention layer are cascaded, and further subjected to normalization processing and full-connection layer processing with GRLu activation functions to obtain the multi-scale characteristics of the visible light image and the multi-scale characteristics of the infrared image so as to obtain the original characteristic diagram.

According to the method of the first aspect, in said step S2: the high-frequency information attention module extracts high-frequency information attention vectors in the original feature map by utilizing a window of 3*3, wherein the high-frequency information attention vectors represent lines and shape information in the original feature map; the low frequency information attention module extracts a low frequency signal in the original feature map by utilizing a two-dimensional discrete cosine transform, and captures the low frequency information attention vector from the low frequency signal based on a standard attention mechanism, wherein the low frequency information attention vector characterizes texture and color information in the original feature map.

According to the method of the first aspect, in said step S3, said multi-scale up-sampling process is performed with a cross-layer dense connection structure comprising a number of basic decoding units, each comprising two 3*3 convolutional layers.

The invention discloses a multispectral image fusion system based on frequency domain enhancement in a degradation environment. The system comprises: a first processing unit configured to: taking a visible light image and an infrared image as multispectral images to be fused, and respectively extracting multiscale characteristics of the visible light image and the infrared image by utilizing a characteristic encoder so as to obtain an original characteristic diagram formed by the multiscale characteristics of the visible light image and the multiscale characteristics of the infrared image; a second processing unit configured to: extracting a high-frequency information attention vector and a low-frequency information attention vector in the original feature map based on a high-frequency information attention module and a low-frequency information attention module respectively, and obtaining the attention vector of the original feature map by cascading the high-frequency information attention vector and the low-frequency information attention vector; a third processing unit configured to: and carrying out cascade connection on the multi-scale features of the visible light image and the multi-scale features of the infrared image, multiplying the multi-scale features with the attention vector of the original feature image to obtain a multi-spectrum fusion feature, and carrying out multi-scale up-sampling processing on the multi-spectrum fusion feature to obtain a multi-spectrum fusion image.

The system according to the second aspect, wherein the feature encoder is a visual transducer based on a multi-head self-attention architecture, and comprises a plurality of self-attention layers and a fully-connected layer; wherein: and each self-attention layer extracts the characteristics of the visible light image and the infrared image on different scales through linear mapping calculation to serve as output, and the output characteristics of each self-attention layer are cascaded, and further subjected to normalization processing and full-connection layer processing with GRLu activation functions to obtain the multi-scale characteristics of the visible light image and the multi-scale characteristics of the infrared image so as to obtain the original characteristic diagram.

The system according to the second aspect, the high frequency information attention module extracts a high frequency information attention vector in the original feature map using a window of 3*3, the high frequency information attention vector characterizing line and shape information in the original feature map; the low frequency information attention module extracts a low frequency signal in the original feature map by utilizing a two-dimensional discrete cosine transform, and captures the low frequency information attention vector from the low frequency signal based on a standard attention mechanism, wherein the low frequency information attention vector characterizes texture and color information in the original feature map.

The system according to the second aspect, the multi-scale upsampling process is performed using a cross layer dense connection structure comprising a number of basic decoding units, each of the basic decoding units comprising two 3*3 convolutional layers.

A third aspect of the invention discloses an electronic device. The electronic device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps in the multispectral image fusion method based on frequency domain enhancement in the degradation environment according to the first aspect of the disclosure when executing the computer program.

A fourth aspect of the invention discloses a computer-readable storage medium. The computer readable storage medium stores a computer program, which when executed by a processor, implements the steps in a multispectral image fusion method based on frequency domain enhancement in a degradation environment according to the first aspect of the disclosure.

In summary, the technical scheme provided by the invention utilizes the most advanced deep learning model Vision Transformer (ViT) model as a backbone network to extract multi-scale multi-spectral features, designs high-frequency and low-frequency modules to extract frequency information in a spectrum, fully utilizes deep features through a nested connection architecture, and retains more information of different scale features extracted from an encoder network. Finally, high-precision and robust multispectral image fusion is realized.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the invention and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a multispectral image fusion method based on frequency domain enhancement in a degradation environment according to an embodiment of the invention;

fig. 2 is a flow diagram of performing a multi-scale upsampling process using a nested structure of unet++ network according to an embodiment of the present invention;

fig. 3 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a multispectral image fusion scheme based on frequency domain enhancement in a degradation environment. According to the invention, a ViT backbone network is used as a feature encoder to extract multi-scale multi-spectrum features, and a high-frequency module and a low-frequency module are designed to extract and improve a self-attention structure in a ViT model, so that frequency information in a multi-spectrum image is captured. Deep features are fully utilized through a nested connection architecture, different scale information extracted from an encoder network is further fused, and a high-resolution fused image is output through layer-by-layer up-sampling. Finally, the multispectral image complementary information is fully utilized in a degradation environment, background clutter is effectively suppressed, target characteristics are obviously enhanced, and high-quality and reliable image data are provided for specific tasks such as downstream target detection, tracking and segmentation.

The invention discloses a multispectral image fusion method based on frequency domain enhancement in a degradation environment. Fig. 1 is a flow chart of a multispectral image fusion method based on frequency domain enhancement in a degradation environment according to an embodiment of the invention; as shown in connection with fig. 1, the method comprises: step S1, taking a visible light image and an infrared image as multispectral images to be fused, and respectively extracting multiscale characteristics of the visible light image and the infrared image by utilizing a characteristic encoder so as to obtain an original characteristic diagram formed by the multiscale characteristics of the visible light image and the multiscale characteristics of the infrared image; s2, extracting a high-frequency information attention vector and a low-frequency information attention vector in the original feature map based on a high-frequency information attention module and a low-frequency information attention module respectively, and obtaining the attention vector of the original feature map by cascading the high-frequency information attention vector and the low-frequency information attention vector; and S3, cascading the visible light image multiscale features and the infrared image multiscale features, multiplying the cascaded visible light image multiscale features and the attention vector of the original feature image to obtain multispectral fusion features, and performing multiscale up-sampling processing on the multispectral fusion features to obtain a multispectral fusion image.

Specifically, the method effectively combines the high-frequency local detail information and the low-frequency global structure information of the image by enhancing the frequency domain characteristics, and realizes the high-quality and robust fusion technology of the multispectral image. Specifically, in a degradation environment, background clutter of an image is large, effective target information is interfered and weakened, in order to effectively utilize the characteristics of a multispectral image, a ViT backbone network is used as a characteristic encoder to extract the characteristics of the multispectral image, a high-frequency and low-frequency module is designed to extract and improve a self-attention structure in a ViT model, and frequency information in the multispectral image is extracted. Deep features are fully utilized through a nested connection architecture, different scale information extracted from an encoder network is further fused, and a high-resolution fused image is output through layer-by-layer up-sampling. Finally, the background clutter is effectively inhibited in a degradation environment, the target characteristics are obviously enhanced, and high-quality and reliable target image data are provided for downstream specific target sensing tasks such as target detection, tracking and segmentation.

In some embodiments, in the step S1, the feature encoder is a visual transducer based on a multi-head self-attention architecture, including a plurality of self-attention layers and one fully-connected layer; wherein: and each self-attention layer extracts the characteristics of the visible light image and the infrared image on different scales through linear mapping calculation to serve as output, and the output characteristics of each self-attention layer are cascaded, and further subjected to normalization processing and full-connection layer processing with GRLu activation functions to obtain the multi-scale characteristics of the visible light image and the multi-scale characteristics of the infrared image so as to obtain the original characteristic diagram.

Specifically, the transducer architecture was originally proposed for sequence-to-sequence learning, such as machine translation. By virtue of its efficiency, the transducer then becomes the model of choice for various natural language processing tasks. In the field of computer vision, self-Attention (SA) is used instead of convolution, and a visual transformer (Vision Transformer, viT) extracts patches from an image and inputs them into a transformer encoder to obtain a global representation, which is ultimately transformed for classification. The model based on the Transformer architecture shows better scalability than CNN, i.e. ViT performs significantly better than the res net model when training a larger model on a larger data set. The model of transformations is becoming a powerful backbone network in the field of computer vision.

The transformations are based on multi-head SA (MSA), which captures long term relationships between tokens at different locations. Specifically let X ε R ^C×H×W →R ^N×D Is the input sequence of the standard MSA layer, where N is the length of the input sequence and D is the hidden dimension. Each self-intent head computes a query Q, key K, and value V matrix using a linear transformation of X:

Q＝XW _q ,

K＝XW _k ,

V＝XW _v .

wherein W is _q 、W _k And W is _v Is of all dimensions ofAnd are all learnable parameters. D (D) _h Is the number of hidden sizes of the header. The output of the SA is then a weighted sum of the N value vectors,

for the MSA layer of N head, the final output is obtained by linear mapping calculation of cascade output of each SA, and the formula is as follows

MSA(X)＝concat[Attention(X)]W _o

Wherein the method comprises the steps ofIs a learnable parameter. Thus, the construction of a transducer block comprises a MSA layer and a fully-connected layer, which can be expressed as:

where LN represents a layer normalization (LayerNorm) operation. FC represents a fully connected layer with a gruu activation function.

ViT suggests dividing the block into 4 phases to generate a pyramid feature map for a dense prediction task. For multispectral inputs R and T, multiscale features were obtained via a ViT multiscale feature extractorAndthe definition is as follows:

in some embodiments, in said step S2: the high-frequency information attention module extracts high-frequency information attention vectors in the original feature map by utilizing a window of 3*3, wherein the high-frequency information attention vectors represent lines and shape information in the original feature map; the low frequency information attention module extracts a low frequency signal in the original feature map by utilizing a two-dimensional discrete cosine transform, and captures the low frequency information attention vector from the low frequency signal based on a standard attention mechanism, wherein the low frequency information attention vector characterizes texture and color information in the original feature map.

Specifically, two-dimensional discrete cosine transform (Discrete Cosine Transform, DCT):

wherein h.epsilon. {1,2, 3.,. The.H., w.epsilon. {1,2, 3.,. The.W., F.epsilon.R.) ^H×W Is a two-dimensional DCT spectrum. x epsilon R ^H×W Is an input feature, H and W represent the height and width of x, respectively.

Specifically, the inverse DCT transform:

where i e {1,2, 3., H }, j e {1,2, 3., W }

In particular, the characteristic channel attention mechanism is widely used in CNNs. It uses a scalar to represent and evaluate the importance of each channel. Suppose X ε R ^C×H×W Is the image feature tensor in the network, C is the number of channels, H is the feature height, and W is the feature width. The present invention treats scalar representation in channel attention as a compression problem because it must represent the entire channel, but only one scalar can be used. Thus, the attention mechanism can be written as:

Att＝σ(f ₁ (f ₂ (X)))

where Att is the channel attention vector, σ is the sigmoid function, f1 represents the mapping function, and either the full join layer or the one-dimensional convolution operation may be selected. f2 represents a compression operation to realize R ^C×H×W →R ^C . After obtaining attention vector attention for all C channels, each channel feature map of input X is scaled by the corresponding attention value:

wherein the method comprises the steps ofIs the output of the attention mechanism, att _i Is the i element of the attention vector, X _i,:,: Is the i-th channel of the input feature. In general, global averaging pools (Global Average Pooling, GAP) are a common approach to channel compression due to their simplicity and efficiency. In addition, some compression methods, such as global maximum pool (Global Max Pooling, GMP) and global standard deviation pool (Global Standard Deviation Pooling, GSDP), may also implement channel compression.

It is known that the use of GAP in the channel attention mechanism means that only the lowest frequency information is retained. Active components from other frequencies are discarded, and other frequency components also contain information useful for channel learning. Thus, to better compress the channels and introduce more information, the present invention extends GAP to higher frequencies and GAP to more frequency components of the 2D DCT.

Natural images contain rich frequencies where high frequencies capture local details of objects (e.g., lines and shapes), and low frequencies encode global structures (e.g., textures and colors). However, global SAs in a typical MSA layer do not take into account the characteristics of different potential frequencies. For this purpose, the invention processes the high-frequency and low-frequency information in the feature map at the attention layer, respectively. Specifically, the high-frequency attention focuses on the local details of the object, so that global attention does not need to be applied to the feature map, the calculation complexity can be effectively reduced, and the operation efficiency is improved. The High frequency information attention module (High-Frequency Information Attention Module, HIAM) uses 3*3 windows to capture fine grained High frequency information with local window self-attention, which can save a lot of computing resources. Global attention in the MSA helps to capture low frequency information. However, directly applying MSA to high resolution feature maps requires significant computational costs. The Low frequency information attention module (Low-Frequency Information Attention Module, LIAM) first applies a two-dimensional DCT to each window to obtain a Low frequency signal in input X. The feature map of the DCT is then mapped to the key K and the value V. The query vector Q in the LIAM is still from the original feature map X. The present invention then applies standard attention mechanisms to capture the rich low frequency information in the feature map.

Q＝DCT(X)W _q ,

K＝DCT(X)W _k ,

V＝XW _v .

Obtaining final high-frequency and low-frequency information attention vectors:

HLA(X)＝concat(HIAM(X)，LIAM(X))

wherein cancat () represents a concatenation operation of the high-frequency information attention vector and the low-frequency information attention vector.

The multi-scale features extracted by ViT are input into the HIAM and LIAM modules, and the high-frequency and low-frequency attention vectors can be obtained. Multiplication of visible and infrared multiscale features with attention vectors to yield enhanced multispectral fusion features

y ⁱ ＝HLA(x ⁱ )*x ⁱ

In some embodiments, in the step S3, the multi-scale upsampling process is performed using a cross layer dense connection structure comprising a number of basic decoding units, each of which comprises two 3*3 convolutional layers. In some alternative/alternative embodiments, the multi-scale upsampling process is performed using a nested structure of a unet++ network that contains several basic decoding units, each of which includes two 3*3 convolutional layers.

Specifically, as shown in fig. 2, the nested structure of the unet++ network is adopted to fully retain multi-scale information. Specifically, we see first our back a basic Decoding Unit (DU), which is mainly composed of two 3*3 convolutional layers, defined as:

v＝ReLU(Conv _3×3 (u))

w＝ReLU(Conv _3×3 (v))

wherein the input features areThe output of the first layer is v E R ^32×H×W Output of DUFirst, two input images are input into an encoder network, respectively, to obtain multi-scale depth features. For each scale feature, our fusion strategy is used to fuse the resulting features. And finally, a decoder network based on nested connection is used for reconstructing a fusion image by using the fused multi-scale depth characteristics to obtain a final multi-spectrum fusion image.

Fig. 3 is a block diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device includes a processor, a memory, a communication interface, a display screen, and an input device connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the electronic device is used for conducting wired or wireless communication with an external terminal, and the wireless communication can be achieved through WIFI, an operator network, near Field Communication (NFC) or other technologies. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the structure shown in fig. 3 is merely a structural diagram of a portion related to the technical solution of the present disclosure, and does not constitute a limitation of the electronic device to which the present application is applied, and that a specific electronic device may include more or less components than those shown in the drawings, or may combine some components, or have different component arrangements.

Note that the technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be regarded as the scope of the description. The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. The multispectral image fusion method based on frequency domain enhancement in a degradation environment is characterized by comprising the following steps of:

step S1, taking a visible light image and an infrared image as multispectral images to be fused, and respectively extracting multiscale characteristics of the visible light image and the infrared image by utilizing a characteristic encoder so as to obtain an original characteristic diagram formed by the multiscale characteristics of the visible light image and the multiscale characteristics of the infrared image;

s2, extracting a high-frequency information attention vector and a low-frequency information attention vector in the original feature map based on a high-frequency information attention module and a low-frequency information attention module respectively, and obtaining the attention vector of the original feature map by cascading the high-frequency information attention vector and the low-frequency information attention vector;

and S3, cascading the visible light image multiscale features and the infrared image multiscale features, multiplying the cascaded visible light image multiscale features and the attention vector of the original feature image to obtain multispectral fusion features, and performing multiscale up-sampling processing on the multispectral fusion features to obtain a multispectral fusion image.

2. The method according to claim 1, wherein in the step S1, the feature encoder is a visual transducer based on a multi-head self-attention architecture, and comprises a plurality of self-attention layers and a fully-connected layer; wherein:

and each self-attention layer extracts the characteristics of the visible light image and the infrared image on different scales through linear mapping calculation to serve as output, and the output characteristics of each self-attention layer are cascaded, and further subjected to normalization processing and full-connection layer processing with GRLu activation functions to obtain the multi-scale characteristics of the visible light image and the multi-scale characteristics of the infrared image so as to obtain the original characteristic diagram.

3. The method for frequency domain enhancement based multi-spectral image fusion in a degraded environment according to claim 2, wherein in the step S2:

the high-frequency information attention module extracts high-frequency information attention vectors in the original feature map by utilizing a window of 3*3, wherein the high-frequency information attention vectors represent lines and shape information in the original feature map;

the low frequency information attention module extracts a low frequency signal in the original feature map by utilizing a two-dimensional discrete cosine transform, and captures the low frequency information attention vector from the low frequency signal based on a standard attention mechanism, wherein the low frequency information attention vector characterizes texture and color information in the original feature map.

4. A method of multi-spectral image fusion based on frequency domain enhancement in a degraded environment according to claim 3, wherein in said step S3, said multi-scale up-sampling process is performed using a cross-layer dense connection structure comprising a number of basic decoding units, each of said basic decoding units comprising two 3*3 convolution layers.

5. A multispectral image fusion system based on frequency domain enhancement in a degraded environment, the system comprising:

a first processing unit configured to: taking a visible light image and an infrared image as multispectral images to be fused, and respectively extracting multiscale characteristics of the visible light image and the infrared image by utilizing a characteristic encoder so as to obtain an original characteristic diagram formed by the multiscale characteristics of the visible light image and the multiscale characteristics of the infrared image;

a second processing unit configured to: extracting a high-frequency information attention vector and a low-frequency information attention vector in the original feature map based on a high-frequency information attention module and a low-frequency information attention module respectively, and obtaining the attention vector of the original feature map by cascading the high-frequency information attention vector and the low-frequency information attention vector;

a third processing unit configured to: and carrying out cascade connection on the multi-scale features of the visible light image and the multi-scale features of the infrared image, multiplying the multi-scale features with the attention vector of the original feature image to obtain a multi-spectrum fusion feature, and carrying out multi-scale up-sampling processing on the multi-spectrum fusion feature to obtain a multi-spectrum fusion image.

6. The system of claim 5, wherein the feature encoder is a visual transducer based on a multi-head self-attention architecture, comprising a plurality of self-attention layers and a fully-connected layer; wherein:

7. The system for frequency domain enhancement based multi-spectral image fusion in a degraded environment of claim 6, wherein:

8. The frequency domain enhancement based multi-spectral image fusion system according to claim 7, wherein the multi-scale upsampling process is performed using a cross-layer dense connection comprising a number of basic decoding units, each of the basic decoding units comprising two 3*3 convolution layers.

9. An electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps in a method for frequency domain enhancement based multi-spectral image fusion in a reduced quality environment according to any of claims 1-4 when executing the computer program.

10. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, the computer program, when executed by a processor, implementing the steps in a multi-spectral image fusion method based on frequency domain enhancement in a degradation environment according to any one of claims 1-4.