CN116452930A - Multispectral image fusion method and multispectral image fusion system based on frequency domain enhancement in degradation environment - Google Patents
Multispectral image fusion method and multispectral image fusion system based on frequency domain enhancement in degradation environment Download PDFInfo
- Publication number
- CN116452930A CN116452930A CN202310311387.1A CN202310311387A CN116452930A CN 116452930 A CN116452930 A CN 116452930A CN 202310311387 A CN202310311387 A CN 202310311387A CN 116452930 A CN116452930 A CN 116452930A
- Authority
- CN
- China
- Prior art keywords
- image
- attention
- frequency information
- fusion
- visible light
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000015556 catabolic process Effects 0.000 title claims abstract description 22
- 238000006731 degradation reaction Methods 0.000 title claims abstract description 22
- 238000007500 overflow downdraw method Methods 0.000 title claims abstract description 17
- 230000004927 fusion Effects 0.000 title claims description 53
- 238000012545 processing Methods 0.000 claims abstract description 31
- 238000001228 spectrum Methods 0.000 claims abstract description 22
- 238000005070 sampling Methods 0.000 claims abstract description 11
- 239000013598 vector Substances 0.000 claims description 70
- 238000000034 method Methods 0.000 claims description 38
- 239000000284 extract Substances 0.000 claims description 19
- 238000010586 diagram Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 12
- 230000007246 mechanism Effects 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000013507 mapping Methods 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 7
- 238000010606 normalization Methods 0.000 claims description 7
- 230000000007 visual effect Effects 0.000 claims description 7
- 230000000295 complement effect Effects 0.000 abstract description 5
- 238000001514 detection method Methods 0.000 abstract description 4
- 230000011218 segmentation Effects 0.000 abstract description 4
- 238000013135 deep learning Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 5
- 230000006835 compression Effects 0.000 description 5
- 238000007906 compression Methods 0.000 description 5
- 238000013459 approach Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000007796 conventional method Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/52—Scale-space analysis, e.g. wavelet analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/54—Extraction of image or video features relating to texture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/56—Extraction of image or video features relating to colour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a multispectral image fusion method and system based on frequency domain enhancement in a degradation environment, and relates to the technical field of image processing. According to the invention, a ViT backbone network is used as a feature encoder to extract multi-scale multi-spectrum features, and a high-frequency module and a low-frequency module are designed to extract and improve a self-attention structure in a ViT model, so that frequency information in a multi-spectrum image is captured. Deep features are fully utilized through a nested connection architecture, different scale information extracted from an encoder network is further fused, and a high-resolution fused image is output through layer-by-layer up-sampling. Finally, the multispectral image complementary information is fully utilized in a degradation environment, background clutter is effectively suppressed, target characteristics are obviously enhanced, and high-quality and reliable image data are provided for specific tasks such as downstream target detection, tracking and segmentation.
Description
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a multispectral image fusion method and system based on frequency domain enhancement in a degradation environment.
Background
Multispectral image fusion is an important technique in image processing aimed at generating a single image containing significant features and complementary information from a source image by using appropriate feature extraction methods and fusion strategies. The most advanced fusion algorithms currently are widely used in many applications, such as autonomous vehicles, vision tracking, and intelligent security.
Fusion algorithms can be broadly divided into two categories: traditional methods and deep learning-based methods. Most conventional methods are based on signal processing operators that have achieved good performance. In recent years, deep learning-based approaches have shown tremendous potential in image fusion tasks and are believed to offer better performance than traditional algorithms.
Conventional methods generally include two methods: a multi-scale based method and a sparse and low rank representation (low-rank representation, LRR) learning based method.
Multiscale methods typically decompose the source image into different scales to extract features and fuse each scale feature using an appropriate fusion strategy. The inverse operator is then used to reconstruct the fused image. Although these methods exhibit good fusion performance, their performance is highly dependent on multi-scale methods.
Prior to developing deep learning based fusion methods, sparse representation (sparse representation, SR) and LRR have attracted considerable attention. Public information and complementary features are extracted from a source image based on a joint sparse representation (joint sparse representation, JSR) method.
A multi-focus image fusion method based on LRR and dictionary learning first divides a source image into image blocks, and classifies each image block using directional gradient histogram (histogram of oriented gradient, HOG) features. The global dictionary is learned by K singular value decomposition (K-SVD). In addition, there are many other methods that combine SR with other operators, such as Pulse Coupled Neural Networks (PCNNs) and shear wave transforms.
The conventional fusion method has the following disadvantages: the operation efficiency of the fusion algorithm is highly dependent on structural operators such as dictionary learning; and when the input image is complicated, the conventional method is poor in generalization, resulting in a decrease in fusion performance.
To address these shortcomings, many deep learning based fusion approaches have been proposed over the past few years. The training strategy can be adopted to train a model suitable for the image fusion task so as to obtain better fusion performance, so that the latest deep learning method is based on the strategy. In the field of infrared and visible light image fusion, the multi-spectrum data are fused by adopting a dense block-based and automatic encoder architecture, the multi-scale features cannot be extracted without downsampling operation based on a deep learning method, and the deep multi-spectrum features are not fully utilized. The lack of a well-designed fusion module to achieve fusion of multi-scale multispectral deep features.
In addition, conventional methods fuse multispectral images by using low-level features that are made by hand, and therefore these methods often fail in complex scenes. Some approaches attempt to design texture enhancement modules or employ attention mechanisms to direct the model to focus on complementary regions in the multispectral image. All of these latest deep learning-based methods share a common feature: they simply enhance the information of the spectral image domain of the image by complex techniques. There is a lack of efficient use of frequency domain information.
Therefore, in order to better model the spectrum invariant information and the spectrum specific information from the RGB-T image, a solution is needed that enables robust multi-spectral image fusion.
Disclosure of Invention
In recent years, the progress and development of sensor technology provides more spectrum data for target perception in a degraded environment, and in order to effectively utilize the characteristics of multispectral images and solve the technical problems in the prior art, the invention provides a multispectral image fusion scheme based on frequency domain enhancement in the degraded environment; the scheme combines the high-frequency local detail information and the low-frequency global structure information of the image, effectively suppresses background clutter under the degradation condition, enhances the useful target characteristics, and provides high-quality and reliable image data for the downstream specific target recognition tasks such as target detection, tracking and segmentation.
The invention discloses a multispectral image fusion method based on frequency domain enhancement in a degradation environment. The method comprises the following steps: step S1, taking a visible light image and an infrared image as multispectral images to be fused, and respectively extracting multiscale characteristics of the visible light image and the infrared image by utilizing a characteristic encoder so as to obtain an original characteristic diagram formed by the multiscale characteristics of the visible light image and the multiscale characteristics of the infrared image; s2, extracting a high-frequency information attention vector and a low-frequency information attention vector in the original feature map based on a high-frequency information attention module and a low-frequency information attention module respectively, and obtaining the attention vector of the original feature map by cascading the high-frequency information attention vector and the low-frequency information attention vector; and S3, cascading the visible light image multiscale features and the infrared image multiscale features, multiplying the cascaded visible light image multiscale features and the attention vector of the original feature image to obtain multispectral fusion features, and performing multiscale up-sampling processing on the multispectral fusion features to obtain a multispectral fusion image.
According to the method of the first aspect, in the step S1, the feature encoder is a visual transducer based on a multi-head self-attention architecture, comprising a plurality of self-attention layers and one fully-connected layer; wherein: and each self-attention layer extracts the characteristics of the visible light image and the infrared image on different scales through linear mapping calculation to serve as output, and the output characteristics of each self-attention layer are cascaded, and further subjected to normalization processing and full-connection layer processing with GRLu activation functions to obtain the multi-scale characteristics of the visible light image and the multi-scale characteristics of the infrared image so as to obtain the original characteristic diagram.
According to the method of the first aspect, in said step S2: the high-frequency information attention module extracts high-frequency information attention vectors in the original feature map by utilizing a window of 3*3, wherein the high-frequency information attention vectors represent lines and shape information in the original feature map; the low frequency information attention module extracts a low frequency signal in the original feature map by utilizing a two-dimensional discrete cosine transform, and captures the low frequency information attention vector from the low frequency signal based on a standard attention mechanism, wherein the low frequency information attention vector characterizes texture and color information in the original feature map.
According to the method of the first aspect, in said step S3, said multi-scale up-sampling process is performed with a cross-layer dense connection structure comprising a number of basic decoding units, each comprising two 3*3 convolutional layers.
The invention discloses a multispectral image fusion system based on frequency domain enhancement in a degradation environment. The system comprises: a first processing unit configured to: taking a visible light image and an infrared image as multispectral images to be fused, and respectively extracting multiscale characteristics of the visible light image and the infrared image by utilizing a characteristic encoder so as to obtain an original characteristic diagram formed by the multiscale characteristics of the visible light image and the multiscale characteristics of the infrared image; a second processing unit configured to: extracting a high-frequency information attention vector and a low-frequency information attention vector in the original feature map based on a high-frequency information attention module and a low-frequency information attention module respectively, and obtaining the attention vector of the original feature map by cascading the high-frequency information attention vector and the low-frequency information attention vector; a third processing unit configured to: and carrying out cascade connection on the multi-scale features of the visible light image and the multi-scale features of the infrared image, multiplying the multi-scale features with the attention vector of the original feature image to obtain a multi-spectrum fusion feature, and carrying out multi-scale up-sampling processing on the multi-spectrum fusion feature to obtain a multi-spectrum fusion image.
The system according to the second aspect, wherein the feature encoder is a visual transducer based on a multi-head self-attention architecture, and comprises a plurality of self-attention layers and a fully-connected layer; wherein: and each self-attention layer extracts the characteristics of the visible light image and the infrared image on different scales through linear mapping calculation to serve as output, and the output characteristics of each self-attention layer are cascaded, and further subjected to normalization processing and full-connection layer processing with GRLu activation functions to obtain the multi-scale characteristics of the visible light image and the multi-scale characteristics of the infrared image so as to obtain the original characteristic diagram.
The system according to the second aspect, the high frequency information attention module extracts a high frequency information attention vector in the original feature map using a window of 3*3, the high frequency information attention vector characterizing line and shape information in the original feature map; the low frequency information attention module extracts a low frequency signal in the original feature map by utilizing a two-dimensional discrete cosine transform, and captures the low frequency information attention vector from the low frequency signal based on a standard attention mechanism, wherein the low frequency information attention vector characterizes texture and color information in the original feature map.
The system according to the second aspect, the multi-scale upsampling process is performed using a cross layer dense connection structure comprising a number of basic decoding units, each of the basic decoding units comprising two 3*3 convolutional layers.
A third aspect of the invention discloses an electronic device. The electronic device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps in the multispectral image fusion method based on frequency domain enhancement in the degradation environment according to the first aspect of the disclosure when executing the computer program.
A fourth aspect of the invention discloses a computer-readable storage medium. The computer readable storage medium stores a computer program, which when executed by a processor, implements the steps in a multispectral image fusion method based on frequency domain enhancement in a degradation environment according to the first aspect of the disclosure.
In summary, the technical scheme provided by the invention utilizes the most advanced deep learning model Vision Transformer (ViT) model as a backbone network to extract multi-scale multi-spectral features, designs high-frequency and low-frequency modules to extract frequency information in a spectrum, fully utilizes deep features through a nested connection architecture, and retains more information of different scale features extracted from an encoder network. Finally, high-precision and robust multispectral image fusion is realized.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are some embodiments of the invention and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a multispectral image fusion method based on frequency domain enhancement in a degradation environment according to an embodiment of the invention;
fig. 2 is a flow diagram of performing a multi-scale upsampling process using a nested structure of unet++ network according to an embodiment of the present invention;
fig. 3 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a multispectral image fusion scheme based on frequency domain enhancement in a degradation environment. According to the invention, a ViT backbone network is used as a feature encoder to extract multi-scale multi-spectrum features, and a high-frequency module and a low-frequency module are designed to extract and improve a self-attention structure in a ViT model, so that frequency information in a multi-spectrum image is captured. Deep features are fully utilized through a nested connection architecture, different scale information extracted from an encoder network is further fused, and a high-resolution fused image is output through layer-by-layer up-sampling. Finally, the multispectral image complementary information is fully utilized in a degradation environment, background clutter is effectively suppressed, target characteristics are obviously enhanced, and high-quality and reliable image data are provided for specific tasks such as downstream target detection, tracking and segmentation.
The invention discloses a multispectral image fusion method based on frequency domain enhancement in a degradation environment. Fig. 1 is a flow chart of a multispectral image fusion method based on frequency domain enhancement in a degradation environment according to an embodiment of the invention; as shown in connection with fig. 1, the method comprises: step S1, taking a visible light image and an infrared image as multispectral images to be fused, and respectively extracting multiscale characteristics of the visible light image and the infrared image by utilizing a characteristic encoder so as to obtain an original characteristic diagram formed by the multiscale characteristics of the visible light image and the multiscale characteristics of the infrared image; s2, extracting a high-frequency information attention vector and a low-frequency information attention vector in the original feature map based on a high-frequency information attention module and a low-frequency information attention module respectively, and obtaining the attention vector of the original feature map by cascading the high-frequency information attention vector and the low-frequency information attention vector; and S3, cascading the visible light image multiscale features and the infrared image multiscale features, multiplying the cascaded visible light image multiscale features and the attention vector of the original feature image to obtain multispectral fusion features, and performing multiscale up-sampling processing on the multispectral fusion features to obtain a multispectral fusion image.
Specifically, the method effectively combines the high-frequency local detail information and the low-frequency global structure information of the image by enhancing the frequency domain characteristics, and realizes the high-quality and robust fusion technology of the multispectral image. Specifically, in a degradation environment, background clutter of an image is large, effective target information is interfered and weakened, in order to effectively utilize the characteristics of a multispectral image, a ViT backbone network is used as a characteristic encoder to extract the characteristics of the multispectral image, a high-frequency and low-frequency module is designed to extract and improve a self-attention structure in a ViT model, and frequency information in the multispectral image is extracted. Deep features are fully utilized through a nested connection architecture, different scale information extracted from an encoder network is further fused, and a high-resolution fused image is output through layer-by-layer up-sampling. Finally, the background clutter is effectively inhibited in a degradation environment, the target characteristics are obviously enhanced, and high-quality and reliable target image data are provided for downstream specific target sensing tasks such as target detection, tracking and segmentation.
In some embodiments, in the step S1, the feature encoder is a visual transducer based on a multi-head self-attention architecture, including a plurality of self-attention layers and one fully-connected layer; wherein: and each self-attention layer extracts the characteristics of the visible light image and the infrared image on different scales through linear mapping calculation to serve as output, and the output characteristics of each self-attention layer are cascaded, and further subjected to normalization processing and full-connection layer processing with GRLu activation functions to obtain the multi-scale characteristics of the visible light image and the multi-scale characteristics of the infrared image so as to obtain the original characteristic diagram.
Specifically, the transducer architecture was originally proposed for sequence-to-sequence learning, such as machine translation. By virtue of its efficiency, the transducer then becomes the model of choice for various natural language processing tasks. In the field of computer vision, self-Attention (SA) is used instead of convolution, and a visual transformer (Vision Transformer, viT) extracts patches from an image and inputs them into a transformer encoder to obtain a global representation, which is ultimately transformed for classification. The model based on the Transformer architecture shows better scalability than CNN, i.e. ViT performs significantly better than the res net model when training a larger model on a larger data set. The model of transformations is becoming a powerful backbone network in the field of computer vision.
The transformations are based on multi-head SA (MSA), which captures long term relationships between tokens at different locations. Specifically let X ε R C×H×W →R N×D Is the input sequence of the standard MSA layer, where N is the length of the input sequence and D is the hidden dimension. Each self-intent head computes a query Q, key K, and value V matrix using a linear transformation of X:
Q=XW q ,
K=XW k ,
V=XW v .
wherein W is q 、W k And W is v Is of all dimensions ofAnd are all learnable parameters. D (D) h Is the number of hidden sizes of the header. The output of the SA is then a weighted sum of the N value vectors,
for the MSA layer of N head, the final output is obtained by linear mapping calculation of cascade output of each SA, and the formula is as follows
MSA(X)=concat[Attention(X)]W o
Wherein the method comprises the steps ofIs a learnable parameter. Thus, the construction of a transducer block comprises a MSA layer and a fully-connected layer, which can be expressed as:
where LN represents a layer normalization (LayerNorm) operation. FC represents a fully connected layer with a gruu activation function.
ViT suggests dividing the block into 4 phases to generate a pyramid feature map for a dense prediction task. For multispectral inputs R and T, multiscale features were obtained via a ViT multiscale feature extractorAndthe definition is as follows:
in some embodiments, in said step S2: the high-frequency information attention module extracts high-frequency information attention vectors in the original feature map by utilizing a window of 3*3, wherein the high-frequency information attention vectors represent lines and shape information in the original feature map; the low frequency information attention module extracts a low frequency signal in the original feature map by utilizing a two-dimensional discrete cosine transform, and captures the low frequency information attention vector from the low frequency signal based on a standard attention mechanism, wherein the low frequency information attention vector characterizes texture and color information in the original feature map.
Specifically, two-dimensional discrete cosine transform (Discrete Cosine Transform, DCT):
wherein h.epsilon. {1,2, 3.,. The.H., w.epsilon. {1,2, 3.,. The.W., F.epsilon.R.) H×W Is a two-dimensional DCT spectrum. x epsilon R H×W Is an input feature, H and W represent the height and width of x, respectively.
Specifically, the inverse DCT transform:
where i e {1,2, 3., H }, j e {1,2, 3., W }
In particular, the characteristic channel attention mechanism is widely used in CNNs. It uses a scalar to represent and evaluate the importance of each channel. Suppose X ε R C×H×W Is the image feature tensor in the network, C is the number of channels, H is the feature height, and W is the feature width. The present invention treats scalar representation in channel attention as a compression problem because it must represent the entire channel, but only one scalar can be used. Thus, the attention mechanism can be written as:
Att=σ(f 1 (f 2 (X)))
where Att is the channel attention vector, σ is the sigmoid function, f1 represents the mapping function, and either the full join layer or the one-dimensional convolution operation may be selected. f2 represents a compression operation to realize R C×H×W →R C . After obtaining attention vector attention for all C channels, each channel feature map of input X is scaled by the corresponding attention value:
wherein the method comprises the steps ofIs the output of the attention mechanism, att i Is the i element of the attention vector, X i,:,: Is the i-th channel of the input feature. In general, global averaging pools (Global Average Pooling, GAP) are a common approach to channel compression due to their simplicity and efficiency. In addition, some compression methods, such as global maximum pool (Global Max Pooling, GMP) and global standard deviation pool (Global Standard Deviation Pooling, GSDP), may also implement channel compression.
It is known that the use of GAP in the channel attention mechanism means that only the lowest frequency information is retained. Active components from other frequencies are discarded, and other frequency components also contain information useful for channel learning. Thus, to better compress the channels and introduce more information, the present invention extends GAP to higher frequencies and GAP to more frequency components of the 2D DCT.
Natural images contain rich frequencies where high frequencies capture local details of objects (e.g., lines and shapes), and low frequencies encode global structures (e.g., textures and colors). However, global SAs in a typical MSA layer do not take into account the characteristics of different potential frequencies. For this purpose, the invention processes the high-frequency and low-frequency information in the feature map at the attention layer, respectively. Specifically, the high-frequency attention focuses on the local details of the object, so that global attention does not need to be applied to the feature map, the calculation complexity can be effectively reduced, and the operation efficiency is improved. The High frequency information attention module (High-Frequency Information Attention Module, HIAM) uses 3*3 windows to capture fine grained High frequency information with local window self-attention, which can save a lot of computing resources. Global attention in the MSA helps to capture low frequency information. However, directly applying MSA to high resolution feature maps requires significant computational costs. The Low frequency information attention module (Low-Frequency Information Attention Module, LIAM) first applies a two-dimensional DCT to each window to obtain a Low frequency signal in input X. The feature map of the DCT is then mapped to the key K and the value V. The query vector Q in the LIAM is still from the original feature map X. The present invention then applies standard attention mechanisms to capture the rich low frequency information in the feature map.
Q=DCT(X)W q ,
K=DCT(X)W k ,
V=XW v .
Obtaining final high-frequency and low-frequency information attention vectors:
HLA(X)=concat(HIAM(X),LIAM(X))
wherein cancat () represents a concatenation operation of the high-frequency information attention vector and the low-frequency information attention vector.
The multi-scale features extracted by ViT are input into the HIAM and LIAM modules, and the high-frequency and low-frequency attention vectors can be obtained. Multiplication of visible and infrared multiscale features with attention vectors to yield enhanced multispectral fusion features
y i =HLA(x i )*x i
In some embodiments, in the step S3, the multi-scale upsampling process is performed using a cross layer dense connection structure comprising a number of basic decoding units, each of which comprises two 3*3 convolutional layers. In some alternative/alternative embodiments, the multi-scale upsampling process is performed using a nested structure of a unet++ network that contains several basic decoding units, each of which includes two 3*3 convolutional layers.
Specifically, as shown in fig. 2, the nested structure of the unet++ network is adopted to fully retain multi-scale information. Specifically, we see first our back a basic Decoding Unit (DU), which is mainly composed of two 3*3 convolutional layers, defined as:
v=ReLU(Conv 3×3 (u))
w=ReLU(Conv 3×3 (v))
wherein the input features areThe output of the first layer is v E R 32×H×W Output of DUFirst, two input images are input into an encoder network, respectively, to obtain multi-scale depth features. For each scale feature, our fusion strategy is used to fuse the resulting features. And finally, a decoder network based on nested connection is used for reconstructing a fusion image by using the fused multi-scale depth characteristics to obtain a final multi-spectrum fusion image.
The invention discloses a multispectral image fusion system based on frequency domain enhancement in a degradation environment. The system comprises: a first processing unit configured to: taking a visible light image and an infrared image as multispectral images to be fused, and respectively extracting multiscale characteristics of the visible light image and the infrared image by utilizing a characteristic encoder so as to obtain an original characteristic diagram formed by the multiscale characteristics of the visible light image and the multiscale characteristics of the infrared image; a second processing unit configured to: extracting a high-frequency information attention vector and a low-frequency information attention vector in the original feature map based on a high-frequency information attention module and a low-frequency information attention module respectively, and obtaining the attention vector of the original feature map by cascading the high-frequency information attention vector and the low-frequency information attention vector; a third processing unit configured to: and carrying out cascade connection on the multi-scale features of the visible light image and the multi-scale features of the infrared image, multiplying the multi-scale features with the attention vector of the original feature image to obtain a multi-spectrum fusion feature, and carrying out multi-scale up-sampling processing on the multi-spectrum fusion feature to obtain a multi-spectrum fusion image.
The system according to the second aspect, wherein the feature encoder is a visual transducer based on a multi-head self-attention architecture, and comprises a plurality of self-attention layers and a fully-connected layer; wherein: and each self-attention layer extracts the characteristics of the visible light image and the infrared image on different scales through linear mapping calculation to serve as output, and the output characteristics of each self-attention layer are cascaded, and further subjected to normalization processing and full-connection layer processing with GRLu activation functions to obtain the multi-scale characteristics of the visible light image and the multi-scale characteristics of the infrared image so as to obtain the original characteristic diagram.
The system according to the second aspect, the high frequency information attention module extracts a high frequency information attention vector in the original feature map using a window of 3*3, the high frequency information attention vector characterizing line and shape information in the original feature map; the low frequency information attention module extracts a low frequency signal in the original feature map by utilizing a two-dimensional discrete cosine transform, and captures the low frequency information attention vector from the low frequency signal based on a standard attention mechanism, wherein the low frequency information attention vector characterizes texture and color information in the original feature map.
The system according to the second aspect, the multi-scale upsampling process is performed using a cross layer dense connection structure comprising a number of basic decoding units, each of the basic decoding units comprising two 3*3 convolutional layers.
A third aspect of the invention discloses an electronic device. The electronic device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps in the multispectral image fusion method based on frequency domain enhancement in the degradation environment according to the first aspect of the disclosure when executing the computer program.
Fig. 3 is a block diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device includes a processor, a memory, a communication interface, a display screen, and an input device connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the electronic device is used for conducting wired or wireless communication with an external terminal, and the wireless communication can be achieved through WIFI, an operator network, near Field Communication (NFC) or other technologies. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the structure shown in fig. 3 is merely a structural diagram of a portion related to the technical solution of the present disclosure, and does not constitute a limitation of the electronic device to which the present application is applied, and that a specific electronic device may include more or less components than those shown in the drawings, or may combine some components, or have different component arrangements.
A fourth aspect of the invention discloses a computer-readable storage medium. The computer readable storage medium stores a computer program, which when executed by a processor, implements the steps in a multispectral image fusion method based on frequency domain enhancement in a degradation environment according to the first aspect of the disclosure.
In summary, the technical scheme provided by the invention utilizes the most advanced deep learning model Vision Transformer (ViT) model as a backbone network to extract multi-scale multi-spectral features, designs high-frequency and low-frequency modules to extract frequency information in a spectrum, fully utilizes deep features through a nested connection architecture, and retains more information of different scale features extracted from an encoder network. Finally, high-precision and robust multispectral image fusion is realized.
Note that the technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be regarded as the scope of the description. The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.
Claims (10)
1. The multispectral image fusion method based on frequency domain enhancement in a degradation environment is characterized by comprising the following steps of:
step S1, taking a visible light image and an infrared image as multispectral images to be fused, and respectively extracting multiscale characteristics of the visible light image and the infrared image by utilizing a characteristic encoder so as to obtain an original characteristic diagram formed by the multiscale characteristics of the visible light image and the multiscale characteristics of the infrared image;
s2, extracting a high-frequency information attention vector and a low-frequency information attention vector in the original feature map based on a high-frequency information attention module and a low-frequency information attention module respectively, and obtaining the attention vector of the original feature map by cascading the high-frequency information attention vector and the low-frequency information attention vector;
and S3, cascading the visible light image multiscale features and the infrared image multiscale features, multiplying the cascaded visible light image multiscale features and the attention vector of the original feature image to obtain multispectral fusion features, and performing multiscale up-sampling processing on the multispectral fusion features to obtain a multispectral fusion image.
2. The method according to claim 1, wherein in the step S1, the feature encoder is a visual transducer based on a multi-head self-attention architecture, and comprises a plurality of self-attention layers and a fully-connected layer; wherein:
and each self-attention layer extracts the characteristics of the visible light image and the infrared image on different scales through linear mapping calculation to serve as output, and the output characteristics of each self-attention layer are cascaded, and further subjected to normalization processing and full-connection layer processing with GRLu activation functions to obtain the multi-scale characteristics of the visible light image and the multi-scale characteristics of the infrared image so as to obtain the original characteristic diagram.
3. The method for frequency domain enhancement based multi-spectral image fusion in a degraded environment according to claim 2, wherein in the step S2:
the high-frequency information attention module extracts high-frequency information attention vectors in the original feature map by utilizing a window of 3*3, wherein the high-frequency information attention vectors represent lines and shape information in the original feature map;
the low frequency information attention module extracts a low frequency signal in the original feature map by utilizing a two-dimensional discrete cosine transform, and captures the low frequency information attention vector from the low frequency signal based on a standard attention mechanism, wherein the low frequency information attention vector characterizes texture and color information in the original feature map.
4. A method of multi-spectral image fusion based on frequency domain enhancement in a degraded environment according to claim 3, wherein in said step S3, said multi-scale up-sampling process is performed using a cross-layer dense connection structure comprising a number of basic decoding units, each of said basic decoding units comprising two 3*3 convolution layers.
5. A multispectral image fusion system based on frequency domain enhancement in a degraded environment, the system comprising:
a first processing unit configured to: taking a visible light image and an infrared image as multispectral images to be fused, and respectively extracting multiscale characteristics of the visible light image and the infrared image by utilizing a characteristic encoder so as to obtain an original characteristic diagram formed by the multiscale characteristics of the visible light image and the multiscale characteristics of the infrared image;
a second processing unit configured to: extracting a high-frequency information attention vector and a low-frequency information attention vector in the original feature map based on a high-frequency information attention module and a low-frequency information attention module respectively, and obtaining the attention vector of the original feature map by cascading the high-frequency information attention vector and the low-frequency information attention vector;
a third processing unit configured to: and carrying out cascade connection on the multi-scale features of the visible light image and the multi-scale features of the infrared image, multiplying the multi-scale features with the attention vector of the original feature image to obtain a multi-spectrum fusion feature, and carrying out multi-scale up-sampling processing on the multi-spectrum fusion feature to obtain a multi-spectrum fusion image.
6. The system of claim 5, wherein the feature encoder is a visual transducer based on a multi-head self-attention architecture, comprising a plurality of self-attention layers and a fully-connected layer; wherein:
and each self-attention layer extracts the characteristics of the visible light image and the infrared image on different scales through linear mapping calculation to serve as output, and the output characteristics of each self-attention layer are cascaded, and further subjected to normalization processing and full-connection layer processing with GRLu activation functions to obtain the multi-scale characteristics of the visible light image and the multi-scale characteristics of the infrared image so as to obtain the original characteristic diagram.
7. The system for frequency domain enhancement based multi-spectral image fusion in a degraded environment of claim 6, wherein:
the high-frequency information attention module extracts high-frequency information attention vectors in the original feature map by utilizing a window of 3*3, wherein the high-frequency information attention vectors represent lines and shape information in the original feature map;
the low frequency information attention module extracts a low frequency signal in the original feature map by utilizing a two-dimensional discrete cosine transform, and captures the low frequency information attention vector from the low frequency signal based on a standard attention mechanism, wherein the low frequency information attention vector characterizes texture and color information in the original feature map.
8. The frequency domain enhancement based multi-spectral image fusion system according to claim 7, wherein the multi-scale upsampling process is performed using a cross-layer dense connection comprising a number of basic decoding units, each of the basic decoding units comprising two 3*3 convolution layers.
9. An electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps in a method for frequency domain enhancement based multi-spectral image fusion in a reduced quality environment according to any of claims 1-4 when executing the computer program.
10. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, the computer program, when executed by a processor, implementing the steps in a multi-spectral image fusion method based on frequency domain enhancement in a degradation environment according to any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310311387.1A CN116452930A (en) | 2023-03-28 | 2023-03-28 | Multispectral image fusion method and multispectral image fusion system based on frequency domain enhancement in degradation environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310311387.1A CN116452930A (en) | 2023-03-28 | 2023-03-28 | Multispectral image fusion method and multispectral image fusion system based on frequency domain enhancement in degradation environment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116452930A true CN116452930A (en) | 2023-07-18 |
Family
ID=87119415
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310311387.1A Pending CN116452930A (en) | 2023-03-28 | 2023-03-28 | Multispectral image fusion method and multispectral image fusion system based on frequency domain enhancement in degradation environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116452930A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117314757A (en) * | 2023-11-30 | 2023-12-29 | 湖南大学 | Space spectrum frequency multi-domain fused hyperspectral computed imaging method, system and medium |
CN117893871A (en) * | 2024-03-14 | 2024-04-16 | 深圳市日多实业发展有限公司 | Spectrum segment fusion method, device, equipment and storage medium |
-
2023
- 2023-03-28 CN CN202310311387.1A patent/CN116452930A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117314757A (en) * | 2023-11-30 | 2023-12-29 | 湖南大学 | Space spectrum frequency multi-domain fused hyperspectral computed imaging method, system and medium |
CN117314757B (en) * | 2023-11-30 | 2024-02-09 | 湖南大学 | Space spectrum frequency multi-domain fused hyperspectral computed imaging method, system and medium |
CN117893871A (en) * | 2024-03-14 | 2024-04-16 | 深圳市日多实业发展有限公司 | Spectrum segment fusion method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ma et al. | Infrared and visible image fusion methods and applications: A survey | |
Jiao et al. | A survey on the new generation of deep learning in image processing | |
Xia et al. | A novel improved deep convolutional neural network model for medical image fusion | |
Chen et al. | The improved image inpainting algorithm via encoder and similarity constraint | |
Li et al. | Survey of single image super‐resolution reconstruction | |
Wang et al. | Ultra-dense GAN for satellite imagery super-resolution | |
Ozcelik et al. | Rethinking CNN-based pansharpening: Guided colorization of panchromatic images via GANs | |
Shi et al. | Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network | |
Saragadam et al. | Miner: Multiscale implicit neural representation | |
Zhao et al. | Invertible image decolorization | |
Zhu et al. | Stacked U-shape networks with channel-wise attention for image super-resolution | |
CN111951195A (en) | Image enhancement method and device | |
Lyu et al. | A nonsubsampled countourlet transform based CNN for real image denoising | |
Liu et al. | Single image super resolution techniques based on deep learning: Status, applications and future directions | |
Khan et al. | An encoder–decoder deep learning framework for building footprints extraction from aerial imagery | |
CN112446835A (en) | Image recovery method, image recovery network training method, device and storage medium | |
Liu et al. | Research on super-resolution reconstruction of remote sensing images: A comprehensive review | |
Shao et al. | Uncertainty-guided hierarchical frequency domain transformer for image restoration | |
CN116452930A (en) | Multispectral image fusion method and multispectral image fusion system based on frequency domain enhancement in degradation environment | |
Gao | A method for face image inpainting based on generative adversarial networks | |
Wang et al. | Multi-focus image fusion framework based on transformer and feedback mechanism | |
Hua et al. | Dynamic scene deblurring with continuous cross-layer attention transmission | |
Dharejo et al. | SwinWave-SR: Multi-scale lightweight underwater image super-resolution | |
Liu et al. | Multi-level wavelet network based on CNN-Transformer hybrid attention for single image deraining | |
Zhang et al. | Enhanced visual perception for underwater images based on multistage generative adversarial network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |