CN115331112A - Infrared and visible light image fusion method and system based on multi-granularity word elements - Google Patents
Infrared and visible light image fusion method and system based on multi-granularity word elements Download PDFInfo
- Publication number
- CN115331112A CN115331112A CN202211054722.6A CN202211054722A CN115331112A CN 115331112 A CN115331112 A CN 115331112A CN 202211054722 A CN202211054722 A CN 202211054722A CN 115331112 A CN115331112 A CN 115331112A
- Authority
- CN
- China
- Prior art keywords
- image
- visible light
- infrared
- fusion
- granularity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007500 overflow downdraw method Methods 0.000 title claims abstract description 19
- 230000004927 fusion Effects 0.000 claims abstract description 134
- 238000000605 extraction Methods 0.000 claims abstract description 29
- 238000013461 design Methods 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 claims description 86
- 238000012545 processing Methods 0.000 claims description 21
- 238000012549 training Methods 0.000 claims description 11
- 230000007246 mechanism Effects 0.000 claims description 8
- 150000001875 compounds Chemical class 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 230000000877 morphologic effect Effects 0.000 claims description 6
- 230000009467 reduction Effects 0.000 claims description 6
- 238000000354 decomposition reaction Methods 0.000 claims description 5
- 238000003709 image segmentation Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 2
- 238000005192 partition Methods 0.000 claims description 2
- 238000007634 remodeling Methods 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 description 19
- 238000010586 diagram Methods 0.000 description 12
- 238000004458 analytical method Methods 0.000 description 10
- 238000013527 convolutional neural network Methods 0.000 description 9
- 239000000284 extract Substances 0.000 description 9
- 238000002679 ablation Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 230000000007 visual effect Effects 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 6
- 238000007796 conventional method Methods 0.000 description 3
- 238000004445 quantitative analysis Methods 0.000 description 3
- 230000000717 retained effect Effects 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 230000003042 antagnostic effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004451 qualitative analysis Methods 0.000 description 2
- 230000005855 radiation Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Radiation Pyrometers (AREA)
- Photometry And Measurement Of Optical Pulse Characteristics (AREA)
- Image Processing (AREA)
Abstract
The invention provides an infrared and visible light image fusion method and system based on multi-granularity lemmas, which comprises the following steps: s1, acquiring an infrared image and a visible light image, and decomposing the infrared image and the visible light image on not less than 2 different scales respectively; extracting multi-granularity word global features: respectively calculating the long-range dependence relationship of the infrared image and the visible light image through not less than 2 independent Transformer branches; utilizing a preset logic design loss function to supervise and train a preset multi-granularity word element fusion model; and fusing the infrared image and the visible light image through a multi-granularity word element fusion module to obtain a multi-granularity word element fusion output image. The invention solves the technical problems of poor performance, high complexity and limited feature extraction and representation of the fusion model.
Description
Technical Field
The invention relates to the field of image fusion, in particular to the technical field of multi-type image fusion.
Background
To comprehensively describe real-world scenes, the combination of multi-source images acquired by different sensors is a key to the application. Therefore, infrared and visible image fusion has been widely used for information acquisition or analysis, such as military affairs, public security, smart city, etc. The infrared sensor is intended to capture thermal radiation emitted by the heat source, enhancing the thermal infrared target. However, the infrared sensor cannot detect the presence of detail or texture information in the background because the objects in the background have almost the same thermal information. And the visible light sensor captures reflected light to generate an image, so that abundant background texture or detail information can be saved. Therefore, image fusion can simultaneously synthesize a single information image using the advantages of multi-source images, and many fusion methods including a conventional method and a deep learning-based method have been proposed. The traditional method mainly extracts features through mathematical transformation and then combines the features through designing a fusion strategy. Traditional methods include multi-scale transformation methods, sparse representation-based methods, saliency-based methods, subspace-based methods, and other hybrid methods. Conventional fusion methods aim to obtain a satisfactory fused image and to some extent satisfy certain applications. However, the traditional fusion method still has a bottleneck.
For example, the prior invention patent application document CN110332934A, the patent application document "a system and a method for tracking a track of a robot under hybrid network fusion" includes a visual image sensor, a wireless transmitter, a wireless receiver, a visible light source, an optical communication device, an inertial navigation device, and a hybrid data fusion unit. A visual image sensor, a wireless transmitter and a visible light source are installed in a moving area of the robot, a wireless receiver, an optical communication device and an inertial navigation device are installed on a robot body, and a hybrid data processing unit in a control center is used for resolving and evaluating track tracking data. This multisource miscellaneous data that prior art fuses multisensor and obtains can learn from the concrete implementation mode of this prior art scheme, and this prior art's visible light image is only used for thick unit, later corrects many times with the data that other multisource sensors gathered, and the difference of multisource image has been neglected in single expression, has restricted the performance of fusing the model, and this prior art's mode of fusion mainly is the combination of various kinds of data processing process simultaneously, has improved the algorithm complexity of this prior art. Furthermore, conventional fusion methods also require complex designed fusion strategies. Therefore, deep Learning (DL) is introduced to solve these tasks.
The method based on deep learning has the nonlinear fitting capability and can model the complex correlation of the source images. The DL-based fusion method may be classified into a CNN-based fusion method and a generation countermeasure network (GAN) -based fusion method according to a difference of a fusion framework. The CNN-based method utilizes parallel convolution kernels to extract multiple features, and uses a fine loss function to reconstruct a fusion result. In addition, the GAN framework can also perform an image fusion task, and the framework designs a antagonistic game to simulate the distribution of the source images. Typically, ma et al first fuse the infrared image and the visible image using GAN, and then propose various GAN-based methods such as a dual discriminator GAN method and a multi-class constraint GAN method.
For example, in the prior invention patent application publication No. CN114240736A, "method for simultaneously generating and editing arbitrary face attributes based on VAE and cGAN", we focus on the encoder-decoder architecture based on a Variational Automatic Encoder (VAE) and a conditional antagonistic neural network (cGAN), and develop a bidirectional feedback generation network for simultaneously generating a new face and performing attribute editing. And using attribute classification constraints on the generated image to ensure the correct change of the specified attributes, and generating a human face image with a plurality of attributes by sampling attribute codes from a hidden space. The method includes modeling attribute strengths to support attribute interpolation and flexible handling of multiple face attributes. The CNN or GAN-based methods in the prior art use convolution operation to extract image features in a small-range perceptual domain, and the unified convolution operation also limits the extraction and representation of features. In addition, the CNN or GAN based method aims at extracting local features through convolution kernel, but this kind of existing scheme also has a defect that long-range dependency information cannot be extracted. Therefore, transformers can be introduced to solve these problems. For example, li et al and Vibashan et al combine Transformer and CNN to extract local features and long-range dependency information of images.
However, existing Transformer-based methods ignore the difference in attention weights of infrared and visible lemmas at the same location, which affects fusion performance because the importance of infrared and visible lemmas at the same location is different.
In summary, the prior art has the technical problems of poor performance of a fusion model, high complexity and limited feature extraction and representation.
Disclosure of Invention
The invention aims to solve the technical problems of poor performance, high complexity and limited feature extraction and representation of a fusion model in the prior art.
The invention adopts the following technical scheme to solve the technical problems: the infrared and visible light image fusion method based on the multi-granularity lemmas comprises the following steps:
s1, acquiring an infrared image and a visible light image, and decomposing the infrared image and the visible light image on not less than 2 different scales respectively;
s2, multi-granularity word element global feature extraction: respectively calculating the long-range dependency relationship of the infrared image and the visible light image through not less than 2 independent Transformer branches, wherein not less than 2 Transformer models are designed in each independent Transformer branch to extract the comprehensive multi-scale long-range dependency relationship, and the step S2 comprises the following steps:
s21, dividing the infrared image and the visible light image into multi-scale subarea batch;
s22, converting the infrared image and the visible light image into an infrared sequenceAnd visible light sequence
S23, embedding and processing infrared sequence by utilizing preset linear projection EAnd visible light sequenceAnd adding coded position information into each sequence to obtain coded infrared sequenceAnd encoding the visible light sequence
S24, utilizing the full connection layer to preset the embedded logic pair to code the infrared sequenceAnd encoding the visible light sequencePerforming embedding operation to obtain a relation extraction parameter;
s25, extracting parameters by using a multi-head self-attention mechanism MSA (multiple advanced self-attention mechanism) according to a preset logic processing relation, extracting a long-range dependency relationship from the infrared image and the visible light image, and acquiring multi-head self-attention fusion parameters according to the long-range dependency relationship, wherein the multi-head self-attention fusion parameters comprise: infrared morphemeAnd visible light lemma
S3, designing a loss function by using preset logic, and supervising and training a preset multi-granularity word element fusion model according to the loss function;
s4, fusing the infrared image and the visible light image through the multi-granularity lemma fusion module to obtain a multi-granularity lemma fusion output image, wherein the step S4 comprises the following steps:
s41, defining logic with preset weight to obtain learnable attention weight, and capturing infrared morpheme by using preset relation to capture logicAnd the visual light word elementMulti-granularity lemma relevance of (2);
and S42, processing the multi-granularity lemma correlation and the difference scale characteristics by using preset reconstruction logic to obtain a multi-granularity lemma fusion output image.
According to the method, the relevance of the corresponding lemmas is captured by introducing the learnable attention weight, so that the infrared image and the visible light image are fused under the multi-granularity lemma dimension, and the method can sense the difference of multi-modal information of the infrared and visible light lemmas at the same position. The invention extracts the long-range dependency relationship of each image on multiple scales, embeds multi-granularity lemmas by locally dividing sub-images on multiple scales, reconstructs multi-scale characteristics by utilizing multi-granularity fusion, and calculates the fusion image by dimension reduction mapping operation. The method can realize the fusion of the infrared image and the visible light image under the multi-scale lemma dimension based on the pure Transformer, and has better fusion performance compared with other schemes.
In a more specific embodiment, step S21 includes: the infrared image is divided into N P's per scale using the following logic s 2 Size multiscale partition region patch:
where s is a scale applied to the divided image, and s is defined as 1, 2, and 3, respectively.
In a more specific embodiment, in step S23, the infrared sequence is embedded and processed by using the preset linear projection E with the following logicAnd visible light sequenceAnd adding coded position information to each sequence:
in the formula (I), the compound is shown in the specification,andrepresenting the coded sequences of the two original images at different scales of s.
In a more specific embodiment, in step S24, the full connectivity layer is used to encode the ir sequence with the following logic pairsAnd encoding the visible light sequenceCarrying out embedding operation:
in the formula (I), the compound is shown in the specification,andquery, key and value representing infrared and visible image sequences, LN representing fully connected layers.
In a more specific technical solution, in step S25, the relationship extraction parameters are processed with the following logic to extract the long-range dependency relationship from the infrared image and the visible light image:
The method expands the fusion to the fusion based on multi-granularity lemmas so as to extract the multi-scale long-range dependency relationship of each source image and capture the attention relevance of the corresponding lemmas under different scales. In addition, local region features can be extracted based on the fusion of the lemmas, and the lemmas embedded in the local segmentation subimages contain the local region features, so that the model fusion performance is optimized, and the representation degree of the fusion image features is improved.
In a more specific technical solution, step S3 includes:
s31, calculating the loss of the infrared image and the fused image in an intensity domain by utilizing the following logics:
s32, obtaining the loss L by the following logic processing 1 And total loss, thereby preserving the details and brightness information of the visible light image:
in the formula, L represents the total loss value, M represents the total number of pixels, f, I and V respectively represent the fusion result, the infrared image and the visible light image, | | · | calcualting F Representing the matrix Forbenius norm, with λ being the equilibrium parameter.
The invention extracts the long-range dependence relationship between the infrared image and the visible light image through two independent Transformer branches, and the intensity loss of the infrared image and the visible light image L are passed through the fusion module 1 And the optimal fusion result is searched for by loss, more details and brightness information of the visible light image are retained, the fusion of the infrared image and the visible light image is finally realized, and the image fusion precision and the fusion effect are optimized.
In a more specific technical solution, step S41 includes:
s411, utilizing the following logic to capture infrared word elements by the learnable attention weightAnd the visual light word elementThe multi-granularity lemma relevance of (2):
in the formula, f represents the characteristics calculated by fusing the lemma under the s scale, R represents the remodeling operation,andrepresenting infrared lemmaAnd visible light word elementLearnable attention weight of (a);
In a more specific embodiment, in step S412, the infrared lemma is balanced by using the following logicAnd visible light word elementImportance at the same location:
in a more specific technical solution, in step S42, the following logic is used to process the correlation between multiple-granularity lemmas and the difference scale features to obtain a multi-granularity lemma fusion output image:
f=g(f 1 ,…f s ),
where f denotes the fused image, g denotes the dimensionality reduction mapping operation of the convolutional layer, and the kernel is 1 × 1.
In a more specific technical scheme, the infrared and visible light image fusion system based on multi-granularity lemmas comprises:
the difference scale decomposition module is used for acquiring the infrared image and the visible light image and decomposing the infrared image and the visible light image on not less than 2 difference scales respectively;
the multi-granularity morphological global feature extraction module is used for respectively calculating the long-range dependency relationship of the infrared image and the visible light image through not less than 2 independent transform branches, wherein not less than 2 transform models are designed in each independent transform branch to extract the comprehensive multi-scale long-range dependency relationship, the multi-granularity morphological global feature extraction module is connected with the difference scale decomposition module, and the multi-granularity morphological global feature extraction module comprises:
the image segmentation module is used for segmenting the infrared image and the visible light image into multi-scale subareas patch;
an image conversion module for converting the infrared image and the visible light image into an infrared sequenceAnd visible light sequenceThe image conversion module is connected with the image segmentation module;
linear projection embedding module for embedding processing infrared sequence by using preset linear projection EAnd visible light sequenceAnd adding coded position information into each sequence to obtain coded infrared sequenceAnd encoding the visible light sequenceThe linear projection embedding module is connected with the image conversion module;
a relation extraction module for encoding the infrared sequence with preset embedded logic pairs by using the full connection layerAnd weavingCode visible light sequenceEmbedding operation is carried out to obtain a relation extraction parameter, and the relation extraction module is connected with the linear projection embedding module;
the multi-head self-attention fusion module is used for extracting parameters by utilizing a multi-head self-attention mechanism MSA (Multi-head self-attention mechanism) and using a preset logic processing relation to extract a long-range dependency relation from an infrared image and a visible light image so as to obtain multi-head self-attention fusion parameters, wherein the multi-head self-attention fusion parameters comprise: infrared word elementAnd visible light word elementThe multi-head self-attention fusion module is connected with the relation extraction module;
the training module of the word element fusion model is used for monitoring and training a preset multi-granularity word element fusion model by utilizing a preset logic design loss function, and the training module of the word element fusion model is connected with the multi-head self-attention fusion module;
the multi-granularity lemma fusion output module is used for fusing the infrared image and the visible light image through the multi-granularity lemma fusion module so as to obtain a multi-granularity lemma fusion output image, the multi-granularity lemma fusion output module is connected with the lemma fusion model training module, and the multi-granularity lemma fusion output module comprises:
a word element correlation module for obtaining learnable attention weight value by preset weight definition logic and capturing infrared word element by using preset relation capture logicAnd the visual light word elementThe multi-granularity lemma relevance of (2);
and the image reconstruction module is used for processing the multi-granularity word element correlation and the difference scale characteristics by utilizing preset reconstruction logic so as to obtain a multi-granularity word element fusion output image, and is connected with the word element correlation module.
Compared with the prior art, the invention has the following advantages: according to the method, the relevance of the corresponding lemmas is captured by introducing the learnable attention weight, so that the infrared image and the visible light image are fused under the multi-granularity lemma dimension, and the method can sense the difference of multi-modal information of the infrared and visible light lemmas at the same position. The invention extracts the long-range dependency relationship of each image on multiple scales, embeds multi-granularity lemmas by locally dividing sub-images on multiple scales, reconstructs multi-scale characteristics by utilizing multi-granularity fusion, and calculates the fusion image by dimension reduction mapping operation. The method can realize the fusion of the infrared image and the visible light image under the multi-scale lemma dimension based on the pure Transformer, and has better fusion performance compared with other schemes.
The method expands the fusion to the fusion based on multi-granularity lemmas so as to extract the multi-scale long-range dependency relationship of each source image and capture the attention relevance of the corresponding lemmas under different scales. In addition, local region features can be extracted based on the fusion of the word elements, and the word elements embedded in the local segmentation subimages contain the local region features, so that the model fusion performance is optimized, and the characterization degree of the fusion image features is improved.
The invention extracts the long-range dependence relationship between the infrared image and the visible light image through two independent Transformer branches, and the intensity loss of the infrared image and the visible light image L are passed through the fusion module 1 And searching for the optimal fusion result of the infrared image and the visible image in a loss mode, and simultaneously keeping more details and brightness information of the visible image to finally realize the fusion of the infrared image and the visible image. Meanwhile, the image fusion precision and the fusion effect are optimized. The method solves the technical problems of poor performance, high complexity and limited feature extraction and representation of the fusion model in the prior art.
Drawings
Fig. 1 is a schematic diagram of a transform-based infrared and visible light image fusion framework in a multi-granularity lemma-based infrared and visible light image fusion method according to embodiment 1 of the present invention;
FIG. 2 is a schematic diagram of basic steps of a multi-granularity lemma-based infrared and visible light image fusion method according to embodiment 1 of the present invention;
fig. 3 is a schematic diagram of multi-granularity lemma global feature extraction in embodiment 1 of the present invention;
FIG. 4 is a diagram illustrating a specific step of multi-granularity lemma fusion in embodiment 1 of the present invention;
FIG. 5 is a comparison of the results of the ablation experiment with the learnable attention module of example 2 of the present invention;
FIG. 6 is a comparison chart of the results of the ablation experiments of the multi-granularity lemma module in example 2 of the present invention;
FIG. 7 is a schematic diagram showing the comparison of the processing effects of the methods on the TNO data set of example 2 of the present invention;
fig. 8a is a schematic diagram of a first index analysis of a fused image obtained by each method on the TNO data set according to embodiment 2 of the present invention;
FIG. 8b is a schematic diagram of a second index analysis of the fused image of each method on the TNO data set according to embodiment 2 of the present invention;
fig. 8c is a schematic diagram of a third index analysis of the fused image obtained by the methods in the TNO data set according to embodiment 2 of the present invention;
fig. 8d is a diagram illustrating a fourth index analysis of an image fused by each method on the TNO data set according to embodiment 2 of the present invention;
fig. 8e is a schematic diagram of a fifth index analysis of the fused image obtained by the methods in the TNO data set according to embodiment 2 of the present invention;
fig. 8f is a schematic diagram of a sixth index analysis of fused images of each method on the TNO data set according to embodiment 2 of the present invention;
fig. 9 is a diagram of an image fusion effect of each method of a night scene on a Roadscene data set in embodiment 2 of the present invention;
fig. 10 is a comparison graph of the image fusion effect of the methods of nighttime scenes on the LLVIP data set in embodiment 2 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
Example 1
As shown in fig. 1, the infrared and visible light image fusion framework based on Transformer related to the infrared and visible light image fusion method based on multi-granularity lemmas provided by the present invention. Extracting the long-range dependence relationship between the infrared image and the visible light image through two independent Transformer branches, and obtaining the visible light image L and the intensity loss of the infrared image in the fusion module 1 And finding the optimal fusion result by loss, and finally realizing the fusion of the infrared image and the visible light image.
In this embodiment, the method for fusing infrared and visible light images based on multi-granularity lemmas includes the following steps:
s1, image multi-scale decomposition: the infrared image and the visible light image are decomposed on three different scales respectively to obtain three different scale branches.
S2, extracting global features of multiple granularity word elements;
as shown in fig. 3, in this embodiment, the step S2 further includes the following specific steps:
s21, dividing the infrared image and the visible light image into multi-scale subarea batch;
s22, converting the infrared image and the visible light image into an infrared sequenceAnd visible light sequence
S23, embedding and processing infrared sequence by utilizing preset linear projection EAnd visible light sequenceAnd adding coded position information into each sequence to obtain coded infrared sequenceAnd encoding the visible light sequence
S24, utilizing the full connection layer to code the infrared sequence by preset embedded logic pairsAnd encoding the visible light sequencePerforming embedding operation to obtain a relation extraction parameter;
s25, extracting parameters by utilizing multi-head self-attention mechanism MSA processing to preset a logic processing relation, extracting a long-range dependency relation from the infrared image and the visible light image, and obtaining multi-head self-attention fusion parameters according to the long-range dependency relation, wherein the multi-head self-attention fusion parameters comprise: infrared morphemeAnd visible light word element
In this embodiment, the long-range dependence of the infrared and visible images is calculated by two independent transform branches, respectively. In order to comprehensively extract the multi-scale long-range dependency relationship, 3 Transformer models are designed in each branch. Given an infrared image (I) and a visible light image (V), wherein I, V E R H×W×C H, W and C respectively represent the height, width and channel size of the source image, the invention firstly divides the two source images into multiple scalesDegree patch, dividing the infrared image into N P pieces on each scale s 2 Size ofWhere s is the scale applied to the segmented image and s is defined as 1, 2, and 3, respectively. In addition, in the operation of the present invention, P S Also satisfies P 1 =P 2 =P 3 . Based on this, the infrared and visible light images are converted into a sequenceAndthen embedded by linear projection EAndand incorporating coded position information in each sequenceAndthe expression is shown in formula (1):
in the formula (I), the compound is shown in the specification,andrepresenting the coded sequence of the two original images at different scales of s.
In addition, the invention utilizes the full connecting layerAndembedding into query, key and value, as follows:
whereinAndquery, key and value, LN, representing infrared and visible image sequences, represent fully connected layers. In addition, the invention utilizes a multi-head self attention Mechanism (MSA) to extract the long-range dependence relationship between the infrared image and the visible light image, and the expression thereof is shown as a formula (3):
S3, setting a loss function: a loss function is designed to supervise the training of the proposed method to simulate the data distribution of the original image. Since the infrared image is obtained by capturing thermal radiation, the content present in the infrared image is characterized by pixel intensity. Thus calculating the loss of the infrared image and the fused image in the intensity domain, i.e.At the same time, the visible light is transmittedThe sensor describes the scene by capturing the reflected light, using L in order to retain more detail and brightness information of the visible light image 1 The loss is used to constrain the fused image to have a data distribution similar to that of the visible light image, defined asThe total loss function is shown in equation (4):
wherein L represents the total loss value, M represents the total number of pixels, f, I and V respectively represent the fusion result, the infrared image and the visible light image, | | · | | F The matrix Forbenius norm is expressed and lambda is designed to balance the two terms.
As shown in fig. 4, step S4, multi-granularity lemma fusion: and fusing the infrared image and the visible light image through a multi-granularity word element fusion module. In this embodiment, step S4 further includes:
in this embodiment, a learnable attention weight is introduced to captureAndtheir definition is shown in equation (5):
wherein f represents the characteristics calculated by fusing the lemma under the s scale, R represents the reshaping operation,andrepresenting infrared lemmaAnd visible light word elementThe learnable attention weight of (1), the present invention definesTo balance the importance of infrared and visible lemmas being co-located.
The invention reconstructs the fused image by using the characteristics of different scales, and the definition is shown as a formula (6):
f=g(f 1 ,…f s ),(6)
where f represents the fused image, g represents the dimensionality reduction mapping operation of the convolutional layer, and the kernel is 1 × 1.
Example 2
The effectiveness of the fusion strategy of the invention is proved by ablation experiments, and qualitative and quantitative comparison and generalization experiments are carried out on three public data sets.
A. Ablation experiment
1) Ablation analysis of the learnable attention weight: the invention introduces the learnable attention to estimate the importance of the corresponding word element in the fusion process based on the word element. Therefore, models were trained by discarding learnable attention weights (no weights), and compared to prove their validity.
As shown in fig. 5, it can be seen that the learnable attention weight plays a key role in the lemma-based fusion, for example, in the red square of fig. 5, the fusion model added with the learnable attention weight contains more detail and edge information, which proves the effectiveness of the learnable attention weight.
2) Multi-granularity word element fusion ablation analysis: in the subject, the invention fuses infrared images and visible light images, extracts multi-scale long-range dependency, and captures attention relevance of corresponding word elements under different sizes. Thus, fusion model comparisons can be trained by removing multi-granular operations (no multi-granular) to prove their effectiveness.
As shown in fig. 6, in the present embodiment, the thermal infrared details are richer and have better visual effect after the multi-granularity module is introduced, such as the area existing in the red square in fig. 6. In addition, compared with the method that more background texture information is captured when no multi-granularity lemmas exist, the method shows the rationality and the necessity of multi-granularity lemma fusion.
B. Comparative experiment
In the experiments of the present invention, the present invention evaluated the method of the present invention in a qualitative and quantitative manner. In qualitative analysis, the fusion result is evaluated by using a human visual system, and the analysis is mainly performed from the aspects of brightness, sharpness, contrast and the like. In quantitative analysis, these methods were evaluated using 6 evaluation indexes. Metrics include Mutual Information (MI), standard Deviation (SD), average Gradient (AG), spatial Frequency (SF), edge strength (EI), and peak signal-to-noise ratio (PSNR). The purpose of the fused image is to measure the information retained by the fused image from the source image. Wherein SD can reflect the contrast of the fused image, and AG mainly measures the texture information of the fused image. The SF can measure the gray change rate of the fused image and reflect the definition of the image. EI may be used to evaluate edge information of the fused image. For the above indices, the larger the value, the better the performance of the fusion method.
1) And (3) qualitative analysis: in comparative experiments, the present invention compared the method of the present invention with 8 fusion methods on 35 TNO image pairs.
As shown in fig. 7, the conventional method, the CNN-based method, and the GAN-based method cannot maintain clear background texture information, while the Transformer-based method performs better in maintaining background details, and the method of the present invention maintains more details and texture information than the conventional method (CGTF) that combines CNN and Transformer. For example, the area of the invention enlarged with red squares may reflect the most clear edges and details of the results of the invention. In addition, compared with other methods, the method can also save more obvious thermal infrared information. The present invention selects a region from each image and scales them in green squares, which illustrates that the processing results of the present invention are brighter in the face region than other methods.
2) Quantitative analysis: in a quantitative experiment, 35 TNO image pairs are selected and compared, and six indexes are selected for objectively evaluating the method. As can be seen from Table 1, the method of the present invention performed well on all 6 criteria.
As shown in fig. 8, the figure shows more detail for quantitative analysis of 35 image pairs. The maximum MI value represents that the method retains rich original image information, and the maximum SD value represents that the method has higher contrast ratio than other methods. In addition, the values of the model of the present invention on AG, SF, EI and PSNR are also the largest, which indicates that the method of the present invention retains more texture and detail information with less noise.
Table 1: indicators of different scenarios on the TNO dataset
C. Generalization experiment
In order to verify the generalization ability of the proposed model, in addition to the TNO data set, the invention selects 100 infrared and visible light images of the Roadscene data set and LLVIP data set respectively for qualitative and quantitative experiments.
As shown in fig. 9, on the Roadscene data set, through qualitative experiments, it can be seen that the method of the present invention not only retains more background textures on a night scene, but also has more obvious thermal infrared characteristics and details; more thermal infrared characteristics and details can be retained on daytime scenes, with night time scene comparison results as shown in fig. 9. Also by quantitative experiments, it can be seen from table 2 that the results of the present invention are the largest in all of MI, SD, AG, EI and SF and still acceptable in PSNR.
Table 2: indicators of different schemes on the Roadscene dataset
As shown in fig. 10, on LLVIP data set, through qualitative experiments, it can be seen that the method of the present invention contains more details and higher contrast on nighttime scenes, which can also provide clearer outlines and edges than other methods; a certain amount of texture and prominent infrared targets can be saved simultaneously on a day scene, wherein a night scene comparison result is shown in fig. 8. Meanwhile, through quantitative experiments, it can be seen from table 3 that the results of the present invention are the greatest in values of MI, SD, AG, EI, SF and PSNR, which can prove the superiority of the method of the present invention.
Table 3: indicators of different scenarios on LLVIP dataset
D. Efficiency comparison
In this work, the present invention also makes efficiency comparisons by providing an average run time of each method over three data sets. The traditional method is realized by a CPU, and other methods are realized by a GPU. As can be seen from table 4, the conventional MSVD and wavelet based methods are less time consuming than most DL based methods, and the Transformer based methods are more time consuming than CNN and GAN based methods.
Table 4: average run time of different methods on three datasets
By combining the above experiments, it can be seen through the ablation experiments that it is reasonable and necessary to introduce the learnable attention weight and the multi-granularity lemma fusion module. As can be seen from comparison experiments and generalization experiments, the method provided by the invention has certain advantages in both quantitative and qualitative aspects. The method has certain advantages in computational efficiency compared with the same CGTF based on a Transformer framework, but is far lower than methods based on CNN and GAN, and due to the excellent processing effect, the method has a great application scene.
According to the method, the relevance of the corresponding lemmas is captured by introducing the learnable attention weight, so that the infrared image and the visible light image are fused under the dimension of the multi-granularity lemmas, and the method can sense the difference of multi-modal information of the infrared and visible light lemmas at the same position. The invention extracts the long-range dependency relationship of each image on multiple scales, embeds multi-granularity lemmas by locally dividing sub-images on multiple scales, reconstructs multi-scale characteristics by utilizing multi-granularity fusion, and calculates the fusion image by dimension reduction mapping operation. The method can realize the fusion of the infrared image and the visible light image under the multi-scale lemma dimensionality based on the pure Transformer, and has better fusion performance compared with other schemes.
The method expands the fusion to the fusion based on multi-granularity lemmas so as to extract the multi-scale long-range dependency relationship of each source image and capture the attention relevance of the corresponding lemmas under different scales. In addition, local region features can be extracted based on the fusion of the word elements, and the word elements embedded in the local segmentation subimages contain the local region features, so that the model fusion performance is optimized, and the characterization degree of the fusion image features is improved.
The invention extracts the long-range dependence relationship between the infrared image and the visible light image through two independent Transformer branches, and the intensity loss of the infrared image and the visible light image L are obtained in a fusion module 1 And finding the optimal fusion result of the infrared image and the visible image through loss, and simultaneously keeping more details and brightness information of the visible image to finally realize the fusion of the infrared image and the visible image.
Meanwhile, the image fusion precision and the fusion effect are optimized. The method solves the technical problems of poor performance, high complexity and limited feature extraction and representation of the fusion model in the prior art.
The above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. The infrared and visible light image fusion method based on the multi-granularity lemmas is characterized by comprising the following steps:
s1, acquiring an infrared image and a visible light image, and decomposing the infrared image and the visible light image on not less than 2 different scales respectively;
s2, extracting multi-granularity word global features: calculating the long-range dependency relationship of the infrared image and the visible light image respectively through not less than 2 independent transform branches, wherein not less than 2 transform models are designed in each independent transform branch to extract a comprehensive multi-scale long-range dependency relationship, and the step S2 comprises:
s21, dividing the infrared image and the visible light image into multi-scale subareas patch;
s22, converting the infrared image and the visible light image into an infrared sequenceAnd visible light sequence
S23, embedding and processing the infrared sequence by utilizing preset linear projection EAnd the visible light sequenceAnd adding coded position information into each sequence to obtain coded infrared sequenceAnd encoding the visible light sequence
S24, utilizing the full connection layer to preset the embedded logic to the coded infrared sequenceAnd the coded visible light sequencePerforming embedding operation to obtain a relation extraction parameter;
s25, processing the relationship extraction parameters by preset logic by using a multi-head self-attention mechanism MSA to extract the long-range dependency relationship from the infrared image and the visible light image so as to obtain multi-head self-attention fusion parameters, wherein the multi-head self-attention fusion parameters comprise: infrared word elementAnd visible light token TokenV s j ;
S3, designing a loss function by using preset logic, and supervising and training a preset multi-granularity word element fusion model according to the loss function;
s4, fusing the infrared image and the visible light image through a multi-granularity lemma fusion module to obtain a multi-granularity lemma fusion output image, wherein the step S4 comprises the following steps of:
s41, defining logic with preset weight to obtain learnable attention weight, and capturing the infrared word element by using preset relation capture logicAnd the visible light lemma TokenV s j The multi-granularity lemma relevance of (2);
and S42, processing the correlation and the difference scale characteristics of the multi-granularity lemmas by using preset reconstruction logic to obtain the multi-granularity lemma fusion output image.
2. The infrared and visible light image fusion method based on multi-granularity lemmas according to claim 1, wherein the step S21 comprises: dividing the infrared image into N P's per scale using the following logic s 2 Size the multiscale partition region patch:
where s is a scale applied to the divided image, and s is defined as 1, 2, and 3, respectively.
3. The method of claim 1, wherein in step S23, the preset linear projection E is used to embed the infrared sequenceAnd the visible light sequenceAnd adding coded position information to each sequence:
4. The method as claimed in claim 1, wherein in step S24, the encoded ir sequence is processed by the following logic using the full-concatenation layerAnd the coded visible light sequenceCarrying out embedding operation:
5. The method as claimed in claim 1, wherein in step S25, the relationship extraction parameters are processed with the following logic to extract the long-range dependency relationship from the infrared image and the visible light image:
6. The infrared and visible light image fusion method based on multi-granularity lemmas according to claim 1, wherein the step S3 comprises:
s31, calculating the loss of the infrared image and the fused image in an intensity domain by using the following logics:
s32, obtaining the loss L by the following logic processing 1 And total loss, so as to retain the detail and brightness information of the visible light image:
in the formula, L represents the total loss value, M represents the total number of pixels, f, I and V respectively represent the fusion result, the infrared image and the visible light image, | | F Representing the matrix Forbenius norm, with λ being the balance parameter.
7. The method for fusing infrared and visible light images based on multi-granularity lemmas according to claim 1, wherein the step S41 comprises:
s411, capturing the infrared word element by the learnable attention weight value by utilizing the following logicAnd the visible light lemma TokenV s j The multi-granularity lemma relevance of (a):
wherein f represents the calculated characteristics of the fusion word element under the s scale, R represents the remodeling operation,andrepresenting said infrared lemmaAnd the visible light lemma TokenV s j Learnable attention weights of (a);
9. the method as claimed in claim 1, wherein in step S42, the correlation and difference scale features of the multi-granular lemmas are processed by the following logic to obtain the multi-granular lemma fusion output image:
f=g(f 1 ,…f s ),
where f denotes the fused image, g denotes the dimensionality reduction mapping operation of the convolutional layer, and the kernel is 1 × 1.
10. An infrared and visible light image fusion system based on multi-granularity lemmas is characterized by comprising:
the difference scale decomposition module is used for acquiring an infrared image and a visible light image, and decomposing the infrared image and the visible light image on not less than 2 difference scales respectively;
the multi-granularity morphological global feature extraction module is used for calculating the long-range dependency relationship of the infrared image and the visible light image respectively through not less than 2 independent transform branches, wherein not less than 2 transform models are designed in each independent transform branch to extract the comprehensive multi-scale long-range dependency relationship, the multi-granularity morphological global feature extraction module is connected with the difference scale decomposition module, and the multi-granularity morphological global feature extraction module comprises:
the image segmentation module is used for segmenting the infrared image and the visible light image into a multi-scale subarea patch;
an image conversion module for converting the infrared image and the visible light image into an infrared sequenceAnd visible light sequenceThe image conversion module is connected with the image segmentation module;
linear projection embedded moduleFor processing the infrared sequence by means of preset linear projection E embeddingAnd the visible light sequenceAnd adding coded position information into each sequence to obtain coded infrared sequenceAnd encoding the visible light sequenceThe linear projection embedding module is connected with the image conversion module;
a relation extraction module for using the full connection layer to preset the embedded logic to the coded infrared sequenceAnd the coded visible light sequenceEmbedding operation is carried out to obtain a relation extraction parameter, and the relation extraction module is connected with the linear projection embedding module;
a multi-head self-attention fusion module, configured to utilize a multi-head self-attention mechanism MSA to process the relationship extraction parameters with preset logic, so as to extract the long-range dependency relationship from the infrared image and the visible light image, so as to obtain multi-head self-attention fusion parameters, where the multi-head self-attention fusion parameters include: infrared morphemeAnd visible light token TokenV s j The multi-head self-attention fusion module is connected with the relation extraction module;
the training module of the word element fusion model is used for monitoring and training a preset multi-granularity word element fusion model by utilizing a preset logic design loss function, and the training module of the word element fusion model is connected with the multi-head self-attention fusion module;
a multi-granularity lemma fusion output module, configured to fuse the infrared image and the visible light image through a multi-granularity lemma fusion module, so as to obtain a multi-granularity lemma fusion output image, where the multi-granularity lemma fusion output module is connected to the lemma fusion model training module, and the multi-granularity lemma fusion output module includes:
a word element correlation module for obtaining learnable attention weight value by preset weight definition logic and capturing the infrared word element by using preset relation capture logicAnd the visible light lemma TokenV s j Multi-granularity lemma relevance of (2);
and the image reconstruction module is used for processing the multi-granularity word element correlation and the difference scale characteristics by utilizing preset reconstruction logic so as to obtain a multi-granularity word element fusion output image, and is connected with the word element correlation module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211054722.6A CN115331112A (en) | 2022-08-30 | 2022-08-30 | Infrared and visible light image fusion method and system based on multi-granularity word elements |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211054722.6A CN115331112A (en) | 2022-08-30 | 2022-08-30 | Infrared and visible light image fusion method and system based on multi-granularity word elements |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115331112A true CN115331112A (en) | 2022-11-11 |
Family
ID=83928840
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211054722.6A Pending CN115331112A (en) | 2022-08-30 | 2022-08-30 | Infrared and visible light image fusion method and system based on multi-granularity word elements |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115331112A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116523969A (en) * | 2023-06-28 | 2023-08-01 | 云南联合视觉科技有限公司 | MSCFM and MGFE-based infrared-visible light cross-mode pedestrian re-identification method |
-
2022
- 2022-08-30 CN CN202211054722.6A patent/CN115331112A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116523969A (en) * | 2023-06-28 | 2023-08-01 | 云南联合视觉科技有限公司 | MSCFM and MGFE-based infrared-visible light cross-mode pedestrian re-identification method |
CN116523969B (en) * | 2023-06-28 | 2023-10-03 | 云南联合视觉科技有限公司 | MSCFM and MGFE-based infrared-visible light cross-mode pedestrian re-identification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tang et al. | YDTR: Infrared and visible image fusion via Y-shape dynamic transformer | |
CN110097528B (en) | Image fusion method based on joint convolution self-coding network | |
CN112347859A (en) | Optical remote sensing image saliency target detection method | |
CN111126202A (en) | Optical remote sensing image target detection method based on void feature pyramid network | |
Komorowski et al. | Minkloc++: lidar and monocular image fusion for place recognition | |
CN115713679A (en) | Target detection method based on multi-source information fusion, thermal infrared and three-dimensional depth map | |
CN113762201A (en) | Mask detection method based on yolov4 | |
CN113870160B (en) | Point cloud data processing method based on transformer neural network | |
CN113792641A (en) | High-resolution lightweight human body posture estimation method combined with multispectral attention mechanism | |
CN115188066A (en) | Moving target detection system and method based on cooperative attention and multi-scale fusion | |
CN117830788B (en) | Image target detection method for multi-source information fusion | |
CN115331112A (en) | Infrared and visible light image fusion method and system based on multi-granularity word elements | |
CN117238034A (en) | Human body posture estimation method based on space-time transducer | |
CN115393404A (en) | Double-light image registration method, device and equipment and storage medium | |
Yuan et al. | STransUNet: A siamese TransUNet-based remote sensing image change detection network | |
Wang et al. | PACCDU: Pyramid attention cross-convolutional dual UNet for infrared and visible image fusion | |
Baoyuan et al. | Research on object detection method based on FF-YOLO for complex scenes | |
CN114639002A (en) | Infrared and visible light image fusion method based on multi-mode characteristics | |
Xu et al. | JCa2Co: A joint cascade convolution coding network based on fuzzy regional characteristics for infrared and visible image fusion | |
Li et al. | TFIV: Multi-grained Token Fusion for Infrared and Visible Image via Transformer | |
CN116778346A (en) | Pipeline identification method and system based on improved self-attention mechanism | |
CN115984714A (en) | Cloud detection method based on double-branch network model | |
CN115393735A (en) | Remote sensing image building extraction method based on improved U-Net | |
Zhu et al. | PD-SegNet: Semantic Segmentation of Small Agricultural Targets in Complex Environments | |
CN113449611B (en) | Helmet recognition intelligent monitoring system based on YOLO network compression algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |