CN115331112A

CN115331112A - Infrared and visible light image fusion method and system based on multi-granularity word elements

Info

Publication number: CN115331112A
Application number: CN202211054722.6A
Authority: CN
Inventors: 窦浩; 伍政华; 代泽洋; 倪向东; 李静; 王潇
Original assignee: CETC 38 Research Institute
Current assignee: CETC 38 Research Institute
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2022-11-11

Abstract

The invention provides an infrared and visible light image fusion method and system based on multi-granularity lemmas, which comprises the following steps: s1, acquiring an infrared image and a visible light image, and decomposing the infrared image and the visible light image on not less than 2 different scales respectively; extracting multi-granularity word global features: respectively calculating the long-range dependence relationship of the infrared image and the visible light image through not less than 2 independent Transformer branches; utilizing a preset logic design loss function to supervise and train a preset multi-granularity word element fusion model; and fusing the infrared image and the visible light image through a multi-granularity word element fusion module to obtain a multi-granularity word element fusion output image. The invention solves the technical problems of poor performance, high complexity and limited feature extraction and representation of the fusion model.

Description

Infrared and visible light image fusion method and system based on multi-granularity word elements

Technical Field

The invention relates to the field of image fusion, in particular to the technical field of multi-type image fusion.

Background

To comprehensively describe real-world scenes, the combination of multi-source images acquired by different sensors is a key to the application. Therefore, infrared and visible image fusion has been widely used for information acquisition or analysis, such as military affairs, public security, smart city, etc. The infrared sensor is intended to capture thermal radiation emitted by the heat source, enhancing the thermal infrared target. However, the infrared sensor cannot detect the presence of detail or texture information in the background because the objects in the background have almost the same thermal information. And the visible light sensor captures reflected light to generate an image, so that abundant background texture or detail information can be saved. Therefore, image fusion can simultaneously synthesize a single information image using the advantages of multi-source images, and many fusion methods including a conventional method and a deep learning-based method have been proposed. The traditional method mainly extracts features through mathematical transformation and then combines the features through designing a fusion strategy. Traditional methods include multi-scale transformation methods, sparse representation-based methods, saliency-based methods, subspace-based methods, and other hybrid methods. Conventional fusion methods aim to obtain a satisfactory fused image and to some extent satisfy certain applications. However, the traditional fusion method still has a bottleneck.

For example, the prior invention patent application document CN110332934A, the patent application document "a system and a method for tracking a track of a robot under hybrid network fusion" includes a visual image sensor, a wireless transmitter, a wireless receiver, a visible light source, an optical communication device, an inertial navigation device, and a hybrid data fusion unit. A visual image sensor, a wireless transmitter and a visible light source are installed in a moving area of the robot, a wireless receiver, an optical communication device and an inertial navigation device are installed on a robot body, and a hybrid data processing unit in a control center is used for resolving and evaluating track tracking data. This multisource miscellaneous data that prior art fuses multisensor and obtains can learn from the concrete implementation mode of this prior art scheme, and this prior art's visible light image is only used for thick unit, later corrects many times with the data that other multisource sensors gathered, and the difference of multisource image has been neglected in single expression, has restricted the performance of fusing the model, and this prior art's mode of fusion mainly is the combination of various kinds of data processing process simultaneously, has improved the algorithm complexity of this prior art. Furthermore, conventional fusion methods also require complex designed fusion strategies. Therefore, deep Learning (DL) is introduced to solve these tasks.

The method based on deep learning has the nonlinear fitting capability and can model the complex correlation of the source images. The DL-based fusion method may be classified into a CNN-based fusion method and a generation countermeasure network (GAN) -based fusion method according to a difference of a fusion framework. The CNN-based method utilizes parallel convolution kernels to extract multiple features, and uses a fine loss function to reconstruct a fusion result. In addition, the GAN framework can also perform an image fusion task, and the framework designs a antagonistic game to simulate the distribution of the source images. Typically, ma et al first fuse the infrared image and the visible image using GAN, and then propose various GAN-based methods such as a dual discriminator GAN method and a multi-class constraint GAN method.

For example, in the prior invention patent application publication No. CN114240736A, "method for simultaneously generating and editing arbitrary face attributes based on VAE and cGAN", we focus on the encoder-decoder architecture based on a Variational Automatic Encoder (VAE) and a conditional antagonistic neural network (cGAN), and develop a bidirectional feedback generation network for simultaneously generating a new face and performing attribute editing. And using attribute classification constraints on the generated image to ensure the correct change of the specified attributes, and generating a human face image with a plurality of attributes by sampling attribute codes from a hidden space. The method includes modeling attribute strengths to support attribute interpolation and flexible handling of multiple face attributes. The CNN or GAN-based methods in the prior art use convolution operation to extract image features in a small-range perceptual domain, and the unified convolution operation also limits the extraction and representation of features. In addition, the CNN or GAN based method aims at extracting local features through convolution kernel, but this kind of existing scheme also has a defect that long-range dependency information cannot be extracted. Therefore, transformers can be introduced to solve these problems. For example, li et al and Vibashan et al combine Transformer and CNN to extract local features and long-range dependency information of images.

However, existing Transformer-based methods ignore the difference in attention weights of infrared and visible lemmas at the same location, which affects fusion performance because the importance of infrared and visible lemmas at the same location is different.

In summary, the prior art has the technical problems of poor performance of a fusion model, high complexity and limited feature extraction and representation.

Disclosure of Invention

The invention aims to solve the technical problems of poor performance, high complexity and limited feature extraction and representation of a fusion model in the prior art.

The invention adopts the following technical scheme to solve the technical problems: the infrared and visible light image fusion method based on the multi-granularity lemmas comprises the following steps:

s1, acquiring an infrared image and a visible light image, and decomposing the infrared image and the visible light image on not less than 2 different scales respectively;

s2, multi-granularity word element global feature extraction: respectively calculating the long-range dependency relationship of the infrared image and the visible light image through not less than 2 independent Transformer branches, wherein not less than 2 Transformer models are designed in each independent Transformer branch to extract the comprehensive multi-scale long-range dependency relationship, and the step S2 comprises the following steps:

s21, dividing the infrared image and the visible light image into multi-scale subarea batch;

s22, converting the infrared image and the visible light image into an infrared sequence

And visible light sequence

S23, embedding and processing infrared sequence by utilizing preset linear projection E

And visible light sequence

And adding coded position information into each sequence to obtain coded infrared sequence

And encoding the visible light sequence

S24, utilizing the full connection layer to preset the embedded logic pair to code the infrared sequence

And encoding the visible light sequence

Performing embedding operation to obtain a relation extraction parameter;

s25, extracting parameters by using a multi-head self-attention mechanism MSA (multiple advanced self-attention mechanism) according to a preset logic processing relation, extracting a long-range dependency relationship from the infrared image and the visible light image, and acquiring multi-head self-attention fusion parameters according to the long-range dependency relationship, wherein the multi-head self-attention fusion parameters comprise: infrared morpheme

And visible light lemma

S3, designing a loss function by using preset logic, and supervising and training a preset multi-granularity word element fusion model according to the loss function;

s4, fusing the infrared image and the visible light image through the multi-granularity lemma fusion module to obtain a multi-granularity lemma fusion output image, wherein the step S4 comprises the following steps:

s41, defining logic with preset weight to obtain learnable attention weight, and capturing infrared morpheme by using preset relation to capture logic

And the visual light word element

Multi-granularity lemma relevance of (2);

and S42, processing the multi-granularity lemma correlation and the difference scale characteristics by using preset reconstruction logic to obtain a multi-granularity lemma fusion output image.

According to the method, the relevance of the corresponding lemmas is captured by introducing the learnable attention weight, so that the infrared image and the visible light image are fused under the multi-granularity lemma dimension, and the method can sense the difference of multi-modal information of the infrared and visible light lemmas at the same position. The invention extracts the long-range dependency relationship of each image on multiple scales, embeds multi-granularity lemmas by locally dividing sub-images on multiple scales, reconstructs multi-scale characteristics by utilizing multi-granularity fusion, and calculates the fusion image by dimension reduction mapping operation. The method can realize the fusion of the infrared image and the visible light image under the multi-scale lemma dimension based on the pure Transformer, and has better fusion performance compared with other schemes.

In a more specific embodiment, step S21 includes: the infrared image is divided into N P's per scale using the following logic _s ² Size multiscale partition region patch:

where s is a scale applied to the divided image, and s is defined as 1, 2, and 3, respectively.

In a more specific embodiment, in step S23, the infrared sequence is embedded and processed by using the preset linear projection E with the following logic

And visible light sequence

And adding coded position information to each sequence:

in the formula (I), the compound is shown in the specification,

and

representing the coded sequences of the two original images at different scales of s.

In a more specific embodiment, in step S24, the full connectivity layer is used to encode the ir sequence with the following logic pairs

And encoding the visible light sequence

Carrying out embedding operation:

in the formula (I), the compound is shown in the specification,

and

query, key and value representing infrared and visible image sequences, LN representing fully connected layers.

In a more specific technical solution, in step S25, the relationship extraction parameters are processed with the following logic to extract the long-range dependency relationship from the infrared image and the visible light image:

in the formula (I), the compound is shown in the specification,

and

is the output of the MSA.

The method expands the fusion to the fusion based on multi-granularity lemmas so as to extract the multi-scale long-range dependency relationship of each source image and capture the attention relevance of the corresponding lemmas under different scales. In addition, local region features can be extracted based on the fusion of the lemmas, and the lemmas embedded in the local segmentation subimages contain the local region features, so that the model fusion performance is optimized, and the representation degree of the fusion image features is improved.

In a more specific technical solution, step S3 includes:

s31, calculating the loss of the infrared image and the fused image in an intensity domain by utilizing the following logics:

s32, obtaining the loss L by the following logic processing ₁ And total loss, thereby preserving the details and brightness information of the visible light image:

in the formula, L represents the total loss value, M represents the total number of pixels, f, I and V respectively represent the fusion result, the infrared image and the visible light image, | | · | calcualting _F Representing the matrix Forbenius norm, with λ being the equilibrium parameter.

The invention extracts the long-range dependence relationship between the infrared image and the visible light image through two independent Transformer branches, and the intensity loss of the infrared image and the visible light image L are passed through the fusion module ₁ And the optimal fusion result is searched for by loss, more details and brightness information of the visible light image are retained, the fusion of the infrared image and the visible light image is finally realized, and the image fusion precision and the fusion effect are optimized.

In a more specific technical solution, step S41 includes:

s411, utilizing the following logic to capture infrared word elements by the learnable attention weight

And the visual light word element

The multi-granularity lemma relevance of (2):

in the formula, f represents the characteristics calculated by fusing the lemma under the s scale, R represents the remodeling operation,

and

representing infrared lemma

And visible light word element

Learnable attention weight of (a);

s412, balancing the infrared lemmas

And visible light word element

Importance at the same location.

In a more specific embodiment, in step S412, the infrared lemma is balanced by using the following logic

And visible light word element

Importance at the same location:

in a more specific technical solution, in step S42, the following logic is used to process the correlation between multiple-granularity lemmas and the difference scale features to obtain a multi-granularity lemma fusion output image:

f＝g(f ₁ ,…f _s ),

where f denotes the fused image, g denotes the dimensionality reduction mapping operation of the convolutional layer, and the kernel is 1 × 1.

In a more specific technical scheme, the infrared and visible light image fusion system based on multi-granularity lemmas comprises:

the difference scale decomposition module is used for acquiring the infrared image and the visible light image and decomposing the infrared image and the visible light image on not less than 2 difference scales respectively;

the multi-granularity morphological global feature extraction module is used for respectively calculating the long-range dependency relationship of the infrared image and the visible light image through not less than 2 independent transform branches, wherein not less than 2 transform models are designed in each independent transform branch to extract the comprehensive multi-scale long-range dependency relationship, the multi-granularity morphological global feature extraction module is connected with the difference scale decomposition module, and the multi-granularity morphological global feature extraction module comprises:

the image segmentation module is used for segmenting the infrared image and the visible light image into multi-scale subareas patch;

an image conversion module for converting the infrared image and the visible light image into an infrared sequence

And visible light sequence

The image conversion module is connected with the image segmentation module;

linear projection embedding module for embedding processing infrared sequence by using preset linear projection E

And visible light sequence

And encoding the visible light sequence

The linear projection embedding module is connected with the image conversion module;

a relation extraction module for encoding the infrared sequence with preset embedded logic pairs by using the full connection layer

And weavingCode visible light sequence

Embedding operation is carried out to obtain a relation extraction parameter, and the relation extraction module is connected with the linear projection embedding module;

the multi-head self-attention fusion module is used for extracting parameters by utilizing a multi-head self-attention mechanism MSA (Multi-head self-attention mechanism) and using a preset logic processing relation to extract a long-range dependency relation from an infrared image and a visible light image so as to obtain multi-head self-attention fusion parameters, wherein the multi-head self-attention fusion parameters comprise: infrared word element

And visible light word element

The multi-head self-attention fusion module is connected with the relation extraction module;

the training module of the word element fusion model is used for monitoring and training a preset multi-granularity word element fusion model by utilizing a preset logic design loss function, and the training module of the word element fusion model is connected with the multi-head self-attention fusion module;

the multi-granularity lemma fusion output module is used for fusing the infrared image and the visible light image through the multi-granularity lemma fusion module so as to obtain a multi-granularity lemma fusion output image, the multi-granularity lemma fusion output module is connected with the lemma fusion model training module, and the multi-granularity lemma fusion output module comprises:

a word element correlation module for obtaining learnable attention weight value by preset weight definition logic and capturing infrared word element by using preset relation capture logic

And the visual light word element

The multi-granularity lemma relevance of (2);

and the image reconstruction module is used for processing the multi-granularity word element correlation and the difference scale characteristics by utilizing preset reconstruction logic so as to obtain a multi-granularity word element fusion output image, and is connected with the word element correlation module.

Compared with the prior art, the invention has the following advantages: according to the method, the relevance of the corresponding lemmas is captured by introducing the learnable attention weight, so that the infrared image and the visible light image are fused under the multi-granularity lemma dimension, and the method can sense the difference of multi-modal information of the infrared and visible light lemmas at the same position. The invention extracts the long-range dependency relationship of each image on multiple scales, embeds multi-granularity lemmas by locally dividing sub-images on multiple scales, reconstructs multi-scale characteristics by utilizing multi-granularity fusion, and calculates the fusion image by dimension reduction mapping operation. The method can realize the fusion of the infrared image and the visible light image under the multi-scale lemma dimension based on the pure Transformer, and has better fusion performance compared with other schemes.

The method expands the fusion to the fusion based on multi-granularity lemmas so as to extract the multi-scale long-range dependency relationship of each source image and capture the attention relevance of the corresponding lemmas under different scales. In addition, local region features can be extracted based on the fusion of the word elements, and the word elements embedded in the local segmentation subimages contain the local region features, so that the model fusion performance is optimized, and the characterization degree of the fusion image features is improved.

The invention extracts the long-range dependence relationship between the infrared image and the visible light image through two independent Transformer branches, and the intensity loss of the infrared image and the visible light image L are passed through the fusion module ₁ And searching for the optimal fusion result of the infrared image and the visible image in a loss mode, and simultaneously keeping more details and brightness information of the visible image to finally realize the fusion of the infrared image and the visible image. Meanwhile, the image fusion precision and the fusion effect are optimized. The method solves the technical problems of poor performance, high complexity and limited feature extraction and representation of the fusion model in the prior art.

Drawings

Fig. 1 is a schematic diagram of a transform-based infrared and visible light image fusion framework in a multi-granularity lemma-based infrared and visible light image fusion method according to embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of basic steps of a multi-granularity lemma-based infrared and visible light image fusion method according to embodiment 1 of the present invention;

fig. 3 is a schematic diagram of multi-granularity lemma global feature extraction in embodiment 1 of the present invention;

FIG. 4 is a diagram illustrating a specific step of multi-granularity lemma fusion in embodiment 1 of the present invention;

FIG. 5 is a comparison of the results of the ablation experiment with the learnable attention module of example 2 of the present invention;

FIG. 6 is a comparison chart of the results of the ablation experiments of the multi-granularity lemma module in example 2 of the present invention;

FIG. 7 is a schematic diagram showing the comparison of the processing effects of the methods on the TNO data set of example 2 of the present invention;

fig. 8a is a schematic diagram of a first index analysis of a fused image obtained by each method on the TNO data set according to embodiment 2 of the present invention;

FIG. 8b is a schematic diagram of a second index analysis of the fused image of each method on the TNO data set according to embodiment 2 of the present invention;

fig. 8c is a schematic diagram of a third index analysis of the fused image obtained by the methods in the TNO data set according to embodiment 2 of the present invention;

fig. 8d is a diagram illustrating a fourth index analysis of an image fused by each method on the TNO data set according to embodiment 2 of the present invention;

fig. 8e is a schematic diagram of a fifth index analysis of the fused image obtained by the methods in the TNO data set according to embodiment 2 of the present invention;

fig. 8f is a schematic diagram of a sixth index analysis of fused images of each method on the TNO data set according to embodiment 2 of the present invention;

fig. 9 is a diagram of an image fusion effect of each method of a night scene on a Roadscene data set in embodiment 2 of the present invention;

fig. 10 is a comparison graph of the image fusion effect of the methods of nighttime scenes on the LLVIP data set in embodiment 2 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

Example 1

As shown in fig. 1, the infrared and visible light image fusion framework based on Transformer related to the infrared and visible light image fusion method based on multi-granularity lemmas provided by the present invention. Extracting the long-range dependence relationship between the infrared image and the visible light image through two independent Transformer branches, and obtaining the visible light image L and the intensity loss of the infrared image in the fusion module ₁ And finding the optimal fusion result by loss, and finally realizing the fusion of the infrared image and the visible light image.

In this embodiment, the method for fusing infrared and visible light images based on multi-granularity lemmas includes the following steps:

s1, image multi-scale decomposition: the infrared image and the visible light image are decomposed on three different scales respectively to obtain three different scale branches.

S2, extracting global features of multiple granularity word elements;

as shown in fig. 3, in this embodiment, the step S2 further includes the following specific steps:

And visible light sequence

And visible light sequence

And encoding the visible light sequence

S24, utilizing the full connection layer to code the infrared sequence by preset embedded logic pairs

And encoding the visible light sequence

Performing embedding operation to obtain a relation extraction parameter;

s25, extracting parameters by utilizing multi-head self-attention mechanism MSA processing to preset a logic processing relation, extracting a long-range dependency relation from the infrared image and the visible light image, and obtaining multi-head self-attention fusion parameters according to the long-range dependency relation, wherein the multi-head self-attention fusion parameters comprise: infrared morpheme

And visible light word element

In this embodiment, the long-range dependence of the infrared and visible images is calculated by two independent transform branches, respectively. In order to comprehensively extract the multi-scale long-range dependency relationship, 3 Transformer models are designed in each branch. Given an infrared image (I) and a visible light image (V), wherein I, V E R ^H×W×C H, W and C respectively represent the height, width and channel size of the source image, the invention firstly divides the two source images into multiple scalesDegree patch, dividing the infrared image into N P pieces on each scale _s ² Size of

Where s is the scale applied to the segmented image and s is defined as 1, 2, and 3, respectively. In addition, in the operation of the present invention, P _S Also satisfies P ₁ ＝P ₂ ＝P ₃ . Based on this, the infrared and visible light images are converted into a sequence

And

then embedded by linear projection E

And

and incorporating coded position information in each sequence

And

the expression is shown in formula (1):

in the formula (I), the compound is shown in the specification,

and

representing the coded sequence of the two original images at different scales of s.

In addition, the invention utilizes the full connecting layer

And

embedding into query, key and value, as follows:

wherein

And

query, key and value, LN, representing infrared and visible image sequences, represent fully connected layers. In addition, the invention utilizes a multi-head self attention Mechanism (MSA) to extract the long-range dependence relationship between the infrared image and the visible light image, and the expression thereof is shown as a formula (3):

wherein

And

is the output of the MSA, applied to the next fusion stage.

S3, setting a loss function: a loss function is designed to supervise the training of the proposed method to simulate the data distribution of the original image. Since the infrared image is obtained by capturing thermal radiation, the content present in the infrared image is characterized by pixel intensity. Thus calculating the loss of the infrared image and the fused image in the intensity domain, i.e.

At the same time, the visible light is transmittedThe sensor describes the scene by capturing the reflected light, using L in order to retain more detail and brightness information of the visible light image ₁ The loss is used to constrain the fused image to have a data distribution similar to that of the visible light image, defined as

The total loss function is shown in equation (4):

wherein L represents the total loss value, M represents the total number of pixels, f, I and V respectively represent the fusion result, the infrared image and the visible light image, | | · | | _F The matrix Forbenius norm is expressed and lambda is designed to balance the two terms.

As shown in fig. 4, step S4, multi-granularity lemma fusion: and fusing the infrared image and the visible light image through a multi-granularity word element fusion module. In this embodiment, step S4 further includes:

in this embodiment, a learnable attention weight is introduced to capture

And

their definition is shown in equation (5):

wherein f represents the characteristics calculated by fusing the lemma under the s scale, R represents the reshaping operation,

and

representing infrared lemma

And visible light word element

The learnable attention weight of (1), the present invention defines

To balance the importance of infrared and visible lemmas being co-located.

The invention reconstructs the fused image by using the characteristics of different scales, and the definition is shown as a formula (6):

f＝g(f ₁ ,…f _s ),(6)

where f represents the fused image, g represents the dimensionality reduction mapping operation of the convolutional layer, and the kernel is 1 × 1.

Example 2

The effectiveness of the fusion strategy of the invention is proved by ablation experiments, and qualitative and quantitative comparison and generalization experiments are carried out on three public data sets.

A. Ablation experiment

1) Ablation analysis of the learnable attention weight: the invention introduces the learnable attention to estimate the importance of the corresponding word element in the fusion process based on the word element. Therefore, models were trained by discarding learnable attention weights (no weights), and compared to prove their validity.

As shown in fig. 5, it can be seen that the learnable attention weight plays a key role in the lemma-based fusion, for example, in the red square of fig. 5, the fusion model added with the learnable attention weight contains more detail and edge information, which proves the effectiveness of the learnable attention weight.

2) Multi-granularity word element fusion ablation analysis: in the subject, the invention fuses infrared images and visible light images, extracts multi-scale long-range dependency, and captures attention relevance of corresponding word elements under different sizes. Thus, fusion model comparisons can be trained by removing multi-granular operations (no multi-granular) to prove their effectiveness.

As shown in fig. 6, in the present embodiment, the thermal infrared details are richer and have better visual effect after the multi-granularity module is introduced, such as the area existing in the red square in fig. 6. In addition, compared with the method that more background texture information is captured when no multi-granularity lemmas exist, the method shows the rationality and the necessity of multi-granularity lemma fusion.

B. Comparative experiment

In the experiments of the present invention, the present invention evaluated the method of the present invention in a qualitative and quantitative manner. In qualitative analysis, the fusion result is evaluated by using a human visual system, and the analysis is mainly performed from the aspects of brightness, sharpness, contrast and the like. In quantitative analysis, these methods were evaluated using 6 evaluation indexes. Metrics include Mutual Information (MI), standard Deviation (SD), average Gradient (AG), spatial Frequency (SF), edge strength (EI), and peak signal-to-noise ratio (PSNR). The purpose of the fused image is to measure the information retained by the fused image from the source image. Wherein SD can reflect the contrast of the fused image, and AG mainly measures the texture information of the fused image. The SF can measure the gray change rate of the fused image and reflect the definition of the image. EI may be used to evaluate edge information of the fused image. For the above indices, the larger the value, the better the performance of the fusion method.

1) And (3) qualitative analysis: in comparative experiments, the present invention compared the method of the present invention with 8 fusion methods on 35 TNO image pairs.

As shown in fig. 7, the conventional method, the CNN-based method, and the GAN-based method cannot maintain clear background texture information, while the Transformer-based method performs better in maintaining background details, and the method of the present invention maintains more details and texture information than the conventional method (CGTF) that combines CNN and Transformer. For example, the area of the invention enlarged with red squares may reflect the most clear edges and details of the results of the invention. In addition, compared with other methods, the method can also save more obvious thermal infrared information. The present invention selects a region from each image and scales them in green squares, which illustrates that the processing results of the present invention are brighter in the face region than other methods.

2) Quantitative analysis: in a quantitative experiment, 35 TNO image pairs are selected and compared, and six indexes are selected for objectively evaluating the method. As can be seen from Table 1, the method of the present invention performed well on all 6 criteria.

As shown in fig. 8, the figure shows more detail for quantitative analysis of 35 image pairs. The maximum MI value represents that the method retains rich original image information, and the maximum SD value represents that the method has higher contrast ratio than other methods. In addition, the values of the model of the present invention on AG, SF, EI and PSNR are also the largest, which indicates that the method of the present invention retains more texture and detail information with less noise.

Table 1: indicators of different scenarios on the TNO dataset

C. Generalization experiment

In order to verify the generalization ability of the proposed model, in addition to the TNO data set, the invention selects 100 infrared and visible light images of the Roadscene data set and LLVIP data set respectively for qualitative and quantitative experiments.

As shown in fig. 9, on the Roadscene data set, through qualitative experiments, it can be seen that the method of the present invention not only retains more background textures on a night scene, but also has more obvious thermal infrared characteristics and details; more thermal infrared characteristics and details can be retained on daytime scenes, with night time scene comparison results as shown in fig. 9. Also by quantitative experiments, it can be seen from table 2 that the results of the present invention are the largest in all of MI, SD, AG, EI and SF and still acceptable in PSNR.

Table 2: indicators of different schemes on the Roadscene dataset

As shown in fig. 10, on LLVIP data set, through qualitative experiments, it can be seen that the method of the present invention contains more details and higher contrast on nighttime scenes, which can also provide clearer outlines and edges than other methods; a certain amount of texture and prominent infrared targets can be saved simultaneously on a day scene, wherein a night scene comparison result is shown in fig. 8. Meanwhile, through quantitative experiments, it can be seen from table 3 that the results of the present invention are the greatest in values of MI, SD, AG, EI, SF and PSNR, which can prove the superiority of the method of the present invention.

Table 3: indicators of different scenarios on LLVIP dataset

D. Efficiency comparison

In this work, the present invention also makes efficiency comparisons by providing an average run time of each method over three data sets. The traditional method is realized by a CPU, and other methods are realized by a GPU. As can be seen from table 4, the conventional MSVD and wavelet based methods are less time consuming than most DL based methods, and the Transformer based methods are more time consuming than CNN and GAN based methods.

Table 4: average run time of different methods on three datasets

By combining the above experiments, it can be seen through the ablation experiments that it is reasonable and necessary to introduce the learnable attention weight and the multi-granularity lemma fusion module. As can be seen from comparison experiments and generalization experiments, the method provided by the invention has certain advantages in both quantitative and qualitative aspects. The method has certain advantages in computational efficiency compared with the same CGTF based on a Transformer framework, but is far lower than methods based on CNN and GAN, and due to the excellent processing effect, the method has a great application scene.

According to the method, the relevance of the corresponding lemmas is captured by introducing the learnable attention weight, so that the infrared image and the visible light image are fused under the dimension of the multi-granularity lemmas, and the method can sense the difference of multi-modal information of the infrared and visible light lemmas at the same position. The invention extracts the long-range dependency relationship of each image on multiple scales, embeds multi-granularity lemmas by locally dividing sub-images on multiple scales, reconstructs multi-scale characteristics by utilizing multi-granularity fusion, and calculates the fusion image by dimension reduction mapping operation. The method can realize the fusion of the infrared image and the visible light image under the multi-scale lemma dimensionality based on the pure Transformer, and has better fusion performance compared with other schemes.

The invention extracts the long-range dependence relationship between the infrared image and the visible light image through two independent Transformer branches, and the intensity loss of the infrared image and the visible light image L are obtained in a fusion module ₁ And finding the optimal fusion result of the infrared image and the visible image through loss, and simultaneously keeping more details and brightness information of the visible image to finally realize the fusion of the infrared image and the visible image.

Meanwhile, the image fusion precision and the fusion effect are optimized. The method solves the technical problems of poor performance, high complexity and limited feature extraction and representation of the fusion model in the prior art.

The above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. The infrared and visible light image fusion method based on the multi-granularity lemmas is characterized by comprising the following steps:

s2, extracting multi-granularity word global features: calculating the long-range dependency relationship of the infrared image and the visible light image respectively through not less than 2 independent transform branches, wherein not less than 2 transform models are designed in each independent transform branch to extract a comprehensive multi-scale long-range dependency relationship, and the step S2 comprises:

s21, dividing the infrared image and the visible light image into multi-scale subareas patch;

And visible light sequence

S23, embedding and processing the infrared sequence by utilizing preset linear projection E

And the visible light sequence

And encoding the visible light sequence

S24, utilizing the full connection layer to preset the embedded logic to the coded infrared sequence

And the coded visible light sequence

Performing embedding operation to obtain a relation extraction parameter;

s25, processing the relationship extraction parameters by preset logic by using a multi-head self-attention mechanism MSA to extract the long-range dependency relationship from the infrared image and the visible light image so as to obtain multi-head self-attention fusion parameters, wherein the multi-head self-attention fusion parameters comprise: infrared word element

And visible light token TokenV _s ^j ；

s4, fusing the infrared image and the visible light image through a multi-granularity lemma fusion module to obtain a multi-granularity lemma fusion output image, wherein the step S4 comprises the following steps of:

s41, defining logic with preset weight to obtain learnable attention weight, and capturing the infrared word element by using preset relation capture logic

And the visible light lemma TokenV _s ^j The multi-granularity lemma relevance of (2);

and S42, processing the correlation and the difference scale characteristics of the multi-granularity lemmas by using preset reconstruction logic to obtain the multi-granularity lemma fusion output image.

2. The infrared and visible light image fusion method based on multi-granularity lemmas according to claim 1, wherein the step S21 comprises: dividing the infrared image into N P's per scale using the following logic _s ² Size the multiscale partition region patch:

3. The method of claim 1, wherein in step S23, the preset linear projection E is used to embed the infrared sequence

And the visible light sequence

And adding coded position information to each sequence:

in the formula (I), the compound is shown in the specification,

and

4. The method as claimed in claim 1, wherein in step S24, the encoded ir sequence is processed by the following logic using the full-concatenation layer

And the coded visible light sequence

Carrying out embedding operation:

in the formula (I), the compound is shown in the specification,

and

5. The method as claimed in claim 1, wherein in step S25, the relationship extraction parameters are processed with the following logic to extract the long-range dependency relationship from the infrared image and the visible light image:

in the formula (I), the compound is shown in the specification,

and TokenV _s ^j Is the output of the MSA.

6. The infrared and visible light image fusion method based on multi-granularity lemmas according to claim 1, wherein the step S3 comprises:

s31, calculating the loss of the infrared image and the fused image in an intensity domain by using the following logics:

s32, obtaining the loss L by the following logic processing ₁ And total loss, so as to retain the detail and brightness information of the visible light image:

in the formula, L represents the total loss value, M represents the total number of pixels, f, I and V respectively represent the fusion result, the infrared image and the visible light image, | | _F Representing the matrix Forbenius norm, with λ being the balance parameter.

7. The method for fusing infrared and visible light images based on multi-granularity lemmas according to claim 1, wherein the step S41 comprises:

s411, capturing the infrared word element by the learnable attention weight value by utilizing the following logic

And the visible light lemma TokenV _s ^j The multi-granularity lemma relevance of (a):

wherein f represents the calculated characteristics of the fusion word element under the s scale, R represents the remodeling operation,

and

representing said infrared lemma

And the visible light lemma TokenV _s ^j Learnable attention weights of (a);

s412, balancing the infrared lemmas

And said visible light token TokenV _s ^j Importance at the same location.

8. The method for fusing infrared and visible images based on multi-granularity lemmas according to claim 1, wherein the step S412 is performed by balancing the infrared lemmas using the following logic

And said visible light token TokenV _s ^j Importance at the same location:

9. the method as claimed in claim 1, wherein in step S42, the correlation and difference scale features of the multi-granular lemmas are processed by the following logic to obtain the multi-granular lemma fusion output image:

f＝g(f ₁ ,…f _s ),

10. An infrared and visible light image fusion system based on multi-granularity lemmas is characterized by comprising:

the difference scale decomposition module is used for acquiring an infrared image and a visible light image, and decomposing the infrared image and the visible light image on not less than 2 difference scales respectively;

the multi-granularity morphological global feature extraction module is used for calculating the long-range dependency relationship of the infrared image and the visible light image respectively through not less than 2 independent transform branches, wherein not less than 2 transform models are designed in each independent transform branch to extract the comprehensive multi-scale long-range dependency relationship, the multi-granularity morphological global feature extraction module is connected with the difference scale decomposition module, and the multi-granularity morphological global feature extraction module comprises:

the image segmentation module is used for segmenting the infrared image and the visible light image into a multi-scale subarea patch;

And visible light sequence

The image conversion module is connected with the image segmentation module;

linear projection embedded moduleFor processing the infrared sequence by means of preset linear projection E embedding

And the visible light sequence

And encoding the visible light sequence

a relation extraction module for using the full connection layer to preset the embedded logic to the coded infrared sequence

And the coded visible light sequence

a multi-head self-attention fusion module, configured to utilize a multi-head self-attention mechanism MSA to process the relationship extraction parameters with preset logic, so as to extract the long-range dependency relationship from the infrared image and the visible light image, so as to obtain multi-head self-attention fusion parameters, where the multi-head self-attention fusion parameters include: infrared morpheme

And visible light token TokenV _s ^j The multi-head self-attention fusion module is connected with the relation extraction module;

a multi-granularity lemma fusion output module, configured to fuse the infrared image and the visible light image through a multi-granularity lemma fusion module, so as to obtain a multi-granularity lemma fusion output image, where the multi-granularity lemma fusion output module is connected to the lemma fusion model training module, and the multi-granularity lemma fusion output module includes:

a word element correlation module for obtaining learnable attention weight value by preset weight definition logic and capturing the infrared word element by using preset relation capture logic

And the visible light lemma TokenV _s ^j Multi-granularity lemma relevance of (2);