CN115035003A

CN115035003A - Infrared and visible light image anti-fusion method for interactively compensating attention

Info

Publication number: CN115035003A
Application number: CN202210376347.0A
Authority: CN
Inventors: 王志社; 邵文禹; 陈彦林; 杨帆; 孙婧
Original assignee: Taiyuan University of Science and Technology
Current assignee: Taiyuan University of Science and Technology
Priority date: 2022-04-11
Filing date: 2022-04-11
Publication date: 2022-09-09

Abstract

The invention relates to an infrared and visible light image anti-fusion method for interactively compensating attention. According to the invention, a multi-scale encoder-decoder network with triple paths is constructed in an interactive compensation generator, and an infrared path and a visible light path provide extra strength and gradient information for a connecting path under the action of an interactive attention module and a compensation attention module of the multi-scale encoder-decoder network, so that more prominent infrared targets and abundant texture details can be reserved in a fused image, the capability of feature extraction and feature reconstruction is enhanced, and an obtained attention feature map focuses on infrared image target perception and visible light image texture detail representation; during training, the interactive compensation generator is optimized through the double discriminators, and the double discriminators can be used for more uniformly constraining the data distribution similarity between the fusion result and the source image so that the interactive compensation generator can generate a more balanced fusion result.

Description

Infrared and visible light image anti-fusion method for interactively compensating attention

Technical Field

The invention relates to the technical field of image processing, in particular to an infrared and visible light image confrontation fusion method for interactively compensating attention.

Background

The infrared and visible light image fusion aims at integrating the advantages of the two types of sensors, and the fusion image generated by complementation has better target perception and scene expression and is beneficial to human eye observation and subsequent calculation processing. Infrared sensors sensitive to heat source radiation can capture salient target area information, but the infrared images obtained typically lack structural features and textural detail. On the contrary, the visible light sensor can acquire rich scene information and texture details through light reflection imaging, and the visible light image has higher spatial resolution and rich texture details, but cannot effectively highlight the characteristics of the target, is easily influenced by the external environment, and is seriously lost particularly under the low-illumination environmental condition. Because of the difference of the infrared imaging mechanism and the visible light imaging mechanism, the two types of images have stronger complementary information, the cooperative detection capability of the infrared imaging sensor and the visible light imaging sensor can be effectively improved only by applying the fusion technology, and the method and the device are widely applied to the fields of remote sensing detection, medical diagnosis, intelligent driving, safety monitoring and the like.

Currently, infrared and visible light image fusion techniques can be broadly classified into conventional fusion methods and deep learning fusion methods. In the conventional image fusion method, image features are usually extracted by the same feature transformation or feature representation, and are combined by adopting a proper fusion rule, and then a final fusion image is obtained by inverse transformation reconstruction. Since infrared and visible light sensor imaging mechanisms are different, infrared images characterize target features in terms of pixel brightness, while visible light images characterize scene texture in terms of edges and gradients. The traditional fusion method does not consider the inherent different characteristics of the source images, adopts the same transformation or representation model to indiscriminately extract the image characteristics, and inevitably causes the results of low fusion performance and poor visual effect. In addition, the fusion rule is set manually, and is more and more complex, the calculation cost is high, and the practical application of image fusion is limited.

In recent years, the convolution operation has strong feature extraction capability and can learn the model building parameters from a large amount of data, so that the fusion method based on deep learning has a satisfactory effect. Nevertheless, there are some disadvantages. Firstly, the methods blindly rely on convolution operation to extract image features, and do not consider the interaction of the internal features of the two types of images, so that the local feature extraction capability is insufficient, and the target brightness reduction and the texture detail blurring of image fusion are easily caused. Secondly, the methods completely depend on convolution operation to extract the local features of the images, the global dependency of the image features is not considered, the global feature information of the images cannot be effectively extracted, and the loss of the global feature information of the fused images is easy to cause.

In summary, there is an urgent need for a method capable of extracting local and global features of two types of images simultaneously, effectively enhancing the characterization capability of depth features, suppressing irrelevant information when enhancing useful information, and further improving the fusion performance of infrared and visible light images.

Disclosure of Invention

The invention provides an infrared and visible light image anti-fusion method capable of interactively compensating attention, and aims to solve the technical problems that an unbalanced fusion result is easily caused because the existing deep learning fusion method only extracts local features of images, cannot model the local feature interaction relationship and the global feature compensation relationship of two types of images, and namely, the fused images cannot effectively retain infrared typical targets and visible texture details at the same time. The technical scheme is as follows:

an infrared and visible image anti-fusion method for interactively compensating attention, comprising:

s1, determining the triple paths of the infrared path and the visible light path corresponding to the infrared image to be fused and the visible light image to be fused, and the connection path obtained by the channel connection of the infrared image to be fused and the visible light image to be fused as the input of a pre-trained interactive compensation generator, wherein the interactive compensation generator establishes a multi-scale coding-decoding network framework of the triple paths, and the multi-scale coding-decoding network framework comprises an interactive attention coding network, a fusion layer and a compensation attention decoding network;

s2, extracting the multi-scale depth features of the triple paths through convolution layers with 4 convolution kernels of 3 x3 adopted by the interactive attention coding network, wherein the first convolution layer and the second convolution layer of the interactive attention coding network are convolutions with the step length of 1 and used for extracting the shallow layer features of the image, the third convolution layer and the fourth convolution layer are convolutions with the step length of 2 and used for extracting the multi-scale depth features of the image, and the shallow layer features and the multi-scale depth features are subjected to three-level interactive attention action to obtain a final interactive attention map;

s3, directly connecting the final interactive attention map with the compensation attention map obtained by the fourth convolution layer of the infrared path and the visible light path through the fusion layer to obtain a fused attention feature map;

s4, reconstructing the convolutional layer with the 4 convolutional kernels adopted by the compensatory attention decoding network being 3 x3, wherein the first convolutional layer and the second convolutional layer of the compensatory attention decoding network are accompanied by an upsampling operation; and performing channel connection on the fused attention feature map and the infrared path compensation attention map and the visible path compensation attention map of the corresponding scale through up-sampling operation and convolution operation of the first convolution layer to obtain a fused image.

Optionally, the number of input channels of the four convolutional layers of the infrared path and the visible light path of the interactive attention coding network is 1, 16, and 32, respectively, the number of output channels is 16, 32, and 64, respectively, the number of input channels of the four convolutional layers of the connection path is 2, 16, 64, and 128, the number of output channels is 16, 32, and 64, respectively, and the activation function is a prime lu; starting from the second convolutional layer, the characteristics of the infrared path and the visible light path are respectively connected with the characteristics of the connecting path in a channel way, and the channel is recorded as phi _m And phi _n Then inputting an interactive attention module of the interactive attention coding network to generate an interactive attention fusion mapIs marked as phi _F 。

Optionally, the number of input channels of the four convolutional layers of the complementary attention decoding network is 384, 192, 96 and 32, respectively, the number of output channels is 128, 64, 32 and 1, respectively, and the activation function is PReLU.

Optionally, the interactive attention module is to input the feature Φ _m And phi _n ∈R ^H×W×C Firstly, respectively mapping depth features to channel vectors by using global average pooling operation and maximum pooling operation in a channel attention model, performing channel connection on output feature vectors after passing through two convolutional layers and a PReLU active layer, and inputting the output feature vectors to the convolutional layers and the Sigmod active layer to obtain an initial channel weighting coefficient

And

are respectively represented as

And

wherein Conv represents convolution operation, Con represents channel connection operation, AP (-) and MP (-) represent global average pooling operation and maximum pooling operation, respectively, σ and δ represent PReLU and Sigmod activation functions, H and W represent height and width of image, respectively, and C represents input channel number;

then, Softmax operation is adopted to obtain the corresponding final channel weighting coefficient, namely

And

are respectively represented as

And

multiplying the final channel weighting coefficient with the respective input characteristics to obtain the corresponding channel interaction ideogram diagram

And

respectively expressed as:

and

then, taking the corresponding channel interactive attention diagram as the input of a space attention model, performing global average pooling operation and maximum pooling operation, performing channel connection on the output space characteristic diagram, inputting the convolution layer and the Sigmod activation layer to obtain respective initial space weighting coefficients

And

are respectively represented as

And

then, a final spatial weighting coefficient is obtained by utilizing Softmax operation

And

are respectively provided withIs shown as

And

multiplying the final space weighting coefficient with the corresponding channel attention diagram to obtain the corresponding space interaction attention diagram

And

are respectively represented as

And

finally, the space interactive attention diagrams of the two are connected in a channel mode to obtain an interactive attention fusion diagram phi _F Is shown as

Optionally, the attention compensation module is used for inputting infrared image characteristics or visible light image characteristics phi _m ∈R ^H×W×C Firstly, a global average pooling operation and a maximum pooling operation are used for converting feature mapping to a channel vector in a channel attention model, after passing through two convolutional layers and a PReLU active layer, output feature vectors are subjected to channel connection and input into the convolutional layers and the Sigmod active layer to obtain a channel weighting coefficient

Is shown as

H and W denote the height and width of the image, respectively, and C denotes the input channelCounting;

then, multiplying the channel weighting coefficient with the input characteristic to obtain the corresponding channel attention chart

Is shown as

Then, taking the channel attention diagram as the input of a space attention model, performing global average pooling operation and maximum pooling operation, performing channel connection on the output space characteristic diagram, inputting the convolution layer and the Sigmod activation layer to obtain a space weighting coefficient

Is shown as

Finally, the spatial weighting coefficient is multiplied with the input channel attention diagram to obtain the corresponding spatial attention diagram

Is shown as

Optionally, the S1 is preceded by:

s01, constructing an interactive compensation generator: establishing a three-path multi-scale coding-decoding network framework by taking an infrared path, a visible light path and a connecting path for connecting an infrared image and a visible light image channel as input, wherein the three-path multi-scale coding-decoding network framework comprises an interactive attention coding network, a fusion layer and a compensation attention decoding network and is used for generating an initial fusion image;

the interactive attention coding network respectively adopts 4 convolution layers with convolution kernels of 3 multiplied by 3 to extract the multi-scale depth features of the triple paths, wherein the first convolution layer and the second convolution layer are convolutions with step length of 1 and are used for extracting the shallow layer of the imageThe third convolution layer and the fourth convolution layer are convolutions with the step length of 2 and are used for extracting the multi-scale depth features of the image; the number of input channels of the four convolution layers of the infrared path and the visible light path is respectively 1, 16 and 32, the number of output channels is respectively 16, 32 and 64, the number of input channels of the four convolution layers of the connecting path is respectively 2, 16, 64 and 128, the number of output channels is respectively 16, 32 and 64, and the activation function is PReLU; starting from the second convolution layer, the characteristics of the infrared path and the visible light path are respectively connected with the characteristics of the connection path in a channel manner, and are marked as phi _m And phi _n Then inputting the interactive attention module to generate an interactive attention fusion graph, which is marked as phi _F (ii) a Obtaining a final interactive attention map after three-level interactive attention action;

the fusion layer directly connects the final interactive attention map with the compensation attention map of the fourth convolution layer of the infrared path and the visible light path through a channel to obtain a fusion attention feature map;

the compensation attention decoding network respectively adopts convolution layers with 4 convolution kernels of 3 multiplied by 3 to reconstruct characteristics, wherein the first convolution layer and the second convolution layer are accompanied by an up-sampling operation; the number of input channels of the four convolutional layers is 384, 192, 96 and 32 respectively, the number of output channels is 128, 64, 32 and 1 respectively, and the activation function is PReLU; the fused attention feature map is subjected to up-sampling operation and first-layer convolution, and the obtained output is in channel connection with the infrared path compensation attention map and the visible path compensation attention map of the corresponding scales, so that an initial fused image is finally obtained;

s02, constructing a double discriminator model comprising an infrared discriminator and a visible light discriminator; in the training process, inputting the initial fusion image obtained by the interactive compensation generator into a corresponding discriminator together with the infrared image and the visible light image so as to restrict the fusion image to have similar data distribution with the infrared image and the visible light image respectively; when the confrontation game of the interactive compensation generator, the infrared discriminator and the visible light discriminator reaches balance, a final fusion result is obtained;

the infrared discriminator and the visible light discriminator have the same network structure and are composed of 4 convolutional layers and 1 full-connection layer, all convolutional layers adopt 3 x3 kernel size and LeakyRelu activation function, the step length is 2, the input channels of the corresponding convolutional layers are respectively 1, 16, 32 and 64, and the number of output channels is respectively 16, 32, 64 and 128;

s03, training a network model: taking the infrared image and the visible light image as training data sets, and adopting a loss function representing the pixel intensity of the infrared image and the edge gradient of the visible light image to supervise network model training to obtain optimal network model parameters;

the loss function comprises an interaction compensation generator loss function and a discriminator loss function; in the interactive compensation generator, the loss function is formed by a competing loss function L _adv And a content loss function L _con Composition, represented as L _G ＝L _adv +L _con (ii) a The content loss function of the interaction compensation generator may be expressed as

representing gradient operators, I _f Representing the initial fused image, I _ir Representing an infrared image, I _vis Representing a visible light image; in the infrared discriminator and the visible light discriminator, the resistance loss function is expressed as

N represents the number of training images; meanwhile, the respective loss functions of the infrared discriminator and the visible light discriminator are respectively expressed as

And

wherein λ is a regularization parameter, | | · | computation circuitry ₂ Represents the L2 norm; first item representationThe wasserstein distance between the fusion result and the infrared or visible light image, and the second term is a gradient penalty for limiting the learning ability of the infrared discriminator and the visible light discriminator.

Optionally, the training data set adopts 25 groups of infrared and visible light images of the TNO data set, and uses a sliding window with a step size of 12 to segment the original image into a size of 128 × 128, and convert the gray value range into [ -1,1], and finally obtain 18813 groups of images as the training set;

in the training process, an Adam optimizer is used for updating network model parameters, and the Batchsize and Epoch are respectively set to be 4 and 16; the learning rates of the interactive compensation generator and the discriminator are set to 1 × 10, respectively ^-4 And 4X 10 ^-4 The corresponding iteration times are set to 1 and 2, respectively; the regularization parameter λ is set to 10.

By means of the scheme, the invention has the following characteristics:

1. in the interactive compensation generator, a multi-scale encoder-decoder network with a triple path is constructed. The infrared path and the visible light path provide additional intensity and gradient information for the connection path under the action of the interaction attention module and the compensation attention module of the multi-scale encoder-decoder network, so that more prominent infrared targets and rich texture details can be reserved in the fused image.

2. According to the invention, an interactive attention module and a compensation attention module are developed to transfer path characteristics, global characteristics are modeled from channels and space dimensions, the characteristics extraction and characteristics reconstruction capabilities are enhanced, and the obtained attention characteristic diagram focuses on infrared image target perception and visible light image texture detail representation.

3. The invention designs the double discriminators comprising the infrared discriminator and the visible light discriminator when training the interactive compensation generator, optimizes the interactive compensation generator through the double discriminators, and can more uniformly restrict the similarity of data distribution between the fusion result and the source image by using the infrared discriminator and the visible light image discriminator, so that the interactive compensation generator generates a more balanced fusion result and obtains more similar pixel distribution and more detailed texture detail information from the source image.

4. The invention provides an end-to-end (namely, the pre-training network model is the same as the testing network model, and no additional fusion rule needs to be added in the testing network model) infrared image and visible light image generation confrontation fusion method, the fusion effect is obviously improved, the method can also be applied to the fusion of multi-mode images, multi-focus images and medical images, and has high application value in the field of image fusion.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a schematic diagram of a process of fusing an infrared image to be fused and a visible light image to be fused through an interactive attention coding network, a fusion layer and a compensation attention decoding network.

FIG. 3 is a data processing process diagram of the interaction intent module.

FIG. 4 is a data processing diagram of the compensate attention module.

FIG. 5 is a schematic diagram of a training process of the interactive compensation generator.

FIG. 6 is a schematic diagram of a comparison of the first set of Solider _ with _ jeep fusion results.

FIG. 7 is a comparison of the second set of Street fusion results.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention, but are not intended to limit the scope of the invention.

As shown in FIG. 1, the present invention provides a method for interactively compensating attention of anti-fusion of infrared and visible light images, which comprises:

and S1, determining the triple paths of the infrared path and the visible light path corresponding to the infrared image to be fused and the visible light image to be fused respectively, and the connection path obtained by channel connection of the infrared image to be fused and the visible light image to be fused as the input of a pre-trained interactive compensation generator, wherein the interactive compensation generator establishes a multi-scale coding-decoding network framework of the triple paths, and the multi-scale coding-decoding network framework comprises an interactive attention coding network, a fusion layer and a compensation attention decoding network.

S2, extracting the multi-scale depth features of the triple paths through convolution layers with 4 convolution kernels of 3 x3 adopted by the interactive attention coding network, wherein the first convolution layer and the second convolution layer of the interactive attention coding network are convolutions with the step length of 1 and used for extracting the shallow layer features of the image, the third convolution layer and the fourth convolution layer are convolutions with the step length of 2 and used for extracting the multi-scale depth features of the image, and the shallow layer features and the multi-scale depth features are subjected to three-level interactive attention action to obtain the final interactive attention map.

The number of input channels of the four convolutional layers of the infrared path and the visible light path of the interactive attention coding network is respectively 1, 16 and 32, the number of output channels is respectively 16, 32 and 64, the number of input channels of the four convolutional layers of the connection path is respectively 2, 16, 64 and 128, the number of output channels is respectively 16, 32 and 64, and the activation function is PReLU; starting from the second convolutional layer, the characteristics of the infrared path and the visible light path are respectively connected with the characteristics of the connection path (corresponding to C in fig. 2 to 5), and are recorded as Φ _m And phi _n Then inputting the interactive attention module (Inter _ Att in FIG. 2) of the interactive attention coding network, generating an interactive attention fusion graph, which is marked as phi _F 。

And S3, directly connecting the final interactive attention map with the compensation attention map obtained by the fourth convolution layer of the infrared path and the visible light path through the fusion layer to obtain a fused attention feature map.

S4, reconstructing the convolutional layer with a 3 × 3 convolutional layer reconstruction characteristic by using 4 convolutional kernels adopted by the compensatory attention decoding network, wherein the first convolutional layer and the second convolutional layer of the compensatory attention decoding network are accompanied by an Upsampling operation (Upsampling in fig. 2); and performing channel connection on the fused attention feature map and the infrared path compensation attention map and the visible path compensation attention map of the corresponding scale through up-sampling operation and convolution operation of the first convolution layer to obtain a fused image.

The number of input channels of the four convolutional layers of the complementary attention decoding network is 384, 192, 96 and 32, the number of output channels is 128, 64, 32 and 1, and the activation function is PReLU. In the compensation attention decoding network, different scale features obtained by an infrared path and a visible light path in the interactive attention coding network through a compensation attention module (Comp _ Att in fig. 2) are in channel connection with corresponding scale features of a connection path, and reconstruction of a feature map is completed along with an up-sampling operation to obtain an initial fusion image. The infrared path and the visible light path provide additional strength and gradient information for the connection path, improving feature decoding capability.

Fig. 2 is a schematic diagram of a process of fusing an infrared image to be fused and a visible light image to be fused through an interactive attention coding network, a fusion layer, and a compensation attention decoding network. Conv In fig. 2 represents convolution operation, k3 represents convolution kernel of 3 × 3, s1 represents convolution with step size of 1, In16 represents number of output channels of 16, and the rest of parameters In fig. 2 are the same.

Optionally, as shown in FIG. 3, the interactive attention module, for an input feature Φ _m And phi _n ∈R ^H×W×C Firstly, respectively mapping depth features to channel vectors by using global average pooling operation and maximum pooling operation in a channel attention model, performing channel connection on output feature vectors after passing through two convolutional layers and a PReLU active layer, and inputting the output feature vectors to the convolutional layers and the Sigmod active layer to obtain an initial channel weighting coefficient

And

are respectively represented as

And

And

are respectively represented as

And

multiplying the final channel weighting coefficient with the respective input characteristics to obtain the corresponding channel interactive attention diagram

And

respectively expressed as:

and

And

are respectively represented as

And

And

are respectively represented as

And

And

are respectively represented as

And

finally, the space interactive attention maps of the two are connected through a channel to obtain an interactive attention fusion map phi _F Is shown as

Optionally, as shown in FIG. 4, the compensation attention module is configured to compensate for an input infrared image feature or a visible light image feature Φ _m ∈R ^H×W×C Firstly, a global average pooling operation and a maximum pooling operation are used for converting feature mapping into a channel vector in a channel attention model, after the channel vector passes through two convolutional layers and a PReLU active layer, output feature vectors are subjected to channel connection and input into the convolutional layers and the Sigmod active layer, and a channel weighting coefficient is obtained

Is shown as

H and W respectively represent the height and width of the image, and C represents the number of input channels;

then, multiplying the channel weighting coefficient with the input characteristic to obtain the corresponding channel attention diagram

Is shown as

Is shown as

Is shown as

The interactive attention module and the compensation attention module are used for establishing a global dependency relationship of local features, realizing feature interaction and compensation of triple paths and enhancing feature extraction and feature reconstruction capabilities.

The above process is the relevant content for image fusion of the infrared image to be fused and the visible light image to be fused. In order to perform image fusion on the infrared image to be fused and the visible light image to be fused through the interactive compensation generator, the interactive compensation generator needs to be trained in advance, and the following content is a process for training the interactive compensation generator.

Specifically, the method for training the interactive compensation generator comprises the following steps:

s01, constructing an interactive compensation generator: and establishing a three-path multi-scale coding-decoding network framework comprising an interactive attention coding network, a fusion layer and a compensation attention decoding network by taking an infrared path, a visible light path and a connecting path for connecting an infrared image and a visible light image channel as input, wherein the three-path multi-scale coding-decoding network framework comprises an interactive attention coding network, a fusion layer and a compensation attention decoding network and is used for generating an initial fusion image.

The interactive attention coding network respectively adopts 4 convolution layers with convolution kernels of 3 multiplied by 3 to extract the multi-scale depth features of the triple paths, wherein the first convolution layer and the second convolution layer are convolutions with the step length of 1 and used for extracting the shallow layer features of the image, and the third convolution layer and the fourth convolution layer are convolutions with the step length of 2 and used for extracting the multi-scale depth features of the image; the number of input channels of the four convolution layers of the infrared path and the visible light path is respectively 1, 16 and 32, the number of output channels is respectively 16, 32 and 64, the number of input channels of the four convolution layers of the connecting path is respectively 2, 16, 64 and 128, the number of output channels is respectively 16, 32 and 64, the activation function is that the PReLU starts from the second convolution layer, the characteristics of the infrared path and the visible light path are respectively in channel connection with the characteristics of the connecting path and are recorded as phi _m And phi _n Then input the interactive attention modelBlock, generate interactive attention fusion graph, denoted Φ _F (ii) a Obtaining a final interactive attention map after the three-level interactive attention;

the compensation attention decoding network respectively adopts convolution layers with 4 convolution kernels of 3 multiplied by 3 to reconstruct characteristics, wherein the first convolution layer and the second convolution layer are accompanied by an up-sampling operation; the number of input channels of the four convolutional layers is 384, 192, 96 and 32 respectively, the number of output channels is 128, 64, 32 and 1 respectively, and the activation function is PReLU; the fused attention feature map is subjected to up-sampling operation and first-layer convolution, and the obtained output is in channel connection with the infrared path compensation attention map and the visible path compensation attention map of the corresponding scales, so that an initial fused image is finally obtained.

S02, constructing a double discriminator model comprising an infrared discriminator and a visible light discriminator; in the training process, inputting the initial fusion image obtained by the interactive compensation generator into a corresponding discriminator together with the infrared image and the visible light image so as to restrict the fusion image to have similar data distribution with the infrared image and the visible light image respectively; and when the competing games of the interactive compensation generator, the infrared discriminator and the visible light discriminator reach balance, obtaining a final fusion result.

The infrared discriminator causes the fused image to hold as much infrared pixel intensity information as possible, while the visible light discriminator causes the fused image to contain as much visible light detail information as possible. And the final fusion result obtained when the countermeasure game is balanced enables the fusion image to have the infrared pixel intensity and the visible light texture detail information of the source image at the same time.

The infrared discriminator and the visible light discriminator have the same network structure and are respectively composed of 4 convolution layers and 1 full-connection layer, all the convolution layers adopt a 3 x3 kernel size and a LeakyRelu activation function, the step length is 2, the input channels of the corresponding convolution layers are respectively 1, 16, 32 and 64, and the output channels are respectively 16, 32, 64 and 128;

s03, training a network model: the infrared image and the visible light image are used as training data sets, loss functions representing the pixel intensity of the infrared image and the edge gradient of the visible light image are adopted to supervise network model training, and optimal network model parameters, namely parameters of an optimal interaction compensation generator, are obtained.

The loss function comprises an interaction compensation generator loss function and a discriminator loss function; in the interactive compensation generator, the loss function is formed by a competing loss function L _adv And a content loss function L _con Composition, represented as L _G ＝L _adv +L _con (ii) a Consider that the infrared image represents the target features in pixel intensities, while the visible image represents the scene texture by edges and gradients. Therefore, the Frobenius norm is adopted to carry out similarity constraint on the pixel intensity of the infrared image and the fused image, the L1 norm is adopted to carry out similarity constraint on the gradient change of the visible light image and the fused image, and therefore, the content loss function of the interactive compensation generator can be expressed as

representing gradient operators, I _f Representing the initial fused image, I _ir Representing an infrared image, I _vis Representing a visible light image. In the dual discriminator, the infrared discriminator and the visible light discriminator aim to balance the authenticity of the fused image and the source image, forcing the generated fused image to be simultaneously inclined to the real data distribution of the infrared image and the visible light image. In the infrared discriminator and the visible light discriminator, the resistance loss function is expressed as

N represents the number of training images; meanwhile, the respective loss functions of the infrared discriminator and the visible light discriminator are expressed as

And

wherein λ is a regularization parameter, | | · |. non-woven phosphor ₂ Represents the L2 norm; the first term represents the wasserstein distance between the fusion result and the infrared or visible light image, and the second term is a gradient penalty for limiting the learning ability of the infrared discriminator and the visible light discriminator.

Wherein the training dataset takes 25 sets of infrared and visible images of the TNO dataset, the original image is segmented into dimensions 128 x 128 using a sliding window with a step size of 12, the grey value range is converted to [ -1,1]Finally obtaining 18813 groups of images as a training set; in the training process, an Adam optimizer is used for updating network model parameters, and the Batchsize and Epoch are respectively set to be 4 and 16; the learning rates of the interactive compensation generator and the discriminators (infrared discriminator and visible light discriminator) were set to 1 × 10, respectively ^-4 And 4X 10 ^-4 The corresponding iteration times are set to 1 and 2, respectively; in the loss function, the regularization parameter λ is set to 10. The experimental training platform is Intel I9-10850KCPU, 64GB memory and NVIDIA GeForce GTX3090 GPU. The compilation environment is Python and PyTorch platforms.

Further, in order to verify the image fusion effect of the interactive compensation generator obtained by training through the method, the embodiment of the invention also verifies the trained interactive compensation generator.

Specifically, in the testing phase, 22 sets of images from the TNO data set were selected for test validation. Comparison methods 9 exemplary methods were selected, including MDLatLRR, DenseeFuse, IFCNN, Res2Fusion, SEDRFUse, RFN-Nest, PMGI, FusionGAN, and GANMCC. In addition, 8 indices such as Average Gradient (AG), entropy of information (EN), Standard Deviation (SD), Mutual Information (MI), Spatial Frequency (SF), entropy of Nonlinear Correlation Information (NCIE), Qabf, and Visual Information Fidelity (VIF) are used as the objective evaluation index. The verification result includes the following two aspects.

(1) And (4) subjective evaluation. Fig. 6 and 7 show a subjective comparison result of two sets of images Solider _ with _ jeep and Street. By contrast, the fusion method of the present invention can be found to have three advantages. First, the fusion result may retain high-luminance target information in the infrared image. For typical infrared targets, such as the car of fig. 6 and the pedestrian of fig. 7, the fusion results of the present invention have higher brightness target features than other methods. Second, the fusion result may preserve the texture details of the visible light image. For example, the house edge of fig. 6 and the billboard of fig. 7, the fusion result of the present invention is more obvious and more precise than other methods for these representative details. Finally, the fusion result obtains higher contrast and better visual effect. Compared with the source image and other fusion results, the method can better reserve the outstanding target characteristics and rich scene detail information and obtain more balanced fusion results.

(2) And (4) objective evaluation. Table 1 gives the results of an objective comparison of 22 sets of images of the TNO dataset. The optimal and suboptimal mean are marked with bold and underline, respectively. It can be seen that the method of the invention achieves an optimum average of the indices AG, EN, MI, SF, NCIE and VIF, a suboptimal average of the indices SD and Qabf. Objective experiments show that the method has better fusion performance than other methods. The maximum value EN indicates that the useful information abundant in the source image can be maintained. This is because the method of the present invention uses a triple path, the infrared path and the visible light path providing additional intensity and gradient information for the connecting path. The maximum values MI and NCIE show that the fusion result has strong correlation and similarity with the source image. This is because the method of the present invention uses dual discriminators to supervise and optimize the interactive compensation generator, which can produce a more balanced fusion result. The maximum values AG, SF and VIF indicate that better image contrast and visual effect can be obtained. According to the method, the interaction attention module and the compensation attention module are adopted, the long dependency relationship of local features is established, and the acquired attention feature map focuses on infrared target perception and visible texture detail representation.

TABLE 1

All the optional technical schemes can be combined at will, and the structure after one-to-one combination is not explained in detail in the invention.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and it should be noted that, for those skilled in the art, many modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. An infrared and visible image anti-fusion method for interactively compensating attention, which is characterized by comprising the following steps:

s1, determining triple paths of an infrared path and a visible light path corresponding to the infrared image to be fused and the visible light image to be fused respectively, and a connection path obtained by channel connection of the infrared image to be fused and the visible light image to be fused as the input of a pre-trained interactive compensation generator, wherein the interactive compensation generator establishes a multi-scale coding-decoding network framework of the triple paths, and the multi-scale coding-decoding network framework comprises an interactive attention coding network, a fusion layer and a compensation attention decoding network;

s2, extracting the multi-scale depth features of the triple path through convolution layers with 4 convolution kernels of 3 x3 adopted by the interactive attention coding network, wherein the first convolution layer and the second convolution layer of the interactive attention coding network are respectively convolution with the step length of 1 and are used for extracting the shallow layer features of the image, the third convolution layer and the fourth convolution layer are respectively convolution with the step length of 2 and are used for extracting the multi-scale depth features of the image, and the shallow layer features and the multi-scale depth features are subjected to three-level interactive attention action to obtain a final interactive attention map;

2. The method of claim 1, wherein the number of input channels of the four convolutional layers of the infrared path and the visible light path of the interactive attention coding network are 1, 16, 32, respectively, the number of output channels are 16, 32, 64, respectively, the number of input channels of the four convolutional layers of the connection path are 2, 16, 64, 128, respectively, the number of output channels are 16, 32, 64, respectively, and the activation function is PReLU; starting from the second convolutional layer, the characteristics of the infrared path and the visible light path are respectively connected with the characteristics of the connecting path in a channel way, and the channel is recorded as phi _m And phi _n Then inputting the interactive attention module of the interactive attention coding network to generate an interactive attention fusion graph, which is marked as phi _F 。

3. The method as claimed in claim 1, wherein the number of input channels of the four convolutional layers of the attention decoding network is 384, 192, 96, 32, the number of output channels is 128, 64, 32, 1, and the activation function is PReLU.

4. The attention-compensating interactive infrared and visible image anti-fusion method of claim 2,

the interactive attention module, for input feature Φ _m And

first using a global averaging pool in the channel attention modelMapping depth features to channel vectors through the maximum pooling operation and the maximum pooling operation, performing channel connection on output feature vectors after passing through two convolutional layers and a PReLU active layer, and inputting the output feature vectors to the convolutional layers and the Sigmod active layer to obtain initial channel weighting coefficients

And

are respectively represented as

And

And

are respectively represented as

And

And

respectively expressed as:

and

And

are respectively represented as

And

And

are respectively represented as

And

And

are respectively represented as

And

finally, the space interactive attention diagrams of the two are connected in a channel mode to obtain an interactive attention fusion diagram phi _F Is represented as

5. The attention-compensating interactive infrared and visible image anti-fusion method of claim 3,

the attention compensation module is used for inputting infrared image characteristics or visible light image characteristics

Firstly, a global average pooling operation and a maximum pooling operation are used for converting feature mapping to a channel vector in a channel attention model, after passing through two convolutional layers and a PReLU active layer, output feature vectors are subjected to channel connection and input to the convolutional layers and the Sigmod active layer, and a channel weighting coefficient is obtained

Is shown as

Is shown as

Is shown as

Is shown as

6. The method for interactively attention-compensating anti-fusion of infrared and visible images as claimed in claim 1, further comprising, before said step S1:

the interactive attention coding network respectively adopts 4 convolution layers with convolution kernels of 3 multiplied by 3 to extract the multi-scale depth features of the triple paths, wherein the first convolution layer and the second convolution layer are convolutions with the step length of 1 and used for extracting the shallow layer features of the image, and the third convolution layer and the fourth convolution layer are convolutions with the step length of 2 and used for extracting the multi-scale depth features of the image; the number of input channels of the four convolution layers of the infrared path and the visible light path is respectively 1, 16 and 32, the number of output channels is respectively 16, 32 and 64, the number of input channels of the four convolution layers of the connecting path is respectively 2, 16, 64 and 128, the number of output channels is respectively 16, 32 and 64, and the activation function is PReLU; starting from the second convolutional layer, the characteristics of the infrared path and the visible light path are respectively connected with the characteristics of the connecting path in a channel way, and the channel is recorded as phi _m And phi _n Then inputting the interactive attention module to generate an interactive attention fusion graph, which is marked as phi _F (ii) a Obtaining a final interactive attention map after three-level interactive attention action;

s02, constructing a double discriminator comprising an infrared discriminator and a visible light discriminator: in the training process, inputting the initial fusion image obtained by the interactive compensation generator and the infrared image and the visible light image into corresponding discriminators so as to restrict the fusion image to have similar data distribution with the infrared image and the visible light image respectively; when the confrontation game of the interactive compensation generator, the infrared discriminator and the visible light discriminator reaches balance, a final fusion result is obtained;

And

7. The interactive attention-compensating infrared and visible image pair fusion method according to claim 6, wherein the training dataset uses 25 sets of infrared and visible images of the TNO dataset, divides the original image into the size of 128 x 128 using a sliding window with the step size of 12, converts the gray value range into [ -1,1], and finally obtains 18813 sets of images as the training set;

in the training process, an Adam optimizer is used for updating network model parameters, and the Batchsize and Epoch are respectively set to be 4 and 16; the learning rates of the interactive compensation generator and the dual discriminator are set to 1 × 10 ⁴ And 4X 10 ⁴ The corresponding iteration times are set to 1 and 2, respectively;

in the loss function, the regularization parameter λ is set to 10.