CN115601282A

CN115601282A - Infrared and visible light image fusion method based on multi-discriminator generation countermeasure network

Info

Publication number: CN115601282A
Application number: CN202211405079.7A
Authority: CN
Inventors: 康家银; 武凌霄; 张文娟; 姬云翔; 马寒雁
Original assignee: Jiangsu Ocean University
Current assignee: Jiangsu Ocean University
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-01-13

Abstract

The invention discloses an infrared and visible light image fusion method based on a multi-discriminator generation countermeasure network, which comprises the following main processes: calculating and preprocessing a differential image; source image (infrared image and) _Iir Visible light image _Ivis ) And difference image (infrared difference image) _Idif‑ir And a visible light differential image _Idif‑vis ) Training network models (generator model and discriminator model) as input; a fused image is generated using the trained generator model. Thus in the practical application of infrared and visible light image fusion,the method can not only fully retain the heat radiation information in the infrared image, but also effectively reproduce the texture details in the visible light image.

Description

Infrared and visible light image fusion method based on multi-discriminator generation countermeasure network

Technical Field

The invention relates to the field of image processing, in particular to an infrared and visible light image fusion method based on a multi-discriminator generation countermeasure network.

Background

The image fusion task aims to fuse different information in images obtained by various sensors into one image so as to meet various application requirements. The infrared image contains rich thermal radiation information, and can be distinguished according to different thermal feedback between the target and the background, but the resolution is low generally, and texture details are lacked; the visible light image has higher resolution and rich details, is highly fit for human visual perception, and is easily influenced by external factors; the fused image obtained by the image fusion technology has the advantages of two images.

In the existing multi-modal image fusion research, the fusion of infrared and visible light images is one of the key branches. Researchers have proposed different fusion methods and strategies for their properties. According to the machine learning technique used in the fusion strategy, the fusion method can be roughly divided into a fusion method based on a conventional algorithm and a fusion method based on deep learning. The common fusion method based on the traditional algorithm comprises the following steps: a multi-scale transform based fusion method, a sparse representation based fusion method, a subspace based fusion method, and the like. The fusion method based on the traditional algorithm usually adopts the same transformation for different images, and key features cannot be extracted from the images in a targeted manner according to the characteristics of the images; in addition, the fusion strategy of the method cannot reserve various detailed information in a targeted manner, so that the fusion result is influenced. In contrast, the fusion method based on deep learning can effectively solve the above problems. Common fusion methods based on deep learning include: a fusion method based on self-encoding (AE), a fusion method based on Convolutional Neural Network (CNN), a fusion method based on generation of a countermeasure network (GAN), and the like. In the existing fusion method, an image fusion method based on AE needs to train an encoder and a decoder in a network on a public data set to obtain the best performance, and the fusion method depends on a characteristic fusion strategy which is manually preset, so that the fusion effect of infrared and visible light images is limited to a certain extent; in the infrared and visible light image fusion method based on the CNN, a true value used for training a deep learning model needs to be obtained in advance, actually, the infrared and visible light image fusion task does not have the true value, and the fused image is evaluated by subjective evaluation of human vision and assisted by objective evaluation indexes, so that a fusion strategy based on the CNN is limited, and a fusion result is influenced; the GAN-based fusion strategy performs unsupervised training by establishing a fighting game between the creator intended to generate an image with infrared intensity and additional visible gradients and the discriminator; the discriminator is intended to distinguish the generated image from the source image so that the final fused image has both the clear thermal radiation intensity of the infrared image and the texture details of the visible image. The fusion strategy based on the GAN not only makes up the defects of the method, but also is more suitable for the fusion task of the infrared and visible light images.

In the existing infrared and visible light image fusion method based on GAN, part scholars use a single generator-single discriminator structure to discriminate the fused image from the visible light image through the discriminator so as to guide the generator to retain the texture details of the visible light image as much as possible; in order to solve the problem of information imbalance of a fusion result caused by a single discriminator, some scholars propose a single generator-double discriminator structure, and source images of two modes are identified by using double discriminators; in addition, some scholars introduce a differential discriminator on the basis of the existing research, and propose a single generator-three discriminator structure, and the differential image is used as the additional input of the network, so that the fusion performance is improved. With the proposed GAN model for image fusion, increasing the number of discriminators can constrain the generator from multiple angles, improving the fusion performance to some extent. In addition, the differential image can focus the unique information of the source image, and the auxiliary image fusion network model reserves more source image information. However, most of the existing methods perform a "thresholding" operation with a threshold value of 0 or take absolute values on the difference image to avoid the occurrence of negative gray values. In practice, the thresholding operation is adopted to lose part of the information of the source image; the absolute value operation, while retaining all information, does not highlight the unique information of the multimodal images.

Aiming at the problems of the existing GAN-based method in fusing infrared and visible light images, the invention provides a novel infrared and visible light image fusion method based on a multi-discriminator generation countermeasure network. The proposed network model adopts a structure of single generator-four discriminators, and two differential discriminators are added on the basis of a leading edge algorithm to establish antagonistic training with the generator, so that the optimization trend of the generator is further constrained. Firstly, the generator adopts a double-encoder-single-decoder structure, different encoders extract the features of images in different modes, and then a decoder reconstructs a fused image according to the fused features. Secondly, different from the absolute value operation of the difference image in other methods, the method carries out normalization processing on the difference image so as to highlight the respective unique information of the source images in the two modes. Finally, in order to avoid the problem that the convergence of the generator is difficult due to excessive constraint of the discriminator, a strategy that the source image loss is used as a main strategy and the differential image loss is used as an auxiliary strategy is adopted in the design of the loss function. Experimental results on a public data set show that the algorithm provided by the invention not only can fully retain heat radiation information in an infrared image, but also can effectively reproduce texture details in a visible light image.

Disclosure of Invention

The present invention aims to solve the above-mentioned problems of the background art by providing an infrared and visible light image fusion method based on a multi-discriminator-generated countermeasure network.

In order to achieve the purpose, the invention provides the following technical scheme: the method for fusing the infrared and visible light images based on the multi-discriminator generation countermeasure network is characterized in that: the method comprises the following steps 1 to 3 to complete the fusion of the infrared and visible light images:

step 1: calculating and preprocessing the difference image to obtain an infrared image I _ir And visible light image I _vis Respectively calculating difference values and normalizing to obtain a difference image I _dif-ir And I _dif-vis (ii) a And 2, step: will be provided withThe method comprises the following steps of taking a source image and a differential image as input to train a network model, wherein the training process comprises the following steps of 2-1 to 2-4:

step 2-1: will infrared image I _ir And a differential image I _dif-ir Joined, visible light image I _vis And a differential image I _dif-vis Concatenating, as input to a different encoder in generator G in step 2-2;

step 2-2: the generator G carries out feature extraction and fusion on the data in the step 2-1, and then a fusion image I is reconstructed according to the fused features _F ；

Step 2-3: respectively combining the fused images with the infrared image I _ir Visible light image I _vis And a differential image I _dif-ir ,I _dif-vis Input to the discriminator (D) _ir ,D _vis ,D _dif-ir And D _dif-vis ) In the middle, establishing confrontation training with a generator G;

step 2-4: step 2-1 to step 2-3 are circulated to carry out iterative training, when the antagonistic training tends to be balanced, namely the discriminator cannot distinguish whether the input sample is from the image generated by the generator or the real image, the training is terminated, and the generator G required by fusion is obtained;

and 3, step 3: the fused image is generated using the trained generator model G. Specifically, the infrared image and the visible light image are respectively connected with two differential images and are input into the generator trained in the step 2 together, so as to obtain a final fusion result.

As a preferred technical scheme of the invention: the step 1 of calculating the difference image comprises the following specific steps:

step 1-1: using infrared images I _ir Subtracting the visible image I _vis To obtain an infrared differential image I capable of emphasizing the intensity of thermal radiation _dif-ir ；

Step 1-2: using visible light images I _vis Subtracting the infrared image I _ir Obtaining a visible light differential image I capable of highlighting texture details _dif-vis 。

As a preferred technical scheme of the invention: the normalization processing in the preprocessing of the differential image in the step 1 controls the gray value of the pixel to be between 0 and 1, and specifically, the following formula is:

where v (i, j) is the gray scale value of the pixel at (i, j) in the difference image, v _min And v _max The minimum and maximum gray levels in the difference image are respectively.

As a preferred technical scheme of the invention: the generator network G in the step 2-2 adopts a double-encoder-single-decoder structure, and the specific structure is as follows: firstly, connecting a source image and a differential image highlighting the unique information of the modal image as input of a branched encoder; secondly, the two branch encoders are respectively responsible for extracting the characteristics of the infrared image and the visible light image; and finally, connecting the high-dimensional features of the images in different modes, inputting the high-dimensional features into a decoder, and reconstructing to obtain a fused image.

As a preferred technical scheme of the invention: the four discriminators in the step 2-3 adopt the same network structure and sequentially comprise five convolutional neural networks, wherein convolutional kernels with the size of 3 multiplied by 3 are used in the first four layers, and the step length is set to be 2; adding a batch normalization layer into the second layer to the fourth layer; in the last layer, the features extracted from the convolutional layer are first integrated using the fully-connected layer, and then a scalar is calculated using the Tanh activation function.

As a preferred technical scheme of the invention: in the step 2-4, during iterative training, the loss function is adopted to evaluate the model prediction difference, and the loss function is composed of a generator loss function and a discriminator loss function, wherein the generator loss function L _G Mainly by opposing the loss L _adv And content loss L _content The two parts are used for feeding back the loss of the generator network training; four discriminators use similar loss functions L _D And the judgment of the input result by the discriminator is fed back to the generator, and the generator establishes confrontation training with the input result, wherein the specific calculation formula is as follows:

L _G ＝L _adv +λL _content

wherein λ is a weight parameter, wherein,

and

respectively correspond to four discriminators D _ir ,D _vis ,D _dif-ir And D _dif-vis ；

In particular, the antagonistic loss L _adv The optimization direction is mainly used for restricting the generator, and the formula is defined as:

L _adv ＝E[log(1-D _ir (I _F ))]+E[log(1-D _vis (I _F ))]+E[log(1-D _dif-ir (I _F ))]+E[log(1-D _dif-vis (I _F ))]

wherein E [. Cndot. ] is expectation, D (·) is the input image classification probability of the discriminator;

in particular, the content loss L _content By comparing the difference between the fused image and the input image, the generator is guided to generate a fusion result which simultaneously retains the thermal radiation information of the infrared image and the texture information of the visible image, and the formula is defined as follows:

L _content ＝αL _int +βL _grad +γL _SSIM

wherein alpha, beta and gamma are weight parameters, L _int For loss of strength, L _grad For gradient loss, L _SSIM Loss of structural similarity; l is _int ，L _grad And L and _SSIM are respectively defined as follows:

L _SSIM ＝ω·L _SSIM-img +(1-ω)·L _SSIM-dif

where ω is the weight parameter, H, W is the size of the input image, L _int-img For loss of source image intensity, L _int-dif Is a loss of differential image intensity, L _grad-img For loss of source image gradient, L _grad-dif Is a loss of differential image gradient, L _SSIM-img For loss of structural similarity of source images, L _SSIM-dif For loss of structural similarity of the difference image, L _SSIM (. Is) the similarity between the two images; l is a radical of an alcohol _int-img ，L _int-dif ，L _grad-img ，L _grad-dif ，L _SSIM-img And L _SSIM-dif The definitions of (A) are as follows:

L _int-img ＝a·||I _F -I _ir || _F +(1-a)·||I _F -I _vis || _F

L _int-dif ＝a·||I _F -I _dif-ir || _F +(1-a)·||I _F -I _dif-vis || _F

L _SSIM-img ＝(1-L _SSIM (I _F ,I _ir ))+(1-L _SSIM (I _F ,I _vis ))

L _SSIM-dif ＝(1-L _SSIM (I _F ,I _dif-ir ))+(1-L _SSIM (I _F ,I _dif-vis ))

specifically, the specific formula of the loss function of each arbiter is defined as:

wherein, a discriminator D _ir And D _vis Is a source image (I) _ir And I _vis ) Or fusing images (I) _F ) D, discriminator D _dif-ir And D _dif-vis The input image of (a) is a difference image (I) _dif-ir And I _dif-vis ) Or fusing images (I) _F )。

As a preferred technical scheme of the invention: in the double-encoder-single-decoder structure, the double-encoder structure comprises two branches which are respectively used for extracting two characteristics of infrared heat radiation intensity and visible light texture, each branch consists of four layers of convolution layers, a DenseNet structure is adopted for intensive connection, and specifically, a first layer of network consists of a convolution kernel with the size of 3 multiplied by 3, a switchable normalization layer and an activation function Leaky ReLU; convolution Block Attention Modules (CBAM) are added in the last three layers, the number of channels in all convolution layers is set to be 64, and the step length of convolution kernels is set to be 1; in addition, CBAM was introduced for improving feature extraction capability, which mainly consists of steps a to F:

step A: firstly inputting the feature maps into a channel attention module, and performing global maximum pooling and global average pooling on the basis of the width and the height of the input feature maps to obtain two feature maps;

and B: inputting the two feature maps into a multilayer perceptron with shared parameters to generate respective channel attention feature maps, and then obtaining a final channel attention feature map through element-by-element summation and Sigmoid activation function operation;

step C: multiplying the original input features by the channel attention feature map according to elements, and inputting the result into a space attention module;

step D: performing maximum pooling and global average pooling on the channel dimensions of the feature maps input into the spatial attention module to obtain two feature maps;

and E, step E: connecting the channels based on channel dimensions, performing convolution through a convolution layer, and generating a space attention diagram through Sigmoid activation function operation;

step F: and multiplying the input vector of the spatial attention module and the spatial attention module element by element to obtain the final output characteristic.

As a preferred technical scheme of the invention: in the double-encoder single-decoder structure, the single-decoder network structure is composed of two convolutional layers, wherein the convolutional layer of the first layer is composed of a convolutional kernel with the size of 3 x 3, a switchable normalization layer and an activation function Leaky ReLU, and the convolutional layer of the second layer is composed of a 3 x 3 convolutional kernel and a Tanh activation function.

Compared with the prior art, the infrared and visible light image fusion method based on the multi-discriminator generated countermeasure network has the following technical effects by adopting the technical scheme:

the invention has the beneficial effects that: the framework for generating the antagonistic network fusion comprises a generator and four discriminators, and the differential image is used as auxiliary information, so that the fusion performance of the network is further improved. In the method provided by the invention, the differential image is not only used as additional information of the source image and used for guiding the generator to pay attention to the unique information of the images in different modes, but also used as real data distribution and assisting the differential discriminator and the generator to carry out antagonistic training. In the proposed network model, the generator adopts a double-encoder-single-decoder structure, wherein the encoder aims at extracting different modal characteristics, mainly adopts a dense connection structure and combines an attention module; the decoder aims at reconstructing the fused image from the joined high-dimensional features. The discriminator judges whether the input image is from a real image or an image generated by the generator, and performs constraint optimization on the generator according to the result of the judgment.

Drawings

FIG. 1 is a flow chart of a method for generating an infrared and visible light image fusion method of a countermeasure network based on multiple discriminators according to the present invention;

FIG. 2 is an overall fusion framework of the infrared and visible light image fusion method for generating a countermeasure network based on multiple discriminators according to the present invention;

FIG. 3 is a diagram of a difference image and its preprocessing effect according to the method of the present invention;

FIG. 4 is a diagram of a generator network structure case of the method of the present invention;

FIG. 5 is a diagram of a Convolutional Block Attention Module (CBAM) architecture case of the method of the present invention;

FIG. 6 is a diagram of an example of a network structure of an arbiter in the method of the present invention;

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the present invention more comprehensible and clear for those skilled in the art, and thus define the scope of the present invention more clearly.

The embodiment is as follows: referring to fig. 1, the present invention provides a technical solution: the method for fusing the infrared and visible light images based on the multi-discriminator generation countermeasure network completes the fusion of the infrared and visible light images according to the steps 1 to 3: step 1, calculating and preprocessing a differential image; step 2, taking the source image and the difference image as input to train the network model; and 3, generating a fused image by using the trained generator model.

Experimental groups: referring to fig. 2, the method for fusing infrared and visible light images based on a multi-discriminator-generated countermeasure network includes the following steps:

as shown in FIG. 3, step 1 is performed using the infrared image I _ir Subtracting the visible image I _vis The obtained infrared difference image can emphasize the intensity of the thermal radiation, as shown in fig. 3 (c); using visible light images I _vis Subtracting the infrared image I _ir The obtained visible light differential image can highlight the texture detail,as shown in FIG. 3 (d); the difference image is normalized to control the pixel gray value between 0 and 1, which is specifically as follows:

where v (i, j) is the gray scale value of the pixel at (i, j) in the difference image, v _min And v _max The minimum and maximum gray levels in the difference image are respectively. FIGS. 3 (e) and (f) are normalized infrared difference images I _dif-ir Image I differentiated from visible light _dif-vis 。

And (3) executing the training task of the step 2 according to the steps 2-1 to 2-4:

step 2-1: will infrared image I _ir And a differential image I _dif-ir Bonding, visible light image I _vis And a differential image I _dif-vis Concatenating as input to the different encoders in the generator G in step 2-2;

step 2-2: the generator G carries out feature extraction and fusion on the data in the step 2-1, and then a fusion image I is reconstructed according to the fused features _F (ii) a As a specific embodiment of the present invention, a specific network structure of a generator network is shown in fig. 4, a dual encoder structure includes two branches for extracting two features of infrared thermal radiation intensity and visible light texture, respectively, each branch is composed of four convolution layers, a DenseNet structure is adopted for dense connection, and the multilayer features are fully utilized. Specifically, the first layer network is used for extracting shallow features of the image and is composed of a convolution kernel with the size of 3 × 3, a switchable normalization layer and an activation function leak ReLU; the last three layers are used for extracting the depth features of the image, and a convolution block attention module is added to the network structure of the image on the basis of the first layer network. The number of channels in all convolutional layers is set to 64 and the convolutional kernel step size is set to 1.

In the deep feature layer of the encoder, CBAM is introduced to improve the feature extraction capability. The CBAM includes two sub-modules, a channel attention module and a spatial attention module, wherein the channel attention module is used for describing the relationship of each channel, the spatial attention module is used for describing the spatial relationship of the depth feature, and the structure is shown in fig. 5. Specifically, 1) inputting the feature map into a channel attention module, and performing global maximum pooling and global average pooling based on the width and height of the input feature map to obtain two feature maps; inputting the two feature maps into a multilayer perceptron shared by parameters to generate respective channel attention feature maps, and then obtaining a final channel attention feature map through element-by-element summation and Sigmoid activation function operation; 2) Multiplying the original input features by the channel attention feature map according to elements, and inputting the result into a space attention module; performing maximum pooling and global average pooling on the channel dimensions of the feature maps input into the spatial attention module to obtain two feature maps; then, the channels are connected based on the channel dimension, convolution is carried out through a convolution layer, and a space attention diagram is generated through Sigmoid activation function operation; and finally, multiplying the input vector of the spatial attention module and the spatial attention diagram according to elements to obtain the final output characteristic.

After the two branch encoders extract the features of the source images with different modalities, the features are connected based on the channel dimension and input into the decoder. The decoder reconstructs a fused image according to the connected high-dimensional features, and the network structure of the fused image is formed by two convolution layers. The convolution layer of the first layer is composed of convolution kernels with the size of 3 x 3, the switchable normalization layer and an activation function Leaky ReLU, and the convolution layer of the second layer is composed of a convolution kernel with the size of 3 x 3 and a Tanh activation function.

as a specific embodiment of the present invention, the four discriminators adopt the same network structure, and mainly consist of five convolutional neural networks, and the specific network structure is shown in fig. 6. Specifically, a convolution kernel of size 3 × 3 is used in the first four layers and the step size is set to 2; adding a batch normalization layer in the second layer to the fourth layer, integrating the extracted features of the convolution layer by using a full connection layer (FC) in the last layer, and then calculating a scalar by using a Tanh activation function so as to reflect the probability that the discriminator judges that the input image is from the source image or the difference image instead of the fusion image.

Step 2-4: and (4) performing iterative training by looping the step 2-1 to the step 2-3, and terminating the training when the antagonistic training tends to be balanced, namely the discriminator cannot distinguish whether the input sample is from the image generated by the generator or the real image.

Wherein the loss function is composed of a generator loss function and a discriminator loss function, wherein the generator loss function L _G Mainly by opposing the loss L _adv And content loss L _content The two parts are used for feeding back the training loss of the generator network; the four discriminators use similar loss functions L _D And the judgment of the input result by the discriminator is fed back to the generator, and the generator establishes confrontation training with the input result, wherein the specific calculation formula is as follows:

L _G ＝L _adv +λL _content

wherein λ is a weight parameter, wherein,

and

respectively correspond to four discriminators D _ir ,D _vis ,D _dif-ir And D _dif-vis 。

Against loss L _adv The optimization direction is mainly used for constraining the generator, and the formula is defined as follows:

where E [. Cndot. ] is the expectation and D (·) is the probability that the classifier classified the input image.

Content loss L _content By comparing the difference between the fused image and the input image, the generator is guided to generate a fused result which simultaneously retains infrared image heat radiation information and visible light image texture information, and the formula is defined as:

L _content ＝αL _int +βL _grad +γL _SSIM

wherein alpha, beta and gamma are weight parameters, L _int For loss of strength, L _grad For gradient loss, L _SSIM Loss of structural similarity; l is a radical of an alcohol _int ，L _grad And L _SSIM Are respectively defined as follows:

L _SSIM ＝ω·L _SSIM-img +(1-ω)·L _SSIM-dif

where ω is the weight parameter, H, W is the size of the input image, L _int-img For loss of source image intensity, L _int-dif For loss of differential image intensity, L _grad-img For loss of source image gradient, L _grad-dif For differential image gradient loss, L _SSIM-img Loss of structural similarity for source images, L _SSIM-dif For loss of structural similarity of the difference image, L _SSIM (. Is) the similarity between the two images; l is a radical of an alcohol _int-img ，L _int-dif ，L _grad-img ，L _grad-dif ，L _SSIM-img And L _SSIM-dif The definitions of (A) are as follows:

L _int-img ＝a·||I _F -I _ir || _F +(1-a)·||I _F -I _vis || _F

L _int-dif ＝a·||I _F -I _dif-ir || _F +(1-a)·||I _F -I _dif-vis || _F

L _SSIM-img ＝(1-L _SSIM (I _F ,I _ir ))+(1-L _SSIM (I _F ,I _vis ))

L _SSIM-dif ＝(1-L _SSIM (I _F ,I _dif-ir ))+(1-L _SSIM (I _F ,I _dif-vis ))

the penalty function for each discriminator is defined as:

wherein, a discriminator D _ir And D _vis Is a source image (I) _ir And I _vis ) Or fusing images (I) _F ) D discriminator D _dif-ir And D _dif-vis Is a differential image (I) _dif-ir And I _dif-vis ) Or fusing images (I) _F )。

And 3, step 3: a fused image is generated using the trained generator model. Specifically, the infrared image and the visible light image are respectively connected with two differential images, and are input into the generator trained in the step 2 together to obtain a final fusion result.

The experimental conclusion is that: the invention provides an infrared and visible light image fusion method for generating a countermeasure network based on multiple discriminators, which is an end-to-end network model consisting of one generator and four discriminators. The disclosed infrared and visible light image data sets are used for carrying out experiments, and the experimental results show that compared with the existing method, the algorithm disclosed by the invention has the advantages that the obtained fusion result is richer in texture information and better in subjective visual effect; in addition, the objective evaluation result shows that the average values of the algorithm are respectively 6.02%, 25.93%, 7.61% and 16.77% better than the average values of the comparison method in terms of information entropy, average gradient, correlation coefficient, difference correlation and indexes, so that the method provided by the invention can better fuse the texture information of the visible light image while effectively retaining the thermal radiation information of the infrared image, thereby improving the performance of the existing infrared and visible light image fusion algorithm.

The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims

1. The method for fusing the infrared and visible light images based on the multi-discriminator generation countermeasure network is characterized in that: the method comprises the following steps 1 to 3 to complete the fusion of the infrared and visible light images:

step 1: calculating and preprocessing the difference image to obtain an infrared image I _ir And visible light image I _vis Respectively calculating difference values and normalizing to obtain a difference image I _dif-ir And I _dif-vis ；

Step 2: taking a source image and a difference image as input to train a network model, wherein the training process comprises the following steps of 2-1 to 2-4:

Step 2-3: fusing the images I _F Respectively with infrared image I _ir Visible light image I _vis And a differential image I _dif-ir 、I _dif-vis Input to the discriminator (D) _ir ,D _vis ,D _dif-ir And D _dif-vis ) In the middle, establishing confrontation training with a generator G;

step 2-4: step 2-1 to step 2-3 are circulated to carry out iterative training, and when the antagonistic training tends to balance, namely the discriminator cannot distinguish whether the input sample is from the image generated by the generator or the real image, the training is terminated, so that the generator G required by fusion is obtained;

and step 3: and (3) generating a fusion image by using the trained generator G, specifically, respectively connecting the infrared image and the visible light image with the two differential images, and inputting the images into the generator trained in the step 2 together to obtain a final fusion result.

2. The method for fusing infrared and visible light images based on multi-discriminator generated countermeasure network according to claim 1, wherein: the differential image calculation in step 1 specifically comprises the following steps:

step 1-1: using infrared images I _ir Subtracting the visible image I _vis Obtaining an infrared differential image I capable of emphasizing the intensity of thermal radiation _dif-ir ；

3. The infrared and visible light image fusion method based on multi-discriminator generated countermeasure network according to claim 1, characterized in that: the normalization processing in the preprocessing of the differential image in the step 1 controls the gray value of the pixel between 0 and 1, and specifically, the following formula is:

4. The method for fusing infrared and visible light images based on multi-discriminator generated countermeasure network according to claim 1, wherein: the generator network G in step 2-2 adopts a double-encoder-single-decoder structure, and the specific structure is as follows: firstly, connecting a source image and a differential image highlighting the unique information of the modal image as input of a branched encoder; secondly, the two branch encoders are respectively responsible for extracting the characteristics of the infrared image and the visible light image; and finally, connecting the high-dimensional features of the images in different modes, inputting the high-dimensional features into a decoder, and reconstructing to obtain a fused image.

5. The infrared and visible light image fusion method based on multi-discriminator generated countermeasure network according to claim 1, characterized in that: the four discriminators in the step 2-3 adopt the same network structure and sequentially comprise five layers of convolutional neural networks, wherein a convolutional kernel with the size of 3 multiplied by 3 is used in the first four layers, and the step length is set to be 2; adding a batch normalization layer into the second layer to the fourth layer; in the last layer, the features extracted from the convolutional layer are first integrated using the fully-connected layer, and then a scalar is calculated using the Tanh activation function.

6. The infrared and visible light image fusion method based on multi-discriminator generated countermeasure network according to claim 1, characterized in that: in the step 2-4 of iterative training, a loss function is adopted to evaluate model prediction differenceA loss function consisting of a generator loss function and a discriminator loss function, wherein the generator loss function L _G Mainly by opposing the loss L _adv And content loss L _content The two parts are used for feeding back the loss of the generator network training; the four discriminators use similar loss functions L _D And the judgment of the input result by the discriminator is fed back to the generator, and the generator establishes confrontation training with the input result, wherein the specific calculation formula is as follows:

L _G ＝L _adv +λL _content

wherein λ is a weight parameter, wherein,

and

respectively corresponding to four discriminators D _ir ,D _vis ,D _dif-ir And D _dif-vis ；

in particular, the content loss L _content By comparing the difference between the fused image and the input image, the generator is guided to generate a fused result which simultaneously retains infrared image heat radiation information and visible light image texture information, and the formula is defined as:

L _content ＝αL _int +βL _grad +γL _SSIM

wherein alpha, alpha,Beta, gamma are weight parameters, L _int For loss of strength, L _grad For gradient loss, L _SSIM Loss of structural similarity; l is _int ，L _grad And L _SSIM Are respectively defined as follows:

L _SSIM ＝ω·L _SSIM-img +(1-ω)·L _SSIM-dif

where ω is the weight parameter, H, W is the size of the input image, L _int-img For loss of source image intensity, L _int-dif For loss of differential image intensity, L _grad-img For loss of source image gradient, L _grad-dif For differential image gradient loss, L _SSIM-img For loss of structural similarity of source images, L _SSIM-dif For loss of structural similarity of the difference image, L _SSIM (. Is) the similarity between the two images; l is _int-img ，L _int-dif ，L _grad-img ，L _grad-dif ，L _SSIM-img And L _SSIM-dif The definitions of (A) are as follows:

L _int-img ＝a·||I _F -I _ir || _F +(1-a)·||I _F -I _vis || _F

L _int-dif ＝a·||I _F -I _dif-ir || _F +(1-a)·||I _F -I _dif-vis || _F

L _SSIM-img ＝(1-L _SSIM (I _F ,I _ir ))+(1-L _SSIM (I _F ,I _vis ))

L _SSIM-dif ＝(1-L _SSIM (I _F ,I _dif-ir ))+(1-L _SSIM (I _F ,I _dif-vis ))

7. The method for fusing infrared and visible light images based on multi-discriminator generated countermeasure network according to claim 4, wherein: in the double-encoder-single-decoder structure, the double-encoder structure comprises two branches which are respectively used for extracting two characteristics of infrared heat radiation intensity and visible light texture, each branch consists of four layers of convolution layers, a DenseNet structure is adopted for intensive connection, and specifically, a first layer of network consists of a convolution kernel with the size of 3 multiplied by 3, a switchable normalization layer and an activation function Leaky ReLU; convolution Block Attention Modules (CBAM) are added in the last three layers, the number of channels in all convolution layers is set to be 64, and the step length of convolution kernels is set to be 1; note that CBAM was introduced to improve the feature extraction capability, which mainly consists of steps a to F:

and B: inputting the two feature maps into a multilayer perceptron shared by parameters to generate respective channel attention feature maps, and then obtaining a final channel attention feature map through element-by-element summation and Sigmoid activation function operation;

step E: connecting the channels based on channel dimensions, performing convolution through a convolution layer, and generating a space attention diagram through Sigmoid activation function operation;

8. The infrared and visible light image fusion method based on multi-discriminator generated countermeasure network according to claim 4, characterized in that: in the dual encoder-single decoder structure, the single decoder network structure is composed of two convolutional layers, wherein the convolutional layer of the first layer is composed of a convolutional kernel with the size of 3 x 3, a switchable normalization layer and an activation function Leaky ReLU, and the convolutional layer of the second layer is composed of a 3 x 3 convolutional kernel and a Tanh activation function.