CN112001868A

CN112001868A - Infrared and visible light image fusion method and system based on generation of antagonistic network

Info

Publication number: CN112001868A
Application number: CN202010751222.2A
Authority: CN
Inventors: 隋晓丹; 王亚茹; 冯飞燕; 王雪梅; 许源; 丁维康; 赵艳娜
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2020-11-27
Anticipated expiration: 2040-07-30
Also published as: CN112001868B

Abstract

The invention discloses an infrared and visible light image fusion method and system based on a generation antagonistic network, which comprises the following steps: acquiring an infrared image and a visible light image to be fused; simultaneously inputting the infrared image and the visible light image to be fused into a pre-trained antagonistic network, and outputting the fused image; as one or more embodiments, the generating the antagonistic network comprises: a total loss function; the total loss function comprising: a content loss function, a detail loss function, a target edge enhancement loss function, and an antagonism loss function.

Description

Infrared and visible light image fusion method and system based on generation of antagonistic network

Technical Field

The application relates to the technical field of image processing, in particular to an infrared and visible light image fusion method and system based on a generation antagonistic network.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

The infrared image is captured by an infrared sensor, is used for recording heat radiation emitted by different targets, and is widely applied to target detection and surface parameter inversion. The infrared images are less affected by illumination variations and camouflaging, and they are easily captured both day and night. However, infrared images generally lack texture and rarely affect the amount of heat emitted by an object. In contrast, a visible light image is captured and used to record characteristic information of the reflections of different objects, which contains distinctive characteristic information. The visible image also provides a perceptual scene description for the human eye. However, objects in the visible image are not readily observable due to the influence of the external environment, such as nighttime conditions, camouflage, objects in the smoke, cluttered backgrounds, and the like.

In the process of implementing the present application, the inventors found that the following technical problems exist in the prior art:

in recent years, various infrared and visible light image fusion methods have been proposed, which can be classified into seven major categories, namely, multi-scale transformation, sparse representation, neural networks, subspaces, saliency, hybrid models, and deep learning. In general, existing fusion methods involve three challenges, namely image transformation, activity level measurement, and fusion rule design. These three constraints have become more and more complex, especially for designing fusion rules in a manual way, which strongly limits the development of fusion methods. In addition, the existing fusion method generally selects the same edge, line and other salient features as the original image to fuse into the fusion image, so that the fusion image contains more detailed information. However, the above method may not be suitable for fusion of infrared and visible light images. In particular, infrared thermal radiation information is characterized by pixel intensity, while texture detail information in visible light images is typically characterized by attribute edges and gradients. These two cases are different and cannot be represented in the same way.

Disclosure of Invention

In order to overcome the defects of the prior art, the application provides an infrared and visible light image fusion method and system based on a generation antagonistic network; a single complementary graph with abundant detail information is obtained from the effective target area in the visible light image and the infrared image.

In a first aspect, the present application provides an infrared and visible light image fusion method based on generating an antagonistic network;

the infrared and visible light image fusion method based on the generation antagonistic network comprises the following steps:

acquiring an infrared image and a visible light image to be fused;

simultaneously inputting the infrared image and the visible light image to be fused into a pre-trained antagonistic network, and outputting the fused image; as one or more embodiments, the generating the antagonistic network comprises: a total loss function; the total loss function comprising: a content loss function, a detail loss function, a target edge enhancement loss function, and an antagonism loss function.

In a second aspect, the present application provides an infrared and visible light image fusion system based on a generative antagonistic network;

an infrared and visible light image fusion system based on a generative antagonistic network, comprising:

an acquisition module configured to: acquiring an infrared image and a visible light image to be fused;

a fusion module configured to: simultaneously inputting the infrared image and the visible light image to be fused into a pre-trained antagonistic network, and outputting the fused image; as one or more embodiments, the generating the antagonistic network comprises: a total loss function; the total loss function comprising: a content loss function, a detail loss function, a target edge enhancement loss function, and an antagonism loss function.

In a third aspect, the present application further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.

In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

In a fifth aspect, the present application also provides a computer program (product) comprising a computer program for implementing the method of any of the preceding first aspects when run on one or more processors.

Compared with the prior art, the beneficial effects of this application are:

the application provides a novel infrared and visible light image fusion method based on GAN, which can simultaneously reserve heat radiation information in an infrared image and abundant texture details in a visible light image. The method is an end-to-end model, and avoids manual design and complex design of activity level measurement and fusion rules of the traditional fusion strategy. In particular, the application designs two loss functions of detail loss and target edge enhancement loss to improve the fusion performance. The loss of detail is introduced to better utilize the texture detail in the source image, while the loss of target edge enhancement is to sharpen the edges of infrared targets. By utilizing the two loss functions, the result of the application can well store thermal radiation information, infrared target boundary and texture detail information at the same time. The present application demonstrates the effectiveness of detail loss and target edge enhancement loss in experiments. Qualitative and quantitative comparisons show that the strategy of the present application outperforms the most advanced methods. Furthermore, the method of the present application not only produces a relatively good visual effect, but also generally preserves the maximum or about maximum amount of information in the source image.

The present application proposes an end-to-end model, namely fusion gan, fusing infrared and visible images based on a generative antagonistic network. The method avoids complex activity level measurement and manual design of fusion rules, and the fusion result is like infrastructure, clear in target and rich in details.

The method designs two loss functions, namely detail loss and target edge enhancement loss, so as to improve the quality of detail information and sharpen the edge of the infrared target. On one hand, the technical problem that the fusion GAN only depends on resistance training to add extra detail information which is uncertain and unstable, so that a large amount of detail information is lost is solved. On the other hand, the technical problem that in a visible light image, the content loss in the fusion GAN only depends on the edge information, but ignores the edge information in the infrared image, so that the target edge is blurred in the fusion process is solved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow chart of a method of the first embodiment;

FIGS. 2(a) and 2(b) are schematic diagrams of a generator and an arbiter of a first embodiment;

3(a) -3 (p) are graphs showing the test results of the D-model and VGG model of the first embodiment;

4(a) -4 (f) are schematic diagrams of the results of three different scenario evaluations from the TNO dataset for the first embodiment;

FIGS. 5(a) -5 (f) are schematic diagrams illustrating filtering of the edge map using Gaussian kernels with different radii according to the first embodiment;

fig. 6(a) -6 (f) are schematic diagrams illustrating quantitative comparison of six fusion indicators on the TNO data set according to the first embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

In recent years, with the development of image representation technology and the rapid increase of demand, many infrared and visible light image fusion methods have been proposed. Fusion methods can be divided into seven major categories based on theoretical basis.

The multi-scale transform based fusion method is the most common fusion method, which assumes that the source image can be decomposed into several layers. And fusing the multi-layer images according to a specific fusion rule to obtain a final target fusion image.

The transformations most commonly used for decomposition and reconstruction are wavelet, pyramid, curvelet and variants thereof.

The second category is analytic representation-based methods. Researches find that the animation can be represented by sparse linear combination in a complete dictionary, which is a key factor for ensuring the good performance of the method.

The third type is a neural network-based method, which has the advantages of strong adaptability, strong fault-tolerant capability, strong anti-noise capability and the like, and can simulate a perception behavior system of a human brain when processing neural information.

The fourth category is subspace-based methods, which aim to project high-dimensional input computations to low-dimensional subspaces.

Given the often redundant information present in an image, the low-dimensional subspace may help capture the intrinsic structure of the original image.

The fifth category is saliency-based methods. Human visual attention often captures objects or pixels that are more important than neighbors. For such methods, the intensity of the salient object regions is highlighted, thereby improving the visual quality of the fused image.

The sixth type is a hybrid method that combines the advantages of various methods to further improve the image fusion performance.

In particular, due introduces an interesting fusion process under the framework of a joint sparse representation model, which is guided by an integrated saliency map.

In recent years, with the development of deep learning technology, some fusion methods based on deep learning have been developed. However, existing approaches typically apply a deep learning framework, such as feature extraction or learning fusion strategies, only to certain parts of the fusion process, while the entire fusion process remains in the traditional framework, not end-to-end. In the field of exposure fusion, Prabhakar proposes an end-to-end fusion model and obtains good fusion performance.

Generally, the purpose of the above method is to ensure that the fused image contains rich detail information. Therefore, they typically use the same image representation and select the same source image with salient features, such as texture fusion. In infrared and visible image fusion, since the thermal radiation in the infrared image is pixel-level, if only the influence of texture is considered, the high contrast possession property of the fusion result is caused.

To solve this problem, the present application proposes a Gradient Transfer Fusion (GTF) method of image fusion to maintain the main intensity distribution of the infrared image and the gradient variation of the visible light image. The result is a highly similar infrared image and detailed appearance.

The detail information in the visible light image includes gradient, contrast, saturation, and the like, and therefore, the gradient change cannot sufficiently retain useful detail information in the visible light image.

To solve this problem, the present application further proposes a GAN framework to alleviate the problem of OSS functionality based on GTF design. However, pure countertraining may still cause information loss due to its uncertainty and instability, and GTF loss ignores edge information of the infrared image, obscuring the target in the fusion result. To overcome the above challenges, the present application proposes a new end-to-end model that employs two specially designed loss functions based on detail-preserving antagonistic learning.

Example one

The embodiment provides an infrared and visible light image fusion method based on a generation antagonistic network;

s101: acquiring an infrared image and a visible light image to be fused;

s102: simultaneously inputting the infrared image and the visible light image to be fused into a pre-trained antagonistic network, and outputting the fused image; as one or more embodiments, the generating the antagonistic network comprises: a total loss function; the total loss function comprising: a content loss function, a detail loss function, a target edge enhancement loss function, and an antagonism loss function.

As one or more embodiments, the generating the antagonistic network comprises: a generator and a discriminator; the generator is used for: resnet classification networks; the discriminator is as follows: VGG-Net neural network.

Further, the content loss function is: the fused image is limited to pixel intensities with similar pixel intensities as the infrared image and similar gradient changes as the visible image.

Further, the content loss function is equal to the sum of the image loss function and the gradient loss function.

Further, the detail loss function is used for limiting the difference of the identification characteristic map between the fusion image and the visible image.

Further, the target edge enhancement loss function is: the method is used for sharpening edges and fusing objects highlighted in the image.

Further, the antagonism loss function refers to: is defined according to the probability of discriminant of all training sample entropies.

As one or more embodiments, the infrared image and the visible light image to be fused are both images captured in the same scene.

The total loss function, expressed as follows:

where the content loss limits the fused image to pixel intensities with similar pixel intensities as the infrared image and similar gradient changes as the visible image, which may be similar to the objective function of the GTF. The detail loss L detail and the confrontation loss L confrontation aim to add more abundant detail information to the fused image. Target edge enhancement loss

The method is used for sharpening edges and fusing objects highlighted in the image. The content loss formula is the sum of image loss Limage and gradient loss Lgradient. The present application then uses the weighting parameters of β, γ to control the loss of items in the trade-off generator between different points.

The MSE-based per-pixel image loss Limage is defined as follows:

where Ir is the raw infrared image, If is the final output of the generator, and W and H represent the width and height of the image. Image loss reconciles the fused image with the infrared image in terms of pixel intensity distribution. Note that the l2 norm was chosen by the present application because it is quadratic. The l2 norm is derivable and easy to optimize compared to the l1 norm.

In order to fuse rich texture information, the application designs gradient loss inspired by GTF as follows:

where Dv (x, y) represents the gradient of the visible image and Df (x, y) represents the gradient of the fused image. The gradient penalty is defined as the MSE between Dv (x, y) and Df (x, y).

The difference in the discriminative feature map between the fused image and the visible image is defined as loss of detail as follows:

where φ represents the feature map obtained by convolution within the discriminator, φ v and φ f represent the feature representations of the visible image and the fused image, and N and M represent the width and height of the result, which is the input image for conventional feature map computation.

For other computer vision tasks, the perceptual loss produced by pre-trained VGG-Net is often used to improve performance. When the VGG-Net is used for extracting advanced functions, the method is a good choice. However, VGG-Net pre-trained using ImageNet dataset does not contain infrared images. Further, a method of extracting high-level features from the fused image (thermal radiation information and visible texture information) is not determined in VGG-Net. Therefore, mixing the visible image and the fused image into a VGG-Net input would be problematic. In fact, the discriminators of the network of the present application are trained by fused visible images. During the training process, the discriminator can extract relatively better features of the fused image and the visible image, and the discriminator is used for extracting high-grade features instead of VGG-Net. Furthermore, when the loss of detail is optimized, the gradient loss will be reduced.

The target edge enhancement loss Ltee is as follows:

in fact, the project is similar to Limage. In order to make the target boundary clearer, the weight map G is designed to focus more on the target boundary region and multiplied by Limage, where G is defined as follows:

G(x，y)＝N_k＝3(D_r(x，y))+N_k＝5(D_r(x，y))+N_k＝7(D_r(x，y))， (6)

where N represents the Gaussian kernel, k represents the kernel radius, and Dr (x, y) represents the gradient of the infrared image. The present application empirically refers to

combinations k

3, 5, 7 as the default configuration of the present application due to satisfactory visual effects. Obviously, the G map of the present application has three features. First, most regions are weighted 0, since they can be well optimized by the L image and do not need to be optimized again in Ltee. Secondly, the infrared target boundary regions are heavily weighted, which enables the model of the present application to focus on infrared target boundaries that may be ignored in the visible images during the training process. Thirdly, the part close to the edge region can obtain smaller weight, and smooth transition is realized on two sides of the edge region.

To combat the loss

The generator network of the present application employs the antagonistic losses with a discriminator to generate better fused images. The resistance loss is defined according to the probability of the discriminant log D θ D (G θ G (Imix)) over all training samples, as follows:

where Imix is the stack of infrared and visible images, log D θ D (G θ G (Imix)) is the probability that the fused image will look like a visible image, and N is the size of the batch.

As one or more embodiments, the pre-trained generated adversarial network, the training step includes:

constructing and generating a resistance network;

constructing a training set, wherein the training set comprises: infrared image, visible light image and fused image of the two;

simultaneously inputting the infrared image and the visible light image of the training set into a generator to obtain a generated image;

inputting the generated image and the known fusion image into a discriminator, and outputting a discrimination result by the discriminator;

stopping training when the accuracy of distinguishing the generated image from the known fused image by the discriminator is 50%;

and obtaining a trained generation antagonistic network and a trained generator.

As one or more embodiments, the generating the antagonistic network operates on the principle that:

in the training phase, the infrared image Ir and the visible image Iv are first stacked in the channel dimension, and then the stacked images are put into the generator G. Under the guidance of the loss function, the original fused image If is obtained from the generator G. Then, If and Iv are input to a discriminator D to determine which sample is from the source data. The above training process is repeated until the discriminator D cannot distinguish the fused image from the visible image. Finally, the present application obtains a generator G with a powerful capability to generate a fused image with a prominent sharp edge target and richer texture.

First, generator G is trained. In a first step, the infrared image If and the visible image Iv are stacked together. Second, the stacked images are put into the generator G. And thirdly, obtaining a fused image If through a generator G. And fourthly, distinguishing the fused image If from the visible image Iv by using a discriminator D. And repeating the training of the generator G, continuously improving the capability of the generator G, and stopping the training when the accuracy of distinguishing the fused image and the visible image by the discriminator D is 50%. Arbiter D is then retrained. And the discriminator D can improve the self ability through continuous training and can accurately distinguish the fusion image and the visible image again, and then the generator G is trained again. And continuously repeating the process of training the generator G and the discriminator D until the discriminator D cannot distinguish the fusion image from the visible image, and finishing the training.

The generation of the antagonistic network GAN was first proposed by Goodfellow et al to solve the problem of generating more realistic images. The main idea is to create a minimum and maximum two-player game between the generator and the discriminator. The generator takes noise as input and attempts to convert the input noise into more realistic image samples. At the same time, the discriminator takes the generated sample or the true sample as input, with the purpose of determining whether the input sample is from the generated sample or the true sample. The antagonistic properties between the generator and the arbiter continue until the generated samples are not recognized by the arbiter. The generator may then generate a relatively more realistic image sample.

Although the original GAN can be used to generate digital images, such as those obtained from the MNIST, the resulting results still contain noisy and unintelligible information; in order to improve the quality of the generated image, a high-resolution image monitored by a low-resolution image is generated using the laplacian pyramid, but this method is not applicable to an image containing an unstable object. Gregor and Dosovitskiy successfully generated natural type images, but they did not utilize generators for supervised learning. Radford proposed the application of a deeper level generator and drafted a rule to design a convolutional neural network structure of a generator and an arbiter for stable training. InfoGAN can learn more interpretable expressions.

In order to solve the problem of poor physical stability of athletes in the training process, an objective function of the GANs is modified, and a WGAN model which relaxes the requirements on the physical stability of the athletes is provided, but the model is slower in convergence compared with a normal model. The most widely used variant of GAN is conditional GAN. Many studies based on conditional GANs include image inpainting, image style conversion, image-to-image conversion, production photo generation, and the like. The methods proposed in this application are also based primarily on conditional GANs.

Pixel level loss functions, such as Mean Square Error (MSE), are widely used for image generation. However, such a loss function tends to make the generated result too smooth, resulting in poor perceived image quality. In recent years, more and more researchers have taken advantage of perceptual loss to solve the image quality delivery and image super-resolution problems. Perceptual loss is typically used to compare high-level features extracted from a convolutional network, rather than detecting the pixels themselves. The application introduces a detail item in the loss function to improve the fusion performance. However, unlike the perceptual loss calculated by a typical VGG network, the present application uses a discriminator as a feature extractor to calculate the loss of detail in the study.

Given an infrared image and a visible light image, the method aims to fuse the two image types and construct a fused image which can keep the significance of the target in the infrared image and can keep rich detail information in the visible light image. The fusion image is generated by utilizing the convolutional neural network, so that the difficulty of artificially designing activity level measurement and fusion rules can be overcome. However, this approach presents two challenges:

on the one hand, in the deep learning domain, training an excellent network requires a large amount of labeled data.

To solve this problem, the present application converts the fusion problem into a regression problem, where a loss function is needed to guide the regression process. The present application uses the objective function of the GTF, i.e., the objective function that preserves thermal radiation information and visible texture details, in view of the fusion objectives of the present application.

On the other hand, the detail information is only represented as gradient changes, which means that other important detail information, such as contrast and saturation, etc., is discarded. However, such detailed information is generally not handled as a mathematical model.

The GTF, which preserves the thermal radiation information and visible texture details, initially solves the objective function in the GTF by using a generator to generate a fused image that looks like the GTF result. The result with the visible image is then sent to a discriminator to determine whether the image is from source data. By establishing a countermeasure between the generator and the discriminator, the present application assumes that the fused image contains sufficient detail information when the discriminator cannot distinguish the fused image from the visible image. By using this method, detailed information is automatically represented and selected by a neural network rather than a manually designed rule. Furthermore, the penalty function of the present application contains additional detail penalty and target edge enhancement penalty in addition to the penalty. These items keep the model of the present application stable during challenge and have very promising fusion properties.

According to the method and the device, an extremely-small and extremely-large game is established between the generator and the discriminator, and the style conversion problem can be better solved. The application first generates a fused image, which looks like the result of the GTF, and uses the generator at the GTF by solving the objective function. The result with the visible image is then sent to a discriminator to determine whether the image is from source data. By constructing the antagonistic relationship between the generator and the discriminator, the present application assumes that the fused image contains sufficient detail information when the discriminator cannot distinguish the fused image from the visible image, with which method the detail information is automatically represented and selected by the neural network, rather than being artificially designed.

In addition, the loss function of the present application includes detail loss and target edge enhancement loss in addition to the antagonism loss. These items may indicate that the model of the present application is stable in the challenge program with good fusion properties.

The framework of the method of the present application is illustrated in figure 1. In the training phase, the application first stacks the infrared image and the visible image iv on the channel dimension, and then puts the stacked images into a generator g, similar to ResNet.

Under the guidance of a loss function, the original fusion image can be obtained, and If the original fusion image is input to a discriminator D from G later, the structure of the original fusion image is close to that of VGG-Net so as to judge which sample comes from the source data.

The training process is repeated until the fused image is indistinguishable from the visible image. Finally, the resulting G of the present application has a powerful capability to generate objects and richer textures of fused images that highlight sharp edges.

Network architecture: the model consists of generators and discriminators based on different network structures, as shown in fig. 2(a) and 2 (b). Compared with the prior fusion algorithm, the method deepens the generator and the discriminator with stronger feature representation capability, thereby improving the fusion performance. In particular, the generator is designed based on ResNet. In the generator network of the present application, the activation function of the remaining blocks is a parameter-corrected linear unit (RELU), rather than the typical RELU.

The present application defines an activation function:

where yi is the input for the nonlinear activation f on the ith channel and ai is the coefficient that controls the slope of the negative part. The subscript i in ai indicates that the present application allows the nonlinear activation to vary across different channels. When ai is 0, it becomes ReLU; when ai is a learnable parameter, the present application refers to the activation function as the parameter ReLU (PReLU). Figure1 shows the shape of the ReLU and the prilu. The activation function is equal to f (yi) max (0, yi) + ai min (0, yi).

If ai is small and fixed, then PReLU becomes Leaky ReLU (LReLU) (ai ═ 0.01). The motivation for lretl is to avoid zero gradients. Experiments show that lreuu has a negligible effect on accuracy compared to ReLU. In contrast, the method of the present application adaptively learns the PReLU parameters along with the entire model. It is expected that end-to-end training will result in more specialized activations.

The PReLU introduces few additional parameters. The number of additional parameters is equal to the total number of channels and can be ignored when considering the total number of weights. Thus, the application anticipates that no additional risk of overfitting occurs. The present application also considers channel sharing variables: f (yi) ═ max (0, yi) + min (0, yi), where the coefficients are shared by all channels of one layer. This variant introduces only one additional parameter in each layer.

The PReLU may be trained using back propagation and optimized simultaneously with other layers.

The updated formula of { ai } is derived from the chain rule only. The gradient of ai for one layer is:

where E represents the objective function.

Is a gradient propagating from deeper layers. The gradient of activation is given by:

sum Σ_yiThroughout all positions of the feature map. For the variant of the shared channel, the gradient of a is

Wherein_iSumming over all channels of the layer. The temporal complexity due to the PReLU is negligible for forward and backward propagation. The application adopts a momentum method when updating ai:

μ is momentum and momentum is learning rate. Notably, the present application does not use weight decay (l2 regularization) when updating ai. Weight decay tends to push ai to zero, biasing the PReLU toward the ReLU. Even without regularization, the learned coefficients are rarely of the order of greater than 1 in the experiments of the present application. The present application uses ai ═ 0.25 as initialization.

The parametric relationship is the same as the leakage relationship, except that the slope is a parameter that is adaptively learned through back propagation. In the fusion process, the application uses 1 × 1 convolutional layers to replace complete connection layers, and a complete convolutional network which is not limited by the size of an input image is constructed, so that valuable information can be extracted from a source infrared image and a visible light image. Therefore, this approach differs from the general model in that the model of the present application does not contain deconvolution or merging layers. The collection layer will lose some detail information, while the deconvolution layer will insert additional information into the input, both cases indicating an inaccurate description of the actual information of the source image.

The design of the arbiter is based on the VGG11 network. VGG11 employs 5 convolutional layers and 5 maxporoling layers, in contrast to one bulk normalization layer for each convolutional layer in the network of the present application, which has proven to be effective in accelerating network training. For the activation function, the present application replaces the general relationship with a parametric relationship to adjust the degree of leakage during back propagation. Then one convolutional layer (11 filters) is added to reduce the dimension, which means that the fully connected layers in the VGG can be ignored.

Whether the image is a visible image or not is classified by the discriminator, so that a large-scale fully-connected network is replaced by a simple convolutional layer, and the generator network and the discriminator network can be regarded as a fully convolutional network with robustness on input images of different sizes.

Loss function: the model proposed in this application was trained on TNO data containing 45 different scenes and 45 was chosen to train both infrared and visible images. The image pairs have been previously aligned and image registration is required for the unregistered image pairs. During the training process, 88 × 88 random cropping is used as input for the original infrared image and the visible light image for each iteration. The input (i.e. pixel intensity) is normalized to a range between-1 and 1. During the training process, the application optimizes the loss function by using an adam solver. Each iteration, the generator and the arbiter update their parameters. During testing, the present application places the entire stack image into the generator, and then obtains a fused image of the same size as the input.

In order to evaluate the fusion performance of the method, the present application performed experiments on public data sets such as TNO and INO, and compared 10 fusion methods such as ADF, dual-tree complex wavelet transform (dtct), fourth-order partial differential equation (FPDE), image fusion based on multi-resolution singular value decomposition (IMSDV), infrared and visible light image fusion based on a deep learning framework, two-level image fusion based on visual saliency (TSIFVS), wavelet fusion, GTF, DenseFuse, and fusion gan. All this is done based on published code and the application refers to their original report to set parameters. The experiment was performed on a notebook computer, configured as 3.3GHz Intel Xeon CPU i5-4590 GPU GeForce GTX1080TI and 11GB memories.

The training parameters are set as follows: the batch image size is 64, the number of training iterations is 400, and the discriminator training step is 2. The parameters α, β, γ are 100, β is 0.2, 5, γ is 0.005. The learning rate is set to 10-5. All models were trained using the TNO dataset.

Fusion performance is often difficult to judge by subjective evaluation alone. Therefore, quantitative fusion metrics are considered for objective evaluation. Six indexes, namely Entropy (EN), Standard Deviation (SD), Correlation Coefficient (CC), Spatial Frequency (SF), Structural Similarity Index (SSIM) and Visual Information Fidelity (VIF), are selected. They are defined as follows: EN is based on information theory, which defines and measures the amount of information contained in an image. SD is based on a statistical concept that reflects the distribution and contrast of the image. CC measures the linear correlation degree of the fusion image and the source image. The SF metric is established based on horizontal and vertical gradients, which effectively measures the gradient distribution, reflecting the details and texture of the image. SSIM measures the structural similarity of the source image and the fused image. The VIF measures the information fidelity of the fused image. For these six indices, larger values indicate better performance.

Loss of detail plays an important role in the method proposed in the present application. By introducing detail loss, the model of the application is more stable, and the fusion performance is improved. Therefore, the present application will focus on verifying the added loss of detail in lossototal without loss of target edge enhancement. Specific experiments prove how to extract specific characteristic information from the fused image, so that the loss rate of the image is improved.

Perceptual loss has a wide range of applications in image style conversion. Existing methods typically use a pre-trained VGG network as a feature extractor and compare the feature map of pool5 layers extracted from the generated image with the target image. The perceptual loss makes the generated image not only at the pixel level, but also semantically similar to the target image. In the method proposed in the present application, the function of detail loss is almost the same as the function of perceptual loss. However, since the pre-trained VGG mesh only trains the visible light image and can hardly extract the high-level features of the infrared information, the feature mapping of the pre-trained VGG mesh and pool5 layer may not be suitable for the infrared and visible light image fusion. In contrast, the discriminator of the present application is trained to fuse images and visible light images, so that infrared information can be extracted by the discriminator. Therefore, it is more suitable to use the frequency discriminator as the feature extractor for the detail loss calculation.

To verify the above, the present application below performs an experiment-training two different models. The first model is called a VGG model, where a pre-trained VGG network is used as a feature extractor. The second method is called D-model, in which a discriminator is used as a feature extractor. The feature maps of the fused image and visible image pool5 layers were compared. Since the feature map of pool5 may not contain useful information for infrared and visible image fusion, experiment 2 was also performed in this application comparing the different levels of feature maps of several models of pool5, pool4, pool3, and pool 2. Finally, the contribution of loss of detail to fusion performance was verified by experiment 3.

Fig. 3(a) -3 (p) illustrate some typical fusion results, where pre-trained VGGNet and discriminators are used as feature extractors, respectively. The first two rows show the raw infrared and visible light images from four scenes of the TNO dataset, such as smoke, people, benches and trees. The remaining two rows correspond to the fused result of the VGG model and the D model. The result shows that the VGG-mod fusion result only utilizes texture information in the visible light image, and the high contrast characteristic of the target in the infrared image can hardly be maintained. For example, in the first example, the person behind the smoke is completely invisible, while in the remaining three examples, the fusion results retain only the blurred contours of the person, which are not already apparent. However, the results of the D model retain the salient objects well and contain rich details in the visible light image, especially the sharp branches in the first two examples. This indicates that the pre-trained VGG network has a strong ability to extract high-level features in visible light images, but not in infrared images. Thus, the loss of detail of the VGG model focuses the fused results on retaining more detailed information rather than highlighting the target. In contrast, the D-model is better suited to preserve thermal radiation and texture detail information.

The D-model and VGG models of the present application were tested using feature mapping of pool2, pool3, pool4, and pool5 layers to calculate loss of detail. An image pair named as sand path is used for evaluation, and the result shows that the characteristics of the four fused images are basically the same for the D model, namely the target is clear, the detail information is rich, and the image is a sharpened infrared image. However, in the results at pool5 level, the fence along the road is more clear. For the VGG model, the application observes that no matter which VGG network is used for calculating the detail loss, fusion occurs, and as a result, the high contrast characteristic of the target in the infrared image cannot be maintained. This indicates that the VGG mesh pre-trained on visible light images cannot extract high-level features of infrared information. In order to comprehensively evaluate the optimal layer selection of the D model, the four candidate layers are tested on the infrared and visible light image sequence pairs of the INO data set, and 6 fusion indexes such as EN, SD, CC, SF, SSIM and VIF are calculated and compared. The results are shown in FIGS. 4(a) to 4 (f). The pool5 layer is clearly the best overall performance for most image pairs. Therefore, the present application uses a discriminator as a feature extractor, using the feature map of the pool5 layer to calculate the loss of detail.

Further proves that the fusion result of the model of the application verifies the promotion effect of the detail loss on the fusion performance under the condition of no detail loss. Three different scenarios from the TNO dataset, such as Kaptein, sand path, and bush, are used for the evaluation. From the Kaptein tent, the fences on the sand path, and the leaves in the shrub, it can be seen that the detail information in the results of the loss of detail model is significantly richer, although both retain the significant targets in the infrared image well.

In addition, the present application quantitatively evaluated six fusion metrics from 40 samples of the TNO dataset, with the results shown in fig. 5(a) -5 (f). From the results, the present application found that the model of the present application outperformed the model without loss of detail in all six indicators of each image pair. Therefore, the detail loss can improve the visual effect of the fused image and improve the quantization index of the fusion.

Verification of target edge enhancement loss: the application explains the reason that the G mapping is designed to calculate the target edge enhancement loss, and the effect of the target edge enhancement loss is verified based on the D model.

In order to effectively preserve the edge of the target, an effective method is devised. However, since the infrared image always contains a lot of noise, which affects the fusion performance, the edge mapping of the infrared image is discrete and cluttered. Therefore, the method and the device adopt Gaussian kernels with different radiuses to filter the edge mapping to obtain a continuous and smooth G mapping. The radius of the nucleus in this application is empirically set to 3, 5 and 7. In addition, this application also provides some qualitative results regarding the combination of different nuclear radii. As can be seen from the results, the best visual effect is produced at Nk — 3 +. Therefore, the present application will set as a default setting.

The application presents some representative fusion results, namely: d model (fusion gan with loss of detail) and method of the present application (fusion gan with loss of detail and loss of target edge enhancement). In either fusion gan or D-model, the edges of the infrared target are significantly blurry, such as the forehead edge of bush and the elbow edge of Kaptein. In contrast, the target edge enhancement loss method of the present application can solve this problem well, and the resulting target boundary of the present application is well preserved and sharpened. In addition to the sharpened infrared target edge, the present application also finds that the loss of detail and the loss of target edge enhancement can be optimized simultaneously, without significant conflict. Evidence suggests that the fusion results of the present application also contain much of the detail information retained in the D model, such as leaves in the brush, fences in the sand pits, and streaks in the kapton U1123. This demonstrates the effectiveness of the targeted edge enhancement loss of the present application.

Impact of different buildings: the impact of different architectures in the framework will be studied. On the one hand, the present application studies the impact of network depth. Considering that the 5 remaining block network of the present application is deep enough, the present application selects a shallow network called ShallowNet, such as a4 remaining block network for comparison. On the other hand, the present application investigates the impact of applying a different type of architecture (e.g., using dense connections) called DenseNet. In particular, the present application adds dense connections in a 4-remainder network. The present application finds that all three architectures can well preserve radiation information, but there is a difference in detail information preservation. For example, the details in the red box in the ShallownNet results are difficult to identify, but they are clear in the other two architectures. Moreover, the targets in the fused result are more prominent than in DenseNet, as is the case with people in three scenarios. Thus, the present application concludes that both deep networks and dense connections can improve the quality of detail of fused images, and that deeper structures can better preserve infrared information than dense connections.

The TNO dataset contains multispectral (e.g., enhanced vision, near infrared and long-wavelength infrared or thermal imaging) nighttime images of different military-related scenes that have been registered in different multi-band camera systems. 45 pairs of infrared and visible images were selected from the data set as training sets and 12 pairs as test sets. Five typical pairs, such as bunker, smoke, lake, Kaptein _ and sand path, are selected from the test set for qualitative description.

All the methods can fuse the information of two source images to a certain extent. In this sense, it is difficult to judge which method is the best. However, other methods, such as a shelter, a window, a lake, a human body, and the like, have low significance in the fused image except for GTF and FusionGAN, which indicates that the thermal radiation information in the infrared image is not well preserved. The observation results can be attributed to the tendency of utilizing detailed information in the source images aiming at the methods, and difficulty is brought to the subsequent target detection and positioning.

Experimental results show that both the GTF (gradient transfer fusion) and the fusion GAN of the method can well highlight the target in the fused image.

However, the fusion result of the method contains more detail information and sharpened infrared target edges. For example, in Kaptein, the outline of trees is much clearer and sharper in the results of the present application compared to GTF, while stripes on roads are evident in the results of the present application but hardly observable in the results of FusionGAN. On sand paths, the fences were fully merged in the results of the present application, but were difficult to identify in the results of GTF and fusingen. Similar phenomena can be observed in the other three examples. This finding demonstrates that the method proposed by the present application is superior to other existing methods in simultaneously preserving thermal radiation information, infrared target boundary information, and texture detail information. In addition, the method also carries out quantitative comparison on the 11 images, and the parking line area in the fused image is blackened on 12 pairs of infrared and visible light images in the test set according to the ground radiation information and the edge texture of the parking line, and is not like an infrared image or a visible light image. Similar phenomena can be observed in the area of trucks and flippers. Therefore, the goal of maintaining both thermal radiation and rich texture detail will inevitably reduce the SSIM index.

Results of the INO dataset: to verify its generalizability, the present application tested the method of the present application on the INO dataset, which was trained on the TNO dataset. The INO data set is provided by the canadian national optical institute and contains several pairs of visible and infrared video representing different scenes taken under different weather conditions. The present application captures 90 pairs of infrared and visible images from videos named trees and runner for comparison. The method of the present application has the best SD, CC, SF and VIF for all image pairs. Obviously, the average value of the evaluation index is the largest compared to the other ten methods. For the EN measurement, the method of the present application is second only to GTF to a small extent; subject to content loss, the present application also fails to achieve optimal SSIM. Furthermore, the present application observes that the metric for IVIFDLF varies greatly between different frames, especially for SSIM and VIF. This is because in IVIFDLF, the image reconstruction after the downsampling operation may result in a registration error of the fusion result with the source image, and such a mis-registration may vary from frame to frame, resulting in an unstable result. The present application also describes a run-time comparison of 11 methods, which achieved comparable efficiencies compared to the other 10 methods. Quantitative comparison of six fusion indices on TNO data set as shown in fig. 6(a) -6 (f).

The second embodiment of the invention provides an infrared and visible light image fusion system based on a generation antagonistic network; an infrared and visible light image fusion system based on a generative antagonistic network, comprising:

It should be noted here that the above-mentioned acquisition module and fusion module correspond to steps S101 to S102 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

The third embodiment of the present invention further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment. The fourth embodiment also provides a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. The infrared and visible light image fusion method based on the generation antagonistic network is characterized by comprising the following steps:

acquiring an infrared image and a visible light image to be fused;

simultaneously inputting the infrared image and the visible light image to be fused into a pre-trained antagonistic network, and outputting the fused image; the generating an antagonistic network comprises: a total loss function; the total loss function comprising: a content loss function, a detail loss function, a target edge enhancement loss function, and an antagonism loss function.

2. The method of claim 1, wherein said generating a resistance network comprises: a generator and a discriminator; the generator is used for: resnet classification networks; the discriminator is as follows: VGG-Net neural network.

3. The method of claim 1, wherein the content loss function is: the fused image is limited to pixel intensities with similar pixel intensities as the infrared image and similar gradient changes as the visible image.

4. The method of claim 1, wherein the content loss function is equal to a sum of an image loss function and a gradient loss function.

5. The method of claim 1, wherein the detail loss function is used to limit the difference in the discriminative feature map between the fused image and the visible image.

6. The method of claim 1, wherein the target edge enhancement loss function is: the method is used for sharpening edges and fusing objects highlighted in the image.

7. The method of claim 1, wherein the antagonism loss function is: is defined according to the probability of discriminant of all training sample entropies.

8. An infrared and visible light image fusion system based on a generation antagonistic network is characterized by comprising:

a fusion module configured to: simultaneously inputting the infrared image and the visible light image to be fused into a pre-trained antagonistic network, and outputting the fused image; the generating an antagonistic network comprises: a total loss function; the total loss function comprising: a content loss function, a detail loss function, a target edge enhancement loss function, and an antagonism loss function.

9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.