CN113781377B

CN113781377B - Infrared and visible light image fusion method based on antagonism semantic guidance and perception

Info

Publication number: CN113781377B
Application number: CN202111292602.5A
Authority: CN
Inventors: 滕之杰; 韩静; 陈霄宇; 李怡然; 冯琳; 张权; 魏驰恒; 张靖远
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2021-11-03
Filing date: 2021-11-03
Publication date: 2024-08-13
Anticipated expiration: 2041-11-03
Also published as: CN113781377A

Abstract

The invention relates to an infrared and visible light image fusion method based on antagonistic semantic guidance and perception, which comprises the following steps: 1. generating a fusion network ASGGAN, learning optimization, namely forming a relationship of generating an countermeasure network by using a segmentation network as a discriminator and forming the relationship of generating the countermeasure network by the segmentation network and the fusion network, wherein the segmentation network and the fusion network are continuously optimized in the process of countermeasure learning, 3, acquiring global and local GAN network loss functions, 4, adding segmentation label, and adding segmentation label as a spatial selection on the prior optimization fusion of the discriminator, and 5, comprehensively evaluating. According to the invention, the process of transferring semantic information to image fusion by utilizing the segmentation network is utilized, so that the target significance of the fusion image is enhanced; the U-shaped discriminator is utilized to keep the global structural features and local textures of the image, so that the image has natural look and feel.

Description

Infrared and visible light image fusion method based on antagonism semantic guidance and perception

Technical Field

The invention relates to an infrared and visible light image fusion method based on antagonistic semantic guidance and perception, and belongs to the technical field of image processing.

Background

In the development of the image processing field, image fusion is always the subject of development throughout the entire image field. Infrared and visible image fusion has been the most significant problem in the field of image fusion due to its widespread use in the fields of remote sensing, medical and autopilot. Because of the principle and the property difference of the detector, the visible light and infrared images have obvious differences, and the visible light and infrared images have advantages and disadvantages. The visible light image often contains abundant texture detail information, and has higher resolution than the infrared image, however, the image quality is more susceptible to the influence of external environment, for example, important target information is often lost in the visible light image under the conditions of insufficient illumination at night, low visibility in foggy days, vegetation shielding and the like. Compared with the prior art, the infrared image is obtained by imaging the detector according to the temperature characteristic or the emissivity of the object, the imaging mechanism causes that the infrared image is stable in influence of the external environment, the information ignored by the visible light image can be obtained frequently, and targets with obviously different heat radiation characteristics in the image are relatively more obvious, so that people can capture the targets more easily. However, the infrared image detail information is relatively missing, the image quality is not in line with the visual sense of human eyes, and the resolution is often smaller than that of a visible light image. Therefore, the image fusion of two complementary images to obtain a high-quality fused image is a requirement of production and application in many practical lives. The fusion of infrared and visible light images has a wide range of application fields in real life, such as remote sensing, military, video monitoring, medical treatment and other fields.

In the past few years, a number of image fusion methods have been proposed. These image fusion methods can be broadly divided into two main categories: traditional image fusion algorithms and image fusion algorithms based on deep learning. If specifically subdivided, conventional methods can be broadly classified into the following categories: 1. the method comprises a multi-scale transformation image fusion method, a sparse representation image fusion method, a low-rank representation image fusion method, a subspace-based image fusion method, a saliency-based image fusion method and the like. However, the conventional algorithm often depends on the design of people, and in order to obtain a better fusion effect, the design rule of the conventional algorithm is more and more complex, which results in difficulty in practical application and excessive calculation time. Meanwhile, a large number of traditional algorithms ignore semantic information of images and significance of targets, so that fusion images are in a condition that infrared targets are fuzzy and are difficult to recognize.

Therefore, with the recent rise of deep learning and neural networks, a method of applying the deep learning technology to infrared and visible light image fusion is also emerging. While deep learning-based methods can be broadly divided into two types: 1. and 2, an image fusion method based on the neural network for end-to-end training, and an image fusion method based on the generation of the countermeasure network. Although the existing infrared and visible light image fusion methods based on deep learning have excellent effects to a certain extent, certain defects still exist in the methods: 1. because the image fusion task of infrared light and visible light is an unsupervised task and lacks the label of the fusion image, the prior fusion network often uses subjective loss straightforwardly, only attaches importance to the global structure and ignores local space information, and the fusion image also often generates noise. 2. The existing image fusion network ignores the advanced semantic information in the image, and often focuses on the global fusion of the image, the target and the background are mixed together, the attention to the target is lost, the saliency of the target is ignored, the local fusion effect is not good, and the saliency of the target is reduced.

In the above-mentioned method, the generation of the countermeasure network (GAN) has a very excellent effect in handling an unsupervised image fusion task like this. The method does not need complex fusion criteria, the characteristics of the contrast are often utilized to guide the generated image to have the visual impression of the visible light image, and the reasonable loss is utilized to control the image components of infrared light and visible light, so that the excellent image fusion effect is achieved. However, the current GAN-based image fusion network also ignores the target significance and the local fusion effect of the image, ignores the function of the image high-level semantics and the importance of the target, and causes the reduction of the target significance.

Due to the development of deep learning in recent years, the technology of semantic segmentation has made a great progress. The deep learning-based semantic segmentation network aims at mining high-level semantic features of images and at restoring the resolution of the original images. Images that are easily segmented also represent images with good target significance. The semantic segmentation technology is used as a classical pixel-level task, can mine the semantics of the image, plays an important auxiliary role in other tasks, and improves the performance indexes of a plurality of tasks based on deep learning by utilizing the characteristic of semantic segmentation. The method is applied to the image fusion task, and semantic segmentation is also provided to effectively guide the image fusion task. However, the method also needs to acquire the segmentation labels of the images before the images are fused in the test stage, the segmentation labels are added as prior during fusion, and the effort for marking the segmentation labels is consumed during the test.

Disclosure of Invention

In order to solve the technical problems, the invention provides an infrared and visible light image fusion method based on antagonistic semantic guidance and perception, which has the following specific technical scheme:

The infrared and visible light image fusion method based on antagonism semantic guidance and perception comprises the following steps:

Step 1: generating a fusion network ASGGAN, namely generating a double-path visible light and infrared image fusion network ASGGAN based on a simple-structure generation countermeasure network through guidance of a discriminator and a loss function;

Step 2: learning optimization, namely, using a segmentation network as a discriminator, wherein the segmentation network and a fusion network form a relationship for generating an countermeasure network, the segmentation network and the fusion network are continuously optimized in the countermeasure learning process, and the fusion image has target significance by taking segmentation prediction and loss of segmentation label as guidance;

Step 3: acquiring global and local GAN network loss functions, and acquiring global and local GAN network loss by using a U-shaped discriminator structure, so that a fusion network focuses on not only global information of an image but also local information of the image;

step 4: adding a segmentation label, and adding the segmentation label as a spatial selection on the prior optimization fusion of the discriminator;

Step 5: comprehensive evaluation reveals that ASGGAN provided herein has superior image fusion effect compared with other infrared and visible light image fusion methods through qualitative subjective evaluation and quantitative objective evaluation indexes.

Further, the fusion network ASGGAN in step 1 includes a generator that generates an image, the generator adopts a full convolution network structure of two-way Encoder and one-way Decoder, and a discriminator that distinguishes between a false image generated by the generator and a true image, and the generator and the discriminator are continuously optimized so that the generator can generate a false image spoofing the discriminator, and the discriminator enhances the capability of distinguishing between the false image generated by the generator and the true image.

Further, the discriminator comprises a perception discriminator and a semantic discriminator, the perception discriminator is used for pulling up the distribution distance of visible light and the fusion image, so that the visible light impression of the fusion image is more natural, the semantic discriminator is used for dividing the fusion image, the image fusion is carried out by utilizing a division loss boosting fusion network generated by a division network, the perception discriminator adopts a U-shaped discriminator, the perception discriminator comprises Encoder and a Decoder, the perception discriminator carries out global discrimination and local discrimination of the image through Encoder and the Decoder, and the semantic discriminator adopts a RPNet division network to carry out division loss calculation.

Further, in the step 3, the GAN network is developed and derived to form DCGAN and LSGAN, the DCGAN changes the multi-layer perceptron of the generator and the arbiter in the original GAN into a convolutional neural network for extracting features, and the LSGAN changes the cross entropy loss of the GAN network into a least square loss, thereby improving the generation quality of the picture and stabilizing the training of the GAN network.

Furthermore, in the step 4, the segmentation label is input into a network structure in the network training to perform image fusion.

Further, the loss function includes a discriminant loss function, a split network loss function, and a generator loss function, where the discriminant loss function is used to train the discriminant, as shown in equation (1)

（1）

In the middle ofRepresenting the loss function of the global arbiter,A loss function representing global information output by Encoder of the arbiter,A loss function representing local information output by the Decoder of the discriminator;

the split network loss function is shown as a formula (2)

（2）

In the middle ofRepresenting the value of I _label at the c-th channel of the one-hot vector at pixel value (I, j), I _label is an image segmentation label,An output probability value representing the c-th channel of the output probability map at the pixel value (i, j), N being the number of channels, W and H being the width and height of the image;

The generator loss function includes a perceived countermeasure Semantic countermeasureAnd detailsThe generator loss function is shown in formula (3)

（3）

In the middle ofThe overall function of the generator is represented,AndIs a superparameter for balancing the weights of the three loss.

Further, the objective evaluation indexes comprise AG, EI, SF and EN, the AG evaluation indexes are used for measuring the definition of the fusion image, and the AG evaluation indexes are expressed as follows

（4）

Where M and N represent the width and height of the fused image respectively,Representing the positions of pixel points in the fusion image, wherein the larger the AG value in the formula is, the better the definition of the fusion image is, and the better the quality of the fusion image is;

The EI evaluation index calculates the edge strength of the fusion image, and the EI evaluation index has the following formula

（5）

（6）

In the method, in the process of the invention,、For sobel operators in the x and y directions,、The larger the value of the evaluation index EI is, the better the quality of the fusion image is;

The SF evaluation index calculates the change rate of the image gray scale, and the SF evaluation index has the following formula

（7）

（8）

（9）

In the formula, RF is row space frequency, CF is column space frequency, and the larger the SF value is, the better the fusion image quality is;

The EN evaluation index calculates the information amount contained in the image, and the EN evaluation index has the following formula

（10）

In the middle ofThe larger the EN value is, the more the representative image information amount is, the better the fusion image quality is, which is the statistical probability of the gray histogram.

The invention has the beneficial effects that:

according to the invention, the process of transferring semantic information to image fusion by utilizing the segmentation network is utilized, so that the target significance of the fusion image is enhanced; the invention uses the U-shaped discriminant to reserve the global structural features and local textures of the image in the process of image fusion, so that the generated fusion image has natural look and feel; the invention adds the segmentation label as the priori information of the discriminator, so that the fusion can better perform countermeasure learning.

Drawings

Figure 1 is a flow chart of the method of the present invention,

Figure 2 is a schematic diagram of the ASGGAN network architecture of the present invention,

Figure 3 is a schematic diagram of the generator structure of the present invention,

Figure 4 is a schematic diagram of the perception arbiter of the present invention,

Figure 5 is a schematic diagram of the visible image versus infrared image of the present invention,

Figure 6 is a graph of the comparison of figure 5 with and without the use of a U-shaped arbiter,

FIG. 7 is a graph of the comparison of FIG. 5 between the use and non-use of a label discriminator.

Detailed Description

The invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic representations which merely illustrate the basic structure of the invention and therefore show only the structures which are relevant to the invention.

As shown in fig. 1, the invention discloses an infrared and visible light image fusion method based on antagonistic semantic guidance, and provides an infrared and visible light image fusion network ASGGAN (ADVERSARIAL SEMANTIC guiding GAN) based on antagonistic semantic guidance. Firstly, the method is based on a simple-structure generation countermeasure network, complex fusion rule design is not needed as the traditional algorithm, and the generated fusion network is optimized through guidance of a discriminator and loss, so that the fusion network with better performance is obtained. Secondly, the segmentation network is used as a discriminator and forms a relationship for generating antagonism with the fusion network, the fusion network and the segmentation network are continuously optimized in the process of antagonism learning, and the fusion image has target significance by taking segmentation prediction and loss of segmentation label as guidance. Thirdly, a U-shaped discriminator structure is used for obtaining global and local GAN loss, so that the fusion network focuses on not only global information of the image but also local information of the image. Meanwhile, the segmentation label is added as a spatial selection on the prior optimization fusion of the discriminator. Finally, the object evaluation index of qualitative subjective evaluation and quantitative evaluation reveals that ASGGAN has superior image fusion effect compared with other infrared and visible light image fusion methods.

First, an antagonism network is generated. Goodfellow et al, for the first time, proposed the concept of generating an countermeasure network, which has profound and widespread use in the field of image generation. The GAN network is generally composed of a pair of generators G and a discriminator D, and the generators G are responsible for generating images in terms of the task of image generation, and in the process of optimizing the generators G, random noise z is taken as input, and the purpose of the generators G is to generate a false image G (z) that can be spoofed by the discriminator D. The task of the arbiter D is to distinguish between the false image G (z) and the real image x generated by the generator G, the purpose of the arbiter D being to continuously enhance the ability to distinguish between the real image x and the false image G (z) during the process of the arbiter D optimization. The generator G and the arbiter D are continuously optimized in the process of the two-side countermeasure generation, the generator G continuously trends the generated false image to the real image, namely, the distance between the data distribution P _z of the generated image and the data distribution P _data of the real image is continuously reduced, and the arbiter D continuously increases the distinguishing capability of the two, which in turn helps to make the two data distributions close. The antagonism of the generator and the discriminator forms a model of zero and game to complete an optimization task, the optimization direction is that the generating capacity of the generator is enhanced, the discriminating capacity of the discriminator is enhanced, and the quality of the generated false image is close to that of the real image. In order to make the quality of the generated images more excellent, the process of countermeasure training more stable, GAN networks have derived a series of variants in the developing process. DCGAN the multi-layer perceptron of the generator and arbiter in the original GAN is changed to a convolutional neural network for extracting features. LSGAN changes the cross entropy loss of the GAN network into the least square loss, improves the generation quality of pictures and enables the training of the GAN network to be more stable. CGAN add additional conditions to the GAN network so that the GAN network generation process becomes controllable. WGAN introduces the Wasserstein distance, and simply adjusts the original GAN network, so as to obtain the effect of surprise, solve the problems of modecollapse, difficult training and instability to a certain extent, and the loss of the generator can indicate the training process. BigGAN the maximum performance improvement of the GAN network is achieved by increasing the parameters and batchsize, and the truncated skills (truncation trick) are applied to make the training process more stable, and a certain balance is made between the stability of the training and the performance of the network. U-NetGAN is a GAN network excellent in image generation effect in recent years, and SOTA is achieved on a plurality of data sets. On the basis of BigGAN, the U-Net GAN changes the arbiter of the GAN network into a UNet structure, encoder carries out global discrimination on an input image, and a Decoder carries out detailed discrimination on image pixels, so that the quality of a generated image is higher, and the generated image is more vivid in texture detail information. The discriminator of the network structure provided by the invention adopts the idea of U-Net GAN, designs the discriminator into a simple U-shaped network, classifies and discriminates the global fusion effect, and improves the texture detail of the fusion image to a certain extent.

Secondly, semantic segmentation is performed. Semantic segmentation is a fundamental topic throughout the development of computer vision. Semantic segmentation refers to classifying pixel points with fine granularity on an original image after the original image is processed in a traditional or neural network mode. Globally, the image segmentation identifies the content information of the image, performs a segmentation and localization on the content of the image, and performs classification tasks. With the popularization of deep learning, the performance of semantic segmentation has a great leap, and meanwhile, the semantic segmentation plays a certain role in other tasks based on the deep learning. The FCN is a network which makes a major breakthrough in the field of image segmentation by using a deep learning method at the earliest, and changes a full-connection layer of a neural network into a convolution layer, so that the FCN is designed into a full-convolution neural network, can adapt to any size input, and finally recovers to the original image size by using transposed convolution, thereby realizing the prediction of pixel-level fine segmentation results. UNet is the earliest neural network used for medical image segmentation, and later semantic segmentation networks more or less borrow from its network architecture. Olaf et the two biggest features of the UNet network structure proposed in the al are that a U-shaped network structure is designed and a layer jump connection is used, when the Encoder end of UNet performs downsampling once, the Decoder end performs upsampling once, the U-shaped structure can better extract the advanced semantic features of the image, and the layer jump connection is used for continuously supplementing the shallow layer detail information to the Decoder end. SegNet is to store the position information of the maximum value of the feature map in the process of maximum pooling in order to solve the detailed information lost in the downsampling process, and to recover the information of the image by using the position information of maximum pooling in the process of upsampling at the Decoder end, thereby improving the segmentation index. PSPNet a pooling pyramid module PPM is designed, pyramid features with different scales are fused, scene analysis is better carried out, and segmentation of the image is assisted by the context of the image. The Deeplab series network proposed by Google aims to solve the problem that the spatial resolution of an image is lost while the features are extracted commonly in the image segmentation process, and the concept of cavity convolution is firstly proposed and applied to the field of image segmentation. The latest Deeplabv3+ integrates a Space Pyramid Pooling Module (SPPM) and a Encoder-Decoder, fully utilizes a Xception depth separable convolution structure, and optimizes the detail effect of the edge. Because the semantic segmentation network is often huge in parameter quantity, and a certain requirement on real-time performance is often met in an actual engineering application scene, the application of many large-scale semantic segmentation networks is often limited to a certain degree. In this context, therefore, a number of semantic segmentation networks satisfying real-time performance have been proposed, such as ENet, ERFNet, ICNet and BiseNet. RPNet provides a feature residual pyramid, the shallow features of the residual pyramid pay more attention to detail textures, the high layers pay more attention to semantic attributes, and finally, the complete scene is synthesized in a pyramid mode, so that the network can help to promote detail and edge information. the semantic segmentation network uses RPNet network of real-time semantic segmentation, and the characteristic of the quantity of the small parameters is helpful for rapid training with ASGGAN, so that gradient return is easier to carry out and convergence of the network is faster achieved. The semantic segmentation technology can be used for extracting semantic features of the image and classifying pixel levels, so that the semantic segmentation technology is often used for assisting other networks based on deep learning, and the performance of other tasks is improved. In the field of image fusion, houetal also uses a mask of semantic segmentation as a priori of an image fusion network to divide an image into a foreground and a background for high-quality fusion, so that the fused image retains more information. the ASGGAN disclosed by the invention also utilizes semantic information extracted by the semantic segmentation network to guide the image fusion network to generate a fusion image.

According to the network structure disclosed by the invention, only the label is utilized to be input into the network structure for image fusion during training, and the segmentation label is not required to be added during testing, so that the additional work of introducing manual labeling during actual use is avoided. The ASGGAN network framework of the present invention, as shown in fig. 2, is mainly composed of three parts: a generator, a perception discriminant and a semantic discriminant. In the training phase, the RGB-T four-channel image is taken as input by the generator, the RGB image I _vis and the infrared image I _ir respectively pass through two Encoder of the generator, the output concat of the two Encoder is input to the Decoder, and the output result is a single-channel fusion image I _{f_y}. and taking the single-channel fusion image as an image of a brightness channel, adding a color channel image of visible light, and converting the image into an RGB image to obtain a final fusion image I _f. The perception discriminator only inputs the brightness channel I _{vis_y} of the visible light image and the fusion image I _{f_y} to discriminate, so that the fusion image has a natural overall impression which is more prone to the visible light image, and the loss effect of the perception discriminator is not mainly used for enhancing the details of the visible light image, but is used for shortening the distribution distance between the visible light and the fusion image, so that the fusion image has a natural visible light impression. the semantic discriminator may be regarded as another discriminator herein, and is configured to divide the RGB fusion image I _f into images to obtain a division prediction graph I _pred, and then utilize a division loss generated by the division network to boost the fusion network in turn to perform image fusion, so that the fusion network is more conducive to generating a generated image capable of improving the index of the division network, that is, the fusion image contains more significant semantic information. Both are similar to the generator and the arbiter, establishing a pair of relationships that generate the challenge. It is assumed here that the infrared image contains more abundant and significant target information than the visible image, so that the fused image will incorporate the component of the infrared image with more target significance through the set of antagonism relationships, generating an image that is easy to segment, and also representing improvement of the target significance than the visible image. In the test stage, the RGB-T image is input into a generator network without using a perception discriminator and a semantic discriminator to obtain a fused brightness channel image, and then the RGB conversion is carried out on the visible light color channel and the fused brightness channel image to obtain a final fused image. It follows that inputting segmented labels as a priori information at the input is avoided during testing.

The network structure of the generator, as shown in fig. 3, adopts a network structure of two-way Encoder and one-way Decoder, and the network is a full convolution network. The visible light image Ivis and the infrared image Iir are respectively input into the two paths Encoder, the network structures of the two paths Encoder are basically the same, each convolution layer adopts 3x3 convolution, the scale of the feature layer is kept unchanged, and the number of channels of the convolution in Encoder is continuously increased. In order to prevent the information loss of the image, there is no pooling layer in the whole process. At the same time, denseNet is referenced. Each path in Encoder is connected backwards in a dense way, the information of the forward features is continuously supplemented, the shallow features can be repeatedly and effectively utilized in deep convolution, and therefore the fused image can be effectively helped to retain more detail information. Feature images output by the visible light image path Encoder and the infrared image path Encoder are subjected to feature fusion in a concat mode and input into a Decoder structure. In the Decoder process, the number of channels of the feature map is gradually reduced, finally, a probability map with the number of channels being 2 is obtained by using a sigmoid activation function to serve as probability distribution of visible light and infrared components in the fusion image, corresponding point multiplication is carried out on a visible light brightness channel image Ivis _y and an infrared image Iir corresponding to the two channels, then the two channels are added, and finally, a final fusion image is obtained through tanh activation function output. It was experimentally confirmed that such an operation can prevent the phenomenon that the object becomes black due to the action of the split network. In the overall architecture of the generator, each layer is convolved with a Spectral Normalization (SN) operation. To prevent gradient explosion or gradient extinction and to speed up the convergence speed of the network, batch Normalization (BN) was added. The activation function uses leakyrelu, in contrast to relu, which loses the negative number of the feature map during fusion, and loses information to some extent for the fusion task, while leakyrelu can fully retain the information.

The structure of the perception discriminator is shown in fig. 4, the network structure of the discriminator refers to the structure of UNet-GAN, and a condition U-shaped discriminator is adopted. Unlike the conventional other discriminator with Encoder structure, the Decoder structure is constructed, so that a simple U-shaped discriminator structure is built. The discriminator comprises Encoder parts and a Decoder, and can perform global discrimination and local discrimination of images, so that the fused image is more prone to the look and feel of a visible light image. Part Encoder inputs the luminance channel Ivis _y of the visible light image or the fused single channel image if_y in a non-paired manner. The split label of both concat is input simultaneously as the auxiliary condition input to the discriminator. By adding the segmentation label, the discriminator can judge the fusion image with higher quality on the basis of the segmentation label, thereby being beneficial to optimizing the details of the fusion image, reasonably judging the fusion image space based on high-level semantics and making certain constraint on the fusion of the fusion image pixel level. That is, the U-shaped discriminant is given certain high-level semantic information, image fusion is driven based on the semantic information, and the information amount of the fused image is increased. After being input into Encoder, the full convolution structure of Encoder continuously increases the number of channels, the size of the convolution feature map is halved every time, global features are extracted by Encoder in the whole process, and finally a global discrimination result is obtained through a global pooling layer and a full connection layer. The global discrimination result is to discriminate the whole look and feel and visible light of the fusion image once, and the discrimination is to restrict the characteristics of the whole image of the fusion image to strengthen the whole image look and feel of the fusion image, so that the fusion image has more naturalness. At the Decoder end, the high-level features of Encoder are subjected to continuous transpose convolution operation, the number of transpose convolution channels is reduced every time the number of transpose convolution channels is carried out, the size of a feature diagram is increased, and the process and the Encoder structure form a symmetrical relation. And the forward information is supplemented by continuously using the layer jump connection in each layer, so that the information lost in Encoder because of the reduced convolution size is effectively reused. After the feature image is restored to the original image size, the feature image is tidied up through one convolution operation, and the judgment of the original image size is obtained. Such a decision may be understood as a decision at the pixel level in image space, where a decision may be made on the local texture of the fused image, giving the generator some feedback in space. In the fusion task, the local texture detail of the fusion image can be enabled to have the visual impression of the visible light image by utilizing the spatial decision, and the naturalness of the fusion image is enhanced from a local view angle. And each layer of the discriminator is subjected to spectrum normalization, so that the stability in the training process of the GAN network is improved. As with the generator, batchnorm is used for each layer and leakyrelu is utilized as an activation function.

The semantic discriminator adopts RPNet division network, RPNet division network is based on residual pyramid, has smaller parameter number, has faster reasoning speed and has good division performance. And converting the color channels of the combined single-channel image and the visible light into RGB images, inputting the RGB images into RPNet segmentation networks, finally obtaining probability diagrams with the same channel number and category number, and calculating segmentation loss. The splitting network acts as a discriminator. On the one hand, the segmentation network continuously enhances the mining capability of the semantic features of the fused image. On the other hand, the semantic information guide generator fuses out images with better target saliency. The constraint of the segmentation loss drives the segmentation network to learn the semantics of the fusion image, so that the fusion image is guided to be properly fused in space, and high-quality image fusion is realized. Compared with a plurality of fusion networks which directly carry out MSE loss by adopting fusion images and infrared images, the semantic discriminator adopts the high-level features of the images to analyze the fusion images so as to guide the image fusion. The method takes into account the spatial distribution of the fused image rather than the coarse global mean square error penalty.

The loss function in ASGGAN disclosed in the present invention includes a discriminant loss function, a split network loss function, and a generator loss function, which are used to train the discriminant, the split network, and the generator, respectively. The perception discriminant continuously strengthens the ability to distinguish between visible light and fused images during the training process, and continuously gives feedback to the generator during the process. When the brightness channel image I _{vis_y} of the visible light is input, the discriminator will discriminate true, and when the fusion image I _{f_y} is input, the discriminator discriminates false. Let us denote the arbiter by D ^U, which consists of Encoder and Decoder, denoted by D ^U _enc and D ^U _dec, respectively, outputting two loss. Encoder of the arbiter outputs global informationDecoder output local information of discriminator. As shown in the formula (1),

（1）

Equation (1) represents the loss function of the overall arbiter, let the input visible light brightness channel image be I _{vis_y}, and the input fusion image be I _{f_y}, then the loss function of Encoder is:

The loss output by the Decoder end of the discriminator is:

In the middle of AndAll represent decisions of the arbiter at pixel points (i, j), and the specific loss follows hingeloss in the U-Net GAN. The two loss functions of the arbiter represent the global and local decision distances, respectively. Therefore, in the process of continuously strengthening the discriminator, the discriminator can make global decisions as well as local decisions. RPNet split network architecture, for simplicity and straightforwardly, auxiliary loss in RPNet is not employed. After the image I _f is fused by the input RPNet, an output segmentation result I _pred is obtained by the RPNet, the result and the I _label are subjected to common cross entropy loss of a segmentation network to calculate, and an ASG module loss function formula is as follows:

（2）

Wherein the method comprises the steps of Representing the value of I _label at the c-th channel of the one-hot vector at pixel value (I, j),The output probability value representing the c-th channel of the output probability map at the pixel value (i, j), N being the number of channels, W and H being the width and height of the image. The loss function of the generator is mainly composed of three parts: perceived countermeasureSemantic countermeasureAnd details. Sensing countermeasure lossThe method is used for leading the fusion image to be identified as true in discrimination, and leading the fusion image to have the appearance that the whole and partial details of the fusion image are more prone to visible light. Semantic countermeasureThe guided fusion image is easy to split, and the infrared image contains richer semantic information, so that the semantic countermeasure is realizedThe method is equivalent to adding the target significance information in the infrared image into the fusion image at the same time, and improves the target significance of the fusion image. Details of theAnd the visible light detail information is used for enhancing the fusion image. When the countermeasure loss is calculated, the parameters of the discriminator are fixed, and the parameters of the generator are trained. The purpose of the generator is to train out the fusion image that can fool the arbiter, i.e. to make the arbiter's discrimination true. At this time, sense the countermeasureIs calculated as follows:

The distance between the fusion image and the visible light is gradually shortened by the training generator against loss, so that the fusion image has the visual sense of the visible light image. And simultaneously Encoder and a Decoder output of two portions loss, so that the fusion image is subjected to constraint of the visible light image from two aspects of global and local. Also, hingeloss used by U-Net GAN was used in training. Computing semantic countermeasures And when the parameters of the segmentation network are fixed, the parameters of the generator are trained. At this time, the generator is continuously adjusted in the training process, the fused image with higher segmentation indexes can be gradually output, the target significance of the image is more obvious, the semantic information is easier to be displayed in the fused image, and the effective information of the infrared image is also gradually increased in the fused image. Segmentation in generatorThe formula of (2) is also a cross entropy formula, and the formula is consistent with a loss formula when a semantic discriminator is trained. FusionGAN global contentloss runs and mean square error of infrared pixel values can cause the overall fused image to blur due to excessive infrared content. In comparison with this approach, resistive segmentation is usedTo add infrared components regionally, making the target of the fused image more pronounced. Details of theIs to calculate the distance between the fusion image and the visible image gradient, we calculate the gradients of the two and find the mean of the L2 norms of the difference between the gradients of the two, with the formula:

Representing the operation of evaluating the image gradient, (i, j) representing the position of the pixel point, W and H representing the width and height of the image. Details of the The gradient used to zoom in the fused image tends to be that of visible light, so that the fused image has more abundant detailed information. Combining the three loss results in a loss function of the generator as follows:

Wherein the method comprises the steps of AndIs a superparameter for balancing the weights of the three loss.

A MFNet semantic segmentation dataset is adopted, the dataset is manufactured for carrying out semantic segmentation on visible light and infrared images, and the scene is a vehicle-mounted scene. The dataset contains 1569 RGB-T image pairs, of which daytime 820 sets, nighttime 749 sets. Training was performed using a night RGB-T dataset, where the training set had 374 RGB-T image pairs, the validation set had 187, and the test set had 188. The size of the images is 640 x 480. Although this dataset has been used as a semantically segmented dataset, the RGB-T image pairs of this dataset are mostly aligned and thus can be used for image fusion. Night scenes are also rich, and common scenes in driving such as dim light and dazzling light often appear in the scenes. Each RGB-T image pair has a corresponding label, and the initial set of data for semantic segmentation contains eight categories, except for unlabeled ones, where only three categories, three common in driving, car, person and bicycle, are used as segmentation labels, where the person is more evident in the infrared properties. In the training phase, the image is data augmented. And adopting random clipping, wherein the pixel size of random clipping is 400 x 400, and simultaneously carrying out data augmentation operation of random translation and horizontal overturn. In setting the network loss superparameter, the split loss superparameter is set to 10000, and the gradient loss superparameter is set to 100. At batchsize =s in the training stage, the M discriminators are trained first, then the N segmentation networks are trained, and finally the generator is trained once. The optimizer used in the invention is an Adam optimizer, and the total number of epochs trained is K. Through experiments, parameters of the set experiments are set as s=4, m=2, n= 2,K =300, and the number of training sets num=374. In the test stage, the invention discards the two discriminators, only keeps the generator, removes data augmentation modes such as random cutting and the like, and inputs the RGB-T test pictures in the original size to obtain a fusion result. The display card used in training and testing is NVIDIA TITAN RTX, and the memory used is 32GB.

The invention adopts AG (Average gradient), EI (Edgeintensity), SF (Spatial frequency) and EN (Entropy), and the indexes quantitatively evaluate the image quality based on the image characteristics and the informatics theory respectively, so that the image fusion quality can be comprehensively evaluated. Because the fused image is described based on saliency, the image quality is compared with other methods under a mask that segments label. The AG evaluation index is used for measuring the definition of the fusion image, and the AG evaluation index has the following formula

（4）

Where M and N represent the width and height of the fused image respectively,And the position of a pixel point in the fusion image is represented, wherein the larger the AG value in the formula is, the better the definition of the fusion image is, and the better the quality of the fusion image is. The EI evaluation index calculates the edge strength of the fused image, and the EI evaluation index is expressed as follows

（5）

（6）

In the method, in the process of the invention,、For sobel operators in the x and y directions,、The larger the value of the evaluation index EI is, the better the quality of the fusion image is. The SF evaluation index calculates the change rate of the image gray scale, and the SF evaluation index is expressed as follows

（7）

（8）

（9）

The larger the SF value, the better the quality of the fused image. The EN evaluation index calculates the amount of information contained in the image, and the EN evaluation index is given by the following formula

（10）

In the middle ofThe larger the EN value is, the more the representative image information amount is, the better the fusion image quality is, which is the statistical probability of the gray histogram. Because the image fusion task has more evaluation indexes, various methods are difficult to unify in practice, and the subjective evaluation of the fusion image is qualitatively performed and the merits of the fusion image and other fusion methods are compared at the moment that objective and quantitative evaluation indexes are high and the visual effect of the image quality is not good in practice. The invention proves that the crowd with obvious infrared radiation characteristics can have more significance in the fusion method by adopting the actual picture comparison; compared with other image fusion methods, the fusion image has more remarkable information of people.

Ablation experiments. In order to make the detail impression of the fusion image stronger, the invention designs a U-shaped discriminator, so that the overall image discriminates the fusion image from the global angle and the local angle, and the local detail and impression of the image are enhanced. In order to prove the effectiveness of the method, the Decoder end of the U-shaped discriminator is only removed for experiment and comparison under the condition that the whole framework of the original network is unchanged. As can be seen from fig. 5 and 6, in the case of using the U-shaped discriminator, the fused image has more abundant visible light detail information. Such as visible light information of billboards and lamplight, and details are more abundant under the condition of a U-shaped discriminator. This shows that our U-shaped discriminant enhances the visible details of the fused image to some extent, enhancing the visible impression of the image. The infrared and visible information is actually a pair of opposed information, however, in reality, without the Decoder's discriminator, the infrared detail information is also not more fused images using the U-shaped discriminator. And (5) a condition discriminator experiment. The input end of the discriminator is simultaneously input with the segmentation label as the priori of the discriminator, so that the discriminator can obtain certain target priori information in space, and the network can selectively select fusion rules in area, thereby having certain advanced semantic and space perception capability. Experiments were compared as shown in fig. 7. Obviously, as can be seen from the picture, after the label is added as the identifier priori, the contrast of lines on the ground is obviously improved, and the detail information is more abundant, which means that the addition of the segmentation label as the priori information optimizes the spatial distribution and the look and feel of the fusion network, so that the contrast of the image is enhanced, the detail sense is improved, and the information added into the picture is more. Semantic discriminant experiments. The invention introduces local infrared information through the regional division network, and adjusts the super parameters of the division loss of the generator in order to display the function of the semantic discriminatorTo show the impact of the semantic discriminant on the fused image. Hyper-parameters of split network loss are adjustedRespectively 1000, 5000, 10000, 20000, obviously, when the parameters are exceededAt smaller times, the fused image has significantly more visible detail, but the infrared thermal signature is less pronounced, meaning that the infrared information component is less added. Along withThe gradual increase of the segmentation network's semantically guided image fusion, with the consequent addition of more components of the infrared image containing significant semantic information into the fused image. When (when)After larger, there is some loss of detail in the visible light and more infrared components. Therefore, reasonable adjustment of the super parameters is required. The invention selects=10000 To achieve a balance of detailed information of visible light and salient information of infrared.

The invention provides a novel infrared and visible light image fusion method ASGGAN, which is based on a generation countermeasure network and is characterized in that a segmentation network is used as a path of discriminator, and semantic guidance generator image fusion is used for carrying out space selective image fusion, so that the target of the fused image has excellent significance. The U-shaped discriminator is used for discriminating the global detail and the local detail, so that the image has better visible light impression from the aspects of the integrity and the local detail. Meanwhile, the segmentation label is used as a priori to enable the advanced semantic and spatial perceptibility of the discriminator. Experiments on MFNet datasets demonstrate that the method of the invention has better image fusion performance both objectively and subjectively than the currently popular methods.

With the above-described preferred embodiments according to the present invention as an illustration, the above-described descriptions can be used by persons skilled in the relevant art to make various changes and modifications without departing from the scope of the technical idea of the present invention. The technical scope of the present invention is not limited to the description, but must be determined according to the scope of claims.

Claims

1. The infrared and visible light image fusion method based on antagonism semantic guidance and perception is characterized by comprising the following steps of: the method comprises the following steps:

Step 1: generating a visible light and infrared image fusion network ASGGAN, and optimizing and generating a visible light and infrared image fusion network ASGGAN of a double-path discriminator through the guidance of the discriminator and a loss function based on a simple-structure generation countermeasure network; wherein ASGGAN represents ADVERSARIAL SEMANTIC guiding GAN;

the loss function comprises a discriminator loss function, a split network loss function and a generator loss function, wherein the discriminator loss function is used for training the discriminator, as shown in a formula (1)

（1）

the split network loss function is shown as a formula (2)

（2）

In the method, in the process of the invention,Representing the value of I _label at the c-th channel of the one-hot vector at pixel value (I, j), I _label is an image segmentation label,An output probability value representing the c-th channel of the output probability map at the pixel value (i, j), N being the number of channels, W and H being the width and height of the image;

（3）

In the middle ofThe overall function of the generator is represented,AndIs a super parameter and is used for balancing the weights of three loss functions;

Step 2: learning optimization, namely forming an countermeasure network relation by using a segmentation network as a discriminator, wherein the segmentation network and the fusion network are continuously optimized in the countermeasure learning process, and the fusion image has target significance by taking a loss function of segmentation prediction and segmentation label as a guide;

Step 3: acquiring global and local GAN network loss functions, and acquiring the global and local GAN network loss functions by using a U-shaped discriminator structure, so that a fusion network focuses on not only global information of an image but also local information of the image;

Step 5: comprehensive evaluation reveals ASGGAN that compared with other infrared and visible light image fusion methods, the method has excellent image fusion effect through qualitative subjective evaluation and quantitative objective evaluation indexes.

2. The ir and visible light image fusion method based on antagonistic semantic guidance and perception according to claim 1, characterized in that: the visible light and infrared image fusion network ASGGAN in the step 1 comprises a generator and a discriminator, wherein the generator generates images, the generator adopts a full convolution network structure of two paths Encoder and a single path Decoder, the discriminator distinguishes false images and real images generated by the generator, the generator and the discriminator are continuously optimized, so that the generator can generate false images of a deception discriminator, and the discriminator enhances the capability of distinguishing the false images and the real images generated by the generator.

3. The ir and visible light image fusion method based on antagonistic semantic guidance and perception according to claim 1, characterized in that: the discriminator comprises a perception discriminator and a semantic discriminator, the perception discriminator draws the distribution distance of visible light and the fusion image so that the visible light impression of the fusion image is more natural, the semantic discriminator segments the fusion image, the segmentation loss boosting fusion network generated by the segmentation network is utilized for carrying out image fusion, the perception discriminator adopts a U-shaped discriminator, the perception discriminator comprises Encoder and a Decoder, the perception discriminator carries out global discrimination and local discrimination on images through Encoder and the Decoder, and the semantic discriminator carries out division loss calculation through a RPNet division network.

4. The ir and visible light image fusion method based on antagonistic semantic guidance and perception according to claim 1, characterized in that: and in the step 3, the GAN network development and derivation are carried out to form DCGAN and LSGAN, the DCGAN changes a multi-layer perceptron of a generator and a discriminator in the original GAN into a convolutional neural network for extracting the characteristics, and the LSGAN changes the cross entropy loss of the GAN network into a least square loss, so that the generation quality of pictures is improved, and the training of the GAN network is more stable.

5. The ir and visible light image fusion method based on antagonistic semantic guidance and perception according to claim 1, characterized in that: and in the step 4, the segmentation label is input into a network structure in the network training to perform image fusion.

6. The ir and visible light image fusion method based on antagonistic semantic guidance and perception according to claim 1, characterized in that: the objective evaluation indexes comprise AG, EI, SF and EN, the AG evaluation indexes are used for measuring the definition of the fusion image, and the AG evaluation indexes have the following formulas

（4）

（5）

（6）

（7）

（8）

（9）

（10）