CN117197624A

CN117197624A - Infrared-visible light image fusion method based on attention mechanism

Info

Publication number: CN117197624A
Application number: CN202311089192.3A
Authority: CN
Inventors: 徐骞; 陈征; 赵文杰; 邵雪明
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2023-12-08

Abstract

The application discloses an infrared-visible light image fusion method based on an attention mechanism, which solves the problem that the existing method can not store heat radiation information in an infrared image and detail and texture information in a visible light image in a balance way. The method utilizes a network architecture which is a self-Encoder (Auto-Encoder) network, three sizes of convolution kernels and a channel space attention mechanism module are introduced into the Encoder to serve as feature extractors, the convolution kernels with different sizes are used for enhancing the feature extraction capability of the network, the channel space attention module is used for focusing on a remarkable target area in an image more in the network, and meanwhile, a reasonable-design loss function is used for guiding the training of the network; the application can train and fuse the infrared light and visible light image data sets disclosed in the downstream, and the fused result can effectively retain the heat radiation information in the infrared light image and the texture information in the visible light image.

Description

Infrared-visible light image fusion method based on attention mechanism

Technical Field

The application relates to the technical field of image processing, in particular to an infrared-visible light image fusion method based on an attention mechanism.

Background

Due to hardware limitations of imaging devices, sensors of a single type or single setting often cannot fully characterize an imaging scene. For example, visible light images typically contain rich texture detail information, but are subject to extreme environmental and occlusion effects that can result in the loss of objects in the scene. In contrast, the infrared sensor can effectively highlight a significant object such as a pedestrian, a vehicle, etc. by capturing the thermal radiation information image emitted from the object, but lacks a detailed description of the scene. The image fusion technology can fuse complementary information in multiple mode images together to obtain a fused image containing more information, and provides a higher-quality image for downstream visual tasks, such as: automatic driving, security and rescue fields.

The method for fusing infrared light and visible light images based on deep learning is a hot spot of current research, and can be divided into an image fusion frame based on an encoder, an image fusion frame based on a convolutional neural network and an image fusion frame based on a generated countermeasure network according to network architecture utilized by the method.

Although the self-encoder based image fusion network also has a deep-learning inherent short plate, i.e. the model has weaker interpretability, the self-encoder only comprises two structures, i.e. the encoder and the decoder, so the design difficulty is lower compared with the other two network structures, besides, the artificially designed fusion strategy enhances the interpretability of the network to a certain extent, which allows the application to have more operations for improving the performance of the fusion network.

Although the convolutional neural network-based image fusion framework can be used for end-to-end training, and a fusion strategy does not need to be designed manually, the specific mechanism is not clear because the neural network is a black box. Therefore, the quality of the fusion image obtained by the image fusion framework based on the convolutional neural network is completely dependent on experience of a designer, and the design is complex.

The image fusion network based on the generated countermeasure network can generate a fusion image without supervision data, is very suitable for fusion of infrared light and visible light images, but the training of the generated countermeasure network is complex, and the model is difficult to converge.

However, the current image fusion network architecture based on the self-encoder is essentially a process of image decomposition and image reconstruction, which inevitably leads to information loss, and the ability of the network to retain information is a key for ensuring network performance; in the fusion process of infrared and visible light images, the network is required to extract and store detailed information of the visible light images, and also is required to retain heat radiation information in the infrared images, which are indispensable. Most current methods cannot have both capabilities, mainly because most current networks only use convolution kernels of one size, which places a large limit on the receptive field of the network model.

In order to solve the above problems, the present application proposes an image fusion network architecture based on a self-encoder, and introduces a sub-attention mechanism and a multi-size convolution kernel.

Disclosure of Invention

Aiming at the defects and shortcomings of the existing image fusion frame, the application provides an infrared and visible light image fusion network based on an attention mechanism, and solves the problems that the network has insufficient information extraction capability on images and weaker reserved capability on visible light image texture information and infrared image thermal radiation information.

The aim of the application is realized by the following technical scheme: an infrared-visible light image fusion method based on an attention mechanism comprises the following steps:

step one: constructing a self-encoder-based neural network model, wherein the neural network model comprises an encoder and a decoder, and the encoder comprises a multi-convolution kernel feature extraction module and an attention mechanism module;

step two: inputting the infrared-visible light image pair into an encoder, and extracting the characteristics by a multi-convolution kernel characteristic extraction module to respectively obtain characteristic diagrams of the infrared-visible light image pair; the infrared-visible light image pair comprises an infrared light image and a visible light image corresponding to the infrared light image;

step three: inputting the feature map of the infrared-visible light image pair obtained in the second step into a channel space attention mechanism module to obtain a corresponding feature map;

step four: inputting the feature map obtained in the third step into a decoder for feature dimension reduction to obtain a reconstructed image;

step five: inputting the reconstructed image and the infrared-visible light image pair as corresponding loss functions, continuously updating network parameters by utilizing the counter-transmissibility of the neural network to continuously optimize the network until the loss function value converges and stabilizes, indicating that the network training is completed, and finally obtaining a network model for generating the reconstructed image;

step six: inserting a fusion layer into an encoder and a decoder for generating a network model of a reconstructed image to obtain a final fusion image generation model of a fusion image; the fusion layer is provided with a fusion rule for fusing the characteristic images of the infrared-visible light image pair output by the encoder;

step seven: and D, inputting the infrared-visible light image pair into a fusion image generation model obtained in the step six, outputting a characteristic image of the infrared-visible light image pair by the encoder, fusing the characteristic image of the infrared-visible light image pair output by the encoder by the fusion layer by using a set fusion rule, outputting the fused characteristic image, and inputting the fused characteristic image into a decoder to obtain a fusion image.

As a preferred embodiment of the present application, in the second step, the infrared light and the visible light images are input to a multi-convolution kernel module, which includes convolution kernels of three sizes, 3×3,5×5, and 7×7, respectively; and superposing the features extracted by the three convolution kernels in the channel dimension, wherein the formula is as follows:

where I denotes the type of input image, i=1 and 2 denote infrared and visible light images, respectively, I _i The input image is represented by a representation of the input image,characteristic diagrams respectively representing the output of each convolution layer; concate represents the superposition of feature maps in the channel direction.

Further, due to the back propagation optimization characteristics of the neural network, a reasonably designed loss function is required to guide the optimization of network parameters, so that the network can be quickly converged, and the designed loss function is as follows:

wherein O is _i ，I _i Respectively reconstructing an image and an infrared-visible light image pair, wherein L represents a total loss function; l (L) _ssim Representing a structural similarity loss function; l (L) _ms-ssim Representing a multi-scale structure similarity loss function; l (L) _gradient Representing a gradient loss function; l (L) _mse Representing the mean square error loss function.

Compared with the prior art, the application has the following beneficial effects:

because the application adopts convolution kernels of various sizes and a attention mechanism, the network not only has the capability of extracting multi-scale features, but also can know which regional features in the image are obvious, thereby overcoming the defect of insufficient extraction capability of the network features and the defect of weaker capability of keeping obvious information in the source image in the prior art, and greatly improving the capability of keeping detailed information in visible light information and obvious information in infrared images.

Drawings

Fig. 1 is a training flow chart of an infrared light and visible light image fusion network.

Fig. 2 is a flow chart of infrared and visible light image fusion network testing.

FIG. 3 is a diagram of an attention mechanism module and a multi-convolution kernel cross-connect module.

FIG. 4 is a qualitative comparison of the present application with other advanced fusion methods.

Detailed Description

The application is further illustrated and described below in connection with specific embodiments. The described embodiments are merely exemplary of the present disclosure and do not limit the scope. The technical features of the embodiments of the application can be combined correspondingly on the premise of no mutual conflict.

As shown in fig. 1, the infrared light and visible light image fusion method based on the attention mechanism comprises the following steps:

step one: and cutting out the visible light image or the infrared light image, overturning and other data enhancement operations.

Step two: constructing a neural network model based on a self-encoder, wherein the neural network model comprises

Step three: inputting the preprocessed image in the first step into an encoder for feature extraction to obtain feature images of infrared light and visible light images

As shown in fig. 3, the following substeps are specifically included.

3.1 Obtain preliminary feature information.

The basic convolution of inputting the source image into the encoder extracts only the preliminary features of the source image.

The encoder comprises a base convolution layer with a convolution kernel size of 3x 3; and a cross-connect feature extraction module comprising three convolution kernels; inputting a source image into a basic convolution layer to obtain preliminary characteristic information, wherein the expression is as follows;

wherein the method comprises the steps ofCharacteristic information representing the first layer of the encoder, i representing the type of image.

3.2 Inputting the feature map obtained in sub-step 3.1) into a multi-convolution kernel cross-connect feature extraction module comprising a plurality of convolution kernels (3 x3, 5 x 5,7 x 7), the different convolution kernels giving the network different receptive fields (explanation: if a 3x3 convolution operation is performed on a 100x100 image, then the receptive field corresponding to a neuron in the output feature map is a 3x3 region in the original image), so that the network can extract more multi-element feature information. Meanwhile, the cross connection between different convolution kernels enhances interaction between convolution layers, so that the information loss of the network is weakened to a certain extent, and the formula is as follows:

where i represents the type of input image, i=1 and 2 represent infrared and visible light images, respectively; i _i The input image is represented by a representation of the input image,characteristic diagrams respectively representing the output of each convolution layer; concate represents the superposition of feature maps in the channel direction.

3.3 In the feature extraction process, not only is the network required to extract more useful information as much as possible, but also the network is required to know which information is important, for which an attention mechanism module is introduced, which comprises two components: the expression of the channel attention module and the space attention module is as follows:

and->Output feature graphs respectively representing the channel attention module and the spatial attention module; MLP and Conv _1×1 A convolution operation representing a multi-layer perceptron and a convolution kernel of 1 x 1; maxPooling (·) and AvgPooling (·) represent maximum pooling and average pooling operations; />Representing pixel-by-pixel multiplication.

Step four: inputting the feature map obtained in the second step into a decoder to obtain a reconstructed image:

where O represents the reconstructed image, decoder () represents the Decoder, F _De A feature diagram is shown that superimposes the spatial attention mechanism with the channel attention mechanism.

Step five: and taking the reconstructed image and the visible-infrared light image pair as the input of a corresponding loss function, continuously updating network parameters by utilizing the counter-propagation property of the neural network, continuously optimizing the network, and finally obtaining a network model capable of generating high-quality reconstructed images. The loss function expression is as follows:

wherein O is _i ，I _i Respectively reconstructing an image and a visible-infrared image pair, wherein L represents a total loss function; l (L) _ssim Representing a structural similarity loss function; l (L) _ms-ssim Representing a multi-scale structure similarity loss function; l (L) _gradient Representing a gradient loss function; l (L) _mse Representing the mean square error loss function.

Step six: insertion in an encoder and decoder for generating a network model of a reconstructed imageA fusion layer for obtaining a fusion image generation model capable of generating a fusion image; the fusion layer is provided with a fusion rule for fusing the characteristic images of the infrared-visible light image pair output by the encoder, and the fusion rule of the fusion layer uses L ₁ A norm;

according to L ₁ Fusion feature map obtained by norm fusion rule:

wherein I II ₁ Represents L ₁ A norm; r represents the size of the filter;obtaining a weight graph by using an average filtering method; (x, y) represents the pixel position of the feature image; />Representing the fused feature map; omega _i Representing the weight of the corresponding feature map.

Step seven: as shown in fig. 2, the ir-visible light image pair is input to the fusion image generation model obtained in the step six, the encoder outputs the feature map of the ir-visible light image pair, the fusion layer fuses the feature map of the ir-visible light image pair output by the encoder by using the set fusion rule and outputs the fused feature map, and the fused feature map is input to the decoder to obtain the fusion image.

As shown in fig. 4, a fusion image of the fusion method of the present application with other advanced fusion methods. To facilitate observation of the effect of each fused image, the same position is subjected to enlargement processing, such as the framed position in the figure. The present application has three significant advantages over other methods. First, the fusion result of the present application can save more detail information of the visible light image. The method and the device benefit from the characteristic extraction by utilizing the multi-convolution kernel, and enhance the characteristic extraction capability of the network. Such as texture information of the branches in the figure. Secondly, because the attention mechanism is introduced in the feature extraction section, the network is more concerned with the salient region information in the source image; compared with other methods, the method disclosed by the application can retain more infrared heat radiation information. Such as the human thermal radiation information in the figure. Table 1 shows the average value of the quantitative analysis according to the present application

TABLE 1 quantitative comparison results of the present application and other methods

In addition to qualitative analysis, quantitative analysis was performed on the TNO dataset, and 8 image evaluation indexes, namely Entropy (EN), mutual Information (MI), average Gradient (AG), standard Deviation (SD), spatial Frequency (SF), structural Similarity (SSIM), visual fidelity (VIF), and total difference correlation (SCD), were evaluated.

As can be seen from Table 1, the fusion effect of the present application achieves the best values on four indices, entropy, average gradient, standard deviation, spatial frequency, respectively. The entropy value represents the richness of the information contained in the image, and the larger the entropy value is, the more the information contained in the image is, and the capability of retaining the fusion network information is reflected from the side. The average gradient is used for evaluating the definition or detail information of the image, and the larger the value is, the clearer the fused image is, the more detail and texture information is contained, and the better the fusion effect is. The standard deviation represents that the fused image is used for evaluating the definition or detail information of the image, and the larger the value is, the more contrast and texture information are reserved in the fused image, so that the fusion effect is better. The spatial frequency is an index describing the rate of change of detail of the image, and a larger value indicates that the fused image contains more texture and detail information. Although the best value is not obtained on other indexes, it is also a sub-optimal value such as the difference correlation sum, the structural similarity.

In the application, an infrared and visible light image fusion network based on an attention mechanism is provided. The network structure self-encoder structure adopted by the application has the characteristic points of simple design and good fusion effect. For the image fusion task, it contains three components: and (5) extracting and fusing the characteristics and reconstructing the characteristics. The feature extraction is important, and the quality of the fusion effect depends on whether the network can sufficiently extract information in the source image or not, so that the application adopts a network structure of cross connection of various convolution kernels in the feature extraction layer; in addition, the network needs to know which information is important and which information is noise, and the attention mechanism module is introduced in the application, so that the introduction of the module can not only make the network pay more attention to important areas, but also remove noise to a certain extent. After qualitative and quantitative analysis is carried out on the TNO data set, the method provided by the application can effectively fuse infrared and visible light images.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit of the application.

Claims

1. An infrared-visible light image fusion method based on an attention mechanism is characterized by comprising the following steps:

step five: inputting the reconstructed image and the infrared-visible light image pair as corresponding loss functions, continuously updating network parameters by utilizing the counter-transmissibility of the neural network, and continuously optimizing the neural network model until the loss function values are converged and stabilized, wherein the completion of training the neural network model is represented, and the neural network model for generating the reconstructed image is obtained;

step six: inserting a fusion layer between an encoder and a decoder for generating a neural network model of a reconstructed image to obtain a fusion image generation model capable of generating a fusion image finally; the fusion layer is provided with a fusion rule for fusing the characteristic images of the infrared-visible light image pair output by the encoder;

2. The method of claim 1, wherein in the second step, feature extraction is performed on the infrared-visible image pair by a multi-convolution kernel feature extraction module, wherein the multi-convolution kernel feature extraction module comprises convolution kernels of three sizes, 3×3,5×5, and 7×7, respectively; and superposing the features extracted by the three convolution kernels in the channel dimension, wherein the formula is as follows:

where i denotes the type of input image, i=1 and 2 denote infrared image and visible respectivelyLight image, I _i The input image is represented by a representation of the input image,characteristic diagrams respectively representing the output of each convolution layer; concate represents the superposition of feature maps in the channel direction.

3. The method of claim 1, wherein in the third step, the feature map through the attention mechanism module may be expressed as:

F _i ^CA and F _i ^SA Output feature graphs respectively representing the channel attention module and the spatial attention module; MLP and Conv _1×1 A convolution operation representing a multi-layer perceptron and a convolution kernel of 1 x 1; maxPooling (·) and AvgPooling (·) represent maximum pooling and average pooling operations;representing pixel-by-pixel multiplication.

4. The infrared-visible light image fusion method according to claim 1, wherein in step four, the obtained feature map is input into a decoder:

5. The method for fusing infrared-visible light images according to claim 1, wherein in the fifth step, the image output by the decoder is used as an output value, the infrared-visible light image pair is used as a standard value to calculate a loss value, and the loss function expression is as follows:

6. The method according to claim 5, wherein in the fifth step, the reconstructed image and the pair of infrared-visible images are respectively input into the neural network model, and the training of the neural network model is guided by adding constraint to the training of the neural network model through the loss function, the neural network parameters are updated by back propagation of the neural network, and the training of the neural network model is completed when the loss function approaches zero and no longer changes.

7. The method according to claim 1, wherein in the sixth step, the fusion rule of the fusion layer uses L ₁ Norms, according to L ₁ Fusion feature map obtained by norm fusion rule: the expression is as follows: