CN117197624A - Infrared-visible light image fusion method based on attention mechanism - Google Patents

Infrared-visible light image fusion method based on attention mechanism Download PDF

Info

Publication number
CN117197624A
CN117197624A CN202311089192.3A CN202311089192A CN117197624A CN 117197624 A CN117197624 A CN 117197624A CN 202311089192 A CN202311089192 A CN 202311089192A CN 117197624 A CN117197624 A CN 117197624A
Authority
CN
China
Prior art keywords
image
infrared
visible light
fusion
light image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311089192.3A
Other languages
Chinese (zh)
Inventor
徐骞
陈征
赵文杰
邵雪明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202311089192.3A priority Critical patent/CN117197624A/en
Publication of CN117197624A publication Critical patent/CN117197624A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The application discloses an infrared-visible light image fusion method based on an attention mechanism, which solves the problem that the existing method can not store heat radiation information in an infrared image and detail and texture information in a visible light image in a balance way. The method utilizes a network architecture which is a self-Encoder (Auto-Encoder) network, three sizes of convolution kernels and a channel space attention mechanism module are introduced into the Encoder to serve as feature extractors, the convolution kernels with different sizes are used for enhancing the feature extraction capability of the network, the channel space attention module is used for focusing on a remarkable target area in an image more in the network, and meanwhile, a reasonable-design loss function is used for guiding the training of the network; the application can train and fuse the infrared light and visible light image data sets disclosed in the downstream, and the fused result can effectively retain the heat radiation information in the infrared light image and the texture information in the visible light image.

Description

Infrared-visible light image fusion method based on attention mechanism
Technical Field
The application relates to the technical field of image processing, in particular to an infrared-visible light image fusion method based on an attention mechanism.
Background
Due to hardware limitations of imaging devices, sensors of a single type or single setting often cannot fully characterize an imaging scene. For example, visible light images typically contain rich texture detail information, but are subject to extreme environmental and occlusion effects that can result in the loss of objects in the scene. In contrast, the infrared sensor can effectively highlight a significant object such as a pedestrian, a vehicle, etc. by capturing the thermal radiation information image emitted from the object, but lacks a detailed description of the scene. The image fusion technology can fuse complementary information in multiple mode images together to obtain a fused image containing more information, and provides a higher-quality image for downstream visual tasks, such as: automatic driving, security and rescue fields.
The method for fusing infrared light and visible light images based on deep learning is a hot spot of current research, and can be divided into an image fusion frame based on an encoder, an image fusion frame based on a convolutional neural network and an image fusion frame based on a generated countermeasure network according to network architecture utilized by the method.
Although the self-encoder based image fusion network also has a deep-learning inherent short plate, i.e. the model has weaker interpretability, the self-encoder only comprises two structures, i.e. the encoder and the decoder, so the design difficulty is lower compared with the other two network structures, besides, the artificially designed fusion strategy enhances the interpretability of the network to a certain extent, which allows the application to have more operations for improving the performance of the fusion network.
Although the convolutional neural network-based image fusion framework can be used for end-to-end training, and a fusion strategy does not need to be designed manually, the specific mechanism is not clear because the neural network is a black box. Therefore, the quality of the fusion image obtained by the image fusion framework based on the convolutional neural network is completely dependent on experience of a designer, and the design is complex.
The image fusion network based on the generated countermeasure network can generate a fusion image without supervision data, is very suitable for fusion of infrared light and visible light images, but the training of the generated countermeasure network is complex, and the model is difficult to converge.
However, the current image fusion network architecture based on the self-encoder is essentially a process of image decomposition and image reconstruction, which inevitably leads to information loss, and the ability of the network to retain information is a key for ensuring network performance; in the fusion process of infrared and visible light images, the network is required to extract and store detailed information of the visible light images, and also is required to retain heat radiation information in the infrared images, which are indispensable. Most current methods cannot have both capabilities, mainly because most current networks only use convolution kernels of one size, which places a large limit on the receptive field of the network model.
In order to solve the above problems, the present application proposes an image fusion network architecture based on a self-encoder, and introduces a sub-attention mechanism and a multi-size convolution kernel.
Disclosure of Invention
Aiming at the defects and shortcomings of the existing image fusion frame, the application provides an infrared and visible light image fusion network based on an attention mechanism, and solves the problems that the network has insufficient information extraction capability on images and weaker reserved capability on visible light image texture information and infrared image thermal radiation information.
The aim of the application is realized by the following technical scheme: an infrared-visible light image fusion method based on an attention mechanism comprises the following steps:
step one: constructing a self-encoder-based neural network model, wherein the neural network model comprises an encoder and a decoder, and the encoder comprises a multi-convolution kernel feature extraction module and an attention mechanism module;
step two: inputting the infrared-visible light image pair into an encoder, and extracting the characteristics by a multi-convolution kernel characteristic extraction module to respectively obtain characteristic diagrams of the infrared-visible light image pair; the infrared-visible light image pair comprises an infrared light image and a visible light image corresponding to the infrared light image;
step three: inputting the feature map of the infrared-visible light image pair obtained in the second step into a channel space attention mechanism module to obtain a corresponding feature map;
step four: inputting the feature map obtained in the third step into a decoder for feature dimension reduction to obtain a reconstructed image;
step five: inputting the reconstructed image and the infrared-visible light image pair as corresponding loss functions, continuously updating network parameters by utilizing the counter-transmissibility of the neural network to continuously optimize the network until the loss function value converges and stabilizes, indicating that the network training is completed, and finally obtaining a network model for generating the reconstructed image;
step six: inserting a fusion layer into an encoder and a decoder for generating a network model of a reconstructed image to obtain a final fusion image generation model of a fusion image; the fusion layer is provided with a fusion rule for fusing the characteristic images of the infrared-visible light image pair output by the encoder;
step seven: and D, inputting the infrared-visible light image pair into a fusion image generation model obtained in the step six, outputting a characteristic image of the infrared-visible light image pair by the encoder, fusing the characteristic image of the infrared-visible light image pair output by the encoder by the fusion layer by using a set fusion rule, outputting the fused characteristic image, and inputting the fused characteristic image into a decoder to obtain a fusion image.
As a preferred embodiment of the present application, in the second step, the infrared light and the visible light images are input to a multi-convolution kernel module, which includes convolution kernels of three sizes, 3×3,5×5, and 7×7, respectively; and superposing the features extracted by the three convolution kernels in the channel dimension, wherein the formula is as follows:
where I denotes the type of input image, i=1 and 2 denote infrared and visible light images, respectively, I i The input image is represented by a representation of the input image,characteristic diagrams respectively representing the output of each convolution layer; concate represents the superposition of feature maps in the channel direction.
Further, due to the back propagation optimization characteristics of the neural network, a reasonably designed loss function is required to guide the optimization of network parameters, so that the network can be quickly converged, and the designed loss function is as follows:
wherein O is i ,I i Respectively reconstructing an image and an infrared-visible light image pair, wherein L represents a total loss function; l (L) ssim Representing a structural similarity loss function; l (L) ms-ssim Representing a multi-scale structure similarity loss function; l (L) gradient Representing a gradient loss function; l (L) mse Representing the mean square error loss function.
Compared with the prior art, the application has the following beneficial effects:
because the application adopts convolution kernels of various sizes and a attention mechanism, the network not only has the capability of extracting multi-scale features, but also can know which regional features in the image are obvious, thereby overcoming the defect of insufficient extraction capability of the network features and the defect of weaker capability of keeping obvious information in the source image in the prior art, and greatly improving the capability of keeping detailed information in visible light information and obvious information in infrared images.
Drawings
Fig. 1 is a training flow chart of an infrared light and visible light image fusion network.
Fig. 2 is a flow chart of infrared and visible light image fusion network testing.
FIG. 3 is a diagram of an attention mechanism module and a multi-convolution kernel cross-connect module.
FIG. 4 is a qualitative comparison of the present application with other advanced fusion methods.
Detailed Description
The application is further illustrated and described below in connection with specific embodiments. The described embodiments are merely exemplary of the present disclosure and do not limit the scope. The technical features of the embodiments of the application can be combined correspondingly on the premise of no mutual conflict.
As shown in fig. 1, the infrared light and visible light image fusion method based on the attention mechanism comprises the following steps:
step one: and cutting out the visible light image or the infrared light image, overturning and other data enhancement operations.
Step two: constructing a neural network model based on a self-encoder, wherein the neural network model comprises
Step three: inputting the preprocessed image in the first step into an encoder for feature extraction to obtain feature images of infrared light and visible light images
As shown in fig. 3, the following substeps are specifically included.
3.1 Obtain preliminary feature information.
The basic convolution of inputting the source image into the encoder extracts only the preliminary features of the source image.
The encoder comprises a base convolution layer with a convolution kernel size of 3x 3; and a cross-connect feature extraction module comprising three convolution kernels; inputting a source image into a basic convolution layer to obtain preliminary characteristic information, wherein the expression is as follows;
wherein the method comprises the steps ofCharacteristic information representing the first layer of the encoder, i representing the type of image.
3.2 Inputting the feature map obtained in sub-step 3.1) into a multi-convolution kernel cross-connect feature extraction module comprising a plurality of convolution kernels (3 x3, 5 x 5,7 x 7), the different convolution kernels giving the network different receptive fields (explanation: if a 3x3 convolution operation is performed on a 100x100 image, then the receptive field corresponding to a neuron in the output feature map is a 3x3 region in the original image), so that the network can extract more multi-element feature information. Meanwhile, the cross connection between different convolution kernels enhances interaction between convolution layers, so that the information loss of the network is weakened to a certain extent, and the formula is as follows:
where i represents the type of input image, i=1 and 2 represent infrared and visible light images, respectively; i i The input image is represented by a representation of the input image,characteristic diagrams respectively representing the output of each convolution layer; concate represents the superposition of feature maps in the channel direction.
3.3 In the feature extraction process, not only is the network required to extract more useful information as much as possible, but also the network is required to know which information is important, for which an attention mechanism module is introduced, which comprises two components: the expression of the channel attention module and the space attention module is as follows:
and->Output feature graphs respectively representing the channel attention module and the spatial attention module; MLP and Conv 1×1 A convolution operation representing a multi-layer perceptron and a convolution kernel of 1 x 1; maxPooling (·) and AvgPooling (·) represent maximum pooling and average pooling operations; />Representing pixel-by-pixel multiplication.
Step four: inputting the feature map obtained in the second step into a decoder to obtain a reconstructed image:
where O represents the reconstructed image, decoder () represents the Decoder, F De A feature diagram is shown that superimposes the spatial attention mechanism with the channel attention mechanism.
Step five: and taking the reconstructed image and the visible-infrared light image pair as the input of a corresponding loss function, continuously updating network parameters by utilizing the counter-propagation property of the neural network, continuously optimizing the network, and finally obtaining a network model capable of generating high-quality reconstructed images. The loss function expression is as follows:
wherein O is i ,I i Respectively reconstructing an image and a visible-infrared image pair, wherein L represents a total loss function; l (L) ssim Representing a structural similarity loss function; l (L) ms-ssim Representing a multi-scale structure similarity loss function; l (L) gradient Representing a gradient loss function; l (L) mse Representing the mean square error loss function.
Step six: insertion in an encoder and decoder for generating a network model of a reconstructed imageA fusion layer for obtaining a fusion image generation model capable of generating a fusion image; the fusion layer is provided with a fusion rule for fusing the characteristic images of the infrared-visible light image pair output by the encoder, and the fusion rule of the fusion layer uses L 1 A norm;
according to L 1 Fusion feature map obtained by norm fusion rule:
wherein I II 1 Represents L 1 A norm; r represents the size of the filter;obtaining a weight graph by using an average filtering method; (x, y) represents the pixel position of the feature image; />Representing the fused feature map; omega i Representing the weight of the corresponding feature map.
Step seven: as shown in fig. 2, the ir-visible light image pair is input to the fusion image generation model obtained in the step six, the encoder outputs the feature map of the ir-visible light image pair, the fusion layer fuses the feature map of the ir-visible light image pair output by the encoder by using the set fusion rule and outputs the fused feature map, and the fused feature map is input to the decoder to obtain the fusion image.
As shown in fig. 4, a fusion image of the fusion method of the present application with other advanced fusion methods. To facilitate observation of the effect of each fused image, the same position is subjected to enlargement processing, such as the framed position in the figure. The present application has three significant advantages over other methods. First, the fusion result of the present application can save more detail information of the visible light image. The method and the device benefit from the characteristic extraction by utilizing the multi-convolution kernel, and enhance the characteristic extraction capability of the network. Such as texture information of the branches in the figure. Secondly, because the attention mechanism is introduced in the feature extraction section, the network is more concerned with the salient region information in the source image; compared with other methods, the method disclosed by the application can retain more infrared heat radiation information. Such as the human thermal radiation information in the figure. Table 1 shows the average value of the quantitative analysis according to the present application
TABLE 1 quantitative comparison results of the present application and other methods
In addition to qualitative analysis, quantitative analysis was performed on the TNO dataset, and 8 image evaluation indexes, namely Entropy (EN), mutual Information (MI), average Gradient (AG), standard Deviation (SD), spatial Frequency (SF), structural Similarity (SSIM), visual fidelity (VIF), and total difference correlation (SCD), were evaluated.
As can be seen from Table 1, the fusion effect of the present application achieves the best values on four indices, entropy, average gradient, standard deviation, spatial frequency, respectively. The entropy value represents the richness of the information contained in the image, and the larger the entropy value is, the more the information contained in the image is, and the capability of retaining the fusion network information is reflected from the side. The average gradient is used for evaluating the definition or detail information of the image, and the larger the value is, the clearer the fused image is, the more detail and texture information is contained, and the better the fusion effect is. The standard deviation represents that the fused image is used for evaluating the definition or detail information of the image, and the larger the value is, the more contrast and texture information are reserved in the fused image, so that the fusion effect is better. The spatial frequency is an index describing the rate of change of detail of the image, and a larger value indicates that the fused image contains more texture and detail information. Although the best value is not obtained on other indexes, it is also a sub-optimal value such as the difference correlation sum, the structural similarity.
In the application, an infrared and visible light image fusion network based on an attention mechanism is provided. The network structure self-encoder structure adopted by the application has the characteristic points of simple design and good fusion effect. For the image fusion task, it contains three components: and (5) extracting and fusing the characteristics and reconstructing the characteristics. The feature extraction is important, and the quality of the fusion effect depends on whether the network can sufficiently extract information in the source image or not, so that the application adopts a network structure of cross connection of various convolution kernels in the feature extraction layer; in addition, the network needs to know which information is important and which information is noise, and the attention mechanism module is introduced in the application, so that the introduction of the module can not only make the network pay more attention to important areas, but also remove noise to a certain extent. After qualitative and quantitative analysis is carried out on the TNO data set, the method provided by the application can effectively fuse infrared and visible light images.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit of the application.

Claims (7)

1. An infrared-visible light image fusion method based on an attention mechanism is characterized by comprising the following steps:
step one: constructing a self-encoder-based neural network model, wherein the neural network model comprises an encoder and a decoder, and the encoder comprises a multi-convolution kernel feature extraction module and an attention mechanism module;
step two: inputting the infrared-visible light image pair into an encoder, and extracting the characteristics by a multi-convolution kernel characteristic extraction module to respectively obtain characteristic diagrams of the infrared-visible light image pair; the infrared-visible light image pair comprises an infrared light image and a visible light image corresponding to the infrared light image;
step three: inputting the feature map of the infrared-visible light image pair obtained in the second step into a channel space attention mechanism module to obtain a corresponding feature map;
step four: inputting the feature map obtained in the third step into a decoder for feature dimension reduction to obtain a reconstructed image;
step five: inputting the reconstructed image and the infrared-visible light image pair as corresponding loss functions, continuously updating network parameters by utilizing the counter-transmissibility of the neural network, and continuously optimizing the neural network model until the loss function values are converged and stabilized, wherein the completion of training the neural network model is represented, and the neural network model for generating the reconstructed image is obtained;
step six: inserting a fusion layer between an encoder and a decoder for generating a neural network model of a reconstructed image to obtain a fusion image generation model capable of generating a fusion image finally; the fusion layer is provided with a fusion rule for fusing the characteristic images of the infrared-visible light image pair output by the encoder;
step seven: and D, inputting the infrared-visible light image pair into a fusion image generation model obtained in the step six, outputting a characteristic image of the infrared-visible light image pair by the encoder, fusing the characteristic image of the infrared-visible light image pair output by the encoder by the fusion layer by using a set fusion rule, outputting the fused characteristic image, and inputting the fused characteristic image into a decoder to obtain a fusion image.
2. The method of claim 1, wherein in the second step, feature extraction is performed on the infrared-visible image pair by a multi-convolution kernel feature extraction module, wherein the multi-convolution kernel feature extraction module comprises convolution kernels of three sizes, 3×3,5×5, and 7×7, respectively; and superposing the features extracted by the three convolution kernels in the channel dimension, wherein the formula is as follows:
where i denotes the type of input image, i=1 and 2 denote infrared image and visible respectivelyLight image, I i The input image is represented by a representation of the input image,characteristic diagrams respectively representing the output of each convolution layer; concate represents the superposition of feature maps in the channel direction.
3. The method of claim 1, wherein in the third step, the feature map through the attention mechanism module may be expressed as:
F i CA and F i SA Output feature graphs respectively representing the channel attention module and the spatial attention module; MLP and Conv 1×1 A convolution operation representing a multi-layer perceptron and a convolution kernel of 1 x 1; maxPooling (·) and AvgPooling (·) represent maximum pooling and average pooling operations;representing pixel-by-pixel multiplication.
4. The infrared-visible light image fusion method according to claim 1, wherein in step four, the obtained feature map is input into a decoder:
where O represents the reconstructed image, decoder () represents the Decoder, F De A feature diagram is shown that superimposes the spatial attention mechanism with the channel attention mechanism.
5. The method for fusing infrared-visible light images according to claim 1, wherein in the fifth step, the image output by the decoder is used as an output value, the infrared-visible light image pair is used as a standard value to calculate a loss value, and the loss function expression is as follows:
wherein O is i ,I i Respectively reconstructing an image and an infrared-visible light image pair, wherein L represents a total loss function; l (L) ssim Representing a structural similarity loss function; l (L) ms-ssim Representing a multi-scale structure similarity loss function; l (L) gradient Representing a gradient loss function; l (L) mse Representing the mean square error loss function.
6. The method according to claim 5, wherein in the fifth step, the reconstructed image and the pair of infrared-visible images are respectively input into the neural network model, and the training of the neural network model is guided by adding constraint to the training of the neural network model through the loss function, the neural network parameters are updated by back propagation of the neural network, and the training of the neural network model is completed when the loss function approaches zero and no longer changes.
7. The method according to claim 1, wherein in the sixth step, the fusion rule of the fusion layer uses L 1 Norms, according to L 1 Fusion feature map obtained by norm fusion rule: the expression is as follows:
wherein I II 1 Represents L 1 A norm; r represents the size of the filter;obtaining a weight graph by using an average filtering method; (x, y) represents the pixel position of the feature image; />Representing the fused feature map; omega i Representing the weight of the corresponding feature map.
CN202311089192.3A 2023-08-28 2023-08-28 Infrared-visible light image fusion method based on attention mechanism Pending CN117197624A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311089192.3A CN117197624A (en) 2023-08-28 2023-08-28 Infrared-visible light image fusion method based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311089192.3A CN117197624A (en) 2023-08-28 2023-08-28 Infrared-visible light image fusion method based on attention mechanism

Publications (1)

Publication Number Publication Date
CN117197624A true CN117197624A (en) 2023-12-08

Family

ID=89002705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311089192.3A Pending CN117197624A (en) 2023-08-28 2023-08-28 Infrared-visible light image fusion method based on attention mechanism

Country Status (1)

Country Link
CN (1) CN117197624A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117517335A (en) * 2023-12-27 2024-02-06 国网辽宁省电力有限公司电力科学研究院 System and method for monitoring pollution of insulator of power transformation equipment
CN117726979A (en) * 2024-02-18 2024-03-19 合肥中盛水务发展有限公司 Piping lane pipeline management method based on neural network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117517335A (en) * 2023-12-27 2024-02-06 国网辽宁省电力有限公司电力科学研究院 System and method for monitoring pollution of insulator of power transformation equipment
CN117517335B (en) * 2023-12-27 2024-03-29 国网辽宁省电力有限公司电力科学研究院 System and method for monitoring pollution of insulator of power transformation equipment
CN117726979A (en) * 2024-02-18 2024-03-19 合肥中盛水务发展有限公司 Piping lane pipeline management method based on neural network

Similar Documents

Publication Publication Date Title
Ma et al. GANMcC: A generative adversarial network with multiclassification constraints for infrared and visible image fusion
CN108986050B (en) Image and video enhancement method based on multi-branch convolutional neural network
Li et al. Survey of single image super‐resolution reconstruction
CN111275618A (en) Depth map super-resolution reconstruction network construction method based on double-branch perception
CN117197624A (en) Infrared-visible light image fusion method based on attention mechanism
CN111754446A (en) Image fusion method, system and storage medium based on generation countermeasure network
CN110472634A (en) Change detecting method based on multiple dimensioned depth characteristic difference converged network
CN112241939B (en) Multi-scale and non-local-based light rain removal method
Wang et al. FusionGRAM: An infrared and visible image fusion framework based on gradient residual and attention mechanism
CN111242173A (en) RGBD salient object detection method based on twin network
Wang et al. TF-SOD: a novel transformer framework for salient object detection
CN117474781A (en) High spectrum and multispectral image fusion method based on attention mechanism
CN113538243A (en) Super-resolution image reconstruction method based on multi-parallax attention module combination
CN105931189A (en) Video ultra-resolution method and apparatus based on improved ultra-resolution parameterized model
Zhou et al. MSAR‐DefogNet: Lightweight cloud removal network for high resolution remote sensing images based on multi scale convolution
CN114022356A (en) River course flow water level remote sensing image super-resolution method and system based on wavelet domain
Shit et al. An encoder‐decoder based CNN architecture using end to end dehaze and detection network for proper image visualization and detection
CN111861870B (en) End-to-end parallel generator network construction method for image translation
CN113628143A (en) Weighted fusion image defogging method and device based on multi-scale convolution
CN117314808A (en) Infrared and visible light image fusion method combining transducer and CNN (carbon fiber network) double encoders
CN111598841B (en) Example significance detection method based on regularized dense connection feature pyramid
CN115965844B (en) Multi-focus image fusion method based on visual saliency priori knowledge
CN116883303A (en) Infrared and visible light image fusion method based on characteristic difference compensation and fusion
CN116863285A (en) Infrared and visible light image fusion method for multiscale generation countermeasure network
Chen et al. Multi‐scale single image dehazing based on the fusion of global and local features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination