CN114782298A

CN114782298A - Infrared and visible light image fusion method with regional attention

Info

Publication number: CN114782298A
Application number: CN202210434625.3A
Authority: CN
Inventors: 杜友田; 蓝宇; 王航; 王雪
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2022-07-22
Anticipated expiration: 2042-04-24
Also published as: CN114782298B

Abstract

The infrared and visible light image fusion aims to utilize information complementarity and fuse information such as heat radiation, texture details and the like in the same scene, so that the content of the fused image is more comprehensive and clear, and the fusion method is beneficial to human eye observation, subsequent tasks and the like. The image fusion steps are typically feature extraction, feature fusion and image reconstruction. The invention provides a fusion method with regional attention. Firstly, extracting high-dimensional features by using an encoder, then designing fusion strategy fusion features with attention of a salient region, and finally reconstructing an image by using a decoder. The invention aims to solve the problem of image fusion in a scene with insufficient illumination. The result shows that the method can fully reserve the good texture details of the visible light image and supplement the content of the underexposed area by utilizing the infrared image. In addition, due to the fact that the salient regions are concerned, the highlighted regions in the source images still keep highlighted in the fused images, and the good effect of complementary advantages of the infrared images and the visible light images is achieved.

Description

Infrared and visible light image fusion method with regional attention

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an infrared and visible light image fusion method with regional attention.

Background

With the steady development of hardware and software industries, the capabilities of collecting information by using sensors and transmitting and processing information are gradually enhanced. In this context, vision-based sensors are widely used because they can provide rich environmental information. A single type of sensor has only information characteristics characterizing a certain aspect, and cannot meet the requirement of comprehensive description of a monitored environment, so that a multi-sensor system starts to get more and more attention and applications. The multi-source sensor imaging system completely fills the gap of insufficient image expression capability of a single sensor. At present, the image fusion technology has great application value in the fields of remote sensing detection, safety navigation, medical image analysis, anti-terrorism inspection, environment protection, traffic monitoring, clear image reconstruction, disaster detection and forecast, and particularly in the fields of computer vision and the like.

For visual multisource sensor systems, infrared and visible light images can be acquired by relatively simple equipment, most typically by fusion of infrared and visible light images. Due to different imaging mechanisms of the two images, the visible image generally has higher spatial resolution and image contrast, is suitable for human visual perception, but is extremely easily influenced by severe conditions, such as special climates of insufficient brightness, heavy rain, haze and the like. However, the infrared image has better scene anti-jamming capability, and can be more remarkably displayed for objects with higher temperature than the environment, such as pedestrians and the like. But generally the infrared image resolution is low and the image detail is poor. The two images are fused, so that various information can be displayed on one image, the target is highlighted, and the image has richer details and capability of resisting a severe environment compared with a single image. Therefore, the infrared and visible light image fusion aims to perform detailed fusion on the infrared and visible light images in the same scene, and simultaneously reserve the highlight target with thermal radiation information in the infrared image and the background texture detail information with high resolution in the visible light image, so that the finally fused image has more information richness, and the method is more favorable for human eye identification and automatic machine detection, human observation aesthetics and subsequent image processing of a computer.

The prior art and the defects thereof.

The general steps of image fusion are feature extraction, feature fusion and feature reconstruction, wherein the feature reconstruction is the reverse process of the feature extraction, and the feature extraction and fusion are the two most key elements in the image fusion. Among the conventional methods, multi-scale transform (MST) is the most common image fusion method, and has the main characteristics of accurately representing the spatial structure of an image and having consistency of space and frequency spectrum. And a great many multi-scale transforms have been proposed, such as pyramid transform, wavelet transform, contour transform and related variants. In addition, Sparse Representation (SR) based fusion algorithms, and subspace-based methods such as principal component analysis and independent component analysis, etc., have also been proposed.

In recent years, deep learning has demonstrated the most advanced performance in various fields, and has also been successfully applied to image fusion. These algorithms can be roughly classified into three types, Auto Encoder (AE) based methods, CNN based methods, GAN based methods. Li et al propose a simple self-encoder (AE) fusion architecture that includes an encoder, a fusion layer, and a decoder. Later they also increased the complexity of the encoder, and proposed a nested fusion method based on an auto-encoder to obtain a more comprehensive feature fusion. The disadvantage of the above method is that the fusion performance is limited by manually designing the fusion strategy. Zhang et al developed a general image fusion framework through a general network structure, i.e., a feature extraction layer, a fusion layer, and an image reconstruction layer, and learned feature extraction, feature fusion, and image reconstruction under the guidance of a class of complex loss functions. Such methods only focus on fusion at the global level and do not highlight the target region of interest. Ma et al creatively introduce GAN into the image fusion community, which utilizes a discriminator-forced generator to synthesize a fused image with rich texture. They also introduce loss of detail and loss of edge enhancement in order to improve the quality of detail information and sharpen the edges of hot objects. Due to the difficulty in GAN training, this approach fails to achieve good fusion quality and also fails to highlight significant information.

Disclosure of Invention

In order to overcome the above drawbacks of the prior art, an object of the present invention is to provide an infrared and visible image fusion method with regional attention, which is used to solve the image fusion problem of infrared and visible images in a scene with insufficient illumination. The method provided by the invention can fully exert the advantages of the infrared and visible light images in the aspect of scene representation. By extracting and fusing the high-dimensional characteristics of the images, the infrared thermal radiation information and the texture information of the visible light images can be fully fused under the scene of insufficient illumination. Moreover, the regional attention module in the fusion network can focus on a significant region in the high-dimensional features, such as a highlight target of the infrared image and a region with sufficient exposure of the visible light image, and increase the pixel intensity of the part in the fusion to realize the image fusion with regional attention, thereby realizing the complementary advantages of the infrared image and the visible light image.

In order to achieve the purpose, the invention adopts the technical scheme that:

an infrared and visible image fusion method with regional attention, comprising:

step 1, training a self-Encoder (Auto Encoder), wherein the self-Encoder comprises an Encoder and a decoder;

step 1.1: reading an image I in the training set in an RGB format, adjusting the size of the image, and converting the image I into a YCbCr color space;

step 1.2: luminance channel I of image_YInputting the data into an encoder to obtain a high-dimensional characteristic diagram F;

step 1.3: inputting the high-dimensional feature map F into a decoder, and outputting a luminance channel map O_Y；

Step 1.4: calculating I from the loss function_YAnd O_YThen optimizing gradient and back propagation, and updating model parameters of a self-encoder;

step 1.5: repeating the steps 1.1 to 1.4 until the iteration times on the whole training set reach a set threshold value, and obtaining a trained self-encoder;

and 2, step: making a fused image training set

Acquiring an infrared and visible light image pair for training, performing sub-image cutting to expand a data set, wherein the cutting size is consistent with the image size adjusted in the step 1, and obtaining a fused image training set;

and 3, step 3: training converged networks

Step 3.1: image pair (I) of infrared and visible light in a training set of fused images_R,I_V) Respectively converting into YCbCr color space, respectively extracting respective luminance channel maps to obtain (I)_RY,I_VY)；

Step 3.2: respectively combining (I)_RY,I_VY) Inputting the coder trained in the step 1, and calculating to obtain a feature map (F)_R，F_V)；

Step 3.3: will (F)_R，F_V) Connecting in characteristic dimensions, inputting into a fusion network, and calculating to obtain a fusion characteristic diagram F_F；

Step 3.4: f is to be_FDecoding by an input decoder to obtain a fused image O of a brightness channel_FY；

Step 3.5: calculating a loss value according to a loss function, then optimizing a gradient and performing back propagation, and updating model parameters of the fusion network;

step 3.6: repeating the steps 3.1 to 3.5 until the number of times of calculation on the whole fusion image training set reaches a set value, and obtaining a trained fusion network;

step 4, acquiring a fusion image

Step 4.1: obtaining a fusion image O of a brightness channel by the infrared and visible light image pair to be fused according to the method of the steps 3.1 to 3.4_FY；

Step 4.2: mixing O with_FYAnd the CbCr channel of the visible image is connected in characteristic dimension to obtain an image in YCbCr format, and then the image is converted into RGB format to obtain a fusion image.

In one embodiment, the encoder has four convolutional layers, with dense connections; the decoder is directly connected by adopting four convolutional layers.

In one embodiment, an encoder and a decoderThe convolution kernel size of the device is 3 multiplied by 3, step is 1, padding is 1, and ReLu activation function is adopted. In step 1.2, the input size is 256 × 256 × 1, the size of the obtained high-dimensional feature map F is 256 × 256 × 128, and in step 1.3, the luminance channel map O_YThe size is 256 × 256 × 1.

In one embodiment, after step 1.5, the training data is changed to test data, and steps 1.1 to 1.3 are performed to obtain O_YThen adding O_YConnecting with the CbCr channel in the step 1.1 in characteristic dimension to obtain an image in YCbCr format, and converting the image into RGB format to obtain an output image O; and subjectively verifying whether the O is consistent with the I.

In one embodiment, the calculation steps of step 3.3 are as follows:

(1) will (F)_R，F_V) The global information fusion feature map F is obtained by calculation of the convolution layers Conv _1, Conv _2 and Conv _3 through feature dimension connection_{F_0}；

(2) Respectively combining (F)_R，F_V) Inputting the attention module RAB of the same region, and calculating to obtain an attention feature map (M)_R,M_V) (ii) a Will (M)_R,M_V) Connected in characteristic dimension, and input into convolutional layer Conv _ Att to obtain fusion attention characteristic diagram M_RV；

(3) Computing a fused feature map F_F＝F_{F_0}+M_RVI.e. the corresponding position pixel values are added.

In one embodiment, step 1.4 and step 3.5 both adopt Adam optimizer to optimize the gradient, wherein step 3.5, the model parameters of the self-encoder are fixed, and only the model parameters of the fusion network are updated.

In one embodiment, in step 2, an image with a scene with insufficient illumination and a significant target is selected from the public data set TNO to form a training set and a test set, and the training set is expanded offline in a manner of performing sub-image cropping on the original infrared and visible light images, wherein the sub-image cropping size is 256 × 256, and the cropping step size is 16.

Compared with the prior art, the invention has the beneficial effects that:

firstly: under the scene of insufficient illumination, the texture information of the visible light image and the heat radiation information of the infrared image can be fully fused. The encoder can fully extract the high-dimensional features of the image after training, and the deep fusion of the features of all dimensions in the fusion process is guaranteed due to the calculation loss of the high-dimensional features.

Secondly, on the basis of fusing global contents, attention can be paid to a region with prominent highlight in the source image, and the region with prominent highlight in the fused image can still be kept highlighted. The fusion network comprises two fusion paths, namely global fusion and salient region fusion. The region attention module can extract a salient region of the image from multiple scales, and the two fusion path results are added, so that the salient region has a higher-intensity brightness value, and the effect of highlighting is achieved.

Thirdly, the fused image has good contrast and definition. In training, the structure loss is measured from three aspects of gray scale, contrast and structure similarity. Gradient loss can enable the fused image to have good image texture details, and definition is increased. In addition, only the strategy of fusing the image brightness channel enables the invention to process both gray level images and color images. As the CbCr channel of the visible light image does not participate in calculation, the color of the visible light image can be well restored by the fusion result.

Drawings

Figure 1 gives an overall block diagram of the scheme. The input is the infrared and visible images to be fused and the output is the fused image. The network structure consists of an Encoder Encoder, an Attention fusion network Attention fusion Net and a Decoder Decoder. In the dashed box it is noted that the penalty function consists of three parts, feature penalty loss, structure similarity loss, and gradient penalty loss.

Fig. 2 gives the structure of the self-encoder and the composition of the loss function required for training.

Fig. 3 gives the structure of the attention fusion network. Input as a feature map (F)_R,F_V) The output is a fusion feature map F_F。

Fig. 4 gives the network structure of the regional attention module. A feature diagram F is input, and an attention diagram M is output.

Fig. 5 gives three sets of fused image cases. The box is labeled the fusion effect of the salient object.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the drawings and examples.

The visible light sensor can capture an image which is clear enough and conforms to the observation habit of human eyes under the condition of sufficient light. The field that best highlights the infrared and visible image fusion advantages is often scenes with insufficient lighting. How to make the fusion result compensate the disadvantage of underexposure and highlight the interested target, so as to be more beneficial to human eye observation and subsequent advanced tasks, is a current problem.

Most of the previous fusion methods design a fusion strategy from a global perspective, and focus on fusion of contents such as image texture details, and for a significant target originally in an infrared image, such as a person, a car and the like, brightness is reduced due to introduction of components of a visible light image in the fused image. Some methods, although introducing attention to salient objects, require additional algorithms to obtain in advance a binary map of the object segmentation. On the other hand, the existing method is insufficient for researching nighttime scenes with the widest infrared imaging application.

Based on this, the invention provides an infrared and visible light image fusion method with regional attention, the overall structure is as shown in fig. 1, and the steps are as follows:

step 1, training an Auto Encoder (Auto Encoder). The self-Encoder has a structure as shown in fig. 2, and includes an Encoder and a Decoder. Each rectangle in the figure represents a layer, and both the Encoder and the Decoder are composed of a convolutional layer and an active layer. The loss includes a structural loss of ssim loss and a content loss of pixel loss. In this embodiment, the Encoder has four convolution layers, and adopts dense connection; the Decoder is directly connected by four convolutional layers, the convolutional kernel size of the convolutional layer is 3 multiplied by 3, step is 1, and padding is 1. The activation layer uses the ReLu activation function. The parameters of each layer of the Encoder and the Decoder are specifically set as follows:

Layer	Encoder	Decoder
			L1	Conv(I1,O32,K3,S1,P1),ReLu	Conv(I128,O64,K3,S1,P1),ReLu
L2	Conv(I32,O32,K3,S1,P1),ReLu	Conv(I64,O32,K3,S1,P1),ReLu
			L3	Conv(I64,O32,K3,S1,P1),ReLu	Conv(I32,O16,K3,S1,P1),ReLu
L4	Conv(I96,O32,K3,S1,P1),ReLu	Conv(I16,O1,K3,S1,P1),ReLu

step 1.1, reading an image I in the training set by using an imread function of OpenCV, wherein the read image I is in an RGB format, and the size of the read image I is adjusted to 256 × 256 × 3. And then from RGB to YCbCr color space, the conversion method can utilize the library function cvtColor of OpenCV. Finally, each pixel of the image is divided by 255, and the pixel value is normalized to [0,1], so that the input image is obtained.

Step 1.2: luminance channel I of image_YThe input Encoder encorder has an input size of 256 × 256 × 1, resulting in a high-dimensional feature map F with a size of 256 × 256 × 128.

Step 1.3: inputting the high-dimensional feature map F into a Decoder to obtain an output brightness channel map O_YThe size is 256 × 256 × 1.

Step 1.4: calculating I from the loss function_YAnd O_YThe loss function is defined as:

wherein, mu (1-SSIM (O)_Y,I_Y) Is the structural loss, and SSIM (. cndot.) is the structural similarity function.

For content loss, i.e. calculating I_YAnd O_YThe euclidean distance of (c). Mu is a hyperparameter used to balance the two losses. H and W are the height and width of the image, respectively.

Step 1.5: and optimizing the gradient in a mode of an Adam optimizer and the like, reversely propagating, and updating the model parameters of the self-encoder.

Step 1.6: steps 1.1 to 1.5 are repeated. And obtaining the trained self-encoder until the iteration times on the whole training set reach a set threshold value.

In the present embodiment, an open source color image data set MS-COCO is used, and 80000 images are included in total. The algorithm is implemented with python and pytorch, based on GPU training of a block NVIDIA TITAN V, epoch set to 2, batch size set to 16, and hyperparameter μ set to 1.

Step 1.7: to verify the training, the training data can be changed into test data, and steps 1.1 to 1.3 are performed to obtain O_Y. Then adding O_YConnecting with the CbCr channel image in the step 1.1 in characteristic dimension to obtain an image in YCbCr format, and then converting into RGB format to obtain an output image O; subjectively verifying whether the output image O coincides with the input image I.

And 2, making a fusion image training set and a test set.

From the disclosed infrared and visible image fusion data set TNO, images containing low-light scenes and having significant targets are selected to constitute a training set and a test set. This example picks 41 pairs of darker bright images as the training set and 25 pairs as the test set. Then, the training set is expanded in an off-line mode, wherein the expansion mode is as follows: and (3) performing sub-image cropping on the original infrared and visible light images, wherein the size of each sub-image is consistent with the size of the image adjusted in the step (1), namely 256 multiplied by 256, and the cropping moving step size is 16, so that 13940 pairs of infrared and visible light images are obtained finally.

And 3, step 3: and training the fusion network.

Step 3.1: reading pairs of infrared and visible light images in a fused image training set (I)_R,I_V) Then, the same operations as in step 1.1 are performed, i.e. conversion to YCbCr color space and respective luminance channel maps are extracted to obtain (I)_RY,I_VY)。

Step 3.2: respectively will (I)_RY,I_VY) Inputting the Encoder Encoder trained in the step 1, and calculating to obtain a characteristic diagram (F)_R，F_V)；

Step 3.3: will (F)_R，F_V) Connecting in characteristic dimensions, inputting into a fusion network, and calculating to obtain a fusion characteristic diagram F_F. The structure of the fused layer is shown in FIG. 3. The fusion process includes two paths, namely global information fusion and attention feature map fusion, namely a global information fusion network and an attention feature map fusion network. The former contains three convolutional layers, Conv _1, Conv _2 and Conv _3, and the latter contains a regional attention module RAB and a convolutional layer Conv _ Att, in this embodiment, the network layer parameters may be set as:

the calculation steps in the converged network are as follows:

(1) will (F)_R，F_V) Connecting in feature dimensions, and then calculating by Conv _1, Conv _2 and Conv _3 to obtain a global information fusion feature map F_{F_0}。

(2) An attention feature map is calculated.

Respectively combining (F)_R，F_V) Input into a regional attention module RAB to obtain an attention profile (M)_R,M_V) The size is 256 × 256 × 128. Note that the same RFB module is used for separate calculations here. The structure of the regional attention module RAB is shown in fig. 4. The method comprises maximum pooling, global average pooling, a full connection layer, an activation layer, an upsampling operation and a standardization operation.

Representing the multiplication of the weights and the feature map.

Representing a feature map addition. To extract the feature map weights from multiple scales, the module uses three kinds of max-pooling kernels, respectively.

The specific calculation steps are as follows: inputting a characteristic diagram F, and performing maximum pooling to obtain the characteristic diagram F_sOf a size of

Where H and W represent image sizes, in this example, both H and W are 256, s is 1, 2, 4, respectively, and the pooling kernel sizes are 1 × 1, 2 × 2, 4 × 4, respectively. Then, for f_sAnd performing global average pooling operation to obtain vectors with dimensions of 1 × 1 × 128. Then, a full connection layer and an activation layer are connected to finally obtain a weight vector omega_sThe dimension is 1 × 1 × 128. Weight value of kth dimension feature

Representation for measuring the k characteristic layers

Of the cell. On the other hand, for F, to obtain a feature map of the same size as F_sPerform an upsampling operation and then convert ω to_sMultiplying the feature map after up-sampling by the corresponding dimension to obtain a weighted feature map

Wherein k represents a characteristic diagram of the k-th dimension, H_up(. cndot.) represents an upsampling function. Finally, adding the feature maps of the three scales, and then normalizing to obtain an attention feature map with dimensions H multiplied by W multiplied by 128:

where σ (·) denotes the normalization operation.

(3) Will (M)_R,M_V) Connected in characteristic dimension, and input into convolutional layer Conv _ Att to obtain fusion attention characteristic diagram M_RVThe size is H × W × 128.

(4) Computing a final fused feature map F_F＝F_{F_0}+M_RVI.e. the corresponding position pixel values are added.

Step 3.4: f is to be_FDecoding by an input Decoder to obtain a fused image O of a brightness channel_FY。

Step 3.5: and calculating a loss value according to the loss function L, optimizing the loss gradient by using an Adam optimizer and the like, reversely propagating, and updating the model parameters of the fusion network.

The loss function L comprises three parts, the structure loss L_ssimCharacteristic loss L_pixelAnd gradient loss L_gradientThe calculation formula is as follows:

L＝ωL_ssim+λL_pixel+L_gradient

wherein, omega and lambda are hyper-parameters used for balancing various losses.

Structural loss L_ssimThe calculation formula is as follows:

L_ssim＝δ(1-SSIM(I_RY,O_Y))+(1-δ)(1-SSIM(I_VY,O_Y))

where δ is a hyperparameter used to balance two loss values.

The characteristic loss calculation formula is as follows:

wherein eta is a hyper-parameter, and the size of the characteristic diagram is H multiplied by W multiplied by C. I | · | purple wind₂Representing the euclidean distance of the feature map.

Gradient loss L_gradientThe calculation formula is as follows:

wherein,

and representing Sobel gradient calculation operation used for measuring fine grain texture information of the image.

Step 3.6: and (5) repeating the steps 3.1 to 3.5 until the iteration times reach a set threshold value on the whole fusion image training set, thereby obtaining a trained fusion network. In this example, a GPU based on a block of NVIDIA TITAN V was trained, using an Adam optimizer, with a batch size and epoch of 4 and 2, respectively. The initial learning rate is set to 1 × 10^-4The hyperparameters ω, λ, δ, η of the loss function are set to 1, 2.7, 0.5, respectively.

And 4, step 4: and inputting test data to obtain a fusion image.

Step 4.1: obtaining a fused image O of a brightness channel by the method of the steps 3.1 to 3.4 according to the test data or the infrared and visible light image pair to be fused_FY。

And 4.2: is prepared from O_FYAnd connecting with a CbCr channel of the visible image in a characteristic dimension to obtain an image in a YCbCr format, and converting the image into an RGB format to obtain a fused image.

Three sets of fused images were selected from the test, as shown in fig. 5. As can be seen from the figure, the fused image fuses the texture details of the visible light image, as shown by the dashed line square in the figure, and the overall brightness of the image is improved to a certain extent. Meanwhile, only the salient region in the infrared image is well embodied in the fused image, as shown by the solid line box in the figure.

Claims

1. A method for fusing infrared and visible images with regional attention, comprising:

step 1, training a self-encoder, wherein the self-encoder comprises an encoder and a decoder;

Step 1.4: calculation of I from the loss function_YAnd O_YThen optimizing gradient and back propagation, and updating model parameters of a self-encoder;

step 1.5: repeating the step 1.1 to the step 1.4 until the iteration times on the whole training set reach a set threshold value, and obtaining a trained self-encoder;

step 2: making a fused image training set

and 3, step 3: training converged networks

Step 3.1: imaging the infrared and visible light images in the fused image training set (I)_R,I_V) Respectively converting into YCbCr color space, respectively extracting respective luminance channel maps to obtain (I)_RY,I_VY)；

step 4, obtaining a fusion image

Step 4.1: obtaining a fusion image O of a brightness channel by the infrared and visible light image to be fused according to the method of the steps 3.1 to 3.4_FY；

Step 4.2: mixing O with_FYAnd the CbCr channel of the visible image is connected in the characteristic dimension to obtain an image in a YCbCr format, and then the image is converted into an RGB format to obtain a fusion image.

2. The method of claim 1, wherein the encoder has four convolutional layers, with dense connections; the decoder is directly connected by adopting four convolutional layers; in the encoder and the decoder, the convolution kernel size is 3 × 3, step is 1, padding is 1, and the ReLu activation function is adopted.

3. The method for fusing the infrared and visible light images with regional attention according to claim 2, wherein in the step 1.2, the input size is 256 x 1, the size of the obtained high-dimensional feature map F is 256 x 128, and in the step 1.3, the luminance channel map O_YThe size is 256 × 256 × 1.

4. The method for fusing local attention infrared and visible light images as claimed in claim 1, wherein after step 1.5, the training data is changed into the test data, and steps 1.1 to 1.3 are performed to obtain O_YThen adding O_YConnecting with the CbCr channel in the step 1.1 in a characteristic dimension to obtain an image in a YCbCr format, and converting the image into an RGB format to obtain an output image O; and subjectively verifying whether the O is consistent with the I.

5. The method for fusing infrared and visible images with regional attention according to claim 1, wherein the calculation step of step 3.3 is as follows:

(2) Respectively mixing (F)_R，F_V) Inputting the same regional attention module RAB to calculate attention feature map (M)_R,M_V) (ii) a Will (M)_R,M_V) Connected in characteristic dimension, and input into convolutional layer Conv _ Att to obtain fusion attention characteristic diagram M_RV；

6. The method of claim 1, wherein step 1.4 and step 3.5 both adopt Adam optimizer to optimize gradients, and wherein step 3.5 fixes model parameters of the self-encoder and updates only model parameters of the fusion network.

7. The method for fusing infrared and visible images with regional attention according to claim 1, wherein in step 2, the images containing the scene with insufficient illumination and having significant targets are selected from the public data set TNO to form a training set and a test set, and the training set is expanded off-line in a way of sub-image clipping on the original infrared and visible images, wherein the sub-image size is 256 × 256 and the clipping step size is 16.