CN117292244A

CN117292244A - Infrared and visible light image fusion method based on multilayer convolution

Info

Publication number: CN117292244A
Application number: CN202311352355.2A
Authority: CN
Inventors: 陈海秀; 房威志; 陆康; 黄仔洁; 陈嘉越; 褚羽婷
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2023-10-18
Filing date: 2023-10-18
Publication date: 2023-12-26

Abstract

The invention discloses an infrared and visible light image fusion method based on multilayer convolution, wherein a network structure comprises a coding network, a decoding network and a multilayer convolution fusion network, and the coder is formed by mutually nesting a multilayer convolution block and an ECA attention mechanism; the decoder is mainly composed of decoding blocks, and each decoding block is composed of two convolution layers; the multi-layer convolution fusion network mainly comprises a gradient convolution block, a downsampling convolution block, a convolution space channel attention mechanism and a plurality of convolution layers; the method comprises the following steps: s1, the registered infrared source images and visible light source images are sent into an encoder in pairs, and source image features are extracted by the encoder; s2, fusing the source image features by a multi-layer convolution fusion network to obtain fused features; s3, reconstructing the fused features by a decoder, and outputting an image. The fused image has the advantages of prominent target, clear detail, obvious outline and obvious index improvement, and accords with human visual perception.

Description

Infrared and visible light image fusion method based on multilayer convolution

Technical Field

The invention relates to the field of image processing, in particular to an infrared and visible light image fusion method based on multilayer convolution.

Background

The infrared and visible light images reflect different characteristics of the target scene under different photographing instruments. The infrared image has strong penetrability, is not influenced by illumination intensity, but lacks texture information. The visible light image contains abundant structural information, has a good visual effect, is easily influenced by factors such as weather, illumination conditions and the like, and has poor anti-interference capability. Therefore, the infrared and visible light images are fused, the complementarity of the information is fully utilized, and the method has wide application value in various fields.

The existing image fusion method mainly comprises a traditional method and a deep learning method. A typical representation of conventional methods is a multi-scale transform based method. The method is widely applied to extracting multi-scale features from a source image, obtaining multi-scale features of an input image by using multi-scale transformation, and then fusing the multi-scale features of different images according to a specific rule. Finally, reconstructing the fused image by inverse multi-scale transformation. Such methods can capture features at different scales and can more fully understand the image. However, it is difficult to select and design a proper transformation rule, and a fusion rule is complex, which requires derivation and calculation of a digital formula.

The infrared and visible light image fusion method based on deep learning is divided into three methods, namely a convolutional neural network-based method, a generation countermeasure network-based method and a self-encoder-based method. The method based on the self-encoder is to encode, fuse, decode and reconstruct the source image, and finally generate the fused image. In 2018, li et al (Li H, wu X J. Denseuse: A Fusion Approach to Infrared and Visible Images [ J ]. IEEE Trans. Image Processing,2019,28 (5)) introduced dense connectivity extraction image depth features in the encoder network, suggesting a Denseuse for infrared and visible light image fusion. In 2020, li et al (Li H, WU X J, DURRANI T. NestFuse: an Infrared and Visible Image Fusion Architecture based on Nest Connection and Spatial/Channel Attention Models [ J ]. IEEE Transactions on Instrumentation and Measurement,2020, PP (99) devised a fusion strategy based on spatial and channel attentions that could fuse multiple scale features, nestFuse network 2021, li et al (Li H, wu X J, KITTLER J. RFN-Nest: an end-to-end residual fusion network for infrared and visible images [ J ]. Information Fusion,2021,73.72-86) proposed An end-to-end RFN-Nest network on NestFuse basis, and many scholars subsequently proposed more methods of fusion of infrared and visible images based on self-encoders, but the current methods of fusion based on self-encoders still have the following disadvantages:

1) The target of the fusion image is not outstanding, and details are lost;

2) The visual perception of the fused image is poor;

3) Under a complex background, texture detail information of the fusion image is easy to lose.

Disclosure of Invention

The invention aims to: the invention aims to provide an infrared and visible light image fusion method based on multilayer convolution, which can better keep heat radiation information in an infrared image and texture details in a visible light image under a complex background and improve visual effect.

The technical scheme is as follows: the invention relates to an infrared and visible light image fusion method based on multilayer convolution, wherein the adopted network structure comprises a coding network, a decoding network and a multilayer convolution fusion network, and the coder is formed by mutually nesting a multilayer convolution block and an ECA attention mechanism; the decoder is mainly composed of decoding blocks, and each decoding block is composed of two convolution layers; the multi-layer convolution fusion network mainly comprises a gradient convolution block, a downsampling convolution block, a convolution space channel attention mechanism and a plurality of convolution layers; the method comprises the following steps:

s1, the registered infrared source images and visible light source images are sent into an encoder in pairs, and source image features are extracted by the encoder;

s2, fusing the source image features by a multi-layer convolution fusion network to obtain fused features;

s3, reconstructing the fused features by a decoder, and outputting an image.

Further, the encoder will perform feature extraction on the infrared source image and the visible light source image in four dimensions;

c, k, W and H in the ECA attention mechanism represent the channel dimension, the size of the convolution kernel, the width and height of the feature map, respectively; the convolution kernel size is determined by:

k＝|log ₂ (c)/γ+b/γ| _odd

wherein I _odd The representation k can only be taken as an odd number, b and γ are used to change the ratio between the number of channels and the convolution kernel size, respectively.

Further, in the multi-layer convolution fusion network, the downsampling convolution block is formed by intersecting a max pooling layer, a 3×3 convolution layer and a convolution layer with an activation function; after the input image passes through the max imaging layer, the characteristic information is processed twice through a convolution layer with an activation function and a convolution layer with a 3 multiplied by 3;

the method comprises the steps of directly extracting features of source image information by adopting a convolution block consisting of a 3×3 convolution layer and a 3×3 convolution layer with an LReLU activation function; and integrating the characteristics extracted from the source image information with the characteristic information extracted from the gradient convolution block and the downsampling convolution block.

Furthermore, the gradient convolution block is mainly formed by combining a convolution layer with an LReLU activation function, a 3×3 convolution layer, a 1×1 convolution layer and gradient operators, the main body is densely connected, and the feature extraction is carried out by using 23×3 convolution layers and blocks spliced by the 3×3 convolution layers with the LReLU activation function; the residual error flow adopts gradient operation to calculate the gradient amplitude of the characteristic, and a 1 multiplied by 1 regular convolution layer is used for eliminating the channel dimension difference; and finally, integrating the deep features extracted by the main dense flow and fine granularity detail information acquired by the residual gradient flow.

Further, in the decoder, each decoding block is composed of two 3×3 convolutional layers; a short connection is used for the connection in each row.

Further, an automatic encoder loss function L is adopted _auto Training an automatic encoder network, an automatic encoder loss function L _auto The definition is as follows:

L _auto ＝L _pixel +100L _ssim

L _ssim ＝1-SSIM(Output,Input)

wherein L is _pixel Representing pixel loss between input and output images, L _ssim Representing a loss of structural similarity between the input image and the output image;is the Frobenius norm; SSIM (-) is a structural similarity measure that quantifies the structural similarity of two images.

Further, a fusion strategy loss function L is adopted _MCFN Training a multi-layer convolution fusion network, and fusing a strategy loss function L _MCFN The definition is as follows:

L _MCFN ＝αL _detail +L _feature

L _detail ＝1-SSIM(O,I _vi )

wherein L is _detail 、L _feature Representing a background detail retention loss function and a target feature enhancement loss function respectively; α is a trade-off parameter; m is a fusionThe number of networks; w (w) ₁ Is a trade-off parameter vector for balancing the magnitude of the loss to balance the differences in magnitude at different scales; w (w) _vi Control fusion profileThe relative influence of the infrared features in w _ir Control fusion profile->The relative influence of the visible light characteristics of (a) is provided.

Compared with the prior art, the invention has the following remarkable effects:

1. the invention introduces an ECA attention mechanism into the encoder, designs a CSCA, GCB and DSCB fusion block, constructs an MCFN fusion network on the basis, and solves the problems of unobtrusive fusion image targets, losing texture detail information under a complex background, poor visual perception and the like to a certain extent;

2. the fusion network designed by the invention can better keep the heat radiation information in the infrared image and the texture details in the visible light image under a complex background, and the experimental result shows that the fusion network is compared with the existing 5 fusion algorithms in subjective and objective aspects on two public data sets: the image evaluation index after the fusion in the objective aspect is obviously improved; in subjective aspect, the invention has certain superiority in highlighting target information, retaining texture detail information under complex background, improving visual effect and the like.

Drawings

FIG. 1 is a schematic diagram of the overall network architecture of the present invention;

FIG. 2 is a detailed structure diagram of the codec;

FIG. 3 is a diagram of the mechanism of attention of the ECA;

fig. 4 is a structural diagram of an MCFN;

fig. 5 is a structural view of CB;

fig. 6 is a structural diagram of CSCA;

FIG. 7 is a graph of results for different alpha values;

FIG. 8 is a graph of fused image results for a helicopter under different algorithms;

FIG. 9 is a graph of fusion results for soldiers under different algorithms;

FIG. 10 is a graph of the fusion results of roads under different algorithms;

FIG. 11 is a graph of the fusion results of a tent and a person under different algorithms;

FIG. 12 is a graph of the fusion of streets under different algorithms.

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

The invention provides an infrared and visible light image fusion algorithm based on multilayer convolution. In the encoder stage, an efficient channel attention mechanism (Efficient Channel Attention, ECA) is introduced to improve the quality of the fused image. Gradient convolution blocks (Gradient Convolution Block, GCB), downsampling convolution blocks (DownSampling Convolution Block, DSCB) and convolution spatial channel attention mechanisms (Convolution Spatial Channel Attention, CSCA) are designed in a multi-layer convolution fusion network (hereinafter simply referred to as "fusion network") MCFN (Multilayer Convolutional Fusion Network), which can better preserve texture detail information of images in complex backgrounds and highlight infrared targets. Finally, the decoder decodes the reconstructed output.

Network architecture design

As shown in fig. 1, the network overall structure of the present invention is composed of an encoding network, a decoding network, and a convergence network. Firstly, the registered infrared and visible light images are sent into an encoder in pairs, the encoder extracts the characteristics of the source images, then the fusion network MCFN fuses the characteristics of the source images, and finally the decoder reconstructs and outputs the source images.

The invention adopts a double-stage training method, wherein the encoder, the fusion network and the decoder are integrally trained in the first stage, and the parameter weight in the first stage is directly used for independently training the fusion network MCFN in the second stage. In order to better keep the detail information and the background information of the fusion image and improve the visual effect of the image, a fusion network MCFN is designed. The designed MCFN has important significance for improving objective evaluation indexes and visual effects of the fusion image through an ablation experiment.

The detailed structure of the encoding and decoding is shown in FIG. 2, the left side is the encoder, I _vis And I _ir Representing an input visible light source image and an infrared source image. The encoder is formed by nesting multiple layers of convolution blocks and ECA attention mechanisms, and in the figure, marks similar to "(in out)" on the convolution layers indicate an input channel of in and an output channel of out, for example, (168) indicates an input channel of 16 and an output channel of 8. The encoder extracts the characteristics of the infrared source image and the visible light source image on four scales, and the extracted characteristic information is subjected to characteristic fusion through a fusion network.

The detailed structure of the Decoder is mainly composed of Decoding Blocks (DB) each composed of two 3×3 convolutional layers, as shown on the right side of fig. 2. In each row, the blocks are connected by short connections similar to the dense block architecture. In addition, the decoder structure adopts cross-layer connection to retain more multi-scale deep features and detail information in the source image, and the output of the network is a fusion map after multi-scale feature reconstruction.

Detailed structure of ECA attention mechanism as shown in fig. 3, a lightweight ECA attention mechanism is introduced into the encoder, and ablation experiments show that the attention mechanism has positive effect on the improvement of the fused image index.

The structure of the ECA attention mechanism is shown in fig. 3, where C, k, W and H represent the channel dimension, the size of the convolution kernel, the width and height of the feature map, respectively. The convolution kernel size is determined by:

k＝|log ₂ (c)/γ+b/γ| _odd

in the I _odd K can be odd; b and γ are set to 2 and 1, respectively, for changing the ratio between the number of channels and the convolution kernel size.

The structure of the fusion network MCFN is shown in fig. 4, and the MCFN mainly comprises a gradient convolution block GCB, a downsampling convolution block DSCB, a convolution space channel attention mechanism CSCA and a plurality of convolution layers. Wherein the downsampled convolution block DSCB block is formed by a max pooling layer, a 3 x3 convolution layer and a convolution layer with an activation function interleaved. By the design, large-scale calculation amount caused by the gradient convolution block GCB can be reduced, and the max pooling layer can better retain detail texture information. After passing through the max pulling layer, the characteristic information is subjected to 3×3 convolution and a convolution layer with an activation function to perform twice processing, the detail characteristic is enhanced again, and the texture information is reserved.

Outside the main body of the fusion network MCFN, a convolution block consisting of a 3X 3 convolution layer and a 3X 3 convolution layer with an LReLU activation function is used for directly extracting features of source image information, and features extracted from the source image information are integrated with feature information extracted from gradient convolution blocks GCB and downsampling convolution blocks DSCB. The operation can keep more source information and enrich the information quantity of the fusion image.

As shown in fig. 5, the detailed structure diagram of the gradient convolution block GCB is mainly formed by combining a convolution layer with lrlu activation function, a 3×3 convolution layer, a 1×1 convolution layer, and a gradient operator. The LReLU belongs to an unsaturated activation function, and the application of the LReLU can solve the gradient disappearance problem, and can also accelerate the convergence speed and improve the operation efficiency. This operation is performed on the feature information first, and shallow information of the feature map is extracted. The main body of the gradient convolution block GCB adopts dense connection, and features extraction is performed by using 23×3 convolution layers and blocks spliced by the 3×3 convolution layers with lrlu activation function. The dense connections are introduced into the body, which can make full use of the features extracted by the various convolution layers. In addition, the residual stream uses gradient operations to calculate the gradient magnitude of the feature and uses a 1 x 1 canonical convolution layer to eliminate channel dimension differences. And finally, integrating the deep features extracted by the main dense flow and fine granularity detail information acquired by the residual gradient flow.

The detailed structure diagram of the downsampling convolution block CSCA is shown in fig. 6, and the attention mechanism contained in the network structure can reduce the attention to the irrelevant information in the aspects of channels and spaces, focus high-value information, solve the information overload problem and improve the efficiency and accuracy of task processing. The method and the device can be applied to the convolution block simultaneously, so that the detail information of the target area needing to be focused can be better acquired, and the requirement of reserving more detail information is met. Through an ablation experiment, the module has important significance for improving the quality of the fusion image.

The loss functions of this embodiment are classified into an automatic encoder loss function and a fusion strategy loss function.

(A) Automatic encoder loss function

This embodiment uses the loss function to train the auto-encoder network, L _auto The definition is given below with respect to the definition,

L _auto ＝L _pixel +100L _ssim (1)

wherein L is _pixel Representing pixel loss between input and output images, L _ssim Representing a loss of structural similarity between the input image and the output image.

L _pixel The loss is calculated as follows:

in the method, in the process of the invention,is the Frobenius norm.

Similarity of output image to input image at pixel level is subject to L _pixel Constraint. L (L) _ssim The loss is calculated from equation (3),

L _ssim ＝1-SSIM(Output,Input) (3)

where SSIM (-) is a structural similarity measure that quantifies the structural similarity of two images.

(B) Fusion policy loss function

Training of the converged network MCFN aims to implement a fully learnable converged strategy. In the second phase, the fusion network MCFN is trained with the appropriate loss function, with the encoder and decoder fixed. To better train the converged network MCFN, the present embodiment uses the loss function L _MCFN Definition of itThe following are provided:

L _MCFN ＝αL _detail +L _feature (4)

wherein L is _detail 、L _feature Representing a background detail retention loss function and a target feature enhancement loss function, respectively. α is a trade-off parameter set to 700 by a parameter setting experiment. Since most of the background detail information of the fused image comes from the visible light image. L (L) _detail The purpose of (a) is to preserve detailed information and structural features in the visible light image, which is defined as:

L _detail ＝1-SSIM(O,I _vi ) (5)

since the infrared image contains more significant target features than the visible image, the loss function L _feature Designed to define fused deep features to preserve significant target features. L (L) _feature The definition is as follows:

in equation (6), M is the number of converged networks, and m=4 is set. w (w) ₁ Is a trade-off parameter vector for balancing the magnitude of the loss, which is set to {1,10,100,1000}, to balance the magnitude differences at different scales. w (w) _vi And w _ir Control fusion profileInfrared characteristics of->And visible light characteristics->Is set to 6.0 and 3.0, respectively.

(II) results of experiments and analysis

In this embodiment, after the experimental settings of the training phase and the testing phase are described, the experimental setting is performed on α in the loss function. Ablation experiments were performed on the attentional mechanisms and the converged network MCFN, fully demonstrating the effectiveness of the present invention. Finally, the invention is compared with other five algorithms disclosed in recent years, including: an end-to-end residual fusion network (RFN-Nest) for infrared and visible images, a generation countermeasure network (GANMCC) with multi-classification constraints for infrared and visible image fusion, a fusion countermeasure generation network (fusion gan), depth image decomposition (didfse) for infrared and visible image fusion, unsupervised dislocation infrared and visible image fusion (UMF-CMGR) based on cross-modality image generation and registration. And 8 indexes are used for objectively evaluating the quality of the fused pictures, wherein the quality is respectively as follows: information Entropy (EN), spatial Frequency (SF), average Gradient (AG), standard Deviation (SD), correlation Coefficient (CC), differential correlation Sum (SCD), visual fidelity (VIF), peak signal-to-noise ratio (PSNR).

(21) Experimental setup

The TNO Data set (TOET A.the TNO Multiband Image Data Collection [ J ]. Data in Brief,2017,15.249-251.) contains rich military scenes, such as helicopters, houses, tanks, figures, forests, vehicles and the like, so that the TNO Data set meets the Data set requirements in the infrared and visible light image fusion field very well and is the most authoritative Data set in the research field. The 45000 pairs of infrared and visible light images are obtained by a method of expanding the data set to serve as a training set. The MSRS dataset is used as a dataset commonly used in the field of infrared and visible image fusion, which contains 1444 pairs of high quality out-of-Ji Gong and visible images, including day and night image pairs. To make the test results more authoritative, 42 pairs of images were selected from the TNO dataset as the TNO test set, 10 pairs of images were selected from the MSRS dataset as the MSRS test set, and the results averaged. The algorithm verification is carried out in an experimental environment built by a Windows10 system, training is carried out on NVIDIA RTX3080 GPU, the initial learning rate is set to 0.0001, and the batch size and epoch are set to 4.

(22) Parameter setting experiment

The present example analyzes the experimental results of different alpha values in subjective as well as objective terms by experimental methods. Fig. 7 shows a graph of fusion results for different alpha values.

It can be intuitively seen from fig. 7 that the α value has a direct effect on the experimental result, and when α is too large or too small, the infrared information of the fused image will be lost or even not exist. From a comparison of the details in the block diagram of fig. 7, it can be found that texture details remain better when α is set to 600, 700, 800. To further determine the alpha values, table 1 shows the index results for the different alpha values in the TNO test set.

TABLE 1 results for different alpha values

In table 1, the top three values are marked using bold fonts, and it can be clearly observed that when α=700, the number of top three values is 6, and when α=800, the number of top three values is 5. In the experiment, therefore, α was set to 700.

(23) Ablation experiments

In order to verify the effectiveness of the ECA attention mechanism and the set module introduced by the invention, ablation experiments are carried out on the ECA attention mechanism and the CSCA, GCB, DSCB module, so that the effectiveness of the network designed by the invention is fully illustrated. The results of the ablation experiments are shown in table 2.

Table 2 ablation experimental results

In table 2, the best result values are marked by using bold fonts, and it can be seen that CSCA blocks are more obvious for 8 index improvement. ECA attention mechanisms play an important role in EN, SF, AG and SD promotion. GCB blocks, while having a large negative effect on SD, have a non-negligible significant boosting effect on other metrics. In summary, the fusion block designed by the invention and the attention mechanism quoted have important significance on the quality of the fused image.

(24) Comparative experiments

To illustrate the effectiveness of the present invention, the present invention is compared with the 5 algorithms disclosed in both objective and subjective terms.

(241) Subjective evaluation

As shown in fig. 8, a fusion of helicopter images from TNO datasets under different algorithms was selected. In the figure, the infrared image has no background information, but has obvious outline, the object is prominent, and the characteristics of the propeller can be well shown. The background texture information of the visible light image is well preserved, the visibility of the whole image is also good, but the detail information of the helicopter is seriously lost.

In fig. 8, 5 existing algorithms and the fusion results of the present invention to the image are shown. The fusion result diagram under the GANMCC algorithm has the advantages that not only is the background texture information seriously lost, but also the tail screw propeller is hardly visible, and the landing gear is not obvious. The fusion result graph tail wing and the upper propeller information are seriously lost under the DIDFuse algorithm, and the background texture detail is kept relatively good, but is deviated from the background information in the visible light image. The fusion result diagram detail information under the fusion gan algorithm is better preserved, but each position of the helicopter has double images, so that the whole is more fuzzy. The fusion result diagram under the UMF-CMGR algorithm has better overall detail performance, but the landing gear is relatively unclear, the background information is lost, and the overall image is dark. The information of the propeller above the fusion result diagram under the RFN-Nest algorithm is almost completely lost, the tail propeller can only observe the rudiment, the visible light information of the landing gear is more, the infrared information is less, the visual sense of the landing gear is poor, and the background texture information is seriously lost. The fusion result of the invention has better background texture detail retention, almost coincides with the background information of the visible light image, the landing gear, the tail screw propeller and the upper screw propeller detail are clearly visible, and the whole image has no ghost image and has better visual effect. Therefore, compared with other algorithms, the method has advantages in target information retention, background texture details and overall image visual effect from the subjective point of view.

As shown in fig. 9, a fusion of soldiers under different algorithms in a jungle from the TNO dataset was selected. The fusion image background texture detail information under the GANMCC, fusionGAN, RFN-Nest algorithm is seriously lost. The target information of the soldier in the fused image under the DIDFuse and UMF-CMGR algorithms is not well highlighted, so that the outline of the character is not clear enough. The texture information of the fusion image background is better preserved, and the figure outline is clearer.

To better illustrate the universality of the present invention, a representative fusion result diagram is selected from the MSRS data set as shown in FIG. 10, and the present invention has certain advantages over other 5 algorithms.

Fig. 11 and 12 again show two results from TNO dataset. The subjective comparison shows that the invention has certain superiority in the aspects of highlighting target information, detail reservation and visual perception.

(242) Objective evaluation

In order to better verify the effectiveness of the invention, 5 representative algorithms and 8 objective evaluation indexes are selected for objective evaluation. The 8 evaluation indexes are all that the image quality is better as the numerical value is larger. In order to ensure fairness and reliability, the average value of the 42 images of the TNO test set and the 10 images of the MSRS test set is respectively compared, so that subjective factors of people are eliminated to a certain extent, and the evaluation result is more objective.

TABLE 3 evaluation of fusion Effect of TNO datasets

To enable a clearer analysis of the evaluation index data in table 3, the first numerical value of the rank is marked with bold. Of the 8 evaluation indexes used, the invention has 4 indexes ranked first and 3 indexes ranked second. The invention ranks above the mean value in different indexes, so the invention has a certain superiority in the TNO data set in objective aspect compared with other algorithms. Wherein SF is improved by 25.9% compared with the second name, and AG is improved by 38.6% compared with the second name.

TABLE 4 evaluation of fusion Effect of MSRS datasets

In table 4, the first numerical value of the rank is marked using bold fonts. It can be clearly seen that among the 8 evaluation indexes, the present invention has 6 indexes on the MSRS dataset that perform optimally. Therefore, the invention has a certain superiority in the MSRS data set in objective aspect compared with other algorithms. Wherein SD is 7.4% higher than the second name. In the comprehensive view, compared with other algorithm indexes, the TNO data set and the MSRS data set are obviously improved in the objective aspect, so that the method has a good effect in the objective aspect.

Claims

1. The infrared and visible light image fusion method based on the multilayer convolution is characterized in that a network structure comprises an encoding network, a decoding network and a multilayer convolution fusion network, and the encoder is formed by mutually nesting a multilayer convolution block and an ECA attention mechanism; the decoder is mainly composed of decoding blocks, and each decoding block is composed of two convolution layers; the multi-layer convolution fusion network mainly comprises a gradient convolution block, a downsampling convolution block, a convolution space channel attention mechanism and a plurality of convolution layers; the method comprises the following steps:

s3, reconstructing the fused features by a decoder, and outputting an image.

2. The infrared and visible light image fusion method based on multi-layer convolution according to claim 1, wherein the encoder will perform feature extraction on infrared source images and visible light source images in four dimensions;

k＝|log ₂ (c)/γ+b/γ| _odd

wherein I _odd Meaning that k can only be odd, b and γ are set to 2 and 1.

3. The infrared and visible light image fusion method based on multi-layer convolution according to claim 1, wherein in the multi-layer convolution fusion network, the downsampling convolution block is formed by a max pooling layer, a 3 x3 convolution layer and a convolution layer with an activation function which are intersected with each other; after the input image passes through the max imaging layer, the characteristic information is processed twice through a convolution layer with an activation function and a convolution layer with a 3 multiplied by 3;

4. The infrared and visible light image fusion method based on multi-layer convolution according to claim 3, wherein the gradient convolution block is mainly formed by combining a convolution layer with an lrlu activation function, a 3×3 convolution layer, a 1×1 convolution layer and a gradient operator, and the main body adopts dense connection, and uses 23×3 convolution layers and blocks spliced by the 3×3 convolution layers with the lrlu activation function to perform feature extraction; the residual error flow adopts gradient operation to calculate the gradient amplitude of the characteristic, and a 1 multiplied by 1 regular convolution layer is used for eliminating the channel dimension difference; and finally, integrating the deep features extracted by the main dense flow and fine granularity detail information acquired by the residual gradient flow.

5. The infrared and visible light image fusion method based on multi-layer convolution according to claim 1, wherein in said decoder, each decoding block consists of two 3 x3 convolution layers; a short connection is used for the connection in each row.

6. The method for fusion of infrared and visible light images based on multi-layer convolution according to claim 1, characterized in that an automatic encoder loss function L is used _auto Training an automatic encoder network, an automatic encoder loss function L _auto The definition is as follows:

L _auto ＝L _pixel +100L _ssim

L _ssim ＝1-SSIM(Output,Input)

7. The method for fusing infrared and visible light images based on multi-layer convolution as claimed in claim 1, wherein a fusion strategy loss function L is adopted _MCFN Training a multi-layer convolution fusion network, and fusing a strategy loss function L _MCFN The definition is as follows:

L _MCFN ＝αL _detail +L _feature

L _detail ＝1-SSIM(O,I _vi )

wherein L is _detail 、L _feature Representing a background detail retention loss function and a target feature enhancement loss function respectively; alpha is oneA trade-off parameter; m is the number of converged networks; w (w) ₁ Is a trade-off parameter vector for balancing the magnitude of the loss to balance the differences in magnitude at different scales; w (w) _vi Control fusion profileThe relative influence of the infrared features in w _ir Control fusion profile->The relative influence of the visible light characteristics of (a) is provided.