CN117292244A - Infrared and visible light image fusion method based on multilayer convolution - Google Patents

Infrared and visible light image fusion method based on multilayer convolution Download PDF

Info

Publication number
CN117292244A
CN117292244A CN202311352355.2A CN202311352355A CN117292244A CN 117292244 A CN117292244 A CN 117292244A CN 202311352355 A CN202311352355 A CN 202311352355A CN 117292244 A CN117292244 A CN 117292244A
Authority
CN
China
Prior art keywords
convolution
layer
fusion
image
infrared
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311352355.2A
Other languages
Chinese (zh)
Inventor
陈海秀
房威志
陆康
黄仔洁
陈嘉越
褚羽婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202311352355.2A priority Critical patent/CN117292244A/en
Publication of CN117292244A publication Critical patent/CN117292244A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/86Arrangements for image or video recognition or understanding using pattern recognition or machine learning using syntactic or structural representations of the image or video pattern, e.g. symbolic string recognition; using graph matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an infrared and visible light image fusion method based on multilayer convolution, wherein a network structure comprises a coding network, a decoding network and a multilayer convolution fusion network, and the coder is formed by mutually nesting a multilayer convolution block and an ECA attention mechanism; the decoder is mainly composed of decoding blocks, and each decoding block is composed of two convolution layers; the multi-layer convolution fusion network mainly comprises a gradient convolution block, a downsampling convolution block, a convolution space channel attention mechanism and a plurality of convolution layers; the method comprises the following steps: s1, the registered infrared source images and visible light source images are sent into an encoder in pairs, and source image features are extracted by the encoder; s2, fusing the source image features by a multi-layer convolution fusion network to obtain fused features; s3, reconstructing the fused features by a decoder, and outputting an image. The fused image has the advantages of prominent target, clear detail, obvious outline and obvious index improvement, and accords with human visual perception.

Description

Infrared and visible light image fusion method based on multilayer convolution
Technical Field
The invention relates to the field of image processing, in particular to an infrared and visible light image fusion method based on multilayer convolution.
Background
The infrared and visible light images reflect different characteristics of the target scene under different photographing instruments. The infrared image has strong penetrability, is not influenced by illumination intensity, but lacks texture information. The visible light image contains abundant structural information, has a good visual effect, is easily influenced by factors such as weather, illumination conditions and the like, and has poor anti-interference capability. Therefore, the infrared and visible light images are fused, the complementarity of the information is fully utilized, and the method has wide application value in various fields.
The existing image fusion method mainly comprises a traditional method and a deep learning method. A typical representation of conventional methods is a multi-scale transform based method. The method is widely applied to extracting multi-scale features from a source image, obtaining multi-scale features of an input image by using multi-scale transformation, and then fusing the multi-scale features of different images according to a specific rule. Finally, reconstructing the fused image by inverse multi-scale transformation. Such methods can capture features at different scales and can more fully understand the image. However, it is difficult to select and design a proper transformation rule, and a fusion rule is complex, which requires derivation and calculation of a digital formula.
The infrared and visible light image fusion method based on deep learning is divided into three methods, namely a convolutional neural network-based method, a generation countermeasure network-based method and a self-encoder-based method. The method based on the self-encoder is to encode, fuse, decode and reconstruct the source image, and finally generate the fused image. In 2018, li et al (Li H, wu X J. Denseuse: A Fusion Approach to Infrared and Visible Images [ J ]. IEEE Trans. Image Processing,2019,28 (5)) introduced dense connectivity extraction image depth features in the encoder network, suggesting a Denseuse for infrared and visible light image fusion. In 2020, li et al (Li H, WU X J, DURRANI T. NestFuse: an Infrared and Visible Image Fusion Architecture based on Nest Connection and Spatial/Channel Attention Models [ J ]. IEEE Transactions on Instrumentation and Measurement,2020, PP (99) devised a fusion strategy based on spatial and channel attentions that could fuse multiple scale features, nestFuse network 2021, li et al (Li H, wu X J, KITTLER J. RFN-Nest: an end-to-end residual fusion network for infrared and visible images [ J ]. Information Fusion,2021,73.72-86) proposed An end-to-end RFN-Nest network on NestFuse basis, and many scholars subsequently proposed more methods of fusion of infrared and visible images based on self-encoders, but the current methods of fusion based on self-encoders still have the following disadvantages:
1) The target of the fusion image is not outstanding, and details are lost;
2) The visual perception of the fused image is poor;
3) Under a complex background, texture detail information of the fusion image is easy to lose.
Disclosure of Invention
The invention aims to: the invention aims to provide an infrared and visible light image fusion method based on multilayer convolution, which can better keep heat radiation information in an infrared image and texture details in a visible light image under a complex background and improve visual effect.
The technical scheme is as follows: the invention relates to an infrared and visible light image fusion method based on multilayer convolution, wherein the adopted network structure comprises a coding network, a decoding network and a multilayer convolution fusion network, and the coder is formed by mutually nesting a multilayer convolution block and an ECA attention mechanism; the decoder is mainly composed of decoding blocks, and each decoding block is composed of two convolution layers; the multi-layer convolution fusion network mainly comprises a gradient convolution block, a downsampling convolution block, a convolution space channel attention mechanism and a plurality of convolution layers; the method comprises the following steps:
s1, the registered infrared source images and visible light source images are sent into an encoder in pairs, and source image features are extracted by the encoder;
s2, fusing the source image features by a multi-layer convolution fusion network to obtain fused features;
s3, reconstructing the fused features by a decoder, and outputting an image.
Further, the encoder will perform feature extraction on the infrared source image and the visible light source image in four dimensions;
c, k, W and H in the ECA attention mechanism represent the channel dimension, the size of the convolution kernel, the width and height of the feature map, respectively; the convolution kernel size is determined by:
k=|log 2 (c)/γ+b/γ| odd
wherein I odd The representation k can only be taken as an odd number, b and γ are used to change the ratio between the number of channels and the convolution kernel size, respectively.
Further, in the multi-layer convolution fusion network, the downsampling convolution block is formed by intersecting a max pooling layer, a 3×3 convolution layer and a convolution layer with an activation function; after the input image passes through the max imaging layer, the characteristic information is processed twice through a convolution layer with an activation function and a convolution layer with a 3 multiplied by 3;
the method comprises the steps of directly extracting features of source image information by adopting a convolution block consisting of a 3×3 convolution layer and a 3×3 convolution layer with an LReLU activation function; and integrating the characteristics extracted from the source image information with the characteristic information extracted from the gradient convolution block and the downsampling convolution block.
Furthermore, the gradient convolution block is mainly formed by combining a convolution layer with an LReLU activation function, a 3×3 convolution layer, a 1×1 convolution layer and gradient operators, the main body is densely connected, and the feature extraction is carried out by using 23×3 convolution layers and blocks spliced by the 3×3 convolution layers with the LReLU activation function; the residual error flow adopts gradient operation to calculate the gradient amplitude of the characteristic, and a 1 multiplied by 1 regular convolution layer is used for eliminating the channel dimension difference; and finally, integrating the deep features extracted by the main dense flow and fine granularity detail information acquired by the residual gradient flow.
Further, in the decoder, each decoding block is composed of two 3×3 convolutional layers; a short connection is used for the connection in each row.
Further, an automatic encoder loss function L is adopted auto Training an automatic encoder network, an automatic encoder loss function L auto The definition is as follows:
L auto =L pixel +100L ssim
L ssim =1-SSIM(Output,Input)
wherein L is pixel Representing pixel loss between input and output images, L ssim Representing a loss of structural similarity between the input image and the output image;is the Frobenius norm; SSIM (-) is a structural similarity measure that quantifies the structural similarity of two images.
Further, a fusion strategy loss function L is adopted MCFN Training a multi-layer convolution fusion network, and fusing a strategy loss function L MCFN The definition is as follows:
L MCFN =αL detail +L feature
L detail =1-SSIM(O,I vi )
wherein L is detail 、L feature Representing a background detail retention loss function and a target feature enhancement loss function respectively; α is a trade-off parameter; m is a fusionThe number of networks; w (w) 1 Is a trade-off parameter vector for balancing the magnitude of the loss to balance the differences in magnitude at different scales; w (w) vi Control fusion profileThe relative influence of the infrared features in w ir Control fusion profile->The relative influence of the visible light characteristics of (a) is provided.
Compared with the prior art, the invention has the following remarkable effects:
1. the invention introduces an ECA attention mechanism into the encoder, designs a CSCA, GCB and DSCB fusion block, constructs an MCFN fusion network on the basis, and solves the problems of unobtrusive fusion image targets, losing texture detail information under a complex background, poor visual perception and the like to a certain extent;
2. the fusion network designed by the invention can better keep the heat radiation information in the infrared image and the texture details in the visible light image under a complex background, and the experimental result shows that the fusion network is compared with the existing 5 fusion algorithms in subjective and objective aspects on two public data sets: the image evaluation index after the fusion in the objective aspect is obviously improved; in subjective aspect, the invention has certain superiority in highlighting target information, retaining texture detail information under complex background, improving visual effect and the like.
Drawings
FIG. 1 is a schematic diagram of the overall network architecture of the present invention;
FIG. 2 is a detailed structure diagram of the codec;
FIG. 3 is a diagram of the mechanism of attention of the ECA;
fig. 4 is a structural diagram of an MCFN;
fig. 5 is a structural view of CB;
fig. 6 is a structural diagram of CSCA;
FIG. 7 is a graph of results for different alpha values;
FIG. 8 is a graph of fused image results for a helicopter under different algorithms;
FIG. 9 is a graph of fusion results for soldiers under different algorithms;
FIG. 10 is a graph of the fusion results of roads under different algorithms;
FIG. 11 is a graph of the fusion results of a tent and a person under different algorithms;
FIG. 12 is a graph of the fusion of streets under different algorithms.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
The invention provides an infrared and visible light image fusion algorithm based on multilayer convolution. In the encoder stage, an efficient channel attention mechanism (Efficient Channel Attention, ECA) is introduced to improve the quality of the fused image. Gradient convolution blocks (Gradient Convolution Block, GCB), downsampling convolution blocks (DownSampling Convolution Block, DSCB) and convolution spatial channel attention mechanisms (Convolution Spatial Channel Attention, CSCA) are designed in a multi-layer convolution fusion network (hereinafter simply referred to as "fusion network") MCFN (Multilayer Convolutional Fusion Network), which can better preserve texture detail information of images in complex backgrounds and highlight infrared targets. Finally, the decoder decodes the reconstructed output.
Network architecture design
As shown in fig. 1, the network overall structure of the present invention is composed of an encoding network, a decoding network, and a convergence network. Firstly, the registered infrared and visible light images are sent into an encoder in pairs, the encoder extracts the characteristics of the source images, then the fusion network MCFN fuses the characteristics of the source images, and finally the decoder reconstructs and outputs the source images.
The invention adopts a double-stage training method, wherein the encoder, the fusion network and the decoder are integrally trained in the first stage, and the parameter weight in the first stage is directly used for independently training the fusion network MCFN in the second stage. In order to better keep the detail information and the background information of the fusion image and improve the visual effect of the image, a fusion network MCFN is designed. The designed MCFN has important significance for improving objective evaluation indexes and visual effects of the fusion image through an ablation experiment.
The detailed structure of the encoding and decoding is shown in FIG. 2, the left side is the encoder, I vis And I ir Representing an input visible light source image and an infrared source image. The encoder is formed by nesting multiple layers of convolution blocks and ECA attention mechanisms, and in the figure, marks similar to "(in out)" on the convolution layers indicate an input channel of in and an output channel of out, for example, (168) indicates an input channel of 16 and an output channel of 8. The encoder extracts the characteristics of the infrared source image and the visible light source image on four scales, and the extracted characteristic information is subjected to characteristic fusion through a fusion network.
The detailed structure of the Decoder is mainly composed of Decoding Blocks (DB) each composed of two 3×3 convolutional layers, as shown on the right side of fig. 2. In each row, the blocks are connected by short connections similar to the dense block architecture. In addition, the decoder structure adopts cross-layer connection to retain more multi-scale deep features and detail information in the source image, and the output of the network is a fusion map after multi-scale feature reconstruction.
Detailed structure of ECA attention mechanism as shown in fig. 3, a lightweight ECA attention mechanism is introduced into the encoder, and ablation experiments show that the attention mechanism has positive effect on the improvement of the fused image index.
The structure of the ECA attention mechanism is shown in fig. 3, where C, k, W and H represent the channel dimension, the size of the convolution kernel, the width and height of the feature map, respectively. The convolution kernel size is determined by:
k=|log 2 (c)/γ+b/γ| odd
in the I odd K can be odd; b and γ are set to 2 and 1, respectively, for changing the ratio between the number of channels and the convolution kernel size.
The structure of the fusion network MCFN is shown in fig. 4, and the MCFN mainly comprises a gradient convolution block GCB, a downsampling convolution block DSCB, a convolution space channel attention mechanism CSCA and a plurality of convolution layers. Wherein the downsampled convolution block DSCB block is formed by a max pooling layer, a 3 x3 convolution layer and a convolution layer with an activation function interleaved. By the design, large-scale calculation amount caused by the gradient convolution block GCB can be reduced, and the max pooling layer can better retain detail texture information. After passing through the max pulling layer, the characteristic information is subjected to 3×3 convolution and a convolution layer with an activation function to perform twice processing, the detail characteristic is enhanced again, and the texture information is reserved.
Outside the main body of the fusion network MCFN, a convolution block consisting of a 3X 3 convolution layer and a 3X 3 convolution layer with an LReLU activation function is used for directly extracting features of source image information, and features extracted from the source image information are integrated with feature information extracted from gradient convolution blocks GCB and downsampling convolution blocks DSCB. The operation can keep more source information and enrich the information quantity of the fusion image.
As shown in fig. 5, the detailed structure diagram of the gradient convolution block GCB is mainly formed by combining a convolution layer with lrlu activation function, a 3×3 convolution layer, a 1×1 convolution layer, and a gradient operator. The LReLU belongs to an unsaturated activation function, and the application of the LReLU can solve the gradient disappearance problem, and can also accelerate the convergence speed and improve the operation efficiency. This operation is performed on the feature information first, and shallow information of the feature map is extracted. The main body of the gradient convolution block GCB adopts dense connection, and features extraction is performed by using 23×3 convolution layers and blocks spliced by the 3×3 convolution layers with lrlu activation function. The dense connections are introduced into the body, which can make full use of the features extracted by the various convolution layers. In addition, the residual stream uses gradient operations to calculate the gradient magnitude of the feature and uses a 1 x 1 canonical convolution layer to eliminate channel dimension differences. And finally, integrating the deep features extracted by the main dense flow and fine granularity detail information acquired by the residual gradient flow.
The detailed structure diagram of the downsampling convolution block CSCA is shown in fig. 6, and the attention mechanism contained in the network structure can reduce the attention to the irrelevant information in the aspects of channels and spaces, focus high-value information, solve the information overload problem and improve the efficiency and accuracy of task processing. The method and the device can be applied to the convolution block simultaneously, so that the detail information of the target area needing to be focused can be better acquired, and the requirement of reserving more detail information is met. Through an ablation experiment, the module has important significance for improving the quality of the fusion image.
The loss functions of this embodiment are classified into an automatic encoder loss function and a fusion strategy loss function.
(A) Automatic encoder loss function
This embodiment uses the loss function to train the auto-encoder network, L auto The definition is given below with respect to the definition,
L auto =L pixel +100L ssim (1)
wherein L is pixel Representing pixel loss between input and output images, L ssim Representing a loss of structural similarity between the input image and the output image.
L pixel The loss is calculated as follows:
in the method, in the process of the invention,is the Frobenius norm.
Similarity of output image to input image at pixel level is subject to L pixel Constraint. L (L) ssim The loss is calculated from equation (3),
L ssim =1-SSIM(Output,Input) (3)
where SSIM (-) is a structural similarity measure that quantifies the structural similarity of two images.
(B) Fusion policy loss function
Training of the converged network MCFN aims to implement a fully learnable converged strategy. In the second phase, the fusion network MCFN is trained with the appropriate loss function, with the encoder and decoder fixed. To better train the converged network MCFN, the present embodiment uses the loss function L MCFN Definition of itThe following are provided:
L MCFN =αL detail +L feature (4)
wherein L is detail 、L feature Representing a background detail retention loss function and a target feature enhancement loss function, respectively. α is a trade-off parameter set to 700 by a parameter setting experiment. Since most of the background detail information of the fused image comes from the visible light image. L (L) detail The purpose of (a) is to preserve detailed information and structural features in the visible light image, which is defined as:
L detail =1-SSIM(O,I vi ) (5)
since the infrared image contains more significant target features than the visible image, the loss function L feature Designed to define fused deep features to preserve significant target features. L (L) feature The definition is as follows:
in equation (6), M is the number of converged networks, and m=4 is set. w (w) 1 Is a trade-off parameter vector for balancing the magnitude of the loss, which is set to {1,10,100,1000}, to balance the magnitude differences at different scales. w (w) vi And w ir Control fusion profileInfrared characteristics of->And visible light characteristics->Is set to 6.0 and 3.0, respectively.
(II) results of experiments and analysis
In this embodiment, after the experimental settings of the training phase and the testing phase are described, the experimental setting is performed on α in the loss function. Ablation experiments were performed on the attentional mechanisms and the converged network MCFN, fully demonstrating the effectiveness of the present invention. Finally, the invention is compared with other five algorithms disclosed in recent years, including: an end-to-end residual fusion network (RFN-Nest) for infrared and visible images, a generation countermeasure network (GANMCC) with multi-classification constraints for infrared and visible image fusion, a fusion countermeasure generation network (fusion gan), depth image decomposition (didfse) for infrared and visible image fusion, unsupervised dislocation infrared and visible image fusion (UMF-CMGR) based on cross-modality image generation and registration. And 8 indexes are used for objectively evaluating the quality of the fused pictures, wherein the quality is respectively as follows: information Entropy (EN), spatial Frequency (SF), average Gradient (AG), standard Deviation (SD), correlation Coefficient (CC), differential correlation Sum (SCD), visual fidelity (VIF), peak signal-to-noise ratio (PSNR).
(21) Experimental setup
The TNO Data set (TOET A.the TNO Multiband Image Data Collection [ J ]. Data in Brief,2017,15.249-251.) contains rich military scenes, such as helicopters, houses, tanks, figures, forests, vehicles and the like, so that the TNO Data set meets the Data set requirements in the infrared and visible light image fusion field very well and is the most authoritative Data set in the research field. The 45000 pairs of infrared and visible light images are obtained by a method of expanding the data set to serve as a training set. The MSRS dataset is used as a dataset commonly used in the field of infrared and visible image fusion, which contains 1444 pairs of high quality out-of-Ji Gong and visible images, including day and night image pairs. To make the test results more authoritative, 42 pairs of images were selected from the TNO dataset as the TNO test set, 10 pairs of images were selected from the MSRS dataset as the MSRS test set, and the results averaged. The algorithm verification is carried out in an experimental environment built by a Windows10 system, training is carried out on NVIDIA RTX3080 GPU, the initial learning rate is set to 0.0001, and the batch size and epoch are set to 4.
(22) Parameter setting experiment
The present example analyzes the experimental results of different alpha values in subjective as well as objective terms by experimental methods. Fig. 7 shows a graph of fusion results for different alpha values.
It can be intuitively seen from fig. 7 that the α value has a direct effect on the experimental result, and when α is too large or too small, the infrared information of the fused image will be lost or even not exist. From a comparison of the details in the block diagram of fig. 7, it can be found that texture details remain better when α is set to 600, 700, 800. To further determine the alpha values, table 1 shows the index results for the different alpha values in the TNO test set.
TABLE 1 results for different alpha values
In table 1, the top three values are marked using bold fonts, and it can be clearly observed that when α=700, the number of top three values is 6, and when α=800, the number of top three values is 5. In the experiment, therefore, α was set to 700.
(23) Ablation experiments
In order to verify the effectiveness of the ECA attention mechanism and the set module introduced by the invention, ablation experiments are carried out on the ECA attention mechanism and the CSCA, GCB, DSCB module, so that the effectiveness of the network designed by the invention is fully illustrated. The results of the ablation experiments are shown in table 2.
Table 2 ablation experimental results
In table 2, the best result values are marked by using bold fonts, and it can be seen that CSCA blocks are more obvious for 8 index improvement. ECA attention mechanisms play an important role in EN, SF, AG and SD promotion. GCB blocks, while having a large negative effect on SD, have a non-negligible significant boosting effect on other metrics. In summary, the fusion block designed by the invention and the attention mechanism quoted have important significance on the quality of the fused image.
(24) Comparative experiments
To illustrate the effectiveness of the present invention, the present invention is compared with the 5 algorithms disclosed in both objective and subjective terms.
(241) Subjective evaluation
As shown in fig. 8, a fusion of helicopter images from TNO datasets under different algorithms was selected. In the figure, the infrared image has no background information, but has obvious outline, the object is prominent, and the characteristics of the propeller can be well shown. The background texture information of the visible light image is well preserved, the visibility of the whole image is also good, but the detail information of the helicopter is seriously lost.
In fig. 8, 5 existing algorithms and the fusion results of the present invention to the image are shown. The fusion result diagram under the GANMCC algorithm has the advantages that not only is the background texture information seriously lost, but also the tail screw propeller is hardly visible, and the landing gear is not obvious. The fusion result graph tail wing and the upper propeller information are seriously lost under the DIDFuse algorithm, and the background texture detail is kept relatively good, but is deviated from the background information in the visible light image. The fusion result diagram detail information under the fusion gan algorithm is better preserved, but each position of the helicopter has double images, so that the whole is more fuzzy. The fusion result diagram under the UMF-CMGR algorithm has better overall detail performance, but the landing gear is relatively unclear, the background information is lost, and the overall image is dark. The information of the propeller above the fusion result diagram under the RFN-Nest algorithm is almost completely lost, the tail propeller can only observe the rudiment, the visible light information of the landing gear is more, the infrared information is less, the visual sense of the landing gear is poor, and the background texture information is seriously lost. The fusion result of the invention has better background texture detail retention, almost coincides with the background information of the visible light image, the landing gear, the tail screw propeller and the upper screw propeller detail are clearly visible, and the whole image has no ghost image and has better visual effect. Therefore, compared with other algorithms, the method has advantages in target information retention, background texture details and overall image visual effect from the subjective point of view.
As shown in fig. 9, a fusion of soldiers under different algorithms in a jungle from the TNO dataset was selected. The fusion image background texture detail information under the GANMCC, fusionGAN, RFN-Nest algorithm is seriously lost. The target information of the soldier in the fused image under the DIDFuse and UMF-CMGR algorithms is not well highlighted, so that the outline of the character is not clear enough. The texture information of the fusion image background is better preserved, and the figure outline is clearer.
To better illustrate the universality of the present invention, a representative fusion result diagram is selected from the MSRS data set as shown in FIG. 10, and the present invention has certain advantages over other 5 algorithms.
Fig. 11 and 12 again show two results from TNO dataset. The subjective comparison shows that the invention has certain superiority in the aspects of highlighting target information, detail reservation and visual perception.
(242) Objective evaluation
In order to better verify the effectiveness of the invention, 5 representative algorithms and 8 objective evaluation indexes are selected for objective evaluation. The 8 evaluation indexes are all that the image quality is better as the numerical value is larger. In order to ensure fairness and reliability, the average value of the 42 images of the TNO test set and the 10 images of the MSRS test set is respectively compared, so that subjective factors of people are eliminated to a certain extent, and the evaluation result is more objective.
TABLE 3 evaluation of fusion Effect of TNO datasets
To enable a clearer analysis of the evaluation index data in table 3, the first numerical value of the rank is marked with bold. Of the 8 evaluation indexes used, the invention has 4 indexes ranked first and 3 indexes ranked second. The invention ranks above the mean value in different indexes, so the invention has a certain superiority in the TNO data set in objective aspect compared with other algorithms. Wherein SF is improved by 25.9% compared with the second name, and AG is improved by 38.6% compared with the second name.
TABLE 4 evaluation of fusion Effect of MSRS datasets
In table 4, the first numerical value of the rank is marked using bold fonts. It can be clearly seen that among the 8 evaluation indexes, the present invention has 6 indexes on the MSRS dataset that perform optimally. Therefore, the invention has a certain superiority in the MSRS data set in objective aspect compared with other algorithms. Wherein SD is 7.4% higher than the second name. In the comprehensive view, compared with other algorithm indexes, the TNO data set and the MSRS data set are obviously improved in the objective aspect, so that the method has a good effect in the objective aspect.

Claims (7)

1. The infrared and visible light image fusion method based on the multilayer convolution is characterized in that a network structure comprises an encoding network, a decoding network and a multilayer convolution fusion network, and the encoder is formed by mutually nesting a multilayer convolution block and an ECA attention mechanism; the decoder is mainly composed of decoding blocks, and each decoding block is composed of two convolution layers; the multi-layer convolution fusion network mainly comprises a gradient convolution block, a downsampling convolution block, a convolution space channel attention mechanism and a plurality of convolution layers; the method comprises the following steps:
s1, the registered infrared source images and visible light source images are sent into an encoder in pairs, and source image features are extracted by the encoder;
s2, fusing the source image features by a multi-layer convolution fusion network to obtain fused features;
s3, reconstructing the fused features by a decoder, and outputting an image.
2. The infrared and visible light image fusion method based on multi-layer convolution according to claim 1, wherein the encoder will perform feature extraction on infrared source images and visible light source images in four dimensions;
c, k, W and H in the ECA attention mechanism represent the channel dimension, the size of the convolution kernel, the width and height of the feature map, respectively; the convolution kernel size is determined by:
k=|log 2 (c)/γ+b/γ| odd
wherein I odd Meaning that k can only be odd, b and γ are set to 2 and 1.
3. The infrared and visible light image fusion method based on multi-layer convolution according to claim 1, wherein in the multi-layer convolution fusion network, the downsampling convolution block is formed by a max pooling layer, a 3 x3 convolution layer and a convolution layer with an activation function which are intersected with each other; after the input image passes through the max imaging layer, the characteristic information is processed twice through a convolution layer with an activation function and a convolution layer with a 3 multiplied by 3;
the method comprises the steps of directly extracting features of source image information by adopting a convolution block consisting of a 3×3 convolution layer and a 3×3 convolution layer with an LReLU activation function; and integrating the characteristics extracted from the source image information with the characteristic information extracted from the gradient convolution block and the downsampling convolution block.
4. The infrared and visible light image fusion method based on multi-layer convolution according to claim 3, wherein the gradient convolution block is mainly formed by combining a convolution layer with an lrlu activation function, a 3×3 convolution layer, a 1×1 convolution layer and a gradient operator, and the main body adopts dense connection, and uses 23×3 convolution layers and blocks spliced by the 3×3 convolution layers with the lrlu activation function to perform feature extraction; the residual error flow adopts gradient operation to calculate the gradient amplitude of the characteristic, and a 1 multiplied by 1 regular convolution layer is used for eliminating the channel dimension difference; and finally, integrating the deep features extracted by the main dense flow and fine granularity detail information acquired by the residual gradient flow.
5. The infrared and visible light image fusion method based on multi-layer convolution according to claim 1, wherein in said decoder, each decoding block consists of two 3 x3 convolution layers; a short connection is used for the connection in each row.
6. The method for fusion of infrared and visible light images based on multi-layer convolution according to claim 1, characterized in that an automatic encoder loss function L is used auto Training an automatic encoder network, an automatic encoder loss function L auto The definition is as follows:
L auto =L pixel +100L ssim
L ssim =1-SSIM(Output,Input)
wherein L is pixel Representing pixel loss between input and output images, L ssim Representing a loss of structural similarity between the input image and the output image;is the Frobenius norm; SSIM (-) is a structural similarity measure that quantifies the structural similarity of two images.
7. The method for fusing infrared and visible light images based on multi-layer convolution as claimed in claim 1, wherein a fusion strategy loss function L is adopted MCFN Training a multi-layer convolution fusion network, and fusing a strategy loss function L MCFN The definition is as follows:
L MCFN =αL detail +L feature
L detail =1-SSIM(O,I vi )
wherein L is detail 、L feature Representing a background detail retention loss function and a target feature enhancement loss function respectively; alpha is oneA trade-off parameter; m is the number of converged networks; w (w) 1 Is a trade-off parameter vector for balancing the magnitude of the loss to balance the differences in magnitude at different scales; w (w) vi Control fusion profileThe relative influence of the infrared features in w ir Control fusion profile->The relative influence of the visible light characteristics of (a) is provided.
CN202311352355.2A 2023-10-18 2023-10-18 Infrared and visible light image fusion method based on multilayer convolution Pending CN117292244A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311352355.2A CN117292244A (en) 2023-10-18 2023-10-18 Infrared and visible light image fusion method based on multilayer convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311352355.2A CN117292244A (en) 2023-10-18 2023-10-18 Infrared and visible light image fusion method based on multilayer convolution

Publications (1)

Publication Number Publication Date
CN117292244A true CN117292244A (en) 2023-12-26

Family

ID=89257076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311352355.2A Pending CN117292244A (en) 2023-10-18 2023-10-18 Infrared and visible light image fusion method based on multilayer convolution

Country Status (1)

Country Link
CN (1) CN117292244A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118090743A (en) * 2024-04-22 2024-05-28 山东浪潮数字商业科技有限公司 Porcelain winebottle quality detection system based on multi-mode image recognition technology

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118090743A (en) * 2024-04-22 2024-05-28 山东浪潮数字商业科技有限公司 Porcelain winebottle quality detection system based on multi-mode image recognition technology

Similar Documents

Publication Publication Date Title
CN113362223B (en) Image super-resolution reconstruction method based on attention mechanism and two-channel network
CN112653899B (en) Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene
CN103020933B (en) A kind of multisource image anastomosing method based on bionic visual mechanism
CN110490919A (en) A kind of depth estimation method of the monocular vision based on deep neural network
CN109523513B (en) Stereoscopic image quality evaluation method based on sparse reconstruction color fusion image
CN112967178B (en) Image conversion method, device, equipment and storage medium
CN113033630A (en) Infrared and visible light image deep learning fusion method based on double non-local attention models
CN117292244A (en) Infrared and visible light image fusion method based on multilayer convolution
CN114283158A (en) Retinal blood vessel image segmentation method and device and computer equipment
CN113283444B (en) Heterogeneous image migration method based on generation countermeasure network
CN110473142A (en) Single image super resolution ratio reconstruction method based on deep learning
CN115423734B (en) Infrared and visible light image fusion method based on multi-scale attention mechanism
CN110189286A (en) A kind of infrared and visible light image fusion method based on ResNet
CN110120049A (en) By single image Combined estimator scene depth and semantic method
CN107071423A (en) Application process of the vision multi-channel model in stereoscopic video quality objective evaluation
CN114187214A (en) Infrared and visible light image fusion system and method
CN115330620A (en) Image defogging method based on cyclic generation countermeasure network
Pan et al. DenseNetFuse: A study of deep unsupervised DenseNet to infrared and visual image fusion
CN117197624A (en) Infrared-visible light image fusion method based on attention mechanism
CN114926337A (en) Single image super-resolution reconstruction method and system based on CNN and Transformer hybrid network
CN110246093A (en) A kind of decoding image enchancing method
Li et al. Maskformer with improved encoder-decoder module for semantic segmentation of fine-resolution remote sensing images
Yang et al. Infrared and visible image fusion based on multiscale network with dual-channel information cross fusion block
CN116993639A (en) Visible light and infrared image fusion method based on structural re-parameterization
CN114067187A (en) Infrared polarization visible light face translation method based on countermeasure generation network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination