CN116681636A

CN116681636A - Light infrared and visible light image fusion method based on convolutional neural network

Info

Publication number: CN116681636A
Application number: CN202310924379.4A
Authority: CN
Inventors: 彭成磊; 洪宇宸; 苏鸿丽; 刘知豪; 潘红兵; 王宇宣
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2023-09-01
Anticipated expiration: 2043-07-26
Also published as: CN116681636B

Abstract

The invention discloses a light infrared and visible light image fusion method based on a convolutional neural network, and belongs to the field of image processing and computer vision. The method comprises the steps of firstly carrying out image registration, judging whether to carry out enhancement network enhancement Net processing according to whether the average brightness of a visible light image is lower than a certain threshold value, respectively inputting a visible light Y component and an infrared image in a gray image format into a fusion network fusion Net to obtain a fusion result Y', and carrying out format conversion to obtain a final fusion image. The invention couples the low-illumination image enhancement and the image fusion, so that the algorithm can realize good fusion effect under the low-illumination scene. The enhanced network and the fusion network in the invention are lightweight convolutional neural networks with small parameter and calculation amount and high reasoning speed, and are suitable for being deployed to embedded equipment with limited resources.

Description

Light infrared and visible light image fusion method based on convolutional neural network

Technical Field

The invention relates to a light infrared and visible light image fusion method based on a convolutional neural network, and belongs to the fields of image processing and computer vision.

Background

Convolutional neural networks have long been one of the core algorithms in the field of image recognition and have stable performance when learning data are sufficient, and for general large-scale image classification problems, convolutional neural networks can be used for constructing hierarchical classifiers (hierarchical classifier) and also can be used for extracting discrimination features of images in fine classification recognition (fine-grained recognition) for learning by other classifiers. For the latter, the feature extraction may be performed manually by inputting different portions of the image into the convolutional neural network, respectively, or may be performed by the convolutional neural network by self-extraction through unsupervised learning.

The evaluation of the light weight is that the parameter quantity of the convolutional neural network is small, the academic world can only refer to the light weight and does not have an exact and quantitative definition when the parameter quantity of a deep learning model is small, and the currently recognized light weight convolutional neural network has the parameter quantity of 4.2M, such as MobileNet-v 1; the parameter of the SheffleNet V2 is 2M.

The existing infrared and visible light image fusion method is designed aiming at normal illumination conditions, so that the difficulty of illumination degradation in a night visible light image is ignored, and particularly, in a weak light condition, the existing fusion method only uses infrared information to fill scene defects caused by illumination degradation in the visible light image. Therefore, abundant scene information in night visible images cannot be expressed in the fused image, which deviates from the original purpose of the task of fusing infrared and visible images.

One intuitive solution is to pre-enhance the visible light image using an advanced dim light enhancement algorithm and then merge the source images by a fusion method, but using image enhancement and image fusion as separate tasks often leads to incompatibility problems, resulting in poor fusion results, mainly because: the night visible light image has slight color distortion, the low light enhancement algorithm changes the color distribution of the light source and amplifies the color distortion of the whole image to a certain extent, and in addition, in the fusion process, the saturation distribution of the source image is changed due to the fusion strategy adopted in the Y channel, so that the fusion image can also have color distortion. Therefore, it is necessary to design a fusion method suitable for night scenes, which can improve the fusion effect of infrared and visible light images and reduce the phenomena of noise, color distortion, artifacts and the like of the fused images while ensuring the light weight and high real-time performance of the model.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, the invention provides a light-weight convolution neural network which is applicable to infrared and visible light image fusion of a low-illumination scene. The invention designs an enhancement network for improving the illumination degradation problem, optimizing the brightness distribution of the image, enhancing the contrast and texture details, and enriching the visible light scene information in the fused image while excessively amplifying noise; in addition, the architecture design of the invention fully considers the internal connection between low-illumination image enhancement and image fusion, not only reduces color distortion, but also can effectively couple the two problems together.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the light infrared and visible light image fusion method based on the convolutional neural network comprises the following steps:

s1: image registration is carried out on a pair of infrared images and visible light images which are acquired in similar positions under the same scene, so that matching points of the two images are aligned in space;

s2: converting the visible light image from RGB format to YUV format;

s3: judging whether the average brightness of the visible light image is lower than a certain threshold value, if so, carrying out low-illumination image enhancement on the Y component by using an enhancement network enhancement Net; otherwise, directly entering the next step;

s4: respectively inputting the enhanced or unreinforced visible light Y component and the infrared image in the gray map format into a visible light branch and an infrared branch of fusion network fusion Net, and processing to obtain a fusion result Y';

s5: and converting the YUV format image formed by the fusion result Y' and the UV component of the original visible light image into an RGB format to obtain a final fusion image.

In one embodiment, the specific image registration procedure in step S1 is as follows: extracting edges by using a Canny operator, detecting characteristic points of two edge images by using a SURF algorithm, carrying out characteristic point matching according to prior knowledge of slope (direction) consistency between correct matching point pairs, removing mismatching points by using a random sampling consistency RANSAC algorithm, estimating a homography matrix for coordinate system transformation, multiplying one image to be registered by using the matrix, and cutting to obtain a result aligned with the other image; the concept of multimodal registration is involved because the visible image is in RGB format and the infrared image is in gray format, which is more challenging than general image registration, and the present invention only contemplates infrared and visible image registration methods involving rigid transformations.

In one embodiment, the brightness values of all pixels are divided by 255 in step S2, and the RGB format is converted into YUV format (more precisely, YCbCr format in YUV color space family) as follows:

in one embodiment, the step S3 needs to perform enhancement processing on the image with lower average brightness by using enhancement network enhancement net, but does not perform enhancement processing on the image photographed under better illumination condition.

In one embodiment, the average luminance in S3Wherein I _vi Represents visible light image, H, W, C represents the height, width and channel number of visible light image, respectively, if +.>Less than 0.25, then the Y component is low-light image enhanced with an enhancement network enhancement net.

In one embodiment, the enhancement network enhancement net is a lightweight convolutional neural network for low-light image enhancement, and is very compact and simple in structure; the enhancement network enhancement Net only comprises a convolution layer connected with the residual error, and the other layers except for the output layer of the last three branches of the activation function adopt ReLU.

In one embodiment, the low-light image is normalized to its pixel values After the two features are input into an enhancement network Enhance Net, for a feature image of an image after three convolution layer processing, feature extraction and feature aggregation on different sizes are carried out on the feature image by utilizing two multi-receptive-field convolution modules which are used for expanding the width of the image on the same level, the obtained feature image is split into three branches according to channel dimensions, each branch is subjected to convolution operation for two times, and finally five brightness mapping curve parameter images for adjusting the pixel value of the image are output through a Tanh activation function, wherein the five brightness mapping curve parameter images are respectively single-channel%>、/>、/>、/>、/>。

In one embodiment, in the multiple receptive field convolution module, 1*1 convolutions are used to reduce dimensions; a receptive field of 5*5 is achieved on one branch by concatenating two 3*3 convolution kernels; and (3) dividing a plurality of branches on the same level, respectively carrying out feature extraction on the input feature images by using convolution kernel combinations with different receptive fields, carrying out feature aggregation in a channel splicing mode, carrying out residual connection with the original feature images after the dimension reduction of the 1*1 convolution kernels, and obtaining the final output feature images.

In one embodiment, the luminance mapping curve parameter map is a two-dimensional matrix for non-linear transformation of luminance values of a low-light image, wherein: The method is characterized in that a parameter of a cubic polynomial of an input image is iterated and called, a higher-order brightness mapping curve can be constructed, the curve is used for mapping the brightness of different pixels of an original image to new brightness respectively, and the higher-order polynomial has stronger expression capability, so that an adjustment result has higher dynamic range; />And->Respectively used for logarithmic transformation and gamma transformation; passing the input low-light image through +.>、/>、/>Three weight parameter graphs are used for the three middle graphs after brightness mapping>、/>、/>And carrying out weighted summation to obtain a final enhanced image.

In one embodiment, the fusion network fusion net in the step S4 is a lightweight convolutional neural network for completing fusion of infrared and visible light images, and the network structure of the fusion network comprises a feature extraction part, a fusion operation and an image reconstruction part; the characteristic extraction part is composed of two convolution layers and a residual convolution module, wherein the residual convolution module is composed of two convolution layers and a residual connection, weight parameters of the characteristic extraction part are shared by an infrared branch and a visible light branch, and the parameter sharing is helpful for reducing the quantity of parameters and guiding the convolution layers to learn the similarity between an infrared image and a visible light image; the fusion operation introduces a compression-excitation channel attention mechanism, the characteristic graphs output by the infrared branch and the visible light branch are spliced according to channels, one branch is led out, global average pooling is carried out on the characteristic graphs to enable the spatial characteristics to be reduced to 1*1, then a weight coefficient of each channel is obtained through calculation by using a full connection layer-ReLU-full connection layer-Hard Sigmoid, and finally channel-by-channel multiplication is carried out on each channel of the coefficient and the original characteristic graph, so that the recalibration of the original characteristics by the channel attention is completed; the image reconstruction section is composed of two residual convolution modules and two convolution layers, unlike the feature extraction section, where the residual convolution modules use a compression-excitation channel attention mechanism.

In one embodiment, the convolutional layer of the fusion network fusion net is actually composed of three basic operations of depth separable convolution-batch normalization operation-ReLU activation function.

In one embodiment, in step S5, the YUV format image formed by the fusion result Y' and the UV component of the original visible light image is converted into the RGB format according to the following formula, so as to obtain the final fusion image.

The invention also provides a neural network training method based on the convolution neural network light infrared and visible light image fusion method, which is a two-stage combined training method based on enhancement network enhancement Net and fusion network fusion Net, and the two sub-networks are coupled into an integral network by utilizing the internal association between the image enhancement and the image fusion problems.

In one embodiment, the training method of the enhanced network enhancement net is to perform supervised pre-training on the enhanced network enhancement net by using a low-light image data set in an RGB format, wherein a loss function is formed byReconstruction loss, structural similarity loss and smoothness loss, the starting point of which is introduced to suppress noise; and then performing unsupervised fine adjustment on the low-illumination visible light image in the infrared and visible light image fusion data set by using the Y component of the low-illumination visible light image, wherein the loss function consists of space consistency loss, exposure control loss and smoothness loss.

In one embodiment, the training method of fusion network fusion net adopts self-supervision learning, the loss function consists of intensity loss, texture loss, color consistency loss and self-adaption structural similarity loss, although the enhanced visible light Y component is input into the fusion network fusion net, the color consistency loss refers to the visible light image in RGB format before enhancement, and the introduction of the loss term not only reduces color distortion, but also can effectively couple the enhancement problem and the fusion problem together.

The invention also provides a light infrared and visible light image fusion system based on the convolutional neural network, which is used for executing the infrared and visible light image fusion method of the invention, and comprises the following steps:

the enhancement network enhancement Net is used for carrying out low-illumination image enhancement on the visible light image with lower overall illumination;

the fusion network fusion Net is used for fusing the Y component of the visible light image and the infrared gray image into Y';

the RGB-to-YUV module and the YUV-to-RGB module are used for completing the mutual conversion between the YUV format and the RGB format of the color image.

The invention also provides an electronic device, which comprises a memory, a processor and computer instructions stored on the memory and running on the processor, wherein the computer instructions complete the infrared and visible light image fusion method when being run by the processor.

The invention discloses a light infrared and visible light image fusion method based on a convolutional neural network, and relates to the background technology mainly comprising image registration, low-illumination image enhancement and infrared and visible light image fusion, which are sequentially developed and briefly introduced below.

(1) And (5) image registration. The two images to be fused need to be geometrically aligned exactly to ensure a proper fusion effect, which requires the addition of a preprocessing step called image registration before the image fusion. Image registration is a process of aligning, overlaying, images (two or more) of the same scene taken at different times, from different angles, and/or by different sensors, the image fusion effect is often severely dependent on the effect of image registration, so it must be ensured that a good image registration algorithm is adopted. The registration of infrared and visible images involves the category of multi-modal registration, requiring alignment of infrared gray-scale and visible color RGB maps with large differences in texture detail, edge, salient region, and other geometries. In general, there are two main types of methods for registering infrared and visible images: region-based registration methods (also known as template matching methods, representative algorithms include correlation methods, fourier transforms, mutual information, gradient information, etc.) and feature-based registration methods (common methods are point feature-based, contour edge-based, visual salient feature-based, and region feature-based image registration methods). Wherein a feature-based method first extracts two sets of salient structures (e.g., feature points) and then determines the correspondence between them and estimates the spatial transform accordingly for further use in registering a given image pair. Compared with the area-based method, the feature-based method has stronger robustness in processing typical appearance changes and scene movements, and in addition, the calculation efficiency is higher if the software implementation is more reasonable. In general, feature-based methods include two main steps, feature extraction, which refers to the detection of a salient object, and feature matching, which refers to the establishment of correspondence between detected features.

(2) Low light image enhancement. The visible light image shot in the scene with lower illumination is called a low-illumination image, and for subjective feeling of human eyes, the image generally has the problems of low overall brightness and contrast, blurred color and texture details and even lost, if the low-illumination image is directly used for fusion with an infrared image, the fusion image is easy to fail to well utilize complementary information from the visible light image, so that the visual quality is lower, and therefore, the low-illumination image enhancement treatment of the visible light image is necessary to be considered before the fusion. The low-illumination image enhancement is a bottom task in the field of computer vision, and is to use computer technology to process the problems of low brightness, low contrast, noise, artifact and the like existing in an image with insufficient illumination, so as to improve the visual quality and maintain the texture, structure and color information of an object in an original image.

(3) The infrared and visible light images are fused. The infrared sensor can effectively highlight a significant target even in the case of insufficient light, extreme conditions, bad weather, and partial occlusion by capturing thermal radiation information imaging of an object. But infrared images do not provide sufficient environmental information such as texture details, ambient lighting, etc. In contrast, visible light sensors image by collecting reflected light from the surface of an object, so that the visible light image contains rich texture detail information and is more in line with human visual perception. The infrared and visible light image fusion aims at integrating complementary information in a source image and generating a high-contrast fusion image which can highlight a remarkable target and contains rich texture details, and experiments show that the multi-mode fusion image is beneficial to improving the processing effect of subsequent high-level visual tasks, such as target detection, target tracking, pattern recognition, semantic segmentation and the like. The main stream infrared and visible light image fusion method is mainly classified into 6 categories according to different theories, namely a method based on multi-scale transformation, a method based on sparse representation, a method based on a neural network, a method based on subspace, a method based on saliency and a mixed method combining the above methods. The conventional method without adopting the neural network needs to manually design a fusion rule (such as pixel-by-pixel addition, pixel-by-pixel weighted summation, maximum selection strategy and coefficient combination method), the manually designed fusion rule is generally rough, the work of fusing the reconstructed image and the work of feature extraction cannot be well linked, the fusion result is easily interfered by subjective priori, important information of the source image cannot be effectively reserved, even flaws such as artifacts can be generated, and the end-to-end image fusion framework based on the convolutional neural network is a technical route for avoiding the defects caused by the manually designed fusion rule.

The invention has the advantages and effects that:

(1) The two convolutional neural networks (enhanced network and fusion network) in the invention are very light, the parameters of the enhanced network and the fusion network are respectively 5.006K and 4.34800K, the calculation efficiency is high, the instantaneity is strong, and the algorithm of the invention can occupy less resources and reach higher reasoning speed when deployed to the embedded equipment with limited resources.

(2) Instead of simply stringing together the enhancement network and the fusion network, the method of the invention should be said to be two sub-modules coupled to each other in a whole network, and can determine whether to perform low-light enhancement on the visible light image according to the average brightness of the visible light image, because if the visible light image is brighter, the low-light enhancement will destroy the color distribution. The invention couples the low-illumination image enhancement and the image fusion, so that the algorithm can realize good fusion effect under the low-illumination scene.

(3) In general, the fusion method provided by the invention can enhance the brightness and contrast of the low-illumination visible light image, enrich the visible light field Jing Xinxi in the fusion image, and improve the identification degree of the obvious thermal target while maintaining the texture detail and the original color.

Drawings

FIG. 1 is a schematic flow chart of the method of the invention for fusing infrared and visible light images.

Fig. 2 is a schematic diagram of the architecture of an enhanced network enhancement net, with the numbers above the convolution kernel representing the number of channels of the output signature.

Fig. 3 is a schematic diagram of the fusion network fusion net structure, with the numbers above the convolution kernel representing the number of channels of the output profile.

FIG. 4 is an infrared image taken at night selected from a data set in one embodiment of the invention.

FIG. 5 is a visual image paired with FIG. 4 selected from a data set in one embodiment of the invention.

Fig. 6 is the result of low-light image enhancement of fig. 5 with the enhancement network enhancement net of the present invention.

Fig. 7 is the result of infrared and visible image fusion of fig. 4 and 6 using fusion network fusion net of the present invention.

Detailed Description

The following describes the scheme of the invention in detail with reference to the accompanying drawings.

Example 1:

as shown in FIG. 1, the method of the invention is used for carrying out the flow diagram of infrared and visible light image fusion, and is also a whole network structure diagram, and the whole network can be seen to be composed of two sub-networks of an enhanced network and a fusion network.

S1: and carrying out image registration on a pair of infrared images and visible light images which are acquired in similar positions under the same scene, so that the matching points of the two images are aligned in space. The adopted registration algorithm comprises the following specific processes: the method comprises the steps of respectively extracting edges of an infrared image and a visible light image by a Canny operator, detecting characteristic points of the two edge images by a SURF algorithm, carrying out characteristic point matching according to prior knowledge of slope (direction) consistency between correct matching point pairs, finally removing mismatching points by a random sampling consistency RANSAC algorithm, estimating a homography matrix for coordinate system transformation, multiplying one image to be registered by the homography matrix, and cutting to obtain a result aligned with the other image.

S2: the visible light image is normalized, i.e. the luminance values of all pixels are divided by 255 and the RGB format is converted into YUV format (more precisely YCbCr format in the YUV color space family) as follows:

s3: calculating normalized average brightness of visible light imageWherein I _vi Represents visible light image, H, W, C represents the height, width and channel number of visible light image, respectively, if +.>Less than 0.25, the Y component is enhanced with the enhancement network enhancement net, and the workflow of the enhancement network enhancement net is briefly described below.

The enhancement network enhancement net only comprises a convolution layer connected with the residual error, and the active function adopts ReLU except for the output layer of the last three branches which adopts Tanh. As shown in FIG. 2, the low-light image normalizes its pixel values to [0,1 ]]After the two features are input into an enhancement network Enhance Net, for a feature image of an image processed by three convolution layers (3 x3 cov), feature extraction and feature aggregation on different sizes are carried out on the feature image by utilizing two multi-receptive-field convolution modules which are used for expanding the width of the image on the same level, the obtained feature image is split into three branches according to channel dimensions, each branch is subjected to convolution operation for two times, and finally five brightness mapping curve parameter diagrams for adjusting the pixel value of the image are output through a Tanh activation function, wherein the first branch outputs A and the second branch outputsAnd->Third branch output->And->As shown in fig. 2. It is noted that output +.>The activation function used is in fact +.>This is to limit their value to [0,0.5]Between, and output->,/>The activation function used is->The value range is [0,1]。

The brightness mapping curve parameter diagram is a two-dimensional matrix, and is used for carrying out nonlinear transformation on brightness values of the low-illumination image, wherein: a is repeatedly used for constructing a high-order brightness mapping curve, 4 iterations are needed in total, each iteration is used for carrying out brightness mapping on an image by taking a cubic function as a brightness enhancement curve, and the specific process can be expressed by the following formula:

Wherein the method comprises the steps ofLEA luminance enhancement map is shown which is a graph,xrepresenting pixel coordinates, subscript n represents iteration runs, values of 0,1, 2, 3, 4, representing a total of 4 iterations,low-light image taken as input +.>，/>The brightness mapping curve parameter diagram representing a branch output in the enhancement network has the meaning of further brightness adjustment of the brightness enhancement diagram of the previous iteration by using the same cubic function, which is equivalent toThe tertiary function is used to map the brightness value of the pixel point in each channel in the image into a new value, namely the meaning of the brightness mapping curve. The higher-order curve generated by 4 iterations of the cubic function is smoother, and can locally generate larger curvature, so that the higher-order curve has stronger brightness adjustment capability, and can only cope with various complex low-illumination scenes.

The higher-order brightness mapping curve generated according to the cubic function iteration satisfies the following 3 conditions: (1) Each pixel value of the enhanced image should fall within the normalized range of [0,1] to avoid information loss caused by overflow truncation; (2) The curve should remain monotonic to preserve the difference (contrast) of neighboring pixels; (3) The curve form should be as simple as possible and have a slight nature during the back propagation of the gradient.

The specific procedure for logarithmic transformation of low-light images can be represented by the following formula:

wherein the method comprises the steps ofRepresenting a low-light image,/->Representing the maximum pixel value in the low-light image,/->Representing the logarithmically transformed image; />Representing the hadamard product, i.e. the matrix is multiplied by the element.

G ₂ The specific process for gamma conversion of low-light images can be expressed by the following formula:

wherein the method comprises the steps ofxRepresenting pixel coordinates, gamma conversion is based on low-light image pixel values, based onThe value of the corresponding position is an exponent, and the exponentiation is performed.

Will be used、/>、/>Luminance optimization results obtained by performing nonlinear luminance mapping on the low-light image are respectively marked as +.>、、/>With weight parameter map->、/>、/>The weighted summation of these is equivalent to integrating the advantages of three non-linear brightness mapping curves, and the calculation formula is as follows:

wherein the method comprises the steps ofRepresenting the Hadamard product, i.e., the matrix is multiplied by the elements; />During the calculation, 1 in the (1) is broadcasted as a matrix with the same shape as the low-light image and the elements being 1; />Namely, the enhancement result of the enhancement network enhancement Net on the low-light image is enhanced.

The input of the enhancement network enhancement net is a single-channel diagram, when the Y component of the gray level diagram or the YUV format color diagram is processed, the Y component can be directly input into the network, but the pre-training stage uses the RGB format image, at this time, R, G, B three color channels are respectively input into the network for processing, and the output three single-channel diagrams are spliced into the RGB format to obtain the enhancement result.

Table 1 shows the evaluation index (+.r. indicates that the larger the numerical value is, the better the numerical value is, whereas ∈indicates the smaller the numerical value is) obtained by testing the enhancement network enhancement net on the real data set with the resolution of 640 x 480, and the test platform is Linux, NVIDIA GeForce RTX and 3090. From the data in the table, it can be seen that even in the case of small parameter amounts, the PSNR and SSIM indexes obtained by enhancing the network enhancement net are still considerable. The model is light, so that the reasoning speed of the enhanced network Enhance Net is very high, which is a great advantage.

Table 1 evaluation index of enhanced network enhancement net

S4: the enhanced or unreinforced visible light Y component and the infrared gray level diagram (i.e. the infrared image in the gray level diagram format) are respectively input into the visible light branch and the infrared branch of the fusion network fusion net, and the fusion result Y' is obtained after processing, and the working flow of the fusion network fusion net is briefly described below.

As shown in fig. 3, the fusion network fusion net is a lightweight convolutional neural network for completing fusion of infrared and visible light images, the network structure of the fusion network fusion net is composed of a feature extraction part, a fusion operation and an image reconstruction part, and the basic convolutional layer is composed of three basic operations of depth separable convolution-batch normalization operation-ReLU activation function.

The feature extraction part of the fusion network fusion Net comprises two branches of infrared light and visible light, wherein the two branches use the same set of convolution layers with the same structure and shared weight, and specifically, the fusion network fusion Net consists of two convolution layers and a residual convolution module, wherein the residual convolution module consists of two convolution layers and a residual connection.

The fusion operation of fusion network fusion net introduces a compression-stimulus channel attention mechanism: the characteristic graphs output by the infrared branch and the visible light branch are spliced according to channels, one branch is led out, global average pooling is carried out on the branch, the dimension of the spatial characteristic is reduced to 1*1, then the weight coefficient of each channel is obtained through full-connecting layer-ReLU-full-connecting layer-Hard Sigmoid calculation, and finally channel-by-channel multiplication is carried out on each channel of the coefficient and the original characteristic graph, so that the recalibration of the original characteristic by the attention of the channel is completed.

The image reconstruction part of fusion network fusion net is composed of two residual convolution modules and two convolution layers, wherein the residual convolution modules use a compression-excitation channel attention mechanism, the output layer adopts 1*1 convolution, and the activation function adopts f (X) =as different from the feature extraction part To limit the pixel value to 0, 1]Between them. The output of the activation function is the fusion result Y ', Y' =fusion net (Y, I _ir )，I _ir Representing an infrared image.

Table 2 shows the evaluation index (∈r represents a larger value and better value, whereas ∈r represents a smaller value and better value) obtained by testing fusion network fusion net on a real data set with a resolution of 640 x 480, and the test platform is Linux, NVIDIA GeForce RTX 3090. It should be noted that, for those images taken in a night scene, the visible light image referred to when calculating the evaluation index of fusion effect of fusion net is an RGB image processed by enhancement network enhancement net, and for those images taken under good illumination conditions, the original RGB format visible light image is referred to. As can be seen from the data in the table, similar to the enhanced network, the fusion network fusion net can obtain good performance on the performance index under the condition of small parameter quantity and short average reasoning time.

Table 2 fuses evaluation index of network fusion Net, +.

S5: and converting the YUV format image formed by the fusion result Y' and the UV component of the original visible light image into an RGB format according to the following formula to obtain a final fusion image.

In the invention, a two-stage combined training strategy is adopted when the enhancement network enhancement net and the fusion network fusion net are trained, and the two sub-networks are coupled to form an integral network by utilizing the inherent association between the image enhancement and the image fusion problem. The following will be described in detail.

(1) Training enhanced network enhancement net

The RGB format low-light image data set is firstly used for supervised pre-training, and the loss function is formed byReconstruction loss, structural similaritySex loss and smoothness loss composition, the starting point for introducing smoothness loss is to suppress noise; and then performing unsupervised fine adjustment on the low-illumination visible light image in the infrared and visible light image fusion data set by using the Y component of the low-illumination visible light image, wherein the loss function consists of space consistency loss, exposure control loss and smoothness loss. In particular, the method comprises the steps of,

firstly, a supervised pre-training is needed, and the data set adopts a low-illumination image in an RGB format, and the loss function is expressed by the following formula:

wherein the method comprises the steps of、/>、/>The weight coefficient of each loss term is an adjustable parameter; />For reconstruction loss, GT is the true image of normal exposure, E is the image after low light enhancement, < > >、/>Representing the height and width of the image, the loss function representing the pixel-by-pixel absolute value error between the output image and the real image; />For structural similarity loss, ++>，/>And->Mean values of the real image and the output image, respectively, < >>Representing covariance between real image and output image, < ->And->Representing the variance of the real image and the output image, respectively, < >>And->Is a constant, the loss function characterizes the structural similarity between the output image and the real image;for smoothness loss, ++>And->Representing gradient operators in the horizontal and vertical directions, respectively, the loss function is used to suppress noise and prevent the noise intensity of the low-light image from being excessively amplified after enhancement.

After the pre-training is completed, when two phases of combined training are performed by using the infrared and visible light image fusion data set, the fusion network fusion net is trained from the beginning, and the enhancement net is subjected to unsupervised fine tuning, wherein the unsupervised learning is adopted because the visible light images in the fusion data set are not paired with normal illumination images. In addition, the fine tuning stage needs to convert the visible light image in RGB format into YUV format, and only the Y component is transmitted into enhancement Net, i.e. the pre-training stage is enhanced in RGB format The image is enhanced in the fine tuning stage by the Y component in the YUV image, and the enhanced Y componentWill be directly as input to the fusion network fusion net. The loss function employed in the trimming stage is represented by the following formula:

wherein the method comprises the steps of、/>、/>The weight coefficient of each loss term is an adjustable parameter and is used for balancing the scales of different losses; />Is a loss of spatial coherence, which is promoted by preserving the difference of neighboring areas between the input image and the enhanced image, +.>And->The visible light Y component before and after enhancement is represented by +.>Represents a square area of shape 4*4, K is the total number of non-overlapping square areas in the image,4 neighbors of the same shape representing the up, down, left, right of the region, ++>And->Representing average luminance values of the enhanced image and the input image within the 4*4 square region, respectively; />Is an exposure control loss for measuring the average intensity value of 16 x 16 local areas in the enhanced image +.>To good exposure level->The difference between (I) and (II)>Is an adjustable prior value, k represents an index value of 16 x 16 non-overlapping local areas in the enhanced image, and M represents the total number of the non-overlapping local areas in the enhanced image; />Is a smoothness loss, the effect is the same as the smoothness loss during the pre-training phase, wherein +. >、/>Represents the gradient values in the horizontal direction and in the vertical direction, respectively,>and->The Sobel gradient operators in the horizontal direction and the vertical direction are respectively represented.

The main changes in the design of the fine tuning stage over the loss function compared to the pre-training stage are: omitting reference-dependent picturesReconstruction loss and structural similarityLoss; the space consistency loss is increased; add with adjustable a priori parameters +.>Adjusting this parameter may make the brightness distribution of the visible portion of the fusion result more suitable for subjective perception by the human eye.

(2) Training fusion network fusion net

The self-supervision learning is adopted when the fusion network fusion net is trained, the loss function consists of intensity loss, texture loss, color consistency loss and self-adaptive structural similarity loss, although the enhanced visible light Y component is input into the fusion network fusion net, the color consistency loss refers to the visible light image in the RGB format before enhancement, and the introduction of the loss term not only reduces color distortion, but also can effectively couple the enhancement problem and the fusion problem together. In particular, the method comprises the steps of,

the loss function employed is represented by the following formula:

wherein the method comprises the steps of、/>、/>、/>The weight coefficient of each loss term is an adjustable parameter; Is loss of strength, < >>、/>Representing the height and width of the image, max representing the maximum value selection strategy element by element, < >>、/>And->Respectively representing a Y component of the fusion image, an enhanced visible light Y component and an infrared image;it is a texture penalty that experiments find that the best texture of the fused image can be expressed as a set of element-wise maxima of the infrared and visible image textures, so the texture penalty is used to force the fused image to contain more texture details, where +.>Representing a Sobel gradient operator; />Is a color consistency loss, +.>Representing the original visible image before entering the whole network, i.e. without enhancement,/>Representation->The format-converted RGB image is spliced with the original UV,/or->、/>Representing three-dimensional vectors composed of RGB values at (i, j) positions of the fused image and the visible light image, respectively, the term loss functionThe two problems are coupled together by constraining the color distortion that may be caused by image enhancement and image fusion, respectively; />The self-adaptive structure similarity loss after the improvement of the image fusion problem is aimed at, wherein N represents the total number of sliding windows in the process of calculating SSIM, and the starting point is as follows: the structural similarity index SSIM is a sliding window with a fixed size>To calculate AND +. >And for the fusion problem, judging which source image has better image quality in the window by using a non-reference image evaluation index, and then calculating the SSIM by using the optimal and fused images. />The SSIM value representing the i-th sliding window area is calculated as follows:

wherein the method comprises the steps ofRepresented in the picture +.>Region->The larger the average gradient within, the more pronounced the texture information contained in that region, but for those successive regions where the change in pixel values is not significant, the average gradients of the corresponding windows in the infrared image and the visible light Y component may differ slightly, so the better the image quality in evaluating which source image is within that sliding windowWhen the brightness information is needed as an additional supplement, the pixel value of the Y component of the visible light is in the area +.>The larger proportion of the inner than infrared image is used as the weight coefficient of the average gradient of the visible light Y component, and the calculation formula is as follows: />，/>Representing the visible light Y component->In area->The number of pixels with the inner pixel value larger than that of the infrared image is +.>、/>Representing the height and width of the sliding window. If in the region->The weighted average gradient of the Y component of the internal visible light is greater than that of the infrared image, i.e Then select the visible light Y component and the fused image calculation region +.>In the inner partAnd otherwise, selecting an infrared image. The self-adaptive structure similarity loss function designed by the calculation method can promote the fusion image to retain enough texture characteristics, capture a region with larger local brightness and realize the fusion of the obvious characteristics of the source image.

In summary, the infrared and visible light image fusion method provided by the embodiment of the invention comprises two core modules, namely the enhancement network enhancement net and the fusion network fusion net, and can achieve a good fusion effect even when processing low-light images on the premise of ensuring light weight and high reasoning speed of the model: the brightness and contrast of the low-illumination visible light image can be enhanced, scene information is enriched, and the identification degree of a remarkable thermal target can be improved while texture details and original colors are maintained.

In this embodiment, a combination experiment of two core modules, namely an enhanced network enhancement net and a fusion network fusion net, is performed. Fig. 4 is an infrared image taken at night selected from a certain data set in the present embodiment, fig. 5 is a visible image paired with fig. 4 selected from a certain data set in an embodiment of the present invention, fig. 6 is a result of low-light image enhancement of fig. 5 by the enhancement network enhancement net of the present invention, and fig. 7 is a result of infrared and visible image fusion of fig. 4 and 6 by the fusion network fusion net of the present invention.

The performance index obtained by directly processing with the fusion network should be basically indistinguishable, because in this embodiment, when calculating the index, an infrared image, a visible light image and a fusion image are required to be used as independent variables, for those images shot in a night scene, the visible light image referred to when calculating the evaluation index of the fusion effect of fusion net is an RGB image processed by the enhancement network enhancement net, that is, when evaluating the performance of the fusion network, whether the visible light image is enhanced or not is not perceivable, the enhancement network is introduced, mainly for the purpose of better visual perception of human eyes due to the fusion result, and the fusion performance of the enhancement network itself on the fusion network is not improved. For pictures whose average brightness is relatively high, enhancement would instead destroy its original color distribution, causing color distortion, so an empirical threshold is set to prevent all images from being enhanced.

Embodiment 2 an electronic device with infrared and visible light image fusion

An electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the infrared and visible light image fusion method of embodiment 1.

The processor may also be referred to as a CPU (Central Processing Unit ). The processor may be an integrated circuit chip having signal processing capabilities. The processor may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. A general purpose processor may be a microprocessor, or the processor may be any conventional processor or the like.

Further, the storage processor may store instructions and data necessary for operation.

Alternatively, the electronic device may be a notebook computer, a server, a development board.

Embodiment 3 a computer-readable storage medium

A computer readable storage medium storing computer instructions that, when executed by a processor, perform the method of infrared and visible light image fusion of embodiment 1.

In a computer readable storage medium, instructions/program data are stored, which when executed, implement the method described in embodiment 2 of the present application. Wherein the instructions/program data may be stored in the form of program files on the storage medium in the form of a software product for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or processor (processor) to perform all or a portion of the steps of a method according to various embodiments of the application.

Alternatively, the aforementioned storage medium may be a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or various media capable of storing program codes, or a computer, a server, a mobile phone, a tablet, or other devices.

Example 4 an infrared and visible image fusion System

A convolutional neural network-based lightweight infrared and visible light image fusion system, comprising:

The infrared and visible light image fusion system can realize the following method:

s2: converting the visible light image from RGB format to YUV format;

s3: judging whether the average brightness of the visible light image is lower than a certain threshold value, if so, carrying out low-illumination image enhancement on the Y component by using an enhancement network enhancement Net;

S4: respectively inputting the enhanced or unreinforced visible light Y component and the infrared gray level diagram into a visible light branch and an infrared branch of fusion network fusion Net, and processing to obtain a fusion result Y';

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above-described embodiment is only one specific embodiment of the present invention, and is not intended to limit the present invention in any way. It should be noted that the image acquisition apparatus, the image registration method, the image resolution, the image content, the application scenario, the hardware platform for deploying the algorithm of the present invention, the employed deep learning framework, etc. are not limiting to the present invention. The scope of the present invention is not limited thereto, and any changes or substitutions that would be easily recognized by those skilled in the art within the scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. The light infrared and visible light image fusion method based on the convolutional neural network is characterized by comprising the following steps of:

s2: converting the visible light image from RGB format to YUV format;

s3: judging whether the average brightness of the visible light image is lower than a threshold value, if so, carrying out low-illumination image enhancement on the Y component by using an enhancement network (enhancement Net); otherwise, directly entering the next step;

2. The convolutional neural network-based light-weight infrared and visible light image fusion method according to claim 1, wherein step S3 requires enhancement processing of an image with average brightness lower than a threshold value by using an enhancement network enhancement net, otherwise, the processing is not performed; the enhancement network enhancement Net only comprises a convolution layer connected with the residual error, and the other layers except for the output layer of the last three branches of the activation function adopt ReLU.

3. The convolutional neural network-based light-weight infrared and visible light image fusion method according to claim 2, wherein the low-light image in step S3 normalizes its pixel values toAfter the two features are input into an enhancement network Enhance Net, for a feature image processed by three conventional convolution layers, feature extraction and feature aggregation on different sizes are carried out on the feature image by utilizing two multi-receptive-field convolution modules which are used for expanding the width of the image on the same level, the obtained feature image is split into three branches according to channel dimensions, each branch is subjected to convolution operation for two times, and finally five brightness mapping curve parameter images for adjusting the pixel value of the image are output through a Tanh activation function, wherein the five brightness mapping curve parameter images are respectively single-channel%>、/>、/>、/>、The method comprises the steps of carrying out a first treatment on the surface of the In the multi-receptive field convolution module, 1*1 convolution is used for reducing the dimension; a receptive field of 5*5 is achieved on one branch by concatenating two 3*3 convolution kernels; dividing a plurality of branches on the same level, respectively carrying out feature extraction on an input feature map by using convolution kernel combinations with different receptive fields, carrying out feature aggregation in a channel splicing mode, carrying out residual connection with the original feature map after the dimension reduction of a 1*1 convolution kernel, and obtaining a final output feature map; the brightness mapping curve parameter diagram is a two-dimensional matrix, and is used for carrying out nonlinear transformation on brightness values of the low-illumination image, wherein: />A parameter of a cubic polynomial of an input image, and iteratively calling the cubic polynomial; />And->Respectively used for logarithmic transformation and gamma transformation; passing the input low-light image through +.>、/>、/>Three weight parameter graphs are used for the three middle graphs after brightness mapping>、/>、And carrying out weighted summation to obtain a final enhanced image.

4. The method for merging light-weighted infrared and visible light images based on convolutional neural network according to claim 1, wherein the merging network fusion net in the step S4 is a light-weighted convolutional neural network for merging infrared and visible light images, and the network structure of the merging network comprises a feature extraction part, a merging operation and an image reconstruction part; the characteristic extraction part is composed of two convolution layers and a residual convolution module, wherein the residual convolution module is composed of two convolution layers and a residual connection, weight parameters of the characteristic extraction part are shared by an infrared branch and a visible light branch, and the parameter sharing is helpful for reducing the quantity of parameters and guiding the convolution layers to learn the similarity between an infrared image and a visible light image; the fusion operation introduces a compression-excitation channel attention mechanism, the characteristic graphs output by the infrared branch and the visible light branch are spliced according to channels, one branch is led out, global average pooling is carried out on the characteristic graphs to enable the spatial characteristics to be reduced to 1*1, then a weight coefficient of each channel is obtained through calculation by using a full connection layer-ReLU-full connection layer-Hard Sigmoid, and finally channel-by-channel multiplication is carried out on each channel of the coefficient and the original characteristic graph, so that the recalibration of the original characteristics by the channel attention is completed; the image reconstruction part is composed of two residual convolution modules and two convolution layers, and the difference is that the residual convolution modules in the image reconstruction part use a compression-excitation channel attention mechanism; the convolutional layer of fusion network fusion net is actually composed of three basic operations of depth separable convolution-batch normalization operation-ReLU activation function.

5. The convolutional neural network-based light-weight infrared and visible light image fusion method according to claim 1, wherein in the method, a two-stage combined training strategy is adopted for an enhanced network enhanced net and a fusion network fusion net, and the two sub-networks are coupled into a whole network by utilizing the inherent association between image enhancement and image fusion problems;

wherein the first stage: training enhancement network Enhance Net, firstly using RGB format low-light image data set to conduct supervised pre-training, at this time, the loss function is composed ofReconstruction loss, structural similarity loss and smoothness loss; then performing unsupervised fine adjustment on the Y component of the low-illumination visible light image in the infrared and visible light image fusion data set, wherein the loss function consists of space consistency loss, exposure control loss and smoothness loss;

and a second stage: self-supervision learning is adopted when the fusion network fusion net is trained, and a loss function consists of strength loss, texture loss, color consistency loss and self-adaptive structural similarity loss; wherein the color consistency loss refers to the visible light image in RGB format before enhancement.

6. The convolutional neural network-based light-weight infrared and visible light image fusion method of claim 1, wherein step S2 is converting RGB format into YCbCr format in YUV format; step S5 is to convert the YCbCr format image in YUV format into RGB format.

7. The convolutional neural network-based light-weight infrared and visible light image fusion method of claim 1, wherein the specific image registration procedure in step S1 is as follows: extracting edges by using a Canny operator, detecting characteristic points of two edge images by using a SURF algorithm, carrying out characteristic point matching according to priori knowledge of slope consistency between correct matching point pairs, finally removing mismatching points by using a random sampling consistency RANSAC algorithm, estimating a homography matrix for coordinate system transformation, multiplying the homography matrix by one image to be registered, and cutting to obtain a result aligned with the other image.

8. A lightweight infrared and visible light image fusion system based on a convolutional neural network for performing the infrared and visible light image fusion method of any one of claims 1-7, comprising:

9. An electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, wherein the computer instructions, when executed by the processor, perform the method of infrared and visible light image fusion of any one of claims 1-7.

10. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of infrared and visible light image fusion of any one of claims 1-7.