CN113033630A

CN113033630A - Infrared and visible light image deep learning fusion method based on double non-local attention models

Info

Publication number: CN113033630A
Application number: CN202110258048.2A
Authority: CN
Inventors: 王志社; 武园园; 王君尧; 邵文禹; 陈彦林
Original assignee: Taiyuan University of Science and Technology
Current assignee: Taiyuan University of Science and Technology
Priority date: 2021-03-09
Filing date: 2021-03-09
Publication date: 2021-06-25

Abstract

The invention relates to an infrared and visible light image depth learning fusion method based on a double non-local attention model. The method specifically comprises the following steps: constructing a multi-scale depth network, and extracting multi-scale depth features of two types of images by adopting a coding-decoding network model, wherein the coding network comprises a common convolution module and two multi-scale convolution modules; the fusion layer utilizes a double non-local attention model of space and channel to enhance and combine the multi-scale depth features to obtain final fusion features; and the decoding network is used for reconstructing the final fusion characteristic to obtain a fusion image. The method solves the problem that the target area and texture details in the fusion image are lost due to the fact that the existing deep learning fusion method only depends on convolution operation to extract local features and does not fully consider multi-scale characteristics and global correlation of the local features, and can be applied to the fields of remote sensing detection, medical diagnosis, intelligent driving, safety monitoring and the like.

Description

Infrared and visible light image deep learning fusion method based on double non-local attention models

Technical Field

The invention relates to an image fusion method in the field of image processing and artificial intelligence, in particular to an infrared and visible light image deep learning fusion method based on a double non-local attention model.

Background

The infrared and visible light image fusion technology is widely applied to the fields of remote sensing detection, medical diagnosis, intelligent driving, safety monitoring and the like. Due to differences in imaging mechanisms and hardware conditions, an image sensor can only capture a portion of the information. Infrared image sensors use thermal radiation imaging to reflect the radiation characteristics of background objects, but are generally lacking in structural features and texture detail. The visible light image sensor is formed by reflecting light rays, can describe scene information of the environment, and is easily influenced by changes of lighting conditions and weather changes. The purpose of image fusion is to obtain a composite image with rich scene details and prominent background objects, so as to be suitable for human visual perception and other visual tasks. Therefore, image fusion techniques are an important prerequisite for improving the level of infrared and visible image detection and recognition.

Currently, infrared and visible light image fusion techniques can be broadly classified into conventional fusion methods and deep learning fusion methods. The traditional fusion method usually adopts unified feature transformation or feature representation to solve the fusion problem, and feature extraction and feature representation of manual design are one difficulty of the traditional fusion method. The convolution operation has stronger image feature extraction capability and obtains a satisfactory fusion effect based on a deep learning method in a big data driving mode. However, the current fusion method based on deep learning has some defects. Firstly, most methods rely on the last layer of depth features for processing only due to the limitation of the size of the kernel function of the filter and the depth of the network, and cannot effectively extract the multi-scale depth features. In fact, the multi-scale characteristics of the image play an important role in machine vision, and only a single-scale depth feature cannot effectively represent the spatial information of a large target in the image. Secondly, the methods only pay attention to the extraction of the local features and do not consider the global correlation of the local features, the global correlation of the local features can effectively highlight useful information and inhibit irrelevant information, and the method plays an important role in the fusion task. Third, the current method directly uses the depth features obtained by the convolutional neural network for combined reconstruction, but these deep features are not refined and enhanced, thereby resulting in the loss of details of the target region and texture in the fused image.

In summary, there is an urgent need for a method capable of effectively extracting multi-scale depth features of an image, establishing global correlation of the multi-scale depth features, suppressing irrelevant information when enhancing useful information, and enhancing depth feature representation capability, thereby improving the fusion effect of infrared and visible light images.

Disclosure of Invention

The invention provides an infrared and visible light image deep learning fusion method based on a double non-local attention model, aiming at solving the problem that the target area and texture details in a fusion image are lost due to the fact that the existing deep learning fusion method only depends on convolution operation to extract local features and does not fully consider multi-scale characteristics and global correlation of the existing deep learning fusion method.

The invention relates to an infrared and visible light image deep learning fusion method based on a double non-local attention model, which comprises the following steps:

designing and constructing a multi-scale depth network: the multi-scale depth network comprises two sub-networks of encoding and decoding, the encoding sub-network is used for extracting multi-scale depth features of a source image, and the decoding sub-network is used for reconstructing final fusion features to obtain a fusion image;

designing a fusion layer: for input infrared and visible light images, multi-scale depth features of the two types of images are respectively extracted from an encoding sub-network, space and channel attention features of the two types of images are respectively obtained by utilizing a space and channel non-local attention module, the space attention features of the two types of images are fused to obtain a space attention fusion feature, the channel attention features of the two types of images are fused to obtain a channel attention fusion feature, and finally the space and channel attention fusion feature is weighted to obtain a final fusion feature.

The infrared and visible light image deep learning fusion method based on the double non-local attention models is characterized in that a coding sub-network for extracting multi-scale depth features is composed of a common convolution module and two multi-scale convolution modules, the common convolution module uses a 3 x 3 convolution kernel, the step length is 1, the filling is 0, the number of filters is 16, and the convolution layer is followed by a correction linear unit (ReLU) and used for extracting shallow depth features of images; the two multi-scale convolution modules are used for extracting the multi-scale depth features of the image, the scale is 4, the number of output channels is 64, and the common convolution module and the two multi-scale convolution modules are connected densely to enhance the feature characterization capability.

According to the infrared and visible light image deep learning fusion method based on the double non-local attention model, the spatial attention feature fusion process is as follows: multi-scale depth feature phi for infrared and visible light images_I、Φ_VObtaining corresponding spatial attention characteristics through the spatial attention module

Then using spatial attention characteristics

Calculating to obtain weighting coefficients of the infrared image and the visible light image respectively represented as

And

finally, multiplying the weighting coefficients of the infrared and visible light images with the corresponding depth features to obtain the spatial attention fusion features

In the above method for fusing infrared and visible light image deep learning based on the double non-local attention model, the fusion process of the channel attention features is as follows: multi-scale depth feature phi for infrared and visible light images_I、Φ_VObtaining corresponding channel attention characteristics through the channel attention module

Then using the channel attention feature

And

finally, multiplying the weighting coefficients of the infrared and visible light images with the corresponding depth features to obtain the channel attention fusion features

According to the infrared and visible light image deep learning fusion method based on the double non-local attention models, the spatial non-local attention model is selected to be subjected to 8 x 8 pooling operation.

Compared with the existing deep learning fusion technology, the invention has the following advantages:

1. according to the invention, a multi-scale convolution module is embedded into a coding sub-network, the coding network can effectively extract multi-scale depth features of an image, the dense connection can retain the depth features of a network middle layer, and the multi-scale features and the dense connection can effectively enhance the depth feature characterization capability;

2. according to the method, a space and channel non-local attention model is adopted, multi-scale depth features are enhanced and fused from the dimensions of the space and the channel, useful information is enhanced, irrelevant information is inhibited, and the fused image highlights infrared image target information while abundant detail information of a visible image is reserved;

3. the invention establishes the deep learning fusion method facing the double non-local attention model, obviously improves the fusion effect, can also be applied to the fusion of multi-mode images, multi-focus images and medical images, and has high application value in the field of image fusion.

Drawings

FIG. 1 is a schematic fusion diagram of the method of the present invention.

FIG. 2 is a schematic diagram of the training of the method of the present invention.

FIG. 3 is a schematic diagram of a multi-scale convolution module of the method of the present invention.

FIG. 4 is a schematic diagram of a fusion of dual non-local attention features according to the method of the present invention.

FIG. 5 is a schematic view of a spatial and channel attention model of the method of the present invention.

FIG. 6 is a first set of graphs of an infrared and visible image fusion experiment.

Fig. 7 is a second set of infrared and visible light image fusion experimental graphs.

FIG. 8 is a third set of experimental graphs of infrared and visible light image fusion.

Detailed Description

A method for deep learning and fusing infrared and visible light images based on a double non-local attention model comprises the following steps:

s1: and designing and constructing a multi-scale depth network. The multi-scale depth network comprises two sub-networks of encoding and decoding. The coding network is used for extracting the multi-scale depth features of the source image. And the decoding network is used for reconstructing the final fusion characteristics to obtain a fusion image.

S11: and (4) forming a coding network. The coding network consists of a common convolution module and two multi-scale convolution modules. The common convolution module uses a 3 × 3 convolution kernel with a step size of 1, a fill of 0, and a number of filters of 16, and the convolution layer is followed by a modified linear unit (ReLU) for extracting shallow depth features of the image. The two multi-scale convolution modules are used for extracting the image multi-scale depth features. In order to fully utilize the intermediate features of the deep network, dense connections are applied between the ordinary convolution module and the two multi-scale convolution modules.

S12: and a multi-scale convolution module. Depth feature F for normal convolution module output_xAfter 1 × 1 convolution, the depth feature is divided into s sub-features x_iI is 1,2, … s, except for x₁In addition, other sub-features are convolved by 3 × 3 and modified by a linear unit (ReLU), denoted as C_i() Each of C_i-1() Is added to C_i() In, then output y_iCan be expressed as

Then to the output y_iMaking a channel connection (termination) with the input F_xAdding to obtain final output characteristic F_y＝concat(y_i)+F_xWhere s represents a scale control parameter. Experiments prove that the scale control parameter s takes a value of 4, and the output channel of the multi-scale convolution module is 64.

Because of intensive connection, shallow depth features output by the common convolution module are used as input of the first multi-scale convolution module, the shallow depth features output by the common convolution module and output features of the first multi-scale convolution module are subjected to channel connection (coordination) and are used as input of the second multi-scale convolution module, and finally, the bottom depth features output by the common convolution module and the output features of the two multi-scale convolution modules are subjected to channel connection (coordination) to obtain the multi-scale depth features phi.

S13: and (5) decoding network composition. The decoding subnetwork comprises 4 general convolution modules, each convolution using a 3 × 3 convolution kernel with step size 1, padding 0, convolution layer followed by a modified linear unit (ReLU), the number of convolution filters being 112,64,32,16, respectively.

S2: and (3) fusion layer design, for input infrared and visible light images, respectively extracting multi-scale depth features of the two images in an encoding network, and obtaining final fusion features by using double non-local attention modules of space and channels.

S21: spatial non-local attention models. Spatial non-local attention establishes global feature correlation for local features from the spatial dimension. For a multiscale depth feature Φ ∈ R^C×H×WFirstly, reconstructing and transposing the same to obtain phi^A∈R^HW ^×CThen, n × n pooling operation and reconstruction are carried out on the multi-scale depth features to obtain another two depth features

Is connected immediately afterAlignment feature phi^AAnd phi^BMatrix multiplication and spatial attention mapping by SoftMax operation

Wherein

Indicating the effect of the ith position on the jth position. Finally, is aligned again

And phi^CMatrix multiplication and reconstruction are carried out to obtain spatial non-local attention characteristics

In the spatial non-local attention module, the n × n pooling operation is to reduce the amount of calculation, and experiments verify that the pooling operations for 1 × 1,2 × 2, 4 × 4 and 8 × 8 do not affect the image fusion performance, so the 8 × 8 pooling operation is finally selected.

S22: the channel is a non-local attention model. Channel non-local attention establishes global feature correlation for local features from the channel dimension. For a multiscale depth feature Φ ∈ R^C×H×WFirstly, reconstructing the data to obtain phi epsilon R^C×HWThen for the characteristic phi epsilon R^C×HWAnd transposed phi epsilon R^C×HWMatrix multiplication and channel attention mapping by SoftMax operation

Wherein

Shows the effect of the ith channel on the jth channel, and finally on again

Multiplying the sum phi by a matrix and reconstructing to obtain the non-local attention characteristic of the channel

S23: spatial attention features fusion. Multi-scale depth feature phi for infrared and visible light images_I、Φ_VObtaining corresponding spatial attention characteristics through the spatial attention module

Then using spatial attention characteristics

And

S24: channel attention features fusion. Multi-scale depth feature phi for infrared and visible light images_I、Φ_VObtaining corresponding channel attention characteristics through the channel attention module

Then using the channel attention feature

And

finally, the sum of the infrared and visible imagesMultiplying the weight coefficient by the corresponding depth feature to obtain the channel attention fusion feature

S25: spatial and channel attention characteristics are fused. Weighting the space and channel attention fusion characteristics of the infrared image and the visible light image to obtain the final fusion characteristics

And finally, obtaining a fused image through the decoding network according to the fused characteristics.

S3: and (5) training a network model. Performing graying and size adjustment on an input visible light image by adopting an MS-COCO image data set, and training a network model by adopting structural similarity and root mean square error as loss functions to obtain hyper-parameters of the network model;

s31: a training data set is selected. 80000 visible light images are selected from the MS-COCO image dataset as a training set, the image grayscale range is converted to [0,255], and the size is converted to 256 × 256.

S32: and setting training parameters. The overall loss function consists of the structural similarity and the root mean square error loss, denoted L_Total＝L_MSE+βL_SSIMWherein the structural similarity loss function L_SSIM1-SSIM (O, I), root mean square error loss function L_MSE＝∑||O-I||₂I and O are input and output images, SSIM represents a structural similarity operator, beta is a hyper-parameter for controlling network balance, and the value of beta is 1 in the invention. The batchsize and epoch sizes were 5 and 4, respectively, and the learning rate was 0.0001.

Claims

1. A method for deep learning and fusion of infrared and visible light images based on a double non-local attention model is characterized by comprising the following steps: the method comprises the following steps:

2. The infrared and visible light image deep learning fusion method based on the double non-local attention model according to claim 1, characterized in that: the coding subnetwork for extracting the multi-scale depth features consists of a convolution module and two multi-scale convolution modules, wherein the convolution module uses a 3 multiplied by 3 convolution kernel, the step length is 1, the filling is 0, the number of filters is 16, and the convolution layer is followed by a correction linear unit and used for extracting the shallow depth features of the image; the two multi-scale convolution modules are used for extracting the multi-scale depth features of the image, the scale is 4, the number of output channels is 64, and the feature characterization capability is enhanced by adopting dense connection between the convolution modules and the two multi-scale convolution modules.

3. The infrared and visible light image deep learning fusion method based on the double non-local attention model according to claim 2, characterized in that: the spatial attention feature fusion process is as follows: multi-scale depth feature phi for infrared and visible light images_I、Φ_VObtaining corresponding spatial attention characteristics through the spatial attention module

Then using spatial attention characteristics

And

4. The infrared and visible light image deep learning fusion method based on the double non-local attention model according to claim 3, characterized in that: the channel attention feature fusion process is as follows: multi-scale depth feature phi for infrared and visible light images_I、Φ_VObtaining corresponding channel attention characteristics through the channel attention module

Then using the channel attention feature

And

5. The method for fusing infrared and visible light image deep learning based on the double non-local attention model according to claim 1,2, 3 or 4, characterized in that: the spatial non-local attention model selects an 8 x 8 pooling operation.