CN116309215A

CN116309215A - Image fusion method based on double decoders

Info

Publication number: CN116309215A
Application number: CN202310165488.2A
Authority: CN
Inventors: 邱怀彬; 刘晓宋; 邸江磊; 秦玉文
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-06-23

Abstract

The invention belongs to the field of image fusion, and discloses an image fusion method based on double decoders, which is used for solving the problem of image fusion based on deep learningThe method is used for solving the problem that the characteristic extraction capability and fusion effect of complex multi-mode image processing shot by cameras with different imaging modes are poor, and comprises the following steps: will multimodal image A ₁ 、A ₂ The method comprises the steps of extracting features through a large receptive field feature extraction module, then respectively passing through two interactive decoding modules, splicing and interactively fusing decoding information between the two decoding modules with different modes on a channel in the decoding process, reconstructing a fused image C, and calculating the fused image C and a multi-mode image A ₁ 、A ₂ Updating network model parameters. The invention can effectively realize the fusion of complex multi-mode images and has the characteristics of better feature information extraction, less parameter quantity, high reconstruction precision, stronger robustness and the like.

Description

Image fusion method based on double decoders

Technical field:

the invention relates to an image fusion method, in particular to an image fusion method based on double decoders.

The background technology is as follows:

with the progress of the age, the information provided by a single source image cannot meet the requirement of human vision or the requirement of target identification and detection, so that cameras with different imaging modes are required to shoot multi-mode images, and fusion images with richer detail information are acquired through an image fusion means.

The image fusion technology integrates all information of two or more images of the same scene with different sensors or different positions, time, brightness and the like into a single fusion image by overlapping and complementing, so as to comprehensively characterize an imaging scene and promote subsequent visual tasks. Compared with a single source image, the fusion image can obtain scene information of a target more clearly, and the quality and definition of the image are obviously improved.

The traditional image fusion method is relatively mature, requires complex fusion rules to be designed manually, and has high labor cost and calculation cost of image fusion. For complex multi-modal images, it is very difficult to design a general feature extraction method for the complex multi-modal images, which is highly dependent on manually designed features. With the rise of deep learning in recent years, an image fusion method based on the deep learning is also emerging, and a new idea is provided for image fusion. However, the image fusion method based on deep learning at the present stage has high network complexity and large calculation amount, and can also have the problems of inaccurate feature extraction, poor image fusion effect and the like for complex multi-mode images.

The invention comprises the following steps:

the invention aims to overcome the defects of the prior art and provides an image fusion method based on the non-activated feature extraction of a simple gate unit, which can realize the fusion of complex multi-mode images and has the characteristics of good feature information extraction, less parameter quantity, high reconstruction precision and stronger robustness.

The technical scheme for solving the technical problems is as follows:

an image fusion method based on double decoders, comprising the following steps:

(S1) shooting a multi-mode image by using cameras with different imaging modes, and recording the multi-mode image as an image A ₁ 、A ₂ ；

(S2) Multi-modality image A ₁ 、A ₂ As the input of the network, the multi-mode feature map is obtained through a convolution layer and then through N large receptive field feature extraction modules;

(S3) respectively passing the two multi-mode feature maps through two interactive decoding modules, splicing decoding information between the two decoding modules with different modes on a channel for interactive fusion in the decoding process, repeating the steps for N times, gradually fusing, and then obtaining a fused image C through a convolution layer;

and (S4) constructing the neural network through the process, calculating a loss function value between a fusion image output by the neural network and an input image, and reversely transmitting a gradient of the loss function value to update parameters of the network model until the loss function value is converged, and stopping updating the parameters of the network model to obtain the trained neural network.

Preferably, in step (S1), the multi-modal image includes, but is not limited to, a visible light image, a short wave infrared image, a medium wave infrared image, a long wave infrared image, a polarized image.

Preferably, in step (S1), the multi-modal image a ₁ For visible light image, A ₂ Is one of short wave, medium wave, long wave infrared image or polarized image.

Preferably, in the step (S2), the number of times N of the module repetition is preferably within a range of 4.ltoreq.N.ltoreq.6.

Preferably, in step (S2), the large receptive field feature extraction module adopts residual connection, including a convolution layer with a convolution kernel size of 1*1, a gaussian error nonlinear activation function, a depth convolution layer with a convolution kernel size of 5*5, a depth convolution layer with a convolution kernel size of 5*5 and an expansion value of 3, and pixel normalization.

Preferably, in step (S3), the interactive decoding module acquires each level of feature information and the fusion decoding information of the previous level to perform pixel superposition and interactive decoding.

Preferably, in step (S3), the interactive decoding module includes a convolution layer with a convolution kernel size of 3*3, channel attention, and interpolation upsampling.

Preferably, in step (S4), the neural network' S Loss function uses a Loss function for comparing the similarity between the fusion result image and the pre-fusion image, where Loss function Loss is a combination of SSIM Loss, background content Loss and saliency target Loss, and the expression of the Loss function is as follows:

L _SSIM ＝1-kSSIM(A ₁ ，C)-(1-k)SSIM(A ₂ ，C) (1)

Loss＝δ ₁ L _SSIM +δ ₂ L _back +δ ₃ L _salient (4)

in the above

For a gradient operator, h and w are respectively the height and width of an image, k can take different values according to different input mode images, k takes the value range of 0 < k < 1, and δ1+δ2+δ3 should be equal to 1.

Compared with the prior art, the invention has the following beneficial effects:

1. the image fusion method based on the double decoders adopts a large receptive field feature extraction module, uses large kernel convolution with separable depth to reduce the size of a model, and simultaneously increases receptive field. By using the large convolution kernel, information can be collected from a large area, the semantic information extraction capacity is higher, the large calculation burden caused by the large convolution kernel is reduced by using the designed large convolution kernel with separable depth, and parameters are reduced, so that better feature extraction performance is realized.

2. The image fusion method based on the double decoders uses the double decoders and carries out multi-mode fusion in the decoding stage. In existing methods, multi-modal fusion is typically performed during the encoding phase, but such fusion strategies are more difficult to optimize than methods that perform fusion during the decoding phase. In back propagation, the gradient computation path in the decoder is shorter than the path in the encoder, so the optimization process of the decoder is less affected by the gradient extinction or gradient explosion. Thus, the decoder is more easily optimized than the encoder.

3. The double decoder of the double-decoder-based image fusion method adopts an interactive decoding module, and aims to utilize the complementation potential of different modes and multiple types of information of image contents, including modal fusion information and context information. Instead of using different information alone to improve decoding characteristics in existing work, we use the interactive decoding module in the dual decoder as the basic module of the decoder, combining multiple information. The channel attention is then used to adaptively select useful information, and the pixels add different types of information for feature reconstruction. In addition, the network also has interaction between two decoders, the output information contains fusion information and specific information of two modes, each decoding step is added, namely the information is transmitted to two interaction decoding modules of the next stage, the decoded characteristic information is gradually improved, and further the improvement of the multi-mode image fusion quality is more facilitated.

Description of the drawings:

fig. 1 is a block flow diagram of a dual decoder-based image fusion method of the present invention.

Fig. 2 is a schematic diagram of a large receptive field feature extraction module of the image fusion method based on dual decoders of the invention.

Fig. 3 is a block diagram of an interactive decoding module of the image fusion method based on the dual decoders of the present invention.

The specific embodiment is as follows:

the present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Referring to fig. 1 to 3, the dual decoder-based image fusion method of the present invention includes the steps of:

Referring to fig. 2, in step (S2), the large receptive field feature extraction module adopts residual connection, including a convolution layer with a convolution kernel size of 1*1, a gaussian error nonlinear activation function, a depth convolution layer with a convolution kernel size of 5*5, a depth convolution layer with a convolution kernel size of 5*5 and an expansion value of 3, and pixel normalization.

Referring to fig. 3, in step (S3), the interactive decoding module acquires the feature information of each layer and the fusion decoding information of the previous stage to perform pixel superposition and interactive decoding.

Referring to fig. 3, in step (S3), the interactive decoding module includes a convolution layer of convolution kernel size 3*3, channel attention, interpolation upsampling.

In addition, the neural network Loss function in this embodiment adopts a Loss function for comparing the similarity degree of the fusion result image and the fusion pre-image, where Loss function Loss is the combination of SSIM Loss, background content Loss and saliency target Loss, and the expression of the Loss function is as follows:

L _SSIM ＝1-kSSIM(A ₁ ，C)-(1-k)SSIM(A ₂ ，C) (1)

Loss＝δ ₁ L _SSIM +δ ₂ L _back +δ ₃ L _salient (4)

in the above

In addition, the multi-modal image described in the present embodiment includes, but is not limited to, a visible light image, a short wave infrared image, a medium wave infrared image, a long wave infrared image, a polarized image.

In addition, the multi-modal image A described in this embodiment ₁ For visible light image, A ₂ For mid-wave or long-wave infrared images, the image resolution is 640 x 512.

In addition, the number N of repetitions of the module described in the present embodiment may be 4.

The foregoing is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the foregoing examples, but all technical solutions falling under the concept of the present invention fall within the scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. A dual decoder-based image fusion method, comprising the steps of:

2. The dual decoder-based image fusion method of claim 1, wherein in step (S1), the multi-modal image includes, but is not limited to, a visible light image, a short wave infrared image, a medium wave infrared image, a long wave infrared image, a polarized image.

3. The dual decoder-based image fusion method according to claim 1, wherein in step (S1), the multi-mode image a ₁ For visible light image, A ₂ Is one of short wave, medium wave, long wave infrared image or polarized image.

4. The dual decoder-based image fusion method of claim 1, wherein in step (S2), the range of the number of module repetitions N is preferably 4.ltoreq.n.ltoreq.6.

5. The dual decoder-based image fusion method of claim 1, wherein in step (S2), the large receptive field feature extraction module uses residual connection, including convolution layer with convolution kernel size 1*1, gaussian error nonlinear activation function, depth convolution layer with convolution kernel size 5*5, depth convolution layer with convolution kernel size 5*5 and dilation value 3, and pixel normalization.

6. The dual decoder-based image fusion method according to claim 1, wherein in the step (S3), the interactive decoding module acquires each level of feature information and the fusion decoding information of the previous level to perform pixel superposition and interactive decoding.

7. The dual decoder-based image fusion method of claim 1, wherein in step (S3), the interactive decoding module includes a convolution layer of convolution kernel size 3*3, channel attention, interpolation upsampling.

8. The dual decoder-based image fusion method according to claim 1, wherein in the step (S4), a Loss function of the neural network is used to compare the similarity between the fusion result image and the pre-fusion image, the Loss function Loss is a combination of SSIM Loss, background content Loss and saliency target Loss, and the expression of the Loss function is as follows:

L _SSIM ＝1-kSSIM(A ₁ ，C)-(1-k)SSIM(A ₂ ，C) (1)

Loss＝δ ₁ L _SSIM +δ ₂ L _back +δ ₃ L _salient (4)

in the above