CN113159236A

CN113159236A - Multi-focus image fusion method and device based on multi-scale transformation

Info

Publication number: CN113159236A
Application number: CN202110581448.7A
Authority: CN
Inventors: 田赛赛; 老伟雄; 苏喆; 高佩忻
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-07-23

Abstract

The disclosure provides a multi-focus image fusion method and device based on multi-scale transformation, and relates to the field of artificial intelligence or computer vision. The method comprises the following steps: acquiring two input images in the same imaging scene to acquire different depths of the input images; respectively extracting k-level scale image features at different depths from the input image by using a k-level context feature extraction model; performing primary fusion on the image features of the k-level scale by using a same-level scale fusion mode to obtain primary fusion features; fusing the primary fusion feature of each level of scale with the primary fusion feature of the previous level after inverse transformation to obtain a refined fusion feature; reconstructing the refined fusion characteristics by using an image reconstruction model to obtain a fusion image; the multi-focus image fusion network model is trained using the input images and fused images of the input images as training data.

Description

Multi-focus image fusion method and device based on multi-scale transformation

Technical Field

The present disclosure relates to the field of artificial intelligence or computer vision, and in particular, to a multi-focus image fusion method and apparatus based on multi-scale transformation.

Background

The multi-focus image fusion technology aims to fuse a plurality of images containing the same scene under different focus settings to form a full-definition image with more complete information content, and the obtained full-definition image is convenient for subsequent computer vision tasks such as identification and supervision. In order to obtain excellent fusion effect, researchers have proposed various image fusion methods, which can be broadly divided into two main categories according to the principles of algorithms: a conventional image fusion method and an image fusion method based on deep learning.

Conventional image fusion methods can be further divided into fusion methods based on a spatial domain and fusion methods based on a transform domain. However, this kind of multi-focus image fusion method requires human design of activity level detection and fusion rules, which greatly increases the difficulty of algorithm design. In recent years, with the wide application of deep learning, multi-focus image fusion based on deep learning is greatly developed, and compared with the traditional multi-focus image fusion method, the algorithm complexity is greatly reduced, and the fusion performance is greatly improved.

At present, most of multi-focus image fusion algorithms design a convolutional neural network as a classifier, and a label is set for a pair of training data, so that the situation that misclassification occurs at the edges of a focus area and a non-focus area can be caused. In addition, only the focus point detection is completed by the network, and the rest parts also need to set the judgment criterion manually, so that the complexity of the algorithm is increased. In addition, the existing neural network training mode makes the network difficult to learn the edges of the focusing area and the non-focusing area. Also, when defining a variety of masks for network learning of the edges of the in-focus and out-of-focus regions, such regularly shaped masks are not sufficient to simulate real-world situations.

Disclosure of Invention

In view of the above-mentioned deficiencies of the existing multi-focus image fusion technology, the present disclosure provides a multi-focus image fusion method and apparatus based on multi-scale transformation, so as to solve the problem that the existing multi-focus image fusion technology is inaccurate in edge fusion between focused and unfocused images.

One aspect of the present disclosure provides a multi-focus image fusion method based on multi-scale transformation, including: acquiring two input images A under the same imaging scene_n(n ═ 1, 2), acquiring different depths of the input image; extracting k-level scale image features at different depths from an input image respectively by using a k-level context feature extraction model

Image characteristics of k-level scale by using same-level scale fusion mode

Performing preliminary fusion to obtain a preliminary fusion characteristic U^d(ii) a Preliminarily fusing the characteristics U of each level of scale^dPreliminarily fusing the characteristics U with the previous stage after inverse transformation^d+1Performing fusion to obtain refined fusion characteristic U'^d(ii) a Refining fused feature U 'by using image reconstruction model'^dReconstructing to obtain a fused image F_recon(ii) a Using input image A_nFused image F with input image_reconAnd training a multi-focus image fusion network model as training data.

According to an embodiment of the present disclosure, the two input images are two multi-focus images to be fused and pre-registered.

According to the embodiment of the disclosure, each level of context feature extraction model is composed of 3 parallel convolution modules with different receptive fields, and the image features containing context information under each level of scale are obtained

Wherein: the parallel convolution module of 3 different receptive fields includes: original image characteristic branch, receptive field expansion branch and attention weight branch; the receptive field enlarging branch enlarges the receptive field by using hole convolution to acquire image relative global information, and the attention weight branch refers to self-attentionThe mechanism performs weight processing on the receptive field expansion branch.

According to an embodiment of the present disclosure, in a k-level context feature extraction model, an input image a_nD scale feature of

Is input as an input image A_nCharacteristic of the previous scale

Representing an input image; by reference to the self-attention mechanism

Conversion to transitional image features

Adopting Sigmoid function as transition image characteristic

Each pixel point of (a) is assigned a weight.

According to embodiments of the present disclosure, a self-attentive mechanism will be cited

Conversion to transitional image features

The method comprises the following steps: using a filter pair of size 1 and outputting 32 channels

Performing convolution operation to obtain transition image characteristics

Adopting Sigmoid function as transition image characteristic

Each pixel point of (1) allocationThe weight is calculated according to the following formula;

wherein, (i, j) represents a row coordinate and a column coordinate, respectively; H. w represents the pixel width and height of the image feature, respectively;

representing transitional image features

The weight assigned to each pixel point.

According to an embodiment of the present disclosure, the receptive field enlarging branch comprises two partial image sub-features, each partial image sub-feature being formed by two consecutive hole convolutions, wherein: the sub-features of the two partial images are calculated according to the following formulas respectively:

wherein the content of the first and second substances,

a first partial image sub-feature and a second partial image sub-feature, respectively;

respectively representing a first hole convolution operation and a second hole convolution operation;

respectively representing parameter sets of the label filter corresponding to the first hole convolution operation and the second hole convolution operation; Θ denotes the Relu activation function;

to represent

And the converted corresponding pixel points are distributed with weights.

According to an embodiment of the present disclosure, an image A is input_nD scale feature of

Calculated according to the following formula:

wherein the content of the first and second substances,

a first volume operation is shown as a first volume operation,

a parameter set representing a label filter corresponding to the first convolution operation; cat (·) denotes cascade operation; pooling (. cndot.) represents a pooling operation, with a pooling step size of 2 being set.

According to the embodiment of the disclosure, the image features of k-level scale are subjected to the same-level scale fusion mode

Performing preliminary fusion to obtain a preliminary fusion characteristic U^dThe method comprises the following steps: image features with same scale from two input images

Cascading, and then obtaining a weight value graph through a softmax layer; d scale feature

Adding the product multiplied by the weight value graph corresponding to the d-th scale to obtainPreliminary fusion features U to d-th scale^d。

According to an embodiment of the present disclosure, the features U are preliminarily fused^dCalculated according to the following formula:

wherein, mapn (n is 1, 2) represents a weight map obtained through softmax layer operation; cat (.) denotes cascade operation; denotes multiplication operations at the pixel level; + represents an addition operation at the pixel level.

According to the embodiment of the disclosure, the primary fusion characteristics U of each level scale are combined^dPreliminarily fusing the characteristics U with the previous stage after inverse transformation^d+1Performing fusion to obtain refined fusion characteristic U'^dThe method comprises the following steps: adopting a step-by-step scale reverse transmission mode to enable the preliminary fusion characteristic U of the d-level scale^dThe preliminary fusion feature of the previous-level scale obtained after inverse transformation is changed into a preliminary fusion feature U^d+1(ii) a To the previous level preliminary fusion characteristics U^d+1Post-upsampling and preliminary fusion features U^dPerforming fusion to obtain refined fusion characteristic U'^d。

Refining fused feature U 'according to an embodiment of the present disclosure'^dCalculated according to the following formula:

wherein the content of the first and second substances,

it is shown that the second convolution operation is,

representing a second convolution operationA set of parameters corresponding to the label filter; Θ denotes the Relu activation function; cat (·) denotes cascade operation; sample (·) represents an upsampling operation.

According to an embodiment of the disclosure, refined fused feature U 'is pair of image reconstruction models'^dReconstructing to obtain a fused image F_reconThe method comprises the following steps: fused image F_reconCalculated according to the following formula:

wherein conv (; θ)_recon) Denotes a third convolution operation, θ_reconA parameter set representing a label filter corresponding to the third convolution operation; Θ denotes the Relu activation function.

According to an embodiment of the present disclosure, two input images are constructed by: inputting an image training set containing a plurality of source images, setting a first mask and a second mask with complementary scenes, and acquiring a target region in the source images through the first mask; performing dot multiplication operation on each source image by using a first mask and a second mask respectively to obtain a target image and a background image; continuously and repeatedly blurring the target image and the background image through a blurring filter to obtain a plurality of groups of target blurred images and background blurred images with different blurring degrees; and respectively adding the target blurred image and the background blurred image which have the same degree of blurring in each group to obtain a plurality of groups of artificially synthesized multi-focus images.

According to an embodiment of the present disclosure, the blur filter is a gaussian filter with a sliding window size of 7 × 7 and a standard deviation of 2.

According to an embodiment of the disclosure, the method further comprises: determining a predictive fused image of one of the plurality of input images using a multi-focus image fusion network model; calculating a fused image F of a predicted fused image and an input image determined using a multi-focus image fusion network model using a joint loss function_reconA loss value in between; judging whether the loss value meets a preset loss threshold value, if not, judging according to the loss valueAnd adjusting parameters of the multi-focus image fusion network model, and returning to the step of determining a fusion output image by using the multi-focus image fusion network model aiming at another input image in the plurality of input images.

According to an embodiment of the present disclosure, a joint loss function is jointly constructed according to a structural similarity loss function and a mean square error loss function, wherein; structural similarity loss function L_MSESum mean square error loss function L_SSIMRespectively calculated according to the following formula:

L_SSIM＝1-SSIM(G，P)

wherein H, W represents the pixel height and width of the image, respectively; (i, j) represent row and column coordinates in the image; g (i, j) and P (i, j) respectively represent the color values of the true value image and the prediction fusion image of the corresponding pixel coordinates; i | · | purple wind₂Representing a two-norm operation; mu.s_G、μ_PRespectively represent the fused images F_reconThe color mean value of the image is fused with the prediction; c₁、C₂Respectively representing two constants which are used for preventing zero-removing errors and are respectively set to be 0.01 and 0.03 in training;

respectively represent the fused images F_reconColor variance of the fused image with the prediction; sigma_GPRepresenting a fused image F_reconAnd the covariance of the image fused with the prediction.

According to An embodiment of the present disclosure, a fused image F using An input image An and An input image_reconBefore the step of training the multi-focus image fusion network model as training data, the method further comprises: will input image A_nZooming to a preset size.

Another aspect of the present disclosure provides a multi-focus image fusion apparatus based on multi-scale transformation, including: an image acquisition module for acquiring two input images A under the same imaging scene_n(n ═ 1, 2), acquiring different depths of the input image; a feature extraction module for extracting k-level scale image features at different depths from the input image respectively by using a k-level context feature extraction model

A preliminary fusion module for using the same-level scale fusion mode to the image characteristics of k-level scale

Performing preliminary fusion to obtain a preliminary fusion characteristic U^d(ii) a The inverse transformation fusion module is used for fusing the primary fusion characteristics U of each level of scale^dPreliminarily fusing the characteristics U with the previous stage after inverse transformation^d+1Performing fusion to obtain refined fusion characteristic U'^d(ii) a An image reconstruction module for using the image reconstruction model to refine the fused feature U'^dReconstructing to obtain a fused image F_recon(ii) a And a network training module for using the input image A_nFused image F with input image_reconAnd training a multi-focus image fusion network model as training data.

Another aspect of the present disclosure provides an electronic device including: one or more processors; a storage device to store one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method as described above.

Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.

Another aspect of the disclosure provides a computer program comprising computer executable instructions for implementing the method as described above when executed.

Compared with the prior art, the multi-focus image fusion method and device based on multi-scale transformation provided by the disclosure at least have the following beneficial effects:

(1) according to the method, the characteristic extraction and the characteristic fusion do not need to be artificially designed, the accurate fusion of the multi-focus images can be realized, and a simulation result shows that the fused image containing abundant information can be obtained;

(2) the method uses the basic network to extract the depth image features which are multi-level and different in scale and contain rich detail information and context information, and can effectively utilize the context information in the image to assist the network in judging the focusing area in the multi-focus image, thereby improving feature extraction;

(3) according to the method, the image features of the same scale and different branches are fused together by constructing the feature fusion module, then the fusion image features under the current scale are fused together with the fusion image features subjected to inverse multi-scale transformation, and the step-by-step refinement of the features is realized in a reverse transmission mode, so that the detail information in the fusion image is effectively improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

fig. 1 schematically illustrates a flow chart of a multi-focus image fusion method based on multi-scale transformation according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates an operational flow diagram of a multi-focus image fusion method based on multi-scale transformation according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram for manual synthesis of an input image according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a sequential multi-pass blur processing procedure according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a process of attention weight branching according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a process of preliminary fusion feature processing according to an embodiment of the disclosure;

FIG. 7 schematically illustrates a process of refining a fused feature according to an embodiment of the disclosure;

FIG. 8 schematically illustrates a flow diagram for training a multi-focus image fusion network model with training data, in accordance with an embodiment of the present disclosure;

fig. 9 schematically illustrates a block diagram of a multi-focus image fusion apparatus based on multi-scale transformation according to an embodiment of the present disclosure; and

FIG. 10 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Before describing in detail specific embodiments of the present disclosure, technical terms are first explained to facilitate a better understanding of the present disclosure.

Convolutional Neural Network (CNN): the supervised deep learning model is a trainable multi-layer structure and aims to learn multi-level feature representation of input data, and each level of feature comprises a plurality of feature maps. The coefficients in the feature map are called neurons, and the feature map is connected by performing several different types of computations, such as convolution, nonlinear activation, and spatial pooling. Generally, nonlinear processing is performed immediately after the input passes through the convolutional layer, and nonlinear functions such as a Sigmoid function, a Relu activation function, and a tanh function are generally adopted, so that the convergence rate during network training can be increased, and the network training process can be accelerated.

Receptive field: the area size of the pixel points on the feature map (feature map) output by each layer of the convolutional neural network is mapped on the original image. The relevant theory and method of the receptive field are researched, the size of the receptive field of each layer in the convolutional neural network is quantified, a reliable optimization direction can be provided for image processing tasks such as target detection and the like, and the method has important significance for improving the precision of the target detection.

Multi-scale: in fact, sampling the signal at different granularities, and we can observe different features at different scales, so as to accomplish different tasks. Generally, more detail can be seen with less/denser granularity sampling, and the overall trend can be seen with more/sparser granularity sampling.

Hole convolution (punctured convolution): by convolving at sparsely sampled locations, the convolution kernel is enlarged with the original weights, thereby increasing the size of the receptive field without adding additional cost. Moreover, the hole convolution can integrate a large amount of context information in semantic segmentation.

softmax layer: and (4) for classification, converting the output result of the neural network into a probability expression through a softmax function, finding the maximum probability item, and classifying the maximum probability item. The softmax layer is typically used at the last level in the neural network for final classification and normalization.

Image masking: refers to the control of the area or process of image processing by occluding (wholly or partially) the processed image with selected images, graphics or objects. In digital image processing, masks may be used to mask certain areas of the image from processing or from processing parameter calculations, or to process or count only the masked areas.

The embodiment of the disclosure provides a multi-focus image fusion method based on multi-scale transformation, which comprises the following steps: acquiring two input images A under the same imaging scene_n(n ═ 1, 2), acquiring different depths of the input image; extracting k-level scale image features at different depths from an input image respectively by using a k-level context feature extraction model

Image characteristics of k-level scale by using same-level scale fusion mode

Fig. 1 schematically shows a flowchart of a multi-focus image fusion method based on multi-scale transformation according to an embodiment of the present disclosure. Fig. 2 schematically illustrates an operational flow diagram of a multi-focus image fusion method based on multi-scale transformation according to an embodiment of the present disclosure.

Referring to fig. 2, the method shown in fig. 1 will be described in detail. The multi-focus image fusion method based on multi-scale transformation according to an embodiment of the present disclosure may include the following operations S101 to S106.

In operation S101, two input images A of the same imaging scene are acquired_n(n ═ 1, 2), the different depths of the input image are acquired.

In operation S102, image features of k-scale at different depths are respectively extracted from an input image using a k-scale context feature extraction model

In operation S103, image features of k-level scale are subjected to a same-level scale fusion method

Performing preliminary fusion to obtain a preliminary fusion characteristic U^d。

In operation S104, the preliminary fusion features U of each level scale are combined^dPreliminarily fusing the characteristics U with the previous stage after inverse transformation^d+1Performing fusion to obtain refined fusion characteristic U'^d。

In operation S105, the refined fusion feature U 'is fused using the image reconstruction model'^dReconstructing to obtain a fused image F_recon。

In operation S106, an input image A is used_nFused image F with input image_reconAnd training a multi-focus image fusion network model as training data.

According to the embodiment of the disclosure, the multi-level depth and multi-scale image features are respectively extracted from the multi-focus image by using the k-level context feature extraction model, and the inverse transformation is performed on the image features of different scales, so that the fusion of the image features of different levels and different scales is realized, the accurate fusion of the multi-focus image can be realized without manually designing feature extraction and feature fusion, a large amount of objective and accurate training data is mined, and the training efficiency and accuracy of the multi-focus image fusion network model are improved.

According to the embodiment of the disclosure, the preliminary fusion features are obtained by using the same-level fusion mode for the k-level image features with different scales obtained in the step S102, and then the refinement of the fusion features is realized by adopting a coarse-to-fine mode.

In the embodiment of the present disclosure, the two input images are two multi-focus images to be fused and pre-registered.

Due to the depth of field of the imaging device and when the image acquisition is actually performed, the target is acquired in a targeted manner, so that the target in the focus area is relatively sharp and the rest of the focus area is relatively blurred. To be able to better simulate the object properties of the in-focus region and to increase the learning of the network for the edges of the in-focus and out-of-focus regions, the present disclosure artificially synthesizes an input image, for example, using the data set MSRA10K for salient object detection as a base data set, which contains 10000 images. It should be noted that, in other embodiments, both the source and the image size of the image training set may be set according to the actual training process, which is not limited by the present disclosure.

Fig. 3 schematically illustrates a flow diagram for manual synthesis of an input image according to an embodiment of the present disclosure. FIG. 4 schematically illustrates a sequential multi-pass blur processing procedure according to an embodiment of the present disclosure.

Referring to fig. 4, the process of artificially synthesizing the input image shown in fig. 3 will be further described. In some embodiments, the two input images may be constructed by the following sub-operations S110-S140:

in operation S110, an image training set including a plurality of source images is input, a first mask and a second mask having complementary scenes are set, and a target region in the source images is obtained through the first mask.

The target area refers to the area where the target object is located, and the target area is selected by acting on the source image through the first mask. For example, the source image may be a face image and the target region may be a face region.

In the embodiment of the present disclosure, after the first mask is set, the second mask can be obtained by inverting the first mask.

In operation S120, a dot product operation is performed on each source image by using the first mask and the second mask, so as to obtain a target image and a background image.

The target image is obtained as the first mask acts on the source image to select the target area. The second mask and the first mask have complementary scenes, and the second mask acts on the source image through dot product operation to obtain the remaining region of the target region, namely the background image.

In operation S130, a plurality of sets of target blurred images and background blurred images with different degrees of blur are obtained by performing a plurality of consecutive blurring processes on the target image and the background image respectively through a blurring filter.

For example, the blurring filter may be a gaussian filter having a sliding window size of 7 × 7 and a standard deviation of 2, and the consecutive multi-pass blurring process may be a gaussian blurring process that operates 5 times in succession.

As shown in fig. 4, 5 blurred images of the target with different blurring degrees can be obtained by continuously operating the selected target part 5 times with a gaussian filter with a window size of 7 and a parameter of 2. Accordingly, 5 background blurred images with different blurring degrees can be obtained by performing the same continuous blurring processing on the background image.

In operation S140, each group of target blurred images and background blurred images with the same blur degree are added to obtain a plurality of artificially synthesized multi-focus images.

The target blurred images and the background blurred images are overlapped based on the same blurring degree, so that multi-focus images with different blurring degrees can be obtained, and therefore the input image has different depths. It should be further noted that the depth, i.e., the blur degree, of the embodiments of the present disclosure is discussed below in terms of depth in order to adapt to the image processing field in computer vision.

In this embodiment of the disclosure, in operation S102, each level of context feature extraction model is formed by 3 parallel convolution modules with different receptive fields, and an image feature including context information at each level of scale is obtained

Wherein: the parallel convolution module for 3 different receptive fields includes three branches: original image characteristic branch, receptive field expansion branch and attention weight branch; the receptive field expanding branch expands the receptive field by using hole convolution to acquire image relative global information, and the attention weight branch performs weight processing on the receptive field expanding branch by using a self-attention mechanism.

Each level of context feature extraction model is composed of 3 parallel convolution modules with different receptive fields, and image features containing context information under a single scale are obtained by fusing image features of different levels together.

In the original image feature branch, input image A_nD scale feature of

Is input as an input image A_nCharacteristic of the previous scale

Representing the input image.

For ease of understanding, assume that k is 4, and an arbitrary input image a is input_nD scale feature of

Can be expressed as:

wherein the content of the first and second substances,

representing an input image A_nThe d-1 th order scale feature of (a),

representing an input image;

representing the feature extraction function in the context feature extraction model.

Thus, the previous-level scale features need to be input for calculation

Performing a series of processing equivalent to converting the computation input into computation output by means of a feature extraction function, i.e. the d-th scale feature

Fig. 5 schematically illustrates a process of attention weight branching according to an embodiment of the present disclosure.

As shown in fig. 5, the process of the attention weight branch may include the following sub-operations S210-S220.

In operation S210, a self-attentive mechanism is introduced

Conversion to transitional image features

For example, the self-attention mechanism may employ a filter pair of size 1 and outputting 32 channels

Performing convolution operation to obtain transition image characteristics

In operation S220, adoptUsing Sigmoid function as transition image characteristic

Each pixel point of (a) is assigned a weight.

Specifically, the transition image features are calculated according to the following formula

The distribution weight of each pixel point:

representing transitional image features

The weight assigned to each pixel point.

In the embodiment of the present disclosure, the receptive field enlarging branch includes two partial image sub-features, each partial image sub-feature is formed by convolution of two continuous holes, where:

the sub-features of the two partial images are calculated according to the following formulas respectively:

wherein the content of the first and second substances,

respectively a first partial image sub-feature and a second partial imageA sub-feature;

to represent

And the converted corresponding pixel points are distributed with weights.

Thus, the output of the computation, namely the d-th-order scale feature, is synthesized by integrating the three branches included in the parallel convolution module

It can be calculated according to the following formula:

wherein the content of the first and second substances,

a first volume operation is shown as a first volume operation,

FIG. 6 schematically illustrates a process of preliminary fusion feature processing according to an embodiment of the disclosure.

As shown in fig. 6, the image features of k-level scale are fused by using the same level scale

Performing preliminary fusion to obtain a preliminary fusion characteristic U^dThe following sub-operations S310-S320 may be included.

In operation S310, image features having the same scale from two input images

And cascading, and then obtaining a weight value graph through a softmax layer.

In operation S320, a d-th order scale feature

Adding the products multiplied by the weight value graph corresponding to the d-th scale to obtain the preliminary fusion characteristic U under the d-th scale^d。

Therefore, the embodiment of the disclosure adopts a cascading mode for image features of different scales, obtains pixel-level weight information through the softmax layer, performs pixel-level multiplication operation with the original image features respectively, and then obtains preliminary fusion features through pixel-level addition operation.

Specifically, the preliminary fusion features U^dCalculated according to the following formula:

wherein, mapn (n is 1, 2) represents a weight map obtained through softmax layer operation; cat (·) denotes cascade operation; denotes multiplication operations at the pixel level; + represents an addition operation at the pixel level.

FIG. 7 schematically illustrates a process of refining a fused feature according to an embodiment of the disclosure.

As shown in fig. 7, the preliminary fusion features U at each level scale are combined^dPreliminary integration with the previous stage through inverse transformationCombined characteristic U^d+1Performing fusion to obtain refined fusion characteristic U'^dThe following sub-operations S410-S420 may be included.

In operation S410, a step-by-step scale backward transfer manner is adopted to enable the preliminary fusion feature U of the d-th step scale^dThe preliminary fusion feature of the previous-level scale obtained after inverse transformation is changed into a preliminary fusion feature U^d+1。

In operation S420, a feature U is preliminarily fused to a previous stage^d+1Post-upsampling and preliminary fusion features U^dPerforming fusion to obtain refined fusion characteristic U'^d。

Specifically, refining fused feature U'^dIt can be calculated according to the following formula:

wherein the content of the first and second substances,

it is shown that the second convolution operation is,

a parameter set representing a second convolution operation corresponding to the label filter; Θ denotes the Relu activation function; cat (·) denotes cascade operation; sample (·) represents an upsampling operation.

Refining fusion characteristic U 'obtained in step S420'^dInputting into image reconstruction model to obtain fused image F_reconIt can be calculated according to the following formula:

Therefore, the embodiment of the disclosure performs inverse transformation on the obtained preliminary fusion features through a depth supervision mechanism and a back propagation rule, and realizes fusion of image features of different scales by adopting a back transmission mode.

In the embodiment of the present disclosure, the multi-focus image fusion network model may be a mobilene series neural network model or a Resnet series neural network model. The model of the Mobilene series neural network is a neural network model based on deep level separable convolution, and the model of the Resnet series neural network is a neural network model based on residual errors.

In some embodiments, the number of the input images may be greater than two, and for more than two input multi-focus images, the two multi-focus images may be fused first by performing the above steps S101 to S106, and then the same fusion step is repeated for the other two multi-focus images until all the multi-focus images are fused.

Fig. 8 schematically illustrates a flow diagram for training a multi-focus image fusion network model with training data according to an embodiment of the present disclosure. In this example, the training data includes a plurality of input images and a fused image F for each input image_recon。

Determining a predictive fusion image of one input image among the plurality of input images using the multi-focus image fusion network model in operation S610;

in operation S620, a fused image F of the predicted fused image determined using the multi-focus image fusion network model and the input image is calculated using a joint loss function_reconA loss value in between;

in operation S630, it is determined whether the loss value satisfies a preset loss threshold, and if not, parameters of the multi-focus image fusion network model are adjusted according to the loss value, and the step of determining a fusion output image using the multi-focus image fusion network model is returned for another input image among the plurality of input images.

In the embodiment of the present disclosure, the joint loss function in operation S620 is jointly constructed according to a structural similarity loss function and a mean square error loss function, where the structural similarity loss function L_MSESum mean square error loss function L_SSIMRespectively calculated according to the following formula:

L_SSIM＝1-SSIM(G，P)

Therefore, the combined loss function constructed by the embodiment of the disclosure comprises the loss (structural similarity loss function) based on the image block level and the loss (mean square error loss function) based on the pixel level, optimizes the quality of image fusion, completes the training of the network, obtains the network model parameters, and realizes the accurate reconstruction of the image by optimizing the loss function.

In some embodiments, input image A is used_nFused image F with input image_reconBefore the step of training the multi-focus image fusion network model as training data, the method further comprises: will input image A_nZooming to a preset size.

Since the images in the data training set have arbitrary sizes, the images are uniformly transformed into preset sizes of 180 × 180 pixels, for example, so as to adapt to a multi-focus image fusion network model used subsequently.

Fig. 9 schematically shows a block diagram of a multi-focus image fusion apparatus based on multi-scale transformation according to an embodiment of the present invention.

As shown in fig. 9, the multi-focus image fusion apparatus 900 based on multi-scale transformation may include an image acquisition module 910, a feature extraction module 920, a preliminary fusion module 930, an inverse transformation fusion module 940, an image reconstruction module 950, and a network training module 960.

An image collecting module 910, configured to collect two input images a in the same imaging scene_n(n ═ 1, 2), acquiring different depths of the input image;

a feature extraction module 920, configured to extract k-scale image features at different depths from the input image using a k-level context feature extraction model

A preliminary fusion module 930, configured to apply a same-level scale fusion mode to the k-level scale image features

Performing preliminary fusion to obtain a preliminary fusion characteristic U^d；

An inverse transform fusion module 940 for fusing the preliminary fusion features U of each level of scale^dPreliminarily fusing the characteristics U with the previous stage after inverse transformation^d+1Performing fusion to obtain refined fusion characteristic U'^d；

An image reconstruction module 950 for fusing the refined feature U 'with an image reconstruction model'^dReconstructing to obtain a fused image F_recon(ii) a And

a network training module 960 for fusing images F using the input image An and the input image_reconAnd training a multi-focus image fusion network model as training data.

It should be noted that the apparatus part of the embodiment of the present disclosure corresponds to the method part of the embodiment of the present disclosure, and the description of the multi-focus image fusion apparatus part based on multi-scale transformation specifically refers to the multi-focus image fusion method part based on multi-scale transformation, and is not described herein again.

Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.

For example, any plurality of the image acquisition module 910, the feature extraction module 920, the preliminary fusion module 930, the inverse transform fusion module 940, the image reconstruction module 950, and the network training module 960 may be combined into one module/unit/sub-unit to be implemented, or any one of the modules/units/sub-units may be split into a plurality of modules/units/sub-units. Alternatively, at least part of the functionality of one or more of these modules/units/sub-units may be combined with at least part of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to an embodiment of the present disclosure, at least one of the image acquisition module 910, the feature extraction module 920, the preliminary fusion module 930, the inverse transform fusion module 940, the image reconstruction module 950, and the network training module 960 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementations of software, hardware, and firmware, or by a suitable combination of any of them. Alternatively, at least one of the image acquisition module 910, the feature extraction module 920, the preliminary fusion module 930, the inverse transform fusion module 940, the image reconstruction module 950 and the network training module 960 may be at least partially implemented as a computer program module, which when executed, may perform corresponding functions.

FIG. 10 schematically shows a block diagram of an electronic device according to an embodiment of the invention. The electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 10, the electronic device 1000 includes a processor 1010, a computer-readable storage medium 1020. The electronic device 1000 may perform a multi-focus image fusion method based on multi-scale transformation according to an embodiment of the present disclosure.

In particular, processor 1010 may include, for example, a general purpose microprocessor, an instruction set processor and/or related chip set and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), and/or the like. The processor 1010 may also include on-board memory for caching purposes. Processor 1010 may be a single processing unit or multiple processing units for performing different acts of a method flow according to embodiments of the disclosure.

Computer-readable storage media 1020, for example, may be non-volatile computer-readable storage media, specific examples including, but not limited to: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and so on.

The computer-readable storage medium 1020 may comprise a computer program 1021, which computer program 1021 may comprise code/computer-executable instructions that, when executed by the processor 1010, cause the processor 1010 to perform a method according to an embodiment of the disclosure, or any variant thereof.

The computer program 1021 may be configured with computer program code, for example, comprising computer program modules. For example, in an example embodiment, code in computer program 1021 may include one or more program modules, including, for example, 1021A, modules 1021B, … …. It should be noted that the division and number of modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, and when the program modules are executed by the processor 1010, the processor 1010 may execute the method according to the embodiment of the present disclosure or any variation thereof.

According to an embodiment of the present disclosure, at least one of the image acquisition module 910, the feature extraction module 920, the preliminary fusion module 930, the inverse transform fusion module 940, the image reconstruction module 950 and the network training module 960 may be implemented as a computer program module as described with reference to fig. 10, which, when executed by the processor 1010, may implement the respective operations described above.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement a multi-focus image fusion method based on multi-scale transformation according to an embodiment of the present disclosure.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A multi-focus image fusion method based on multi-scale transformation comprises the following steps:

acquiring two input images A under the same imaging scene_n(n ═ 1, 2) obtaining the different depths of the input image;

extracting k-scale image features at different depths from the input image respectively using a k-scale context feature extraction model

Using a same-level scale fusion mode to perform on the image characteristics of the k-level scale

Preliminarily fusing the characteristics U of each level of scale^dPreliminarily fusing the characteristics U with the previous stage after inverse transformation^d+1Performing fusion to obtain refined fusion characteristic U'^d；

Refining the fused feature U 'by using an image reconstruction model'^dReconstructing to obtain a fused image F_recon；

Using the input image A_nFused image F with input image_reconAnd training a multi-focus image fusion network model as training data.

2. The method of claim 1, wherein the two input images are two multi-focus images to be fused and pre-registered.

3. The method as claimed in claim 1, wherein each stage of the context feature extraction model is composed of 3 parallel convolution modules with different receptive fields, and the image features containing context information at each stage of scale are obtained

Wherein:

the parallel convolution module of the 3 different receptive fields comprises: original image characteristic branch, receptive field expansion branch and attention weight branch;

the receptive field expanding branch expands the receptive field by using cavity convolution to acquire image relative global information, and the attention weight branch performs weight processing on the receptive field expanding branch by using a self-attention mechanism.

4. The method of claim 3, wherein in the k-level contextual feature extraction model, input image A_nD scale feature of

Is input as an input image A_nCharacteristic of the previous scale

Representing an input image;

by reference to the self-attention mechanism

Conversion to transitional image features

Adopting Sigmoid function as transition image characteristic

Each pixel point of (a) is assigned a weight.

5. The method of claim 4, wherein the self-attentiveness by reference mechanism is to

Conversion to transitional image features

The method comprises the following steps:

using a filter pair of size 1 and outputting 32 channels

Performing convolution operation to obtain transition image characteristics

The Sigmoid function is adopted as the transition image characteristic

Each pixel point of (1) is assigned with a weight, and is calculated according to the following formula;

representing transitional image features

The weight assigned to each pixel point.

6. The method of claim 3, wherein the expanded receptor field branch comprises two partial image sub-features, each partial image sub-feature being comprised of two successive hole convolutions, wherein:

the two partial image sub-features are calculated according to the following formulas respectively:

wherein the content of the first and second substances,

to represent

And the converted corresponding pixel points are distributed with weights.

7. The method of claim 6, wherein the input image A_nD scale feature of

Calculated according to the following formula:

wherein the content of the first and second substances,

a first volume operation is shown as a first volume operation,

a parameter set representing a label filter corresponding to the first convolution operation; cat (·) denotes cascade operation; poThe vibrating (. cndot.) represents a pooling operation, setting the pooling step size to 2.

8. The method of claim 1, wherein the image features at the k-scale are fused using a sibling scale

Performing preliminary fusion to obtain a preliminary fusion characteristic U^dThe method comprises the following steps:

image features with same scale from two input images

Cascading, and then obtaining a weight value graph through a softmax layer;

d scale feature

9. The method of claim 8, wherein the preliminary fusion features U^dCalculated according to the following formula:

10. The method of claim 1, wherein the step of adding each of the plurality ofPreliminary fusion features U of level scale^dPreliminarily fusing the characteristics U with the previous stage after inverse transformation^d+1Performing fusion to obtain refined fusion characteristic U'^dThe method comprises the following steps:

adopting a step-by-step scale reverse transmission mode to enable the preliminary fusion feature U of the d-th step scale^dThe preliminary fusion feature of the previous-level scale obtained after inverse transformation is changed into a preliminary fusion feature U^d+1；

To the previous level preliminary fusion characteristics U^d+1Post-upsampling and preliminary fusion features U^dPerforming fusion to obtain refined fusion characteristic U'^d。

11. The method of claim 10, wherein the refined fused feature U'^dCalculated according to the following formula:

wherein the content of the first and second substances,

it is shown that the second convolution operation is,

12. The method of claim 1, wherein the refined fused feature U 'is modeled using an image reconstruction model'^dReconstructing to obtain a fused image F_reconThe method comprises the following steps:

the fusion image F_reconCalculated according to the following formula:

13. The method of claim 1, wherein the two input images are constructed by:

inputting an image training set containing a plurality of source images, setting a first mask and a second mask with complementary scenes, and acquiring a target region in the source images through the first mask;

performing dot multiplication operation on each source image by using a first mask and a second mask respectively to obtain a target image and a background image;

continuously and repeatedly blurring the target image and the background image through a blurring filter to obtain a plurality of groups of target blurred images and background blurred images with different blurring degrees;

and respectively adding the target blurred image and the background blurred image which have the same degree of blurring in each group to obtain a plurality of groups of artificially synthesized multi-focus images.

14. The method of claim 13, wherein the blur filter is a gaussian filter with a sliding window size of 7 x 7 and a standard deviation of 2.

15. The method of claim 1, wherein the method further comprises:

determining a predictive fused image of one of the plurality of input images using a multi-focus image fusion network model;

calculating a fused image F of a predicted fused image and an input image determined using a multi-focus image fusion network model using a joint loss function_reconA loss value in between;

and judging whether the loss value meets a preset loss threshold value, if not, adjusting parameters of the multi-focus image fusion network model according to the loss value, and returning to the step of determining a fusion output image by using the multi-focus image fusion network model for another input image in the plurality of input images.

16. The method of claim 15, wherein the joint loss function is jointly constructed from a structural similarity loss function and a mean square error loss function, wherein;

the structural similarity loss function L_MSESum mean square error loss function L_SSIMRespectively calculated according to the following formula:

L_SSIM＝1-SSIM(G，P)

wherein H, W represents the pixel height and width of the image, respectively; (i, j) represent row and column coordinates in the image; g (i, j) and P (i, j) respectively represent the color values of the true value image and the prediction fusion image of the corresponding pixel coordinates; i | · | purple wind₂Representing a two-norm operation;

μ_G、μ_Prespectively represent the fused images F_reconThe color mean value of the image is fused with the prediction; c₁、C₂Respectively representing two constants which are used for preventing zero-removing errors and are respectively set to be 0.01 and 0.03 in training;

17. The method of claim 1, whereinUsing the input image A_nFused image F with input image_reconBefore the step of training the multi-focus image fusion network model as training data, the method further comprises:

the input image A is processed_nZooming to a preset size.

18. A multi-focus image fusion device based on multi-scale transformation comprises:

an image acquisition module for acquiring two input images A under the same imaging scene_n(n ═ 1, 2) obtaining the different depths of the input image;

a feature extraction module for extracting k-scale image features at different depths from the input image respectively using a k-level context feature extraction model

A preliminary fusion module for using the same-level scale fusion mode to the image characteristics of the k-level scale

The inverse transformation fusion module is used for fusing the primary fusion characteristics U of each level of scale^dPreliminarily fusing the characteristics U with the previous stage after inverse transformation^d+1Performing fusion to obtain refined fusion characteristic U'^d；

An image reconstruction module for reconstructing the refined fused feature U 'using an image reconstruction model'^dReconstructing to obtain a fused image F_recon(ii) a And

a network training module for using the input image A_nFused image F with input image_reconAnd training a multi-focus image fusion network model as training data.

19. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-17.

20. A computer-readable storage medium storing computer-executable instructions for implementing the method of any one of claims 1 to 17 when executed.