CN113313663A

CN113313663A - Multi-focus image fusion method based on zero sample learning

Info

Publication number: CN113313663A
Application number: CN202110644185.XA
Authority: CN
Inventors: 江俊君; 胡星宇; 刘贤明; 马佳义
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2021-08-27
Anticipated expiration: 2041-06-09
Also published as: CN113313663B

Abstract

The invention provides a multi-focus image fusion method based on zero sample learning, which is characterized in that a multi-focus image fusion network structure IM-Net is used for fusing information contained in input multi-focus images, the IM-Net comprises two combined sub-networks I-Net and M-Net, the I-Net carries out depth prior modeling on the fused images, the M-Net carries out depth prior modeling on a focus image, zero sample learning is realized through extracted prior information, reconstruction constraint is applied to the IM-Net so as to ensure that the information of a source image pair can be better transmitted to the fused images, high-level semantic information can keep the brightness consistency of adjacent pixels, guide loss provides guide information for the IM-Net to search a clear region, and the effectiveness of the method is clarified through an experimental result table.

Description

Multi-focus image fusion method based on zero sample learning

Technical Field

The invention belongs to multi-focus image fusion, and particularly relates to a multi-focus image fusion method based on zero sample learning.

Background

In imaging systems, objects outside the camera focal plane will become blurred due to depth of field (DOF) limitations, making it difficult to obtain a full focus image, which will result in a substantial degradation of the imaging quality. In recent years, researchers have proposed a variety of multi-focus image fusion (MFIF) algorithms to solve this problem. The multi-focus image fusion algorithm can fuse focus areas of source images with different focal lengths in the same scene to obtain a high-quality full-focus image, and has wide application, including digital photography, microscopic imaging and advanced visual tasks or just for obtaining better visual perception.

In recent years, the problem of multi-focus image fusion has made a remarkable progress. Generally, multi-focus image fusion methods can be classified into three categories: transform domain based methods, spatial domain based methods and deep learning based methods. The transform domain based method uses an image decomposition algorithm to convert the original image into a transform domain, which is a step for better encoding to distinguish the geometric features of a clear picture, then fuses the transformed images, and finally performs an inverse transformation to obtain the fused images.

These transform domain based methods have been widely used because they can avoid artifacts caused by direct manipulation of pixels, but they are also prone to cause distortion of the image due to sensitivity to high frequency components. Attention is drawn to spatial domain based methods by estimating a binary focus map and then weighting and summing the source images based on the obtained focus map. They can be further divided into block-based methods and pixel-based methods. Block-based algorithms calculate the activity metric for a block centered at each pixel and are therefore time consuming. With the potential to provide more accurate classification, pixel-based methods have begun to be popular.

These traditional a priori based methods have designed activity metrics and fusion rules as the main task and propose many manually designed activity metrics based on underlying features, e.g., reduction of edge or gradient information, reduction of pixel intensity or contrast. However, these manually extracted features do not accurately indicate whether the image is in sharp focus. Many deep learning-based methods have been proposed to mitigate a priori reliance on manual design (including manually designed image decomposition methods or manually fabricated features) by which both activity metrics and fusion rules can be optimized for better results. The MFIF method based on deep learning may be further classified into a supervised learning-based method and an unsupervised learning-based method.

These supervised learning based models using artificially synthesized datasets may differ from real imaging processes that need to consider Point Spread Functions (PSFs) and distances between objects and shots, but estimating both point spread functions and depth is a very ill-posed problem. Therefore, unsupervised learning becomes a natural solution. While some deep learning based approaches have achieved state-of-the-art (SOTA) performance, most of them work under a supervised learning framework or in a way that trains using large-scale image sets. To address this challenging and less discussed problem, a new type of deep neural network was developed that can work in an unsupervised and untrained way while achieving better performance.

Disclosure of Invention

The invention provides a multi-focus image fusion method based on zero sample learning, which aims to ensure that information of a source image pair can be better transmitted to a fusion image and avoid the problems of time-consuming and labor-consuming data collection and model generalization capability.

The invention is realized by the following scheme:

a multi-focus image fusion method based on zero sample learning,

fusing information contained in input multi-focus images by using a multi-focus image fusion network structure IM-Net, wherein the IM-Net comprises two combined sub-networks I-Net and M-Net, the 1-Net models the depth prior of the fusion images, the M-Net models the depth prior of the focus images, and zero sample learning is realized through the extracted prior information;

the method comprises the following steps;

the method comprises the following steps: randomly sampling two input noises Z from uniform distribution_iAnd Z_m，

Step two: input noise Z_iThrough sub-network I-Net obtaining an estimated fused image I_fusedInput noise Z_mObtaining an estimated focusing image Im through a sub-network M-Net;

step three: reconstructing a loss function to ensure that all generated image information comes from a source image;

step four: using steering losses to steer the network to learn sharp information, using perceptual losses to enhance I_fused(ii) a visual perception of; calculating an overall loss function;

step five: and finishing the transmission from the source image to the fused image.

Further, the air conditioner is provided with a fan,

source image

Estimating a focus map

Wherein, M and N respectively represent the height and width of the source image, and C represents the number of source image channels;

the estimated fusion image I_fusedThe calculation formula of (2) is as follows:

I_fused＝I_A⊙I_m+I_B⊙(1-I_m)

s.t.0≤(I_m)_i，j≤1，

wherein, l indicates a product element by element, and 11 indicates a full 1 matrix of size M × N.

Further, the air conditioner is provided with a fan,

and (3) deducing reconstruction loss according to a calculation formula of fused image Ifused:

wherein the content of the first and second substances,

and

respectively represent I-Net and M-Net.

Further, the air conditioner is provided with a fan,

the guidance loss is used to guide the network to learn clear information:

wherein Is an initial focus image, and Is calculated according to the gradient information of the source image:

I_s＝sign(abs(I_A-lp(I_A))-min(abs(I_A-lp(I_A))，abs(I_B-lp(I_B))))

wherein lp represents low pass filtering;

enhancing I using perceptual loss_fusedThe visual perception of (2):

wherein

Represents a pre-trained ResNet 101;

the overall loss function is:

where α and β are the weights of the reconstruction loss and the guiding loss, respectively.

The invention has the beneficial effects

(1) The method is one of the methods for realizing zero-sample multi-focus image fusion firstly, and can predict clear fusion images without supervision information and large-scale image sets;

(2) the invention is inspired by DIP, two generation networks are applied to simultaneously estimate the depth prior of a clear fusion image and a focus map, and the invention combines the advantage that a method based on the focus map estimation can well reserve the information of a source image and the advantage that a method based on the fusion image generation can provide a good visual effect;

(3) the invention avoids the problems of time-consuming and labor-consuming data collection and model generalization capability and simultaneously obtains good results

(4) The invention compares the method with several SOTAs methods in the experiment, and proves the effectiveness of IM-Net.

Drawings

FIG. 1 is an overall structure of IM-Net of the present invention, two generation networks I-Net and M-Net are combined to simultaneously estimate a fused image I_fusedAnd a focus diagram I_m；

FIG. 2 is a comparison of the subjective results of the method of the present invention with the other 5 SOTA methods;

fig. 3 is a comparison of the subjective results of the method of the present invention with no guide loss and no perceptual loss.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Because the untrained network can be used as a priori for many underlying visual tasks and does not require any training data, a multi-focus image fusion network IM-Net can be used, which includes two joint sub-networks I-Net and M-Net, where I-Net models the depth a priori of the fused image and M-Net models the depth a priori of the focus map, enabling zero-sample learning through the extracted a priori information.

Fig. 1 shows the main structure of the method.

First respectively randomly distributed from the uniform distributionSampling two input noises Z_iAnd Z_mThen, two hourglass-shaped structure networks I-Net and M-Net based on U-Net (U-Net) are used to obtain a fused image I_fusedAnd estimating a focus map I_m。

A multi-focus image fusion method based on zero sample learning,

the method comprises the following steps;

Step two: input noise Z_iObtaining an estimated fusion image I through a subnetwork I-Net_fusedInput noise Z_mObtaining an estimated Focus map I through a sub-network M-Net_m；

The multi-focus image fusion is converted into a zero-sample self-supervision learning form consisting of two generation networks,

MFIF multi-focus image fusion can be viewed as a source image

Estimating a focus map

the estimated fusion image I_fusedThe calculation formula of (2) is as follows:

I_fused＝I_A⊙I_m+I_B⊙(1-I_m)

s.t.0≤(I_m)_i，j≤1，

wherein, l indicates a product element by element, and 11 indicates a full 1 matrix of size M × NM × N.

DIP uses a randomly initialized depth network to fit a single image and uses the extracted features as depth priors for the image, but the lack of a large-scale dataset makes the model prone to overfitting, and a well-designed hourglass type network can greatly alleviate this problem. U-Net is good at extracting low and high level information, so DIP uses U-Net as its backbone.

The asymmetric design between the downsampling block and the upsampling block can effectively avoid trivial solutions, i.e., I_fusedLooks like and I_AOr I_BSame, I_mIt appears to be completely white or black, which makes it difficult to optimize the algorithm. The large number of BatchNorm layers may also make the network better fit to high frequency components. The network architecture of the proposed IM-Net is shown in fig. 1.

Since there is no supervisory information to guide the pixel-to-pixel generation, image I is fused according to_fusedThe calculation formula of (c) deduces the reconstruction loss:

wherein the content of the first and second substances,

medicine for curing cancer

Respectively represent I-Net and M-Net.

The reconstruction loss ensures that the information of the generated image is all from the source image, thereby avoiding the occurrence of artifacts.

At the same time, under the guidance of high-level semantic information, the generation-based method can eliminate noise with inconsistent brightness caused by the selection-based fusion strategy, and pixels are easily misclassified especially in the boundary region between the in-focus and out-of-focus regions. Thus, the method of the present invention combines the advantages of a method based on a focus map estimation and a method based on fused image generation.

However, due to the lack of constraint on whether the information obtained by I-Net is clear or ambiguous, and experiments show that DIP tends to select ambiguous regions rather than unambiguous regions.

The guidance loss is used to guide the network to learn clear information:

I_s＝sign(abs(I_A-lp(I_A))-min(abs(I_A-lp(I_A))，abs(I_B-lp(I_B))))

wherein lp represents low pass filtering;

enhancing I using perceptual loss_fusedThe visual perception of (2):

wherein

Represents a pre-trained ResNet 101;

the overall loss function is:

Implementation details:

the proposed algorithm was implemented using a Pytorch framework, all experiments were performed on a server equipped with NVIDIA RTX 1080Ti GPU. The learning rate is set to 0.01, and α and β are set to 20 and 0.05, respectively. It is noted that after a certain number of iterations (e.g., 800 for a total number of 1500), α is set to 0, mainly because I is assumed to be early in the fusion_sThe guiding information of (a) enables DIP to find clear areas, while in later stages, the guiding loss will limit further optimization of I-Net.

Since minimizing KL divergence in the generative model is equivalent to applying L under Gaussian conditions₂Norm is optimized, so L₂The norm has smooth optimization characteristics and is more suitable for an image classification task. L is widely used in experiments₁Norm because it can better preserve edge information. In addition, in the last 700 iterations, L was replaced with SSIM in the reconstruction loss₁This is more consistent with the human visual system.

Experimental setup:

data set and indices: the widely used real dataset Lytro was used to demonstrate the effectiveness and generalization of IM-Net. The Lytro dataset contains 20 pairs of multi-focus images, with a size of 520 x 520 pixels. Since no supervisory information is available, the results cannot be directly compared to the underlying real data. Therefore, many works propose various indexes to objectively evaluate MFIF. Selection of Q_en，Q_abf，Q_scd，Q_sdThe reason for the quantization index of the present invention is as follows. Q_enCalculation of I_fusedMay represent the amount of information retained. Q_abfIs a novel objective non-reference quality evaluation index for image fusion, and uses local characteristics to estimate the storage degree of significant information and obtain Q_abfA higher value indicates better quality of the fused image. Q_scdIs the sum of the difference correlations, higher Q_scdThe value represents less spurious information. Q_sdIs a standard deviation reflecting the preservation of high frequency information.

The comparison method comprises the following steps: in the experiments of the present invention, the proposed IM-Net was compared with several SOTA methods, including DCTVar (Multi-Focus Image Fusion for visual sensor network in DCT domain), DSIFT (Multi-Focus Image Fusion with depth dense SIFT), CNN (Multi-Focus Image Fusion with a depth dependent network), MFF-GAN (MFF-GAN: An Unsupervised informational additive sampling network with adaptive and graphical connections for Multi-Focus Image Fusion) and SESF (SESF-Fuse: An Unsupervised attached Multi Model-Focus Fusion). Among them, DCTVar and DSIFT are conventional methods, based on the transform domain and the spatial domain, respectively. CNN is the first supervised learning based method that is trained on a synthetic dataset consisting of 1,000,000 pairs of 16 x 16 sized image blocks containing the negative examples of positive examples. MFF-GAN is an unsupervised learning image generation based approach trained on a real dataset Lytro and a synthetic dataset without supervised information. SESF is an unsupervised method using a pre-trained auto-encoder. In contrast to these works, the method of the present invention is unsupervised and untrained (no image dataset used for training).

Comparative experiment:

subjective results: subjective results of six different multi-focus image fusion methods on four representative source images selected by Lytro. It can be seen that in the results of the methods based on focus map estimation (including DSIFT, CNN and SESF), artifacts appear in the focus-defocus boundary region. In addition, CNN and SESF use many post-processing steps to make the focus map smoother over the boundary regions, which is inaccurate and may cause certain scenes to become blurred. As for DCTVar, it cannot eliminate severe blocking effects. MFF-GAN can cancel blur well, but has a tendency to produce noise and ringing easily due to instability of GAN training. Thanks to the architectural design that preserves the consistency between adjacent pixels and the reconstruction penalty design that preserves the edge and texture information, the method of the present invention achieves more accurate results and better visual perception.

TABLE 1 Objective comparison of the method of the invention with the other 5 SOTA methods, the first two best results are marked as bolded

Objective results: table 1 lists the objective performance of the different fusion methods using the above indices. For Q_abfThe method based on deep learning has better performance than the traditional method. This is because Q_abfVisual perception can be reflected, and the image quality can be greatly improved by the deep learning-based method. For Q_en，Q_scdAnd Q_sdThe method of the present invention can provide results comparable to SOTA. Both subjective and objective results indicate that the method of the present invention can greatly preserve texture, edge information and image quality. Despite the absence of training data (zero sample learning), the method of the invention still achieves competitive objective performance.

Ablation analysis

To demonstrate the effectiveness of the guidance loss and perception loss, ablation experiments were performed. FIG. 3 shows two example pairs of source images and corresponding I_s. FIG. 3 shows that if none from I_sIM-Net tends to generate a completely blurred fused image, which is of course an undesirable result. This is believed to occur because the noise reduction characteristics of DIP make it easier to model low frequency information than high frequency information.

By comparing the guide image with the focus map obtained without using the perceptual loss, it can be found that the proposed strategy in which the guide loss acts only at an early stage plays an important role in obtaining a more consistent focus map. Due to high-level semantic information provided by the U-Net structure, no matter whether guide information exists or not, a focusing graph generated by the DIP tends to be more concentrated and is blocky. Also, if the focus map is further compared to the use perception loss and the inapplicable perception loss, it is found that the perception loss also helps to maintain the integrity of the focus map while improving the image quality.

The multi-focus image fusion method based on zero sample learning provided by the invention is described in detail, a numerical simulation example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the above embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A multi-focus image fusion method based on zero sample learning is characterized in that,

fusing information contained in input multi-focus images by using a multi-focus image fusion network structure IM-Net, wherein the IM-Net comprises two combined sub-networks I-Net and M-Net, the I-Net models the depth prior of the fusion images, the M-Net models the depth prior of the focus images, and zero sample learning is realized through the extracted prior information;

the method comprises the following steps;