CN117078510B

CN117078510B - Single image super-resolution reconstruction method of potential features

Info

Publication number: CN117078510B
Application number: CN202310373066.4A
Authority: CN
Inventors: 王鑫; 颜靖柯; 蔡竟业; 邓建华
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-11-16
Filing date: 2023-04-10
Publication date: 2024-04-30
Anticipated expiration: 2043-04-10
Also published as: CN117078510A

Abstract

The invention discloses a super-resolution reconstruction method for a single image with potential characteristics, and belongs to the technical field of image processing. In order to ensure that the diffusion probability model performs high-quality sampling in a small number of sampling steps, the reconstruction of the high-resolution image is realized based on the set multi-mode distribution model, the model is realized based on a generator and a normalized flow, and the high-frequency detail of the high-resolution image is focused in a small number of iteration steps. And the low-resolution image is converted into hidden conditions as the condition input of the model through the self-adaptive multi-head attention mechanism and the variational self-encoder, and the negative influence caused by model collapse is reduced while the sampling is fast, so that the complicated and diversified high-quality high-resolution image is generated. The predictive randomness effect brought by the maximum variation lower bound in the diffusion probability model is limited through the self-adaptive multi-head attention mechanism and the variation self-encoder, so that model training is stable, and an image consistent with the style and content of the original high-resolution image can be generated.

Description

Single image super-resolution reconstruction method of potential features

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a single image super-resolution reconstruction method of potential characteristics.

Background

The task of super-resolution reconstruction (SISR) of a single image is a very important task in the research fields of computer vision, image processing and the like. The SISR task is to reconstruct a corresponding High Resolution (HR) image using the LR image. Because Low Resolution (LR) images lose a great deal of detail, texture features during image degradation, reconstructed HR images require rich image detail and sharp texture. However, there may be an infinite number of HR images corresponding to a single LR image, and a single HR image may also be restored by a variety of different degraded LR images, so the SISR task is a typical one to solve the uncertainty problem. In SISR tasks, researchers have proposed various conventional methods such as iterative backprojection, convex set projection, sparse representation, etc., but conventional methods typically explicitly estimate the blur kernel and reconstruct the HR image. Conventional methods may therefore result in errors in the estimated blur kernel, and hence less than ideal results in the reconstruction of HR images.

The SISR task can also be regarded as a typical generation task, which is to fit the probability distribution of data effectively by a generator such that the probability distribution is generated as close as possible to the real data distribution. Deep learning-based methods in this task can be divided into five categories: a CNN-based method, a method based on generating an antagonism network (GAN), a stream-based method, a variational self-encoder (VAE) -based method, and a diffusion probability model (DDPM) -based method. However, these generative models face mainly 3 dilemmas at SISR tasks: high quality sampling, fast sampling and diversity of details after sampling. The CNN-based method can fit any function, but cannot fit any probability distribution, so that the problems of unrealistic reconstruction result perception, artifacts and the like are difficult to solve by a method for only constructing CNN. The GAN-based method is also a common method in SISR, which uses perceptual and contrast loss to reconstruct images, and although it can provide fast sampling, there are problems of pattern collapse, unstable training, etc. Flow-based methods can use log-likelihood functions to accurately infer potential variables, thereby increasing the diversity of the generated images, but the generated images are too smooth. The VAE-based method not only can generate more diverse data by using additional conditions, but also can provide faster sampling, however, the quality of the samples of the VAE-based method is low, and the phenomenon of detail and texture loss of HR images exists.

Recently DDPM has achieved a better effect on the generation tasks such as image synthesis and speech synthesis. DDPM use a Markov chain to translate the underlying variables in the Gaussian distribution into data in a complex distribution, thereby solving the "one-to-many" problem on the SISR task and improving the quality of the reconstructed data. However, unlike other generation tasks, applying DDPM to SISR tasks also requires solving the following problems:

(1) The back diffusion process of DDPM at SISR tasks requires a complex probability distribution to model the denoising distribution, so DDPM requires thousands of evaluation steps to sample one sample feature during the forward diffusion process. If DDPM uses a small number of sampling steps, the resulting image quality after DDPM sampling is not high.

(2) DDPM is model input based on unconditional or simple conditions, while SISR tasks often require constraining the solution space of the HR image by fully utilizing the LR image as a condition for model input.

Disclosure of Invention

The invention provides a single image super-resolution reconstruction method of potential characteristics, which is used for improving the super-resolution reconstruction performance of a single image.

The invention adopts the technical scheme that:

a method of super-resolution reconstruction of a single image of a potential feature, the method comprising:

step 1, constructing and training a diffusion probability model facing potential characteristics;

the training data of the diffusion probability model facing the potential features comprises a plurality of image groups consisting of a high-resolution image Y and a low-resolution image X;

The diffusion probability model facing the potential characteristics comprises a first image encoder, a discriminator, a noise U-Net network and a noise generator;

The input of the first image encoder is a low-resolution image X, and the encoding characteristic tau _θ (X) of the low-resolution image with a specified dimension is output;

The noise generator is used for outputting noise distribution epsilon _i of the ith step;

Obtaining a noise image X _i of the ith step according to X _i＝X_i-1+∈_i, wherein i=1, …, T represents a preset number of processing steps, and X ₀ =y;

The noise U-Net network is used for predicting noise and comprises a splicing layer, a conditional variation self-encoder, a U-Net encoder, a convolution layer, a stream generation model and a U-Net encoder decoder; the splicing layer is used for carrying out channel dimension splicing on the coding characteristic tau _θ (X) and the noise image X _i, and sending the coding characteristic tau _θ (X) and the noise image X _i into the condition variation self-encoder and the U-Net encoder respectively; the conditional variation self-encoder sequentially comprises a first convolution layer, a feature encoder, a feature decoder and a second convolution layer, wherein the input of the feature encoder is the output of the first convolution layer, and the conditional variation self-encoder is used for extracting the mean mu _θ and the variance sigma _θ of Gaussian distribution of input features; the feature decoder outputs a probability distribution among pixels of the image based on the input mean and variance; the second convolution layer is used for projecting the probability distribution to a space domain and outputting a conditional probability mapping characteristic F _R;

The U-Net encoder adopts an adaptive attention mechanism, wherein the input of the U-Net encoder comprises the output of a splicing layer and coding characteristics tau _θ (X), the coding characteristics tau _θ (X) are used as input values of key K and value V operation in the adaptive attention mechanism of the U-Net encoder, the output of the splicing layer is used as input values of query Q operation in the adaptive attention mechanism of the U-Net encoder, and the characteristic map of the output of the U-Net encoder is expressed as F _X; the feature mapping spatial mean F _μ and the feature mapping spatial variance F _δ of the feature mapping F _X are learned through a convolution layer, the conditional probability mapping feature F _R and the feature mapping spatial variance F _δ output by the conditional variation self-encoder are subjected to Hadamard product and then added with the feature mapping spatial variance F _δ to obtain a fusion feature F _g;

The input of the stream generation model includes a fusion feature F _g and a coding feature τ _θ (X) for outputting a second probability distribution C between pixels of the image;

The U-Net decoder is used for outputting the prediction noise of the i-th step (E _θ)_i, the U-Net decoder adopts an adaptive attention mechanism, and the input of the U-Net decoder comprises a second probability distribution C and a coding characteristic tau _θ (X), wherein the coding characteristic tau _θ (X) is used as an input value of key K and value V operation in the adaptive attention mechanism of the U-Net decoder, and the second probability distribution C is used as an input value of query Q operation in the adaptive attention mechanism of the U-Net decoder;

Prediction noise (E _θ)_i calculates the noise image estimation of the ith step) based on E _i generated by a noise generator and the output of a noise U-Net network And noise image estimation/>, step i-1And input it into a discriminator for outputting a noise image estimate/>And/>Confidence between; during training, the discriminator is also used for outputting the confidence between the noise images X _i and X _i-1;

Wherein, Alpha _i represents the Gaussian distribution transformation matrix of the current variance sigma _θ,/>Representing summing i αi;

based on a preset training loss function, carrying out parameter training on a diffusion probability model facing potential features through training data until a preset training ending condition is reached, and obtaining a trained diffusion probability model facing potential features;

and retains the predicted noise of the last predicted get T step (epsilon _θ)_T is marked as noise image) Or is reservedAnd/>Multiple prediction noise with confidence between meeting specified predetermined conditions (average of e _θ)_T is denoted as noise image/>

Step 2, reconstructing a high-resolution image of the target low-resolution image based on a first image encoder and a noise U-Net network in the trained diffusion probability model facing the potential features:

inputting the target low-resolution image into a first image encoder to obtain a first image coding characteristic;

inputting the first image coding characteristic into a splicing layer of a noise U-Net network, a U-Net encoder, a stream generating model and a U-Net encoder decoder respectively, starting from the T-th step, inputting the target noise image of the current step into the splicing layer of the noise U-Net network, and obtaining the target prediction noise of the previous step based on the output of the target noise image; wherein the initial value of the target noise image of the current step is

Subtracting the predicted noise of the previous step from the target noise image of the current step to obtain the target noise image of the previous step, continuously inputting a noise U-Net network, and repeatedly iterating and outputting the predicted noise of the previous step until the target predicted noise of the 1 st step is obtained;

And subtracting the target prediction noise of the step 1 from the target noise image of the step 1 to obtain a reconstructed high-resolution image.

Further, in step 2, according toCalculating to obtain the target noise image/>, of the first stepWherein/>Representing the sum of α _T to α _t, and α _t representing the gaussian distribution transformation matrix of the variance σ _θ of the feature encoder output of the conditional variance from the encoder at the time of the calculation of the t-th step; target noise image/>, obtained currentlyContinuously inputting a noise U-Net network, and repeatedly iterating and outputting the predicted noise of the previous step until the target predicted noise/> -of the step 1 is obtainedAccording toObtain reconstructed high resolution image/>

The technical scheme provided by the invention has at least the following beneficial effects:

(1) According to the invention, the HR image is reconstructed through the Markov chain and the complex multi-mode distribution modeling, so that the negative influence on the modeling HR image caused by model collapse can be reduced while the model is rapidly sampled, and the complex and diversified HR image with high quality can be generated.

(2) The invention limits the influence of prediction randomness caused by the maximum variation lower bound in DDPM through the set condition encoder, so that model training is stable and images consistent with the style and content of the original HR image can be generated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a network architecture of LDDPM model according to an embodiment of the present invention.

FIG. 2 is an overview of forward and reverse diffusion processes of DDPM models in an embodiment of the present invention. The forward diffusion process is right to left, the reverse diffusion process is left to right, and θ represents a learnable parameter.

FIG. 3 illustrates a LDDPM denoising process based on generating an countermeasure network in accordance with an embodiment of the present invention.

FIG. 4 is a visualization of results obtained on the Urban100 and Set14 datasets (2X) with different models in an embodiment of the present invention.

Fig. 5 is a view of the face SR (8×) on CeleHQ dataset according to an embodiment of the present invention.

FIG. 6 is a visualization of gray level histograms of pixel features of different models on CeleHQ datasets in an embodiment of the present invention.

FIG. 7 is a view of a LDDPM model for feature sampling under the same number of steps in an embodiment of the present invention

FIG. 8 illustrates the generation of HR images at an asynchronous number LDDPM, where t is the time to generate the HR images in seconds, in an embodiment of the present invention.

FIG. 9 shows experimental results of different models on a real dataset in an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

The embodiment of the invention provides a single image super-resolution reconstruction method of potential features, which realizes super-resolution reconstruction based on a diffusion probability model (LDDPM) facing the potential features, thereby solving the problems faced by DDPM on the SISR task, and comprises the following steps:

(1) In order to ensure that DDPM performs high-quality sampling with a small number of sampling steps, the invention designs a modeling method for HR images based on multi-modal distribution, wherein the distribution is realized based on GAN and normalized flow, so that LDDPM focuses on reconstructing high-frequency details of HR images with a small number of iteration steps.

(2) In order to extract characteristic information in an LR image to constrain the solution space of the HR image, the invention designs an adaptive multi-head attention mechanism and a variable self-encoder to convert the LR image into a hidden condition as a condition input of a model.

The LDDPM model provided by the invention has the following advantages:

(1) Fast and high quality sampling: the HR image is reconstructed through a Markov chain and complex multi-modal distribution modeling, so that the negative influence on the modeling HR image caused by model collapse can be reduced while the model is rapidly sampled, and complex and diversified HR images with high quality can be generated.

(2) Stable style and content consistency: although the probability distribution of HR images is difficult to predict, the invention limits the influence of prediction randomness caused by the maximized variation lower bound in DDPM by designing a novel conditional coder (an adaptive multi-head attention mechanism and a variation self-coder), so that model training is stable and images consistent with the style and content of the original HR images can be generated.

In the embodiment of the invention, a diffusion probability model for potential characteristics is specifically described as follows:

In SISR task, a LR image X ε R ^w×h×c is given and restored to the corresponding HR image Wherein, the image X is also called a source image, Y is a target image, w, h, c are the width, length, and channel number of the image X, s ∈is an upsampling factor, that is, after upsampling the LR image, the image size (resolution) becomes w _s↑×h_s↑. The super-resolution problem of a single image can be described as shown in formula (1).

Where n represents gaussian white noise, k represents the downsampled convolution kernel,Representing convolved downsamples. The aim of the SISR task is to model equation (1) as the maximum posterior probability problem, as shown in equation (2).

Wherein,Representing the reconstructed HR image, logq (Y) represents the model optimized HR image, logq (x|y) represents the log-likelihood of the LR image for a given HR image. However, in the conventional SISR task, the model is not only prone to collapse, but also does not recover image detail well. DDPM the standard normal distribution is converted into an empirical data distribution (similar to langevin dynamics) through a series of refinement steps, DDPM can reduce model collapse and retain more image detail. Thus, the present invention learns logq (X ₁,X₂...X_T |y) parameters to approximate/>, by means of DDPM's stochastic iterative refinement processThe process gradually maps the source image X ₁,X₂...X_T to the target image Y, thereby achieving a "one-to-many" mapping. Wherein the target image Y is kept as consistent as possible with the plurality of source images X ₁,X₂...X_T. Thus, the present invention can change equation (2) to DDPM-based modeling method, as shown in equation (3).

Wherein X _i (i=1, …, T) contains the LR image X and the gaussian noise is added to X _i-1 at step i, i.e., X _i represents the noise image obtained at step i, and T is the total diffusion step. The gaussian noise generating latent variable X ₁,X₂...X_T is added step by step to Y at DDPM.

LDDPM of the invention as shown in figure 1, LDDPM is built on DDPM in the T-phase. Instead of reconstructing the HR image directly, each iteration step number LDDPM predicts the noise e in the noise image X _i of the current i-th step using the UNet network. And in LDDPM model, the invention adds a conditional coding mechanism, and the added conditional coding mechanism is divided into a conditional coding based on an adaptive multi-head attention mechanism and a conditional coding based on a VAE. In the condition coding based on the adaptive multi-head attention mechanism, the invention maps LR image characteristics coded by a condition coder to an intermediate layer of the UNet by utilizing the multi-head attention mechanism, thereby guiding the UNet network to learn more potential characteristics in the LR image. In the VAE-based conditional encoding, the present invention uses the VAE to sample random feature vectors from LR image X as conditional feature F _R, and combines the mean map F _μ and variance map F _σ of the decomposition of feature vector F _X of UNet encoder, thereby transferring the conditional feature to the hidden space. The VAE not only effectively fills the missing information of LR image amplification, but also constrains the solution space of the reconstructed HR image, so that the model can learn the noise E at the current moment more easily. For the encoder output feature F _g of UNet networks, the present invention uses normalized streams, enabling better modeling to account for more complex probability distribution biases. In training, in order to ensure high-quality sampling of IDPPM models, the invention adopts GAN to learn multi-mode distribution of X _i-1,X_i and replaces simple Gaussian distribution learned by original DDPM with multi-mode distribution, thereby reducing Kullback-Leibler divergence (KL divergence) of noise probability distribution of a denoising model and a real model.

In DDPM, the present invention defines the HR image Y as the target variable, and q (Y) is the data probability distribution of the target variable. As shown in fig. 2, DDPM consists of a forward diffusion process and a reverse diffusion process. DDPM forward diffusion is aimed at mapping Y to a multidimensional normal distribution (gaussian noise) by a markov chain, calculated as shown in equation (4).

Wherein, X ₀ is defined as Y, X _i and Y are variables with the same dimension, T is the number of diffusion steps, q (X ₁,…,X_T|X₀) represents the forward diffusion process, and q (X _i|X_i-1) is defined as the forward Gaussian distribution related to a constant beta _i I represents a variance of 1 for the Gaussian distribution. In the process, a small amount of Gaussian noise is added to each diffusion step number, and the final HR image is converted into multi-dimensional Gaussian distribution which is independent of each other in different dimensions.

And DDPM is to generate an HR image based on the gaussian distributed samples, and the calculation is shown in formula (5).

Wherein pθ (X ₀,…,X_T-1|X_T) represents the back diffusion process, pθ (X _i-1|X_i) represents the back gaussian distribution, and μθ () and σθ () distributions represent the mean and variance functions of the back diffusion process; p (X _T) represents a Gaussian distribution with a mean of 0 and a variance of 1 obeyed by X _T. Through the process model, gaussian noise can be gradually eliminated, and HR images conforming to target distribution are finally generated. It is noted that during model training, the model may sample to generate HR images by training only the mean function μ _θ () and the variance function σ _θ (). In addition, the present invention sets the function value of σ _θ () to a constant, so μ _θ () can be rewritten as shown in formula (6) according to argument re-parameterization.

Where a _i denotes a matrix of gaussian distribution transformation of variance, which is obtained by pre-setting values in advance,Representing the sum of i α _i, e _θ (·) representing the UNet network predicted noise. Finally DDPM can be interpreted as extracting the noise e added in step i from X _i given the image Y, noise e and step number i. To achieve this, the model needs to learn valid feature information from X _i, e, step number i to gradually map HR image Y to the corresponding noise value according to the specified rule, and generate a distribution close to Y according to the noise value during the back diffusion. Thus, the loss function (L _DDPM) of DDPM is defined as shown in equation (7).

Wherein,Representing mathematical expectations.

In the present invention, the purpose of LDDPM at SISR tasks is to model the condition distribution P (x|y). Therefore, the synthesis process of the HR image can be controlled by encoding the LR image X as a conditional input to the function epsilon _θ (). However, if X and the current time noise image X _i are stacked directly together in the UNet network for conditional sampling, UNet is not only easy to ignore the perceptually relevant detail features, but also requires an expensive functional evaluation in pixel space to better extract the noise features. Therefore, LR image X is encoded in DDPM and input as a condition of the model, so that the noise distribution is better learned for further study.

Aiming at the modeling condition of LDDPM, the invention not only stacks the current noise image X _i and the LR image X together, but also designs a conditional encoding mode based on an adaptive multi-head attention mechanism. First, an LR image-based encoder is designedUsing LR image X with encoder/>Features projected onto the Unet intermediate layer are of the same dimension; the features and/>, of the intermediate layer are then combined by the multi-head attention mechanism in UNetThe projected features are dot multiplied. Therefore, the calculation mode of the conditional adaptive multi-head attention mechanism is shown in the formula (7).

Wherein X _i is changed from a three-dimensional matrix to a two-dimensional matrix,And/>The method comprises the steps of respectively obtaining a query projection matrix, a key projection matrix and a value projection matrix of a kth middle layer of UNet, wherein Q represents the query matrix, K represents the key matrix and V represents the value matrix. d represents a scaling adjustment parameter for scaling adjustment of the similarity weights Q and K matrix, and the inner product is controlled not to be too large. /(I)Representing flat feature operations (changing X _i from a three-dimensional matrix to a two-dimensional matrix), τ _θ (X) represents encoding X with an UNet encoder, respectively. Notably, in a normalized stream-based Glow model (stream generation model), the present invention uses a conditional encoder/>, with a mask matrix in the Glow model (the values inside the mask matrix will determine the effect of neighboring pixels on new pixel values, the larger the value of the mask matrix, the larger the effect, the smaller the effect of the mask matrixThe output characteristic matrix is replaced, so that the model can learn the noise E of the current ith step better.

The invention also designs a conditional variation self-encoder (CVAE) for conditional encoding in DDPM. CVAE is modeling the variant inference of the hidden variable Z and the observed variable C ε X _1,2...T using a reparameterization technique. CVAE can project the condition X _i into a potential space, thereby learning a potential conditional probability distribution, and the output of CVAE is the condition feature F _c. The decoder of UNet will refer to the conditional feature F _c output by CVAE, thereby learning the feature map F _g (the feature map output by the encoder of UNet) for prediction of UNet noise e. The CVAE model is mainly divided into a feature encoder, an hidden variable Z and a feature decoder. CVAE are used to fit likelihood functions, as shown in equation (9).

Wherein the values of the likelihood function depend on the μ _θ function and σ _θ function calculations, μ _vae and σ _vae learn the gaussian distribution mean and variance of CVAE, respectively, which thereby learn the relationship between pixels and are represented by a probability model. I.e. P _θ (c|z) represents a likelihood function for the second probability distribution C of the fusion feature F _g,

In CVAE, σ _vae and σ _vae may be used to represent the hidden variable Z, respectively, and the calculation mode is shown in formula (9).

Wherein Z _i represents the hidden variable of step i, where the hidden variable Z is sampled from the gaussian distribution Q (Z) =n to (0,I), and δ represents a gaussian distribution obeying a mean of 0 and a variance of 1. In order to ensure that randomness is introduced in the sampling process, and that the probability distribution learned by CVAE is close to Gaussian distribution, the KL divergence is used for optimizing CVAE, and the calculation mode is shown in a formula (10).

Where E represents mathematical expectations and σ _i、μ_i represents the mean and variance of X _i in the VAE, respectively.

In LDDPM forward diffusion, first, the gaussian distribution of the feature decoder input of the CVAE model is replaced with the hidden variable Z. The probability distribution of the feature decoder output is then projected to the spatial domain using a convolutional layer, resulting in a conditional probability map feature F _R. Finally, to map the conditional probabilities F _R to the output of the UNet encoder, a convolution layer is used to learn the mean F _μ and variance F _δ of the feature map F _X output by the UNet encoder, and the calculation results of the conditional probabilities F _R and F _X are feature fused to obtain the fused feature F _g. It is noted that the mean F _μ and the variance F _δ are spatial variables of the feature map, not variables of the Gaussian distribution. In addition, the feature encoder of CVAE may be removed during the LDDPM's back-diffusion process, using a random gaussian distribution as the input argument Z for the feature decoder.

In the problem setting based on the LDDPM task SISR, the aim of the invention is to enable the posterior encoder of LDDPM (the encoder in the test phase, i.e. the actual use phase after LDDPM model training is completed) to accurately reconstruct the HR image. However, the LDDPM a priori encoder (training phase encoder) can learn a better probability distribution, and the a posteriori encoder can reconstruct the HR image better. The a priori and a posterior encoders of UNet are parameterized with gaussian distributions in CVAE of LDDPM, so a normalized stream based Glow model is applied to the feature map F _g. Glow is a flow model that is composed of a plurality of skins, each of which includes an extrusion function and a flow step. Each flow step contains normalization, 1x1 convolution layers, and coupling layers. Based on the Gaussian distribution output by CVAE, the Glow designed by the embodiment of the invention can convert simple distribution into more complex distribution C by using a bijective function f _θ (·) according to the rule of continuous change of noise, so that a posterior encoder can reconstruct complex probability distribution p _θ(C|F_g of an HR image more easily, and the calculation mode of the Glow is shown in a formula (11).

KL divergence in DDPMThe degree of information loss by replacing the C distribution with the Y distribution is described. Therefore LDDPM should ensure that the KL divergence of the probability distribution p _θ(X_i-1|X_i) of the denoising model in the backward diffusion process and the probability distribution q (X _i-1|X_i) of the denoising model in the forward diffusion process is as small as possible, so as to ensure that the probability distribution matching degree of the real HR image and the reconstructed HR image is higher. A large number of scholars demonstrate that the data distribution will approach a single-mode gaussian distribution as gaussian noise is gradually added during forward diffusion, whereas the data distribution will become more complex from gaussian distribution as step size increases during reverse diffusion. Therefore, the design condition GAN of the invention estimates the real denoising distribution q (X _i-1|X_i), so that LDDPM can have stronger expression capability and model the multi-mode denoising distribution. The goal of the condition GAN is to minimize the contrast loss function, thereby enabling minimization of D _KL (C, Y), improving the matching of the probability distribution p _θ(X_i-1|X_i of the LDDPM back-diffusion process to the true denoising distribution q (X _i-1|X_i) of the forward direction of the process. The condition GAN proposed by the present invention, as shown in FIG. 3, is provided with a time-dependent arbiter D _φ (), the input of which is noise E, calculated as/>And/>Output is/>And/>Is a confidence level of (2). The arbiter is trained by equation (13).

Notably, the output of the GAN generator is a distribution of noise ε, so the present invention can calculate using equation (14)And/>

The distribution of the HR image reconstructed by the present invention LDDPM back-diffusion process is more complex than DDPM, and LDDPM is an implicit model. LDDPM the forward diffusion process is still an additive gaussian noise process, so the forward diffusion process q (X _i-1|X_i) follows the gaussian distribution property no matter how long the step size or how complex the data distribution is. P _θ(X_i-1|X_i) of LDDPM can therefore be expressed by formula (14).

Wherein p _θ(∈_θ|X_i) is an implicit distribution added by the generator G _θ (-) of GAN, which inputs X _i andQ (X _i-1|X_i,∈_θ) represents the corresponding conditional prior probability distribution, and p _θ(∈_θ|X_i represents the E _θ posterior probability distribution conditioned on X _i. p _θ(X_i-1|X_i) represents the posterior probability distribution of X _i-1 conditioned on X _i. In the training process, LDDPM of the embodiment of the invention gradually maps Y to Gaussian distribution through a Markov chain. However, the noise error increases with the number of iterative steps, in order to reduce the reconstructed HR image/>, due to the noise error during trainingThe invention utilizes content perception and style perception to guide LDDPM to reconstruct the HR image by increasing the perception distance with the real HR image Y. The idea of the invention is that: the true denoised image X _i for each iteration step requires that the style information be passed to the LDDPM reconstructed denoised image/>, while preserving the content information

Content loss: ledig et al demonstrate that the Mean Square Error (MSE) loss function is prone to lack of high frequency detail information, resulting in the reconstructed image in SISR task being prone to excessive smooth texture, in the present embodiment, the image X _i and the image X _i are calculated from the content perceived lossThereby preserving more detailed features of image X _i. Unlike Ledig et al, the present invention additionally calculates the true HR image Y and reconstructed HR image/>, for the differences between pixelsIs a loss value of (2). Therefore, LDDPM is calculated as shown in formula (15).

Representing a loss network (i.e., representing the loss of noise Unet network output),/>Representing the output of the generator GAN.

Style loss: the calculation is shown in formula (16).

Thus, the total loss function of LDDPM models is shown as equation (18).

In order to further verify the performance of the single image super-resolution reconstruction method of potential features provided by the invention, compared with the existing image super-resolution model with good performance through LDDPM, the embodiment evaluates the reconstruction effect of LDDPM on the face image super-resolution dialectic dataset (8×, image reconstruction is 8 times of original image) and the image super-resolution dataset (2×, image reconstruction is 2 times of original image) of the general dataset.

For a general image super-resolution dataset (2×), the existing image super-resolution models EDSR, RCAN, SAN, IGNN, HAN and NLSA, as well as the LDDPM model of the present invention, were first trained on the DIV2K dataset. Then, the existing image super-resolution model SwinIR, swinFIR and EDT, as well as the LDDPM model of the present invention, were trained on DIFL K data sets. Finally, the CelebA dataset is used as a pre-training dataset, the existing image super-resolution model SR3 and the LDDPM model of the invention are trained, and fine tuning of model parameters is performed on the DIFL K dataset. Wherein EDSR, RCAN, SAN, IGNN, HAN is primarily a CNN-based model. SwinIR, EDT and SwinFIR are mainly based on the transducer model. SR3 and LDDPM are mainly based on the model of DDPM. Table 1 shows the experimental results of the classical single image super-resolution model. The LDDPM of the present invention achieves higher performance of reconstructing HR images over multiple test sets than other advanced models. Particularly LDDPM on Urban100, compared with PSNR and SSIM of an EDT model, PSNR and SSIM are respectively improved by 2.07dB and 1.95%, the effectiveness of LDDPM is proved, and a new thought is provided for SISR tasks. Meanwhile, in a comparison experiment, celebA is taken as LDDPM and SR3 pre-training data sets, as can be seen from table 1, LDDPM has PSNR of 42.96dB and SSIM of 97% on Manga data sets, and compared with the SR3 model based on DDPM, PSNR is improved by 6.57dB and SSIM is improved by 1.75%, which indicates that the reconstruction effect of LDDPM is improved to a certain extent by adding the pre-training data set.

Finally, HR images recovered by some models are visualized, and the visualization result is shown in fig. 4. It can be seen from fig. 4 that the existing model based on the transducer and the CNN still has a large lifting space for reconstructing details and textures of the complex image, and LDDPM can well solve the problems existing in the model based on the transducer and the CNN, and reconstruct the HR image with high-frequency details.

Table 1 quantitative comparison of LDDPM model with most advanced model on classical image super resolution data (2X)

Meanwhile, in the embodiment of the invention, the LDDPM model of the invention and the existing image super-resolution models ESRGAN, progFSR, SRFlow, SRDiff and SR3 are subjected to experimental comparison on a CeleHQ data set, as shown in table 2. Wherein LDDPM, SRDiff, SR is a DDPM-based model, RRDB and ProgFSR are CNN-based models, ESRGAN is a (GAN) -based model, and SRFlow is a normalized flow-based model, respectively. As can be seen from Table 2, LDDPM is superior to all models above in evaluation index, it increases PSNR by 0.96dB and SSIM by 2.31% compared to advanced SRDiff, demonstrating that LDDPM is capable of generating high quality, diversified HR images with a strong uniformity with LR. As can be seen from fig. 5, the image of wrinkles on the forehead of the elderly and the image of the hair of the women reconstructed by LDDPM look more natural and have rich details and textures compared to other models.

Furthermore, the LDDPM (model parameter number 43M) of the present invention uses fewer model parameters than the SR3 (model parameter number 98M) and SRDiff models (model parameter number 52M) and only about 20 hours on the CeleHQ dataset to converge, while SRDiff takes 34 hours and SR3 takes 40 hours, indicating that LDDPM is training efficient and results in better performance with less computational overhead.

In this embodiment, as shown in fig. 6, important detail pixels of the face part of the woman in the HR image reconstructed by LDDPM and SR3 and SRDiff are visually displayed by using a histogram. From fig. 6, it can be seen that LDDPM can learn a more regular feature distribution, thereby capturing good detail and texture features for better performance.

Table 2 quantitative comparison of LDDPM on CeleHQ face dataset (8X) with the most advanced model

Models	PSNR(dB)	SSIM(％)
			ESRGAN(ECCV,2018)	23.24	66.45
ProgFSR(arXiv,2019)	24.21	72.24
			SRFlow(ECCV,2019)	25.32	72.45
SRDiff(NC,2019)	25.38	74.21
			SR3(T-PAMI,2022)	24.92	70.95
LDDPM	26.07	76.52

In order to demonstrate the effectiveness of the LDDPM added modules of the present invention, the present embodiment also performed a number of ablative experimental validations on the proposed modules on the CeleHQ dataset.

Condition coding: in condition encoding, this embodiment defines three models, discussing the effect of condition encoding on LDDPM. The conditional encoding of the first model (V1) is to stack together the low resolution LR image and the noise pictures of each stage, so that the reconstruction of the HR image is performed. The second model (V2) incorporates conditional encoding based on an adaptive multi-head attention mechanism based on the V1 model. The third model (V3) adds a conditional encoding based on a variational self-encoder on a V2 basis. From Table 3, the PSNR and SSIM of the V2 model added with the adaptive multi-head attention mechanism are respectively improved by 1.05dB and 1.23% compared with that of the V1 model, which shows that the adaptive multi-head attention mechanism can provide more conditional features to guide the model to learn the probability distribution of HR images, so that the HR and the real HR of the model reconstruction are consistent. Compared with PSNR and SSIM of V2, the V3 model added with the VAE is improved by 1.14dB and 0.83%, which shows that LDDPM added with the VAE can enable the model to learn more potential condition features in an LR image, and further reduce the solution space of the HR image by utilizing the potential condition features, so that feature information of the image space is restrained.

Glow and GAN based model optimization: from line 2 of Table 4, it can be seen that LDDPM added Glow, PSNR and SSIM increased by 1.03dB and 3.19%, respectively, demonstrating that the addition of Glow LDDPM enabled LDDPM to capture a more complex noise probability distribution. As can be seen from line 3 of Table 4, the LDDPM-added GAN, PSNR and SSIM distributions were raised by 0.87dB and 0.97%, demonstrating that the multi-modal distribution of GAN learning can make the HR image reconstructed in the inverse process LDDPM more realistic. In fig. 7, the present example visualizes the features of LDDPM added to the Glow and GAN extraction. As can be seen from fig. 7, the addition of LDDPM to Glow and GAN allows sampling with a relatively small total number of steps, and a better probability distribution can be learned, relative to the original LDDPM.

Optimizing experimental super parameters: to investigate the effect of the total diffusion step and loss function on LDDPM, the present example also performed an ultra-parametric ablation experiment. As the total number of diffusion steps increases as shown in fig. 8, the quality of the image of the present embodiment is enhanced. However, a larger total number of diffusion steps slows down the training and reasoning of the model, so this embodiment chooses t=1000 for default parameter setting. Finally, the influence of content loss and style loss on the experimental result of the embodiment is compared. As can be seen from Table 5, the addition of the Content Loss (CL) at LDDPM is raised by 0.56dB and 0.49% compared to the PSNR and SSIM at line 1 of Table 5, respectively. And the Style Loss (SL) was added at LDDPM, which was raised by 0.18dB and 0.64% compared to line 2 of Table 5. The above results demonstrate that adding content loss and style loss to LDDPM can better guide LDDPM to learn more image feature information, and thus DDPM is a more stable training.

Table 3 LDDPM adds a comparison of PSNR and SSIM metrics for the condition encoding module, with the best results bolded.

Models	PSNR(dB)	SSIM(％)
			V1	23.10	71.09
V2	24.15	72.32
			V3	25.29	73.15

Table 4 LDDPM comparison of PSNR (%) and SSIM (%) with Glow and GAN addition, the best results are shown bolded

Models	PSNR(dB)	SSIM(％)
			LDDPM	23.43	71.23
LDDPM+Glow	24.46	74.42
			LDDPM+Glow+GAN	25.33	75.39

Table 5 LDDPM A comparison of PSNR (%) and SSIM (%) with added content loss and style loss, the best results are bolded

In order to more fully evaluate the performance of the LDDPM model, this embodiment also collects some pictures that are low in resolution in the real world. As shown in FIG. 9, the HR image quality reconstructed by LDDPM is superior to the reconstruction effect of SR3, EDF and SRDiff. Specifically, in fig. 9, the images reconstructed by SR3, EDF and SRDiff all have blurring and detail and texture missing, while the LDDPM model can reconstruct not only a clear image but also a complete detail and texture of the reconstructed image. Experiments on real world datasets demonstrate that LDDPM is powerful in generalization and can be well applied to SISR tasks in natural environments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

What has been described above is merely some embodiments of the present invention. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit of the invention.

Claims

1. The super-resolution reconstruction method for the single image of the potential feature is characterized by comprising the following steps of:

Wherein, Alpha _i-1 represents the Gaussian distribution transformation matrix of variance sigma _θ of step i-1,/>Represents summing α ₁ to α _i-1;

And retains the predicted noise of the last computed step T (epsilon _θ)_T is marked as noise image)

Step 2, reconstructing a target low-resolution image based on a first image encoder and a noise U-Net network in a trained latent feature-oriented diffusion probability modelHigh resolution image/>

Inputting the first image coding feature into the splicing layer, the U-Net encoder, the stream generating model and the U-Net encoder decoder of the noise U-Net network respectively, and starting from the T-th step, inputting the target noise image of the current step Input noise U-Net network splice layer, based on its output to get the current step target prediction noise/>Wherein, the initial value of T is T, and/>

Removing the predicted noise of the previous step from the target noise image of the current step to obtain the target noise image of the previous step, continuously inputting a noise U-Net network, and repeatedly iterating and outputting the predicted noise of the previous step until the target predicted noise of the 1 st step is obtained;

Removing the target prediction noise of the step 1 from the target noise image of the step 1 to obtain a reconstructed high-resolution image

2. The method according to claim 1, wherein in step 2, the method is performed according toCalculating to obtain the target noise image/>, of the first stepWherein/>Representing the sum of α _T to α _t, and α _t representing the gaussian distribution transformation matrix of the variance σ _θ of the feature encoder output of the conditional variance from the encoder at the time of the calculation of the t-th step;

the currently obtained target noise image Continuously inputting a noise U-Net network, and repeatedly iterating and outputting the predicted noise of the previous step until the target predicted noise/> -of the step 1 is obtainedAccording to/>Obtain reconstructed high resolution image/>

3. The method of claim 1, wherein in step 2, when using the trained noise U-Net network, the conditional variance of the noise U-Net network is removed from the feature encoder of the encoder, directly using a random gaussian distribution as an input argument from the feature decoder of the encoder.

4. The method of claim 1, wherein the second probability distribution C is calculated by:

Where p _θ(C|F_g) represents the likelihood function of the second probability distribution C with respect to the fusion feature F _g, μ _θ()、σ_θ () represents the mean function and the variance function of the gaussian distribution respectively, Representing the mean and variance functions, f _θ () represents the bijective function.

5. The method of claim 1, wherein the condition-variable self-encoder is optimized using Kullback-Leibler divergence during training.

6. The method according to any one of claims 1 to 5, wherein step 1, the total loss function L _total of the latent feature-oriented diffusion probability model when training is:

Wherein L _DDPM represents a loss of diffusion probability, Representing the conditional variation loss from the encoder dispersion, L _adv represents the loss of the arbiter, and L _content and L _style represent the content loss and style loss, respectively, of the latent feature-oriented diffusion probability model.

7. The method of claim 6, wherein the loss of diffusion probability L _DDPM is specifically set as:

8. The method of claim 6, wherein the condition variation is lost from encoder dispersion The specific arrangement is as follows: /(I)Where μ _i、σ_i represents the mean and variance, respectively, of the feature encoder output of the condition variation from the encoder when the noise image X _i is input.

9. The method of claim 6, wherein the loss of discriminators L _adv is specifically set as:

The function is set as:

Where phi represents the parameters of the arbiter, D _φ () represents the confidence of the arbiter output, Representing mathematical expectations,/>The subscript of (a) denotes the processed data range, q (X _i) denotes the a priori probability distribution of X _i,/>Expressed as/>Conditional/>Prior probability distribution,/>Expressed as/>Conditional/>Posterior probability distribution.

10. The method of claim 6, wherein the content loss L _content and the style loss L _style are specifically configured to:

representing loss of noise Unet network output,/> Representing loss/>The subscripts x, y respectively denote the spatial coordinates of the high resolution image, W and H respectively denote the resolution of the high resolution image,/>Representing the output of the generator.