CN117495714A

CN117495714A - Face image restoration method and device based on diffusion generation priori and readable medium

Info

Publication number: CN117495714A
Application number: CN202410004081.6A
Authority: CN
Inventors: 宋佳讯; 黄德天; 黄小茜; 林明昕; 刘航; 曾焕强
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2024-01-03
Filing date: 2024-01-03
Publication date: 2024-02-02
Anticipated expiration: 2044-01-03
Also published as: CN117495714B

Abstract

The invention discloses a face image restoration method, a device and a readable medium based on diffusion generation priori, relating to an image processing module, comprising the following steps: constructing a face image restoration model based on a pre-trained diffusion model, inputting a face image to be restored into a forward noise adding module to gradually increase noise, and obtaining a noise image; inputting the noise image into a reverse denoising module to gradually denoise, and generating a final restored face image; inputting the noise image of the t step and the time stamp of the t step into a noise predictor, and predicting to obtain the noise of the t step; in a forward noise adding module, combining the noise image of the t step and the noise input of the t step with a forward diffusion formula of fusion inversion to obtain a noise image of the t+1 step; in the reverse denoising module, zero threshold decomposition is carried out on the noise image in the t step, and the zero threshold decomposition and the noise in the t step are input into a reverse diffusion formula to obtain the noise image in the t-1 step, so that the problem that the restored image generated in the prior art is poor in authenticity and consistency is solved.

Description

Face image restoration method and device based on diffusion generation priori and readable medium

Technical Field

The invention relates to the field of image processing, in particular to a face image restoration method and device based on diffusion generation prior and a readable medium.

Background

The current image restoration method based on the generation of the prior is mainly divided into three types: an image restoration method for generating a priori based on display modeling, an image restoration method for generating a priori based on a generation countermeasure network (Generative Adversarial Networks, abbreviated GAN), and an image restoration method for generating a priori based on diffusion.

The image restoration method based on the display modeling generation prior is a method for carrying out image restoration by using some generation models adopting a maximum likelihood estimation strategy. Mainly comprises the following steps: an autoregressive model-based generation priori image restoration method, a variational autorecoder-based generation priori image restoration method, a stream model-based generation priori image restoration method and the like. The method is used for carrying out display modeling on data distribution, and the method is used as an image restoration method of priori knowledge, so that the method has stronger interpretation, but has poorer modeling capability.

The GAN-based generation-prior image restoration method is a method for performing image restoration using the generation-prior capability of the generation countermeasure network. The image restoration method based on GAN generation prior has good image restoration capability and strong modeling capability. However, the interpretation of the GAN model is poor, and training of the GAN model adopts an antagonistic learning idea, and requires training of a generator and a discriminator respectively, so that the requirement on the tuning capability of the model is high, and some regular terms are often required to be added to ensure that the GAN model does not have training breakdown.

The diffusion generation prior-based image restoration method is a method for performing image restoration by using the latest generation prior knowledge of a diffusion probability model (Denoising Diffusion Probabilistic Models, abbreviated as DDPM) which is a display modeling generation method. DDPM shows strong distribution modeling capability, and compared with GAN, the DDPM adopting variation inference has simpler training target and easy training. At the same time, the prior art requires multiple iterations to obtain a restored image with high authenticity, which causes a large resource overhead. And the generated restored image still has a large lifting space in the aspects of authenticity and consistency. Considering the drawbacks of the prior art and the advantages of the diffusion-based prior art image restoration method, this makes the diffusion-based prior art image restoration method the most interesting method among the diffusion-based prior art image restoration methods.

Disclosure of Invention

The technical problems mentioned above are solved. An embodiment of the present application aims to provide a face image restoration method, a device and a readable medium based on diffusion generation prior, so as to solve the technical problems mentioned in the background art section.

In a first aspect, the present invention provides a face image restoration method based on diffusion generation prior, including the steps of:

acquiring a face image to be restored;

constructing a face image restoration model based on a pre-trained diffusion model, wherein the face image restoration model comprises a forward noise adding module, a reverse noise removing module and a noise predictor, and inputting a face image to be restored into the forward noise adding module to gradually increase noise so as to obtain a noise image; inputting the noise image into a reverse denoising module to gradually denoise, and generating a final restored face image; inputting the noise image of the t step and the timestamp of the t step into a noise predictor, and predicting to obtain the noise of the t step; in a forward noise adding module, combining the noise image of the t step and the noise input of the t step with a forward diffusion formula of fusion inversion to obtain a noise image of the t+1 step; in the reverse denoising module, zero threshold decomposition is carried out on the noise image of the t step to obtain decomposition features of the t step, and the decomposition features of the t step and the noise of the t step are input into a reverse diffusion formula to obtain the noise image of the t-1 step.

Preferably, the forward diffusion formula in combination with fusion inversion is:

；

Wherein,noise image representing step t, +.>Noise image representing step t+1, < >>Face image representing the restoration of step t, < >>Noise representing the t-th step predicted by the noise predictor,>the noise table is represented by a table of noises,representing the cumulative noise table, ">，/>Representing the regulation factor->Representing noise sampled from a normal distribution of the standard, +.>。

Preferably, zero threshold decomposition is performed on the noise image in the t step to obtain decomposition features in the t step, which specifically includes:

；

wherein,noise image representing step t, +.>For the decomposition feature of step t, +.>For the value range part->Zero field part->Representing a transformation matrix representing a pseudo-inverse matrix,>representing a real multidimensional space, D representing the dimension of the face image to be restored, D representing the dimension of the degraded image,/for the face image to be restored>Representing the identity matrix.

Preferably, for a given one of the transformation matricesConstructing a transformation matrix by singular value decomposition or Fourier transformation>Pseudo inverse matrix->Satisfy->。

Preferably, the noise predictor adopts a dual-branch adjusting Unet network, the dual-branch adjusting Unet network comprises a main branch and a jump connection branch, the main branch comprises a first convolution layer, a second convolution layer, a first maximum pooling layer, a third convolution layer, a fourth convolution layer, a second maximum pooling layer, a fifth convolution layer, a sixth convolution layer, a first deconvolution layer, a seventh convolution layer, a second deconvolution layer and an eighth convolution layer, and the jump connection branch comprises a first Fourier transform module, a first inverse Fourier transform module, a second Fourier transform module and a second inverse Fourier transform module.

Preferably, the noise image input double-branch adjusting Unet network in the t step sequentially goes through an encoding stage and a decoding stage;

the encoding stage comprises: the noise image in the step t sequentially passes through a first convolution layer and a second convolution layer to obtain a first intermediate coding image; the first intermediate coded image sequentially passes through a first maximum pooling layer, a third convolution layer and a fourth convolution layer to obtain a second intermediate coded image; the second intermediate coded image sequentially passes through a second maximum pooling layer, a fifth convolution layer and a sixth convolution layer to obtain a third intermediate coded image;

the decoding stage comprises: the third intermediate image is obtained through multiplicative regulation and control of the trunk characteristic regulation and control coefficient and the first deconvolution layer; converting the second intermediate coded image into a frequency domain through a first Fourier transform module, performing low-frequency masking by using jump characteristic regulation coefficients to obtain a first processed image with high-frequency components reserved, and converting the first processed image into a spatial domain through a first inverse Fourier transform module to obtain a fourth intermediate image; splicing the third intermediate image and the fourth intermediate image to obtain a second intermediate decoded image; the second intermediate decoded image passes through a seventh convolution layer, multiplicative regulation of a trunk characteristic regulation coefficient and a second deconvolution layer to obtain a first intermediate image; converting the first intermediate image into a frequency domain through a second Fourier transform module, performing low-frequency masking by using jump characteristic regulation coefficients to obtain a second processed image with high-frequency components reserved, and converting the second processed image into a spatial domain through a second inverse Fourier transform module to obtain a second intermediate image; splicing the first intermediate image and the second intermediate image to obtain a first intermediate decoded image; the first intermediate decoded image passes through an eighth convolution layer to obtain noise in the t step.

Preferably, the decoding stage is performedDecomposing the feature map into a first branch of trunk feature>Personal profile->The first part obtained by branching with jump connection>Personal profile->；

Setting a trunk characteristic regulation and control coefficient in a trunk branch, regulating and controlling a characteristic diagram obtained by the trunk branch through the trunk characteristic regulation and control coefficient, wherein the characteristic diagram is shown in the following formula:

；

wherein,represents the +.>Personal->Characteristic map on individual channels->Indicate->The number of channels in the whole feature map, +.>Representing the modulation factor via the trunk feature->The>Personal->Feature maps on the individual channels;

setting a jump characteristic regulation and control coefficient in the jump connection branch, regulating and controlling a characteristic diagram obtained by the jump connection branch through the jump characteristic regulation and control coefficient, wherein the characteristic diagram is shown in the following formula:

；

wherein,indicates the jump connection branch is taken>Personal->Characteristic map on individual channels->Representing fourier transformation of the jump connection branch>Characteristic map obtained later->Characteristic diagram representing frequency domain regulation>Representing inverse fourier transform of the frequency domain conditioned feature map>Characteristic map obtained later->Representing a fourier mask, the following formula:

；

Wherein,represent radius>Represents a threshold frequency->Representing the jump characteristic regulation factor.

In a second aspect, the present invention provides a face image restoration device for generating a priori based on diffusion, including:

the data acquisition module is configured to acquire a face image to be restored;

the execution module is configured to construct a face image restoration model based on a pre-trained diffusion model, wherein the face image restoration model comprises a forward noise adding module, a reverse noise removing module and a noise predictor, and the face image to be restored is input into the forward noise adding module to gradually increase noise so as to obtain a noise image; inputting the noise image into a reverse denoising module to gradually denoise, and generating a final restored face image; inputting the noise image of the t step and the timestamp of the t step into a noise predictor, and predicting to obtain the noise of the t step; in a forward noise adding module, combining the noise image of the t step and the noise input of the t step with a forward diffusion formula of fusion inversion to obtain a noise image of the t+1 step; in the reverse denoising module, zero threshold decomposition is carried out on the noise image of the t step to obtain decomposition features of the t step, and the decomposition features of the t step and the noise of the t step are input into a reverse diffusion formula to obtain the noise image of the t-1 step.

In a third aspect, the present invention provides an electronic device comprising one or more processors; and storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.

Compared with the prior art, the invention has the following beneficial effects:

(1) According to the face image restoration method based on diffusion generation priori provided by the invention, by adopting fusion inversion in the forward diffusion module, zero-valued domain decomposition is introduced in the backward diffusion module, and the trunk characteristic regulation and control coefficient and the jump characteristic regulation and control coefficient are introduced into the dual-branch regulation RUnet network architecture, the consistency can be ensured under the condition of keeping high generation authenticity, and the Zero-sample image restoration method (Zero-shot Image Restoration) is realized.

(2) The face image restoration method based on diffusion generation priori provided by the invention can greatly improve the sampling speed, still can generate restored images conforming to consistency and authenticity under the condition of lower iteration steps, not only can improve the image restoration speed, but also can reduce the time delay expenditure.

(3) The face image restoration method based on diffusion generation priori provided by the invention can realize excellent restoration effects on various different face image restoration tasks such as super resolution, deblurring, coloring and the like, the quality of restored face images is greatly improved, the reconstructed texture details are more real, and good consistency is maintained on a bottom structure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an exemplary device frame pattern to which an embodiment of the present application may be applied;

fig. 2 is a flow chart of a face image restoration method based on diffusion generation prior according to an embodiment of the present application;

FIG. 3 (a) is a schematic diagram of a network framework for fusion inversion;

FIG. 3 (b) is a difference plot of DDIM inversion, DDPM inversion and fusion inversion;

FIG. 4 (a) is a schematic diagram of a network framework of a back projection mechanism based on zero-value-range decomposition;

FIG. 4 (b) is a schematic flow chart of an implementation of a back projection mechanism based on zero-value-range decomposition;

FIG. 5 is a schematic diagram of a two-branch adjusting Unet network (RUnet) framework of a face image restoration method based on diffusion generation prior according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a face image restoration model of a face image restoration method based on diffusion generation prior in an embodiment of the present application;

fig. 7 is an effect diagram of a face image restoration method based on diffusion generation prior in the embodiment of the present application on four restoration tasks of super resolution (x 4), super resolution (x 8), coloring, and deblurring, respectively;

FIG. 8 is a schematic diagram of a face image restoration device based on diffusion generation prior in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device suitable for use in implementing the electronic device of the embodiments of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Fig. 1 illustrates an exemplary device architecture 100 to which a face image restoration method based on diffusion generation prior or a face image restoration device based on diffusion generation prior of embodiments of the present application may be applied.

As shown in fig. 1, the apparatus architecture 100 may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages, etc. Various applications, such as a data processing class application, a file processing class application, and the like, may be installed on the terminal device one 101, the terminal device two 102, and the terminal device three 103.

The first terminal device 101, the second terminal device 102 and the third terminal device 103 may be hardware or software. When the first terminal device 101, the second terminal device 102, and the third terminal device 103 are hardware, they may be various electronic devices, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and the like. When the first terminal apparatus 101, the second terminal apparatus 102, and the third terminal apparatus 103 are software, they can be installed in the above-listed electronic apparatuses. Which may be implemented as multiple software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background data processing server that processes files or data uploaded by the terminal device one 101, the terminal device two 102, and the terminal device three 103. The background data processing server can process the acquired file or data to generate a processing result.

It should be noted that, the face image restoration method based on the diffusion generation priori provided in the embodiment of the present application may be executed by the server 105, or may be executed by the first terminal device 101, the second terminal device 102, or the third terminal device 103, and accordingly, the face image restoration device based on the diffusion generation priori may be set in the server 105, or may be set in the first terminal device 101, the second terminal device 102, or the third terminal device 103.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where the processed data does not need to be acquired from a remote location, the above-described apparatus architecture may not include a network, but only a server or terminal device.

Fig. 2 shows that a face image restoration method based on diffusion generation prior provided by an embodiment of the present application includes the following steps:

S1, acquiring a face image to be restored.

Specifically, a face image to be restored is acquired.

S2, constructing a face image restoration model based on a pre-trained diffusion model, wherein the face image restoration model comprises a forward noise adding module, a reverse noise removing module and a noise predictor, and inputting a face image to be restored into the forward noise adding module to gradually increase noise to obtain a noise image; inputting the noise image into a reverse denoising module to gradually denoise, and generating a final restored face image; inputting the noise image of the t step and the timestamp of the t step into a noise predictor, and predicting to obtain the noise of the t step; in a forward noise adding module, combining the noise image of the t step and the noise input of the t step with a forward diffusion formula of fusion inversion to obtain a noise image of the t+1 step; in the reverse denoising module, zero threshold decomposition is carried out on the noise image of the t step to obtain decomposition features of the t step, and the decomposition features of the t step and the noise of the t step are input into a reverse diffusion formula to obtain the noise image of the t-1 step.

In a specific embodiment, the forward diffusion formula in combination with fusion inversion is:

；

wherein,noise image representing step t, +.>Noise image representing step t+1, < > >Face image representing the restoration of step t, < >>Noise representing the t-th step predicted by the noise predictor,>representing noise table->Representing the cumulative noise table, ">，/>Representing the regulation factor->Representing noise sampled from a normal distribution of the standard, +.>。

Specifically, referring to fig. 3 (a) and 3 (b), the Fusion Inversion (Fusion Inversion) is a combination of DDIM Inversion (DDIM Inversion) and DDPM Inversion (DDPM Inversion), which is a deterministic mapping from input images to hidden space vectors, by imparting certainty to the DDIM generation process, so that DDIM Inversion can be established by inverting equations, the specific expression is as follows:

；

wherein,noise image representing step t, +.>Noise image representing step t+1, < >>Face image representing the restoration of step t, < >>Noise representing the t-th step predicted by the noise predictor,>representing a cumulative noise table.

The hidden space vector obtained by DDIM inversion retains a large amount of original information of input space, and the generated image obtained by iterative sampling of the hidden space vector obtained by DDIM inversion can keep high consistency with the original image. However, due to noise predicted by the noise predictor during the DDIM inversion process Deviations from the standard normal distribution may result in the pre-trained model receiving the out-of-domain input distribution during the generation process, thereby generating unrealistic results.

The DDPM inversion is designed on the basis of observing the DDIM inversion, and the authenticity of the reconstruction result is improved by enabling the noise predicted by the noise predictor to be closer to the standard normal distribution. When the noise predicted by the noise predictor is replaced by random sampling noise from a gaussian distribution, a special inversion technique, called DDPM inversion, can be seen as follows:

；

wherein,noise image representing step t, +.>Noise image representing step t+1, < >>Face image representing the restoration of step t, < >>Noise representing the t-th step predicted by the noise predictor,>the noise table is represented by a table of noises,representing the cumulative noise table, ">，/>Representing noise sampled from a normal distribution of the standard, +.>。

The hidden space vector obtained by the DDPM inversion is closer to the standard normal distribution, but lacks information of the original input image compared with the hidden space vector obtained by the DDIM inversion, thereby generating a true but inconsistent result.

The fusion inversion is designed on the basis of observing DDIM inversion and DDPM inversion. It was observed that the deterministic inversion produced faithful but not realistic results, while the stochastic inversion produced true but not faithful results. Comprehensively considering the advantages and disadvantages of DDIM inversion and DDPM inversion, the embodiment of the application designs a novel inversion technology, which is called fusion inversion. By adding regulating and controlling coefficients For controlling the added random disturbance->The fusion inversion has the ability to generate high-authenticity images with consistency maintained. This is due to the control factor->The control of the magnitude of the added random disturbance replaces only a portion of the prediction noise, which not only brings the hidden space vector close to the standard normal distribution, but also the obtained hidden space vector retains a large amount of information of the original input image.

By rewriting the prediction noise part of the fusion inversion, the specific expression is as follows:

；

prediction noise for rewritesAnalysis was performed to demonstrate that the random perturbation added can bring the predicted noise closer to a standard normal distribution, with the following expression:

；

wherein,mean value of initial disturbance +.>Representing the variance of the initial disturbance. Due to the regulatory factor->The following expression holds:

；

wherein,mean value of fusion inversion construction disturbance is represented, +.>Representing the variance of the fusion inversion construction disturbance. This indicates that after addition of random perturbations +.>Become closer to the standard normal distribution +.>. Fusion Inversion still retains the prior ability of the diffusion generation model while maintaining comparable original image information, and the authenticity of the generated image is inherited. Also, embodiments of the present application may be implemented by performing the time step +. >To achieve acceleration, rather than performing fusion inversion until the last time step +.>. Obtaining better hidden code by fusion inversion, and then performing inverse denoising iteration from the hidden codeThe image is restored truly and faithfully. And the fusion inversion can also reduce the iterative sampling step number to t ₀ A true and faithful restored image can be obtained without performing until the last time step T. Number of iteration steps t to be shortened ₀ Called execution time step, the design of fusion inversion is such that only t needs to be performed ₀ And (5) iterating the steps.

In a specific embodiment, performing zero threshold decomposition on the noise image in the t step to obtain decomposition features in the t step, which specifically includes:

；

In a particular embodiment, for a given one of the transformation matricesUsing singular value decompositionOr Fourier transform to form a transform matrix>Pseudo inverse matrix- >Satisfy->。

Specifically, the zero-valued domain decomposition RND is understood as follows, given a transformation matrixThen constructing a transformation matrix by singular value decomposition (Singular Value Decomposition, SVD) or Fourier transformation (Fourier transform, FT)>Pseudo inverse matrix->Satisfy->. When the transformation matrix is obtainedAnd its pseudo-inverse->After that, any variable can be +.>The following identity decomposition was performed:

；

wherein,representing a transformation matrix->Representing a transformation matrix->Pseudo-inverse of>Representing an identity matrix>Representing the original image. The decomposition has a property when calculating +.>：

；

Analysis shows that the matrix on the left sideThrough matrix->After transformation, equal to->. While the right matrixThrough matrix->After transformation, equal to 0. Will->Called matrix->Is to be +.>Called matrix->The above decomposition is called RND (Null-space Part).

By introducing the RND concept into the image restoration task, it provides a new relationship between angle handling authenticity (Realness) and Data Consistency (Data Consistency). For the Linear image restoration task (Linear IR Tasks), the problem can be generalized toIf the transformation matrix is known- >And pseudo-inverse matrix thereofThe RND decomposition can be performed on the original image as follows:

；

although the original image is completeUnknown, but its value range->Is known. Via the degenerate operator->After the effect of (a), the original image leaves a part that is a value range part and a part that is lost is a zero range part. A general solution can thus be constructed:wherein->Is a zero field extraction term.

Checking the authenticity and consistency of the constructed general solution:

consistency test:

；

from the above derivation, it can be seen that: no matter the zero field extraction itemHow to take the values, the consistency constraint is satisfied.

And (3) checking authenticity: the image inversion problem only leaves the zero domain extraction itemThat is, require->Make->Distribution of the image domain of interest>. Analysis shows that the authenticity of the restored image depends only on the zero-domain part, and that the diffusion model is an ideal tool to produce the ideal zero-domain part. By utilizing a pre-trained diffusion model, the field part Null-space is iterated and refined in the sampling process, and the RND concept is combined to a Back diffusion module with a Back Projection mechanism (Back Projection), so that a result meeting the reality and data consistency at the same time can be generated, and the consistency constraint on an input image is further realized.

Referring to figures 4 (a) and 4 (b),is from inputting low quality image/>By the known transformation->The obtained value range is used as a value range part to realize consistency limitation and ensure the consistency of the low-dimensional structure information of the restored image and the low-quality image. WhileIs derived from a pre-trained diffusion model that can guarantee the authenticity of the high-dimensional texture information of the reconstructed image. Superimposed value range filling item by zero value range decomposition theory>Zero field padding->It is possible to obtain a restored image which is both realistic and faithful +.>。

In a specific embodiment, the noise predictor uses a dual-branch conditioning uiet network, the dual-branch conditioning uiet network includes a main branch and a jump connection branch, the main branch includes a first convolution layer, a second convolution layer, a first max-pooling layer, a third convolution layer, a fourth convolution layer, a second max-pooling layer, a fifth convolution layer, a sixth convolution layer, a first deconvolution layer, a seventh convolution layer, a second deconvolution layer, and an eighth convolution layer, and the jump connection branch includes a first fourier transform module, a first inverse fourier transform module, a second fourier transform module, and a second inverse fourier transform module.

In a specific embodiment, the noise image input dual-branch adjusting Unet network in the t step sequentially goes through an encoding stage and a decoding stage;

In a particular embodiment, the decoding stage is obtainedDecomposing the feature map into a first branch of trunk feature>Personal profile->The first part obtained by branching with jump connection>Personal profile->；

；

wherein,indicates the jump connection branch is taken>Personal->Characteristic map on individual channels->Representing fourier transformation of the jump connection branch>Characteristic map obtained later->A characteristic diagram which is regulated and controlled by a frequency domain is shown,representing inverse fourier transform of the frequency domain conditioned feature map>Characteristic map obtained later- >Representing a fourier mask, the following formula:

；

Specifically, in the noise predictor part, a dual-branch regulation Unet network (Unet) is constructed to replace the original Unet network, and the dual-branch regulation Unet network (Double Branch Regulated Unet, abbreviated as Unet) is obtained by adjusting based on the original Unet network. Referring to fig. 5, a specific workflow of the dual-branch regulation Unet network (run) is as follows: in the encoding phase: first, a first intermediate coded image of 64 channels of 256×256 size is obtained by passing the input noise image of the step t of 3 channels of 256×256 size through two first and second convolution layers of 3×3 size. Next, the first intermediate coded image is passed through a first maximum pooling layer of size 2×2, two third convolution layers of size 3×3, and a fourth convolution layer, respectively, to obtain a 128-channel second intermediate coded image of size 120×120. Finally, the second intermediate coded image is respectively subjected to a second maximum pooling layer with the size of 2×2, a fifth convolution layer with the size of 3×3 and a sixth convolution layer, and then a third intermediate coded image with 256 channels with the size of 60×60 is obtained. In the decoding stage: first, the coefficients are adjusted via the trunk feature for a third intermediate encoded image of 256 channels of size 60×60 A first deconvolution layer with the multiplication regulation and the size of 2 multiplied by 2, a third intermediate image with 128 channels with the size of 120 multiplied by 120 is obtained, a second intermediate coding image is converted into a frequency domain through a first Fourier transform module, and jump characteristic regulation coefficients are utilized>And performing low-frequency masking to obtain a first processed image with high-frequency components reserved, finally converting the first processed image into a spatial domain through a first inverse Fourier transform module to obtain a fourth intermediate image with 128 channels with the size of 120 multiplied by 120, and splicing the third intermediate image with the fourth intermediate image to obtain a second intermediate decoded image with 256 channels with the size of 120 multiplied by 120. Second, the second intermediate decoded image is subjected to a seventh convolution layer with a size of 3×3 and a main feature control coefficient +.>A second deconvolution layer with the size of 2 multiplied by 2 to obtain a first intermediate image with the size of 252 multiplied by 252 and 64 channels, then the first intermediate coding image is converted into a frequency domain by a second Fourier transform module, and jump characteristic regulation coefficients are utilized to regulate>And performing low-frequency masking to obtain a second processed image with high-frequency components reserved, finally converting the second processed image into a spatial domain through a second inverse Fourier transform module to obtain a 64-channel second intermediate image with the size of 252 multiplied by 252, and splicing the first intermediate image with the second intermediate image to obtain a 128-channel first intermediate decoded image with the size of 252 multiplied by 252. Finally, the first intermediate decoded image is predicted to obtain 3-channel noise with a size of 256×256 via an eighth convolution layer with a size of 3×3.

The main branch and the jump connection branch in the RUnet are respectively discussed, and the specific discussion is as follows:

first, the first obtained in the decoding stageDecomposing the feature map into a feature map obtained by trunk branchesFeature map obtained with jump junction branch->Two parts, the purpose is to explore separately the two branches separately for noise predictors +.>A contribution to denoising ability;

secondly, respectively introducing two regulation coefficientsAnd->For adjusting the contribution of the trunk branch and the jump connection branch. Main characteristic regulation coefficient->The control effect of (2) is the +.>Personal->And regulating and controlling the characteristic diagrams on the channels. Jump characteristic control coefficient->The control effect of (2) is that the feature map obtained by jumping connection branches is screened in the frequency domain, and only high-frequency components are reserved.

Finally, for two regulation coefficientsAnd->How to adjust the contribution condition of the trunk branch and the jump connection branch, and further to influence the problem of the image generation result to analyze:

analysis means:

1. fixed backbone characteristic control coefficientRegulation of jump feature regulationCoefficient->The objective is to analyze the jump connection branch versus denoising network +.>The influence effect of the denoising capability;

2. Fixed jump characteristic regulation and control coefficientRegulating the main characteristic regulating coefficient->The objective is to analyze the main branch versus noise predictor->The effect of the denoising ability.

Analysis results:

1. analysis finds that trunk branch pair denoising predictorHas obvious effect on the denoising capability, when the main characteristic regulation coefficient is increased>When the method is used, the generated image quality is better, but the high-frequency component in the image is suppressed, and the image is smoother, which means that the noise removal capability of the U-Net architecture is effectively enhanced by enhancing the main trunk characteristic, so that the method is beneficial to obtaining more excellent effects in the aspects of fidelity and detail preservation.

2. Analysis finds that jump connection branch is to noise predictorHas weak effect on the denoising ability by setting the jump characteristic control coefficient +.>The purpose is to extract the high frequency component on the jump branchThe method is used for supplementing the smoothing problem of the trunk branch characteristic diagram. Thus by two control factors->And->The trunk branch and the jump connection branch are combined and regulated, so that the image quality can be better enhanced.

Referring to fig. 6, the above fusion inversion, the back projection mechanism based on zero-value-range decomposition and the dual-branch adjustment Unet network Runet are combined to obtain the face image restoration model of the embodiment of the application, and the face image restoration model is processed on four restoration tasks of super resolution (x 4), super resolution (x 8), coloring and deblurring respectively, so that as shown in fig. 7, the restoration tasks under different conditions can obtain real and faithful restoration images.

To evaluate the performance of the model, embodiments of the present application performed experiments on two data sets with different distribution characteristics: a CelebA 256 x 256 dataset of face images and an ImageNet 256 x 256 dataset of natural images, both datasets containing 1k verification images independent of the training dataset; removing images with problems of blurring, incorrect data format, excessive similarity and the like in the image data set to obtain a primary screened face image and a natural image; and carrying out data enhancement on the primary screened face image and the natural image, wherein the data enhancement operation is to divide the primary screened face image and the natural image into blocks, then overturn the block images so as to improve the number and diversity of samples, further adjust the samples into images with the scale of 256 multiplied by 256 and the channel number of 3, and take the images as a data set.

To validate the CelebA 256 x 256 dataset of face images, embodiments of the present application employ a pre-training model provided by a de-noising network VE-SDE pre-trained on CelebA. To validate the image net 256 x 256 dataset of natural images, embodiments of the present application employ a pre-training model provided by a denoising network guide-diffusion pre-trained on image net. And respectively carrying out degradation operation on the face image and the natural image in the data set by using different degradation operators (Degradation Operators), wherein the types of the degradation operators comprise: the method comprises the steps of performing bicubic filtering degradation operator aiming at a super-resolution (x 4) and super-resolution (x 8) restoration task, performing three-channel homogenization degradation operator aiming at an image coloring restoration task, performing Gaussian blur kernel aiming at an image deblurring restoration task, wherein the size of the Gaussian blur kernel is 9 x 9, and the variance is set to sigma=13.0, and respectively obtaining various types of degraded images of two data sets.

The weights of the two pre-training models are used as the prior of the pre-training diffusion model of the embodiment of the application, the performance of 1k verification images under the two data sets under various degradation conditions is verified, and the test indexes mainly comprise: peak signal-to-noise ratio (Peak Signal to Noise Ratio, abbreviated as PSNR ≡), distance score (Frechet Inception Distance score, abbreviated as FID ≡), learning perception image block similarity (Learned Perceptual Image Patch Similarity, abbreviated as LPIPS ≡), and test results are shown in Table 1. Fusion inversion (30 steps and 100 steps) was compared to some current advanced methods. The quantitative evaluation results shown in table 1 indicate that the methods proposed in the examples of the present application obtain competitive results compared to the most advanced methods. When NFE is set to 30, fusion inversion-30 is superior to other known fast sampling methods, such as DDRM-30, and achieves better FID and LPIPS metrics than the best method (DDNM with 100 NFEs). When NFE is set to 100, fusion inversion-100 achieves optimal performance among many image restoration tasks, including four times super-resolution reconstruction and coloring tasks. The method provided by the embodiment of the application can obtain the optimal restoration result in all tested data sets and tasks.

TABLE 1

With further reference to fig. 8, as an implementation of the method shown in the foregoing drawings, the present application provides an embodiment of a face image restoration device that generates a priori based on diffusion, where the embodiment of the device corresponds to the embodiment of the method shown in fig. 2, and the device may be specifically applied to various electronic devices.

The embodiment of the application provides a face image restoration device based on diffusion generation prior, which comprises the following steps:

a data acquisition module 1 configured to acquire a face image to be restored;

the execution module 2 is configured to construct a face image restoration model based on a pre-trained diffusion model, wherein the face image restoration model comprises a forward noise adding module, a reverse noise removing module and a noise predictor, and the face image to be restored is input into the forward noise adding module to gradually increase noise so as to obtain a noise image; inputting the noise image into a reverse denoising module to gradually denoise, and generating a final restored face image; inputting the noise image of the t step and the timestamp of the t step into a noise predictor, and predicting to obtain the noise of the t step; in a forward noise adding module, combining the noise image of the t step and the noise input of the t step with a forward diffusion formula of fusion inversion to obtain a noise image of the t+1 step; in the reverse denoising module, zero threshold decomposition is carried out on the noise image of the t step to obtain decomposition features of the t step, and the decomposition features of the t step and the noise of the t step are input into a reverse diffusion formula to obtain the noise image of the t-1 step.

Referring now to fig. 9, there is illustrated a schematic diagram of a computer apparatus 900 suitable for use in implementing an electronic device (e.g., a server or terminal device as illustrated in fig. 1) of an embodiment of the present application. The electronic device shown in fig. 9 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present application.

As shown in fig. 9, the computer apparatus 900 includes a Central Processing Unit (CPU) 901 and a Graphics Processor (GPU) 902, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 903 or a program loaded from a storage section 909 into a Random Access Memory (RAM) 904. In the RAM 904, various programs and data required for the operation of the computer device 900 are also stored. The CPU 901, GPU902, ROM 903, and RAM 904 are connected to each other by a bus 905. An input/output (I/O) interface 906 is also connected to bus 905.

The following components are connected to the I/O interface 906: an input section 907 including a keyboard, a mouse, and the like; an output portion 908 including a speaker, such as a Liquid Crystal Display (LCD), or the like; a storage section 909 including a hard disk or the like; and a communication section 910 including a network interface card such as a LAN card, a modem, or the like. The communication section 910 performs communication processing via a network such as the internet. The drive 911 may also be connected to the I/O interface 906 as needed. A removable medium 912 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 911 so that a computer program read out therefrom is installed into the storage section 909 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 910, and/or installed from the removable medium 912. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 901 and a Graphics Processor (GPU) 902.

It should be noted that the computer readable medium described in the present application may be a computer readable signal medium or a computer readable medium, or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor apparatus, device, or means, or a combination of any of the foregoing. More specific examples of the computer-readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments described in the present application may be implemented by software, or may be implemented by hardware. The described modules may also be provided in a processor.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a face image to be restored; constructing a face image restoration model based on a pre-trained diffusion model, wherein the face image restoration model comprises a forward noise adding module, a reverse noise removing module and a noise predictor, and inputting a face image to be restored into the forward noise adding module to gradually increase noise so as to obtain a noise image; inputting the noise image into a reverse denoising module to gradually denoise, and generating a final restored face image; inputting the noise image of the t step and the timestamp of the t step into a noise predictor, and predicting to obtain the noise of the t step; in a forward noise adding module, combining the noise image of the t step and the noise input of the t step with a forward diffusion formula of fusion inversion to obtain a noise image of the t+1 step; in the reverse denoising module, zero threshold decomposition is carried out on the noise image of the t step to obtain decomposition features of the t step, and the decomposition features of the t step and the noise of the t step are input into a reverse diffusion formula to obtain the noise image of the t-1 step.

The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. The face image restoration method based on diffusion generation priori is characterized by comprising the following steps:

acquiring a face image to be restored;

constructing a face image restoration model based on a pre-trained diffusion model, wherein the face image restoration model comprises a forward noise adding module, a reverse noise removing module and a noise predictor, and inputting the face image to be restored into the forward noise adding module to gradually increase noise so as to obtain a noise image; inputting the noise image into the reverse denoising module to gradually denoise, and generating a final restored face image; inputting the noise image of the t step and the timestamp of the t step into the noise predictor to predict and obtain the noise of the t step; in the forward noise adding module, combining the noise image of the t step and the noise input of the t step with a forward diffusion formula of fusion inversion to obtain a noise image of the t+1 step; and in the reverse denoising module, zero threshold decomposition is carried out on the noise image of the t step to obtain decomposition features of the t step, and the decomposition features of the t step and the noise of the t step are input into a reverse diffusion formula to obtain the noise image of the t-1 step.

2. The face image restoration method based on diffusion generation prior of claim 1, wherein the forward diffusion formula combined with fusion inversion is:

；

wherein,noise image representing step t, +.>Noise image representing step t+1, < >>Face image representing the restoration of step t, < >>Representing the noise of the t-th step predicted by the noise predictor, < >>The noise table is represented by a table of noises,representing the cumulative noise table, ">，/>Representing the regulation factor->Representing noise sampled from a normal distribution of the standard, +.>。

3. The face image restoration method based on diffusion generation prior according to claim 1, wherein the zero threshold decomposition is performed on the noise image in the t step to obtain the decomposition feature in the t step, and specifically includes:

；

wherein,noise image representing step t, +.>For the decomposition feature of step t, +.>For the value range part->Zero field part->Representing a transformation matrix representing a pseudo-inverse matrix,>representing a real number multidimensional space, D representing the dimension of the face image to be restored, D representing the dimension of the degraded image,/D>Representing the identity matrix.

4. A face image restoration method based on diffusion generation of a priori according to claim 3, wherein for a given one of the transformation matrices Constructing a transformation matrix by singular value decomposition or Fourier transformation>Pseudo inverse matrix->Satisfy->。

5. The face image restoration method based on diffusion generation prior according to claim 1, wherein the noise predictor adopts a dual-branch adjustment uiet network, the dual-branch adjustment uiet network comprises a main branch and a jump connection branch, the main branch comprises a first convolution layer, a second convolution layer, a first maximum pooling layer, a third convolution layer, a fourth convolution layer, a second maximum pooling layer, a fifth convolution layer, a sixth convolution layer, a first deconvolution layer, a seventh convolution layer, a second deconvolution layer and an eighth convolution layer, and the jump connection branch comprises a first fourier transform module, a first inverse fourier transform module, a second fourier transform module and a second inverse fourier transform module.

6. The face image restoration method based on diffusion generation priori according to claim 5, wherein the noise image input in the t-th step is sequentially subjected to an encoding stage and a decoding stage by the dual-branch adjusting Unet network;

the encoding stage comprises: the noise image in the t step sequentially passes through the first convolution layer and the second convolution layer to obtain a first intermediate coding image; the first intermediate coded image sequentially passes through the first maximum pooling layer, the third convolution layer and the fourth convolution layer to obtain a second intermediate coded image; the second intermediate coded image sequentially passes through the second maximum pooling layer, the fifth convolution layer and the sixth convolution layer to obtain a third intermediate coded image;

The decoding stage comprises: the third intermediate image is obtained through multiplicative regulation of the trunk characteristic regulation coefficient and the first deconvolution layer; converting the second intermediate coded image into a frequency domain through the first Fourier transform module, performing low-frequency masking by using jump characteristic regulation coefficients to obtain a first processed image with high-frequency components reserved, and converting the first processed image into a spatial domain through the first inverse Fourier transform module to obtain a fourth intermediate image; splicing the third intermediate image with the fourth intermediate image to obtain a second intermediate decoded image; the second intermediate decoded image passes through the seventh convolution layer, multiplicative regulation and control of a trunk characteristic regulation and control coefficient and a second deconvolution layer to obtain a first intermediate image; converting the first intermediate image into a frequency domain through the second Fourier transform module, performing low-frequency masking by using jump characteristic regulation coefficients to obtain a second processed image with high-frequency components reserved, and converting the second processed image into a spatial domain through the second inverse Fourier transform module to obtain a second intermediate image; splicing the first intermediate image and the second intermediate image to obtain a first intermediate decoded image; and the first intermediate decoded image passes through the eighth convolution layer to obtain the noise of the t step.

7. The face image restoration method based on diffusion generation prior of claim 6, wherein the decoding stage is performed to obtain the firstDecomposing the feature map into a first branch of trunk feature>Personal profile->The first part obtained by branching with jump connection>Personal profile->；

Setting a trunk characteristic regulation and control coefficient in the trunk branch, regulating and controlling a characteristic diagram obtained by the trunk branch through the trunk characteristic regulation and control coefficient, wherein the characteristic diagram is shown in the following formula:

；

wherein,representing the +.>Personal->Characteristic map on individual channels->Indicate->The number of channels in the whole feature map, +.>Representing the modulation factor via the trunk feature->The>Personal->Feature maps on the individual channels;

；

wherein,indicates the jump connection branch is taken>Personal->Characteristic map on individual channels->Representing fourier transformation of the jump connection branch>Characteristic map obtained later->Characteristic diagram representing frequency domain regulation >Representing inverse fourier transform of the frequency domain conditioned feature map>Characteristic map obtained later->Representing a fourier mask, the following formula:

；

8. A face image restoration device for generating a priori based on diffusion, comprising:

the execution module is configured to construct a face image restoration model based on a pre-trained diffusion model, the face image restoration model comprises a forward noise adding module, a reverse noise removing module and a noise predictor, and the face image to be restored is input into the forward noise adding module to gradually increase noise so as to obtain a noise image; inputting the noise image into the reverse denoising module to gradually denoise, and generating a final restored face image; inputting the noise image of the t step and the timestamp of the t step into the noise predictor to predict and obtain the noise of the t step; in the forward noise adding module, combining the noise image of the t step and the noise input of the t step with a forward diffusion formula of fusion inversion to obtain a noise image of the t+1 step; and in the reverse denoising module, zero threshold decomposition is carried out on the noise image of the t step to obtain decomposition features of the t step, and the decomposition features of the t step and the noise of the t step are input into a reverse diffusion formula to obtain the noise image of the t-1 step.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.