CN115018954A

CN115018954A - Image generation method and device and electronic equipment

Info

Publication number: CN115018954A
Application number: CN202210941909.1A
Authority: CN
Inventors: 吴凌翔; 王金桥; 牛蕴方
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Zhongke Zidong Taichu Beijing Technology Co ltd
Priority date: 2022-08-08
Filing date: 2022-08-08
Publication date: 2022-09-06
Anticipated expiration: 2042-08-08
Also published as: CN115018954B

Abstract

The invention provides an image generation method, an image generation device and electronic equipment, relates to the technical field of image generation, and solves the problem of how to generate a target noiseless image matched with text content of a target text. The method comprises the following steps: the method comprises the steps of firstly obtaining a noise image to be processed, a target text corresponding to the noise image, a target noise adding stage and a random non-empty text; denoising the noise image based on the noise image, the target text, the target denoising stage and the random non-empty text to generate a target noiseless image; and the matching degree of the image content of the target noiseless image and the text content of the target text is greater than a first threshold value. Thus, the target text and the random non-empty text are used as guide information in the denoising process to denoise the noise image, and the target noiseless image matched with the text content of the target text can be generated, so that the accuracy of the generated target noiseless image is improved.

Description

Image generation method and device and electronic equipment

Technical Field

The present invention relates to the field of image generation technologies, and in particular, to an image generation method and apparatus, and an electronic device.

Background

In the image denoising and sampling process under the diffusion model scene, the matching degree between the image generated by the diffusion model and the target text is poor under the normal condition.

Therefore, how to generate a target noiseless image matching the text content of the target text, thereby improving the accuracy of the generated target noiseless image is a problem to be solved by those skilled in the art.

Disclosure of Invention

The invention provides an image generation method, which can generate a target noiseless image matched with the text content of a target text, thereby improving the accuracy of the generated target noiseless image.

The invention provides an image generation method, which comprises the following steps:

the method comprises the steps of obtaining a noise image to be processed, a target text and a target noise adding stage corresponding to the noise image and a random non-empty text.

Denoising the noise image based on the noise image, the target text, the target denoising stage and the random non-empty text to generate a target noiseless image; wherein the matching degree of the image content of the target noiseless image and the text content of the target text is larger than a first threshold value.

According to an image generation method provided by the present invention, the denoising processing is performed on the noise image based on the noise image, the target text, the target denoising stage and the random non-empty text to obtain a target non-noise image, including:

s1, inputting the noise image, the target text and the target noise stage into an image denoising model in a diffusion model to obtain a first parameter; and inputting the noise image, the random non-empty text and the target noise stage into the image denoising model to obtain a second parameter.

And S2, generating a noise-free image corresponding to the noise image in the target noise adding stage according to the first parameter and the second parameter.

And S3, updating the target noise adding stage, and judging whether the updated noise adding stage is equal to a second threshold value.

And S4, determining the noiseless image corresponding to the noise image in the target noise adding stage as the target noiseless image under the condition that the updated noise adding stage is equal to a second threshold value.

S5, if it is determined that the updated noise adding stage is greater than the second threshold, determining a noise image corresponding to a previous noise adding stage of the target noise adding stage, determining the noise image corresponding to the previous noise adding stage as the noise image to be processed, determining the updated noise adding stage as the target noise adding stage, determining a new random non-null text as the random non-null text, and repeatedly executing the above S1-S5 until the updated noise adding stage is equal to the second threshold, and determining a no-noise image corresponding to the noise image in the updated noise adding stage as the target no-noise image.

According to an image generating method provided by the present invention, the first parameter includes a first mean and a first variance, the second parameter includes a second mean, and the generating of the noise-free image corresponding to the noise image in the target noise adding stage according to the first parameter and the second parameter includes:

and fusing the first average value and the second average value to obtain a corresponding target average value.

And generating a noiseless image corresponding to the noise image in the target noise adding stage according to the target mean value and the first variance.

According to an image generation method provided by the present invention, the fusing the first average value and the second average value to obtain a corresponding target average value includes:

determining a difference between the first mean and the second mean, and determining a product between the difference and its corresponding weight.

Determining the sum of the second mean and the product as the target mean.

According to an image generating method provided by the present invention, the determining a noise image corresponding to a previous noise adding stage of the target noise adding stage includes:

and generating a noise image corresponding to the previous noise adding stage according to the noise image, the noiseless image corresponding to the noise image in the target noise adding stage and the first variance in the first parameter.

According to an image generating method provided by the present invention, the generating a noise image corresponding to the previous noise adding stage according to the noise image, the noise-free image corresponding to the noise image in the target noise adding stage, and the first variance includes:

and determining a third mean value according to the noise image, the noiseless image corresponding to the noise image in the target noise adding stage and the first square difference.

A second variance is determined based on the first variance.

And generating a noise image corresponding to the previous noise adding stage according to the third mean value and the second variance.

According to the image generation method provided by the invention, the image denoising model is obtained by training an initial image denoising model in an initial diffusion model based on a plurality of noise image samples, texts and denoising stages corresponding to the noise image samples.

The present invention also provides an image generating apparatus, which may include:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a noise image to be processed, a target text corresponding to the noise image, a target noise adding stage and a random non-empty text.

A generating unit, configured to perform denoising processing on the noise image based on the noise image, the target text, the target denoising stage, and the random non-empty text, and generate a target non-noise image; wherein the matching degree of the image content of the target noiseless image and the text content of the target text is larger than a first threshold value.

According to an image generating apparatus provided by the present invention, the generating unit is specifically configured to execute:

s1, inputting the noise image, the target text and the target noise adding stage into an image denoising model in a diffusion model to obtain a first parameter; and inputting the noise image, the random non-empty text and the target noise stage into the image denoising model to obtain a second parameter.

According to the image generation device provided by the invention, the first parameter comprises a first mean value and a first variance, the second parameter comprises a second mean value, and the generation unit is specifically configured to fuse the first mean value and the second mean value to obtain a corresponding target mean value; and generating a noiseless image corresponding to the noise image in the target noise adding stage according to the target mean value and the first variance.

According to the image generating device provided by the invention, the generating unit is specifically configured to determine a difference value between the first mean value and the second mean value, and determine a product between the difference value and a corresponding weight; determining the sum of the second mean and the product as the target mean.

According to the image generating device provided by the invention, the generating unit is specifically configured to generate the noise image corresponding to the previous noise adding stage according to the noise image, the non-noise image corresponding to the noise image at the target noise adding stage and the first variance in the first parameter.

According to the image generating device provided by the invention, the generating unit is specifically configured to determine a third mean value according to the noise image, the noise-free image corresponding to the noise image in the target noise adding stage, and the first variance; determining a second variance based on the first variance; and generating a noise image corresponding to the previous noise adding stage according to the third mean value and the second variance.

According to the image generation device provided by the invention, the image denoising model is obtained by training an initial image denoising model in an initial diffusion model based on a plurality of noise image samples, texts and denoising stages corresponding to the noise image samples.

The present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the image generation method as described in any of the above when executing the program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image generation method as described in any of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the image generation method as described in any one of the above.

According to the image generation method, the image generation device and the electronic equipment, when a target noiseless image matched with the text content of a target text is generated, the noise image to be processed, the target text corresponding to the noise image, a target noise adding stage and a random non-empty text can be obtained firstly; denoising the noise image based on the noise image, the target text, the target denoising stage and the random non-empty text to generate a target noiseless image; and the matching degree of the image content of the target noiseless image and the text content of the target text is greater than a first threshold value. Thus, the target text and the random non-empty text are used as guide information in the denoising process to denoise the noise image, and the target noiseless image matched with the text content of the target text can be generated, so that the accuracy of the generated target noiseless image is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of an image generation method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a process of denoising a noise image according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of a training method of an image denoising model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present invention;

fig. 5 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the embodiments of the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In the description of the present invention, the character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The technical scheme provided by the embodiment of the invention can be applied to an image generation scene, and particularly can be applied to a denoising scene in a diffusion model. In the image denoising and sampling process under the diffusion model scene, the matching degree between the image generated by the diffusion model and the target text is poor under the normal condition.

In order to generate a target noiseless image matched with the text content of a target text, so as to improve the accuracy of the generated target noiseless image, the embodiment of the invention provides an image generation method, in an image denoising and sampling process, two texts can be predetermined, wherein one text is a description text of the noiseless image to be generated based on the noise image, the other text is a non-empty text selected randomly, and the two texts are used as guide information in the denoising process, so that the noise image is denoised to generate the target noiseless image matched with the text content of the target text, and the accuracy of the generated target noiseless image is improved.

Hereinafter, the image generating method provided by the present invention will be described in detail by the following specific examples. It is to be understood that the following detailed description may be combined with other embodiments, and that the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 1 is a schematic flowchart of an image generation method according to an embodiment of the present invention, where the image generation method may be executed by software and/or a hardware device. For example, referring to fig. 1, the image generating method may include:

s101, acquiring a noise image to be processed, a target text corresponding to the noise image, a target noise adding stage and a random non-empty text.

Wherein, the target text can be understood as the description text of the noiseless image which is generated based on the noise image; the random non-null text is a randomly selected non-null text different from the target text. It can be understood that, in the embodiment of the present invention, the target text corresponding to the noise image may serve as a guidance condition, and the random non-null text may serve as a guidance basis, that is, the target text and the random non-null text together serve as guidance information in a denoising process, so as to generate a target noiseless image matching text content of the target text in combination with the guidance information.

For example, when a non-empty text is randomly selected, a text set may be pre-specified, where the text set may be all texts in a text training set or may be a text subset in a text training set, and may be specifically set according to actual needs, and herein, the embodiment of the present invention is not limited specifically. It should be noted that, in order to ensure that the contents of the random non-empty texts adopted in the denoising process are different as much as possible, which is more beneficial to assisting the denoising process, the text set needs to have a certain size, that is, the text set needs to include a certain number of texts.

For example, when acquiring the noise image to be processed, the noise image may be received from another electronic device, and may be searched from a local storage to acquire the noise image, or may also be acquired from another third-party database, and the like, which may be specifically set according to actual needs, where the embodiment of the present invention is not particularly limited with respect to the method for acquiring the noise image to be processed.

The noise image can be understood as being obtained by performing noise processing on an original clean image based on a target noise adding stage. Assuming that the set diffusion step number is T steps in a diffusion model scene, and a target noise adding stage is represented by T, the noise adding stage range can be determined to be 0-T, and the target noise adding stage T is a random number in 0-T. For example, assuming that the randomly determined target noise adding stage is t =5, the noise adding algorithm in the diffusion model may continuously perform five times of noise adding on the original clean image, where the second time of noise adding is performed on the basis of the noise image obtained by the first time of noise adding, the third time of noise adding is performed on the basis of the noise image obtained by the second time of noise adding, the fourth time of noise adding is performed on the basis of the noise image obtained by the third time of noise adding, and the fifth time of noise adding is performed on the basis of the noise image obtained by the fourth time of noise adding, so that the noise image of the original clean image at the target noise adding stage t =5 may be obtained by performing five times of noise adding on the original clean image.

Illustratively, when the noise adding processing algorithm in the diffusion model is used for adding noise to the original clean image, the noise adding method in a cosine mode can be adopted for gradually adding noise, so that the noise adding process is more stable, and the related information of the original clean image can be well kept in the diffusion process; of course, other noise adding methods may be used to perform the noise adding process, and may be specifically set according to actual needs, and here, the embodiment of the present invention is only described by taking the cosine-based noise adding method as an example, but the embodiment of the present invention is not limited thereto.

After the noise image to be processed, the target text corresponding to the noise image, the target noise adding stage and the random non-empty text are respectively obtained, the following S102 may be executed:

s102, denoising the noise image based on the noise image, the target text, the target denoising stage and the random non-empty text to generate a target noiseless image; and the matching degree of the image content of the target noiseless image and the text content of the target text is greater than a first threshold value.

The value of the first threshold may be set according to actual needs, and the specific value of the first threshold is not specifically limited in the embodiments of the present invention.

Under a general condition, in a diffusion model scene, a target noise adding stage which accords with Gaussian probability distribution can be initialized randomly, a noise adding processing algorithm in the diffusion model carries out gradual noise adding processing on an original clean image based on the target noise adding stage, and after a noise image is obtained, the noise image obtained by the noise adding processing can be subjected to layer-by-layer noise removing processing step by step through a reverse diffusion process, so that a predicted target noiseless image is obtained. In an example, when the layer-by-layer denoising processing is performed on the noise image obtained by the denoising processing step by step through the inverse diffusion process, the layer-by-layer denoising processing may be performed on the noise image based on the value of the target denoising stage, that is, the denoising processing operation is performed in a loop, so as to obtain the predicted target noiseless image. It can be understood that, in the embodiment of the present invention, in the process of performing layer-by-layer denoising processing on a noise image based on the value of the target denoising stage, target texts corresponding to noise images at different denoising stages are the same text, but random non-empty texts corresponding to different denoising stages are different, which is more beneficial to assisting in denoising processing.

For example, when denoising the noise image based on the noise image, the target text, the target denoising stage and the random non-empty text, the denoising process may include:

s1, inputting the noise image, the target text and the target noise adding stage into an image denoising model in the diffusion model to obtain a first parameter; and inputting the noise image, the random non-empty text, and the target noise stage into the image denoising model to obtain a second parameter, for example, as shown in fig. 2, fig. 2 is a schematic diagram of a process for denoising the noise image according to the embodiment of the present invention, and after obtaining the first parameter and the second parameter, the following S2 may be performed:

and S2, generating a noiseless image corresponding to the noise image in the target noise adding stage according to the first parameter and the second parameter.

And S4, determining the noiseless image corresponding to the noise image in the target noise adding stage as the target noiseless image under the condition that the updated noise adding stage is equal to the second threshold value.

S5, when the updated noise adding stage is determined to be larger than the second threshold, determining a noise image corresponding to the previous noise adding stage of the target noise adding stage, determining the noise image corresponding to the previous noise adding stage as a noise image to be processed, determining the updated noise adding stage as the target noise adding stage, determining a new random non-null text as the random non-null text, repeatedly executing the steps S1-S5 until the updated noise adding stage is equal to the second threshold, and determining a noise-free image corresponding to the noise image in the updated noise adding stage as the target noise-free image.

The value of the second threshold can be set according to actual needs. Illustratively, the second threshold is 0 when the noise stage ranges from 0 to T.

Illustratively, the image denoising model may be a UNet neural network model, which mainly includes an input layer, an intermediate layer, and an output layer. The input layer is mainly formed by stacking a plurality of residual blocks, attention blocks and downsampling blocks and is used for extracting the characteristics of a noise image; the middle block consists of a residual block, an attention block and another residual block and is used for further integrating and processing the extracted features; the output block is formed by stacking a plurality of residual blocks, attention blocks and up-sampling blocks and is used for recovering the characteristics of the intermediate layer integral sum processing so as to obtain the mean value and the variance of the noise contained in the noise image.

Illustratively, the image denoising model is obtained by training an initial image denoising model in the initial diffusion model based on a plurality of noise image samples, texts and a denoising stage corresponding to each of the plurality of noise image samples, and a training process of the initial image denoising model will be described in detail later.

Exemplarily, in the above S1, the noise image, the target text, and the target noise stage are input into an image denoising model in the diffusion model, the image denoising model may adopt a sinusoidal time encoder to encode the target noise stage, the obtained encoding features are superimposed into each residual block in the image denoising model, the image denoising model may adopt a tokenizer to tokenize and serialize the target text, and input the serialized text features into an attention block, the text features are fused by using an attention mechanism, and finally, a mean value and a variance included in the noise image corresponding to the target noise stage are output, that is, a first parameter, where the first parameter includes a mean value and a variance which can be correspondingly marked as a first mean value and a first variance; similarly, a noise image, a random non-empty text and a target noise stage are input into an image denoising model in the diffusion model, the image denoising model may adopt a sinusoidal time encoder to encode the target noise stage, obtained encoding features are superimposed into each residual block in the image denoising model, the image denoising model may adopt a tokenizer to tokenize and serialize the random non-empty text, the serialized text features are input into an attention block, the text features are fused by an attention mechanism, and finally, a mean value and a variance included in the noise image corresponding to the target noise stage are output, that is, a second parameter, the second parameter includes a mean value which may be correspondingly marked as a second mean value, and the variance included in the second parameter is not involved in a subsequent processing process, so that the second parameter may not be distinguished first.

For example, in the above S2, when the noise-free image corresponding to the noise image in the target noise adding stage is generated according to the first parameter and the second parameter, the diffusion model may adopt the idea of random difference guide to first fuse the first mean value and the second mean value to obtain the corresponding target mean value; and generating a noiseless image corresponding to the noise image in the target noise adding stage according to the target mean value and the first square difference obtained by fusion. In this way, the idea of random differential guidance is adopted to generate the noise-free image corresponding to the noise image in the target noise adding stage, and compared with the generation method without guidance, the image quality and accuracy of the generated noise-free image can be improved to a certain extent.

For example, when the first mean value and the second mean value are fused, a difference value between the first mean value and the second mean value may be determined, and a product between the difference value and a corresponding weight thereof may be determined; and determining the sum of the second mean value and the product as the target mean value. For ease of understanding, by way of example, assuming the target noise addition phase is t,

a noisy image representing the target noisy phase t,

representing a random non-empty text that is,

representing a target text, adding a noise image of the target in a noise stage t

Target text

Inputting the target noise stage t into an image denoising model, and recording a first average value output by the target noise stage t as

Adding the noise image of the target noise stage t

Random non-empty text

And inputting the target noise stage t into an image denoising model, and recording a second average value output by the target noise stage t as

Then to the first mean value

And the second mean value

The fusion is performed to obtain the target mean value according to the following formula 1:

wherein,

the target mean value obtained by the fusion is shown,

and representing the weight corresponding to the difference value of the first average value and the second average value.

After the first mean value and the second mean value are fused to obtain a corresponding target mean value, a non-noise image corresponding to the noise image in the target noise adding stage can be generated according to the target mean value and the first variance.

After the above S2 is executed to generate the noiseless image corresponding to the noise image in the target noise adding stage, the noiseless image cannot be directly determined as the target noiseless image finally obtained by performing the noise removing process on the noise image, but the noise adding stage needs to be further updated, for example, each time the noise adding stage is updated, the value of the target noise adding stage may be subtracted by 1 to obtain an updated noise adding stage; and judging whether the updated noise adding stage is larger than a second threshold, for example, the second threshold may be set to 0, and may be specifically set according to actual needs. Under the condition that the updated noise adding stage is equal to the second threshold, the noiseless image corresponding to the noise image in the target noise adding stage can be directly determined as the final target noiseless image; in contrast, in the case where it is determined that the updated noise addition stage is greater than the second threshold, it is necessary to further perform S5 described above, and determine the noise image corresponding to the previous noise addition stage of the target noise addition stage

And the noise image corresponding to the previous noise adding stage is added

Determining the noise image to be processed in the step S1, determining the updated noise stage as the target noise stage, determining the new random non-null text as the random non-null text, repeating the steps S1-S5 until the updated noise stage is equal to the second threshold, and determining the corresponding noise-free image of the noise image in the updated noise stage as the final target noise-free image.

For example, in S5, when determining the noise image corresponding to the previous noise adding stage of the target noise adding stage, the noise image corresponding to the previous noise adding stage may be generated according to the noise image, the noise-free image corresponding to the noise image at the target noise adding stage, and the first variance in the first parameter, so as to acquire the noise image corresponding to the previous noise adding stage.

For example, when a noise image corresponding to a previous noise adding stage is generated according to the noise image, a non-noise image corresponding to the noise image in the target noise adding stage, and the first variance, a third mean value may be determined according to the noise image, the non-noise image corresponding to the noise image in the target noise adding stage, and the first variance; and determining a second variance based on the first variance; and generating a noise image corresponding to the previous noise adding stage according to the third mean value and the second variance.

For example, when determining the third mean value according to the noise image, the non-noise image corresponding to the noise image in the target noise adding stage, and the first variance, see formula 2:

wherein,

which represents the third mean value of the first mean value,

a noisy image representing the target noisy phase t,

representing noisy images

The corresponding noise-free image at the target noise adding stage t,

noisy image representing a target noisy phase t

A first variance of the noise contained therein,

，

，

s denotes the s-th noise stage among the noise stages 1-t,

noisy image representing a noisy phase s

The variance of the noise contained.

For example, when determining the second variance based on the first variance, see equation 3:

wherein,

noisy image representing a target noisy phase t

Including noiseA second variance of the sound.

Illustratively, according to the third mean and the second variance, a noise image corresponding to the previous noise adding stage is generated

See equation 4:

wherein,

which is indicative of a process of the gaussian,

representing random noise that follows a normal distribution.

By combining the above equations 2, 3 and 4, a noise image corresponding to the previous noise adding stage can be generated.

It can be seen that, in the embodiment of the present invention, when generating a target noiseless image matched with the text content of a target text, a noise image to be processed, the target text corresponding to the noise image, a target noise adding stage and a random non-empty text may be obtained first; denoising the noise image based on the noise image, the target text, the target denoising stage and the random non-empty text to generate a target noiseless image; and the matching degree of the image content of the target noiseless image and the text content of the target text is greater than a first threshold value. Thus, the target text and the random non-empty text are used as guide information in the denoising process to denoise the noise image, and the target noiseless image matched with the text content of the target text can be generated, so that the accuracy of the generated target noiseless image is improved.

The above-mentioned embodiment shown in fig. 1 describes in detail how to generate a target noiseless image matching with the text content of the target text in the image denoising sampling process, and below, how to train and generate an image denoising model will be described in detail through the following embodiment shown in fig. 3.

Fig. 3 is a flowchart illustrating a method for training an image denoising model according to an embodiment of the present invention, where the method may be implemented by software and/or hardware. For example, referring to fig. 3, the training method of the image denoising model may include:

s301, obtaining a plurality of clean image samples, and texts and noise stages corresponding to the clean image samples.

The text corresponding to the clean image sample can be understood as the description text of the clean image sample, that is, the text description information of the clean image sample.

For example, when obtaining the plurality of clean image samples and the text corresponding to each of the plurality of clean image samples, the text corresponding to each of the plurality of clean image samples and the plurality of clean image samples may be received from other electronic devices, the text corresponding to each of the plurality of clean image samples and the plurality of clean image samples may be searched and obtained from a local storage, or the text corresponding to each of the plurality of clean image samples and the plurality of clean image samples may also be obtained from other third-party databases, and the like, and the setting may be specifically performed according to actual needs.

For example, when obtaining the plurality of clean image samples and the texts corresponding to the plurality of clean image samples from other third-party databases, an image-text data set may be downloaded from the third-party database, a download uniform resource locator (url) is mapped to a hash value and stored as a file name of the clean image sample, in order to facilitate reading of large-scale data, the file name and corresponding text information of the clean image sample are extracted and stored in a document, and index information and corresponding text information of all files are obtained by loading the document, so that the texts corresponding to the plurality of clean image samples and the plurality of clean image samples are obtained.

After obtaining the plurality of clean image samples, the text and noise stages corresponding to the plurality of clean image samples, the following S302 may be performed:

s302, inputting the plurality of clean image samples and the noise adding stages corresponding to the plurality of clean image samples into a noise adding processing algorithm in the initial diffusion model to obtain noise image samples corresponding to the noise adding stages of the plurality of clean image samples.

The set diffusion step number is T steps, the noise adding stage is represented by T, the noise adding stage range can be determined to be 0-T, the noise adding stage T is a random number within 0-T, and the noise image sample corresponding to the clean image sample in the noise adding stage T can be recorded as T

。

For example, assuming that a noise stage t =5 corresponds to a clean image sample, the clean image sample and the corresponding noise stage are input into the noise processing algorithm in the initial diffusion model, the denoising algorithm in the diffusion model may perform five times of denoising successively on the original clean image, wherein the second time of the noise addition processing is performed on the basis of the noise image resulting from the first time of the noise addition processing, the third time of the noise addition processing is performed on the basis of the noise image resulting from the second time of the noise addition processing, the fourth time of the noise addition processing is performed on the basis of the noise image resulting from the third time of the noise addition processing, the fifth time of the noise addition processing is performed on the basis of the noise image resulting from the fourth time of the noise addition processing, in this way, five times of noise addition processing are performed on the clean image sample in succession, so that the noise image sample of the clean image sample at the noise addition stage t =5 can be obtained. In a similar manner, a noisy image sample corresponding to each of the plurality of clean image samples at the time of the noising stage can be obtained.

And S303, inputting the plurality of clean image samples into an initial image denoising model in the initial diffusion model at a noise adding stage from the noise image sample, the target text and the noise adding stage corresponding to the noise adding stage to obtain the mean value and the variance of noise contained in the noise image corresponding to the noise adding stage of each clean image sample.

Illustratively, the initial image denoising model may be a UNet neural network model, which mainly includes an input layer, an intermediate layer, and an output layer. The input layer is mainly formed by stacking a plurality of residual blocks, attention blocks and down-sampling blocks and is used for extracting the characteristics of a noise image sample; the middle block consists of a residual block, an attention block and another residual block and is used for further integrating and processing the extracted features; the output block is formed by stacking a plurality of residual blocks, attention blocks and up-sampling blocks and is used for recovering the characteristics of the intermediate layer integral sum processing so as to obtain the mean value and the variance of the noise contained in the noise image sample.

Exemplarily, in an embodiment of the present invention, the input layer may be composed of 15 residual blocks, 9 attention blocks, and 6 downsampling blocks; the middle layer may consist of 2 residual blocks and 1 attention block; the output layer may consist of 19 residual blocks, 12 attention blocks and 6 upsampled blocks.

Exemplarily, a noise image sample, a target text and a noise adding stage are input into an initial image denoising model in a diffusion model, the initial image denoising model may adopt a sinusoidal time encoder to encode the noise adding stage, obtained encoding features are superimposed into each residual block in the initial image denoising model, the initial image denoising model may adopt a tokenizer to tokenize and serialize the target text, the serialized text features are input into an attention block, the text features are fused by adopting an attention mechanism, and finally, a mean value and a variance included in a noise image corresponding to the noise adding stage are output.

After the mean and variance of the noise included in the noise image corresponding to each clean image sample in the noise adding stage are obtained, the model parameters of the initial image denoising model may be updated according to the mean and variance of the noise included in the noise image corresponding to each clean image sample in the noise adding stage, that is, the following S304 is executed:

s304, updating model parameters of the initial image denoising model according to the mean value and the variance of noise contained in each clean image sample and a noise image corresponding to each clean image sample in the denoising stage, so as to obtain a final image denoising model.

It can be understood that, in the embodiment of the present invention, when the initial diffusion model is updated, only the model parameters of the initial image denoising model in the initial diffusion model are updated, and parameters in the denoising processing algorithm and other calculation methods in the initial diffusion model are not generally updated.

For example, when the model parameters of the initial image denoising model are updated according to the mean and variance of the noise included in each clean image sample and the noise image corresponding to each clean image sample in the denoising stage, the diffusion model may calculate and generate a predicted clean image corresponding to each clean image sample in the denoising stage according to the mean and variance of the noise included in the noise image corresponding to each clean image sample in the denoising stage; and then updating the model parameters of the initial image denoising model according to each clean image sample and the predicted clean image corresponding to each clean image sample in the denoising stage.

Illustratively, when model parameters of an initial image denoising model are updated according to each clean image sample and a predicted clean image corresponding to each clean image sample in a denoising stage, for each clean image sample, a mean square error loss and a variation lower bound loss corresponding to the clean image sample can be constructed according to the clean image sample and the predicted clean image corresponding to the clean image sample in the denoising stage, and a target loss corresponding to the clean image sample is determined according to the mean square error loss and the variation lower bound loss corresponding to the clean image sample, so that the target loss corresponding to each clean image sample is determined; and determining average losses corresponding to the plurality of clean image samples according to the target losses corresponding to the clean image samples, updating model parameters of the initial image denoising model based on the average losses corresponding to the plurality of clean image samples until the updated image denoising model converges, and determining the converged image denoising model as the image denoising model obtained by final training.

It can be seen that in the embodiment of the present invention, when the image denoising model is trained, a plurality of clean image samples, texts and denoising stages corresponding to the plurality of clean image samples, respectively, may be obtained first; inputting the plurality of clean image samples and the noise adding stages corresponding to the plurality of clean image samples into a noise adding processing algorithm in the initial diffusion model to obtain noise image samples corresponding to the noise adding stages of the plurality of clean image samples; inputting a plurality of clean image samples into an initial image denoising model in the initial diffusion model at a noise adding stage, a noise image sample corresponding to the noise adding stage, a target text and the noise adding stage to obtain a mean value and a variance of noise contained in the noise image at the noise adding stage corresponding to each clean image sample; and updating the model parameters of the initial image denoising model according to the mean value and the variance of noise contained in each clean image sample and the noise image corresponding to each clean image sample in the denoising stage so as to obtain a final image denoising model. Therefore, the training efficiency of the image denoising model can be improved, and the denoising performance of the image denoising model can be effectively enhanced.

The image generating apparatus provided by the present invention is described below, and the image generating apparatus described below and the image generating method described above may be referred to in correspondence with each other.

Fig. 4 is a schematic structural diagram of an image generating apparatus 40 according to an embodiment of the present invention, and for example, referring to fig. 4, the image generating apparatus 40 may include:

the acquiring unit 401 is configured to acquire a noise image to be processed, a target text corresponding to the noise image, a target noise adding stage, and a random non-empty text.

A generating unit 402, configured to perform denoising processing on a noise image based on the noise image, a target text, a target denoising stage, and a random non-empty text, to obtain a target noiseless image; and the matching degree of the image content of the target noiseless image and the text content of the target text is greater than a first threshold value.

Optionally, the generating unit 402 is specifically configured to perform:

s1, inputting the noise image, the target text and the target noise adding stage into an image denoising model in the diffusion model to obtain a first parameter; and inputting the noise image, the random non-empty text and the target noise stage into an image denoising model to obtain a second parameter.

Optionally, the first parameter includes a first mean and a first variance, the second parameter includes a second mean, and the generating unit 402 is specifically configured to fuse the first mean and the second mean to obtain a corresponding target mean; and generating a non-noise image corresponding to the noise image in the target noise adding stage according to the target mean value and the first variance.

Optionally, the generating unit 402 is specifically configured to determine a difference between the first mean value and the second mean value, and determine a product between the difference and a corresponding weight; and determining the sum of the second mean value and the product as a target mean value.

Optionally, the generating unit 402 is specifically configured to generate a noise image corresponding to a previous noise adding stage according to the noise image, a non-noise image corresponding to the noise image in the target noise adding stage, and the first variance in the first parameter.

Optionally, the generating unit 402 is specifically configured to determine a third average value according to the noise image, the non-noise image corresponding to the noise image in the target noise adding stage, and the first square difference; determining a second variance based on the first variance; and generating a noise image corresponding to the previous noise adding stage according to the third mean value and the second variance.

Optionally, the image denoising model is obtained by training an initial image denoising model in the initial diffusion model based on the multiple noise image samples, the texts corresponding to the multiple noise image samples, and the denoising stage.

The image generation apparatus 40 provided in the embodiment of the present invention can execute the technical solution of the image generation method in any of the above embodiments, and the implementation principle and the beneficial effect of the image generation method are similar to those of the image generation method, which can be referred to as the implementation principle and the beneficial effect of the image generation method, and are not described herein again.

Fig. 5 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device may include: a processor (processor)510, a communication Interface (Communications Interface)520, a memory (memory)530 and a communication bus 540, wherein the processor 510, the communication Interface 520 and the memory 530 communicate with each other via the communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform an image generation method comprising: acquiring a noise image to be processed, a target text corresponding to the noise image, a target noise adding stage and a random non-empty text; denoising the noise image based on the noise image, the target text, the target denoising stage and the random non-empty text to generate a target noiseless image; and the matching degree of the image content of the target noiseless image and the text content of the target text is greater than a first threshold value.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the image generation method provided by the above methods, the method comprising: acquiring a noise image to be processed, a target text corresponding to the noise image, a target noise adding stage and a random non-empty text; denoising the noise image based on the noise image, the target text, the target denoising stage and the random non-empty text to generate a target noiseless image; and the matching degree of the image content of the target noiseless image and the text content of the target text is greater than a first threshold value.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image generation method provided by performing the above methods, the method including: acquiring a noise image to be processed, a target text corresponding to the noise image, a target noise adding stage and a random non-empty text; denoising the noise image based on the noise image, the target text, the target denoising stage and the random non-empty text to generate a target noiseless image; and the matching degree of the image content of the target noiseless image and the text content of the target text is greater than a first threshold value.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on the understanding, the above technical solutions substantially or otherwise contributing to the prior art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An image generation method, comprising:

acquiring a noise image to be processed, a target text and a target noise adding stage corresponding to the noise image and a random non-empty text;

2. The image generation method of claim 1, wherein the denoising the noise image based on the noise image, the target text, the target denoising stage, and the random non-null text, and generating a target non-noise image comprises:

s1, inputting the noise image, the target text and the target noise adding stage into an image denoising model in a diffusion model to obtain a first parameter; inputting the noise image, the random non-empty text and the target noise adding stage into the image denoising model to obtain a second parameter;

s2, generating a noise-free image corresponding to the noise image in the target noise adding stage according to the first parameter and the second parameter;

s3, updating the target noise adding stage, and judging whether the updated noise adding stage is equal to a second threshold value;

s4, determining a noiseless image corresponding to the noise image in the target noise adding stage as the target noiseless image under the condition that the updated noise adding stage is equal to a second threshold value;

3. The image generation method according to claim 2, wherein the first parameter includes a first mean and a first variance, the second parameter includes a second mean, and the generating the noise-free image corresponding to the noise image in the target noise adding stage according to the first parameter and the second parameter includes:

fusing the first average value and the second average value to obtain a corresponding target average value;

4. The image generation method according to claim 3, wherein the fusing the first mean and the second mean to obtain a corresponding target mean comprises:

determining a difference value between the first mean value and the second mean value, and determining a product between the difference value and a corresponding weight;

determining the sum of the second mean and the product as the target mean.

5. The image generation method according to any one of claims 2 to 4, wherein the determining a noise image corresponding to a previous noise stage of the target noise stage includes:

6. The image generation method according to claim 5, wherein the generating a noise image corresponding to the previous noise adding stage according to the noise image, the noise-free image corresponding to the noise image at the target noise adding stage, and the first variance comprises:

determining a third mean value according to the noise image, the noise-free image corresponding to the noise image in the target noise adding stage and the first square difference;

determining a second variance based on the first variance;

7. The image generation method according to any one of claims 2 to 4,

the image denoising model is obtained by training an initial image denoising model in an initial diffusion model based on a plurality of noise image samples, texts and denoising stages corresponding to the noise image samples.

8. An image generation apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a noise image to be processed, a target text corresponding to the noise image, a target noise adding stage and a random non-empty text;

a generating unit, configured to perform denoising processing on the noise image based on the noise image, the target text, the target denoising stage, and the random non-empty text, and generate a target noiseless image; wherein the matching degree of the image content of the target noiseless image and the text content of the target text is larger than a first threshold value.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the image generation method according to any of claims 1 to 7 when executing the program.

10. A non-transitory computer-readable storage medium on which a computer program is stored, the computer program, when executed by a processor, implementing the image generation method of any one of claims 1 to 7.