CN117710504A

CN117710504A - Image generation method, training method, device and equipment of image generation model

Info

Publication number: CN117710504A
Application number: CN202311733691.1A
Authority: CN
Inventors: 李建伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-12-15
Filing date: 2023-12-15
Publication date: 2024-03-15

Abstract

The disclosure provides an image generation method, a training device and training equipment of an image generation model, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of computer vision, deep learning, a large model and the like. The image generation method comprises the following steps: acquiring an initial noise image; acquiring a plurality of inference step information corresponding to a plurality of preset denoising inference steps; respectively fusing the initial noise image and the information of the plurality of reasoning steps to obtain a plurality of first features; the method comprises the steps of carrying out parallel processing on a plurality of first features by using a self-attention mechanism to obtain a plurality of second features corresponding to the plurality of first features, wherein the plurality of second features characterize a plurality of images obtained by iteratively executing a plurality of denoising reasoning steps on an initial noise image; and obtaining a target image that does not include noise based on the plurality of second features.

Description

Image generation method, training method, device and equipment of image generation model

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to the technical fields of computer vision, deep learning, large models, and the like, and more particularly, to an image generation method, an image generation model training method, an image generation apparatus, an image generation model training apparatus, an electronic device, a computer readable storage medium, and a computer program product.

Background

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

In the last two years, diffusion models (Diffusion models) have had a very great heat in academia and industry. It is an image generation technique that can generate a sharp image by continually iterating through a noisy image.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides an image generation method, an image generation model training method, an image generation apparatus, an image generation model training apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided an image generating method including: acquiring an initial noise image; acquiring a plurality of inference step information corresponding to a plurality of preset denoising inference steps; respectively fusing the initial noise image and the information of the plurality of reasoning steps to obtain a plurality of first features; the method comprises the steps of carrying out parallel processing on a plurality of first features by using a self-attention mechanism to obtain a plurality of second features corresponding to the plurality of first features, wherein the plurality of second features characterize a plurality of images obtained by iteratively executing a plurality of denoising reasoning steps on an initial noise image; and obtaining a target image that does not include noise based on the plurality of second features.

According to another aspect of the present disclosure, there is provided a training method of an image generation model, including: acquiring a sample initial noise image, a plurality of sample intermediate images and a sample target image, wherein the sample target image does not contain noise, and the plurality of sample intermediate images represent sample target images containing noise with different degrees; acquiring a plurality of inference step information corresponding to a plurality of preset denoising inference steps; respectively fusing the sample initial noise image and the information of the plurality of reasoning steps to obtain a plurality of third features; parallel processing is carried out on the plurality of third features by using a deep learning model based on a self-attention mechanism, a plurality of fourth features corresponding to the plurality of third features are obtained, wherein the plurality of fourth features characterize a plurality of images obtained by iteratively executing a plurality of denoising reasoning steps on the initial noise image of the sample; and training the deep learning model based on the plurality of fourth features, the plurality of sample intermediate images and the sample target image to obtain an image generation model.

According to another aspect of the present disclosure, there is provided an image generating apparatus including: a first image acquisition unit configured to acquire an initial noise image; a first inference step information acquisition unit configured to acquire a plurality of inference step information corresponding to a plurality of preset denoising inference steps; the first fusion unit is configured to fuse the initial noise image and the information of the plurality of reasoning steps respectively to obtain a plurality of first features; the first parallel processing unit is configured to process the first features in parallel by using a self-attention mechanism to obtain a plurality of second features corresponding to the first features, wherein the second features characterize a plurality of images obtained by performing a plurality of denoising reasoning steps on the initial noise image in an iterative manner; and a generating unit configured to obtain a target image containing no noise based on the plurality of second features.

According to another aspect of the present disclosure, there is provided a training apparatus of an image generation model, including: a second image acquisition unit configured to acquire a sample initial noise image, a plurality of sample intermediate images, and a sample target image, wherein the plurality of sample intermediate images characterize the sample target image containing noise of different degrees; a second inference step information acquisition unit configured to acquire a plurality of inference step information corresponding to a plurality of preset denoising inference steps, the plurality of denoising inference rounds corresponding to a plurality of noise addition rounds; the second fusion unit is configured to fuse the sample initial noise image and the information of the plurality of reasoning steps respectively to obtain a plurality of third characteristics; the second parallel processing unit is configured to perform parallel processing on the plurality of third features by using a deep learning model based on a self-attention mechanism to obtain a plurality of fourth features corresponding to the plurality of third features, wherein the plurality of fourth features characterize a plurality of images obtained by performing a plurality of denoising reasoning steps on the sample initial noise image iteration; and a training unit configured to train the deep learning model based on the plurality of fourth features, the plurality of sample intermediate images, and the sample target image, resulting in an image generation model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above-described method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described method.

According to one or more embodiments of the present disclosure, the present disclosure respectively fuses a plurality of inference step information corresponding to a plurality of preset denoising inference steps with an initial noise image to obtain a plurality of first features, and processes the plurality of first features fused with the initial noise image and different inference step information in parallel by using a self-attention mechanism to obtain a plurality of second features, where the second features can characterize a plurality of images obtained by iteratively performing a plurality of denoising inference steps on the initial noise image, and finally obtains an image generation result, that is, a target image, based on the second features. By the method, the target image without noise can be obtained without iteratively executing a plurality of denoising reasoning steps, so that the denoising reasoning efficiency of the noise image is improved, and the time consumption of the image generation process is reduced.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a flowchart of an image generation method according to an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a flowchart of a process for fusing an initial noise image with a plurality of inference step information, respectively, resulting in a plurality of first features, according to an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a flowchart of a training method of an image generation model according to an exemplary embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of an image generation model according to an exemplary embodiment of the present disclosure;

fig. 6 shows a block diagram of an image generating apparatus according to an exemplary embodiment of the present disclosure;

FIG. 7 shows a block diagram of a training apparatus of an image generation model according to an exemplary embodiment of the present disclosure; and

fig. 8 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

In the related art, a clear image can be obtained by performing a denoising inference step on a noisy image iteration, but such an iteration process is very time-consuming.

In order to solve the above problems, the present disclosure respectively fuses a plurality of inference step information corresponding to a plurality of preset denoising inference steps with an initial noise image to obtain a plurality of first features, and processes the plurality of first features fused with the initial noise image and different inference step information in parallel by using a self-attention mechanism to obtain a plurality of second features, where the second features can characterize a plurality of images obtained by iteratively performing a plurality of denoising inference steps on the initial noise image, and finally obtains an image generation result, that is, a target image, based on the second features. By the method, the target image without noise can be obtained without iteratively executing a plurality of denoising reasoning steps, so that the denoising reasoning efficiency of the noise image is improved, and the time consumption of the image generation process is reduced.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In an embodiment of the present disclosure, the server 120 may run one or more services or software applications that enable execution of the methods of the present disclosure.

In some embodiments, server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, such as provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) network.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use client devices 101, 102, 103, 104, 105, and/or 106 for human-machine interaction. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

In some embodiments, a diffusion model may be used to generate the image. The diffusion model may include two processes: forward diffusion processes and reverse reasoning processes.

In the forward diffusion process, a real picture x is given _T Through a plurality of presetThe noise adding step realizes T times of accumulation to add noise with preset probability distribution (e.g. Gaussian distribution) to the noise, and sequentially obtains x _T-1 ，x _T-2 …x ₀ This process can be expressed as q (x _t |x _t+1 ). Here, a series of hyper-parameters (e.g., gaussian distribution variance) of a predetermined probability distribution is given The forward process can also be considered as a markov process since each time t is only related to time t+1, and the specific formula is as follows:

wherein alpha is _t ＝1-β _t ，During forward diffusion, x decreases as t decreases _t And more closely approach pure noise. When T tends to infinity, x ₀ Is entirely gaussian noise. In practice, beta _t Is decreasing with t, i.e. beta ₀ >β ₁ >…>β _T . Through a series of deductions, x can be calculated _T And beta rapidly gives x _t . The specific formula is as follows:

wherein,

the forward diffusion process may be considered as a noise-adding process, and the reverse reasoning process may be considered as a denoising reasoning process. When the diffusion model is trained, forward diffusion noise adding is needed to be carried out on the original image, then reverse reasoning denoising is carried out by utilizing the diffusion model, and the result output by the model is enabled to be as close as possible to the result obtained in the forward diffusion process. In this process, the image obtained by performing the noise adding step T times on the clear original image can be regarded as a pure noise image.

When the diffusion model is used for generating the image, the diffusion model can be used for carrying out reverse reasoning and reasoning on the initial noise image, so that a clear target image without noise is obtained.

It should be noted that the reasoning about the initial noise image to obtain a completely noise-canceled image is typically accomplished by iteratively performing a plurality of denoising reasoning steps. If the reversed distribution can be obtained step by step, it is possible to obtain a distribution from a completely preset probability distribution (e.g., gaussian distributionI.e. the original noise image) to restore the original image distribution x _T (or new images meeting additional conditions). However, since the inverted profile cannot be inferred simply, a deep learning model (with parameters θ, e.g., using U-Net structure) can be used to predict such an inverted profile p _θ Thereby achieving the purpose of denoising.

In some embodiments, a reverse distribution p may be predicted for an image to be inferred (i.e., an image obtained by performing T-round denoising inference steps on a purely noisy image, or an image obtained by performing T-T times of noise on an original image without noise) with a time step number of T _θ (e.g., gaussian distribution).

Wherein x is _t Mu, is the image to be inferred with time step number t _θ (x _t T) refers to the mean value of the inferred image, squareThe difference may be usedThe difference between the image to be inferred and the intermediate noise image (the predicted result of the noise added to the corresponding step in the forward diffusion process) calculated based on the preset parameters of the preset probability distribution can be expressed as:

wherein x is _t Z is the image to be inferred with time step number t _θ (x _t T) is deep learning model based on image x to be inferred _t And the time step number t, i.e., the predicted result of the noise added to the corresponding step in the forward diffusion process. By using the formula to execute the denoising reasoning step, the image to be reasoning with the time step number of t+1 can be obtained. By repeating the above denoising reasoning step, a target image containing no noise can be obtained gradually from the initial noise image.

According to an aspect of the present disclosure, fig. 2 shows a flowchart of an image generation method 200 according to an exemplary embodiment of the present disclosure. The image generation method 200 includes: step S201, acquiring an initial noise image; step S202, obtaining a plurality of inference step information corresponding to a plurality of preset denoising inference steps; step S203, respectively fusing the initial noise image and the information of the plurality of reasoning steps to obtain a plurality of first features; step S204, carrying out parallel processing on the plurality of first features by using a self-attention mechanism to obtain a plurality of second features corresponding to the plurality of first features, wherein the plurality of second features characterize a plurality of images obtained by iteratively executing a plurality of denoising reasoning steps on the initial noise image; and step S205, obtaining a target image which does not contain noise based on the plurality of second features.

Therefore, through the mode, the target image without noise can be obtained without iteratively executing a plurality of denoising reasoning steps, the denoising reasoning efficiency of the noise image is improved, and the time consumption of the image generation process is reduced.

In some embodiments, the initial noise image acquired in step S201 may also be obtained by performing a plurality of noise adding steps on a specific original image that does not include noise. As described above, the formulas provided above may be utilizedThe image after the specified number of noise adding steps is performed is quickly obtained based on the noise coefficients of the original image and the noise (e.g., the super-parameters of the preset probability distribution of the noise) desired to be added in each noise adding step. In this way, the final generated target image may be made to have a certain correlation with the original image, for example including the same object, content or theme.

According to some embodiments, the initial noise image may be, for example, a pure noise image obtained by random sampling based on a second preset probability distribution. The second preset probability distribution may be, for example, a gaussian distribution. In this way, a completely new image can be generated.

In some embodiments, the number of multiple denoising inference steps, and the inference step information corresponding to each denoising inference step, may be predetermined. The number T of the plurality of denoising reasoning steps may be, for example, 50 (refer to the denoising diffusion implicit model Denoising Diffusion Implicit Models, DDIM) to 1000 times (refer to the denoising diffusion probability model Denoising Diffusion Probabilistic Model, DDPM). In the present disclosure, T will be used to denote the number of preset plurality of denoising reasoning steps.

In some embodiments, the plurality of inference step information acquired at step S202 may include information about the number of time steps of the denoising inference step corresponding to each of the inference step information. The number of time steps may refer to the number of inferred steps of the corresponding denoising inference step. For example, the number of inferred steps for the first denoising inference step for the initial noise image may be 1, the number of inferred steps for the denoising inference step for the image obtained after the first denoising inference step may be 2, and so on, until the number of inferred steps for the last denoising inference step may be T.

According to some embodiments, the inference step information includes a temporal step feature indicating an inferred number of steps of the denoising inference step corresponding to the inference step information. In some embodiments, the time step feature may be derived, for example, by Embedding (Embedding) the time steps, which may be performed using a trained time step Embedding network.

Therefore, the time step number of the denoising reasoning step is fused in the first characteristic obtained later by using the time step number characteristic, so that the effect of denoising reasoning process based on the first characteristic by using the self-attention mechanism can be improved, and finally the target image with higher quality and without noise is obtained.

According to some embodiments, the inferencing step information may include a noise figure. The noise figure may comprise a hyper-parameter of a first preset probability distribution, which may characterize the noise desired to be removed in the de-noising reasoning step corresponding to the reasoning step information.

Therefore, through the use of the noise coefficient, the relevant information of the noise expected to be removed in the denoising reasoning step is fused in the first characteristics obtained later, so that the effect of denoising reasoning process based on the first characteristics by using a self-attention mechanism can be improved, and finally, a target image with higher quality and without noise is obtained.

In some embodiments, the noise figure for the denoising reasoning step of time step number t may include, for example, β as described above _t I.e. the hyper-parameters of the first preset probability distribution indicate the variance of the gaussian distribution corresponding to the number of time steps t. The noise figure included in the plurality of inference step information decreases with the number of time steps t. In some embodiments, the noise figure may comprise, for example, a noise signature obtained by embedding a hyper-parameter of a first predetermined probability distribution, which embedding process may be performed using a trained noise parameter embedding network.

In some embodiments, the inference step information may also include other information related to the corresponding denoising inference step, not limited herein.

In some embodiments, in step S203, the initial noise image may be fused with each of the plurality of inference step information, respectively, to obtain a plurality of first features. The first features can be in one-to-one correspondence with a plurality of preset denoising reasoning steps, and each first feature can represent images required to be inferred by the corresponding denoising reasoning step.

In some embodiments, the initial noise image may be fused with the inference step information using addition, multiplication, weighted summation, processing using a small neural network, or the like, or any combination thereof, to obtain the corresponding first feature. In some embodiments, the initial noise image may be converted to an image token or image feature prior to fusion.

Fig. 3 illustrates a flow chart of a process 300 for fusing an initial noise image with a plurality of inference step information, respectively, resulting in a plurality of first features, according to an exemplary embodiment of the present disclosure. Step S203 described in fig. 2 may be implemented using the process 300 shown in fig. 3.

In step S301, the initial noise image is multiplied by noise coefficients included in each of the plurality of inference step information, respectively, to obtain a plurality of intermediate features corresponding to the plurality of denoising inference steps. Considering that the noise coefficient comprises a super parameter (for example, gaussian distribution variance) representing a preset probability distribution of noise which is expected to be removed in a corresponding denoising reasoning step, information of the noise which is expected to be removed can be better fused with noise in an initial noise image by a multiplication mode, the effect of a denoising reasoning process based on a first characteristic by using a self-attention mechanism is improved, and finally a target image which does not contain noise and has higher quality is obtained.

Step S302, adding the plurality of intermediate features and the time step number features included in the plurality of reasoning step information respectively to obtain a plurality of first features. Considering that the number of time steps describes the number of inferred steps of the corresponding denoising inference step, and the number of inferred steps of different denoising inference steps have relevance (incremental relation), the relevance between the denoising inference steps can be better reflected by the first fused features by extracting the characteristics of the number of time steps (in particular, by embedding) and fusing the characteristics of the number of time steps with the initial noise image by adding. The embodiment of the relevance can provide assistance for subsequent parallel processing of a plurality of first features by using a self-attention mechanism so as to obtain more effective second features, and finally can generate a target image which does not contain noise and has higher quality.

In some embodiments, at step S204, the plurality of first features may be processed in parallel using a self-attention mechanism. When a sequence (e.g., a sequence of a plurality of first features) is processed using a self-attention mechanism, each feature in the sequence may be associated with other features in the sequence, rather than just features that depend on neighboring locations. It adaptively captures long-range dependencies between features by computing the relative importance between the features.

Specifically, for each feature in the sequence, the self-attention mechanism calculates its similarity between the other features and normalizes these similarities to an attention weight. The output from the attention mechanism, i.e. the processed feature corresponding to each feature, can then be obtained by weighted summing each feature with the corresponding attention weight. The processed features are fused with information of other features in the sequence, so that the prediction capability is stronger.

The self-attention mechanism calculation process for each feature in the sequence can be performed in parallel, so that the processed feature corresponding to each feature in the sequence can be quickly obtained.

In some embodiments, the plurality of second features obtained at step S204 may characterize a plurality of images obtained by iteratively performing a plurality of denoising reasoning steps on the initial noise image. From another perspective, the plurality of second features may characterize a plurality of images resulting from iteratively performing a plurality of noise addition steps on the resulting target image. Each second feature may correspond to a denoising inference step, a noise addition step, or a number of time steps. The second feature may be transformed, decoded, or otherwise processed to obtain a corresponding image. It will be appreciated that the last of the plurality of second features may characterize a target image that does not contain noise, while other ones of the plurality of second features may characterize images that fuse the target image with varying degrees of noise.

Therefore, by utilizing the self-attention mechanism to process the plurality of first features in parallel, a plurality of second features of a plurality of images obtained by iteratively executing a plurality of denoising reasoning steps on the initial noise image can be obtained without iteratively executing the plurality of denoising reasoning steps on the initial noise image, and further a target image without noise is obtained, so that the denoising reasoning efficiency of the noise image is improved, and the time consumption of the image generation process is reduced.

In some embodiments, at step S204, the plurality of first features may be processed in parallel using a deep learning model based on a self-attention mechanism. The model may include one or more transform encoders stacked to enable multiple parallel processing of multiple first features to obtain more predictive second features. This model is also referred to in this disclosure as a transducer model, and an exemplary training pattern will be described below.

According to some embodiments, step S204, performing parallel processing on the plurality of first features using the self-attention mechanism, and obtaining a plurality of second features corresponding to the plurality of first features may include: for one of the first features, a second feature corresponding to the first feature is derived using a self-attention mechanism based on the first feature and other first features preceding the first feature.

Considering that when the diffusion model iteratively executes a plurality of denoising reasoning steps, denoising reasoning of each step is executed on the basis of an image obtained in the previous denoising reasoning step, when each of a plurality of first features is processed, a second feature corresponding to the first feature is obtained by utilizing a self-attention mechanism based on the first feature and other first features before the first feature, so that a logic relation among the plurality of denoising reasoning steps can be better reflected, and a plurality of more accurate second features can be obtained.

In some embodiments, at step S204, for one of the plurality of first features, a self-attention mechanism may be utilized to calculate similarities between the first feature and other first features preceding the first feature, and normalize the similarities to an attention weight. The output of the self-attention mechanism, i.e. the second feature corresponding to the first feature, can then be obtained by weighted summing the first feature and the other first features preceding the first feature with the corresponding attention weights.

In some embodiments, the attention weight in the transducer model may be multiplied by a time mask matrix (t×t size), where each first feature corresponds to a time mask matrix with a value of 1 for the lower left triangle matrix before the number of time steps corresponding to the first feature and a value of 0 for the other positions.

In some embodiments, the second feature may be transformed, decoded, or otherwise processed to obtain a corresponding image, as described above. The last second feature of the plurality of second features may characterize the target image that does not contain noise, and thus in step S205, the last second feature of the plurality of second features may be processed accordingly to obtain the target image.

In some embodiments, at step S205, the plurality of second features may be further processed using a self-attention mechanism or otherwise to obtain further enhanced features, and a target image that does not include noise may be obtained based on the features.

In accordance with another aspect of the present disclosure, FIG. 4 shows a flowchart of a training method 400 of an image generation model in accordance with an exemplary embodiment of the present disclosure. The training method 400 of the image generation model includes: step S401, acquiring a sample initial noise image, a plurality of sample intermediate images and a sample target image, wherein the sample target image does not contain noise, and the plurality of sample intermediate images represent sample target images containing noise with different degrees; step S402, obtaining a plurality of inference step information corresponding to a plurality of preset denoising inference steps; step S403, respectively fusing the sample initial noise image and the information of the plurality of reasoning steps to obtain a plurality of third features; step S404, performing parallel processing on the plurality of third features by using a deep learning model based on a self-attention mechanism to obtain a plurality of fourth features corresponding to the plurality of third features, wherein the plurality of fourth features characterize a plurality of images obtained by performing a plurality of denoising reasoning steps on the sample initial noise image in an iterative manner; and step S405, training the deep learning model based on the plurality of fourth features, the plurality of sample intermediate images and the sample target image to obtain an image generation model.

Therefore, the sample initial noise image, the sample target image without noise and the plurality of sample intermediate images with different degrees of noise are obtained, the self-attention mechanism is used for carrying out parallel processing on the plurality of third features fused with the sample initial noise image and different reasoning step information to obtain a plurality of fourth features, the fourth features can characterize the plurality of images obtained by carrying out a plurality of denoising reasoning steps on the sample initial noise image in an iterative manner, finally the plurality of sample intermediate images and the sample target image are used as supervision on the plurality of fourth features to train the deep learning model, so that the trained deep learning model can be used for obtaining the target image without noise under the condition that a plurality of denoising reasoning steps are not needed to be carried out in an iterative manner, the denoising reasoning efficiency on the noise image is improved, and the time consumption of the image generating process is reduced.

In some embodiments, in step S401, a sample target image containing no noise and a plurality of images containing different degrees of noise on the basis of the sample target image, that is, a plurality of sample intermediate images, may be acquired. The sample initial noise image may be a pure noise image or an image containing the most noise on the basis of the sample target image (may be regarded as a pure noise image).

It may be appreciated that the operations of step S402 to step S404 in the method 400 may refer to step S202 to step S204 in the method 200, and the denoising inference step, the inference step information, the third feature and the fourth feature referred to in the method 400 may refer to the denoising inference step, the inference step information, the first feature and the second feature in the method 200, respectively, which are not described herein.

In some embodiments, in step S405, the plurality of sample intermediate images and the sample target image may be used as a supervisory image sequence, further, a loss value is obtained based on the plurality of fourth features and the supervisory image sequence, the loss value is positively correlated with the degree of difference between the plurality of fourth features and the supervisory image sequence, and finally, parameters of the deep learning model are adjusted based on the loss value, so as to obtain the image generation model.

In some embodiments, the sample target image may be a predetermined clear image that does not include noise, and noise may be added to the sample target image based on a plurality of inference step information corresponding to a preset plurality of denoising inference steps to obtain a plurality of sample intermediate images and a sample initial noise image.

According to some embodiments, the inference step information may include a temporal step feature, which may indicate an inferred number of steps of the denoising inference step corresponding to the inference step information. The meaning and implementation of the time step feature may be referred to the corresponding description above. By using the time step number feature, the time step number of the denoising reasoning step is fused in the third feature obtained later, so that the denoising reasoning capability of the image generation model obtained after training by using a self-attention mechanism can be improved, and finally, a target image with higher quality and without noise is obtained.

In some embodiments, the time step feature may be obtained by embedding the time steps using a time step embedding network that includes a learnable parameter. In step S405, the time step number embedding network may be trained based on the plurality of fourth features, the plurality of sample intermediate images, and the sample target image to obtain a trained time step number embedding network.

According to some embodiments, the inference step information may include a noise figure, which may include a hyper-parameter of a fourth preset probability distribution, which may characterize noise desired to be removed in the denoising inference step corresponding to the inference step information. The meaning and implementation of the noise figure may be referred to the respective description above and the fourth preset probability distribution may be identical to the first preset probability distribution above. By using the noise coefficient, the relevant information of the noise expected to be removed in the denoising reasoning step is fused in the third characteristic obtained later, so that the denoising reasoning capacity of the image generation model obtained after training by using a self-attention mechanism can be improved, and a target image with higher quality and without noise is finally obtained.

In some embodiments, the noise characteristic may be obtained by embedding a super-parameter of the fourth predetermined probability distribution using a noise parameter embedding network comprising a learnable parameter. In step S405, the noise parameter embedding network may be trained based on the plurality of fourth features, the plurality of sample intermediate images, and the sample target image to obtain a trained noise parameter embedding network.

According to some embodiments, the plurality of sample intermediate images and the sample initial noise image may represent a plurality of images sequentially obtained by performing a plurality of noise adding steps corresponding to a preset plurality of denoising reasoning steps on the sample target image. The noise added in one of the noise adding steps of the plurality of noise adding steps may have the same probability distribution as the noise desired to be removed in the denoising reasoning step corresponding to the noise adding step.

Since the plurality of fourth features outputted by the deep learning model characterize a plurality of images obtained by iteratively performing a plurality of denoising reasoning steps on the sample initial noise image, and the denoising reasoning steps are reverse operations of the noise adding steps, using the plurality of sample intermediate images obtained in the manner as described above and the sample target image containing no noise as supervisory signals of the plurality of fourth features can enable the trained image generation model to generate a plurality of image features that more accurately characterize an image sequence obtained by iteratively performing the plurality of denoising reasoning steps on the initial noise image.

In some embodiments, the above may be usedA plurality of sample intermediate images and a sample initial noise image are directly calculated. Wherein, the time step number corresponding to the initial noise image of the sample is 0.

In some embodiments, the deep learning model may be trained using a plurality of sample sequences, wherein each sample sequence includes one sample initial noise image, a plurality of sample intermediate images, and a sample target image. In some embodiments, the number of intermediate images of the plurality of samples included in the different sample sequences is the same, and the length of the input sequence of the deep learning model (the number of the plurality of first features) may be the same as the number of all images included in the sample sequences, thereby enabling a more stable and predictive image generation model.

According to some embodiments, step S403, respectively fusing the sample initial noise image with the plurality of inference step information, and obtaining the plurality of third features may include: multiplying the sample initial noise image by noise coefficients respectively included in the plurality of reasoning step information to obtain a plurality of sample intermediate features corresponding to the plurality of denoising reasoning steps; and respectively adding the plurality of sample intermediate features and the time step number features respectively included in the plurality of reasoning step information to obtain a plurality of third features. It is to be understood that the operation and effect of step S403 may refer to the descriptions of step S203 and process 300 above, and will not be described herein.

According to some embodiments, step S404, performing parallel processing on the plurality of third features using a deep learning model based on a self-attention mechanism, and obtaining a plurality of fourth features corresponding to the plurality of third features may include: for one of the plurality of third features, a fourth feature corresponding to the third feature is derived using a self-attention mechanism based on the third feature and other third features preceding the third feature.

It is to be understood that the operation and effect of step S404 may refer to the description of step S204 above, and will not be described herein.

Returning to step S401. In other embodiments, the sample initial noise image may be a pure noise image obtained by random sampling according to a third predetermined probability distribution. The sample initial noise image may be processed using a predetermined plurality of denoising inference steps to obtain a sample initial noise image and a plurality of sample intermediate images. The third predetermined probability distribution of the sample initial noise image may be consistent with the second predetermined probability distribution.

According to some embodiments, the plurality of sample intermediate images and the sample target image may be sequentially obtained by iteratively performing a plurality of denoising reasoning steps on the sample initial noise image using the trained diffusion model. When the trained diffusion model performs denoising reasoning, for a pure noise image, a noise-containing image (i.e. a sample intermediate image) is generated in each iteration step, and the noise-containing images become clearer as the iteration is performed. The noisy maps can be stored or cached with iterations, i.e., multiple sample intermediate images and sample target images that do not contain noise can be obtained for use in supervising the deep learning model.

In an exemplary embodiment, the denoising diffusion probability model DDPM has 1000 iterations, that is, 1000 preset denoising reasoning steps, so that 999 sample intermediate images and 1 sample target image can be obtained by iterating the sample initial noise image by using the DDPM, and 1000 third features can be generated. When the deep learning model is utilized for denoising, the fourth feature corresponding to each third feature can correspond to one noisy image (sample intermediate image or sample target image) generated by iteration of the DDPM, so that one-to-one supervision training is realized.

In another exemplary embodiment, the denoising diffusion implicit model DDIM has 50 iterations, that is, 50 preset denoising reasoning steps, and then 49 sample intermediate images and 1 sample target image can be obtained by iterating the sample initial noise image with DDIM, and 50 third features can be generated. When the deep learning model is utilized for denoising, the fourth feature corresponding to each third feature can correspond to one noisy image (sample intermediate image or sample target image) generated by iteration of the DDIM, so that one-to-one supervision training is realized.

And training the supervision of the plurality of fourth features generated by the deep learning model by using the plurality of sample intermediate images and the sample target images generated by the trained diffusion model, so that the deep learning model can learn the learned knowledge in the trained diffusion model in the training process, thereby accelerating training and improving the image generation effect of the image generation model obtained after training.

Fig. 5 shows a schematic diagram of an image generation model 500 according to an exemplary embodiment of the present disclosure. The image generation model 500 may be, for example, a deep learning model used in the image generation method 200, or a deep learning model trained by the image generation model training method 400 or a training image generation model.

As shown in fig. 5, an initial noise image 502 may be multiplied by a noise figure sequence 504 to obtain a plurality of intermediate features, which in turn are multiplied by a time step feature sequence 506 to obtain an input, i.e., a plurality of first features, of a self-attention mechanism based deep learning model 508. The noise figure sequence 504 may be a noise figure included in each of the above-mentioned preset plurality of inference step information, and the time step feature sequence 506 may be a noise figure included in each of the above-mentioned preset plurality of inference step information. The deep learning model 508 processes the plurality of first features in parallel to obtain a plurality of second features that characterize an image expansion sequence 510 obtained by iteratively performing a plurality of denoising reasoning steps on the initial noise image. In the prediction stage, a target image which does not contain noise can be obtained based on the plurality of second features; in the training phase, the plurality of second features may be supervised using a sample target image that does not contain noise and a plurality of sample intermediate images that contain varying degrees of noise.

According to another aspect of the present disclosure, an image generating apparatus is provided. As shown in fig. 6, the apparatus 600 includes: a first image acquisition unit 610 configured to acquire an initial noise image; a first inference step information acquiring unit 620 configured to acquire a plurality of inference step information corresponding to a plurality of preset denoising inference steps; a first fusing unit 630 configured to fuse the initial noise image and the plurality of inference step information, respectively, to obtain a plurality of first features; a first parallel processing unit 640 configured to perform parallel processing on the plurality of first features by using a self-attention mechanism, so as to obtain a plurality of second features corresponding to the plurality of first features, where the plurality of second features characterize a plurality of images obtained by iteratively performing a plurality of denoising reasoning steps on the initial noise image; and a generating unit 650 configured to obtain a target image containing no noise based on the plurality of second features.

It is understood that the operations of the units 610-650 in the apparatus 600 are similar to the operations of the steps S201-S205 in fig. 2, respectively, and are not described herein.

According to another aspect of the present disclosure, a training apparatus for generating a model of an image is provided. As shown in fig. 7, the apparatus 700 includes: a second image acquisition unit 710 configured to acquire a sample initial noise image, a plurality of sample intermediate images, and a sample target image, wherein the sample target image does not contain noise, the plurality of sample intermediate images characterizing the sample target image containing noise of different degrees; a second inference step information obtaining unit 720 configured to obtain a plurality of inference step information corresponding to a plurality of preset denoising inference steps; a second fusing unit 730 configured to fuse the sample initial noise image and the plurality of inference step information, respectively, to obtain a plurality of third features; a second parallel processing unit 740 configured to perform parallel processing on the plurality of third features by using a deep learning model based on a self-attention mechanism, so as to obtain a plurality of fourth features corresponding to the plurality of third features, wherein the plurality of fourth features characterize a plurality of images obtained by iteratively performing a plurality of denoising reasoning steps on the sample initial noise image; and a training unit 750 configured to train the deep learning model based on the plurality of fourth features, the plurality of sample intermediate images, and the sample target image, resulting in an image generation model.

It is understood that the operations of the units 710 to 750 in the apparatus 700 are similar to those of the steps S401 to S405 in fig. 4, respectively, and are not described herein.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.

Referring to fig. 8, a block diagram of an electronic device 800 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information to the device 800, the input unit 806 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystickA microphone and/or a remote control. The output unit 807 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. The storage unit 808 may include, but is not limited to, magnetic disks, optical disks. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth ^TM Devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning network algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as an image generation method and/or a training method of an image generation model. For example, in some embodiments, the image generation method and/or the training method of the image generation model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the image generation method and/or the training method of the image generation model described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the image generation method and/or the training method of the image generation model in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. An image generation method, comprising:

acquiring an initial noise image;

Acquiring a plurality of inference step information corresponding to a plurality of preset denoising inference steps;

respectively fusing the initial noise image and the information of the plurality of reasoning steps to obtain a plurality of first features;

the first features are processed in parallel by using a self-attention mechanism to obtain a plurality of second features corresponding to the first features, wherein the second features characterize a plurality of images obtained by iteratively executing the denoising reasoning steps on the initial noise image; and

and obtaining a target image which does not contain noise based on the second features.

2. The method of claim 1, wherein the inference step information includes a temporal step number feature indicating an inferred number of steps of a denoising inference step corresponding to the inference step information.

3. The method of claim 2, wherein the inference step information comprises a noise figure comprising a hyper-parameter of a first preset probability distribution characterizing noise desired to be removed in a de-noising inference step to which the inference step information corresponds.

4. The method of claim 3, wherein fusing the initial noise image and the plurality of inference step information, respectively, to obtain a plurality of first features comprises:

Multiplying the initial noise image by noise coefficients respectively included in the plurality of reasoning step information to obtain a plurality of intermediate features corresponding to the plurality of denoising reasoning steps; and

and respectively adding the plurality of intermediate features and the time step number features respectively included in the plurality of reasoning step information to obtain the plurality of first features.

5. The method of any of claims 1-4, wherein processing the first plurality of features in parallel using a self-attention mechanism to obtain a second plurality of features corresponding to the first plurality of features comprises:

for one of the first features, a second feature corresponding to the first feature is obtained based on the first feature and other first features preceding the first feature using a self-attention mechanism.

6. The method of any of claims 1-4, wherein the initial noise image is a pure noise image randomly sampled based on a second preset probability distribution.

7. A training method of an image generation model, comprising:

acquiring a sample initial noise image, a plurality of sample intermediate images and a sample target image, wherein the sample target image does not contain noise, and the plurality of sample intermediate images characterize the sample target image containing noise of different degrees;

respectively fusing the sample initial noise image and the information of the plurality of reasoning steps to obtain a plurality of third features;

performing parallel processing on the plurality of third features by using a deep learning model based on a self-attention mechanism to obtain a plurality of fourth features corresponding to the plurality of third features, wherein the plurality of fourth features characterize a plurality of images obtained by iteratively executing the plurality of denoising reasoning steps on the sample initial noise image; and

training the deep learning model based on the fourth features, the sample intermediate images and the sample target image to obtain an image generation model.

8. The method of claim 7, wherein the sample initial noise image is a pure noise image randomly sampled according to a third predetermined probability distribution.

9. The method of claim 8, wherein the plurality of sample intermediate images and the sample target image are sequentially derived by iteratively performing the plurality of denoising reasoning steps over the sample initial noise image using a trained diffusion model.

10. The method of claim 7, wherein the inference step information includes a temporal step number feature indicating an inferred number of steps of a denoising inference step corresponding to the inference step information.

11. The method of claim 10, wherein the inference step information includes a noise figure that includes a hyper-parameter of a fourth preset probability distribution that characterizes noise desired to be removed in a denoising inference step corresponding to the inference step information.

12. The method of claim 11, wherein the plurality of sample intermediate images and the sample initial noise image characterize a plurality of images that result from sequentially performing a plurality of noise addition steps corresponding to the plurality of denoising reasoning steps on the sample target image, wherein noise added in one of the plurality of noise addition steps and noise desired to be removed in a denoising reasoning step corresponding to the noise addition step have the same probability distribution.

13. The method of claim 11, wherein fusing the sample initial noise image with the plurality of inference step information, respectively, to obtain a plurality of third features comprises:

Multiplying the sample initial noise image by noise coefficients respectively included in the plurality of reasoning step information to obtain a plurality of sample intermediate features corresponding to the plurality of denoising reasoning steps; and

and respectively adding the intermediate characteristics of the plurality of samples and the time step number characteristics respectively included by the plurality of reasoning step information to obtain the plurality of third characteristics.

14. The method of any of claims 7-13, wherein processing the plurality of third features in parallel using a self-attention mechanism based deep learning model to obtain a plurality of fourth features corresponding to the plurality of third features comprises:

for one of the plurality of third features, a fourth feature corresponding to the third feature is derived using a self-attention mechanism based on the third feature and other third features preceding the third feature.

15. An image generating apparatus comprising:

a first image acquisition unit configured to acquire an initial noise image;

a first inference step information acquisition unit configured to acquire a plurality of inference step information corresponding to a plurality of preset denoising inference steps;

the first fusion unit is configured to fuse the initial noise image and the information of the plurality of reasoning steps respectively to obtain a plurality of first features;

A first parallel processing unit configured to perform parallel processing on the plurality of first features by using a self-attention mechanism to obtain a plurality of second features corresponding to the plurality of first features, wherein the plurality of second features characterize a plurality of images obtained by iteratively performing the plurality of denoising reasoning steps on the initial noise image; and

and a generation unit configured to obtain a target image containing no noise based on the plurality of second features.

16. A training apparatus for an image generation model, comprising:

a second image acquisition unit configured to acquire a sample initial noise image, a plurality of sample intermediate images, and a sample target image, wherein the sample target image does not contain noise, the plurality of sample intermediate images characterizing the sample target image containing noise to different extents;

a second inference step information acquisition unit configured to acquire a plurality of inference step information corresponding to a plurality of preset denoising inference steps;

the second fusion unit is configured to fuse the sample initial noise image and the plurality of reasoning step information respectively to obtain a plurality of third features;

a second parallel processing unit configured to perform parallel processing on the plurality of third features by using a deep learning model based on a self-attention mechanism, so as to obtain a plurality of fourth features corresponding to the plurality of third features, wherein the plurality of fourth features characterize a plurality of images obtained by iteratively performing the plurality of denoising reasoning steps on the sample initial noise image; and

And the training unit is configured to train the deep learning model based on the fourth characteristics, the sample intermediate images and the sample target images to obtain an image generation model.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-14.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-14.

19. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-14.