CN117292007A - Image generation method and device - Google Patents

Image generation method and device Download PDF

Info

Publication number
CN117292007A
CN117292007A CN202311274061.2A CN202311274061A CN117292007A CN 117292007 A CN117292007 A CN 117292007A CN 202311274061 A CN202311274061 A CN 202311274061A CN 117292007 A CN117292007 A CN 117292007A
Authority
CN
China
Prior art keywords
model
image
diffusion model
basic
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311274061.2A
Other languages
Chinese (zh)
Inventor
曹佳炯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202311274061.2A priority Critical patent/CN117292007A/en
Publication of CN117292007A publication Critical patent/CN117292007A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the specification discloses an image generation method and device, the image generation method comprises the steps of utilizing a primary model training of a diffusion model comprising an image space and a diffusion model comprising a hidden space, determining an associated primary model supported under a campt, constructing a double diffusion model to perform model training, then combining resource information, training a path weight optimization model, and adaptively searching an image generation path to obtain a target image.

Description

Image generation method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of artificial intelligence, and in particular, to an image generating method and apparatus.
Background
With the development of the Internet, the image generation method based on the diffusion model makes a great breakthrough in the AIGC field, and promotes the AIGC technology to advance from academia to industry.
However, in the aspect of generating an AIGC image, there are two main different technical paths, including an AIGC image generating method based on hidden space diffusion and an AIGC image generating method directly performing diffusion in an image space, and both have advantages but cannot be considered, for example, the former cannot perform better control on some image details, and the latter often needs to consume a great deal of computing resources and time.
Disclosure of Invention
Aiming at the problems existing in the prior art, the embodiment of the invention provides an image generation method and device.
In a first aspect, embodiments of the present disclosure provide an image generating method, including:
constructing a primary model of image training, wherein the primary model comprises a diffusion model of an image space and a diffusion model of a hidden space;
inputting a prompt into the primary model, acquiring a first loss function in the primary model training process, and training by taking the first loss function as a convergence function to obtain an associated primary model;
performing joint training on the associated primary models to obtain a double diffusion model;
inputting the campt into a double diffusion model, acquiring a second loss function in the training process of the double diffusion model, and training by taking the second loss function as a convergence function to obtain an output standard image;
and acquiring resource information, combining the standard image, calculating the layer weight of the double diffusion model by the resource information, and optimizing the double diffusion model training process by the layer weight to obtain a target image output by the optimized double diffusion model.
In a second aspect, embodiments of the present specification provide an image generating apparatus including:
a primary model building module configured to build a primary model of image training, the primary model comprising a diffusion model of image space and a diffusion model of hidden space;
the primary model training module is configured to input a prompt into the primary model, acquire a first loss function in the primary model training process, and train by taking the first loss function as a convergence function to acquire an associated primary model;
the joint training module is configured to perform joint training on the associated primary models to obtain a double diffusion model;
the double diffusion model training module is configured to input the prompt into a double diffusion model, acquire a second loss function in the double diffusion model training process, train by taking the second loss function as a convergence function, and acquire an output standard image;
the double diffusion model optimizing module is configured to acquire resource information, calculate layer weights of the double diffusion model by combining the standard image and the resource information, and optimize the double diffusion model training process by the layer weights to obtain a target image output by the optimized double diffusion model.
In a third aspect, embodiments of the present disclosure provide an electronic device including a processor and a memory;
the processor is connected with the memory;
the memory is used for storing executable program codes;
the processor runs a program corresponding to executable program code stored in the memory by reading the executable program code for performing the method of one or more embodiments.
In a fourth aspect, embodiments of the present description provide a computer-readable storage medium, on which is stored a computer program that, when executed by a processor, performs the method of one or more embodiments.
In view of the above, in one or more embodiments of the present disclosure, a primary model training including a diffusion model of an image space and a diffusion model of a hidden space is utilized, under the condition of reducing complexity of the model, an associated primary model supported under the campt is determined, a double diffusion model is constructed to perform model training, generation of standard images under various paths is guaranteed, then a path weight optimization model is trained in combination with resource information, and a path is generated by adaptively optimizing an image, so as to obtain a target image. Therefore, the advantages of hidden space diffusion and image space diffusion can be combined to perform double training, and dynamic path selection is performed based on resource information in the diffusion process, so that a better detail control effect is achieved with smaller resource consumption.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present description, the drawings that are required in the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present description, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a system architecture to which one embodiment of the present description applies.
Fig. 2 is a flowchart of an image generating method according to an embodiment of the present disclosure.
Fig. 3 is a flowchart of yet another image generation method provided in an embodiment of the present disclosure.
Fig. 4 is a flowchart of still another image generation method provided in an embodiment of the present specification.
Fig. 5 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present specification.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It should be appreciated that these embodiments are discussed only to enable a person skilled in the art to better understand and thereby practice the subject matter described herein, and are not limiting of the scope, applicability, or examples set forth in the claims. Changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure as set forth in the specification. Various examples may omit, replace, or add various procedures or components as desired. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. In addition, features described with respect to some examples may be combined in other examples as well.
As used herein, the term "comprising" and variations thereof mean open-ended terms, meaning "including, but not limited to. The term "based on" means "based at least in part on". The terms "one embodiment" and "an embodiment" mean "at least one embodiment. The term "another embodiment" means "at least one other embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other definitions, whether explicit or implicit, may be included below. Unless the context clearly indicates otherwise, the definition of a term is consistent throughout this specification.
Before describing the image password processing method in detail in connection with one or more embodiments, the present disclosure may briefly explain how an AIGC image is generated in the prior art.
With the development of the Internet, the image generation method based on the diffusion model has made a major breakthrough in the field of AIGC, and the AIGC application of images has been explosively increased. There are many image-based AIGC plays and applications both domestically and abroad. Currently, there are two main different technical paths in AIGC image generation. The first technical scheme is an AIGC image generation method based on hidden space diffusion. Such methods first map the high-dimensional image to the low-dimensional hidden space through the VAE and then diffuse in the low-dimensional hidden space. After the diffusion is carried out to obtain a result, the VAE is utilized to map the result of the hidden space to the image space, and a generated image is obtained. The method has the advantages of diffusion in the hidden space, high algorithm speed and high algorithm efficiency. The disadvantage is that some image details cannot be controlled well. The second technical solution is an AIGC image generation method that directly diffuses in the image space. The method directly diffuses in the image space, and directly obtains the generated image after the diffusion is finished. The advantage of this type of method is that the details of the image can be well controlled. However, performing diffusion in a high-dimensional image space tends to consume a significant amount of computing resources and time.
In view of the above, embodiments of the present specification propose an image generation scheme. And (3) training a primary model comprising a diffusion model of an image space and a diffusion model of a hidden space, determining an associated primary model supported under the campt under the condition of reducing the complexity of the model, constructing a double diffusion model to perform model training, ensuring that standard images under various paths are generated and polled, and then training a path weight optimization model by combining resource information, and adaptively optimizing an image generation path to obtain a target image. Therefore, the advantages of hidden space diffusion and image space diffusion can be combined to perform double training, and dynamic path selection is performed based on resource information in the diffusion process, so that a better detail control effect is achieved with smaller resource consumption.
An image generation scheme according to an embodiment of the present specification will be described in detail below with reference to the accompanying drawings.
Referring to fig. 1, fig. 1 illustrates an exemplary application scenario of an AIGC model according to an embodiment of the present specification.
In FIG. 1, the AIGC model is trained by a server/terminal and, after the training is completed, is configured to a platform or terminal, receives a user's prompt (the input text or instructions provided by the prompt user to the AIGC model). The server/terminal may be a related device capable of training a model, such as a dedicated server, a graphics processor, a cloud server, a personal computer, a GPU server, an edge device, etc., or any combination.
The platform or terminal that receives the user's campt may configure the AIGC model and receive tasks related to the user's campt, which may include a command line terminal, jupyter Notebook platform, web application, mobile application, or the like, or any combination thereof.
It should be appreciated that all network entities shown in fig. 1 are exemplary and that any other network entity may be involved in an application scenario depending on the particular application requirements.
Fig. 2 shows a flowchart of an image generation process according to an embodiment of the present specification, as shown in fig. 2, including:
step 202, constructing an image-trained primary model, wherein the primary model comprises a diffusion model of an image space and a diffusion model of a hidden space.
In step 202, an image-trained primary model is constructed based on a diffusion model of the image space and a diffusion model of the hidden space, wherein the image-space-based diffusion model refers to using the image data itself as input and generating new content by manipulating or transforming the image. Such models typically utilize image processing techniques such as convolutional neural networks, generation of countermeasure networks (GAN), etc., to learn and generate content related to the input image. The generation of such models is mainly performed in image space to generate high quality images, but tends to consume a lot of computing resources and time. The hidden space based diffusion model refers to generating new content by manipulating or transforming hidden variables. Hidden variables refer to potential, high-dimensional vector representations that can function to encode and generate content in the AIGC model. Such models typically use techniques such as self-encoders, variational self-encoders (VAEs), etc. to learn and manipulate hidden variables to generate content having diversity and continuity. The generation process of the model is mainly carried out in a hidden space, so that images with vividness and diversity can be generated, but some image details cannot be controlled well. When the primary model is built, the diffusion model of the image space and the diffusion model of the hidden space need to be configured and fused, for example, when the primary model is built, the diffusion model of the image space and the diffusion model of the hidden space need to be subjected to parallel fusion, serial fusion, enhancement fusion and the like, and the diffusion model of the image space and the diffusion model of the hidden space are configured as the primary model which can be complementarily enhanced.
In addition, when the primary model is configured, the specific constitution of the primary model can comprise a large-scale model, a basic model and an associated mapping module, wherein the large-scale model is a deep learning model trained by a huge data set, potential modes and rules of the data are learned by training on a large amount of data, and generated contents can have diversity and creativity, but the large-scale model can cause large memory and calculation resource consumption; the base model may be a deep learning model, such as an image generation model: the underlying model used to generate the image content may be to generate variants of the antagonism network (GAN), such as DCGAN (Deep Convolutional GAN) or Style-based GAN. These models can generate realistic images by learning statistical features of the real images and representations of specific image styles; the association mapping module establishes an association between the input features and the generator network so that the generator can generate a corresponding output from the input, and in image generation, the association mapping module can connect one or more input features (such as noise vectors, text descriptions, conditional images, etc.) with an input layer or middle layer of the generator network. In this way, the generator network can generate image content related thereto based on the input features. The large-scale model and the basic model respectively comprise a diffusion model of an image space and a diffusion model of a hidden space, namely the primary model can be configured by 4 models and an association mapping module, wherein the 4 models comprise the large-scale diffusion model based on the image space and the hidden space, the basic diffusion model based on the image space and the hidden space, and the association mapping module is used for detecting the association of the basic diffusion model based on the image space and the hidden space.
Step 204, inputting the campt into the primary model, obtaining a first loss function in the primary model training process, and training by taking the first loss function as a convergence function to obtain an associated primary model.
In step 204, the precursor is input to the primary model for training, when the precursor is input, the original noise may be input at the same time, then the precursor and the original noise are trained by the primary model, the training is to obtain the association degree between the diffusion model of the image space and the diffusion model of the hidden space in the primary model, the associated primary model is obtained according to the association degree between the diffusion model of the image space and the diffusion model of the hidden space, the training may be iterative training, and the corresponding convergence function is the first loss function generated in the primary model training process, as shown in fig. 3, and the primary model training process may be implemented by the following steps.
Step 302, inputting the template into a diffusion model of an image space and a diffusion model of a hidden space in the large-scale model, and outputting a corresponding large-scale image. In the step, the simplet can be respectively input into a large-scale diffusion model based on an image space and a hidden space, the large-scale diffusion model of the image space can be a model constructed by using a Convolutional Neural Network (CNN) or a generation countermeasure network (GAN) and other technologies, and the input simplet and noise are processed and generated through a series of image processing operations and transformation, so that a large-scale image is output; the hidden space large scale diffusion model encodes and decodes input promt and noise by manipulating and transforming hidden variables to convert promt and noise into a potential high dimensional vector representation, and the decoding process converts potential vectors into large scale images including large scale diffusion models from image space and hidden space large scale diffusion models.
Step 304, inputting the template into a diffusion model of an image space and a diffusion model of a hidden space in the base model, and outputting a corresponding base image. In the step, the template can be respectively input into a basic diffusion model based on an image space and a hidden space, compared with a large-scale diffusion model, the basic diffusion model of the image space has the volume and the consumption calculation power of about 15% -25% of that of the large-scale diffusion model, the basic diffusion model of the image space processes and generates the input template and noise through a series of image processing operations and transformations, and the model outputs a basic image which can be a low-resolution image or a basic image meeting specific requirements; the basic diffusion model of the hidden space may be a model constructed using a technology such as an auto encoder (Autoencoder), a Variational Auto Encoder (VAE), etc., which encodes and decodes input campt and noise by operating and transforming the hidden variable, the encoding process converts the campt and noise into a potential high-dimensional vector representation, and the decoding process converts the potential vector into a basic image including the basic diffusion model of the image space and the basic diffusion model of the hidden space.
And 306, inputting the hidden spaces in the large-scale image and the basic image into the association mapping module, and outputting a primary image. In this step, the associative mapping module receives as input hidden space vectors for the large-scale image and the base image. A set of convolution layers, full connection layers, or other types of layers may be employed for feature extraction and fusion operations. By correlating and mapping hidden space vectors of the large-scale image and the base image, the correlation mapping module will generate a representation of the primary image. The specific primary image generation process may be a fitting process, including: the large-scale image and the basic image are respectively converted into corresponding hidden space representations through a hidden space encoder, hidden space output of the large-scale image and the basic image is used as input data of an association mapping module, and the module can adopt a group of convolution layers, full connection layers or other types of layers to perform feature extraction and fusion operation. The hidden space output of the large-scale image and the basic image is correlated and mapped through the correlation mapping module, and the correlation image representation generated by the correlation mapping module can be converted into a primary image by using a decoder or a deconvolution network and other technologies. The decoder may translate the representation of the associated image into a pixel-level image that matches the original input large-scale image and the base image. The final output fitted primary image will be the output of the model.
In addition, in the primary model training process, the generated first loss function is used as a convergence function for training, wherein the first loss function can comprise generating a diffusion loss of a difference between an image and a target image, generating a distillation loss of similarity between the image and a basic image, generating a correlation loss of correlation degree between the image and the primary image, in the iterative training process, taking the first loss function as an iteration target, and determining a basic diffusion model of a corresponding image space and a basic diffusion model of a hidden space in the primary model when the iteration target is reached, namely, the correlated primary model.
Step 206, performing joint training on the associated primary models to obtain a double diffusion model.
In step 206, after determining the basic diffusion model of the image space and the basic diffusion model of the hidden space associated with the primary model, the basic diffusion model and the basic diffusion model of the hidden space are jointly trained, such as parallel fusion, serial fusion, enhanced fusion, and the like, to obtain a corresponding double diffusion model.
In addition, when the double diffusion model is configured, the specific structure of the double diffusion model comprises a basic diffusion model of an image space and a basic diffusion model of a hidden space, and also comprises a path interaction model and a fusion output model, wherein the path interaction model is a model for image processing or computer vision tasks, the purpose of which is to enhance characteristic representation by interacting information on different paths, and the fusion output model is a model for fusing the outputs of a plurality of models to generate a final result. Such models are typically used to integrate predictive or feature representations of multiple models to improve overall performance or to reduce bias of individual models.
And step 208, inputting the prompt into a double diffusion model, acquiring a second loss function in the training process of the double diffusion model, and training by taking the second loss function as a convergence function to obtain an output standard image.
In step 208, the dummy is input into the double diffusion model for training, and when the dummy is input, the original noise may be input at the same time, then the dummy and the original noise are trained by the double diffusion model, where the training is to obtain a standard image output by the double diffusion model, the training may be an iterative training, and the corresponding convergence function is a second loss function generated in the primary model training process, as shown in fig. 4, and the double diffusion model training process may be implemented by the following steps.
And step 402, inputting the campt into the basic diffusion model of the image space and the basic diffusion model of the hidden space for training, and obtaining the intermediate layer characteristics of the basic diffusion model of the image space and the basic diffusion model of the hidden space. In the step, the input of a basic diffusion model based on an image space is a sample and random noise, and the output is a corresponding generated image; the input of the hidden space-based basic diffusion model is the prompt and random noise, the output is a corresponding generated image, and then the intermediate layer characteristics of the image space basic diffusion model and the hidden space basic diffusion model in the training process are acquired, wherein the intermediate layer characteristics are extracted on the image pixel level for the image space basic diffusion model. These features may be the activation output of the convolutional layer or the pooling result of the pooling layer. The features capture local and global information in the image, and can be used for tasks such as image classification, target detection and the like; for the underlying diffusion model of the hidden space, the intermediate layer features are extracted over the potential space. These features are obtained by mapping the input image to potential space by means of a hidden space encoder. These features have a higher level of semantic representation, can be used for tasks such as image generation, image reconstruction, etc., and these intermediate layer features can be further feature extracted and fused in a fused output model. These features may be extracted and combined using convolutional layers, fully-connected layers, or other types of layers to obtain more rich and useful information.
Step 404, inputting the intermediate layer features into the path interaction model, outputting the intermediate layer features after interaction, and replacing the original intermediate layer features with the intermediate layer features after interaction. In the step, intermediate layer features of the image space basic diffusion model and the hidden space basic diffusion model are used as input to a path interaction model, and the two features are used as input of the path interaction model for interaction between the models. The path interaction model interacts with the input middle layer features through learning and training to generate the interacted middle layer features, the interaction process can be fusion, combination or other forms of interaction of the features, and the specific method can be designed according to the requirements of tasks and data sets. The interacted intermediate layer features are input into the image space basic diffusion model and the hidden space basic diffusion model again to serve as new intermediate layer features, so that the basic model can utilize the output of the path interaction model, and the performance and generalization capability of the model are further improved.
Step 406, obtaining a process image output by the basic diffusion model of the image space and the basic diffusion model of the hidden space, and inputting the process image into a fusion output model to obtain a corresponding standard image. In this step, two process images output by the basic diffusion model of the image space and the basic diffusion model of the hidden space after the new intermediate layer feature is replaced are acquired, and the two process images are fused, including: and taking the output image of the basic diffusion model of the image space and the output image of the basic diffusion model of the hidden space as inputs to be transmitted into a fusion output model, and fusing the two images by using a splicing method, a weighted summation method or other fusion methods to obtain a fused input image. Training is then performed by fusing the output models: the fusion output model receives the fused input image as input, and a convolution layer, a full-connection layer or other types of layers can be used for extracting features and carrying out information fusion, and the fused input image is converted into a final output result through the learning and training of the model. And the standard image is output by the double diffusion model.
In addition, in the double diffusion model training process, the generated second loss function is used as a convergence function for training, wherein the first loss function can comprise diffusion loss of difference between the standard image and the target image, in the iterative training process, the second loss function is used as an iteration target, and when the iteration target is reached, the corresponding output standard image is determined.
And 210, acquiring resource information, combining the standard image, calculating the layer weight of the double diffusion model by the resource information, and optimizing the double diffusion model training process by the layer weight to obtain a target image output by the optimized double diffusion model.
In step 210, in the actual training process, computing resource information in the current training environment needs to be analyzed, and a part of layers need to be skipped adaptively according to the computing resource conditions, so that a better quality and efficiency tradeoff is achieved. The resource information is acquired, the training combination of the standard images in the double diffusion model training process is combined, the resource use weight of each layer in the double diffusion model for the double diffusion model training process is determined, the double diffusion model training process is optimized through the layer weight after the weight of each layer is determined, and then model training is carried out through the optimized double diffusion model, so that an optimized standard image which is an output target image corresponding to the sample is obtained.
In addition, the specific double diffusion model optimization process comprises two models, wherein the double diffusion model in the training process is input and output as standard images; the path weight optimization model inputs environmental resource information (CPU, GPU model, occupation condition and the like) in the double diffusion model training process, and outputs the weight of each layer, wherein the path weight optimization model can preset an optimization target firstly, for example, a weight threshold can be set to be 0.5, and layers lower than 0.5 can be skipped; then selecting a proper optimization algorithm according to the optimization target and constraint conditions, wherein common optimization algorithms comprise gradient descent, genetic algorithm, particle swarm optimization and the like, and the weight of each layer can be effectively searched and updated by selecting the proper algorithm; the weights for each layer are updated using gradients or other methods depending on the optimization algorithm selected. This process may be accomplished by iteration until an optimization objective is reached or constraints are met. The convergence function in the iteration process can comprise a second loss function generated by the double diffusion model in the training process and weight sparse loss generated by the path weight optimization model, the loss function is taken as an iteration target, and when the iteration target is reached, the completion of the double diffusion model optimization is indicated.
In the embodiments of the present disclosure, the implementation purpose of the embodiments is to utilize primary model training including a diffusion model of an image space and a diffusion model of a hidden space, determine an associated primary model supported under a campt under the condition of reducing model complexity, construct a double diffusion model to perform model training, ensure that standard images under various paths are generated, and then combine resource information to train a path weight optimization model, and adaptively optimize an image generation path to obtain a target image. Therefore, the advantages of hidden space diffusion and image space diffusion can be combined to perform double training, and dynamic path selection is performed based on resource information in the diffusion process, so that a better detail control effect is achieved with smaller resource consumption.
In one or more embodiments of the present disclosure, after an optimized double diffusion model is obtained through training, the double diffusion model is deployed to a cloud server or a terminal device, after the double diffusion model is deployed to the cloud server or the terminal device, a content generation instruction (prompt) input to a relevant platform/APP by a user in real time is received, resource information of the current cloud server or the terminal device is obtained, further real-time optimization is performed on the double diffusion model according to the real-time resource information, the content generation instruction is input to the double diffusion model after the real-time optimization, and a target image corresponding to the content generation instruction is output.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The image generating apparatus according to the embodiment of the present application will be described in detail with reference to fig. 5. It should be noted that, the image generating apparatus shown in fig. 5 is used to perform the method of the embodiment shown in fig. 1 of the present application, and for convenience of explanation, only the portion relevant to the embodiment of the present application is shown, and specific technical details are not disclosed, please refer to the embodiment shown in fig. 5 of the present application.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present application. As shown in fig. 5, the apparatus includes:
a primary model construction module 501 configured to construct an image-trained primary model comprising a diffusion model of an image space and a diffusion model of a hidden space;
The primary model training module 502 is configured to input a prompt into the primary model, acquire a first loss function in the primary model training process, and train by taking the first loss function as a convergence function to acquire an associated primary model;
a joint training module 503 configured to perform joint training on the associated primary models to obtain a double diffusion model;
the double diffusion model training module 504 is configured to input the prompt into a double diffusion model, acquire a second loss function in the double diffusion model training process, and train by taking the second loss function as a convergence function to acquire an output standard image;
the double diffusion model optimizing module 505 is configured to acquire resource information, calculate layer weights of the double diffusion model in combination with the standard image, and optimize the double diffusion model training process by using the layer weights to obtain a target image output by the optimized double diffusion model.
In some possible embodiments, the image generating apparatus further comprises at least:
the primary model configuration module is configured to enable the primary model to comprise a large-scale model, a basic model and an association mapping module, wherein the large-scale model and the basic model respectively comprise a diffusion model of an image space and a diffusion model of a hidden space.
In some possible embodiments, the image generating apparatus further comprises at least:
the large-scale image training module is configured to input a template into a diffusion model of an image space and a diffusion model of a hidden space in the large-scale model and output a corresponding large-scale image;
the basic image training module is configured to input a template into a diffusion model of an image space and a diffusion model of a hidden space in the basic model and output a corresponding basic image;
and the primary image training module is configured to input the hidden space in the large-scale image and the basic image into the association mapping module and output a primary image.
In some possible embodiments, the image generating apparatus further comprises at least:
and the fitting training module is configured to take hidden space output in the large-scale image and the basic image as input data of the association mapping module, take image space output in the large-scale image and the basic image as a fitting object and output a primary image after fitting.
In some possible embodiments, the image generating apparatus further comprises at least:
and the first loss function generation module is configured to acquire diffusion loss, distillation loss and association loss generated in the primary model training process and generate a corresponding first loss function.
In some possible embodiments, the image generating apparatus further comprises at least:
and the double diffusion model configuration module is configured to enable the double diffusion model to comprise a basic diffusion model of an image space, a basic diffusion model of a hidden space, a path interaction model and a fusion output model.
In some possible embodiments, the image generating apparatus further comprises at least:
the middle layer feature acquisition module is configured to train a prompt input into the basic diffusion model of the image space and the basic diffusion model of the hidden space to acquire middle layer features of the basic diffusion model of the image space and the basic diffusion model of the hidden space;
the middle layer feature replacement module is configured to input the middle layer features into the path interaction model, output the interacted middle layer features, and replace the original middle layer features with the interacted middle layer features;
and the standard image output module is configured to acquire a process image output by the basic diffusion model of the image space and the basic diffusion model of the hidden space, and input the process image into the fusion output model to obtain a corresponding standard image.
In some possible embodiments, the image generating apparatus further comprises at least:
The standard image training module is configured to input the standard image into the double diffusion model, output a trained standard image and acquire a generated second loss function;
the weight training module is configured to input the resource information into the path weight optimizing module, output layer weights and acquire generated weight sparse loss;
and the weight optimization module is configured to train by taking the second loss function and the weight sparse loss as convergence functions until the convergence functions converge, so as to obtain an optimized double diffusion model.
It will be apparent to those skilled in the art that the embodiments of the present application may be implemented in software and/or hardware. "Unit" and "module" in this specification refer to software and/or hardware capable of performing a specific function, either alone or in combination with other components, such as Field programmable gate arrays (Field-Programmable Gate Array, FPGAs), integrated circuits (Integrated Circuit, ICs), etc.
The processing units and/or modules of the embodiments of the present application may be implemented by an analog circuit that implements the functions described in the embodiments of the present application, or may be implemented by software that executes the functions described in the embodiments of the present application.
Referring to fig. 6, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown, where the electronic device may be used to implement the method in the embodiment shown in fig. 1. As shown in fig. 6, the electronic device 600 may include: at least one processor 601, at least one network interface 604, a user interface 603, a memory 605, at least one communication bus 602.
Wherein the communication bus 602 is used to enable connected communications between these components.
The user interface 603 may include a Display screen (Display), a Camera (Camera), and the optional user interface 603 may further include a standard wired interface, a wireless interface.
The network interface 604 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.
Wherein the processor 601 may include one or more processing cores. The processor 601 connects various parts within the overall electronic device 600 using various interfaces and lines, performs various functions of the terminal 600 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 605, and invoking data stored in the memory 605. Alternatively, the processor 601 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 601 may integrate one or a combination of several of a processor (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 601 and may be implemented by a single chip.
The Memory 605 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 605 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 605 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 605 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, etc.; the storage data area may store data or the like referred to in the above respective method embodiments. The memory 605 may also optionally be at least one storage device located remotely from the processor 601. As shown in fig. 6, an operating system, network communication modules, user interface modules, and program instructions may be included in memory 605, which is a type of computer storage medium.
In the electronic device 600 shown in fig. 6, the user interface 603 is mainly used for providing an input interface for a user, and acquiring data input by the user; and the processor 601 may be configured to invoke the image-based interactive application stored in the memory 605 and to specifically perform the following operations: constructing a primary model of image training, wherein the primary model comprises a diffusion model of an image space and a diffusion model of a hidden space; inputting the campt into a primary model, acquiring a first loss function in the primary model training process, and training by taking the first loss function as a convergence function to acquire an associated primary model; performing joint training on the associated primary models to obtain a double diffusion model; inputting the campt into a double diffusion model, acquiring a second loss function in the training process of the double diffusion model, and training by taking the second loss function as a convergence function to obtain an output standard image; and acquiring resource information, combining the standard image, calculating the layer weight of the double diffusion model by the resource information, and optimizing the double diffusion model training process by the layer weight to obtain a target image output by the optimized double diffusion model.
The present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method. The computer readable storage medium may include, among other things, any type of disk including floppy disks, optical disks, DVDs, CD-ROMs, micro-drives, and magneto-optical disks, ROM, RAM, EPROM, EEPROM, DRAM, VRAM, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some service interface, device or unit indirect coupling or communication connection, electrical or otherwise.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be performed by hardware associated with a program that is stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Claims (18)

1. An image generation method, comprising:
constructing a primary model of image training, wherein the primary model comprises a diffusion model of an image space and a diffusion model of a hidden space;
inputting a prompt into the primary model, acquiring a first loss function in the primary model training process, and training by taking the first loss function as a convergence function to obtain an associated primary model;
performing joint training on the associated primary models to obtain a double diffusion model;
inputting the campt into a double diffusion model, acquiring a second loss function in the training process of the double diffusion model, and training by taking the second loss function as a convergence function to obtain an output standard image;
And acquiring resource information, combining the standard image, calculating the layer weight of the double diffusion model by the resource information, and optimizing the double diffusion model training process by the layer weight to obtain a target image output by the optimized double diffusion model.
2. The method of claim 1, wherein the method further comprises:
the primary model comprises a large-scale model, a basic model and an associated mapping module, wherein the large-scale model and the basic model respectively comprise a diffusion model of an image space and a diffusion model of a hidden space.
3. The method of claim 2, wherein the primary model training process comprises:
inputting the campt into a diffusion model of an image space and a diffusion model of a hidden space in the large-scale model, and outputting a corresponding large-scale image;
inputting the template into a diffusion model of an image space and a diffusion model of a hidden space in a basic model, and outputting a corresponding basic image;
and inputting the hidden spaces in the large-scale image and the basic image into the association mapping module, and outputting a primary image.
4. A method according to claim 3, wherein said inputting the hidden space in the large-scale image and base image into the associative mapping module outputs a primary first image, comprising:
And taking hidden space output in the large-scale image and the basic image as input data of the association mapping module, taking image space output in the large-scale image and the basic image as a fitting object, and outputting a primary image after fitting.
5. A method according to claim 3, wherein the first loss function comprises:
and obtaining diffusion loss, distillation loss and association loss generated in the primary model training process, and generating a corresponding first loss function.
6. The method of claim 1, wherein the method further comprises:
the double diffusion model comprises a basic diffusion model of an image space, a basic diffusion model of a hidden space, a path interaction model and a fusion output model.
7. The method of claim 6, wherein the double diffusion model training process comprises:
inputting the campt into a basic diffusion model of the image space and a basic diffusion model of the hidden space for training, and obtaining intermediate layer characteristics of the basic diffusion model of the image space and the basic diffusion model of the hidden space;
inputting the intermediate layer characteristics into the path interaction model, outputting the intermediate layer characteristics after interaction, and replacing the original intermediate layer characteristics with the intermediate layer characteristics of the pair after interaction;
And acquiring a process image output by the basic diffusion model of the image space and the basic diffusion model of the hidden space, and inputting the process image into a fusion output model to obtain a corresponding standard image.
8. The method of claim 1, wherein the computing the layer weights of the double diffusion model in combination with the standard image, resource information, and optimizing the double diffusion model training process with the layer weights, comprises:
inputting the standard image into the double diffusion model, outputting the trained standard image, and acquiring a generated second loss function;
inputting the resource information into a path weight optimizing module, outputting layer weights, and acquiring generated weight sparse loss;
and training by taking the second loss function and the weight sparse loss as convergence functions until the convergence functions converge, so as to obtain the optimized double diffusion model.
9. An image generating apparatus, comprising:
a primary model building module configured to build a primary model of image training, the primary model comprising a diffusion model of image space and a diffusion model of hidden space;
the primary model training module is configured to input a prompt into the primary model, acquire a first loss function in the primary model training process, and train by taking the first loss function as a convergence function to acquire an associated primary model;
The joint training module is configured to perform joint training on the associated primary models to obtain a double diffusion model;
the double diffusion model training module is configured to input the prompt into a double diffusion model, acquire a second loss function in the double diffusion model training process, train by taking the second loss function as a convergence function, and acquire an output standard image;
the double diffusion model optimizing module is configured to acquire resource information, calculate layer weights of the double diffusion model by combining the standard image and the resource information, and optimize the double diffusion model training process by the layer weights to obtain a target image output by the optimized double diffusion model.
10. The apparatus of claim 9, comprising:
the primary model configuration module is configured to enable the primary model to comprise a large-scale model, a basic model and an association mapping module, wherein the large-scale model and the basic model respectively comprise a diffusion model of an image space and a diffusion model of a hidden space.
11. The apparatus of claim 10, wherein the image enhancement module comprises:
the large-scale image training module is configured to input a template into a diffusion model of an image space and a diffusion model of a hidden space in the large-scale model and output a corresponding large-scale image;
The basic image training module is configured to input a template into a diffusion model of an image space and a diffusion model of a hidden space in the basic model and output a corresponding basic image;
and the primary image training module is configured to input the hidden space in the large-scale image and the basic image into the association mapping module and output a primary image.
12. The apparatus of claim 11, comprising:
and the fitting training module is configured to take hidden space output in the large-scale image and the basic image as input data of the association mapping module, take image space output in the large-scale image and the basic image as a fitting object and output a primary image after fitting.
13. The apparatus of claim 11, comprising:
and the first loss function generation module is configured to acquire diffusion loss, distillation loss and association loss generated in the primary model training process and generate a corresponding first loss function.
14. The apparatus of claim 9, comprising:
and the double diffusion model configuration module is configured to enable the double diffusion model to comprise a basic diffusion model of an image space, a basic diffusion model of a hidden space, a path interaction model and a fusion output model.
15. The apparatus of claim 14, comprising:
the middle layer feature acquisition module is configured to train a prompt input into the basic diffusion model of the image space and the basic diffusion model of the hidden space to acquire middle layer features of the basic diffusion model of the image space and the basic diffusion model of the hidden space;
the middle layer feature replacement module is configured to input the middle layer features into the path interaction model, output the interacted middle layer features, and replace the original middle layer features with the interacted middle layer features;
and the standard image output module is configured to acquire a process image output by the basic diffusion model of the image space and the basic diffusion model of the hidden space, and input the process image into the fusion output model to obtain a corresponding standard image.
16. The apparatus of claim 9, comprising:
the standard image training module is configured to input the standard image into the double diffusion model, output a trained standard image and acquire a generated second loss function;
the weight training module is configured to input the resource information into the path weight optimizing module, output layer weights and acquire generated weight sparse loss;
And the weight optimization module is configured to train by taking the second loss function and the weight sparse loss as convergence functions until the convergence functions converge, so as to obtain an optimized double diffusion model.
17. An electronic device includes a processor and a memory;
the processor is connected with the memory;
the memory is used for storing executable program codes;
the processor runs a program corresponding to executable program code stored in the memory by reading the executable program code for performing the method according to any one of claims 1-8.
18. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1-8.
CN202311274061.2A 2023-09-28 2023-09-28 Image generation method and device Pending CN117292007A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311274061.2A CN117292007A (en) 2023-09-28 2023-09-28 Image generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311274061.2A CN117292007A (en) 2023-09-28 2023-09-28 Image generation method and device

Publications (1)

Publication Number Publication Date
CN117292007A true CN117292007A (en) 2023-12-26

Family

ID=89240534

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311274061.2A Pending CN117292007A (en) 2023-09-28 2023-09-28 Image generation method and device

Country Status (1)

Country Link
CN (1) CN117292007A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117893838A (en) * 2024-03-14 2024-04-16 厦门大学 Target detection method using diffusion detection model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117893838A (en) * 2024-03-14 2024-04-16 厦门大学 Target detection method using diffusion detection model

Similar Documents

Publication Publication Date Title
CN111402143B (en) Image processing method, device, equipment and computer readable storage medium
US10671889B2 (en) Committed information rate variational autoencoders
WO2024051445A1 (en) Image generation method and related device
CN111819580A (en) Neural architecture search for dense image prediction tasks
CN111401406B (en) Neural network training method, video frame processing method and related equipment
CN112418292B (en) Image quality evaluation method, device, computer equipment and storage medium
CN113064968B (en) Social media emotion analysis method and system based on tensor fusion network
CN116721334B (en) Training method, device, equipment and storage medium of image generation model
CN112906721B (en) Image processing method, device, equipment and computer readable storage medium
CN117292007A (en) Image generation method and device
CN112581635B (en) Universal quick face changing method and device, electronic equipment and storage medium
CN117454495B (en) CAD vector model generation method and device based on building sketch outline sequence
US20230153965A1 (en) Image processing method and related device
JP2023545052A (en) Image processing model training method and device, image processing method and device, electronic equipment, and computer program
CN116958324A (en) Training method, device, equipment and storage medium of image generation model
CN117576248B (en) Image generation method and device based on gesture guidance
WO2024046144A1 (en) Video processing method and related device thereof
CN113762261A (en) Method, device, equipment and medium for recognizing characters of image
CN117541668A (en) Virtual character generation method, device, equipment and storage medium
CN117218300A (en) Three-dimensional model construction method, three-dimensional model construction training method and device
WO2022127603A1 (en) Model processing method and related device
Xie et al. Design of painting art style rendering system based on convolutional neural network
CN113822790A (en) Image processing method, device, equipment and computer readable storage medium
CN116798052B (en) Training method and device of text recognition model, storage medium and electronic equipment
CN113822959A (en) Traditional Chinese painting generation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination