CN116778011A

CN116778011A - Image generating method

Info

Publication number: CN116778011A
Application number: CN202310594509.2A
Authority: CN
Inventors: 刘志恒; 张轶飞; 沈宇军; 郑可成; 朱凯; 冯睿蠡; 刘宇; 赵德丽; 周靖人
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-09-19

Abstract

The embodiment of the present specification provides an image generation method, wherein the image generation method includes: inputting an object description text containing at least two objects into an image generation model, and determining an initial text vector of the object description text; determining pre-stored object vectors of the at least two objects, wherein the pre-stored object vectors are generated for training text vectors of the at least two objects in the training process of generating a model according to the images; and processing the initial text vector through the pre-stored object vector to obtain a target text vector, and generating an object image containing at least two objects according to the target text vector. Therefore, the problems that the training efficiency of the neural network model is low and a large amount of computing resources are wasted due to the huge number of multi-concept customized sentences are avoided. Meanwhile, smooth operation of the image generation model is guaranteed, and stable image generation capability is provided.

Description

Image generating method

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to an image generation method.

Background

With the continuous development of Artificial Intelligence (AI) technology, in the field of image generation, the technology of automatically generating an image by AI according to some conceptual description sentences is also widely applied in various computer service scenes. The text-generated graph technology is realized based on the artificial intelligence technology, and the neural network model is trained through a large number of samples, so that the purpose of inputting concept description sentences into the neural network model and automatically generating images is achieved. For example, a multi-concept customized sentence including two concepts of "sunglasses" and "men" is input into the neural network model, and an image of Dai Mojing men is generated.

In the prior art, in order to enable the neural network model to generate the multi-concept customized image in the process of training the neural network model, the neural network model can learn each multi-concept customized sentence respectively. However, since there are many concept words in the actual scene, the number of multi-concept customized sentences formed by combining a plurality of concept words is very large, so that the training efficiency of the neural network model is low, and a great amount of computing resources are wasted.

Disclosure of Invention

In view of this, the present embodiment provides an image generation method. One or more embodiments of the present specification relate to an image generation model training method, another image generation method, an image generation model training apparatus, two image generation apparatuses, a computing device, a computer-readable storage medium, and a computer program, to solve the technical drawbacks existing in the prior art.

According to a first aspect of embodiments of the present specification, there is provided an image generation method including:

inputting an object description text containing at least two objects into an image generation model, and determining an initial text vector of the object description text;

Determining pre-stored object vectors of the at least two objects, wherein the pre-stored object vectors are generated for training text vectors of the at least two objects in the training process of generating a model according to the images;

and processing the initial text vector through the pre-stored object vector to obtain a target text vector, and generating an object image containing at least two objects according to the target text vector.

According to a second aspect of embodiments of the present specification, there is provided an image generating apparatus comprising:

a first vector determination module configured to input an object description text containing at least two objects into an image generation model, determine an initial text vector of the object description text;

a second vector determination module configured to determine pre-stored object vectors of the at least two objects, wherein the pre-stored object vectors are generated for training text vectors of the at least two objects in a training process according to the image generation model;

and the image generation module is configured to process the initial text vector through the pre-stored object vector, obtain a target text vector and generate an object image containing the at least two objects according to the target text vector.

According to a third aspect of embodiments of the present specification, there is provided an image generation model training method, including:

determining a training sample aiming at an image generation model, wherein the training sample comprises a sample object description text and a reference object image corresponding to the sample object description text;

inputting the sample object description text into the image generation model, and processing the sample object description text by using a text coding module in the image generation model to obtain a training text vector corresponding to the sample object description text;

decoding the training text vector by using a vector processing module in the image generation model to obtain a sample object image corresponding to the sample object description text, and generating a pre-stored object vector corresponding to the sample object according to the training text vector;

and determining a loss function of the image generation model according to the reference object image and the sample object image, and training the image generation model based on the loss function to obtain a trained image generation model.

According to a fourth aspect of embodiments of the present specification, there is provided an image generation model training apparatus comprising:

The system comprises a sample determining module, a sample generating module and a processing module, wherein the sample determining module is configured to determine a training sample aiming at an image generating model, wherein the training sample comprises sample object description text and a reference object image corresponding to the sample object description text;

the vector acquisition module is configured to input the sample object description text into the image generation model, and process the sample object description text by utilizing a text encoding module in the image generation model to obtain a training text vector corresponding to the sample object description text;

the image acquisition module is configured to decode the training text vector by utilizing the vector processing module in the image generation model, obtain a sample object image corresponding to the sample object description text, and generate a pre-stored object vector corresponding to the sample object according to the training text vector;

and the loss determination module is configured to determine a loss function of the image generation model according to the reference object image and the sample object image, train the image generation model based on the loss function and obtain a trained image generation model.

According to a fifth aspect of embodiments of the present specification, there is provided an image generating method applied to a cloud-side apparatus, including:

Receiving an object description text sent by a terminal side device, wherein the object description text comprises at least two objects;

inputting the object description text into an image generation model, and determining an initial text vector of the object description text;

processing the initial text vector through the pre-stored object vector to obtain a target text vector, and generating an object image containing at least two objects according to the target text vector;

and sending the object image to the end-side device.

According to a sixth aspect of embodiments of the present specification, there is provided an image generating apparatus applied to a cloud-side device, including:

the text receiving module is configured to receive an object description text sent by the terminal side equipment, wherein the object description text comprises at least two objects;

a first vector determination module configured to input the object description text into an image generation model, determining an initial text vector of the object description text;

the image generation module is configured to process the initial text vector through the pre-stored object vector, obtain a target text vector and generate an object image containing the at least two objects according to the target text vector;

an image transmission module configured to transmit the object image to the end-side device.

According to a seventh aspect of embodiments of the present specification, there is provided a computing device comprising:

a memory and a processor;

the memory is configured to store computer-executable instructions that, when executed by the processor, implement the steps of the two image generation methods and the image generation model training method.

According to an eighth aspect of embodiments of the present specification, there is provided a computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the two image generation methods and the image generation model training method described above.

According to a ninth aspect of embodiments of the present specification, there is provided a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the two image generation methods and the image generation model training method described above.

The image generation method provided by the specification comprises the following steps: inputting an object description text containing at least two objects into an image generation model, and determining an initial text vector of the object description text; determining pre-stored object vectors of the at least two objects, wherein the pre-stored object vectors are generated for training text vectors of the at least two objects in the training process of generating a model according to the images; and processing the initial text vector through the pre-stored object vector to obtain a target text vector, and generating an object image containing at least two objects according to the target text vector.

Specifically, the method does not need to enable the image generation model to learn each multi-concept customized sentence, but trains the image generation model to learn each individual object (concept), so that the problems that the training efficiency of the neural network model is low and a large amount of computing resources are wasted due to the large number of the multi-concept customized sentences are avoided. Meanwhile, in the model training process, a pre-stored object vector corresponding to each individual object is generated and stored. Therefore, in practical application, under the condition that the object description text containing at least two objects is input into the image generation model, the image generation model can determine pre-stored object vectors corresponding to the at least two objects, combine the pre-stored object vector of each object with the initial text vector of the object description text, and then generate an object image containing the at least two objects according to the combined target text vector, thereby ensuring smooth operation of the image generation model and providing stable image generation capability.

Drawings

FIG. 1 is a schematic diagram of a multi-concept customization model provided in one embodiment of the present disclosure;

fig. 2 is a schematic application scenario diagram of an image generating method according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of an image generation method provided by one embodiment of the present description;

FIG. 4 is a process flow diagram of an image generation method according to one embodiment of the present disclosure;

FIG. 5 is a flow chart of another image generation method provided by one embodiment of the present disclosure;

FIG. 6 is a flow chart of an image generation model training method provided in one embodiment of the present disclosure;

fig. 7 is a schematic structural view of an image generating apparatus according to an embodiment of the present specification;

FIG. 8 is a schematic structural diagram of an image generation model training apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural view of another image generating apparatus provided in one embodiment of the present specification;

FIG. 10 is a block diagram of a computing device provided in one embodiment of the present description.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present description. This description may be embodied in many other forms than described herein and similarly generalized by those skilled in the art to whom this disclosure pertains without departing from the spirit of the disclosure and, therefore, this disclosure is not limited by the specific implementations disclosed below.

The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms first, second, etc. may be used in one or more embodiments of this specification to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first may also be referred to as a second, and similarly, a second may also be referred to as a first, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

First, terms related to one or more embodiments of the present specification will be explained.

Diffusion model (diffusion model) the diffusion model is inspired by non-equilibrium thermodynamics and is a new SOTA in the depth generation model.

Custom generation (Customized Generation): generating task-specific concepts (multi-subject customization) about related concepts given by a user: a diagram is generated of a plurality of user-provided customization concepts.

SOTA (state-of-the-art): the model is used for describing the current better effect on a certain task in machine learning.

U-Net: an image segmentation network.

Because the text-based graphics technology realized based on the neural network model is widely applied in various computer service scenes, in the process of training the neural network model, in order to enable the neural network model to generate multi-concept customized images, the neural network model can learn each multi-concept customized sentence respectively. However, since there are many concept words in the actual scene, the number of multi-concept customized sentences formed by combining a plurality of concept words is very large, so that attribute confusion is likely to occur in some multi-subject image generating methods when synthesizing a plurality of subjects (such as a sunglasses, male concepts, etc.), and the similarity between the generated image and the reference image is reduced. In addition, most of the methods are to train multi-concept data together, and training is needed again as long as new customized objects need to be customized, so that a great deal of calculation data is wasted. For example, since the diffusion model is a parametric neural network, the image distribution is learned by gradual denoising. Thus, to further explore the scalability of diffusion models, much work is devoted to diffusion-based conditional generation, which can be broadly divided into two categories. The first is a method called classifier-guided, which utilizes a classifier to facilitate the sampling process of a pre-trained unconditional model. The second is called no classification And the method directly collects a large amount of data pairs to perform joint optimization under the guarantee of conditional probability deduction. This approach can produce very detailed results, but requires a large amount of data and computing resources. Also, as language and cross-modal base models evolve, much of the text-to-image work using classifier-less technology begins to appear, facilitating explicit control over the corresponding semantics and styles. However, the expressive power of text is still limited, more work is expected to further guide global control with additional, conditional information reference images, backgrounds and sketches, and the diffusion model of the text-generated image will guide this denoising process by text cues describing the image content. It is typically trained by denoising noisy images, resulting in less efficient model training. It should be specifically noted that, the present specification provides a multi-concept custom model for text-to-image diffusion, the basic idea is to obtain better custom generation effect through fine tuning of cross attention mechanism (cross attention), but there is also a problem that model training efficiency is low and a lot of computation resources are wasted, wherein the purpose of custom generation is to implant a given topic into a diffusion model and bind a unique text identifier for it to indicate its existence; the model can thus vividly generate various reproductions of the subject matter under the guidance of the text prompts. Referring specifically to fig. 1, fig. 1 is a schematic structural diagram of a multi-concept customization model provided in an embodiment of the present disclosure, where the multi-concept customization model is a diffusion model U-Net, and the model adopts a joint training method to train the model directly by using multi-concept customization sentences including multiple concepts. The specific process is as follows: the multi-concept customized sentence "photo of a v" is input to a Text converter (Text converter), and the "photo of a v" is mapped into an embedding vector (embedding) by the Text converter. Note that the dimension of the embedded vector output by the text converter may be 77 x 1024. Wherein a portion of the 77 dimensions corresponds to words in the sentence "photo of a v x dog". For example, the first ones of the 77 dimensions correspond to the words "photo of av" dog ", respectively The term "dog" in "photo of av dog" is 1024 th weft yarn. After the sentence embedding vector is obtained, it is input into a different Attention layer (Attention neural network layer). Through the Attention layer and residual network (ResNet), x is calculated _t Is restored to x _t-1 Is a puppy image of (c). The Trainable vector (Trainable) refers to a vector that needs to be trained and fine-tuned in the diffusion model, that is, a "v x" vector in fig. 1; the training fixed vector (Frozen) refers to a vector which is fixed in the diffusion model and does not need training, namely vectors such as "photo", "of", "a", "dog" and the like in fig. 1.

Based on this, in the present specification, an image generation method, an image generation model training method, another image generation method, an image generation model training apparatus, two kinds of image generation apparatuses, a computing device, a computer-readable storage medium, and a computer program are provided, which are described in detail one by one in the following embodiments.

Fig. 2 is a schematic application scenario of an image generating method according to an embodiment of the present disclosure, and referring to fig. 2, a user inputs text of "one cat and one dog" to a terminal, the terminal inputs the text into a diffusion model after receiving the text, and an encoder in the diffusion model encodes the text to obtain a text embedding vector. And determining residual vectors of objects such as cats and dogs from the pre-cached residual vectors according to words such as cats and dogs in the text. Adding the residual vector to the text embedded vector may generate corresponding objects such as cats and dogs in the generated image. Thereby obtaining a multi-concept customized image. Based on this, the image generation method provided in the present specification finds one residual vector for each concept (object), and can be infinitely multiplexed. And the multi-concept does not need to be trained for combination, which is different from the mode of multi-concept common training in other methods, and the computing resource is greatly saved.

Fig. 3 shows a flowchart of an image generating method according to an embodiment of the present specification, which specifically includes the following steps.

Step 302: an object description text comprising at least two objects is input into an image generation model, and an initial text vector of the object description text is determined.

Where an object may be understood as a particular subject that needs to be generated as an image, the object includes, but is not limited to, an article, an animal, a plant, a person, a landscape, and the like. For example, the subject may be a sunglasses, cat, dog, male, etc., which is not particularly limited in this specification. The object can be understood as a concept, a body, or the like in the above-described embodiments. Object description text containing at least two objects, which can be understood as language for describing two or more objects, for example, two objects of "sunglasses" and "men" are described in the sentence "sunglasses-wearing men"; the phrases "a cat and a dog" describe two subjects, "cat" and "dog". In practical applications, the object description text may be set as needed, which is not specifically limited in this specification. The image generation model may be understood as a model capable of generating an object image comprising at least two objects from an object description text comprising at least two objects. For example, the image generation model may be a neural network model, a diffusion model, or the like that implements the text-to-image technique, which is not particularly limited in this specification. The initial text vector may be understood as a feature vector that characterizes each term in the object description text. For example, the initial text vector is a text embedded vector corresponding to the sentence "one cat and one dog"; it should be noted that the dimension of the text embedding vector may be set according to the actual application scenario.

Specifically, the image generating method provided by the specification can obtain the object description text containing at least two objects, and input the object description text into the image generating model, so as to determine the initial text vector corresponding to the object description text. Further, in an embodiment provided in the present disclosure, in order to ensure that the image generation model can generate a vivid and accurate object image, a text encoding module in the image generation model is required to encode the object description text, so that the accurate object image can be generated based on the encoded text embedded vector. Specifically, the inputting the object description text containing at least two objects into the image generation model, determining the initial text vector of the object description text, includes:

determining an object description text containing at least two objects;

and inputting the object description text into an image generation model, and processing the object description text by using a text coding module in the image generation model to obtain an initial text vector corresponding to the object description text.

In an embodiment provided in the present specification, the object description text including at least two objects may be provided by a user, based on which the determining the object description text including at least two objects includes:

And receiving an image generation request sent by a user, and acquiring an object description text containing at least two objects from the image generation request. The image generation request may be understood as a request indicating a computing device to which the image generation method is applied to generate an object image. It should be noted that, the image generating method provided in the present specification may be applied to a computing device such as a computer, a mobile phone, a client, a mobile terminal, or a server, which is not specifically limited in the present specification.

Based on the above, the image generation method provided in the present specification can receive an image generation request sent by a user, where the image generation request carries an object description text containing at least two objects. And subsequently, an object description text containing at least two objects is input into the image generation model in response to the image generation request, and an object image containing at least two objects is obtained, so that corresponding object images are generated according to different requirements of users.

Furthermore, in some special computer service scenarios, object description text containing at least two objects may be pre-stored in a computing device to which the image generation method is applied. When receiving the image generation request, the method can quickly obtain the locally stored object description text to generate the corresponding object image. Specifically, based on this, the determining the object description text containing at least two objects includes:

And acquiring a text identifier from the received image generation request, and determining an object description text corresponding to the text identifier from pre-stored object description texts, wherein the object description text comprises at least two objects.

The text identification may be understood as information that uniquely identifies an object description text, for example, the object description text "a cat and a dog" may be represented by the character a. Based on this, the image processing method provided in the present specification, when an image generation request sent by a user or other computing device is received, acquires a text identifier from the image generation request. And quickly determining the object description text corresponding to the text identifier from the locally stored object description text containing at least two objects. The object description text is then input into an image generation model, and an object image containing at least two objects is obtained.

The text encoding module may be understood as a network layer for encoding the object description text in the image generation model, for example, the text encoding module may be a text encoder (text encoder).

Specifically, the image generating method provided by the present specification can input an object description text containing at least two objects into an image generating model after determining the object description text, and encode the object description text by using a text encoding module in the image generating model, so as to obtain word text vectors corresponding to each word in the object description text; the word text vector corresponding to each word constitutes the initial text vector corresponding to the object description text.

For example, the object description text is "a cat and a dog". Based on this, by inputting the object description text into a text encoder in a diffusion model, the text encoder is utilized to encode "a cat and a dog" to obtain a text embedded vector corresponding to each word.

Step 304: and determining pre-stored object vectors of the at least two objects, wherein the pre-stored object vectors are generated for training text vectors of the at least two objects in the training process of generating a model according to the images.

The pre-stored object vector can be understood as a residual vector corresponding to a single object, and the residual vector is generated in the process of training the image generation model. And, the residual vector may be stored in a computing device to which the image generation method is applied. In practical applications, the pre-stored object vector may also be understood as the residual token embedding or residual tag embedding of each individual object.

The training sample vector can be understood as a text embedding vector generated after each sample object describing text is encoded in the image generation model training process.

Specifically, in an embodiment provided in the present specification, the determining the pre-stored object vectors of the at least two objects includes:

Determining an object identification word in the object description text, wherein the object description text is composed of the object identification word and object feature words except the object identification word, and one object identification word represents one object;

determining a target pre-stored object vector associated with the object identification word from pre-stored object vectors;

and determining the target pre-stored object vector as the pre-stored object vector of the at least two objects.

The object identification word can be understood as a word directly representing a specific object in the object description text, and one object identification word directly represents an object, through which the object can be determined. For example, in the object description text of "sunglasses-wearing men", the terms "sunglasses" and "men" directly denote the two objects of sunglasses and men, and thus, the terms "sunglasses" and "men" are object identification words. For another example, in the object description text of "a cat and a dog," the terms "cat" and "dog" directly refer to the two objects cat and dog, and thus "cat" and "dog" are object identification words.

The object feature word may be understood as a word describing a feature of an object in the object description text, from which a part of the attribute features of a specific object may be determined. For example, in the object description text of "a cat and a dog," the word "a" describes that the cat and the dog are one, respectively, and the word "and" describes that both the cat and the dog are displayed in the image. Thus, "a" and "are object feature words. It should be noted that, in one embodiment, words other than the object identification word in one object description text may be object feature words.

Specifically, the image generating method provided in the present specification needs to determine, by using the image generating model, an object identifier in the object description text, and determine, from among pre-stored object vectors stored in advance locally, a target pre-stored object vector associated with the object identifier. And then determining the target pre-stored object vector as a pre-stored object vector of at least two objects. Along the above example, after the object description text "a cat and a dog" is input to the diffusion model, the diffusion model determines two words "cat" and "dog" from "a cat and a dog", and determines residual vectors corresponding to the two words "cat" and "dog" from a plurality of locally stored residual vectors, where the residual vectors are residual vectors corresponding to the two concepts of cat and dog.

Based on this, the image generation method provided in the present specification finds a residual vector for each concept (object) and stores it, so that it can be infinitely multiplexed. In addition, the residual vector corresponding to each concept only needs a small storage space, for example, about 5kb, so that the problem that other multi-concept customization generation methods require that parameters of a diffusion model are stored with full digital precision is avoided, and therefore considerable cost is likely to be spent on large-scale application programs in mobile equipment. According to the image generation method provided by the specification, only text embedded residual vectors of each concept are stored, only a small storage space is needed, and the cost is reduced.

Step 306: and processing the initial text vector through the pre-stored object vector to obtain a target text vector, and generating an object image containing at least two objects according to the target text vector.

The target text vector may be understood as a vector obtained by adjusting the initial text vector based on the pre-stored object vector. An image of a subject containing the at least two subjects may be understood as an image containing two or more subjects, e.g. an image containing a cat and a dog, an image of a male wearing a sunglasses, etc.

In an embodiment of the present disclosure, an image generating method for generating a multi-subject image is provided, where the method implements custom synthesis of a plurality of subjects by learning residual vectors of a single subject, so as to generate a multi-concept customized image based on a combination of the plurality of residual vectors, specifically, the processing the initial text vector by the pre-stored object vector to obtain a target text vector, including:

according to the object identification words in the object description text, determining vectors to be adjusted corresponding to the at least two objects from the initial text vectors;

and adding the pre-stored object vector and the vector to be adjusted in the initial text vector to obtain a target text vector.

The vector to be adjusted may be understood as a text embedded vector corresponding to each object.

Specifically, in the image generation method provided by the specification, an object identification word in an object description text is determined by using an image generation model, and a vector to be adjusted corresponding to at least two objects is determined from an initial text vector according to the object identification word; and then adjusting the vector to be adjusted in the initial text vector based on the pre-stored object vector to obtain a target text vector. For example, in the image generating method provided in the present specification, after the residual vectors of the two objects "cat" and "dog" are determined by the diffusion model, the text embedding vectors corresponding to the two words "cat" and "dog" are determined as the text embedding vectors corresponding to the two objects "cat" and "dog" from the text embedding vectors of the text described by the object "a cat and a dog". The residual vectors of the two objects, cat and dog, are then added to the corresponding text embedding vector, thereby adding the residual vector to the basic category embedding vector of the object description text, "a cat and a dog", thereby generating the final text embedding vector containing the object residual vector. The final text is embedded in the image generated by the vector, which results in corresponding topics (cat, dog).

In one embodiment provided herein, after a target text vector is determined, the target text vector is processed using a vector processing module in the image generation model to generate an object image comprising at least two objects. Specifically, the generating an object image including the at least two objects according to the target text vector includes:

and processing the target text vector by using a vector processing module in the image generation model to obtain an object image containing the at least two objects.

The vector processing module may be understood as a module that performs decoding processing on the target text vector to generate an object image. For example, the vector processing module can be a U-Net model. For example, after obtaining a text embedded vector that ultimately contains an object residual vector, the text embedded vector is input to a U-Net model for decoding processing, thereby obtaining an image containing cats and dogs.

Further, in the embodiment provided in the present specification, the image generating method provided in the present specification solves the problem of high computing resource consumption in the model training process, and at the model application stage, the problem of attribute confusion in the generation of multi-subject images (i.e., multi-concept customized images) can be avoided, and other multi-subject image generating methods are easy to generate attribute confusion when synthesizing a plurality of subjects, so that the similarity between the generated images and the reference images is reduced, for example, puppies and kittens overlap in the images. Therefore, the method based on the custom diffusion model is provided, custom synthesis of a plurality of subjects is realized by learning residual vectors of a single subject, and the problem of attribute confusion is avoided. Specifically, the processing the target text vector by using a vector processing module in the image generation model to obtain an object image containing the at least two objects includes:

Decoding the target text vector by using the vector processing module to obtain image features corresponding to the object description text;

according to the object display position parameters of the at least two objects, adjusting the image characteristics to obtain adjusted image characteristics;

and continuing to decode the adjusted image features to obtain an object image containing the at least two objects.

Where image features may be understood as an attention map determined by the vector processing module for each word in the object description text using an attention mechanism in generating the object image based on the target text vector. That is, the image feature may be understood as an attention attempt.

The object display position parameter is understood to be a parameter indicating the display position of the object in the object image, and may be coordinate information or area information. In practical applications, to achieve high quality multi-topic (i.e., object) generation, the present method uses a layout (i.e., object display location parameters) that is an abstract and readily accessible prior as a spatial guide for ranking topics into an attention map. The user may define it as a set of topic bounding boxes that describe the spatial composition of the topic. Specifically, the present solution emphasizes the required regions corresponding to these topics in an attention map, while imposing restrictions on regions related to other topics according to layout priors, thereby obtaining an adjusted attention map, and subsequently generating an accurate object image based on the attention map, avoiding attribute confusion.

For example, the image generating method provided in the present specification may first input a text embedded vector including a residual vector into a U-net model (i.e., a vector processing module) after obtaining the text embedded vector. The U-net model decodes the text-embedded vector and determines an attention map of the text-embedded vector via an attention mechanism. Wherein each word corresponds to an attention map. Second, the two objects are arranged into spatial locations in the attention profile according to the layout provided by the user for the two objects "cat" and "dog", resulting in an adjusted attention profile. Finally, through the U-net model, images containing cats and dogs are generated according to the adjusted attention diagram, so that each object can be accurately displayed in the images, and attribute confusion of the objects is avoided.

Further, in an embodiment provided in the present disclosure, the adjusting the image feature according to the object display position parameters of the at least two objects to obtain an adjusted image feature includes:

acquiring object display position parameters of the at least two objects, and determining target object display position parameters corresponding to target objects from the object display position parameters, wherein the target objects are any one of the at least two objects;

Determining target object image characteristics corresponding to the target object from the image characteristics;

and determining the target position of the target object in the target object image characteristic based on the object display position parameter, and adjusting the object display weight of the target object in the target position to obtain the adjusted image characteristic.

Wherein a target object may be understood as any one of at least two objects. For example, the at least two subjects are two subjects, a cat and a dog, and the target subject is the cat or the dog.

The target object image feature refers to a cross attention map corresponding to the target object among the plurality of attention maps. For example, the object "a cat and a dog" describes each word in the text, there will be a corresponding cross-attention attempt. Whereas the cross-attention force diagram corresponding to the two words "cat" and "dog" is the cross-attention force diagram corresponding to the two objects cat and dog.

The object display weight may be understood as a weight that a target object is displayed in a specific area, and may be understood as a signal that the target object is in an image area where a user wants it to appear. The stronger the weight or signal, the higher the accuracy with which the object is displayed in the image area; the weaker the weight or signal, the lower the accuracy with which the object is displayed in that image region.

Along with the above example, the image processing method provided in the present specification can obtain a first problem of multi-theme customization: some topics may not be displayed as described in the text prompt. The present method believes that this is due to insufficient activation values in the cross-attention profile of these subjects. To avoid this, the present solution selects signals that strengthen the target subject in the area where the user wishes it to appear. Weakening the signal of unrelated subjects. The second problem in multi-topic customization is attribute confusion: the topics in the generated image may contain features of other topics. The present solution considers this to be due to overlapping activation areas of different subjects in a cross-attention diagram. To avoid this, the method chooses to attenuate each custom object signal component that appears in other custom object regions.

Based on the above, the method obtains the layout of two objects, namely a cat and a dog, sent by a user, and determines the cross attention map corresponding to the cat and the dog, namely the cross attention map corresponding to the two words, namely the cat and the dog, from a plurality of cross attention maps corresponding to the text embedding vectors of the cat and the dog. The display area of each object in the corresponding cross-attention-seeking graph is determined based on the layout priors provided by the user. For example, determining a display area of the dog in its corresponding cross-attention map; the display area of this object in its corresponding cross-attention profile is determined for the cat. Then, the weight of the corresponding object is reinforced in each region, and the weight of each object in other regions is reduced. For example, the signal (i.e., weight) of the dog is enhanced in the area corresponding to the dog, and the signal of the dog is lowered in the display area of other objects. Thereby achieving a signal that selects a region where the desired object appears to strengthen the target subject (object). Weakening the signal of unrelated subjects. And selecting to attenuate the signal component of each custom object (i.e., object) that appears in the other custom object region to resolve the attribute confusion by attention mechanism.

In particular, the manner in which the target subject signals of different regions in a cross-attention diagram are adjusted based on the cross-attention mechanism can be seen in the following formula (1), which formula 1 shows how to enhance the signal components of the target object in its own region and attenuate the signal components of the target object in other object regions.

Where EditedCA denotes a cross-attention mechanism, CA denotes an attention map, softmax is an activation function, t is time, c denotes the text of the input ebedding (embedded vector), s denotes that dimension of the object (e.g. dog), E denotes an average value, and W denotes a weight matrix. M refers to a user provided layput numerations, e.g., whether the region where layput numerates this thing is a 1 or a 0.η is a weight parameter.

In an embodiment provided in the present specification, the method provides a method for combining multiple concepts by using traningsfreee, aiming at the problem that the training needs to be performed again when a new custom object needs to be custom-made in the process of training the multiple concept data together. The residual vector is found for each concept only in the model training process, so that the residual vector can be infinitely multiplexed. Based on the above, the training process of the method for the image generation model is specifically as follows. The method for generating the image by inputting the object description text containing at least two objects into the image generation model, before determining the initial text vector of the object description text, further comprises:

The sample object description text is understood to be the description text of a single object as a sample. For example, "image of one dog", "photograph of one cat", "a photo of dog". The reference object image may be understood as an image of an object described in the sample object description text. For example, the sample object description text is "an image of a dog", and the reference object image is an image of a dog.

Training text vectors can be understood as text embedded vectors corresponding to sample object description text

For example, the present method provides an image generation model training scheme that trains the diffusion model with the goal of generating new images that are vividly accurate, containing any combination of these topics. Based on this, the method requires determining a residual vector for each object, which can be subsequently achieved by combining the residual vector of the specific topic with a pre-trained diffusion model, and using a layout-guidance generation process.

While training for the diffusion model includes: a set of subjects (sample object descriptive text) such as "a photo of dog" and its corresponding sample photographs of 3-5 dogs (reference object images) are given from different angles as training samples. The "a photo of dog" is input to the text encoder in the diffusion model, and a text embedding vector (training text vector) corresponding to the "a photo of dog" is output.

And inputting the text embedded vector into a U-Net model in the diffusion model to obtain a corresponding generated image. Meanwhile, the scheme can generate a residual vector corresponding to the dog object according to the text embedding vector corresponding to the 'a photo of dog'.

After an image generated by the U-Net model is obtained, determining a loss function aiming at the diffusion model based on the generated image and a sample photo, and training the diffusion model based on the loss function until the training is stopped and adjusted, so that the trained diffusion model is obtained.

In an embodiment provided in the specification, in order to reduce the waste of computing resources, a pre-stored object vector corresponding to each object needs to be determined, and the pre-stored object vector is generated in the following manner, where the generating, according to the training text vector, a pre-stored object vector corresponding to a sample object includes:

determining a first sample object text vector corresponding to the sample object from the training text vector according to the sample object identification word in the sample object description text;

inputting the sample object description text into a text coding module in a pre-training image generation model to obtain an untrained text vector of the sample object description text;

determining a second sample object text vector corresponding to the sample object from the untrained text vector according to the sample object identification word;

and taking the average difference value of the first sample object text vector and the second sample object text vector as a pre-stored object vector of the sample object.

A sample object is understood to be an object that is a sample, and is identical to the object in the above embodiment. The sample object identifier word may be understood as a word in the sample description text that directly represents a specific sample object, and a sample object identifier word directly represents a sample object. For example, in the sample object description text of "a photo of dog," the word "dog" directly indicates the sample object. The first sample object text vector may be understood as a text embedding vector corresponding to the sample object in the training text vector. The pre-training image generation model refers to an image generation model before the current model training is performed on the image generation model. The untrained text vector refers to a text embedded vector corresponding to the sample object description text obtained through a text coding module in the pre-training image generation model.

Along the above example, in the text generation graph model (diffusion model), the text given by the user is encoded by a text encoder (text encoder), so as to obtain a corresponding text embedding vector (text embedding). In the method, in the process of determining the residual vector of each object, an average variation is sought. This average variance is the average of this dimension for the custom object (e.g., custom dog) under various texts calculated by the text encoder (text encoder) trained, and the text encoder (text encoder) frezen (training fixed), such as various texts including ("one dog on beach", "one dog sleeping", etc.). Wherein frozen (training fixation) means fixed without training. Thus, the frozen (training fixed) text encoder refers to an encoder that is fixed in the diffusion model without training. By calculating the average difference of these texts in the embedded vectors (i.e., the trained text vector and the untrained text vector) of the two text encoder outputs, the average difference is taken as the residual vector in that vector dimension (i.e., the first sample object text vector and the second sample object text vector). For the calculation method of the residual vector, the following formula (2) may be referred to. The formula (2) represents the method of finding each concept as follows: calculating the trained text encoder and the trained original text encoder, and carrying out category-to-specific object residual vectors under different texts:

Wherein the method comprisesResidual vector representing dog object, C _dog Refers to all dog-related text used to average this change, E ^custom Refers to the text encoding which is trained, and text is input by c.

In the embodiments provided in this specification, the main content of the image generation model is a small number of images given a custom theme during training of the model, and the text encoder is fine-tuned to learn residual embedding over basic embedding of the original theme, with specific fine-tuning of the current data loss calculation as follows. Said determining a loss function of said image generation model from said reference object image and said sample object image, comprising:

determining a first loss function of the image generation model according to the reference object image and the sample object image;

determining a second loss function of the image generation model according to the untrained text vector and the trained text vector;

and adding the first loss function and the second loss function to obtain a loss function of the image generation model.

In the above example, the present specification calculates a first loss function from the image and the sample parameter image after generating the image by the U-Net model in the process of training the diffusion model, and the loss function can be referred to as the following formula (3).

Wherein, the liquid crystal display device comprises a liquid crystal display device,mean that the mean value, x _θ Is a diffusion model and parameters. />Is true noise->Is the predicted noise, c is the text of the input, and t is time.

After determining the first loss function, the embedded vectors output by the text encoder (text encoder) after training and fixed are determined respectively from the two embedded vectors, other embedded vectors except the embedded vector corresponding to the sample object are calculated based on the other embedded vectors to obtain a second loss function, and the loss function can be seen in the following formula (4). The equation (4) is to find the residual of each concept:

wherein E is ^custom Refers to a trained text encoder, p being a dimension. s refers to that dimension of the object (e.g., dog). C text scrolling, E represents the average value, C _dog Refers to all dog-related text.

Finally, the loss of the global diffusion model is the sum of the two losses.

According to the image generation method provided by the specification, the image generation model does not need to learn each multi-concept customized sentence, but rather the image generation model is trained to learn each individual object (concept), so that the problems that the training efficiency of the neural network model is low and a large amount of computing resources are wasted due to the large number of the multi-concept customized sentences are avoided. Meanwhile, in the model training process, a pre-stored object vector corresponding to each individual object is generated and stored. Therefore, in practical application, under the condition that the object description text containing at least two objects is input into the image generation model, the image generation model can determine pre-stored object vectors corresponding to the at least two objects, combine the pre-stored object vector of each object with the initial text vector of the object description text, and then generate an object image containing the at least two objects according to the combined target text vector, thereby ensuring smooth operation of the image generation model and providing stable image generation capability.

The image generation method provided in the present specification will be further described with reference to fig. 4 by taking an application of the image generation method in a scene customized for multiple concepts in a diffusion model as an example. Fig. 4 is a flowchart of a processing procedure of an image generating method according to an embodiment of the present disclosure, and as can be seen from fig. 4, the image generating method according to the present disclosure includes two stages: training phase and application phase.

The processing procedure for the training phase comprises the following steps:

first, a set of texts (i.e., text "a photo of dog" in fig. 4) is given from different angles as training samples, and reference images (images of 2-5 dogs) corresponding to the training samples are determined. And inputting the training sample into a text encoder (text encoder) for training (tuning) to obtain a text embedding vector (text embedding) corresponding to the training sample. The text-embedded vector is input to a training fixed (frozen) U-Net model that generates an image using the noise image and the text-embedded vector, thereby obtaining an image of the model output. Wherein the diffusion model comprises the text encoder and a U-Net model.

Next, the training samples "a photo of dog" are input to a training fixed (frozen) text encoder to obtain corresponding text embedding vectors. Wherein, the text encoder with fixed training is the original text encoder before the training of the round.

Determining text embedding vectors output by two text encoders, calculating an average difference value of the text embedding vectors of the dimension "dog" among the two text embedding vectors, and taking the average difference value as a residual vector of the object "dog" (in fig. 4)。

Finally, calculating a loss function (namely the formula 3) based on the generated image output by the U-Net model and the reference image; from the two text embedded vectors, another penalty function (i.e., equation 4 above) is calculated from the text embedded vectors except for the dimension "dog". And taking the two loss functions as the integral loss functions of the diffusion model, and training the diffusion model based on the integral loss functions to obtain the trained diffusion model.

The processing procedure for the training phase comprises the following steps:

first, text "a cat and a dog" sent by a user is received, and the text is input into a diffusion model to train a fixed text encoder, so as to obtain a text embedding vector.

Determining text embedding vectors of the two objects of 'cat' and 'dog' from the text embedding vectors, and obtaining residual vectors corresponding to the cat and dog from the locally cached residual vectors (namely, in figure 4And->). The residual vector is added to the text embedding vectors of the two objects, thereby adding the residual vector to the text embedding vector corresponding to "a cat and a dog".

Next, the text embedded vector is processed through an attention mechanism to determine a cross attention map corresponding to each word in "a cat and a dog". For the cross-attention-seeking graph of two objects, "cat" and "dog," the regions of the two objects in the cross-attention-seeking graph are determined based on a user-supplied layout (an abstract and readily accessible prior that the user may define as a set of subject bounding boxes), the weights of the corresponding objects are emphasized in each region, and the weights of each object in the other regions are reduced.

Finally, after editing the cross-attention attempt, an image containing the cat and dog is generated based on the weighted cross-attention attempt.

Based on the above, the image generation method provided in the present specification can be used to vividly and accurately generate a new image containing any theme combination by combining the residual token embedding (i.e., residual vector) of a specific theme with a pre-trained diffusion model and using a layout guidance generation process. By representing each topic as the remaining token embedded (i.e., residual vector) transferred from its base class. Adding the residual mark embedding (i.e., residual vector) to the basic category embedding (i.e., the ebadd obtained after the text encoder is passed by the "a cat and a dog") can produce a corresponding theme in the generated image. And in the reasoning process, two key problems in multi-topic customization generation such as topic non-occurrence and attribute confusion among different topics are overcome.

Fig. 5 shows a flowchart of another image generation method provided according to an embodiment of the present specification, which is applied to a cloud-side device, and specifically includes the following steps.

Step 502: and receiving an object description text sent by the terminal side equipment, wherein the object description text comprises at least two objects.

Step 504: and inputting the object description text into an image generation model, and determining an initial text vector of the object description text.

Step 506: and determining pre-stored object vectors of the at least two objects, wherein the pre-stored object vectors are generated for training text vectors of the at least two objects in the training process of generating a model according to the images.

Step 508: and processing the initial text vector through the pre-stored object vector to obtain a target text vector, and generating an object image containing at least two objects according to the target text vector.

Step 510: and sending the object image to the end-side device.

The cloud-side device may be understood as a device that is located in the cloud and is capable of providing cloud services for the end-side device. For example, the cloud-side device may be one or more servers, one or more hosts. In an embodiment provided in the present specification, the cloud-side device may further be configured by a cloud-side computing device and/or a cloud-side storage device. The cloud-side computing device may be understood as a device that is located in the cloud and is capable of providing computing services for the end-side device. For example, the cloud-side computing device may be one or more servers. The cloud-side storage device may be understood as a device that is located in the cloud and is capable of providing storage services for the end-side device. Such as one or more database storage servers, cloud disks, etc. The end-side device may be understood as a device that exists opposite to the cloud-side device and is capable of using cloud services provided by the cloud-side device. The terminal side device comprises, but is not limited to, a client, a terminal, a computer, a server, a mobile phone or intelligent mobile device and the like.

Specifically, the other image generating method provided in the present specification can be applied to a cloud side device, and when receiving an object description text sent by an end side device, the object description text including at least two objects is input into an image generating model, and an initial text vector of the object description text is obtained by using a text encoding module in the image generating model. And then, determining pre-stored object vectors of at least two objects, and adjusting the initial text vector based on the pre-stored object vectors, so that the pre-stored object vectors are added into the initial text vector, and a target text vector is obtained. The target text vector is processed using the image generation model to obtain an object image comprising at least two objects. After the object image is obtained, it is transmitted to the end-side apparatus.

Based on this, according to the another image generation method applied to the cloud-side device provided by the specification, the image generation model does not need to learn each multi-concept customized sentence, but rather the image generation model is trained to learn each individual object (concept), so that the problems that the training efficiency of the neural network model is low and a large amount of computing resources are wasted due to the huge number of the multi-concept customized sentences are avoided. Meanwhile, in the model training process, a pre-stored object vector corresponding to each individual object is generated and stored. In practical application, after the object description text sent by the terminal side equipment is obtained, the object description text containing at least two objects is input into the image generation model, the image generation model can determine pre-stored object vectors corresponding to the at least two objects, the pre-stored object vectors of each object are combined with initial text vectors of the object description text, then an object image containing the at least two objects is generated according to the combined target text vectors, and the object image is returned to the terminal side equipment, so that smooth operation of the image generation model is ensured, and the cloud side equipment can stably provide image generation capability for the terminal side equipment.

The above is a schematic version of another image generation method of the present embodiment. It should be noted that, the technical solution of the other image generating method and the technical solution of the one image generating method belong to the same concept, and details of the technical solution of the other image generating method, which are not described in detail, can be referred to the description of the technical solution of the one image generating method, and are not described in detail herein.

FIG. 6 shows a flowchart of an image generation model training method provided in accordance with one embodiment of the present description, including the following steps in particular.

Step 602: and determining a training sample aiming at the image generation model, wherein the training sample comprises sample object description text and a reference object image corresponding to the sample object description text.

Step 604: and inputting the sample object description text into the image generation model, and processing the sample object description text by using a text coding module in the image generation model to obtain a training text vector corresponding to the sample object description text.

Step 606: and decoding the training text vector by using a vector processing module in the image generation model to obtain a sample object image corresponding to the sample object description text, and generating a pre-stored object vector corresponding to the sample object according to the training text vector.

Step 608: and determining a loss function of the image generation model according to the reference object image and the sample object image, and training the image generation model based on the loss function to obtain a trained image generation model.

Specifically, for a description of a technical solution of an image generation model training method, reference may be made to a description of correspondence or response in the foregoing image generation method, which is not specifically described herein.

According to the image generation model training method provided by the specification, the image generation model does not need to learn each multi-concept customized sentence, but rather the training image generation model learns each individual object (concept), so that the problems that the training efficiency of the neural network model is low and a large amount of computing resources are wasted due to the large number of the multi-concept customized sentences are avoided. Meanwhile, in the model training process, a pre-stored object vector corresponding to each individual object is generated and stored. In practical application, when the object description text containing at least two objects is input into the image generation model, the image generation model can determine pre-stored object vectors corresponding to the at least two objects, combine the pre-stored object vector of each object with the initial text vector of the object description text, and then generate an object image containing the at least two objects according to the combined target text vector, thereby ensuring smooth operation of the image generation model and providing stable image generation capability.

The above is a schematic scheme of an image generation model training method of the present embodiment. It should be noted that, the technical solution of the image generation model training method and the technical solution of the image generation method belong to the same concept, and details of the technical solution of the image generation model training method which are not described in detail can be referred to the description of the technical solution of the image generation method, and are not repeated herein.

Corresponding to the above method embodiments, the present disclosure further provides an embodiment of an image generating apparatus, and fig. 7 shows a schematic structural diagram of an image generating apparatus according to one embodiment of the present disclosure. As shown in fig. 7, the apparatus includes:

a first vector determination module 702 configured to input an object description text comprising at least two objects into an image generation model, determine an initial text vector of the object description text;

a second vector determination module 704 configured to determine pre-stored object vectors of the at least two objects, wherein the pre-stored object vectors are generated for training text vectors of the at least two objects in a training process according to the image generation model;

The image generation module 706 is configured to process the initial text vector through the pre-stored object vector, obtain a target text vector, and generate an object image including the at least two objects according to the target text vector.

Optionally, the second vector determination module 704 is further configured to:

Optionally, the image generation module 706 is further configured to:

acquiring object display position parameters of the at least two objects, and determining target object position parameters corresponding to target objects from the object display position parameters, wherein the target objects are any one of the at least two objects;

Optionally, the first vector determination module 702 is further configured to:

determining an object description text containing at least two objects;

Optionally, the first vector determination module 702 is further configured to:

receiving an image generation request sent by a user, and acquiring an object description text containing at least two objects from the image generation request; or alternatively

Optionally, the image generating apparatus further comprises a model training module configured to:

Optionally, the model training module is further configured to:

According to the image generation device provided by the embodiment of the specification, the image generation model does not need to learn each multi-concept customized sentence, but rather the image generation model is trained to learn each single object (concept), so that the problems that the training efficiency of the neural network model is low and a large amount of computing resources are wasted due to the large number of the multi-concept customized sentences are avoided. Meanwhile, in the model training process, a pre-stored object vector corresponding to each individual object is generated and stored. Therefore, in practical application, under the condition that the object description text containing at least two objects is input into the image generation model, the image generation model can determine pre-stored object vectors corresponding to the at least two objects, combine the pre-stored object vector of each object with the initial text vector of the object description text, and then generate an object image containing the at least two objects according to the combined target text vector, thereby ensuring smooth operation of the image generation model and providing stable image generation capability.

The above is a schematic scheme of an image generating apparatus of the present embodiment. It should be noted that, the technical solution of the image generating apparatus and the technical solution of the image generating method described above belong to the same concept, and details of the technical solution of the image generating apparatus that are not described in detail may be referred to the description of the technical solution of the image generating method described above.

Corresponding to the method embodiment, the present disclosure further provides an embodiment of an image generation model training device, and fig. 8 shows a schematic structural diagram of an image generation model training device provided in one embodiment of the present disclosure. As shown in fig. 8, the apparatus includes:

a sample determining module 802 configured to determine a training sample for an image generation model, where the training sample includes a sample object description text and a reference object image corresponding to the sample object description text;

a vector obtaining module 804, configured to input the sample object description text into the image generation model, and process the sample object description text by using a text encoding module in the image generation model to obtain a training text vector corresponding to the sample object description text;

An image obtaining module 806, configured to decode the training text vector by using a vector processing module in the image generating model, obtain a sample object image corresponding to the sample object description text, and generate a pre-stored object vector corresponding to the sample object according to the training text vector;

a loss determination module 808 is configured to determine a loss function of the image generation model from the reference object image and the sample object image, and train the image generation model based on the loss function, obtaining a trained image generation model.

According to the image generation model training device provided by the specification, the image generation model does not need to learn each multi-concept customized sentence, but rather the training image generation model learns each single object (concept), so that the problems that the training efficiency of the neural network model is low and a large amount of computing resources are wasted due to the large number of the multi-concept customized sentences are avoided. Meanwhile, in the model training process, a pre-stored object vector corresponding to each individual object is generated and stored. In practical application, when the object description text containing at least two objects is input into the image generation model, the image generation model can determine pre-stored object vectors corresponding to the at least two objects, combine the pre-stored object vector of each object with the initial text vector of the object description text, and then generate an object image containing the at least two objects according to the combined target text vector, thereby ensuring smooth operation of the image generation model and providing stable image generation capability.

The above is a schematic scheme of an image generation model training apparatus of the present embodiment. It should be noted that, the technical solution of the image generation model training device and the technical solution of the image generation model training method described above belong to the same concept, and details of the technical solution of the image generation model training device which are not described in detail can be referred to the description of the technical solution of the image generation model training method described above.

Corresponding to the above method embodiments, the present disclosure further provides another embodiment of an image generating apparatus, and fig. 9 shows a schematic structural diagram of another image generating apparatus provided in one embodiment of the present disclosure. As shown in fig. 9, the apparatus is applied to cloud-side equipment, and includes:

the text receiving module 902 is configured to receive an object description text sent by the terminal side device, where the object description text includes at least two objects;

a first vector determination module 904 configured to input the object description text into an image generation model, determining an initial text vector of the object description text;

a second vector determination module 906 configured to determine pre-stored object vectors of the at least two objects, wherein the pre-stored object vectors are generated for training text vectors of the at least two objects in a training process according to the image generation model;

An image generation module 908 configured to process the initial text vector by the pre-stored object vector, obtain a target text vector, and generate an object image containing the at least two objects according to the target text vector;

an image transmitting module 910 configured to transmit the object image to the end-side device.

The image generation model is not required to learn each multi-concept customized sentence, but is trained to learn each single object (concept), so that the problems that the training efficiency of the neural network model is low and a large amount of computing resources are wasted due to the large number of the multi-concept customized sentences are avoided. Meanwhile, in the model training process, a pre-stored object vector corresponding to each individual object is generated and stored. In practical application, after the object description text sent by the terminal side equipment is obtained, the object description text containing at least two objects is input into the image generation model, the image generation model can determine pre-stored object vectors corresponding to the at least two objects, the pre-stored object vectors of each object are combined with initial text vectors of the object description text, then an object image containing the at least two objects is generated according to the combined target text vectors, and the object image is returned to the terminal side equipment, so that smooth operation of the image generation model is ensured, and the cloud side equipment can stably provide image generation capability for the terminal side equipment.

The above is another exemplary embodiment of the image generating apparatus of the present embodiment. It should be noted that, the technical solution of the other image generating apparatus and the technical solution of the other image generating method belong to the same concept, and details of the technical solution of the other image generating apparatus, which are not described in detail, can be referred to the description of the technical solution of the other image generating method.

Fig. 10 illustrates a block diagram of a computing device 1000 provided in accordance with one embodiment of the present description. The components of the computing device 1000 include, but are not limited to, a memory 1010 and a processor 1020. Processor 1020 is coupled to memory 1010 via bus 1030 and database 1050 is used to store data.

Computing device 1000 also includes access device 1040, which access device 1040 enables computing device 1000 to communicate via one or more networks 1060. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The access device 1040 may include one or more of any type of network interface, wired or wireless (e.g., a Network Interface Card (NIC)), such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.

In one embodiment of the present description, the above-described components of computing device 1000, as well as other components not shown in FIG. 10, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device illustrated in FIG. 10 is for exemplary purposes only and is not intended to limit the scope of the present description. Those skilled in the art may add or replace other components as desired.

Computing device 1000 may be any type of stationary or mobile computing device including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smart phone), wearable computing device (e.g., smart watch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 1000 may also be a mobile or stationary server.

The processor 1020 is configured to execute computer-executable instructions that, when executed by the processor 1020, implement the steps of the two image generation methods and the image generation model training method described above.

The foregoing is a schematic illustration of a computing device of this embodiment. It should be noted that, the technical solution of the computing device belongs to the same concept as the technical solutions of the two image generating methods and the image generating model training method, and details of the technical solution of the computing device, which are not described in detail, can be referred to the description of the technical solutions of the two image generating methods and the image generating model training method.

An embodiment of the present disclosure also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the two image generation methods and the image generation model training method.

The above is an exemplary version of a computer-readable storage medium of the present embodiment. It should be noted that, the technical solution of the storage medium belongs to the same concept as the technical solutions of the two image generating methods and the image generating model training method, and details of the technical solution of the storage medium which are not described in detail can be referred to the description of the technical solutions of the two image generating methods and the image generating model training method.

An embodiment of the present specification also provides a computer program, wherein the computer program, when executed in a computer, causes the computer to perform the steps of the two image generation methods and the image generation model training method.

The above is an exemplary version of a computer program of the present embodiment. It should be noted that, the technical solution of the computer program, the two image generating methods and the technical solution of the image generating model training method belong to the same conception, and details of the technical solution of the computer program, which are not described in detail, can be referred to the description of the two image generating methods and the technical solution of the image generating model training method.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The computer instructions include computer program code that may be in source code form, object code form, executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the embodiments are not limited by the order of actions described, as some steps may be performed in other order or simultaneously according to the embodiments of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily all required for the embodiments described in the specification.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

The preferred embodiments of the present specification disclosed above are merely used to help clarify the present specification. Alternative embodiments are not intended to be exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the teaching of the embodiments. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention. This specification is to be limited only by the claims and the full scope and equivalents thereof.

Claims

1. An image generation method, comprising:

2. The image generation method according to claim 1, the determining pre-stored object vectors of the at least two objects, comprising:

3. The image generating method according to claim 1, wherein the processing the initial text vector by the pre-stored object vector to obtain a target text vector comprises:

4. The image generation method according to claim 1, the generating an object image containing the at least two objects from the target text vector, comprising:

5. The image generating method according to claim 4, wherein the processing the target text vector by using a vector processing module in the image generating model to obtain an object image including the at least two objects includes:

6. The image generating method according to claim 5, wherein the adjusting the image feature according to the object display position parameters of the at least two objects to obtain the adjusted image feature includes:

7. The image generation method according to claim 1, the inputting an object description text containing at least two objects into an image generation model, determining an initial text vector of the object description text, comprising:

determining an object description text containing at least two objects;

8. The image generation method according to claim 1, the determining object description text containing at least two objects, comprising:

9. The image generation method according to claim 1, wherein the inputting of the object description text containing at least two objects into the image generation model, before determining the initial text vector of the object description text, further comprises:

10. The image generating method according to claim 9, wherein the generating a pre-stored object vector corresponding to a sample object according to the training text vector includes:

11. The image generation method according to claim 10, the determining a loss function of the image generation model from the reference object image and the sample object image, comprising:

12. An image generation model training method, comprising:

13. An image generation method applied to cloud side equipment comprises the following steps:

and sending the object image to the end-side device.

14. A computer readable storage medium storing computer executable instructions which when executed by a processor implement the steps of the image generation method of any one of claims 1 to 11, the image generation model training method of claim 12, and the image generation method of claim 13.