CN117689749A

CN117689749A - Image generation method, electronic device, and computer-readable storage medium

Info

Publication number: CN117689749A
Application number: CN202311613798.2A
Authority: CN
Inventors: 吴志凡; 黄梁华; 王威; 魏延恒; 刘宇
Original assignee: Zhejiang Alibaba Robot Co ltd
Current assignee: Zhejiang Alibaba Robot Co ltd
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-03-12

Abstract

The application discloses an image generation method, electronic equipment and a computer readable storage medium, and relates to the technical fields of large model technology and image processing. The method comprises the following steps: acquiring multi-mode prompt information, wherein the multi-mode prompt information comprises: text information and enhancement mark information, the text information is used for describing image content to be generated, and the image content comprises: the enhancement mark information is used for determining the position characteristics and the image characteristics of the at least one target object; and carrying out multi-mode image generation on the multi-mode prompt information by adopting an image generation model to obtain a target image, wherein the image generation model is used for generating the target image by adopting a multi-mode image generation mode. The method and the device solve the technical problems that in the related art, the corresponding image is generated only through text prompt, so that the generated image is low in accuracy and limited in image generation range.

Description

Image generation method, electronic device, and computer-readable storage medium

Technical Field

The present application relates to the field of large model technology and image processing technology, and in particular, to an image generation method, an electronic device, and a computer readable storage medium.

Background

With the development of artificial intelligence, the field of text-to-image generation has made great progress, and corresponding high-quality images can be generated based on text prompts, so that the function of automatically generating images is realized.

Currently, objects are described only by text prompts to generate corresponding images, the accuracy of the generated images is low because the text prompts are difficult to describe physically, and the method generally needs fine tuning or only supports the use of a single object as a constraint condition, so that the image generation range is limited.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides an image generation method, electronic equipment and a computer readable storage medium, which at least solve the technical problems that in the related art, the generated image has low accuracy and the image generation range is limited because a corresponding image is generated only through text prompt.

According to an aspect of the embodiments of the present application, there is provided an image generating method, including: acquiring multi-mode prompt information, wherein the multi-mode prompt information comprises: text information and enhancement mark information, the text information is used for describing image content to be generated, and the image content comprises: the enhancement mark information is used for determining the position characteristics and the image characteristics of the at least one target object; and carrying out multi-mode image generation on the multi-mode prompt information by adopting an image generation model to obtain a target image, wherein the image generation model is used for generating the target image by adopting a multi-mode image generation mode.

According to another aspect of the embodiments of the present application, there is provided another image generating method, providing, by a terminal device, a graphical user interface, where content displayed by the graphical user interface at least partially includes an image generating scene, including: inputting text information in response to a first control operation performed on the graphical user interface, wherein the text information is used for describing image content to be generated, and the image content comprises: at least one target object; responding to a second control operation executed on the graphical user interface, respectively generating position features of at least one target object and image features of at least one target object based on text information to obtain enhanced mark information, and adopting an image generation model to generate multi-mode prompt information to obtain a target image, wherein the enhanced mark information is used for determining the position features and the image features of the at least one target object, the image generation model is used for generating the target image in a multi-mode image generation mode, and the multi-mode prompt information comprises: text information and enhanced mark information; the target image is presented within a graphical user interface.

According to another aspect of the embodiments of the present application, there is also provided another image generating method, including: receiving a currently input multi-modal dialog request, wherein the information carried in the multi-modal dialog request comprises: the multi-modal dialog text information and the multi-modal dialog enhancement tagging information are used for describing multi-modal dialog image content to be generated, the multi-modal dialog image content comprising: the multi-modal dialog enhancement tagging information is used to determine location features and image features of the at least one object; carrying out multi-mode image generation on the multi-mode dialogue request by adopting an image generation model to obtain a multi-mode dialogue image, wherein the image generation model is used for generating the multi-mode dialogue image by adopting a multi-mode image generation mode; the multi-mode dialogue reply is fed back, wherein the information carried in the multi-mode dialogue reply comprises: multimodal dialog image.

According to another aspect of the embodiments of the present application, there is also provided an electronic device, including: a memory storing an executable program; and a processor for running a program, wherein the program executes any one of the image generation methods described above.

According to another aspect of the embodiments of the present application, there is further provided a computer readable storage medium, where the computer readable storage medium includes a stored executable program, and when the executable program runs, the apparatus on which the computer readable storage medium is controlled to execute any one of the image generating methods described above.

According to the method and the device for generating the multi-mode prompt information, the multi-mode prompt information comprising the text information and the enhancement mark information is obtained, the image content to be generated can be determined according to the text information, the position feature and the image feature of at least one target object in the image content can be determined according to the enhancement mark information, then the multi-mode prompt information is subjected to multi-mode image generation by adopting the image generation model, so that the target image is obtained, and the aim of accurately generating the high-quality target image comprising the multi-target object through the multi-mode prompt information is achieved. In addition, more accurate control can be realized on the generated image, the image with higher accuracy is generated, and the generation of the image of the multi-target object is supported, so that the generation range of the image is wider, thereby realizing more accurate and diversified high-quality image generation of the multi-target object, meeting the technical effects of image generation requirements in different fields, and further solving the technical problems of lower accuracy and more limited image generation range of the generated image caused by generating the corresponding image only through text prompt in the related art.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a schematic view of an application scenario of an image generating method according to embodiment 1 of the present application;

FIG. 2 is a flow chart of an image generation method according to embodiment 1 of the present application;

FIG. 3 is a flowchart of another image generation method according to embodiment 1 of the present application;

FIG. 4 is a flow chart of an image generation method according to embodiment 2 of the present application;

FIG. 5 is a flowchart of an image generation method according to embodiment 3 of the present application;

FIG. 6 is a schematic diagram of a human-machine conversation scenario in accordance with embodiment 3 of the present application;

fig. 7 is a schematic structural view of an image generating apparatus according to embodiment 4 of the present application;

fig. 8 is a schematic structural view of another image generating apparatus according to embodiment 4 of the present application;

fig. 9 is a schematic structural view of still another image generating apparatus according to embodiment 4 of the present application;

Fig. 10 is a block diagram of a computer terminal according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The technical scheme provided by the application is mainly realized by adopting a large model technology, wherein a large model refers to a deep learning model with large-scale model parameters, and the deep learning model can generally contain hundreds of millions, billions, trillions and even billions of model parameters. The large Model can be called as a Foundation Model, a training Model is performed by using a large-scale unlabeled corpus, a pre-training Model with more than one hundred million parameters is produced, the Model can adapt to a wide downstream task, and the Model has better generalization capability, such as a large-scale language Model (Large Language Model, LLM for short), a multi-mode pre-training Model and the like.

It should be noted that, when the large model is actually applied, the pretrained model can be finely tuned by a small number of samples, so that the large model can be applied to different tasks. For example, the large model can be widely applied to the fields of natural language processing (Natural Language Processing, abbreviated as NLP), computer vision, voice processing and the like, and can be particularly applied to the tasks of the fields of computer vision such as visual question-answering (Visual Question Answering, abbreviated as VQA), image description (IC), image generation and the like, and can also be widely applied to the tasks of the fields of natural language processing such as emotion classification based on text, text abstract generation, machine translation and the like. Thus, major application scenarios for large models include, but are not limited to, digital assistants, intelligent robots, searches, online education, office software, electronic commerce, intelligent design, and the like.

First, partial terms or terminology appearing in describing embodiments of the present application are applicable to the following explanation:

diffusion Model (Diffusion Model): a mathematical model for describing the diffusion process is a commonly used generative model. In the embodiment of the application, the diffusion model can denoise the noisy picture based on the diffusion process, so that the generated picture is obtained.

Composition tree (constituency tree): a tree structure for representing a natural language sentence structure is composed of nodes representing words or phrases in the sentence and edges representing grammatical relations between the words or phrases. In the composition tree, sentences are decomposed into different phrase structures, such as noun phrases, verb phrases, adjective phrases, and the like. These phrase structures may be further broken down into smaller phrase structures up to a minimum word level. The composition tree can help understand the structure and grammatical relation of sentences, facilitate syntactic analysis and semantic analysis, and can also be used for natural language processing tasks such as syntactic analysis, machine translation, information extraction and the like. By analyzing the composition tree, the grammatical structure and semantic meaning of sentences can be better understood.

Natural language processing (Natural Language Processing, NLP) parser: a tool for natural language processing can convert natural language text into structured data and perform tasks such as grammar analysis, part-of-speech tagging, named entity recognition and the like. NLP parsers are typically based on machine learning and deep learning techniques that can automatically analyze text and extract information therein, and can help understand and process large amounts of natural language data, such as speech recognition, emotion analysis, text classification, and the like.

Text Encoder (Text Encoder): a model or tool for converting text data into a numerical representation. In the field of Natural Language Processing (NLP), text encoders are commonly used to convert text data into vector or matrix form for machine learning and deep learning tasks such as text classification, semantic similarity calculation, emotion analysis, and the like.

Fine-tuning (Fine-tune): it means that on the basis of using a pre-trained model, further training and adjustment is performed on a specific task to improve the performance and adaptability of the model on that task. Typically, the pre-trained model is trained on a large-scale data set, while the Fine-tune is tuned on a specific small-scale data set to better adapt the model to specific task requirements. Fine-tune is commonly used in deep learning models in the fields of natural language processing, computer vision, and the like.

In the related art text-to-image generation method, a text-to-image model generally uses word embedding extracted from text-hint words as a condition. However, text has a high level of abstraction and limited information density, making it difficult to accurately describe objects, and thus the accuracy of the generated images is low, and this method generally requires fine tuning or only supports the use of a single object as a constraint, and thus the image generation range is limited.

At present, a BLIP-Diffusion model is proposed, which can support image-text mixing to generate corresponding images, but only support the generation of single objects, and the generated images have lower accuracy. In addition, a KOMMOS-G model is also provided, which can support image-text mixing to generate corresponding images, but the accuracy of the generated images is still lower.

The related art method of generating a corresponding image based on text cues has the following drawbacks.

Defect 1: the object is described only by the text prompt to generate a corresponding image, and the accuracy of the generated image is low because the text prompt is difficult to describe physically.

Defect 2: often fine tuning is required or only single objects are supported to be used as constraint conditions, so that model training cost is high and image generation range is limited;

Defect 3: the BLIP-Diffusion model and the KOMSOS-G model support mixed generation of graphics, but the accuracy of generating images is low, and the BLIP-Diffusion model only supports single object generation.

In view of the above drawbacks, no effective solution has been proposed before the present application.

Example 1

According to an embodiment of the present application, there is provided an image generating method, it should be noted that the steps illustrated in the flowcharts of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different from that herein.

Considering that the model parameters of the large model are huge and the operation resources of the mobile terminal are limited, the image generation method provided in the embodiment of the present application can be applied to the application scenario shown in fig. 1, but is not limited thereto. In the application scenario illustrated in fig. 1, the large model is deployed in a server 10, and the server 10 may connect to one or more client devices 20 via a local area network connection, a wide area network connection, an internet connection, or other type of data network, where the client devices 20 may include, but are not limited to: smart phones, tablet computers, notebook computers, palm computers, personal computers, smart home devices, vehicle-mounted devices and the like. The client device 20 can interact with a user through a graphical user interface to realize the invocation of the large model, thereby realizing the method provided by the embodiment of the application.

In the embodiment of the application, a system formed by the client device and the server may perform the following steps: the method comprises the steps that the client device executes steps of acquiring multi-mode prompt information input by a user in a graphical user interface, sending the multi-mode prompt information to a server and the like, the server executes steps of generating a multi-mode image of the acquired multi-mode prompt information by adopting an image generation model, so as to obtain a target image, returning the target image to the client device and the like, and the method can be carried out in the client device under the condition that operation resources of the client device can meet deployment and operation conditions of a large model.

In the above-described operating environment, the present application provides an image generation method as shown in fig. 2. Fig. 2 is a flowchart of an image generation method according to embodiment 1 of the present application. As shown in fig. 2, the method may include the steps of:

step S21, multi-mode prompt information is obtained, wherein the multi-mode prompt information comprises: text information and enhancement mark information, the text information is used for describing image content to be generated, and the image content comprises: the enhancement mark information is used for determining the position characteristics and the image characteristics of the at least one target object;

Step S22, performing multi-mode image generation on the multi-mode prompt information by adopting an image generation model to obtain a target image, wherein the image generation model is used for generating the target image by adopting a multi-mode image generation mode.

Text information may be understood as text prompts, i.e. image content described using natural language words, for describing the image content to be generated. By way of example, the natural language text may be chinese, english, japanese, etc., without limitation. In the embodiment of the present application, the image content described by the text information may include at least one target object, where the target object may be understood as a person, an animal, or an object in the image content to be generated, that is, the present application can support multi-object image generation. Illustratively, the target object may include a real Character such as boys, girls, doctors, teachers, etc., a virtual Character such as a game Player, non-Player Character (NPC), etc., an animal such as cat, dog, peacock, elephant, etc., and an object such as a table, car, lawn, tree, train station, etc., without limitation.

For example, the text information may generate text information corresponding to the requirement for the image of the user, and if the user wants to obtain the image of one cat and one dog on the lawn, the corresponding text information may be "one cat and one dog on the lawn (i.e. a cat and a dog on the grass)", where three target objects are included in the text information, namely, the cat, the dog and the lawn, respectively. It will be appreciated that the text message may be replaced by "one dog and one cat on the lawn" or "one cat and one dog on the lawn", and the text message is not limited in the manner of description herein.

In consideration of inaccurate image generation caused by describing objects only through text prompt, the method and the device take the enhancement mark information as additional information on the basis of providing text information, and jointly generate a target image through the text information and the enhancement mark information, so that the accuracy of image generation is improved.

The enhanced marking information may include coordinate information and image information, wherein the coordinate information may be coordinates of at least one target object included in the text information for determining a position feature of the at least one target object. Illustratively, if the user wishes to have an image of a cat and a dog on the lawn, the coordinate information includes the coordinates of the cat (e.g., [ coordinates: 12, 15, 100, 200 ]), the coordinates of the dog (e.g., [ coordinates: 100, 150, 220, 240 ]), and the coordinates of the lawn (e.g., [ coordinates: 500, 500, 500, 500 ]).

The coordinates of the at least one target object may be, for example, coordinates in a two-dimensional rectangular coordinate system, which can represent a coordinate range of the target object, or coordinates in a two-dimensional coordinate system or a three-dimensional coordinate system, and the coordinate system adopted in the application is not limited.

The image information may be an image of at least one target object included in the text information for determining an image characteristic of the at least one target object in the text information. Illustratively, if the user wishes to have an image of a cat and a dog on the lawn, the image information includes an image of a cat (e.g., [ image: image of a cat of oneself uploaded by the user ]), an image of a dog (e.g., [ image: image of a dog of oneself uploaded by the user ]), and an image of the lawn (e.g., [ image: image of a lawn uploaded by the user ]).

The multi-modal prompt message may be understood as a prompt message of multiple modes, including the text message and the enhancement mark message. In the embodiment of the application, the multi-mode prompt information includes prompt information of a text mode, a coordinate mode and an image mode.

For example, the multi-modal prompt information of the same target object may be bound together to avoid affecting other objects or global conditions, i.e., to introduce an enhanced token (enhanced token) into the present application, where the enhanced token contains both object-level text, coordinates, and image information for describing an object in the generated image.

In this embodiment of the present application, the image generation model is a model suitable for generating an image, and for example, the image generation model may be a pre-trained diffusion process-based image generation model, that is, a diffusion model, an autoregressive model, and the like, which are not limited herein. The image generation model can accurately generate a high-quality target image based on multi-mode prompt information of the multi-target object, namely, can generate a high-quality image based on multi-mode prompt of multiple object levels, thereby realizing accurate control of the generated image.

Illustratively, if the multimodal prompt information of the above example is input to the image generation model, i.e. text information is input: grassland, coordinate information for one cat and one dog: cat coordinates, dog coordinates and grassland coordinates, image information: the image generation model can accurately output corresponding target images, namely images of one cat and one dog on grasslands according to the multi-mode prompt information.

It can be appreciated that the image generation model in the embodiment of the present application can keep the original structure of the pre-trained text to image model as much as possible in order to integrate the extension of the existing model, and the image generation model in the embodiment of the present application only changes the input of the model, does not need to change the architecture of the model, and can therefore keep the usability of the construction technology on the basis of the model.

According to the method and the device for generating the multi-mode image, the multi-mode prompt information comprising the text information and the enhancement mark information is obtained, the image content to be generated can be determined according to the text information, the position feature and the image feature of at least one target object in the image content are determined according to the enhancement mark information, then the multi-mode prompt information is subjected to multi-mode image generation by adopting the image generation model, so that the target image is obtained, the purpose of accurately generating a high-quality target image comprising the multi-target object through the multi-mode prompt information is achieved, more accurate control can be achieved on the generated image, the image with higher accuracy is generated, the image of the multi-target object is supported to be generated, and the generation range of the image is wider.

The image generation method provided by the embodiment of the application can be applied to the application scenes related to image generation in the fields of scenario service, design service, game service, e-commerce service, education service, legal service, medical service, conference service, social network service, financial product service, logistics service, navigation service and the like, for example: the scenario service generates the required image materials according to the scenario or the story line, the design service generates the design drawings according to the text, the image and the coordinate description of the user, the game generates the corresponding character drawings according to the text description of the character image, the e-commerce service generates the corresponding commodity drawings according to the commodity description, and the like, and the scenario service is not limited herein.

By adopting the embodiment of the application, the multi-mode prompt information comprising the text information and the enhanced mark information is obtained, the image content to be generated can be determined according to the text information, the position characteristic and the image characteristic of at least one target object in the image content can be determined according to the enhanced mark information, and then the multi-mode prompt information is subjected to multi-mode image generation by adopting the image generation model, so that the target image is obtained, and the aim of accurately generating the high-quality target image comprising the multi-target object through the multi-mode prompt information is fulfilled. In addition, more accurate control can be realized on the generated image, the image with higher accuracy is generated, and the generation of the image of the multi-target object is supported, so that the generation range of the image is wider, thereby realizing more accurate and diversified high-quality image generation of the multi-target object, meeting the technical effects of image generation requirements in different fields, and further solving the technical problems of lower accuracy and more limited image generation range of the generated image caused by generating the corresponding image only through text prompt in the related art.

In an alternative embodiment, in step S21, the multi-mode prompt information is obtained, including the following method steps:

Step S211, obtaining text information;

step S212, respectively generating the position characteristics of at least one target object and the image characteristics of at least one target object based on the text information to obtain enhanced mark information;

and S213, combining the text information and the enhancement mark information to obtain the multi-mode prompt information.

Considering that it may be difficult to obtain multi-modal prompt information for all target objects during the inference process, for example, only text information for the target object may be obtained, or only multi-modal prompt information for a portion of the target objects may be obtained (e.g., when the user wishes to combine a real object and a generated object in an image). Therefore, in the embodiment of the application, the enhancement mark information can be acquired based on the text information, namely the enhancement mark information missing from the target object can be generated, so that flexible combination of a plurality of modes is realized.

In the embodiment of the application, when the multi-mode prompt information is acquired, the text information can be acquired first, then the position feature of at least one target object and the image feature of at least one target object are generated based on the text information respectively, namely, the coordinates and the images of the object level are generated according to the text prompt, so that corresponding enhancement mark information is obtained, and then the text information and the enhancement mark information are combined, so that the multi-mode prompt information can be obtained.

In an alternative embodiment, in step S212, the location feature and the image feature of at least one target object are generated based on the text information, respectively, to obtain enhanced marking information, which includes the following method steps:

step S2121, performing part-of-speech analysis on text information, and selecting at least one target word, wherein the at least one target word meets a preset part-of-speech requirement, and the at least one target word is used for determining at least one target object;

step S2122, generating a position feature of at least one target object based on the text information and the at least one target word, and generating an image feature of at least one target object based on the text information and the position feature of the at least one target object;

step S2123, determining enhancement mark information by using the at least one target word, the position feature of the at least one target object, and the image feature of the at least one target object.

It is understood that parts of speech in a language include nouns, verbs, adjectives, adverbs, pronouns, numbers, adjectives, conjunctions, prepositions, assisted words, sighing, etc., whereas generally nouns are used to represent persons, things, places, etc., that need to be generated when generating an image, and thus nouns may be used to represent a target object.

In this embodiment of the present application, the preset part of speech may be a noun, and the at least one target word is at least one noun in the text information. Through part-of-speech analysis of the text information, at least one target word serving as a noun in the text information is selected, so that at least one target object contained in the text information can be determined, and multi-mode prompt information corresponding to each target object can be acquired conveniently.

Illustratively, a composition tree (constituency tree) may be used to identify objects in the text information, i.e., to identify target objects in the text information. For example, the text information is "one cat and one dog are on a grassland (i.e., a cat and a dog on the grass)", part-of-speech analysis is performed on the text information through the composition tree to obtain "one (qualifier) cat (noun) and one (qualifier) dog (noun) on the (prepositive) grassland (i.e., a (determiner) cat (no) and (conjunction) a (determiner) dog (no) on (preposition) the (determiner) grass (no)))", and nouns in the text information, i.e., the cat, the dog and the grassland are identified, so that the identified noun cat, dog and grassland are used as a plurality of selected target segmentations.

The text may be labeled with parts of speech by training a model using Conditional RandomFields, CRF, maximum entropy Maximum Entropy Model, or the like, or labeled with parts of speech and recognized with parts of speech using a recurrent neural network (Recurrent Neural Networks, RNN), long Short-Term Memory (LSTM), attention mechanism (Attention Mechanism), or the like, without limitation.

In the embodiment of the application, when the position feature and the image feature of at least one target object are respectively generated based on the text information to obtain the enhanced mark information, part-of-speech analysis can be performed on the text information, and at least one target word with part-of-speech as a noun is selected from the text information. The location feature and the image feature of the at least one target object are then generated based on the text information and the at least one target word. And determining enhancement mark information corresponding to the at least one target object respectively according to the at least one target word, the position characteristic of the at least one target object and the image characteristic of the at least one target object.

In an alternative embodiment, in step S2122, a location feature of at least one target object is generated based on the text information and the at least one target word, comprising the method steps of:

step S21221, performing position feature generation on the text information and the at least one target word by using the position feature generation model to obtain the position feature of the at least one target object.

In the embodiment of the present application, the location feature generation model is a model suitable for generating location features, for example: diffusion model, autoregressive model, etc. The position feature generation model may be a coordinate model, and can generate position coordinates corresponding to each object based on text content of text information and objects corresponding to at least one target word.

In the embodiment of the application, when generating the position feature of at least one target object based on the text information and the at least one target word, the text information and the at least one target word may be input into the position feature generation model, and the position feature generation model is adopted to perform position feature generation on the text information and the at least one target word, so as to obtain the position feature corresponding to the at least one target object, and thus obtain the coordinate information corresponding to the at least one target object.

For example, if the text information is "a cat and a dog", the plurality of target words are "a cat" and "a dog", and the three text features of "a cat and a dog", "a cat" and "a dog" are spliced together to be used as the input of the location feature generation model together, so that the coordinates of the cat and the coordinates of the dog are obtained according to the output of the location feature generation model.

It can be seen that when the coordinate mode information is missing, the position feature generation model can be adopted to automatically generate position coordinates corresponding to a plurality of objects based on the text prompt and the objects, so that the missing coordinate mode information is complemented.

In an alternative embodiment, in step S2122, image features of at least one target object are generated based on the text information and the position features of the at least one target object, comprising the method steps of:

In step S21222, image feature generation is performed on the text information and the position feature of the at least one target object by using the image feature generation model, so as to obtain the image feature of the at least one target object.

In the embodiment of the present application, the image feature generation model is a model suitable for generating image features, for example: diffusion model, autoregressive model, etc. The image feature generation model may be an image feature model capable of generating image features corresponding to at least one target object based on text content of text information and position features of the at least one target object.

In the embodiment of the present application, when generating the image feature of the at least one target object based on the text information and the position feature of the at least one target object, the text information and the position feature of the at least one target object may be input to the image feature generation model, and the image feature generation model is used to perform image feature generation on the text information and the coordinates of the at least one target object, so as to obtain the image feature corresponding to the at least one target object, and thus obtain the image information corresponding to the at least one target object.

For example, if the text information is "one cat and one dog" and the coordinates of the cat and the coordinates of the dog have been obtained, the present application obtains the image features of the cat and the image features of the dog from the output of the image feature generation model by inputting "one cat and one dog (text)", "cat (text) +cat coordinates (position features)", "dog (text) +dog coordinates (position features)", to the image feature generation model.

It can be seen that when the image mode information is missing, the image feature generation model can be adopted to automatically generate image information corresponding to a plurality of objects based on the text prompt and the position coordinates of the plurality of objects, so that the missing image mode information is complemented.

It can be seen that the image generation model in the embodiment of the application can support not only generating the target image according to the text prompt, the coordinate information and the image information, but also generating the target image according to the text prompt. The method and the device can generate pictures based on multi-object multi-mode prompt, and meanwhile, when the problem of modal missing occurs, the generated model can be used for complementing the missing modes, so that the application range is wider.

In an alternative embodiment, the image generation method further comprises the method steps of:

in step S2124, text encoding is performed on the text information by using a text encoder to obtain global text features, and text encoding is performed on at least one target word by using a text encoder to obtain object text features.

It is contemplated that a Text Encoder (Text Encoder) can translate Text data into vector or matrix form for machine learning and deep learning tasks such as part-of-speech tagging, and the like. Therefore, in the embodiment of the application, when the text information is processed, the text encoder may be used to perform text encoding processing on the text information, so as to obtain a global text encoding result, that is, a global text feature, and at the same time, the text encoder may also be used to perform text encoding processing on at least one target word, so as to obtain a noun part encoding result, that is, an object text feature.

It can be appreciated that, when processing the Image information in the enhancement mark information, an Image Encoder (Image Encoder) may be used to perform Image feature encoding processing on the Image information, so as to obtain Image features, which are not described herein in detail.

In an alternative embodiment, in step S2123, enhanced marker information is determined using at least one target word, a location feature of at least one target object, and an image feature of at least one target object, comprising the method steps of:

in step S21231, the object text feature, the position feature of the at least one target object, and the image feature of the at least one target object are combined to obtain the enhanced marker information.

In the embodiment of the present application, when the enhancement mark information is determined by using at least one target word, the position feature of at least one target object, and the image feature of at least one target object, the object text feature obtained by performing text encoding processing on at least one target word, the position feature of at least one target object, and the image feature of at least one target object may be combined, so as to obtain the enhancement mark information. For example, the object text feature, the position feature of the at least one target object, and the image feature of the at least one target object may be horizontally stitched to obtain enhanced marker information, which is not limited herein.

For example, if the text information is "one cat and one dog on the lawn", the object text features "cat", "dog" and "lawn", the position features "cat coordinates", "dog coordinates" and "lawn coordinates" of the at least one target object, and the image features "one cat image", "one dog image" and "lawn image" of the at least one target object may be feature-combined to obtain the enhanced marker information.

In an alternative embodiment, in step S213, the text information and the enhancement mark information are combined to obtain the multi-mode prompt information, which includes the following method steps:

in step S2131, the global text feature, the object text feature, the position feature of at least one target object, and the image feature of at least one target object are combined to obtain the multimodal prompt message.

In the embodiment of the application, when the text information and the enhancement mark information are combined to obtain the multi-mode prompt information, the global text feature, the object text feature, the position feature of at least one target object and the image feature of at least one target object can be combined to obtain the multi-mode prompt information. For example, the object text feature, the position feature of at least one target object, and the image feature of at least one target object may be horizontally stitched and then vertically stitched with the global text feature, so as to obtain the multi-mode prompt information, which is not limited herein.

For example, if the text information is "one cat and one dog are on the lawn", the global text features "one cat and one dog are on the lawn", the object text features "cat", "dog" and "lawn", the position features "cat coordinates", "dog coordinates" and "lawn coordinates" of at least one object, and the image features "one cat image", "one dog image" and "lawn image" of at least one object may be feature-combined, thereby obtaining the multi-modal prompt information.

step S211, acquiring text information and additional information, wherein the additional information includes at least one of the following: position information of at least one target object, image information of at least one target object;

step S212, enhancement mark information is determined based on the text information and the additional information;

Text information may be understood as a text prompt entered by a user for describing the image content to be generated.

The additional information may be understood as coordinates and/or images entered by the user, and it may be understood that the additional information includes at least one of coordinates of at least one target object entered by the user, images of at least one target object entered by the user.

According to the method and the device for determining the multi-mode prompt information, the text information and the additional information input by the user are acquired, so that the enhanced marker information for determining the position feature and the image feature of at least one target object can be determined according to the acquired text information and the additional information, and then the multi-mode prompt information can be obtained by combining the text information and the enhanced marker information.

For example, the text information and the enhancement mark information may be vertically spliced, so as to obtain multi-mode prompt information, which is not limited herein.

Fig. 3 is a flowchart of another image generating method according to embodiment 1 of the present application, and as shown in fig. 3, the image generating model of the present application supports the input of text information and enhancement mark information together, so as to generate a target image, that is, supports the combination of a text mode and an image mode, and generates the target image based on multi-mode prompt information. In addition, the image generation model also supports that when the mode is missing, only text information is used as input, so that corresponding enhancement mark information is generated according to the text information, namely, corresponding coordinate mode and image mode are generated according to the text mode, and further, a target image is generated according to the text mode and the generated coordinate mode and image mode.

Further, in the image generation process based on the multi-modal prompt, at least one target word (i.e., noun), coordinates of at least one target object and an image of at least one target object in the text information can be determined according to the input text information, then object text features corresponding to the at least one target word determined by the text encoder, position features of the at least one target object are determined according to the coordinates of the at least one target object, and image features of the at least one target object are determined by the image encoder. And combining the text characteristics of the object, the position characteristics of at least one target object and the image characteristics of at least one target object to obtain enhanced mark information, embedding the text information and the enhanced mark information, and inputting the combined text information and enhanced mark information into an image generation model to finally obtain a target image.

By way of example, the text information may be "one cat and one dog on the lawn" and thus it is possible to determine that the plurality of target words includes "dog", "cat" and "lawn", the coordinates of the at least one target object include [ coordinates of dog ], [ coordinates of cat ], and [ coordinates of lawn ], and the image of the at least one target object includes an image of dog, an image of cat, and an image of lawn. And then determining object text characteristics corresponding to the target segmentation words through a text editor, determining the position characteristics of at least one target object according to the coordinates of the at least one target object, and determining the image characteristics of the at least one target object through an image encoder. And combining the text characteristics of the object, the position characteristics of at least one target object and the image characteristics of at least one target object to obtain enhanced mark information, and further combining text information of 'one cat and one dog on grasslands' and the enhanced mark information and inputting the combined text information into an image generation model to finally obtain a target image.

In the image generation process based on the plain text prompt, in order to cope with the modal deficiency, only text information can be input, at least one target word in the text information is determined based on the word segmentation model, and object text characteristics corresponding to the at least one target word are determined according to the word encoder. And then, generating position features of the text information and the at least one target word by adopting a coordinate model to obtain the position features of the at least one target object, and generating image features of the text information and the position features of the at least one target object by adopting an image feature model to obtain the image features of the at least one target object. And combining the text characteristics of the object, the generated position characteristics of at least one target object and the generated image characteristics of at least one target object to obtain enhanced mark information, embedding the text information and the enhanced mark information, and inputting the combined text information and enhanced mark information into an image generation model to finally obtain a target image.

For example, the text information may be "one cat and one dog on the lawn", so that it can be determined that the at least one target word includes "dog", "cat" and "lawn" according to the word segmentation model, and the object text feature corresponding to the at least one target word is determined according to the word encoder. And then, generating position features of the text information and at least one target word by adopting a coordinate model to obtain position features corresponding to coordinates of at least one target object (coordinates of a dog, coordinates of a cat and coordinates of a grassland), and generating image features of the text information and the position features of at least one target object by adopting an image feature model to obtain image features corresponding to images of at least one target object (images of a dog, images of a cat and images of grassland). And combining the text characteristics of the object, the generated position characteristics of at least one target object and the generated image characteristics of at least one target object to obtain enhanced mark information, embedding the text information and the enhanced mark information, and inputting the combined text information and enhanced mark information into an image generation model to finally obtain a target image.

It can be seen that the present application gives an image-text pair by taking the text, coordinates and image of the object level and integrating this information into the "enhanced token" of each object. The enhancement markers are trained in the diffusion model as additional conditions along with the text prompts so that the image generation model of the present application can handle multi-object, multi-modal text prompts.

In addition, in order to deal with the problem of modal absence in the reasoning process, namely, in order to solve the problem of zero sample image generation, the application proposes to generate coordinates and image features of an object level according to text prompts by using one coordinate model and one image feature model. Thus, the present application can generate a target image by text prompting alone or by a combination of various multi-modal prompting, and can generate a target image in a flexible manner in combination with various modalities. And a large number of qualitative and quantitative experiments prove that the method is superior to the image generation method in the related technology, and can complete a wider image generation task.

It is easy to understand that the beneficial effects of the image generation method provided in the present application include the following points.

The method has the beneficial effects that (1) the generation of high-quality images based on multi-mode prompt of multi-object level is supported, the generated images can be controlled more accurately, and the application range is wider;

the method has the beneficial effects that (2) the coordinate model and the image characteristic model are designed, and the generation of the coordinate and the image mode based on the text mode is supported, so that the possible problem of mode deletion can be overcome;

the method has the beneficial effects that (3) the image generation model of the related technology is mostly generated based on a text mode, but the image generation model of the method can generate a target image based on the text mode, the image mode and a coordinate mode at the same time, so that the generated image has higher accuracy;

the beneficial effects (4) are that if the image generation model of the related technology is required to generate a given object, such as a dog, as the dog has a plurality of varieties, if a specific dog is required to be generated, 3-5 pictures are generally required to be given, the model is subjected to fine adjustment, the model can be learned, the image generation model of the related technology does not need the fine adjustment process, only one picture of the dog is required to be given in the deduction process, the dog of the corresponding variety can be generated, namely zero sample generation can be realized, the participation of a target object is not required in the training process, and therefore, the model training cost is lower;

The method has the beneficial effects that (5) after the enhancement mark is added to the text for embedding, the enhancement mark is used as a sampling condition for generating the image by the diffusion model, so that the method does not need to change the framework of the diffusion model, only needs to change the input of the diffusion model, and further improves the usability of the construction technology on the basis of the diffusion model.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus a necessary general hardware platform, but that it may also be implemented by means of hardware. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method described in the embodiments of the present application.

Example 2

In the operating environment as in embodiment 1, the present application provides an image generating method as shown in fig. 4, providing, by a terminal device, a graphical user interface, where a content displayed by the graphical user interface at least partially includes an image generating scene, and fig. 4 is a flowchart of an image generating method according to embodiment 2 of the present application, as shown in fig. 4, and the method includes:

in step S41, in response to a first control operation performed on the graphical user interface, text information is input, where the text information is used to describe image content to be generated, and the image content includes: at least one target object;

Step S42, in response to a second control operation performed on the graphical user interface, generating location features of at least one target object and image features of at least one target object respectively based on the text information to obtain enhanced mark information, and performing multi-mode image generation on the multi-mode prompt information by using an image generation model to obtain a target image, wherein the enhanced mark information is used for determining the location features and the image features of the at least one target object, the image generation model is used for generating the target image by using a multi-mode image generation mode, and the multi-mode prompt information comprises: text information and enhanced mark information;

step S43, displaying the target image in the graphical user interface.

At least an image generating scene is displayed in the graphical user interface in the embodiment of the application, and a user can input text information for describing image content to be generated in the image generating scene by executing control operation, control the generation of the position feature of at least one target object and the image feature of at least one target object based on the text information to obtain enhancement mark information, control the adoption of an image generating model to generate a multi-mode image of the multi-mode prompt information to obtain a target image, and the like. It is to be appreciated that the above-described image generation scenarios may be, but are not limited to, application scenarios involving image generation in the fields of scripts, designs, games, electronic commerce, education, medical treatment, meetings, social networks, financial products, logistics, navigation, and the like.

The graphical user interface further comprises a first control (or a first touch area), and text information input by a user can be acquired when a first touch operation acting on the first control (or the first touch area) is detected. The text information may be input from a text box in the graphical user interface by the user through a first touch operation. The first touch operation may be operations such as clicking, selecting, hooking, and screening conditions, which are not limited herein.

The graphical user interface further comprises a second control (or a second touch area), when a second touch operation acting on the second control (or the second touch area) is detected, the position feature of at least one target object and the image feature of at least one target object can be respectively generated based on the text information to obtain enhanced mark information, and the image generation model is adopted to generate the multi-mode image of the multi-mode prompt information to obtain the target image. The second touch operation may be operations such as clicking, selecting, hooking, and screening conditions, which are not limited herein.

In this embodiment of the present invention, a graphical user interface is provided through a terminal device, where content displayed on the graphical user interface at least partially includes an image generating scene, if a user performs a first control operation on the graphical user interface, the user inputs text information for describing image content of at least one target object to be generated, and if the user performs a second control operation, such as a submitting operation, on the graphical user interface, the user can generate, based on the text information, a position feature of the at least one target object and an image feature of the at least one target object, respectively, so as to obtain enhanced tag information, and at the same time, the image generating model can be used to generate a multimodal prompt message to obtain a target image, and then the generated target image is displayed in the graphical user interface for feedback to the user. The method and the device achieve the aim of accurately generating the high-quality target image comprising the multi-target object through the multi-mode prompt information, can realize more accurate control on the generated image, generate the image with higher accuracy, and support the generation of the image of the multi-target object, so that the generation range of the image is wider.

It should be noted that, the first touch operation and the second touch operation may be operations that a user touches the display screen of the terminal device with a finger and touches the terminal device. The touch operation may include single-point touch, multi-point touch, where the touch operation of each touch point may include clicking, long pressing, heavy pressing, swiping, and the like. The first touch operation and the second touch operation may also be touch operations implemented by an input device such as a mouse, a keyboard, and the like, which are not limited herein.

According to the method and the device for generating the multi-mode image, a graphical user interface is provided through the terminal equipment, the content displayed by the graphical user interface at least partially comprises an image generation scene, if a user performs a first control operation on the graphical user interface, the user inputs text information for describing the image content of at least one target object to be generated, if the user performs a second control operation, such as submitting operation, on the graphical user interface, the position characteristics of the at least one target object and the image characteristics of the at least one target object can be generated respectively based on the text information, so that enhancement mark information is obtained, and meanwhile, the multi-mode prompt information can be subjected to multi-mode image generation by adopting an image generation model to obtain a target image, and then the generated target image is displayed in the graphical user interface to be fed back to the user.

It should be noted that, the preferred implementation manner of this embodiment may be referred to the related description in embodiment 1, and will not be repeated here.

Example 3

In the operating environment as in example 1, the present application provides an image generation method as shown in fig. 5.

Fig. 5 is a flowchart of an image generating method according to embodiment 3 of the present application, as shown in fig. 5, the method includes:

step S51, receiving a currently input multimodal dialog request, where information carried in the multimodal dialog request includes: the multi-modal dialog text information and the multi-modal dialog enhancement tagging information are used for describing multi-modal dialog image content to be generated, the multi-modal dialog image content comprising: the multi-modal dialog enhancement tagging information is used to determine location features and image features of the at least one object;

step S52, performing multi-mode image generation on the multi-mode dialogue request by adopting an image generation model to obtain a multi-mode dialogue image, wherein the image generation model is used for generating the multi-mode dialogue image by adopting a multi-mode image generation mode;

step S53, feeding back a multi-mode dialogue reply, wherein the information carried in the multi-mode dialogue reply comprises: multimodal dialog image.

A multimodal dialog request may be understood as a dialog request (request) initiated by a user to a computer or a robot, where the multimodal dialog request carries multimodal dialog text information and multimodal dialog enhancement tag information. The multimodal dialog text information may be understood as text prompts, i.e. multimodal dialog image contents to be generated using natural language text descriptions. By way of example, the natural language text may be chinese, english, japanese, etc., without limitation.

In the embodiment of the present application, the multimodal dialog image content described by the multimodal dialog text information may include at least one object, where the object may be understood as a person, an animal, or an object in the multimodal dialog image content to be generated, that is, the present application can support image generation of multiple objects. By way of example, the object may include a real Character such as boys, girls, doctors, teachers, etc., a virtual Character such as gamers, non-Player characters (NPCs), etc., an animal such as cats, dogs, peacock, elephants, etc., and an object such as a table, car, lawn, tree, train station, etc., without limitation.

For example, the multimodal dialog text information may be text information corresponding to a multimodal dialog request input by a user, and if the user wants to obtain an image of one cat and one dog on a lawn, the corresponding multimodal dialog text information may be "one cat and one dog on the lawn (i.e. a cat and a dog on the grass)", where three objects are included in the multimodal dialog text information, namely, a cat, a dog and a lawn, respectively. It will be appreciated that the multi-modal dialog text message may be replaced by "one dog and one cat on the lawn" or "one cat and one dog on the lawn", and the description of the multi-modal dialog text message is not limited in this application.

In consideration of inaccurate generation of the multi-modal dialog image caused by only prompting the descriptive object through the multi-modal dialog text, the multi-modal dialog image generation method and device based on the multi-modal dialog text provide the multi-modal dialog text information, the multi-modal dialog enhancement mark information is used as additional information, and the multi-modal dialog image is generated through the multi-modal dialog text information and the multi-modal dialog enhancement mark information, so that the accuracy of the multi-modal dialog image generation is improved.

The multimodal dialog enhancement tag information may include coordinate information and image information, wherein the coordinate information may be coordinates of at least one object included in the multimodal dialog text information for determining a location feature of the at least one object. Illustratively, if the user wishes to have an image of a cat and a dog on the lawn, the coordinate information includes the coordinates of the cat (e.g., [ coordinates: 12, 15, 100, 200 ]), the coordinates of the dog (e.g., [ coordinates: 100, 150, 220, 240 ]), and the coordinates of the lawn (e.g., [ coordinates: 500, 500, 500, 500 ]).

The coordinates of the at least one object may be, for example, coordinates in a two-dimensional rectangular coordinate system, which can represent a coordinate range of the target object, or coordinates in a two-dimensional coordinate system or a three-dimensional coordinate system, and the application is not limited to the coordinate system adopted.

The image information may be an image of at least one object included in the multimodal dialog text information for determining an image characteristic of the at least one object in the multimodal dialog text information. Illustratively, if the user wishes to have an image of a cat and a dog on the lawn, the image information includes an image of a cat (e.g., [ image: image of a cat of oneself uploaded by the user ]), an image of a dog (e.g., [ image: image of a dog of oneself uploaded by the user ]), and an image of the lawn (e.g., [ image: image of a lawn uploaded by the user ]).

It can be seen that the multimodal dialog request in the embodiment of the present application carries prompt information of multiple modalities, including the multimodal dialog text information and the multimodal dialog enhancement flag information, that is, the multimodal prompt information includes prompt information of a text modality, a coordinate modality, and an image modality.

For example, prompt messages of multiple modes of the same object can be bound together to avoid affecting other objects or global conditions, that is, an enhanced token (enhanced token) is introduced in the application, and the enhanced token contains text, coordinates and image information at the object level at the same time, so as to describe one object in the generated image.

In this embodiment of the present application, the image generation model is a model suitable for generating an image, and for example, the image generation model may be a pre-trained diffusion process-based image generation model, that is, a diffusion model, an autoregressive model, and the like, which are not limited herein. The image generation model can accurately generate high-quality multi-mode dialogue images based on multi-mode dialogue requests, namely can generate high-quality images based on multi-mode prompts of multi-object levels, so that accurate control over generation of the multi-mode dialogue images is realized.

Illustratively, if the image generation model is used to generate the multimodal image for the multimodal dialog request in the above example, that is, multimodal dialog text information is input into the image generation model: grassland, coordinate information for one cat and one dog: cat coordinates, dog coordinates and grassland coordinates, image information: the image generation model of the application can accurately output corresponding multi-modal dialogue images, namely images of one cat and one dog on the lawn according to the multi-modal dialogue request.

Multimodal dialog replies may be understood as reply content (response) fed back by a computer or a robot to a user based on a multimodal dialog request entered by the user, corresponding to the multimodal dialog request entered by the user. The multimodal dialog reply carries a multimodal dialog image, which is an image including the content of the multimodal dialog image to be generated described in the multimodal dialog request.

The above steps S51 to S53 may be applied to a man-machine conversation scenario, i.e., a scenario in which a conversation is performed between a user and a computer or a robot. As shown in fig. 6, fig. 6 is a schematic diagram of a human-machine conversation scenario according to embodiment 3 of the present application, in which a user inputs a multimodal conversation request to a computer or a robot, and the computer or the robot can feed back a multimodal conversation reply corresponding to the multimodal conversation request to the user.

It will be appreciated that the dialog between the user and the computer or robot may be carried out by speech recognition and natural language processing techniques, or may be carried out by way of text communication, without limitation.

According to the embodiment of the application, through receiving the multimodal dialogue request which is input by the user and comprises the multimodal dialogue text information and the multimodal dialogue enhancement mark information, the content of the multimodal dialogue image to be generated can be determined according to the multimodal dialogue text information, the position feature and the image feature of at least one object in the multimodal dialogue image content to be generated can be determined according to the multimodal dialogue enhancement mark information, then the image generation model is adopted to generate the multimodal dialogue request, so that the multimodal dialogue image is obtained, and the multimodal dialogue reply carrying the multimodal dialogue image is fed back to the user. Therefore, the purpose of accurately generating the high-quality multi-mode dialogue image comprising the multi-objects through the multi-mode dialogue request is achieved, the generated multi-mode dialogue image can be controlled more accurately, the multi-mode dialogue image with higher accuracy is generated, the multi-mode dialogue image with the multi-objects is supported to be generated, and the generation range of the multi-mode dialogue image is wider.

The image generation method provided by the embodiment of the application may be applied to, but not limited to, application scenarios involving image generation in the fields of scenario service, design service, game service, e-commerce service, education service, legal service, medical service, conference service, social network service, financial product service, logistics service, navigation service, and the like, for example: the design drawing is generated according to the text, image and coordinate description of the user in the design service, the corresponding character drawing is generated according to the text description of the character image in the game, the corresponding commodity drawing is generated according to the commodity description in the e-commerce service, and the like, and the design drawing is not limited herein.

By adopting the embodiment of the application, through receiving the multimodal dialogue request which is input by the user and comprises the multimodal dialogue text information and the multimodal dialogue enhancement mark information, the multimodal dialogue image content to be generated can be determined according to the multimodal dialogue text information, the position feature and the image feature of at least one object in the multimodal dialogue image content to be generated can be determined according to the multimodal dialogue enhancement mark information, then the multimodal dialogue request is subjected to multimodal image generation by adopting the image generation model, so that the multimodal dialogue image is obtained, and the multimodal dialogue reply carrying the multimodal dialogue image is fed back to the user. Therefore, the purpose of accurately generating the high-quality multi-mode dialogue image comprising multiple objects through the multi-mode dialogue request is achieved, the generated multi-mode dialogue image can be controlled more accurately, the multi-mode dialogue image with higher accuracy is generated, the multi-mode dialogue image with multiple objects is supported to be generated, the generation range of the multi-mode dialogue image is wider, and the technical problems that the generated image is low in accuracy and limited in image generation range due to the fact that the corresponding image is generated only through text prompt in the related technology are solved.

In an alternative embodiment, in step S51, a currently entered multimodal dialog request is received, comprising the following method steps:

step S511, acquiring multi-mode dialogue text information;

step S512, generating the position feature of at least one object and the image feature of at least one object based on the multi-modal dialog text information respectively to obtain multi-modal dialog enhancement mark information;

step S513, the multimodal dialog text information and the multimodal dialog enhancement tag information are combined to obtain a multimodal dialog request.

In the embodiment of the application, when receiving the currently input multi-modal dialog request, the multi-modal dialog text information can be acquired first, then the position feature of at least one object and the image feature of at least one object are generated based on the multi-modal dialog text information respectively, that is, coordinates and images of object levels are generated according to multi-modal dialog text prompts, so that corresponding multi-modal dialog enhancement mark information is obtained, and then the multi-modal dialog text information and the multi-modal dialog enhancement mark information are combined, so that the multi-modal dialog request can be obtained.

Example 4

According to an embodiment of the present application, there is also provided an embodiment of an apparatus for implementing the above-mentioned image generation. Fig. 7 is a schematic structural view of an image generating apparatus according to embodiment 4 of the present application, as shown in fig. 7, the apparatus including:

the obtaining module 701 is configured to obtain multi-mode prompt information, where the multi-mode prompt information includes: text information and enhancement mark information, the text information is used for describing image content to be generated, and the image content comprises: the enhancement mark information is used for determining the position characteristics and the image characteristics of the at least one target object;

the image generation module 702 is configured to generate a multi-mode image for the multi-mode prompt information by using an image generation model to obtain a target image, where the image generation model is configured to generate the target image by using a multi-mode image generation mode.

Optionally, the acquiring module 701 is further configured to: acquiring text information; generating the position feature of at least one target object and the image feature of at least one target object based on the text information respectively to obtain enhancement mark information; and combining the text information and the enhancement mark information to obtain the multi-mode prompt information.

Optionally, the acquiring module 701 is further configured to: performing part-of-speech analysis on the text information, and selecting at least one target word, wherein the at least one target word meets the preset part-of-speech requirement, and the at least one target word is used for determining at least one target object; generating a location feature of at least one target object based on the text information and the at least one target word, and generating an image feature of the at least one target object based on the text information and the location feature of the at least one target object; enhancement tag information is determined using the at least one target word, the location feature of the at least one target object, and the image feature of the at least one target object.

Optionally, the acquiring module 701 is further configured to: and adopting a position feature generation model to generate position features of the text information and the at least one target word to obtain the position features of the at least one target object.

Optionally, the acquiring module 701 is further configured to: and adopting an image feature generation model to generate image features of the text information and the position features of the at least one target object to obtain the image features of the at least one target object.

Optionally, the method further comprises: an encoding module for: and performing text coding on the text information by adopting a text coder to obtain global text characteristics, and performing text coding on at least one target word by adopting the text coder to obtain object text characteristics.

Optionally, the acquiring module 701 is further configured to: and combining the text characteristics of the object, the position characteristics of at least one target object and the image characteristics of at least one target object to obtain the enhanced marking information.

Optionally, the acquiring module 701 is further configured to: and combining the global text feature, the object text feature, the position feature of at least one target object and the image feature of at least one target object to obtain the multi-mode prompt information.

Optionally, the acquiring module 701 is further configured to: acquiring text information and additional information, wherein the additional information comprises at least one of the following: position information of at least one target object, image information of at least one target object; determining enhanced marking information based on the text information and the additional information; and combining the text information and the enhancement mark information to obtain the multi-mode prompt information.

Here, the acquisition module 701 and the image generation module 702 correspond to step S21 and step S22 in embodiment 1, and the two modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above modules or units may be hardware components or software components stored in a memory and processed by one or more processors, and the above modules may also be run in the server 10 provided in embodiment 1.

According to an embodiment of the present application, there is also provided another embodiment of an apparatus for implementing the above-mentioned image generation. Fig. 8 is a schematic structural diagram of another image generating apparatus according to embodiment 4 of the present application, in which a graphical user interface is provided by a terminal device, and content displayed on the graphical user interface at least partially includes an image generating scene, as shown in fig. 8, the apparatus includes:

a first response module 801, configured to input text information in response to a first control operation performed on the graphical user interface, where the text information is used to describe image content to be generated, and the image content includes: at least one target object;

a second response module 802, configured to generate, based on the text information, a position feature of at least one target object and an image feature of at least one target object to obtain enhanced mark information, and perform multi-mode image generation on the multi-mode prompt information to obtain a target image, where the enhanced mark information is used to determine the position feature and the image feature of the at least one target object, and the image generation model is used to generate the target image in a multi-mode image generation manner, and the multi-mode prompt information includes: text information and enhanced mark information;

A display module 803 for displaying the target image in the graphical user interface.

Here, it should be noted that the first response module 801, the second response module 802, and the display module 803 correspond to steps S41 to S43 in embodiment 2, and the three modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 2. It should be noted that the above modules or units may be hardware components or software components stored in a memory and processed by one or more processors, and the above modules may also be run in the server 10 provided in embodiment 1.

According to an embodiment of the present application, there is also provided another embodiment of an apparatus for implementing the above-mentioned image generation. Fig. 9 is a schematic structural view of still another image generating apparatus according to embodiment 4 of the present application, as shown in fig. 9, including:

the receiving module 901 is configured to receive a currently input multimodal dialog request, where information carried in the multimodal dialog request includes: the multi-modal dialog text information and the multi-modal dialog enhancement tagging information are used for describing multi-modal dialog image content to be generated, the multi-modal dialog image content comprising: the multi-modal dialog enhancement tagging information is used to determine location features and image features of the at least one object;

The generating module 902 is configured to generate a multimodal image of the multimodal dialog request by using an image generating model to obtain a multimodal dialog image, where the image generating model is configured to generate the multimodal dialog image by using a multimodal image generating mode;

the feedback module 903 is configured to feedback a multi-modal dialog reply, where the information carried in the multi-modal dialog reply includes: multimodal dialog image.

Optionally, the receiving module 901 is further configured to: acquiring multi-mode dialogue text information; respectively generating position features of at least one object and image features of at least one object based on the multi-modal dialog text information to obtain multi-modal dialog enhancement mark information; and combining the multi-modal dialogue text information and the multi-modal dialogue enhancement mark information to obtain the multi-modal dialogue request.

Here, the above receiving module 901, generating module 902 and feedback module 903 correspond to steps S51 to S53 in embodiment 3, and the three modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 3 above. It should be noted that the above modules or units may be hardware components or software components stored in a memory and processed by one or more processors, and the above modules may also be run in the server 10 provided in embodiment 1.

It should be noted that, the preferred embodiments in the foregoing examples of the present application are the same as the embodiments provided in example 1, the application scenario and the implementation process, but are not limited to the embodiments provided in example 1.

Example 5

Embodiments of the present application may provide a computer terminal, which may be any one of a group of computer terminals. Alternatively, in the present embodiment, the above-described computer terminal may be replaced with a terminal device such as a mobile terminal.

Alternatively, in this embodiment, the above-mentioned computer terminal may be located in at least one network device among a plurality of network devices of the computer network.

In this embodiment, the above-described computer terminal may execute the program code of the following steps in the image generation method: acquiring multi-mode prompt information, wherein the multi-mode prompt information comprises: text information and enhancement mark information, the text information is used for describing image content to be generated, and the image content comprises: the enhancement mark information is used for determining the position characteristics and the image characteristics of the at least one target object; and carrying out multi-mode image generation on the multi-mode prompt information by adopting an image generation model to obtain a target image, wherein the image generation model is used for generating the target image by adopting a multi-mode image generation mode.

Alternatively, fig. 10 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 10, the computer terminal a may include: one or more (only one is shown) processors 1002, memory 1004, a memory controller, and a peripheral interface, wherein the peripheral interface is coupled to a radio frequency module, an audio module, and a display.

The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the image generating method and apparatus in the embodiments of the present application, and the processor executes the software programs and modules stored therein, thereby executing various functional applications and data processing, that is, implementing the image generating method described above. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further comprise memory remotely located from the processor, the remote memory being connectable to the computer terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: acquiring multi-mode prompt information, wherein the multi-mode prompt information comprises: text information and enhancement mark information, the text information is used for describing image content to be generated, and the image content comprises: the enhancement mark information is used for determining the position characteristics and the image characteristics of the at least one target object; and carrying out multi-mode image generation on the multi-mode prompt information by adopting an image generation model to obtain a target image, wherein the image generation model is used for generating the target image by adopting a multi-mode image generation mode.

Optionally, the above processor may further execute program code for: acquiring text information; generating the position feature of at least one target object and the image feature of at least one target object based on the text information respectively to obtain enhancement mark information; and combining the text information and the enhancement mark information to obtain the multi-mode prompt information.

Optionally, the above processor may further execute program code for: performing part-of-speech analysis on the text information, and selecting at least one target word, wherein the at least one target word meets the preset part-of-speech requirement, and the at least one target word is used for determining at least one target object; generating a location feature of at least one target object based on the text information and the at least one target word, and generating an image feature of the at least one target object based on the text information and the location feature of the at least one target object; enhancement tag information is determined using the at least one target word, the location feature of the at least one target object, and the image feature of the at least one target object.

Optionally, the above processor may further execute program code for: and adopting a position feature generation model to generate position features of the text information and the at least one target word to obtain the position features of the at least one target object.

Optionally, the above processor may further execute program code for: and adopting an image feature generation model to generate image features of the text information and the position features of the at least one target object to obtain the image features of the at least one target object.

Optionally, the above processor may further execute program code for: and performing text coding on the text information by adopting a text coder to obtain global text characteristics, and performing text coding on at least one target word by adopting the text coder to obtain object text characteristics.

Optionally, the above processor may further execute program code for: and combining the text characteristics of the object, the position characteristics of at least one target object and the image characteristics of at least one target object to obtain the enhanced marking information.

Optionally, the above processor may further execute program code for: and combining the global text feature, the object text feature, the position feature of at least one target object and the image feature of at least one target object to obtain the multi-mode prompt information.

Optionally, the above processor may further execute program code for: acquiring text information and additional information, wherein the additional information comprises at least one of the following: position information of at least one target object, image information of at least one target object; determining enhanced marking information based on the text information and the additional information; and combining the text information and the enhancement mark information to obtain the multi-mode prompt information.

It will be appreciated by those skilled in the art that the structure shown in fig. 10 is only illustrative, and the computer terminal a may be a terminal device such as a smart phone (e.g. an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a mobile internet device (MobileInternetDevices, MID), a PAD, etc. Fig. 10 is not limited to the structure of the electronic device. For example, the computer terminal a may also include more or fewer components (such as a network interface, a display device, etc.) than shown in fig. 10, or have a different configuration than shown in fig. 10.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

Example 6

Embodiments of the present application also provide a computer-readable storage medium. Alternatively, in this embodiment, the computer-readable storage medium may be used to store the program code executed by the image generating method provided in the first embodiment.

Alternatively, in this embodiment, the above-mentioned computer-readable storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: acquiring multi-mode prompt information, wherein the multi-mode prompt information comprises: text information and enhancement mark information, the text information is used for describing image content to be generated, and the image content comprises: the enhancement mark information is used for determining the position characteristics and the image characteristics of the at least one target object; and carrying out multi-mode image generation on the multi-mode prompt information by adopting an image generation model to obtain a target image, wherein the image generation model is used for generating the target image by adopting a multi-mode image generation mode.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: acquiring text information; generating the position feature of at least one target object and the image feature of at least one target object based on the text information respectively to obtain enhancement mark information; and combining the text information and the enhancement mark information to obtain the multi-mode prompt information.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: performing part-of-speech analysis on the text information, and selecting at least one target word, wherein the at least one target word meets the preset part-of-speech requirement, and the at least one target word is used for determining at least one target object; generating a location feature of at least one target object based on the text information and the at least one target word, and generating an image feature of the at least one target object based on the text information and the location feature of the at least one target object; enhancement tag information is determined using the at least one target word, the location feature of the at least one target object, and the image feature of the at least one target object.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: and adopting a position feature generation model to generate position features of the text information and the at least one target word to obtain the position features of the at least one target object.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: and adopting an image feature generation model to generate image features of the text information and the position features of the at least one target object to obtain the image features of the at least one target object.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: and performing text coding on the text information by adopting a text coder to obtain global text characteristics, and performing text coding on at least one target word by adopting the text coder to obtain object text characteristics.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: and combining the text characteristics of the object, the position characteristics of at least one target object and the image characteristics of at least one target object to obtain the enhanced marking information.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: and combining the global text feature, the object text feature, the position feature of at least one target object and the image feature of at least one target object to obtain the multi-mode prompt information.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: acquiring text information and additional information, wherein the additional information comprises at least one of the following: position information of at least one target object, image information of at least one target object; determining enhanced marking information based on the text information and the additional information; and combining the text information and the enhancement mark information to obtain the multi-mode prompt information.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. An image generation method, comprising:

acquiring multi-mode prompt information, wherein the multi-mode prompt information comprises: text information and enhancement mark information, wherein the text information is used for describing image content to be generated, and the image content comprises: at least one target object, the enhancement tag information being used to determine a location feature and an image feature of the at least one target object;

and carrying out multi-mode image generation on the multi-mode prompt information by adopting an image generation model to obtain a target image, wherein the image generation model is used for generating the target image by adopting a multi-mode image generation mode.

2. The image generation method according to claim 1, wherein acquiring the multi-modal prompt information includes:

acquiring the text information;

generating the position feature of the at least one target object and the image feature of the at least one target object based on the text information respectively to obtain the enhancement mark information;

And combining the text information and the enhancement mark information to obtain the multi-mode prompt information.

3. The image generation method according to claim 2, wherein generating the position feature and the image feature of the at least one target object based on the text information, respectively, to obtain the enhancement mark information includes:

performing part-of-speech analysis on the text information, and selecting at least one target word, wherein the at least one target word meets a preset part-of-speech requirement, and the at least one target word is used for determining the at least one target object;

generating a location feature of the at least one target object based on the text information and the at least one target word, and generating an image feature of the at least one target object based on the text information and the location feature of the at least one target object;

and determining the enhancement mark information by using the at least one target word, the position feature of the at least one target object and the image feature of the at least one target object.

4. The image generation method of claim 3, wherein generating a location feature of the at least one target object based on the text information and the at least one target word comprises:

And adopting a position feature generation model to generate position features of the text information and the at least one target word to obtain the position features of the at least one target object.

5. The image generation method of claim 3, wherein generating image features of the at least one target object based on the text information and the location features of the at least one target object comprises:

and adopting an image feature generation model to generate image features of the text information and the position features of the at least one target object to obtain the image features of the at least one target object.

6. The image generation method according to claim 3, characterized in that the image generation method further comprises:

and carrying out text coding on the text information by adopting a text coder to obtain global text characteristics, and carrying out text coding on the at least one target word by adopting the text coder to obtain object text characteristics.

7. The image generation method according to claim 6, wherein determining the enhancement mark information using the at least one target word, the position feature of the at least one target object, and the image feature of the at least one target object comprises:

And combining the object text characteristics, the position characteristics of the at least one target object and the image characteristics of the at least one target object to obtain the enhancement mark information.

8. The image generation method according to claim 6, wherein combining the text information and the enhancement tag information to obtain the multi-modal prompt information includes:

and combining the global text feature, the object text feature, the position feature of the at least one target object and the image feature of the at least one target object to obtain the multi-mode prompt information.

9. The image generation method according to claim 1, wherein acquiring the multi-modal prompt information includes:

acquiring the text information and additional information, wherein the additional information comprises at least one of the following: position information of the at least one target object, image information of the at least one target object;

determining the enhancement mark information based on the text information and the additional information;

10. An image generation method, characterized in that a graphical user interface is provided by a terminal device, the content displayed by the graphical user interface at least partially containing an image generation scene, the image generation method comprising:

inputting text information in response to a first control operation performed on the graphical user interface, wherein the text information is used for describing image content to be generated, and the image content comprises: at least one target object;

in response to a second control operation performed on the graphical user interface, respectively generating location features of the at least one target object and image features of the at least one target object based on the text information to obtain enhanced marker information, and performing multi-mode image generation on the multi-mode prompt information by adopting an image generation model to obtain a target image, wherein the enhanced marker information is used for determining the location features and the image features of the at least one target object, the image generation model is used for generating the target image by adopting a multi-mode image generation mode, and the multi-mode prompt information comprises: the text information and the enhancement mark information;

The target image is presented within the graphical user interface.

11. An image generation method, comprising:

receiving a currently input multi-modal dialog request, wherein information carried in the multi-modal dialog request comprises: the multi-modal dialog text information and the multi-modal dialog enhancement tagging information are used for describing multi-modal dialog image content to be generated, and the multi-modal dialog image content comprises: at least one object, the multi-modal dialog enhancement tagging information being used to determine location features and image features of the at least one object;

performing multi-modal image generation on the multi-modal dialog request by adopting an image generation model to obtain a multi-modal dialog image, wherein the image generation model is used for generating the multi-modal dialog image by adopting a multi-modal image generation mode;

and feeding back a multi-mode dialogue reply, wherein the information carried in the multi-mode dialogue reply comprises the following steps: the multimodal dialog image.

12. The image generation method of claim 11, wherein receiving the multimodal dialog request for current input comprises:

Acquiring the multi-modal dialogue text information;

generating the position feature of the at least one object and the image feature of the at least one object based on the multi-modal dialog text information respectively to obtain the multi-modal dialog enhancement mark information;

and combining the multi-modal dialogue text information and the multi-modal dialogue enhancement mark information to obtain the multi-modal dialogue request.

13. An electronic device, comprising:

a memory storing an executable program;

a processor for executing the program, wherein the program when executed performs the image generation method of any one of claims 1 to 12.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored executable program, wherein the executable program, when run, controls a device in which the computer-readable storage medium is located to perform the image generation method of any one of claims 1 to 12.