CN116012492A

CN116012492A - Prompt word intelligent optimization method and system for character generation image

Info

Publication number: CN116012492A
Application number: CN202211603803.7A
Authority: CN
Inventors: 范凌; 王建楠; 裴子龙; 王喆
Original assignee: Tezign Shanghai Information Technology Co Ltd
Current assignee: Tezign Shanghai Information Technology Co Ltd
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2023-04-25

Abstract

The application discloses a prompt word intelligent optimization method and a system for a text generated image, wherein the method is characterized in that input text input by a user is obtained, the input text is input into a text classification model to obtain main body intention of the input text, and the main body intention is used for representing that drawing intention of the user comprises figures and does not comprise figures; inputting an input text into a pre-trained text-to-text generation model, and outputting a supplementary prompt word text based on the generation parameters of the text-to-text generation model; if the main body is not intended to contain the person, eliminating the artist names with the character drawing trend attribute in the prompt word text according to a pre-generated artist list, and obtaining the optimized input prompt word. The method and the device solve the technical problems that the method for supplementing and optimizing the promtt in the related technology is low in efficiency and high in trial and error cost, and the intention of inputting the promtt by the user cannot be accurately supplemented, and realize the intention of accurately supplementing the promtt by the user with high efficiency and low trial and error cost.

Description

Prompt word intelligent optimization method and system for character generation image

Technical Field

The application belongs to the technical field of computers, and particularly relates to a prompt word intelligent optimization method and system for character generation images.

Background

In recent years, the capability of generating images based on texts in AI (advanced technology) is mature, and particularly, the Stable Diffusion model is greatly improved in terms of image generation speed and quality. However, in the AI capability of text generated images, the prompt (input of a prompt term) has a great influence on the quality of the generated images (the prompt project is a formal search for retrieving prompts of desired results from a language model, wherein the desired results depend on the final task and the end user), and especially some modifier prompt terms can greatly improve the visual aesthetic feeling of the generated results; moreover, most of the user-entered templates are not sufficiently described, and the Stable Diffusion authoring performance cannot be fully utilized, which is necessary for the replenishment of templates.

In the related technology, the method for supplementing and optimizing the promt has low efficiency and high trial and error cost, and cannot accurately supplement the intention of the user for inputting the promt.

Aiming at the technical problems that the method for supplementing and optimizing the promt in the related technology has low efficiency and high trial and error cost and cannot accurately supplement the intention of a user for inputting the promt, no effective solution is proposed at present.

Disclosure of Invention

Therefore, an embodiment of the present application is to provide a method and a system for intelligent optimization of a hint word for text generation image, which aim to solve at least one problem existing in the prior art.

In order to achieve the above object, in a first aspect, the present application provides a method for intelligently optimizing a hint word for generating an image by text, including:

acquiring input text input by a user, and inputting the input text into a text classification model to obtain a main body intention of the input text, wherein the main body intention is used for representing that drawing intention of the user comprises a character and does not comprise the character;

inputting the input text into a pre-trained text-to-text generation model, and outputting a supplemented prompt word text based on the generation parameters of the text-to-text generation model;

if the main body is not intended to contain the person, removing the artist name with the character drawing trend attribute from the prompt word text according to a pre-generated artist list to obtain an optimized input prompt word; wherein the artist list includes a plurality of artist names having a picture character tendency attribute.

In one embodiment, further comprising: acquiring a training sample set, extracting the main intention word of each input prompt word text in the sample set, and determining the main intention of each input prompt word text according to the main intention word;

Obtaining target input prompt word texts to obtain pre-training sample data, wherein the target input prompt word texts are all input prompt word texts with sentence heads as main intention words; processing the pre-training sample data by using an undersampling technology so that the proportions of samples of which the main body is intended to contain characters and not contain characters are consistent, and obtaining training sample data;

training the main body intention word of each input prompt word text in the training sample data as the input of a text-to-text generation model to obtain the pre-trained text-to-text generation model.

In one embodiment, the training the body intention word of each input prompt word text in the training sample data as the input of the text-to-text generation model to obtain the pre-trained text-to-text generation model includes: and taking the main body intention word of each input prompt word text in the training sample data as input of a text-to-text generation model, carrying out continuous writing on the main body intention word by the text-to-text generation model to obtain a complete training prompt word, calculating cross entropy loss between the training prompt word and each word of the corresponding input prompt word text, transmitting the cross entropy loss back to the text-to-text generation model, and correcting model weight by using an optimizer until the training is stopped to obtain the text-to-text generation model when the cross entropy loss is not reduced in N steps, wherein N is a preset value and is a positive integer.

In one embodiment, the extracting the subject intent word of each input prompt word text in the sample set includes: dividing each input prompt word text or the prompt word text into a plurality of fragments by taking the symbol as a dividing point, counting the occurrence frequency of each fragment, and taking the fragment which has the frequency meeting a threshold value and is the sentence head of the input prompt word text as the main body intention word.

In one embodiment, the text classification model is a zero sample text classification model.

In one embodiment, further comprising: fitting artist names and corresponding main body intentions in the text of the input prompt words in the sample set by using a decision tree to obtain an influence degree value of each artist name on the image with the character attribute; filtering artist names with the occurrence times smaller than a preset value in the sample set; calculating the frequency of drawing figures of each artist name according to the times of drawing figures and the times of appearance of each artist name in the sample set; and adding artist names with influence values larger than the influence mean value and picture character frequencies higher than a preset value into a list to generate the artist list.

In one embodiment, if the subject intends to include a character, the prompt text is directly used as the input prompt.

In a second aspect, the present application further provides a prompt word intelligent optimization system for text generation images, including:

an acquisition unit, configured to acquire an input text input by a user, and input the input text into a text classification model to obtain a subject intention of the input text, where the subject intention is used to characterize that a drawing intention of the user includes a person and does not include a person;

the processing unit is used for inputting the input text into a pre-trained text-to-text generation model and outputting a supplemented prompt word text based on the generation parameters of the text-to-text generation model;

the generating unit is used for eliminating artist names with character drawing trend attributes in the prompt word text according to a pre-generated artist list if the main body is not intended to contain characters, so as to obtain optimized input prompt words; wherein the artist list includes a plurality of interfering artist names having a picture character trend attribute.

According to the intelligent optimization method and the intelligent optimization system for the prompt word of the text generation image, input text input by a user is obtained, the input text is input into a text classification model to obtain main body intention of the input text, and the main body intention is used for representing drawing intention of the user to include figures and not include figures; inputting the input text into a pre-trained text-to-text generation model, and outputting a supplemented prompt word text based on the generation parameters of the text-to-text generation model; if the main body is not intended to contain the person, removing the artist name with the character drawing trend attribute from the prompt word text according to a pre-generated artist list to obtain an optimized input prompt word; wherein the artist list includes a plurality of interfering artist names having a picture character trend attribute. The technical problems that the method for supplementing and optimizing the promt in the related technology is low in efficiency and high in trial and error cost, and the intention of a user for inputting the promt cannot be accurately supplemented are solved, and the following beneficial effects are achieved: through analyzing the drawing main body intention of the text input by the user, the supplemented input prompt word is ensured to accurately express the user intention; the advanced text generation model is pre-trained, so that the trial and error cost of a user is reduced, the efficiency and accuracy of text supplementation are improved, and the intention of the user is prevented from being changed due to the style of the artist by filtering the name of the artist. The method has the advantages of high efficiency, low trial and error cost and accurate supplement of the intention of the user for inputting the prompt.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application and to provide a further understanding of the application with regard to the other features, objects and advantages of the application. The drawings of the illustrative embodiments of the present application and their descriptions are for the purpose of illustrating the present application and are not to be construed as unduly limiting the present application. In the drawings:

fig. 1 is a flow chart of implementation of a method for intelligent optimization of a hint word for text-generated images according to an embodiment of the present application;

FIG. 2 is an overall flowchart of a method for intelligent optimization of a hint word for text-generated images according to embodiments of the present application;

FIG. 3 is a schematic diagram of main modules of a prompt word intelligent optimization system for text-generated images according to an embodiment of the present application;

FIG. 4 is a diagram of an exemplary system architecture to which embodiments of the present application may be applied;

fig. 5 is a schematic diagram of a computer system suitable for use in implementing the terminal device or server of the embodiments of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the present application described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the present application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal" and the like indicate an azimuth or a positional relationship based on that shown in the drawings. These terms are used primarily to better describe the present application and its embodiments and are not intended to limit the indicated device, element or component to a particular orientation or to be constructed and operated in a particular orientation.

Also, some of the terms described above may be used to indicate other meanings in addition to orientation or positional relationships, for example, the term "upper" may also be used to indicate some sort of attachment or connection in some cases. The specific meaning of these terms in this application will be understood by those of ordinary skill in the art as appropriate.

In addition, the term "plurality" shall mean two as well as more than two.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 shows a flow chart of implementing a method for intelligent optimization of a hint word for text-generated images according to an embodiment of the present application, fig. 2 shows an overall flow chart of the method for intelligent optimization of a hint word for text-generated images according to an embodiment of the present application, and for convenience of explanation, only parts related to the embodiment of the present application are shown, which are described in detail below:

in recent years, the capability of generating images based on texts by AI has matured, and in particular, the Stable Diffuse model (Stable Diffuse is a model of Stability AI open source for image generation) has been greatly improved in terms of image generation speed and quality. However, in the AI capability of text generated images, the prompt (input of a prompt term) has a great influence on the quality of the generated images (the prompt project is a formal search for retrieving prompts of desired results from a language model, wherein the desired results depend on the final task and the end user), and especially some modifier prompt terms can greatly improve the visual aesthetic feeling of the generated results; moreover, most user-entered prompt descriptions are inadequate and cannot fully utilize the Stable Diffusion authoring performance. Supplementation of the prompt is necessary, such as: the input text of the original sample is "Nezha", and the optimized sample is then "Nezha, D & D, fantasy, portrait, high-resolution, digital painting, trending on artstation, concept art, sharp focus, il lustration, art by Artgerm and Greg Rutkowski and Alphonse Mucha", and the image style generated based on the input text of the original sample and the text of the optimized sample is very different. Therefore, some refined picture modifier words can greatly improve the quality of image generation. Therefore, the application can provide the capacity of intelligent complement optimization for the sample of Stable diffration, so that the Stable diffration can create higher-quality images.

Here, the optimization and supplementation of the simplet can be divided into the following:

1. providing a supplemental word selection: after the user inputs the original template, providing a supplementary modifier list, and enabling the user to autonomously select a prompt word which can be filled (such as promptomania prompt-builder);

2. finding an approximate high-quality sample: searching for a template related to the user intention as an input of a current generation task on a template collection website of some text generation images;

3. generating a model to supplement the template: a promt generation model (e.g., an open source model magicpromt) is trained on the disclosed text generation image promt dataset for user basic input description supplementation.

The above methods can gradually solve the embarrassment of the lack of the supplementary words for the user when drawing, but have different problems:

1. a method of providing supplemental word selection:

a. some tools can directly give out all optional prompt words, thousands of optional words are often provided for users, and different word collocations also have great influence on the generation quality, so the method also increases more trial-and-error cost for the users;

b. some tools make arrangement and classification for modified vocabulary and supplementary artists, so that provided selectable vocabulary can be conditioned more, but the arrangement and classification need a great deal of manual knowledge and processing, and the method is not end-to-end one-step word-in-place and needs some trial-and-error cost;

2. A method for finding approximately high-quality campt provides a ready and high-quality prompt word for a user at the same time:

a. on the one hand, the original intention of the user is defined or changed to different degrees, such as the intention of the user is 'Iron Man with Venom', but approximate sample can be found as 'venom in a venom inspired ironman suit, black and red, dynamic lighting, photorealistic fantasy concept art, trending on art station, stunning visual, terrifying, creation, cinematic';

b. on the other hand, the problem of cold start without approximate prompt is also faced;

3. the method for generating model supplemental campt (such as open source model magiccampt) can supplement supplemental words suitable for intention and ensure the quality of generated images to a great extent, but due to unbalance of training set data intention, main intention is buried in the middle or at the end of text, the method leads to:

a. the template output may deviate from the original intention of the user, and redundant intention main bodies are added, such as original input "An apple", and the model output: "An apple, green forest, cleaning, beige leave, in style of Yoji Shinkawa and Hyung-tae Kim, trending on ArtStation, dark fat, great composition, concept art, high ly extracted". The intent of "An apple" is subtracted and a subject such as "green forest" is additionally added to the generated image;

b. The added artist supplement word of more good-quality figures, in the case of "An apple", the added artists such as "Yoji Shinkawa" can lead to that the model for generating the image is more prone to draw figures rather than "apple", and the random text input is more common with the situation that the result of generating the image is inconsistent; therefore, the scheme disclosed by the application can be a prompt intelligent complementary optimizing capability, an advanced text generation model is used, fine adjustment is performed on a mass prompt word data set collected by people, and the scheme comprises the processes of data preprocessing and post-processing, so that the defects of other technologies in changing intention and drawing character inclination are overcome.

In the application, the intelligent complementary optimization method for the campt can comprise the following steps:

1. data analysis and cleaning, namely distinguishing the painting intention type of the template in the data set, wherein the painting intention is a person and a person not included;

2. analyzing the tendency of each artist to appear as a character that may result in the generated result based on the intention type of the prompt;

3. based on the intention type of the promt, performing data balancing on the promt containing the person and not containing the person in the data;

4. on the cleaned and balanced data, the main body intention, namely the drawing intention, is input as a model, the complete promt is taken as a model training target (target), and a T5 model (T5: text-to-Text Transfer Transformer, a Text-to-Text generation model) is used for fine tuning training;

5. In the model application stage, firstly, the intention type input by a user is distinguished, and artist names with high probability of drawing figures appearing in the result of the prompt generated by the intention of the figure is filtered out.

In one embodiment of the present application, a method for intelligent optimization of a hint word for text-generated images is provided, comprising the steps of:

s101: acquiring input text input by a user, and inputting the input text into a text classification model to obtain a main body intention of the input text, wherein the main body intention is used for representing that drawing intention of the user comprises a character and does not comprise the character;

s102: inputting the input text into a pre-trained text-to-text generation model, and outputting a supplemented prompt word text based on the generation parameters of the text-to-text generation model;

s103: if the main body is not intended to contain the person, removing the artist name with the character drawing trend attribute from the prompt word text according to a pre-generated artist list to obtain an optimized input prompt word; wherein the artist list includes a plurality of interfering artist names having a picture character trend attribute.

In step S101: input text input by a user is acquired, the input text is input into a text classification model, and the main body intention of the input text is obtained, wherein the main body intention is used for representing that drawing intention of the user comprises a character and does not comprise the character. The method comprises the steps of inputting a text, namely an original template input text input by a user, acquiring the input text input by the user after the user inputs the original template input text before image generation, inputting the input text into a preset text classification model to obtain the main body intention of the input text input by the user, wherein the main body intention is the drawing intention of the input text input by the user, and the drawing intention input by the user is an image containing a person or an image not containing the person.

Here, the text classification model may select a zero shot (zero sample) text classification model (without training) to distinguish the intent of the sentence of the input text. The target categories of the zero shot text classification model are as follows:

1["human",

2"portrait",

3"comics animation movie characters",

4"goods",

5"food",

6"fruit",

7"electronic product",

8"product",

9"traffic",

10"city",

11"natural",

12"animal",

13"tools"]；

when the model determines that the subject intention is one of [ "human", "portrait", "comics animation movie characters" ], the subject intention may be determined to include a person; the remaining categories are judged to not include characters.

In step S102: and inputting the input text into a pre-trained text-to-text generation model, and outputting the supplemented prompt word text based on the generation parameters of the text-to-text generation model. After classifying and separating the intention of the user input text, inputting the input text into a pre-selected trained text generation model, and performing intelligent supplementary optimization on the input text input by the user based on preset generation parameters to obtain an optimized complete prompt word text, namely an optimized prompt input text, so as to generate a high-quality image conforming to the drawing intention of the user. Here, the generation parameters of the text-to-text generation model are settings when the deep learning model samples the sequence logits values of the model output result into a language sequence when training for reasoning. When the input is given, the model weight is calculated, the model obtains a final output matrix, and the work of the model is completed. At this time, a final complete output sentence cannot be obtained yet, the matrix output by the model is given to a sequence sampler, a greedy search strategy or a polar-beam search strategy is used to decode the vector of each position into specific characters, and finally a final output text sequence is obtained, and the parameters are parameters of the strategy during sequence sampling.

In one embodiment, the generation parameters of the text-to-text generation model are set optimally based on the accuracy performance of the model. Specifically, the parameters of the generation part of the model are set as follows:

do_sample True (can increase the diversity of the generated results);

early_stopping: true (generate results only in the highest probability generation range);

length_penalty:1.0 (the model will be more prone to produce short, aligned results);

and, in combination with the subject intent classification, control the length of model generation:

i. when intended as [ "human", "portrait", "comics animation movie characters" ], the minimum complement length is set to 70 (the text encoder of Stable Diffusion receives 77 token at maximum), more portrait type artists and embellishments can be complemented;

when the intention is [ "food", "freit", "electronic product", "product", "tools" ], the minimum word complement length is set to 30, the model output modifier is less and more accurate, so that stable diffusion can avoid people and rarely complement artists;

the remaining intents, such as "city" etc., the minimum complement length is set to 50, balanced complement modifiers and artists.

Therefore, after the main body intention of the input text is distinguished by the user input text, parameters can be generated directly according to the setting of the model, and the input text is supplemented with prompt words, namely, the text input by the user (such as An apple) is given to a trained T5 model, and the model supplements the input text to a complete prompt.

Note that, the pre-trained Text-to-Text generation model (T5 model) may be obtained by performing fine tuning training using a T5 model (T5: text-to-Text Transfer Transformer, text-to-Text generation model).

Prior to training the model, data preparation is first performed: in the application, a huge amount of pre-collected prompt word data is used, and 14 ten thousand+ pieces of prompt text capable of generating high-quality images are included, and examples are: prompt 1"Anthropomorphized pineapple pizza,D&D,fantasy,cinematic lighting,highly detailed,digital painting,artstation,concept art,smooth,sharp focus,illustration,warm light,cozy warm tint,magic the gathering artwork,volumetric lighting,8k,no gold,no gold colours,art by Akihiko Yoshida,Greg Rutkowski"; the sample 2"a fox fursona, trending on pixiv, by kawacy, furry art, digital art, cyberpunk, high quality, backlighting".

Then, data analysis is performed on the prepared data: in the above example, a premium sample generally contains three parts (taking the first example sample 1):

1. the main intention, which dominates the main screen content in the generated result, is also the main intention entered by the user, for example: "Anthropomorphized pineapple pizza";

2. Picture style modification, and the addition of picture style can make picture quality higher, for example: "D & D, fantasy, cinematic lighting, high ly extracted, digital imaging, attstation, concept art, smooth, sharp focus, il-lustration, wall light, cozy wall tint, magic the gathering artwork, volumetric lighting,8k,no gold,no gold colours";

3. artist supplements, artist names may dominate the overall style of the picture, such as: "art by Akihiko Yoshida, greg Rutkowski).

Here, the 14 ten thousand+ campt dataset presents the following law:

1. the picture style modification is mostly the splicing of a plurality of picture description words, and each word repeatedly appears in the whole data set for a plurality of times;

2. artist replenishment is also a concatenation of multiple artist names, each repeated multiple times throughout the dataset;

3. in the main intention description of the data set, the intention of drawing the person accounts for more than 50%.

In order to avoid biasing the text-generated model output toward the prompt of the drawing character, some preprocessing and cleaning operations are required on the data.

Specifically, by preparing and analyzing a sample dataset, the T5 model may be trained using the following method: acquiring a training sample set, extracting the main intention word of each input prompt word text in the sample set, and determining the main intention of each input prompt word text according to the main intention word; obtaining target input prompt word texts to obtain pre-training sample data, wherein the target input prompt word texts are all input prompt word texts with sentence heads as main intention words; processing the pre-training sample data by using an undersampling technology to ensure that the sample proportion of the subjects which are intended to be the containing figures is consistent with the sample proportion of the subjects which are not contained in the pre-training sample data, so that the sample proportion of the subjects which are contained in the training sample data is maintained at 1:1; training the main body intention word of each input prompt word text in the training sample data as the input of a text-to-text generation model to obtain the pre-trained text-to-text generation model. Here, only the sample data of the subject intention at the beginning of the sentence is taken, and the generated model is prevented from supplementing additional subject vocabulary in the training learning process.

Further, the training the main body intention word of each input prompt word text in the training sample data as the input of the text-to-text generation model to obtain the pre-trained text-to-text generation model includes: taking the main body intention word of each input prompt word text in the training sample data as input of a text-to-text generation model, and carrying out continuous writing on the main body intention word by the text-to-text generation model to obtain a complete training prompt word, calculating cross entropy loss between the training prompt word and each word of the corresponding input prompt word text, and judging whether the cross entropy loss is reduced compared with the last cross entropy loss; if yes, the cross entropy loss is transmitted back to the text-to-text generation model, the steps are repeatedly executed after the model weight is corrected by using an optimizer, until training is stopped when the cross entropy loss is not reduced in N steps, and the text-to-text generation model is obtained, wherein N is a preset value and is a positive integer. Here, the model weight refers to a model weight optimization strategy in a deep learning training method, and mainly uses an AdamW optimizer to calculate a loss gradient correction model weight.

In one embodiment, further comprising: fitting artist names and corresponding main body intentions in the text of the input prompt words in the sample set by using a decision tree to obtain an influence degree value of each artist name on the image with the character attribute; filtering artist names with the occurrence times smaller than a preset value in the sample set; calculating the frequency of drawing figures of each artist name according to the times of drawing figures and the times of appearance of each artist name in the sample set; and adding artist names with influence values larger than the influence mean value and picture character frequencies higher than a preset value into a list to generate the artist list. Here, fitting the artist names and the corresponding subject intentions in the input prompt word text in the sample set using the decision tree may be to vector them first using Tfidf vector (TF-IDF (word frequency-inverse document frequency) -based feature vector extraction) on the artist names, and then input the vector-oriented artist names into the decision tree whose classification targets are containing characters and not containing characters. The range of influence values may be 0 to 1, with 1 being the most influence.

In a particular embodiment, the intent of the sample dataset is classified after preparation and analysis of the sample dataset. Specifically, to identify the intention type of the promtt, the main intention part in the promtt is first cleaned, and the procedure is as follows:

1. dividing each campt into small fragments by using a symbol;

2. counting the occurrence frequency of each segment;

3. the fragment which appears only 5 times or less and is the beginning of the prompt is taken as the subject intention.

Then, the subject intention is classified as follows: using a text zero shot text classification model (without training) to distinguish the intention of the sentences; when the model determines that the subject intends to be one of [ "human", "portrait", "comics animation movie characters" ], the intent may be determined to include a person; the remaining categories are judged to not include characters.

In addition, the sample data set needs to be balanced and cleaned, specifically, after the intention classification of the sample, the following operations are performed on the data:

1. only sampling the sample data of the main body intention at the beginning of a sentence, and avoiding the generation model from supplementing additional main body vocabulary in the training and learning process;

2. in the data after the above washing, the undersampling technique is used to maintain the ratio of the sample containing the person to the sample not containing the person at 1:1. So far, the high-quality prompt data after 3 ten thousand+ cleaning and balancing can be obtained.

Thereby, the training sample data described above is formed.

It should be noted that, the backbone pre-training language model Google used in the present application issues an open source generation model: t5 (Text-to-Text Transfer Transformer), which is the most advanced generative model at the present time, is a generative model with an Encoder and a Decoder, and can expand or translate an input Text into a new Text, and is applied to downstream tasks such as classification, abstract, question-answer, translation, follow-up and the like.

Specifically, the training process and the supervision method of the application are as follows:

1. using the subject intent of each probt as a model input in the cleaned balance dataset;

the T5 model will write the subject intent to the full promtt;

3. the supervision process evaluates the cross entropy loss between the template generated in each step and each token (word) of the real complete template, and transmits the loss back to the model to correct the model weight by using a weight optimization method until the model loss is no longer reduced in N steps, and the training is stopped.

Further, during the model training phase, the artists in the sample dataset may be analyzed simultaneously: experiments show that even if the input template does not contain information of the picture character, the generation result of the template with the artist name of the frequent picture character can output a picture with the character with high probability. Therefore, the tendency of artist names to draw figures needs to be analyzed, as follows:

1. Fitting artist names and intention categories (including people and not including people) in the template by using a decision tree to obtain feature_importance of each artist, namely the influence degree of the artist on the drawing of the people;

2. counting the frequency of the artists drawing figures: number of drawings characters/number of artist occurrences;

3. filtering only artists that appear less than 30 times in statistics (frequency is too low to be statistical);

4. when the feature_importance of the artist is larger than the average value and the frequency of drawing the character is higher than 50%, namely judging that the template contains the artist, outputting an image with the character;

5. and (4) adding the artist names obtained in the step (4) into the list to obtain an artist list with the tendency of drawing figures.

In step S103: if the main body is not intended to contain the person, removing the artist name with the character drawing trend attribute from the prompt word text according to a pre-generated artist list to obtain an optimized input prompt word; wherein the artist list includes a plurality of interfering artist names having a picture character trend attribute. If the subject intends to contain a person, the prompt word text is directly used as the input prompt word. Therefore, if the user input is judged to not contain the person, the model generated sample result is further post-processed, and artists with the tendency of drawing the person obtained in the artist analysis are filtered out from the model generated sample result; if the user input is determined to contain a person, no processing is done. Thereby outputting a final prompt result.

Compared with the method for providing optional words and searching approximate promptt, the intelligent promptt optimizing method is an end-to-end promptt optimizing method; compared with the existing open source generation type prompt optimization method, the model is generated through data balance sampling training, so that the original drawing intention of the model is not changed, and under the condition that the intention is not changed, a prompt word with better Stable Diffusion drawing effect is supplemented, a post-processing process is added, the Stable Diffusion drawing result is more Stable, and a picture with a figure is not output under the condition that no figure is input; and, distinguish each part of the campt through the method of dividing, frequency statistics: the main body intention, the picture style modification and the artist supplement are distinguished by using an open source pre-training model, the category of the main body intention of drawing in 13 is defined, and the influence of each artist on the character contained in the image generation result is analyzed and used in the post-processing process.

In this way, according to the intelligent optimization method for the prompt word of the text generation image, through obtaining the input text input by the user, inputting the input text into the text classification model to obtain the main body intention of the input text, wherein the main body intention is used for representing that the drawing intention of the user comprises a character and does not comprise the character; inputting the input text into a pre-trained text-to-text generation model, and outputting a supplemented prompt word text based on the generation parameters of the text-to-text generation model; if the main body is not intended to contain the person, removing the artist name with the character drawing trend attribute from the prompt word text according to a pre-generated artist list to obtain an optimized input prompt word; wherein the artist list includes a plurality of artist names having a picture character tendency attribute. The technical problems that the method for supplementing and optimizing the promt in the related technology is low in efficiency and high in trial and error cost, and the intention of a user for inputting the promt cannot be accurately supplemented are solved, and the following beneficial effects are achieved: through analyzing the drawing main body intention of the text input by the user, the supplemented input prompt word is ensured to accurately express the user intention; the advanced text generation model is pre-trained, so that the trial and error cost of a user is reduced, the efficiency and accuracy of text supplementation are improved, and the intention of the user is prevented from being changed due to the style of the artist by filtering the name of the artist. The method has the advantages of high efficiency, low trial and error cost and accurate supplement of the intention of the user for inputting the prompt.

Fig. 3 is a schematic diagram of main modules of the intelligent optimization system for generating images by using words according to the embodiment of the present application, and for convenience of explanation, only the portions relevant to the embodiment of the present application are shown, which is described in detail below:

a cue word intelligent optimization system 200 for text-generated images, comprising:

an obtaining unit 201, configured to obtain an input text input by a user, and input the input text into a text classification model to obtain a subject intention of the input text, where the subject intention is used to characterize that a drawing intention of the user includes a person and does not include a person;

a processing unit 202, configured to input the input text into a pre-trained text-to-text generation model, and output a supplementary prompt word text based on generation parameters of the text-to-text generation model;

a generating unit 203, configured to reject artist names with character trend attribute in the prompt word text according to a pre-generated artist list if the main body is intended to not include characters, so as to obtain an optimized input prompt word; wherein the artist list includes a plurality of interfering artist names having a picture character trend attribute.

For the acquisition unit 201: the method comprises the steps of acquiring input text input by a user, and inputting the input text into a text classification model to obtain main body intention of the input text, wherein the main body intention is used for representing drawing intention of the user to include characters and not include characters. The method comprises the steps of inputting a text, namely an original template input text input by a user, acquiring the input text input by the user after the user inputs the original template input text before image generation, inputting the input text into a preset text classification model to obtain the main body intention of the input text input by the user, wherein the main body intention is the drawing intention of the input text input by the user, and the drawing intention input by the user is an image containing a person or an image not containing the person.

Here, the text classification model may select a zero shot text classification model (without training) to distinguish intent of sentences of the input text. The target categories of the zero shot text classification model are as follows:

1["human",

2"portrait",

3"comics animation movie characters",

4"goods",

5"food",

6"fruit",

7"electronic product",

8"product",

9"traffic",

10"city",

11"natural",

12"animal",

13"tools"]；

For the processing unit 202: and the method is used for inputting the input text into a pre-trained text-to-text generation model, and outputting the supplemented prompt word text based on the generation parameters of the text-to-text generation model. After classifying and separating the intention of the user input text, inputting the input text into a pre-selected trained text generation model, and performing intelligent supplementary optimization on the input text input by the user based on preset generation parameters to obtain optimized prompt word text, namely optimized sample input text, so as to generate an image with high quality and accordant with the drawing intention of the user.

Note that the pre-trained Text-to-Text generation model (T5 model) may be obtained by performing fine tuning training using a T5 model (T5: text-to-Text Transfer Transformer, text-to-Text migration).

Here, the 14 ten thousand+ campt dataset presents the following law:

Further, the training the main body intention word of each input prompt word text in the training sample data as the input of the text-to-text generation model to obtain the pre-trained text-to-text generation model includes: taking the main body intention word of each input prompt word text in the training sample data as input of a text-to-text generation model, and carrying out continuous writing on the main body intention word by the text-to-text generation model to obtain a complete training prompt word, calculating cross entropy loss between the training prompt word and each word of the corresponding input prompt word text, and judging whether the cross entropy loss is reduced compared with the last cross entropy loss; if yes, the cross entropy loss is transmitted back to the text-to-text generation model, the steps are repeatedly executed after the generation parameters are corrected, training is stopped until the text-to-text generation model is obtained when the cross entropy loss is not reduced in N steps, and N is a preset value and a positive integer.

1. dividing each campt into small fragments by using a symbol;

2. counting the occurrence frequency of each segment;

Thereby, the training sample data described above is formed.

It should be noted that, the backbone pre-training language model Google used in the present application issues an open source generation model: t5 (Text-to-Text Transfer Transformer), which is the most advanced generative model at the present time, is a generative model with an Encoder and a Decoder, which can expand or translate an input Text into a new Text, applied to classification, abstract, question-answer, translation, and Text-processing,

And writing on downstream tasks.

The T5 model will write the subject intent to the full promtt;

Further, during the model training phase, the artists in the sample dataset may be analyzed simultaneously: experiments show that even if the input template does not contain information of the picture character, the generation result of the template can output a picture with the character with high probability as long as the template has the artist name of the frequently-drawn character. Therefore, the tendency of artist names to draw figures needs to be analyzed, as follows:

In one embodiment, the generation parameters of the text-to-text generation model are set optimally based on the accuracy performance of the model during model training. Specifically, the parameters of the generation part of the model are set as follows:

do_sample True (can increase the diversity of the generated results);

For the generating unit 203: if the main body is intended to not contain people, removing artist names with character drawing trend attributes from the prompt word text according to a pre-generated artist list to obtain optimized input prompt words; wherein the artist list includes a plurality of interfering artist names having a picture character trend attribute. If the subject intends to contain a person, the prompt word text is directly used as the input prompt word. Therefore, if the user input is judged to not contain the person, the model generated sample result is further post-processed, and artists with the tendency of drawing the person obtained in the artist analysis are filtered out from the model generated sample result; if the user input is determined to contain a person, no processing is done. Thereby outputting a final prompt result.

Therefore, the intelligent optimizing system for the prompt words of the text generated image provided by the embodiment of the application ensures that the supplemented input prompt words accurately express the user intention by analyzing the drawing main body intention of the text input by the user; the advanced text generation model is pre-trained, so that the trial and error cost of a user is reduced, the efficiency and accuracy of text supplementation are improved, and the intention of the user is prevented from being changed due to the style of the artist by filtering the name of the artist. The method has the advantages of high efficiency, low trial and error cost and accurate supplement of the intention of the user for inputting the prompt.

The embodiment of the application also provides electronic equipment, which comprises: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the intelligent optimization method for the prompt word for the text generation image.

The embodiment of the application also provides a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the intelligent optimization method for the prompt word for generating the image by the text.

FIG. 4 illustrates an exemplary system architecture 300 to which the present embodiments of a method or system for intelligent optimization of hints for text-generated images can be applied.

As shown in fig. 4, the system architecture 300 may include

terminal devices

301, 302, 303, a network 304, and a server 305. The network 304 is used as a medium to provide communication links between the

terminal devices

301, 302, 303 and the server 305. The network 304 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 305 via the network 304 using the

terminal devices

301, 302, 303 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc., may be installed on the

terminal devices

301, 302, 303.

The

terminal devices

301, 302, 303 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 305 may be a server providing various services, such as a background management server providing support for user messages sent to and from the

terminal devices

301, 302, 303. The background management server can perform analysis and other processes after receiving the terminal equipment request, and feed back the processing result to the terminal equipment.

It should be noted that, the method for intelligent optimization of a hint word for text-generated images provided in the embodiments of the present application is generally executed by the

terminal device

301, 302, 303 or the server 305, and accordingly, the system for intelligent optimization of a hint word for text-generated images is generally set in the

terminal device

301, 302, 303 or the server 305.

It should be understood that the number of terminal devices, networks and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 5, a schematic diagram of a computer system 400 suitable for use in implementing the electronic device of the present embodiments is shown. The computer system illustrated in fig. 5 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present application.

As shown in fig. 5, the computer system 400 includes a Central Processing Unit (CPU) 401, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In RAM 403, various programs and data required for the operation of system 400 are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output portion 407 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage section 408 including a hard disk or the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. The drive 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 410 as needed, so that a computer program read therefrom is installed into the storage section 408 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments disclosed herein include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 409 and/or installed from the removable medium 411. The above-described functions defined in the system of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 401.

It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments described in the present application may be implemented by software, or may be implemented by hardware. The described modules may also be provided in a processor, for example, as: a processor includes a determination module, an extraction module, a training module, and a screening module. Where the names of the modules do not constitute a limitation on the module itself in some cases, the determination module may also be described as "module for determining a candidate set of users", for example.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. The intelligent optimization method for the prompt words of the character generation image is characterized by comprising the following steps of:

2. The intelligent optimization method for a hint word of a text-generated image of claim 1, further comprising: acquiring a training sample set, extracting the main intention word of each input prompt word text in the sample set, and determining the main intention of each input prompt word text according to the main intention word;

3. The intelligent optimization method for generating images according to claim 2, wherein training the main intention word of each input prompt word text in the training sample data as an input of a text-to-text generation model to obtain the pre-trained text-to-text generation model comprises: and taking the main body intention word of each input prompt word text in the training sample data as input of a text-to-text generation model, carrying out continuous writing on the main body intention word by the text-to-text generation model to obtain a complete training prompt word, calculating cross entropy loss between the training prompt word and each word of the corresponding input prompt word text, transmitting the cross entropy loss back to the text-to-text generation model, and correcting model weight by using an optimizer until the training is stopped to obtain the text-to-text generation model when the cross entropy loss is not reduced in N steps, wherein N is a preset value and is a positive integer.

4. The intelligent optimization method for generating images from text according to claim 2, wherein the extracting the subject intent word of each input prompt text in the sample set comprises: dividing each input prompt word text or the prompt word text into a plurality of fragments by taking the symbol as a dividing point, counting the occurrence frequency of each fragment, and taking the fragment which has the frequency meeting a threshold value and is the sentence head of the input prompt word text as the main body intention word.

5. The intelligent optimization method for a hint word of a text-generated image of claim 2, wherein the text classification model is a zero-sample text classification model.

6. The intelligent optimization method for a hint word of a text-generated image of claim 2, further comprising: fitting artist names and corresponding main body intentions in the text of the input prompt words in the sample set by using a decision tree to obtain an influence degree value of each artist name on the image with the character attribute; filtering artist names with the occurrence times smaller than a preset value in the sample set; calculating the frequency of drawing figures of each artist name according to the times of drawing figures and the times of appearance of each artist name in the sample set; and adding artist names with influence values larger than the influence mean value and picture character frequencies higher than a preset value into a list to generate the artist list.

7. The intelligent optimization method for generating images from text according to claim 1, wherein if the subject intends to include a person, the text of the prompt is directly used as the input prompt.

8. A prompt word intelligent optimization system for text-generated images, comprising: