CN116935144B

CN116935144B - Training of image generation model, sample set construction method and image generation large model

Info

Publication number: CN116935144B
Application number: CN202311192236.5A
Authority: CN
Inventors: 张建安; 刘微; 郑维学; 赵越
Original assignee: Hisense Group Holding Co Ltd
Current assignee: Hisense Group Holding Co Ltd
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2024-02-06
Anticipated expiration: 2043-09-15
Also published as: CN116935144A

Abstract

The application relates to the technical field of artificial intelligence, in particular to a training method of an image generation model, a sample set construction method and an image generation large model. When training an image generation model, grouping sample texts and sample images according to the attribute quantity of preset attributes of sample texts in a sample set of a target scene to obtain sample groups, wherein the attribute change of the sample texts in the same sample group is single, sorting the sample groups in ascending order according to the attribute quantity corresponding to the sample groups, training the image generation model to be trained by sequentially using the sorted sample groups to obtain the target model, and progressively training the sample texts from simple to detailed use, so that the problem that the model convergence speed is influenced due to the diversity of the attribute change in the sample texts is avoided, and the image generation model of a specific scene is efficiently trained. The technical scheme has the characteristics of robustness, interpretability, reliability and universality, and accords with the credibility characteristic.

Description

Training of image generation model, sample set construction method and image generation large model

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a training method of an image generation model, a sample set construction method and an image generation large model.

Background

With the development of scientific technology, image generation technology is receiving more and more attention. Existing image generation techniques are classified into image-to-image generation models, text-to-image generation models, and the like, depending on input conditions. The text-to-image generation model is in line with the use scene of the daily user, is simple and convenient, and is widely focused by people. The text-to-image generation model, namely, a user inputs a prompt text into the model, and the model generates an image conforming to the semantics of the prompt text according to the content of the received prompt text.

In the related art, when training an image generation model, a model to be trained is generally trained using a wide-area general sample image and a sample text, so that a target model obtained through training has general knowledge. However, since the target model is not trained based on the sample image and the sample text focused on the specific scene, the image generation effect of the target model on the specific scene is poor, for example, the detail generation effect of the target model on pedestrians is to be promoted. Moreover, since the sample text corresponding to the sample image used is artificially written according to experience when the model to be trained is trained, the obtained sample text has no unified standard, for example, the sample text written by some people is extremely fine when describing the attribute of the character in the image, for example, "a man wearing a mask wearing an army green coat wearing an army green trousers aged 20-40 years old", and the sample text written by some people is relatively coarse when describing the attribute of the character in the image, for example, "a man wearing an army green coat", so that the types and the number of the attribute included in different sample texts are different, and the convergence speed of the model to be trained is extremely slow due to the variety of the attribute change in the sample text when training the model to be trained.

Therefore, how to train the image generation model of a specific scene efficiently to improve the effect of the generated image is a urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides a training method and a sample set construction method of an image generation model and an image generation large model, which are used for solving the problem that the training convergence speed of the image generation model aiming at a specific scene is low in the prior art.

In a first aspect, the present application provides a training method of an image generation model, the method comprising:

acquiring a sample set corresponding to a target scene, wherein the sample set comprises a plurality of sample images, and the sample images correspond to sample texts;

determining the attribute quantity of preset attributes included in the sample texts in the sample set, and grouping the sample texts and the corresponding sample images according to the attribute quantity to obtain sample groups;

and carrying out ascending order sorting on the sample groups according to the attribute quantity corresponding to the sample groups, and training the image generation model to be trained by sequentially using the sorted sample groups to obtain a target model.

In the embodiment of the application, when the image generation model is trained, a sample set corresponding to a target scene is acquired, the attribute number of preset attributes included in sample texts in the sample set is determined, the sample texts and corresponding sample images are grouped according to the attribute number to obtain sample groups, the attribute number of each sample text in the sample groups is consistent, then the attribute change of the sample texts in the same sample group is single, the sample groups are ordered in ascending order according to the attribute number corresponding to the sample groups, the ordered sample groups are sequentially used for training the image generation model to be trained to obtain the target model, namely, the progressive use is from simple to detailed sample texts, the problem that the convergence speed of the image generation model to be trained is extremely slow due to the diversity of attribute change in the sample texts is avoided, and the image generation model of a specific scene is efficiently trained.

Further, the process of obtaining the sample set corresponding to the target scene includes:

acquiring an image set corresponding to the target scene and a text to be filled which is stored for the target scene, wherein the text to be filled comprises at least one target field to be written with an attribute value;

determining a target attribute value corresponding to the target field in the sample image aiming at any sample image in the image set, and correspondingly filling the target attribute value into the target field to obtain a sample text corresponding to the sample image; and storing the sample text corresponding to the sample image into a sample set.

Further, a plurality of texts to be filled are stored for the target scene, and if the number of target fields included in different texts to be filled is the same, the target attributes corresponding to the target fields included in the texts to be filled with the same number are different.

Further, the determining, for any sample image in the image set, a target attribute value corresponding to the target field in the sample image includes:

determining a target attribute corresponding to a target field in the text to be filled;

outputting the sample image and the target attribute aiming at any sample image in the image set, and prompting to input an attribute value corresponding to the target attribute in the sample image; determining the received attribute value corresponding to the target attribute as the target attribute value; or, for any sample image in the image set, performing image recognition on the sample image to obtain an attribute value corresponding to the target attribute in the sample image, and determining the attribute value as the target attribute value.

Further, before the outputting the sample image and the target attribute, the method further includes:

acquiring candidate attribute values stored aiming at the target attributes;

said outputting said sample image and said target property comprises:

and outputting the sample image, the target attribute and the candidate attribute value.

Further, after the obtaining the attribute value corresponding to the target attribute in the sample image, before the determining the attribute value as the target attribute value, the method further includes:

outputting the sample image and the obtained attribute value corresponding to the target attribute, and prompting to audit the attribute value corresponding to the target attribute to obtain a first attribute value corresponding to the target attribute after audit;

and receiving the first attribute value corresponding to the target attribute after verification, and updating the attribute value corresponding to the target attribute obtained through image identification by using the first attribute value.

Further, training the image generation model to be trained by sequentially using the sequenced sample groups comprises:

sequentially acquiring a first sample group from the sequenced sample groups;

And if the first sample group is the first group in the ordered sample groups, training the image generation model to be trained by using the first sample group.

Further, the method further comprises:

if the first sample group is a non-first group in the sorted sample groups, acquiring a set number of sample images and corresponding sample texts from other sample groups before the first sample group;

and training the image generation model to be trained by using the set number of sample images, sample texts corresponding to the set number of sample images and the first sample packets.

Further, the acquiring the set number of sample images and the corresponding sample text includes:

obtaining a target ratio stored in advance;

and acquiring a corresponding number of sample images and corresponding sample texts of the target ratio in the other sample groups.

In a second aspect, embodiments of the present application further provide a sample set construction method, where the method includes:

acquiring an image set corresponding to a target scene and a text to be filled which is stored for the target scene, wherein the text to be filled comprises at least one target field to be written with an attribute value;

According to the embodiment of the application, the texts to be filled are stored for different scenes, when the sample set of the target scene is constructed, the image set corresponding to the target scene is obtained, and the text to be filled stored for the target scene, wherein the text to be filled comprises at least one target field to be written with an attribute value, and the target attribute value corresponding to the target field in the sample image is determined for any sample image in the image set, and the corresponding target attribute value is correspondingly filled into the corresponding target field, so that the sample text corresponding to the sample image is obtained.

acquiring candidate attribute values stored aiming at the target attributes;

Said outputting said sample image and said target property comprises:

In a third aspect, an embodiment of the present application further provides an image generation large model, where the image generation large model is obtained based on the training method of the image generation model.

In a fourth aspect, an embodiment of the present application further provides a training apparatus for generating a model by using an image, where the apparatus includes:

the acquisition module is used for acquiring a sample set corresponding to a target scene, wherein the sample set comprises a plurality of sample images, and the sample images correspond to sample texts;

The determining module is used for determining the attribute quantity of preset attributes included in the sample text in the sample set;

the grouping module is used for grouping the sample texts and the corresponding sample images according to the attribute quantity to obtain sample groups;

and the training module is used for carrying out ascending order on the sample groups according to the attribute quantity corresponding to the sample groups, and training the image generation model to be trained by sequentially using the ordered sample groups to obtain a target model.

Further, the acquiring module is specifically configured to acquire an image set corresponding to the target scene, and a text to be filled stored for the target scene, where the text to be filled includes at least one target field to which an attribute value is to be written; determining a target attribute value corresponding to the target field in the sample image aiming at any sample image in the image set, and correspondingly filling the target attribute value into the target field to obtain a sample text corresponding to the sample image; and storing the sample text corresponding to the sample image into a sample set.

Further, the determining module is specifically configured to determine a target attribute corresponding to a target field in the text to be filled;

the output module is used for outputting the sample image and the target attribute aiming at any sample image in the image set, and prompting to input an attribute value corresponding to the target attribute in the sample image;

the determining module is further configured to determine, as the target attribute value, the received attribute value corresponding to the target attribute; or, for any sample image in the image set, performing image recognition on the sample image to obtain an attribute value corresponding to the target attribute in the sample image, and determining the attribute value as the target attribute value.

Further, the obtaining module is further configured to obtain a candidate attribute value stored for the target attribute;

the output module is specifically configured to output the sample image, the target attribute, and the candidate attribute value.

Further, the output module is further configured to output the sample image and an attribute value corresponding to the obtained target attribute, and prompt to audit the attribute value corresponding to the target attribute, so as to obtain a first attribute value corresponding to the target attribute after audit;

And the receiving updating module is used for receiving the first attribute value corresponding to the target attribute after the verification and updating the attribute value corresponding to the target attribute obtained through image identification by using the first attribute value.

Further, the training module is specifically configured to sequentially obtain a first sample packet from the sequenced sample packets; and if the first sample group is the first group in the ordered sample groups, training the image generation model to be trained by using the first sample group.

Further, the obtaining module is further configured to obtain a set number of sample images and corresponding sample texts in other sample packets before the first sample packet if the first sample packet is a non-first packet in the sorted sample packets;

the training module is further configured to train the image generation model to be trained using the set number of sample images, sample text corresponding to the set number of sample images, and the first sample packet.

Further, the acquiring module is specifically configured to acquire a target ratio stored in advance; and acquiring a corresponding number of sample images and corresponding sample texts of the target ratio in the other sample groups.

In a fifth aspect, embodiments of the present application further provide a sample set building apparatus, including:

the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring an image set corresponding to a target scene and a text to be filled which is stored for the target scene, and the text to be filled comprises at least one target field to be written with an attribute value;

the construction module is used for determining a target attribute value corresponding to the target field in the sample image aiming at any sample image in the image set, and correspondingly filling the target attribute value into the target field to obtain a sample text corresponding to the sample image; and storing the sample text corresponding to the sample image into a sample set.

Further, the construction module is specifically configured to determine a target attribute corresponding to a target field in the text to be filled; outputting the sample image and the target attribute aiming at any sample image in the image set, and prompting to input an attribute value corresponding to the target attribute in the sample image; determining the received attribute value corresponding to the target attribute as the target attribute value; or, for any sample image in the image set, performing image recognition on the sample image to obtain an attribute value corresponding to the target attribute in the sample image, and determining the attribute value as the target attribute value.

Further, the construction module is further configured to obtain a candidate attribute value saved for the target attribute; and outputting the sample image, the target attribute and the candidate attribute value.

Further, the construction module is further configured to output the sample image and an attribute value corresponding to the obtained target attribute, and prompt to audit the attribute value corresponding to the target attribute, so as to obtain a first attribute value corresponding to the target attribute after audit; and receiving the first attribute value corresponding to the target attribute after verification, and updating the attribute value corresponding to the target attribute obtained through image identification by using the first attribute value.

In a sixth aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes at least a processor and a memory, where the processor is configured to implement, when executing a computer program stored in the memory, a step of a training method of an image generation model according to any one of the foregoing, or a step of a sample set building method according to any one of the foregoing.

In a seventh aspect, embodiments of the present application further provide a computer readable storage medium storing a computer program, which when executed by a processor, implements the steps of the training method of the image generation model according to any one of the above, or the steps of the sample set building method according to any one of the above.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a training method of an image generation model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an image generated from a hint text provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of candidate attributes according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a page display provided in an embodiment of the present application;

FIG. 5 is a schematic view of a sample image according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a training process of another image generation model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of model training provided in an embodiment of the present application;

FIG. 8 is a schematic diagram of a generic image generation model fine tuning training process provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of a training device for an image generation model according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a sample set constructing apparatus according to an embodiment of the present disclosure;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application are within the scope of the protection of the present application.

The embodiment of the application provides a training method of an image generation model, a sample set construction method and an image generation large model, wherein a sample set corresponding to a target scene is obtained when the image generation model is trained, the sample set comprises a plurality of sample images, and the sample images correspond to sample texts; determining the attribute quantity of preset attributes included in sample texts in a sample set, and grouping the sample texts and corresponding sample images according to the attribute quantity to obtain sample groups; and carrying out ascending order sorting on the sample groups according to the attribute quantity corresponding to the sample groups, and training the image generation model to be trained by sequentially using the sorted sample groups to obtain a target model.

Fig. 1 is a flowchart of a training method of an image generation model according to an embodiment of the present application, including the following steps:

s101: and acquiring a sample set corresponding to the target scene, wherein the sample set comprises a plurality of sample images, and the sample images correspond to sample texts.

The training method of the image generation model is applied to electronic equipment, and the electronic equipment can be a server, a PC and the like.

With the development of technology, the image generation model can generate various images, for example, prompt text of 'one sheep in wine glass' is input into the image generation model, the image generation model can generate an image as shown in fig. 2, fig. 2 is a schematic diagram of the image generated according to the prompt text, which is provided by the embodiment of the application, as can be seen from fig. 2, one sheep is prone in a goblet, and the image acquisition device cannot be really shot in real life, so that the image generation model gives possibility of more technical development. In the related art, the problem of missing of special attribute data of the field model such as video structuring can be solved by using the image generation model. For example, in the video structured attribute identification, the attribute identification model is not robust due to the scarcity of data of pedestrians wearing sandals, so that a batch of pedestrians wearing sandals can be generated by using the image generation model, the data can be effectively supplemented, and the robustness of the attribute identification model is enhanced.

In order to improve the effect of the image of the specific scene generated by the image generating model, the image generating model to be trained can be trained aiming at a specific scene, and in the embodiment of the application, the image generating model to be trained can be a brand new image generating model, namely a newly developed image generating model without any training; the image generating model to be trained may also be an image generating model obtained through pre-training, and in this embodiment of the present application, when the image generating model obtained through pre-training is trained, fine tuning training is performed on the image generating model obtained through pre-training by using a sample set of a specific scene. In the embodiment of the present application, the image generation model to be trained may be a Stable Diffusion model (SD), controlNet (CN), streambooth, or the like.

In order to train an image generation model with a better image generation effect, in the embodiment of the present application, a sample set corresponding to a target scene may be obtained, where the target scene may be input by a user of an electronic device, or may be sent to the electronic device by other electronic devices. The target scene may be a weather scene such as a snow, a rain, a gale, or the like, or may be a scene including a person, a scene including an animal, a scene including an automobile, or the like, which is not limited in the embodiment of the present application.

Since the present application is intended to train a model that generates an image according to text, in the embodiment of the present application, the obtained sample set corresponding to the target scene may include a large number of sample images, and sample text corresponding to the sample images, where the sample text describes content included in the corresponding sample images, and the sample text may be understood as a prompt text for prompting content included in the image to be generated by the model. The sample images are images related to the target scene, for example, when the target scene is a snowy day, then the sample images in the sample set are all images related to the snowy day, and when the target scene is a scene containing pedestrians, then the sample images in the sample set all contain pedestrians. In the embodiment of the present application, the sample images included in the sample set may be acquired in the disclosed image set, or may be acquired by the image acquisition device, or may be generated by other image generation models. In addition, the sample images included in the sample set may be collected by manual or automated means, and how to collect the sample images is the prior art, which is not limited in the embodiments of the present application. Each sample image in the sample set may correspond to one sample text, or may correspond to a plurality of sample texts.

S102: determining the attribute quantity of preset attributes included in the sample texts in the sample set, and grouping the sample texts and the corresponding sample images according to the attribute quantity to obtain sample groups.

After a sample set corresponding to a target scene is acquired, sample texts in the sample set are acquired one by one, and the attribute number of preset attributes included in the acquired sample texts is determined.

In the embodiment of the present application, the preset attribute may be configured in advance according to the characteristics of the target scene, and, for example, when the target scene is a scene including a pedestrian, the preset attribute may be gender, age, head-shoulder attribute, upper body attribute, lower body attribute, foot attribute, and the like, where the head-shoulder attribute may be understood as whether the pedestrian in the image wears a cap, wears glasses, wears earrings, and the like; the upper body attribute can be understood as whether a pedestrian in the image is a T-shirt, a down jacket or the like; the lower body attribute can be understood as whether the pedestrian in the image is jeans, shorts, casual pants, and the like; the foot attribute may be understood as whether the pedestrian in the image is wearing a high-heeled shoe, a leather shoe, a sandal, etc. It should be noted that, in the embodiment of the present application, the configuration of the preset attribute is not limited, and those skilled in the art may configure the configuration according to needs.

In determining the number of attributes of the preset attributes included in any sample, it may be determined whether the sample text includes preset fields, for example, whether the sample text includes preset fields such as "man", "woman", "hat", "jeans", etc., if the sample text includes two fields such as "man" and "jeans", it may be determined that the sample text includes two preset attributes, that is, gender and lower body attributes, respectively, and the number of attributes corresponding to the sample text is 2.

It should be noted that, how to identify the attribute included in the text, the processing method in the related art is numerous, and the method for identifying the attribute included in the text is not limited in the embodiment of the present application.

After determining the number of attributes corresponding to each sample text in the sample set, the sample texts and their corresponding sample images may be grouped according to the number of attributes corresponding to each sample text. That is, the sample texts with the same number of attributes and the corresponding sample images are divided into a group, and the preset attributes included in each sample text in the grouped sample text group have the same number of attributes.

S103: and carrying out ascending order sorting on the sample groups according to the attribute quantity corresponding to the sample groups, and training the image generation model to be trained by sequentially using the sorted sample groups to obtain a target model.

If the sample text used in the previous training is very simple, for example, a man, and the sample text used in the next training is very complex, for example, a man wearing blue jeans wearing a mask, then the convergence rate of the image generation model to be trained is slow, and the robustness is poor. Therefore, in the embodiment of the present application, after the sample groups are obtained, the obtained sample groups are sorted in ascending order according to the attribute numbers corresponding to the sample groups, after the ascending order is performed, the image generation models to be trained are trained by sequentially using the sorted sample groups, after the training meets the convergence condition, the image generation models for the target scene can be obtained, and for convenience in description, the image generation models obtained by the training are referred to as target models. That is, when the image generating model to be trained is trained, the image generating model to be trained is trained by using the sample text with less preset attributes, and then the image generating model to be trained is trained by using the sample text with more preset attributes, namely, the image generating model to be trained is firstly input with the image text pairs with low information, and then the image text pairs with high information are gradually added, so that the course learning is simulated from simple to complex.

By way of example, assuming that the image generation model to be trained is trained for 1000 rounds in total, sample text and sample images in a sample group with an attribute number of 1 can be used for training on rounds 1-300; training is performed on rounds 301-600 using sample text and sample images in a sample group with an attribute number of 2, and so on.

How to train the image generation model to be trained by using the sample text and the corresponding sample image is the prior art, and the embodiments of the present application will not be repeated.

In order to improve the efficiency of model training, in the embodiment of the present application, the process of obtaining the sample set corresponding to the target scene includes:

Because the sample set used for training the image generation model to be trained has extremely large data volume, if the sample text corresponding to each sample image is completely input by people word by word, the training efficiency of the image method generation model to be trained is seriously affected, the thinking of different staff is different, and the detailed degree of describing the sample image by different staff is also different. Therefore, in order to standardize the description manner of the sample text, in the embodiment of the present application, the text to be filled is pre-stored, and the text to be filled may include at least one target field to be written with an attribute value, and when a worker marks the sample image, the worker only needs to write a corresponding attribute value in the target field in the text to be filled.

The efficiency of determining the sample set can be improved by determining the sample text corresponding to the sample image based on the pre-saved text to be filled, so that in the embodiment of the application, when the sample set corresponding to the target scene is obtained, the image set corresponding to the target scene and the pre-saved text to be filled can be obtained.

Since the contents included in the sample images of different scenes are different, the contents described by the sample texts corresponding to the sample images of different scenes are also different, and the contents included in the sample images of the same scene are similar, the contents described by the sample texts corresponding to the sample images of the same scene are also similar, so in the embodiment of the present application, the pre-saved text to be filled may be saved for different scenes, that is, in the embodiment of the present application, a plurality of texts to be filled are pre-saved, the scenes corresponding to different texts to be filled are different, and the same scene may correspondingly save one or more texts to be filled.

In the embodiment of the present application, when acquiring the text to be filled, the text to be filled stored for the target scene may be acquired, and the acquired text to be filled may be one or more. When a plurality of texts to be filled in are saved for the target scene, the number of target fields included in different texts to be filled in may be the same or different. When the number of targets included in different texts to be filled is the same, then the target attributes corresponding to the target fields included in the same number of texts to be filled are different.

In this embodiment of the present application, each text to be filled includes at least one target field to be written with an attribute value, where the target field is a field to be filled subsequently, and in order to make it convenient for the subsequent explicit that the target field needs to fill in an attribute value corresponding to which preset attribute, in this embodiment of the present application, the preset attribute corresponding to each target field may be identified in the text to be filled.

Specifically, assuming that the target scene is a scene containing pedestrians, the text to be filled in saved for the target scene may be one { sex } person, one person with age { age }, one { sex } person with age { age } foot wear { foot property }, or the like. The fields in the { } and { } in the text to be filled are target fields of the attribute values to be written, that is, the contents in the { (gender) }, "{ age }," { foot attribute } "in the text to be filled are target fields, and the contents in the { }" are preset attributes corresponding to the target fields. When filling the text to be filled, the subsequent staff can determine which preset attribute corresponding attribute value is required to be filled in the target field corresponding to the position according to the content included in each "{ }". It should be noted that, how to identify the target field and the preset attribute value corresponding to the target field to be written with the attribute value is not limited to the above example, and those skilled in the art may configure the foregoing as required.

After the image set corresponding to the target scene and the text to be filled stored for the target scene are acquired, a sample text corresponding to each sample image in the image set can be determined. In the embodiment of the application, when determining the sample text corresponding to any sample image in the image set, the target attribute value corresponding to the target field in the sample image can be determined, and the sample text corresponding to the sample image can be obtained by correspondingly filling the target attribute value into the target field. After obtaining the sample text corresponding to the sample image, the sample text corresponding to the sample image may be saved in a sample set, where the sample set is to be used for subsequent training of the image generation model to be trained.

In order to improve accuracy of sample text determination, in the embodiments of the present application, for any sample image in the image set, determining a target attribute value corresponding to the target field in the sample image includes:

In the embodiment of the present application, when determining the target attribute value corresponding to the target field in the sample image, the obtained target attribute corresponding to the target field in each text to be filled may be determined. Specifically, whether each text to be filled includes a preset character or not can be respectively identified based on a text identification technology, and the preset character is a character for identifying preset attributes. For example, whether each text to be filled includes "{ }" is identified, and if "{ }" is identified, a field in "{ }" is described, namely, the target attribute.

In order to facilitate the subsequent filling of the target attribute value into the corresponding text to be filled, after determining the target attribute corresponding to the target field in any text to be filled, the determined target attribute may be saved corresponding to the text to be filled.

After determining the target attribute corresponding to the target field in the text to be filled, any sample image in the image set may be acquired, the sample image and each determined target attribute may be output, and for example, the sample image and the determined target attribute may be displayed in a display, and a user of the electronic device may be prompted to input an attribute value corresponding to the target attribute in the currently displayed sample image. The user of the electronic device may input a corresponding attribute value in the input box corresponding to each target attribute, click a button such as "submit", "complete" or "save" after the input is completed, and send the attribute value corresponding to each target attribute to the electronic device. After receiving the attribute value corresponding to the target attribute, the electronic device may determine the attribute value as the target attribute.

Specifically, it is assumed that the sample image 1 is output with the target attribute a, the target attribute B, and the target attribute C, and the attribute value input by the user of the electronic apparatus for the target attribute a is the attribute value a, the attribute value input for the target attribute B is the attribute value B, and the attribute value input for the target attribute C is the attribute value C. After receiving the attribute value corresponding to each target attribute, the electronic device may determine the attribute value a as a target attribute value corresponding to the target attribute a, determine the attribute value B as a target attribute value corresponding to the target attribute B, and determine the attribute value C as a target attribute value corresponding to the target attribute C.

In order to further improve the efficiency of sample text determination, in the embodiments of the present application, before outputting the sample image and the target attribute, the method further includes:

acquiring candidate attribute values stored aiming at the target attributes;

said outputting said sample image and said target property comprises:

Since the category of the target attribute is fixed, the candidate range of the attribute value corresponding to each target attribute is limited, for example, the attribute value which can be candidate and corresponds to the target attribute such as gender is only male or female, so in the embodiment of the application, in order to further improve the efficiency of determining the sample text, before outputting the sample image and the target attribute, the candidate attribute value stored for the target attribute can be obtained, and the candidate attribute value can be an attribute value which can occur in advance and listed by a staff according to experience.

After the candidate attribute values stored for the target attributes are acquired, when the sample image and the target attributes are output, the acquired candidate attribute values can be output corresponding to the target attributes, so that a worker can select among the candidate attribute values without inputting the corresponding attribute values one by one.

Specifically, fig. 3 is a schematic diagram of candidate attributes provided in the embodiment of the present application, as shown in fig. 3, a determined target attribute corresponding to a target field in a text to be filled may include gender, age, head-shoulder attribute, upper body attribute, lower body attribute and foot attribute, where candidate attribute values stored for the target attribute of gender may include male and female; candidate attribute values saved for the target attribute of age may include less than 20 years old, 20 years old-40 years old, 40 years old-60 years old, greater than 60 years old; candidate attribute values saved for the target attribute of the head and shoulder attribute may include hat wear, glasses wear, earrings wear, jewelry wear, and the like; candidate attribute values saved for the upper body attribute, which is a target attribute, may include T-shirts (T-shirts), shirts, down jackets, etc.; candidate attribute values saved for this target attribute of the lower body attribute may include jeans, shorts, casual pants, and the like; candidate attribute values saved for the foot attribute, which is a target attribute, may include shoes, drags, etc. As can be seen from fig. 3, the candidate attribute values corresponding to the upper body attribute, the lower body attribute and the foot attribute further include an attribute of color, wherein the attribute of color can be used as a modification of the candidate attribute values, and can be combined with the candidate attribute values corresponding to the upper body attribute, the lower body attribute and the foot attribute, for example, red T-shirt, blue shirt, yellow down coat, green down coat, and the like. In the embodiment of the application, the attribute of the color corresponds to a plurality of candidate attribute values, and may include red, orange, yellow, green, cyan, blue, violet and other colors.

Because the number of sample images for training the image generation model to be trained is extremely large and each sample image comprises a plurality of attributes, although staff does not need to input sample texts word by word, a user of the electronic device needs to input attribute values word by word, and a large amount of manpower and material resources are still consumed. Therefore, in the embodiment of the application, for any sample image in the image set, image recognition may be performed on the sample image, so as to obtain an attribute value corresponding to the target attribute in the sample image, and the attribute value is determined as the target attribute value. When the attribute value corresponding to the target attribute in the sample image is identified, the sample image can be identified based on the trained attribute identification model. It should be noted that, the attribute value corresponding to the attribute in the identification image is in the prior art, which is not described in the embodiment of the present application.

In order to ensure accuracy of sample text determination, in the embodiments of the present application, after obtaining the attribute value corresponding to the target attribute in the sample image, before determining the attribute value as the target attribute value, the method further includes:

Due to the accuracy of the samples included in the sample set used to train the image generation model to be trained, the resulting target model will be severely affected, and if the content described in the sample text is inconsistent with the content included in the sample image, the image generation model to be trained will be erroneously trained. Therefore, in the embodiment of the present application, after the image recognition is performed on the sample image, the attribute value corresponding to the target attribute in the recognized sample image may not be directly determined as the target attribute value, and the attribute value corresponding to the target attribute obtained based on the image recognition and the corresponding sample image may be output. For example, the attribute value corresponding to the target attribute obtained by image recognition and the corresponding sample image may be displayed in a display, and the attribute value corresponding to the target attribute obtained by recognition may be prompted to be checked.

After receiving the prompt of verification, the staff can verify the attribute value corresponding to the target attribute of each sample image, if the attribute value obtained by image identification is determined to be wrong, the corresponding attribute value can be modified, or the wrong attribute value is identified, if a certain attribute value is identified as wrong, the electronic equipment can re-identify the target attribute corresponding to the identification after receiving the identification, or the attribute value corresponding to the identification is not stored. After the staff member finishes the auditing, the staff member can send an auditing completion instruction and send the attribute value corresponding to the audited target attribute to the electronic device. When the first attribute value is transmitted to the electronic device, the worker may click on a button such as "submit", "complete", "save", etc., and transmit the first attribute value.

After the electronic device receives the first attribute value corresponding to the target attribute after the verification, the electronic device can consider that the content included in the first attribute value is correct, the attribute value corresponding to the target attribute obtained through image recognition can be updated by using the first attribute, and then other operations are performed based on the confirmed correct first attribute value.

Specifically, assuming that the attribute value corresponding to the target attribute a included in the sample image 1 identified based on the image identification technology is a and the attribute value corresponding to the target attribute B is B, the sample image 1 and the attribute value corresponding to each target attribute may be displayed on the visualization page. For example, fig. 4 is a schematic diagram of displaying a page provided in the embodiment of the present application, as shown in fig. 4, a sample image to be audited may be displayed on the left side (left and right in the drawing) of the page, that is, a sample image 1 is displayed on the left side (left and right in the drawing) of the page, and each target attribute and an attribute value corresponding to the target attribute are sequentially displayed on the right side (left and right in the drawing) of the page, that is, a target attribute a and an attribute value a corresponding to the target attribute B and an attribute value B corresponding to the target attribute a are sequentially displayed on the right side (left and right in the drawing) of the page. When the staff determines that the attribute value corresponding to the target attribute B is recognized in error, for example, the attribute value corresponding to the target attribute B in the sample image 1 is actually M, the staff may modify the attribute value B corresponding to the target attribute B into M in the page. After the staff finishes auditing each target attribute, the staff can click a 'save' button and send the audited first attribute value to the electronic equipment.

The configuration of text to be filled in will be described below in connection with a specific embodiment, and based on the target attribute included in fig. 3, in this embodiment of the present application, text to be filled in as shown in table 1 may be configured in advance:

TABLE 1

As can be seen from table 1, in the embodiment of the present application, a plurality of texts to be filled are constructed according to the number of target attributes, where the texts to be filled respectively correspond to texts to be filled including only one type of target attribute, including texts to be filled including any two types of target attributes, including texts to be filled including any three types of target attributes, including texts to be filled including any four types of target attributes, including texts to be filled including any five types of target attributes, and 63 types of texts to be filled can be obtained by configuring the texts to be filled according to the above manner.

For example, the text to be filled in of the single type of target attribute shown in the above table 1 may be "one { sex } person", and then, according to the candidate attribute values shown in fig. 3, the text to be filled in may generate sample text of "one man", "one woman".

Illustratively, the text to be filled in of any two kinds of target attributes shown in the above table 1 may be "a { head-shoulder attribute } person wearing { upper body attribute }, and then, according to the candidate attribute values shown in fig. 3, the text to be filled in may generate the following sample text:

A person wearing a hat wearing a white T-shirt;

a person wearing red pants on the hat;

a person wearing a black down coat on the hat;

a person wearing a glasses body wearing a white T-shirt;

……

in addition, as the candidate attribute value can be collocated with various color information, color input can also be added into the text to be filled, so that more sample texts can be obtained.

Fig. 5 is a schematic diagram of a sample image provided in an embodiment of the present application, as shown in fig. 5, where fig. 5 includes a pedestrian, then, based on the text to be filled shown in table 1, at least the following sample text is determined for the sample image:

a man;

a person wearing a black coat on the hat body;

a man wearing a black coat on his/her hat;

a person wearing the mask and the hat wears the black trousers;

a man wearing a cap with a black coat aged 20-40 years old;

a person wearing a mask body and wearing black trousers on a black coat body;

a man wearing a mask body and wearing black trousers;

a man wearing a mask wearing a black coat wearing black trousers with the age of 20-40 years;

……

the method can be used for effectively increasing training data and further increasing model robustness through the sample text generation mode, and meanwhile, generalization capability of the model can be enhanced due to various expressions.

In order to ensure the model training effect, in the embodiment of the present application, training the image generation model to be trained by sequentially using the sequenced sample groups includes:

sequentially acquiring a first sample group from the sequenced sample groups;

Because model training has the characteristics of forgetting, in order to ensure the effect of model training, in the embodiment of the application, when training the image generation model to be trained by sequentially using the sequenced sample groups, when training the image generation model to be trained by using a certain sample group, the image generation model to be trained can review the content learned before and then continue to learn new content.

In the embodiment of the present application, after the sample groups are sorted in ascending order, the sample groups may be sequentially acquired from the sorted sample groups, and for convenience of description, in the embodiment of the present application, the currently acquired sample group may be the first sample group.

After the first sample packet is obtained, whether the first sample packet is the first packet in the sequenced sample packets can be determined, if the first sample packet is the first packet, it is stated that other sample images and corresponding sample texts are not learned before the image generation model to be trained, and "review" is not needed, and the image generation model to be trained can be directly trained by using the first sample packet.

In order to further ensure the effect of model training, on the basis of the above embodiments, in the embodiments of the present application, the method further includes:

If the obtained first sample group is not the first group in the sorted sample groups, it is stated that sample images and sample texts in other sample groups are also learned before the image generation model to be trained, so in order to prevent the image generation model to be trained from forgetting the content learned before, in the embodiment of the application, it may be determined which one or more of the sorted sample groups are respectively the other sample groups arranged before the first sample group, after determining the other sample groups before the first sample group, a set number of sample images and sample texts corresponding to the sample images may be selected in each other sample group, and the set number of sample images and sample texts corresponding to the sample images may be understood as the content of "review" required by the image generation model to be trained. The set number may be any positive integer, for example 10,30,55,100, which is not limited in this embodiment, and may be configured by those skilled in the art as needed.

After the corresponding sample texts of the set number of sample images are acquired, the set number of sample images, the sample texts corresponding to the set number of sample images, and the first sample packets may be used to train the image generation model to be trained.

Specifically, it is assumed that 1000 rounds of training of the image generation model to be trained are required, wherein the 100 th to 200 th rounds require training of the image generation model to be trained by using a sample group with the attribute number of 3. In order to forget the content of previous learning by the image generation model to be trained, in the embodiment of the application, a set number of sample images and corresponding sample texts can be acquired from a sample group with the attribute number of 1 and the attribute number of 2. Assuming that the set number is 50, 50 sample images and sample texts corresponding to the 50 sample images respectively can be acquired in a sample group with the attribute number of 1; and acquiring 50 sample images and sample texts corresponding to the 50 sample images respectively in the sample group with the attribute number of 2. After the sample images and the corresponding sample texts thereof are obtained, the 100 sample images and the corresponding sample texts thereof can be used for training the image generation model to be trained by using the first sample packets. Of course, since the set number is 50, in the embodiment of the present application, 50 sample images and corresponding sample texts thereof may be obtained randomly in other sample packets before the first sample packet.

The following describes a training process of an image generation model to be trained in connection with a specific embodiment, and fig. 6 is a schematic diagram of a training process of another image generation model provided in an embodiment of the present application, as shown in fig. 6, and the process includes the following steps:

s601: and carrying out ascending sort on the sample groups obtained by grouping according to the attribute quantity corresponding to the sample groups.

S602: sequentially acquiring a first sample packet from the sequenced sample packets.

S603: it is determined whether the first sample packet is the first sample packet of the sorted sample packets, and if so, S604 is performed, otherwise S605 is performed.

S604: and training the image generation model to be trained by using the first sample group.

S605: and acquiring a set number of sample images and corresponding sample texts in other sample groups before the first sample group, and training an image generation model to be trained by using the set number of sample images, the sample texts corresponding to the set number of sample images and the first sample group.

In order to further ensure the effect of model training, in the embodiments of the present application, the obtaining a set number of sample images and corresponding sample texts includes:

Obtaining a target ratio stored in advance;

Since the number of sample images included in different sample groups is different, if only the set number is configured in advance, it is likely that the number of sample images included in a certain sample group is smaller than the set number, so in the embodiment of the present application, in order to further ensure the effect of model training, a target ratio may be configured in advance, where the target ratio may be understood as that the sample image of the target ratio of the sample group and the corresponding sample text thereof are acquired from each sample group. In this embodiment of the present application, the target ratio may be one or more, and when the target ratio is one, a sample image with the same target ratio and a sample text corresponding to the sample image are obtained from each sample group; when the target ratio is plural, different target ratios may be respectively configured for sample groups of different attribute numbers, for example, a target ratio of 10% for a sample group of attribute number 1, a target ratio of 30% for a sample group of attribute number 2, a target ratio of 40% for a sample group of attribute number 3, and the like.

After the pre-saved target ratio is acquired, a number of sample images and corresponding sample text corresponding to the target ratio may be acquired in other sample packets prior to the first sample packet.

Specifically, assuming that the target ratio stored in advance is 100%, the model training needs to be performed for 1000 rounds of image generation to be trained. Fig. 7 is a schematic diagram of model training provided in the embodiment of the present application, as shown in fig. 7, when an image generating model to be trained is trained, a sample group with the number of attributes being 1 is used to train the image generating model to be trained in the 1 st-100 th round, that is, a sample text including a single attribute and a sample image corresponding to the sample text are used to train the image generating model to be trained; after 100 rounds of training, the image generating model to be trained already has a certain knowledge, and in the next 101-200 th rounds, the image generating model to be trained can be trained by using sample groups with the attribute number of 2, namely, the image generating model to be trained is trained by using sample texts with any two types of attributes and corresponding sample images thereof, but in the 101-200 th rounds, the image generating model to be trained also learns the sample texts and the knowledge in the corresponding sample images included in the sample groups with the attribute number of 1, so that the image generating model to be trained can be prevented from forgetting the learned knowledge, and sample images with the target ratio and the corresponding sample texts thereof can be obtained in the sample groups with the attribute number of 1, and because the pre-stored target ratio is 100%, the image generating model to be trained can be trained by using the sample groups with the attribute number of 1 and the attribute number of 2 in the 101-200 th rounds; similarly, in the next 201-300 rounds, the image generation model to be trained can be trained by using the sample group with the attribute number of 1, the attribute number of 2 and the attribute number of 3, and so on.

The process of fine-tuning a generic image generation model to arrive at an image generation model specific to a pedestrian image is described below in connection with another embodiment. Fig. 8 is a schematic diagram of a general image generation model fine tuning training process provided in an embodiment of the present application, as shown in fig. 8, in an embodiment of the present application, fine tuning training may be performed on a general text image generation model to obtain a pedestrian text image generation model, training data used in fine tuning training is a text image pair, where an image used for training is an image including a pedestrian, the pedestrian in the image used for training is of multiple types, for example, long-phase multiple, wearing multiple, gesture multiple, and the like, each image used for training corresponds to a sample text, and the sample text is determined based on a multi-level text prompt template, which is the template to be filled in each embodiment described above. After the data for training the general text image generation model is obtained, the general text image generation model can be subjected to fine tuning training based on a fine tuning training method of course learning, so that a pedestrian text image generation model is obtained.

The embodiment of the application provides a text prompting method for converting attribute discrete values of a specific scene into multi-level text prompting, text-image pair data is constructed based on the text-image pair data, a course learning method is adopted on a general text image generation model to finely tune the model, the general text image generation model is converted into a special pedestrian image generation model, and the controllable generation of pedestrian images is realized.

On the basis of the above embodiments, there is also provided in an embodiment of the present application a sample set construction method, where the method includes:

In one possible implementation manner, a plurality of texts to be filled are stored for the target scene, and if the number of target fields included in different texts to be filled is the same, the target attributes corresponding to the target fields included in the texts to be filled with the same number are different.

In one possible implementation manner, the determining, for any sample image in the image set, a target attribute value corresponding to the target field in the sample image includes:

In a possible implementation manner, before the outputting the sample image and the target attribute, the method further includes:

acquiring candidate attribute values stored aiming at the target attributes;

said outputting said sample image and said target property comprises:

In a possible implementation manner, after the obtaining the attribute value corresponding to the target attribute in the sample image and before the determining the attribute value as the target attribute value, the method further includes:

The sample set construction process provided in the embodiment of the present application is consistent with the process of obtaining the sample set corresponding to the target scene in the training process of the image generation model, and since the process has been described in detail in the foregoing embodiments, the sample set construction process is not repeated in the embodiment of the present application.

On the basis of the above embodiments, in the embodiments of the present application, there is also provided an image generation large model obtained based on the training method of the image generation model, so that the above embodiments can be referred to for training of the image generation large model.

In the present embodiment, a "large model" in the image generation large model may be understood as a model based on a converter (transducer) architecture; the "large model" may also be understood as a machine learning model with a huge parameter scale and complexity, e.g., a neural network model with millions to billions of parameters or billions of parameters; the "large model" may also be understood as a deep learning model trained on large-scale training data by semi (weakly) supervised, fully supervised, self-supervised or unsupervised techniques. In the embodiment of the application, the large model can process a plurality of different tasks, training is generally performed based on training data of a certain target task field when the large model is trained, and the large model obtained through training can be migrated to other task fields similar to the target task field for use under the general condition.

The technical scheme has the characteristics of robustness, interpretability, reliability and universality, and accords with the credibility characteristic.

Fig. 9 is a schematic structural diagram of a training device for an image generation model according to an embodiment of the present application, as shown in fig. 9, where the device includes:

An obtaining module 901, configured to obtain a sample set corresponding to a target scene, where the sample set includes a plurality of sample images, and the sample images correspond to sample texts;

a determining module 902, configured to determine an attribute number of preset attributes included in a sample text in the sample set;

the grouping module 903 is configured to group the sample text and the corresponding sample image according to the attribute number, so as to obtain a sample group;

the training module 904 is configured to sort the sample groups in ascending order according to the attribute numbers corresponding to the sample groups, and train the image generation model to be trained by sequentially using the sorted sample groups to obtain a target model.

In a possible implementation manner, the obtaining module 901 is specifically configured to obtain an image set corresponding to the target scene, and a text to be filled stored for the target scene, where the text to be filled includes at least one target field to which an attribute value is to be written; determining a target attribute value corresponding to the target field in the sample image aiming at any sample image in the image set, and correspondingly filling the target attribute value into the target field to obtain a sample text corresponding to the sample image; and storing the sample text corresponding to the sample image into a sample set.

In a possible implementation manner, the determining module 902 is specifically configured to determine a target attribute corresponding to a target field in the text to be filled;

an output module 905, configured to output, for any sample image in the image set, the sample image and the target attribute, and prompt to input an attribute value corresponding to the target attribute in the sample image;

the determining module 902 is further configured to determine a received attribute value corresponding to the target attribute as the target attribute value; or, for any sample image in the image set, performing image recognition on the sample image to obtain an attribute value corresponding to the target attribute in the sample image, and determining the attribute value as the target attribute value.

In a possible implementation manner, the obtaining module 901 is further configured to obtain a candidate attribute value saved for the target attribute;

The output module 905 is specifically configured to output the sample image, the target attribute, and the candidate attribute value.

In a possible implementation manner, the output module 905 is further configured to output the sample image and the obtained attribute value corresponding to the target attribute, and prompt to audit the attribute value corresponding to the target attribute, so as to obtain a first attribute value corresponding to the target attribute after audit;

the receiving update module 906 is configured to receive the first attribute value corresponding to the target attribute after the verification, and update the attribute value corresponding to the target attribute obtained through image recognition by using the first attribute value.

In a possible implementation manner, the training module 904 is specifically configured to sequentially obtain a first sample packet from the sorted sample packets; and if the first sample group is the first group in the ordered sample groups, training the image generation model to be trained by using the first sample group.

In a possible implementation manner, the obtaining module 901 is further configured to obtain a set number of sample images and corresponding sample texts in other sample packets before the first sample packet if the first sample packet is a non-first packet in the sorted sample packets;

The training module 904 is further configured to train the image generation model to be trained using the set number of sample images, sample text corresponding to the set number of sample images, and the first sample packet.

In a possible implementation manner, the obtaining module 901 is specifically configured to obtain a target ratio stored in advance; and acquiring a corresponding number of sample images and corresponding sample texts of the target ratio in the other sample groups.

Fig. 10 is a schematic structural diagram of a sample set constructing apparatus according to an embodiment of the present application, as shown in fig. 10, where the apparatus includes:

an obtaining module 1001, configured to obtain an image set corresponding to a target scene, and a text to be filled stored for the target scene, where the text to be filled includes at least one target field to which an attribute value is to be written;

a construction module 1002, configured to determine, for any sample image in the image set, a target attribute value corresponding to the target field in the sample image, and fill the target attribute value into the target field correspondingly, so as to obtain a sample text corresponding to the sample image; and storing the sample text corresponding to the sample image into a sample set.

In a possible implementation manner, the construction module 1002 is specifically configured to determine a target attribute corresponding to a target field in the text to be filled; outputting the sample image and the target attribute aiming at any sample image in the image set, and prompting to input an attribute value corresponding to the target attribute in the sample image; determining the received attribute value corresponding to the target attribute as the target attribute value; or, for any sample image in the image set, performing image recognition on the sample image to obtain an attribute value corresponding to the target attribute in the sample image, and determining the attribute value as the target attribute value.

In a possible implementation manner, the construction module 1002 is further configured to obtain a candidate attribute value saved for the target attribute; and outputting the sample image, the target attribute and the candidate attribute value.

In a possible implementation manner, the construction module 1002 is further configured to output the sample image and the obtained attribute value corresponding to the target attribute, and prompt to audit the attribute value corresponding to the target attribute, so as to obtain a first attribute value corresponding to the target attribute after audit; and receiving the first attribute value corresponding to the target attribute after verification, and updating the attribute value corresponding to the target attribute obtained through image identification by using the first attribute value.

Fig. 11 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, and on the basis of the foregoing embodiments, the present application further provides an electronic device, as shown in fig. 11, including: the device comprises a processor 1101, a communication interface 1102, a memory 1103 and a communication bus 1104, wherein the processor 1101, the communication interface 1102 and the memory 1103 are in communication with each other through the communication bus 1104;

the memory 1103 has stored therein a computer program which, when executed by the processor 1101, causes the processor 1101 to perform the steps of:

In a possible implementation manner, the processor 1101 is configured to obtain an image set corresponding to the target scene, and a text to be filled stored for the target scene, where the text to be filled includes at least one target field to be written with an attribute value;

In a possible implementation manner, the processor 1101 is configured to determine a target attribute corresponding to a target field in the text to be filled;

In a possible implementation manner, the processor 1101 is configured to obtain a candidate attribute value saved for the target attribute;

said outputting said sample image and said target property comprises:

In a possible implementation manner, the processor 1101 is configured to output the sample image and the obtained attribute value corresponding to the target attribute, and prompt to audit the attribute value corresponding to the target attribute, so as to obtain a first attribute value corresponding to the target attribute after audit;

In a possible implementation manner, the processor 1101 is configured to sequentially obtain a first sample packet from the sorted sample packets;

In a possible implementation manner, the processor 1101 is configured to obtain a set number of sample images and corresponding sample texts in other sample packets before the first sample packet if the first sample packet is a non-first packet in the sorted sample packets;

In one possible implementation, the processor 1101 is configured to obtain a pre-saved target ratio;

The processor 1101 performs the steps of:

In a possible implementation manner, the processor 1101 is configured to store, for the target scene, a plurality of texts to be filled, and if the number of target fields included in different texts to be filled is the same, the target attributes corresponding to the target fields included in the texts to be filled with the same number are different.

In a possible implementation manner, the processor 1101 is configured to determine a target attribute corresponding to a target field in the text to be filled; outputting the sample image and the target attribute aiming at any sample image in the image set, and prompting to input an attribute value corresponding to the target attribute in the sample image; determining the received attribute value corresponding to the target attribute as the target attribute value; or, for any sample image in the image set, performing image recognition on the sample image to obtain an attribute value corresponding to the target attribute in the sample image, and determining the attribute value as the target attribute value.

In a possible implementation manner, the processor 1101 is configured to obtain a candidate attribute value saved for the target attribute; and outputting the sample image, the target attribute and the candidate attribute value.

The communication bus mentioned above for the electronic devices may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus. The communication interface 1102 is used for communication between the electronic device and other devices. The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor. The processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; but also digital instruction processors (Digital Signal Processing, DSP), application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

On the basis of the above embodiments, the present application further provides a computer readable storage medium having stored therein a computer program executable by a processor, which when run on the processor, causes the processor to perform the steps of:

In one possible implementation manner, the process of obtaining the sample set corresponding to the target scene includes:

acquiring candidate attribute values stored aiming at the target attributes;

said outputting said sample image and said target property comprises:

In a possible implementation manner, the training the image generation model to be trained by sequentially using the sorted sample groups includes:

Sequentially acquiring a first sample group from the sequenced sample groups;

In one possible embodiment, the method further comprises:

In one possible implementation, the acquiring a set number of sample images and corresponding sample text includes:

obtaining a target ratio stored in advance;

The processor, when executing, further performs the following steps:

acquiring candidate attribute values stored aiming at the target attributes;

said outputting said sample image and said target property comprises:

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

For system/device embodiments, the description is relatively simple as it is substantially similar to method embodiments, with reference to the description of method embodiments in part.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A method of training an image generation model, the method comprising:

the sample groups are ordered in ascending order according to the attribute quantity corresponding to the sample groups, and the ordered sample groups are used for training the image generation model to be trained in sequence to obtain a target model;

the process of obtaining the sample set corresponding to the target scene comprises the following steps:

acquiring an image set corresponding to the target scene and a text to be filled which is stored for the target scene, wherein the text to be filled comprises at least one target field to be written with an attribute value; the texts to be filled comprise texts to be filled with different attribute numbers;

2. The method of claim 1, wherein a plurality of texts to be filled are stored for the target scene, and if the number of target fields included in different texts to be filled is the same, the target attributes corresponding to the target fields included in the texts to be filled with the same number are different.

3. The method according to claim 1 or 2, wherein the determining, for any sample image in the set of images, a target attribute value corresponding to the target field in the sample image comprises:

4. A method according to claim 3, wherein prior to said outputting said sample image and said target property, said method further comprises:

Acquiring candidate attribute values stored aiming at the target attributes;

said outputting said sample image and said target property comprises:

5. A method according to claim 3, wherein after obtaining the attribute value corresponding to the target attribute in the sample image, before determining the attribute value as the target attribute value, the method further comprises:

6. The method of claim 1, wherein training the image generation model to be trained using the ordered sample groups in turn comprises:

sequentially acquiring a first sample group from the sequenced sample groups;

7. The method of claim 6, wherein the method further comprises:

8. The method of claim 7, wherein the acquiring a set number of sample images and corresponding sample text comprises:

obtaining a target ratio stored in advance;

9. A method of sample set construction, the method comprising:

Determining a target attribute value corresponding to the target field in the sample image aiming at any sample image in the image set, and correspondingly filling the target attribute value into the target field to obtain a sample text corresponding to the sample image; storing the sample text corresponding to the sample image into a sample set;

wherein, for any sample image in the image set, determining the target attribute value corresponding to the target field in the sample image includes:

10. The method of claim 9, wherein a plurality of texts to be filled are stored for the target scene, and if the number of target fields included in different texts to be filled is the same, the target attributes corresponding to the target fields included in the texts to be filled with the same number are different.

11. The method of claim 9, wherein prior to outputting the sample image and the target property, the method further comprises:

acquiring candidate attribute values stored aiming at the target attributes;

said outputting said sample image and said target property comprises:

12. The method according to claim 9, wherein after obtaining the attribute value corresponding to the target attribute in the sample image, before determining the attribute value as the target attribute value, the method further comprises:

13. An electronic device comprising at least a processor and a memory, the processor being adapted to implement the steps of the training method of the image generation model according to any of the preceding claims 1-8 or the steps of the sample set construction method according to any of the preceding claims 9-12 when executing a computer program stored in the memory.