CN117456039A

CN117456039A - AIGC magic head portrait generation method, device and equipment based on joint training

Info

Publication number: CN117456039A
Application number: CN202311794176.4A
Authority: CN
Inventors: 赖坤锋
Original assignee: Shenzhen Moshi Technology Co ltd
Current assignee: Shenzhen Moshi Technology Co ltd
Priority date: 2023-12-25
Filing date: 2023-12-25
Publication date: 2024-01-26
Anticipated expiration: 2043-12-25
Also published as: CN117456039B

Abstract

The invention discloses a method, a device and equipment for generating an AIGC magic head portrait based on joint training, wherein the method comprises the following steps: adding the preprocessed image set and the image description text set to the training sample set to update the training sample set; performing two-section training on the initial Stable diffration model based on the training sample set to obtain a second-section trained Stable diffration model; and if the average loss value and the loss standard deviation after the second stage of Stable diffration model after training are determined to meet the preset loss condition, taking the average loss value and the loss standard deviation as the target model. The embodiment of the invention can perform multi-round combined training on the initial Stable diffration model based on the user image currently acquired by the user and the stored training sample set, fully train a text encoder and a U-Net network in the model, and promote the prompt similarity and the face similarity of the generated image.

Description

AIGC magic head portrait generation method, device and equipment based on joint training

Technical Field

The invention relates to the technical field of generation type artificial intelligence, in particular to an AIGC magic head portrait generation method, device and equipment based on joint training.

Background

With the development of artificial intelligence technology, the generation technology of magic head figures has also been developed. At present, the generation process of the magic head portrait is as follows: the user inputs several photos, typically 3 or more than 10 photos, trains the user's image onto a label through model training for several minutes (typically between 10 minutes and 20 minutes), and then generates various artistic photos with the user's image in language. The generation process of the magic head portrait can generate a picture with high imagination according to the appointed role, and brings great convenience for content creation, artistic design, novel illustration and network red image creation. In general, evaluating a good magic head portrait needs to have the following features:

a1 Person similarity: character similarity is particularly important, and if the characters are dissimilar or ugly, user dissatisfaction is caused;

a2 Story sense of picture): the pictures cannot be pictures which can be generated by users when photographing, and must have certain story sense, such as art sense, sketch and oil painting pictures, or certain story backgrounds, such as super heroes, flower faeries and the like;

a3 Art smell of photo: the quality of the photo, the composition needs to have a certain artist pen touch and artistic smell, so that the whole aesthetic feeling of the photo is improved;

A4 Lighting and atmosphere): proper illumination, such as soft light, portrait light, theatrical light, etc., makes the photograph more aesthetic.

Where A2) -A4) need to be similar to the user's prompt (prompt may be understood as hint, thread, instruction, etc.), which may be defined as a whole as a prompt similarity.

Two works are required to create the magic head portrait. B1 Text-to-image generation): this step generally uses the Stable Diffusion model. B2 A person figure is required to be fused into the Stable Diffusion model, since there is no possibility of a photo of the person figure in the Stable Diffusion model. For example, a character string, such as abc, of the name of a character figure replaces a character figure for text-to-image creation. This step is called object transformation, and there are several methods for object transformation in the industry today:

c1 Text version scheme, proposed by the Inlet, inc. at month 8 of 2022, the core is the Text encoder section that trains only the Stable Diffusion model. The method has limited model similarity;

c2 Dreambooth training protocol: the scheme is proposed by Google corporation in 8 of 2022, the core is that the U-Net part of the Stable Diffusion model (generally only the U-Net part is trained) is trained through regularization and class samples, and the method can ensure the similarity of characters. However, the generation of the figure image is greatly influenced by stylization, so that the similarity to a certain extent is insufficient;

C3 In the Dreambooth+text version (DB+TI) scheme, DB+TI trains a text encoder part and a U-Net part at the same time, so that the similarity is greatly improved as a whole, but the performance on the sample similarity and stylization is poor.

Disclosure of Invention

The embodiment of the invention provides a method, a device and equipment for generating an AIGC magic head portrait based on joint training, which aim to solve the problem that various schemes adopted by a Stable diffration model in the process of generating the magic head portrait in the prior art cannot simultaneously give consideration to training of a text encoder and a U-Net network and improvement of the prompt similarity and improvement of the face similarity.

In a first aspect, an embodiment of the present invention provides a method for generating an AIGC magic head portrait based on joint training, including:

responding to a magic head portrait generation model training instruction, acquiring a plurality of user images corresponding to the magic head portrait generation model training instruction and a preprocessing image set obtained by corresponding preprocessing;

acquiring an image description text set corresponding to the preprocessed image set;

acquiring a pre-stored training sample set, and adding the pre-processed image set and the image description text set to the training sample set to update the training sample set;

Determining a first training round number according to a preset total training round number and a first preset training proportion, and performing model training of the first training round number on a text encoder and a U-Net network in an initial Stable Diffuse model based on the training sample set to obtain a Stable Diffuse model after the first training;

determining a second training round number according to the training total round number and a second preset training proportion, and performing model training of the second training round number on the U-Net network in the first-stage after-training Stable diffration model based on the training sample set to obtain a second-stage after-training Stable diffration model;

and if the average loss value and the loss standard deviation of the training sample set input to the second-stage trained Stable Diffuse model meet the preset loss condition, taking the second-stage trained Stable Diffuse model as a target Stable Diffuse model and using the second-stage trained Stable Diffuse model for generating an AIGC magic head portrait.

In a second aspect, an embodiment of the present invention further provides an AIGC magic head portrait generating apparatus based on joint training, including:

the image preprocessing unit is used for responding to the magic head portrait generation model training instruction, acquiring a plurality of user images corresponding to the magic head portrait generation model training instruction and a preprocessed image set obtained by corresponding preprocessing;

An image description text acquisition unit for acquiring an image description text set corresponding to the preprocessed image set;

a training sample set updating unit, configured to acquire a pre-stored training sample set, and add the preprocessed image set and the image description text set to the training sample set to update the training sample set;

the first section training unit is used for determining the first section training round number according to the preset training total round number and the first preset training proportion, and carrying out model training on the first section training round number on a text encoder and a U-Net network in the initial Stable Diffuse model based on the training sample set to obtain a Stable Diffuse model after the first section training;

the second-section training unit is used for determining the second-section training round number according to the training total round number and a second preset training proportion, and performing model training of the second-section training round number on the U-Net network in the first-section after-training Stable diffration model based on the training sample set to obtain a second-section after-training Stable diffration model;

and the target model acquisition unit is used for taking the second-stage trained Stable Diffuse model as a target Stable Diffuse model and generating an AIGC magic head image if the average loss value and the loss standard deviation of the second-stage trained Stable Diffuse model are determined to meet a preset loss condition.

In a third aspect, an embodiment of the present invention further provides a computer device; the computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the method of the first aspect.

In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, implement the method of the first aspect.

The embodiment of the invention provides a method, a device and equipment for generating an AIGC magic head portrait based on joint training, wherein the method comprises the following steps: responding to the magic head portrait generation model training instruction, acquiring a plurality of user images corresponding to the magic head portrait generation model training instruction and a preprocessing image set obtained by corresponding preprocessing; acquiring an image description text set corresponding to the preprocessed image set; acquiring a pre-stored training sample set, and adding the pre-processed image set and the image description text set to the training sample set to update the training sample set; determining a first training round number according to a preset total training round number and a first preset training proportion, and performing model training of the first training round number on a text encoder and a U-Net network in an initial Stable Difference model based on a training sample set to obtain a Stable Difference model after the first training; determining a second training round number according to the total training round number and a second preset training proportion, and performing model training of the second training round number on the U-Net network in the first-segment after-training Stable diffration model based on the training sample set to obtain a second-segment after-training Stable diffration model; and if the average loss value and the loss standard deviation of the training sample set input to the second-stage trained Stable Diffuse model meet the preset loss condition, taking the second-stage trained Stable Diffuse model as a target Stable Diffuse model and using the target Stable Diffuse model for generating the AIGC magic head portrait. The embodiment of the invention can perform multi-round combined training on the initial Stable diffration model based on the user image currently acquired by the user and the stored training sample set, fully train a text encoder and a U-Net network in the model, and promote the prompt similarity and the character similarity of the generated image.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an application scenario schematic diagram of an AIGC magic head portrait generating method based on joint training provided by an embodiment of the present invention;

fig. 2 is a flow chart of an AIGC magic head portrait generating method based on joint training according to an embodiment of the present invention;

FIG. 3 is a schematic sub-flowchart of an AIGC magic head portrait generating method based on joint training according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a model structure of a Stable diffration model in the AIGC magic head image generating method based on joint training according to the embodiment of the present invention;

FIG. 5 is a schematic diagram of another sub-flowchart of the AIGC magic head portrait generating method based on joint training according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of another sub-flowchart of the AIGC magic head portrait generating method based on joint training according to an embodiment of the present invention;

FIG. 7 is a schematic block diagram of an AIGC magic head portrait generating device based on joint training according to an embodiment of the present invention;

fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Before the technical scheme of the application is introduced, technical feature words related to the application are introduced below.

AIGC, which is known as Artificial Intelligence Generated Content, represents generative artificial intelligence. AIGC is a new way of content creation following Professional production of content (PGC, professional-generated Content), user production of content (UGC, user-generated Content), and can create new digital content generation and interaction forms in terms of conversations, stories, images, video, and music production.

The Stable Diffusion model is a text-to-image generation model based on a potential Diffusion model (Latent Diffusion Models) and can generate high-quality, high-resolution and high-fidelity images according to any text input.

Magic Avatar, specifically using an AIGC text-based image generation technique. And uploading a plurality of single-person portrait head images by a user under a generation scene of the magic head images, and generating a plurality of imaginative head images on the basis of a text-to-picture technology corresponding to the Stable diffration model according to the uploaded portrait head images.

Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of a scenario of an AIGC magic head image generating method based on joint training according to an embodiment of the present invention, and fig. 2 is a schematic flow chart of an AIGC magic head image generating method based on joint training according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S160.

S110, responding to a magic head portrait generation model training instruction, acquiring a plurality of user images corresponding to the magic head portrait generation model training instruction, and corresponding a preprocessed image set obtained by preprocessing.

In this embodiment, the server 10 is used as an execution subject to describe the technical scheme. When the user needs to use the initial Stable dispersion model stored in the server 10 to complete the generation of the magic head portrait based on the initial Stable dispersion model, the user may first operate the user interaction interface of the user terminal 20, for example, click the model training virtual button on the user interaction interface to trigger generation of the magic head portrait generation model training instruction, because the initial Stable dispersion model does not use the head portrait of the user for model training. Of course, in order to relate the training of the initial Stable distribution model in the server to the user, after the user collects a plurality of user images with user head images by using the user terminal 20, the user terminal 20 may upload the plurality of user images to the server 10, and preprocessing the plurality of user images in the server 10 to obtain the preprocessed image set.

In one embodiment, as shown in fig. 3, step S110 includes:

s111, acquiring a plurality of user images corresponding to the magic head portrait generation model training instruction, and forming an initial user image set;

s112, performing image clipping processing meeting a first preset clipping condition on each initial user image in the initial user image set to obtain preprocessed images respectively corresponding to each initial user image, and forming the preprocessed image set; the first preset clipping condition is that the occupancy ratio of the face area in the clipped preprocessed image is within a first preset occupancy ratio interval compared with that of the preprocessed image.

In this embodiment, after the server receives the multiple user images uploaded by the user terminal, the multiple user images may be sequentially numbered (e.g., sorted and numbered according to the ascending order of the acquisition time) based on the acquisition time sequence of each user image, and the initial user image set may be formed. For example, a user sequentially collects N1 user images (N1 is a positive integer) using a user terminal, and composes an initial user image set. The clipping process is described by taking one of the initial user images in the initial user image set as an example, and the initial user image corresponding to the number 3 is taken as an example. When the face area in the initial user image of the initial judgment number 3 is not in the first preset duty ratio interval (for example, the first preset duty ratio interval is set to 20% -30%) compared with the duty ratio of the whole initial user image, it may be specifically judged whether the duty ratio is smaller than the lower limit value of the first preset duty ratio interval or greater than the upper limit value of the first preset duty ratio interval. When the duty ratio of the face area in the initial user image with the number 3 is smaller than the lower limit value of the first preset duty ratio interval compared with the whole initial user image, the edge area in the initial user image can be cut to enlarge the duty ratio of the face area until the duty ratio of the face area compared with the preprocessed image is in the first preset duty ratio interval, wherein the duty ratio of the face area in the initial user image is indicated to be insufficient. When the duty ratio of the face area in the initial user image numbered 3 is larger than the upper limit value of the first preset duty ratio interval compared with the whole initial user image, the duty ratio of the face area is excessively large, and the margin area in the initial user image can be increased to reduce the duty ratio of the face area until the duty ratio of the face area compared with the preprocessed image is in the first preset duty ratio interval. Based on the preprocessing, a plurality of user images uploaded by the user terminal can be processed into a preprocessed image set which is convenient for subsequent extraction of the face region.

S120, acquiring an image description text set corresponding to the preprocessed image set.

In this embodiment, after the server only obtains the preprocessed image set, the server is still insufficient to form a training set with more information dimensions, and at this time, an automatically obtained image description text may be corresponding to each preprocessed image in the preprocessed image set.

In one embodiment, step S120 includes:

and acquiring an image description text corresponding to each preprocessed image in the preprocessed image set based on the pre-trained unified understanding and the generated bootstrap multi-modal model.

In this embodiment, a bootstrapping multi-mode model that is uniformly understood and generated, that is, a BLIP model, is prestored and trained in a server, and image description text corresponding to each preprocessed image in the preprocessed image set can be automatically acquired through the BLIP model. The image description text acquisition process of the preprocessed image corresponding to the original user image of the number 3 will be described below as an example. For example, a boy image wearing a spider knight-errant coat is displayed in the preprocessed image corresponding to the initial user image with the number 3, and the image descriptive text obtained by processing the image descriptive text corresponding to the image is such as 'a boy, male, individual, lifelike, spider knight-errant coat, attention on me, brown hair, super hero, upper body, spider net printing, and tight mouth' based on the BLIP model. Each preprocessed image in the preprocessed image set is also referred to in the above process, and image description text corresponding to each preprocessed image is obtained based on the uniformly understood and generated bootstrap multi-mode model. Therefore, the image description text corresponding to each preprocessed image can be automatically acquired based on the process, and human intervention and intervention are not needed.

S130, acquiring a pre-stored training sample set, and adding the preprocessing image set and the image description text set to the training sample set to update the training sample set.

In this embodiment, since a training sample set that does not include the user image and the description text corresponding to the user terminal is stored in advance in the server, in order to make model training more targeted, the preprocessed image set and the image description text set may be added to the training sample set to update the training sample set. The user image and the descriptive text of the user are added in the training sample set, so that the training sample set adds the influencing factors of the user.

And S140, determining a first training round number according to a preset total training round number and a first preset training proportion, and performing model training of the first training round number on a text encoder and a U-Net network in an initial Stable Diffuse model based on the training sample set to obtain a Stable Diffuse model after the first training.

In this embodiment, in order to more clearly understand the model structure of the Stable diffration model, the following description will be made in detail with reference to fig. 4. As shown in FIG. 4, at least a Text Encoder (i.e., text Encoder), a U-Net network, and a variant automatic Encoder (i.e., VAE Encoder) are included in the Stable Diffusion model. The text encoder plays a role of a text understanding module in the Stable Difference model, and can convert the text input into the Stable Difference model into a vector which can be understood and interpreted by the model. The U-Net network is regarded as a semantic segmentation model, which is a module used for continuously denoising text information into image composition information. The variation automatic encoder can encode the image composition information output by the U-Net network to generate an image, thereby realizing the generation of the image.

When the initial Stable Diffusion model is initially trained in the first segment, the text encoder therein and the U-Net network are trained simultaneously, i.e. the text encoder therein is not turned off. The text encoder and the U-Net network can be fully trained by carrying out the initial training of the first section on the initial Stable Diffusion model, but the training of the U-Net network at the moment does not reach the standard, and the training can be continued in the next stage.

In one embodiment, step S140 includes:

and based on the training sample set, performing model training on at least a first training round number on a Text encoder and a U-Net network in the initial Stable diffration model in a mode with both the streambooth training and the Text conversion training, and obtaining the Stable diffration model after the first training.

In this embodiment, in the first training process of the server on the initial Stable diffration model, a mode with both streambooth training and Text conversion training is adopted. For example, the total training wheel number is set to 1000 wheels, and the first preset training proportion is set to 15% -20%, so that the value of the first training wheel number is 150 to 200 wheels. More specifically, when a mode with both the streambooth training and the Text conversion training is adopted, each round of training is required to be trained to the Text encoder and the U-Net network in the initial Stable Difference model at the same time, and after 150 rounds to 200 rounds of training are completed to the Text encoder and the U-Net network in the initial Stable Difference model, model training with the first training round number can be considered to be completed, so that the first-section trained Stable Difference model is obtained.

And S150, determining a second-stage training round number according to the training total round number and a second preset training proportion, and performing model training of the second-stage training round number on the U-Net network in the first-stage trained Stable diffration model based on the training sample set to obtain a second-stage trained Stable diffration model.

In this embodiment, since the U-Net network is not fully trained in the previous first stage training process, the second stage training round number may be determined according to the training total round number (e.g., the training total round number is equal to 1000 rounds) and the second preset training proportion (e.g., the value of the second preset training proportion is set to 1-the first preset training proportion), and model training of the second stage training round number may be performed on the U-Net network in the first stage post-training Stable diffu ion model based on the training sample set, so as to obtain the second stage post-training Stable diffu.

And when the second section of training is carried out on the Stable diffration model after the first section of training, the U-Net network is independently trained, namely the text encoder is closed. By performing initial training of the second section on the Stable diffration model after the first section training, the U-Net network can be more fully trained, and the Stable diffration model after the second section training is obtained.

In an embodiment, as shown in fig. 5, as a first embodiment of step S150, step S150 includes:

s151a, acquiring training samples of the training sample set based on a first preset extraction proportion so as to form a first training sample set;

s152a, acquiring character name keywords included in an image description text of the first training sample set, and performing model training of a second section of training round number on a U-Net network in the first section of trained Stable Difference model by using the character name keywords to obtain the second section of trained Stable Difference model.

In this embodiment, when performing the second training on the Stable diffration model after the first training, the first training sample set may be selected from the training sample sets based on the first preset extraction ratio. And then, carrying out model training of a second section of training round number on the U-Net network in the first section of trained Stable diffration model by using character name keywords included in the image description text in each training sample in the first training sample set to obtain the second section of trained Stable diffration model, thereby realizing training of the residual wheel speed number of the U-Net network in the first section of trained Stable diffration model. It can be seen that the above training procedure turns off the text encoder from training it, but trains the U-Net network in the Stable Diffusion model after the first training.

In an embodiment, as shown in fig. 6, as a second embodiment of step S150, step S150 includes:

s151b, acquiring training samples of the training sample set based on a second preset extraction ratio to form a second training sample set;

s152b, acquiring a preprocessed image and an image description text included in the second training sample set, and performing model training of a second segment of training round number on the U-Net network in the first segment of post-training Stable Diffusion model to obtain the second segment of post-training Stable Diffusion model.

In this embodiment, similarly to the first embodiment of step S150, when the second training is performed on the Stable diffration model after the first training, the second training sample set may be selected from the training sample sets based on the second preset extraction ratio. And then carrying out model training of a second section of training round number on the U-Net network in the first section of post-training Stable diffration model by using the preprocessed image and the image description text included in each training sample in the second training sample set to obtain the second section of post-training Stable diffration model, thereby realizing the training of the residual wheel speed number of the U-Net network in the first section of post-training Stable diffration model. It can be seen that the training process described above also turns off the text encoder from training it, but rather trains the U-Net network in the Stable Diffusion model after the first training stage alone.

Of course, in the implementation, step S150 may also have the training manners of both the first embodiment and the second embodiment, that is, the first embodiment may be used as the first subsection of the second training segment, and the second embodiment may be used as the second subsection of the second training segment. However, in any of the above modes, the second stage training of the first stage trained Stable diffration model can be achieved, and the second stage trained Stable diffration model is obtained.

And S160, if the average loss value and the loss standard deviation of the training sample set input to the second-stage trained Stable Diffuse model meet the preset loss condition, taking the second-stage trained Stable Diffuse model as a target Stable Diffuse model and using the second-stage trained Stable Diffuse model for generating an AIGC magic head portrait.

In this embodiment, after the first stage training and the second stage training of the initial Stable diffration model are completed, it is further required to determine whether the average loss value and the loss standard deviation after the training sample set is input to the second stage trained Stable diffration model meet the preset loss condition. If the average loss value and the loss standard deviation after the training sample set is input to the second stage of the Stable difference model after training are determined to meet the preset loss condition, the training can be stopped. At the moment, the second stage trained Stable Diffuse model is used as a target Stable Diffuse model and is used for generating an AIGC magic head portrait.

In one embodiment, step S160 includes:

and if the average loss value and the loss standard deviation of the training sample set input to the second stage of the post-training Stable diffration model meet the preset loss condition, taking the second stage of the post-training Stable diffration model with the minimum average loss value and the minimum loss standard deviation as the target Stable diffration model.

In this embodiment, more specifically, when it is determined that the average loss value and the loss standard deviation of the training sample set after being input to the second stage post-training Stable diffration model satisfy a preset loss condition, the second stage post-training Stable diffration model with the minimum average loss value and the minimum loss standard deviation is selected as the target Stable diffration model. The target Stable distribution model obtained through screening in the screening mode is output to generate AIGC magic head images which are more fit with the requirements of users.

Therefore, the embodiment of the method can perform multi-round combined training on the initial Stable diffration model based on the user image currently acquired by the user and the stored training sample set, fully train a text encoder and a U-Net network in the model, and promote the prompt similarity and the face similarity of the generated image.

Fig. 7 is a schematic block diagram of an AIGC magic head figure generating device based on joint training according to an embodiment of the present invention. As shown in fig. 7, the present invention also provides an AIGC magic head image generating apparatus 100 based on the joint training, corresponding to the above AIGC magic head image generating method based on the joint training. The AIGC magic head image generating device based on the joint training comprises the following components: an image preprocessing unit 110, an image description text acquisition unit 120, a training sample set updating unit 130, a first segment training unit 140, a second segment training unit 150, and a target model acquisition unit 160.

The image preprocessing unit 110 is configured to respond to a magic head portrait generation model training instruction, obtain a plurality of user images corresponding to the magic head portrait generation model training instruction, and obtain a preprocessed image set corresponding to the preprocessing.

In the present embodiment, please refer to fig. 1, a technical scheme is described by taking a server 10 as an execution subject. When the user needs to use the initial Stable dispersion model stored in the server 10 to complete the generation of the magic head portrait based on the initial Stable dispersion model, the user may first operate the user interaction interface of the user terminal 20, for example, click the model training virtual button on the user interaction interface to trigger generation of the magic head portrait generation model training instruction, because the initial Stable dispersion model does not use the head portrait of the user for model training. Of course, in order to relate the training of the initial Stable distribution model in the server to the user, after the user collects a plurality of user images with user head images by using the user terminal 20, the user terminal 20 may upload the plurality of user images to the server 10, and preprocessing the plurality of user images in the server 10 to obtain the preprocessed image set.

In an embodiment, the image preprocessing unit 110 is configured to:

acquiring a plurality of user images corresponding to the magic head portrait generation model training instructions, and forming an initial user image set;

image clipping processing meeting a first preset clipping condition is carried out on each initial user image in the initial user image set to obtain preprocessed images respectively corresponding to each initial user image, and the preprocessed images form the preprocessed image set; the first preset clipping condition is that the occupancy ratio of the face area in the clipped preprocessed image is within a first preset occupancy ratio interval compared with that of the preprocessed image.

An image description text obtaining unit 120 is configured to obtain an image description text set corresponding to the preprocessed image set.

In an embodiment, the image description text obtaining unit 120 is configured to:

A training sample set updating unit 130, configured to obtain a pre-stored training sample set, and add the preprocessed image set and the image description text set to the training sample set to update the training sample set.

The first training unit 140 is configured to determine a first training round number according to a preset total training round number and a first preset training proportion, and perform model training of the first training round number on a text encoder and a U-Net network in the initial Stable diffration model based on the training sample set, so as to obtain a first-segment trained Stable diffration model.

In an embodiment, the first training unit 140 is configured to:

And the second training unit 150 is configured to determine a second training round number according to the total training round number and a second preset training proportion, and perform model training of the second training round number on the U-Net network in the first stage of post-training Stable diffration model based on the training sample set, so as to obtain a second stage of post-training Stable diffration model.

In an embodiment, as a first embodiment of the second stage training unit 150, the second stage training unit 150 is configured to:

acquiring training samples of the training sample set based on a first preset extraction ratio to form a first training sample set;

and acquiring character name keywords included in the image description text of the first training sample set, and performing model training of a second section of training round number on the U-Net network in the first section of trained Stable Diffusion model by using the character name keywords to obtain the second section of trained Stable Diffusion model.

In an embodiment, as a second embodiment of the second stage training unit 150, the second stage training unit 150 is configured to:

acquiring training samples of the training sample set based on a second preset extraction ratio to form a second training sample set;

and acquiring a preprocessed image and an image description text included in the second training sample set, and performing model training of a second segment of training round number on the U-Net network in the first segment of trained Stable Difference model to obtain the second segment of trained Stable Difference model.

In this embodiment, similarly to the first embodiment of the second training unit 150, when the second training is performed on the Stable diffu ion model after the first training, the second training sample set may be selected from the training sample set based on the second preset extraction ratio. And then carrying out model training of a second section of training round number on the U-Net network in the first section of post-training Stable diffration model by using the preprocessed image and the image description text included in each training sample in the second training sample set to obtain the second section of post-training Stable diffration model, thereby realizing the training of the residual wheel speed number of the U-Net network in the first section of post-training Stable diffration model. It can be seen that the training process described above also turns off the text encoder from training it, but rather trains the U-Net network in the Stable Diffusion model after the first training stage alone.

Of course, in the implementation, the second segment training unit 150 may also have the training modes of the first embodiment and the second embodiment at the same time, that is, the first embodiment may be used as the first subsection of the second segment training, and the second embodiment may be used as the second subsection of the second segment training. However, in any of the above modes, the second stage training of the first stage trained Stable diffration model can be achieved, and the second stage trained Stable diffration model is obtained.

The target model obtaining unit 160 is configured to, if it is determined that the average loss value and the loss standard deviation of the training sample set input to the second stage post-training Stable diffration model meet the preset loss condition, use the second stage post-training Stable diffration model as a target Stable diffration model, and use the target Stable diffration model in generating an AIGC magic head.

In an embodiment, the object model obtaining unit 160 is configured to:

Therefore, the embodiment of the device can perform multi-round combined training on the initial Stable diffration model based on the user image currently acquired by the user and the stored training sample set, fully train the text encoder and the U-Net network in the model, and promote the prompt similarity and the face similarity of the generated image.

The above-described AIGC magic head generation apparatus based on joint training may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 8.

Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer equipment integrates any AIGC magic head portrait generating device based on joint training provided by the embodiment of the invention.

With reference to fig. 8, the computer device includes a processor 402, a memory, and a network interface 405, which are connected by a system bus 401, wherein the memory may include a storage medium 403 and an internal memory 404.

The storage medium 403 may store an operating system 4031 and a computer program 4032. The computer program 4032 includes program instructions that, when executed, cause the processor 402 to perform the above-described joint training based AIGC magic head generation method.

The processor 402 is used to provide computing and control capabilities to support the operation of the overall computer device.

The internal memory 404 provides an environment for the execution of a computer program 4032 in the storage medium 403, which computer program 4032, when executed by the processor 402, causes the processor 402 to perform the above-described joint training based AIGC magic head portrait generating method.

The network interface 405 is used for network communication with other devices. It will be appreciated by those skilled in the art that the structure shown in FIG. 8 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

Wherein the processor 402 is configured to execute the computer program 4032 stored in the memory to implement the AIGC magic head portrait generating method based on the joint training as described above.

It should be appreciated that in embodiments of the present invention, the processor 402 may be a central processing unit (Central Processing Unit, CPU), the processor 402 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.

Accordingly, the present invention also provides a computer-readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program includes program instructions. The program instructions, when executed by the processor, cause the processor to perform the AIGC magic head generation method based on joint training as described above.

The computer readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, etc. which may store the program code.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The AIGC magic head portrait generating method based on the joint training is characterized by comprising the following steps:

2. The method of claim 1, wherein the acquiring a plurality of user images corresponding to the magic head portrait generation model training instructions and a preprocessed image set corresponding to preprocessing, comprises:

3. The method of claim 1, wherein the acquiring the set of image description text corresponding to the set of preprocessed images comprises:

4. The method according to claim 1, wherein the training the text encoder and the U-Net network in the initial Stable diffration model based on the training sample set to perform model training of a first training round number, to obtain a first trained Stable diffration model, includes:

5. The method of claim 1, wherein the performing model training for the second training round for the U-Net network in the first post-training Stable diffration model based on the training sample set to obtain a second post-training Stable diffration model comprises:

6. The method of claim 1, wherein the performing model training for the second training round for the U-Net network in the first post-training Stable diffration model based on the training sample set to obtain a second post-training Stable diffration model comprises:

7. The method of claim 1, wherein if it is determined that the average loss value and the loss standard deviation of the training sample set input to the second stage post-training Stable diffration model satisfy a preset loss condition, the step of using the second stage post-training Stable diffration model as the target Stable diffration model includes:

8. An AIGC magic head figure generating device based on joint training, characterized by comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the joint training based AIGC magic head generation method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions, which when executed by a processor, can implement the joint training based AIGC magic head generation method of any of claims 1-7.