CN117372587A - Digital person generation method and device, and small sample model training method and device - Google Patents

Digital person generation method and device, and small sample model training method and device Download PDF

Info

Publication number
CN117372587A
CN117372587A CN202311330431.XA CN202311330431A CN117372587A CN 117372587 A CN117372587 A CN 117372587A CN 202311330431 A CN202311330431 A CN 202311330431A CN 117372587 A CN117372587 A CN 117372587A
Authority
CN
China
Prior art keywords
image
model
small sample
face
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311330431.XA
Other languages
Chinese (zh)
Inventor
王甜甜
张宁
赵以诚
刘佳颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202311330431.XA priority Critical patent/CN117372587A/en
Publication of CN117372587A publication Critical patent/CN117372587A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The present disclosure provides a digital person generating method and apparatus, which relates to the technical field of artificial intelligence, in particular to the technical fields of natural language processing, computer vision, deep learning, etc. The specific implementation scheme is as follows: acquiring an input image and a style template input by a user; cutting a face region of the input image to obtain a face image; based on the face image, obtaining a small sample model and an intermediate image corresponding to the small sample model; based on the intermediate image and the style template, a digital person image corresponding to the digital person is obtained. This embodiment improves the efficiency of digital person generation.

Description

Digital person generation method and device, and small sample model training method and device
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to the technical fields of natural language processing, computer vision, deep learning, and the like, and more particularly, to a digital person generation method and apparatus, a small sample model training method and apparatus, an electronic device, a computer readable medium, and a computer program product.
Background
With the upgrade of AI (Artificial Intelligence ) photo technology, users are no longer satisfied with the portrait photos taken by cameras (cell phones), and more users are trying to make online personal digital personas.
In the conventional technology, a face image is generally mapped into a fixed template based on a conventional mathematical model or a simple machine learning model to obtain a digital person, and the mode is difficult to effectively extract high-level features from an original image, so that the accuracy and the robustness of an algorithm are limited.
Disclosure of Invention
A digital person generation method and apparatus, a small sample model training method and apparatus, an electronic device, a computer readable storage medium, and a computer program product are provided.
According to a first aspect, there is provided a digital person generation method, the method comprising: acquiring an input image and a style template input by a user; cutting a face region of the input image to obtain a face image; based on the face image, obtaining a small sample model and an intermediate image corresponding to the small sample model, wherein the small sample model is used for representing the corresponding relation between the face image and the intermediate image; based on the intermediate image and the style template, a digital person image corresponding to the digital person is obtained.
According to a second aspect, there is provided a small sample model training method, the method comprising: obtaining a preset sample set, wherein the sample set comprises at least one sample, and the sample comprises: sample images and sample text for the sample images; acquiring a pre-established digital personal network, wherein the digital personal network comprises: the system comprises a base model and a small sample network connected in parallel with the base model, wherein the base model is used for representing the corresponding relation between texts and digital human images, and the small sample network is used for representing the corresponding relation between images and image characteristics; the following training steps are performed: selecting a sample from the sample set; inputting the sample into a digital human network to obtain a digital human image output by the digital human network; and responding to the small sample network to meet the training completion condition, and obtaining a small sample model corresponding to the small sample network.
According to a third aspect, there is provided a digital person generating apparatus, the apparatus comprising: an information acquisition unit configured to acquire an input image and a style template input by a user; the clipping unit is configured to clip the face region of the input image to obtain a face image; the image obtaining unit is configured to obtain a small sample model and an intermediate image corresponding to the small sample model based on the face image, wherein the small sample model is used for representing the corresponding relation between the face image and the intermediate image; a result obtaining unit configured to obtain a digital person image of the corresponding digital person based on the intermediate image and the style template.
According to a fourth aspect, there is provided a small sample model training apparatus, the apparatus comprising: a set acquisition unit configured to acquire a preset sample set including at least one sample including: sample images and sample text for the sample images; a network acquisition unit configured to acquire a pre-established digital personal network including: the system comprises a base model and a small sample network connected in parallel with the base model, wherein the base model is used for representing the corresponding relation between texts and digital human images, and the small sample network is used for representing the corresponding relation between images and image characteristics; a selecting unit configured to select a sample from a sample set; the input unit is configured to input the sample into the digital personal network to obtain a digital personal image output by the digital personal network; and a model obtaining unit configured to obtain a small sample model corresponding to the small sample network in response to the small sample network satisfying the training completion condition.
According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first or second aspect.
According to a sixth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method as described in any implementation of the first or second aspect.
According to a seventh aspect, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first or second aspects.
The digital person generation method and device provided by the embodiment of the disclosure comprise the steps of firstly, acquiring an input image and a style template input by a user; secondly, cutting out a face region of the input image to obtain a face image; thirdly, based on the face image, obtaining a small sample model and an intermediate image corresponding to the small sample model, wherein the small sample model is used for representing the corresponding relation between the face image and the intermediate image; and finally, obtaining the digital person image corresponding to the digital person based on the intermediate image and the style template. Therefore, a small sample model is obtained through the face image input by the user, and the characteristics of the input image can be effectively extracted based on the small sample model; the intermediate image is fused with the style template, so that the generated digital human image completely presents the characteristics of the input image, and the effect of generating the digital sub-human image is improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of one embodiment of a digital person generation method according to the present disclosure;
FIG. 2 is a schematic diagram of a steady diffusion model in a digital human generation method according to the present disclosure;
FIG. 3 is a flow chart of one embodiment of a small sample model training method according to the present disclosure;
FIG. 4 is a schematic diagram of a structure of one embodiment of a digital person generating apparatus according to the present disclosure;
FIG. 5 is a schematic diagram of the structure of one embodiment of a small sample model training apparatus according to the present disclosure;
fig. 6 is a block diagram of an electronic device for implementing a digital person generation method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In this embodiment, "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature.
The on-line digital portrait ensures the reality and the aesthetic property of the face effect of the user, and the background of the photo can be generated by AI, so that the style is changeable. In the conventional technology, a face photo is mapped into a fixed template based on a convolutional neural network or a conventional mathematical model, so that face fusion is realized. Compared with digital human generation based on a diffusion model, the traditional human face fusion algorithm mainly has the following problems:
the feature extraction capability is weak: traditional fusion algorithms are usually based on manual design or simple machine learning models, and high-level features are difficult to effectively extract from original images, which limits the accuracy and robustness of the algorithm;
the processing capacity of big data is insufficient: when facing large-scale high-dimensional data, the traditional fusion algorithm often has the problems of low calculation efficiency, inaccurate feature extraction and the like, and cannot effectively process the complex problem in a large-data environment;
Lack of automation and intelligence: the traditional fusion algorithm lacks automation and intelligent characteristics, and often needs manual intervention and parameter adjustment, which not only increases the complexity and cost of the algorithm, but also limits the application range and development prospect of the algorithm;
fusion capability for multi-source data is limited: conventional fusion algorithms generally fuse for a single type of data source, and are difficult to effectively handle the problem of fusion between multi-source data, which limits the scope of the algorithm in practical applications;
poor real-time performance: the traditional fusion algorithm has high computational complexity, needs to consume a large amount of time and computational resources, has poor real-time performance, and is difficult to meet the requirements of speed and efficiency in practical application.
The present disclosure provides a digital person generating method, which processes an input image through a small sample model, improving the effect of the digital person image, and fig. 1 shows a flow 100 according to one embodiment of the digital person generating method of the present disclosure, where the digital person generating method includes the following steps:
step 101, obtaining an input image and a style template input by a user.
In the present embodiment, the input image is an image desired by the user to be subjected to digital human image conversion, for example, the input image is a self-photograph of the user, or the input image is an image of another person input by the user. The input image input by the user is input when the user is voluntary, and does not relate to the privacy of the user.
In this embodiment, the style template is an image layout having a digital personality style, through which the style of the digital personality can be determined, for example, the style template is an image layout in the form of a cartoon, or the style template is an image layout in the form of an oil painting.
As shown in fig. 2, by a user input operation of a user, an execution subject on which the digital person generating method is operated may simultaneously obtain an input image and a style template, and by the input image and style template input by the user, a style of a digital person desired to be transformed by the user and an image desired to be transformed may be determined.
Step 102, clipping a face region of the input image to obtain a face image.
In this embodiment, the input image is an image having a face region, and the input image is cut to remove an image of the input image that does not belong to the face region, so as to obtain the face image.
In this embodiment, the face image is a core area image of the digital face image, and the face image is fused with style images of different styles in the style template, so that the final digital face image is a composite image with style form of the style template and the face image.
And step 103, obtaining a small sample model and an intermediate image corresponding to the small sample model based on the face image.
In this embodiment, the small sample model is used to characterize a correspondence between image features of a face image and an intermediate image, where the image features are features obtained by extracting the face image from the small sample model, and the intermediate image is an image obtained by extracting the image features from the small sample model.
In this embodiment, the small sample model may be a model obtained by training a face image in real time, and since the small sample model is obtained by training a face image, the small sample model may extract features of the face image, combine the extracted features, and obtain the intermediate image Z shown in fig. 2 through a module (for example, a diffusion model or a neural network model that is trained in advance) that processes the features.
Alternatively, the intermediate image may be an image directly obtained by a small sample model, and the small sample model may be a model obtained by training a user image (having the same user identity information as a face image) that belongs to the same user information as the face image, and the intermediate image of the user image having the same identity information as the face image may be obtained by the small sample model.
In this embodiment, the step 103 includes: based on the face image, detecting whether an initial network (such as a neural network model) for completing the feature extraction of the face image by the user is provided in advance; in response to detecting that the initial network has a preset, training the initial network based on the face image, and taking the initial network as a small sample model when the initial network meets a training completion condition (for example, the initial network training iteration number reaches a preset number).
Optionally, when the intermediate image is directly obtained from the small sample model, the step 103 may further include: based on the face image, acquiring the identity of the face image; selecting an image of a user from a preset image set based on the identity; detecting whether a small sample model matched with the user image exists or not; in response to detecting that there is a small sample model that matches the user image, the face image is input to the small sample model, resulting in an intermediate image.
Optionally, when the intermediate image is obtained from the image features output by the small sample model, the small sample model may be a model obtained by training a fine tuning framework of the stable diffusion model, and the step 103 may further include: determining a stable diffusion model based on the face image; judging whether a fine tuning frame matched with the stable diffusion model and trained is provided; responding to a fine tuning frame which is matched with the stable diffusion model and is trained, acquiring the fine tuning frame which is matched with the face image, and obtaining a small sample model; and inputting the face image into a stable diffusion model corresponding to the small sample model to obtain an intermediate image output by the stable diffusion model.
Step 104, obtaining a digital person image corresponding to the digital person based on the intermediate image and the style template.
In this embodiment, as shown in fig. 2, based on the style template and the intermediate image Z, the digital person image S corresponding to the digital person is obtained by face template fusion. Wherein the intermediate image Z is an image output by the small sample model.
In this embodiment, the step 104 includes: removing the face area of the style template to obtain a style image with a blank face area; and synthesizing the style image and the intermediate image to obtain a digital person image corresponding to the digital person. Wherein, the synthesizing style image and the intermediate image comprise: and inputting the style image and the intermediate image into a neural network model which is trained in advance, and obtaining a digital person image of the corresponding digital person output by the neural network model.
In this embodiment, the digital person image corresponding to the digital person may represent the digital person image, and since the input image is a plurality of images, the digital person image is a plurality of images; carrying out video on a plurality of digital person images to obtain digital persons with animated images; and similarly, three-dimensional model processing is performed based on a plurality of digital human images, so that a three-dimensional digital human image can be obtained.
Optionally, when the small sample model is a fine tuning framework of a stable diffusion model, the step 104 may further include: removing the face area of the style model to obtain a style image with a blank face area; and inputting the style image and the intermediate image into a stable diffusion model to obtain a digital person image of the corresponding digital person output by the stable diffusion model.
In this alternative implementation, the steady diffusion model (stable diffusion model) is a deep learning model for image generation, which is characterized by the ability to gradually generate realistic high resolution images. In this embodiment, a stable diffusion model is used to implement the generation of diffusion conditions from text to image, and the stable diffusion model includes: a graph-text matching model and a diffusion generating model; the image-text matching model consists of a text encoder and an image encoder based on a transform structure, and is used for respectively carrying out coding representation on an input text (related to a small sample model, automatically triggered according to rules and not opened to a user) and a style image (related to a style template), and calculating image-text similarity based on the coding representation. The diffusion generation model adopts a network structure of a UNet (U-shaped structure), and the image result of the final digital human image is gradually generated based on the representation of the middle characteristic diagram of the current step in the stable diffusion model to predict the digital human image of the next step. In the whole diffusion process, feature fusion guides the diffusion process based on the similarity between the diffusion intermediate graph and the input text, so that a diffusion model generates an image consistent with text semantics. The training process of the stable diffusion model is to train the diffusion generation model and the image-text matching model respectively through a large number of text and image data pairs collected in advance.
The embodiment provides a digital person generating method, which combines an image-text matching model and a diffusion generating model to realize the generation of diffusion conditions from text to image, and a user can quickly generate personal digital person images by only uploading own pictures (input images).
The digital person generating method provided by the embodiment of the disclosure comprises the steps of firstly, acquiring an input image and a style template input by a user; secondly, cutting out a face region of the input image to obtain a face image; thirdly, based on the face image, obtaining a small sample model and an intermediate image corresponding to the small sample model, wherein the small sample model is used for representing the corresponding relation between the face image and the intermediate image; and finally, obtaining the digital person image corresponding to the digital person based on the intermediate image and the style template. Therefore, a small sample model is obtained through the face image input by the user, and the characteristics of the input image can be effectively extracted based on the small sample model; the intermediate image is fused with the style template, so that the generated digital human image completely presents the characteristics of the input image, and the effect of generating the digital sub-human image is improved.
In some embodiments of the present disclosure, the obtaining the small sample model and the intermediate image corresponding to the small sample model based on the face image includes: judging whether a small sample model corresponding to the face image exists in the model library or not based on the face image; and responding to the judgment result that the model library is provided with a small sample model corresponding to the face image, and obtaining an intermediate image based on the face image and the small sample model.
In this optional implementation manner, determining whether the model library has the small sample model corresponding to the face image includes: a storage unit in a model library is obtained, wherein the storage unit comprises a storage image and a storage model; comparing the similarity between the face image and the storage image in the storage unit; and in response to the similarity of the face image and the stored image being greater than a similarity threshold, taking the stored model in the storage unit as a small sample model, and determining that the model library has a small sample model corresponding to the face image. Wherein the similarity threshold may be set based on model requirements, e.g., the similarity threshold is a value greater than 80%.
Optionally, in response to the similarity of the face image and the stored images in the respective storage units in the model library being less than or equal to a similarity threshold, it is determined that the model library does not have a small sample model corresponding to the face image.
Optionally, determining whether the model library has a small sample model corresponding to the face image includes: a storage unit in a model library is obtained, wherein the storage unit comprises a storage text and a storage model; converting the face image into an image text, wherein the image text is a text describing the input image and the format of the input image; comparing the similarity between the image text and the storage text in the storage unit; in response to the similarity of the image text and the stored text being greater than a similarity threshold, the stored model in the storage unit is taken as a small sample model, and it is determined that the model library has a small sample model corresponding to the face image.
In this optional implementation manner, the obtaining the intermediate image based on the face image and the small sample model includes: and inputting the face image into the small sample model to obtain an intermediate image corresponding to the small sample model.
In this optional implementation manner, the small sample model is a model for extracting features of the face image, and after the feature extraction is performed on the face image by the small sample model, an intermediate image which fully characterizes the features of the face image can be obtained.
Alternatively, the small sample model may also be a feature extraction model generated based on a face image, specifically, the face image or an image related to the face image is taken as a sample, and the feature extraction network is trained, so that the small sample model may be obtained.
Alternatively, the small sample model may also be a model that is trained in conjunction with the diffusion model, e.g., the small sample model is a fine-tuning framework of the diffusion model, and the diffusion model is frozen and the small sample model is trained while the small sample model is being trained. After the training of the small sample model is completed, the small sample model can be used for fine-tuning the image of the diffusion model on the basis of the diffusion model, so that the image of the diffusion model is fine-tuned into an intermediate picture wanted by the small sample model on the basis of the diffusion model.
In this optional implementation manner, the intermediate image is an image obtained after feature learning is performed on the face image by the small sample model.
The method for outputting the intermediate image provided by the embodiment firstly judges whether a small sample model corresponding to the face image exists in a model library based on the face image; after the small sample model corresponding to the face image is arranged in the model library, the small sample model is directly used for outputting the intermediate image, so that the convenience of obtaining the small sample model is improved.
In some optional implementations of this embodiment, the obtaining, based on the face image, the small sample model and the intermediate image corresponding to the small sample model further includes: responding to the judgment result that the model library does not have a small sample model corresponding to the face image, and training a fine tuning framework of the stable diffusion model based on the face image to obtain the small sample model; obtaining an intermediate image based on the face image and the small sample model; based on the intermediate image and the style template, obtaining a digital person image of the corresponding digital person includes: the face area of the style template is intercepted, and a style image is obtained; and inputting the style image and the intermediate image into a stable diffusion model to obtain a digital human image output by the stable diffusion model.
According to the digital person generating method provided by the embodiment of the disclosure, after the face image is obtained and the small sample model corresponding to the face image is not in the model library, the fine adjustment framework of the stable diffusion model is trained based on the face image, so that the trained small sample model is completely matched with the face image, the characteristics of the face image can be extracted, and the intermediate image corresponding to the face image is obtained. During training of the fine tuning framework, parameters of the parts of the stable diffusion model which are not related to the small sample model are frozen, and do not participate in training.
In this optional implementation manner, the fine tuning framework is a network for fine tuning the stable diffusion model, and the fine tuning framework belongs to a module connected in parallel with the fine tuning framework, and when the fine tuning framework is trained, parameters of a part of the stable diffusion model except for the fine tuning framework are frozen and do not participate in training, and the fine tuning framework only performs training based on characteristics of an input image, extracts characteristics of the input image, and obtains an intermediate image.
In this alternative implementation, the fine tuning framework may be a lorea (Low-Rank Adaptation of Large Language Models, low rank adaptation of large language model) model, and the fine tuning framework may be understood as a plug-in the stable diffusion model, and only a small amount of data is needed to train a model, and when a digital human image is generated, the lorea model is used in combination with the stable diffusion model, so as to implement adjustment of the image result output by the stable diffusion model.
In the case of stable diffusion model fine-tuning, the LoRA model can be applied to the cross-attention layer between image representations related to describing their cues. Advantages of the LoRA model fine tuning: the training speed is faster; the calculation requirement is lower; training weights are smaller because the original model is frozen and new trainable layers are injected, and the weights of the new layers can be saved as a file of about tens of MB in size, which is nearly one thousand times smaller than the original size of the UNet model.
In the alternative implementation mode, when the digital person image is obtained, the style image and the intermediate image are input into a stable diffusion model, and the style image and the intermediate image are fused by the stable diffusion model to obtain the digital person image corresponding to the digital person. The style of the style image and the characteristics of the intermediate image can be fully combined through the stable diffusion model, and the digital human image with vivid effect is obtained.
According to the digital person generating method, when the small sample model corresponding to the face image is not arranged in the model library, the fine-tuning framework of the stable diffusion model is trained based on the face image to obtain the small sample model, so that the small sample model can fully absorb the characteristics of the face image, the parameters of the stable diffusion model are combined, the digital person generating method can be suitable for the intermediate image of the stable diffusion model, when the digital person image is obtained, the digital person image corresponding to the digital person is synthesized by adopting the stable diffusion model, the generating effect of the digital person image is improved, and the reliability of digital person generation is guaranteed.
Optionally, the obtaining the small sample model and the intermediate image corresponding to the small sample model based on the face image includes: acquiring a small sample network; taking the face image as a sample of a small sample network, sequentially selecting and sampling images in the face image, iteratively training the small sample network, and responding to the small sample network meeting the training completion condition to obtain a small sample model; and selecting an image with highest definition in the input images, and inputting the image into a small sample model to obtain an intermediate image.
In some embodiments of the present disclosure, the digital person generating method further includes: training a fine tuning framework of a stable diffusion model based on a face image, the obtaining a small sample model comprising: performing iterative training for the set times on a fine tuning frame of the stable diffusion model based on the face image to obtain a plurality of model parameters; and determining a small sample model based on the face image, the model parameters and the stable diffusion model.
In this embodiment, the set number of times may be determined based on the training requirement, for example, the set number of times is 5 thousand times. In the iterative training of the fine adjustment frame for the set number of times, the sample images (the input images are plural, and the sample images are images in the selected input image) input to the fine adjustment frame may be the same image or may be different images.
In this embodiment, each time the fine tuning framework performs iterative training, there is a model parameter; when the fine tuning framework carries out iterative training for set times, the fine tuning framework has a plurality of model parameters for set times. The model parameters are parameters of the fine tuning frame, and parameters of the stable diffusion model except the fine tuning frame are frozen in the training process of the fine tuning frame, and belong to parameters which do not change.
In this embodiment, the determining the small sample model based on the face image, the model parameters and the stable diffusion model includes: determining an image text corresponding to the face image based on the face image; aiming at stable diffusion models under different model parameters, respectively inputting the same image text into the stable diffusion models to obtain a plurality of result images output by the stable diffusion models; judging whether a plurality of result images meet the preset image quality requirements or not; and taking a fine tuning frame corresponding to the result image meeting the image quality requirement as a small sample model under the stable diffusion model.
According to the digital person generation method provided by the alternative implementation mode, a plurality of model parameters are obtained by carrying out iteration for the set times on the fine tuning framework; based on the model parameters and the stable diffusion model under each model parameter, a small sample model is obtained, and a reliable implementation mode is provided for obtaining the small sample model.
Optionally, after obtaining the small sample model, the method further comprises: inputting the face image into a small sample model to obtain an intermediate image corresponding to the small sample model; and detecting whether the intermediate image meets the image quality requirement, and if so, determining that the small sample model meets the requirement.
In some embodiments of the present disclosure, determining the small sample model based on the face image, the model parameters, and the stable diffusion model includes: determining a plurality of initial images based on the model parameters and the stable diffusion model; a small sample model is determined based on the initial image and the face image.
In this embodiment, determining the plurality of initial images based on the model parameters and the stable diffusion model includes: aiming at the stable diffusion model under the parameter condition of each model, determining an initial image output by the stable diffusion model; the determining the small sample model based on the initial image and the face image includes: and comparing each initial image with the face image, and taking the model parameters corresponding to the initial image which is most matched with the face image as the parameters of the small sample model.
According to the digital person generation method provided by the alternative implementation mode, a plurality of initial images are determined through model parameters and a stable diffusion model; based on the initial image and the face image, parameters of the small sample model are determined, and a reliable implementation mode is provided for determining the small sample model.
In order to solve the problem that the difficulty of training a small sample model is different due to the fact that the difference in aspects such as portrait proportion, illumination and background exists in uploaded images of different target users, and different from the mode of selecting model parameters of the same training iteration for all target digital human images in the traditional technology, the method and the device provide that in the training process of the small sample model, an automatic model is preferred so as to select the parameters of the small sample model, and as shown in fig. 2, the small sample model with optimal performance can be obtained from models of different training iterations for the same target user.
In some optional implementations of the disclosure, determining the small sample model based on the initial image and the face image includes: carrying out face similarity calculation on the initial image and the face image respectively to obtain a similarity value corresponding to the initial image; calculating average similarity corresponding to the initial image based on the number of the initial images and the similarity value corresponding to the initial image; and selecting the model parameters of the iterative training with the maximum average similarity under all the iterative training as the parameters of the small sample model.
In this embodiment, the number of input images may be multiple, and the number of generated face images is not directly corresponding to the number of input images; for a plurality of face images and a plurality of initial images, calculating the average similarity between the initial images and the face images, preferably calculating the average similarity between the initial images and the front images, wherein the front images are images with clear front faces, clear five sense organs, no shielding and good light rays of the faces in the plurality of face images; when the face image is a plurality of face images, the calculating the face similarity between the initial image and the face image includes: and selecting a front image from the plurality of face images, and calculating the similarity between the initial image and the front image.
The specific implementation flow of the alternative implementation mode is as follows: firstly, carrying out N (N > 1) iterative training on small sample models of all user input images, secondly, respectively utilizing trained model parameters of the N iterations, adopting a method of fixing random seeds and prompt words, carrying out a text-generated graph based on a steady-state diffusion model, and respectively obtaining K (K > 1) Zhang Chushi images O of related target digital persons under each weight. Meanwhile, the input images uploaded by the users and used for training the small sample model are utilized to obtain a front image T, and the face similarity S between the K Zhang Chushi image and the front image is calculated respectively:
S ij =f(O ij ,T),
wherein f (·) represents an algorithm for calculating the similarity of faces, i e [1, n ], j e [1, k ]. After the similarity between each initial image and the front image is obtained, the average similarity under all iterations is calculated by adopting a weighted average mode:
and selecting the iteration with the highest average similarity as the parameter of the optimal small sample model, thereby obtaining the optimal small sample model.
The method for determining the small sample model provided by the alternative implementation mode takes the model parameters of the iterative training with the maximum average similarity as the parameters of the small sample model, and provides a reliable and accurate implementation mode for the implementation of the small sample model.
In another embodiment of the present disclosure, the clipping the face region of the input image to obtain a face image includes: distributing a set number of anchor points in an input image; determining the position of a face region based on the image region where each anchor point is located; based on the face region position, the input image is cut to obtain a face image.
In this embodiment, the anchor point is an image detection point, and whether the pixels around the anchor point belong to pixels preset in the face can be determined through the image detection point; the determining the face region position based on the image region where each anchor point is located includes: judging whether an image area where each anchor point is located is a part of a face image or not through a pre-trained neural network; and collecting all anchor points belonging to the face image together to obtain the face region position in the image.
In this embodiment, clipping the input image based on the face region position to obtain the face image includes: and cutting the position of the face area in the input image to obtain a face image.
Optionally, clipping the input image based on the face region position to obtain the face image may further include: based on the face region positions, the key point positions of the clipping faces are predicted using a pre-trained key point detection model. And calculating the key point positions and key points of the preset standard positions of the human face to obtain a transformation matrix, and carrying out affine transformation on the input image through the transformation matrix to obtain an aligned human face image.
According to the method for obtaining the face image, the anchor point is set first, the face area is determined based on the image area where the anchor point is located, the input image is cut based on the face area, the face image is obtained, and accuracy of obtaining the face image is improved.
Optionally, the clipping the face region of the input image to obtain a face image includes: acquiring a pre-marked face key point; matching the face key points with each pixel point in the input image; and in response to the matching value of the face key point and the pixel point in the input image being greater than a matching degree threshold (the matching degree threshold can be set based on matching requirements, for example, the matching degree threshold is 80%), taking the area where the pixel point matched with the face key point is located as a face area, and cutting the face area in the input image to obtain the face image.
In one embodiment of the present disclosure, the digital person generating method further includes: data augmentation is carried out on the input image, and an augmented image is obtained; an augmented image is used instead of the input image.
In this embodiment, the data augmentation of the input image means increasing the diversity of the input image, and the purpose of increasing the diversity of the input image is achieved by performing color conversion, brightness, contrast, and saturation changes on the input image.
Optionally, the above data augmentation of the input image further includes: performing a geometric transformation on the input image, wherein the geometric transformation comprises: random cutting, rotating, translating, scaling and the like.
Optionally, the above data augmentation of the input image may further include: and performing image synthesis on the input image.
According to the digital person generating method, the input images are subjected to data augmentation processing, so that the diversity of the input images is increased, the diversity of face images is improved, and when a small sample model is trained, the small sample model can fully extract the characteristics of the sample images through the input of the diversity of the face images, so that the accuracy of obtaining the small sample model is improved.
In some embodiments of the present disclosure, before clipping the face region of the input image, the digital person generating method further includes: and removing the blurred image in the input image, and preprocessing the input image to obtain a processed input image.
In this embodiment, the main purpose of preprocessing the input image is to eliminate irrelevant information in the input image, recover useful real information, enhance the detectability of relevant information and simplify data to the maximum extent, thereby improving the reliability of feature extraction, image segmentation, matching and recognition. The preprocessing process generally includes the steps of digitizing, geometric transformation, normalization, smoothing, restoration, enhancement and the like.
In this embodiment, in order to better train a small sample model, the input image is generally multiple, and before clipping the face image, the blurred image in the input image is removed, so that the face image is an image with higher definition; the input image is preprocessed, so that the input image has more useful information, and the information content of the input image is improved.
In some optional implementations of this embodiment, the digital person generating method further includes: removing blurred images in the digital human image.
In this embodiment, since the input image may be multiple, each input image may correspond to one digital person image, so as to remove blurred images in the multiple digital person images, so that the effect of digital persons obtained by the user is better, and the user experience is improved.
The present disclosure provides a small sample model training method, which processes an input image through a small sample model, improving the effect of a digital human image, and fig. 3 shows a flow 300 according to one embodiment of the small sample model training method of the present disclosure, where the small sample model training method includes the following steps:
step 301, a preset sample set is acquired.
In this embodiment, the sample set includes at least one sample, the sample including: sample image and sample text for the sample image.
In this embodiment, the sample image is an image including a face of a person, and the sample text is a text describing the person, a scene around the person, and an image format in the sample image, so that the digital personal network can fully understand the content described in the sample image through the sample text.
Step 302, a pre-established digital personal network is obtained.
In this embodiment, the digital personal network includes: the system comprises a base model and a small sample network connected with the base model in parallel, wherein the base model is used for representing the corresponding relation between texts and digital human images, and the small sample network is used for representing the corresponding relation between images and image features.
In this embodiment, the base model may be a large model for representing a correspondence between text and a digital human image, that is, the text input base model may obtain the digital human image output by the base model, and optionally, the base model is a steady diffusion model after training. The small sample network is a network branch which can be trained independently on the basis that the base model processes the image, the small sample network is used for extracting image characteristics of the image and inputting the image characteristics into the base model, and the base model analyzes the image characteristics on the basis that the base model understands texts to obtain a digital human image output by the base model.
Optionally, in order to enable the digital person to have corresponding style characteristics, the style image corresponding to the style characteristics, the text and the image characteristics of the small sample network can be input into the base model together, so that the digital person image with the corresponding style characteristics output by the base model can be obtained.
In this embodiment, the loss functions of the base model and the small sample network may be the same, and parameters of the base model may be directly frozen when training the small sample network, so as to achieve the purpose of training the small sample model alone. Optionally, the base model is an image feature processing module without training, the image features output by the small sample model are input to the base model, and the base model processes the sample text and the image features to obtain the digital human image.
The image features are features obtained by extracting face images through a small sample network, and the intermediate images are images obtained by extracting image features through the small sample network.
At step 303, a sample is selected from the sample set.
In this embodiment, the executing body may select a sample from the sample set obtained in step 301, and execute the training steps of steps 303 to 305. The selection manner and the selection number of the samples are not limited in the application. For example, at least one sample may be selected randomly, or a sample with better definition (i.e., higher pixels) may be selected from the samples.
Step 304, the sample is input into a digital person network, and a digital person image output by the digital person network is obtained.
In this embodiment, the step 304 includes: inputting a sample text in the sample to a digital person network, and controlling the digital person network to obtain a predicted digital person image based on the input sample text; comparing the predicted digital human image with the sample image in the sample, detecting whether the small sample network in the digital human network meets the training completion condition according to the training iteration times of the steps 303 to 305 or the loss value of the loss function set for the small sample network, and if the small sample network meets the training completion condition, determining that the small sample network meets the training completion condition.
In step 305, in response to the small sample network meeting the training completion condition, a small sample model corresponding to the small sample network is obtained.
In this embodiment, the training completion condition includes at least one of: the training iteration number reaches a predetermined iteration threshold, and the penalty value is less than the predetermined penalty value threshold. For example, training iterations reach 5 thousand times. The loss value is less than 0.05, and only a small sample network is reserved as a small sample model after training is completed. When the small sample model is used, the small sample model and the base model can be combined to obtain a digital human image meeting the requirements.
In this embodiment, if the small sample network does not meet the training completion condition, the relevant parameters in the small sample network are adjusted to make the loss value converge, and steps 303-304 are continuously performed based on the adjusted small sample network until step 305 is met.
The small sample model training method provided by the embodiment of the disclosure comprises the steps of firstly, acquiring a preset sample set; secondly, acquiring a pre-established digital personal network; again, selecting a sample from the sample set; inputting the sample into a digital human network from time to obtain a digital human image output by the digital human network; and finally, responding to the small sample network to meet the training completion condition, and obtaining a small sample model corresponding to the small sample network. Therefore, in the training process of the small sample network, the images output by the base model are gradually attached to texts in the samples, and the training efficiency of the small sample model is improved.
With further reference to fig. 4, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of a digital person generating apparatus, which corresponds to the method embodiment shown in fig. 1, which is particularly applicable in various electronic devices.
As shown in fig. 4, the digital person generating apparatus 400 provided in this embodiment includes: an information acquisition unit 401, a cropping unit 402, an image obtaining unit 403, and a result obtaining unit 404. The above-described information acquiring unit 401 may be configured to acquire an input image and a style template input by a user. The clipping unit 402 may be configured to clip the face region of the input image to obtain a face image. The image obtaining unit 403 may be configured to obtain, based on the face image, a small sample model and an intermediate image corresponding to the small sample model, where the small sample model is used to characterize a correspondence between the face image and the intermediate image. The above-described result obtaining unit 404 may be configured to obtain a digital person image of the corresponding digital person based on the intermediate image and the style template.
In the present embodiment, in the digital person generating apparatus 400: the specific processing of the information obtaining unit 401, the clipping unit 402, the image obtaining unit 403, and the result obtaining unit 404 and the technical effects thereof may refer to the relevant descriptions of step 101, step 102, step 103, and step 104 in the corresponding embodiment of fig. 1, and are not repeated herein.
In some optional implementations of the present embodiment, the image obtaining unit 403 is configured to: judging whether a small sample model corresponding to the face image exists in the model library or not based on the face image; and responding to the judgment result that the model library is provided with a small sample model corresponding to the face image, and obtaining an intermediate image based on the face image and the small sample model.
In some optional implementations of the present embodiment, the image obtaining unit 403 is configured to: responding to the judgment result that the model library does not have a small sample model corresponding to the face image, and training a fine tuning framework of the stable diffusion model based on the face image to obtain the small sample model; obtaining an intermediate image based on the face image and the small sample model; the above-described result obtaining unit 404 is configured to: the face area of the style template is intercepted, and a style image is obtained; and inputting the style image and the intermediate image into a stable diffusion model to obtain a digital human image output by the stable diffusion model.
In some optional implementations of the present embodiment, the image obtaining unit 403 is configured to: performing iterative training for the set times on a fine tuning frame of the stable diffusion model based on the face image to obtain a plurality of model parameters; based on the model parameters and the stable diffusion model, a small sample model is determined.
In some optional implementations of the present disclosure, the image obtaining unit 403 is configured to: determining a plurality of initial images based on the model parameters and the stable diffusion model; a small sample model is determined based on the initial image and the face image.
In some optional implementations of the present disclosure, the image obtaining unit 403 is configured to: carrying out face similarity calculation on the initial image and the face image respectively to obtain a similarity value corresponding to the initial image; calculating average similarity corresponding to the initial image based on the number of the initial images and the similarity value corresponding to the initial image; and selecting the model parameters of the iterative training with the maximum average similarity under all the iterative training as the parameters of the small sample model.
In some optional implementations of the present disclosure, the clipping unit 402 is configured to: distributing a set number of anchor points in an input image; determining the position of a face region based on the image region where each anchor point is located; based on the face region position, the input image is cut to obtain a face image.
In some optional implementations of the present disclosure, the digital person generating apparatus 400 further includes: an augmentation unit (not shown in the figures). The above-mentioned augmentation unit is configured to: configured to perform data augmentation on an input image to obtain an augmented image; the input image is replaced with the augmented image.
In some optional implementations of the present disclosure, the digital person generating apparatus 400 further includes: and a pre-removing unit (not shown in the figure) for removing the blurred image in the input image and preprocessing the input image to obtain a processed input image.
In some optional implementations of the present disclosure, the digital person generating unit 400 further includes: a post-removal unit (not shown in the figure) configured to: removing blurred images in the digital human image.
The digital person generating apparatus provided by the embodiment of the present disclosure, first, an information acquiring unit 401 acquires an input image and a style template input by a user; secondly, the clipping unit 402 clips the face region of the input image to obtain a face image; thirdly, the image obtaining unit 403 obtains a small sample model and an intermediate image corresponding to the small sample model based on the face image, where the small sample model is used to characterize the corresponding relationship between the face image and the intermediate image; finally, the result obtaining unit 404 obtains a digital person image of the corresponding digital person based on the intermediate image and the style template. Therefore, a small sample model is obtained through the face image input by the user, and the characteristics of the input image can be effectively extracted based on the small sample model; the intermediate image is fused with the style template, so that the generated digital human image completely presents the characteristics of the input image, and the effect of generating the digital sub-human image is improved.
With further reference to fig. 5, as an implementation of the method illustrated in the above figures, the present disclosure provides one embodiment of a small sample model training apparatus, which corresponds to the method embodiment illustrated in fig. 3, which is particularly applicable in a variety of electronic devices.
As shown in fig. 5, the small sample model training apparatus 500 provided in this embodiment includes: the method comprises a set acquisition unit 501, a network acquisition unit 502, a selection unit 503, an input unit 504 and a model acquisition unit 505. Wherein the set obtaining unit 501 may be configured to obtain a preset sample set, where the sample set includes at least one sample, and the sample includes: sample image and sample text for the sample image. The network acquisition unit 502 may be configured to acquire a pre-established digital personal network, where the digital personal network includes: the system comprises a base model and a small sample network connected with the base model in parallel, wherein the base model is used for representing the corresponding relation between texts and digital human images, and the small sample network is used for representing the corresponding relation between images and image features. The selection unit 503 may be configured to select a sample from a sample set. The input unit 504 may be configured to input the sample into a digital person network, and obtain a digital person image output by the digital person network. The model obtaining unit 505 may be configured to obtain the small sample model corresponding to the small sample network in response to the small sample network satisfying the training completion condition.
In the present embodiment, in the small sample model training apparatus 500: the specific processing and the technical effects of the model obtaining unit 505 by the set obtaining unit 501, the network obtaining unit 502, the selecting unit 503, the input unit 504 may refer to the relevant descriptions of the steps 301, 302, 303, and 304 in the corresponding embodiment of fig. 3, and are not described herein.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM602, and RAM603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as a digital human generation method or a small sample model training method. For example, in some embodiments, the digital human generation method or the small sample model training method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM602 and/or the communication unit 609. When the computer program is loaded into RAM603 and executed by computing unit 601, one or more steps of the digital person generation method or small sample model training method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the digital human generation method or the small sample model training method in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable digital person generating device or small sample model training device, such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (25)

1. A digital person generation method, the method comprising:
acquiring an input image and a style template input by a user;
cutting the face area of the input image to obtain a face image;
based on the face image, a small sample model and an intermediate image corresponding to the small sample model are obtained, wherein the small sample model is used for representing the corresponding relation between the face image and the intermediate image;
and obtaining a digital person image corresponding to the digital person based on the intermediate image and the style template.
2. The method of claim 1, wherein the obtaining, based on the face image, a small sample model and an intermediate image corresponding to the small sample model comprises:
judging whether a small sample model corresponding to the face image exists in a model library or not based on the face image;
And responding to the judgment result that the model library is provided with a small sample model corresponding to the face image, and obtaining an intermediate image based on the face image and the small sample model.
3. The method of claim 2, wherein the obtaining, based on the face image, a small sample model and an intermediate image corresponding to the small sample model further comprises:
responding to the judgment result that the model library does not have a small sample model corresponding to the face image, and training a fine tuning framework of a stable diffusion model based on the face image to obtain a small sample model;
obtaining an intermediate image based on the face image and the small sample model;
the obtaining the digital person image corresponding to the digital person based on the intermediate image and the style template comprises the following steps:
intercepting the face area of the style template to obtain a style image;
and inputting the style image and the intermediate image into the stable diffusion model to obtain the digital human image output by the stable diffusion model.
4. The method of claim 3, wherein the training a fine-tuning framework of a stable diffusion model based on the face image, the small sample model comprises:
Performing iterative training for the set times on the fine tuning framework of the stable diffusion model based on the face image to obtain a plurality of model parameters;
and determining a small sample model based on the face image, the model parameters and the stable diffusion model.
5. The method of claim 4, wherein the determining a small sample model based on the face image, the model parameters, and the stable diffusion model comprises:
determining a plurality of initial images based on the model parameters and the stable diffusion model;
a small sample model is determined based on the initial image and the face image.
6. The method of claim 5, wherein the determining a small sample model based on the initial image and the face image comprises:
carrying out face similarity calculation on the initial image and the face image respectively to obtain a similarity value corresponding to the initial image;
calculating the average similarity corresponding to the initial image based on the number of the initial images and the similarity value corresponding to the initial image;
and selecting the model parameters of the iterative training with the maximum average similarity under all the iterative training as the parameters of the small sample model.
7. The method of one of claims 1-6, wherein the clipping the face region of the input image to obtain a face image comprises:
distributing a set number of anchor points in the input image;
determining the position of a face region based on the image region where each anchor point is located;
and cutting the input image based on the face region position to obtain a face image.
8. The method of claim 7, the method further comprising:
performing data augmentation on the input image to obtain an augmented image;
replacing the input image with the augmented image.
9. The method of claim 7, prior to clipping the face region of the input image, the method further comprising:
and removing the blurred image in the input image, and preprocessing the input image to obtain a processed input image.
10. The method of claim 7, the method further comprising:
and removing the blurred image in the digital human image.
11. A method of small sample model training, the method comprising:
obtaining a preset sample set, the sample set comprising at least one sample, the sample comprising: a sample image and sample text for the sample image;
Acquiring a pre-established digital personal network, wherein the digital personal network comprises: the system comprises a base model and a small sample network connected with the base model in parallel, wherein the base model is used for representing the corresponding relation between texts and digital human images, and the small sample network is used for representing the corresponding relation between images and image features;
the following training steps are performed: selecting a sample from the sample set; inputting the sample into the digital human network to obtain a digital human image output by the digital human network; and responding to the small sample network to meet the training completion condition, and obtaining a small sample model corresponding to the small sample network.
12. A digital person generating apparatus, the apparatus comprising:
an information acquisition unit configured to acquire an input image and a style template input by a user;
the clipping unit is configured to clip the face area of the input image to obtain a face image;
the image obtaining unit is configured to obtain a small sample model and an intermediate image corresponding to the small sample model based on the face image, wherein the small sample model is used for representing the corresponding relation between the face image and the intermediate image;
a result obtaining unit configured to obtain a digital person image of the corresponding digital person based on the intermediate image and the style template.
13. The apparatus of claim 12, wherein the image acquisition unit is configured to: judging whether a small sample model corresponding to the face image exists in a model library or not based on the face image; and responding to the judgment result that the model library is provided with a small sample model corresponding to the face image, and obtaining an intermediate image based on the face image and the small sample model.
14. The apparatus of claim 13, wherein the image acquisition unit is configured to: responding to the judgment result that the model library does not have a small sample model corresponding to the face image, and training a fine tuning framework of a stable diffusion model based on the face image to obtain a small sample model; obtaining an intermediate image based on the face image and the small sample model;
the result obtaining unit is configured to: intercepting the face area of the style template to obtain a style image; and inputting the style image and the intermediate image into the stable diffusion model to obtain the digital human image output by the stable diffusion model.
15. The apparatus of claim 14, wherein the image acquisition unit is configured to: performing iterative training for the set times on the fine tuning framework of the stable diffusion model based on the face image to obtain a plurality of model parameters; a small sample model is determined based on the model parameters and the stable diffusion model.
16. The apparatus of claim 15, wherein the image acquisition unit is configured to: determining a plurality of initial images based on the model parameters and the stable diffusion model; a small sample model is determined based on the initial image and the face image.
17. The apparatus of claim 16, wherein the image acquisition unit is configured to: carrying out face similarity calculation on the initial image and the face image respectively to obtain a similarity value corresponding to the initial image; calculating the average similarity corresponding to the initial image based on the number of the initial images and the similarity value corresponding to the initial image; and selecting the model parameters of the iterative training with the maximum average similarity under all the iterative training as the parameters of the small sample model.
18. The apparatus of one of claims 12-17, wherein the clipping unit is configured to: distributing a set number of anchor points in the input image; determining the position of a face region based on the image region where each anchor point is located; and cutting the input image based on the face region position to obtain a face image.
19. The apparatus of claim 18, the apparatus further comprising: an amplification unit;
The augmentation unit is configured to perform data augmentation on the input image to obtain an augmented image; replacing the input image with the augmented image.
20. The apparatus of claim 18, the apparatus further comprising: a front-end removing unit; the pre-removal unit is configured to: and removing the blurred image in the input image, and preprocessing the input image to obtain a processed input image.
21. The apparatus of claim 18, the apparatus further comprising: a post-removal unit; the post-removal unit is configured to: and removing the blurred image in the digital human image.
22. A small sample model training apparatus, the apparatus comprising:
a set acquisition unit configured to acquire a preset sample set including at least one sample including: a sample image and sample text for the sample image;
a network acquisition unit configured to acquire a pre-established digital personal network including: the system comprises a base model and a small sample network connected with the base model in parallel, wherein the base model is used for representing the corresponding relation between texts and digital human images, and the small sample network is used for representing the corresponding relation between images and image features;
A selecting unit configured to select a sample from the sample set;
an input unit configured to input the sample into the digital person network to obtain a digital person image output by the digital person network;
and a model obtaining unit configured to obtain a small sample model corresponding to the small sample network in response to the small sample network satisfying a training completion condition.
23. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.
24. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-11.
25. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-11.
CN202311330431.XA 2023-10-13 2023-10-13 Digital person generation method and device, and small sample model training method and device Pending CN117372587A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311330431.XA CN117372587A (en) 2023-10-13 2023-10-13 Digital person generation method and device, and small sample model training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311330431.XA CN117372587A (en) 2023-10-13 2023-10-13 Digital person generation method and device, and small sample model training method and device

Publications (1)

Publication Number Publication Date
CN117372587A true CN117372587A (en) 2024-01-09

Family

ID=89392234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311330431.XA Pending CN117372587A (en) 2023-10-13 2023-10-13 Digital person generation method and device, and small sample model training method and device

Country Status (1)

Country Link
CN (1) CN117372587A (en)

Similar Documents

Publication Publication Date Title
US20210343065A1 (en) Cartoonlization processing method for image, electronic device, and storage medium
CN111832745A (en) Data augmentation method and device and electronic equipment
CN113379627B (en) Training method of image enhancement model and method for enhancing image
CA3137297C (en) Adaptive convolutions in neural networks
CN114445831A (en) Image-text pre-training method, device, equipment and storage medium
CN114723888B (en) Three-dimensional hair model generation method, device, equipment, storage medium and product
CN115861462B (en) Training method and device for image generation model, electronic equipment and storage medium
CN114792355B (en) Virtual image generation method and device, electronic equipment and storage medium
CN112561879A (en) Ambiguity evaluation model training method, image ambiguity evaluation method and device
CN116721460A (en) Gesture recognition method, gesture recognition device, electronic equipment and storage medium
CN114120413A (en) Model training method, image synthesis method, device, equipment and program product
CN113177466A (en) Identity recognition method and device based on face image, electronic equipment and medium
CN116402914A (en) Method, device and product for determining stylized image generation model
CN114863450B (en) Image processing method, device, electronic equipment and storage medium
CN113240780B (en) Method and device for generating animation
CN117372587A (en) Digital person generation method and device, and small sample model training method and device
CN115082298A (en) Image generation method, image generation device, electronic device, and storage medium
CN115019057A (en) Image feature extraction model determining method and device and image identification method and device
CN113920023A (en) Image processing method and device, computer readable medium and electronic device
CN113610856A (en) Method and device for training image segmentation model and image segmentation
CN114066790A (en) Training method of image generation model, image generation method, device and equipment
EP4330932A1 (en) Texture completion
CN114550236B (en) Training method, device, equipment and storage medium for image recognition and model thereof
CN116452741B (en) Object reconstruction method, object reconstruction model training method, device and equipment
CN114387160B (en) Training method, image processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination