CN117456062A - Digital person generation model generator training method, digital person generation method and device - Google Patents

Digital person generation model generator training method, digital person generation method and device Download PDF

Info

Publication number
CN117456062A
CN117456062A CN202311421191.4A CN202311421191A CN117456062A CN 117456062 A CN117456062 A CN 117456062A CN 202311421191 A CN202311421191 A CN 202311421191A CN 117456062 A CN117456062 A CN 117456062A
Authority
CN
China
Prior art keywords
loss function
voice
mouth
generator
current frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311421191.4A
Other languages
Chinese (zh)
Inventor
叶志坚
肖龙源
李海洲
李稀敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Kuaishangtong Technology Co Ltd
Original Assignee
Xiamen Kuaishangtong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Kuaishangtong Technology Co Ltd filed Critical Xiamen Kuaishangtong Technology Co Ltd
Priority to CN202311421191.4A priority Critical patent/CN117456062A/en
Publication of CN117456062A publication Critical patent/CN117456062A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/802D [Two Dimensional] animation, e.g. using sprites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a generator training method of a digital person generation model, a digital person generation method and a device, wherein the method comprises the following steps: a generator for inputting a current frame, a reference frame and voice corresponding to the current frame in the sample video into a digital human generation model to generate a voice-driven face image; extracting the key point characteristics of the face of the current frame by adopting a pre-training model, wherein the key point characteristics of the face comprise the key point characteristics of the outer ring of the lips, forming a closed lip mask by the key point characteristics of the outer ring of the lips, performing masking operation by adopting the lip mask and the face image generated by the current frame and voice drive respectively, and constructing a mouth reconstruction loss function; the L1 reconstruction loss function, the SSIM loss function, the antagonism loss function and the mouth-shaped synchronization loss function are calculated, the total loss function is obtained by combining the mouth reconstruction loss function, and the generator of the digital human generation model is trained based on the total loss function, so that the lips and teeth of the face image generated by the trained generator are clearer.

Description

Digital person generation model generator training method, digital person generation method and device
Technical Field
The invention relates to the technical field of digital persons, in particular to a generator training method of a digital person generation model, a digital person generation method and a digital person generation device.
Background
The existing 2D super-realistic digital person is mainly realized by voice driving mouth shape change, and the common method is wav2lip, and the method can be used for any face after the model is trained, namely, training for specific people is not needed; in addition, the mouth shape and voice of the method are well matched. However, there is a problem in using this method in that details of teeth and lips are blurred, which affects the effect of its practical use. It is necessary to further refine this approach so that the resulting digital person is more realistic and lifelike.
In the current training process of wav2lip, the generated whole human face is usually used for comparison with the current frame to obtain L1 reconstruction loss, and the generated human face image is obtained through a gradient descent method. On the one hand, the ratio of teeth and lips in the whole face image is low, so that the teeth and lips are blurred. On the other hand, the L1 reconstruction loss only focuses on the difference of each pixel, and cannot capture important features such as the structure and texture of an image, which is also one of the causes of blurring of teeth and lips.
Disclosure of Invention
Aiming at the technical problems of blurring of the teeth and lips of the face image generated by the generator of the prior digital person generating model. An embodiment of the present application aims to provide a generator training method of a digital person generation model, a digital person generation method and a device thereof, so as to solve the technical problems mentioned in the background art section.
In a first aspect, the present invention provides a method for training a generator of a digital person generation model, comprising the steps of:
acquiring training data, wherein the training data comprises a current frame, a reference frame and voice corresponding to the current frame in a sample video, inputting the training data into a generator of a digital human generation model, and generating a voice-driven face image;
extracting the key point characteristics of the face of the current frame by adopting a pre-training model, wherein the key point characteristics of the face comprise the key point characteristics of the outer ring of the lips, forming a closed lip mask by the key point characteristics of the outer ring of the lips, performing masking operation by adopting the lip mask and a face image generated by driving the current frame and voice respectively to obtain a first image and a second image, and constructing a mouth reconstruction loss function according to the first image and the second image;
calculating an L1 reconstruction loss function, an SSIM loss function, an antagonism loss function and a mouth-shaped synchronization loss function between a face image generated by voice driving and a current frame;
and calculating according to the mouth reconstruction loss function, the L1 reconstruction loss function, the SSIM loss function, the antagonism loss function and the mouth synchronization loss function to obtain a total loss function, and training a generator of the digital human generation model based on the total loss function to obtain a trained generator of the digital human generation model.
Preferably, the construction of the mouth reconstruction loss function according to the first image and the second image specifically includes:
the mouth reconstruction loss function is calculated using:
wherein L is G_mouthi An ith pixel point, L, representing a first image g_mouthi The i-th pixel of the second image is represented, n represents the total number of pixels of the first image or the second image, i=1, 2, …, n.
Preferably, the method for calculating the L1 reconstruction loss function, the SSIM loss function, the antagonism loss function and the mouth-shaped synchronization loss function between the face image generated by the voice drive and the current frame specifically comprises the following steps:
calculating an L1 reconstruction loss function between a face image generated by voice driving and a current frame by the following steps:
wherein L is Gi Represents the ith pixel point, L, of the current frame gi An ith pixel point representing a voice-driven face image.
Calculating an SSIM loss function between a face image generated by voice driving and a current frame by:
wherein L is G And L g Represents the face image of the current frame and the voice drive, respectively, mu represents the mean value, sigma represents the standard deviation,representing covariance, c 1 And c 2 Is a constant;
and judging the face images driven by the voice and the voice by adopting a mouth-shaped synchronous discriminator, and correspondingly obtaining a mouth-shaped synchronous loss function as shown in the following formula:
wherein,representing the judging result of mouth shape and voice of the mth voice-driven face image, wherein M is the total number of the voice-driven face images;
adopting an image quality discriminator and a generator of a digital person generation model to form a GAN network, discriminating the face image driven by the current frame and voice, and the antagonism loss function is shown as the following formula:
the loss function of the image quality discriminator is shown as follows:
wherein L is g For voice-driven face images, L G Is the current frame.
Preferably, the total loss function L total The following formula is shown:
L total =(1-s w -s g -s mouth -s ssim )·L 1 +s w ·L sync +s g ·L gen +s mouth L 1_mouth +s ssim L ssim
wherein s is w 、s g 、s mouth 、s ssim Respectively representing mouth synchronization loss weight, counterloss weight, mouth reconstruction loss weight and SSIM loss weight.
Preferably, the digital human generation model comprises a wav2lip model and the pre-training model comprises a dlib model.
In a second aspect, the present invention provides a digital person generating method, which is characterized in that a digital person generating model generator obtained by training by using a digital person generating model generator training method as described in any implementation manner of the first aspect includes the following steps:
the method comprises the steps of obtaining a face image, target voice and a reference frame in a video to be synthesized, inputting the face image, the target voice and the reference frame in the video to be synthesized into a generator of a digital human generation model, obtaining a face image driven by the target voice, and generating a digital human video according to the face image driven by the target voice and the target voice.
In a third aspect, the present invention provides a generator training apparatus for a digital human generation model, comprising:
the image generation module is configured to acquire training data, wherein the training data comprises a current frame, a reference frame and voice corresponding to the current frame in a sample video, and the training data is input into a generator of a digital human generation model to generate a voice-driven human face image;
the first loss construction module is configured to extract the key point characteristics of the face of the current frame by adopting a pre-training model, wherein the key point characteristics of the face comprise the key point characteristics of the outer ring of lips, the key point characteristics of the outer ring of the lips form a closed lip mask, the lip mask is adopted to carry out masking operation with the face image generated by the current frame and voice drive respectively, a first image and a second image are obtained, and a mouth reconstruction loss function is constructed according to the first image and the second image;
the second loss construction module is configured to calculate an L1 reconstruction loss function, an SSIM loss function, an antagonism loss function and a mouth-shaped synchronization loss function between the face image generated by the voice drive and the current frame;
the total loss construction module is configured to calculate a total loss function according to the mouth reconstruction loss function, the L1 reconstruction loss function, the SSIM loss function, the antagonism loss function and the mouth synchronization loss function, train a generator of the digital human generation model based on the total loss function, and obtain a generator of the trained digital human generation model.
In a fourth aspect, the present invention provides a digital person generating apparatus, wherein the digital person generating model generator trained by the digital person generating model generator training method described in any implementation manner of the first aspect includes:
the execution module is configured to acquire a face image, target voice and a reference frame in the video to be synthesized, input the face image, the target voice and the reference frame in the video to be synthesized into a generator of a digital human generation model, obtain a face image driven by the target voice, and generate the digital human video according to the face image driven by the target voice and the target voice.
In a fifth aspect, the present invention provides an electronic device comprising one or more processors; and storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.
In a sixth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.
Compared with the prior art, the invention has the following beneficial effects:
(1) According to the generator training method of the digital human generation model, key point characteristics of a human face are extracted from a current frame through the pre-training model, a closed lip mask is formed, masking operation is carried out on the current frame and a voice-driven human face image by using the lip mask, L1 reconstruction loss between a first image and a second image after masking operation is calculated, mouth reconstruction loss is obtained, SSIM loss between the voice-driven human face image and the current frame is further introduced, therefore important characteristics such as structure and texture of the image can be captured, and definition of teeth and lips in the generated human face image and video is effectively improved.
(2) The generator training method of the digital human generation model provided by the invention also reserves the L1 reconstruction loss function, the antagonism loss function and the mouth-shaped synchronous loss function used in the original generator training process, so that the problem of unclear lips and teeth in the generated face image or video is solved on the basis of keeping the advantages of the original digital human generation model.
(3) The digital person generating method adopts the digital person generating model generator trained by the digital person generating model generator training method, so that the lips and teeth of the generated digital person video are clearer, and the effect is more vivid and natural.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an exemplary device frame pattern to which an embodiment of the present application may be applied;
FIG. 2 is a flow chart of a generator training method of a digital person generation model of an embodiment of the present application;
FIG. 3 is a schematic diagram of a digital person generation model in a generator training method of the digital person generation model of an embodiment of the present application;
FIG. 4 is a graph of digital person results generated by a digital person generation model that was not trained using the generator training method of the digital person generation model of an embodiment of the present application;
FIG. 5 is a graph of digital person results generated by a digital person generation model obtained by training a digital person generation model generator training method of an embodiment of the present application;
FIG. 6 is a schematic diagram of a generator training device of a digital person generation model of an embodiment of the present application;
fig. 7 is a schematic structural diagram of a computer device suitable for use in implementing the embodiments of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
FIG. 1 illustrates an exemplary device architecture 100 of a generator training method of a digital person generation model or a generator training device of a digital person generation model to which embodiments of the present application may be applied.
As shown in fig. 1, the apparatus architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various applications, such as a data processing class application, a file processing class application, and the like, may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smartphones, tablets, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.
The server 105 may be a server providing various services, such as a background data processing server processing files or data uploaded by the terminal devices 101, 102, 103. The background data processing server can process the acquired file or data to generate a processing result.
It should be noted that, the method for training the generator of the digital person generation model provided in the embodiment of the present application may be executed by the server 105, or may be executed by the terminal devices 101, 102, 103, and accordingly, the device for training the generator of the digital person generation model may be set in the server 105, or may be set in the terminal devices 101, 102, 103.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where the processed data does not need to be acquired from a remote location, the above-described apparatus architecture may not include a network, but only a server or terminal device.
Fig. 2 shows a method for training a generator of a digital person generation model according to an embodiment of the present application, including the following steps:
s1, training data are acquired, wherein the training data comprise a current frame, a reference frame and voice corresponding to the current frame in a sample video, the training data are input into a generator of a digital human generation model, and a voice-driven face image is generated.
In a specific embodiment, the digital person generation model comprises a wav2lip model. The wav2lip model comprises a generator and two discriminators, wherein the two discriminators are a mouth-shaped synchronous discriminator and an image quality discriminator respectively.
Specifically, the generator of the wav2lip model includes a face encoder, a speech encoder, and a face decoder.
Inputting training data into a generator of a digital person generation model to generate a voice-driven face image, comprising the following steps:
extracting characteristics from voice to obtain voice characteristics;
inputting the voice characteristics into a voice encoder, and extracting and embedding the voice characteristics;
covering the lower half part of the current frame, and splicing the lower half part with a reference frame to obtain a face input frame;
inputting the face input frame into a face encoder, and extracting face features for embedding;
the face feature embedding and the voice feature embedding are spliced to obtain a fusion embedding;
and embedding the fusion into an input face decoder to generate a voice-driven face image.
Specifically, referring to fig. 3, a sample video is acquired, a current frame and a reference frame in the sample video are acquired, a lower half part of the current frame is covered, and the current frame and the reference frame covered on the lower half part are spliced together to form a face input frame. And acquiring the voice corresponding to the current frame and extracting voice characteristics. Inputting the face input frame into a face encoder, and extracting to obtain face characteristic embedding; the voice characteristics are input into a voice encoder, and the voice characteristics are extracted to be embedded. The face feature embedding and the voice feature embedding are spliced to form a fusion embedding, and the fusion embedding comprises both the face feature and the voice feature. And embedding the fusion into an input face decoder to generate a voice-driven face image.
S2, extracting the key point characteristics of the face of the current frame by adopting a pre-training model, wherein the key point characteristics of the face comprise the key point characteristics of the outer ring of the lips, forming a closed lip mask by the key point characteristics of the outer ring of the lips, performing masking operation on the lip mask and the face image generated by the current frame and voice driving respectively to obtain a first image and a second image, and constructing a mouth reconstruction loss function according to the first image and the second image.
In a specific embodiment, the pre-training model comprises a dlib model.
In a specific embodiment, constructing the mouth reconstruction loss function from the first image and the second image specifically includes:
the mouth reconstruction loss function is calculated using:
wherein L is G_mouthi An ith pixel point, L, representing a first image g_mouthi The i-th pixel of the second image is represented, n represents the total number of pixels of the first image or the second image, i=1, 2, …, n.
Specifically, the key point features of the face of the current frame are extracted through a pre-training model, wherein the key point features comprise the outer ring key point features and the inner ring key point features of the lips. Forming a closed lip mask using the outer ring keypoint feature of the lips; masking with the lipsMasking operation is carried out on the face image driven by voice and the current frame, then the L1 reconstruction loss function after masking operation is calculated, and the generated mouth (including teeth) reconstruction loss function L is obtained 1 And (2) masking the lip mask and the current frame to obtain a first image, masking the lip mask and the voice-driven face image to obtain a second image, and further calculating an L1 reconstruction loss function between the first image and the second image to obtain a mouth reconstruction loss function. The pre-training model can adopt dlib model or other facial key point characteristic model.
S3, calculating an L1 reconstruction loss function, an SSIM loss function, an antagonism loss function and a mouth-shaped synchronization loss function between the face image generated by voice driving and the current frame.
In a specific embodiment, calculating an L1 reconstruction loss function, an SSIM loss function, an antagonism loss function, and a mouth-shaped synchronization loss function between a face image generated by a voice driver and a current frame specifically includes:
calculating an L1 reconstruction loss function between a face image generated by voice driving and a current frame by the following steps:
wherein L is G Represents the ith pixel point, L, of the current frame g An ith pixel point representing a voice-driven face image.
Calculating an SSIM loss function between a face image generated by voice driving and a current frame by:
wherein L is G And L g Represents the face image of the current frame and the voice drive, respectively, mu represents the mean value, sigma represents the standard deviation,representing covariance, c 1 And c 2 Is a constant;
and judging the face images driven by the voice and the voice by adopting a mouth-shaped synchronous discriminator, and correspondingly obtaining a mouth-shaped synchronous loss function as shown in the following formula:
wherein,representing the judging result of mouth shape and voice of the mth voice-driven face image, wherein M is the total number of the voice-driven face images;
adopting an image quality discriminator and a generator of a digital person generation model to form a GAN network, discriminating the face image driven by the current frame and voice, and the antagonism loss function is shown as the following formula:
the loss function of the image quality discriminator is shown as follows:
wherein L is g For voice-driven face images, L G Is the current frame.
Specifically, the current frame is used as a real frame, the face image generated by the voice drive is compared with the current frame, and the L1 reconstruction loss function is optimized through the L1 reconstruction loss function optimizing generator, wherein one of the targets of the generator training is to minimize the L1 reconstruction loss function between the face image generated by the voice drive and the current frame. In addition, a SSIM loss function which is the structural loss between the face image generated by voice driving and the current frame is introduced, and important characteristics such as the structure, the texture and the like of the image can be captured by constructing the SSIM loss function between the face image generated by voice driving and the current frame.
Further, the wav2lip model optimizes the degree of matching of mouth shape and voice through a pre-trained mouth shape synchronous discriminator. The pre-trained mouth-shaped synchronous discriminator can adopt syncNet, and can output a high probability value when the mouth-shaped synchronous discriminator and the voice synchronous discriminator output a low probability value when the mouth-shaped synchronous discriminator and the voice synchronous discriminator are asynchronous. Therefore, a mouth type synchronization discriminator is adopted to discriminate the mouth type synchronization in the face images driven by the voice and the voice, and a mouth type synchronization loss function is established.
Furthermore, the wav2lip model forms a GAN network with the generator through the image quality discriminator, and constructs an antagonism loss function so as to improve the visual quality of the face.
S4, calculating a total loss function according to the mouth reconstruction loss function, the L1 reconstruction loss function, the SSIM loss function, the antagonism loss function and the mouth synchronization loss function, training a generator of the digital human generation model based on the total loss function, and obtaining a trained generator of the digital human generation model.
In a particular embodiment, the total loss function is L total
L total =(1-s w -s g -s mouth -s ssim )·L 1 +s w ·L sync +s g ·L gen +s mouth L 1_mouth +s ssim L ssim
Wherein s is w 、s g 、s mouth 、s ssim Respectively representing mouth synchronization loss weight, counterloss weight, mouth reconstruction loss weight and SSIM loss weight.
Specifically, the total loss function of the generator is constructed by combining the L1 reconstruction loss function, the mouth synchronization loss function and the antagonism loss function through the introduced mouth reconstruction loss function and the SSIM loss function. By optimizing the total loss function, a trained digital person generation model generator is obtained, with which a speech driven face image with clearer lips and teeth can be obtained.
The embodiment of the application also provides a digital person generating method, which is a digital person generating model obtained by training by the generator training method of the digital person generating model, and comprises the following steps:
the method comprises the steps of obtaining a face image, target voice and a reference frame in a video to be synthesized, inputting the face image, the target voice and the reference frame in the video to be synthesized into a generator of a digital human generation model, obtaining a face image driven by the target voice, and generating a digital human video according to the face image driven by the target voice and the target voice.
Specifically, in a generator of a digital human generation model, extracting the voice characteristics of target voice, splicing a reference frame with a face image in a video to be synthesized, which covers the lower half face, to obtain a spliced image, inputting the spliced image into a face encoder in the generator to obtain face characteristic embedding, inputting the voice characteristics of the target voice into the voice encoder to obtain voice characteristic embedding, inputting the face characteristic embedding and the voice characteristic embedding into a face decoder, and outputting the face image to obtain the face image driven by the target voice. The target voice driven face image not only synchronizes the lips with the target voice, but also has clearer lips and teeth. The further synthesized digital human video is also better.
As shown in fig. 4, the digital human mouth is blurred, and the synthesized digital human video is difficult to be true enough, without training the generated digital human result of the digital human generation model by adopting the generator training method of the digital human generation model provided by the embodiment of the application; as shown in fig. 5, the digital human mouth is clear, the detail parts of the obvious teeth and lips can be seen, and the synthesized digital human video is very real. Therefore, the performance of the generator of the digital human generation model obtained by training through the generator training method of the digital human generation model provided by the embodiment of the application is good.
The above steps S1-S4 do not merely represent the order between steps, but rather are step notations.
With further reference to fig. 6, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of a digital human generation model generator training apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.
The embodiment of the application provides a generator training device of a digital person generation model, which comprises the following components:
the image generation module 1 is configured to acquire training data, wherein the training data comprises a current frame, a reference frame and voice corresponding to the current frame in a sample video, and the training data is input into a generator of a digital human generation model to generate a voice-driven human face image;
the first loss construction module 2 is configured to extract key point features of a face of a current frame by adopting a pre-training model, wherein the key point features of the face comprise key point features of an outer ring of lips, the key point features of the outer ring of the lips form a closed lip mask, masking operations are respectively carried out on the lip mask and face images generated by the current frame and voice driving by adopting the lip mask, a first image and a second image are obtained, and a mouth reconstruction loss function is constructed according to the first image and the second image;
a second loss construction module 3 configured to calculate an L1 reconstruction loss function, an SSIM loss function, an antagonism loss function, and a mouth-like synchronization loss function between the face image generated by the voice drive and the current frame;
the total loss construction module 4 is configured to calculate a total loss function according to the mouth reconstruction loss function, the L1 reconstruction loss function, the SSIM loss function, the antagonism loss function and the mouth synchronization loss function, train a generator of the digital human generation model based on the total loss function, and obtain a generator of the trained digital human generation model.
The embodiment of the application also provides a digital person generating device, which adopts the digital person generating model generator training method to train the digital person generating model generator, comprising:
the execution module is configured to acquire a face image, target voice and a reference frame in the video to be synthesized, input the face image, the target voice and the reference frame in the video to be synthesized into a generator of a digital human generation model, obtain a face image driven by the target voice, and generate the digital human video according to the face image driven by the target voice and the target voice.
Referring now to fig. 7, there is illustrated a schematic diagram of a computer apparatus 700 suitable for use in implementing an electronic device (e.g., a server or terminal device as illustrated in fig. 1) of an embodiment of the present application. The electronic device shown in fig. 7 is only an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present application.
As shown in fig. 7, the computer apparatus 700 includes a Central Processing Unit (CPU) 701 and a Graphics Processor (GPU) 702, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 703 or a program loaded from a storage section 709 into a Random Access Memory (RAM) 704. In the RAM 704, various programs and data required for the operation of the apparatus 700 are also stored. The CPU 701, the GPU702, the ROM 703, and the RAM 704 are connected to each other through a bus 705. An input/output (I/O) interface 706 is also connected to the bus 705.
The following components are connected to the I/O interface 706: an input section 707 including a keyboard, a mouse, and the like; an output portion 708 including a speaker, such as a Liquid Crystal Display (LCD), or the like; a storage section 709 including a hard disk or the like; and a communication section 710 including a network interface card such as a LAN card, a modem, and the like. The communication section 710 performs communication processing via a network such as the internet. The drives 711 may also be connected to the I/O interfaces 706 as needed. A removable medium 712 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 711, so that a computer program read out therefrom is installed into the storage section 709 as needed.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 710, and/or installed from the removable media 712. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 701 and a Graphics Processor (GPU) 702.
It should be noted that the computer readable medium described in the present application may be a computer readable signal medium or a computer readable medium, or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor apparatus, device, or means, or a combination of any of the foregoing. More specific examples of the computer-readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments described in the present application may be implemented by software, or may be implemented by hardware. The described modules may also be provided in a processor.
As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring training data, wherein the training data comprises a current frame, a reference frame and voice corresponding to the current frame in a sample video, inputting the training data into a generator of a digital human generation model, and generating a voice-driven face image; extracting the key point characteristics of the face of the current frame by adopting a pre-training model, wherein the key point characteristics of the face comprise the key point characteristics of the outer ring of the lips, forming a closed lip mask by the key point characteristics of the outer ring of the lips, performing masking operation by adopting the lip mask and a face image generated by driving the current frame and voice respectively to obtain a first image and a second image, and constructing a mouth reconstruction loss function according to the first image and the second image; calculating an L1 reconstruction loss function, an SSIM loss function, an antagonism loss function and a mouth-shaped synchronization loss function between a face image generated by voice driving and a current frame; and calculating according to the mouth reconstruction loss function, the L1 reconstruction loss function, the SSIM loss function, the antagonism loss function and the mouth synchronization loss function to obtain a total loss function, and training a generator of the digital human generation model based on the total loss function to obtain a trained generator of the digital human generation model.
The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims (10)

1. A method of training a generator of a digital person generation model, comprising the steps of:
acquiring training data, wherein the training data comprises a current frame, a reference frame and voice corresponding to the current frame in a sample video, and inputting the training data into a generator of a digital human generation model to generate a voice-driven human face image;
extracting face key point characteristics of the current frame by adopting a pre-training model, wherein the face key point characteristics comprise outer ring key point characteristics of lips, forming a closed lip mask by the outer ring key point characteristics of the lips, performing masking operation on the lips mask and face images generated by the current frame and the voice drive respectively to obtain a first image and a second image, and constructing a mouth reconstruction loss function according to the first image and the second image;
calculating an L1 reconstruction loss function, an SSIM loss function, an antagonism loss function and a mouth-shaped synchronization loss function between the face image generated by the voice drive and the current frame;
and calculating according to the mouth reconstruction loss function, the L1 reconstruction loss function, the SSIM loss function, the antagonism loss function and the mouth synchronization loss function to obtain a total loss function, and training the generator of the digital human generation model based on the total loss function to obtain a trained generator of the digital human generation model.
2. The method for training a generator of a digital person generation model according to claim 1, wherein the constructing a mouth reconstruction loss function from the first image and the second image specifically comprises:
the mouth reconstruction loss function is calculated using:
wherein L is G_mouthi Representing the first imagei pixel points, L g_mouthi Representing the i-th pixel of the second image, n represents the total number of pixels of the first or second image, i=1, 2, …, n.
3. The method according to claim 2, wherein said calculating the L1 reconstruction loss function, SSIM loss function, antagonism loss function, and mouth sync loss function between the face image generated by the speech driver and the current frame comprises:
calculating an L1 reconstruction loss function between the face image generated by the voice drive and the current frame by the following formula:
wherein L is Gi Represents the ith pixel point, L, of the current frame gi An ith pixel point representing a voice-driven face image.
Calculating an SSIM loss function between the face image generated by the voice driver and the current frame by the following formula:
wherein L is G And L g Represents the face image of the current frame and the voice drive, respectively, mu represents the mean value, sigma represents the standard deviation,representing covariance, c 1 And c 2 Is a constant;
and judging the face images driven by the voice and the voice by adopting a mouth-shaped synchronous judging device, and correspondingly obtaining a mouth-shaped synchronous loss function as shown in the following formula:
wherein,representing the judging result of mouth shape and voice of the mth voice-driven face image, wherein M is the total number of the voice-driven face images;
adopting an image quality discriminator and a generator of the digital human generation model to form a GAN network, discriminating the current frame and the voice-driven face image, wherein the antagonism loss function is shown in the following formula:
the loss function of the image quality discriminator is shown as follows:
wherein L is g For the voice-driven face image, L G Is the current frame.
4. A method of training a generator of a digital human generation model according to claim 3, characterized in that the total loss function L total The following formula is shown:
L total =(1-s w -s g -s mouth -s ssim )·L 1 +s w ·L sync +s g ·L gen +s mouth L 1_mouth +s ssim L ssim
wherein s is w 、s g 、s mouth 、s ssim Respectively representing mouth synchronization loss weight, counterloss weight, mouth reconstruction loss weight and SSIM loss weight.
5. The method of generator training of a digital person generation model of claim 1, wherein the digital person generation model comprises a wav2lip model and the pre-training model comprises a dlib model.
6. A digital person generating method, characterized in that a generator of a digital person generating model trained by the generator training method of a digital person generating model according to any one of claims 1 to 5, comprises the steps of:
and acquiring a face image, target voice and a reference frame in the video to be synthesized, inputting the face image, the target voice and the reference frame in the video to be synthesized into a generator of the digital human generation model to obtain a face image driven by the target voice, and generating a digital human video according to the face image driven by the target voice and the target voice.
7. A generator training device for a digital person generation model, comprising:
the image generation module is configured to acquire training data, wherein the training data comprises a current frame, a reference frame and voice corresponding to the current frame in a sample video, and the training data is input into a generator of a digital human generation model to generate a voice-driven human face image;
the first loss construction module is configured to extract facial key point characteristics of the current frame by adopting a pre-training model, wherein the facial key point characteristics comprise outer ring key point characteristics of lips, the outer ring key point characteristics of the lips form a closed lip mask, masking operations are respectively carried out on the lip mask and facial images generated by the current frame and the voice drive by adopting the lip mask, a first image and a second image are obtained, and a mouth reconstruction loss function is constructed according to the first image and the second image;
a second loss building module configured to calculate an L1 reconstruction loss function, an SSIM loss function, an antagonism loss function, and a mouth sync loss function between the voice-driven generated face image and the current frame;
and the total loss construction module is configured to calculate a total loss function according to the mouth reconstruction loss function, the L1 reconstruction loss function, the SSIM loss function, the antagonism loss function and the mouth synchronization loss function, train the generator of the digital human generation model based on the total loss function, and obtain the generator of the trained digital human generation model.
8. A digital person generating apparatus, characterized in that a generator of a digital person generating model trained by the generator training method of a digital person generating model according to any one of claims 1 to 5, comprises:
the execution module is configured to acquire a face image, target voice and a reference frame in the video to be synthesized, input the face image, the target voice and the reference frame in the video to be synthesized into the generator of the digital human generation model, obtain a face image driven by the target voice, and generate the digital human video according to the face image driven by the target voice and the target voice.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-5.
CN202311421191.4A 2023-10-30 2023-10-30 Digital person generation model generator training method, digital person generation method and device Pending CN117456062A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311421191.4A CN117456062A (en) 2023-10-30 2023-10-30 Digital person generation model generator training method, digital person generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311421191.4A CN117456062A (en) 2023-10-30 2023-10-30 Digital person generation model generator training method, digital person generation method and device

Publications (1)

Publication Number Publication Date
CN117456062A true CN117456062A (en) 2024-01-26

Family

ID=89596090

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311421191.4A Pending CN117456062A (en) 2023-10-30 2023-10-30 Digital person generation model generator training method, digital person generation method and device

Country Status (1)

Country Link
CN (1) CN117456062A (en)

Similar Documents

Publication Publication Date Title
CN107633218B (en) Method and apparatus for generating image
CN107578017B (en) Method and apparatus for generating image
EP3889912A1 (en) Method and apparatus for generating video
US11308671B2 (en) Method and apparatus for controlling mouth shape changes of three-dimensional virtual portrait
CN110298319B (en) Image synthesis method and device
CN110446066B (en) Method and apparatus for generating video
JP7401606B2 (en) Virtual object lip driving method, model training method, related equipment and electronic equipment
CN114245215A (en) Method, device, electronic equipment, medium and product for generating speaking video
CN114581980A (en) Method and device for generating speaker image video and training face rendering model
CN112308950A (en) Video generation method and device
CN115880400A (en) Cartoon digital human image generation method and device, electronic equipment and medium
CN111967397A (en) Face image processing method and device, storage medium and electronic equipment
CN112785669B (en) Virtual image synthesis method, device, equipment and storage medium
CN114399814A (en) Deep learning-based obstruction removal and three-dimensional reconstruction method
CN114429658A (en) Face key point information acquisition method, and method and device for generating face animation
CN114049290A (en) Image processing method, device, equipment and storage medium
CN117835001A (en) Video editing method, device, equipment and medium
CN117911588A (en) Virtual object face driving and model training method, device, equipment and medium
CN113223555A (en) Video generation method and device, storage medium and electronic equipment
CN114418835B (en) Image processing method, device, equipment and medium
CN112383721A (en) Method and apparatus for generating video
CN111260756B (en) Method and device for transmitting information
CN117456062A (en) Digital person generation model generator training method, digital person generation method and device
CN115690238A (en) Image generation and model training method, device, equipment and storage medium
CN113079327A (en) Video generation method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination