CN113269872A - Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization - Google Patents

Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization Download PDF

Info

Publication number
CN113269872A
CN113269872A CN202110610539.9A CN202110610539A CN113269872A CN 113269872 A CN113269872 A CN 113269872A CN 202110610539 A CN202110610539 A CN 202110610539A CN 113269872 A CN113269872 A CN 113269872A
Authority
CN
China
Prior art keywords
face
video
frame
network
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110610539.9A
Other languages
Chinese (zh)
Inventor
杨志景
李为杰
温瑞冕
徐永宗
李凯
凌永权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202110610539.9A priority Critical patent/CN113269872A/en
Publication of CN113269872A publication Critical patent/CN113269872A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/04Indexing scheme for image data processing or generation, in general involving 3D image data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computer Graphics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Geometry (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization, which comprises the following steps: optimizing and fitting all parameters of the three-dimensional face deformation model to the input face image by adopting a convolutional neural network; training a speech-to-expression and head pose mapping network using parameters of a target video and a face model
Figure DDA0003095678890000011
Mapping network using trained speech to expressions and head gestures
Figure DDA0003095678890000012
Acquiring facial expression and head posture parameters from input audio; synthesizing a human face and rendering the synthesized human face to generate a vivid human face video frame; training a rendering network based on a generated confrontation network by using the parameterized face images and the face images in the video frames, wherein the rendering network is used for generating a background for the face image of each frame; and performing face background rendering and video synthesis based on video key frame optimization. The background transition of each frame of the synthesized face video output by the invention is natural and vivid, and the usability and the practicability of the synthesized face video can be greatly enhanced.

Description

Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization
Technical Field
The invention relates to the field of three-dimensional face reconstruction and face synthesis migration in deep learning, in particular to a synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization.
Background
With the improvement of the social level of China, the popularization of mobile intelligent terminals and the rapid development of mobile internet technology, videos become an indispensable part of people in life, study, entertainment and work, and compared with the traditional image-text expression form, the videos can combine the sense of hearing and the sense of vision, so that the manufacturing threshold is lower. Most of the current applications of video synthesis are still in the entertainment aspect, such as face changing photographing in the beautiful picture show, AR head portrait making in an iPhone mobile phone, istap Faces and other applications, and most of the applications are essentially to detect, position and segment the face in the image based on a deep learning neural network, and then exchange a source face and a target face. These functions require a neural network based on a large amount of face data, and have poor controllability, and it is difficult to realize the coupling of each attribute of the face.
The audio-driven video synthesis of the human face speech video is a key problem of realizing virtual anchor and intelligent human face speech video synthesis at present. The key effect of the method is that the human face speaking video with vivid human face and natural video frame transition can be generated under the condition that only the source audio and the target character video are available. Traditional person speech video recording requires a significant amount of labor and time costs, and necessarily requires the target person to participate in the recording. Therefore, the method generates a vivid synthetic face image by utilizing three-dimensional face reconstruction and adding a rendering network under video background frame optimization, thereby generating a vivid synthetic face video, and is a very practical problem for solving virtual anchor, character program recording, network recording class and the like.
The method is characterized in that a neural network is applied to face video synthesis in deep learning at present, features such as expressions, head postures, shapes and textures in a face model can be extracted based on three-dimensional face reconstruction and a rendering network method under video key frame optimization, the features such as the expressions and the head postures in the face model are extracted from source audio and are replaced into a face model of a target person, required vivid synthesized face frames are generated through optimized rendering of video key frames, face synthesis is achieved, and in recent years, a large number of scholars are put into scientific research in the field of face synthesis. However, since the face generation based on the neural network only needs a large amount of training data, the data acquisition work of the network is a great challenge. And due to the quality of the input data and the inherent instability of the generated model, the pictures and videos synthesized by the method may have low image quality and cannot be subjected to large-scale head pose control. The face synthesis mode is always a difficult point and a hot point in the research of the face video synthesis field. In the patent of "training method for generating confrontation network, image face changing method and video face changing device" (application number: 202010592443.X), peninsula (beijing) information technology ltd, a generator and a discriminator are proposed, which are based on generation of confrontation network, and generate confrontation network by using mass data pair training, extract the attribute feature diagram of the figure in the target image, and decode the generated mixed feature diagram to obtain the synthesized face. Although the method can also keep the attribute characteristics of the original image and the identity characteristics of the target image, the method has low stability of obtaining a vivid synthesized face, cannot obtain a synthesized face speech video under the condition of only human voice and video, and generates a synthesized face video with fuzzy face background, low video frame quality and unnaturalness.
Disclosure of Invention
The invention provides a synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization, which enables the background transition of each frame of the output synthetic face video to be natural and vivid.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization comprises the following steps:
optimizing and fitting each parameter of the three-dimensional face deformation model to the input face image by adopting a convolutional neural network to realize the parameterized reconstruction of the face model;
training a speech-to-expression and head pose mapping network using parameters of a target video and a face model
Figure BDA0003095678870000021
Mapping network using trained speech to expressions and head gestures
Figure BDA0003095678870000022
Acquiring facial expression and head posture parameters from input audio, wherein the target video comprises a video frame and audio corresponding to the video frame, and the video frame comprises a face image;
replacing parameters in the parameterized face image according to the acquired facial expression and head posture parameters, synthesizing to obtain a face image of each frame, rendering, and generating a vivid face video frame;
training a rendering network based on a generated confrontation network by using the parameterized face images and the face images in the video frames, wherein the rendering network is used for generating a background for the face image of each frame;
and performing face background rendering and video synthesis based on video key frame optimization to obtain a high-quality synthesized face video frame which contains source audio facial expression and head posture parameters and is used for synthesizing a synthesized face with clear portrait and background, and synthesizing a complete speech video of the synthesized face according to the sequence of the video frames.
Preferably, the method for realizing the parametric reconstruction of the face model by optimally fitting each parameter of the three-dimensional face deformation model to the input face image by adopting the convolutional neural network specifically comprises the following steps:
recognizing a face in a face image, marking the face with 68 marking points, representing the face by using a three-dimensional face deformation model, and parameterizing the face into a triangular mesh model consisting of 35709 vertexes;
converting a two-dimensional face image I into a three-dimensional face parametric model X, and expressing as follows:
Figure BDA0003095678870000031
wherein
Figure BDA0003095678870000032
Are the shape parameters of the three-dimensional face model,
Figure BDA0003095678870000033
is a parameter of the texture that is,
Figure BDA0003095678870000034
is a parameter of an expression that is,
Figure BDA0003095678870000035
is the parameter of the light emission,
Figure BDA0003095678870000036
rotation parameters being head pose parameters and derived from camera model
Figure BDA0003095678870000037
And translation parameters
Figure BDA0003095678870000038
Representing that the shape of any one face picture can be used as the three-dimensional face shape parameterThe modeling model is represented as:
Figure BDA0003095678870000039
in the formula, BshapeIs a face shape vector, BexpIs a facial expression vector;
the texture of a face is represented as:
Figure BDA00030956788700000310
wherein B istexIs a vector of the texture of the human face,
Figure BDA00030956788700000311
and
Figure BDA00030956788700000312
respectively expressed as the average shape and average texture of the face model; the illumination model of the face is represented as:
Figure BDA00030956788700000313
where gamma is the illumination parameter of the face model, niFor any vertex v of any face modeliNormal vector of (d), tiIs a vertex viTexture parameter, the vertex viIrradiance of (C) is represented as C (n)i,ti| γ), and
Figure BDA00030956788700000314
is a basis function of spherical harmonics, gammabIs the coefficient of the spherical harmonic;
therefore, the reconstruction process of the three-dimensional face model can be represented as an optimization solution of the face model parameters, and the training process of the three-dimensional face model based on the convolutional neural network can be represented as an optimization problem of the following equations (1) and (2):
Figure BDA00030956788700000315
Figure BDA00030956788700000316
in the formula (I), the compound is shown in the specification,
Figure BDA00030956788700000317
function, omega, for the regularized optimization of the coefficients of a three-dimensional face modelα、ωβ、ωδWeights corresponding to face shape coefficient, expression coefficient and texture coefficient respectively, c ∈ { r, g, b } represents that the picture is RGB picture, TcRepresents the face texture vector parameter of picture c, var () represents the variance, and r (x) represents the skin area of the face containing the cheek, nose, and forehead.
Preferably, the speech-to-expression and head-pose mapping network is trained using parameters of the target video and the face model
Figure BDA00030956788700000318
The method specifically comprises the following steps:
extracting audio frequency in a target video, converting the audio frequency into Mel frequency cepstrum coefficient, inputting the Mel frequency cepstrum coefficient obtained by conversion into a pre-trained audio high-grade feature extraction network to obtain high-grade feature FtThen F is addedtAnd beta generated by optimally fitting the input human face image with a three-dimensional human face deformation model through a convolutional neural networktAnd PtTraining the speech to expression and head pose mapping network as a training dataset
Figure BDA0003095678870000041
Trained speech to expression and head pose mapping network
Figure BDA0003095678870000042
Extracting two face estimation parameters from the audio, and respectively corresponding to the expression coefficients beta ═ beta in the three-dimensional face deformation model(1),......,β(64)And head pose coefficient P ═ P(1),......,P(6)Are multiplied by
Figure BDA0003095678870000043
The speech to expression and head pose mapping network
Figure BDA0003095678870000049
The training process of (A) can be regarded as the mean square error loss of the expression parameters
Figure BDA0003095678870000044
And mean square error loss of head pose parameters
Figure BDA0003095678870000045
The optimization process of (2) is shown in formulas (3) and (4):
Figure BDA0003095678870000046
Figure BDA0003095678870000047
where MSE () represents a mean square error function; ftRepresenting a high-level characteristic of the input network at time t, betatIs an expression parameter, P, of the target video at time ttIs the head pose parameter of the target video at the time t.
Preferably, the pre-trained audio advanced feature extraction network specifically includes:
the audio advanced feature extraction network takes an AT-net network as a backbone to train on The Oxford-BBC Lip Reading in The Wild Dataset.
Preferably, the method for acquiring a parameterized face image specifically includes the following steps:
extracting a target video into video frames, and cutting the whole face of each frame to obtain a face image I(1),I(2),......,I(n)Inputting the n human face images through a filterObtaining any picture I after convolutional neural network optimization(k)Face image represented by a parameterized model of k e n and corresponding shape parameter alphavideoExpression parameter betavideoTexture parameter deltaviaeoIllumination parameter gammavideoAnd head pose parameter Pvideo
Preferably, the parameters in the parameterized face image are replaced according to the acquired parameters of facial expression and head pose, specifically:
the expression coefficient beta obtained from the audioaudioAnd head pose coefficient PaudioReplacing the face parameters in each corresponding video frame according to the time frame sequence, and synthesizing a new face
Figure BDA0003095678870000048
Preferably, the synthesizing obtains a face image of each frame and renders the face image to generate a realistic face video frame, specifically:
and (3) using a 3D grid renderer to enable each grid to be in uniform and smooth transition, so that vivid and natural face images are obtained, and the face images subjected to further rendering are stored as corresponding video frames.
Preferably, the parameterized face image and the face image in the video frame are used to train a rendering network based on the generation of a confrontation network, specifically:
parameterized face image I obtained from target video frame(k)And the original video frame corresponding to the time
Figure BDA00030956788700000511
Are combined into sequence data pairs
Figure BDA00030956788700000512
And inputting a rendering network consisting of a generator G and a discriminator D for pre-training, wherein the pre-training process is a total optimization process of the generator and the discriminator for generating the countermeasure network and can be represented by an equation (5):
Figure BDA00030956788700000513
in the formula, G*An overall optimization function representing the rendering network;
complete training objectives include reconstruction losses
Figure BDA0003095678870000051
And loss of antagonism
Figure BDA0003095678870000052
Training is represented as formula (6):
Figure BDA0003095678870000053
where the weighting parameter λ is 100, reconstruction loss
Figure BDA0003095678870000054
Expression (7) and antagonistic loss
Figure BDA0003095678870000055
Expression (8) is as follows:
Figure BDA0003095678870000056
Figure BDA0003095678870000057
in the formula, the rendering network subjected to the optimization training can generate a background for the face image of each frame.
Preferably, face background rendering and video synthesis are performed based on video key frame optimization, and when the number of original video frames exceeds a threshold, the method specifically comprises the following steps:
calculating the face image I synthesized at the time ttThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum value
Figure BDA0003095678870000058
That is, the matching frame with the head pose deviation similar to the synthesized face pose in the original video frame is used as a rendering data pair
Figure BDA0003095678870000059
According to the method, matched frames of all frames are obtained and combined into rendering sequence data pairs
Figure BDA00030956788700000510
Inputting the obtained rendering sequence data pair into a pre-trained rendering network based on a generation countermeasure network, and obtaining a rendering network containing a source audio expression coefficient betaaudioAnd head pose coefficient PaudioSynthetic face X ═ { α ═ αvideo,βaudio,δvideo,γvideo,PaudioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.
Preferably, face background rendering and video synthesis are performed based on video key frame optimization, and when the number of original video frames does not exceed a threshold, the method specifically comprises the following steps:
selecting video frames with large head pose offset from the synthesized face video frames as key frames, and obtaining a key frame sequence according to the time sequence of the key frames
Figure BDA0003095678870000061
Calculating the face image I synthesized at the time ttThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum value
Figure BDA0003095678870000062
To find the most closely matched frame to obtain a matched frame sequence
Figure BDA0003095678870000063
And then obtaining video frames among the matched frames through a linear interpolation algorithm based on OpenCV (open circuit video coding) so as to synthesize a complete video frame sequence
Figure BDA0003095678870000064
The video frame sequence and the video frame for synthesizing human face are used as rendering sequence to form data pair
Figure BDA0003095678870000065
Inputting the obtained rendering sequence data pair into a pre-trained rendering network based on a generation countermeasure network, and obtaining a rendering network containing a source audio expression coefficient betaaudioAnd head pose coefficient PaudioSynthetic face X ═ { α ═ αvideo,βaudio,δvideo,γvideo,PaudioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
compared with the traditional method for recording the talking video of the person at present, the highlight of the invention can correspondingly generate the talking video of A and the talking video of B (matching the audio of A) under certain special scenes, such as only the audio of A and the talking video recorded before A or only the audio of A and the talking video of B, which is obviously impossible in the traditional method; secondly, the invention can train the network well in advance and store the network locally, and then the needed composite video can be obtained quickly only by inputting the corresponding audio and video; third, the present invention is greatly reduced in both time and labor costs over conventional methods. Compared with the existing face synthesis video method, firstly, the invention uses the parameterized model of the three-dimensional face and the face synthesis of the audio conversion expression and head posture mapping network to ensure that the synthesized face has the audio expression and head posture parameter information completely, so that the mouth shape of the character in the generated face synthesis video is matched with the audio, and the video character is controlled by the expression and the head posture parameter in the audio; secondly, the invention carries out targeted optimization on the character background in the synthesized video frame, and can obtain the synthesized face video frame with high quality and clear portrait and background after being rendered by the background rendering network.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Fig. 2 is a schematic flow chart of acquiring facial expression and head pose parameters from input audio.
Fig. 3 is a schematic flow chart of optimized rendering of video key frames when the number of original video frames exceeds a threshold.
Fig. 4 is a schematic flow chart of optimized rendering of video key frames when the number of original video frames does not exceed a threshold.
Fig. 5 is a schematic view of a face synthesis process.
Fig. 6 is a schematic view of a flow of rendering a human face background and synthesizing a video.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
A method for generating a composite video based on three-dimensional face reconstruction and video key frame optimization is disclosed, as shown in FIG. 1, and comprises the following steps:
optimizing and fitting each parameter of the three-dimensional face deformation model to the input face image by adopting a convolutional neural network to realize the parameterized reconstruction of the face model;
training a voice-to-expression and head pose mapping network H by using parameters of a target video and a face model, and acquiring facial expression and head pose parameters from input audio by using the trained voice-to-expression and head pose mapping network H, wherein the target video comprises a video frame and audio corresponding to the video frame, and the video frame comprises a face image;
replacing parameters in the parameterized face image according to the acquired facial expression and head posture parameters, synthesizing to obtain a face image of each frame, rendering, and generating a vivid face video frame;
training a rendering network based on a generated confrontation network by using the parameterized face images and the face images in the video frames, wherein the rendering network is used for generating a background for the face image of each frame;
and performing face background rendering and video synthesis based on video key frame optimization to obtain a high-quality synthesized face video frame which contains source audio facial expression and head posture parameters and is used for synthesizing a synthesized face with clear portrait and background, and synthesizing a complete speech video of the synthesized face according to the sequence of the video frames.
The embodiment considers that the human video recording needs a large amount of labor and time cost in the aspects of video anchor, online education course recording and the like at the present stage. At present, the generation of the character talking video is mainly realized by a traditional method, and a real person must be recorded before a lens, so that the time and the labor are wasted, and the wearing and the appearance of the character with the mirror are higher in requirements. In the embodiment, model parameters about expressions and head gestures are extracted from character voice, a three-dimensional face model is used for reconstructing a face in each video frame, and a synthetic face video which has target face identity information and is matched with the expression and head gesture information in source audio is generated. And the method is purposefully optimized according to the reality of the face image of the face synthetic video and the transition naturalness of the video background. In some past researches, strange face structures of the synthesized faces are not vivid enough, or the faces are separated from the background, the background is fuzzy and the like, and the naturalness of real faces is lacked. In the embodiment, after the face is synthesized, the 3D mesh renderer is used for further rendering the facial texture of the synthesized face image, so that the texture details of the synthesized face are richer and more vivid visually. Meanwhile, some key frames are selected from the synthesized face video frames, and an interpolation frame-filling algorithm is utilized in the adjacent key frames, so that the background transition of each frame of the output synthesized face video is natural and vivid. Therefore, the usability and the practicability of the synthesized face video can be greatly enhanced by using the method.
The method adopts the convolutional neural network to optimally fit all parameters of the three-dimensional face deformation model to the input face image so as to realize the parametric reconstruction of the face model, and specifically comprises the following steps:
recognizing a face in a face image, marking the face with 68 marking points, representing the face by using a three-dimensional face deformation model, and parameterizing the face into a triangular mesh model consisting of 35709 vertexes;
converting a two-dimensional face image I into a three-dimensional face parametric model X, and expressing as follows:
Figure BDA0003095678870000081
wherein
Figure BDA0003095678870000082
Are the shape parameters of the three-dimensional face model,
Figure BDA0003095678870000083
is a parameter of the texture that is,
Figure BDA0003095678870000084
is a parameter of an expression that is,
Figure BDA0003095678870000085
is the parameter of the light emission,
Figure BDA0003095678870000086
rotation parameters being head pose parameters and derived from camera model
Figure BDA0003095678870000087
And translation parameters
Figure BDA0003095678870000088
And (3) representing the shape of any human face picture by using a three-dimensional human face shape parameterized model as follows:
Figure BDA0003095678870000089
in the formula, BshapeIs a face shape vector, BexpIs a facial expression vector;
the texture of a face is represented as:
Figure BDA00030956788700000810
wherein B istexIs a vector of the texture of the human face,
Figure BDA00030956788700000811
and
Figure BDA00030956788700000812
respectively expressed as the average shape and average texture of the face model;
the illumination model of the face is represented as:
Figure BDA0003095678870000091
where gamma is the illumination parameter of the face model, niFor any vertex v of any face modeliNormal vector of (d), tiIs a vertex viTexture parameter, the vertex viIrradiance of (C) is represented as C (n)i,ti| γ), and
Figure BDA0003095678870000092
is a basis function of spherical harmonics, gammabIs the coefficient of spherical harmonic, B ═ 3;
therefore, the reconstruction process of the three-dimensional face model can be represented as an optimization solution of the face model parameters, and the training process of the three-dimensional face model based on the convolutional neural network can be represented as an optimization problem of the following equations (1) and (2):
Figure BDA0003095678870000093
Figure BDA0003095678870000094
in the formula (I), the compound is shown in the specification,
Figure BDA0003095678870000095
function, omega, for the regularized optimization of the coefficients of a three-dimensional face modelα、ωβ、ωδWeights corresponding to face shape coefficient, expression coefficient and texture coefficient respectively, c ∈ { r, g, b } represents that the picture is RGB picture, TcRepresents the face texture vector parameter of picture c, var () represents the variance, and r (x) represents the skin area of the face containing the cheek, nose, and forehead.
As shown in FIG. 2, the speech-to-expression and head-pose mapping network is trained using parameters of the target video and the face model
Figure BDA00030956788700000912
The method specifically comprises the following steps:
extracting audio frequency in a target video, converting the audio frequency into Mel frequency cepstrum coefficient, inputting the Mel frequency cepstrum coefficient obtained by conversion into a pre-trained audio high-grade feature extraction network to obtain high-grade feature FtThen F is addedtAnd beta of the parameters of the face model after the convolutional neural network optimizationtAnd PtTraining the speech to expression and head pose mapping network as a training dataset
Figure BDA00030956788700000913
Trained speech to expression and head pose mapping network
Figure BDA00030956788700000914
Extracting two face estimation parameters from the audio, and respectively corresponding to the expression coefficients beta ═ beta in the three-dimensional face deformation model(1),......,β(64)And head pose coefficient P ═ P(1),......,P(6)Are multiplied by
Figure BDA0003095678870000096
The speech to expression and head pose mapping network
Figure BDA0003095678870000097
The training process of (A) can be regarded as the mean square error loss of the expression parameters
Figure BDA0003095678870000098
And mean square error loss of head pose parameters
Figure BDA0003095678870000099
The optimization process of (2) is shown in formulas (3) and (4):
Figure BDA00030956788700000910
Figure BDA00030956788700000911
where MSE () represents a mean square error function; ftRepresenting a high-level characteristic of the input network at time t, betatIs an expression parameter, P, of the target video at time ttIs the head pose parameter of the target video at the time t.
The pre-trained audio advanced feature extraction network specifically comprises the following steps:
the audio advanced feature extraction network takes an AT-net network as a backbone to train on The Oxford-BBC Lip Reading in The Wild Dataset.
As shown in fig. 5 to 6, the method for acquiring a parameterized face image specifically includes the following steps:
extracting a target video into video frames, and cutting the whole face of each frame to obtain a face image I(1),I(2),......,I(n)Inputting the n human face images, and obtaining any one image I after the n human face images are optimized by a convolutional neural network(k)Face image represented by a parameterized model of k e n and corresponding shape parameter alphavideoExpression parameter betavideoTexture parameter deltaviaeoIllumination parameter gammavideoAnd head pose parameter Pvideo
Replacing parameters in the parameterized face image according to the acquired parameters of the facial expression and the head posture, which specifically comprises the following steps:
the expression coefficient beta obtained from the audioaudioAnd head pose coefficient PaudioReplacing the face parameters in each corresponding video frame according to the time frame sequence, and synthesizing a new face
Figure BDA0003095678870000101
The synthesizing obtains the face image of each frame and renders the face image to generate a vivid face video frame, which specifically comprises the following steps:
and (3) using a 3D grid renderer to enable each grid to be in uniform and smooth transition, so that vivid and natural face images are obtained, and the face images subjected to further rendering are stored as corresponding video frames.
The method comprises the following steps of training a rendering network based on a generated countermeasure network by using a parameterized face image and a face image in a video frame, and specifically comprises the following steps:
parameterized face image I obtained from target video frame(k)And the original video frame corresponding to the time
Figure BDA0003095678870000109
Are combined into sequence data pairs
Figure BDA00030956788700001010
And inputting a rendering network consisting of a generator G and a discriminator D for pre-training, wherein the pre-training process is a total optimization process of the generator and the discriminator for generating the countermeasure network and can be represented by an equation (5):
Figure BDA0003095678870000102
in the formula, G*An overall optimization function representing the rendering network;
complete training objectives include reconstruction losses
Figure BDA0003095678870000103
And loss of antagonism
Figure BDA0003095678870000104
Training is represented as formula (6):
Figure BDA0003095678870000105
where the weighting parameter λ is 100, reconstruction loss
Figure BDA0003095678870000106
Expression (7) and antagonistic loss
Figure BDA0003095678870000107
Expression (8) is as follows:
Figure BDA0003095678870000108
Figure BDA0003095678870000111
the rendering network after the optimization training can generate a background for the face image of each frame.
Performing face background rendering and video synthesis based on video keyframe optimization, as shown in fig. 3, when the number of original video frames exceeds a threshold, specifically:
calculating the face image I synthesized at the time ttThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum value
Figure BDA0003095678870000112
That is, the matching frame with the head pose deviation similar to the synthesized face pose in the original video frame is used as a rendering data pair
Figure BDA0003095678870000113
According to the method, matched frames of all frames are obtained and combined into rendering sequence data pairs
Figure BDA0003095678870000114
Inputting the obtained rendering sequence data pair into a pre-trained rendering network based on a generation countermeasure network, and obtaining a rendering network containing a source audio expression coefficient betaaudioAnd head pose coefficient PaudioSynthetic face X ═ { α ═ αvideo,βaudio,δvideo,γvideo,PaudioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.
Performing face background rendering and video synthesis based on video keyframe optimization, as shown in fig. 4, when the number of original video frames does not exceed a threshold, specifically:
selecting video frames with large head pose offset from the synthesized face video frames as key frames, and obtaining a key frame sequence according to the time sequence of the key frames
Figure BDA0003095678870000115
Calculating the face image I synthesized at the time ttThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum value
Figure BDA0003095678870000116
To find the most closely matched frame to obtain a matched frame sequence
Figure BDA0003095678870000117
And then obtaining video frames among the matched frames through a linear interpolation algorithm based on OpenCV (open circuit video coding) so as to synthesize a complete video frame sequence
Figure BDA0003095678870000118
The video frame sequence and the video frame for synthesizing human face are used as rendering sequence to form data pair
Figure BDA0003095678870000119
Inputting the obtained rendering sequence data pair into a pre-trained rendering network based on a generation countermeasure network, and obtaining a rendering network containing a source audio expression coefficient betaaudioAnd head pose coefficient PaudioSynthetic face X ═ { α ═ αvideo,βaudio,δvideo,γvideo,PaudioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.
In a specific embodiment:
the advanced feature extraction network for extracting The audio advanced feature F _ t is trained on The Oxford-BBC Lip Reading in The Wild (LRW) Dataset by taking an AT-net network as a backbone, and The Dataset contains 1000 pronunciations of up to 500 different words and is spoken by hundreds of different speakers. Our audio-to-expression and head pose mapping network is pre-trained with the data set as the sum P _ t reconstructed from the target video frame and F _ t extracted from the speech of the target video. The three-dimensional Face reconstruction Model based on the convolutional neural network takes a ResNet-50 network as a backbone, represents a Face based on an Expression Basis Model in a Basel Face Model 2009 and a Facewarehouse, further renders a reconstructed synthetic Face by a mesh render in a tensoflow, and is trained on a data set 300 WLP. The face background rendering network based on video key frame optimization is composed of a generator and a discriminator, and a training set is a data pair composed of an original video frame and a synthesized face image reconstructed from the original video frame.
Pre-training an audio-to-expression and head pose mapping network and a convolutional neural network-based three-dimensional face reconstruction model: because the LRW training set and the 300WLP training set are public training sets, data sets are downloaded respectively and input into two networks for pre-training, and model parameters are saved after learning is finished when the number of iterations of the two networks is small.
Processing input data: firstly, converting input audio into Mel Frequency Cepstrum Coefficient (MFCC) for storage, extracting audio and video frames from input target video, identifying human face in each video frame, intercepting and storing the human face as human face video frame, wherein the size of the image intercepted by each human face video frame is 256 × 256 RGB image and is set as image It
A face synthesis stage: the RGB face image I with the size of 256 multiplied by 256 is takentInputting the three-dimensional face image into a pre-trained convolutional neural network-based three-dimensional face deformation model to generate a reconstructed three-dimensional face image of each frame of face
Figure BDA0003095678870000123
And corresponding shape parameter alphavideoExpression parameter betavideoTexture parameter deltavideoIllumination parameter gammavideoAnd head pose parameter Pvideo(wherein the head pose parameter P is composed of the rotation parameter R and the translation parameter T in the camera parameters), and saves these parameters. Then, the MFCC is input into an audio-to-expression and head posture mapping network to obtain an expression parameter betaaudioAnd head pose parameter PaudioAnd stored, and then both are replaced by the target video in the order of time frameReconstructed expression parameter betavideoAnd head pose parameter PvideoAnd these face parameters { alpha }video,βaudio,δvideo,γvideo,PaudioInputting the face synthesis model, rendering by a 3D mesh renderer, and outputting a new face X ═ alphavideo,βaudio,δvideo,γvideo,PaudioAnd storing in time frame sequence.
Face background rendering and video synthesis based on video key frame optimization: calculating the frame number of the target video, and if the frame number is lower than 5000 frames, obtaining a key frame sequence by adopting a mode 2 of a video key frame optimization method
Figure BDA0003095678870000121
Obtaining a matching frame sequence in the target video frame by calculating the minimum Euclidean distance
Figure BDA0003095678870000122
Inputting the sequence into a rendering network, and obtaining video frames between matched frames between output face frames through an OpenCV-based linear interpolation algorithm, thereby synthesizing a complete synthetic face video frame sequence
Figure BDA0003095678870000131
Rendering sequence data pairs of the video frame sequence and the video frame for synthesizing the human face in the step 3
Figure BDA0003095678870000132
Image ItAnd face image corresponding to time t
Figure BDA0003095678870000133
Set as a data pair
Figure BDA0003095678870000134
Training the data pair input based on a rendering network for generating an antagonistic neural network, wherein the iteration number is 100 epochs, and after learning is finished, protecting the modelAnd (4) storing. Calculating the frame number of the target video, and if the frame number is lower than 5000 frames, obtaining a key frame sequence by adopting a mode 2 of a video key frame optimization method
Figure BDA0003095678870000135
Obtaining a matching frame sequence in the target video frame by calculating the minimum Euclidean distance
Figure BDA0003095678870000136
Inputting the sequence into the pre-training rendering network, and obtaining video frames between matched frames between the output face frames through a linear interpolation algorithm based on OpenCV (open circuit content variation) so as to obtain a complete synthetic face video frame sequence
Figure BDA0003095678870000137
If the target video frame is more than 5000 frames, the mode 1 is adopted, and the complete synthetic human face video frame sequence can be obtained in the same way
Figure BDA0003095678870000138
And finally, synthesizing the video frames and the audio into a whole video, so that a conversation video which contains the source audio expression and the head gesture on the basis of the target video and has a natural synthetic face and a clear background is generated.
The same or similar reference numerals correspond to the same or similar parts;
the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (10)

1. A synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization is characterized by comprising the following steps:
optimizing and fitting each parameter of the three-dimensional face deformation model to the input face image by adopting a convolutional neural network to realize the parameterized reconstruction of the face model;
training a speech-to-expression and head pose mapping network using parameters of a target video and a face model
Figure FDA00030956788600000110
Mapping network using trained speech to expressions and head gestures
Figure FDA00030956788600000111
Acquiring facial expression and head posture parameters from input audio, wherein the target video comprises a video frame and audio corresponding to the video frame, and the video frame comprises a face image;
replacing parameters in the parameterized face image according to the acquired facial expression and head posture parameters, synthesizing to obtain a face image of each frame, rendering, and generating a vivid face video frame;
training a rendering network based on a generated confrontation network by using the parameterized face images and the face images in the video frames, wherein the rendering network is used for generating a background for the face image of each frame;
and performing face background rendering and video synthesis based on video key frame optimization to obtain a high-quality synthesized face video frame which contains source audio facial expression and head posture parameters and is used for synthesizing a synthesized face with clear portrait and background, and synthesizing a complete speech video of the synthesized face according to the sequence of the video frames.
2. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 1, wherein the method for optimizing and fitting each parameter of the three-dimensional face deformation model to the input face image by using the convolutional neural network to realize the parameterized reconstruction of the face model comprises the following steps:
recognizing a face in a face image, marking the face with 68 marking points, representing the face by using a three-dimensional face deformation model, and parameterizing the face into a triangular mesh model consisting of 35709 vertexes;
converting a two-dimensional face image I into a three-dimensional face parametric model X, and expressing as follows:
Figure FDA0003095678860000011
wherein
Figure FDA0003095678860000012
Are the shape parameters of the three-dimensional face model,
Figure FDA0003095678860000013
is a parameter of the texture that is,
Figure FDA0003095678860000014
is a parameter of an expression that is,
Figure FDA0003095678860000015
is the parameter of the light emission,
Figure FDA0003095678860000016
rotation parameters being head pose parameters and derived from camera model
Figure FDA0003095678860000017
And translation parameters
Figure FDA0003095678860000018
And (3) representing the shape of any human face picture by using a three-dimensional human face shape parameterized model as follows:
Figure FDA0003095678860000019
in the formula, BshapeIs a face shape vector, BexpIs a facial expression vector;
the texture of a face is represented as:
Figure FDA0003095678860000021
wherein B istexIs a vector of the texture of the human face,
Figure FDA0003095678860000022
and
Figure FDA0003095678860000023
respectively expressed as the average shape and average texture of the face model;
the illumination model of the face is represented as:
Figure FDA0003095678860000024
where gamma is the illumination parameter of the face model, niFor any vertex v of any face modeliNormal vector of (d), tiIs a vertex viTexture parameter, the vertex viIrradiance of (C) is represented as C (n)i,ti| γ), and
Figure FDA0003095678860000025
is a basis function of spherical harmonics, gammabIs the coefficient of the spherical harmonic;
therefore, the reconstruction process of the three-dimensional face model can be represented as an optimization solution of the face model parameters, and the training process of the three-dimensional face model based on the convolutional neural network can be represented as an optimization problem of the following equations (1) and (2):
Figure FDA0003095678860000026
Figure FDA0003095678860000027
in the formula (I), the compound is shown in the specification,
Figure FDA0003095678860000028
function, omega, for the regularized optimization of the coefficients of a three-dimensional face modelα、ωβ、ωδWeights corresponding to face shape coefficient, expression coefficient and texture coefficient respectively, c ∈ { r, g, b } represents that the picture is RGB picture, TcRepresents the face texture vector parameter of picture c, var () represents the variance, and r (x) represents the skin area of the face containing the cheek, nose, and forehead.
3. The method of claim 2, wherein parameters of the target video and the face model are used to train a speech-to-expression and head pose mapping network
Figure FDA00030956788600000211
The method specifically comprises the following steps:
extracting audio frequency in a target video, converting the audio frequency into Mel frequency cepstrum coefficient, inputting the Mel frequency cepstrum coefficient obtained by conversion into a pre-trained audio high-grade feature extraction network to obtain high-grade feature FtThen F is addedtAnd beta generated by optimally fitting the input human face image with a three-dimensional human face deformation model through a convolutional neural networktAnd PtTraining the speech to expression and head pose mapping network as a training dataset
Figure FDA00030956788600000212
Trained speech to expression and head pose mapping network
Figure FDA00030956788600000213
Extracting two face estimation parameters from the audio, and respectively corresponding to the expression coefficients beta ═ beta in the three-dimensional face deformation model(1),……,β(64)And head pose coefficient P ═ P(1),……,P(6)Are multiplied by
Figure FDA0003095678860000029
The speech to expression and head pose mapping network
Figure FDA00030956788600000210
The training process of (A) can be regarded as the mean square error loss of the expression parameters
Figure FDA0003095678860000031
And mean square error loss of head pose parameters
Figure FDA0003095678860000032
The optimization process of (2) is shown in formulas (3) and (4):
Figure FDA0003095678860000033
Figure FDA0003095678860000034
where MSE () represents a mean square error function; ftRepresenting a high-level characteristic of the input network at time t, betatIs an expression parameter, P, of the target video at time ttIs the head pose parameter of the target video at the time t.
4. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 3, wherein the pre-trained audio advanced feature extraction network specifically comprises:
the audio advanced feature extraction network takes an AT-net network as a backbone to train on The Oxford-BBC Lip Reading in The Wild Dataset.
5. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 4, wherein the method for obtaining the parameterized face image specifically comprises the following steps:
extracting a target video into video frames, and cutting the whole face of each frame to obtain a face image I(1),I(2),……,I(n)Inputting the n human face images, and obtaining any one image I after the n human face images are optimized by a convolutional neural network(k)Face image represented by a parameterized model of k e n and corresponding shape parameter alphavideoExpression parameter betavideoTexture parameter deltavideoIllumination parameter gammavideoAnd head pose parameter Pvideo
6. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 5, wherein parameters in the parameterized face image are replaced according to the acquired facial expression and head pose parameters, specifically:
the expression coefficient beta obtained from the audioaudioAnd head pose coefficient PaudioReplacing the face parameters in each corresponding video frame according to the time frame sequence, and synthesizing a new face
Figure FDA0003095678860000035
7. The method for generating a composite video based on three-dimensional face reconstruction and video key frame optimization according to claim 6, wherein the synthesizing obtains a face image of each frame and renders the face image to generate a realistic face video frame, specifically:
and (3) using a 3D grid renderer to enable each grid to be in uniform and smooth transition, so that vivid and natural face images are obtained, and the face images subjected to further rendering are stored as corresponding video frames.
8. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 7, wherein the rendering network based on the generation of the countermeasure network is trained using parameterized face images and face images in the video frames, specifically:
parameterized face image I obtained from target video frame(k)And the original video frame corresponding to the time
Figure FDA0003095678860000041
Are combined into sequence data pairs
Figure FDA0003095678860000042
And inputting a rendering network consisting of a generator G and a discriminator D for pre-training, wherein the pre-training process is a total optimization process of the generator and the discriminator for generating the countermeasure network and can be represented by an equation (5):
Figure FDA0003095678860000043
in the formula, G*An overall optimization function representing the rendering network;
complete training objectives include reconstruction losses
Figure FDA0003095678860000044
And loss of antagonism
Figure FDA0003095678860000045
Training is represented as formula (6):
Figure FDA0003095678860000046
where the weighting parameter λ is 100, reconstruction loss
Figure FDA0003095678860000047
Expression (7) and antagonistic loss
Figure FDA0003095678860000048
Expression (8) is as follows:
Figure FDA0003095678860000049
Figure FDA00030956788600000410
in the formula, the rendering network subjected to the optimization training can generate a background for the face image of each frame.
9. The method for generating a composite video based on three-dimensional face reconstruction and video key frame optimization according to claim 8, wherein face background rendering and video composition are performed based on video key frame optimization, and when the number of original video frames exceeds a threshold, the method specifically comprises:
calculating the face image I synthesized at the time ttThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum value
Figure FDA00030956788600000411
That is, the matching frame with the head pose deviation similar to the synthesized face pose in the original video frame is used as a rendering data pair
Figure FDA00030956788600000412
According to the method, matched frames of all frames are obtained and combined into rendering sequence data pairs
Figure FDA00030956788600000413
Inputting the obtained rendering sequence data pair into a pre-trained rendering net based on a generation countermeasure networkObtaining the expression coefficient beta containing the source audio frequency by the networkaudioAnd head pose coefficient PaudioSynthetic face X ═ { α ═ αvideoaudiovideovideo,PaudioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.
10. The method for generating a composite video based on three-dimensional face reconstruction and video key frame optimization according to claim 8, wherein face background rendering and video composition are performed based on video key frame optimization, and when the number of original video frames does not exceed a threshold, the method specifically comprises:
selecting video frames with large head pose offset from the synthesized face video frames as key frames, and obtaining a key frame sequence according to the time sequence of the key frames
Figure FDA0003095678860000051
Calculating the face image I synthesized at the time ttThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum value
Figure FDA0003095678860000052
To find the most closely matched frame to obtain a matched frame sequence
Figure FDA0003095678860000053
And then obtaining video frames among the matched frames through a linear interpolation algorithm based on OpenCV (open circuit video coding) so as to synthesize a complete video frame sequence
Figure FDA0003095678860000054
The video frame sequence and the video frame for synthesizing human face are used as rendering sequence to form data pair
Figure FDA0003095678860000055
Inputting the obtained rendering sequence data pair into a pre-trained rendering network based on a generation countermeasure network, and obtaining a rendering network containing a source audio expression coefficient betaaudioAnd head pose coefficient PaudioSynthetic face X ═ { α ═ αvideoaudiovideovideo,PaudioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.
CN202110610539.9A 2021-06-01 2021-06-01 Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization Pending CN113269872A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110610539.9A CN113269872A (en) 2021-06-01 2021-06-01 Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110610539.9A CN113269872A (en) 2021-06-01 2021-06-01 Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization

Publications (1)

Publication Number Publication Date
CN113269872A true CN113269872A (en) 2021-08-17

Family

ID=77233988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110610539.9A Pending CN113269872A (en) 2021-06-01 2021-06-01 Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization

Country Status (1)

Country Link
CN (1) CN113269872A (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822969A (en) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 Method, device and server for training nerve radiation field model and face generation
CN113838173A (en) * 2021-09-23 2021-12-24 厦门大学 Virtual human head motion synthesis method driven by voice and background sound
CN114049678A (en) * 2022-01-11 2022-02-15 之江实验室 Facial motion capturing method and system based on deep learning
CN114332136A (en) * 2022-03-15 2022-04-12 南京甄视智能科技有限公司 Face attribute data labeling method, computer equipment and storage medium
CN114419702A (en) * 2021-12-31 2022-04-29 南京硅基智能科技有限公司 Digital human generation model, training method of model, and digital human generation method
CN114648613A (en) * 2022-05-18 2022-06-21 杭州像衍科技有限公司 Three-dimensional head model reconstruction method and device based on deformable nerve radiation field
CN114782864A (en) * 2022-04-08 2022-07-22 马上消费金融股份有限公司 Information processing method and device, computer equipment and storage medium
CN114821404A (en) * 2022-04-08 2022-07-29 马上消费金融股份有限公司 Information processing method and device, computer equipment and storage medium
CN114898244A (en) * 2022-04-08 2022-08-12 马上消费金融股份有限公司 Information processing method and device, computer equipment and storage medium
CN115294622A (en) * 2022-06-15 2022-11-04 北京邮电大学 Method, system and storage medium for synthesizing and enhancing voice-driven speaker head motion video
GB2607140A (en) * 2021-05-26 2022-11-30 Flawless Holdings Ltd Modification of objects in film
CN115909015A (en) * 2023-02-15 2023-04-04 苏州浪潮智能科技有限公司 Construction method and device of deformable nerve radiation field network
CN115984943A (en) * 2023-01-16 2023-04-18 支付宝(杭州)信息技术有限公司 Facial expression capturing and model training method, device, equipment, medium and product
CN116091668A (en) * 2023-04-10 2023-05-09 广东工业大学 Talking head video generation method based on emotion feature guidance
CN116152447A (en) * 2023-04-21 2023-05-23 科大讯飞股份有限公司 Face modeling method and device, electronic equipment and storage medium
CN116342760A (en) * 2023-05-25 2023-06-27 南昌航空大学 Three-dimensional facial animation synthesis method, system, electronic equipment and storage medium
US11715495B2 (en) 2021-05-26 2023-08-01 Flawless Holdings Limited Modification of objects in film
CN116721194A (en) * 2023-08-09 2023-09-08 瀚博半导体(上海)有限公司 Face rendering method and device based on generation model
US11830159B1 (en) 2022-12-08 2023-11-28 Flawless Holding Limited Generative films
CN117292030A (en) * 2023-10-27 2023-12-26 海看网络科技(山东)股份有限公司 Method and system for generating three-dimensional digital human animation
CN117392292A (en) * 2023-10-20 2024-01-12 联通在线信息科技有限公司 3D digital person generation method and system
CN117593442A (en) * 2023-11-28 2024-02-23 拓元(广州)智慧科技有限公司 Portrait generation method based on multi-stage fine grain rendering
WO2024055379A1 (en) * 2022-09-16 2024-03-21 粤港澳大湾区数字经济研究院(福田) Video processing method and system based on character avatar model, and related device
WO2024078243A1 (en) * 2022-10-13 2024-04-18 腾讯科技(深圳)有限公司 Training method and apparatus for video generation model, and storage medium and computer device

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254336A (en) * 2011-07-14 2011-11-23 清华大学 Method and device for synthesizing face video
CN103218842A (en) * 2013-03-12 2013-07-24 西南交通大学 Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
CN103279970A (en) * 2013-05-10 2013-09-04 中国科学技术大学 Real-time human face animation driving method by voice
CN107067429A (en) * 2017-03-17 2017-08-18 徐迪 Video editing system and method that face three-dimensional reconstruction and face based on deep learning are replaced
CN108230438A (en) * 2017-12-28 2018-06-29 清华大学 The facial reconstruction method and device of sound driver secondary side face image
CN108510437A (en) * 2018-04-04 2018-09-07 科大讯飞股份有限公司 A kind of virtual image generation method, device, equipment and readable storage medium storing program for executing
CN109035394A (en) * 2018-08-22 2018-12-18 广东工业大学 Human face three-dimensional model method for reconstructing, device, equipment, system and mobile terminal
CN109308731A (en) * 2018-08-24 2019-02-05 浙江大学 The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM
CN110599573A (en) * 2019-09-03 2019-12-20 电子科技大学 Method for realizing real-time human face interactive animation based on monocular camera
CN110942502A (en) * 2019-11-29 2020-03-31 中山大学 Voice lip fitting method and system and storage medium
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 Speaking video generation method and system
CN111294665A (en) * 2020-02-12 2020-06-16 百度在线网络技术(北京)有限公司 Video generation method and device, electronic equipment and readable storage medium
CN111508064A (en) * 2020-04-14 2020-08-07 北京世纪好未来教育科技有限公司 Expression synthesis method and device based on phoneme driving and computer storage medium
CN112188304A (en) * 2020-09-28 2021-01-05 广州酷狗计算机科技有限公司 Video generation method, device, terminal and storage medium
CN112215927A (en) * 2020-09-18 2021-01-12 腾讯科技(深圳)有限公司 Method, device, equipment and medium for synthesizing face video
CN112420014A (en) * 2020-11-17 2021-02-26 平安科技(深圳)有限公司 Virtual face construction method and device, computer equipment and computer readable medium
CN112562722A (en) * 2020-12-01 2021-03-26 新华智云科技有限公司 Audio-driven digital human generation method and system based on semantics
CN112766160A (en) * 2021-01-20 2021-05-07 西安电子科技大学 Face replacement method based on multi-stage attribute encoder and attention mechanism

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254336A (en) * 2011-07-14 2011-11-23 清华大学 Method and device for synthesizing face video
CN103218842A (en) * 2013-03-12 2013-07-24 西南交通大学 Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
CN103279970A (en) * 2013-05-10 2013-09-04 中国科学技术大学 Real-time human face animation driving method by voice
CN107067429A (en) * 2017-03-17 2017-08-18 徐迪 Video editing system and method that face three-dimensional reconstruction and face based on deep learning are replaced
CN108230438A (en) * 2017-12-28 2018-06-29 清华大学 The facial reconstruction method and device of sound driver secondary side face image
CN108510437A (en) * 2018-04-04 2018-09-07 科大讯飞股份有限公司 A kind of virtual image generation method, device, equipment and readable storage medium storing program for executing
CN109035394A (en) * 2018-08-22 2018-12-18 广东工业大学 Human face three-dimensional model method for reconstructing, device, equipment, system and mobile terminal
CN109308731A (en) * 2018-08-24 2019-02-05 浙江大学 The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM
CN110599573A (en) * 2019-09-03 2019-12-20 电子科技大学 Method for realizing real-time human face interactive animation based on monocular camera
CN110942502A (en) * 2019-11-29 2020-03-31 中山大学 Voice lip fitting method and system and storage medium
CN111243626A (en) * 2019-12-30 2020-06-05 清华大学 Speaking video generation method and system
CN111294665A (en) * 2020-02-12 2020-06-16 百度在线网络技术(北京)有限公司 Video generation method and device, electronic equipment and readable storage medium
CN111508064A (en) * 2020-04-14 2020-08-07 北京世纪好未来教育科技有限公司 Expression synthesis method and device based on phoneme driving and computer storage medium
CN112215927A (en) * 2020-09-18 2021-01-12 腾讯科技(深圳)有限公司 Method, device, equipment and medium for synthesizing face video
CN112188304A (en) * 2020-09-28 2021-01-05 广州酷狗计算机科技有限公司 Video generation method, device, terminal and storage medium
CN112420014A (en) * 2020-11-17 2021-02-26 平安科技(深圳)有限公司 Virtual face construction method and device, computer equipment and computer readable medium
CN112562722A (en) * 2020-12-01 2021-03-26 新华智云科技有限公司 Audio-driven digital human generation method and system based on semantics
CN112766160A (en) * 2021-01-20 2021-05-07 西安电子科技大学 Face replacement method based on multi-stage attribute encoder and attention mechanism

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CAO FAXIAN,YANG ZHIJING ETC.: "Convolutional neural network extreme learning machine for effective classification of hyperspectral images", 《JOURNAL OF APPLIED REMOTE SENSING》, vol. 12, no. 3, 5 July 2018 (2018-07-05), pages 1 - 20 *
RAN YI,ZIPENG YE,ETC.: "Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose", 《HTTPS://ARXIV.ORG/ABS/2002.10137》, 5 March 2020 (2020-03-05), pages 1 - 12 *
YU DENG,JIAOLONG YANG,ETC.: "Accurate 3D Face Reconstruction with Weakly-Supervised Learning:From Single Image to Image Set", 《2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS》, 9 April 2020 (2020-04-09), pages 285 - 295 *
李欣怡等: "语音驱动的人脸动画研究现状综述", 《计算机工程与应用》, no. 22, 15 November 2017 (2017-11-15), pages 26 - 33 *

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11574655B2 (en) 2021-05-26 2023-02-07 Flawless Holdings Limited Modification of objects in film
GB2607140A (en) * 2021-05-26 2022-11-30 Flawless Holdings Ltd Modification of objects in film
US11699464B2 (en) 2021-05-26 2023-07-11 Flawless Holdings Limited Modification of objects in film
GB2607140B (en) * 2021-05-26 2024-04-10 Flawless Holdings Ltd Modification of objects in film
US11715495B2 (en) 2021-05-26 2023-08-01 Flawless Holdings Limited Modification of objects in film
CN113822969A (en) * 2021-09-15 2021-12-21 宿迁硅基智能科技有限公司 Method, device and server for training nerve radiation field model and face generation
CN113822969B (en) * 2021-09-15 2023-06-09 宿迁硅基智能科技有限公司 Training neural radiation field model, face generation method, device and server
CN113838173A (en) * 2021-09-23 2021-12-24 厦门大学 Virtual human head motion synthesis method driven by voice and background sound
CN113838173B (en) * 2021-09-23 2023-08-22 厦门大学 Virtual human head motion synthesis method driven by combination of voice and background sound
CN114419702B (en) * 2021-12-31 2023-12-01 南京硅基智能科技有限公司 Digital person generation model, training method of model, and digital person generation method
CN114419702A (en) * 2021-12-31 2022-04-29 南京硅基智能科技有限公司 Digital human generation model, training method of model, and digital human generation method
CN114049678A (en) * 2022-01-11 2022-02-15 之江实验室 Facial motion capturing method and system based on deep learning
CN114332136A (en) * 2022-03-15 2022-04-12 南京甄视智能科技有限公司 Face attribute data labeling method, computer equipment and storage medium
CN114782864B (en) * 2022-04-08 2023-07-21 马上消费金融股份有限公司 Information processing method, device, computer equipment and storage medium
CN114898244B (en) * 2022-04-08 2023-07-21 马上消费金融股份有限公司 Information processing method, device, computer equipment and storage medium
CN114782864A (en) * 2022-04-08 2022-07-22 马上消费金融股份有限公司 Information processing method and device, computer equipment and storage medium
CN114821404A (en) * 2022-04-08 2022-07-29 马上消费金融股份有限公司 Information processing method and device, computer equipment and storage medium
CN114898244A (en) * 2022-04-08 2022-08-12 马上消费金融股份有限公司 Information processing method and device, computer equipment and storage medium
CN114821404B (en) * 2022-04-08 2023-07-25 马上消费金融股份有限公司 Information processing method, device, computer equipment and storage medium
CN114648613A (en) * 2022-05-18 2022-06-21 杭州像衍科技有限公司 Three-dimensional head model reconstruction method and device based on deformable nerve radiation field
CN114648613B (en) * 2022-05-18 2022-08-23 杭州像衍科技有限公司 Three-dimensional head model reconstruction method and device based on deformable nerve radiation field
CN115294622A (en) * 2022-06-15 2022-11-04 北京邮电大学 Method, system and storage medium for synthesizing and enhancing voice-driven speaker head motion video
WO2024055379A1 (en) * 2022-09-16 2024-03-21 粤港澳大湾区数字经济研究院(福田) Video processing method and system based on character avatar model, and related device
WO2024078243A1 (en) * 2022-10-13 2024-04-18 腾讯科技(深圳)有限公司 Training method and apparatus for video generation model, and storage medium and computer device
US11830159B1 (en) 2022-12-08 2023-11-28 Flawless Holding Limited Generative films
CN115984943A (en) * 2023-01-16 2023-04-18 支付宝(杭州)信息技术有限公司 Facial expression capturing and model training method, device, equipment, medium and product
CN115984943B (en) * 2023-01-16 2024-05-14 支付宝(杭州)信息技术有限公司 Facial expression capturing and model training method, device, equipment, medium and product
CN115909015A (en) * 2023-02-15 2023-04-04 苏州浪潮智能科技有限公司 Construction method and device of deformable nerve radiation field network
WO2024169314A1 (en) * 2023-02-15 2024-08-22 苏州元脑智能科技有限公司 Method and apparatus for constructing deformable neural radiance field network
CN116091668B (en) * 2023-04-10 2023-07-21 广东工业大学 Talking head video generation method based on emotion feature guidance
CN116091668A (en) * 2023-04-10 2023-05-09 广东工业大学 Talking head video generation method based on emotion feature guidance
CN116152447B (en) * 2023-04-21 2023-09-26 科大讯飞股份有限公司 Face modeling method and device, electronic equipment and storage medium
CN116152447A (en) * 2023-04-21 2023-05-23 科大讯飞股份有限公司 Face modeling method and device, electronic equipment and storage medium
CN116342760A (en) * 2023-05-25 2023-06-27 南昌航空大学 Three-dimensional facial animation synthesis method, system, electronic equipment and storage medium
CN116721194A (en) * 2023-08-09 2023-09-08 瀚博半导体(上海)有限公司 Face rendering method and device based on generation model
CN116721194B (en) * 2023-08-09 2023-10-24 瀚博半导体(上海)有限公司 Face rendering method and device based on generation model
CN117392292A (en) * 2023-10-20 2024-01-12 联通在线信息科技有限公司 3D digital person generation method and system
CN117392292B (en) * 2023-10-20 2024-04-30 联通在线信息科技有限公司 3D digital person generation method and system
CN117292030A (en) * 2023-10-27 2023-12-26 海看网络科技(山东)股份有限公司 Method and system for generating three-dimensional digital human animation
CN117593442A (en) * 2023-11-28 2024-02-23 拓元(广州)智慧科技有限公司 Portrait generation method based on multi-stage fine grain rendering
CN117593442B (en) * 2023-11-28 2024-05-03 拓元(广州)智慧科技有限公司 Portrait generation method based on multi-stage fine grain rendering

Similar Documents

Publication Publication Date Title
CN113269872A (en) Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization
Guo et al. Ad-nerf: Audio driven neural radiance fields for talking head synthesis
CN112887698B (en) High-quality face voice driving method based on nerve radiation field
CN101055647B (en) Method and device for processing image
CN101324961B (en) Human face portion three-dimensional picture pasting method in computer virtual world
CN117496072B (en) Three-dimensional digital person generation and interaction method and system
CN113362422B (en) Shadow robust makeup transfer system and method based on decoupling representation
KR102353556B1 (en) Apparatus for Generating Facial expressions and Poses Reappearance Avatar based in User Face
CN115209180A (en) Video generation method and device
CN115914505B (en) Video generation method and system based on voice-driven digital human model
JP2009104570A (en) Data structure for image formation and method of forming image
CN111640172A (en) Attitude migration method based on generation of countermeasure network
CN115239857B (en) Image generation method and electronic device
CN116721190A (en) Voice-driven three-dimensional face animation generation method
CN116385606A (en) Speech signal driven personalized three-dimensional face animation generation method and application thereof
CN112862672B (en) Liu-bang generation method, device, computer equipment and storage medium
CN113947520A (en) Method for realizing face makeup conversion based on generation of confrontation network
Tang et al. Real-time conversion from a single 2D face image to a 3D text-driven emotive audio-visual avatar
Nguyen-Phuoc et al. Alteredavatar: Stylizing dynamic 3d avatars with fast style adaptation
Sun et al. SSAT $++ $: A Semantic-Aware and Versatile Makeup Transfer Network With Local Color Consistency Constraint
CN117671090A (en) Expression processing method and device, electronic equipment and storage medium
Wang et al. Uncouple generative adversarial networks for transferring stylized portraits to realistic faces
WO2024055379A1 (en) Video processing method and system based on character avatar model, and related device
CN116402928B (en) Virtual talking digital person generating method
Kumar et al. Multi modal adaptive normalization for audio to video generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210817