CN113269872A - Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization - Google Patents
Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization Download PDFInfo
- Publication number
- CN113269872A CN113269872A CN202110610539.9A CN202110610539A CN113269872A CN 113269872 A CN113269872 A CN 113269872A CN 202110610539 A CN202110610539 A CN 202110610539A CN 113269872 A CN113269872 A CN 113269872A
- Authority
- CN
- China
- Prior art keywords
- face
- video
- frame
- network
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 238000005457 optimization Methods 0.000 title claims abstract description 56
- 238000009877 rendering Methods 0.000 claims abstract description 78
- 230000014509 gene expression Effects 0.000 claims abstract description 54
- 238000012549 training Methods 0.000 claims abstract description 47
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 29
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 24
- 238000013507 mapping Methods 0.000 claims abstract description 24
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 24
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 19
- 230000008921 facial expression Effects 0.000 claims abstract description 17
- 230000007704 transition Effects 0.000 claims abstract description 8
- 210000003128 head Anatomy 0.000 claims description 88
- 230000008569 process Effects 0.000 claims description 19
- 239000013598 vector Substances 0.000 claims description 12
- 239000002131 composite material Substances 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 10
- 238000005286 illumination Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 7
- 230000003042 antagnostic effect Effects 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000013519 translation Methods 0.000 claims description 4
- 150000001875 compounds Chemical class 0.000 claims description 3
- 238000005520 cutting process Methods 0.000 claims description 3
- 239000013604 expression vector Substances 0.000 claims description 3
- 210000001061 forehead Anatomy 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims 2
- 230000036544 posture Effects 0.000 description 12
- 230000006870 function Effects 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 5
- 210000000887 face Anatomy 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/005—General purpose rendering architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2200/00—Indexing scheme for image data processing or generation, in general
- G06T2200/04—Indexing scheme for image data processing or generation, in general involving 3D image data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computer Graphics (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Human Computer Interaction (AREA)
- Biomedical Technology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Geometry (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization, which comprises the following steps: optimizing and fitting all parameters of the three-dimensional face deformation model to the input face image by adopting a convolutional neural network; training a speech-to-expression and head pose mapping network using parameters of a target video and a face modelMapping network using trained speech to expressions and head gesturesAcquiring facial expression and head posture parameters from input audio; synthesizing a human face and rendering the synthesized human face to generate a vivid human face video frame; training a rendering network based on a generated confrontation network by using the parameterized face images and the face images in the video frames, wherein the rendering network is used for generating a background for the face image of each frame; and performing face background rendering and video synthesis based on video key frame optimization. The background transition of each frame of the synthesized face video output by the invention is natural and vivid, and the usability and the practicability of the synthesized face video can be greatly enhanced.
Description
Technical Field
The invention relates to the field of three-dimensional face reconstruction and face synthesis migration in deep learning, in particular to a synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization.
Background
With the improvement of the social level of China, the popularization of mobile intelligent terminals and the rapid development of mobile internet technology, videos become an indispensable part of people in life, study, entertainment and work, and compared with the traditional image-text expression form, the videos can combine the sense of hearing and the sense of vision, so that the manufacturing threshold is lower. Most of the current applications of video synthesis are still in the entertainment aspect, such as face changing photographing in the beautiful picture show, AR head portrait making in an iPhone mobile phone, istap Faces and other applications, and most of the applications are essentially to detect, position and segment the face in the image based on a deep learning neural network, and then exchange a source face and a target face. These functions require a neural network based on a large amount of face data, and have poor controllability, and it is difficult to realize the coupling of each attribute of the face.
The audio-driven video synthesis of the human face speech video is a key problem of realizing virtual anchor and intelligent human face speech video synthesis at present. The key effect of the method is that the human face speaking video with vivid human face and natural video frame transition can be generated under the condition that only the source audio and the target character video are available. Traditional person speech video recording requires a significant amount of labor and time costs, and necessarily requires the target person to participate in the recording. Therefore, the method generates a vivid synthetic face image by utilizing three-dimensional face reconstruction and adding a rendering network under video background frame optimization, thereby generating a vivid synthetic face video, and is a very practical problem for solving virtual anchor, character program recording, network recording class and the like.
The method is characterized in that a neural network is applied to face video synthesis in deep learning at present, features such as expressions, head postures, shapes and textures in a face model can be extracted based on three-dimensional face reconstruction and a rendering network method under video key frame optimization, the features such as the expressions and the head postures in the face model are extracted from source audio and are replaced into a face model of a target person, required vivid synthesized face frames are generated through optimized rendering of video key frames, face synthesis is achieved, and in recent years, a large number of scholars are put into scientific research in the field of face synthesis. However, since the face generation based on the neural network only needs a large amount of training data, the data acquisition work of the network is a great challenge. And due to the quality of the input data and the inherent instability of the generated model, the pictures and videos synthesized by the method may have low image quality and cannot be subjected to large-scale head pose control. The face synthesis mode is always a difficult point and a hot point in the research of the face video synthesis field. In the patent of "training method for generating confrontation network, image face changing method and video face changing device" (application number: 202010592443.X), peninsula (beijing) information technology ltd, a generator and a discriminator are proposed, which are based on generation of confrontation network, and generate confrontation network by using mass data pair training, extract the attribute feature diagram of the figure in the target image, and decode the generated mixed feature diagram to obtain the synthesized face. Although the method can also keep the attribute characteristics of the original image and the identity characteristics of the target image, the method has low stability of obtaining a vivid synthesized face, cannot obtain a synthesized face speech video under the condition of only human voice and video, and generates a synthesized face video with fuzzy face background, low video frame quality and unnaturalness.
Disclosure of Invention
The invention provides a synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization, which enables the background transition of each frame of the output synthetic face video to be natural and vivid.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization comprises the following steps:
optimizing and fitting each parameter of the three-dimensional face deformation model to the input face image by adopting a convolutional neural network to realize the parameterized reconstruction of the face model;
training a speech-to-expression and head pose mapping network using parameters of a target video and a face modelMapping network using trained speech to expressions and head gesturesAcquiring facial expression and head posture parameters from input audio, wherein the target video comprises a video frame and audio corresponding to the video frame, and the video frame comprises a face image;
replacing parameters in the parameterized face image according to the acquired facial expression and head posture parameters, synthesizing to obtain a face image of each frame, rendering, and generating a vivid face video frame;
training a rendering network based on a generated confrontation network by using the parameterized face images and the face images in the video frames, wherein the rendering network is used for generating a background for the face image of each frame;
and performing face background rendering and video synthesis based on video key frame optimization to obtain a high-quality synthesized face video frame which contains source audio facial expression and head posture parameters and is used for synthesizing a synthesized face with clear portrait and background, and synthesizing a complete speech video of the synthesized face according to the sequence of the video frames.
Preferably, the method for realizing the parametric reconstruction of the face model by optimally fitting each parameter of the three-dimensional face deformation model to the input face image by adopting the convolutional neural network specifically comprises the following steps:
recognizing a face in a face image, marking the face with 68 marking points, representing the face by using a three-dimensional face deformation model, and parameterizing the face into a triangular mesh model consisting of 35709 vertexes;
converting a two-dimensional face image I into a three-dimensional face parametric model X, and expressing as follows:
whereinAre the shape parameters of the three-dimensional face model,is a parameter of the texture that is,is a parameter of an expression that is,is the parameter of the light emission,rotation parameters being head pose parameters and derived from camera modelAnd translation parametersRepresenting that the shape of any one face picture can be used as the three-dimensional face shape parameterThe modeling model is represented as:
in the formula, BshapeIs a face shape vector, BexpIs a facial expression vector;
the texture of a face is represented as:
wherein B istexIs a vector of the texture of the human face,andrespectively expressed as the average shape and average texture of the face model; the illumination model of the face is represented as:
where gamma is the illumination parameter of the face model, niFor any vertex v of any face modeliNormal vector of (d), tiIs a vertex viTexture parameter, the vertex viIrradiance of (C) is represented as C (n)i,ti| γ), andis a basis function of spherical harmonics, gammabIs the coefficient of the spherical harmonic;
therefore, the reconstruction process of the three-dimensional face model can be represented as an optimization solution of the face model parameters, and the training process of the three-dimensional face model based on the convolutional neural network can be represented as an optimization problem of the following equations (1) and (2):
in the formula (I), the compound is shown in the specification,function, omega, for the regularized optimization of the coefficients of a three-dimensional face modelα、ωβ、ωδWeights corresponding to face shape coefficient, expression coefficient and texture coefficient respectively, c ∈ { r, g, b } represents that the picture is RGB picture, TcRepresents the face texture vector parameter of picture c, var () represents the variance, and r (x) represents the skin area of the face containing the cheek, nose, and forehead.
Preferably, the speech-to-expression and head-pose mapping network is trained using parameters of the target video and the face modelThe method specifically comprises the following steps:
extracting audio frequency in a target video, converting the audio frequency into Mel frequency cepstrum coefficient, inputting the Mel frequency cepstrum coefficient obtained by conversion into a pre-trained audio high-grade feature extraction network to obtain high-grade feature FtThen F is addedtAnd beta generated by optimally fitting the input human face image with a three-dimensional human face deformation model through a convolutional neural networktAnd PtTraining the speech to expression and head pose mapping network as a training datasetTrained speech to expression and head pose mapping networkExtracting two face estimation parameters from the audio, and respectively corresponding to the expression coefficients beta ═ beta in the three-dimensional face deformation model(1),......,β(64)And head pose coefficient P ═ P(1),......,P(6)Are multiplied byThe speech to expression and head pose mapping networkThe training process of (A) can be regarded as the mean square error loss of the expression parametersAnd mean square error loss of head pose parametersThe optimization process of (2) is shown in formulas (3) and (4):
where MSE () represents a mean square error function; ftRepresenting a high-level characteristic of the input network at time t, betatIs an expression parameter, P, of the target video at time ttIs the head pose parameter of the target video at the time t.
Preferably, the pre-trained audio advanced feature extraction network specifically includes:
the audio advanced feature extraction network takes an AT-net network as a backbone to train on The Oxford-BBC Lip Reading in The Wild Dataset.
Preferably, the method for acquiring a parameterized face image specifically includes the following steps:
extracting a target video into video frames, and cutting the whole face of each frame to obtain a face image I(1),I(2),......,I(n)Inputting the n human face images through a filterObtaining any picture I after convolutional neural network optimization(k)Face image represented by a parameterized model of k e n and corresponding shape parameter alphavideoExpression parameter betavideoTexture parameter deltaviaeoIllumination parameter gammavideoAnd head pose parameter Pvideo。
Preferably, the parameters in the parameterized face image are replaced according to the acquired parameters of facial expression and head pose, specifically:
the expression coefficient beta obtained from the audioaudioAnd head pose coefficient PaudioReplacing the face parameters in each corresponding video frame according to the time frame sequence, and synthesizing a new face
Preferably, the synthesizing obtains a face image of each frame and renders the face image to generate a realistic face video frame, specifically:
and (3) using a 3D grid renderer to enable each grid to be in uniform and smooth transition, so that vivid and natural face images are obtained, and the face images subjected to further rendering are stored as corresponding video frames.
Preferably, the parameterized face image and the face image in the video frame are used to train a rendering network based on the generation of a confrontation network, specifically:
parameterized face image I obtained from target video frame(k)And the original video frame corresponding to the timeAre combined into sequence data pairsAnd inputting a rendering network consisting of a generator G and a discriminator D for pre-training, wherein the pre-training process is a total optimization process of the generator and the discriminator for generating the countermeasure network and can be represented by an equation (5):
in the formula, G*An overall optimization function representing the rendering network;
complete training objectives include reconstruction lossesAnd loss of antagonismTraining is represented as formula (6):
where the weighting parameter λ is 100, reconstruction lossExpression (7) and antagonistic lossExpression (8) is as follows:
in the formula, the rendering network subjected to the optimization training can generate a background for the face image of each frame.
Preferably, face background rendering and video synthesis are performed based on video key frame optimization, and when the number of original video frames exceeds a threshold, the method specifically comprises the following steps:
calculating the face image I synthesized at the time ttThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum valueThat is, the matching frame with the head pose deviation similar to the synthesized face pose in the original video frame is used as a rendering data pairAccording to the method, matched frames of all frames are obtained and combined into rendering sequence data pairsInputting the obtained rendering sequence data pair into a pre-trained rendering network based on a generation countermeasure network, and obtaining a rendering network containing a source audio expression coefficient betaaudioAnd head pose coefficient PaudioSynthetic face X ═ { α ═ αvideo,βaudio,δvideo,γvideo,PaudioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.
Preferably, face background rendering and video synthesis are performed based on video key frame optimization, and when the number of original video frames does not exceed a threshold, the method specifically comprises the following steps:
selecting video frames with large head pose offset from the synthesized face video frames as key frames, and obtaining a key frame sequence according to the time sequence of the key framesCalculating the face image I synthesized at the time ttThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum valueTo find the most closely matched frame to obtain a matched frame sequenceAnd then obtaining video frames among the matched frames through a linear interpolation algorithm based on OpenCV (open circuit video coding) so as to synthesize a complete video frame sequenceThe video frame sequence and the video frame for synthesizing human face are used as rendering sequence to form data pairInputting the obtained rendering sequence data pair into a pre-trained rendering network based on a generation countermeasure network, and obtaining a rendering network containing a source audio expression coefficient betaaudioAnd head pose coefficient PaudioSynthetic face X ═ { α ═ αvideo,βaudio,δvideo,γvideo,PaudioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
compared with the traditional method for recording the talking video of the person at present, the highlight of the invention can correspondingly generate the talking video of A and the talking video of B (matching the audio of A) under certain special scenes, such as only the audio of A and the talking video recorded before A or only the audio of A and the talking video of B, which is obviously impossible in the traditional method; secondly, the invention can train the network well in advance and store the network locally, and then the needed composite video can be obtained quickly only by inputting the corresponding audio and video; third, the present invention is greatly reduced in both time and labor costs over conventional methods. Compared with the existing face synthesis video method, firstly, the invention uses the parameterized model of the three-dimensional face and the face synthesis of the audio conversion expression and head posture mapping network to ensure that the synthesized face has the audio expression and head posture parameter information completely, so that the mouth shape of the character in the generated face synthesis video is matched with the audio, and the video character is controlled by the expression and the head posture parameter in the audio; secondly, the invention carries out targeted optimization on the character background in the synthesized video frame, and can obtain the synthesized face video frame with high quality and clear portrait and background after being rendered by the background rendering network.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Fig. 2 is a schematic flow chart of acquiring facial expression and head pose parameters from input audio.
Fig. 3 is a schematic flow chart of optimized rendering of video key frames when the number of original video frames exceeds a threshold.
Fig. 4 is a schematic flow chart of optimized rendering of video key frames when the number of original video frames does not exceed a threshold.
Fig. 5 is a schematic view of a face synthesis process.
Fig. 6 is a schematic view of a flow of rendering a human face background and synthesizing a video.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
A method for generating a composite video based on three-dimensional face reconstruction and video key frame optimization is disclosed, as shown in FIG. 1, and comprises the following steps:
optimizing and fitting each parameter of the three-dimensional face deformation model to the input face image by adopting a convolutional neural network to realize the parameterized reconstruction of the face model;
training a voice-to-expression and head pose mapping network H by using parameters of a target video and a face model, and acquiring facial expression and head pose parameters from input audio by using the trained voice-to-expression and head pose mapping network H, wherein the target video comprises a video frame and audio corresponding to the video frame, and the video frame comprises a face image;
replacing parameters in the parameterized face image according to the acquired facial expression and head posture parameters, synthesizing to obtain a face image of each frame, rendering, and generating a vivid face video frame;
training a rendering network based on a generated confrontation network by using the parameterized face images and the face images in the video frames, wherein the rendering network is used for generating a background for the face image of each frame;
and performing face background rendering and video synthesis based on video key frame optimization to obtain a high-quality synthesized face video frame which contains source audio facial expression and head posture parameters and is used for synthesizing a synthesized face with clear portrait and background, and synthesizing a complete speech video of the synthesized face according to the sequence of the video frames.
The embodiment considers that the human video recording needs a large amount of labor and time cost in the aspects of video anchor, online education course recording and the like at the present stage. At present, the generation of the character talking video is mainly realized by a traditional method, and a real person must be recorded before a lens, so that the time and the labor are wasted, and the wearing and the appearance of the character with the mirror are higher in requirements. In the embodiment, model parameters about expressions and head gestures are extracted from character voice, a three-dimensional face model is used for reconstructing a face in each video frame, and a synthetic face video which has target face identity information and is matched with the expression and head gesture information in source audio is generated. And the method is purposefully optimized according to the reality of the face image of the face synthetic video and the transition naturalness of the video background. In some past researches, strange face structures of the synthesized faces are not vivid enough, or the faces are separated from the background, the background is fuzzy and the like, and the naturalness of real faces is lacked. In the embodiment, after the face is synthesized, the 3D mesh renderer is used for further rendering the facial texture of the synthesized face image, so that the texture details of the synthesized face are richer and more vivid visually. Meanwhile, some key frames are selected from the synthesized face video frames, and an interpolation frame-filling algorithm is utilized in the adjacent key frames, so that the background transition of each frame of the output synthesized face video is natural and vivid. Therefore, the usability and the practicability of the synthesized face video can be greatly enhanced by using the method.
The method adopts the convolutional neural network to optimally fit all parameters of the three-dimensional face deformation model to the input face image so as to realize the parametric reconstruction of the face model, and specifically comprises the following steps:
recognizing a face in a face image, marking the face with 68 marking points, representing the face by using a three-dimensional face deformation model, and parameterizing the face into a triangular mesh model consisting of 35709 vertexes;
converting a two-dimensional face image I into a three-dimensional face parametric model X, and expressing as follows:
whereinAre the shape parameters of the three-dimensional face model,is a parameter of the texture that is,is a parameter of an expression that is,is the parameter of the light emission,rotation parameters being head pose parameters and derived from camera modelAnd translation parametersAnd (3) representing the shape of any human face picture by using a three-dimensional human face shape parameterized model as follows:
in the formula, BshapeIs a face shape vector, BexpIs a facial expression vector;
the texture of a face is represented as:
wherein B istexIs a vector of the texture of the human face,andrespectively expressed as the average shape and average texture of the face model;
the illumination model of the face is represented as:
where gamma is the illumination parameter of the face model, niFor any vertex v of any face modeliNormal vector of (d), tiIs a vertex viTexture parameter, the vertex viIrradiance of (C) is represented as C (n)i,ti| γ), andis a basis function of spherical harmonics, gammabIs the coefficient of spherical harmonic, B ═ 3;
therefore, the reconstruction process of the three-dimensional face model can be represented as an optimization solution of the face model parameters, and the training process of the three-dimensional face model based on the convolutional neural network can be represented as an optimization problem of the following equations (1) and (2):
in the formula (I), the compound is shown in the specification,function, omega, for the regularized optimization of the coefficients of a three-dimensional face modelα、ωβ、ωδWeights corresponding to face shape coefficient, expression coefficient and texture coefficient respectively, c ∈ { r, g, b } represents that the picture is RGB picture, TcRepresents the face texture vector parameter of picture c, var () represents the variance, and r (x) represents the skin area of the face containing the cheek, nose, and forehead.
As shown in FIG. 2, the speech-to-expression and head-pose mapping network is trained using parameters of the target video and the face modelThe method specifically comprises the following steps:
extracting audio frequency in a target video, converting the audio frequency into Mel frequency cepstrum coefficient, inputting the Mel frequency cepstrum coefficient obtained by conversion into a pre-trained audio high-grade feature extraction network to obtain high-grade feature FtThen F is addedtAnd beta of the parameters of the face model after the convolutional neural network optimizationtAnd PtTraining the speech to expression and head pose mapping network as a training datasetTrained speech to expression and head pose mapping networkExtracting two face estimation parameters from the audio, and respectively corresponding to the expression coefficients beta ═ beta in the three-dimensional face deformation model(1),......,β(64)And head pose coefficient P ═ P(1),......,P(6)Are multiplied byThe speech to expression and head pose mapping networkThe training process of (A) can be regarded as the mean square error loss of the expression parametersAnd mean square error loss of head pose parametersThe optimization process of (2) is shown in formulas (3) and (4):
where MSE () represents a mean square error function; ftRepresenting a high-level characteristic of the input network at time t, betatIs an expression parameter, P, of the target video at time ttIs the head pose parameter of the target video at the time t.
The pre-trained audio advanced feature extraction network specifically comprises the following steps:
the audio advanced feature extraction network takes an AT-net network as a backbone to train on The Oxford-BBC Lip Reading in The Wild Dataset.
As shown in fig. 5 to 6, the method for acquiring a parameterized face image specifically includes the following steps:
extracting a target video into video frames, and cutting the whole face of each frame to obtain a face image I(1),I(2),......,I(n)Inputting the n human face images, and obtaining any one image I after the n human face images are optimized by a convolutional neural network(k)Face image represented by a parameterized model of k e n and corresponding shape parameter alphavideoExpression parameter betavideoTexture parameter deltaviaeoIllumination parameter gammavideoAnd head pose parameter Pvideo。
Replacing parameters in the parameterized face image according to the acquired parameters of the facial expression and the head posture, which specifically comprises the following steps:
the expression coefficient beta obtained from the audioaudioAnd head pose coefficient PaudioReplacing the face parameters in each corresponding video frame according to the time frame sequence, and synthesizing a new face
The synthesizing obtains the face image of each frame and renders the face image to generate a vivid face video frame, which specifically comprises the following steps:
and (3) using a 3D grid renderer to enable each grid to be in uniform and smooth transition, so that vivid and natural face images are obtained, and the face images subjected to further rendering are stored as corresponding video frames.
The method comprises the following steps of training a rendering network based on a generated countermeasure network by using a parameterized face image and a face image in a video frame, and specifically comprises the following steps:
parameterized face image I obtained from target video frame(k)And the original video frame corresponding to the timeAre combined into sequence data pairsAnd inputting a rendering network consisting of a generator G and a discriminator D for pre-training, wherein the pre-training process is a total optimization process of the generator and the discriminator for generating the countermeasure network and can be represented by an equation (5):
in the formula, G*An overall optimization function representing the rendering network;
complete training objectives include reconstruction lossesAnd loss of antagonismTraining is represented as formula (6):
where the weighting parameter λ is 100, reconstruction lossExpression (7) and antagonistic lossExpression (8) is as follows:
the rendering network after the optimization training can generate a background for the face image of each frame.
Performing face background rendering and video synthesis based on video keyframe optimization, as shown in fig. 3, when the number of original video frames exceeds a threshold, specifically:
calculating the face image I synthesized at the time ttThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum valueThat is, the matching frame with the head pose deviation similar to the synthesized face pose in the original video frame is used as a rendering data pairAccording to the method, matched frames of all frames are obtained and combined into rendering sequence data pairsInputting the obtained rendering sequence data pair into a pre-trained rendering network based on a generation countermeasure network, and obtaining a rendering network containing a source audio expression coefficient betaaudioAnd head pose coefficient PaudioSynthetic face X ═ { α ═ αvideo,βaudio,δvideo,γvideo,PaudioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.
Performing face background rendering and video synthesis based on video keyframe optimization, as shown in fig. 4, when the number of original video frames does not exceed a threshold, specifically:
selecting video frames with large head pose offset from the synthesized face video frames as key frames, and obtaining a key frame sequence according to the time sequence of the key framesCalculating the face image I synthesized at the time ttThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum valueTo find the most closely matched frame to obtain a matched frame sequenceAnd then obtaining video frames among the matched frames through a linear interpolation algorithm based on OpenCV (open circuit video coding) so as to synthesize a complete video frame sequenceThe video frame sequence and the video frame for synthesizing human face are used as rendering sequence to form data pairInputting the obtained rendering sequence data pair into a pre-trained rendering network based on a generation countermeasure network, and obtaining a rendering network containing a source audio expression coefficient betaaudioAnd head pose coefficient PaudioSynthetic face X ═ { α ═ αvideo,βaudio,δvideo,γvideo,PaudioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.
In a specific embodiment:
the advanced feature extraction network for extracting The audio advanced feature F _ t is trained on The Oxford-BBC Lip Reading in The Wild (LRW) Dataset by taking an AT-net network as a backbone, and The Dataset contains 1000 pronunciations of up to 500 different words and is spoken by hundreds of different speakers. Our audio-to-expression and head pose mapping network is pre-trained with the data set as the sum P _ t reconstructed from the target video frame and F _ t extracted from the speech of the target video. The three-dimensional Face reconstruction Model based on the convolutional neural network takes a ResNet-50 network as a backbone, represents a Face based on an Expression Basis Model in a Basel Face Model 2009 and a Facewarehouse, further renders a reconstructed synthetic Face by a mesh render in a tensoflow, and is trained on a data set 300 WLP. The face background rendering network based on video key frame optimization is composed of a generator and a discriminator, and a training set is a data pair composed of an original video frame and a synthesized face image reconstructed from the original video frame.
Pre-training an audio-to-expression and head pose mapping network and a convolutional neural network-based three-dimensional face reconstruction model: because the LRW training set and the 300WLP training set are public training sets, data sets are downloaded respectively and input into two networks for pre-training, and model parameters are saved after learning is finished when the number of iterations of the two networks is small.
Processing input data: firstly, converting input audio into Mel Frequency Cepstrum Coefficient (MFCC) for storage, extracting audio and video frames from input target video, identifying human face in each video frame, intercepting and storing the human face as human face video frame, wherein the size of the image intercepted by each human face video frame is 256 × 256 RGB image and is set as image It。
A face synthesis stage: the RGB face image I with the size of 256 multiplied by 256 is takentInputting the three-dimensional face image into a pre-trained convolutional neural network-based three-dimensional face deformation model to generate a reconstructed three-dimensional face image of each frame of faceAnd corresponding shape parameter alphavideoExpression parameter betavideoTexture parameter deltavideoIllumination parameter gammavideoAnd head pose parameter Pvideo(wherein the head pose parameter P is composed of the rotation parameter R and the translation parameter T in the camera parameters), and saves these parameters. Then, the MFCC is input into an audio-to-expression and head posture mapping network to obtain an expression parameter betaaudioAnd head pose parameter PaudioAnd stored, and then both are replaced by the target video in the order of time frameReconstructed expression parameter betavideoAnd head pose parameter PvideoAnd these face parameters { alpha }video,βaudio,δvideo,γvideo,PaudioInputting the face synthesis model, rendering by a 3D mesh renderer, and outputting a new face X ═ alphavideo,βaudio,δvideo,γvideo,PaudioAnd storing in time frame sequence.
Face background rendering and video synthesis based on video key frame optimization: calculating the frame number of the target video, and if the frame number is lower than 5000 frames, obtaining a key frame sequence by adopting a mode 2 of a video key frame optimization methodObtaining a matching frame sequence in the target video frame by calculating the minimum Euclidean distanceInputting the sequence into a rendering network, and obtaining video frames between matched frames between output face frames through an OpenCV-based linear interpolation algorithm, thereby synthesizing a complete synthetic face video frame sequenceRendering sequence data pairs of the video frame sequence and the video frame for synthesizing the human face in the step 3
Image ItAnd face image corresponding to time tSet as a data pairTraining the data pair input based on a rendering network for generating an antagonistic neural network, wherein the iteration number is 100 epochs, and after learning is finished, protecting the modelAnd (4) storing. Calculating the frame number of the target video, and if the frame number is lower than 5000 frames, obtaining a key frame sequence by adopting a mode 2 of a video key frame optimization methodObtaining a matching frame sequence in the target video frame by calculating the minimum Euclidean distanceInputting the sequence into the pre-training rendering network, and obtaining video frames between matched frames between the output face frames through a linear interpolation algorithm based on OpenCV (open circuit content variation) so as to obtain a complete synthetic face video frame sequenceIf the target video frame is more than 5000 frames, the mode 1 is adopted, and the complete synthetic human face video frame sequence can be obtained in the same wayAnd finally, synthesizing the video frames and the audio into a whole video, so that a conversation video which contains the source audio expression and the head gesture on the basis of the target video and has a natural synthetic face and a clear background is generated.
The same or similar reference numerals correspond to the same or similar parts;
the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (10)
1. A synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization is characterized by comprising the following steps:
optimizing and fitting each parameter of the three-dimensional face deformation model to the input face image by adopting a convolutional neural network to realize the parameterized reconstruction of the face model;
training a speech-to-expression and head pose mapping network using parameters of a target video and a face modelMapping network using trained speech to expressions and head gesturesAcquiring facial expression and head posture parameters from input audio, wherein the target video comprises a video frame and audio corresponding to the video frame, and the video frame comprises a face image;
replacing parameters in the parameterized face image according to the acquired facial expression and head posture parameters, synthesizing to obtain a face image of each frame, rendering, and generating a vivid face video frame;
training a rendering network based on a generated confrontation network by using the parameterized face images and the face images in the video frames, wherein the rendering network is used for generating a background for the face image of each frame;
and performing face background rendering and video synthesis based on video key frame optimization to obtain a high-quality synthesized face video frame which contains source audio facial expression and head posture parameters and is used for synthesizing a synthesized face with clear portrait and background, and synthesizing a complete speech video of the synthesized face according to the sequence of the video frames.
2. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 1, wherein the method for optimizing and fitting each parameter of the three-dimensional face deformation model to the input face image by using the convolutional neural network to realize the parameterized reconstruction of the face model comprises the following steps:
recognizing a face in a face image, marking the face with 68 marking points, representing the face by using a three-dimensional face deformation model, and parameterizing the face into a triangular mesh model consisting of 35709 vertexes;
converting a two-dimensional face image I into a three-dimensional face parametric model X, and expressing as follows:
whereinAre the shape parameters of the three-dimensional face model,is a parameter of the texture that is,is a parameter of an expression that is,is the parameter of the light emission,rotation parameters being head pose parameters and derived from camera modelAnd translation parametersAnd (3) representing the shape of any human face picture by using a three-dimensional human face shape parameterized model as follows:
in the formula, BshapeIs a face shape vector, BexpIs a facial expression vector;
the texture of a face is represented as:
wherein B istexIs a vector of the texture of the human face,andrespectively expressed as the average shape and average texture of the face model;
the illumination model of the face is represented as:
where gamma is the illumination parameter of the face model, niFor any vertex v of any face modeliNormal vector of (d), tiIs a vertex viTexture parameter, the vertex viIrradiance of (C) is represented as C (n)i,ti| γ), andis a basis function of spherical harmonics, gammabIs the coefficient of the spherical harmonic;
therefore, the reconstruction process of the three-dimensional face model can be represented as an optimization solution of the face model parameters, and the training process of the three-dimensional face model based on the convolutional neural network can be represented as an optimization problem of the following equations (1) and (2):
in the formula (I), the compound is shown in the specification,function, omega, for the regularized optimization of the coefficients of a three-dimensional face modelα、ωβ、ωδWeights corresponding to face shape coefficient, expression coefficient and texture coefficient respectively, c ∈ { r, g, b } represents that the picture is RGB picture, TcRepresents the face texture vector parameter of picture c, var () represents the variance, and r (x) represents the skin area of the face containing the cheek, nose, and forehead.
3. The method of claim 2, wherein parameters of the target video and the face model are used to train a speech-to-expression and head pose mapping networkThe method specifically comprises the following steps:
extracting audio frequency in a target video, converting the audio frequency into Mel frequency cepstrum coefficient, inputting the Mel frequency cepstrum coefficient obtained by conversion into a pre-trained audio high-grade feature extraction network to obtain high-grade feature FtThen F is addedtAnd beta generated by optimally fitting the input human face image with a three-dimensional human face deformation model through a convolutional neural networktAnd PtTraining the speech to expression and head pose mapping network as a training datasetTrained speech to expression and head pose mapping networkExtracting two face estimation parameters from the audio, and respectively corresponding to the expression coefficients beta ═ beta in the three-dimensional face deformation model(1),……,β(64)And head pose coefficient P ═ P(1),……,P(6)Are multiplied byThe speech to expression and head pose mapping networkThe training process of (A) can be regarded as the mean square error loss of the expression parametersAnd mean square error loss of head pose parametersThe optimization process of (2) is shown in formulas (3) and (4):
where MSE () represents a mean square error function; ftRepresenting a high-level characteristic of the input network at time t, betatIs an expression parameter, P, of the target video at time ttIs the head pose parameter of the target video at the time t.
4. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 3, wherein the pre-trained audio advanced feature extraction network specifically comprises:
the audio advanced feature extraction network takes an AT-net network as a backbone to train on The Oxford-BBC Lip Reading in The Wild Dataset.
5. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 4, wherein the method for obtaining the parameterized face image specifically comprises the following steps:
extracting a target video into video frames, and cutting the whole face of each frame to obtain a face image I(1),I(2),……,I(n)Inputting the n human face images, and obtaining any one image I after the n human face images are optimized by a convolutional neural network(k)Face image represented by a parameterized model of k e n and corresponding shape parameter alphavideoExpression parameter betavideoTexture parameter deltavideoIllumination parameter gammavideoAnd head pose parameter Pvideo。
6. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 5, wherein parameters in the parameterized face image are replaced according to the acquired facial expression and head pose parameters, specifically:
7. The method for generating a composite video based on three-dimensional face reconstruction and video key frame optimization according to claim 6, wherein the synthesizing obtains a face image of each frame and renders the face image to generate a realistic face video frame, specifically:
and (3) using a 3D grid renderer to enable each grid to be in uniform and smooth transition, so that vivid and natural face images are obtained, and the face images subjected to further rendering are stored as corresponding video frames.
8. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 7, wherein the rendering network based on the generation of the countermeasure network is trained using parameterized face images and face images in the video frames, specifically:
parameterized face image I obtained from target video frame(k)And the original video frame corresponding to the timeAre combined into sequence data pairsAnd inputting a rendering network consisting of a generator G and a discriminator D for pre-training, wherein the pre-training process is a total optimization process of the generator and the discriminator for generating the countermeasure network and can be represented by an equation (5):
in the formula, G*An overall optimization function representing the rendering network;
complete training objectives include reconstruction lossesAnd loss of antagonismTraining is represented as formula (6):
where the weighting parameter λ is 100, reconstruction lossExpression (7) and antagonistic lossExpression (8) is as follows:
in the formula, the rendering network subjected to the optimization training can generate a background for the face image of each frame.
9. The method for generating a composite video based on three-dimensional face reconstruction and video key frame optimization according to claim 8, wherein face background rendering and video composition are performed based on video key frame optimization, and when the number of original video frames exceeds a threshold, the method specifically comprises:
calculating the face image I synthesized at the time ttThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum valueThat is, the matching frame with the head pose deviation similar to the synthesized face pose in the original video frame is used as a rendering data pairAccording to the method, matched frames of all frames are obtained and combined into rendering sequence data pairsInputting the obtained rendering sequence data pair into a pre-trained rendering net based on a generation countermeasure networkObtaining the expression coefficient beta containing the source audio frequency by the networkaudioAnd head pose coefficient PaudioSynthetic face X ═ { α ═ αvideo,βaudio,δvideo,γvideo,PaudioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.
10. The method for generating a composite video based on three-dimensional face reconstruction and video key frame optimization according to claim 8, wherein face background rendering and video composition are performed based on video key frame optimization, and when the number of original video frames does not exceed a threshold, the method specifically comprises:
selecting video frames with large head pose offset from the synthesized face video frames as key frames, and obtaining a key frame sequence according to the time sequence of the key framesCalculating the face image I synthesized at the time ttThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum valueTo find the most closely matched frame to obtain a matched frame sequenceAnd then obtaining video frames among the matched frames through a linear interpolation algorithm based on OpenCV (open circuit video coding) so as to synthesize a complete video frame sequenceThe video frame sequence and the video frame for synthesizing human face are used as rendering sequence to form data pairInputting the obtained rendering sequence data pair into a pre-trained rendering network based on a generation countermeasure network, and obtaining a rendering network containing a source audio expression coefficient betaaudioAnd head pose coefficient PaudioSynthetic face X ═ { α ═ αvideo,βaudio,δvideo,γvideo,PaudioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110610539.9A CN113269872A (en) | 2021-06-01 | 2021-06-01 | Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110610539.9A CN113269872A (en) | 2021-06-01 | 2021-06-01 | Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113269872A true CN113269872A (en) | 2021-08-17 |
Family
ID=77233988
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110610539.9A Pending CN113269872A (en) | 2021-06-01 | 2021-06-01 | Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113269872A (en) |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113822969A (en) * | 2021-09-15 | 2021-12-21 | 宿迁硅基智能科技有限公司 | Method, device and server for training nerve radiation field model and face generation |
CN113838173A (en) * | 2021-09-23 | 2021-12-24 | 厦门大学 | Virtual human head motion synthesis method driven by voice and background sound |
CN114049678A (en) * | 2022-01-11 | 2022-02-15 | 之江实验室 | Facial motion capturing method and system based on deep learning |
CN114332136A (en) * | 2022-03-15 | 2022-04-12 | 南京甄视智能科技有限公司 | Face attribute data labeling method, computer equipment and storage medium |
CN114419702A (en) * | 2021-12-31 | 2022-04-29 | 南京硅基智能科技有限公司 | Digital human generation model, training method of model, and digital human generation method |
CN114648613A (en) * | 2022-05-18 | 2022-06-21 | 杭州像衍科技有限公司 | Three-dimensional head model reconstruction method and device based on deformable nerve radiation field |
CN114782864A (en) * | 2022-04-08 | 2022-07-22 | 马上消费金融股份有限公司 | Information processing method and device, computer equipment and storage medium |
CN114821404A (en) * | 2022-04-08 | 2022-07-29 | 马上消费金融股份有限公司 | Information processing method and device, computer equipment and storage medium |
CN114898244A (en) * | 2022-04-08 | 2022-08-12 | 马上消费金融股份有限公司 | Information processing method and device, computer equipment and storage medium |
CN115294622A (en) * | 2022-06-15 | 2022-11-04 | 北京邮电大学 | Method, system and storage medium for synthesizing and enhancing voice-driven speaker head motion video |
GB2607140A (en) * | 2021-05-26 | 2022-11-30 | Flawless Holdings Ltd | Modification of objects in film |
CN115909015A (en) * | 2023-02-15 | 2023-04-04 | 苏州浪潮智能科技有限公司 | Construction method and device of deformable nerve radiation field network |
CN115984943A (en) * | 2023-01-16 | 2023-04-18 | 支付宝(杭州)信息技术有限公司 | Facial expression capturing and model training method, device, equipment, medium and product |
CN116091668A (en) * | 2023-04-10 | 2023-05-09 | 广东工业大学 | Talking head video generation method based on emotion feature guidance |
CN116152447A (en) * | 2023-04-21 | 2023-05-23 | 科大讯飞股份有限公司 | Face modeling method and device, electronic equipment and storage medium |
CN116342760A (en) * | 2023-05-25 | 2023-06-27 | 南昌航空大学 | Three-dimensional facial animation synthesis method, system, electronic equipment and storage medium |
US11715495B2 (en) | 2021-05-26 | 2023-08-01 | Flawless Holdings Limited | Modification of objects in film |
CN116721194A (en) * | 2023-08-09 | 2023-09-08 | 瀚博半导体(上海)有限公司 | Face rendering method and device based on generation model |
US11830159B1 (en) | 2022-12-08 | 2023-11-28 | Flawless Holding Limited | Generative films |
CN117292030A (en) * | 2023-10-27 | 2023-12-26 | 海看网络科技(山东)股份有限公司 | Method and system for generating three-dimensional digital human animation |
CN117392292A (en) * | 2023-10-20 | 2024-01-12 | 联通在线信息科技有限公司 | 3D digital person generation method and system |
CN117593442A (en) * | 2023-11-28 | 2024-02-23 | 拓元(广州)智慧科技有限公司 | Portrait generation method based on multi-stage fine grain rendering |
WO2024055379A1 (en) * | 2022-09-16 | 2024-03-21 | 粤港澳大湾区数字经济研究院(福田) | Video processing method and system based on character avatar model, and related device |
WO2024078243A1 (en) * | 2022-10-13 | 2024-04-18 | 腾讯科技(深圳)有限公司 | Training method and apparatus for video generation model, and storage medium and computer device |
Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102254336A (en) * | 2011-07-14 | 2011-11-23 | 清华大学 | Method and device for synthesizing face video |
CN103218842A (en) * | 2013-03-12 | 2013-07-24 | 西南交通大学 | Voice synchronous-drive three-dimensional face mouth shape and face posture animation method |
CN103279970A (en) * | 2013-05-10 | 2013-09-04 | 中国科学技术大学 | Real-time human face animation driving method by voice |
CN107067429A (en) * | 2017-03-17 | 2017-08-18 | 徐迪 | Video editing system and method that face three-dimensional reconstruction and face based on deep learning are replaced |
CN108230438A (en) * | 2017-12-28 | 2018-06-29 | 清华大学 | The facial reconstruction method and device of sound driver secondary side face image |
CN108510437A (en) * | 2018-04-04 | 2018-09-07 | 科大讯飞股份有限公司 | A kind of virtual image generation method, device, equipment and readable storage medium storing program for executing |
CN109035394A (en) * | 2018-08-22 | 2018-12-18 | 广东工业大学 | Human face three-dimensional model method for reconstructing, device, equipment, system and mobile terminal |
CN109308731A (en) * | 2018-08-24 | 2019-02-05 | 浙江大学 | The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM |
CN110599573A (en) * | 2019-09-03 | 2019-12-20 | 电子科技大学 | Method for realizing real-time human face interactive animation based on monocular camera |
CN110942502A (en) * | 2019-11-29 | 2020-03-31 | 中山大学 | Voice lip fitting method and system and storage medium |
CN111243626A (en) * | 2019-12-30 | 2020-06-05 | 清华大学 | Speaking video generation method and system |
CN111294665A (en) * | 2020-02-12 | 2020-06-16 | 百度在线网络技术(北京)有限公司 | Video generation method and device, electronic equipment and readable storage medium |
CN111508064A (en) * | 2020-04-14 | 2020-08-07 | 北京世纪好未来教育科技有限公司 | Expression synthesis method and device based on phoneme driving and computer storage medium |
CN112188304A (en) * | 2020-09-28 | 2021-01-05 | 广州酷狗计算机科技有限公司 | Video generation method, device, terminal and storage medium |
CN112215927A (en) * | 2020-09-18 | 2021-01-12 | 腾讯科技(深圳)有限公司 | Method, device, equipment and medium for synthesizing face video |
CN112420014A (en) * | 2020-11-17 | 2021-02-26 | 平安科技(深圳)有限公司 | Virtual face construction method and device, computer equipment and computer readable medium |
CN112562722A (en) * | 2020-12-01 | 2021-03-26 | 新华智云科技有限公司 | Audio-driven digital human generation method and system based on semantics |
CN112766160A (en) * | 2021-01-20 | 2021-05-07 | 西安电子科技大学 | Face replacement method based on multi-stage attribute encoder and attention mechanism |
-
2021
- 2021-06-01 CN CN202110610539.9A patent/CN113269872A/en active Pending
Patent Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102254336A (en) * | 2011-07-14 | 2011-11-23 | 清华大学 | Method and device for synthesizing face video |
CN103218842A (en) * | 2013-03-12 | 2013-07-24 | 西南交通大学 | Voice synchronous-drive three-dimensional face mouth shape and face posture animation method |
CN103279970A (en) * | 2013-05-10 | 2013-09-04 | 中国科学技术大学 | Real-time human face animation driving method by voice |
CN107067429A (en) * | 2017-03-17 | 2017-08-18 | 徐迪 | Video editing system and method that face three-dimensional reconstruction and face based on deep learning are replaced |
CN108230438A (en) * | 2017-12-28 | 2018-06-29 | 清华大学 | The facial reconstruction method and device of sound driver secondary side face image |
CN108510437A (en) * | 2018-04-04 | 2018-09-07 | 科大讯飞股份有限公司 | A kind of virtual image generation method, device, equipment and readable storage medium storing program for executing |
CN109035394A (en) * | 2018-08-22 | 2018-12-18 | 广东工业大学 | Human face three-dimensional model method for reconstructing, device, equipment, system and mobile terminal |
CN109308731A (en) * | 2018-08-24 | 2019-02-05 | 浙江大学 | The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM |
CN110599573A (en) * | 2019-09-03 | 2019-12-20 | 电子科技大学 | Method for realizing real-time human face interactive animation based on monocular camera |
CN110942502A (en) * | 2019-11-29 | 2020-03-31 | 中山大学 | Voice lip fitting method and system and storage medium |
CN111243626A (en) * | 2019-12-30 | 2020-06-05 | 清华大学 | Speaking video generation method and system |
CN111294665A (en) * | 2020-02-12 | 2020-06-16 | 百度在线网络技术(北京)有限公司 | Video generation method and device, electronic equipment and readable storage medium |
CN111508064A (en) * | 2020-04-14 | 2020-08-07 | 北京世纪好未来教育科技有限公司 | Expression synthesis method and device based on phoneme driving and computer storage medium |
CN112215927A (en) * | 2020-09-18 | 2021-01-12 | 腾讯科技(深圳)有限公司 | Method, device, equipment and medium for synthesizing face video |
CN112188304A (en) * | 2020-09-28 | 2021-01-05 | 广州酷狗计算机科技有限公司 | Video generation method, device, terminal and storage medium |
CN112420014A (en) * | 2020-11-17 | 2021-02-26 | 平安科技(深圳)有限公司 | Virtual face construction method and device, computer equipment and computer readable medium |
CN112562722A (en) * | 2020-12-01 | 2021-03-26 | 新华智云科技有限公司 | Audio-driven digital human generation method and system based on semantics |
CN112766160A (en) * | 2021-01-20 | 2021-05-07 | 西安电子科技大学 | Face replacement method based on multi-stage attribute encoder and attention mechanism |
Non-Patent Citations (4)
Title |
---|
CAO FAXIAN,YANG ZHIJING ETC.: "Convolutional neural network extreme learning machine for effective classification of hyperspectral images", 《JOURNAL OF APPLIED REMOTE SENSING》, vol. 12, no. 3, 5 July 2018 (2018-07-05), pages 1 - 20 * |
RAN YI,ZIPENG YE,ETC.: "Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose", 《HTTPS://ARXIV.ORG/ABS/2002.10137》, 5 March 2020 (2020-03-05), pages 1 - 12 * |
YU DENG,JIAOLONG YANG,ETC.: "Accurate 3D Face Reconstruction with Weakly-Supervised Learning:From Single Image to Image Set", 《2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS》, 9 April 2020 (2020-04-09), pages 285 - 295 * |
李欣怡等: "语音驱动的人脸动画研究现状综述", 《计算机工程与应用》, no. 22, 15 November 2017 (2017-11-15), pages 26 - 33 * |
Cited By (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11574655B2 (en) | 2021-05-26 | 2023-02-07 | Flawless Holdings Limited | Modification of objects in film |
GB2607140A (en) * | 2021-05-26 | 2022-11-30 | Flawless Holdings Ltd | Modification of objects in film |
US11699464B2 (en) | 2021-05-26 | 2023-07-11 | Flawless Holdings Limited | Modification of objects in film |
GB2607140B (en) * | 2021-05-26 | 2024-04-10 | Flawless Holdings Ltd | Modification of objects in film |
US11715495B2 (en) | 2021-05-26 | 2023-08-01 | Flawless Holdings Limited | Modification of objects in film |
CN113822969A (en) * | 2021-09-15 | 2021-12-21 | 宿迁硅基智能科技有限公司 | Method, device and server for training nerve radiation field model and face generation |
CN113822969B (en) * | 2021-09-15 | 2023-06-09 | 宿迁硅基智能科技有限公司 | Training neural radiation field model, face generation method, device and server |
CN113838173A (en) * | 2021-09-23 | 2021-12-24 | 厦门大学 | Virtual human head motion synthesis method driven by voice and background sound |
CN113838173B (en) * | 2021-09-23 | 2023-08-22 | 厦门大学 | Virtual human head motion synthesis method driven by combination of voice and background sound |
CN114419702B (en) * | 2021-12-31 | 2023-12-01 | 南京硅基智能科技有限公司 | Digital person generation model, training method of model, and digital person generation method |
CN114419702A (en) * | 2021-12-31 | 2022-04-29 | 南京硅基智能科技有限公司 | Digital human generation model, training method of model, and digital human generation method |
CN114049678A (en) * | 2022-01-11 | 2022-02-15 | 之江实验室 | Facial motion capturing method and system based on deep learning |
CN114332136A (en) * | 2022-03-15 | 2022-04-12 | 南京甄视智能科技有限公司 | Face attribute data labeling method, computer equipment and storage medium |
CN114782864B (en) * | 2022-04-08 | 2023-07-21 | 马上消费金融股份有限公司 | Information processing method, device, computer equipment and storage medium |
CN114898244B (en) * | 2022-04-08 | 2023-07-21 | 马上消费金融股份有限公司 | Information processing method, device, computer equipment and storage medium |
CN114782864A (en) * | 2022-04-08 | 2022-07-22 | 马上消费金融股份有限公司 | Information processing method and device, computer equipment and storage medium |
CN114821404A (en) * | 2022-04-08 | 2022-07-29 | 马上消费金融股份有限公司 | Information processing method and device, computer equipment and storage medium |
CN114898244A (en) * | 2022-04-08 | 2022-08-12 | 马上消费金融股份有限公司 | Information processing method and device, computer equipment and storage medium |
CN114821404B (en) * | 2022-04-08 | 2023-07-25 | 马上消费金融股份有限公司 | Information processing method, device, computer equipment and storage medium |
CN114648613A (en) * | 2022-05-18 | 2022-06-21 | 杭州像衍科技有限公司 | Three-dimensional head model reconstruction method and device based on deformable nerve radiation field |
CN114648613B (en) * | 2022-05-18 | 2022-08-23 | 杭州像衍科技有限公司 | Three-dimensional head model reconstruction method and device based on deformable nerve radiation field |
CN115294622A (en) * | 2022-06-15 | 2022-11-04 | 北京邮电大学 | Method, system and storage medium for synthesizing and enhancing voice-driven speaker head motion video |
WO2024055379A1 (en) * | 2022-09-16 | 2024-03-21 | 粤港澳大湾区数字经济研究院(福田) | Video processing method and system based on character avatar model, and related device |
WO2024078243A1 (en) * | 2022-10-13 | 2024-04-18 | 腾讯科技(深圳)有限公司 | Training method and apparatus for video generation model, and storage medium and computer device |
US11830159B1 (en) | 2022-12-08 | 2023-11-28 | Flawless Holding Limited | Generative films |
CN115984943A (en) * | 2023-01-16 | 2023-04-18 | 支付宝(杭州)信息技术有限公司 | Facial expression capturing and model training method, device, equipment, medium and product |
CN115984943B (en) * | 2023-01-16 | 2024-05-14 | 支付宝(杭州)信息技术有限公司 | Facial expression capturing and model training method, device, equipment, medium and product |
CN115909015A (en) * | 2023-02-15 | 2023-04-04 | 苏州浪潮智能科技有限公司 | Construction method and device of deformable nerve radiation field network |
WO2024169314A1 (en) * | 2023-02-15 | 2024-08-22 | 苏州元脑智能科技有限公司 | Method and apparatus for constructing deformable neural radiance field network |
CN116091668B (en) * | 2023-04-10 | 2023-07-21 | 广东工业大学 | Talking head video generation method based on emotion feature guidance |
CN116091668A (en) * | 2023-04-10 | 2023-05-09 | 广东工业大学 | Talking head video generation method based on emotion feature guidance |
CN116152447B (en) * | 2023-04-21 | 2023-09-26 | 科大讯飞股份有限公司 | Face modeling method and device, electronic equipment and storage medium |
CN116152447A (en) * | 2023-04-21 | 2023-05-23 | 科大讯飞股份有限公司 | Face modeling method and device, electronic equipment and storage medium |
CN116342760A (en) * | 2023-05-25 | 2023-06-27 | 南昌航空大学 | Three-dimensional facial animation synthesis method, system, electronic equipment and storage medium |
CN116721194A (en) * | 2023-08-09 | 2023-09-08 | 瀚博半导体(上海)有限公司 | Face rendering method and device based on generation model |
CN116721194B (en) * | 2023-08-09 | 2023-10-24 | 瀚博半导体(上海)有限公司 | Face rendering method and device based on generation model |
CN117392292A (en) * | 2023-10-20 | 2024-01-12 | 联通在线信息科技有限公司 | 3D digital person generation method and system |
CN117392292B (en) * | 2023-10-20 | 2024-04-30 | 联通在线信息科技有限公司 | 3D digital person generation method and system |
CN117292030A (en) * | 2023-10-27 | 2023-12-26 | 海看网络科技(山东)股份有限公司 | Method and system for generating three-dimensional digital human animation |
CN117593442A (en) * | 2023-11-28 | 2024-02-23 | 拓元(广州)智慧科技有限公司 | Portrait generation method based on multi-stage fine grain rendering |
CN117593442B (en) * | 2023-11-28 | 2024-05-03 | 拓元(广州)智慧科技有限公司 | Portrait generation method based on multi-stage fine grain rendering |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113269872A (en) | Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization | |
Guo et al. | Ad-nerf: Audio driven neural radiance fields for talking head synthesis | |
CN112887698B (en) | High-quality face voice driving method based on nerve radiation field | |
CN101055647B (en) | Method and device for processing image | |
CN101324961B (en) | Human face portion three-dimensional picture pasting method in computer virtual world | |
CN117496072B (en) | Three-dimensional digital person generation and interaction method and system | |
CN113362422B (en) | Shadow robust makeup transfer system and method based on decoupling representation | |
KR102353556B1 (en) | Apparatus for Generating Facial expressions and Poses Reappearance Avatar based in User Face | |
CN115209180A (en) | Video generation method and device | |
CN115914505B (en) | Video generation method and system based on voice-driven digital human model | |
JP2009104570A (en) | Data structure for image formation and method of forming image | |
CN111640172A (en) | Attitude migration method based on generation of countermeasure network | |
CN115239857B (en) | Image generation method and electronic device | |
CN116721190A (en) | Voice-driven three-dimensional face animation generation method | |
CN116385606A (en) | Speech signal driven personalized three-dimensional face animation generation method and application thereof | |
CN112862672B (en) | Liu-bang generation method, device, computer equipment and storage medium | |
CN113947520A (en) | Method for realizing face makeup conversion based on generation of confrontation network | |
Tang et al. | Real-time conversion from a single 2D face image to a 3D text-driven emotive audio-visual avatar | |
Nguyen-Phuoc et al. | Alteredavatar: Stylizing dynamic 3d avatars with fast style adaptation | |
Sun et al. | SSAT $++ $: A Semantic-Aware and Versatile Makeup Transfer Network With Local Color Consistency Constraint | |
CN117671090A (en) | Expression processing method and device, electronic equipment and storage medium | |
Wang et al. | Uncouple generative adversarial networks for transferring stylized portraits to realistic faces | |
WO2024055379A1 (en) | Video processing method and system based on character avatar model, and related device | |
CN116402928B (en) | Virtual talking digital person generating method | |
Kumar et al. | Multi modal adaptive normalization for audio to video generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210817 |