CN113269872A

CN113269872A - Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization

Info

Publication number: CN113269872A
Application number: CN202110610539.9A
Authority: CN
Inventors: 杨志景; 李为杰; 温瑞冕; 徐永宗; 李凯; 凌永权
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2021-08-17

Abstract

The invention discloses a synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization, which comprises the following steps: optimizing and fitting all parameters of the three-dimensional face deformation model to the input face image by adopting a convolutional neural network; training a speech-to-expression and head pose mapping network using parameters of a target video and a face model

Mapping network using trained speech to expressions and head gestures

Acquiring facial expression and head posture parameters from input audio; synthesizing a human face and rendering the synthesized human face to generate a vivid human face video frame; training a rendering network based on a generated confrontation network by using the parameterized face images and the face images in the video frames, wherein the rendering network is used for generating a background for the face image of each frame; and performing face background rendering and video synthesis based on video key frame optimization. The background transition of each frame of the synthesized face video output by the invention is natural and vivid, and the usability and the practicability of the synthesized face video can be greatly enhanced.

Description

Synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization

Technical Field

The invention relates to the field of three-dimensional face reconstruction and face synthesis migration in deep learning, in particular to a synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization.

Background

With the improvement of the social level of China, the popularization of mobile intelligent terminals and the rapid development of mobile internet technology, videos become an indispensable part of people in life, study, entertainment and work, and compared with the traditional image-text expression form, the videos can combine the sense of hearing and the sense of vision, so that the manufacturing threshold is lower. Most of the current applications of video synthesis are still in the entertainment aspect, such as face changing photographing in the beautiful picture show, AR head portrait making in an iPhone mobile phone, istap Faces and other applications, and most of the applications are essentially to detect, position and segment the face in the image based on a deep learning neural network, and then exchange a source face and a target face. These functions require a neural network based on a large amount of face data, and have poor controllability, and it is difficult to realize the coupling of each attribute of the face.

The audio-driven video synthesis of the human face speech video is a key problem of realizing virtual anchor and intelligent human face speech video synthesis at present. The key effect of the method is that the human face speaking video with vivid human face and natural video frame transition can be generated under the condition that only the source audio and the target character video are available. Traditional person speech video recording requires a significant amount of labor and time costs, and necessarily requires the target person to participate in the recording. Therefore, the method generates a vivid synthetic face image by utilizing three-dimensional face reconstruction and adding a rendering network under video background frame optimization, thereby generating a vivid synthetic face video, and is a very practical problem for solving virtual anchor, character program recording, network recording class and the like.

The method is characterized in that a neural network is applied to face video synthesis in deep learning at present, features such as expressions, head postures, shapes and textures in a face model can be extracted based on three-dimensional face reconstruction and a rendering network method under video key frame optimization, the features such as the expressions and the head postures in the face model are extracted from source audio and are replaced into a face model of a target person, required vivid synthesized face frames are generated through optimized rendering of video key frames, face synthesis is achieved, and in recent years, a large number of scholars are put into scientific research in the field of face synthesis. However, since the face generation based on the neural network only needs a large amount of training data, the data acquisition work of the network is a great challenge. And due to the quality of the input data and the inherent instability of the generated model, the pictures and videos synthesized by the method may have low image quality and cannot be subjected to large-scale head pose control. The face synthesis mode is always a difficult point and a hot point in the research of the face video synthesis field. In the patent of "training method for generating confrontation network, image face changing method and video face changing device" (application number: 202010592443.X), peninsula (beijing) information technology ltd, a generator and a discriminator are proposed, which are based on generation of confrontation network, and generate confrontation network by using mass data pair training, extract the attribute feature diagram of the figure in the target image, and decode the generated mixed feature diagram to obtain the synthesized face. Although the method can also keep the attribute characteristics of the original image and the identity characteristics of the target image, the method has low stability of obtaining a vivid synthesized face, cannot obtain a synthesized face speech video under the condition of only human voice and video, and generates a synthesized face video with fuzzy face background, low video frame quality and unnaturalness.

Disclosure of Invention

The invention provides a synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization, which enables the background transition of each frame of the output synthetic face video to be natural and vivid.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization comprises the following steps:

optimizing and fitting each parameter of the three-dimensional face deformation model to the input face image by adopting a convolutional neural network to realize the parameterized reconstruction of the face model;

training a speech-to-expression and head pose mapping network using parameters of a target video and a face model

Mapping network using trained speech to expressions and head gestures

Acquiring facial expression and head posture parameters from input audio, wherein the target video comprises a video frame and audio corresponding to the video frame, and the video frame comprises a face image;

replacing parameters in the parameterized face image according to the acquired facial expression and head posture parameters, synthesizing to obtain a face image of each frame, rendering, and generating a vivid face video frame;

training a rendering network based on a generated confrontation network by using the parameterized face images and the face images in the video frames, wherein the rendering network is used for generating a background for the face image of each frame;

and performing face background rendering and video synthesis based on video key frame optimization to obtain a high-quality synthesized face video frame which contains source audio facial expression and head posture parameters and is used for synthesizing a synthesized face with clear portrait and background, and synthesizing a complete speech video of the synthesized face according to the sequence of the video frames.

Preferably, the method for realizing the parametric reconstruction of the face model by optimally fitting each parameter of the three-dimensional face deformation model to the input face image by adopting the convolutional neural network specifically comprises the following steps:

recognizing a face in a face image, marking the face with 68 marking points, representing the face by using a three-dimensional face deformation model, and parameterizing the face into a triangular mesh model consisting of 35709 vertexes;

converting a two-dimensional face image I into a three-dimensional face parametric model X, and expressing as follows:

wherein

Are the shape parameters of the three-dimensional face model,

is a parameter of the texture that is,

is a parameter of an expression that is,

is the parameter of the light emission,

rotation parameters being head pose parameters and derived from camera model

And translation parameters

Representing that the shape of any one face picture can be used as the three-dimensional face shape parameterThe modeling model is represented as:

in the formula, B_shapeIs a face shape vector, B_expIs a facial expression vector;

the texture of a face is represented as:

wherein B is_texIs a vector of the texture of the human face,

and

respectively expressed as the average shape and average texture of the face model; the illumination model of the face is represented as:

where gamma is the illumination parameter of the face model, n_iFor any vertex v of any face model_iNormal vector of (d), t_iIs a vertex v_iTexture parameter, the vertex v_iIrradiance of (C) is represented as C (n)_i，t_i| γ), and

is a basis function of spherical harmonics, gamma_bIs the coefficient of the spherical harmonic;

therefore, the reconstruction process of the three-dimensional face model can be represented as an optimization solution of the face model parameters, and the training process of the three-dimensional face model based on the convolutional neural network can be represented as an optimization problem of the following equations (1) and (2):

in the formula (I), the compound is shown in the specification,

function, omega, for the regularized optimization of the coefficients of a three-dimensional face model_α、ω_β、ω_δWeights corresponding to face shape coefficient, expression coefficient and texture coefficient respectively, c ∈ { r, g, b } represents that the picture is RGB picture, T_cRepresents the face texture vector parameter of picture c, var () represents the variance, and r (x) represents the skin area of the face containing the cheek, nose, and forehead.

Preferably, the speech-to-expression and head-pose mapping network is trained using parameters of the target video and the face model

The method specifically comprises the following steps:

extracting audio frequency in a target video, converting the audio frequency into Mel frequency cepstrum coefficient, inputting the Mel frequency cepstrum coefficient obtained by conversion into a pre-trained audio high-grade feature extraction network to obtain high-grade feature F_tThen F is added_tAnd beta generated by optimally fitting the input human face image with a three-dimensional human face deformation model through a convolutional neural network_tAnd P_tTraining the speech to expression and head pose mapping network as a training dataset

Trained speech to expression and head pose mapping network

Extracting two face estimation parameters from the audio, and respectively corresponding to the expression coefficients beta ═ beta in the three-dimensional face deformation model⁽¹⁾，......，β⁽⁶⁴⁾And head pose coefficient P ═ P⁽¹⁾，......，P⁽⁶⁾Are multiplied by

The speech to expression and head pose mapping network

The training process of (A) can be regarded as the mean square error loss of the expression parameters

And mean square error loss of head pose parameters

The optimization process of (2) is shown in formulas (3) and (4):

where MSE () represents a mean square error function; f_tRepresenting a high-level characteristic of the input network at time t, beta_tIs an expression parameter, P, of the target video at time t_tIs the head pose parameter of the target video at the time t.

Preferably, the pre-trained audio advanced feature extraction network specifically includes:

the audio advanced feature extraction network takes an AT-net network as a backbone to train on The Oxford-BBC Lip Reading in The Wild Dataset.

Preferably, the method for acquiring a parameterized face image specifically includes the following steps:

extracting a target video into video frames, and cutting the whole face of each frame to obtain a face image I⁽¹⁾，I⁽²⁾，......，I⁽ⁿ⁾Inputting the n human face images through a filterObtaining any picture I after convolutional neural network optimization^(k)Face image represented by a parameterized model of k e n and corresponding shape parameter alpha_videoExpression parameter beta_videoTexture parameter delta_viaeoIllumination parameter gamma_videoAnd head pose parameter P_video。

Preferably, the parameters in the parameterized face image are replaced according to the acquired parameters of facial expression and head pose, specifically:

the expression coefficient beta obtained from the audio_audioAnd head pose coefficient P_audioReplacing the face parameters in each corresponding video frame according to the time frame sequence, and synthesizing a new face

Preferably, the synthesizing obtains a face image of each frame and renders the face image to generate a realistic face video frame, specifically:

and (3) using a 3D grid renderer to enable each grid to be in uniform and smooth transition, so that vivid and natural face images are obtained, and the face images subjected to further rendering are stored as corresponding video frames.

Preferably, the parameterized face image and the face image in the video frame are used to train a rendering network based on the generation of a confrontation network, specifically:

parameterized face image I obtained from target video frame^(k)And the original video frame corresponding to the time

Are combined into sequence data pairs

And inputting a rendering network consisting of a generator G and a discriminator D for pre-training, wherein the pre-training process is a total optimization process of the generator and the discriminator for generating the countermeasure network and can be represented by an equation (5):

in the formula, G^*An overall optimization function representing the rendering network;

complete training objectives include reconstruction losses

And loss of antagonism

Training is represented as formula (6):

where the weighting parameter λ is 100, reconstruction loss

Expression (7) and antagonistic loss

Expression (8) is as follows:

in the formula, the rendering network subjected to the optimization training can generate a background for the face image of each frame.

Preferably, face background rendering and video synthesis are performed based on video key frame optimization, and when the number of original video frames exceeds a threshold, the method specifically comprises the following steps:

calculating the face image I synthesized at the time t_tThe Euclidean distance H of the video frame of which only the head region is left after the original video frame is subjected to mask processing is obtained, so that the H obtains the frame with the minimum value

That is, the matching frame with the head pose deviation similar to the synthesized face pose in the original video frame is used as a rendering data pair

According to the method, matched frames of all frames are obtained and combined into rendering sequence data pairs

Inputting the obtained rendering sequence data pair into a pre-trained rendering network based on a generation countermeasure network, and obtaining a rendering network containing a source audio expression coefficient beta_audioAnd head pose coefficient P_audioSynthetic face X ═ { α ═ α_video，β_audio，δ_video，γ_video，P_audioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.

Preferably, face background rendering and video synthesis are performed based on video key frame optimization, and when the number of original video frames does not exceed a threshold, the method specifically comprises the following steps:

selecting video frames with large head pose offset from the synthesized face video frames as key frames, and obtaining a key frame sequence according to the time sequence of the key frames

To find the most closely matched frame to obtain a matched frame sequence

And then obtaining video frames among the matched frames through a linear interpolation algorithm based on OpenCV (open circuit video coding) so as to synthesize a complete video frame sequence

The video frame sequence and the video frame for synthesizing human face are used as rendering sequence to form data pair

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

compared with the traditional method for recording the talking video of the person at present, the highlight of the invention can correspondingly generate the talking video of A and the talking video of B (matching the audio of A) under certain special scenes, such as only the audio of A and the talking video recorded before A or only the audio of A and the talking video of B, which is obviously impossible in the traditional method; secondly, the invention can train the network well in advance and store the network locally, and then the needed composite video can be obtained quickly only by inputting the corresponding audio and video; third, the present invention is greatly reduced in both time and labor costs over conventional methods. Compared with the existing face synthesis video method, firstly, the invention uses the parameterized model of the three-dimensional face and the face synthesis of the audio conversion expression and head posture mapping network to ensure that the synthesized face has the audio expression and head posture parameter information completely, so that the mouth shape of the character in the generated face synthesis video is matched with the audio, and the video character is controlled by the expression and the head posture parameter in the audio; secondly, the invention carries out targeted optimization on the character background in the synthesized video frame, and can obtain the synthesized face video frame with high quality and clear portrait and background after being rendered by the background rendering network.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Fig. 2 is a schematic flow chart of acquiring facial expression and head pose parameters from input audio.

Fig. 3 is a schematic flow chart of optimized rendering of video key frames when the number of original video frames exceeds a threshold.

Fig. 4 is a schematic flow chart of optimized rendering of video key frames when the number of original video frames does not exceed a threshold.

Fig. 5 is a schematic view of a face synthesis process.

Fig. 6 is a schematic view of a flow of rendering a human face background and synthesizing a video.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

A method for generating a composite video based on three-dimensional face reconstruction and video key frame optimization is disclosed, as shown in FIG. 1, and comprises the following steps:

training a voice-to-expression and head pose mapping network H by using parameters of a target video and a face model, and acquiring facial expression and head pose parameters from input audio by using the trained voice-to-expression and head pose mapping network H, wherein the target video comprises a video frame and audio corresponding to the video frame, and the video frame comprises a face image;

The embodiment considers that the human video recording needs a large amount of labor and time cost in the aspects of video anchor, online education course recording and the like at the present stage. At present, the generation of the character talking video is mainly realized by a traditional method, and a real person must be recorded before a lens, so that the time and the labor are wasted, and the wearing and the appearance of the character with the mirror are higher in requirements. In the embodiment, model parameters about expressions and head gestures are extracted from character voice, a three-dimensional face model is used for reconstructing a face in each video frame, and a synthetic face video which has target face identity information and is matched with the expression and head gesture information in source audio is generated. And the method is purposefully optimized according to the reality of the face image of the face synthetic video and the transition naturalness of the video background. In some past researches, strange face structures of the synthesized faces are not vivid enough, or the faces are separated from the background, the background is fuzzy and the like, and the naturalness of real faces is lacked. In the embodiment, after the face is synthesized, the 3D mesh renderer is used for further rendering the facial texture of the synthesized face image, so that the texture details of the synthesized face are richer and more vivid visually. Meanwhile, some key frames are selected from the synthesized face video frames, and an interpolation frame-filling algorithm is utilized in the adjacent key frames, so that the background transition of each frame of the output synthesized face video is natural and vivid. Therefore, the usability and the practicability of the synthesized face video can be greatly enhanced by using the method.

The method adopts the convolutional neural network to optimally fit all parameters of the three-dimensional face deformation model to the input face image so as to realize the parametric reconstruction of the face model, and specifically comprises the following steps:

wherein

Are the shape parameters of the three-dimensional face model,

is a parameter of the texture that is,

is a parameter of an expression that is,

is the parameter of the light emission,

rotation parameters being head pose parameters and derived from camera model

And translation parameters

And (3) representing the shape of any human face picture by using a three-dimensional human face shape parameterized model as follows:

the texture of a face is represented as:

wherein B is_texIs a vector of the texture of the human face,

and

respectively expressed as the average shape and average texture of the face model;

the illumination model of the face is represented as:

is a basis function of spherical harmonics, gamma_bIs the coefficient of spherical harmonic, B ═ 3;

in the formula (I), the compound is shown in the specification,

As shown in FIG. 2, the speech-to-expression and head-pose mapping network is trained using parameters of the target video and the face model

The method specifically comprises the following steps:

extracting audio frequency in a target video, converting the audio frequency into Mel frequency cepstrum coefficient, inputting the Mel frequency cepstrum coefficient obtained by conversion into a pre-trained audio high-grade feature extraction network to obtain high-grade feature F_tThen F is added_tAnd beta of the parameters of the face model after the convolutional neural network optimization_tAnd P_tTraining the speech to expression and head pose mapping network as a training dataset

Trained speech to expression and head pose mapping network

The speech to expression and head pose mapping network

And mean square error loss of head pose parameters

The optimization process of (2) is shown in formulas (3) and (4):

The pre-trained audio advanced feature extraction network specifically comprises the following steps:

As shown in fig. 5 to 6, the method for acquiring a parameterized face image specifically includes the following steps:

extracting a target video into video frames, and cutting the whole face of each frame to obtain a face image I⁽¹⁾，I⁽²⁾，......，I⁽ⁿ⁾Inputting the n human face images, and obtaining any one image I after the n human face images are optimized by a convolutional neural network^(k)Face image represented by a parameterized model of k e n and corresponding shape parameter alpha_videoExpression parameter beta_videoTexture parameter delta_viaeoIllumination parameter gamma_videoAnd head pose parameter P_video。

Replacing parameters in the parameterized face image according to the acquired parameters of the facial expression and the head posture, which specifically comprises the following steps:

The synthesizing obtains the face image of each frame and renders the face image to generate a vivid face video frame, which specifically comprises the following steps:

The method comprises the following steps of training a rendering network based on a generated countermeasure network by using a parameterized face image and a face image in a video frame, and specifically comprises the following steps:

Are combined into sequence data pairs

complete training objectives include reconstruction losses

And loss of antagonism

Training is represented as formula (6):

where the weighting parameter λ is 100, reconstruction loss

Expression (7) and antagonistic loss

Expression (8) is as follows:

the rendering network after the optimization training can generate a background for the face image of each frame.

Performing face background rendering and video synthesis based on video keyframe optimization, as shown in fig. 3, when the number of original video frames exceeds a threshold, specifically:

Performing face background rendering and video synthesis based on video keyframe optimization, as shown in fig. 4, when the number of original video frames does not exceed a threshold, specifically:

To find the most closely matched frame to obtain a matched frame sequence

In a specific embodiment:

the advanced feature extraction network for extracting The audio advanced feature F _ t is trained on The Oxford-BBC Lip Reading in The Wild (LRW) Dataset by taking an AT-net network as a backbone, and The Dataset contains 1000 pronunciations of up to 500 different words and is spoken by hundreds of different speakers. Our audio-to-expression and head pose mapping network is pre-trained with the data set as the sum P _ t reconstructed from the target video frame and F _ t extracted from the speech of the target video. The three-dimensional Face reconstruction Model based on the convolutional neural network takes a ResNet-50 network as a backbone, represents a Face based on an Expression Basis Model in a Basel Face Model 2009 and a Facewarehouse, further renders a reconstructed synthetic Face by a mesh render in a tensoflow, and is trained on a data set 300 WLP. The face background rendering network based on video key frame optimization is composed of a generator and a discriminator, and a training set is a data pair composed of an original video frame and a synthesized face image reconstructed from the original video frame.

Pre-training an audio-to-expression and head pose mapping network and a convolutional neural network-based three-dimensional face reconstruction model: because the LRW training set and the 300WLP training set are public training sets, data sets are downloaded respectively and input into two networks for pre-training, and model parameters are saved after learning is finished when the number of iterations of the two networks is small.

Processing input data: firstly, converting input audio into Mel Frequency Cepstrum Coefficient (MFCC) for storage, extracting audio and video frames from input target video, identifying human face in each video frame, intercepting and storing the human face as human face video frame, wherein the size of the image intercepted by each human face video frame is 256 × 256 RGB image and is set as image I_t。

A face synthesis stage: the RGB face image I with the size of 256 multiplied by 256 is taken_tInputting the three-dimensional face image into a pre-trained convolutional neural network-based three-dimensional face deformation model to generate a reconstructed three-dimensional face image of each frame of face

And corresponding shape parameter alpha_videoExpression parameter beta_videoTexture parameter delta_videoIllumination parameter gamma_videoAnd head pose parameter P_video(wherein the head pose parameter P is composed of the rotation parameter R and the translation parameter T in the camera parameters), and saves these parameters. Then, the MFCC is input into an audio-to-expression and head posture mapping network to obtain an expression parameter beta_audioAnd head pose parameter P_audioAnd stored, and then both are replaced by the target video in the order of time frameReconstructed expression parameter beta_videoAnd head pose parameter P_videoAnd these face parameters { alpha }_video，β_audio，δ_video，γ_video，P_audioInputting the face synthesis model, rendering by a 3D mesh renderer, and outputting a new face X ═ alpha_video，β_audio，δ_video，γ_video，P_audioAnd storing in time frame sequence.

Face background rendering and video synthesis based on video key frame optimization: calculating the frame number of the target video, and if the frame number is lower than 5000 frames, obtaining a key frame sequence by adopting a mode 2 of a video key frame optimization method

Obtaining a matching frame sequence in the target video frame by calculating the minimum Euclidean distance

Inputting the sequence into a rendering network, and obtaining video frames between matched frames between output face frames through an OpenCV-based linear interpolation algorithm, thereby synthesizing a complete synthetic face video frame sequence

Rendering sequence data pairs of the video frame sequence and the video frame for synthesizing the human face in the step 3

Image I_tAnd face image corresponding to time t

Set as a data pair

Training the data pair input based on a rendering network for generating an antagonistic neural network, wherein the iteration number is 100 epochs, and after learning is finished, protecting the modelAnd (4) storing. Calculating the frame number of the target video, and if the frame number is lower than 5000 frames, obtaining a key frame sequence by adopting a mode 2 of a video key frame optimization method

Inputting the sequence into the pre-training rendering network, and obtaining video frames between matched frames between the output face frames through a linear interpolation algorithm based on OpenCV (open circuit content variation) so as to obtain a complete synthetic face video frame sequence

If the target video frame is more than 5000 frames, the mode 1 is adopted, and the complete synthetic human face video frame sequence can be obtained in the same way

And finally, synthesizing the video frames and the audio into a whole video, so that a conversation video which contains the source audio expression and the head gesture on the basis of the target video and has a natural synthetic face and a clear background is generated.

The same or similar reference numerals correspond to the same or similar parts;

the terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A synthetic video generation method based on three-dimensional face reconstruction and video key frame optimization is characterized by comprising the following steps:

Mapping network using trained speech to expressions and head gestures

2. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 1, wherein the method for optimizing and fitting each parameter of the three-dimensional face deformation model to the input face image by using the convolutional neural network to realize the parameterized reconstruction of the face model comprises the following steps:

wherein

Are the shape parameters of the three-dimensional face model,

is a parameter of the texture that is,

is a parameter of an expression that is,

is the parameter of the light emission,

rotation parameters being head pose parameters and derived from camera model

And translation parameters

the texture of a face is represented as:

wherein B is_texIs a vector of the texture of the human face,

and

the illumination model of the face is represented as:

where gamma is the illumination parameter of the face model, n_iFor any vertex v of any face model_iNormal vector of (d), t_iIs a vertex v_iTexture parameter, the vertex v_iIrradiance of (C) is represented as C (n)_i,t_i| γ), and

in the formula (I), the compound is shown in the specification,

3. The method of claim 2, wherein parameters of the target video and the face model are used to train a speech-to-expression and head pose mapping network

The method specifically comprises the following steps:

Trained speech to expression and head pose mapping network

Extracting two face estimation parameters from the audio, and respectively corresponding to the expression coefficients beta ═ beta in the three-dimensional face deformation model⁽¹⁾,……,β⁽⁶⁴⁾And head pose coefficient P ═ P⁽¹⁾,……，P⁽⁶⁾Are multiplied by

The speech to expression and head pose mapping network

And mean square error loss of head pose parameters

The optimization process of (2) is shown in formulas (3) and (4):

4. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 3, wherein the pre-trained audio advanced feature extraction network specifically comprises:

5. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 4, wherein the method for obtaining the parameterized face image specifically comprises the following steps:

extracting a target video into video frames, and cutting the whole face of each frame to obtain a face image I⁽¹⁾，I⁽²⁾，……,I⁽ⁿ⁾Inputting the n human face images, and obtaining any one image I after the n human face images are optimized by a convolutional neural network^(k)Face image represented by a parameterized model of k e n and corresponding shape parameter alpha_videoExpression parameter beta_videoTexture parameter delta_videoIllumination parameter gamma_videoAnd head pose parameter P_video。

6. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 5, wherein parameters in the parameterized face image are replaced according to the acquired facial expression and head pose parameters, specifically:

7. The method for generating a composite video based on three-dimensional face reconstruction and video key frame optimization according to claim 6, wherein the synthesizing obtains a face image of each frame and renders the face image to generate a realistic face video frame, specifically:

8. The method for generating a composite video based on three-dimensional face reconstruction and video keyframe optimization according to claim 7, wherein the rendering network based on the generation of the countermeasure network is trained using parameterized face images and face images in the video frames, specifically:

Are combined into sequence data pairs

complete training objectives include reconstruction losses

And loss of antagonism

Training is represented as formula (6):

where the weighting parameter λ is 100, reconstruction loss

Expression (7) and antagonistic loss

Expression (8) is as follows:

9. The method for generating a composite video based on three-dimensional face reconstruction and video key frame optimization according to claim 8, wherein face background rendering and video composition are performed based on video key frame optimization, and when the number of original video frames exceeds a threshold, the method specifically comprises:

Inputting the obtained rendering sequence data pair into a pre-trained rendering net based on a generation countermeasure networkObtaining the expression coefficient beta containing the source audio frequency by the network_audioAnd head pose coefficient P_audioSynthetic face X ═ { α ═ α_video,β_audio,δ_video,γ_video,P_audioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.

10. The method for generating a composite video based on three-dimensional face reconstruction and video key frame optimization according to claim 8, wherein face background rendering and video composition are performed based on video key frame optimization, and when the number of original video frames does not exceed a threshold, the method specifically comprises:

To find the most closely matched frame to obtain a matched frame sequence

Inputting the obtained rendering sequence data pair into a pre-trained rendering network based on a generation countermeasure network, and obtaining a rendering network containing a source audio expression coefficient beta_audioAnd head pose coefficient P_audioSynthetic face X ═ { α ═ α_video,β_audio,δ_video,γ_video,P_audioAnd finally, synthesizing a complete speech video of the synthesized face containing the source audio, the audio expression and the head gesture on the basis of the identity of the target person according to the sequence of the video frames.