CN116188649B - Three-dimensional face model driving method based on voice and related device - Google Patents

Three-dimensional face model driving method based on voice and related device Download PDF

Info

Publication number
CN116188649B
CN116188649B CN202310472056.6A CN202310472056A CN116188649B CN 116188649 B CN116188649 B CN 116188649B CN 202310472056 A CN202310472056 A CN 202310472056A CN 116188649 B CN116188649 B CN 116188649B
Authority
CN
China
Prior art keywords
dimensional
model
voice
offset
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310472056.6A
Other languages
Chinese (zh)
Other versions
CN116188649A (en
Inventor
杨硕
何山
殷兵
刘聪
周良
胡金水
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202310472056.6A priority Critical patent/CN116188649B/en
Publication of CN116188649A publication Critical patent/CN116188649A/en
Application granted granted Critical
Publication of CN116188649B publication Critical patent/CN116188649B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application provides a three-dimensional face model driving method and a related device based on voice, which are used for carrying out three-dimensional model vertex deviation prediction according to deviation prediction parameters based on voice characteristics and target emotion characteristics of target voice, and driving a three-dimensional basic model according to predicted three-dimensional model vertex deviation data to obtain three-dimensional face animation corresponding to the target voice. The offset prediction parameters are determined by carrying out three-dimensional model vertex offset prediction processing on 4D synthesized data, audio characteristics and emotion characteristics corresponding to the sample video; the 4D synthesized data is the data synthesized by the three-dimensional reconstruction face model corresponding to each frame of image of the sample video according to the frame rate of the sample video. According to the method, the 4D synthetic data obtained by reconstructing each frame of image of the sample video into the three-dimensional face model is used as sample data for determining the offset prediction parameters, and the data size and emotion diversity of the sample data are improved, so that the accuracy and emotion effect of the voice-driven three-dimensional face model are improved.

Description

Three-dimensional face model driving method based on voice and related device
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a three-dimensional face model driving method based on voice and a related device.
Background
The voice-driven facial animation generation aims to drive a 2D facial image or a 3D facial model to make a corresponding mouth shape or expression by utilizing voice information. In recent years, the generation of 3D face animation has received higher attention in the industries of film and television production, games and the like, and has wider application prospect.
The existing method for driving the three-dimensional face model by voice predicts the offset data of the vertexes of the three-dimensional model when the voice drives the three-dimensional model by using offset prediction parameters, for example, the three-dimensional offset prediction model based on deep learning is used as the offset prediction parameters to predict the offset data of the vertexes when the voice drives the three-dimensional model. While determining the offset prediction parameters requires performing an offset analysis of the speech driven three-dimensional model using a large number of sample three-dimensional face animations (i.e., sample 4D data), for example, the three-dimensional offset prediction model directly obtains a mapping of speech to three-dimensional model vertex offset by learning 4D data, and requires training using a large number of 4D data support models. However, the 4D data is generally acquired frame by a three-dimensional scanning device, and the data acquisition cost is high, so that the data volume and diversity of the acquired 4D data are insufficient. And the 4D data is insufficient in diversity, facial attributes such as emotion and the like are difficult to control, so that the emotional effect of the offset data-driven face model predicted by the offset prediction parameters is low, the sample 4D data is less, the accuracy of the offset data predicted by the offset prediction parameters is low, and the emotional effect and accuracy of the voice-driven three-dimensional model are influenced.
Disclosure of Invention
Based on the defects and shortcomings of the prior art, the application provides a three-dimensional face model driving method based on voice and a related device, which can improve the emotional effect and accuracy of the voice-driven three-dimensional face model.
The technical scheme provided by the application is as follows:
according to a first aspect of an embodiment of the present application, there is provided a method for driving a three-dimensional face model based on speech, including:
based on the voice characteristics and the target emotion characteristics of the target voice, carrying out three-dimensional model vertex offset prediction according to predetermined offset prediction parameters to obtain three-dimensional model vertex offset data corresponding to the target voice;
driving a three-dimensional basic face model according to the three-dimensional model vertex offset data to obtain a three-dimensional face animation corresponding to the target voice;
the offset prediction parameters are determined by carrying out three-dimensional model vertex offset prediction processing on 4D synthesized data, audio features and emotion features corresponding to the sample video; the 4D synthesized data corresponding to the sample video is synthesized data according to the frame rate of the sample video by using a three-dimensional reconstruction face model corresponding to each frame of image of the sample video.
Optionally, based on the voice feature and the target emotion feature of the target voice, performing three-dimensional model vertex offset prediction according to a predetermined offset prediction parameter to obtain three-dimensional model vertex offset data corresponding to the target voice, including:
fusing emotion fusion characteristics corresponding to the voice frames in the target voice with the voice characteristics to obtain coding characteristics corresponding to the voice frames; the emotion fusion features corresponding to the voice frames are obtained by fusing the target emotion features with three-dimensional model vertex offset data corresponding to the voice frame before the voice frame;
and decoding the coding features corresponding to the voice frames in the target voice to obtain three-dimensional model vertex offset data corresponding to the voice frames in the target voice.
Optionally, based on the voice feature and the target emotion feature of the target voice, performing three-dimensional model vertex offset prediction according to a predetermined offset prediction parameter to obtain three-dimensional model vertex offset data corresponding to the target voice, including:
inputting the voice characteristics and the target emotion characteristics of the target voice into a pre-trained three-dimensional offset prediction model to obtain three-dimensional model vertex offset data corresponding to the target voice;
The three-dimensional offset prediction model is obtained by performing three-dimensional model vertex offset prediction training by using 4D synthetic data, audio features and emotion features corresponding to a sample video; the 4D synthesized data corresponding to the sample video is synthesized data according to the frame rate of the sample video by using a three-dimensional reconstruction face model corresponding to each frame of image of the sample video.
Optionally, the target emotion feature is obtained by extracting emotion features from a video containing the target emotion.
Optionally, the training process of the three-dimensional offset prediction model includes:
determining 4D synthesized data, audio characteristics and emotion characteristics corresponding to a pre-collected sample video;
inputting the audio features and the emotion features into the three-dimensional offset prediction model to obtain sample vertex offset data output by the three-dimensional offset prediction model;
performing model parameter adjustment on the three-dimensional offset prediction model based on a first loss function between a first three-dimensional animation and the 4D synthesis data; the first three-dimensional animation is a three-dimensional facial animation after the three-dimensional basic facial model is driven according to the sample vertex offset data.
Optionally, after determining the 4D synthesized data, the audio feature and the emotion feature corresponding to the pre-collected sample video, the method further includes:
and linearly interpolating the audio features according to the frame rate of the sample video, the frame rate of the audio features and a preset frame rate alignment rule.
Optionally, the training process of the three-dimensional offset prediction model further includes:
rendering the first three-dimensional animation as a 2D image sequence;
and taking the error between the 2D image sequence and the video frame sequence of the sample video as a second loss function, and performing model parameter adjustment on the three-dimensional offset prediction model.
Optionally, the training process of the three-dimensional offset prediction model further includes:
determining a sample three-dimensional face animation acquired in advance, and an animation audio feature and an animation emotion feature corresponding to the sample three-dimensional face animation;
inputting the animation audio features and the animation emotion features into a trained three-dimensional offset prediction model to obtain animation vertex offset data corresponding to the sample three-dimensional face animation;
based on a third loss function between the three-dimensional face offset animation corresponding to the animation vertex offset data and the sample three-dimensional face animation, performing model parameter adjustment on the trained three-dimensional offset prediction model; the three-dimensional face offset animation is a three-dimensional face animation after the three-dimensional basic face model is driven according to the animation vertex offset data.
Optionally, the training process of the three-dimensional offset prediction model further includes:
acquiring driving characteristics of the audio corresponding to the sample video on a face in the sample video;
calculating mutual information between the driving features and emotion features corresponding to the sample video;
and carrying out parameter adjustment on the three-dimensional offset prediction model by taking the minimization of the mutual information as a target.
Optionally, determining the 4D synthesized data, the audio feature and the emotion feature corresponding to the pre-acquired sample video includes:
carrying out three-dimensional face reconstruction on each video frame in the pre-acquired sample video to obtain three-dimensional deformation parameters corresponding to each video frame, and determining a three-dimensional reconstruction face model restored by the three-dimensional deformation parameters corresponding to each video frame;
synthesizing the three-dimensional reconstruction face model corresponding to each video frame according to the frame rate corresponding to the sample video to obtain 4D synthesized data corresponding to the sample video;
and extracting voice features from the audio corresponding to the sample video to obtain audio features corresponding to the sample video, and extracting emotion features from the sample video to obtain emotion features corresponding to the sample video.
Optionally, inputting the voice feature and the target emotion feature of the target voice into a pre-trained three-dimensional offset prediction model to obtain three-dimensional model vertex offset data corresponding to the target voice, and driving a three-dimensional basic face model according to the three-dimensional model vertex offset data to obtain a three-dimensional face animation corresponding to the target voice, including:
inputting the voice characteristics and the target emotion characteristics of the target voice into a pre-trained voice driving model, predicting three-dimensional model vertex deviation data corresponding to the target voice by using a three-dimensional deviation prediction model, and driving a three-dimensional basic face model according to the three-dimensional model vertex deviation data to obtain a three-dimensional face animation corresponding to the target voice.
According to a second aspect of an embodiment of the present application, there is provided a three-dimensional face model driving apparatus based on voice, including:
the deviation prediction module is used for carrying out three-dimensional model vertex deviation prediction according to predetermined deviation prediction parameters based on the voice characteristics and the target emotion characteristics of the target voice to obtain three-dimensional model vertex deviation data corresponding to the target voice;
The driving module is used for driving the three-dimensional basic face model according to the three-dimensional model vertex offset data to obtain a three-dimensional face animation corresponding to the target voice;
the offset prediction parameters are determined by carrying out three-dimensional model vertex offset prediction processing on 4D synthesized data, audio features and emotion features corresponding to the sample video; the 4D synthesized data corresponding to the sample video is synthesized data according to the frame rate of the sample video by using a three-dimensional reconstruction face model corresponding to each frame of image of the sample video.
According to a third aspect of an embodiment of the present application, there is provided an electronic apparatus including: a memory and a processor;
the memory is connected with the processor and used for storing programs;
the processor is used for realizing the three-dimensional face model driving method based on the voice by running the program in the memory.
According to a fourth aspect of the embodiments of the present application, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described voice-based three-dimensional face model driving method.
The application provides a three-dimensional face model driving method based on voice, which comprises the following steps: based on the voice characteristics and the target emotion characteristics of the target voice, carrying out three-dimensional model vertex offset prediction according to predetermined offset prediction parameters to obtain three-dimensional model vertex offset data corresponding to the target voice; driving the three-dimensional basic model according to the vertex offset data of the three-dimensional model to obtain a three-dimensional face animation corresponding to the target voice; the offset prediction parameters are determined by carrying out three-dimensional model vertex offset prediction processing on 4D synthesized data, audio characteristics and emotion characteristics corresponding to the sample video; the 4D synthesized data corresponding to the sample video is synthesized data according to the frame rate of the sample video by using a three-dimensional reconstruction face model corresponding to each frame of image of the sample video. By adopting the technical scheme of the application, each frame of image of the sample video can be reconstructed into the three-dimensional face model, so that 4D synthetic data is obtained as sample data for determining the offset prediction parameters, compared with the method for acquiring the 4D data by using three-dimensional scanning equipment, the cost is lower, the data volume and the diversity of the sample data are improved, the prediction effect of the offset prediction parameters can be improved, the accuracy of the voice-driven three-dimensional face model is improved, and the emotional effect of the voice-driven three-dimensional face model is improved through the input of emotional characteristics.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only embodiments of the present application, and other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a three-dimensional face model driving method based on voice according to an embodiment of the present application.
Fig. 2 is a schematic diagram of a process flow for training a three-dimensional offset prediction model according to an embodiment of the present application.
FIG. 3 is a schematic diagram of another process for training a three-dimensional offset prediction model according to an embodiment of the present application.
FIG. 4 is a schematic diagram of a process for training a three-dimensional offset prediction model according to an embodiment of the present application.
Fig. 5 is a schematic structural diagram of a three-dimensional face model driving device based on voice according to an embodiment of the present application.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical scheme of the embodiment of the application is suitable for application scenes driven by the three-dimensional face model, and can improve the accuracy and the emotional effect of the three-dimensional face model driven by voice.
The voice driving of the 3D face model refers to driving the mouth shape change and/or expression change of the 3D face model by voice so as to obtain the speaking animation of the 3D face model. At present, a mode of driving a 3D face model by voice generally determines offset prediction parameters in advance, performs offset prediction of the 3D face model by using the offset prediction parameters, and drives the vertex of the 3D face model to move by using the predicted offset data. The offset prediction parameters are determined by performing offset analysis on the voice-driven 3D face model by using a large amount of sample data, wherein the sample data is animation data, namely 4D data, of the 3D face model.
At present, 4D data is generally obtained frame by frame through a three-dimensional scanning device, and the data acquisition cost is high, so that the data volume and diversity of the acquired 4D data are insufficient. If the diversity of the 4D data is insufficient, when the deviation prediction parameters are determined by performing the deviation analysis of the voice-driven 3D face model on the 4D data, the facial properties such as emotion are difficult to control, so that the emotional effect of the 3D face model driven by the deviation data predicted by the determined deviation prediction parameters is low, and the emotional effect of the voice-driven three-dimensional face model is affected. If the data amount of the 4D data is smaller, the offset prediction capability of the determined offset prediction parameters is lower, so that the accuracy of the voice-driven three-dimensional face model is affected.
Therefore, how to improve the emotional effect and accuracy of the voice-driven three-dimensional face model is a technical problem that needs to be solved by those skilled in the art.
Based on the method, the application provides a three-dimensional face model driving method based on voice, each frame of image of a sample video can be reconstructed into a three-dimensional face model by the technical scheme, so that 4D synthesized data is obtained as sample data to determine offset prediction parameters, and because the sample video is a 2D video, the acquisition is convenient, the data volume and the diversity of the sample video are higher, the data volume and the diversity of the sample data can be improved, the prediction effect of the offset prediction parameters and the emotional effect of driving the three-dimensional face model are improved, and the problem that the accuracy and the emotional effect of voice driving the three-dimensional face model in the prior art are lower is solved.
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Exemplary method
The embodiment of the application provides a three-dimensional face model driving method based on voice, which can be executed by electronic equipment, wherein the electronic equipment can be any equipment with data and instruction processing functions, such as a computer, an intelligent terminal, a server and the like. Referring to fig. 1, the method includes:
s101, based on the voice characteristics and the target emotion characteristics of the target voice, carrying out three-dimensional model vertex offset prediction according to predetermined offset prediction parameters to obtain three-dimensional model vertex offset data corresponding to the target voice.
In this embodiment, when driving a three-dimensional face model by using target speech, it is first necessary to extract speech features of the target speech, and take the emotion required to be displayed in the three-dimensional face model when driving the three-dimensional face model as a target emotion, so as to determine target emotion features corresponding to the target emotion. The voice feature extraction of the target voice can be performed by adopting a voice recognition algorithm, and the voice feature extraction can be performed by constructing a voice feature extraction model, for example, a wav2vec pre-training model or a wav2vec2.0 pre-training model, and the like, so that the audio characterization of the target voice, namely the voice feature, can be extracted. The target voice comprises a plurality of frames of voice frames, so that the voice features of the extracted target voice are sequences formed by voice features corresponding to the voice frames. The target emotion features also need to be in one-to-one correspondence with the voice features of the target voice, i.e. the target emotion features are also sequences of emotion features corresponding to each voice frame.
Specifically, in this embodiment, for extracting the target emotion feature, a video including the target emotion may be collected in advance, and the video may be subjected to emotion feature extraction, so that the target emotion feature may be extracted. In addition, the number of frames of the collected video containing the target emotion needs to be the same as the number of frames of the target voice, so that the target emotion characteristics can be corresponding to the voice characteristics of the target voice. The method for extracting emotional characteristics of video may use an existing emotional characteristic extraction network, which is not specifically described in this embodiment.
After the voice characteristics and the target emotion characteristics of the target voice are determined, three-dimensional model vertex offset prediction is carried out on the voice characteristics and the target emotion characteristics of the target voice according to the predetermined offset prediction parameters, so that three-dimensional model vertex offset data corresponding to the target voice is obtained. The offset prediction parameters are determined by performing three-dimensional model vertex offset prediction processing on 4D synthesized data, audio features and emotion features corresponding to the sample video. The 4D synthesized data corresponding to the sample video is synthesized data according to the frame rate of the sample video by using a three-dimensional reconstruction face model corresponding to each frame of image of the sample video. When the offset prediction parameters are determined, three-dimensional model vertex offset prediction is carried out on the audio features and the emotion features corresponding to the video frames in the sample video, the offset prediction parameters are adjusted according to the deviation between the three-dimensional face model driven by the predicted offset data and the three-dimensional reconstructed face model of the video frames after three-dimensional face reconstruction, and therefore the difference between the three-dimensional face model driven by the offset data predicted according to the offset prediction parameters and the corresponding three-dimensional reconstructed face model is minimized.
Specifically, the offset prediction parameter includes an encoding parameter, a decoding parameter, and the like. According to the preset deviation prediction parameters, three-dimensional model vertex deviation prediction is carried out on the voice characteristics and the target emotion characteristics of the target voice, and three-dimensional model vertex deviation prediction is carried out on the voice characteristics and the target emotion characteristics of each frame of voice frame in the target voice according to the deviation prediction parameters. When three-dimensional model vertex deviation prediction is performed on a current voice frame in target voice, firstly, feature fusion is performed on three-dimensional model vertex deviation data corresponding to a previous voice frame of the current voice frame in the target voice (namely, three-dimensional model vertex deviation prediction is performed on voice features of the previous voice frame and target emotion features according to predetermined deviation prediction parameters, and obtained three-dimensional model vertex deviation data) and target emotion features, so that emotion fusion features corresponding to the current voice frame are obtained. And then fusing the emotion fusion characteristic corresponding to the current voice frame and the voice characteristic of the current voice frame based on an attention mechanism by utilizing the coding parameters in the offset prediction parameters to obtain the coding characteristic corresponding to the current voice frame. And finally, decoding the coding features corresponding to the current voice frame by utilizing the decoding parameters in the offset prediction parameters, thereby obtaining the three-dimensional model vertex offset data corresponding to the current voice frame. The three-dimensional model vertex offset prediction can be performed on any voice frame in the target voice by taking the voice frame as the current voice frame, so that three-dimensional model vertex offset data corresponding to each voice frame in the target voice can be obtained, and three-dimensional model vertex offset data corresponding to all voice frames are combined together according to the sequence of each voice frame in the target voice, so that three-dimensional model vertex offset data corresponding to the target voice can be obtained.
Further, the predetermined deviation prediction parameters in this embodiment may be model parameters of a three-dimensional deviation prediction model, and the three-dimensional model vertex deviation prediction is performed according to the predetermined deviation prediction parameters based on the speech characteristics and the target emotion characteristics of the target speech, and the specific steps are as follows:
and inputting the voice characteristics and the target emotion characteristics of the target voice into a pre-trained three-dimensional deviation prediction model to perform deviation prediction of the three-dimensional face model, so as to obtain three-dimensional model vertex deviation data when the three-dimensional face model is driven by the target voice output by the three-dimensional deviation prediction model. The target emotion features are input into the three-dimensional deviation prediction model as an embedding layer of the three-dimensional deviation prediction model, so that when the three-dimensional deviation prediction model predicts the voice features of the target voice through three-dimensional model vertex deviation, the three-dimensional deviation prediction model can be combined with the face features represented by the target emotion in the target emotion features, and therefore the three-dimensional model vertex deviation data output by the three-dimensional deviation prediction model not only comprises deviation driven by the target voice but also comprises deviation driven by the expressed target emotion.
Specifically, in this embodiment, training the three-dimensional offset prediction model requires collecting training data in advance, and then performing three-dimensional model vertex offset prediction training by using the training data, thereby obtaining the three-dimensional offset prediction model. The training data includes: the method comprises the steps of 4D synthesized data corresponding to sample video, audio features corresponding to the sample video and emotion features corresponding to the sample video. The sample video is a face two-dimensional video of a pre-collected speaker when speaking, and each image frame in the sample video is a two-dimensional image. In order to acquire the 4D data required by model training, each frame of image in a sample video is required to be subjected to three-dimensional face reconstruction, so that a three-dimensional reconstruction face model corresponding to each frame of image is obtained, the three-dimensional reconstruction face model is three-dimensional data, then the three-dimensional reconstruction face model corresponding to each frame of image is synthesized into a model sequence according to the frame rate of the sample video, and four-dimensional data, namely 4D synthesized data, are obtained. Compared with the method that the three-dimensional face model is obtained frame by utilizing the three-dimensional scanning equipment, the 4D training data is obtained, four-dimensional data synthesis is carried out by utilizing the two-dimensional sample video, the obtained 4D synthetic data is used as the training data, the cost is lower, the two-dimensional video is easy to collect, the data volume and the diversity of the sample video are higher, so that the data volume and the diversity of the 4D synthetic data are also higher, the accuracy of the trained three-dimensional offset prediction model is higher by utilizing the 4D synthetic data synthesized by the sample video, and the emotion is more various due to the high diversity of the 4D synthetic data, so that the emotional effect reflected by the offset data predicted by the trained three-dimensional offset prediction model is better. The audio features corresponding to the sample video are obtained by extracting the audio features corresponding to the sample video by utilizing a voice feature extraction mode. The emotion characteristics corresponding to the sample video are obtained by extracting the emotion characteristics of the sample video by using an emotion extraction network.
In this embodiment, the three-dimensional offset prediction model preferably adopts an audio2mesh network, specifically, a FaceFormer network may be used as a base line network, the FaceFormer network uses a transducer architecture to input the voice features of the target voice, and autoregressively outputs three-dimensional model vertex offset data corresponding to the voice features of each frame of voice frame in the target voice.
S102, driving the three-dimensional basic face model according to the three-dimensional model vertex offset data to obtain the three-dimensional face animation corresponding to the target voice.
Specifically, after the three-dimensional offset prediction model predicts the three-dimensional model vertex offset data corresponding to the target voice, driving the three-dimensional basic face model according to the three-dimensional model vertex offset data (namely, performing offset adjustment on vertices in the three-dimensional basic face model according to the three-dimensional model vertex offset data), so as to obtain the three-dimensional face animation driven by the target voice. The three-dimensional basic face model preferably adopts a three-dimensional face model with neutral expression. The three-dimensional model vertex offset data corresponding to the target voice is a sequence formed by offset data corresponding to each voice frame in the target voice, the three-dimensional basic face model is driven by the three-dimensional model vertex offset data, a three-dimensional face model driven by each voice frame is obtained, the three-dimensional face model driven by each voice frame is combined into a three-dimensional face model sequence according to the sequence of each voice frame in the target voice, and then the three-dimensional face animation is obtained.
Further, in order to implement voice driving of the end-to-end three-dimensional face model, the embodiment constructs a voice driving model in advance, where the voice driving model includes a three-dimensional offset prediction model. The voice characteristics and the target emotion characteristics of the target voice are input into the voice driving model, the three-dimensional deviation prediction model in the voice driving model can predict three-dimensional model vertex deviation data corresponding to the target voice according to the voice characteristics and the target emotion characteristics of the target voice, and then the voice driving model drives the three-dimensional basic face model according to the three-dimensional model vertex deviation data, so that three-dimensional face animation corresponding to the target voice is obtained. Thus, after the voice characteristics and the target emotion characteristics of the target voice are input into the voice driving model, the driven three-dimensional facial animation can be directly obtained, and the end-to-end three-dimensional facial model driving is realized.
As can be seen from the above description, according to the three-dimensional face model driving method based on voice provided by the embodiment of the present application, three-dimensional model vertex offset prediction is performed according to predetermined offset prediction parameters based on the voice characteristics and the target emotion characteristics of the target voice, so as to obtain three-dimensional model vertex offset data corresponding to the target voice; driving the three-dimensional basic model according to the vertex offset data of the three-dimensional model to obtain a three-dimensional face animation corresponding to the target voice; the offset prediction parameters are determined by carrying out three-dimensional model vertex offset prediction processing on 4D synthesized data, audio characteristics and emotion characteristics corresponding to the sample video; the 4D synthesized data corresponding to the sample video is synthesized data according to the frame rate of the sample video by using a three-dimensional reconstruction face model corresponding to each frame of image of the sample video. By adopting the technical scheme of the application, each frame of image of the sample video can be reconstructed into the three-dimensional face model, so that 4D synthetic data is obtained as sample data for determining the offset prediction parameters, compared with the method for acquiring the 4D data by using three-dimensional scanning equipment, the cost is lower, the data volume and the diversity of the sample data are improved, the prediction effect of the offset prediction parameters can be improved, the accuracy of the voice-driven three-dimensional face model is improved, and the emotional effect of the voice-driven three-dimensional face model is improved through the input of emotional characteristics.
As an alternative implementation manner, referring to fig. 2, in another embodiment of the present application, a training process of a three-dimensional offset prediction model is disclosed, which may specifically include the following steps:
s201, determining 4D synthesized data, audio characteristics and emotion characteristics corresponding to a pre-collected sample video.
Specifically, in order to perform three-dimensional model vertex deviation prediction training on a three-dimensional deviation prediction model, a two-dimensional sample video is first acquired, where the sample video may be a face video (such as RGB video) of a speaker when the speaker is speaking, which is shot by using a monocular camera. And then, reconstructing three-dimensional face of each frame of image (such as each frame of RGB image in RGB video) in the sample video to obtain a three-dimensional reconstructed face model corresponding to each frame of image, and synthesizing the three-dimensional reconstructed face model corresponding to each frame of image into a model sequence according to the frame rate of the sample video to obtain 4D synthesized data. In addition, in the process of collecting the sample video, the audio corresponding to the sample video, namely the audio spoken by the speaker in the sample video, needs to be collected. And extracting voice characteristics of the audio corresponding to the sample video to obtain the audio characteristics corresponding to the sample video. And carrying out emotion feature extraction on the sample video to obtain emotion features corresponding to the sample video. The specific steps are as follows:
Firstly, carrying out three-dimensional face reconstruction on each video frame in a sample video acquired in advance to obtain three-dimensional deformation parameters corresponding to each video frame, and determining a three-dimensional reconstruction face model restored by the three-dimensional deformation parameters corresponding to each video frame.
In the embodiment, a monocular three-dimensional face reconstruction model is adopted to reconstruct the three-dimensional face of each two-dimensional image frame in the sample video, so that three-dimensional deformation parameters corresponding to each video frame can be obtained. FLAME is a parameterized three-dimensional face model, and can be expressed by the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,is a parameter for controlling the shape change of the face, < >>Is a parameter for controlling the expression of the face,is a parameter that controls the joint and overall rotation.
Inputting each video frame in the sample video into a monocular three-dimensional face reconstruction model, wherein the monocular three-dimensional face reconstruction model can be fitted with three parameters for controlling the deformation of the FLAME face model as three-dimensional deformation parameters corresponding to each video frame, and the corresponding three-dimensional reconstruction face model can be restored by utilizing the three-dimensional deformation parameters corresponding to the video frames, so that the three-dimensional reconstruction face model corresponding to each video frame in the sample video can be obtained.
Secondly, synthesizing the three-dimensional reconstruction face model corresponding to each video frame according to the frame rate corresponding to the sample video to obtain 4D synthesized data corresponding to the sample video.
After the three-dimensional reconstruction face models corresponding to all video frames in the sample video are determined, all the three-dimensional reconstruction face models are combined according to the frame rate corresponding to the sample video, and 4D synthetic data corresponding to the sample video is obtained.
Thirdly, extracting voice features from the audio corresponding to the sample video to obtain audio features corresponding to the sample video, and extracting emotion features from the sample video to obtain emotion features corresponding to the sample video.
In this embodiment, a voice feature extraction network (e.g., a wave2vec2.0 network) is used to extract voice features of the audio corresponding to the sample video, so as to obtain the audio features corresponding to the sample video. And extracting emotion characteristics of the sample video by using the existing emotion extraction network, so as to obtain emotion characteristics corresponding to the sample video. When the emotion extraction network extracts emotion features from the sample video, the emotion extraction network can also map the emotion features to a high-dimensional space through the fully connected network to increase the expression capacity of the emotion features.
Further, the frame rate of the sample video collected in this embodiment may be different from the frame rate of the audio corresponding to the sample video, so that the frame rate of the audio feature corresponding to the sample video is different from the frame rate of the sample video, which results in that each video frame in the sample video cannot be aligned with the audio feature corresponding to the sample video. At this time, the linear interpolation operation may be performed on the audio feature according to a preset frame rate alignment rule according to the frame rate of the sample video and the frame rate of the audio feature. The preset frame rate alignment rule may be that the frame rate of the audio feature is a multiple of the frame rate of the sample video, and the audio feature is linearly interpolated according to the multiple. For example, the frame rate of the audio feature corresponding to the sample video is 49 hz, and the frame rate of the sample video is 30FPS, in this embodiment, the frame rate alignment rule is preferably set to be twice the frame rate of the audio feature as the frame rate of the sample video, so that the audio feature is linearly interpolated, so that the frame rate of the audio feature is adjusted to be 60 hz, and then when the sample video is aligned with the audio feature, one frame of video frame can correspond to two frames of audio feature.
S202, inputting the audio features and the emotion features into the three-dimensional deviation prediction model to obtain sample vertex deviation data output by the three-dimensional deviation prediction model.
After the 4D synthesized data corresponding to the sample video, the audio features corresponding to the sample video and the emotion features corresponding to the sample video are determined, the 4D synthesized data is used as a tag carried by the audio features corresponding to the sample video, the audio features corresponding to the sample video and the emotion features corresponding to the sample video are input into a three-dimensional offset prediction model, the three-dimensional offset prediction model performs three-dimensional model vertex offset prediction, and sample vertex offset data (comprising sample vertex offset data corresponding to each frame of audio features) corresponding to the audio features are output autoregressively. The emotion features corresponding to the sample video are input into the three-dimensional offset prediction model as an embedding layer of the three-dimensional offset prediction model, so that when the three-dimensional offset prediction model performs three-dimensional model vertex offset prediction on the audio features corresponding to the sample video, the face features represented by the emotion of a speaker in the sample video can be combined, and therefore the sample vertex offset data output by the three-dimensional offset prediction model not only comprises offset driven by audio corresponding to the sample video, but also comprises offset driven by emotion of the speaker in the expression sample video.
And S203, performing model parameter adjustment on the three-dimensional deviation prediction model based on a first loss function between the first three-dimensional animation and the 4D synthesized data.
Specifically, the sample vertex offset data comprises sample vertex offset data corresponding to each frame of audio feature corresponding to a sample video, three-dimensional model vertex offset adjustment is performed on the three-dimensional basic face model according to the sample vertex offset data corresponding to each frame of audio feature to obtain a three-dimensional face model driven by each frame of audio feature, the three-dimensional face models driven by each frame of audio feature are combined together according to the sequence of the audio features, and therefore three-dimensional face animation driven by the audio feature corresponding to the sample video is obtained and is used as a first three-dimensional animation. The three-dimensional basic face model preferably adopts a three-dimensional face model under the neutral expression of a speaker in a sample video.
Because the 4D composite data corresponding to the sample video is also data combined by the three-dimensional reconstructed face model corresponding to each frame of image of the sample video, the 4D composite data is a real three-dimensional face animation driven by the audio feature corresponding to the sample video and the emotion feature corresponding to the sample video, the first three-dimensional animation is a three-dimensional face animation predicted according to the audio feature corresponding to the sample video and the emotion feature corresponding to the sample video, in an ideal state, the first three-dimensional animation driven by the sample vertex offset data predicted by the three-dimensional offset prediction model should be the same as the 4D composite data carried by the audio feature corresponding to the sample video, and at the moment, the prediction capability of the three-dimensional offset prediction model is most accurate. Therefore, in order to improve the prediction accuracy of the three-dimensional offset prediction model, it is necessary to calculate a first loss function between the first three-dimensional animation and the 4D synthesis data, and perform model parameter adjustment on the three-dimensional offset prediction model according to the first loss function so that the first three-dimensional animation driven by the sample vertex offset data predicted by the three-dimensional offset prediction model gradually approaches the 4D synthesis data.
As an alternative implementation manner, referring to fig. 3, in another embodiment of the present application, a training process of a three-dimensional offset prediction model is disclosed, and specifically the method may further include the following steps:
and S301, rendering the first three-dimensional animation into a 2D image sequence.
Specifically, the three-dimensional offset prediction model predicts corresponding sample vertex offset data according to audio features and emotion features corresponding to a sample video, drives a three-dimensional basic face model according to the sample vertex offset data to obtain a first three-dimensional animation, renders each three-dimensional face model in the first three-dimensional animation into a 2D image (such as an RGB image) by utilizing a differential rendering technology, and combines all the 2D images obtained by rendering the first three-dimensional animation together to obtain a 2D image sequence.
S302, taking the error between the 2D image sequence and the video frame sequence of the sample video as a second loss function, and carrying out model parameter adjustment on the three-dimensional offset prediction model.
In an ideal state, a 2D image sequence rendered by a first three-dimensional animation driven by sample vertex offset data predicted by a three-dimensional offset prediction model should be very close to pixel data between image sequences formed by video frames in a sample video, and the prediction capability of the three-dimensional offset prediction model is the most accurate. Therefore, in order to improve the prediction accuracy of the three-dimensional offset prediction model, an error between the 2D image sequence and the video frame sequence of the sample video (i.e., pixel level error) needs to be calculated as a second loss function, and the model parameter adjustment is performed on the three-dimensional offset prediction model according to the second loss function, so that the 2D image sequence rendered by the first three-dimensional animation driven by the sample vertex offset data predicted by the three-dimensional offset prediction model gradually approaches to the image sequence composed of each video frame in the sample video. Correcting the predictive effect of the three-dimensional deviation predictive model on the face detail deviation by using the second loss function can enable the face detail in the three-dimensional face animation driven by the predicted sample vertex deviation data to be more accurate.
As an alternative implementation manner, referring to fig. 4, in another embodiment of the present application, a training process of a three-dimensional offset prediction model is disclosed, and specifically the method may further include the following steps:
s401, determining a sample three-dimensional face animation acquired in advance, and an animation audio feature and an animation emotion feature corresponding to the sample three-dimensional face animation.
In this embodiment, because the adopted 4D training data is the data synthesized according to the two-dimensional sample video, a certain gap exists between the quality of the 4D training data and the real 4D training data acquired by the three-dimensional scanning device, so that the accuracy of the three-dimensional offset prediction model trained by using the 4D synthesized data is affected, after the three-dimensional offset prediction model is trained by using the 4D synthesized data, the trained three-dimensional offset prediction model can be optimized by using a small amount of real 4D data, and the prediction accuracy of the three-dimensional offset prediction model is improved.
Firstly, the embodiment needs to collect real 4D data by using a three-dimensional scanning device, that is, collect a three-dimensional face animation as a sample three-dimensional face animation, where the sample three-dimensional face animation is 4D data, extract an animation audio feature of an audio corresponding to the sample three-dimensional face animation, and extract an animation emotion feature corresponding to the sample three-dimensional face animation by using a three-dimensional face expression feature extraction method. The three-dimensional facial expression extraction may be performed by training a three-dimensional facial-based emotion extraction network in advance, where the three-dimensional facial-based emotion extraction network is similar to a two-dimensional image-based emotion extraction network, and this embodiment will not be described in detail.
S402, inputting the animation audio features and the animation emotion features into the trained three-dimensional offset prediction model to obtain animation vertex offset data corresponding to the sample three-dimensional face animation.
In this embodiment, a sample three-dimensional face animation is used as real 4D data, an animation audio feature carries the sample three-dimensional face animation as a tag, and an animation emotion feature corresponding to the sample three-dimensional face animation is input into a three-dimensional offset prediction model trained by using 4D synthesis data corresponding to a sample video, an audio feature corresponding to the sample video and an emotion feature corresponding to the sample video, the three-dimensional offset prediction model performs three-dimensional model vertex offset prediction, and the animation vertex offset data corresponding to the sample three-dimensional face animation is output autoregressively. The process of three-dimensional model vertex deviation prediction by using the sample three-dimensional face animation, the animation audio feature and the animation emotion feature in the three-dimensional deviation prediction model is the same as the process of three-dimensional model vertex deviation prediction by using the 4D synthesized data corresponding to the sample video, the audio feature corresponding to the sample video and the emotion feature corresponding to the sample video, which is not specifically described in this embodiment.
S403, based on a third loss function between the three-dimensional face offset animation corresponding to the animation vertex offset data and the sample three-dimensional face animation, performing model parameter adjustment on the trained three-dimensional offset prediction model.
In this embodiment, after the three-dimensional offset prediction model predicts the animation vertex offset data corresponding to the sample three-dimensional face animation, three-dimensional model vertex offset adjustment is performed on the three-dimensional basic face model according to the animation vertex offset data, where the animation vertex offset data corresponding to the sample three-dimensional face animation includes three-dimensional vertex offset data driven by each frame of animation audio feature, so that three-dimensional model vertex offset adjustment is performed on the three-dimensional basic face model according to the animation vertex offset data, and three-dimensional face models driven by each frame of animation audio feature and three-dimensional basic face model are obtained, and three-dimensional face models driven by all the animation audio features and three-dimensional basic face models are combined together to obtain the three-dimensional face offset animation driven by the animation audio feature. The three-dimensional basic face model preferably adopts a three-dimensional face model under the neutral expression of a speaker in the sample three-dimensional face animation.
In an ideal state, the three-dimensional face offset animation after the three-dimensional basic face model is driven by the animation vertex offset data predicted by the three-dimensional offset prediction model is the same as the sample three-dimensional face animation carried by the animation audio frequency feature, and the prediction capability of the three-dimensional offset prediction model is the most accurate. Therefore, in order to improve the prediction accuracy of the three-dimensional offset prediction model, a third loss function between the three-dimensional face offset animation and the sample three-dimensional face animation needs to be calculated, and the model parameter adjustment is performed on the three-dimensional offset prediction model according to the third loss function, so that the three-dimensional face offset animation driven by the sample vertex offset data predicted by the three-dimensional offset prediction model is gradually close to the sample three-dimensional face animation, and the tuning of the three-dimensional offset prediction model is realized.
As an alternative implementation manner, in another embodiment of the present application, a training process of the three-dimensional offset prediction model is disclosed, and specifically the method may further include the following steps:
first, the driving characteristics of the audio corresponding to the sample video on the face in the sample video are obtained.
In this embodiment, when the emotion extraction network is used to extract emotion features in the sample video, the extracted emotion features may be mixed with motion features of the face under audio driving, that is, actions of the face in the sample video when the face speaks audio, for example, actions of the lip and mouth parts of the face when the audio is output. When the three-dimensional deviation prediction model performs three-dimensional model vertex deviation prediction according to the audio features and the emotion features, in order to ensure that any emotion can be inserted into the audio-driven three-dimensional face model, and the inserted emotion does not influence the driving of the audio on the three-dimensional face model, the three-dimensional deviation prediction model needs to be trained by using purer emotion features (i.e., features irrelevant to the audio-driven face deviation). Therefore, the present embodiment needs to decouple the emotional features corresponding to the extracted sample video from the relevant characterization of the audio driver. Firstly, the driving characteristics of the audio corresponding to the sample video to the human face in the sample video are required to be acquired. For example, the offset characteristic of the lip part in the video frame corresponding to each frame of audio is used as the driving characteristic of the audio frame to the human face in the characteristic that the human face characteristic is compared with the characteristic of the neutral expression of the human face.
Second, mutual information between the driving feature and the emotional feature corresponding to the sample video is calculated.
Because the extracted emotional characteristics may be mixed with driving characteristics of the audio to the human face when the emotional characteristics in the sample video are extracted by using the emotion extraction network, that is, mutual information exists between the emotional characteristics corresponding to the sample video and the driving characteristics of the audio to the human face, in this embodiment, the mutual information between the driving characteristics and the emotional characteristics corresponding to the sample video needs to be calculated by using the existing calculation mode of the mutual information.
Thirdly, parameter adjustment is carried out on the three-dimensional offset prediction model by taking the minimized mutual information as a target.
In the training process of the three-dimensional deviation prediction model by utilizing the 4D synthesized data corresponding to the sample video, the audio features corresponding to the sample video and the emotion features corresponding to the sample video, mutual information between the calculated driving features and the emotion features corresponding to the sample video can be added into a loss function, and the three-dimensional deviation prediction model is subjected to parameter adjustment with the minimum mutual information as a target, so that the influence of the emotion features on the voice driving three-dimensional face model is reduced, and the emotional effect in the driving process of the three-dimensional face model is improved.
As an alternative implementation manner, in another embodiment of the present application, in a process of driving a three-dimensional face model by using a target voice, a voice feature and a target emotion feature of the target voice may be input to a pre-trained three-dimensional offset prediction model, and a speaker code corresponding to a speaker to be driven may also be input, where a personal habit feature and the like of the speaker during speaking is recorded in the speaker code, so that when the three-dimensional offset prediction model performs three-dimensional model vertex offset prediction, not only the emotion feature but also personal habit and the like of the speaker may be combined, so that the anthropomorphic degree of the three-dimensional face animation after driving according to the three-dimensional model vertex offset data predicted by the three-dimensional offset prediction model is higher. When the model training is performed on the three-dimensional offset prediction model capable of performing three-dimensional model vertex offset prediction by combining with the speaker code, the data input into the model also need to include the speaker code to train the model, and the specific mode is the same as the mode of training the model after inputting the emotion characteristics of the sample video, and the speaker code is also used as an embedding layer of the model to train, so that the embodiment is not specifically described.
Exemplary apparatus
Corresponding to the above-mentioned three-dimensional face model driving method based on voice, the embodiment of the application also discloses a three-dimensional face model driving device based on voice, as shown in fig. 5, which comprises:
the deviation prediction module 100 is configured to perform three-dimensional model vertex deviation prediction according to predetermined deviation prediction parameters based on the voice feature and the target emotion feature of the target voice, so as to obtain three-dimensional model vertex deviation data corresponding to the target voice;
the driving module 110 is configured to drive the three-dimensional basic face model according to the three-dimensional model vertex offset data, so as to obtain a three-dimensional face animation corresponding to the target voice;
the offset prediction parameters are determined by carrying out three-dimensional model vertex offset prediction processing on 4D synthesized data, audio characteristics and emotion characteristics corresponding to the sample video; the 4D synthesized data corresponding to the sample video is synthesized data according to the frame rate of the sample video by using a three-dimensional reconstruction face model corresponding to each frame of image of the sample video.
According to the three-dimensional face model driving device based on the voice, each frame of image of a sample video can be rebuilt into the three-dimensional face model, so that 4D synthesized data is obtained as sample data for determining the offset prediction parameters.
As an alternative implementation manner, in another embodiment of the present application, the offset prediction module 100 of the above embodiment is disclosed, and is specifically configured to:
fusing emotion fusion characteristics corresponding to the voice frames in the target voice with the voice characteristics to obtain coding characteristics corresponding to the voice frames; the emotion fusion features corresponding to the voice frames are obtained by fusing the target emotion features with three-dimensional model vertex offset data corresponding to the voice frame before the voice frame;
and decoding the coding features corresponding to the voice frames in the target voice to obtain three-dimensional model vertex offset data corresponding to the voice frames in the target voice.
As an alternative implementation manner, in another embodiment of the present application, the offset prediction module 100 of the above embodiment is disclosed, and is specifically further configured to:
inputting the voice characteristics and the target emotion characteristics of the target voice into a pre-trained three-dimensional offset prediction model to obtain three-dimensional model vertex offset data corresponding to the target voice;
the three-dimensional offset prediction model is obtained by performing three-dimensional model vertex offset prediction training by using 4D synthetic data, audio features and emotion features corresponding to the sample video; the 4D synthesized data corresponding to the sample video is synthesized data according to the frame rate of the sample video by using a three-dimensional reconstruction face model corresponding to each frame of image of the sample video.
As an alternative implementation manner, in another embodiment of the present application, it is disclosed that the target emotion feature of the above embodiment is extracted from a video containing the target emotion.
As an optional implementation manner, in another embodiment of the present application, the voice-based three-dimensional face model driving device of the above embodiment further includes: the system comprises a training data determining module, a data input module and a model parameter adjusting module.
The training data determining module is used for determining 4D synthesized data, audio characteristics and emotion characteristics corresponding to the pre-collected sample video;
the data input module is used for inputting the audio features and the emotion features into the three-dimensional deviation prediction model to obtain sample vertex deviation data output by the three-dimensional deviation prediction model;
the model parameter adjustment module is used for adjusting model parameters of the three-dimensional deviation prediction model based on a first loss function between the first three-dimensional animation and the 4D synthetic data; the first three-dimensional animation is a three-dimensional facial animation after the three-dimensional basic facial model is driven according to the sample vertex offset data.
As an optional implementation manner, in another embodiment of the present application, the voice-based three-dimensional face model driving device of the above embodiment further includes: and an interpolation module.
And the interpolation module is used for carrying out linear interpolation on the audio features according to the frame rate of the sample video, the frame rate of the audio features and a preset frame rate alignment rule.
As an optional implementation manner, in another embodiment of the present application, the voice-based three-dimensional face model driving device of the above embodiment further includes: and a rendering module.
The rendering module is used for rendering the first three-dimensional animation into a 2D image sequence;
the model parameter adjustment module is further configured to perform model parameter adjustment on the three-dimensional offset prediction model by using an error between the 2D image sequence and the video frame sequence of the sample video as a second loss function.
As an alternative implementation manner, in another embodiment of the present application, a three-dimensional face model driving device based on voice in the above embodiment is disclosed:
the training data determining module is also used for determining a sample three-dimensional face animation acquired in advance, and an animation audio feature and an animation emotion feature corresponding to the sample three-dimensional face animation;
the data input module is also used for inputting the animation audio features and the animation emotion features into the trained three-dimensional offset prediction model to obtain animation vertex offset data corresponding to the sample three-dimensional face animation;
The model parameter adjustment module is further used for adjusting model parameters of the trained three-dimensional offset prediction model based on a third loss function between the three-dimensional face offset animation corresponding to the animation vertex offset data and the sample three-dimensional face animation; the three-dimensional face offset animation is a three-dimensional face animation after driving the three-dimensional basic face model according to the animation vertex offset data.
As an optional implementation manner, in another embodiment of the present application, the voice-based three-dimensional face model driving device of the above embodiment further includes: the device comprises an acquisition module and a calculation module.
The acquisition module is used for acquiring driving characteristics of the audio corresponding to the sample video on the face in the sample video;
the computing module is used for computing mutual information between the driving characteristics and emotion characteristics corresponding to the sample video;
and the model parameter adjustment module is also used for carrying out parameter adjustment on the three-dimensional deviation prediction model by taking the minimized mutual information as a target.
As an optional implementation manner, in another embodiment of the present application, a three-dimensional face model driving device based on voice in the above embodiment is disclosed, where the training data determining module is specifically configured to:
Carrying out three-dimensional face reconstruction on each video frame in the pre-acquired sample video to obtain three-dimensional deformation parameters corresponding to each video frame, and determining a three-dimensional reconstruction face model restored by the three-dimensional deformation parameters corresponding to each video frame;
synthesizing the three-dimensional reconstruction face model corresponding to each video frame according to the frame rate corresponding to the sample video to obtain 4D synthesized data corresponding to the sample video;
extracting voice features from the audio corresponding to the sample video to obtain audio features corresponding to the sample video, and extracting emotion features from the sample video to obtain emotion features corresponding to the sample video.
As an alternative implementation manner, in another embodiment of the present application, in the voice-based three-dimensional face model driving device in the above embodiment, the offset prediction module 100 inputs the voice feature and the target emotion feature of the target voice into a pre-trained three-dimensional offset prediction model to obtain three-dimensional model vertex offset data corresponding to the target voice, and the driving module 110 drives the three-dimensional basic face model according to the three-dimensional model vertex offset data to obtain a three-dimensional face animation corresponding to the target voice, including:
Inputting the voice characteristics and the target emotion characteristics of the target voice into a pre-trained voice driving model, predicting three-dimensional model vertex deviation data corresponding to the target voice by using a three-dimensional deviation prediction model, and driving a three-dimensional basic face model according to the three-dimensional model vertex deviation data to obtain a three-dimensional face animation corresponding to the target voice.
The three-dimensional face model driving device based on voice provided by the embodiment of the application belongs to the same application conception as the three-dimensional face model driving method based on voice provided by the embodiment of the application, and the three-dimensional face model driving method based on voice provided by any embodiment of the application can be executed, and has the corresponding functional modules and beneficial effects of executing the three-dimensional face model driving method based on voice. Technical details not described in detail in the present embodiment may refer to specific processing content of the voice-based three-dimensional face model driving method provided in the foregoing embodiment of the present application, and will not be described herein.
Exemplary electronic device, storage Medium, and computing product
Corresponding to the above-mentioned three-dimensional face model driving method based on voice, the embodiment of the application also discloses an electronic device, as shown in fig. 6, which comprises:
A memory 200 and a processor 210;
wherein the memory 200 is connected to the processor 210 for storing a program;
the processor 210 is configured to implement the three-dimensional face model driving method based on voice disclosed in any of the above embodiments by running the program stored in the memory 200.
Specifically, the electronic device may further include: a bus, a communication interface 220, an input device 230, and an output device 240.
The processor 210, the memory 200, the communication interface 220, the input device 230, and the output device 240 are interconnected by a bus. Wherein:
a bus may comprise a path that communicates information between components of a computer system.
Processor 210 may be a general-purpose processor such as a general-purpose Central Processing Unit (CPU), microprocessor, etc., or may be an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of programs in accordance with aspects of the present application. But may also be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Processor 210 may include a main processor, and may also include a baseband chip, modem, and the like.
The memory 200 stores programs for implementing the technical scheme of the present application, and may also store an operating system and other key services. In particular, the program may include program code including computer-operating instructions. More specifically, the memory 200 may include read-only memory (ROM), other types of static storage devices that may store static information and instructions, random access memory (random access memory, RAM), other types of dynamic storage devices that may store information and instructions, disk storage, flash, and the like.
The input device 230 may include means for receiving data and information entered by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor, among others.
Output device 240 may include means, such as a display screen, printer, speakers, etc., that allow information to be output to a user.
The communication interface 220 may include devices using any transceiver or the like for communicating with other devices or communication networks, such as ethernet, radio Access Network (RAN), wireless Local Area Network (WLAN), etc.
The processor 210 executes the program stored in the memory 200 and invokes other devices, which can be used to implement the steps of the three-dimensional face model driving method based on voice provided in the above embodiment of the present application.
Another embodiment of the present application further provides a storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the steps of the three-dimensional face model driving method based on voice provided in any one of the foregoing embodiments.
For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present application is not limited by the order of acts, as some steps may, in accordance with the present application, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all of the preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
The steps in the method of each embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs, and the technical features described in each embodiment can be replaced or combined.
In the embodiments of the present application, the modules and sub-modules in the terminal may be combined, divided, and pruned according to actual needs.
In the embodiments provided in the present application, it should be understood that the disclosed terminal, apparatus and method may be implemented in other manners. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of modules or sub-modules is merely a logical function division, and there may be other manners of division in actual implementation, for example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.
The modules or sub-modules illustrated as separate components may or may not be physically separate, and components that are modules or sub-modules may or may not be physical modules or sub-modules, i.e., may be located in one place, or may be distributed over multiple network modules or sub-modules. Some or all of the modules or sub-modules may be selected according to actual needs to achieve the purpose of the embodiment.
In addition, each functional module or sub-module in the embodiments of the present application may be integrated in one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated in one module. The integrated modules or sub-modules may be implemented in hardware or in software functional modules or sub-modules.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software elements may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (13)

1. The three-dimensional face model driving method based on the voice is characterized by comprising the following steps of:
based on the voice characteristics and the target emotion characteristics of the target voice, carrying out three-dimensional model vertex offset prediction according to predetermined offset prediction parameters to obtain three-dimensional model vertex offset data corresponding to the target voice;
driving a three-dimensional basic face model according to the three-dimensional model vertex offset data to obtain a three-dimensional face animation corresponding to the target voice;
the offset prediction parameters are determined by carrying out three-dimensional model vertex offset prediction processing on 4D synthesized data, audio features and emotion features corresponding to the sample video; the 4D synthesized data corresponding to the sample video is synthesized data according to the frame rate of the sample video by using a three-dimensional reconstruction face model corresponding to each frame of image of the sample video; the three-dimensional reconstruction face model corresponding to the image is obtained by reconstructing the three-dimensional face of the image;
the method for predicting the three-dimensional model vertex offset based on the voice characteristics and the target emotion characteristics of the target voice according to the predetermined offset prediction parameters to obtain three-dimensional model vertex offset data corresponding to the target voice comprises the following steps:
Fusing emotion fusion characteristics corresponding to the voice frames in the target voice with the voice characteristics to obtain coding characteristics corresponding to the voice frames; the emotion fusion features corresponding to the voice frames are obtained by fusing the target emotion features with three-dimensional model vertex offset data corresponding to the voice frame before the voice frame;
and decoding the coding features corresponding to the voice frames in the target voice to obtain three-dimensional model vertex offset data corresponding to the voice frames in the target voice.
2. The method of claim 1, wherein performing three-dimensional model vertex shift prediction based on the speech features and the target emotion features of the target speech according to predetermined shift prediction parameters to obtain three-dimensional model vertex shift data corresponding to the target speech, comprises:
inputting the voice characteristics and the target emotion characteristics of the target voice into a pre-trained three-dimensional offset prediction model to obtain three-dimensional model vertex offset data corresponding to the target voice;
the three-dimensional offset prediction model is obtained by performing three-dimensional model vertex offset prediction training by using 4D synthetic data, audio features and emotion features corresponding to a sample video; the 4D synthesized data corresponding to the sample video is synthesized data according to the frame rate of the sample video by using a three-dimensional reconstruction face model corresponding to each frame of image of the sample video; the three-dimensional reconstruction face model corresponding to the image is obtained by reconstructing the three-dimensional face of the image.
3. The method of claim 1, wherein the target emotional characteristic is extracted from a video containing the target emotion.
4. The method of claim 2, wherein the training process of the three-dimensional offset prediction model comprises:
determining 4D synthesized data, audio characteristics and emotion characteristics corresponding to a pre-collected sample video;
inputting the audio features and the emotion features into the three-dimensional offset prediction model to obtain sample vertex offset data output by the three-dimensional offset prediction model;
performing model parameter adjustment on the three-dimensional offset prediction model based on a first loss function between a first three-dimensional animation and the 4D synthesis data; the first three-dimensional animation is a three-dimensional facial animation after the three-dimensional basic facial model is driven according to the sample vertex offset data.
5. The method of claim 4, wherein after determining the 4D synthesized data, audio features, and emotional features corresponding to the pre-acquired sample video, further comprising:
and linearly interpolating the audio features according to the frame rate of the sample video, the frame rate of the audio features and a preset frame rate alignment rule.
6. The method of claim 4, wherein the training process of the three-dimensional offset prediction model further comprises:
rendering the first three-dimensional animation as a 2D image sequence;
and taking the error between the 2D image sequence and the video frame sequence of the sample video as a second loss function, and performing model parameter adjustment on the three-dimensional offset prediction model.
7. The method of claim 4, wherein the training process of the three-dimensional offset prediction model further comprises:
determining a sample three-dimensional face animation acquired in advance, and an animation audio feature and an animation emotion feature corresponding to the sample three-dimensional face animation;
inputting the animation audio features and the animation emotion features into a trained three-dimensional offset prediction model to obtain animation vertex offset data corresponding to the sample three-dimensional face animation;
based on a third loss function between the three-dimensional face offset animation corresponding to the animation vertex offset data and the sample three-dimensional face animation, performing model parameter adjustment on the trained three-dimensional offset prediction model; the three-dimensional face offset animation is a three-dimensional face animation after the three-dimensional basic face model is driven according to the animation vertex offset data.
8. The method of claim 4, wherein the training process of the three-dimensional offset prediction model further comprises:
acquiring driving characteristics of the audio corresponding to the sample video on a face in the sample video;
calculating mutual information between the driving features and emotion features corresponding to the sample video;
and carrying out parameter adjustment on the three-dimensional offset prediction model by taking the minimization of the mutual information as a target.
9. The method of claim 4, wherein determining 4D synthesized data, audio features, and emotional features corresponding to the pre-acquired sample video comprises:
carrying out three-dimensional face reconstruction on each video frame in the pre-acquired sample video to obtain three-dimensional deformation parameters corresponding to each video frame, and determining a three-dimensional reconstruction face model restored by the three-dimensional deformation parameters corresponding to each video frame;
synthesizing the three-dimensional reconstruction face model corresponding to each video frame according to the frame rate corresponding to the sample video to obtain 4D synthesized data corresponding to the sample video;
and extracting voice features from the audio corresponding to the sample video to obtain audio features corresponding to the sample video, and extracting emotion features from the sample video to obtain emotion features corresponding to the sample video.
10. The method according to claim 2, wherein inputting the speech features and the target emotion features of the target speech into a pre-trained three-dimensional deviation prediction model to obtain three-dimensional model vertex deviation data corresponding to the target speech, and driving a three-dimensional basic face model according to the three-dimensional model vertex deviation data to obtain a three-dimensional face animation corresponding to the target speech, comprising:
inputting the voice characteristics and the target emotion characteristics of the target voice into a pre-trained voice driving model, predicting three-dimensional model vertex deviation data corresponding to the target voice by using a three-dimensional deviation prediction model, and driving a three-dimensional basic face model according to the three-dimensional model vertex deviation data to obtain a three-dimensional face animation corresponding to the target voice.
11. A three-dimensional face model driving device based on voice, comprising:
the deviation prediction module is used for carrying out three-dimensional model vertex deviation prediction according to predetermined deviation prediction parameters based on the voice characteristics and the target emotion characteristics of the target voice to obtain three-dimensional model vertex deviation data corresponding to the target voice;
The driving module is used for driving the three-dimensional basic face model according to the three-dimensional model vertex offset data to obtain a three-dimensional face animation corresponding to the target voice;
the offset prediction parameters are determined by carrying out three-dimensional model vertex offset prediction processing on 4D synthesized data, audio features and emotion features corresponding to the sample video; the 4D synthesized data corresponding to the sample video is synthesized data according to the frame rate of the sample video by using a three-dimensional reconstruction face model corresponding to each frame of image of the sample video; the three-dimensional reconstruction face model corresponding to the image is obtained by reconstructing the three-dimensional face of the image;
the offset prediction module is specifically configured to fuse emotion fusion features corresponding to a voice frame in the target voice with voice features to obtain coding features corresponding to the voice frame; the emotion fusion features corresponding to the voice frames are obtained by fusing the target emotion features with three-dimensional model vertex offset data corresponding to the voice frame before the voice frame; and decoding the coding features corresponding to the voice frames in the target voice to obtain three-dimensional model vertex offset data corresponding to the voice frames in the target voice.
12. An electronic device, comprising: a memory and a processor;
the memory is connected with the processor and used for storing programs;
the processor is configured to implement the three-dimensional face model driving method based on voice according to any one of claims 1 to 10 by running a program in the memory.
13. A storage medium having stored thereon a computer program which, when executed by a processor, implements the speech-based three-dimensional face model driving method according to any one of claims 1 to 10.
CN202310472056.6A 2023-04-27 2023-04-27 Three-dimensional face model driving method based on voice and related device Active CN116188649B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310472056.6A CN116188649B (en) 2023-04-27 2023-04-27 Three-dimensional face model driving method based on voice and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310472056.6A CN116188649B (en) 2023-04-27 2023-04-27 Three-dimensional face model driving method based on voice and related device

Publications (2)

Publication Number Publication Date
CN116188649A CN116188649A (en) 2023-05-30
CN116188649B true CN116188649B (en) 2023-10-13

Family

ID=86440713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310472056.6A Active CN116188649B (en) 2023-04-27 2023-04-27 Three-dimensional face model driving method based on voice and related device

Country Status (1)

Country Link
CN (1) CN116188649B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116664731B (en) * 2023-06-21 2024-03-29 华院计算技术(上海)股份有限公司 Face animation generation method and device, computer readable storage medium and terminal
CN117012198B (en) * 2023-09-28 2023-12-19 中影年年(北京)文化传媒有限公司 Voice interaction method and system based on artificial intelligence

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724458A (en) * 2020-05-09 2020-09-29 天津大学 Voice-driven three-dimensional human face animation generation method and network structure
CN112215927A (en) * 2020-09-18 2021-01-12 腾讯科技(深圳)有限公司 Method, device, equipment and medium for synthesizing face video
CN112785670A (en) * 2021-02-01 2021-05-11 北京字节跳动网络技术有限公司 Image synthesis method, device, equipment and storage medium
CN113838174A (en) * 2021-11-25 2021-12-24 之江实验室 Audio-driven face animation generation method, device, equipment and medium
CN115049016A (en) * 2022-07-20 2022-09-13 聚好看科技股份有限公司 Model driving method and device based on emotion recognition
CN115330911A (en) * 2022-08-09 2022-11-11 北京通用人工智能研究院 Method and system for driving mimicry expression by using audio
WO2023011221A1 (en) * 2021-08-06 2023-02-09 南京硅基智能科技有限公司 Blend shape value output method, storage medium and electronic apparatus
CN115984933A (en) * 2022-12-29 2023-04-18 浙江极氪智能科技有限公司 Training method of human face animation model, and voice data processing method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3915108B1 (en) * 2019-01-25 2023-11-29 Soul Machines Limited Real-time generation of speech animation
US11049308B2 (en) * 2019-03-21 2021-06-29 Electronic Arts Inc. Generating facial position data based on audio data
CN110531860B (en) * 2019-09-02 2020-07-24 腾讯科技(深圳)有限公司 Animation image driving method and device based on artificial intelligence

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724458A (en) * 2020-05-09 2020-09-29 天津大学 Voice-driven three-dimensional human face animation generation method and network structure
CN112215927A (en) * 2020-09-18 2021-01-12 腾讯科技(深圳)有限公司 Method, device, equipment and medium for synthesizing face video
CN112785670A (en) * 2021-02-01 2021-05-11 北京字节跳动网络技术有限公司 Image synthesis method, device, equipment and storage medium
WO2023011221A1 (en) * 2021-08-06 2023-02-09 南京硅基智能科技有限公司 Blend shape value output method, storage medium and electronic apparatus
CN113838174A (en) * 2021-11-25 2021-12-24 之江实验室 Audio-driven face animation generation method, device, equipment and medium
CN115049016A (en) * 2022-07-20 2022-09-13 聚好看科技股份有限公司 Model driving method and device based on emotion recognition
CN115330911A (en) * 2022-08-09 2022-11-11 北京通用人工智能研究院 Method and system for driving mimicry expression by using audio
CN115984933A (en) * 2022-12-29 2023-04-18 浙江极氪智能科技有限公司 Training method of human face animation model, and voice data processing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Speech-Driven 3D Facial Animation With Implicit Emotional Awareness: A Deep Learning Approach;Hai X. Pham, Samuel Cheung, Vladimir Pavlovic;Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops;80-88 *
语音驱动的人脸动画研究现状综述;李欣怡;张志超;;计算机工程与应用(22);26-33 *

Also Published As

Publication number Publication date
CN116188649A (en) 2023-05-30

Similar Documents

Publication Publication Date Title
CN116188649B (en) Three-dimensional face model driving method based on voice and related device
CN110531860B (en) Animation image driving method and device based on artificial intelligence
CN107861938B (en) POI (Point of interest) file generation method and device and electronic equipment
CN113256821B (en) Three-dimensional virtual image lip shape generation method and device and electronic equipment
US20210312671A1 (en) Method and apparatus for generating video
CN113592985B (en) Method and device for outputting mixed deformation value, storage medium and electronic device
CN110751708A (en) Method and system for driving face animation in real time through voice
CN108492322A (en) A method of user&#39;s visual field is predicted based on deep learning
CN114339409B (en) Video processing method, device, computer equipment and storage medium
EP4143787A1 (en) Photometric-based 3d object modeling
CN114782661B (en) Training method and device for lower body posture prediction model
CN113111812A (en) Mouth action driving model training method and assembly
CN116228979A (en) Voice-driven editable face replay method, device and storage medium
CN115512014A (en) Method for training expression driving generation model, expression driving method and device
CN115393480A (en) Speaker synthesis method, device and storage medium based on dynamic nerve texture
CN114429611A (en) Video synthesis method and device, electronic equipment and storage medium
CN116704084B (en) Training method of facial animation generation network, facial animation generation method and device
CN112053278A (en) Image processing method and device and electronic equipment
CN116883524A (en) Image generation model training, image generation method and device and computer equipment
CN116863042A (en) Motion generation method of virtual object and training method of motion generation model
CN115082636A (en) Single image three-dimensional reconstruction method and equipment based on hybrid Gaussian network
Wong Exploiting Regularities to Recover 3D Scene Geometry
CN115761065A (en) Intermediate frame generation method, device, equipment and medium
Bozhilov et al. Lip-Synchronized 3D Facial Animation Using Audio-Driven Graph Convolutional Autoencoder
Gu A journey to photo-realistic facial animation synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant