WO2022267380A1 - 基于语音驱动的人脸动作合成方法、电子设备及存储介质 - Google Patents

基于语音驱动的人脸动作合成方法、电子设备及存储介质 Download PDF

Info

Publication number
WO2022267380A1
WO2022267380A1 PCT/CN2021/137489 CN2021137489W WO2022267380A1 WO 2022267380 A1 WO2022267380 A1 WO 2022267380A1 CN 2021137489 W CN2021137489 W CN 2021137489W WO 2022267380 A1 WO2022267380 A1 WO 2022267380A1
Authority
WO
WIPO (PCT)
Prior art keywords
facial
movement
face
muscles
parameter
Prior art date
Application number
PCT/CN2021/137489
Other languages
English (en)
French (fr)
Inventor
彭飞
马世奎
Original Assignee
达闼科技(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 达闼科技(北京)有限公司 filed Critical 达闼科技(北京)有限公司
Publication of WO2022267380A1 publication Critical patent/WO2022267380A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • Embodiments of the present invention relate to the field of computer information technology, and in particular to a method for synthesizing human facial movements based on voice drive, electronic equipment and storage media.
  • VOCA Voice Operated Character Animation
  • the target data for VOCA model training is the corner position of the character model virtualized by 3D visual effects synthesis software such as FALME.
  • FALME 3D visual effects synthesis software
  • the voca model usually only models the mouth movement, and there are no movements in many other parts of the face, such as raising eyebrows, blinking, etc., which will cause the output face movement effect to be stiff.
  • the purpose of the embodiments of the present invention is to provide a voice-driven face motion synthesis method, electronic equipment and storage media, which can be generally applied to character models containing a variety of corner numbers, and the output face motions are rich, and the expression effect nature.
  • the embodiment of the present invention provides a voice-driven face action synthesis method, including:
  • the audio vector input parameter recognition model is processed, and the human face muscle motion parameter corresponding to the human face action to be recognized is output;
  • the parameter recognition model is obtained after training based on the sample audio vectors and the predetermined facial muscle movement parameter labels corresponding to each sample audio vector, and the loss function during the training of the parameter recognition model is based on the facial muscle movement loss composition;
  • the movement of the corners on the multiple elastic bodies divided according to the facial muscle distribution in the human face model is controlled to obtain the result of the human facial movement to be recognized.
  • Embodiments of the present invention also provide an electronic device, including:
  • the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform the above-mentioned voice-driven facial actions resolve resolution.
  • Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the voice-driven face action synthesis method as described above is implemented.
  • the embodiment of the present application also provides a computer program, which implements the above-mentioned method for synthesizing human facial movements based on voice driving when the computer program is executed by a processor.
  • the embodiment of the present invention obtains the audio vector corresponding to the speech signal by processing the speech signal of the face action to be recognized; the audio vector is input into the parameter recognition model for processing, wherein the parameter recognition model is It is obtained after training based on the sample audio vector and the predetermined facial muscle motion parameter label corresponding to each sample audio vector, and the loss function of the parameter recognition model training is based on the face muscle motion loss, so that the parameter recognition model Build the corresponding relationship between the voice signal and the motion parameters of the facial muscles; through this corresponding relationship, the voice signal of the facial action to be recognized is converted into the facial muscle motion parameters; through the facial muscle motion parameters, the face model is controlled According to the distribution of facial muscles, the corner points on multiple elastic bodies are moved to obtain the results of facial movements to be recognized.
  • the relationship between the voice signal and the specific corners in the face model in the traditional model based on voice-driven facial movements is first established through the parameter recognition model.
  • the corresponding relationship between the facial muscle movement parameters and then correlate the facial muscle movement parameters with the corner point movements on multiple elastic bodies divided according to the facial muscle distribution in the face model, so as to control the angle based on the facial muscle movement parameters Get some exercise.
  • the human face movement is simulated through the facial muscle movement, the output movement is vivid and lifelike.
  • the movement of the corners is controlled based on the facial muscle movement parameters, there is no limit to the number of corners, and it can be applied to a variety of face models with different numbers of corners, with good portability.
  • Fig. 1 is the specific flow chart of the human face action synthesis method based on speech drive according to the first embodiment of the present invention
  • Fig. 2 is a specific flowchart of a voice-driven face action synthesis method according to the second embodiment of the present invention
  • Fig. 3 is the concrete flowchart of another kind of voice-driven face action synthesis method according to the second embodiment of the present invention.
  • Fig. 4 is the specific flow chart of the human face action synthesis method based on speech drive according to the third embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.
  • the mouth movement is mainly driven by speech.
  • the mouth movement of the virtual character should be basically the same as that of the real person, so that people feel that this sentence is spoken by the virtual character. If the mouth shape is not right, it will give people a "fake” feeling. This is also the problem to be solved by many "virtual anchor” lip movements.
  • VOCA Voice Operated Character Animation
  • the target data for training the VOCA model is the corner position in the character model virtualized by 3D visual effects synthesis software such as FALME. Since the number of corner points of the character model synthesized by FLAME is fixed, it is difficult to migrate the target data to a custom In the virtual characters, it is impossible to achieve the effect of one training and multi-scenario application.
  • the voca model usually only models the mouth movement, and there are no movements in many other parts of the face, such as raising eyebrows, blinking, etc., which will cause the output face movement effect to be stiff.
  • this solution decided to abandon the training of the corner points of the virtual character's face in the traditional solution, and instead use the "muscle" parameters of the virtual character for training.
  • the face model of a virtual character refer to the facial muscles of the real person to construct the model of the facial muscles, that is, model some elastic bodies on the face of the virtual character according to the distribution of the facial muscles to imitate the facial muscles of the real person (such as orbicularis oculi, corrugator eyebrow, upper lip elevator, zygomaticus, etc.).
  • the corner position on the corresponding elastic body is controlled, so as to achieve the effect of synthesizing human face movements based on voice, and the synthesized human facial movements are not limited to mouth movement.
  • the first embodiment of the present invention relates to a voice-driven facial action synthesis method, which is applicable to scenes based on voice-driven facial actions, such as virtual characters, robot scenes, and the like.
  • this voice-driven face action synthesis method includes the following steps:
  • Step 101 Process the speech signal of the facial action to be recognized to obtain an audio vector corresponding to the speech signal.
  • the voice signal of the face action to be recognized can be that when the user speaks, the user's voice is transmitted to the recording device or the voice is synthesized by the speech synthesis system, thereby generating a continuous voice signal in real time.
  • the audio vector corresponding to the speech signal is obtained by digitally encoding the signal frame of the speech signal, for example, using a Deep Speech model for encoding.
  • Step 102 Input the audio vector into the parameter recognition model for processing, and output the facial muscle movement parameters corresponding to the facial action to be recognized.
  • the parameter recognition model is obtained after training based on the sample audio vectors and the predetermined facial muscle movement parameter labels corresponding to each sample audio vector, and the loss function during the parameter recognition model training is formed based on the facial muscle movement loss.
  • the audio vector is input into a pre-trained parameter recognition model, and the model outputs the facial muscle movement parameters corresponding to the facial movements to be recognized.
  • the parameter recognition model is obtained after training based on sample audio vectors and predetermined facial muscle movement parameter labels corresponding to each sample audio vector. In order to ensure the accuracy of the trained model, a large number of Sample audio vectors for training.
  • the formation process of the above training samples is: by using the depth camera to collect voice signals and face image data when people speak in general indoor scenes, encode the voice signals to obtain sample audio vectors, and convert each sample audio vector and face image data according to The acquired time nodes are associated; the label corresponding to each sample audio vector is the face muscle movement parameter label in the face image data associated with each sample audio vector, and the method for obtaining the label is to associate the sample audio vector Obtained by parameter annotation of facial muscle movement in face image data.
  • the labeling process can be manually marked or marked with a preset algorithm.
  • the facial muscle motion parameters are used to describe the positional parameters of the facial muscle motion, such as the contraction displacement along the muscle texture direction.
  • the facial muscles that can highlight facial movements can be selected to set the facial muscle movement parameter labels.
  • the facial muscle movement parameters include at least one of the following muscle contraction parameters along the contraction direction of the muscle texture:
  • the loss function during the training of the parameter recognition model is based on the loss of facial muscle movement.
  • the difference between the labels of facial muscle motion parameters constrains the model training process.
  • Step 103 Control the movement of corner points on multiple elastic bodies in the face model according to the facial muscle distribution according to the facial muscle movement parameters of the facial movement to be recognized, and obtain the result of the facial movement to be recognized.
  • the face model to be controlled can be a 3D face model, and the corner points on the face model are pre-divided into multiple elastic bodies according to the distribution of real face muscles, and each elastic body corresponds to a face muscle. And there is a one-to-one correspondence with the above-mentioned facial muscle motion parameters corresponding to the facial muscles.
  • the corner movement on the elastic body corresponding to each facial muscle in the human face model is controlled, so as to realize the output of human facial movement results on the human face model.
  • the facial muscle motion parameter specifies the displacement of the smiling muscle, then the parameter is mapped to the corresponding displacement of the corner point movement on the elastic body corresponding to the smiling muscle in the face model, and the smile action is output.
  • this embodiment obtains the audio vector corresponding to the voice signal by processing the voice signal of the face action to be recognized; the audio vector is input into the parameter recognition model for processing, wherein the parameter recognition model is based on samples
  • the audio vector and the pre-determined facial muscle movement parameter labels corresponding to each sample audio vector are obtained after training, and the loss function of the parameter recognition model training is based on the face muscle movement loss, so that the parameter recognition model is used to build the The corresponding relationship between the voice signal and the motion parameters of the facial muscles; through this corresponding relationship, the voice signal of the facial action to be recognized is converted into the facial muscle motion parameters;
  • the movement of the corner points on multiple elastic bodies divided by the distribution of facial muscles obtains the results of facial movements to be recognized.
  • the relationship between the voice signal and the specific corners in the face model in the traditional model based on voice-driven facial movements is first established through the parameter recognition model.
  • the corresponding relationship between the facial muscle movement parameters and then correlate the facial muscle movement parameters with the corner point movements on multiple elastic bodies divided according to the facial muscle distribution in the face model, so as to control the angle based on the facial muscle movement parameters Get some exercise.
  • the human face movement is simulated through the facial muscle movement, the output movement is vivid and lifelike.
  • the movement of the corners is controlled based on the facial muscle movement parameters, there is no limit to the number of corners, and it can be applied to a variety of face models with different numbers of corners, with good portability.
  • the second embodiment of the present invention relates to a voice-driven face action synthesis method.
  • the second embodiment is an improvement on the first embodiment.
  • the improvement is to refine the internal structure of the parameter recognition model, and at the same time based on This refinement illustrates the data processing performed by the parameter identification model.
  • the above parameter identification model is a neural network model, and the neural network model includes three convolutional layers and two fully connected layers.
  • the above-mentioned step 102 specifically includes the following sub-steps:
  • Sub-step 1021 The audio vector is sequentially passed through three layers of convolutional layers for sample space feature extraction to obtain convolutional layer feature data.
  • each audio vector sequentially passes through three convolutional layers to complete sample space feature extraction to obtain convolutional layer feature data.
  • the convolutional layer feature data is a numeric feature vector with a specified dimension.
  • Sub-step 1022 After classifying the feature data of the convolutional layer through two fully-connected layers in sequence, output the facial muscle movement parameters corresponding to the facial movements to be recognized.
  • the above-mentioned two layers of fully-connected layers are both one-dimensional fully-connected layers, and the dimension of the vector output after the feature data of the convolutional layer is processed by the two layers of fully-connected layers is the same as or close to the dimension of the above-mentioned audio vector.
  • the above-mentioned neural network model can also include two layers of pooling layers, which are used to reduce the dimensionality of the intermediate vector data output by the convolution layer, so that the size and number of convolution kernels in the convolution layer can be set more easily. flexible.
  • the above-mentioned sub-step 1021 may specifically include:
  • the processed audio vector is processed by a layer of pooling layer, and the audio vector processed by the pooling layer is input to the next convolutional layer for processing.
  • the audio vector after the audio vector is processed by the first convolutional layer to form the first convolutional feature data, it can be processed by a layer of pooling layer before dimensionality reduction; the dimensionality-reduced first convolutional feature data is input to the second After the second convolution layer is processed to form the second convolution feature data, it is processed by another layer of pooling layer and then the dimension is reduced.
  • the second convolution feature data after dimension reduction is input to the third convolution layer for processing to obtain the final The feature data of the convolutional layer.
  • step 101 may include the following sub-steps:
  • Sub-step 1011 Encoding the speech signal of the facial action to be recognized using the deep speech model, encoding every 32 frames of the speech signal frame into a vector with 29 dimensions as an audio vector.
  • the pre-trained deep speech model is used to encode the continuous speech signal to be recognized, and the speech signal frame is encoded into an audio vector every 32 frames in time order, and the dimension number of each audio vector is 29.
  • decentralization and normalization processing can be further performed on the encoded audio vectors to obtain optimized audio vectors.
  • step 102 may specifically include the following substeps.
  • Sub-step 1022 Extract n audio vectors from the audio vector each time, and process the first convolution layer to obtain the first convolution feature data; the first convolution layer contains 32 convolution kernels and the size of the convolution kernel is for 3.
  • the first convolution layer contains 32 convolution kernels with a size of 3, and the input vector dimension is n ⁇ 29, the output vector dimension is n ⁇ 32 ⁇ 29.
  • Sub-step 1023 Process the first convolution feature data through the first pooling layer to obtain the first pooling feature data; the size of the first pooling layer is 2.
  • the output vector dimension of the first convolution feature data after passing through the first pooling layer is n ⁇ 32 ⁇ 15.
  • Sub-step 1024 Process the first pooled feature data through the second convolutional layer to obtain the second convolutional feature data; the second convolutional layer contains 64 convolution kernels and the size of the convolution kernel is 3,
  • the dimension of the output vector after the first pooled feature data is processed by the second convolutional layer is n ⁇ 64 ⁇ 15.
  • Sub-step 1025 Process the second convolution feature data through the second pooling layer to obtain the second pooling feature data; the size of the second pooling layer is 2.
  • the dimension of the output vector after the second convolution feature data is processed by the second pooling layer is n ⁇ 64 ⁇ 8.
  • Sub-step 1026 Process the second pooled feature data through the third convolutional layer to obtain the third convolutional feature data; the third convolutional layer includes 128 convolution kernels and the size of the convolution kernel is 4.
  • the dimension of the output vector after the second pooled feature data is processed by the third convolutional layer is n ⁇ 128 ⁇ 8. Then flatten to get n ⁇ 1024-dimensional vector.
  • Sub-step 1027 After classifying the third convolutional feature data through two layers of fully connected layers in sequence, output n vectors with 28 dimensions respectively, and each vector is the face corresponding to a group of face actions to be recognized Muscle movement parameters.
  • the third convolution feature data is sent to the fully connected layer.
  • the input vector of the first fully connected layer is n ⁇ 1024, the output vector dimension is n ⁇ 256, and the output vector dimension of the second fully connected layer is n ⁇ 28.
  • a ReLU activation function and a drop layer can be further added to the parameter identification model of this embodiment, and the drop probability is 0.25.
  • a loss function is then constructed on the predicted values.
  • the loss function uses a quadratic function, constructed from the quadratic difference between the predicted vector and the true vector.
  • the above-mentioned facial muscle movement parameters may include: movement displacement parameters of human facial muscles and movement velocity parameters of human facial muscles, wherein the movement velocity parameters are parameter increments of adjacent two groups of movement displacement parameters (later motion displacement parameter minus the difference of the previous motion displacement parameter). Therefore, the dimensions of the motion displacement parameter and the motion velocity parameter are the same.
  • the first 14 dimensions are movement displacement parameters
  • the last 14 dimensions are movement speed parameters.
  • the loss function during the training of the above-mentioned parameter identification model can be formed based on the movement displacement loss and movement speed loss of facial muscles;
  • E p is the motion displacement loss
  • E v is the motion velocity loss
  • the objective function by optimizing the objective function, it not only requires the minimization of the difference between the motion displacement parameter output by the neural network and the corresponding target parameter, but also further proposes a first-order differential of the two (here, it is a difference after being discretized, that is, The minimum requirement of the difference of the motion speed parameter), this requirement reflects the motion similarity of human face movements.
  • a first-order differential of the two here, it is a difference after being discretized, that is, The minimum requirement of the difference of the motion speed parameter
  • this requirement reflects the motion similarity of human face movements.
  • the motion displacement loss E p is calculated by the following formula:
  • y i is the motion displacement information of the reference facial muscle corresponding to the i-th sample audio vector, which is obtained by inputting the i-th sample audio vector into the labeling algorithm
  • f i is the neural network prediction during the training of the parameter recognition model The movement displacement information of the facial muscles of the i-th sample audio vector; i is an integer greater than 0.
  • the movement speed loss E v is calculated by the following formula:
  • the parameter recognition model is trained by describing the motion displacement loss function constrained by static facial expressions and the motion velocity loss function constrained by dynamic facial expressions, which plays an important role in reflecting the authenticity of the virtual character's speech.
  • the third embodiment of the present invention relates to a voice-driven face action synthesis method.
  • the third embodiment is an improvement on the first embodiment and the second real-time method.
  • the improvement is based on a preset face model
  • the movement direction of the corner points on the elastic body corresponds to the movement parameters of the facial muscles, and the movement state of the corresponding corner points in the face model is controlled according to the movement parameters of the facial muscles.
  • the above-mentioned step 103 specifically includes the following sub-steps:
  • Sub-step 1031 Determine the corner points on the elastic body corresponding to the movement parameters of the facial muscles.
  • the facial muscles can be associated with the corner points of the three-dimensional model of the human face.
  • the number of corner points of the face models built in different scenes is not the same.
  • the movement parameters of the face muscles and the corresponding muscle parameters in the face models should be established in advance.
  • Correspondence of the corner points on the position of the elastic body For example, after the face model is determined, the regions controlled by different facial muscles can be divided according to the distribution of facial muscles, and the corner points contained in the corresponding regions are the corner points corresponding to the corresponding facial muscles.
  • Sub-step 1032 Determine the movement direction of the corner point on the elastic body.
  • the corner points corresponding to the movement parameters of the facial muscles define a vector along the direction of the muscle lines in advance, and the vector is the movement direction of these corner points.
  • Sub-step 1033 Control the corresponding corner points on the elastic body to move along the movement direction through the movement parameters of the facial muscles.
  • the corner points corresponding to the facial muscles can move in a predefined movement direction, and the range of movement is the parameter of muscle contraction, that is, the human Movement parameters of facial muscles.
  • the parameter recognition model calculates and outputs the facial muscle movement parameters corresponding to the facial movements.
  • the left side of the human body is the x direction
  • the top side is the y direction
  • the front side is the z direction
  • the movement direction of the selected corner point corresponding to the laughing muscle is: (-1, 1, 0.8).
  • the voice signal is input, according to the facial muscle movement parameters calculated by the parameter recognition model, the movement displacement and movement speed of the corresponding corner points are determined, and the corresponding corner points in the face model are shrunk according to the predefined movement direction. Get the corresponding facial action (expression) effect.
  • this embodiment obtains the audio vector corresponding to the voice signal by processing the voice signal of the face action to be recognized; the audio vector is input into the parameter recognition model for processing, wherein the parameter recognition model is based on samples
  • the audio vector and the pre-determined facial muscle movement parameter labels corresponding to each sample audio vector are obtained after training, and the loss function of the parameter recognition model training is based on the face muscle movement loss, so that the parameter recognition model is used to build the The corresponding relationship between the voice signal and the motion parameters of the facial muscles; through this corresponding relationship, the voice signal of the facial action to be recognized is converted into the facial muscle motion parameters;
  • the movement of the corner points on multiple elastic bodies divided by the distribution of facial muscles obtains the results of facial movements to be recognized.
  • the relationship between the voice signal and the specific corners in the face model in the traditional model based on voice-driven facial movements is first established through the parameter recognition model.
  • the corresponding relationship between the facial muscle movement parameters and then correlate the facial muscle movement parameters with the corner point movements on multiple elastic bodies divided by the facial muscle distribution in the face model, so as to control the angle based on the facial muscle movement parameters Get some exercise.
  • the human face movement is simulated through the facial muscle movement, the output movement is vivid and lifelike.
  • the movement of the corners is controlled based on the facial muscle movement parameters, there is no limit to the number of corners, and it can be applied to a variety of face models with different numbers of corners, with good portability.
  • the fourth embodiment of the present invention relates to an electronic device, as shown in FIG. 5 , including at least one processor 202; and a memory connected in communication with at least one processor 202; Instructions executed by 202, the instructions are executed by at least one processor 202, so that at least one processor 202 can execute any one of the foregoing method embodiments.
  • the memory 201 and the processor 202 are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors 202 and various circuits of the memory 201 together.
  • the bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein.
  • the bus interface provides an interface between the bus and the transceivers.
  • a transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium.
  • the data processed by the processor 202 is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor 202 .
  • Processor 202 is responsible for managing the bus and general processing, and may also provide various functions including timing, peripheral interfacing, voltage regulation, power management, and other control functions. Instead, the memory 201 may be used to store data used by the processor 202 when performing operations.
  • a fifth embodiment of the present invention relates to a computer-readable storage medium storing a computer program.
  • the computer program is executed by the processor, any one of the above method embodiments is implemented.
  • a sixth embodiment of the present invention relates to a computer program.
  • the computer program is executed by the processor, any one of the above method embodiments is implemented.
  • a storage medium includes several instructions to make a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本发明实施例涉及计算机信息技术领域,公开了一种基于语音驱动的人脸动作合成方法、电子设备及存储介质。通过对待识别人脸动作的语音信号进行处理,得到所述语音信号对应的音频向量;将所述音频向量输入参数识别模型进行处理,输出所述待识别人脸动作对应的人脸肌肉运动参数;通过所述待识别人脸动作的人脸肌肉运动参数,控制人脸模型中按人脸肌肉分布划分的多个弹性体上的角点运动,得到待识别人脸动作结果。本方案可以普遍适用于包含多种角点数量的人物模型,且输出的人脸动作丰富,表情效果自然。

Description

基于语音驱动的人脸动作合成方法、电子设备及存储介质
交叉引用
本申请基于申请号为2021107122777、申请日为2021年06月25日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。
技术领域
本发明实施例涉及计算机信息技术领域,特别涉及一种基于语音驱动的人脸动作合成方法、电子设备及存储介质。
背景技术
无论是现实中的机器人还是计算机里面的虚拟人物或模型,如何通过音频实现虚拟人物或模型自动对口型,这都是业界的一个难题,即使经过多年的研究和发展,这个问题依然困扰着相关从业人员。
目前,基于语音驱动虚拟人物口型的方式有很多,最常用的是VOCA(Voice Operated Character Animation)模型。VOCA模型的训练的目标数据是利用三维视觉特效合成软件如FALME虚拟出的人物模型的角点位置,而由于FLAME所合成的人物模型的角点数量固定,很难将目标数据迁移到自定义的虚拟人物中,从而不能达到一次训练,多场景应用的效果。此外,voca模型通常只对口型运动进行建模,人脸的其他很多地方是没有运动的,譬如,抬眉、眨眼等,这会导致输出的人脸动作效果僵硬。
发明内容
本发明实施方式的目的在于提供一种基于语音驱动的人脸动作合成方法、 电子设备及存储介质,可以普遍适用于包含多种角点数量的人物模型,且输出的人脸动作丰富,表情效果自然。
为解决上述技术问题,本发明的实施方式提供了一种基于语音驱动的人脸动作合成方法,包括:
对待识别人脸动作的语音信号进行处理,得到所述语音信号对应的音频向量;
将所述音频向量输入参数识别模型进行处理,输出所述待识别人脸动作对应的人脸肌肉运动参数;
其中,所述参数识别模型是基于样本音频向量及预先确定的对应于各样本音频向量的人脸肌肉运动参数标签进行训练后得到的,所述参数识别模型训练时的损失函数基于人脸肌肉运动损失构成;
通过所述待识别人脸动作的人脸肌肉运动参数,控制人脸模型中按人脸肌肉分布划分的多个弹性体上的角点运动,得到待识别人脸动作结果。
本发明的实施方式还提供了一种电子设备,包括:
至少一个处理器;以及,
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上所述的基于语音驱动的人脸动作合成方法。
本发明的实施方式还提供了一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现如上所述的基于语音驱动的人脸动作合成方法。
本申请实施例还提供了一种计算机程序,所述计算机程序被处理器执行时实现以上所述的基于语音驱动的人脸动作合成方法。
本发明实施方式相对于现有技术而言,通过对待识别人脸动作的语音信号进行处理,得到语音信号对应的音频向量;将音频向量输入参数识别模型进行处理,其中,所述参数识别模型是基于样本音频向量及预先确定的对应于各样本音频向量的人脸肌肉运动参数标签进行训练后得到的,且参数识别模型训练时的损失函数是基于人脸肌肉运动损失构成,从而通过参数识别模型搭建出语音信号与人脸肌肉的运动参数之间的对应关系;通过这种对应关系将待识别人脸动作的语音信号转换为人脸肌肉运动参数;通过该人脸肌肉运动参数,控制人脸模型中按人脸肌肉分布划分的多个弹性体上的角点运动,得到待识别人脸动作结果。本方案中,摒弃了传统基于语音驱动人脸动作(如口型)的模型中直接构建语音信号与人脸模型中具体角点之间的关系,而是先通过参数识别模型建立语音信号与人脸肌肉运动参数之间的对应关系,然后将人脸肌肉运动参数与人脸模型中按人脸肌肉分布划分的多个弹性体上的角点运动相关联,从而基于人脸肌肉运动参数控制角点运动。由于通过人脸肌肉运动模拟人脸动作遵从了生物行为特性,输出的动作生动逼真。且由于是基于人脸肌肉运动参数控制角点运动,因此对角点数量没有限定,可适用多种不同角点数量的人脸模型,移植性好。
附图说明
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定,附图中具有相同参考数字标号的元件表示为类似的元件,除非有特别申明,附图中的图不构成比例限制。
图1是根据本发明第一实施方式的基于语音驱动的人脸动作合成方法的具体流程图;
图2是根据本发明第二实施方式的一种基于语音驱动的人脸动作合成方法的具体流程图;
图3是根据本发明第二实施方式的另一种基于语音驱动的人脸动作合成方法的具体流程图;
图4是根据本发明第三实施方式的基于语音驱动的人脸动作合成方法的具体流程图;
图5是根据本发明第四实施方式的电子设备的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合附图对本发明的各实施方式进行详细的阐述。然而,本领域的普通技术人员可以理解,在本发明各实施方式中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施方式的种种变化和修改,也可以实现本申请所要求保护的技术方案。
在现有的基于语音驱动的人脸动作合成的方案中,主要是基于语音驱动口型运动。例如让虚拟人物说一句“今天天气真好”,那么在语音播放的同时,虚拟人物的口型运动要和真人的口型基本相同,从而给人感觉这句话就是虚拟人物说出。如果口型对不上,会给人带来“假”的感觉。这也是目前很多“虚拟主播”的口型运动要解决的问题。
目前,基于语音驱动虚拟人物口型的方式有很多,最常用的是VOCA(Voice Operated Character Animation)模型。VOCA模型的训练的目标数据是 利用三维视觉特效合成软件如FALME虚拟出的人物模型中的角点位置,而由于FLAME所合成的人物模型的角点数量固定,很难将目标数据迁移到自定义的虚拟人物中,从而不能达到一次训练,多场景应用的效果。此外,voca模型通常只对口型运动进行建模,人脸的其他很多地方是没有运动的,譬如,抬眉、眨眼等,这会导致输出的人脸动作效果僵硬。
因此,本方案为解决上述问题,决定放弃传统方案中对虚拟人物的人脸的角点进行训练,转而采用对虚拟人物的“肌肉”参数进行训练。在对虚拟人物的人脸模型进行构建时,参考真人面部肌肉,进行人脸肌肉的模型构建,即在虚拟人物的面部根据人脸肌肉分布建模一些弹性体,用以模仿真人的面部肌肉(如眼轮匝肌、皱眉肌、上唇举肌、颧肌等)。然后,基于这些“肌肉”参数对相应的弹性体上的角点位置进行控制,从而达到基于语音驱合成人脸动作的效果,且在合成的人脸动作中不局限于口型运动。
本发明的第一实施方式涉及一种基于语音驱动的人脸动作合成方法,该方法适用于基于语音驱动人脸动作的场景,如虚拟人物、机器人场景等。如图1所示,该基于语音驱动的人脸动作合成方法包括如下步骤:
步骤101:对待识别人脸动作的语音信号进行处理,得到语音信号对应的音频向量。
其中,待识别人脸动作的语音信号可以为用户说话时,用户的语音传入录音设备或由语言合成系统合成语音,从而实时生成连续的语音信号。通过对该语音信号的信号帧进行数字编码,如采用深度语音(Deep Speech)模型进行编码,从而得到语音信号对应的音频向量。
步骤102:将音频向量输入参数识别模型进行处理,输出待识别人脸动 作对应的人脸肌肉运动参数。
其中,参数识别模型是基于样本音频向量及预先确定的对应于各样本音频向量的人脸肌肉运动参数标签进行训练后得到的,参数识别模型训练时的损失函数基于人脸肌肉运动损失构成。
具体地,将音频向量输入预先训练好的参数识别模型,该模型即输出所述待识别人脸动作对应的人脸肌肉运动参数。其中,所述参数识别模型是基于样本音频向量及预先确定的对应于各样本音频向量的人脸肌肉运动参数标签进行训练后得到的,为了保证训练出的模型的准确性,通常会采用大量的样本音频向量进行训练。
上述训练样本的形成过程是:通过使用深度相机在一般室内场景下采集人说话时语音信号及人脸图像数据,将语音信号进行编码得到样本音频向量,将各样本音频向量与人脸图像数据按获取的时间节点进行关联;而对应于各样本音频向量的标签即为与各样本音频向量关联的人脸图像数据中人脸肌肉运动参数标签,该标签的获取方法即对样本音频向量关联的人脸图像数据中人脸肌肉运动进行参数标注获得。该标注过程可采用人工标注或者预置算法标注。人脸肌肉运动参数用于描述人脸肌肉运动时的位置参数,如沿肌肉纹理方向收缩的位移等。
为生动描述人说话时人脸动作的效果,可选择能够突出展示人脸动作的人脸肌肉进行人脸肌肉运动参数标签的设置。本实施例中,人脸肌肉运动参数至少包括以下肌肉中的一种沿肌肉纹理收缩方向的收缩参数:
左右额肌、左右皱眉肌、左右眼轮匝肌、左右提上唇鼻翼肌、左右口轮匝肌、左右降下唇肌、左右笑肌。
此外,进一步限定,所述参数识别模型训练时的损失函数基于人脸肌肉运动损失构成,该损失函数主要以构建模型参数识别模型时所采用的神经网络预测的人脸肌肉运动参数与预先确定的人脸肌肉运动参数标签之间的差值对模型训练过程进行约束。
步骤103:通过待识别人脸动作的人脸肌肉运动参数,控制人脸模型中按人脸肌肉分布划分的多个弹性体上的角点运动,得到待识别人脸动作结果。
其中,待控制的人脸模型可为3D人脸模型,该人脸模型上的角点被预先按真人的人脸肌肉分布划分到多个弹性体上,每个弹性体对应一块人脸肌肉,并与该人脸肌肉对应的上述人脸肌肉运动参数一一对应。
具体地,根据待识别人脸动作的人脸肌肉运动参数,控制人脸模型中与各人脸肌肉对应的弹性体上的角点动作,从而实现在人脸模型上输出人脸动作结果。例如人脸肌肉运动参数为笑肌运动指定位移,则该参数映射到人脸模型中即为笑肌对应的弹性体上的角点运动相应的位移,输出微笑动作。
本实施例与现有技术相比较,通过对待识别人脸动作的语音信号进行处理,得到语音信号对应的音频向量;将音频向量输入参数识别模型进行处理,其中,所述参数识别模型是基于样本音频向量及预先确定的对应于各样本音频向量的人脸肌肉运动参数标签进行训练后得到的,且参数识别模型训练时的损失函数是基于人脸肌肉运动损失构成,从而通过参数识别模型搭建出语音信号与人脸肌肉的运动参数之间的对应关系;通过这种对应关系将待识别人脸动作的语音信号转换为人脸肌肉运动参数;通过该人脸肌肉运动参数,控制人脸模型中按人脸肌肉分布划分的多个弹性体上的角点运动,得到待识别人脸动作结果。本方案中,摒弃了传统基于语音驱动人脸动作(如口型)的模型中直接构 建语音信号与人脸模型中具体角点之间的关系,而是先通过参数识别模型建立语音信号与人脸肌肉运动参数之间的对应关系,然后将人脸肌肉运动参数与人脸模型中按人脸肌肉分布划分的多个弹性体上的角点运动相关联,从而基于人脸肌肉运动参数控制角点运动。由于通过人脸肌肉运动模拟人脸动作遵从了生物行为特性,输出的动作生动逼真。且由于是基于人脸肌肉运动参数控制角点运动,因此对角点数量没有限定,可适用多种不同角点数量的人脸模型,移植性好。
本发明的第二实施方式涉及一种基于语音驱动的人脸动作合成方法,第二实施方式是对第一实施方式的改进,改进之处在于对参数识别模型的内部结构进行细化,同时基于这种细化,对参数识别模型执行的数据处理过程进行了说明。具体地,上述参数识别模型为神经网络模型,且该神经网络模型包括三层卷积层和两层全连接层。如图2所示,上述步骤102具体包括如下子步骤:
子步骤1021:将音频向量依次经三层卷积层进行样本空间特征提取,得到卷积层特征数据。
其中,上述三层卷积层均为一维卷积层,每层卷积层的卷积核大小和数量均不作限定。具体地,各音频向量依次经三层卷积层完成样本空间特征提取,得到卷积层特征数据。该卷积层特征数据为具有指定维数的数字特征向量。
子步骤1022:将卷积层特征数据依次经两层全连接层进行分类后,输出待识别人脸动作对应的人脸肌肉运动参数。
其中,上述两层全连接层均为一维全连接层,卷积层特征数据经两层全连接层处理后输出的向量维度与上述音频向量的维度相同或接近。
在一个例子中,上述神经网络模型还可包两层池化层,用于对卷积层输 出的中间向量数据进行降维,从而使卷积层中卷积核的大小和个数的设置更加灵活。相应地,上述子步骤1021具体可以包括:
在将音频向量依次经前两层卷积层处理时,每经一层卷积层处理后,将处理后的音频向量经一层池化层处理,并将经池化层处理后的音频向量输入到下一层卷积层进行处理。
具体地,音频向量经第一层卷积层处理形成第一卷积特征数据后,可先经一层池化层处理后降维;降维后的第一卷积特征数据被输入至第二层卷积层处理形成第二卷积特征数据后,再经另一层池化层处理后降维,降维后的第二卷积特征数据被输入至第三层卷积层处理,得到最终的卷积层特征数据。
在图2所示的方法步骤的基础上,在一个例子中,如图3所示,示出了更细化的参数识别模型的处理数据过程。在图3所示方法步骤中,上述步骤101可包括如下子步骤:
子步骤1011:对待识别人脸动作的语音信号采用深度语音模型进行编码,将语音信号帧中每32帧编码成一个具有29个维度的向量作为一个音频向量。
具体地,针对待识别的连续语音信号采用预先训练的深度语音模型进行编码,将语音信号帧中按时间顺序,每32帧编码成一个音频向量,每个音频向量的维度数为29。
此外,对编码后的音频向量还可以进一步执行去中心化、归一化处理,得到优化后的音频向量。
在子步骤1011的基础上,上述步骤102可具体包括如下子步骤。
子步骤1022:从音频向量中每次提取n个音频向量,经第一层卷积层进行处理得到第一卷积特征数据;第一层卷积层包含32个卷积核且卷积核大小为 3。
具体地,假设每次输入到参数识别模型语音中的音频向量数为n(n为大于0的整数),第一层卷积层包含32个卷积核,大小为3,则输入向量维度是n×29,输出向量维度是n×32×29。
子步骤1023:将第一卷积特征数据经第一层池化层进行处理得到第一池化特征数据;第一池化层大小为2。
具体地,第一卷积特征数据经过第一池化层后的输出向量维度为n×32×15。
子步骤1024:将第一池化特征数据经第二层卷积层进行处理得到第二卷积特征数据;第二层卷积层包含64个卷积核且卷积核大小为3,
具体地,第一池化特征数据经第二层卷积层处理后的输出向量维度为n×64×15。
子步骤1025:将第二卷积特征数据经第二层池化层进行处理得到第二池化特征数据;第二池化层大小为2。
具体地,第二卷积特征数据经第二池化层处理后的输出向量维度为n×64×8。
子步骤1026:将第二池化特征数据经第三层卷积层进行处理得到第三卷积特征数据;第三层卷积层包含128个卷积核且卷积核大小为4。
具体地,第二池化特征数据经第三层卷积层(深度卷积层)处理后的输出向量维度为n×128×8。然后展平得到n×1024维的向量。
子步骤1027:将第三卷积特征数据依次经两层全连接层进行分类后,输出n个分别具有28个维度的向量,且每个向量分别为一组待识别人脸动作对应 的人脸肌肉运动参数。
具体地,第三卷积特征数据送入到全连接层。第一层全连接输入向量为n×1024,输出向量维度为n×256,第二层全连接层输出向量维度为n×28。
此外,为了减小过拟合,本实施例的参数识别模型中还可进一步添加ReLU激活函数,和一个drop层,drop概率采用0.25。然后对预测值构建损失函数。损失函数采用二次函数,由预测向量和真实向量之差的二次方进行构建。
在一个例子中,上述人脸肌肉运动参数可包括:人脸肌肉的运动位移参数和人脸肌肉的运动速度参数,其中,运动速度参数为相邻两组运动位移参数的参数增量(在后运动位移参数减去在前运动位移参数的差值)。因此,运动位移参数和运动速度参数的维度相同。在上述28维度的人脸肌肉运动参数中,前14维度为运动位移参数,后14维度为运动速度参数。
相应地,上述参数识别模型训练时的损失函数可基于人脸肌肉的运动位移损失和运动速度损失构成;
其中,参数识别模型训练时的损失函数:
E Total=a 1E p+a 2E v……………………..(1)
其中,E p为运动位移损失,E v为运动速度损失,a j(j=1,2)是对应损失项的权重。
本实施例中,通过优化目标函数,不仅要求最小化神经网络输出的运动位移参数与对应的目标参数的差的最小化,而且还进一步提出了两者一阶微分(这里离散后为差分,即运动速度参数)的差值的最小化要求,这一项要求体现了人脸动作的运动相似性。通过调整运动位移损失和运动速度损失之间的权重,达到人脸动作既具有静态相似性,又具有动态相似性。
在一个例子中,运动位移损失E p通过如下公式计算:
E p=||y i-f i|| 2……………………..(2)
其中,y i是对应于第i个样本音频向量的参考人脸肌肉的运动位移信息,通过将第i个样本音频向量输入到标注算法得到,f i是为参数识别模型训练时的神经网络预测的第i个样本音频向量的人脸肌肉的运动位移信息;i为大于0的整数。
运动速度损失E v通过如下公式计算:
E v=||(y i-y i-1)-(f i-f i-1)|| 2……………………..(3)。
本实施例中,通过描述人脸静态表情约束的运动位移损失函数,以及描述人脸动态表情约束的运动速度损失函数训练参数识别模型,对体现虚拟人物说话时的真实性具有很重要的作用。
本发明的第三实施方式涉及一种基于语音驱动的人脸动作合成方法,第三实施方式是对第一实施方式以及第二实时方式的改进,改进之处在于基于预先设定的人脸模型中弹性体上的角点运动方向与人脸肌肉运动参数相对应,实现根据人脸肌肉运动参数控制人脸模型中相应角点的运动状态。如图4所示,上述步骤103具体包括如下子步骤:
子步骤1031:确定与人脸肌肉运动参数所对应的弹性体上的角点。
为了将人脸肌肉运动效果与面部模型的动作效果关联起来,可将人脸肌肉与人脸三维模型的角点进行关联。在实际应用场景中,不同场景中搭建的人脸模型的角点数量不尽相同,在针对不同人脸模型进行人脸动作控制时,应提前建立人脸肌肉运动参数与人脸模型中相应肌肉位置的弹性体上角点的对应关系。例如,在确定人脸模型后,可先按人脸肌肉的分布情况划分不同的人脸肌 肉所控制的区域,相应区域内包含的角点即为与相应人脸肌肉对应的角点。
例如,以右笑肌为例,我们在嘴唇附近圈定笑肌所牵连的面部角点作为笑肌运动参数对应的角点。
子步骤1032:确定弹性体上的角点的运动方向。
预先将各人脸肌肉运动参数对应的角点沿着肌肉纹路的方向定义一个向量,该向量即为这些角点的运动方向。
子步骤1033:通过人脸肌肉运动参数控制对应的弹性体上的角点沿运动方向进行运动。
当人脸肌肉进行收缩时(产生一组人脸肌肉运动参数),与该人脸肌肉对应的角点就可以按照预定义的运动方向进行运动,运动幅度的大小为肌肉收缩的参数,即人脸肌肉运动参数。
例如下表1所示,为参数识别模型某次计算输出人脸动作对应的人脸肌肉运动参数。
表1人脸肌肉运动参数
Figure PCTCN2021137489-appb-000001
Figure PCTCN2021137489-appb-000002
例如,在人脸模型中,预先定义人体的左为x方向,上为y方向,前为z方向。且选取笑肌对应的角点的运动方向是:(-1,1,0.8)。其他肌肉同理,参照人体面部肌肉,选择相应的角点,沿着肌肉纹理的方向预定义他们的运动方向。当语音信号输入时,根据参数识别模型计算出的人脸肌肉运动参数,确定相应角点的运动位移和运动速度,将人脸模型中对应的角点按照预定义的运动方向进行收缩,即可得到相对应的人脸动作(表情)效果。
本实施例与现有技术相比较,通过对待识别人脸动作的语音信号进行处理,得到语音信号对应的音频向量;将音频向量输入参数识别模型进行处理,其中,所述参数识别模型是基于样本音频向量及预先确定的对应于各样本音频向量的人脸肌肉运动参数标签进行训练后得到的,且参数识别模型训练时的损失函数是基于人脸肌肉运动损失构成,从而通过参数识别模型搭建出语音信号与人脸肌肉的运动参数之间的对应关系;通过这种对应关系将待识别人脸动作的语音信号转换为人脸肌肉运动参数;通过该人脸肌肉运动参数,控制人脸模型中按人脸肌肉分布划分的多个弹性体上的角点运动,得到待识别人脸动作结 果。本方案中,摒弃了传统基于语音驱动人脸动作(如口型)的模型中直接构建语音信号与人脸模型中具体角点之间的关系,而是先通过参数识别模型建立语音信号与人脸肌肉运动参数之间的对应关系,然后将人脸肌肉运动参数与人脸模型中按人脸肌肉分布划分的多个弹性体上的角点运动相关联,从而基于人脸肌肉运动参数控制角点运动。由于通过人脸肌肉运动模拟人脸动作遵从了生物行为特性,输出的动作生动逼真。且由于是基于人脸肌肉运动参数控制角点运动,因此对角点数量没有限定,可适用多种不同角点数量的人脸模型,移植性好。
本发明第四实施方式涉及一种电子设备,如图5所示,包括至少一个处理器202;以及,与至少一个处理器202通信连接的存储器;其中,存储器201存储有可被至少一个处理器202执行的指令,指令被至少一个处理器202执行,以使至少一个处理器202能够执行上述任一方法实施例。
其中,存储器201和处理器202采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器202和存储器201的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器202处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传送给处理器202。
处理器202负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器201可以被 用于存储处理器202在执行操作时所使用的数据。
本发明第五实施方式涉及一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述任一方法实施例。
本发明第六实施方式涉及一种计算机程序。计算机程序被处理器执行时实现上述任一方法实施例。
即,本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本领域的普通技术人员可以理解,上述各实施方式是实现本发明的具体实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本发明的精神和范围。

Claims (11)

  1. 一种基于语音驱动的人脸动作合成方法,其特征在于,包括:
    对待识别人脸动作的语音信号进行处理,得到所述语音信号对应的音频向量;
    将所述音频向量输入参数识别模型进行处理,输出所述待识别人脸动作对应的人脸肌肉运动参数;
    其中,所述参数识别模型是基于样本音频向量及预先确定的对应于各样本音频向量的人脸肌肉运动参数标签进行训练后得到的,所述参数识别模型训练时的损失函数基于人脸肌肉运动损失构成;
    通过所述待识别人脸动作的人脸肌肉运动参数,控制人脸模型中按人脸肌肉分布划分的多个弹性体上的角点运动,得到待识别人脸动作结果。
  2. 根据权利要求1所述的方法,其特征在于,所述参数识别模型为神经网络模型,所述神经网络模型包括三层卷积层和两层全连接层;所述将所述音频向量输入参数识别模型进行处理,输出所述待识别人脸动作对应的人脸肌肉运动参数,包括:
    将所述音频向量依次经所述三层卷积层进行样本空间特征提取,得到卷积层特征数据;
    将所述卷积层特征数据依次经所述两层全连接层进行分类后,输出所述待识别人脸动作对应的人脸肌肉运动参数。
  3. 根据权利要求2所述的方法,其特征在于,所述神经网络模型还包两层池化层;
    所述将所述音频向量依次经所述三层卷积层进行样本空间特征提取,得到 卷积层特征数据,包括:
    在将所述音频向量依次经前两层卷积层处理时,每经一个所述卷积层处理后,将处理后的音频向量经一个所述池化层处理,并将经所述池化层处理后的音频向量输入到下一层卷积层进行处理。
  4. 根据权利要求3所述的方法,其特征在于,所述对待识别人脸动作的语音信号进行处理,得到所述语音信号对应的音频向量包括:
    对待识别人脸动作的语音信号采用深度语音模型进行编码,将语音信号帧中每32帧编码成一个具有29个维度的向量作为一个所述音频向量;
    所述将所述音频向量输入参数识别模型进行处理,输出所述待识别人脸动作对应的人脸肌肉运动参数包括:
    从所述音频向量中每次提取n个音频向量,经第一层卷积层进行处理得到第一卷积特征数据;所述第一层卷积层包含32个卷积核且卷积核大小为3;
    将所述第一卷积特征数据经第一层池化层进行处理得到第一池化特征数据;所述第一池化层大小为2;
    将所述第一池化特征数据经第二层卷积层进行处理得到第二卷积特征数据;所述第二层卷积层包含64个卷积核且卷积核大小为3;
    将所述第二卷积特征数据经第二层池化层进行处理得到第二池化特征数据;所述第二池化层大小为2;
    将所述第二池化特征数据经第三层卷积层进行处理得到第三卷积特征数据;所述第三层卷积层包含128个卷积核且卷积核大小为4;
    将所述第三卷积特征数据依次经所述两层全连接层进行分类后,输出n个分别具有28个维度的向量,且每个向量分别为一组所述待识别人脸动作对应的 人脸肌肉运动参数。
  5. 根据权利要求1所述的方法,其特征在于,所述人脸肌肉运动参数包括:人脸肌肉的运动位移参数和人脸肌肉的运动速度参数,其中,所述运动速度参数为相邻两组运动位移参数的参数增量;所述参数识别模型训练时的损失函数基于人脸肌肉的运动位移损失和运动速度损失构成;
    其中,所述参数识别模型训练时的损失函数E Total=a 1E p+a 2E v,其中,E p为运动位移损失,E v为运动速度损失,a j(j=1,2)是对应损失项的权重。
  6. 根据权利要求5所述的方法,其特征在于,
    所述运动位移损失E p通过如下公式计算:
    E p=||y i-f i|| 2
    其中,y i是对应于第i个样本音频向量的参考人脸肌肉的运动位移信息,通过将所述第i个样本音频向量输入到标注算法得到,f i是为所述参数识别模型训练时的神经网络预测的第i个样本音频向量的人脸肌肉的运动位移信息;i为大于0的整数;
    所述运动速度损失E v通过如下公式计算:
    E v=||(y i-y i-1)-(f i-f i-1)|| 2
  7. 根据权利要求1所述的方法,其特征在于,所述通过所述待识别人脸动作的人脸肌肉运动参数,控制人脸模型中按人脸肌肉分布所划分的多个弹性体上的角点运动,得到待识别人脸动作结果,包括:
    确定与所述人脸肌肉运动参数所对应的所述弹性体上的角点;
    确定所述弹性体上的角点的运动方向;
    通过所述人脸肌肉运动参数控制对应的所述弹性体上的角点沿所述运动方 向进行运动。
  8. 根据权利要求1-7中任一项所述的方法,其特征在于,所述人脸肌肉运动参数至少包括以下肌肉中的一种沿肌肉纹路收缩方向的收缩参数:
    左右额肌、左右皱眉肌、左右眼轮匝肌、左右提上唇鼻翼肌、左右口轮匝肌、左右降下唇肌、左右笑肌。
  9. 一种电子设备,其特征在于,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至8中任一项所述的基于语音驱动的人脸动作合成方法。
  10. 一种计算机可读存储介质,存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至8中任一项所述的基于语音驱动的人脸动作合成方法。
  11. 一种计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至8中任一项所述的基于语音驱动的人脸动作合成方法。
PCT/CN2021/137489 2021-06-25 2021-12-13 基于语音驱动的人脸动作合成方法、电子设备及存储介质 WO2022267380A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110712277.7A CN113408449B (zh) 2021-06-25 2021-06-25 基于语音驱动的人脸动作合成方法、电子设备及存储介质
CN202110712277.7 2021-06-25

Publications (1)

Publication Number Publication Date
WO2022267380A1 true WO2022267380A1 (zh) 2022-12-29

Family

ID=77679655

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137489 WO2022267380A1 (zh) 2021-06-25 2021-12-13 基于语音驱动的人脸动作合成方法、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN113408449B (zh)
WO (1) WO2022267380A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912373A (zh) * 2023-05-23 2023-10-20 苏州超次元网络科技有限公司 一种动画处理方法和系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408449B (zh) * 2021-06-25 2022-12-06 达闼科技(北京)有限公司 基于语音驱动的人脸动作合成方法、电子设备及存储介质
CN114697568B (zh) * 2022-04-07 2024-02-20 脸萌有限公司 特效视频确定方法、装置、电子设备及存储介质
CN115100329B (zh) * 2022-06-27 2023-04-07 太原理工大学 基于多模态驱动的情感可控面部动画生成方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218842A (zh) * 2013-03-12 2013-07-24 西南交通大学 一种语音同步驱动三维人脸口型与面部姿势动画的方法
CN109523616A (zh) * 2018-12-04 2019-03-26 科大讯飞股份有限公司 一种面部动画生成方法、装置、设备及可读存储介质
CN112614212A (zh) * 2020-12-16 2021-04-06 上海交通大学 联合语气词特征的视音频驱动人脸动画实现方法及系统
CN112907706A (zh) * 2021-01-31 2021-06-04 云知声智能科技股份有限公司 基于多模态的声音驱动动漫视频生成方法、装置及系统
CN113408449A (zh) * 2021-06-25 2021-09-17 达闼科技(北京)有限公司 基于语音驱动的人脸动作合成方法、电子设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103258340B (zh) * 2013-04-17 2015-12-09 中国科学技术大学 富有情感表达能力的三维可视化中文普通话发音词典的发音方法
US11568864B2 (en) * 2018-08-13 2023-01-31 Carnegie Mellon University Processing speech signals of a user to generate a visual representation of the user
CN110866968A (zh) * 2019-10-18 2020-03-06 平安科技(深圳)有限公司 基于神经网络生成虚拟人物视频的方法及相关设备
CN112215926A (zh) * 2020-09-28 2021-01-12 北京华严互娱科技有限公司 一种语音驱动的人脸动作实时转移方法和系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218842A (zh) * 2013-03-12 2013-07-24 西南交通大学 一种语音同步驱动三维人脸口型与面部姿势动画的方法
CN109523616A (zh) * 2018-12-04 2019-03-26 科大讯飞股份有限公司 一种面部动画生成方法、装置、设备及可读存储介质
CN112614212A (zh) * 2020-12-16 2021-04-06 上海交通大学 联合语气词特征的视音频驱动人脸动画实现方法及系统
CN112907706A (zh) * 2021-01-31 2021-06-04 云知声智能科技股份有限公司 基于多模态的声音驱动动漫视频生成方法、装置及系统
CN113408449A (zh) * 2021-06-25 2021-09-17 达闼科技(北京)有限公司 基于语音驱动的人脸动作合成方法、电子设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912373A (zh) * 2023-05-23 2023-10-20 苏州超次元网络科技有限公司 一种动画处理方法和系统
CN116912373B (zh) * 2023-05-23 2024-04-16 苏州超次元网络科技有限公司 一种动画处理方法和系统

Also Published As

Publication number Publication date
CN113408449A (zh) 2021-09-17
CN113408449B (zh) 2022-12-06

Similar Documents

Publication Publication Date Title
WO2022267380A1 (zh) 基于语音驱动的人脸动作合成方法、电子设备及存储介质
Hong et al. Real-time speech-driven face animation with expressions using neural networks
KR101558202B1 (ko) 아바타를 이용한 애니메이션 생성 장치 및 방법
Sifakis et al. Simulating speech with a physics-based facial muscle model
US8224652B2 (en) Speech and text driven HMM-based body animation synthesis
Fan et al. A deep bidirectional LSTM approach for video-realistic talking head
Deng et al. Expressive facial animation synthesis by learning speech coarticulation and expression spaces
CN111045582B (zh) 一种个性化虚拟人像活化互动系统及方法
CN115116109B (zh) 虚拟人物说话视频的合成方法、装置、设备及存储介质
JP2023545642A (ja) 目標対象の動作駆動方法、装置、機器及びコンピュータプログラム
US20060009978A1 (en) Methods and systems for synthesis of accurate visible speech via transformation of motion capture data
WO2021196643A1 (zh) 交互对象的驱动方法、装置、设备以及存储介质
Ma et al. Styletalk: One-shot talking head generation with controllable speaking styles
WO2023284435A1 (zh) 生成动画的方法及装置
CN111243065B (zh) 一种语音信号驱动的脸部动画生成方法
CN110910479B (zh) 视频处理方法、装置、电子设备及可读存储介质
CN113781610A (zh) 一种虚拟人脸的生成方法
CN113228163A (zh) 基于文本和音频的实时面部再现
CN115953521B (zh) 远程数字人渲染方法、装置及系统
CN116597857A (zh) 一种语音驱动图像的方法、系统、装置及存储介质
CN111939558A (zh) 一种实时语音驱动虚拟人物动作的方法和系统
CN116400806A (zh) 个性化虚拟人的生成方法及系统
CN108908353B (zh) 基于平滑约束逆向机械模型的机器人表情模仿方法及装置
CN116665695B (zh) 虚拟对象口型驱动方法、相关装置和介质
Tang et al. Real-time conversion from a single 2D face image to a 3D text-driven emotive audio-visual avatar

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21946854

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE