WO2024124680A1 - 一种语音信号驱动的个性化三维人脸动画生成方法及其应用 - Google Patents

一种语音信号驱动的个性化三维人脸动画生成方法及其应用 Download PDF

Info

Publication number
WO2024124680A1
WO2024124680A1 PCT/CN2023/075515 CN2023075515W WO2024124680A1 WO 2024124680 A1 WO2024124680 A1 WO 2024124680A1 CN 2023075515 W CN2023075515 W CN 2023075515W WO 2024124680 A1 WO2024124680 A1 WO 2024124680A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
feature
speech
personalized
style
Prior art date
Application number
PCT/CN2023/075515
Other languages
English (en)
French (fr)
Inventor
周昆
柴宇进
翁彦琳
邵天甲
Original Assignee
浙江大学
杭州相芯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学, 杭州相芯科技有限公司 filed Critical 浙江大学
Publication of WO2024124680A1 publication Critical patent/WO2024124680A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Definitions

  • the present invention relates to the field of facial animation, and in particular to a method for generating personalized three-dimensional facial animation driven by a voice signal and an application thereof.
  • ACM Transactions on Graphics (TOG), 35(4):127, 2016.) automatically identifies phoneme sequences that reflect pronunciation (such as syllables in English and pinyin in Chinese) from speech signals, and groups phonemes into visemes according to the shape of human lips during pronunciation, and creates personalized animation keyframes for the target person for each viseme; then, the entire sequence is connected through manually formulated rules to obtain a coherent personalized facial animation.
  • pronunciation such as syllables in English and pinyin in Chinese
  • These technologies require animation keyframes to be created for each target person, which requires a lot of repeated manual work; and the quality of the generated animation is usually limited by the accuracy of phoneme recognition and the rationality of the manually formulated rules.
  • DNN deep neural networks
  • AAM Active Appearance Model
  • Suwajanakorn et al. Suddensorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Synthesizing Obama: learning lip sync from audio.
  • ACM Transactions on Graphics (TOG), 36(4):95, 2017.) collected dozens of hours of speech videos of Barack Obama to train their dedicated Long Short-Term Memory (LSTM) network to map the Mel-scale Frequency Cepstral Coefficients (MFCC) sequence of speech signals to the motion trajectory of lip key points.
  • LSTM Long Short-Term Memory
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • These technologies rely on style control methods to generate personalized facial animations for a specific target person. Although these technologies distinguish the personalized styles of different characters, they do not explicitly distinguish the personalized style information and speech content pronunciation action information within each character’s data, resulting in the inability of the trained network model to accurately learn the personalized style of the character.
  • the purpose of the present invention is to address the deficiencies of the prior art and provide a method for generating personalized three-dimensional facial animation driven by a voice signal.
  • a speech signal driven personalized 3D facial animation generation method reconstructs a 3D facial action sequence for a target person's frontal speech video, and extracts a speech feature sequence from the speech signal of the video; the reconstructed 3D facial action sequence is decomposed into a content feature sequence and a personalized style feature through a decoupling network, wherein the content feature sequence contains the necessary action information required for the pronunciation of the speech content in the 3D facial action, and the personalized style feature contains the style information reflecting the personality of the person in the 3D facial action; at the same time, the decomposed personalized style features are combined with the extracted speech feature sequence through another speech animation network to generate a personalized 3D facial animation.
  • the present invention uses the existing technology to reconstruct a three-dimensional face action sequence from the video, and uses the existing speech recognition technology to extract a speech feature sequence from the speech signal of the video.
  • the present invention decomposes the reconstructed three-dimensional face action sequence into two parts: a content feature sequence and a personalized style feature through a deep neural network (called a decoupling network), wherein the content feature sequence contains the necessary action information required for the pronunciation of the speech content in the three-dimensional face action, and the personalized style feature contains the style information reflecting the personality of the person in the three-dimensional face action; and the present invention combines the decomposed personalized style features with the extracted speech feature sequence through another deep neural network (called a speech animation network) to generate a personalized three-dimensional face animation.
  • a speech animation network another deep neural network
  • a method for generating a personalized three-dimensional face animation driven by a voice signal comprising the following steps:
  • each frame of the target person's frontal face speech video is reconstructed in three dimensions, and the head movement is removed to obtain the target person's three-dimensional face model template and three-dimensional face action sequence;
  • the model template is a two-dimensional tensor composed of vertex dimension and space dimension;
  • the three-dimensional face action sequence is a vertex offset sequence relative to the model template, and is a three-dimensional tensor composed of sequence dimension, vertex dimension and space dimension; extracting the speech signal from the given video.
  • auxiliary character data Obtaining auxiliary character data from an existing public voice-synchronized 3D facial animation database, wherein the data of each auxiliary character includes a 3D facial model template, a 3D facial action sequence, and a synchronized voice signal; the voice-synchronized 3D facial animation database does not contain the 3D data of the target character.
  • step (3) Extracting speech feature sequence: For the speech signals obtained in step (1) and step (2), use existing speech recognition technology to extract speech feature sequence; the speech feature sequence is a three-dimensional tensor composed of sequence dimension, window dimension, and feature map dimension.
  • a facial action sequence is decomposed into a content feature sequence and a personalized style feature
  • the content feature sequence is a two-dimensional tensor composed of a sequence dimension and a feature map dimension, and contains necessary action information required for pronunciation of speech content in three-dimensional facial actions
  • the personalized style feature is a one-dimensional tensor composed of a feature map dimension, and contains style information reflecting the personality of a character in three-dimensional facial actions
  • the speech animation network combines the decomposed personalized style features with the speech feature sequence to output a personalized three-dimensional facial action sequence.
  • step (1) For the three-dimensional facial action sequence of the target person obtained in step (1), the personalized style features of the target person are decomposed using the decoupled network trained in step (4).
  • Generate personalized three-dimensional facial animation synchronized with speech extract a speech feature sequence from any input speech signal using the same method as in step (3); use the speech animation network trained in step (4) to combine the extracted speech feature sequence with the personalized style features of the target person obtained in step (5) to output a personalized three-dimensional facial action sequence; add the obtained three-dimensional facial action sequence to the three-dimensional facial model template of the target person obtained in step (1) to obtain a personalized three-dimensional facial animation; the personalized three-dimensional facial animation is synchronized with the input speech and has the personalized style of the target person.
  • step (4) comprises the following sub-steps:
  • a deep neural network is trained: a decoupled network; the decoupled network is composed of a content encoder, a style encoder, and an action decoder.
  • the content encoder first performs a cubic spiral convolution on each frame in the three-dimensional face action sequence; after each spiral convolution operation, the vertex is downsampled and activated using a leaky linear rectifier function with a negative slope of 0.2; then all vertex features after the cubic spiral convolution are connected into a one-dimensional vector, and then mapped to the content feature through a linear matrix; after all frames in the three-dimensional face action sequence are mapped, a content feature sequence is obtained; the content feature sequence is a two-dimensional tensor composed of a sequence dimension and a feature map dimension.
  • the style encoder performs the same cubic spiral convolution, vertex downsampling, activation and subsequent linear mapping operations as the aforementioned content encoder on each frame in the three-dimensional face action sequence, but uses different parameters to map each frame to the intermediate style feature; after all frames in the three-dimensional face action sequence are mapped to the intermediate style feature sequence, a standard long short-term memory unit is used to cyclically process the intermediate style feature sequence and obtain personalized style features; the personalized style feature is a one-dimensional vector composed of feature map dimensions.
  • the action decoder performs three one-dimensional convolutions on the content feature sequence obtained by the content encoder.
  • the personalized style feature obtained by the style encoder is connected with the input frame feature, and the front end of the sequence is padded with a zero feature vector to ensure that the sequence length after the convolution remains unchanged; after each convolution, a leaky linear rectifier function with a negative slope of 0.2 is used for activation; and then mapped through five layers of fully connected layers to output a personalized three-dimensional face action sequence.
  • the training process uses a standard Adam optimizer to optimize the trainable parameters in the network to minimize the decoupled objective function;
  • the decoupled objective function includes: a reconstruction term, a style exchange term, and a cycle-consistent term;
  • the reconstruction term uses a content encoder and a style encoder to encode the three-dimensional facial action sequence obtained in step (1) and step (2) into a content feature sequence and a personalized style feature, and uses the original data to supervise the action decoder to decode and output the personalized three-dimensional facial action sequence from the content feature sequence and the personalized style feature;
  • the style exchange term uses a content encoder and a style encoder to encode the three-dimensional facial action sequence obtained in step (1) and step (2) into a content feature sequence and a personalized style feature, and then exchanges the personalized style features of any two sequence data, so that they are combined with content feature sequences from different sources and output the personalized three-dimensional facial action sequence after the personalized style feature exchange through the action decoder, and the style exchange term supervises the output;
  • a speech animation network Using the speech feature sequence obtained in step (3) and the personalized style features decomposed by the decoupling network in step (4.1), another deep neural network is trained: a speech animation network. This step is performed simultaneously with step (4.1).
  • the speech animation network consists of a speech encoder and an action decoder.
  • the speech encoder encodes each frame feature window in the speech feature sequence using the entire window as the source and the middle frame of the window as the query using a standard deformer network; all frames in the sequence are encoded to obtain an encoded speech feature sequence; the encoded speech feature sequence is a two-dimensional tensor consisting of a sequence dimension and a feature map dimension.
  • the action decoder performs three one-dimensional convolutions on the encoded speech feature sequence.
  • the personalized style features decomposed in step (4.1) are connected to the input frame features, and the front end of the sequence is padded with a zero feature vector to ensure that the sequence length after the convolution remains unchanged; after each convolution, a leaky linear rectifier function with a negative slope of 0.2 is used for activation; and then mapped through five fully connected layers to output a personalized three-dimensional face action sequence.
  • the action decoder is exactly the same as the action decoder in the decoupled network in step (4.1) except for the input, that is, the decoupled network in step (4.1) and the speech animation network in this step share the same action decoder.
  • the training process uses a standard Adam optimizer to optimize the trainable parameters in the network to minimize the speech animation objective function;
  • the speech animation objective function includes: a speech animation reconstruction term, a speech animation style exchange term, and a speech animation cycle consistency term;
  • the calculation method of the speech animation reconstruction term is similar to the reconstruction term in step (4.1), only the output of the decoupled network is replaced by the corresponding speech animation network output;
  • the calculation method of the speech animation style exchange term is similar to the style exchange term in step (4.1), only the output of the decoupled network is replaced by the corresponding speech animation network output;
  • the calculation method of the speech animation cycle consistency term is similar to the cycle consistency term in step (4.1), only the output of the decoupled network is replaced by the corresponding speech animation network output.
  • the present invention discloses a method for generating personalized three-dimensional facial animation driven by voice signals: given a one-minute face-to-face speech video of a target person, the personalized three-dimensional facial action style can be learned, and a voice-synchronized three-dimensional facial animation with the personalized style of the person can be generated for any input voice signal; the quality of the generated animation reaches the most advanced voice signal-driven personalized three-dimensional facial animation technology level.
  • the method is mainly divided into six steps: processing the target person video data, obtaining auxiliary person data, extracting voice feature sequences, training a deep neural network, obtaining the personalized style features of the target person, and generating voice-synchronized personalized three-dimensional facial animation.
  • step (2) of obtaining the auxiliary person data only needs to be executed once, and under the premise that the amount of target person video data is small (only about one minute), the auxiliary person data can effectively expand the data volume, which is conducive to the execution of the subsequent step (4).
  • a decoupling network is trained to explicitly decompose the three-dimensional facial action sequence into a content feature sequence and a personalized style feature, so that the personalized style feature of the target person obtained in step (5) can accurately reflect the personalized style information of the target person without being affected by the pronunciation of the speech content;
  • another speech animation network trained in step (4) can combine the personalized style feature and the speech feature sequence, so that the personalized three-dimensional facial animation generated in step (6) can accurately reflect the personalized style of the target person and maintain synchronization with the input speech.
  • the present invention can be used for personalized three-dimensional facial animation generation tasks driven by voice signals in different scenarios, such as VR virtual social interaction, virtual voice assistants, and games.
  • Fig. 1 is a schematic flow chart of the method of the present invention
  • FIG2 is a schematic diagram of the calculation flow of the reconstruction item in sub-step (4.1) of step (4) in the method of the present invention
  • FIG3 is a schematic diagram of the calculation flow of the style exchange item in sub-step (4.1) of step (4) in the method of the present invention
  • FIG4 is a schematic diagram of the calculation flow of the loop-consistent term in the sub-step (4.1) of step (4) in the method of the present invention.
  • FIG5 is an excerpt of animation key frames generated by driving personalized three-dimensional facial animation of different target persons with input voice signals in an embodiment of the present invention
  • the core technology of the present invention trains a deep neural network (decoupling network) to decompose the three-dimensional face action into a content feature sequence and a personalized style feature, and trains another deep neural network (speech animation network) to combine the decomposed personalized style features with the speech feature sequence extracted from the speech signal and output a personalized three-dimensional face action synchronized with speech.
  • the method is mainly divided into six steps: processing the target person video data, obtaining the auxiliary person data, extracting the speech feature sequence, training the deep neural network, obtaining the personalized style features of the target person, and generating a personalized three-dimensional face animation synchronized with speech.
  • Processing the target person’s video data For each frame of the target person’s frontal speech video, use the existing 3D deformable face model technology (e.g., FLAME, URL: https://flame.is.tue.mpg.de/, reference: Tianye Li, Timo Bolkart, Michael J Black, Hao Li and Javier Romero. Learning facial shape and expression from 4D scans. FLAME: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph., 36(6):194:1-194:17, 2017) to perform 3D reconstruction and remove all head movements to obtain the target person’s 3D face model template I 0 and 3D face action sequence.
  • 3D deformable face model technology e.g., FLAME, URL: https://flame.is.tue.mpg.de/
  • reference Tianye Li, Timo Bolkart, Michael J Black, Hao Li and Javier Romero. Learning facial shape and expression from 4D scans.
  • FLAME Learning
  • n the frame number set ⁇ 1,2,...,
  • I 0 and is a tensor of shape V ⁇ 3, It is a tensor of shape
  • the audio signal X 0 of the speech is separated from the video of the target person.
  • auxiliary character data Obtain auxiliary character data from an existing public voice-synchronized 3D facial animation database (e.g., VOCASET, URL: https://voca.is.tue.mpg.de/ , reference: Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. Capture, learning, and synthesis of 3D speaking styles. Computer Vision and Pattern Recognition (CVPR), pages 10101-10111, 2019.).
  • VOCASET public voice-synchronized 3D facial animation database
  • the data of each auxiliary character in the database includes a 3D face model template I u , a 3D face action sequence and a synchronized speech signal Xu ; wherein u is the number of the character corresponding to the data, and m represents the frame number set ⁇ 1, 2, ...,
  • the speech-synchronized three-dimensional face animation database does not contain the three-dimensional data of the target person, that is, it satisfies u>0, and the topological structure of the three-dimensional face model of its data is consistent with the topological structure of the three-dimensional face model used in step (1).
  • step (3) Extracting speech feature sequence: For the speech signal Xi obtained in step (1) and step (2), use the existing speech recognition technology (such as DeepSpeech, website: https://github.com/mozilla/DeepSpeech , reference: Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y.Ng. DeepSpeech: Scaling up end-to-end speech recognition [J].
  • existing speech recognition technology such as DeepSpeech, website: https://github.com/mozilla/DeepSpeech , reference: Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y.Ng. DeepSpeech: Scaling up
  • step (3) Using the three-dimensional facial action sequence obtained in step (1) and step (2) and the speech feature sequence obtained in step (3), two deep neural networks are simultaneously trained, which are respectively called the decoupled network and the speech animation network. This includes the following sub-steps:
  • Ci is the code The obtained content feature sequence
  • sk is the encoding
  • It is a personalized 3D facial action sequence generated by combining sk and Ci and decoding them.
  • the content encoder E C is used to decode the 3D face action sequence
  • the tth frame in First, three spiral convolutions (SpiralConv) are performed; after each spiral convolution operation, the vertex is downsampled and activated using a leaky linear rectifier (Leaky ReLU) function with a negative slope of 0.2; then, all vertex features obtained by convolution are connected into a one-dimensional vector, and then mapped to the content feature c t of the tth frame through a trainable linear matrix; after mapping all frames in the three-dimensional face action sequence, the content feature sequence C i ⁇ c t ⁇ t ⁇ i is obtained; the content feature sequence C i is a tensor of shape
  • the spiral convolution is defined on the vertex dimension of the input, and its form is as follows:
  • vj represents the feature of the jth vertex of the input spiral convolution, which is a vector of shape C, and C represents the number of features; represents the set of L adjacent vertices predefined for the i-th vertex, It means that the features of all vertices in the adjacent vertex set of the input i-th vertex are connected into a one-dimensional vector of shape LC, ⁇ is a trainable linear mapping, Represents the feature of the i-th vertex output by the spiral convolution.
  • the predefined set of adjacent vertices is pre-calculated on the 3D face model template, and the i-th vertex on the model template is taken from the L vertices on the ring around the topological structure.
  • the vertex downsampling is defined on the vertex dimension, and its form is as follows:
  • N is the number of vertices output by the spiral convolution
  • Md is the downsampling matrix, which is pre-calculated on the 3D face model template
  • V * is the result after downsampling, and its number of vertices is V +
  • the style encoder ES is used for 3D face action sequences
  • the tth frame in First, three spiral convolutions (SpiralConv) are performed; after each spiral convolution operation, the vertex is downsampled and activated using a leaky linear rectification (Leaky ReLU) function with a negative slope of 0.2; then, all vertex features obtained by convolution are concatenated into a one-dimensional vector, which is then mapped to the intermediate style feature of the tth frame through a trainable linear matrix Mapping of all frames in a 3D face action sequence After the intermediate style features are generated, a long short-term memory unit is used to cyclically process the intermediate style feature sequence Obtain personalized style feature s k ; the personalized style feature s k is a vector of shape C s , where C s is the number of feature maps.
  • the spiral convolution and vertex downsampling are consistent with the methods in the content encoder, but use different parameters.
  • the long short-term memory unit has a state machine that stores historical information and three gates: the input gate i t acts on the intermediate style feature of the tth frame.
  • the output h t-1 of the t-1 frame of the memory unit indicates whether new intermediate style feature information is allowed to be added to the state machine of the memory unit. The value is 0 to 1. If the input gate value is 1, that is, the gate is open, then new information is added. If it is 0, that is, the gate is closed, then a zero vector is added. If it is an intermediate value between 0 and 1, the new information is multiplied by the gate value and then added.
  • the forget gate f t acts on the state machine of the memory unit, indicating whether to retain the historical information S t-1 of the t-1 frame stored in the state machine. The value is 0 to 1. If the forget gate value is 1, that is, the gate is open, then the stored information is retained. If it is 0, that is, the gate is closed, then the stored information is reset to a zero vector. If it is an intermediate value between 0 and 1, then the stored information is multiplied by the gate value and then retained.
  • the output gate o t acts on the state machine of the memory unit, indicating whether to store the current t-frame state S of the memory unit. t is used as output, and the value is 0 to 1.
  • the current state of the memory unit is used as output. If it is 0, that is, the door is closed, the zero vector is output. If it is an intermediate value between 0 and 1, the current state of the memory unit is multiplied by the gate value and then used as output.
  • the specific values of the three gates are input by the current frame t. It is connected and projected with the output h t-1 of the t-1th frame of the memory unit. The specific formula is as follows:
  • h t-1 is the output of the t-1th frame of the memory unit, Indicates that and the feature map of h t-1 ; it is the input gate value, Wi and bi are the weight and bias of the input gate respectively; ft is the input gate value, Wf and bf are the weight and bias of the forget gate respectively; o t is the input gate value, W o and b o are the weight and bias of the output gate respectively; is the projection of the current frame input and the previous frame output, Wx and bx are the weight and bias of the projection respectively; St-1 and St are the states of the memory unit state machine of the t-1th frame and the current tth frame respectively; ht is the output of the memory unit of the tth frame; Wi , Wf , Wo , Wx are all matrices of shape Cs ⁇ Cs , bi , bf , bo , bx are all vectors of shape Cs
  • the personalized style feature sk obtained in the above steps is connected to the input frame feature, and the front end of the sequence is padded with a zero feature vector to ensure that the sequence length after the convolution remains unchanged; after each convolution, a leaky linear rectifier function with a negative slope of 0.2 is used for activation; for the tth frame in the sequence after the three-layer convolution, the tth frame of the three-dimensional face action is mapped through five layers of fully connected layers to generate The final output 3D face action sequence is
  • the training process uses a standard Adam optimizer to optimize the trainable parameters in the network to minimize the decoupled objective function L decomp .
  • the decoupled objective function L decomp includes: a reconstruction term L rec , a style exchange term L swp , and a cycle consistency term L cyc :
  • ⁇ rec , ⁇ swp , ⁇ cyc are the corresponding weights respectively.
  • the reconstruction term calculation process is shown in FIG2 , and is defined as follows:
  • Lseq is the supervision loss function defined for the three-dimensional face action sequence, which is defined as follows:
  • the calculation process of the style exchange term is shown in FIG3 , and its calculation method is defined on a pair of 3D face action sequences: Among them, p ⁇ 0,q ⁇ 0 represents the character number including the target character and the auxiliary character, and i,j represents the frame number set in the corresponding sequence. For these two sequences, the content encoder and style encoder are used to encode them respectively:
  • the personalized style features s p and s q obtained from the two sequences are exchanged, combined with the content feature sequence of another sequence, and a three-dimensional face action sequence after the exchange of personalized style features is generated.
  • the second case is p ⁇ q, that is, the two 3D face action sequences come from different people. In this case, only some sequences are correct.
  • character p in The language content in must also be said by character q, that is, there is And the language content in the sequence is consistent with the sequence The same; however, the sequence length of i′ may be different from i, and the standard dynamic time warping algorithm will Align to sequence Above, aligned The sequence is marked as Used for supervision Similarly, using aligned sequences Supervision For the second case, it is calculated only if the requirement is met.
  • the calculation process of the cycle consistency term is shown in FIG4.
  • the three-dimensional face action sequence generated after the aforementioned exchange of personalized style features The content encoder and style encoder are used to encode again, and the personalized style features s q′ and s p′ obtained by the encoding are exchanged again, and combined with the content feature sequence of another sequence to generate a three-dimensional face action sequence after the personalized style features are exchanged twice.
  • the personalized style features are combined with the original matched content feature sequence, so its output should restore the original input sequence;
  • the cycle consistency term Lcyc uses the original input sequence for supervision:
  • Ai is the speech feature sequence after encoding Wi , It is a personalized 3D facial action sequence that combines sk and Ai and decodes the output.
  • the personalized style feature sk obtained in step (4.1) is connected to the input frame feature, and the front end of the sequence is padded with a zero feature vector to ensure that the sequence length after the convolution remains unchanged; after each convolution, a leaky linear rectifier function with a negative slope of 0.2 is used for activation; for the tth frame in the sequence after the three-layer convolution, the tth frame of the three-dimensional face action is generated by mapping through five layers of fully connected layers.
  • the final output 3D face action sequence is
  • the action decoder is exactly the same as the action decoder in the decoupled network in step (4.1) except for the input, that is, the decoupled network in step (4.1) and the speech animation network in this step share the same action decoder.
  • the training process uses the standard Adam optimizer to optimize the trainable parameters in the network to minimize the speech animation objective function L anime .
  • the speech animation objective function is similar to the decoupled objective function described in step (4.1) and consists of three similar items: Voice animation refactoring Voice animation style swap Voice animation loop consistency item In formula (6), Replaced with voice animation network generated Get the voice animation reconstruction item
  • the speech animation objective function L anime is expressed as a weighted sum of three items:
  • the training process is performed synchronously with the training process in step (4.1), that is, L decomp and L anime form a joint objective function L joint :
  • step (5) Obtaining personalized style features of the target person: For the target person's 3D facial action sequence obtained in step (1), The decoupled network trained in step (4) is used to decompose the personalized style feature s 0 of the target person.
  • Generate personalized three-dimensional facial animation synchronized with speech extract a speech feature sequence from any speech signal using the same method as in step (3); use the speech animation network trained in step (4) to combine the extracted speech feature sequence with the personalized style feature s 0 of the target person obtained in step (5), and output a personalized three-dimensional facial action sequence; add the obtained personalized three-dimensional facial action sequence to the three-dimensional facial model template I 0 of the target person obtained in step (1) to obtain a personalized three-dimensional facial animation; the personalized three-dimensional facial animation is synchronized with the input speech and has the personalized style of the target person.
  • Training example The inventors implemented the example of the present invention on a computer equipped with an Intel Core i7-8700K CPU (3.70GHz) and an NVIDIA GTX1080Ti graphics processor (11GB video memory).
  • the target person video in step (1) comes from the Internet and personal photography;
  • the auxiliary person data in step (2) comes from the public database VOCASET (Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael Black. Capture, learning, and synthesis of 3D speaking styles. Computer Vision and Pattern Recognition (CVPR), pages 10101–10111, 2019.).
  • DeepSpeech (URL: https://github.com/mozilla/DeepSpeech , References: Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, Andrew Y. Ng. DeepSpeech: Scaling up end-to-end speech recognition [J].
  • SpiralNet++ A Fast and Highly Efficient Mesh Convolution Operator. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.
  • Animation excerpt The inventor implemented the present invention and used speech signals to drive the generation of personalized 3D facial animation.
  • the key frame excerpt of the generated result shows that five different target characters speak the English word "climate” in a personalized way (the key frames correspond to the syllables /k/, with /m/).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本发明涉及人脸动画领域,尤其涉及一种语音信号驱动的个性化三维人脸动画生成方法及其应用。一种语音信号驱动的个性化三维人脸动画生成方法,对于目标人物的正脸演讲视频重建三维人脸动作序列,并从视频的语音信号中提取语音特征序列;通过一个解耦网络将所重建的三维人脸动作序列分解为内容特征序列和个性化风格特征两部分,其中内容特征序列包含三维人脸动作中语音内容发音所需的必要动作信息,个性化风格特征包含三维人脸动作中反应人物个性的风格信息;同时通过另一个语音动画网络将所分解的个性化风格特征与所提取的语音特征序列相结合,生成个性化三维人脸动画。

Description

一种语音信号驱动的个性化三维人脸动画生成方法及其应用 技术领域
本发明涉及人脸动画领域,尤其涉及一种语音信号驱动的个性化三维人脸动画生成方法及其应用。
背景技术
传统的语音信号驱动的程序式个性化人脸动画生成技术(Yuyu Xu,Andrew W Feng,Stacy Marsella,and Ari Shapiro.一种游戏中的实用且可配置的口型同步方法,A practical and configurable lip sync method for games.In Proceedings of Motion on Games,pages 131–140.ACM,2013.)(Pif Edwards,Chris Landreth,Eugene Fiume,and Karan Singh.JALI:一种为动画师设计的唇形同步发音模型,Jali:an animator-centric viseme model for expressive lip synchronization.ACM Transactions on Graphics(TOG),35(4):127,2016.),从语音信号中自动识别反映发音的音素序列(例如英语中的音节、中文中的拼音),并根据人类在发音时嘴唇的形状将音素分组为视素,且为每个视素制作目标人物个性化的动画关键帧;而后通过人工制定的规则连接整个序列,得到连贯的个性化人脸动画。这些技术需要对每个目标人物制作动画关键帧,重复的人工工作量大;并且其生成动画的质量通常受限于音素识别的准确度和人工所制定规则的合理性。
近年来,一些技术运用深度神经网络(Deep Neural Network,DNN)为目标人物从语音信号中生成高质量的个性化人脸动画;例如Talyor等人(Sarah Taylor,Taehwan Kim,Yisong Yue,Moshe Mahler,James Krahe,Anastasio Garcia Rodriguez,Jessica Hodgins,and Iain Matthews.一种适用于通用语音动画的深度学习方法,A deep learning approach for generalized speech animation.ACM Transactions on Graphics(TOG),36(4):93,2017.)对一个目标人物采集超过2000个句子的正脸演说视频,然后为该目标人物训练其专用的深度神经网络,该网络可以将语音中的音素序列映射到人脸的主动外观模型(Active Appearance Model,AAM)系数序列;Suwajanakorn等人(Supasorn Suwajanakorn,Steven M Seitz,and Ira Kemelmacher-Shlizerman.合成奥巴马:从语音中学习唇形同步,Synthesizing obama:learning lip sync from audio.ACM Transactions on Graphics(TOG),36(4):95,2017.)收集巴拉克·奥巴马(Barack Obama)数十小时的演讲视频,以训练其专用的长短时记忆网络(Long Short-Term Memory,LSTM)将语音信号的梅尔倒谱系数(Mel-scaleFrequency Cepstral Coefficients,MFCC)序列映射到唇部关键点的运动轨迹。这些技术虽然能够为目标人物生成高质量的个性化人脸动画,但受限于数据需求量过高的要求,难以适用于任意目标人物。
在单个目标人物数据量有限的情况下,一些技术通过混合多个目标人物以扩大模型训练的整体数据量,并通过控制风格以生成其中某个特定目标人物的个性化人脸动画;例如,Cudeiro等人(Daniel Cudeiro,Timo Bolkart,Cassidy Laidlaw,Anurag Ranjan,and Michael Black.采集、学习与合成三维演讲风格,Capture,learning,and synthesis of 3D speaking styles.Computer Vision and Pattern Recognition(CVPR),pages 10101-10111,2019.)对十二个不同目标人物各采取40句的演讲三维人脸动画,训练一个卷积神经网络(Convolutional Neural Network)将语音信号映射到三维人脸动画,网络中使用人物编号对应的独热编码向量(One-Hot Vector)控制输出为对应目标人物的个性化人脸动画;Thies等人(Justus Thies,Mohamed Elgharib,Ayush Tewari,Christian Theobalt,Matthias Nieβner.语 音驱动的脸部重演,Neural voice puppetry:Audio-driven facial reenactment.European Conference on Computer Vision(ECCV),pages 716-731,Springer,Cham,2020.)从德国新闻评论视频中收集116个目标人物的正脸演讲视频数据,用所有目标人物的数据训练一个共用的卷积神经网络将语音信号映射到共用的混合变形(Blend Shape)模型系数序列,再为每个目标人物优化一个线性映射矩阵将共用的混合变形模型系数映射到目标人物个性化的混合变形模型系数。这些技术依赖于风格控制的方法以生成某个特定目标人物的个性化人脸动画,虽然这些技术对不同人物的个性化风格加以区分,但是没有显式地区分每个人物数据内部的个性化风格信息与语音内容发音动作信息,导致其所训练的网络模型无法准确地学习人物的个性化风格。
发明内容
本发明的目的在于针对现有技术的不足,提供了一种语音信号驱动的个性化三维人脸动画生成方法。
一种语音信号驱动的个性化三维人脸动画生成方法,对于目标人物的正脸演讲视频重建三维人脸动作序列,并从视频的语音信号中提取语音特征序列;通过一个解耦网络将所重建的三维人脸动作序列分解为内容特征序列和个性化风格特征两部分,其中内容特征序列包含三维人脸动作中语音内容发音所需的必要动作信息,个性化风格特征包含三维人脸动作中反应人物个性的风格信息;同时通过另一个语音动画网络将所分解的个性化风格特征与所提取的语音特征序列相结合,生成个性化三维人脸动画。
对于目标人物的一分钟左右时长的正脸演讲视频,本发明运用现有技术从视频中重建三维人脸动作序列,并利用现有语音识别技术从视频的语音信号中提取语音特征序列。本发明通过一个深度神经网络(称为解耦网络)将所重建的三维人脸动作序列分解为内容特征序列和个性化风格特征两部分,其中内容特征序列包含三维人脸动作中语音内容发音所需的必要动作信息,个性化风格特征包含三维人脸动作中反应人物个性的风格信息;并且,本发明通过另一个深度神经网络(称为语音动画网络)将所分解的个性化风格特征与所提取的语音特征序列相结合,生成个性化三维人脸动画。
具体的,本发明的目的是通过以下技术方案来实现的,由语音信号驱动的个性化三维人脸动画生成方法,包括以下步骤:
(1)处理目标人物视频数据:对所给目标人物的正脸演讲视频中的每一帧画面使用现有三维可形变人脸模型技术进行三维重建,并移除头部运动,得到目标人物的三维人脸模型模板以及三维人脸动作序列;所述模型模板是由顶点维度、空间维度组成的二维张量;所述三维人脸动作序列是相对于模型模板的顶点偏移序列,是由序列维度、顶点维度、空间维度组成的三维张量;对所给视频提取语音信号。
(2)获取辅助人物数据:从现有的公开的语音同步三维人脸动画数据库中获取辅助人物数据,其中每个辅助人物的数据包括三维人脸模型模板、三维人脸动作序列、以及同步的语音信号;所述语音同步三维人脸动画数据库不包含目标人物的三维数据。
(3)提取语音特征序列:对步骤(1)与步骤(2)中所得语音信号,使用现有语音识别技术提取语音特征序列;所述语音特征序列是由序列维度、窗口维度、特征图维度组成的三维张量。
(4)训练深度神经网络:使用步骤(1)与(2)所得三维人脸动作序列和步骤(3)所得语音特征序列同时训练两个深度神经网络,分别称为解耦网络与语音动画网络;所述解耦网络将三维人 脸动作序列分解为内容特征序列和个性化风格特征两部分;所述内容特征序列是由序列维度、特征图维度组成的二维张量,包含三维人脸动作中语音内容发音所需的必要动作信息;所述个性化风格特征是由特征图维度组成的一维张量,包含三维人脸动作中反应人物个性的风格信息;所述语音动画网络将所分解的个性化风格特征与语音特征序列结合,输出个性化三维人脸动作序列。
(5)获取目标人物个性化风格特征:对步骤(1)中所得的目标人物三维人脸动作序列,使用步骤(4)训练所得的解耦网络分解出目标人物的个性化风格特征。
(6)生成语音同步的个性化三维人脸动画:对输入的任意语音信号使用与步骤(3)中相同的方法提取语音特征序列;使用步骤(4)训练所得语音动画网络将所提取语音特征序列与步骤(5)所得目标人物的个性化风格特征结合,输出个性化三维人脸动作序列;所得三维人脸动作序列加上步骤(1)所得目标人物的三维人脸模型模板,得到个性化三维人脸动画;所述个性化三维人脸动画与输入的语音保持同步,并具有目标人物个性化的风格。
其中,所述步骤(4)包含如下子步骤:
(4.1)使用步骤(1)与步骤(2)中所得三维人脸动作序列训练一个深度神经网络:解耦网络;所述解耦网络由一个内容编码器、一个风格编码器、以及一个动作解码器组成。所述内容编码器对三维人脸动作序列中的每一帧首先进行三次螺旋卷积;每次螺旋卷积操作之后进行顶点下采样,并使用负数倾斜率为0.2的带泄漏线性整流函数激活;随后将三次螺旋卷积之后的所有顶点特征连接成一维向量,再通过一个线性矩阵将其映射到内容特征;三维人脸动作序列中所有帧经过映射之后得到内容特征序列;所述内容特征序列是由序列维度、特征图维度组成的二维张量。所述风格编码器对三维人脸动作序列中的每一帧进行与前述内容编码器相同的三次螺旋卷积、顶点下采样、激活与后续线性映射操作,但使用不同的参数将每一帧映射到中间风格特征;三维人脸动作序列中所有帧映射到中间风格特征序列之后,用一个标准的长短时记忆单元循环地处理中间风格特征序列并得到个性化风格特征;所述个性化风格特征是由特征图维度组成的一维向量。所述动作解码器对内容编码器所得内容特征序列进行三次一维卷积,每次卷积之前,将风格编码器所得个性化风格特征与输入的每帧特征相连接,并且在序列前端以零特征向量填补以保证卷积之后的序列长度不变;每次卷积之后,使用负数倾斜率为0.2的带泄漏线性整流函数进行激活;再通过五层全连接层映射,输出个性化三维人脸动作序列。所述训练过程使用标准Adam优化器优化网络中的可训练参数,以最小化解耦目标函数;所述解耦目标函数包括:重构项,风格交换项,以及循环一致项;所述重构项利用内容编码器和风格编码器将步骤(1)与步骤(2)中所得三维人脸动作序列编码为内容特征序列和个性化风格特征,并使用原始数据监督动作解码器从内容特征序列和个性化风格特征解码输出的个性化三维人脸动作序列;所述风格交换项利用内容编码器和风格编码器将步骤(1)与步骤(2)中所得三维人脸动作序列编码为内容特征序列和个性化风格特征,然后交换任意两个序列数据的个性化风格特征,使之与来源不同的内容特征序列结合并经过动作解码器输出个性化风格特征交换之后的个性化三维人脸动作序列,风格交换项对该输出进行监督;所述循环一致项对前述个性化风格特征交换之后的个性化三维人脸动作序列再次利用内容编码器和风格编码器编码并再次交换编码后的个性化风格特征,经过动作解码器输出两次个性化风格特征交换之后的个性化三维人脸动作序列,循环一致项对该输出进行监督。
(4.2)使用步骤(3)所得语音特征序列与步骤(4.1)中解耦网络所分解的个性化风格特征,训练另一个深度神经网络:语音动画网络,该步骤与步骤(4.1)同时进行。所述语音动画网络由一个语音编码器以及一个动作解码器组成。所述语音编码器对语音特征序列中的每一帧特征窗口,将整个窗口作为源,窗口中间帧作为询问,使用标准的变形器网络进行编码;对序列中所有帧进行编码得到编码后的语音特征序列;所述编码后的语音特征序列是由序列维度、特征图维度组成的二维张量。所述动作解码器对编码后的语音特征序列进行三次一维卷积,每次卷积之前,将步骤(4.1)所分解的个性化风格特征与输入的每帧特征相连接,并且在序列前端以零特征向量填补以保证卷积之后的序列长度不变;每次卷积之后,使用负数倾斜率为0.2的带泄漏线性整流函数进行激活;再通过五层全连接层映射,输出个性化三维人脸动作序列。该动作解码器与步骤(4.1)中的解耦网络中的动作解码器除输入之外完全相同,即步骤(4.1)中的解耦网络与该步骤中的语音动画网络共用同一个动作解码器。所述训练过程使用标准Adam优化器优化网络中的可训练参数,以最小化语音动画目标函数;所述语音动画目标函数包括:语音动画重构项,语音动画风格交换项,以及语音动画循环一致项;所述语音动画重构项计算方法与步骤(4.1)中的重构项相似,仅将解耦网络的输出替换为对应的语音动画网络输出;所述语音动画风格交换项计算方法与步骤(4.1)中的风格交换项相似,仅将解耦网络的输出替换为对应的语音动画网络输出;所述语音动画循环一致项计算方法与步骤(4.1)中的循环一致项相似,仅将解耦网络的输出替换为对应的语音动画网络输出。
本发明公开了一种语音信号驱动的个性化三维人脸动画生成方法:在给定目标人物一段一分钟左右正脸演讲视频的情况下,可以学习其个性化三维人脸动作风格,并对任意输入的语音信号生成语音同步的、具有该人物个性化风格的三维人脸动画;所生成动画质量达到当前最先进的语音信号驱动的个性化三维人脸动画技术水平。该方法主要分为六个步骤:处理目标人物视频数据、获取辅助人物数据、提取语音特征序列、训练深度神经网络、获取目标人物个性化风格特征、以及生成语音同步的个性化三维人脸动画。其中,步骤(2)获取辅助人物数据只需执行一次,并且在目标人物视频数据量较少的前提下(仅一分钟左右),辅助人物数据能够有效扩大数据量,有利于后续步骤(4)的执行。步骤(4)中训练一个解耦网络显式地将三维人脸动作序列分解为内容特征序列和个性化风格特征,使得步骤(5)中所获取的目标人物个性化风格特征能准确地反映目标人物的个性化风格信息而不受语音内容发音的影响;步骤(4)中训练的另一个语音动画网络能够结合个性化风格特征和语音特征序列,使得步骤(6)中所生成的个性化三维人脸动画既能准确反映目标人物个性化风格又能保持与输入语音的同步。
本发明可以用于不同场景下的语音信号驱动的个性化三维人脸动画生成任务,如VR虚拟社交、虚拟语音助手、以及游戏等。
附图说明
图1是本发明的方法流程示意图;
图2是本发明的方法中步骤(4)中子步骤(4.1)中重构项的计流程示意图;
图3是本发明的方法中步骤(4)中子步骤(4.1)中风格交换项的计流程示意图;
图4是本发明的方法中步骤(4)中子步骤(4.1)中循环一致项的计流程示意图;
图5是本发明实施实例中输入语音信号驱动不同目标人物个性化三维人脸动画生成的动画关键帧节选;
其中,五个不同的目标人物以各自个性化地方式说出英文单词“climate”。
具体实施方式
本发明的核心技术训练一个深度神经网络(解耦网络)将三维人脸动作分解为内容特征序列和个性化风格特征,同时训练另一个深度神经网络(语音动画网络)将所分解的个性化风格特征与从语音信号提取的语音特征序列结合并输出语音同步的个性化三维人脸动作。如图1所示,该方法主要分为六个步骤:处理目标人物视频数据、获取辅助人物数据、提取语音特征序列、训练深度神经网络、获取目标人物个性化风格特征、以及生成语音同步的个性化三维人脸动画。
(1)处理目标人物视频数据:对目标人物的正脸演讲视频中的每一帧图像使用现有的三维可形变人脸模型技术(例如:FLAME,网址:https://flame.is.tue.mpg.de/,参考文献:Tianye Li,Timo Bolkart,Michael J Black,Hao Li and Javier Romero.从四维扫描中学习人脸形状和表情。FLAME:Learning a model of facial shape and expression from 4D scans.ACM Trans.Graph.,36(6):194:1-194:17,2017)进行三维重建,并移除所有头部运动,得到目标人物的三维人脸模型模板I0以及三维人脸动作序列0为目标人物编号,n表示序列中的帧序号集合{1,2,…,|n|},表示序列中的第t帧人脸动作,即相较于模型模板的顶点偏移;其中,I0是形状为V×3的张量,是形状为|n|×V×3的张量,|n|表示序列长度,V表示三维人脸模型顶点数量,3表示三维空间。同时,从目标人物的视频中分离出语音的音频信号X0
(2)获取辅助人物数据:从现有的公开的语音同步三维人脸动画数据库(例如:VOCASET,网址:https://voca.is.tue.mpg.de/,参考文献:Daniel Cudeiro,Timo Bolkart,Cassidy Laidlaw,Anurag Ranjan,and Michael Black.采集、学习与合成三维演讲风格,Capture,learning,and synthesis of 3D speaking styles.Computer Vision and Pattern Recognition(CVPR),pages 10101-10111,2019.)中获取辅助人物数据。数据库中的每个辅助人物的数据包括三维人脸模型模板Iu、三维人脸动作序列以及同步的语音信号Xu;其中,u为该数据对应人物的编号,m表示序列中的帧序号集合{1,2,…,|m|},表示序列中的第t帧人脸动作,Iu是形状为V×3的张量,是形状为|m|×V×3的张量,|m|表示序列长度,V表示三维人脸模型顶点数量,3表示三维空间。所述语音同步三维人脸动画数据库不包含目标人物的三维数据,即满足u>0,并且其数据的三维人脸模型的拓扑结构与步骤(1)中所使用的三维人脸模型的拓扑结构一致。
(3)提取语音特征序列:对步骤(1)与步骤(2)中所得语音信号Xi,使用现有语音识别技术(例如DeepSpeech,网址:https://github.com/mozilla/DeepSpeech,参考文献:Awni Hannun,Carl Case,Jared Casper,Bryan Catanzaro,Greg Diamos,Erich Elsen,Ryan Prenger,Sanjeev Satheesh,Shubho Sengupta,Adam Coates,Andrew Y.Ng.DeepSpeech:扩大规模的端到端语音识别。DeepSpeech:Scaling up end-to-end speech recognition[J].arXiv preprint arXiv:1412.5567,2014.)提取中间特征xi,其是形状为|i|×Cx的张量,再对其进行分窗操作得到语音特征序列Wi={wt}t∈i,其是形状为|i|×W×Cx的张量;其中,i≥0为包括目标人物和辅助人物的人物编号,i表示序列中的帧序号集合{1,2,…,|i|},wt表示第t帧语音特征, |i|表示序列长度,与对应的三维人脸动作序列长度一致,W表示每一帧特征的窗口长度,Cx表示特征图数量;所述分窗操作对xi序列上的每一帧取其前后各帧作为一个窗口,超出序列范围的部分取零填补。
(4)训练深度神经网络:使用步骤(1)与步骤(2)所得三维人脸动作序列和步骤(3)所得语音特征序列同时训练两个深度神经网络,分别称为解耦网络与语音动画网络。包含以下子步骤:
(4.1)训练解耦网络:使用步骤(1)与步骤(2)中所得三维人脸动作序列训练一个深度神经网络,称为解耦网络;其中,k≥0表示包括目标人物和辅助人物的人物编号,i表示序列中的帧序号集合{1,2,…,|i|},为序列中第t帧三维人脸动作;所述解耦网络由一个内容编码器EC、一个风格编码器ES、以及一个动作解码器D组成,其运算过程定义如下:


其中,Ci为编码所得内容特征序列,sk为编码所得个性化风格特征,为结合sk和Ci并解码之后生成的个性化三维人脸动作序列。
所述内容编码器EC对三维人脸动作序列中的第t帧首先进行三次螺旋卷积(SpiralConv);每次螺旋卷积操作之后进行顶点下采样,并使用负数倾斜率为0.2的带泄漏线性整流(Leaky ReLU)函数进行激活;随后,将卷积所得的所有顶点特征连接成一维向量,再通过一个可训练的线性矩阵将其映射到第t帧内容特征ct;三维人脸动作序列中所有帧映射之后得到内容特征序列Ci={ct}t∈i;所述内容特征序列Ci是形状为|i|×Cc的张量,|i|表示序列长度,Cc表示特征图数量。所述螺旋卷积定义在输入的顶点维度上,其形式如下:
其中,vj表示输入螺旋卷积的第j个顶点的特征,是形状为C的向量,C表示特征数量;表示对第i个顶点预定义的L个邻接顶点的集合,表示将输入的第i个顶点的邻接顶点集合中所有顶点的特征连接成形状为LC的一维向量,γ为可训练的线性映射,表示螺旋卷积输出的第i个顶点的特征。所述预定义的邻接顶点集合是在三维人脸模型模板上预计算所得,对模型模板上的第i个顶点取其自身与拓扑结构周围环上的共L个顶点。所述顶点下采样定义在顶点维度上,其形式如下:
V*=MdV+      (3)
其中,为螺旋卷积输出的所有顶点,下标中N为螺旋卷积输出的顶点数量;Md是下采样矩阵,在三维人脸模型模板上预计算所得;V*是下采样之后的结果,其顶点数量为V+
所述风格编码器ES对三维人脸动作序列中的第t帧首先进行三次螺旋卷积(SpiralConv);每次螺旋卷积操作之后进行顶点下采样,并使用负数倾斜率为0.2的带泄漏线性整流(Leaky ReLU)函数进行激活;随后,将卷积所得的所有顶点特征连接成一维向量,再通过一个可训练的线性矩阵将其映射到第t帧中间风格特征三维人脸动作序列中所有帧映射 为中间风格特征之后,再使用一个长短时记忆单元循环地处理中间风格特征序列得到个性化风格特征sk;所述个性化风格特征sk是形状为Cs的向量,Cs为特征图数量。所述螺旋卷积以及顶点下采样与内容编码器中的方法一致,但使用不同的参数。所述长短时记忆单元具有一个存储历史信息的状态器和三个门:输入门it作用于第t帧中间风格特征与记忆单元第t-1帧输出ht-1,表示是否允许新的中间风格特征信息加入到记忆单元的状态器中,数值为0到1,如果输入门数值为1,即开门,则加入新信息,如果为0,即关门,则加入零向量,如果为0到1中间数值则将新信息乘以门数值再加入;遗忘门ft作用于记忆单元的状态器,表示是否保留状态器存储的第t-1帧历史信息St-1,数值为0到1,如果遗忘门数值为1,即开门,则保留存储的信息,如果为0,即关门,则重置存储信息为零向量,如果为0到1中间数值则将存储信息乘以门数值再保留;输出门ot作用于记忆单元的状态器,表示是否将记忆单元当前第t帧状态St作为输出,数值为0到1,如果为1,即开门,则当前记忆单元的状态作为输出,如果为0,即关门,则输出零向量,如果为0到1中间数值则将当前记忆单元的状态乘以门数值再作为输出;三个门的具体数值由当前第t帧输入与该记忆单元第t-1帧的输出ht-1连接、投影得到,其具体公式如下:
其中,为当前第t帧输入的中间风格特征,ht-1为记忆单元第t-1帧的输出,表示将和ht-1的特征图相连接;it为输入门数值,Wi、bi分别为输入门的权重与偏置;ft为输入门数值,Wf、bf分别为遗忘门的权重与偏置;ot为输入门数值,Wo、bo分别为输出门的权重与偏置;为对当前帧输入和上一帧输出的投影,Wx、bx分别为投影的权重与偏置;St-1、St分别为第t-1帧与当前第t帧的记忆单元状态器的状态;ht为第t帧记忆单元的输出;Wi,Wf,Wo,Wx均为形状为Cs×Cs的矩阵,bi,bf,bo,bx均为形状为Cs的向量,Wi,Wf,Wo,Wx,bi,bf,bo,bx均为可训练的参数。
所述动作解码器D对前述步骤所得内容特征序列Ci={ct}t∈i进行三次一维卷积,每次卷积之前,将前述步骤所得个性化风格特征sk与输入的每帧特征相连接,并且在序列前端以零特征向量填补以保证卷积之后的序列长度不变;每次卷积之后,使用负数倾斜率为0.2的带泄漏线性整流函数进行激活;对三层卷积之后的序列中的第t帧,再通过五层全连接层映射,生成第t帧三维人脸动作最终输出三维人脸动作序列为
所述训练过程使用标准Adam优化器优化网络中的可训练参数,以最小化解耦目标函数Ldecomp。所述解耦目标函数Ldecomp包括:重构项Lrec,风格交换项Lswp,以及循环一致项Lcyc
Ldecomp=λrecLrecswpLswpcycLcyc.      (5)
其中,λrecswpcyc分别为相应的权重。
所述重构项计算流程如图2所示,其定义如下:
其中,Lseq是对三维人脸动作序列定义的监督损失函数,其定义如下:
上式中的标记忽略人物编号;其中,yt为监督数据序列Yi中的第t帧,为生成动作序列中的第t帧;计算生成动作第t帧与监督数据第t帧之间的l2距离,以监督所生成动作的准确性;计算生成动作第t-1帧与第t帧之间变化幅度与监督数据第t-1帧与第t帧之间变化幅度的l2距离,以监督所生成动作的平滑性;计算生成动作第t帧唇部张开高度与监督数据第t帧唇部张开高度的l2距离,以监督所生成动作有准确的唇部动作;其中LipH(·)根据预先选定的唇部顶点计算在y轴上的平均高度差,以近似唇部张开的高度;λm和λi为相应的权重。
所述风格交换项的计算流程如图3所示,其计算方法定义在一对三维人脸动作序列上:其中,p≥0,q≥0表示包括目标人物和辅助人物的人物编号,i,j表示对应序列中的帧序号集合。对于这样两个序列,使用内容编码器和风格编码器分别编码:
再将两个序列所得个性化风格特征sp与sq相交换,与另一个序列的内容特征序列相结合并生成交换个性化风格特征之后的三维人脸动作序列
对于交换个性化风格特征之后的三维人脸动作序列计算所述风格交换项Lswp,需考虑两种情况:
其中,第一种情况为p=q,即两段三维人脸动作序列来自于同一个人物,则直接使用输入的序列作为监督数据,计算损失函数。第二种情况为p≠q,即两段三维人脸动作序列来自于不同人物,对于这种情况,只有部分序列对满足可以计算的要求:人物p在中所说的语言内容也必须被人物q说过,即存在且该序列中所说的语言内容与序列相同;然而i′的序列长度可能与i不同,通过标准的动态时间规整算法将对齐到序列上,对齐后的 序列标记为被用于监督相似地,使用对齐的序列监督对于所述第二种情况,仅在满足要求的情况下计算。
所述循环一致项的计算流程如图4所示,对前述交换个性化风格特征之后生成的三维人脸动作序列再次使用内容编码器和风格编码器分别编码,并再次交换编码所得个性化风格特征sq′与sp′,与另一个序列的内容特征序列相结合并生成两次交换个性化风格特征之后的三维人脸动作序列


经过两次交换之后,个性化风格特征与原始匹配的内容特征序列相结合,因此其输出应该恢复原始的输入序列;循环一致项Lcyc使用原始的输入序列进行监督:
(4.2)训练语音动画网络:使用步骤(3)所得语音特征序列Wi={wt}t∈i与步骤(4.1)中解耦网络所分解的个性化风格特征sk,训练另一个深度神经网络,称为语音动画网络;其中,Wi与前述三维人脸动作序列同步,并拥有相同序列长度与帧编号。所述语音动画网络由一个语音编码器EA以及一个动作解码器D组成:
其中,Ai为编码Wi后的语音特征序列,为结合sk与Ai并解码输出的个性化三维人脸动作序列。
所述语音编码器EA对语音特征序列Wi={wt}t∈i中的第t帧特征窗口wt,将整个窗口作为源(Source),窗口中间帧作为询问(Query),使用标准的变形器网络(Transformer Network)进行编码,得到第t帧编码后的语音特征at;对整个序列重复操作得到编码后的语音特征序列Ai={at}t∈i;所述编码后的语音特征序列Ai是形状为|i|×Ca的二维张量,|i|表示序列长度,Ca表示特征图数量。
所述动作解码器D对编码后的语音特征序列Ai={at}t∈i进行三次一维卷积,每次卷积之前,将步骤(4.1)所得个性化风格特征sk与输入的每帧特征相连接,并且在序列前端以零特征向量填补以保证卷积之后的序列长度不变;每次卷积之后,使用负数倾斜率为0.2的带泄漏线性整流函数进行激活;对三层卷积之后的序列中的第t帧,再通过五层全连接层映射,生成第t帧三维人脸动作最终输出三维人脸动作序列为该动作解码器与步骤(4.1)中的解耦网络中的动作解码器除输入之外完全相同,即步骤(4.1)中的解耦网络与该步骤中的语音动画网络共用同一个动作解码器。
所述训练过程使用标准Adam优化器优化网络中的可训练参数,以最小化语音动画目标函数Lanime。所述语音动画目标函数与步骤(4.1)中所述解耦目标函数相似,由三个相似的项目构成: 语音动画重构项语音动画风格交换项语音动画循环一致项将式(6)中的替换成语音动画网络生成的得到语音动画重构项
将与式(8)中分别同步的语音特征Wi,Wj,经过编码得到Ai,Aj之后,分别与来自式(8)交换之后的个性化风格特征sq与sp结合并解码得到
再用与式(10)相同的方法计算语音动画风格交换项
将Ai,Aj分别与来自式(11)两次交换之后的个性化风格特征sp′与sq′结合并解码得到
再用与式(12)相同的方法计算语音动画循环一致项
所述语音动画目标函数Lanime表示为三项加权和:
其中,为各项相应的权重。所述训练过程与步骤(4.1)中的训练过程同步进行,即Ldecomp与Lanime组成联合目标函数Ljoint
Ljoint=Ldecomp+Lanime.      (20)
(5)获取目标人物个性化风格特征:对步骤(1)中所得的目标人物三维人脸动作序列使用步骤(4)训练所得的解耦网络分解出目标人物的个性化风格特征s0
(6)生成语音同步的个性化三维人脸动画:对任意的语音信号使用与步骤(3)中相同的方法提取语音特征序列;使用步骤(4)训练所得语音动画网络将所提取语音特征序列与步骤(5)所得目标人物的个性化风格特征s0结合,输出个性化三维人脸动作序列;所得个性化三维人脸动作序列加上步骤(1)中所得目标人物的三维人脸模型模板I0,得到个性化三维人脸动画;所述个性化三维人脸动画与输入的语音保持同步,并具有目标人物个性化的风格。
实施实例
训练实例:发明人在一台配备Intel Core i7-8700K中央处理器(3.70GHz),NVIDIA GTX1080Ti图形处理器(显存11GB)的计算机上实施本发明的实例。实施过程中,步骤(1)中的目标人物视频来源于互联网络与个人拍摄;步骤(2)中辅助人物数据来源于公开数据库VOCASET(Daniel Cudeiro,Timo Bolkart,Cassidy Laidlaw,Anurag Ranjan,and Michael Black.Capture,learning,and synthesis of 3D speaking styles.Computer Vision and Pattern Recognition(CVPR),pages 10101–10111,2019.)。
模型参数:发明人在实施本发明的实例时,步骤(1)到(4)所涉及的参数如下:
(1)处理目标人物视频数据:使用的现有三维可形变人脸模型技术为FLAME(网址:https://flame.is.tue.mpg.de/,参考文献:Tianye Li,Timo Bolkart,Michael J Black,Hao Li and Javier Romero.从四维扫描中学习人脸形状和表情。FLAME:Learning a model of facial shape and expression from 4D scans.ACM Trans.Graph.,36(6):194:1-194:17,2017);模型中顶点数量V=5023。
(2)获取辅助人物数据:使用现有的公开的语音同步三维人脸动画数据库VOCASET(网址:https://voca.is.tue.mpg.de/,参考文献:Daniel Cudeiro,Timo Bolkart,Cassidy Laidlaw,Anurag Ranjan,and Michael Black.采集、学习与合成三维演讲风格。Capture,learning,and synthesis of 3D speaking styles.Computer Vision and Pattern Recognition(CVPR),pages 10101-10111,2019.)。
(3)提取语音特征序列:使用的现有语音识别技术为DeepSpeech(网址:https://github.com/mozilla/DeepSpeech,参考文献:Awni Hannun,Carl Case,Jared Casper,Bryan Catanzaro,Greg Diamos,Erich Elsen,Ryan Prenger,Sanjeev Satheesh,Shubho Sengupta,Adam Coates,Andrew Y.Ng.DeepSpeech:扩大规模的端到端语音识别。DeepSpeech:Scaling up end-to-end speech recognition[J].arXiv preprint arXiv:1412.5567,2014.);语音特征窗口大小W=16,特征图数量Cx=29;使用的标准变形器网络(Transformer Network)模型维度为64,注意力头数量为4,编码层数为3,解码层数为1。
(4)训练深度神经网络:螺旋卷积使用L=12个邻接顶点,三层螺旋卷积的特征图数量分别为16、32、32;螺旋卷积中的邻接顶点集合预定义、下采样矩阵的预计算使用现有技术(网址:https://github.com/sw-gong/spiralnet_plus,参考文献:Shunwang Gong,Lei Chen,Michael Bronstein,Stefanos Zafeiriou.SpiralNet++:一种快速高效的网格卷积算子。SpiralNet++:A Fast and Highly Efficient Mesh Convolution Operator.Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.2019);内容特征序列的特征图数量Cc=64;个性化风格特征的特征图数量Cs=32;编码后的语音特征序列的特征图数量Ca=64;动作解码器中的三层一维卷积核大小分为别5、3、3,特征图数量分别为64、128、256;式(5)中的权重为λrec=1,λswp=3,λcyc=1;式(7)中的权重为λm=5,λl=1;式(19)中的权重为Adam优化器的学习率为0.0001。
动画节选:发明人实施本发明实例,用语音信号驱动个性化三维人脸动画的生成。如图5所示的生成结果的关键帧节选,五个不同目标人物分别个性化地说出英文单词“climate”(关键帧依次对应音节/k/,与/m/)。

Claims (5)

  1. 一种语音信号驱动的个性化三维人脸动画生成方法,其特征在于,对于目标人物的正脸演讲视频重建三维人脸动作序列,并从视频的语音信号中提取语音特征序列;通过一个解耦网络将所重建的三维人脸动作序列分解为内容特征序列和个性化风格特征两部分,其中内容特征序列包含三维人脸动作中语音内容发音所需的必要动作信息,个性化风格特征包含三维人脸动作中反应人物个性的风格信息;同时通过另一个语音动画网络将所分解的个性化风格特征与所提取的语音特征序列相结合,生成个性化三维人脸动画。
  2. 根据权利要求1所述的语音信号驱动的个性化三维人脸动画生成方法,其特征在于:包括以下步骤:
    (1)处理目标人物视频数据:对所给目标人物的正脸演讲视频中的每一帧画面使用现有三维可形变人脸模型技术进行三维重建,并移除头部运动,得到目标人物的三维人脸模型模板以及三维人脸动作序列;所述模型模板是由顶点维度、空间维度组成的二维张量;所述三维人脸动作序列是相对于模型模板的顶点偏移序列,是由序列维度、顶点维度、空间维度组成的三维张量;对所给视频提取语音信号;
    (2)获取辅助人物数据:从现有的公开的语音同步三维人脸动画数据库中获取辅助人物数据,其中每个辅助人物的数据包括三维人脸模型模板、三维人脸动作序列、以及同步的语音信号;所述语音同步三维人脸动画数据库不包含目标人物的三维数据;
    (3)提取语音特征序列:对步骤(1)与步骤(2)中所得语音信号,使用现有语音识别技术提取语音特征序列;所述语音特征序列是由序列维度、窗口维度、特征图维度组成的三维张量;
    (4)训练深度神经网络:使用步骤(1)与(2)所得三维人脸动作序列和步骤(3)所得语音特征序列同时训练两个深度神经网络,分别称为解耦网络与语音动画网络;所述解耦网络将三维人脸动作序列分解为内容特征序列和个性化风格特征两部分;所述内容特征序列是由序列维度、特征图维度组成的二维张量,包含三维人脸动作中语音内容发音所需的必要动作信息;所述个性化风格特征是由特征图维度组成的一维张量,包含三维人脸动作中反应人物个性的风格信息;所述语音动画网络将所分解的个性化风格特征与语音特征序列结合,输出个性化三维人脸动作序列;
    (5)获取目标人物个性化风格特征:对步骤(1)中所得的目标人物三维人脸动作序列,使用步骤(4)训练所得的解耦网络分解出目标人物的个性化风格特征;
    (6)生成语音同步的个性化三维人脸动画:对输入的任意语音信号使用与步骤(3)中相同的方法提取语音特征序列;使用步骤(4)训练所得语音动画网络将所提取语 音特征序列与步骤(5)所得目标人物的个性化风格特征结合,输出个性化三维人脸动作序列;所得三维人脸动作序列加上步骤(1)所得目标人物的三维人脸模型模板,得到个性化三维人脸动画;所述个性化三维人脸动画与输入的语音保持同步,并具有目标人物个性化的风格。
  3. 根据权利要求2所述的语音信号驱动的个性化三维人脸动画生成方法,其特征在于,所述步骤(4)包含如下子步骤:
    (4.1)使用步骤(1)与步骤(2)中所得三维人脸动作序列训练一个深度神经网络:解耦网络;所述解耦网络由一个内容编码器、一个风格编码器、以及一个动作解码器组成;所述内容编码器对三维人脸动作序列中的每一帧首先进行三次螺旋卷积;每次螺旋卷积操作之后进行顶点下采样,并使用负数倾斜率为0.2的带泄漏线性整流函数激活;随后将三次螺旋卷积之后的所有顶点特征连接成一维向量,再通过一个线性矩阵将其映射到内容特征;三维人脸动作序列中所有帧经过映射之后得到内容特征序列;所述内容特征序列是由序列维度、特征图维度组成的二维张量;所述风格编码器对三维人脸动作序列中的每一帧进行与前述内容编码器相同的三次螺旋卷积、顶点下采样、激活与后续线性映射操作,但使用不同的参数将每一帧映射到中间风格特征;三维人脸动作序列中所有帧映射到中间风格特征序列之后,用一个标准的长短时记忆单元循环地处理中间风格特征序列并得到个性化风格特征;所述个性化风格特征是由特征图维度组成的一维向量。所述动作解码器对内容编码器所得内容特征序列进行三次一维卷积,每次卷积之前,将风格编码器所得个性化风格特征与输入的每帧特征相连接,并且在序列前端以零特征向量填补以保证卷积之后的序列长度不变;每次卷积之后,使用负数倾斜率为0.2的带泄漏线性整流函数进行激活;再通过五层全连接层映射,输出个性化三维人脸动作序列。所述训练过程使用标准Adam优化器优化网络中的可训练参数,以最小化解耦目标函数;所述解耦目标函数包括:重构项,风格交换项,以及循环一致项;所述重构项利用内容编码器和风格编码器将步骤(1)与步骤(2)中所得三维人脸动作序列编码为内容特征序列和个性化风格特征,并使用原始数据监督动作解码器从内容特征序列和个性化风格特征解码输出的个性化三维人脸动作序列;所述风格交换项利用内容编码器和风格编码器将步骤(1)与步骤(2)中所得三维人脸动作序列编码为内容特征序列和个性化风格特征,然后交换任意两个序列数据的个性化风格特征,使之与来源不同的内容特征序列结合并经过动 作解码器输出个性化风格特征交换之后的个性化三维人脸动作序列,风格交换项对该输出进行监督;所述循环一致项对前述个性化风格特征交换之后的个性化三维人脸动作序列再次利用内容编码器和风格编码器编码并再次交换编码后的个性化风格特征,经过动作解码器输出两次个性化风格特征交换之后的个性化三维人脸动作序列,循环一致项对该输出进行监督。
    (4.2)使用步骤(3)所得语音特征序列与步骤(4.1)中解耦网络所分解的个性化风格特征,训练另一个深度神经网络:语音动画网络,该步骤与步骤(4.1)同时进行。所述语音动画网络由一个语音编码器以及一个动作解码器组成。所述语音编码器对语音特征序列中的每一帧特征窗口,将整个窗口作为源,窗口中间帧作为询问,使用标准的变形器网络进行编码;对序列中所有帧进行编码得到编码后的语音特征序列;所述编码后的语音特征序列是由序列维度、特征图维度组成的二维张量。所述动作解码器对编码后的语音特征序列进行三次一维卷积,每次卷积之前,将步骤(4.1)所分解的个性化风格特征与输入的每帧特征相连接,并且在序列前端以零特征向量填补以保证卷积之后的序列长度不变;每次卷积之后,使用负数倾斜率为0.2的带泄漏线性整流函数进行激活;再通过五层全连接层映射,输出个性化三维人脸动作序列。该动作解码器与步骤(4.1)中的解耦网络中的动作解码器除输入之外完全相同,即步骤(4.1)中的解耦网络与该步骤中的语音动画网络共用同一个动作解码器。所述训练过程使用标准Adam优化器优化网络中的可训练参数,以最小化语音动画目标函数;所述语音动画目标函数包括:语音动画重构项,语音动画风格交换项,以及语音动画循环一致项;所述语音动画重构项计算方法与步骤(4.1)中的重构项相似,仅将解耦网络的输出替换为对应的语音动画网络输出;所述语音动画风格交换项计算方法与步骤(4.1)中的风格交换项相似,仅将解耦网络的输出替换为对应的语音动画网络输出;所述语音动画循环一致项计算方法与步骤(4.1)中的循环一致项相似,仅将解耦网络的输出替换为对应的语音动画网络输出。
  4. 根据权利要求2所述的语音信号驱动的个性化三维人脸动画生成方法,其特征在于:具体步骤如下:
    (1)处理目标人物视频数据:对目标人物的正脸演讲视频中的每一帧图像使用现有的三维可形变人脸模型技术,进行三维重建,并移除所有头部运动,得到目标人物的三维人脸模型模板I0以及三维人脸动作序列0为目标人物编 号,n表示序列中的帧序号集合{1,2,…,|n|},表示序列中的第t帧人脸动作,即相较于模型模板的顶点偏移;其中,I0是形状为V×3的张量,是形状为|n|×V×3的张量,|n|表示序列长度,V表示三维人脸模型顶点数量,3表示三维空间;同时,从目标人物的视频中分离出语音的音频信号X0
    (2)获取辅助人物数据:从现有的公开的语音同步三维人脸动画数据库,获取辅助人物数据;数据库中的每个辅助人物的数据包括三维人脸模型模板Iu、三维人脸动作序列以及同步的语音信号Xu;其中,u为该数据对应人物的编号,m表示序列中的帧序号集合{1,2,…,|m|},表示序列中的第t帧人脸动作,Iu是形状为V×3的张量,是形状为|m|×V×3的张量,|m|表示序列长度,V表示三维人脸模型顶点数量,3表示三维空间;所述语音同步三维人脸动画数据库不包含目标人物的三维数据,即满足u>0,并且其数据的三维人脸模型的拓扑结构与步骤(1)中所使用的三维人脸模型的拓扑结构一致;
    (3)提取语音特征序列:对步骤(1)与步骤(2)中所得语音信号Xi使用现有语音识别技术,提取中间特征xi,其是形状为|i|×Cx的张量,再对其进行分窗操作得到语音特征序列Wi={wt}t∈i,其是形状为|i|×W×Cx的张量;其中,i≥0为包括目标人物和辅助人物的人物编号,i表示序列中的帧序号集合{1,2,…,|i|},wt表示第t帧语音特征,|i|表示序列长度,与对应的三维人脸动作序列长度一致,W表示每一帧特征的窗口长度,Cx表示特征图数量;所述分窗操作对xi序列上的每一帧取其前后各帧作为一个窗口,超出序列范围的部分取零填补;
    (4)训练深度神经网络:使用步骤(1)与步骤(2)所得三维人脸动作序列和步骤(3)所得语音特征序列同时训练两个深度神经网络,分别称为解耦网络与语音动画网络;包含以下子步骤:
    (4.1)训练解耦网络:使用步骤(1)与步骤(2)中所得三维人脸动作序列训练一个深度神经网络,称为解耦网络;其中,k≥0表示包括目标人物和辅助人物的人物编号,i表示序列中的帧序号集合{1,2,…,|i|},为序列中第t帧三维人脸动作;所述解耦网络由一个内容编码器EC、一个风格编码器ES、以及一个动作解码器D组成,其运算过程定义如下:
    其中,Ci为编码所得内容特征序列,sk为编码所得个性化风格特征,为结合sk和Ci并解码之后生成的个性化三维人脸动作序列;
    所述内容编码器EC对三维人脸动作序列中的第t帧首先进行三次螺旋卷积;每次螺旋卷积操作之后进行顶点下采样,并使用负数倾斜率为0.2的带泄漏线性整流函数进行激活;随后,将卷积所得的所有顶点特征连接成一维向量,再通过一个可训练的线性矩阵将其映射到第t帧内容特征ct;三维人脸动作序列中所有帧映射之后得到内容特征序列Ci={ct}t∈i;所述内容特征序列Ci是形状为|i|×Cc的张量,|i|表示序列长度,Cc表示特征图数量;所述螺旋卷积定义在输入的顶点维度上,其形式如下:
    其中,vj表示输入螺旋卷积的第j个顶点的特征,是形状为C的向量,C表示特征数量;表示对第i个顶点预定义的L个邻接顶点的集合,表示将输入的第i个顶点的邻接顶点集合中所有顶点的特征连接成形状为LC的一维向量,γ为可训练的线性映射,表示螺旋卷积输出的第i个顶点的特征;所述预定义的邻接顶点集合是在三维人脸模型模板上预计算所得,对模型模板上的第i个顶点取其自身与拓扑结构周围环上的共L个顶点;所述顶点下采样定义在顶点维度上,其形式如下:
    V*=MdV+  (3)
    其中,为螺旋卷积输出的所有顶点,下标中N为螺旋卷积输出的顶点数量;Md是下采样矩阵,在三维人脸模型模板上预计算所得;V*是下采样之后的结果,其顶点数量为V+
    所述风格编码器ES对三维人脸动作序列中的第t帧首先进行三次螺旋卷积;每次螺旋卷积操作之后进行顶点下采样,并使用负数倾斜率为0.2的带泄漏线性整流函数进行激活;随后,将卷积所得的所有顶点特征连接成一维向量,再通过一个可训练的线性矩阵将其映射到第t帧中间风格特征三维人脸动作序列中所有帧映射为中间风格特征之后,再使用一个长短时记忆单元循环地处理中间风格特征序列得到个性化风格特征sk;所述个性化风格特征sk是形状为Cs的向量,Cs为特征图数量;所述螺旋卷积以及顶点下采样与内容编码器中的方法一致,但使用不同的参数;所述长短时记忆单元具有一个存储历史信息的状态器和三个门:输入门it作用于第t帧中间风格特征 与记忆单元第t-1帧输出ht-1,表示是否允许新的中间风格特征信息加入到记忆单元的状态器中,数值为0到1,如果输入门数值为1,即开门,则加入新信息,如果为0,即关门,则加入零向量,如果为0到1中间数值则将新信息乘以门数值再加入;遗忘门ft作用于记忆单元的状态器,表示是否保留状态器存储的第t-1帧历史信息St-1,数值为0到1,如果遗忘门数值为1,即开门,则保留存储的信息,如果为0,即关门,则重置存储信息为零向量,如果为0到1中间数值则将存储信息乘以门数值再保留;输出门ot作用于记忆单元的状态器,表示是否将记忆单元当前第t帧状态St作为输出,数值为0到1,如果为1,即开门,则当前记忆单元的状态作为输出,如果为0,即关门,则输出零向量,如果为0到1中间数值则将当前记忆单元的状态乘以门数值再作为输出;三个门的具体数值由当前第t帧输入与该记忆单元第t-1帧的输出ht-1连接、投影得到,其具体公式如下:
    其中,为当前第t帧输入的中间风格特征,ht-1为记忆单元第t-1帧的输出,表示将和ht-1的特征图相连接;it为输入门数值,Wi、bi分别为输入门的权重与偏置;ft为输入门数值,Wf、bf分别为遗忘门的权重与偏置;ot为输入门数值,Wo、bo分别为输出门的权重与偏置;为对当前帧输入和上一帧输出的投影,Wx、bx分别为投影的权重与偏置;St-1、St分别为第t-1帧与当前第t帧的记忆单元状态器的状态;ht为第t帧记忆单元的输出;Wi,Wf,Wo,Wx均为形状为Cs×Cs的矩阵,bi,bf,bo,bx均为形状为Cs的向量,Wi,Wf,Wo,Wx,bi,bf,bo,bx均为可训练的参数;
    所述动作解码器D对前述步骤所得内容特征序列Ci={ct}t∈i进行三次一维卷积,每次卷积之前,将前述步骤所得个性化风格特征sk与输入的每帧特征相连接,并且在序列前端以零特征向量填补以保证卷积之后的序列长度不变;每次卷积之后,使用负数倾斜率为0.2的带泄漏线性整流函数进行激活;对三层 卷积之后的序列中的第t帧,再通过五层全连接层映射,生成第t帧三维人脸动作最终输出三维人脸动作序列为
    所述训练过程使用标准Adam优化器优化网络中的可训练参数,以最小化解耦目标函数Ldecomp;所述解耦目标函数Ldecomp包括:重构项Lrec,风格交换项Lswp,以及循环一致项Lcyc
    Ldecomp=λrecLrecswpLswpcycLcyc.  (5)
    其中,λrecswpcyc分别为相应的权重;
    所述重构项定义如下:
    其中,Lseq是对三维人脸动作序列定义的监督损失函数,其定义如下:
    上式中的标记忽略人物编号;其中,yt为监督数据序列Yi中的第t帧,为生成动作序列中的第t帧;计算生成动作第t帧与监督数据第t帧之间的l2距离,以监督所生成动作的准确性;计算生成动作第t-1帧与第t帧之间变化幅度与监督数据第t-1帧与第t帧之间变化幅度的l2距离,以监督所生成动作的平滑性;计算生成动作第t帧唇部张开高度与监督数据第t帧唇部张开高度的l2距离,以监督所生成动作有准确的唇部动作;其中LipH(·)根据预先选定的唇部顶点计算在y轴上的平均高度差,以近似唇部张开的高度;λm和λl为相应的权重;
    所述风格交换项的计算方法定义在一对三维人脸动作序列上:其中,p≥0,q≥0表示包括目标人物和辅助人物的人物编号,i,j表示对应序列中的帧序号集合;对于这样两个序列,使用内容编码器和风格编码器分别编码:
    再将两个序列所得个性化风格特征sp与sq相交换,与另一个序列的内容特征 序列相结合并生成交换个性化风格特征之后的三维人脸动作序列
    对于交换个性化风格特征之后的三维人脸动作序列计算所述风格交换项Lswp,需考虑两种情况:
    其中,第一种情况为p=q,即两段三维人脸动作序列来自于同一个人物,则直接使用输入的序列作为监督数据,计算损失函数;第二种情况为p≠q,即两段三维人脸动作序列来自于不同人物,对于这种情况,只有部分序列对满足可以计算的要求:人物p在中所说的语言内容也必须被人物q说过,即存在且该序列中所说的语言内容与序列相同;然而i的序列长度可能与i不同,通过标准的动态时间规整算法将对齐到序列上,对齐后的序列标记为被用于监督相似地,使用对齐的序列监督对于所述第二种情况,仅在满足要求的情况下计算;
    所述循环一致项对前述交换个性化风格特征之后生成的三维人脸动作序列再次使用内容编码器和风格编码器分别编码,并再次交换编码所得个性化风格特征sq′与sp′,与另一个序列的内容特征序列相结合并生成两次交换个性化风格特征之后的三维人脸动作序列
    经过两次交换之后,个性化风格特征与原始匹配的内容特征序列相结合,因此其输出应该恢复原始的输入序列;循环一致项Lcyc使用原始的输入序列进行监督:
    (4.2)训练语音动画网络:使用步骤(3)所得语音特征序列Wi={wt}t∈i与步骤(4.1)中解耦网络所分解的个性化风格特征sk,训练另一个深度神经网络,称为语音动画网络;其中,Wi与前述三维人脸动作序列同步,并拥有相同序列长度与帧编号;所述语音动画网络由一个语音编码器EA以及一个动作解码器D组 成:
    其中,Ai为编码Wi后的语音特征序列,为结合sk与Ai并解码输出的个性化三维人脸动作序列;
    所述语音编码器EA对语音特征序列Wi={wt}t∈i中的第t帧特征窗口wt,将整个窗口作为源,窗口中间帧作为询问,使用标准的变形器网络进行编码,得到第t帧编码后的语音特征at;对整个序列重复操作得到编码后的语音特征序列Ai={at}t∈i;所述编码后的语音特征序列Ai是形状为|i|×Ca的二维张量,|i|表示序列长度,Ca表示特征图数量;
    所述动作解码器D对编码后的语音特征序列Ai={at}t∈i进行三次一维卷积,每次卷积之前,将步骤(4.1)所得个性化风格特征sk与输入的每帧特征相连接,并且在序列前端以零特征向量填补以保证卷积之后的序列长度不变;每次卷积之后,使用负数倾斜率为0.2的带泄漏线性整流函数进行激活;对三层卷积之后的序列中的第t帧,再通过五层全连接层映射,生成第t帧三维人脸动作最终输出三维人脸动作序列为该动作解码器与步骤(4.1)中的解耦网络中的动作解码器除输入之外完全相同,即步骤(4.1)中的解耦网络与该步骤中的语音动画网络共用同一个动作解码器;
    所述训练过程使用标准Adam优化器优化网络中的可训练参数,以最小化语音动画目标函数Lanime;所述语音动画目标函数与步骤(4.1)中所述解耦目标函数相似,由三个相似的项目构成:语音动画重构项语音动画风格交换项语音动画循环一致项将式(6)中的替换成语音动画网络生成的得到语音动画重构项
    将与式(8)中分别同步的语音特征Wi,Wj,经过编码得到Ai,Aj之后,分别与来自式(8)交换之后的个性化风格特征sq与sp结合并解码得到
    再用与式(10)相同的方法计算语音动画风格交换项
    将Ai,Aj分别与来自式(11)两次交换之后的个性化风格特征sp′与sq′结合并解码得到
    再用与式(12)相同的方法计算语音动画循环一致项
    所述语音动画目标函数Lanime表示为三项加权和:
    其中,为各项相应的权重;所述训练过程与步骤(4.1)中的训练过程同步进行,即Ldecomp与Lanime组成联合目标函数Ljoint
    Ljoint=Ldecomp+Lanime.  (20)
    (5)获取目标人物个性化风格特征:对步骤(1)中所得的目标人物三维人脸动作序列使用步骤(4)训练所得的解耦网络分解出目标人物的个性化风格特征s0
    (6)生成语音同步的个性化三维人脸动画:对任意的语音信号使用与步骤(3)中相同的方法提取语音特征序列;使用步骤(4)训练所得语音动画网络将所提取语音特征序列与步骤(5)所得目标人物的个性化风格特征s0结合,输出个性化三维人脸动作序列;所得个性化三维人脸动作序列加上步骤(1)中所得目标人物的三维人脸模型模板I0,得到个性化三维人脸动画;所述个性化三维人脸动画与输入的语音保持同步,并具有目标人物个性化的风格。
  5. 权利要求1-4任一所述的语音信号驱动的个性化三维人脸动画生成方法在VR虚拟社交、虚拟语音助手或游戏中的应用。
PCT/CN2023/075515 2022-12-16 2023-02-10 一种语音信号驱动的个性化三维人脸动画生成方法及其应用 WO2024124680A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211621760.5A CN116385606A (zh) 2022-12-16 2022-12-16 一种语音信号驱动的个性化三维人脸动画生成方法及其应用
CN202211621760.5 2022-12-16

Publications (1)

Publication Number Publication Date
WO2024124680A1 true WO2024124680A1 (zh) 2024-06-20

Family

ID=86977431

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/075515 WO2024124680A1 (zh) 2022-12-16 2023-02-10 一种语音信号驱动的个性化三维人脸动画生成方法及其应用

Country Status (2)

Country Link
CN (1) CN116385606A (zh)
WO (1) WO2024124680A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117115312B (zh) * 2023-10-17 2023-12-19 天度(厦门)科技股份有限公司 一种语音驱动面部动画方法、装置、设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724458A (zh) * 2020-05-09 2020-09-29 天津大学 一种语音驱动的三维人脸动画生成方法及网络结构
CN112581569A (zh) * 2020-12-11 2021-03-30 中国科学院软件研究所 自适应情感表达的说话人面部动画生成方法及电子装置
US11263796B1 (en) * 2020-11-11 2022-03-01 Sony Interactive Entertainment Inc. Binocular pose prediction
CN115330911A (zh) * 2022-08-09 2022-11-11 北京通用人工智能研究院 一种利用音频驱动拟态表情的方法与系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111724458A (zh) * 2020-05-09 2020-09-29 天津大学 一种语音驱动的三维人脸动画生成方法及网络结构
US11263796B1 (en) * 2020-11-11 2022-03-01 Sony Interactive Entertainment Inc. Binocular pose prediction
CN112581569A (zh) * 2020-12-11 2021-03-30 中国科学院软件研究所 自适应情感表达的说话人面部动画生成方法及电子装置
CN115330911A (zh) * 2022-08-09 2022-11-11 北京通用人工智能研究院 一种利用音频驱动拟态表情的方法与系统

Also Published As

Publication number Publication date
CN116385606A (zh) 2023-07-04

Similar Documents

Publication Publication Date Title
Lu et al. Live speech portraits: real-time photorealistic talking-head animation
Zhang et al. Facial: Synthesizing dynamic talking face with implicit attribute learning
Brand Voice puppetry
Cao et al. Expressive speech-driven facial animation
Chuang et al. Mood swings: expressive speech animation
Hong et al. Real-time speech-driven face animation with expressions using neural networks
CN110880315A (zh) 一种基于音素后验概率的个性化语音和视频生成系统
KR102509666B1 (ko) 텍스트 및 오디오 기반 실시간 얼굴 재연
US11354841B2 (en) Speech-driven facial animation generation method
Ma et al. Styletalk: One-shot talking head generation with controllable speaking styles
Tian et al. Audio2face: Generating speech/face animation from single audio with attention-based bidirectional lstm networks
CN112581569B (zh) 自适应情感表达的说话人面部动画生成方法及电子装置
Zhang et al. Text2video: Text-driven talking-head video synthesis with personalized phoneme-pose dictionary
CN111243065B (zh) 一种语音信号驱动的脸部动画生成方法
Taylor et al. Audio-to-visual speech conversion using deep neural networks
WO2024124680A1 (zh) 一种语音信号驱动的个性化三维人脸动画生成方法及其应用
Yu et al. Mining audio, text and visual information for talking face generation
Wang et al. 3d-talkemo: Learning to synthesize 3d emotional talking head
CN113838174A (zh) 一种音频驱动人脸动画生成方法、装置、设备与介质
Deena et al. Visual speech synthesis using a variable-order switching shared Gaussian process dynamical model
Liu et al. Real-time speech-driven animation of expressive talking faces
Lan et al. Low level descriptors based DBLSTM bottleneck feature for speech driven talking avatar
Filntisis et al. Photorealistic adaptation and interpolation of facial expressions using HMMS and AAMS for audio-visual speech synthesis
Kim et al. 3D Lip‐Synch Generation with Data‐Faithful Machine Learning
Chuang Analysis, synthesis, and retargeting of facial expressions