CN111724458B - Voice-driven three-dimensional face animation generation method and network structure - Google Patents
Voice-driven three-dimensional face animation generation method and network structure Download PDFInfo
- Publication number
- CN111724458B CN111724458B CN202010387250.0A CN202010387250A CN111724458B CN 111724458 B CN111724458 B CN 111724458B CN 202010387250 A CN202010387250 A CN 202010387250A CN 111724458 B CN111724458 B CN 111724458B
- Authority
- CN
- China
- Prior art keywords
- voice
- constraint
- driven
- intermediate variable
- encoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000006073 displacement reaction Methods 0.000 claims abstract description 21
- 230000001815 facial effect Effects 0.000 claims abstract description 18
- 238000013507 mapping Methods 0.000 claims abstract description 12
- 239000011159 matrix material Substances 0.000 claims abstract description 11
- 230000006870 function Effects 0.000 claims description 22
- 238000012549 training Methods 0.000 claims description 15
- 230000004913 activation Effects 0.000 claims description 14
- 239000013598 vector Substances 0.000 claims description 11
- 238000011176 pooling Methods 0.000 claims description 7
- 230000007246 mechanism Effects 0.000 claims description 5
- 230000014509 gene expression Effects 0.000 claims description 4
- 238000009795 derivation Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 230000000452 restraining effect Effects 0.000 claims description 3
- 230000000750 progressive effect Effects 0.000 claims 1
- 230000000007 visual effect Effects 0.000 abstract description 4
- 230000008921 facial expression Effects 0.000 abstract description 2
- 230000006872 improvement Effects 0.000 description 8
- 230000008569 process Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000010200 validation analysis Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating 3D models or images for computer graphics
- G06T19/20—Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Architecture (AREA)
- Computer Graphics (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention belongs to the technical field of computers, and particularly relates to a voice-driven three-dimensional face animation generation method, which comprises the following steps of: 1) Extracting voice characteristics, and embedding the identity information of the voice into a characteristic matrix; 2) Mapping the feature matrix to a low-dimensional space through an encoder to obtain an intermediate variable; 3) Mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and performing 3D graph geometry constraint on the intermediate variable to obtain the displacement of the 3D space; 4) And driving the template to simulate the facial animation according to the acquired displacement of the 3D space. Compared with the prior art, the invention creatively utilizes the characteristics of the 3D geometric figure to restrain the intermediate variable, and leads the generated 3D facial expression to be more vivid and visual by introducing the nonlinear geometric figure representation and two constraint conditions from different visual angles. In addition, the invention also provides a voice-driven three-dimensional face animation generation network structure.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a voice-driven three-dimensional face animation generation method and a network structure.
Background
The voice contains rich information, and the expression and action of the face are simulated by the voice, so that the animation with special speaking style and conforming to the identity of the individual can be produced. Creating 3D facial animation that conforms to speech features has wide-ranging applications in movies, games, augmented reality and virtual reality. Therefore, it is very important to understand the correlation between speech and facial deformation.
The 3D facial animation using voice driving can be classified into two types of speaker-dependent and speaker-independent according to whether generalization across characters is supported. Where speaker-dependent animation refers to animation capabilities that use large amounts of data primarily to learn a particular situation to generate an animation of a fixed individual. Current methods for speaker-dependent animation generally require the generation of video by means of high quality motion capture data, the generation of video therefrom for the sound and material of a fixed speaker, or the real-time generation of facial animation using an end-to-end network, but these methods for specific situations cannot be applied due to their inconvenience. Currently, more research is mainly directed to speaker-independent animation, while for speaker-independent animation, in the prior art, effective feature learning is mainly performed by using a neural network. For example, from phoneme labels to mouth motion nonlinear mapping (Taylor et al: A deep learning approach for generalized speech animation. ACM Trans. Graph.36,93:1-93:11 (2017)); estimating rotation and activation parameters of3D blendershape using long and short term memory networks (pharm et al: spech-drive 3d facial animation with implicit emotional awareness:A deep learning approach.2017IEEE Conference on Computer Vision and Pattern RecognitionWorkshops (CVPRW) pp.2328-2336 (2017)); further utilizing a network learning acoustic feature representation (Pham et al: end-to-End learning for 3d facial animation from speech.In:ICMI'18 (2018)); the cartoon man (Zhou et al: visegment: audio-drive animal-center speed animation. ACM trans. Graph.37,161:1-161:10 (2018)) was simulated using a three-phase network; from the proposed multi-topic 4D face dataset, a generic voice driven 3D face framework (Cudeiro et al: capture, learning, and synchronization of3D scrolling styles.In: the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)) and so forth, which can work across a variety of identity ranges, is generated. However, none of these approaches take into account the impact of the geometric representation on the voice-driven 3D facial animation.
In view of this, it is necessary to propose a new voice-driven three-dimensional face animation generation method.
Disclosure of Invention
The invention aims at: aiming at the defects of the prior art, the voice-driven three-dimensional facial animation generation method realizes a speaker-independent voice-driven facial animation network guided by a 3D geometric figure, and the generated 3D facial expression is more vivid and visual by introducing a nonlinear geometric figure representation and two constraint conditions from different visual angles.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a voice-driven three-dimensional face animation generation method is characterized by comprising the following steps:
1) Extracting voice characteristics, and embedding the identity information of the voice into a characteristic matrix;
2) Mapping the feature matrix to a low-dimensional space through an encoder to obtain an intermediate variable;
3) Mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and performing 3D graph geometry constraint on the intermediate variable to obtain the displacement of the 3D space;
4) And driving the template to simulate the facial animation according to the acquired displacement of the 3D space.
As an improvement of the voice-driven three-dimensional face animation generation method, in the step 1), a deep speech engine is adopted to extract voice features.
As an improvement to the voice-driven three-dimensional face animation generation method, the encoder comprises four convolution layers, and the ith convolution needs to receive all the previous layers x 0 ......x i-1 As input:
x i =H i ([x 0 ,x 1 ,…,x i-1 ]);
wherein [ x ] 0 ,x 1 ,…,x i-1 ]Representing the concatenation of feature maps generated in layers 0 to i-1, H i Representing a complex function with a 3 x1 filter, a convolution of 2 x1 steps and a linear activation unit ReLU. The main purpose of the encoder in the present invention is to map speech features to potential representations, i.e. intermediate vectors. The encoder uses 4 convolution layers, unlike the general convolution process, where a more dense model is used, enabling deep features and shallow features to be effectively combined.
As an improvement on the voice-driven three-dimensional facial animation generation method, a pooling layer is added after each convolution layer, and the number of feature images is reduced through the pooling layer. In general, the number of feature patterns is doubled after each convolution layer, so that the serial connection process is smoothly performed, and a pooling layer is added after each convolution layer to reduce the number of feature patterns, so that each convolution layer can be effectively reused, and the learning of an encoder is richer.
As a pair of the inventionThe improvement of the three-dimensional face animation generation method driven by the voice in the invention, the decoder comprises two full-connection layers with tanh activation functions and a final output layer, and an attention mechanism is arranged between the two full-connection layers, namely, the assumption is thatRepresenting the input of an attention layer, where C is the number of feature graphs, the attention value a i Can be expressed as:
a i =σ(W 2 δ(W 1 x i ));
wherein sigma represents a ReLU function, delta represents a sigmoid function,and->Weights representing the attention block; the output of the attention layer is:
wherein,,representing element-by-element multiplication; adaptively selecting important features of a current input sample through an attention module, and generating different attention responses for different inputs; the final output layer is a fully connected layer with linear activation that produces an N x 3 output corresponding to three-dimensional displacement vectors of N vertices. The invention adds a attention mechanism to make the network have important learning information with emphasis.
As an improvement to the speech driven three-dimensional face animation generation method described in the present invention, the weights of the output layer are initialized by 50 PCA components calculated from the vertex displacement of the training data, and the deviation is initialized by 0. The PCA is used as a principal component analysis, and the stability of the training of the network model can be improved through the arrangement.
As an improvement to the speech driven three-dimensional face animation generation method of the present invention, the constraint of3D graphics geometry on the intermediate variables is specifically to set a grid corresponding to the frames in each audio, and the encoder can be used to automatically obtain a corresponding geometric representation, and the geometric representation is used to constrain the intermediate variables, including Huber constraint and Hilbert-Schmidt independence criterion constraint, where Huber constraint is expressed as: assume that there are two vectors r andthen there is
Wherein,,in the present invention, the encoder encodes the inputted face Mesh as an intermediate variable +.>The decoder decodes it into a 3D mesh; the present invention uses a multi-column multi-scale network MGCN to extract the geometric representation of each training Mesh by setting a grid corresponding to the frames in each audio and can use an automatic encoder to obtain the corresponding geometric representation, using which the intermediate variables are effectively constrained so that the encoder output is closely related to the 3D geometric representation.
As an improvement to the voice-driven three-dimensional face animation generation method, the Hilbert-Schmidt independence constraint is used for measuring nonlinearity and higher-order correlation, and the dependence between representations can be estimated without explicitly estimating the joint distribution of random variables, assuming that there are two variables R= [ R ] 1 ,...,r i ,...,r M ]Andm is the batch size, defining a map phi (r) mapping the intermediate variable r to kernel space +.>Also the inner product is denoted +.>Hilbert-Schmidt independence criterion constraints are expressed as:
wherein k is R Andfor kernel function +.>And->Is Hilbert space->For R and->Is to orderIs taken from->The empirical derivation of HSIC is:
where tr denotes the trace of the square matrix, K 1 And K 2 Is k 1,ij =k 1 (r i ,r j ) Andis centered with a mean value of 0 in the feature space:
as an improvement of the voice-driven three-dimensional face animation generation method of the present invention, the loss functions of the steps 1) to 4) include reconstruction loss, constraint loss and speed loss, and the expressions are as follows:
L=L r +λ 1 L c +λ 2 L v ;
wherein lambda is 1 And lambda (lambda) 2 Is a positive number to balance loss terms, set lambda 1 Is 0.1 lambda 2 10.0, L r To reconstruct the loss, the distance between the true and predicted values is calculated:
constraint loss L c The 3D graph geometric intermediate variable is obtained through the grid, the Huber or Hilbert-Schmidt independence criterion is used for restraining the existing intermediate variable, and the speed loss is expressed as follows, so that the time stability is guaranteed:
the invention also provides a voice-driven three-dimensional facial animation generation network structure which comprises an encoder, a decoder and a constraint of3D graph geometry for intermediate variables.
The invention has the beneficial effects that: compared with the traditional reconstruction method, the method creatively utilizes the characteristic of the 3D geometric figure to restrain the intermediate variable. An encoder stage, in which closely connected convolutional layers are designed to enhance feature propagation and enhance reuse of audio features; a decoder stage, which enables a network to adaptively adjust key areas through an attention mechanism; a training strategy of geometric guidance is provided for the intermediate variables, and the training strategy has two constraints from different angles so as to realize a more powerful animation effect; in addition, the three-dimensional face animation generation network has high precision, the generated animation effect is more accurate and reasonable, and the generalization can be well carried out.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is a workflow diagram of the present invention;
FIG. 2 is a diagram of a network architecture model according to the present invention;
fig. 3 is a schematic diagram of comparing a reconstruction result obtained on a VOCASET data set with other methods according to an embodiment of the present invention, from top to bottom, the result reconstructed by Cudeiro et al, the error visualization diagram of the method of Cudeiro et al, and the error visualization diagram of the present invention.
Detailed Description
Certain terms are used throughout the description and claims to refer to particular components. Those of skill in the art will appreciate that a hardware manufacturer may refer to the same component by different names. The description and claims do not take the form of an element differentiated by name, but rather by functionality. As used throughout the specification and claims, the word "comprise" is an open-ended term, and thus should be interpreted to mean "include, but not limited to. By "substantially" is meant that within an acceptable error range, a person skilled in the art is able to solve the technical problem within a certain error range, substantially achieving the technical effect.
In the description of the present invention, it should be understood that the directions or positional relationships indicated by the terms "upper", "lower", "front", "rear", "left", "right", "horizontal", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
The present invention will be described in further detail below with reference to the drawings, but is not limited thereto.
Example 1
As shown in fig. 1 to 3, a voice-driven three-dimensional face animation generation method includes the following steps:
1) Extracting voice characteristics by adopting a deep engine, converting the identity information of the voice into one-hot vectors, and embedding the one-hot vectors into a characteristic matrix;
2) Mapping the feature matrix to a low-dimensional space through an encoder to obtain an intermediate variable;
3) Mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and performing 3D graph geometry constraint on the intermediate variable to obtain the displacement of the 3D space;
4) And driving the template to simulate the facial animation according to the acquired displacement of the 3D space.
Preferably, the primary purpose of the encoder is to map speech features to potential representations, i.e. intermediate vectors. The encoder uses 4 convolutional layers, each of which is serially connected by downsamplingAnd the feature map is obtained through the ReLU activation function, unlike the general convolution process, a denser model is adopted here, so that deep features and shallow features can be effectively combined. Specifically, the ith layer convolution requires receiving all previous layers x 0 ......x i-1 As input:
x i =H i ([x 0 ,x 1 ,…,x i-1 ]);
wherein [ x ] 0 ,x 1 ,...,x i-1 ]Representing the concatenation of feature maps generated in layers 0 to i-1, H i Representing a complex function with a 3 x1 filter, a convolution of 2 x1 steps and a linear activation unit ReLU. In general, the number of feature patterns is doubled after each convolution layer, so that the serial connection process is smoothly performed, and a pooling layer is added after each convolution layer to reduce the number of feature patterns, so that each convolution layer can be effectively reused, and the learning of an encoder is richer.
Preferably, the decoder comprises two fully connected layers with tanh activation function and a final output layer, and an attention mechanism is arranged between the two fully connected layers, so that the network has important information for study. Assume thatRepresenting the input of an attention layer, where C is the number of feature graphs, the attention value a i Can be expressed as:
a i =σ(W 2 δ(W 1 x i ));
wherein sigma represents a ReLU function, delta represents a sigmoid function,and->Weights representing the attention block; the output of the attention layer is:
wherein,,representing element-by-element multiplication; adaptively selecting important features of a current input sample through an attention module, and generating different attention responses for different inputs; the final output layer is a fully connected layer with linear activation that produces an N x 3 output corresponding to three-dimensional displacement vectors of N vertices. To make training more stable, in this embodiment, the weights of the output layer are initialized by 50 PCA components calculated from the vertex displacement of the training data, and the bias is initialized by 0.
Preferably, the above-described encoder-decoder structure can be regarded as a cross-mode process, in which the intermediate variable r is referred to as a cross-modal representation, which represents the representation of a particular identity and the deformed geometry. The encoder encodes the input face mesh into intermediate variablesThe decoder decodes it into a 3D mesh. The present invention uses a multi-column multi-scale network MGCN to extract the geometric representation of each training grid. During training of the present invention, by providing a grid corresponding to frames in each audio, and using the encoder, the corresponding geometric representation can be automatically obtained, which is used to constrain the cross-modal representation such that the encoder output is closely related to the 3D geometric representation. Among these, huber constraints and Hilbert-Schmidt independence criteria constraints are employed in the present invention.
preferably, the Hilbert-Schmidt independence criterion constraint is used to measure non-linearities and higher-order correlations, enabling the estimation of dependencies between representations without explicitly estimating the joint distribution of random variables, assuming that there are two variables R= [ R ] 1 ,...,r i ,...,r M ]Andm is the batch size, defining a map phi (r) mapping the intermediate variable r to kernel space +.>Also the inner product is denoted +.>Hilbert-Schmidt independence criterion constraints are expressed as:
wherein k is R Andfor kernel function +.>And->Is Hilbert space->For R and->Is to orderIs taken from->The empirical derivation of HSIC is:
where tr denotes the trace of the square matrix, K 1 And K 2 Is k 1,ij =k 1 (r i ,r j ) Andis centered with a mean value of 0 in the feature space:
in deep, the present embodiment employs a window of w=16, a voice feature of d=29, and the size of the intermediate variable is set to 64. As previously described, the network is divided into encoder and decoder parts and the constraints on the intermediate variables are imposed on the 3D geometry. The encoder has a 4-layer convolution, a 3 x1 filter, a convolution of 2 x1 steps, and a linear activation unit ReLU. The number of feature maps is doubled after each convolution layer, and in order to make the concatenation process smooth, a 2×1 pooling layer is added after each convolution layer to reduce the number of feature maps. The first two fully connected layers of the decoder use the tanh activation function and the final output layer is the fully connected layer with linear activation function, which produces 5023 x 3 outputs corresponding to the three-dimensional displacement vectors of 5023 vertices. The weights of this layer are initialized by 50 PCA components calculated from the vertex displacements of the training data, and the deviations are initialized by 0. Wherein the loss function includes a reconstruction loss, a constraint loss, and a speed loss, and the expression is:
L=L r +λ 1 L c +λ 2 L v ;
wherein lambda is 1 And lambda (lambda) 2 Is a positive number to balance loss terms, set lambda 1 Is 0.1 lambda 2 10.0, L r To reconstruct the loss, the distance between the true and predicted values is calculated:
constraint loss L c The 3D graph geometric intermediate variable is obtained through the grid, the Huber or Hilbert-Schmidt independence criterion is used for restraining the existing intermediate variable, and the speed loss is expressed as follows, so that the time stability is guaranteed:
it should be noted that the invention is realized based on Tensorflow, and runs on an Indellovely GTX1080Ti video card to train with an Adam optimizer with momentum of 0.9, and trains 50 stages with a fixed learning rate of 1 e-4. For efficient training and testing, the present embodiment divides 12 subjects into a training set, a validation set and a test set. In addition, the remaining objects are also divided into 2 validation sets, 2 test sets. The training set includes all sentences of the eight objects. For the validation set and the test set, 20 unique sentences are selected so that they are not shared with other objects. There is no overlap between training, validation and test set for an object or sentence.
Example 2
A voice-driven three-dimensional face animation generation network structure comprises an encoder, a decoder and a constraint on 3D graphics geometry of intermediate variables, wherein the encoder and the decoder are the encoder and the decoder in the embodiment 1. The network takes the 3D geometric figure as a guide, and can obtain low-dimensional voice characteristics through the encoder as long as a section of voice is input, constraint is carried out through the 3D geometric figure, then face displacement in the 3D space can be obtained through the decoder, and animation can be generated through driving the template.
While the foregoing description illustrates and describes several preferred embodiments of the present invention, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as limited to other embodiments, and is capable of numerous other combinations, modifications and environments and is capable of changes or modifications within the scope of the inventive concept as described herein, either as a result of the foregoing teachings or as a result of the knowledge or technology in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.
Claims (7)
1. A voice-driven three-dimensional face animation generation method is characterized by comprising the following steps:
1) Extracting voice characteristics, and embedding the identity information of the voice into a characteristic matrix;
2) Mapping the feature matrix to a low-dimensional space through an encoder to obtain an intermediate variable;
3) Mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and performing 3D graph geometry constraint on the intermediate variable to obtain the displacement of the 3D space; the decoder comprises two fully connected layers with tanh activation function and a final output layer, and attention mechanism is arranged between the two fully connected layers, namelyRepresenting the input of an attention layer, where C is the number of feature graphs, the attention value a i Can be expressed as:
a i =σ(W 2 δ(W 1 x i ));
wherein sigma represents a ReLU function, delta represents a sigmoid function,and->Attention blockWeight of->Is a real number; the output of the attention layer is:
wherein,,representing element-by-element multiplication; adaptively selecting important features of a current input sample through an attention module, and generating different attention responses for different inputs; the final output layer is a fully connected layer with a linear activation function, and the layer generates an N multiplied by 3 output and corresponds to three-dimensional displacement vectors of N vertexes;
the constraint of3D graphics geometry on the intermediate variables is specifically to set a grid corresponding to the frames in each audio, and the encoder can be used to automatically obtain a corresponding geometric representation, and the geometric representation is used to constrain the intermediate variables, including Huber constraint and Hilbert-Schmidt independence criterion constraint, where Huber constraint is expressed as: assume that there are two vectors r andthen there is
the Hilbert-Schmidt independence constraint is used to measure non-linearities and higher-order correlations, enabling the estimation of dependencies between representations without explicitly estimating the joint distribution of random variables, assuming twoThe variable r= [ R ] 1 ,...,r i ,...,r M ]Andm is the batch size, defining a map phi (r) mapping the intermediate variable r to kernel space +.>Also the inner product is denoted +.>Hilbert-Schmidt independence criterion constraints are expressed as:
wherein k is R Andfor kernel function +.>And->Is Hilbert space->For R and->Is (are) desirable to be (are)>For two parameters R and->Is a combination of (2)Distribution of the combination, let->Is taken from->The empirical derivation of HSIC is:
the two HSIC formulas are gradually progressive in front and back, and a final deduction formula is adopted;
where tr denotes the trace of the square matrix, K 1 And K 2 Is k 1,ij =k 1 (r i ,r j ) Andis centered with a mean value of 0 in the feature space:
4) And driving the template to simulate the facial animation according to the acquired displacement of the 3D space.
2. The voice-driven three-dimensional facial animation generating method of claim 1, wherein: the deep speech engine is adopted to extract the speech characteristics in the step 1).
3. The method of claim 1, wherein the encoder comprises four convolution layers, and the ith convolution requires reception of all previous layers x 0 ......x i-1 As input:
x i =H i ([x 0 ,x 1 ,...,x i-1 ]);
wherein the method comprises the steps of,[x 0 ,x 1 ,...,x i-1 ]Representing the concatenation of feature maps generated in layers 0 to i-1, H i Representing a complex function with a 3 x1 filter, a convolution of 2 x1 steps and a linear activation unit ReLU.
4. A method of generating a three-dimensional face animation driven by speech according to claim 3, wherein: the method further comprises adding a pooling layer after each convolution layer, and reducing the number of feature graphs through the pooling layer.
5. The voice-driven three-dimensional facial animation generating method of claim 1, wherein: the weights of the output layer are initialized by 50 PCA components calculated from the vertex displacements of the training data, and the deviations are initialized by 0.
6. The method for generating three-dimensional facial animation driven by voice according to claim 1, wherein the loss functions of the steps 1) to 4) comprise reconstruction loss, constraint loss and speed loss, and the expression is as follows:
L=L r +λ 1 L c +λ 2 L v ;
wherein lambda is 1 And lambda (lambda) 2 Is a positive number to balance loss terms, set lambda 1 Is 0.1 lambda 2 10.0, L r To reconstruct the loss, the distance between the true and predicted values is calculated:
constraint loss L c The 3D graph geometric intermediate variable is obtained through the grid, the Huber or Hilbert-Schmidt independence criterion is used for restraining the existing intermediate variable, and the speed loss is expressed as follows, so that the time stability is guaranteed:
7. a three-dimensional face animation generation network structure driven by voice is characterized in that: comprising an encoder, a decoder and a constraint on the 3D geometry of the intermediate variable, wherein the encoder is an encoder according to any of claims 1 to 6 and the decoder is a decoder according to any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010387250.0A CN111724458B (en) | 2020-05-09 | 2020-05-09 | Voice-driven three-dimensional face animation generation method and network structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010387250.0A CN111724458B (en) | 2020-05-09 | 2020-05-09 | Voice-driven three-dimensional face animation generation method and network structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111724458A CN111724458A (en) | 2020-09-29 |
CN111724458B true CN111724458B (en) | 2023-07-04 |
Family
ID=72564794
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010387250.0A Active CN111724458B (en) | 2020-05-09 | 2020-05-09 | Voice-driven three-dimensional face animation generation method and network structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111724458B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113763532B (en) * | 2021-04-19 | 2024-01-19 | 腾讯科技(深圳)有限公司 | Man-machine interaction method, device, equipment and medium based on three-dimensional virtual object |
CN113838174B (en) * | 2021-11-25 | 2022-06-10 | 之江实验室 | An audio-driven face animation generation method, device, device and medium |
CN114332315B (en) * | 2021-12-07 | 2022-11-08 | 北京百度网讯科技有限公司 | 3D video generation method, model training method and device |
CN114612594A (en) * | 2022-02-23 | 2022-06-10 | 上海暖叠网络科技有限公司 | Virtual character animation expression generation system and method |
CN116385606A (en) * | 2022-12-16 | 2023-07-04 | 浙江大学 | Speech signal driven personalized three-dimensional face animation generation method and application thereof |
CN116309988A (en) * | 2023-02-09 | 2023-06-23 | 华南理工大学 | A 3D facial animation generation method, device and medium based on audio drive |
CN116188649B (en) * | 2023-04-27 | 2023-10-13 | 科大讯飞股份有限公司 | Three-dimensional face model driving method based on voice and related device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109308731A (en) * | 2018-08-24 | 2019-02-05 | 浙江大学 | Speech-driven lip-syncing face video synthesis algorithm based on cascaded convolutional LSTM |
CN110728219A (en) * | 2019-09-29 | 2020-01-24 | 天津大学 | 3D face generation method based on multi-column multi-scale graph convolution neural network |
-
2020
- 2020-05-09 CN CN202010387250.0A patent/CN111724458B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109308731A (en) * | 2018-08-24 | 2019-02-05 | 浙江大学 | Speech-driven lip-syncing face video synthesis algorithm based on cascaded convolutional LSTM |
CN110728219A (en) * | 2019-09-29 | 2020-01-24 | 天津大学 | 3D face generation method based on multi-column multi-scale graph convolution neural network |
Also Published As
Publication number | Publication date |
---|---|
CN111724458A (en) | 2020-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111724458B (en) | Voice-driven three-dimensional face animation generation method and network structure | |
Busso et al. | Rigid head motion in expressive speech animation: Analysis and synthesis | |
CN112581569B (en) | Adaptive emotion expression speaker facial animation generation method and electronic device | |
CN113378806B (en) | Audio-driven face animation generation method and system integrating emotion coding | |
Granström et al. | Audiovisual representation of prosody in expressive speech communication | |
EP3866117A1 (en) | Voice signal-driven facial animation generation method | |
Liu et al. | Geometry-guided dense perspective network for speech-driven facial animation | |
CN103258340B (en) | Is rich in the manner of articulation of the three-dimensional visualization Mandarin Chinese pronunciation dictionary of emotional expression ability | |
CN104732590A (en) | Sign language animation synthesis method | |
Eskimez et al. | Noise-resilient training method for face landmark generation from speech | |
Vasani et al. | Generation of indian sign language by sentence processing and generative adversarial networks | |
CN113935435B (en) | Multimodal emotion recognition method based on spatiotemporal feature fusion | |
CN115330911A (en) | Method and system for driving mimicry expression by using audio | |
CN116385606A (en) | Speech signal driven personalized three-dimensional face animation generation method and application thereof | |
CN118052917A (en) | Text-driven voice and face action generation method, device, equipment and medium | |
Li et al. | A survey of computer facial animation techniques | |
Ma et al. | Accurate automatic visible speech synthesis of arbitrary 3D models based on concatenation of diviseme motion capture data | |
CN116721190A (en) | A voice-driven three-dimensional facial animation generation method | |
Chu et al. | CorrTalk: Correlation Between Hierarchical Speech and Facial Activity Variances for 3D Animation | |
Liu et al. | Emotional facial expression transfer based on temporal restricted Boltzmann machines | |
Balayn et al. | Data-driven development of virtual sign language communication agents | |
Huang et al. | Visual speech emotion conversion using deep learning for 3D talking head | |
Liu et al. | Optimization of an image-based talking head system | |
Li et al. | A novel speech-driven lip-sync model with CNN and LSTM | |
Niu et al. | Audio2AB: Audio-driven collaborative generation of virtual character animation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |