CN111724458A - Voice-driven three-dimensional human face animation generation method and network structure - Google Patents

Voice-driven three-dimensional human face animation generation method and network structure Download PDF

Info

Publication number
CN111724458A
CN111724458A CN202010387250.0A CN202010387250A CN111724458A CN 111724458 A CN111724458 A CN 111724458A CN 202010387250 A CN202010387250 A CN 202010387250A CN 111724458 A CN111724458 A CN 111724458A
Authority
CN
China
Prior art keywords
voice
constraint
driven
dimensional
intermediate variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010387250.0A
Other languages
Chinese (zh)
Other versions
CN111724458B (en
Inventor
李坤
刘云珂
刘景瑛
惠彬原
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202010387250.0A priority Critical patent/CN111724458B/en
Publication of CN111724458A publication Critical patent/CN111724458A/en
Application granted granted Critical
Publication of CN111724458B publication Critical patent/CN111724458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Architecture (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention belongs to the technical field of computers, and particularly relates to a voice-driven three-dimensional face animation generation method, which comprises the following steps: 1) extracting voice characteristics and embedding the identity information of the voice into a characteristic matrix; 2) mapping the characteristic matrix to a low-dimensional space through an encoder to obtain an intermediate variable; 3) mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and carrying out 3D graphic geometric constraint on the intermediate variable to obtain the displacement of a 3D space; 4) and driving the template to simulate the facial animation according to the acquired displacement of the 3D space. Compared with the prior art, the method has the advantages that the 3D geometric figure characteristics are innovatively utilized to constrain the intermediate variable, and the generated 3D facial expression is more vivid and vivid by introducing the nonlinear geometric figure representation method and two constraint conditions from different visual angles. In addition, the invention also provides a voice-driven three-dimensional human face animation generation network structure.

Description

Voice-driven three-dimensional human face animation generation method and network structure
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a voice-driven three-dimensional human face animation generation method and a network structure.
Background
The voice contains rich information, the expression and the action of the face are simulated by the voice, and the animation which has a special speaking style and accords with the identity of an individual can be manufactured. Creating 3D facial animation that conforms to speech features has wide application in movies, games, augmented reality, and virtual reality. Therefore, it is very important to know the correlation between voice and face deformation.
The 3D facial animation using speech driving can be classified into speaker-dependent and speaker-independent according to whether or not generalization across characters is supported. Speaker-dependent animation refers to the ability to learn animation of a particular situation, primarily using large amounts of data, to generate animation of a fixed individual. Current methods for speaker-dependent animation generally require generation of video from high quality motion capture data, generation of video from fixed speaker's voice and material, or real-time facial animation using end-to-end networks, but these case-specific generation methods cannot be applied due to their inconvenience. Currently, much research is mainly directed towards speaker independent animations, and for speaker independent animations, the neural network is mainly used for effective feature learning in the prior art. For example, nonlinear mapping from phoneme label to mouth movement (Taylor et al: A deep learning approach for generalized speed evaluation. ACMTrans. graph.36,93:1-93:11 (2017)); estimating the rotation and activation parameters of the 3D blenders by using a long-short term memory network (Pham et al: Speech-drive 3D facial animation with implicit atomic aware: A deep learning approach.2017IEEE Conference on Computer Vision and Pattern Recognition Works (CVPRW) pp.2328-2336 (2017)); further using a network learning acoustic feature representation (phase et al: End-to-End learning for 3d facial and immunological from speed. in: ICMI' 18 (2018)); three-stage networks were used to simulate cartoon persons (Zhou et al: Visemenet: audio-drive analyzer-central speed evaluation. ACM Trans. graph.37,161: 1-161: 10 (2018)); with The proposed multi-topic 4D face dataset, a generic speech driven 3D face framework (Cudeiro et al: Capture, learning, and synthesis of3D listening styles. in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)) and The like, which can work in a variety of identity scopes, is generated. However, none of these methods take into account the effect of the geometric representation on the speech driven 3D facial animation.
In view of the above, there is a need for a new method for generating a voice-driven three-dimensional human face animation.
Disclosure of Invention
The invention aims to: aiming at the defects of the prior art, the provided voice-driven three-dimensional face animation generation method realizes a speaker-independent voice-driven face animation network which takes 3D geometric figures as guidance, and leads the generated 3D facial expression to be more vivid and vivid by introducing a nonlinear geometric figure representation method and two constraint conditions from different visual angles.
In order to achieve the purpose, the invention adopts the following technical scheme:
a voice-driven three-dimensional face animation generation method is characterized by comprising the following steps:
1) extracting voice characteristics and embedding the identity information of the voice into a characteristic matrix;
2) mapping the characteristic matrix to a low-dimensional space through an encoder to obtain an intermediate variable;
3) mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and carrying out 3D graphic geometric constraint on the intermediate variable to obtain the displacement of a 3D space;
4) and driving the template to simulate the facial animation according to the acquired displacement of the 3D space.
As an improvement of the voice-driven three-dimensional human face animation generation method, a DeepSpeech engine is adopted to extract voice features in the step 1).
As an improvement to the speech-driven three-dimensional face animation generation method described in the present invention, the encoder includes four convolution layers, and the i-th convolution layer needs to receive all the previous layers x0......xi-1As input:
xi=Hi([x0,x1,…,xi-1]);
wherein, [ x ]0,x1,…,xi-1]Showing the concatenation of the signatures generated in layers 0 to i-1, HiRepresenting a complex function with a 3 × 1 filter, a 2 × 1 step convolution and linear activation unit relu the main purpose of the encoder in the present invention is to map the speech features to the potential representation, i.e. the intermediate vector the encoder uses 4 convolutional layers, unlike the general convolution process, where a more dense model is used, enabling efficient combination of deep and shallow features.
As an improvement of the voice-driven three-dimensional face animation generation method, a pooling layer is added after each convolution layer, and the number of feature maps is reduced through the pooling layer. Generally, the number of feature maps is doubled after each convolutional layer, and in order to make the concatenation process smoothly, a pooling layer is added after each convolutional layer to reduce the number of feature maps, so that each convolutional layer can be effectively reused, and the learning of the encoder is richer.
As an improvement to the speech-driven three-dimensional face animation generation method described in the present invention, the decoder includes two fully-connected layers with tanh activation function and a final output layer, and an attention mechanism is set between the two fully-connected layers, that is, it is assumed that
Figure BDA0002484255930000041
Representing the input of the attention layer, where C is the number of feature maps, attention value aiCan be expressed as:
ai=σ(W2(W1xi));
where σ denotes a ReLU function, denotes a sigmoid function,
Figure BDA0002484255930000042
and
Figure BDA0002484255930000043
representing the weight of the attention block; the output of the attention layer is:
Figure BDA0002484255930000044
wherein,
Figure BDA0002484255930000045
the method comprises the steps of representing element-by-element multiplication, enabling a current input sample to adaptively select important features through an attention module, generating different attention responses for different inputs, and enabling a final output layer to be a full-connection layer with a linear activation function, wherein the full-connection layer generates an output of N × 3 and corresponds to three-dimensional displacement vectors of N vertexes.
As an improvement to the speech-driven three-dimensional face animation generation method described in the present invention, the weights of the output layer are initialized by 50 PCA components calculated from the vertex displacements of the training data, and the deviations are initialized by 0. The PCA is a principal component analysis, and the stability of the network model training can be improved by setting the PCA as a principal component analysis.
As an improvement to the speech-driven three-dimensional face animation generation method in the present invention, the 3D graphical geometric constraint on the intermediate variable is specifically to set a mesh corresponding to a frame in each audio, and a corresponding geometric representation can be automatically obtained by using an encoder, and the geometric representation is used to constrain the intermediate variable, including Huber constraint and Hilbert-Schmidt independence criterion constraint, where the Huber constraint is represented as: suppose there are two vectors r and
Figure BDA0002484255930000051
then there are
Figure BDA0002484255930000052
Wherein,
Figure BDA0002484255930000053
in the invention, the encoder encodes the input face MeshThe code being an intermediate variable
Figure BDA0002484255930000054
The decoder decodes it into a 3D mesh; the invention uses a multi-column multi-scale network MGCN to extract the geometric representation of each training Mesh, and by setting a grid corresponding to the frame in each audio and using an automatic encoder to obtain the corresponding geometric representation, the geometric representation effectively constrains the intermediate variables, so that the encoder output is closely related to the 3D geometric representation.
As an improvement to the speech-driven three-dimensional face animation generation method described in the present invention, the Hilbert-Schmidt independence criterion constraint, which is used to measure non-linearity and higher-order correlation, enables estimation of the dependency between representations without explicitly estimating the joint distribution of random variables, assuming that there are two variables R ═ R1,...,ri,...,rM]And
Figure BDA0002484255930000055
m is batch size, and the definition mapping phi (r) maps the intermediate variable r to kernel space
Figure BDA0002484255930000056
Also the inner product is expressed as
Figure BDA0002484255930000057
The Hilbert-Schmidt independence criteria constraint is expressed as:
Figure BDA0002484255930000061
wherein k isRAnd
Figure BDA0002484255930000062
in the form of a kernel function, the kernel function,
Figure BDA0002484255930000063
and
Figure BDA0002484255930000064
is a space of Hilbert, and is provided with a plurality of Hilbert,
Figure BDA0002484255930000065
is as to R and
Figure BDA0002484255930000066
expectation of (1), order
Figure BDA0002484255930000067
Is taken from
Figure BDA0002484255930000068
The empirical derivation of HSIC is:
Figure BDA0002484255930000069
where tr denotes the trace of the square matrix, K1And K2Is k1,ij=k1(ri,rj) And
Figure BDA00024842559300000610
h centers the Gram matrix with a mean value of 0 in the feature space:
Figure BDA00024842559300000611
as an improvement of the speech-driven three-dimensional face animation generation method in the present invention, the loss functions of steps 1) to 4) include reconstruction loss, constraint loss, and speed loss, and the expression is:
L=Lr1Lc2Lv
wherein λ is1And λ2Is a positive number to balance the loss term, setting λ1Is 0.1, lambda2Is 10.0, LrFor reconstruction losses, the distance between the true and predicted values is calculated:
Figure BDA00024842559300000612
constraint loss LcAcquiring a 3D graphic geometric intermediate variable through the grid, and constraining the existing intermediate variable by using a Huber or Hilbert-Schmidt independence criterion, wherein the speed loss is expressed as follows, so that the stability of time is ensured:
Figure BDA00024842559300000613
the invention also provides a voice-driven three-dimensional human face animation generation network structure which comprises an encoder, a decoder and 3D graphic geometric constraint on the intermediate variables.
The invention has the beneficial effects that: compared with the traditional reconstruction method, the method innovatively utilizes the 3D geometric figure characteristic to constrain the intermediate variable. An encoder stage, in which tightly connected convolutional layers are designed to enhance feature propagation and enhance reuse of audio features; a decoder stage, which makes the network self-adaptively adjust the key area through an attention mechanism; a geometric-guided training strategy is provided for the intermediate variables, and the training strategy has two constraints from different angles so as to realize stronger animation effect; in addition, the three-dimensional human face animation generation network has high precision, the generated animation effect is more accurate and reasonable, and the generalization can be well carried out.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a work flow diagram of the present invention;
FIG. 2 is a diagram of a network architecture model according to the present invention;
fig. 3 is a schematic diagram illustrating a comparison between a reconstruction result obtained from a VOCASET data set and other methods according to an embodiment of the present invention, which is a true value of an input Mesh, a result reconstructed by Cudeiro et al, an estimated reconstruction result estimated by the present invention, an error visualization diagram of the Cudeiro et al method, and an error visualization diagram of the present invention.
Detailed Description
As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, within which a person skilled in the art can solve the technical problem to substantially achieve the technical result.
In the description of the present invention, it is to be understood that the terms "upper", "lower", "front", "rear", "left", "right", horizontal ", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.
In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
The present invention will be described in further detail below with reference to the accompanying drawings, but the present invention is not limited thereto.
Example 1
As shown in fig. 1 to 3, a method for generating a voice-driven three-dimensional human face animation includes the following steps:
1) extracting voice features by adopting a DeepSpeech engine, converting the identity information of the voice into a one-hot vector and embedding the one-hot vector into a feature matrix;
2) mapping the characteristic matrix to a low-dimensional space through an encoder to obtain an intermediate variable;
3) mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and carrying out 3D graphic geometric constraint on the intermediate variable to obtain the displacement of a 3D space;
4) and driving the template to simulate the facial animation according to the acquired displacement of the 3D space.
Preferably, the main purpose of the encoder is to map the speech features to a potential representation, i.e. an intermediate vector. Unlike the general convolution process, in which 4 convolutional layers are used in the encoder, each convolutional layer is connected in series through downsampling, and a feature map is obtained through a ReLU activation function, a more dense model is used, so that deep features and shallow features can be effectively combined. Specifically, the i-th layer convolution needs to receive all the previous layers x0......xi-1As input:
xi=Hi([x0,x1,…,xi-1]);
wherein, [ x ]0,x1,...,xi-1]Showing the concatenation of the signatures generated in layers 0 to i-1, HiA complex function is shown with a 3 × 1 filter, 2 × 1 steps of convolution and linear activation unit relu. generally, the number of signatures is doubled after each convolutional layer, and in order to make the concatenation process smooth, a pooling layer is added after each convolutional layer to reduce the number of signatures, which can effectively reuse each convolutional layer, making the encoder learn richer.
Preferably, the decoder comprises two fully-connected layers with tanh activation functions and a final output layer, and an attention mechanism is arranged between the two fully-connected layers, so that the network has an emphasis on learning important information. Suppose that
Figure BDA0002484255930000091
Representing the input of the attention layer, where C is the number of feature maps, attention value aiCan be expressed as:
ai=σ(W2(W1xi));
where σ denotes a ReLU function, denotes a sigmoid function,
Figure BDA0002484255930000101
and
Figure BDA0002484255930000102
representing the weight of the attention block; the output of the attention layer is:
Figure BDA0002484255930000103
wherein,
Figure BDA0002484255930000104
representing element-by-element multiplication, adaptively selecting important features of a current input sample through an attention module, generating different attention responses for different inputs, and finally, the output layer is a fully connected layer with a linear activation function, which produces an output of N × 3, corresponding to three-dimensional displacement vectors of N vertices.
Preferably, the encoder-decoder structure described above can be viewed as a cross-mode process, with the intermediate variable r being referred to in this embodiment as a cross-modal representation, which represents the representation of a particular identity and the geometry of a deformation. An encoder encodes an input face mesh into intermediate variables
Figure BDA0002484255930000105
The decoder decodes it into a 3D mesh. The present invention uses a multi-column multi-scale network MGCN to extract a few of each training gridWhat is shown. During the training of the present invention, the encoder output is closely related to the 3D geometry representation by setting a mesh corresponding to the frames in each audio and the corresponding geometry representation can be automatically obtained using the encoder, using which the cross-modal representation is constrained. Among them, the Huber constraint and the Hilbert-Schmidt independence criterion constraint are adopted in the invention.
The Huber constraint is expressed as: suppose there are two vectors r and
Figure BDA0002484255930000106
then there are
Figure BDA0002484255930000111
Wherein,
Figure BDA0002484255930000112
preferably, the Hilbert-Schmidt independence criterion constraint for measuring non-linear and higher order correlations enables the estimation of the dependency between the representations without explicitly estimating the joint distribution of the random variables, assuming that there are two variables R ═ R1,...,ri,...,rM]And
Figure BDA0002484255930000113
m is batch size, and the definition mapping phi (r) maps the intermediate variable r to kernel space
Figure BDA0002484255930000114
Also the inner product is expressed as
Figure BDA0002484255930000115
The Hilbert-Schmidt independence criteria constraint is expressed as:
Figure BDA0002484255930000116
wherein k isRAnd
Figure BDA0002484255930000117
in the form of a kernel function, the kernel function,
Figure BDA0002484255930000118
and
Figure BDA0002484255930000119
is a space of Hilbert, and is provided with a plurality of Hilbert,
Figure BDA00024842559300001110
is as to R and
Figure BDA00024842559300001111
expectation of (1), order
Figure BDA00024842559300001112
Is taken from
Figure BDA00024842559300001113
The empirical derivation of HSIC is:
Figure BDA00024842559300001114
where tr denotes the trace of the square matrix, K1And K2Is k1,ij=k1(ri,rj) And
Figure BDA00024842559300001115
h centers the Gram matrix with a mean value of 0 in the feature space:
Figure BDA00024842559300001116
in deep speech, the present embodiment adopts a speech feature of W ═ 16, D ═ 29, and sets the size of the intermediate variable to 64. As previously described, the network is divided into encoder and decoder sections, and the intermediate variables are 3D geometry constrained. The encoder has 4 layers of convolutions, a 3 x1 filter, 2 x1 steps of convolutions and a linear activation unit ReLU. The number of features is doubled after each convolutional layer, and in order to make the concatenation process smooth, a 2 × 1 pooling layer is added after each convolutional layer to reduce the number of features. The first two fully-connected layers of the decoder use the tanh activation function, and the final output layer is a fully-connected layer with linear activation function, which produces 5023 × 3 outputs, corresponding to 5023 vertex three-dimensional displacement vectors. The weights for this layer are initialized by the 50 PCA components calculated from the vertex displacements of the training data, and the bias is initialized by 0. The loss function comprises reconstruction loss, constraint loss and speed loss, and the expression is as follows:
L=Lr1Lc2Lv
wherein λ is1And λ2Is a positive number to balance the loss term, setting λ1Is 0.1, lambda2Is 10.0, LrFor reconstruction losses, the distance between the true and predicted values is calculated:
Figure BDA0002484255930000121
constraint loss LcAcquiring a 3D graphic geometric intermediate variable through the grid, and constraining the existing intermediate variable by using a Huber or Hilbert-Schmidt independence criterion, wherein the speed loss is expressed as follows, so that the stability of time is ensured:
Figure BDA0002484255930000122
the invention is realized based on Tensorflow, and is operated in Adam optimizer with 0.9 momentum for the image card GTX1080Ti in great, and 50 stages are trained with a fixed learning rate of 1 e-4. For efficient training and testing, the present embodiment divides the 12 subjects into a training set, a validation set, and a test set. In addition, the remaining objects are also divided into 2 validation sets, 2 test sets. The training set includes all sentences of eight subjects. For the validation set and test set, 20 unique sentences were selected so that they were not shared with other objects. There is no overlap between the training, validation and test sets of subjects or sentences.
Example 2
A speech-driven three-dimensional human face animation generation network structure comprises an encoder, a decoder and a constraint for conducting 3D graph geometry on intermediate variables, wherein the encoder and the decoder are both the encoder and the decoder in embodiment 1. The network takes a 3D geometric figure as a guide, as long as a section of voice is input, the network can obtain low-dimensional voice characteristics through an encoder, the 3D geometric figure is used for constraint, then a decoder is used for obtaining the face displacement of a 3D space, and animation can be generated by driving a template.
The foregoing description shows and describes several preferred embodiments of the invention, but as aforementioned, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A voice-driven three-dimensional face animation generation method is characterized by comprising the following steps:
1) extracting voice characteristics and embedding the identity information of the voice into a characteristic matrix;
2) mapping the characteristic matrix to a low-dimensional space through an encoder to obtain an intermediate variable;
3) mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and carrying out 3D graphic geometric constraint on the intermediate variable to obtain the displacement of a 3D space;
4) and driving the template to simulate the facial animation according to the acquired displacement of the 3D space.
2. The speech-driven three-dimensional face animation generation method according to claim 1, characterized in that: and in the step 1), a DeepSpeech engine is adopted to extract the voice features.
3. The method of claim 1, wherein the encoder comprises four convolutional layers, and the i-th convolutional layer needs to receive all the previous layers x0......xi-1As input:
xi=Hi([x0,x1,...,xi-1]);
wherein, [ x ]0,x1,...,xi-1]Showing the concatenation of the signatures generated in layers 0 to i-1, HiA complex function is represented with a 3 × 1 filter, convolution with 2 × 1 steps and a linear activation unit ReLU.
4. A speech-driven three-dimensional face animation generation method according to claim 3, characterized in that: and adding a pooling layer after each convolution layer, wherein the number of the characteristic graphs is reduced through the pooling layer.
5. The method of claim 1, wherein the decoder comprises two fully-connected layers with tanh activation function and a final output layer, and an attention mechanism is provided between the two fully-connected layers, that is, assuming that an attention mechanism is provided
Figure FDA0002484255920000021
Representing the input of the attention layer, where C is the number of feature maps, attention value aiCan be expressed as:
ai=σ(W2(W1xi));
where σ denotes a ReLU function, denotes a sigmoid function,
Figure FDA0002484255920000022
and
Figure FDA0002484255920000023
representing the weight of the attention block; the output of the attention layer is:
Figure FDA0002484255920000024
wherein,
Figure FDA0002484255920000025
representing element-by-element multiplication, adaptively selecting important features of a current input sample through an attention module, generating different attention responses for different inputs, and generating an output layer which is a fully connected layer with a linear activation function and generates an output of N × 3 corresponding to three-dimensional displacement vectors of N vertexes.
6. The speech-driven three-dimensional face animation generation method according to claim 5, wherein: the weights of the output layers are initialized by the 50 PCA components calculated from the vertex displacements of the training data, and the deviations are initialized by 0.
7. The method of claim 1, wherein the 3D geometric constraint on the intermediate variables is specifically to set a mesh corresponding to each frame in the audio, and the encoder can be used to automatically obtain a corresponding geometric representation, which is used to constrain the intermediate variables, including Huber constraint and Hilbert-Schmidt independence criterion constraint, wherein the Huber constraint is expressed as: suppose there are two vectors r and
Figure FDA0002484255920000031
then there are
Figure FDA0002484255920000032
Wherein,
Figure FDA0002484255920000033
8. the method of claim 7, wherein the Hilbert-Schmidt independence criterion constraint is applied to measure non-linearity and higher order correlation, enabling estimation of dependencies between representations without explicit estimation of joint distribution of random variables, assuming two variables R ═ R [ -R ] - ]1,...,ri,...,rM]And
Figure FDA0002484255920000034
m is batch size, and the definition mapping phi (r) maps the intermediate variable r to kernel space
Figure FDA0002484255920000035
Also the inner product is expressed as
Figure FDA0002484255920000036
The Hilbert-Schmidt independence criteria constraint is expressed as:
Figure FDA0002484255920000037
wherein k isRAnd
Figure FDA0002484255920000038
in the form of a kernel function, the kernel function,
Figure FDA0002484255920000039
and
Figure FDA00024842559200000310
is a space of Hilbert, and is provided with a plurality of Hilbert,
Figure FDA00024842559200000311
is as to R and
Figure FDA00024842559200000312
expectation of (1), order
Figure FDA00024842559200000313
Is taken from
Figure FDA00024842559200000314
The empirical derivation of HSIC is:
Figure FDA0002484255920000041
where tr denotes the trace of the square matrix, K1And K2Is k1,ij=k1(ri,rj) And
Figure FDA0002484255920000042
h centers the Gram matrix with a mean value of 0 in the feature space:
Figure FDA0002484255920000043
9. the method for generating a voice-driven three-dimensional human face animation according to claim 7, wherein the loss functions of the steps 1) to 4) comprise reconstruction loss, constraint loss and speed loss, and the expressions are as follows:
L=Lr1Lc2Lv
wherein λ is1And λ2Is a positive number to balance the loss term, setting λ1Is 0.1, lambda2Is 10.0, LrFor reconstruction losses, the distance between the true and predicted values is calculated:
Figure FDA0002484255920000044
constraint loss LcAcquiring a 3D graphic geometric intermediate variable through the grid, and constraining the existing intermediate variable by using a Huber or Hilbert-Schmidt independence criterion, wherein the speed loss is expressed as follows, so that the stability of time is ensured:
Figure FDA0002484255920000045
10. a voice-driven three-dimensional human face animation generation network structure is characterized in that: comprising an encoder as claimed in any one of claims 1 to 9, a decoder as claimed in any one of claims 1 to 9, and a constraint for 3D graphics geometry on the intermediate variables.
CN202010387250.0A 2020-05-09 2020-05-09 Voice-driven three-dimensional face animation generation method and network structure Active CN111724458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010387250.0A CN111724458B (en) 2020-05-09 2020-05-09 Voice-driven three-dimensional face animation generation method and network structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010387250.0A CN111724458B (en) 2020-05-09 2020-05-09 Voice-driven three-dimensional face animation generation method and network structure

Publications (2)

Publication Number Publication Date
CN111724458A true CN111724458A (en) 2020-09-29
CN111724458B CN111724458B (en) 2023-07-04

Family

ID=72564794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010387250.0A Active CN111724458B (en) 2020-05-09 2020-05-09 Voice-driven three-dimensional face animation generation method and network structure

Country Status (1)

Country Link
CN (1) CN111724458B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763532A (en) * 2021-04-19 2021-12-07 腾讯科技(深圳)有限公司 Human-computer interaction method, device, equipment and medium based on three-dimensional virtual object
CN113838174A (en) * 2021-11-25 2021-12-24 之江实验室 Audio-driven face animation generation method, device, equipment and medium
CN114332315A (en) * 2021-12-07 2022-04-12 北京百度网讯科技有限公司 3D video generation method, model training method and device
CN114612594A (en) * 2022-02-23 2022-06-10 上海暖叠网络科技有限公司 Virtual character animation expression generation system and method
CN116188649A (en) * 2023-04-27 2023-05-30 科大讯飞股份有限公司 Three-dimensional face model driving method based on voice and related device
WO2024124680A1 (en) * 2022-12-16 2024-06-20 浙江大学 Speech signal-driven personalized three-dimensional facial animation generation method, and application thereof
WO2024164748A1 (en) * 2023-02-09 2024-08-15 华南理工大学 Three-dimensional facial animation generation method and apparatus based on audio driving, and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308731A (en) * 2018-08-24 2019-02-05 浙江大学 The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM
CN110728219A (en) * 2019-09-29 2020-01-24 天津大学 3D face generation method based on multi-column multi-scale graph convolution neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109308731A (en) * 2018-08-24 2019-02-05 浙江大学 The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM
CN110728219A (en) * 2019-09-29 2020-01-24 天津大学 3D face generation method based on multi-column multi-scale graph convolution neural network

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763532A (en) * 2021-04-19 2021-12-07 腾讯科技(深圳)有限公司 Human-computer interaction method, device, equipment and medium based on three-dimensional virtual object
CN113763532B (en) * 2021-04-19 2024-01-19 腾讯科技(深圳)有限公司 Man-machine interaction method, device, equipment and medium based on three-dimensional virtual object
CN113838174A (en) * 2021-11-25 2021-12-24 之江实验室 Audio-driven face animation generation method, device, equipment and medium
CN113838174B (en) * 2021-11-25 2022-06-10 之江实验室 Audio-driven face animation generation method, device, equipment and medium
CN114332315A (en) * 2021-12-07 2022-04-12 北京百度网讯科技有限公司 3D video generation method, model training method and device
CN114332315B (en) * 2021-12-07 2022-11-08 北京百度网讯科技有限公司 3D video generation method, model training method and device
CN114612594A (en) * 2022-02-23 2022-06-10 上海暖叠网络科技有限公司 Virtual character animation expression generation system and method
WO2024124680A1 (en) * 2022-12-16 2024-06-20 浙江大学 Speech signal-driven personalized three-dimensional facial animation generation method, and application thereof
WO2024164748A1 (en) * 2023-02-09 2024-08-15 华南理工大学 Three-dimensional facial animation generation method and apparatus based on audio driving, and medium
CN116188649A (en) * 2023-04-27 2023-05-30 科大讯飞股份有限公司 Three-dimensional face model driving method based on voice and related device
CN116188649B (en) * 2023-04-27 2023-10-13 科大讯飞股份有限公司 Three-dimensional face model driving method based on voice and related device

Also Published As

Publication number Publication date
CN111724458B (en) 2023-07-04

Similar Documents

Publication Publication Date Title
CN111724458B (en) Voice-driven three-dimensional face animation generation method and network structure
CN110728219B (en) 3D face generation method based on multi-column multi-scale graph convolution neural network
CN112581569B (en) Adaptive emotion expression speaker facial animation generation method and electronic device
Cao et al. Expressive speech-driven facial animation
CN112562722A (en) Audio-driven digital human generation method and system based on semantics
CN110910479B (en) Video processing method, device, electronic equipment and readable storage medium
Liu et al. Geometry-guided dense perspective network for speech-driven facial animation
CN113255457A (en) Animation character facial expression generation method and system based on facial expression recognition
CN111028319A (en) Three-dimensional non-photorealistic expression generation method based on facial motion unit
CN113140023A (en) Text-to-image generation method and system based on space attention
CN114724224A (en) Multi-mode emotion recognition method for medical care robot
CN113935435A (en) Multi-modal emotion recognition method based on space-time feature fusion
CN116311472A (en) Micro-expression recognition method and device based on multi-level graph convolution network
CN111767842B (en) Micro-expression type discrimination method based on transfer learning and self-encoder data enhancement
Li et al. A survey of computer facial animation techniques
Liu et al. Emotional facial expression transfer based on temporal restricted Boltzmann machines
CN113076918A (en) Video-based facial expression cloning method
Balayn et al. Data-driven development of virtual sign language communication agents
CN117951763A (en) Multi-mode data-driven generation type fashion compatible clothing design method and system
CN114783039B (en) Motion migration method driven by 3D human body model
CN117636106A (en) Fashion commodity image generation method based on attention generation countermeasure network
Huang et al. Visual speech emotion conversion using deep learning for 3D talking head
CN110210336B (en) Low-resolution single-sample face recognition method
Zhong et al. Style-preserving lip sync via audio-aware style reference
CN113706650A (en) Image generation method based on attention mechanism and flow model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant