CN111724458A - Voice-driven three-dimensional human face animation generation method and network structure - Google Patents
Voice-driven three-dimensional human face animation generation method and network structure Download PDFInfo
- Publication number
- CN111724458A CN111724458A CN202010387250.0A CN202010387250A CN111724458A CN 111724458 A CN111724458 A CN 111724458A CN 202010387250 A CN202010387250 A CN 202010387250A CN 111724458 A CN111724458 A CN 111724458A
- Authority
- CN
- China
- Prior art keywords
- voice
- constraint
- driven
- dimensional
- intermediate variable
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000006073 displacement reaction Methods 0.000 claims abstract description 20
- 239000011159 matrix material Substances 0.000 claims abstract description 14
- 238000013507 mapping Methods 0.000 claims abstract description 12
- 230000001815 facial effect Effects 0.000 claims abstract description 10
- 230000006870 function Effects 0.000 claims description 28
- 230000004913 activation Effects 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 13
- 239000013598 vector Substances 0.000 claims description 11
- 238000011176 pooling Methods 0.000 claims description 7
- 230000007246 mechanism Effects 0.000 claims description 5
- 230000014509 gene expression Effects 0.000 claims description 4
- 238000009795 derivation Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 230000008921 facial expression Effects 0.000 abstract description 2
- 230000000007 visual effect Effects 0.000 abstract description 2
- 230000006872 improvement Effects 0.000 description 8
- 230000008569 process Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000010200 validation analysis Methods 0.000 description 4
- 230000001419 dependent effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001900 immune effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/40—3D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating 3D models or images for computer graphics
- G06T19/20—Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Architecture (AREA)
- Computer Graphics (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention belongs to the technical field of computers, and particularly relates to a voice-driven three-dimensional face animation generation method, which comprises the following steps: 1) extracting voice characteristics and embedding the identity information of the voice into a characteristic matrix; 2) mapping the characteristic matrix to a low-dimensional space through an encoder to obtain an intermediate variable; 3) mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and carrying out 3D graphic geometric constraint on the intermediate variable to obtain the displacement of a 3D space; 4) and driving the template to simulate the facial animation according to the acquired displacement of the 3D space. Compared with the prior art, the method has the advantages that the 3D geometric figure characteristics are innovatively utilized to constrain the intermediate variable, and the generated 3D facial expression is more vivid and vivid by introducing the nonlinear geometric figure representation method and two constraint conditions from different visual angles. In addition, the invention also provides a voice-driven three-dimensional human face animation generation network structure.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a voice-driven three-dimensional human face animation generation method and a network structure.
Background
The voice contains rich information, the expression and the action of the face are simulated by the voice, and the animation which has a special speaking style and accords with the identity of an individual can be manufactured. Creating 3D facial animation that conforms to speech features has wide application in movies, games, augmented reality, and virtual reality. Therefore, it is very important to know the correlation between voice and face deformation.
The 3D facial animation using speech driving can be classified into speaker-dependent and speaker-independent according to whether or not generalization across characters is supported. Speaker-dependent animation refers to the ability to learn animation of a particular situation, primarily using large amounts of data, to generate animation of a fixed individual. Current methods for speaker-dependent animation generally require generation of video from high quality motion capture data, generation of video from fixed speaker's voice and material, or real-time facial animation using end-to-end networks, but these case-specific generation methods cannot be applied due to their inconvenience. Currently, much research is mainly directed towards speaker independent animations, and for speaker independent animations, the neural network is mainly used for effective feature learning in the prior art. For example, nonlinear mapping from phoneme label to mouth movement (Taylor et al: A deep learning approach for generalized speed evaluation. ACMTrans. graph.36,93:1-93:11 (2017)); estimating the rotation and activation parameters of the 3D blenders by using a long-short term memory network (Pham et al: Speech-drive 3D facial animation with implicit atomic aware: A deep learning approach.2017IEEE Conference on Computer Vision and Pattern Recognition Works (CVPRW) pp.2328-2336 (2017)); further using a network learning acoustic feature representation (phase et al: End-to-End learning for 3d facial and immunological from speed. in: ICMI' 18 (2018)); three-stage networks were used to simulate cartoon persons (Zhou et al: Visemenet: audio-drive analyzer-central speed evaluation. ACM Trans. graph.37,161: 1-161: 10 (2018)); with The proposed multi-topic 4D face dataset, a generic speech driven 3D face framework (Cudeiro et al: Capture, learning, and synthesis of3D listening styles. in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)) and The like, which can work in a variety of identity scopes, is generated. However, none of these methods take into account the effect of the geometric representation on the speech driven 3D facial animation.
In view of the above, there is a need for a new method for generating a voice-driven three-dimensional human face animation.
Disclosure of Invention
The invention aims to: aiming at the defects of the prior art, the provided voice-driven three-dimensional face animation generation method realizes a speaker-independent voice-driven face animation network which takes 3D geometric figures as guidance, and leads the generated 3D facial expression to be more vivid and vivid by introducing a nonlinear geometric figure representation method and two constraint conditions from different visual angles.
In order to achieve the purpose, the invention adopts the following technical scheme:
a voice-driven three-dimensional face animation generation method is characterized by comprising the following steps:
1) extracting voice characteristics and embedding the identity information of the voice into a characteristic matrix;
2) mapping the characteristic matrix to a low-dimensional space through an encoder to obtain an intermediate variable;
3) mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and carrying out 3D graphic geometric constraint on the intermediate variable to obtain the displacement of a 3D space;
4) and driving the template to simulate the facial animation according to the acquired displacement of the 3D space.
As an improvement of the voice-driven three-dimensional human face animation generation method, a DeepSpeech engine is adopted to extract voice features in the step 1).
As an improvement to the speech-driven three-dimensional face animation generation method described in the present invention, the encoder includes four convolution layers, and the i-th convolution layer needs to receive all the previous layers x0......xi-1As input:
xi=Hi([x0,x1,…,xi-1]);
wherein, [ x ]0,x1,…,xi-1]Showing the concatenation of the signatures generated in layers 0 to i-1, HiRepresenting a complex function with a 3 × 1 filter, a 2 × 1 step convolution and linear activation unit relu the main purpose of the encoder in the present invention is to map the speech features to the potential representation, i.e. the intermediate vector the encoder uses 4 convolutional layers, unlike the general convolution process, where a more dense model is used, enabling efficient combination of deep and shallow features.
As an improvement of the voice-driven three-dimensional face animation generation method, a pooling layer is added after each convolution layer, and the number of feature maps is reduced through the pooling layer. Generally, the number of feature maps is doubled after each convolutional layer, and in order to make the concatenation process smoothly, a pooling layer is added after each convolutional layer to reduce the number of feature maps, so that each convolutional layer can be effectively reused, and the learning of the encoder is richer.
As an improvement to the speech-driven three-dimensional face animation generation method described in the present invention, the decoder includes two fully-connected layers with tanh activation function and a final output layer, and an attention mechanism is set between the two fully-connected layers, that is, it is assumed thatRepresenting the input of the attention layer, where C is the number of feature maps, attention value aiCan be expressed as:
ai=σ(W2(W1xi));
where σ denotes a ReLU function, denotes a sigmoid function,andrepresenting the weight of the attention block; the output of the attention layer is:
wherein,the method comprises the steps of representing element-by-element multiplication, enabling a current input sample to adaptively select important features through an attention module, generating different attention responses for different inputs, and enabling a final output layer to be a full-connection layer with a linear activation function, wherein the full-connection layer generates an output of N × 3 and corresponds to three-dimensional displacement vectors of N vertexes.
As an improvement to the speech-driven three-dimensional face animation generation method described in the present invention, the weights of the output layer are initialized by 50 PCA components calculated from the vertex displacements of the training data, and the deviations are initialized by 0. The PCA is a principal component analysis, and the stability of the network model training can be improved by setting the PCA as a principal component analysis.
As an improvement to the speech-driven three-dimensional face animation generation method in the present invention, the 3D graphical geometric constraint on the intermediate variable is specifically to set a mesh corresponding to a frame in each audio, and a corresponding geometric representation can be automatically obtained by using an encoder, and the geometric representation is used to constrain the intermediate variable, including Huber constraint and Hilbert-Schmidt independence criterion constraint, where the Huber constraint is represented as: suppose there are two vectors r andthen there are
Wherein,in the invention, the encoder encodes the input face MeshThe code being an intermediate variableThe decoder decodes it into a 3D mesh; the invention uses a multi-column multi-scale network MGCN to extract the geometric representation of each training Mesh, and by setting a grid corresponding to the frame in each audio and using an automatic encoder to obtain the corresponding geometric representation, the geometric representation effectively constrains the intermediate variables, so that the encoder output is closely related to the 3D geometric representation.
As an improvement to the speech-driven three-dimensional face animation generation method described in the present invention, the Hilbert-Schmidt independence criterion constraint, which is used to measure non-linearity and higher-order correlation, enables estimation of the dependency between representations without explicitly estimating the joint distribution of random variables, assuming that there are two variables R ═ R1,...,ri,...,rM]Andm is batch size, and the definition mapping phi (r) maps the intermediate variable r to kernel spaceAlso the inner product is expressed asThe Hilbert-Schmidt independence criteria constraint is expressed as:
wherein k isRAndin the form of a kernel function, the kernel function,andis a space of Hilbert, and is provided with a plurality of Hilbert,is as to R andexpectation of (1), orderIs taken fromThe empirical derivation of HSIC is:
where tr denotes the trace of the square matrix, K1And K2Is k1,ij=k1(ri,rj) Andh centers the Gram matrix with a mean value of 0 in the feature space:
as an improvement of the speech-driven three-dimensional face animation generation method in the present invention, the loss functions of steps 1) to 4) include reconstruction loss, constraint loss, and speed loss, and the expression is:
L=Lr+λ1Lc+λ2Lv;
wherein λ is1And λ2Is a positive number to balance the loss term, setting λ1Is 0.1, lambda2Is 10.0, LrFor reconstruction losses, the distance between the true and predicted values is calculated:
constraint loss LcAcquiring a 3D graphic geometric intermediate variable through the grid, and constraining the existing intermediate variable by using a Huber or Hilbert-Schmidt independence criterion, wherein the speed loss is expressed as follows, so that the stability of time is ensured:
the invention also provides a voice-driven three-dimensional human face animation generation network structure which comprises an encoder, a decoder and 3D graphic geometric constraint on the intermediate variables.
The invention has the beneficial effects that: compared with the traditional reconstruction method, the method innovatively utilizes the 3D geometric figure characteristic to constrain the intermediate variable. An encoder stage, in which tightly connected convolutional layers are designed to enhance feature propagation and enhance reuse of audio features; a decoder stage, which makes the network self-adaptively adjust the key area through an attention mechanism; a geometric-guided training strategy is provided for the intermediate variables, and the training strategy has two constraints from different angles so as to realize stronger animation effect; in addition, the three-dimensional human face animation generation network has high precision, the generated animation effect is more accurate and reasonable, and the generalization can be well carried out.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a work flow diagram of the present invention;
FIG. 2 is a diagram of a network architecture model according to the present invention;
fig. 3 is a schematic diagram illustrating a comparison between a reconstruction result obtained from a VOCASET data set and other methods according to an embodiment of the present invention, which is a true value of an input Mesh, a result reconstructed by Cudeiro et al, an estimated reconstruction result estimated by the present invention, an error visualization diagram of the Cudeiro et al method, and an error visualization diagram of the present invention.
Detailed Description
As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, within which a person skilled in the art can solve the technical problem to substantially achieve the technical result.
In the description of the present invention, it is to be understood that the terms "upper", "lower", "front", "rear", "left", "right", horizontal ", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.
In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
The present invention will be described in further detail below with reference to the accompanying drawings, but the present invention is not limited thereto.
Example 1
As shown in fig. 1 to 3, a method for generating a voice-driven three-dimensional human face animation includes the following steps:
1) extracting voice features by adopting a DeepSpeech engine, converting the identity information of the voice into a one-hot vector and embedding the one-hot vector into a feature matrix;
2) mapping the characteristic matrix to a low-dimensional space through an encoder to obtain an intermediate variable;
3) mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and carrying out 3D graphic geometric constraint on the intermediate variable to obtain the displacement of a 3D space;
4) and driving the template to simulate the facial animation according to the acquired displacement of the 3D space.
Preferably, the main purpose of the encoder is to map the speech features to a potential representation, i.e. an intermediate vector. Unlike the general convolution process, in which 4 convolutional layers are used in the encoder, each convolutional layer is connected in series through downsampling, and a feature map is obtained through a ReLU activation function, a more dense model is used, so that deep features and shallow features can be effectively combined. Specifically, the i-th layer convolution needs to receive all the previous layers x0......xi-1As input:
xi=Hi([x0,x1,…,xi-1]);
wherein, [ x ]0,x1,...,xi-1]Showing the concatenation of the signatures generated in layers 0 to i-1, HiA complex function is shown with a 3 × 1 filter, 2 × 1 steps of convolution and linear activation unit relu. generally, the number of signatures is doubled after each convolutional layer, and in order to make the concatenation process smooth, a pooling layer is added after each convolutional layer to reduce the number of signatures, which can effectively reuse each convolutional layer, making the encoder learn richer.
Preferably, the decoder comprises two fully-connected layers with tanh activation functions and a final output layer, and an attention mechanism is arranged between the two fully-connected layers, so that the network has an emphasis on learning important information. Suppose thatRepresenting the input of the attention layer, where C is the number of feature maps, attention value aiCan be expressed as:
ai=σ(W2(W1xi));
where σ denotes a ReLU function, denotes a sigmoid function,andrepresenting the weight of the attention block; the output of the attention layer is:
wherein,representing element-by-element multiplication, adaptively selecting important features of a current input sample through an attention module, generating different attention responses for different inputs, and finally, the output layer is a fully connected layer with a linear activation function, which produces an output of N × 3, corresponding to three-dimensional displacement vectors of N vertices.
Preferably, the encoder-decoder structure described above can be viewed as a cross-mode process, with the intermediate variable r being referred to in this embodiment as a cross-modal representation, which represents the representation of a particular identity and the geometry of a deformation. An encoder encodes an input face mesh into intermediate variablesThe decoder decodes it into a 3D mesh. The present invention uses a multi-column multi-scale network MGCN to extract a few of each training gridWhat is shown. During the training of the present invention, the encoder output is closely related to the 3D geometry representation by setting a mesh corresponding to the frames in each audio and the corresponding geometry representation can be automatically obtained using the encoder, using which the cross-modal representation is constrained. Among them, the Huber constraint and the Hilbert-Schmidt independence criterion constraint are adopted in the invention.
preferably, the Hilbert-Schmidt independence criterion constraint for measuring non-linear and higher order correlations enables the estimation of the dependency between the representations without explicitly estimating the joint distribution of the random variables, assuming that there are two variables R ═ R1,...,ri,...,rM]Andm is batch size, and the definition mapping phi (r) maps the intermediate variable r to kernel spaceAlso the inner product is expressed asThe Hilbert-Schmidt independence criteria constraint is expressed as:
wherein k isRAndin the form of a kernel function, the kernel function,andis a space of Hilbert, and is provided with a plurality of Hilbert,is as to R andexpectation of (1), orderIs taken fromThe empirical derivation of HSIC is:
where tr denotes the trace of the square matrix, K1And K2Is k1,ij=k1(ri,rj) Andh centers the Gram matrix with a mean value of 0 in the feature space:
in deep speech, the present embodiment adopts a speech feature of W ═ 16, D ═ 29, and sets the size of the intermediate variable to 64. As previously described, the network is divided into encoder and decoder sections, and the intermediate variables are 3D geometry constrained. The encoder has 4 layers of convolutions, a 3 x1 filter, 2 x1 steps of convolutions and a linear activation unit ReLU. The number of features is doubled after each convolutional layer, and in order to make the concatenation process smooth, a 2 × 1 pooling layer is added after each convolutional layer to reduce the number of features. The first two fully-connected layers of the decoder use the tanh activation function, and the final output layer is a fully-connected layer with linear activation function, which produces 5023 × 3 outputs, corresponding to 5023 vertex three-dimensional displacement vectors. The weights for this layer are initialized by the 50 PCA components calculated from the vertex displacements of the training data, and the bias is initialized by 0. The loss function comprises reconstruction loss, constraint loss and speed loss, and the expression is as follows:
L=Lr+λ1Lc+λ2Lv;
wherein λ is1And λ2Is a positive number to balance the loss term, setting λ1Is 0.1, lambda2Is 10.0, LrFor reconstruction losses, the distance between the true and predicted values is calculated:
constraint loss LcAcquiring a 3D graphic geometric intermediate variable through the grid, and constraining the existing intermediate variable by using a Huber or Hilbert-Schmidt independence criterion, wherein the speed loss is expressed as follows, so that the stability of time is ensured:
the invention is realized based on Tensorflow, and is operated in Adam optimizer with 0.9 momentum for the image card GTX1080Ti in great, and 50 stages are trained with a fixed learning rate of 1 e-4. For efficient training and testing, the present embodiment divides the 12 subjects into a training set, a validation set, and a test set. In addition, the remaining objects are also divided into 2 validation sets, 2 test sets. The training set includes all sentences of eight subjects. For the validation set and test set, 20 unique sentences were selected so that they were not shared with other objects. There is no overlap between the training, validation and test sets of subjects or sentences.
Example 2
A speech-driven three-dimensional human face animation generation network structure comprises an encoder, a decoder and a constraint for conducting 3D graph geometry on intermediate variables, wherein the encoder and the decoder are both the encoder and the decoder in embodiment 1. The network takes a 3D geometric figure as a guide, as long as a section of voice is input, the network can obtain low-dimensional voice characteristics through an encoder, the 3D geometric figure is used for constraint, then a decoder is used for obtaining the face displacement of a 3D space, and animation can be generated by driving a template.
The foregoing description shows and describes several preferred embodiments of the invention, but as aforementioned, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. A voice-driven three-dimensional face animation generation method is characterized by comprising the following steps:
1) extracting voice characteristics and embedding the identity information of the voice into a characteristic matrix;
2) mapping the characteristic matrix to a low-dimensional space through an encoder to obtain an intermediate variable;
3) mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and carrying out 3D graphic geometric constraint on the intermediate variable to obtain the displacement of a 3D space;
4) and driving the template to simulate the facial animation according to the acquired displacement of the 3D space.
2. The speech-driven three-dimensional face animation generation method according to claim 1, characterized in that: and in the step 1), a DeepSpeech engine is adopted to extract the voice features.
3. The method of claim 1, wherein the encoder comprises four convolutional layers, and the i-th convolutional layer needs to receive all the previous layers x0......xi-1As input:
xi=Hi([x0,x1,...,xi-1]);
wherein, [ x ]0,x1,...,xi-1]Showing the concatenation of the signatures generated in layers 0 to i-1, HiA complex function is represented with a 3 × 1 filter, convolution with 2 × 1 steps and a linear activation unit ReLU.
4. A speech-driven three-dimensional face animation generation method according to claim 3, characterized in that: and adding a pooling layer after each convolution layer, wherein the number of the characteristic graphs is reduced through the pooling layer.
5. The method of claim 1, wherein the decoder comprises two fully-connected layers with tanh activation function and a final output layer, and an attention mechanism is provided between the two fully-connected layers, that is, assuming that an attention mechanism is providedRepresenting the input of the attention layer, where C is the number of feature maps, attention value aiCan be expressed as:
ai=σ(W2(W1xi));
where σ denotes a ReLU function, denotes a sigmoid function,andrepresenting the weight of the attention block; the output of the attention layer is:
wherein,representing element-by-element multiplication, adaptively selecting important features of a current input sample through an attention module, generating different attention responses for different inputs, and generating an output layer which is a fully connected layer with a linear activation function and generates an output of N × 3 corresponding to three-dimensional displacement vectors of N vertexes.
6. The speech-driven three-dimensional face animation generation method according to claim 5, wherein: the weights of the output layers are initialized by the 50 PCA components calculated from the vertex displacements of the training data, and the deviations are initialized by 0.
7. The method of claim 1, wherein the 3D geometric constraint on the intermediate variables is specifically to set a mesh corresponding to each frame in the audio, and the encoder can be used to automatically obtain a corresponding geometric representation, which is used to constrain the intermediate variables, including Huber constraint and Hilbert-Schmidt independence criterion constraint, wherein the Huber constraint is expressed as: suppose there are two vectors r andthen there are
8. the method of claim 7, wherein the Hilbert-Schmidt independence criterion constraint is applied to measure non-linearity and higher order correlation, enabling estimation of dependencies between representations without explicit estimation of joint distribution of random variables, assuming two variables R ═ R [ -R ] - ]1,...,ri,...,rM]Andm is batch size, and the definition mapping phi (r) maps the intermediate variable r to kernel spaceAlso the inner product is expressed asThe Hilbert-Schmidt independence criteria constraint is expressed as:
wherein k isRAndin the form of a kernel function, the kernel function,andis a space of Hilbert, and is provided with a plurality of Hilbert,is as to R andexpectation of (1), orderIs taken fromThe empirical derivation of HSIC is:
where tr denotes the trace of the square matrix, K1And K2Is k1,ij=k1(ri,rj) Andh centers the Gram matrix with a mean value of 0 in the feature space:
9. the method for generating a voice-driven three-dimensional human face animation according to claim 7, wherein the loss functions of the steps 1) to 4) comprise reconstruction loss, constraint loss and speed loss, and the expressions are as follows:
L=Lr+λ1Lc+λ2Lv;
wherein λ is1And λ2Is a positive number to balance the loss term, setting λ1Is 0.1, lambda2Is 10.0, LrFor reconstruction losses, the distance between the true and predicted values is calculated:
constraint loss LcAcquiring a 3D graphic geometric intermediate variable through the grid, and constraining the existing intermediate variable by using a Huber or Hilbert-Schmidt independence criterion, wherein the speed loss is expressed as follows, so that the stability of time is ensured:
10. a voice-driven three-dimensional human face animation generation network structure is characterized in that: comprising an encoder as claimed in any one of claims 1 to 9, a decoder as claimed in any one of claims 1 to 9, and a constraint for 3D graphics geometry on the intermediate variables.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010387250.0A CN111724458B (en) | 2020-05-09 | 2020-05-09 | Voice-driven three-dimensional face animation generation method and network structure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010387250.0A CN111724458B (en) | 2020-05-09 | 2020-05-09 | Voice-driven three-dimensional face animation generation method and network structure |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111724458A true CN111724458A (en) | 2020-09-29 |
CN111724458B CN111724458B (en) | 2023-07-04 |
Family
ID=72564794
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010387250.0A Active CN111724458B (en) | 2020-05-09 | 2020-05-09 | Voice-driven three-dimensional face animation generation method and network structure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111724458B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113763532A (en) * | 2021-04-19 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Human-computer interaction method, device, equipment and medium based on three-dimensional virtual object |
CN113838174A (en) * | 2021-11-25 | 2021-12-24 | 之江实验室 | Audio-driven face animation generation method, device, equipment and medium |
CN114332315A (en) * | 2021-12-07 | 2022-04-12 | 北京百度网讯科技有限公司 | 3D video generation method, model training method and device |
CN114612594A (en) * | 2022-02-23 | 2022-06-10 | 上海暖叠网络科技有限公司 | Virtual character animation expression generation system and method |
CN116188649A (en) * | 2023-04-27 | 2023-05-30 | 科大讯飞股份有限公司 | Three-dimensional face model driving method based on voice and related device |
WO2024124680A1 (en) * | 2022-12-16 | 2024-06-20 | 浙江大学 | Speech signal-driven personalized three-dimensional facial animation generation method, and application thereof |
WO2024164748A1 (en) * | 2023-02-09 | 2024-08-15 | 华南理工大学 | Three-dimensional facial animation generation method and apparatus based on audio driving, and medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109308731A (en) * | 2018-08-24 | 2019-02-05 | 浙江大学 | The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM |
CN110728219A (en) * | 2019-09-29 | 2020-01-24 | 天津大学 | 3D face generation method based on multi-column multi-scale graph convolution neural network |
-
2020
- 2020-05-09 CN CN202010387250.0A patent/CN111724458B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109308731A (en) * | 2018-08-24 | 2019-02-05 | 浙江大学 | The synchronous face video composition algorithm of the voice-driven lip of concatenated convolutional LSTM |
CN110728219A (en) * | 2019-09-29 | 2020-01-24 | 天津大学 | 3D face generation method based on multi-column multi-scale graph convolution neural network |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113763532A (en) * | 2021-04-19 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Human-computer interaction method, device, equipment and medium based on three-dimensional virtual object |
CN113763532B (en) * | 2021-04-19 | 2024-01-19 | 腾讯科技(深圳)有限公司 | Man-machine interaction method, device, equipment and medium based on three-dimensional virtual object |
CN113838174A (en) * | 2021-11-25 | 2021-12-24 | 之江实验室 | Audio-driven face animation generation method, device, equipment and medium |
CN113838174B (en) * | 2021-11-25 | 2022-06-10 | 之江实验室 | Audio-driven face animation generation method, device, equipment and medium |
CN114332315A (en) * | 2021-12-07 | 2022-04-12 | 北京百度网讯科技有限公司 | 3D video generation method, model training method and device |
CN114332315B (en) * | 2021-12-07 | 2022-11-08 | 北京百度网讯科技有限公司 | 3D video generation method, model training method and device |
CN114612594A (en) * | 2022-02-23 | 2022-06-10 | 上海暖叠网络科技有限公司 | Virtual character animation expression generation system and method |
WO2024124680A1 (en) * | 2022-12-16 | 2024-06-20 | 浙江大学 | Speech signal-driven personalized three-dimensional facial animation generation method, and application thereof |
WO2024164748A1 (en) * | 2023-02-09 | 2024-08-15 | 华南理工大学 | Three-dimensional facial animation generation method and apparatus based on audio driving, and medium |
CN116188649A (en) * | 2023-04-27 | 2023-05-30 | 科大讯飞股份有限公司 | Three-dimensional face model driving method based on voice and related device |
CN116188649B (en) * | 2023-04-27 | 2023-10-13 | 科大讯飞股份有限公司 | Three-dimensional face model driving method based on voice and related device |
Also Published As
Publication number | Publication date |
---|---|
CN111724458B (en) | 2023-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111724458B (en) | Voice-driven three-dimensional face animation generation method and network structure | |
CN110728219B (en) | 3D face generation method based on multi-column multi-scale graph convolution neural network | |
CN112581569B (en) | Adaptive emotion expression speaker facial animation generation method and electronic device | |
Cao et al. | Expressive speech-driven facial animation | |
CN112562722A (en) | Audio-driven digital human generation method and system based on semantics | |
CN110910479B (en) | Video processing method, device, electronic equipment and readable storage medium | |
Liu et al. | Geometry-guided dense perspective network for speech-driven facial animation | |
CN113255457A (en) | Animation character facial expression generation method and system based on facial expression recognition | |
CN111028319A (en) | Three-dimensional non-photorealistic expression generation method based on facial motion unit | |
CN113140023A (en) | Text-to-image generation method and system based on space attention | |
CN114724224A (en) | Multi-mode emotion recognition method for medical care robot | |
CN113935435A (en) | Multi-modal emotion recognition method based on space-time feature fusion | |
CN116311472A (en) | Micro-expression recognition method and device based on multi-level graph convolution network | |
CN111767842B (en) | Micro-expression type discrimination method based on transfer learning and self-encoder data enhancement | |
Li et al. | A survey of computer facial animation techniques | |
Liu et al. | Emotional facial expression transfer based on temporal restricted Boltzmann machines | |
CN113076918A (en) | Video-based facial expression cloning method | |
Balayn et al. | Data-driven development of virtual sign language communication agents | |
CN117951763A (en) | Multi-mode data-driven generation type fashion compatible clothing design method and system | |
CN114783039B (en) | Motion migration method driven by 3D human body model | |
CN117636106A (en) | Fashion commodity image generation method based on attention generation countermeasure network | |
Huang et al. | Visual speech emotion conversion using deep learning for 3D talking head | |
CN110210336B (en) | Low-resolution single-sample face recognition method | |
Zhong et al. | Style-preserving lip sync via audio-aware style reference | |
CN113706650A (en) | Image generation method based on attention mechanism and flow model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |