CN111724458A

CN111724458A - Voice-driven three-dimensional human face animation generation method and network structure

Info

Publication number: CN111724458A
Application number: CN202010387250.0A
Authority: CN
Inventors: 李坤; 刘云珂; 刘景瑛; 惠彬原
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2020-09-29
Anticipated expiration: 2040-05-09
Also published as: CN111724458B

Abstract

The invention belongs to the technical field of computers, and particularly relates to a voice-driven three-dimensional face animation generation method, which comprises the following steps: 1) extracting voice characteristics and embedding the identity information of the voice into a characteristic matrix; 2) mapping the characteristic matrix to a low-dimensional space through an encoder to obtain an intermediate variable; 3) mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and carrying out 3D graphic geometric constraint on the intermediate variable to obtain the displacement of a 3D space; 4) and driving the template to simulate the facial animation according to the acquired displacement of the 3D space. Compared with the prior art, the method has the advantages that the 3D geometric figure characteristics are innovatively utilized to constrain the intermediate variable, and the generated 3D facial expression is more vivid and vivid by introducing the nonlinear geometric figure representation method and two constraint conditions from different visual angles. In addition, the invention also provides a voice-driven three-dimensional human face animation generation network structure.

Description

Voice-driven three-dimensional human face animation generation method and network structure

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a voice-driven three-dimensional human face animation generation method and a network structure.

Background

The voice contains rich information, the expression and the action of the face are simulated by the voice, and the animation which has a special speaking style and accords with the identity of an individual can be manufactured. Creating 3D facial animation that conforms to speech features has wide application in movies, games, augmented reality, and virtual reality. Therefore, it is very important to know the correlation between voice and face deformation.

The 3D facial animation using speech driving can be classified into speaker-dependent and speaker-independent according to whether or not generalization across characters is supported. Speaker-dependent animation refers to the ability to learn animation of a particular situation, primarily using large amounts of data, to generate animation of a fixed individual. Current methods for speaker-dependent animation generally require generation of video from high quality motion capture data, generation of video from fixed speaker's voice and material, or real-time facial animation using end-to-end networks, but these case-specific generation methods cannot be applied due to their inconvenience. Currently, much research is mainly directed towards speaker independent animations, and for speaker independent animations, the neural network is mainly used for effective feature learning in the prior art. For example, nonlinear mapping from phoneme label to mouth movement (Taylor et al: A deep learning approach for generalized speed evaluation. ACMTrans. graph.36,93:1-93:11 (2017)); estimating the rotation and activation parameters of the 3D blenders by using a long-short term memory network (Pham et al: Speech-drive 3D facial animation with implicit atomic aware: A deep learning approach.2017IEEE Conference on Computer Vision and Pattern Recognition Works (CVPRW) pp.2328-2336 (2017)); further using a network learning acoustic feature representation (phase et al: End-to-End learning for 3d facial and immunological from speed. in: ICMI' 18 (2018)); three-stage networks were used to simulate cartoon persons (Zhou et al: Visemenet: audio-drive analyzer-central speed evaluation. ACM Trans. graph.37,161: 1-161: 10 (2018)); with The proposed multi-topic 4D face dataset, a generic speech driven 3D face framework (Cudeiro et al: Capture, learning, and synthesis of3D listening styles. in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)) and The like, which can work in a variety of identity scopes, is generated. However, none of these methods take into account the effect of the geometric representation on the speech driven 3D facial animation.

In view of the above, there is a need for a new method for generating a voice-driven three-dimensional human face animation.

Disclosure of Invention

The invention aims to: aiming at the defects of the prior art, the provided voice-driven three-dimensional face animation generation method realizes a speaker-independent voice-driven face animation network which takes 3D geometric figures as guidance, and leads the generated 3D facial expression to be more vivid and vivid by introducing a nonlinear geometric figure representation method and two constraint conditions from different visual angles.

In order to achieve the purpose, the invention adopts the following technical scheme:

a voice-driven three-dimensional face animation generation method is characterized by comprising the following steps:

1) extracting voice characteristics and embedding the identity information of the voice into a characteristic matrix;

2) mapping the characteristic matrix to a low-dimensional space through an encoder to obtain an intermediate variable;

3) mapping the intermediate variable to a high-dimensional space of3D vertex displacement by using a decoder, and carrying out 3D graphic geometric constraint on the intermediate variable to obtain the displacement of a 3D space;

4) and driving the template to simulate the facial animation according to the acquired displacement of the 3D space.

As an improvement of the voice-driven three-dimensional human face animation generation method, a DeepSpeech engine is adopted to extract voice features in the step 1).

As an improvement to the speech-driven three-dimensional face animation generation method described in the present invention, the encoder includes four convolution layers, and the i-th convolution layer needs to receive all the previous layers x₀......x_i-1As input:

x_i＝H_i([x₀,x₁,…,x_i-1])；

wherein, [ x ]₀,x₁,…,x_i-1]Showing the concatenation of the signatures generated in layers 0 to i-1, H_iRepresenting a complex function with a 3 × 1 filter, a 2 × 1 step convolution and linear activation unit relu the main purpose of the encoder in the present invention is to map the speech features to the potential representation, i.e. the intermediate vector the encoder uses 4 convolutional layers, unlike the general convolution process, where a more dense model is used, enabling efficient combination of deep and shallow features.

As an improvement of the voice-driven three-dimensional face animation generation method, a pooling layer is added after each convolution layer, and the number of feature maps is reduced through the pooling layer. Generally, the number of feature maps is doubled after each convolutional layer, and in order to make the concatenation process smoothly, a pooling layer is added after each convolutional layer to reduce the number of feature maps, so that each convolutional layer can be effectively reused, and the learning of the encoder is richer.

As an improvement to the speech-driven three-dimensional face animation generation method described in the present invention, the decoder includes two fully-connected layers with tanh activation function and a final output layer, and an attention mechanism is set between the two fully-connected layers, that is, it is assumed that

Representing the input of the attention layer, where C is the number of feature maps, attention value a_iCan be expressed as:

a_i＝σ(W₂(W₁x_i))；

where σ denotes a ReLU function, denotes a sigmoid function,

and

representing the weight of the attention block; the output of the attention layer is:

wherein,

the method comprises the steps of representing element-by-element multiplication, enabling a current input sample to adaptively select important features through an attention module, generating different attention responses for different inputs, and enabling a final output layer to be a full-connection layer with a linear activation function, wherein the full-connection layer generates an output of N × 3 and corresponds to three-dimensional displacement vectors of N vertexes.

As an improvement to the speech-driven three-dimensional face animation generation method described in the present invention, the weights of the output layer are initialized by 50 PCA components calculated from the vertex displacements of the training data, and the deviations are initialized by 0. The PCA is a principal component analysis, and the stability of the network model training can be improved by setting the PCA as a principal component analysis.

As an improvement to the speech-driven three-dimensional face animation generation method in the present invention, the 3D graphical geometric constraint on the intermediate variable is specifically to set a mesh corresponding to a frame in each audio, and a corresponding geometric representation can be automatically obtained by using an encoder, and the geometric representation is used to constrain the intermediate variable, including Huber constraint and Hilbert-Schmidt independence criterion constraint, where the Huber constraint is represented as: suppose there are two vectors r and

then there are

Wherein,

in the invention, the encoder encodes the input face MeshThe code being an intermediate variable

The decoder decodes it into a 3D mesh; the invention uses a multi-column multi-scale network MGCN to extract the geometric representation of each training Mesh, and by setting a grid corresponding to the frame in each audio and using an automatic encoder to obtain the corresponding geometric representation, the geometric representation effectively constrains the intermediate variables, so that the encoder output is closely related to the 3D geometric representation.

As an improvement to the speech-driven three-dimensional face animation generation method described in the present invention, the Hilbert-Schmidt independence criterion constraint, which is used to measure non-linearity and higher-order correlation, enables estimation of the dependency between representations without explicitly estimating the joint distribution of random variables, assuming that there are two variables R ═ R₁,...,r_i,...,r_M]And

m is batch size, and the definition mapping phi (r) maps the intermediate variable r to kernel space

Also the inner product is expressed as

The Hilbert-Schmidt independence criteria constraint is expressed as:

wherein k is_RAnd

in the form of a kernel function, the kernel function,

and

is a space of Hilbert, and is provided with a plurality of Hilbert,

is as to R and

expectation of (1), order

Is taken from

The empirical derivation of HSIC is:

where tr denotes the trace of the square matrix, K₁And K₂Is k_1,ij＝k₁(r_i,r_j) And

h centers the Gram matrix with a mean value of 0 in the feature space:

as an improvement of the speech-driven three-dimensional face animation generation method in the present invention, the loss functions of steps 1) to 4) include reconstruction loss, constraint loss, and speed loss, and the expression is:

L＝L_r+λ₁L_c+λ₂L_v；

wherein λ is₁And λ₂Is a positive number to balance the loss term, setting λ₁Is 0.1, lambda₂Is 10.0, L_rFor reconstruction losses, the distance between the true and predicted values is calculated:

constraint loss L_cAcquiring a 3D graphic geometric intermediate variable through the grid, and constraining the existing intermediate variable by using a Huber or Hilbert-Schmidt independence criterion, wherein the speed loss is expressed as follows, so that the stability of time is ensured:

the invention also provides a voice-driven three-dimensional human face animation generation network structure which comprises an encoder, a decoder and 3D graphic geometric constraint on the intermediate variables.

The invention has the beneficial effects that: compared with the traditional reconstruction method, the method innovatively utilizes the 3D geometric figure characteristic to constrain the intermediate variable. An encoder stage, in which tightly connected convolutional layers are designed to enhance feature propagation and enhance reuse of audio features; a decoder stage, which makes the network self-adaptively adjust the key area through an attention mechanism; a geometric-guided training strategy is provided for the intermediate variables, and the training strategy has two constraints from different angles so as to realize stronger animation effect; in addition, the three-dimensional human face animation generation network has high precision, the generated animation effect is more accurate and reasonable, and the generalization can be well carried out.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a work flow diagram of the present invention;

FIG. 2 is a diagram of a network architecture model according to the present invention;

fig. 3 is a schematic diagram illustrating a comparison between a reconstruction result obtained from a VOCASET data set and other methods according to an embodiment of the present invention, which is a true value of an input Mesh, a result reconstructed by Cudeiro et al, an estimated reconstruction result estimated by the present invention, an error visualization diagram of the Cudeiro et al method, and an error visualization diagram of the present invention.

Detailed Description

As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, within which a person skilled in the art can solve the technical problem to substantially achieve the technical result.

In the description of the present invention, it is to be understood that the terms "upper", "lower", "front", "rear", "left", "right", horizontal ", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.

In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

The present invention will be described in further detail below with reference to the accompanying drawings, but the present invention is not limited thereto.

Example 1

As shown in fig. 1 to 3, a method for generating a voice-driven three-dimensional human face animation includes the following steps:

1) extracting voice features by adopting a DeepSpeech engine, converting the identity information of the voice into a one-hot vector and embedding the one-hot vector into a feature matrix;

Preferably, the main purpose of the encoder is to map the speech features to a potential representation, i.e. an intermediate vector. Unlike the general convolution process, in which 4 convolutional layers are used in the encoder, each convolutional layer is connected in series through downsampling, and a feature map is obtained through a ReLU activation function, a more dense model is used, so that deep features and shallow features can be effectively combined. Specifically, the i-th layer convolution needs to receive all the previous layers x₀......x_i-1As input:

x_i＝H_i([x₀,x₁,…,x_i-1])；

wherein, [ x ]₀,x₁,...,x_i-1]Showing the concatenation of the signatures generated in layers 0 to i-1, H_iA complex function is shown with a 3 × 1 filter, 2 × 1 steps of convolution and linear activation unit relu. generally, the number of signatures is doubled after each convolutional layer, and in order to make the concatenation process smooth, a pooling layer is added after each convolutional layer to reduce the number of signatures, which can effectively reuse each convolutional layer, making the encoder learn richer.

Preferably, the decoder comprises two fully-connected layers with tanh activation functions and a final output layer, and an attention mechanism is arranged between the two fully-connected layers, so that the network has an emphasis on learning important information. Suppose that

a_i＝σ(W₂(W₁x_i))；

where σ denotes a ReLU function, denotes a sigmoid function,

and

wherein,

representing element-by-element multiplication, adaptively selecting important features of a current input sample through an attention module, generating different attention responses for different inputs, and finally, the output layer is a fully connected layer with a linear activation function, which produces an output of N × 3, corresponding to three-dimensional displacement vectors of N vertices.

Preferably, the encoder-decoder structure described above can be viewed as a cross-mode process, with the intermediate variable r being referred to in this embodiment as a cross-modal representation, which represents the representation of a particular identity and the geometry of a deformation. An encoder encodes an input face mesh into intermediate variables

The decoder decodes it into a 3D mesh. The present invention uses a multi-column multi-scale network MGCN to extract a few of each training gridWhat is shown. During the training of the present invention, the encoder output is closely related to the 3D geometry representation by setting a mesh corresponding to the frames in each audio and the corresponding geometry representation can be automatically obtained using the encoder, using which the cross-modal representation is constrained. Among them, the Huber constraint and the Hilbert-Schmidt independence criterion constraint are adopted in the invention.

The Huber constraint is expressed as: suppose there are two vectors r and

then there are

Wherein,

preferably, the Hilbert-Schmidt independence criterion constraint for measuring non-linear and higher order correlations enables the estimation of the dependency between the representations without explicitly estimating the joint distribution of the random variables, assuming that there are two variables R ═ R₁,...,r_i,...,r_M]And

Also the inner product is expressed as

The Hilbert-Schmidt independence criteria constraint is expressed as:

wherein k is_RAnd

in the form of a kernel function, the kernel function,

and

is a space of Hilbert, and is provided with a plurality of Hilbert,

is as to R and

expectation of (1), order

Is taken from

The empirical derivation of HSIC is:

h centers the Gram matrix with a mean value of 0 in the feature space:

in deep speech, the present embodiment adopts a speech feature of W ═ 16, D ═ 29, and sets the size of the intermediate variable to 64. As previously described, the network is divided into encoder and decoder sections, and the intermediate variables are 3D geometry constrained. The encoder has 4 layers of convolutions, a 3 x1 filter, 2 x1 steps of convolutions and a linear activation unit ReLU. The number of features is doubled after each convolutional layer, and in order to make the concatenation process smooth, a 2 × 1 pooling layer is added after each convolutional layer to reduce the number of features. The first two fully-connected layers of the decoder use the tanh activation function, and the final output layer is a fully-connected layer with linear activation function, which produces 5023 × 3 outputs, corresponding to 5023 vertex three-dimensional displacement vectors. The weights for this layer are initialized by the 50 PCA components calculated from the vertex displacements of the training data, and the bias is initialized by 0. The loss function comprises reconstruction loss, constraint loss and speed loss, and the expression is as follows:

L＝L_r+λ₁L_c+λ₂L_v；

the invention is realized based on Tensorflow, and is operated in Adam optimizer with 0.9 momentum for the image card GTX1080Ti in great, and 50 stages are trained with a fixed learning rate of 1 e-4. For efficient training and testing, the present embodiment divides the 12 subjects into a training set, a validation set, and a test set. In addition, the remaining objects are also divided into 2 validation sets, 2 test sets. The training set includes all sentences of eight subjects. For the validation set and test set, 20 unique sentences were selected so that they were not shared with other objects. There is no overlap between the training, validation and test sets of subjects or sentences.

Example 2

A speech-driven three-dimensional human face animation generation network structure comprises an encoder, a decoder and a constraint for conducting 3D graph geometry on intermediate variables, wherein the encoder and the decoder are both the encoder and the decoder in embodiment 1. The network takes a 3D geometric figure as a guide, as long as a section of voice is input, the network can obtain low-dimensional voice characteristics through an encoder, the 3D geometric figure is used for constraint, then a decoder is used for obtaining the face displacement of a 3D space, and animation can be generated by driving a template.

The foregoing description shows and describes several preferred embodiments of the invention, but as aforementioned, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A voice-driven three-dimensional face animation generation method is characterized by comprising the following steps:

2. The speech-driven three-dimensional face animation generation method according to claim 1, characterized in that: and in the step 1), a DeepSpeech engine is adopted to extract the voice features.

3. The method of claim 1, wherein the encoder comprises four convolutional layers, and the i-th convolutional layer needs to receive all the previous layers x₀......x_i-1As input:

x_i＝H_i([x₀,x₁,...,x_i-1])；

wherein, [ x ]₀,x₁,...,x_i-1]Showing the concatenation of the signatures generated in layers 0 to i-1, H_iA complex function is represented with a 3 × 1 filter, convolution with 2 × 1 steps and a linear activation unit ReLU.

4. A speech-driven three-dimensional face animation generation method according to claim 3, characterized in that: and adding a pooling layer after each convolution layer, wherein the number of the characteristic graphs is reduced through the pooling layer.

5. The method of claim 1, wherein the decoder comprises two fully-connected layers with tanh activation function and a final output layer, and an attention mechanism is provided between the two fully-connected layers, that is, assuming that an attention mechanism is provided

a_i＝σ(W₂(W₁x_i))；

where σ denotes a ReLU function, denotes a sigmoid function,

and

wherein,

representing element-by-element multiplication, adaptively selecting important features of a current input sample through an attention module, generating different attention responses for different inputs, and generating an output layer which is a fully connected layer with a linear activation function and generates an output of N × 3 corresponding to three-dimensional displacement vectors of N vertexes.

6. The speech-driven three-dimensional face animation generation method according to claim 5, wherein: the weights of the output layers are initialized by the 50 PCA components calculated from the vertex displacements of the training data, and the deviations are initialized by 0.

7. The method of claim 1, wherein the 3D geometric constraint on the intermediate variables is specifically to set a mesh corresponding to each frame in the audio, and the encoder can be used to automatically obtain a corresponding geometric representation, which is used to constrain the intermediate variables, including Huber constraint and Hilbert-Schmidt independence criterion constraint, wherein the Huber constraint is expressed as: suppose there are two vectors r and

then there are

Wherein,

8. the method of claim 7, wherein the Hilbert-Schmidt independence criterion constraint is applied to measure non-linearity and higher order correlation, enabling estimation of dependencies between representations without explicit estimation of joint distribution of random variables, assuming two variables R ═ R [ -R ] - ]₁,...,r_i,...,r_M]And

Also the inner product is expressed as

The Hilbert-Schmidt independence criteria constraint is expressed as:

wherein k is_RAnd

in the form of a kernel function, the kernel function,

and

is a space of Hilbert, and is provided with a plurality of Hilbert,

is as to R and

expectation of (1), order

Is taken from

The empirical derivation of HSIC is:

h centers the Gram matrix with a mean value of 0 in the feature space:

9. the method for generating a voice-driven three-dimensional human face animation according to claim 7, wherein the loss functions of the steps 1) to 4) comprise reconstruction loss, constraint loss and speed loss, and the expressions are as follows:

L＝L_r+λ₁L_c+λ₂L_v；

10. a voice-driven three-dimensional human face animation generation network structure is characterized in that: comprising an encoder as claimed in any one of claims 1 to 9, a decoder as claimed in any one of claims 1 to 9, and a constraint for 3D graphics geometry on the intermediate variables.