CN112215926A

CN112215926A - Voice-driven human face action real-time transfer method and system

Info

Publication number: CN112215926A
Application number: CN202011027777.9A
Authority: CN
Inventors: 不公告发明人
Original assignee: Beijing Huayan Mutual Entertainment Technology Co ltd
Current assignee: Beijing Huayan Mutual Entertainment Technology Co ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-01-12

Abstract

The invention discloses a voice-driven real-time transfer method and a voice-driven real-time transfer system for human face actions, wherein the method comprises the following steps: inputting an audio sequence of a source character; estimating an audio signal characterization for each frame in the audio sequence; driving a three-dimensional face model action according to the estimated audio signal representation of each audio frame; acquiring a target video frame; predicting the human face action on the target video frame image based on the driven three-dimensional face model to obtain a human face action prediction result; and synthesizing the predicted face action prediction result to a corresponding frame image in the target video, so as to realize real-time transfer of the face action driven by voice. The invention greatly improves the sense of reality of the driven human face action, greatly reduces the complexity of the human face driving algorithm and can effectively ensure the real-time property of the driving human face action.

Description

Voice-driven human face action real-time transfer method and system

Technical Field

The invention relates to the technical field of face action driving, in particular to a voice-driven face action real-time transfer method and system.

Background

The voice-driven human face animation is a research hotspot in the technical field of current animation simulation. The technical core of the voice-driven human face animation is that the human face model animation is driven by the voice information input from the outside. The technology of the voice-driven face animation which is popular in time is mainly characterized in that the corresponding relation between voice information and face animation videos is established, all the face animation videos are stored in a face animation material library, then the voice information input from the outside is recognized, the face animation videos corresponding to the recognized voice information are matched from the face animation material library according to the matching relation between the voice information and the face animation videos, and finally the face animation videos are directly called to be displayed to a user. The method cannot realize the real-time performance of the voice-driven human face animation.

In addition, although some existing methods for driving the human face animation by voice ensure the real-time performance of human face driving to a certain extent, the algorithm is complex, the real-time effect is not ideal, and the fidelity of the driven human face is poor, so that the application requirements cannot be met.

Disclosure of Invention

The invention aims to provide a voice-driven real-time transfer method and system for human face actions, so as to solve the technical problems.

In order to achieve the purpose, the invention adopts the following technical scheme:

the method for transferring the human face action driven by the voice in real time comprises the following steps:

inputting an audio sequence of a source character;

estimating an audio signal characterization for each frame in the audio sequence;

driving a three-dimensional face model action according to the estimated audio signal representation of each audio frame;

acquiring a target video frame;

predicting the human face action on the target video frame image based on the driven three-dimensional face model to obtain a human face action prediction result;

and synthesizing the predicted face action prediction result to a corresponding frame image in the target video, so as to realize real-time transfer of the face action driven by voice.

Preferably, the audio signal characterization for each frame in the audio sequence is estimated based on a FacialSpeech speech recognition framework.

Preferably, the feature dimension of each frame of audio input into the facel special speech recognition framework is 16 × 29, and the number "16" represents a time window in which each frame of audio contains 16 audio features;

the number "29" indicates that the FacialSpeech alphabet is 29 in length.

Preferably, the FacialSpeech speech recognition framework comprises 4 convolutional layers and 3 full-connected layers which are cascaded in sequence, and the input 16 × 29-dimensional audio features are subjected to one-dimensional feature convolution extraction of the first convolutional layer and then output 8 × 32-dimensional audio features;

8 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the second convolution layer and then 4 x 32 dimensional audio features are output;

4 x 32 dimensional audio features are subjected to one-dimensional feature convolution extraction of the third convolution layer, and then 2 x 64 dimensional audio features are output;

the 2 x 64-dimensional audio features are subjected to one-dimensional feature convolution extraction of a fourth convolution layer and then 64 audio features are output;

the first fully-connected layer maps 64 audio features to 128;

the second fully-connected layer maps 128 audio features to 64;

the third fully-connected layer maps 64 audio features into an audio characterization vector of length 32.

Preferably, the convolution kernel size of 4 of said convolutional layers is 3, the step size is 2.

The invention also provides a voice-driven human face action real-time transfer system, which can realize the human face action real-time transfer method, and the system comprises:

the audio sequence input module is used for inputting an audio sequence of a source role;

the audio signal expression estimation module is connected with the audio sequence input module and used for estimating the representation of the audio signal of each frame in the audio sequence;

the model action driving module is connected with the audio signal expression estimation module and used for driving a three-dimensional face model action according to the representation of the audio signal of each audio frame;

the target video frame acquisition module is used for acquiring a target video frame;

the target frame human face action prediction module is respectively connected with the model action driving module and the target video frame acquisition module and is used for predicting human face actions on the target video frame images based on the driven three-dimensional face model to obtain a human face action prediction result;

and the face action transfer module is connected with the target frame face action prediction module and used for synthesizing the predicted face action prediction result to a corresponding frame image in the target video frame so as to realize real-time transfer of the face action driven by voice.

the number "29" indicates that the FacialSpeech alphabet is 29 in length.

the first fully-connected layer maps 64 audio features to 128;

the second fully-connected layer maps 128 audio features to 64;

The invention firstly drives the three-dimensional face model to act through the estimated audio signal representation, then predicts the face action of the target frame image based on the driven three-dimensional face model, synthesizes the prediction result on the target frame image, and realizes the real-time transfer of the face action driven by voice.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a step diagram of a voice-driven real-time transfer method for human face actions according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a voice-driven real-time human face motion transfer system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the architecture of the FacialSpeech speech recognition framework employed in the present invention;

fig. 4 is a schematic diagram of the invention for realizing real-time transfer of human face actions based on voice driving.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.

In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between the components, is to be understood broadly, for example, as being fixed or detachable or integral; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

An embodiment of the present invention provides a method for transferring a face action in real time driven by voice, as shown in fig. 1, including the following steps:

step S1, inputting an audio sequence of a source role;

step S2, evaluating an audio signal representation (expression characterizing audio features) of each frame in the audio sequence;

step S3, driving a three-dimensional face model according to the estimated audio signal representation of each audio frame; fig. 4 is a schematic diagram of a three-dimensional face model;

step S4, acquiring a target video frame;

step S5, based on the driven three-dimensional face model, predicting the face action on the target video frame image to obtain the face action prediction result;

and step S6, synthesizing the predicted human face action prediction result to a corresponding frame image in the target video, and realizing the real-time transfer of the human face action driven by voice.

In step S2, the present invention estimates an audio signal characterization for each frame in an audio sequence based on, in particular, the FacialSpeech speech recognition framework. FacialSpeech is a speech recognition system developed by hundreds of degrees in China. The invention improves the network architecture for estimating the audio signal representation based on the FacialSpeech framework. The present invention first determines the feature dimension of each frame of audio in an input audio sequence to be 16 x 29,

the number "16" represents a time window containing 16 audio features per frame of audio;

the number "29" indicates that the FacialSpeech alphabet is 29 in length.

Referring to fig. 3, the improved facel speech recognition framework of the present invention includes 4 convolutional layers and 3 full-link layers cascaded in sequence, and the input 16 × 29 dimensional audio features are convolution extracted by the one-dimensional features of the first convolutional layer and then output 8 × 32 dimensional audio features;

the first fully-connected layer maps 64 audio features to 128;

the second fully-connected layer maps 128 audio features to 64;

The convolution kernel size of the 4 convolutional layers is 3, and the step size is 2.

The specific estimation process with respect to the characterization of the audio signal is not set forth herein.

In step S3, a preset three-dimensional face model is driven to act based on the audio signal representation, which is beneficial to improving the fidelity of the face action of the target video frame and reducing the complexity of the face action synthesis algorithm. If the audio signal representation (expression for representing audio characteristics) is calculated incorrectly, and therefore the direct-driven target video frame human face may have poor reality sense, even has the situations of face distortion and the like, the invention firstly drives the three-dimensional face model to act, transfers the facial action of the model to the target video frame image under the condition of ensuring the fidelity, and is beneficial to improving the reality sense of the target video frame human face action. And the algorithm of the direct drive target frame human face action is quite complex, and the real-time performance of the human face drive is influenced.

Referring to fig. 4, the following briefly explains the principle of transferring the facial motion of a three-dimensional face model to a target video frame image:

extracting a face region on a target video frame image based on a three-dimensional face model, then mapping model actions to the target face region (many existing face mapping methods are available and are not specifically described here), and finally synthesizing the target face subjected to action mapping to the target video frame image to realize real-time transfer of the face actions driven by voice.

The present invention also provides a voice-driven real-time transfer system for human face actions, which can implement the above-mentioned real-time transfer method for human face actions, as shown in fig. 2, the system includes:

the audio sequence input module 1 is used for inputting an audio sequence of a source role;

the audio signal expression estimation module 2 is connected with the audio sequence input module 1 and is used for estimating the representation of the audio signal of each frame in the audio sequence;

the model action driving module 3 is connected with the audio signal expression estimation module 2 and used for driving a three-dimensional face model action according to the representation of the audio signal of each audio frame;

a target video frame obtaining module 4, configured to obtain a target video frame;

the target frame human face action prediction module 5 is respectively connected with the model action driving module 3 and the target video frame acquisition module 4 and is used for predicting human face actions on a target video frame image based on the driven three-dimensional face model to obtain a human face action prediction result;

and the human face action transfer module 6 is connected with the target frame human face action prediction module 5 and is used for synthesizing the predicted human face action prediction result to a corresponding frame image in a target video frame so as to realize real-time transfer of the human face action driven by voice.

The human face action real-time transfer system provided by the invention estimates the representation of the audio signal of each frame in the audio sequence based on a FacialSpeech speech recognition framework.

Specifically, as shown in fig. 3, the improved facel spech framework of the present invention includes 4 convolutional layers and 3 fully-connected layers, which are cascaded in sequence, and the input 16 × 29 dimensional (the number "16" represents a time window in which each frame of audio contains 16 audio features; the number "29" represents that the length of the facel spech alphabet is 29.) audio features are convolved and extracted by one-dimensional features of the first convolutional layer, and then 8 × 32 dimensional audio features are output;

the first fully-connected layer maps 64 audio features to 128;

the second fully-connected layer maps 128 audio features to 64;

The convolution kernel size of 4 of the convolutional layers is 3, and the step size is 2.

It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims

1. A voice-driven human face action real-time transfer method is characterized by comprising the following steps:

inputting an audio sequence of a source character;

acquiring a target video frame;

2. The method of claim 1, wherein the audio signal characterization for each frame in the audio sequence is estimated based on a FacialSpeech speech recognition framework.

3. The method of claim 2, wherein the feature dimension of each frame of audio input into the FacialSpeech speech recognition framework is 16 x 29,

the number "29" indicates that the FacialSpeech alphabet is 29 in length.

4. The real-time human face motion transfer method according to claim 3, wherein the FacialSpeech speech recognition framework comprises 4 convolutional layers and 3 full-link layers which are cascaded in sequence, and the input 16 x 29-dimensional audio features are subjected to one-dimensional feature convolution extraction of the first convolutional layer and then output 8 x 32-dimensional audio features;

the first fully-connected layer maps 64 audio features to 128;

the second fully-connected layer maps 128 audio features to 64;

5. The method according to claim 4, wherein the convolution kernel size of 4 convolution layers is 3, and the step size is 2.

6. A voice-driven real-time human face motion transfer system, which can implement the method of any one of claims 1-5, and comprises:

7. The system of claim 6, wherein the audio signal characterization for each frame in the audio sequence is estimated based on a FacialSpeech speech recognition framework.

8. The system of claim 7, wherein the feature dimension of each frame of audio input into the FacialSpeech speech recognition framework is 16 x 29,

the number "29" indicates that the FacialSpeech alphabet is 29 in length.

9. The system of claim 8, wherein the facespecific speech recognition framework comprises 4 convolutional layers and 3 full-link layers which are cascaded in sequence, and the input 16 x 29-dimensional audio features are subjected to one-dimensional feature convolution extraction of the first convolutional layer and then output 8 x 32-dimensional audio features;

the first fully-connected layer maps 64 audio features to 128;

the second fully-connected layer maps 128 audio features to 64;

10. The system of claim 9, wherein the convolution kernel size of 4 convolution layers is 3 and the step size is 2.