CN114495908A

CN114495908A - Method and system for driving mouth shape by voice based on time sequence convolution

Info

Publication number: CN114495908A
Application number: CN202210116972.1A
Authority: CN
Inventors: 王松坡
Original assignee: Beijing Zhongke Shenzhi Technology Co ltd
Current assignee: Beijing Zhongke Shenzhi Technology Co ltd
Priority date: 2022-02-08
Filing date: 2022-02-08
Publication date: 2022-05-13

Abstract

The invention discloses a method and a system for driving a mouth shape by voice based on time sequence convolution, comprising the following steps: adopting the blenshape to represent the movement of the mouth, outputting a plurality of weights of the blenshape through a neural network, and combining the values of the blenshape to obtain a reasonable representation of the movement of the mouth; the reasonable representation of mouth movements needs discretization, the discretized sound signals are time domain signals, and the time domain signals are converted into a frequency domain through Fourier transformation to complete feature conversion. The invention introduces time sequence convolution, uses a time sequence convolution network for processing the voice frequency spectrum characteristics, and better solves the problems of time sequence information dependence and single generation mode.

Description

Method and system for driving mouth shape by voice based on time sequence convolution

Technical Field

The invention belongs to the technical field of animation production, and particularly relates to a method and a system for driving a mouth shape by voice based on time sequence convolution.

Background

Speech-driven mouth shapes are typically either linguistic-based models or neural-network-based model implementations.

The method of the linguistic model is to divide phonemes based on the characteristics of audio and to pinch out a corresponding mouth shape for each phoneme. The resulting mouth shape is a weighted average of these extracted phoneme shapes. The method based on the neural network model does not need to extract specific types of phonemes aiming at data, and can directly map audio data into a mouth shape due to the strong function fitting capability of the neural network, wherein the output of the neural network can output any value according to different task settings. For the scheme using the neural network, it is most important to select a reasonable data representation and a network structure, and for the scheme which is more common at present, a mesh method is used for representing the representation of the facial mouth shape, and the network structure uses a convolutional neural network or a cyclic neural network. The convolutional neural network has proved its powerful ability to extract features in the field of computer vision, and speech features can be processed by the convolutional neural network after being converted into frequency spectrum, but it is known that audio signals have continuous time sequence, and information in the time dimension is lost by using the convolutional network. The cyclic neural network can well utilize the previous time sequence characteristics, but the cyclic neural network belongs to a generation network, so that the problems that the generation mode is single and the output tends to be average are easily caused.

Therefore, how to provide a method and a system for driving a mouth shape based on time-series convolution voice becomes a problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In view of this, the present invention provides a method and a system for driving a mouth shape by using a voice based on time sequence convolution, which introduces time sequence convolution, uses a time sequence convolution network to process voice spectrum characteristics, and better solves the problems of time sequence information dependency and single generation mode.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method of speech-driven mouth-shaping based on time-series convolution, comprising: representing the mouth action by adopting a blendshape, outputting weights of a plurality of blendshapes through a neural network, and combining the values of the blendshapes to obtain reasonable representation of the mouth action; the reasonable representation of mouth movements needs discretization, the discretized sound signals are time domain signals, and the time domain signals are converted into a frequency domain through Fourier transformation to complete feature conversion.

Further, the method for feature transformation comprises the following steps: and pre-emphasizing and windowing the voice signal, performing discrete Fourier transform, performing logarithm calculation after passing through a Mel filter bank, and performing discrete cosine transform to obtain the MFCC characteristics.

Further, the neural network adopts a time sequence convolution network, the loss adopts mean square error loss MSE, and the calculation formula is as follows:

wherein T is a period, y_iIn order to be the true value of the value,

for the predicted value, the network is constrained by measuring the Euclidean distance between the predicted value and the true value.

A speech driving mouth shape system based on time sequence convolution comprises a data acquisition module and an audio characteristic processing module, wherein the data acquisition module adopts a blendshape to represent the action of a mouth, outputs the weights of a plurality of blendshapes through a neural network, and obtains reasonable representation of the action of the mouth by combining the values of the blendshapes; and the audio characteristic processing module is used for reasonably representing discretization of the mouth action, the discretized sound signal is a time domain signal, and the time domain signal is converted into a frequency domain through Fourier transform to complete characteristic conversion.

Furthermore, the audio characteristic processing module comprises a pre-emphasis unit, a windowing unit, a discrete Fourier transform unit, a Mel filter bank, a logarithm calculation unit and a discrete cosine transform unit; a pre-emphasis unit for emphasizing the energy of the high-frequency part of the speech signal; the windowing unit is used for weighting the data in the sliding window; a discrete Fourier transform unit for performing discrete Fourier transform on the weighted data; the Mel filter bank is used for converting the frequency spectrum after the discrete Fourier transform to a Mel scale; the logarithm calculation unit is used for interconversion between the Mel scale and the Hertz; and the discrete cosine transform unit is used for performing inverse discrete Fourier transform to obtain the MFCC characteristics.

The invention has the beneficial effects that:

the invention introduces time sequence convolution, uses the time sequence convolution network for processing the voice frequency spectrum characteristics, and better solves the problems of time sequence information dependence and single generation mode; in the network part, compared with a cyclic neural network and a traditional convolutional network, the method gives consideration to information dependence on a time sequence and also reflects the accuracy of data generation. The use of the blendshape is simpler than the mesh, and the complex mouth action representation can be represented by using less data.

Drawings

In order to illustrate the present invention or the technical solutions in the prior art more clearly, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only the present embodiments of the invention, and other drawings can be obtained by those skilled in the art without creative efforts based on the provided drawings.

FIG. 1 is a flow chart of a method of feature transformation according to the present invention.

Fig. 2 is a schematic diagram of the feature transformation after pre-emphasis according to the present invention.

FIG. 3 is a graphical representation of several window functions of the present invention.

FIG. 4 is a schematic diagram of the nonlinear relationship between the Mel scale and the Hertz scale according to the present invention.

FIG. 5 is a diagram illustrating the conversion of the Meyer filter bank according to the present invention.

FIG. 6 is a partial structural diagram of the TCN of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 1, the present invention provides a method for driving a mouth shape based on a speech of a time series convolution, comprising: the method comprises the steps of adopting 27 blenshapes to represent mouth movements, outputting weights of the 27 blenshapes through a neural network, and combining values of the blenshapes to obtain reasonable representation of the mouth movements, wherein sound and corresponding face blenshape data can be obtained by recording data through a mobile phone and other devices. The reasonable representation of mouth movements needs discretization, the discretized sound signals are time domain signals, and the time domain signals are converted into a frequency domain through Fourier transformation to complete feature conversion.

The Blendshape acquisition is that the face expression can be captured by an iphone mobile phone (iphoneX and above), and 51 blendshapes are used for representation (reference links: https:// leveller. apple. com/documentation/arkit/artifact/someachor/blendshaperation), wherein the present invention uses only the Blendshape representation of the mouth, and the total number is 27. (refer to the Mouth and Jaw section of the above links.)

The neural network used here is based on TCN and has modified inputs and outputs to adapt to the data structure of the present invention, which is referred to fig. 6, a more detailed reference paper (An Empirical Evaluation of genetic consistent and current Networks for Sequence Modeling), and the present invention uses the network structure to model the processed speech features and mouth actions blenshape, so that the trained TCN can complete the mapping between speech and mouth actions.

Each blendshape represents a value in the range of 0-100, and the TCN can output the specific values of the corresponding 27 blendshapes. These values represent a particular mouth movement.

The discretization referred to above is the change from a continuous representation of the real world to a discontinuous representation of the digital world. Specifically, in the present invention the mouth movements are discretized into a representation of the mouth blendshape of 30 frames per second.

The reasonable expression of the discretized mouth motion and the discretized sound signals are in one-to-one correspondence, and the mapping is implemented by a TCN network. Specifically, the input of the TCN network is a "discretized sound signal", and the output is a "rational representation of the discretized mouth movement".

The sound belongs to waves, and the storage and representation of the invention need to be discretized, and the discretization can cause information loss to a certain extent, wherein, several key parameters are provided.

(1) Sampling rate: the number of sampling points per second is 16kHz, 44.1kHz and the like. The higher the sampling rate, the higher the frequency of the described sound wave, and the more real and natural the restoring degree of the sound wave.

(2) The number of channels: i.e., the number of channels, commonly left and right channels, 5.1 channels, etc., we use monophonic channels in processing audio data.

(3) Bit depth: it is understood that the loudness (amplitude) of a sound is sampled, which affects the signal-to-noise ratio and dynamic range of the sound, and 16 bits, 32 bits, etc. are common.

(4) Bit rate: the number of bits processed per second, such as a bit rate of 16kHz at a sampling rate and 16 bits at a bit depth, is: 16000 × 16 ═ 256 kbit/s.

The representation by only discretization is not enough, and the sound signal is called time-domain signal, and in order to dig more information, the time-domain signal needs to be converted into the frequency domain through Fourier transform.

After conversion to a frequency domain signal, further feature conversions are required, and the benefits of performing these conversions are: therefore, the voice information is easier to expose, and the algorithm optimization difficulty is reduced. The robustness of the signal to human voice, noise, channels and the like is enhanced. The method plays a role in reducing dimension, for example, the sampling rate of 16kHz and 25ms have 400 numerical values in total, and only 40-dimensional features exist after conversion.

"fourier transforming a time domain signal into the frequency domain" is one step of "performing a feature transformation", all steps of which are shown in fig. 1.

In this embodiment, the method for feature conversion includes: and pre-emphasizing and windowing the voice signal, performing discrete Fourier transform, performing logarithm calculation after passing through a Mel filter bank, and performing discrete cosine transform to obtain the MFCC characteristics.

Pre-emphasis refers to emphasizing the energy of the high frequency part of the speech signal because the low frequency part of the original speech signal has higher energy, which is called spectral tilt.

Calculation method

x′[t]＝x[t]-αx[t-1]

Where t is the time step, α is the weighting factor, x is the discrete speech signal, x [ t ] is the signal value at time t, x [ t-1] is the signal value at time t-1, and x' [ t ] is the signal value at time t after pre-emphasis.

The diagram of the feature transformation after pre-emphasis is shown in fig. 2.

Windowing refers to weighting the data within a sliding window. The reason is that the rectangular window causes spectral leakage in the FFT calculation.

Calculation method

x′[n]＝w[n]x[n]

Several window functions are schematically shown in FIG. 3, where ω [ N ] is a coefficient value, N represents an index within a window size N, x [ N ] is a signal value with an index of N within the window, and x' [ N ] is a signal value with an index of N after windowing.

Discrete Fourier transform: after the above preprocessing, the Fourier transform step is obtained, and the calculation method is as follows, for the N point sequence { x [ N ]]}_0≤n＜NIs provided with

Where X is the Fourier transformed sequence, k is the Fourier transformed index, N is the window size, X [ N ] is the signal value with index N in the window, exp is the base of the natural logarithm, j is the imaginary unit, and π is the circumferential ratio.

Mel filter bank: the sensitivity of the human ear to sounds of different frequencies is not linear, and the mel scale is a depiction of this sensitivity.

The conversion relation between the Mel scale and Hertz is

f＝700(10^m/2595-1)

The non-linear relationship between the two scales is shown in fig. 4.

The mel filter bank converts the frequency spectrum to the mel scale, which is schematically shown in fig. 5.

MFCC characteristics: the output of the last step can be called as a FilterBank feature, and the low-frequency harmonic component removed from the FilterBank feature is the MFCC feature. The method comprises the following specific steps:

the envelope and harmonics can be separated by performing an inverse discrete fourier transform (equivalent to a discrete cosine transform) on the Filterbank feature.

The low-order bits of the generated result are retained (usually the first 13 bits), and are the MFCC features, which are also called Mel-frequency cepstrum coefficients.

The audio data processing is now complete and the data can be taken directly to the neural network for processing.

The network structure of the invention adopts a time sequence convolutional network (TCN), which is different from the traditional convolutional neural network and can extract characteristics of the whole input sequence. As shown in fig. 6, the input layer samples each point, the second layer samples every other point, and the receptive field in the last layer may cover the entire input. And compared with a recurrent neural network, the method can complete the extraction and training of the whole features at one time. The training and forwarding efficiency of the network is greatly improved. And the TCN has the advantages of lightness, simplicity and good effect.

The same as the traditional convolutional neural network, a residual connecting structure is also adopted, and the characteristic can ensure that the problem of gradient disappearance does not occur in the training of the deep-level network.

The neural network adopts a time sequence convolution network, the loss adopts mean square error loss MSE, and the calculation formula is as follows:

wherein the content of the first and second substances,t is one period, y_iIn order to be the true value of the value,

Example 2

The embodiment provides a voice-driven mouth shape system based on time sequence convolution, which comprises a data acquisition module and an audio characteristic processing module, wherein the data acquisition module adopts a blendshape to represent the action of a mouth, outputs the weights of a plurality of blendshapes through a neural network, and obtains the reasonable representation of the action of the mouth by combining the values of the blendshapes; and the audio characteristic processing module is used for reasonably representing discretization of the mouth action, the discretized sound signal is a time domain signal, and the time domain signal is converted into a frequency domain through Fourier transform to complete characteristic conversion.

The audio characteristic processing module comprises a pre-emphasis unit, a windowing unit, a discrete Fourier transform unit, a Mel filter bank, a logarithm calculation unit and a discrete cosine transform unit; a pre-emphasis unit for emphasizing the energy of the high-frequency part of the speech signal; the windowing unit is used for weighting the data in the sliding window; a discrete Fourier transform unit for performing discrete Fourier transform on the weighted data; the Mel filter bank is used for converting the frequency spectrum after the discrete Fourier transform to a Mel scale; the logarithm calculation unit is used for interconversion between the Mel scale and the Hertz; and the discrete cosine transform unit is used for performing inverse discrete Fourier transform to obtain the MFCC characteristics.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for speech-driven mouth shape based on time-series convolution, comprising: representing the mouth action by adopting a blendshape, outputting weights of a plurality of blendshapes through a neural network, and combining the values of the blendshapes to obtain reasonable representation of the mouth action; the reasonable representation of mouth movements needs discretization, the discretized sound signals are time domain signals, and the time domain signals are converted into a frequency domain through Fourier transformation to complete feature conversion.

2. The method for driving the mouth shape based on the speech of the time-series convolution as claimed in claim 1, wherein the feature transformation method is: and pre-emphasizing and windowing the voice signal, performing discrete Fourier transform, performing logarithm calculation after passing through a Mel filter bank, and performing discrete cosine transform to obtain the MFCC characteristics.

3. The method for driving mouth shape by voice based on time series convolution according to claim 1, wherein the neural network adopts a time series convolution network, the loss adopts mean square error loss (MSE), and the calculation formula is as follows:

wherein T is a period, y_iIn order to be the true value of the value,

4. A speech-driven mouth shape system based on time sequence convolution is characterized by comprising a data acquisition module and an audio characteristic processing module, wherein the data acquisition module adopts a blendshape to represent the action of a mouth, the weights of a plurality of blendshapes are output through a neural network, and the reasonable representation of the action of the mouth is obtained by combining the values of the blendshapes; and the audio characteristic processing module is used for reasonably representing discretization of the mouth action, the discretized sound signal is a time domain signal, and the time domain signal is converted into a frequency domain through Fourier transform to complete characteristic conversion.

5. The system of claim 4, wherein the audio feature processing module comprises a pre-emphasis unit, a windowing unit, a discrete Fourier transform unit, a Mel filter bank, a logarithm calculation unit, and a discrete cosine transform unit; a pre-emphasis unit for emphasizing the energy of the high-frequency part of the speech signal; the windowing unit is used for weighting the data in the sliding window; a discrete Fourier transform unit for performing discrete Fourier transform on the weighted data; the Mel filter bank is used for converting the frequency spectrum after the discrete Fourier transform to a Mel scale; the logarithm calculation unit is used for interconversion between the Mel scale and the Hertz; and the discrete cosine transform unit is used for performing inverse discrete Fourier transform to obtain the MFCC characteristics.