WO2019196306A1

WO2019196306A1 - Device and method for speech-based mouth shape animation blending, and readable storage medium

Info

Publication number: WO2019196306A1
Application number: PCT/CN2018/102209
Authority: WO
Inventors: 梁浩; 王健宗; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-04-12
Filing date: 2018-08-24
Publication date: 2019-10-17
Also published as: CN108763190A; CN108763190B

Abstract

A device and method for speech-based mouth shape animation blending. The device comprises a memory and a processor. A mouth shape animation blending program that can run on the processor is stored in the memory. The program implements the following steps when executed by the processor: acquiring target text data, acquiring phoneme features in the target text data on the basis of a pronunciation dictionary (S10); inputting the phoneme features into a pre-trained deep neural network model, outputting acoustic features (S20); inputting the acoustic features into a speech synthesizer and outputting speech data (S30); acquiring mouth shape data on the basis of the speech data, of a pretrained tensor model, and of speaker identification information (S40); and generating a corresponding mouth shape animation on the basis of the mouth shape data and of the speech data (S50). The device and method solve the technical problem in the prior art in which a mouth shape animation matching speech data and having a real feel could not be presented.

Description

Speech-based lip animation synthesis device, method and readable storage medium

This application claims priority to Chinese Patent Application No. 201810327672.1, entitled "Voice-based syllabic animation synthesis device, method and readable storage medium", which is filed on April 12, 2018. The content is incorporated herein by reference.

Technical field

The present application relates to the field of computer technologies, and in particular, to a voice-based lip animation synthesis apparatus, method, and readable storage medium.

Background technique

Speech synthesis, also known as text-to-speech technology, is a technology that converts text information into speech and reads it aloud. It involves many subject technologies such as acoustics, linguistics, digital signal processing, and computer science. It is a cutting-edge technology in the field of Chinese information processing. The main problem solved is how to convert text information into audible sound information.

In some application scenarios, for example, the application scenario of computer-assisted pronunciation training needs to dynamically display the speaker's mouth shape change when playing the voice data to help the user perform the pronunciation training. In the prior art, the playback is synthesized. In the case of voice data, since there is no real speaker's mouth data corresponding to it, it is impossible to display a realistic mouth-shaped animation that matches the synthesized voice data.

Summary of the invention

The present application provides a voice-based lip-type animation synthesizing apparatus, method, and readable storage medium, the main purpose of which is to solve the prior art technology that cannot display a realistic mouth-shaped animation that matches the synthesized speech data. problem.

To achieve the above object, the present application provides a voice-based lip animation synthesis device, the device comprising a memory and a processor, wherein the memory stores a lip animation synthesis program executable on the processor, The mouth animation synthesis program is implemented by the processor to implement the following steps:

Obtaining target text data, and acquiring phoneme features in the target text data according to a pronunciation dictionary;

Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;

Inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data;

Obtaining, according to the voice data, a pre-trained tensor model, and pre-set speaker identification information, lip-shaped data corresponding to the voice data and the speaker identification information, where the tensor model expresses voice data The correlation between the pronunciation feature and the lip position feature of the lip data;

And generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.

In addition, to achieve the above object, the present application further provides a voice-based lip animation synthesis method, the method comprising:

In addition, in order to achieve the above object, the present application further provides a computer readable storage medium having a lip animation synthesis program stored thereon, the lip animation synthesis program being executable by one or more processors Executing to implement the steps of the speech-based lip animation synthesis method as described above.

DRAWINGS

1 is a schematic diagram of a voice-based lip-shaped animation synthesizing device of the present application;

2 is a schematic diagram of a program module of a lip-shaped animation synthesis program in an embodiment of a speech-based lip animation synthesis apparatus according to an embodiment of the present invention;

3 is a flow chart of a preferred embodiment of a speech-based lip animation synthesis method of the present application.

The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.

detailed description

It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.

The present application provides a voice-based lip animation synthesis device. Referring to FIG. 1, a schematic diagram of a preferred embodiment of a speech-based lip animation synthesis apparatus of the present application is shown.

In this embodiment, the voice-based lip animation synthesis device may be a PC (Personal Computer), or may be a terminal device such as a smart phone, a tablet computer, or a portable computer. The speech-based lip animation synthesis apparatus includes at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.

The memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (for example, an SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may be, in some embodiments, an internal storage unit of a voice-based lip animation composition device, such as a hard disk of the speech-based lip animation synthesis device. The memory 11 may also be an external storage device of a voice-based lip animation synthesis device in other embodiments, such as a plug-in hard disk equipped with a voice-based lip animation synthesis device, a smart memory card (Smart Media Card, SMC). ), Secure Digital (SD) card, Flash Card, etc. Further, the memory 11 may also include an internal storage unit of the voice-based lip animation composition device and an external storage device. The memory 11 can be used not only for storing application software and various types of data installed in the voice-based lip animation synthesizing device, such as code of a lip animation synthesis program, but also for temporarily storing data that has been output or is to be output.

The processor 12, in some embodiments, may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data processing chip for running program code or processing stored in the memory 11. Data, such as performing a lip animation synthesis program.

Communication bus 13 is used to implement connection communication between these components.

The network interface 14 can optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is typically used to establish a communication connection between the device and other electronic devices.

Figure 1 shows only a speech-based lip animation synthesis device having components 11-14 and a lip animation synthesis program, but it should be understood that not all illustrated components may be implemented, alternative implementations may be Fewer components.

Optionally, the device may further include a user interface, the user interface may include a display, an input unit such as a keyboard, and the optional user interface may further include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like. Among them, the display may also be appropriately referred to as a display screen or a display unit for displaying information processed in the voice-based lip animation synthesis device and a user interface for displaying visualization.

In the apparatus embodiment shown in FIG. 1, a memory animation synthesis program is stored in the memory 11; when the processor 12 executes the lip animation synthesis program stored in the memory 11, the following steps are implemented:

Obtaining target text data, and acquiring phoneme features in the target text data according to the pronunciation dictionary.

The phoneme feature is input into a pre-trained deep neural network model, and an acoustic feature corresponding to the phoneme feature is output, the acoustic feature including a Mel cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency.

The acoustic feature is input to a speech synthesizer, and speech data corresponding to the target text data is output.

In the solution proposed in this embodiment, the target text data is converted into voice data through a pre-established deep neural network model, and the voice data is converted into the mouth data through a pre-established tensor model. Specifically, the target text data to be synthesized is obtained, and the target text data is split into words or words by the word segmentation tool, and then the split word is split into phonemes through the pronunciation dictionary, thereby obtaining the phoneme feature, for Chinese. Said phonemes include the initial phoneme and the vowel phoneme. In this embodiment, taking Chinese as an example, for each phoneme, the phoneme feature mainly includes the following features: the pronunciation feature of the current phoneme, the pronunciation feature of the previous phoneme, the pronunciation feature of the next phoneme, and the current phoneme in the word. The position, the syllable feature of the current phoneme, the syllable feature of the previous phoneme, the syllable feature of the latter phoneme, the position of the word in which the current phoneme is located, where the pronunciation features include the phoneme type (vowel or consonant), length , pitch, accent position, position of the final, pronunciation part, whether the final is pronounced, the syllable features include the position of the syllable, the position of the phoneme in the syllable, and the position of the syllable in the word. The phoneme feature can be expressed as a feature vector of 3*7+3*3+2=32 dimensions.

A deep neural network model for expressing the correlation between phoneme features and acoustic features is pre-trained, and the above feature vectors are input into the model to obtain corresponding acoustic features, and the acoustic features include time series features and pronunciation of each sound. The length, wherein the time series feature includes a 25-dimensional feature vector and a fundamental frequency, and the 25-dimensional feature vector includes 25 Mel-frequency cepstral coefficients (MFCC), representing a 10 ms speech acoustic feature of a frame. The MFCC feature, the pronunciation length, and the pronunciation fundamental frequency are synthesized by a speech synthesizer to obtain a speech signal.

Before applying the deep neural network model in the embodiment, the model needs to be trained. First, a corpus construction sample is collected, and a sample library is constructed based on at least one speaker corpus, the corpus includes voice data, and corresponding to the voice data. The text data and the mouth data, that is, the voice data obtained by reading one or more speakers reading the same text data, and the corresponding mouth data, and establishing a sample library, wherein the mouth data is changing information by capturing mouth shape motion Physiological electromagnetic joint angiography data can reflect the mouth shape of the speaker's pronunciation. Then, the deep neural network model is trained according to the text data in the sample library and the voice data, and the model parameters of the deep neural network model are acquired.

Specifically, the training process of the deep neural network model is as follows: the phoneme features are extracted according to the text data in the sample library and the pronunciation dictionary, and the features can form a feature vector of 3*7+3*3+2=32 dimensions; The acoustic data corresponding to the text data is extracted, including MFCC features, pronunciation length, and pronunciation fundamental frequency, as information for training standard comparison; the two are sent to the deep neural network model training to obtain the model parameters to be solved, ie The weight of each phoneme feature and acoustic feature between a particular phoneme and the corresponding pronunciation. The length of the pronunciation can be predicted according to the length characteristics and the syllable position features in the phoneme feature, and the pronunciation fundamental frequency can be predicted according to the pronunciation features such as the pitch and the accent position in the factor feature.

Obtaining, according to the voice data, a pre-trained tensor model, and pre-set speaker identification information, lip-shaped data corresponding to the voice data and the speaker identification information, where the tensor model expresses voice data The correlation between the pronunciation feature and the lip position feature of the lip data.

It should be noted that the lip shape data in this embodiment is physiological electromagnetic joint angiography data by capturing mouth shape motion change information, wherein the electromagnetic joint angiography data mainly includes coordinate information of a specific mouth shape and a corresponding mouth. Type image. In the model training, the mouth position feature in the mouth data is directly used. The mouth position feature mainly includes the coordinate information of the following positions: the tip of the tongue, the tongue, the back of the tongue, the upper lip, the lower lip, the upper incisor and the lower incisor.

According to the speech data and the lip data in the sample library, a tensor model for expressing the correlation between the acoustic features and the oral data is pre-trained, and the tensor model is a third-order tensor model, a third-order tensor model. The three dimensions correspond to pronunciation features, lip shape data, and speaker identification information, respectively. Obtain the pronunciation features of the speech data in the sample library, use the pronunciation features and speaker identification information as the input features of the third-order tensor model, and use the lip-shaped data as the output feature of the third-order tensor model, using a high-order singular value decomposition algorithm. The third-order tensor model is trained to obtain the model parameters of the third-order tensor model.

Specifically, the third-order tensor model in the present embodiment is constructed and trained as follows: a set of pronunciation features is used as a parameter space.

The set of lip data corresponding to the pronunciation feature is used as a parameter space

Construct a multi-linear space transform based on the above parameter space, and its expression is as follows:

among them

a grid structure for storing lip data,

V is used to store three-dimensional coordinate information of a specific mouth shape, wherein two dimensions are the coordinates of the mouth type, and the other one is the speaker identification information, that is, the speaker ID, because the mouth position is slightly different for different speakers. Difference; F is used to store a lip image of a specific lip shape, which is used to express the correlation between the pronunciation feature and the lip position feature. A third-order tensor is constructed based on the expression of the multi-line spatial variation described above, and the three dimensions of the third-order tensor correspond to acoustic features, lip-shaped data, and speaker identification information, respectively. Its expression is as follows:

Where the left side of the equation is some model parameters to be solved, mainly including the parameter space

Parameter space

The weight of each feature in the middle, the right side of the equation is the feature input when training the model, through the feature data extracted from the text data and the oral data in the database, the feature of the pronunciation and the shape of the mouth shape; where C is the tensor The expression, μ is the averaged vocal position information for different speakers. Taking the sound of “a” as an example, the corresponding μ is the average value of the mouth position information of different speakers when sending the “a” sound. . Since the decomposition of the tensor generally uses a high-order singular value decomposition algorithm, in the present embodiment, the third-order tensor model is trained using a high-order singular value decomposition algorithm to solve the model parameters on the left side of the above expression.

After the speech data is obtained based on the deep neural network model, the speech data and the pre-set speaker identification information are input into the pre-trained third-order tensor model to obtain the lip-shaped data corresponding to the speech data. That is to say, when the sample library for training the third-order tensor model contains a plurality of speaker corpora, the user can select the speaker identification information in advance, and the finally generated mouth data is closer to the speaker. Mouth data.

And generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data. According to the obtained mouth shape data corresponding to each phoneme in the target text data, and the preset three-dimensional lip region model, a mouth-shaped animation that can be dynamically displayed is generated, and when the synthesized data corresponding to the target text data is played, the display is performed Corresponding lip animation. In the solution of the embodiment, the deep neural network model is used to realize the modeling mapping between the phoneme feature and the acoustic feature. This mapping relationship is a nonlinear mapping problem, and the deep neural network can achieve better feature mining. And expression, so that the speech synthesis system can obtain more accurate and more natural output results; and by constructing the tensor model to realize the expression of the correlation between the pronunciation feature and the lip shape feature, the acquired speech can be matched and realistic. The lip-shaped data is used to realize the dynamic display of the lip-shaped while playing the voice data.

The speech-based vocal animation synthesizing device of the embodiment obtains the phoneme feature in the target text data according to the pronunciation dictionary, inputs the phoneme feature into the pre-trained deep neural network model, and outputs the acoustic feature corresponding to the phoneme feature. The acoustic features include MFCC features, pronunciation duration and pronunciation fundamental frequency, and these acoustic features are input into a speech synthesizer for speech-based lip animation synthesis, and speech data corresponding to the target text data is obtained, which is pre-trained according to the speech data. The tensor model and the pre-set speaker identification information acquire mouth-shaped data corresponding to the voice data and the speaker identification information, and generate a mouth-shaped animation corresponding to the voice data according to the mouth-shaped data, so as to play the voice data simultaneously , showing the lip animation. This scheme uses the deep neural network model to transform the target text data into acoustic features, which can achieve better feature mining, and make the speech synthesis system obtain more accurate and natural output results, and at the same time pass the sheet that can express the acoustic features and the mouth data. The quantity model realizes converting the synthesized voice data into corresponding mouth type data, and generates a mouth type animation corresponding to the target text data according to the mouth type data, which solves the problem that the prior art cannot display and match the synthesized voice data, and has real The technical problem of the mouth shape animation.

Optionally, in other embodiments, the lip animation synthesis program may also be divided into one or more modules, one or more modules are stored in the memory 11 and executed by one or more processors (this implementation) For example, the processor 12) is executed to complete the application, and the module referred to in the present application refers to a series of computer program instruction segments capable of performing a specific function for describing a lip-shaped animation synthesis program in a voice-based lip animation synthesis device. The execution process in .

For example, referring to FIG. 2, it is a schematic diagram of a program module of a lip animation synthesis program in an embodiment of a speech-based lip animation synthesis device of the present application. In this embodiment, the lip animation synthesis program can be segmented into feature extraction. The module 10, the feature conversion module 20, the speech synthesis module 30, the lip shape generation module 40, and the animation synthesis module 50 are exemplarily:

The feature extraction module 10 is configured to: acquire target text data, and acquire phoneme features in the target text data according to a pronunciation dictionary;

The feature conversion module 20 is configured to: input the phoneme feature into a pre-trained deep neural network model, and output an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration And pronunciation fundamental frequency;

The speech synthesis module 30 is configured to: input the acoustic feature into a speech synthesizer, and output speech data corresponding to the target text data;

The port type generating module 40 is configured to: acquire, according to the voice data, the pre-trained tensor model, and the preset speaker identification information, the mouth data corresponding to the voice data and the speaker identification information, The tensor model expresses a correlation between the pronunciation features of the speech data and the lip position characteristics of the lip data;

The animation synthesizing module 50 is configured to: generate a lip animation corresponding to the voice data according to the lip shape data, to display the lip animation while playing the voice data.

The functions or operation steps implemented when the program modules such as the feature extraction module 10, the feature conversion module 20, the speech synthesis module 30, the lip generation module 40, and the animation synthesis module 50 are executed are substantially the same as the above embodiments, and are no longer Narration.

In addition, the present application also provides a voice-based lip animation synthesis method. Referring to FIG. 3, it is a flowchart of a preferred embodiment of a speech-based lip animation synthesis method of the present application. The method may be performed by a device, which may be implemented by software and/or hardware, and the following voice-based lip animation synthesis device as an execution subject describes the method of the present embodiment.

In this embodiment, the voice-based lip animation synthesis method includes:

Step S10: Acquire target text data, and acquire phoneme features in the target text data according to the pronunciation dictionary.

Step S20, input the phoneme feature into a pre-trained deep neural network model, and output an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency. .

Step S30, inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data.

In the solution proposed in this embodiment, the target text data is converted into voice data through a pre-established deep neural network model, and the voice data is converted into the mouth data through a pre-established tensor model. Specifically, the target text data to be synthesized is obtained, and the target text data is split into words or words by the word segmentation tool, and then the split word is split into phonemes through the pronunciation dictionary, thereby obtaining the phoneme feature, for Chinese. Said phonemes include the initial phoneme and the vowel phoneme. In this embodiment, taking Chinese as an example, for each phoneme, the phoneme feature mainly includes the following features: the pronunciation feature of the current phoneme, the pronunciation feature of the previous phoneme, the pronunciation feature of the next phoneme, and the current phoneme in the word. The position, the syllable feature of the current phoneme, the syllable feature of the previous phoneme, the syllable feature of the latter phoneme, the position of the word in which the current phoneme is located, where the pronunciation features include the phoneme type (vowel or consonant), length , pitch, accent position, position of the final, pronunciation part, whether the final is pronounced, syllable features include the position of the syllable, the position of the phoneme in the syllable, and the position of the syllable in the word. The phoneme feature can be expressed as a feature vector of 3*7+3*3+2=32 dimensions.

Step S40, acquiring, according to the voice data, the pre-trained tensor model, and the preset speaker identification information, the mouth data corresponding to the voice data and the speaker identification information, the tensor model expression The correlation between the pronunciation features of the speech data and the lip position characteristics of the lip data.

It should be noted that the lip shape data in this embodiment is physiological electromagnetic joint angiography data by capturing mouth shape motion change information, wherein the electromagnetic joint angiography data mainly includes coordinate information of a specific mouth shape and a corresponding mouth. Type image. In the model training, the mouth position feature in the mouth data is directly used, and the mouth position feature mainly includes the coordinate information of the following positions: the tip of the tongue, the tongue, the back of the tongue, the upper lip, the lower lip, the upper incisor and the lower incisor.

among them

a grid structure for storing lip data,

Parameter space

Step S50, generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.

According to the obtained mouth shape data corresponding to each phoneme in the target text data, and the preset three-dimensional lip region model, a mouth-shaped animation that can be dynamically displayed is generated, and when the synthesized data corresponding to the target text data is played, the display is performed Corresponding lip animation. In the solution of the embodiment, the deep neural network model is used to realize the modeling mapping between the phoneme feature and the acoustic feature. This mapping relationship is a nonlinear mapping problem, and the deep neural network can achieve better feature mining. And expression, so that the speech synthesis system can obtain more accurate and more natural output results; and by constructing the tensor model to realize the expression of the correlation between the pronunciation feature and the lip shape feature, the acquired speech can be matched and realistic. The lip-shaped data is used to realize the dynamic display of the lip-shaped while playing the voice data.

The speech-based vocal animation synthesis method proposed in the embodiment obtains the phoneme feature in the target text data according to the pronunciation dictionary, inputs the phoneme feature into the pre-trained deep neural network model, and outputs the acoustic feature corresponding to the phoneme feature. The acoustic features include MFCC features, pronunciation duration and pronunciation fundamental frequency, and these acoustic features are input into a speech synthesizer for speech-based lip animation synthesis, and speech data corresponding to the target text data is obtained, which is pre-trained according to the speech data. The tensor model and the pre-set speaker identification information acquire mouth-shaped data corresponding to the voice data and the speaker identification information, and generate a mouth-shaped animation corresponding to the voice data according to the mouth-shaped data, so as to play the voice data simultaneously , showing the lip animation. This scheme uses the deep neural network model to transform the target text data into acoustic features, which can achieve better feature mining, and make the speech synthesis system obtain more accurate and natural output results, and at the same time pass the sheet that can express the acoustic features and the mouth data. The quantity model realizes converting the synthesized voice data into corresponding mouth type data, and generates a mouth type animation corresponding to the target text data according to the mouth type data, which solves the problem that the prior art cannot display and match the synthesized voice data, and has real The technical problem of the mouth shape animation.

In addition, the embodiment of the present application further provides a computer readable storage medium, where the mouth-shaped animation synthesis program is stored, and the lip animation synthesis program can be executed by one or more processors, Implement the following operations:

The specific embodiment of the computer readable storage medium of the present application is substantially the same as the above embodiments of the voice-based vocal animation synthesis apparatus and method, and will not be described herein.

It should be noted that the foregoing serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments. And the terms "including", "comprising", or any other variations thereof are intended to encompass a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a plurality of elements includes not only those elements but also Other elements listed, or elements that are inherent to such a process, device, item, or method. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, the device, the item, or the method that comprises the element.

Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM as described above). , a disk, an optical disk, including a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.

The above is only a preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process transformations made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims

A speech-based vocal animation synthesizing device, characterized in that the device comprises a memory and a processor, and the memory stores a lip animation synthesis program executable on the processor, the mouth animation When the composition program is executed by the processor, the following steps are implemented:

Obtaining target text data, and acquiring phoneme features in the target text data according to a pronunciation dictionary;

Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;

Inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data;

Obtaining, according to the voice data, a pre-trained tensor model, and pre-set speaker identification information, lip-shaped data corresponding to the voice data and the speaker identification information, where the tensor model expresses voice data The correlation between the pronunciation feature and the lip position feature of the lip data;

And generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
The speech-based vocal animation synthesizing apparatus according to claim 1, wherein the acquiring the target text data and acquiring the phoneme features in the target text data according to the pronunciation dictionary comprises:

Obtaining target text data, and performing word segmentation processing on the target text data to obtain a word segmentation result;

The words in the word segmentation result are converted into phoneme features by the pronunciation dictionary.
The speech-based vocal animation synthesizing apparatus according to claim 1, wherein said lip animation synthesis program is further executable by said processor to implement the following steps:

Constructing a sample library based on at least one speaker's corpus, the corpus including voice data, and text data and lip type data corresponding to the voice data;

And acquiring a deep neural network model according to the text data in the sample library and the voice data, and acquiring model parameters of the deep neural network model;

The tensor model is trained according to the voice data and the mouth data in the sample library, and the model parameters of the tensor model are obtained.
The speech-based vocal animation synthesizing apparatus according to claim 2, wherein said lip animation synthesizing program is further executable by said processor to implement the following steps:

Constructing a sample library based on at least one speaker's corpus, the corpus including voice data, and text data and lip type data corresponding to the voice data;

And acquiring a deep neural network model according to the text data in the sample library and the voice data, and acquiring model parameters of the deep neural network model;

The tensor model is trained according to the voice data and the mouth data in the sample library, and the model parameters of the tensor model are obtained.
The speech-based vocal animation synthesis apparatus according to claim 3, wherein said training a depth neural network model based on text data in said sample library and said speech data to acquire model parameters of a deep neural network model The steps include:

Extracting a phoneme feature from the text data in the sample library according to the pronunciation dictionary, and extracting an acoustic feature from the voice data corresponding to the text data;

Using the phoneme feature as an input feature of the deep neural network model, using the acoustic feature as an output feature of the deep neural network model, training the deep neural network model, and acquiring model parameters of a deep neural network model .
The speech-based vocal animation synthesis apparatus according to claim 4, wherein said training a depth neural network model based on text data in said sample library and said speech data to acquire model parameters of a deep neural network model The steps include:

Extracting a phoneme feature from the text data in the sample library according to the pronunciation dictionary, and extracting an acoustic feature from the voice data corresponding to the text data;

Using the phoneme feature as an input feature of the deep neural network model, using the acoustic feature as an output feature of the deep neural network model, training the deep neural network model, and acquiring model parameters of a deep neural network model .
The speech-based vocal animation synthesizing apparatus according to claim 5 or 6, wherein the tensor model is a third-order tensor model, and the training is based on voice data and oral data in the sample library. The tensor model, the step of acquiring model parameters of the tensor model includes:

Constructing a third-order tensor model, the three dimensions of the third-order tensor model respectively corresponding to the pronunciation feature, the mouth data and the speaker identification information;

Obtaining a pronunciation feature corresponding to the voice data in the sample library, using the pronunciation feature and the speaker identification information as input features of the third-order tensor model, and using the mouth data corresponding to the voice data as the third-order The output feature of the tensor model is trained using a high-order singular value decomposition algorithm to obtain model parameters of the third-order tensor model.
A speech-based vocal animation synthesis method, characterized in that the method comprises:

Obtaining target text data, and acquiring phoneme features in the target text data according to a pronunciation dictionary;

Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;

Inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data;

Obtaining, according to the voice data, a pre-trained tensor model, and pre-set speaker identification information, lip-shaped data corresponding to the voice data and the speaker identification information, where the tensor model expresses voice data The correlation between the pronunciation feature and the lip position feature of the lip data;

And generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
The speech-based vocal animation synthesis method according to claim 8, wherein the acquiring the target text data and acquiring the phoneme features in the target text data according to the pronunciation dictionary comprises:

Obtaining target text data, and performing word segmentation processing on the target text data to obtain a word segmentation result;

The words in the word segmentation result are converted into phoneme features by the pronunciation dictionary.
The method of synthesizing a voice-based vocal animation according to claim 8, wherein the method further comprises:

Constructing a sample library based on at least one speaker's corpus, the corpus including voice data, and text data and lip type data corresponding to the voice data;

And acquiring a deep neural network model according to the text data in the sample library and the voice data, and acquiring model parameters of the deep neural network model;

The tensor model is trained according to the voice data and the mouth data in the sample library, and the model parameters of the tensor model are obtained.
The method of synthesizing a voice-based vocal animation according to claim 9, wherein the method further comprises:

Constructing a sample library based on at least one speaker's corpus, the corpus including voice data, and text data and lip type data corresponding to the voice data;

And acquiring a deep neural network model according to the text data in the sample library and the voice data, and acquiring model parameters of the deep neural network model;

The tensor model is trained according to the voice data and the mouth data in the sample library, and the model parameters of the tensor model are obtained.
The speech-based vocal animation synthesis method according to claim 10, wherein the training the depth neural network model according to the text data in the sample library and the speech data, and acquiring the model parameters of the deep neural network model The steps include:

Extracting a phoneme feature from the text data in the sample library according to the pronunciation dictionary, and extracting an acoustic feature from the voice data corresponding to the text data;

Using the phoneme feature as an input feature of the deep neural network model, using the acoustic feature as an output feature of the deep neural network model, training the deep neural network model, and acquiring model parameters of a deep neural network model .
The speech-based vocal animation synthesis method according to claim 11, wherein the training the depth neural network model according to the text data in the sample library and the speech data, and acquiring the model parameters of the deep neural network model The steps include:

Extracting a phoneme feature from the text data in the sample library according to the pronunciation dictionary, and extracting an acoustic feature from the voice data corresponding to the text data;

Using the phoneme feature as an input feature of the deep neural network model, using the acoustic feature as an output feature of the deep neural network model, training the deep neural network model, and acquiring model parameters of a deep neural network model .
The speech-based vocal animation synthesis method according to claim 12 or 13, wherein the tensor model is a third-order tensor model, and the training is based on voice data and oral data in the sample library. The tensor model, the step of acquiring model parameters of the tensor model includes:

Constructing a third-order tensor model, the three dimensions of the third-order tensor model respectively corresponding to the pronunciation feature, the mouth data and the speaker identification information;

Obtaining a pronunciation feature corresponding to the voice data in the sample library, using the pronunciation feature and the speaker identification information as input features of the third-order tensor model, and using the mouth data corresponding to the voice data as the third-order The output feature of the tensor model is trained using a high-order singular value decomposition algorithm to obtain model parameters of the third-order tensor model.
A computer readable storage medium, characterized in that the computer readable storage medium stores a lip animation synthesis program, and the lip animation synthesis program can be executed by one or more processors to implement the following steps:

Obtaining target text data, and acquiring phoneme features in the target text data according to a pronunciation dictionary;

Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;

Inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data;

Obtaining, according to the voice data, a pre-trained tensor model, and pre-set speaker identification information, lip-shaped data corresponding to the voice data and the speaker identification information, where the tensor model expresses voice data The correlation between the pronunciation feature and the lip position feature of the lip data;

And generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
The computer readable storage medium according to claim 15, wherein the obtaining the target text data and acquiring the phoneme features in the target text data according to the pronunciation dictionary comprises:

Obtaining target text data, and performing word segmentation processing on the target text data to obtain a word segmentation result;

The words in the word segmentation result are converted into phoneme features by the pronunciation dictionary.
The computer readable storage medium of claim 15, wherein the method further comprises:

Constructing a sample library based on at least one speaker's corpus, the corpus including voice data, and text data and lip type data corresponding to the voice data;

And acquiring a deep neural network model according to the text data in the sample library and the voice data, and acquiring model parameters of the deep neural network model;

The tensor model is trained according to the voice data and the mouth data in the sample library, and the model parameters of the tensor model are obtained.
The computer readable storage medium of claim 16 wherein the method further comprises:

Constructing a sample library based on at least one speaker's corpus, the corpus including voice data, and text data and lip type data corresponding to the voice data;

And acquiring a deep neural network model according to the text data in the sample library and the voice data, and acquiring model parameters of the deep neural network model;

The tensor model is trained according to the voice data and the mouth data in the sample library, and the model parameters of the tensor model are obtained.
The computer readable storage medium according to claim 17 or 18, wherein said training a depth neural network model based on text data in said sample library and said speech data, acquiring model parameters of a deep neural network model The steps include:

Extracting a phoneme feature from the text data in the sample library according to the pronunciation dictionary, and extracting an acoustic feature from the voice data corresponding to the text data;

Using the phoneme feature as an input feature of the deep neural network model, using the acoustic feature as an output feature of the deep neural network model, training the deep neural network model, and acquiring model parameters of a deep neural network model .
The computer readable storage medium according to claim 19, wherein said tensor model is a third-order tensor model, said training said tensor model based on speech data and lip-shaped data in said sample library The steps of obtaining the model parameters of the tensor model include:

Constructing a third-order tensor model, the three dimensions of the third-order tensor model respectively corresponding to the pronunciation feature, the mouth data and the speaker identification information;

Obtaining a pronunciation feature corresponding to the voice data in the sample library, using the pronunciation feature and the speaker identification information as input features of the third-order tensor model, and using the mouth data corresponding to the voice data as the third-order The output feature of the tensor model is trained using a high-order singular value decomposition algorithm to obtain model parameters of the third-order tensor model.