CN113539240A

CN113539240A - Animation generation method and device, electronic equipment and storage medium

Info

Publication number: CN113539240A
Application number: CN202110812403.6A
Authority: CN
Inventors: 王海新; 杜峰
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-10-22

Abstract

The embodiment of the invention discloses an animation generation method, an animation generation device, electronic equipment and a storage medium, wherein the animation generation method comprises the following steps: acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages; analyzing and recognizing the target text data to obtain each phoneme included in the target text data, and analyzing and recognizing the target voice data to obtain the pronunciation time period of each phoneme in each phoneme; determining the language to which each phoneme belongs; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven according to the corresponding mouth shape within the pronunciation period of each phoneme to generate mouth shape animation. The embodiment of the invention can improve the fit degree of the mouth shape of the virtual image and the expression sentence, so that the mouth shape of the virtual image is more abundant in change and smoother and natural in expression.

Description

Animation generation method and device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to computer technology, in particular to an animation generation method and device, electronic equipment and a storage medium.

Background

With the development of the live broadcast industry, each large platform disputes and releases the own virtual image, and the virtual image is used for speaking interactive sentences to interact with the user, for example, the virtual image is used for speaking welcome xx to welcome the user to enter the live broadcast room, and the virtual image is used for explaining product information and the like for the user in the live broadcast room. When the avatar is used to interact with a user, the interactive sentences usually have languages of other languages (such as user names of english, product names of english, etc.) in addition to chinese, and for languages of other languages in the interactive sentences, it is a common practice at present to replace the mouth shapes of other languages with chinese mouth shapes, so as to drive the avatar.

In the process of implementing the invention, the inventor finds that the way of driving the virtual image by using the Chinese mouth shape instead of the mouth shapes of other languages has the problems of non-overlapping mouth shapes and interactive sentences, single mouth shape change of the virtual image, hard and unnatural expression and the like.

Disclosure of Invention

The embodiment of the invention provides an animation generation method, an animation generation device, electronic equipment and a storage medium, which can improve the fit degree of the mouth shape of an avatar and an expression sentence, so that the mouth shape of the avatar is more abundant in change and smoother and natural in expression.

In a first aspect, an embodiment of the present invention provides an animation generation method, including:

acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages;

analyzing and identifying the target text data to obtain each phoneme included in the target text data, and analyzing and identifying the target voice data to obtain a pronunciation time period of each phoneme in each phoneme;

determining the language to which each phoneme belongs;

inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme;

and driving the virtual character according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation.

In a second aspect, an embodiment of the present invention provides an animation generation apparatus, including:

the acquisition module is used for acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages;

the recognition module is used for analyzing and recognizing the target text data to obtain each phoneme included in the target text data, and analyzing and recognizing the target voice data to obtain the pronunciation time period of each phoneme in each phoneme;

the determining module is used for determining the language to which each phoneme belongs;

the query module is used for querying a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme;

and the generating module is used for driving the virtual image according to the corresponding mouth shape in the pronunciation time interval of each phoneme so as to generate mouth shape animation.

In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the animation generation method according to any one of the embodiments of the present invention.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the animation generation method according to any one of the embodiments of the present invention.

In the embodiment of the present invention, the speech data (i.e., target speech data) and the text data (i.e., target text data) of the target sentence composed of languages of different languages may be analyzed and recognized to obtain each phoneme included in the target text data and the pronunciation time period of each phoneme in each phoneme, and determine the language to which each phoneme belongs; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven according to the corresponding mouth shape within the pronunciation period of each phoneme to generate mouth shape animation. In other words, in the embodiment of the present invention, phonemes of different languages contained in target text data can be identified, and the mouth shape configuration table of each language is queried to obtain the mouth shapes of the corresponding languages configured for the phonemes of different languages, so that the avatar is driven according to the mouth shapes of different languages, the mouth shapes of the avatar are varied widely, the problems of non-overlap between the mouth shapes and interactive sentences, single mouth shape variation, and unnatural expression caused by using a chinese mouth shape instead of a mouth shape of another language to drive the avatar are avoided, the degree of fit between the mouth shapes and the expression sentences of the avatar is improved, and the expression of the avatar is smoother and more natural.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flowchart of an animation generation method according to an embodiment of the present invention.

Fig. 2 is a schematic flowchart of a method for creating a mouth shape configuration table according to an embodiment of the present invention.

Fig. 3 is a diagram of an example of a mouth shape configuration table provided by an embodiment of the invention.

Fig. 4 is another exemplary diagram of a mouth shape configuration table provided by the embodiment of the invention.

Fig. 5 is a flowchart illustrating a method for driving an avatar according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a vector variation law provided by an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of an animation generation apparatus according to an embodiment of the present invention.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Fig. 1 is a schematic flowchart of an animation generating method according to an embodiment of the present invention, which may be implemented by an animation generating apparatus according to an embodiment of the present invention, and the apparatus may be implemented in software and/or hardware. In a particular embodiment, the apparatus may be integrated in an electronic device, which may be, for example, a computer. The following embodiments will be described by taking as an example that the apparatus is integrated in an electronic device, and referring to fig. 1, the method may specifically include the following steps:

step 101, acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages.

Illustratively, taking the example that the target sentence needs to be expressed by the driving avatar, the target speech data can be understood as the target sentence expressed in the form of speech, and the target text data can be understood as the target sentence expressed in the form of text; the target sentence may be composed of languages of different languages such as chinese, english, french, russian, etc. A specific target sentence (represented by text data) such as: you good Sam, i.e. the target sentence consists of chinese and english.

In the specific implementation, the avatar to be driven may be a virtual character, the character of the character is not limited, and the avatar may be a two-dimensional avatar or a three-dimensional avatar. Specifically, target voice data and target text data can be obtained through real-time input by a user, for example, a voice uttered by the user can be picked up through a microphone, so that the target voice data is obtained; for example, the text provided by the user can be acquired through a keyboard or a screen, so as to obtain the target text data corresponding to the target voice data. In addition, preset target voice data and corresponding target text data can be acquired; in addition, target voice data can be obtained, and the target voice data is converted to obtain target text data corresponding to the target voice data.

Specifically, after the target text data is obtained, the target text data may be segmented according to languages, so as to obtain segmented text data; for example, when the target text data includes chinese and english, the chinese and english in the target text data can be segmented. For example, when the target text data is "hello Sam", the target text data may be divided into "hello" and "Sam". After the target text data is segmented, preprocessing the segmented text data to obtain text data to be recognized so as to facilitate subsequent recognition; the preprocessing may include, but is not limited to, converting a special symbol into a word or phrase of a corresponding language, segmenting a synthesized word, etc.; for example, the symbol "&" for the special symbol may be converted to "and", the symbol "#" for the special symbol may be converted to "well", and the composite word "what's" for the composite word may be divided into "what is".

Step 102, analyzing and recognizing the target text data to obtain each phoneme included in the target text data, and analyzing and recognizing the target voice data to obtain the pronunciation time interval of each phoneme in each phoneme.

Specifically, the preprocessed target text data may be analyzed and recognized, that is, the text data to be recognized is analyzed and recognized, so as to obtain each phoneme included in the target text data. In a specific implementation, the pre-established pronunciation dictionary of each language can be used for analyzing and recognizing the text data to be recognized, so as to obtain each phoneme included in the target text data. For example, a pronunciation dictionary of a corresponding language may be searched for based on each word or word in the text data to be recognized, thereby obtaining each phoneme.

Illustratively, the pronunciation dictionary library of each language is such as a chinese pronunciation dictionary library, an english pronunciation dictionary library, a french pronunciation dictionary library, and the like. The Chinese pronunciation dictionary library may include each Chinese character and its pronunciation (phoneme), and the English pronunciation dictionary library may include each word and its pronunciation (phoneme).

In one specific embodiment, the partial data in the chinese pronunciation dictionary library may be as follows, wherein the arabic numerals represent the pronunciation tones of the chinese phonemes:

through ch ua 1 g uo4

Shuttle ch ua 1 s uo1

When wearing ch ua 1 zh uo2

Pass ch ua 2

In one specific embodiment, the part of the data in the english pronunciation dictionary library may be as follows, wherein the arabic numerals represent pronunciation tones of english phonemes:

ABILENE AE1 B IH0 L IY2 N

ABILITIES AH0 B IH1 L AH0 T IY0 Z

ABILITY AH0 B IH1 L AH0 T IY0

ABIMELECH AE0 B IH0 M EH0 L EH1 K

ABINADAB AH0 B AY1 N AH0 D AE1 B

ABINGDON AE1 B IH0 NG D AH0 N

ABINGDON'S AE1 B IH0 NG D AH0 N Z

ABINGER AE1 B IH0 NG ER0

in addition, in order to make the application scene of the invention wider, English brand names, English names and the like can be added into the English pronunciation dictionary database.

For example, the target speech data may be analyzed and recognized through a pre-trained acoustic Model, which may include Hidden Markov chain Model (HMM) -Gaussian Mixture Model (GMM) and Deep Neural Network (DNN) models, to obtain the pronunciation period of each of the phonemes. For example, the target voice data may be subjected to framing processing to obtain a plurality of audio frames; extracting the acoustic features of each audio frame, and inputting the extracted acoustic features into a pre-trained acoustic model so that the acoustic model predicts the probability of the candidate phonemes in each audio frame; determining a phoneme sequence corresponding to the target voice data according to the probability of the candidate phonemes in each audio frame and a plurality of phonemes obtained by identifying the target text data, and acquiring the pronunciation time period of each phoneme according to the pronunciation starting time and the pronunciation ending time of each phoneme in the audio sequence.

Among them, the extracted acoustic features may be Mel-Frequency Cepstral Coefficients (MFCC). Experiments on human auditory perception show that the human auditory perception focuses on certain specific regions rather than the whole spectrum range, so that the MFCC is designed according to the human auditory characteristics and is more suitable for a speech recognition scene.

The acoustic model can be obtained based on mixed corpus training, and the mixed corpus can include corpuses of various languages, such as a Chinese corpus, an English corpus, a French corpus and the like. The Chinese language database can adopt an open-source corpus Aishell, the Aishell is recorded by 400 Chinese people from different dialect areas, the audio frequency is 16000Hz, the total time is 170 hours, and a text corresponding to each voice is provided. The English corpus can adopt Ireland English Dialect Speech Data Set (Ireland English Dialect Speech Data Set), the Data Set is formed by English sentences recorded by volunteers of different dialects, the 48000Hz corpus is changed into 16000Hz for training, and text corresponding to each Speech is provided.

In the concrete implementation, data can be extracted from a mixed corpus according to a first proportion to construct a training sample set, each sample in the training sample set comprises voice data and corresponding text data, phonemes in the text data of each training sample are obtained according to a pronunciation dictionary base of each language, acoustic feature extraction and recognition are carried out on the voice data of each training sample to obtain pronunciation duration of the corresponding phonemes, the phonemes included in each training sample and the pronunciation duration of the corresponding phonemes are labeled as labels of the corresponding training samples, model training is carried out by using the training sample set with the labels, and reverse optimization is carried out on the models through a loss function, so that acoustic models are obtained. Illustratively, the loss function may employ a cross-entropy loss function.

In addition, data can be extracted from the mixed corpus according to a second proportion to construct a test sample set, the trained acoustic model is subjected to performance test by using the test sample set, and if the test result meets the requirement, the trained acoustic model is put into use. The first proportion and the second proportion may be set according to actual requirements or experience, for example, the first proportion may be set to 80%, and the second proportion may be set to 5%.

In a specific embodiment, the pronunciation period of each phoneme identified may include a pronunciation start time and a pronunciation end time, and taking the target text data as "hello", for example, the pronunciation period of each phoneme and each phoneme obtained may be as shown in table 1 below:

TABLE 1

The data shown in table 1 is only an example, and does not ultimately limit the actual data processing.

Step 103, determining the language to which each phoneme belongs.

For example, a pronunciation dictionary base of each language may be queried based on each phoneme, and the language corresponding to the pronunciation dictionary base matched to each phoneme may be determined as the language to which the corresponding phoneme belongs; the pronunciation dictionary database matching a phoneme may be a pronunciation dictionary database containing the phoneme. For example, if a phoneme is contained in the chinese pronunciation dictionary library, the chinese language of the language to which the phoneme belongs can be determined; for another example, if a phoneme is contained in the english pronunciation dictionary library, the english language of the language to which the phoneme belongs can be determined.

And 104, inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme.

Specifically, a mouth shape configuration table may be created in advance for each language, and the mouth shape configuration table created for each language includes all phonemes included in the corresponding language and a mouth shape corresponding to each phoneme. After obtaining each phoneme included in the target sentence, the mouth shape configuration table of the corresponding language may be queried according to the language corresponding to each phoneme, so as to obtain the mouth shape configured for each phoneme. For example, if the language to which a certain phoneme belongs is chinese, the chinese mouth shape configuration table may be queried, so as to obtain the chinese mouth shape configured for the phoneme. For another example, if the language to which a certain phoneme belongs is english, the mouth shape configuration table of english may be queried, so as to obtain an english mouth shape configured for the phoneme.

And 105, driving the virtual character according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation.

The following describes a method for creating a mouth shape configuration table according to an embodiment of the present invention, as shown in fig. 2, the method may include the following steps:

step 201, collecting phonemes of each language, and determining a speech phoneme of each language.

Illustratively, the languages may include chinese, english, french, russian, etc., and each language may include a plurality of phonemes, which refer to minimum pronunciation units of speech. For example, Chinese includes phonemes b, p, m, F, z, c, S, a, o, e, i, u, lu, etc., and English includes phonemes L, R, S, F, V, CH, SH, ZH, etc.

The speech visemes represent the facial and oral positions when a word or phrase is spoken, are the visual equivalents of phonemes, are the basic acoustic units that form the word, and are the basic visual building blocks for speech. In the language, each phoneme has a corresponding phonetic viseme that represents the shape of the oral cavity when spoken.

In step 202, phonemes with the same speech visemes in each language are categorized.

In a specific implementation, different phonemes may have the same speech visemes, and phonemes having the same speech visemes in each language may be categorized. For example, In the phonemes included In the chinese language, the speech visemes of the phonemes In, ing, and ie are all In, and the phonemes In, ing, and ie can be classified into one category. For another example, in the phonemes included in english, if the phonetic visemes of the phonemes AE and AY are both ai, the phonemes AE and AY can be classified into one category.

And step 203, classifying the phonemes in each language to configure the mouth shape for each type of phoneme so as to obtain a mouth shape configuration table created for each language.

The same mouth shape can be configured for the phonemes with the same pronunciation viseme in each language. For example, the speech visemes of the Chinese phonemes in, ing and ie are the same, and the same mouth shape can be configured for in, ing and ie; the phonetic visemes of English phonemes AE and AY are the same, and the same mouth shape can be configured for AE and AY. In specific implementation, the configured mouth shape can be manufactured comprehensively through various deformers blendshape.

In a specific embodiment, taking the example that each language includes chinese and english, the mouth shape configuration table created for chinese may be as shown in fig. 3, and the mouth shape configuration table created for english may be as shown in fig. 4. The mouth shape configuration table created for each language may include mouth shape identification, phoneme, phonetic viseme, mouth shape, and the like. Note that the phoneme, the phonetic viseme, the mouth shape, and the like in fig. 3 and 4 are merely examples, and do not constitute final limitations on the actual configuration.

The method for creating the mouth shape configuration table provided by the embodiment of the invention can simplify data processing and improve creation efficiency by classifying phonemes with the same speech viseme in each language and creating the mouth shape configuration table for each language according to categories.

The avatar driving method provided in the embodiment of the present invention is described below, as shown in fig. 5, that is, step 105 in fig. 1 may specifically include the following steps:

step 1051, determine the multi-dimensional state vector of the mouth shape configured for the current phoneme and determine the multi-dimensional state vector of the mouth shape configured for the previous phoneme.

Specifically, the previous phoneme may be a phoneme whose pronunciation period is before and adjacent to the pronunciation period of the current phoneme. For example, a phoneme 1, a phoneme 2 and a phoneme 3 are respectively arranged on a time axis according to the sequence of pronunciation time, and if the current phoneme is the phoneme 2, the previous phoneme of the current phoneme is the phoneme 1; if the current phoneme is phoneme 3, the previous phoneme of the current phoneme is phoneme 2. In addition, if the current phoneme is the first phoneme in the chronological order, the previous phoneme of the current phoneme may be considered as null, and the multi-dimensional state vector of the mouth shape of the previous phoneme may be 0.

Each mouth shape may be represented by a multidimensional state vector, which includes a state vector of multiple dimensions, each representing a state feature value of a portion constituting a mouth shape, such as an upper lip, a lower lip, a tongue tip, a tongue position, a tongue surface, and the like, and different mouth shapes have different multidimensional state vectors with different values.

Step 1052, calculating the multi-dimensional state vector of the mouth shape configured for the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme by using the inching function to obtain the multi-dimensional state vectors of all the moments in the pronunciation period of the current phoneme.

Illustratively, the slow function in the embodiment of the present invention may include a slow function, which may be as follows:

f(x_i)＝-x_i ²+2x_i

wherein x is_iIndicating the ith time, f (x) within the pronunciation period of the current phoneme_i) Representing the vector change rate of the ith moment in the pronunciation period of the current phoneme; for example, if the pronunciation duration corresponding to the pronunciation period of the current phoneme is 5 seconds, the ith time may be 1 st, 2 nd, 3 rd, 4 th or 5 th seconds. It can be seen that the slow motion function provided by the embodiment of the invention is a slow motion function of variable speed motion, the change speed is fast at the beginning, smooth feeling is given to people, and then the slow motion function is gradually decelerated, so that the people cannot feel like the game to end.

For convenience of calculation, when performing vector calculation, the pronunciation duration of the current phoneme may be determined according to the pronunciation start time and the pronunciation end time of the current phoneme, and each time x in the pronunciation period of the current phoneme may be determined according to the pronunciation duration of the current phoneme_iNormalization processing is carried out, and x after normalization processing is carried out_iSubstituting the slow motion function for calculation.

Specifically, the bradykinetic function can be used for calculating the vector change rate of each moment in the pronunciation period of the current phoneme; calculating the vector difference of the mouth shape configured for the current phoneme and the mouth shape configured for the previous phoneme in each dimension according to the multidimensional state vector of the mouth shape configured for the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme, calculating the vector of each dimension of each time in the pronunciation period of the current phoneme according to the vector difference of the mouth shape configured for the current phoneme and the mouth shape configured for the previous phoneme in each dimension, the vector change rate of each time in the pronunciation period of the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme, and determining the multidimensional state vector of each time in the pronunciation period of the current phoneme according to the vector of each time in each dimension in the pronunciation period of the current phoneme.

For example, the bradyseis function, the multidimensional state vector of the mouth shape configured for the current phoneme, and the multidimensional state vector of the mouth shape configured for the previous phoneme may be processed according to the following formula to obtain the multidimensional state vectors at various times in the pronunciation period of the current phoneme:

E_ij＝-Δ_jf(x_i)+s_j

E_i＝(E_i1，E_i2，......E_ij)

wherein E is_ijA vector, Δ, representing the j-th dimension at the i-th time within the pronunciation period of the current phoneme_jVector difference, s, of j-th dimension in the multidimensional state vector representing the mouth configured for the current phoneme and the multidimensional state vector representing the mouth configured for the previous phoneme_jVector of j-th dimension in the multi-dimensional state vector representing the mouth configured for the previous phoneme, E_iA multi-dimensional state vector representing the ith time within the pronunciation period of the current phoneme.

By analysis of formula E_ij＝-Δ_jf(x_i)+s_jIt can be derived if_j>0, the vector change law can be as shown in (a) of FIG. 6, if Δ_j<0, the vector change rule can be shown in (b) of fig. 6, and is reflected in animation change, that is, a certain mouth shape is faster in speed when starting to change, and gradually transitions to the next mouth shape, so that the design does not cause the phenomenon of blocking due to sudden stop of the mouth shape, and animation transition performance is better.

And 1053, providing the multidimensional state vector of each moment in the pronunciation time interval of the current phoneme to the deformer with the corresponding dimension, so as to drive the virtual image by using the deformer with the corresponding dimension, and generate the mouth shape animation.

The state vector of each dimension in the multi-dimensional state vector corresponds to one deformer, and each deformer is used for driving a corresponding part of the virtual image. For example, the multidimensional state vector includes three-dimensional state vectors of an upper lip, a lower lip and a tongue tip, the upper lip, the lower lip and the tongue tip respectively correspond to one deformer, when the multidimensional state vector at a certain moment is obtained through calculation, the state vector of the upper lip in the multidimensional state vector can be provided for the deformer corresponding to the upper lip, the state vector of the lower lip is provided for the deformer corresponding to the lower lip, and the state vector of the tongue tip is provided for the deformer corresponding to the tongue tip, so that the corresponding deformer drives the corresponding part of the virtual image according to the vector of the corresponding dimension, and the mouth shape animation is generated. In addition, when the mouth-shaped animation is generated, the mouth-shaped animation and the target voice data can be synchronously played, so that the animation effect of expressing the target voice data by the virtual image is presented.

The multi-dimensional state vector of each moment in the pronunciation time interval of the current phoneme is calculated by utilizing the slow motion function provided by the embodiment of the invention, so that the virtual image is driven, the mouth shape change of the virtual image is more real and natural, and the animation display effect is improved. In practical applications, other types of slow motion functions may also be used to calculate the multidimensional state vector at each time within the pronunciation period of the current phoneme, for example, a linear slow motion function calculation is used, and this is not limited herein.

In the following, a specific example is described to illustrate the avatar driving method according to the embodiment of the present invention, taking target text data as "hello", identifying and obtaining each phoneme and the pronunciation duration of each phoneme as shown in table 1, where each phoneme is a chinese phoneme, it can be known from fig. 3 that a mouth shape corresponding to a phoneme n is mouth shape five, a mouth shape corresponding to a phoneme in is mouth shape eight, a mouth shape corresponding to a phoneme h is mouth shape six, and a mouth shape corresponding to a phoneme ao is mouth shape three, that is, when the avatar expresses "hello", the mouth shape to drive the avatar changes sequentially according to mouth shape five, mouth shape eight, mouth shape six, and mouth shape three.

For example, the current phoneme is ao, the correspondingly configured mouth shape is three, the mouth shape configured by the previous phoneme is six, the multidimensional state vector of six mouth shapes is [20, 40, 60], the multidimensional state vector of three mouth shapes is [50, 20, 90], the pronunciation duration corresponding to the current phoneme is 5 seconds, and taking the calculation of the multidimensional state vector of 2 seconds in the pronunciation duration of the current phoneme as an example, the following steps are specifically taken:

the utterance time (2 nd second) is normalized: 2/5 ═ 0.4;

the state vector for the first dimension at second 2 is: -30 × 0.4(0.4-2) +20 ═ 39.2;

the state vector for the second dimension at second 2 is: - (-20) × 0.4(0.4-2) +40 ═ 27.2;

the state vector of the third dimension at second 2 is: -30 × 0.4(0.4-2) +60 ═ 79.2;

that is, the multidimensional state vector of the 2 nd second in the pronunciation period of the phoneme ao is (39.2, 27.2, 79.2), and assuming that the dimensions corresponding to the multidimensional state vector are the upper lip, the lower lip and the tongue tip, respectively, 39.2 may be provided to the deformer corresponding to the upper lip, 27.2 may be provided to the deformer corresponding to the lower lip, and 79.2 may be provided to the deformer corresponding to the tongue tip, so that the corresponding deformer drives the corresponding portion of the avatar according to the vector of the corresponding dimension.

The method comprises the steps of calculating other moments in a pronunciation period of a current phoneme and all moments in the pronunciation period of the other phonemes according to a similar method, so that a multi-dimensional state vector of each moment in the pronunciation period of each phoneme can be obtained, the state vector of each dimension is provided for a deformer of the corresponding dimension, and the deformer of the corresponding dimension drives an avatar according to a time sequence, so that the effect that the avatar expresses 'you' can be realized.

Fig. 7 is a block diagram of an animation generation apparatus according to an embodiment of the present invention, which is adapted to execute an animation generation method according to an embodiment of the present invention. As shown in fig. 7, the apparatus may specifically include:

an obtaining module 401, configured to obtain target speech data and target text data corresponding to the target speech data, where the target speech data includes speech data of different languages;

the recognition module 402 is configured to analyze and recognize the target text data to obtain each phoneme included in the target text data, and analyze and recognize the target voice data to obtain a pronunciation time period of each phoneme in each phoneme;

a determining module 403, configured to determine a language to which each phoneme belongs;

a query module 404, configured to query a mouth shape configuration table of a language to which each phoneme belongs, so as to obtain a mouth shape configured for each phoneme;

a generating module 405, configured to drive the avatar according to the corresponding mouth shape within the pronunciation period of each phoneme to generate the mouth shape animation.

In an embodiment, the generating module 405 is specifically configured to:

determining a multi-dimensional state vector of a mouth shape configured for a current phoneme, and determining a multi-dimensional state vector of a mouth shape configured for a previous phoneme, wherein the previous phoneme is a phoneme of which a pronunciation period is before and adjacent to a pronunciation period of the current phoneme;

calculating the multi-dimensional state vector of the mouth shape configured for the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme by using a bradykinetic function to obtain the multi-dimensional state vector of each moment in the pronunciation time period of the current phoneme;

and providing the multidimensional state vectors of all moments in the pronunciation time interval of the current phoneme to deformers with corresponding dimensions, so as to drive the virtual image by using the deformers with the corresponding dimensions and generate the mouth shape animation.

In one embodiment, the slow function includes a slow-out function, ease-out.

In an embodiment, the calculating module 405 calculates the multidimensional state vector of the mouth shape configured for the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme by using a slow motion function to obtain the multidimensional state vectors at various times in the pronunciation period of the current phoneme, including:

calculating the vector change rate of each moment in the pronunciation time interval of the current phoneme by using the inching function;

calculating a vector difference of the mouth shape configured for the current phoneme and the mouth shape configured for the previous phoneme in each dimension according to the multi-dimensional state vector of the mouth shape configured for the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme;

calculating vectors of all dimensions at all times in the pronunciation period of the current phoneme according to the vector difference of the mouth shape configured for the current phoneme and the mouth shape configured for the previous phoneme in all dimensions, the vector change rate of all times in the pronunciation period of the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme;

and determining the multidimensional state vector of each moment in the pronunciation time period of the current phoneme according to the vector of each dimension of each moment in the pronunciation time period of the current phoneme.

In one embodiment, the pronunciation period of the current phoneme includes a pronunciation start time and a pronunciation end time of the current phoneme, and the generating module 405 is further configured to, before calculating the multidimensional state vector of the mouth shape configured for the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme by using a slow motion function to obtain the multidimensional state vectors at various times in the pronunciation period of the current phoneme:

determining the pronunciation duration of the current phoneme according to the pronunciation starting time and the pronunciation ending time of the current phoneme;

and carrying out normalization processing on each moment in the pronunciation time interval of the current phoneme according to the pronunciation duration of the current phoneme.

In an embodiment, the determining module 403 is specifically configured to:

inquiring a pronunciation dictionary library of each language based on each phoneme;

and determining the language corresponding to the pronunciation dictionary library matched with each phoneme as the language to which the corresponding phoneme belongs.

In one embodiment, the apparatus further comprises:

and the creating module is used for collecting phonemes of all languages, classifying the phonemes with the same pronunciation in all languages, and configuring the mouth shape for each class of phonemes according to the classification of the phonemes in all languages so as to obtain a mouth shape configuration table created for each language.

In one embodiment, the apparatus further comprises:

the preprocessing module is used for segmenting the target text data according to languages to obtain segmented text data; preprocessing the segmented text data to obtain text data to be recognized;

the identification module 402 analyzes and identifies the target text data to obtain each phoneme included in the target text data, including:

and analyzing and identifying the text data to be identified to obtain each phoneme included in the target text data.

It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the functional module, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.

The device of the embodiment of the invention can analyze and recognize the voice data (namely, the target voice data) and the text data (namely, the target text data) of the target sentence formed by languages of different languages, obtain each phoneme included in the target text data and the pronunciation time interval of each phoneme in each phoneme, and determine the language to which each phoneme belongs; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven according to the corresponding mouth shape within the pronunciation period of each phoneme to generate mouth shape animation. In other words, in the embodiment of the present invention, phonemes of different languages contained in target text data can be identified, and the mouth shape configuration table of each language is queried to obtain the mouth shapes of the corresponding languages configured for the phonemes of different languages, so that the avatar is driven according to the mouth shapes of different languages, the mouth shapes of the avatar are varied widely, the problems of non-overlap between the mouth shapes and interactive sentences, single mouth shape variation, and unnatural expression caused by using a chinese mouth shape instead of a mouth shape of another language to drive the avatar are avoided, the degree of fit between the mouth shapes and the expression sentences of the avatar is improved, and the expression of the avatar is smoother and more natural.

The embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program that is stored in the memory and can be run on the processor, and when the processor executes the computer program, the animation generation method provided in any of the above embodiments is implemented.

The embodiment of the invention also provides a computer readable medium, on which a computer program is stored, and the program is executed by a processor to implement the animation generation method provided by any one of the above embodiments.

Referring now to FIG. 8, shown is a block diagram of a computer system 500 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 8, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules and/or units described in the embodiments of the present invention may be implemented by software, and may also be implemented by hardware. The described modules and/or units may also be provided in a processor, and may be described as: a processor includes an acquisition module, a recognition module, a determination module, a query module, and a generation module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:

acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages; analyzing and identifying the target text data to obtain each phoneme included in the target text data, and analyzing and identifying the target voice data to obtain the pronunciation time interval of each phoneme in each phoneme; determining the language to which each phoneme belongs; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; and driving the virtual character according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation.

According to the technical scheme of the embodiment of the invention, the voice data (namely, the target voice data) and the text data (namely, the target text data) of the target sentence formed by languages of different languages can be analyzed and identified, each phoneme included in the target text data and the pronunciation time interval of each phoneme in each phoneme are obtained, and the language to which each phoneme belongs is determined; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven according to the corresponding mouth shape within the pronunciation period of each phoneme to generate mouth shape animation. In other words, in the embodiment of the present invention, phonemes of different languages contained in target text data can be identified, and the mouth shape configuration table of each language is queried to obtain the mouth shapes of the corresponding languages configured for the phonemes of different languages, so that the avatar is driven according to the mouth shapes of different languages, the mouth shapes of the avatar are varied widely, the problems of non-overlap between the mouth shapes and interactive sentences, single mouth shape variation, and unnatural expression caused by using a chinese mouth shape instead of a mouth shape of another language to drive the avatar are avoided, the degree of fit between the mouth shapes and the expression sentences of the avatar is improved, and the expression of the avatar is smoother and more natural.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An animation generation method, comprising:

determining the language to which each phoneme belongs;

2. The animation generation method as claimed in claim 1, wherein the driving of the avatar in accordance with the corresponding mouth shape within the pronunciation period of each phoneme to generate the mouth shape animation comprises:

3. The animation generation method as claimed in claim 2, wherein the slow motion function includes a slow-out function, ease-out.

4. The animation generation method as claimed in claim 3, wherein the calculating the multi-dimensional state vector of the mouth shape configured for the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme by using a slow motion function to obtain the multi-dimensional state vectors at respective times within the pronunciation period of the current phoneme comprises:

5. The animation generation method as claimed in claim 3, wherein the pronunciation period of the current phoneme includes a pronunciation start time and a pronunciation end time of the current phoneme, and before the multidimensional state vector of the mouth shape configured for the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme are calculated by using a slow motion function to obtain the multidimensional state vectors at respective times within the pronunciation period of the current phoneme, the method further comprises:

6. The animation generation method as claimed in any one of claims 1 to 5, wherein the determining the language to which each phoneme belongs comprises:

7. An animation generation method as claimed in any one of claims 1 to 5, characterized in that the mouth shape configuration table is created by:

collecting phonemes of each language and determining a speech viseme of the phonemes of each language;

classifying phonemes with the same speech visemes in each language;

and classifying the phonemes in each language to configure the mouth shape for each class of phonemes so as to obtain a mouth shape configuration table created for each language.

8. The animation generation method as claimed in any one of claims 1 to 5, further comprising, before performing the analysis recognition on the target text data:

segmenting the target text data according to languages to obtain segmented text data;

preprocessing the segmented text data to obtain text data to be recognized;

the analyzing and identifying the target text data to obtain each phoneme included in the target text data includes:

9. An animation generation device, comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the animation generation method as claimed in any one of claims 1 to 8 when executing the program.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the animation generation method as claimed in any one of claims 1 to 8.