CN113539240A - Animation generation method and device, electronic equipment and storage medium - Google Patents

Animation generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113539240A
CN113539240A CN202110812403.6A CN202110812403A CN113539240A CN 113539240 A CN113539240 A CN 113539240A CN 202110812403 A CN202110812403 A CN 202110812403A CN 113539240 A CN113539240 A CN 113539240A
Authority
CN
China
Prior art keywords
phoneme
mouth shape
pronunciation
language
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110812403.6A
Other languages
Chinese (zh)
Inventor
王海新
杜峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN202110812403.6A priority Critical patent/CN113539240A/en
Publication of CN113539240A publication Critical patent/CN113539240A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The embodiment of the invention discloses an animation generation method, an animation generation device, electronic equipment and a storage medium, wherein the animation generation method comprises the following steps: acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages; analyzing and recognizing the target text data to obtain each phoneme included in the target text data, and analyzing and recognizing the target voice data to obtain the pronunciation time period of each phoneme in each phoneme; determining the language to which each phoneme belongs; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven according to the corresponding mouth shape within the pronunciation period of each phoneme to generate mouth shape animation. The embodiment of the invention can improve the fit degree of the mouth shape of the virtual image and the expression sentence, so that the mouth shape of the virtual image is more abundant in change and smoother and natural in expression.

Description

Animation generation method and device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to computer technology, in particular to an animation generation method and device, electronic equipment and a storage medium.
Background
With the development of the live broadcast industry, each large platform disputes and releases the own virtual image, and the virtual image is used for speaking interactive sentences to interact with the user, for example, the virtual image is used for speaking welcome xx to welcome the user to enter the live broadcast room, and the virtual image is used for explaining product information and the like for the user in the live broadcast room. When the avatar is used to interact with a user, the interactive sentences usually have languages of other languages (such as user names of english, product names of english, etc.) in addition to chinese, and for languages of other languages in the interactive sentences, it is a common practice at present to replace the mouth shapes of other languages with chinese mouth shapes, so as to drive the avatar.
In the process of implementing the invention, the inventor finds that the way of driving the virtual image by using the Chinese mouth shape instead of the mouth shapes of other languages has the problems of non-overlapping mouth shapes and interactive sentences, single mouth shape change of the virtual image, hard and unnatural expression and the like.
Disclosure of Invention
The embodiment of the invention provides an animation generation method, an animation generation device, electronic equipment and a storage medium, which can improve the fit degree of the mouth shape of an avatar and an expression sentence, so that the mouth shape of the avatar is more abundant in change and smoother and natural in expression.
In a first aspect, an embodiment of the present invention provides an animation generation method, including:
acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages;
analyzing and identifying the target text data to obtain each phoneme included in the target text data, and analyzing and identifying the target voice data to obtain a pronunciation time period of each phoneme in each phoneme;
determining the language to which each phoneme belongs;
inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme;
and driving the virtual character according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation.
In a second aspect, an embodiment of the present invention provides an animation generation apparatus, including:
the acquisition module is used for acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages;
the recognition module is used for analyzing and recognizing the target text data to obtain each phoneme included in the target text data, and analyzing and recognizing the target voice data to obtain the pronunciation time period of each phoneme in each phoneme;
the determining module is used for determining the language to which each phoneme belongs;
the query module is used for querying a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme;
and the generating module is used for driving the virtual image according to the corresponding mouth shape in the pronunciation time interval of each phoneme so as to generate mouth shape animation.
In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the animation generation method according to any one of the embodiments of the present invention.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the animation generation method according to any one of the embodiments of the present invention.
In the embodiment of the present invention, the speech data (i.e., target speech data) and the text data (i.e., target text data) of the target sentence composed of languages of different languages may be analyzed and recognized to obtain each phoneme included in the target text data and the pronunciation time period of each phoneme in each phoneme, and determine the language to which each phoneme belongs; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven according to the corresponding mouth shape within the pronunciation period of each phoneme to generate mouth shape animation. In other words, in the embodiment of the present invention, phonemes of different languages contained in target text data can be identified, and the mouth shape configuration table of each language is queried to obtain the mouth shapes of the corresponding languages configured for the phonemes of different languages, so that the avatar is driven according to the mouth shapes of different languages, the mouth shapes of the avatar are varied widely, the problems of non-overlap between the mouth shapes and interactive sentences, single mouth shape variation, and unnatural expression caused by using a chinese mouth shape instead of a mouth shape of another language to drive the avatar are avoided, the degree of fit between the mouth shapes and the expression sentences of the avatar is improved, and the expression of the avatar is smoother and more natural.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a schematic flowchart of an animation generation method according to an embodiment of the present invention.
Fig. 2 is a schematic flowchart of a method for creating a mouth shape configuration table according to an embodiment of the present invention.
Fig. 3 is a diagram of an example of a mouth shape configuration table provided by an embodiment of the invention.
Fig. 4 is another exemplary diagram of a mouth shape configuration table provided by the embodiment of the invention.
Fig. 5 is a flowchart illustrating a method for driving an avatar according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a vector variation law provided by an embodiment of the present invention.
Fig. 7 is a schematic structural diagram of an animation generation apparatus according to an embodiment of the present invention.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Fig. 1 is a schematic flowchart of an animation generating method according to an embodiment of the present invention, which may be implemented by an animation generating apparatus according to an embodiment of the present invention, and the apparatus may be implemented in software and/or hardware. In a particular embodiment, the apparatus may be integrated in an electronic device, which may be, for example, a computer. The following embodiments will be described by taking as an example that the apparatus is integrated in an electronic device, and referring to fig. 1, the method may specifically include the following steps:
step 101, acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages.
Illustratively, taking the example that the target sentence needs to be expressed by the driving avatar, the target speech data can be understood as the target sentence expressed in the form of speech, and the target text data can be understood as the target sentence expressed in the form of text; the target sentence may be composed of languages of different languages such as chinese, english, french, russian, etc. A specific target sentence (represented by text data) such as: you good Sam, i.e. the target sentence consists of chinese and english.
In the specific implementation, the avatar to be driven may be a virtual character, the character of the character is not limited, and the avatar may be a two-dimensional avatar or a three-dimensional avatar. Specifically, target voice data and target text data can be obtained through real-time input by a user, for example, a voice uttered by the user can be picked up through a microphone, so that the target voice data is obtained; for example, the text provided by the user can be acquired through a keyboard or a screen, so as to obtain the target text data corresponding to the target voice data. In addition, preset target voice data and corresponding target text data can be acquired; in addition, target voice data can be obtained, and the target voice data is converted to obtain target text data corresponding to the target voice data.
Specifically, after the target text data is obtained, the target text data may be segmented according to languages, so as to obtain segmented text data; for example, when the target text data includes chinese and english, the chinese and english in the target text data can be segmented. For example, when the target text data is "hello Sam", the target text data may be divided into "hello" and "Sam". After the target text data is segmented, preprocessing the segmented text data to obtain text data to be recognized so as to facilitate subsequent recognition; the preprocessing may include, but is not limited to, converting a special symbol into a word or phrase of a corresponding language, segmenting a synthesized word, etc.; for example, the symbol "&" for the special symbol may be converted to "and", the symbol "#" for the special symbol may be converted to "well", and the composite word "what's" for the composite word may be divided into "what is".
Step 102, analyzing and recognizing the target text data to obtain each phoneme included in the target text data, and analyzing and recognizing the target voice data to obtain the pronunciation time interval of each phoneme in each phoneme.
Specifically, the preprocessed target text data may be analyzed and recognized, that is, the text data to be recognized is analyzed and recognized, so as to obtain each phoneme included in the target text data. In a specific implementation, the pre-established pronunciation dictionary of each language can be used for analyzing and recognizing the text data to be recognized, so as to obtain each phoneme included in the target text data. For example, a pronunciation dictionary of a corresponding language may be searched for based on each word or word in the text data to be recognized, thereby obtaining each phoneme.
Illustratively, the pronunciation dictionary library of each language is such as a chinese pronunciation dictionary library, an english pronunciation dictionary library, a french pronunciation dictionary library, and the like. The Chinese pronunciation dictionary library may include each Chinese character and its pronunciation (phoneme), and the English pronunciation dictionary library may include each word and its pronunciation (phoneme).
In one specific embodiment, the partial data in the chinese pronunciation dictionary library may be as follows, wherein the arabic numerals represent the pronunciation tones of the chinese phonemes:
through ch ua 1 g uo4
Shuttle ch ua 1 s uo1
When wearing ch ua 1 zh uo2
Pass ch ua 2
In one specific embodiment, the part of the data in the english pronunciation dictionary library may be as follows, wherein the arabic numerals represent pronunciation tones of english phonemes:
ABILENE AE1 B IH0 L IY2 N
ABILITIES AH0 B IH1 L AH0 T IY0 Z
ABILITY AH0 B IH1 L AH0 T IY0
ABIMELECH AE0 B IH0 M EH0 L EH1 K
ABINADAB AH0 B AY1 N AH0 D AE1 B
ABINGDON AE1 B IH0 NG D AH0 N
ABINGDON'S AE1 B IH0 NG D AH0 N Z
ABINGER AE1 B IH0 NG ER0
in addition, in order to make the application scene of the invention wider, English brand names, English names and the like can be added into the English pronunciation dictionary database.
For example, the target speech data may be analyzed and recognized through a pre-trained acoustic Model, which may include Hidden Markov chain Model (HMM) -Gaussian Mixture Model (GMM) and Deep Neural Network (DNN) models, to obtain the pronunciation period of each of the phonemes. For example, the target voice data may be subjected to framing processing to obtain a plurality of audio frames; extracting the acoustic features of each audio frame, and inputting the extracted acoustic features into a pre-trained acoustic model so that the acoustic model predicts the probability of the candidate phonemes in each audio frame; determining a phoneme sequence corresponding to the target voice data according to the probability of the candidate phonemes in each audio frame and a plurality of phonemes obtained by identifying the target text data, and acquiring the pronunciation time period of each phoneme according to the pronunciation starting time and the pronunciation ending time of each phoneme in the audio sequence.
Among them, the extracted acoustic features may be Mel-Frequency Cepstral Coefficients (MFCC). Experiments on human auditory perception show that the human auditory perception focuses on certain specific regions rather than the whole spectrum range, so that the MFCC is designed according to the human auditory characteristics and is more suitable for a speech recognition scene.
The acoustic model can be obtained based on mixed corpus training, and the mixed corpus can include corpuses of various languages, such as a Chinese corpus, an English corpus, a French corpus and the like. The Chinese language database can adopt an open-source corpus Aishell, the Aishell is recorded by 400 Chinese people from different dialect areas, the audio frequency is 16000Hz, the total time is 170 hours, and a text corresponding to each voice is provided. The English corpus can adopt Ireland English Dialect Speech Data Set (Ireland English Dialect Speech Data Set), the Data Set is formed by English sentences recorded by volunteers of different dialects, the 48000Hz corpus is changed into 16000Hz for training, and text corresponding to each Speech is provided.
In the concrete implementation, data can be extracted from a mixed corpus according to a first proportion to construct a training sample set, each sample in the training sample set comprises voice data and corresponding text data, phonemes in the text data of each training sample are obtained according to a pronunciation dictionary base of each language, acoustic feature extraction and recognition are carried out on the voice data of each training sample to obtain pronunciation duration of the corresponding phonemes, the phonemes included in each training sample and the pronunciation duration of the corresponding phonemes are labeled as labels of the corresponding training samples, model training is carried out by using the training sample set with the labels, and reverse optimization is carried out on the models through a loss function, so that acoustic models are obtained. Illustratively, the loss function may employ a cross-entropy loss function.
In addition, data can be extracted from the mixed corpus according to a second proportion to construct a test sample set, the trained acoustic model is subjected to performance test by using the test sample set, and if the test result meets the requirement, the trained acoustic model is put into use. The first proportion and the second proportion may be set according to actual requirements or experience, for example, the first proportion may be set to 80%, and the second proportion may be set to 5%.
In a specific embodiment, the pronunciation period of each phoneme identified may include a pronunciation start time and a pronunciation end time, and taking the target text data as "hello", for example, the pronunciation period of each phoneme and each phoneme obtained may be as shown in table 1 below:
TABLE 1
Figure BDA0003168881720000081
Figure BDA0003168881720000091
The data shown in table 1 is only an example, and does not ultimately limit the actual data processing.
Step 103, determining the language to which each phoneme belongs.
For example, a pronunciation dictionary base of each language may be queried based on each phoneme, and the language corresponding to the pronunciation dictionary base matched to each phoneme may be determined as the language to which the corresponding phoneme belongs; the pronunciation dictionary database matching a phoneme may be a pronunciation dictionary database containing the phoneme. For example, if a phoneme is contained in the chinese pronunciation dictionary library, the chinese language of the language to which the phoneme belongs can be determined; for another example, if a phoneme is contained in the english pronunciation dictionary library, the english language of the language to which the phoneme belongs can be determined.
And 104, inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme.
Specifically, a mouth shape configuration table may be created in advance for each language, and the mouth shape configuration table created for each language includes all phonemes included in the corresponding language and a mouth shape corresponding to each phoneme. After obtaining each phoneme included in the target sentence, the mouth shape configuration table of the corresponding language may be queried according to the language corresponding to each phoneme, so as to obtain the mouth shape configured for each phoneme. For example, if the language to which a certain phoneme belongs is chinese, the chinese mouth shape configuration table may be queried, so as to obtain the chinese mouth shape configured for the phoneme. For another example, if the language to which a certain phoneme belongs is english, the mouth shape configuration table of english may be queried, so as to obtain an english mouth shape configured for the phoneme.
And 105, driving the virtual character according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation.
In the embodiment of the present invention, the speech data (i.e., target speech data) and the text data (i.e., target text data) of the target sentence composed of languages of different languages may be analyzed and recognized to obtain each phoneme included in the target text data and the pronunciation time period of each phoneme in each phoneme, and determine the language to which each phoneme belongs; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven according to the corresponding mouth shape within the pronunciation period of each phoneme to generate mouth shape animation. In other words, in the embodiment of the present invention, phonemes of different languages contained in target text data can be identified, and the mouth shape configuration table of each language is queried to obtain the mouth shapes of the corresponding languages configured for the phonemes of different languages, so that the avatar is driven according to the mouth shapes of different languages, the mouth shapes of the avatar are varied widely, the problems of non-overlap between the mouth shapes and interactive sentences, single mouth shape variation, and unnatural expression caused by using a chinese mouth shape instead of a mouth shape of another language to drive the avatar are avoided, the degree of fit between the mouth shapes and the expression sentences of the avatar is improved, and the expression of the avatar is smoother and more natural.
The following describes a method for creating a mouth shape configuration table according to an embodiment of the present invention, as shown in fig. 2, the method may include the following steps:
step 201, collecting phonemes of each language, and determining a speech phoneme of each language.
Illustratively, the languages may include chinese, english, french, russian, etc., and each language may include a plurality of phonemes, which refer to minimum pronunciation units of speech. For example, Chinese includes phonemes b, p, m, F, z, c, S, a, o, e, i, u, lu, etc., and English includes phonemes L, R, S, F, V, CH, SH, ZH, etc.
The speech visemes represent the facial and oral positions when a word or phrase is spoken, are the visual equivalents of phonemes, are the basic acoustic units that form the word, and are the basic visual building blocks for speech. In the language, each phoneme has a corresponding phonetic viseme that represents the shape of the oral cavity when spoken.
In step 202, phonemes with the same speech visemes in each language are categorized.
In a specific implementation, different phonemes may have the same speech visemes, and phonemes having the same speech visemes in each language may be categorized. For example, In the phonemes included In the chinese language, the speech visemes of the phonemes In, ing, and ie are all In, and the phonemes In, ing, and ie can be classified into one category. For another example, in the phonemes included in english, if the phonetic visemes of the phonemes AE and AY are both ai, the phonemes AE and AY can be classified into one category.
And step 203, classifying the phonemes in each language to configure the mouth shape for each type of phoneme so as to obtain a mouth shape configuration table created for each language.
The same mouth shape can be configured for the phonemes with the same pronunciation viseme in each language. For example, the speech visemes of the Chinese phonemes in, ing and ie are the same, and the same mouth shape can be configured for in, ing and ie; the phonetic visemes of English phonemes AE and AY are the same, and the same mouth shape can be configured for AE and AY. In specific implementation, the configured mouth shape can be manufactured comprehensively through various deformers blendshape.
In a specific embodiment, taking the example that each language includes chinese and english, the mouth shape configuration table created for chinese may be as shown in fig. 3, and the mouth shape configuration table created for english may be as shown in fig. 4. The mouth shape configuration table created for each language may include mouth shape identification, phoneme, phonetic viseme, mouth shape, and the like. Note that the phoneme, the phonetic viseme, the mouth shape, and the like in fig. 3 and 4 are merely examples, and do not constitute final limitations on the actual configuration.
The method for creating the mouth shape configuration table provided by the embodiment of the invention can simplify data processing and improve creation efficiency by classifying phonemes with the same speech viseme in each language and creating the mouth shape configuration table for each language according to categories.
The avatar driving method provided in the embodiment of the present invention is described below, as shown in fig. 5, that is, step 105 in fig. 1 may specifically include the following steps:
step 1051, determine the multi-dimensional state vector of the mouth shape configured for the current phoneme and determine the multi-dimensional state vector of the mouth shape configured for the previous phoneme.
Specifically, the previous phoneme may be a phoneme whose pronunciation period is before and adjacent to the pronunciation period of the current phoneme. For example, a phoneme 1, a phoneme 2 and a phoneme 3 are respectively arranged on a time axis according to the sequence of pronunciation time, and if the current phoneme is the phoneme 2, the previous phoneme of the current phoneme is the phoneme 1; if the current phoneme is phoneme 3, the previous phoneme of the current phoneme is phoneme 2. In addition, if the current phoneme is the first phoneme in the chronological order, the previous phoneme of the current phoneme may be considered as null, and the multi-dimensional state vector of the mouth shape of the previous phoneme may be 0.
Each mouth shape may be represented by a multidimensional state vector, which includes a state vector of multiple dimensions, each representing a state feature value of a portion constituting a mouth shape, such as an upper lip, a lower lip, a tongue tip, a tongue position, a tongue surface, and the like, and different mouth shapes have different multidimensional state vectors with different values.
Step 1052, calculating the multi-dimensional state vector of the mouth shape configured for the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme by using the inching function to obtain the multi-dimensional state vectors of all the moments in the pronunciation period of the current phoneme.
Illustratively, the slow function in the embodiment of the present invention may include a slow function, which may be as follows:
f(xi)=-xi 2+2xi
wherein x isiIndicating the ith time, f (x) within the pronunciation period of the current phonemei) Representing the vector change rate of the ith moment in the pronunciation period of the current phoneme; for example, if the pronunciation duration corresponding to the pronunciation period of the current phoneme is 5 seconds, the ith time may be 1 st, 2 nd, 3 rd, 4 th or 5 th seconds. It can be seen that the slow motion function provided by the embodiment of the invention is a slow motion function of variable speed motion, the change speed is fast at the beginning, smooth feeling is given to people, and then the slow motion function is gradually decelerated, so that the people cannot feel like the game to end.
For convenience of calculation, when performing vector calculation, the pronunciation duration of the current phoneme may be determined according to the pronunciation start time and the pronunciation end time of the current phoneme, and each time x in the pronunciation period of the current phoneme may be determined according to the pronunciation duration of the current phonemeiNormalization processing is carried out, and x after normalization processing is carried outiSubstituting the slow motion function for calculation.
Specifically, the bradykinetic function can be used for calculating the vector change rate of each moment in the pronunciation period of the current phoneme; calculating the vector difference of the mouth shape configured for the current phoneme and the mouth shape configured for the previous phoneme in each dimension according to the multidimensional state vector of the mouth shape configured for the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme, calculating the vector of each dimension of each time in the pronunciation period of the current phoneme according to the vector difference of the mouth shape configured for the current phoneme and the mouth shape configured for the previous phoneme in each dimension, the vector change rate of each time in the pronunciation period of the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme, and determining the multidimensional state vector of each time in the pronunciation period of the current phoneme according to the vector of each time in each dimension in the pronunciation period of the current phoneme.
For example, the bradyseis function, the multidimensional state vector of the mouth shape configured for the current phoneme, and the multidimensional state vector of the mouth shape configured for the previous phoneme may be processed according to the following formula to obtain the multidimensional state vectors at various times in the pronunciation period of the current phoneme:
Eij=-Δjf(xi)+sj
Ei=(Ei1,Ei2,......Eij)
wherein E isijA vector, Δ, representing the j-th dimension at the i-th time within the pronunciation period of the current phonemejVector difference, s, of j-th dimension in the multidimensional state vector representing the mouth configured for the current phoneme and the multidimensional state vector representing the mouth configured for the previous phonemejVector of j-th dimension in the multi-dimensional state vector representing the mouth configured for the previous phoneme, EiA multi-dimensional state vector representing the ith time within the pronunciation period of the current phoneme.
By analysis of formula Eij=-Δjf(xi)+sjIt can be derived ifj>0, the vector change law can be as shown in (a) of FIG. 6, if Δj<0, the vector change rule can be shown in (b) of fig. 6, and is reflected in animation change, that is, a certain mouth shape is faster in speed when starting to change, and gradually transitions to the next mouth shape, so that the design does not cause the phenomenon of blocking due to sudden stop of the mouth shape, and animation transition performance is better.
And 1053, providing the multidimensional state vector of each moment in the pronunciation time interval of the current phoneme to the deformer with the corresponding dimension, so as to drive the virtual image by using the deformer with the corresponding dimension, and generate the mouth shape animation.
The state vector of each dimension in the multi-dimensional state vector corresponds to one deformer, and each deformer is used for driving a corresponding part of the virtual image. For example, the multidimensional state vector includes three-dimensional state vectors of an upper lip, a lower lip and a tongue tip, the upper lip, the lower lip and the tongue tip respectively correspond to one deformer, when the multidimensional state vector at a certain moment is obtained through calculation, the state vector of the upper lip in the multidimensional state vector can be provided for the deformer corresponding to the upper lip, the state vector of the lower lip is provided for the deformer corresponding to the lower lip, and the state vector of the tongue tip is provided for the deformer corresponding to the tongue tip, so that the corresponding deformer drives the corresponding part of the virtual image according to the vector of the corresponding dimension, and the mouth shape animation is generated. In addition, when the mouth-shaped animation is generated, the mouth-shaped animation and the target voice data can be synchronously played, so that the animation effect of expressing the target voice data by the virtual image is presented.
The multi-dimensional state vector of each moment in the pronunciation time interval of the current phoneme is calculated by utilizing the slow motion function provided by the embodiment of the invention, so that the virtual image is driven, the mouth shape change of the virtual image is more real and natural, and the animation display effect is improved. In practical applications, other types of slow motion functions may also be used to calculate the multidimensional state vector at each time within the pronunciation period of the current phoneme, for example, a linear slow motion function calculation is used, and this is not limited herein.
In the following, a specific example is described to illustrate the avatar driving method according to the embodiment of the present invention, taking target text data as "hello", identifying and obtaining each phoneme and the pronunciation duration of each phoneme as shown in table 1, where each phoneme is a chinese phoneme, it can be known from fig. 3 that a mouth shape corresponding to a phoneme n is mouth shape five, a mouth shape corresponding to a phoneme in is mouth shape eight, a mouth shape corresponding to a phoneme h is mouth shape six, and a mouth shape corresponding to a phoneme ao is mouth shape three, that is, when the avatar expresses "hello", the mouth shape to drive the avatar changes sequentially according to mouth shape five, mouth shape eight, mouth shape six, and mouth shape three.
For example, the current phoneme is ao, the correspondingly configured mouth shape is three, the mouth shape configured by the previous phoneme is six, the multidimensional state vector of six mouth shapes is [20, 40, 60], the multidimensional state vector of three mouth shapes is [50, 20, 90], the pronunciation duration corresponding to the current phoneme is 5 seconds, and taking the calculation of the multidimensional state vector of 2 seconds in the pronunciation duration of the current phoneme as an example, the following steps are specifically taken:
the utterance time (2 nd second) is normalized: 2/5 ═ 0.4;
the state vector for the first dimension at second 2 is: -30 × 0.4(0.4-2) +20 ═ 39.2;
the state vector for the second dimension at second 2 is: - (-20) × 0.4(0.4-2) +40 ═ 27.2;
the state vector of the third dimension at second 2 is: -30 × 0.4(0.4-2) +60 ═ 79.2;
that is, the multidimensional state vector of the 2 nd second in the pronunciation period of the phoneme ao is (39.2, 27.2, 79.2), and assuming that the dimensions corresponding to the multidimensional state vector are the upper lip, the lower lip and the tongue tip, respectively, 39.2 may be provided to the deformer corresponding to the upper lip, 27.2 may be provided to the deformer corresponding to the lower lip, and 79.2 may be provided to the deformer corresponding to the tongue tip, so that the corresponding deformer drives the corresponding portion of the avatar according to the vector of the corresponding dimension.
The method comprises the steps of calculating other moments in a pronunciation period of a current phoneme and all moments in the pronunciation period of the other phonemes according to a similar method, so that a multi-dimensional state vector of each moment in the pronunciation period of each phoneme can be obtained, the state vector of each dimension is provided for a deformer of the corresponding dimension, and the deformer of the corresponding dimension drives an avatar according to a time sequence, so that the effect that the avatar expresses 'you' can be realized.
Fig. 7 is a block diagram of an animation generation apparatus according to an embodiment of the present invention, which is adapted to execute an animation generation method according to an embodiment of the present invention. As shown in fig. 7, the apparatus may specifically include:
an obtaining module 401, configured to obtain target speech data and target text data corresponding to the target speech data, where the target speech data includes speech data of different languages;
the recognition module 402 is configured to analyze and recognize the target text data to obtain each phoneme included in the target text data, and analyze and recognize the target voice data to obtain a pronunciation time period of each phoneme in each phoneme;
a determining module 403, configured to determine a language to which each phoneme belongs;
a query module 404, configured to query a mouth shape configuration table of a language to which each phoneme belongs, so as to obtain a mouth shape configured for each phoneme;
a generating module 405, configured to drive the avatar according to the corresponding mouth shape within the pronunciation period of each phoneme to generate the mouth shape animation.
In an embodiment, the generating module 405 is specifically configured to:
determining a multi-dimensional state vector of a mouth shape configured for a current phoneme, and determining a multi-dimensional state vector of a mouth shape configured for a previous phoneme, wherein the previous phoneme is a phoneme of which a pronunciation period is before and adjacent to a pronunciation period of the current phoneme;
calculating the multi-dimensional state vector of the mouth shape configured for the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme by using a bradykinetic function to obtain the multi-dimensional state vector of each moment in the pronunciation time period of the current phoneme;
and providing the multidimensional state vectors of all moments in the pronunciation time interval of the current phoneme to deformers with corresponding dimensions, so as to drive the virtual image by using the deformers with the corresponding dimensions and generate the mouth shape animation.
In one embodiment, the slow function includes a slow-out function, ease-out.
In an embodiment, the calculating module 405 calculates the multidimensional state vector of the mouth shape configured for the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme by using a slow motion function to obtain the multidimensional state vectors at various times in the pronunciation period of the current phoneme, including:
calculating the vector change rate of each moment in the pronunciation time interval of the current phoneme by using the inching function;
calculating a vector difference of the mouth shape configured for the current phoneme and the mouth shape configured for the previous phoneme in each dimension according to the multi-dimensional state vector of the mouth shape configured for the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme;
calculating vectors of all dimensions at all times in the pronunciation period of the current phoneme according to the vector difference of the mouth shape configured for the current phoneme and the mouth shape configured for the previous phoneme in all dimensions, the vector change rate of all times in the pronunciation period of the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme;
and determining the multidimensional state vector of each moment in the pronunciation time period of the current phoneme according to the vector of each dimension of each moment in the pronunciation time period of the current phoneme.
In one embodiment, the pronunciation period of the current phoneme includes a pronunciation start time and a pronunciation end time of the current phoneme, and the generating module 405 is further configured to, before calculating the multidimensional state vector of the mouth shape configured for the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme by using a slow motion function to obtain the multidimensional state vectors at various times in the pronunciation period of the current phoneme:
determining the pronunciation duration of the current phoneme according to the pronunciation starting time and the pronunciation ending time of the current phoneme;
and carrying out normalization processing on each moment in the pronunciation time interval of the current phoneme according to the pronunciation duration of the current phoneme.
In an embodiment, the determining module 403 is specifically configured to:
inquiring a pronunciation dictionary library of each language based on each phoneme;
and determining the language corresponding to the pronunciation dictionary library matched with each phoneme as the language to which the corresponding phoneme belongs.
In one embodiment, the apparatus further comprises:
and the creating module is used for collecting phonemes of all languages, classifying the phonemes with the same pronunciation in all languages, and configuring the mouth shape for each class of phonemes according to the classification of the phonemes in all languages so as to obtain a mouth shape configuration table created for each language.
In one embodiment, the apparatus further comprises:
the preprocessing module is used for segmenting the target text data according to languages to obtain segmented text data; preprocessing the segmented text data to obtain text data to be recognized;
the identification module 402 analyzes and identifies the target text data to obtain each phoneme included in the target text data, including:
and analyzing and identifying the text data to be identified to obtain each phoneme included in the target text data.
It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the functional module, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.
The device of the embodiment of the invention can analyze and recognize the voice data (namely, the target voice data) and the text data (namely, the target text data) of the target sentence formed by languages of different languages, obtain each phoneme included in the target text data and the pronunciation time interval of each phoneme in each phoneme, and determine the language to which each phoneme belongs; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven according to the corresponding mouth shape within the pronunciation period of each phoneme to generate mouth shape animation. In other words, in the embodiment of the present invention, phonemes of different languages contained in target text data can be identified, and the mouth shape configuration table of each language is queried to obtain the mouth shapes of the corresponding languages configured for the phonemes of different languages, so that the avatar is driven according to the mouth shapes of different languages, the mouth shapes of the avatar are varied widely, the problems of non-overlap between the mouth shapes and interactive sentences, single mouth shape variation, and unnatural expression caused by using a chinese mouth shape instead of a mouth shape of another language to drive the avatar are avoided, the degree of fit between the mouth shapes and the expression sentences of the avatar is improved, and the expression of the avatar is smoother and more natural.
The embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program that is stored in the memory and can be run on the processor, and when the processor executes the computer program, the animation generation method provided in any of the above embodiments is implemented.
The embodiment of the invention also provides a computer readable medium, on which a computer program is stored, and the program is executed by a processor to implement the animation generation method provided by any one of the above embodiments.
Referring now to FIG. 8, shown is a block diagram of a computer system 500 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 8, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules and/or units described in the embodiments of the present invention may be implemented by software, and may also be implemented by hardware. The described modules and/or units may also be provided in a processor, and may be described as: a processor includes an acquisition module, a recognition module, a determination module, a query module, and a generation module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:
acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages; analyzing and identifying the target text data to obtain each phoneme included in the target text data, and analyzing and identifying the target voice data to obtain the pronunciation time interval of each phoneme in each phoneme; determining the language to which each phoneme belongs; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; and driving the virtual character according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation.
According to the technical scheme of the embodiment of the invention, the voice data (namely, the target voice data) and the text data (namely, the target text data) of the target sentence formed by languages of different languages can be analyzed and identified, each phoneme included in the target text data and the pronunciation time interval of each phoneme in each phoneme are obtained, and the language to which each phoneme belongs is determined; inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme; the avatar is driven according to the corresponding mouth shape within the pronunciation period of each phoneme to generate mouth shape animation. In other words, in the embodiment of the present invention, phonemes of different languages contained in target text data can be identified, and the mouth shape configuration table of each language is queried to obtain the mouth shapes of the corresponding languages configured for the phonemes of different languages, so that the avatar is driven according to the mouth shapes of different languages, the mouth shapes of the avatar are varied widely, the problems of non-overlap between the mouth shapes and interactive sentences, single mouth shape variation, and unnatural expression caused by using a chinese mouth shape instead of a mouth shape of another language to drive the avatar are avoided, the degree of fit between the mouth shapes and the expression sentences of the avatar is improved, and the expression of the avatar is smoother and more natural.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (11)

1. An animation generation method, comprising:
acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages;
analyzing and identifying the target text data to obtain each phoneme included in the target text data, and analyzing and identifying the target voice data to obtain a pronunciation time period of each phoneme in each phoneme;
determining the language to which each phoneme belongs;
inquiring a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme;
and driving the virtual character according to the corresponding mouth shape in the pronunciation period of each phoneme to generate mouth shape animation.
2. The animation generation method as claimed in claim 1, wherein the driving of the avatar in accordance with the corresponding mouth shape within the pronunciation period of each phoneme to generate the mouth shape animation comprises:
determining a multi-dimensional state vector of a mouth shape configured for a current phoneme, and determining a multi-dimensional state vector of a mouth shape configured for a previous phoneme, wherein the previous phoneme is a phoneme of which a pronunciation period is before and adjacent to a pronunciation period of the current phoneme;
calculating the multi-dimensional state vector of the mouth shape configured for the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme by using a bradykinetic function to obtain the multi-dimensional state vector of each moment in the pronunciation time period of the current phoneme;
and providing the multidimensional state vectors of all moments in the pronunciation time interval of the current phoneme to deformers with corresponding dimensions, so as to drive the virtual image by using the deformers with the corresponding dimensions and generate the mouth shape animation.
3. The animation generation method as claimed in claim 2, wherein the slow motion function includes a slow-out function, ease-out.
4. The animation generation method as claimed in claim 3, wherein the calculating the multi-dimensional state vector of the mouth shape configured for the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme by using a slow motion function to obtain the multi-dimensional state vectors at respective times within the pronunciation period of the current phoneme comprises:
calculating the vector change rate of each moment in the pronunciation time interval of the current phoneme by using the inching function;
calculating a vector difference of the mouth shape configured for the current phoneme and the mouth shape configured for the previous phoneme in each dimension according to the multi-dimensional state vector of the mouth shape configured for the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme;
calculating vectors of all dimensions at all times in the pronunciation period of the current phoneme according to the vector difference of the mouth shape configured for the current phoneme and the mouth shape configured for the previous phoneme in all dimensions, the vector change rate of all times in the pronunciation period of the current phoneme and the multi-dimensional state vector of the mouth shape configured for the previous phoneme;
and determining the multidimensional state vector of each moment in the pronunciation time period of the current phoneme according to the vector of each dimension of each moment in the pronunciation time period of the current phoneme.
5. The animation generation method as claimed in claim 3, wherein the pronunciation period of the current phoneme includes a pronunciation start time and a pronunciation end time of the current phoneme, and before the multidimensional state vector of the mouth shape configured for the current phoneme and the multidimensional state vector of the mouth shape configured for the previous phoneme are calculated by using a slow motion function to obtain the multidimensional state vectors at respective times within the pronunciation period of the current phoneme, the method further comprises:
determining the pronunciation duration of the current phoneme according to the pronunciation starting time and the pronunciation ending time of the current phoneme;
and carrying out normalization processing on each moment in the pronunciation time interval of the current phoneme according to the pronunciation duration of the current phoneme.
6. The animation generation method as claimed in any one of claims 1 to 5, wherein the determining the language to which each phoneme belongs comprises:
inquiring a pronunciation dictionary library of each language based on each phoneme;
and determining the language corresponding to the pronunciation dictionary library matched with each phoneme as the language to which the corresponding phoneme belongs.
7. An animation generation method as claimed in any one of claims 1 to 5, characterized in that the mouth shape configuration table is created by:
collecting phonemes of each language and determining a speech viseme of the phonemes of each language;
classifying phonemes with the same speech visemes in each language;
and classifying the phonemes in each language to configure the mouth shape for each class of phonemes so as to obtain a mouth shape configuration table created for each language.
8. The animation generation method as claimed in any one of claims 1 to 5, further comprising, before performing the analysis recognition on the target text data:
segmenting the target text data according to languages to obtain segmented text data;
preprocessing the segmented text data to obtain text data to be recognized;
the analyzing and identifying the target text data to obtain each phoneme included in the target text data includes:
and analyzing and identifying the text data to be identified to obtain each phoneme included in the target text data.
9. An animation generation device, comprising:
the acquisition module is used for acquiring target voice data and target text data corresponding to the target voice data, wherein the target voice data comprises voice data of different languages;
the recognition module is used for analyzing and recognizing the target text data to obtain each phoneme included in the target text data, and analyzing and recognizing the target voice data to obtain the pronunciation time period of each phoneme in each phoneme;
the determining module is used for determining the language to which each phoneme belongs;
the query module is used for querying a mouth shape configuration table of the language to which each phoneme belongs to obtain a mouth shape configured for each phoneme;
and the generating module is used for driving the virtual image according to the corresponding mouth shape in the pronunciation time interval of each phoneme so as to generate mouth shape animation.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the animation generation method as claimed in any one of claims 1 to 8 when executing the program.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the animation generation method as claimed in any one of claims 1 to 8.
CN202110812403.6A 2021-07-19 2021-07-19 Animation generation method and device, electronic equipment and storage medium Pending CN113539240A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110812403.6A CN113539240A (en) 2021-07-19 2021-07-19 Animation generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110812403.6A CN113539240A (en) 2021-07-19 2021-07-19 Animation generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113539240A true CN113539240A (en) 2021-10-22

Family

ID=78128611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110812403.6A Pending CN113539240A (en) 2021-07-19 2021-07-19 Animation generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113539240A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114420088A (en) * 2022-01-20 2022-04-29 安徽淘云科技股份有限公司 Display method and related equipment thereof
CN114581567A (en) * 2022-05-06 2022-06-03 成都市谛视无限科技有限公司 Method, device and medium for driving mouth shape of virtual image by sound
CN115222856A (en) * 2022-05-20 2022-10-21 一点灵犀信息技术(广州)有限公司 Expression animation generation method and electronic equipment
CN115662388A (en) * 2022-10-27 2023-01-31 维沃移动通信有限公司 Avatar face driving method, apparatus, electronic device and medium
CN117275485A (en) * 2023-11-22 2023-12-22 翌东寰球(深圳)数字科技有限公司 Audio and video generation method, device, equipment and storage medium

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539354B1 (en) * 2000-03-24 2003-03-25 Fluent Speech Technologies, Inc. Methods and devices for producing and using synthetic visual speech based on natural coarticulation
KR20040076524A (en) * 2003-02-26 2004-09-01 주식회사 메세지 베이 아시아 Method to make animation character and System for Internet service using the animation character
KR20050026774A (en) * 2003-09-09 2005-03-16 정보통신연구진흥원 The lip sync controling method
CN1604072A (en) * 2004-11-09 2005-04-06 北京中星微电子有限公司 Karaoke broadcasting method for mobile terminals
KR20090051851A (en) * 2007-11-20 2009-05-25 경원대학교 산학협력단 Mobile communication terminal for outputting voice received text message to voice using avatar and method thereof
CN102637071A (en) * 2011-02-09 2012-08-15 英华达(上海)电子有限公司 Multimedia input method applied to multimedia input device
CN106504304A (en) * 2016-09-14 2017-03-15 厦门幻世网络科技有限公司 A kind of method and device of animation compound
CN107945253A (en) * 2017-11-21 2018-04-20 腾讯数码(天津)有限公司 A kind of animation effect implementation method, device and storage device
CN108447474A (en) * 2018-03-12 2018-08-24 北京灵伴未来科技有限公司 A kind of modeling and the control method of virtual portrait voice and Hp-synchronization
CN110853614A (en) * 2018-08-03 2020-02-28 Tcl集团股份有限公司 Virtual object mouth shape driving method and device and terminal equipment
CN110874557A (en) * 2018-09-03 2020-03-10 阿里巴巴集团控股有限公司 Video generation method and device for voice-driven virtual human face
CN111260761A (en) * 2020-01-15 2020-06-09 北京猿力未来科技有限公司 Method and device for generating mouth shape of animation character
CN111915707A (en) * 2020-07-01 2020-11-10 天津洪恩完美未来教育科技有限公司 Mouth shape animation display method and device based on audio information and storage medium
WO2021023869A1 (en) * 2019-08-08 2021-02-11 Universite De Lorraine Audio-driven speech animation using recurrent neutral network
CN112634861A (en) * 2020-12-30 2021-04-09 北京大米科技有限公司 Data processing method and device, electronic equipment and readable storage medium
CN112734889A (en) * 2021-02-19 2021-04-30 北京中科深智科技有限公司 Mouth shape animation real-time driving method and system for 2D character
CN112837401A (en) * 2021-01-27 2021-05-25 网易(杭州)网络有限公司 Information processing method and device, computer equipment and storage medium
CN113112575A (en) * 2021-04-08 2021-07-13 深圳市山水原创动漫文化有限公司 Mouth shape generation method and device, computer equipment and storage medium

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539354B1 (en) * 2000-03-24 2003-03-25 Fluent Speech Technologies, Inc. Methods and devices for producing and using synthetic visual speech based on natural coarticulation
KR20040076524A (en) * 2003-02-26 2004-09-01 주식회사 메세지 베이 아시아 Method to make animation character and System for Internet service using the animation character
KR20050026774A (en) * 2003-09-09 2005-03-16 정보통신연구진흥원 The lip sync controling method
CN1604072A (en) * 2004-11-09 2005-04-06 北京中星微电子有限公司 Karaoke broadcasting method for mobile terminals
KR20090051851A (en) * 2007-11-20 2009-05-25 경원대학교 산학협력단 Mobile communication terminal for outputting voice received text message to voice using avatar and method thereof
CN102637071A (en) * 2011-02-09 2012-08-15 英华达(上海)电子有限公司 Multimedia input method applied to multimedia input device
CN106504304A (en) * 2016-09-14 2017-03-15 厦门幻世网络科技有限公司 A kind of method and device of animation compound
CN107945253A (en) * 2017-11-21 2018-04-20 腾讯数码(天津)有限公司 A kind of animation effect implementation method, device and storage device
CN108447474A (en) * 2018-03-12 2018-08-24 北京灵伴未来科技有限公司 A kind of modeling and the control method of virtual portrait voice and Hp-synchronization
CN110853614A (en) * 2018-08-03 2020-02-28 Tcl集团股份有限公司 Virtual object mouth shape driving method and device and terminal equipment
CN110874557A (en) * 2018-09-03 2020-03-10 阿里巴巴集团控股有限公司 Video generation method and device for voice-driven virtual human face
WO2021023869A1 (en) * 2019-08-08 2021-02-11 Universite De Lorraine Audio-driven speech animation using recurrent neutral network
CN111260761A (en) * 2020-01-15 2020-06-09 北京猿力未来科技有限公司 Method and device for generating mouth shape of animation character
CN111915707A (en) * 2020-07-01 2020-11-10 天津洪恩完美未来教育科技有限公司 Mouth shape animation display method and device based on audio information and storage medium
CN112634861A (en) * 2020-12-30 2021-04-09 北京大米科技有限公司 Data processing method and device, electronic equipment and readable storage medium
CN112837401A (en) * 2021-01-27 2021-05-25 网易(杭州)网络有限公司 Information processing method and device, computer equipment and storage medium
CN112734889A (en) * 2021-02-19 2021-04-30 北京中科深智科技有限公司 Mouth shape animation real-time driving method and system for 2D character
CN113112575A (en) * 2021-04-08 2021-07-13 深圳市山水原创动漫文化有限公司 Mouth shape generation method and device, computer equipment and storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114420088A (en) * 2022-01-20 2022-04-29 安徽淘云科技股份有限公司 Display method and related equipment thereof
CN114581567A (en) * 2022-05-06 2022-06-03 成都市谛视无限科技有限公司 Method, device and medium for driving mouth shape of virtual image by sound
CN114581567B (en) * 2022-05-06 2022-08-02 成都市谛视无限科技有限公司 Method, device and medium for driving mouth shape of virtual image by sound
CN115222856A (en) * 2022-05-20 2022-10-21 一点灵犀信息技术(广州)有限公司 Expression animation generation method and electronic equipment
CN115222856B (en) * 2022-05-20 2023-09-26 一点灵犀信息技术(广州)有限公司 Expression animation generation method and electronic equipment
CN115662388A (en) * 2022-10-27 2023-01-31 维沃移动通信有限公司 Avatar face driving method, apparatus, electronic device and medium
WO2024088321A1 (en) * 2022-10-27 2024-05-02 维沃移动通信有限公司 Virtual image face driving method and apparatus, electronic device and medium
CN117275485A (en) * 2023-11-22 2023-12-22 翌东寰球(深圳)数字科技有限公司 Audio and video generation method, device, equipment and storage medium
CN117275485B (en) * 2023-11-22 2024-03-12 翌东寰球(深圳)数字科技有限公司 Audio and video generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113539240A (en) Animation generation method and device, electronic equipment and storage medium
CN107945805B (en) A kind of across language voice identification method for transformation of intelligence
CN109410914B (en) Method for identifying Jiangxi dialect speech and dialect point
CN107972028B (en) Man-machine interaction method and device and electronic equipment
CN112650831A (en) Virtual image generation method and device, storage medium and electronic equipment
KR20060090687A (en) System and method for audio-visual content synthesis
Kumar et al. A comprehensive view of automatic speech recognition system-a systematic literature review
CN114895817B (en) Interactive information processing method, network model training method and device
Karpov An automatic multimodal speech recognition system with audio and video information
CN115312030A (en) Display control method and device of virtual role and electronic equipment
Shaukat et al. Automatic Urdu speech recognition using hidden Markov model
Barkani et al. Amazigh speech recognition embedded system
Alshamsi et al. Automated speech emotion recognition on smart phones
CN110782916B (en) Multi-mode complaint identification method, device and system
Kamble et al. Emotion recognition for instantaneous Marathi spoken words
WO2014167570A1 (en) System and method for extracting and using prosody features
Ullah et al. Speech emotion recognition using deep neural networks
Hozjan et al. A rule-based emotion-dependent feature extraction method for emotion analysis from speech
CN115547299A (en) Quantitative evaluation and classification method and device for controlled voice quality division
Ajayi et al. Systematic review on speech recognition tools and techniques needed for speech application development
Hong et al. Emotion recognition from Korean language using MFCC HMM and speech speed
Lingam Speaker based language independent isolated speech recognition system
Babykutty et al. Development of multilingual phonetic engine for four Indian languages
Huang et al. Latent discriminative representation learning for speaker recognition
JP2003271185A (en) Device and method for preparing information for voice recognition, device and method for recognizing voice, information preparation program for voice recognition, recording medium recorded with the program, voice recognition program and recording medium recorded with the program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination