CN113609255A

CN113609255A - Method, system and storage medium for generating facial animation

Info

Publication number: CN113609255A
Application number: CN202110891111.6A
Authority: CN
Inventors: 顾文元; 张雪源
Original assignee: Yuanmeng Human Intelligence International Co ltd
Current assignee: Shanghai Yuanmeng Intelligent Technology Co.,Ltd.; Yuanmeng humanistic Intelligence International Co., Ltd
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2021-11-05

Abstract

The invention discloses a method, a system and a storage medium for generating facial animation, wherein the method comprises the following steps: receiving audio information and text information; generating a plurality of mouth shapes according to the audio information; calculating a rate of change of the mouth shape of the audio information; generating a mouth shape animation according to the mouth shape change rate and the mouth shapes; acquiring global expressions of the text information and weights of a plurality of preset expressions of all parts in the text information; generating expression animation according to the global expression and the weights of a plurality of preset expressions in each part of the text information; and generating facial animation according to the mouth shape animation and the expression animation. The invention introduces the mouth shape change rate as a reference quantity to generate different mouth shape animations, the different mouth shape animations are combined with different expressions to generate facial animations, and the influence of each lyric in a song on emotional expressions is referred, so that the facial animations of a virtual person are richer and more natural when the virtual person sings the song.

Description

Method, system and storage medium for generating facial animation

Technical Field

The invention relates to the technical field of animation, in particular to a method and a system for generating facial animation and a storage medium.

Background

With the popularization of virtual human (digital human) technology, virtual humans are widely used in various fields, and are mainly classified into two types: virtual anchor with speech as the main and virtual singing-hop as the main. In the field of virtual human research, facial animation synthesis is always the focus of research, and at present, two methods are mainly used: conventional surface capture methods and methods that use machine learning or deep learning algorithm synthesis.

The traditional surface capture method is to capture the facial movements of actors in performance through surface capture software and finish the facial movements through later-stage fine trimming by an animator. The method has better final effect, but has two disadvantages: firstly, the efficiency is low, a large amount of manual adjustment of animators is needed, and the manufacturing time and the price cost are high; and secondly, the music is seriously dependent on the performance of the action capture actor, each song requires the action capture actor to record independently, and the quality is difficult to ensure due to the performance level of the action capture actor.

And (3) a method of synthesizing by using a machine learning or deep learning algorithm. And the mouth shape generation acquires the time information of the phoneme through mouth shape alignment, and then the facial animation is automatically generated by using a mouth shape fusion method. The expression generation mainly controls the change of eyes and lips through Blendshape, and carries out fusion processing on the expression, thereby generating the facial expression. Although this method is efficient, it is difficult to combine the expression details of the mouth shape and the expression and the influence of the multi-element, and the mechanical feeling of the synthesized animation is heavy.

At present, two methods are applied to a virtual anchor speaking scene, but virtual singing of people represented by virtual singing is mainly based on a traditional face capture method, and an algorithm specially aiming at automatic synthesis of mouth shapes and expressions in singing is not provided. The mouth shape synthesis of singing and speaking is very different, and the mouth shape and expression algorithm of speaking cannot be directly transplanted to singing, wherein the difference mainly comprises the following steps:

singing is a broad range of speech rates, e.g., speaking a song at a much faster rate than speaking, but lyrics a song much slower than speaking. Algorithms designed for speech purposes are not applicable to very fast and very slow speech rates.

The expression and the speech of singing are different, more artistic expressions such as deep emotion and overlook are generated when singing, and the expression during the speech is relatively daily expression.

Therefore, a method for generating facial animation is needed for the wide range of the speed of singing and the different expressions and speeches of singing, so that the problem that the traditional mouth shape and expression algorithm cannot be applied is solved, high-quality singing facial animation can be synthesized efficiently, and the requirements of the entertainment market and the cartoon market on the performance of the singing of the virtual human are met.

Disclosure of Invention

The invention aims to provide a method, a system and a storage medium for generating facial animation, which solve the problem that the existing virtual human facial animation generation algorithm model cannot be directly applied in the field of singing of virtual humans, thereby effectively generating virtual human facial animation with more vivid and lively mouth shape and expression.

In order to achieve the above object of the present invention, the present invention is achieved by the following technical solutions:

the invention provides a method for generating facial animation, which comprises the following steps:

receiving audio information and text information;

generating a plurality of mouth shapes according to the audio information;

calculating a rate of change of the mouth shape of the audio information;

generating a mouth shape animation according to the mouth shape change rate and the mouth shapes;

acquiring global expressions of the text information and weights of a plurality of preset expressions of all parts in the text information;

generating expression animation according to the global expression and the weights of a plurality of preset expressions in each part of the text information;

and generating facial animation according to the mouth shape animation and the expression animation.

The invention introduces the mouth type change rate as a reference quantity to generate different mouth type animations, and the different mouth type animations are combined with different expressions to generate the facial animation, thereby solving the influence of different speeds of the virtual human on the facial expressions when singing the song and the influence of each lyric in the reference song on the emotional expressions, and leading the facial animation of the virtual human to be richer and more natural when singing the song.

Further, the present invention also provides a method for generating a facial animation, wherein the generating a plurality of mouth shapes according to the audio information specifically includes:

converting all parts of the audio information into a plurality of phonemes;

generating a corresponding first mouth shape according to each phoneme;

respectively acquiring a plurality of complete pronunciation time periods of each part in the audio information;

time-aligning each of the complete pronunciation periods with the corresponding phoneme;

and identifying the first mouth shape corresponding to the phoneme in the middle position of each complete pronunciation period as a key mouth shape.

The invention introduces two mouth shapes for generating mouth shape animation under the condition of different mouth shape change rates, so that the virtual human can generate more vivid and natural facial expressions when singing songs with different speech rates.

Further, the present invention provides a method for generating a facial animation, wherein the generating of the mouth shape animation according to the mouth shape change rate and a plurality of mouth shapes specifically comprises:

generating a plurality of segments of sub-mouth shape animations according to the mouth shape change rate and a plurality of mouth shapes corresponding to all parts in the audio information;

combining a plurality of sections of the sub-mouth animations in sequence to generate the mouth-shape animation;

the generating a plurality of segments of sub-mouth shape animations according to the mouth shape change rates and the plurality of mouth shapes corresponding to the parts in the audio information specifically comprises:

when the mouth shape change rate is smaller than a preset threshold value, generating the sub-mouth shape animation according to a plurality of first mouth shapes corresponding to all parts in the audio information;

and when the mouth shape change rate is larger than a preset threshold value, generating the sub-mouth shape animation according to a plurality of key mouth shapes corresponding to all parts in the audio information.

Further, the method for generating facial animation of the present invention, where the obtaining of the global expression of the text information specifically includes:

calculating the weights of a plurality of preset emotions in the text information according to a preset emotion classification model, and calculating the weights of a plurality of preset emotions in each part of the text information respectively;

converting a plurality of preset emotions into corresponding preset expressions according to a preset expression library;

and identifying the preset expression with the highest weight in the text information as the global expression.

Further, the present invention provides a method for generating a facial animation, where the generating of the facial animation according to the global expression and the weights of a plurality of preset expressions in each part of the text information specifically includes:

combining the global expression and the preset expressions by adopting a weighted average method to generate a plurality of local expressions;

and combining a plurality of local expressions in sequence to generate the expression animation.

According to the invention, a plurality of expressions are subjected to weighted average processing and combined to generate expression animation, so that the facial expression of the virtual human is more changeable and natural when the song is subjected to emotion change in the singing process of the virtual human, and the expression accords with the emotion expression of the song.

Further, the present invention provides a method for generating a facial animation, wherein after generating a mouth shape animation according to the mouth shape change rate and the mouth shape, before acquiring the global expression of the text information and the weights of a plurality of preset expressions of each part in the text information, the method further comprises:

performing linear smoothing on the mouth shape animation by adopting a linear interpolation method;

a smooth window of gaussian is defined, the formula being:

wherein

k is the rate of change of the die,

i is a relative time scale of the audio information, and e is a constant;

and carrying out nonlinear smoothing processing on the mouth shape animation, wherein the formula is as follows:

where l is a parameter of the mouth shape generation, N is the smoothing window width, and N is 3f₃And/k is the audio sampling frequency.

The method adopts nonlinear smoothing processing in the mouth shape changing process, avoids mechanical mouth shape change when the virtual human has a lingering phenomenon in the singing process, and enables the facial expression of the virtual human to be more natural and smooth in the singing process.

Further, the present invention provides a method for generating a facial animation, wherein after generating a plurality of mouth shapes according to the audio information, before generating a mouth shape animation according to the mouth shape change rate and the plurality of mouth shapes, the method further comprises:

calculating a first average energy within M seconds of the audio information, wherein the formula is as follows:

wherein, t is the current time of the audio information, and the unit of t is minutes;

calculating the modulation ratio as the second average energy after the normalization, wherein the formula is as follows:

wherein, the second average energy is e, the average energy in the audio information within 1-T seconds is;

and debugging the mouth shape according to the second average energy.

According to the method and the device, the degree of matching between the mouth shape and the song can be improved according to the amplitude of the mouth shape of the virtual person in the process of singing the song by volume modulation, so that the facial expression of the virtual person is closer to the song in the singing process.

Additionally, the present invention also provides a facial animation generation system, comprising:

the receiving module is used for receiving audio information and text information, wherein the text information comprises a plurality of pieces of sub-text information;

the mouth shape generating module is connected with the receiving module and used for generating a plurality of mouth shapes according to the audio information;

the rate calculation module is connected with the receiving module and used for calculating the mouth shape change rate according to the audio information;

the mouth shape animation generating module is connected with the mouth shape generating module and the rate calculating module and is used for generating mouth shape animations according to the mouth shape change rate and the mouth shapes;

the acquiring module is connected with the receiving module and used for acquiring the global expression of the text information and the weights of a plurality of preset expressions of each part in the text information;

the expression animation generation module generates expression animation according to the global expression and the weights of a plurality of preset expressions in each part of the text information;

and the facial animation generating module is used for generating facial animation according to the mouth shape animation and the expression animation.

Further, the present invention provides a system for generating a facial animation, further comprising:

the smoothing module is connected with the first animation generation module and is used for carrying out nonlinear smoothing processing on the mouth shape animation;

the amplitude adjusting module is connected with the mouth shape generating module and used for adjusting the amplitudes of the mouth shapes according to the energy of the audio information;

the smoothing module comprises a linear smoothing unit, a smoothing window defining unit and a nonlinear smoothing unit, wherein the linear smoothing unit is used for performing linear smoothing on the mouth shape animation by adopting a linear interpolation method;

a smooth window defining unit for defining a smooth window of gaussian type, the formula is as follows:

wherein

k is the rate of change of the die,

i is a relative time scale of the audio information, and e is a constant;

the nonlinear smoothing unit is used for carrying out nonlinear smoothing processing on the mouth shape animation, and the formula is as follows:

where l is a parameter of the mouth shape generation, N is the smoothing window width, and N is 3/k, which is the audio sampling frequency.

The amplitude adjusting module comprises a first average energy calculating unit, a second average energy calculating unit and a mouth shape debugging unit,

a first average energy calculating unit, configured to calculate a first average energy from time t to time t1 of the audio information, where the formula is as follows:

a second average energy calculating unit, configured to calculate a modulation ratio as a normalized second average energy, where the formula is as follows:

and the mouth shape debugging unit is used for debugging the mouth shape according to the second average energy.

Additionally, the present invention also provides a storage medium, wherein the storage medium stores at least one instruction for implementing the operations performed by the method for generating facial animation as described above.

The invention provides a method, a system and a storage medium for generating facial animation, which at least have the following gain effects:

1) the invention introduces the mouth type change rate as a reference quantity to generate different mouth type animations, and the different mouth type animations are combined with different expressions to generate facial animations, so that the influence of different speeds of the virtual human on the facial expressions when singing the song can be solved, and the influence of each lyric in the reference song on the emotional expressions can be avoided, so that the facial animations of the virtual human when singing the song are richer and more natural;

2) the mouth shape animation generation method introduces two mouth shapes for generating mouth shape animation under the condition of different mouth shape change rates, so that the virtual human can generate more vivid and natural facial expressions when singing songs with different speech rates;

3) according to the invention, multiple expressions are subjected to weighted average processing and combined to generate expression animation, so that the facial expression of the virtual human is more changeable and natural when the song is subjected to emotion change in the singing process of the virtual human, and the expression accords with the emotion expression of the song;

4) the mouth shape is subjected to nonlinear smoothing processing in the mouth shape changing process, so that the mechanical change of the mouth shape when the dragging phenomenon exists in the singing process of the virtual human is avoided, and the facial expression of the virtual human is more natural and smooth in the singing process;

5) according to the method, the mouth shape amplitude of the virtual person in the singing process is modulated according to the volume, so that the mouth shape and song matching degree can be improved, and the facial expression of the virtual person is closer to the song in the singing process;

6) the invention can automatically generate facial animation containing mouth shape and expression only according to the acquired target song and lyrics thereof, does not need human intervention, and can efficiently produce a large amount of virtual human singing animation.

Drawings

The above features, technical features, advantages and implementations of a method, system and storage medium for generating a facial animation will be further described in the following detailed description of preferred embodiments with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method of generating a facial animation of the present invention;

FIG. 2 is a flowchart of a mouth shape generating method in a face animation generating method according to the present invention;

FIG. 3 is a flowchart of a mouth shape animation generating method in a facial animation generating method according to the present invention;

FIG. 4 is a flowchart of a method for obtaining global expressions in a method for generating facial animation according to the present invention;

FIG. 5 is a flowchart of an expression animation generating method in a facial animation generating method according to the present invention;

FIG. 6 is another flow chart of a method of generating a facial animation of the present invention;

FIG. 7 is yet another flow chart of a method of generating a facial animation of the present invention;

FIG. 8 is a schematic diagram of a facial animation generation system of the present invention;

FIG. 9 is a schematic diagram of a mouth shape generation module in the facial animation generation system of the present invention;

FIG. 10 is a schematic diagram of a mouth-shape animation generation module in the facial animation generation system of the present invention;

FIG. 11 is a schematic diagram of an acquisition module in a facial animation generation system according to the present invention;

FIG. 12 is a schematic diagram of a representation animation generation module in the facial animation generation system of the present invention;

FIG. 13 is a schematic diagram of a smoothing module in the facial animation generation system of the present invention;

FIG. 14 is a schematic diagram of a magnitude adjustment module in a facial animation generation system of the present invention;

reference numbers in the figures: 10-receiving module, 20-mouth shape generating module, 30-rate calculating module, 40-mouth shape animation generating module, 50-obtaining module, 60-expression animation generating module, 70-facial animation generating module, 80-smoothing module, 90-amplitude adjusting module, 21-phoneme converting unit, 22-first mouth shape generating unit, 23-complete pronunciation period obtaining unit, 24-time aligning unit, 25-key mouth shape identifying unit 25, 41-judging unit, 42-sub mouth shape animation generating unit, 43-mouth shape animation combining unit, 51-emotion classification model establishing unit, 52-expression library establishing unit, 53-weight calculating unit, 54-expression converting unit, 55-global identifying unit, rate calculating, 61-local expression generating unit, 62-expression animation composing unit, 81-linear smoothing unit, 82-smooth window defining unit, 83-nonlinear smoothing unit, 91-first average energy calculating unit, 92-second average energy calculating unit and 93-mouth shape debugging unit.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

In addition, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

Example 1

One embodiment of the present invention, as shown in fig. 1, provides a method for generating a facial animation, including the steps of:

s100 receives audio information and text information.

Specifically, the audio information comprises singing audio frequency, recitation audio frequency and the like, and mainly refers to the singing audio frequency needing to generate expressions in the process of singing a virtual person; the text information comprises lyrics, reciting manuscripts and the like, and mainly refers to the lyrics needing expression generation in the process of singing the virtual human.

S200, a plurality of mouth shapes are generated according to the audio information.

Specifically, the method of generating the mouth shape includes a conventional face capture method, a method of synthesizing by using a deep learning algorithm such as machine learning, a method of converting audio into phonemes and generating the mouth shape, and the like.

S300 calculates a rate of change of the mouth shape of the audio information.

Specifically, the mouth shape change rate refers to the speed of mouth shape change with time, and is defined as the average lyric word number of one second by referring to the speech speed estimation during speaking. Usually fast songs change mouth shape fast and slow songs change slowly. This rate helps to globally set the speed of die change and control how well the die is in place.

S400, generating the mouth shape animation according to the mouth shape change rate and the mouth shapes.

Specifically, the speed of die change is globally set and the degree of die in-place is controlled according to the rate of die change. When the rate of change of the mouth shape is large, the switching speed among the mouth shapes is high, and the switching of the mouth shapes has certain jitter; when the mouth shape speed is smaller, the switching speed is slower, the transition is stable, and the jitter is less.

S500, acquiring the global expression of the text message and the weights of a plurality of preset expressions of each part in the text message.

Specifically, the expression generation in the invention mainly comprises two parts, namely a global expression and a local expression. The average emotional type of the whole song determines the global expression of the whole song, and meanwhile, the emotion of each lyric influences the change of the local expression.

S600, generating an expression animation according to the global expression and the weights of a plurality of preset expressions in each part of the text information.

Specifically, the method adopts a mode of combining the global basic expression and the local variation expression, the global expression adopts a winner eating-all strategy, the local expression adopts a weighted average strategy, and the global basic expression and the local variation expression are combined to generate the facial expression which has a definite emotion style and is rich in changeable emotion expression as a whole.

S700, generating facial animation according to the mouth shape animation and the expression animation.

Specifically, the final facial animation is realized by fusing a mouth animation and an expression animation converted from the resolution of the lyric sentence to the time resolution q consistent with the mouth animation_tThe animation after fusion is as follows:

f_t＝l_t+q_t，

wherein f is_tModel parameters, l, representing the generated facial animation_tModel parameters representing the generated mouth shape animation, q_tAnd representing model parameters of the generated expression animation.

Specifically, the order of the steps of generating the mouth shape animation and generating the expression animation in this embodiment is not completely limited, and in the actual facial expression generating process, the mouth shape animation may be generated first and then the expression animation may be generated, and the expression animation may also be generated first and then the mouth shape animation may be generated, and the generation order of the mouth shape animation and the expression animation has no influence on the specific generating steps.

Specifically, in this embodiment, the sequence of the steps of generating the plurality of mouth shapes according to the audio information and calculating the mouth shape change rate of the audio information is not limited, in the actual facial expression generating process, the mouth shape change rate of the audio information may be calculated after the plurality of mouth shapes are generated according to the audio information, or the mouth shape change rate of the audio information may be calculated before the plurality of mouth shapes are generated according to the audio information, and the sequence of calculating the mouth shape change rate and generating the mouth shape according to the audio information has no influence on the specific generating step.

The embodiment provides a method for synthesizing a virtual singing face animation, which fills the blank of a complete algorithm for synthesizing the virtual singing face. The method provided by the embodiment respectively synthesizes and fuses the two most important elements of the face, namely the mouth shape and the expression, so as to generate rich, realistic and changeable facial animation. The embodiment focuses on the application scene of singing a virtual person and provides a complete solution. In the embodiment, the influence of different voice speeds on the facial expression animation is considered under the condition that the voice speed range of the virtual person is wide when the virtual person sings the song, the facial expression change is generated according to different voice speeds in the singing process, and the influence of each sentence of lyrics on the facial expression comprehensively generates the facial animation of the virtual person in the singing process, so that the facial expression of the virtual person is richer and more natural when the virtual person sings the song.

Example 2

Based on the method for generating a facial animation in embodiment 1, as shown in fig. 2, generating a plurality of mouth shapes according to text information specifically includes:

s210 converts each part of the audio information into a plurality of phonemes.

Specifically, in this embodiment, pinyin is split into minimum units, and a phoneme library is constructed. As shown in table 1:

initial consonant	b，p，m，f，d，t，n，l，g，k，h，j，q，x，zh，ch，sh，r，z，c，s，y，w
		Single vowel	a，o，e，i，u，ü，er
Final tail	nn，ng

TABLE 1

The phoneme library comprises all initials, single finals and finals, wherein n and nn respectively represent pinyin n of the initials and the finals, and each phoneme corresponds to a mouth shape.

S220 generates a corresponding first mouth shape according to each phoneme.

Specifically, the first mouth shape is to identify a mouth shape corresponding to the phoneme information, each song is divided into parts divided according to each sentence of lyrics, each sentence of lyrics is divided into a plurality of phonemes, and each phoneme corresponds to one mouth shape.

S230 respectively obtains a plurality of complete utterance periods of each portion in the audio information.

Specifically, a word or a word of each complete pronunciation in each sentence of lyrics is acquired as one complete pronunciation period.

S240 time-aligns each full-utterance period with a corresponding phoneme.

Specifically, each completely-pronounced word or word is matched and corresponding to a plurality of phonemes corresponding to the word or word according to time, that is, in the song singing process, the time required for singing a single word is obtained, the phonemes contained in the word in the singing process are obtained through matching, the time required for expressing the mouth shape corresponding to the phonemes is obtained, and the alignment method comprises the traditional alignment based on Viterbi decoding in an HMM and the alignment based on a TDNN neural network.

S250 identifies the first mouth shape corresponding to the phoneme at the middle position of each complete pronunciation period as a key mouth shape.

Specifically, the second mouth shape refers to a key mouth shape in the pronunciation process, namely, the mouth shape which is most significant during pronunciation, and is usually in the middle of the time length of the complete pronunciation of a phoneme, and is relatively less influenced by the context phoneme.

Example 3

Based on the method for generating a facial animation in embodiment 2, as shown in fig. 3, generating a mouth shape animation according to a mouth shape change rate and a plurality of mouth shapes specifically includes:

s410, generating a plurality of segments of sub-mouth shape animations according to the mouth shape change rate and a plurality of mouth shapes corresponding to all parts in the audio information.

S420, combining the plurality of segments of the sub-mouth shape animations in sequence to generate the mouth shape animation.

Wherein, the step S410 of generating the sub-mouth shape animation according to the mouth shape change rate and the corresponding plurality of mouth shapes specifically includes:

and judging the size relation between the mouth shape change rate and a preset threshold value.

Specifically, the speed of die change is globally set and the degree of die in-place is controlled according to the rate of die change. When the rate of the mouth shape is larger than a preset threshold value, and the switching speed between the mouth shapes is high, and the mouth shape is allowed to shake to a certain degree; when the mouth shape rate is smaller than the preset threshold value, the switching speed is slow, the transition is stable, and the jitter is less.

Wherein the switching speed refers to the number of expressions presented per second, and the threshold value and the switching speed are set manually. For example, when the preset threshold is 2 and the switching speed is 2, if the mouth shape change rate is greater than 2, the expressions are switched at a speed at which two expressions are presented per second.

And when the mouth shape change rate is smaller than a preset threshold value, generating the sub-mouth shape animation according to a plurality of first mouth shapes corresponding to all parts of the audio information.

Specifically, when the mouth shape change rate is small, the mouth shapes corresponding to all phonemes are displayed to a certain extent, and the switching is slow, excessive and stable, and the jitter is less. That is, when singing a slow song, the switching speed between the mouth shapes is slow, and each mouth shape can be fully expressed.

And when the mouth shape change rate is larger than a preset threshold value, generating the sub-mouth shape animation according to a plurality of key mouth shapes corresponding to all parts of the audio information.

Specifically, when the rate of change of the die is high, only key dies are made in the die switching process, and the switching between the key dies is fast, so that a certain degree of jitter is allowed. That is to say, when singing a song, the switching speed between the mouth shapes is high, only the key mouth shape is shown in the mouth shape changing process, and the fullness of the mouth shape is correspondingly reduced.

In the embodiment, two mouth shapes are introduced to generate mouth shape animation under the condition that the mouth shapes are used at different mouth shape change rates, so that the virtual human can generate more vivid and natural facial expressions when singing songs with different speech rates.

Example 4

As shown in fig. 4, the method for generating a facial animation according to any one of embodiments 1 to 3, wherein the obtaining of the global expression of the text information specifically includes:

s510, emotion classification models and expression libraries of expressions corresponding to all emotions are established in advance.

Specifically, the emotion classification model adopts Text CNN which is excellent in short Text semantic classification, emotion data required by training comprises 73 Chinese songs, and the whole emotion of each song and the emotion of each lyric are manually labeled. The Text CNN training is verified by using a ten-fold intersection method, and the single sentence recognition accuracy rate of 88.32% is obtained.

In this embodiment, four emotions of a song are also defined as joy (joyful), worries (depressed), criticism (critical) and affection (affectinate), and an animator creates a static expression for each emotion, which is respectively denoted as q^j、q^d、q^c、q^aFour expressions q corresponding to joy (joyful), worries (depressed), criticism (critical) and affection (affictate)^j、q^d、q^c、q^aAnd forming an expression library.

S520, calculating the weights of a plurality of preset emotions in the text information according to the preset emotion classification model, and calculating the weights of a plurality of preset emotions in each part of the text information respectively.

Specifically, weights of the four preset emotions of cheerful, sadness, criticism and deep emotion of the whole expression of the lyrics of the whole song are calculated according to the emotion classification model, and the weights of the four preset emotions of cheerful, sadness, criticism and deep emotion of each sentence of the lyrics are calculated at the same time.

S530, converting the preset emotions into corresponding preset expressions according to a preset expression library.

And S540, identifying the preset expression with the highest weight in the text information as a global expression.

Specifically, each emotion of the whole song is scored, and the expression corresponding to the emotion with the highest score in the emotion classification of the whole song is recorded as q^g。

Example 5

Based on the method for generating facial animation in embodiment 4, as shown in fig. 5, the step of generating the facial animation according to the global expression and the weights of a plurality of preset expressions in each part of the text information specifically includes:

s610, combining the global expression and the preset expressions to generate a plurality of local expressions by adopting a weighted average method.

Specifically, the emotion and expression weight analysis is carried out on each lyric of the whole song, and the scores of four emotions of the ith lyric are

The fused expression of the lyric of the sentence is:

wherein w is the proportion of the global expression, and the higher the proportion is, the less the local change is; the lower the ratio, the more local mood swings. Due to the expression weighting, various compound expressions and micro expressions can be combined, such as expressions which are mainly deep and slightly worried, and the variety of the expressions is increased.

S620, combining the local expressions in sequence to generate expression animation.

Specifically, after the global expression generated based on the lyric of the whole song is weighted with the local expression in each lyric, the generated local expression is animated and synthesized according to the singing sequence of the song.

In the embodiment, a plurality of expressions are subjected to weighted average processing and combined to generate expression animation, so that the facial expression of the virtual human is more changeable and natural when the song is subjected to emotion change in the singing process of the virtual human, and the expression accords with the emotion expression of the song.

Example 6

Based on the method for generating a facial animation in any one of embodiments 1 to 5, as shown in fig. 6, after generating a mouth shape animation according to a mouth shape change rate and a mouth shape, before acquiring a global expression of text information and weights of a plurality of preset expressions of each part in the text information, the method further includes:

s810, linear smoothing is carried out on the mouth shape animation by adopting a linear interpolation method.

S820 defines a smooth window of Gaussian type, the formula is as follows:

wherein

k is the rate of change of the die shape,

for the parameters of the smoothing window, i is the relative time scale of the audio information and e is a constant.

Specifically, k is the rate of mouth shape change, which is the rate of mouth shape change with time, and is defined as the average words of one second of lyrics with reference to the speech rate estimation during speech. i is the relative time scale of the audio information, i is the relative time scale, i is the addition or subtraction on the song time scale, and the unit of measurement is second, for example

Which refers to the parameters of the gaussian smooth window of the mouth shape animation corresponding to the song after 100 seconds from the current moment.

S830, the mouth shape animation is subjected to nonlinear smoothing processing, and the formula is as follows:

wherein, l is the mouth shape generated parameter, N is the smooth window width, i.e. the time sum of the time interval of the complete pronunciation of the single character at the time t of the audio information and the time interval of the complete pronunciation of the two single characters before and after, N is 3f_s/k，f_sIs the audio sampling frequency.

Specifically, N is the sum of the time required for pronunciation of a word in the lyric corresponding to the time t in the song and the time required for pronunciation of two words in front and back of the song.

The embodiment solves the problem that the traditional speech mouth shape synthesis usually adopts linear transition, namely, the transition between two pinyin mouth shapes is filled in a linear interpolation mode. Since the time interval between two pinyins is typically tens of milliseconds when speaking, mouth stiffness is not visually discernable. However, the phenomenon of dragging frequently exists when singing, the interval between two Pinyin can last for several seconds, if only linear transition is adopted, the problem of mechanical change of the mouth shape can be obviously observed visually, and the invention adopts nonlinear smoothing processing in the process of changing the mouth shape. The facial expression of the virtual human is more natural and smooth in the singing process.

Example 7

Based on the method for generating a facial animation according to any one of embodiments 1 to 6, as shown in fig. 7, after generating a plurality of mouth shapes from audio information, and before generating a mouth shape animation according to a mouth shape change rate and the plurality of mouth shapes, the method further includes:

s910, calculating a first average energy of the audio information within M seconds, wherein the formula is as follows:

wherein e_tThe first average energy is t, the current time of the audio information is t, and the unit of t is second;

illustratively, i-1, 2,3.. T, a song is typically 4 to 5 minutes long, and a verse or refrain is about 1 minute long, so a 1-minute window is selected to calculate the average energy, i.e., M equals 60, when the average energy at time T is:

s920, calculating a modulation ratio as a second average energy after normalization, wherein the formula is as follows:

wherein r is_tAnd e is the first average energy vector in 1-T seconds.

S930 fits the die according to the second average energy.

Specifically, the formula for debugging the lip shape is as follows:

wherein

For the modulated lip parameter,/_tThe lip shape parameter before debugging.

The embodiment provides a solution to the problem that the mouth shape is correspondingly enlarged or closed in the singing process. The main song portion of a song, i.e. from the beginning to before the climax, usually the volume of singing by the singer is smaller than the climax portion of the refrain, and accordingly, the mouth shape of the main song portion is less open than that of the refrain portion. According to the method and the device, the degree of matching between the mouth shape and the song can be improved according to the amplitude of the mouth shape of the virtual person in the process of singing the song by volume modulation, so that the facial expression of the virtual person is closer to the song in the singing process.

Example 8

An embodiment of the present invention, as shown in fig. 8, provides a facial animation generation system, including:

the receiving module 10 is configured to receive the audio information and the text information.

And the mouth shape generating module 20 is connected with the receiving module and is used for generating a plurality of mouth shapes according to the audio information.

As shown in fig. 9, the mouth shape generating module 20 specifically includes:

a phoneme converting unit 21 for converting each part of the audio information into a number of phonemes.

Specifically, in this embodiment, pinyin is split into minimum units, and a phoneme library is constructed. As shown in Table 1, the phone library contains all initials, simple finals and finals, wherein n and nn respectively represent pinyin n of the initials and the simple finals, and each phone corresponds to a mouth shape.

The first mouth shape generating unit 22 is connected to the phoneme converting unit 21 and is used for generating a corresponding first mouth shape according to each phoneme.

The complete utterance period obtaining unit 23 is configured to obtain a plurality of complete utterance periods of each portion in the audio information, respectively.

And a time alignment unit 24 connected to the phoneme conversion unit and the full-pronunciation period acquisition unit 23 for time-aligning each full-pronunciation period with the corresponding phoneme.

And a key mouth shape recognition unit 25 connected to the time alignment unit 24 and the first mouth shape generation unit 22, for recognizing the first mouth shape corresponding to the phoneme at the middle position of each complete pronunciation period as the key mouth shape.

And the rate calculating module 30 is connected with the receiving module and is used for calculating the mouth shape change rate according to the audio information.

And the mouth shape animation generating module 40 is connected with the mouth shape generating module and the rate calculating module and is used for generating mouth shape animation according to the mouth shape change rate and a plurality of mouth shapes.

Specifically, the speed of die change is globally set and the degree of die in-place is controlled according to the rate of die change.

As shown in fig. 10, the mouth shape animation generation module 40 specifically includes:

and the judging unit 41 is used for judging the size relation between the mouth shape change rate and a preset threshold value.

And the sub-mouth shape animation generating unit 42 is connected with the judging unit 41 and is used for generating a plurality of segments of sub-mouth shape animations according to a plurality of first mouth shapes corresponding to all parts in the audio information when the mouth shape change rate is smaller than a preset threshold value. And when the mouth shape change rate is larger than a preset threshold value, generating a plurality of segments of sub-mouth shape animations according to a plurality of key mouth shapes corresponding to all parts in the audio information.

And the mouth shape animation combination unit 43 is connected with the sub-mouth shape animation generation unit 42 and is used for combining a plurality of segments of sub-mouth shape animations in sequence to generate the mouth shape animation.

And the obtaining module 50 is connected with the receiving module and is configured to obtain the global expression of the text information and the weights of the plurality of preset expressions of each part in the text information.

Specifically, the expression generation in this embodiment mainly includes two parts, a global expression and a local expression. The average emotional type of the whole song determines the global expression of the whole song, and meanwhile, the emotion of each lyric affects the local expression change. The global expression of the text information is obtained by obtaining the global emotion of the whole song according to the lyric, and then speaking the global emotion to convert the global emotion into the global expression.

As shown in fig. 11, the obtaining module 50 specifically includes:

and the emotion classification model establishing unit 51 is used for establishing an emotion classification model in advance.

And an expression library establishing unit 52, configured to establish an expression library of expressions corresponding to the respective emotions.

Specifically, four emotions of a song are defined as cheerful (joyful), worrisome (decompressed), criticism (critical), and affection (affectinate), and an animator makes a static expression for each emotion, which is respectively denoted as q^j、q^d、q^c、q^aFour expressions q corresponding to joy (joyful), worries (depressed), criticism (critical) and affection (affictate)^j、q^d、q^c、q^aAnd forming an expression library.

And the weight calculating unit 53 is connected with the emotion classification model establishing unit 51 and is used for calculating the weights of a plurality of preset emotions in the text information according to the preset emotion classification model and calculating the weights of a plurality of preset emotions in each part of the text information respectively.

And the expression conversion unit 54 is connected with the weight calculation unit 53 and the expression library establishing unit 52, and is configured to convert the plurality of preset emotions into corresponding preset expressions according to a preset expression library.

And the global expression recognition unit 55 is connected to the expression conversion unit 54 and is configured to recognize a preset expression with the highest weight in the text information as a global expression.

Specifically, each emotion of the whole song is scored, and the expression corresponding to the emotion with the highest score in the emotion classification of the whole song is recorded as a global expression q^g。

And the expression animation generating module 60 is connected with the obtaining module 50 and is used for generating expression animation according to the global expression and the weights of a plurality of preset expressions in each part of the text information.

As shown in fig. 12, the expression animation generation module 60 specifically includes:

and the local expression generating unit 61 is configured to combine the global expression and the preset expressions by using a weighted average method to generate a plurality of local expressions.

The fused expression of the lyric of the sentence is:

And the expression animation composition sheet 62 is connected with the local expression generating unit 61 and is used for combining a plurality of local expressions in sequence to generate expression animation.

And the facial animation generating module 70 is connected with the expression animation generating module 60 and the mouth shape animation generating module 40 and is used for generating facial animation according to the mouth shape animation and the expression animation.

Specifically, the final facial animation is realized by fusing the mouth animation and the expression animation, and the expression animation is converted into the mouth animation from the resolution of the lyric sentenceResulting in a temporal resolution q_tThe animation after fusion is as follows:

f_t＝l_t+q_t，

And the smoothing module 80 is connected with the mouth shape animation generation module and is used for carrying out nonlinear smoothing processing on the mouth shape animation.

Specifically, as shown in fig. 13, the smoothing module 80 includes:

and the linear smoothing unit 81 is used for performing linear smoothing on the mouth shape animation by adopting a linear interpolation method.

A smoothing window defining unit 82 connected to the linear smoothing unit 81 for defining a smoothing window of gaussian type, wherein the formula is as follows:

wherein

k is the rate of change of the die shape,

The nonlinear smoothing unit 83 is connected to the smoothing window defining unit 82, and is configured to perform nonlinear smoothing on the mouth shape animation, where the formula is as follows:

And the amplitude adjusting module 90 is connected with the mouth shape generating module and used for adjusting the amplitudes of the mouth shapes according to the energy of the audio information.

Specifically, as shown in fig. 14, the amplitude adjustment module 90 includes:

a first average energy calculating unit 91, configured to calculate a first average energy within M seconds of the audio information, where the formula is as follows:

the second average energy calculating unit 92 is connected to the first average energy calculating unit 91, and is configured to calculate the modulation ratio as the normalized second average energy, where the formula is as follows:

wherein r is_tAnd e is the first average energy vector in 1-T seconds.

And the mouth shape debugging unit 93 is connected with the second average energy calculating unit 93 and used for debugging the mouth shape according to the second average energy.

Specifically, the formula for debugging the lip shape is as follows:

wherein

For the modulated lip parameter,/_tThe lip shape parameter before debugging.

The embodiment provides a system for synthesizing the facial animation of singing a virtual person, which fills the blank of a complete algorithm for synthesizing the facial animation of singing the virtual person. In the system provided by the embodiment, different modules respectively synthesize and fuse two most important elements of the face, namely the mouth shape and the expression, so that rich, realistic and changeable facial animation is generated. The embodiment focuses on the application scene of singing a virtual person and provides a complete solution. In the embodiment, the influence of different speech speeds on the facial expression animation is considered under the condition that the speed range of the virtual person is wide when the virtual person sings the song, the facial expression change is generated according to the different speech speeds in the singing process, and the influence of each sentence of lyrics on the facial expression comprehensively generates the facial animation of the virtual person in the singing process. Meanwhile, nonlinear smoothing processing is adopted in the mouth shape changing process, and the mouth shape amplitude in the virtual human singing process is modulated according to the volume, so that the matching degree of the mouth shape and the song can be improved, and the facial expression of the virtual human is richer and more natural when the virtual human sings the song.

Example 9

An embodiment of the present invention provides a storage medium having stored therein at least one instruction for implementing the operations performed by the facial animation generation methods described in embodiments 1 to 7.

They may be implemented in program code that is executable by a computing device such that it is executed by the computing device, or separately, or as individual integrated circuit modules, or as a plurality or steps of individual integrated circuit modules. Thus, the present invention is not limited to any specific combination of hardware and software.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or recited in detail in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed facial animation generation method, system and storage medium may be implemented in other ways. For example, the above-described embodiments of facial animation generation method, system and storage medium are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units or integrated circuits, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for generating a facial animation, comprising the steps of:

receiving audio information and text information;

generating a plurality of mouth shapes according to the audio information;

calculating a rate of change of the mouth shape of the audio information;

2. A method for generating a facial animation as claimed in claim 1, wherein the generating a plurality of mouth shapes based on the audio information specifically comprises:

converting all parts of the audio information into a plurality of phonemes;

generating a corresponding first mouth shape according to each phoneme;

3. A method as claimed in claim 2, wherein said generating a mouth shape animation according to said mouth shape change rate and a plurality of mouth shapes comprises:

4. The method for generating facial animation according to claim 1, wherein the obtaining of the global expression of the text information specifically includes:

5. The method for generating facial animation according to claim 4, wherein the generating of the facial animation according to the global expression and the weights of the plurality of preset expressions in each part of the text information specifically comprises:

6. The method for generating facial animation according to any one of claims 1 to 5, wherein after the generating of the mouth shape animation according to the rate of change of the mouth shape and the mouth shape, and before the obtaining of the weights of the global expression of the text message and the preset expressions of the parts in the text message, the method further comprises:

a smooth window of gaussian is defined, the formula being:

wherein

k is the rate of change of the die,

i is a relative time scale of the audio information, and e is a constant;

where l is a parameter of the mouth shape generation, N is the smoothing window width, and N is 3f_s/k，f_sIs the audio sampling frequency.

7. A method for generating a facial animation as claimed in any one of claims 1 to 5, wherein after generating a plurality of mouth shapes based on the audio information and before generating a mouth shape animation based on the rate of change of the mouth shapes and the plurality of mouth shapes, the method further comprises:

wherein e_tIs the first average energy, t is the current time of the audio information, and the unit of t is second;

wherein r is_tE is the first average energy vector within 1-T seconds;

and debugging the mouth shape according to the second average energy.

8. A system for generating a facial animation, comprising:

the receiving module is used for receiving the audio information and the text information;

the expression animation generation module is connected with the acquisition module and used for generating expression animation according to the global expression and the weights of a plurality of preset expressions in each part of the text information;

and the facial animation generation module is connected with the mouth shape animation generation module and the expression animation generation module and is used for generating facial animation according to the mouth shape animation and the expression animation.

9. A facial animation generation system as claimed in claim 8, further comprising:

the smoothing module is connected with the mouth shape animation generating module and is used for carrying out nonlinear smoothing processing on the mouth shape animation;

the smoothing module comprises a linear smoothing unit, a smoothing window defining unit and a non-linear smoothing unit,

the linear smoothing unit is used for performing linear smoothing on the mouth shape animation by adopting a linear interpolation method;

wherein

k is the mouth shape change rate, phi is the parameter of the smooth window, i is the relative time scale of the audio information, and e is a constant;

where l is a parameter of the mouth shape generation, N is the smoothing window width, and N is 3f_s/k，f_sIs the audio sampling frequency;

a first average energy calculating unit, which calculates a first average energy within M seconds of the audio information, and the formula is as follows:

wherein r is_tIs the second average energy, e is the average energy vector in 1-T seconds;

10. A storage medium, wherein at least one instruction is stored in the storage medium, and the instruction is used for realizing the operation executed by the facial animation generation method according to any one of claims 1 to 7.