CN117011434A

CN117011434A - Animation generation method, device, equipment, medium and product based on virtual character

Info

Publication number: CN117011434A
Application number: CN202310956034.7A
Authority: CN
Inventors: 陈欢; 陈长海
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2023-11-07

Abstract

The application discloses an animation generation method, device, equipment, medium and product based on virtual roles, and relates to the field of artificial intelligence. Acquiring expression tag data, wherein the expression tag data comprises text content and expression tags corresponding to the text content; performing voice conversion on the text content to obtain audio data; performing mouth shape conversion on the audio data to generate mouth shape animation of the virtual character; based on the expression labels corresponding to the audio data and the text content, controlling facial muscle data of the virtual character to generate expression animation of the virtual character; and fusing the mouth shape animation and the expression animation to obtain the facial animation of the virtual character. The facial animation containing the appointed expression can be generated, so that the expression of the virtual character expressing the text content is more natural and smooth, and the presentation effect of the animation is improved.

Description

Animation generation method, device, equipment, medium and product based on virtual character

Technical Field

The embodiment of the application relates to the field of artificial intelligence, in particular to an animation generation method, device, equipment, medium and product based on virtual roles.

Background

The Text-to-Face animation generation is performed by means of T2F technology, so that the Face animation of the virtual character can be generated, the Text content is input into the Face animation generation model, and the corresponding Face animation is obtained through output.

In the related technology, corresponding audio is generated based on text content, after complete audio is obtained, a mouth shape animation of a virtual character is generated based on the audio content, and finally the audio and the animation are integrated to obtain complete face animation.

However, the facial animation generated by the method generally only comprises the mouth shape animation of the virtual character, the expression of the virtual character does not have emotion fluctuation when expressing the text content, and the user cannot specify the expression of the virtual character, so that the expression of the virtual character when expressing the text content is not matched with the text content, and the effect of the facial animation presentation is poor.

Disclosure of Invention

The embodiment of the application provides an animation generation method, device, equipment, medium and product based on a virtual character, which can generate facial animation containing appointed expressions, so that the expressions of the virtual character when expressing text content are more natural and smooth, and the presentation effect of the animation is improved. The technical scheme is as follows:

In one aspect, there is provided a virtual character-based animation generation method, the method comprising:

acquiring expression tag data, wherein the expression tag data comprises text content, expression tags corresponding to the text content, and the expression tags are used for indicating the text content expressed by the virtual character and facial expressions when the text content is expressed;

performing voice conversion on the text content to obtain audio data;

performing mouth shape conversion on the audio data to generate mouth shape animation of the virtual character;

based on the audio data and the expression labels corresponding to the text content, controlling the facial muscle data of the virtual character to generate expression animation of the virtual character;

and fusing the mouth shape animation and the expression animation to obtain the facial animation of the virtual character.

In another aspect, there is provided an animation generation apparatus based on a virtual character, the apparatus comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring expression label data, the expression label data comprises text content, expression labels corresponding to the text content, and the expression labels are used for indicating the text content expressed by the virtual character and facial expressions when the text content is expressed;

The conversion module is used for performing voice conversion on the text content to obtain audio data;

the conversion module is also used for performing mouth shape conversion on the audio data to generate mouth shape animation of the virtual character;

the generation module is used for controlling the facial muscle data of the virtual character based on the audio data and the expression labels corresponding to the text content, and generating the expression animation of the virtual character;

and the fusion module is used for fusing the mouth shape animation and the expression animation to obtain the facial animation of the virtual character.

In another aspect, a computer device is provided, the computer device including a processor and a memory having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement a virtual character-based animation generation method as described in any of the embodiments of the application.

In another aspect, a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions loaded and executed by a processor to implement a virtual character based animation generation method as described in any of the above embodiments of the present application is provided.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the avatar-based animation generation method as described in any of the above embodiments.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

when the animation of the virtual character is generated through the expression label data, corresponding audio data can be generated based on the text content in the expression label data, and further, the mouth-shaped animation of the virtual character when the text content is expressed is obtained based on the audio data; facial expressions of the virtual characters can be set based on the expression labels in the expression label data, so that expression animation of the virtual characters is obtained; the expression animation and the mouth shape animation are entered into the facial animation of the virtual character obtained by fusion, the virtual character has rich expressions when expressing text content, the matching degree of the text content and the facial animation is improved, and the presentation effect of the animation is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of generating facial animation of a virtual character based on emoticon data, provided in accordance with an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 3 is a flowchart of a virtual character-based animation generation method provided by an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of reading audio data based on a window sliding step provided by an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of the overall flow of a solution provided by an exemplary embodiment of the present application;

FIG. 6 is a flowchart of a method for generating an animation based on a buffer mechanism provided by an exemplary embodiment of the present application;

FIG. 7 is a flowchart of a method for determining an expressive animation of a virtual character based on expressive strength according to an illustrative embodiment of the application;

FIG. 8 is a block diagram of a virtual character-based animation generation device according to an exemplary embodiment of the present application;

FIG. 9 is a block diagram of a virtual character-based animation generation device according to another exemplary embodiment of the present application;

fig. 10 is a block diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be noted that, the information (including but not limited to information of the virtual character, text content in the expression tag data, etc.), the data (including but not limited to expression tag data, audio data, etc.) related to the present application are all information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

It should be understood that, although the terms first, second, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first parameter may also be referred to as a second parameter, and similarly, a second parameter may also be referred to as a first parameter, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

First, a brief description will be made of terms involved in the embodiments of the present application:

artificial intelligence (Artificial Intelligence, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. With the research and advancement of artificial intelligence technology, the research and application of artificial intelligence technology is developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, digital twin, virtual man, robot, artificial intelligence generation content (Artificial Intelligence Generated Content, AIGC), conversational interaction, smart medical treatment, smart customer service, game AI, etc., and it is believed that with the development of technology, the artificial intelligence technology will find application in more fields and play an increasingly important value.

Machine Learning (ML): is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Phonemes (Phoneme, phon): the method is characterized in that minimum voice units are divided according to the natural attribute of voice, the voice units are analyzed according to pronunciation actions in syllables, and one action forms a phoneme. Phonemes are divided into two major classes, vowels and consonants.

T2F technology (Text-to-Face, text-driven Face animation generation): by inputting descriptive text, corresponding face animation is generated based on the text content.

Action Unit (AU): different facial expressions can be obtained by different combinations of AU from the basic model obtained from the basic movements of a single block or group of muscles.

TTS technology (Text-to-Speech): is a technology capable of converting text content into voice output. In the process of generating the facial animation, text contents expressed by the virtual characters can be converted into audio based on a TTS technology to dub the facial animation.

In the process of manufacturing the three-dimensional virtual character, driving and generating the facial animation of the virtual character is an important link. The generation of face animation for virtual characters is generally divided into two aspects: (1) Audio played when the virtual character expresses the text content is generated, and (2) mouth animation when the virtual character expresses the text content is generated.

In the related art, first, complete audio data is generated based on text contents, and then, a mouth shape animation of a virtual character is generated based on the audio data. And aligning the mouth shape animation with the audio data to obtain the complete face animation of the virtual character.

The facial expressions of the virtual characters are mostly fixed, namely the facial expressions are mild and stiff, and emotion fluctuation does not exist, so that the requirements of the users on the expressions of the virtual characters cannot be met. In some cases, the text content expressed by the virtual character has strong emotion change or mood, and at this time, the facial expression of the virtual character remains unchanged, which results in low matching degree between the facial expression and the audio frequency, and poor animation presentation effect.

Moreover, the above method needs to drive the generation of the face animation after the generation of the complete audio, when the text content is more, the generation of the audio takes more time, and the waiting time is longer when the generation of the face animation is driven, so that larger time delay can be caused. For the streaming application scene, the receiving, calculating and generating processes of all data have high requirements on real-time performance, so that the method cannot generate the face animation in real time while generating the audio, and the generating efficiency of the face animation is low.

The application provides an animation generation method based on a virtual character, which is shown in fig. 1, wherein facial animation 150 (namely facial animation) of the virtual character is generated by expression label data 110, and the expression label data 110 comprises text content 111 and expression labels 112 corresponding to the text content 111. Corresponding audio data 120 can be generated based on the text content 111 in the expression label data 110, and further, the mouth shape animation 130 of the virtual character when expressing the text content 111 is obtained based on the audio data 120; facial expressions of the virtual character can be set based on the expression labels 112 in the expression label data 110, so that an expression animation 140 of the virtual character is obtained; the facial animation 150 of the virtual character is obtained by fusing the expression animation 140 and the mouth shape animation 130, so that the virtual character has rich expression when expressing the text content 111, the matching degree of the text content 111 and the facial animation 150 is improved, and the animation presenting effect is improved.

Meanwhile, in order to solve the time delay problem existing in the process of generating complete audio and reproducing the animation, in the method provided by the application, after all the expression label data are input into the related algorithm model, generating audio according to the text content in the expression label data in a preset unit time length, immediately generating corresponding mouth shape animation and expression animation based on the audio data obtained in the unit time length, fusing the mouth shape animation and the expression animation, and obtaining the complete facial animation of the virtual character after all the mouth shape animation and the expression animation corresponding to the text content are generated. Based on the fragmentation data processing mode, the efficiency of the animation generation process can be ensured, and the requirement of real-time in a streaming application scene is met.

Next, an implementation environment according to an embodiment of the present application will be described, schematically, with reference to fig. 1, where a terminal 210 and a server 220 are involved, and the terminal 210 and the server 220 are connected through a communication network 230.

In some embodiments, the terminal 210 is configured to send the emoticon data to the server 220. The expression label data comprises text content, expression labels corresponding to the text content, wherein the expression labels are used for indicating the text content expressed by the virtual character and facial expressions when the text content is expressed. Notably, the avatar may be a three-dimensional avatar, such as: three-dimensional virtual game characters, three-dimensional virtual animated characters, and the like.

After the server 220 obtains the expression tag data, voice conversion is performed on text content in the expression tag data to obtain audio data, and mouth shape conversion is performed on the audio data to generate mouth shape animation of the virtual character; based on the expression labels corresponding to the audio data and the text content, controlling facial muscle data of the virtual character to generate expression animation of the virtual character; and fusing the mouth shape animation and the expression animation to obtain the facial animation of the virtual character. In some embodiments, the server 220 inputs the expression tag data into a facial animation generation model, and the facial animation generation model performs audio generation, mouth shape animation generation, and expression animation generation on the expression tag data, and fuses them and outputs the resultant facial animation.

In some embodiments, the above-described process may also be performed by the terminal 210 alone.

The terminal may be a mobile phone, a tablet computer, a desktop computer, a portable notebook computer, an intelligent television, a vehicle-mounted terminal, an intelligent home device, or other terminal devices, which is not limited in the embodiment of the present application.

It should be noted that the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), and basic cloud computing services such as big data and an artificial intelligence platform.

Cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, networks and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

In some embodiments, the servers described above may also be implemented as nodes in a blockchain system.

In the embodiment of the present application, the method is executed by the terminal as an example, as shown in fig. 2, and fig. 3 is a flowchart of the virtual character-based animation generation method according to an exemplary embodiment of the present application. The method comprises the following steps.

In step 310, expression tag data is obtained.

The expression label data comprises text content, expression labels corresponding to the text content, wherein the expression labels are used for indicating the text content expressed by the virtual character and facial expressions when the text content is expressed.

Illustratively, one complete expression label data is as follows:

"< email sad=" 0.5"> text content 1, text content 2, </email > < email feature=" 0.6"> text content 3-! The < emission > < emission happy= "0.5" > text content 4, text content 5. "Emotion"

Wherein < project sad= "0.5" > text content 1, text content 2, </project > means: the expression of the virtual character when expressing the text content 1 and the text content 2 is sad, and < emotion sad= "0.5" > and </emotion > are expression labels in the expression label data. < emotion sad= "0.5" > is the "start" mark in the expression label, 0.5 is used to express the degree of emotion expressed by the "sad" expression (the range is 0-1, the bigger the numerical value is the more intense the emotion is), and < emotion > is the "end" mark in the expression label.

Wherein, < emission feature= "0.6" > text content 3-! The expression < element >: the expression of the virtual character when expressing the text content 3 is a feature (fear), < emotion feature= "0.6" > and </emotion > are expression tags in the expression tag data. < emotion feature= "0.6" > is the "start" mark in the expression label, 0.6 is used to express the degree of emotion expressed by the "feature" (the range is 0-1, the larger the value is, the more intense the emotion is), and < emotion > is the "end" mark in the expression label.

Wherein < emission happy= "0.5" > text content 4, text content 5. The expression < element >: the expression of the virtual character when expressing the text content 4 and the text content 5 is happy, and < transmission happy= "0.5" > and </transmission > are expression tags in the expression tag data. < emission happy= "0.5" > is the "start" mark in the expression label, 0.5 is used to express the degree of emotion expressed by the "happy" expression (range is 0-1, the bigger the numerical value is the more intense the emotion is), and < emission > is the "end" mark in the expression label.

And step 320, performing voice conversion on the text content to obtain audio data.

In some embodiments, when performing voice conversion on the text content, expression tag data can be input, the text content is extracted from the expression tag data, and voice conversion is performed on the text content to obtain audio data; or, when inputting, extracting text content from the expression label data in advance, or acquiring the text content in advance, inputting the text content, and then performing voice conversion to obtain audio data.

In some embodiments, in the process of performing voice conversion on text content, firstly performing voice conversion on the text content to obtain candidate phoneme data corresponding to the text content, wherein the voice conversion process refers to a process of converting the text content into phoneme expression, and the candidate phoneme data refers to phoneme data corresponding to each text character in the text content; and setting a start-stop mark in the candidate phoneme data based on the position of the expression label in the expression label data to obtain audio data carrying the start-stop mark.

For example, if the text content includes 10000 phonemes and the occupation time of each phoneme is set to be 10 ms, the time length of the audio data converted from the text content is 100000 ms, that is, 100 seconds.

Illustratively, the receiving information of one phoneme existing in the expression label data includes:

(1) "phn": "u3", identification of phonemes;

(2) "t1": "32900000", beginning of phoneme;

(3) "t2": "33900000", the end time of the phoneme;

(4) "pitch": "4.72904", the pitch of the phonemes.

And 330, performing mouth shape conversion on the audio data to generate mouth shape animation of the virtual character.

In some embodiments, the creation of the mouth-shape animation of the virtual character is performed after all text content is converted into audio data, i.e., after the complete audio data is obtained, the complete audio data is mouth-shape converted to create the complete mouth-shape animation. When the text content is more, the time required for generating the audio data is longer, so that the efficiency of generating the mouth shape animation is improved, the time delay is reduced, and the mouth shape animation of the virtual character can be generated at the same time of generating the audio data. Reading audio data based on an audio reading window and a window sliding step length, wherein the audio reading window refers to a unit length for reading the audio data, and the window sliding step length refers to a sliding length of a window between two adjacent audio readings; and performing mouth shape conversion on the read audio data to generate mouth shape animation of the virtual character.

By way of example, a buffer mechanism for processing audio data is set, and every time audio data of a unit duration is generated, the audio data in the unit duration is immediately received and subjected to mouth shape conversion, so that mouth shape animation corresponding to the audio data of the unit duration is generated. And repeatedly executing the step until all the audio data are generated and converted, and fusing the mouth shape animation of each unit time length to be used as a complete mouth shape animation.

Alternatively, the unit time length is 0.1 seconds, and in a streaming environment, each streaming sub-packet for receiving audio data may receive audio data for 0.1 seconds at a time. Each time an audio data of 0.1 seconds is received, it is stored in the audio buffer. The 2048 audio data can form 1 audio frame, the audio data in the audio buffer is read by an audio reading window with the size of 2048 audio data (1 audio frame), and the window sliding step length of the audio reading window is 512 audio data, so that a complete audio frame can be read after the audio reading window slides for 1 time. For example, the audio read window reads the 1 st to 2048 th audio data for the first time, i.e., reads a complete audio frame, reads the 513 st to 2560 th audio data once by sliding, i.e., reads a complete audio frame again, and so on. And immediately performing mouth shape conversion on the complete audio frames in the audio data of 0.1 second after the audio data corresponding to the audio data of 0.1 second are read, if the audio data of 0.1 second is less than 1 frame of audio frame, temporarily not performing conversion, and performing mouth shape conversion on the complete audio frames in the received audio data when the audio data of the next 0.1 second is continuously received and the received audio data is enough for 1 frame of audio frame, so as to obtain the corresponding mouth shape animation.

After receiving the 1 st 0.1 second audio data, the method can continuously receive the 2 nd 0.1 second audio data in the process of performing the mouth shape conversion and generating the mouth shape animation, so that compared with the method for waiting for all audio data to be generated and regenerating the mouth shape animation after receiving, the method provided by the embodiment can effectively reduce the time delay.

Since the audio reading window reads the audio data in the audio buffer, and the unit of the audio data in the actual use process is usually one audio frame, in order to ensure that the audio data read each time the audio reading window slides will not be missed, the size of the audio buffer is set to the least common multiple of the window sliding step size (length is 512 audio data) of the audio reading window and the length (2400 audio data) of the audio data received in the unit time length, for example: the audio buffer may buffer audio data in a maximum of 16 streaming sub-packets at a time. Accordingly, by sliding the audio reading window, the number of times the audio buffer is full can be determined and the number of times can be recorded, and when there is a need to acquire the characteristics of the audio data (i.e., when the mouth shape animation is generated based on the characteristics of the audio data and the expression animation is generated), the repeated calculation of the audio data can be avoided.

Schematically, as shown in fig. 4, fig. 4 is a schematic diagram of reading audio data based on a window sliding step.

The window sliding step size of the audio read window 410 is 512 audio data sizes, the audio data frame 420 is 2048 audio data sizes, and the audio read window 410 is 2048 audio data (1 frame of audio data frame 420) sizes, so that one complete audio data frame 420 can be read every 1 sliding of the audio read window 410.

In some embodiments, the mouth shape animation may be generated based solely on the text content in the expression label data, and the buffer mechanism for setting the audio data processing may be used solely to generate the mouth shape animation; the method can also generate mouth shape animation and expression animation based on text content and expression labels in the expression label data, a buffer mechanism for processing audio data can be used for generating the mouth shape animation and the expression animation, namely, after each unit time length of audio data is received, the audio data in the unit time length are respectively converted, so that the mouth shape animation and the expression animation corresponding to the audio data in the unit time length are obtained, and time delay can be effectively reduced.

And 340, controlling facial muscle data of the virtual character based on the expression labels corresponding to the audio data and the text content, and generating the expression animation of the virtual character.

Determining the start-stop time of the expression label corresponding to the text content based on the start-stop identification carried by the audio data, and obtaining the aligned expression label data; and controlling facial muscle data of the virtual character based on the aligned expression label data, and generating expression animation of the virtual character.

Wherein, generating the expression animation of the virtual character comprises the following aspects: (1) emottab time alignment: aligning the start-stop time of each expression label in the expression label data with the audio data; (2) expression label stream extraction: extracting expression labels corresponding to the text content in the expression process from the expression label data; and (3) expression animation streaming transition optimization: that is, when a plurality of expression labels exist in the expression label data, it is stated that the virtual character has emotion change in the process of expressing text content, and in order to make the emotion change natural, it is necessary to perform expression animation streaming transition optimization.

After the steps of the aspects are realized based on the expression labels corresponding to the audio data and the text content, the expression corresponding to each text content in the expression label data and the respective starting time can be determined, so that the audio data and the expression animation are aligned. At this time, the facial muscle data of the virtual character is controlled, and the expression animation of the virtual character is generated.

Optionally, facial muscle data is controlled based on the expressive motion unit AU, such as: when the expression label is happy and the expression intensity is 0.5, the action unit AU is as follows:

AU1: "inner_brown_raiser, describing the extent to which the Inner eyebrow moves upward";

AU2: "outer_Brow_Raiser, describing the extent to which the Outer eyebrow moves upward";

AU6: "cheek_raiser, describing the extent to which the cheekbones move upward";

AU12: "lip_Corner_Puller, describing the degree to which the mouth angle moves upward";

AU25: "lips_part, describes the degree of lip separation";

AU26: "Jaw_drop, describing the extent to which the chin moves downward";

by configuring parameters of the action units, the expression of the virtual character can be controlled, and the animation effect of the expression label happy is achieved.

It should be noted that the above manner of controlling the facial muscle data of the virtual character is only used as an example, and other manners may be used to control the facial muscle data of the virtual character, and the muscle area corresponding to each action unit may be arbitrary, and the value may be arbitrary when the parameter value of the action unit is set based on the expression label, which is not limited in this embodiment.

It should be noted that, the step 330 and the step 340 are parallel steps, and the step 330 may be performed first and then the step 340 may be performed first, the step 340 may be performed first and then the step 330 may be performed, and the step 330 and the step 340 may be performed synchronously, which is not limited in this embodiment.

And 350, fusing the mouth shape animation and the expression animation to obtain the facial animation of the virtual character.

Schematically, as shown in fig. 5, fig. 5 is a schematic diagram of the overall flow of the scheme of the present embodiment.

The expression tag data is input into a facial animation generating model with a T2F function (or the expression tag data is processed by using a facial animation generating algorithm), firstly, text content in the expression tag data is subjected to voice conversion to obtain audio data, the audio data is stored into an audio buffer 510, and a mouth shape animation 520 and an expression animation 530 are respectively generated based on the audio data. The step of performing the facial expression label time alignment, the facial expression label stream extraction, the facial expression animation stream transition optimization and the like based on the audio data is needed when the facial expression animation 530 is generated. Finally, the mouth shape animation 520 and the expression animation 530 are subjected to curve fusion, the facial animation of the virtual character is output, when the facial animation is played, the audio data corresponds to the mouth shape of the text content expressed by the virtual character, the facial expression of the virtual character when the text content is expressed corresponds to the expression label, and the emotion change of the virtual character is smooth when different expressions are switched.

Optionally, the face animation generation model is a model trained and generated based on a plurality of action units AU. Common categories of action units AU include, but are not limited to: AU1: describing the degree to which the inner eyebrow moves upward; AU2: describing the degree to which the outer eyebrow moves upward; AU4: describing the extent to which the eyebrows move downwardly; AU5: describing the extent to which the upper eyelid moves upward; AU6: describing the degree of upward movement of the cheekbones; AU7: describing the extent to which the eyelid is contracted; AU9: describing the extent of nose collapse; AU10: describing the extent to which the upper lip moves upward; AU12: describing the degree of upward movement of the mouth angle; AU14: describing the extent of depression of the cheek bones; AU15: describing the degree to which the mouth angle moves downward; AU17: describing the extent to which the chin moves upward; AU20: describing the lips being pulled laterally and flattened; AU23: describing the degree of lip compression; AU25: describing the degree of lip separation; AU26: describing the extent to which the chin moves downward; AU45: the number of blinks is described.

Illustratively, reference may be made to the facial animation 150 of FIG. 1, where AU4 is used to control the muscles of the eyebrow area of the virtual character. By setting the values of the parameters in each action unit AU, the facial muscles of the virtual character can be controlled to make corresponding expressions, such as: the parameter in AU1 is the degree of upward movement of the inner eyebrow, the range of the parameter is (0-1), the corresponding parameter is 1 when the inner eyebrow of AU1 is the highest, and the corresponding parameter is 0.5 when the inner eyebrow of AU1 is half.

The expression intensity in each expression label corresponds to different weight combinations of parameters in the prior action unit AU, and when the expression animation is generated, parameters of key frames in the animation are designated, and then parameters of all frames in the animation can be obtained through smooth transition operation.

It should be noted that the above-mentioned manner of processing the expression label data by the model/algorithm is only used as an example, and the manner of obtaining the face animation generation model may be arbitrary, which is not limited in this embodiment.

In summary, according to the method provided by the application, when the animation of the virtual character is generated through the expression label data, corresponding audio data can be generated based on the text content in the expression label data, and further, the mouth-shaped animation of the virtual character when the text content is expressed can be obtained based on the audio data; facial expressions of the virtual characters can be set based on the expression labels in the expression label data, so that expression animation of the virtual characters is obtained; the expression animation and the mouth shape animation are entered into the facial animation of the virtual character obtained by fusion, the virtual character has rich expressions when expressing text content, the matching degree of the text content and the facial animation is improved, and the presentation effect of the animation is improved.

According to the method provided by the embodiment, the candidate phoneme data are obtained by performing voice conversion on the text content, the mouth shape animation can be directly generated based on the candidate phoneme data, and the mouth shape animation is used as the final facial animation; and the start and stop marks can be set in the candidate phoneme data based on the position of the expression label in the expression label data, so that the audio data carrying the start and stop marks are obtained, the expression animation and the audio data can be aligned when the expression animation is generated later, various situations possibly existing in the process of generating the facial animation are considered, and the presentation effect of the facial animation is improved.

According to the method provided by the embodiment, the audio data are read based on the audio reading window and the window sliding step length, wherein the window sliding step length audio reading window refers to the unit length of the audio data, and the window sliding step length refers to the sliding length of the window between two adjacent audio readings; the read audio data is subjected to mouth shape conversion to generate mouth shape animation of the virtual character, the mouth shape animation can be generated in the audio data generation process, and compared with the time delay problem caused by overlong audio data generation time when the mouth shape animation is generated by generating the audio data in advance in the related art, the mouth shape animation generation method can meet the real-time requirement on the generation of the audio data and the mouth shape animation in a streaming environment, and the generation efficiency of the face animation of the virtual character is improved.

According to the method provided by the embodiment, the start-stop time of the expression label corresponding to the text content is determined based on the start-stop identification carried by the phoneme data, and the aligned expression label data is obtained; and controlling the facial muscle data of the virtual character based on the aligned expression label data to generate the expression animation of the virtual character, so that the synchronism between the expression animation of the virtual character and the audio data can be ensured, and the facial display effect of the virtual character in the text content expression process can be improved.

In some embodiments, the user designates the expression of the virtual character, the expression label data includes text content and a corresponding expression label, and finally the generated facial animation includes mouth-shaped animation and expression animation, so that a buffer mechanism for processing audio data can be set to generate the mouth-shaped animation and the expression animation, that is, after receiving the audio data in unit time length, the audio data in the unit time length are respectively converted to obtain the mouth-shaped animation and the expression animation corresponding to the audio data in the unit time length, so that the time delay can be effectively reduced. As shown in FIG. 6, FIG. 6 is a flow chart of a method for generating an animation based on a buffer mechanism according to the present application, comprising the following steps.

In step 610, expression tag data is obtained.

The expression label data comprises text content and expression labels corresponding to the text content, wherein the expression labels are used for indicating facial expressions of the virtual characters when expressing the text content.

Illustratively, one complete expression label data is as follows:

And 620, performing voice conversion on the text content in the ith unit duration to obtain the ith section of audio data.

Wherein i is a positive integer.

Alternatively, the contents of steps 620 to 650 are performed for a unit time length of 0.1 seconds, that is, every time audio data of 0.1 seconds is generated. The i-th piece of audio data is 0.1 seconds regardless of the value of i.

And inputting the complete expression label data into a facial animation generation model (or using a facial animation generation algorithm), wherein the facial animation generation model firstly carries out voice conversion based on text content in the expression label data to obtain audio data.

A streaming sub-packet for receiving audio data in a streaming environment may receive audio data for 0.1 seconds at a time, a buffering mechanism for audio data processing is set, i.e. every time audio data for 0.1 seconds is received, the audio data for 0.1 seconds is stored in an audio buffer. Wherein 2400 pieces of audio data are included in every 0.1 seconds of audio data.

In the field of audio data processing, audio frames are the basic unit. The length of a common audio frame is 2048 audio data, that is, 2048 audio data form 1 frame of audio frame, and when the audio data in the audio buffer is processed, in order to ensure the integrity of the audio data, the audio data in the audio buffer is read based on an audio reading window.

The size of the audio reading window is 2048 audio data, the length of 512 audio data can be moved for each sliding of the audio reading window, and 2048 audio data, namely 1 audio frame data, can be read for each sliding of the audio reading window. In order to ensure the integrity of the received audio data in unit time length and ensure that 2048 audio data (1 frame of audio frame) can be completely read by each sliding of the audio reading window, the size of the audio buffer is the least common multiple of the window sliding step size (512 audio data) and the streaming sub-packet size (2400 audio data) of the audio reading window. The audio buffer is sized to buffer a maximum of 16 streaming sub-packets of data at a time, i.e. a maximum of 16 complete 0.1 seconds of audio data at a time. And the last sliding of the audio reading window in the audio buffer can just read all the audio data in the audio buffer.

If the size of the audio buffer is not an integer multiple of the window sliding step size of the audio read window, e.g., the size of the audio buffer is 20.5 window sliding steps, the audio buffer can be considered to be a container comprising 20.5 cells, then the first 16 slides of the audio read window in the audio buffer are complete, 2048 audio data (1 frame of audio frames) can be read at a time. At the 17 th sliding, the audio reading window can only cover (3+0.5) ×512=1792 audio data, less than 1 frame of audio frames (2048 audio data).

Therefore, the size of the audio buffer is set to be capable of buffering data of 16 streaming sub-packets at most once, the number of read audio data can be determined by recording the number of times the buffer is full in the process of processing the audio data, and repeated calculation of the audio data is avoided when the characteristics of the audio data are acquired in the process of processing the data.

And 630, performing mouth shape conversion on the ith section of audio data in the ith unit time length to generate the ith section of mouth shape animation of the virtual character.

After each 0.1 second of audio data is generated, the 0.1 second of audio data is subjected to a mouth-shape conversion. And in the ith unit duration, performing mouth shape conversion on the ith section of audio data to generate the ith section of mouth shape animation of the virtual character.

In some embodiments, the audio data generation process and the mouth-shape animation generation process are synchronized, such as: in the (i+1) -th unit time period, i is a positive integer, and the (i) -th audio data has been obtained at this time, the step of obtaining the (i+1) -th audio data may be performed simultaneously with the step of generating the (i) -th mouth shape animation of the virtual character based on the (i) -th audio data. Compared with the method that the mouth shape animation is generated based on the complete audio data after the complete audio data is generated, the mouth shape animation is directly generated by the audio data received in unit time, and the step 630 is repeatedly executed, so that the synchronization and the instantaneity in the animation and audio generation process can be ensured, and the real-time requirement on data processing in a streaming environment can be met.

It should be noted that, since the user designates the expression of the virtual character in this embodiment, the expression label data includes both text content and expression labels, and the finally generated facial animation includes a mouth shape animation and an expression animation, and in some embodiments, in order to ensure synchronicity in the mouth shape animation and the expression animation generating process, the steps 630 and 640 may be two steps in parallel. That is, step 630 may be performed first, followed by step 640; step 640 may be performed before step 630 is performed; step 630 and step 640 may also be performed simultaneously, which is not limited in this embodiment.

And 640, controlling the facial muscle data of the virtual character based on the ith section of audio data and the expression label corresponding to the text content in the ith unit time length, and generating the ith section of table condition animation of the virtual character.

When the complete audio data and the expression label data are processed to obtain the animation, the expression animation of the whole virtual character can be determined directly based on the expression label in the expression label data. All the expression label data are input to the model/algorithm for processing at one time, but the audio data and the expression label data received by the streaming sub-packet are incomplete in unit time length.

Optionally, the unit duration is 0.1 second, that is, the expression label data is divided according to the size of the streaming sub-packet, so as to obtain the divided partial expression label data, which are sequentially received by the streaming sub-packets according to the sequence, and each streaming sub-packet receives audio data of 0.1 second. The text content is divided into at least one text clause by the expression label corresponding to the text content; since the expression label in the expression label data only exists at the sentence head of each text clause, the emotion state and expression of the virtual character can be determined only by receiving the stream sub-packet of the sentence head expression label when the next text clause is expressed by the virtual character. Each streaming sub-packet transmits the phonemes of each text clause (i.e., received text content) at the beginning of that text clause, and after the audio transmission of that text clause is completed, the phonemes of the next text clause are transmitted at the beginning of the next text clause.

Thus, generating an expressive animation of a virtual character includes the following: (1) emottab time alignment: aligning the start-stop time of each expression label in the expression label data with the audio data; (2) expression label stream extraction: extracting expression labels corresponding to the text content in the expression process from the expression label data; and (3) expression animation streaming transition optimization: that is, when a plurality of expression labels exist in the expression label data, it is stated that the virtual character has emotion change in the process of expressing text content, and in order to make the emotion change natural, it is necessary to perform expression animation streaming transition optimization.

The expression label stream extraction method comprises the following specific processes:

1. receiving text content corresponding to the ith section of audio data in the aligned expression label data, wherein the received text content corresponding to the ith section of audio data is positioned in a text window, and the text window is used for determining the condition of receiving the text content in the expression label data;

since the size of the stream sub-packet is the same when any data is received, when the ith section of audio data is generated based on the text content in the emotion tag data, the text content corresponding to the ith section of audio data in the aligned emotion tag data can be determined.

2. Traversing text clauses of text content corresponding to the ith section of audio data from a text window based on a text sliding window, extracting to obtain expression labels corresponding to the text clauses, wherein the text sliding window refers to traversing unit length of the text clauses from the text window;

traversing and searching the target text clause from the text window based on the text sliding window to obtain a searching result; and extracting expression labels corresponding to the target text clauses based on the search result.

For example, the expression label data is: "< email sad=" 0.5"> text content 1, text content 2,

< emission couple= "0.6" > text content 3-! The < emission > < emission happy= "0.5" > text content 4, text content 5. "Emotion"

In order of precedence, the first text clause is "text content 1", the second text clause is "text content 2", and so on. Each time a text clause is traversed from the text window based on the text sliding window, it proceeds in the order "text content 1", "text content 2" … … "text content 5".

Optionally, the target text clause may be any text clause in the expression label data, and the search result includes the following several types:

(1) Responding to the searching result to indicate that a target text clause exists in the text window, extracting expression labels corresponding to the target text clause, and deleting the target text clause in the text window;

taking a target text clause as ' text content 1 ', wherein ' text content 1 ', ' text content 2 ' and ' text content 3 ' exist in a text window, wherein an expression label < emotion sad= ' 0.5 ' > ' exists at the sentence head of ' text content 1 ', and the expression label is extracted and obtained and stored. The "text content 1" is deleted from the text window, and at this time, there are "text content 2" and "text content 3" in the text window.

Continuing to take the text content 2 as a target text clause, wherein the sentence head of the text content 2 does not have an expression label, and the expression label corresponding to the text content 2 is not extracted. However, the expression tag sad has been extracted before, sad is taken as the expression tag corresponding to the "text content 2", and the "text content 2" is deleted from the text window, and at this time, only the "text content 3" exists in the text window.

Continuing to take the text content 3 as a target text clause, wherein the sentence head of the text content 3 has an expression label < emotion feature= "0.6" >, and extracting and storing the expression label feature corresponding to the text content 3. The "text content 3" is deleted from the text window, at which point the text window is empty.

(2) Responding to the searching result to indicate that the target text clause does not exist in the text window, and determining that the expression label corresponding to the target text clause is not extracted;

if the text window does not have the target text sub-when searching is performed by taking the text content 3 as the target text clause, stopping searching.

(3) And responding to the searching result to indicate that the text window is empty, and determining expression labels corresponding to all text clauses in the extracted text window.

When searching is carried out by taking the target text clause as the text content 1, the fact that the target text clause does not exist in the text window is found, and because the processes of storing data in the text window and traversing the text window based on the text sliding window are carried out simultaneously, the text content in the text window can be continuously increased until all the text content in the expression label data is written in the text window.

Therefore, when the text window is empty, the expression labels corresponding to all the text clauses in the extracted text window are explained.

Based on the steps, after the expression labels corresponding to the text clauses are obtained, facial muscle data of the virtual character is controlled based on the text clauses and the expression labels corresponding to the text clauses, and an ith section of table emotion animation of the virtual character is generated.

Optionally, facial muscle data is controlled based on the expressive motion unit AU, such as: and when the expression label corresponding to the ith section of the expression animation is happy and the expression strength is 0.5, controlling eyes and mouths of the virtual character.

And 650, fusing the ith section of mouth shape animation and the ith section of table animation in the ith unit time length to obtain the ith section of face animation of the virtual character.

Because the steps of generating the ith section mouth shape animation and the ith section table condition animation are almost simultaneous, the mouth shape animation and the expression animation are immediately fused after each time the mouth shape animation and the expression animation are generated based on the 0.1 second audio data, the instantaneity of generating the face animation can be ensured, and the efficiency of generating the face animation is improved.

According to the method provided by the embodiment, the buffer mechanism for processing the audio data is arranged, the audio data of the text content is generated within each unit time length, the ith section of audio data is obtained, the expression animation and the mouth shape animation are generated based on the ith section of audio data, the time delay problem generated in the audio data processing process can be reduced, the real-time requirements for generating the audio data and the mouth shape animation in a streaming environment are met, and the generation efficiency of the face animation of the virtual character is improved.

According to the method provided by the embodiment, the text content is divided into the text clauses, the expression labels corresponding to the text clauses are extracted, the expression of the virtual character when the expression animation is generated based on the ith section of audio data is determined, and the problem that the expression of the virtual character cannot be determined directly based on the expression labels in the expression label data when the phonemes of the text content in the expression label data are incomplete in unit time length is solved.

According to the method provided by the embodiment, the target text clause is traversed and searched from the text window based on the text sliding window, the extraction condition of the expression label can be determined according to the condition that the target text clause exists, and the expression animation generation efficiency is improved.

In some embodiments, the plurality of emoticons in the emoticon data correspond to text content, and the avatar expresses the text content based on the emotional state indicated by the emoticon, which also has a gradual progress. The emotion intensity corresponding to each emotion label can be determined by setting the emotion intensity in the emotion label data. And controlling facial muscle data of the virtual character based on the audio data, the expression labels and the expression intensities corresponding to the text content, and generating the expression animation of the virtual character. When the text content is expressed, the virtual object presents different animation effects based on the expression labels and the corresponding expression intensities. The expression intensity can be changed, an expression intensity peak value exists in the expression intensity change process of a single expression, and the expression intensity of the virtual character is in a rising state in the process that the expression intensity reaches the expression intensity peak value; in the process that the expression intensity is restored to a default value (expression fading) from the expression intensity peak value, the expression intensity of the virtual character assumes a reduced state. The emotion labels also comprise a start label and a stop label, so that the change condition of the emotion corresponding to the single emotion label in the starting and stopping processes can be determined based on the emotion intensity, the start label and the stop label. As shown in FIG. 7, FIG. 7 is a flowchart of a method for determining an expressive animation of a virtual character based on expressive strength, comprising the following steps.

And step 710, controlling the facial muscle data of the virtual character based on the initial label and the expression intensity corresponding to the audio data and the text content, and obtaining a first expression animation.

The first expression animation is an animation representing the expression intensity of the virtual character from the initial intensity to the expression intensity in the expression label data, for example: the expression label < emotion sad= "0.5" > is 0.5, which is the set expression intensity, when the expression gradually reaches 0.5 when the virtual character expresses the text content, the expression is maintained at 0.5, and the animation corresponding to the process is the first expression animation.

Optionally, one complete expression label data is as follows:

Wherein, < EMOTion sad= "0.5" > is the start tag of expression sad, and < EMOTion > is the end tag of expression sad; < emotion fear= "0.6" > is the start tag of expression fear, and < emotion > is the end tag of expression fear; < emission happy= "0.5" > is the start tag of expression happy, and < emission > is the end tag of expression happy.

Taking the example of generating the expression animation with the expression label as the sad as the illustration, the expression intensity is 0.5, when the virtual character starts to express the content text 1, the expression intensity gradually reaches and keeps the expression intensity 0.5 within 1-1.5 seconds (the duration can be customized) and the expression intensity corresponding to the sad is gradually increased, and at the moment, the virtual character continuously expresses the content text 1 based on the sad expression intensity 0.5. And controlling the facial muscle data of the virtual character to obtain a first expression animation, namely, the expression animation when the virtual character expresses the text content 1.

And step 720, controlling the facial muscle data of the virtual character based on the termination label and the expression intensity corresponding to the audio data and the text content, and obtaining a second expression animation.

The second expression animation is an animation showing the expression intensity of the virtual character in a downward trend from the expression intensity set in the expression label data.

Still taking the example of generating the expression animation with the expression label as sad as the illustration, the expression intensity set in the expression label data is 0.5, and because the "sad" expression intensity of the virtual character is maintained at 0.5 at this time, the virtual character continuously expresses the content text 2 based on the expression intensity of 0.5, and the "sad" expression intensity gradually falls from 0.5 within 1 to 1.5 seconds (the duration can be customized) before the expression of the content text 2 is ended. And controlling the facial muscle data of the virtual character to obtain a second expression animation, namely, the expression animation when the virtual character expresses the text content 2.

And 730, splicing the first expression animation and the second expression animation according to a preset sequence to obtain the expression animation of the virtual character.

Optionally, the preset sequence is the sequence of text content, the first expression animation is followed by splicing the second expression animation, so as to obtain the expression animation of the virtual character corresponding to the expression label sad, and the expression animation is spliced twice by the first expression animation and the second expression animation, but the expression animation comprises an animation from the gradual starting of the expression of the virtual character to the expression intensity peak, an animation maintained at the expression intensity peak and an animation from the gradual dropping of the expression at the expression intensity peak, and in general, the duration occupied by the animation of the expression intensity maintained at the expression intensity peak of the virtual character is generally longer than the gradual starting/gradual dropping duration.

In addition to the emotional change process existing in a single emoticon, in some embodiments, there is also an emotional change process between different emoticons. The expression label data comprise a plurality of expression labels, text contents spaced by different expression labels are shorter, so that the situation of virtual character expression mutation is likely to occur, when facial muscle data of a virtual character are controlled based on the expression labels corresponding to the audio data and the text contents, and when an expression animation of the virtual character is generated, the expression animation stream transition optimization can be further carried out: when there are a plurality of emoticons in the emotion tag data and there are emotion changes in the virtual character during the text content expression, a process is performed to naturally change the emotion.

Typically, the virtual character has a default expression, and when some text contents exist in the expression label data and are not marked by the expression label, the text contents can be expressed by using a default expression/emotion state when generating the facial animation of the virtual character.

Illustratively, one complete expression label data is:

"< project sad=" 0.5"> i can only send you here, the rest of the way gets away by itself,

< emission couple= "0.6" > return-! The < emission > < emission happy= "0.5" > i really is dream and does not think, your two have this change of turning down. "Emotion"

Wherein, a short emotion feature is clamped between the emotion sad and the emotion happy, namely, the expression mutation occurs. In the related art, such abrupt change of expression is not as gradual as experience according to animation processing.

Optionally, the expression tag data further includes an expression extension signal, which can prevent the unfinished expression from fading due to incomplete reception. And in response to receiving the expression extension signal, smoothly transitioning the expression animation of the virtual character based on the expression extension signal to obtain a transition-back expression animation, wherein the facial expression change of the virtual character in the transition-back expression animation is smooth.

According to experience, after receiving a 'start' mark of an expression label (such as < emotion sad= "0.5"), the face of the virtual character is slowly transformed from a neutral face (default expression) to a peak value of the expression AU; when the "stop" mark of the expression tag is reached (</injection >), the face has slowly recovered to the neutral face by the peak of the expression AU.

In the above example, the receiving timestamp arrives ("i can only send you here"), but the expression of the heart contains ("i can only send you here, the rest of the route is to walk by oneself,"), i.e. the current expression is not yet sent all, considering the gradual rise and fall effect of the expression, in order to prevent the expression from fading, the expression "termination" mark needs to be set at the future time. I.e. a state in which it is detected that a part of a certain expression segment has been received, an expression extension signal is given.

Optionally, if the expression extension signal exists, the end time of the last phoneme of the text content indicated by the expression extension signal is extended by 2 seconds, and the time after 2 seconds of the extension of the expression label of the text content is used as the end time of the expression label, so that the current expression animation still keeps a full expression state at the end of the receiving time and does not recover to the state of the default expression in advance.

In a non-streaming T2F (text-driven AI facial animation generation) scene, for an abrupt change of an expression, the previous expression is not required to be quickly attenuated to 0, or the peak value of the next expression is quickly reached, but according to experience, the slow rising/falling duration of the expression is set to be [1,1.5] seconds, the actual rising/falling duration linearly takes a value in the set range according to the maximum value of the intensity of the default expression, and then the expression AU is linearly transited from the peak value of the previous expression to the peak value of the next expression at a reasonable speed.

However, in the streaming T2F scenario, all data is received and processed in real time, and this scheme cannot be adopted since the information of the next expression tag is not yet transmitted. Can be implemented based on two variables as follows.

(1) Variable 1: the expression actual rising/falling duration limited by the expression duration;

(2) Variable 2: when the expression is raised/lowered according to experience expectations and only related to the expression AU intensity. After the AU peak value of the short expression is depressed based on the ratio of the variable 1 to the variable 2, the difference value of the same AU between the adjacent expressions is reduced, and then the smooth transition effect is realized in the adjacent expression segments with shorter duration.

According to the method provided by the embodiment, through carrying out the expression animation streaming transition optimization, when a plurality of expression labels exist in the expression label data and the emotion change exists in the process of expressing the text content, the emotion change of the virtual character is natural, and the animation presentation effect is improved.

Fig. 8 is a block diagram showing the structure of an apparatus for generating an animation based on a virtual character according to an exemplary embodiment of the present application, which includes the following parts as shown in fig. 8.

An obtaining module 810, configured to obtain expression tag data, where the expression tag data includes text content, an expression tag corresponding to the text content, where the expression tag is used to indicate the text content expressed by the virtual character, and a facial expression when the text content is expressed;

a conversion module 820, configured to perform voice conversion on the text content to obtain audio data;

the conversion module 820 is further configured to perform a mouth shape conversion on the audio data to generate a mouth shape animation of the virtual character;

a generating module 830, configured to control facial muscle data of the virtual character based on the audio data and the expression label corresponding to the text content, and generate an expression animation of the virtual character;

and a fusion module 840, configured to fuse the mouth shape animation and the expression animation to obtain the facial animation of the virtual character.

In an optional embodiment, the conversion module 820 is further configured to perform speech conversion on the text content to obtain candidate phoneme data corresponding to the text content; and setting a start-stop mark in the candidate phoneme data based on the position of the expression label in the expression label data to obtain the audio data carrying the start-stop mark.

In an alternative embodiment, the conversion module 820 is further configured to read the audio data based on an audio reading window and a window sliding step, where the audio reading window is a unit length for reading the audio data, and the window sliding step is a sliding length of the window between two adjacent audio readings; and performing mouth shape conversion on the read audio data to generate the mouth shape animation of the virtual character.

In an optional embodiment, the generating module 830 is further configured to determine a start-stop time of an expression tag corresponding to the text content based on the start-stop identifier carried by the audio data, to obtain aligned expression tag data; and controlling the facial muscle data of the virtual character based on the aligned expression label data, and generating the expression animation of the virtual character.

In an optional embodiment, the conversion module 820 is further configured to perform voice conversion on the text content in an ith unit duration to obtain an ith segment of audio data, where i is a positive integer;

the generating module 830 is further configured to control, in the ith unit duration, facial muscle data of the virtual character based on the ith section of audio data and the expression label corresponding to the text content, and generate an ith section of table emotion animation of the virtual character.

In an alternative embodiment, the text content is divided into at least one text clause by the expression label corresponding to the text content; as shown in fig. 9, the generating module 830 further includes:

a receiving unit 831, configured to receive text content corresponding to the ith segment of audio data in the aligned emoji tag data, where the received text content corresponding to the ith segment of audio data is located in a text window, and the text window is used to determine a situation of receiving the text content in the emoji tag data;

an extracting unit 832, configured to traverse the text clause of the text content corresponding to the i-th segment of audio data from the text window based on a text sliding window, to extract an expression tag corresponding to the text clause, where the text sliding window refers to a unit length of traversing the text clause from the text window;

and a generating unit 833 configured to control the facial muscle data of the virtual character based on the text clause and an expression label corresponding to the text clause, and generate the i-th segment table animation of the virtual character.

In an optional embodiment, the extracting unit 832 is further configured to traverse the search target text clause from the text window based on a text sliding window to obtain a search result; and extracting expression labels corresponding to the target text clauses based on the search result.

In an optional embodiment, the generating module 830 is further configured to, in response to receiving an expression extension signal, smoothly transition the expression animation of the virtual character based on the expression extension signal, to obtain a transitional expression animation, where a facial expression change of the virtual character in the transitional expression animation is smooth.

In an optional embodiment, the expression label data further includes expression intensity;

the generating module 830 is further configured to control the facial muscle data of the virtual character based on the audio data, the expression label corresponding to the text content, and the expression intensity, and generate an expression animation of the virtual character.

In an alternative embodiment, the emoticons further include a start tab and a stop tab;

the generating module 830 is further configured to control the facial muscle data of the virtual character based on the audio data and the start tag and the expression intensity corresponding to the text content, so as to obtain a first expression animation, where the first expression animation is an animation that represents that the expression intensity of the virtual character is in an ascending trend from an initial intensity until the expression intensity in the expression tag data; controlling the facial muscle data of the virtual character based on the termination tag and the expression intensity corresponding to the audio data and the text content to obtain a second expression animation, wherein the second expression animation is an animation showing that the expression intensity of the virtual character is in a descending trend by the expression intensity set in the expression tag data; and splicing the first expression animation and the second expression animation according to a preset sequence to obtain the expression animation of the virtual character.

In summary, according to the device for generating the animation based on the virtual character provided by the application, when the animation of the virtual character is generated through the expression tag data, corresponding audio data can be generated based on the text content in the expression tag data, and further, the mouth-shaped animation of the virtual character when the text content is expressed can be obtained based on the audio data; facial expressions of the virtual characters can be set based on the expression labels in the expression label data, so that expression animation of the virtual characters is obtained; the expression animation and the mouth shape animation are entered into the facial animation of the virtual character obtained by fusion, the virtual character has rich expressions when expressing text content, the matching degree of the text content and the facial animation is improved, and the presentation effect of the animation is improved.

It should be noted that: the virtual character-based animation generating apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the virtual character-based animation generation device provided in the above embodiment and the virtual character-based animation generation method embodiment belong to the same concept, and detailed implementation processes of the device and the method embodiment are detailed and are not described herein.

Fig. 10 shows a block diagram of a computer device 1000 provided by an exemplary embodiment of the present application. The computer device 1000 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. The computer device 1000 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, and the like.

In general, the computer device 1000 includes: a processor 1001 and a memory 1002.

The processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 1001 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 1001 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1001 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 1001 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. Memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1002 is configured to store at least one instruction for execution by processor 1001 to implement the virtual character-based animation generation method provided by the method embodiments of the present application.

In some embodiments, computer device 1000 also includes other components, and those skilled in the art will appreciate that the structure illustrated in FIG. 10 is not limiting of computer device 1000, and may include more or less components than those illustrated, or may combine certain components, or employ a different arrangement of components.

Alternatively, the computer-readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid State Drives), or optical disk, etc. The random access memory may include resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

The embodiment of the application also provides a computer device, which comprises a processor and a memory, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the animation generation method based on the virtual character according to any one of the embodiment of the application.

The embodiment of the application also provides a computer readable storage medium, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the storage medium, and the at least one instruction, the at least one section of program, the code set or instruction set is loaded and executed by a processor to realize the virtual character-based animation generation method according to any one of the embodiments of the application.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the avatar-based animation generation method as described in any of the above embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. A virtual character-based animation generation method, the method comprising:

performing voice conversion on the text content to obtain audio data;

2. The method of claim 1, wherein said speech converting the text content to audio data comprises:

performing voice conversion on the text content to obtain candidate phoneme data corresponding to the text content;

and setting a start-stop mark in the candidate phoneme data based on the position of the expression label in the expression label data to obtain the audio data carrying the start-stop mark.

3. The method of claim 2, wherein said performing a mouth-shape transformation on said audio data to generate a mouth-shape animation of said virtual character comprises:

reading the audio data based on an audio reading window and a window sliding step length, wherein the audio reading window is used for reading the unit length of the audio data, and the window sliding step length is used for reading the sliding length of the window between two adjacent audio readings;

And performing mouth shape conversion on the read audio data to generate the mouth shape animation of the virtual character.

4. The method according to claim 2, wherein the controlling facial muscle data of the virtual character based on the audio data and the expression label corresponding to the text content, and generating the expression animation of the virtual character, comprises:

determining the start-stop time of the expression label corresponding to the text content based on the start-stop identification carried by the audio data, and obtaining aligned expression label data;

and controlling the facial muscle data of the virtual character based on the aligned expression label data, and generating the expression animation of the virtual character.

5. The method according to any one of claims 1 to 4, wherein said speech converting the text content to obtain audio data comprises:

in the ith unit duration, performing voice conversion on the text content to obtain the ith section of audio data, wherein i is a positive integer;

the facial muscle data of the virtual character is controlled based on the expression labels corresponding to the audio data and the text content, and the expression animation of the virtual character is generated, which comprises the following steps:

And controlling the facial muscle data of the virtual character based on the ith section of audio data and the expression label corresponding to the text content in the ith unit time length, and generating the ith section of table emotion animation of the virtual character.

6. The method of claim 5, wherein the text content is divided into at least one text clause by an emoticon corresponding to the text content;

and controlling the facial muscle data of the virtual character based on the ith section of audio data and the expression label corresponding to the text content in the ith unit time length to generate the ith section of table emotion animation of the virtual character, wherein the method comprises the following steps of:

receiving text content corresponding to the ith section of audio data in the aligned expression label data, wherein the received text content corresponding to the ith section of audio data is positioned in a text window, and the text window is used for determining the condition of receiving the text content in the expression label data;

traversing the text clause of the text content corresponding to the ith section of audio data from the text window based on a text sliding window, and extracting to obtain an expression label corresponding to the text clause, wherein the text sliding window refers to the unit length of traversing the text clause from the text window;

And controlling the facial muscle data of the virtual character based on the text clause and the expression label corresponding to the text clause, and generating the ith section of table emotion animation of the virtual character.

7. The method of claim 6, wherein the traversing the text clause of the text content corresponding to the i-th segment of audio data from the text window based on the text sliding window, extracting the emoji tag corresponding to the text clause, comprises:

traversing and searching a target text clause from the text window based on a text sliding window to obtain a searching result;

and extracting expression labels corresponding to the target text clauses based on the search result.

8. The method according to any one of claims 1 to 4, wherein the controlling facial muscle data of the virtual character based on the audio data and the expression label corresponding to the text content to generate the expression animation of the virtual character further comprises:

and in response to receiving the expression extension signal, smoothly transitioning the expression animation of the virtual character based on the expression extension signal to obtain a transition-after expression animation, wherein the facial expression change of the virtual character in the transition-after expression animation is smooth.

9. The method according to any one of claims 1 to 4, wherein expression intensity is further included in the expression label data;

and controlling the facial muscle data of the virtual character based on the audio data, the expression label corresponding to the text content and the expression intensity, and generating the expression animation of the virtual character.

10. The method of claim 9, wherein the emoticons further include a start tab and a stop tab;

the method for generating the facial muscle data of the virtual character based on the audio data, the expression labels corresponding to the text content and the expression intensity comprises the steps of:

controlling the facial muscle data of the virtual character based on the initial tag and the expression intensity corresponding to the audio data and the text content to obtain a first expression animation, wherein the first expression animation is an animation showing that the expression intensity of the virtual character is in an ascending trend from the initial intensity until the expression intensity in the expression tag data;

Controlling the facial muscle data of the virtual character based on the termination tag and the expression intensity corresponding to the audio data and the text content to obtain a second expression animation, wherein the second expression animation is an animation showing that the expression intensity of the virtual character is in a descending trend by the expression intensity set in the expression tag data;

and splicing the first expression animation and the second expression animation according to a preset sequence to obtain the expression animation of the virtual character.

11. An avatar-based animation generation apparatus, the apparatus comprising:

12. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement the avatar-based animation generation method of any of claims 1-10.

13. A computer-readable storage medium, wherein at least one program is stored in the storage medium, the at least one program being loaded and executed by a processor to implement the virtual character-based animation generation method of any one of claims 1 to 10.

14. A computer program product comprising a computer program which when executed by a processor implements the virtual character based animation generation method of any of claims 1 to 10.