CN110880198A

CN110880198A - Animation generation method and device

Info

Publication number: CN110880198A
Application number: CN201811037239.0A
Authority: CN
Inventors: 陈昌滨; 卞衍尧; 傅宇韬
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-09-06
Filing date: 2018-09-06
Publication date: 2020-03-13

Abstract

The embodiment of the application discloses an animation generation method and device. An embodiment of the method comprises: responding to the received input text, acquiring an animation label of the input text, and obtaining a corresponding relation between words and actions in the input text; generating a voice corresponding to the input text; and combining the video generated by rendering the preset character image model based on the obtained corresponding relation and the generated voice to generate the animation of the character image. The realization mode can enable the action of the character image in the generated animation to naturally and accurately represent the meaning expressed by the input text.

Description

Animation generation method and device

Technical Field

The embodiment of the application relates to the field of multimedia, in particular to the field of computer vision, and particularly relates to an animation generation method and device.

Background

With the development of artificial intelligence technology, more and more intelligent products capable of interacting with human voice are coming out. However, these products lack an intuitive interactive image, and the user can only hear the sound during the interaction process. If the animation image is added into the intelligent product, the user can generate the feeling similar to the communication of natural people, and the user experience can be improved. In order to achieve the perception of a user communicating with natural people, there is a need to create a lively image in real time on an interactive device, requiring natural expressions, movements and corresponding mouth-animation synchronized with sound.

The existing animation mode generally needs to draw and design the sound, lip movement, expression and movement of a character frame by frame. In the process of creating a 3D (three-dimensional) animation, the animation is adjusted after dubbing or dubbed according to the animation in a manner of combining sound and a screen. In addition, the motions, expressions, and lip motions of the character figure are all drawn frame by an animator, requiring a great deal of labor and time costs.

Disclosure of Invention

The embodiment of the application provides an animation generation method and device.

In a first aspect, an embodiment of the present application provides an animation generation method, including: responding to the received input text, acquiring an animation label of the input text, and obtaining a corresponding relation between words and actions in the input text; generating a voice corresponding to the input text; and combining the video generated by rendering the preset character image model based on the obtained corresponding relation and the generated voice to generate the animation of the character image.

In some embodiments, generating speech corresponding to the input text comprises: analyzing the input text to generate a phoneme sequence; based on the generated phoneme sequence, a speech corresponding to the input text is synthesized.

In some embodiments, generating speech corresponding to the input text further comprises: determining a mouth shape coefficient sequence corresponding to the phoneme sequence; rendering the preset character image based on the obtained corresponding relation and the voice, and generating the animation of the character image further comprises: and combining the video generated by rendering the preset character image model based on the obtained corresponding relation and the mouth shape coefficient sequence and the generated voice to generate the animation of the character image.

In some embodiments, in response to receiving the input text, obtaining an animation tag of the input text, and obtaining a correspondence between words and actions in the input text comprises: inputting an input text into a pre-trained animation label acquisition model to obtain a corresponding relation between words in the input text and animation labels; and generating an action coefficient sequence based on the corresponding relation between the words and the animation labels in the input text, and taking the action coefficient sequence as the corresponding relation between the words and the actions in the input text.

In some embodiments, the animation tag includes an expression tag and an action tag, the animation tag obtaining model includes an emotion prediction submodel and an action prediction submodel, and inputting the input text into the pre-trained animation tag obtaining model to obtain a correspondence between words in the input text and the animation tag includes: inputting the input text into a pre-trained emotion prediction submodel to obtain corresponding relations between words contained in the input text and emotion tendencies of the words; determining the corresponding relation between the words contained in the input text and the preset emotions based on the corresponding relation between the words contained in the input text and the emotional tendency of the words; and inputting the corresponding relation between the words contained in the input text and the preset expression labels of the words into the action prediction submodel to obtain an action coefficient sequence, wherein the action coefficient sequence is used for indicating the corresponding relation between the words contained in the input text and the action labels of the words.

In some embodiments, the motion indicated by the motion tag includes at least one of a limb motion, a torso motion, and a head motion.

In some embodiments, generating speech corresponding to the input text further comprises: and inputting the input text into a pre-established voice mouth shape generation model to obtain a voice and mouth shape coefficient sequence corresponding to the input text.

In some embodiments, the pre-established speech mouth shape generation model is trained by: establishing a sequence-to-sequence model as an initial voice mouth shape generation model; inputting a training sample into an initial voice mouth shape generation model to obtain the output of the initial voice mouth shape generation model, wherein the training sample comprises voice corpora, marks of the voice corpora and marks of video corpora; and back propagating the loss value determined based on the preset loss function and between the output and the label of the training sample in the initial voice shape generation model to train the initial voice shape generation model.

In some embodiments, generating speech corresponding to the input text further comprises: determining a voice corresponding to an input text from a pre-established audio database; and inputting the determined voice into a mouth shape coefficient generation model trained in advance to obtain a mouth shape coefficient sequence corresponding to the input text.

In a second aspect, an embodiment of the present application provides an animation generation apparatus, including: the animation label acquisition unit is configured to respond to the received input text, acquire an animation label of the input text and obtain a corresponding relation between words and actions in the input text; a voice generating unit configured to generate a voice corresponding to an input text; and an animation generation unit configured to combine the video generated by rendering the model of the preset character image based on the obtained correspondence and the generated voice to generate an animation of the character image.

In some embodiments, the speech generation unit is further configured to: analyzing the input text to generate a phoneme sequence; based on the generated phoneme sequence, a speech corresponding to the input text is synthesized.

In some embodiments, the speech generation unit is further configured to: determining a mouth shape coefficient sequence corresponding to the phoneme sequence; rendering the preset character image based on the obtained corresponding relation and the voice, and generating the animation of the character image further comprises: and combining the video generated by rendering the preset character image model based on the obtained corresponding relation and the mouth shape coefficient sequence and the generated voice to generate the animation of the character image.

In some embodiments, the animation tag obtaining unit is further configured to: inputting an input text into a pre-trained animation label acquisition model to obtain a corresponding relation between words in the input text and animation labels; and generating an action coefficient sequence based on the corresponding relation between the words and the animation labels in the input text, and taking the action coefficient sequence as the corresponding relation between the words and the actions in the input text.

In some embodiments, the animation tag includes an expression tag and an action tag, the animation tag obtaining model includes an emotion prediction submodel and an action prediction submodel, and the animation tag obtaining unit is further configured to: inputting the input text into a pre-trained emotion prediction submodel to obtain corresponding relations between words contained in the input text and emotion tendencies of the words; determining the corresponding relation between the words contained in the input text and the preset emotions based on the corresponding relation between the words contained in the input text and the emotional tendency of the words; and inputting the corresponding relation between the words contained in the input text and the preset expression labels of the words into the action prediction submodel to obtain an action coefficient sequence, wherein the action coefficient sequence is used for indicating the corresponding relation between the words contained in the input text and the action labels of the words.

In some embodiments, the speech generation unit is further configured to: and inputting the input text into a pre-established voice mouth shape generation model to obtain a voice and mouth shape coefficient sequence corresponding to the input text.

In some embodiments, the speech generation unit is further configured to: determining a voice corresponding to an input text from a pre-established audio database; and inputting the determined voice into a mouth shape coefficient generation model trained in advance to obtain a mouth shape coefficient sequence corresponding to the input text.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means for storing one or more programs which, when executed by one or more processors, cause the one or more processors to carry out the method as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method as described in the first aspect.

According to the animation generation method and device provided by the embodiment of the application, firstly, in response to the received input text, the animation label of the input text is obtained, and the corresponding relation between the words and the actions in the input text is obtained; then, generating a voice corresponding to the input text; and finally, rendering the preset character image based on the obtained corresponding relation and the voice to generate the animation of the character image, so that the action of the character image in the generated animation can naturally and accurately embody the meaning expressed by the input text.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram to which the animation generation method of one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of an animation generation method according to the present application;

FIG. 3 is a schematic diagram of an application scenario of an animation generation method according to the application;

FIG. 4 is a flow diagram of yet another embodiment of an animation generation method according to the present application;

FIG. 5 is a block diagram of one embodiment of an animation generation device according to the present application;

fig. 6 is a schematic structural diagram of a computer system of an electronic device suitable for implementing the animation generation method according to the embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of the animation generation methods or animation generation apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting multimedia playing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, for example, a multimedia server capable of multimedia interaction with users using the

terminal devices

101, 102, 103. The multimedia server can analyze the received data such as the video, voice or text of the user and feed back the processing result (for example, animation which is generated based on the analysis result and is used for multimedia interaction with the terminal equipment) to the terminal equipment.

It should be noted that the animation generation method provided in the embodiment of the present application may be executed by the server 104, and accordingly, the animation generation apparatus may be disposed in the server 104.

It should be understood that the number of

terminal devices

101, 102, 103, network 104 and server 105 in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of an animation generation method according to the present application is shown. The animation generation method comprises the following steps:

step 201, in response to receiving an input text, acquiring an animation label of the input text, and obtaining a corresponding relationship between words and actions in the input text.

In the present embodiment, the execution subject (e.g., the server 105 shown in fig. 1) of the animation generation method of the present embodiment may actively acquire or receive the input text in various feasible ways.

For example, the pre-generated input text may be acquired from an electronic device communicatively connected to the executing subject by way of a wired or wireless connection. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

Here, the input text may be text stored in advance on the execution main body or other electronic devices communicatively connected to the execution main body, or may be response text generated in real time and generated for interactive contents (for example, text, voice, and the like) transmitted by the user.

Further, the animation tag may be a tag for indicating any dynamic process of the character image in the animation. Such as waving hands, rolling circles, laughing, twisting, running, etc.

In this step, for example, word segmentation processing may be performed on the input text first, and an animation tag corresponding to each word segmentation may be searched from a pre-established animation tag database. Words and animation tags corresponding to the words may be stored in association with a pre-established animation tag database. For example, the word "hello" may correspond to an animation label of "wave hand". Here, the animation tag database may be preset manually based on experience and/or character features of character images in the animation to be generated, for example.

It is understood that the animation tag database may have a one-to-one correspondence relationship between words and animation tags, or may have a one-to-many correspondence relationship, or may have a many-to-many correspondence relationship.

Specifically, in the animation tag database, a certain word may uniquely correspond to a certain animation tag (one-to-one correspondence). For example, the word "hello" in the animation tag database corresponds only to the animation tag of "wave" and the other words in the animation tag database do not correspond to the animation tag of "wave".

Alternatively, a certain word may correspond to a certain number of animation tags (one-to-many correspondence), for example, the word "hello" may have an association relationship with two animation tags, i.e., "waving hand" and "smiling" in the animation tag database.

Alternatively, in the animation tag database, some words may have an association relationship (many-to-many correspondence relationship) with some animation tags. For example, in the animation tag database, the words "hello" and "hi" both have an association relationship with two animation tags, i.e., "waving hand" and "smiling".

Therefore, the animation label corresponding to the word segmentation in the input text can be searched and obtained from the animation label database. It will be appreciated that not every word must have an animation tag corresponding to it. For example, for an input text of "hello, i is a small a", the word "hello" may correspond to two action tags of "waving hands" and "smiling", and the words of "i", "is" and "a small a" may not have animation tags corresponding thereto.

The correspondence between words and actions in the input text generated by this step may be characterized as a character string obtained after inserting an animation tag at a corresponding position of the input text, for example. For example, for the input text "hello, i is xiao a", the finally generated correspondence may be expressed as "< smile > hello < smile/>, i is xiao", for example, thereby characterizing that the character image may simultaneously expose the expression of smile during the term "hello" is spoken in the finally generated animation.

Step 202, generating a voice corresponding to the input text.

Any existing or future-developed speech synthesis techniques may be used herein to convert the input text into speech data.

For example, TTS (Text to Speech) technology may be used to convert input Text to generate Speech data. If the input text is Chinese, the TTS technology can utilize Chinese rhythm and other related knowledge to perform word segmentation, part of speech judgment, phonetic notation and digital symbol conversion on the Chinese sentence, and obtains voice data by inquiring a Chinese voice library.

In some alternative implementations, the generating of the speech corresponding to the input text of this step may be implemented as follows.

In step 202a, the input text is parsed to generate a sequence of phonemes.

Step 202b, synthesizing the speech corresponding to the input text based on the generated phoneme sequence.

Phones (phones), which are the smallest units in speech, are analyzed according to the pronunciation actions in syllables, and one action constitutes one phone. Phonemes can be divided into two major categories, namely vowels and consonants.

It will be appreciated that a word or word may include a single phoneme or may include two or more phonemes. In addition, if the input text is definite, the phoneme sequence corresponding to the input text may also be unique.

After the phoneme sequence is obtained, the speech corresponding to the input text can be synthesized by determining the speech feature of each phoneme in the phoneme sequence. Here, the voice feature may include, for example, but not limited to, at least one of sound intensity, loudness, pitch period, gene frequency, and the like.

For example, in some alternative implementations, the phoneme sequence may be input into a pre-established speech synthesis model to obtain the speech characteristics of each phoneme in the phoneme sequence.

And step 203, merging the video generated by rendering the preset character image model based on the obtained corresponding relation and the generated voice to generate the animation of the character image.

Through the above step 201, the corresponding relationship between the words and the actions in the input text can be obtained, that is, the input text is provided with a label of where the character image should make what action. In this step, a model of a preset character image may be rendered based on such a correspondence between words and actions in the input text obtained through step 201, thereby obtaining a video of the character image during speaking of speech corresponding to the input text.

In some alternative implementations, a two-dimensional or three-dimensional model may be generated in advance for the character. On the pre-generated model, the positions of one or more key points of the model are adjusted based on the corresponding relation between the words and the actions in the input text, so that the character image can make the actions indicated by the animation labels corresponding to the words in the input text at proper time.

On the other hand, in step 202 described above, a speech corresponding to the input text is generated. Since each word in the input text corresponds to a portion of the generated speech, and through step 201, the correspondence between the word and the action in the input text is obtained. Thus, the action corresponding to the word of the input text may correspond in time to a certain segment of speech generated by step 202 (e.g., the action starts at the same time as the segment of speech and ends at the same time). In this way, the rendered video and the generated speech can be accurately "aligned" in time, making the generated animation natural and reasonable.

In the animation generation method provided by this embodiment, first, in response to receiving an input text, an animation tag of the input text is obtained, and a correspondence between words and actions in the input text is obtained; then, generating a voice corresponding to the input text; and finally, rendering the preset character image based on the obtained corresponding relation and the voice to generate the animation of the character image, so that the action of the character image in the generated animation can naturally and accurately embody the meaning expressed by the input text.

With continued reference to fig. 3, fig. 3 is a schematic diagram 300 of an application scenario of the animation generation method according to the present embodiment.

In the application scenario shown in fig. 3, a user 301 interacts with a server 303 through a cell phone 302.

For example, the user inputs a voice "xiao a" to the cellular phone, and the server 303 analyzes the voice signal after receiving the voice signal, and recognizes that the user intends to wake up a service by the voice input. Further, a response sentence "you are good, and small a serves you" for the wake intention may be generated.

The server 303 may first determine the text corresponding to the answer sentence and use it as the input text. Next, as indicated by reference numeral 304, the server 303 can obtain an animation tag for the input text to obtain a correspondence between words and actions in the input text, for example, the word "hello" for an action of "bow" in the input text.

Next, as indicated by reference numeral 305, the server 303 may also generate speech corresponding to the input text.

Next, as indicated by reference numeral 306, the server 303 may combine the video generated by rendering the model of the preset character image based on the obtained correspondence and the generated voice, generating an animation of the character image. And sends the generated animation to the mobile phone 302 for playing, so that the user 301 can get a response which is presented in the form of animation and inputs voice content for the user.

With further reference to FIG. 4, a flow 400 of yet another embodiment of an animation generation method is shown. The flow 400 of the animation generation method comprises the following steps:

step 401, in response to receiving the input text, obtaining an animation tag of the input text, and obtaining a corresponding relationship between words and actions in the input text. Step 401 of this embodiment may be performed in a similar manner to step 201 of the embodiment shown in fig. 2, and is not described herein again.

Step 402, analyzing an input text to generate a phoneme sequence;

step 403, synthesizing the speech corresponding to the input text based on the generated phoneme sequence.

The

steps

402 and 403 may be performed in a manner similar to that in the optional implementation manner (step 202a and step 202b) of the step 202 in the embodiment shown in fig. 2, and are not described again here.

Step 404, determining a mouth shape coefficient sequence corresponding to the phoneme sequence.

Here, the mouth shape coefficients in the mouth shape coefficient sequence may be used to characterize the shape and posture of the mouth when the character image pronounces in the animation. Accordingly, the mouth shape coefficient sequence containing the mouth shape coefficient can be used for representing the dynamic process of the mouth shape change when the character image speaks in the animation.

It is to be understood that the mouth shape coefficient may be a coefficient for characterizing the overall shape of the mouth when speaking. Alternatively, the mouth shape coefficient may be a feature vector for characterizing a plurality of key points of the mouth during pronunciation.

Step 405, merging the video generated by rendering the preset character image model based on the obtained corresponding relation and the mouth shape coefficient sequence and the generated voice to generate the animation of the character image.

Similar to step 203 in the embodiment shown in fig. 2, in this step 405, a two-dimensional or three-dimensional model may also be generated in advance for the character image. On the pre-generated model, the position of one or a plurality of key points of the model is adjusted based on the corresponding relation between the words and the actions in the input text and the generated mouth shape coefficient sequence, so that the character image can make the actions and the mouth shapes indicated by the animation labels corresponding to the words in the input text at proper time.

Furthermore, it is understood that in some application scenarios, in the correspondence between words and actions in the input text and the mouth shape coefficient sequence, the action and the mouth shape for a certain word of the input text both relate to the same key point in the pre-established model. In these application scenarios, the positions of the key points determined based on the correspondence between the words and the motions and the positions of the key points determined based on the sequence of mouth shape coefficients may be fused (e.g., averaged, or weighted and averaged based on a preset weighting coefficient) to determine the positions of the key points when the human figure pronounces the words in the rendered video.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the animation generation method of the embodiment further highlights the corresponding relationship between the mouth shape change of the character image and the voice in the animation generation process, so that the finally generated animation is more natural and vivid.

In some optional implementation manners of this embodiment, the steps 402 to 404 of this embodiment may also be implemented in the following manners:

and inputting the input text into a pre-established voice mouth shape generation model to obtain a voice and mouth shape coefficient sequence corresponding to the input text.

In some application scenarios of these alternative implementations, the speech mouth shape generation model may be established in the following manner.

Firstly, sound corpora are collected, and meanwhile mouth shape movement videos of a speaker are collected. When the voice corpus is preprocessed, the mouth shape videos are processed in parallel, and the complete alignment of the audio and the videos is guaranteed.

And then, labeling the voice corpus according to the standard of TTS, processing each frame of the video corpus, and extracting a mouth shape coefficient corresponding to each frame of mouth shape image.

And then, on the basis of establishing a TTS splicing library, establishing a mouth shape coefficient library aligned with the TTS splicing library. Thus, when the phones in the TTS splicing library are indexed, the mouth shape coefficients corresponding to the phones can be simultaneously indexed. For example, if there is a phoneme with a time length of t seconds in the phoneme library, then there is a mouth shape coefficient of t × fps (fps is the frame rate of the video to be generated) frame in the mouth shape coefficient library corresponding to it.

In these application scenarios, the process of establishing the TTS concatenation library and the mouth shape coefficient library can be regarded as a process of proposing a speech mouth shape generation model.

Thus, the unit selection and other algorithms can be used for respectively acquiring the voice and the mouth shape coefficient corresponding to the input text from the TTS splicing library and the mouth shape coefficient library, and obtaining the voice and the mouth shape coefficient.

In other application scenarios of these alternative implementations, the pre-established speech mouth shape generation model can also be trained as follows.

First, a sequence-to-sequence (seq2seq) model is built as an initial speech mouth shape generation model.

And then, inputting the training sample into the initial voice mouth shape generation model to obtain the output of the initial voice mouth shape generation model. The training samples comprise sound linguistic data, labels of the sound linguistic data, video linguistic data and labels of the video linguistic data.

Finally, the loss value between the output and the label of the training sample, which is determined based on the preset loss function, is propagated reversely in the initial voice shape generation model to train the initial voice shape generation model.

first, a voice corresponding to an input text is determined from a pre-established audio database.

And then, inputting the determined voice into a mouth shape coefficient generation model trained in advance to obtain a mouth shape coefficient sequence corresponding to the input text.

In these alternative implementations, the pre-established audio database may be established as follows:

firstly, sound linguistic data are collected, and then the sound linguistic data are labeled according to the standard of TTS, so that an audio database is obtained.

In addition, the mouth shape coefficient generation model can be obtained by training in the following way:

the method comprises the steps of establishing an audio database to collect voice corpora and collecting mouth shape movement videos of a speaker at the same time. When the voice corpus is preprocessed, the mouth shape videos are processed in parallel, and the complete alignment of the audio and the videos is guaranteed. Then, each frame of the video corpus is processed, and a mouth shape coefficient corresponding to each frame of mouth shape image is extracted. Then, an initial sequence-to-sequence model is established, and the labeled voice corpus and the labeled video corpus (i.e., the voice corpus is labeled according to the TTS standard and the video corpus labeled with the mouth shape coefficient for each frame) are used as training samples to train the initial sequence-to-sequence model, so that a mouth shape coefficient generation model is obtained.

Thus, in these alternative implementations, the speech corresponding to the input text may be determined first, and then the obtained speech is input into the mouth shape coefficient generation model, so as to obtain the mouth shape coefficient sequence corresponding to the input text.

In some optional implementations of embodiments of the present application, in response to receiving an input text, obtaining an animation tag of the input text, and obtaining a correspondence between words and actions in the input text (step 201 in the embodiment shown in fig. 2 and step 401 in the embodiment shown in fig. 4) may be further implemented by:

firstly, an input text is input into a pre-trained animation label acquisition model to obtain the corresponding relation between words and animation labels in the input text.

Here, for example, a sample formed after text and animation labeling of the text may be used to train an initial animation label obtaining model established in advance, so as to obtain a trained animation label obtaining model.

Then, based on the correspondence between the word in the input text and the animation tag, a motion coefficient sequence is generated, and the motion coefficient sequence is taken as the correspondence between the word and the motion in the input text.

It is understood that similar to the mouth shape coefficient sequence, the motion coefficients of the motion coefficient sequence can be used to characterize the corresponding motion of the character image in the animation, for example, the position of one or more key points in the pre-established character image model. Accordingly, the motion coefficient sequence can be used for representing the dynamic process of motion change when the character image speaks in the animation.

In some application scenarios of these alternative implementations, the animation tags may include emoji tags and action tags.

In these application scenarios, the animation tag acquisition model may include an emotion predictor model and an action predictor model. Inputting the input text into the pre-trained animation tag acquisition model to obtain the correspondence between the words in the input text and the animation tags may be further implemented in the following manner.

Firstly, an input text is input to a pre-trained emotion prediction submodel, and the corresponding relation between words contained in the input text and the emotion tendencies of the words is obtained.

And then, determining the corresponding relation between the words contained in the input text and the preset emotions based on the corresponding relation between the words contained in the input text and the emotional tendency of the words.

And then, inputting a corresponding relation between words contained in the input text and preset expression labels of the words into the action prediction submodel to obtain an action coefficient sequence, wherein the action coefficient sequence is used for indicating the corresponding relation between the words contained in the input text and the action labels of the words.

It will be appreciated that in these application scenarios, the motion indicated by the motion tag may include, but is not limited to, at least one of limb motion, torso motion, and head motion.

With further reference to fig. 5, as an implementation of the method shown in the above-mentioned figures, the present application provides an embodiment of an animation generation apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the animation generation apparatus of the present embodiment includes an animation tag acquisition unit 501, a speech generation unit 502, and an animation generation unit 503.

The animation tag obtaining unit 501 may be configured to obtain an animation tag of an input text in response to receiving the input text, and obtain a correspondence between words and actions in the input text.

The speech generating unit 502 may be configured to generate speech corresponding to the input text.

The animation generation unit 503 is configured to combine the video generated by rendering the model of the preset character image based on the obtained correspondence and the generated voice, generating an animation of the character image.

In some optional implementations, the speech generating unit 502 may be further configured to: analyzing the input text to generate a phoneme sequence; based on the generated phoneme sequence, a speech corresponding to the input text is synthesized.

In some optional implementations, the speech generating unit 502 may be further configured to: determining a mouth shape coefficient sequence corresponding to the phoneme sequence; rendering the preset character image based on the obtained corresponding relation and the voice, and generating the animation of the character image further comprises: and combining the video generated by rendering the preset character image model based on the obtained corresponding relation and the mouth shape coefficient sequence and the generated voice to generate the animation of the character image.

In some optional implementations, the animation tag obtaining unit 501 may be further configured to: inputting an input text into a pre-trained animation label acquisition model to obtain a corresponding relation between words in the input text and animation labels; and generating an action coefficient sequence based on the corresponding relation between the words and the animation labels in the input text, and taking the action coefficient sequence as the corresponding relation between the words and the actions in the input text.

In some optional implementations, the animation tag may include an emoticon tag and an action tag, and the animation tag obtaining model may further include an emotion prediction submodel and an action prediction submodel.

In these alternative implementations, the animation tag obtaining unit 501 may be further configured to: inputting the input text into a pre-trained emotion prediction submodel to obtain corresponding relations between words contained in the input text and emotion tendencies of the words; determining the corresponding relation between the words contained in the input text and the preset emotions based on the corresponding relation between the words contained in the input text and the emotional tendency of the words; and inputting the corresponding relation between the words contained in the input text and the preset expression labels of the words into the action prediction submodel to obtain an action coefficient sequence, wherein the action coefficient sequence is used for indicating the corresponding relation between the words contained in the input text and the action labels of the words.

In some alternative implementations, the motion indicated by the motion tag may include at least one of limb motion, torso motion, and head motion.

In some optional implementations, the speech generating unit 502 may be further configured to: and inputting the input text into a pre-established voice mouth shape generation model to obtain a voice and mouth shape coefficient sequence corresponding to the input text.

In these alternative implementations, the pre-established speech mouth shape generation model can be trained as follows: establishing a sequence-to-sequence model as an initial voice mouth shape generation model; inputting a training sample into an initial voice mouth shape generation model to obtain the output of the initial voice mouth shape generation model, wherein the training sample comprises voice corpora, marks of the voice corpora and marks of video corpora; and back propagating the loss value determined based on the preset loss function and between the output and the label of the training sample in the initial voice shape generation model to train the initial voice shape generation model.

In some optional implementations, the speech generating unit 502 may be further configured to: determining a voice corresponding to an input text from a pre-established audio database; and inputting the determined voice into a mouth shape coefficient generation model trained in advance to obtain a mouth shape coefficient sequence corresponding to the input text.

Referring now to FIG. 6, a block diagram of a computer system 600 suitable for use in an electronic device implementing the animation generation method of an embodiment of the present application is shown. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 606 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: a storage portion 606 including a hard disk and the like; and a communication section 607 including a network interface card such as a LAN card, a modem, or the like. The communication section 607 performs communication processing via a network such as the internet. Drivers 608 are also connected to the I/O interface 605 as needed. A removable medium 609 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 608 as necessary, so that a computer program read out therefrom is mounted into the storage section 606 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 607 and/or installed from the removable medium 609. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an animation tag acquisition unit, a voice generation unit, and an animation generation unit. The names of these units do not constitute a limitation to the unit itself in some cases, and for example, the animation tag acquisition unit may also be described as "a unit that acquires an animation tag of an input text and obtains a correspondence between a word and an action in the input text in response to receiving the input text".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: responding to the received input text, acquiring an animation label of the input text, and obtaining a corresponding relation between words and actions in the input text; generating a voice corresponding to the input text; and combining the video generated by rendering the preset character image model based on the obtained corresponding relation and the generated voice to generate the animation of the character image.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. An animation generation method, comprising:

responding to the received input text, acquiring an animation label of the input text, and obtaining a corresponding relation between words and actions in the input text;

generating a voice corresponding to the input text;

and combining the video generated by rendering the preset character image model based on the obtained corresponding relation and the generated voice to generate the animation of the character image.

2. The animation generation method of claim 1, wherein the generating of the speech corresponding to the input text comprises:

analyzing the input text to generate a phoneme sequence;

synthesizing to obtain a speech corresponding to the input text based on the generated phoneme sequence.

3. The animation generation method of claim 2, the generating speech corresponding to the input text further comprising:

determining a mouth shape coefficient sequence corresponding to the phoneme sequence;

rendering a preset character image based on the obtained corresponding relation and the voice, and generating the animation of the character image further comprises:

and combining the video generated by rendering the preset character image model based on the obtained corresponding relation and the mouth shape coefficient sequence and the generated voice to generate the animation of the character image.

4. The animation generation method according to any one of claims 1 to 3, wherein the obtaining of the animation tag of the input text in response to the reception of the input text and the obtaining of the correspondence between the words and the actions in the input text comprises:

inputting the input text into a pre-trained animation label acquisition model to obtain a corresponding relation between words in the input text and animation labels;

and generating an action coefficient sequence based on the corresponding relation between the words and the animation labels in the input text, and taking the action coefficient sequence as the corresponding relation between the words and the actions in the input text.

5. The method of claim 4, wherein the animation labels comprise expression labels and action labels, the animation label obtaining model comprises an emotion prediction submodel and an action prediction submodel, and the inputting the input text into the pre-trained animation label obtaining model to obtain the corresponding relationship between the words in the input text and the animation labels comprises:

inputting the input text into a pre-trained emotion prediction sub-model to obtain a corresponding relation between words contained in the input text and emotion tendencies of the words;

determining the corresponding relation between the words contained in the input text and a preset expression label based on the corresponding relation between the words contained in the input text and the emotional tendency of the words;

and inputting the corresponding relation between the words contained in the input text and the preset expression labels of the words into the action prediction submodel to obtain an action coefficient sequence, wherein the action coefficient sequence is used for indicating the corresponding relation between the words contained in the input text and the action labels of the words.

6. The method of claim 5, wherein the motion indicated by the motion tag comprises at least one of limb motion, torso motion, and head motion.

7. The method of claim 3, wherein the generating speech corresponding to the input text further comprises:

8. The method of claim 7, wherein the pre-established speech mouth shape generation model is trained by:

establishing a sequence-to-sequence model as an initial voice mouth shape generation model;

inputting a training sample into the initial voice mouth shape generation model to obtain the output of the initial voice mouth shape generation model, wherein the training sample comprises a voice corpus, a label of the voice corpus, a video corpus and a label of the video corpus;

and back propagating the loss value determined based on the preset loss function between the output and the label of the training sample in the initial voice shape generation model to train the initial voice shape generation model.

9. The method of claim 3, wherein the generating speech corresponding to the input text further comprises:

determining the voice corresponding to the input text from a pre-established audio database;

and inputting the determined voice into a pre-trained mouth shape coefficient generation model to obtain a mouth shape coefficient sequence corresponding to the input text.

10. An animation generation apparatus comprising:

the animation label acquisition unit is configured to respond to the received input text, acquire an animation label of the input text and obtain a corresponding relation between words and actions in the input text;

a voice generating unit configured to generate a voice corresponding to the input text;

and an animation generation unit configured to combine the video generated by rendering the model of the preset character image based on the obtained correspondence and the generated voice, and generate an animation of the character image.

11. The animation generation apparatus according to claim 10, wherein the speech generation unit is further configured to:

analyzing the input text to generate a phoneme sequence;

12. The animation generation apparatus of claim 11, the speech generation unit further configured to:

13. The animation generation apparatus according to one of claims 10 to 12, wherein the animation tag acquisition unit is further configured to:

14. The apparatus of claim 13, wherein the animation tag comprises an emoji tag and an action tag, the animation tag acquisition model comprises an emotion prediction submodel and an action prediction submodel, the animation tag acquisition unit is further configured to:

15. The apparatus of claim 14, wherein the motion indicated by the motion tag comprises at least one of limb motion, torso motion, and head motion.

16. The apparatus of claim 12, wherein the speech generation unit is further configured to:

17. The apparatus of claim 16, wherein the pre-established speech mouth shape generation model is trained by:

18. The apparatus of claim 12, wherein the speech generation unit is further configured to:

19. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.

20. A computer-readable storage medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-9.