CN116309965A

CN116309965A - Animation generation method and device, computer readable storage medium and terminal

Info

Publication number: CN116309965A
Application number: CN202211732261.3A
Authority: CN
Inventors: 施跇; 刘博�; 王斌; 柴金祥
Original assignee: Shanghai Movu Technology Co Ltd; Mofa Shanghai Information Technology Co Ltd
Current assignee: Shanghai Movu Technology Co Ltd; Mofa Shanghai Information Technology Co Ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-06-23

Abstract

An animation generation method and device, a computer readable storage medium and a terminal, wherein the method comprises the following steps: acquiring a text; word segmentation is carried out on the text to obtain an initial set; judging whether each word in the initial set is a target word, if so, adding the target word into the target set, wherein the target word refers to a word with a similarity value with at least one tag in a preset tag set which comprises a plurality of preset tags and is greater than or equal to a preset similarity threshold; according to the labels corresponding to the target words in the target set, acquiring action data of the target words from a preset action database, wherein the action database is used for storing the action data corresponding to the labels, and each label corresponds to at least one group of action data; and generating an action set corresponding to the text based on the acquired action data matched with each target word, wherein the action set is used for generating the animation. According to the scheme, the corresponding animation can be generated according to the text, and the requirements of some application scenes with interaction requirements are met.

Description

Animation generation method and device, computer readable storage medium and terminal

Technical Field

The embodiment of the invention relates to the technical field of natural language processing, in particular to an animation generation method and device, a computer readable storage medium and a terminal.

Background

With the rise of meta-universe concepts and the development of multimedia technology, there is an increasing need to generate a segment of character animation from natural language. Existing generation schemes are generally applicable only to narrative text depicting character actions.

In the conventional scheme, the action of the person is described by a third person, and the action is generated according to the description. For example, text: he is running, and the action is running. Some fields have application scenes with interaction requirements, such as the virtual live field, in which limb actions of a lecturer are required to be generated according to the speech text of the lecturer. The conventional generation scheme is generally only suitable for describing the descriptive text of the action of the person, and cannot meet the requirement of generating the action of the body of the person in the application scene with the interaction requirement.

Disclosure of Invention

The technical problem solved by the embodiment of the invention is that the existing generation scheme is only generally suitable for describing the descriptive text of the action of the person, and cannot meet the requirement of generating the limb action of the person in the application scene with the interaction requirement.

In order to solve the above technical problems, an embodiment of the present invention provides an animation generation method, including: acquiring a text; word segmentation is carried out on the text to obtain an initial set, wherein the initial set comprises one or more words; judging whether each word in the initial set is a target word, if so, adding the target word into the target set, wherein the target word refers to a word with a similarity value with at least one tag in a preset tag set being greater than or equal to a set similarity threshold, and the preset tag set comprises a plurality of preset tags; according to the labels corresponding to the target words in the target set, obtaining action data of the target words from a preset action database, wherein the action database is used for storing the action data corresponding to the labels, and each label corresponds to at least one group of action data; and generating an action set corresponding to the text based on the acquired action data matched with each target word, wherein the action set is used for generating animation.

Optionally, the determining whether each word in the initial set is a target word includes: for each word in the initial set, the vector of each word is compared with the vector of each tag separately for similarity.

Alternatively, the vector of the tag is obtained as follows: acquiring keywords corresponding to each label; and obtaining the vector of each label according to the vector of the keyword corresponding to each label.

Optionally, for each word in the initial set, a vector of each word is obtained in the following manner, including: for each word in the initial set, carrying out weighted average on vectors corresponding to all characters in each word, and taking the vector after weighted average as the vector of each word.

Optionally, the determining whether each word in the initial set is a target word, if so, adding the target word into the target set includes: s1, adding words with similarity values with labels in the initial set being greater than or equal to the similarity threshold value into a candidate set; s2, taking the word with the maximum similarity value in the candidate set as a reference word, putting the word into a successful matching set, and comparing the rest words in the candidate set with the reference word; s3, eliminating words with similarity values smaller than the reference words and overlapping words with the reference words from the candidate set, and updating the candidate set; and S4, repeating the steps S2 and S3 until the candidate set is an empty set, and taking the successful matching set as the target set.

Optionally, the animation generation method further includes: matching each word in the initial set with a first specific word in a first specific set, wherein the first specific word is a word which is not subjected to tag matching, and eliminating words which are the same as the first specific word or belong to a subset of the first characteristic words from the initial set; or, matching each word in the target set with a first specific word in a first specific set, wherein the first specific word is a word which is not subjected to tag matching, and eliminating words which are the same as the first specific word or belong to a subset of the first characteristic words from the target set.

Optionally, the animation generation method further includes: matching the text with a second specific word in a second specific set by using a regular matching mode to obtain a forced matching set, wherein the forced matching set comprises one or more forced matching words, and the second specific word is a forced matching word; and if the forced matching word has word overlapping with the word in the initial set or the target set, covering the word in the initial set or the target set with the forced matching word with word overlapping.

Optionally, the generating an action set corresponding to the text based on the obtained action data matched with each target word, where the action set is used to generate an animation, includes: converting the text into voice, and obtaining word boundaries of each target word and the position of each target word in the voice; and obtaining an action set corresponding to the text according to word boundaries of each target word, positions of each target word in the voice and action data matched with the target word.

Optionally, the obtaining the action set corresponding to the text according to the word boundary of each target word, the position of each target word in the voice, and the action data matched with the target word includes: aiming at two adjacent target words, according to the position of each target word in the voice, taking action data corresponding to the target word with the previous time as a starting key frame and action data corresponding to the target word with the subsequent time as an ending key frame; generating transition animation between the initial key frame and the end key frame based on the initial key frame and the end key frame to obtain transition animation corresponding to the adjacent target word; and obtaining an action set corresponding to the text based on the action data corresponding to each target word and the transition animation corresponding to the adjacent target word.

Optionally, the generating the transition animation between the start key frame and the end key frame based on the start key frame and the end key frame, to obtain the transition animation corresponding to the adjacent target word includes: acquiring a blank time length of a blank time period between the starting key frame and the ending key frame; according to the blank time length and the preset frame rate, calculating the total frame number N of the transition animation and the position of each transition frame, wherein N is a positive integer; calculating to obtain action data of each transition frame, wherein the action data of the first transition frame is obtained according to the action data of the initial key frame, the action data of the end key frame and the position of the first transition frame, the action data of the (i+1) th transition frame is obtained according to the action data of the initial key frame, the action data of the end key frame, the action data of the (i) th transition frame and the position of the (i+1) th transition frame, i is not less than 1 and not more than N-1, and i is a positive integer; and generating the transition animation based on the action data and the positions of the N transition frames.

The embodiment of the invention also provides an animation generation device, which comprises: a text acquisition unit configured to acquire a text; the word segmentation unit is used for carrying out word segmentation processing on the text to obtain an initial set, wherein the initial set comprises one or more words; the determining unit is used for judging whether each word in the initial set is a target word, if so, adding the target word into the target set, wherein the target word refers to a word with a similarity value with at least one tag in a preset tag set being greater than or equal to a set similarity threshold, and the preset tag set comprises a plurality of preset tags; the action data acquisition unit is used for acquiring action data of each target word from a preset action database according to the labels corresponding to each target word in the target set, wherein the action database is used for storing the action data corresponding to each label, and each label corresponds to at least one group of action data; the generating unit is used for generating an action set corresponding to the text based on the acquired action data matched with each target word, and the action set is used for generating animation.

The embodiments of the present invention also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the animation generation methods described above.

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of any animation generation method when running the computer program.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

in the embodiment of the invention, the initial set is obtained by word segmentation processing on the acquired text. And comparing the similarity between each word in the initial set and each tag in the preset tag set, taking the word with the similarity greater than or equal to the set similarity threshold as a target word, and adding the target word into the target set. And acquiring action data corresponding to each target word from a preset action database according to the labels corresponding to each target word in the target word set. The action database is used for storing action data corresponding to each label, and each label corresponds to at least one group of action data. And generating an action set corresponding to the text based on the acquired action data matched with each target word, wherein the action set is used for generating the animation. Therefore, the animation corresponding to the text is generated according to the text, the limb actions corresponding to the text can be represented through the animation, and the requirements of some application scenes with interaction requirements can be met.

Drawings

FIG. 1 is a flow chart of an animation generation method in an embodiment of the present invention;

FIG. 2 is a flow chart of one embodiment of step 13 of FIG. 1;

FIG. 3 is a flow chart of a transitional animation generation method in an embodiment of the present invention;

FIG. 4 is a flow chart of one embodiment of step 34 of FIG. 1;

FIG. 5 is a schematic diagram of the generation of motion data in an embodiment of the invention;

FIG. 6 is a training flow diagram of a motion data generation network model in an embodiment of the invention;

fig. 7 is a schematic diagram of an animation generation device according to an embodiment of the present invention.

Detailed Description

As described above, in the conventional scheme, the action of the person is described as a third person, and the action is generated from the description. For example, text: he is running, and the action is running. Some fields have application scenes with interaction requirements, such as the virtual live field, in which limb actions of a lecturer are required to be generated according to the speech text of the lecturer. The conventional generation scheme is generally only suitable for describing the descriptive text of the action of the person, and cannot meet the requirement of generating the action of the body of the person in the application scene with the interaction requirement.

In order to make the above objects, features and advantages of the embodiments of the present invention more comprehensible, the following detailed description of the embodiments of the present invention refers to the accompanying drawings.

The embodiment of the invention provides an animation generation method which can generate an animation according to a text. Referring to fig. 1, a flowchart of an animation generation method in an embodiment of the present invention is provided, where the animation generation method specifically may include the following steps:

Step 11, acquiring a text;

step 12, word segmentation is carried out on the text to obtain an initial set, wherein the initial set comprises one or more words;

step 13, judging whether each word in the initial set is a target word, if so, adding the target word into the target set, wherein the target word refers to a word with a similarity value with at least one tag in a preset tag set being greater than or equal to a set similarity threshold, and the preset tag set comprises a plurality of preset tags;

step 14, according to the labels corresponding to the target words in the target set, obtaining the action data of the target words from a preset action database, wherein the action database is used for storing the action data corresponding to the labels, and each label corresponds to at least one group of action data;

and 15, generating an action set corresponding to the text based on the acquired action data matched with each target word, wherein the action set is used for generating animation.

In a specific implementation, the text in step 11 may be the input text or may be the text obtained based on the voice. Wherein the text may be from a viewer, a live object or video editor, a virtual person instructor, etc.

In some embodiments, text may be a speech for a virtual person to explain in generating an animated video scene from the text. The specific content of the lines can be preset according to the actual requirement scene.

In the live broadcast field, the text may be a speech commonly used in live broadcast. The specific content of the lines can be preset or input in real time according to the actual live broadcast scene.

In some embodiments, in a live scene, text may be entered through a human-machine interaction interface.

In other embodiments, common text is preset and stored, and the acquired text may be text selected by the user.

In still other embodiments, a plurality of types of operation keys are provided on the man-machine interaction interface, and the different operation keys are respectively associated with corresponding texts. When the operation key is clicked or touched, a corresponding text is obtained. For example, the operation key is a question key, and when the question key is clicked or touched, a question text is obtained. For another example, the operation key is a heart rate key, and when the heart rate key is clicked or touched, heart rate text is obtained.

In the specific implementation of step 12, word segmentation is performed on the text to obtain a word segmentation result, word pairing is performed on the word segmentation result, and an initial set is obtained based on the word pairing result.

In one non-limiting embodiment, sentences in the text may be broken in units of characters to obtain word segmentation results. In the text "hello, i is small. "for example, word segmentation is performed on the text, and the word segmentation result is" you/good/, i/y/min ". ". The sentences in the text can be disconnected by taking the words as units to obtain word segmentation results, and the obtained word segmentation results are as follows: "you/good/, me/yes/min. ". It should be noted that, word segmentation in the embodiment of the present invention refers to word segmentation in a broad sense, that is, a text is divided into a plurality of segments, and each segment obtained by taking a character as a unit is a character; each segment is a word, which may be composed of one character or a plurality of characters.

Furthermore, in order to obtain an accurate word boundary, the fine-granularity word segmentation can be changed into the coarse-granularity word segmentation through word pairing so as to obtain a word segmentation result with proper granularity, so that the matching method can be facilitated to obtain the accurate word boundary, and the position of the action time axis, namely the position of the target word in the time axis, can be accurately triggered.

In one non-limiting embodiment, the initial set may be obtained by employing the following. Specifically, the text is input into a pre-trained language model, and a vector corresponding to each character is obtained, wherein the number of the characters is the same as that of the vectors, namely, a plurality of characters have a plurality of vectors. For Chinese, a word is marked as a character, and a punctuation mark can also be used as a character. The language model may select any transform-based language model that outputs word vectors, such as a rotomer model, a BERT model, an albert model, and the like.

The whole name of the BERT model is a bidirectional encoder (Bidirectional Encoder Representations from Transformer, abbreviated as BERT) based on a transducer, and the BERT model is a language model which is trained.

For example, the text "hello, i is small. "you" corresponds to vector T1, and "good" corresponds to vector T2 … … ". "corresponds to vector T8, and so on, has a total of 8 vectors of T1-T8.

The language model can be trained on a large scale of corpus based on self-supervision. The training task is mainly to finish filling, cover a character/word, predict what is covered by context, so the language model can learn the rules of the language, and its output is originally set to output a vector for each input unit (often characters if the language of the text is chinese).

For example, input to the language model is "() good, i am small. ". The output result is: "(you) good, I are small. ". And then comparing the output result with the correct result, adjusting the language model and training the language model. After the language model training is completed, "hello, i are small and bright" is input to the language model. By character, a total of 8 vectors T1-T8 can be obtained. The language model output is a vector, and the output vector and the input character have a corresponding relation.

A sequence of byte fragments (N-gram) of length N' corresponding to the input text is generated. N-gram is an algorithm based on a statistical language model. The basic idea is to perform a sliding window operation of size N 'on the content in the text according to bytes, forming a sequence of byte fragments of length N'. Each byte segment is called a gram, statistics is carried out on the occurrence frequency of all the grams, filtering is carried out according to a preset threshold value, a key gram list, namely a vector feature space of the text, is formed, and each gram in the key gram list is a feature vector dimension.

Binary Bi-Gram and ternary Tri-Gram are commonly used. Binary Bi-Gram is a binary word segmentation, which constructs a sentence from beginning to end into one word every two characters/phrases. The ternary Tri-Gram is a ternary word segmentation, which divides a sentence into three characters/words from beginning to end. Word pairs can be obtained by word pairing, and an initial set is obtained based on the word pairs. The initial set includes unigram, bigram, … …, N-gram.

In the text "hello, i is small. "by way of example. For example, if the text is segmented into words by characters, word pairs are generated for the segmented word vectors T1-T8, and if n=3, the word pairs are obtained as follows: < hello, > < good, > , , < are small). > is provided. The initial set includes < you >, < good >, <, >, < me >, < yes >, , < light >, <. Good, me, i is, is small, min, and min. "> < hello, > < good, > < me >, < me is small >, is provided.

For another example, if the text is segmented by word units, the obtained segmentation result is: you/hello, i/yes/min, assuming n=3, the resulting word pairs are: < hello, > < good, > , , < are min. > is provided. The initial set includes: < you >, < good >, <, >, < me >, < yes >, < min >, <. The method includes the steps of > hello >, < good, >, <, i >, , < are min >, < min. "> < hello, > < good, > < me >, < me is min >, < is min. > is provided.

In a specific implementation of step 13, for each word in the initial set, a similarity comparison is performed on the vector of each word and the vector of each tag to determine whether each word is a target word.

For each word in the initial set, a vector for each word may be obtained as follows: for each word, acquiring vectors of all characters contained in each word, carrying out weighted average on vectors corresponding to all characters in each word, and taking the vector after weighted average as the vector of each word.

For example, "hello", "you" corresponds to a character vector of T1 "," good "corresponds to a character vector of T3, and" hello ", the word vector of" is (t1+t2+t3)/3.

For another example, if the character vector corresponding to "yes" is "min", the character vector corresponding to "yes" is "T5", and the word vector corresponding to "min" is "T9", the word vector corresponding to "yes" is "t4+t9)/2, and T9 may be equal to (t6+t7)/2.

In a specific implementation, the preset tag set may include a plurality of preset tags. The labels are predefined for the product. The labels are used for describing the semantics of the language, different labels represent different semantics, and meanwhile, the labels are paired with the actions in advance, so-called label pairing with the actions can be the pairing of the labels and corresponding action data, so that the corresponding action data can be obtained through searching the labels. The labels may be asked, advertised, praised, thanked, etc. It will be appreciated that other tags may be provided, as well, specifically configured as desired.

Each tag may correspond to a set of motion data or may correspond to multiple sets of motion data. For example, the action data corresponding to the question mark may be action data of a one-hand wave question, or action data of a two-hand wave question. The set of motion data may include motion data of one frame or motion data of a plurality of frames.

In a specific implementation, the vector of the tag may be obtained as follows: acquiring keywords corresponding to each label; and obtaining the vector of each label according to the vector of the keyword corresponding to each label.

The tag may correspond to one keyword or may correspond to a plurality of keywords. When the label corresponds to a keyword, the vector of the label is the vector of the keyword. When the label corresponds to a plurality of keywords, the vector of the label is a plurality of vectors corresponding to the keywords. The number of vectors corresponding to the tags is the same as the number of keywords.

For example, keywords corresponding to question labels may include: you are good, midday, evening, etc., the vector of the question mark is four. As another example, keywords corresponding to thank you tags may include: thank to the heart, extremely thank to the heart, etc. The vector of thank you tags is two.

Further, in order to improve the accuracy of matching the word with the tag, a corresponding example sentence can be provided for the keyword corresponding to the tag, wherein the example sentence comprises the keyword. By configuring example sentences for the tags, ambiguity can be effectively eliminated, and matching accuracy is improved.

When the labeled keyword corresponds to the example sentence, the example sentence can be input into the language model to obtain a vector corresponding to the keyword in the example sentence, and the vector of the keyword in the example sentence is used as the vector of the label.

When judging whether the words in the initial set are target words or not, comparing the similarity between each word and each label. Specifically, the vector of the word is compared with the vector of each tag for similarity. If the tag corresponds to a plurality of vectors, the vector of the word is sequentially compared with the plurality of vectors of the tag in similarity.

For example, the preset tag set includes two tags, a question tag and a thank you tag. The question label corresponds to four vectors and the thank you label corresponds to two vectors. For each word in the initial set, similarity comparisons are required with six vectors (four vectors corresponding to question labels and two vectors corresponding to thank you labels), respectively.

When similarity calculation is performed on each word in the initial set and each tag in the preset tag set, calculation may be performed in a matrix calculation manner, or a manner that one word and one vector of the tag are calculated one by one may be adopted, which is not limited herein.

In some embodiments, the words in the initial set, whose similarity value with the tag is greater than or equal to the similarity threshold, are used as target words, the obtained target words are added into the target set, and the words in the target set are words matched with the tag.

The target set obtained in the above way can have a plurality of target words corresponding to a certain label, and the plurality of target words have overlapping characters. At this time, a word that best matches the tag needs to be selected from the plurality of words. To this end, in other embodiments of the present invention, the target set may be obtained in the following manner. The target set may be derived based on non-maximized suppression (Non Maximum Suppression). Specifically, referring to fig. 2, a flowchart of one embodiment of step 13 in fig. 1 is provided, where step 13 specifically includes:

S1, adding words with similarity values with labels in the initial set being greater than or equal to the similarity threshold value into a candidate set.

In specific implementation, the similarity between the target word and the tag can be calculated by adopting cosine similarity, the similarity between the target word and the tag can be calculated by adopting Euclidean distance, and the similarity between the target word and the tag can be calculated by adopting Manhattan distance. It will be appreciated that other suitable algorithms may be used to calculate the similarity between the target word and the tag, and this is not illustrated here.

S2, taking the word with the maximum similarity value in the candidate set as a reference word, putting the word into a successful matching set, and comparing the rest words in the candidate set with the reference word.

And comparing the rest words in the candidate set with the reference words, wherein the comparison comprises comparison of similarity values and whether the words with the reference words overlap between the beginning of the words of the rest words and the end of the words.

And S3, eliminating words with similarity values smaller than the reference words and overlapping words with the reference words from the candidate set, and updating the candidate set.

And S4, repeating the steps S2 and S3 until the candidate set is an empty set, and taking the successful matching set as the target set.

The action database can be preset according to the actual application scene requirement. The action data corresponding to each tag may be preset. When the target word accords with a certain tag, motion data corresponding to the corresponding tag is obtained from the motion database, and the obtained motion data is used as motion data corresponding to the target word. One tag may correspond to one set of motion data or may correspond to multiple sets of motion data.

When the label corresponding to the target word corresponds to a plurality of groups of action data, a group of action data can be randomly selected from the plurality of groups of action data to serve as the action data corresponding to the target word, the action data of the target word can be determined based on the types of other target words or the selected action data, and the action data selected at this time can be determined based on the action data of the target word which appears or is selected at a certain time. For example, this time, motion data different from the motion data once selected may be selected.

Further, in some non-limiting implementations, the initial set may also be processed as follows: and matching each word in the initial set with a first specific word in a first specific set, wherein the first specific word is a word which is not subjected to tag matching, and eliminating words which are identical to the first specific word or belong to a subset of the first characteristic words from the initial set.

Further, in other non-limiting embodiments, the target set may also be processed as follows: and matching each word in the target set with a first specific word in a first specific set, wherein the first specific word is a word which is not subjected to tag matching, and eliminating words which are identical to the first specific word or belong to a subset of the first characteristic words from the target set.

Wherein the first particular word in the first particular set is a word that does not want to match, and any subset of words that belongs to the word does not match. The subset is the subset that appears in this word. For example, the first specific time is "do not want," then "do not want" is a subset of "do not want," and no match is made with the tag.

Further, in still other non-limiting embodiments, the initial set or the target set may also be processed as follows: matching the text with a second specific word in a second specific set by using a regular matching mode to obtain a forced matching set, wherein the forced matching set comprises one or more forced matching words, and the second specific word is a forced matching word; and if the forced matching word has word overlapping with the word in the initial set or the target set, covering the word in the initial set or the target set with the forced matching word with word overlapping.

Wherein the second specific word in the second specific set is a popular or web language. The popular language or the network language is not matched with the model generally, and the accuracy of words in the obtained initial set or the target set can be improved by adopting a regular matching mode. In general, many network expressions are used with a certain ambiguity, such as new usage of old words, usage and errors without formal Chinese semantics or omitted usage, new words, and training expectation, wherein the three situations can cause missing matching or incorrect matching.

From the above, the initial set is obtained by performing word segmentation processing on the acquired text. And comparing the similarity between each word in the initial set and each tag in the preset tag set, taking the word with the similarity greater than or equal to the set similarity threshold as a target word, and adding the target word into the target set. And acquiring action data corresponding to each target word from a preset action database according to the labels corresponding to each target word in the target word set. The action database is used for storing action data corresponding to each label, and each label corresponds to at least one group of action data. And generating an action set corresponding to the text based on the acquired action data matched with each target word, wherein the action set is used for generating the animation. Therefore, the animation corresponding to the text is generated according to the text, the limb actions corresponding to the text can be represented through the animation, and the requirements of some application scenes with interaction requirements can be met.

In addition, the initial set and the target set are determined through the scheme, and the target words in the text can be accurately determined by combining the preset tag set, so that the target words can be accurately found and word boundaries can be accurately determined under the condition that a word segmentation system is not ideal or is not available. The animation sound field method is beneficial to realizing an end-to-end motion generation system based on natural language, and can realize vivid motion generation in live broadcast and other scenes.

In some embodiments, each frame of motion may be represented by a displacement of the bone and a rotation angle of the bone. Bones are predefined and may be represented by a root bone with other bones, where the root bone has no parent nodes and the other bones have parent nodes except the root bone, and these parent-child relationships define a hierarchical relationship between bones. The position information of the root skeleton is the position information in the world coordinate system, other skeletons have own local coordinate systems, and the local coordinate systems define the conversion relation between the skeletons and the world coordinate system.

The displacement and rotation angle of all bones constitute each frame of motion, i.e. each frame of motion can be represented by the displacement and rotation angle of all bones. Wherein displacement of bone refers to a change in position of the bone relative to the parent node. Rotation angle refers to the rotational change of each bone in its local coordinate system. The displacement and the rotation angle need to be associated with the bone in order to make sense, i.e. the action is determined by a combination of displacement and rotation angle. For each bone, there are some properties that can be changed, which can include displacement and rotation angle, i.e. the displacement and rotation angle of the bone for different actions are different.

In some embodiments of the invention, the set of motion data may include a rotation angle and a displacement. Wherein the rotation angle refers to the rotation angle of the bone and the displacement refers to the displacement of the bone. In particular, the displacement of part or all of the bones is zero for the position of the object performing the action is unchanged.

When the tag corresponds to a plurality of sets of motion data, each set of motion data (including rotation angle and displacement) corresponds to one motion, and the plurality of sets of motion data corresponds to a plurality of motions. A set of action data is selected for each successfully tagged target word.

In step 15, based on the obtained motion data matched with each target word, a motion set corresponding to the text is generated, and the motion set is used for generating an animation.

In a specific implementation of step 15, the text may be converted into speech, and word boundaries for each target word and the position of each target word in the speech are obtained. And obtaining an action set corresponding to the text according to word boundaries of each target word, positions of each target word in the voice and action data matched with the target word. Wherein word boundaries are the end point of one word and the beginning of the next word.

For example, text To Speech (TTS) is used To convert Text To Speech based on time stamp information output by the TTS module, which is calculated from Speech generated by the TTS. And obtaining the position of each target word according to the timestamp information, wherein the position refers to the time position.

Further, in order to improve smoothness of the obtained animation, motion data corresponding to each target word can be fused, so that transitional animation between adjacent target words can be obtained.

In some embodiments, the merging of the motion data corresponding to each target word to obtain the transition animation between the adjacent target words may be specifically implemented as follows: and aiming at two adjacent target words, taking action data corresponding to the target word with the previous time as a starting key frame and taking corresponding action data with the subsequent time as an ending key frame according to the position of each target word in the voice. And generating transition animation between the starting key frame and the ending key frame based on the starting key frame and the ending key frame to obtain transition animation corresponding to the adjacent target word. And obtaining an action set corresponding to the text based on the transition animation corresponding to the adjacent target word and the action data corresponding to each target word.

The following describes a detailed description of a specific generation method of transition animation between a start key frame and an end key frame in conjunction with a flowchart of a transition animation generation method in an embodiment of the present invention in fig. 3. The specific transitional animation generation method may include the following steps 31 to 35:

Step 31, acquiring a start key frame and an end key frame.

Step 32, obtaining a blank time length of a blank time period between the start key frame and the end key frame.

The position of each target word in the voice is the time position of each target word in the voice, and the time position can be a time point. In speech, a certain period of time is usually occupied by a target word, so that each target word has a start time and an end time in the speech, that is, the actual position of the target word is a period of time in the whole period of time of the speech.

The position of the target word in the voice may be any time (e.g., take the middle time between the start time and the end time of the target word) between the start time of the target word in the voice and the end time of the target word.

In some non-limiting embodiments, when the position of the target word in the voice may be from the start time of the target word in the voice to the end time of the target word, the position of the start key frame may be the end time of the target word corresponding to the start key frame, and the position of the end key frame may be the start time of the target word corresponding to the end key frame. At this time, the time period between the end time of the target word corresponding to the start key frame and the start time of the target word corresponding to the end key frame is a blank time period, and the duration corresponding to the time period is a blank duration.

In one non-limiting embodiment, the location of the starting key frame may be the end time of the corresponding action of the target word corresponding to the starting key frame, and the location of the ending key frame may be the start time of the target word corresponding to the ending key frame. At this time, the time period between the end time of the corresponding action of the target word corresponding to the start key frame and the start time of the target word corresponding to the end key frame is a blank time period, and the duration corresponding to the time period is a blank duration.

When the position of the target word is any time between the starting time of the target word in the voice and the ending time of the target word, the position of the starting key frame is the time of the target word corresponding to the starting key frame, and the position of the ending key frame can be the time of the target word corresponding to the ending key frame. At this time, the time period between the time of the target word corresponding to the start key frame and the time of the target word corresponding to the end key frame is a blank time period, and the duration corresponding to the time period is a blank duration.

In one embodiment, when the number of the start key frames and the end key frames is multiple frames, the position of the start key frame, that is, the start time, is the time corresponding to the start key frame of the first frame, and the position of the end key frame, that is, the end time, is the time corresponding to the end key frame of the last frame, so that the motion transition of the obtained transition animation is natural and smooth.

It should be noted that, when the position of the start key frame and the position of the end key frame are determined in other manners, the determining manners of the blank time periods are correspondingly different, and the determination is specifically performed according to the actual situation, which is not repeated herein.

And step 33, calculating the total frame number N of the transition animation and the position of each transition frame according to the blank time length and the preset frame rate, wherein N is a positive integer.

And step 34, calculating to obtain the action data of each transition frame, wherein the action data of the first transition frame is obtained according to the action data of the initial key frame, the action data of the end key frame and the position of the first transition, the action data of the (i+1) th transition frame is obtained according to the action data of the initial key frame, the action data of the end key frame, the action data of the (i) th transition frame and the position of the (i+1) th transition frame, i is greater than or equal to 1 and less than or equal to N-1, and i is a positive integer.

And 35, generating the transition animation based on the action data and the positions of the N transition frames.

In a specific implementation, step 11 may also obtain the action data of the start key frame and the action data of the end key frame at the same time.

In particular implementations, in step 34, the motion data for each transition frame may be calculated using a motion data generation network model. Specifically, referring to FIG. 4, which shows a flowchart of one embodiment of step 34 of FIG. 3, step 34 may include the steps of:

step 41, the motion data generating network model generates a first hidden variable and a first vector of a first transition frame according to a start vector corresponding to the motion data of the start key frame, an end vector corresponding to the motion data of the end key frame, a first transition position vector and a zeroth hidden variable, where the first vector is used to indicate the motion data of the first transition frame.

Wherein the position vector of each frame is used to characterize the sequence number of each frame. Specifically, the sequence number of each frame can be converted into a mathematically expressed vector, which becomes a position vector of each frame.

In some embodiments, each frame of motion data may be represented by a displacement of the bone and a rotation angle of the bone, which may be represented by vectors.

And 42, the motion data generating network model obtains an (i+1) vector of the (i+1) th transition frame according to the (i) th hidden variable, the (i) th vector of the (i) th transition frame, the ending vector and the (i+1) th transition frame position vector.

For easy understanding, a schematic diagram of generating motion data according to an embodiment of the present invention is provided below with reference to fig. 5, and a process of generating motion data is described below with reference to fig. 4 and 5.

The motion data of the start key frame and the motion data of the end key frame are input into a motion data generation network model (in fig. 5, simply referred to as generation network), wherein the total frame number N of the transition animation and the position of each transition frame can be informed to the generation network through a parameter configuration mode.

Generating a first vector and a first hidden variable of a first transition frame according to a starting vector (abbreviated as a starting vector in fig. 5) corresponding to the motion data of the starting key frame, an ending vector (abbreviated as an ending vector in fig. 5) corresponding to the motion data of the ending key frame, a first transition position vector (abbreviated as a 1 st frame position vector in fig. 5) and a zeroth hidden variable by the network;

the generating network obtains a second vector of the second transition frame and a second hidden variable according to the first hidden variable, the first vector, the ending vector and a position vector of the second transition frame (the second frame position vector is simply called as a second frame position vector in fig. 5).

The generating network obtains a third vector of the third transition frame and a third hidden variable according to the second hidden variable, the second vector, the ending vector and the position vector of the third transition frame (the third frame position vector is simply called as a third frame position vector in fig. 5).

Similarly, the generating network obtains the N-th vector of the N-th transition frame according to the N-1 hidden variable, the N-1-th vector, the ending vector and the position vector of the N-th transition frame (abbreviated as the N-th frame position vector in fig. 5). Accordingly, the nth hidden variable is obtained at the same time.

Thus, the vector of each transition frame is obtained, and the motion data of each transition frame is obtained.

In a specific implementation, referring to fig. 6, a training flowchart of an action data generation network model in an embodiment of the present invention is provided, where training about the action data generation network model may specifically include the following steps:

step 61, obtaining a training sample set, wherein the training sample set comprises a plurality of training samples, and the training samples are animation sequence fragments.

In implementations, each training sample in the training sample set may be obtained in a variety of ways. For example, the transition animation sequence data between different key actions is produced by a motion capture system or an animator, and is an animation sequence segment of about 0.5-5 s. The key actions may be: clapping hands, comparing heart, pointing leftwards, pointing rightwards, praying, walking, running, jumping, and the like. The key actions may also be configured according to the actual application scenario requirements, which are not exemplified here. The database has a plurality of animation sequence segments.

The motion data corresponding to each training sample can be obtained as follows. Bones in the action data may be built by modelers, binders. The rotation angle in the motion data may be obtained by actor performance or by animator hand k (hand-made key frame).

Step 62, for each training sample, extracting motion data from each frame of motion of the training sample.

In implementations, motion data is extracted for each frame of motion in an animation sequence segment. Taking the example where the motion data includes the position of the bone and the rotation angle of the bone, vectors may be used to represent the displacement of the bone and the rotation angle of the bone in the motion data.

Step 63, taking the first frame of each training sample as a starting frame, taking the last frame as an ending frame, and determining the total frame number of the transition animation to be predicted according to the time interval between the starting frame and the ending frame.

In a specific implementation, if the frame rate is 30fps and the duration of a certain training sample is 2 seconds(s), the training sample is 60 frames in total, the first frame of the training sample is taken as a starting frame, the last frame (60 th frame) is taken as an ending frame, and the motion data of the predicted frames from the 2 nd frame to the 59 th frame are predicted.

In step 64, in the mth iterative training process, based on the training mode of supervised learning, part or all of the training samples in the training sample set are input into the network model obtained by the mth-1 th iterative training, the first loss is obtained based on the deviation between the predicted result and the real result, and based on the training mode of unsupervised learning, part or all of the training samples in the training sample set are input into the network model obtained by the mth-1 th iterative training, so as to obtain the countermeasures.

In a specific implementation, the motion data includes a displacement of a bone and a rotation angle of the bone, the prediction result includes a prediction displacement of the bone of each prediction frame and a prediction rotation angle of the bone of each prediction frame, and the real result includes a real displacement of the bone of the annotation frame and a real rotation angle of the bone.

In some embodiments, a first deviation of a predicted displacement of a bone from a true displacement of the bone is calculated, a second deviation of a predicted rotational angle of the bone from a true rotational angle of the bone, and a sum of the first deviation and the second deviation is taken as the first loss.

In other embodiments, a first deviation of a predicted displacement of the bone from a true displacement of the bone is calculated, a second deviation of a predicted rotational angle of the bone from a true rotational angle of the bone, a predicted position of the bone is determined from the predicted displacement of the bone and the predicted rotational angle of the bone, a third deviation of the predicted position of the bone from the true position of the bone is calculated, and a sum of the first deviation, the second deviation, and the third deviation is taken as the first loss.

In some non-limiting embodiments, the first penalty may be derived based on the deviation of the predicted outcome from the true outcome. Specifically, the deviation of the predicted displacement of the bone and the true displacement of the bone of each predicted frame is calculated, the deviation of the predicted rotation angle of the bone and the true rotation angle of the bone of each predicted frame is summed with the deviation of the predicted displacement of the bone and the true displacement of the bone of all predicted frames to obtain a first deviation, the deviation of the predicted rotation angle of the bone and the true rotation angle of the bone of all predicted frames is summed with a second deviation, and the sum of the first deviation and the second deviation is used as the first loss.

In other non-limiting embodiments, a first deviation of the predicted displacement of the bone from the true displacement of the bone for all predicted frames is calculated, a second deviation of the predicted rotation angle of the bone from the true rotation angle of the bone for all predicted frames, a predicted position of the bone is determined from the predicted displacement of the bone and the predicted rotation angle of the bone, a true position of the bone is determined from the true displacement of the bone and the true rotation angle of the bone, a third deviation of the predicted position of the bone from the true position of the bone for all predicted frames is calculated, and a sum of the first deviation, the second deviation, and the third deviation is taken as the first loss.

Wherein the first deviation, the second deviation or the third deviation may be calculated using the manhattan distance. The euclidean distance may also be used to calculate the first deviation, the second deviation, or the third deviation.

In some embodiments, the first penalty may be a sum of the first bias and the second bias for each training sample that participates in the supervised learning training. The first penalty may also be the sum of the weighted result of the first bias and the weighted result of the second bias for each training sample that participates in the supervised learning training.

In other embodiments, the first penalty may be a sum of the first bias, the second bias, and the third bias for each of the training samples that participate in the supervised learning training. The first penalty may also be a sum of the weighted results of the first bias, the weighted results of the second bias, and the weighted results of the third bias for each of the training samples that participate in the supervised learning.

And step 65, according to the first loss and the countering loss, adjusting parameters of the network model obtained by the m-1 th iteration to obtain a network model obtained by the m-1 th iteration, performing iterative training until the network model obtained by the m+p-th iteration meets a convergence condition, and using the network model obtained by the m+p-th iteration as the action data to generate a network model, wherein when m=1, the initial network model is regarded as the network model obtained by the 0 th iterative training.

In particular implementations, the initial network model is considered as the network model trained on iteration 0, and may be a recurrent neural network (Recurrent Neural Network, RNN). The start vector (including position and rotation angle), the end vector (including position and rotation angle), the position vector of the 1 st frame and the zeroth hidden variable are input to the initial network model, and the vector (position and rotation angle) of the first frame is output. And so on, obtaining a prediction result. For example, the prediction result is a vector of predicted frames 2-59. Wherein, the corresponding zeroth hidden variable of the first frame is set manually, and the content of the zeroth hidden variable is 0.

In the training process based on the unsupervised learning, 2 key action sequences can be randomly selected from the training sample set, and a time is randomly selected as a blank time period within a set time range (for example, 1 to 2 seconds), for example, the blank time period is 1.5 seconds. The network model generates a transition action sequence according to the key action sequence and the blank time length, wherein the transition action sequence comprises multi-frame action data. For example, if the blank time length is 1.5 seconds and the frame rate is 30fps, then 45-2=43 frame transition actions are predicted.

The motion data of the start frame, the motion data of each transition frame generated, and the motion data of the end frame are put into a discrimination network (the discrimination network is trained along with a network model), and a Loss of countermeasure (Loss) is generated. Generating network parameters according to the countermeasures loss adjustment, and using the countermeasures loss, the predicted actions can be smoother and more consistent.

In the action data generation network model training, supervised learning is used together with unsupervised learning: and generating parameters of the network model according to the LOSS adjustment action data obtained by the first LOSS and the countering LOSS until the LOSS converges. The advantage of incorporating a loss of contrast (loss) is that it makes the animation sequence generated by the action data generating network more realistic.

In the training process of generating the network model by the action data, the duration of the transition animation is dynamically changed. For example, a transition animation of 1.2s may be generated, and a transition animation of 0.8s may be generated, so that the generalization capability of the motion data generated by training to generate a network model may be improved. It should be noted that, the specific duration, frame rate, etc. referred to in the foregoing examples are only made for convenience of understanding, and in practice, other values may be set according to actual requirements, which is not described herein.

In step 15, after the action data and positions of the N transition frames are obtained, the sequence of each transition frame may be determined according to the positions of each transition frame, the sequence of the action data of the N transition frames is determined, and then the transition animation is rendered and generated through time stamp alignment.

Specifically, the transition animations between each target word and the adjacent target word are combined to obtain the animation corresponding to the text.

The embodiment of the invention also provides an animation generation device, and referring to fig. 7, a schematic structural diagram of the animation generation device in the embodiment of the invention is provided. The animation generation device may include:

a text acquisition unit 71 for acquiring text;

a word segmentation unit 72, configured to perform word segmentation processing on the text, so as to obtain an initial set, where the initial set includes one or more words;

a determining unit 73, configured to determine whether each word in the initial set is a target word, and if so, add the target word to a target set, where the target word refers to a word having a similarity value with at least one tag in a preset tag set that includes a plurality of preset tags that is greater than or equal to a set similarity threshold;

an action data obtaining unit 74, configured to obtain, according to the labels corresponding to each target word in the target set, action data of each target word from a preset action database, where the action database is used to store action data corresponding to each label, and each label corresponds to at least one group of action data;

And a generating unit 75, configured to generate an action set corresponding to the text based on the obtained action data matched with each target word, where the action set is used to generate an animation.

In specific implementation, for the specific working principle and workflow of the animation generating device, reference may be made to the description in the animation generating method provided in the foregoing embodiment, which is not repeated herein.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the animation generation method provided by any of the above embodiments.

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the animation generation method provided by any embodiment when running the computer program.

The memory is coupled to the processor and may be located within the terminal or external to the terminal. The memory and the processor may be connected by a communication bus.

The terminal can include, but is not limited to, a mobile phone, a computer, a tablet computer and other terminal equipment, and can also be a server, a cloud platform and the like.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with the embodiments of the present application are all or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer program may be stored in or transmitted from one computer readable storage medium to another, for example, by wired or wireless means from one website, computer, server, or data center.

In the several embodiments provided in the present application, it should be understood that the disclosed method, apparatus, and system may be implemented in other manners. For example, the device embodiments described above are merely illustrative; for example, the division of the units is only one logic function division, and other division modes can be adopted in actual implementation; for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may be physically included separately, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units. For example, for each device or product applied to or integrated on a chip, each module/unit included in the device or product may be implemented in hardware such as a circuit, or at least part of the modules/units may be implemented in software program, where the software program runs on a processor integrated inside the chip, and the rest (if any) of the modules/units may be implemented in hardware such as a circuit; for each device and product applied to or integrated in the chip module, each module/unit contained in the device and product can be realized in a hardware manner such as a circuit, different modules/units can be located in the same component (such as a chip, a circuit module and the like) or different components of the chip module, or at least part of the modules/units can be realized in a software program, the software program runs on a processor integrated in the chip module, and the rest (if any) of the modules/units can be realized in a hardware manner such as a circuit; for each device, product, or application to or integrated with the terminal, each module/unit included in the device, product, or application may be implemented by using hardware such as a circuit, different modules/units may be located in the same component (for example, a chip, a circuit module, or the like) or different components in the terminal, or at least part of the modules/units may be implemented by using a software program, where the software program runs on a processor integrated inside the terminal, and the remaining (if any) part of the modules/units may be implemented by using hardware such as a circuit.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, the character "/" indicates that the front and rear associated objects are an "or" relationship.

The term "plurality" as used in the embodiments herein refers to two or more.

The first, second, third, etc. descriptions in the embodiments of the present application are only used for illustrating and distinguishing description objects, and no order division is used, nor does it indicate that the number of devices in the embodiments of the present application is particularly limited, and no limitation in the embodiments of the present application should be construed.

It should be noted that the serial numbers of the steps in the present embodiment do not represent a limitation on the execution sequence of the steps.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.

Claims

1. An animation generation method, comprising:

Acquiring a text;

word segmentation is carried out on the text to obtain an initial set, wherein the initial set comprises one or more words;

judging whether each word in the initial set is a target word, if so, adding the target word into the target set, wherein the target word refers to a word with a similarity value with at least one tag in a preset tag set being greater than or equal to a set similarity threshold, and the preset tag set comprises a plurality of preset tags;

according to the labels corresponding to the target words in the target set, obtaining action data of the target words from a preset action database, wherein the action database is used for storing the action data corresponding to the labels, and each label corresponds to at least one group of action data;

and generating an action set corresponding to the text based on the acquired action data matched with each target word, wherein the action set is used for generating animation.

2. The animation generation method of claim 1, wherein the determining whether each word in the initial set is a target word comprises:

for each word in the initial set, the vector of each word is compared with the vector of each tag separately for similarity.

3. The animation generation method of claim 2, wherein the vector of the tag is obtained by:

acquiring keywords corresponding to each label;

and obtaining the vector of each label according to the vector of the keyword corresponding to each label.

4. The animation generation method of claim 2, wherein, for each word in the initial set, a vector for each word is obtained by:

for each word in the initial set, carrying out weighted average on vectors corresponding to all characters in each word, and taking the vector after weighted average as the vector of each word.

5. The animation generation method of claim 1, wherein the determining whether each word in the initial set is a target word, and if so, adding the target word to the target set comprises:

s1, adding words with similarity values with labels in the initial set being greater than or equal to the similarity threshold value into a candidate set;

s2, taking the word with the maximum similarity value in the candidate set as a reference word, putting the word into a successful matching set, and comparing the rest words in the candidate set with the reference word;

S3, eliminating words with similarity values smaller than the reference words and overlapping words with the reference words from the candidate set, and updating the candidate set;

6. The animation generation method of claim 1, further comprising:

matching each word in the initial set with a first specific word in a first specific set, wherein the first specific word is a word which is not subjected to tag matching, and eliminating words which are the same as the first specific word or belong to a subset of the first characteristic words from the initial set;

or, matching each word in the target set with a first specific word in a first specific set, wherein the first specific word is a word which is not subjected to tag matching, and eliminating words which are the same as the first specific word or belong to a subset of the first characteristic words from the target set.

7. The animation generation method of claim 1, further comprising:

matching the text with a second specific word in a second specific set by using a regular matching mode to obtain a forced matching set, wherein the forced matching set comprises one or more forced matching words, and the second specific word is a forced matching word;

And if the forced matching word has word overlapping with the word in the initial set or the target set, covering the word in the initial set or the target set with the forced matching word with word overlapping.

8. The animation generation method of claim 1, wherein the generating the action set corresponding to the text based on the obtained action data matched with each target word, the action set being used for generating an animation, comprises:

converting the text into voice, and obtaining word boundaries of each target word and the position of each target word in the voice;

and obtaining an action set corresponding to the text according to word boundaries of each target word, positions of each target word in the voice and action data matched with the target word.

9. The animation generation method of claim 8, wherein the obtaining the action set corresponding to the text based on the word boundary of each target word, the position of each target word in the speech, and the action data matched with the target word comprises:

aiming at two adjacent target words, according to the position of each target word in the voice, taking action data corresponding to the target word with the previous time as a starting key frame and action data corresponding to the target word with the subsequent time as an ending key frame;

Generating transition animation between the initial key frame and the end key frame based on the initial key frame and the end key frame to obtain transition animation corresponding to the adjacent target word;

and obtaining an action set corresponding to the text based on the action data corresponding to each target word and the transition animation corresponding to the adjacent target word.

10. The animation generation method of claim 9, wherein generating a transitional animation between a start key frame and an end key frame based on the start key frame and the end key frame to obtain a transitional animation corresponding to an adjacent target word comprises:

acquiring a blank time length of a blank time period between the starting key frame and the ending key frame;

according to the blank time length and the preset frame rate, calculating the total frame number N of the transition animation and the position of each transition frame, wherein N is a positive integer;

calculating to obtain action data of each transition frame, wherein the action data of the first transition frame is obtained according to the action data of the initial key frame, the action data of the end key frame and the position of the first transition frame, the action data of the (i+1) th transition frame is obtained according to the action data of the initial key frame, the action data of the end key frame, the action data of the (i) th transition frame and the position of the (i+1) th transition frame, i is not less than 1 and not more than N-1, and i is a positive integer;

And generating the transition animation based on the action data and the positions of the N transition frames.

11. An animation generation device, comprising:

a text acquisition unit configured to acquire a text;

the word segmentation unit is used for carrying out word segmentation processing on the text to obtain an initial set, wherein the initial set comprises one or more words;

the determining unit is used for judging whether each word in the initial set is a target word, if so, adding the target word into the target set, wherein the target word refers to a word with a similarity value with at least one tag in a preset tag set being greater than or equal to a set similarity threshold, and the preset tag set comprises a plurality of preset tags;

the action data acquisition unit is used for acquiring action data of each target word from a preset action database according to the labels corresponding to each target word in the target set, wherein the action database is used for storing the action data corresponding to each label, and each label corresponds to at least one group of action data;

the generating unit is used for generating an action set corresponding to the text based on the acquired action data matched with each target word, and the action set is used for generating animation.

12. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the animation generation method of any of claims 1 to 10.

13. A terminal comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, characterized in that the processor executes the steps of the animation generation method according to any of claims 1 to 10 when the computer program is executed.