CN113920559A - Method and device for generating facial expressions and limb actions of virtual character - Google Patents
Method and device for generating facial expressions and limb actions of virtual character Download PDFInfo
- Publication number
- CN113920559A CN113920559A CN202111083019.3A CN202111083019A CN113920559A CN 113920559 A CN113920559 A CN 113920559A CN 202111083019 A CN202111083019 A CN 202111083019A CN 113920559 A CN113920559 A CN 113920559A
- Authority
- CN
- China
- Prior art keywords
- sample
- limb
- words
- text data
- virtual character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008921 facial expression Effects 0.000 title claims abstract description 214
- 230000009471 action Effects 0.000 title claims abstract description 122
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000012549 training Methods 0.000 claims abstract description 31
- 230000033001 locomotion Effects 0.000 claims description 106
- 230000014509 gene expression Effects 0.000 claims description 17
- 238000002372 labelling Methods 0.000 claims description 12
- 230000033764 rhythmic process Effects 0.000 claims description 12
- 230000001502 supplementing effect Effects 0.000 claims description 7
- 230000001815 facial effect Effects 0.000 claims description 5
- 230000008451 emotion Effects 0.000 abstract description 3
- 238000012545 processing Methods 0.000 description 13
- 238000004891 communication Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 8
- 230000003287 optical effect Effects 0.000 description 5
- 230000005236 sound signal Effects 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 238000007477 logistic regression Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000003780 insertion Methods 0.000 description 3
- 230000037431 insertion Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 240000007185 Albizia julibrissin Species 0.000 description 1
- 235000011468 Albizia julibrissin Nutrition 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 230000001427 coherent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000003306 harvesting Methods 0.000 description 1
- 230000004886 head movement Effects 0.000 description 1
- 235000008216 herbs Nutrition 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Processing Or Creating Images (AREA)
Abstract
The disclosure relates to a method and a device for generating facial expressions and limb actions of virtual characters. The method comprises the following steps: acquiring text data; inputting the text data into a prosody model, and outputting facial expressions or limb actions matched with the text data through the prosody model, wherein the prosody model is set to be obtained through training according to the corresponding relation between semantic features of sample text data and the facial expressions or the limb actions; inserting the facial expression and the limb actions into a virtual character of a video sequence to generate the facial expression and the limb actions of the virtual character. Compared with the prior art, the semantic information can convey the expressed emotion or intention information, so that the facial expression or the limb action of the virtual character can be accurately predicted.
Description
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating facial expressions and body movements of virtual characters.
Background
Virtual characters are often created in the industries of virtual reality, game entertainment, video phones, movie special effects and the like, and in the process of creating the virtual characters, facial expressions and body actions matched with speaking contents need to be configured for the virtual characters by utilizing a portrait driving technology. In the related art, based on the mode of the base animation, the corresponding expression or action is matched for the virtual character, for example, some action segments are collected in advance, and then a certain action segment is read and played at random, and the facial expression or limb action generated by the method is extremely unnatural. And a rule-based expression and action generation method is adopted, the method can trigger corresponding expressions based on some specific rules according to voice signals, and the unnatural facial expressions or limb actions can still be caused due to the fact that the triggering conditions are relatively fixed.
Therefore, a generation method capable of accurately generating the facial expression and the limb movement of the virtual character is needed.
Disclosure of Invention
To overcome at least one of the problems of the related art, the present disclosure provides a method and apparatus for generating facial expressions and body movements of a virtual character.
According to a first aspect of the embodiments of the present disclosure, there is provided a method for generating facial expressions and limb movements of a virtual character, including:
acquiring text data;
inputting the text data into a prosody model, and outputting facial expressions or limb actions matched with the text data through the prosody model, wherein the prosody model is set to be obtained through training according to the corresponding relation between semantic features of sample text data and the facial expressions or the limb actions;
inserting the facial expression and the limb actions into a virtual character of a video sequence to generate the facial expression and the limb actions of the virtual character.
In one possible implementation manner, the inserting the facial expression and the body movement into a virtual character of a video sequence to generate a facial expression and a body movement of the virtual character includes:
acquiring audio data corresponding to the text data;
determining the time information of facial expressions and limb actions matched with the words according to the time information of the audio data words;
and inserting the facial expression and the limb actions into a virtual character of a video sequence according to the time information to generate the facial expression and the limb actions of the virtual character.
In one possible implementation, inserting the facial expression and the body movement into a virtual character of a video sequence, and generating the facial expression and the body movement of the virtual character includes:
acquiring expression parameters matched with the facial expression and action parameters matched with the limb action;
and adjusting the pixel position of the virtual character in the video sequence according to the expression parameters and the action parameters to generate the facial expression and the limb action of the virtual character.
In one possible implementation manner, the inserting the facial expression and the body movement into a virtual character of a video sequence to generate a facial expression and a body movement of the virtual character includes:
adjusting the facial expression or the limb movement when a plurality of continuous words in the text data have matched facial expressions or limb movements;
registering the adjusted facial expression or the adjusted limb action to a virtual character of the video sequence to generate the facial expression and the limb action of the virtual character.
In one possible implementation, the adjusting the facial expression or the limb movement includes:
and acquiring facial expressions or limb actions matched with words at preset positions in the continuous words, and removing the facial expressions or limb actions matched with words except the preset positions.
In one possible implementation, the adjusting the facial expression or the limb movement includes:
and acquiring facial expressions or limb actions with the highest preset priority level matched with the words from the continuous words, and removing the facial expressions or the limb actions matched with the words except the words.
In one possible implementation manner, the prosodic model is configured to be obtained through training according to the corresponding relationship between the semantic features of the sample text data and the sample facial expressions and the sample body movements, and includes:
obtaining sample text data, wherein the sample text data comprises a plurality of sample words marked with facial expressions and limb actions;
constructing a rhythm model, wherein training parameters are set in the rhythm model;
inputting the sample words into the prosodic model to generate a prediction result;
and iteratively adjusting the training parameters based on the difference between the prediction result and the labeled sample facial expression and sample limb actions until the difference meets the preset requirement to obtain the prosody model.
In one possible implementation, the sample text data set includes a plurality of sample words labeled with facial expressions and body movements, where the method for labeling the facial expressions and the body movements on the sample words includes:
acquiring sample audio data matched with the sample text data;
according to semantic information of sample text data, facial expressions and limb actions are labeled on sample words to obtain initial labeled sample words;
and supplementing the initial sample labeled words according to the volume information and the speech speed information of the sample audio data to obtain labeled sample words. (ii) a
According to a second aspect of the embodiments of the present disclosure, there is provided an apparatus for generating facial expressions and body movements of a virtual character, including:
the acquisition module is used for acquiring text data;
the prediction module is used for inputting the text data into a prosody model and outputting facial expressions or limb actions matched with the text data through the prosody model, wherein the prosody model is obtained by training according to the corresponding relation between semantic features of sample text data and the facial expressions or the limb actions;
and the generating module is used for inserting the facial expression and the limb actions into a virtual character of a video sequence to generate the facial expression and the limb actions of the virtual character.
In one possible implementation, the generating module includes:
the obtaining submodule is used for obtaining audio data corresponding to the text data;
the determining submodule is used for determining the time information of the facial expression and the limb action matched with the words according to the time information of the audio data words;
and the generation submodule is used for inserting the facial expression and the limb action into a virtual character of a video sequence according to the time information to generate the facial expression and the limb action of the virtual character.
In one possible implementation manner, the prosodic model is configured to be obtained through training according to the corresponding relationship between the semantic features of the sample text data and the sample facial expressions and the sample body movements, and includes:
obtaining sample text data, wherein the sample text data comprises a plurality of sample words marked with facial expressions and limb actions;
constructing a rhythm model, wherein training parameters are set in the rhythm model;
inputting the sample words into the prosodic model to generate a prediction result;
and iteratively adjusting the training parameters based on the difference between the prediction result and the labeled sample facial expression and sample limb actions until the difference meets the preset requirement to obtain the prosody model.
In one possible implementation, the sample text data set includes a plurality of sample words labeled with facial expressions and body movements, where the method for labeling the facial expressions and the body movements on the sample words includes:
acquiring sample audio data matched with the sample text data;
according to semantic information of sample text data, facial expressions and limb actions are labeled on sample words to obtain initial labeled sample words;
and supplementing the initial sample labeled words according to the volume information and the speech speed information of the sample audio data to obtain labeled sample words.
According to a third aspect of the embodiments of the present disclosure, there is provided an apparatus for generating a facial expression and a limb movement of a virtual character, the apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the method of any of the embodiments of the present disclosure.
According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, wherein instructions, when executed by a processor, enable the processor to perform the method according to any one of the embodiments of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: in the embodiment of the disclosure, the prosodic model is obtained by training according to the corresponding relation between the semantic features of the sample text data and the facial expressions and the body movements, and compared with the prior art, the semantic information can convey the expressed emotion or intention information, so that the facial expressions or the body movements of the virtual character can be accurately predicted.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flowchart illustrating a method for generating facial expressions and body movements of a virtual character according to an exemplary embodiment.
FIG. 2 is a schematic block diagram illustrating a prosody model according to an exemplary embodiment.
Fig. 3 is a schematic block diagram illustrating an apparatus for generating facial expressions and limb movements of a virtual character according to an exemplary embodiment.
Fig. 4 is a schematic block diagram illustrating an apparatus for generating facial expressions and limb movements of a virtual character according to an exemplary embodiment.
Fig. 5 is a schematic block diagram illustrating an apparatus for generating facial expressions and limb movements of a virtual character according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In order to facilitate those skilled in the art to understand the technical solutions provided by the embodiments of the present disclosure, a technical environment for implementing the technical solutions is described below.
The portrait driving technique is a technique for generating a virtual character that speaks or acts in mouth using a computer. In the correlation technology, the prediction model is trained based on an artificial intelligence algorithm by utilizing the correlation of audio, facial expressions and limb actions. Since the audio signal conveys less information, the predicted facial expressions and body movements are not accurate enough.
Based on the actual technical needs similar to those described above, the present disclosure provides a method and an apparatus for generating facial expressions and body movements of a virtual character.
The method for generating facial expressions and body movements of a virtual character according to the present disclosure will be described in detail below with reference to fig. 1. Fig. 1 is a flowchart of a method of an embodiment of a method for generating facial expressions and limb movements of a virtual character provided by the present disclosure. Although the present disclosure provides method steps as illustrated in the following examples or figures, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the disclosed embodiments.
Specifically, an embodiment of the method for generating facial expressions and body movements of a virtual character provided by the present disclosure is shown in fig. 1, where the method may be applied to a terminal or a server, and includes:
step S101, text data is acquired.
In the embodiment of the present disclosure, the text data may include the content of the virtual character utterance, and the text data may further include text information extracted from an audio file or a video file. For example, in an interview scenario, the content pre-broadcast for the virtual character may include: "welcome to participate in our company X on-line trial! I are the interviewer of the current round, i can take you to complete all the interview processes today, the link takes about forty-five minutes, please find a quiet and stable network environment to complete the test, and have to ensure that the camera and the microphone can operate normally ".
And S102, inputting the text data into a prosody model, and outputting facial expressions or limb actions matched with the text data through the prosody model, wherein the prosody model is obtained by training according to the corresponding relation between semantic features of sample text data and the facial expressions or the limb actions.
In the embodiment of the present disclosureThe prosodic models may include machine learning-based neural network models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), graph-theoretic neural networks (GNNs), attention-based mechanisms, and the like. Debugging can also be performed on the basis of some basic models, such as a pre-training language (PLM) (pretrained languge model), wherein the PLM model can include models of BERT series and derivatives thereof or GPT series and derivatives thereof, and the disclosure is not limited. In one example, the prosodic model is set to be obtained through training according to the corresponding relation between semantic features of the sample text data and facial expressions and body movements. The training method may include a supervised learning method or an auto-supervised learning method in deep learning, for example, a preset facial expression or limb movement is labeled on a training sample, text data is input into the prosody model, the probability of each facial expression or limb movement of a word in the text data is obtained, and parameters of the prosody model are adjusted based on a prediction result and a labeling result. In one example, inputting the text data to a prosodic model may include performing word segmentation on the text data and extracting word vector representations of words, inputting the word vectors to the prosodic model, and outputting facial expressions and/or body movements corresponding to the word vectors. For example, for the interview scenario described above, the output result may be expressed as: "All-grass of Chinese silktree albizziaMeet and participate in usCompany XOfOn the upper partInterview!I am the bookThe interviewer of the wheel is presented with,i amWill take you to complete all the interview process today, the process takes about forty-five minutes, please find a placeAnStanding still,NetThe stable environment of the network completes the test and ensuresTaking a photographLike a head and a microphoneIs justAnd (4) running frequently. Wherein, the underlined words indicate words that are required to add facial expressions or actions, and the words that are not underlined indicate words that are not required to add facial expressions or actions.
Step S103, inserting the facial expression and the limb actions into a virtual character of a video sequence to generate the facial expression and the limb actions of the virtual character.
In the embodiment of the present disclosure, the virtual roles with multiple facial expressions or body motions may be preset, and corresponding video sequence roles to be inserted are matched from the virtual roles with preset facial expressions or body motions according to the facial expressions or body motions predicted by the model. Animation parameters corresponding to the facial expressions or the limb movements can be predetermined, images with the facial expressions or the limb movements are generated according to the animation parameters, and the images are inserted into a video sequence.
In the embodiment of the disclosure, the prosodic model is obtained by training according to the corresponding relation between the semantic features of the sample text data and the facial expressions and the body actions, and compared with the prior art, the semantic information can convey the expressed emotion or intention information, so that the facial expressions or the body actions of the virtual character can be accurately predicted, and the facial expressions and the body actions are fused, so that the trained model can predict the facial expressions and the body actions at the same time.
In one possible implementation manner, the inserting the facial expression and the body movement into a virtual character of a video sequence to generate a facial expression and a body movement of the virtual character includes:
acquiring audio data corresponding to the text data;
determining the time information of facial expressions and limb actions matched with the words according to the time information of the audio data words;
and inserting the facial expression and the limb actions into a virtual character of a video sequence according to the time information to generate the facial expression and the limb actions of the virtual character.
In the embodiment of the present disclosure, the audio data corresponding To the Text data may be obtained, in an example, by using a Text-To-Speech (TTS) model based on deep learning, inputting the Text data, and outputting the audio data corresponding To the Text data. In another example, matching audio data may also be recorded manually according to the text data, and the disclosure is not limited herein.
In the embodiment of the present disclosure, the time information includes a start time and an end time of playing the word, and the time information of the word is consistent with the facial expression and the body movement. In one example, table 1 shows correspondence of time information of text data with facial expressions, body movements. And finding the corresponding facial expression and limb movement into the virtual role of the video sequence according to the time information. In the embodiment of the present disclosure, the video sequence has a part of random factors, or in order to match the appearance of the text, the time of the animation sequence needs to be adjusted, and the numerical value of each facial expression frame or limb motion frame at the new stretching position thereof may be calculated in an interpolation manner, and the facial expression or limb motion of the character is inserted at the numerical value.
Table 1 correspondence between time information of text data and facial expressions and body movements
Text content | Start time (seconds) | End time (seconds) | Nodding head | Swinging head | Raising head |
Eyes of a user | 1.03 | 1.17 | Y | ||
Front side | 1.17 | 1.51 | |||
Is suitable for | 1.64 | 1.89 | Y | ||
Combination of Chinese herbs | 1.89 | 2.06 | |||
You | 2.06 | 2.31 | |||
And is | 2.31 | 2.55 | Y | ||
Harvesting machine | 2.55 | 2.75 | |||
Benefit to | 2.75 | 2.84 | |||
Most preferably | 2.84 | 3.01 | Y | ||
Height of | 3.01 | 3.21 | |||
Is/are as follows | 3.21 | 3.43 | |||
Then is turned on | 3.43 | 3.60 | Y | ||
Is that | 3.60 | 3.76 | |||
New | 3.76 | 3.99 | Y | ||
Passenger(s) | 3.99 | 4.15 | |||
Theory of things | 4.15 | 4.27 | |||
Wealth | 4.27 | 4.50 | |||
To master | 4.50 | 4.64 | Y |
According to the embodiment of the disclosure, the insertion point of the facial expression or the limb action is determined by using the time information of the audio data, and the insertion time point which is convenient to realize and can be aligned to the video sequence more accurately is provided.
In one possible implementation, inserting the facial expression and the body movement into a virtual character of a video sequence, and generating the facial expression and the body movement of the virtual character includes:
acquiring expression parameters matched with the facial expression and action parameters matched with the limb action;
and adjusting the pixel position of the virtual character in the video sequence according to the expression parameters and the action parameters to generate the facial expression and the limb action of the virtual character.
In this disclosure, the expression parameters may include parameters in software for creating an expression, such as Character Builder, blendshape, and the like, and may also include some expression parameters for generating three-dimensional capture in an animation, which is not limited in this disclosure. By setting the parameter values of the expression parameters, for example, adjusting the movement in the blendshape, which is blendshape x (1.0+ movement), the corresponding expression, such as smile, cry, anger, etc., is obtained. The motion parameters may be parameters related to the motion of the limb, such as pitch (pitch), yaw (yaw), roll (roll), etc., and may be appropriately parameterized according to the stretch coefficient sbody, where roll ', pitch ', yaw ' respectively represent the adjusted pitch parameter, yaw parameter, and roll parameter.
roll’=roll×(1.0+sbody,roll) (1)
pitch’=pitch×(1.0+sbody,pitch) (2)
yaw’=yaw×(1.0+sbody,yaw) (3)
In one possible implementation manner, the inserting the facial expression and the body movement into a virtual character of a video sequence to generate a facial expression and a body movement of the virtual character includes:
adjusting the facial expression or the limb movement when a plurality of continuous words in the text data have matched facial expressions or limb movements;
registering the adjusted facial expression or the adjusted limb action to a virtual character of the video sequence to generate the facial expression and the limb action of the virtual character.
In one possible implementation manner, the inserting the facial expression and the body movement into a virtual character of a video sequence to generate a facial expression and a body movement of the virtual character includes:
adjusting the facial expression or the limb movement when a plurality of continuous words in the text data have matched facial expressions or limb movements;
registering the adjusted facial expression or the adjusted limb action to a virtual character of the video sequence to generate the facial expression and the limb action of the virtual character.
In the disclosed embodiment, there is a matching facial expression or limb action for multiple consecutive words, such as "welcome to our company X on-line interview! If the prosodic model predicts that the words correspond to the body actions, such as 'welcome' and 'nodding' and 'attending' and 'head up' and 'our' and 'corresponding body actions' are slightly lifted up, the insertion of the expressions or the actions is not coherent under the condition that a plurality of continuous words all have corresponding facial expressions or body actions, and therefore the facial expressions or the body actions are adjusted. In one example, adjusting the facial expression or the limb movement comprises: the facial expressions or the limb actions matched with the words at the preset positions in the continuous words are obtained, the facial expressions or the limb actions matched with the words at the positions except the preset positions are removed, and for example, the facial expressions or the limb actions corresponding to the words except the first word are removed for the words. In another example, adjusting the facial expression or the limb movement comprises: and acquiring facial expressions or limb actions with the highest preset priority level matched with the words from the continuous words, and removing the facial expressions or the limb actions matched with the words except the words. For example, an expression "rereaded" among a plurality of words in succession may be prioritized over other words, and thus facial expressions or limb movements matched to words other than the words may be eliminated.
According to the embodiment of the disclosure, the predicted facial expression or limb movement is adjusted, and the facial expression and the limb movement are deleted under the condition that a plurality of continuous words have the facial expression and the limb movement, so that the beneficial effects of natural expression and movement are achieved.
In one possible implementation manner, the prosodic model is configured to be obtained through training according to the corresponding relationship between the semantic features of the sample text data and the sample facial expressions and the sample body movements, and includes:
obtaining sample text data, wherein the sample text data comprises a plurality of sample words marked with facial expressions and limb actions;
constructing a rhythm model, wherein training parameters are set in the rhythm model;
inputting the sample words into the prosodic model to generate a prediction result;
and iteratively adjusting the training parameters based on the difference between the prediction result and the labeled sample facial expression and sample limb actions until the difference meets the preset requirement to obtain the prosody model.
In the embodiment of the present disclosure, the obtaining of the sample text data may be performed in the following manner: acquiring a section of video, separating image data and audio data from the video, and obtaining corresponding text data according to the audio data, wherein the text data can be expressed in the form of the following array: array 1 (words, start time, end time). The time information such as facial expressions, the start and the end of limb actions and the like which are related to the image data are acquired through face detection and limb detection, and can be expressed in the form of the following arrays: array 2 (facial expressions, limb movements, start time, end time). And integrating the two groups of data, and aligning the array 1 and the array 2 in time to obtain sample words marked with facial expressions and limb movements.
FIG. 2 is a schematic block diagram illustrating a prosody model according to an exemplary embodiment. Referring to fig. 2, a prosody model is constructed, in which training parameters are set. The prosody model in the embodiment of the disclosure may perform word segmentation on the text data based on a pre-trained speech model (PLM), extract word vectors, input the word vectors to the PLM model, obtain output of a last layer of the PLM, add a preset number of full-connected layers behind the PLM, enter a logistic regression layer, where the logistic regression layer includes logistic regression and softmax regression, where the logistic regression represents probability distribution of each facial expression and limb action type, output a corresponding prediction tag by using the softmax regression, calculate a difference between a prediction result and a labeled sample facial expression and sample limb action by using a cross entropy loss function layer, and perform iterative adjustment on the training parameters until the difference meets a preset requirement, so as to obtain the prosody model.
According to the prosody model trained by the method, semantic information in words is extracted by using a pre-trained speech model (PLM), and the obtained words are classified according to facial expressions and body actions.
In one possible implementation, the sample text data set includes a plurality of sample words labeled with facial expressions and body movements, where the method for labeling the facial expressions and the body movements on the sample words includes:
acquiring sample audio data matched with the sample text data;
according to semantic information of sample text data, facial expressions and limb actions are labeled on sample words to obtain initial labeled sample words;
and supplementing the initial sample labeled words according to the volume information and the speech speed information of the sample audio data to obtain labeled sample words.
In the embodiment of the present disclosure, the method for obtaining sample audio data matched with sample text data may include separating audio data from a segment of video by using the above embodiment, and may further include synthesizing related audio data by using a TTS algorithm according to the sample data. And according to the semantic information of the sample text data, marking the facial expressions and the limb actions on the sample words to obtain initial marked sample words. The method may include labeling the separated words in the image data, such as facial expressions, body movements, and audio data, in an aligned manner, to obtain an initial labeled sample word.
In the embodiment of the present disclosure, the volume information indicates the volume. The volume information may be measured, for example, by whether the volume of the word calculated from the position of the word exceeds a predetermined value volthresholdIn one example, the predetermined value may be selected by a statistical method, such as quantile vol of a plurality of wordsthresholdWhere the value of N is determined by the probability of occurrence of the relevant head movement, a data value between 60 and 90 may be chosen. If the volume information is larger than the preset value, the corresponding facial expression or behavior action can be matched, such as nodding, waving and the like. And when the volume is less than a certain preset value, the speaking voice becomes lighter and corresponds to certain facial expressions or behavior actions. The volume information is used for supplementing the initial labeling sample words, and more labeling sample words can be obtained. In another example, the speaking pace can also reflect the speaker's expressive information such as urgency. The initial labeling sample words can be supplemented according to the facial expressions and the limb actions corresponding to the speech speed information, and more labeling sample words can be obtained.
According to the embodiment of the disclosure, the sample labeled words are expanded from the dimensions of the volume information and the speech speed information, so that more labeled sample words are obtained, and the prediction accuracy of the prosodic model is improved.
Fig. 3 is a schematic block diagram illustrating an apparatus for generating facial expressions and limb movements of a virtual character according to an exemplary embodiment. Referring to fig. 3, the apparatus includes an acquisition module 301, a prediction module 302, and a generation module 303.
An obtaining module 301, configured to obtain text data;
the prediction module 302 is configured to input the text data into a prosody model, and output facial expressions or body movements matched with the text data through the prosody model, where the prosody model is trained according to corresponding relationships between semantic features of sample text data and the facial expressions or the body movements;
the generating module 303 is configured to insert the facial expression and the body movement into a virtual character of a video sequence, and generate a facial expression and a body movement of the virtual character.
In one possible implementation, the generating module includes:
the obtaining submodule is used for obtaining audio data corresponding to the text data;
the determining submodule is used for determining the time information of the facial expression and the limb action matched with the words according to the time information of the audio data words;
and the generation submodule is used for inserting the facial expression and the limb action into a virtual character of a video sequence according to the time information to generate the facial expression and the limb action of the virtual character.
In one possible implementation manner, the prosodic model is configured to be obtained through training according to the corresponding relationship between the semantic features of the sample text data and the sample facial expressions and the sample body movements, and includes:
obtaining sample text data, wherein the sample text data comprises a plurality of sample words marked with facial expressions and limb actions;
constructing a rhythm model, wherein training parameters are set in the rhythm model;
inputting the sample words into the prosodic model to generate a prediction result;
and iteratively adjusting the training parameters based on the difference between the prediction result and the labeled sample facial expression and sample limb actions until the difference meets the preset requirement to obtain the prosody model.
In one possible implementation, the sample text data set includes a plurality of sample words labeled with facial expressions and body movements, where the method for labeling the facial expressions and the body movements on the sample words includes:
acquiring sample audio data matched with the sample text data;
according to semantic information of sample text data, facial expressions and limb actions are labeled on sample words to obtain initial labeled sample words;
and supplementing the initial sample labeled words according to the volume information and the speech speed information of the sample audio data to obtain labeled sample words.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 4 is a block diagram illustrating an apparatus 800 for generating facial expressions and body movements of a virtual character according to an exemplary embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 4, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Fig. 5 is a block diagram illustrating an apparatus 1900 for generating facial expressions and body movements of a virtual character according to an exemplary embodiment. For example, the apparatus 1900 may be provided as a server. Referring to FIG. 5, the device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.
The device 1900 may also include a power component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system, such as Windows Server, MacOSXTM, UnixTM, LinuxTM, FreeBSDTM, or the like, stored in the memory 1932.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided that includes instructions, such as the memory 1932 that includes instructions, which are executable by the processing component 1922 of the apparatus 1900 to perform the above-described method. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
Claims (14)
1. A method for generating facial expressions and limb movements of a virtual character is characterized by comprising the following steps:
acquiring text data;
inputting the text data into a prosody model, and outputting facial expressions or limb actions matched with the text data through the prosody model, wherein the prosody model is set to be obtained through training according to the corresponding relation between semantic features of sample text data and the facial expressions or the limb actions;
inserting the facial expression and the limb actions into a virtual character of a video sequence to generate the facial expression and the limb actions of the virtual character.
2. The method of claim 1, wherein the inserting the facial expression and the body movement into a virtual character of a video sequence, and the generating the facial expression and the body movement of the virtual character comprises:
acquiring audio data corresponding to the text data;
determining the time information of facial expressions and limb actions matched with the words according to the time information of the audio data words;
and inserting the facial expression and the limb actions into a virtual character of a video sequence according to the time information to generate the facial expression and the limb actions of the virtual character.
3. The method of claim 1, wherein inserting the facial expression and the body movement into a virtual character of a video sequence, and generating the facial expression and the body movement of the virtual character comprises:
acquiring expression parameters matched with the facial expression and action parameters matched with the limb action;
and adjusting the pixel position of the virtual character in the video sequence according to the expression parameters and the action parameters to generate the facial expression and the limb action of the virtual character.
4. The method of claim 1, wherein the inserting the facial expression and the body movement into a virtual character of a video sequence, and the generating the facial expression and the body movement of the virtual character comprises:
adjusting the facial expression or the limb movement when a plurality of continuous words in the text data have matched facial expressions or limb movements;
registering the adjusted facial expression or the adjusted limb action to a virtual character of the video sequence to generate the facial expression and the limb action of the virtual character.
5. The method of claim 4, wherein the adjusting the facial expression or the limb movement comprises:
and acquiring facial expressions or limb actions matched with words at preset positions in the continuous words, and removing the facial expressions or limb actions matched with words except the preset positions.
6. The method of claim 4, wherein the adjusting the facial expression or the limb movement comprises:
and acquiring facial expressions or limb actions with the highest preset priority level matched with the words from the continuous words, and removing the facial expressions or the limb actions matched with the words except the words.
7. The method of claim 1, wherein the prosodic model is trained according to the semantic features of the sample text data and the corresponding relationship between the sample facial expressions and the sample body movements, and comprises:
obtaining sample text data, wherein the sample text data comprises a plurality of sample words marked with facial expressions and limb actions;
constructing a rhythm model, wherein training parameters are set in the rhythm model;
inputting the sample words into the prosodic model to generate a prediction result;
and iteratively adjusting the training parameters based on the difference between the prediction result and the labeled sample facial expression and sample limb actions until the difference meets the preset requirement to obtain the prosody model.
8. The method of claim 7, wherein the sample text data set comprises a plurality of sample words labeled with facial expressions and body movements, and wherein the method of labeling facial expressions and body movements on the sample words comprises:
acquiring sample audio data matched with the sample text data;
according to semantic information of sample text data, facial expressions and limb actions are labeled on sample words to obtain initial labeled sample words;
and supplementing the initial sample labeled words according to the volume information and the speech speed information of the sample audio data to obtain labeled sample words.
9. An apparatus for generating a facial expression and a body movement of a virtual character, comprising:
the acquisition module is used for acquiring text data;
the prediction module is used for inputting the text data into a prosody model and outputting facial expressions or limb actions matched with the text data through the prosody model, wherein the prosody model is obtained by training according to the corresponding relation between semantic features of sample text data and the facial expressions or the limb actions;
and the generating module is used for inserting the facial expression and the limb actions into a virtual character of a video sequence to generate the facial expression and the limb actions of the virtual character.
10. The apparatus of claim 9, wherein the generating module comprises:
the obtaining submodule is used for obtaining audio data corresponding to the text data;
the determining submodule is used for determining the time information of the facial expression and the limb action matched with the words according to the time information of the audio data words;
and the generation submodule is used for inserting the facial expression and the limb action into a virtual character of a video sequence according to the time information to generate the facial expression and the limb action of the virtual character.
11. The apparatus of claim 10, wherein the prosodic model is configured to be obtained by training according to correspondence between semantic features of sample text data and sample facial expressions and sample body movements, and comprises:
obtaining sample text data, wherein the sample text data comprises a plurality of sample words marked with facial expressions and limb actions;
constructing a rhythm model, wherein training parameters are set in the rhythm model;
inputting the sample words into the prosodic model to generate a prediction result;
and iteratively adjusting the training parameters based on the difference between the prediction result and the labeled sample facial expression and sample limb actions until the difference meets the preset requirement to obtain the prosody model.
12. The apparatus of claim 11, the sample set of text data comprising a plurality of sample words labeled with facial expressions, limb movements, wherein a method of labeling facial expressions, limb movements on sample words comprises:
acquiring sample audio data matched with the sample text data;
according to semantic information of sample text data, facial expressions and limb actions are labeled on sample words to obtain initial labeled sample words;
and supplementing the initial sample labeled words according to the volume information and the speech speed information of the sample audio data to obtain labeled sample words.
13. An apparatus for generating a facial expression and a body movement of a virtual character, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the method of any one of claims 1 to 8.
14. A non-transitory computer readable storage medium having instructions therein which, when executed by a processor, enable the processor to perform the method of any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111083019.3A CN113920559A (en) | 2021-09-15 | 2021-09-15 | Method and device for generating facial expressions and limb actions of virtual character |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111083019.3A CN113920559A (en) | 2021-09-15 | 2021-09-15 | Method and device for generating facial expressions and limb actions of virtual character |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113920559A true CN113920559A (en) | 2022-01-11 |
Family
ID=79235179
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111083019.3A Pending CN113920559A (en) | 2021-09-15 | 2021-09-15 | Method and device for generating facial expressions and limb actions of virtual character |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113920559A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114898018A (en) * | 2022-05-24 | 2022-08-12 | 北京百度网讯科技有限公司 | Animation generation method and device for digital object, electronic equipment and storage medium |
CN115908722A (en) * | 2023-01-05 | 2023-04-04 | 杭州华鲤智能科技有限公司 | Method for generating 3D face modeling |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107784355A (en) * | 2017-10-26 | 2018-03-09 | 北京光年无限科技有限公司 | The multi-modal interaction data processing method of visual human and system |
CN111369687A (en) * | 2020-03-04 | 2020-07-03 | 腾讯科技(深圳)有限公司 | Method and device for synthesizing action sequence of virtual object |
CN112330780A (en) * | 2020-11-04 | 2021-02-05 | 北京慧夜科技有限公司 | Method and system for generating animation expression of target character |
-
2021
- 2021-09-15 CN CN202111083019.3A patent/CN113920559A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107784355A (en) * | 2017-10-26 | 2018-03-09 | 北京光年无限科技有限公司 | The multi-modal interaction data processing method of visual human and system |
CN111369687A (en) * | 2020-03-04 | 2020-07-03 | 腾讯科技(深圳)有限公司 | Method and device for synthesizing action sequence of virtual object |
CN112330780A (en) * | 2020-11-04 | 2021-02-05 | 北京慧夜科技有限公司 | Method and system for generating animation expression of target character |
Non-Patent Citations (1)
Title |
---|
侯进: "个性化虚拟人建模及文本控制其动作表情合成研究", 《学术动态》, no. 4, 15 December 2012 (2012-12-15), pages 16 - 19 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114898018A (en) * | 2022-05-24 | 2022-08-12 | 北京百度网讯科技有限公司 | Animation generation method and device for digital object, electronic equipment and storage medium |
CN115908722A (en) * | 2023-01-05 | 2023-04-04 | 杭州华鲤智能科技有限公司 | Method for generating 3D face modeling |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109599128B (en) | Speech emotion recognition method and device, electronic equipment and readable medium | |
CN109189985B (en) | Text style processing method and device, electronic equipment and storage medium | |
CN109871896B (en) | Data classification method and device, electronic equipment and storage medium | |
CN107291690B (en) | Punctuation adding method and device and punctuation adding device | |
CN107221330B (en) | Punctuation adding method and device and punctuation adding device | |
CN110210310B (en) | Video processing method and device for video processing | |
CN112185389B (en) | Voice generation method, device, storage medium and electronic equipment | |
CN109961791B (en) | Voice information processing method and device and electronic equipment | |
CN109819288B (en) | Method and device for determining advertisement delivery video, electronic equipment and storage medium | |
CN107133354B (en) | Method and device for acquiring image description information | |
CN104077597B (en) | Image classification method and device | |
CN113920559A (en) | Method and device for generating facial expressions and limb actions of virtual character | |
CN111210844B (en) | Method, device and equipment for determining speech emotion recognition model and storage medium | |
CN112037756A (en) | Voice processing method, apparatus and medium | |
CN111144101A (en) | Wrongly written character processing method and device | |
CN112735396A (en) | Speech recognition error correction method, device and storage medium | |
CN113095085A (en) | Text emotion recognition method and device, electronic equipment and storage medium | |
CN112036174B (en) | Punctuation marking method and device | |
CN110619325A (en) | Text recognition method and device | |
CN111723606A (en) | Data processing method and device and data processing device | |
CN112579767A (en) | Search processing method and device for search processing | |
CN113420553A (en) | Text generation method and device, storage medium and electronic equipment | |
CN113553946A (en) | Information prompting method and device, electronic equipment and storage medium | |
CN110858099B (en) | Candidate word generation method and device | |
CN108346423B (en) | Method and device for processing speech synthesis model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |