CN113689880A

CN113689880A - Method, device, electronic equipment and medium for driving virtual human in real time

Info

Publication number: CN113689880A
Application number: CN202010420720.9A
Authority: CN
Inventors: 樊博; 陈伟; 陈曦; 孟凡博; 刘恺; 张克宁; 段文君
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2021-11-23
Anticipated expiration: 2040-05-18
Also published as: CN113689880B; WO2021232877A1

Abstract

The embodiment of the specification discloses a method for driving a virtual human in real time, which comprises the steps of acquiring data to be processed for driving the virtual human, wherein the data to be processed comprises at least one of text data and voice data; processing the data to be processed by using an end-to-end model, and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed; inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual human through the muscle model. Thus, the acoustic feature sequence, the facial feature sequence and the limb feature sequence can be acquired in a shorter time through the end-to-end model; and the obtained sequence is input into the muscle model to directly drive the virtual human, so that the calculation amount and the data transmission amount of the virtual human are greatly reduced, the calculation efficiency is also improved, and the real-time performance of driving the virtual human is greatly improved.

Description

Method, device, electronic equipment and medium for driving virtual human in real time

Technical Field

The embodiment of the specification relates to the technical field of virtual human processing, in particular to a method, a device, electronic equipment and a medium for driving a virtual human in real time.

Background

Digital Human (Digital Human) is a comprehensive rendering technology for simulating real Human by using a computer, and is also called virtual Human, super-writable real Human and photo-level Human. Since people are too familiar with real people, a 3D static model can be obtained by spending a large amount of time, but when the 3D static model is driven to move, even a fine expression can be modeled again, the modeling needs to be calculated by a large amount of data due to the very high degree of reality of the model, the calculation process is long, generally, one movement of the model can be realized by one hour or several hours of calculation, and the real-time performance of the driving is very poor.

Disclosure of Invention

The embodiment of the specification provides a method, a device, electronic equipment and a medium for driving a virtual human in real time, and the virtual human can be driven in real time.

The first aspect of the embodiments of the present specification provides a method for driving a virtual human in real time, including:

acquiring data to be processed for driving a virtual human, wherein the data to be processed comprises at least one of text data and voice data;

processing the data to be processed by using an end-to-end model, and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed;

inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual human through the muscle model;

wherein the processing the data to be processed using the end-to-end model comprises:

acquiring text features and duration features of the data to be processed;

determining the acoustic feature sequence according to the text feature and the duration feature;

and determining the facial feature sequence and the limb feature sequence according to the text features and the duration features.

Optionally, the obtaining the text feature and the duration feature of the data to be processed includes:

acquiring the text features through a fastspeech model;

and acquiring the duration characteristics through a duration model, wherein the duration model is a deep learning model.

Optionally, if the fastspeed model for training the output acoustic feature sequence is the first fastspeed model, and the fastspeed model for training the output facial feature sequence and the limb feature sequence is the second fastspeed model, determining the acoustic feature sequence according to the text feature and the duration feature, including:

inputting the text feature and the duration feature into a first fastspeed model to obtain the acoustic feature sequence;

the determining the acoustic feature sequence according to the text feature and the duration feature includes:

and inputting the text feature and the duration feature into a second fastspeed model to obtain the facial feature sequence and the limb feature sequence.

Optionally, the inputting the acoustic feature sequence, the facial feature sequence, and the limb feature sequence into a trained muscle model includes:

fusing the acoustic feature sequence, the facial feature sequence and the limb feature sequence to obtain a fused feature sequence;

inputting the fused feature sequence into the muscle model.

Optionally, the fusing the acoustic feature sequence, the facial feature sequence, and the limb feature sequence to obtain a fused feature sequence includes:

and fusing the acoustic feature sequence, the facial feature sequence and the limb feature sequence based on the duration features to obtain a fused feature sequence.

Optionally, the facial features corresponding to the facial feature sequence include expressive features and lip features.

A second aspect of the embodiments of the present specification provides a method for driving a virtual human in real time, including:

processing the data to be processed by using an end-to-end model, and determining fusion feature data corresponding to the data to be processed, wherein the fusion feature sequence is obtained by fusing an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed;

inputting the fusion sequence into a trained muscle model, and driving a virtual human through the muscle model;

acquiring text features and duration features of the data to be processed;

determining the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the text feature and the duration feature;

and obtaining the fusion feature sequence according to the acoustic feature sequence, the facial feature sequence and the limb feature sequence.

acquiring the text features through a fastspeech model;

Optionally, the obtaining the fusion feature sequence according to the acoustic feature sequence, the facial feature sequence, and the limb feature sequence includes:

A third aspect of the embodiments of the present specification provides an apparatus for driving a virtual human in real time, including:

the data acquisition module is used for acquiring data to be processed for driving the virtual human, wherein the data to be processed comprises at least one of text data and voice data;

the data processing module is used for processing the data to be processed by using an end-to-end model and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed;

the virtual human driving module is used for inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model and driving a virtual human through the muscle model;

the data processing module is specifically used for acquiring text features and duration features of the data to be processed; determining the acoustic feature sequence according to the text feature and the duration feature; and determining the facial feature sequence and the limb feature sequence according to the text features and the duration features.

A fourth aspect of the embodiments of the present specification provides a device for driving a virtual human in real time, including:

the data processing module is used for processing the data to be processed by using an end-to-end model and determining fusion feature data corresponding to the data to be processed, wherein the fusion feature sequence is obtained by fusing an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed;

the virtual human driving module is used for inputting the fusion sequence into a trained muscle model and driving a virtual human through the muscle model;

the data processing module is specifically used for acquiring text features and duration features of the data to be processed; determining the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the text feature and the duration feature; and obtaining the fusion feature sequence according to the acoustic feature sequence, the facial feature sequence and the limb feature sequence.

In a fifth aspect of the embodiments of the present specification, there is provided an apparatus for driving a avatar in real time, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, the one or more programs including method steps for driving the avatar in real time.

A sixth aspect of embodiments of the present specification provides a machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the above-described steps of the method for driving a virtual human in real time.

The beneficial effects of the embodiment of the specification are as follows:

based on the technical scheme, after the data to be processed is obtained, the data to be processed is processed by using an end-to-end model, and an acoustic feature sequence, a facial feature sequence and a limb feature sequence are obtained; inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual human through the muscle model; because the end-to-end model inputs the original data of the data to be processed and directly outputs the acoustic feature sequence, the facial feature sequence and the limb feature sequence, the parallel computing capability of new hardware (such as GPU) can be better utilized and adapted, and the computing speed is higher; namely, the acoustic feature sequence, the facial feature sequence and the limb feature sequence can be acquired in a shorter time; and then inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a muscle model, directly driving the virtual human, directly controlling the virtual human to output voice through the acoustic feature sequence after the virtual human is created, and simultaneously controlling the facial expression and the limb actions of the virtual human through the facial feature sequence and the limb feature sequence.

Drawings

FIG. 1 is a flow chart illustrating training an end-to-end model of an output acoustic feature sequence according to an embodiment of the present disclosure;

FIG. 2 is a first flowchart of a method for driving a virtual human in real time according to an embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating steps of a first fastspeed model outputting a sequence of acoustic features in an embodiment of the present disclosure;

FIG. 4 is a second flowchart of a method for driving a virtual human in real time according to an embodiment of the present disclosure;

fig. 5 is a first structural diagram of an apparatus for driving a virtual human in real time in an embodiment of the present specification;

fig. 6 is a second structural diagram of an apparatus for driving a virtual human in real time in an embodiment of the present specification;

fig. 7 is a block diagram of a configuration of an apparatus for driving a virtual human in real time as a device in an embodiment of the present specification;

fig. 8 is a block diagram of a server in some embodiments of the present disclosure.

Detailed Description

In order to better understand the technical solutions, the technical solutions of the embodiments of the present specification are described in detail below with reference to the drawings and specific embodiments, and it should be understood that the specific features of the embodiments and embodiments of the present specification are detailed descriptions of the technical solutions of the embodiments of the present specification, and are not limitations of the technical solutions of the present specification, and the technical features of the embodiments and embodiments of the present specification may be combined with each other without conflict.

Aiming at the technical problem that a large amount of time is consumed when a virtual human is driven, the embodiment of the invention provides a scheme for driving the virtual human in real time, wherein the scheme is used for driving the virtual human in real time and specifically comprises the following steps: acquiring data to be processed for driving a virtual human, wherein the data to be processed comprises at least one of text data and voice data; processing the data to be processed by using an end-to-end model, and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed; inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual human through the muscle model;

the processing the data to be processed by using the end-to-end model, and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed includes: acquiring text features and duration features of the data to be processed; determining the acoustic feature sequence according to the text feature and the duration feature; and determining the facial feature sequence and the limb feature sequence according to the text features and the duration features.

The virtual human in the embodiment of the invention can be a high-simulation virtual human which has smaller difference with a real human; the virtual human in the embodiment of the invention can be applied to content expression scenes such as a news broadcast scene, a teaching scene, a medical scene, a customer service scene, a legal scene, a conference scene and the like.

In the embodiment of the present invention, the data to be processed may be text data, or may also be voice data, or may also be text data and semantic data existing at the same time, and the present specification is not limited specifically.

For example, in a news broadcasting scene, a newsfeed to be broadcasted for driving a virtual human needs to be acquired, where the newsfeed is data to be processed, and the newsfeed may be a text edited manually or by a machine, and after the text is edited manually or by a machine, the edited text is acquired as the newsfeed.

In the embodiment of the invention, before the end-to-end model is used for processing the data to be processed, the end-to-end model is required to be trained through a sample to obtain a trained end-to-end model; and after the trained end-to-end model is obtained, processing the data to be processed by using the trained end-to-end model.

The end-to-end model in the embodiment of the invention comprises two training methods, wherein one training method is used for training an acoustic characteristic sequence output by the end-to-end model, and the other training method is used for training a facial characteristic sequence and a limb characteristic sequence output by the end-to-end model; and the end-to-end model may specifically be a fastspeed model.

When an end-to-end model of an output acoustic feature sequence is trained, training samples can be text data, voice data and video data; for each training sample in the training sample set, the training steps are specifically as shown in fig. 1, firstly, step a1 is executed, and the acoustic features 101 and the text features 102 of the training sample are obtained, where the text features 101 may be at a phoneme level. Specifically, feature data of a training sample can be mapped into an embedding (embedding) layer in an end-to-end model, so as to obtain an acoustic feature 101 and a text feature 102; then, step a2 is executed, the acoustic features 101 and the text features 102 are processed by a Feed Forward Transformer 103(Feed Forward Transformer), so as to obtain acoustic vectors 104 and text coding features 105, where the acoustic vectors 104 may be acoustic vectors of sentences or acoustic vectors of words, and the text coding features 105 are also phoneme levels; next, step a3 is executed, the acoustic vector 104 and the text coding feature 105 are aligned to obtain an aligned text coding feature 106, and the acoustic vector 104 and the text coding feature 105 may be aligned by using a duration predictor, where the text coding feature 105 is specifically a phoneme feature, and the acoustic vector 104 may be a mel-frequency spectrogram, and thus, the factor feature and the mel-frequency spectrogram may be aligned by using the duration predictor; step a4 is performed next, decoding 107 the aligned text coding features 106 to obtain a sequence of acoustic features 108, at which time the speech speed can be easily determined by extending or shortening the phoneme duration using a length adjuster to determine the length of the generated mel-frequency spectrogram, and partial prosody can be controlled by adding intervals between adjacent phonemes; and acquiring the acoustic feature sequence according to the determined length of the Mel frequency spectrogram and the phoneme interval time.

When training an end-to-end model of an output acoustic feature sequence, its training sample set may contain, for example, 13,100 voice frequency clips and corresponding text recordings, with a total audio length of about 24 hours. At this time, the training sample set was randomly divided into 3 groups: 12500 samples for training, 300 samples for validation and 300 samples for testing. To alleviate the problem of pronunciation errors, a phoneme conversion tool is used to convert the text sequence into a phoneme sequence; for voice data, converting an original waveform into a Mel frequency spectrogram; then 12500 samples are used for training the end-to-end model, and after the training is finished, 300 verification samples are used for verifying the end-to-end model obtained through training; after the verification is met, the end-to-end model is tested by using 300 test samples, and if the test meets the test condition, the trained end-to-end model is obtained.

If the end-to-end model is verified and does not meet the verification requirement, the training sample is used for training the end-to-end model again until the trained end-to-end model meets the verification requirement; and testing the end-to-end model which meets the requirement until the trained end-to-end model meets both the verification requirement and the test condition, and taking the trained end-to-end model as a final model, namely the trained end-to-end model.

When an end-to-end model of the output face characteristic sequence and the limb characteristic sequence is trained, training samples can be real person video data and real person action data; for each training sample in the training sample set, the training step specifically includes, first, performing step B1, and obtaining facial features, limb features, and text features of the training sample, where the text features may be phoneme levels. Specifically, the feature data of the training sample can be mapped into an embedding (embedding) layer in the end-to-end model, so as to obtain facial features, limb features and text features; then step B2 is executed, facial features, limb features and text features are processed through a Feed Forward Transformer (Feed Forward Transformer), and facial feature vectors, limb feature vectors and text coding features are obtained, wherein the facial feature vectors are used for carrying out feature representation of facial expressions, the limb feature vectors can be muscle action vectors, and the text coding features are phoneme levels as well; next, step B3 is executed to align the facial feature vector and the limb feature vector with the text coding features, and the facial feature vector and the limb feature vector may be aligned with the text coding features by using a duration predictor, wherein the text coding features are specifically phoneme features; next, step B4 is performed to obtain a facial feature sequence and a limb feature sequence, at which point the facial expressions and movements may be aligned by lengthening or shortening the phoneme duration using a length adjuster, resulting in a facial feature sequence and a limb feature sequence.

The text features in the embodiment of the invention can comprise: phoneme features, and/or semantic features, etc. Further, a phoneme is a minimum voice unit divided according to natural attributes of voice, and is analyzed according to pronunciation actions in syllables, and one action constitutes one phoneme. The phonemes may include: vowels and consonants. Optionally, the specific phoneme feature corresponds to a specific lip feature, an expressive feature or a limb feature, etc.

And the semantics is the meaning of the concept represented by the object in the real world corresponding to the text to be processed, and the relationship between the meanings is the explanation and the logical representation of the text to be processed on a certain field. Optionally, the particular semantic features correspond to particular limb features, etc.

When the end-to-end models of the output facial feature sequence and the limb feature sequence are trained, the training sample set comprises real person motion data or real person video data, the training process refers to the training process of training the end-to-end models of the output acoustic feature sequence, and for the sake of simplicity of the description, the description is omitted here.

And after an end-to-end model of the output acoustic feature sequence is obtained through training, and an end-to-end model of the output facial feature sequence and the limb feature sequence is obtained, taking the obtained end-to-end model of the output acoustic feature sequence as a first end-to-end model, and taking the obtained end-to-end model of the output facial feature sequence and the limb feature sequence as a second end-to-end model.

Thus, after the data to be processed is obtained, the text feature of the data to be processed can be obtained by using the embedded layer of the first end-to-end model, the duration feature of the data to be processed is obtained, and the text feature and the duration feature are input into the first end-to-end model to obtain the acoustic feature sequence; correspondingly, the text feature of the data to be processed can be obtained by using the embedding layer of the second end-to-end model, the duration feature of the data to be processed is obtained, and the text feature and the duration feature are input into the second end-to-end model to obtain a face feature sequence and a limb feature sequence; of course, the text feature and the duration feature obtained in the foregoing may also be directly input into the second end-to-end model to obtain the facial feature sequence and the limb feature sequence. In this embodiment of this specification, the first end-to-end model and the second end-to-end model may process data simultaneously, or the first end-to-end model may process data first, or the second end-to-end model may process data first, and this specification is not limited specifically.

In the embodiment of the present invention, the duration characteristics may be used to characterize the duration of the phoneme corresponding to the text. The time length characteristic can depict the suppression of the rising and the pause and the lightness and the slowness in the voice, and further the expressive force and the naturalness of the synthesized voice can be improved. Optionally, a duration model may be used to determine a duration characteristic corresponding to the data to be processed. The inputs to the duration model may be: and outputting the phoneme characteristics with the accent labels as phoneme duration. The duration model may be obtained by learning a voice sample with sometimes long information, and may be a Deep learning model such as a Convolutional Neural Network (CNN) and a Deep Neural network (Deep Neural network, DNN), for example, and the specific duration model is not limited in the embodiment of the present invention.

And after the acoustic feature sequence, the facial feature sequence and the limb feature sequence are obtained, the acoustic feature sequence is obtained, the facial feature sequence and the limb feature sequence are input into a trained muscle model, and the virtual human is driven through the muscle model.

In the embodiment of the invention, the facial features comprise expression features and lip features, wherein the expression, expression emotion and emotion can refer to thought emotion expressed on the face. Expressive features are typically directed to the entire face. The lip characteristics can be specially specific to lips and have a relationship with text contents, voice, pronunciation modes and the like of texts, so that the facial expressions can be more vivid and finer through the facial characteristics.

The body characteristics can convey the thought of people through the coordination activities of human body parts such as head, eyes, neck, hands, elbows, arms, body, crotch, feet and the like, so as to vividly express and reach the intention. The limb characteristics may include: turning head, shrugging shoulder, gesture, etc., can improve the richness of the corresponding expression of the image sequence. For example, at least one arm naturally droops when speaking, and at least one arm naturally rests on the abdomen when not speaking, etc.

In the embodiment of the invention, before the trained muscle model is used, model training is required to be carried out to obtain the trained muscle model; and after the trained muscle model is obtained, processing the acoustic feature sequence, the facial feature sequence and the limb feature sequence by using the trained muscle model.

When the trained muscle model is used for model training, firstly, the muscle model is created according to facial muscles and limb muscles of a human face, and training samples of the muscle model are obtained, wherein the training samples can be real person video data and real person action data; for each training sample in the training sample set, the training steps include:

firstly, step C1 is executed, and facial muscle features and limb muscle features of each training sample are obtained; then, step C2 is executed, and the facial muscle features and the limb muscle features of each training sample are used for training the muscle model; and after the training is finished, executing a step C3, and verifying the trained muscle model by using the verification sample; and after the verification meets the verification requirement, testing the muscle model obtained by training by using the test sample, and if the test meets the test condition, obtaining the trained muscle model.

If the verification of the muscle model obtained by training does not meet the verification requirement, the training sample is used for training the muscle model again until the trained muscle model meets the verification requirement; and testing the muscle model which meets the requirements until the trained muscle model meets the requirements of both the verification and the test conditions, and taking the trained muscle model as a final model, namely the trained muscle model.

And, in creating the muscle model, using the facial muscles as an example, using polygonal network for approximate abstract muscle control, two types of muscles can be used, one linear muscle for stretching; a sphincter muscle for use in compression; two muscles are only connected with a grid space at one point and have direction designation (when the two muscles are deformed, the angular displacement and the radial displacement of a certain point are calculated), so that the muscle control is independent of specific facial topology, and the facial expression can be more vivid and finer; accordingly, the limb muscles are also controlled by approximate abstract muscles by using the polygonal network, so that the limb actions can be ensured to be more accurate.

Since the self-attention mechanism adopted by the feed-forward transformer of the end-to-end model is an innovative method for understanding the current word through the context thereof, the semantic feature extraction capability is stronger. In practical application, this characteristic means that for homophones or words in a sentence, a new algorithm can judge which one should be (such as bathing and washing jujube) according to words around the new algorithm and sentences before and after the new algorithm, so as to obtain a more accurate result; and the end-to-end model solves the problem that each part of the tasks in the traditional voice recognition scheme are independent and can not be optimized in a combined mode. The framework of a single neural network becomes simpler, and the accuracy is higher as the number of layers of the model is deeper and the training data is larger; third, the end-to-end model adopts a new neural network structure, which can better utilize and adapt to the parallel computing capability of new hardware (such as GPU) and has faster operation speed. This means that the transcription of speech with the same duration can be completed in a shorter time based on the algorithm model of the new network structure, and the real-time transcription requirement can be met better.

After data to be processed is obtained, processing the data to be processed by using an end-to-end model to obtain an acoustic feature sequence, a facial feature sequence and a limb feature sequence; inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual human through the muscle model; because the end-to-end model inputs the original data of the data to be processed and directly outputs the acoustic feature sequence, the facial feature sequence and the limb feature sequence, the parallel computing capability of new hardware (such as GPU) can be better utilized and adapted, and the computing speed is higher; namely, the acoustic feature sequence, the facial feature sequence and the limb feature sequence can be acquired in a shorter time; and then inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a muscle model, directly driving the virtual human, directly controlling the virtual human to output voice through the acoustic feature sequence after the virtual human is created, and simultaneously controlling the facial expression and the limb actions of the virtual human through the facial feature sequence and the limb feature sequence.

Moreover, when the acoustic feature sequence, the facial feature sequence and the limb feature sequence are obtained by adopting the end-to-end model, the time length feature is used, and the time length feature can improve the synchronism between the acoustic feature sequence and the facial feature sequence and the limb feature sequence, so that when the acoustic feature sequence, the facial feature sequence and the limb feature sequence are used for driving the virtual human on the basis of improving the synchronism, the accuracy of matching between the sound output of the virtual human and the facial expression and the limb features can be higher.

Method embodiment one

Referring to fig. 2, a flowchart of a first step of a first embodiment of a method for driving a virtual human in real time according to the present invention is shown, which may specifically include the following steps:

s201, acquiring data to be processed for driving a virtual human, wherein the data to be processed comprises at least one of text data and voice data;

s202, processing the data to be processed by using an end-to-end model, and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed;

s203, inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual human through the muscle model;

wherein, step S201 includes:

step S2021, acquiring text features and duration features of the data to be processed;

step S2022, determining the acoustic feature sequence according to the text feature and the duration feature;

step S2023, determining the facial feature sequence and the limb feature sequence according to the text feature and the duration feature.

In step S201, for the client, to-be-processed data uploaded by a user may be received; for the server, the data to be processed sent by the client can be received. It is to be understood that any first device may receive the text to be processed from the second device, and the specific transmission manner of the data to be processed is not limited in the embodiments of the present invention.

If the data to be processed is text data, directly processing the data to be processed by using the step S202; if the data to be processed is voice data, the data to be processed is converted into text data, and then the converted text data is processed in step S202.

In step S202, an end-to-end model needs to be trained, wherein the end-to-end model includes two training methods, an acoustic feature sequence output by the end-to-end model trained by one training method, and a facial feature sequence and a limb feature sequence output by the end-to-end model trained by the other training method; and the end-to-end model may specifically be a fastspeed model.

Training an end-to-end model of the output acoustic feature sequence as a first end-to-end model, wherein the training process specifically refers to the description of the steps A1-A4; and training an end-to-end model outputting the facial feature sequence and the limb feature sequence to serve as a second end-to-end model, wherein the training process is described with reference to steps B1-B4.

In step S2021, the text feature may be acquired through a fastspeech model; and acquiring the duration characteristics through a duration model, wherein the duration model is a deep learning model.

If the end-to-end model is a fastspeed model, after a first fastspeed model and a second fastspeed model are obtained through training, any one of the fastspeed models is used for obtaining text features of data to be processed; and acquiring the duration characteristics by using a duration model, wherein the duration model can be a deep learning model such as CNN (convolutional neural network) and DNN (deep neural network).

In step S2022, if the fastspeed model for training the output acoustic feature sequence is the first fastspeed model and the fastspeed model for training the output facial feature sequence and the limb feature sequence is the second fastspeed model, the text feature and the duration feature may be input into the first fastspeed model to obtain the acoustic feature sequence; and step S2023, inputting the text feature and the duration feature into a second fastspeed model to obtain the facial feature sequence and the limb feature sequence.

Specifically, as shown in fig. 3, taking the first fastspeed model as an example to acquire an acoustic feature sequence, the steps include: acquiring text characteristics 301 of data to be processed through an embedded layer of a first fastspeech model, and coding the text characteristics 301 through a feed-forward transformer 302 to obtain text coding characteristics 303; at this time, the text coding features 303 are processed by the duration model 304 to obtain duration features 305, wherein the duration features 304 can be used to characterize the duration of each phoneme in the text coding features 30; aligning the text coding features 303 through the duration features 305 to obtain aligned text coding features 306; the aligned text encoding features 306 are decoded 307 and predicted to obtain an acoustic feature sequence 307.

The text coding features 303 are at a phoneme level, and the aligned text coding features 306 may be at a frame level or at a phoneme level.

Correspondingly, in the process of acquiring the face characteristic sequence and the limb characteristic sequence by using the second fastspeed model, the text characteristics of the data to be processed can be acquired through the embedded layer of the second fastspeed model; coding the text features through a feed-forward transformer to obtain text coding features; at the moment, processing the text coding features through a duration model to obtain duration features, wherein the duration features align the text coding features to obtain the aligned text coding features; and decoding the aligned text coding features, and then performing face prediction and limb prediction to obtain a face feature sequence and a limb feature sequence.

Step S203 is executed next, and the acoustic feature sequence, the facial feature sequence, and the limb feature sequence are fused to obtain a fused feature sequence. Specifically, the acoustic feature sequence, the facial feature sequence, and the limb feature sequence may be fused according to the duration feature to obtain a fused feature sequence; after the fusion characteristic sequence is obtained, inputting the fusion characteristic sequence into a trained muscle model, and driving a virtual human through the muscle model.

Specifically, according to the duration features, the acoustic feature sequence, the facial feature sequence and the limb feature sequence are aligned to obtain the fusion feature sequence, the fusion feature sequence is input into a trained muscle model, and a virtual human is driven through the muscle model.

Specifically, the muscle model training process refers to the description of steps C1-C3, and at this time, after the fusion feature sequence is obtained, the corresponding bound muscle in the muscle model is directly driven by the fusion feature sequence, and when the bound muscle is driven by the fusion feature sequence to perform corresponding movement, the facial expression and the action of the bound muscle are correspondingly changed according to the movement of the bound muscle.

For example, when the acoustic feature sequence is "goodbye", the facial expression is smile and the limb action is waving, at this time, according to the duration feature, the time period of "goodbye" may be aligned with the facial feature sequence of which the facial feature sequence is smile and the limb action sequence of which the limb action sequence is waving, so as to obtain an aligned feature sequence, which is a fusion feature sequence; at this time, the fusion feature sequence is input into the muscle model, and the virtual human is controlled by the muscle model to make the face smile and wave the hand when saying 'goodbye', so that the sound of the virtual human is matched with the face and the motion.

For another example, when the acoustic feature sequence is "come", the facial expression is smile and the limb action is waving, at this time, according to the duration feature, the time period of "come" may be aligned with the facial feature sequence of which the facial feature sequence is smile and the limb action sequence of which the limb action sequence is waving, so as to obtain an aligned feature sequence, which is a fusion feature sequence; at this time, the fusion feature sequence is input into the muscle model, and the muscle model controls the face to be smiling and to be recruited when the virtual human speaks 'come', so that the sound of the virtual human is matched with the face and the action.

Method embodiment two

Referring to fig. 4, a flowchart illustrating a first step of the first embodiment of the method for driving a virtual human in real time according to the present invention is shown, which may specifically include the following steps:

s401, acquiring data to be processed for driving the virtual human, wherein the data to be processed comprises at least one of text data and voice data;

s402, processing the data to be processed by using an end-to-end model, and determining fusion feature data corresponding to the data to be processed, wherein the fusion feature sequence is obtained by fusing an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed;

s403, inputting the fusion sequence into a trained muscle model, and driving a virtual human through the muscle model;

wherein, step S402 includes:

s4021, acquiring text characteristics and duration characteristics of the data to be processed;

step S4022, determining the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the text feature and the duration feature;

and S4023, obtaining the fusion feature sequence according to the acoustic feature sequence, the facial feature sequence and the limb feature sequence.

In step S401, for the client, to-be-processed data uploaded by the user may be received; for the server, the data to be processed sent by the client can be received. It is to be understood that any first device may receive the text to be processed from the second device, and the specific transmission manner of the data to be processed is not limited in the embodiments of the present invention.

If the data to be processed is text data, directly processing the data to be processed by using the step S402; if the data to be processed is voice data, the data to be processed is converted into text data, and then the converted text data is processed in step S402.

In step S402, an end-to-end model needs to be trained first, so that the trained end-to-end model outputs a fusion feature sequence, and at this time, the end-to-end model outputting the fusion feature sequence may be used as a third end-to-end model.

When the third end-to-end model is trained, the training samples can be real person video data and real person action data; for each training sample in the training sample set, the training step specifically includes, firstly, performing step D1, and obtaining facial features, limb features, and text features of the training sample, where the text features may be phoneme levels. Specifically, the feature data of the training sample can be mapped into an embedding (embedding) layer in the end-to-end model, so as to obtain facial features, limb features and text features; then step D2 is executed, the facial features, the limb features and the text features are processed through a Feed Forward Transformer (Feed Forward Transformer), and facial feature vectors, limb feature vectors and text coding features are obtained, wherein the facial feature vectors are used for carrying out feature representation of facial expressions, the limb feature vectors can be muscle action vectors, and the text coding features are phoneme levels as well; next, step D3 is executed to align the facial feature vector and the limb feature vector with the text coding features, and the facial feature vector and the limb feature vector may be aligned with the text coding features by using a duration predictor, where the text coding features are specifically phoneme features; next, step D4 is executed to obtain a sound feature sequence, a facial feature sequence and a limb feature sequence; next, step D5 is performed to fuse the voice feature sequence, the facial feature sequence and the limb feature sequence to obtain a fused feature sequence, at which time the voice, the facial expression and the motion can be aligned by extending or shortening the phoneme duration using the length adjuster to obtain the fused feature sequence.

When the third end-to-end model is trained, the training sample set includes real person motion data or real person video data, and the training process refers to a training process for training an end-to-end model outputting an acoustic feature sequence, which is not repeated herein for brevity of the description.

Thus, after the data to be processed is obtained, the text feature of the data to be processed can be obtained by using the embedded layer of the third end-to-end model, the duration feature of the data to be processed is obtained, and the text feature and the duration feature are input into the third end-to-end model to obtain an acoustic feature sequence, a facial feature sequence and a limb feature sequence; and according to the duration characteristics, fusing the acoustic characteristic sequence, the facial characteristic sequence and the limb characteristic sequence to obtain a fused characteristic sequence.

In step S4021, if the end-to-end model is a fastspeed model, the text feature may be obtained through a third fastspeed model; and acquiring the duration characteristics through a duration model, wherein the duration model is a deep learning model.

In step S4022, if the fastspeed model for training the output acoustic feature sequence is the third fastspeed model, the text feature and the duration feature may be input into the first fastspeed model, and the acoustic feature sequence, the face feature sequence, and the limb feature sequence are determined; and in step S4023, obtaining the fusion feature sequence according to the acoustic feature sequence, the facial feature sequence and the limb feature sequence.

Specifically, the acoustic feature sequence, the facial feature sequence, and the limb feature sequence may be aligned according to the duration feature, so as to obtain the fused feature sequence.

If the end-to-end model is a fastspeed model, after a third fastspeed model is obtained through training, text features of the data to be processed are obtained by using the third fastspeed model; and acquiring the duration characteristics by using a duration model, wherein the duration model can be a deep learning model such as CNN (convolutional neural network) and DNN (deep neural network).

Correspondingly, in the process of acquiring the fusion feature sequence by using the third fastspeed model, the text features of the data to be processed can be acquired through the embedded layer of the third fastspeed model; coding the text features through a feed-forward transformer to obtain text coding features; at the moment, processing the text coding features through a duration model to obtain duration features, wherein the duration features align the text coding features to obtain the aligned text coding features; after decoding the aligned text coding features, performing sound prediction, face prediction and limb prediction to obtain a sound feature sequence, a face feature sequence and a limb feature sequence; and aligning the sound feature sequence, the face feature sequence and the limb feature sequence according to the duration features to obtain the aligned sound feature sequence, the aligned face feature sequence and the aligned limb feature sequence, namely, taking the aligned sound feature sequence, the aligned face feature sequence and the aligned limb feature sequence as a fusion feature sequence.

Step S403 is executed next, and after the fusion feature sequence is acquired, the fusion feature sequence is input into a trained muscle model, and a virtual human is driven by the muscle model.

Specifically, the fusion feature sequence is input into a trained muscle model, and a virtual human is driven by the muscle model.

Specifically, the training process of the muscle model specifically refers to the description of steps C1-C3, and at this time, after the fusion feature sequence is obtained, the corresponding bound muscle in the muscle model is directly driven by the fusion feature sequence, and when the bound muscle is driven by the fusion feature sequence to perform corresponding movement, the facial expression and the action of the bound muscle are correspondingly changed according to the movement of the bound muscle.

For example, when the acoustic feature sequence is saying "goodbye", the facial expression is smiling and the limb movement is waving, at this time, since the fused feature sequence is aligned according to the duration features, the face appears smiling and waving while the virtual human is "goodbye", so that the sound of the virtual human matches the face and the movement.

For another example, when the acoustic feature sequence says "someone is injured", the facial expression is smiling and the limb movement is closing two hands, at this time, since the fused feature sequence is aligned according to the duration feature, the face appears sad and closing two hands is performed while the virtual person is "someone is injured", so that the sound of the virtual person matches the face and the movement.

After data to be processed is obtained, processing the data to be processed by using an end-to-end model to obtain a fusion feature sequence fused by an acoustic feature sequence, a facial feature sequence and a limb feature sequence; inputting the fusion characteristic sequence into a trained muscle model, and driving a virtual human through the muscle model; because the end-to-end model inputs the original data of the data to be processed and directly outputs the fusion characteristic sequence fused by the acoustic characteristic sequence, the facial characteristic sequence and the limb characteristic sequence, the parallel computing capability of new hardware (such as GPU) can be better utilized and adapted, and the computing speed is higher; that is, the fusion feature sequence can be acquired in a shorter time; and then inputting the fusion feature sequence into a muscle model, directly driving the virtual human, and controlling the facial expression and limb actions of the virtual human while directly controlling the virtual human to perform voice output through the fusion feature sequence after the virtual human is created.

Moreover, when the fusion feature sequence is obtained by adopting the end-to-end model, the time length feature is used for fusing the acoustic feature sequence, the facial feature sequence and the limb feature sequence, and the time length feature can improve the synchronism between the acoustic feature sequence and the facial feature sequence and the limb feature sequence, so that the accuracy of matching the sound output of the virtual human with the facial expression and the limb feature can be higher when the fusion feature sequence is used for driving the virtual human on the basis of improving the synchronism.

Apparatus embodiment one

Referring to fig. 5, a block diagram of an embodiment of a device for driving a virtual human in real time according to the present invention is shown, and specifically, the block diagram may include:

a data obtaining module 501, configured to obtain data to be processed for driving a virtual human, where the data to be processed includes at least one of text data and voice data;

a data processing module 502, configured to process the to-be-processed data by using an end-to-end model, and determine an acoustic feature sequence, a facial feature sequence, and a limb feature sequence corresponding to the to-be-processed data;

a virtual human driving module 503, configured to input the acoustic feature sequence, the facial feature sequence, and the limb feature sequence into a trained muscle model, and drive a virtual human through the muscle model;

the data processing module 502 is specifically configured to obtain a text feature and a duration feature of the data to be processed; determining the acoustic feature sequence according to the text feature and the duration feature; and determining the facial feature sequence and the limb feature sequence according to the text features and the duration features.

In an alternative embodiment, the data processing module 502 is configured to obtain the text feature through a fastspeed model; and acquiring the duration characteristics through a duration model, wherein the duration model is a deep learning model.

In an optional implementation manner, if the fastspeed model for training the output acoustic feature sequence is the first fastspeed model, and the fastspeed model for training the output facial feature sequence and the limb feature sequence is the second fastspeed model, the data processing module 502 is configured to input the text feature and the duration feature into the first fastspeed model to obtain the acoustic feature sequence; and inputting the text feature and the duration feature into a second fastspeed model to obtain the facial feature sequence and the limb feature sequence.

In an optional embodiment, the virtual human driver module 503 is configured to fuse the acoustic feature sequence, the facial feature sequence, and the limb feature sequence to obtain a fused feature sequence;

inputting the fused feature sequence into the muscle model.

In an optional implementation manner, the virtual human driver module 503 is configured to fuse the acoustic feature sequence, the facial feature sequence, and the limb feature sequence based on the duration feature to obtain the fused feature sequence.

In an alternative embodiment, the facial features corresponding to the sequence of facial features include expressive features and lip features.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Device embodiment II

Referring to fig. 6, a block diagram of an embodiment of a device for driving a virtual human in real time according to the present invention is shown, and specifically, the block diagram may include:

the data acquisition module 601 is configured to acquire data to be processed for driving the virtual human, where the data to be processed includes at least one of text data and voice data;

a data processing module 602, configured to process the data to be processed by using an end-to-end model, and determine fusion feature data corresponding to the data to be processed, where the fusion feature sequence is obtained by fusing an acoustic feature sequence, a facial feature sequence, and a limb feature sequence corresponding to the data to be processed;

a virtual human driving module 603, configured to input the fusion sequence into a trained muscle model, and drive a virtual human through the muscle model;

the data processing module 602 is configured to obtain a text feature and a duration feature of the data to be processed; determining the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the text feature and the duration feature; and obtaining the fusion feature sequence according to the acoustic feature sequence, the facial feature sequence and the limb feature sequence.

In an optional implementation, the data processing module 602 is configured to obtain the text feature through a fastspeed model; and acquiring the duration characteristics through a duration model, wherein the duration model is a deep learning model.

In an optional implementation manner, the data processing module 602 is configured to align the acoustic feature sequence, the facial feature sequence, and the limb feature sequence according to the duration feature to obtain the fused feature sequence.

Fig. 7 is a block diagram illustrating a configuration of an apparatus for driving a virtual human in real time as a device according to an exemplary embodiment. For example, the apparatus 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 7, apparatus 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.

The processing component 902 generally controls overall operation of the device 900, such as operations associated with display, incoming calls, data communications, camera operations, and recording operations. Processing element 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operation at the device 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 906 provides power to the various components of the device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 900.

The multimedia component 908 comprises a screen providing an output interface between the device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide motion action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when apparatus 900 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.

I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing status assessment of various aspects of the apparatus 900. For example, the sensor assembly 914 may detect an open/closed state of the device 900, the relative positioning of the components, such as a display and keypad of the apparatus 900, the sensor assembly 914 may also detect a change in the position of the apparatus 900 or a component of the apparatus 900, the presence or absence of user contact with the apparatus 900, orientation or acceleration/deceleration of the apparatus 900, and a change in the temperature of the apparatus 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communications between the apparatus 900 and other devices in a wired or wireless manner. The apparatus 900 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the apparatus 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 8 is a block diagram of a server in some embodiments of the invention. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer-readable storage medium in which instructions, when executed by a processor of an apparatus (device or server), enable the apparatus to perform a method of driving a avatar in real time, the method comprising: determining a duration characteristic corresponding to a text to be processed; the text to be processed relates to at least two languages; determining a target voice sequence corresponding to the text to be processed according to the duration characteristics; determining a target image sequence corresponding to the text to be processed according to the duration characteristics; the target image sequence is obtained according to the text sample and the image sample corresponding to the text sample; the language corresponding to the text sample comprises: all languages related to the text to be processed; and fusing the target voice sequence and the target image sequence to obtain a corresponding target video.

The description has been presented with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present specification have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all changes and modifications that fall within the scope of the specification.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present specification without departing from the spirit and scope of the specification. Thus, if such modifications and variations of the present specification fall within the scope of the claims of the present specification and their equivalents, the specification is intended to include such modifications and variations.

Claims

1. A method for driving a virtual human in real time is characterized by comprising the following steps:

acquiring text features and duration features of the data to be processed;

2. The method of claim 1, wherein the obtaining text features and duration features of the data to be processed comprises:

acquiring the text features through a fastspeech model;

3. The method of claim 2, wherein if the fastspeed model that trains output acoustic feature sequences is a first fastspeed model and the fastspeed model that trains output facial feature sequences and limb feature sequences is a second fastspeed model, the determining the acoustic feature sequences from the text features and the duration features comprises:

4. The method of claim 1, wherein the inputting the acoustic, facial, and limb feature sequences into a trained muscle model comprises:

inputting the fused feature sequence into the muscle model.

5. The method of claim 4, wherein said fusing the acoustic feature sequence, the facial feature sequence, and the limb feature sequence to obtain a fused feature sequence comprises:

6. A method for driving a virtual human in real time is characterized by comprising the following steps:

acquiring text features and duration features of the data to be processed;

7. An apparatus for driving a virtual human in real time, comprising:

8. An apparatus for driving a virtual human in real time, comprising:

9. An apparatus for driving a avatar in real-time, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein the one or more programs configured to be executed by the one or more processors comprise instructions for performing the steps of the method of any of claims 1-5.

10. A machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform a method of driving a avatar in real-time as recited in one or more of claims 1-5.