CN113689880B - Method, device, electronic equipment and medium for driving virtual person in real time - Google Patents
Method, device, electronic equipment and medium for driving virtual person in real time Download PDFInfo
- Publication number
- CN113689880B CN113689880B CN202010420720.9A CN202010420720A CN113689880B CN 113689880 B CN113689880 B CN 113689880B CN 202010420720 A CN202010420720 A CN 202010420720A CN 113689880 B CN113689880 B CN 113689880B
- Authority
- CN
- China
- Prior art keywords
- feature sequence
- sequence
- feature
- model
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 230000001815 facial effect Effects 0.000 claims abstract description 158
- 210000003205 muscle Anatomy 0.000 claims abstract description 101
- 238000012545 processing Methods 0.000 claims abstract description 60
- 238000012549 training Methods 0.000 claims description 77
- 230000004927 fusion Effects 0.000 claims description 66
- 238000013136 deep learning model Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 abstract description 4
- 230000005540 biological transmission Effects 0.000 abstract description 3
- 230000009471 action Effects 0.000 description 21
- 239000013598 vector Substances 0.000 description 21
- 230000008921 facial expression Effects 0.000 description 19
- 230000008569 process Effects 0.000 description 19
- 230000033001 locomotion Effects 0.000 description 16
- 238000010586 diagram Methods 0.000 description 15
- 238000012795 verification Methods 0.000 description 14
- 238000012360 testing method Methods 0.000 description 13
- 238000004891 communication Methods 0.000 description 11
- 230000014509 gene expression Effects 0.000 description 9
- 238000003860 storage Methods 0.000 description 9
- 241000282412 Homo Species 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 6
- 238000004590 computer program Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 210000001097 facial muscle Anatomy 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 4
- 230000008451 emotion Effects 0.000 description 3
- 238000004904 shortening Methods 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 241000282414 Homo sapiens Species 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000006073 displacement reaction Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000000750 progressive effect Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 240000008866 Ziziphus nummularia Species 0.000 description 1
- 210000001015 abdomen Anatomy 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 210000005070 sphincter Anatomy 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Molecular Biology (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Acoustics & Sound (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Quality & Reliability (AREA)
- Processing Or Creating Images (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The embodiment of the specification discloses a method for driving a virtual person in real time, which is used for obtaining data to be processed for driving the virtual person, wherein the data to be processed comprises at least one of text data and voice data; processing the data to be processed by using an end-to-end model, and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed; the acoustic feature sequence, the facial feature sequence and the limb feature sequence are input into a trained muscle model through which a virtual person is driven. Thus, the acoustic feature sequence, the facial feature sequence and the limb feature sequence can be acquired in a shorter time through the end-to-end model; and the obtained sequence is input into a muscle model to directly drive the virtual human, so that the calculated amount and the data transmission amount of the virtual human are greatly reduced, the calculation efficiency is improved, and the real-time performance of driving the virtual human is greatly improved.
Description
Technical Field
The embodiment of the specification relates to the technical field of virtual man processing, in particular to a method, a device, electronic equipment and a medium for driving a virtual man in real time.
Background
Digital humans (Digital humans) are simply referred to as Digital humans, a comprehensive rendering technique that utilizes computers to simulate real humans, also known as virtual humans, super-write real humans, photo-level humans. Since people are familiar with the true man, a 3D static model can be obtained by spending a lot of time, but when the 3D static model is driven to act, even a fine expression can be remodelled, since the model has very high reality, the modeling can need to perform a lot of data to perform calculation, the calculation process is long, and usually, an action of the model can be realized by one hour or a few hours of calculation, so that the real-time performance of the driving is very poor.
Disclosure of Invention
The embodiment of the specification provides a method, a device, electronic equipment and a medium for driving a virtual person in real time, which can drive the virtual person in real time.
A first aspect of embodiments of the present disclosure provides a method for driving a virtual person in real time, including:
Acquiring data to be processed for driving a virtual person, wherein the data to be processed comprises at least one of text data and voice data;
Processing the data to be processed by using an end-to-end model, and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed;
Inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual person through the muscle model;
The processing the data to be processed by using the end-to-end model comprises the following steps:
acquiring text characteristics and duration characteristics of the data to be processed;
determining the acoustic feature sequence according to the text feature and the duration feature;
And determining the facial feature sequence and the limb feature sequence according to the text features and the duration features.
Optionally, the acquiring the text feature and the duration feature of the data to be processed includes:
Acquiring the text characteristics through fastspeech models;
and acquiring the duration characteristics through a duration model, wherein the duration model is a deep learning model.
Optionally, if the fastspeech model for training the output acoustic feature sequence is the first fastspeech model and the fastspeech model for training the output facial feature sequence and the limb feature sequence is the second fastspeech model, determining the acoustic feature sequence according to the text feature and the duration feature includes:
Inputting the text features and the duration features into a first fastspeech model to obtain the acoustic feature sequence;
the determining the acoustic feature sequence according to the text feature and the duration feature comprises the following steps:
And inputting the text features and the duration features into a second fastspeech model to obtain the facial feature sequence and the limb feature sequence.
Optionally, the inputting the acoustic feature sequence, the facial feature sequence, and the limb feature sequence into a trained muscle model includes:
Fusing the acoustic feature sequence, the facial feature sequence and the limb feature sequence to obtain a fused feature sequence;
The fusion feature sequence is input into the muscle model.
Optionally, the fusing the acoustic feature sequence, the facial feature sequence and the limb feature sequence to obtain a fused feature sequence includes:
And based on the duration feature, fusing the acoustic feature sequence, the facial feature sequence and the limb feature sequence to obtain the fused feature sequence.
Optionally, the facial features corresponding to the facial feature sequence include an expression feature and a lip feature.
A second aspect of embodiments of the present specification provides a method for driving a virtual person in real time, including:
Acquiring data to be processed for driving a virtual person, wherein the data to be processed comprises at least one of text data and voice data;
processing the data to be processed by using an end-to-end model, and determining fusion characteristic data corresponding to the data to be processed, wherein the fusion characteristic sequence is obtained by fusing an acoustic characteristic sequence, a facial characteristic sequence and a limb characteristic sequence corresponding to the data to be processed;
Inputting the fusion sequence into a trained muscle model, and driving a virtual human through the muscle model;
The processing the data to be processed by using the end-to-end model comprises the following steps:
acquiring text characteristics and duration characteristics of the data to be processed;
Determining the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the text feature and the duration feature;
And obtaining the fusion characteristic sequence according to the acoustic characteristic sequence, the facial characteristic sequence and the limb characteristic sequence.
Optionally, the acquiring the text feature and the duration feature of the data to be processed includes:
Acquiring the text characteristics through fastspeech models;
and acquiring the duration characteristics through a duration model, wherein the duration model is a deep learning model.
Optionally, the obtaining the fusion feature sequence according to the acoustic feature sequence, the facial feature sequence and the limb feature sequence includes:
And based on the duration feature, fusing the acoustic feature sequence, the facial feature sequence and the limb feature sequence to obtain the fused feature sequence.
A third aspect of embodiments of the present disclosure provides an apparatus for driving a virtual person in real time, including:
The data acquisition module is used for acquiring data to be processed for driving the virtual person, wherein the data to be processed comprises at least one of text data and voice data;
the data processing module is used for processing the data to be processed by using an end-to-end model and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed;
A virtual person driving module for inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual person through the muscle model;
The data processing module is specifically used for acquiring text characteristics and duration characteristics of the data to be processed; determining the acoustic feature sequence according to the text feature and the duration feature; and determining the facial feature sequence and the limb feature sequence according to the text features and the duration features.
A fourth aspect of embodiments of the present specification provides an apparatus for driving a virtual person in real time, including:
The data acquisition module is used for acquiring data to be processed for driving the virtual person, wherein the data to be processed comprises at least one of text data and voice data;
The data processing module is used for processing the data to be processed by using an end-to-end model and determining fusion characteristic data corresponding to the data to be processed, wherein the fusion characteristic sequence is obtained by fusing an acoustic characteristic sequence, a facial characteristic sequence and a limb characteristic sequence corresponding to the data to be processed;
A virtual person driving module for inputting the fusion sequence into a trained muscle model, and driving a virtual person through the muscle model;
The data processing module is specifically used for acquiring text characteristics and duration characteristics of the data to be processed; determining the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the text feature and the duration feature; and obtaining the fusion characteristic sequence according to the acoustic characteristic sequence, the facial characteristic sequence and the limb characteristic sequence.
A fifth aspect of the embodiments of the present specification provides an apparatus for driving a virtual person in real time, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising method steps for driving a virtual person as described above.
A sixth aspect of the embodiments of the present description provides a machine readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the steps of the above-described method steps of driving a virtual person in real time.
The beneficial effects of the embodiment of the specification are as follows:
Based on the technical scheme, after the data to be processed are acquired, the data to be processed are processed by using an end-to-end model, so that an acoustic feature sequence, a facial feature sequence and a limb feature sequence are obtained; inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual person through the muscle model; because the end-to-end model inputs the original data of the data to be processed and directly outputs the acoustic feature sequence, the facial feature sequence and the limb feature sequence, the method can better utilize and adapt to the parallel computing capacity of new hardware (such as GPU) and has higher computing speed; that is, the acoustic feature sequence, the facial feature sequence, and the limb feature sequence can be acquired in a shorter time; the acoustic feature sequence, the facial feature sequence and the limb feature sequence are input into the muscle model, and the virtual person is directly driven, so that after the virtual person is created, the acoustic feature sequence is directly used for controlling the virtual person to output voice, and meanwhile, the facial expression and the limb action of the virtual person are controlled through the facial feature sequence and the limb feature sequence.
Drawings
FIG. 1 is a training flow diagram for training an end-to-end model of an output acoustic feature sequence in an embodiment of the present disclosure;
FIG. 2 is a first flowchart of a method for driving a virtual person in real time in an embodiment of the present disclosure;
FIG. 3 is a flowchart illustrating steps of a first fastspeech model outputting an acoustic signature sequence in an embodiment of the present disclosure;
FIG. 4 is a second flowchart of a method for driving a virtual person in real time in an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a first configuration of a device for driving a virtual person in real time according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram showing a second configuration of a device for driving a virtual person in real time according to an embodiment of the present invention;
fig. 7 is a block diagram showing the structure of an apparatus for driving a virtual person in real time as a device in the embodiment of the present specification;
fig. 8 is a block diagram of a service end in some embodiments of the present disclosure.
Detailed Description
In order to better understand the technical solutions described above, the technical solutions of the embodiments of the present specification are described in detail below through the accompanying drawings and the specific embodiments, and it should be understood that the specific features of the embodiments of the present specification and the specific features of the embodiments of the present specification are detailed descriptions of the technical solutions of the embodiments of the present specification, and not limit the technical solutions of the present specification, and the technical features of the embodiments of the present specification may be combined without conflict.
Aiming at the technical problem that a great deal of time is required to be consumed when the virtual person is driven, the embodiment of the invention provides a scheme for driving the virtual person in real time, which is used for driving the virtual person in real time and specifically comprises the following steps: acquiring data to be processed for driving a virtual person, wherein the data to be processed comprises at least one of text data and voice data; processing the data to be processed by using an end-to-end model, and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed; inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual person through the muscle model;
The end-to-end model processes the data to be processed, and determines an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed, including: acquiring text characteristics and duration characteristics of the data to be processed; determining the acoustic feature sequence according to the text feature and the duration feature; and determining the facial feature sequence and the limb feature sequence according to the text features and the duration features.
The virtual person in the embodiment of the invention can be a high-simulation virtual person, and the difference between the high-simulation virtual person and a real person is small; the virtual person in the embodiment of the invention can be applied to content expression scenes such as news broadcasting scenes, teaching scenes, medical scenes, customer service scenes, legal scenes, conference scenes and the like.
In the embodiment of the invention, the data to be processed can be text data, voice data or both text data and semantic data, and the specification is not particularly limited.
For example, in a news broadcasting scene, a news manuscript to be broadcasted driving a virtual person needs to be acquired, at this time, the news manuscript is data to be processed, and the news manuscript may be a text edited by a person or a machine, and after the text is edited by the person or the machine, the edited text is acquired as the news manuscript.
In the embodiment of the invention, before the end-to-end model is used for processing the data to be processed, the end-to-end model is trained through a sample, so that a trained end-to-end model is obtained; after obtaining the trained end-to-end model, the trained end-to-end model is used for processing the data to be processed.
The end-to-end model in the embodiment of the invention comprises two training methods, wherein one training method trains an acoustic feature sequence output by the end-to-end model, and the other training method trains a facial feature sequence and a limb feature sequence output by the end-to-end model; and the end-to-end model may specifically be fastspeech model.
When the end-to-end model of the output acoustic feature sequence is trained, training samples of the model can be text and voice data and can also be video data; for each training sample in the training sample set, the training steps are specifically shown in fig. 1, first, step A1 is performed to obtain an acoustic feature 101 and a text feature 102 of the training sample, where the text feature 101 may be a phoneme level. Specifically, feature data of the training sample may be mapped into an embedding (embedding) layer in an end-to-end model, resulting in acoustic features 101 and text features 102; step A2 is then performed, where acoustic vector 104 may be an acoustic vector of a sentence or an acoustic vector of a word, and text encoding feature 105 is also a phoneme level, by processing acoustic feature 101 and text feature 102 by feedforward transformer 103 (Feed Forward Transformer) to obtain acoustic vector 104 and text encoding feature 105; step A3 is performed to align the acoustic vector 104 with the text encoding feature 105 to obtain an aligned text encoding feature 106, and the acoustic vector 104 and the text encoding feature 105 may be aligned using a duration predictor, where the text encoding feature 105 is specifically a phoneme feature, and the acoustic vector 104 may be a mel-frequency spectrogram, so that the factor feature and the mel-frequency spectrogram may be aligned using the duration predictor; next, step A4 is performed, where the aligned text-coding features 106 are decoded 107 to obtain an acoustic feature sequence 108, at which time the speech speed can be easily determined by lengthening or shortening the phoneme duration using a length adjuster, so as to determine the length of the generated mel-spectrogram, and a partial prosody can be controlled by adding a space between adjacent phonemes; and acquiring an acoustic feature sequence according to the determined length of the Mel spectrogram and the interval time of the pixels.
When training an end-to-end model of an output acoustic feature sequence, its training sample set may contain, for example, 13,100 voice frequency clips and corresponding text recordings, with a total audio length of about 24 hours. At this time, training sample sets were randomly divided into 3 groups: 12500 samples for training, 300 samples for validation and 300 samples for testing. To alleviate the problem of pronunciation errors, a phoneme conversion tool is used to convert the text sequence into a phoneme sequence; for voice data, converting an original waveform into a Mel spectrogram; training by using 12500 sample opposite end-to-end models, and after training is completed, verifying the end-to-end models obtained by training by using 300 verification samples; after the verification meets the verification requirement, testing the end-to-end model by using 300 test sample books, and if the test meets the test condition, obtaining the trained end-to-end model.
If the verification of the end-to-end model does not meet the verification requirement, training the end-to-end model again by using a training sample until the trained end-to-end model meets the verification requirement; and testing the end-to-end model which meets the verification requirements until the trained end-to-end model meets the verification requirements and the test conditions, and taking the trained end-to-end model as a final model, namely the trained end-to-end model.
When the end-to-end model outputting the facial feature sequence and the limb feature sequence is trained, training samples of the model can be real person video data and real person action data; for each training sample in the training sample set, the training step specifically includes executing step B1, and obtaining facial features, limb features and text features of the training sample, where the text features may be phoneme levels. Specifically, feature data of the training sample can be mapped into an embedding (embedding) layer in the end-to-end model to obtain facial features, limb features and text features; then executing a step B2, and processing facial features, limb features and text features through a feedforward transformer (Feed Forward Transformer) to obtain facial feature vectors, limb feature vectors and text coding features, wherein the facial feature vectors are feature representations for carrying out facial expressions, the limb feature vectors can be muscle action vectors, and the text coding features are also phoneme levels; step B3 is performed to align the facial feature vector and the limb feature vector with text coding features, which may be aligned with text coding features, in particular phoneme features, using a duration predictor; next, step B4 is performed to obtain a facial feature sequence and a limb feature sequence, at which time a length adjuster may be used to align facial expressions and actions by extending or shortening the phoneme duration, resulting in a facial feature sequence and a limb feature sequence.
The text features in the embodiment of the invention can comprise: phoneme features, and/or semantic features, etc. Further, the phonemes are the minimum phonetic units divided according to the natural attributes of the speech, and are analyzed according to the pronunciation actions in syllables, and one action constitutes one phoneme. The phonemes may include: vowels and consonants. Optionally, the specific phoneme features correspond to specific lip features, expression features, limb features, or the like.
And, semantics are meanings of concepts represented by things in the real world corresponding to the text to be processed, and relationships between the meanings are interpretation and logical representation of the text to be processed in a certain field. Optionally, a particular semantic feature corresponds to a particular limb feature, etc.
When the end-to-end model of the output facial feature sequence and the limb feature sequence is trained, the training sample set comprises real person action data or real person video data, and the training process refers to the training process of training the end-to-end model of the output acoustic feature sequence, so that the description is omitted for brevity.
And after training to obtain an end-to-end model of the output acoustic feature sequence and obtain an end-to-end model of the output facial feature sequence and the limb feature sequence, taking the end-to-end model of the obtained output acoustic feature sequence as a first end-to-end model, and taking the end-to-end model of the obtained output facial feature sequence and the limb feature sequence as a second end-to-end model.
Thus, after obtaining the data to be processed, the text feature of the data to be processed can be obtained by utilizing the embedded layer of the first end-to-end model, then the time length feature of the data to be processed is obtained, and the text feature and the time length feature are input into the first end-to-end model to obtain the acoustic feature sequence; correspondingly, the text features of the data to be processed can be obtained by utilizing an embedded layer of the second end-to-end model, then the time length features of the data to be processed are obtained, and the text features and the time length features are input into the second end-to-end model to obtain a facial feature sequence and a limb feature sequence; of course, the text features and the duration features acquired in the previous step can also be directly input into the second end-to-end model to obtain the facial feature sequence and the limb feature sequence. In this embodiment of the present disclosure, the first end-to-end model and the second end-to-end model may process data simultaneously, or may process data first from the first end-to-end model, or process data first from the second end-to-end model, which is not particularly limited in this disclosure.
In the embodiment of the invention, the duration feature can be used for representing the duration of the phonemes corresponding to the text. The duration features can be used for describing the pause and the urgency of the user in the voice, so that the expressive force and the naturalness of the synthesized voice can be improved. Optionally, a duration model may be used to determine a duration feature corresponding to the data to be processed. The input of the duration model may be: and outputting the phoneme characteristics with accent marks as phoneme duration. The duration model may be obtained by learning a voice sample with duration information, for example, may be deep learning models such as a convolutional neural network (Convolutional Neural Networks, hereinafter abbreviated as CNN) and a deep neural network (Deep Neural Networks, hereinafter abbreviated as DNN), which are not limited by the embodiment of the present invention.
And after the acoustic feature sequence, the facial feature sequence and the limb feature sequence are acquired, the acoustic feature sequence is acquired, the facial feature sequence and the limb feature sequence are input into a trained muscle model, and a virtual person is driven through the muscle model.
In the embodiment of the invention, facial features comprise expression features and lip features, wherein expression, expression emotion and emotion can refer to thought emotion represented on the face. The expressive features are typically for the entire face. The lip characteristics can be specially aimed at lips and are related to text contents, voices, pronunciation modes and the like of texts, so that facial expressions can be promoted to be more lifelike and finer through facial characteristics.
The limb features can convey the thought of the person through coordinated activities of the body parts such as head, eyes, neck, hands, elbows, arms, body, crotch, feet and the like, thereby vividly expressing the mind. Limb characteristics may include: the richness of the expression corresponding to the image sequence can be improved by turning, shrugging, gesture and the like. For example, at least one arm naturally sags when speaking and at least one arm naturally rests on the abdomen when not speaking.
In the embodiment of the invention, before the trained muscle model is used, model training is also needed to be carried out, so as to obtain the trained muscle model; after the trained muscle model is obtained, the trained muscle model is used to process the acoustic feature sequence, the facial feature sequence and the limb feature sequence.
When the trained muscle model in the embodiment of the invention is used for model training, firstly, a muscle model is created according to facial muscles and limb muscles of a human face, and training samples of the muscle model are obtained, wherein the training samples can be real person video data and real person action data; for each training sample in the training sample set, the training steps include:
Firstly, executing a step C1, and acquiring facial muscle characteristics and limb muscle characteristics of each training sample; step C2 is then executed, and the facial muscle characteristics and the limb muscle characteristics of each training sample are used for training the muscle model; after training is completed, executing a step C3, and verifying the muscle model obtained by training by using a verification sample; and after verifying the compliance with the verification requirement, testing the trained muscle model by using a test sample, and if the test is in compliance with the test condition, obtaining the trained muscle model.
If the muscle model obtained through training is verified to be in accordance with the verification requirement, training the muscle model again by using a training sample until the trained muscle model is in accordance with the verification requirement; and testing the muscle model which meets the verification requirement until the trained muscle model meets the verification requirement and the test condition, and taking the trained muscle model as a final model, namely the trained muscle model.
And, in creating a muscle model, using facial muscles as an example, using a polygonal network for near abstract muscle control, two types of muscles, a linear muscle, can be used for stretching; a sphincter for squeezing; two kinds of muscles are only connected with the grid space at one point and have direction assignment (the angular displacement and the radial displacement of a certain point are calculated when the two kinds of muscles are deformed), so that the control of the muscles is independent of the specific facial topology, and the facial expression can be more vivid and finer; accordingly, limb muscles also use polygonal networks for near abstract muscle control, thereby ensuring more accurate limb movements.
Since the self-attention mechanism adopted by the feedforward transformer of the end-to-end model is an innovative method for understanding the current word through the context, the semantic feature extraction capability is stronger. In practical applications, this feature means that for homophones or words in sentences, the new algorithm can determine which one should be based on the words around it and the sentences before and after it (e.g. bath and jujube washing), so that a more accurate result is obtained; and the end-to-end model solves the problems that each part of tasks in the traditional voice recognition scheme are independent and cannot be optimized in a combined way. The framework of a single neural network becomes simpler, and the accuracy is higher as the model layer number is deeper and the training data is larger; third, the end-to-end model adopts a new neural network structure, which can better utilize and adapt to the parallel computing capacity of new hardware (such as GPU) and has faster operation speed. This means that the speech with the same duration is transcribed, the algorithm model based on the new network structure can be completed in a shorter time, and the requirement of real-time transcription can be met.
After the data to be processed is acquired, the data to be processed is processed by using an end-to-end model, so that an acoustic feature sequence, a facial feature sequence and a limb feature sequence are obtained; inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual person through the muscle model; because the end-to-end model inputs the original data of the data to be processed and directly outputs the acoustic feature sequence, the facial feature sequence and the limb feature sequence, the method can better utilize and adapt to the parallel computing capacity of new hardware (such as GPU) and has higher computing speed; that is, the acoustic feature sequence, the facial feature sequence, and the limb feature sequence can be acquired in a shorter time; the acoustic feature sequence, the facial feature sequence and the limb feature sequence are input into the muscle model, and the virtual person is directly driven, so that after the virtual person is created, the acoustic feature sequence is directly used for controlling the virtual person to output voice, and meanwhile, the facial expression and the limb action of the virtual person are controlled through the facial feature sequence and the limb feature sequence.
Moreover, because the end-to-end model is adopted to acquire the acoustic feature sequence, the facial feature sequence and the limb feature sequence, the time length feature is used, and the time length feature can improve the synchronism between the acoustic feature sequence and the facial feature sequence and the limb feature sequence, so that the accuracy of matching the sound output of the virtual person with the facial expression and the limb feature is higher when the virtual person is driven by using the acoustic feature sequence, the facial feature sequence and the limb feature sequence on the basis of improving the synchronism.
Method embodiment one
Referring to fig. 2, a flowchart of the steps of a first embodiment of a method for driving a virtual person in real time according to the present invention may specifically include the following steps:
S201, obtaining data to be processed for driving a virtual person, wherein the data to be processed comprises at least one of text data and voice data;
s202, processing the data to be processed by using an end-to-end model, and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed;
S203, inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual person through the muscle model;
Wherein, step S201 includes:
step S2021, acquiring text characteristics and duration characteristics of the data to be processed;
Step S2022, determining the acoustic feature sequence according to the text feature and the duration feature;
Step S2023, determining the facial feature sequence and the limb feature sequence according to the text feature and the duration feature.
In step S201, for the client, data to be processed uploaded by the user may be received; for the server, the data to be processed sent by the client may be received. It may be appreciated that any first device may receive the text to be processed from the second device, and the embodiment of the present invention does not limit a specific transmission manner of the data to be processed.
If the data to be processed is text data, the step S202 is directly used for processing the data to be processed; if the data to be processed is voice data, after the data to be processed is converted into text data, the converted text data is processed by using step S202.
In step S202, an end-to-end model needs to be trained, where the end-to-end model includes two training methods, one of which trains an acoustic feature sequence output by the end-to-end model, and the other trains a facial feature sequence and a limb feature sequence output by the end-to-end model; and the end-to-end model may specifically be fastspeech model.
Training an end-to-end model of the output acoustic feature sequence as a first end-to-end model, wherein the description of the steps A1-A4 is specifically referred to in the training process; an end-to-end model of the output facial feature sequence and the limb feature sequence is trained as a second end-to-end model, and the training process is described with reference to steps B1-B4.
In step S2021, the text feature may be acquired through fastspeech model; and acquiring the duration features through a duration model, wherein the duration model is a deep learning model.
If the end-to-end model is fastspeech models, training to obtain a first fastspeech model and a second fastspeech model, and then obtaining text features of the data to be processed by using any fastspeech model; and obtaining the time length characteristics by using a time length model, wherein the time length model can be a deep learning model such as CNN and DNN.
In step S2022, if the fastspeech model of the training output acoustic feature sequence is the first fastspeech model and the fastspeech model of the training output facial feature sequence and the limb feature sequence is the second fastspeech model, the text feature and the duration feature may be input into the first fastspeech model to obtain the acoustic feature sequence; and step S2023, inputting the text feature and the duration feature into a second fastspeech model, to obtain the facial feature sequence and the limb feature sequence.
If the end-to-end model is fastspeech models, training to obtain a first fastspeech model and a second fastspeech model, and then obtaining text features of the data to be processed by using any fastspeech model; and obtaining the time length characteristics by using a time length model, wherein the time length model can be a deep learning model such as CNN and DNN.
Specifically, as shown in fig. 3, taking the first fastspeech model to acquire the acoustic feature sequence as an example, the steps include: acquiring text features 301 of data to be processed through an embedding layer of a first fastspeech model, and encoding the text features 301 through a feedforward transformer 302 to obtain text encoding features 303; at this time, the text encoding feature 303 is processed through the duration model 304 to obtain a duration feature 305, where the duration feature 304 may be used to characterize the duration of each phoneme in the text encoding feature 30; then, the text coding feature 303 is aligned through the duration feature 305, and an aligned text coding feature 306 is obtained; the aligned text-encoded features 306 are decoded 307 and predicted to obtain an acoustic feature sequence 307.
The text encoding feature 303 is a phoneme level, and the aligned text encoding feature 306 may be a frame level or a phoneme level.
Correspondingly, in the process of acquiring the facial feature sequence and the limb feature sequence by using the second fastspeech model, the text feature of the data to be processed can be acquired through the embedded layer of the second fastspeech model; coding the text characteristics through a feedforward transformer to obtain text coding characteristics; at this time, processing the text coding features through a duration model to obtain duration features, wherein the duration features align the text coding features to obtain aligned text coding features; and decoding the aligned text coding features, and then carrying out face prediction and limb prediction to obtain a face feature sequence and a limb feature sequence.
Step S203 is executed, where the acoustic feature sequence, the facial feature sequence, and the limb feature sequence are fused to obtain a fused feature sequence. Specifically, the acoustic feature sequence, the facial feature sequence and the limb feature sequence may be fused according to the duration feature to obtain the fused feature sequence; after the fusion feature sequence is acquired, the fusion feature sequence is input into a trained muscle model, and a virtual person is driven through the muscle model.
Specifically, according to the duration feature, the acoustic feature sequence, the facial feature sequence and the limb feature sequence are aligned to obtain the fusion feature sequence, and then the fusion feature sequence is input into a trained muscle model, and a virtual person is driven through the muscle model.
The training process of the muscle model specifically refers to the description of the steps C1-C3, at this time, after the fusion characteristic sequence is obtained, the corresponding binding muscle in the muscle model is directly driven through the fusion characteristic sequence, and when the binding muscle is driven by the fusion characteristic sequence to perform corresponding movement, the facial expression and the action of the binding muscle are correspondingly changed according to the movement of the binding muscle.
For example, when the acoustic feature sequence is said to be "bye", the facial expression is smiling and the limb motion is waving, and at this time, according to the duration feature, the time period of "bye" can be said to be aligned with the facial feature sequence of which the facial feature sequence is smiling and the limb motion sequence is waving, so as to obtain an aligned feature sequence, namely a fusion feature sequence; at this time, the fusion characteristic sequence is input into a muscle model, and the face of the virtual person is controlled to smile and swing hands when speaking 'bystandstill' through the muscle model, so that the sound of the virtual person is matched with the face and the action.
For another example, when the acoustic feature sequence is said to be "on" the face expression is smiling and the limb motion is a hand, at this time, according to the duration feature, the facial feature sequence which is said to be "on" the face feature sequence is said to be smiling and the limb motion sequence is said to be the limb motion sequence of the hand are aligned, so as to obtain an aligned feature sequence, namely a fusion feature sequence; at this time, the fusion feature sequence is input into the muscle model, and the face of the virtual person is controlled to smile and to be recruited when speaking "on" through the muscle model, so that the sound of the virtual person is matched with the face and the action.
After the data to be processed is acquired, the data to be processed is processed by using an end-to-end model, so that an acoustic feature sequence, a facial feature sequence and a limb feature sequence are obtained; inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual person through the muscle model; because the end-to-end model inputs the original data of the data to be processed and directly outputs the acoustic feature sequence, the facial feature sequence and the limb feature sequence, the method can better utilize and adapt to the parallel computing capacity of new hardware (such as GPU) and has higher computing speed; that is, the acoustic feature sequence, the facial feature sequence, and the limb feature sequence can be acquired in a shorter time; the acoustic feature sequence, the facial feature sequence and the limb feature sequence are input into the muscle model, and the virtual person is directly driven, so that after the virtual person is created, the acoustic feature sequence is directly used for controlling the virtual person to output voice, and meanwhile, the facial expression and the limb action of the virtual person are controlled through the facial feature sequence and the limb feature sequence.
Moreover, because the end-to-end model is adopted to acquire the acoustic feature sequence, the facial feature sequence and the limb feature sequence, the time length feature is used, and the time length feature can improve the synchronism between the acoustic feature sequence and the facial feature sequence and the limb feature sequence, so that the accuracy of matching the sound output of the virtual person with the facial expression and the limb feature is higher when the virtual person is driven by using the acoustic feature sequence, the facial feature sequence and the limb feature sequence on the basis of improving the synchronism.
Method embodiment II
Referring to fig. 4, a flowchart of the steps of a first embodiment of a method for driving a virtual person in real time according to the present invention may specifically include the following steps:
S401, acquiring data to be processed for driving a virtual person, wherein the data to be processed comprises at least one of text data and voice data;
s402, processing the data to be processed by using an end-to-end model, and determining fusion characteristic data corresponding to the data to be processed, wherein the fusion characteristic sequence is obtained by fusing an acoustic characteristic sequence, a facial characteristic sequence and a limb characteristic sequence corresponding to the data to be processed;
s403, inputting the fusion sequence into a trained muscle model, and driving a virtual person through the muscle model;
wherein, step S402 includes:
step S4021, acquiring text characteristics and duration characteristics of the data to be processed;
Step S4022, determining the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the text feature and the duration feature;
Step S4023, obtaining the fusion characteristic sequence according to the acoustic characteristic sequence, the facial characteristic sequence and the limb characteristic sequence.
In step S401, for the client, data to be processed uploaded by the user may be received; for the server, the data to be processed sent by the client may be received. It may be appreciated that any first device may receive the text to be processed from the second device, and the embodiment of the present invention does not limit a specific transmission manner of the data to be processed.
If the data to be processed is text data, directly processing the data to be processed by using the step S402; if the data to be processed is voice data, after the data to be processed is converted into text data, the converted text data is processed by using step S402.
In step S402, an end-to-end model needs to be trained first, so that the trained end-to-end model outputs a fusion feature sequence, and at this time, the end-to-end model outputting the fusion feature sequence may be used as a third end-to-end model.
When the third end-to-end model is trained, training samples of the third end-to-end model can be real person video data and real person action data; for each training sample in the training sample set, the training step specifically includes executing step D1, and obtaining facial features, limb features and text features of the training sample, where the text features may be phoneme levels. Specifically, feature data of the training sample can be mapped into an embedding (embedding) layer in the end-to-end model to obtain facial features, limb features and text features; step D2 is then executed, facial features, limb features and text features are processed through a feedforward transformer (Feed Forward Transformer), and facial feature vectors, limb feature vectors and text coding features are obtained, wherein the facial feature vectors are feature representations for carrying out facial expressions, the limb feature vectors can be muscle action vectors, and the text coding features are also phoneme levels; step D3 is performed to align the facial feature vector and the limb feature vector with text coding features, which may be aligned with text coding features, in particular phoneme features, using a duration predictor; step D4 is executed, and a sound characteristic sequence, a facial characteristic sequence and a limb characteristic sequence are obtained; step D5 is then performed to fuse the sound feature sequence, the facial feature sequence, and the limb feature sequence to obtain a fused feature sequence, at which time the length adjuster may be used to align the sound, facial expression, and motion by extending or shortening the phoneme duration to obtain a fused feature sequence.
The text features in the embodiment of the invention can comprise: phoneme features, and/or semantic features, etc. Further, the phonemes are the minimum phonetic units divided according to the natural attributes of the speech, and are analyzed according to the pronunciation actions in syllables, and one action constitutes one phoneme. The phonemes may include: vowels and consonants. Optionally, the specific phoneme features correspond to specific lip features, expression features, limb features, or the like.
And, semantics are meanings of concepts represented by things in the real world corresponding to the text to be processed, and relationships between the meanings are interpretation and logical representation of the text to be processed in a certain field. Optionally, a particular semantic feature corresponds to a particular limb feature, etc.
When the third end-to-end model is trained, the training sample set includes real person action data or real person video data, and the training process refers to the training process of training the end-to-end model outputting the acoustic feature sequence, so that the description is omitted for brevity.
Thus, after obtaining the data to be processed, the text feature of the data to be processed can be obtained by utilizing the embedded layer of the third end-to-end model, then the time length feature of the data to be processed is obtained, and the text feature and the time length feature are input into the third end-to-end model to obtain an acoustic feature sequence, a facial feature sequence and a limb feature sequence; and according to the duration characteristics, fusing the acoustic characteristic sequence, the facial characteristic sequence and the limb characteristic sequence to obtain a fused characteristic sequence.
In step S4021, if the end-to-end model is fastspeech models, the text feature may be obtained through a third fastspeech model; and acquiring the duration features through a duration model, wherein the duration model is a deep learning model.
In step S4022, if the fastspeech model for training the output acoustic feature sequence is the third fastspeech model, the text feature and the duration feature may be input into the first fastspeech model, to determine the acoustic feature sequence, the facial feature sequence and the limb feature sequence; and step S4023, obtaining the fusion feature sequence according to the acoustic feature sequence, the facial feature sequence and the limb feature sequence.
Specifically, the acoustic feature sequence, the facial feature sequence, and the limb feature sequence may be aligned according to the duration feature, so as to obtain the fusion feature sequence.
If the end-to-end model is fastspeech models, training to obtain a third fastspeech model, and then obtaining text characteristics of the data to be processed by using a third fastspeech model; and obtaining the time length characteristics by using a time length model, wherein the time length model can be a deep learning model such as CNN and DNN.
Correspondingly, in the process of acquiring the fusion characteristic sequence by using the third fastspeech model, the text characteristic of the data to be processed can be acquired through an embedded layer of the third fastspeech model; coding the text characteristics through a feedforward transformer to obtain text coding characteristics; at this time, processing the text coding features through a duration model to obtain duration features, wherein the duration features align the text coding features to obtain aligned text coding features; performing sound prediction, face prediction and limb prediction after decoding the aligned text coding features to obtain a sound feature sequence, a face feature sequence and a limb feature sequence; and aligning the sound feature sequence, the facial feature sequence and the limb feature sequence according to the duration feature to obtain an aligned sound feature sequence, a facial feature sequence and a limb feature sequence, namely, taking the aligned sound feature sequence, the aligned facial feature sequence and the aligned limb feature sequence as fusion feature sequences.
Next, step S403 is performed, in which, after the fusion feature sequence is acquired, the fusion feature sequence is input into a trained muscle model, through which the virtual person is driven.
In particular, the fusion feature sequence is input into a trained muscle model by which the virtual person is driven.
Specifically, the training process of the muscle model specifically refers to the description of steps C1-C3, and after the fusion feature sequence is obtained, the corresponding binding muscle in the muscle model is directly driven by the fusion feature sequence, and when the binding muscle is driven by the fusion feature sequence to perform corresponding motion, the facial expression and the action of the binding muscle are changed correspondingly according to the motion of the binding muscle.
For example, when the acoustic feature sequence is said to be "bye", the facial expression is smiling and the limb motion is waving, at this time, the fusion feature sequence is aligned according to the time length feature, so that the virtual person's face is smiling and waving while "bye", so that the sound of the virtual person matches the face and the motion.
For another example, when the acoustic feature sequence says "someone is injured", the facial expression is smiling and the limb movements are two-handed, at this time, since the fusion feature sequence is aligned according to the time length features, the virtual person is injured by "someone", the face is sad and two-handed ten is performed, and the sound of the virtual person is matched with the face and the movements.
After the data to be processed is acquired, the data to be processed is processed by using an end-to-end model, and a fusion feature sequence fused by an acoustic feature sequence, a facial feature sequence and a limb feature sequence is obtained; inputting the fusion characteristic sequence into a trained muscle model, and driving a virtual person through the muscle model; because the end-to-end model inputs the original data of the data to be processed and directly outputs the fusion characteristic sequence fused by the acoustic characteristic sequence, the facial characteristic sequence and the limb characteristic sequence, the method can better utilize and adapt to the parallel computing capacity of new hardware (such as GPU) and has higher computing speed; that is, the fusion feature sequence can be acquired in a shorter time; the fusion characteristic sequence is input into the muscle model to directly drive the virtual person, and after the virtual person is created, the virtual person is directly controlled to carry out voice output through the fusion characteristic sequence, and meanwhile, the facial expression and limb actions of the virtual person are controlled.
Moreover, when the fusion feature sequence is acquired by adopting the end-to-end model, the acoustic feature sequence, the facial feature sequence and the limb feature sequence are fused by using the time length feature, and the time length feature can improve the synchronism between the acoustic feature sequence and the facial feature sequence and between the facial feature sequence and the limb feature sequence, so that the accuracy of matching the sound output of the virtual person with the facial expression and the limb feature is higher when the virtual person is driven by using the fusion feature sequence on the basis of improving the synchronism.
Device embodiment 1
Referring to fig. 5, there is shown a block diagram of an embodiment of a device for driving a virtual person in real time according to the present invention, which may include:
A data acquisition module 501, configured to acquire data to be processed for driving a virtual person, where the data to be processed includes at least one of text data and voice data;
The data processing module 502 is configured to process the data to be processed using an end-to-end model, and determine an acoustic feature sequence, a facial feature sequence, and a limb feature sequence corresponding to the data to be processed;
a virtual person driving module 503, configured to input the acoustic feature sequence, the facial feature sequence, and the limb feature sequence into a trained muscle model, and drive a virtual person through the muscle model;
The data processing module 502 is specifically configured to obtain a text feature and a duration feature of the data to be processed; determining the acoustic feature sequence according to the text feature and the duration feature; and determining the facial feature sequence and the limb feature sequence according to the text features and the duration features.
In an alternative embodiment, the data processing module 502 is configured to obtain the text feature through a fastspeech model; and acquiring the duration characteristics through a duration model, wherein the duration model is a deep learning model.
In an alternative embodiment, the data processing module 502 is configured to input the text feature and the duration feature into the first fastspeech model to obtain the acoustic feature sequence if the fastspeech model for training the output acoustic feature sequence is the first fastspeech model and the fastspeech model for training the output facial feature sequence and the limb feature sequence is the second fastspeech model; and inputting the text features and the duration features into a second fastspeech model to obtain the facial feature sequence and the limb feature sequence.
In an alternative embodiment, the virtual person driving module 503 is configured to fuse the acoustic feature sequence, the facial feature sequence, and the limb feature sequence to obtain a fused feature sequence;
The fusion feature sequence is input into the muscle model.
In an alternative embodiment, the virtual person driving module 503 is configured to fuse the acoustic feature sequence, the facial feature sequence, and the limb feature sequence based on the duration feature, to obtain the fused feature sequence.
In an alternative embodiment, the facial features corresponding to the facial feature sequence include expressive features and lip features.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Device example two
Referring to fig. 6, there is shown a block diagram of an embodiment of a device for driving a virtual person in real time according to the present invention, which may include:
A data acquisition module 601, configured to acquire data to be processed for driving a virtual person, where the data to be processed includes at least one of text data and voice data;
The data processing module 602 is configured to process the data to be processed using an end-to-end model, and determine fusion feature data corresponding to the data to be processed, where the fusion feature sequence is obtained by fusing an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed;
a virtual person driving module 603 for inputting the fusion sequence into a trained muscle model, driving a virtual person through the muscle model;
The data processing module 602 is configured to obtain a text feature and a duration feature of the data to be processed; determining the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the text feature and the duration feature; and obtaining the fusion characteristic sequence according to the acoustic characteristic sequence, the facial characteristic sequence and the limb characteristic sequence.
In an alternative embodiment, the data processing module 602 is configured to obtain the text feature through a fastspeech model; and acquiring the duration characteristics through a duration model, wherein the duration model is a deep learning model.
In an alternative embodiment, the data processing module 602 is configured to align the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the duration feature, to obtain the fusion feature sequence.
In an alternative embodiment, the facial features corresponding to the facial feature sequence include expressive features and lip features.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Fig. 7 is a block diagram illustrating a structure when an apparatus for driving a virtual person in real time is used as a device according to an exemplary embodiment. For example, apparatus 900 may be a mobile phone call, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and so forth.
Referring to fig. 7, apparatus 900 may include one or more of the following components: a processing component 902, a memory 904, a power component 906, a multimedia component 908, an audio component 910, an input/output (I/O) interface 912, a sensor component 914, and a communication component 916.
The processing component 902 generally controls overall operations of the apparatus 900, such as operations associated with display, incoming call, data communication, camera operations, and recording operations. The processing element 902 may include one or more processors 920 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 902 can include one or more modules that facilitate interaction between the processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.
The memory 904 is configured to store various types of data to support operations at the device 900. Examples of such data include instructions for any application or method operating on the device 900, contact data, phonebook data, messages, pictures, videos, and the like. The memory 904 may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 906 provides power to the various components of the device 900. Power supply components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for device 900.
The multimedia component 908 comprises a screen between the device 900 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or a sliding motion action, but also the duration and pressure associated with the touch or sliding operation. In some embodiments, the multimedia component 908 includes a front-facing camera and/or a rear-facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a Microphone (MIC) configured to receive external audio signals when the device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 904 or transmitted via the communication component 916. In some embodiments, the audio component 910 further includes a speaker for outputting audio signals.
The I/O interface 912 provides an interface between the processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 914 includes one or more sensors for providing status assessment of various aspects of the apparatus 900. For example, the sensor assembly 914 may detect the on/off state of the device 900, the relative positioning of the components, such as the display and keypad of the apparatus 900, the sensor assembly 914 may also detect the change in position of the apparatus 900 or one component of the apparatus 900, the presence or absence of user contact with the apparatus 900, the orientation or acceleration/deceleration of the apparatus 900, and the change in temperature of the apparatus 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 916 is configured to facilitate communication between the apparatus 900 and other devices in a wired or wireless manner. The device 900 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication part 916 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, apparatus 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as a memory 904 including instructions executable by the processor 920 of the apparatus 900 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
Fig. 8 is a block diagram of a server in some embodiments of the invention. The server 1900 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage mediums 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Wherein the memory 1932 and storage medium 1930 may be transitory or persistent. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, a central processor 1922 may be provided in communication with a storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.
A non-transitory computer readable storage medium, which when executed by a processor of an apparatus (device or server) causes the apparatus to perform a method of driving a virtual person in real time, the method comprising: determining a duration characteristic corresponding to the text to be processed; the text to be processed involves at least two languages; determining a target voice sequence corresponding to the text to be processed according to the duration characteristics; determining a target image sequence corresponding to the text to be processed according to the duration characteristics; the target image sequence is obtained according to a text sample and a corresponding image sample; the language corresponding to the text sample comprises: all languages involved in the text to be processed; and fusing the target voice sequence and the target image sequence to obtain a corresponding target video.
The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present description have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the disclosure.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present specification without departing from the spirit or scope of the specification. Thus, if such modifications and variations of the present specification fall within the scope of the claims and the equivalents thereof, the present specification is also intended to include such modifications and variations.
Claims (8)
1. A method of driving a virtual person in real time, comprising:
Acquiring data to be processed for driving a virtual person, wherein the data to be processed comprises at least one of text data and voice data;
Processing the data to be processed by using an end-to-end model, and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed;
Inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual person through the muscle model;
The processing the data to be processed by using the end-to-end model comprises the following steps:
Acquiring text characteristics and duration characteristics of the data to be processed, wherein the acquiring comprises the following steps: acquiring the text characteristics through fastspeech models; acquiring the time length characteristics through a time length model, wherein the time length model is a deep learning model;
determining the acoustic feature sequence according to the text feature and the duration feature;
determining the facial feature sequence and the limb feature sequence according to the text features and the duration features;
If the fastspeech model for training the output acoustic feature sequence is the first fastspeech model and the fastspeech model for training the output facial feature sequence and the limb feature sequence is the second fastspeech model, determining the acoustic feature sequence according to the text feature and the duration feature includes:
Inputting the text features and the duration features into a first fastspeech model to obtain the acoustic feature sequence;
the step of determining the facial feature sequence and the limb feature sequence according to the text feature and the duration feature comprises the following steps:
Inputting the text feature and the time length feature into a second fastspeech model to obtain the facial feature sequence and the limb feature sequence, wherein the method specifically comprises the following steps: encoding the text characteristics through a feedforward transformer to obtain text encoding characteristics; processing the text coding features through a time length model to obtain time length features, and aligning the text coding features through the time length features to obtain aligned text coding features; and decoding the aligned text coding features, and then carrying out face prediction and limb prediction to obtain a face feature sequence and a limb feature sequence.
2. The method of claim 1, wherein said inputting the acoustic feature sequence, the facial feature sequence, and the limb feature sequence into a trained muscle model comprises:
Fusing the acoustic feature sequence, the facial feature sequence and the limb feature sequence to obtain a fused feature sequence;
The fusion feature sequence is input into the muscle model.
3. The method of claim 2, wherein the fusing the acoustic feature sequence, the facial feature sequence, and the limb feature sequence to obtain a fused feature sequence comprises:
And based on the duration feature, fusing the acoustic feature sequence, the facial feature sequence and the limb feature sequence to obtain the fused feature sequence.
4. A method of driving a virtual person in real time, comprising:
Acquiring data to be processed for driving a virtual person, wherein the data to be processed comprises at least one of text data and voice data;
Processing the data to be processed by using an end-to-end model, and determining a fusion characteristic sequence corresponding to the data to be processed, wherein the fusion characteristic sequence is obtained by fusing an acoustic characteristic sequence, a facial characteristic sequence and a limb characteristic sequence corresponding to the data to be processed;
inputting the fusion characteristic sequence into a trained muscle model, and driving a virtual person through the muscle model;
The processing the data to be processed by using the end-to-end model comprises the following steps:
acquiring text characteristics and duration characteristics of the data to be processed, wherein the acquiring comprises the following steps: acquiring the text features through a third fastspeech model; acquiring the time length characteristics through a time length model, wherein the time length model is a deep learning model;
Determining the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the text feature and the duration feature, wherein the acoustic feature sequence and the limb feature sequence comprise: inputting the text feature and the duration feature into a third fastspeech model to obtain the acoustic feature sequence, wherein the facial feature sequence and the limb feature sequence specifically comprise: encoding the text characteristics through a feedforward transformer to obtain text encoding characteristics; processing the text coding features through a time length model to obtain time length features, and aligning the text coding features through the time length features to obtain aligned text coding features; performing sound prediction, face prediction and limb prediction after decoding the aligned text coding features to obtain a sound feature sequence, a face feature sequence and a limb feature sequence;
According to the acoustic feature sequence, the facial feature sequence and the limb feature sequence, the fusion feature sequence is obtained, and the fusion feature sequence comprises: according to the time length characteristics, aligning the sound characteristic sequence, the facial characteristic sequence and the limb characteristic sequence to obtain an aligned sound characteristic sequence, a facial characteristic sequence and a limb characteristic sequence, and taking the aligned sound characteristic sequence, the facial characteristic sequence and the limb characteristic sequence as fusion characteristic sequences.
5. An apparatus for driving a virtual person in real time, comprising:
The data acquisition module is used for acquiring data to be processed for driving the virtual person, wherein the data to be processed comprises at least one of text data and voice data;
the data processing module is used for processing the data to be processed by using an end-to-end model and determining an acoustic feature sequence, a facial feature sequence and a limb feature sequence corresponding to the data to be processed;
A virtual person driving module for inputting the acoustic feature sequence, the facial feature sequence and the limb feature sequence into a trained muscle model, and driving a virtual person through the muscle model;
The data processing module is specifically configured to obtain a text feature and a duration feature of the data to be processed, and includes: acquiring the text characteristics through fastspeech models; acquiring the time length characteristics through a time length model, wherein the time length model is a deep learning model; determining the acoustic feature sequence according to the text feature and the duration feature; determining the facial feature sequence and the limb feature sequence according to the text features and the duration features;
If the fastspeech model for training the output acoustic feature sequence is the first fastspeech model and the fastspeech model for training the output facial feature sequence and the limb feature sequence is the second fastspeech model, determining the acoustic feature sequence according to the text feature and the duration feature includes:
Inputting the text features and the duration features into a first fastspeech model to obtain the acoustic feature sequence;
the step of determining the facial feature sequence and the limb feature sequence according to the text feature and the duration feature comprises the following steps:
Inputting the text feature and the time length feature into a second fastspeech model to obtain the facial feature sequence and the limb feature sequence, wherein the method specifically comprises the following steps: encoding the text characteristics through a feedforward transformer to obtain text encoding characteristics; processing the text coding features through a time length model to obtain time length features, and aligning the text coding features through the time length features to obtain aligned text coding features; and decoding the aligned text coding features, and then carrying out face prediction and limb prediction to obtain a face feature sequence and a limb feature sequence.
6. An apparatus for driving a virtual person in real time, comprising:
The data acquisition module is used for acquiring data to be processed for driving the virtual person, wherein the data to be processed comprises at least one of text data and voice data;
the data processing module is used for processing the data to be processed by using an end-to-end model and determining a fusion characteristic sequence corresponding to the data to be processed, wherein the fusion characteristic sequence is obtained by fusing an acoustic characteristic sequence, a facial characteristic sequence and a limb characteristic sequence corresponding to the data to be processed;
The virtual person driving module is used for inputting the fusion characteristic sequence into a trained muscle model, and driving a virtual person through the muscle model;
The data processing module is specifically configured to obtain a text feature and a duration feature of the data to be processed, and includes: acquiring the text features through a third fastspeech model; acquiring the time length characteristics through a time length model, wherein the time length model is a deep learning model; determining the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the text feature and the duration feature, wherein the acoustic feature sequence and the limb feature sequence comprise: inputting the text feature and the duration feature into a third fastspeech model to obtain the acoustic feature sequence, wherein the facial feature sequence and the limb feature sequence specifically comprise: encoding the text characteristics through a feedforward transformer to obtain text encoding characteristics; processing the text coding features through a time length model to obtain time length features, and aligning the text coding features through the time length features to obtain aligned text coding features; performing sound prediction, face prediction and limb prediction after decoding the aligned text coding features to obtain a sound feature sequence, a face feature sequence and a limb feature sequence; according to the acoustic feature sequence, the facial feature sequence and the limb feature sequence, the fusion feature sequence is obtained, and the fusion feature sequence comprises: according to the time length characteristics, aligning the sound characteristic sequence, the facial characteristic sequence and the limb characteristic sequence to obtain an aligned sound characteristic sequence, a facial characteristic sequence and a limb characteristic sequence, and taking the aligned sound characteristic sequence, the facial characteristic sequence and the limb characteristic sequence as fusion characteristic sequences.
7. A device for driving a virtual person in real time, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising the steps for performing the method of any of claims 1-3.
8. A machine readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method of driving a virtual person in real time as claimed in any one of claims 1 to 3.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010420720.9A CN113689880B (en) | 2020-05-18 | 2020-05-18 | Method, device, electronic equipment and medium for driving virtual person in real time |
PCT/CN2021/078244 WO2021232877A1 (en) | 2020-05-18 | 2021-02-26 | Method and apparatus for driving virtual human in real time, and electronic device, and medium |
US17/989,323 US20230082830A1 (en) | 2020-05-18 | 2022-11-17 | Method and apparatus for driving digital human, and electronic device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010420720.9A CN113689880B (en) | 2020-05-18 | 2020-05-18 | Method, device, electronic equipment and medium for driving virtual person in real time |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113689880A CN113689880A (en) | 2021-11-23 |
CN113689880B true CN113689880B (en) | 2024-05-28 |
Family
ID=78575574
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010420720.9A Active CN113689880B (en) | 2020-05-18 | 2020-05-18 | Method, device, electronic equipment and medium for driving virtual person in real time |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113689880B (en) |
WO (1) | WO2021232877A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115209180B (en) * | 2022-06-02 | 2024-06-18 | 阿里巴巴(中国)有限公司 | Video generation method and device |
CN117152308B (en) * | 2023-09-05 | 2024-03-22 | 江苏八点八智能科技有限公司 | Virtual person action expression optimization method and system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361620A (en) * | 2014-11-27 | 2015-02-18 | 韩慧健 | Mouth shape animation synthesis method based on comprehensive weighted algorithm |
CN106653052A (en) * | 2016-12-29 | 2017-05-10 | Tcl集团股份有限公司 | Virtual human face animation generation method and device |
CN110162598A (en) * | 2019-04-12 | 2019-08-23 | 北京搜狗科技发展有限公司 | A kind of data processing method and device, a kind of device for data processing |
CN110174942A (en) * | 2019-04-30 | 2019-08-27 | 北京航空航天大学 | Eye movement synthetic method and device |
CN110689879A (en) * | 2019-10-10 | 2020-01-14 | 中国科学院自动化研究所 | Method, system and device for training end-to-end voice transcription model |
CN110688911A (en) * | 2019-09-05 | 2020-01-14 | 深圳追一科技有限公司 | Video processing method, device, system, terminal equipment and storage medium |
CN110866968A (en) * | 2019-10-18 | 2020-03-06 | 平安科技(深圳)有限公司 | Method for generating virtual character video based on neural network and related equipment |
CN110956948A (en) * | 2020-01-03 | 2020-04-03 | 北京海天瑞声科技股份有限公司 | End-to-end speech synthesis method, device and storage medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8224652B2 (en) * | 2008-09-26 | 2012-07-17 | Microsoft Corporation | Speech and text driven HMM-based body animation synthesis |
US9431027B2 (en) * | 2011-01-26 | 2016-08-30 | Honda Motor Co., Ltd. | Synchronized gesture and speech production for humanoid robots using random numbers |
CN111415677B (en) * | 2020-03-16 | 2020-12-25 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating video |
-
2020
- 2020-05-18 CN CN202010420720.9A patent/CN113689880B/en active Active
-
2021
- 2021-02-26 WO PCT/CN2021/078244 patent/WO2021232877A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361620A (en) * | 2014-11-27 | 2015-02-18 | 韩慧健 | Mouth shape animation synthesis method based on comprehensive weighted algorithm |
CN106653052A (en) * | 2016-12-29 | 2017-05-10 | Tcl集团股份有限公司 | Virtual human face animation generation method and device |
CN110162598A (en) * | 2019-04-12 | 2019-08-23 | 北京搜狗科技发展有限公司 | A kind of data processing method and device, a kind of device for data processing |
CN110174942A (en) * | 2019-04-30 | 2019-08-27 | 北京航空航天大学 | Eye movement synthetic method and device |
CN110688911A (en) * | 2019-09-05 | 2020-01-14 | 深圳追一科技有限公司 | Video processing method, device, system, terminal equipment and storage medium |
CN110689879A (en) * | 2019-10-10 | 2020-01-14 | 中国科学院自动化研究所 | Method, system and device for training end-to-end voice transcription model |
CN110866968A (en) * | 2019-10-18 | 2020-03-06 | 平安科技(深圳)有限公司 | Method for generating virtual character video based on neural network and related equipment |
CN110956948A (en) * | 2020-01-03 | 2020-04-03 | 北京海天瑞声科技股份有限公司 | End-to-end speech synthesis method, device and storage medium |
Non-Patent Citations (2)
Title |
---|
YI, REN ET AL..FastSpeech: Fast, Robust and Controllable Text to Speech.33RD CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEM (NEURIPS 2019).2019,第32卷全文. * |
李冰锋 ; 谢磊 ; 周祥增 ; 付中华 ; 张艳宁 ; .实时语音驱动的虚拟说话人.清华大学学报(自然科学版).2011,(09),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN113689880A (en) | 2021-11-23 |
WO2021232877A1 (en) | 2021-11-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113689879B (en) | Method, device, electronic equipment and medium for driving virtual person in real time | |
WO2021169431A1 (en) | Interaction method and apparatus, and electronic device and storage medium | |
US20200279553A1 (en) | Linguistic style matching agent | |
TWI766499B (en) | Method and apparatus for driving interactive object, device and storage medium | |
JP7227395B2 (en) | Interactive object driving method, apparatus, device, and storage medium | |
CN113362812B (en) | Voice recognition method and device and electronic equipment | |
US20230082830A1 (en) | Method and apparatus for driving digital human, and electronic device | |
CN110162598B (en) | Data processing method and device for data processing | |
CN110210310A (en) | A kind of method for processing video frequency, device and the device for video processing | |
CN113362813B (en) | Voice recognition method and device and electronic equipment | |
CN110148406B (en) | Data processing method and device for data processing | |
CN113689880B (en) | Method, device, electronic equipment and medium for driving virtual person in real time | |
CN113691833A (en) | Virtual anchor face changing method and device, electronic equipment and storage medium | |
WO2023045716A1 (en) | Video processing method and apparatus, and medium and program product | |
CN111640424A (en) | Voice recognition method and device and electronic equipment | |
CN113345452B (en) | Voice conversion method, training method, device and medium of voice conversion model | |
CN113282791B (en) | Video generation method and device | |
CN111696536B (en) | Voice processing method, device and medium | |
CN117351123A (en) | Interactive digital portrait generation method, device, equipment and storage medium | |
CN112785667A (en) | Video generation method, device, medium and electronic equipment | |
CN110930977B (en) | Data processing method and device and electronic equipment | |
CN112151072A (en) | Voice processing method, apparatus and medium | |
CN113409765B (en) | Speech synthesis method and device for speech synthesis | |
CN114155849A (en) | Virtual object processing method, device and medium | |
CN110166844B (en) | Data processing method and device for data processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |