WO2021232877A1 - 实时驱动虚拟人的方法、装置、电子设备及介质 - Google Patents

实时驱动虚拟人的方法、装置、电子设备及介质 Download PDF

Info

Publication number
WO2021232877A1
WO2021232877A1 PCT/CN2021/078244 CN2021078244W WO2021232877A1 WO 2021232877 A1 WO2021232877 A1 WO 2021232877A1 CN 2021078244 W CN2021078244 W CN 2021078244W WO 2021232877 A1 WO2021232877 A1 WO 2021232877A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
feature sequence
model
data
sequence
Prior art date
Application number
PCT/CN2021/078244
Other languages
English (en)
French (fr)
Inventor
樊博
陈伟
陈曦
孟凡博
刘恺
张克宁
段文君
Original Assignee
北京搜狗科技发展有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202010420720.9A external-priority patent/CN113689880B/zh
Application filed by 北京搜狗科技发展有限公司 filed Critical 北京搜狗科技发展有限公司
Publication of WO2021232877A1 publication Critical patent/WO2021232877A1/zh
Priority to US17/989,323 priority Critical patent/US20230082830A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present disclosure relates to the field of virtual human processing technology, and in particular to a method, device, electronic device, and medium for driving virtual humans in real time.
  • Digital Human is abbreviated as Digital Human, which is a comprehensive rendering technology that uses computers to simulate real humans, and is also called virtual humans, super-realistic humans, and photo-level humans. Because people are too familiar with real people, it takes a lot of time to obtain the 3D static model to make it very real, but when driving the 3D static model to move, even a subtle expression will be re-modeled, because the model is very real. As a result, modeling requires a large amount of data for calculation, and the calculation process is long. Generally, an action of the model may take an hour or several hours of calculation to be realized, resulting in very poor real-time performance of the drive.
  • the purpose of the present disclosure is, at least in part, to provide a method, device, electronic device, and medium for driving a virtual person in real time, which can drive the virtual person in real time.
  • a method for driving a virtual person in real time including: obtaining to-be-processed data for driving the virtual person, the to-be-processed data including at least one of text data and voice data; using The end-to-end model processes the data to be processed, and determines the acoustic feature sequence, facial feature sequence, and limb feature sequence corresponding to the data to be processed; the acoustic feature sequence, the facial feature sequence, and the limb The feature sequence is input into the trained muscle model, and the virtual person is driven by the muscle model; wherein, the use of the end-to-end model to process the to-be-processed data includes: acquiring the text features of the to-be-processed data and The duration feature; the acoustic feature sequence is determined based on the text feature and the duration feature; the facial feature sequence and the limb feature sequence are determined based on the text feature and the duration feature.
  • the acquiring the text feature and duration feature of the data to be processed includes: acquiring the text feature through a fastspeech model; acquiring the duration feature through a duration model, wherein the duration model is deep learning Model.
  • the fastspeech model that is trained to output the acoustic feature sequence is the first fastspeech model
  • the fastspeech model that is trained to output the facial feature sequence and limb feature sequence is the second fastspeech model
  • the duration feature and determining the acoustic feature sequence includes: inputting the text feature and the duration feature into a first fastspeech model to obtain the acoustic feature sequence;
  • the determining the acoustic feature sequence according to the text feature and the duration feature includes: inputting the text feature and the duration feature into a second fastspeech model to obtain the The facial feature sequence and the limb feature sequence.
  • the inputting the acoustic feature sequence, the facial feature sequence, and the limb feature sequence into a trained muscle model includes: inputting the acoustic feature sequence, the facial feature sequence, and The limb feature sequence is fused to obtain a fusion feature sequence; the fusion feature sequence is input into the muscle model.
  • the fusing the acoustic feature sequence, the facial feature sequence, and the limb feature sequence to obtain a fused feature sequence includes: combining the acoustic feature sequence, the facial feature sequence, and the acoustic feature sequence based on the duration feature. The facial feature sequence and the limb feature sequence are fused to obtain the fused feature sequence.
  • the facial features corresponding to the facial feature sequence include expression features and lip features.
  • a method for driving a virtual person in real time including: obtaining to-be-processed data for driving the virtual person, the to-be-processed data including at least one of text data and voice data; using The end-to-end model processes the data to be processed and determines the fusion feature data corresponding to the data to be processed, where the fusion feature sequence is an acoustic feature sequence corresponding to the data to be processed, a facial feature sequence and Limb feature sequences are fused; inputting the fusion sequence into a trained muscle model, and driving the virtual person through the muscle model; wherein, using the end-to-end model to process the to-be-processed data includes: Acquire the text feature and duration feature of the data to be processed; determine the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the text feature and the duration feature; according to the acoustic feature sequence , The facial feature sequence and the limb feature sequence are used to obtain the
  • the acquiring the text feature and duration feature of the data to be processed includes: acquiring the text feature through a fastspeech model; acquiring the duration feature through a duration model, wherein the duration model is deep learning Model.
  • the obtaining the fusion feature sequence according to the acoustic feature sequence, the facial feature sequence, and the limb feature sequence includes: combining the acoustic feature sequence and the body feature sequence based on the duration feature. The facial feature sequence and the limb feature sequence are fused to obtain the fused feature sequence.
  • a device for driving a virtual person in real time including: a data acquisition module for acquiring data to be processed for driving the virtual person, the data to be processed includes text data and voice data. At least one of; a data processing module for processing the to-be-processed data using an end-to-end model to determine the acoustic feature sequence, facial feature sequence, and body feature sequence corresponding to the to-be-processed data; virtual person driving module , Used to input the acoustic feature sequence, the facial feature sequence, and the limb feature sequence into the trained muscle model, and drive the virtual person through the muscle model; wherein, the data processing module is specifically used for Acquire the text feature and duration feature of the data to be processed; determine the acoustic feature sequence based on the text feature and the duration feature; determine the facial feature sequence based on the text feature and the duration feature And the limb feature sequence.
  • a device for driving a virtual person in real time including: a data acquisition module for acquiring data to be processed for driving the virtual person, the data to be processed includes text data and voice data. At least one of; a data processing module, used to process the to-be-processed data using an end-to-end model, and determine the fusion feature data corresponding to the to-be-processed data, wherein the fusion feature sequence is generated by the to-be-processed data
  • the acoustic feature sequence corresponding to the processed data is obtained by fusion of the facial feature sequence and the limb feature sequence; the virtual person driving module is used to input the fusion sequence into the trained muscle model, and the virtual person is driven by the muscle model; wherein ,
  • the data processing module is specifically configured to obtain the text feature and duration feature of the data to be processed; determine the acoustic feature sequence, the facial feature sequence and the facial feature sequence according to the text feature and the duration feature Limb feature sequence; according to the acous
  • a device for driving a virtual person in real time including a memory and one or more programs, wherein one or more programs are stored in the memory and configured to be operated by one Or the execution of the one or more programs by one or more processors includes the method steps for driving the virtual person in real time.
  • a machine-readable medium on which instructions are stored, which when executed by one or more processors, cause the device to execute the steps of the method steps for driving a virtual person in real time.
  • the end-to-end model is used to process the data to be processed to obtain an acoustic feature sequence, a facial feature sequence, and a limb feature sequence; then the acoustic feature sequence, the facial feature sequence, and the The limb feature sequence is input to the trained muscle model, and the virtual person is driven by the muscle model; since the end-to-end model inputs the raw data of the data to be processed, it directly outputs the acoustic feature sequence, facial feature sequence and limbs Feature sequence, which can better utilize and adapt to the parallel computing power of new hardware (such as GPU), and the calculation speed is faster; that is, the acoustic feature sequence, facial feature sequence and body feature sequence can be obtained in a shorter time; Acoustic feature sequence, facial feature sequence, and body feature sequence are input into the muscle model to directly drive the virtual person.
  • new hardware such as GPU
  • the virtual person is directly controlled by the acoustic feature sequence for voice output, and at the same time through the facial feature sequence and body
  • the feature sequence controls the facial expressions and body movements of the virtual human. Compared with the need to remodel the virtual human, it greatly reduces the amount of calculation and data transmission, and also improves the calculation efficiency, making the real-time driving the virtual human. It has been greatly improved, so that the virtual person can be driven in real time.
  • Fig. 1 shows a training flowchart of end-to-end model training according to one or more embodiments of the present disclosure
  • Fig. 2 shows a first flow chart of a method for driving a virtual person in real time according to one or more embodiments of the present disclosure
  • FIG. 3 shows a flow chart of the steps of outputting an acoustic feature sequence by the first fastspeech model according to one or more embodiments of the present disclosure
  • FIG. 4 shows a second flow chart of a method for driving a virtual person in real time according to one or more embodiments of the present disclosure
  • FIG. 5 shows a first structural schematic diagram of an apparatus for driving a virtual person in real time according to one or more embodiments of the present disclosure
  • FIG. 6 shows a schematic diagram of a second structure of an apparatus for driving a virtual person in real time according to one or more embodiments of the present disclosure
  • FIG. 7 shows a structural block diagram of an apparatus for driving a virtual person in real time as a device according to one or more embodiments of the present disclosure
  • Fig. 8 shows a structural block diagram of a server according to one or more embodiments of the present disclosure.
  • a solution for driving the virtual person in real time is provided.
  • the solution is used to drive the virtual person in real time, which may specifically include:
  • the end-to-end model is used to process the to-be-processed data, and the data corresponding to the to-be-processed data is determined Acoustic feature sequence, facial feature sequence, and limb feature sequence; input the acoustic feature sequence, facial feature sequence, and limb feature sequence into a trained muscle model, and drive the virtual person through the muscle model;
  • the virtual person according to the embodiment of the present disclosure may be a high-simulation virtual person, which is less different from a real person; the virtual person can be applied to news broadcast scenes, teaching scenes, medical scenes, customer service scenes, legal scenes, conference scenes, etc. Express the scene.
  • the data to be processed according to the embodiments of the present disclosure may be text data, voice data, or both text data and semantic data, which is not specifically limited in this specification.
  • the news release is the data to be processed, and the news release can be text edited by humans or machines, and after the text is edited by humans or machines To get the edited text as a press release.
  • the end-to-end model before the end-to-end model is used to process the data to be processed, the end-to-end model needs to be trained through samples to obtain the trained end-to-end model; After the end-to-end model of, then use the trained end-to-end model to process the to-be-processed data.
  • the end-to-end model includes two training methods, one of which is the acoustic feature sequence of the end-to-end model trained by the training method, and the end-to-end model output by the other training method
  • the facial feature sequence and body feature sequence; and the end-to-end model can be a fastspeech model.
  • the training samples can be text and voice data, or video data; for each training sample in the training sample set, its training
  • the specific steps are shown in Fig. 1.
  • step A1 is performed to obtain the acoustic feature 101 and the text feature 102 of the training sample, where the text feature 101 may be a phoneme level.
  • the feature data of the training sample can be mapped to the embedding layer in the end-to-end model to obtain the acoustic feature 101 and the text feature 102; then step A2 is performed, and the feedforward transformer 103 (Feed Forward Transformer) processes the acoustic feature 101 and the text feature 102 to obtain the acoustic vector 104 and the text encoding feature 105, where the acoustic vector 104 can be the acoustic vector of a sentence or the acoustic vector of a word.
  • the feedforward transformer 103 Fee Forward Transformer
  • the text encoding feature 105 is also Phoneme level; Next, perform step A3 to align the acoustic vector 104 with the text encoding feature 105 to obtain the aligned text encoding feature 106.
  • the duration predictor can be used to align the acoustic vector 104 with the text encoding feature 105, where, The text encoding feature 105 is specifically a phoneme feature, and the acoustic vector 104 can be a mel spectrogram. In this way, a duration predictor can be used to align the factor feature with the mel spectrogram; then step A4 is performed to encode the aligned text
  • the feature 106 is decoded 107 to obtain the acoustic feature sequence 108.
  • the length adjuster can be used to easily determine the voice speed by extending or shortening the phoneme duration, so as to determine the length of the generated mel spectrogram. Add intervals between phonemes to control part of the prosody; according to the determined length of the mel spectrogram and phoneme interval time, the acoustic characteristic sequence is obtained.
  • the training sample set may include 13,100 speech audio clips and corresponding text records, for example, and the total audio length is about 24 hours.
  • the training sample set is randomly divided into 3 groups: 12,500 samples for training, 300 samples for verification, and 300 samples for testing.
  • the phoneme conversion tool In order to alleviate the problem of pronunciation errors, use the phoneme conversion tool to convert the text sequence into a phoneme sequence; for speech data, convert the original waveform to a mel spectrogram; then use 12,500 samples to train the end-to-end model, after the training is complete , Use 300 verification samples to verify the end-to-end model obtained by training; after verifying that it meets the verification requirements, use 300 test samples to test the end-to-end model. If the test meets the test conditions, the trained end-to-end model is obtained. End model.
  • the end-to-end model verification does not meet the verification requirements, the end-to-end model is trained again using the training samples until the end-to-end model after training meets the verification requirements; and the end-to-end model that meets the verification requirements is tested, Until the trained end-to-end model meets both the verification requirements and the test conditions, the trained end-to-end model is regarded as the final model, which is the trained end-to-end model.
  • the training samples can be real person video data and real person action data; for each training sample in the training sample set, the training steps specifically include: first Step B1 is performed to obtain facial features, body features, and text features of the training samples, where the text features can be phoneme level.
  • the feature data of the training sample can be mapped to the embedding layer in the end-to-end model to obtain facial features, body features, and text features; then step B2 is performed, and the feed forward
  • the Feed Forward Transformer processes facial features, body features, and text features to obtain facial feature vectors, body feature vectors, and text encoding features.
  • facial feature vectors are used to express facial expressions, and body feature vectors can be muscles.
  • the action vector and text encoding feature are also at the phoneme level.
  • step B3 to align the facial feature vector and body feature vector with the text encoding feature.
  • step B4 is performed to obtain facial feature sequences and body feature sequences.
  • the length adjuster can be used to align facial expressions and facial expressions by extending or shortening the phoneme duration. Action, thereby obtaining facial feature sequence and body feature sequence.
  • text features may include: phoneme features, and/or semantic features, and so on.
  • the phoneme is the smallest phonetic unit divided according to the natural attributes of the speech, and is analyzed according to the pronunciation actions in the syllable, and one action constitutes one phoneme.
  • Phonemes can include: vowels and consonants.
  • a specific phoneme feature corresponds to a specific lip feature, expression feature, or body feature.
  • semantics is the meaning of the concepts represented by things in the real world corresponding to the text to be processed, and the relationship between these meanings is the interpretation and logical representation of the text to be processed in a certain field.
  • specific semantic features correspond to specific body features and the like.
  • the training sample set includes real-person action data or real-person video data
  • the training process refers to the training of the end-to-end model of the output acoustic feature sequence
  • the training process for the sake of brevity in the manual, will not be repeated here.
  • the end-to-end model of the output acoustic feature sequence is obtained through training, and the end-to-end model of the output facial feature sequence and limb feature sequence is obtained, the end-to-end model of the obtained output acoustic feature sequence is taken as the first end-to-end model Model, and the obtained end-to-end model of the output facial feature sequence and body feature sequence as the second end-to-end model.
  • the embedding layer of the first end-to-end model can be used to obtain the text features of the data to be processed, and then the duration features of the data to be processed are obtained, and the text features and the duration
  • the features are input into the first end-to-end model to obtain the acoustic feature sequence
  • the embedding layer of the second end-to-end model can be used to obtain the text features of the data to be processed, Then obtain the duration feature of the data to be processed, and input the text feature and the duration feature into the second end-to-end model to obtain the facial feature sequence and the limb feature sequence; of course, the previously obtained text can also be used directly
  • the feature and duration feature are directly input into the second end-to-end model to obtain the facial feature sequence and the limb feature sequence.
  • the first end-to-end model and the second end-to-end model can process data at the same time, or the first end-to-end model can process data first, or the second end-to-end model The data is processed first, and this manual does not make specific restrictions.
  • the duration feature may be used to characterize the duration of the phoneme corresponding to the text.
  • the duration feature can describe the circumflex, ups and downs and priorities in the speech, which can improve the expressiveness and naturalness of the synthesized speech.
  • the duration model may be used to determine the duration characteristics corresponding to the data to be processed.
  • the input of the duration model can be: phoneme features with accent annotations, and the output is the phoneme duration.
  • the duration model can be obtained by learning the speech samples with duration information, for example, it can be deep learning models such as Convolutional Neural Networks (hereinafter referred to as CNN) and Deep Neural Networks (hereinafter referred to as DNN)
  • CNN Convolutional Neural Networks
  • DNN Deep Neural Networks
  • the specific duration model is not limited.
  • the acoustic feature sequence will be obtained, and the facial feature sequence and the limb feature sequence will be input to the trained muscle model In, the virtual person is driven by the muscle model.
  • facial features include facial features and lip features, where facial features, expressing feelings and affection, may refer to thoughts and feelings expressed on the face.
  • Expression features are usually for the entire face.
  • Lip features can be specifically for the lips, and are related to the text content, voice, pronunciation, etc. of the text, so that facial features can be used to make facial expressions more realistic and delicate.
  • the body features can convey the thoughts of the characters through the coordinated activities of the head, eyes, neck, hands, elbows, arms, body, hips, feet and other parts of the human body, and express ideas through expressions vividly.
  • Physical features can include: head turning, shoulder shrug, gestures, etc., which can increase the richness of the corresponding expression of the image sequence. For example, at least one arm is naturally drooping when speaking, and at least one arm is naturally placed on the abdomen when not speaking.
  • model training is required to obtain the trained muscle model; after the trained muscle model is obtained, the trained muscle model is used to The acoustic feature sequence, the facial feature sequence and the limb feature sequence are processed.
  • the trained muscle model is first created according to the facial muscles and limb muscles of the human face during model training.
  • the training samples can be live video data and Real action data; for each training sample in the training sample set, the training steps include:
  • step C1 First perform step C1 to obtain the facial muscle characteristics and limb muscle characteristics of each training sample; then perform step C2 to train the muscle model using the facial muscle characteristics and limb muscle characteristics of each training sample; and, after the training is completed, Step C3 is performed to verify the trained muscle model using the verification sample; after verifying that it meets the verification requirements, the test sample is used to test the trained muscle model. If the test meets the test conditions, the trained muscle model is obtained.
  • the trained muscle model If the verification of the trained muscle model does not meet the verification requirements, use the training sample to train the muscle model again until the trained muscle model meets the verification requirements; and test the verified muscle model that meets the requirements until the trained muscle If the model meets both the verification requirements and the test conditions, the trained muscle model is taken as the final model, which is the trained muscle model.
  • a muscle model taking facial muscles as an example, using a polygonal network for approximate abstract muscle control
  • the self-attention mechanism adopted by the feedforward transformer of the end-to-end model is an innovative method to understand the current word through its context, the ability to extract semantic features is stronger.
  • this feature means that for homophones or words in a sentence, the new algorithm can determine which one should be based on the surrounding words and the sentences before and after it (such as bathing and washing dates), so as to get more accurate The result; and the end-to-end model solves the problem that the tasks of each part of the traditional speech recognition scheme are independent and cannot be jointly optimized.
  • the framework of a single neural network becomes simpler. As the number of model layers is deeper, the larger the training data, the higher the accuracy.
  • the end-to-end model adopts a new neural network structure, which can better utilize and adapt to new models.
  • the hardware (such as GPU) has parallel computing capability, and the computing speed is faster. This means that transcribing the same length of speech, the algorithm model based on the new network structure can be completed in a shorter time, and it can better meet the needs of real-time transcribing.
  • an end-to-end model is used to process the data to be processed to obtain an acoustic feature sequence, a facial feature sequence, and a limb feature sequence;
  • the facial feature sequence and the limb feature sequence are input into the trained muscle model, and the virtual person is driven by the muscle model; since the end-to-end model inputs the raw data of the data to be processed, the acoustic feature sequence and facial feature sequence are directly output Feature sequence and body feature sequence, which can better utilize and adapt to the parallel computing power of new hardware (such as GPU), and the calculation speed is faster; that is, the acoustic feature sequence, facial feature sequence and body feature can be obtained in a shorter time Sequence; then input the acoustic feature sequence, facial feature sequence, and body feature sequence into the muscle model to directly drive the virtual person.
  • the virtual person is directly controlled by the acoustic feature sequence for voice output, and at the same time through the face
  • the feature sequence and the body feature sequence control the facial expressions and body movements of the virtual person.
  • it greatly reduces the amount of calculation and data transmission, and also improves the calculation efficiency, so that the virtual person is driven.
  • Human real-time performance has been greatly improved, so that real-time driving of virtual humans can be realized.
  • the duration feature is used, and the duration feature can improve the synchronization between the acoustic feature sequence and the facial feature sequence and the body feature sequence Therefore, based on the improvement of synchronization, when the acoustic feature sequence is used to drive the virtual person with the facial feature sequence and the limb feature sequence, it can make the sound output of the virtual person match the facial expression and the limb feature with higher accuracy.
  • an embodiment of the present disclosure provides a flow chart of steps of a method for driving a virtual person in real time, which may specifically include the following steps:
  • S202 Use an end-to-end model to process the to-be-processed data, and determine the acoustic feature sequence, facial feature sequence, and body feature sequence corresponding to the to-be-processed data;
  • step S201 includes:
  • Step S2021 Acquire the text feature and duration feature of the data to be processed
  • Step S2022 Determine the acoustic feature sequence according to the text feature and the duration feature;
  • Step S2023 Determine the facial feature sequence and the limb feature sequence according to the text feature and the duration feature.
  • step S201 for the client, the to-be-processed data uploaded by the user can be received; for the server, the to-be-processed data sent by the client can be received.
  • any first device can receive the text to be processed from the second device, and an embodiment of the present disclosure does not limit the specific transmission mode of the data to be processed.
  • step S202 is used to process the to-be-processed data directly; if the to-be-processed data is voice data, after the to-be-processed data is converted into text data, step S202 is used to process the converted text data.
  • the end-to-end model needs to be trained first.
  • the end-to-end model includes two training methods. One of the training methods trains the end-to-end model output acoustic feature sequence, and the other training method trains the end-to-end model.
  • the facial feature sequence and the limb feature sequence output by the end-to-end model of, and the end-to-end model can be a fastspeech model.
  • the text feature may be acquired through a fastspeech model; and the duration feature may be acquired through a duration model, where the duration model is a deep learning model.
  • the end-to-end model is the fastspeech model
  • the fastspeech model after training the first fastspeech model and the second fastspeech model, use any one of the fastspeech models to obtain the text features of the data to be processed; then use the duration model to obtain the duration features, among which, the duration model It can be a deep learning model such as CNN and DNN.
  • step S2022 if the fastspeech model of the training output acoustic feature sequence is the first fastspeech model, and the fastspeech model of the training output facial feature sequence and the limb feature sequence is the second fastspeech model, the text feature and the duration feature can be combined Input into the first fastspeech model to obtain the acoustic feature sequence; and in step S2023, input the text feature and the duration feature into the second fastspeech model to obtain the facial feature sequence and the limb feature sequence .
  • the end-to-end model is the fastspeech model
  • the fastspeech model after training the first fastspeech model and the second fastspeech model, use any one of the fastspeech models to obtain the text features of the data to be processed; then use the duration model to obtain the duration features, among which, the duration model It can be a deep learning model such as CNN and DNN.
  • the steps include: obtaining the text feature 301 of the data to be processed through the embedding layer of the first fastspeech model, and
  • the feedforward transformer 302 encodes the text feature 301 to obtain the text encoding feature 303; at this time, the text encoding feature 303 is processed by the duration model 304 to obtain the duration feature 305, where the duration feature 304 can be used to characterize each of the text encoding features 30 The duration of each phoneme; then the text encoding feature 303 is aligned by the duration feature 305 to obtain the aligned text encoding feature 306; the aligned text encoding feature 306 is decoded 307 and predicted to obtain an acoustic feature sequence 307.
  • the text encoding feature 303 is at the phoneme level, and the aligned text encoding feature 306 may be at the frame level or at the phoneme level, which is not specifically limited in the embodiment of the present disclosure.
  • the text features of the data to be processed can be obtained through the embedding layer of the second fastspeech model;
  • the feature is encoded to obtain the text encoding feature; at this time, the text encoding feature is processed through the duration model to obtain the duration feature, where the duration feature aligns the text encoding feature to obtain the aligned text encoding feature; encode the aligned text
  • facial prediction and limb prediction are performed to obtain facial feature sequences and limb feature sequences.
  • step S203 is performed to fuse the acoustic feature sequence, the facial feature sequence, and the limb feature sequence to obtain a fusion feature sequence.
  • the acoustic feature sequence, the facial feature sequence, and the limb feature sequence may be fused according to the duration feature to obtain the fused feature sequence; After the feature sequence is fused, the fusion feature sequence is input into the trained muscle model, and the virtual person is driven by the muscle model.
  • the acoustic feature sequence, the facial feature sequence, and the limb feature sequence are aligned to obtain the fusion feature sequence, and then the The fusion feature sequence is input into the trained muscle model, and the virtual person is driven by the muscle model.
  • steps C1-C3 For the training process of the muscle model, refer to the description of steps C1-C3.
  • the corresponding bound muscle in the muscle model is directly driven by the fusion feature sequence, and the muscle is bound
  • the fusion feature sequence When a corresponding movement is driven by the fusion feature sequence, its facial expressions and actions will be changed correspondingly according to the movement of the bound muscles.
  • the facial expression is smiling and the body action is waving.
  • the time period of saying "goodbye” can be compared with the facial feature sequence as smiling
  • the facial feature sequence and the body action sequence are aligned with the waving body action sequence, and the aligned feature sequence is obtained, which is the fusion feature sequence; at this time, the fusion feature sequence is input into the muscle model, and the muscle model controls the virtual
  • the face is smiling and waving, so that the voice of the virtual person matches the face and movement.
  • the facial expression is a smile and the body action is a beckoning.
  • the time period of saying “come” can be compared with the facial feature sequence. Align the facial feature sequence for the smile and the body action sequence for the beckoning body action sequence to obtain the aligned feature sequence, which is the fusion feature sequence; at this time, the fusion feature sequence is input into the muscle model through the muscle model.
  • the control virtual person says “come here”
  • the face will smile and beckon, so that the virtual person's voice matches the face and movement.
  • an end-to-end model is used to process the data to be processed to obtain an acoustic feature sequence, a facial feature sequence, and a limb feature sequence;
  • the facial feature sequence and the limb feature sequence are input into the trained muscle model, and the virtual person is driven by the muscle model; since the end-to-end model inputs the raw data of the data to be processed, the acoustic feature sequence and facial feature sequence are directly output Feature sequence and body feature sequence, which can better utilize and adapt to the parallel computing power of new hardware (such as GPU), and the calculation speed is faster; that is, the acoustic feature sequence, facial feature sequence and body feature can be obtained in a shorter time Sequence; then input the acoustic feature sequence, facial feature sequence, and body feature sequence into the muscle model to directly drive the virtual person.
  • the virtual person is directly controlled by the acoustic feature sequence for voice output, and at the same time through the face
  • the feature sequence and the body feature sequence control the facial expressions and body movements of the virtual person.
  • it greatly reduces the amount of calculation and data transmission, and also improves the calculation efficiency, so that the virtual person is driven.
  • Human real-time performance has been greatly improved, so that real-time driving of virtual humans can be realized.
  • the duration feature is used, and the duration feature can improve the synchronization between the acoustic feature sequence and the facial feature sequence and the body feature sequence Therefore, based on the improvement of synchronization, when the acoustic feature sequence is used to drive the virtual person with the facial feature sequence and the limb feature sequence, it can make the sound output of the virtual person match the facial expression and the limb feature with higher accuracy.
  • an embodiment of the present disclosure provides a step flow chart of a method for driving a virtual person in real time, which may specifically include the following steps:
  • S402. Use an end-to-end model to process the to-be-processed data, and determine the fusion feature data corresponding to the to-be-processed data, where the fusion feature sequence is an acoustic feature sequence corresponding to the to-be-processed data, and the face Feature sequence and body feature sequence are fused;
  • step S402 includes:
  • Step S4021 obtain the text feature and duration feature of the to-be-processed data
  • Step S4022 Determine the acoustic feature sequence, the facial feature sequence, and the limb feature sequence according to the text feature and the duration feature;
  • Step S4023 Obtain the fusion feature sequence according to the acoustic feature sequence, the facial feature sequence and the limb feature sequence.
  • step S401 for the client, the to-be-processed data uploaded by the user can be received; for the server, the to-be-processed data sent by the client can be received.
  • any first device can receive the text to be processed from the second device, and an embodiment of the present disclosure does not limit the specific transmission mode of the data to be processed.
  • step S402 is directly used to process the data to be processed; if the data to be processed is voice data, after the data to be processed is converted into text data, step S402 is used to process the converted text data.
  • step S402 the end-to-end model needs to be trained first, so that the trained end-to-end model outputs the fusion feature sequence.
  • the end-to-end model that outputs the fusion feature sequence can be used as the third end-to-end model.
  • the third end-to-end model when the third end-to-end model is trained, its training samples may be live-action video data and live-action data; for each training sample in the training sample set, the training steps specifically include: first Step D1 is performed to obtain facial features, body features, and text features of the training samples, where the text features can be phoneme level.
  • the feature data of the training sample can be mapped to the embedding layer in the end-to-end model to obtain facial features, body features, and text features; then step D2 is performed, and the feed forward
  • the Feed Forward Transformer processes facial features, body features, and text features to obtain facial feature vectors, body feature vectors, and text encoding features.
  • facial feature vectors are used to express facial expressions, and body feature vectors can be muscles.
  • Action vectors and text encoding features are also at the phoneme level.
  • step D3 to align the facial feature vector and body feature vector with the text encoding feature.
  • step D4 is performed to obtain the voice feature sequence, facial feature sequence, and body feature sequence
  • step D5 is performed to combine the voice feature sequence, facial feature sequence, and body feature sequence.
  • the feature sequence is fused to obtain the fusion feature sequence.
  • the length adjuster can be used to align the voice, facial expressions, and actions by extending or shortening the phoneme duration, so as to obtain the fusion feature sequence.
  • text features may include: phoneme features, and/or semantic features, and so on.
  • the phoneme is the smallest phonetic unit divided according to the natural attributes of the speech, and is analyzed according to the pronunciation actions in the syllable, and one action constitutes one phoneme.
  • Phonemes can include: vowels and consonants.
  • a specific phoneme feature corresponds to a specific lip feature, expression feature, or body feature.
  • semantics is the meaning of the concepts represented by things in the real world corresponding to the text to be processed, and the relationship between these meanings is the interpretation and logical representation of the text to be processed in a certain field.
  • specific semantic features correspond to specific body features and the like.
  • the training sample set includes real-life action data or real-life video data.
  • the training process refers to the training process of training the end-to-end model of the output acoustic feature sequence, for the sake of simplicity of the description , I won’t repeat it here.
  • the embedding layer of the third end-to-end model can be used to acquire the text characteristics of the data to be processed, and then the duration characteristics of the data to be processed are obtained, and the text characteristics and the duration
  • the features are input into the third end-to-end model to obtain the acoustic feature sequence, facial feature sequence and limb feature sequence; and according to the duration feature, the acoustic feature sequence, facial feature sequence and limb feature sequence are fused to obtain a fusion feature sequence.
  • step S4021 if the end-to-end model is a fastspeech model, the text feature can be acquired through a third fastspeech model; and the duration feature can be acquired through a duration model, where the duration model is a deep learning model.
  • step S4022 if the fastspeech model of the training output acoustic feature sequence is the third fastspeech model, the text feature and the duration feature may be input into the first fastspeech model to determine the acoustic feature sequence, the facial feature Sequence and the limb feature sequence; and in step S4023, the fusion feature sequence is obtained according to the acoustic feature sequence, the facial feature sequence and the limb feature sequence.
  • the acoustic feature sequence, the facial feature sequence, and the limb feature sequence may be aligned according to the duration feature to obtain the fusion feature sequence.
  • the end-to-end model is the fastspeech model
  • the third fastspeech model uses the third fastspeech model to obtain the text features of the data to be processed; then use the duration model to obtain the duration features, where the duration model can be CNN and DNN And other deep learning models.
  • the text features of the data to be processed can be obtained through the embedding layer of the third fastspeech model; and then the text features are encoded by the feedforward transformer, Obtain the text encoding feature; at this time, the text encoding feature is processed through the duration model to obtain the duration feature, where the duration feature aligns the text encoding feature to obtain the aligned text encoding feature; after the aligned text encoding feature is decoded Perform voice prediction, face prediction, and body prediction to get voice feature sequence, facial feature sequence and body feature sequence; then according to the duration feature, the voice feature sequence, facial feature sequence and body feature sequence are aligned to obtain the aligned voice feature sequence , The facial feature sequence and the limb feature sequence, that is, the aligned voice feature sequence, the facial feature sequence and the limb feature sequence are used as the fusion feature sequence.
  • step S403 is executed. After the fusion feature sequence is obtained, the fusion feature sequence is input into the trained muscle model, and the virtual person is driven by the muscle model.
  • the fusion feature sequence is input into a trained muscle model, and the virtual person is driven by the muscle model.
  • the training process of the muscle model specifically refers to the description of steps C1-C3.
  • the fusion feature sequence is directly used to drive the muscle model
  • the bound muscles are driven by the fusion feature sequence to perform corresponding movements, their facial expressions and actions will be changed correspondingly according to the movement of the bound muscles.
  • the facial expression is smiling and the body movement is waving. Show a smile and wave your hands, so that the voice of the virtual person matches the face and movement.
  • the acoustic feature sequence says “someone is injured”
  • the facial expression is a smile and the body action is folded hands.
  • the fusion feature sequence is aligned according to the duration feature, the virtual person is While "someone is injured”, the face is sad and the hands are put together, so that the voice of the virtual person matches the face and movement.
  • an end-to-end model is used to process the data to be processed to obtain a fusion feature sequence that is fused with an acoustic feature sequence, a facial feature sequence, and a limb feature sequence; and then the fusion feature
  • the sequence is input to the trained muscle model, and the virtual person is driven by the muscle model; since the end-to-end model inputs the raw data of the data to be processed, the direct output is the fusion of the acoustic feature sequence, the facial feature sequence and the limb feature sequence
  • the fusion feature sequence can better utilize and adapt to the parallel computing power of new hardware (such as GPU), and the calculation speed is faster; that is, the fusion feature sequence can be obtained in a shorter time; and then the fusion feature sequence can be input to the muscle
  • directly driving the virtual person is to directly control the virtual person's voice output by fusing the feature sequence after the virtual person is created, and at the same time control the virtual person's facial expressions and body movements
  • the duration feature is used to fuse the acoustic feature sequence, the facial feature sequence, and the limb feature sequence, and the duration feature can improve the acoustic feature sequence, which is compatible with the facial feature sequence and the limb feature.
  • an embodiment of the present disclosure provides a structural block diagram of an embodiment of an apparatus for driving a virtual person in real time, which may specifically include:
  • the data acquisition module 501 is configured to acquire data to be processed for driving the virtual person, where the data to be processed includes at least one of text data and voice data;
  • the data processing module 502 is configured to process the to-be-processed data using an end-to-end model, and determine the acoustic feature sequence, facial feature sequence, and body feature sequence corresponding to the to-be-processed data;
  • the virtual person driving module 503 is configured to input the acoustic feature sequence, the facial feature sequence, and the limb feature sequence into the trained muscle model, and drive the virtual person through the muscle model;
  • the data processing module 502 is specifically configured to obtain the text feature and duration feature of the data to be processed; determine the acoustic feature sequence according to the text feature and the duration feature; according to the text feature and the The duration feature determines the facial feature sequence and the limb feature sequence.
  • the data processing module 502 is configured to acquire the text feature through a fastspeech model; acquire the duration feature through a duration model, where the duration model is a deep learning model.
  • the data processing module 502 if the fastspeech model of the training output acoustic feature sequence is the first fastspeech model, and the fastspeech model of the training output facial feature sequence and the limb feature sequence is the second fastspeech model, it is used to combine all
  • the text feature and the duration feature are input into the first fastspeech model to obtain the acoustic feature sequence; the text feature and the duration feature are input into the second fastspeech model to obtain the facial feature sequence and the Physical feature sequence.
  • the virtual person driving module 503 is configured to fuse the acoustic feature sequence, the facial feature sequence, and the limb feature sequence to obtain a fusion feature sequence;
  • the fusion feature sequence is input into the muscle model.
  • the virtual person driving module 503 is configured to fuse the acoustic feature sequence, the facial feature sequence, and the limb feature sequence based on the duration feature to obtain the fusion feature sequence.
  • the facial features corresponding to the facial feature sequence include expression features and lip features.
  • the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
  • an embodiment of the present disclosure provides a structural block diagram of an embodiment of an apparatus for driving a virtual person in real time, which may specifically include:
  • the data acquisition module 601 is configured to acquire data to be processed for driving the virtual person, where the data to be processed includes at least one of text data and voice data;
  • the data processing module 602 is configured to process the to-be-processed data using an end-to-end model, and determine the fusion feature data corresponding to the to-be-processed data, where the fusion feature sequence is corresponding to the to-be-processed data Acoustic feature sequence, fusion of facial feature sequence and body feature sequence;
  • the virtual person driving module 603 is configured to input the fusion sequence into the trained muscle model, and drive the virtual person through the muscle model;
  • the data processing module 602 is used to obtain the text feature and duration feature of the data to be processed; determine the acoustic feature sequence, the facial feature sequence and the limbs according to the text feature and the duration feature Feature sequence; According to the acoustic feature sequence, the facial feature sequence and the limb feature sequence, the fusion feature sequence is obtained.
  • the data processing module 602 is configured to acquire the text feature through a fastspeech model; acquire the duration feature through a duration model, where the duration model is a deep learning model.
  • the data processing module 602 is configured to align the acoustic feature sequence, the facial feature sequence and the limb feature sequence according to the duration feature to obtain the fusion feature sequence.
  • the facial features corresponding to the facial feature sequence include expression features and lip features.
  • the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
  • Fig. 7 is a structural block diagram when a device for driving a virtual person in real time is provided as a device according to an embodiment of the present disclosure.
  • the device 900 may be a mobile phone call, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.
  • the device 900 may include one or more of the following components: a processing component 902, a memory 904, a power supply component 906, a multimedia component 908, an audio component 910, an input/output (I/O) interface 912, a sensor component 914, And the communication component 916.
  • the processing component 902 generally controls the overall operations of the device 900, such as operations associated with display, incoming call, data communication, camera operations, and recording operations.
  • the processing element 902 may include one or more processors 920 to execute instructions to complete all or part of the steps of the foregoing method.
  • the processing component 902 may include one or more modules to facilitate the interaction between the processing component 902 and other components.
  • the processing component 902 may include a multimedia module to facilitate the interaction between the multimedia component 908 and the processing component 902.
  • the memory 904 is configured to store various types of data to support the operation of the device 900. Examples of these data include instructions for any application or method operating on the device 900, contact data, caller book data, messages, pictures, videos, etc.
  • the memory 904 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable and Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable and Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Magnetic Disk Magnetic Disk or Optical Disk.
  • the power supply component 906 provides power to various components of the device 900.
  • the power supply component 906 may include a power management system, one or more power supplies, and other components associated with the generation, management, and distribution of power for the device 900.
  • the multimedia component 908 includes a screen that provides an output interface between the device 900 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or sliding motion action, but also detect the duration and pressure related to the touch or sliding operation.
  • the multimedia component 908 includes a front camera and/or a rear camera. When the device 900 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 910 is configured to output and/or input audio signals.
  • the audio component 910 includes a microphone (MIC).
  • the microphone is configured to receive an external audio signal.
  • the received audio signal may be further stored in the memory 904 or transmitted via the communication component 916.
  • the audio component 910 further includes a speaker for outputting audio signals.
  • the I/O interface 912 provides an interface between the processing component 902 and a peripheral interface module.
  • the above-mentioned peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.
  • the sensor component 914 includes one or more sensors for providing the device 900 with various aspects of state evaluation.
  • the sensor component 914 can detect the on/off status of the device 900 and the relative positioning of components.
  • the component is the display and the keypad of the device 900.
  • the sensor component 914 can also detect the position change of the device 900 or a component of the device 900. , The presence or absence of contact between the user and the device 900, the orientation or acceleration/deceleration of the device 900, and the temperature change of the device 900.
  • the sensor component 914 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact.
  • the sensor component 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.
  • the communication component 916 is configured to facilitate wired or wireless communication between the device 900 and other devices.
  • the device 900 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof.
  • the communication component 916 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel.
  • the communication component 916 further includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the apparatus 900 may be implemented by one or more application specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing equipment (DSPD), programmable logic devices (PLD), field programmable gates Array (FPGA), controller, microcontroller, microprocessor or other electronic components are implemented to implement the above methods.
  • ASIC application specific integrated circuits
  • DSP digital signal processors
  • DSPD digital signal processing equipment
  • PLD programmable logic devices
  • FPGA field programmable gates Array
  • controller microcontroller, microprocessor or other electronic components are implemented to implement the above methods.
  • non-transitory computer-readable storage medium including instructions, such as the memory 904 including instructions, which can be executed by the processor 920 of the device 900 to complete the foregoing methods.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
  • Fig. 8 is a structural block diagram of a server provided by an embodiment of the present disclosure.
  • the server 1900 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 1922 (for example, one or more processors) and a memory 1932, one or one
  • the above storage medium 1930 (for example, one or one storage device with a large amount of storage) for storing the application program 1942 or the data 1944.
  • the memory 1932 and the storage medium 1930 may be short-term storage or permanent storage.
  • the program stored in the storage medium 1930 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server.
  • the central processing unit 1922 may be configured to communicate with the storage medium 1930, and execute a series of instruction operations in the storage medium 1930 on the server 1900.
  • the server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input and output interfaces 1958, one or more keyboards 1956, and/or, one or more operating systems 1941 , Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and so on.
  • a non-transitory computer-readable storage medium When instructions in the storage medium are executed by a processor of a device (equipment or server), the device can execute a method for driving a virtual person in real time.
  • the method includes: Determine the duration feature corresponding to the text to be processed; the text to be processed involves at least two languages; determine the target speech sequence corresponding to the text to be processed based on the duration feature; determine the text to be processed based on the duration feature Corresponding target image sequence; the target image sequence is obtained based on a text sample and its corresponding image sample; the language corresponding to the text sample includes: all languages involved in the text to be processed; The target image sequence is fused to obtain the corresponding target video.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Quality & Reliability (AREA)
  • Processing Or Creating Images (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种实时驱动虚拟人的方法,获取用于驱动虚拟人的待处理数据,待处理数据包括文本数据和语音数据中的至少一种(S201);使用端到端模型对待处理数据进行处理,确定出待处理数据对应的声学特征序列、面部特征序列和肢体特征序列(S202);将声学特征序列、面部特征序列和肢体特征序列输入到已训练的肌肉模型中,通过肌肉模型驱动虚拟人(S203)。如此,提高了计算效率,使得驱动虚拟人的实时性得以较大提高。

Description

实时驱动虚拟人的方法、装置、电子设备及介质
相关申请的交叉引用
本申请要求于2020年5月18日提交、申请号为202010420720.9且名称为“实时驱动虚拟人的方法、装置、电子设备及介质”的中国专利申请的优先权,其全部内容通过引用合并于此。
技术领域
本公开内容涉及虚拟人处理技术领域,尤其涉及一种实时驱动虚拟人的方法、装置、电子设备及介质。
背景技术
本公开内容数字人类(Digital Human)简称数字人,是利用计算机模拟真实人类的一种综合性的渲染技术,也被称为虚拟人类、超写实人类、照片级人类。由于人对真人太熟悉了,通过花费大量时间可以获取使得3D静态模型很真,但在驱动3D静态模型进行动作时,即使是一个细微的表情都会重新建模,由于模型的真实度非常高会导致建模会需要进行大量的数据进行计算,其计算过程较长,通常模型的一个动作可能需要一个小时或几个小时的计算才能实现,导致驱动的实时性能非常差。
发明内容
本公开内容的目的至少部分在于,提供一种实时驱动虚拟人的方法、装置、电子设备及介质,能够在实时驱动虚拟人。
在本公开内容的第一方面提供了一种实时驱动虚拟人的方法,包括:获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的声学特征序列、面部特征序列和肢体特征序列;将所述声学特征序列、所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;其中,所述使用端到端模型对所述待处理数据进行处理,包括:获取所述待处理数据的文本特征和时长特征;根据所述文本特征和所述时长特征,确定出所述声学特征序列;根据所述文本特征和所述时长特征,确定出所述面部特征序列和所述肢体特征序列。
在一些实施例中,所述获取所述待处理数据的文本特征和时长特征,包括:通过fastspeech模型获取所述文本特征;通过时长模型获取所述时长特征,其中,所述时长模型为深度学习模型。
在一些实施例中,若训练输出声学特征序列的fastspeech模型为第一fastspeech模型,以及训练输出面部特征序列和肢体特征序列的的fastspeech模型为第二fastspeech模型,所述根据所述文本特征和所述时长特征,确定出所述声学特征序列,包括:将所述文本特征和所述时长特征输入到第一fastspeech模型中,得到所述声学特征序列;
在一些实施例中,所述根据所述文本特征和所述时长特征,确定出所述声学特征序列,包括:将所述文本特征和所述时长特征输入到第二fastspeech模型中,得到所述面部特征序列和所述肢体特征序列。
在一些实施例中,所述将所述声学特征序列、所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,包括:将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到融合特征序列;将所述融合特征序列输入到所述肌肉模型中。
在一些实施例中,所述将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到融合特征序列,包括:基于所述时长特征,将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到所述融合特征序列。
在一些实施例中,所述面部特征序列对应的面部特征包括表情特征和唇部特征。
在本公开内容的第二方面提供了一种实时驱动虚拟人的方法,包括:获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的融合特征数据,其中,所述融合特征序列是由所述待处理数据对应的声学特征序列,面部特征序列和肢体特征序列融合得到的;将所述融合序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;其中,所述使用端到端模型对所述待处理数据进行处理,包括:获取所述待处理数据的文本特征和时长特征;根据所述文本特征和所述时长特征,确定出所述声学特征序列,所述面部特征序列和所述肢体特征序列;根据所述声学特征序列,所述面部特征序列和所述肢体特征序列,得到所述融合特征序列。
在一些实施例中,所述获取所述待处理数据的文本特征和时长特征,包括:通过fastspeech模型获取所述文本特征;通过时长模型获取所述时长特征,其中,所述时长模型为深度学习模型。
在一些实施例中,所述根据所述声学特征序列,所述面部特征序列和所述肢体特征序列,得到所述融合特征序列,包括:基于所述时长特征,将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到所述融合特征序列。
在本公开内容的第三方面提供了一种实时驱动虚拟人的装置,包括:数据获取模块,用于获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;数据处理模块,用于使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的声学特征序列、面部特征序列和肢体特征序列;虚拟人驱动模块,用于将所述声学特征序列、所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;其中,所述数据处理模块,具体用于获取所述待处理数据的文本特征和时长特征;根据所述文本特征和所述时长特征,确定出所述声学特征序列;根据所述文本特征和所述时长特征,确定出所述面部特征序列和所述肢体特征序列。
在本公开内容的第四方面提供了一种实时驱动虚拟人的装置,包括:数据获取模块,用于获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;数据处理模块,用于使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的融合特征数据,其中,所述融合特征序列是由所述待处理数据对应的声学特征序列,面部特征序列和肢体特征序列融合得到的;虚拟人驱动模块,用于将所述融合序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;其中,所述数据处理模块,具体用于获取所述待处理数据的文本特征和时长特征;根据所述文本特征和所述时长特征,确定出所述声学特征序列,所述面部特征序列和所述肢体特征序列;根据所述声学特征序列,所述面部特征序列和所述肢体特征序列,得到所述融合特征序列。
在本公开内容的第五方面提供了一种用于实时驱动虚拟人的装置,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于上述实时驱动虚拟人的方法步骤。
在本公开内容的第六方面提供了一种机器可读介质,其上存储有指令,当由一个或多个处理器执行时,使得装置执行上述实时驱动虚拟人的方法步骤的步骤。
基于上述技术方案,在获取待处理数据之后,使用端到端模型对待处理数据进行处理,得到声学特征序列、面部特征序列和肢体特征序列;再将所述声学特征序列、所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚 拟人;由于端到端模型输入的是待处理数据的原始数据,而直接输出声学特征序列、面部特征序列和肢体特征序列,其能够更好的利用和适应新的硬件(比如GPU)并行计算能力,运算速度更快;即,能够在更短时间内获取声学特征序列、面部特征序列和肢体特征序列;再将声学特征序列、面部特征序列和肢体特征序列输入到肌肉模型中,直接驱动虚拟人,是在创建虚拟人之后,直接通过声学特征序列来控制虚拟人进行语音输出,并同时通过面部特征序列和肢体特征序列控制虚拟人的面部表情和肢体动作,与需要重新对虚拟人建模相比,极大的降低了其计算量和数据传输量,且还提高了计算效率,使得驱动虚拟人的实时性得到极大的提高,从而能够实现实时驱动虚拟人。
附图说明
图1示出了依据本公开内容的一个或多个实施例的端到端模型进行训练的训练流程图;
图2示出了依据本公开内容的一个或多个实施例的实时驱动虚拟人的方法的第一种流程图;
图3示出了依据本公开内容的一个或多个实施例的第一fastspeech模型输出声学特征序列的步骤流程图;
图4示出了依据本公开内容的一个或多个实施例的实时驱动虚拟人的方法的第二种流程图;
图5示出了依据本公开内容的一个或多个实施例的实时驱动虚拟人的装置的第一种结构示意图;
图6示出了依据本公开内容的一个或多个实施例的实时驱动虚拟人的装置的第二种结构示意图;
图7示出了依据本公开内容的一个或多个实施例的用于实时驱动虚拟人的装置作为设备时的结构框图;
图8示出了依据本公开内容的一个或多个实施例的服务端的结构框图。
具体实施方式
为了更好的理解上述技术方案,下面通过附图以及具体实施例对本公开内容的技术方案做详细的说明,应当理解本公开内容的以及实施例中的具体特征是对本公开内容的 技术方案的详细的说明,而不是对本说明书技术方案的限定,在不冲突的情况下,本公开内容的以及实施例中的技术特征可以相互组合。
针对虚拟人在驱动时需要耗费大量时间的技术问题,依据本公开内容的一个或多个实施例提供了一种实时驱动虚拟人的方案,该方案用于实时驱动虚拟人,具体可以包括:获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的声学特征序列、面部特征序列和肢体特征序列;将所述声学特征序列、所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;其中,所述使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的声学特征序列、面部特征序列和肢体特征序列,包括:获取所述待处理数据的文本特征和时长特征;根据所述文本特征和所述时长特征,确定出所述声学特征序列;根据所述文本特征和所述时长特征,确定出所述面部特征序列和所述肢体特征序列。
依据本公开内容的实施例的虚拟人具体可以是高仿真虚拟人,与真人的差异较小;虚拟人可以应用于新闻播报场景、教学场景、医疗场景、客服场景、法律场景和会议场景等内容表达场景。
依据本公开内容的实施例的待处理数据可以是文本数据,也可以是语音数据,也可以是文本数据和语义数据同时存在,本说明书不作具体限制。
例如,在新闻播报场景,需要获取驱动虚拟人的待播报的新闻稿,此时,新闻稿为待处理数据,且新闻稿可以是由人工或机器编辑的文本,以及在人工或机器编辑文本之后,获取编辑的文本作为新闻稿。
依据本公开内容的一些实施例,在使用端到端模型对所述待处理数据进行处理之前,还需通过样本对端到端模型进行训练,得到已训练的端到端模型;在得到已训练的端到端模型之后,再使用已训练的端到端模型对所述待处理数据进行处理。
在本公开内容的一些实施例中端到端模型包括两种训练方法,其中一种训练方法训练出的端到端模型输出的声学特征序列,另一种训练方法训练出的端到端模型输出的面部特征序列和肢体特征序列;以及端到端模型具体可以为fastspeech模型。
本公开内容的一些实施例中,在对输出声学特征序列的端到端模型进行训练时,其训练样本可以是文本和语音数据,还可以视频数据;针对训练样本集中每个训练样本,其训练步骤具体如图1所示,首先执行步骤A1,获取训练样本的声学特征101和文本特征 102,其中,文本特征101可以为音素级别。
本公开内容的一种实施例中,可以将训练样本的特征数据映射到端到端模型中的嵌入(embedding)层中,得到声学特征101和文本特征102;然后执行步骤A2,通过前馈变压器103(Feed Forward Transformer)处理声学特征101和文本特征102,得到声学向量104和文本编码特征105,其中,声学向量104可以是句子的声学向量,也可以词的声学向量,文本编码特征105同样是音素级别;接下执行步骤A3,将声学向量104和文本编码特征105进行对齐,得到对齐后的文本编码特征106,可以使用持续时间预测器将声学向量104和文本编码特征105进行对齐,其中,文本编码特征105具体为音素特征,声学向量104可以是梅尔频谱图,如此,可以使用持续时间预测器将因素特征和梅尔频谱图进行对齐;接下来执行步骤A4,对对齐后的文本编码特征106进行解码107,获取声学特征序列108,此时,可以使用长度调节器通过延长或缩短音素持续时间来轻松确定语音速度,从而确定生成的梅尔频谱图的长度,还可以通过在相邻音素之间添加间隔来控制部分韵律;根据确定出的梅尔频谱图的长度和音素间隔时间,获取到声学特征序列。
在对输出声学特征序列的端到端模型进行训练时,其训练样本集例如可以包含13,100个语音频剪辑和相应的文本记录,音频总长度约为24小时。此时,将训练样本集随机分为3组:用于训练的12500个样本,用于验证的300个样本和用于测试的300个样本。为了减轻发音错误的问题,使用音素转换工具将文本序列转换为音素序列;对于语音数据,将原始波形转换为梅尔频谱图;然后使用12500个样本对端到端模型进行训练,在训练完成之后,使用300个验证样本对训练得到的端到端模型进行验证;在验证符合验证要求之后,使用300个测试样本对端到端模型进行测试,若测试符合测试条件,则得到已训练的端到端模型。
若对端到端模型进行验证未符合验证要求,则使用训练样本再次对端到端模型训练,直至训练后的端到端模型符合验证要求;并对验证符合要求的端到端模型进行测试,直至训练后的端到端模型既符合验证要求也符合测试条件,则将训练后的端到端模型作为最终的模型,即为已训练的端到端模型。
以及,在对输出面部特征序列和肢体特征序列的端到端模型进行训练时,其训练样本可以是真人视频数据和真人动作数据;针对训练样本集中每个训练样本,其训练步骤具体包括,首先执行步骤B1,获取训练样本的面部特征、肢体特征和文本特征,其中,文本特征可以为音素级别。
本公开内容的一种实施例中,可以将训练样本的特征数据映射到端到端模型中的嵌入(embedding)层中,得到面部特征、肢体特征和文本特征;然后执行步骤B2,通过前馈变压器(Feed Forward Transformer)处理面部特征、肢体特征和文本特征,得到面部特征向量、肢体特征向量和文本编码特征,其中,面部特征向量是用于进行面部表情的特征表示,肢体特征向量可以是肌肉动作向量,文本编码特征同样是音素级别;接下执行步骤B3,将面部特征向量和肢体特征向量,与文本编码特征进行对齐,可以使用持续时间预测器将面部特征向量和肢体特征向量,与文本编码特征进行对齐,其中,文本编码特征具体为音素特征;接下来执行步骤B4,获取面部特征序列和肢体特征序列,此时,可以使用长度调节器通过延长或缩短音素持续时间来对齐面部表情和动作,从而得到面部特征序列和肢体特征序列。
本公开内容的一种实施例中文本特征可以包括:音素特征、和/或、语义特征等。在一些实施例中,音素是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。音素可以包括:元音与辅音。在一些实施例中,特定的音素特征对应特定的唇部特征、表情特征或者肢体特征等。
以及,语义是待处理文本所对应的现实世界中的事物所代表的概念的含义,以及这些含义之间的关系,是待处理文本在某个领域上的解释和逻辑表示。在一些实施例中,特定的语义特征对应特定的肢体特征等。
在对输出面部特征序列和肢体特征序列的端到端模型进行训练时,其训练样本集包括的真人动作数据或者真人视频数据,其训练过程参考对输出声学特征序列的端到端模型进行训练的训练过程,为了说明书的简洁,在此就不再赘述了。
以及,在训练得到输出声学特征序列的端到端模型,且得到输出面部特征序列和肢体特征序列的端到端模型之后,将得到的输出声学特征序列的端到端模型作为第一端到端模型,以及将得到的输出面部特征序列和肢体特征序列的端到端模型作为第二端到端模型。
如此,在得到待处理数据之后,可以利用第一端到端模型的嵌入层获取所述待处理数据的文本特征,再获取所述待处理数据的时长特征,将所述文本特征和所述时长特征输入到第一端到端模型中,得到所述声学特征序列;本公开内容的一种实施例中,可以利用第二端端到端模型的嵌入层获取所述待处理数据的文本特征,再获取所述待处理数据的时长特征,将所述文本特征和所述时长特征输入到第二端到端模型中,得到面部特征序列 和肢体特征序列;当然,也可以直接利用前面获取的文本特征和时长特征直接输入到第二端到端模型中,得到面部特征序列和所述肢体特征序列。本公开内容的一种实施例中,第一端到端模型和第二端到端模型可以同时处理数据,也可以是第一端到端模型先处理数据,还可以是第二端到端模型先处理数据,本说明书不作具体限制。
本公开内容的一种实施例中,时长特征可用于表征文本所对应音素的时长。时长特征能够刻画出语音中的抑扬顿挫与轻重缓急,进而可以提高合成语音的表现力和自然度。在一些实施例中,可以利用时长模型,确定待处理数据对应的时长特征。时长模型的输入可以为:带有重音标注的音素特征,输出为音素时长。时长模型可以为对带有时长信息的语音样本进行学习得到,例如,可以是卷积神经网络(Convolutional Neural Networks,以下简称CNN)和深度神经网络(Deep Neural Networks,以下简称DNN)等深度学习模型,本公开内容的一种实施例中对于具体的时长模型不加以限制。
以及,在获取到所述声学特征序列,所述面部特征序列和所述肢体特征序列之后,将得到所述声学特征序列,所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人。
本公开内容的一种实施例中,面部特征包括表情特征和唇部特征,其中,表情,表达感情、情意,可以指表现在面部的思想感情。表情特征通常是针对整个面部的。唇部特征可以专门针对唇部,而且跟文本的文本内容、语音、发音方式等都有关系,从而可以通过面部特征能够促使面部表情更逼真且更细腻。
肢体特征可以通过头、眼、颈、手、肘、臂、身、胯、足等人体部位的协调活动来传达人物的思想,形象地借以表情达意。肢体特征可以包括:转头、耸肩、手势等,可以提高图像序列所对应表达的丰富度。例如,说话时至少一个手臂自然下垂,不说话时至少一个手臂自然放在腹部等。
本公开内容的一种实施例中,在使用已训练的肌肉模型之前,还需进行模型训练,得到已训练的肌肉模型;在得到已训练的肌肉模型之后,再使用已训练的肌肉模型对所述声学特征序列,所述面部特征序列和所述肢体特征序列进行处理。
本公开内容的一种实施例中已训练的肌肉模型在进行模型训练时,首先根据人脸的面部肌肉和肢体肌肉来创建肌肉模型,在获取其训练样本,其训练样本可以是真人视频数据和真人动作数据;针对训练样本集中每个训练样本,其训练步骤包括:
首先执行步骤C1,获取每个训练样本的面部肌肉特征和肢体肌肉特征;然后执行 步骤C2,使用每个训练样本的面部肌肉特征和肢体肌肉特征对肌肉模型进行训练;以及,在训练完成之后,执行步骤C3,使用验证样本对训练得到的肌肉模型进行验证;在验证符合验证要求之后,再使用测试样本对训练得到的肌肉模型进行测试,若测试符合测试条件,则得到已训练的肌肉模型。
若对训练得到的肌肉模型进行验证未符合验证要求,则使用训练样本再次对肌肉模型训练,直至训练后的肌肉模型符合验证要求;并对验证符合要求的肌肉模型进行测试,直至训练后的肌肉模型既符合验证要求也符合测试条件,则将训练后的肌肉模型作为最终的模型,即为已训练的肌肉模型。
以及,在创建肌肉模型时,以面部肌肉为例,使用多边形网络进行近似抽象的肌肉控制,可以使用两类肌肉,一种线性肌肉,用于拉伸;一种括约肌,用于挤压;两种肌肉只在一点与网格空间相联系,有方向指定(两种肌肉变形时都是计算某一点的角位移和径向位移),因此肌肉的控制独立于具体的面部拓扑,使得面部表情能够更逼真且更细腻;本公开内容的一种实施例中,肢体肌肉也使用多边形网络进行近似抽象的肌肉控制,从而能够确保肢体动作更准确。
由于端到端模型的前馈变压器采用的自注意力机制是一种通过其上下文来理解当前词的创新方法,语义特征的提取能力更强。在实际应用中,这个特性意味着对于句子中的同音字或词,新的算法能根据它周围的词和前后的句子来判断究竟应该是哪个(比如洗澡和洗枣),从而得到更准确的结果;而且端到端模型解决了传统的语音识别方案中各部分任务独立,无法联合优化的问题。单一神经网络的框架变得更简单,随着模型层数更深,训练数据越大,准确率越高;第三,端到端模型采用新的神经网络结构,其可以更好地利用和适应新的硬件(比如GPU)并行计算能力,运算速度更快。这意味着转写同样时长的语音,基于新网络结构的算法模型可以在更短的时间内完成,也更能满足实时转写的需求。
本公开内容的一种实施例中在获取待处理数据之后,使用端到端模型对待处理数据进行处理,得到声学特征序列、面部特征序列和肢体特征序列;再将所述声学特征序列、所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;由于端到端模型输入的是待处理数据的原始数据,而直接输出声学特征序列、面部特征序列和肢体特征序列,其能够更好的利用和适应新的硬件(比如GPU)并行计算能力,运算速度更快;即,能够在更短时间内获取声学特征序列、面部特征序列和肢体特征序列;再将声学特征序列、面部特征序列和肢体特征序列输入到肌肉模型中,直接驱动 虚拟人,是在创建虚拟人之后,直接通过声学特征序列来控制虚拟人进行语音输出,并同时通过面部特征序列和肢体特征序列控制虚拟人的面部表情和肢体动作,与需要重新对虚拟人建模相比,极大的降低了其计算量和数据传输量,且还提高了计算效率,使得驱动虚拟人的实时性得到极大的提高,从而能够实现实时驱动虚拟人。
而且,由于采用端到端模型来获取声学特征序列、面部特征序列和肢体特征序列时,使用了时长特征,而时长特征能够提高声学特征序列,与面部特征序列和肢体特征序列之间的同步性,从而在同步性的提高的基础上,使用声学特征序列,与面部特征序列和肢体特征序列来驱动虚拟人时,能够使得虚拟人的声音输出与面部表情和肢体特征匹配的精确度更高。
方法实施例一
参照图2,示出了本公开内容的一种实施例提供了一种实时驱动虚拟人的方法的步骤流程图,具体可以包括如下步骤:
S201、获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;
S202、使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的声学特征序列、面部特征序列和肢体特征序列;
S203、将所述声学特征序列、所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;
其中,步骤S201包括:
步骤S2021、获取所述待处理数据的文本特征和时长特征;
步骤S2022、根据所述文本特征和所述时长特征,确定出所述声学特征序列;
步骤S2023、根据所述文本特征和所述时长特征,确定出所述面部特征序列和所述肢体特征序列。
步骤S201中,对于客户端而言,可以接收用户上传的待处理数据;对于服务端而言,可以接收客户端发送的待处理数据。可以理解,任意的第一设备可以从第二设备接收待处理文本,本公开内容的一种实施例中对于待处理数据的具体传输方式不加以限制。
若待处理数据为文本数据,则直接使用步骤S202对待处理数据进行处理;若待处理数据为语音数据,则将待处理数据转换成文本数据之后,使用步骤S202对转换后的文本数据进行处理。
步骤S202中,首先需要训练出端到端模型,其中,端到端模型包括两种训练方法,其中一种训练方法训练出的端到端模型输出的声学特征序列,另一种训练方法训练出的端到端模型输出的面部特征序列和肢体特征序列;以及端到端模型具体可以为fastspeech模型。
以及训练出输出声学特征序列的端到端模型作为第一端到端模型,其训练过程中具体参考上述步骤A1-A4的叙述;训练出输出面部特征序列和肢体特征序列的端到端模型作为第二端到端模型,其训练过程参考步骤B1-B4的叙述。
步骤S2021中,可以通过fastspeech模型获取所述文本特征;以及通过时长模型获取所述时长特征,其中,所述时长模型为深度学习模型。
若端到端模型为fastspeech模型,则训练得到第一fastspeech模型和第二fastspeech模型之后,使用任意一个fastspeech模型获取到待处理数据的文本特征;再使用时长模型获取到时长特征,其中,时长模型可以是CNN和DNN等深度学习模型。
步骤S2022中,若训练输出声学特征序列的fastspeech模型为第一fastspeech模型,以及训练输出面部特征序列和肢体特征序列的的fastspeech模型为第二fastspeech模型,可以将所述文本特征和所述时长特征输入到第一fastspeech模型中,得到所述声学特征序列;以及步骤S2023中,将所述文本特征和所述时长特征输入到第二fastspeech模型中,得到所述面部特征序列和所述肢体特征序列。
若端到端模型为fastspeech模型,则训练得到第一fastspeech模型和第二fastspeech模型之后,使用任意一个fastspeech模型获取到待处理数据的文本特征;再使用时长模型获取到时长特征,其中,时长模型可以是CNN和DNN等深度学习模型。
本公开内容的一种实施例中,如图3所示,以第一fastspeech模型获取声学特征序列为例,其步骤包括:通过第一fastspeech模型的嵌入层获取待处理数据的文本特征301,通过前馈变压器302对文本特征301进行编码,得到文本编码特征303;此时,通过时长模型304对文本编码特征303处理,得到时长特征305,其中,时长特征304可用于表征文本编码特征30中每个音素的时长;然后通过时长特征305对文本编码特征303进行对齐,得到对齐后的文本编码特征306;对对齐后的文本编码特征306进行解码307并预测,得到声学特征序列307。
本公开内容的一种实施例中,文本编码特征303是音素级别,对齐后的文本编码特征306可以是帧级,也可以是音素级别,本公开内容的实施例中不作具体限制。
本公开内容的一种实施例中,使用第二fastspeech模型获取面部特征序列和肢体特征序列过程中,可以通过第二fastspeech模型的嵌入层获取待处理数据的文本特征;再通过前馈变压器对文本特征进行编码,得到文本编码特征;此时,通过时长模型对文本编码特征处理,得到时长特征,其中,时长特征对文本编码特征进行对齐,得到对齐后的文本编码特征;对对齐后的文本编码特征进行解码后进行面部预测和肢体预测,得到面部特征序列和肢体特征序列。
接下来执行步骤S203,将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到融合特征序列。本公开内容的一种实施例中,可以根据所述时长特征,将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到所述融合特征序列;在获取到所述融合特征序列之后,将所述融合特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人。
本公开内容的一种实施例中,根据所述时长特征,将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行对齐,从而得到所述融合特征序列,再将将所述融合特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人。
所述肌肉模型的训练过程具体参考步骤C1-C3的叙述,此时,得到所述融合特征序列之后,直接通过所述融合特征序列驱动所述肌肉模型中对应的绑定肌肉,在绑定肌肉由所述融合特征序列驱动得进行相应运动时,其面部表情和动作均会根据绑定肌肉的运动而进行相应的改变。
例如,在声学特征序列在说“再见”时,面部表情是微笑且肢体动作是挥手,此时,根据所述时长特征,可以将说“再见”这个时间段,与所述面部特征序列为微笑的面部特征序列,以及所述肢体动作序列为挥手的肢体动作序列进行对齐,得到对齐后特征序列,即为融合特征序列;此时,将融合特征序列输入到肌肉模型中,通过肌肉模型控制虚拟人在说“再见”时面部呈现微笑且进行挥手,从而使得虚拟人的声音,与面部和动作相匹配。
又例如,在声学特征序列在说“过来”时,面部表情是微笑且肢体动作是招手,此时,根据所述时长特征,可以将说“过来”这个时间段内,与所述面部特征序列为微笑的面部特征序列,以及所述肢体动作序列为招手的肢体动作序列进行对齐,得到对齐后特征序列,即为融合特征序列;此时,将融合特征序列输入到肌肉模型中,通过肌肉模型控制虚拟人在说“过来”时面部呈现微笑且进行招手,从而使得虚拟人的声音,与面部和动 作相匹配。
本公开内容的一种实施例中在获取待处理数据之后,使用端到端模型对待处理数据进行处理,得到声学特征序列、面部特征序列和肢体特征序列;再将所述声学特征序列、所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;由于端到端模型输入的是待处理数据的原始数据,而直接输出声学特征序列、面部特征序列和肢体特征序列,其能够更好的利用和适应新的硬件(比如GPU)并行计算能力,运算速度更快;即,能够在更短时间内获取声学特征序列、面部特征序列和肢体特征序列;再将声学特征序列、面部特征序列和肢体特征序列输入到肌肉模型中,直接驱动虚拟人,是在创建虚拟人之后,直接通过声学特征序列来控制虚拟人进行语音输出,并同时通过面部特征序列和肢体特征序列控制虚拟人的面部表情和肢体动作,与需要重新对虚拟人建模相比,极大的降低了其计算量和数据传输量,且还提高了计算效率,使得驱动虚拟人的实时性得到极大的提高,从而能够实现实时驱动虚拟人。
而且,由于采用端到端模型来获取声学特征序列、面部特征序列和肢体特征序列时,使用了时长特征,而时长特征能够提高声学特征序列,与面部特征序列和肢体特征序列之间的同步性,从而在同步性的提高的基础上,使用声学特征序列,与面部特征序列和肢体特征序列来驱动虚拟人时,能够使得虚拟人的声音输出与面部表情和肢体特征匹配的精确度更高。
方法实施例二
参照图4,示出了本公开内容的一种实施例提供了一种实时驱动虚拟人的方法的步骤流程图,具体可以包括如下步骤:
S401、获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;
S402、使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的融合特征数据,其中,所述融合特征序列是由所述待处理数据对应的声学特征序列,面部特征序列和肢体特征序列融合得到的;
S403、将所述融合序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;
其中,步骤S402包括:
步骤S4021、获取所述待处理数据的文本特征和时长特征;
步骤S4022、根据所述文本特征和所述时长特征,确定出所述声学特征序列,所述面 部特征序列和所述肢体特征序列;
步骤S4023、根据所述声学特征序列,所述面部特征序列和所述肢体特征序列,得到所述融合特征序列。
步骤S401中,对于客户端而言,可以接收用户上传的待处理数据;对于服务端而言,可以接收客户端发送的待处理数据。可以理解,任意的第一设备可以从第二设备接收待处理文本,本公开内容的一种实施例中对于待处理数据的具体传输方式不加以限制。
若待处理数据为文本数据,则直接使用步骤S402对待处理数据进行处理;若待处理数据为语音数据,则将待处理数据转换成文本数据之后,使用步骤S402对转换后的文本数据进行处理。
步骤S402中,首先需要训练出端到端模型,以使得训练出的端到端模型输出的融合特征序列,此时,可以将输出融合特征序列的端到端模型作为第三端到端模型。
本公开内容的一种实施例中,在第三端到端模型进行训练时,其训练样本可以是真人视频数据和真人动作数据;针对训练样本集中每个训练样本,其训练步骤具体包括,首先执行步骤D1,获取训练样本的面部特征、肢体特征和文本特征,其中,文本特征可以为音素级别。
本公开内容的一种实施例中,可以将训练样本的特征数据映射到端到端模型中的嵌入(embedding)层中,得到面部特征、肢体特征和文本特征;然后执行步骤D2,通过前馈变压器(Feed Forward Transformer)处理面部特征、肢体特征和文本特征,得到面部特征向量、肢体特征向量和文本编码特征,其中,面部特征向量是用于进行面部表情的特征表示,肢体特征向量可以是肌肉动作向量,文本编码特征同样是音素级别;接下执行步骤D3,将面部特征向量和肢体特征向量,与文本编码特征进行对齐,可以使用持续时间预测器将面部特征向量和肢体特征向量,与文本编码特征进行对齐,其中,文本编码特征具体为音素特征;接下来执行步骤D4,获取声音特征序列,面部特征序列和肢体特征序列;接下执行步骤D5,将声音特序列,面部特征序列和肢体特征序列进行融合,得到融合特征序列,此时,可以使用长度调节器通过延长或缩短音素持续时间来对齐声音、面部表情和动作,从而得到融合特征序列。
本公开内容的一种实施例中文本特征可以包括:音素特征、和/或、语义特征等。在一些实施例中,音素是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。音素可以包括:元音与辅音。在一些实施例中, 特定的音素特征对应特定的唇部特征、表情特征或者肢体特征等。
以及,语义是待处理文本所对应的现实世界中的事物所代表的概念的含义,以及这些含义之间的关系,是待处理文本在某个领域上的解释和逻辑表示。在一些实施例中,特定的语义特征对应特定的肢体特征等。
在对第三端到端模型进行训练时,其训练样本集包括的真人动作数据或者真人视频数据,其训练过程参考对输出声学特征序列的端到端模型进行训练的训练过程,为了说明书的简洁,在此就不再赘述了。
如此,在得到待处理数据之后,可以利用第三端到端模型的嵌入层获取所述待处理数据的文本特征,再获取所述待处理数据的时长特征,将所述文本特征和所述时长特征输入到第三端到端模型中,得到声学特征序列,面部特征序列和肢体特征序列;并根据时长特征,将声学特征序列,面部特征序列和肢体特征序列进行融合,得到融合特征序列。
步骤S4021中,若端到端模型为fastspeech模型,则可以通过第三fastspeech模型获取所述文本特征;以及通过时长模型获取所述时长特征,其中,所述时长模型为深度学习模型。
步骤S4022中,若训练输出声学特征序列的fastspeech模型为第三fastspeech模型,可以将所述文本特征和所述时长特征输入到第一fastspeech模型中,确定出所述声学特征序列,所述面部特征序列和所述肢体特征序列;以及步骤S4023中,根据所述声学特征序列,所述面部特征序列和所述肢体特征序列,得到所述融合特征序列。
本公开内容的一种实施例中,可以根据所述时长特征,将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行对齐,从而得到所述融合特征序列。
若端到端模型为fastspeech模型,则训练得到第三fastspeech模型之后,使用第三fastspeech模型获取到待处理数据的文本特征;再使用时长模型获取到时长特征,其中,时长模型可以是CNN和DNN等深度学习模型。
本公开内容的一种实施例中,使用第三fastspeech模型获取融合特征序列过程中,可以通过第三fastspeech模型的嵌入层获取待处理数据的文本特征;再通过前馈变压器对文本特征进行编码,得到文本编码特征;此时,通过时长模型对文本编码特征处理,得到时长特征,其中,时长特征对文本编码特征进行对齐,得到对齐后的文本编码特征;对对齐后的文本编码特征进行解码后进行声音预测,面部预测和肢体预测,得到声音特征序列,面部特征序列和肢体特征序列;再根据时长特征,将声音特征序列,面部特征序列和肢体 特征序列进行对齐,得到对齐后的声音特征序列,面部特征序列和肢体特征序列,即,将对齐后的声音特征序列,面部特征序列和肢体特征序列作为融合特征序列。
接下来执行步骤S403,在获取到所述融合特征序列之后,将所述融合特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人。
本公开内容的一种实施例中,将所述融合特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人。
本公开内容的一种实施例中,所述肌肉模型的训练过程具体参考步骤C1-C3的叙述,此时,得到所述融合特征序列之后,直接通过所述融合特征序列驱动所述肌肉模型中对应的绑定肌肉,在绑定肌肉由所述融合特征序列驱动得进行相应运动时,其面部表情和动作均会根据绑定肌肉的运动而进行相应的改变。
例如,在声学特征序列在说“再见”时,面部表情是微笑且肢体动作是挥手,此时,由于融合特征序列是根据时长特征进行对齐的,从而使得虚拟人在“再见”的同时,面部呈现微笑且进行挥手,从而使得虚拟人的声音,与面部和动作相匹配。
又例如,在声学特征序列在说“某人受伤”时,面部表情是微笑且肢体动作是双手合十,此时,此时,由于融合特征序列是根据时长特征进行对齐的,从而使得虚拟人在“某人受伤”的同时,面部呈现悲伤且进行双手合十,从而使得虚拟人的声音,与面部和动作相匹配。
本公开内容的一种实施例中在获取待处理数据之后,使用端到端模型对待处理数据进行处理,得到由声学特征序列、面部特征序列和肢体特征序列融合的融合特征序列;再将融合特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;由于端到端模型输入的是待处理数据的原始数据,而直接输出由声学特征序列、面部特征序列和肢体特征序列融合的融合特征序列,其能够更好的利用和适应新的硬件(比如GPU)并行计算能力,运算速度更快;即,能够在更短时间内获取融合特征序列;再将融合特征序列输入到肌肉模型中,直接驱动虚拟人,是在创建虚拟人之后,直接通过融合特征序列来控制虚拟人进行语音输出的同时,控制虚拟人的面部表情和肢体动作,与需要重新对虚拟人建模相比,极大的降低了其计算量和数据传输量,且还提高了计算效率,使得驱动虚拟人的实时性得到极大的提高,从而能够实现实时驱动虚拟人。
而且,由于采用端到端模型来获取融合特征序列时,使用了时长特征对声学特征序列、面部特征序列和肢体特征序列进行融合,而时长特征能够提高声学特征序列,与面 部特征序列和肢体特征序列之间的同步性,从而在同步性的提高的基础上,使用融合征序列来驱动虚拟人时,能够使得虚拟人的声音输出与面部表情和肢体特征匹配的精确度更高。装置实施例一
参照图5,示出了本公开内容的一种实施例提供了一种实时驱动虚拟人的装置实施例的结构框图,具体可以包括:
数据获取模块501,用于获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;
数据处理模块502,用于使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的声学特征序列、面部特征序列和肢体特征序列;
虚拟人驱动模块503,用于将所述声学特征序列、所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;
其中,数据处理模块502,具体用于获取所述待处理数据的文本特征和时长特征;根据所述文本特征和所述时长特征,确定出所述声学特征序列;根据所述文本特征和所述时长特征,确定出所述面部特征序列和所述肢体特征序列。
在一些实施例中,数据处理模块502,用于通过fastspeech模型获取所述文本特征;通过时长模型获取所述时长特征,其中,所述时长模型为深度学习模型。
在一些实施例中,数据处理模块502,若训练输出声学特征序列的fastspeech模型为第一fastspeech模型,以及训练输出面部特征序列和肢体特征序列的的fastspeech模型为第二fastspeech模型,用于将所述文本特征和所述时长特征输入到第一fastspeech模型中,得到所述声学特征序列;将所述文本特征和所述时长特征输入到第二fastspeech模型中,得到所述面部特征序列和所述肢体特征序列。
在一些实施例中,虚拟人驱动模块503,用于将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到融合特征序列;
将所述融合特征序列输入到所述肌肉模型中。
在一些实施例中,虚拟人驱动模块503,用于基于所述时长特征,将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到所述融合特征序列。
在一些实施例中,所述面部特征序列对应的面部特征包括表情特征和唇部特征。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本公开内容的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
装置实施例二
参照图6,示出了本公开内容的一种实施例提供了一种实时驱动虚拟人的装置实施例的结构框图,具体可以包括:
数据获取模块601,用于获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;
数据处理模块602,用于使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的融合特征数据,其中,所述融合特征序列是由所述待处理数据对应的声学特征序列,面部特征序列和肢体特征序列融合得到的;
虚拟人驱动模块603,用于将所述融合序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;
其中,数据处理模块602,用于获取所述待处理数据的文本特征和时长特征;根据所述文本特征和所述时长特征,确定出所述声学特征序列,所述面部特征序列和所述肢体特征序列;根据所述声学特征序列,所述面部特征序列和所述肢体特征序列,得到所述融合特征序列。
在一些实施例中,数据处理模块602,用于通过fastspeech模型获取所述文本特征;通过时长模型获取所述时长特征,其中,所述时长模型为深度学习模型。
在一些实施例中,数据处理模块602,用于根据所述时长特征,对所述声学特征序列,所述面部特征序列和所述肢体特征序列进行对齐,得到所述融合特征序列。
在一些实施例中,所述面部特征序列对应的面部特征包括表情特征和唇部特征。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本公开内容的一种实施例中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
关于本公开内容的实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
图7是根据本公开内容的一种实施例提供了一种用于实时驱动虚拟人的装置作为设备时的结构框图。例如,装置900可以是移动来电,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图7,装置900可以包括以下一个或多个组件:处理组件902,存储器904,电源组件906,多媒体组件908,音频组件910,输入/输出(I/O)的接口912,传感器组件914,以及通信组件916。
处理组件902通常控制装置900的整体操作,诸如与显示,来电呼叫,数据通信,相机操作和记录操作相关联的操作。处理元件902可以包括一个或多个处理器920来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件902可以包括一个或多个模块,便于处理组件902和其他组件之间的交互。例如,处理组件902可以包括多媒体模块,以方便多媒体组件908和处理组件902之间的交互。
存储器904被配置为存储各种类型的数据以支持在设备900的操作。这些数据的示例包括用于在装置900上操作的任何应用程序或方法的指令,联系人数据,来电簿数据,消息,图片,视频等。存储器904可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件906为装置900的各种组件提供电力。电源组件906可以包括电源管理系统,一个或多个电源,及其他与为装置900生成、管理和分配电力相关联的组件。
多媒体组件908包括在所述装置900和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动运动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件908包括一个前置摄像头和/或后置摄像头。当设备900处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件910被配置为输出和/或输入音频信号。例如,音频组件910包括一个麦 克风(MIC),当装置900处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器904或经由通信组件916发送。在一些实施例中,音频组件910还包括一个扬声器,用于输出音频信号。
I/O接口912为处理组件902和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件914包括一个或多个传感器,用于为装置900提供各个方面的状态评估。例如,传感器组件914可以检测到设备900的打开/关闭状态,组件的相对定位,例如所述组件为装置900的显示器和小键盘,传感器组件914还可以检测装置900或装置900一个组件的位置改变,用户与装置900接触的存在或不存在,装置900方位或加速/减速和装置900的温度变化。传感器组件914可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件914还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件914还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件916被配置为便于装置900和其他设备之间有线或无线方式的通信。装置900可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一些实施例中,通信部件916经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一些实施例中,所述通信部件916还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在一些实施例中,装置900可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在一些实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器904,上述指令可由装置900的处理器920执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
图8是本公开内容的一种实施例提供的服务器的结构框图。该服务器1900可因配 置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1922(例如,一个或一个以上处理器)和存储器1932,一个或一个以上存储应用程序1942或数据1944的存储介质1930(例如一个或一个以上海量存储设备)。其中,存储器1932和存储介质1930可以是短暂存储或持久存储。存储在存储介质1930的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器1922可以设置为与存储介质1930通信,在服务器1900上执行存储介质1930中的一系列指令操作。
服务器1900还可以包括一个或一个以上电源1926,一个或一个以上有线或无线网络接口1950,一个或一个以上输入输出接口1958,一个或一个以上键盘1956,和/或,一个或一个以上操作系统1941,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
一种非临时性计算机可读存储介质,当所述存储介质中的指令由装置(设备或者服务器)的处理器执行时,使得装置能够执行一种实时驱动虚拟人的方法,所述方法包括:确定待处理文本对应的时长特征;所述待处理文本涉及至少两种语言;依据所述时长特征,确定所述待处理文本对应的目标语音序列;依据所述时长特征,确定所述待处理文本对应的目标图像序列;所述目标图像序列为依据文本样本及其对应的图像样本得到;所述文本样本对应的语言包括:所述待处理文本涉及的所有语言;对所述目标语音序列和所述目标图像序列进行融合,以得到对应的目标视频。
本说明书是参照根据在本公开内容的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的设备。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令设备的制造品,该指令设备实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本说明书的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本说明书范围的所有变更和修改。
显然,本领域的技术人员可以对本说明书进行各种改动和变型而不脱离本说明书的精神和范围。这样,倘若本说明书的这些修改和变型属于本说明书权利要求及其等同技术的范围之内,则本说明书也意图包含这些改动和变型在内。

Claims (20)

  1. 一种实时驱动虚拟人的方法,包括:
    获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;
    使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的声学特征序列、面部特征序列和肢体特征序列;
    将所述声学特征序列、所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;
    其中,所述使用端到端模型对所述待处理数据进行处理,包括:
    获取所述待处理数据的文本特征和时长特征;
    根据所述文本特征和所述时长特征,确定出所述声学特征序列;
    根据所述文本特征和所述时长特征,确定出所述面部特征序列和所述肢体特征序列。
  2. 如权利要求1所述的方法,其中,所述获取所述待处理数据的文本特征和时长特征,包括:
    通过fastspeech模型获取所述文本特征;
    通过时长模型获取所述时长特征,其中,所述时长模型为深度学习模型。
  3. 如权利要求2所述的方法,其中,若训练输出声学特征序列的fastspeech模型为第一fastspeech模型,以及训练输出面部特征序列和肢体特征序列的的fastspeech模型为第二fastspeech模型,所述根据所述文本特征和所述时长特征,确定出所述声学特征序列,包括:
    将所述文本特征和所述时长特征输入到第一fastspeech模型中,得到所述声学特征序列;
    所述根据所述文本特征和所述时长特征,确定出所述声学特征序列,包括:
    将所述文本特征和所述时长特征输入到第二fastspeech模型中,得到所述面部特征序列和所述肢体特征序列。
  4. 如权利要求1所述的方法,其中,所述将所述声学特征序列、所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,包括:
    将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到融合特 征序列;
    将所述融合特征序列输入到所述肌肉模型中。
  5. 如权利要求4所述的方法,其中,所述将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到融合特征序列,包括:
    基于所述时长特征,将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到所述融合特征序列。
  6. 如权利要求1所述的方法,其中,所述面部特征序列对应的面部特征包括表情特征和唇部特征。
  7. 一种实时驱动虚拟人的方法,包括:
    获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;
    使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的融合特征数据,其中,所述融合特征序列是由所述待处理数据对应的声学特征序列,面部特征序列和肢体特征序列融合得到的;
    将所述融合序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;
    其中,所述使用端到端模型对所述待处理数据进行处理,包括:
    获取所述待处理数据的文本特征和时长特征;
    根据所述文本特征和所述时长特征,确定出所述声学特征序列,所述面部特征序列和所述肢体特征序列;
    根据所述声学特征序列,所述面部特征序列和所述肢体特征序列,得到所述融合特征序列。
  8. 如权利要求7所述的方法,其中,所述获取所述待处理数据的文本特征和时长特征,包括:
    通过fastspeech模型获取所述文本特征;
    通过时长模型获取所述时长特征,其中,所述时长模型为深度学习模型。
  9. 如权利要求8所述的方法,其中,所述根据所述声学特征序列,所述面部特征序列和所述肢体特征序列,得到所述融合特征序列,包括:
    基于所述时长特征,将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到所述融合特征序列。
  10. 一种实时驱动虚拟人的装置,包括:
    数据获取模块,用于获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;
    数据处理模块,用于使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的声学特征序列、面部特征序列和肢体特征序列;
    虚拟人驱动模块,用于将所述声学特征序列、所述面部特征序列和所述肢体特征序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;
    其中,所述数据处理模块,具体用于获取所述待处理数据的文本特征和时长特征;根据所述文本特征和所述时长特征,确定出所述声学特征序列;根据所述文本特征和所述时长特征,确定出所述面部特征序列和所述肢体特征序列。
  11. 如权利要求10所述的装置,其中,所述数据处理模块,用于通过fastspeech模型获取所述文本特征;通过时长模型获取所述时长特征,其中,所述时长模型为深度学习模型。
  12. 如权利要求11所述的装置,其中,所述数据处理模块,若训练输出声学特征序列的fastspeech模型为第一fastspeech模型,以及训练输出面部特征序列和肢体特征序列的的fastspeech模型为第二fastspeech模型,用于将所述文本特征和所述时长特征输入到第一fastspeech模型中,得到所述声学特征序列;将所述文本特征和所述时长特征输入到第二fastspeech模型中,得到所述面部特征序列和所述肢体特征序列。
  13. 如权利要求11所述的装置,其中,所述虚拟人驱动模块,用于将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到融合特征序列;将所述融合特征序列输入到所述肌肉模型中。
  14. 如权利要求13所述的装置,其中,所述虚拟人驱动模块,用于基于所述时长特征,将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到所述融合特征序列。
  15. 如权利要求11所述的装置,其中,所述面部特征序列对应的面部特征包括表情特征和唇部特征。
  16. 一种实时驱动虚拟人的装置,包括:
    数据获取模块,用于获取用于驱动虚拟人的待处理数据,所述待处理数据包括文本数据和语音数据中的至少一种;
    数据处理模块,用于使用端到端模型对所述待处理数据进行处理,确定出所述待处理数据对应的融合特征数据,其中,所述融合特征序列是由所述待处理数据对应的声学特征序列,面部特征序列和肢体特征序列融合得到的;
    虚拟人驱动模块,用于将所述融合序列输入到已训练的肌肉模型中,通过所述肌肉模型驱动虚拟人;
    其中,所述数据处理模块,具体用于获取所述待处理数据的文本特征和时长特征;根据所述文本特征和所述时长特征,确定出所述声学特征序列,所述面部特征序列和所述肢体特征序列;根据所述声学特征序列,所述面部特征序列和所述肢体特征序列,得到所述融合特征序列。
  17. 如权利要求16所述的装置,其中,所述数据处理模块,用于通过fastspeech模型获取所述文本特征;通过时长模型获取所述时长特征,其中,所述时长模型为深度学习模型。
  18. 如权利要求17所述的装置,其中,所所述数据处理模块,用于基于所述时长特征,将所述声学特征序列、所述面部特征序列和所述肢体特征序列进行融合,得到所述融合特征序列。
  19. 一种用于实时驱动虚拟人的装置,包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行如权利要求1-9任一权项所述的方法步骤。
  20. 一种机器可读介质,其上存储有指令,当由一个或多个处理器执行时,使得装置执行如权利要求1至9中一个或多个所述的实时驱动虚拟人的方法。
PCT/CN2021/078244 2020-05-18 2021-02-26 实时驱动虚拟人的方法、装置、电子设备及介质 WO2021232877A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/989,323 US20230082830A1 (en) 2020-05-18 2022-11-17 Method and apparatus for driving digital human, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010420720.9A CN113689880B (zh) 2020-05-18 实时驱动虚拟人的方法、装置、电子设备及介质
CN202010420720.9 2020-05-18

Related Child Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078243 Continuation WO2021232876A1 (zh) 2020-05-18 2021-02-26 实时驱动虚拟人的方法、装置、电子设备及介质

Publications (1)

Publication Number Publication Date
WO2021232877A1 true WO2021232877A1 (zh) 2021-11-25

Family

ID=78575574

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078244 WO2021232877A1 (zh) 2020-05-18 2021-02-26 实时驱动虚拟人的方法、装置、电子设备及介质

Country Status (1)

Country Link
WO (1) WO2021232877A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120191460A1 (en) * 2011-01-26 2012-07-26 Honda Motor Co,, Ltd. Synchronized gesture and speech production for humanoid robots
CN104361620A (zh) * 2014-11-27 2015-02-18 韩慧健 一种基于综合加权算法的口型动画合成方法
CN106653052A (zh) * 2016-12-29 2017-05-10 Tcl集团股份有限公司 虚拟人脸动画的生成方法及装置
CN110162598A (zh) * 2019-04-12 2019-08-23 北京搜狗科技发展有限公司 一种数据处理方法和装置、一种用于数据处理的装置
CN110866968A (zh) * 2019-10-18 2020-03-06 平安科技(深圳)有限公司 基于神经网络生成虚拟人物视频的方法及相关设备
CN111415677A (zh) * 2020-03-16 2020-07-14 北京字节跳动网络技术有限公司 用于生成视频的方法、装置、设备和介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120191460A1 (en) * 2011-01-26 2012-07-26 Honda Motor Co,, Ltd. Synchronized gesture and speech production for humanoid robots
CN104361620A (zh) * 2014-11-27 2015-02-18 韩慧健 一种基于综合加权算法的口型动画合成方法
CN106653052A (zh) * 2016-12-29 2017-05-10 Tcl集团股份有限公司 虚拟人脸动画的生成方法及装置
CN110162598A (zh) * 2019-04-12 2019-08-23 北京搜狗科技发展有限公司 一种数据处理方法和装置、一种用于数据处理的装置
CN110866968A (zh) * 2019-10-18 2020-03-06 平安科技(深圳)有限公司 基于神经网络生成虚拟人物视频的方法及相关设备
CN111415677A (zh) * 2020-03-16 2020-07-14 北京字节跳动网络技术有限公司 用于生成视频的方法、装置、设备和介质

Also Published As

Publication number Publication date
CN113689880A (zh) 2021-11-23

Similar Documents

Publication Publication Date Title
CN110688911B (zh) 视频处理方法、装置、系统、终端设备及存储介质
US20200279553A1 (en) Linguistic style matching agent
WO2021232876A1 (zh) 实时驱动虚拟人的方法、装置、电子设备及介质
TWI778477B (zh) 互動方法、裝置、電子設備以及儲存媒體
JP7395792B2 (ja) 2レベル音声韻律転写
TWI766499B (zh) 互動物件的驅動方法、裝置、設備以及儲存媒體
JP7227395B2 (ja) インタラクティブ対象の駆動方法、装置、デバイス、及び記憶媒体
US20100082345A1 (en) Speech and text driven hmm-based body animation synthesis
US20230082830A1 (en) Method and apparatus for driving digital human, and electronic device
CN110162598B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
CN110210310A (zh) 一种视频处理方法、装置和用于视频处理的装置
CN110148406B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
US20240105160A1 (en) Method and system for generating synthesis voice using style tag represented by natural language
WO2023246163A9 (zh) 一种虚拟数字人驱动方法、装置、设备和介质
EP4343755A1 (en) Method and system for generating composite speech by using style tag expressed in natural language
CN117275485B (zh) 一种音视频的生成方法、装置、设备及存储介质
US20240022772A1 (en) Video processing method and apparatus, medium, and program product
WO2021232877A1 (zh) 实时驱动虚拟人的方法、装置、电子设备及介质
CN113689880B (zh) 实时驱动虚拟人的方法、装置、电子设备及介质
CN110166844B (zh) 一种数据处理方法和装置、一种用于数据处理的装置
CN112632262A (zh) 一种对话方法、装置、计算机设备及存储介质
TWM652806U (zh) 互動虛擬人像系統
CN114155849A (zh) 一种虚拟对象的处理方法、装置和介质
CN115409923A (zh) 生成三维虚拟形象面部动画的方法、装置及系统
Pueblo Videorealistic facial animation for speech-based interfaces

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21809379

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.03.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21809379

Country of ref document: EP

Kind code of ref document: A1