WO2022048405A1 - 基于文本的虚拟对象动画生成方法及装置、存储介质、终端 - Google Patents

基于文本的虚拟对象动画生成方法及装置、存储介质、终端 Download PDF

Info

Publication number
WO2022048405A1
WO2022048405A1 PCT/CN2021/111424 CN2021111424W WO2022048405A1 WO 2022048405 A1 WO2022048405 A1 WO 2022048405A1 CN 2021111424 W CN2021111424 W CN 2021111424W WO 2022048405 A1 WO2022048405 A1 WO 2022048405A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
virtual object
information
object animation
text information
Prior art date
Application number
PCT/CN2021/111424
Other languages
English (en)
French (fr)
Inventor
王从艺
陈余
柴金祥
Original Assignee
魔珐(上海)信息科技有限公司
上海墨舞科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 魔珐(上海)信息科技有限公司, 上海墨舞科技有限公司 filed Critical 魔珐(上海)信息科技有限公司
Priority to US18/024,021 priority Critical patent/US11908451B2/en
Publication of WO2022048405A1 publication Critical patent/WO2022048405A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/802D [Two Dimensional] animation, e.g. using sprites
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • G10L2013/105Duration

Definitions

  • the invention relates to the technical field of virtual digital objects, in particular to a text-based virtual object animation generation method and device, a storage medium and a terminal.
  • virtual digital object may be referred to as virtual object
  • animation industry the market demand for rapid and automatic generation of real and realistic virtual images is increasing day by day.
  • the rapid generation system of virtual object animation is mainly embodied in how to quickly and efficiently generate emotional speech and corresponding virtual object animation from text.
  • the technical problem solved by the present invention is how to quickly and efficiently generate virtual object animation with emotional speech from text.
  • an embodiment of the present invention provides a text-based virtual object animation generation method, including: acquiring text information, wherein the text information includes the original text of the virtual object animation to be generated; analyzing the text information. Emotional features and prosodic boundaries; speech synthesis is performed according to the emotional features, the prosodic boundaries, and the text information to obtain audio information, wherein the audio information includes speech with emotion converted based on the original text ; Generate a corresponding virtual object animation based on the text information and audio information, and the virtual object animation and the audio information are synchronized in time.
  • the analyzing the emotional features and prosodic boundaries of the text information includes: performing word segmentation processing on the text information; for each word obtained by word segmentation, performing sentiment analysis on the word to obtain the word. Sentiment characteristics of words; determine prosodic boundaries for each word.
  • the analyzing the emotional features and prosodic boundaries of the text information includes: analyzing the emotional features of the text information based on a preset text front-end prediction model, where the input of the preset text front-end prediction model is the text information. , the output of the preset text front-end prediction model is the emotional feature, prosodic boundary and word segmentation of the text information.
  • the performing speech synthesis according to the emotional feature, the prosodic boundary and the text information to obtain the audio information includes: inputting the text information, the emotional feature and the prosodic boundary into a preset speech synthesis model, wherein , the preset speech synthesis model is used to convert the input text sequence into a speech sequence in time sequence, and the speech in the speech sequence has the emotion of the text at the corresponding time point; obtain the output of the preset speech synthesis model audio information.
  • the preset speech synthesis model is obtained by training based on training data, wherein the training data includes text information samples and corresponding audio information samples, and the audio information samples are pre-recorded according to the text information samples. owned.
  • the training data further includes extended samples, wherein the extended samples are obtained by recombining speech and text slices on the text information samples and the corresponding audio information samples.
  • generating the corresponding virtual object animation based on the text information and audio information includes: receiving input information, wherein the input information includes the text information and audio information; converting the input information into pronunciation units sequence; perform feature analysis on the pronunciation unit sequence to obtain a corresponding linguistic feature sequence; input the linguistic feature sequence into a preset time sequence mapping model to generate a corresponding virtual object animation based on the linguistic feature sequence.
  • the generating the corresponding virtual object animation based on the text information and the audio information includes: inputting the text information and the audio information into a preset time sequence mapping model to generate the corresponding virtual object animation.
  • the preset time sequence mapping model is used to map the input feature sequence to the expression parameters and/or action parameters of the virtual object in time sequence, so as to generate a corresponding virtual object animation.
  • the virtual object animation generation method further includes: normalizing the text information according to the context to obtain Normalized text information.
  • the normalization processing includes digital reading processing and special character reading processing.
  • generating the corresponding virtual object animation based on the text information and the audio information includes: generating the corresponding virtual object animation based on the text information, the emotional features and prosodic boundaries of the text information, and the audio information. .
  • an embodiment of the present invention further provides a text-based virtual object animation generation device, including: an acquisition module for acquiring text information, wherein the text information includes the original text of the virtual object animation to be generated; an analysis module for analyzing emotional features and prosodic boundaries of the text information; a speech synthesis module for performing speech synthesis according to the emotional features, the prosodic boundaries and the text information to obtain audio information, wherein the The audio information includes a voice with emotion converted based on the original text; a processing module is used to generate a corresponding virtual object animation based on the text information and the audio information, and the virtual object animation and the audio information are synchronized in time.
  • a text-based virtual object animation generation device including: an acquisition module for acquiring text information, wherein the text information includes the original text of the virtual object animation to be generated; an analysis module for analyzing emotional features and prosodic boundaries of the text information; a speech synthesis module for performing speech synthesis according to the emotional features, the prosodic boundaries and the text information to obtain audio information, wherein the The audio
  • an embodiment of the present invention further provides a storage medium on which a computer program is stored, and the computer program executes the steps of the above method when the computer program is run by a processor.
  • an embodiment of the present invention further provides a terminal, including a memory and a processor, where the memory stores a computer program that can run on the processor, and when the processor runs the computer program, Perform the steps of the above method.
  • An embodiment of the present invention provides a text-based virtual object animation generation method, comprising: acquiring text information, wherein the text information includes the original text of the virtual object animation to be generated; analyzing emotional features and prosodic boundaries of the text information; Perform speech synthesis according to the emotion feature, the prosodic boundary and the text information to obtain audio information, wherein the audio information includes the speech with emotion converted based on the original text; based on the text information A virtual object animation corresponding to the audio information is generated, and the virtual object animation and the audio information are synchronized in time.
  • this embodiment can quickly and efficiently generate virtual object animations with emotional speech from texts, especially 3D animations, and has high versatility. No specific voice actor driver required.
  • the speech with emotion is synthesized by analyzing the emotional features and prosodic boundaries of the text.
  • a corresponding virtual object animation is generated based on the text and the speech with emotion.
  • the data of the generated virtual object animations arranged in time sequence and the audio information are synchronized in time, making it possible to directly generate virtual object animations from text, and when the generated virtual object animations act in time series, they can be compared with emotionally charged animations. Voice stays in sync.
  • generating the corresponding virtual object animation based on the text information and audio information includes: receiving input information, wherein the input information includes the text information and audio information; converting the input information into a sequence of pronunciation units; Perform feature analysis on the pronunciation unit sequence to obtain a corresponding linguistic feature sequence; input the linguistic feature sequence into a preset time sequence mapping model to generate a corresponding virtual object animation based on the linguistic feature sequence.
  • the corresponding linguistic feature sequence in the original audio or text is extracted and used as the input information of the preset time sequence mapping model. Since linguistic features are only related to the semantic content of audio, they have nothing to do with features that vary from speaker to speaker, such as timbre, pitch, and F0 features of fundamental frequency. Therefore, the solution in this embodiment is not limited to a specific speaker, and the original audio with different audio characteristics can be applied to the preset time sequence mapping model described in this embodiment. That is to say, because the solution of this embodiment does not analyze the audio features in the audio information, but analyzes the linguistic features of the pronunciation units after converting the audio information into pronunciation units, so that the neural network model does not depend on specific audio features to drive the neural network model. It is possible to generate animation of virtual objects.
  • the end-to-end virtual object animation generation method provided by the solution in this embodiment can be applied to the end-to-end virtual object animation generation of any voice actor and any text, which solves the problem of the existing end-to-end automatic speech synthesis virtual object animation technology.
  • the problem of dependence on a specific voice actor really realizes the "universality" of the technology.
  • a preset time sequence mapping model is constructed based on deep learning technology training, and based on the preset time sequence mapping model, the input linguistic feature sequence is mapped to the expression parameters and/or action parameters of the corresponding virtual object.
  • the originally received input information may be text information or audio information, so that the solution of this embodiment can generate corresponding virtual object animations according to different input modalities.
  • Fig. 1 is a flow chart of a text-based virtual object animation generation method according to an embodiment of the present invention
  • Fig. 2 is a flowchart of a specific implementation of step S104 in Fig. 1;
  • Fig. 3 is a flowchart of a specific implementation of step S1043 in Fig. 2;
  • Fig. 4 is a flowchart of a specific implementation of step S1044 in Fig. 2;
  • FIG. 5 is a schematic structural diagram of a text-based virtual object animation generation device according to an embodiment of the present invention.
  • the existing virtual object animation generation technology must rely on the driver of a specific speaker, and the generality is poor.
  • the artist needs to provide human support in the production process, the labor cost is high, and the time cost is also very high.
  • an embodiment of the present invention provides a text-based virtual object animation generation method, including: acquiring text information, wherein the text information includes the original text of the virtual object animation to be generated; analyzing the text information. Emotional features and prosodic boundaries; speech synthesis is performed according to the emotional features, the prosodic boundaries, and the text information to obtain audio information, wherein the audio information includes speech with emotion converted based on the original text ; Generate a corresponding virtual object animation based on the text information and audio information, and the virtual object animation and the audio information are synchronized in time.
  • This embodiment can quickly and efficiently generate virtual object animations with emotional speech from text, especially 3D animations, with high versatility and no need for specific voice actors to drive.
  • the speech with emotion is synthesized by analyzing the emotional features and prosodic boundaries of the text.
  • a corresponding virtual object animation is generated based on the text and the speech with emotion.
  • the data of the generated virtual object animations arranged in time sequence and the audio information are synchronized in time, making it possible to directly generate virtual object animations from text, and when the generated virtual object animations act in time series, they can be compared with emotionally charged animations. Voice stays in sync.
  • FIG. 1 is a flowchart of a method for generating a text-based virtual object animation according to an embodiment of the present invention.
  • the solution of this embodiment can be applied to application scenarios such as virtual digital object generation and animation production.
  • the virtual object may include a virtual person, and may also include multiple types of virtual objects such as virtual animals and virtual plants. Such as virtual digital human voice assistants, virtual teachers, virtual consultants, virtual newscasters, etc. Virtual objects can be three-dimensional or two-dimensional.
  • the text-based virtual object animation generation method described in this embodiment can be understood as an end-to-end virtual object animation generation solution.
  • the user only needs to provide the original text and input it to the computer executing this embodiment, and then the corresponding virtual object animation and the synchronized emotional speech can be generated.
  • the user inputs the original text into the computer implementing this embodiment, and the corresponding three-dimensional (3D) virtual object animation and the synchronized emotional speech can be generated.
  • the virtual object image can be set according to the actual situation, including three-dimensional virtual objects and two-dimensional virtual objects.
  • End-to-end can refer to the computer operation from the input end to the output end, and there is no human (such as animator) intervention between the input end and the output end.
  • the input terminal refers to the port for receiving original audio and original text
  • the output terminal refers to the port for generating and outputting virtual object animation.
  • the virtual object animation output by the output terminal may include a controller for generating the virtual object animation, and the specific expression is a sequence of digitized vectors.
  • the virtual object animation may include a lip animation
  • the controller of the lip animation output by the output terminal may include offset information of the lip feature points
  • the controller of the lip animation may be input into the rendering engine. The lips of the virtual object are driven to make corresponding actions.
  • the controller for generating the virtual object animation may be a sequence of virtual object animation data, the data in the sequence is arranged in time sequence of the input information and synchronized with the audio data generated based on the input information.
  • the facial expression movement and human posture movement of the virtual object can be driven by the virtual object animation data.
  • the final virtual object animation can be obtained through the rendering engine.
  • the virtual object animation data may include facial expression motion data and body motion data of the virtual object.
  • the facial expressions and actions include information such as expressions, eyes, and lip shapes, and the body actions may include information such as human body postures and gestures of virtual objects.
  • the facial expression motion data is referred to as the expression parameter of the virtual object, and the body motion data is referred to as the motion parameter of the virtual object.
  • the method for generating a text-based virtual object animation in this embodiment may include the following steps:
  • Step S101 acquiring text information, wherein the text information includes the original text of the virtual object animation to be generated;
  • Step S102 analyzing the emotional features and prosodic boundaries of the text information
  • Step S103 performing speech synthesis according to the emotional feature, the prosodic boundary and the text information to obtain audio information, wherein the audio information includes the voice with emotion converted based on the original text;
  • Step S104 generating a corresponding virtual object animation based on the text information and the audio information, and the virtual object animation and the audio information are synchronized in time.
  • the text information may be obtained from a client that needs to generate animation of a virtual object.
  • the original text may be a sentence or a paragraph including multiple sentences.
  • the original text may include common characters such as Chinese characters, English, numbers, and special characters.
  • the text information may be obtained by real-time input based on a device such as a keyboard.
  • the input information may be pre-collected text information, and is transmitted to the computing device executing the solution of this embodiment in a wired or wireless form when a corresponding virtual object animation needs to be generated.
  • the virtual object animation generation method in this embodiment may further include the step of: normalizing the text information according to the context, so as to Get the normalized text information.
  • the normalization processing may include digital reading processing and special character reading processing.
  • the digital reading processing can determine the correct reading of the numbers in the original text according to the method of rule matching.
  • the number “110” can be read as “one hundred and ten” and as “one-one zero"
  • the context language before and after the number "110” can be read according to the environment to determine the correct pronunciation of the number "110”.
  • the number “1983” can be read as "1983” and "1983". Assuming that the text content after the number "1983” in the original text is "year”, it can be Make sure that the correct reading of the number "1983” here is "Nineteen Eighty-Three".
  • the special character reading processing can determine the correct reading of the special characters in the original text according to rule matching.
  • a pronunciation dictionary of special characters can be pre-built to perform special character pronunciation processing on special characters in the original text. For example, the special character " ⁇ " is the RMB symbol, and you can read “yuan” directly.
  • the normalization processing may further include the reading processing of the polyphonic words, which is used to determine the correct reading of the polyphonic words according to the context.
  • the normalized text information can be used as the data processing basis of steps S102 to S104.
  • the step S102 may include the steps of: performing word segmentation processing on the text information; for each word obtained from the word segmentation, performing sentiment analysis on the word to obtain the sentiment feature of the word; Determine the prosodic boundaries for each word.
  • word segmentation processing may be performed on the normalized text information based on natural language processing, so as to obtain the minimum unit of words.
  • the word of the smallest unit may be a single word, or may be a phrase, an idiom, etc. that can represent a specific meaning.
  • the emotional feature of each word obtained by the word segmentation process is determined, so as to obtain the emotional feature of the normalized text information.
  • the step S102 may be performed based on a preset text front-end prediction model, wherein the preset text front-end prediction model may include a coupled Recurrent Neural Network (RNN for short) and a conditional Random fields (Conditional Random Fields, CRF for short), the input of the preset text front-end prediction model is the text information, and the output of the preset text front-end prediction model is the emotional features, prosodic boundaries and word segmentation of the text information .
  • RNN Recurrent Neural Network
  • CRF Conditional Random Fields
  • this specific implementation adopts the deep learning model of RNN+CRF to quickly predict the emotional features and prosodic boundary estimation of each word of the text information.
  • the preset text front-end prediction model may simultaneously output the emotional features, prosodic boundaries and word segmentation results of the text information.
  • the specific process of step S102 in the above-mentioned specific implementation can be followed to first perform word segmentation, and then process the word segmentation result to obtain corresponding emotional features and prosodic boundaries.
  • the step S103 may include the step of: inputting the text information, emotional features and prosodic boundaries into a preset speech synthesis model, wherein the preset speech synthesis model is used to synthesize the input text based on deep learning
  • the sequence is converted into a speech sequence in time sequence, and the speech in the speech sequence has the emotion of the text at the corresponding time point; the audio information output by the speech synthesis model is obtained.
  • the emotion of the text at the corresponding time point may include emotion features and prosodic boundaries of the text.
  • this specific implementation takes the original text, the emotional features and prosodic boundaries of the original text as input, and converts the speech with emotion based on the preset speech synthesis model.
  • the preset speech synthesis model may be a sequence to sequence (Sequence to Sequence, Seq-to-Seq for short) model.
  • the corresponding speech may be determined according to the text, emotional features and prosodic boundaries of the word.
  • a corresponding speech sequence with emotion can be obtained, and the speech sequence with emotion is also sequenced by time, and the speech sequence and the text sequence are synchronized.
  • the preset speech synthesis model can be run in real time or offline.
  • the real-time operation refers to inputting the text information generated in real time and the emotional features and prosodic boundaries predicted from the text information, and synthesizing the corresponding emotional speech, such as a virtual object animation live broadcast scene.
  • Offline operation refers to inputting complete text information and the emotional features and prosodic boundaries predicted from the text information, and synthesizing the corresponding emotional speech, such as offline production of animation scenes.
  • the preset speech synthesis model may be obtained by training based on training data, wherein the training data may include text information samples and corresponding audio information samples, the audio information samples are based on the text information Information samples are pre-recorded.
  • the audio information sample may be recorded by a professional sound engineer in a recording studio according to the text information sample.
  • the emotional feature, prosodic boundary and word segmentation in the recorded audio information sample can be determined.
  • the emotion feature determined according to the audio information sample in combination with the text context is recorded as the standard emotion feature of the text information sample.
  • Voice is emotional when recorded, but text is not. Therefore, in order to ensure the synthesis of controllable emotional speech, it is necessary to add corresponding emotional information, prosodic boundary and other information to the input text information during synthesis. Therefore, in the training stage of the preset text front-end prediction model, it is necessary to ensure that the emotional features (referred to as predicted emotional features) predicted by the preset text front-end prediction model match the standard emotional features determined during speech recording.
  • predicted emotional features the emotional features predicted by the preset text front-end prediction model match the standard emotional features determined during speech recording.
  • the preset text front-end prediction model when training the preset text front-end prediction model, it is possible to compare the difference between the predicted emotional features output by the preset text front-end prediction model and the standard emotional features, and then adjust the preset text front-end prediction model. model parameters.
  • the training process of the preset text front-end prediction model may be performed iteratively, that is, the parameters are continuously optimized and adjusted according to the difference between the predicted emotional features and the standard emotional features, so that the preset text front-end prediction model outputs The predicted sentiment features of , gradually approach the standard sentiment features.
  • the audio sample information may be speech with emotion.
  • the recorded audio sample information can have corresponding emotional colors according to the context of the text information.
  • the training data may further include extended samples, wherein the extended samples may be obtained by recombining the text information samples and the corresponding audio information samples by voice and text slices.
  • the recombination of speech and text slices may refer to slicing speech information samples and text information samples into minimum units respectively, and then arranging and combining them. Therefore, the expansion and data enhancement of the sample data can be realized, which is beneficial to the training of a deep learning model with strong generalization ability.
  • slices can be made according to emotional features and prosodic boundaries to obtain minimum units.
  • the training data has been entered into the text A: I am from the coast, corresponding to the voice As, recorded as A ⁇ "I am from the coast",As>.
  • Text B is also entered: he is from Chongqing, voice Bs, recorded as B ⁇ "He is from Chongqing", Bs>.
  • A can be cut into "I'm from” and "Coastal”, denoted as A1 ⁇ " I'mfrom", As 1 > and A2 ⁇ "Coastal", As 2 >.
  • B can be divided into "he is from” and "Chongqing”, denoted as B 1 ⁇ "he is from", Bs 1 > and B 2 ⁇ "Chongqing", Bs 2 >.
  • the step S104 may include the step of: inputting the text information and the audio information into a preset time sequence mapping model to generate a corresponding virtual object animation.
  • the preset time sequence mapping model may be used to map the input feature sequence to the expression parameters and/or action parameters of the virtual object in time sequence, so as to generate the corresponding virtual object animation.
  • the text information, the emotional features and prosodic boundaries of the text information, and the audio information may be input into the preset time sequence mapping model together to generate a corresponding virtual object animation.
  • the step S104 may include the following steps:
  • Step S1041 receiving input information, wherein the input information includes the text information and audio information;
  • Step S1042 converting the input information into a pronunciation unit sequence
  • Step S1043 performing feature analysis on the pronunciation unit sequence to obtain a corresponding linguistic feature sequence
  • Step S1044 inputting the linguistic feature sequence into a preset time sequence mapping model to generate a corresponding virtual object animation based on the linguistic feature sequence.
  • the preset timing mapping model described in this specific implementation can be applied to an end-to-end virtual object animation generation scene with multimodal input and any speaker.
  • Multimodal input can include voice input and text input.
  • Arbitrary speaker can mean that there is no limit to the audio characteristics of the speaker.
  • the linguistic feature sequence may include a plurality of linguistic features, wherein each linguistic feature includes at least the pronunciation feature of the corresponding pronunciation unit.
  • the preset time sequence mapping model may be used to map the input linguistic feature sequence to the expression parameters and/or action parameters of the virtual object in time sequence based on deep learning, so as to generate a corresponding virtual object animation.
  • the pronunciation unit sequence and the linguistic feature sequence are both time-aligned sequences.
  • the input information can be divided into pronunciation unit sequences composed of the smallest pronunciation units, which are used as the data basis for the subsequent linguistic feature analysis.
  • the step S1042 may include the steps of: converting the input information into a pronunciation unit and a corresponding time code; performing a time alignment operation on the pronunciation unit according to the time code to obtain the time aligned pronunciation unit sequence.
  • the time-aligned pronunciation unit sequence is simply referred to as a pronunciation unit sequence.
  • each group of data includes a single pronunciation unit and a corresponding time code.
  • the pronunciation units in the multiple sets of data can be aligned in time sequence, so as to obtain a time-aligned pronunciation unit sequence.
  • the audio information may be converted into text information, and then the text information may be processed to obtain the pronunciation unit and the corresponding time code.
  • the text information can be directly processed to obtain the pronunciation unit and the corresponding time code.
  • text information may be expressed in the form of words, characters, pinyin, phonemes and the like.
  • the audio information can be converted into pronunciation units and corresponding time codes based on automatic speech recognition (Automatic Speech Recognition, ASR for short) technology and preset pronunciation dictionary.
  • automatic speech recognition Automatic Speech Recognition, ASR for short
  • the basic pronunciation in the text information can be extracted based on the Front-End module and the Alignment module in the text-to-speech (Text-to-Speech, TTS for short) technology Units and their arrangement and duration information in the time dimension, so as to obtain the basic pronunciation unit sequence after time alignment.
  • the text-to-speech Text-to-Speech, TTS for short
  • the text information can play a guiding role for determining the time length of each voice in the audio information.
  • the audio information when the input information is audio information, the audio information can be converted into pronunciation units and corresponding time codes based on the speech recognition technology and the preset pronunciation dictionary, and then according to the The time code performs a time alignment operation on the pronunciation unit to obtain a time aligned pronunciation unit sequence.
  • the text information can be converted into a pronunciation unit and a corresponding time code based on the speech synthesis technology, and then a time alignment operation is performed on the pronunciation unit according to the time code to obtain the time Aligned phonetic unit sequences.
  • the pronunciation unit as a phoneme as an example, when the input information is audio information, the corresponding phoneme sequence and the duration information of each phoneme can be extracted from the original audio based on the speech recognition technology and the pronunciation dictionary prepared in advance.
  • the front-end (Front-End) module and the attention-based Alignment (Attention-based Alignment) module in the TTS technology can be used to obtain the phoneme sequences and phonemes that are not time-aligned in the original text. Alignment matrix with the output audio mel spectrum. Then, the phoneme corresponding to each time segment can be obtained based on the dynamic programming algorithm, so as to obtain a time-aligned phoneme sequence.
  • step S1043 may be performed to perform linguistic feature analysis on the basic pronunciation unit sequence obtained in step S1042 , so as to obtain a time-aligned linguistic feature sequence (which may be referred to as a linguistic feature sequence for short).
  • the step S1043 may include the following steps:
  • Step S10431 carries out feature analysis to each pronunciation unit in the described pronunciation unit sequence, obtains the linguistic feature of each pronunciation unit;
  • Step S10432 based on the linguistic features of each pronunciation unit, generate a corresponding linguistic feature sequence.
  • the linguistic features can be used to characterize the pronunciation characteristics of the pronunciation units.
  • the pronunciation features include, but are not limited to, whether the pronunciation unit is a front nasal or a posterior nasal, whether the pronunciation unit is a single or diphthong, whether the pronunciation unit is aspirated or non-aspirated, whether the pronunciation unit is a is a fricative, whether the pronunciation unit is a tip of the tongue, etc.
  • the linguistic features of the pronunciation unit may include independent linguistic features obtained by performing feature analysis on a single pronunciation unit.
  • the step S10431 may include the steps of: for each pronunciation unit, analyzing the pronunciation feature of the pronunciation unit to obtain the independent linguistic feature of the pronunciation unit; Describe the linguistic features of the pronunciation unit.
  • the independent linguistic features can be used to characterize the pronunciation characteristics of a single pronunciation unit itself.
  • step S1042 For each phoneme in the time-aligned phoneme sequence obtained in step S1042, feature analysis can be performed on each phoneme to obtain the pronunciation feature of the phoneme.
  • the pronunciation features that need to be analyzed for each phoneme may include ⁇ whether it is a nasal; whether it is an anterior nasal; whether it is a posterior nasal; whether it is a monophonic; whether it is a diphthong; Whether it is a voiced sound; whether it is a labial sound; whether it is a apical sound; whether it is an anterior apical sound; whether it is a posterior apical sound; vowel; whether it is a vowel containing I; whether it is a vowel containing O; whether it is a vowel containing U; whether it is a vowel containing V; whether it is a stop; whether it is a mute; whether it is an initial; whether it is rhyme ⁇ .
  • a single pronunciation unit adjoining pronunciation units with different pronunciation characteristics in time sequence may affect the pronunciation characteristics of the action feature of the animation corresponding to the current pronunciation unit.
  • the step S10431 can also include the steps: for each pronunciation unit, analyze the pronunciation feature of the pronunciation unit to obtain the independent linguistic feature of the pronunciation unit; analyze the pronunciation characteristics of the adjacent pronunciation units of the pronunciation unit to obtain the The adjacent linguistic features of the pronunciation units are generated; the linguistic features of the pronunciation units are generated based on the independent linguistic features of the pronunciation units and the adjacent linguistic features.
  • all adjacent pronunciation units of each pronunciation unit can be analyzed within a certain time window, and the dimensions of the analysis include but are not limited to how many vowels or consonants are in the left window of the current pronunciation unit, the current pronunciation How many pre-nasal or post-nasal etc. are in the right window of the unit.
  • the types of pronunciation features and the number of the same pronunciation features of the adjacent pronunciation units are counted, and the adjacent linguistic features are obtained according to the statistical results.
  • quantized statistical features can be used as the adjacent linguistic features of the current pronunciation unit.
  • the adjacent sounding units of the sounding unit may include a preset number of sounding units centered on the sounding unit and located before and after the sounding unit in time sequence.
  • the specific value of the preset number may be determined according to experiments, for example, according to the evaluation index during training of the preset time sequence mapping model.
  • the statistical features on the right side of the pronunciation unit are uniformly zeroed.
  • the statistical features on the left side of the pronunciation unit are uniformly zeroed.
  • the current phoneme can be taken as the center, and 20 consecutive phonemes can be taken from the left and right sides, and the pronunciation features of all the phonemes can be counted.
  • the statistical dimension for the pronunciation features of the 20 phonemes located on the left and right sides of the current phoneme may include ⁇ how many vowels are on the left side of the central pronunciation unit; how many consonants are on the left side of the central pronunciation unit; how many elements are on the right side of the central pronunciation unit sound; how many consonants are on the right side of the central sounding unit; how many adjacent vowels are on the left side of the central sounding unit; how many adjacent consonants are on the left side of the central sounding unit; how many adjacent vowels are on the right side of the central sounding unit; How many adjacent consonants are on the right side; how many adjacent front nasals are on the left side of the central sound unit; how many adjacent rear nasals are on the left side of the central sound unit; how many adjacent front nasals are on the right side of the central sound unit; How many adjoining back nasals ⁇ .
  • the independent linguistic features of the phonetic unit and the adjacent linguistic features are combined to obtain the complete linguistic feature of the phonetic unit.
  • the linguistic features of the pronunciation unit can be obtained by splicing the independent linguistic features and the adjacent linguistic features in the form of quantitative coding. That is, the linguistic feature of the pronunciation unit is a long array consisting of a series of quantified values.
  • the linguistic feature sequence of each pronunciation unit arranged in time sequence is spliced together to obtain a quantified linguistic feature sequence.
  • the linguistic feature sequence is a quantitative expression of the features of the input information, and the expression mode is not restricted by a specific speaker, and does not need to be driven by a specific speaker.
  • step S1044 can be executed to input the linguistic feature sequence into the learned preset time sequence mapping model to obtain a corresponding virtual object animation data sequence.
  • the step S1044 may include the following steps:
  • Step S10441 performing multi-dimensional information extraction on the linguistic feature sequence based on the preset time sequence mapping model, wherein the multi-dimensional information includes a time dimension and a linguistic feature dimension;
  • Step S10442 performing feature domain mapping and feature dimension transformation on the multi-dimensional information extraction result based on the preset time sequence mapping model, to obtain expression parameters and/or action parameters of the virtual object;
  • the mapping of the feature domain refers to the mapping of the linguistic feature domain to the animation feature domain of the virtual object, and the animation feature domain of the virtual object includes the expression feature and/or the action feature of the virtual object.
  • the length of the audio information or text information input in step S1041 is not fixed, it can be based on a recurrent neural network (Recurrent Neural Network, RNN for short) and its variants (such as a long short-term memory network (Long Short-Term Memory, LSTM for short), etc.) process the variable-length sequence information (ie, the linguistic feature sequence) obtained by processing the input information, so as to extract feature information as a whole.
  • a recurrent neural network Recurrent Neural Network, RNN for short
  • its variants such as a long short-term memory network (Long Short-Term Memory, LSTM for short), etc.
  • feature mapping models usually involve feature domain transformation and feature dimension transformation.
  • this conversion function can be implemented based on a Fully Connected Network (FCN for short).
  • the RNN network can process the input features from the time dimension, and in order to process the features in more dimensions to extract higher-dimensional feature information, thereby enhancing the generalization ability of the model, it can be based on convolution.
  • Neural network Convolutional Neural Network, CNN for short
  • its variants such as dilated convolution, causal convolution, etc.
  • feature mapping models such as preset time series mapping models usually involve feature domain transformation and feature dimension transformation.
  • this conversion function can be implemented based on a Fully Connected Network (FCN for short).
  • the model can be trained by using the training data and machine learning technology prepared in advance, and the optimal parameters of the preset time sequence mapping model can be found, so as to realize the analysis of linguistic features. Mapping of sequences to virtual object animation sequences.
  • the preset time sequence mapping model may be a model that can use time sequence information (such as text information and audio information aligned with time synchronization) to predict other time sequence information (such as virtual object animation).
  • the training data of the preset time sequence mapping model may include text information, voice data synchronized with the text information, and virtual object animation data.
  • a professional recording engineer and actor can express corresponding voice data and action data (one-to-one correspondence between voice and action) according to rich and emotional text information.
  • the motion data includes facial expressions and body movements. Facial expressions and actions involve information such as expressions and eyes.
  • the data of the virtual object facial expression controller is obtained.
  • Body movements can be obtained by capturing high-quality posture information data of actors' performances through the performance capture platform, and body movement data and expression data have temporal correspondence.
  • the corresponding virtual object animation data can be obtained by mapping based on the digitized vector sequence (ie, the linguistic feature sequence).
  • the driving of body movements can also be implemented based on the controller.
  • the driving of the limb movements may also be bone-driven.
  • the preset time sequence mapping model may be a convolutional network-long short-term memory network-deep neural network (Convolutional LSTM Deep Neural Networks, CLDNN for short).
  • the structure of the preset timing mapping model may not be limited to this.
  • the preset timing mapping model may be any one of the above three networks, or any two of the above three networks. combination of species.
  • the preset time sequence mapping model may include: a multi-layer convolutional network, configured to receive the linguistic feature sequence and perform multi-dimensional information extraction on the linguistic feature sequence.
  • the multi-layer convolutional network may include a four-layer dilated convolutional network, which is used to perform multi-dimensional information extraction on the quantized linguistic feature sequence processed in step S1043.
  • the linguistic feature sequence can be two-dimensional data. Assuming that each pronunciation unit is represented by a pronunciation feature with a length of 600 bits and there are 100 pronunciation units in total, the linguistic feature sequence input into the preset time sequence mapping model is 100. A two-dimensional array of ⁇ 600. The 100 dimension represents the time dimension, and the 600 dimension represents the linguistic feature dimension.
  • the multi-layer convolutional network performs feature operations in two dimensions, time and linguistic features.
  • the preset time sequence mapping model may further include: a long-short-term memory network for performing information aggregation processing on the information extraction results of the time dimension.
  • the long short-term memory network may include a two-layer stacked bidirectional LSTM network, coupled with the multi-layer convolutional network to obtain the temporal dimension of the linguistic feature sequence output by the multi-layer convolutional network. Information extraction results. Further, the two-layer stacked bidirectional LSTM network performs high-dimensional information processing on the information extraction result of the linguistic feature sequence in the time dimension, so as to further obtain feature information in the time dimension.
  • the preset time sequence mapping model may further include: a deep neural network, coupled with the multi-layer convolutional network and the long-short-term memory network, and the deep neural network is used for the multi-layer convolutional network and the long-short-term memory network.
  • the multi-dimensional information extraction result of the output of the time memory network is used to map the feature domain and transform the feature dimension, so as to obtain the expression parameter and/or action parameter of the virtual object.
  • the deep neural network can receive the information extraction result of the linguistic feature dimension output by the multi-layer convolutional network, and the deep neural network can also receive the updated information on the time dimension output by the long-short-term memory network Extract results.
  • the dimension transformation may refer to dimension reduction.
  • the input of the preset time series mapping model is 600 features, and the output is 100 features.
  • the deep neural network may include: multiple fully connected layers connected in series, wherein the first fully connected layer is used to receive the multi-dimensional information extraction results, and the last fully connected layer outputs the virtual object expression parameters and/or action parameters.
  • the number of the fully connected layers may be three.
  • the deep neural network may further include: a plurality of nonlinear transformation modules, respectively coupled between two adjacent fully connected layers except the last fully connected layer, the nonlinear transformation modules are used for The output result of the coupled upper fully connected layer is subjected to nonlinear transformation processing, and the result of the nonlinear transformation processing is input to the next coupled fully connected layer.
  • the nonlinear transformation module may be a Rectified linear unit (Rectified linear unit, ReLU for short) activation function.
  • the nonlinear transformation module can improve the expression ability and generalization ability of the preset time series mapping model.
  • the multi-layer convolutional network, the long-short-term memory network and the deep neural network can be connected in series in sequence, and the information extraction results of the linguistic feature dimension output by the multi-layer convolutional network are processed by the long-short-term memory network. It is transmitted to the deep neural network, and the information extraction result of the time dimension output by the multi-layer convolutional network is processed by the long-short-term memory network and then transmitted to the deep neural network.
  • the model does not depend on a specific voice actor to drive the model, and the dependence on the specific voice actor is completely solved, which is beneficial to reduce the labor cost in the animation production process.
  • the solution of this embodiment can output high-quality virtual object animation, especially 3D animation, which reduces the labor cost and time cost of manual trimming of animation by animators and artists, and helps to improve animation production efficiency.
  • the solution of this embodiment has the ability to receive different types of input information, thereby improving the scope of application and helping to further reduce the cost and efficiency related to animation production.
  • the traditional end-to-end virtual object animation synthesis technology mainly generates two-dimensional animation, while the solution of this embodiment can generate high-quality three-dimensional animation, and can also generate two-dimensional animation.
  • the "virtual object animation sequence” described in this embodiment is a generalized expression of the quantized animation data or animation controller, and is not limited to two-dimensional or three-dimensional animation.
  • the speech with emotion is synthesized by analyzing the emotional features and prosodic boundaries of the text. Further, a corresponding virtual object animation is generated based on the text and the speech with emotion. Further, the data of the generated virtual object animations arranged in time sequence and the audio information are synchronized in time, making it possible to directly generate virtual object animations from text, and when the generated virtual object animations move in time series, they can interact with emotionally charged animations. Voice stays in sync.
  • FIG. 5 is a schematic structural diagram of a text-based virtual object animation generation device according to an embodiment of the present invention.
  • the text-based virtual object animation generation device 5 in this embodiment can be used to implement the method and technical solution described in any of the embodiments in FIG. 1 to FIG. 4 .
  • the text-based virtual object animation generation device 5 in this embodiment may include: an acquisition module 51 configured to acquire text information, wherein the text information includes the original text of the virtual object animation to be generated;
  • the analysis module 52 is used to analyze the emotional features and prosodic boundaries of the text information;
  • the speech synthesis module 53 is used to perform speech synthesis according to the emotional features, the prosodic boundaries and the text information to obtain audio information, wherein , the audio information includes the voice with emotion converted based on the original text;
  • the processing module 54 is used to generate a corresponding virtual object animation based on the text information and audio information, and the virtual object animation and all The audio information is synchronized in time.
  • the text-based virtual object animation generation method described in this embodiment may be implemented by a text-based virtual object animation generation method system.
  • the text-based virtual object animation generation method system may include: a collection module for collecting the text information; the text-based virtual object animation generation method device 5 shown in FIG. 5, wherein the acquisition module 51 is coupled with the acquisition module to receive the text information, and the text-based virtual object animation generation method device 5 executes the text-based virtual object animation generation method shown in FIG. 1 to FIG. 4 to generate a corresponding virtual object Animation and emotional speech.
  • the collection module may be a text input device such as a keyboard, and is used to collect the text information.
  • the text-based virtual object animation generation method device 5 may be integrated in computing devices such as terminals and servers.
  • the text-based virtual object animation generation method device 5 can be centrally integrated in the same server.
  • the apparatus 5 for the text-based virtual object animation generation method may be integrated in a plurality of terminals or servers in a decentralized manner and coupled to each other.
  • the preset time sequence mapping model can be independently set on a terminal or server to ensure a better data processing speed.
  • the user Based on the text-based virtual object animation generation system described in this embodiment, the user provides input information at the acquisition module side, and the corresponding virtual object animation can be obtained and synchronized with the text-based virtual object animation generation method device 5 side. emotional speech.
  • an embodiment of the present invention further discloses a storage medium on which a computer program is stored, and when the computer program is run by a processor, the method and technical solutions described in the embodiments shown in FIG. 1 to FIG. 4 are executed.
  • the storage medium may include a computer-readable storage medium such as a non-volatile memory or a non-transitory memory.
  • the storage medium may include ROM, RAM, magnetic or optical disks, and the like.
  • an embodiment of the present invention also discloses a terminal, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor executes the above diagram when running the computer program. 1 to the technical solutions of the methods described in the embodiments shown in FIG. 4 .

Abstract

一种基于文本的虚拟对象动画生成方法及装置、存储介质、终端,所述方法包括:获取文本信息,其中,所述文本信息包括待生成虚拟对象动画的原始文本;分析所述文本信息的情感特征;根据所述情感特征、所述韵律边界和所述文本信息进行语音合成,以得到音频信息,其中,所述音频信息包括基于所述原始文本转换得到的带有情感的语音;基于所述文本信息和音频信息生成对应的虚拟对象动画,并且,所述虚拟对象动画与所述音频信息在时间上是同步的。通过本发明方案能够从文本快速且高效的生成带情感语音的虚拟对象动画,通用性高,无需特定配音演员驱动。

Description

基于文本的虚拟对象动画生成方法及装置、存储介质、终端
本申请要求2020年9月1日提交中国专利局、申请号为202010905539.7、发明名称为“基于文本的虚拟对象动画生成方法及装置、存储介质、终端”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及虚拟数字对象技术领域,具体地涉及一种基于文本的虚拟对象动画生成方法及装置、存储介质、终端。
背景技术
随着虚拟数字对象(可简称为虚拟对象)技术、动画产业等领域的飞速发展,市场对快速自动生成真实、逼真的虚拟形象的需求与日俱增。具体而言,虚拟对象动画的快速生成系统主要体现在如何从文本快速且高效的生成带情感的语音以及对应的虚拟对象动画。
传统系统在制作上述数据时,需要专业的录音师进行配音,并通过美术师进行对应的虚拟对象表情制作和肢体动作制作。需要投入大量的人力成本与时间成本。
并且,这样的制作方式依赖于具有特定声音特征的配音演员,严重制约了该技术的通用性以及对制作成本的进一步降低。美术师需要对演员的动作进行手工修复,耗时较大。
发明内容
本发明解决的技术问题是如何从文本快速且高效的生成带情感语音的虚拟对象动画。
为解决上述技术问题,本发明实施例提供一种基于文本的虚拟对象动画生成方法,包括:获取文本信息,其中,所述文本信息包括待 生成虚拟对象动画的原始文本;分析所述文本信息的情感特征和韵律边界;根据所述情感特征、所述韵律边界和所述文本信息进行语音合成,以得到音频信息,其中,所述音频信息包括基于所述原始文本转换得到的带有情感的语音;基于所述文本信息和音频信息生成对应的虚拟对象动画,并且,所述虚拟对象动画与所述音频信息在时间上是同步的。
可选的,所述分析所述文本信息的情感特征和韵律边界包括:对所述文本信息进行分词处理;对于分词得到的每一字词,对所述字词进行情感分析以得到所述字词的情感特征;确定每一字词的韵律边界。
可选的,所述分析所述文本信息的情感特征和韵律边界包括:基于预设文本前端预测模型分析所述文本信息的情感特征,所述预设文本前端预测模型的输入为所述文本信息,所述预设文本前端预测模型的输出为所述文本信息的情感特征、韵律边界和分词。
可选的,所述根据所述情感特征、所述韵律边界和所述文本信息进行语音合成,以得到音频信息包括:将所述文本信息、情感特征和韵律边界输入预设语音合成模型,其中,所述预设语音合成模型用于将输入的文本序列按时序转换成语音序列,且所述语音序列中的语音带有对应时间点上文本的情感;获取所述预设语音合成模型输出的音频信息。
可选的,所述预设语音合成模型是基于训练数据训练得到的,其中,所述训练数据包括文本信息样本以及对应的音频信息样本,所述音频信息样本是根据所述文本信息样本预先录制得到的。
可选的,所述训练数据还包括扩展样本,其中,所述扩展样本是对所述文本信息样本以及对应的音频信息样本进行语音文本切片重组合得到的。
可选的,所述基于所述文本信息和音频信息生成对应的虚拟对象 动画包括:接收输入信息,其中,所述输入信息包括所述文本信息和音频信息;将所述输入信息转换为发音单元序列;对所述发音单元序列进行特征分析,得到对应的语言学特征序列;将所述语言学特征序列输入预设时序映射模型,以基于所述语言学特征序列生成对应的虚拟对象动画。
可选的,所述基于所述文本信息和音频信息生成对应的虚拟对象动画包括:将所述文本信息和音频信息输入预设时序映射模型,以生成对应的虚拟对象动画。
可选的,所述预设时序映射模型用于按时序将输入的特征序列映射至虚拟对象的表情参数和/或动作参数,以生成对应的虚拟对象动画。
可选的,在获取文本信息之后,分析所述文本信息的情感特征和韵律边界之前,所述虚拟对象动画生成方法还包括:根据上下文语境对所述文本信息进行归一化处理,以得到归一化处理后的文本信息。
可选的,所述归一化处理包括数字读法处理以及特殊字符读法处理。
可选的,所述基于所述文本信息和音频信息生成对应的虚拟对象动画包括:基于所述文本信息、所述文本信息的情感特征和韵律边界,以及所述音频信息生成对应的虚拟对象动画。
为解决上述技术问题,本发明实施例还提供一种基于文本的虚拟对象动画生成装置,包括:获取模块,用于获取文本信息,其中,所述文本信息包括待生成虚拟对象动画的原始文本;分析模块,用于分析所述文本信息的情感特征和韵律边界;语音合成模块,用于根据所述情感特征、所述韵律边界和所述文本信息进行语音合成,以得到音频信息,其中,所述音频信息包括基于所述原始文本转换得到的带有情感的语音;处理模块,用于基于所述文本信息和音频信息生成对应的虚拟对象动画,并且,所述虚拟对象动画与所述音频信息在时间上 是同步的。
为解决上述技术问题,本发明实施例还提供一种存储介质,其上存储有计算机程序,所述计算机程序被处理器运行时执行上述方法的步骤。
为解决上述技术问题,本发明实施例还提供一种终端,包括存储器和处理器,所述存储器上存储有能够在所述处理器上运行的计算机程序,所述处理器运行所述计算机程序时执行上述方法的步骤。
与现有技术相比,本发明实施例的技术方案具有以下有益效果:
本发明实施例提供一种基于文本的虚拟对象动画生成方法,包括:获取文本信息,其中,所述文本信息包括待生成虚拟对象动画的原始文本;分析所述文本信息的情感特征和韵律边界;根据所述情感特征、所述韵律边界和所述文本信息进行语音合成,以得到音频信息,其中,所述音频信息包括基于所述原始文本转换得到的带有情感的语音;基于所述文本信息和音频信息生成对应的虚拟对象动画,并且,所述虚拟对象动画与所述音频信息在时间上是同步的。
较之现有必须依赖配音演员的特定音频特征来驱动虚拟对象动画生成的技术方案,本实施方案能够从文本快速且高效的生成带情感语音的虚拟对象动画,特别是3D动画,通用性高,无需特定配音演员驱动。具体而言,通过分析文本的情感特征和韵律边界来合成得到带有情感的语音。进一步,基于文本和带有情感的语音来生成对应的虚拟对象动画。进一步,生成的虚拟对象动画按时序排列的数据与音频信息在时间上是同步的,使得从文本直接生成虚拟对象动画成为可能,且生成的虚拟对象动画按时序动作时,能够与带有情感的语音保持同步。
进一步,所述基于所述文本信息和音频信息生成对应的虚拟对象动画包括:接收输入信息,其中,所述输入信息包括所述文本信息和音频信息;将所述输入信息转换为发音单元序列;对所述发音单元序 列进行特征分析,得到对应的语言学特征序列;将所述语言学特征序列输入预设时序映射模型,以基于所述语言学特征序列生成对应的虚拟对象动画。
采用本实施方案,提取原始音频或文本中对应的语言学特征序列,并以此作为预设时序映射模型的输入信息。由于语言学特征只与音频的语义内容相关,与音色、音调、基频F0特征等因发音人而异的特征无关。因此本实施例方案不会受限于特定发音人,具有不同音频特征的原始音频均可适用于本实施例所述预设时序映射模型。也就是说,由于本实施例方案不是对音频信息中的音频特征进行分析,而是将音频信息转换为发音单元后对发音单元的语言学特征进行分析,使得不依赖特定音频特征驱动神经网络模型生成虚拟对象动画成为可能。由此,本实施例方案提供的端到端的虚拟对象动画生成方法能够适用于任何配音演员、任何文本的端到端虚拟对象动画生成,解决了现有端到端自动化语音合成虚拟对象动画技术中对特定配音演员的依赖问题,真正实现该项技术的“通用性”。
进一步,基于深度学习技术训练构建预设时序映射模型,进而基于预设时序映射模型将输入的语言学特征序列映射至对应的虚拟对象的表情参数和/或动作参数。在动画生成过程中无需动画师和美术师的参与,完全依赖计算机的自动计算,从而极大的降低了人力成本和时间成本,真正意义上地实现端到端的自动化虚拟对象动画合成技术。
进一步,原始接收的输入信息可以为文本信息也可以为音频信息,使得本实施例方案能够根据不同的输入模态生成相应的虚拟对象动画。
附图说明
图1是本发明实施例一种基于文本的虚拟对象动画生成方法的流程图;
图2是图1中步骤S104的一个具体实施方式的流程图;
图3是图2中步骤S1043的一个具体实施方式的流程图;
图4是图2中步骤S1044的一个具体实施方式的流程图;
图5是本发明实施例一种基于文本的虚拟对象动画生成装置的结构示意图。
具体实施方式
如背景技术所言,现有的虚拟对象动画生成技术必须依赖于特定发音人驱动,通用性差。且制作过程中需要美术师提供人力支持,人力成本高,所需时间成本也非常高。
为解决上述技术问题,本发明实施例提供一种基于文本的虚拟对象动画生成方法,包括:获取文本信息,其中,所述文本信息包括待生成虚拟对象动画的原始文本;分析所述文本信息的情感特征和韵律边界;根据所述情感特征、所述韵律边界和所述文本信息进行语音合成,以得到音频信息,其中,所述音频信息包括基于所述原始文本转换得到的带有情感的语音;基于所述文本信息和音频信息生成对应的虚拟对象动画,并且,所述虚拟对象动画与所述音频信息在时间上是同步的。
本实施方案能够从文本快速且高效的生成带情感语音的虚拟对象动画,特别是3D动画,通用性高,无需特定配音演员驱动。具体而言,通过分析文本的情感特征和韵律边界来合成得到带有情感的语音。进一步,基于文本和带有情感的语音来生成对应的虚拟对象动画。进一步,生成的虚拟对象动画按时序排列的数据与音频信息在时间上是同步的,使得从文本直接生成虚拟对象动画成为可能,且生成的虚拟对象动画按时序动作时,能够与带有情感的语音保持同步。
为使本发明的上述目的、特征和有益效果能够更为明显易懂,下面结合附图对本发明的具体实施例做详细的说明。
图1是本发明实施例一种基于文本的虚拟对象动画生成方法的流程图。
本实施例方案可以应用于虚拟数字对象生成、动画制作等应用场景。
虚拟对象可以包括虚拟人,也可以包括虚拟动物、虚拟植物等多类型的虚拟对象。如虚拟数字人语音助手、虚拟老师、虚拟顾问、虚拟新闻播报员等。虚拟对象可以是三维的也可以是二维的。
本实施方案所述基于文本的虚拟对象动画生成方法可以理解为一种端到端的虚拟对象动画生成方案。对于用户而言,用户只需提供原始文本并输入执行本实施方案的计算机,即可生成对应的虚拟对象动画以及相同步的带情感的语音。
例如,用户将原始文本输入执行本实施方案的计算机,即可生成对应的三维(3D)虚拟对象动画及相同步的带情感的语音。虚拟对象形象可以根据实际的情况进行设定,包括三维的虚拟对象和二维的虚拟对象。
端到端可以指从输入端到输出端均由计算机操作实现,从输入端到输出端之间没有人力(如动画师)介入。其中,输入端是指接收原始音频、原始文本的端口,输出端是指生成并输出虚拟对象动画的端口。
所述输出端输出的虚拟对象动画可以包括用于生成虚拟对象动画的控制器,具体表现形式为数字化向量的序列。例如,所述虚拟对象动画可以包括唇形动画,所述输出端输出的唇形动画的控制器可以包括唇形特征点的偏移信息,将所述唇形动画的控制器输入渲染引擎即可驱动虚拟对象的唇形做出相应的动作。
也就是说,所述用于生成虚拟对象动画的控制器可以是一段虚拟对象动画数据的序列,该序列中的数据按输入信息的时序排列并与基于输入信息生成的音频数据同步。通过所述虚拟对象动画数据可以驱 动虚拟对象的人脸表情运动与人体姿态运动。通过渲染引擎就可以获得最终的虚拟对象动画。
所述虚拟对象动画数据可以包括虚拟对象的人脸表情动作数据以及肢体动作数据。其中人脸表情动作包括表情、眼神、唇形等信息,肢体动作可以包括虚拟对象的人体姿态、手势等信息。本实施例将所述人脸表情动作数据称作虚拟对象的表情参数,将所述肢体动作数据称作虚拟对象的动作参数。
具体地,参考图1,本实施例所述基于文本的虚拟对象动画生成方法可以包括如下步骤:
步骤S101,获取文本信息,其中,所述文本信息包括待生成虚拟对象动画的原始文本;
步骤S102,分析所述文本信息的情感特征和韵律边界;
步骤S103,根据所述情感特征、所述韵律边界和所述文本信息进行语音合成,以得到音频信息,其中,所述音频信息包括基于所述原始文本转换得到的带有情感的语音;
步骤S104,基于所述文本信息和音频信息生成对应的虚拟对象动画,并且,所述虚拟对象动画与所述音频信息在时间上是同步的。
在一个具体实施中,所述文本信息可以获取自需要生成虚拟对象动画的用户端。
具体地,所述原始文本可以为一个句子或包括多个句子的一段话。
进一步,所述原始文本可以包括汉字、英文、数字、特殊字符等常见字符。
在一个具体实施中,所述文本信息可以是基于键盘等设备实时输入得到的。或者,所述输入信息可以是预先采集得到的文本信息,并在需要生成相应的虚拟对象动画时通过有线或无线形式传输至执行 本实施例方案的计算设备。
在一个具体实施中,在所述步骤S101之后,所述步骤S102之前,本实施例所述虚拟对象动画生成方法还可以包括步骤:根据上下文语境对所述文本信息进行归一化处理,以得到归一化处理后的文本信息。
具体地,所述归一化处理可以包括数字读法处理以及特殊字符读法处理。
所述数字读法处理可以根据规则匹配的方法确定原始文本中数字的正确读法。例如,数字“110”即可以读成“一百一十”又可以读成“幺幺零”,则在对数字“110”进行数字读法处理时,可以根据数字“110”前后的上下文语境确定数字“110”的正确读法。又如,数字“1983”即可以读成“一九八三”又可以读成“一千九百八十三”,假设原始文本中数字“1983”后面的文本内容为“年”,则可以确定数字“1983”在此的正确读法为“一九八三”。
所述特殊字符读法处理可以根据规则匹配确定原始文本中特殊字符的正确读法。可以预先构建特殊字符的读法字典,以对原始文本中的特殊字符执行特殊字符读法处理。例如,特殊字符“¥”为人民币符号,可以直接读“元”。
所述归一化处理还可以包括多音字的读法处理,用于根据上下文语境确定多音字的正确读法。
进一步,所述归一化处理后的文本信息可以作为步骤S102至步骤S104的数据处理基础。
在一个具体实施中,所述步骤S102可以包括步骤:对所述文本信息进行分词处理;对于分词得到的每一字词,对所述字词进行情感分析以得到所述字词的情感特征;确定每一字词的韵律边界。
具体地,可以基于自然语言处理对归一化处理后的文本信息进行分词处理,以得到最小单元的字词。例如,所述最小单元的字词可以 是单个的字,也可以是能够表征特定含义的词组、成语等。
进一步,确定分词处理得到的每一个字词的情感特征,以得到所述归一化处理后的文本信息的情感特征。
进一步,在针对每一字词进行情感特征分析以及韵律边界估计时,可以结合位于所述字词前后的字词综合分析估计。
在一个具体实施中,所述步骤S102可以是基于预设文本前端预测模型执行的,其中,所述预设文本前端预测模型可以包括耦接的循环神经网络(Recurrent Neural Network,简称RNN)和条件随机场(Conditional Random Fields,简称CRF),所述预设文本前端预测模型的输入为所述文本信息,所述预设文本前端预测模型的输出为所述文本信息的情感特征、韵律边界和分词。
也就是说,本具体实施采用RNN+CRF的深度学习模型快速预测文本信息的各个字词的情感特征以及韵律边界估计。
需要指出的是,所述预设文本前端预测模型可以是同时输出所述文本信息的情感特征、韵律边界和分词结果的。而在预设文本前端预测模型内部,则可以按照前述具体实施中步骤S102的具体流程,先进行分词,然后再处理分词结果以得到对应的情感特征和韵律边界。
在一个具体实施中,所述步骤S103可以包括步骤:将所述文本信息、情感特征和韵律边界输入预设语音合成模型,其中,所述预设语音合成模型用于基于深度学习将输入的文本序列按时序转换成语音序列,且所述语音序列中的语音带有对应时间点上文本的情感;获取所述语音合成模型输出的音频信息。
具体地,所述对应时间点上文本的情感,可以包括所述文本的情感特征和韵律边界。
较之现有仅基于原始文本合成语音的语音合成方案,本具体实施将原始文本、原始文本的情感特征和韵律边界作为输入,基于预设语音合成模型转换得到带有情感的语音。
进一步,所述预设语音合成模型可以为序列到序列(Sequence to Sequence,简称Seq-to-Seq)模型。
例如,在语音合成时,针对步骤S102分词得到的每一字词,可以根据所述字词的文本、情感特征和韵律边界确定对应的语音。将文本信息的所有字词按时序排列经过语音合成模型,就可以得到对应的带有情感的语音序列,且所述带有情感的语音序列也是按时间排序的,且语音序列和文本序列同步。
进一步,所述预设语音合成模型可以实时运行,也可以离线运行。其中,实时运行是指边输入实时产生的文本信息及对文本信息预测得到的情感特征和韵律边界,边合成对应的带有情感的语音,如虚拟对象动画直播场景。离线运行是指,输入完整的文本信息及对文本信息预测得到的情感特征和韵律边界,合成对应的带有情感的语音,如离线制作动画场景。
由上,基于所述预设语音合成模型能够精准且快速的将文本转成高质量的带有情感的语音。
在一个具体实施中,所述预设语音合成模型可以是基于训练数据训练得到的,其中,所述训练数据可以包括文本信息样本以及对应的音频信息样本,所述音频信息样本是根据所述文本信息样本预先录制得到的。
例如,所述音频信息样本可以是由专业录音师在录音棚中根据文本信息样本录制得到的。
进一步,根据所述文本信息样本的文本语境,可以确定所录制音频信息样本中情感特征、韵律边界和分词。将根据所述音频信息样本结合文本语境确定的情感特征记作所述文本信息样本的标准情感特征。
录制时语音是有情感的,但文字是没有情感的。所以,为确保合成可控的情感语音,需要在合成时针对输入的文字信息增加对应情感 信息、韵律边界等信息。因此,在预设文本前端预测模型的训练阶段,需要确保所述预设文本前端预测模型预测得到的情感特征(记作预测情感特征)与语音录制时确定的标准情感特征相匹配。
相应地,在训练所述预设文本前端预测模型时,可以比较所述预设文本前端预测模型输出的预测情感特征与标准情感特征之间的差异,进而调整所述预设文本前端预测模型的模型参数。
具体地,对所述预设文本前端预测模型的训练过程可以是迭代地执行的,也即,根据预测情感特征与标准情感特征之间的差异不断优化调参,使得预设文本前端预测模型输出的预测情感特征逐渐逼近标准情感特征。
进一步,所述音频样本信息可以是带有情感的语音。录制的音频样本信息可以根据文本信息的情景,带有相应的情感色彩。
在一个具体实施中,所述训练数据还可以包括扩展样本,其中,所述扩展样本可以是对所述文本信息样本以及对应的音频信息样本进行语音文本切片重组合得到的。
具体地,语音文本切片重组合可以指,将语音信息样本和文本信息样本分别切片成最小单元,然后排列组合。由此,能够实现样本数据的扩充和数据增强,利于训练得到泛化能力强大的深度学习模型。
进一步,可以根据情感特征和韵律边界进行切片,以得到最小单元。
例如,训练数据已经录入文本A:我来自沿海,对应语音As,记为A<“我来自沿海”,As>。还录入文本B:他来自重庆,语音Bs,记为B<“他来自重庆”,Bs>。假设A可以切成“我来自”与“沿海”,记为A 1<“我来自”,As 1>与A 2<“沿海”,As 2>。假设B可以切成“他来自”与“重庆”,记为B 1<“他来自”,Bs 1>与B 2<“重庆”,Bs 2>。
则可以重新组合成A 1B 2<“我来自重庆”,As 1Bs 2>,B 1A 2<“他来自沿海”,Bs 1As 2>。
以上提到的语音文本切片重组合要符合实际的语言用语习惯,例如,按照主语,谓语,宾语的顺序方式。而不是任意顺序的组合。
在一个具体实施中,所述步骤S104可以包括步骤:将所述文本信息和音频信息输入预设时序映射模型,以生成对应的虚拟对象动画。
具体地,所述预设时序映射模型可以用于按时序将输入的特征序列映射至虚拟对象的表情参数和/或动作参数,以生成对应的虚拟对象动画。
进一步,在所述步骤S104中,可以将所述文本信息、所述文本信息的情感特征和韵律边界,以及所述音频信息共同输入所述预设时序映射模型,以生成对应的虚拟对象动画。
接下来以基于语言学特征分析实现虚拟对象动画生成为例进行详细阐述。
在一个具体实施中,参考图2,所述步骤S104可以包括如下步骤:
步骤S1041,接收输入信息,其中,所述输入信息包括所述文本信息和音频信息;
步骤S1042,将所述输入信息转换为发音单元序列;
步骤S1043,对所述发音单元序列进行特征分析,得到对应的语言学特征序列;
步骤S1044,将所述语言学特征序列输入预设时序映射模型,以基于所述语言学特征序列生成对应的虚拟对象动画。
具体地,本具体实施所述预设时序映射模型可以应用于多模态输入且任意发音人的端到端虚拟对象动画生成场景。多模态输入可以包括语音输入和文本输入。任意发音人可以指对发音人的音频特征没有限定。
更为具体地,所述语言学特征序列可以包括多个语言学特征,其中每一语言学特征至少包括对应的发音单元的发音特征。
进一步,所述预设时序映射模型可以用于基于深度学习按时序将输入的语言学特征序列映射至虚拟对象的表情参数和/或动作参数,以生成对应的虚拟对象动画。
进一步,所述发音单元序列和所述语言学特征序列均为时间对齐后的序列。
在一个具体实施中,可以将输入信息划分成最小发音单元组成的发音单元序列,以作为后续进行语言学特征分析的数据基础。
具体地,所述步骤S1042可以包括步骤:将所述输入信息转换为发音单元及对应的时间码;根据所述时间码对所述发音单元进行时间对齐操作,以得到所述时间对齐后的发音单元序列。为便于表述,本实施例将所述时间对齐后的发音单元序列简称为发音单元序列。
将单个发音单元和对应的时间码记作一组数据,通过执行所述步骤S102可以自输入信息中转换得到多组所述数据,其中每一组数据包含单个发音单元及对应的时间码。通过时间码可以将多组数据中的发音单元按时序对齐,以得到时间对齐后的发音单元序列。
当所述输入信息为音频信息时,可以将所述音频信息转换为文本信息后,再对所述文本信息进行处理以得到所述发音单元和对应的时间码。
当所述输入信息为文本信息时,可以直接对所述文本信息进行处理以得到所述发音单元和对应的时间码。
进一步,所述文本信息可以采用词语、文字、拼音、音素等文本表达形式。
当所述输入信息为音频信息时,可以基于自动语音识别(Automatic Speech Recognition,简称ASR)技术和预设发音字典将 所述音频信息转换为发音单元及对应的时间码。
当所述输入信息为文本信息时,可以基于文本到语音(Text-to-Speech,简称TTS)技术中的前端(Front-End)模块和对齐(Alignment)模块,提取出文本信息中的基本发音单元及其在时间维度上的排列和时长信息,从而得到时间对齐后的基本发音单元序列。
当所述输入信息为文本信息和音频信息时,其中的文本信息可以起到引导作用,用于确定音频信息中每一语音的时间长度。
也就是说,在所述步骤S1042中,当所述输入信息为音频信息时,可以基于语音识别技术和预设发音字典将所述音频信息转换为发音单元及对应的时间码,然后根据所述时间码对所述发音单元进行时间对齐操作,以得到时间对齐后的发音单元序列。
当所述输入信息为文本信息时,则可以基于语音合成技术将所述文本信息转换为发音单元及对应的时间码,然后根据所述时间码对所述发音单元进行时间对齐操作,以得到时间对齐后的发音单元序列。
以所述发音单元为音素为例,当输入信息为音频信息时,可基于语音识别技术以及事先拟定的发音字典,从原始音频中提取出相应的音素序列以及每个音素的时长信息。
又例如,当输入信息为文本信息时,可基于TTS技术中的前端(Front-End)模块和基于注意力机制的对齐(Attention-based Alignment)模块,得到原始文本未时间对齐的音素序列以及音素与输出音频梅尔谱的对齐矩阵。然后可基于动态规划算法求得每个时间片段所对应的音素,从而得到时间对齐后的音素序列。
在一个具体实施中,在得到时间对齐的发音单元序列后,为进一步提升预设时序映射模型的泛化能力,可以执行所述步骤S1043以对步骤S1042得到的基本发音单元序列进行语言学特征分析,从而得到时间对齐后的语言学特征序列(可简称为语言学特征序列)。
具体地,参考图3,所述步骤S1043可以包括如下步骤:
步骤S10431,对所述发音单元序列中的每个发音单元进行特征分析,得到每个发音单元的语言学特征;
步骤S10432,基于每个发音单元的语言学特征,生成对应的语言学特征序列。
更为具体地,所述语言学特征可以用于表征发音单元的发音特征。例如,所述发音特征包括但不限于所述发音单元为前鼻音还是后鼻音、所述发音单元为单元音还是双元音、所述发音单元为送气音还是非送气音、所述发音单元是否为摩擦音、所述发音单元是否为舌尖音等。
在一个具体实施中,所述发音单元的语言学特征可以包括对单个发音单元进行特征分析得到的独立语言学特征。
具体地,所述步骤S10431可以包括步骤:对于每个发音单元,分析所述发音单元的发音特征,以得到所述发音单元的独立语言学特征;基于所述发音单元的独立语言学特征生成所述发音单元的语言学特征。
更为具体地,所述独立语言学特征可以用于表征单个发音单元本身的发音特征。
以发音单元为音素为例,对于步骤S1042得到的时间对齐后的音素序列中的每一音素,可以对每一音素进行特征分析从而得到所述音素的发音特征。
针对每一音素需要分析的发音特征可以包括{是否为鼻音;是否为前鼻音;是否为后鼻音;是否为单元音;是否为双元音;是否为送气音;是否为摩擦音;是否为清音;是否为浊音;是否为唇音;是否为舌尖音;是否为前舌尖音;是否为后舌尖音;是否为翘舌音;是否为平舌音;是否为包含A的元音;是否为包含E的元音;是否为包含I的元音;是否为包含O的元音;是否为包含U的元音;是否为 包含V的元音;是否为塞音;是否为静音符;是否为声母;是否为韵母}。
对于每一音素均判定上述所有问题的答案,以0代表“否”,以1代表“是”,从而以量化编码的形式生成各音素的独立语言学特征。
在一个具体实施中,考虑到协同发音以及生成动画的连贯性,单个发音单元在时序上前后邻接具有不同发音特征的发音单元可能影响当前发音单元对应的动画的动作特征的发音特征,因此,所述步骤S10431还可以包括步骤:对于每个发音单元,分析所述发音单元的发音特征,以得到所述发音单元的独立语言学特征;分析所述发音单元的邻接发音单元的发音特征,得到所述发音单元的邻接语言学特征;基于所述发音单元的独立语言学特征和邻接语言学特征生成所述发音单元的语言学特征。
具体而言,可以在一定的时间窗口范围内对每个发音单元的所有邻接发音单元进行分析,分析的维度包括但不限于当前发音单元的左侧窗口内有多少个元音或辅音、当前发音单元的右侧窗口内有多少个前鼻音或后鼻音等。
例如,统计所述邻接发音单元所具有发音特征的种类以及同种发音特征的数量,并根据统计结果得到所述邻接语言学特征。
进一步,可以将量化后的统计特征作为当前发音单元的邻接语言学特征。
进一步,所述发音单元的邻接发音单元可以包括:以所述发音单元为中心,在时序上位于所述发音单元前后的预设数量的发音单元。
所述预设数量的具体数值可以根据实验确定,如根据所述预设时序映射模型训练时的评价指标决定。
对于位于句子结束位置的发音单元,所述发音单元右侧的统计特征统一归零。
对于位于句子起始位置的发音单元,所述发音单元左侧的统计特征统一归零。
以发音单元为音素为例,对于步骤S1042得到的时间对齐后的音素序列中的每一音素,可以以当前音素为中心,左右侧各取连续的20个音素,并统计所有音素的发音特征。
针对位于当前音素左右侧的各20个音素的发音特征的统计维度可以包括{中心发音单元左侧共有多少个元音;中心发音单元左侧共有多少个辅音;中心发音单元右侧共有多少个元音;中心发音单元右侧共有多少个辅音;中心发音单元左侧有多少个邻接元音;中心发音单元左侧有多少个邻接辅音;中心发音单元右侧有多少个邻接元音;中心发音单元右侧有多少个邻接辅音;中心发音单元左侧有多少个邻接前鼻音;中心发音单元左侧有多少个邻接后鼻音;中心发音单元右侧有多少个邻接前鼻音;中心发音单元右侧有多少个邻接后鼻音}。
基于上述统计维度,对每个音素的所有邻接音素进行分析,并将量化后的统计特征作为当前音素的邻接语言学特征。
进一步,对于每一发音单元,将所述发音单元的独立语言学特征和邻接语言学特征相组合,以得到所述发音单元的完整的语言学特征。
例如,可以将量化编码形式表示的独立语言学特征和邻接语言学特征前后拼接起来,得到所述发音单元的语言学特征。也即,所述发音单元的语言学特征是由一系列量化数值组成的长数组。
在一个具体实施中,在所述步骤S10432中,将按照时序排列的各发音单元的语言学特征顺序拼接起来,可以得到量化的语言学特征序列。所述语言学特征序列是对所述输入信息的特征量化表达,且该表达方式不受特定的发音人制约,无需特定的发音人驱动。
进一步,在得到所述量化的语言学特征序列后,可以执行步骤S1044以将所述语言学特征序列输入已学习得到的预设时序映射模型 中,得到对应的虚拟对象动画数据序列。
在一个具体实施中,参考图4,所述步骤S1044可以包括如下步骤:
步骤S10441,基于所述预设时序映射模型对所述语言学特征序列进行多维度的信息提取,其中,所述多维度包括时间维度和语言学特征维度;
步骤S10442,基于所述预设时序映射模型对多维度的信息提取结果进行特征域的映射和特征维度变换,以得到所述虚拟对象的表情参数和/或动作参数;
其中,所述特征域的映射是指语言学特征域到虚拟对象动画特征域的映射,所述虚拟对象动画特征域包括所述虚拟对象的表情特征和/或动作特征。
具体地,由于步骤S1041中输入的音频信息或文本信息的长度并不固定,因此,可以基于循环神经网络(Recurrent Neural Network,简称RNN)及其变体(如长短时记忆网络(Long Short-Term Memory,简称LSTM)等)处理基于输入信息处理得到的变长序列信息(即所述语言学特征序列),从而从整体上提取特征信息。
进一步,特征映射模型通常涉及到特征域转换以及特征维度变换。对此,可以基于全链接网络(Fully Connected Network,简称FCN)实现此转换功能。
进一步,所述RNN网络可从时间维度上对输入特征进行处理,而为了在更多维度上对特征进行处理从而提取出更高维度的特征信息,进而增强模型的泛化能力,可以基于卷积神经网络(Convolutional Neural Network,简称CNN)及其变体(如膨胀卷积、因果卷积等)对输入信息进行处理。
进一步,预设时序映射模型这类特征映射模型通常涉及到特征域转换以及特征维度变换。对此,可以基于全链接网络(Fully Connected  Network,简称FCN)实现此转换功能。
进一步,在设计好所述预设时序映射模型后,可利用事先准备好的训练数据和机器学习技术对该模型进行训练,寻找该预设时序映射模型的最优参数,从而实现由语言学特征序列到虚拟对象动画序列的映射。
进一步,所述预设时序映射模型可以是一种能够利用时序信息(如与时间同步对齐的文本信息、音频信息),对其他时序信息(如虚拟对象动画)做预测的模型。
在一个具体实施中,所述预设时序映射模型的训练数据可以包括文本信息、与所述文本信息同步的语音数据以及虚拟对象动画数据。
具体可以是由专业录音师(兼演员)根据丰富且带有情感的文本信息,表现出与之对应的语音数据与动作数据(语音与动作一一对应)。其中动作数据包含了人脸表情动作与肢体动作。人脸表情动作涉及了表情、眼神等信息。
通过建立人脸表情动作与虚拟对象控制器的对应关系后,得到虚拟对象人脸表情控制器数据。肢体动作则可以通过表演捕捉平台捕获演员表演的高质量姿态信息数据获得,肢体动作数据与表情数据具有时间对应性。由此,可以基于数字化向量序列(即所述语言学特征序列)映射得到对应的虚拟对象动画数据。
与人脸表情动作的驱动逻辑相类似,对肢体动作的驱动也可以基于控制器实现。或者,对所述肢体动作的驱动也可以是骨骼驱动的。
在一个具体实施中,所述预设时序映射模型可以为卷积网络-长短时记忆网络-深度神经网络(Convolutional LSTM Deep Neural Networks,简称CLDNN)。
需要指出的是,虽然本具体实施是以上述三个网络构成的预设时序映射模型为例进行详细阐述的。但在实际应用中,所述预设时序映射模型的结构可以不限于此,如所述预设时序映射模型可以是上述三 种网络中的任一种,还可以是上述三种网络中任两种的组合。
具体地,所述预设时序映射模型可以包括:多层卷积网络,用于接收所述语言学特征序列,并对所述语言学特征序列进行多维度的信息提取。
例如,所述多层卷积网络可以包括四层膨胀卷积网络,用于对步骤S1043处理得到的量化的语言学特征序列进行多维度的信息提取。所述语言学特征序列可以为二维数据,假设对于每一发音单元都由600位长度的发音特征表示且共有100个发音单元,则输入所述预设时序映射模型的语言学特征序列为100×600的二维数组。其中100这个维度代表时间维度,600这个维度代表语言学特征维度。相应的,所述多层卷积网络在时间和语言学特征两个维度上进行特征运算。
进一步,所述预设时序映射模型还可以包括:长短时记忆网络,用于对时间维度的信息提取结果进行信息聚合处理。由此,可以在时间维度上对经过多层卷积网络卷积处理后的特征从整体上进行连续性考虑。
例如,所述长短时记忆网络可以包括两层堆叠的双向LSTM网络,与所述多层卷积网络的耦接以获取所述多层卷积网络输出的对语言学特征序列在时间维度上的信息提取结果。进一步,所述两层堆叠的双向LSTM网络对语言学特征序列在时间维度上的信息提取结果进行高维度的信息加工,以进一步得到时间维度上的特征信息。
进一步,所述预设时序映射模型还可以包括:深度神经网络,与所述多层卷积网络和长短时记忆网络耦接,所述深度神经网络用于对所述多层卷积网络和长短时记忆网络的输出的多维度的信息提取结果进行特征域的映射和特征维度变换,以得到所述虚拟对象的表情参数和/或动作参数。
例如,所述深度神经网络可以接收所述多层卷积网络输出的语言学特征维度的信息提取结果,所述深度神经网络还可以接收所述长短 时记忆网络输出的更新的时间维度上的信息提取结果。
所述维度变换可以指降维,如所述预设时序映射模型的输入为600个特征,输出则为100个特征。
例如,所述深度神经网络可以包括:多层串联连接的全连接层,其中,第一层全连接层用于接收所述多维度的信息提取结果,最后一层全连接层输出所述虚拟对象的表情参数和/或动作参数。
所述全连接层的数量可以为三层。
进一步,所述深度神经网络还可以包括:多个非线性变换模块,分别耦接于除最后一层全连接层外的相邻两层全连接层之间,所述非线性变化模块用于对耦接的上一层全连接层的输出结果进行非线性变换处理,并将非线性变换处理的结果输入耦接的下一层全连接层。
所述非线性变换模块可以为修正线性单元(Rectified linear unit,简称ReLU)激活函数。
所述非线性变换模块可以提升所述预设时序映射模型的表达能力和泛化能力。
在一个变化例中,多层卷积网络、长短时记忆网络和深度神经网络可以是依次串联连接的,所述多层卷积网络输出的语言学特征维度的信息提取结果经过长短时记忆网络透传至所述深度神经网络,所述多层卷积网络输出的时间维度的信息提取结果经过长短时记忆网络处理后传输至所述深度神经网络。
由上,采用本实施例方案,以多模态输入(音频和文本)作为原始信息,首先,将其转换为不受发音人、音频特征等影响的语言学发音单元及其特征(即所述语言学特征);然后,在时间维度上将语言学特征与音频同步,得到时间对齐后的语言学特征序列;然后,输入预先学习得到的预设时序映射模型中,得到与输入信息对应的虚拟对象动画。
采用本实施例方案,不依赖于特定的发音演员对模型进行驱动,彻底解决对特定发音演员的依赖,有利于降低动画制作过程中的人力成本。
进一步,本实施例方案能够输出高质量的虚拟对象动画,特别是3D动画,减轻了动画师和美术师对动画进行人工修整的人力成本和时间成本,有助于提高动画的制作效率。
进一步,本实施例方案具备接收不同类型输入信息的能力,从而提高了适用范围,有助于进一步降低动画制作的相关成本和效率。
进一步,传统的端到端虚拟对象动画合成技术所生成的主要是二维动画,而本实施例方案能够生成高质量的三维动画,同时也能够生成二维动画。
本实施例方案中所述“虚拟对象动画序列”是对量化后的动画数据或动画控制器的一种泛化表达,不局限于二维或三维动画,取决于前述预设时序映射模型在学习最优参数时,所使用的训练数据中“虚拟对象动画序列”的表现形式。在得到虚拟对象动画控制器后,可借助Maya、UE等软件将其转换为对应的视频动画。
由上,能够从文本快速且高效的生成带情感语音的虚拟对象动画,特别是三维动画,通用性高,无需特定配音演员驱动。具体而言,通过分析文本的情感特征和韵律边界来合成得到带有情感的语音。进一步,基于文本和带有情感的语音来生成对应的虚拟对象动画。进一步,生成的虚拟对象动画按时序排列的数据与音频信息在时间上是同步的,使得从文本直接生成虚拟对象动画成为可能,且生成的虚拟对象动画按时序动作时,能够与带有情感的语音保持同步。
图5是本发明实施例一种基于文本的虚拟对象动画生成装置的结构示意图。本领域技术人员理解,本实施例所述基于文本的虚拟对象动画生成装置5可以用于实施上述图1至图4任一所述实施例中所述的方法技术方案。
具体地,参考图5,本实施例所述基于文本的虚拟对象动画生成装置5可以包括:获取模块51,用于获取文本信息,其中,所述文本信息包括待生成虚拟对象动画的原始文本;分析模块52,用于分析所述文本信息的情感特征和韵律边界;语音合成模块53,用于根据所述情感特征、所述韵律边界和所述文本信息进行语音合成,以得到音频信息,其中,所述音频信息包括基于所述原始文本转换得到的带有情感的语音;处理模块54,用于基于所述文本信息和音频信息生成对应的虚拟对象动画,并且,所述虚拟对象动画与所述音频信息在时间上是同步的。
关于所述基于文本的虚拟对象动画生成装置5的工作原理、工作方式的更多内容,可以参照上述图1至图4中的相关描述,这里不再赘述。
在一个典型的应用场景中,本实施例所述基于文本的虚拟对象动画生成方法可以由基于文本的虚拟对象动画生成方法系统实现。
具体而言,所述基于文本的虚拟对象动画生成方法系统可以包括:采集模块,用于采集得到所述文本信息;上述图5所示基于文本的虚拟对象动画生成方法装置5,其中的获取模块51与所述采集模块耦接以接收所述文本信息,所述基于文本的虚拟对象动画生成方法装置5执行上述图1至图4所示基于文本的虚拟对象动画生成方法以生成对应的虚拟对象动画和带情感的语音。
进一步,采集模块可以是键盘等文字输入设备,用于采集所述文本信息。
进一步,所述基于文本的虚拟对象动画生成方法装置5可以集成于终端、服务器等计算设备。例如,基于文本的虚拟对象动画生成方法装置5可以集中地集成于同一服务器内。或者,基于文本的虚拟对象动画生成方法装置5可以分散的集成于多个终端或服务器内并相互耦接。例如,所述预设时序映射模型可以单独设置于一终端或服务器上,以确保较优的数据处理速度。
基于本实施例所述基于文本的虚拟对象动画生成系统,用户在采集模块这端提供输入信息,即可在基于文本的虚拟对象动画生成方法装置5这端获得对应的虚拟对象动画和与之同步的带情感的语音。
进一步地,本发明实施例还公开一种存储介质,其上存储有计算机程序,所述计算机程序被处理器运行时执行上述图1至图4所示实施例中所述的方法技术方案。优选地,所述存储介质可以包括诸如非挥发性(non-volatile)存储器或者非瞬态(non-transitory)存储器等计算机可读存储介质。所述存储介质可以包括ROM、RAM、磁盘或光盘等。
进一步地,本发明实施例还公开一种终端,包括存储器和处理器,所述存储器上存储有能够在所述处理器上运行的计算机程序,所述处理器运行所述计算机程序时执行上述图1至图4所示实施例中所述的方法技术方案。
虽然本发明披露如上,但本发明并非限定于此。任何本领域技术人员,在不脱离本发明的精神和范围内,均可作各种更动与修改,因此本发明的保护范围应当以权利要求所限定的范围为准。

Claims (15)

  1. 一种基于文本的虚拟对象动画生成方法,其特征在于,包括:
    获取文本信息,其中,所述文本信息包括待生成虚拟对象动画的原始文本;
    分析所述文本信息的情感特征和韵律边界;
    根据所述情感特征、所述韵律边界和所述文本信息进行语音合成,以得到音频信息,其中,所述音频信息包括基于所述原始文本转换得到的带有情感的语音;
    基于所述文本信息和音频信息生成对应的虚拟对象动画,并且,所述虚拟对象动画与所述音频信息在时间上是同步的。
  2. 根据权利要求1所述的虚拟对象动画生成方法,其特征在于,所述分析所述文本信息的情感特征和韵律边界包括:
    对所述文本信息进行分词处理;
    对于分词得到的每一字词,对所述字词进行情感分析以得到所述字词的情感特征;
    确定每一字词的韵律边界。
  3. 根据权利要求1或2所述的虚拟对象动画生成方法,其特征在于,所述分析所述文本信息的情感特征和韵律边界包括:
    基于预设文本前端预测模型分析所述文本信息的情感特征,所述预设文本前端预测模型的输入为所述文本信息,所述预设文本前端预测模型的输出为所述文本信息的情感特征、韵律边界和分词。
  4. 根据权利要求1所述的虚拟对象动画生成方法,其特征在于,所述根据所述情感特征、所述韵律边界和所述文本信息进行语音合成,以得到音频信息包括:
    将所述文本信息、情感特征和韵律边界输入预设语音合成模型, 其中,所述预设语音合成模型用于将输入的文本序列按时序转换成语音序列,且所述语音序列中的语音带有对应时间点上文本的情感;
    获取所述预设语音合成模型输出的音频信息。
  5. 根据权利要求4所述的虚拟对象动画生成方法,其特征在于,所述预设语音合成模型是基于训练数据训练得到的,其中,所述训练数据包括文本信息样本以及对应的音频信息样本,所述音频信息样本是根据所述文本信息样本预先录制得到的。
  6. 根据权利要求5所述的虚拟对象动画生成方法,其特征在于,所述训练数据还包括扩展样本,其中,所述扩展样本是对所述文本信息样本以及对应的音频信息样本进行语音文本切片重组合得到的。
  7. 根据权利要求1所述的虚拟对象动画生成方法,其特征在于,所述基于所述文本信息和音频信息生成对应的虚拟对象动画包括:
    接收输入信息,其中,所述输入信息包括所述文本信息和音频信息;
    将所述输入信息转换为发音单元序列;
    对所述发音单元序列进行特征分析,得到对应的语言学特征序列;
    将所述语言学特征序列输入预设时序映射模型,以基于所述语言学特征序列生成对应的虚拟对象动画。
  8. 根据权利要求1所述的虚拟对象动画生成方法,其特征在于,所述基于所述文本信息和音频信息生成对应的虚拟对象动画包括:
    将所述文本信息和音频信息输入预设时序映射模型,以生成对应的虚拟对象动画。
  9. 根据权利要求7或8所述的虚拟对象动画生成方法,其特征在于, 所述预设时序映射模型用于按时序将输入的特征序列映射至虚拟对象的表情参数和/或动作参数,以生成对应的虚拟对象动画。
  10. 根据权利要求1所述的虚拟对象动画生成方法,其特征在于,在获取文本信息之后,分析所述文本信息的情感特征和韵律边界之前,还包括:
    根据上下文语境对所述文本信息进行归一化处理,以得到归一化处理后的文本信息。
  11. 根据权利要求10所述的虚拟对象动画生成方法,其特征在于,所述归一化处理包括数字读法处理以及特殊字符读法处理。
  12. 根据权利要求1所述的虚拟对象动画生成方法,其特征在于,所述基于所述文本信息和音频信息生成对应的虚拟对象动画包括:
    基于所述文本信息、所述文本信息的情感特征和韵律边界,以及所述音频信息生成对应的虚拟对象动画。
  13. 一种基于文本的虚拟对象动画生成装置,其特征在于,包括:
    获取模块,用于获取文本信息,其中,所述文本信息包括待生成虚拟对象动画的原始文本;
    分析模块,用于分析所述文本信息的情感特征和韵律边界;
    语音合成模块,用于根据所述情感特征、所述韵律边界和所述文本信息进行语音合成,以得到音频信息,其中,所述音频信息包括基于所述原始文本转换得到的带有情感的语音;
    处理模块,用于基于所述文本信息和音频信息生成对应的虚拟对象动画,并且,所述虚拟对象动画与所述音频信息在时间上是同步的。
  14. 一种存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器运行时执行权利要求1至12任一项所述方法的步 骤。
  15. 一种终端,包括存储器和处理器,所述存储器上存储有能够在所述处理器上运行的计算机程序,其特征在于,所述处理器运行所述计算机程序时执行权利要求1至12任一项所述方法的步骤。
PCT/CN2021/111424 2020-09-01 2021-08-09 基于文本的虚拟对象动画生成方法及装置、存储介质、终端 WO2022048405A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/024,021 US11908451B2 (en) 2020-09-01 2021-08-09 Text-based virtual object animation generation method, apparatus, storage medium, and terminal

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010905539.7A CN112184858B (zh) 2020-09-01 2020-09-01 基于文本的虚拟对象动画生成方法及装置、存储介质、终端
CN202010905539.7 2020-09-01

Publications (1)

Publication Number Publication Date
WO2022048405A1 true WO2022048405A1 (zh) 2022-03-10

Family

ID=73925588

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/111424 WO2022048405A1 (zh) 2020-09-01 2021-08-09 基于文本的虚拟对象动画生成方法及装置、存储介质、终端

Country Status (3)

Country Link
US (1) US11908451B2 (zh)
CN (1) CN112184858B (zh)
WO (1) WO2022048405A1 (zh)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112184858B (zh) 2020-09-01 2021-12-07 魔珐(上海)信息科技有限公司 基于文本的虚拟对象动画生成方法及装置、存储介质、终端
CN114793300A (zh) * 2021-01-25 2022-07-26 天津大学 一种基于生成对抗网络的虚拟视频客服机器人合成方法和系统
CN113379875B (zh) * 2021-03-22 2023-09-29 平安科技(深圳)有限公司 卡通角色动画的生成方法、装置、设备及存储介质
CN113178188A (zh) * 2021-04-26 2021-07-27 平安科技(深圳)有限公司 语音合成方法、装置、设备及存储介质
CN113362471A (zh) * 2021-05-27 2021-09-07 深圳市木愚科技有限公司 基于教学语义的虚拟老师肢体动作生成方法及系统
CN113450436B (zh) * 2021-06-28 2022-04-15 武汉理工大学 一种基于多模态相关性的人脸动画生成方法及系统
CN113744369A (zh) * 2021-09-09 2021-12-03 广州梦映动漫网络科技有限公司 一种动画生成方法、系统、介质及电子终端
CN113870395A (zh) * 2021-09-29 2021-12-31 平安科技(深圳)有限公司 动画视频生成方法、装置、设备及存储介质
CN113900522A (zh) * 2021-09-30 2022-01-07 温州大学大数据与信息技术研究院 一种虚拟形象的互动方法、装置
CN114401438B (zh) * 2021-12-31 2022-12-09 魔珐(上海)信息科技有限公司 虚拟数字人的视频生成方法及装置、存储介质、终端
CN116708951B (zh) * 2023-06-18 2024-02-09 北京家瑞科技有限公司 基于神经网络的视频生成方法和装置
CN116582726B (zh) * 2023-07-12 2023-12-01 北京红棉小冰科技有限公司 视频生成方法、装置、电子设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194006A1 (en) * 2001-03-29 2002-12-19 Koninklijke Philips Electronics N.V. Text to visual speech system and method incorporating facial emotions
CN105931631A (zh) * 2016-04-15 2016-09-07 北京地平线机器人技术研发有限公司 语音合成系统和方法
CN106708789A (zh) * 2015-11-16 2017-05-24 重庆邮电大学 一种文本处理方法及装置
CN110880198A (zh) * 2018-09-06 2020-03-13 百度在线网络技术(北京)有限公司 动画生成方法和装置
CN110941954A (zh) * 2019-12-04 2020-03-31 深圳追一科技有限公司 文本播报方法、装置、电子设备及存储介质
CN112184858A (zh) * 2020-09-01 2021-01-05 魔珐(上海)信息科技有限公司 基于文本的虚拟对象动画生成方法及装置、存储介质、终端
CN112184859A (zh) * 2020-09-01 2021-01-05 魔珐(上海)信息科技有限公司 端到端的虚拟对象动画生成方法及装置、存储介质、终端

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040111272A1 (en) * 2002-12-10 2004-06-10 International Business Machines Corporation Multimodal speech-to-speech language translation and display
US20090132371A1 (en) * 2007-11-20 2009-05-21 Big Stage Entertainment, Inc. Systems and methods for interactive advertising using personalized head models
US8224652B2 (en) * 2008-09-26 2012-07-17 Microsoft Corporation Speech and text driven HMM-based body animation synthesis
US10770092B1 (en) * 2017-09-22 2020-09-08 Amazon Technologies, Inc. Viseme data generation
CN107564511B (zh) * 2017-09-25 2018-09-11 平安科技(深圳)有限公司 电子装置、语音合成方法和计算机可读存储介质
US20190095775A1 (en) * 2017-09-25 2019-03-28 Ventana 3D, Llc Artificial intelligence (ai) character system capable of natural verbal and visual interactions with a human
US10586369B1 (en) * 2018-01-31 2020-03-10 Amazon Technologies, Inc. Using dialog and contextual data of a virtual reality environment to create metadata to drive avatar animation
CN108597492B (zh) * 2018-05-02 2019-11-26 百度在线网络技术(北京)有限公司 语音合成方法和装置
US11593984B2 (en) * 2020-02-07 2023-02-28 Apple Inc. Using text for avatar animation
CN111402855B (zh) * 2020-03-06 2021-08-27 北京字节跳动网络技术有限公司 语音合成方法、装置、存储介质和电子设备
CN111369971B (zh) * 2020-03-11 2023-08-04 北京字节跳动网络技术有限公司 语音合成方法、装置、存储介质和电子设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194006A1 (en) * 2001-03-29 2002-12-19 Koninklijke Philips Electronics N.V. Text to visual speech system and method incorporating facial emotions
CN106708789A (zh) * 2015-11-16 2017-05-24 重庆邮电大学 一种文本处理方法及装置
CN105931631A (zh) * 2016-04-15 2016-09-07 北京地平线机器人技术研发有限公司 语音合成系统和方法
CN110880198A (zh) * 2018-09-06 2020-03-13 百度在线网络技术(北京)有限公司 动画生成方法和装置
CN110941954A (zh) * 2019-12-04 2020-03-31 深圳追一科技有限公司 文本播报方法、装置、电子设备及存储介质
CN112184858A (zh) * 2020-09-01 2021-01-05 魔珐(上海)信息科技有限公司 基于文本的虚拟对象动画生成方法及装置、存储介质、终端
CN112184859A (zh) * 2020-09-01 2021-01-05 魔珐(上海)信息科技有限公司 端到端的虚拟对象动画生成方法及装置、存储介质、终端

Also Published As

Publication number Publication date
US20230267916A1 (en) 2023-08-24
CN112184858B (zh) 2021-12-07
US11908451B2 (en) 2024-02-20
CN112184858A (zh) 2021-01-05

Similar Documents

Publication Publication Date Title
WO2022048405A1 (zh) 基于文本的虚拟对象动画生成方法及装置、存储介质、终端
Liu et al. Diffsinger: Singing voice synthesis via shallow diffusion mechanism
Vougioukas et al. Video-driven speech reconstruction using generative adversarial networks
Zhou et al. Converting anyone's emotion: Towards speaker-independent emotional voice conversion
WO2022048404A1 (zh) 端到端的虚拟对象动画生成方法及装置、存储介质、终端
Jemine Real-time voice cloning
CN112863483A (zh) 支持多说话人风格、语言切换且韵律可控的语音合成装置
Mu et al. Review of end-to-end speech synthesis technology based on deep learning
AlBadawy et al. Voice Conversion Using Speech-to-Speech Neuro-Style Transfer.
An et al. Speech Emotion Recognition algorithm based on deep learning algorithm fusion of temporal and spatial features
Kameoka et al. FastS2S-VC: Streaming non-autoregressive sequence-to-sequence voice conversion
CN112735404A (zh) 一种语音反讽检测方法、系统、终端设备和存储介质
CN115147521A (zh) 一种基于人工智能语义分析的角色表情动画的生成方法
EP4235485A1 (en) Method for converting text data into acoustic feature, electronic device, and storage medium
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
Qu et al. Lipsound2: Self-supervised pre-training for lip-to-speech reconstruction and lip reading
Tits et al. Laughter synthesis: Combining seq2seq modeling with transfer learning
KR20240016975A (ko) 오디오 및 비디오 트렌스레이터
Nagano et al. Data augmentation based on vowel stretch for improving children's speech recognition
JP2024505076A (ja) 多様で自然なテキスト読み上げサンプルを生成する
Malik et al. A preliminary study on augmenting speech emotion recognition using a diffusion model
CN113539268A (zh) 一种端到端语音转文本罕见词优化方法
Zhao et al. Research on voice cloning with a few samples
Waibel et al. Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos
Deng et al. MixGAN-TTS: Efficient and stable speech synthesis based on diffusion model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21863472

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21863472

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 01.09.2023)