WO2019196306A1 - Device and method for speech-based mouth shape animation blending, and readable storage medium - Google Patents

Device and method for speech-based mouth shape animation blending, and readable storage medium Download PDF

Info

Publication number
WO2019196306A1
WO2019196306A1 PCT/CN2018/102209 CN2018102209W WO2019196306A1 WO 2019196306 A1 WO2019196306 A1 WO 2019196306A1 CN 2018102209 W CN2018102209 W CN 2018102209W WO 2019196306 A1 WO2019196306 A1 WO 2019196306A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
feature
model
neural network
voice data
Prior art date
Application number
PCT/CN2018/102209
Other languages
French (fr)
Chinese (zh)
Inventor
梁浩
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019196306A1 publication Critical patent/WO2019196306A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a voice-based lip animation synthesis apparatus, method, and readable storage medium.
  • Speech synthesis also known as text-to-speech technology, is a technology that converts text information into speech and reads it aloud. It involves many subject technologies such as acoustics, linguistics, digital signal processing, and computer science. It is a cutting-edge technology in the field of Chinese information processing. The main problem solved is how to convert text information into audible sound information.
  • the application scenario of computer-assisted pronunciation training needs to dynamically display the speaker's mouth shape change when playing the voice data to help the user perform the pronunciation training.
  • the playback is synthesized.
  • voice data since there is no real speaker's mouth data corresponding to it, it is impossible to display a realistic mouth-shaped animation that matches the synthesized voice data.
  • the present application provides a voice-based lip-type animation synthesizing apparatus, method, and readable storage medium, the main purpose of which is to solve the prior art technology that cannot display a realistic mouth-shaped animation that matches the synthesized speech data. problem.
  • the present application provides a voice-based lip animation synthesis device, the device comprising a memory and a processor, wherein the memory stores a lip animation synthesis program executable on the processor,
  • the mouth animation synthesis program is implemented by the processor to implement the following steps:
  • the phoneme feature Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
  • the present application further provides a voice-based lip animation synthesis method, the method comprising:
  • the phoneme feature Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
  • the present application further provides a computer readable storage medium having a lip animation synthesis program stored thereon, the lip animation synthesis program being executable by one or more processors Executing to implement the steps of the speech-based lip animation synthesis method as described above.
  • FIG. 1 is a schematic diagram of a voice-based lip-shaped animation synthesizing device of the present application
  • FIG. 2 is a schematic diagram of a program module of a lip-shaped animation synthesis program in an embodiment of a speech-based lip animation synthesis apparatus according to an embodiment of the present invention
  • FIG. 3 is a flow chart of a preferred embodiment of a speech-based lip animation synthesis method of the present application.
  • the present application provides a voice-based lip animation synthesis device.
  • FIG. 1 a schematic diagram of a preferred embodiment of a speech-based lip animation synthesis apparatus of the present application is shown.
  • the voice-based lip animation synthesis device may be a PC (Personal Computer), or may be a terminal device such as a smart phone, a tablet computer, or a portable computer.
  • the speech-based lip animation synthesis apparatus includes at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.
  • the memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (for example, an SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like.
  • the memory 11 may be, in some embodiments, an internal storage unit of a voice-based lip animation composition device, such as a hard disk of the speech-based lip animation synthesis device.
  • the memory 11 may also be an external storage device of a voice-based lip animation synthesis device in other embodiments, such as a plug-in hard disk equipped with a voice-based lip animation synthesis device, a smart memory card (Smart Media Card, SMC). ), Secure Digital (SD) card, Flash Card, etc.
  • the memory 11 may also include an internal storage unit of the voice-based lip animation composition device and an external storage device.
  • the memory 11 can be used not only for storing application software and various types of data installed in the voice-based lip animation synthesizing device, such as code of a lip animation synthesis program, but also for temporarily storing data that has been output or is to be output.
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data processing chip for running program code or processing stored in the memory 11. Data, such as performing a lip animation synthesis program.
  • CPU Central Processing Unit
  • controller microcontroller
  • microprocessor or other data processing chip for running program code or processing stored in the memory 11.
  • Data such as performing a lip animation synthesis program.
  • Communication bus 13 is used to implement connection communication between these components.
  • the network interface 14 can optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is typically used to establish a communication connection between the device and other electronic devices.
  • a standard wired interface such as a WI-FI interface
  • Figure 1 shows only a speech-based lip animation synthesis device having components 11-14 and a lip animation synthesis program, but it should be understood that not all illustrated components may be implemented, alternative implementations may be Fewer components.
  • the device may further include a user interface
  • the user interface may include a display
  • an input unit such as a keyboard
  • the optional user interface may further include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like.
  • the display may also be appropriately referred to as a display screen or a display unit for displaying information processed in the voice-based lip animation synthesis device and a user interface for displaying visualization.
  • a memory animation synthesis program is stored in the memory 11; when the processor 12 executes the lip animation synthesis program stored in the memory 11, the following steps are implemented:
  • the phoneme feature is input into a pre-trained deep neural network model, and an acoustic feature corresponding to the phoneme feature is output, the acoustic feature including a Mel cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency.
  • the acoustic feature is input to a speech synthesizer, and speech data corresponding to the target text data is output.
  • the target text data is converted into voice data through a pre-established deep neural network model, and the voice data is converted into the mouth data through a pre-established tensor model.
  • the target text data to be synthesized is obtained, and the target text data is split into words or words by the word segmentation tool, and then the split word is split into phonemes through the pronunciation dictionary, thereby obtaining the phoneme feature, for Chinese.
  • Said phonemes include the initial phoneme and the vowel phoneme.
  • the phoneme feature mainly includes the following features: the pronunciation feature of the current phoneme, the pronunciation feature of the previous phoneme, the pronunciation feature of the next phoneme, and the current phoneme in the word.
  • a deep neural network model for expressing the correlation between phoneme features and acoustic features is pre-trained, and the above feature vectors are input into the model to obtain corresponding acoustic features, and the acoustic features include time series features and pronunciation of each sound.
  • MFCC Mel-frequency cepstral coefficients
  • the MFCC feature, the pronunciation length, and the pronunciation fundamental frequency are synthesized by a speech synthesizer to obtain a speech signal.
  • the model Before applying the deep neural network model in the embodiment, the model needs to be trained.
  • a corpus construction sample is collected, and a sample library is constructed based on at least one speaker corpus, the corpus includes voice data, and corresponding to the voice data.
  • the text data and the mouth data that is, the voice data obtained by reading one or more speakers reading the same text data, and the corresponding mouth data, and establishing a sample library, wherein the mouth data is changing information by capturing mouth shape motion Physiological electromagnetic joint angiography data can reflect the mouth shape of the speaker's pronunciation.
  • the deep neural network model is trained according to the text data in the sample library and the voice data, and the model parameters of the deep neural network model are acquired.
  • the length of the pronunciation can be predicted according to the length characteristics and the syllable position features in the phoneme feature, and the pronunciation fundamental frequency can be predicted according to the pronunciation features such as the pitch and the accent position in the factor feature.
  • the lip shape data in this embodiment is physiological electromagnetic joint angiography data by capturing mouth shape motion change information, wherein the electromagnetic joint angiography data mainly includes coordinate information of a specific mouth shape and a corresponding mouth. Type image.
  • the mouth position feature in the mouth data is directly used.
  • the mouth position feature mainly includes the coordinate information of the following positions: the tip of the tongue, the tongue, the back of the tongue, the upper lip, the lower lip, the upper incisor and the lower incisor.
  • a tensor model for expressing the correlation between the acoustic features and the oral data is pre-trained, and the tensor model is a third-order tensor model, a third-order tensor model.
  • the three dimensions correspond to pronunciation features, lip shape data, and speaker identification information, respectively.
  • the third-order tensor model is trained to obtain the model parameters of the third-order tensor model.
  • the third-order tensor model in the present embodiment is constructed and trained as follows: a set of pronunciation features is used as a parameter space.
  • the set of lip data corresponding to the pronunciation feature is used as a parameter space
  • a third-order tensor is constructed based on the expression of the multi-line spatial variation described above, and the three dimensions of the third-order tensor correspond to acoustic features, lip-shaped data, and speaker identification information, respectively. Its expression is as follows:
  • the left side of the equation is some model parameters to be solved, mainly including the parameter space Parameter space
  • the weight of each feature in the middle, the right side of the equation is the feature input when training the model, through the feature data extracted from the text data and the oral data in the database, the feature of the pronunciation and the shape of the mouth shape; where C is the tensor
  • is the averaged vocal position information for different speakers. Taking the sound of “a” as an example, the corresponding ⁇ is the average value of the mouth position information of different speakers when sending the “a” sound. .
  • the third-order tensor model is trained using a high-order singular value decomposition algorithm to solve the model parameters on the left side of the above expression.
  • the speech data and the pre-set speaker identification information are input into the pre-trained third-order tensor model to obtain the lip-shaped data corresponding to the speech data. That is to say, when the sample library for training the third-order tensor model contains a plurality of speaker corpora, the user can select the speaker identification information in advance, and the finally generated mouth data is closer to the speaker. Mouth data.
  • the deep neural network model is used to realize the modeling mapping between the phoneme feature and the acoustic feature. This mapping relationship is a nonlinear mapping problem, and the deep neural network can achieve better feature mining.
  • the lip-shaped data is used to realize the dynamic display of the lip-shaped while playing the voice data.
  • the speech-based vocal animation synthesizing device of the embodiment obtains the phoneme feature in the target text data according to the pronunciation dictionary, inputs the phoneme feature into the pre-trained deep neural network model, and outputs the acoustic feature corresponding to the phoneme feature.
  • the acoustic features include MFCC features, pronunciation duration and pronunciation fundamental frequency, and these acoustic features are input into a speech synthesizer for speech-based lip animation synthesis, and speech data corresponding to the target text data is obtained, which is pre-trained according to the speech data.
  • the tensor model and the pre-set speaker identification information acquire mouth-shaped data corresponding to the voice data and the speaker identification information, and generate a mouth-shaped animation corresponding to the voice data according to the mouth-shaped data, so as to play the voice data simultaneously , showing the lip animation.
  • This scheme uses the deep neural network model to transform the target text data into acoustic features, which can achieve better feature mining, and make the speech synthesis system obtain more accurate and natural output results, and at the same time pass the sheet that can express the acoustic features and the mouth data.
  • the quantity model realizes converting the synthesized voice data into corresponding mouth type data, and generates a mouth type animation corresponding to the target text data according to the mouth type data, which solves the problem that the prior art cannot display and match the synthesized voice data, and has real The technical problem of the mouth shape animation.
  • the lip animation synthesis program may also be divided into one or more modules, one or more modules are stored in the memory 11 and executed by one or more processors (this implementation)
  • the processor 12 is executed to complete the application
  • the module referred to in the present application refers to a series of computer program instruction segments capable of performing a specific function for describing a lip-shaped animation synthesis program in a voice-based lip animation synthesis device. The execution process in .
  • FIG. 2 it is a schematic diagram of a program module of a lip animation synthesis program in an embodiment of a speech-based lip animation synthesis device of the present application.
  • the lip animation synthesis program can be segmented into feature extraction.
  • the module 10, the feature conversion module 20, the speech synthesis module 30, the lip shape generation module 40, and the animation synthesis module 50 are exemplarily:
  • the feature extraction module 10 is configured to: acquire target text data, and acquire phoneme features in the target text data according to a pronunciation dictionary;
  • the feature conversion module 20 is configured to: input the phoneme feature into a pre-trained deep neural network model, and output an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration And pronunciation fundamental frequency;
  • the speech synthesis module 30 is configured to: input the acoustic feature into a speech synthesizer, and output speech data corresponding to the target text data;
  • the port type generating module 40 is configured to: acquire, according to the voice data, the pre-trained tensor model, and the preset speaker identification information, the mouth data corresponding to the voice data and the speaker identification information,
  • the tensor model expresses a correlation between the pronunciation features of the speech data and the lip position characteristics of the lip data;
  • the animation synthesizing module 50 is configured to: generate a lip animation corresponding to the voice data according to the lip shape data, to display the lip animation while playing the voice data.
  • the present application also provides a voice-based lip animation synthesis method.
  • FIG. 3 it is a flowchart of a preferred embodiment of a speech-based lip animation synthesis method of the present application. The method may be performed by a device, which may be implemented by software and/or hardware, and the following voice-based lip animation synthesis device as an execution subject describes the method of the present embodiment.
  • the voice-based lip animation synthesis method includes:
  • Step S10 Acquire target text data, and acquire phoneme features in the target text data according to the pronunciation dictionary.
  • Step S20 input the phoneme feature into a pre-trained deep neural network model, and output an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency. .
  • Step S30 inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data.
  • the target text data is converted into voice data through a pre-established deep neural network model, and the voice data is converted into the mouth data through a pre-established tensor model.
  • the target text data to be synthesized is obtained, and the target text data is split into words or words by the word segmentation tool, and then the split word is split into phonemes through the pronunciation dictionary, thereby obtaining the phoneme feature, for Chinese.
  • Said phonemes include the initial phoneme and the vowel phoneme.
  • the phoneme feature mainly includes the following features: the pronunciation feature of the current phoneme, the pronunciation feature of the previous phoneme, the pronunciation feature of the next phoneme, and the current phoneme in the word.
  • a deep neural network model for expressing the correlation between phoneme features and acoustic features is pre-trained, and the above feature vectors are input into the model to obtain corresponding acoustic features, and the acoustic features include time series features and pronunciation of each sound.
  • MFCC Mel-frequency cepstral coefficients
  • the MFCC feature, the pronunciation length, and the pronunciation fundamental frequency are synthesized by a speech synthesizer to obtain a speech signal.
  • the model Before applying the deep neural network model in the embodiment, the model needs to be trained.
  • a corpus construction sample is collected, and a sample library is constructed based on at least one speaker corpus, the corpus includes voice data, and corresponding to the voice data.
  • the text data and the mouth data that is, the voice data obtained by reading one or more speakers reading the same text data, and the corresponding mouth data, and establishing a sample library, wherein the mouth data is changing information by capturing mouth shape motion Physiological electromagnetic joint angiography data can reflect the mouth shape of the speaker's pronunciation.
  • the deep neural network model is trained according to the text data in the sample library and the voice data, and the model parameters of the deep neural network model are acquired.
  • the length of the pronunciation can be predicted according to the length characteristics and the syllable position features in the phoneme feature, and the pronunciation fundamental frequency can be predicted according to the pronunciation features such as the pitch and the accent position in the factor feature.
  • Step S40 acquiring, according to the voice data, the pre-trained tensor model, and the preset speaker identification information, the mouth data corresponding to the voice data and the speaker identification information, the tensor model expression The correlation between the pronunciation features of the speech data and the lip position characteristics of the lip data.
  • the lip shape data in this embodiment is physiological electromagnetic joint angiography data by capturing mouth shape motion change information, wherein the electromagnetic joint angiography data mainly includes coordinate information of a specific mouth shape and a corresponding mouth. Type image.
  • the mouth position feature in the mouth data is directly used, and the mouth position feature mainly includes the coordinate information of the following positions: the tip of the tongue, the tongue, the back of the tongue, the upper lip, the lower lip, the upper incisor and the lower incisor.
  • a tensor model for expressing the correlation between the acoustic features and the oral data is pre-trained, and the tensor model is a third-order tensor model, a third-order tensor model.
  • the three dimensions correspond to pronunciation features, lip shape data, and speaker identification information, respectively.
  • the third-order tensor model is trained to obtain the model parameters of the third-order tensor model.
  • the third-order tensor model in the present embodiment is constructed and trained as follows: a set of pronunciation features is used as a parameter space.
  • the set of lip data corresponding to the pronunciation feature is used as a parameter space
  • a third-order tensor is constructed based on the expression of the multi-line spatial variation described above, and the three dimensions of the third-order tensor correspond to acoustic features, lip-shaped data, and speaker identification information, respectively. Its expression is as follows:
  • the left side of the equation is some model parameters to be solved, mainly including the parameter space Parameter space
  • the weight of each feature in the middle, the right side of the equation is the feature input when training the model, through the feature data extracted from the text data and the oral data in the database, the feature of the pronunciation and the shape of the mouth shape; where C is the tensor
  • is the averaged vocal position information for different speakers. Taking the sound of “a” as an example, the corresponding ⁇ is the average value of the mouth position information of different speakers when sending the “a” sound. .
  • the third-order tensor model is trained using a high-order singular value decomposition algorithm to solve the model parameters on the left side of the above expression.
  • the speech data and the pre-set speaker identification information are input into the pre-trained third-order tensor model to obtain the lip-shaped data corresponding to the speech data. That is to say, when the sample library for training the third-order tensor model contains a plurality of speaker corpora, the user can select the speaker identification information in advance, and the finally generated mouth data is closer to the speaker. Mouth data.
  • Step S50 generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
  • the deep neural network model is used to realize the modeling mapping between the phoneme feature and the acoustic feature. This mapping relationship is a nonlinear mapping problem, and the deep neural network can achieve better feature mining. And expression, so that the speech synthesis system can obtain more accurate and more natural output results; and by constructing the tensor model to realize the expression of the correlation between the pronunciation feature and the lip shape feature, the acquired speech can be matched and realistic.
  • the lip-shaped data is used to realize the dynamic display of the lip-shaped while playing the voice data.
  • the speech-based vocal animation synthesis method proposed in the embodiment obtains the phoneme feature in the target text data according to the pronunciation dictionary, inputs the phoneme feature into the pre-trained deep neural network model, and outputs the acoustic feature corresponding to the phoneme feature.
  • the acoustic features include MFCC features, pronunciation duration and pronunciation fundamental frequency, and these acoustic features are input into a speech synthesizer for speech-based lip animation synthesis, and speech data corresponding to the target text data is obtained, which is pre-trained according to the speech data.
  • the tensor model and the pre-set speaker identification information acquire mouth-shaped data corresponding to the voice data and the speaker identification information, and generate a mouth-shaped animation corresponding to the voice data according to the mouth-shaped data, so as to play the voice data simultaneously , showing the lip animation.
  • This scheme uses the deep neural network model to transform the target text data into acoustic features, which can achieve better feature mining, and make the speech synthesis system obtain more accurate and natural output results, and at the same time pass the sheet that can express the acoustic features and the mouth data.
  • the quantity model realizes converting the synthesized voice data into corresponding mouth type data, and generates a mouth type animation corresponding to the target text data according to the mouth type data, which solves the problem that the prior art cannot display and match the synthesized voice data, and has real The technical problem of the mouth shape animation.
  • the embodiment of the present application further provides a computer readable storage medium, where the mouth-shaped animation synthesis program is stored, and the lip animation synthesis program can be executed by one or more processors, Implement the following operations:
  • the phoneme feature Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
  • the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM as described above). , a disk, an optical disk, including a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.
  • a terminal device which may be a mobile phone, a computer, a server, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Processing Or Creating Images (AREA)

Abstract

A device and method for speech-based mouth shape animation blending. The device comprises a memory and a processor. A mouth shape animation blending program that can run on the processor is stored in the memory. The program implements the following steps when executed by the processor: acquiring target text data, acquiring phoneme features in the target text data on the basis of a pronunciation dictionary (S10); inputting the phoneme features into a pre-trained deep neural network model, outputting acoustic features (S20); inputting the acoustic features into a speech synthesizer and outputting speech data (S30); acquiring mouth shape data on the basis of the speech data, of a pretrained tensor model, and of speaker identification information (S40); and generating a corresponding mouth shape animation on the basis of the mouth shape data and of the speech data (S50). The device and method solve the technical problem in the prior art in which a mouth shape animation matching speech data and having a real feel could not be presented.

Description

基于语音的口型动画合成装置、方法及可读存储介质Speech-based lip animation synthesis device, method and readable storage medium
本申请要求于2018年4月12日提交中国专利局,申请号为201810327672.1、发明名称为“基于语音的口型动画合成装置、方法及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to Chinese Patent Application No. 201810327672.1, entitled "Voice-based syllabic animation synthesis device, method and readable storage medium", which is filed on April 12, 2018. The content is incorporated herein by reference.
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种基于语音的口型动画合成装置、方法及可读存储介质。The present application relates to the field of computer technologies, and in particular, to a voice-based lip animation synthesis apparatus, method, and readable storage medium.
背景技术Background technique
语音合成,又称为文语转换技术,是一种能够将文字信息转化为语音并进行朗读的技术。其涉及声学、语言学、数字信号处理、计算机科学等多个学科技术,是中文信息处理领域的一项前沿技术,解决的主要问题是如何将文字信息转化为可听的声音信息。Speech synthesis, also known as text-to-speech technology, is a technology that converts text information into speech and reads it aloud. It involves many subject technologies such as acoustics, linguistics, digital signal processing, and computer science. It is a cutting-edge technology in the field of Chinese information processing. The main problem solved is how to convert text information into audible sound information.
在一些应用场景中,例如计算机辅助发音训练的应用场景,需要在播放语音数据时,动态地展示说话人的口型变化情况,以帮助用户进行发音训练,在现有技术中,播放的是合成的语音数据时,由于没有与之对应的真实的说话人的口型数据可供展示,导致无法展示与合成的语音数据匹配的、并具有真实感的口型动画。In some application scenarios, for example, the application scenario of computer-assisted pronunciation training needs to dynamically display the speaker's mouth shape change when playing the voice data to help the user perform the pronunciation training. In the prior art, the playback is synthesized. In the case of voice data, since there is no real speaker's mouth data corresponding to it, it is impossible to display a realistic mouth-shaped animation that matches the synthesized voice data.
发明内容Summary of the invention
本申请提供一种基于语音的口型动画合成装置、方法及可读存储介质,其主要目的在于解决现有技术中无法展示与合成的语音数据匹配的、并具有真实感的口型动画的技术问题。The present application provides a voice-based lip-type animation synthesizing apparatus, method, and readable storage medium, the main purpose of which is to solve the prior art technology that cannot display a realistic mouth-shaped animation that matches the synthesized speech data. problem.
为实现上述目的,本申请提供一种基于语音的口型动画合成装置,该装置包括存储器和处理器,所述存储器中存储有可在所述处理器上运行的口型动画合成程序,所述口型动画合成程序被所述处理器执行时实现如下步骤:To achieve the above object, the present application provides a voice-based lip animation synthesis device, the device comprising a memory and a processor, wherein the memory stores a lip animation synthesis program executable on the processor, The mouth animation synthesis program is implemented by the processor to implement the following steps:
获取目标文本数据,根据发音词典获取所述目标文本数据中的音素特征;Obtaining target text data, and acquiring phoneme features in the target text data according to a pronunciation dictionary;
将所述音素特征输入到预先训练好的深度神经网络模型中,输出与所述音素特征对应的声学特征,所述声学特征包括梅尔倒谱系数MFCC特征、发音时长和发音基频;Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
将所述声学特征输入到语音合成器中,输出与所述目标文本数据对应的语音数据;Inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data;
根据所述语音数据、预先训练好的张量模型以及预先设置的说话人标识信息,获取与所述语音数据和所述说话人标识信息对应的口型数据,所述张量模型表达语音数据的发音特征与口型数据的口型位置特征之间的相关关系;Obtaining, according to the voice data, a pre-trained tensor model, and pre-set speaker identification information, lip-shaped data corresponding to the voice data and the speaker identification information, where the tensor model expresses voice data The correlation between the pronunciation feature and the lip position feature of the lip data;
根据所述口型数据生成与所述语音数据对应的口型动画,以供在播放所述语音数据的同时,展示所述口型动画。And generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
此外,为实现上述目的,本申请还提供一种基于语音的口型动画合成方法,该方法包括:In addition, to achieve the above object, the present application further provides a voice-based lip animation synthesis method, the method comprising:
获取目标文本数据,根据发音词典获取所述目标文本数据中的音素特征;Obtaining target text data, and acquiring phoneme features in the target text data according to a pronunciation dictionary;
将所述音素特征输入到预先训练好的深度神经网络模型中,输出与所述音素特征对应的声学特征,所述声学特征包括梅尔倒谱系数MFCC特征、发音时长和发音基频;Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
将所述声学特征输入到语音合成器中,输出与所述目标文本数据对应的语音数据;Inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data;
根据所述语音数据、预先训练好的张量模型以及预先设置的说话人标识信息,获取与所述语音数据和所述说话人标识信息对应的口型数据,所述张量模型表达语音数据的发音特征与口型数据的口型位置特征之间的相关关系;Obtaining, according to the voice data, a pre-trained tensor model, and pre-set speaker identification information, lip-shaped data corresponding to the voice data and the speaker identification information, where the tensor model expresses voice data The correlation between the pronunciation feature and the lip position feature of the lip data;
根据所述口型数据生成与所述语音数据对应的口型动画,以供在播放所述语音数据的同时,展示所述口型动画。And generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有口型动画合成程序,所述口型动画合成程序可被一个或者多个处理器执行,以实现如上所述的基于语音的口型动画合成方法的步骤。In addition, in order to achieve the above object, the present application further provides a computer readable storage medium having a lip animation synthesis program stored thereon, the lip animation synthesis program being executable by one or more processors Executing to implement the steps of the speech-based lip animation synthesis method as described above.
附图说明DRAWINGS
图1为本申请基于语音的口型动画合成装置较佳实施例的示意图;1 is a schematic diagram of a voice-based lip-shaped animation synthesizing device of the present application;
图2为本申请基于语音的口型动画合成装置一实施例中口型动画合成程序的程序模块示意图;2 is a schematic diagram of a program module of a lip-shaped animation synthesis program in an embodiment of a speech-based lip animation synthesis apparatus according to an embodiment of the present invention;
图3为本申请基于语音的口型动画合成方法较佳实施例的流程图。3 is a flow chart of a preferred embodiment of a speech-based lip animation synthesis method of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.
本申请提供一种基于语音的口型动画合成装置。参照图1所示,为本申请基于语音的口型动画合成装置较佳实施例的示意图。The present application provides a voice-based lip animation synthesis device. Referring to FIG. 1, a schematic diagram of a preferred embodiment of a speech-based lip animation synthesis apparatus of the present application is shown.
在本实施例中,基于语音的口型动画合成装置可以是PC(Personal Computer,个人电脑),也可以是智能手机、平板电脑、便携计算机等终端设备。该基于语音的口型动画合成装置至少包括存储器11、处理器12,通信总线13,以及网络接口14。In this embodiment, the voice-based lip animation synthesis device may be a PC (Personal Computer), or may be a terminal device such as a smart phone, a tablet computer, or a portable computer. The speech-based lip animation synthesis apparatus includes at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是基于语音的口型动画合成装置的内部存储单元,例如该基于语音的口型动画合成装置的硬盘。存储器11在另一些实施例中也可以是基于语音的口型动画合成装置的外部存储设备,例如基于语音的口型动画合成装置上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括基于语音的口型动画合成装置的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于基于语音的口型动画合成装置的应用软件及各类数据,例如口型动画合成程序的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。The memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (for example, an SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may be, in some embodiments, an internal storage unit of a voice-based lip animation composition device, such as a hard disk of the speech-based lip animation synthesis device. The memory 11 may also be an external storage device of a voice-based lip animation synthesis device in other embodiments, such as a plug-in hard disk equipped with a voice-based lip animation synthesis device, a smart memory card (Smart Media Card, SMC). ), Secure Digital (SD) card, Flash Card, etc. Further, the memory 11 may also include an internal storage unit of the voice-based lip animation composition device and an external storage device. The memory 11 can be used not only for storing application software and various types of data installed in the voice-based lip animation synthesizing device, such as code of a lip animation synthesis program, but also for temporarily storing data that has been output or is to be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行口型动画合成程序等。The processor 12, in some embodiments, may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data processing chip for running program code or processing stored in the memory 11. Data, such as performing a lip animation synthesis program.
通信总线13用于实现这些组件之间的连接通信。 Communication bus 13 is used to implement connection communication between these components.
网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该装置与其他电子设备之间建立通信连接。The network interface 14 can optionally include a standard wired interface, a wireless interface (such as a WI-FI interface), and is typically used to establish a communication connection between the device and other electronic devices.
图1仅示出了具有组件11-14以及口型动画合成程序的基于语音的口型动画合成装置,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。Figure 1 shows only a speech-based lip animation synthesis device having components 11-14 and a lip animation synthesis program, but it should be understood that not all illustrated components may be implemented, alternative implementations may be Fewer components.
可选地,该装置还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在基于语音的口型动画合成装置中处理的信息以及用于显示可视化的用户界面。Optionally, the device may further include a user interface, the user interface may include a display, an input unit such as a keyboard, and the optional user interface may further include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch sensor, or the like. Among them, the display may also be appropriately referred to as a display screen or a display unit for displaying information processed in the voice-based lip animation synthesis device and a user interface for displaying visualization.
在图1所示的装置实施例中,存储器11中存储有口型动画合成程序;处理器12执行存储器11中存储的口型动画合成程序时实现如下步骤:In the apparatus embodiment shown in FIG. 1, a memory animation synthesis program is stored in the memory 11; when the processor 12 executes the lip animation synthesis program stored in the memory 11, the following steps are implemented:
获取目标文本数据,根据发音词典获取所述目标文本数据中的音素特征。Obtaining target text data, and acquiring phoneme features in the target text data according to the pronunciation dictionary.
将所述音素特征输入到预先训练好的深度神经网络模型中,输出与所述音素特征对应的声学特征,所述声学特征包括梅尔倒谱系数MFCC特征、发音时长和发音基频。The phoneme feature is input into a pre-trained deep neural network model, and an acoustic feature corresponding to the phoneme feature is output, the acoustic feature including a Mel cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency.
将所述声学特征输入到语音合成器中,输出与所述目标文本数据对应的语音数据。The acoustic feature is input to a speech synthesizer, and speech data corresponding to the target text data is output.
本实施例提出的方案中,通过预先建立的深度神经网络模型将目标文本数据转换为语音数据,通过预先建立的张量模型将语音数据转换为口型数据。具体地,获取待合成的目标文本数据,通过分词工具将目标文本数据拆分成字或词,再通过发音词典将拆分得到的字或拆分成音素,进而获取到音素特征,对于中文来说,音素包括声母音素和韵母音素。在该实施例中,以中文为例,对于每一个音素来说,音素特征主要包括以下特征:当前音素的发音特征,前一个音素的发音特征、下一个音素的发音特征、当前音素在字中的位置、当前音素的音节特征、前一个音素的音节特征、后一个音素的音节特征、当前音素所在的字在句子中的位置,其中,发音特征包括音素类型(元音或辅音)、音长、音高、重音位置、韵母的位置、发音部位、韵母是否发音, 音节特征包括音节位置、音素在音节中的位置、音节在字中的位置。音素特征可以表达为一个3*7+3*3+2=32维的特征向量。In the solution proposed in this embodiment, the target text data is converted into voice data through a pre-established deep neural network model, and the voice data is converted into the mouth data through a pre-established tensor model. Specifically, the target text data to be synthesized is obtained, and the target text data is split into words or words by the word segmentation tool, and then the split word is split into phonemes through the pronunciation dictionary, thereby obtaining the phoneme feature, for Chinese. Said phonemes include the initial phoneme and the vowel phoneme. In this embodiment, taking Chinese as an example, for each phoneme, the phoneme feature mainly includes the following features: the pronunciation feature of the current phoneme, the pronunciation feature of the previous phoneme, the pronunciation feature of the next phoneme, and the current phoneme in the word. The position, the syllable feature of the current phoneme, the syllable feature of the previous phoneme, the syllable feature of the latter phoneme, the position of the word in which the current phoneme is located, where the pronunciation features include the phoneme type (vowel or consonant), length , pitch, accent position, position of the final, pronunciation part, whether the final is pronounced, the syllable features include the position of the syllable, the position of the phoneme in the syllable, and the position of the syllable in the word. The phoneme feature can be expressed as a feature vector of 3*7+3*3+2=32 dimensions.
预先训练好用于表达音素特征与声学特征之间的相关关系的深度神经网络模型,将上述特征向量输入到该模型中,得到对应的声学特征,声学特征中包含时序特征和每个音的发音长度,其中,时序特征包括一个25维的特征向量和基频,25维的特征向量包含了25个梅尔倒谱系数(Mel-frequency cepstral coefficient,MFCC),表示一帧10ms的语音声学特征。将MFCC特征、发音长度、发音基频通过语音合成器合成得到语音信号。A deep neural network model for expressing the correlation between phoneme features and acoustic features is pre-trained, and the above feature vectors are input into the model to obtain corresponding acoustic features, and the acoustic features include time series features and pronunciation of each sound. The length, wherein the time series feature includes a 25-dimensional feature vector and a fundamental frequency, and the 25-dimensional feature vector includes 25 Mel-frequency cepstral coefficients (MFCC), representing a 10 ms speech acoustic feature of a frame. The MFCC feature, the pronunciation length, and the pronunciation fundamental frequency are synthesized by a speech synthesizer to obtain a speech signal.
在应用本实施例中的深度神经网络模型之前,需要对模型进行训练,首先,采集语料构建样本,基于至少一个说话人的语料构建样本库,所述语料包括语音数据,以及与语音数据对应的文本数据和口型数据,即获取一个或者多个说话人朗读相同的文本数据得到的语音数据,以及对应的口型数据,建立样本库,其中,口型数据为通过捕捉口型运动变化信息的生理学的电磁关节造影术数据,能够体现说话人的发音时的口型状态。然后,根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数。Before applying the deep neural network model in the embodiment, the model needs to be trained. First, a corpus construction sample is collected, and a sample library is constructed based on at least one speaker corpus, the corpus includes voice data, and corresponding to the voice data. The text data and the mouth data, that is, the voice data obtained by reading one or more speakers reading the same text data, and the corresponding mouth data, and establishing a sample library, wherein the mouth data is changing information by capturing mouth shape motion Physiological electromagnetic joint angiography data can reflect the mouth shape of the speaker's pronunciation. Then, the deep neural network model is trained according to the text data in the sample library and the voice data, and the model parameters of the deep neural network model are acquired.
具体地,深度神经网络模型的训练过程如下:根据样本库中的文本数据结合发音字典提取得到音素特征,这些特征可以形成一个3*7+3*3+2=32维的特征向量;从与文本数据对应的语音数据提取声学特征,主要包括MFCC特征、发音长度、发音基频,作为训练标准比对的信息;将这两者送入深度神经网络模型训练,得到待求解的模型参数,即特定的音素与对应的发音之间,各个音素特征、声学特征的权重。其中,根据音素特征中的音长特征和音节位置特征可以预测发音时长,根据因素特征中的音高、重音位置等发音特征可以预测发音基频。Specifically, the training process of the deep neural network model is as follows: the phoneme features are extracted according to the text data in the sample library and the pronunciation dictionary, and the features can form a feature vector of 3*7+3*3+2=32 dimensions; The acoustic data corresponding to the text data is extracted, including MFCC features, pronunciation length, and pronunciation fundamental frequency, as information for training standard comparison; the two are sent to the deep neural network model training to obtain the model parameters to be solved, ie The weight of each phoneme feature and acoustic feature between a particular phoneme and the corresponding pronunciation. The length of the pronunciation can be predicted according to the length characteristics and the syllable position features in the phoneme feature, and the pronunciation fundamental frequency can be predicted according to the pronunciation features such as the pitch and the accent position in the factor feature.
根据所述语音数据、预先训练好的张量模型以及预先设置的说话人标识信息,获取与所述语音数据和所述说话人标识信息对应的口型数据,所述张量模型表达语音数据的发音特征与口型数据的口型位置特征之间的相关关系。Obtaining, according to the voice data, a pre-trained tensor model, and pre-set speaker identification information, lip-shaped data corresponding to the voice data and the speaker identification information, where the tensor model expresses voice data The correlation between the pronunciation feature and the lip position feature of the lip data.
需要说明的是,本实施例中的口型数据为通过捕捉口型运动变化信息的生理学的电磁关节造影术数据,其中,电磁关节造影术数据中主要包括特定口型的坐标信息和对应的口型图像。在模型训练时,直接采用口型数据中的 口型位置特征,口型位置特征主要包括以下位置的坐标信息:舌尖、舌质、舌背、上嘴唇、下嘴唇、上门牙和下门牙。It should be noted that the lip shape data in this embodiment is physiological electromagnetic joint angiography data by capturing mouth shape motion change information, wherein the electromagnetic joint angiography data mainly includes coordinate information of a specific mouth shape and a corresponding mouth. Type image. In the model training, the mouth position feature in the mouth data is directly used. The mouth position feature mainly includes the coordinate information of the following positions: the tip of the tongue, the tongue, the back of the tongue, the upper lip, the lower lip, the upper incisor and the lower incisor.
根据样本库中的语音数据和口型数据,预先训练好用于表达声学特征与口型数据之间的相关关系的张量模型,该张量模型为三阶张量模型,三阶张量模型的三个维度分别对应于发音特征、口型数据和说话人标识信息。获取样本库中的语音数据的发音特征,将发音特征和说话人标识信息作为三阶张量模型的输入特征,将口型数据作为三阶张量模型的输出特征,使用高阶奇异值分解算法训练三阶张量模型,以获取三阶张量模型的模型参数。According to the speech data and the lip data in the sample library, a tensor model for expressing the correlation between the acoustic features and the oral data is pre-trained, and the tensor model is a third-order tensor model, a third-order tensor model. The three dimensions correspond to pronunciation features, lip shape data, and speaker identification information, respectively. Obtain the pronunciation features of the speech data in the sample library, use the pronunciation features and speaker identification information as the input features of the third-order tensor model, and use the lip-shaped data as the output feature of the third-order tensor model, using a high-order singular value decomposition algorithm. The third-order tensor model is trained to obtain the model parameters of the third-order tensor model.
具体地,本实施例中的三阶张量模型的构建以及训练方法如下:将发音特征构成的集合作为一个参数空间
Figure PCTCN2018102209-appb-000001
将与发音特征对应的口型数据的集合作为一个参数空间
Figure PCTCN2018102209-appb-000002
基于上述参数空间构建一个多线性空间变换,其表达式如下:
Figure PCTCN2018102209-appb-000003
其中
Figure PCTCN2018102209-appb-000004
为一个网格结构,该网格结构用于存储口型数据,
Figure PCTCN2018102209-appb-000005
V用于存储特定口型的三维坐标信息,其中两维是口型的坐标,另外一位是说话人标识信息,即说话人ID,由于对于不同的说话人来说,其口型位置稍有差别;F用于存储特定口型的口型图像,该空间变换用于表达发音特征与口型位置特征之间的相关关系。基于上述多线空间变化的表达构建一个三阶张量,该三阶张量的三个维度分别对应于声学特征、口型数据和说话人标识信息。其表达式如下:
Specifically, the third-order tensor model in the present embodiment is constructed and trained as follows: a set of pronunciation features is used as a parameter space.
Figure PCTCN2018102209-appb-000001
The set of lip data corresponding to the pronunciation feature is used as a parameter space
Figure PCTCN2018102209-appb-000002
Construct a multi-linear space transform based on the above parameter space, and its expression is as follows:
Figure PCTCN2018102209-appb-000003
among them
Figure PCTCN2018102209-appb-000004
a grid structure for storing lip data,
Figure PCTCN2018102209-appb-000005
V is used to store three-dimensional coordinate information of a specific mouth shape, wherein two dimensions are the coordinates of the mouth type, and the other one is the speaker identification information, that is, the speaker ID, because the mouth position is slightly different for different speakers. Difference; F is used to store a lip image of a specific lip shape, which is used to express the correlation between the pronunciation feature and the lip position feature. A third-order tensor is constructed based on the expression of the multi-line spatial variation described above, and the three dimensions of the third-order tensor correspond to acoustic features, lip-shaped data, and speaker identification information, respectively. Its expression is as follows:
Figure PCTCN2018102209-appb-000006
Figure PCTCN2018102209-appb-000006
其中,等式的左边是一些待求解的模型参数,主要包括参数空间
Figure PCTCN2018102209-appb-000007
参数空间
Figure PCTCN2018102209-appb-000008
中的各个特征的权重,等式的右边则是训练模型时输入的特征,通过对数据库中的文本数据和口型数据,经特征提取得到的发音特征、口型位置特征;其中C为张量表达符,μ是针对不同说话人的平均化口型位置信息,以“a”这个音为例,其对应的μ为不同说话人在发“a”这个音时的口型位置信息的平均值。由于张量的分解一般使用高阶奇异值分解算法,因此,本实施例中,使用高阶奇异值分解算法训练三阶张量模型,以求解上述表达式左侧的模型参数。
Where the left side of the equation is some model parameters to be solved, mainly including the parameter space
Figure PCTCN2018102209-appb-000007
Parameter space
Figure PCTCN2018102209-appb-000008
The weight of each feature in the middle, the right side of the equation is the feature input when training the model, through the feature data extracted from the text data and the oral data in the database, the feature of the pronunciation and the shape of the mouth shape; where C is the tensor The expression, μ is the averaged vocal position information for different speakers. Taking the sound of “a” as an example, the corresponding μ is the average value of the mouth position information of different speakers when sending the “a” sound. . Since the decomposition of the tensor generally uses a high-order singular value decomposition algorithm, in the present embodiment, the third-order tensor model is trained using a high-order singular value decomposition algorithm to solve the model parameters on the left side of the above expression.
在基于深度神经网络模型得到语音数据后,将语音数据以及预先设置的说话人标识信息输入到预先训练好的三阶张量模型中,得到与该语音数据对应的口型数据。也就是说,当用于训练三阶张量模型的样本库中包含有多个 说话人的语料时,用户可以预先选择说话人标识信息,那么最终生成的口型数据会更接近于该说话人的口型数据。After the speech data is obtained based on the deep neural network model, the speech data and the pre-set speaker identification information are input into the pre-trained third-order tensor model to obtain the lip-shaped data corresponding to the speech data. That is to say, when the sample library for training the third-order tensor model contains a plurality of speaker corpora, the user can select the speaker identification information in advance, and the finally generated mouth data is closer to the speaker. Mouth data.
根据所述口型数据生成与所述语音数据对应的口型动画,以供在播放所述语音数据的同时,展示所述口型动画。根据获取到的与目标文本数据中的各个音素对应的口型数据,以及预设的三维唇区模型生成可以动态展示的口型动画,在播放与目标文本数据对应的合成数据时,展示与之对应的口型动画。在本实施例的方案中,使用深度神经网络模型来实现音素特征到声学特征之间的建模映射,这种映射关系是一种非线性的映射问题,深度神经网络能够实现更好的特征挖掘和表达,使得语音合成系统得到更准确、更自然的输出结果;并且,通过构建张量模型实现发音特征与口型特征之间的相关关系的表达,能够获取与合成的语音匹配且有真实感的口型数据,以实现在播放语音数据的同时,对口型的动态化展示。And generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data. According to the obtained mouth shape data corresponding to each phoneme in the target text data, and the preset three-dimensional lip region model, a mouth-shaped animation that can be dynamically displayed is generated, and when the synthesized data corresponding to the target text data is played, the display is performed Corresponding lip animation. In the solution of the embodiment, the deep neural network model is used to realize the modeling mapping between the phoneme feature and the acoustic feature. This mapping relationship is a nonlinear mapping problem, and the deep neural network can achieve better feature mining. And expression, so that the speech synthesis system can obtain more accurate and more natural output results; and by constructing the tensor model to realize the expression of the correlation between the pronunciation feature and the lip shape feature, the acquired speech can be matched and realistic. The lip-shaped data is used to realize the dynamic display of the lip-shaped while playing the voice data.
本实施例提出的基于语音的口型动画合成装置,根据发音词典获取目标文本数据中的音素特征,将音素特征输入到预先训练好的深度神经网络模型中,输出与音素特征对应的声学特征,该声学特征包括MFCC特征、发音时长和发音基频,将这些声学特征输入到语音合成器中进行基于语音的口型动画合成,得到与目标文本数据对应的语音数据,根据语音数据、预先训练好的张量模型以及预先设置的说话人标识信息,获取与语音数据和说话人标识信息对应的口型数据,根据口型数据生成与语音数据对应的口型动画,以供在播放语音数据的同时,展示所述口型动画。本方案采用深度神经网络模型将目标文本数据转换为声学特征,能够实现更好的特征挖掘,使得语音合成系统得到更准确、更自然的输出结果,同时通过能够表达声学特征与口型数据的张量模型实现将合成的语音数据转换为对应的口型数据,根据口型数据生成与目标文本数据对应的口型动画,解决了现有技术中无法展示与合成的语音数据匹配的、并具有真实感的口型动画的技术问题。The speech-based vocal animation synthesizing device of the embodiment obtains the phoneme feature in the target text data according to the pronunciation dictionary, inputs the phoneme feature into the pre-trained deep neural network model, and outputs the acoustic feature corresponding to the phoneme feature. The acoustic features include MFCC features, pronunciation duration and pronunciation fundamental frequency, and these acoustic features are input into a speech synthesizer for speech-based lip animation synthesis, and speech data corresponding to the target text data is obtained, which is pre-trained according to the speech data. The tensor model and the pre-set speaker identification information acquire mouth-shaped data corresponding to the voice data and the speaker identification information, and generate a mouth-shaped animation corresponding to the voice data according to the mouth-shaped data, so as to play the voice data simultaneously , showing the lip animation. This scheme uses the deep neural network model to transform the target text data into acoustic features, which can achieve better feature mining, and make the speech synthesis system obtain more accurate and natural output results, and at the same time pass the sheet that can express the acoustic features and the mouth data. The quantity model realizes converting the synthesized voice data into corresponding mouth type data, and generates a mouth type animation corresponding to the target text data according to the mouth type data, which solves the problem that the prior art cannot display and match the synthesized voice data, and has real The technical problem of the mouth shape animation.
可选地,在其他的实施例中,口型动画合成程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行以完成本申请,本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段,用于描述口型动画合成程序在 基于语音的口型动画合成装置中的执行过程。Optionally, in other embodiments, the lip animation synthesis program may also be divided into one or more modules, one or more modules are stored in the memory 11 and executed by one or more processors (this implementation) For example, the processor 12) is executed to complete the application, and the module referred to in the present application refers to a series of computer program instruction segments capable of performing a specific function for describing a lip-shaped animation synthesis program in a voice-based lip animation synthesis device. The execution process in .
例如,参照图2所示,为本申请基于语音的口型动画合成装置一实施例中的口型动画合成程序的程序模块示意图,该实施例中,口型动画合成程序可以被分割为特征提取模块10、特征转换模块20、语音合成模块30、口型生成模块40和动画合成模块50,示例性地:For example, referring to FIG. 2, it is a schematic diagram of a program module of a lip animation synthesis program in an embodiment of a speech-based lip animation synthesis device of the present application. In this embodiment, the lip animation synthesis program can be segmented into feature extraction. The module 10, the feature conversion module 20, the speech synthesis module 30, the lip shape generation module 40, and the animation synthesis module 50 are exemplarily:
特征提取模块10用于:获取目标文本数据,根据发音词典获取所述目标文本数据中的音素特征;The feature extraction module 10 is configured to: acquire target text data, and acquire phoneme features in the target text data according to a pronunciation dictionary;
特征转换模块20用于:将所述音素特征输入到预先训练好的深度神经网络模型中,输出与所述音素特征对应的声学特征,所述声学特征包括梅尔倒谱系数MFCC特征、发音时长和发音基频;The feature conversion module 20 is configured to: input the phoneme feature into a pre-trained deep neural network model, and output an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration And pronunciation fundamental frequency;
语音合成模块30用于:将所述声学特征输入到语音合成器中,输出与所述目标文本数据对应的语音数据;The speech synthesis module 30 is configured to: input the acoustic feature into a speech synthesizer, and output speech data corresponding to the target text data;
口型生成模块40用于:根据所述语音数据、预先训练好的张量模型以及预先设置的说话人标识信息,获取与所述语音数据和所述说话人标识信息对应的口型数据,所述张量模型表达语音数据的发音特征与口型数据的口型位置特征之间的相关关系;The port type generating module 40 is configured to: acquire, according to the voice data, the pre-trained tensor model, and the preset speaker identification information, the mouth data corresponding to the voice data and the speaker identification information, The tensor model expresses a correlation between the pronunciation features of the speech data and the lip position characteristics of the lip data;
动画合成模块50用于:根据所述口型数据生成与所述语音数据对应的口型动画,以供在播放所述语音数据的同时,展示所述口型动画。The animation synthesizing module 50 is configured to: generate a lip animation corresponding to the voice data according to the lip shape data, to display the lip animation while playing the voice data.
上述特征提取模块10、特征转换模块20、语音合成模块30、口型生成模块40和动画合成模块50等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。The functions or operation steps implemented when the program modules such as the feature extraction module 10, the feature conversion module 20, the speech synthesis module 30, the lip generation module 40, and the animation synthesis module 50 are executed are substantially the same as the above embodiments, and are no longer Narration.
此外,本申请还提供一种基于语音的口型动画合成方法。参照图3所示,为本申请基于语音的口型动画合成方法较佳实施例的流程图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现,以下基于语音的口型动画合成装置作为执行主体对本实施例的方法进行说明。In addition, the present application also provides a voice-based lip animation synthesis method. Referring to FIG. 3, it is a flowchart of a preferred embodiment of a speech-based lip animation synthesis method of the present application. The method may be performed by a device, which may be implemented by software and/or hardware, and the following voice-based lip animation synthesis device as an execution subject describes the method of the present embodiment.
在本实施例中,基于语音的口型动画合成方法包括:In this embodiment, the voice-based lip animation synthesis method includes:
步骤S10,获取目标文本数据,根据发音词典获取所述目标文本数据中的音素特征。Step S10: Acquire target text data, and acquire phoneme features in the target text data according to the pronunciation dictionary.
步骤S20,将所述音素特征输入到预先训练好的深度神经网络模型中,输 出与所述音素特征对应的声学特征,所述声学特征包括梅尔倒谱系数MFCC特征、发音时长和发音基频。Step S20, input the phoneme feature into a pre-trained deep neural network model, and output an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency. .
步骤S30,将所述声学特征输入到语音合成器中,输出与所述目标文本数据对应的语音数据。Step S30, inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data.
本实施例提出的方案中,通过预先建立的深度神经网络模型将目标文本数据转换为语音数据,通过预先建立的张量模型将语音数据转换为口型数据。具体地,获取待合成的目标文本数据,通过分词工具将目标文本数据拆分成字或词,再通过发音词典将拆分得到的字或拆分成音素,进而获取到音素特征,对于中文来说,音素包括声母音素和韵母音素。在该实施例中,以中文为例,对于每一个音素来说,音素特征主要包括以下特征:当前音素的发音特征,前一个音素的发音特征、下一个音素的发音特征、当前音素在字中的位置、当前音素的音节特征、前一个音素的音节特征、后一个音素的音节特征、当前音素所在的字在句子中的位置,其中,发音特征包括音素类型(元音或辅音)、音长、音高、重音位置、韵母的位置、发音部位、韵母是否发音,音节特征包括音节位置、音素在音节中的位置、音节在字中的位置。音素特征可以表达为一个3*7+3*3+2=32维的特征向量。In the solution proposed in this embodiment, the target text data is converted into voice data through a pre-established deep neural network model, and the voice data is converted into the mouth data through a pre-established tensor model. Specifically, the target text data to be synthesized is obtained, and the target text data is split into words or words by the word segmentation tool, and then the split word is split into phonemes through the pronunciation dictionary, thereby obtaining the phoneme feature, for Chinese. Said phonemes include the initial phoneme and the vowel phoneme. In this embodiment, taking Chinese as an example, for each phoneme, the phoneme feature mainly includes the following features: the pronunciation feature of the current phoneme, the pronunciation feature of the previous phoneme, the pronunciation feature of the next phoneme, and the current phoneme in the word. The position, the syllable feature of the current phoneme, the syllable feature of the previous phoneme, the syllable feature of the latter phoneme, the position of the word in which the current phoneme is located, where the pronunciation features include the phoneme type (vowel or consonant), length , pitch, accent position, position of the final, pronunciation part, whether the final is pronounced, syllable features include the position of the syllable, the position of the phoneme in the syllable, and the position of the syllable in the word. The phoneme feature can be expressed as a feature vector of 3*7+3*3+2=32 dimensions.
预先训练好用于表达音素特征与声学特征之间的相关关系的深度神经网络模型,将上述特征向量输入到该模型中,得到对应的声学特征,声学特征中包含时序特征和每个音的发音长度,其中,时序特征包括一个25维的特征向量和基频,25维的特征向量包含了25个梅尔倒谱系数(Mel-frequency cepstral coefficient,MFCC),表示一帧10ms的语音声学特征。将MFCC特征、发音长度、发音基频通过语音合成器合成得到语音信号。A deep neural network model for expressing the correlation between phoneme features and acoustic features is pre-trained, and the above feature vectors are input into the model to obtain corresponding acoustic features, and the acoustic features include time series features and pronunciation of each sound. The length, wherein the time series feature includes a 25-dimensional feature vector and a fundamental frequency, and the 25-dimensional feature vector includes 25 Mel-frequency cepstral coefficients (MFCC), representing a 10 ms speech acoustic feature of a frame. The MFCC feature, the pronunciation length, and the pronunciation fundamental frequency are synthesized by a speech synthesizer to obtain a speech signal.
在应用本实施例中的深度神经网络模型之前,需要对模型进行训练,首先,采集语料构建样本,基于至少一个说话人的语料构建样本库,所述语料包括语音数据,以及与语音数据对应的文本数据和口型数据,即获取一个或者多个说话人朗读相同的文本数据得到的语音数据,以及对应的口型数据,建立样本库,其中,口型数据为通过捕捉口型运动变化信息的生理学的电磁关节造影术数据,能够体现说话人的发音时的口型状态。然后,根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数。Before applying the deep neural network model in the embodiment, the model needs to be trained. First, a corpus construction sample is collected, and a sample library is constructed based on at least one speaker corpus, the corpus includes voice data, and corresponding to the voice data. The text data and the mouth data, that is, the voice data obtained by reading one or more speakers reading the same text data, and the corresponding mouth data, and establishing a sample library, wherein the mouth data is changing information by capturing mouth shape motion Physiological electromagnetic joint angiography data can reflect the mouth shape of the speaker's pronunciation. Then, the deep neural network model is trained according to the text data in the sample library and the voice data, and the model parameters of the deep neural network model are acquired.
具体地,深度神经网络模型的训练过程如下:根据样本库中的文本数据结合发音字典提取得到音素特征,这些特征可以形成一个3*7+3*3+2=32维的特征向量;从与文本数据对应的语音数据提取声学特征,主要包括MFCC特征、发音长度、发音基频,作为训练标准比对的信息;将这两者送入深度神经网络模型训练,得到待求解的模型参数,即特定的音素与对应的发音之间,各个音素特征、声学特征的权重。其中,根据音素特征中的音长特征和音节位置特征可以预测发音时长,根据因素特征中的音高、重音位置等发音特征可以预测发音基频。Specifically, the training process of the deep neural network model is as follows: the phoneme features are extracted according to the text data in the sample library and the pronunciation dictionary, and the features can form a feature vector of 3*7+3*3+2=32 dimensions; The acoustic data corresponding to the text data is extracted, including MFCC features, pronunciation length, and pronunciation fundamental frequency, as information for training standard comparison; the two are sent to the deep neural network model training to obtain the model parameters to be solved, ie The weight of each phoneme feature and acoustic feature between a particular phoneme and the corresponding pronunciation. The length of the pronunciation can be predicted according to the length characteristics and the syllable position features in the phoneme feature, and the pronunciation fundamental frequency can be predicted according to the pronunciation features such as the pitch and the accent position in the factor feature.
步骤S40,根据所述语音数据、预先训练好的张量模型以及预先设置的说话人标识信息,获取与所述语音数据和所述说话人标识信息对应的口型数据,所述张量模型表达语音数据的发音特征与口型数据的口型位置特征之间的相关关系。Step S40, acquiring, according to the voice data, the pre-trained tensor model, and the preset speaker identification information, the mouth data corresponding to the voice data and the speaker identification information, the tensor model expression The correlation between the pronunciation features of the speech data and the lip position characteristics of the lip data.
需要说明的是,本实施例中的口型数据为通过捕捉口型运动变化信息的生理学的电磁关节造影术数据,其中,电磁关节造影术数据中主要包括特定口型的坐标信息和对应的口型图像。在模型训练时,直接采用口型数据中的口型位置特征,口型位置特征主要包括以下位置的坐标信息:舌尖、舌质、舌背、上嘴唇、下嘴唇、上门牙和下门牙。It should be noted that the lip shape data in this embodiment is physiological electromagnetic joint angiography data by capturing mouth shape motion change information, wherein the electromagnetic joint angiography data mainly includes coordinate information of a specific mouth shape and a corresponding mouth. Type image. In the model training, the mouth position feature in the mouth data is directly used, and the mouth position feature mainly includes the coordinate information of the following positions: the tip of the tongue, the tongue, the back of the tongue, the upper lip, the lower lip, the upper incisor and the lower incisor.
根据样本库中的语音数据和口型数据,预先训练好用于表达声学特征与口型数据之间的相关关系的张量模型,该张量模型为三阶张量模型,三阶张量模型的三个维度分别对应于发音特征、口型数据和说话人标识信息。获取样本库中的语音数据的发音特征,将发音特征和说话人标识信息作为三阶张量模型的输入特征,将口型数据作为三阶张量模型的输出特征,使用高阶奇异值分解算法训练三阶张量模型,以获取三阶张量模型的模型参数。According to the speech data and the lip data in the sample library, a tensor model for expressing the correlation between the acoustic features and the oral data is pre-trained, and the tensor model is a third-order tensor model, a third-order tensor model. The three dimensions correspond to pronunciation features, lip shape data, and speaker identification information, respectively. Obtain the pronunciation features of the speech data in the sample library, use the pronunciation features and speaker identification information as the input features of the third-order tensor model, and use the lip-shaped data as the output feature of the third-order tensor model, using a high-order singular value decomposition algorithm. The third-order tensor model is trained to obtain the model parameters of the third-order tensor model.
具体地,本实施例中的三阶张量模型的构建以及训练方法如下:将发音特征构成的集合作为一个参数空间
Figure PCTCN2018102209-appb-000009
将与发音特征对应的口型数据的集合作为一个参数空间
Figure PCTCN2018102209-appb-000010
基于上述参数空间构建一个多线性空间变换,其表达式如下:
Figure PCTCN2018102209-appb-000011
其中
Figure PCTCN2018102209-appb-000012
为一个网格结构,该网格结构用于存储口型数据,
Figure PCTCN2018102209-appb-000013
V用于存储特定口型的三维坐标信息,其中两维是口型的坐标,另外一位是说话人标识信息,即说话人ID,由于对于不同的说话人来说,其口型位置稍有差别;F用于存储特定口型的口型图像,该空间变换用于表达 发音特征与口型位置特征之间的相关关系。基于上述多线空间变化的表达构建一个三阶张量,该三阶张量的三个维度分别对应于声学特征、口型数据和说话人标识信息。其表达式如下:
Specifically, the third-order tensor model in the present embodiment is constructed and trained as follows: a set of pronunciation features is used as a parameter space.
Figure PCTCN2018102209-appb-000009
The set of lip data corresponding to the pronunciation feature is used as a parameter space
Figure PCTCN2018102209-appb-000010
Construct a multi-linear space transform based on the above parameter space, and its expression is as follows:
Figure PCTCN2018102209-appb-000011
among them
Figure PCTCN2018102209-appb-000012
a grid structure for storing lip data,
Figure PCTCN2018102209-appb-000013
V is used to store three-dimensional coordinate information of a specific mouth shape, wherein two dimensions are the coordinates of the mouth type, and the other one is the speaker identification information, that is, the speaker ID, because the mouth position is slightly different for different speakers. Difference; F is used to store a lip image of a specific lip shape, which is used to express the correlation between the pronunciation feature and the lip position feature. A third-order tensor is constructed based on the expression of the multi-line spatial variation described above, and the three dimensions of the third-order tensor correspond to acoustic features, lip-shaped data, and speaker identification information, respectively. Its expression is as follows:
Figure PCTCN2018102209-appb-000014
Figure PCTCN2018102209-appb-000014
其中,等式的左边是一些待求解的模型参数,主要包括参数空间
Figure PCTCN2018102209-appb-000015
参数空间
Figure PCTCN2018102209-appb-000016
中的各个特征的权重,等式的右边则是训练模型时输入的特征,通过对数据库中的文本数据和口型数据,经特征提取得到的发音特征、口型位置特征;其中C为张量表达符,μ是针对不同说话人的平均化口型位置信息,以“a”这个音为例,其对应的μ为不同说话人在发“a”这个音时的口型位置信息的平均值。由于张量的分解一般使用高阶奇异值分解算法,因此,本实施例中,使用高阶奇异值分解算法训练三阶张量模型,以求解上述表达式左侧的模型参数。
Where the left side of the equation is some model parameters to be solved, mainly including the parameter space
Figure PCTCN2018102209-appb-000015
Parameter space
Figure PCTCN2018102209-appb-000016
The weight of each feature in the middle, the right side of the equation is the feature input when training the model, through the feature data extracted from the text data and the oral data in the database, the feature of the pronunciation and the shape of the mouth shape; where C is the tensor The expression, μ is the averaged vocal position information for different speakers. Taking the sound of “a” as an example, the corresponding μ is the average value of the mouth position information of different speakers when sending the “a” sound. . Since the decomposition of the tensor generally uses a high-order singular value decomposition algorithm, in the present embodiment, the third-order tensor model is trained using a high-order singular value decomposition algorithm to solve the model parameters on the left side of the above expression.
在基于深度神经网络模型得到语音数据后,将语音数据以及预先设置的说话人标识信息输入到预先训练好的三阶张量模型中,得到与该语音数据对应的口型数据。也就是说,当用于训练三阶张量模型的样本库中包含有多个说话人的语料时,用户可以预先选择说话人标识信息,那么最终生成的口型数据会更接近于该说话人的口型数据。After the speech data is obtained based on the deep neural network model, the speech data and the pre-set speaker identification information are input into the pre-trained third-order tensor model to obtain the lip-shaped data corresponding to the speech data. That is to say, when the sample library for training the third-order tensor model contains a plurality of speaker corpora, the user can select the speaker identification information in advance, and the finally generated mouth data is closer to the speaker. Mouth data.
步骤S50,根据所述口型数据生成与所述语音数据对应的口型动画,以供在播放所述语音数据的同时,展示所述口型动画。Step S50, generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
根据获取到的与目标文本数据中的各个音素对应的口型数据,以及预设的三维唇区模型生成可以动态展示的口型动画,在播放与目标文本数据对应的合成数据时,展示与之对应的口型动画。在本实施例的方案中,使用深度神经网络模型来实现音素特征到声学特征之间的建模映射,这种映射关系是一种非线性的映射问题,深度神经网络能够实现更好的特征挖掘和表达,使得语音合成系统得到更准确、更自然的输出结果;并且,通过构建张量模型实现发音特征与口型特征之间的相关关系的表达,能够获取与合成的语音匹配且有真实感的口型数据,以实现在播放语音数据的同时,对口型的动态化展示。According to the obtained mouth shape data corresponding to each phoneme in the target text data, and the preset three-dimensional lip region model, a mouth-shaped animation that can be dynamically displayed is generated, and when the synthesized data corresponding to the target text data is played, the display is performed Corresponding lip animation. In the solution of the embodiment, the deep neural network model is used to realize the modeling mapping between the phoneme feature and the acoustic feature. This mapping relationship is a nonlinear mapping problem, and the deep neural network can achieve better feature mining. And expression, so that the speech synthesis system can obtain more accurate and more natural output results; and by constructing the tensor model to realize the expression of the correlation between the pronunciation feature and the lip shape feature, the acquired speech can be matched and realistic. The lip-shaped data is used to realize the dynamic display of the lip-shaped while playing the voice data.
本实施例提出的基于语音的口型动画合成方法,根据发音词典获取目标文本数据中的音素特征,将音素特征输入到预先训练好的深度神经网络模型 中,输出与音素特征对应的声学特征,该声学特征包括MFCC特征、发音时长和发音基频,将这些声学特征输入到语音合成器中进行基于语音的口型动画合成,得到与目标文本数据对应的语音数据,根据语音数据、预先训练好的张量模型以及预先设置的说话人标识信息,获取与语音数据和说话人标识信息对应的口型数据,根据口型数据生成与语音数据对应的口型动画,以供在播放语音数据的同时,展示所述口型动画。本方案采用深度神经网络模型将目标文本数据转换为声学特征,能够实现更好的特征挖掘,使得语音合成系统得到更准确、更自然的输出结果,同时通过能够表达声学特征与口型数据的张量模型实现将合成的语音数据转换为对应的口型数据,根据口型数据生成与目标文本数据对应的口型动画,解决了现有技术中无法展示与合成的语音数据匹配的、并具有真实感的口型动画的技术问题。The speech-based vocal animation synthesis method proposed in the embodiment obtains the phoneme feature in the target text data according to the pronunciation dictionary, inputs the phoneme feature into the pre-trained deep neural network model, and outputs the acoustic feature corresponding to the phoneme feature. The acoustic features include MFCC features, pronunciation duration and pronunciation fundamental frequency, and these acoustic features are input into a speech synthesizer for speech-based lip animation synthesis, and speech data corresponding to the target text data is obtained, which is pre-trained according to the speech data. The tensor model and the pre-set speaker identification information acquire mouth-shaped data corresponding to the voice data and the speaker identification information, and generate a mouth-shaped animation corresponding to the voice data according to the mouth-shaped data, so as to play the voice data simultaneously , showing the lip animation. This scheme uses the deep neural network model to transform the target text data into acoustic features, which can achieve better feature mining, and make the speech synthesis system obtain more accurate and natural output results, and at the same time pass the sheet that can express the acoustic features and the mouth data. The quantity model realizes converting the synthesized voice data into corresponding mouth type data, and generates a mouth type animation corresponding to the target text data according to the mouth type data, which solves the problem that the prior art cannot display and match the synthesized voice data, and has real The technical problem of the mouth shape animation.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有口型动画合成程序,所述口型动画合成程序可被一个或多个处理器执行,以实现如下操作:In addition, the embodiment of the present application further provides a computer readable storage medium, where the mouth-shaped animation synthesis program is stored, and the lip animation synthesis program can be executed by one or more processors, Implement the following operations:
获取目标文本数据,根据发音词典获取所述目标文本数据中的音素特征;Obtaining target text data, and acquiring phoneme features in the target text data according to a pronunciation dictionary;
将所述音素特征输入到预先训练好的深度神经网络模型中,输出与所述音素特征对应的声学特征,所述声学特征包括梅尔倒谱系数MFCC特征、发音时长和发音基频;Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
将所述声学特征输入到语音合成器中,输出与所述目标文本数据对应的语音数据;Inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data;
根据所述语音数据、预先训练好的张量模型以及预先设置的说话人标识信息,获取与所述语音数据和所述说话人标识信息对应的口型数据,所述张量模型表达语音数据的发音特征与口型数据的口型位置特征之间的相关关系;Obtaining, according to the voice data, a pre-trained tensor model, and pre-set speaker identification information, lip-shaped data corresponding to the voice data and the speaker identification information, where the tensor model expresses voice data The correlation between the pronunciation feature and the lip position feature of the lip data;
根据所述口型数据生成与所述语音数据对应的口型动画,以供在播放所述语音数据的同时,展示所述口型动画。And generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
本申请计算机可读存储介质具体实施方式与上述基于语音的口型动画合成装置和方法各实施例基本相同,在此不作累述。The specific embodiment of the computer readable storage medium of the present application is substantially the same as the above embodiments of the voice-based vocal animation synthesis apparatus and method, and will not be described herein.
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非 排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the foregoing serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments. And the terms "including", "comprising", or any other variations thereof are intended to encompass a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a plurality of elements includes not only those elements but also Other elements listed, or elements that are inherent to such a process, device, item, or method. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, the device, the item, or the method that comprises the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM as described above). , a disk, an optical disk, including a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the various embodiments of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above is only a preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process transformations made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims (20)

  1. 一种基于语音的口型动画合成装置,其特征在于,所述装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的口型动画合成程序,所述口型动画合成程序被所述处理器执行时实现如下步骤:A speech-based vocal animation synthesizing device, characterized in that the device comprises a memory and a processor, and the memory stores a lip animation synthesis program executable on the processor, the mouth animation When the composition program is executed by the processor, the following steps are implemented:
    获取目标文本数据,根据发音词典获取所述目标文本数据中的音素特征;Obtaining target text data, and acquiring phoneme features in the target text data according to a pronunciation dictionary;
    将所述音素特征输入到预先训练好的深度神经网络模型中,输出与所述音素特征对应的声学特征,所述声学特征包括梅尔倒谱系数MFCC特征、发音时长和发音基频;Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
    将所述声学特征输入到语音合成器中,输出与所述目标文本数据对应的语音数据;Inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data;
    根据所述语音数据、预先训练好的张量模型以及预先设置的说话人标识信息,获取与所述语音数据和所述说话人标识信息对应的口型数据,所述张量模型表达语音数据的发音特征与口型数据的口型位置特征之间的相关关系;Obtaining, according to the voice data, a pre-trained tensor model, and pre-set speaker identification information, lip-shaped data corresponding to the voice data and the speaker identification information, where the tensor model expresses voice data The correlation between the pronunciation feature and the lip position feature of the lip data;
    根据所述口型数据生成与所述语音数据对应的口型动画,以供在播放所述语音数据的同时,展示所述口型动画。And generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
  2. 如权利要求1所述的基于语音的口型动画合成装置,其特征在于,所述获取目标文本数据,根据发音词典获取所述目标文本数据中的音素特征的步骤包括:The speech-based vocal animation synthesizing apparatus according to claim 1, wherein the acquiring the target text data and acquiring the phoneme features in the target text data according to the pronunciation dictionary comprises:
    获取目标文本数据,并对所述目标文本数据进行分词处理,以获取分词结果;Obtaining target text data, and performing word segmentation processing on the target text data to obtain a word segmentation result;
    通过发音词典将分词结果中的词转换为音素特征。The words in the word segmentation result are converted into phoneme features by the pronunciation dictionary.
  3. 如权利要求1所述的基于语音的口型动画合成装置,其特征在于,所述口型动画合成程序还可被所述处理器执行,以实现如下步骤:The speech-based vocal animation synthesizing apparatus according to claim 1, wherein said lip animation synthesis program is further executable by said processor to implement the following steps:
    基于至少一个说话人的语料构建样本库,所述语料包括语音数据,以及与语音数据对应的文本数据和口型数据;Constructing a sample library based on at least one speaker's corpus, the corpus including voice data, and text data and lip type data corresponding to the voice data;
    根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数;And acquiring a deep neural network model according to the text data in the sample library and the voice data, and acquiring model parameters of the deep neural network model;
    根据所述样本库中的语音数据和口型数据训练所述张量模型,获取所述张量模型的模型参数。The tensor model is trained according to the voice data and the mouth data in the sample library, and the model parameters of the tensor model are obtained.
  4. 如权利要求2所述的基于语音的口型动画合成装置,其特征在于,所 述口型动画合成程序还可被所述处理器执行,以实现如下步骤:The speech-based vocal animation synthesizing apparatus according to claim 2, wherein said lip animation synthesizing program is further executable by said processor to implement the following steps:
    基于至少一个说话人的语料构建样本库,所述语料包括语音数据,以及与语音数据对应的文本数据和口型数据;Constructing a sample library based on at least one speaker's corpus, the corpus including voice data, and text data and lip type data corresponding to the voice data;
    根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数;And acquiring a deep neural network model according to the text data in the sample library and the voice data, and acquiring model parameters of the deep neural network model;
    根据所述样本库中的语音数据和口型数据训练所述张量模型,获取所述张量模型的模型参数。The tensor model is trained according to the voice data and the mouth data in the sample library, and the model parameters of the tensor model are obtained.
  5. 如权利要求3所述的基于语音的口型动画合成装置,其特征在于,所述根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数的步骤包括:The speech-based vocal animation synthesis apparatus according to claim 3, wherein said training a depth neural network model based on text data in said sample library and said speech data to acquire model parameters of a deep neural network model The steps include:
    根据所述发音词典从所述样本库中的文本数据中提取音素特征,从与文本数据对应的语音数据中提取声学特征;Extracting a phoneme feature from the text data in the sample library according to the pronunciation dictionary, and extracting an acoustic feature from the voice data corresponding to the text data;
    将所述音素特征作为所述深度神经网络模型的输入特征,将所述声学特征作为所述深度神经网络模型的输出特征,对所述深度神经网络模型进行训练,获取深度神经网络模型的模型参数。Using the phoneme feature as an input feature of the deep neural network model, using the acoustic feature as an output feature of the deep neural network model, training the deep neural network model, and acquiring model parameters of a deep neural network model .
  6. 如权利要求4所述的基于语音的口型动画合成装置,其特征在于,所述根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数的步骤包括:The speech-based vocal animation synthesis apparatus according to claim 4, wherein said training a depth neural network model based on text data in said sample library and said speech data to acquire model parameters of a deep neural network model The steps include:
    根据所述发音词典从所述样本库中的文本数据中提取音素特征,从与文本数据对应的语音数据中提取声学特征;Extracting a phoneme feature from the text data in the sample library according to the pronunciation dictionary, and extracting an acoustic feature from the voice data corresponding to the text data;
    将所述音素特征作为所述深度神经网络模型的输入特征,将所述声学特征作为所述深度神经网络模型的输出特征,对所述深度神经网络模型进行训练,获取深度神经网络模型的模型参数。Using the phoneme feature as an input feature of the deep neural network model, using the acoustic feature as an output feature of the deep neural network model, training the deep neural network model, and acquiring model parameters of a deep neural network model .
  7. 如权利要求5或6所述的基于语音的口型动画合成装置,其特征在于,所述张量模型为三阶张量模型,所述根据所述样本库中的语音数据和口型数据训练所述张量模型,获取所述张量模型的模型参数的步骤包括:The speech-based vocal animation synthesizing apparatus according to claim 5 or 6, wherein the tensor model is a third-order tensor model, and the training is based on voice data and oral data in the sample library. The tensor model, the step of acquiring model parameters of the tensor model includes:
    构建三阶张量模型,所述三阶张量模型的三个维度分别对应于发音特征、口型数据和说话人标识信息;Constructing a third-order tensor model, the three dimensions of the third-order tensor model respectively corresponding to the pronunciation feature, the mouth data and the speaker identification information;
    获取所述样本库中的语音数据对应的发音特征,将所述发音特征和说话人标识信息作为所述三阶张量模型的输入特征,将与语音数据对应的口型数 据作为所述三阶张量模型的输出特征,使用高阶奇异值分解算法训练所述三阶张量模型,以获取所述三阶张量模型的模型参数。Obtaining a pronunciation feature corresponding to the voice data in the sample library, using the pronunciation feature and the speaker identification information as input features of the third-order tensor model, and using the mouth data corresponding to the voice data as the third-order The output feature of the tensor model is trained using a high-order singular value decomposition algorithm to obtain model parameters of the third-order tensor model.
  8. 一种基于语音的口型动画合成方法,其特征在于,所述方法包括:A speech-based vocal animation synthesis method, characterized in that the method comprises:
    获取目标文本数据,根据发音词典获取所述目标文本数据中的音素特征;Obtaining target text data, and acquiring phoneme features in the target text data according to a pronunciation dictionary;
    将所述音素特征输入到预先训练好的深度神经网络模型中,输出与所述音素特征对应的声学特征,所述声学特征包括梅尔倒谱系数MFCC特征、发音时长和发音基频;Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
    将所述声学特征输入到语音合成器中,输出与所述目标文本数据对应的语音数据;Inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data;
    根据所述语音数据、预先训练好的张量模型以及预先设置的说话人标识信息,获取与所述语音数据和所述说话人标识信息对应的口型数据,所述张量模型表达语音数据的发音特征与口型数据的口型位置特征之间的相关关系;Obtaining, according to the voice data, a pre-trained tensor model, and pre-set speaker identification information, lip-shaped data corresponding to the voice data and the speaker identification information, where the tensor model expresses voice data The correlation between the pronunciation feature and the lip position feature of the lip data;
    根据所述口型数据生成与所述语音数据对应的口型动画,以供在播放所述语音数据的同时,展示所述口型动画。And generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
  9. 如权利要求8所述的基于语音的口型动画合成方法,其特征在于,所述获取目标文本数据,根据发音词典获取所述目标文本数据中的音素特征的步骤包括:The speech-based vocal animation synthesis method according to claim 8, wherein the acquiring the target text data and acquiring the phoneme features in the target text data according to the pronunciation dictionary comprises:
    获取目标文本数据,并对所述目标文本数据进行分词处理,以获取分词结果;Obtaining target text data, and performing word segmentation processing on the target text data to obtain a word segmentation result;
    通过发音词典将分词结果中的词转换为音素特征。The words in the word segmentation result are converted into phoneme features by the pronunciation dictionary.
  10. 如权利要求8所述的基于语音的口型动画合成方法,其特征在于,所述方法还包括:The method of synthesizing a voice-based vocal animation according to claim 8, wherein the method further comprises:
    基于至少一个说话人的语料构建样本库,所述语料包括语音数据,以及与语音数据对应的文本数据和口型数据;Constructing a sample library based on at least one speaker's corpus, the corpus including voice data, and text data and lip type data corresponding to the voice data;
    根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数;And acquiring a deep neural network model according to the text data in the sample library and the voice data, and acquiring model parameters of the deep neural network model;
    根据所述样本库中的语音数据和口型数据训练所述张量模型,获取所述张量模型的模型参数。The tensor model is trained according to the voice data and the mouth data in the sample library, and the model parameters of the tensor model are obtained.
  11. 如权利要求9所述的基于语音的口型动画合成方法,其特征在于,所述方法还包括:The method of synthesizing a voice-based vocal animation according to claim 9, wherein the method further comprises:
    基于至少一个说话人的语料构建样本库,所述语料包括语音数据,以及与语音数据对应的文本数据和口型数据;Constructing a sample library based on at least one speaker's corpus, the corpus including voice data, and text data and lip type data corresponding to the voice data;
    根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数;And acquiring a deep neural network model according to the text data in the sample library and the voice data, and acquiring model parameters of the deep neural network model;
    根据所述样本库中的语音数据和口型数据训练所述张量模型,获取所述张量模型的模型参数。The tensor model is trained according to the voice data and the mouth data in the sample library, and the model parameters of the tensor model are obtained.
  12. 如权利要求10所述的基于语音的口型动画合成方法,其特征在于,所述根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数的步骤包括:The speech-based vocal animation synthesis method according to claim 10, wherein the training the depth neural network model according to the text data in the sample library and the speech data, and acquiring the model parameters of the deep neural network model The steps include:
    根据所述发音词典从所述样本库中的文本数据中提取音素特征,从与文本数据对应的语音数据中提取声学特征;Extracting a phoneme feature from the text data in the sample library according to the pronunciation dictionary, and extracting an acoustic feature from the voice data corresponding to the text data;
    将所述音素特征作为所述深度神经网络模型的输入特征,将所述声学特征作为所述深度神经网络模型的输出特征,对所述深度神经网络模型进行训练,获取深度神经网络模型的模型参数。Using the phoneme feature as an input feature of the deep neural network model, using the acoustic feature as an output feature of the deep neural network model, training the deep neural network model, and acquiring model parameters of a deep neural network model .
  13. 如权利要求11所述的基于语音的口型动画合成方法,其特征在于,所述根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数的步骤包括:The speech-based vocal animation synthesis method according to claim 11, wherein the training the depth neural network model according to the text data in the sample library and the speech data, and acquiring the model parameters of the deep neural network model The steps include:
    根据所述发音词典从所述样本库中的文本数据中提取音素特征,从与文本数据对应的语音数据中提取声学特征;Extracting a phoneme feature from the text data in the sample library according to the pronunciation dictionary, and extracting an acoustic feature from the voice data corresponding to the text data;
    将所述音素特征作为所述深度神经网络模型的输入特征,将所述声学特征作为所述深度神经网络模型的输出特征,对所述深度神经网络模型进行训练,获取深度神经网络模型的模型参数。Using the phoneme feature as an input feature of the deep neural network model, using the acoustic feature as an output feature of the deep neural network model, training the deep neural network model, and acquiring model parameters of a deep neural network model .
  14. 如权利要求12或13所述的基于语音的口型动画合成方法,其特征在于,所述张量模型为三阶张量模型,所述根据所述样本库中的语音数据和口型数据训练所述张量模型,获取所述张量模型的模型参数的步骤包括:The speech-based vocal animation synthesis method according to claim 12 or 13, wherein the tensor model is a third-order tensor model, and the training is based on voice data and oral data in the sample library. The tensor model, the step of acquiring model parameters of the tensor model includes:
    构建三阶张量模型,所述三阶张量模型的三个维度分别对应于发音特征、口型数据和说话人标识信息;Constructing a third-order tensor model, the three dimensions of the third-order tensor model respectively corresponding to the pronunciation feature, the mouth data and the speaker identification information;
    获取所述样本库中的语音数据对应的发音特征,将所述发音特征和说话人标识信息作为所述三阶张量模型的输入特征,将与语音数据对应的口型数据作为所述三阶张量模型的输出特征,使用高阶奇异值分解算法训练所述三 阶张量模型,以获取所述三阶张量模型的模型参数。Obtaining a pronunciation feature corresponding to the voice data in the sample library, using the pronunciation feature and the speaker identification information as input features of the third-order tensor model, and using the mouth data corresponding to the voice data as the third-order The output feature of the tensor model is trained using a high-order singular value decomposition algorithm to obtain model parameters of the third-order tensor model.
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有口型动画合成程序,所述口型动画合成程序可被一个或者多个处理器执行,以实现如下步骤:A computer readable storage medium, characterized in that the computer readable storage medium stores a lip animation synthesis program, and the lip animation synthesis program can be executed by one or more processors to implement the following steps:
    获取目标文本数据,根据发音词典获取所述目标文本数据中的音素特征;Obtaining target text data, and acquiring phoneme features in the target text data according to a pronunciation dictionary;
    将所述音素特征输入到预先训练好的深度神经网络模型中,输出与所述音素特征对应的声学特征,所述声学特征包括梅尔倒谱系数MFCC特征、发音时长和发音基频;Inputting the phoneme feature into a pre-trained deep neural network model, and outputting an acoustic feature corresponding to the phoneme feature, the acoustic feature including a Mel Cepstrum coefficient MFCC feature, a pronunciation duration, and a pronunciation fundamental frequency;
    将所述声学特征输入到语音合成器中,输出与所述目标文本数据对应的语音数据;Inputting the acoustic feature into a speech synthesizer, and outputting speech data corresponding to the target text data;
    根据所述语音数据、预先训练好的张量模型以及预先设置的说话人标识信息,获取与所述语音数据和所述说话人标识信息对应的口型数据,所述张量模型表达语音数据的发音特征与口型数据的口型位置特征之间的相关关系;Obtaining, according to the voice data, a pre-trained tensor model, and pre-set speaker identification information, lip-shaped data corresponding to the voice data and the speaker identification information, where the tensor model expresses voice data The correlation between the pronunciation feature and the lip position feature of the lip data;
    根据所述口型数据生成与所述语音数据对应的口型动画,以供在播放所述语音数据的同时,展示所述口型动画。And generating a lip animation corresponding to the voice data according to the lip shape data, for displaying the lip animation while playing the voice data.
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,所述获取目标文本数据,根据发音词典获取所述目标文本数据中的音素特征的步骤包括:The computer readable storage medium according to claim 15, wherein the obtaining the target text data and acquiring the phoneme features in the target text data according to the pronunciation dictionary comprises:
    获取目标文本数据,并对所述目标文本数据进行分词处理,以获取分词结果;Obtaining target text data, and performing word segmentation processing on the target text data to obtain a word segmentation result;
    通过发音词典将分词结果中的词转换为音素特征。The words in the word segmentation result are converted into phoneme features by the pronunciation dictionary.
  17. 如权利要求15所述的计算机可读存储介质,其特征在于,所述方法还包括:The computer readable storage medium of claim 15, wherein the method further comprises:
    基于至少一个说话人的语料构建样本库,所述语料包括语音数据,以及与语音数据对应的文本数据和口型数据;Constructing a sample library based on at least one speaker's corpus, the corpus including voice data, and text data and lip type data corresponding to the voice data;
    根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数;And acquiring a deep neural network model according to the text data in the sample library and the voice data, and acquiring model parameters of the deep neural network model;
    根据所述样本库中的语音数据和口型数据训练所述张量模型,获取所述张量模型的模型参数。The tensor model is trained according to the voice data and the mouth data in the sample library, and the model parameters of the tensor model are obtained.
  18. 如权利要求16所述的计算机可读存储介质,其特征在于,所述方法 还包括:The computer readable storage medium of claim 16 wherein the method further comprises:
    基于至少一个说话人的语料构建样本库,所述语料包括语音数据,以及与语音数据对应的文本数据和口型数据;Constructing a sample library based on at least one speaker's corpus, the corpus including voice data, and text data and lip type data corresponding to the voice data;
    根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数;And acquiring a deep neural network model according to the text data in the sample library and the voice data, and acquiring model parameters of the deep neural network model;
    根据所述样本库中的语音数据和口型数据训练所述张量模型,获取所述张量模型的模型参数。The tensor model is trained according to the voice data and the mouth data in the sample library, and the model parameters of the tensor model are obtained.
  19. 如权利要求17或18所述的计算机可读存储介质,其特征在于,所述根据所述样本库中的文本数据和所述语音数据训练深度神经网络模型,获取深度神经网络模型的模型参数的步骤包括:The computer readable storage medium according to claim 17 or 18, wherein said training a depth neural network model based on text data in said sample library and said speech data, acquiring model parameters of a deep neural network model The steps include:
    根据所述发音词典从所述样本库中的文本数据中提取音素特征,从与文本数据对应的语音数据中提取声学特征;Extracting a phoneme feature from the text data in the sample library according to the pronunciation dictionary, and extracting an acoustic feature from the voice data corresponding to the text data;
    将所述音素特征作为所述深度神经网络模型的输入特征,将所述声学特征作为所述深度神经网络模型的输出特征,对所述深度神经网络模型进行训练,获取深度神经网络模型的模型参数。Using the phoneme feature as an input feature of the deep neural network model, using the acoustic feature as an output feature of the deep neural network model, training the deep neural network model, and acquiring model parameters of a deep neural network model .
  20. 如权利要求19所述的计算机可读存储介质,其特征在于,所述张量模型为三阶张量模型,所述根据所述样本库中的语音数据和口型数据训练所述张量模型,获取所述张量模型的模型参数的步骤包括:The computer readable storage medium according to claim 19, wherein said tensor model is a third-order tensor model, said training said tensor model based on speech data and lip-shaped data in said sample library The steps of obtaining the model parameters of the tensor model include:
    构建三阶张量模型,所述三阶张量模型的三个维度分别对应于发音特征、口型数据和说话人标识信息;Constructing a third-order tensor model, the three dimensions of the third-order tensor model respectively corresponding to the pronunciation feature, the mouth data and the speaker identification information;
    获取所述样本库中的语音数据对应的发音特征,将所述发音特征和说话人标识信息作为所述三阶张量模型的输入特征,将与语音数据对应的口型数据作为所述三阶张量模型的输出特征,使用高阶奇异值分解算法训练所述三阶张量模型,以获取所述三阶张量模型的模型参数。Obtaining a pronunciation feature corresponding to the voice data in the sample library, using the pronunciation feature and the speaker identification information as input features of the third-order tensor model, and using the mouth data corresponding to the voice data as the third-order The output feature of the tensor model is trained using a high-order singular value decomposition algorithm to obtain model parameters of the third-order tensor model.
PCT/CN2018/102209 2018-04-12 2018-08-24 Device and method for speech-based mouth shape animation blending, and readable storage medium WO2019196306A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810327672.1 2018-04-12
CN201810327672.1A CN108763190B (en) 2018-04-12 2018-04-12 Voice-based mouth shape cartoon synthesizer, method and readable storage medium storing program for executing

Publications (1)

Publication Number Publication Date
WO2019196306A1 true WO2019196306A1 (en) 2019-10-17

Family

ID=63981728

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102209 WO2019196306A1 (en) 2018-04-12 2018-08-24 Device and method for speech-based mouth shape animation blending, and readable storage medium

Country Status (2)

Country Link
CN (1) CN108763190B (en)
WO (1) WO2019196306A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827799A (en) * 2019-11-21 2020-02-21 百度在线网络技术(北京)有限公司 Method, apparatus, device and medium for processing voice signal
EP3866166A1 (en) * 2020-02-13 2021-08-18 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for predicting mouth-shape feature, electronic device, storage medium and computer program product
CN117173292A (en) * 2023-09-07 2023-12-05 河北日凌智能科技有限公司 Digital human interaction method and device based on vowel slices

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288077B (en) * 2018-11-14 2022-12-16 腾讯科技(深圳)有限公司 Method and related device for synthesizing speaking expression based on artificial intelligence
CN109523616B (en) * 2018-12-04 2023-05-30 科大讯飞股份有限公司 Facial animation generation method, device, equipment and readable storage medium
CN111326141A (en) * 2018-12-13 2020-06-23 南京硅基智能科技有限公司 Method for processing and acquiring human voice data
CN109801349B (en) * 2018-12-19 2023-01-24 武汉西山艺创文化有限公司 Sound-driven three-dimensional animation character real-time expression generation method and system
CN109599113A (en) * 2019-01-22 2019-04-09 北京百度网讯科技有限公司 Method and apparatus for handling information
CN110136698B (en) * 2019-04-11 2021-09-24 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for determining mouth shape
CN110189394B (en) * 2019-05-14 2020-12-29 北京字节跳动网络技术有限公司 Mouth shape generation method and device and electronic equipment
CN110288682B (en) * 2019-06-28 2023-09-26 北京百度网讯科技有限公司 Method and apparatus for controlling changes in a three-dimensional virtual portrait mouth shape
CN112181127A (en) * 2019-07-02 2021-01-05 上海浦东发展银行股份有限公司 Method and device for man-machine interaction
CN111133506A (en) * 2019-12-23 2020-05-08 深圳市优必选科技股份有限公司 Training method and device of speech synthesis model, computer equipment and storage medium
CN110992926B (en) * 2019-12-26 2022-06-10 标贝(北京)科技有限公司 Speech synthesis method, apparatus, system and storage medium
CN111340920B (en) * 2020-03-02 2024-04-09 长沙千博信息技术有限公司 Semantic-driven two-dimensional animation automatic generation method
CN111698552A (en) * 2020-05-15 2020-09-22 完美世界(北京)软件科技发展有限公司 Video resource generation method and device
CN112331184B (en) * 2020-10-29 2024-03-15 网易(杭州)网络有限公司 Voice mouth shape synchronization method and device, electronic equipment and storage medium
CN112837401B (en) * 2021-01-27 2024-04-09 网易(杭州)网络有限公司 Information processing method, device, computer equipment and storage medium
CN113079328B (en) * 2021-03-19 2023-03-28 北京有竹居网络技术有限公司 Video generation method and device, storage medium and electronic equipment
CN113314094B (en) * 2021-05-28 2024-05-07 北京达佳互联信息技术有限公司 Lip model training method and device and voice animation synthesis method and device
CN113707124A (en) * 2021-08-30 2021-11-26 平安银行股份有限公司 Linkage broadcasting method and device of voice operation, electronic equipment and storage medium
CN113870396B (en) * 2021-10-11 2023-08-15 北京字跳网络技术有限公司 Mouth shape animation generation method and device, computer equipment and storage medium
CN114420088A (en) * 2022-01-20 2022-04-29 安徽淘云科技股份有限公司 Display method and related equipment thereof
CN114581567B (en) * 2022-05-06 2022-08-02 成都市谛视无限科技有限公司 Method, device and medium for driving mouth shape of virtual image by sound
CN116257762B (en) * 2023-05-16 2023-07-14 世优(北京)科技有限公司 Training method of deep learning model and method for controlling mouth shape change of virtual image

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080312930A1 (en) * 1997-08-05 2008-12-18 At&T Corp. Method and system for aligning natural and synthetic video to speech synthesis
CN104361620A (en) * 2014-11-27 2015-02-18 韩慧健 Mouth shape animation synthesis method based on comprehensive weighted algorithm
US9262857B2 (en) * 2013-01-16 2016-02-16 Disney Enterprises, Inc. Multi-linear dynamic hair or clothing model with efficient collision handling
CN106297792A (en) * 2016-09-14 2017-01-04 厦门幻世网络科技有限公司 The recognition methods of a kind of voice mouth shape cartoon and device
CN106531150A (en) * 2016-12-23 2017-03-22 上海语知义信息技术有限公司 Emotion synthesis method based on deep neural network model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080312930A1 (en) * 1997-08-05 2008-12-18 At&T Corp. Method and system for aligning natural and synthetic video to speech synthesis
US9262857B2 (en) * 2013-01-16 2016-02-16 Disney Enterprises, Inc. Multi-linear dynamic hair or clothing model with efficient collision handling
CN104361620A (en) * 2014-11-27 2015-02-18 韩慧健 Mouth shape animation synthesis method based on comprehensive weighted algorithm
CN106297792A (en) * 2016-09-14 2017-01-04 厦门幻世网络科技有限公司 The recognition methods of a kind of voice mouth shape cartoon and device
CN106531150A (en) * 2016-12-23 2017-03-22 上海语知义信息技术有限公司 Emotion synthesis method based on deep neural network model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GRALEWSKI, L. ET AL.: "Using a Tensor Framework for the Analysis of Facial Dynamics", 7TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FGR06, 24 April 2006 (2006-04-24), pages 217 - 222, XP010911558, DOI: 10.1109/FGR.2006.108 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827799A (en) * 2019-11-21 2020-02-21 百度在线网络技术(北京)有限公司 Method, apparatus, device and medium for processing voice signal
CN110827799B (en) * 2019-11-21 2022-06-10 百度在线网络技术(北京)有限公司 Method, apparatus, device and medium for processing voice signal
EP3866166A1 (en) * 2020-02-13 2021-08-18 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for predicting mouth-shape feature, electronic device, storage medium and computer program product
US11562732B2 (en) 2020-02-13 2023-01-24 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for predicting mouth-shape feature, and electronic device
CN117173292A (en) * 2023-09-07 2023-12-05 河北日凌智能科技有限公司 Digital human interaction method and device based on vowel slices

Also Published As

Publication number Publication date
CN108763190A (en) 2018-11-06
CN108763190B (en) 2019-04-02

Similar Documents

Publication Publication Date Title
WO2019196306A1 (en) Device and method for speech-based mouth shape animation blending, and readable storage medium
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
CN106575500B (en) Method and apparatus for synthesizing speech based on facial structure
US9361722B2 (en) Synthetic audiovisual storyteller
WO2017067206A1 (en) Training method for multiple personalized acoustic models, and voice synthesis method and device
WO2019056500A1 (en) Electronic apparatus, speech synthesis method, and computer readable storage medium
KR102116309B1 (en) Synchronization animation output system of virtual characters and text
JP6206960B2 (en) Pronunciation operation visualization device and pronunciation learning device
CN111145777A (en) Virtual image display method and device, electronic equipment and storage medium
JP2018146803A (en) Voice synthesizer and program
JP5913394B2 (en) Audio synchronization processing apparatus, audio synchronization processing program, audio synchronization processing method, and audio synchronization system
CN109949791A (en) Emotional speech synthesizing method, device and storage medium based on HMM
Karpov et al. Automatic technologies for processing spoken sign languages
CN112599113B (en) Dialect voice synthesis method, device, electronic equipment and readable storage medium
WO2024088321A1 (en) Virtual image face driving method and apparatus, electronic device and medium
CN114121006A (en) Image output method, device, equipment and storage medium of virtual character
CN112735371A (en) Method and device for generating speaker video based on text information
JP5807921B2 (en) Quantitative F0 pattern generation device and method, model learning device for F0 pattern generation, and computer program
TWI574254B (en) Speech synthesis method and apparatus for electronic system
Mukherjee et al. A Bengali speech synthesizer on Android OS
WO2023142413A1 (en) Audio data processing method and apparatus, electronic device, medium, and program product
JP2016142936A (en) Preparing method for data for speech synthesis, and preparing device data for speech synthesis
CN112634861A (en) Data processing method and device, electronic equipment and readable storage medium
JP6475572B2 (en) Utterance rhythm conversion device, method and program
JP6137708B2 (en) Quantitative F0 pattern generation device, model learning device for F0 pattern generation, and computer program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18914626

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18914626

Country of ref document: EP

Kind code of ref document: A1