WO2016172871A1 - Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents - Google Patents
Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents Download PDFInfo
- Publication number
- WO2016172871A1 WO2016172871A1 PCT/CN2015/077785 CN2015077785W WO2016172871A1 WO 2016172871 A1 WO2016172871 A1 WO 2016172871A1 CN 2015077785 W CN2015077785 W CN 2015077785W WO 2016172871 A1 WO2016172871 A1 WO 2016172871A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- parameters
- acoustic
- neural network
- sequence
- Prior art date
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 73
- 238000001308 synthesis method Methods 0.000 title claims abstract description 10
- 230000000306 recurrent effect Effects 0.000 title abstract description 4
- 125000004122 cyclic group Chemical group 0.000 claims description 46
- 238000001228 spectrum Methods 0.000 claims description 5
- 230000003595 spectral effect Effects 0.000 claims description 4
- 238000000034 method Methods 0.000 abstract description 29
- 230000002194 synthesizing effect Effects 0.000 abstract description 2
- 230000015572 biosynthetic process Effects 0.000 description 21
- 238000003786 synthesis reaction Methods 0.000 description 21
- 230000000875 corresponding effect Effects 0.000 description 16
- 238000003066 decision tree Methods 0.000 description 13
- 238000009499 grossing Methods 0.000 description 13
- 230000004913 activation Effects 0.000 description 7
- 230000001427 coherent effect Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 210000002569 neuron Anatomy 0.000 description 6
- 238000007476 Maximum Likelihood Methods 0.000 description 4
- 238000001914 filtration Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 2
- 230000027455 binding Effects 0.000 description 1
- 238000009739 binding Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Definitions
- the invention relates to the field of speech synthesis, in particular to statistical parameter speech synthesis.
- Speech synthesis technology is a technology that allows a machine or program to generate human-readable speech based on textual information.
- Applications related to speech synthesis technology include text-to-speech (TTS) and speech synthesis (SVS).
- the mainstream speech synthesis technology is a statistical parameter speech synthesis technology based on Hidden Markov Model (HMM), which includes two stages of training and operation.
- HMM Hidden Markov Model
- Training phase the acoustic parameters used to train the speech data are corresponding to the state sequence of the hidden Markov model, and the acoustic statistical parameters of each state are calculated by the training algorithm; the state of the model is used according to the context information of the text using the decision tree. class.
- Operation phase The decision tree is used to convert the context information sequence of the input text into a state sequence of the classified model; and the acoustic statistical parameter sequence is obtained according to the acoustic statistical parameters corresponding to each state. Due to the state-discrete nature of the hidden Markov model, the sequence of acoustic statistical parameters obtained at this time is incoherent between states. In order to generate coherent speech acoustic parameters, the acoustic statistical parameter sequence needs to be smoothed.
- the traditional smoothing method is the Maximum Likelihood Parameter Generation Algorithm (MLPG).
- the method generates a sequence of coherent acoustic parameters having the most matching statistical features from a sequence of acoustic statistical parameters including dynamic parameters (eg, first order, second derivative).
- the speech waveform data is synthesized and output using a source-filter model or other speech analysis synthesis technique.
- the main problems of the smoothing method are as follows: 1) The smoothing method tends to make the generated acoustic parameters excessively smooth, and finally the generated speech sounds ambiguous; 2) the smoothing method does not have real-time, that is, the acoustics can only be generated step by step. Parameters, in the application of real-time speech synthesis, easily lead to playback.
- Another way to avoid excessive smoothing of acoustic parameters generated by traditional statistical parameter speech synthesis techniques is to use deep neural networks instead of hidden Markov models (see Zen, Heiga, et al. "Statistical parametric speech synthesis using deep neural networks.” Acoustics,Speech And Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.).
- the deep neural network will directly generate corresponding acoustic statistical parameter sequences according to the context information sequence of the text, so that the generated acoustic statistical parameter sequence is closer to the statistical features of the real speech, thereby obtaining a more natural synthesized speech.
- this kind of deep neural network requires a large amount of training data, and usually requires about 5 hours of training speech data to obtain a better synthesis effect. If you prepare enough training voice data, you will need a lot of time and labor costs.
- the present invention uses a cyclic neural network to replace the traditional acoustic parameter smoothing method, and generates coherent acoustic parameters according to the acoustic statistical parameter sequence, thereby effectively solving the problem that the generated acoustic parameters are excessively smooth, so that the synthesized speech is on the sense of hearing. More natural.
- the present invention has real-time performance, and can synthesize speech sentence by sentence, word by word or frame by frame.
- the demand for training data in the present invention is relatively small, and the training time is relatively short.
- the technical field to which the present invention pertains is speech synthesis based on statistical parameters.
- One of the technical problems solved by the present invention It avoids the excessive smoothing of synthesized speech, making the synthesized speech more natural and clear.
- the present invention solves the above technical problems, and the method adopted includes two steps of training and running of the model.
- the training steps of the model include:
- a cyclic neural network is trained to map acoustic statistical parameter sequences to acoustic parameter sequences of training speech.
- the operational steps of the present invention include:
- the speech is synthesized based on a sequence of smooth acoustic parameters.
- the method for synthesizing speech from a sequence of smooth acoustic parameters may be parameter speech synthesis based on source-filter model, parametric speech synthesis based on sinusoidal model or harmonic plus noise model, vocoder, spliced speech based on primitive selection Synthesis, etc.
- the acoustic statistical parameter prediction model may be a machine learning model and method such as a hidden Markov model, a decision tree or a neural network.
- the present invention is not specifically limited to the specifically employed model or method.
- FIG. 1 is a flow chart of a training phase of a model of the present invention
- Figure 2 is a flow chart of the operation phase of the present invention.
- FIG. 3 is a schematic diagram of a training method of a cyclic neural network according to the present invention.
- FIG. 4 is a schematic diagram of a method for operating a circulating neural network according to the present invention.
- Figure 5 is a diagram showing an example of an acoustic statistical parameter sequence in an embodiment of the present invention.
- FIG. 6 is a diagram showing an example of an acoustic statistical parameter sequence subjected to smoothing and normalization processing and an acoustic parameter sequence output by a cyclic neural network in an embodiment of the present invention.
- the invention includes two stages of training and running of the model, wherein the training phase of the model is shown in Figure 1; the operational phase is shown in Figure 2.
- the training phase of the model is mainly based on the training data, and the acoustic statistical parameter prediction model parameters and the cyclic neural network model parameters are calculated.
- the training data includes voice data and text data that is time aligned with the voice data.
- the voice data may be in different forms.
- the voice data in a text-to-speech application, the voice data is the audio data of the spoken sentence; in the voice synthesis application, the voice data is the audio data of the voice.
- the text data includes text corresponding to the voice and phonetic label information, and may also include information such as syllable annotation, accent annotation, and part-of-speech annotation.
- the function of the acoustic statistical parameter prediction model is to predict the statistical information of the speech acoustic parameters (including the rhythm of the speech at a specific time, the timbre parameters, etc.) at different times according to the input text information, thereby generating a preliminary prediction of the acoustic parameters of the speech.
- the output of the model can be discrete, ie the output is an incoherent acoustic statistical parameter between a series of states.
- the output of the model can also be continuous, ie the output is a series of consecutive acoustic statistical parameters.
- the parameters output by the model should reflect the statistical information of the acoustic parameters of the speech in a short period of time (eg mean, variance, derivative) Etc), not just the acoustic parameters themselves.
- the present invention does not specifically limit the acoustic statistical parameter prediction model to be specifically used.
- the acoustic statistical parameter prediction model used in this embodiment is an acoustic statistical parameter prediction model based on a hidden Markov model and a decision tree; in other implementations, other models or methods with approximate functions, such as feedforward neural networks, support Vector machine, etc.
- the present invention replaces the traditional speech acoustic parameter smoothing method by using a cyclic neural network, and solves the problem that the generated acoustic parameters are excessively smooth; the present invention can be sentence by sentence, verbatim, or The speech synthesis of the frame solves the low real-time problem of the technology.
- the training phase of the model of the present invention is shown in FIG. 1 and specifically includes the following steps:
- the speech data and the corresponding text data, and the context information of the text are obtained from the training data, and the acoustic statistical parameter prediction model is trained to enable the model to map the context information of the text to the acoustic statistical parameters.
- the state transition probability distribution and the output probability distribution parameter of the hidden Markov model are initialized according to the aligned training text and the training speech data.
- the hidden Markov model adopts a context-dependent state; alternatively, the output probability distribution adopts a mixed probability distribution; alternatively, the output probability distribution uses a diagonal skew variance matrix.
- the state transition probability distribution parameter may be calculated by the order and the number of occurrences of different states in the training text; the output probability distribution parameter may be obtained by statistically calculating the acoustic parameters of the voice data corresponding to each state;
- the decision tree generating algorithm may use a minimum description length (MDL) criterion;
- the context-dependent state of the hidden Markov model is attributed to the same decision tree node.
- the Baum-Welch algorithm or Viterbi training algorithm is used to recalculate the state transition probability distribution and output state distribution parameters of the hidden Markov model.
- the output probability distribution parameter obtained at this time is the acoustic statistical parameter corresponding to each state;
- the average duration and variance of the states of each group of bindings are counted and stored.
- the acoustic statistical parameter prediction model based on the hidden Markov model and the decision tree used in this embodiment is only an example, and the model and the training method different from the embodiment may be adopted in the specific implementation.
- a corresponding acoustic statistical parameter sequence is generated according to the context information sequence of the text data in the training data.
- the decision tree is used to select the corresponding hidden Markov model state to form a state sequence; each state duration It is determined by the Viterbi training algorithm in the first step of the training phase to ensure that the state sequence and the training speech data are aligned in time.
- the trained hidden Markov model is then used to generate a corresponding sequence of acoustic statistical parameters based on the sequence of states.
- the cyclic neural network is trained to enable the neural network to map the acoustic statistical parameter sequence output by the acoustic statistical parameter prediction model to the acoustic parameter sequence of the coherent, natural speech.
- the training data of the cyclic neural network is the acoustic statistical parameter sequence generated in the second step of the training phase and the speech acoustic parameters calculated from the speech data in the training data.
- the output of the hidden Markov model used in the second step of the training phase of the present embodiment is an incoherent acoustic statistical parameter sequence between states
- the acoustic statistical parameter sequence outputted by the hidden Markov model should be initially smoothed and then input.
- Cyclic neural network to ensure that the cyclic neural network can output a coherent sequence of speech parameters;
- the preliminary smoothing method can be interpolation, low-pass filtering, average filtering, and maximum likelihood parameter generation algorithms;
- normalizing the sequence of acoustic statistical parameters input to the cyclic neural network for example, making the input parameter conform to a Gaussian distribution with a mean of 0 and a variance of 1;
- cyclic neural networks For different applications, different structures of cyclic neural networks can be used, including first-order or second-order cyclic neural networks, long-short-time memory neural networks, multi-layered cyclic neural networks, and combinations of several kinds of neural networks.
- the output layer of the ring neural network uses a linear activation function.
- a cyclic neural network may be used to output a plurality of acoustic parameter sequences according to the input acoustic statistical parameter sequence; a plurality of cyclic neural networks may also be used according to the input acoustic statistical parameter sequence, and each cyclic neural network outputs one Or a plurality of acoustic parameter sequences;
- First-order cyclic neural networks can be trained using time-varying back propagation algorithm (BPTT) or real-time recursive learning algorithm (RTRL); second-order cyclic neural networks, long-short-time memory neural networks, and multi-layered cyclic neural networks can use universal lengths and time Neural network algorithm (LSTM-g, see Monner, Derek, et al. "A generalized LSTM-like training algorithm for second-order recurrent neural networks.” Neural Networks 25 (2012): 70-83.) training; training cycle The output layer of the neural network adopts a least square error criterion;
- the output acoustic parameters are excessively smoothed, and the input of the cyclic neural network includes the sequence of acoustic statistical parameters of 5 to 40 frames in advance, and The sequence of acoustic statistical parameters of the current frame.
- FIG. 2 The operation phase of the present invention is shown in FIG. 2, which uses the trained model to generate speech acoustic parameters and synthesize speech based on the input text.
- the specific method includes the following steps:
- the context information sequence of the text is obtained from the input text; according to the context information sequence of the input text, the acoustic statistical parameter prediction model is used to generate a corresponding acoustic statistical parameter sequence.
- the decision tree is used to select the corresponding hidden Markov model state to form a sequence of states; the duration of each state sequence is determined by the average duration and variance of each state obtained during training.
- the trained hidden Markov model is then used to generate a corresponding sequence of acoustic statistical parameters based on the sequence of states.
- a sequence of acoustic parameters is generated using a trained cyclic neural network based on the sequence of acoustic statistical parameters.
- Specific methods include:
- the acoustic statistical parameter sequence outputted by the hidden Markov model should be initially smoothed and then input into the circulating nerve.
- the network ensures that the cyclic neural network can output a coherent sequence of speech parameters; the preliminary smoothing methods can be interpolation, low-pass filtering, average filtering, and maximum likelihood parameter generation algorithms;
- Figure 5 is an example diagram of an unsmoothed sequence of acoustic statistical parameters (solid line) and a sequence of smoothed acoustic statistical parameters (dashed lines); for clarity, the figure contains only the second Mel frequency of speech.
- normalizing the sequence of acoustic statistical parameters input to the cyclic neural network for example, making the input parameter conform to a Gaussian distribution with a mean of 0 and a variance of 1;
- the activation value of each neuron in the input layer of the cyclic neural network is set to an acoustic statistical parameter of 5 to 40 frames in advance; optionally, the activation value of each neuron in the input layer is set to an acoustic statistical parameter of 5 to 40 frames in advance.
- the above steps are performed cyclically for the acoustic statistical parameters corresponding to each frame to generate a coherent sequence of acoustic parameters.
- Figure 6 shows the smoothed and normalized sequence of acoustic statistical parameters (dashed lines), smoothed and normalized.
- the acoustic statistical parameter sequence in the figure contains only the average sequence of the second MFCC parameter of the speech, and the output is normalized. It can be seen that the output of the cyclic neural network has more detail than the smoothed average sequence of the inputs.
- the speech waveform is synthesized as an output according to the acoustic parameter sequence generated by the cyclic neural network.
- the specific synthesis method depends on the type of acoustic parameters used, and the present invention does not specifically limit the synthesis method.
- the cyclic neural network used in the present invention, and the acoustic statistical parameter prediction model based on the decision tree and the hidden Markov model used in the present embodiment are applicable to various acoustic parameters and acoustic statistical parameters, such as the Mel frequency cepstral coefficient of speech. (MFCC) characteristics, line spectrum pair (LSP) characteristics, harmonic energy characteristics, formant characteristics, spectral envelope characteristics, fundamental frequency characteristics, logarithmic fundamental frequency characteristics, and mean, variance and derivative of the above characteristics.
- MFCC Mel frequency cepstral coefficient of speech.
- LSP line spectrum pair
- harmonic energy characteristics formant characteristics
- spectral envelope characteristics fundamental frequency characteristics
- logarithmic fundamental frequency characteristics logarithmic fundamental frequency characteristics
- mean variance and derivative of the above characteristics.
- the present invention does not specifically define the acoustic parameters and acoustic statistical parameter types used.
- the traditional statistical parameter speech synthesis method based on maximum likelihood parameter generation minimizes the statistical error of the generated parameters, but does not guarantee the simultaneous loss of the auditory feature loss of the generated acoustic parameters.
- the invention uses a cyclic neural network to minimize the error of the acoustic parameters of the synthesized speech and the acoustic parameters of the real speech during the training process, thereby effectively reducing the loss of the auditory features of the synthesized speech and reducing the excessive smoothing phenomenon.
- the cyclic neural network used in the present invention combines the speaker characteristics of the training speech during the training process, so that more details can be retained in the acoustic parameters generated during the running phase, making the synthesized speech more natural.
- the speech synthesis method based on the deep neural network directly uses the context information of the text as the input of the neural network, and the input of the cyclic neural network in the present invention is the acoustic statistical parameter. Since the input and output data of the cyclic neural network in the present invention are highly correlated, the speech synthesis method of the present invention requires only 1 to 2 hours of training speech data, and the speech synthesis method based on the deep neural network is reduced. The need to train data reduces the workload of preparing to train voice data.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
L'invention concerne un procédé de synthèse de parole basé sur des réseaux neuronaux récurrents, comprenant, de manière spécifique, les étapes suivantes consistant à : acquérir des informations de contexte d'un texte à synthétiser ; produire une séquence de paramètres statistiques acoustiques selon les informations de contexte du texte ; en fonction de la séquence de paramètres statistiques acoustiques produite à partir des informations de contexte, utiliser un réseau neuronal récurrent pour produire une séquence de paramètres acoustiques d'une parole à synthétiser ; et synthétiser la parole en fonction de la séquence de paramètres acoustiques de la parole à synthétiser. En comparaison avec des procédés de synthèse de parole à paramètres statistiques classiques, le procédé confère plus de naturel à la parole synthétisée et possède une bonne propriété en temps réel.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2015/077785 WO2016172871A1 (fr) | 2015-04-29 | 2015-04-29 | Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2015/077785 WO2016172871A1 (fr) | 2015-04-29 | 2015-04-29 | Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016172871A1 true WO2016172871A1 (fr) | 2016-11-03 |
Family
ID=57198903
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2015/077785 WO2016172871A1 (fr) | 2015-04-29 | 2015-04-29 | Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2016172871A1 (fr) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107610707A (zh) * | 2016-12-15 | 2018-01-19 | 平安科技(深圳)有限公司 | 一种声纹识别方法及装置 |
CN110737268A (zh) * | 2019-10-14 | 2020-01-31 | 哈尔滨工程大学 | 一种基于Viterbi算法的确定指令的方法 |
CN110879833A (zh) * | 2019-11-20 | 2020-03-13 | 中国科学技术大学 | 一种基于轻量级循环单元lru的文本预测方法 |
CN111627418A (zh) * | 2020-05-27 | 2020-09-04 | 携程计算机技术(上海)有限公司 | 语音合成模型的训练方法、合成方法、系统、设备和介质 |
CN111862931A (zh) * | 2020-05-08 | 2020-10-30 | 北京嘀嘀无限科技发展有限公司 | 一种语音生成方法及装置 |
CN115276697A (zh) * | 2022-07-22 | 2022-11-01 | 交通运输部规划研究院 | 一种集成智能语音的海岸电台通信系统 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1275746A (zh) * | 1994-04-28 | 2000-12-06 | 摩托罗拉公司 | 使用神经网络变换文本为声频信号的设备 |
EP1291848A2 (fr) * | 2001-08-31 | 2003-03-12 | Nokia Corporation | Prononciations en plusieurs langues pour la reconnaissance de parole |
CN1929655A (zh) * | 2006-09-28 | 2007-03-14 | 中山大学 | 一种可实现文本与语音转换的手机 |
CN101510424A (zh) * | 2009-03-12 | 2009-08-19 | 孟智平 | 基于语音基元的语音编码与合成方法及系统 |
CN102117614A (zh) * | 2010-01-05 | 2011-07-06 | 索尼爱立信移动通讯有限公司 | 个性化文本语音合成和个性化语音特征提取 |
CN104538024A (zh) * | 2014-12-01 | 2015-04-22 | 百度在线网络技术(北京)有限公司 | 语音合成方法、装置及设备 |
-
2015
- 2015-04-29 WO PCT/CN2015/077785 patent/WO2016172871A1/fr active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1275746A (zh) * | 1994-04-28 | 2000-12-06 | 摩托罗拉公司 | 使用神经网络变换文本为声频信号的设备 |
EP1291848A2 (fr) * | 2001-08-31 | 2003-03-12 | Nokia Corporation | Prononciations en plusieurs langues pour la reconnaissance de parole |
CN1929655A (zh) * | 2006-09-28 | 2007-03-14 | 中山大学 | 一种可实现文本与语音转换的手机 |
CN101510424A (zh) * | 2009-03-12 | 2009-08-19 | 孟智平 | 基于语音基元的语音编码与合成方法及系统 |
CN102117614A (zh) * | 2010-01-05 | 2011-07-06 | 索尼爱立信移动通讯有限公司 | 个性化文本语音合成和个性化语音特征提取 |
CN104538024A (zh) * | 2014-12-01 | 2015-04-22 | 百度在线网络技术(北京)有限公司 | 语音合成方法、装置及设备 |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107610707A (zh) * | 2016-12-15 | 2018-01-19 | 平安科技(深圳)有限公司 | 一种声纹识别方法及装置 |
CN107610707B (zh) * | 2016-12-15 | 2018-08-31 | 平安科技(深圳)有限公司 | 一种声纹识别方法及装置 |
CN110737268A (zh) * | 2019-10-14 | 2020-01-31 | 哈尔滨工程大学 | 一种基于Viterbi算法的确定指令的方法 |
CN110737268B (zh) * | 2019-10-14 | 2022-07-15 | 哈尔滨工程大学 | 一种基于Viterbi算法的确定指令的方法 |
CN110879833A (zh) * | 2019-11-20 | 2020-03-13 | 中国科学技术大学 | 一种基于轻量级循环单元lru的文本预测方法 |
CN110879833B (zh) * | 2019-11-20 | 2022-09-06 | 中国科学技术大学 | 一种基于轻量级循环单元lru的文本预测方法 |
CN111862931A (zh) * | 2020-05-08 | 2020-10-30 | 北京嘀嘀无限科技发展有限公司 | 一种语音生成方法及装置 |
CN111627418A (zh) * | 2020-05-27 | 2020-09-04 | 携程计算机技术(上海)有限公司 | 语音合成模型的训练方法、合成方法、系统、设备和介质 |
CN111627418B (zh) * | 2020-05-27 | 2023-01-31 | 携程计算机技术(上海)有限公司 | 语音合成模型的训练方法、合成方法、系统、设备和介质 |
CN115276697A (zh) * | 2022-07-22 | 2022-11-01 | 交通运输部规划研究院 | 一种集成智能语音的海岸电台通信系统 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11514888B2 (en) | Two-level speech prosody transfer | |
US11664011B2 (en) | Clockwork hierarchal variational encoder | |
WO2016172871A1 (fr) | Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents | |
WO2021225830A1 (fr) | Prosodie de synthèse vocale en utilisant un modèle de bert | |
Ma et al. | Incremental text-to-speech synthesis with prefix-to-prefix framework | |
JP6305955B2 (ja) | 音響特徴量変換装置、音響モデル適応装置、音響特徴量変換方法、およびプログラム | |
US11830474B2 (en) | Predicting parametric vocoder parameters from prosodic features | |
Pouget et al. | HMM training strategy for incremental speech synthesis | |
WO2015025788A1 (fr) | Dispositif et procédé de génération quantitative motif f0, et dispositif et procédé d'apprentissage de modèles pour la génération d'un motif f0 | |
Kons et al. | Neural TTS voice conversion | |
Chen et al. | Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features | |
US10446133B2 (en) | Multi-stream spectral representation for statistical parametric speech synthesis | |
Van Nguyen et al. | Development of Vietnamese speech synthesis system using deep neural networks | |
Mullah | A comparative study of different text-to-speech synthesis techniques | |
Ronanki et al. | The CSTR entry to the Blizzard Challenge 2017 | |
Suzić et al. | HiFi-GAN based Text-to-Speech Synthesis in Serbian | |
Coto-Jiménez et al. | Speech Synthesis Based on Hidden Markov Models and Deep Learning. | |
US20230018384A1 (en) | Two-Level Text-To-Speech Systems Using Synthetic Training Data | |
JP2014095851A (ja) | 音響モデル生成方法と音声合成方法とそれらの装置とプログラム | |
Louw | Neural speech synthesis for resource-scarce languages | |
Phan et al. | Extracting MFCC, F0 feature in Vietnamese HMM-based speech synthesis | |
Shah et al. | Influence of various asymmetrical contextual factors for TTS in a low resource language | |
Frikha et al. | Hidden Markov models (HMMs) isolated word recognizer with the optimization of acoustical analysis and modeling techniques | |
RU160585U1 (ru) | Система распознавания речи с моделью вариативности произношения | |
Zhao et al. | The UTokyo system for Blizzard Challenge 2016 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15890251 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15890251 Country of ref document: EP Kind code of ref document: A1 |