WO2016172871A1 - Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents - Google Patents

Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents Download PDF

Info

Publication number
WO2016172871A1
WO2016172871A1 PCT/CN2015/077785 CN2015077785W WO2016172871A1 WO 2016172871 A1 WO2016172871 A1 WO 2016172871A1 CN 2015077785 W CN2015077785 W CN 2015077785W WO 2016172871 A1 WO2016172871 A1 WO 2016172871A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
parameters
acoustic
neural network
sequence
Prior art date
Application number
PCT/CN2015/077785
Other languages
English (en)
Chinese (zh)
Inventor
华侃如
Original Assignee
华侃如
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华侃如 filed Critical 华侃如
Priority to PCT/CN2015/077785 priority Critical patent/WO2016172871A1/fr
Publication of WO2016172871A1 publication Critical patent/WO2016172871A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the invention relates to the field of speech synthesis, in particular to statistical parameter speech synthesis.
  • Speech synthesis technology is a technology that allows a machine or program to generate human-readable speech based on textual information.
  • Applications related to speech synthesis technology include text-to-speech (TTS) and speech synthesis (SVS).
  • the mainstream speech synthesis technology is a statistical parameter speech synthesis technology based on Hidden Markov Model (HMM), which includes two stages of training and operation.
  • HMM Hidden Markov Model
  • Training phase the acoustic parameters used to train the speech data are corresponding to the state sequence of the hidden Markov model, and the acoustic statistical parameters of each state are calculated by the training algorithm; the state of the model is used according to the context information of the text using the decision tree. class.
  • Operation phase The decision tree is used to convert the context information sequence of the input text into a state sequence of the classified model; and the acoustic statistical parameter sequence is obtained according to the acoustic statistical parameters corresponding to each state. Due to the state-discrete nature of the hidden Markov model, the sequence of acoustic statistical parameters obtained at this time is incoherent between states. In order to generate coherent speech acoustic parameters, the acoustic statistical parameter sequence needs to be smoothed.
  • the traditional smoothing method is the Maximum Likelihood Parameter Generation Algorithm (MLPG).
  • the method generates a sequence of coherent acoustic parameters having the most matching statistical features from a sequence of acoustic statistical parameters including dynamic parameters (eg, first order, second derivative).
  • the speech waveform data is synthesized and output using a source-filter model or other speech analysis synthesis technique.
  • the main problems of the smoothing method are as follows: 1) The smoothing method tends to make the generated acoustic parameters excessively smooth, and finally the generated speech sounds ambiguous; 2) the smoothing method does not have real-time, that is, the acoustics can only be generated step by step. Parameters, in the application of real-time speech synthesis, easily lead to playback.
  • Another way to avoid excessive smoothing of acoustic parameters generated by traditional statistical parameter speech synthesis techniques is to use deep neural networks instead of hidden Markov models (see Zen, Heiga, et al. "Statistical parametric speech synthesis using deep neural networks.” Acoustics,Speech And Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.).
  • the deep neural network will directly generate corresponding acoustic statistical parameter sequences according to the context information sequence of the text, so that the generated acoustic statistical parameter sequence is closer to the statistical features of the real speech, thereby obtaining a more natural synthesized speech.
  • this kind of deep neural network requires a large amount of training data, and usually requires about 5 hours of training speech data to obtain a better synthesis effect. If you prepare enough training voice data, you will need a lot of time and labor costs.
  • the present invention uses a cyclic neural network to replace the traditional acoustic parameter smoothing method, and generates coherent acoustic parameters according to the acoustic statistical parameter sequence, thereby effectively solving the problem that the generated acoustic parameters are excessively smooth, so that the synthesized speech is on the sense of hearing. More natural.
  • the present invention has real-time performance, and can synthesize speech sentence by sentence, word by word or frame by frame.
  • the demand for training data in the present invention is relatively small, and the training time is relatively short.
  • the technical field to which the present invention pertains is speech synthesis based on statistical parameters.
  • One of the technical problems solved by the present invention It avoids the excessive smoothing of synthesized speech, making the synthesized speech more natural and clear.
  • the present invention solves the above technical problems, and the method adopted includes two steps of training and running of the model.
  • the training steps of the model include:
  • a cyclic neural network is trained to map acoustic statistical parameter sequences to acoustic parameter sequences of training speech.
  • the operational steps of the present invention include:
  • the speech is synthesized based on a sequence of smooth acoustic parameters.
  • the method for synthesizing speech from a sequence of smooth acoustic parameters may be parameter speech synthesis based on source-filter model, parametric speech synthesis based on sinusoidal model or harmonic plus noise model, vocoder, spliced speech based on primitive selection Synthesis, etc.
  • the acoustic statistical parameter prediction model may be a machine learning model and method such as a hidden Markov model, a decision tree or a neural network.
  • the present invention is not specifically limited to the specifically employed model or method.
  • FIG. 1 is a flow chart of a training phase of a model of the present invention
  • Figure 2 is a flow chart of the operation phase of the present invention.
  • FIG. 3 is a schematic diagram of a training method of a cyclic neural network according to the present invention.
  • FIG. 4 is a schematic diagram of a method for operating a circulating neural network according to the present invention.
  • Figure 5 is a diagram showing an example of an acoustic statistical parameter sequence in an embodiment of the present invention.
  • FIG. 6 is a diagram showing an example of an acoustic statistical parameter sequence subjected to smoothing and normalization processing and an acoustic parameter sequence output by a cyclic neural network in an embodiment of the present invention.
  • the invention includes two stages of training and running of the model, wherein the training phase of the model is shown in Figure 1; the operational phase is shown in Figure 2.
  • the training phase of the model is mainly based on the training data, and the acoustic statistical parameter prediction model parameters and the cyclic neural network model parameters are calculated.
  • the training data includes voice data and text data that is time aligned with the voice data.
  • the voice data may be in different forms.
  • the voice data in a text-to-speech application, the voice data is the audio data of the spoken sentence; in the voice synthesis application, the voice data is the audio data of the voice.
  • the text data includes text corresponding to the voice and phonetic label information, and may also include information such as syllable annotation, accent annotation, and part-of-speech annotation.
  • the function of the acoustic statistical parameter prediction model is to predict the statistical information of the speech acoustic parameters (including the rhythm of the speech at a specific time, the timbre parameters, etc.) at different times according to the input text information, thereby generating a preliminary prediction of the acoustic parameters of the speech.
  • the output of the model can be discrete, ie the output is an incoherent acoustic statistical parameter between a series of states.
  • the output of the model can also be continuous, ie the output is a series of consecutive acoustic statistical parameters.
  • the parameters output by the model should reflect the statistical information of the acoustic parameters of the speech in a short period of time (eg mean, variance, derivative) Etc), not just the acoustic parameters themselves.
  • the present invention does not specifically limit the acoustic statistical parameter prediction model to be specifically used.
  • the acoustic statistical parameter prediction model used in this embodiment is an acoustic statistical parameter prediction model based on a hidden Markov model and a decision tree; in other implementations, other models or methods with approximate functions, such as feedforward neural networks, support Vector machine, etc.
  • the present invention replaces the traditional speech acoustic parameter smoothing method by using a cyclic neural network, and solves the problem that the generated acoustic parameters are excessively smooth; the present invention can be sentence by sentence, verbatim, or The speech synthesis of the frame solves the low real-time problem of the technology.
  • the training phase of the model of the present invention is shown in FIG. 1 and specifically includes the following steps:
  • the speech data and the corresponding text data, and the context information of the text are obtained from the training data, and the acoustic statistical parameter prediction model is trained to enable the model to map the context information of the text to the acoustic statistical parameters.
  • the state transition probability distribution and the output probability distribution parameter of the hidden Markov model are initialized according to the aligned training text and the training speech data.
  • the hidden Markov model adopts a context-dependent state; alternatively, the output probability distribution adopts a mixed probability distribution; alternatively, the output probability distribution uses a diagonal skew variance matrix.
  • the state transition probability distribution parameter may be calculated by the order and the number of occurrences of different states in the training text; the output probability distribution parameter may be obtained by statistically calculating the acoustic parameters of the voice data corresponding to each state;
  • the decision tree generating algorithm may use a minimum description length (MDL) criterion;
  • the context-dependent state of the hidden Markov model is attributed to the same decision tree node.
  • the Baum-Welch algorithm or Viterbi training algorithm is used to recalculate the state transition probability distribution and output state distribution parameters of the hidden Markov model.
  • the output probability distribution parameter obtained at this time is the acoustic statistical parameter corresponding to each state;
  • the average duration and variance of the states of each group of bindings are counted and stored.
  • the acoustic statistical parameter prediction model based on the hidden Markov model and the decision tree used in this embodiment is only an example, and the model and the training method different from the embodiment may be adopted in the specific implementation.
  • a corresponding acoustic statistical parameter sequence is generated according to the context information sequence of the text data in the training data.
  • the decision tree is used to select the corresponding hidden Markov model state to form a state sequence; each state duration It is determined by the Viterbi training algorithm in the first step of the training phase to ensure that the state sequence and the training speech data are aligned in time.
  • the trained hidden Markov model is then used to generate a corresponding sequence of acoustic statistical parameters based on the sequence of states.
  • the cyclic neural network is trained to enable the neural network to map the acoustic statistical parameter sequence output by the acoustic statistical parameter prediction model to the acoustic parameter sequence of the coherent, natural speech.
  • the training data of the cyclic neural network is the acoustic statistical parameter sequence generated in the second step of the training phase and the speech acoustic parameters calculated from the speech data in the training data.
  • the output of the hidden Markov model used in the second step of the training phase of the present embodiment is an incoherent acoustic statistical parameter sequence between states
  • the acoustic statistical parameter sequence outputted by the hidden Markov model should be initially smoothed and then input.
  • Cyclic neural network to ensure that the cyclic neural network can output a coherent sequence of speech parameters;
  • the preliminary smoothing method can be interpolation, low-pass filtering, average filtering, and maximum likelihood parameter generation algorithms;
  • normalizing the sequence of acoustic statistical parameters input to the cyclic neural network for example, making the input parameter conform to a Gaussian distribution with a mean of 0 and a variance of 1;
  • cyclic neural networks For different applications, different structures of cyclic neural networks can be used, including first-order or second-order cyclic neural networks, long-short-time memory neural networks, multi-layered cyclic neural networks, and combinations of several kinds of neural networks.
  • the output layer of the ring neural network uses a linear activation function.
  • a cyclic neural network may be used to output a plurality of acoustic parameter sequences according to the input acoustic statistical parameter sequence; a plurality of cyclic neural networks may also be used according to the input acoustic statistical parameter sequence, and each cyclic neural network outputs one Or a plurality of acoustic parameter sequences;
  • First-order cyclic neural networks can be trained using time-varying back propagation algorithm (BPTT) or real-time recursive learning algorithm (RTRL); second-order cyclic neural networks, long-short-time memory neural networks, and multi-layered cyclic neural networks can use universal lengths and time Neural network algorithm (LSTM-g, see Monner, Derek, et al. "A generalized LSTM-like training algorithm for second-order recurrent neural networks.” Neural Networks 25 (2012): 70-83.) training; training cycle The output layer of the neural network adopts a least square error criterion;
  • the output acoustic parameters are excessively smoothed, and the input of the cyclic neural network includes the sequence of acoustic statistical parameters of 5 to 40 frames in advance, and The sequence of acoustic statistical parameters of the current frame.
  • FIG. 2 The operation phase of the present invention is shown in FIG. 2, which uses the trained model to generate speech acoustic parameters and synthesize speech based on the input text.
  • the specific method includes the following steps:
  • the context information sequence of the text is obtained from the input text; according to the context information sequence of the input text, the acoustic statistical parameter prediction model is used to generate a corresponding acoustic statistical parameter sequence.
  • the decision tree is used to select the corresponding hidden Markov model state to form a sequence of states; the duration of each state sequence is determined by the average duration and variance of each state obtained during training.
  • the trained hidden Markov model is then used to generate a corresponding sequence of acoustic statistical parameters based on the sequence of states.
  • a sequence of acoustic parameters is generated using a trained cyclic neural network based on the sequence of acoustic statistical parameters.
  • Specific methods include:
  • the acoustic statistical parameter sequence outputted by the hidden Markov model should be initially smoothed and then input into the circulating nerve.
  • the network ensures that the cyclic neural network can output a coherent sequence of speech parameters; the preliminary smoothing methods can be interpolation, low-pass filtering, average filtering, and maximum likelihood parameter generation algorithms;
  • Figure 5 is an example diagram of an unsmoothed sequence of acoustic statistical parameters (solid line) and a sequence of smoothed acoustic statistical parameters (dashed lines); for clarity, the figure contains only the second Mel frequency of speech.
  • normalizing the sequence of acoustic statistical parameters input to the cyclic neural network for example, making the input parameter conform to a Gaussian distribution with a mean of 0 and a variance of 1;
  • the activation value of each neuron in the input layer of the cyclic neural network is set to an acoustic statistical parameter of 5 to 40 frames in advance; optionally, the activation value of each neuron in the input layer is set to an acoustic statistical parameter of 5 to 40 frames in advance.
  • the above steps are performed cyclically for the acoustic statistical parameters corresponding to each frame to generate a coherent sequence of acoustic parameters.
  • Figure 6 shows the smoothed and normalized sequence of acoustic statistical parameters (dashed lines), smoothed and normalized.
  • the acoustic statistical parameter sequence in the figure contains only the average sequence of the second MFCC parameter of the speech, and the output is normalized. It can be seen that the output of the cyclic neural network has more detail than the smoothed average sequence of the inputs.
  • the speech waveform is synthesized as an output according to the acoustic parameter sequence generated by the cyclic neural network.
  • the specific synthesis method depends on the type of acoustic parameters used, and the present invention does not specifically limit the synthesis method.
  • the cyclic neural network used in the present invention, and the acoustic statistical parameter prediction model based on the decision tree and the hidden Markov model used in the present embodiment are applicable to various acoustic parameters and acoustic statistical parameters, such as the Mel frequency cepstral coefficient of speech. (MFCC) characteristics, line spectrum pair (LSP) characteristics, harmonic energy characteristics, formant characteristics, spectral envelope characteristics, fundamental frequency characteristics, logarithmic fundamental frequency characteristics, and mean, variance and derivative of the above characteristics.
  • MFCC Mel frequency cepstral coefficient of speech.
  • LSP line spectrum pair
  • harmonic energy characteristics formant characteristics
  • spectral envelope characteristics fundamental frequency characteristics
  • logarithmic fundamental frequency characteristics logarithmic fundamental frequency characteristics
  • mean variance and derivative of the above characteristics.
  • the present invention does not specifically define the acoustic parameters and acoustic statistical parameter types used.
  • the traditional statistical parameter speech synthesis method based on maximum likelihood parameter generation minimizes the statistical error of the generated parameters, but does not guarantee the simultaneous loss of the auditory feature loss of the generated acoustic parameters.
  • the invention uses a cyclic neural network to minimize the error of the acoustic parameters of the synthesized speech and the acoustic parameters of the real speech during the training process, thereby effectively reducing the loss of the auditory features of the synthesized speech and reducing the excessive smoothing phenomenon.
  • the cyclic neural network used in the present invention combines the speaker characteristics of the training speech during the training process, so that more details can be retained in the acoustic parameters generated during the running phase, making the synthesized speech more natural.
  • the speech synthesis method based on the deep neural network directly uses the context information of the text as the input of the neural network, and the input of the cyclic neural network in the present invention is the acoustic statistical parameter. Since the input and output data of the cyclic neural network in the present invention are highly correlated, the speech synthesis method of the present invention requires only 1 to 2 hours of training speech data, and the speech synthesis method based on the deep neural network is reduced. The need to train data reduces the workload of preparing to train voice data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé de synthèse de parole basé sur des réseaux neuronaux récurrents, comprenant, de manière spécifique, les étapes suivantes consistant à : acquérir des informations de contexte d'un texte à synthétiser ; produire une séquence de paramètres statistiques acoustiques selon les informations de contexte du texte ; en fonction de la séquence de paramètres statistiques acoustiques produite à partir des informations de contexte, utiliser un réseau neuronal récurrent pour produire une séquence de paramètres acoustiques d'une parole à synthétiser ; et synthétiser la parole en fonction de la séquence de paramètres acoustiques de la parole à synthétiser. En comparaison avec des procédés de synthèse de parole à paramètres statistiques classiques, le procédé confère plus de naturel à la parole synthétisée et possède une bonne propriété en temps réel.
PCT/CN2015/077785 2015-04-29 2015-04-29 Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents WO2016172871A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/077785 WO2016172871A1 (fr) 2015-04-29 2015-04-29 Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/077785 WO2016172871A1 (fr) 2015-04-29 2015-04-29 Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents

Publications (1)

Publication Number Publication Date
WO2016172871A1 true WO2016172871A1 (fr) 2016-11-03

Family

ID=57198903

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/077785 WO2016172871A1 (fr) 2015-04-29 2015-04-29 Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents

Country Status (1)

Country Link
WO (1) WO2016172871A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610707A (zh) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 一种声纹识别方法及装置
CN110737268A (zh) * 2019-10-14 2020-01-31 哈尔滨工程大学 一种基于Viterbi算法的确定指令的方法
CN110879833A (zh) * 2019-11-20 2020-03-13 中国科学技术大学 一种基于轻量级循环单元lru的文本预测方法
CN111627418A (zh) * 2020-05-27 2020-09-04 携程计算机技术(上海)有限公司 语音合成模型的训练方法、合成方法、系统、设备和介质
CN111862931A (zh) * 2020-05-08 2020-10-30 北京嘀嘀无限科技发展有限公司 一种语音生成方法及装置
CN115276697A (zh) * 2022-07-22 2022-11-01 交通运输部规划研究院 一种集成智能语音的海岸电台通信系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1275746A (zh) * 1994-04-28 2000-12-06 摩托罗拉公司 使用神经网络变换文本为声频信号的设备
EP1291848A2 (fr) * 2001-08-31 2003-03-12 Nokia Corporation Prononciations en plusieurs langues pour la reconnaissance de parole
CN1929655A (zh) * 2006-09-28 2007-03-14 中山大学 一种可实现文本与语音转换的手机
CN101510424A (zh) * 2009-03-12 2009-08-19 孟智平 基于语音基元的语音编码与合成方法及系统
CN102117614A (zh) * 2010-01-05 2011-07-06 索尼爱立信移动通讯有限公司 个性化文本语音合成和个性化语音特征提取
CN104538024A (zh) * 2014-12-01 2015-04-22 百度在线网络技术(北京)有限公司 语音合成方法、装置及设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1275746A (zh) * 1994-04-28 2000-12-06 摩托罗拉公司 使用神经网络变换文本为声频信号的设备
EP1291848A2 (fr) * 2001-08-31 2003-03-12 Nokia Corporation Prononciations en plusieurs langues pour la reconnaissance de parole
CN1929655A (zh) * 2006-09-28 2007-03-14 中山大学 一种可实现文本与语音转换的手机
CN101510424A (zh) * 2009-03-12 2009-08-19 孟智平 基于语音基元的语音编码与合成方法及系统
CN102117614A (zh) * 2010-01-05 2011-07-06 索尼爱立信移动通讯有限公司 个性化文本语音合成和个性化语音特征提取
CN104538024A (zh) * 2014-12-01 2015-04-22 百度在线网络技术(北京)有限公司 语音合成方法、装置及设备

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610707A (zh) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 一种声纹识别方法及装置
CN107610707B (zh) * 2016-12-15 2018-08-31 平安科技(深圳)有限公司 一种声纹识别方法及装置
CN110737268A (zh) * 2019-10-14 2020-01-31 哈尔滨工程大学 一种基于Viterbi算法的确定指令的方法
CN110737268B (zh) * 2019-10-14 2022-07-15 哈尔滨工程大学 一种基于Viterbi算法的确定指令的方法
CN110879833A (zh) * 2019-11-20 2020-03-13 中国科学技术大学 一种基于轻量级循环单元lru的文本预测方法
CN110879833B (zh) * 2019-11-20 2022-09-06 中国科学技术大学 一种基于轻量级循环单元lru的文本预测方法
CN111862931A (zh) * 2020-05-08 2020-10-30 北京嘀嘀无限科技发展有限公司 一种语音生成方法及装置
CN111627418A (zh) * 2020-05-27 2020-09-04 携程计算机技术(上海)有限公司 语音合成模型的训练方法、合成方法、系统、设备和介质
CN111627418B (zh) * 2020-05-27 2023-01-31 携程计算机技术(上海)有限公司 语音合成模型的训练方法、合成方法、系统、设备和介质
CN115276697A (zh) * 2022-07-22 2022-11-01 交通运输部规划研究院 一种集成智能语音的海岸电台通信系统

Similar Documents

Publication Publication Date Title
US11514888B2 (en) Two-level speech prosody transfer
US11664011B2 (en) Clockwork hierarchal variational encoder
WO2016172871A1 (fr) Procédé de synthèse de parole basé sur des réseaux neuronaux récurrents
WO2021225830A1 (fr) Prosodie de synthèse vocale en utilisant un modèle de bert
Ma et al. Incremental text-to-speech synthesis with prefix-to-prefix framework
JP6305955B2 (ja) 音響特徴量変換装置、音響モデル適応装置、音響特徴量変換方法、およびプログラム
US11830474B2 (en) Predicting parametric vocoder parameters from prosodic features
Pouget et al. HMM training strategy for incremental speech synthesis
WO2015025788A1 (fr) Dispositif et procédé de génération quantitative motif f0, et dispositif et procédé d'apprentissage de modèles pour la génération d'un motif f0
Kons et al. Neural TTS voice conversion
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
US10446133B2 (en) Multi-stream spectral representation for statistical parametric speech synthesis
Van Nguyen et al. Development of Vietnamese speech synthesis system using deep neural networks
Mullah A comparative study of different text-to-speech synthesis techniques
Ronanki et al. The CSTR entry to the Blizzard Challenge 2017
Suzić et al. HiFi-GAN based Text-to-Speech Synthesis in Serbian
Coto-Jiménez et al. Speech Synthesis Based on Hidden Markov Models and Deep Learning.
US20230018384A1 (en) Two-Level Text-To-Speech Systems Using Synthetic Training Data
JP2014095851A (ja) 音響モデル生成方法と音声合成方法とそれらの装置とプログラム
Louw Neural speech synthesis for resource-scarce languages
Phan et al. Extracting MFCC, F0 feature in Vietnamese HMM-based speech synthesis
Shah et al. Influence of various asymmetrical contextual factors for TTS in a low resource language
Frikha et al. Hidden Markov models (HMMs) isolated word recognizer with the optimization of acoustical analysis and modeling techniques
RU160585U1 (ru) Система распознавания речи с моделью вариативности произношения
Zhao et al. The UTokyo system for Blizzard Challenge 2016

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15890251

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15890251

Country of ref document: EP

Kind code of ref document: A1