WO2016172871A1 - Speech synthesis method based on recurrent neural networks - Google Patents

Speech synthesis method based on recurrent neural networks Download PDF

Info

Publication number
WO2016172871A1
WO2016172871A1 PCT/CN2015/077785 CN2015077785W WO2016172871A1 WO 2016172871 A1 WO2016172871 A1 WO 2016172871A1 CN 2015077785 W CN2015077785 W CN 2015077785W WO 2016172871 A1 WO2016172871 A1 WO 2016172871A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
parameters
acoustic
neural network
sequence
Prior art date
Application number
PCT/CN2015/077785
Other languages
French (fr)
Chinese (zh)
Inventor
华侃如
Original Assignee
华侃如
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华侃如 filed Critical 华侃如
Priority to PCT/CN2015/077785 priority Critical patent/WO2016172871A1/en
Publication of WO2016172871A1 publication Critical patent/WO2016172871A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the invention relates to the field of speech synthesis, in particular to statistical parameter speech synthesis.
  • Speech synthesis technology is a technology that allows a machine or program to generate human-readable speech based on textual information.
  • Applications related to speech synthesis technology include text-to-speech (TTS) and speech synthesis (SVS).
  • the mainstream speech synthesis technology is a statistical parameter speech synthesis technology based on Hidden Markov Model (HMM), which includes two stages of training and operation.
  • HMM Hidden Markov Model
  • Training phase the acoustic parameters used to train the speech data are corresponding to the state sequence of the hidden Markov model, and the acoustic statistical parameters of each state are calculated by the training algorithm; the state of the model is used according to the context information of the text using the decision tree. class.
  • Operation phase The decision tree is used to convert the context information sequence of the input text into a state sequence of the classified model; and the acoustic statistical parameter sequence is obtained according to the acoustic statistical parameters corresponding to each state. Due to the state-discrete nature of the hidden Markov model, the sequence of acoustic statistical parameters obtained at this time is incoherent between states. In order to generate coherent speech acoustic parameters, the acoustic statistical parameter sequence needs to be smoothed.
  • the traditional smoothing method is the Maximum Likelihood Parameter Generation Algorithm (MLPG).
  • the method generates a sequence of coherent acoustic parameters having the most matching statistical features from a sequence of acoustic statistical parameters including dynamic parameters (eg, first order, second derivative).
  • the speech waveform data is synthesized and output using a source-filter model or other speech analysis synthesis technique.
  • the main problems of the smoothing method are as follows: 1) The smoothing method tends to make the generated acoustic parameters excessively smooth, and finally the generated speech sounds ambiguous; 2) the smoothing method does not have real-time, that is, the acoustics can only be generated step by step. Parameters, in the application of real-time speech synthesis, easily lead to playback.
  • Another way to avoid excessive smoothing of acoustic parameters generated by traditional statistical parameter speech synthesis techniques is to use deep neural networks instead of hidden Markov models (see Zen, Heiga, et al. "Statistical parametric speech synthesis using deep neural networks.” Acoustics,Speech And Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.).
  • the deep neural network will directly generate corresponding acoustic statistical parameter sequences according to the context information sequence of the text, so that the generated acoustic statistical parameter sequence is closer to the statistical features of the real speech, thereby obtaining a more natural synthesized speech.
  • this kind of deep neural network requires a large amount of training data, and usually requires about 5 hours of training speech data to obtain a better synthesis effect. If you prepare enough training voice data, you will need a lot of time and labor costs.
  • the present invention uses a cyclic neural network to replace the traditional acoustic parameter smoothing method, and generates coherent acoustic parameters according to the acoustic statistical parameter sequence, thereby effectively solving the problem that the generated acoustic parameters are excessively smooth, so that the synthesized speech is on the sense of hearing. More natural.
  • the present invention has real-time performance, and can synthesize speech sentence by sentence, word by word or frame by frame.
  • the demand for training data in the present invention is relatively small, and the training time is relatively short.
  • the technical field to which the present invention pertains is speech synthesis based on statistical parameters.
  • One of the technical problems solved by the present invention It avoids the excessive smoothing of synthesized speech, making the synthesized speech more natural and clear.
  • the present invention solves the above technical problems, and the method adopted includes two steps of training and running of the model.
  • the training steps of the model include:
  • a cyclic neural network is trained to map acoustic statistical parameter sequences to acoustic parameter sequences of training speech.
  • the operational steps of the present invention include:
  • the speech is synthesized based on a sequence of smooth acoustic parameters.
  • the method for synthesizing speech from a sequence of smooth acoustic parameters may be parameter speech synthesis based on source-filter model, parametric speech synthesis based on sinusoidal model or harmonic plus noise model, vocoder, spliced speech based on primitive selection Synthesis, etc.
  • the acoustic statistical parameter prediction model may be a machine learning model and method such as a hidden Markov model, a decision tree or a neural network.
  • the present invention is not specifically limited to the specifically employed model or method.
  • FIG. 1 is a flow chart of a training phase of a model of the present invention
  • Figure 2 is a flow chart of the operation phase of the present invention.
  • FIG. 3 is a schematic diagram of a training method of a cyclic neural network according to the present invention.
  • FIG. 4 is a schematic diagram of a method for operating a circulating neural network according to the present invention.
  • Figure 5 is a diagram showing an example of an acoustic statistical parameter sequence in an embodiment of the present invention.
  • FIG. 6 is a diagram showing an example of an acoustic statistical parameter sequence subjected to smoothing and normalization processing and an acoustic parameter sequence output by a cyclic neural network in an embodiment of the present invention.
  • the invention includes two stages of training and running of the model, wherein the training phase of the model is shown in Figure 1; the operational phase is shown in Figure 2.
  • the training phase of the model is mainly based on the training data, and the acoustic statistical parameter prediction model parameters and the cyclic neural network model parameters are calculated.
  • the training data includes voice data and text data that is time aligned with the voice data.
  • the voice data may be in different forms.
  • the voice data in a text-to-speech application, the voice data is the audio data of the spoken sentence; in the voice synthesis application, the voice data is the audio data of the voice.
  • the text data includes text corresponding to the voice and phonetic label information, and may also include information such as syllable annotation, accent annotation, and part-of-speech annotation.
  • the function of the acoustic statistical parameter prediction model is to predict the statistical information of the speech acoustic parameters (including the rhythm of the speech at a specific time, the timbre parameters, etc.) at different times according to the input text information, thereby generating a preliminary prediction of the acoustic parameters of the speech.
  • the output of the model can be discrete, ie the output is an incoherent acoustic statistical parameter between a series of states.
  • the output of the model can also be continuous, ie the output is a series of consecutive acoustic statistical parameters.
  • the parameters output by the model should reflect the statistical information of the acoustic parameters of the speech in a short period of time (eg mean, variance, derivative) Etc), not just the acoustic parameters themselves.
  • the present invention does not specifically limit the acoustic statistical parameter prediction model to be specifically used.
  • the acoustic statistical parameter prediction model used in this embodiment is an acoustic statistical parameter prediction model based on a hidden Markov model and a decision tree; in other implementations, other models or methods with approximate functions, such as feedforward neural networks, support Vector machine, etc.
  • the present invention replaces the traditional speech acoustic parameter smoothing method by using a cyclic neural network, and solves the problem that the generated acoustic parameters are excessively smooth; the present invention can be sentence by sentence, verbatim, or The speech synthesis of the frame solves the low real-time problem of the technology.
  • the training phase of the model of the present invention is shown in FIG. 1 and specifically includes the following steps:
  • the speech data and the corresponding text data, and the context information of the text are obtained from the training data, and the acoustic statistical parameter prediction model is trained to enable the model to map the context information of the text to the acoustic statistical parameters.
  • the state transition probability distribution and the output probability distribution parameter of the hidden Markov model are initialized according to the aligned training text and the training speech data.
  • the hidden Markov model adopts a context-dependent state; alternatively, the output probability distribution adopts a mixed probability distribution; alternatively, the output probability distribution uses a diagonal skew variance matrix.
  • the state transition probability distribution parameter may be calculated by the order and the number of occurrences of different states in the training text; the output probability distribution parameter may be obtained by statistically calculating the acoustic parameters of the voice data corresponding to each state;
  • the decision tree generating algorithm may use a minimum description length (MDL) criterion;
  • the context-dependent state of the hidden Markov model is attributed to the same decision tree node.
  • the Baum-Welch algorithm or Viterbi training algorithm is used to recalculate the state transition probability distribution and output state distribution parameters of the hidden Markov model.
  • the output probability distribution parameter obtained at this time is the acoustic statistical parameter corresponding to each state;
  • the average duration and variance of the states of each group of bindings are counted and stored.
  • the acoustic statistical parameter prediction model based on the hidden Markov model and the decision tree used in this embodiment is only an example, and the model and the training method different from the embodiment may be adopted in the specific implementation.
  • a corresponding acoustic statistical parameter sequence is generated according to the context information sequence of the text data in the training data.
  • the decision tree is used to select the corresponding hidden Markov model state to form a state sequence; each state duration It is determined by the Viterbi training algorithm in the first step of the training phase to ensure that the state sequence and the training speech data are aligned in time.
  • the trained hidden Markov model is then used to generate a corresponding sequence of acoustic statistical parameters based on the sequence of states.
  • the cyclic neural network is trained to enable the neural network to map the acoustic statistical parameter sequence output by the acoustic statistical parameter prediction model to the acoustic parameter sequence of the coherent, natural speech.
  • the training data of the cyclic neural network is the acoustic statistical parameter sequence generated in the second step of the training phase and the speech acoustic parameters calculated from the speech data in the training data.
  • the output of the hidden Markov model used in the second step of the training phase of the present embodiment is an incoherent acoustic statistical parameter sequence between states
  • the acoustic statistical parameter sequence outputted by the hidden Markov model should be initially smoothed and then input.
  • Cyclic neural network to ensure that the cyclic neural network can output a coherent sequence of speech parameters;
  • the preliminary smoothing method can be interpolation, low-pass filtering, average filtering, and maximum likelihood parameter generation algorithms;
  • normalizing the sequence of acoustic statistical parameters input to the cyclic neural network for example, making the input parameter conform to a Gaussian distribution with a mean of 0 and a variance of 1;
  • cyclic neural networks For different applications, different structures of cyclic neural networks can be used, including first-order or second-order cyclic neural networks, long-short-time memory neural networks, multi-layered cyclic neural networks, and combinations of several kinds of neural networks.
  • the output layer of the ring neural network uses a linear activation function.
  • a cyclic neural network may be used to output a plurality of acoustic parameter sequences according to the input acoustic statistical parameter sequence; a plurality of cyclic neural networks may also be used according to the input acoustic statistical parameter sequence, and each cyclic neural network outputs one Or a plurality of acoustic parameter sequences;
  • First-order cyclic neural networks can be trained using time-varying back propagation algorithm (BPTT) or real-time recursive learning algorithm (RTRL); second-order cyclic neural networks, long-short-time memory neural networks, and multi-layered cyclic neural networks can use universal lengths and time Neural network algorithm (LSTM-g, see Monner, Derek, et al. "A generalized LSTM-like training algorithm for second-order recurrent neural networks.” Neural Networks 25 (2012): 70-83.) training; training cycle The output layer of the neural network adopts a least square error criterion;
  • the output acoustic parameters are excessively smoothed, and the input of the cyclic neural network includes the sequence of acoustic statistical parameters of 5 to 40 frames in advance, and The sequence of acoustic statistical parameters of the current frame.
  • FIG. 2 The operation phase of the present invention is shown in FIG. 2, which uses the trained model to generate speech acoustic parameters and synthesize speech based on the input text.
  • the specific method includes the following steps:
  • the context information sequence of the text is obtained from the input text; according to the context information sequence of the input text, the acoustic statistical parameter prediction model is used to generate a corresponding acoustic statistical parameter sequence.
  • the decision tree is used to select the corresponding hidden Markov model state to form a sequence of states; the duration of each state sequence is determined by the average duration and variance of each state obtained during training.
  • the trained hidden Markov model is then used to generate a corresponding sequence of acoustic statistical parameters based on the sequence of states.
  • a sequence of acoustic parameters is generated using a trained cyclic neural network based on the sequence of acoustic statistical parameters.
  • Specific methods include:
  • the acoustic statistical parameter sequence outputted by the hidden Markov model should be initially smoothed and then input into the circulating nerve.
  • the network ensures that the cyclic neural network can output a coherent sequence of speech parameters; the preliminary smoothing methods can be interpolation, low-pass filtering, average filtering, and maximum likelihood parameter generation algorithms;
  • Figure 5 is an example diagram of an unsmoothed sequence of acoustic statistical parameters (solid line) and a sequence of smoothed acoustic statistical parameters (dashed lines); for clarity, the figure contains only the second Mel frequency of speech.
  • normalizing the sequence of acoustic statistical parameters input to the cyclic neural network for example, making the input parameter conform to a Gaussian distribution with a mean of 0 and a variance of 1;
  • the activation value of each neuron in the input layer of the cyclic neural network is set to an acoustic statistical parameter of 5 to 40 frames in advance; optionally, the activation value of each neuron in the input layer is set to an acoustic statistical parameter of 5 to 40 frames in advance.
  • the above steps are performed cyclically for the acoustic statistical parameters corresponding to each frame to generate a coherent sequence of acoustic parameters.
  • Figure 6 shows the smoothed and normalized sequence of acoustic statistical parameters (dashed lines), smoothed and normalized.
  • the acoustic statistical parameter sequence in the figure contains only the average sequence of the second MFCC parameter of the speech, and the output is normalized. It can be seen that the output of the cyclic neural network has more detail than the smoothed average sequence of the inputs.
  • the speech waveform is synthesized as an output according to the acoustic parameter sequence generated by the cyclic neural network.
  • the specific synthesis method depends on the type of acoustic parameters used, and the present invention does not specifically limit the synthesis method.
  • the cyclic neural network used in the present invention, and the acoustic statistical parameter prediction model based on the decision tree and the hidden Markov model used in the present embodiment are applicable to various acoustic parameters and acoustic statistical parameters, such as the Mel frequency cepstral coefficient of speech. (MFCC) characteristics, line spectrum pair (LSP) characteristics, harmonic energy characteristics, formant characteristics, spectral envelope characteristics, fundamental frequency characteristics, logarithmic fundamental frequency characteristics, and mean, variance and derivative of the above characteristics.
  • MFCC Mel frequency cepstral coefficient of speech.
  • LSP line spectrum pair
  • harmonic energy characteristics formant characteristics
  • spectral envelope characteristics fundamental frequency characteristics
  • logarithmic fundamental frequency characteristics logarithmic fundamental frequency characteristics
  • mean variance and derivative of the above characteristics.
  • the present invention does not specifically define the acoustic parameters and acoustic statistical parameter types used.
  • the traditional statistical parameter speech synthesis method based on maximum likelihood parameter generation minimizes the statistical error of the generated parameters, but does not guarantee the simultaneous loss of the auditory feature loss of the generated acoustic parameters.
  • the invention uses a cyclic neural network to minimize the error of the acoustic parameters of the synthesized speech and the acoustic parameters of the real speech during the training process, thereby effectively reducing the loss of the auditory features of the synthesized speech and reducing the excessive smoothing phenomenon.
  • the cyclic neural network used in the present invention combines the speaker characteristics of the training speech during the training process, so that more details can be retained in the acoustic parameters generated during the running phase, making the synthesized speech more natural.
  • the speech synthesis method based on the deep neural network directly uses the context information of the text as the input of the neural network, and the input of the cyclic neural network in the present invention is the acoustic statistical parameter. Since the input and output data of the cyclic neural network in the present invention are highly correlated, the speech synthesis method of the present invention requires only 1 to 2 hours of training speech data, and the speech synthesis method based on the deep neural network is reduced. The need to train data reduces the workload of preparing to train voice data.

Abstract

A speech synthesis method based on recurrent neural networks specifically comprises the following steps of: acquiring context information of a text to be synthesized; generating an acoustic statistical parameter sequence according to the context information of the text; according to the acoustic statistical parameter sequence generated from the context information, using a recurrent neural network to generate an acoustic parameter sequence of a speech to be synthesized; and synthesizing the speech according to the acoustic parameter sequence of the speech to be synthesized. Compared with traditional statistical parameter speech synthesis methods, the method gives the synthesized speech better naturalness and has good real-time property.

Description

基于循环神经网络的语音合成方法Speech synthesis method based on cyclic neural network 技术领域Technical field
本发明涉及语音合成领域,尤其涉及统计参数语音合成。The invention relates to the field of speech synthesis, in particular to statistical parameter speech synthesis.
背景技术Background technique
语音合成技术是让机器或程序根据文本信息产生人类可懂的语音的技术,与语音合成技术相关的应用包括文语转换(TTS)和歌声合成(SVS)等。Speech synthesis technology is a technology that allows a machine or program to generate human-readable speech based on textual information. Applications related to speech synthesis technology include text-to-speech (TTS) and speech synthesis (SVS).
目前主流的语音合成技术是基于隐马尔科夫模型(HMM)的统计参数语音合成技术,该技术包括训练和运行两个阶段。At present, the mainstream speech synthesis technology is a statistical parameter speech synthesis technology based on Hidden Markov Model (HMM), which includes two stages of training and operation.
训练阶段:将用于训练语音数据的声学参数和隐马尔科夫模型的状态序列相对应,由训练算法计算出各状态的声学统计参数;使用决策树对模型的状态按照文本的上下文信息进行归类。Training phase: the acoustic parameters used to train the speech data are corresponding to the state sequence of the hidden Markov model, and the acoustic statistical parameters of each state are calculated by the training algorithm; the state of the model is used according to the context information of the text using the decision tree. class.
运行阶段:使用决策树将输入的文本的上下文信息序列转换为经过归类的模型的状态序列;根据各状态对应的声学统计参数,获取声学统计参数序列。由于隐马尔科夫模型的状态离散性质,此时获得的声学统计参数序列是状态间不连贯的。为了生成连贯的语音声学参数,需要对声学统计参数序列进行平滑处理。传统的平滑方法是最大似然参数生成算法(MLPG)。该方法根据包括动态参数(如一阶、二阶导数)的声学统计参数序列,生成具有最匹配的统计学特征的连贯的声学参数序列。最后,根据声学参数序列,使用源-滤波器模型或其它语音分析合成技术,合成语音波形数据并输出。Operation phase: The decision tree is used to convert the context information sequence of the input text into a state sequence of the classified model; and the acoustic statistical parameter sequence is obtained according to the acoustic statistical parameters corresponding to each state. Due to the state-discrete nature of the hidden Markov model, the sequence of acoustic statistical parameters obtained at this time is incoherent between states. In order to generate coherent speech acoustic parameters, the acoustic statistical parameter sequence needs to be smoothed. The traditional smoothing method is the Maximum Likelihood Parameter Generation Algorithm (MLPG). The method generates a sequence of coherent acoustic parameters having the most matching statistical features from a sequence of acoustic statistical parameters including dynamic parameters (eg, first order, second derivative). Finally, based on the acoustic parameter sequence, the speech waveform data is synthesized and output using a source-filter model or other speech analysis synthesis technique.
该平滑方法的主要问题在于:1)该平滑方法容易使生成的声学参数过度平滑,最终导致生成的语音听上去含糊不清;2)该平滑方法不具备实时性,即只能逐句生成声学参数,在实时语音合成的应用中易导致播放卡顿。The main problems of the smoothing method are as follows: 1) The smoothing method tends to make the generated acoustic parameters excessively smooth, and finally the generated speech sounds ambiguous; 2) the smoothing method does not have real-time, that is, the acoustics can only be generated step by step. Parameters, in the application of real-time speech synthesis, easily lead to playback.
避免传统的统计参数语音合成技术生成的声学参数过度平滑的方法之一是采用全局方差(Global Variance)标准生成声学参数(见Tomoki,Toda,et al."A speech parameter generation algorithm considering global variance for HMM-based speech synthesis."IEICE TRANSACTIONS on Information and Systems 90.5(2007):816-824.),但是这种改进技术未能解决参数生成的实时性问题。One of the ways to avoid excessive smoothing of acoustic parameters generated by traditional statistical parameter speech synthesis techniques is to generate acoustic parameters using the Global Variance standard (see Tomoki, Toda, et al. "A speech parameter generation algorithm considering global variance for HMM -based speech synthesis. "IEICE TRANSACTIONS on Information and Systems 90.5 (2007): 816-824.), but this improved technique fails to solve the real-time problem of parameter generation.
避免传统的统计参数语音合成技术生成的声学参数过度平滑的另一方法是使用深度神经网络代替隐马尔科夫模型(见Zen,Heiga,etal."Statisticalparametric speech synthesis using deep neural networks."Acoustics,Speech and Signal Processing(ICASSP),2013 IEEE International Conference on.IEEE,2013.)。深度神经网络将根据文本的上下文信息序列,直接生成对应的声学统计参数序列,使生成的声学统计参数序列更接近真实语音的统计特征,从而获得更自然的合成语音。但是这种深度神经网络对训练数据的需求量大,通常需要约5个小时的训练语音数据,才能得到较好的合成效果。如准备足够的训练语音数据,需要大量时间和人力成本。Another way to avoid excessive smoothing of acoustic parameters generated by traditional statistical parameter speech synthesis techniques is to use deep neural networks instead of hidden Markov models (see Zen, Heiga, et al. "Statistical parametric speech synthesis using deep neural networks." Acoustics,Speech And Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.). The deep neural network will directly generate corresponding acoustic statistical parameter sequences according to the context information sequence of the text, so that the generated acoustic statistical parameter sequence is closer to the statistical features of the real speech, thereby obtaining a more natural synthesized speech. However, this kind of deep neural network requires a large amount of training data, and usually requires about 5 hours of training speech data to obtain a better synthesis effect. If you prepare enough training voice data, you will need a lot of time and labor costs.
本发明为解决以上技术问题,使用循环神经网络代替传统的声学参数平滑方法,根据声学统计参数序列生成连贯的声学参数,有效解决了生成的声学参数过度平滑的问题,使合成语音在听感上更自然。同时,本发明具备实时性,可逐句、逐字或逐帧合成语音。另外,本发明对训练数据的需求量相对较小,训练时间相对较短。In order to solve the above technical problem, the present invention uses a cyclic neural network to replace the traditional acoustic parameter smoothing method, and generates coherent acoustic parameters according to the acoustic statistical parameter sequence, thereby effectively solving the problem that the generated acoustic parameters are excessively smooth, so that the synthesized speech is on the sense of hearing. More natural. At the same time, the present invention has real-time performance, and can synthesize speech sentence by sentence, word by word or frame by frame. In addition, the demand for training data in the present invention is relatively small, and the training time is relatively short.
发明内容Summary of the invention
本发明所属的技术领域是基于统计参数的语音合成。本发明解决的技术问题之一 是避免了合成语音的过度平滑现象,使合成语音更自然、清晰。The technical field to which the present invention pertains is speech synthesis based on statistical parameters. One of the technical problems solved by the present invention It avoids the excessive smoothing of synthesized speech, making the synthesized speech more natural and clear.
本发明为解决以上技术问题,采取的方法包括模型的训练和运行两个步骤。其中模型的训练步骤包括:The present invention solves the above technical problems, and the method adopted includes two steps of training and running of the model. The training steps of the model include:
获取训练数据中文本的上下文信息;Obtaining context information of the text in the training data;
训练声学统计参数预测模型,将文本的上下文信息映射到声学统计参数;Training an acoustic statistical parameter prediction model to map context information of the text to acoustic statistical parameters;
使用训练好的声学统计参数预测模型,根据训练文本的上下文信息序列生成对应的声学统计参数序列;Using the trained acoustic statistical parameter prediction model, generating a corresponding acoustic statistical parameter sequence according to the context information sequence of the training text;
训练循环神经网络,将声学统计参数序列映射到训练语音的声学参数序列。A cyclic neural network is trained to map acoustic statistical parameter sequences to acoustic parameter sequences of training speech.
本发明的运行步骤包括:The operational steps of the present invention include:
获取输入文本的上下文信息;Get context information of the input text;
使用训练好的声学统计参数预测模型,根据输入文本的上下文信息序列生成对应的声学统计参数序列;Using the trained acoustic statistical parameter prediction model, generating a corresponding acoustic statistical parameter sequence according to the context information sequence of the input text;
使用训练好的循环神经网络,根据声学统计参数序列,生成平滑的声学参数序列;Using a trained cyclic neural network to generate a smooth sequence of acoustic parameters based on the sequence of acoustic statistical parameters;
根据平滑的声学参数序列,合成语音。The speech is synthesized based on a sequence of smooth acoustic parameters.
其中,由平滑的声学参数序列合成语音的方法可以是基于源-滤波器模型的参数语音合成、基于正弦模型或谐波加噪音模型的参数语音合成、声码器、基于基元选择的拼接语音合成等。The method for synthesizing speech from a sequence of smooth acoustic parameters may be parameter speech synthesis based on source-filter model, parametric speech synthesis based on sinusoidal model or harmonic plus noise model, vocoder, spliced speech based on primitive selection Synthesis, etc.
其中,声学统计参数预测模型可以是隐马尔科夫模型、决策树或神经网络等机器学习模型和方法。对于具体采用的模型或方法本发明不做具体限制。The acoustic statistical parameter prediction model may be a machine learning model and method such as a hidden Markov model, a decision tree or a neural network. The present invention is not specifically limited to the specifically employed model or method.
附图说明DRAWINGS
图1为本发明的模型的训练阶段流程图;1 is a flow chart of a training phase of a model of the present invention;
图2为本发明的运行阶段流程图;Figure 2 is a flow chart of the operation phase of the present invention;
图3为本发明的循环神经网络训练方法示意图;3 is a schematic diagram of a training method of a cyclic neural network according to the present invention;
图4为本发明的循环神经网络运行方法示意图;4 is a schematic diagram of a method for operating a circulating neural network according to the present invention;
图5为本发明的实施例中声学统计参数序列的示例图;Figure 5 is a diagram showing an example of an acoustic statistical parameter sequence in an embodiment of the present invention;
图6为本发明的实施例中经过平滑和正规化处理的声学统计参数序列和循环神经网络输出的声学参数序列的示例图。6 is a diagram showing an example of an acoustic statistical parameter sequence subjected to smoothing and normalization processing and an acoustic parameter sequence output by a cyclic neural network in an embodiment of the present invention.
具体实施方式detailed description
本发明包括模型的训练和运行两个阶段,其中模型的训练阶段如图1所示;运行阶段如图2所示。The invention includes two stages of training and running of the model, wherein the training phase of the model is shown in Figure 1; the operational phase is shown in Figure 2.
模型的训练阶段主要是根据训练数据,计算出声学统计参数预测模型参数和循环神经网络模型参数。The training phase of the model is mainly based on the training data, and the acoustic statistical parameter prediction model parameters and the cyclic neural network model parameters are calculated.
训练数据包括语音数据和与语音数据时间对齐的文本数据。在不同的应用场合,语音数据可以是不同的形式,例如在文语转换应用中,语音数据为朗读句子的音频数据;在歌声合成应用中,语音数据为歌声的音频数据。文本数据包括与语音相对应的文字以及音标标注信息,还可包含音节标注、重音标注、词性标注等信息。The training data includes voice data and text data that is time aligned with the voice data. In different applications, the voice data may be in different forms. For example, in a text-to-speech application, the voice data is the audio data of the spoken sentence; in the voice synthesis application, the voice data is the audio data of the voice. The text data includes text corresponding to the voice and phonetic label information, and may also include information such as syllable annotation, accent annotation, and part-of-speech annotation.
声学统计参数预测模型的作用是,根据输入的文本信息,预测出语音声学参数(包括语音在特定时间的韵律、音色参数等)在不同时间的统计信息,从而产生对语音声学参数的初步预测。该模型的输出可以是离散的,即输出为一串状态间不连贯的声学统计参数。该模型的输出也可以是连续的,即输出为一串状态间连贯的声学统计参数。该模型输出的参数应反映语音的声学参数在短时间内的统计信息(例如平均值、方差、导数 等),而不只是声学参数本身。The function of the acoustic statistical parameter prediction model is to predict the statistical information of the speech acoustic parameters (including the rhythm of the speech at a specific time, the timbre parameters, etc.) at different times according to the input text information, thereby generating a preliminary prediction of the acoustic parameters of the speech. The output of the model can be discrete, ie the output is an incoherent acoustic statistical parameter between a series of states. The output of the model can also be continuous, ie the output is a series of consecutive acoustic statistical parameters. The parameters output by the model should reflect the statistical information of the acoustic parameters of the speech in a short period of time (eg mean, variance, derivative) Etc), not just the acoustic parameters themselves.
本发明对于具体使用的声学统计参数预测模型不作具体限定。本实施例中使用的声学统计参数预测模型为基于隐马尔科夫模型和决策树的声学统计参数预测模型;具体实施时也可以是其它具有近似功能的模型或方法,例如前馈神经网络、支持向量机等。The present invention does not specifically limit the acoustic statistical parameter prediction model to be specifically used. The acoustic statistical parameter prediction model used in this embodiment is an acoustic statistical parameter prediction model based on a hidden Markov model and a decision tree; in other implementations, other models or methods with approximate functions, such as feedforward neural networks, support Vector machine, etc.
本发明与现有的统计参数语音合成技术相比,使用循环神经网络代替了传统的语音声学参数平滑方法,解决了生成的声学参数过度平滑的问题;本发明能够逐句、逐字、或逐帧进行语音合成,解决了该技术的低实时性问题。Compared with the existing statistical parameter speech synthesis technology, the present invention replaces the traditional speech acoustic parameter smoothing method by using a cyclic neural network, and solves the problem that the generated acoustic parameters are excessively smooth; the present invention can be sentence by sentence, verbatim, or The speech synthesis of the frame solves the low real-time problem of the technology.
本发明的模型的训练阶段如图1所示,具体包括以下步骤:The training phase of the model of the present invention is shown in FIG. 1 and specifically includes the following steps:
第一步,从训练数据中获取语音数据和对应的文本数据,及文本的上下文信息,训练声学统计参数预测模型,使该模型能够将文本的上下文信息映射到声学统计参数。In the first step, the speech data and the corresponding text data, and the context information of the text are obtained from the training data, and the acoustic statistical parameter prediction model is trained to enable the model to map the context information of the text to the acoustic statistical parameters.
本实施例采用的声学统计参数预测模型的训练方法包括:The training method of the acoustic statistical parameter prediction model adopted in this embodiment includes:
根据对齐的训练文本和训练语音数据,初始化隐马尔科夫模型的状态转移概率分布和输出概率分布参数。隐马尔科夫模型采用上下文相关状态;可选地,输出概率分布采用混合概率分布;可选地,输出概率分布采用对角斜方差矩阵。状态转移概率分布参数可由训练文本中不同状态的出现顺序和次数计算得到;输出概率分布参数可由各状态所对应的语音数据的声学参数统计得到;The state transition probability distribution and the output probability distribution parameter of the hidden Markov model are initialized according to the aligned training text and the training speech data. The hidden Markov model adopts a context-dependent state; alternatively, the output probability distribution adopts a mixed probability distribution; alternatively, the output probability distribution uses a diagonal skew variance matrix. The state transition probability distribution parameter may be calculated by the order and the number of occurrences of different states in the training text; the output probability distribution parameter may be obtained by statistically calculating the acoustic parameters of the voice data corresponding to each state;
根据隐马尔科夫模型状态在文本中的上下文信息,以及该模型的输出概率分布参数,生成决策树;决策树的生成算法可以使用最小描述长度(MDL)准则;Generating a decision tree according to the context information of the hidden Markov model state in the text and the output probability distribution parameter of the model; the decision tree generating algorithm may use a minimum description length (MDL) criterion;
根据决策树,绑定隐马尔科夫模型中归在相同决策树节点下的上下文相关状态;使用Baum-Welch算法或Viterbi训练算法重新计算隐马尔科夫模型的状态转移概率分布和输出状态分布参数;此时获得的输出概率分布参数即为各状态对应的声学统计参数;According to the decision tree, the context-dependent state of the hidden Markov model is attributed to the same decision tree node. The Baum-Welch algorithm or Viterbi training algorithm is used to recalculate the state transition probability distribution and output state distribution parameters of the hidden Markov model. The output probability distribution parameter obtained at this time is the acoustic statistical parameter corresponding to each state;
为了能够在本发明运行阶段中生成各状态的时长参数,统计并存储各组绑定的状态的平均时长和方差。In order to be able to generate time duration parameters for each state in the operational phase of the present invention, the average duration and variance of the states of each group of bindings are counted and stored.
本实施例采用的基于隐马尔科夫模型和决策树的声学统计参数预测模型仅作为例子,具体实施时可以采用与与本实施例不同的模型和训练方法。The acoustic statistical parameter prediction model based on the hidden Markov model and the decision tree used in this embodiment is only an example, and the model and the training method different from the embodiment may be adopted in the specific implementation.
第二步,使用训练好的声学统计参数预测模型,根据训练数据中的文本数据的上下文信息序列,生成对应的声学统计参数序列。In the second step, using the trained acoustic statistical parameter prediction model, a corresponding acoustic statistical parameter sequence is generated according to the context information sequence of the text data in the training data.
以基于隐马尔科夫模型和决策树的声学统计参数预测模型为例,根据训练文本数据的上下文信息序列,使用决策树选择相对应的隐马尔科夫模型状态,组成状态序列;各状态持续时间由训练阶段第一步中的Viterbi训练算法决定,从而保证状态序列和训练语音数据在时间上对齐。然后根据状态序列,使用训练好的隐马尔科夫模型生成相对应的声学统计参数序列。Taking the acoustic statistical parameter prediction model based on hidden Markov model and decision tree as an example, according to the context information sequence of training text data, the decision tree is used to select the corresponding hidden Markov model state to form a state sequence; each state duration It is determined by the Viterbi training algorithm in the first step of the training phase to ensure that the state sequence and the training speech data are aligned in time. The trained hidden Markov model is then used to generate a corresponding sequence of acoustic statistical parameters based on the sequence of states.
第三步,如图3所示,训练循环神经网络,使该神经网络能够将声学统计参数预测模型输出的声学统计参数序列映射到连贯、自然的语音的声学参数序列。循环神经网络的训练数据为训练阶段第二步生成的声学统计参数序列和根据训练数据中的语音数据计算得到的语音声学参数。In the third step, as shown in FIG. 3, the cyclic neural network is trained to enable the neural network to map the acoustic statistical parameter sequence output by the acoustic statistical parameter prediction model to the acoustic parameter sequence of the coherent, natural speech. The training data of the cyclic neural network is the acoustic statistical parameter sequence generated in the second step of the training phase and the speech acoustic parameters calculated from the speech data in the training data.
由于本实施例训练阶段第二步中使用的隐马尔科夫模型的输出为状态间不连贯的声学统计参数序列,应当对隐马尔科夫模型输出的声学统计参数序列做初步平滑处理,再输入循环神经网络,以确保循环神经网络能够输出连贯的语音参数序列;初步平滑处理方法可以是插值、低通滤波、平均值滤波和最大似然参数生成算法等;Since the output of the hidden Markov model used in the second step of the training phase of the present embodiment is an incoherent acoustic statistical parameter sequence between states, the acoustic statistical parameter sequence outputted by the hidden Markov model should be initially smoothed and then input. Cyclic neural network to ensure that the cyclic neural network can output a coherent sequence of speech parameters; the preliminary smoothing method can be interpolation, low-pass filtering, average filtering, and maximum likelihood parameter generation algorithms;
可选地,对输入循环神经网络的声学统计参数序列做正规化处理,例如使输入参数符合均值为0,方差为1的高斯分布;Optionally, normalizing the sequence of acoustic statistical parameters input to the cyclic neural network, for example, making the input parameter conform to a Gaussian distribution with a mean of 0 and a variance of 1;
针对不同应用场合,可使用不同结构的循环神经网络,包括一阶或二阶循环神经网络、长短时记忆神经网络、多层循环神经网络、以及以上若干种神经网络的组合等。循 环神经网络的输出层采用线性激活函数。当存在多个声学参数时,可使用一个循环神经网络根据输入声学统计参数序列,输出多个声学参数序列;亦可使用多个循环神经网络根据输入声学统计参数序列,每个循环神经网络输出一个或多个声学参数序列;For different applications, different structures of cyclic neural networks can be used, including first-order or second-order cyclic neural networks, long-short-time memory neural networks, multi-layered cyclic neural networks, and combinations of several kinds of neural networks. Follow The output layer of the ring neural network uses a linear activation function. When there are multiple acoustic parameters, a cyclic neural network may be used to output a plurality of acoustic parameter sequences according to the input acoustic statistical parameter sequence; a plurality of cyclic neural networks may also be used according to the input acoustic statistical parameter sequence, and each cyclic neural network outputs one Or a plurality of acoustic parameter sequences;
针对不同结构的循环神经网络,使用不同的训练算法。一阶循环神经网络可使用随时间演化的反向传播算法(BPTT)或实时递归学习算法(RTRL)训练;二阶循环神经网络、长短时记忆神经网络、多层循环神经网络可使用通用长短时神经网络算法(LSTM-g,见Monner,Derek,et al."A generalized LSTM-like training algorithm for second-order recurrent neural networks."Neural Networks 25(2012):70-83.)训练;训练时循环神经网络的输出层采用最小平方误差准则;Different training algorithms are used for cyclic neural networks of different structures. First-order cyclic neural networks can be trained using time-varying back propagation algorithm (BPTT) or real-time recursive learning algorithm (RTRL); second-order cyclic neural networks, long-short-time memory neural networks, and multi-layered cyclic neural networks can use universal lengths and time Neural network algorithm (LSTM-g, see Monner, Derek, et al. "A generalized LSTM-like training algorithm for second-order recurrent neural networks." Neural Networks 25 (2012): 70-83.) training; training cycle The output layer of the neural network adopts a least square error criterion;
由于循环神经网络的输出具有延迟性,训练中须要将循环神经网络的输入数据的声学统计参数序列提前大约5至40帧,以提供给循环神经网络足够时间预测输出的声学参数;Since the output of the cyclic neural network is delayed, it is necessary to advance the acoustic statistical parameter sequence of the input data of the cyclic neural network by about 5 to 40 frames in order to provide the cyclic neural network with sufficient time to predict the acoustic parameters of the output;
可选地,为了避免循环神经网络由于输入和输出数据的时间差遗忘输入或输出信息,造成输出的语音声学参数过度平滑,循环神经网络的输入同时包括提前5至40帧的声学统计参数序列,以及当前帧的声学统计参数序列。Optionally, in order to prevent the cyclic neural network from forgetting the input or output information due to the time difference between the input and output data, the output acoustic parameters are excessively smoothed, and the input of the cyclic neural network includes the sequence of acoustic statistical parameters of 5 to 40 frames in advance, and The sequence of acoustic statistical parameters of the current frame.
本发明的运行阶段流程如图2所示,该阶段主要根据输入文本,使用训练好的模型,产生语音声学参数并合成语音,具体方法包括以下步骤:The operation phase of the present invention is shown in FIG. 2, which uses the trained model to generate speech acoustic parameters and synthesize speech based on the input text. The specific method includes the following steps:
第一步,从输入文本中获取文本的上下文信息序列;根据输入文本的上下文信息序列,使用声学统计参数预测模型,生成对应的声学统计参数序列。In the first step, the context information sequence of the text is obtained from the input text; according to the context information sequence of the input text, the acoustic statistical parameter prediction model is used to generate a corresponding acoustic statistical parameter sequence.
使用本实施例采用的声学统计参数预测模型生成声学统计参数序列的方法具体包括:The method for generating an acoustic statistical parameter sequence by using the acoustic statistical parameter prediction model adopted in this embodiment specifically includes:
根据输入文本的上下文信息序列,使用决策树选择相对应的隐马尔科夫模型状态,组成状态序列;各状态序列持续时间由训练时得到的各状态的平均时长和方差决定。然后根据状态序列,使用训练好的隐马尔科夫模型生成相对应的声学统计参数序列。According to the sequence of context information of the input text, the decision tree is used to select the corresponding hidden Markov model state to form a sequence of states; the duration of each state sequence is determined by the average duration and variance of each state obtained during training. The trained hidden Markov model is then used to generate a corresponding sequence of acoustic statistical parameters based on the sequence of states.
第二步,如图4所示,根据声学统计参数序列,使用训练好的循环神经网络,生成连贯的声学参数序列。具体方法包括:In the second step, as shown in Figure 4, a sequence of acoustic parameters is generated using a trained cyclic neural network based on the sequence of acoustic statistical parameters. Specific methods include:
由于本实施例运行阶段第一步中使用的隐马尔科夫模型输出状态间不连贯的声学统计参数序列,应当对隐马尔科夫模型输出的声学统计参数序列做初步平滑处理,再输入循环神经网络,以确保循环神经网络能够输出连贯的语音参数序列;初步平滑处理方法可以是插值、低通滤波、平均值滤波和最大似然参数生成算法等;Due to the incoherent acoustic statistical parameter sequence between the output states of the hidden Markov model used in the first step of the running phase of the embodiment, the acoustic statistical parameter sequence outputted by the hidden Markov model should be initially smoothed and then input into the circulating nerve. The network ensures that the cyclic neural network can output a coherent sequence of speech parameters; the preliminary smoothing methods can be interpolation, low-pass filtering, average filtering, and maximum likelihood parameter generation algorithms;
图5为未经平滑处理的声学统计参数序列(实线)和经过平滑处理的声学统计参数序列(虚线)的示例图;为了表示清晰,图中仅包含了语音的第二个梅尔频率倒谱系数(MFCC)参数的平均值序列;Figure 5 is an example diagram of an unsmoothed sequence of acoustic statistical parameters (solid line) and a sequence of smoothed acoustic statistical parameters (dashed lines); for clarity, the figure contains only the second Mel frequency of speech. Average sequence of spectral coefficient (MFCC) parameters;
可选地,对输入循环神经网络的声学统计参数序列做正规化处理,例如使输入参数符合均值为0,方差为1的高斯分布;Optionally, normalizing the sequence of acoustic statistical parameters input to the cyclic neural network, for example, making the input parameter conform to a Gaussian distribution with a mean of 0 and a variance of 1;
将循环神经网络的输入层各神经元的激活值设定为提前5至40帧的声学统计参数;可选地,输入层各神经元的激活值设定为提前5至40帧的声学统计参数序列和当前帧的声学统计参数;The activation value of each neuron in the input layer of the cyclic neural network is set to an acoustic statistical parameter of 5 to 40 frames in advance; optionally, the activation value of each neuron in the input layer is set to an acoustic statistical parameter of 5 to 40 frames in advance. Acoustic statistical parameters of the sequence and the current frame;
根据当前帧各层神经元的激活值和前一帧各层神经元的激活值,计算循环神经网络各层上各神经元的激活值;Calculating activation values of neurons on each layer of the cyclic neural network according to activation values of neurons in each layer of the current frame and activation values of neurons in each layer of the previous frame;
将输出的声学参数设定为循环神经网络输出层各神经元的激活值;Setting the acoustic parameters of the output to the activation values of the neurons of the output layer of the cyclic neural network;
按时间顺序对各帧对应的声学统计参数循环执行上述步骤,生成连贯的声学参数序列。The above steps are performed cyclically for the acoustic statistical parameters corresponding to each frame to generate a coherent sequence of acoustic parameters.
图6为经过平滑和正规化处理的声学统计参数序列(虚线)、经过平滑和正规化 处理并提前30帧的声学统计参数序列(点线)和循环神经网络输出的声学参数序列(实线)的示意图。为了表示清晰,图中声学统计参数序列仅包含了语音的第二个MFCC参数的平均值序列,且输出经过正规化处理。可以看到,循环神经网络的输出比输入的经平滑处理的平均值序列,具有更多细节。Figure 6 shows the smoothed and normalized sequence of acoustic statistical parameters (dashed lines), smoothed and normalized. A schematic diagram of an acoustic statistical parameter sequence (dotted line) of 30 frames in advance and an acoustic parameter sequence (solid line) output by the circulating neural network. For clarity, the acoustic statistical parameter sequence in the figure contains only the average sequence of the second MFCC parameter of the speech, and the output is normalized. It can be seen that the output of the cyclic neural network has more detail than the smoothed average sequence of the inputs.
第三步,根据循环神经网络生成的声学参数序列,合成语音波形作为输出。具体合成方法取决于使用的声学参数类型,本发明不对合成方法作具体限定。In the third step, the speech waveform is synthesized as an output according to the acoustic parameter sequence generated by the cyclic neural network. The specific synthesis method depends on the type of acoustic parameters used, and the present invention does not specifically limit the synthesis method.
本发明使用的循环神经网络,以及本实施例使用的基于决策树和隐马尔科夫模型的声学统计参数预测模型,适用于多种声学参数和声学统计参数,例如语音的梅尔频率倒谱系数(MFCC)特征、线谱对(LSP)特征、谐波能量特征、共振峰特征、频谱包络特征、基频特征、对数基频特征、及上述特征的平均值、方差和导数等。本发明不对使用的声学参数和声学统计参数类型作具体限定。The cyclic neural network used in the present invention, and the acoustic statistical parameter prediction model based on the decision tree and the hidden Markov model used in the present embodiment are applicable to various acoustic parameters and acoustic statistical parameters, such as the Mel frequency cepstral coefficient of speech. (MFCC) characteristics, line spectrum pair (LSP) characteristics, harmonic energy characteristics, formant characteristics, spectral envelope characteristics, fundamental frequency characteristics, logarithmic fundamental frequency characteristics, and mean, variance and derivative of the above characteristics. The present invention does not specifically define the acoustic parameters and acoustic statistical parameter types used.
传统的基于最大似然参数生成的统计参数语音合成方法最小化了生成参数的统计误差,但并不能保证同时最小化生成的声学参数在听觉上的特征损失。本发明使用循环神经网络,在训练过程中最小化了合成语音的声学参数和真实语音的声学参数的误差,从而有效降低了合成语音的听觉特征损失,并减轻了过度平滑现象。另一方面,本发明使用的循环神经网络在训练过程中,结合了训练语音的说话人特征,因此可以在运行阶段生成的声学参数中保留更多细节,使合成语音更加自然。The traditional statistical parameter speech synthesis method based on maximum likelihood parameter generation minimizes the statistical error of the generated parameters, but does not guarantee the simultaneous loss of the auditory feature loss of the generated acoustic parameters. The invention uses a cyclic neural network to minimize the error of the acoustic parameters of the synthesized speech and the acoustic parameters of the real speech during the training process, thereby effectively reducing the loss of the auditory features of the synthesized speech and reducing the excessive smoothing phenomenon. On the other hand, the cyclic neural network used in the present invention combines the speaker characteristics of the training speech during the training process, so that more details can be retained in the acoustic parameters generated during the running phase, making the synthesized speech more natural.
基于深度神经网络的语音合成方法直接以文本的上下文信息作为神经网络的输入,而本发明中循环神经网络的输入为声学统计参数。由于本发明中循环神经网络的输入和输出数据关联性强,本发明所述的语音合成方法只需要1至2小时的训练语音数据,与基于深度神经网络的语音合成方法相比,降低了对训练数据的需求,减少了准备训练语音数据的工作负担。 The speech synthesis method based on the deep neural network directly uses the context information of the text as the input of the neural network, and the input of the cyclic neural network in the present invention is the acoustic statistical parameter. Since the input and output data of the cyclic neural network in the present invention are highly correlated, the speech synthesis method of the present invention requires only 1 to 2 hours of training speech data, and the speech synthesis method based on the deep neural network is reduced. The need to train data reduces the workload of preparing to train voice data.

Claims (5)

  1. 一种基于循环神经网络的语音合成方法,其特征在于包含以下步骤:A speech synthesis method based on a cyclic neural network, which comprises the following steps:
    a.获取待合成文本的上下文信息;a. obtaining context information of the text to be synthesized;
    b.根据文本的上下文信息,生成声学统计参数序列;b. generating an acoustic statistical parameter sequence according to the context information of the text;
    c.根据由上下文生成的声学统计参数序列,使用循环神经网络生成待合成语音的声学参数序列;c. generating a sequence of acoustic parameters of the speech to be synthesized using a cyclic neural network based on the sequence of acoustic statistical parameters generated by the context;
    d.根据待合成语音的声学参数序列,合成语音。d. Synthesize speech according to the sequence of acoustic parameters of the speech to be synthesized.
  2. 权利要求1所述的文本的上下文信息,是指文本所包含的一个或多个音素或音节的语境信息。文本的上下文信息至少包括以下参数之一:The context information of the text of claim 1 refers to context information of one or more phonemes or syllables contained in the text. The context information for the text includes at least one of the following parameters:
    a.当前处理的音素或音节的名称;a. the name of the currently processed phoneme or syllable;
    b.当前处理的音素类别,至少包括元音、辅音、鼻音、塞音、擦音之一;b. The currently processed phoneme category includes at least one of vowel, consonant, nasal, stop, and squeak;
    c.当前处理的元音发音位置,至少包括前、中、后之一;c. The currently processed vowel pronunciation position includes at least one of the front, middle and back;
    d.当前处理的元音发音口型大小,至少包括开、中、闭之一;d. The currently processed vowel pronunciation mouth size, including at least one of opening, middle and closed;
    e.下一个或上一个处理的音素或音节的名称;e. the name of the next or previous processed phoneme or syllable;
    f.下一个或上一个处理的音素或音节的类别;f. the category of the next or previous processed phoneme or syllable;
    g.下一个或上一个处理的音素或音节的发音位置;g. the pronunciation position of the next or previous processed phoneme or syllable;
    h.下一个或上一个处理的音素或音节的发音口型大小;h. the size of the pronunciation of the next or previous processed phoneme or syllable;
    i.当前处理的音节中包含的音素数量;i. the number of phonemes contained in the currently processed syllable;
    j.当前处理的句子或短语或词语中包含的音素或音节数量;j. The number of phonemes or syllables contained in the currently processed sentence or phrase or word;
    k.当前处理的音素在音节中的位置;k. the position of the currently processed phoneme in the syllable;
    l.当前处理的音节在句子或短语或词语中的位置;l. the position of the currently processed syllable in a sentence or phrase or word;
    m.当前处理的词语的词性,至少包括动词、代词、名词、形容词、副词、介词之一。m. The part of speech of the currently processed words, including at least one of verbs, pronouns, nouns, adjectives, adverbs, and prepositions.
  3. 权利要求1中,步骤b生成的声学统计参数序列至少包括以下参数之一:In claim 1, the sequence of acoustic statistical parameters generated by step b includes at least one of the following parameters:
    a.MFCC(梅尔频率倒谱系数)参数或梅尔频率滤波器能量参数;a. MFCC (Mel frequency cepstral coefficient) parameter or Mel frequency filter energy parameter;
    b.LSP(线谱对)参数;b. LSP (line spectrum pair) parameters;
    c.LPC(线性预测系数)参数;c. LPC (linear prediction coefficient) parameters;
    d.语音的谐波能量参数;d. harmonic energy parameters of speech;
    e.语音的共振峰频率或共振峰带宽或共振峰能量参数;e. the formant frequency or formant bandwidth or formant energy parameter of the speech;
    f.语音的基频参数;f. the fundamental frequency parameter of the speech;
    g.语音的频谱或频谱包络参数;g. the spectrum or spectral envelope parameters of the speech;
    h.语音的短时能量参数;h. short-term energy parameters of speech;
    i.以上参数的一阶导数或二阶导数;i. the first derivative or the second derivative of the above parameters;
    j.以上参数的短时平均值或短时均方差或短时最大/最小值。j. Short-term average or short-term mean square error or short-term maximum/minimum of the above parameters.
  4. 权利要求1中,步骤c所述的循环神经网络生成的待合成语音的声学参数序列至少包括以下参数之一:In claim 1, the acoustic parameter sequence of the speech to be synthesized generated by the cyclic neural network described in step c includes at least one of the following parameters:
    a.MFCC(梅尔频率倒谱系数)参数、或梅尔频率滤波器能量参数;a. MFCC (Mel frequency cepstral coefficient) parameter, or Mel frequency filter energy parameter;
    b.LSP(线谱对)参数;b. LSP (line spectrum pair) parameters;
    c.LPC(线性预测系数)参数;c. LPC (linear prediction coefficient) parameters;
    d.语音的谐波能量参数;d. harmonic energy parameters of speech;
    e.语音的共振峰频率或共振峰带宽或共振峰能量参数;e. the formant frequency or formant bandwidth or formant energy parameter of the speech;
    f.语音的基频参数;f. the fundamental frequency parameter of the speech;
    g.语音的频谱或频谱包络参数;g. the spectrum or spectral envelope parameters of the speech;
    h.语音的短时能量参数。h. Short-term energy parameters of speech.
  5. 权利要求1中,步骤c所述的循环神经网络至少包括以下类型的神经网络之一:In claim 1, the cyclic neural network described in step c includes at least one of the following types of neural networks:
    a.一阶或二阶循环神经网络; a. First or second order cyclic neural network;
    b.多层循环神经网络;b. multilayer cyclic neural network;
    c.长短时记忆神经网络;c. long and short time memory neural network;
    d.以上神经网络的组合;d. a combination of the above neural networks;
    e.以上神经网络和前馈神经网络的组合。 e. Combination of the above neural network and feedforward neural network.
PCT/CN2015/077785 2015-04-29 2015-04-29 Speech synthesis method based on recurrent neural networks WO2016172871A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/077785 WO2016172871A1 (en) 2015-04-29 2015-04-29 Speech synthesis method based on recurrent neural networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/077785 WO2016172871A1 (en) 2015-04-29 2015-04-29 Speech synthesis method based on recurrent neural networks

Publications (1)

Publication Number Publication Date
WO2016172871A1 true WO2016172871A1 (en) 2016-11-03

Family

ID=57198903

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/077785 WO2016172871A1 (en) 2015-04-29 2015-04-29 Speech synthesis method based on recurrent neural networks

Country Status (1)

Country Link
WO (1) WO2016172871A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610707A (en) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN110737268A (en) * 2019-10-14 2020-01-31 哈尔滨工程大学 method for determining instruction based on Viterbi algorithm
CN110879833A (en) * 2019-11-20 2020-03-13 中国科学技术大学 Text prediction method based on light-weight loop unit LRU
CN111627418A (en) * 2020-05-27 2020-09-04 携程计算机技术(上海)有限公司 Training method, synthesizing method, system, device and medium for speech synthesis model
CN111862931A (en) * 2020-05-08 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice generation method and device
CN115276697A (en) * 2022-07-22 2022-11-01 交通运输部规划研究院 Coast radio station communication system integrated with intelligent voice

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1275746A (en) * 1994-04-28 2000-12-06 摩托罗拉公司 Equipment for converting text into audio signal by using nervus network
EP1291848A2 (en) * 2001-08-31 2003-03-12 Nokia Corporation Multilingual pronunciations for speech recognition
CN1929655A (en) * 2006-09-28 2007-03-14 中山大学 Mobile phone capable of realizing text and voice conversion
CN101510424A (en) * 2009-03-12 2009-08-19 孟智平 Method and system for encoding and synthesizing speech based on speech primitive
CN102117614A (en) * 2010-01-05 2011-07-06 索尼爱立信移动通讯有限公司 Personalized text-to-speech synthesis and personalized speech feature extraction
CN104538024A (en) * 2014-12-01 2015-04-22 百度在线网络技术(北京)有限公司 Speech synthesis method, apparatus and equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1275746A (en) * 1994-04-28 2000-12-06 摩托罗拉公司 Equipment for converting text into audio signal by using nervus network
EP1291848A2 (en) * 2001-08-31 2003-03-12 Nokia Corporation Multilingual pronunciations for speech recognition
CN1929655A (en) * 2006-09-28 2007-03-14 中山大学 Mobile phone capable of realizing text and voice conversion
CN101510424A (en) * 2009-03-12 2009-08-19 孟智平 Method and system for encoding and synthesizing speech based on speech primitive
CN102117614A (en) * 2010-01-05 2011-07-06 索尼爱立信移动通讯有限公司 Personalized text-to-speech synthesis and personalized speech feature extraction
CN104538024A (en) * 2014-12-01 2015-04-22 百度在线网络技术(北京)有限公司 Speech synthesis method, apparatus and equipment

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610707A (en) * 2016-12-15 2018-01-19 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN107610707B (en) * 2016-12-15 2018-08-31 平安科技(深圳)有限公司 A kind of method for recognizing sound-groove and device
CN110737268A (en) * 2019-10-14 2020-01-31 哈尔滨工程大学 method for determining instruction based on Viterbi algorithm
CN110737268B (en) * 2019-10-14 2022-07-15 哈尔滨工程大学 Viterbi algorithm-based instruction determining method
CN110879833A (en) * 2019-11-20 2020-03-13 中国科学技术大学 Text prediction method based on light-weight loop unit LRU
CN110879833B (en) * 2019-11-20 2022-09-06 中国科学技术大学 Text prediction method based on light weight circulation unit LRU
CN111862931A (en) * 2020-05-08 2020-10-30 北京嘀嘀无限科技发展有限公司 Voice generation method and device
CN111627418A (en) * 2020-05-27 2020-09-04 携程计算机技术(上海)有限公司 Training method, synthesizing method, system, device and medium for speech synthesis model
CN111627418B (en) * 2020-05-27 2023-01-31 携程计算机技术(上海)有限公司 Training method, synthesizing method, system, device and medium for speech synthesis model
CN115276697A (en) * 2022-07-22 2022-11-01 交通运输部规划研究院 Coast radio station communication system integrated with intelligent voice

Similar Documents

Publication Publication Date Title
US11664011B2 (en) Clockwork hierarchal variational encoder
US11514888B2 (en) Two-level speech prosody transfer
WO2016172871A1 (en) Speech synthesis method based on recurrent neural networks
EP4128211A1 (en) Speech synthesis prosody using a bert model
Ma et al. Incremental text-to-speech synthesis with prefix-to-prefix framework
JP6305955B2 (en) Acoustic feature amount conversion device, acoustic model adaptation device, acoustic feature amount conversion method, and program
US11830474B2 (en) Predicting parametric vocoder parameters from prosodic features
Pouget et al. HMM training strategy for incremental speech synthesis
WO2015025788A1 (en) Quantitative f0 pattern generation device and method, and model learning device and method for generating f0 pattern
Kons et al. Neural TTS voice conversion
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
US10446133B2 (en) Multi-stream spectral representation for statistical parametric speech synthesis
Van Nguyen et al. Development of Vietnamese speech synthesis system using deep neural networks
Mullah A comparative study of different text-to-speech synthesis techniques
Ronanki et al. The CSTR entry to the Blizzard Challenge 2017
Suzić et al. HiFi-GAN based Text-to-Speech Synthesis in Serbian
Coto-Jiménez et al. Speech Synthesis Based on Hidden Markov Models and Deep Learning.
US20230018384A1 (en) Two-Level Text-To-Speech Systems Using Synthetic Training Data
JP2014095851A (en) Methods for acoustic model generation and voice synthesis, devices for the same, and program
Louw Neural speech synthesis for resource-scarce languages
Phan et al. Extracting MFCC, F0 feature in Vietnamese HMM-based speech synthesis
Shah et al. Influence of various asymmetrical contextual factors for TTS in a low resource language
Frikha et al. Hidden Markov models (HMMs) isolated word recognizer with the optimization of acoustical analysis and modeling techniques
RU160585U1 (en) SPEECH RECOGNITION SYSTEM WITH VARIABILITY MODEL
Zhao et al. The UTokyo system for Blizzard Challenge 2016

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15890251

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15890251

Country of ref document: EP

Kind code of ref document: A1