JP6142401B2

JP6142401B2 - Speech synthesis model learning apparatus, method, and program

Info

Publication number: JP6142401B2
Application number: JP2013177166A
Authority: JP
Inventors: 弘和亀岡; 伸克北条; 幸太吉里; 大輔齋藤; 茂樹嵯峨山
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2013-08-28
Filing date: 2013-08-28
Publication date: 2017-06-07
Anticipated expiration: 2033-08-28
Also published as: JP2015045755A

Description

本発明は、音声合成モデル学習装置、方法、及びプログラムに係り、特に、テキストデータから音声波形を合成するための音声合成モデルを学習する音声合成モデル学習装置、方法、及びプログラムに関する。 The present invention relates to a speech synthesis model learning apparatus, method, and program, and more particularly, to a speech synthesis model learning apparatus, method, and program for learning a speech synthesis model for synthesizing a speech waveform from text data.

統計的モデルに基づくテキスト音声合成方式の基本戦略は、音声の確率的な生成モデルを立て、学習データからそのモデルパラメータを学習させ、学習したモデルを用いて任意のテキストデータに対して音声を生成するというものである。従って、音声における様々な性質や挙動をいかに適切に生成モデルの形で記述できるかが、合成音声の品質を左右する。特に音声の音韻に着目すると、スペクトル包絡特徴量の時系列をいかに適切にモデル化するかが重要であるが、従来の隠れマルコフモデル（Hidden Markov Model、ＨＭＭ）またはその変種による音声合成（以下、「ＨＭＭ音声合成」という）方式（例えば、非特許文献１参照）は、音声スペクトル系列の時間伸縮を確率的な現象として捉えようという考えの下、考案されたものである。 The basic strategy of a text-to-speech synthesis method based on a statistical model is to create a probabilistic speech generation model, learn its model parameters from learning data, and generate speech for any text data using the learned model. It is to do. Therefore, the quality of the synthesized speech depends on how various properties and behaviors in speech can be described appropriately in the form of a generation model. In particular, focusing on the phoneme of speech, it is important how to properly model the time series of spectral envelope features, but speech synthesis using a conventional Hidden Markov Model (HMM) or its variants (hereinafter referred to as “Hidden Markov Model”) The “HMM speech synthesis” method (see, for example, Non-Patent Document 1) was devised under the idea of capturing time expansion and contraction of a speech spectrum sequence as a stochastic phenomenon.

従来のＨＭＭ音声合成方式では、スペクトル包絡を表現する音声特徴量として、ケプストラムや線スペクトル対（Line Spectral Pairs、ＬＳＰ）が用いられている。ケプストラムを特徴量とした場合、スペクトル包絡がパワー方向にのみ確率的に揺らぐ現象を表現したモデルに相当し、ＬＳＰを特徴量とした場合、スペクトル包絡のピークが周波数方向にのみ確率的に揺らぐ現象を表現したモデルに相当する。ケプストラムを特徴量としたＨＭＭ音声合成方式では、合成音声のスペクトル包絡が周波数方向に平滑化される傾向にあるが、これは生成モデルがスペクトルの周波数方向の揺らぎを上手く捉えられないモデルであることが原因である。スペクトル包絡が平滑化されると、一般にはｂｕｚｚｙな音になるが、これは従来のＨＭＭ音声合成において良く知られた傾向である。 In the conventional HMM speech synthesis method, cepstrum and line spectrum pairs (Line Spectral Pairs, LSP) are used as speech feature quantities expressing the spectral envelope. When the cepstrum is used as a feature value, it corresponds to a model that expresses a phenomenon in which the spectral envelope is stochastically fluctuated only in the power direction. Is equivalent to a model that expresses In the HMM speech synthesis method using cepstrum as a feature, the spectrum envelope of the synthesized speech tends to be smoothed in the frequency direction, but this is a model in which the generated model cannot capture fluctuations in the frequency direction of the spectrum well. Is the cause. When the spectral envelope is smoothed, it generally becomes a buzzy sound, which is a well-known trend in conventional HMM speech synthesis.

そのため、例えばスペクトル包絡のピークとディップとの間を強調する目的で、確率モデルにGlobal Variance（ＧＶ）を導入することにより改善が図られているが、ひとたび平滑化されたスペクトル包絡からは、本来あるべきピーク及びディップを復元することは難しく、根本的な解決には至っていない。 Therefore, for example, for the purpose of emphasizing between the peak and the dip of the spectral envelope, improvement has been achieved by introducing Global Variance (GV) into the probabilistic model. It is difficult to restore the peaks and dips that should be, and the fundamental solution has not been reached.

スペクトル包絡ピークの周波数及びパワーは、声道における共振の共振周波数及びパワーに相当するため、音声のスペクトル包絡には、実際にはパワー方向及び周波数方向の双方の揺らぎが存在する。共振周波数及びパワーは、声道形状の物理的な変化に従い時間方向に連続に変化すると考えられる。そのため、例えばある音素の中央部分付近のスペクトル包絡と、後続音素との接続部分付近のスペクトル包絡とを比較した場合、後者は声道形状が後続音素の声道形状へ連続的に変化する過程にあるため、両者のスペクトル包絡間には、共振周波数及びパワーに差異があり、これを揺らぎとしてモデル化することが重要である。 Since the frequency and power of the spectral envelope peak correspond to the resonant frequency and power of resonance in the vocal tract, the speech spectral envelope actually has fluctuations in both the power direction and the frequency direction. It is considered that the resonance frequency and power continuously change in the time direction according to the physical change of the vocal tract shape. Therefore, for example, when comparing the spectral envelope near the central part of a phoneme and the spectral envelope near the connected part of the subsequent phoneme, the latter is a process in which the vocal tract shape continuously changes to the vocal tract shape of the subsequent phoneme. Therefore, there is a difference in resonance frequency and power between the spectral envelopes of both, and it is important to model this as fluctuations.

音声分析合成系のための音声スペクトルモデルとして、スペクトル包絡の各ピークがガウス分布で近似可能という仮定に基づき、スペクトル包絡全体を混合ガウス関数モデル（Gaussian Mixture Model、ＧＭＭ）によって表現した複合ウェーブレットモデル（Composite Wavelet Model、ＣＷＭ）と呼ぶモデルが提案されている（例えば、非特許文献２参照）。 As a speech spectrum model for speech analysis and synthesis systems, a composite wavelet model (Gaussian Mixture Model, GMM) that represents the entire spectrum envelope based on the assumption that each peak of the spectrum envelope can be approximated by a Gaussian distribution. A model called Composite Wavelet Model (CWM) has been proposed (see, for example, Non-Patent Document 2).

ＣＷＭは、スペクトル包絡ピークの周波数及びパワーの双方をパラメータとして持つため、スペクトル包絡のパワー方向及び周波数方向の双方の揺らぎを確率モデル化するのに適している。なお、ＣＷＭパラメータから音声波形を合成する際は、周波数領域におけるガウス分布関数は時間領域ではＧａｂｏｒ関数に対応するため、このＧａｂｏｒ関数を基本周波数に対応する時間間隔で配置することにより音声波形が合成される。ＣＷＭに基づく音声分析合成は、ＦＩＲフィルタによる合成手法であり、従来のＬＳＰやケプストラムなどの巡回型フィルタによる合成手法に比べ、Ｑ値の高いフィルタであっても、基本周波数に依らず時間特性の良い音声が合成可能である。 Since CWM has both the frequency and power of the spectrum envelope peak as parameters, it is suitable for probabilistic modeling of fluctuations in both the power direction and frequency direction of the spectrum envelope. When a speech waveform is synthesized from CWM parameters, a Gaussian distribution function in the frequency domain corresponds to a Gabor function in the time domain. Therefore, the speech waveform is synthesized by arranging the Gabor function at time intervals corresponding to the fundamental frequency. Is done. Speech analysis and synthesis based on CWM is a synthesis method using an FIR filter. Compared to conventional synthesis methods using cyclic filters such as LSP and cepstrum, even a filter having a high Q value has a time characteristic that is independent of the fundamental frequency. Good voice can be synthesized.

以上のＣＷＭの利点より、ＣＷＭパラメータを音声特徴量としたＨＭＭ音声合成方式が提案されている（例えば、非特許文献３参照）。この方式では、パラメータ学習において、まず各時刻（短時間フレーム）の音声スペクトル包絡に対し、まずＣＷＭのパラメータ抽出を前段で行い、抽出したＣＷＭパラメータセットを並べたベクトルの系列をＨＭＭ音声合成における音声特徴量系列としている。 Due to the above-mentioned advantages of CWM, an HMM speech synthesis method using CWM parameters as speech feature amounts has been proposed (see, for example, Non-Patent Document 3). In this method, in parameter learning, first, CWM parameter extraction is first performed for the speech spectrum envelope at each time (short-time frame), and a vector sequence in which the extracted CWM parameter sets are arranged is used for speech in HMM speech synthesis. It is a feature quantity series.

T.Yoshimura, K.Tokuda, T.Masuko, T.Kobayashi, and T.Kitamura, "Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis", in Proc. of Eurospeech 1999, 1999, pp.2347-2350.T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, "Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis", in Proc. Of Eurospeech 1999, 1999, pp. 2347 -2350. 槐武也他、“複合ウェーブレットモデルによる音声合成の検討”、日本音響学会２００６年春季研究発表会講演論文集、2-11-7、2006.Takeya Tsuji et al., “Examination of speech synthesis using composite wavelet model”, Proceedings of the Acoustical Society of Japan 2006 Spring Conference, 2-11-7, 2006. 北条他、“複合ウェーブレットモデル分析合成系に基づくＨＭＭ音声合成”、no.2-2-7、2012.Hojo et al., “HMM speech synthesis based on composite wavelet model analysis and synthesis system”, no.2-2-7, 2012.

非特許文献３の技術では、フォルマント周波数推定の問題に内在する難しさに起因して、十分な性能が得られないという問題がある。フォルマント軌跡は、音声スペクトログラムに鮮明に現れるが、自動的に取り出すことは容易ではない。実際に存在するはずのフォルマントを検出できなかったり、実際には存在しないはずのフォルマントを誤検出してしまったりするからである。各短時間フレームでＣＷＭパラメータの推定を行うことは、フォルマント抽出問題と同等と見なせるため、非特許文献３のような手法では、前段のＣＷＭパラメータの推定においても、これと同様な誤りが多発してしまう。 The technique of Non-Patent Document 3 has a problem that sufficient performance cannot be obtained due to the difficulty inherent in the problem of formant frequency estimation. The formant trajectory appears clearly in the speech spectrogram, but it is not easy to extract automatically. This is because a formant that should actually exist cannot be detected, or a formant that should not actually exist is erroneously detected. Since estimation of CWM parameters in each short-time frame can be regarded as equivalent to the formant extraction problem, the same method as in Non-Patent Document 3 often causes similar errors in estimation of CWM parameters in the previous stage. End up.

図９に、ある音声信号のサンプルに対して、時刻（短時間フレーム）毎にＣＷＭパラメータの推定を行った結果の例を示す。図９では、各時刻で推定されたＣＷＭにおける各ガウス関数の中心を、ガウス関数のインデックス毎に異なるマーカーでプロットしたものである。図９に示すように、ＣＷＭにおける各ガウス関数のインデックスの付けられ方が時刻毎に整合していないことが多々ある（例えば、図９中の楕円部分）。例えば、同一音素が発せられている異なる２つの時刻において、一方の時刻では第一フォルマント及び第二フォルマントに１番目のガウス関数及び２番目のガウス関数がフィッティングされているのに対し、他方の時刻では２番目のガウス関数及び３番目のガウス関数がフィッティングされる、というようなケースが頻繁に起こる。このような、ＣＷＭパラメータのインデックスの不整合は後段のＨＭＭ音声合成のパラメータ学習において、性能低下の原因となる。なぜなら各状態の特徴量分布の平均を得る際、異なるスペクトルピークに対応したガウス関数の中心同士の平均を算出する事態となってしまうからである。 FIG. 9 shows an example of the result of estimating CWM parameters for each time (short-time frame) for a certain audio signal sample. In FIG. 9, the center of each Gaussian function in the CWM estimated at each time is plotted with a different marker for each Gaussian function index. As shown in FIG. 9, the indexing method of each Gaussian function in CWM often does not match every time (for example, the elliptical part in FIG. 9). For example, in two different times at which the same phoneme is emitted, the first and second formants are fitted with the first Gaussian function and the second Gaussian function, while the other time In such a case, the second Gaussian function and the third Gaussian function are frequently fitted. Such inconsistencies in the CWM parameter index cause performance degradation in the subsequent HMM speech synthesis parameter learning. This is because the average of the centers of Gaussian functions corresponding to different spectral peaks is calculated when obtaining the average of the feature quantity distributions in the respective states.

以上のことから、ＣＷＭパラメータによるスペクトル表現は、スペクトル包絡のピークのパワー方向及び周波数方向の双方の揺らぎを確率モデル化するのに適しているという利点を持ちながら、ＣＷＭパラメータの推定とＨＭＭパラメータの学習とを単純に多段的に繋げた方式ではうまく動作しない、という問題がある。 From the above, the spectral representation by the CWM parameter has the advantage that it is suitable for probabilistic modeling of fluctuations in both the power direction and the frequency direction of the peak of the spectral envelope, while estimating the CWM parameter and the HMM parameter. There is a problem that it does not work well with a method that simply connects learning with multiple stages.

本発明は、上記の事情を鑑みてなされたもので、各ガウス関数のインデックスが同一状態において整合するよう保証されたＣＷＭパラメータを音声特徴量としてＨＭＭを学習することができる音声合成モデル学習装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and a speech synthesis model learning apparatus capable of learning an HMM using CWM parameters that are guaranteed to have matching indexes in the same state as speech features, It is an object to provide a method and a program.

上記目的を達成するために、本発明の音声合成モデル学習装置は、音声信号の各時刻のスペクトル包絡を混合ガウスモデルによって表現した複合ウェーブレットモデルＣＷＭのパラメータと、テキストデータから得られる情報によって表される各時刻の状態に対応する前記ＣＷＭのパラメータの系列を出力する隠れマルコフモデルＨＭＭのパラメータとを、同一の規準を最大化するように交互に更新して、前記ＣＷＭのパラメータを推定する推定部と、前記推定部により推定されたＣＷＭのパラメータ、及び前記音声信号の各時刻の状態を示すラベルを用いて、前記ＨＭＭを学習する学習部と、を含んで構成されている。 In order to achieve the above object, the speech synthesis model learning apparatus of the present invention is expressed by parameters of a composite wavelet model CWM in which a spectral envelope of each time of a speech signal is expressed by a mixed Gaussian model and information obtained from text data. An estimation unit that estimates the CWM parameters by alternately updating the parameters of the Hidden Markov Model HMM that outputs a sequence of the CWM parameters corresponding to the state at each time to maximize the same criterion And a learning unit that learns the HMM using a CWM parameter estimated by the estimation unit and a label indicating the state of each time of the audio signal.

本発明の音声合成モデル学習装置によれば、推定部が、音声信号の各時刻のスペクトル包絡を混合ガウスモデルによって表現した複合ウェーブレットモデルＣＷＭのパラメータと、テキストデータから得られる情報によって表される各時刻の状態に対応するＣＷＭのパラメータの系列を出力する隠れマルコフモデルＨＭＭのパラメータとを、同一の規準を最大化するように交互に更新して、ＣＷＭのパラメータを推定する。そして、学習部が、推定部により推定されたＣＷＭのパラメータ、及び音声信号の各時刻の状態を示すラベルを用いて、ＨＭＭを学習する。 According to the speech synthesis model learning apparatus of the present invention, the estimation unit represents each parameter represented by information obtained from the parameters of the composite wavelet model CWM in which the spectral envelope of each time of the speech signal is expressed by a mixed Gaussian model and text data. The parameters of the hidden Markov model HMM that outputs a series of CWM parameters corresponding to the time state are alternately updated so as to maximize the same criterion, and the CWM parameters are estimated. Then, the learning unit learns the HMM using the CWM parameter estimated by the estimation unit and the label indicating the state of each time of the audio signal.

このように、ＣＷＭのパラメータと、テキストデータから得られる情報によって表される各時刻の状態に対応するＣＷＭのパラメータの系列を出力する隠れマルコフモデルＨＭＭのパラメータとを、同一の規準を最大化するように交互に更新して推定したＣＷＭパラメータを用いてＨＭＭを学習するため、各ガウス関数のインデックスが同一状態において整合するよう保証されたＣＷＭパラメータを音声特徴量としてＨＭＭを学習することができる。 In this way, the same criterion is maximized between the CWM parameters and the parameters of the hidden Markov model HMM that outputs a series of CWM parameters corresponding to the state at each time represented by the information obtained from the text data. Since the HMM is learned using the CWM parameters that are alternately updated and estimated as described above, the HMM can be learned using the CWM parameters that are guaranteed to match the indices of the Gaussian functions in the same state as speech features.

また、前記推定部は、前記同一の規準を、前記ＣＷＭのパラメータが決まった場合に、前記スペクトル包絡が出力される確率と、前記ＨＭＭの状態系列の確率と、前記状態系列が決まった場合に、前記ＣＷＭのパラメータが出力される確率との積とすることができる。 Further, the estimation unit determines that the same criterion is obtained when the CWM parameter is determined, the probability that the spectrum envelope is output, the probability of the state sequence of the HMM, and the state sequence. , And the probability that the CWM parameter is output.

また、前記推定部は、前記同一の規準を、前記ＨＭＭのパラメータ、前記ＣＷＭのパラメータ、及び補助変数によって表され、前記ＣＷＭのパラメータが決まった場合に、前記スペクトル包絡が出力される確率の対数を上回らず、かつ前記対数に接する関数とし、前記ＨＭＭのパラメータ、前記ＣＷＭのパラメータ、及び前記補助変数を交互に更新することができる。 In addition, the estimation unit represents the same criterion by the HMM parameter, the CWM parameter, and an auxiliary variable, and the logarithm of the probability that the spectrum envelope is output when the CWM parameter is determined. The HMM parameter, the CWM parameter, and the auxiliary variable can be updated alternately with a function that does not exceed the logarithm and is in contact with the logarithm.

また、前記推定部は、前記同一の規準を、負の対数関数の凸性を利用して、ジェンセンの不等式により得られる下限関数とすることができる。 In addition, the estimation unit may use the same criterion as a lower limit function obtained by Jensen's inequality using the convexity of a negative logarithmic function.

また、本発明の音声合成モデル学習方法は、推定部が、音声信号の各時刻のスペクトル包絡を混合ガウスモデルによって表現した複合ウェーブレットモデルＣＷＭのパラメータと、テキストデータから得られる情報によって表される各時刻の状態に対応する前記ＣＷＭのパラメータの系列を出力する隠れマルコフモデルＨＭＭのパラメータとを、同一の規準を最大化するように交互に更新して、前記ＣＷＭのパラメータを推定するステップと、学習部が、前記推定部により推定されたＣＷＭのパラメータ、及び前記音声信号の各時刻の状態を示すラベルを用いて、前記ＨＭＭを学習するステップと、を含む方法である。 Further, in the speech synthesis model learning method of the present invention, each of the estimation units is represented by parameters of the composite wavelet model CWM in which the spectrum envelope of each time of the speech signal is expressed by a mixed Gaussian model and information obtained from the text data. Estimating the CWM parameters by alternately updating the parameters of the Hidden Markov Model HMM that outputs a sequence of the CWM parameters corresponding to the state of time so as to maximize the same criterion; And learning the HMM using a CWM parameter estimated by the estimation unit and a label indicating a state of each time of the audio signal.

また、本発明の音声合成モデル学習プログラムは、コンピュータを、上記の音声合成モデル学習装置を構成する各部として機能させるためのプログラムである。 The speech synthesis model learning program of the present invention is a program for causing a computer to function as each part constituting the speech synthesis model learning apparatus.

以上説明したように、本発明の音声合成モデル学習装置、方法、及びプログラムによれば、ＣＷＭのパラメータと、テキストデータから得られる情報によって表される各時刻の状態に対応するＣＷＭのパラメータの系列を出力する隠れマルコフモデルＨＭＭのパラメータとを、同一の規準を最大化するように交互に更新して推定したＣＷＭパラメータを用いてＨＭＭを学習するため、各ガウス関数のインデックスが同一状態において整合するよう保証されたＣＷＭパラメータを音声特徴量としてＨＭＭを学習することができる、という効果が得られる。 As described above, according to the speech synthesis model learning apparatus, method, and program of the present invention, CWM parameters and CWM parameter sequences corresponding to the state at each time represented by information obtained from text data. Since the HMM is trained using the CWM parameters estimated by alternately updating the parameters of the hidden Markov model HMM that outputs the same criterion so as to maximize the same criterion, the indices of the Gaussian functions are matched in the same state. As a result, it is possible to learn the HMM using the guaranteed CWM parameter as a voice feature amount.

ＣＷＭパラメータを出力するＨＭＭの概略を示すイメージ図である。It is an image figure which shows the outline of HMM which outputs a CWM parameter. 本実施の形態に係る音声合成装置の概略構成を示す機能ブロック図である。It is a functional block diagram which shows schematic structure of the speech synthesizer which concerns on this Embodiment. ＣＷＭパラメータ推定部の概略構成を示す機能ブロック図である。It is a functional block diagram which shows schematic structure of a CWM parameter estimation part. 学習処理を示すフローチャートである。It is a flowchart which shows a learning process. ＣＷＭパラメータ推定処理を示すフローチャートである。It is a flowchart which shows a CWM parameter estimation process. 合成処理を示すフローチャートである。It is a flowchart which shows a synthetic | combination process. 検証結果の一例を示すスペクトログラムである。It is a spectrogram which shows an example of a verification result. 検証結果の一例を示すグラフである。It is a graph which shows an example of a verification result. 従来技術の問題点を説明するための図である。It is a figure for demonstrating the problem of a prior art.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本実施の形態の概要＞
スペクトル包絡全体を混合ガウス関数モデル（Gaussian Mixture Model、ＧＭＭ）によって表現した複合ウェーブレットモデル（Composite Wavelet Model、ＣＷＭ）における各ガウス関数のインデックスが同一状態において整合するよう保証されたＣＷＭパラメータと隠れマルコフモデル（Hidden Markov Model、ＨＭＭ）との一体化モデルを構築したこと、及び学習データが与えられた下で当該モデルのパラメータを学習するための収束性が保証された反復アルゴリズムを実現したことが、本実施の形態のポイントである。具体的には以下により実現する。 <Outline of the present embodiment>
CWM parameters and hidden Markov models in which the indices of each Gaussian function in the composite wavelet model (CWM) in which the entire spectral envelope is expressed by a Gaussian Mixture Model (GMM) are matched in the same state. The fact that an integrated model with (Hidden Markov Model, HMM) has been constructed, and that an iterative algorithm with guaranteed convergence for learning the parameters of the model given learning data has been realized. This is the point of the embodiment. Specifically, it is realized as follows.

１．ＨＭＭパラメータとＣＷＭパラメータとを、同一の規準を大きくするように交互に更新する
２．上記１において、同一の規準を、ＣＷＭパラメータが決まった場合に、スペクトル包絡が出力される確率と、ＨＭＭの状態系列の確率と、ＨＭＭの状態系列が決まった場合に、ＣＷＭパラメータが出力される確率との積（またはその対数）とする
３．上記２において、ＨＭＭパラメータとＣＷＭパラメータと補助変数λとによって表され、ＣＷＭパラメータが決まった場合にスペクトル包絡が出力される確率の対数を上回らず、かつ、これに接する関数を、同一の規準とし、この規準を大きくするようにＨＭＭパラメータとＣＷＭパラメータと補助変数とを交互に更新する
４．上記３において、同一の規準は、負の対数関数の凸性を利用してジェンセン（Ｊｅｎｓｅｎ）の不等式を用いて作られる下限関数である 1. 1. Update HMM parameters and CWM parameters alternately to increase the same criterion. In the above 1, when the CWM parameter is determined based on the same criterion, the probability that the spectrum envelope is output, the probability of the HMM state sequence, and the CMM parameter is output when the HMM state sequence is determined. 2. Product with probability (or its logarithm) In the above 2, the function that is expressed by the HMM parameter, the CWM parameter, and the auxiliary variable λ, does not exceed the logarithm of the probability that the spectrum envelope is output when the CWM parameter is determined, and is in contact with this is defined as the same criterion. 3. Update HMM parameters, CWM parameters, and auxiliary variables alternately so as to increase this criterion. In 3 above, the same criterion is a lower limit function created using Jensen's inequality using the convexity of the negative logarithmic function.

＜ＣＷＭによるスペクトル包絡系列生成モデル＞
まず、スペクトル包絡系列の生成モデルについて述べる。 <Spectrum envelope sequence generation model by CWM>
First, a generation model of a spectrum envelope sequence will be described.

従来のＨＭＭ音声合成方式では、ケプストラム特徴量系列を出力するＨＭＭを立て、学習データから出力分布のパラメータを学習し、各状態での平均的なケプストラム特徴量が推定される。しかし、こうした手法では、スペクトル包絡の平滑化現象が起こる。なぜなら、ケプストラムは、スペクトル包絡の線形変換により得られるため、ケプストラムの平均を得ることは、スペクトル包絡のパワー方向の平均を得ることと同等である。しかし、スペクトル包絡ピークの周波数の揺らぎが存在すると、スペクトル包絡の山と谷とが平均化され、なだらかな形状へ平滑化されるためである。このようにスペクトル平滑化の原因は、ケプストラム特徴量の確率的な揺らぎを仮定し、スペクトル包絡のパワー方向のみの揺らぎをモデル化している点にあると考えられる。 In the conventional HMM speech synthesis method, an HMM that outputs a cepstrum feature quantity sequence is set up, an output distribution parameter is learned from learning data, and an average cepstrum feature quantity in each state is estimated. However, in this method, a smoothing phenomenon of the spectral envelope occurs. Because the cepstrum is obtained by linear transformation of the spectral envelope, obtaining the average of the cepstrum is equivalent to obtaining the average of the power direction of the spectral envelope. However, if there is a fluctuation in the frequency of the spectrum envelope peak, the peaks and valleys of the spectrum envelope are averaged and smoothed into a gentle shape. Thus, it is considered that the cause of the spectrum smoothing is that the stochastic fluctuation of the cepstrum feature is assumed and the fluctuation only in the power direction of the spectrum envelope is modeled.

音声のスペクトル包絡に見られる揺らぎには、声道形状の物理的な変化に基づく共振周波数及びパワーの変動が含まれると考えられるため、スペクトル包絡ピークの周波数及びパワーの双方の揺らぎを表現できる確率的生成モデルを立てるべきである。そこで、スペクトル包絡ピークの周波数及びパワーをパラメータに持つＣＷＭを用いれば、このような確率モデル化を行うことが可能である。ＣＷＭは、ＧＭＭによりスペクトル包絡を近似し、そのＧＭＭのパラメータを音声特徴量とするモデルである。ＣＷＭでは、スペクトル包絡ｆ_ω，ｌは下記（１）式のように表される。なお、ｆ_ω，ｌを、以下では「モデルスペクトル包絡」という。 Since fluctuations in the speech spectral envelope are thought to include fluctuations in resonance frequency and power based on physical changes in the vocal tract shape, the probability that both fluctuations in the spectral envelope peak frequency and power can be expressed. Should create a static generation model. Therefore, if a CWM having the frequency and power of the spectrum envelope peak as parameters is used, such probability modeling can be performed. The CWM is a model that approximates a spectrum envelope by a GMM and uses the parameters of the GMM as speech feature amounts. In CWM, the spectral envelope f _{ω, l} is expressed by the following equation (1). Note that f _{ω, l} is hereinafter referred to as a “model spectrum envelope”.

ただし、ＫはＧＭＭの混合数である。μ_ｋ、ｗ_ｋ、σ_ｋはそれぞれＧＭＭの平均、重み、分散パラメータであり、それぞれモデルスペクトル包絡ピークの周波数、パワー、鋭さに相当するものと見なすことができる。 However, K is the number of mixed GMMs. μ _k , w _k , and σ _k are the mean, weight, and dispersion parameter of the GMM, respectively, and can be regarded as corresponding to the frequency, power, and sharpness of the model spectrum envelope peak, respectively.

続いて、観測スペクトル包絡系列が生成される過程について述べる。図１に示すような、離散時刻ｌ毎に、平均μ_ｋ，ｌ、分散の逆数ρ_ｋ，ｌ、及び重みｗ_ｋ，ｌのＣＷＭパラメータを出力するＨＭＭを考える。ＨＭＭの各状態は、言語ラベルの一状態を表しており、例えば図１に示すように、それぞれ一つの音素に対応させることができる。また、従来のＨＭＭ音声合成方式などの手法と同様に、音素状態に加え、前後の音素のアクセント位置などの情報を用いたコンテキストラベルの一状態を対応させてもよい。本実施の形態では、各状態から出力されるＣＷＭパラメータの確率分布は、各時刻ｌの状態ｓ_ｌについて、下記（２）式〜（４）式と仮定した。 Next, the process of generating an observed spectrum envelope sequence will be described. Consider an HMM that outputs CWM parameters of average μ _{k, l} , reciprocal variance ρ _{k, l} , and weight w _{k, l} at each discrete time l as shown in FIG. Each state of the HMM represents one state of a language label, and for example, as shown in FIG. 1, each state can correspond to one phoneme. In addition to the phoneme state, one state of the context label using information such as the accent positions of the preceding and following phonemes may be associated with the phoneme state, as in the conventional HMM speech synthesis method. In the present embodiment, the probability distribution of the CWM parameter output from each state is assumed to be the following expressions (2) to (4) for the state s _{1 at} each time l.

ここで、Ｎ（ｘ；ｍ，η^２）は正規分布、Ｇａｍｍａ（ｘ；ａ，ｂ）は下記（５）式に示すガンマ分布である。 Here, N (x; m, η ² ) is a normal distribution, and Gamma (x; a, b) is a gamma distribution represented by the following equation (5).

ＣＷＭパラメータの系列＾μ＝｛μ_ｋ｝_ｋ，ｌ、＾ρ＝｛ρ_ｋ｝_ｋ，ｌ、及び＾ｗ＝｛ｗ_ｋ｝_ｋ，ｌが与えられたとき、時刻ｌにおいて、観測スペクトル包絡｛ｙ_ｗ，ｌ｝を生成する確率分布は、下記（６）式とする。なお、数式内の太字表記の記号、及び文章内の「＾」が前に付された記号は、行列またはベクトルを表している。 _Given a sequence of CWM parameters ^ μ = {μk} _{k, l} , ^ ρ = {ρ _k } _{k, l} and ^ w = {w _k } _{k, l} , at time l, the observed spectral envelope The probability distribution for generating {y _{w, l} } is the following equation (6). In addition, the symbol of bold notation in a numerical formula, and the symbol preceded by "^" in a sentence represent a matrix or a vector.

ここで、ｆ_ｗ，ｌは、ＣＷＭパラメータ系列＾μ、＾ρ、及び＾ｗが与えられたとき、時刻ｌのＣＷＭパラメータを用いて（１）式で表されるスペクトル包絡であり、Ｐｏｉｓｓｏｎ（ｘ；λ）は、下記（７）式に示すポアソン分布である。 Here, fw _{, l} is a spectrum envelope represented by the equation (1) using the CWM parameter at time l when CWM parameter sequences ^ μ, ^ ρ, and ^ w are given, and Poisson ( x; λ) is a Poisson distribution represented by the following equation (7).

上記の生成モデルを定めることにより、以下のパラメータ推定アルゴリズムを適用することが可能となる。 By defining the above generation model, the following parameter estimation algorithm can be applied.

＜パラメータ推定アルゴリズム＞
パラメータの学習（推定）は、観測スペクトル包絡系列Ｙ＝｛ｙ_ｗ，ｌ｝_ｗ，ｌが与えられたときに、スペクトル包絡系列生成モデルのパラメータΘの事後確率Ｐ（Θ｜Ｙ）を最大化する問題として定式化される。推定すべきパラメータΘは、ＨＭＭの状態系列＾ｓ＝｛ｓ_ｌ｝_ｌ、ＨＭＭの各状態ｉの状態出力分布＾θ＝｛ｍ_ｋ，ｉ，η_ｋ，ｉ，ａ_ｋ，ｉ ^（σ），ｂ_ｋ，ｉ ^（σ），ａ_ｋ，ｉ ^（ｗ），ｂ_ｋ，ｉ ^（ｗ）｝、並びにＣＷＭパラメータ系列＾μ、＾ρ、及び＾ｗである。 <Parameter estimation algorithm>
Parameter learning (estimation) maximizes the posterior probability P (Θ | Y) of the parameter Θ of the spectrum envelope sequence generation model when the observed spectrum envelope sequence Y = {y _{w, l} } _{w, l} is given. It is formulated as a problem. The parameter Θ to be estimated is the HMM state sequence ^ s = {s ₁ } ₁ , the state output distribution of each state i of the HMM ^ θ = {m _{k, i} , η _{k, i} , a _{k, i} ^(σ) , B _{k, i} ^(σ) , a _{k, i} ^(w) , b _{k, i} ^(w) }, and CWM parameter sequences ^ μ, ^ ρ, and ^ w.

パラメータΘの事後確率Ｐ（Θ｜Ｙ）を最大化するΘを求めることは難しいが、各変数について局所最適化を繰り返すことは可能である。このときＰ（Θ｜Ｙ）は、下記（８）式のように書ける。 Although it is difficult to obtain Θ that maximizes the posterior probability P (Θ | Y) of the parameter Θ, it is possible to repeat local optimization for each variable. At this time, P (Θ | Y) can be written as in the following equation (8).

ここで、αは正則化パラメータであり、対数尤度に対する対数事前分布の重みを表す。また、「＝」の上に「Ｃ」を付した記号は、定数部分を除いて一致することを意味する。 Here, α is a regularization parameter and represents the weight of the log prior distribution for the log likelihood. A symbol with “C” on “=” means matching except for a constant part.

本実施の形態におけるパラメータ推定アルゴリズムでは、各変数について−ｌｏｇＰ（Θ｜Ｙ）の最小化を反復することにより、パラメータ推定を行う。ここで、−ｌｏｇＰ（Ｙ｜Θ）は、各時刻の観測スペクトル包絡ｙ_ω，ｌとモデルスペクトル包絡ｆ_ω，ｌとの擬距離であるＩ−ｄｉｖｅｒｇｅｎｃｅを全時刻について足し合わせたものに相当する。Ｉ−ｄｉｖｅｒｇｅｎｃｅは、下記（１０）で示される。 In the parameter estimation algorithm in the present embodiment, parameter estimation is performed by repeating the minimization of -logP (Θ | Y) for each variable. Here, -logP (Y | Θ) is equivalent to that observed spectrum envelope y _omega at each _{time, l} and the model spectral envelope f _omega, the I-divergence is the pseudo distance between _l sum for all time . I-divergence is indicated by the following (10).

従って、Ｐ（Θ｜Ｙ）の最大化は、Θについて、Ｉ（Θ）−αｌｏｇＰ（Θ）を最小化することと同等である。 Thus, maximizing P (Θ | Y) is equivalent to minimizing I (Θ) -αlogP (Θ) for Θ.

Ｉ−ｄｉｖｅｒｇｅｎｃｅの項についての最小化は、補助関数法を用いて各パラメータについて逐次的に実行可能である。すなわち、対数関数の凸性に基づき、Ｊｅｎｓｅｎの不等式を適用すると、下記（１２）となる。 Minimization for the I-divergence term can be performed sequentially for each parameter using the auxiliary function method. That is, applying Jensen's inequality based on the convexity of the logarithmic function yields (12) below.

ここで、ｇ_{ｋ，ω，ｌ}は、下記（１３）式である。また、（１２）式の等号成立条件は、下記（１４）式である。 Here, g _{k, ω, l} is the following equation (13). Further, the condition for establishing the equal sign in equation (12) is the following equation (14).

Ｉ（Θ）の上限関数、すなわち（１２）式の右辺をＪ（Θ，λ）とする。ここで、任意のΘについて、λが（１４）式で与えられるとき、補助関数Ｊ（Θ、λ）−αｌｏｇＰ（Θ）は、目的関数Ｉ（Θ）−αｌｏｇＰ（Θ）と等しい。そして、任意の固定されたλについてＪ（Θ，λ）−αｌｏｇＰ（Θ）を減少させるΘは、（１２）式により、Ｉ（Θ）−αｌｏｇＰ（Θ）を必ず減少させる。以上より、（１４）式によるλの更新と、Ｊ（Θ，λ）−αｌｏｇＰ（Θ）を減少させるようなΘの更新とを繰り返すことにより、目的関数は局所最適解に到達するまで、単調に減少する。 Let J (Θ, λ) be the upper limit function of I (Θ), that is, the right side of equation (12). Here, for an arbitrary Θ, when λ is given by equation (14), the auxiliary function J (Θ, λ) −αlogP (Θ) is equal to the objective function I (Θ) −αlogP (Θ). Then, Θ that reduces J (Θ, λ) −αlogP (Θ) for any fixed λ necessarily reduces I (Θ) −αlogP (Θ) according to equation (12). From the above, by repeating the update of λ according to the equation (14) and the update of Θ so as to decrease J (Θ, λ) −αlogP (Θ), the objective function is monotonous until the local optimal solution is reached. To decrease.

＜音声合成装置の構成＞
本実施の形態に係る音声合成装置は、ＣＰＵと、ＲＡＭと、後述する学習処理及び合成処理を含む音声合成処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成されている。 <Configuration of speech synthesizer>
The speech synthesizer according to the present embodiment includes a computer including a CPU, a RAM, and a ROM that stores a program for executing a speech synthesis processing routine including a learning process and a synthesis process described later. .

図２に示すように、音声合成装置１０を構成するコンピュータは、機能的には、学習部２０及び合成部４０を含んだ構成で表すことができる。なお、学習部２０は、本発明の音声合成モデル学習装置の一例である。 As shown in FIG. 2, the computer constituting the speech synthesizer 10 can be functionally represented by a configuration including a learning unit 20 and a synthesis unit 40. The learning unit 20 is an example of a speech synthesis model learning device of the present invention.

さらに、学習部２０は、基本周波数系列抽出部２２、観測スペクトル包絡系列抽出部２４、ＣＷＭパラメータ推定部２６、及びＨＭＭ学習部２８を含んだ構成で表すことができる。学習部２０には、データベースから、音声信号の時系列データ及び各時刻の状態ｓ_ｌの情報を含むラベルが入力される。なお、ＣＷＭパラメータ推定部２６は、本発明の推定部の一例であり、ＨＭＭ学習部２８は、本発明の学習部の一例である。 Furthermore, the learning unit 20 can be represented by a configuration including a fundamental frequency sequence extraction unit 22, an observed spectrum envelope sequence extraction unit 24, a CWM parameter estimation unit 26, and an HMM learning unit 28. The learning section 20, from a database, the label containing information of the time series data and state s _l of each time of the audio signal. The CWM parameter estimation unit 26 is an example of the estimation unit of the present invention, and the HMM learning unit 28 is an example of the learning unit of the present invention.

基本周波数系列抽出部２２は、入力された音声信号の時系列データから、基本周波数の時系列データを抽出し、それらを離散時間ｌで表現するように変換して、音声信号の基本周波数の時系列データである基本周波数系列を抽出する。この基本周波数の抽出処理は、周知技術により実現でき、例えば、非特許文献４（H. Kameoka, "Statistical speech spectrum model incorporating all-pole vocal tract model and F0 contour generating process model," in Tech. Rep. IEICE, 2010, in Japanese.）に記載の手法を利用して、例えば８ｍｓ毎に基本周波数を抽出することができる。基本周波数系列抽出部２２は、抽出した基本周波数系列を、ＨＭＭ学習部２８へ出力する。 The basic frequency series extraction unit 22 extracts time series data of the basic frequency from the time series data of the input audio signal, converts the time series data to be expressed in the discrete time l, and outputs the time of the basic frequency of the audio signal. A basic frequency sequence that is sequence data is extracted. This extraction process of the fundamental frequency can be realized by a well-known technique. For example, Non-Patent Document 4 (H. Kameoka, “Statistical speech spectrum model incorporating all-pole vocal tract model and F0 contour generating process model,” in Tech. Rep. IEICE, 2010, in Japanese.), For example, the fundamental frequency can be extracted every 8 ms. The fundamental frequency sequence extraction unit 22 outputs the extracted fundamental frequency sequence to the HMM learning unit 28.

観測スペクトル包絡系列抽出部２４は、入力された音声信号の時系列データを時刻（短時間フレーム）毎にフーリエ変換して、観測スペクトル包絡系列Ｙを抽出する。観測スペクトル包絡系列抽出部２４は、抽出した観測スペクトル包絡系列ＹをＣＷＭパラメータ推定部２６へ出力する。 The observed spectrum envelope series extraction unit 24 performs Fourier transform on the input time series data of the audio signal for each time (short-time frame) to extract the observed spectrum envelope series Y. The observed spectrum envelope sequence extraction unit 24 outputs the extracted observed spectrum envelope sequence Y to the CWM parameter estimation unit 26.

ＣＷＭパラメータ推定部２６は、観測スペクトル包絡系列抽出部２４から出力された観測スペクトル包絡系列Ｙ、及びデータベースから入力されたラベルを受け付け、観測スペクトル包絡系列事後確率Ｐ（Θ｜Ｙ）を最大化するパラメータΘを推定する。そして、ＣＷＭパラメータ推定部２６は、推定したパラメータΘに含まれるＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを、ＨＭＭ学習部２８へ出力する。ＣＷＭパラメータ推定部２６は、さらに、図３に示すように、初期更新部２６０、補助変数更新部２６２、ＣＷＭパラメータ更新部２６４、第１収束判定部２６６、状態出力分布更新部２６８、状態系列更新部２７０、観測スペクトル包絡系列事後確率更新部２７２、及び第２収束判定部２７４を含んだ構成で表すことができる。 The CWM parameter estimating unit 26 receives the observed spectrum envelope sequence Y output from the observed spectrum envelope sequence extracting unit 24 and the label input from the database, and maximizes the observed spectrum envelope sequence posterior probability P (Θ | Y). Estimate the parameter Θ. Then, the CWM parameter estimation unit 26 outputs the CWM parameters ^ μ, ^ ρ, and ^ w included in the estimated parameter Θ to the HMM learning unit 28. Further, as shown in FIG. 3, the CWM parameter estimation unit 26 further includes an initial update unit 260, an auxiliary variable update unit 262, a CWM parameter update unit 264, a first convergence determination unit 266, a state output distribution update unit 268, and a state series update. Unit 270, observed spectrum envelope sequence posterior probability update unit 272, and second convergence determination unit 274.

初期更新部２６０は、パラメータΘの初期値を用いて、観測スペクトル包絡系列事後確率Ｐ（Θ｜Ｙ）の初期更新を行う。パラメータΘの初期値として、状態出力分布＾θ、並びにＣＷＭパラメータ＾μ、＾ρ、及び＾ｗの初期値については、予め適当に設定した値を用いる。ＨＭＭの状態列＾ｓの初期値としては、入力されたラベルに含まれる情報を用いる。 The initial update unit 260 performs initial update of the observed spectrum envelope sequence posterior probability P (Θ | Y) using the initial value of the parameter Θ. As initial values of the parameter Θ, values set appropriately in advance are used for the initial values of the state output distribution ^ θ and the CWM parameters ^ μ, ^ ρ, and ^ w. Information included in the input label is used as the initial value of the state sequence ^ s of the HMM.

補助変数更新部２６２は、前回更新されたＣＷＭパラメータ＾μ、＾ρ、及び＾ｗ、または初期値として設定されたＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを用いて、（１４）式により、補助変数λを更新する。 The auxiliary variable update unit 262 uses the CWM parameters ^ μ, ^ ρ, and ^ w updated last time, or the CWM parameters ^ μ, ^ ρ, and ^ w set as initial values, according to the equation (14). The auxiliary variable λ is updated.

ＣＷＭパラメータ更新部２６４は、状態系列＾ｓ及び状態出力分布＾θを、前回更新された値、または初期値として設定された値で固定し、補助変数更新部２６２により更新された補助変数λを用いて、補助関数Ｊ（Θ、λ）−αｌｏｇＰ（Θ）を減少させるように、ＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを、下記（１５）式〜（１７）式の更新式により更新する。 The CWM parameter updating unit 264 fixes the state series ^ s and the state output distribution ^ θ with values updated last time or values set as initial values, and sets the auxiliary variable λ updated by the auxiliary variable updating unit 262. And update the CWM parameters ^ μ, ^ ρ, and ^ w with the update formulas of the following formulas (15) to (17) so as to decrease the auxiliary function J (Θ, λ) −αlogP (Θ). To do.

ただし、Ｃ_ｋ，ｌ、Ｄ_ｋ，ｌ、及びＥ_ｋ，ｌは、下記（１８）式〜（２０）式である。 However, C _{k, l} , D _{k, l} , and E _{k, l} are the following formulas (18) to (20).

第１収束判定部２６６は、予め定められた収束条件を満足するか否かを判定し、収束条件を満足していない場合には、補助変数更新部２６２及びＣＷＭパラメータ更新部２６４の各処理を繰り返す。第１収束判定部２６６は、収束条件を満足したと判定した場合には、収束条件を満足したときのＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを状態出力分布更新部２６８へ出力する。 The first convergence determination unit 266 determines whether or not a predetermined convergence condition is satisfied. If the convergence condition is not satisfied, each process of the auxiliary variable update unit 262 and the CWM parameter update unit 264 is performed. repeat. If the first convergence determination unit 266 determines that the convergence condition is satisfied, the first convergence determination unit 266 outputs the CWM parameters ^ μ, ^ ρ, and ^ w when the convergence condition is satisfied to the state output distribution update unit 268.

収束条件としては、繰り返し回数ｎ_１が予め定めた回数Ｎ_１（例えば、２０回）に達したことを用いればよい。なお、ｎ_１−１回目のパラメータを用いたときの補助関数の値とｎ_１回目のパラメータを用いたときの補助関数の値との差が、予め定めた閾値よりも小さくなったことを、収束条件として用いてもよい。 As the convergence condition, it may be used that the number of repetitions n ₁ has reached a predetermined number N ₁ (for example, 20 times). It should be noted that the difference between the value of the auxiliary function when the n ₁ −1 parameter is used and the value of the auxiliary function when the n ₁ time parameter is used is smaller than a predetermined threshold. It may be used as a convergence condition.

状態出力分布更新部２６８は、ＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを、第１収束判定部２６６から出力された＾μ、＾ρ、及び＾ｗで固定すると共に、状態出力分布＾θを、前回更新された値、または初期値として設定された値で固定し、補助変数更新部２６２により更新された補助変数λを用いて、補助関数Ｊ（Θ、λ）−αｌｏｇＰ（Θ）を減少させるように、状態出力分布＾θに含まれる｛ｍ_ｋ，ｉ，η_ｋ，ｉ ^２｝_ｋ，ｉを、下記（２１）式及び（２２）式の更新式により更新する。 The state output distribution updating unit 268 fixes the CWM parameters {circumflex over (μ)}, {circumflex over (ρ)}, and {circumflex over (w)} to {circumflex over (μ)}, {circumflex over (ρ)}, and {circumflex over (w)} output from the first convergence determination unit 266 and Is fixed at a value updated last time or a value set as an initial value, and the auxiliary function J (Θ, λ) −αlogP (Θ) is calculated using the auxiliary variable λ updated by the auxiliary variable update unit 262. {M _{k, i} , η _{k, i} ² } _{k, i} included in the state output distribution ^ θ is updated by the update formulas of the following formulas (21) and (22) so as to decrease.

ただし、Ｔｉ＝｛ｌ｜ｓ_ｌ＝ｉ｝である。また、状態出力分布＾θに含まれる｛ａ_ｋ，ｉ ^（ρ），ｂ_ｋ，ｉ ^（ρ），ａ_ｋ，ｉ ^（ｗ），ｂ_ｋ，ｉ ^（ｗ）｝_ｋ，ｌについての更新式は、下記（２３）式〜（２６）式の方程式の根として得られる。 However, Ti = {l | s _l = i}. Also, the update formula for {a _{k, i} ^(ρ) , b _{k, i} ^(ρ) , a _{k, i} ^(w) , b _{k, i} ^(w) } _{k, l} included in the state output distribution ^ θ. Is obtained as the root of the following equations (23) to (26).

ただし、ψ（ａ）は下記（２７）式に示すｄｉｇａｍｍａ関数を表す。 However, (psi) (a) represents the digamma function shown to following (27) Formula.

状態系列更新部２７０は、ＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを、第１収束判定部２６６から出力された＾μ、＾ρ、及び＾ｗで固定すると共に、状態出力分布＾θを、前回更新された値、または初期値として設定された値で固定し、Ｖｉｔｅｒｂｉアルゴリズムにより、補助関数Ｊ（Θ、λ）−αｌｏｇＰ（Θ）を減少させるように、状態系列＾ｓを更新する。 The state series update unit 270 fixes the CWM parameters ^ μ, ^ ρ, and ^ w with ^ μ, ^ ρ, and ^ w output from the first convergence determination unit 266, and sets the state output distribution ^ θ. The state sequence ^ s is updated so that the auxiliary function J (Θ, λ) −αlogP (Θ) is decreased by the Viterbi algorithm, fixed at a previously updated value or a value set as an initial value.

観測スペクトル包絡系列事後確率更新部２７２は、ＣＷＭパラメータ更新部２６４で更新されたＣＷＭパラメータ＾μ、＾ρ、及び＾ｗ、状態出力分布更新部２６８で更新された状態出力分布＾θ、並びに状態系列更新部２７０で更新された状態系列＾ｓを用いて、観測スペクトル包絡系列事後確率Ｐ（Θ｜Ｙ）を更新する。 The observed spectrum envelope sequence posterior probability update unit 272 includes CWM parameters ^ μ, ^ ρ, and ^ w updated by the CWM parameter update unit 264, the state output distribution ^ θ updated by the state output distribution update unit 268, and the state The observed spectrum envelope sequence posterior probability P (Θ | Y) is updated using the state sequence ^ s updated by the sequence update unit 270.

第２収束判定部２７４は、予め定められた収束条件を満足するか否かを判定し、収束条件を満足していない場合には、補助変数更新部２６２、ＣＷＭパラメータ更新部２６４、第１収束判定部２６６、状態出力分布更新部２６８、状態系列更新部２７０、及び観測スペクトル包絡系列事後確率更新部２７２の各処理を繰り返す。第２収束判定部２７４は、収束条件を満足したと判定した場合には、収束条件を満足したときのＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを、ＨＭＭ学習部２８へ出力する。 The second convergence determination unit 274 determines whether or not a predetermined convergence condition is satisfied. If the convergence condition is not satisfied, the auxiliary variable update unit 262, the CWM parameter update unit 264, and the first convergence are determined. Each process of the determination part 266, the state output distribution update part 268, the state series update part 270, and the observed spectrum envelope series posterior probability update part 272 is repeated. If it is determined that the convergence condition is satisfied, the second convergence determination unit 274 outputs the CWM parameters ^ μ, ^ ρ, and ^ w when the convergence condition is satisfied to the HMM learning unit 28.

収束条件としては、繰り返し回数ｎ_２が予め定めた回数Ｎ_２（例えば、２０回）に達したことを用いればよい。なお、ｎ_２−１回目のパラメータを用いたときの補助関数の値とｎ_２回目のパラメータを用いたときの補助関数の値との差が、予め定めた閾値よりも小さくなったことを、収束条件として用いてもよい。 As the convergence condition, it may be used that the number of repetitions n ₂ has reached a predetermined number N ₂ (for example, 20 times). Note that the difference between the value of the auxiliary function when using the n ₂ -first parameter and the value of the auxiliary function when using the n ₂ -th parameter is smaller than a predetermined threshold value. It may be used as a convergence condition.

ＨＭＭ学習部２８は、ＣＷＭパラメータ推定部２６から出力されたＣＷＭパラメータ＾μ、＾ρ、及び＾ｗ、並びにデータベースから入力されたラベルを用いて、例えば非特許文献１等の従来技術を用いて、ＨＭＭ３０を学習する。なお、学習したＨＭＭを用いて、テキストデータからモデルスペクトル包絡系列を求める際に、単に尤度最大の基準により求めたモデルスペクトル包絡系列は、音素境界付近で不連続となり、合成音声品質の劣化の原因となる。そこで、例えば非特許文献１の手法のように、音素状態を細かく分割し、さらに動的特徴量（特徴量の１階、２階の時間差分量）を用いて、ＨＭＭ３０を学習する。これにより、連続的なモデルスペクトル包絡系列を出力することができるＨＭＭ３０を学習することができる。ＨＭＭ学習部２８は、学習したＨＭＭ３０を所定の記憶領域に記憶する。 The HMM learning unit 28 uses the CWM parameters ^ μ, ^ ρ, and ^ w output from the CWM parameter estimation unit 26 and the labels input from the database, for example, using conventional techniques such as Non-Patent Document 1. , HMM30 is learned. When a model spectrum envelope sequence is obtained from text data using the learned HMM, the model spectrum envelope sequence obtained simply based on the maximum likelihood criterion becomes discontinuous in the vicinity of the phoneme boundary, resulting in deterioration of the synthesized speech quality. Cause. Therefore, for example, as in the method of Non-Patent Document 1, the phoneme state is finely divided, and further, the HMM 30 is learned using dynamic feature amounts (first and second time difference amounts of feature amounts). As a result, it is possible to learn the HMM 30 that can output a continuous model spectrum envelope sequence. The HMM learning unit 28 stores the learned HMM 30 in a predetermined storage area.

また、合成部４０は、図２に示すように、テキスト解析部４２、パラメータ合成部４４、及び音声波形合成部４６を含んだ構成で表すことができる。合成部４０には、テキストデータが入力される。 As shown in FIG. 2, the synthesis unit 40 can be represented by a configuration including a text analysis unit 42, a parameter synthesis unit 44, and a speech waveform synthesis unit 46. Text data is input to the combining unit 40.

テキスト解析部４２は、入力されたテキストデータを解析し、例えば各音素に対応させたラベルで表される状態を解析し、ラベル系列をパラメータ合成部４４へ出力する。 The text analysis unit 42 analyzes the input text data, analyzes a state represented by a label corresponding to each phoneme, for example, and outputs a label series to the parameter synthesis unit 44.

パラメータ合成部４４は、テキスト解析部４２から出力されたラベル系列に対し、学習部２０で学習されたＨＭＭ３０を用いて、尤度最大の基準によりＣＷＭパラメータ系列を求める。このＣＷＭパラメータ系列に基づいて、モデルスペクトル包絡系列を得ることができる。また、パラメータ合成部４４は、テキスト解析部４２から出力されたラベル系列に基づいて、基本周波数系列を求める。なお、ＣＷＭパラメータ系列の出力の際には、音素状態のＤｕｒａｔｉｏｎに関するモデルが別途必要である。また、ラベル系列から基本周波数系列を求めるためには、別途基本周波数に関するモデルが必要である。これらのモデルとしては、例えば非特許文献１に記載のモデルを用いることができる。パラメータ合成部４４は、求めた基本周波数系列及びＣＷＭパラメータ系列を、音声波形合成部４６へ出力する。 The parameter synthesis unit 44 uses the HMM 30 learned by the learning unit 20 for the label sequence output from the text analysis unit 42 to obtain a CWM parameter sequence based on the maximum likelihood criterion. A model spectrum envelope sequence can be obtained based on this CWM parameter sequence. Further, the parameter synthesis unit 44 obtains a fundamental frequency sequence based on the label sequence output from the text analysis unit 42. When outputting the CWM parameter series, a model related to duration of phoneme state is separately required. Further, in order to obtain the fundamental frequency sequence from the label sequence, a model related to the fundamental frequency is required separately. As these models, for example, the model described in Non-Patent Document 1 can be used. The parameter synthesizer 44 outputs the obtained fundamental frequency sequence and CWM parameter sequence to the speech waveform synthesizer 46.

音声波形合成部４６は、パラメータ合成部４４から出力されたＣＷＭパラメータ系列と基本周波数系列とを用いて、例えば非特許文献２、非特許文献３等の手法により、音声波形を合成する。すなわち、下記（２８）式に示すように、周波数領域のＧＭＭは時間領域ではＧａｂｏｒ関数に相当するため、ＣＷＭパラメータからＧａｂｏｒ関数の重ね合わせであるＧａｂｏｒＷａｖｅｌｅｔを生成し、基本周波数に対応する時間間隔で時間軸上に並べることにより、音声波形を合成する。 The speech waveform synthesizer 46 synthesizes a speech waveform using the CWM parameter sequence and the fundamental frequency sequence output from the parameter synthesizer 44 by a technique such as Non-Patent Document 2, Non-Patent Document 3, or the like. That is, as shown in the following equation (28), since the GMM in the frequency domain corresponds to a Gabor function in the time domain, a Gabor Wavelet that is a superposition of the Gabor function is generated from the CWM parameter, and the time interval corresponding to the fundamental frequency is generated. The voice waveforms are synthesized by arranging them on the time axis.

これは、ＦＩＲフィルタによる合成手法であり、基本周波数に依らず、時間特性の良い音声合成が可能である。音声波形合成部４６は、合成した音声波形を出力する。 This is a synthesis method using an FIR filter, and speech synthesis with good time characteristics is possible regardless of the fundamental frequency. The speech waveform synthesis unit 46 outputs the synthesized speech waveform.

＜音声合成装置の作用＞
次に、本実施の形態に係る音声合成装置１０の作用について説明する。まず、学習部２０に、データベースから、音声信号の時系列データ及び各時刻の状態ｓ_ｌの情報を含むラベルが入力され、学習部２０が、図４に示す学習処理を実行することにより、ＨＭＭ３０が学習される。そして、合成部４０に、テキストデータが入力され、合成部４０が、図６に示す合成処理を実行することにより、音声波形が出力される。以下、各処理について詳述する。 <Operation of speech synthesizer>
Next, the operation of the speech synthesizer 10 according to the present embodiment will be described. First, the learning section 20, from the database, the label inputs including information of the time series data and state s _l of each time of the audio signal, the learning section 20, by executing the learning processing shown in FIG. 4, HMM30 Is learned. Then, text data is input to the synthesizing unit 40, and the synthesizing unit 40 executes a synthesizing process shown in FIG. Hereinafter, each process is explained in full detail.

図４に示す学習処理のステップＳ１０で、基本周波数系列抽出部２２が、入力された音声信号の時系列データから、基本周波数の時系列データを抽出し、それらを離散時間ｌで表現するように変換して、音声信号の基本周波数の時系列データである基本周波数系列を抽出し、ＨＭＭ学習部２８へ出力する。 In step S10 of the learning process shown in FIG. 4, the fundamental frequency sequence extraction unit 22 extracts the time series data of the fundamental frequency from the time series data of the input speech signal, and expresses them in the discrete time l. The fundamental frequency sequence, which is the time series data of the fundamental frequency of the audio signal, is extracted and output to the HMM learning unit 28.

次に、ステップＳ１２で、観測スペクトル包絡系列抽出部２４が、入力された音声信号の時系列データを時刻（短時間フレーム）毎にフーリエ変換して、観測スペクトル包絡系列Ｙを抽出し、ＣＷＭパラメータ推定部２６へ出力する。 Next, in step S12, the observed spectrum envelope sequence extraction unit 24 performs Fourier transform on the input time series data of the audio signal for each time (short-time frame) to extract the observed spectrum envelope sequence Y, and the CWM parameter It outputs to the estimation part 26.

次に、ステップＳ１４で、ＣＷＭパラメータ推定部２６が、図５に示すＣＷＭパラメータ推定処理を実行する。 Next, in step S14, the CWM parameter estimation unit 26 executes a CWM parameter estimation process shown in FIG.

図５に示すＣＷＭパラメータ推定処理のステップＳ１４０で、初期更新部２６０が、状態出力分布＾θ、並びにＣＷＭパラメータ＾μ、＾ρ、及び＾ｗの初期値として、予め適当に設定した値を用い、ＨＭＭの状態列＾ｓの初期値として、入力されたラベルに含まれる情報を用い、観測スペクトル包絡系列事後確率Ｐ（Θ｜Ｙ）の初期更新を行う。 In step S140 of the CWM parameter estimation process shown in FIG. 5, the initial update unit 260 uses values appropriately set in advance as initial values of the state output distribution ^ θ and the CWM parameters ^ μ, ^ ρ, and ^ w. , Using the information included in the input label as the initial value of the state sequence ^ s of the HMM, the observed spectrum envelope sequence posterior probability P (Θ | Y) is initially updated.

次に、ステップＳ１４２で、補助変数更新部２６２が、前回更新されたＣＷＭパラメータ＾μ、＾ρ、及び＾ｗ、または初期値として設定されたＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを用いて、（１４）式により、補助変数λを更新する。 Next, in step S142, the auxiliary variable updating unit 262 uses the previously updated CWM parameters ^ μ, ^ ρ, and ^ w, or the CWM parameters ^ μ, ^ ρ, and ^ w set as initial values. Thus, the auxiliary variable λ is updated by the equation (14).

次に、ステップＳ１４４で、ＣＷＭパラメータ更新部２６４が、状態系列＾ｓ及び状態出力分布＾θを、前回更新された値、または初期値として設定された値で固定し、上記ステップＳ１４２で更新された補助変数λを用いて、補助関数Ｊ（Θ、λ）−αｌｏｇＰ（Θ）を減少させるように、ＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを（１５）式〜（１７）式の更新式により更新する。 Next, in step S144, the CWM parameter updating unit 264 fixes the state series ^ s and the state output distribution ^ θ with values updated last time or values set as initial values, and updated in step S142. Using the auxiliary variable λ, the CWM parameters ^ μ, ^ ρ, and ^ w are updated so that the auxiliary function J (Θ, λ) −αlogP (Θ) is decreased. Update with formula.

次に、ステップＳ１４６で、第１収束判定部２６６が、予め定められた収束条件を満足したか否かを判定する。収束条件を満足していない場合には、ステップＳ１４２へ戻り、ステップＳ１４２及びＳ１４４の各処理を繰り返す。一方、収束条件を満足した場合には、収束条件を満足したときのＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを状態出力分布更新部２６８へ出力し、ステップＳ１４８へ移行する。 Next, in step S146, the first convergence determination unit 266 determines whether or not a predetermined convergence condition is satisfied. If the convergence condition is not satisfied, the process returns to step S142, and the processes of steps S142 and S144 are repeated. On the other hand, if the convergence condition is satisfied, the CWM parameters ^ μ, ^ ρ, and ^ w when the convergence condition is satisfied are output to the state output distribution update unit 268, and the process proceeds to step S148.

ステップＳ１４８では、状態出力分布更新部２６８が、ＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを、第１収束判定部２６６から出力された＾μ、＾ρ、及び＾ｗで固定すると共に、状態出力分布＾θを、前回更新された値、または初期値として設定された値で固定し、上記ステップＳ１４２で更新された補助変数λを用いて、補助関数Ｊ（Θ、λ）−αｌｏｇＰ（Θ）を減少させるように、状態出力分布＾θを、（２１）式〜（２６）式により更新する。 In step S148, the state output distribution updating unit 268 fixes the CWM parameters ^ μ, ^ ρ, and ^ w with ^ μ, ^ ρ, and ^ w output from the first convergence determination unit 266, and the state The output distribution {circumflex over (θ)} is fixed at the previously updated value or the value set as the initial value, and the auxiliary function J (Θ, λ) −αlogP (Θ) is used by using the auxiliary variable λ updated at step S142. ) Is updated by the equations (21) to (26) so as to reduce the.

次に、ステップＳ１５０で、状態系列更新部２７０が、ＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを、第１収束判定部２６６から出力された＾μ、＾ρ、及び＾ｗで固定すると共に、状態出力分布＾θを、前回更新された値、または初期値として設定された値で固定し、Ｖｉｔｅｒｂｉアルゴリズムにより、補助関数Ｊ（Θ、λ）−αｌｏｇＰ（Θ）を減少させるように、状態系列＾ｓを更新する。 Next, in step S150, the state series update unit 270 fixes the CWM parameters ^ μ, ^ ρ, and ^ w with ^ μ, ^ ρ, and ^ w output from the first convergence determination unit 266. The state output distribution ^ θ is fixed at a value updated last time or a value set as an initial value, and the auxiliary function J (Θ, λ) −αlogP (Θ) is decreased by the Viterbi algorithm. Update the sequence ^ s.

なお、上記ステップＳ１４８と上記ステップＳ１５０とは、いずれを先に実行してもよい。 Note that either step S148 or step S150 may be executed first.

次に、ステップＳ１５２で、観測スペクトル包絡系列事後確率更新部２７２が、上記ステップＳ１４４で更新されたＣＷＭパラメータ＾μ、＾ρ、及び＾ｗ、上記ステップＳ１４８で更新された状態出力分布＾θ、並びに上記ステップＳ１５０で更新された状態系列＾ｓを用いて、観測スペクトル包絡系列事後確率Ｐ（Θ｜Ｙ）を更新する。 Next, in step S152, the observed spectrum envelope sequence posterior probability updating unit 272 performs the CWM parameters ^ μ, ^ ρ, and ^ w updated in step S144, and the state output distribution ^ θ updated in step S148. In addition, the observed spectrum envelope sequence posterior probability P (Θ | Y) is updated using the state sequence ^ s updated in step S150.

次に、ステップＳ１５４で、第２収束判定部２７４が、予め定められた収束条件を満足したか否かを判定する。収束条件を満足していない場合には、ステップＳ１４２へ戻り、ステップＳ１４２〜Ｓ１５２の各処理を繰り返す。一方、収束条件を満足した場合には、収束条件を満足したときのＣＷＭパラメータ＾μ、＾ρ、及び＾ｗを、ＨＭＭ学習部２８へ出力し、学習処理へリターンする。 Next, in step S154, the second convergence determination unit 274 determines whether or not a predetermined convergence condition is satisfied. If the convergence condition is not satisfied, the process returns to step S142, and the processes of steps S142 to S152 are repeated. On the other hand, when the convergence condition is satisfied, the CWM parameters ^ μ, ^ ρ, and ^ w when the convergence condition is satisfied are output to the HMM learning unit 28, and the process returns to the learning process.

次に、図４に示す学習処理のステップＳ１６で、ＨＭＭ学習部２８が、上記ステップＳ１４で出力されたＣＷＭパラメータ＾μ、＾ρ、及び＾ｗ、並びにデータベースから入力されたラベルを用いて、例えば非特許文献１等の従来技術を用いて、ＨＭＭ３０を学習し、学習したＨＭＭ３０を所定の記憶領域に記憶して、学習処理を終了する。 Next, in step S16 of the learning process shown in FIG. 4, the HMM learning unit 28 uses the CWM parameters ^ μ, ^ ρ, and ^ w output in step S14 and the labels input from the database, For example, the conventional technique such as Non-Patent Document 1 is used to learn the HMM 30, store the learned HMM 30 in a predetermined storage area, and terminate the learning process.

次に、図６に示す合成処理のステップＳ２０で、テキスト解析部４２が、入力されたテキストデータを解析し、例えば各音素に対応させたラベルで表される状態を解析し、ラベル系列をパラメータ合成部４４へ出力する。 Next, in step S20 of the synthesizing process shown in FIG. 6, the text analysis unit 42 analyzes the input text data, for example, analyzes a state represented by a label corresponding to each phoneme, and sets the label series as a parameter. The data is output to the combining unit 44.

次に、ステップＳ２２で、パラメータ合成部４４が、上記ステップＳ２０で出力されたラベル系列に対し、図４に示す学習処理で学習されたＨＭＭ３０を用いて、尤度最大の基準によりＣＷＭパラメータ系列を求め、音声波形合成部４６へ出力する。また、パラメータ合成部４４が、上記ステップＳ２０で出力されたラベル系列に基づいて、基本周波数系列を求め、音声波形合成部４６へ出力する。 Next, in step S22, the parameter synthesizer 44 uses the HMM 30 learned in the learning process shown in FIG. 4 for the label sequence output in step S20 to generate a CWM parameter sequence based on the maximum likelihood criterion. Obtained and output to the speech waveform synthesis unit 46. Further, the parameter synthesizer 44 obtains a fundamental frequency sequence based on the label sequence output in step S20 and outputs it to the speech waveform synthesizer 46.

次に、ステップＳ２４で、音声波形合成部４６が、上記ステップＳ２２で出力されたＣＷＭパラメータ系列と基本周波数系列とを用いて、例えば非特許文献２、非特許文献３等の手法により、音声波形を合成して出力し、合成処理を終了する。 Next, in step S24, the speech waveform synthesizer 46 uses the CWM parameter sequence and the fundamental frequency sequence output in step S22, for example, by the technique of Non-Patent Document 2, Non-Patent Document 3, etc. Are synthesized and output, and the synthesis process is terminated.

＜実験＞
本実施の形態に係る音声合成装置１０を用いた音声合成手法に関し、適切にＣＷＭパラメータの推定及び音声合成が実行可能であることの検証結果について説明する。 <Experiment>
Regarding the speech synthesis method using the speech synthesizer 10 according to the present embodiment, a verification result that CWM parameter estimation and speech synthesis can be appropriately performed will be described.

ＡＴＲ５０３のＪ０４文「切符を買うのは自動販売機からである。」の（Ａ）サンプル音声（肉声）のスペクトログラム、及び（Ｂ）本実施の形態の手法（以下、「本手法」という）による合成音声のスペクトログラムを図７に示す。また、冒頭「切符」の音素／ｉ／の中央部のスペクトル包絡を、本手法（実線）、従来法（破線）、及び肉声（一点破線）についてそれぞれ図８に示す。ここでの従来法とは、２４次メルケプストラムによる手法（非特許文献１参照）である。 According to the ATR503 J04 sentence “A ticket is purchased from a vending machine” (A) spectrogram of sample voice (real voice) and (B) the method of the present embodiment (hereinafter referred to as “the present method”). A spectrogram of the synthesized speech is shown in FIG. In addition, the spectrum envelope of the central part of the phoneme / i / of the opening “ticket” is shown in FIG. 8 for the present method (solid line), the conventional method (dashed line), and the real voice (dotted line). Here, the conventional method is a method using a 24th order mel cepstrum (see Non-Patent Document 1).

図７に示すように、本手法による合成音声のスペクトログラムは、肉声のスペクトログラムと類似しており、本手法によりテキストデータの音声合成が可能であることを示している。本手法で再現されたスペクトル包絡は、主に４ｋＨｚから７ｋＨｚの周波数において、スペクトル包絡のディップを上手く再現する傾向があった。これは、ＣＷＭパラメータがスペクトル包絡ピークの周波数及びパワーの両方の揺らぎを捉えたため、従来法に比べ、スペクトル包絡が平滑化し難くなった結果であると考えることができる。 As shown in FIG. 7, the spectrogram of the synthesized speech according to the present method is similar to the spectrogram of the real voice, and it is shown that the speech synthesis of text data is possible by the present method. The spectrum envelope reproduced by this method has a tendency to successfully reproduce the spectrum envelope dip mainly at frequencies of 4 kHz to 7 kHz. This can be considered to be a result of the spectral envelope becoming difficult to smooth compared to the conventional method because the CWM parameter captures both the frequency and power fluctuations of the spectral envelope peak.

一方で、１ｋＨｚ以下の低周波数において、複数のスペクトル包絡ピークがなだらかな曲線で再現されており、共振周波数が不明瞭となり、品質劣化の原因となっていると考えられる。これは、ＣＷＭパラメータ抽出の際、複数のスペクトル包絡ピークを少数のガウス関数の和で近似しているためであると考えられる。例えばＧＭＭの混合数を増やすなど、スペクトル包絡の各ピークに対し、精緻にガウス関数を対応付けることにより、共振周波数がより明瞭な音声を合成することができると考えられる。 On the other hand, at a low frequency of 1 kHz or less, a plurality of spectral envelope peaks are reproduced with gentle curves, and the resonance frequency is unclear, which is considered to cause quality degradation. This is presumably because a plurality of spectral envelope peaks are approximated by a sum of a small number of Gaussian functions when extracting CWM parameters. For example, it may be possible to synthesize speech with a clearer resonance frequency by precisely associating a Gaussian function with each peak of the spectral envelope, such as increasing the number of GMM mixtures.

以上説明したように、本発明の実施の形態に係る音声合成装置によれば、ＣＷＭパラメータ及びＨＭＭパラメータを、同一の規準を最大化するように交互に更新して得られたＣＷＭパラメータを音声特徴量として用いることにより、各ガウス関数のインデックスが同一状態において整合するよう保証されたＣＷＭパラメータを音声特徴量としてＨＭＭを学習することができる。 As described above, according to the speech synthesizer according to the embodiment of the present invention, the CWM parameter obtained by alternately updating the CWM parameter and the HMM parameter so as to maximize the same criterion is used as the speech feature. By using it as a quantity, it is possible to learn the HMM using the CWM parameter guaranteed to match the index of each Gaussian function in the same state as the speech feature quantity.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上記の実施の形態では、学習部と合成部とを同一のコンピュータで構成する場合について説明したが、それぞれ別のコンピュータで構成するようにしてもよい。 For example, in the above embodiment, the case where the learning unit and the synthesis unit are configured by the same computer has been described. However, the learning unit and the synthesis unit may be configured by different computers.

また、上記の音声合成装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 The above speech synthesizer has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.

また、本願明細書中において、プログラムが予めインストールされている実施の形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０音声合成装置
２０学習部
２２基本周波数系列抽出部
２４観測スペクトル包絡系列抽出部
２６ＣＷＭパラメータ推定部
２８ＨＭＭ学習部
３０ＨＭＭ
４０合成部
４２テキスト解析部
４４パラメータ合成部
４６音声波形合成部
２６０初期更新部
２６２補助変数更新部
２６４ＣＷＭパラメータ更新部
２６６第１収束判定部
２６８状態出力分布更新部
２７０状態系列更新部
２７２観測スペクトル包絡系列事後確率更新部
２７４第２収束判定部 DESCRIPTION OF SYMBOLS 10 Speech synthesizer 20 Learning part 22 Fundamental frequency sequence extraction part 24 Observation spectrum envelope series extraction part 26 CWM parameter estimation part 28 HMM learning part 30 HMM
40 synthesis unit 42 text analysis unit 44 parameter synthesis unit 46 speech waveform synthesis unit 260 initial update unit 262 auxiliary variable update unit 264 CWM parameter update unit 266 first convergence determination unit 268 state output distribution update unit 270 state series update unit 272 observation spectrum Envelope sequence posterior probability update unit 274 second convergence determination unit

Claims

A hidden that outputs a series of parameters of a composite wavelet model CWM expressing a spectral envelope of each time of an audio signal by a mixed Gaussian model and a parameter of the CWM corresponding to each time state represented by information obtained from text data An estimation unit for alternately updating the parameters of the Markov model HMM so as to maximize the same criterion, and estimating the parameters of the CWM;
Parameters of CWM estimated by the estimation unit, and with a label indicating the state of each time of the audio signal, seen including a learning section, a learning of the HMM,
Based on the same criteria, when the CWM parameter is determined, the probability that the spectrum envelope is output, the probability of the state sequence of the HMM, and the parameter of the CWM are output when the state sequence is determined. Speech synthesis model learning device as a product of the probability of being played .

The estimation unit represents the same criterion by the HMM parameter, the CWM parameter, and an auxiliary variable, and exceeds the logarithm of the probability that the spectrum envelope is output when the CWM parameter is determined. not, and a function that is in contact with the log, the parameter of the HMM, the parameters of the CWM, and voice synthesis model learning device according to claim 1, wherein updating the auxiliary variable alternately.

The speech synthesis model learning device according to claim 2 , wherein the estimation unit uses the same criterion as a lower limit function obtained by Jensen's inequality using the convexity of a negative logarithmic function.

A series of parameters of the composite wavelet model CWM in which the estimation unit expresses a spectral envelope of each time of the speech signal by a mixed Gaussian model, and a parameter of the CWM corresponding to each time state represented by information obtained from text data Alternately updating parameters of a hidden Markov model HMM that outputs the same criterion to maximize the same criterion, and estimating the parameters of the CWM;
Learning section, parameters of the CWM estimated by the estimation unit, and with a label indicating the state of each time of the speech signal, see contains the steps of: learning the HMM,
Based on the same criteria, when the CWM parameter is determined, the probability that the spectrum envelope is output, the probability of the state sequence of the HMM, and the parameter of the CWM are output when the state sequence is determined. Speech synthesis model learning method as product of the probability of being played .

The speech synthesis model learning program for functioning a computer as each part which comprises the speech synthesis model learning apparatus of any one of Claims 1-3 .