JP5717097B2

JP5717097B2 - Hidden Markov model learning device and speech synthesizer for speech synthesis

Info

Publication number: JP5717097B2
Application number: JP2011194907A
Authority: JP
Inventors: 晋富倪; 恒河井
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2011-09-07
Filing date: 2011-09-07
Publication date: 2015-05-13
Anticipated expiration: 2031-09-07
Also published as: JP2013057735A

Description

この発明は音声合成に関し，特に，ＨＭＭ（隠れマルコフモデル）を用いて音声合成用フィルタのパラメータを生成するための技術に関する。 The present invention relates to speech synthesis, and more particularly to a technique for generating parameters of a speech synthesis filter using an HMM (Hidden Markov Model).

マン・マシン・インターフェイスの必須技術として，音声認識技術と音声合成技術とがある。音声認識と音声合成とを組み合わせることにより，音声を使うという，人間にとって自然な動作で，複雑な操作指示を必要とする最新の装置を利用できる。 The essential technologies for man-machine interface include speech recognition technology and speech synthesis technology. By combining speech recognition and speech synthesis, it is possible to use the latest devices that use speech and that are natural for humans and require complex operation instructions.

これらの技術のうちでも，音声合成技術に関しては，単に目的のテキストを発声すればよいというわけではなく，より自然な発声を得ることが必要である。そのために様々な方式が提案されている。 Among these technologies, with regard to speech synthesis technology, it is not necessary to simply utter the desired text, but it is necessary to obtain a more natural utterance. Various schemes have been proposed for this purpose.

そうした方式の１つに，ＨＭＭを用いるものがある。ＨＭＭを用いる音声合成では，予め多数の音声から音声の規則合成用のパラメータを推定するためのＨＭＭを学習しておく。音声合成時には，入力テキストを解析して音素ラベル列を得て，それら音素ラベル列に含まれる各音素を合成するためのフィルタパラメータを上記したＨＭＭから生成する。 One such method uses HMM. In speech synthesis using an HMM, an HMM for estimating speech rule synthesis parameters from a large number of speeches is learned in advance. At the time of speech synthesis, the input text is analyzed to obtain phoneme label strings, and filter parameters for synthesizing each phoneme included in these phoneme label strings are generated from the above HMM.

そのような技術は，例えば特許文献１に開示されている。特許文献１に開示された音声合成装置の基本的構成を図１に示す。 Such a technique is disclosed in Patent Document 1, for example. A basic configuration of the speech synthesizer disclosed in Patent Document 1 is shown in FIG.

図１を参照して，従来の音声合成システム４０は，大きく分けて音声合成用のＨＭＭの学習を行なうための学習装置５０と，学習装置５０を記憶するためのＨＭＭ記憶部５２と，入力テキスト５４が与えられると，入力テキスト５４を構成する各音素について，ＨＭＭ記憶部５２に記憶されたＨＭＭを用いて規則合成のための合成フィルタのパラメータと音声生成のためのＦ０パラメータとを生成して音声を合成するための音声合成装置５６とを含む。 Referring to FIG. 1, a conventional speech synthesis system 40 is roughly divided into a learning device 50 for learning a speech synthesis HMM, an HMM storage unit 52 for storing the learning device 50, and an input text. 54, for each phoneme constituting the input text 54, a synthesis filter parameter for rule synthesis and an F0 parameter for speech generation are generated using the HMM stored in the HMM storage unit 52. And a speech synthesizer 56 for synthesizing speech.

学習装置５０は，音素別にラベル付けされた多数の音声データを記憶する音声データベース６０を含む。音声は所定フレーム長及び所定シフト長でフレーム化されている。学習装置５０はさらに，音声データベース６０に記憶された音声の各フレームについて基本周波数（Ｆ０）を抽出するためのＦ０抽出処理部６２と，音声データベース６０に記憶された音声の各フレームについて，音響パラメータとしてＭＦＣＣ（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）を算出するＭＦＣＣ算出部６４と，音声データベース６０に記憶された音声データの各フレームについて，音素ラベルと，Ｆ０抽出処理部６２により抽出されたＦ０と，ＭＦＣＣ算出部６４により算出されたＭＦＣＣとを１組にしてＨＭＭ学習用データとして記憶するＨＭＭ学習用データ記憶部６６と，ＨＭＭ学習用データ記憶部６６に記憶されたＨＭＭ学習用データを用いてＨＭＭの学習を行なうための，ＨＴＳツールキット（参考文献１）を用いるＨＭＭ学習部６８とを含んでおり，ＨＭＭ学習部６８による学習が行なわれたＨＭＭはＨＭＭ記憶部５２に記憶される。典型的には，ＨＭＭ記憶部５２に記憶されたＨＭＭは，コンテキスト依存の３音素ＨＭＭである。 The learning device 50 includes a speech database 60 that stores a large number of speech data labeled by phoneme. Audio is framed with a predetermined frame length and a predetermined shift length. The learning device 50 further includes an F0 extraction processing unit 62 for extracting a fundamental frequency (F0) for each frame of speech stored in the speech database 60, and an acoustic parameter for each frame of speech stored in the speech database 60. MFCC (Mel Frequency Cepstrum Coefficient) 64, phoneme labels, F0 extracted by the F0 extraction processing unit 62, and MFCC calculation unit for each frame of audio data stored in the audio database 60 HMM learning data storage unit 66 that stores a set of MFCCs calculated by H.64 as HMM learning data, and HMM learning data stored in HMM learning data storage unit 66 HTS two to do Kit contains an HMM learning section 68 using (Reference 1), HMM learning by HMM learning section 68 is performed is stored in the HMM storage 52. Typically, the HMM stored in the HMM storage unit 52 is a context-dependent triphone HMM.

一方，音声合成装置５６は，入力テキスト５４に対してテキスト解析を行ない，合成音声が持つべき韻律情報等が付された音素ラベル列８２を出力するテキスト解析部８０と，音素ラベル列８２を受け，ＨＭＭ記憶部５２から，音素ラベル列８２の各音素について各音素のコンテキスト及び韻律情報に基づいて，最も適合したＨＭＭをＨＭＭ記憶部５２から選択して接続することにより音声合成用のＦ０のパラメータ系列及びＭＦＣＣのパラメータ系列を生成するパラメータ生成部８４と，パラメータ生成部８４により生成されたＦ０のパラメータ系列にしたがい，音源信号を生成する音源生成部８６と，パラメータ生成部８４により生成されたＭＦＣＣのパラメータ系列にしたがい，音源生成部８６により生成された音源信号をフィルタリング（変調）することにより，合成音声信号を生成する合成フィルタ８８とを含む。 On the other hand, the speech synthesizer 56 performs text analysis on the input text 54 and receives a phoneme label sequence 82 and a text analysis unit 80 that outputs a phoneme label sequence 82 to which prosodic information that the synthesized speech should have is attached. , F0 parameter for speech synthesis by selecting and connecting the most suitable HMM from the HMM storage unit 52 based on the context and prosodic information of each phoneme for each phoneme in the phoneme label sequence 82 from the HMM storage unit 52 In accordance with the F0 parameter sequence generated by the parameter generation unit 84, the parameter generation unit 84 that generates the sequence and the MFCC parameter sequence, and the MFCC generated by the parameter generation unit 84 The sound source signal generated by the sound source generator 86 is filtered according to the parameter series of By grayed (modulation), and a synthesis filter 88 to generate a synthesized speech signal.

このようなＨＭＭを用いた音声合成は高速であるとともに，話者対応が容易で，種々の発話様式にも対応可能な柔軟なものであることが知られている。しかし，ＨＭＭを用いた音声合成では，汎化処理のために，合成音声が不自然なものになることも多い。そうした問題を解決するために，音声のダイナミックな特徴量と，系列内変動（ｇｌｏｂａｌｖａｒｉａｎｃｅ）とを用いる方式が提案されている。ダイナミックな特徴量としては，例えばＭＦＣＣの差分（デルタ）と，差分の差分（デルタ─デルタ）とが用いられる． It is known that speech synthesis using such an HMM is fast, flexible for speakers, and flexible for various utterance styles. However, in speech synthesis using HMM, the synthesized speech often becomes unnatural due to generalization processing. In order to solve such a problem, a method using a dynamic feature amount of speech and global variation has been proposed. For example, the MFCC difference (delta) and the difference between the differences (delta-delta) are used as dynamic features.

特開２０１１−０２８１３１号公報JP 2011-02811 A

ＨＭＭを用いた音声合成における問題は，以下の３つの局面に分けることができる。 Problems in speech synthesis using HMM can be divided into the following three aspects.

（１）音声パラメータをＨＭＭ生成時に統計処理して平滑化してしまうために，音質が劣化してしまうこと。 (1) Since the voice parameters are statistically processed and smoothed when the HMM is generated, the sound quality is deteriorated.

（２）種々の話者の音声を用いるため，音声の変化がノイズとして作用し，音質が劣化すること。 (2) Since the voices of various speakers are used, the voice change acts as noise and the sound quality deteriorates.

（３）定型化されていない音声収録環境で，種々の話者の種々の発話スタイルの音声をＨＭＭの学習に用いるために合成音声にひずみが生じること。 (3) In a non-standardized voice recording environment, synthesized voices are distorted because voices of different utterance styles of different speakers are used for HMM learning.

第１の局面については，ＭＦＣＣパラメータに振幅だけでなく位相も含ませることが必要であることが知られている。しかし，通常はそうした位相に関する情報は利用できない。発話の特徴量を生成するという観点からは，位相情報を持たないＭＦＣＣパラメータは，厳密には非線形パラメータと考えるべきである。したがって，種々の位相のＭＦＣＣパラメータをＨＭＭ学習時に統計処理し平均化してしまうことにより，合成音声にひずみが生じることになる。そうしたひずみはバズノイズを生じさせる。 As for the first aspect, it is known that it is necessary to include not only the amplitude but also the phase in the MFCC parameter. However, information about such phases is usually not available. Strictly speaking, the MFCC parameter having no phase information should be considered as a non-linear parameter from the viewpoint of generating the feature amount of the utterance. Accordingly, the MFCC parameters of various phases are statistically processed and averaged during HMM learning, so that the synthesized speech is distorted. Such distortion causes buzz noise.

第２の局面に関しては，発話の変化しやすさは，ノイズの発生源の１つと考えることができる。 Regarding the second aspect, the susceptibility to utterance change can be considered as one of the sources of noise.

第３の局面は，エキスパートでないユーザが音声合成を利用してコミュニケーションをとる上では重大な問題である。 The third aspect is a serious problem for users who are not experts to communicate using speech synthesis.

バズノイズについていうと，上記したようにダイナミックな音響特徴量（ＭＦＣＣのデルタ及びデルタ−デルタ）を用いることでかなり音声が改善されることが分かってきた。こうした手法を用いると，あるフレームの特徴量の計算に，そのフレームの前後の複数フレームの特徴量を用いる必要がある。すなわち，ＭＦＣＣパラメータの応答が，１フレームだけでなく複数フレームにまたがってくる。 Regarding buzz noise, it has been found that the use of dynamic acoustic features (MFCC delta and delta-delta) as described above significantly improves speech. When such a method is used, it is necessary to use the feature values of a plurality of frames before and after the frame for calculating the feature value of the frame. That is, the response of the MFCC parameter extends over a plurality of frames as well as one frame.

こうした手法で信号処理にウィンドウを用いたりする場合，スペクトル間での，干渉の生じないような属性を維持する必要が生ずる。さもなければ合成音声にひずみが生じてしまうという問題がある。 When a window is used for signal processing by such a method, it is necessary to maintain attributes that do not cause interference between spectra. Otherwise, there is a problem that the synthesized speech is distorted.

それゆえに本発明の目的は，ＨＭＭを用いる音声合成装置であって，合成音声波形にひずみが生じることを抑えることが可能な音声合成装置，及びそのためのＨＭＭ学習装置を提供することである。 Therefore, an object of the present invention is to provide a speech synthesizer using an HMM, which can suppress the occurrence of distortion in a synthesized speech waveform, and an HMM learning device therefor.

本発明の第１の局面に係る音声合成用の隠れマルコフモデル学習装置は，各々に音素ラベルが付された複数の音声単位を含む音声データベースを記憶するための音声データベース記憶手段と，複数の音声単位の各々から基本周波数を抽出し，基本周波数情報を出力するための基本周波数抽出手段と，複数の音声単位の各々について，所定の音響特徴量を算出するための音響特徴量算出手段とを含む。隠れマルコフモデル学習装置はさらに，所定の音響特徴量の算出のための時間領域のサンプリングと双対をなす，周波数領域のサンプリングを行なうことにより，複数の音声単位の各々について，所定の音響特徴量を角度量に変換するための変換手段と，音声データベースに含まれる複数の音声単位について，基本周波数抽出手段の出力する基本周波数情報，及び，変換手段の出力する角度量に，当該音声単位のラベルが付された学習用データを用い，別々の音素コンテキストに対する隠れマルコフモデルの学習と，音素ラベル列から隠れマルコフモデルのいずれかを選択するための決定木の学習とを行なうための学習手段と，学習手段により学習が行なわれた隠れマルコフモデルと決定木とを記憶するための記憶手段とを含む。 The hidden Markov model learning device for speech synthesis according to the first aspect of the present invention includes a speech database storage means for storing a speech database including a plurality of speech units each having a phoneme label, and a plurality of speech A fundamental frequency extracting means for extracting a fundamental frequency from each unit and outputting fundamental frequency information; and an acoustic feature quantity calculating means for calculating a predetermined acoustic feature quantity for each of a plurality of speech units. . The hidden Markov model learning device further performs a frequency domain sampling that is dual with a time domain sampling for calculating a predetermined acoustic feature, thereby obtaining a predetermined acoustic feature for each of a plurality of speech units. For a plurality of speech units included in the speech database and the conversion means for converting into angle quantities, the fundamental frequency information output by the fundamental frequency extraction means, and the angle quantities output by the conversion means are labeled with the speech units. Learning means for learning hidden Markov models for different phoneme contexts and learning decision trees for selecting either hidden Markov models from phoneme label sequences, using the attached learning data, and learning Storage means for storing the hidden Markov model learned by the means and the decision tree.

好ましくは，所定の音響特徴量はＭＦＣＣを含む。音響特徴量算出手段は，複数の音声単位の各々について，所定次元までのＭＦＣＣを算出するための手段を含んでもよい。 Preferably, the predetermined acoustic feature amount includes MFCC. The acoustic feature quantity calculating means may include means for calculating an MFCC up to a predetermined dimension for each of a plurality of sound units.

本発明の第２の局面に係る音声合成装置は，上記した音声合成用の隠れマルコフモデル学習装置のいずれかにより学習が行なわれた隠れマルコフモデルを用い，入力されるテキストに対する音声を合成するための音声合成装置である。この音声合成装置は，テキストに対しテキスト解析を行なうことにより，音素ラベル列を出力するためのテキスト解析手段と，テキスト解析手段により出力される音素ラベル列を用い，各音素ラベルについて，決定木を用いて隠れマルコフモデルを選択し，当該隠れマルコフモデルに基づいて，基本周波数情報と角度量とを生成するためのパラメータ生成手段と，パラメータ生成手段により生成された基本周波数情報に基づいて音源信号を生成するための音源生成手段とを含む。音声合成装置はさらに，パラメータ生成手段により生成された角度量に対し，変換手段による変換の逆変換に相当する変換を行なって所定の音響特徴量を算出するための逆変換手段と，逆変換手段により変換された音響特徴量に基づくフィルタ特性により，音源生成手段により生成された音源信号を変調するための合成フィルタとを含んでもよい。 A speech synthesizer according to the second aspect of the present invention uses a hidden Markov model learned by any of the above-described hidden Markov model learning devices for speech synthesis to synthesize speech for input text. This is a speech synthesizer. This speech synthesizer uses a text analysis means for outputting a phoneme label string by performing text analysis on the text and a phoneme label string output by the text analysis means, and determines a decision tree for each phoneme label. To select a hidden Markov model, and based on the hidden Markov model, parameter generation means for generating fundamental frequency information and angle amount, and a sound source signal based on the fundamental frequency information generated by the parameter generation means Sound source generating means for generating. The speech synthesizer further includes an inverse conversion unit for performing a conversion corresponding to the inverse conversion of the conversion by the conversion unit on the angular amount generated by the parameter generation unit, and calculating a predetermined acoustic feature amount, and an inverse conversion unit And a synthesis filter for modulating the sound source signal generated by the sound source generation means, based on the filter characteristics based on the acoustic feature value converted by.

従来の音声合成システム４０の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the conventional speech synthesis system. 本発明の１実施の形態に係る音声合成システム１００の概略構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a speech synthesis system 100 according to an embodiment of the present invention. 図２に示すシステムにおいて，ＨＭＭを選択するための決定木の構成を示す模式図である。FIG. 3 is a schematic diagram showing a configuration of a decision tree for selecting an HMM in the system shown in FIG. 2. 図２に示すシステムによる音声合成の効果を示すための実験結果を示すグラフである。It is a graph which shows the experimental result for showing the effect of the speech synthesis by the system shown in FIG.

以下の説明及び図面では，同一の部品には同一の参照番号を付してある。したがって，それらについての詳細な説明は繰返さない。 In the following description and drawings, the same parts are denoted by the same reference numerals. Therefore, detailed description thereof will not be repeated.

［構成］
本実施の形態では，合成音声のひずみを軽減するために，音声信号の帯域幅を広げることなくＭＦＣＣパラメータを整形する，帯域内整形を用いる。そのため，本実施の形態では，デュアルサンプリングを用いる。本明細書でのデュアルサンプリングは，時間領域と周波数領域との双方でのサンプリングを意味する。このデュアルサンプリングに基づき，音声パラメータのデュアル量子化を行なう。さらに，ＭＦＣＣパラメータに対し，アンチ・エイリアシング・フィルタリング及び平滑化による帯域内波形整形（帯域を増加させない）を行なう。 [Constitution]
In this embodiment, in order to reduce the distortion of the synthesized speech, in-band shaping is used in which the MFCC parameters are shaped without increasing the bandwidth of the speech signal. Therefore, in this embodiment, dual sampling is used. Dual sampling in this specification means sampling in both time domain and frequency domain. Based on this dual sampling, voice parameters are dual quantized. Further, in-band waveform shaping (not increasing the band) by anti-aliasing filtering and smoothing is performed on the MFCC parameters.

図２を参照して，本発明の１実施の形態に係る音声合成システム１００は，図１に示す学習装置５０に相当する学習装置１１０と，学習装置１１０による学習が行なわれたＨＭＭを記憶するためのＨＭＭ記憶部１１２と，図１に示す音声合成装置５６に相当する音声合成装置１１６とを含む。 Referring to FIG. 2, speech synthesis system 100 according to one embodiment of the present invention stores learning device 110 corresponding to learning device 50 shown in FIG. 1 and an HMM that has been learned by learning device 110. And a speech synthesizer 116 corresponding to the speech synthesizer 56 shown in FIG.

学習装置１１０が学習装置５０（図１参照）と異なるのは，図１のＭＦＣＣ算出部６４の後に，各フレームについてＭＦＣＣ算出部６４により算出されたＭＦＣＣパラメータΛを，本実施の形態の特徴の１つである，周波数領域のパラメータΘに変換するＭＦＣＣ変換部１２０をさらに含む点と，図１のＨＭＭ学習用データ記憶部５２に代えて，Ｆ０抽出処理部６２により各フレームについて抽出されたＦ０と，ＭＦＣＣ変換部１２０により各フレームについて算出されたパラメータΘとを，そのフレームのラベルとともに１組にしてＨＭＭ学習用データとして記憶するＨＭＭ学習用データ記憶部１２２を含む点と，図１のＨＭＭ学習部６８に代えて，このＨＭＭ学習用データ記憶部１２２に記憶されたＨＭＭ学習用データを用いて音声合成用のＨＭＭの学習を行なう，ＨＭＭ学習部６８と同様のＨＴＳツールキット（参考文献１）からなるＨＭＭ学習部１２４を含む点とである。学習の終わった後のＨＭＭは，図１のＨＭＭ記憶部５２に代えてＨＭＭ記憶部１１２に記憶されるが，ＨＭＭ記憶部１１２とＨＭＭ記憶部５２とは，内部に記憶されるＨＭＭのパラメータが異なるだけであって，そのハードウェアは同じである。 The learning device 110 is different from the learning device 50 (see FIG. 1) in that the MFCC parameter Λ calculated by the MFCC calculation unit 64 for each frame after the MFCC calculation unit 64 in FIG. One FMF extracted by the F0 extraction processing unit 62 instead of the HMF learning data storage unit 52 shown in FIG. And an HMM learning data storage unit 122 that stores the parameter Θ calculated for each frame by the MFCC conversion unit 120 together with a label of the frame as HMM learning data, and the HMM in FIG. Instead of the learning unit 68, the HMM learning data stored in the HMM learning data storage unit 122 is used for speech synthesis. Perform learning of HMM, it is a point including the HMM learning section 124 made of the same HTS toolkit and HMM learning section 68 (Reference 1). The HMM after the learning is completed is stored in the HMM storage unit 112 instead of the HMM storage unit 52 in FIG. 1. The HMM storage unit 112 and the HMM storage unit 52 have the HMM parameters stored therein. The only difference is the hardware.

図２に示す音声合成装置１１６が図１に示す音声合成装置５６と異なるのは，パラメータ生成部８４に代えて，音素ラベル列８２を受けてＨＭＭ記憶部１１２から各音素ラベル及び韻律情報に最も適合したＨＭＭを選択し，Ｆ０の系列とパラメータΘの系列とを出力するパラメータ生成部１３４を含む点と，パラメータ生成部１３４から出力されるパラメータΘの系列を受け，図２のＭＦＣＣ変換部１２０で行なわれる処理と逆の関係になる処理を行なってＭＦＣＣの系列を出力し，合成フィルタ８８に設定するＭＦＣＣ逆変換部１３６を含む点とである。 The speech synthesizer 116 shown in FIG. 2 differs from the speech synthesizer 56 shown in FIG. 1 in that the phoneme label string 82 is received instead of the parameter generator 84 and the phoneme label and prosody information are received from the HMM storage unit 112. The MFCC converter 120 shown in FIG. 2 receives a point including a parameter generation unit 134 that selects a suitable HMM and outputs an F0 sequence and a parameter Θ sequence, and a parameter Θ sequence output from the parameter generation unit 134. The process includes a MFCC reverse conversion unit 136 that outputs a MFCC sequence by performing a process having a reverse relationship to the process performed in, and sets the synthesized filter 88.

以下，ＭＦＣＣ変換部１２０で行なわれるパラメータΘの計算，及びＭＦＣＣ逆変換部１３６で行なわれるパラメータΘからＭＦＣＣパラメータΛを計算する手法とその考え方について説明する。ＭＦＣＣ変換部１２０での処理はデュアルサンプリングとデュアル量子化に相当する。 Hereinafter, a method of calculating the MFCC parameter Λ from the parameter Θ performed by the MFCC conversion unit 120 and the parameter Θ performed by the MFCC inverse conversion unit 136 will be described. The processing in the MFCC conversion unit 120 corresponds to dual sampling and dual quantization.

基本的に，デュアルサンプリングは時間とともに変化する関数について，正確な再構成を与えることができる。デュアル量子化では，デュアルサンプリングの結果に基づき，音声パラメータが時間及び周波数の双方によりエンコードされる。デュアル量子化により，周波数の帯域制限について多少のゆとりが得られる。帯域内整形によって，ノイズ及び発話の流動性による合成音声のひずみが小さくなり，ＨＭＭによる合成音声の音質が改善される。 Basically, dual sampling can give an accurate reconstruction of functions that change over time. In dual quantization, speech parameters are encoded in both time and frequency based on the result of dual sampling. Dual quantization provides some room for frequency bandwidth limitations. In-band shaping reduces distortion of synthesized speech due to noise and utterance fluidity, and improves the quality of synthesized speech by HMM.

デュアルサンプリングとは，帯域制限された信号を時間と周波数領域との双方でサンプリングすることを意味する。各サンプリング点でのサンプルの対は互いにコヒーレントである。 Dual sampling means sampling a band-limited signal in both time and frequency domain. The sample pair at each sampling point is coherent with each other.

デュアルサンプリングは以下のように表すことができる。 Dual sampling can be expressed as:

ただしＡは対称な共鳴曲線を表し，λは周波数比の二乗を表し，ζは強制振動の減衰係数を表し，ζ^２＜０．５である。ｎは整数でｎ＝０，…，Ｎ，本実施の形態ではＮ＝１０^６，ε_ｎはｎにより変化する，ほぼ１０^−１０程度の小さな値である。

Where A represents a symmetrical resonance curve, λ represents the square of the frequency ratio, ζ represents the damping coefficient of forced vibration, and ζ ² <0.5. n is an integer and n = 0,..., N, and in this embodiment, N = 10 ⁶ , and ε _n is a small value of about 10 ⁻¹⁰ that varies with n.

ζ_ｎはさらに以下の式により単位円回りの回転角α_ｎ（ラジアン）に変換される。 ζ _n is further converted into a rotation angle α _n (radian) around the unit circle by the following equation.

したがって，ｎ番目のサンプリング点λ_ｎ（０＜λ_ｎ＜１）は角α_ｎ（０＜αｎ＜ｗ_ｃ，ただし本実施の形態ではｗ_ｃ＝０．３３３２５ラジアンに固定）に対して逆順で双対をなす。さらに，以下の式によりゼロ点α_ｚを中心としてα_ｎを折り返したθｎを計算することにより，θ_ｎはλ_ｎと同じ順番を持つ変数となる。 Therefore, the nth sampling point λ _n (0 <λ _n <1) is in reverse order with respect to the angle α _n (0 <αn <w _c , but fixed to w _c = 0.33325 radians in this embodiment). Make a dual. Further, by calculating the θn which folded alpha _n around the zero point alpha _z according to the following equation, theta _n is the variable with the same order as the lambda _n.

この折り返しの関係から，周波数領域におけるデュアルサンプリングは，平行移動に関して不変であり，かつ線形であるということができる。したがって，離散周波数系は線形かつ平行移動に関して不変であり，離散時間系も同様である。

From this aliasing relationship, it can be said that dual sampling in the frequency domain is invariant and linear with respect to translation. Therefore, discrete frequency systems are linear and invariant with respect to translation, as are discrete time systems.

ＭＦＣＣに関するデュアル量子化は以下のように表すことができる。ｋ次元目のＭＦＣＣ係数をΛ_ｋとし，ＭＦＣＣΛ_ｋが最小値Λ_ｋｍｉｎから最大値Λ_ｋｍａｘ（ｋ＝０，…，Ｋ：Ｋは最大次元の次元番号）の間の範囲にあるものとする。 Dual quantization for MFCC can be expressed as: The MFCC coefficients k-th dimension and lambda _k, the maximum value Λ _kmax MFCCΛ _k is the minimum value _{Λ kmin (k = 0, ...} , K: K is the maximum dimension of the dimension number) shall be in the range of between.

ここで，Λ_ｋを再サンプリングし，次の式により時間領域で量子化する。 Here, Λ _k is resampled and quantized in the time domain by the following equation.

ただしＱ［ｘ］はｘを最も近いλ_ｎ，ｎ∈｛０，…，Ｎ｝に丸めることを示す。

However, Q [x] indicates that x is rounded to the nearest λ _n , nε {0,.

θ_ｎｋがλ_ｎｋとデュアルであって，θ_ｍとλ_ｎとの間の関係がルックアップテーブル化されているものとする。Λ_ｋに関する周波数領域のデュアル関数は以下の式により表される。 _Assume that θ _nk is dual with λ _nk and that the relationship between θ _m and λ _n is looked up as a look-up table. The frequency domain dual function for Λ _k is expressed by the following equation.

時間領域で（可能なら）位相を持つΛ_ｋにより表される情報は，１次元（線形）空間ではなく，３／２次元（円形）空間内に存在する。大雑把に言えば，Λ_ｋからΘ_ｋへの写像は，幾何学的には，λ_ｎｋにより表される３／２次元の外部平面から，θ_ｎｋにより表される２次元の球面への写像であるということができる。周波数領域での再サンプリングにより，情報は，位相を考えなければ球面Θ_ｋ上にランダムに分配される。Λｋに位相情報が含まれない場合，位相情報は考える必要がないと想定できる。 Information represented by Λ _k with phase (if possible) in the time domain exists in 3 / 2-dimensional (circular) space, not in 1-dimensional (linear) space. Roughly speaking, the mapping from Λ _k to Θ _k is geometrically a mapping from a 3/2 dimensional external plane represented by λ _nk to a 2 dimensional sphere represented by θ _nk. It can be said that there is. By re-sampling in the frequency domain, the information is randomly distributed on the sphere Θ _k if no phase is considered. When phase information is not included in Λk, it can be assumed that phase information need not be considered.

帯域内波形整形は，本実施の形態ではＨＭＭの学習と発話パラメータの生成とに密接に関与している。基本的には，ＨＭＭによる音声の生成にこれらの技術を組込む手続は以下を含む。 In-band waveform shaping is closely related to HMM learning and speech parameter generation in this embodiment. Basically, the procedures for incorporating these techniques into the generation of speech by HMM include:

〈パラメータ化〉
ＭＦＣＣを角度量に変換する。 <Parameterization>
Convert MFCC into angular quantities.

発話コーパス中の全ての発話について，例えばＫ＝３９，フレームシフト＝５ミリ秒としてＭＦＣＣを計算する。ＭＦＣＣをΛ_ｋｉで示す（ｋ＝０，…，Ｋ，ｉ＝０，…，Ｉとする。Ｉは発話のフレーム数を指す。）。ＭＦＣＣの集合からΛ_ｋｍａｘ及びΛ_ｋｍｉｎを見つけ，Λ_ｋｉの全てをΘ_ｋｉにマッピングする。 For all utterances in the utterance corpus, for example, MFCC is calculated with K = 39 and frame shift = 5 milliseconds. MFCC is denoted by Λ _ki (k = 0,..., K, i = 0,..., I. I indicates the number of frames of speech). Find Λ _kmax and Λ _kmin from the set of _MFCCs and map all of Λ _ki to Θ _ki .

〈ＨＭＭの学習〉
ＭＦＣＣを残りの帯域分に拡張し，最尤基準によってデコードを行なう。この作業にはＨＴＳツールキット（参考文献１）を用いるが，Λ_ｋｉの代わりにγ_ｅ×Θ_ｋｉを用いることにより帯域内整形のために帯域を１．４倍に拡張する。 <Learning HMM>
The MFCC is extended to the remaining bandwidth, and decoding is performed according to the maximum likelihood criterion. The HTS toolkit (reference document 1) is used for this work, but the band is expanded to 1.4 times for in-band shaping by using γ _e × Θ _ki instead of Λ _ki .

〈音声合成〉
アンチ・エイリアシングと平滑化とを行なう。ＧＶ（＾Θ_ｋｊで示す。ただしｋ＝０，…，Ｋ，Ｊ＝０，…，Ｊ。Ｊは発話中のフレーム数。）まずΘ_ｋｊをα_ｋｊに変換する。α_ｋｊ＞ｗ_ｃであればα_ｋｊ＝ｗ_ｃとしてエイリアシングの削減を図る。その後，α_ｋｊを｛α_ｎ，ｎ＝０，…，Ｎ｝中のいずれかのα_ｎｋｊに量子化する。この量子化には，最小誤差基準を用いる。さらに，α_ｎｋｊにγ_ｃを乗算することにより帯域を１．２倍して平滑化し，その結果を再度量子化する。最後に，α_ｎｋｊをΛ_ｎｋｊにマッピングすることによりＭＦＣＣを計算し直す。このマッピングが１対多の場合には，本実施の形態では写像のうちの任意の１つをランダムに選択する。この結果，音声合成のためのＭＦＣＣパラメータとしてΛ_ｋｊ，ｋ＝０，…，Ｋ及びｊ＝０，…，Ｊが得られる。 <Speech synthesis>
Perform anti-aliasing and smoothing. GV (indicated by ^ Θ _kj, where k = 0,..., K, J = 0,..., J. J is the number of frames in speech.) First, Θ _kj is converted to α _kj . If α _kj > w _c , α _kj = w _c is set to reduce aliasing. Thereafter, α _kj is quantized to any α _nkj in {α _n , n = 0,..., N}. A minimum error criterion is used for this quantization. Further, α _nkj is multiplied by γ _c to smooth the band by 1.2, and the result is quantized again. Finally, recalculate the MFCC by mapping α _nkj to Λ _nkj . When this mapping is one-to-many, in this embodiment, an arbitrary one of the maps is selected at random. As a result, Λ _kj , k = 0,..., K and j = 0 _,.

〈学習後のＨＭＭ〉
ＨＭＭ学習用データ記憶部１２２に記憶される学習後のＨＭＭについて図３を参照して説明する。本実施の形態では，ＨＭＭはコンテキスト依存の３状態ＨＭＭである。例えば中間の音素として／ａ／を含むＨＭＭ１４０，１４２及び１４４等を考える。これらは，２番目の音素１６０として／ａ／を持つが，先頭の音素としてそれぞれｃ_１１，ｃ_２１及びｃ_３１を持ち，３番目の音素としてそれぞれｃ_１２，ｃ_２２及びｃ_３２を持つものとする。これ以外にも同様に２番目の音素に／ａ／を持つ３状態ＨＭＭは多数存在し得るが，ここでは図の理解を容易にするためにこの３つのＨＭＭ１４０，１４２及び１４４のみを示す。 <HMM after learning>
The learned HMM stored in the HMM learning data storage unit 122 will be described with reference to FIG. In this embodiment, the HMM is a context-dependent three-state HMM. For example, consider HMMs 140, 142, and 144 that include / a / as an intermediate phoneme. These have / a / as the second phoneme 160 but have c ₁₁ , c ₂₁ and c ₃₁ as the first phoneme, respectively, and c ₁₂ , c ₂₂ and c ₃₂ as the third phoneme, respectively. To do. There can be many other three-state HMMs having / a / in the second phoneme, but only these three HMMs 140, 142, and 144 are shown here for easy understanding of the drawing.

２番目の音素１６０として／ａ／を持つＨＭＭのうち，いずれかを選択するために，ＨＭＭに関する決定木１６２の学習が行なわれる。この決定木１６２は，例えば複数のノード１８０〜２００を持つ。これらのうち，ノード１８４，１８８，１９０，１９６，１９８及び２００がリーフノードであり，ＨＭＭ１４０〜１４４等のいずれかに対応する。決定木１６２の各ノードには２値の質問が対応付けられており，音声の合成条件（韻律情報を持つラベル列により定められる。）に応じて各ノードの質問に対して答えながら決定木１６２をルートノード１８０からたどっていき，到達したリーフノードに対応するＨＭＭを選択する。 In order to select one of the HMMs having / a / as the second phoneme 160, learning of the decision tree 162 related to the HMM is performed. The decision tree 162 has a plurality of nodes 180 to 200, for example. Among these, the nodes 184, 188, 190, 196, 198 and 200 are leaf nodes and correspond to any one of the HMMs 140 to 144 and the like. A binary question is associated with each node of the decision tree 162, and the decision tree 162 is answered while answering the question of each node according to the speech synthesis condition (determined by a label string having prosodic information). Is selected from the root node 180, and the HMM corresponding to the reached leaf node is selected.

［動作］
図２に示した音声合成システム１００は以下のように動作する。音声データベース６０には，音声データベースとして多数の発話データが準備される。これらの発話データはいずれもフレーム化され，音素ラベルが付されている。Ｆ０抽出処理部６２は，音声データベース６０内の各フレームからＦ０を抽出して出力する。ＭＦＣＣ算出部６４は各フレームからＭＦＣＣパラメータΛ_ｋｉを算出しＭＦＣＣ変換部１２０に与える。ＭＦＣＣ変換部１２０は，上記したとおりＭＦＣＣの集合からΛ_ｋｍａｘ及びΛ_ｋｍｉｎを見つけ，Λ_ｋｉの全てをΘ_ｋｉにマッピングする。 [Operation]
The speech synthesis system 100 shown in FIG. 2 operates as follows. In the voice database 60, a large number of utterance data are prepared as a voice database. These speech data are all framed and phoneme-labeled. The F0 extraction processing unit 62 extracts F0 from each frame in the audio database 60 and outputs it. The MFCC calculation unit 64 calculates an MFCC parameter Λ _ki from each frame and supplies it to the MFCC conversion unit 120. MFCC conversion unit 120 finds the lambda _kmax and lambda _kmin from a set of MFCC as described above, to map all lambda _ki to theta _ki.

各フレームについて算出されたＦ０及びΘ_ｋｉには，そのフレームの音素ラベルが付され，ＨＭＭ学習用データ記憶部１２２に記憶される。 The F0 and Θ _ki calculated for each frame are assigned the phoneme label of that frame and stored in the HMM learning data storage unit 122.

ＨＭＭ学習部１２４の実体は，上記したとおりＨＭＭ学習部６８同様のＨＴＳツールキットであって，Θ_ｋｉを用いてＨＭＭ記憶部１１２内のＨＭＭの学習を行なう。全ての発話データについてＨＭＭの学習が終了すると，ＨＭＭ記憶部１１２を用いて音声の合成を行なうことが可能になる。 The entity of the HMM learning unit 124 is an HTS toolkit similar to the HMM learning unit 68 as described above, and learns the HMM in the HMM storage unit 112 using Θ _ki . When the HMM learning is completed for all utterance data, it is possible to synthesize speech using the HMM storage unit 112.

音声合成では，入力テキスト５４が与えられると，音声合成装置１１６のテキスト解析部８０は入力テキスト５４に対するテキスト解析を行ない，韻律情報が付された音素ラベル列８２をパラメータ生成部１３４に与える。パラメータ生成部１３４は，与えられた韻律情報付の音素ラベル列を用い，ＨＭＭ記憶部１１２に格納された決定木１６２（図３参照）をたどることで各音素に対応するＨＭＭを選択し，ＨＭＭのシーケンスを出力する。このシーケンスに対応してＦ０のシーケンスも得られ，音源生成部８６に与えられる。ＨＭＭのシーケンスから得られたΘ_ｋｊの各々をα_ｋｊに変換する。α_ｋｊ＞ｗ_ｃであればα_ｋｊ＝ｗ_ｃとしてエイリアシングの削減を図る。さらにα_ｋｊを｛α_ｎ，ｎ＝０，…，Ｎ｝中のいずれかのα_ｎｋｊに量子化する。この量子化には，最小誤差基準を用いる。さらに，α_ｎｋｊにγ_ｃを乗算して平滑化し，その結果を再度量子化する。最後に，α_ｎｋｊをΛ_ｎｋｊにマッピングすることによりＭＦＣＣを計算し直す。このマッピングが１対多の場合には，写像のうちの任意の１つをランダムに選択する。この結果，ＭＦＣＣパラメータとしてΛ_ｋｊのシーケンス（ｋ＝０，…，Ｋ及びｊ＝０，…，Ｊ）が得られる。このシーケンスを構成するＭＦＣＣパラメータΛ_ｋｉの各々により合成フィルタ８８を各フレームについて設定し，当該フレームについてのＦ０に基づいて音源生成部８６が生成する音源信号を合成フィルタ８８でフィルタリングすることにより，合成音声が得られる。 In speech synthesis, when input text 54 is given, the text analysis unit 80 of the speech synthesizer 116 performs text analysis on the input text 54 and gives a phoneme label sequence 82 with prosodic information to the parameter generation unit 134. The parameter generation unit 134 selects the HMM corresponding to each phoneme by following the decision tree 162 (see FIG. 3) stored in the HMM storage unit 112 using the given phoneme label string with prosodic information. The sequence of is output. Corresponding to this sequence, a sequence of F0 is also obtained and given to the sound source generator 86. Each Θ _kj obtained from the HMM sequence is converted to α _kj . If α _kj > w _c , α _kj = w _c is set to reduce aliasing. Further, α _kj is quantized to any α _nkj in {α _n , n = 0,..., N}. A minimum error criterion is used for this quantization. Further, α _nkj is multiplied by γ _c for smoothing, and the result is quantized again. Finally, recalculate the MFCC by mapping α _nkj to Λ _nkj . If this mapping is one-to-many, an arbitrary one of the maps is selected at random. As a result, a sequence of Λ _kj (k = 0,..., K and j = 0,..., J) is obtained as the MFCC parameter. A synthesis filter 88 is set for each frame by each of the MFCC parameters Λ _ki constituting this sequence, and the synthesis filter 88 filters the sound source signal generated by the sound source generation unit 86 based on F0 for the frame, thereby synthesizing the synthesis filter 88. Voice is obtained.

［実施の形態の効果］
以上のように本実施の形態によれば，時間及び周波数領域におけるデュアルサンプリング点でのサンプルはコヒーレントである。いずれか一方に何らかの変化があれば，他方にもそれに対応した変化が生ずる。これは，共鳴曲線と平衡条件とによる。すなわち，ζの値は，入力λと出力λとの値が互いに等しくなるように選ばれる。この結果，デュアルサンプリングによって，音声パラメータを時間及び周波数領域の双方で量子化するための基本的枠組が得られ，双方の領域で音声パラメータを処理することが可能になる。 [Effect of the embodiment]
As described above, according to the present embodiment, the samples at the dual sampling points in the time and frequency domains are coherent. If there is some change in either one, the corresponding change will occur in the other. This depends on the resonance curve and the equilibrium conditions. That is, the value of ζ is selected so that the values of the input λ and the output λ are equal to each other. As a result, the dual sampling provides a basic framework for quantizing speech parameters in both the time and frequency domains, and allows speech parameters to be processed in both domains.

第２に，周波数領域では処理対象は円であるため，「振幅」は一定であり，したがって統計的平均値は線形である角度量によって表される。 Second, since the object to be processed is a circle in the frequency domain, the “amplitude” is constant, and thus the statistical average value is represented by an angular amount that is linear.

第３に，ＭＦＣＣの量子化は基本的には，デュアルサンプリングにより定義される１０^６個の位置のうち，０．３５３５×１０^６個の位置を抽出し，必要であればさらに内挿を行なう余地を残している。利用できない位相情報のために生ずるこうした余地は，ＨＭＭの学習を行なう際には，Θ_ｋを統計的に平均することにより生ずるノイズに対処するために好適である。ただし，このノイズがガウシアンノイズと同じ統計的特徴を示すものと想定した場合であるが。人間の聴覚が，位相のある程度の量には不感であることはよく知られている。したがって音声パラメータを効率的に統計的分類及び平均化するための手段が得られる。 Third, the MFCC quantization basically extracts 0.3535 × 10 ⁶ positions out of 10 ⁶ positions defined by dual sampling, and performs further interpolation if necessary. There is room for it. Such room for phase information that is not available is suitable for dealing with noise caused by statistically averaging Θ _k when learning HMMs. However, it is assumed that this noise exhibits the same statistical characteristics as Gaussian noise. It is well known that human hearing is insensitive to a certain amount of phase. Thus, a means for efficiently statistically classifying and averaging speech parameters is obtained.

第４に，ボコーダは通常，ある周波数のグループ，特に高い周波数のグループをかなりの程度までまとめることを利用する。周波数領域でのデュアルサンプリングはこの要求に合致する。高い周波数の圧縮の程度は，低い周波数と比較して約２．５倍である。 Fourth, vocoders typically make use of grouping certain frequency groups, particularly high frequency groups, to a significant degree. Dual sampling in the frequency domain meets this requirement. The degree of compression of the high frequency is about 2.5 times that of the low frequency.

最後に，パラメータΘ_ｋに線形係数γを乗ずることにより，デュアルサンプリングの効用によって時間領域での群遅延を可能にするための簡便な手段が得られる。 Finally, by multiplying the parameter Θ _k by the linear coefficient γ, a simple means for enabling group delay in the time domain by using the dual sampling is obtained.

［利用例］
少数の女性話者によるＡＴＲ５０３データセットを用い，上記実施の形態に係る方法を従来の方法と比較する実験を行なった。結果を図４に示す。この図４は，ＭＦＣＣの応答を１より大きなフレームに拡大したときのＭＦＣＣの帯域内整形の結果を示す。この結果は，本発明におけるリーフノード数が従来法より全般的に少なく，音響的特徴の多様性が縮小されていることを示している。このことは，上記実施の形態に係る方法により，話者に固有の特徴と普遍的な特徴が良好に分離された結果，ＨＭＭ学習に際して話者に固有の特徴が被る平均化が改善されたことを意味する。 [Usage example]
Using the ATR503 data set by a small number of female speakers, an experiment was conducted comparing the method according to the above embodiment with the conventional method. The results are shown in FIG. FIG. 4 shows the result of in-band shaping of the MFCC when the MFCC response is expanded to a frame larger than 1. This result shows that the number of leaf nodes in the present invention is generally smaller than that of the conventional method, and the diversity of acoustic features is reduced. This is because, as a result of the method according to the above embodiment, the speaker-specific features and the universal features are well separated, and the averaging that the speakers-specific features are subjected to during HMM learning has been improved. Means.

上記方法によって合成された音声を発明者達が聞いて評価した結果，従来の方法と比較して本実施の形態によってバズノイズがかなり低減され，ＨＭＭによる合成音声の音質が改善されることが確認された。 As a result of the inventors listening and evaluating the speech synthesized by the above method, it has been confirmed that the buzz noise is considerably reduced by the present embodiment and the sound quality of the synthesized speech by the HMM is improved as compared with the conventional method. It was.

今回開示された実施の形態は単に例示であって，本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は，発明の詳細な説明の記載を参酌した上で，特許請求の範囲の各請求項によって示され，そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are included. Including.

［参考文献］
［１］Ｋ．Ｔｏｋｕｄａ，Ｈ．Ｚｅｎ，Ｊ．Ｙａｍａｇｉｓｈｉ，Ｔ．Ｍａｓｕｋｏ，Ｓ．Ｓａｋｏ，Ａ．Ｂ．Ｂｌａｃｋ，Ｔ．Ｎｏｓｅ，“ＴｈｅＨＭＭ−ＢａｓｅｄＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓＳｙｓｔｅｍ（ＨＴＳ）Ｖｅｒｓｉｏｎ２．１．”［Ｏｎｌｉｎｅ］。ＵＲＬ：http://hts.sp.nitech.ac.jp/. [References]
[1] K. Tokuda, H .; Zen, J. et al. Yamagishi, T .; Masuko, S .; Sako, A .; B. Black, T.M. Nose, “The HMM-Based Speech Synthesis System (HTS) Version 2.1.” [Online]. URL: http://hts.sp.nitech.ac.jp/.

４０，１００音声合成システム
５０，１１０学習装置
５２，１１２ＨＭＭ記憶部
５４入力テキスト
５６，１１６音声合成装置
６０音声データベース
６２Ｆ０抽出処理部
６４ＭＦＣＣ算出部
６６，１２２ＨＭＭ学習用データ記憶部
６８，１２４ＨＭＭ学習部
８０テキスト解析部
８２音素ラベル列
８４，１３４パラメータ生成部
８６音源生成部
８８合成フィルタ
１３６ＭＦＣＣ逆変換部 40, 100 Speech synthesis system 50, 110 Learning device 52, 112 HMM storage unit 54 Input text 56, 116 Speech synthesis device 60 Speech database 62 F0 extraction processing unit 64 MFCC calculation unit 66, 122 Data storage unit 68, 124 for HMM learning HMM learning unit 80 Text analysis unit 82 Phoneme label sequence 84, 134 Parameter generation unit 86 Sound source generation unit 88 Synthesis filter 136 MFCC inverse conversion unit

Claims

Speech database storage means for storing a speech database including a plurality of speech units each having a phoneme label;
A fundamental frequency extracting means for extracting a fundamental frequency from each of the plurality of voice units and outputting fundamental frequency information;
Acoustic feature amount calculating means for calculating a predetermined acoustic feature amount for each of the plurality of speech units;
The predetermined acoustic feature quantity is converted into an angular quantity for each of the plurality of speech units by performing frequency domain sampling that is dual with the time domain sampling for calculating the predetermined acoustic feature quantity. Conversion means for,
Learning data in which the fundamental frequency information output from the fundamental frequency extraction unit and the angular amount output from the conversion unit are labeled with the unit of the speech unit for the plurality of speech units included in the speech database. Learning means for learning a hidden Markov model for different phoneme contexts and learning a decision tree for selecting one of the hidden Markov models from a phoneme label sequence;
An apparatus for learning a hidden Markov model for speech synthesis, comprising: a storage means for storing the hidden Markov model learned by the learning means and the decision tree.

The predetermined acoustic feature amount includes a mel frequency cepstrum coefficient,
2. The hidden Markov model learning device for speech synthesis according to claim 1, wherein the acoustic feature amount calculating means includes means for calculating a mel frequency cepstrum coefficient up to a predetermined dimension for each of the plurality of speech units. .

A speech synthesizer for synthesizing speech for input text using a hidden Markov model learned by a speech synthesis hidden Markov model learning device according to claim 1 or 2,
Text analysis means for outputting a phoneme label string by performing text analysis on the text;
Using the phoneme label sequence output by the text analysis means, for each phoneme label, select a hidden Markov model using the decision tree, and generate fundamental frequency information and the angular amount based on the hidden Markov model. Parameter generation means for
Sound source generating means for generating a sound source signal based on the fundamental frequency information generated by the parameter generating means;
An inverse conversion means for calculating the predetermined acoustic feature quantity by performing a conversion corresponding to an inverse conversion of the conversion by the conversion means on the angle amount generated by the parameter generation means;
A speech synthesizer comprising: a synthesis filter for modulating the sound source signal generated by the sound source generation means based on a filter characteristic based on the acoustic feature value converted by the inverse conversion means.