JP2004125843A

JP2004125843A - Voice synthesis method

Info

Publication number: JP2004125843A
Application number: JP2002285568A
Authority: JP
Inventors: Hiroyuki Hirai; 平井　啓之
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 2002-09-30
Filing date: 2002-09-30
Publication date: 2004-04-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice synthesis method which obtains a smooth synthesized voice by allowing phoneme units to be selected in consideration of articulation parameters. <P>SOLUTION: In the voice synthesis method, a combination of phoneme units which has the least sum of distortion from a target and connection distortion of phoneme units is selected from combinations of phoneme units stored in a waveform dictionary, and a synthesized voice waveform is generated on the basis of the selected combination of phoneme units. Information related to the difference of an articulation parameter between two phoneme units to be connected is used as an element of connection distortion of phoneme units. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は、任意のテキスト情報を合成音声で読み上げることのできる音声合成方法に関する。
【０００２】
【従来の技術】
図１は、音声合成装置の概略構成を示している。
【０００３】
入力された日本語仮名漢字混じりのテキストは、言語処理部１で形態素解析、係り受け解析が行なわれ、音素記号、アクセント記号等に変換せしめられる。
【０００４】
韻律パターン生成部２では、音素記号、アクセント記号列および形態素解析結果から得られる入力テキストの品詞情報を用いて、音韻継続時間長（声の長さ　ＤＵＲ^Ｔ）、基本周波数（声の高さ　ＦＯ　^Ｔ）、母音中心のパワー（声の大きさＰＯＷ　^Ｔ）等の推定が行なわれる。
【０００５】
音素単位選択部３では、推定された音韻継続時間長　ＤＵＲ^Ｔ、基本周波数　ＦＯ　^Ｔおよび母音中心のパワーＰＯＷ　^Ｔに最も近く、かつ波形辞書５に蓄積されている音素単位（　音素片）　を接続したときの歪み（後述するＣ^ａｌｌ）が最も小さくなる音素片の組み合わせがＤＰ（動的プログラミング）を用いて選択される。
【０００６】
音声波形生成部４では、選択された音素片の組み合わせにしたがって、ピッチを変換しつつ音素片の接続を行なうことによって音声が生成される。
【０００７】
図２は、波形辞書５の内容を示している。
波形辞書５は、複数の音素片が格納された音素片格納部５１と、音素片格納部５１内の各音素片に関する補助情報が格納された補助情報格納部５２とがある。補助情報には、音素片のパワー（ＰＯＷ　^Ｄｉｃ）、基本周波数（　ＦＯ　^Ｄｉｃ）、継続時間長（　ＤＵＲ^Ｄｉｃ）、スペクトル（ＳＰＣ^Ｄｉｃ）等がある。
【０００８】
ところで、音素単位選択部３では、波形辞書５に蓄積されている音素片の組み合わせの中で、歪みが少なくなる組み合わせを選択しているが、この歪みには次のようなものがある。
【０００９】
つまり、図３に示すように、ｕ_ｉ−１、ｕ_ｉ、ｕ_ｉ＋１を波形辞書５から抽出した音素片として、ｔ_ｉ−１、ｔ_ｉ、ｔ_ｉ＋１を実際に使用する環境（　ターゲット）とすると、ｕｉ　に対する歪みには、Ｃ^ｔ _ｉと、Ｃ^ｃ _ｉとがある。
【００１０】
ここで、Ｃ^ｔ _ｉは、ｉ番目の音素について辞書から抽出した音素片（ｕ_ｉ）と実際に使用する環境（　ターゲットｔ_ｉ）との間の歪みである。また、Ｃ^ｃ _ｉは、ｉ番目の音素片（ｕ_ｉ）と、ｉ−１番目の素片（ｕ_ｉ−１）とを接続したときに生じる歪みである。音素単位選択部３は、動的計画法（ＤＰ法）に用いて音素片を接続していき、入力された全ての音素に対するＣ^ｔ _ｉとＣ^ｃ _ｉとの総和Ｃ^ａｌｌが最小となる素片の組み合わせを選択する。
【００１１】
Ｃ^ｔ _ｉは、次式（１）で表される。
【００１２】
【数１】

【００１３】
上記式（１）において、各変数は、次のように定義される。
【００１４】
Ｄ^ｔ _ＰＯＷ（ｔ_ｉ，ｕ_ｉ）は、ｉ番目の音素について、辞書から抽出した音素片（ｕ_ｉ）のパワー（ＰＯＷ　^Ｄｉｃ（ｉ）　）と、実際に使用する環境（ターゲットｔ_ｉ）のパワー（ＰＯＷ　^Ｔ（ｉ）　）との間の距離の自乗であり、｛（ＰＯＷ　^Ｄｉｃ（ｉ）　）−（ＰＯＷ　^Ｔ（ｉ）　）｝^２となる。
【００１５】
ｗ^ｔ _ＰＯＷは、Ｄ^ｔ _ＰＯＷ（ｔ_ｉ，ｕ_ｉ）に対する重み係数である。
【００１６】
Ｄ^ｔ _Ｆ０（ｔ_ｉ，ｕ_ｉ）は、ｉ番目の音素について、辞書から抽出した音素片（ｕ_ｉ）の基本周波数（　ＦＯ　^Ｄｉｃ（ｉ）　）と、実際に使用する環境（ターゲットｔ_ｉ）の基本周波数（　ＦＯ　^Ｔ（ｉ）　）との間の距離の自乗であり、｛（　ＦＯ　^Ｄｉｃ（ｉ）　）−（　ＦＯ　^Ｔ（ｉ）　）｝^２となる。
【００１７】
ｗ^ｔ _Ｆ０　　は、Ｄ^ｔ _Ｆ０（ｔ_ｉ，ｕ_ｉ）に対する重み係数である。
【００１８】
Ｄ^ｔ _ＤＵＲ（ｔ_ｉ，ｕ_ｉ）は、ｉ番目の音素について、辞書から抽出した音素片（ｕ_ｉ）の継続時間長（　ＤＵＲ^Ｄｉｃ（ｉ）　）と、実際に使用する環境（ターゲットｔ_ｉ）の継続時間長（　ＤＵＲ^Ｔ（ｉ）　）との間の距離の自乗であり、｛（　ＤＵＲ^Ｄｉｃ（ｉ）　）−（　ＤＵＲ^Ｔ（ｉ）　）｝^２となる。
【００１９】
ｗ^ｔ _ＤＵＲは、Ｄ^ｔ _ＤＵＲ（ｔ_ｉ，ｕ_ｉ）に対する重み係数である。
【００２０】
Ｃ^ｃ _ｉは、次式（２）で表される。
【００２１】
【数２】

【００２２】
上記式（２）において、各変数は、次のように定義される。
【００２３】
Ｄ^ｃ _ＰＯＷ（ｕ_ｉ，ｕ_ｉ−１）は、ｉ番目の音素片（ｕ_ｉ）の始端のパワー（ＰＯＷ　^ＤｉｃＳ（ｉ）　）と、ｉ−１番目の音素片（ｕ_ｉ−１）の終端のパワー（ＰＯＷ　^ＤｉｃＥ（ｉ−１）　）との間の距離の自乗であり、｛（ＰＯＷ　^ＤｉｃＳ（ｉ）　）−（ＰＯＷ　^ＤｉｃＥ（ｉ−１）　）｝^２となる。
【００２４】
ｗ^ｃ _ＰＯＷは、Ｄ^ｃ _ＰＯＷ（ｕ_ｉ，ｕ_ｉ−１）に対する重み係数である。
【００２５】
Ｄ^ｃ _Ｆ０（ｕ_ｉ，ｕ_ｉ−１）は、ｉ番目の音素片（ｕ_ｉ）の始端の基本周波数（　ＦＯ　^ＤｉｃＳ（ｉ）　）と、ｉ−１番目の音素片（ｕ_ｉ−１）の終端の基本周波数（ＦＯ^ＤｉｃＥ　（ｉ−１））との間の距離の自乗であり、｛（　ＦＯ　^ＤｉｃＳ（ｉ）　）−（ＦＯ^ＤｉｃＥ　（ｉ−１））｝^２となる。
【００２６】
ｗ^ｃ _Ｆ０は、Ｄ^ｃ _Ｆ０（ｕ_ｉ，ｕ_ｉ−１）に対する重み係数である。
【００２７】
Ｄ^ｃ _ＳＰＣ（ｕ_ｉ，ｕ_ｉ−１）は、ｉ番目の音素片（ｕ_ｉ）の始端のスペクトル（ＳＰＣ^ＤｉｃＳ（ｉ，ｊ），　ｊ＝１　〜１６　　）と、ｉ−１番目の音素片（ｕ_ｉ−１）の終端のスペクトル（　ＳＰＣ^ＤｉｃＥ（ｉ−１，ｊ）　，　ｊ　＝１　〜１６）との間の距離の自乗であり、｛（　ＳＰＣ^ＤｉｃＳ（ｉ，ｊ）　）−（　ＳＰＣ^ＤｉｃＥ（ｉ−１，ｊ）　）｝^２となる。
【００２８】
ｗ^ｃ _ＳＰＣは、Ｄ^ｃ _ＳＰＣ（ｕ_ｉ，ｕ_ｉ−１）に対する重み係数である。
【００２９】
入力された全ての音素に対するＣ^ｔ _ｉとＣ^ｃ _ｉとの総和Ｃ^ａｌｌは、次式（３）で表される。
【００３０】
【数３】

【００３１】
【発明が解決しようとする課題】
この発明は、調音パラメータを考慮して音素単位を選択でき、滑らかな合成音声を得ることができるようになる音声合成方法を提供することを目的とする。
【００３２】
【課題を解決するための手段】
この発明による第１の音声合成方法は、波形辞書に格納されている音素単位の組み合わせの中で、ターゲットとの間の歪みおよび音素単位の接続歪みとの和が最も少なくなる組み合わせを選択し、選択した音素単位の組み合わせに基づいて合成音声波形を生成する音声合成方法において、音素単位の接続歪みの要素として、接続される２つの音素単位間の調音パラメータの差異に関する情報が用いられていることを特徴とする。
【００３３】
この発明による第２の音声合成方法は、波形辞書に格納されている音素単位の組み合わせの中で、ターゲットとの間の歪みおよび音素単位の接続歪みとの和が最も少なくなる組み合わせを選択し、選択した音素単位の組み合わせに基づいて合成音声波形を生成する音声合成方法において、音素単位の接続歪みの要素として、接続される２つの音素単位間の調音モデルの差異に関する情報が用いられており、ターゲットとの間の歪みの要素として、ターゲットの調音パラメータと音素単位の調音パラメータとの差異に関する情報が用いられていることを特徴とする。
【００３４】
【発明の実施の形態】
以下、この発明の実施の形態について説明する。
【００３５】
〔１〕第１の実施の形態の説明
音声合成装置の全体構成は、図１と同じである。
【００３６】
上述したように、音素単位選択部３では、波形辞書５に蓄積されている音素片の組み合わせの中で、歪みが少なくなる組み合わせを選択している。そして、この歪みには、上述したように、次の２つがある。
【００３７】
Ｃ^ｔ _ｉ：ｉ番目の音素について辞書から抽出した音素片（ｕ_ｉ）と実際に使用する環境（　ターゲットｔ_ｉ）との間の歪み。
Ｃ^ｃ _ｉ：ｉ番目の音素片（ｕ_ｉ）と、ｉ−１番目の素片（ｕ_ｉ−１）とを接続したときに生じる歪み。
【００３８】
音素単位選択部３は、動的計画法（ＤＰ法）に用いて音素片を接続していき、入力された全ての音素に対するＣ^ｔ _ｉとＣ^ｃ _ｉとの総和Ｃ^ａｌｌが最小となる素片の組み合わせを選択する。
【００３９】
第１の実施の形態では、Ｃ^ｃ _ｉを算出するための要素が異なっている。Ｃ^ｃ _ｉは、次式（４）で表される。
【００４０】
【数４】

【００４１】
つまり、従来のＣ^ｃ _ｉを算出する式（２）における　Ｄ^ｃ _ＳＰＣ（ｕ_ｉ，　ｕ_ｉ−１）が、Ｄ’^ｃ _ＳＰＣ（ｕ_ｉ，ｕ　_ｉ−１）に置き換えられている。
【００４２】
Ｄ’　^ｃ _ＳＰＣ（ｕ_ｉ，ｕ　_ｉ−１）は、素片　ｕ_ｉに対する調音パラメータｘ（ｕ_ｉ）　＝（Ｘ（ｕ_ｉ）　，Ｙ（ｕ_ｉ）　，Ｌ（ｕ_ｉ）　，Ｗ（ｕ_ｉ）　，Ｒ（ｕ_ｉ）　，Ｂ（ｕ_ｉ）　，Ｎ（ｕ_ｉ）　）と、素片　ｕ_ｉ−１に対する調音パラメータｘ（ｕ_ｉ−１）　＝（Ｘ（ｕ_ｉ−１），Ｙ（ｕ_ｉ−１），Ｌ（ｕ_ｉ−１）　，Ｗ（ｕ_ｉ−１）　，Ｒ（ｕ_ｉ−１）　，Ｂ（ｕ_ｉ−１）　，Ｎ（ｕ_ｉ−１）　）とを用いて、次式（５）で表される。
【００４３】
【数５】

【００４４】
上記式（５）において、ｗ^ｃ _Ｘ，ｗ^ｃ _Ｙ，ｗ^ｃ _Ｂ，ｗ^ｃ _Ｒ，ｗ^ｃ _Ｌ，ｗ^ｃ _Ｗ，ｗ^Ｃ _Ｎは重み係数である。なお、波形辞書の補助情報格納部には、音素片のパワー（ＰＯＷ　^Ｄｉｃ）、基本周波数（　ＦＯ　^Ｄｉｃ）、継続時間長（　ＤＵＲ^Ｄｉｃ）に加えて、調音パラメータ（ｘ^Ｄｉｃ）も格納される。
【００４５】
以下、調音モデルについて説明する。
【００４６】
調音モデルは、声道の形を決定する下、咽頭、口蓋、唇、顎などの調音器官そのものの構造と運動を直接モデル化し、その結果として声道形状を表現するモデルである。
【００４７】
調音モデルの構成例を図４に示す。このモデルにおける調音パラメータは７個で、舌の中心位置Ｘ，Ｙ、唇の突き出しＬ、唇の開口Ｗ、舌先の位置と狭めＲ，Ｂ、軟口蓋の結合度Ｎである。
【００４８】
調音パラメータｘ＝（Ｘ，Ｙ，Ｌ，Ｗ，Ｒ，Ｂ，Ｎ）が与えられれば、これに対応する声道断面積が決定され、声道伝達特性を経て音声波形ｙ＝ｆ（ｘ）を得ることができる。
【００４９】
逆に、与えられた音声波形ｙから調音パラメータｘを推定する問題は、逆変換ｘ＝ｆ^−１（ｙ）の解放に帰着する。調音パラメータの動きは、調音器官の構造に依存した制約をもつため、調音パラメータの推定は与えられた拘束条件と適切な評価関数のもとでの非線形最適化問題となる。
【００５０】
具体的な解法としては、例えば、合成による分析法（Ａｎａｌｙｓｉｓ　ｂｙ　Ｓｙｎｔｈｅｓｉｓ）　が用いられる。この方法では、モデルにより合成された音声波形（　推定値）　と実際の音声波形（　観測値）　の誤差を、例えばケプストラム係数の二乗誤差で定義し、この誤差を減少させるようパラメータを逐次変化させ、誤差を最小化するパラメータ値を決定する。
【００５１】
また、音声波形から調音の状態を推定する方法には、調音モデルに基づいたものではなく、声道の音響管モデルに基づき、声道断面積を推定するモデルがある。その代表的なものが線形予測分析を用いる手法であり、適当な境界条件のもとに無損失な音響管の形状を推定することができる。
【００５２】
また、線形予測分析の手法を基礎としながら、音声器官の構造に基づく自然な拘束条件を音声分析の段階で導入し、音声波形から直接声道形状を抽出する方法も提案されている。この方法では、まず、音声波の（唇から空間への）放射特性を含めた声帯波特性モデルを使って、音声波から声帯波特性を推定する。そして、この声帯波特性の逆特性でフィルタリングを行うことによって声道特性を近似的に分離する方法で、適応型フィルタ法と呼ばれ、声道形状のより安定した抽出が可能となる。
【００５３】
〔２〕第２の実施の形態
【００５４】
第２の実施の形態では、第１の実施の形態同様に、Ｃ^ｃ _ｉを上記式（４）に基づいて算出するとともに、Ｃ^ｃ _ｉを次式（６）に基づいて算出する。
【００５５】
【数６】

【００５６】
上記式（６）では、従来のＣ^ｔ _ｉを算出する式（１）の要素に、Ｄ’^ｔ _ＳＰＣ（　　ｔ_ｉ，　ｕ_ｉ）　が追加されている。
【００５７】
Ｄ’^ｔ _ＳＰＣ（　　ｔ_ｉ，　ｕ_ｉ）　は、ターゲット　ｔ_ｉに対する調音パラメータｘ（ｔ_ｉ）　＝（Ｘ（ｔ_ｉ）　，Ｙ（ｔ_ｉ）　，Ｌ（ｔ_ｉ）　，Ｗ（ｔ_ｉ）　，Ｒ（ｔ_ｉ）　，Ｂ（ｔ_ｉ）　，Ｎ（ｔ_ｉ）　）と、素片　ｕ_ｉに対する調音パラメータｘ（ｕ_ｉ）　＝（Ｘ（ｕ_ｉ）　，Ｙ（ｕ_ｉ）　，Ｌ（ｕ_ｉ）　，Ｗ（ｕ_ｉ）　，Ｒ（ｕ_ｉ）　，Ｂ（ｕ_ｉ）　，Ｎ（ｕ_ｉ）　）とを用いて、次式（７）で表される。
【００５８】
【数７】

【００５９】
上記式（７）において、ｗ^ｔ _Ｘ，ｗ^ｔ _Ｙ，ｗ^ｔ _Ｂ，ｗ^ｔ _Ｒ，ｗ^ｔ _Ｌ，ｗ^ｔ _Ｗ，ｗ^ｔ _Ｎは重み係数である。
【００６０】
ターゲット　ｔ_ｉに対する調音パラメータｘ（ｔ_ｉ）　の求め方について説明する。明確な音声を生成する場合について説明する。この場合、Ｄ’^ｔ _ＳＰＣ（　　ｔ_ｉ，　ｕ_ｉ）　は、母音である部分のみにＣ^ｔ _ｉの要素として加えられる。ここでは、説明の便宜上、パラメータＸ（ｔ_ｉ）　，Ｙ（ｔ_ｉ）　のみから、ターゲット　ｔ_ｉに対する調音パラメータｘ（ｔ_ｉ）　を求める場合について説明する。
【００６１】
まず、波形辞書に登録されている波形のうち母音部分のみを抽出する。そして、各母音部分毎に、パラメータＸ，Ｙを求める。そして、得られた各母音（ア、イ、ウ、エ、オ）毎にＸ，Ｙの平均値を求め、各母音毎に得られたＸ，Ｙの平均値を、各母音に対するＸ、Ｙとする。
【００６２】
図５は、このようにして得られた各母音に対するＸ、Ｙと、それらの平均値（母音全体の平均値）ａｌｌとを示している。そして、図６に示すように、全ての母音の平均ａｌｌから各母音へのベクトルを、例えば、１．５倍した位置を、ターゲットでの位置Ｘ（ｔ_ｉ）　，Ｙ（ｔ_ｉ）　とする。この値Ｘ（ｔ_ｉ）　，Ｙ（ｔ_ｉ）　を用いて、ターゲット　ｔ_ｉに対する調音パラメータｘ（ｔ_ｉ）　を算出する。
【００６３】
ところで、一般に声道情報は周波数分析した結果をパラメータ化したもの（ＬＳＰ，ＬＰＣ，ＰＡＲＣＯＲ，ケプトラム等）を用いている。声道形状のパラメータ（調音パラメータ）は、これらと同等の情報を表すものであるが、パラメータが実際の発話器官を表現しているため、パラメータ間の独立性が高く直観的に物理的な意味を理解することができるといった利点がある。このため、素片の種類毎にパラメータの重みを変えることや、パラメータの微分値を用いて発話器官動作の方向を一致させることなどにより、滑らかな音声の生成が可能となる。また、調音パラメータは、周波数を表現するパラメータと比較し、発話の状態（なまけの程度や、強調しているかなど）を類推しやすいパラメータであることから、指定した発話の状態の音声の合成が可能となる。
【００６４】
【発明の効果】
この発明によれば、調音パラメータを考慮して音素単位を選択でき、滑らかな合成音声を得ることができるようになる。
【図面の簡単な説明】
【図１】音声合成装置の全体構成を示すブロック図である。
【図２】波形辞書５の内容を示す模式図である。
【図３】音素単位選択部３において、音素片の組み合わせを選択するために用いられる２種の歪みＣ^ｔ _ｉ、Ｃ^ｃ _ｉを説明するための模式図である。
【図４】調音モデルの構成例を示す模式図である。
【図５】ターゲット　ｔ_ｉに対する調音パラメータｘ（ｔ_ｉ）　の求め方を説明するための模式図である。
【図６】各母音に対するＸ、Ｙとそれらの平均値（母音全体の平均値）ａｌｌとが図５に示すような場合に、全ての母音の平均ａｌｌから各母音へのベクトルを１．５倍した例を示す模式図である。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesis method capable of reading out arbitrary text information with synthesized speech.
[0002]
[Prior art]
FIG. 1 shows a schematic configuration of the speech synthesizer.
[0003]
The input text mixed with Japanese kana and kanji is subjected to morphological analysis and dependency analysis by the language processing unit 1 and converted into phoneme symbols, accent symbols, and the like.
[0004]
The prosody pattern generation unit 2 uses a phoneme symbol, an accent symbol string, and the part-of-speech information of the input text obtained from the morphological analysis result to obtain a phoneme duration (voice length DUR ^T ), a fundamental frequency (voice pitch FO). ^T ), the power at the center of the vowel (loudness POW ^T ), and the like are estimated.
[0005]
In the phoneme unit selector 3, and connect the estimated phoneme duration DUR ^T, fundamental frequency FO ^T and closest to the power POW ^T vowel center, and phonemes stored in the waveform dictionary 5 (phoneme) the combination of smallest phoneme strain (C ^all to be described ^later) when it is selected using the DP (dynamic programming).
[0006]
The speech waveform generation unit 4 generates speech by connecting the phonemes while converting the pitch in accordance with the combination of the selected phonemes.
[0007]
FIG. 2 shows the contents of the waveform dictionary 5.
The waveform dictionary 5 includes a phoneme unit storage unit 51 in which a plurality of phoneme units are stored, and an auxiliary information storage unit 52 in which auxiliary information on each phoneme unit in the phoneme unit storage unit 51 is stored. The auxiliary information includes the power (POW ^Dic ), fundamental frequency (FO ^Dic ), duration (DUR ^Dic ), spectrum (SPC ^Dic ), and the like of the phoneme ^segment .
[0008]
By the way, the phoneme unit selection unit 3 selects a combination that reduces distortion among combinations of phonemic pieces stored in the waveform dictionary 5, and the following distortions are available.
[0009]
That is, as shown in FIG. _{_{3, u i-1, u}} i, as phoneme extracting the _{u i + 1} from the waveform dictionary _{_{5, t i-1, t}} i, and environment (target) actually using the _{t i + 1} Then, distortions for ui include C ^t _i and C ^c _i .
[0010]
Here, C ^t _i is the distortion between the phoneme segment (u _i ) extracted from the dictionary for the i-th phoneme and the environment actually used (target t _i ). ^{Also, C} _{c i} is, i-th speech segment and _{(u i),} a strain generated when connecting the i-1 th segment _{(u i-1).} Phoneme unit selector 3, dynamic programming is used to (DP method) will connect the speech segments, containing the sum C ^all of the C ^t _i and C ^c _i for all the phonemes inputted is minimized Select a combination of pieces.
[0011]
C ^t _i is expressed by the following equation (1).
[0012]
(Equation 1)

[0013]
In the above equation (1), each variable is defined as follows.
[0014]
D ^t _POW (t _i , u _i ) is the power of the phoneme fragment (u _i ) extracted from the dictionary for the i-th phoneme, and the power (POW ^Dic (i)) of the actual use environment (target t _i ). This is the square of the distance from the power (POW ^T (i)), and is {(POW ^Dic (i)) − (POW ^T (i))} ² .
[0015]
w ^t _POW is a weighting factor for D ^t _POW (t _i , u _i ).
[0016]
^{_{_{_{D t F0 (t i, u}}}} i) , for i-th phoneme, phoneme extracted from the dictionary and the fundamental frequency of the _{^{(u i) (FO Dic (}} i)), environment actually used (target _t i) the fundamental frequency is the square of the distance between the ^(FO T (i)), the ^- a ^{^{2 {(FO Dic (i)}} ) (FO T (i))}.
[0017]
w ^t _{F 0} is a weight coefficient for D ^t _{F 0} (t _i , u _i ).
[0018]
D ^t _DUR _(t i, _{u i),} for i-th phoneme, phoneme extracted from the dictionary duration of _{(u i)} and ^(DUR Dic (i)), actually used environment (target _{t i} ) Is the square of the distance to the duration (DUR ^T (i)), and {(DUR ^Dic (i)) − (DUR ^T (i))} ² .
[0019]
w ^t _DUR is a weighting factor for D ^t _DUR (t _i , u _i ).
[0020]
C ^c _i is represented by the following formula (2).
[0021]
(Equation 2)

[0022]
In the above equation (2), each variable is defined as follows.
[0023]
D ^c _POW (u _i , u _i-1 ) is the power (POW ^DicS (i)) at the beginning of the i-th phoneme ^segment (u _i ) and the power of the i-th phoneme ^segment (u _i-1 ). This is the square of the distance between the terminal power (POW ^DicE (i-1)) and {(POW ^DicS (i))-(POW ^DicE (i-1))} ² .
[0024]
w ^c _POW ^is a weighting factor for _{_{_{D c POW (u i, u}}} i-1).
[0025]
^{_{_{_{D c F0 (u i, u}}}} i-1) is, i-th phoneme beginning of the fundamental frequency of _{^{(u i) (FO DicS (}} i)) and, i-1 th phoneme _{(u i-1)} ^Is the square of the distance from the fundamental frequency (FO ^DicE (i-1)) at the end of, and is {(FO ^DicS (i))-(FO ^DicE (i-1))} ² .
[0026]
w ^c _F0 ^is a weighting factor for _{_{_{D c F0 (u i, u}}} i-1).
[0027]
^{_{_{_{D c SPC (u i, u}}}} i-1) , the starting end of the spectrum ^{(SPC DicS (i, j)} , j = 1 ~16) of i-th speech segment _{(u i)} and, i-1 th phoneme ^Is the square of the distance between the spectrum (SPC ^DicE (i-1, j), j = 1 to 16) at the end of the piece (u _i-1 ), and ｛(SPC ^DicS (i, j)) − ( SPC ^DicE (i-1, j))｝ ² .
[0028]
w ^c _SPC ^is a weighting factor for _{_{_{D c SPC (u i, u}}} i-1).
[0029]
The sum C ^all of C ^t _i and C ^c _i for all input phonemes is represented by the following equation (3).
[0030]
[Equation 3]

[0031]
[Problems to be solved by the invention]
SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech synthesizing method capable of selecting a phoneme unit in consideration of articulation parameters and obtaining a smooth synthesized speech.
[0032]
[Means for Solving the Problems]
The first speech synthesis method according to the present invention selects, from among combinations of phoneme units stored in a waveform dictionary, a combination that minimizes the sum of distortion between a target and a connection distortion in phoneme units, In a speech synthesis method for generating a synthesized speech waveform based on a combination of selected phoneme units, information on a difference in articulation parameters between two connected phoneme units is used as an element of connection distortion of the phoneme units. It is characterized by.
[0033]
The second speech synthesis method according to the present invention selects, from among combinations of phoneme units stored in the waveform dictionary, a combination that minimizes the sum of the distortion with the target and the connection distortion of the phoneme unit, In a speech synthesis method that generates a synthesized speech waveform based on a selected combination of phoneme units, information on a difference in articulatory model between two connected phoneme units is used as an element of connection distortion of the phoneme units, It is characterized in that information on the difference between the articulation parameter of the target and the articulation parameter of each phoneme is used as an element of the distortion with respect to the target.
[0034]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described.
[0035]
[1] Description of First Embodiment The overall configuration of the speech synthesizer is the same as that of FIG.
[0036]
As described above, the phoneme unit selection unit 3 selects a combination that reduces distortion from combinations of phoneme pieces stored in the waveform dictionary 5. As described above, the distortion includes the following two types.
[0037]
C ^t _i : Distortion between the phoneme segment (u _i ) extracted from the dictionary for the i-th phoneme and the environment actually used (target t _i ).
C ^c _i: i-th speech segment and _{(u i),} distortion that occurs when the connection between i-1 th segment _{(u i-1).}
[0038]
Phoneme unit selector 3, dynamic programming is used to (DP method) will connect the speech segments, containing the sum C ^all of the C ^t _i and C ^c _i for all the phonemes inputted is minimized Select a combination of pieces.
[0039]
In the first embodiment, are different elements for calculating the C ^c _i. C ^c _i is expressed by the following equation (4).
[0040]
(Equation 4)

[0041]
That is, the conventional ^{_{^{_{_{C c i D c SPC (u}}}}} i, u i-1) in equation (2) for calculating a is replaced by ^{_{_{_{D 'c SPC (u i,}}}} u i-1).
[0042]
^{_{_{_{D 'c SPC (u i,}}}} u i-1) is articulatory parameters for units _{_{_{u i x (u i) =}}} (X (u i), Y (u i), L (u i), W (u _{_{_{i), R (u i)}}} , B (u i), N ( the _u i)), articulatory parameter _x (u i-1 for the units _{_{u i-1) = (X}} (u i-1), Y (Ui _-1 ), L (ui _-1 ), W (ui _-1 ), R (ui _-1 ), B (ui _-1 ), N (ui _-1 )). Thus, it is expressed by the following equation (5).
[0043]
(Equation 5)

[0044]
In the above formula ^{_{^{_{(5), w c X,}}}} w c Y, w c B, w c R, w c L, w c W, w C N is a weighting factor. The auxiliary information storage unit of the waveform dictionary stores the articulation parameter ( ^xDic ) in addition to the power (POW ^Dic ), the fundamental frequency (FO ^Dic ), and the duration time (DUR ^Dic ) of the ^{speech element} .
[0045]
Hereinafter, the articulation model will be described.
[0046]
The articulatory model is a model that directly models the structure and movement of articulatory organs such as the pharynx, palate, lips, and jaw, which determines the shape of the vocal tract, and expresses the vocal tract shape as a result.
[0047]
FIG. 4 shows a configuration example of the articulation model. There are seven articulation parameters in this model: tongue center position X, Y, lip protrusion L, lip opening W, tongue position and narrowing R, B, and soft palate coupling N.
[0048]
Given the articulation parameter x = (X, Y, L, W, R, B, N), the corresponding vocal tract cross-sectional area is determined, and the voice waveform y = f (x) via the vocal tract transfer characteristics. Can be obtained.
[0049]
Conversely, the problem of estimating the articulation parameter x from a given speech waveform y results in the release of the inverse transform x = f ⁻¹ (y). Since the movement of the articulatory parameters has restrictions depending on the structure of the articulatory organ, the estimation of the articulatory parameters becomes a nonlinear optimization problem under given constraints and an appropriate evaluation function.
[0050]
As a specific solution, for example, an analysis method by synthesis (Analysis by Synthesis) is used. In this method, the error between the speech waveform (estimated value) synthesized by the model and the actual speech waveform (observed value) is defined as, for example, the square error of the cepstrum coefficient, and the parameters are sequentially changed so as to reduce this error. Determine the parameter value that minimizes the error.
[0051]
As a method for estimating the state of articulation from a speech waveform, there is a model for estimating a vocal tract cross-sectional area based on a sound tube model of a vocal tract, not based on an articulatory model. A typical example is a method using linear prediction analysis, which can estimate the shape of a lossless acoustic tube under appropriate boundary conditions.
[0052]
Also, a method has been proposed in which a natural constraint condition based on the structure of a speech organ is introduced at the stage of speech analysis, and a vocal tract shape is directly extracted from a speech waveform, based on a method of linear prediction analysis. In this method, first, a vocal fold wave characteristic is estimated from a voice wave using a vocal fold wave characteristic model including a radiation characteristic (from the lips to the space) of the voice wave. Then, a method of approximately separating the vocal tract characteristics by performing filtering with the inverse characteristics of the vocal fold wave characteristics, which is called an adaptive filter method, enables more stable extraction of the vocal tract shape.
[0053]
[2] Second Embodiment [0054]
In the second embodiment, similar to the first embodiment, the C ^c _i to calculate, based on the equation (4), is calculated based on the C ^c _i in the following equation (6).
[0055]
(Equation 6)

[0056]
In the above formula (6), the blocks of formula (1) for calculating a conventional ^{_{^{_{C t i, D 't SPC}}}} (t i, u i) is added.
[0057]
^{_{_{D 't SPC (t i,}}} u i) is articulatory parameters for the target _{_{_{t i x (t i) =}}} (X (t i), Y (t i), L (t i), W (t i), _{_{R (t i), B (}} t i), and N _(t i)), articulatory parameters for units _{_{_{u i x (u i) =}}} (X (u i), Y (u i), L (u i _{_{), W (u i),}} R (u i), B (u i), by using the N _(u i)), is expressed by the following equation (7).
[0058]
(Equation 7)

[0059]
In the above equation (7), w ^t _X , w ^t _Y , w ^t _B , w ^t _R , w ^t _L , w ^t _W , and w ^t _N are weight coefficients.
[0060]
A method of obtaining the articulation parameter x (t _i ) for the target t _i will be described. A case where a clear voice is generated will be described. In this ^{_{_{case, D 't SPC (t i}}} , u i) is added as an element of ^C _{t i} only partially a vowel. Here, for convenience of explanation, the parameter _X (t i), only Y _(t i), will be described for obtaining the articulatory parameters x _(t i) for the target t _i.
[0061]
First, only the vowel part is extracted from the waveform registered in the waveform dictionary. Then, parameters X and Y are obtained for each vowel part. Then, an average value of X, Y is obtained for each of the obtained vowels (A, I, U, D, E), and the average value of X, Y obtained for each vowel is calculated as X, Y for each vowel. And
[0062]
FIG. 5 shows X and Y for each vowel obtained in this way and their average value (average value of all vowels) all. Then, as shown in FIG. 6, a position obtained by multiplying a vector from an average all of all vowels to each vowel, for example, by 1.5 is defined as positions X (t _i ) and Y (t _i ) at the target. . Using these values X (t _i ) and Y (t _i ), an articulation parameter x (t _i ) for the target t _i is calculated.
[0063]
Generally, vocal tract information is obtained by parameterizing the result of frequency analysis (LSP, LPC, PARCOR, Ceptrum, etc.). The parameters of the vocal tract shape (articulation parameters) represent the same information as these, but since the parameters represent the actual speech organs, the independence between the parameters is high and the intuitive physical meaning There is an advantage that you can understand. For this reason, it is possible to generate a smooth voice by changing the weight of the parameter for each type of segment, or by matching the direction of the speech organ motion using the differential value of the parameter. In addition, since the articulation parameter is a parameter that makes it easier to infer the utterance state (such as the degree of slackness or emphasis) compared to the parameter expressing the frequency, the synthesis of the voice of the specified utterance state can be performed. It becomes possible.
[0064]
【The invention's effect】
According to the present invention, a phoneme unit can be selected in consideration of articulation parameters, and a smooth synthesized speech can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating an overall configuration of a speech synthesizer.
FIG. 2 is a schematic diagram showing the contents of a waveform dictionary 5;
FIG. 3 is a schematic diagram for explaining two types of distortions C ^t _i and C ^c _i used for selecting a combination of phoneme segments in a phoneme unit selection unit 3;
FIG. 4 is a schematic diagram showing a configuration example of an articulation model.
FIG. 5 is a schematic diagram for explaining how to obtain an articulation parameter x (t _i ) for a target t _i .
FIG. 6 is a diagram showing an example in which X and Y for each vowel and their average value (average value of all vowels) all are as shown in FIG. 5, and a vector from the average all of all vowels to each vowel is 1.5. It is a schematic diagram which shows the example which multiplied.

Claims

From the combinations of phoneme units stored in the waveform dictionary, select the combination that minimizes the sum of the distortion with the target and the connection distortion of the phoneme unit, and based on the selected combination of phoneme units, In a speech synthesis method for generating a waveform,
A speech synthesis method characterized in that information on a difference in articulation parameters between two connected phoneme units is used as an element of connection distortion in phoneme units.

From the combinations of phoneme units stored in the waveform dictionary, select the combination that minimizes the sum of the distortion with the target and the connection distortion of the phoneme unit, and based on the selected combination of phoneme units, In a speech synthesis method for generating a waveform,
Information on the difference in the articulatory model between the two connected phoneme units is used as the element of the connection distortion in the phoneme unit, and the articulation parameter of the target and the articulation parameter in the phoneme unit are used as the distortion elements between the target and the phoneme unit. A speech synthesizing method characterized in that information on a difference from the above is used.