JP3892691B2

JP3892691B2 - Speech synthesis method and apparatus, and speech synthesis program

Info

Publication number: JP3892691B2
Application number: JP2001282816A
Authority: JP
Inventors: 未来長谷部; 匡伸阿部
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2001-09-18
Filing date: 2001-09-18
Publication date: 2007-03-14
Anticipated expiration: 2021-09-18
Also published as: JP2003091295A

Description

【０００１】
【発明の属する技術分野】
本発明は、高品質な合成音声を安定に得るための音声合成方法及びその装置並びに音声合成プログラムに関するものである。
【０００２】
【従来の技術】
従来、ラジオや、テレビ、有線放送、インターネットなどのコンテンツ作成には膨大な労力が費やされており、現在の有線放送やインターネットなどの多チャンネル時代において、全てのコンテンツを人手で作成すると多大なコストと労力が必要となる。このため、テキストデータから合成音声を生成する音声合成システムが注目されるようになってきた。
【０００３】
音声合成システムを利用すればコンテンツ作成のためのコストを抑えることができ、短時間で大量のコンテンツを作成することができる。
【０００４】
この種の音声合成システムとしては、例えばテキストコーパスを用いたコーパスベースの音声合成システムが知られている。
【０００５】
上記コーパスベース音声合成に属する類の音声合成システムは、低次の音響的情報や低次の物理的情報をもとに、合成文に最適な合成単位をデータベースから検索し、韻律の変形を行わずに接続することで合成音声を作成している。このように韻律を変形しないことで肉声らしさを尊重した音声合成を行っている。
【０００６】
ここで、上記低次の情報とは、空気の振動に近い物理的な情報を指している。例えば、音声波形から直接数値化できる基本周波数やスペクトルなどの情報である。
【０００７】
【発明が解決しようとする課題】
しかしながら、前述した従来の音声合成システムにおいては、韻律を変形しないことで肉声らしさを尊重しているが、ケプストラムや音素など、文章としてのまとまりからは遠い低次のパラメータを元にして合成単位を選択しているため、文章全体を考慮した整合性がなく、フレーズの内部で接続を行ったり、文章本来の韻律パターンとは異なった音声になるなど、合成結果は不安定なものとなる。
【０００８】
即ち、低次の情報から得られるパラメータによって合成単位を選択するシステムにおいては、合成音声の品質はデータベースの検索結果に大きく依存したものになる。このため、目標の文章との適合度の高い合成単位を選択できたときは品質が良く、そうでなければ品質は悪くなり、合成音声の品質は不安定である。
【０００９】
このため、従来の音声合成システムでは、前述したラジオや、テレビ、有線放送、インターネットなどの放送に要求されるだけの十分な品質を実現できているとは言い難い。
【００１０】
本発明の目的は上記の問題点に鑑み、放送のための十分な品質を持った音声合成システムを実現でき、コンテンツの作成や更新の自動化を可能にする音声合成方法及びその装置並びに音声合成プログラムを提供することである。
【００１１】
【課題を解決するための手段】
本発明は上記の目的を達成するために、フレーズの境界を示す情報、フレーズが強調音声か否かを示す単語の役割情報、音声の韻律パターン情報の３種類の情報と、音声波形、音素列データ、音素の境界を示すデータとが対応付けられて蓄積されている第１のデータベースと、前記単語の役割情報および前記韻律パターン情報により定められた劣化パターン毎に、フレーズ境界で接続され且つ接続点前後の基本周波数の差が異なる複数の音声波形による音声を複数の受聴者が評価した第１の値と、フレーズ内で接続され且つ接続点前後の基本周波数の差が異なる複数の音声波形による音声を複数の受聴者が評価した第２の値とが情報として蓄積されている第２のデータベースを用い、前記データベースに記録された前記音素列のうちの少なくとも一部からなる小音素列を合成単位の候補として使用して合成音声を生成するようにした。
【００１２】
また、音声合成の際には、テキストデータで示される音素列の一部と適合する前記合成単位の候補を前記データベースから検索し、前記検索された合成単位の候補毎に、合成単位の候補同士の接続点における音声波形の基本周波数の差に対して、前記フレーズの境界を示す情報、前記単語の役割情報、前記韻律パターン情報の３種類の情報に基づく評価値を前記第１の値と第２の値とにより求め、各接続点における前記評価値の平均値を総合評価値として求めて、該総合評価値が最も高い合成単位の候補を合成単位として選択する。
【００１３】
さらに、前記選択された合成単位に対応した音声波形を前記データベースから抽出し、前記テキストデータで示される音素列に対応させて、前記抽出した音声波形を接続している。
【００１４】
例えば、音声波形の断片同士の接続点がフレーズの境界にあるか否かで品質の劣化状態が異なったものとなる。また、強調されている単語は、文章中で重要な意味を持つ単語であり、パワーやピッチが上がっていることが多く、単語が強調されているか否かで品質の劣化状態が異なったものとなる。
【００１５】
韻律パターン情報は、音声断片同士の接続点における韻律パターンの連続性や整合性を判定するために用いることができる。韻律パターンの連続性では、例えば、末尾で基本周波数が緩やかに下がっている音声断片の後に、先頭において基本周波数が高い音声断片を接続すると、基本周波数が下がるはずのところが上がるため、品質が劣化しやすい。逆に、末尾で基本周波数が緩やかに下がっている音声断片の後に、先頭において基本周波数が低い音声断片を接続すると、下がり具合が大きくなるだけなので、品質は劣化し難くなる。
【００１６】
韻律パターンの整合性としては、音声断片同士の接続点における基本周波数の差の方向の±（プラス・マイナス）を見て、それが文章の持つ連続した韻律パターンの傾斜方向と整合しているか否かによって品質の劣化状態が異なる。
【００１７】
このようにフレーズの境界を示す情報や、単語の役割情報、韻律パターン情報等の高次の言語的情報を用いて求めた評価値に基づいて合成単位を選択することにより、従来のように合成単位の選択に低次の情報を利用することによって生じる合成音声の品質の不安定性という問題を解決することができる。
【００１８】
【発明の実施の形態】
以下、図面に基づいて本発明の一実施形態を説明する。
【００１９】
図１は、本発明の一実施形態における音声合成装置を示す構成図である。本実施形態では、コンピュータに音声合成プログラムをインストールすることによって音声合成装置を構成している。
【００２０】
図１において、１はコンピュータで、ＣＰＵを主体として構成されている中央処理部１１と、中央処理部１１に接続された記憶部１２、表示部１３、入力部１４、半導体素子から構成されるメモリ１５、及び音声発生器としてのディジタル／アナログ（Ｄ／Ａ）変換器やスピーカなどからなる音響部１６等を備えた一般的なコンピュータである。
【００２１】
記憶部１２には、上記の音声合成プログラム１７と音声合成のためのテキスト解析辞書１８、及び評価値情報１９が記憶されていると共に、音声合成に必要な各種情報が音声合成データベース２０として構築されている。
【００２２】
入力部１４は、音声合成対象となるテキストデータを入力する手段であり、例えばキーボード、マウス、磁器フロッピーディスクやコンパクトディスクなどの情報記録媒体へのインタフェース、ネットワーク通信インタフェース等を含んでいる。
【００２３】
音声合成データベース２０には、テキスト音声合成に必要なデータベースとして、音声波形２１と、発声内容に対応する音素列データ２２、音素の境界を示すデータ２３、フレーズの境界を示す情報２４、フレーズが強調音声か否かを示す単語の役割情報２５、音声の韻律パターン情報２６が対応付けられて蓄積されている。音声波形２１としては、例えば実際に人が発声した音声が収録されてこれが蓄積されている。
【００２４】
本実施形態では、上記音声合成プログラム１７によって中央処理部１１を動作させて音声合成処理を行わせる。この音声合成処理では、キーボードや情報記録媒体或いはネットワークを介して入力されたかな漢字混じり文のテキストデータに適合した合成音声を生成する際に、上記音声合成データベース（以下、単にデータベースと称する）２０から合成単位となる音声波形の断片（以下、音声断片と称する）を選択する。この選択の際に、高次の言語的情報からトップダウン的に合成単位の候補を選択し、この合成単位に対応した音声断片を選択する。
【００２５】
高次の言語的情報とは、人間が感知した物理現象を解釈した意味内容を持つ情報を指し、空気の振動等の物理的な情報を人間の言語能力によって解釈し、意味を与えた情報である。例えば、フレーズ境界や、単語の役割、韻律パターンの整合性といった、音声波形の断片から直接数値化できず、何段階かの抽象化が必要な情報である。本実施形態では、高次の言語的情報として、フレーズ（ポーズで挟まれた一気に発声する音声区間、呼気段落）の境界情報、単語の役割（単語の強調発声など）情報、韻律パターン（接続点前の音声波形断片末尾と接続点後の音声波形断片先頭における韻律パターンの連続性や整合性など。韻律パターンとはピッチやアクセントなどの音素毎の変化）情報を用いている。尚、韻律パターン情報には、基本周波数の変動パターンに関する情報も含まれている。また、単語の役割情報としては、強調発声の情報以外に、喜怒哀楽の情報や、驚嘆、熱意、失望などの感情的表現を前後の文脈から判断して得られた情報を用いることができる。
【００２６】
このような高次の言語的情報によって、低次の情報のみを利用して合成単位を選択するよりも、高品質な合成音声を安定して出力することができる。さらに、高次の言語的情報を利用することで、選択した音声断片同士の接続点における違和感や、文章の持つ意味と異なった韻律パターンとなってしまうといった従来技術の問題を解決することができる。
【００２７】
従って、本実施形態によれば、高次の言語的情報を利用して合成単位を選択しているので、高品質な合成音声を安定して得ることができる。その結果、放送の分野で要求される品質を持った合成音声を作成でき、有線放送やインターネットの多チャンネル時代においてコンテンツの不足を補うために自動でコンテンツを作成・更新することも可能になる。
【００２８】
図２は本実施形態における音声合成プログラム１７の処理を説明するフローチャート、図３は上記音声合成プログラム１７によって行う高次の言語的情報用いた音声合成の流れを示す図である。これらを参照して具体的動作について説明する。
【００２９】
音声合成のためにワープロ等で作成されたかな漢字混じり文のテキストデータ101が入力される（Ｓ１）と、このテキストデータ101の解析102を行う（Ｓ２）。テキスト解析102では、テキスト解析辞書１８を用いてテキストデータ101の解析を行い、例えば、入力されたかな漢字混じり文のテキスト101に対して、読み仮名、読み仮名に対応する音素列、各音素の継続時間長、フレーズ毎のアクセント型、各フレーズの結合の仕方、アクセント型と結合の仕方に対応した韻律パターンなどの情報を解析結果として出力する。
【００３０】
この後、上記テキスト解析結果を用いて合成単位の検索103を行う。合成単位の検索103の処理では、テキストデータ101の音素列を複数の小音素列に分割して、データベース２０の中からテキストデータ101に音素レベルで一致する小音素列を検索し（Ｓ３）、検索によってデータベース２０から抽出した小音素列を、音声合成に用いる合成単位の候補104とする（Ｓ４）。ここで、上記小音素列とは複数の音素からなる音素列中の一部であり、１つ以上の音素から構成される音素列である。
【００３１】
さらに、前述した高次の言語的情報を用いて候補の絞り込み105を行い（Ｓ５）、テキストデータ101の各小音素列に対する候補として抽出した複数の合成単位の候補104中から各小音素列毎に最適な合成単位を１つ決定する（Ｓ６）。
【００３２】
候補の絞り込み105の処理では、合成単位の検索103によって抽出された複数の合成単位の候補104の中から、データベース２０に蓄積されている高次の言語的情報を用いて候補の絞り込み105を行い、絞り込んだ合成単位の候補106を出力する。さらに、絞り込まれた合成単位の候補106を再び合成単位の候補104として候補の絞り込み105を行い、候補の合成単位を最後の１つまで絞り込むことで、最終的な合成単位を選択する。
【００３３】
本実施形態では、高次の言語的情報として、フレーズの境界を示す情報２４、フレーズが強調音声か否かを示す単語の役割情報２５、音声の韻律パターン情報２６の３つの情報のうちの１つ以上の情報を用いて合成単位の絞り込みを行っている。
【００３４】
次いで、選択した合成単位に対応付けされている音声断片をデータベース２０から取得し（Ｓ７）、取得した音声断片をテキストデータ101の音素列に対応するように順次接続し（Ｓ８）、この接続によって得られた連続した音声波形を音響部１６に送出し、音声としてスピーカーから出力する（Ｓ９）。
【００３５】
尚、前記Ｓ８の処理において音声断片を接続して得られた連続した音声波形をメモリや情報記憶媒体に音声データとして一旦蓄積しておいても良い。また、他の装置で作成された合成音声を上記のように音声データとして入力してこれを音声として出力することも容易に可能であることは言うまでもない。
【００３６】
次に、本実施形態における音声合成の要部をさらに詳細に説明する。
【００３７】
図４は、本実施形態における音声合成の要部を説明する図であり、上記検索された複数の合成単位の候補104の中から最適な合成単位を、高次の言語的情報に基づいて絞り込む流れを示している。
【００３８】
前記Ｓ３の処理によって検索された合成単位の候補104とは、入力テキストデータ101を構成する小音素列と合致する小音素列をデータベース２０から抽出したものであり、データベース２０に蓄積されている音素列を構成する小音素列と合致するものが通例合成単位の候補104として複数選択される。
【００３９】
例えば、「明日は曇りのち雨でしょう」というテキストデータ101から音声を合成しようとする場合、データベース２０に「明日は曇りです（ASITAWAKUMORIDESU）」というテキスト内容に対応する音声波形が蓄積されていた場合、そのうちの「ASHITAWAKUMORI」という部分が合成単位の候補として抽出される。
【００４０】
また、上記のうち「ASITAWAK」の部分を合成単位の候補として抽出することも可能である。即ち、本実施形態における合成単位は、モーラや、音節、文節などの区切りとは無関係に抽出される。つまり、合成単位は、目的のテキストを合成するために使用することが可能な小音素列が、上述のデータベース２０の中からマッチングして抽出されたものであれば良い。
【００４１】
ここで、合成単位を選択する際に、「ASITAWAKUMORI」と、「NOTIAMEDESYOU」として使うこともでき、また、「ASITAWA」と「KUMORINOTIAMEDESYOU」として使うこともできる。このため、可変長の合成単位をどのように選択するかということを決定する必要がある。従来は、入力テキストとデータベース中のケプストラム距離を求めたり、音韻環境や、基本周波数を比較していたが、本発明が従来技術と異なる点は、合成単位を選択する際に、前述したフレーズ境界などの高次の言語的情報を用いることである。但し、データベース２０の構築の際に、高次の言語的情報を抽出する技術としては、従来からのテキスト解析技術などを用いているため、特に従来技術との違いはない。本発明は、情報を抽出する部分ではなく、抽出した高次の言語的情報を合成単位の選択に利用する部分が従来技術との大きな違いとなっている。
【００４２】
次に、可変長の合成単位をどのように使用するかを決定する方法に関して説明する。
【００４３】
本実施形態では、前述した合成単位の絞り込み105処理において、フレーズの境界を示す情報２４、フレーズが強調音声か否かを示す単語の役割情報２５、音声の韻律パターン情報２６の３つの情報を用いている。尚、フレーズ境界情報２４、単語の役割情報２５、韻律パターン情報２６のうちの１つの情報、或いは任意の２つの情報を用いて合成単位を決定してもよい。
【００４４】
即ち、本実施形態では、これら３種類の情報を、合成単位として選択した音声断片同士の接続点における基本周波数の差（ギャップ）に対する品質劣化度合いのパラメータ及び品質評価値（以下、単に評価値と称する）を決定するために用いている。
【００４５】
一例として、品質劣化パターンのグラフを図５に示す。このグラフは、横軸が音声断片同士の接続点における合成単位の基本周波数の差を示し、縦軸が音声断片同士を接続して合成したときの品質を示している。品質評価を示す値は、合成音声を人が聞いたときの主観評価値を元にして算出している。
【００４６】
図５では、音声断片同士の接続点がフレーズ境界にあるときの接続点における基本周波数の差に対する評価値と、音声断片同士の接続点がフレーズ内にあるときの接続点における基本周波数の差に対する評価値が表されている。
【００４７】
上記接続点がフレーズ境界にあるときは、上記基本周波数の差が０，３０，６０，９０，１２０，１５０（Ｈｚ）のときの評価値がそれぞれ、９８，１００，９３，８５，９３，８７（％）である。また、上記接続点がフレーズ内にあるときは、上記基本周波数の差が０，３０，６０，９０，１２０（Ｈｚ）のときの評価値がそれぞれ、９８，８０，５８，４８，１８（％）である。
【００４８】
このように、接続点がフレーズ境界にあるときは、接続点における基本周波数の差が大きくなっても良好な評価値が得られる。また、接続点がフレーズ内にあるときは、基本周波数の差が大きくなるにつれて評価値が徐々に低下している。
【００４９】
図５のグラフから解るように、音声断片同士の接続点における基本周波数の差（ギャップ）が大きくなるに従って品質が劣化していき、その劣化度合いが接続点の条件（ここではフレーズの境界かどうか）によって違ってくる。本実施形態では、そのことを「接続点の条件から定まる品質劣化パターン」と称している。
【００５０】
上記品質劣化パターン（以下、劣化パターンと称する）から合成単位を絞り込む方法として、本実施形態では、音声断片同士の接続点における基本周波数の差と、接続点の条件から定まる劣化パターンをもとにして、その基本周波数の差における品質評価値を算出することで合成単位を絞り込んでいる。
【００５１】
即ち、本実施形態では、評価値として、予め合成した音声波形を複数用意し、これらの音声波形による音声を受聴者に評価してもらった値に基づいて設定した値を用いている。例えば、フレーズ境界で接続し且つ接続点前後の基本周波数の差が異なる複数の音声波形による音声を受聴者が評価した値と、フレーズ内で接続し且つ接続点前後の基本周波数の差が異なる複数の音声波形による音声を受聴者が評価した値、及び接続点前後の基本周波数の変動パターンが異なり且つ接続点前後の基本周波数の差が異なる複数の音声波形による音声を受聴者が評価した値に基づいて設定した値を用いている。これらの評価値としては、例えば、２つの合成単位を接続して得られた１つの音声に対して複数の受聴者が評価した値を標準化した値（例えば平均値）を用いる。この評価値も上記評価値情報１９に含めて記憶部１２に蓄積されている。ここでは、評価値として百分率を用い、１００％において肉声と変わらず、０％に近づくほど品質が劣化しているという尺度を用いている。
【００５２】
さらに、本実施形態では、目標のテキストに対応した音声を合成可能な全ての合成単位の組合せについて総合評価値を算出し、その評価値が最大となるような合成単位の組合せを求めている。
【００５３】
即ち、合成単位の選択によって目標のテキストを音声合成するための合成単位の数に違いが生じる。合成単位の数が多くなると劣化も大きくなるので、例えば各接続点における評価値の平均値を総合評価値とし、この総合評価値が最も高い合成単位を選択している。
【００５４】
一方、高次の言語的情報として利用するフレーズ境界情報２４としては、音声断片同士の接続点がフレーズの境界なのか或いはフレーズの中（フレーズ内）なのかを示す情報を用いている。音声断片同士の接続点がフレーズの境界にあるか否かで、図５に示す劣化パターンが異なったものとなる。例えば、上記接続点がフレーズの境界に存在する場合はグラフの傾斜が緩やかになり品質は余り劣化しないが、フレーズの中に存在する場合は逆に劣化しやすくなる。
【００５５】
フレーズの境界を用いて合成単位の候補を絞り込む方法としては、例えば、「ASITAWAKUMORI」と、「NOTIAMEDESYOU」という合成単位を抽出した場合、前者の末尾と後者の先頭との間で基本周波数差が３０Ｈｚで、品質評価値が８０％だったとする。この接続点はフレーズの境界ではなく、フレーズの途中で接続しているため、接続したときの品質劣化を評価するために、フレーズ内部で接続したときの劣化パターンを用いる。
【００５６】
その他の例として、「ASITAWA」と「KUMORINOTIAMEDESYOU」を合成単位として接続するときは、接続点がフレーズの境界と判断される。フレーズ境界であるか否（フレーズの内部）かを判断するには、データベース２０に蓄積されいているフレーズ境界情報２４を用いている。
【００５７】
フレーズの境界で接続した場合の劣化パターンはフレーズの内部で接続した場合の劣化パターンより緩やかであるため、例えばこのフレーズ境界における２つの合成単位の間の基本周波数の差が９０Ｈｚ以下であれば品質評価値は８５％より大きくなり、従って前述のフレーズ内部で接続するより品質評価値が高くなる。また、フレーズ内で接続した場合における２つの合成単位の間の基本周波数の差が９０Ｈｚ以上であれば、品質評価値が４８％以下となるためフレーズ境界で接続した方が品質が良くなる（図５参照）。例えば、「ASITAWA」と「KUMORINOTIAMEDESYOU」の２つの合成単位の間の接続点における基本周波数の差が９０Ｈｚであったら、この評価値は１００％（フレーズ境界）であるので、前述した「ASITAWAKUMORI」と「NOTIAMEDESYOU」の評価値が８０％（フレーズ内）であるから、「ASITAWA」と「KUMORINOTIAMEDESYOU」が合成単位として選択されることになる。
【００５８】
単語の役割情報２５としては、単語が強調されているかどうかを表す情報を用いている。ここで、単語の「強調」とは、「プロミネンス」或いは「対比強調」とも称されるもので、文音声における強めや弱めは、文中の他の部分との相対的な強弱によって行われ、このように多の部分に対して相対的に引き立たせることである。また、強調されている単語とは、文章中で重要な意味を持つ単語であり、パワーやピッチが上がっていることが多い。このため、単語が強調されているか否かで、前述したフレーズ境界の場合のように、２つの合成単位の間の基本周波数の差に違いが生じるので、劣化パターンが異なったものとなる。
【００５９】
韻律パターン情報は、音声断片同士の接続点における韻律パターンの連続性や整合性及び基本周波数の変動パターンを判定するために用いる。
【００６０】
韻律パターンの連続性では、例えば、末尾で基本周波数が緩やかに下がっている音声断片の後に、先頭において基本周波数が高い音声断片を接続すると、基本周波数が下がるはずのところが上がるため、基本周波数の変動パターンが大きく変化するので、品質が劣化しやすい。逆に、末尾で基本周波数が緩やかに下がっている音声断片の後に、先頭において基本周波数が低い音声断片を接続すると、下がり具合が大きくなるだけなので、品質は劣化し難くなる。
【００６１】
韻律パターンの整合性としては、音声断片同士の接続点における基本周波数の差の方向の±（プラス・マイナス）を見て、それが文章の持つ連続した韻律パターンの傾斜方向（変化傾向）と整合しているか否かによって、劣化パターンが異なる。
【００６２】
これら３種類の情報のうちの１つ以上を用いて、最適な合成単位を絞り込む。複数の候補の中から最後的に１つまで絞り込むと、それが選択された合成単位となる。
【００６３】
また、上記フレーズ境界情報２４、単語の役割情報２５、韻律パターン情報２６の３種類全ての情報を用いて合成単位の候補を選択する場合、例えば、２つの合成単位「ASITAWA」と「KUMORINOTIAMEDESYOU」の接続では、フレーズ境界での接続で、接続点前の合成単位の末尾及び接続点後の先頭において、強調ではなく、韻律パターンの方向性があっているという条件の劣化パターンを用いて評価値を算出している。
【００６４】
尚、本実施形態では、接続点が１箇所のみの場合を例として説明したが、接続点が複数になっても、上記の接続点が１箇所における場合と同様の選択処理を繰り返して行い、最終的に評価値が最大になるような合成単位を選択する。
【００６５】
また、本実施形態は、例えば音素の一つ一つに対応する音声波形を接続して合成音声を生成する場合にも適用可能であるが、接続点が増えるに従って、合成された音声に対する品質評価値が相対的に低くなるので、品質が低下することは言うまでもない。従って、本実施形態では、もし接続点が少なくてすむ合成単位が存在し、それが抽出されれば、そちらが合成単位として選択されることになる。
【００６６】
しかし、データベース２０の中に音素毎に音声波形が記憶されており、これらを接続せざるを得ない状況においても、本発明の音声合成方法を用いることにより、従来例に比べて、可能な限り良い合成単位を選択できるようになる。
【００６７】
また、上記音声合成プログラム１７を光ディスクや磁気ディスク、光磁気ディスク、半導体メモリなどの情報記録媒体やネットワーク、その他の通信網を介して配布することにより、多くのユーザーに容易に普及させることができることは言うまでもない。
【００６８】
【発明の効果】
以上説明したように本発明の請求項１に記載の音声合成方法によれば、高次の言語的情報を利用して合成単位を選択しているので、高品質な合成音声を安定して得ることができる。その結果、放送の分野で要求される品質を持った合成音声を作成できるので、有線放送やインターネットの多チャンネル時代においてコンテンツの不足を補うために自動でコンテンツを作成・更新することも可能になる。
【００６９】
また、請求項２に記載の音声合成装置によれば、高次の言語的情報を利用して合成単位を選択しているので、高品質な合成音声を安定して得ることができる。その結果、放送の分野で要求される品質を持った合成音声を作成でき、有線放送やインターネットの多チャンネル時代においてコンテンツの不足を補うために自動でコンテンツを作成・更新することも可能になる。
【００７０】
また、請求項３に記載の音声合成プログラムによれば、情報記録媒体やネットワーク、その他の通信網を介してコンピュータにインストールすることにより音声合成装置を容易に構成することができる。さらに、高次の言語的情報を利用して合成単位を選択しているので、高品質な合成音声を安定して得ることができる。その結果、放送の分野で要求される品質を持った合成音声を作成でき、有線放送やインターネットの多チャンネル時代においてコンテンツの不足を補うために自動でコンテンツを作成・更新することも可能になる。
【図面の簡単な説明】
【図１】本発明の一実施形態における音声合成装置を示す構成図
【図２】本発明の一実施形態における音声合成プログラムの処理を説明するフローチャート
【図３】本発明の一実施形態における音声合成プログラムによって行う高次の言語的情報用いた音声合成の流れを示す図
【図４】本発明の一実施形態における音声合成の要部を説明する図
【図５】本発明の一実施形態における品質劣化パターンのグラフの一例を示す図
【符号の説明】
１…コンピュータ、１１…中央処理部、１２…記憶部、１３…表示部、１４…入力部、１５…メモリ、１６…音響部、１７…音声合成プログラム、１８…テキスト解析辞書、１９…評価値情報、２０…音声合成データベース、２１…音声波形、２２…音素列、２３…音素の境界情報、２４…フレーズの境界情報、２５…単語の役割情報、２６…韻律パターン情報、101…テキスト、102…テキスト解析、103…合成単位の検索、104…合成単位の候補、105…候補の絞り込み、106…合成単位の候補、107…接続、108…合成音声。[0001]
BACKGROUND OF THE INVENTION
  The present invention relates to a speech synthesis method and apparatus for stably obtaining high-quality synthesized speech, and a speech synthesis program.
[0002]
[Prior art]
  Conventionally, a great deal of effort has been expended in creating content such as radio, television, cable broadcasting, and the Internet. In the current multi-channel era such as cable broadcasting and the Internet, creating all content manually requires a lot of work. Cost and labor are required. For this reason, a speech synthesis system that generates synthesized speech from text data has attracted attention.
[0003]
  If a speech synthesis system is used, the cost for content creation can be reduced, and a large amount of content can be created in a short time.
[0004]
  As this type of speech synthesis system, for example, a corpus-based speech synthesis system using a text corpus is known.
[0005]
  The speech synthesis system belonging to the above-mentioned corpus-based speech synthesis searches the database for the optimal synthesis unit for a synthesized sentence based on low-order acoustic information and low-order physical information, and performs prosody modification. Synthetic speech is created by connecting without connecting. In this way, speech synthesis is performed with respect to the real voice by not changing the prosody.
[0006]
  Here, the low-order information refers to physical information close to air vibration. For example, information such as fundamental frequency and spectrum that can be directly digitized from a speech waveform.
[0007]
[Problems to be solved by the invention]
  However, in the above-mentioned conventional speech synthesis system, although the prosody is not changed, the quality of the real voice is respected, but the synthesis unit is based on low-order parameters that are far from the group of sentences, such as cepstrum and phonemes. Since it is selected, there is no consistency in consideration of the whole sentence, and the synthesized result becomes unstable, such as connection within the phrase or a voice different from the original prosodic pattern of the sentence.
[0008]
  That is, in a system in which a synthesis unit is selected based on parameters obtained from low-order information, the quality of synthesized speech greatly depends on the database search result. For this reason, the quality is good when a synthesis unit having a high degree of matching with the target sentence can be selected, otherwise the quality is degraded, and the quality of the synthesized speech is unstable.
[0009]
  For this reason, it is difficult to say that the conventional speech synthesis system can realize sufficient quality required for broadcasting such as the above-described radio, television, cable broadcasting, and Internet.
[0010]
  SUMMARY OF THE INVENTION In view of the above problems, an object of the present invention is to realize a speech synthesis system having sufficient quality for broadcasting and to enable automatic creation and update of contents, a speech synthesis method and apparatus, and a speech synthesis program. Is to provide.
[0011]
[Means for Solving the Problems]
  In order to achieve the above object, the present invention provides three types of information: information indicating the boundaries of phrases, role information of words indicating whether phrases are emphasized speech, and speech prosodic pattern information.News andA first database in which speech waveforms, phoneme string data, and data indicating phoneme boundaries are stored in association with each other;For each deterioration pattern defined by the word role information and the prosodic pattern information,A first value obtained by a plurality of listeners evaluating a plurality of speech waveforms connected at a phrase boundary and having different fundamental frequency differences before and after the connection point, and a difference between fundamental frequencies connected within the phrase and before and after the connection point A second value obtained by evaluating a plurality of listeners with sounds of a plurality of sound waveforms having different valuesTogaUsing a second database stored as information, a synthesized speech is generated using a small phoneme sequence consisting of at least a part of the phoneme sequence recorded in the database as a synthesis unit candidate. .
[0012]
  In speech synthesis, the synthesis unit candidates that match a part of the phoneme string indicated by the text data are searched from the database, and for each of the searched synthesis unit candidates, the synthesis unit candidates Three types of information indicating the boundary of the phrase, role information of the word, and prosodic pattern informationEvaluation based on informationThe value is the first value and the second valueAndThe average value of the evaluation values at each connection point is obtained as a comprehensive evaluation value, and the candidate of the composite unit having the highest comprehensive evaluation value is selected as the composite unit..
[0013]
  Furthermore, a speech waveform corresponding to the selected synthesis unit is extracted from the database, and the extracted speech waveform is connected in correspondence with the phoneme string indicated by the text data.
[0014]
  For example, the deterioration state of the quality differs depending on whether or not the connection point between the speech waveform fragments is at the phrase boundary. In addition, emphasized words are words that have important meanings in the text, often with increased power and pitch, and the state of deterioration of quality differs depending on whether the words are emphasized. Become.
[0015]
  The prosodic pattern information can be used to determine the continuity and consistency of the prosodic pattern at the connection point between speech fragments. In the continuity of prosodic patterns, for example, if a speech fragment with a high fundamental frequency at the beginning is connected after a speech fragment with a fundamental frequency gradually decreasing at the end, the quality will deteriorate because the place where the fundamental frequency is supposed to decrease increases. Cheap. Conversely, if a speech fragment having a lower fundamental frequency at the head is connected after a speech fragment whose fundamental frequency is gently lowered at the end, the quality is unlikely to deteriorate because the degree of decrease only increases.
[0016]
  As for the consistency of prosodic patterns, look at ± (plus / minus) of the direction of the difference in fundamental frequency at the connection point between speech fragments, and whether it matches the inclination direction of the continuous prosodic pattern of the sentence The quality deterioration state differs depending on
[0017]
  In this way, synthesis is selected as usual by selecting synthesis units based on evaluation values obtained using information indicating the boundaries of phrases, higher-order linguistic information such as word role information and prosodic pattern information. It is possible to solve the problem of instability in the quality of synthesized speech caused by using low-order information for unit selection.
[0018]
DETAILED DESCRIPTION OF THE INVENTION
  Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
[0019]
  FIG. 1 is a block diagram showing a speech synthesizer according to an embodiment of the present invention. In this embodiment, a speech synthesizer is configured by installing a speech synthesis program in a computer.
[0020]
  In FIG. 1, reference numeral 1 denotes a computer, which is a central processing unit 11 mainly composed of a CPU, a memory 12 connected to the central processing unit 11, a display unit 13, an input unit 14, and a memory composed of semiconductor elements. 15 and a general computer including an acoustic unit 16 including a digital / analog (D / A) converter as a sound generator and a speaker.
[0021]
  The storage unit 12 stores the speech synthesis program 17, the text analysis dictionary 18 for speech synthesis, and the evaluation value information 19, and various information necessary for speech synthesis is constructed as a speech synthesis database 20. ing.
[0022]
  The input unit 14 is means for inputting text data to be synthesized, and includes, for example, an interface to an information recording medium such as a keyboard, a mouse, a porcelain floppy disk and a compact disk, a network communication interface, and the like.
[0023]
  In the speech synthesis database 20, as a database necessary for text-to-speech synthesis, a speech waveform 21, phoneme string data 22 corresponding to the utterance content, data 23 indicating phoneme boundaries, information 24 indicating phrases boundaries, and phrases are emphasized. Word role information 25 indicating whether or not the speech is present and speech prosodic pattern information 26 are stored in association with each other. As the speech waveform 21, for example, speech actually uttered by a person is recorded and stored.
[0024]
  In this embodiment, the speech synthesis program 17 operates the central processing unit 11 to perform speech synthesis processing. In this speech synthesis process, when generating synthesized speech adapted to kana-kanji mixed text data input via a keyboard, an information recording medium, or a network, the speech synthesis database (hereinafter simply referred to as database) 20 is used. A speech waveform fragment (hereinafter referred to as a speech fragment) as a synthesis unit is selected. In this selection, a synthesis unit candidate is selected top-down from higher-order linguistic information, and a speech fragment corresponding to this synthesis unit is selected.
[0025]
  Higher-level linguistic information refers to information with semantic content that interprets physical phenomena perceived by humans, and is information that gives meaning by interpreting physical information such as air vibrations by human language ability. is there. For example, it is information that cannot be digitized directly from speech waveform fragments, such as phrase boundaries, word roles, and prosodic pattern consistency, and requires several levels of abstraction. In the present embodiment, as high-order linguistic information, boundary information of phrases (voiced speech segments and exhalation paragraphs sandwiched between pauses), word role (word emphasis utterances, etc.) information, prosodic patterns (connection points) The continuity and consistency of the prosodic pattern at the end of the previous speech waveform fragment and the beginning of the speech waveform fragment after the connection point, etc. The prosodic pattern is information on changes in phonemes such as pitch and accent). Note that the prosodic pattern information includes information regarding the variation pattern of the fundamental frequency. In addition to emphasizing utterance information, information obtained by judging emotional expressions such as marvel, enthusiasm and disappointment from the context of the surroundings can be used as the role information of words. .
[0026]
  With such high-order linguistic information, it is possible to stably output high-quality synthesized speech, rather than selecting a synthesis unit using only low-order information. Furthermore, by using higher-level linguistic information, it is possible to solve the problems of the prior art such as a sense of incongruity at the connection point between selected speech fragments and a prosodic pattern different from the meaning of the sentence. .
[0027]
  Therefore, according to the present embodiment, since a synthesis unit is selected using high-order linguistic information, high-quality synthesized speech can be obtained stably. As a result, it is possible to create synthesized speech having the quality required in the field of broadcasting, and it is also possible to automatically create and update content in order to make up for the lack of content in the multi-channel era of cable broadcasting and the Internet.
[0028]
  FIG. 2 is a flowchart for explaining the processing of the speech synthesis program 17 in the present embodiment, and FIG. 3 is a diagram showing the flow of speech synthesis using high-order linguistic information performed by the speech synthesis program 17. Specific operations will be described with reference to these drawings.
[0029]
  When text data 101 of a kana-kanji mixed sentence created by a word processor or the like for speech synthesis is input (S1), the text data 101 is analyzed 102 (S2). In the text analysis 102, the text data 101 is analyzed using the text analysis dictionary 18. For example, for the input text 101 of the kana-kanji mixed sentence, the reading kana, the phoneme string corresponding to the reading kana, and the continuation of each phoneme Information such as the length of time, the accent type for each phrase, how to combine each phrase, and the prosodic pattern corresponding to the accent type and how to combine is output as an analysis result.
[0030]
  After this, the composition unit search 103 is performed using the text analysis result. In the synthesis unit search 103, the phoneme sequence of the text data 101 is divided into a plurality of subphoneme sequences, and the phoneme sequence that matches the text data 101 at the phoneme level is searched from the database 20 (S3). The phoneme sequence extracted from the database 20 by the search is set as a synthesis unit candidate 104 used for speech synthesis (S4). Here, the subphoneme sequence is a part of a phoneme sequence composed of a plurality of phonemes, and is a phoneme sequence composed of one or more phonemes.
[0031]
  Further, the narrowing down of candidates 105 is performed using the above-described higher-level linguistic information (S5), and for each subphoneme sequence from among a plurality of synthesis unit candidates 104 extracted as candidates for each subphoneme sequence of the text data 101. One optimum synthesis unit is determined (S6).
[0032]
  In the candidate narrowing-down process 105, the candidate narrowing-down 105 is performed using high-order linguistic information stored in the database 20 from a plurality of composition unit candidates 104 extracted by the composition unit search 103. The narrowed synthesis unit candidate 106 is output. Further, the narrowed synthesis unit candidate 106 is again used as the synthesis unit candidate 104, the candidate narrowing down 105 is performed, and the final synthesis unit is selected by narrowing down the candidate synthesis unit to the last one.
[0033]
  In the present embodiment, as high-order linguistic information, information 24 indicating the boundary between phrases, word role information 25 indicating whether or not the phrase is emphasized speech, and speech prosodic pattern information 26, one of three pieces of information. The composition unit is narrowed down using two or more pieces of information.
[0034]
  Next, the speech fragment associated with the selected synthesis unit is acquired from the database 20 (S7), and the acquired speech fragment is sequentially connected so as to correspond to the phoneme string of the text data 101 (S8). The obtained continuous speech waveform is sent to the acoustic unit 16 and outputted from the speaker as speech (S9).
[0035]
  Note that the continuous speech waveform obtained by connecting speech fragments in the process of S8 may be temporarily stored as speech data in a memory or information storage medium. Needless to say, it is also possible to easily input synthesized speech created by another apparatus as speech data and output it as speech.
[0036]
  Next, the main part of the speech synthesis in this embodiment will be described in more detail.
[0037]
  FIG. 4 is a diagram for explaining a main part of the speech synthesis in the present embodiment, and narrows down the optimum synthesis unit from the plurality of searched synthesis unit candidates 104 based on higher-order linguistic information. The flow is shown.
[0038]
  The synthesis unit candidate 104 retrieved by the process of S3 is obtained by extracting a subphoneme sequence that matches the subphoneme sequence constituting the input text data 101 from the database 20, and the phonemes stored in the database 20 are extracted. A plurality of consonant sequence candidates 104 are usually selected as synthesis unit candidates 104.
[0039]
  For example, when speech is to be synthesized from the text data 101 “Tomorrow will be cloudy then rain”, the speech waveform corresponding to the text content “ASITAWAKUMORIDESU” is stored in the database 20 Of these, the part “ASHITAWAKUMORI” is extracted as a candidate for the synthesis unit.
[0040]
  In addition, the “ASITAWAK” portion of the above can be extracted as a candidate for the synthesis unit. That is, the synthesis unit in the present embodiment is extracted regardless of the delimiters such as mora, syllables, and phrases. In other words, the synthesis unit may be any one obtained by matching and extracting the phoneme string that can be used to synthesize the target text from the database 20 described above.
[0041]
  Here, when selecting a composition unit, it can be used as “ASITAWAKUMORI” and “NOTIAMEDESYOU”, and can also be used as “ASITAWA” and “KUMORINOTIAMEDESYOU”. For this reason, it is necessary to decide how to select a variable-length synthesis unit. Conventionally, the cepstrum distance in the input text and the database is obtained, and the phoneme environment and the fundamental frequency are compared. However, the present invention is different from the prior art in that the phrase boundary described above is selected when selecting the synthesis unit. Is to use higher-level linguistic information such as However, since a conventional text analysis technique or the like is used as a technique for extracting high-order linguistic information when the database 20 is constructed, there is no particular difference from the conventional technique. The present invention is not a part for extracting information, but a part for using extracted high-order linguistic information for selection of a synthesis unit is a significant difference from the prior art.
[0042]
  Next, a method for determining how to use variable-length synthesis units will be described.
[0043]
  In the present embodiment, in the above-described synthesis unit narrowing-down process 105, three pieces of information are used: information 24 indicating the boundaries of phrases, word role information 25 indicating whether the phrases are emphasized speech, and speech prosodic pattern information 26. ing. Note that the synthesis unit may be determined using one of the phrase boundary information 24, the word role information 25, the prosodic pattern information 26, or any two pieces of information.
[0044]
  In other words, in the present embodiment, these three types of information are used as parameters for quality degradation and quality evaluation values (hereinafter simply referred to as evaluation values) with respect to a difference (gap) between fundamental frequencies at connection points between speech fragments selected as synthesis units. Used to determine).
[0045]
  As an example, a graph of a quality deterioration pattern is shown in FIG. In this graph, the horizontal axis indicates the difference in the basic frequency of the synthesis unit at the connection point between the audio fragments, and the vertical axis indicates the quality when the audio fragments are connected and synthesized. The value indicating the quality evaluation is calculated based on the subjective evaluation value when the person listens to the synthesized speech.
[0046]
  In FIG. 5, the evaluation value for the difference in fundamental frequency at the connection point when the connection point between speech fragments is at the phrase boundary, and the difference in fundamental frequency at the connection point when the connection point between speech fragments is in the phrase. The evaluation value is expressed.
[0047]
  When the connection point is at a phrase boundary, the evaluation values when the fundamental frequency differences are 0, 30, 60, 90, 120, and 150 (Hz) are 98, 100, 93, 85, 93, and 87, respectively. (%). When the connection point is in the phrase, the evaluation values when the fundamental frequency difference is 0, 30, 60, 90, 120 (Hz) are 98, 80, 58, 48, 18 (%), respectively. ).
[0048]
  Thus, when the connection point is at the phrase boundary, a good evaluation value can be obtained even if the difference in fundamental frequency at the connection point increases. When the connection point is in the phrase, the evaluation value gradually decreases as the fundamental frequency difference increases.
[0049]
  As can be seen from the graph of FIG. 5, the quality deteriorates as the difference (gap) between the fundamental frequencies at the connection points of the audio fragments increases, and the degree of deterioration is the condition of the connection points (in this case, whether it is a phrase boundary or not). ) Will vary. In the present embodiment, this is referred to as “quality degradation pattern determined from the condition of the connection point”.
[0050]
  As a method of narrowing down the synthesis unit from the quality deterioration pattern (hereinafter referred to as the deterioration pattern), in this embodiment, based on the deterioration pattern determined from the difference of the fundamental frequency at the connection point between the audio fragments and the condition of the connection point. Thus, the synthesis unit is narrowed down by calculating the quality evaluation value for the difference between the fundamental frequencies.
[0051]
  That is, in the present embodiment, a plurality of synthesized speech waveforms are prepared as evaluation values, and values set based on values obtained by the listener evaluating the speech based on these speech waveforms are used. For example, the value evaluated by the listener for a plurality of speech waveforms connected at a phrase boundary and having different fundamental frequency differences before and after the connection point, and a plurality of differences between the fundamental frequencies connected within the phrase and before and after the connection point are different. To the value evaluated by the listener and the value evaluated by the listener for the sound of multiple sound waveforms with different fundamental frequency fluctuation patterns before and after the connection point and with different fundamental frequency differences before and after the connection point The value set based on this is used. As these evaluation values, for example, values (for example, average values) obtained by standardizing values evaluated by a plurality of listeners for one sound obtained by connecting two synthesis units are used. This evaluation value is also included in the evaluation value information 19 and stored in the storage unit 12. Here, a percentage is used as an evaluation value, and a scale is used in which the quality deteriorates as it approaches 0% without changing to 100%.The
[0052]
  Furthermore, in the present embodiment, a total evaluation value is calculated for all combinations of synthesis units that can synthesize speech corresponding to the target text, and a combination of synthesis units that maximizes the evaluation value is obtained.
[0053]
  That is, the number of synthesis units for synthesizing the target text is different depending on the selection of the synthesis unit. As the number of synthesis units increases, the degradation also increases. For example, the average value of the evaluation values at each connection point is set as the overall evaluation value, and the synthesis unit with the highest overall evaluation value is selected.The
[0054]
  On the other hand, as the phrase boundary information 24 used as higher-order linguistic information, information indicating whether the connection point between speech fragments is a phrase boundary or a phrase (within a phrase) is used. The deterioration pattern shown in FIG. 5 differs depending on whether or not the connection point between the voice fragments is at the boundary of the phrase. For example, when the connection point exists at a phrase boundary, the slope of the graph becomes gentle and the quality does not deteriorate so much. However, when the connection point exists in the phrase, it easily deteriorates.
[0055]
  As a method of narrowing down synthesis unit candidates using phrase boundaries, for example, when a synthesis unit of “ASITAWAKUMORI” and “NOTIAMEDESYOU” is extracted, the fundamental frequency difference between the end of the former and the beginning of the latter is 30 Hz. Suppose that the quality evaluation value is 80%. Since this connection point is connected not in the phrase boundary but in the middle of the phrase, in order to evaluate the quality deterioration at the time of connection, the deterioration pattern when connected inside the phrase is used.
[0056]
  As another example, when “ASITAWA” and “KUMORINOTIAMEDESYOU” are connected as a synthesis unit, the connection point is determined as a phrase boundary. Phrase boundary information 24 stored in the database 20 is used to determine whether the phrase boundary is present (inside the phrase).
[0057]
  The deterioration pattern when connected at the boundary of a phrase is more gradual than the deterioration pattern when connected inside a phrase. For example, if the difference in fundamental frequency between two synthesis units at this phrase boundary is 90 Hz or less, the quality The evaluation value is larger than 85%, and therefore the quality evaluation value is higher than that of connecting inside the above phrase. Also, if the difference in fundamental frequency between two synthesis units when connected within a phrase is 90 Hz or more, the quality evaluation value will be 48% or less, and the quality will be better when connected at a phrase boundary (Fig. 5). For example, if the fundamental frequency difference at the connection point between the two synthesis units of “ASITAWA” and “KUMORINOTIAMEDESYOU” is 90 Hz, this evaluation value is 100% (phrase boundary), so the above-mentioned “ASITAWAKUMORI” Since the evaluation value of “NOTIAMEDESYOU” is 80% (in the phrase), “ASITAWA” and “KUMORINOTIAMEDESYOU” are selected as synthesis units.
[0058]
  As the word role information 25, information indicating whether the word is emphasized is used. Here, the word “emphasis” is also called “prominence” or “contrast emphasis”, and the strength and weakness in the sentence sound are determined by relative strength to other parts in the sentence. It is to make it stand out relatively with respect to many parts. Also, emphasized words are words that have an important meaning in the text, and often have increased power and pitch. For this reason, depending on whether or not the word is emphasized, the difference in fundamental frequency between the two synthesis units is different as in the case of the phrase boundary described above, so that the deterioration patterns are different.
[0059]
  The prosodic pattern information is used to determine the continuity and consistency of the prosodic pattern at the connection point between speech fragments and the variation pattern of the fundamental frequency.
[0060]
  In continuity of the prosodic pattern, for example, if a speech fragment with a high fundamental frequency at the beginning is connected after a speech fragment with a fundamental frequency slowly decreasing at the end, the fundamental frequency will increase, so the fluctuation of the fundamental frequency will increase. Since the pattern changes greatly, the quality tends to deteriorate. Conversely, if a speech fragment having a lower fundamental frequency at the head is connected after a speech fragment whose fundamental frequency is gently lowered at the end, the quality is unlikely to deteriorate because the degree of decrease only increases.
[0061]
  As for the consistency of prosodic patterns, look at ± (plus / minus) of the direction of the difference in fundamental frequency at the connection point between speech fragments, and match it with the inclination direction (change tendency) of the continuous prosodic pattern that the sentence has The deterioration pattern varies depending on whether or not it is performed.
[0062]
  One or more of these three types of information is used to narrow down the optimum synthesis unit. When one candidate is finally narrowed down from a plurality of candidates, it becomes the selected composition unit.
[0063]
  Also, when combining unit candidates are selected using all three types of information, the phrase boundary information 24, the word role information 25, and the prosodic pattern information 26, for example, two composition units “ASITAWA” and “KUMORINOTIAMEDESYOU” In the connection at the phrase boundary, the evaluation value is calculated using the deterioration pattern under the condition that the directionality of the prosodic pattern is not emphasized at the end of the synthesis unit before the connection point and the head after the connection point. Calculated.
[0064]
  In this embodiment, the case where there is only one connection point has been described as an example. However, even when there are a plurality of connection points, the same selection process as in the case where there is one connection point is repeated, The synthesis unit that finally gives the maximum evaluation value is selected.
[0065]
  The present embodiment can also be applied to a case where, for example, a speech waveform corresponding to each phoneme is connected to generate synthesized speech. However, as the number of connection points increases, quality evaluation for synthesized speech is possible. Needless to say, the quality is lowered because the value is relatively low. Therefore, in this embodiment, if there is a synthesis unit that requires fewer connection points, and it is extracted, it is selected as the synthesis unit.
[0066]
  However, a speech waveform is stored for each phoneme in the database 20, and even in a situation where these must be connected, the speech synthesis method of the present invention can be used as much as possible compared to the conventional example. A good synthesis unit can be selected.
[0067]
  Further, the voice synthesis program 17 can be easily spread to many users by distributing it through an information recording medium such as an optical disk, a magnetic disk, a magneto-optical disk, and a semiconductor memory, a network, and other communication networks. Needless to say.
[0068]
【The invention's effect】
  As described above, the present inventionClaim 1According to the described speech synthesis method, since a synthesis unit is selected using higher-order linguistic information, a high-quality synthesized speech can be obtained stably. As a result, it is possible to create synthesized speech with the quality required in the field of broadcasting, and it is also possible to automatically create and update content to compensate for the lack of content in the multi-channel era of cable broadcasting and the Internet. .
[0069]
  Also,Claim 2According to the described speech synthesizer, since a synthesis unit is selected using high-order linguistic information, high-quality synthesized speech can be obtained stably. As a result, it is possible to create synthesized speech having the quality required in the field of broadcasting, and it is also possible to automatically create and update content in order to make up for the lack of content in the multi-channel era of cable broadcasting and the Internet.
[0070]
  Also,Claim 3According to the described speech synthesis program, the speech synthesis apparatus can be easily configured by being installed in a computer via an information recording medium, a network, or another communication network. Furthermore, since the synthesis unit is selected using higher-order linguistic information, high-quality synthesized speech can be obtained stably. As a result, it is possible to create synthesized speech having the quality required in the field of broadcasting, and it is also possible to automatically create and update content in order to make up for the lack of content in the multi-channel era of cable broadcasting and the Internet.
[Brief description of the drawings]
FIG. 1 is a configuration diagram showing a speech synthesizer according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating processing of a speech synthesis program according to an embodiment of the present invention.
FIG. 3 is a diagram showing a flow of speech synthesis using high-order linguistic information performed by a speech synthesis program according to an embodiment of the present invention.
FIG. 4 is a diagram for explaining a main part of speech synthesis in one embodiment of the present invention.
FIG. 5 is a diagram showing an example of a graph of a quality deterioration pattern in one embodiment of the present invention.
[Explanation of symbols]
  DESCRIPTION OF SYMBOLS 1 ... Computer, 11 ... Central processing part, 12 ... Memory | storage part, 13 ... Display part, 14 ... Input part, 15 ... Memory, 16 ... Sound part, 17 ... Speech synthesis program, 18 ... Text analysis dictionary, 19 ... Evaluation value Information, 20 ... Speech synthesis database, 21 ... Speech waveform, 22 ... Phoneme sequence, 23 ... Phoneme boundary information, 24 ... Phrase boundary information, 25 ... Word role information, 26 ... Prosodic pattern information, 101 ... Text, 102 ... text analysis, 103 ... search for synthesis unit, 104 ... candidate for synthesis unit, 105 ... narrowing down candidates, 106 ... candidate for synthesis unit, 107 ... connection, 108 ... synthesized speech.

Claims

In a speech synthesis method in which a computer converts text to speech based on text data,
Information indicating the phrase boundaries, words role information that indicates whether the phrase is emphasized sound, and three types of information of the prosodic pattern information of the speech, the speech waveform, a phoneme sequence data, the corresponding data indicating a boundary of phonemes A first database attached and accumulated;
For each deterioration pattern determined by the role information of the word and the prosodic pattern information , a plurality of listeners evaluated sounds by a plurality of speech waveforms connected at phrase boundaries and having different fundamental frequency differences before and after the connection point. A second value in which a value of 1 and a second value obtained by evaluating a plurality of speech waveforms connected by a plurality of speech waveforms with different fundamental frequency differences before and after the connection point are stored as information. Using a database,
A small phoneme sequence consisting of at least a part of the phoneme sequence recorded in the first database is set as a synthesis unit candidate,
Searching the first database for a candidate for the synthesis unit that matches a part of the phoneme string indicated by the text data;
For each of the searched synthesis unit candidates, information indicating the boundary of the phrase, role information of the word, prosodic pattern information with respect to the difference in the fundamental frequency of the speech waveform at the connection point between the synthesis unit candidates. more seek criticism value based on broadcast on the first and second values stored in the second database,
An average value of the evaluation values at each connection point is obtained as a comprehensive evaluation value, and a candidate of a composite unit having the highest comprehensive evaluation value is selected as a composite unit,
Extracting a speech waveform corresponding to the selected synthesis unit from the first database;
The speech synthesis method, wherein the extracted speech waveform is connected in correspondence with the phoneme string indicated by the text data.

In a speech synthesizer that converts text into speech based on text data,
Information indicating the phrase boundaries, words role information that indicates whether the phrase is emphasized sound, and three types of information of the prosodic pattern information of the speech, the speech waveform, a phoneme sequence data, the corresponding data indicating a boundary of phonemes A first database attached and accumulated;
For each deterioration pattern determined by the role information of the word and the prosodic pattern information , a plurality of listeners evaluated sounds by a plurality of speech waveforms connected at phrase boundaries and having different fundamental frequency differences before and after the connection point. A second value in which a value of 1 and a second value obtained by evaluating a plurality of speech waveforms connected by a plurality of speech waveforms with different fundamental frequency differences before and after the connection point are stored as information. A database,
A subphoneme sequence consisting of at least a part of the phoneme sequence recorded in the first database is a synthesis unit candidate, and the synthesis unit candidate that matches a part of the phoneme sequence indicated by the text data is selected. Extraction means for searching and extracting from said first database;
For each synthesis unit candidate extracted by the extraction means, information indicating the boundary of the phrase with respect to the difference in the fundamental frequency of the speech waveform at the connection point between the synthesis unit candidates, role information of the word, determined by the first and second values stored leopards value based on prosodic pattern information in said second database, obtaining an average value of the evaluation value at each connection point as a comprehensive evaluation value, the total evaluation A selecting means for selecting the candidate of the synthesis unit with the highest value as the synthesis unit;
Connection means for extracting a speech waveform corresponding to the synthesis unit selected by the selection means from the first database and connecting the extracted speech waveform in correspondence with the phoneme string indicated by the text data. A speech synthesizer characterized by the above.

Information indicating the phrase boundaries, words role information that indicates whether the phrase is emphasized sound, and three types of information of the prosodic pattern information of the speech, the speech waveform, a phoneme sequence data, the corresponding data indicating a boundary of phonemes a plurality of the first database that is attached are accumulated, each degradation pattern defined by the role information and the prosody pattern information of the words, the difference between the connected and the front and rear connection points the fundamental frequency phrase boundary different The first value evaluated by a plurality of listeners for the sound generated by the plurality of sound waveforms and the first value evaluated by the plurality of listeners for the sounds generated by the plurality of sound waveforms connected in the phrase and having different fundamental frequencies before and after the connection point. to two values and it can be connected to a second database that is stored as information computer, the speech synthesis process of converting text to speech based on the text data In the speech synthesis program to I,
A subphoneme sequence consisting of at least a part of the phoneme sequence recorded in the first database is a synthesis unit candidate, and the synthesis unit candidate that matches a part of the phoneme sequence indicated by the text data is selected. Retrieving from the first database;
For each of the searched synthesis unit candidates, information indicating the boundary of the phrase, role information of the word, prosodic pattern information with respect to the difference in the fundamental frequency of the speech waveform at the connection point between the synthesis unit candidates. a step of obtaining more hail value based on broadcast on the first and second values stored in the second database,
Obtaining an average value of the evaluation values at each connection point as a comprehensive evaluation value, and selecting a composite unit candidate having the highest comprehensive evaluation value as a composite unit;
Extracting a speech waveform corresponding to the selected synthesis unit from the first database;
And connecting the extracted speech waveform in correspondence with the phoneme string indicated by the text data.