JP3626398B2

JP3626398B2 - Text-to-speech synthesizer, text-to-speech synthesis method, and recording medium recording the method

Info

Publication number: JP3626398B2
Application number: JP2000233297A
Authority: JP
Inventors: 智一森尾; 浩幸勘座
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2000-08-01
Filing date: 2000-08-01
Publication date: 2005-03-09
Anticipated expiration: 2020-08-01
Also published as: JP2002049386A

Description

【０００１】
【発明の属する技術分野】
本発明は、テキストから音声を生成するテキスト音声合成装置、テキスト音声合成方法及びその方法を記録した記録媒体に関する。
【０００２】
【従来の技術】
この種の従来の装置は、例えば図５に示す様に構成されている。図５において、テキスト解析器１０２は、日本語の漢字かな混じりの単語や文章等を示すテキスト情報（例えば「左」）を入力端子１０１から入力し、単語や文章等の読みを示す読み情報（例えば音素列「ｈｉｄａｒｉ」）を韻律生成器１０３及び音声素片選択器１０４に出力する。尚、単語や文章等の読みを示す読み情報を入力端子１０１に入力して、これを韻律生成器１０３及び音声素片選択器１０４に与えても構わない。この場合は、テキスト解析器１０２が省略される。
【０００３】
韻律生成器１０３は、読み情報を入力すると、この読み情報から韻律情報を生成し、この韻律情報を音声合成器１０６に出力する。この韻律情報とは、音声の高さや大きさ、あるいは継続時間を示す。音声の高さは、母音の音素を示す音声波形のピッチ周波数により設定される。例えば音素列「ｈｉｄａｒｉ」における母音の音素は、「ｉ」、「ａ」及び「ｉ」であるから、これらの母音の音素を示す音声波形のピッチ周波数が設定される。音声の大きさ及び継続時間は、各音素毎に、音素を示す音声波形の振幅、及び音素を示す音声波形の継続時間として設定される。例えば各音素「ｈ」、「ｉ」、「ｄ」、「ａ」、「ｒ」及び「ｉ」毎に、音声波形の振幅及び継続時間が設定される。
【０００４】
音声素片選択器１０４は、読み情報を入力すると、この読み情報である音素列より少なくとも１つの音素からなる音声単位を抽出する。
【０００５】
音声単位としては、子音＋母音（ＣＶ：Ｃｏｎｓｏｎａｎｔ，Ｖｏｗｅｌ）、あるいは母音＋子音＋母音（ＶＣＶ）等がある。前者のＣＶの音声単位は、例えば「ｋａ」や「ｇｕ」である。後者のＶＣＶの音声単位は、音声を高品質化するために音素連鎖の過渡部の特徴を保持したものであり、例えば「ａｋｉ」や「ｉｔｏ」である。
【０００６】
ここでは、ＶＣＶの音声単位を用いるものとする。この場合、音声素片選択器１０４は、音素列「ｈｉｄａｒｉ」からＶＣＶの各音声単位「＊ｈｉ」、「ｉｄａ」、「ａｒｉ」及び「ｉ＊＊」（＊は無音記号である）を抽出する。そして、音声素片選択器１０４は、これらの音声単位に対応するそれぞれの音声素片データを音声素片データベース１０５から検索して、これらの音声素片データを音声合成器１０６に出力する。
【０００７】
音声素片データベース１０５には、多数の音声素片データが予め登録されている。これらの音声素片データは、例えばアナウンサーの音声を示す音声データをそれぞれの音声単位で適宜に切り出し、更に音声合成に適切な形式に変換したものである。一般的な日本語のテキスト音声合成に用いるＶＣＶの音声単位は、約８００個程度であり、これらの音声単位に対応する各音声素片データが音声素片データベース１０５に予め登録されている。
【０００８】
音声合成器１０６は、韻律生成器１０３からの韻律情報及び音声素片選択器１０４からの各音声素片データを入力すると、韻律情報に基づいて、各音声素片データによって示されるそれぞれの音声の高さや大きさ、及び継続時間を調節した上で、これらの音声をそれぞれの母音区間で滑らかに接続してなる音声データを形成し、この音声データを出力端子１０７から出力する。
【０００９】
ところで、この様な装置においては、１つの音声単位に対して１つの音声素片データが登録されており、如何なる単語であっても、同一の音声単位が含まれていれば、同一の音声素片データが用いられる。このため、１つの単語に含まれる音声単位に適切な音声素片データであっても、他の単語に含まれる同一の音声単位には不適切なことがある。つまり、音声データを音声に変換したときに、前者の単語を明瞭に聞き取ることができても、後者の他の単語を聞き取ることができないという事態を招く。
【００１０】
例えば、特開平２−２０５８９６号公報に記載の「音声合成装置」おいては、人名等の固有名詞と、通常の文章とを区別して、固有名詞のための音声素片データ群と、通常の文章のための音声素片データ群とを登録しておき、２つの音声素片データ群を使い分けている。これによって、固有名詞が明瞭に発音され、通常の文章が滑らかに発音される。
【００１１】
【発明が解決しようとする課題】
しかしがら、上記従来の装置では、如何なる固有名詞であっても、同一の音声単位が含まれていれば、同一の音声素片データが用いられる。例えば、「ｙａｍａｇｕｃｈｉ」と「ｔａｇｕｃｈｉ」では、同一の音声単位「ａｇｕ」が含まれているので、これらの固有名詞に対して、音声単位「ａｇｕ」に対応する同一の音声素片データが用いられる。この場合、例えば一方の「ｙａｍａｇｕｃｈｉ」の発音が特徴的でなくなり、「ｙａｍａｇｕｃｈｉ」の発音と他の類似する「ｙａｍａｕｃｈｉ」の発音とを聞き分けることが困難になる。
【００１２】
また、固有名詞のために専用の音声素片データ群を登録するので、登録されるデータ量が略２倍となり、膨大になった。
【００１３】
そこで、本発明は、上記従来の課題に鑑みてなされたものであり、予め登録される音声素片データのデータ量の増加を必要最小限に抑えながら、如何なる単語であっても明瞭に発音することが可能なテキスト音声合成装置、テキスト音声合成方法及びその方法を記録した記録媒体を提供することを目的とする。
【００１４】
【課題を解決するための手段】
上記課題を解決するために、本発明は、少なくとも１つの音素からなるそれぞれの音声単位に対応する各音声素片データを登録した音声素片データベースを備え、単語や文章の読みを示す音素列に基づいて、音声素片データベースから少なくとも１つの音声素片データを検索し、検索した音声素片データ及び単語や文章の韻律情報に基づいて、単語や文章の音声を合成するテキスト音声合成装置において、特定の単語の読みを示す音素列に含まれる少なくとも１つの音素からなる特定の音声単位に対応する例外音声素片データを登録した例外音声素片データベースと、特定の単語の音声を合成するときは、特定の音声単位については、この音声単位に対応する音声素片データを音声素片データベースから検索する代わりに、この音声単位に対応する例外音声素片データを例外音声素片データベースから検索する音声データ検索手段とを備えている。
【００１５】
この様な構成の本発明によれば、通常は、音声単位に対応する音声素片データを音声素片データベースから検索して用いている。また、特定の単語の音声を合成するときは、特定の音声単位については、例外音声素片データを例外音声素片データベースから検索して用いている。これによって、この特定の単語の音声がが明瞭になる。
【００１６】
また、本発明の装置においては、複数の単語を登録した語彙リストと、語彙リストから読みが類似する各単語を選択し、選択した各単語のうちの少なくとも一方の読みを示す音素列より、少なくとも１つの音素からなる特定の音声単位を抽出し、抽出した特定の音声単位に対応する例外音声素片データを例外音声素片データベースから検索するための例外音声素片テーブルを作成する類似単語抽出手段とを備えている。
【００１７】
ここでは、読みが類似する各単語のうちの少なくとも一方から、特定の音声単位を抽出し、この特定の音声単位に対応する例外音声素片データを例外音声素片データベースから検索するための例外音声素片テーブルを作成している。これによって、例外音声素片テーブルを自動的に作成することができる。特定の音声単位を抽出した単語の音声を合成するときは、例外音声素片テーブルを検索して、特定の音声単位に対応する例外音声素片データを求め、例外音声素片データベースから例外音声素片データを検索する。この例外素片データを用いて、単語の音声を合成すれば、この単語を他の類似の単語と明瞭に聞き分けることができる。
【００１８】
更に、本発明の装置においては、例外音声素片データベース内の例外音声素片データは、子音が強調された音声を示している。
【００１９】
これにより特定の単語の子音が強調され、この特定の単語を他の類似の単語と聞き違えることがなくなる。
【００２０】
一方、本発明は、少なくとも１つの音素からなるそれぞれの音声単位に対応する各音声素片データを登録した音声素片データベースを用いており、単語や文章の読みを示す音素列に基づいて、音声素片データベースから少なくとも１つの音声素片データを検索し、検索した音声素片データ及び単語や文章の韻律情報に基づいて、単語や文章の音声を合成するテキスト音声合成方法において、特定の単語の読みを示す音素列に含まれる少なくとも１つの音素からなる特定の音声単位に対応する例外音声素片データを登録した例外音声素片データベースを用いており、特定の単語の音声を合成するときは、特定の音声単位については、この音声単位に対応する音声素片データを音声素片データベースから検索する代わりに、この音声単位に対応する例外音声素片データを例外音声素片データベースから検索している。
【００２１】
この様な本発明の方法によれば、本発明の装置と同様に、特定の単語が明瞭に発音される。
【００２２】
また、本発明の方法においては、複数の単語を登録した語彙リストを用いており、語彙リストから読みが類似する各単語を選択し、選択した各単語のうちの少なくとも一方の読みを示す音素列より、少なくとも１つの音素からなる特定の音声単位を抽出し、抽出した特定の音声単位に対応する例外音声素片データを例外音声素片データベースから検索するための例外音声素片テーブルを作成している。
【００２３】
ここでも、本発明の装置と同様に、例外音声素片テーブルを自動的に作成することができる。
【００２４】
更に、本発明の方法においては、例外音声素片データベース内の例外音声素片データは、子音が強調された音声を示している。
【００２５】
この場合も、本発明の装置と同様に、特定の単語を他の類似の単語と聞き違えずに済む。
【００２６】
更に、本発明は、上記テキスト音声合成方法をプログラムとして記録した記録媒体をも含む。
【００２７】
【発明の実施の形態】
以下、本発明の実施形態を添付図面を参照して詳細に説明する。
【００２８】
図１は、本発明のテキスト音声合成装置の第１実施形態を示すブロック図である。図１に示す様に本実施形態のテキスト音声合成装置１０は、テキスト情報を入力端子１１を通じて入力し、このテキスト情報に基づいて読み情報を生成するテキスト解析器１２と、読み情報をテキスト解析器１２から入力し、この読み情報に基づいて韻律情報を生成する韻律生成器１３と、音声素片データ群を登録した音声素片データベース１４と、例外音声素片データ群を登録した例外音声素片データベース１５と、音声素片データベース１４及び例外音声素片データベース１５を切替えて使い分ける音声素片データベース切替器１６と、音声素片データベース切替器１６によって参照される例外音声素片テーブル１７と、読み情報をテキスト解析器１２から入力し、この読み情報に基づいて、少なくとも１つの音声素片データを音声素片データベース１４又は例外音声素片データベース１５より検索する音声素片選択器１８と、韻律情報を韻律生成器１３から入力すると共に、音声素片データを音声素片選択器１８から入力し、韻律情報及び音声素片データに基づいて音声データを生成し、この音声データを出力端子１９から出力する音声合成器２０とを備えている。
【００２９】
音声素片データベース１４には、多数のＶＣＶの音声単位に対応するそれぞれの音声素片データが予め登録されている。ここでは、一般的な日本語のテキスト音声合成を行うので、約８００個程度の音声素片データが登録されている。
【００３０】
また、例外音声素片データベース１５には、特定の単語の音素列に含まれる特定の音声単位に対応する例外音声素片データが予め登録されている。例えば、特定の単語の音素列「ｙａｍａｇｕｃｈｉ」に含まれる音声単位「ａｇｕ」に対応する例外音声素片データが登録されている。この例外音声素片データの数は、格別に制限されるものではない。
【００３１】
更に、例外音声素片テーブル１７は、例外音声素片データベース１５内の各例外音声素片データを検索するためのものである。例えば、図２に示す様に特定の各単語の音素列「ｙａｍａｇｕｃｈｉ」、「ｋｉｓｈｉｎｏ」、「ｔａｇｕｃｈｉ」及び「ｓａｋａｇｕｃｈｉ」と、各単語の音素列に含まれる特定の各音声単位「ａｇｕ」、「＊ｋｉ」（＊は無音記号である）、「ａｇｕ」及び「ａｇｕ」と、特定の各音声単位を識別するための識別番号「１」、「２」、「３」及び「１」とが対応づけて登録されている。同一の音声単位に同一の識別番号が付与されていれば、これらの単語の音声単位は、例外音声素片データベース１５内の同一の例外音声素片データに対応する。また、同一の音声単位に相互に異なるそれぞれの識別番号が付与されていれば、これらの音声単位は、例外音声素片データベース１５内の相互に異なるそれぞれの例外音声素片データに対応する。例えば、各単語の音素列「ｙａｍａｇｕｃｈｉ」及び「ｓａｋａｇｕｃｈｉ」の同一の音声単位「ａｇｕ」に同一の識別番号「１」が付与されているので、この音声単位「ａｇｕ」は、例外音声素片データベース１５内の同一の例外音声素片データに対応する。また、各単語の音素列「ｙａｍａｇｕｃｈｉ」及び「ｔａｇｕｃｈｉ」の同一の音声単位「ａｇｕ」に相互に異なるそれぞれの識別番号「１」及び「３」が付与されているので、これらの単語の音声単位「ａｇｕ」は、例外音声素片データベース１５内の相互に異なるそれぞれの例外音声素片データに対応する。
【００３２】
尚、例外音声素片テーブル１７において、１つの音声単位に対応する例外音声素片データの数が制限されることはない。また、音声素片データベース１４には、全ての音声単位に対応するそれぞれの音声素片データが登録されているので、各音声単位「ａｇｕ」及び「＊ｋｉ」に対応するそれぞれの音声素片データも登録されている。
【００３３】
さて、例えばテキスト情報として「山口さん」が入力端子１１を通じてテキスト解析器１２に入力されると、テキスト解析器１２は、「山口さん」の読み情報、つまり音素列「ｙａｍａｇｕｃｈｉｓａＮ」を韻律生成器１３及び音声素片選択器１８に出力する。
【００３４】
韻律生成器１３は、この音素列「ｙａｍａｇｕｃｈｉｓａＮ」を入力すると、この音素列の各音素毎に、音声の高さや大きさ、及び継続時間を求め、各音素の音声の高さや大きさ、及び継続時間を示す韻律情報を生成して、この韻律情報を音声合成器２０に出力する。図３は、この韻律情報を例示する図表である。この図表においては、各音素「ｙ」、「ａ」、「ｍ」、「ａ」、「ｇ」、「ｕ」、…に対応して音素の継続時間、音素の音声の高さを示すピッチ周波数（Ｈｚ）、音声の大きさ（ｄＢ）が設定されている。音素の音声の高さは、音素が母音であるときにのみ設定され、音素が子音であるときには設定されない。従って、音声の高さは、母音のピッチ周波数により決まる。
【００３５】
次に、音声素片選択器１８は、音素列「ｙａｍａｇｕｃｈｉｓａＮ」を入力すると、この音素列からＶＣＶの各音声単位「＊ｙａ」、「ａｍａ」、「ａｇｕ」、「ｕｃｈｉ」、「ｉｓａ」及び「ａＮ＊」を抽出する。また、音声素片選択器１８は、単語の音素列「ｙａｍａｇｕｃｈｉ」を音声素片データベース切替器１６に与える。
【００３６】
音声素片データベース切替器１６は、図２に示す例外音声素片テーブル１７を参照して、単語の音素列「ｙａｍａｇｕｃｈｉ」を検索し、この単語に対応する特定の音声単位「ａｇｕ」及び識別番号「１」を読み取る。そして、音声素片選択器１８によって特定の音声単位「ａｇｕ」の検索が行われるときに、音声素片データベース切替器１６は、例外音声素片データベース１５を音声素片選択器１８に与える。音声素片選択器１８は、例外音声素片データベース１５内の特定の音声単位「ａｇｕ」に対応する複数の例外音声素片データのうちから、識別番号１が付与されている例外音声素片データを選択する。また、音声素片選択器１８によって他の各音声単位「＊ｙａ」、「ａｍａ」、「ｕｃｈｉ」、「ｉｓａ」及び「ａＮ＊」の検索が行われるときに、音声素片データベース切替器１６は、音声素片データベース１４を音声素片選択器１８に与える。音声素片選択器１８は、音声素片データベース１４内の各音声単位「＊ｙａ」、「ａｍａ」、「ｕｃｈｉ」、「ｉｓａ」及び「ａＮ＊」に対応するそれぞれの音声素片データを検索する。
【００３７】
こうして特定の音声単位「ａｇｕ」と識別番号１に対応する例外音声素片データ、及び各音声単位「＊ｙａ」、「ａｍａ」、「ｕｃｈｉ」、「ｉｓａ」及び「ａＮ＊」に対応するそれぞれの音声素片データが検索され、例外音声素片データ及び各音声素片データが音声合成器２０に与えられる。
【００３８】
音声合成器２０は、韻律生成器１３からの図３に示す韻律情報及び音声素片選択器１８からの例外音声素片データ及び各音声素片データを入力すると、韻律情報に基づいて、例外音声素片データ及び各音声素片データによって示されるそれぞれの音声の高さや大きさ、及び継続時間を調節した上で、これらの音声をそれぞれの母音区間で滑らかに接続してなる音声データを形成し、この音声データを出力端子１９から出力する。
【００３９】
この様に本実施形態では、基本的な各音声単位に対応するそれぞれの音声素片データを音声素片データベース１４に予め登録すると共に、特定の単語に含まれる特定の音声単位に対応する例外音声素片データを例外音声素片データベース１５に予め登録しておき、特定の単語の音声データを合成するときには、この特定の単語に含まれる特定の音声単位に対応する例外音声素片データを例外音声素片データベース１５から検索して用いている。このため、この特定の単語を示す音声データを音声に変換したときに、この特定の単語を明瞭に聞き取ることができる。例えば、図２から明らかな様に「ｙａｍａｇｕｃｈｉ」の音声単位「ａｇｕ」と、「ｔａｇｕｃｈｉ」の音声単位「ａｇｕ」を区別して、それぞの音声素片データを用いるので、これらの単語の発音のいずれも特徴的なものとなり、更には「ｙａｍａｇｕｃｈｉ」の発音と他の類似の「ｙａｍａｕｃｈｉ」の発音とを聞き分けることが可能になる。
【００４０】
また、ここでは、特定の単語として、固有名詞を例示しているが、他の如何なる種類の単語であっても、特定の単語として扱うことができる。このため、例外音声素片テーブル１７の内容及び例外音声素片データベース１５の内容を適宜に設定すれば、類似した各名詞が区別して発音される様な音声データを合成することが可能である。
【００４１】
図４は、本発明のテキスト音声合成装置の第２実施形態を示すブロック図である。尚、図４において、図１の装置１０と同様の作用を果たす部位には同じ符号を付して説明を簡略化する。
【００４２】
本実施形態のテキスト音声合成装置３０においては、図１の装置１０に、発声語彙リスト３１及び類似単語抽出器３２を付加したものである。従って、図１の装置１０と同様に、テキスト解析器１２によってテキスト情報が解析され、韻律生成器１３によって韻律情報が生成され、音声素片選択器１８によって音声素片データベース１４内の音声素片データ及び例外音声素片データベース１５内の例外音声素片データが検索され、音声合成器２０によって音声データが合成される。
【００４３】
さて、発声語彙リスト３２には、例えば多数の人名を示すテキスト情報が予め登録されている。類似単語抽出器３２は、これらの人名を相互に比較して、類似度の高い２つの単語、つまり十分に類似した２つの単語を抽出する。
【００４４】
類似度の高い２つの単語を抽出する方法としては、例えば２つの単語をそれぞれの音素列に変換し、これらの音素列を比較して、１つの音素だけが各音素列の違いであったり、１つの音素だけが一方の音素列に挿入されていたり、１つの音素だけが一方の音素列から脱落しているときに、これらの単語の類似度が高いとみなして、これらの単語を抽出する。
【００４５】
また、別の方法としては、全ての音素に対応するそれぞれのスペクトルパラメータを予め登録しておき、これらのスペクトルパラメータを参照して、２つの単語の音素列からそれぞれのスペクトル時系列を生成し、これらのスペクトル時系列を音声認識の分野で広く利用されているＤＰマッチングの手法でマッチングさせた上で、例えば各単語間の音響的な距離（より具体的にはスペクトル距離等）を計算して求め、この距離が近ければ、これらの単語の類似度が高いとみなして、これらの単語を抽出する。
【００４６】
次に、類似単語抽出器３２は、類似度の高い２つの単語のうちの一方の特徴的な音声単位、つまり各単語の違いを求める。例えば、一方の単語の音素列に１つの音素が挿入されている場合は、この挿入されている音素を含む音声単位が特徴的なものである。また、各単語間の音響的な距離が近い場合は、各単語間の音響的な距離が最も遠くなる一方の単語の音声単位が特徴的なものである。
【００４７】
例えば、音素列「ｙａｍａｇｕｃｈｉ」と音素列「ｙａｍａｕｃｈｉ」の各単語が類似度の高いとみなされて抽出された場合は、一方の音素列「ｙａｍａｇｕｃｈｉ」に音素「ｇ」が挿入されているので、この音素「ｇ」を含む音声単位「ａｇｕ」が特徴的なものとして求められる。
【００４８】
こうして類似度の高い２つの単語のうちの一方の特徴的な音声単位を求めた後、類似単語抽出器３２は、この特徴的な音声単位を含む単語と該音声単位を対応させて例外音声素片テーブル１７に登録する。
【００４９】
この様に本実施形態では、類似度の高い２つの単語が発声語彙リスト３２から抽出され、特徴的な音声単位を含む一方の単語と該音声単位を対応させて例外音声素片テーブル１７に自動的に登録することができる。このため、例外音声素片テーブル１７の内容を登録するための手間を省くことができる。例外音声素片テーブル１７に登録されている単語の音声データを合成するときには、この単語に含まれる音声単位が例外音声素片テーブル１７から検索され、この音声単位に対応する例外音声素片データが例外音声素片データベース１５から読み出され、この例外音声素片データが用いられて、この単語の音声データが合成されることになる。
【００５０】
ここでは、登録が予想される音声単位に対応する音声素片データを例外音声素片データベース１５に予め登録しておく必要がある。また、例外音声素片データとして、子音を強調した音声を示すものが良い。この子音を強調した音声素片データを用いて単語の音声データを合成すると、この音声データを音声に変換したときには、単語の子音が強調される。これにより、この単語の音声と他の類似の単語の音声を明瞭に聞き分けることが可能になる。
【００５１】
尚、本実施形態では、多数の人名を発声語彙リスト３２に登録しているが、多数の地名や名詞、あるいは他の種類の単語を発声語彙リスト３２に登録しておき、これらの単語のうちから類似する各単語を抽出しても構わない。
【００５２】
また、例外音声素片テーブル１７に登録されていない他の単語については、音声素片データベース１４内の基本的な各音声素片データを用いて、その音声データを生成するので、文章の発音が不自然になることはない。
【００５３】
尚、本発明は、上記実施形態に限定されるものでなく、多様に変形することができる。例えば、音声単位として、ＶＣＶの音声単位を例示しているが、ＣＶの音声単位あるいは他の種類の音声単位を適用しても構わない。
【００５４】
また、単語や文章等の読みを示す読み情報を入力端子１１に入力して、これを韻律生成器１３及び音声素片選択器１８に与えても構わない。この場合は、テキスト解析器１２が省略される。
【００５５】
更に、本発明は、テキスト音声合成方法をプログラムとして記録した記録媒体をも含む。
【００５６】
【発明の効果】
以上説明した様に本発明によれば、通常は、音声単位に対応する音声素片データを音声素片データベースから検索して用いている。また、特定の単語の音声を合成するときは、特定の音声単位については、例外音声素片データを例外音声素片データベースから検索して用いている。これによって、この特定の単語の音声が明瞭になる。
【００５７】
また、本発明によれば、読みが類似する各単語のうちの少なくとも一方から、特定の音声単位を抽出し、この特定の音声単位に対応する例外音声素片データを例外音声素片データベースから検索するための例外音声素片テーブルを作成している。これによって、例外音声素片テーブルを自動的に作成することができる。特定の音声単位を抽出した単語の音声を合成するときは、例外音声素片テーブルを検索して、特定の音声単位に対応する例外音声素片データを求め、例外音声素片データベースから例外音声素片データを検索する。この例外素片データを用いて、単語の音声を合成すれば、この単語を他の類似の単語と明瞭に聞き分けることができる。
【００５８】
更に、本発明によれば、特定の単語の子音を強調しているので、この単語を他の類似の単語と聞き違えずに済む。
【図面の簡単な説明】
【図１】本発明のテキスト音声合成装置の第１実施形態を示すブロック図である。
【図２】図１の装置における例外音声素片テーブルの構成を例示する図である。
【図３】韻律情報の内容を示す図表である。
【図４】本発明のテキスト音声合成装置の第２実施形態を示すブロック図である。
【図５】従来の音声合成装置を例示するブロック図である。
【符号の説明】
１０，３０テキスト音声合成装置
１１入力端子
１２テキスト解析器
１３韻律生成器
１４音声素片データベース
１５例外音声素片データベース
１６音声素片データベース切替器
１７例外音声素片テーブル
１８音声素片選択器
１９出力端子
２０音声合成器
３１発声語彙リスト
３２類似単語抽出器[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a text-to-speech synthesizer that generates speech from text, a text-to-speech synthesis method, and a recording medium that records the method.
[0002]
[Prior art]
This type of conventional apparatus is configured, for example, as shown in FIG. In FIG. 5, the text analyzer 102 inputs text information (for example, “left”) indicating a word or sentence mixed with Japanese kanji or kana from the input terminal 101, and reading information ( For example, the phoneme string “hidari”) is output to the prosody generator 103 and the speech element selector 104. Note that reading information indicating reading of a word, a sentence, or the like may be input to the input terminal 101 and supplied to the prosody generator 103 and the speech element selector 104. In this case, the text analyzer 102 is omitted.
[0003]
When the prosody generator 103 receives the reading information, the prosody generator 103 generates prosody information from the reading information and outputs the prosodic information to the speech synthesizer 106. This prosodic information indicates the height and size of speech, or the duration. The pitch of the voice is set by the pitch frequency of the voice waveform indicating the vowel phoneme. For example, since the phonemes of the vowels in the phoneme string “hidari” are “i”, “a”, and “i”, the pitch frequency of the speech waveform indicating the phonemes of these vowels is set. The magnitude and duration of the speech are set for each phoneme as the amplitude of the speech waveform indicating the phoneme and the duration of the speech waveform indicating the phoneme. For example, the amplitude and duration of the speech waveform are set for each phoneme “h”, “i”, “d”, “a”, “r”, and “i”.
[0004]
When the speech element selector 104 receives reading information, the speech element selector 104 extracts a speech unit composed of at least one phoneme from the phoneme string that is the reading information.
[0005]
As a voice unit, there are consonant + vowel (CV: Consonant, Vowel) or vowel + consonant + vowel (VCV). The former CV audio unit is, for example, “ka” or “gu”. The latter VCV speech unit retains the characteristics of the transitional part of the phoneme chain in order to improve speech quality, and is, for example, “aki” or “ito”.
[0006]
Here, VCV audio units are used. In this case, the speech segment selector 104 extracts each VCV speech unit “* hi”, “ida”, “ari”, and “i **” (* is a silence symbol) from the phoneme string “hidari”. To do. Then, the speech unit selector 104 searches the speech unit database 105 for each speech unit data corresponding to these speech units, and outputs these speech unit data to the speech synthesizer 106.
[0007]
In the speech unit database 105, a large number of speech unit data is registered in advance. These speech segment data are obtained by appropriately cutting out speech data indicating an announcer's speech for each speech unit, and further converting the speech data into a format suitable for speech synthesis. The number of VCV speech units used for general Japanese text-to-speech synthesis is about 800, and each speech unit data corresponding to these speech units is registered in the speech unit database 105 in advance.
[0008]
When the speech synthesizer 106 receives the prosodic information from the prosody generator 103 and the speech unit data from the speech unit selector 104, the speech synthesizer 106 receives the speech information indicated by the speech unit data based on the prosodic information. After adjusting the height, size, and duration, voice data is formed by smoothly connecting these voices in each vowel section, and this voice data is output from the output terminal 107.
[0009]
By the way, in such a device, one speech unit data is registered for one speech unit. If any word includes the same speech unit, the same speech unit data is stored. One piece of data is used. For this reason, even if the speech unit data is appropriate for a speech unit included in one word, it may be inappropriate for the same speech unit included in another word. That is, when voice data is converted into voice, the former word can be heard clearly, but the latter other words cannot be heard.
[0010]
For example, in the “speech synthesizer” described in Japanese Patent Laid-Open No. 2-205896, a proper noun such as a person name is distinguished from a normal sentence, a speech unit data group for proper nouns, A speech unit data group for text is registered, and two speech unit data groups are used properly. As a result, proper nouns are pronounced clearly, and normal sentences are pronounced smoothly.
[0011]
[Problems to be solved by the invention]
However, in the above conventional apparatus, the same speech segment data is used for any proper noun as long as the same speech unit is included. For example, since “yamaguchi” and “taguchi” contain the same speech unit “agu”, the same speech segment data corresponding to the speech unit “agu” is used for these proper nouns. . In this case, for example, the pronunciation of one “yamaguchi” becomes uncharacteristic, and it becomes difficult to distinguish between the pronunciation of “yamaguchi” and the pronunciation of another similar “yamaguchi”.
[0012]
In addition, since a dedicated speech segment data group is registered for proper nouns, the amount of data to be registered is approximately doubled, which is enormous.
[0013]
Therefore, the present invention has been made in view of the above-described conventional problems, and clearly pronounces any word while minimizing the increase in the amount of pre-registered speech segment data. An object of the present invention is to provide a text-to-speech synthesizer, a text-to-speech synthesis method, and a recording medium on which the method is recorded.
[0014]
[Means for Solving the Problems]
In order to solve the above problems, the present invention includes a speech unit database in which each speech unit data corresponding to each speech unit composed of at least one phoneme is registered, and a phoneme string indicating a reading of a word or a sentence. Based on the speech unit database, and at least one speech unit data is retrieved, and based on the retrieved speech unit data and the prosodic information of the word or sentence, the text speech synthesizer for synthesizing the speech of the word or sentence, When synthesizing a speech of a specific word and an exceptional speech segment database in which exceptional speech segment data corresponding to a specific speech unit consisting of at least one phoneme included in a phoneme sequence indicating a specific word reading is registered For a specific speech unit, instead of searching the speech unit database for speech unit data corresponding to this speech unit, And a speech data retrieving means for retrieving exception speech unit data from the exception speech unit database.
[0015]
According to the present invention having such a configuration, the speech unit data corresponding to the speech unit is normally retrieved from the speech unit database and used. Further, when synthesizing the speech of a specific word, the exceptional speech segment data is searched from the exceptional speech segment database and used for a specific speech unit. Thereby, the voice of this specific word becomes clear.
[0016]
In the apparatus of the present invention, a vocabulary list in which a plurality of words are registered, and each word similar in reading from the vocabulary list are selected, and at least from a phoneme string indicating at least one reading of each selected word, Similar word extraction means for extracting a specific speech unit composed of one phoneme and creating an exceptional speech segment table for retrieving exceptional speech segment data corresponding to the extracted specific speech unit from the exceptional speech segment database And.
[0017]
Here, a specific speech unit is extracted from at least one of the words whose readings are similar, and the exceptional speech for retrieving exceptional speech segment data corresponding to the specific speech unit from the exceptional speech segment database Create a fragment table. As a result, the exceptional speech unit table can be automatically created. When synthesizing the speech of a word from which a specific speech unit has been extracted, the exceptional speech unit data corresponding to the specific speech unit is obtained by searching the exceptional speech unit table, and the exceptional speech unit database is retrieved from the exceptional speech unit database. Search for piece data. By synthesizing the speech of a word using this exceptional segment data, it is possible to clearly distinguish this word from other similar words.
[0018]
Furthermore, in the apparatus of the present invention, the exceptional speech segment data in the exceptional speech segment database indicates speech in which consonants are emphasized.
[0019]
As a result, the consonant of a specific word is emphasized, and the specific word is not mistaken for another similar word.
[0020]
On the other hand, the present invention uses a speech unit database in which each speech unit data corresponding to each speech unit consisting of at least one phoneme is registered, and based on a phoneme string indicating a word or sentence reading, In a text-to-speech synthesis method for retrieving at least one speech unit data from a unit database and synthesizing speech of a word or sentence based on the retrieved speech unit data and prosodic information of the word or sentence, When using an exceptional speech segment database in which exceptional speech segment data corresponding to a specific speech unit consisting of at least one phoneme included in a phoneme sequence indicating reading is registered, and synthesizing speech of a specific word, For a specific speech unit, instead of retrieving speech unit data corresponding to this speech unit from the speech unit database, this speech unit is supported. We are searching for exception speech segment data from the exception speech unit database.
[0021]
According to the method of the present invention, a specific word is pronounced clearly as in the apparatus of the present invention.
[0022]
Further, in the method of the present invention, a vocabulary list in which a plurality of words are registered is used, each word similar in reading is selected from the vocabulary list, and a phoneme string indicating at least one reading of each selected word Then, a specific speech unit composed of at least one phoneme is extracted, and an exceptional speech unit table is created for searching the exceptional speech unit data corresponding to the extracted specific speech unit from the exceptional speech unit database. Yes.
[0023]
Here too, as in the apparatus of the present invention, the exceptional speech unit table can be automatically created.
[0024]
Furthermore, in the method of the present invention, the exceptional speech segment data in the exceptional speech segment database indicates speech in which consonants are emphasized.
[0025]
In this case as well, as with the device of the present invention, it is not necessary to mistake a specific word for other similar words.
[0026]
Furthermore, the present invention also includes a recording medium that records the text-to-speech synthesis method as a program.
[0027]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
[0028]
FIG. 1 is a block diagram showing a first embodiment of a text-to-speech synthesizer according to the present invention. As shown in FIG. 1, a text-to-speech synthesizer 10 according to this embodiment inputs text information through an input terminal 11 and generates a text information 12 based on the text information. 12, a prosody generator 13 that generates prosody information based on the reading information, a speech unit database 14 in which speech unit data groups are registered, and an exceptional speech unit in which exceptional speech unit data groups are registered. A database 15, a speech unit database switch 16 that selectively uses the speech unit database 14 and the exceptional speech unit database 15, an exceptional speech unit table 17 that is referred to by the speech unit database switch 16, and reading information Is input from the text analyzer 12, and at least one speech segment data is converted into speech segments based on the reading information. The speech unit selector 18 for searching from the database 14 or the exceptional speech unit database 15 and the prosody information are input from the prosody generator 13, and the speech unit data is input from the speech unit selector 18. A voice synthesizer 20 that generates voice data based on the voice element data and outputs the voice data from the output terminal 19 is provided.
[0029]
In the speech unit database 14, speech unit data corresponding to a large number of VCV speech units are registered in advance. Here, since general Japanese text-to-speech synthesis is performed, about 800 speech segment data are registered.
[0030]
In the exceptional speech unit database 15, exceptional speech unit data corresponding to a specific speech unit included in a phoneme string of a specific word is registered in advance. For example, exceptional speech unit data corresponding to the speech unit “agu” included in the phoneme sequence “yamaguchi” of a specific word is registered. The number of exceptional speech segment data is not particularly limited.
[0031]
Further, the exceptional speech unit table 17 is for searching for each exceptional speech unit data in the exceptional speech unit database 15. For example, as shown in FIG. 2, phoneme sequences “yamaguchi”, “kishino”, “taguchi”, and “sakaguchi” for each specific word, and specific speech units “agu”, “ * Ki "(* is a silent symbol)," agu "and" agu ", and identification numbers" 1 "," 2 "," 3 "and" 1 "for identifying specific voice units. Registered in association. If the same identification number is assigned to the same speech unit, the speech units of these words correspond to the same exceptional speech unit data in the exceptional speech unit database 15. If different identification numbers are assigned to the same speech unit, these speech units correspond to the different exceptional speech unit data in the exceptional speech unit database 15. For example, since the same identification number “1” is assigned to the same speech unit “agu” of the phoneme strings “yamaguchi” and “sakaguchi” of each word, this speech unit “agu” is used as the exceptional speech unit database. 15 corresponding to the same exceptional speech unit data in 15. Also, since different identification numbers “1” and “3” are assigned to the same speech units “agu” of the phoneme strings “yamaguchi” and “taguchi” of each word, the speech units of these words “Agu” corresponds to each different exceptional speech unit data in the exceptional speech unit database 15.
[0032]
In the exceptional speech unit table 17, the number of exceptional speech unit data corresponding to one speech unit is not limited. In addition, since each speech unit data corresponding to all speech units is registered in the speech unit database 14, each speech unit data corresponding to each speech unit "agu" and "* ki" is registered. Is also registered.
[0033]
For example, when “Yamaguchi-san” is input as text information to the text analyzer 12 through the input terminal 11, the text analyzer 12 reads the reading information of “Yamaguchi-san”, that is, the phoneme string “yamaguchiisaN”. And output to the speech unit selector 18.
[0034]
When the prosody generator 13 receives this phoneme string “yamaguchisaN”, it obtains the speech height, size, and duration for each phoneme of this phoneme sequence, and the speech height, size, and duration of each phoneme. Prosody information indicating time is generated, and this prosodic information is output to the speech synthesizer 20. FIG. 3 is a chart illustrating this prosodic information. In this chart, the pitch indicating the phoneme duration and the phoneme speech height corresponding to each phoneme “y”, “a”, “m”, “a”, “g”, “u”,. A frequency (Hz) and a loudness level (dB) are set. The pitch of the phoneme is set only when the phoneme is a vowel, and is not set when the phoneme is a consonant. Therefore, the pitch of the voice is determined by the pitch frequency of the vowel.
[0035]
Next, when the phoneme segment selector 18 inputs the phoneme string “yamaguchiisaN”, each voice unit “* ya”, “ama”, “agu”, “uchi”, “isa”, and “isa” of the VCV is input from this phoneme string. Extract “aN *”. Further, the speech element selector 18 gives the phoneme string database “yamaguchi” to the speech element database switcher 16.
[0036]
The speech segment database switcher 16 refers to the exceptional speech segment table 17 shown in FIG. 2 and searches for a phoneme string “yamaguchi” of a word, and a specific speech unit “agu” and an identification number corresponding to this word. Read “1”. When the speech unit selector 18 searches for a specific speech unit “agu”, the speech unit database switcher 16 gives the exceptional speech unit database 15 to the speech unit selector 18. The speech unit selector 18 includes exceptional speech unit data to which an identification number 1 is assigned from among a plurality of exceptional speech unit data corresponding to a specific speech unit “agu” in the exceptional speech unit database 15. Select. When the speech unit selector 18 searches for other speech units “* ya”, “ama”, “uchi”, “isa”, and “aN *”, the speech unit database switch 16 Gives the speech unit database 14 to the speech unit selector 18. The speech unit selector 18 searches each speech unit data corresponding to each speech unit “* ya”, “ama”, “uchi”, “isa”, and “aN *” in the speech unit database 14. To do.
[0037]
Thus, the exceptional speech unit data corresponding to the specific speech unit “agu” and the identification number 1, and the speech units “* ya”, “ama”, “uchi”, “isa”, and “aN *”, respectively. Speech unit data is retrieved, and the exceptional speech unit data and each speech unit data are supplied to the speech synthesizer 20.
[0038]
When the speech synthesizer 20 receives the prosodic information shown in FIG. 3 from the prosody generator 13 and the exceptional speech unit data and each speech unit data from the speech unit selector 18, the speech synthesizer 20 receives the exceptional speech based on the prosodic information. After adjusting the height, volume, and duration of each voice indicated by the segment data and each voice segment data, voice data is formed by smoothly connecting these voices in each vowel section. The audio data is output from the output terminal 19.
[0039]
As described above, in the present embodiment, each speech unit data corresponding to each basic speech unit is registered in advance in the speech unit database 14, and exceptional speech corresponding to a specific speech unit included in a specific word. When segment data is registered in advance in the exceptional speech unit database 15 and speech data of a specific word is synthesized, the exceptional speech unit data corresponding to a specific speech unit included in the specific word is converted into the exceptional speech. It is retrieved from the segment database 15 and used. For this reason, when the voice data indicating the specific word is converted into voice, the specific word can be clearly heard. For example, as is apparent from FIG. 2, the speech unit “agu” of “yamaguchi” and the speech unit “agu” of “taguchi” are distinguished from each other, and each speech segment data is used. Both become characteristic, and furthermore, it becomes possible to distinguish between the pronunciation of “yamaguchi” and the pronunciation of other similar “yamaguchi”.
[0040]
Further, here, proper nouns are illustrated as specific words, but any other kind of words can be treated as specific words. For this reason, if the contents of the exceptional speech unit table 17 and the content of the exceptional speech unit database 15 are appropriately set, it is possible to synthesize speech data in which similar nouns are pronounced differently.
[0041]
FIG. 4 is a block diagram showing a second embodiment of the text-to-speech synthesizer of the present invention. In FIG. 4, the same reference numerals are given to portions that perform the same functions as those of the apparatus 10 of FIG. 1, and the description is simplified.
[0042]
In the text-to-speech synthesizer 30 according to the present embodiment, an utterance vocabulary list 31 and a similar word extractor 32 are added to the apparatus 10 of FIG. Accordingly, as in the apparatus 10 of FIG. 1, text information is analyzed by the text analyzer 12, prosody information is generated by the prosody generator 13, and speech units in the speech unit database 14 are generated by the speech unit selector 18. The exceptional speech unit data in the data and exceptional speech unit database 15 is searched, and the speech synthesizer 20 synthesizes speech data.
[0043]
In the utterance vocabulary list 32, for example, text information indicating a large number of names is registered in advance. The similar word extractor 32 compares these person names with each other and extracts two words having a high degree of similarity, that is, two sufficiently similar words.
[0044]
As a method of extracting two words having high similarity, for example, two words are converted into respective phoneme strings, and these phoneme strings are compared, and only one phoneme is a difference between each phoneme string. When only one phoneme is inserted into one phoneme string or when only one phoneme is dropped from one phoneme string, these words are regarded as having high similarity and are extracted. .
[0045]
As another method, each spectral parameter corresponding to all phonemes is registered in advance, and by referring to these spectral parameters, each spectral time series is generated from a phoneme sequence of two words, After these spectral time series are matched using the DP matching method widely used in the field of speech recognition, for example, the acoustic distance between words (more specifically, the spectral distance, etc.) is calculated. If this distance is close, it is considered that the similarity of these words is high, and these words are extracted.
[0046]
Next, the similar word extractor 32 obtains a characteristic speech unit of one of the two words having high similarity, that is, a difference between the words. For example, when one phoneme is inserted in the phoneme string of one word, the speech unit including the inserted phoneme is characteristic. Further, when the acoustic distance between the words is short, the sound unit of one word having the longest acoustic distance between the words is characteristic.
[0047]
For example, if each word of the phoneme sequence “yamaguchi” and the phoneme sequence “yamauchi” is extracted with a high degree of similarity, the phoneme “g” is inserted into one phoneme sequence “yamaguchi”. The speech unit “agu” including the phoneme “g” is obtained as a characteristic.
[0048]
After obtaining the characteristic speech unit of one of the two words having a high degree of similarity in this way, the similar word extractor 32 associates the word including the characteristic speech unit with the speech unit to generate an exceptional speech unit. Register in the single table 17.
[0049]
As described above, in this embodiment, two words having a high similarity are extracted from the utterance vocabulary list 32, and the exceptional speech unit table 17 is automatically associated with one word including a characteristic speech unit and the speech unit. Can be registered automatically. For this reason, the trouble for registering the contents of the exceptional speech unit table 17 can be saved. When synthesizing speech data of a word registered in the exceptional speech unit table 17, speech units included in the word are searched from the exceptional speech unit table 17, and exceptional speech unit data corresponding to the speech unit is obtained. It is read from the exceptional speech unit database 15 and the speech data of this word is synthesized using this exceptional speech unit data.
[0050]
Here, the speech unit data corresponding to the speech unit expected to be registered needs to be registered in the exceptional speech unit database 15 in advance. In addition, it is preferable that the exceptional speech segment data indicates a speech in which a consonant is emphasized. When speech data of a word is synthesized using speech segment data in which the consonant is emphasized, the consonant of the word is emphasized when the speech data is converted into speech. This makes it possible to clearly distinguish the voice of this word from the voice of other similar words.
[0051]
In this embodiment, a large number of names are registered in the utterance vocabulary list 32. However, a large number of place names and nouns or other types of words are registered in the utterance vocabulary list 32. You may extract each similar word from.
[0052]
Also, for other words that are not registered in the exceptional speech unit table 17, the speech data is generated using the basic speech unit data in the speech unit database 14, so that the pronunciation of the sentence can be reduced. There is no unnaturalness.
[0053]
In addition, this invention is not limited to the said embodiment, It can deform | transform variously. For example, although a VCV audio unit is illustrated as an audio unit, a CV audio unit or another type of audio unit may be applied.
[0054]
Alternatively, reading information indicating reading of a word or a sentence may be input to the input terminal 11 and supplied to the prosody generator 13 and the speech unit selector 18. In this case, the text analyzer 12 is omitted.
[0055]
Furthermore, the present invention also includes a recording medium that records the text-to-speech synthesis method as a program.
[0056]
【The invention's effect】
As described above, according to the present invention, normally, speech unit data corresponding to a speech unit is retrieved from a speech unit database and used. Further, when synthesizing the speech of a specific word, the exceptional speech segment data is searched from the exceptional speech segment database and used for a specific speech unit. This makes the voice of this particular word clear.
[0057]
Further, according to the present invention, a specific speech unit is extracted from at least one of words having similar readings, and exceptional speech unit data corresponding to the specific speech unit is searched from the exceptional speech unit database. An exception speech unit table is created to do this. As a result, the exceptional speech unit table can be automatically created. When synthesizing the speech of a word from which a specific speech unit has been extracted, the exceptional speech unit data corresponding to the specific speech unit is obtained by searching the exceptional speech unit table, and the exceptional speech unit database is retrieved from the exceptional speech unit database. Search for piece data. By synthesizing the speech of a word using this exceptional segment data, it is possible to clearly distinguish this word from other similar words.
[0058]
Further, according to the present invention, since the consonant of a specific word is emphasized, it is not necessary to misunderstand this word with other similar words.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a first embodiment of a text-to-speech synthesizer according to the present invention.
FIG. 2 is a diagram illustrating a configuration of an exceptional speech unit table in the apparatus of FIG. 1;
FIG. 3 is a chart showing the contents of prosodic information.
FIG. 4 is a block diagram showing a second embodiment of the text-to-speech synthesizer of the present invention.
FIG. 5 is a block diagram illustrating a conventional speech synthesizer.
[Explanation of symbols]
10,30 Text-to-speech synthesizer
11 Input terminal
12 Text analyzer
13 Prosody generator
14 Speech segment database
15 Exceptional speech unit database
16 Speech unit database switcher
17 Exceptional speech unit table
18 Speech segment selector
19 Output terminal
20 Speech synthesizer
31 Voice vocabulary list
32 Similar word extractor

Claims

A speech unit database in which each speech unit data corresponding to each speech unit composed of at least one phoneme is registered, and at least one speech from the speech unit database is based on a phoneme sequence indicating reading of words and sentences. In a text-to-speech synthesizer that searches for segment data and synthesizes speech of words and sentences based on the searched speech unit data and prosodic information of words and sentences.
Exceptional speech segment data corresponding to a specific speech unit composed of at least one phoneme included in a phoneme string indicating the reading of a specific word, and the exceptional speech associated with the specific word and the specific speech unit An exceptional speech segment database registered to be identified by the segment data identification number ,
When synthesizing speech of a specific word, for a specific speech unit, instead of retrieving speech segment data from the speech segment database, an exceptional speech unit having an identification number corresponding to the specific word and the specific speech unit is used. A text-to-speech synthesizer comprising: speech data search means for searching for piece data from an exceptional speech unit database.

A vocabulary list with multiple words registered,
Select each word with similar reading from the vocabulary list, extract a specific speech unit consisting of at least one phoneme from the phoneme string indicating the reading of at least one of the selected words, and extract the specific speech unit 2. The text-to-speech synthesizer according to claim 1, further comprising: a similar word extracting unit that creates an exceptional speech unit table for searching for exceptional speech unit data corresponding to 1 from the exceptional speech unit database.

3. The text-to-speech synthesizer according to claim 1, wherein the exceptional speech unit data in the exceptional speech unit database indicates speech in which consonants are emphasized.

A speech unit database in which each speech unit data corresponding to each speech unit consisting of at least one phoneme is registered is used, and at least one from the speech unit database is based on a phoneme sequence indicating reading of words and sentences. In a text-to-speech synthesis method that searches for two speech unit data and synthesizes speech of a word or sentence based on the retrieved speech unit data and the prosodic information of the word or sentence.
Exceptional speech segment data corresponding to a specific speech unit composed of at least one phoneme included in a phoneme string indicating the reading of a specific word, and the exceptional speech associated with the specific word and the specific speech unit When an exceptional speech unit database registered so as to be identified by the identification number of the segment data is used. When synthesizing the speech of a specific word, the speech unit data is converted into a speech unit for a specific speech unit. A text-to-speech synthesis method, characterized in that, instead of searching from a database , exceptional speech unit data having an identification number corresponding to a specific word and a specific speech unit is searched from the exceptional speech unit database.

Uses a vocabulary list in which a plurality of words are registered, selects each word that is similar in reading from the vocabulary list, and specifies at least one phoneme from a phoneme string indicating at least one reading of each selected word 5. An exceptional speech unit table for retrieving exceptional speech unit data corresponding to the extracted specific speech unit from the exceptional speech unit database is created. Text-to-speech synthesis method.

6. The text-to-speech synthesis method according to claim 4, wherein the exceptional speech unit data in the exceptional speech unit database indicates speech in which consonants are emphasized.

A recording medium recording the text-to-speech synthesis method according to claim 4 as a program.