JP2004354644A

JP2004354644A - Speech synthesizing method, device and computer program therefor, and information storage medium stored with same

Info

Publication number: JP2004354644A
Application number: JP2003151442A
Authority: JP
Inventors: Miki Hasebe; 未来長谷部; Masanobu Abe; 匡伸阿部; Hideyuki Mizuno; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-05-28
Filing date: 2003-05-28
Publication date: 2004-12-16
Anticipated expiration: 2023-05-28
Also published as: JP4170819B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesizing method, and a device and a computer program therefor that can synthesize an optimum speech according to a situation and synthesize a speech of high quality, and an information storage medium stored with the same. <P>SOLUTION: Control information for speech synthesis including pieces of information specifying at least one of a phoneme series, a fundamental frequency value a goal of speech synthesis, a continuous time length, a data base in use, and a signal processing method, and input information 100 including character data as an object of speech synthesis are obtained to select an elementary speech unit used for the synthesis from a database part 202 and evaluate the quality of the selected elementary speech unit; when the evaluation result does not satisfy a reference value, selection processing for an elementary speech unit is carried out repeatedly according to the control information included in the input information 100. Consequently, an elementary speech unit whose evaluation result reaches the reference value is selected and used to synthesize a speech. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、高品質な音声を合成できる音声合成方法及びその装置並びにそのコンピュータプログラム及びそれを記憶した情報記憶媒体に関するものである。
【０００２】
【従来の技術】
従来、電話による株価案内システム等、各種情報案内システムや、Ｅメール・Ｗｅｂの読み上げ等、様々な分野で合成音声が利用されている。しかし、現状の合成音声は人間の発声に比べてまだ十分な品質を達成できておらず、合成音声の品質向上への要望は強い。
【０００３】
従来のテキスト音声合成システムとして、大量の音声コーパスから合成に使用可能な音声素片を検索し、検索された音声素片の中から最適なものを選択し、選択された音声素片に対して韻律の変形を行わずに出力することで肉声らしい音声を合成する方法（第１従来例）がある。
［参考文献：特許２７６１５５２、「音声合成装置」］
また、他の音声合成システムとして、合成に使用する音声索片に対して、合成の目標となる韻律にあわせるために信号処理を施してから出力する方法（第２従来例）がある。［参考文献：”ＡＮＥＷＦＯＭＯＤＩＦＩＣＡＴＩＯＮＡＬＧＯＲＩＴＨＭＢＹＭＡＮＩＰＵＬＡＴＩＮＧＨＡＲＭＯＮＩＣＳＯＦＭＡＧＮＩＴＵＤＥＳＰＥＣＴＲＵＭ”，ＳａｔｏｓｈｉＴＡＫＡＮＯ，ＭａｓａｎｏｂｕＡＢＥ，Ｅｕｒｏｓｐｅｅｃｈ’９９］
【０００４】
【特許文献１】
特許２７６１５５２号公報
【非特許文献１】
”ＡＮＥＷＦＯＭＯＤＩＦＩＣＡＴＩＯＮＡＬＧＯＲＩＴＨＭＢＹＭＡＮＩＰＵＬＡＴＩＮＧＨＡＲＭＯＮＩＣＳＯＦＭＡＧＮＩＴＵＤＥＳＰＥＣＴＲＵＭ”，ＳａｔｏｓｈｉＴＡＫＡＮＯ，ＭａｓａｎｏｂｕＡＢＥ，Ｅｕｒｏｓｐｅｅｃｈ’９９
【０００５】
【発明が解決しようとする課題】
しかしながら、従来のテキスト音声合成システムにおいて、前述した第１従来例を用いた方法では、最適な音声素片を選択するための規則をあらかじめ設計しておき、その規則に基づいて音声素片を選択している。そのため従来のシステムにおいては、あらゆる入力に対して同じ規則で選択を行うことになる。しかし、入力テキストは音声合成のタスクに応じてその時々で異なり、入力に対して選択の候補となる音声素片の特長も、入力やデータベースといった各種条件によって異なるため、あらゆる入力に対して常に最適な音声素片の組合わせが得られる規則を作成することは難しいという問題がある。
【０００６】
例えば、入力した文章「今日は蒸し暑いでしょう」に対して、システムが生成する合成の目標となる韻律と、「今日は蒸し暑いでしょう」と実際に発声した時の韻律は必ずしも一致しない。これは、図５に示す実験結果からも明らかである。図５において、縦軸は周波数を表し、横軸は時間を表す。また、×印は入力された合成目標の音声の基本周波数（Ｆ_０）値であり、○印は実際に発生された音声の基本周波数（Ｆ_０）である。
【０００７】
もしデータベースに「今日は蒸し暑いでしょう。」という入力と全く同じ音韻系列の音声素片が存在した場合、生成された韻律とデータベースの韻律の違いをある程度許容するように規則を作成することで肉声そのままの音声を得ることができる。
【０００８】
しかし、データベースに「き」「ょ」「う」「わ」・・・・のように、全てバラバラの音声素片しか見つからなかった場合においては、韻律の違いを許容する規則では合成音声の品質が劣化してしまう可能性がある。
【０００９】
また、合成音声に対して感情を付与する揚合等の韻律が重要な場合においては、上述のように韻律の違いを許容してしまうと感情を再現できない可能性が高い。一方、データベースに入力した文章と全く同じ音韻系列の文章があった場合においても、データベースを作成する際に感情音声を収集していなかった場合は、特に感情音声が再現できない可能性が高い。このように合成音声の韻律が重要な場合においては、第２従来例の方法で目標の韻律にあわせて韻律を変形して合成する方法が有効だが、韻律を変形してしまうと、韻律は目標とするものが得られる代わりに音声の持つ肉声らしさが失われてしまうという問題がある。
【００１０】
音声素片に対して韻律の変形量が大きくなるほど音質の劣化も大きくなるため、目標の韻律にあわせて変形を行う場合は、できるかぎり目標の韻律に近い音声素片を選択し使用することが望ましい。
【００１１】
このように、合成のタスクや検索対象となるデータベース等に応じて、選択の候補となる音声素片の特徴や、最適な音声素片を選択する基準が異なり、様々な場面で常に高品質な音声を合成するということは難しい。
【００１２】
本発明の目的は上記の問題点に鑑み、状況に応じて最適な音声を合成すると共に高品質な音声を合成できる音声合成方法及びその装置並びにそのコンピュータプログラム及びそれを記憶した情報記憶媒体を提供することである。
【００１３】
【課題を解決するための手段】
本発明は上記の目的を達成するために、音声素片が蓄積されている複数種のデータベースを備え、入力された文字データを音声に変換するコンピュータ装置からなる音声合成装置を用いて、前記入力された文字データに対応する音声素片を前記データベースから選択し、該選択された音声素片により前記入力された文字データに対応する音声を合成する音声合成方法において、前記音声合成装置は、音声合成の目標となる入力文字列や基本周波数、継続時間長等、通常の音声合成のための入力の他に、使用するデータベース、音声素片を選択するためのルール、信号処理方法、をそれぞれ指定するための情報のうち、少なくとも何れか１つを含む音声合成のための制御情報を含む入力情報を取得し、前記入力情報に基づいて、前記データベースから合成に使用する音声素片を選択し、前記選択した音声素片を用いて音声を合成した場合の品質を評価して、該評価結果が基準値に満たなかったとき、前記入力情報に含まれる前記制御情報に基づいて音声素片に対して処理を行い、その処理の結果を再び評価する、という一連の流れを、基準値を満たすまで複数回行うことで合成音声を生成する音声合成方法を提案する。
【００１４】
本発明の音声合成方法によれば、入力情報に基づいて、データベースから音声合成に使用する音声素片を選択し、該選択した音声素片の品質を評価する。さらに、この評価結果が基準値に満たなかったときには、前記入力情報に含まれる制御情報に基づいた処理を複数回行うことにより合成音声を作成する。また、前記制御情報としては、使用するデータベース、音声素片を選択するためのルール、信号処理方法、をそれぞれ指定するための情報のうち、少なくとも何れか１つの制御情報が用いられる。
【００１５】
また、本発明は上記の音声合成方法において、前記選択された音声素片を評価した結果、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、韻律変形の方法を変更する処理を行なった後に合成音声を生成する音声合成方法を提案する。
【００１６】
本発明の音声合成方法によれば、評価結果が前記基準値に満たなかったとき、音声素片或いは音声素片の一部分の少なくとも何れか一方に対して、韻律変形の方法が変更されて合成音声が生成される。
【００１７】
また、本発明は上記の音声合成方法において、前記選択された音声素片を評価した結果、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、音声素片を選択するためのルールを変更して、前記データベースから再度音声素片の選択を行って合成音声を生成する音声合成方法を提案する。
【００１８】
本発明の音声合成方法によれば、評価結果が前記基準値に満たなかった音声素片に対して、音声素片を選択するためのルールを変更して前記データベースから再度音声素片の選択が行われて合成音声が生成される。
【００１９】
また、本発明は上記の音声合成方法において、前記選択された音声素片を評価した結果、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、候補となる音声素片を検索するデータベースを変更して再度音声素片の選択を行って合成音声を生成する音声合成方法を提案する。
【００２０】
本発明の音声合成方法によれば、評価結果が前記基準値に満たなかった音声素片に対して、候補となる音声素片を検索するデータベースが変更されて再度音声素片の選択が行われて合成音声が生成される。
【００２１】
また、本発明は上記の音声合成方法において、前記選択された音声素片を評価した結果、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、音声素片を選択するためのルールを変更して、前記データベースから再度音声素片の選択を行うと共に、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、韻律変形の方法を変更する処理を行って合成音声を生成する音声合成方法を提案する。
【００２２】
本発明の音声合成方法によれば、評価結果が前記基準値に満たなかった音声素片に対して、音声素片を選択するためのルールを変更して前記データベースから再度音声素片の選択が行われると共に、音声素片或いは音声素片の一部分の少なくとも何れか一方に対して、韻律変形の方法が変更されて合成音声が生成される。
【００２３】
また、本発明は上記の音声合成方法において、前記選択された音声素片を評価した結果、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、候補となる音声素片を検索するデータベースを変更して再度音声素片の選択を行うと共に、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、韻律変形の方法を変更する処理を行って合成音声を生成する音声合成方法を提案する。
【００２４】
本発明の音声合成方法によれば、評価結果が前記基準値に満たなかった音声素片に対して、候補となる音声素片を検索するデータベースが変更されて再度音声素片の選択が行われると共に、音声素片或いは音声素片の一部分の少なくとも何れか一方に対して、韻律変形の方法が変更されて合成音声が生成される。
【００２５】
また、本発明は上記の音声合成方法において、前記選択された音声素片を評価した結果、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、音声素片を選択するためのルールを変更して、前記データベースから再度音声素片の選択を行うと共に、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、候補となる音声素片を検索するデータベースを変更して再度音声素片の選択を行って合成音声を生成する音声合成方法を提案する。
【００２６】
本発明の音声合成方法によれば、評価結果が前記基準値に満たなかった音声素片に対して、音声素片を選択するためのルールが変更されて、前記データベースから再度音声素片の選択が行われると共に、前記基準値に満たなかった音声素片に対して、候補となる音声素片を検索するデータベースが変更されて再度音声素片の選択が行われて合成音声が生成される。
【００２７】
また、本発明は上記の音声合成方法において、前記選択された音声素片を評価した結果、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、音声素片を選択するためのルールを変更して、前記データベースから再度音声素片の選択を行うと共に、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、候補となる音声素片を検索するデータベースを変更して再度音声素片の選択を行い、さらに、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、韻律変形の方法を変更する処理を行って合成音声を生成する音声合成方法を提案する。
【００２８】
本発明の音声合成方法によれば、評価結果が前記基準値に満たなかった音声素片に対して、音声素片を選択するためのルールを変更して、前記データベースから再度音声素片の選択が行われると共に、前記基準値に満たなかった音声素片に対して、候補となる音声素片を検索するデータベースが変更されて再度音声素片の選択が行われ、さらに、前記基準値に満たなかった音声素片或いは音声素片の一部分の少なくとも何れか一方に対して、韻律変形の方法が変更されて合成音声が生成される。
【００２９】
さらに、本発明は上記の音声合成方法を実施するための音声合成装置として、入力された文字データに対応する音声素片をデータベースから選択し、該選択された音声素片により前記入力された文字データに対応する音声を合成する音声合成装置において、音声素片が蓄積されている複数種のデータベースと、音声合成の目標となる入力文字列や基本周波数、継続時間長等、通常の音声合成のための入力の他に、使用するデータベース、音声素片を選択するためのルール、信号処理方法、をそれぞれ指定するための情報のうち、少なくとも何れか１つを含む音声合成のための制御情報を含む入力情報を取得する手段と、前記入力情報に基づいて、前記データベースから合成に使用する音声素片を選択する手段と、前記選択した音声素片を用いて音声を合成した場合の品質を評価して、該評価結果が基準値に満たなかったとき、前記入力情報に含まれる前記制御情報に基づいて音声素片に対して処理を行い、その処理の結果を再び評価する、という一連の流れを、基準値を満たすまで複数回行うことで合成音声を生成する手段とを備えている音声合成装置を提案する。
【００３０】
また、本発明は上記の音声合成装置において、前記選択された音声素片を評価した結果、前記基準値に満たなかった音声素片が存在するとき、該音声素片或いは該音声素片の一部分の少なくとも何れか一方に対して、韻律変形の方法を変更する処理を行う手段を有する音声合成装置を提案する。
【００３１】
また、本発明は上記の音声合成装置において、前記選択された音声素片を評価した結果、前記基準値に満たなかった音声素片が存在するとき、該音声素片に対して、音声素片を選択するためのルールを変更して、前記データベースから再度音声素片の選択を行う手段を有する音声合成装置を提案する。
【００３２】
また、本発明は上記の音声合成装置において、前記選択された音声素片を評価した結果、前記基準値に満たなかった音声素片が存在するとき、該音声素片に対して、候補となる音声素片を検索するデータベースを変更して再度音声素片の選択を行う手段を有する音声合成装置を提案する。
【００３３】
また、本発明は上記の音声合成方法を実施するための音声合成装置を周知のコンピュータ装置によって容易に構成するためのコンピュータプログラムとして、音声素片が蓄積されている複数種のデータベースを備え、入力された文字データを音声に変換するコンピュータ装置を用いて、前記入力された文字データに対応する音声素片を前記データベースから選択し、該選択された音声素片により前記入力された文字データに対応する音声を合成する音声合成コンピュータプログラムにおいて、音声合成の目標となる入力文字列や基本周波数、継続時間長等、通常の音声合成のための入力の他に、使用するデータベース、音声素片を選択するためのルール、信号処理方法、をそれぞれ指定するための情報のうち、少なくとも何れか１つを含む音声合成のための制御情報を含む入力情報を取得するステップと、前記入力情報に基づいて、前記データベースから合成に使用する音声素片を選択するステップと、前記選択した音声素片を用いて音声を合成した場合の品質を評価して、該評価結果が基準値に満たなかったとき、前記入力情報に含まれる前記制御情報に基づいて音声素片に対して処理を行い、その処理の結果を再び評価する、という一連の流れを、基準値を満たすまで複数回行うことで合成音声を生成するステップとを含む音声合成コンピュータプログラムを提案する。
【００３４】
また、本発明は上記の音声合成コンピュータプログラムにおいて、前記選択された音声素片を評価した結果、前記基準値に満たなかった音声素片或いは音声素片の一部分の少なくとも何れか一方に対して、韻律変形の方法を変更するステップを含む音声合成コンピュータプログラムを提案する。
【００３５】
また、本発明は上記の音声合成コンピュータプログラムにおいて、前記選択された音声素片を評価した結果、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、音声素片を選択するためのルールを変更して、前記データベースから再度音声素片の選択を行うステップを含む音声合成コンピュータプログラムを提案する。
【００３６】
また、本発明は上記の音声合成コンピュータプログラムにおいて、前記選択された音声素片を評価した結果、前記基準値に満たなかった場合の処理として、音声素片或いは音声素片の一部分の少なくとも何れか一方に対する処理として、候補となる音声素片を検索するデータベースを変更して再度音声素片の選択を行うステップを含む音声合成コンピュータプログラムを提案する。
【００３７】
また、本発明は、上記音声合成コンピュータプログラムを容易に配布可能にするために、上記音声合成コンピュータプログラムが記憶されているコンピュータ読み取り可能な情報記憶媒体を提案する。
【００３８】
上述のように、音声合成のタスクに応じて入力テキストは変化し、入力に対してデータベース中にどのような音声素片があるのかといった各種条件に応じて最適な音声素片を選択するのは難しいという問題があったが、本発明では、選択された音声素片を評価した結果に応じて、音声素片を選択するデータベースを変更して音声素片を再度選択する処理、データベースから音声素片を選択するルールを変更して再度選択する処理、選択された音声素片の合成方法を変更する処理、もしくはそれら３つの組合わせによる処理を複数回行うことで、使用するデータベースや入力情報に応じた適切な処理を行うという問題を解決することができる。
【００３９】
【発明の実施の形態】
以下、図面に基づいて本発明の一実施形態を説明する。
【００４０】
図１は本発明の一実施形態における音声合成装置を示す機能構成図である。図において、１００は入力情報で、音声合成のための入力であり、音声合成の対象となる文字情報（以下、テキストと称する）と、その音素系列や、合成の目標となる基本周波数（以下、Ｆ_０と称する）値や継続時間長等のパラメータ類、使用するデータベースや信号処理方法を指定する情報等の音声合成のための制御情報を含んでいる。
【００４１】
２００は音声合成処理装置で、周知のコンピュータ装置からなり、入力部２０１と、複数種のデータベース２０２ａからなるデータベース部２０２、データベース選択部２０３、処理方法設定部２０４、データベース検索部（以下、ＤＢ検索部と称する）２０５、検索結果記憶部２０６、選択ルール設定部２０７、音声素片選択部２０８、選択結果記憶部２０９、韻律変形方法設定部２１０、選択結果評価部２１１、評価結果判定部２１２、合成部２１３、合成音声記憶部２１４、合成音声出力部２１５とから構成されている。尚、これらの構成部分は、コンピュータ装置のハードウェア及びソフトウェアの双方によって構成されている。
【００４２】
入力部２０１は、入力情報１００を取得して、これをデータベース選択部２０３に送出する。
【００４３】
データベース部２０２は、複数種のデータベース２０２−１〜２０２−ｎ（ｎは自然数）から構成され、音声波形、音声のＦ_０パタン、発声内容に対応する音素ラベル列、音素の境界を示すラベルデータ等、合成のための情報が格納されているデータベースであり、前後の環境つき音素（Ｔｒｉ−ｐｈｏｎｅ）を集めた汎用のデータベース、ニュースや天気予報を読み上げた特定のタスクに使用するデータベース、地名を含んだデータベース、基本的な音素セットを含んだデータベース等を複数種備えている。また、音声合成のタスクに依存したデータベースの例を挙げると、天気予報を合成するシステムの場合はデータベースとして、天候の名称や天気予報に使用される定型的な文章等を含んだ天気予報用のデータベース、地名を含んだデータベース等が必要となるので、それらを任意の文字データや文章を合成するための基本的な音素セットを含んだデータベース等と組合わせて使用することができるようになっている。
【００４４】
データベース選択部２０３は、処理方法設定部２０４からの指示に基づいて、入力情報１００からデータベース部２０２の中のどのデータベース２０２−１〜２０２−ｎを使用するかを決定する。また、後述する選択結果判定部２１２において評価結果が基準値に満たなかった場合の処理をループして繰り返す場合、どのデータベース２０２−１〜２０２−ｎを使用するかの情報は、処理方法設定部２０４の処理によってフィードバックされる制御情報として与えられる。
【００４５】
処理方法設定部２０４は、評価結果判定部２１２において、評価結果が基準値に満たなかったときに、データベース選択部２０３、選択ルール設定部２０７、韻律変形方法設定部２１０のいずれか、もしくは複数の処理に対して設定条件を変更する制御情報を付加し、検索するデータベースを変更する場合は２０３へ、データベースを変更せず選択ルールを変更する場合は２０７へ、データベース、選択ルールを共に変更せず、韻律変形方法のみ変更する場合は２１０へと処理を進める。
【００４６】
ＤＢ検索部２０５は、データベース選択部２０３によって決定されたデータベース２０２−１〜２０２−ｎの中から合成に使用可能な音声素片を検索して、その音素片を抽出し検索結果記憶部２０６へ送出する。
【００４７】
検索結果記憶部２０６は、ＤＢ検索部２０５によって検索抽出された音声素片を一時的に記憶する。
【００４８】
選択ルール設定部２０７は、制御情報に基づいて、音声素片選択部２０８による音素片の選択ルールを設定する。
【００４９】
音声素片選択部２０８は、選択ルール設定部２０７による設定に基づいて、ＤＢ検索部２０５の検索の結果、候補として挙がってきた音声素片の中からを選択する部分であり、データベース部２０２から検索されて検索結果記憶部２０６に記憶されている音声素片に対してＦ_０や、継続時間長、音韻環境等、合成音声の品質に関わる要素をコストとして計算することによって、最適な音声索片の組合わせを選択し、これを選択結果記憶部２０９に送出する。
【００５０】
選択結果記憶部２０９は、音声素片選択部２０８から取得した最適な音声索片の組合わせを一時的に記憶する。
【００５１】
韻律変形方法設定部２１０は、制御情報に基づいて、韻律変形方法を指定する。
【００５２】
選択結果評価部２１１は、音声素片選択部２０８によって選択され選択結果記憶部２０９に記憶されている音声素片を用いて音声を合成した場合の品質を評価する。
【００５３】
評価結果判定部２１２は、選択結果評価部２１１の処理による評価結果と入力情報に含まれる制御情報を基に、選択結果に対して再処理を行なうべきか否かを判断する。即ち、評価結果判定部２１２において、入力制御情報と、評価結果を基準値と照らし合わせた結果に基づき、再処理が必要な場合には、処理方法設定部へと処理を進める。
【００５４】
合成部２１３は、評価結果判定部２１２で再処理の必要がないと判断された場合、選択された音声素片に対して、制御情報の韻律変形方法の指定に基づいた処理を行ない、各音声素片を接続して、合成音声として合成音声記憶部２１４に送出する。
【００５５】
合成音声記憶部２１４は、合成部２１３から入力した合成音声を一時的に記憶する。
【００５６】
合成音声出力部２１５は、合成音声記憶部２１４に記憶されている合成音声を出力する。
【００５７】
次に、選択結果評価部２１１における選択結果の品質評価処理に関する一実施例を図２のフローチャートを参照して詳細に説明する。
【００５８】
選択結果評価部２１１は、まず、入力に対して音声素片が選択できたかどうか、すなわち選択結果記憶部２０９に音声素片が記憶されているが否かを判定する（３０１）。この判定では、例えば、前述の天気予報のタスクを合成するシステムの場合について述べると、まず最初に天気予報用のデータベースと地名のデータベースのみを用いて音声素片を選択する場合においては、「今日の天気は晴れです」のような典型的な文章は天気予報用のデータベースの中に存在するため、高速に高品質な結果を得ることが出来る。しかし、探索の範囲を狭めて高速化を行なった場合には、例えば「沖縄県で雪が降りました」のような特殊な場合にデータベース中に必要な音声素片が存在しない可能性が高くなる。
そこで４０１の処理において、音声素片が存在しなかった部分のみを判断することで、天気予報用のタスク依存の小さいデータベースを用いた場合に、大きいサイズの基本的な音素セットを全て含んだデータベースを用いて選択できなかった部分のみ音声素片を選択しなおすことが可能となる。
【００５９】
前記３０１の判定の結果、選択結果記憶部２０９に音声素片が存在しなかった場合は、選択結果評価部２１１は、音声合成した場合の評価を行うことができないため、３０２、３０３の処理を行わず、そのまま評価結果を評価結果判定部２１２に出力する。
【００６０】
一方、選択結果記憶部２０９に音声素片が存在した場合は、選択結果評価部２１１は、３０２、３０３の処理を行い、選択された音声素片の品質を評価する。
【００６１】
３０２の処理では、選択結果評価部２１１は、選択された個々の音声素片の音韻環境について判定する。例えば、データベース中から「ＡＳＩＴＡ」という音韻系列を持った音声の「Ｓ」の部分を使用する場合は、「Ａ」がＳの前環境、「Ｉ」がＳの後環境となる。このＳを「ＫＥＳＵ」の「Ｓ」として使用する場合には、Ｓの環境はそれぞれ「Ｅ」と「Ｕ」となり、データベースの音韻環境とは全く異なっている。音韻環境がどの程度異なっているかを表すために、事前に各音韻環境のスペクトルパターンを分析したデータを使用し、比較することで音韻環境がどの程度異なっているかを評価する。
【００６２】
使用する音声素片がＮ個存在し、ｎ番目の音声素片の環境についてＴｒｉ−Ｐｈｏｎｅ［ｎ］（以下、ＴＰ［ｎ］と称する）と表し、それぞれのスペクトルパターンをＴａｒｇｅｔＥｎｖｉｏｒｍｅｎｔ［ＴＰ［ｎ］］（以下、ＴＥ［ＴＰ［ｎ］］と称する）、ＯｒｉｇｉｎａｌＥｎｖｉｏｒｍｅｎｔ［ＴＰ［ｎ］］（以下、ＯＥ［ＴＰ［ｎ］］と称する）とすると、各音声素片の音韻環境の異なり具合を評価する式は次の式（１）のように求めることができ、これをＮ個の音声素片全てについて評価する。
【００６３】
【数１】

【００６４】
次に３０３の処理では、選択結果評価部２１１は、選択された音声素片について、各アクセント句のアクセント型が正しく再現されているかどうかを判定する。これは、合成音声の肉声らしさを保持するために韻律変形を行なわずに出力する場合に、最適な音声素片の組合わせが必ずしも正しいアクセントを再現しているとは限らないため、音声素片のアクセント型がどの程度目標と合致しているのかを評価する必要がある。
【００６５】
アクセント型の評価方法としては、合成する音韻系列について、Ｆ_０値が安定して得られる母音部分について、合成目標と音声素片のＦ_０値を比較することによってアクセント型がどの程度再現できているか評価できる。例えば、合成する音韻系列に母音がＮ個あった場合、合成目標のｎ番目の母音の中心部分のＦ_０値をＴａｒｇｅｔＶｏｉｃｅＦ０［ｎ］（以下、ＴＦ０［ｎ］と称する）、選択された音声素片のｎ番目の母音の中心部分のＦ_０値をＯｒｉｇｉｎａｌＶｏｉｃｅＦ０［ｎ］（以下、ＯＦ０［ｎ］と称する）とすると、Ｆ_０値の推移の違いを次の式（２）のように求めることができる。
【００６６】
【数２】

【００６７】
以上の処理によって、選択された音声素片の組合わせに対して評価結果のデータを付加し、その結果を評価結果判定部２１２の処理によって判断する。
【００６８】
次に、評価結果判定部２１２の処理に関して図３のフローチャートを参照して詳細に説明する。
【００６９】
評価結果判定部２１２は、選択された各音声素片の評価結果に基づき、その後の処理が必要か否かを判定する（４０１）。この判定の結果、処理が不要の場合は、合成部２１３の処理へと進む。
【００７０】
前記４０１の判定の結果、処理が必要と判断された場合は、評価結果判定部２１２は、その後に実行する処理として検索するデータベースを変更するか否かの判断を行なう（４０２）。この判定の結果、データベースを変更しない場合は、続く４０３及び４０４の処理において、それぞれ選択ルール、韻律変形方法を変更するか否かを判断し、全て行なわない場合のみ、合成部２１３の処理へと進む。また、どれか１つでも処理を行なう場合は、処理方法設定部２０４の処理へと進む。
【００７１】
一方、処理方法設定部２０４では、評価結果判定部２１２において音声素片の選択などの処理において再処理の必要があると判断された場合に、データベース、選択ルール、韻律変形方法のうちのどの変更を行なうのかについて制御情報を付加した後、検索するデータベースを変更する場合はデータベース選択部２０３に、データベースを変更せず選択ルールを変更する場合は選択ルール設定部２０７に、データベース及び選択ルールを共に変更せずに韻律変形方法のみを変更する場合は韻律変形方法設定部２１０へと処理を進める。
【００７２】
データベース選択部２０４の処理を経由してデータベース選択部２０３の処理に戻った場合は、データベース選択部２０４において付加された制御情報に基づき検索対象となるデータベースを変更して候補となる音声素片を再検索する。
【００７３】
以下、最初の場合と同じように処理を進め、選択ルール設定部２０７の処理まで処理を進めた際に、データベース選択部２０４の処理において選択ルールを変更するように制御情報が付加されていた場合は、選択ルール設定部２０７は、その制御情報に従って選択ルールを変更して再度選択を行なう。
【００７４】
処理方法設定部２０４の処理を経由して直接選択ルール設定部２０７の処理に戻った場合は、検索候補となる音声素片はデータベースから再検索を行なっていないため最初に検索した候補と同じだが、それ以降の処理の流れはデータベース選択部２０３の場合と同じである。
【００７５】
同じように、データベース選択部２０３の処理から韻律変形方法設定部２１０の処理まで進んだ場合も、付加された制御情報に従い韻律変形の方法を変更する処理を行なう。
【００７６】
処理方法設定部２０４の処理から韻律変形方法設定部２１０の処理に直接戻った場合も、既に選択された音声素片に対して処理を行なうということ以外は同様である。
【００７７】
処理方法設定部２０４からの処理を行なった結果、最初と同様に再び選択結果評価部２１１において評価を行い、再び処理を繰り返すかどうかを評価結果判定部２１２において判断し、最終的に処理が不要になるまで同じプロセスを繰り返す。
【００７８】
以下、天気予報の音声合成を行なう場合を一例にしてシステムの流れを説明する。この場合、図４に示すようにデータベース部２０２には、天気予報ＤＢ２０２ａと、Ｔｒｉ−ｐｈｏｎｅＤＢ２０２ｂ、形態素ＤＢ２０２ｃが備えられている。ここで、「ＤＢ」はデータベースを表す。
【００７９】
まず、入力情報１００として「今日は蒸し暑いでしょう」というテキストが入力されたとする。また、ここでは、このテキストに付随する制御情報として、アクセント句や品詞等の言語情報や、Ｆ_０パタン、継続時間等の韻律情報の他に、以下のような制御情報が入力された場合を例に挙げる。
・使用するデータベース（ＤＢ）の指定
初期値：天気予報ＤＢ
１回目のループ：形態素ＤＢ
２回目のループ：Ｔｒｉ−ＰｈｏｎｅＤＢ
・韻律変形の指定
初期値：韻律変形無し
１回目のループ：韻律変形無し
２回目のループ：韻律変形有り
・選択ルールの指定（重視するパラメータは何か）
初期値：アクセント型、音韻系列の一致性
１回目のループ：形態素境界の一致性、音声素片の音韻環境の一致性
２回目のループ：Ｆ_０値が近いかどうか、音韻環境の一致性
ＤＢ検索部２０５は、これらの制御情報に基づき、データベース部２０２の中から天気予報ＤＢを使用して入力テキストを合成可能な音声素片を候補として検索し、この検索結果を検索結果記憶部２０６に出力する。
【００８０】
音声素片選択部２０８は、選択のルールとしてはアクセント型があっているかどうか、入力テキストに対して音声素片の音韻系列が―致している部分が長いかどうか、という要素に重みを置いて各種のコストを計算し、最適な音声素片の組合わせを求めて選択結果記憶部２０９に出力する。
【００８１】
音声素片選択部２０８の処理結果として「今日は」「暑いでしょう」という音声素片が選択され、「蒸し」という音声素片が存在しなかったとする。その場合、選択結果評価部２１１の中の処理３０１において音声素片が存在しない場合に相当するので、評価結果判定部２１２の中の処理４０１において、音声素片の評価結果を判定する処理で基準を満たしていないと判断され、続く４０２の処理で、入力の制御情報にデータベースの変更に関する指定があるため、再処理の必要有りと判定され処理方法設定部２０４の処理へと進む。
【００８２】
処理方法設定部２０４では、入力情報１００における制御情報を基に、データベースは形態素を使用するように指定し、韻律変形は無し、選択のルールは、形態素境界が一致しているかどうか、選択された音声素片の音韻環境が入力の音韻環境と比較して近いかどうか、というコストを重視するように変更され、データベース選択部２０３の処理へと進む。
【００８３】
データベース選択部２０３では処理方法設定部２０４によって設定された制御情報を基に、データベースとして形態素単位で構成された形態素ＤＢ２０２ｃを使用するように設定し、ＤＢ検索部２０５で形態素ＤＢ２０２ｃから「蒸し」を合成できる音声素片を検索し、検索結果の候補を検索結果記憶部２０６に記憶する。
【００８４】
選択ルール設定部２０７では、検索結果記憶部２０６から最適な音声素片の組合わせを選択するためのルールを、音声素片の形態素境界と入力テキストの形態素境界が一致しているかどうか、音声素片の前後の音韻環境が合成したい音韻環境と近いかどうか、という要素に重みを置くように変更する。
【００８５】
音声素片選択部２０８では選択ルール設定部２０７によって変更されたルールに基づいて各種のコストを計算し、最適な音声素片の組合わせを求めて選択結果記憶部２０９に記憶する。
【００８６】
韻律変形方法設定部２１０は韻律変形方法を指定する部分であるが、ここでは韻律変形は行なわないという指定なので最初の場合と変更はない。
【００８７】
選択結果評価部２１１で、処理方法設定部２０４からの処理の結果を含め再び評価を行なう。評価の結果、新しく選んだ「蒸し」という音声素片が、「蒸し暑い」のアクセントとは異なる「蒸し」しか存在せず、「蒸し暑いでしょう」のアクセント型の評価結果が異常値をとっていた場合、評価結果判定部２１２の処理４０１において、音声素片の評価結果を判定する処理で基準を満たしていないと判断され、「蒸し暑いでしょう」というアクセント句を修正する必要有りと判断され、再び処理方法設定部２０４以降の処理ループへと進む。
【００８８】
処理方法設定部２０４では、最初の入力情報における制御情報に従い、今度は使用ＤＢはＴｒｉ−ＰｈｏｎｅＤＢ２０２ｂを用いて検索し、音韻環境とＦ_０値の―致具合を重視しながら選択し、合成時には韻律変形を施す、という指定を行い、データベース選択部２０３の処理へと進む。その後、「蒸し暑いでしょう」というアクセント句を合成するために、先ほどと同様にＤＢから候補を検索し、最適な音声素片の組合わせを選択し、その評価を行う。
【００８９】
次いで、評価結果判定部２１２では、選択された音声素片の評価がよければそのまま合成部２１３の処理へ進むが、もしまだ評価が悪かった場合においても、入力制御情報として既にデータベースと、選択ルール、韻律変形方法の全てについて変更は無いので、ここで処理方法設定部２０４の処理以降のループは終了し、合成部２１３の処理へと進む。
【００９０】
合成部２１３では、最終的な結果に従い、「今日は蒸し暑いでしょう」という音声を合成する。「今日は」の部分は最初に選択された音声素片であり、韻律変形は無しの指定になっているため、選択された音声素片をそのままの形で出力し、次の「蒸し暑いでしょう」の部分は最後に選択された結果であり、韻律変形を行なう指定になっているため、韻律変形を行なった後に、「今日は」の音声素片と接続し、最終的な結果として合成音声を出力する。
【００９１】
従来の音声合成技術の場合、必ずしも理想的な音声素片が存在するとは限らず、どうしてもアクセント型が合う音声素片を選択出来ない場合などは、アクセントの異なりを許容するか、もしくは音韻環境が異なるがＦ_０値が近いものを選択することで明瞭性を犠牲にしてアクセントを揃えるということを行なうことになる。しかし、前述したように本実施形態では、信号処理によってＦ_０をあわせることを前提に音声素片の音韻環境だけは一致するように再度選択を行なった後に合成するというような、場合に応じた処理が可能になるため、韻律変形によって音声の肉声らしさが損なわれることを許容する代わりに、明瞭性を保ち且つアクセントの正しい合成音声を作成する、といったことが可能になる。すなわち、音声を合成する過程において、アクセント、明瞭性以外に、肉声らしさも考慮することで、選択の幅が広がり、従来の方法では良い合成音声を得られないような場合にもより良い合成音声を得られる可能性が高くなる。
【００９２】
また、天気予報の例のように、最初に天気予報専用の小規模なデータベースから選択し、選択できなかった場合のみ、より汎用的な大きいサイズのデータベースを用いるという段階的な操作を行なうことで、そのままの形で使えるような定型的な文章は素早く選択し、ＤＢに無かった特殊な地名やカタカナ語等のみ大規模ＤＢから最適なものを選択することで素早く品質の良い合成音声を得られるといった効果がある。
【００９３】
尚、上記実施形態及び各実施例は、本発明の一具体例であって本発明が上記具体例の構成のみに限定されないことは言うまでもないことである。
【００９４】
また、上記の音声合成プログラムを記録したコンピュータ読み取り可能な情報記憶媒体を作成することにより、上記音声合成コンピュータプログラムを容易に配布することが可能になる。
【００９５】
【発明の効果】
以上説明したように本発明によれば、入力情報に基づいて、データベースから音声合成に使用する音声素片を選択し、該選択した音声素片の品質が評価され、この評価結果が基準値に満たなかったときには、前記入力情報に含まれる、音素系列、音声合成の目標となる基本周波数値、継続時間長、使用するデータベース、信号処理方法を指定する情報のうちの少なくとも何れか１つの制御情報に基づいて音声素片の選択処理を複数回行うことにより、評価結果が前記基準値に達する音声素片が選択され、該音声素片を用いて合成音声が生成されるので、音声合成のタスクに応じて入力文字データは変化し、入力に対してデータベース中にどのような音声素片があるのかといった各種条件に応じて最適な音声素片を選択することができる。
【００９６】
さらに、本発明では、選択された音声素片を評価した結果に応じて、音声素片を選択するデータベースを変更して音声素片を再度選択する処理、データベースから音声素片を選択するルールを変更して再度選択する処理、選択された音声素片の合成方法を変更する処理、もしくはそれら３つの組合わせによる処理をＮ回行っているので、使用するデータベースや入力情報に応じた適切な音声合成処理を行うことができるという非常に優れた効果を奏する。
【図面の簡単な説明】
【図１】本発明の一実施形態における音声合成装置を示す機能構成図
【図２】本発明の一実施形態における選択結果評価部の品質評価処理を説明するフローチャート
【図３】本発明の一実施形態における評価結果判定部の処理を説明するフローチャート
【図４】本発明の一実施形態における天気予報の音声合成を行なう場合を一例としたシステムの流れを説明する図
【図５】従来例にかかる問題点を説明する周波数特性図
【符号の説明】
１００…入力情報、２００…音声合成装置、２０１…入力部、２０２…データベース部、２０２−１〜２０２−ｎ…データベース、２０１ａ…天気予報ＤＢ、２０２ｂ…Ｔｒｉ−ｐｈｏｎｅＤＢ、２０２ｃ…形態素ＤＢ、２０３…データベース選択部、２０４…処理方法設定部、２０５…データベース検索部（ＤＢ検索部）、２０６…検索結果記憶部、２０７…選択ルール設定部、２０８…音声素片選択部、２０９…選択結果記憶部、２１０…韻律変形方法設定部、２１１…選択結果評価部、２１２…評価結果判定部、２１３…合成部、２１４…合成音声記憶部、２１５…合成音声出力部。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesis method and apparatus capable of synthesizing high-quality speech, a computer program thereof, and an information storage medium storing the same.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, synthesized speech has been used in various fields such as various information guidance systems such as a stock price guidance system by telephone, and reading of e-mail and Web. However, the current synthesized speech has not yet achieved a sufficient quality as compared to human utterances, and there is a strong demand for improved quality of the synthesized speech.
[0003]
As a conventional text-to-speech synthesis system, search for speech units that can be used for synthesis from a large amount of speech corpus, select the best one from the searched speech units, and There is a method of synthesizing a voice like a real voice by outputting without changing the prosody (first conventional example).
[Reference: Patent 2761552, "Speech synthesis device"]
As another speech synthesis system, there is a method (second conventional example) of subjecting speech segments used for synthesis to signal processing in order to conform to the target prosody of synthesis. [References: "A NEW FO MODIFICATION ALGORITHM BY MANIPULATING HARMONICS OF MAGNITUDE SPECTRUM", Satoshi TAKANO, Masanobu ABE, Eurospeech'99]
[0004]
[Patent Document 1]
Japanese Patent No. 2761552
[Non-patent document 1]
"A NEW FO MODIFICATION ALGORITHM BY MANIPULATING HARMONICS OF MAGNITUDE SPECTRUM", Satoshi TAKANO, Masanobu ABE, Eurospech'99
[0005]
[Problems to be solved by the invention]
However, in the conventional text-to-speech synthesis system, in the method using the first conventional example described above, a rule for selecting an optimum speech unit is designed in advance, and a speech unit is selected based on the rule. are doing. Therefore, in the conventional system, selection is performed for all inputs according to the same rule. However, the input text differs from time to time depending on the task of speech synthesis, and the characteristics of speech segments that are candidates for selection for input also differ depending on various conditions such as input and database, so it is always optimal for all inputs There is a problem that it is difficult to create a rule that can obtain a proper combination of speech units.
[0006]
For example, for the input sentence “Today will be hot and humid”, the prosody that is the target of the synthesis generated by the system does not always match the prosody when “Today will be hot and humid”. This is clear from the experimental results shown in FIG. In FIG. 5, the vertical axis represents frequency, and the horizontal axis represents time. The crosses indicate the fundamental frequency (F ₀ ) Value, and the circles indicate the fundamental frequency (F ₀ ).
[0007]
If there is a speech unit in the database with exactly the same phonemic sequence as the input "It will be hot and humid today," create a rule to allow a certain degree of difference between the generated prosody and the prosody of the database. You can get the sound as it is.
[0008]
However, in the case where only discrete speech units are found in the database, such as “ki”, “cho”, “u”, “wa”, etc., the rules that allow for differences in prosodic May be deteriorated.
[0009]
In addition, when prosody such as conjugation that gives emotion to synthesized speech is important, it is highly possible that emotion cannot be reproduced if the difference in prosody is allowed as described above. On the other hand, even when there is a sentence having the same phoneme sequence as the sentence input to the database, it is highly likely that the emotional voice cannot be reproduced especially if the emotional voice has not been collected when the database is created. In the case where the prosody of the synthesized speech is important as described above, the method of synthesizing the prosody by modifying the prosody according to the target prosody by the method of the second conventional example is effective. However, there is a problem that the real voice characteristic of the voice is lost instead of obtaining the sound.
[0010]
Since the deterioration of sound quality increases as the amount of prosodic deformation of a speech unit increases, it is necessary to select and use a speech unit that is as close as possible to the target prosody when performing deformation in accordance with the target prosody. desirable.
[0011]
As described above, the characteristics of the speech unit to be selected and the criteria for selecting the optimal speech unit differ depending on the synthesis task, the database to be searched, and the like. It is difficult to synthesize speech.
[0012]
SUMMARY OF THE INVENTION In view of the above problems, an object of the present invention is to provide a voice synthesizing method and apparatus capable of synthesizing a high-quality voice while synthesizing an optimum voice according to a situation, and a computer program and an information storage medium storing the same. It is to be.
[0013]
[Means for Solving the Problems]
In order to achieve the above object, the present invention comprises a plurality of databases in which speech units are stored, and uses a speech synthesizer comprising a computer device for converting inputted character data into speech. A speech unit corresponding to the selected character data from the database, and synthesizing a speech corresponding to the input character data by the selected speech unit. In addition to the input for normal speech synthesis, such as the input character string, fundamental frequency, and duration, which are the target of synthesis, specify the database to be used, rules for selecting speech units, and signal processing methods. To obtain input information including control information for speech synthesis including at least one of the information for performing the processing, and based on the input information, To select a speech unit to be used for synthesis, and evaluate the quality when speech is synthesized using the selected speech unit.If the evaluation result is less than the reference value, the speech information is included in the input information. A speech synthesis method for generating a synthesized speech by performing a series of steps of performing a process on a speech unit based on the control information and evaluating the result of the process a plurality of times until a reference value is satisfied Suggest.
[0014]
According to the speech synthesis method of the present invention, a speech unit used for speech synthesis is selected from a database based on input information, and the quality of the selected speech unit is evaluated. Further, when the evaluation result is less than the reference value, a synthetic speech is created by performing a plurality of processes based on the control information included in the input information. Further, as the control information, at least any one of information for designating a database to be used, a rule for selecting a speech unit, and a signal processing method is used.
[0015]
Further, in the above-mentioned speech synthesis method, in the above-described speech synthesis method, as a result of evaluating the selected speech unit, if the value does not satisfy the reference value, at least one of the speech unit and a part of the speech unit is performed. As a process for, a speech synthesis method of generating a synthesized speech after performing a process of changing a prosody transformation method is proposed.
[0016]
According to the speech synthesis method of the present invention, when the evaluation result is less than the reference value, the prosody transformation method is changed for at least one of the speech unit and a part of the speech unit, and the synthesized speech is changed. Is generated.
[0017]
Further, in the above-mentioned speech synthesis method, in the above-described speech synthesis method, as a result of evaluating the selected speech unit, if the value does not satisfy the reference value, at least one of the speech unit and a part of the speech unit is performed. As a process for, a rule for selecting a speech unit is changed, and a speech unit is again selected from the database to generate a speech synthesis method.
[0018]
According to the speech synthesis method of the present invention, for a speech unit whose evaluation result is less than the reference value, a rule for selecting a speech unit is changed, and a speech unit is again selected from the database. This produces a synthesized speech.
[0019]
Further, in the above-mentioned speech synthesis method, in the above-described speech synthesis method, as a result of evaluating the selected speech unit, if the value does not satisfy the reference value, at least one of the speech unit and a part of the speech unit is performed. As a process for, a database for searching for a candidate speech unit is changed, a speech unit is selected again, and a speech synthesis method for generating a synthesized speech is proposed.
[0020]
According to the speech synthesis method of the present invention, for speech units whose evaluation results are less than the reference value, the database for searching candidate speech units is changed, and speech units are selected again. As a result, a synthesized speech is generated.
[0021]
Further, in the above-mentioned speech synthesis method, in the above-described speech synthesis method, as a result of evaluating the selected speech unit, if the value does not satisfy the reference value, at least one of the speech unit and a part of the speech unit is performed. As a process for, a rule for selecting a speech unit is changed, a speech unit is selected again from the database, and a speech unit or a speech unit is processed as a process when the reference value is not satisfied. As a process for at least one of the parts, a speech synthesis method for generating a synthesized speech by performing a process of changing a prosody transformation method is proposed.
[0022]
According to the speech synthesis method of the present invention, for a speech unit whose evaluation result is less than the reference value, a rule for selecting a speech unit is changed, and a speech unit is again selected from the database. At the same time, the method of prosody modification is changed for at least one of the speech unit and a part of the speech unit to generate a synthesized speech.
[0023]
Further, in the above-mentioned speech synthesis method, in the above-described speech synthesis method, as a result of evaluating the selected speech unit, if the value does not satisfy the reference value, at least one of the speech unit and a part of the speech unit is performed. As a process for, a database for searching for a candidate speech unit is changed and a speech unit is selected again, and when the speech unit is less than the reference value, the speech unit or a part of the speech unit is As at least one of the processes, a speech synthesis method of generating a synthesized speech by performing a process of changing a prosody modification method is proposed.
[0024]
According to the speech synthesis method of the present invention, for speech segments whose evaluation result is less than the reference value, the database for searching candidate speech segments is changed, and speech segments are selected again. At the same time, the method of prosody transformation is changed for at least one of the speech unit and a part of the speech unit to generate a synthesized speech.
[0025]
Further, in the above-mentioned speech synthesis method, in the above-described speech synthesis method, as a result of evaluating the selected speech unit, if the value does not satisfy the reference value, at least one of the speech unit and a part of the speech unit is performed. As a process for, a rule for selecting a speech unit is changed, a speech unit is selected again from the database, and a speech unit or a speech unit is processed as a process when the reference value is not satisfied. As a process for at least one of the parts, a speech synthesis method for generating a synthesized speech by changing a database for searching for a candidate speech unit and selecting a speech unit again is proposed.
[0026]
According to the speech synthesis method of the present invention, a rule for selecting a speech unit is changed for a speech unit whose evaluation result is less than the reference value, and a speech unit is selected again from the database. Is performed, and for a speech unit that does not satisfy the reference value, a database for searching for a candidate speech unit is changed, and a speech unit is selected again to generate a synthesized speech.
[0027]
Further, in the above-mentioned speech synthesis method, in the above-described speech synthesis method, as a result of evaluating the selected speech unit, if the value does not satisfy the reference value, at least one of the speech unit and a part of the speech unit is performed. As a process for, a rule for selecting a speech unit is changed, a speech unit is selected again from the database, and a speech unit or a speech unit is processed as a process when the reference value is not satisfied. As a process for at least one of the parts, a database for searching for a candidate speech unit is changed and a speech unit is selected again. Speech synthesis method for generating a synthesized speech by performing a process of changing a prosody modification method as at least one of a segment or a part of a speech unit Proposed.
[0028]
According to the speech synthesis method of the present invention, a rule for selecting a speech unit is changed for a speech unit whose evaluation result is less than the reference value, and a speech unit is selected again from the database. Is performed, and for a speech unit that does not satisfy the reference value, the database for searching for a candidate speech unit is changed and a speech unit is selected again. A prosody transformation method is changed for at least one of the missing speech unit or a portion of the speech unit to generate a synthesized speech.
[0029]
Further, the present invention provides a speech synthesizer for implementing the above-described speech synthesis method, wherein a speech unit corresponding to input character data is selected from a database, and the input character unit is selected by the selected speech unit. In a speech synthesizer that synthesizes speech corresponding to data, multiple types of databases in which speech segments are stored, and input speech strings, basic frequencies, durations, and the like that are the target of speech synthesis. Control information for speech synthesis including at least one of a database to be used, a rule for selecting a speech unit, and a signal processing method, in addition to an input for speech synthesis. Means for obtaining input information including, based on the input information, means for selecting a speech unit to be used for synthesis from the database, and using the selected speech unit. Evaluate the quality of the synthesized speech, when the evaluation result is less than the reference value, perform processing on the speech unit based on the control information included in the input information, the result of the processing And a means for generating a synthesized voice by performing a series of steps of evaluating the 再び again a plurality of times until the reference value is satisfied.
[0030]
The present invention also provides the speech synthesis device, wherein, as a result of evaluating the selected speech unit, when there is a speech unit that does not satisfy the reference value, the speech unit or a part of the speech unit is evaluated. A voice synthesizing apparatus having means for performing a process of changing a prosody modification method is proposed for at least one of the above.
[0031]
Also, in the above-described speech synthesis apparatus, when the selected speech unit is evaluated, if there is a speech unit that does not satisfy the reference value, the speech unit is replaced with the speech unit. A speech synthesis device having means for changing a rule for selecting a speech unit and selecting speech units again from the database is proposed.
[0032]
Also, in the above-described speech synthesis apparatus, when the selected speech unit is evaluated, if there is a speech unit that does not satisfy the reference value, the speech unit is a candidate for the speech unit. We propose a speech synthesizer having means for changing a database for searching speech units and selecting speech units again.
[0033]
In addition, the present invention includes a plurality of databases in which speech units are stored, as a computer program for easily configuring a speech synthesis device for performing the above-described speech synthesis method using a well-known computer device. Using a computer device that converts the input character data to voice, a voice unit corresponding to the input character data is selected from the database, and the selected voice unit corresponds to the input character data. In the speech synthesis computer program that synthesizes the speech to be synthesized, in addition to the input for the normal speech synthesis, such as the input character string, the fundamental frequency, and the duration time, which are the target of speech synthesis, select the database and speech unit to be used Sound that includes at least one of information for designating a rule for performing the processing and a signal processing method, respectively. Obtaining input information including control information for synthesis; selecting speech units to be used for synthesis from the database based on the input information; and outputting a speech using the selected speech units. Evaluate the quality of the combined case, and when the evaluation result is below the reference value, perform processing on the speech unit based on the control information included in the input information, and return the result of the processing again. Performing a series of evaluations a plurality of times until a reference value is satisfied, thereby generating a synthesized speech.
[0034]
Further, the present invention, in the above speech synthesis computer program, as a result of evaluating the selected speech unit, for at least one of a speech unit or a part of the speech unit that did not meet the reference value, A speech synthesis computer program including a step of changing a method of prosodic transformation is proposed.
[0035]
Further, the present invention provides the above-mentioned speech synthesis computer program, wherein as a result of evaluating the selected speech unit, if the selected speech unit does not satisfy the reference value, at least one of the speech unit and a part of the speech unit As one process, a speech synthesis computer program including a step of changing a rule for selecting a speech unit and selecting a speech unit again from the database is proposed.
[0036]
Further, the present invention provides the above-mentioned speech synthesis computer program, wherein as a result of evaluating the selected speech unit, if the selected speech unit does not satisfy the reference value, at least one of the speech unit and a part of the speech unit As a process for one of them, a speech synthesis computer program including a step of changing a database for searching for a candidate speech unit and selecting a speech unit again is proposed.
[0037]
The present invention also proposes a computer-readable information storage medium storing the above-mentioned speech synthesis computer program so that the above-mentioned speech synthesis computer program can be easily distributed.
[0038]
As described above, the input text changes according to the task of speech synthesis, and the optimal speech unit is selected according to various conditions such as what kind of speech unit is in the database for the input. Although there was a problem that it was difficult, in the present invention, in accordance with the evaluation result of the selected speech unit, the database for selecting the speech unit is changed and the speech unit is selected again. By changing the rule for selecting a segment and selecting it again, changing the synthesis method of the selected speech unit, or performing a process using a combination of these three times, the database or input information to be used can be changed. It is possible to solve the problem of performing appropriate processing according to the above.
[0039]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
[0040]
FIG. 1 is a functional configuration diagram showing a speech synthesis device according to an embodiment of the present invention. In the figure, reference numeral 100 denotes input information, which is an input for speech synthesis. Character information to be subjected to speech synthesis (hereinafter, referred to as text), its phoneme sequence, and a fundamental frequency to be synthesized (hereinafter, referred to as text). F ₀ ), Control information for speech synthesis such as information for specifying a database to be used and a signal processing method.
[0041]
Reference numeral 200 denotes a speech synthesis processing device, which is a well-known computer device and includes an input unit 201, a database unit 202 including a plurality of types of databases 202a, a database selection unit 203, a processing method setting unit 204, and a database search unit (hereinafter, DB search). 205), a search result storage unit 206, a selection rule setting unit 207, a speech unit selection unit 208, a selection result storage unit 209, a prosody modification method setting unit 210, a selection result evaluation unit 211, an evaluation result determination unit 212, It comprises a synthesizing unit 213, a synthesized voice storage unit 214, and a synthesized voice output unit 215. Note that these components are configured by both hardware and software of the computer device.
[0042]
The input unit 201 acquires the input information 100 and sends it to the database selection unit 203.
[0043]
The database unit 202 is composed of a plurality of types of databases 202-1 to 202-n (n is a natural number), ₀ This is a database that stores information for synthesis, such as patterns, phoneme label strings corresponding to utterance contents, and label data indicating the boundaries of phonemes, and is a general-purpose database that collects pre- and post-environmental phonemes (Tri-phone). , A plurality of databases including a database used for a specific task that reads news and weather forecast, a database including a place name, a database including a basic phoneme set, and the like. In addition, as an example of a database that depends on a task of speech synthesis, in the case of a system that synthesizes a weather forecast, the database for the weather forecast including a standard name used for the weather name and the weather forecast is used as the database. Since a database and a database containing place names are required, they can be used in combination with a database containing a basic phoneme set for synthesizing arbitrary character data and sentences. I have.
[0044]
The database selection unit 203 determines which database 202-1 to 202-n in the database unit 202 to use from the input information 100 based on an instruction from the processing method setting unit 204. Further, in a case where the processing in the case where the evaluation result is less than the reference value is repeated in a selection result determination unit 212 to be described later in a loop, information on which database 202-1 to 202-n is used is determined by the processing method setting unit. It is given as control information fed back by the process of 204.
[0045]
When the evaluation result determination unit 212 determines that the evaluation result does not satisfy the reference value, the processing method setting unit 204 selects one or more of the database selection unit 203, the selection rule setting unit 207, and the prosody transformation method setting unit 210. If control information for changing the setting conditions is added to the processing and the database to be searched is to be changed, to 203, to change the selection rule without changing the database, to 207. If only the prosody modification method is to be changed, the process proceeds to 210.
[0046]
The DB search unit 205 searches for speech segments usable for synthesis from the databases 202-1 to 202-n determined by the database selection unit 203, extracts the speech segments, and sends the speech segments to the search result storage unit 206. Send out.
[0047]
The search result storage unit 206 temporarily stores the speech units searched and extracted by the DB search unit 205.
[0048]
The selection rule setting unit 207 sets a speech unit selection rule by the speech unit selection unit 208 based on the control information.
[0049]
The speech unit selection unit 208 is a unit that selects from speech units listed as candidates as a result of the search by the DB search unit 205 based on the setting by the selection rule setting unit 207. F is applied to the speech unit searched and stored in the search result storage unit 206. ₀ By calculating factors related to the quality of the synthesized speech, such as the duration, the phoneme environment, and the like, as a cost, an optimal combination of speech segments is selected and transmitted to the selection result storage unit 209.
[0050]
The selection result storage unit 209 temporarily stores the optimum combination of speech segments acquired from the speech unit selection unit 208.
[0051]
Prosody transformation method setting section 210 specifies a prosody transformation method based on the control information.
[0052]
The selection result evaluation unit 211 evaluates the quality when the speech is synthesized using the speech units selected by the speech unit selection unit 208 and stored in the selection result storage unit 209.
[0053]
The evaluation result determination unit 212 determines whether or not to reprocess the selection result based on the evaluation result by the processing of the selection result evaluation unit 211 and the control information included in the input information. That is, in the evaluation result determination unit 212, if reprocessing is necessary based on the input control information and the result of comparing the evaluation result with the reference value, the processing proceeds to the processing method setting unit.
[0054]
When the evaluation result determination unit 212 determines that reprocessing is not necessary, the synthesis unit 213 performs a process on the selected speech unit based on the designation of the prosody transformation method of the control information, and The segments are connected and sent to the synthesized speech storage unit 214 as synthesized speech.
[0055]
The synthesized speech storage unit 214 temporarily stores the synthesized speech input from the synthesis unit 213.
[0056]
The synthesized voice output unit 215 outputs the synthesized voice stored in the synthesized voice storage unit 214.
[0057]
Next, an embodiment relating to the quality evaluation processing of the selection result in the selection result evaluation unit 211 will be described in detail with reference to the flowchart of FIG.
[0058]
First, the selection result evaluation unit 211 determines whether a speech unit has been selected for an input, that is, whether or not a speech unit is stored in the selection result storage unit 209 (301). In this determination, for example, in the case of a system for synthesizing the above-mentioned weather forecast task, first, when selecting a speech unit using only the weather forecast database and the place name database, " The typical sentence, such as "The weather is fine," exists in the weather forecast database, so you can get high-quality results quickly. However, if the search range is narrowed and the speed is increased, there is a high possibility that the necessary speech segments do not exist in the database in special cases such as "Snow fell in Okinawa". .
Therefore, in the process of 401, when only a portion where no speech unit is present is determined, a database including all basic phoneme sets of a large size is used when a database for a weather forecast with a small task dependency is used. It is possible to re-select a speech unit only for a part that could not be selected using.
[0059]
As a result of the determination in 301, if no speech segment is present in the selection result storage unit 209, the selection result evaluation unit 211 cannot perform the evaluation when speech synthesis is performed. Instead, the evaluation result is output to the evaluation result determination unit 212 as it is.
[0060]
On the other hand, when the speech unit exists in the selection result storage unit 209, the selection result evaluation unit 211 performs the processing of 302 and 303, and evaluates the quality of the selected speech unit.
[0061]
In the process of 302, the selection result evaluation unit 211 determines the phoneme environment of each selected speech unit. For example, when the "S" part of the voice having the phoneme sequence "ASITA" is used from the database, "A" is the pre-S environment and "I" is the post-S environment. When this S is used as "S" of "KESU", the environment of S is "E" and "U", respectively, which is completely different from the phonemic environment of the database. In order to represent the degree to which the phonemic environments are different, data obtained by analyzing the spectral patterns of the respective phonemic environments in advance is used, and the degree to which the phonemic environments are different is evaluated by comparison.
[0062]
There are N speech units to be used, and the environment of the n-th speech unit is represented as Tri-Phone [n] (hereinafter referred to as TP [n]), and each spectrum pattern is TargetEnvironment [TP [n]. (Hereinafter referred to as TE [TP [n]]) and OriginalEnvironment [TP [n]] (hereinafter referred to as OE [TP [n]]), the degree of difference in the phonemic environment of each speech unit is evaluated. The following equation can be obtained as in the following equation (1), which is evaluated for all N speech units.
[0063]
(Equation 1)

[0064]
Next, in the process of 303, the selection result evaluating unit 211 determines whether or not the accent type of each accent phrase is correctly reproduced for the selected speech unit. This is because the optimal combination of speech units does not always reproduce the correct accent when output without performing prosodic transformation in order to maintain the real voice of the synthesized speech. It is necessary to evaluate the degree to which the accent type matches the goal.
[0065]
As an accent type evaluation method, for a phoneme sequence to be synthesized, F ₀ For the vowel part whose value is obtained stably, the synthesis target and the F ₀ By comparing the values, it is possible to evaluate how much the accent type can be reproduced. For example, if there are N vowels in the phoneme sequence to be synthesized, F in the center of the nth vowel of the synthesis target ₀ The value is TargetVoiceF0 [n] (hereinafter, referred to as TF0 [n]), and F is the central part of the n-th vowel of the selected speech unit. ₀ If the value is OriginalVoiceF0 [n] (hereinafter referred to as OF0 [n]), F ₀ The difference in value transition can be obtained as in the following equation (2).
[0066]
(Equation 2)

[0067]
Through the above processing, data of the evaluation result is added to the selected combination of speech units, and the result is determined by the processing of the evaluation result determination unit 212.
[0068]
Next, the processing of the evaluation result determination unit 212 will be described in detail with reference to the flowchart of FIG.
[0069]
The evaluation result determination unit 212 determines whether subsequent processing is necessary based on the evaluation result of each selected speech unit (401). If the result of this determination is that processing is not required, processing proceeds to the processing of the combining unit 213.
[0070]
As a result of the determination at 401, when it is determined that the processing is necessary, the evaluation result determination unit 212 determines whether or not to change the database to be searched as the processing to be executed thereafter (402). If the result of this determination is that the database is not to be changed, it is determined in

subsequent steps

403 and 404 whether or not to change the selection rule and the prosody transformation method, respectively. move on. When any one of the processes is performed, the process proceeds to the process of the processing method setting unit 204.
[0071]
On the other hand, in the processing method setting unit 204, when the evaluation result determination unit 212 determines that reprocessing is necessary in processing such as selection of a speech unit, any of the database, the selection rule, and the prosody modification method is changed. After adding control information as to whether to perform the search, the database selection unit 203 is used to change the database to be searched, and the selection rule setting unit 207 is used to change the selection rule without changing the database. When only the prosody modification method is changed without changing, the process proceeds to the prosody modification method setting unit 210.
[0072]
When returning to the processing of the database selection unit 203 via the processing of the database selection unit 204, the database to be searched is changed based on the control information added in the database selection unit 204, and the speech unit as a candidate is changed. Search again.
[0073]
Hereinafter, when the process proceeds in the same manner as the first case, and when the process proceeds to the process of the selection rule setting unit 207, control information is added so that the selection rule is changed in the process of the database selection unit 204. , The selection rule setting unit 207 changes the selection rule according to the control information and performs selection again.
[0074]
When the processing returns to the processing of the direct selection rule setting unit 207 via the processing of the processing method setting unit 204, the speech unit to be a search candidate is the same as the first search candidate because the speech unit has not been searched again from the database. The subsequent processing flow is the same as that of the database selection unit 203.
[0075]
Similarly, even when the process proceeds from the process of the database selection unit 203 to the process of the prosody transformation method setting unit 210, the process of changing the prosody transformation method is performed according to the added control information.
[0076]
The same applies to the case where the processing returns to the processing of the prosody transformation method setting section 210 directly from the processing of the processing method setting section 204, except that the processing is performed on the already selected speech unit.
[0077]
As a result of the processing from the processing method setting unit 204, the selection result evaluation unit 211 evaluates again as in the first case, and the evaluation result determination unit 212 determines whether or not to repeat the processing again. Repeat the process until.
[0078]
Hereinafter, the flow of the system will be described by taking as an example a case where speech synthesis of a weather forecast is performed. In this case, as shown in FIG. 4, the database unit 202 includes a weather forecast DB 202a, a Tri-phone DB 202b, and a morphological DB 202c. Here, “DB” represents a database.
[0079]
First, it is assumed that a text “It will be hot and humid today” is input as the input information 100. Here, as control information accompanying the text, linguistic information such as accent phrases and parts of speech, ₀ An example is given in which the following control information is input in addition to the prosody information such as the pattern and the duration.
・ Specification of database (DB) to be used
Initial value: Weather forecast DB
First loop: morpheme DB
Second loop: Tri-PhoneDB
・ Specification of prosodic transformation
Initial value: No prosody transformation
First loop: no prosodic transformation
Second loop: with prosody transformation
・ Specification of selection rules (what parameters are important)
Initial value: Accent type, phonological sequence consistency
First loop: morpheme boundary matching, phonemic environment matching of phonemic units
Second loop: F ₀ Whether the values are close, phonological environment consistency
Based on the control information, the DB search unit 205 searches the database unit 202 as a candidate for a speech unit capable of synthesizing the input text using the weather forecast DB, and stores the search result in the search result storage unit 206. Output to
[0080]
The speech unit selection unit 208 places a weight on elements such as whether or not there is an accent type as a selection rule and whether or not a portion where the phoneme sequence of the speech unit matches the input text is long. Various costs are calculated, and an optimal combination of speech units is obtained and output to the selection result storage unit 209.
[0081]
It is assumed that a speech unit “Today” and “It will be hot” is selected as a processing result of the speech unit selection unit 208, and there is no speech unit “Steam”. In this case, the process 301 in the selection result evaluation unit 211 corresponds to the case where no speech unit is present. Is determined not to be satisfied, and in the subsequent processing of 402, since the input control information includes a specification relating to the change of the database, it is determined that reprocessing is necessary, and the processing proceeds to the processing method setting unit 204.
[0082]
In the processing method setting unit 204, based on the control information in the input information 100, the database specifies to use the morpheme, there is no prosodic transformation, and the selection rule is to determine whether the morpheme boundary matches or not. The cost is changed so as to emphasize whether or not the phoneme environment of the speech unit is closer to the input phoneme environment, and the process proceeds to the processing of the database selection unit 203.
[0083]
Based on the control information set by the processing method setting unit 204, the database selection unit 203 sets the database to use the morpheme DB 202c configured in morpheme units, and the DB search unit 205 extracts “steamed” from the morpheme DB 202c. The speech unit that can be synthesized is searched, and a search result candidate is stored in the search result storage unit 206.
[0084]
The selection rule setting unit 207 determines a rule for selecting an optimal combination of speech units from the search result storage unit 206 by determining whether the morpheme boundary of the speech unit matches the morpheme boundary of the input text. A change is made so as to place a weight on the element of whether the phonemic environment before and after the piece is close to the phonemic environment to be synthesized.
[0085]
The speech unit selection unit 208 calculates various costs based on the rules changed by the selection rule setting unit 207, obtains an optimal combination of speech units, and stores the combination in the selection result storage unit 209.
[0086]
The prosody transformation method setting section 210 is a part for designating the prosody transformation method. Here, since the prosody transformation is not performed, there is no change from the first case.
[0087]
The selection result evaluation unit 211 evaluates again including the processing result from the processing method setting unit 204. As a result of the evaluation, the newly selected speech unit of “steamed” had only “steamed” different from the “sultry” accent, and the evaluation result of the “sultryly” accent type evaluation value was abnormal. In this case, in the process 401 of the evaluation result determination unit 212, it is determined that the criteria are not satisfied in the process of determining the evaluation result of the speech unit, and it is determined that it is necessary to correct the accent phrase "will be hot and humid", and The process proceeds to a processing loop after the processing method setting unit 204.
[0088]
In accordance with the control information in the first input information, the processing method setting unit 204 searches the use DB this time using the Tri-Phone DB 202b, and searches for the phoneme environment and F ₀ The selection is made with emphasis on the degree of matching of the value, and a prosody modification is specified at the time of synthesis, and the process proceeds to the processing of the database selection unit 203. After that, in order to synthesize an accent phrase "will be hot and humid", candidates are searched from the DB in the same manner as described above, an optimal combination of speech units is selected, and its evaluation is performed.
[0089]
Next, in the evaluation result determination unit 212, if the evaluation of the selected speech unit is good, the process directly proceeds to the processing of the synthesis unit 213. However, if the evaluation is still bad, the database and the selection rule are already input as input control information. Since there is no change in all of the prosody transformation methods, the loop after the processing of the processing method setting unit 204 ends here, and the process proceeds to the processing of the synthesis unit 213.
[0090]
The synthesizing unit 213 synthesizes a voice saying "It will be hot and humid today" according to the final result. The "Today is" part is the first selected speech unit and no prosody transformation is specified, so the selected speech unit is output as it is, and the next "sultry will be hot""" Is the result of the last selection, which is specified to perform prosody transformation. After performing the prosody transformation, connect to the speech unit "Today is", and the final result is the synthesized speech Is output.
[0091]
In the case of the conventional speech synthesis technology, an ideal speech unit does not always exist, and when it is not possible to select a speech unit that matches the accent type, for example, a difference in accent is allowed, or the phonemic environment is changed. Different but F ₀ Choosing one with a close value will do the same for accents at the expense of clarity. However, as described above, in the present embodiment, F ₀ Assuming that the phoneme environment of the speech unit is the same as before, it is possible to perform processing depending on the case, such as selecting again and making a synthesis so that they match, so that the prosodic deformation impairs the real voice likeness of the speech. Instead of allowing the speech to be synthesized, it is possible to create a synthesized speech that maintains clarity and has a correct accent. In other words, in the process of synthesizing speech, by considering the likeness of the real voice in addition to accent and clarity, the range of choices is expanded, and even in cases where good synthesized speech cannot be obtained with the conventional method, better synthesized speech Is more likely to be obtained.
[0092]
In addition, as in the case of the weather forecast, by first selecting from a small database dedicated to the weather forecast, and performing a step-by-step operation of using a more general-purpose large-sized database only when the selection is not possible, You can quickly obtain high-quality synthesized speech by quickly selecting a fixed sentence that can be used as it is, and selecting only the best place names and katakana words that are not in the DB from the large-scale DB. There is such an effect.
[0093]
It should be noted that the above-described embodiment and each example are specific examples of the present invention, and it is needless to say that the present invention is not limited to the configuration of the specific example.
[0094]
In addition, by creating a computer-readable information storage medium on which the above-mentioned speech synthesis program is recorded, the above-mentioned speech synthesis computer program can be easily distributed.
[0095]
【The invention's effect】
As described above, according to the present invention, based on input information, a speech unit to be used for speech synthesis is selected from a database, and the quality of the selected speech unit is evaluated. If not, control information of at least one of a phoneme sequence, a fundamental frequency value to be a target of speech synthesis, a duration time, a database to be used, and information specifying a signal processing method included in the input information. By performing a speech segment selection process a plurality of times based on the above, a speech segment whose evaluation result reaches the reference value is selected, and a synthesized speech is generated using the speech segment. The input character data changes in accordance with the conditions, and an optimum speech unit can be selected according to various conditions such as what kind of speech unit is in the database for the input.
[0096]
Further, according to the present invention, a process of changing a database for selecting a speech unit and selecting a speech unit again according to a result of evaluating the selected speech unit, and a rule for selecting a speech unit from the database are described. Since the process of changing and selecting again, the process of changing the method of synthesizing the selected speech unit, or the process of combining these three are performed N times, an appropriate voice according to the database used and the input information is used. This provides a very excellent effect that the combining process can be performed.
[Brief description of the drawings]
FIG. 1 is a functional configuration diagram showing a speech synthesizer according to an embodiment of the present invention.
FIG. 2 is a flowchart illustrating quality evaluation processing of a selection result evaluation unit according to an embodiment of the present invention.
FIG. 3 is a flowchart illustrating processing of an evaluation result determination unit according to an embodiment of the present invention.
FIG. 4 is a view for explaining a system flow in which speech synthesis of a weather forecast is performed according to an embodiment of the present invention;
FIG. 5 is a frequency characteristic diagram for explaining a problem in the conventional example.
[Explanation of symbols]
100 input information, 200 speech synthesizer, 201 input unit, 202 database unit, 202-1 to 202-n database, 201a weather forecast DB, 202b Tri-phone DB, 202c morpheme DB, 203 Database selection unit, 204: processing method setting unit, 205: database search unit (DB search unit), 206: search result storage unit, 207: selection rule setting unit, 208: speech unit selection unit, 209: selection result storage unit , 210: Prosody transformation method setting unit, 211: selection result evaluation unit, 212: evaluation result determination unit, 213: synthesis unit, 214: synthesized voice storage unit, 215: synthesized voice output unit

Claims

A plurality of databases in which speech units are stored, using a speech synthesizer including a computer device that converts input character data into speech, using a speech unit corresponding to the input character data, A speech synthesis method for selecting from a database and synthesizing a speech corresponding to the input character data by the selected speech unit,
The speech synthesizer,
In addition to the input for normal speech synthesis, such as the input character string, fundamental frequency, duration, etc., which is the target of speech synthesis, the database to be used, the rules for selecting speech units, and the signal processing method Obtaining input information including control information for speech synthesis including at least one of the information for designating;
Based on the input information, select a speech unit used for synthesis from the database,
Evaluate the quality of the synthesized speech using the selected speech unit, when the evaluation result is less than the reference value, for the speech unit based on the control information included in the input information A synthesized speech is generated by performing a series of steps of performing a process and evaluating the result of the process again a plurality of times until a reference value is satisfied.

As a result of evaluating the selected speech unit, if the value is less than the reference value, a process of changing a prosody modification method as a process for at least one of the speech unit and a part of the speech unit 2. The speech synthesis method according to claim 1, wherein a synthesized speech is generated after the speech synthesis.

As a result of evaluating the selected speech unit, as a process in a case where the value is less than the reference value, a process for selecting a speech unit as at least one of a speech unit and a part of the speech unit is performed. The speech synthesis method according to claim 1, wherein a rule is changed, and a speech unit is selected again from the database to generate a synthesized speech.

As a result of evaluating the selected speech unit, as a process when the value does not satisfy the reference value, as a process for at least one of the speech unit and a part of the speech unit, a candidate speech unit is searched. 2. A speech synthesis method according to claim 1, wherein the type of database to be changed is changed, and speech units are selected again to generate a synthesized speech.

As a result of evaluating the selected speech unit, as a process in a case where the value is less than the reference value, a process for selecting a speech unit as at least one of a speech unit and a part of the speech unit is performed. Change the rules and select speech units again from the database,
As a process when the reference value is not satisfied, a process for changing a prosody modification method is performed as a process for at least one of a speech unit and a part of the speech unit to generate a synthesized speech. The speech synthesis method according to claim 1.

As a result of evaluating the selected speech unit, as a process when the value does not satisfy the reference value, as a process for at least one of the speech unit and a part of the speech unit, a candidate speech unit is searched. Change the database to be selected and select speech units again,
As a process when the reference value is not satisfied, a process for changing a prosody modification method is performed as a process for at least one of a speech unit and a part of the speech unit to generate a synthesized speech. The speech synthesis method according to claim 1.

As a result of evaluating the selected speech unit, as a process in a case where the value is less than the reference value, a process for selecting a speech unit as at least one of a speech unit and a part of the speech unit is performed. Change the rules and select speech units again from the database,
As a process when the reference value is not satisfied, as a process for at least one of a speech unit and a part of the speech unit, a database for searching for a candidate speech unit is changed and a speech unit is selected again. 2. The speech synthesis method according to claim 1, wherein the speech synthesis is performed to generate a synthesized speech.

As a result of evaluating the selected speech unit, as a process in a case where the value is less than the reference value, a process for selecting a speech unit as at least one of a speech unit and a part of the speech unit is performed. Change the rules and select speech units again from the database,
As a process when the reference value is not satisfied, as a process for at least one of a speech unit and a part of the speech unit, a database for searching for a candidate speech unit is changed and a speech unit is selected again. Do
Further, as a process in the case where the reference value is not satisfied, as a process for at least one of a speech unit and a part of a speech unit, a process of changing a prosody modification method is performed to generate a synthesized speech. The speech synthesis method according to claim 1, wherein:

A speech synthesizer that selects a speech unit corresponding to the input character data from a database and synthesizes a speech corresponding to the input character data with the selected speech unit.
A plurality of databases storing speech units,
In addition to the input for normal speech synthesis, such as the input character string, fundamental frequency, duration, etc., which is the target of speech synthesis, the database to be used, the rules for selecting speech units, and the signal processing method Means for acquiring input information including control information for speech synthesis including at least one of the information for designating;
Means for selecting a speech unit to be used for synthesis from the database based on the input information;
Evaluate the quality of the synthesized speech using the selected speech unit, when the evaluation result is less than the reference value, for the speech unit based on the control information included in the input information Means for generating a synthesized speech by performing a series of steps of performing a series of processes and evaluating the result of the process again a plurality of times until a reference value is satisfied.

As a result of evaluating the selected speech unit, when there is a speech unit that does not satisfy the reference value, at least one of the speech unit and a part of the speech unit is subjected to prosody transformation. The speech synthesizer according to claim 9, further comprising means for performing a process for changing a method.

As a result of evaluating the selected speech unit, when there is a speech unit that does not satisfy the reference value, for the speech unit, change the rule for selecting a speech unit, the 11. The apparatus according to claim 9, further comprising means for selecting a speech unit again from the database.

As a result of evaluating the selected speech unit, when there is a speech unit that does not satisfy the reference value, the database for searching for a candidate speech unit is changed for the speech unit, and The speech synthesizer according to any one of claims 9 to 11, further comprising means for selecting a speech unit.

With a plurality of databases in which speech units are stored, using a computer device that converts input character data to speech, a speech unit corresponding to the input character data is selected from the database, A speech synthesis computer program for synthesizing a speech corresponding to the input character data by the selected speech unit,
In addition to the input for normal speech synthesis, such as the input character string, fundamental frequency, duration, etc., which is the target of speech synthesis, the database to be used, the rules for selecting speech units, and the signal processing method Obtaining input information including control information for speech synthesis including at least one of the information for designating;
Based on the input information, selecting a speech unit used for synthesis from the database,
Evaluate the quality of the synthesized speech using the selected speech unit, when the evaluation result is less than the reference value, for the speech unit based on the control information included in the input information Generating a synthesized voice by performing a series of steps of performing a series of processes and evaluating the result of the process again a plurality of times until a reference value is satisfied.

The method further comprises the step of changing a prosody modification method for at least one of a speech unit or a part of the speech unit that does not satisfy the reference value as a result of evaluating the selected speech unit. The speech synthesis computer program according to claim 13, wherein:

As a result of evaluating the selected speech unit, as a process in a case where the value is less than the reference value, a process for selecting a speech unit as at least one of a speech unit and a part of the speech unit is performed. The computer program according to claim 13 or 14, further comprising a step of changing a rule and selecting a speech unit again from the database.

As a result of evaluating the selected speech unit, as a process when the value does not satisfy the reference value, as a process for at least one of the speech unit and a part of the speech unit, a candidate speech unit is searched. The computer program according to any one of claims 13 to 15, further comprising a step of changing a database to be executed and selecting a speech unit again.

A computer-readable information storage medium storing the speech synthesis computer program according to any one of claims 13 to 16.