JP2004341226A

JP2004341226A - Waveform dictionary creation support system and program

Info

Publication number: JP2004341226A
Application number: JP2003137624A
Authority: JP
Inventors: Chikako Matsumoto; 智佳子松本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2003-05-15
Filing date: 2003-05-15
Publication date: 2004-12-02
Anticipated expiration: 2023-05-15
Also published as: JP4286583B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a waveform dictionary creation support system that can efficiently search for voice data to complement voice data having been recorded to create a waveform dictionary. <P>SOLUTION: The waveform dictionary generation support system is equipped with a voice analysis part 11 which inputs and analyzes voice data to find phoneme information, a voice information storage part 14 which stores voice information including at least phoneme information, a necessary phoneme information storage part 15 which stores conditions of phoneme information essential to the waveform dictionary, a deficient phoneme sequence retrieval part 12 which retrieves and outputs a phoneme sequence deficient in the phoneme information stored in the voice information storage part 14 to meet the conditions, a corpus storage part 16 which stores corpora including at least all essential phoneme information, and a supplementary corpus generation part 13 which retrieves and outputs an additional corpus including deficient phoneme sequences from the corpus storage part 16. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、音声合成用波形辞書の作成を支援するシステムに関し、特に、所望の波形辞書を作成するために補充することが必要なコーパス（追加コーパス）を効率的に探索するためのシステムに関する。
【０００２】
【従来の技術】
従来、定型文章や単語を読み上げた音声データを収録して蓄積し、それらの音声データから、単語単位等で必要な音声データを抽出しつなぎ合わせることによって音声合成を行うシステムが知られている。このような音声データ（通常一人の発声者が発声した全ての音素を含む大量の音声波形、又はその特徴量が蓄積されているデータ）を蓄積したファイルは、波形辞書などと称されている。
【０００３】
従来の音声合成システムにおいて、文章を合成する際には、その文章を合成するために必要な音声データを、音素や音節等の基準単位で波形辞書から検索し抽出する。そして、抽出された音声データを変換したり、複数の音声データを接続したりすることによって、入力されたテキストに対応した最適な合成音声を作成し、出力する。
【０００４】
すなわち、入力された文章（テキスト）に含まれる音の全てが波形辞書に収録されていれば、そのテキストに対応する高品質な合成音声を作成することが可能である。しかし、その逆に、波形辞書に収録されていない音を合成することは不可能である。また、複数の音声データを接続することによって合成音声を作成した場合に、音質が劣化することもある。
【０００５】
従って、任意のテキストを音声合成できるようにするためには、理論的には、波形辞書に多種多様な音声データを格納しておくことが必要となる。しかし、波形辞書のデータ容量が大きくなり過ぎると実装コストや検索効率の点で好ましくない。そこで、大規模コーパス辞書を参照することにより、利用者の用途に応じた波形辞書を適度な大きさで効率よく作成する音声合成用辞書作成装置が、本発明者により既に提案されている（特許文献１参照）。
【０００６】
【特許文献１】
特開２００１−２９６８７８号公報
【０００７】
【発明が解決しようとする課題】
ところで、波形辞書に登録するための大量の音声データについては、一般的に、音声合成エンジンの開発者が、所望する声種・声質のナレータやタレント等と契約し、当該ナレータ等を長時間に渡って拘束して音声収録を行う。このため、時間と費用が嵩むという問題点がある。
【０００８】
従って、例えば新たな音声合成システムを構築する場合や、既存の音声合成システムをバージョンアップする場合などに、波形辞書作成用の音声収録を最初からやり直すのは非効率的である。波形辞書自体を既存の辞書よりもレベルアップさせたい場合には、既存の波形辞書（収録済みの音声データ）に、追加収録によって得られた音声データを追加すれば、音声収録に要するコストおよび時間を削減できる。しかし、追加収録すべき音声データを洗い出す作業は容易ではないという問題があった。
【０００９】
そこで、本発明は、収録済みの音声データがある場合に、所望の波形辞書を作成するために補充すべき音声データ（補充コーパス）を効率的に探索することが可能な波形辞書作成支援システムを提供することを目的とする。
【００１０】
【課題を解決するための手段】
上記の目的を達成するために、本発明にかかる波形辞書作成支援システムは、音声波形データと当該音声の発話内容を表すテキストデータとを含む音声データを入力して分析し、当該音声データの音素情報を求める音声分析部と、前記音声分析部により求められた音素情報を少なくとも含む音声情報を保存する音声情報保存部と、波形辞書に必須な音素情報の条件を保存する必須音素情報保存部と、前記必須音素情報保存部に保存されている条件を満たすために前記音声情報保存部に保存されている音素情報に不足している音素列を検索し、検索結果として得られた音素列を不足音素列として出力する不足音素列検索部と、少なくとも前記必須音素情報の全てを含むコーパスが蓄積されたコーパス蓄積部と、前記コーパス蓄積部から、前記不足音素列検索部より出力された不足音素列を含むコーパスを検索し、検索結果として得られたコーパスを追加コーパスとして出力する補充コーパス作成部とを備えたことを特徴とする。
【００１１】
この構成により、所望の波形辞書に応じた条件を必須音素情報保存部に保存しておけば、その条件を満たすために不足している音素列（すなわち、所望の波形辞書を作成するために音声情報保存部に追加しなければならない音素列）が不足音素列として検索され、その音素列を含むコーパスが追加コーパスとして出力される。これにより、収録済みの音声データがある場合に、所望の波形辞書を作成するために補充すべき音声データ（補充コーパス）を効率的に探索することが可能となる。
【００１２】
なお、出力される追加コーパスは、テキストコーパスおよび音声コーパスのいずれであっても良い。追加コーパスとしてテキストコーパスを出力させたい場合はテキストコーパスが蓄積されたコーパス蓄積部を用い、音声コーパスを出力させたい場合は音声コーパスが蓄積されたコーパス蓄積部を用いれば良い。なお、追加コーパスをテキストコーパスとして出力させた場合は、このテキストコーパスに従って音声の追加収録を行い、追加収録済みの音声情報保存部に基づいて波形辞書を作成すれば良い。また、追加コーパスを音声コーパスとして出力させた場合は、ナレータ等による追加収録を必要とせず、その音声コーパスそのものを音声情報保存部に追加し、波形辞書を作成すれば良い。
【００１３】
また、本発明にかかる波形辞書作成支援システムは、入力された音声波形データから発話内容を認識し、認識した発話内容をテキストデータとして前記音声波形データと共に前記音声分析部へ出力する音声認識部をさらに備えたことが好ましい。発話内容をテキスト入力する手間が省けるからである。
【００１４】
また、本発明にかかる波形辞書作成支援システムにおいて、不足音素列検索部が、更に、合成する音質の品質等級を入力し、入力された品質等級に応じて前記必須音素情報保存部に保存された条件の中から満たすべき条件を決定し、決定した条件に応じて不足音素列を検索することが好ましい。所望の合成音声の品質を実現するために適したサイズ及び内容の波形辞書を作成することが可能となるからである。
【００１５】
また、本発明にかかる波形辞書作成支援システムは、コーパス蓄積部が、分野または用途別に蓄積されたコーパス保存部を有し、前記補充コーパス作成部は、使用する分野または用途情報を入力し、入力された分野または用途情報に応じたコーパス保存部から追加コーパスを検索することが好ましい。分野または用途毎に蓄積されたコーパスから追加コーパスを検索することにより、音声合成システムの使用環境（分野または用途）に応じた波形辞書を作成することが可能となるからである。
【００１６】
また、本発明にかかるコンピュータプログラムは、入力された音声波形データと当該音声の発話内容を表すテキストデータとを含む音声データを分析し、当該音声データの音素情報を求めるステップと、前記音声分析部により求められた音素情報を少なくとも含む音声情報を音声情報保存ファイルへ保存するステップと、波形辞書に必須な音素情報の条件を入力し、当該条件を満たすために前記音声情報保存ファイルに保存されている音素情報に不足している音素列を検索し、検索結果として得られた音素列を不足音素列として出力するステップと、少なくとも前記必須音素情報の全てを含むコーパスが蓄積されたコーパス蓄積ファイルから、前記不足音素列を含むコーパスを検索し、検索結果として得られたコーパスを追加コーパスとして出力するステップとを含む処理をコンピュータに実行させることを特徴とする。
【００１７】
このプログラムをコンピュータにロードして実行することにより、所望の波形辞書を作成するために音声情報ファイルに追加しなければならない音素列が不足音素列として検索され、その音素列を含むコーパスが追加コーパスとして出力される。これにより、収録済みの音声データがある場合に、所望の波形辞書を作成するために補充すべき音声データ（補充コーパス）を効率的に探索することが可能となる。
【００１８】
【発明の実施の形態】
（実施形態１）
以下、本発明の一実施形態について説明する。
【００１９】
図１に示すように、本実施形態にかかる波形辞書作成支援システム１は、音声分析部１１、不足音素列検索部１２、補充コーパス作成部１３、音声情報保存部１４、必須音素情報保存部１５、コーパス蓄積部１６を備えている。
【００２０】
音声分析部１１は、収録済みの音声データを入力し分析する。なお、この実施形態で入力される「音声データ」とは、収録済みの音声波形データ（例えばＰＣＭデータ）とテキストデータとを含む。また、コーパス蓄積部１６は、テキストコーパスを蓄積した大規模コーパスデータベースである。
【００２１】
収録された音声データが入力されると、音声分析部１１は、入力された音声データを分析することにより、音素情報を求める。この音素情報には、少なくとも、発声内容を示すラベル情報（音素ラベル）が含まれる。音声分析部１１は、入力された音声波形データに分析結果の音素情報を付与し、音素単位で検索可能な形態の音声データとして、音声情報保存部１４に保存する。
【００２２】
不足音素列検索部１２は、音声情報保存部１４に格納された音声データに、任意の文章を読み上げるために必要な音素列が全て揃っているか否かを調べ、不足している音素列（以下、「不足音素列」と称する）を求める。具体的には、必須音素情報保存部１５を参照することにより、音声情報保存部１４に保存されている収録済み音声データに不足している音素列を検索する。必須音素情報保存部１５には、少なくとも、任意の文章を合成するために必須である、日本語の全音節データの音素列が保存されている。不足音素列検索部１２は、必須音素情報保存部１５に保存されている音素列であって、かつ、音声情報保存部１４に保存されている音声情報にはない音素列を検索し、検索された音素列を不足音素列として出力する。
【００２３】
補充コーパス作成部１３は、不足音素列検索部１２で求められた不足音素列を含むコーパスを、コーパス蓄積部１６から検索し、追加コーパス（単語や文章）として出力する。追加コーパスの出力は、印刷出力、ディスプレイへの表示、ファイルへの出力など、任意の形式で行えば良い。
【００２４】
このように、本実施形態にかかる波形辞書作成支援システム１は、収録済みの音声データから不足音素列を検索し、その不足音素列を含むコーパス（テキストコーパス）を追加コーパスとして出力する。従って、出力された追加コーパスのテキストをナレータなどに読み上げさせて音声データを追加収録すれば、任意の文章を読み上げるために必要な全ての音素情報を、音声情報保存部１４に収録することができる。
【００２５】
本実施形態の波形辞書作成支援システム１の適用例を、図２に示す。図２に示すように、ユーザ（ナレータを含む、以下同じ）が音声を収録するための収録システム２から、収録済みの音声データ（音声波形データおよびテキストデータ）を波形辞書作成支援システム１へ入力する。そして、波形辞書作成支援システム１から追加コーパスが出力されると、ユーザは、出力された追加コーパスに従って収録システム２へ追加収録を行う。波形辞書作成支援システム１は、さらなる追加コーパスがないかを調べ、追加コーパスがないと判断されたら、波形辞書作成システム３へ、音声情報保存部１４に蓄積された音声データや音声情報を出力する。波形辞書作成システム３は、これらの音声データや音声情報に基づき、波形辞書４を作成する。作成された波形辞書４は、音声合成システム５が任意のテキストから合成音声を作成する際に利用される。
【００２６】
なお、図２に示した矢印は、各システムで生成されるデータが他のシステムでどのように利用されるかを表すものに過ぎず、システム間の定常的な接続状態を示すものではない。例えば、音声合成システム５の動作時に、波形辞書４は参照可能でなければならないが、収録システム２、波形辞書作成支援システム１、および波形辞書作成システム３については、波形辞書４および音声合成システム５に接続されている必要はない。
【００２７】
ここで、本実施形態の波形辞書作成支援システム１の他の適用例を、図３に示す。図３の例では、ネットワーク６を介して、ユーザが収録データを送付する点において、図２に示した例と異なっている。このため、図３の例では、ネットワーク６と各システムとの間に送受信部７がさらに設けられた構成である。
【００２８】
ユーザは、電話やＶｏＩＰ（ＶｏｉｃｅｏｖｅｒＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）等を利用して音声データを送付する。送受信部７は、ネットワーク６から音声データを受信して、収録システム２に入力して収録する。なお、収録は必ずしもしなくても良い。収録システム２から、収録済みの音声データ（音声波形データおよび発話内容を示すテキストデータ）が波形辞書作成支援システム１へ入力されると、波形辞書作成支援システム１は追加コーパスを求めて送受信部７へ出力する。追加コーパスは送受信部７からネットワーク６を介してユーザへ送られる。ユーザは、追加コーパスを受信し、受信した追加コーパスに従って、電話やＶｏＩＰ等を利用して収録システム２へ追加収録を行う。この処理を、波形辞書作成支援システム１がさらなる追加コーパスはないと判断するまで繰り返す。
【００２９】
追加コーパスがなくなったら、波形辞書作成支援システム１の音声情報保存部１４に蓄積された音声情報を、波形辞書作成システム３へ出力する。波形辞書作成システム３は、これらの音声情報に基づき、波形辞書４を作成する。これにより、ユーザがネットワーク６および送受信部７を介して任意のテキストを入力すると、音声合成システム５は、波形辞書４を利用して音声合成を行い、合成音を送受信部７およびネットワーク６を介してユーザに送付する。
【００３０】
前記「音声データ」は、音声波形データおよび発話内容を表すテキストデータを含む。「音声情報」は、音声分析部１１で分析した結果であり、少なくとも音素ラベルを含む。後述の実施形態で説明するように音声分析部１１においてピッチマーク付与や周波数分析を行う場合は、音声情報にはピッチマークやフォルマントも含まれる。なお、上述の例では、音声情報保存部１４に音声データと音声情報の両方を蓄積するものとしたが、音声データは必ずしも蓄積しなくても良い。
【００３１】
なお、図２と同様に、図３に示す矢印も、各システムで生成されるデータが他のシステムでどのように利用されるかを表すものに過ぎず、必ずしも各システム間の定常的な接続状態を示すものではない。
【００３２】
本実施形態の波形辞書作成支援システム１のさらに他の適用例を、図４に示す。図４に示す例では、ネットワーク６を介した収録システム２への音声収録、波形辞書作成支援システム１による追加コーパスの生成およびユーザへの送信、波形辞書作成システム３による波形辞書４の作成までは、図３に示した例と同様である。ただし、図４に示す例の場合、音声合成システム５はユーザの手元にあり、作成された波形辞書４は、ネットワーク６を介してユーザに送付される。ユーザが音声合成システム５に任意のテキストを入力すると、音声合成システム５は、送付された波形辞書４を利用して、合成音を作成し出力する。
【００３３】
（実施形態２）
以下、本発明にかかる波形辞書作成支援システムの他の実施形態について、具体的な例をあげて説明する。
【００３４】
図５に示すように、本実施形態の波形辞書作成支援システム２１では、音声分析部１１には自動ラベリング部１１ａ、音声情報保存部１４には音素ラベル保存部１４ａ、不足音素列検索部１２には音素列検索部１２ａ、必須音素情報保存部１５には必須音素列保存部１５ａが、それぞれ設けられている。
【００３５】
まず、音声分析部１１に、音声データとしてＰＣＭデータとテキストデータとが入力されると、自動ラベリング部１１ａが音素ラベリングを行う。音素ラベリングの結果は、音声情報保存部１４の音素ラベル保存部１４ａに保存される。
【００３６】
例えば発声内容が「朝早く、バンガローに電報が届いた。」、「青山には、新しいお店がたくさんある。」のような場合には、音素ラベルの例は、

である。
【００３７】
次に、不足音素列検索部１２では必須音素情報保存部１５の必須音素列保存部１５ａの情報を元にして、音声情報保存部１４に保存されている音素情報（音素ラベル）に不足している音素列（不足音素列）を求める。必須音素情報保存部１５には、任意の日本語文章を読み上げるために必要な、全音節データの音素列が保存されている。本実施形態の場合、必須音素列保存部１５ａに、例えば、
母音・・・・・・・・・（１）
母音＋母音・・・・・・・・・（２）
子音＋母音・・・・・・・・・（３）
の全パターンが保持されている。
【００３８】
さらに、上記の（１）〜（３）の３パターンに追加して、
母音＋子音＋母音・・・・・・・・・（４）
の３音素の全てを保持することも好ましい。
【００３９】
あるいは、別の例として、（１）〜（３）の３パターンに追加して、接続すると音が悪くなりやすい、
母音＋半母音＋母音・・・・・・・（５）
母音＋鼻音＋母音・・・・・・・（６）
母音＋弾音＋母音・・・・・・・（７）
等のパターンを保持することも好ましい。
【００４０】
例えば、（１）〜（４）を保持する場合は、

という音素列が、必須音素列保存部１５ａに保存される。
【００４１】
また、接続する際に異音や雑音混入の原因になりやすい音素列については、４、５音素連鎖の形式で必須音素列保存部１５ａに保持しておけば、さらに高品質な音声合成が可能となる点で望ましい。
【００４２】
不足音素列検索部１２の音素列検索部１２ａは、音声情報保存部１４の音素ラベル保存部１４ａと必須音素情報保存部１５の必須音素列保存部１５ａとを対比することにより、音素ラベル保存部１４ａに必須音素列の全てが存在しているかどうかを検索し、不足音素列を求める。
【００４３】
補充コーパス作成部１３は、実施形態１で説明したように、コーパス蓄積部１６に蓄積されているコーパスの中から、不足音素列を含むコーパスを検索し、検索結果を「追加コーパス」として出力する。
【００４４】
以上のように、本実施形態の波形辞書作成支援システム２１によれば、所望の波形辞書を作成するための必須の条件を必須音素情報保存部１５に登録しておけば、その条件を満たすために追加することが必要な音素列（不足音素列）を含むコーパスが「追加コーパス」として出力される。これにより、追加コーパスの探索を効率的に行うことができる。
【００４５】
なお、本実施形態の波形辞書作成支援システム２１の適用例は、実施形態１において図２〜図４を用いて説明したものと同じであるため、その説明は省略する。
【００４６】
（実施形態３）
以下、本発明にかかる波形辞書作成支援システムの他の実施形態について、具体的な例をあげて説明する。
【００４７】
図６に示すように、本実施形態にかかる波形辞書作成支援システム３１では、音声分析部１１には自動ラベリング部１１ａとピッチマーク付与部１１ｂ、音声情報保存部１４には音素ラベル保存部１４ａとピッチマーク保存部１４ｂ、不足音素列検索部１２には音素列検索部１２ａ、音素長情報照合部１２ｂ、およびピッチ情報照合部１２ｃ、必須音素情報保存部１５には必須音素列保存部１５ａ、音素長保存部１５ｂ、およびピッチ情報保存部１５ｃが、それぞれ設けられている。
【００４８】
音声分析部１１に、音声データとしてＰＣＭデータおよびテキストデータが入力されると、自動ラベリング部１１ａが音素ラベリング（境界位置決定を含む）を行い、ピッチマーク付与部１１ｂがピッチマーク付与を行う。なお、本実施形態では音声データとしてＰＣＭデータを用いる例を示したが、音声データであればその形式は任意である。音素ラベルとピッチマークは、それぞれ、音声情報保存部１４の音素ラベル保存部１４ａとピッチマーク保存部１４ｂに保存される。
【００４９】
ここで保存される音素ラベルの例を以下に示す。例えば図７に示すような音声波形の場合には、音素名とその音素境界は、時系列上のサンプリング位置を用いて、下記（表１）のように表すことができる。
【００５０】
【表１】

【００５１】
また、この場合に保存されるピッチマークは、図８に示すように、各ピッチの位置になり、例えば下記のように表される。
【００５２】

不足音素列検索部１２では、音素列検索部１２ａが、実施形態２と同様に、音声情報保存部１４の音素ラベル保存部１４ａと必須音素情報保存部１５の必須音素列保存部１５ａとを対比することにより、音素ラベル保存部１４ａに必須音素列の全てが存在しているかどうかを検索する。これにより、必須音素列保存部１５ａに存在する音素列であって、かつ、音素ラベル保存部１４ａに存在しない音素列を、不足音素列と判断する。
【００５３】
また、音素長情報照合部１２ｂは、音声情報保存部１４の音素ラベル保存部１４ａに保存されている音素ラベルから、収録音声データの音素列の音素長を求め、求めた音素長が必須音素情報保存部１５の音素長保存部１５ｂにある合成時の音素長データより極端に短い場合（例えば長さが１／２以下の場合）は、当該音素長条件を満たす音素列が不足していると判断する。なお、音素長保存部１５ｂに保存されている情報例としては、例えば以下のような、各音素の合成時の音素長のリストがあげられる。
【００５４】

また、ピッチ情報照合部１２ｃは、音声情報保存部１４のピッチマーク保存部１４ｂに保存されているピッチマークから収録音声データのピッチを求め、求めたピッチが、必須音素情報保存部１５のピッチ情報保存部１５ｃにある合成時に必要なピッチとの間に大きな隔たりがある場合（例えば５０％以上のピッチ差がある場合）は、当該ピッチ条件を満たす音素列が不足していると判断する。なお、ピッチ情報保存部１５ｃに保存されている情報例としては、各ＰＣＭデータのピッチ情報から求められるベースピッチがあげられる。
【００５５】
不足音素列検索部１２は、音素列検索部１２ａ、音素長情報照合部１２ｂ、ピッチ情報照合部１２ｃのそれぞれで求められた不足音素列の和集合を、補充コーパス作成部１３に出力する。
【００５６】
補充コーパス作成部１３は、実施形態１で説明したように、コーパス蓄積部１６に蓄積されているコーパスの中から、不足音素列を含むコーパスを検索し、検索結果を「追加コーパス」として出力する。
【００５７】
以上のように、本実施形態の波形辞書作成支援システム３１によれば、音素列、音素長、ピッチに関する条件を全て満たすために補充すべき音素列が、不足音素列として検索され、その不足音素列を含むコーパスが、追加コーパスとして出力される。従って、出力された追加コーパスに基づいて追加収録を行えば、音素列、音素長、ピッチに関する条件を全て満たす波形辞書を作成することができ、この波形辞書を用いれば、より高品質な音声合成が可能となる。
【００５８】
なお、本実施形態の波形辞書作成支援システム３１の適用例は、実施形態１において図２〜図４を用いて説明したものと同じであるため、その説明は省略する。
【００５９】
（実施形態４）
以下、本発明にかかる波形辞書作成支援システムの他の実施形態について、具体的な例をあげて説明する。
【００６０】
図９に示すように、本実施形態にかかる波形辞書作成支援システム４１では、音声分析部１１には自動ラベリング部１１ａ、ピッチマーク付与部１１ｂ、および周波数解析部１１ｃが、音声情報保存部１４には音素ラベル保存部１４ａ、ピッチマーク保存部１４ｂ、およびフォルマント保存部１４ｃが、不足音素列検索部１２には音素列検索部１２ａ、音素長情報照合部１２ｂ、ピッチ情報照合部１２ｃ、およびフォルマント情報照合部１２ｄが、必須音素情報保存部１５には必須音素列保存部１５ａ、音素長保存部１５ｂ、ピッチ情報保存部１５ｃ、およびフォルマント情報保存部１５ｄが、それぞれ設けられている。
【００６１】
音声分析部１１に音声データとしてＰＣＭデータとテキストデータが入力されると、音声分析部１１の自動ラベリング部１１ａおよびピッチマーク付与部１１ｂは、実施形態３で説明したように音素ラベルおよびピッチマークの付与を行う。さらに、周波数解析部１１ｃが、ＰＣＭデータの周波数解析を行い、各音素のフォルマント情報を、音声情報保存部１４のフォルマント保存部１４ｃに保存する。
【００６２】
不足音素列検索部１２では、音素列検索部１２ａ、音素長情報照合部１２ｂ、ピッチ情報照合部１２ｃが、実施形態３で説明したように、不足音素列をそれぞれ求める。さらに、フォルマント情報照合部１２ｄが、フォルマント情報保存部１５ｄのデータと、収録データのフォルマントとの照合を行い、大きな隔たりがないかを照合する。
【００６３】
フォルマント情報保存部１５ｄの保存データ例としては、各母音の第一フォルマントおよび第二フォルマントがあげられる。例えば、「ｉ（い）」という音素の第一フォルマントの平均値Ｆｉ_１と、第二フォルマントの平均値Ｆｉ_２とを、フォルマント情報保存部１５ｄに保存しておく。この場合、フォルマントの照合は、例えば以下のように行う。周波数解析部１１ｃが、前述のようにＰＣＭデータの周波数解析を行い、各音素の第一フォルマントｆｉ_１と第二フォルマントｆｉ_２とをフォルマント保存部１４ｃに保存する。そして、フォルマント情報照合部１２ｄが、フォルマント保存部１４ｃに保存されている各音素列ごとに、第一フォルマントおよび第二フォルマントの平均値を求める。例えば、ある音素列中の「ｉ」という音素の第一フォルマントｆｉ_１および第二フォルマントｆｉ_２と、フォルマント情報保存部１５ｄの平均フォルマントＦｉ_１，Ｆｉ_２との差ｄｉｆｆを、例えば下記式により求める。
【００６４】
ｄｉｆｆ＝（Ｆｉ_１− ｆｉ_１）^２＋（Ｆｉ_２− ｆｉ_２）^２
各音素についてｄｉｆｆの閾値を設定しておき、閾値を超えた場合は、当該音素列を不足音素列と判断する。
【００６５】
不足音素列検索部１２は、音素列検索部１２ａ、音素長情報照合部１２ｂ、ピッチ情報照合部１２ｃ、フォルマント情報照合部１２ｄのそれぞれで求められた不足音素列の和集合を、補充コーパス作成部１３に出力する。
【００６６】
補充コーパス作成部１３は、実施形態１で説明したように、コーパス蓄積部１６に蓄積されているコーパスの中から、不足音素列を含むコーパスを検索し、検索結果を「追加コーパス」として出力する。
【００６７】
以上のように、本実施形態の波形辞書作成支援システム４１によれば、音素列、音素長、ピッチ、フォルマントに関する条件を全て満たすために補充すべき音素列が、不足音素列として検索され、その不足音素列を含むコーパスが、追加コーパスとして出力される。従って、出力された追加コーパスに基づいて追加収録を行えば、音素列、音素長、ピッチ、フォルマントに関する条件を全て満たす波形辞書を作成することができ、この波形辞書を用いれば、さらに高品質な音声合成が可能となる。
【００６８】
なお、本実施形態の波形辞書作成支援システム４１の適用例は、実施形態１において図２〜図４を用いて説明したものと同じであるため、その説明は省略する。
【００６９】
（実施形態５）
以下、本発明にかかる波形辞書作成支援システムの他の実施形態について、具体的な例をあげて説明する。
【００７０】
図１０に示すように、本実施形態にかかる波形辞書作成支援システム５１は、実施形態４にかかる波形辞書作成支援システム４１と同様に、音声分析部１１には自動ラベリング部１１ａ、ピッチマーク付与部１１ｂ、および周波数解析部１１ｃが、音声情報保存部１４には音素ラベル保存部１４ａ、ピッチマーク保存部１４ｂ、およびフォルマント保存部１４ｃが、不足音素列検索部１２には音素列検索部１２ａ、音素長情報照合部１２ｂ、ピッチ情報照合部１２ｃ、およびフォルマント情報照合部１２ｄが、必須音素情報保存部１５には必須音素列保存部１５ａ、音素長保存部１５ｂ、ピッチ情報保存部１５ｃ、およびフォルマント情報保存部１５ｄが、それぞれ設けられている。
【００７１】
ただし、本実施形態にかかる波形辞書作成支援システム５１は、不足音素列検索部１２に品質等級を表すデータが入力され、不足音素列検索部１２が入力された品質等級のレベルに応じて不足音素列の検索を行う点において、実施形態４と異なる。
【００７２】
入力される品質の等級は、数値や記号等によって表され、例えば、
１：音声品質最高レベル
２：音声品質中級レベル
３：音声品質最低レベル
のように、合成音声の品質として求められるレベルと対応付けられる。また、合成音声について求められる品質のレベルが高くなるほど必要となる波形辞書のサイズも大きくなるので、前記の品質等級「３」、「２」、「１」の順に、不足音素列検索部１２が不足音素列の検索をより綿密に行うこととなる。
【００７３】
例えば、不足音素列検索部１２に、品質等級「３」が入力された場合は、不足音素列検索部１２の音素列検索部１２ａは、必須音素情報保存部１５の必須音素列保存部１５ａの「母音」、「母音＋母音」、「子音＋母音」を満足するために必要な不足音素列の検索を行う。
【００７４】
また、品質等級「１」が入力された場合には、不足音素列検索部１２の音素列検索部１２ａは、例えば、必須音素情報保存部１５の必須音素列保存部１５ａの１音素（「母音」）〜４、５音素連鎖の必須音素も満足するために必要な不足音素列の検索を行い、更に、音素長、ピッチ情報、フォルマント情報を満足する最高品質を提供するために補充することが必要な不足音素列の検索を行う。
【００７５】
なお、品質等級「２」が入力された場合は、不足音素列検索部１２の音素列検索部１２ａは、例えば、必須音素情報保存部１５の必須音素列保存部１５ａの１音素（「母音」）〜４、５音素連鎖の必須音素も満足するために必要な不足音素列のみの検索を行うなど、上述の品質等級「１」と品質等級「２」との中間的なレベルで不足音素列を検索する。
【００７６】
不足音素列検索部１２は、上述のように検索された不足音素列を、補充コーパス作成部１３に出力する。補充コーパス作成部１３は、実施形態１で説明したように、コーパス蓄積部１６に蓄積されているコーパスの中から、不足音素列を含むコーパスを検索し、検索結果を「追加コーパス」として出力する。
【００７７】
以上のように、本実施形態の波形辞書作成支援システム５１によれば、入力された品質等級に応じた綿密さで不足音素列を検索し、検索結果として得られた不足音素列を含むコーパスが追加収録用のテキストデータ（追加コーパス）として出力される。これにより、求められる品質等級が高くなるほど、不足音素列をきめ細かく検索することとなるので、より自然な合成音声を得るための波形辞書を作成することが可能となる。一方、例えばコストや記憶容量などとの兼ね合いによって品質等級が低くても良い場合は、求められる品質等級を満足するために必要な最小限の不足音素列を検索することにより、波形辞書のサイズを小さく抑え、コストや記憶容量を節約することができる。
【００７８】
なお、本実施形態の波形辞書作成支援システム５１の適用例は、実施形態１において図２〜図４を用いて説明したものと同じであるため、その説明は省略する。
【００７９】
（実施形態６）
以下、本発明にかかる波形辞書作成支援システムの他の実施形態について、具体的な例をあげて説明する。
【００８０】
なお、本実施形態では、補充コーパス作成部１３およびコーパス蓄積部１６の具体的な構成例についてのみ説明するが、音声分析部１１、不足音素列検索部１２、音声情報保存部１４、必須音素情報保存部１５の構成は、本発明の目的を達成できる範囲であれば任意の構成とすることができる。なお、前述の実施形態２〜５のそれぞれで説明した態様の音声分析部１１、不足音素列検索部１２、音声情報保存部１４、必須音素情報保存部１５と、本実施形態で説明する補充コーパス作成部１３およびコーパス蓄積部１６とを組み合わせることができることは、言うまでもない。
【００８１】
図１１に示すように、本実施形態の波形辞書作成支援システム６１では、コーパス蓄積部１６に、テキストコーパスが分野別に収集され、分野別コーパス１６ａ〜１６ｄとして保存されている。なお、分野別コーパスの例としては、図１１に示した「金融」、「官公庁」、「一般」、「自然会話調」等に限定されず、音声合成システム５の使用環境等に応じた任意の分野のコーパスを用いればよい。コーパス１６ａ〜１６ｄは、各々の分野で使用される定型文章を多数含んでいる。補充コーパス作成部１３は、ユーザにより入力された分野名に基づき、コーパス蓄積部１６における該当する分野のコーパスから、不足音素列を含むコーパスを検索する。
【００８２】
コーパス蓄積部１６に保存されている情報の例としては、
・テキスト文章、もしくは音素列、
・発話時に予想されるピッチ情報、
等があげられる。
【００８３】
具体的には、金融コーパスのテキスト文章としては、
・「預金残高をご確認下さい。」
・「通帳の口座番号をご確認下さい。」
・「振り込み先の住所、氏名を入力して下さい。」
等があげられる。
【００８４】
また、単語としては、
・「一円」、「二円」、…「千円」、「二千円」、…「一万円」、…（金額読み上げ）
・「預金」
・「通帳」
等があげられる。
【００８５】
また、自然会話調のテキスト文章としては、
・「おはよう。」
・「元気？」
・「今日はいい天気だね。」
・「明日もし雨が降ったら、どこに行く？」
等のような、自然な会話文があげられる。
【００８６】
このように、不足音素列を含む追加コーパスを、分野別に蓄積されたコーパスから選択することにより、音声合成システム５の使用環境に適したより自然な合成音声を得るための波形辞書４を作成することが可能となる。
【００８７】
なお、図１１に示した構成の変形例として、図１２に示すように、音声分析部１１が、収録済み音声データの分析結果である音素列（テキストデータ）を、音声情報保存部１４のみならず、コーパス蓄積部１６にも保存するようにしても良い。
【００８８】
なお、本実施形態の波形辞書作成支援システム６１の適用例は、実施形態１において図２〜図４を用いて説明したものと同じであるため、その説明は省略する。
【００８９】
（実施形態７）
本発明にかかる波形辞書作成支援システムの他の実施形態について、図面を参照しながら説明する。
【００９０】
本実施形態の波形辞書作成支援システムは、補充コーパス作成部およびコーパス蓄積部が、前述の各実施形態と異なっている。図１３に示すように、本実施形態の波形辞書作成支援システム７１は、実施形態１等で説明した補充コーパス作成部１３およびコーパス蓄積部１６の代わりに、補充コーパス作成部２３および音声コーパス蓄積部２６を備えている。
【００９１】
なお、本実施形態では、補充コーパス作成部２３およびコーパス蓄積部２６についてのみ説明するが、音声分析部１１、不足音素列検索部１２、音声情報保存部１４、必須音素情報保存部１５の構成は、本発明の目的を達成できる範囲であれば任意の構成とすることができる。なお、前述の実施形態２〜５のそれぞれで説明した態様の音声分析部１１、不足音素列検索部１２、音声情報保存部１４、必須音素情報保存部１５と、本実施形態で説明する補充コーパス作成部２３およびコーパス蓄積部２６とを組み合わせることができることは、言うまでもない。
【００９２】
本実施形態の波形辞書作成支援システム７１では、音声コーパス蓄積部２６に、音素ラベル、ピッチ等の波形データの情報を表す情報と共に、音波形データが蓄積されている。補充コーパス作成部２３は、不足音素列検索部１２で求められた不足音素列を含むコーパスを、音声コーパス蓄積部２６から検索し、検索結果を追加音声データとして出力する。
【００９３】
すなわち、実施形態１〜６にかかる波形辞書作成支援システムでは、追加コーパスとしてテキストが出力されるようになっており、そのテキストに従ってユーザが追加収録を行う必要があった。これに対して、本実施形態にかかる波形辞書作成支援システム７１では、追加すべき音声データが、音声コーパス蓄積部２６に蓄積されている音声コーパスから自動的に作成されるので、ユーザは追加収録を行う必要がないという利点がある。
【００９４】
図１４に、本実施形態の波形辞書作成支援システム７１の適用例を示す。図１４に示すように、波形辞書作成支援システム７１から、音声情報保存部１４に保存された音声データと、補充コーパス作成部２３で作成される追加音声データとを、波形辞書作成システム３へ入力する。波形辞書作成システム３は、入力された音声データに基づき、波形辞書４を作成する。音声合成システム５は、この波形辞書４を用いて音声合成を行う。
【００９５】
なお、図１４に示した矢印は、各システムで生成されるデータが他のシステムでどのように利用されるかを表すものに過ぎず、システム間の定常的な接続状態を示すものではない。例えば、音声合成システム５の動作時に、波形辞書４は参照可能でなければならないが、波形辞書作成支援システム７１や波形辞書作成システム３については、波形辞書４および音声合成システム５に接続されている必要はない。
【００９６】
（実施形態８）
本発明にかかる波形辞書作成支援システムの他の実施形態について、図面を参照しながら説明する。なお、本実施形態は、実施形態７で説明した波形辞書作成支援システムのより具体的な例であるので、同様の機能を有する部分には同じ部材番号を付与し、詳細な説明は省略する。
【００９７】
図１５に示すように、本実施形態にかかる波形辞書作成支援システム８１における音声分析部１１、音声情報保存部１４、不足音素列検索部１２、必須音素情報保存部１５の構成および機能は、図９に示す実施形態４と同様であるため、その説明を省略する。
【００９８】
音声コーパス蓄積部２６には、様々な話者による、波形データ、音素ラベル、ピッチマーク、フォルマントの情報が、波形データ保存部２６ａ、音素ラベル保存部２６ｂ、ピッチマーク保存部２６ｃ、フォルマント保存部２６ｄに、それぞれ保存されている。
【００９９】
補充コーパス作成部２３には、不足音素列検索部１２から、例えば実施形態４で説明したように、（１）音素列そのものが音声情報保存部１４に存在しないもの、（２）音声情報保存部１４に保存されている音素列のうち、音素長、ピッチ、フォルマントのいずれかが所定の条件を満たさないもの、の和集合が不足音素列として入力される。そこで、補充コーパス作成部２３は、音声コーパス蓄積部２６から、不足音素列と音素ラベルが同じで、かつ、音素長、ピッチ、フォルマントの一致度の高いものを選択し、追加音声データとして出力する。
【０１００】
波形辞書作成システム３には、音声情報保存部１４に保存されている音声データ（音声波形データと、その音素ラベルやピッチマーク等）と、補充コーパス作成部２３で作成された追加音声データとが入力される。波形辞書作成システム３は、これらに基づき、波形辞書４を作成する。この波形辞書４を用いることで、どのような文章でも読み上げることのできる音声合成システム５が実現される。
【０１０１】
（実施形態９）
本発明にかかる波形辞書作成支援システムの他の実施形態について、図面を参照しながら説明する。
【０１０２】
図１６に示すように、本実施形態にかかる波形辞書作成支援システム９１は、入力されるデータが音声波形データ（例えばＰＣＭデータ）のみであり、音声分析部１１の全段に音声認識を行う音声認識部１７をさらに備えた点において、図１に示す実施形態１にかかる構成と異なっている。
【０１０３】
ＰＣＭデータが入力されると、音声認識部１７は、ＰＣＭデータの音声認識を行い、発話内容を出力する。なお、音声認識部１７による音声認識の手法については、公知の任意の手法を用いることが可能であるため、詳しい説明は省略する。
【０１０４】
音声分析部１１は、ＰＣＭデータと発話内容から、少なくとも発声内容を示すラベル情報を付与し、音素単位で検索可能な形態の音声データとして、音声情報保存部１４に保存する。
【０１０５】
不足音素列検索部１２、必須音素情報保存部１５、補充コーパス作成部１３、コーパス蓄積部１６の構成および機能については、前述の各実施形態で説明した構成および機能を適用することが可能である。
【０１０６】
なお、図１６に示した補充コーパス作成部１３およびコーパス蓄積部１６の代わりに、実施形態７および８で説明した補充コーパス作成部２３および音声コーパス蓄積部２６を備えた構成とすることも可能である。
【０１０７】
また、本実施形態の波形辞書作成支援システム９１の適用例は、収録システム２から波形辞書作成支援システム９１へ入力されるデータがＰＣＭデータである点を除いては、実施形態１において図２〜図４を用いて説明したものと同様であるため、その詳しい説明は省略する。
【０１０８】
以上のように、本実施形態にかかる波形辞書作成支援システム９１では、収録システム２によって収録された音声のＰＣＭデータだけを用いて波形辞書４を作成することが可能である。
【０１０９】
（実施形態１０）
本発明の一実施形態として、本発明にかかる波形辞書作成支援システムをコンピュータで実現するためのプログラムの一例を、図面を参照しながら説明する。
【０１１０】
本実施形態にかかるプログラムは、図１７に示すように、音声データを入力し（ステップＳ１）、入力された音声データを分析して音素情報を求め（ステップＳ２）、音素単位で検索可能な形態の音声データとして音声情報保存ファイルへ保存する（ステップＳ３）。次に、任意の日本語文章を読み上げるために必須とされる音素または音素列の条件を、当該条件があらかじめ保存された必須音素情報保存ファイルから入力する（ステップＳ４）。そして、ステップＳ４で入力された条件を満たすために音声情報保存ファイルに不足している音素列を検索し、検索結果を不足音素列として出力する（ステップＳ５）。続いて、大量のコーパスを蓄積したコーパス蓄積ファイルから、ステップＳ５で求められた不足音素列を含むコーパスを検索し、検索結果を追加コーパスとして出力する（ステップＳ６）。追加コーパスの出力は、印刷出力、ディスプレイへの表示、ファイルへの出力など、任意の形式で行えば良い。
【０１１１】
なお、ステップＳ２、ステップＳ５、ステップＳ６の処理については、上述の各実施形態で説明した音声分析部１１、不足音素列検索部１２、補充コーパス作成部１３（または２３）の処理内容を適用できるが、その詳細な説明は省略する。
【０１１２】
本実施形態にかかるプログラムは、ＣＤ−ＲＯＭ等の任意の可搬型記録媒体を介して、あるいは、無線または有線の通信回線を介して、コンピュータに読み込まれ、実行されることにより、当該コンピュータを上述の各実施形態で説明した波形辞書作成支援システムとして機能させることとなる。
【０１１３】
（付記１）音声波形データと当該音声の発話内容を表すテキストデータとを含む音声データを入力して分析し、当該音声データの音素情報を求める音声分析部と、
前記音声分析部により求められた音素情報を少なくとも含む音声情報を保存する音声情報保存部と、
波形辞書に必須な音素情報の条件を保存する必須音素情報保存部と、
前記必須音素情報保存部に保存されている条件を満たすために前記音声情報保存部に保存されている音素情報に不足している音素列を検索し、検索結果として得られた音素列を不足音素列として出力する不足音素列検索部と、
少なくとも前記必須音素情報の全てを含むコーパスが蓄積されたコーパス蓄積部と、
前記コーパス蓄積部から、前記不足音素列検索部より出力された不足音素列を含むコーパスを検索し、検索結果として得られたコーパスを追加コーパスとして出力する補充コーパス作成部とを備えたことを特徴とする波形辞書作成支援システム。
【０１１４】
（付記２）入力された音声波形データから発話内容を認識し、認識した発話内容をテキストデータとして前記音声波形データと共に前記音声分析部へ出力する音声認識部をさらに備えた、付記１に記載の波形辞書作成支援システム。
【０１１５】
（付記３）前記音声分析部が、入力された音声波形データに対して、音素ラベル付与、ピッチマーク付与、およびフォルマント検出から選ばれる少なくともいずれか一つを行った結果を、前記音素情報として前記音声情報保存部に保存し、
前記必須音素情報保存部に、波形辞書に必須な音素列に加えて、前記音声情報保存部に保存されている音素情報に関する条件が保存され、
前記不足音素列検索部が、前記音声情報保存部に保存されている音素列であっても、前記必須音素情報保存部における前記条件を満たさない場合は、当該音素列を不足音素列として出力する、付記１または２に記載の波形辞書作成支援システム。
【０１１６】
（付記４）前記不足音素列検索部は、更に、合成する音質の品質等級を入力し、入力された品質等級に応じて前記必須音素情報保存部に保存された条件の中から満たすべき条件を決定し、決定した条件に応じて不足音素列を検索する、付記１または２に記載の波形辞書作成支援システム。
【０１１７】
（付記５）前記コーパス蓄積部が、分野または用途別に蓄積されたコーパス保存部を有し、
前記補充コーパス作成部は、使用する分野または用途情報を入力し、入力された分野または用途情報に応じたコーパス保存部から追加コーパスを検索する、付記１〜３のいずれか一項に記載の波形辞書作成支援システム。
【０１１８】
（付記６）入力された音声波形データと当該音声の発話内容を表すテキストデータとを含む音声データを分析し、当該音声データの音素情報を求めるステップと、
前記音声分析部により求められた音素情報を少なくとも含む音声情報を音声情報保存ファイルへ保存するステップと、
波形辞書に必須な音素情報の条件を入力し、当該条件を満たすために前記音声情報保存ファイルに保存されている音素情報に不足している音素列を検索し、検索結果として得られた音素列を不足音素列として出力するステップと、
少なくとも前記必須音素情報の全てを含むコーパスが蓄積されたコーパス蓄積ファイルから、前記不足音素列を含むコーパスを検索し、検索結果として得られたコーパスを追加コーパスとして出力するステップとを含む処理をコンピュータに実行させることを特徴とするコンピュータプログラム。
【０１１９】
【発明の効果】
以上のように、本発明によれば、収録済みの音声データがある場合に、所望の波形辞書を作成するために補充すべき音声データ（補充コーパス）を効率的に探索することが可能な波形辞書作成支援システムを提供することが可能となる。
【図面の簡単な説明】
【図１】本発明の実施形態１にかかる波形辞書作成支援システムの概略構成を示すブロック図
【図２】本発明の実施形態１にかかる波形辞書作成支援システムの適用例を示すブロック図
【図３】本発明の実施形態１にかかる波形辞書作成支援システムの他の適用例を示すブロック図
【図４】本発明の実施形態１にかかる波形辞書作成支援システムのさらに他の適用例を示すブロック図
【図５】本発明の実施形態２にかかる波形辞書作成支援システムの概略構成を示すブロック図
【図６】本発明の実施形態３にかかる波形辞書作成支援システムの概略構成を示すブロック図
【図７】音素ラベリングの例を示す説明図
【図８】ピッチマーク付与の例を示す説明図
【図９】本発明の実施形態４にかかる波形辞書作成支援システムの概略構成を示すブロック図
【図１０】本発明の実施形態５にかかる波形辞書作成支援システムの概略構成を示すブロック図
【図１１】本発明の実施形態６にかかる波形辞書作成支援システムの概略構成を示すブロック図
【図１２】本発明の実施形態６にかかる波形辞書作成支援システムの変形例を示すブロック図
【図１３】本発明の実施形態７にかかる波形辞書作成支援システムの概略構成を示すブロック図
【図１４】本発明の実施形態７にかかる波形辞書作成支援システムの適用例を示すブロック図
【図１５】本発明の実施形態８にかかる波形辞書作成支援システムの概略構成を示すブロック図
【図１６】本発明の実施形態９にかかる波形辞書作成支援システムの概略構成を示すブロック図
【図１７】本発明の実施形態１０にかかるコンピュータプログラムの概略動作を示すフローチャート
【符号の説明】
１波形辞書作成支援システム
２収録システム
３波形辞書作成システム
４波形辞書
５音声合成システム
６ネットワーク
７送受信部
１１音声分析部
１１ａ自動ラベリング部
１１ｂピッチマーク付与部
１１ｃ周波数解析部
１２不足音素列検索部
１２ａ音素列検索部
１２ｂ音素長情報照合部
１２ｃピッチ情報照合部
１２ｄフォルマント情報照合部
１３補充コーパス作成部
１４音声情報保存部
１４ａ音素ラベル保存部
１４ｂピッチマーク保存部
１４ｃフォルマント保存部
１５必須音素情報保存部
１５ａ必須音素列保存部
１５ｂ音素長保存部
１５ｃピッチ情報保存部
１５ｄフォルマント情報保存部
１６コーパス蓄積部
１７音声認識部
２３補充コーパス作成部
２６音声コーパス蓄積部
２１波形辞書作成支援システム
３１波形辞書作成支援システム
４１波形辞書作成支援システム
５１波形辞書作成支援システム
６１波形辞書作成支援システム
７１波形辞書作成支援システム
８１波形辞書作成支援システム
９１波形辞書作成支援システム[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a system for supporting creation of a waveform dictionary for speech synthesis, and more particularly to a system for efficiently searching for a corpus (additional corpus) that needs to be supplemented to create a desired waveform dictionary.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, there has been known a system for performing voice synthesis by recording and accumulating voice data obtained by reading a fixed phrase or a word, and extracting and connecting necessary voice data in word units or the like from the voice data. A file in which such voice data (usually a large amount of voice waveform including all phonemes uttered by one speaker or data in which the feature amount is stored) is called a waveform dictionary or the like.
[0003]
In a conventional speech synthesis system, when synthesizing a sentence, speech data necessary for synthesizing the sentence is searched and extracted from a waveform dictionary in reference units such as phonemes and syllables. Then, by converting the extracted voice data or connecting a plurality of voice data, an optimum synthesized voice corresponding to the input text is created and output.
[0004]
That is, if all of the sounds included in the input text (text) are included in the waveform dictionary, it is possible to create a high-quality synthesized speech corresponding to the text. However, conversely, it is impossible to synthesize sounds that are not included in the waveform dictionary. Also, when a synthesized voice is created by connecting a plurality of voice data, the sound quality may be degraded.
[0005]
Therefore, in order to synthesize an arbitrary text by speech, it is theoretically necessary to store various kinds of speech data in a waveform dictionary. However, if the data capacity of the waveform dictionary is too large, it is not preferable in terms of mounting cost and search efficiency. The inventor of the present invention has already proposed a speech synthesis dictionary creating apparatus that efficiently creates a waveform dictionary having an appropriate size according to a user's application by referring to a large-scale corpus dictionary (Patent Reference 1).
[0006]
[Patent Document 1]
JP 2001-296878 A
[0007]
[Problems to be solved by the invention]
By the way, for a large amount of voice data to be registered in the waveform dictionary, generally, a developer of a voice synthesis engine contracts with a narrator or talent of a desired voice type / voice quality, and the narrator or the like is used for a long time. Cross over and record audio. Therefore, there is a problem that time and cost increase.
[0008]
Therefore, for example, when constructing a new speech synthesis system or upgrading an existing speech synthesis system, it is inefficient to restart recording of speech for creating a waveform dictionary from the beginning. If you want to improve the waveform dictionary itself over the existing dictionary, add the voice data obtained by additional recording to the existing waveform dictionary (recorded voice data), and the cost and time required for voice recording Can be reduced. However, there is a problem that it is not easy to identify audio data to be additionally recorded.
[0009]
Therefore, the present invention provides a waveform dictionary creation support system capable of efficiently searching for audio data (supplementary corpus) to be supplemented in order to create a desired waveform dictionary when there is recorded speech data. The purpose is to provide.
[0010]
[Means for Solving the Problems]
In order to achieve the above object, a waveform dictionary creation support system according to the present invention inputs and analyzes voice data including voice waveform data and text data representing the utterance content of the voice, and analyzes the phonemes of the voice data. A voice analysis unit for obtaining information, a voice information storage unit for storing voice information including at least phoneme information obtained by the voice analysis unit, and an essential phoneme information storage unit for storing conditions of phoneme information essential to the waveform dictionary. In order to satisfy the condition stored in the required phoneme information storage unit, a phoneme string stored in the voice information storage unit is searched for a phoneme string that is insufficient, and the phoneme string obtained as a search result is insufficient. An incomplete phoneme string search unit that outputs a phoneme string; a corpus storage unit in which a corpus including at least all of the essential phoneme information is stored; Find the corpus comprising a shortage phoneme sequence output from the phoneme string search unit, characterized in that a replenishing corpus creation unit for outputting a corpus obtained as the search result as an additional corpus.
[0011]
With this configuration, if a condition corresponding to a desired waveform dictionary is stored in the essential phoneme information storage unit, a phoneme string that is insufficient to satisfy the condition (that is, a voice sequence for creating a desired waveform dictionary). A phoneme string that needs to be added to the information storage unit is searched for as a missing phoneme string, and a corpus including the phoneme string is output as an additional corpus. This makes it possible to efficiently search for audio data (supplementary corpus) to be supplemented in order to create a desired waveform dictionary when there is recorded speech data.
[0012]
The additional corpus to be output may be either a text corpus or a voice corpus. When a text corpus is to be output as an additional corpus, a corpus storage unit in which a text corpus is stored is used, and when a speech corpus is to be output, a corpus storage unit in which a speech corpus is stored may be used. When the additional corpus is output as a text corpus, additional recording of speech is performed according to the text corpus, and a waveform dictionary may be created based on the additionally recorded speech information storage unit. When the additional corpus is output as a voice corpus, additional recording by a narrator or the like is not required, and the voice corpus itself may be added to the voice information storage unit to create a waveform dictionary.
[0013]
Further, the waveform dictionary creation support system according to the present invention includes a voice recognition unit that recognizes utterance content from input voice waveform data, and outputs the recognized utterance content as text data to the voice analysis unit together with the voice waveform data. It is preferable to further provide. This is because there is no need to input the utterance content by text.
[0014]
Further, in the waveform dictionary creation support system according to the present invention, the missing phoneme string search unit further inputs a quality class of a sound quality to be synthesized, and is stored in the essential phoneme information storage unit according to the input quality class. It is preferable that a condition to be satisfied is determined from the conditions, and a missing phoneme string is searched according to the determined condition. This is because it is possible to create a waveform dictionary having a size and contents suitable for achieving a desired quality of synthesized speech.
[0015]
Further, in the waveform dictionary creation support system according to the present invention, the corpus accumulation unit has a corpus storage unit accumulated for each field or application, and the supplemental corpus creation unit inputs field or application information to be used, and It is preferable to search for an additional corpus from the corpus storage unit corresponding to the field or application information that has been input. This is because by searching for an additional corpus from the corpus accumulated for each field or application, it is possible to create a waveform dictionary according to the usage environment (field or application) of the speech synthesis system.
[0016]
Further, the computer program according to the present invention analyzes voice data including input voice waveform data and text data representing the utterance content of the voice, and obtains phoneme information of the voice data; Storing voice information including at least the phoneme information determined in the voice information storage file, and input the necessary phoneme information conditions in the waveform dictionary, and stored in the voice information storage file to satisfy the conditions Searching for a phoneme string missing in the phoneme information that is present, outputting the phoneme string obtained as a search result as a missing phoneme string, and a corpus storage file in which a corpus containing at least all of the essential phoneme information is stored. Searching for a corpus containing the missing phoneme string, and using the corpus obtained as a search result as an additional corpus. Characterized in that to execute a process including the step of force to the computer.
[0017]
By loading this program into a computer and executing it, a phoneme string that must be added to the speech information file in order to create a desired waveform dictionary is searched for as a missing phoneme string, and a corpus containing the phoneme string is added to the corpus. Is output as This makes it possible to efficiently search for audio data (supplementary corpus) to be supplemented in order to create a desired waveform dictionary when there is recorded speech data.
[0018]
BEST MODE FOR CARRYING OUT THE INVENTION
(Embodiment 1)
Hereinafter, an embodiment of the present invention will be described.
[0019]
As shown in FIG. 1, the waveform dictionary creation support system 1 according to the present embodiment includes a speech analysis unit 11, a missing phoneme string search unit 12, a supplemental corpus creation unit 13, a speech information storage unit 14, and a required phoneme information storage unit 15. , A corpus storage unit 16.
[0020]
The voice analysis unit 11 inputs and analyzes recorded voice data. The “sound data” input in this embodiment includes recorded sound waveform data (for example, PCM data) and text data. The corpus storage unit 16 is a large-scale corpus database storing text corpora.
[0021]
When the recorded voice data is input, the voice analysis unit 11 obtains phoneme information by analyzing the input voice data. The phoneme information includes at least label information (phoneme label) indicating the utterance content. The voice analysis unit 11 adds the phoneme information of the analysis result to the input voice waveform data, and stores the input voice waveform data in the voice information storage unit 14 as voice data in a form that can be searched in phoneme units.
[0022]
The lacking phoneme string search unit 12 checks whether or not the speech data stored in the speech information storage unit 14 includes all phoneme strings necessary for reading out a given sentence. , "Insufficient phoneme sequence"). Specifically, by referring to the essential phoneme information storage unit 15, a phoneme string missing from the recorded speech data stored in the speech information storage unit 14 is searched. The essential phoneme information storage unit 15 stores at least a phoneme string of all syllable data in Japanese, which is essential for synthesizing an arbitrary sentence. The missing phoneme string search unit 12 searches for a phoneme string that is stored in the essential phoneme information storage unit 15 and that is not included in the voice information stored in the voice information storage unit 14, and is searched for. The output phoneme string is output as a missing phoneme string.
[0023]
The supplementary corpus creation unit 13 searches the corpus storage unit 16 for a corpus including the missing phoneme string obtained by the missing phoneme string search unit 12, and outputs the corpus as an additional corpus (word or sentence). The output of the additional corpus may be performed in any format, such as print output, display on a display, and output to a file.
[0024]
As described above, the waveform dictionary creation support system 1 according to the present embodiment searches for missing phoneme strings from recorded speech data, and outputs a corpus (text corpus) including the missing phoneme strings as an additional corpus. Therefore, if the text of the output additional corpus is read out by a narrator or the like and voice data is additionally recorded, all phoneme information necessary for reading out an arbitrary sentence can be recorded in the voice information storage unit 14. .
[0025]
FIG. 2 shows an application example of the waveform dictionary creation support system 1 of the present embodiment. As shown in FIG. 2, a user (including a narrator, the same applies hereinafter) inputs recorded speech data (speech waveform data and text data) to a waveform dictionary creation support system 1 from a recording system 2 for recording speech. I do. When the additional corpus is output from the waveform dictionary creation support system 1, the user performs additional recording on the recording system 2 according to the output additional corpus. The waveform dictionary creation support system 1 checks whether there is any additional corpus, and if it is determined that there is no additional corpus, outputs the speech data and speech information accumulated in the speech information storage unit 14 to the waveform dictionary creation system 3. . The waveform dictionary creation system 3 creates a waveform dictionary 4 based on these audio data and audio information. The created waveform dictionary 4 is used when the speech synthesis system 5 creates a synthesized speech from an arbitrary text.
[0026]
Note that the arrows shown in FIG. 2 merely indicate how data generated in each system is used in another system, and do not indicate a steady connection state between the systems. For example, while the speech synthesis system 5 is in operation, the waveform dictionary 4 must be able to be referred to, but the recording system 2, the waveform dictionary creation support system 1, and the waveform dictionary creation system 3 have the waveform dictionary 4 and the speech synthesis system 5 It does not need to be connected to
[0027]
Here, FIG. 3 shows another application example of the waveform dictionary creation support system 1 of the present embodiment. The example of FIG. 3 differs from the example shown in FIG. 2 in that the user sends the recorded data via the network 6. For this reason, in the example of FIG. 3, the transmission / reception unit 7 is further provided between the network 6 and each system.
[0028]
A user sends voice data using a telephone, VoIP (Voice over Internet Protocol), or the like. The transmission / reception unit 7 receives audio data from the network 6 and inputs the audio data to the recording system 2 for recording. Note that recording is not necessarily required. When recorded speech data (speech waveform data and text data indicating utterance contents) is input from the recording system 2 to the waveform dictionary creation support system 1, the waveform dictionary creation support system 1 searches for an additional corpus and transmits and receives the data. Output to The additional corpus is sent from the transmission / reception unit 7 to the user via the network 6. The user receives the additional corpus, and performs additional recording on the recording system 2 using a telephone, VoIP, or the like according to the received additional corpus. This process is repeated until the waveform dictionary creation support system 1 determines that there is no additional corpus.
[0029]
When there is no additional corpus, the voice information stored in the voice information storage unit 14 of the waveform dictionary creation support system 1 is output to the waveform dictionary creation system 3. The waveform dictionary creation system 3 creates the waveform dictionary 4 based on the voice information. As a result, when the user inputs an arbitrary text via the network 6 and the transmission / reception unit 7, the speech synthesis system 5 performs speech synthesis using the waveform dictionary 4, and synthesizes the synthesized sound via the transmission / reception unit 7 and the network 6. To the user.
[0030]
The "speech data" includes speech waveform data and text data representing utterance contents. “Speech information” is a result of analysis by the speech analysis unit 11, and includes at least a phoneme label. When the voice analysis unit 11 performs pitch mark assignment and frequency analysis as described in an embodiment to be described later, the voice information includes pitch marks and formants. In the above-described example, both the audio data and the audio information are stored in the audio information storage unit 14. However, the audio data need not always be stored.
[0031]
Note that, similarly to FIG. 2, the arrows shown in FIG. 3 merely indicate how data generated in each system is used in other systems. It does not indicate status.
[0032]
FIG. 4 shows still another application example of the waveform dictionary creation support system 1 of the present embodiment. In the example shown in FIG. 4, voice recording to the recording system 2 via the network 6, generation and transmission of an additional corpus by the waveform dictionary creation support system 1 to the user, and creation of the waveform dictionary 4 by the waveform dictionary creation system 3 , Is the same as the example shown in FIG. However, in the case of the example shown in FIG. 4, the speech synthesis system 5 is at hand of the user, and the created waveform dictionary 4 is sent to the user via the network 6. When the user inputs an arbitrary text into the speech synthesis system 5, the speech synthesis system 5 creates and outputs a synthesized sound using the sent waveform dictionary 4.
[0033]
(Embodiment 2)
Hereinafter, another embodiment of the waveform dictionary creation support system according to the present invention will be described with a specific example.
[0034]
As shown in FIG. 5, in the waveform dictionary creation support system 21 of the present embodiment, the speech analysis unit 11 has an automatic labeling unit 11a, the speech information storage unit 14 has a phoneme label storage unit 14a, and the lack phoneme string search unit 12 has The phoneme string search unit 12a and the essential phoneme information storage unit 15 are provided with an essential phoneme string storage unit 15a, respectively.
[0035]
First, when PCM data and text data are input as speech data to the speech analysis unit 11, the automatic labeling unit 11a performs phoneme labeling. The result of the phoneme labeling is stored in the phoneme label storage unit 14a of the voice information storage unit 14.
[0036]
For example, if the utterance content is "Early morning, a telegram arrived at a bungalow." Or "Aoyama has many new shops."

It is.
[0037]
Next, the lacking phoneme string search unit 12 lacks phoneme information (phoneme labels) stored in the speech information storage unit 14 based on the information in the essential phoneme string storage unit 15a of the essential phoneme information storage unit 15. Find the phoneme sequence that is present (insufficient phoneme sequence). The essential phoneme information storage unit 15 stores a phoneme sequence of all syllable data necessary for reading out an arbitrary Japanese sentence. In the case of the present embodiment, for example, the required phoneme string storage unit 15a
Vowels (1)
Vowels + vowels (2)
Consonant + vowel ... (3)
Are stored.
[0038]
Furthermore, in addition to the above three patterns (1) to (3),
Vowel + consonant + vowel ... (4)
It is also preferable to hold all three phonemes.
[0039]
Alternatively, as another example, adding to the three patterns of (1) to (3) and making a connection tends to deteriorate the sound.
Vowel + semi-vowel + vowel ... (5)
Vowel + nasal + vowel ... (6)
Vowel + vowel + vowel ... (7)
It is also preferable to maintain such a pattern.
[0040]
For example, when holding (1) to (4),

Is stored in the essential phoneme sequence storage unit 15a.
[0041]
For phoneme strings that are likely to cause abnormal noise or noise at the time of connection, if they are stored in the essential phoneme string storage unit 15a in the form of a 4- or 5-phoneme chain, higher-quality speech synthesis is possible. Is desirable.
[0042]
The phoneme string search unit 12a of the deficient phoneme string search unit 12 compares the phoneme label storage unit 14a of the voice information storage unit 14 with the essential phoneme string storage unit 15a of the essential phoneme information storage unit 15 to obtain a phoneme label storage unit. A search is made to determine whether all of the essential phoneme strings exist in 14a, and a missing phoneme string is obtained.
[0043]
As described in the first embodiment, the supplemental corpus creator 13 searches the corpus stored in the corpus storage 16 for a corpus including a missing phoneme string, and outputs the search result as an “additional corpus”. .
[0044]
As described above, according to the waveform dictionary creation support system 21 of the present embodiment, if the essential condition for creating a desired waveform dictionary is registered in the essential phoneme information storage unit 15, the condition is satisfied. Is output as an “additional corpus” that includes a phoneme string that needs to be added to (a missing phoneme string). Thereby, the search for the additional corpus can be efficiently performed.
[0045]
The application example of the waveform dictionary creation support system 21 of the present embodiment is the same as that described in the first embodiment with reference to FIGS.
[0046]
(Embodiment 3)
Hereinafter, another embodiment of the waveform dictionary creation support system according to the present invention will be described with a specific example.
[0047]
As shown in FIG. 6, in the waveform dictionary creation support system 31 according to the present embodiment, the speech analysis unit 11 has an automatic labeling unit 11a and a pitch mark adding unit 11b, and the speech information storage unit 14 has a phoneme label storage unit 14a. The pitch mark storage unit 14b, the phoneme string search unit 12a for the insufficient phoneme string search unit 12, the phoneme length information matching unit 12b, and the pitch information matching unit 12c, and the essential phoneme information storage unit 15 for the essential phoneme string storage unit 15a, A long storage unit 15b and a pitch information storage unit 15c are provided.
[0048]
When PCM data and text data are input as voice data to the voice analysis unit 11, the automatic labeling unit 11a performs phoneme labeling (including boundary position determination), and the pitch mark giving unit 11b gives pitch marks. In the present embodiment, an example is described in which PCM data is used as audio data, but the format of the audio data is arbitrary. The phoneme label and the pitch mark are stored in the phoneme label storage unit 14a and the pitch mark storage unit 14b of the voice information storage unit 14, respectively.
[0049]
An example of a phoneme label stored here is shown below. For example, in the case of a speech waveform as shown in FIG. 7, a phoneme name and its phoneme boundary can be represented as shown in the following (Table 1) using sampling positions in time series.
[0050]
[Table 1]

[0051]
The pitch marks stored in this case are located at the respective pitches as shown in FIG. 8, and are represented as follows, for example.
[0052]

In the insufficient phoneme string search unit 12, the phoneme string search unit 12a compares the phoneme label storage unit 14a of the speech information storage unit 14 with the essential phoneme string storage unit 15a of the essential phoneme information storage unit 15, as in the second embodiment. By doing so, a search is made as to whether or not all of the essential phoneme strings exist in the phoneme label storage unit 14a. Thus, a phoneme string that exists in the essential phoneme string storage unit 15a and that does not exist in the phoneme label storage unit 14a is determined to be an insufficient phoneme string.
[0053]
The phoneme length information matching unit 12b obtains the phoneme length of the phoneme sequence of the recorded voice data from the phoneme label stored in the phoneme label storage unit 14a of the voice information storage unit 14, and determines the obtained phoneme length as the required phoneme information. If the phoneme length data in the phoneme length storage unit 15b of the storage unit 15 is extremely shorter than the phoneme length data at the time of synthesis (for example, the length is １／ or less), it is determined that the phoneme string satisfying the phoneme length condition is insufficient. to decide. As an example of the information stored in the phoneme length storage unit 15b, for example, a list of phoneme lengths at the time of synthesizing each phoneme as described below is given.
[0054]

Further, the pitch information collating unit 12c obtains the pitch of the recorded voice data from the pitch mark stored in the pitch mark storage unit 14b of the voice information storage unit 14, and determines the obtained pitch as the pitch information of the essential phoneme information storage unit 15. If there is a large gap from the pitch required for synthesis in the storage unit 15c (for example, if there is a pitch difference of 50% or more), it is determined that a phoneme string satisfying the pitch condition is insufficient. As an example of information stored in the pitch information storage unit 15c, there is a base pitch obtained from pitch information of each PCM data.
[0055]
The missing phoneme string search unit 12 outputs the union of the missing phoneme strings obtained by the phoneme string search unit 12a, the phoneme length information matching unit 12b, and the pitch information matching unit 12c to the supplemental corpus creating unit 13.
[0056]
As described in the first embodiment, the supplemental corpus creator 13 searches the corpus stored in the corpus storage 16 for a corpus including a missing phoneme string, and outputs the search result as an “additional corpus”. .
[0057]
As described above, according to the waveform dictionary creation support system 31 of the present embodiment, a phoneme string to be supplemented in order to satisfy all the conditions relating to the phoneme string, the phoneme length, and the pitch is searched for as a missing phoneme string, and the missing phoneme is searched for. The corpus containing the columns is output as an additional corpus. Therefore, if additional recording is performed based on the output additional corpus, it is possible to create a waveform dictionary that satisfies all the conditions relating to phoneme strings, phoneme lengths, and pitches. Becomes possible.
[0058]
The application example of the waveform dictionary creation support system 31 of the present embodiment is the same as that described in the first embodiment with reference to FIGS.
[0059]
(Embodiment 4)
Hereinafter, another embodiment of the waveform dictionary creation support system according to the present invention will be described with a specific example.
[0060]
As illustrated in FIG. 9, in the waveform dictionary creation support system 41 according to the present embodiment, the audio analysis unit 11 includes an automatic labeling unit 11a, a pitch mark adding unit 11b, and a frequency analysis unit 11c. Is a phoneme label storage unit 14a, a pitch mark storage unit 14b, and a formant storage unit 14c. The insufficient phoneme sequence search unit 12 includes a phoneme sequence search unit 12a, a phoneme length information matching unit 12b, a pitch information matching unit 12c, and formant information. The matching unit 12d is provided with an essential phoneme string storage unit 15a, a phoneme length storage unit 15b, a pitch information storage unit 15c, and a formant information storage unit 15d.
[0061]
When the PCM data and the text data are input as voice data to the voice analysis unit 11, the automatic labeling unit 11a and the pitch mark adding unit 11b of the voice analysis unit 11 transmit the phoneme label and the pitch mark as described in the third embodiment. Make a grant. Further, the frequency analysis unit 11c analyzes the frequency of the PCM data, and stores the formant information of each phoneme in the formant storage unit 14c of the audio information storage unit 14.
[0062]
In the missing phoneme string search unit 12, the phoneme string search unit 12a, the phoneme length information matching unit 12b, and the pitch information matching unit 12c find the missing phoneme strings, respectively, as described in the third embodiment. Further, the formant information matching unit 12d checks the data in the formant information storage unit 15d against the formants of the recorded data to check whether there is a large gap.
[0063]
Examples of data stored in the formant information storage unit 15d include a first formant and a second formant of each vowel. For example, the average value Fi of the first formant of the phoneme “i”₁And the average value Fi of the second formant₂Are stored in the formant information storage unit 15d. In this case, the formant collation is performed, for example, as follows. The frequency analysis unit 11c analyzes the frequency of the PCM data as described above, and obtains the first formant fi of each phoneme.₁And the second formant fi₂Are stored in the formant storage unit 14c. Then, the formant information collating unit 12d calculates the average value of the first formant and the second formant for each phoneme string stored in the formant storage unit 14c. For example, the first formant fi of the phoneme “i” in a phoneme sequence₁And the second formant fi₂And the average formant Fi of the formant information storage unit 15d.₁, Fi₂Is obtained by, for example, the following equation.
[0064]
diff = (Fi₁− Fi₁)²+ (Fi₂− Fi₂)²
A diff threshold is set for each phoneme, and if the threshold is exceeded, the phoneme string is determined to be an insufficient phoneme string.
[0065]
The missing phoneme string search unit 12 fills the union of the missing phoneme strings obtained by the phoneme string search unit 12a, the phoneme length information matching unit 12b, the pitch information matching unit 12c, and the formant information matching unit 12d with a supplemental corpus creation unit. 13 is output.
[0066]
As described in the first embodiment, the supplemental corpus creator 13 searches the corpus stored in the corpus storage 16 for a corpus including a missing phoneme string, and outputs the search result as an “additional corpus”. .
[0067]
As described above, according to the waveform dictionary creation support system 41 of the present embodiment, a phoneme string to be supplemented in order to satisfy all the conditions relating to the phoneme string, phoneme length, pitch, and formant is searched for as an insufficient phoneme string. The corpus containing the missing phoneme sequence is output as an additional corpus. Therefore, by performing additional recording based on the output additional corpus, it is possible to create a waveform dictionary that satisfies all the conditions relating to phoneme strings, phoneme lengths, pitches, and formants. Voice synthesis becomes possible.
[0068]
The application example of the waveform dictionary creation support system 41 of the present embodiment is the same as that described in the first embodiment with reference to FIGS.
[0069]
(Embodiment 5)
Hereinafter, another embodiment of the waveform dictionary creation support system according to the present invention will be described with a specific example.
[0070]
As shown in FIG. 10, the waveform dictionary creation support system 51 according to the present embodiment includes an automatic labeling unit 11a and a pitch mark assignment unit in the speech analysis unit 11, similarly to the waveform dictionary creation support system 41 according to the fourth embodiment. 11b, a frequency analysis unit 11c, a phoneme label storage unit 14a, a pitch mark storage unit 14b, and a formant storage unit 14c in the speech information storage unit 14, a phoneme string search unit 12a, and a phoneme string search unit 12 The length information collating unit 12b, the pitch information collating unit 12c, and the formant information collating unit 12d include an essential phoneme information storing unit 15, an essential phoneme string storing unit 15a, a phoneme length storing unit 15b, a pitch information storing unit 15c, and formant information. Each of the storage units 15d is provided.
[0071]
However, in the waveform dictionary creation support system 51 according to the present embodiment, the data representing the quality class is input to the missing phoneme string search unit 12, and the missing phoneme string search unit 12 outputs the missing phoneme according to the level of the input quality class. Embodiment 4 is different from Embodiment 4 in that a column search is performed.
[0072]
The grade of the input quality is represented by numerical values, symbols, etc., for example,
1: The highest level of audio quality
2: Intermediate audio quality level
3: Lowest voice quality level
Is associated with the level required as the quality of the synthesized speech. In addition, since the required size of the waveform dictionary increases as the level of the quality required for the synthesized speech increases, the insufficient phoneme string search unit 12 executes the above-described quality classes “3”, “2”, and “1” in this order. The search for the missing phoneme string is performed more closely.
[0073]
For example, when the quality class “3” is input to the missing phoneme string search unit 12, the phoneme string search unit 12a of the missing phoneme string search unit 12 A search for a missing phoneme string necessary to satisfy “vowel”, “vowel + vowel”, and “consonant + vowel” is performed.
[0074]
When the quality class “1” is input, the phoneme string search unit 12a of the missing phoneme string search unit 12 outputs, for example, one phoneme (“vowel”) of the essential phoneme string storage unit 15a of the essential phoneme information storage unit 15. )) Search for the missing phoneme string necessary to satisfy the essential phonemes of the ~ 4,5 phoneme chain, and supplement it to provide the highest quality that satisfies the phoneme length, pitch information, and formant information. Search for necessary missing phoneme strings.
[0075]
When the quality class “2” is input, the phoneme string search unit 12a of the missing phoneme string search unit 12 outputs, for example, one phoneme (“vowel”) of the essential phoneme string storage unit 15a of the essential phoneme information storage unit 15. ) -Insufficient phoneme sequence at an intermediate level between the above-mentioned quality class "1" and quality class "2", for example, searching only for the missing phoneme sequence necessary to satisfy the essential phonemes of the phoneme chain. Search for.
[0076]
The missing phoneme string search unit 12 outputs the missing phoneme string searched as described above to the supplemental corpus creation unit 13. As described in the first embodiment, the supplemental corpus creator 13 searches the corpus stored in the corpus storage 16 for a corpus including a missing phoneme string, and outputs the search result as an “additional corpus”. .
[0077]
As described above, according to the waveform dictionary creation support system 51 of the present embodiment, a missing phoneme string is searched for in detail according to the input quality class, and a corpus including the missing phoneme string obtained as a search result is obtained. Output as text data (additional corpus) for additional recording. As a result, the higher the required quality class, the finer the search for the missing phoneme string is, so that it is possible to create a waveform dictionary for obtaining a more natural synthesized speech. On the other hand, if the quality class can be low due to, for example, cost or storage capacity, the size of the waveform dictionary can be reduced by searching for the minimum missing phoneme string necessary to satisfy the required quality class. It can be kept small to save cost and storage capacity.
[0078]
The application example of the waveform dictionary creation support system 51 of the present embodiment is the same as that described in the first embodiment with reference to FIGS.
[0079]
(Embodiment 6)
Hereinafter, another embodiment of the waveform dictionary creation support system according to the present invention will be described with a specific example.
[0080]
In the present embodiment, only a specific configuration example of the supplemental corpus creation unit 13 and the corpus accumulation unit 16 will be described. However, the speech analysis unit 11, the missing phoneme string search unit 12, the speech information storage unit 14, the essential phoneme information The configuration of the storage unit 15 can be any configuration as long as the object of the present invention can be achieved. Note that the speech analysis unit 11, the missing phoneme string search unit 12, the speech information storage unit 14, the essential phoneme information storage unit 15, the supplemental corpus described in the present embodiment, and the speech analysis unit 11, the lacking phoneme string search unit 12, and the speech corpus described in each of the above-described embodiments 2 to 5 It goes without saying that the creation unit 13 and the corpus accumulation unit 16 can be combined.
[0081]
As shown in FIG. 11, in the waveform dictionary creation support system 61 of the present embodiment, a text corpus is collected in the corpus storage unit 16 for each field and stored as the corpus 16a to 16d for each field. Note that examples of the corpus by field are not limited to “finance”, “government”, “general”, “natural conversation style”, etc. shown in FIG. A corpus in the field described above may be used. The corpuses 16a to 16d include a large number of fixed sentences used in each field. The supplementary corpus creator 13 searches the corpus of the corresponding field in the corpus storage 16 for a corpus including a missing phoneme string, based on the field name input by the user.
[0082]
Examples of information stored in the corpus storage unit 16 include:
・ Text sentences or phoneme strings,
・ Pitch information expected when speaking,
And the like.
[0083]
Specifically, the text of the financial corpus is:
・ "Please check your deposit balance."
・ "Please confirm your passbook account number."
・ "Please enter your transfer address and name."
And the like.
[0084]
Also, as a word,
・ "1 yen", "2 yen", ... "1,000 yen", "2,000 yen", ... "10,000 yen", ... (reading amount)
·"deposit"
·"passbook"
And the like.
[0085]
Also, as a natural conversation style text sentence,
·"Good morning."
·"energy?"
・ "It's good weather today."
"Where will it go if it rains tomorrow?"
Such as natural conversation sentences.
[0086]
As described above, by selecting the additional corpus including the missing phoneme string from the corpus accumulated for each field, the waveform dictionary 4 for obtaining a more natural synthesized speech suitable for the use environment of the speech synthesis system 5 is created. Becomes possible.
[0087]
As a modification of the configuration shown in FIG. 11, as shown in FIG. 12, if the voice analysis unit 11 converts a phoneme string (text data) which is an analysis result of recorded voice data into only the voice information storage unit 14, Instead, it may be stored in the corpus storage unit 16.
[0088]
The application example of the waveform dictionary creation support system 61 of the present embodiment is the same as that described in the first embodiment with reference to FIGS.
[0089]
(Embodiment 7)
Another embodiment of the waveform dictionary creation support system according to the present invention will be described with reference to the drawings.
[0090]
The waveform dictionary creation support system of the present embodiment is different from the above-described embodiments in a supplementary corpus creation unit and a corpus accumulation unit. As shown in FIG. 13, the waveform dictionary creation support system 71 of the present embodiment includes a supplemental corpus creation unit 23 and a speech corpus accumulation unit instead of the supplementary corpus creation unit 13 and the corpus accumulation unit 16 described in the first embodiment and the like. 26.
[0091]
In this embodiment, only the supplementary corpus creating unit 23 and the corpus accumulating unit 26 will be described. However, the configurations of the speech analyzing unit 11, the missing phoneme string searching unit 12, the speech information storage unit 14, and the essential phoneme information storage unit 15 are as follows. Any configuration can be adopted as long as the object of the present invention can be achieved. Note that the speech analysis unit 11, the missing phoneme string search unit 12, the speech information storage unit 14, the essential phoneme information storage unit 15, the supplemental corpus described in the present embodiment, and the speech analysis unit 11, the lacking phoneme string search unit 12, and the phonetic information storage unit 15 described in each of Embodiments 2 to 5 above. It goes without saying that the creation unit 23 and the corpus accumulation unit 26 can be combined.
[0092]
In the waveform dictionary creation support system 71 of the present embodiment, sound waveform data is stored in the speech corpus storage unit 26 together with information representing waveform data information such as phoneme labels and pitches. The supplemental corpus creation unit 23 searches the corpus including the missing phoneme string obtained by the missing phoneme string search unit 12 from the speech corpus storage unit 26, and outputs the search result as additional speech data.
[0093]
That is, in the waveform dictionary creation support systems according to the first to sixth embodiments, text is output as an additional corpus, and the user has to perform additional recording according to the text. On the other hand, in the waveform dictionary creation support system 71 according to the present embodiment, the audio data to be added is automatically created from the audio corpus stored in the audio corpus storage unit 26. There is an advantage that there is no need to perform.
[0094]
FIG. 14 shows an application example of the waveform dictionary creation support system 71 of the present embodiment. As shown in FIG. 14, the audio data stored in the audio information storage unit 14 and the additional audio data created by the supplemental corpus creation unit 23 are input from the waveform dictionary creation support system 71 to the waveform dictionary creation system 3. I do. The waveform dictionary creation system 3 creates a waveform dictionary 4 based on the input voice data. The speech synthesis system 5 performs speech synthesis using the waveform dictionary 4.
[0095]
Note that the arrows shown in FIG. 14 merely indicate how data generated in each system is used in another system, and do not indicate a steady connection state between the systems. For example, the waveform dictionary 4 must be able to be referred to when the speech synthesis system 5 operates, but the waveform dictionary creation support system 71 and the waveform dictionary creation system 3 are connected to the waveform dictionary 4 and the speech synthesis system 5. No need.
[0096]
(Embodiment 8)
Another embodiment of the waveform dictionary creation support system according to the present invention will be described with reference to the drawings. Note that this embodiment is a more specific example of the waveform dictionary creation support system described in the seventh embodiment, and therefore, portions having the same functions are assigned the same member numbers, and detailed descriptions are omitted.
[0097]
As shown in FIG. 15, the configurations and functions of the speech analysis unit 11, the speech information storage unit 14, the missing phoneme string search unit 12, and the essential phoneme information storage unit 15 in the waveform dictionary creation support system 81 according to the present embodiment are as shown in FIG. Since it is the same as Embodiment 4 shown in FIG.
[0098]
The voice corpus storage unit 26 stores waveform data, phoneme labels, pitch marks, and formant information from various speakers in a waveform data storage unit 26a, a phoneme label storage unit 26b, a pitch mark storage unit 26c, and a formant storage unit 26d. , Respectively.
[0099]
As described in the fourth embodiment, for example, as described in the fourth embodiment, the supplementary corpus creation unit 23 may include, from the insufficient phoneme string search unit 12, (1) a phoneme string that does not exist in the speech information storage unit 14, and (2) a speech information storage unit. Among the phoneme strings stored in 14, a union of those whose phoneme length, pitch, or formant does not satisfy a predetermined condition is input as an insufficient phoneme string. Therefore, the supplemental corpus creator 23 selects, from the speech corpus accumulator 26, those having the same missing phoneme sequence and phoneme label, and having a high degree of coincidence in phoneme length, pitch, and formant, and outputs this as additional speech data. .
[0100]
The waveform dictionary creation system 3 stores the speech data (speech waveform data and its phoneme labels and pitch marks, etc.) stored in the speech information storage unit 14 and the additional speech data created by the supplemental corpus creation unit 23. Is entered. The waveform dictionary creation system 3 creates the waveform dictionary 4 based on these. By using the waveform dictionary 4, a speech synthesis system 5 that can read any sentence is realized.
[0101]
(Embodiment 9)
Another embodiment of the waveform dictionary creation support system according to the present invention will be described with reference to the drawings.
[0102]
As shown in FIG. 16, in the waveform dictionary creation support system 91 according to the present embodiment, the input data is only the voice waveform data (for example, PCM data), and the voice analysis unit 11 performs voice recognition on all stages. The difference from the configuration according to the first embodiment shown in FIG. 1 is that a recognition unit 17 is further provided.
[0103]
When the PCM data is input, the voice recognition unit 17 performs voice recognition of the PCM data and outputs utterance contents. It should be noted that any known method can be used as a method of voice recognition by the voice recognition unit 17, and thus a detailed description is omitted.
[0104]
The speech analysis unit 11 adds at least label information indicating the speech content from the PCM data and the speech content, and stores the speech information in the speech information storage unit 14 as speech data in a form that can be searched in phoneme units.
[0105]
The configurations and functions described in each of the above embodiments can be applied to the configurations and functions of the missing phoneme string search unit 12, the essential phoneme information storage unit 15, the supplementary corpus creation unit 13, and the corpus storage unit 16. .
[0106]
Instead of the supplemental corpus creating unit 13 and the corpus accumulating unit 16 shown in FIG. 16, a configuration including the supplemental corpus creating unit 23 and the speech corpus accumulating unit 26 described in the seventh and eighth embodiments can be adopted. is there.
[0107]
The application example of the waveform dictionary creation support system 91 of the present embodiment is the same as that of the first embodiment except that the data input from the recording system 2 to the waveform dictionary creation support system 91 is PCM data. Since this is the same as that described with reference to FIG. 4, its detailed description is omitted.
[0108]
As described above, in the waveform dictionary creation support system 91 according to the present embodiment, it is possible to create the waveform dictionary 4 using only the PCM data of the voice recorded by the recording system 2.
[0109]
(Embodiment 10)
As one embodiment of the present invention, an example of a program for realizing a waveform dictionary creation support system according to the present invention by a computer will be described with reference to the drawings.
[0110]
As shown in FIG. 17, the program according to the present embodiment receives voice data (step S1), analyzes the input voice data to obtain phoneme information (step S2), and searches in phoneme units. (Step S3). Next, a condition of a phoneme or a phoneme string required for reading out an arbitrary Japanese sentence is input from a required phoneme information storage file in which the condition is stored in advance (step S4). Then, a phoneme string missing in the audio information storage file is searched to satisfy the condition input in step S4, and the search result is output as a missing phoneme string (step S5). Subsequently, a corpus including the missing phoneme string obtained in step S5 is searched from the corpus storage file storing a large amount of corpus, and the search result is output as an additional corpus (step S6). The output of the additional corpus may be performed in any format, such as print output, display on a display, and output to a file.
[0111]
It should be noted that the processing contents of the speech analysis unit 11, the missing phoneme string search unit 12, and the supplemental corpus creation unit 13 (or 23) described in the above embodiments can be applied to the processing of step S2, step S5, and step S6. However, the detailed description is omitted.
[0112]
The program according to the present embodiment is read and executed by a computer via an arbitrary portable recording medium such as a CD-ROM, or via a wireless or wired communication line. Will function as the waveform dictionary creation support system described in each of the embodiments.
[0113]
(Supplementary Note 1) A voice analysis unit that inputs and analyzes voice data including voice waveform data and text data representing the utterance content of the voice, and obtains phoneme information of the voice data;
A voice information storage unit for storing voice information including at least phoneme information determined by the voice analysis unit,
An indispensable phoneme information storage unit for storing conditions of indispensable phoneme information in the waveform dictionary,
In order to satisfy the condition stored in the required phoneme information storage unit, a phoneme string that is missing in the phoneme information stored in the voice information storage unit is searched, and the phoneme string obtained as a search result is searched for the missing phoneme. A missing phoneme string search unit to output as a string,
A corpus storage unit in which a corpus including at least all of the essential phoneme information is stored;
A supplemental corpus creating unit that searches the corpus storing unit for a corpus including the missing phoneme string output from the missing phoneme string searching unit, and outputs the corpus obtained as a search result as an additional corpus. Waveform dictionary creation support system.
[0114]
(Supplementary note 2) The supplementary note 1, further comprising a speech recognition unit that recognizes the utterance content from the input speech waveform data, and outputs the recognized utterance content as text data to the speech analysis unit together with the speech waveform data. Waveform dictionary creation support system.
[0115]
(Supplementary Note 3) The speech analysis unit performs, as the phoneme information, a result of performing at least one selected from phoneme labeling, pitch marking, and formant detection on the input speech waveform data. Save in the audio information storage section,
In the essential phoneme information storage unit, in addition to the phoneme string essential to the waveform dictionary, conditions regarding the phoneme information stored in the voice information storage unit is stored,
Even if the missing phoneme string search unit is a phoneme string stored in the voice information storage unit, if the condition in the essential phoneme information storage unit is not satisfied, the phoneme string is output as a missing phoneme string. , The waveform dictionary creation support system according to

Supplementary Note

1 or 2.
[0116]
(Supplementary Note 4) The missing phoneme string search unit further inputs a quality class of a sound quality to be synthesized, and determines a condition to be satisfied from conditions stored in the essential phoneme information storage unit according to the input quality class. 3. The waveform dictionary creation support system according to

Supplementary Note

1 or 2, wherein the system is determined and searches for a missing phoneme string according to the determined conditions.
[0117]
(Supplementary Note 5) The corpus storage unit includes a corpus storage unit stored for each field or application,
The waveform according to any one of Supplementary Notes 1 to 3, wherein the supplementary corpus creation unit inputs field or application information to be used, and searches for an additional corpus from a corpus storage unit corresponding to the input field or application information. Dictionary creation support system.
[0118]
(Supplementary Note 6) analyzing voice data including the input voice waveform data and text data representing the utterance content of the voice, and obtaining phoneme information of the voice data;
Saving voice information including at least phoneme information determined by the voice analysis unit in a voice information storage file,
Input the necessary phoneme information conditions in the waveform dictionary, search for phoneme strings missing in the phoneme information stored in the voice information storage file to satisfy the conditions, and obtain phoneme strings obtained as a search result. Outputting as a missing phoneme sequence;
Searching for a corpus containing the missing phoneme string from a corpus storage file storing a corpus containing at least all of the essential phoneme information, and outputting the corpus obtained as a search result as an additional corpus. A computer program characterized by being executed by a computer.
[0119]
【The invention's effect】
As described above, according to the present invention, when there is recorded audio data, a waveform that can efficiently search for audio data (supplementary corpus) to be supplemented to create a desired waveform dictionary It is possible to provide a dictionary creation support system.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of a waveform dictionary creation support system according to a first embodiment of the present invention;
FIG. 2 is a block diagram showing an application example of the waveform dictionary creation support system according to the first embodiment of the present invention;
FIG. 3 is a block diagram showing another application example of the waveform dictionary creation support system according to the first embodiment of the present invention;
FIG. 4 is a block diagram showing still another application example of the waveform dictionary creation support system according to the first embodiment of the present invention;
FIG. 5 is a block diagram showing a schematic configuration of a waveform dictionary creation support system according to a second embodiment of the present invention;
FIG. 6 is a block diagram showing a schematic configuration of a waveform dictionary creation support system according to a third embodiment of the present invention;
FIG. 7 is an explanatory diagram showing an example of phoneme labeling.
FIG. 8 is an explanatory diagram showing an example of adding a pitch mark.
FIG. 9 is a block diagram illustrating a schematic configuration of a waveform dictionary creation support system according to a fourth embodiment of the present invention;
FIG. 10 is a block diagram illustrating a schematic configuration of a waveform dictionary creation support system according to a fifth embodiment of the present invention;
FIG. 11 is a block diagram illustrating a schematic configuration of a waveform dictionary creation support system according to a sixth embodiment of the present invention;
FIG. 12 is a block diagram showing a modification of the waveform dictionary creation support system according to the sixth embodiment of the present invention.
FIG. 13 is a block diagram showing a schematic configuration of a waveform dictionary creation support system according to a seventh embodiment of the present invention;
FIG. 14 is a block diagram showing an application example of a waveform dictionary creation support system according to a seventh embodiment of the present invention;
FIG. 15 is a block diagram illustrating a schematic configuration of a waveform dictionary creation support system according to an eighth embodiment of the present invention.
FIG. 16 is a block diagram showing a schematic configuration of a waveform dictionary creation support system according to a ninth embodiment of the present invention;
FIG. 17 is a flowchart showing a schematic operation of a computer program according to the tenth embodiment of the present invention.
[Explanation of symbols]
1 Waveform dictionary creation support system
2 Recording system
3 Waveform dictionary creation system
4 Waveform dictionary
5 Speech synthesis system
6 Network
7 Transceiver
11 Voice analysis unit
11a Automatic labeling unit
11b Pitch mark giving unit
11c Frequency analysis unit
12 Missing phoneme string search unit
12a Phoneme string search unit
12b Phoneme length information matching unit
12c Pitch information collation unit
12d Formant information collation unit
13 Replenishment corpus preparation section
14 Voice information storage
14a Phoneme label storage
14b Pitch mark storage
14c Formant storage
15 Required phoneme information storage
15a Required phoneme string storage
15b Phoneme length storage
15c Pitch information storage
15d Formant information storage
16 Corpus storage
17 Voice Recognition Unit
23 Replenishment corpus preparation
26 Speech corpus storage unit
21 Waveform dictionary creation support system
31 Waveform dictionary creation support system
41 Waveform dictionary creation support system
51 Waveform dictionary creation support system
61 Waveform dictionary creation support system
71 Waveform dictionary creation support system
81 Waveform dictionary creation support system
91 Waveform dictionary creation support system

Claims

A voice analysis unit that inputs and analyzes voice data including voice waveform data and text data representing the utterance content of the voice, and obtains phoneme information of the voice data;
A voice information storage unit for storing voice information including at least phoneme information determined by the voice analysis unit,
An indispensable phoneme information storage unit for storing conditions of indispensable phoneme information in the waveform dictionary,
In order to satisfy the condition stored in the required phoneme information storage unit, a phoneme string that is missing in the phoneme information stored in the voice information storage unit is searched, and the phoneme string obtained as a search result is searched for the missing phoneme. A missing phoneme string search unit to output as a string,
A corpus storage unit in which a corpus including at least all of the essential phoneme information is stored;
A supplemental corpus creating unit that searches the corpus storing unit for a corpus including the missing phoneme string output from the missing phoneme string searching unit, and outputs the corpus obtained as a search result as an additional corpus. Waveform dictionary creation support system.

2. The waveform dictionary creation according to claim 1, further comprising a voice recognition unit that recognizes utterance content from the input voice waveform data and outputs the recognized utterance content as text data to the voice analysis unit together with the voice waveform data. 3. Support system.

The missing phoneme string search unit further inputs a quality class of a sound quality to be synthesized, and determines a condition to be satisfied from conditions stored in the essential phoneme information storage unit according to the input quality class, and determines The waveform dictionary creation support system according to claim 1, wherein a missing phoneme string is searched according to the set condition.

The corpus storage unit has a corpus storage unit stored for each field or application,
The replenishment corpus creation unit, according to any one of claims 1 to 3, wherein field information or application information to be used is input, and an additional corpus is searched from a corpus storage unit corresponding to the input field or application information. Waveform dictionary creation support system.

Analyzing voice data including the input voice waveform data and text data representing the utterance content of the voice, and obtaining phoneme information of the voice data;
Saving voice information including at least phoneme information determined by the voice analysis unit in a voice information storage file,
Input the necessary phoneme information conditions in the waveform dictionary, search for phoneme strings missing in the phoneme information stored in the voice information storage file to satisfy the conditions, and obtain phoneme strings obtained as a search result. Outputting as a missing phoneme sequence;
Searching for a corpus containing the missing phoneme string from a corpus storage file storing a corpus containing at least all of the essential phoneme information, and outputting the corpus obtained as a search result as an additional corpus. A computer program characterized by being executed by a computer.