JP3554513B2

JP3554513B2 - Speech synthesis apparatus and method, and recording medium storing speech synthesis program

Info

Publication number: JP3554513B2
Application number: JP33901099A
Authority: JP
Inventors: 智一森尾
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1999-11-30
Filing date: 1999-11-30
Publication date: 2004-08-18
Anticipated expiration: 2019-11-30
Also published as: JP2001154683A

Description

【０００１】
【発明の属する技術分野】
本発明は、テキストから音声を生成する音声合成装置に関する。
【０００２】
【従来の技術】
図１は一般的な音声合成装置の構成例を示している。図１に示す音声合成装置は、入力端子(１０１)、テキスト解析器(１０２)、韻律生成器(１０３)、音声素片選択器(１０４)、音声素片データベース(１０５)、音声合成器(１０６)、音声出力端子(１０７)で構成される。
【０００３】
従来の音声合成器(１０６)は、線形予測手法に基づくパラメータ合成方式を用いた音声合成器の例である。音声合成器は、図５に示すように、パルス発生器(２０１)、雑音発生器(２０２)、音源切り替え器(２０３)、増幅器(２０４)、合成フィルタ(２０５)、出力端子(２０６)、逆量子化器(５０１)、パラメータ量子化テーブル(５０２)で構成される。
【０００４】
以下、図１を用い、従来の音声合成装置の動作を説明する。テキスト解析器(１０２)は、単語や文章などのテキスト情報(例：「左」)を入力し、読みの情報を(例：hidari)出力する。
【０００５】
韻律生成器(１０３)は、読み情報(例：hidari)を入力し、韻律情報(声の高さ、大きさ、継続時間長)を生成する。声の高さは母音のピッチ周波数で決定され、例の場合で説明すると、時間順にｉ,ａ,ｉの母音におけるピッチ周波数を決定する。声の大きさや発声速度は、各音素毎(例：ｈ,ｉ,ｄ,ａ,ｒ,ｉ)に振幅情報、継続時間長が決定される。
【０００６】
音声素片選択器(１０４)は、音声素片データベース(１０５)を参照し、読み情報(例：hidari)に基づき音声素片を選択する。音声合成単位としては、子音＋母音(ＣＶ：Consonant,Vowel)の音韻単位(例：ka,gu)や、高音質化を目的に音韻過渡部の特徴量を保持した母音＋子音＋母音(ＶＣＶ)単位(例：aki,ito)などが広く使われている。
【０００７】
例えば、特開昭６０‐２０８７９９号公報には、パラメータ合成方式を用いた合成システムで、音韻を単位としてデータを保持する方法が記述されている。このシステムでは、音声の声道情報(線形予測係数)を、音韻の種類別に１セット(例えば、hiでは子音部ｈと母音部ｉの線形予測係数)のデータを音韻メモリーに蓄えている。
【０００８】
以降の説明では、ＶＣＶ単位を用いる手法を説明する。音声素片データベース(１０５)は、自然音声データからＶＣＶの単位で適切に切り出した音声データを分析し、合成に必要な情報を保持する。日本語の場合ＶＣＶ素片数は１０００個程度あり、各素片の時間長は正規化して保持しても異なっていても良い。
【０００９】
音声信号は、予め定められた長さ(フレーム)毎に分析される。ＶＣＶ別に保持する情報としては、ＶＣＶの音素境界位置情報(seg１,seg２,seg３＝ＶＣＶのフレーム数)、及びフレーム毎の有声／無声の時系列情報(uv［１］,uv［２］,…,uv［seg３］)、フレーム毎の振幅時系列情報(pow［１］,pow［２］,…,pow［seg３］)、そしてフレーム毎の線形予測係数ベクトルの時系列情報(ａ［１］,ａ［２］,…,ａ［seg３］)を保持している。ここで線形予測係数ベクトル(ａ［ｋ］,ｋ＝１,…,seg３)は、１０次程度の線形予測係数や線スペクトル対パラメータなどが用いられる。
【００１０】
上記例(ｈ,ｉ,ｄ,ａ,ｒ,ｉ)で音声素片選択器(１０４)の動作を説明すると、選択されるＶＣＶ素片は、＊hi,ida,ari,i＊＊が選択され音声素片情報として出力する。ここで記号＊は無音を表す。ＶＣＶを単位とする合成方式では、このように母音で隣接素片を接続する。
【００１１】
最後に、音声合成器(１０６)は、韻律情報と音声素片情報を入力し、音声素片のデータを韻律情報に従って声の高さや大きさ、発声速度を制御しながら合成音を作成し出力する。
【００１２】
図５において、音源切り替え器(２０３)は、音声素片情報の有声／無声情報に基づき、有声の場合はパルス音源を、無声の場合は雑音音源を選択するよう音源を切り替える。有声音の場合、パルス発生器(２０１)は、韻律情報の声の高さ(ピッチ周波数)を基に、適切な時間間隔でパルス列を生成する。無声音の場合、雑音発生器(２０２)で白色雑音を生成する。増幅器(２０４)は、韻律情報の声の大きさ(振幅情報)を基に、音源信号の振幅を適切に増幅する。パラメータ合成方式を用いる音声合成装置では一般に、線形予測係数を適切に量子化してから、音声素片データベース(１０５)に保存している。音声合成器においては、この音声素片情報中の線形予測係数パラメータを、パラメータ量子化テーブル(５０２)を用いて、逆量子化器(５０１)で適切にデコードした後、合成フィルタ(２０５)に送る。合成フィルタ(２０５)は、デコードした線形予測係数をフィルタ係数に設定し、増幅された音源信号をフィルタリング処理し音声波形として出力する。
【００１３】
上記説明では、パラメータ合成方式の音声合成器を説明したが、音声合成器として波形編集方式(例えば、特開昭６０−２１０９８号公報)を用いても良い。以下に波形編集方式の音声合成器を説明する。
【００１４】
図６を用いて、波形編集方式を用いた音声合成器の動作を説明する。この場合、音声素片データベース(１０５)には、線形予測係数ベクトルの代わりに、音声波形データを保持する。波形の保持方法としては、有声区間のデータに対してピッチ周波数や時間長の変更制御が行いやすいように、波形に零位相化や最小位相化処理を施す手法が広く使われている。
【００１５】
読み情報から決定された音声素片情報に基づき、合成に必要な音声波形データを音声素片データベース(１０５)から取り出す。増幅器(２０４)は、韻律情報の声の大きさ(振幅情報)を基に、音声波形の振幅を適切に増幅する。波形重畳器(２０５)は、音声素片が有声の場合、韻律情報の声の高さ(ピッチ周波数)の情報を基に、増幅された音声波形を適切な時間間隔で配置し窓掛け加算し出力する。音声素片が無声の場合は、増幅された音声波形を出力端子(２０６)からそのまま出力する。
【００１６】
【発明が解決しようとする課題】
特開昭６０‐２０８７９９号公報で例示したようなシステムでは、音韻に応じて１セットの線形予測係数しか保持しておらず、自然音声信号に含まれるスペクトルの時間的変化が表現できず、音質が不十分である問題があった。この問題を解決するために分析フレーム毎にスペクトル情報を保持する手法が一般的である。
【００１７】
そこで従来の音声合成装置は、高音質の合成を得る目的に、音声素片データベースに多くの情報を保持している。例えば合成単位にＶＣＶ素片を用いると約１０００種類の素片数が必要で、一つの素片の時間長は、例えば平均２００ms程度の情報量を保持している。１フレーム５msとすると４０フレームになる。
【００１８】
ここで、波形編集方式の合成方式の場合、波形データを音声素片データベースに保持するゆえ、データ容量が数ＭByteにもなるという問題があった。パラメータ合成方式を用いた場合には、音声素片データベースに線形予測係数ベクトルの時系列を保持する。例えば、１フレームあたりの線形予測係数ベクトルのデータを４０ビットで量子化し表現した場合、線形予測係数の情報量だけでもデータ容量が２００ＫByte程度必要になる。
【００１９】
音声圧縮の分野では、線形予測係数ベクトルを、移動平均予測とベクトル量子化の手法を用いて情報量を削減する手法がある(日本音響学会講演論文集、平成４年10月２−５−１２)。しかしながらこの場合でも、１フレーム当たりの線形予測係数ベクトルの情報量は２０ビット程度必要で、まだ情報量が多いという問題があった。
【００２０】
【課題を解決するための手段】
本発明は、入力された文字列を読み情報に変換するテキスト解析器と、読み情報が入力されて声の高さや大きさや発声速度の韻律情報を生成する韻律生成器と、読み情報が入力されて、音声素片データベースの中から音声素片情報を選択出力する音声素片選択器と、入力された韻律情報と音声素片情報とに基づいて合成音声を作成する音声合成器を備える、音声合成装置であって、上記音声合成器は、複数の音素をある基準で分類した際の前記基準の別に音声特徴ベクトルをベクトル量子化して得られた複数の基準符号帳を格納した基準別ベクトル符号帳と、上記音声素片選択器によって選択された音声素片情報に基づいて上記基準別ベクトル符号帳から当該音声素片情報に応じた音声特徴ベクトルを選択する音声特徴ベクトル選択器を、備えており、上記音声素片データベースには、上記読み情報から決定されると共に上記基準符号帳を選択するための選択基準情報と、上記基準別ベクトル符号帳のインデックスとが、各音声素片別に格納されており、上記音声素片選択器から出力される音声素片情報には、上記選択された音声素片情報に対応する選択基準情報および基準別ベクトル符号帳のインデックスが含まれており、上記音声特徴ベクトル選択器は、入力された音声素片情報に含まれた選択基準情報に基づいて上記基準別ベクトル符号帳から選択された基準符号帳から、当該音声素片情報に含まれたインデックスに基づいて音声特徴ベクトルを選択するようになっていることを特徴とする。
【００２１】
また、前記基準符号帳には、少なくとも１つの音声特徴ベクトルが格納されていることを特徴とする。
【００２２】
さらに、前記選択基準情報は音素情報であり、前記基準符号帳は音素符号帳であることを特徴とする。
【００２３】
あるいは、前記選択基準情報は音素情報であり、前記基準符号帳は、前記基準としてのカテゴリーの別に音声特徴ベクトルをベクトル量子化して得られた複数のカテゴリー符号帳であり、前記音声合成器は、前記カテゴリーとこのカテゴリーに属する音素とが互いに対応付けられて格納された音素カテゴリー表と、入力された音声素片情報に含まれた選択基準情報と前記音素カテゴリー表とに基づいて、前記音素情報に対応するカテゴリーを選択するカテゴリー選択器を有して、前記音声合成器は、前記選択されたカテゴリーに基づいて前記基準別ベクトル符号帳からカテゴリー符号帳を選択するようになっていることを特徴とする。
【００２４】
また、前記音声特徴ベクトルとして、線形予測係数を用いることを特徴とする。
【００２５】
あるいは、前記音声特徴ベクトルとして、音声波形を用いることを特徴とする。
【００２６】
また、本発明は、入力された文字列をテキスト解析手段によって読み情報に変換し、前記テキスト解析手段によって変換された読み情報が入力されて韻律生成手段によって韻律情報を生成し、前記読み情報が入力されて音声素片選択手段によって音声素片データベースの中から音声素片情報を選択出力し、入力された前記韻律情報と前記音声素片情報とに基づいて音声合成手段によって合成音声を作成する、音声合成方法であって、複数の音素をある基準で分類した際の前記基準の別に音声特徴ベクトルをベクトル量子化して得られた複数の基準符号帳を格納した基準別ベクトル符号帳を用いると共に、前記音声素片データベースには、前記読み情報から決定されると共に前記基準符号帳を選択するための選択基準情報と、前記基準別ベクトル符号帳のインデックスとを、各音声素片別に格納しておき、前記音声合成手段によって、入力された音声素片情報に含まれた選択基準情報に基づいて前記基準別ベクトル符号帳から選択した基準符号帳から、当該音声素片情報に含まれたインデックスに基づいて音声特徴ベクトルを選択することを特徴とする。
【００２７】
また、本発明は、コンピュータを、本発明の音声合成装置におけるテキスト解析器 , 韻律生成器 , 音声素片選択器 , 音声合成器および音声特徴ベクトル選択器
として機能させる音声合成プログラムが記録されたことを特徴とする。
【００２８】
【発明の実施の形態】
図１に、本発明の音声合成装置の全体概略構成を示す。概略構成は従来例と同じである。本発明が従来技術とは異なるのは、音声合成器(１０６)の内部構成と、音声素片データベース(１０５)に音声特徴ベクトルを圧縮して保持する点が異なる。
【００２９】
図２は、音声合成器１０６の部分を抽出し示したものである。図９は、図１と図２を合わせた本発明のテキスト音声合成装置全体の構成例を示す。以下、図１及び図２を用いて、本発明の音声合成器の一実施例を示す。従来技術の説明で示した図５と同じ構成要素には同じ番号を付与している。従来技術と異なる構成要素は、音声素片選択器(１０４)の出力情報から決定される選択基準情報を入力し、複数の中から使用するベクトル符号帳を選択する基準別ベクトル符号帳(２０８)と、音声素片選択器(１０４)の出力情報から決定されるインデックスと基準別ベクトル符号帳(２０８)を入力し、音声特徴ベクトルを選択する音声特徴ベクトル選択器(２０７)の二つである。
【００３０】
選択基準としては、例えば音素情報や、音素カテゴリー情報、或いは音素をその音響的特徴(有声、無声、摩擦性など)で分類した基準が用いられる。また音声特徴ベクトルとしては、例えば線形予測係数や、音声波形、或いはホルマント周波数と帯域幅など、音声の特徴を表現するデータが用いられる。
【００３１】
説明の便宜上、選択基準として音素情報を用い、音声特徴ベクトルとして線形予測係数を用いる例を最初に説明する。図３に、選択基準として音素情報を用いた音声合成器の実施例を示す。従来例と本願との違いは、ベクトル符号帳(３０１)の構造にある。音素別ベクトル符号帳(３０１)は、音素情報を選択基準とし、音素別に音声特徴ベクトルの符号帳を保持している。例えば、ａ,ｉ,ｕのような母音や、ｐ,ｔ,ｓのような子音などの音素別にベクトル符号帳を備えている。ここで音声特徴ベクトルは線形予測係数を用い、具体的には線形予測係数から算出される線スペクトル対やケプストラム係数を用いる。
【００３２】
ここで、音素別に線形予測係数のベクトル符号帳を設けると、同一音素の中ではスペクトル形状が似たものが集められる。音素別に６４種類の符号帳を作成した場合の例を図７に示す。上段左側からａ,ｉ,ｕの音素、下段左側からｓ,ｍ,ｇの音素に対して、６４種類の符号帳のうち最初の５つの音声特徴ベクトルをパワースペクトル表現して示したものである。子音ｓのパワースペクトルは、非常にバリエーションが大きいが、他の音素ではパワースペクトルが似た形状のものになっていることが分かる。音素別に特徴ベクトルをベクトル量子化することで、効率的に特徴ベクトルを表現できることが分かる。
【００３３】
音声合成装置全体構成は図１の構成から成っているが、音声素片データベース(１０５)の内容が従来とは異なる。従来技術で説明したように、パラメータ合成方式を用いる従来技術では線形予測係数ベクトルの時系列情報を保持していたのに対して、本発明の手法では、音声素片に対応した音素別ベクトル符号帳のインデックスの時系列を保持する。この様子を模式的に表現したものを図８に示す。この例では音声素片データベース(１０５)の中から、ＶＣＶ素片として例として「ito」が選択された時の状況を示している。ここでidxは、スペクトル情報をベクトル量子化した符号帳中の符号語を指し示すindexを本実施例では示している。フレーム番号１からseg１−１の間は、音素別ベクトル符号帳(３０１)の音素ｉの符号帳が選択され、idx［１］からidx［seg１−１］のデータによって、ｉのベクトル符号帳の該当するインデックスの音声特徴ベクトルが音声特徴ベクトル選択器(２０７)から出力される。同様にフレーム番号seg１からseg２−１の間は、音素別ベクトル符号帳(３０１)の音素ｔの符号帳が、フレーム番号seg２からseg３の間は、音素別ベクトル符号帳(３０１)の音素ｏの符号帳が選択され、同様な処理が実行される。
【００３４】
一般の音声圧縮とは異なりテキスト音声合成では、発声文字列の情報があるので、音声の特徴ベクトルの情報を圧縮記録する際、このように音素情報を得た上で、音素別にベクトル量子化することで、スペクトル情報に偏りがあることが利用でき、効率良く情報圧縮できる。
【００３５】
ここで従来法とデータ量を比較する。高音質の合成音を得るためには、例えば１０次の線形予測係数を量子化表現するには、スカラー量子化では４０ビット程度、ベクトル量子化でも２０ビット以上必要であったが、本実施例のように音素別に線形予測係数をベクトル量子化すると、先に例示したようにスペクトル形状として非常に似た形状のデータになるので、比較的少ない情報で表現できる。音質の簡易主観評価実験によると、音素別に６ビット(６４種類のスペクトルパターン)程度の情報量でも良好な音質が得られ、大幅な情報圧縮が実現できる。ここで音素別のベクトル量子化の符号帳サイズは音素別に異なっていても良い。音声特徴ベクトルが抽出され、合成フィルタ(２０５)に設定された以降の動作は、従来例と同じである。
【００３６】
次に、音素カテゴリー毎ベクトル符号帳を用いた場合の実施例について説明する。上述した音素情報に基づく音声合成装置では、音素別のベクトル符号帳を保持する必要があった。先に例示したように、例えば音素別に６４パターンの音声特徴ベクトルを保持し、音素の数を３２とし、１パターン当たり４０ビットで表現した場合、音素別ベクトル符号帳のデータ量は約１０ＫByte必要であった。
【００３７】
音響的に似た音素は特徴ベクトルも似ており、音素別に持つより音響的に似た音素カテゴリー別に符号帳を持つことで、ベクトル符号帳の容量を削減できる。
【００３８】
これを実現するための構成を図４に示す。図４の構成では、各音素がどのカテゴリーに属するかを示した音素カテゴリー表(４０２)を備えており、入力音素はこのカテゴリー表を参照して、カテゴリー選択器(４０３)でカテゴリー選択される。音声特徴ベクトルのベクトル符号帳は、音素カテゴリー別ベクトル符号帳(４０１)に音素カテゴリー別に保持されており、選択されたカテゴリー情報に基づいて、どの符号帳を使用するか選択される。例えば入力音素がｄであった場合、音素ｄはカテゴリーは２番に属しているので、音素カテゴリー別ベクトル符号帳は第２種類符号帳を使用する。
【００３９】
音声特徴量の類似している音素カテゴリーを適切に構成することで、音声特徴ベクトルを効率良く保持することができ、音質劣化を小さく抑えて、音素別のベクトル符号帳よりも符号帳を保持するためのメモリー量を削減することができる。以降の動作は、図３の実施例と同じであるので省略する。
【００４０】
これまでの説明では、音声特徴ベクトルは線形予測係数を用いた、パラメータ合成方式の音声合成装置を説明したが、波形重畳方式に基づく音声合成器を用いることもできる。この場合の音声特徴ベクトルは音声波形を用いる。特に零位相化処理や最小位相化処理を施し、振幅を正規化することによって、波形歪み尺度に基づいて波形のベクトル符号帳を構成することができる。なお、上記全ての装置はコンピュータなどの処理装置を用いて実現することができる。
【００４１】
【発明の効果】
本発明で示した選択基準(例えば音素)別のベクトル符号帳を用いることで、音声特徴ベクトルを効率的に圧縮表現することができ、音声素片データベースの容量を大幅に削減することができる。
【図面の簡単な説明】
【図１】テキスト音声合成装置全体の構成例を説明する図である。
【図２】本発明の音声合成器の構成例を説明する図である。
【図３】本発明の音声合成器の構成例を説明する図である。
【図４】本発明の音声合成器の構成例を説明する図である。
【図５】従来技術音声合成器の構成例を説明する図である。
【図６】従来技術の波形重畳方式の合成器の動作を説明する図である。
【図７】音声素片データベース内のデータ例を説明する図である。
【図８】音素別の音声特徴ベクトル符号帳の例を説明する図である。
【図９】本発明のテキスト音声合成装置全体の構成例を説明する図である。
【符号の説明】
１０１入力端子
１０２テキスト解析器
１０３韻律生成器
１０４音声素片選択器
１０５音声素片データベース
１０６音声合成器
１０７出力端子
２０１パルス発生器
２０２雑音発生器
２０３音源切り替え器
２０４増幅器
２０５合成フィルタ
２０６出力端子
２０７音声特徴ベクトル選択器
２０８基準別ベクトル符号帳
３０１音素別ベクトル符号帳
４０１音素カテゴリー別ベクトル符号帳
４０２音素カテゴリー表
４０３カテゴリー選択器
５０１逆量子化器
５０２パラメータ量子化テーブル
５０６音声合成器
６０１増幅器
６０２波形重畳器
６０３出力端子[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech synthesizer that generates speech from text.
[0002]
[Prior art]
FIG. 1 shows a configuration example of a general speech synthesizer. The speech synthesizer shown in FIG. 1 includes an input terminal (101), a text analyzer (102), a prosody generator (103), a speech unit selector (104), a speech unit database (105), a speech synthesizer ( 106 ) and an audio output terminal (107).
[0003]
The conventional speech synthesizer ( 106 ) is an example of a speech synthesizer using a parameter synthesis method based on a linear prediction method. As shown in FIG. 5 , the voice synthesizer includes a pulse generator (201), a noise generator (202), a sound source switcher (203), an amplifier (204), a synthesis filter (205), an output terminal (206), It comprises an inverse quantizer (501) and a parameter quantization table (502).
[0004]
Hereinafter, referring to FIG. 1, the operation of the conventional speech synthesis apparatus. The text analyzer (102) inputs text information (eg, “left”) such as words and sentences, and outputs reading information (eg, hidari).
[0005]
The prosody generator (103) receives reading information (eg, hidari) and generates prosody information (pitch, loudness, duration of voice). The pitch of the voice is determined by the pitch frequency of the vowel. In the case of an example, the pitch frequency of the vowel i, a, i is determined in chronological order. As for the volume and the utterance speed of the voice, amplitude information and duration time are determined for each phoneme (eg, h, i, d, a, r, i).
[0006]
The speech unit selector (104) refers to the speech unit database (105) and selects a speech unit based on reading information (eg, hidari). As a speech synthesis unit, a consonant + vowel (CV: Consonant, Vowel) phoneme unit (eg, ka, gu) or a vowel + consonant + vowel (VCV) holding a feature amount of a phoneme transition portion for the purpose of improving sound quality. ) Units (eg, aki, ito) are widely used.
[0007]
For example, Japanese Patent Application Laid-Open No. 60-208799 describes a method of storing data in units of phonemes in a synthesizing system using a parameter synthesizing method. In this system, vocal tract information (linear prediction coefficients) of speech is stored in a phonological memory as one set of data (for example, in hi, a linear prediction coefficient of a consonant part h and a vowel part i).
[0008]
In the following description, a method using a VCV unit will be described. The speech unit database (105) analyzes speech data appropriately cut out from natural speech data in units of VCV, and holds information necessary for synthesis. In the case of Japanese, the number of VCV segments is about 1000, and the time length of each segment may be normalized and held or may be different.
[0009]
The audio signal is analyzed for each predetermined length (frame). The information held for each VCV includes phoneme boundary position information of the VCV (seg1, seg2, seg3 = VCV frame number), and voiced / unvoiced time-series information (uv [1], uv [2],...) For each frame. , uv [seg3]), amplitude time-series information for each frame (pow [1], pow [2],..., pow [seg3]), and time-series information (a [1]) for the linear prediction coefficient vector for each frame , a [2],..., a [seg3]). Here, as the linear prediction coefficient vector (a [k], k = 1,..., Seg3), a linear prediction coefficient of about the tenth order, a line spectrum pair parameter, and the like are used.
[0010]
The operation of the speech unit selector (104) will be described in the above example (h, i, d, a, r, i). As the VCV unit to be selected, * hi, ida, ari, i ** is selected. And output as speech unit information. Here, the symbol * represents silence. In the synthesis method using VCV as a unit, adjacent segments are connected by vowels in this way.
[0011]
Finally, the speech synthesizer ( 106 ) inputs prosody information and speech unit information, and generates and outputs synthesized speech while controlling voice pitch, loudness, and utterance speed according to the prosody information. I do.
[0012]
In FIG. 5, a sound source switch (203) switches sound sources based on voiced / unvoiced information of speech unit information so as to select a pulsed sound source when voiced and a noise source when voiceless. For voiced, pulse generator (201), based on the voice of prosodic information height (pitch frequency), generates a pulse train at appropriate time intervals. In the case of unvoiced sound, a white noise is generated by a noise generator (202). The amplifier (204) appropriately amplifies the amplitude of the sound source signal based on the loudness (amplitude information) of the voice of the prosody information. In general, a speech synthesizer using a parameter synthesis method appropriately quantizes a linear prediction coefficient and then stores it in a speech unit database (105). The speech synthesizer appropriately decodes the linear prediction coefficient parameter in the speech unit information by the inverse quantizer (501) using the parameter quantization table (502), and then sends the decoded result to the synthesis filter (205). send. The synthesis filter (205) sets the decoded linear prediction coefficient as a filter coefficient, performs a filtering process on the amplified sound source signal, and outputs the result as an audio waveform.
[0013]
In the above description, the parameter synthesizer-based speech synthesizer has been described, but a waveform editing system (for example, Japanese Patent Laid-Open No. 60-21098) may be used as the speech synthesizer. Hereinafter, a speech synthesizer using the waveform editing method will be described.
[0014]
The operation of the speech synthesizer using the waveform editing method will be described with reference to FIG. In this case, the speech segment database (105) holds speech waveform data instead of the linear prediction coefficient vector. As a method of retaining a waveform, a method of performing zero-phase or minimum-phase processing on a waveform is widely used so that change control of a pitch frequency and a time length can be easily performed on voiced section data.
[0015]
Based on the speech unit information determined from the reading information, speech waveform data necessary for synthesis is extracted from the speech unit database (105). The amplifier ( 204 ) appropriately amplifies the amplitude of the speech waveform based on the voice volume (amplitude information) of the prosody information. When the speech unit is voiced, the waveform superposition unit ( 205 ) arranges the amplified speech waveform at appropriate time intervals based on the information of the pitch (pitch frequency) of the prosody information and performs windowed addition. Output. If the speech unit is unvoiced, the amplified speech waveform is output from the output terminal ( 206 ) as it is.
[0016]
[Problems to be solved by the invention]
In the system as exemplified in Japanese Patent Application Laid-Open No. 60-208799, only one set of linear prediction coefficients is held in accordance with a phoneme, and a temporal change of a spectrum included in a natural speech signal cannot be expressed. There was a problem that was insufficient. In order to solve this problem, a method of holding spectrum information for each analysis frame is generally used.
[0017]
Therefore, a conventional speech synthesizer holds a large amount of information in a speech unit database for the purpose of obtaining high-quality sound synthesis. For example, if a VCV segment is used as a synthesis unit, about 1000 types of segments are required, and the time length of one segment holds, for example, an information amount of about 200 ms on average. If one frame is 5 ms, it is 40 frames.
[0018]
Here, in the case of the synthesis method of the waveform editing method, since the waveform data is stored in the speech unit database, there is a problem that the data capacity is several MByte. When the parameter synthesis method is used, the time series of the linear prediction coefficient vector is stored in the speech unit database. For example, when the data of the linear prediction coefficient vector per frame is quantized and expressed by 40 bits, a data capacity of about 200 KByte is required only for the information amount of the linear prediction coefficient.
[0019]
In the field of audio compression, there is a method of reducing the amount of information on a linear prediction coefficient vector by using a moving average prediction and a vector quantization method (Proceedings of the Acoustical Society of Japan, October 5-19, 2-5-12). ). However, even in this case, there is a problem that the information amount of the linear prediction coefficient vector per frame is about 20 bits, and the information amount is still large.
[0020]
[Means for Solving the Problems]
This onset Ming, a text analyzer that converts the input string to read information only, and prosody generator for generating the prosodic information of the reading information is input voice of the height and the size and the utterance speed, to read information It is input comprises a voice segment selector for selectively outputting the speech unit information from the speech unit database, the speech synthesizer to create a synthesized speech based on the prosodic information and the speech segment information input , a speech synthesizer, the speech synthesizer, the reference by storing a plurality of reference codebook obtained by separately vector quantizing the speech feature vector of said reference when classified by reference in a plurality of phonemes A vector codebook and a speech feature vector selector for selecting a speech feature vector corresponding to the speech unit information from the reference-specific vector codebook based on the speech unit information selected by the speech unit selector, Equipment In which, in the speech unit database, and selection criteria information for selecting the reference codebook with is determined from the read information, and the index of the reference by the vector codebook, stored for each speech unit The speech unit information output from the speech unit selector includes selection reference information corresponding to the selected speech unit information and an index of a criterion vector codebook. The speech feature vector selector selects, from the reference codebook selected from the reference-specific vector codebook based on the selection reference information included in the input speech unit information, an index included in the speech unit information. It is characterized in that a voice feature vector is selected based on the selected voice feature vector .
[0021]
Further , at least one speech feature vector is stored in the reference codebook .
[0022]
Moreover, the selection criterion information is phoneme information, wherein the base Junfu Gocho is sound Motofu No. book.
[0023]
Alternatively, the selection criterion information is phoneme information, and the reference codebook is a plurality of category codebooks obtained by vector-quantizing a speech feature vector separately for each category as the reference, and the speech synthesizer includes: Based on the phoneme category table in which the category and the phonemes belonging to this category are stored in association with each other, and the selection criterion information included in the input speech unit information and the phoneme category table, the phoneme information Wherein the speech synthesizer selects a category codebook from the reference-specific vector codebook based on the selected category. And
[0024]
The front as Kion voice feature vectors, characterized by using the linear prediction coefficients.
[0025]
Alternatively, before the Kion voice feature vectors, characterized by using a voice waveform.
[0026]
In addition, the present invention is to convert the read only information the input string by text analysis means, before Symbol text analysis means is input to read information is converted by generating a prosodic information by the prosody generation means, before The read-out information is input, and speech unit information is selectively output from a speech unit database by a speech unit selection unit, and synthesized by a speech synthesis unit based on the input prosody information and the speech unit information. to create a sound, a speech synthesis method, by criterion storing a plurality of reference codebook obtained by separately vector quantizing the speech feature vector of said reference when classified by reference in a plurality of phonemes with use of the vector codebook, wherein the speech unit database, and selection criteria information for selecting the reference codebook with is determined from the read information, the reference by vector And the index of the codebook may be stored for each speech segment, by said speech synthesis means, criteria selected from the reference by vector codebook based on the selection criterion information included in the input speech segment information A speech feature vector is selected from the codebook based on an index included in the speech unit information .
[0027]
In addition, the present invention provides a computer including a text analyzer, a prosody generator, a speech unit selector, a speech synthesizer, and a speech feature vector selector in the speech synthesizer of the present invention .
A speech synthesis program that functions as a computer is recorded .
[0028]
BEST MODE FOR CARRYING OUT THE INVENTION
FIG. 1 shows an overall schematic configuration of a speech synthesizer of the present invention. The schematic configuration is the same as the conventional example. The present invention differs from the prior art in that the internal configuration of the speech synthesizer (106) and that the speech feature vector is compressed and stored in the speech unit database (105).
[0029]
FIG. 2 shows a portion of the speech synthesizer 106 extracted. FIG. 9 shows an example of the configuration of the entire text-to-speech synthesizing apparatus according to the present invention, combining FIG. 1 and FIG. Hereinafter, an embodiment of the speech synthesizer of the present invention will be described with reference to FIGS. The same components as those in FIG. 5 described in the description of the related art are given the same numbers. A component different from the prior art is a reference-based vector codebook (208) that inputs selection reference information determined from output information of a speech unit selector (104) and selects a vector codebook to be used from among a plurality. And a speech feature vector selector (207) for inputting an index determined from output information of the speech unit selector (104) and a reference-specific vector codebook (208) and selecting a speech feature vector. .
[0030]
As the selection criterion, for example, phoneme information, phoneme category information, or a criterion in which phonemes are classified by their acoustic characteristics (voiced, unvoiced, frictional, etc.) is used. As the speech feature vector, for example, data expressing speech features such as a linear prediction coefficient, a speech waveform, or a formant frequency and a bandwidth are used.
[0031]
For convenience of explanation, an example in which phoneme information is used as a selection criterion and a linear prediction coefficient is used as a speech feature vector will be described first. FIG. 3 shows an embodiment of a speech synthesizer using phoneme information as a selection criterion. The difference between the conventional example and the present application lies in the structure of the vector codebook (301). The phoneme-specific vector codebook (301) holds a codebook of speech feature vectors for each phoneme using phoneme information as a selection criterion. For example, a vector codebook is provided for each phoneme such as vowels such as a, i, u and consonants such as p, t, s. Here, the speech feature vector uses a linear prediction coefficient, and more specifically, a line spectrum pair or a cepstrum coefficient calculated from the linear prediction coefficient.
[0032]
Here, if a vector codebook of linear prediction coefficients is provided for each phoneme, those having similar spectral shapes are collected among the same phonemes. FIG. 7 shows an example in which 64 types of codebooks are created for each phoneme. For the phonemes a, i, u from the upper left side, and for the phonemes s, m, g from the lower left side, the first five speech feature vectors of the 64 types of codebooks are represented by a power spectrum. . It can be seen that the power spectrum of the consonant s has a very large variation, but the power spectrum of other phonemes has a similar shape. It can be understood that the feature vector can be efficiently expressed by vector-quantizing the feature vector for each phoneme.
[0033]
Although the overall configuration of the speech synthesizer is configured as shown in FIG. 1, the contents of the speech unit database (105) are different from the conventional one. As described in the related art, in the related art using the parameter synthesizing method, the time series information of the linear prediction coefficient vector was held, whereas in the technique of the present invention, the phoneme-specific vector code corresponding to the speech unit was used. Keeps a time series of book indexes. FIG. 8 schematically shows this state. This example shows a situation when "ito" is selected as an example of a VCV unit from the speech unit database (105). In the present embodiment, idx indicates an index indicating a codeword in a codebook obtained by vector-quantizing spectrum information. Between frame numbers 1 to seg1-1, the codebook of phoneme i in the phoneme-specific vector codebook (301) is selected, and the data of idx [1] to idx [seg1-1] is used to select the codebook of the vector codebook of i. The audio feature vector of the corresponding index is output from the audio feature vector selector (207). Similarly, between the frame numbers seg1 to seg2-1, the codebook of the phoneme t of the phoneme-based vector codebook (301) is stored, and between the frame numbers seg2 to seg3, the phoneme o of the phoneme-based vector codebook (301) is stored. A codebook is selected, and a similar process is performed.
[0034]
Unlike general speech compression, text speech synthesis has uttered character string information, so when compressing and recording speech feature vector information, phoneme information is obtained and vector quantization is performed for each phoneme. This makes it possible to utilize the fact that the spectrum information has a bias, and to efficiently compress the information.
[0035]
Here, the data amount is compared with the conventional method. In order to obtain a high-quality synthesized sound, for example, in order to quantize and express a 10th-order linear prediction coefficient, scalar quantization requires about 40 bits and vector quantization requires 20 bits or more. If the linear prediction coefficient is vector-quantized for each phoneme as shown above, the data has a very similar shape as the spectrum shape as exemplified above, so that it can be expressed with relatively little information. According to a simple subjective evaluation test of sound quality, good sound quality can be obtained even with an information amount of about 6 bits (64 types of spectrum patterns) for each phoneme, and significant information compression can be realized. Here, the codebook size of the vector quantization for each phoneme may be different for each phoneme. The operation after the speech feature vector is extracted and set in the synthesis filter (205) is the same as the conventional example.
[0036]
Next, an embodiment in which a vector codebook for each phoneme category is used will be described. In the above-described speech synthesizer based on phoneme information, it is necessary to hold a vector codebook for each phoneme. As exemplified above, for example, when a speech feature vector of 64 patterns is held for each phoneme, the number of phonemes is 32, and the number of phonemes is represented by 40 bits, the data amount of the phoneme-specific vector codebook is about 10 KByte. there were.
[0037]
Acoustically similar phonemes also have similar feature vectors, and having a codebook for each acoustically similar phoneme category rather than for each phoneme can reduce the capacity of the vector codebook.
[0038]
FIG. 4 shows a configuration for realizing this. In the configuration of FIG. 4, a phoneme category table (402) indicating which category each phoneme belongs to is provided, and the input phonemes are selected by the category selector (403) with reference to this category table. . The vector codebook of the speech feature vector is stored in the vector codebook for each phoneme category (401) for each phoneme category, and a codebook to be used is selected based on the selected category information. For example, if the input phoneme is d, since the phoneme d belongs to category 2, the phoneme category-specific vector codebook uses the second type codebook.
[0039]
By appropriately configuring phoneme categories having similar speech feature amounts, speech feature vectors can be held efficiently, and sound quality degradation can be kept small, and codebooks can be held more than phoneme-specific vector codebooks. The amount of memory required for this can be reduced. Subsequent operations are the same as in the embodiment of FIG.
[0040]
In the above description, the speech synthesizing apparatus of the parameter synthesizing method using the linear prediction coefficient as the audio feature vector has been described. However, a speech synthesizer based on the waveform superposition method may be used. In this case, a speech waveform is used as the speech feature vector. In particular, by performing zero-phase processing or minimum-phase processing and normalizing the amplitude, a vector codebook of a waveform can be configured based on a waveform distortion measure. Note that all of the above devices can be realized using a processing device such as a computer.
[0041]
【The invention's effect】
By using the vector codebook for each selection criterion (for example, phoneme) shown in the present invention, the speech feature vector can be efficiently compressed and expressed, and the capacity of the speech unit database can be greatly reduced.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a configuration example of an entire text-to-speech synthesis apparatus.
FIG. 2 is a diagram illustrating a configuration example of a speech synthesizer of the present invention.
FIG. 3 is a diagram illustrating a configuration example of a speech synthesizer of the present invention.
FIG. 4 is a diagram illustrating a configuration example of a speech synthesizer of the present invention.
FIG. 5 is a diagram illustrating a configuration example of a conventional speech synthesizer.
FIG. 6 is a diagram for explaining the operation of a conventional waveform superposition type synthesizer.
FIG. 7 is a diagram illustrating an example of data in a speech unit database.
FIG. 8 is a diagram illustrating an example of a speech feature vector codebook for each phoneme.
FIG. 9 is a diagram illustrating a configuration example of the entire text-to-speech synthesis apparatus of the present invention.
[Explanation of symbols]
101 input terminal 102 text analyzer 103 prosodic generator 104 speech unit selector 105 speech unit database 106 speech synthesizer 107 output terminal 201 pulse generator 202 noise generator 203 sound source switcher 204 amplifier 205 synthesis filter 206 output terminal 207 Speech feature vector selector 208 Vector codebook 301 by reference Vector codebook 401 by phoneme Vector codebook 402 by phoneme category Phoneme category table 403 Category selector 501 Inverse quantizer 502 Parameter quantization table 506 Speech synthesizer 601 Amplifier 602 Waveform Superimposer 603 output terminal

Claims

And text analysis to convert the input string to read information only,
A prosody generator for generating prosody information is inputted to read information converted by the text analyzer,
Together with the read information is input, a speech unit selection unit for selectively outputting the speech unit information from the speech unit database,
A speech synthesizer comprising a speech synthesizer that creates a synthesized speech based on the input prosody information and the speech unit information,
The speech synthesizer includes: a reference-specific vector codebook storing a plurality of reference codebooks obtained by vector-quantizing a speech feature vector separately from the reference when the plurality of phonemes are classified based on a certain reference; A speech feature vector selector for selecting a speech feature vector corresponding to the speech unit information from the reference-specific vector codebook based on the speech unit information selected by the segment selector,
In the speech unit database , selection reference information determined from the reading information and for selecting the reference codebook and an index of the reference-specific vector codebook are stored for each speech unit. ,
The speech unit information output from the speech unit selector includes selection reference information corresponding to the selected speech unit information and an index of a reference-specific vector codebook,
The speech feature vector selector includes an index included in the speech unit information from a reference codebook selected from the reference-specific vector codebooks based on the selection reference information included in the input speech unit information. A speech synthesizing apparatus characterized in that a speech feature vector is selected on the basis of the speech feature vector .

The speech synthesizer according to claim 1, wherein at least one speech feature vector is stored in the reference codebook .

The selection criterion information is phoneme information,
The group Junfu Gocho speech synthesis apparatus according to claim 1 or 2, characterized in that a sound Motofu No. book.

The selection criterion information is phoneme information,
The reference codebook is a plurality of category codebooks obtained by vector quantization of speech feature vectors separately for each category as the reference,
The speech synthesizer,
A phoneme category table in which the category and phonemes belonging to this category are stored in association with each other;
A category selector for selecting a category corresponding to the phoneme information based on the selection criterion information included in the input speech unit information and the phoneme category table
Having
The speech synthesizer selects a category codebook from the reference-specific vector codebook based on the selected category.
Speech synthesis apparatus according to claim 1 or 2, characterized in that it is so.

Before the Kion voice feature vectors, the speech synthesis apparatus according to any one of claims 1 to 4, characterized by using the linear prediction coefficients.

Before the Kion voice feature vectors, the speech synthesis apparatus according to any one of claims 1 to 4, characterized in that a speech waveform.

Converts to read only information the input string by text analysis means, before Symbol text analysis means is input to read information is converted by generating a prosodic information by the prosody generation means, before Symbol reading information is input select output speech unit information from the speech unit database by speech unit selection means Te, to create a synthesized voice by the voice synthesizing means based on the prosody information input and said speech unit information, a speech synthesis method,
Using a reference-specific vector codebook that stores a plurality of reference codebooks obtained by vector-quantizing a speech feature vector separately from the reference when a plurality of phonemes are classified based on a certain reference,
In the speech unit database , selection reference information determined from the reading information and for selecting the reference codebook and an index of the reference-specific vector codebook are stored for each speech unit. ,
From the reference codebook selected from the reference-specific vector codebooks based on the selection reference information included in the input speech unit information, by the speech synthesis unit, based on the index included in the speech unit information. A speech synthesis method characterized by selecting a speech feature vector .

Computer
A text analyzer, a prosody generator, a speech unit selector, a speech synthesizer, and a speech feature vector selector according to claim 1 .
A computer-readable recording medium having recorded thereon a speech synthesis program functioning as a computer .