JP2005024794A

JP2005024794A - Method, device, and program for speech synthesis

Info

Publication number: JP2005024794A
Application number: JP2003188873A
Authority: JP
Inventors: Katsumi Tsuchiya; 勝美土谷; Takehiko Kagoshima; 岳彦籠嶋
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-06-30
Filing date: 2003-06-30
Publication date: 2005-01-27

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that a synthesized speech becomes worse according as the pitch is varied, in a method and a device for speech synthesis used for text synthesis and a computer-readable text synthesis program. <P>SOLUTION: The speech synthesizer is equipped with a target waveform storage part 20 which stores a target waveform, a source speech waveform storage part 21 which stores a source speech waveform, a target waveform selection part 10 which selects the target waveform from the target waveform storage part according to at least rhythm information, and a pitch waveform extraction part 11 which extracts a pitch waveform from the source speech waveform according to an error evaluation function defined between the selected target waveform and the pitch waveform which should be extracted from the source speech waveform stored in the source speech waveform storage part, and superposes the extracted pitch waveform at a desired pitch period interval to generate a synthesized speech. Consequently, a synthesized speech of high quality can be obtained by extracting the pitch waveform according to the error evaluating function defined between the target waveform and pitch waveform. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、例えばテキスト合成に用いられる音声合成方法と装置およびコンピュータ読み取り可能なテキスト合成プログラムに関する。
【０００２】
【従来の技術】
音声合成方式の一つである規則合成方式は、入力された韻律情報から自動的に合成音声に変換する技術であり、任意の文章（テキスト）から人工的に音声を作り出すテキスト合成などに使用されている。
【０００３】
図４に示すように、規則合成方式では、音韻記号列、音韻継続時間長、ピッチパターンおよびパワーなどの韻律情報１００に従って音声素片記憶部あるいは原音声波形記憶部２１に記憶されている音節あるいは音素などの小さな単位（音声素片）を選択し、選択された音声素片に対してピッチおよび時間長を変更して接続することで合成音声信号１０５を生成する。
【０００４】
ピッチの変更方法としてはＰＳＯＬＡと呼ばれる方法が良く知られており、ＰＳＯＬＡでは音声素片に対してピッチ周期毎に窓掛けを行ってピッチ波形を切り出した後、合成すべきピッチ周期間隔で並べ直し重ね合わせてピッチの変更を行い、合成音声信号を生成する。（図５および図６）
【０００５】
【特許文献１】
特許第３２９６０４号公報
【０００６】
【発明が解決しようとする課題】
ここで、そのため、ＰＳＯＬＡではピッチ波形の抽出方法が合成音声の音質を向上するポイントとなる。特許第３２９６０４号では、合成音声のピッチが原音声のピッチよりも低くなる場合は、原音声のピッチ周期の２倍に等しい窓長をもつハニング窓でピッチ波形を切り出し、合成音声のピッチが原音声のピッチよりも高くなる場合は、合成音声のピッチ周期の２倍に等しい窓長をもつハニング窓でピッチ波形を切り出すことでピッチの変更に伴う合成音声の音質の劣化を抑えている。
【０００７】
しかし、ピッチの変更が小さい場合は合成音声の音質の劣化は少ないが、ピッチを大きく変更した場合はピッチとスペクトルの音響的なミスマッチが原因で合成音声の音質は著しく劣化してしまうという問題がある。
【０００８】
【課題を解決するための手段】
上記問題を解決するため、本発明においては、少なくとも音韻情報を含む韻律情報に基づいて原音声波形から抽出されるピッチ波形を所望のピッチ周期間隔で重畳して音声を生成する音声合成方法において、前記韻律情報に基づいて目標波形記億部から目標波形を選択し、選択された目標波形と抽出すべきピッチ波形との間で定義される誤差評価関数に基づいて、原音声波形記憶部の前記原音声波形から前記ピッチ波形を抽出し、該ピッチ波形を所望のピッチ周期間隔で重畳して合成音声を生成することを特徴とする。
【０００９】
また本発明においては、少なくとも音韻情報を含む韻律情報に基づいて原音声波形から抽出されるピッチ波形を所望のピッチ周期間隔で重畳して音声を生成する音声合成方法において、前記韻律情報に基づいて目標波形記億部から目標波形を選択し、選択された目標波形と抽出すべきピッチ波形との間で定義される第１の誤差評価関数と、前記抽出すべきピッチ波形を重畳して得られる合成音声波形と前記原音声波形との間で定義される第２の誤差評価関数とに基づいて、原音声波形記憶部の原音声波形から前記ピッチ波形を抽出し、該ピッチ波形を所望のピッチ周期間隔で重畳して合成音声を生成することを特徴とする。
【００１０】
また本発明においては、少なくとも音韻情報を含む韻律情報に基づいて原音声波形から抽出されるピッチ波形を所望のピッチ周期間隔で重畳して音声を生成する音声合成方法において、前記韻律情報に基づいて目標波形記億部から目標波形として複数ピッチの時間幅を有する音声波形データを選択し、当該音声波形データと抽出すべきピッチ波形に基づく音声波形との間で定義される第１の誤差評価関数と、前記抽出すべきピッチ波形を重畳して得られる合成音声波形と前記原音声波形との間で定義される第２の誤差評価関数とに基づいて、原音声波形記憶部の原音声波形から前記ピッチ波形を抽出し、該ピッチ波形を所望のピッチ周期間隔で重畳して合成音声を生成することを特徴とする。
【００１１】
また本発明においては、少なくとも音韻情報を含む韻律情報に基づいて原音声波形から抽出されるピッチ波形を所望のピッチ周期間隔で重畳して音声を生成する音声合成方法において、前記韻律情報に基づいて目標スペクトル記憶部から目標スペクトルを選択し、選択された目標スペクトルと抽出すべきピッチ波形のスペクトルとの間で定義される誤差評価関数に基づいて、前記原音波形記憶部の前記原音声波形から前記ピッチ波形を抽出し、該ピッチ波形を所望のピッチ周期間隔で重畳して合成音声を生成することを特徴とする。
【００１２】
また本発明においては、少なくとも音韻情報を含む韻律情報に基づいて原音声波形から抽出されるピッチ波形を所望のピッチ周期間隔で重畳して音声を生成する音声合成方法において、前記韻律情報に基づいて目標スペクトル記憶部から目標スペクトルを選択し、選択された目標スペクトルと抽出すべきピッチ波形のスペクトルとの間で定義される第１の誤差評価関数と、前記抽出すべきピッチ波形を重畳して得られる合成音声波形と前記原音声波形との間で定義される第２の誤差評価関数とに基づいて、原音声波形記憶部の原音声波形から前記ピッチ波形を抽出し、該ピッチ波形を所望のピッチ周期間隔で重畳して合成音声を生成することを特徴とする。
【００１３】
また本発明においては、少なくとも音韻情報を含む韻律情報に基づいて原音声波形から抽出されるピッチ波形を所望のピッチ周期間隔で重畳して音声を生成する音声合成方法において、前記韻律情報に基づいて目標スペクトル記憶部から目標スペクトルとして所定の周波数成分を含むスペクトルデータを選択し、当該スペクトルデータと抽出すべきピッチ波形に基づく音声波形のスペクトルとの間で定義される第１の誤差評価関数と、前記抽出すべきピッチ波形を重畳して得られる合成音声波形と前記原音声波形との間で定義される第２の誤差評価関数とに基づいて、原音声波形記憶部の原音声波形から前記ピッチ波形を抽出し、該ピッチ波形を所望のピッチ周期間隔で重畳して合成音声を生成することを特徴とする。
【００１４】
また本発明においては、少なくとも音韻情報を含む韻律情報に基づいて原音声波形から抽出されるピッチ波形を所望のピッチ周期間隔で重畳して音声を生成する音声合成方法において、前記韻律情報に基づいて目標波形記億部から目標波形を選択し、原音声波形記憶部の原音声波形から窓関数をかけて抽出されたピッチ波形と前記選択された目標波形との間で定義される誤差評価関数に基づいてピッチ波形を選択し、、該ピッチ波形を所望のピッチ周期間隔で重畳して合成音声を生成することを特徴とする音声合成方法を提供する。
【００１５】
また本発明においては、目標波形を記憶した目標波形記憶部と、原音声波形を記憶した原音声波形記憶部と、少なくとも韻律情報に基づいて目標波形記億部から目標波形を選択する目標波形選択部と、選択された目標波形と前記原音声波形記憶部に記憶された原音声波形から抽出すべきピッチ波形との間で定義される誤差評価関数に基づいて、前記原音声波形から前記ピッチ波形を抽出するピッチ波形抽出部と、該抽出されたピッチ波形を所望のピッチ周期間隔で重畳して合成音声を生成することを特徴とする音声合成装置を提供する。
【００１６】
また本発明においては、少なくとも音韻情報を含む韻律情報に基づいて原音声波形から抽出されるピッチ波形を所望のピッチ周期間隔で重畳して音声を生成する処理を行なうコンピュータ読み取り可能な音声合成プログラムにおいて、前記韻律情報に基づいて目標波形記億部から目標波形を選択するステップと、選択された目標波形と抽出すべきピッチ波形との間で定義される誤差評価関数に基づいて、原音声波形記憶部の前記原音声波形から前記ピッチ波形を抽出するステップと、該ピッチ波形を所望のピッチ周期間隔で重畳して合成音声を生成するステップとを備えたことを特徴とするコンピュータ読み取り可能な音声合成プログラムを提供する。
【００１７】
【発明の実施の形態】
（第１の実施形態）
図１に本発明の第１の実施形態に係る音声合成システムの構成を示す。この音声合成システムの主要部分は、目標波形記憶部２０と、目標波形選択部１０と、原音声波形記憶部２１と、ピッチ波形抽出部１１と、ピッチ波形重畳部１２とから構成されている。
【００１８】
本実施形態の音声合成システムの動作をテキスト合成の場合を例にとって、図７に示すフローチャートを用いて説明する。
【００１９】
まず、図１に示す文解析・韻律情報生成（処理部）にテキスト文章が入力されると、テキスト合成に供されるテキスト文章の文解析が行われて音韻記号列、音韻継続時間長、ピッチパターンおよびパワ−などの韻律情報が生成され、目標波形選択部１０およびピッチ波形抽出部１１に入力される（ステップＳ１）。
【００２０】
目標波形選択部１０では、音韻記号列、音韻継続時間長、ピッチパターンおよびパワ−などの韻律情報１００に基づいて目標波形記憶部２０から目標波形１０２が選択される（ステップＳ２）。目標波形記憶部２０には、例えば、オフラインで収集された１ピッチ波形としての音声素片が記憶されており、入力された韻律情報に対応した最適な音声素片が選択される。
【００２１】
ピッチ波形抽出部１１では、音韻記号列、音韻継続時間長、ピッチパターンおよびパワ−などの韻律情報１００に基づいて原音声波形記憶部２１から原音声波形１０３が選択される（ステップＳ３）。原音声波形記憶部２１には、例えば、文章単位に収集された音声波形データが記憶されており、入力された韻律情報に対応した（一部の）音声波形データが選択される。
【００２２】
そして、選択された目標波形とピッチ波形との間で定義される第１の誤差評価関数、および選択された原音声波形と上記ピッチ波形から得られる合成音声波形との間で定義される第２の誤差評価関数に基づいて、原音声波形からピッチ波形が抽出される（ステップＳ４）。
【００２３】
最後に、抽出されたピッチ波形１０４が韻律情報１００により決定されるピッチ周期間隔で重畳されて所定の時間長連続する合成音声１０５が生成される（ステップＳ５）。
【００２４】
次に、本実施形態の特徴的な部分であるピッチ波形抽出部１１について、第１の誤差評価関数と第２の誤差評価関数の重み付き和で決定される総誤差を最小にするピッチ波形を解析的に求める場合を例にとって詳細に述べる。
【００２５】
目標波形とピッチ波形との間で定義される第１の誤差評価関数をＤ_１、原音声波形と合成音声波形との間で定義される第２の誤差評価関数をＤ_２とすると、第１の誤差評価関数Ｄ_１及び第２の誤差評価関数Ｄ_２は次式で与えられる。
【００２６】
【数１】

【００２７】
なお、ｘはピッチ波形、ｕは目標波形、ｒは原音声波形、Ａはピッチ波形を原音声波形のピッチに合わせて重畳する操作を表す行列、ｅ_１は目標波形とピッチ波形との誤差ベクトル（第１の誤差ベクトル）、ｅ_２は原音声波形と合成音声波形との誤差ベクトル（第２の誤差ベクトル）、ｇ_１は第１の誤差評価関数に対する最適ゲイン、ｇ_２は第２の誤差評価関数に対する最適ゲインである。
【００２８】
従って、第１の誤差評価関数Ｄ_１に対する重み係数をｗ_１、第２の誤差評価関数Ｄ_２に対する重み係数をｗ_２とすると、総誤差Ｄは
【００２９】
【数２】

【００３０】
となる。
【００３１】
ここで、総誤差Ｅを最小にするピッチ波形ｘ、第１の誤差評価関数に対する最適ゲインｇ_１、および第２の誤差評価関数に対する最適ゲインｇ_２を求める。総誤差Ｄをピッチ波形ｘ、第１の誤差評価関数に対する最適ゲインｇ_１、および第２の誤差評価関数に対する最適ゲインｇ_２のそれぞれの変数で偏微分すると
【００３２】
【数３】

【００３３】
となる。従って、
【００３４】
【数４】

【００３５】
より、
【００３６】
【数５】

【００３７】
となる。上式を繰り返し計算すればｇ_１、ｇ_２およびｘは求められる。
【００３８】
また、第１の誤差ベクトルおよび第２の誤差ベクトルにそれぞれ聴覚重み付けを行い、ピッチ波形を抽出することも可能である。この場合、聴覚重み付けされた第１の誤差評価関数をＤ_ｗ１、聴覚重み付けされた第２の誤差評価関数をＤ_ｗ２とすると、聴覚重み付け総誤差Ｄ_ｗは次式で与えられる。
【００３９】
【数６】

【００４０】
なお、Ｗ_１は第１誤差ベクトルに対する聴覚重み付けの操作を表す行列、Ｗ_２は第２誤差ベクトルに対する聴覚重み付けの操作を表す行列である。
【００４１】
従って、先に述べた聴覚重み付けを行わない場合と同様にして、ピッチ波形ｘ、第１の誤差評価関数に対する最適ゲインｇ_１、および第２の誤差評価関数に対する最適ゲインｇ_２を求めると、
【００４２】
【数７】

【００４３】
となる。
【００４４】
以上の例では、目標波形記憶部２０には、例えば、オフラインで収集された１ピッチ波形としての音声素片が記憶されているとして実施例を説明したが、記憶されているデータはオフラインで収集された１ピッチ波形に限定されるものではない。例えば、予めの学習により得られた音声素片を記憶しておくよう構成しても良いし、１ピッチ波形ではなく複数ピッチの時間幅を有する音声波形データを目標波形として記憶しておくよう構成しても良い。その場合には、複数ピッチの時間幅を有する音声波形データと抽出すべきピッチ波形に基づく音声波形（例えばピッチ波形を重畳して得られた合成音声波形）との誤差を誤差ベクトルとして、前述した第１の誤差ベクトルに置き換えて、ピッチ波形の評価を行なう。
【００４５】
また原音声波形記憶部２１には、例えば、文章単位に収集された音声波形データが記憶されているとして実施例を説明したが、記憶されているデータは文章単位に収集された音声波形データに限られるものではない。例えば、収録された音声波形データを一定の時間や音韻毎に分けられた音声素片データを記憶しておくよう構成しても良いし、文章よりも小さな文法単位毎に連続した音声波形データを記憶するよう構成しても良い。
【００４６】
以上の例では、目標波形として音声の時間波形を目標波形記憶部２０に記憶するようシステムを構成した例を説明したが、目標波形記憶部２０に記憶された目標波形（時間波形）１０１の代わりに、これと等価な周波数成分としての目標スペクトルを記憶し、第１の誤差評価関数を目標スペクトルとピッチ波形の周波数スペクトルとの間で定義し、誤差を評価することも可能である。このときの音声合成システムの構成は図２のようになる。
【００４７】
図２は図１と比較して、目標波形記憶部２０および目標波形選択部１０が目標スペクトル記憶部４０および目標スペクトル選択部３０に置き換わり、原音声波形記憶部２１とピッチ波形抽出部１１の間に原音声波形から抽出されたピッチ波形を拘束フーリエ変換（ＦＦＴ）処理するＦＦＴ処理部３１が追加された構成となっている。なお、図２では図１と同一の要素に同一の参照番号を付してある。
【００４８】
ここで、目標スペクトル記憶部４０には、例えば、オフラインで収集された１ピッチ波形としての音声素片に相当する周波数成分（スペクトル）が記憶されており、入力された韻律情報に対応した最適な音声素片相当のスペクトルが選択される。さらに、第２の誤差評価関数を原音声波形とピッチ波形との間で定義することも可能である。
【００４９】
このように、目標波形と原音声波形およびピッチ波形との間で定義される誤差評価関数に基づいて最適なピッチ波形を抽出することで、より望ましいスペクトル包絡をもつピッチ波形を得ることができ、合成音声の音質は向上する。さらに、異なるピッチに対応する目標波形を幾つか用意し、合成音声のピッチに応じて目標波形を変えてピッチ波形を抽出することで、ピッチとスペクトルの音響的なミスマッチが解消され、ピッチの変更に伴う合成音声の音質の劣化は無くなり、合成音声の音質はさらに向上する。
（第２の実施形態）
図３に本発明の第２の実施形態に係る音声合成システムの構成を示す。本実施形態は解析的にピッチ波形を抽出する第１の実施形態と異なり、切り出し窓（ウィンドウ）を用いて原音声波形からピッチ波形を切り出すことによりピッチ波形を抽出する構成となっている。なお、本実施形態では、図３に示されるようにピッチ波形抽出部１１は、切り出し窓決定部５０、ピッチ波形切り出し部５１および誤差評価部５２から構成されている。ピッチ波形の抽出は、予め、切り出し窓の窓関数あるいは窓長の少なくとも一方を変えた複数の切り出し窓を用意しておき、それぞれの切り出し窓を用いて切り出されたピッチ波形のうち、目標波形との間で定義される誤差評価関数により決定される誤差が最小となるものを選択することで実現される。なお、誤差評価関数は次式で定義されるようなものが考えられる。
【００５０】
【数８】

【００５１】
ここで、ｘはピッチ波形、ｕは目標波形、ｒは原音声波形、Ｍは原音声波形からピッチ波形を切り出す操作を表す行列、ｅは目標波形とピッチ波形との誤差ベクトル、ｇは誤差評価関数に対する最適ゲインである。
【００５２】
本実施形態によって抽出されるピッチ波形は、解析的方法で抽出されたピッチ波形と比べると一般に精度が高くはないため、準最適なものになっているが、予め用意した複数の切り出し窓を用いて原音声波形からピッチ波形を切り出すことでピッチ波形を抽出しているので計算量が少ないという利点がある。
【００５３】
また、目標波形とピッチ波形との間で定義される誤差だけでなく、原音声波形と合成音声波形との間で定義される誤差を考慮してピッチ波形を切り出すことも可能である。この場合、誤差評価関数としては次式で定義されるようなものが考えられる。
【００５４】
【数９】

【００５５】
ここで、ｘはピッチ波形、ｕは目標波形、ｒは原音声波形、Ｍは原音声波形からピッチ波形を切り出す操作を表す行列、Ａはピッチ波形を原音声波形のピッチに合わせて重畳する操作を表す行列、ｗ_１は目標波形とピッチ波形との間で定義される誤差、ｗ_２は原音声波形と合成音声波形との間で定義される誤差、ｇ_１は目標波形に対する最適ゲイン、ｇ_２は原音声波形に対する最適ゲインである。
【００５６】
なお、本実施形態では誤差評価関数は波形領域で定義してあるが、スペクトル領域で定義することも可能である。
【００５７】
以上、本発明の実施形態を幾つか説明したが、本発明は上述した実施形態に限られるものではなく、種々変更して実施が可能である。例えば、誤差評価関数に関しても、上述した関数に限定される必要はなく、少なくとも目標波形あるいは目標スペクトルの一方を含む形で定義されていれば良い。
【００５８】
原音声波形から抽出したピッチ波形と予め用意した目標波形との間で定義される誤差評価関数に基づいて原音声波形からピッチ波形を抽出することで、より望ましいスペクトル包絡をもつピッチ波形を得ることが可能になり、合成音声の音質は向上する。
【００５９】
上述した本発明に基づく音声合成処理は、ハードウェアより実現することも可能であるが、コンピュータを用いてソフトウェア処理により実現することも可能である。従って、本発明によれば上述した音声合成処理をコンピュータに行わせるためのプログラムを提供することもできる。
【００６０】
【発明の効果】
以上説明したように、本発明によれば目標波形とピッチ波形との間で定義される誤差評価関数に基づいてピッチ波形を抽出することにより、高品質な合成音声を得ることができる。
【図面の簡単な説明】
【図１】本発明に係る音声合成方法の第１の実施形態を説明するための音声合成システムの構成を示すブロック図
【図２】本発明に係る音声合成方法の第２の実施形態を説明するための音声合成システムの構成を示すブロック図
【図３】本発明に係る音声合成方法の第３の実施形態を説明するための音声合成システムの構成を示すブロック図
【図４】従来の音声合成方法を説明するための音声合成システムの構成を示すブロック図
【図５】ＰＳＯＬＡ方式の合成音声波形の生成を説明するための図
【図６】ＰＳＯＬＡ方式の合成音声波形の生成を説明するための図
【図７】本発明に係る音声合成方法の第１の実施形態を説明する音声合成システムのフローチャート
【符号の説明】
１０ … 目標波形選択部
１１ … ピッチ波形抽出部
１２ … ピッチ波形重畳部
２０ … 目標波形記憶部
２１ … 原音声波形記憶部
３０ … 目標スペクトル選択部
３１ … ＦＦＴ処理部
４０ … 目標スペクトル記憶部
５０ … 切り出し窓決定部
５１ … ピッチ波形切り出し部
５２ … 誤差評価部
１００ … 韻律情報
１０１ … 候補目標波形
１０２ … 目標波形
１０３ … 原音声波形
１０４ … ピッチ波形
１０５ … 合成音声
２０１ … 候補目標スペクトル
２０２ … 目標スペクトル
２０３ … 原音声スペクトル
３００ … 切り出し窓
３０１ … 候補ピッチ波形
３０２ … ピッチ波形
４００ … 原音声波形
４０１ … ピッチ周期を小さくするときの窓関数
４０２ … ピッチ周期を小さくしたときの合成音声
４０３ … ピッチ周期を大きくするときの窓関数
４０４ … ピッチ周期を大きくしたときの窓関数[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech synthesis method and apparatus used for text synthesis, for example, and a computer-readable text synthesis program.
[0002]
[Prior art]
The rule synthesis method, which is one of the speech synthesis methods, is a technology that automatically converts input prosodic information into synthesized speech, and is used for text synthesis that artificially creates speech from arbitrary sentences (text). ing.
[0003]
As shown in FIG. 4, in the rule synthesis method, the syllables stored in the speech unit storage unit or the original speech waveform storage unit 21 according to the prosodic information 100 such as the phoneme symbol string, the phoneme duration, the pitch pattern, and the power. A small unit (speech unit) such as a phoneme is selected, and the synthesized speech signal 105 is generated by connecting the selected speech unit with a different pitch and time length.
[0004]
As a method for changing the pitch, a method called PSOLA is well known. In PSOLA, a speech waveform is windowed for each pitch period to cut out a pitch waveform, and then rearranged at pitch period intervals to be synthesized. The pitch is changed by superposition, and a synthesized speech signal is generated. (FIGS. 5 and 6)
[0005]
[Patent Document 1]
Japanese Patent No. 329604 [0006]
[Problems to be solved by the invention]
Here, therefore, in PSOLA, the pitch waveform extraction method is a point that improves the quality of synthesized speech. In Japanese Patent No. 329604, when the pitch of the synthesized speech is lower than the pitch of the original speech, the pitch waveform is cut out with a Hanning window having a window length equal to twice the pitch period of the original speech, and the pitch of the synthesized speech is When the pitch is higher than the pitch of the voice, the pitch waveform is cut out with a Hanning window having a window length equal to twice the pitch period of the synthesized voice, thereby suppressing deterioration of the quality of the synthesized voice accompanying the pitch change.
[0007]
However, when the pitch change is small, there is little degradation in the quality of the synthesized speech, but when the pitch is changed greatly, there is a problem that the quality of the synthesized speech is significantly degraded due to an acoustic mismatch between the pitch and the spectrum. is there.
[0008]
[Means for Solving the Problems]
In order to solve the above problem, in the present invention, in a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform based on prosodic information including at least phonemic information at a desired pitch period interval, A target waveform is selected from the target waveform storage unit based on the prosodic information, and based on an error evaluation function defined between the selected target waveform and the pitch waveform to be extracted, the original speech waveform storage unit The pitch waveform is extracted from the original speech waveform, and the synthesized speech is generated by superimposing the pitch waveform at a desired pitch period interval.
[0009]
Also, in the present invention, in a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform on the basis of prosodic information including at least phonemic information at a desired pitch period interval, based on the prosodic information It is obtained by selecting a target waveform from the target waveform storage and superposing the first error evaluation function defined between the selected target waveform and the pitch waveform to be extracted and the pitch waveform to be extracted. Based on a synthesized error waveform and a second error evaluation function defined between the original sound waveform, the pitch waveform is extracted from the original sound waveform in the original sound waveform storage unit, and the pitch waveform is converted to a desired pitch. It is characterized in that synthesized speech is generated by superimposing at periodic intervals.
[0010]
Also, in the present invention, in a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform on the basis of prosodic information including at least phonemic information at a desired pitch period interval, based on the prosodic information A first error evaluation function defined between speech waveform data and a speech waveform based on the pitch waveform to be extracted is selected from speech waveform data having a plurality of pitch time widths as a target waveform from the target waveform storage And a second error evaluation function defined between the synthesized speech waveform obtained by superimposing the pitch waveform to be extracted and the original speech waveform, from the original speech waveform in the original speech waveform storage unit The pitch waveform is extracted, and the synthesized speech is generated by superimposing the pitch waveform at a desired pitch period interval.
[0011]
Also, in the present invention, in a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform on the basis of prosodic information including at least phonemic information at a desired pitch period interval, based on the prosodic information Select a target spectrum from the target spectrum storage unit, and based on the error evaluation function defined between the selected target spectrum and the spectrum of the pitch waveform to be extracted, from the original speech waveform of the original sound waveform storage unit A pitch waveform is extracted, and the synthesized speech is generated by superimposing the pitch waveform at a desired pitch period interval.
[0012]
Also, in the present invention, in a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform on the basis of prosodic information including at least phonemic information at a desired pitch period interval, based on the prosodic information A target spectrum is selected from the target spectrum storage unit, and the first error evaluation function defined between the selected target spectrum and the spectrum of the pitch waveform to be extracted is superimposed on the pitch waveform to be extracted. The pitch waveform is extracted from the original speech waveform in the original speech waveform storage unit based on the synthesized error waveform and the second error evaluation function defined between the original speech waveform and the pitch waveform is It is characterized in that synthesized speech is generated by superimposing at pitch period intervals.
[0013]
Also, in the present invention, in a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform on the basis of prosodic information including at least phonemic information at a desired pitch period interval, based on the prosodic information Selecting a spectrum data including a predetermined frequency component as a target spectrum from the target spectrum storage unit, a first error evaluation function defined between the spectrum data and a spectrum of a speech waveform based on a pitch waveform to be extracted; Based on the second error evaluation function defined between the synthesized speech waveform obtained by superimposing the pitch waveform to be extracted and the original speech waveform, the pitch is calculated from the original speech waveform in the original speech waveform storage unit. A waveform is extracted, and the synthesized speech is generated by superimposing the pitch waveform at a desired pitch period interval.
[0014]
Also, in the present invention, in a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform on the basis of prosodic information including at least phonemic information at a desired pitch period interval, based on the prosodic information A target waveform is selected from the target waveform storage unit, and an error evaluation function defined between the pitch waveform extracted from the original speech waveform in the original speech waveform storage unit by applying a window function and the selected target waveform is used. There is provided a speech synthesis method characterized in that a pitch waveform is selected based on this and a synthesized speech is generated by superimposing the pitch waveform at a desired pitch period interval.
[0015]
Further, in the present invention, a target waveform storage unit that stores a target waveform, an original speech waveform storage unit that stores an original speech waveform, and a target waveform selection that selects a target waveform from a target waveform storage unit based on at least prosodic information And the pitch waveform from the original speech waveform based on an error evaluation function defined between the selected target waveform and the pitch waveform to be extracted from the original speech waveform stored in the original speech waveform storage unit And a speech synthesizer characterized in that the synthesized speech is generated by superimposing the extracted pitch waveform at a desired pitch period interval.
[0016]
Further, in the present invention, in a computer-readable speech synthesis program for performing a process of generating speech by superimposing a pitch waveform extracted from an original speech waveform based on prosodic information including at least phonemic information at a desired pitch period interval Selecting a target waveform from the target waveform storage based on the prosodic information and storing an original speech waveform based on an error evaluation function defined between the selected target waveform and the pitch waveform to be extracted Extracting the pitch waveform from the original speech waveform of the unit, and generating a synthesized speech by superimposing the pitch waveform at a desired pitch period interval. Provide a program.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
(First embodiment)
FIG. 1 shows the configuration of a speech synthesis system according to the first embodiment of the present invention. The main part of the speech synthesis system includes a target waveform storage unit 20, a target waveform selection unit 10, an original speech waveform storage unit 21, a pitch waveform extraction unit 11, and a pitch waveform superposition unit 12.
[0018]
The operation of the speech synthesis system of this embodiment will be described with reference to the flowchart shown in FIG.
[0019]
First, when a text sentence is input to the sentence analysis / prosodic information generation (processing unit) shown in FIG. 1, sentence analysis of the text sentence used for text synthesis is performed, and a phoneme symbol string, a phoneme duration, a pitch Prosody information such as pattern and power is generated and input to the target waveform selection unit 10 and the pitch waveform extraction unit 11 (step S1).
[0020]
The target waveform selection unit 10 selects a target waveform 102 from the target waveform storage unit 20 based on prosodic information 100 such as a phoneme symbol string, a phoneme duration, a pitch pattern, and power (step S2). The target waveform storage unit 20 stores, for example, speech units as a one-pitch waveform collected offline, and an optimal speech unit corresponding to the input prosodic information is selected.
[0021]
The pitch waveform extraction unit 11 selects the original speech waveform 103 from the original speech waveform storage unit 21 based on the prosodic information 100 such as a phoneme symbol string, a phoneme duration, a pitch pattern, and power (step S3). The original speech waveform storage unit 21 stores, for example, speech waveform data collected for each sentence, and (part) speech waveform data corresponding to the input prosodic information is selected.
[0022]
Then, a first error evaluation function defined between the selected target waveform and the pitch waveform, and a second defined between the selected original speech waveform and the synthesized speech waveform obtained from the pitch waveform. A pitch waveform is extracted from the original speech waveform based on the error evaluation function (step S4).
[0023]
Finally, the extracted pitch waveform 104 is superimposed at a pitch cycle interval determined by the prosodic information 100, and a synthesized speech 105 that is continuous for a predetermined time length is generated (step S5).
[0024]
Next, a pitch waveform that minimizes the total error determined by the weighted sum of the first error evaluation function and the second error evaluation function for the pitch waveform extraction unit 11 that is a characteristic part of the present embodiment. The case where it is obtained analytically will be described in detail.
[0025]
When the first error evaluation function defined between the target waveform and the pitch waveform is D ₁ and the second error evaluation function defined between the original speech waveform and the synthesized speech waveform is D ₂ , error evaluation function D ₁ and the second error evaluation function D ₂ of is given by the following equation.
[0026]
[Expression 1]

[0027]
X is a pitch waveform, u is a target waveform, r is an original speech waveform, A is a matrix representing an operation for superimposing the pitch waveform in accordance with the pitch of the original speech waveform, and e ₁ is an error vector between the target waveform and the pitch waveform. (First error vector), e ₂ is an error vector (second error vector) between the original speech waveform and the synthesized speech waveform, g ₁ is an optimum gain for the first error evaluation function, and g ₂ is a second error. This is the optimum gain for the evaluation function.
[0028]
Therefore, if the weighting coefficient for the first error evaluation function D ₁ is w ₁ and the weighting coefficient for the second error evaluation function D ₂ is w ₂ , the total error D is
[Expression 2]

[0030]
It becomes.
[0031]
Here, the pitch waveform x that minimizes the total error E, the optimum gain g ₁ for the first error evaluation function, and the optimum gain g ₂ for the second error evaluation function are obtained. When the total error D is partially differentiated with respect to the respective variables of the pitch waveform x, the optimum gain g ₁ for the first error evaluation function, and the optimum gain g ₂ for the second error evaluation function,
[Equation 3]

[0033]
It becomes. Therefore,
[0034]
[Expression 4]

[0035]
Than,
[0036]
[Equation 5]

[0037]
It becomes. G ₁ , g ₂ and x can be obtained by repeatedly calculating the above equation.
[0038]
It is also possible to extract a pitch waveform by performing auditory weighting on each of the first error vector and the second error vector. In this case, if the perceptually weighted first error evaluation function is _Dw1 , and the perceptually weighted second error evaluation function is _Dw2 , the perceptual weighting total error _Dw is given by the following equation.
[0039]
[Formula 6]

[0040]
Note that W ₁ is a matrix representing an operation of auditory weighting for the first error vector, and W ₂ is a matrix representing an operation of auditory weighting for the second error vector.
[0041]
Accordingly, when the pitch waveform x, the optimum gain g ₁ for the first error evaluation function, and the optimum gain g ₂ for the second error evaluation function are obtained in the same manner as in the case where the auditory weighting described above is not performed,
[0042]
[Expression 7]

[0043]
It becomes.
[0044]
In the above example, the target waveform storage unit 20 has been described as an example in which a speech segment as a one-pitch waveform collected offline is stored. However, stored data is collected offline. However, the present invention is not limited to the one pitch waveform. For example, a speech unit obtained by learning in advance may be stored, or speech waveform data having a time width of a plurality of pitches instead of a single pitch waveform may be stored as a target waveform. You may do it. In this case, the error between the speech waveform data having a time width of a plurality of pitches and the speech waveform based on the pitch waveform to be extracted (for example, a synthesized speech waveform obtained by superimposing the pitch waveform) is used as an error vector. The pitch waveform is evaluated in place of the first error vector.
[0045]
Further, the embodiment has been described on the assumption that the speech waveform data collected in units of sentences is stored in the original speech waveform storage unit 21, for example, but the stored data is converted into the speech waveform data collected in units of sentences. It is not limited. For example, the recorded speech waveform data may be configured to store speech segment data divided for each fixed time or phoneme, or continuous speech waveform data for each grammar unit smaller than a sentence. You may comprise so that it may memorize | store.
[0046]
In the above example, an example in which the system is configured to store the time waveform of the voice as the target waveform in the target waveform storage unit 20 has been described, but instead of the target waveform (time waveform) 101 stored in the target waveform storage unit 20. It is also possible to store the target spectrum as a frequency component equivalent to this, define the first error evaluation function between the target spectrum and the frequency spectrum of the pitch waveform, and evaluate the error. The configuration of the speech synthesis system at this time is as shown in FIG.
[0047]
2, compared with FIG. 1, the target waveform storage unit 20 and the target waveform selection unit 10 are replaced with the target spectrum storage unit 40 and the target spectrum selection unit 30, and between the original speech waveform storage unit 21 and the pitch waveform extraction unit 11. In addition, an FFT processing unit 31 for performing a restricted Fourier transform (FFT) process on the pitch waveform extracted from the original speech waveform is added. In FIG. 2, the same elements as those in FIG. 1 are denoted by the same reference numerals.
[0048]
Here, in the target spectrum storage unit 40, for example, frequency components (spectrums) corresponding to speech segments as a one-pitch waveform collected offline are stored, and the optimum spectrum corresponding to the input prosodic information is stored. A spectrum corresponding to a speech unit is selected. Further, the second error evaluation function can be defined between the original speech waveform and the pitch waveform.
[0049]
Thus, by extracting the optimum pitch waveform based on the error evaluation function defined between the target waveform and the original speech waveform and the pitch waveform, it is possible to obtain a pitch waveform having a more desirable spectral envelope, The sound quality of synthesized speech is improved. Furthermore, by preparing several target waveforms corresponding to different pitches, and extracting the pitch waveform by changing the target waveform according to the pitch of the synthesized speech, the acoustic mismatch between the pitch and the spectrum is eliminated, and the pitch is changed. As a result, the quality of the synthesized speech is not degraded, and the quality of the synthesized speech is further improved.
(Second Embodiment)
FIG. 3 shows the configuration of a speech synthesis system according to the second embodiment of the present invention. Unlike the first embodiment in which the pitch waveform is analytically extracted, the present embodiment is configured to extract the pitch waveform by cutting out the pitch waveform from the original speech waveform using a clipping window. In this embodiment, as shown in FIG. 3, the pitch waveform extraction unit 11 includes a cutout window determination unit 50, a pitch waveform cutout unit 51, and an error evaluation unit 52. The pitch waveform is extracted in advance by preparing a plurality of cutout windows in which at least one of the window function or the window length of the cutout window is changed, and among the pitch waveforms cut out using each cutout window, This is realized by selecting the one that minimizes the error determined by the error evaluation function defined between the two. The error evaluation function may be defined by the following equation.
[0050]
[Equation 8]

[0051]
Here, x is a pitch waveform, u is a target waveform, r is an original speech waveform, M is a matrix representing an operation for extracting a pitch waveform from the original speech waveform, e is an error vector between the target waveform and the pitch waveform, and g is an error evaluation. This is the optimal gain for the function.
[0052]
The pitch waveform extracted by this embodiment is generally less accurate than the pitch waveform extracted by an analytical method, and thus is sub-optimal. However, a plurality of cutout windows prepared in advance are used. Since the pitch waveform is extracted by cutting out the pitch waveform from the original speech waveform, there is an advantage that the calculation amount is small.
[0053]
In addition to the error defined between the target waveform and the pitch waveform, it is also possible to cut out the pitch waveform in consideration of the error defined between the original speech waveform and the synthesized speech waveform. In this case, an error evaluation function defined by the following equation can be considered.
[0054]
[Equation 9]

[0055]
Here, x is a pitch waveform, u is a target waveform, r is an original speech waveform, M is a matrix representing an operation for cutting out a pitch waveform from the original speech waveform, and A is an operation for superimposing the pitch waveform in accordance with the pitch of the original speech waveform. , W ₁ is an error defined between the target waveform and the pitch waveform, w ₂ is an error defined between the original speech waveform and the synthesized speech waveform, g ₁ is an optimum gain for the target waveform, g ₂ is the optimum gain for the original speech waveform.
[0056]
In the present embodiment, the error evaluation function is defined in the waveform region, but can also be defined in the spectral region.
[0057]
Although several embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and various modifications can be made. For example, the error evaluation function need not be limited to the above-described function, and may be defined so as to include at least one of the target waveform or the target spectrum.
[0058]
Obtaining a pitch waveform with a more desirable spectral envelope by extracting the pitch waveform from the original speech waveform based on an error evaluation function defined between the pitch waveform extracted from the original speech waveform and a target waveform prepared in advance And the quality of the synthesized speech is improved.
[0059]
The above-described speech synthesis processing according to the present invention can be realized by hardware, but can also be realized by software processing using a computer. Therefore, according to the present invention, it is possible to provide a program for causing a computer to perform the above-described speech synthesis processing.
[0060]
【The invention's effect】
As described above, according to the present invention, a high-quality synthesized speech can be obtained by extracting a pitch waveform based on an error evaluation function defined between a target waveform and a pitch waveform.
[Brief description of the drawings]
FIG. 1 is a block diagram showing the configuration of a speech synthesis system for explaining a first embodiment of a speech synthesis method according to the present invention. FIG. 2 explains a second embodiment of the speech synthesis method according to the present invention. FIG. 3 is a block diagram showing the configuration of a speech synthesis system for explaining a third embodiment of the speech synthesis method according to the present invention. FIG. 5 is a block diagram showing a configuration of a speech synthesis system for explaining a synthesis method. FIG. 5 is a diagram for explaining generation of a synthesized speech waveform of the PSOLA system. FIG. 6 is a diagram for explaining generation of a synthesized speech waveform of the PSOLA system. FIG. 7 is a flowchart of a speech synthesis system for explaining the first embodiment of the speech synthesis method according to the present invention.
DESCRIPTION OF SYMBOLS 10 ... Target waveform selection part 11 ... Pitch waveform extraction part 12 ... Pitch waveform superposition part 20 ... Target waveform storage part 21 ... Original audio | voice waveform storage part 30 ... Target spectrum selection part 31 ... FFT processing part 40 ... Target spectrum storage part 50 ... Cutout window determination unit 51 ... Pitch waveform cutout unit 52 ... Error evaluation unit 100 ... Prosody information 101 ... Candidate target waveform 102 ... Target waveform 103 ... Original speech waveform 104 ... Pitch waveform 105 ... Synthetic speech 201 ... Candidate target spectrum 202 ... Target spectrum 203 ... Original voice spectrum 300 ... Clipping window 301 ... Candidate pitch waveform 302 ... Pitch waveform 400 ... Original voice waveform 401 ... Window function 402 when the pitch period is reduced ... Synthetic voice 403 when the pitch period is reduced ... Window function 404 when increasing ... Window function when the pitch period was largely

Claims

In a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform based on prosodic information including at least phonemic information at a desired pitch period interval, from a target waveform storage unit based on the prosodic information Selecting a target waveform, extracting the pitch waveform from the original speech waveform in the original speech waveform storage unit based on an error evaluation function defined between the selected target waveform and the pitch waveform to be extracted; A speech synthesis method comprising generating synthesized speech by superimposing pitch waveforms at a desired pitch period interval.

In a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform based on prosodic information including at least phonemic information at a desired pitch period interval, from a target waveform storage unit based on the prosodic information A target error is selected, a first error evaluation function defined between the selected target waveform and the pitch waveform to be extracted, a synthesized speech waveform obtained by superimposing the pitch waveform to be extracted, and the original waveform The pitch waveform is extracted from the original speech waveform of the original speech waveform storage unit based on the second error evaluation function defined between the speech waveform and the pitch waveform is superimposed at a desired pitch period interval. A speech synthesis method characterized by generating synthesized speech.

In a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform based on prosodic information including at least phonemic information at a desired pitch period interval, from a target waveform storage unit based on the prosodic information A speech error data having a plurality of pitch time widths is selected as a target waveform, a first error evaluation function defined between the speech waveform data and a speech waveform based on the pitch waveform to be extracted, and the extraction should be performed Based on a synthesized speech waveform obtained by superimposing the pitch waveform and a second error evaluation function defined between the original speech waveform, the pitch waveform is extracted from the original speech waveform in the original speech waveform storage unit. A speech synthesis method comprising generating synthesized speech by superimposing the pitch waveform at a desired pitch period interval.

The speech synthesis method according to claim 1, wherein the target waveform storage unit stores speech segments as candidates for a target waveform to be selected.

In a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform based on prosodic information including at least phonemic information at a desired pitch period interval, a target spectrum storage unit can generate a target based on the prosodic information. Selecting a spectrum, and extracting the pitch waveform from the original speech waveform of the original sound waveform storage unit based on an error evaluation function defined between the selected target spectrum and the spectrum of the pitch waveform to be extracted; A speech synthesis method characterized by generating synthesized speech by superimposing the pitch waveform at a desired pitch period interval.

In a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform based on prosodic information including at least phonemic information at a desired pitch period interval, a target spectrum storage unit can generate a target based on the prosodic information. Selecting a spectrum, a first error evaluation function defined between the selected target spectrum and the spectrum of the pitch waveform to be extracted; a synthesized speech waveform obtained by superimposing the pitch waveform to be extracted; and Based on the second error evaluation function defined between the original speech waveform and the original speech waveform, the pitch waveform is extracted from the original speech waveform in the original speech waveform storage unit, and the pitch waveform is superimposed at a desired pitch period interval. And a synthesized speech is generated.

In a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform based on prosodic information including at least phonemic information at a desired pitch period interval, a target spectrum storage unit can generate a target based on the prosodic information. A first error evaluation function defined between a spectrum data including a predetermined frequency component as a spectrum and defined between the spectrum data and a spectrum of a speech waveform based on the pitch waveform to be extracted, and the pitch waveform to be extracted. Is extracted from the original speech waveform of the original speech waveform storage unit based on the second error evaluation function defined between the synthesized speech waveform obtained by superimposing and the original speech waveform, A speech synthesis method comprising generating synthesized speech by superimposing pitch waveforms at a desired pitch period interval.

The speech synthesis method according to claim 5, wherein the target spectrum storage unit stores a spectrum corresponding to a speech segment as a candidate of a target spectrum to be selected.

In a speech synthesis method for generating speech by superimposing a pitch waveform extracted from an original speech waveform based on prosodic information including at least phonemic information at a desired pitch period interval, from a target waveform storage unit based on the prosodic information Select a target waveform and select a pitch waveform based on an error evaluation function defined between the pitch waveform extracted from the original speech waveform in the original speech waveform storage unit by applying a window function and the selected target waveform And generating a synthesized speech by superimposing the pitch waveform at a desired pitch period interval.

A target waveform storage unit that stores the target waveform, an original speech waveform storage unit that stores the original speech waveform, a target waveform selection unit that selects a target waveform from the target waveform storage unit based on at least prosodic information, and Pitch waveform extraction for extracting the pitch waveform from the original speech waveform based on an error evaluation function defined between a target waveform and a pitch waveform to be extracted from the original speech waveform stored in the original speech waveform storage unit And a synthesized speech by superimposing the extracted pitch waveform at a desired pitch period interval.

In a computer-readable speech synthesis program for performing a process of generating speech by superimposing a pitch waveform extracted from an original speech waveform based on prosodic information including at least phonemic information at a desired pitch period interval, based on the prosodic information Selecting the target waveform from the target waveform storage unit and an error evaluation function defined between the selected target waveform and the pitch waveform to be extracted. A computer-readable speech synthesis program comprising: extracting the pitch waveform from the step; and superimposing the pitch waveform at a desired pitch period interval to generate synthesized speech.