JP4393794B2

JP4393794B2 - Speech synthesizer

Info

Publication number: JP4393794B2
Application number: JP2003154989A
Authority: JP
Inventors: 正山浦; 真哉高橋
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2003-05-30
Filing date: 2003-05-30
Publication date: 2010-01-06
Anticipated expiration: 2023-05-30
Also published as: JP2004354893A

Description

【０００１】
【発明の属する技術分野】
この発明は、ＶＣＶ（Ｖｏｗｅｌ−Ｃｏｎｓｏｎａｎｔ−Ｖｏｗｅｌ）や、ＣＶ（Ｃｏｎｓｏｎａｎｔ−Ｖｏｗｅｌ）などの音声素片を変形、接続して音声を合成する音声合成装置に関する。
【０００２】
【従来の技術】
従来の音声合成装置は、複数の音声素片に対しそれぞれの韻律パラメータに応じて継続時間長を線形に伸縮する、或いはピッチ周期を変更する等の変形をした後、その変形して得られた音声素片を接続し、得られた一連の音声パラメータを音声信号に変換して合成音声を得ていた（例えば、非特許文献１参照）。
【０００３】
【非特許文献１】
Ｅ．ＭｏｕｌｉｎｅａｎｄＦ．Ｃｈａｒｐｅｎｔｉｒｅ “Ｐｉｔｃｈ−Ｓｙｎｃｈｒｏｎｏｕｓｗａｖｅｆｏｒｍｐｒｏｃｅｓｓｉｎｇｔｅｃｈｎｉｑｕｅｓｆｏｒｔｅｘｔ−ｔｏ−ｓｐｅｅｃｈｓｙｎｔｈｅｓｉｓｕｓｉｎｇｄｉｐｈｏｎｅｓ”ＳｐｅｅｃｈＣｏｍｍｕｎｉｃａｔｉｏｎ，ｖｏｌ．９，ｐｐ．４５３−４６７，Ｄｅｃ１９９０
【０００４】
【発明が解決しようとする課題】
しかしながら上述した従来の音声合成装置では、継続時間長を伸縮して得られる音声素片から生成される音声信号は、その音声特徴の時間変化が本来の音声特徴の時間変化に比較して緩慢あるいは急峻となるので、合成音声の自然性が劣化するという問題があった。特に有声区間を伸長した場合には、合成音声のピッチ波形の時間変化が小さくなりブザー音的な劣化音となることが顕著であった。
【０００５】
また、音声素片の接続部では素片間の不連続が生じる場合があり、合成音声の自然性が劣化するという問題があった。また、この不連続を解消するために素片間の変化を緩慢にする補間処理をもって素片を接続する場合でも、その音声特徴の時間変化が本来の時間変化に比較して緩慢となり、特に有声音区間で接続した場合には、合成音声のピッチ波形の時間変化が小さく、ブザー音的な劣化音となることが顕著であった。
【０００６】
この発明は上記のような課題を解決するためになされたもので、変形した音声素片から生成される音声信号の雑音的な成分の割合を調整することによって合成音声の自然性の劣化をマスクして変形歪を聴覚上で軽減し、また、音声素片の接続部を含む所定の区間で音声の雑音的な成分の割合を調整することによって合成音声の自然性の劣化をマスクして接続歪を聴覚上で軽減する音声合成装置を得ることを目的とする。
【０００７】
【課題を解決するための手段】
この発明に係る音声合成装置は、変形された音声素片を構成する音声パラメータに対し、位相撹乱帯域の音声の位相を撹乱し、音声の雑音成分の割合を制御する変形歪マスク手段を備え、変形歪マスク手段は、変形された音声素片の継続時間長が短い場合には位相撹乱帯域を小さくし、変形された音声素片の継続時間長が長い場合には位相撹乱帯域を大きくするものである。
【０００８】
この発明に係る音声合成装置は、変形された音声素片を構成する音声パラメータに対し、雑音信号を音声に重畳し、音声の雑音成分の割合を制御する変形歪マスク手段を備え、変形歪マスク手段は、変形された音声素片の継続時間長が短い場合には重畳する雑音信号の平均パワーを小さくし、変形された音声素片の継続時間長が長い場合には重畳する雑音信号の平均パワー大きくするものである。
【０００９】
この発明に係る音声合成装置は、変形された音声素片を構成する音声パラメータに対し、雑音信号置換帯域の音声を雑音信号に置換し、音声の雑音成分の割合を制御する変形歪マスク手段を備え、変形歪マスク手段は、変形された音声素片の継続時間長が短い場合には雑音信号置換帯域を小さくし、変形された音声素片の継続時間長が長い場合には雑音信号置換帯域を大きくするものである。
【００１０】
この発明に係る音声合成装置は、音声素片を接続して得られた一連の音声パラメータに対し、位相撹乱帯域の音声の位相を撹乱し、音声素片の接続部を含む所定の区間で音声の雑音成分の割合を制御する接続歪マスク手段を備え、接続歪マスク手段は、接続部に近い時刻では位相撹乱帯域を大きくし、接続部に遠い時刻では位相撹乱帯域を小さくするものである。
【００１１】
この発明に係る音声合成装置は、音声素片を接続して得られた一連の音声パラメータに対し、雑音信号を音声に重畳し、音声素片の接続部を含む所定の区間で音声の雑音成分の割合を制御する接続歪マスク手段を備え、接続歪マスク手段は、接続部に近い時刻では重畳する雑音信号の平均パワーを大きくし、接続部に遠い時刻では重畳する雑音信号の平均パワーを小さくするものである。
この発明に係る音声合成装置は、音声素片を接続して得られた一連の音声パラメータに対し、雑音信号置換帯域の音声を雑音信号に置換し、音声素片の接続部を含む所定の区間で音声の雑音成分の割合を制御する接続歪マスク手段を備え、接続歪マスク手段は、接続部に近い時刻では雑音信号置換帯域を大きくし、接続部に遠い時刻では雑音信号置換帯域を小さくするものである。
【００１２】
【発明の実施の形態】
以下、この発明の実施の一形態について説明する。
実施の形態１．
この発明の実施の形態１に係る音声合成装置について、図１〜図３を参照して説明する。なお、図１は実施の形態１に係る音声合成装置１の構成を示すブロック図であり、図２および図３はこの音声合成装置１による位相撹乱帯域の時間推移の例を示す図である。
【００１３】
この実施の形態１に係る音声合成装置１は図１に示すように、合成する２つの音声素片を変形する変形手段１１および変形手段１２、音声の所定の帯域の位相を撹乱した音声素片を出力する変形歪マスク手段１３および変形歪マスク手段１４、音声素片を接続する接続手段１５、音声の所定の帯域の位相を撹乱する接続歪マスク手段１６、音声パラメータを音声信号に変換し、合成音声として出力する音声生成手段１７を備えて構成される。
【００１４】
変形手段１１は、韻律パラメータＡに応じて音声素片Ａの継続時間長を線形に伸縮し、ピッチ周期を変更して音声素片Ａａに変形する。
また、変形手段１２は、韻律パラメータＢに応じて音声素片Ｂの継続時間長を線形に伸縮し、ピッチ周期を変更して音声素片Ｂａに変形する。
【００１５】
変形歪マスク手段１３は、変形手段１１により変形された音声素片Ａａを構成する音声パラメータに対し音声の所定の帯域の位相を撹乱して音声素片Ａｂを出力する。
また、変形歪マスク手段１４は、変形手段１２により変形された音声素片Ｂａを構成する音声パラメータに対し音声の所定の帯域の位相を撹乱して音声素片Ｂｂを出力する。
【００１６】
接続手段１５は、変形歪マスク手段１３から出力された音声素片Ａｂを構成する音声パラメータと変形歪マスク手段１４から出力された音声素片Ｂｂを構成する音声パラメータを接続して一連の音声パラメータを生成する。
接続歪マスク手段１６は、音声素片Ａｂ，Ｂｂを接続して得られた一連の音声パラメータに対し、音声の所定の帯域の位相を撹乱する。
音声生成手段１７は、接続歪マスク手段１６から出力された音声パラメータを音声信号に変換し、合成音声として出力する。
【００１７】
つぎに音声合成装置１の動作について説明する。
まず、図１に示すように、合成する音声の一方の音声素片Ａを、変形手段１１により韻律パラメータＡに応じて音声素片の継続時間長を線形に伸縮し、またピッチ周期の変更等を行って音声素片Ａａを得る。また同様に合成する音声の他の一方の音声素片Ｂを変形手段１２により韻律パラメータＢに応じて音声素片の継続時間長を線形に伸縮し、またピッチ周期の変更等を行って音声素片Ｂａを得る。
【００１８】
つぎに、変形歪マスク手段１３により音声素片Ａａから生成される音声信号に対して、時刻ｔにおいて周波数Ｆｍ（ｔ）より高域の信号の位相を撹乱した音声信号が得られる音声素片Ａｂを求め、同様に変形歪マスク手段１４により音声素片Ｂａから生成される音声信号に対して、時刻ｔにおいて周波数Ｆｍ（ｔ）より高域の信号の位相を撹乱した音声信号が得られる音声素片Ｂｂを求め、夫々の音声素片Ａｂ、Ｂｂを接続手段１５に出力する。
【００１９】
変形歪マスク手段１３，１４における具体的な処理は、音声素片を表現するパラメータに応じたものとする。例えば、ピッチ時間波形を音声素片のパラメータとして持つピッチ波形重畳型の場合、音声素片Ａａ，Ｂａの各ピッチ時間波形をフーリエ変換などにより周波数領域に変換、周波数Ｆｍ（ｔ）より高域の位相をランダムに回転した後に逆フーリエ変換して得られるピッチ時間波形を音声素片Ａｂ、Ｂｂとして処理し、また、ピッチ、高調波の振幅・位相を音声素片のパラメータとしてもつ正弦波合成型の場合、音声素片Ａａ，Ｂａの周波数Ｆｍ（ｔ）より高域の高調波の位相をランダムな値に変更したパラメータを音声素片Ａｂ，Ｂｂとして処理する。
【００２０】
ここで、周波数Ｆｍ（ｔ）は、例えば式（１）に従い決定する。
【数１】

式（１）のＦｈ，Ｆｌは位相撹乱帯域下限が取り得る最高周波数と最低周波数である。また、ｔ１，ｔ２はそれぞれ位相撹乱帯域幅の変化の開始時刻、終了時刻である。
【００２１】
図２の式（１）に従い決定した位相撹乱帯域の時間推移の例を示す。これは、継続時間長がｔ１より短くて変形による影響が少ない素片では位相撹乱帯域を小さくして変形歪のマスクを弱くし、変形により継続時間長がｔ１より長くて不自然な合成音が長時間に渡る素片ではｔ１以降の区間で位相撹乱帯域を拡大することにより強力に変形歪をマスクするものである。
【００２２】
周波数Ｆｍ（ｔ）を決定するパラメータである上述したＦｈ，Ｆｌ，ｔ１，ｔ２は、例えば音声素片毎に予め調整して設定しておく。例えば、変形によりブザー音的になる傾向が強い撥音／Ｎ／や摩擦性有声子音／ｚ／などではＦｌを低く設定することにより、これらに対してはより強力に変形歪をマスクすることが可能となる。
【００２３】
上述したようにして変形された音声素片から生成される合成音声の位相を撹乱することにより、変形された音声素片から生成される音声信号がブザー音的になるなどの自然性の劣化を聴覚上マスクすることができ、劣化の少ない合成音声を得ることができる。
【００２４】
変形歪マスク手段１３から出力された音声素片Ａｂ、および変形歪マスク手段１４から出力された音声素片Ｂｂは、接続手段１５に入力される。接続手段１５は、音声素片Ａｂ，Ｂｂを構成する音声パラメータを接続して一連の音声パラメータを生成し、その一連の音声パラメータを接続歪マスク手段１６へ出力する。
【００２５】
その後、接続歪マスク手段１６では、接続手段１５から入力された一連の音声パラメータに対して時刻ｔでは周波数Ｆｃ（ｔ）より高域の信号の位相を撹乱した音声信号が得られる音声パラメータを求め音声生成手段１７に出力する。
【００２６】
接続歪マスク手段１６における具体的な処理は、音声パラメータに応じたものとする。例えば、ピッチ時間波形をパラメータとして持つピッチ波形重畳型の場合、接続手段１５から入力された音声パラメータの各ピッチ時間波形をフーリエ変換などにより周波数領域に変換、周波数Ｆｃ（ｔ）より高域の位相をランダムに回転した後に逆フーリエ変換して得られるピッチ時間波形を、音声パラメータとして音声生成手段１７に出力する。また、例えば、ピッチ、高調波の振幅・位相を音声素片のパラメータとしてもつ正弦波合成型の場合、接続手段１５から入力された音声パラメータの周波数Ｆｃ（ｔ）より高域の高調波の位相をランダムな値に変更し、音声パラメータとして音声生成手段１７に出力する。
【００２７】
ここで、周波数Ｆｃ（ｔ）は、例えば、式（２）に従い決定する。
【数２】

式（２）のＦｈ_Ａ，Ｆｈ_Ｂはそれぞれ音声素片Ａｂ，Ｂｂの位相撹乱帯域下限が取り得る最高周波数であり、Ｆｌは位相撹乱帯域下限が取り得る最低周波数である。また、ｔｃは音声素片Ａｂ，Ｂｂの接続部であり、ｔａ，ｔｂはそれぞれ位相撹乱帯域幅の変化の開始時刻、終了時刻である。
【００２８】
式（２）に従い決定した位相撹乱帯域の時間推移の例を図３に示す。接続部ｔｃを中心に経過時刻に応じて位相撹乱帯域を拡大することにより、接続による不連続や、補間処理による歪が発生する区間においてより強力に接続歪をマスクすることができる。周波数Ｆｃ（ｔ）を決定するパラメータであるＦｈ_Ａ，Ｆｈ_Ｂ，ｔａ，ｔｂは、例えば接続する音声素片の組み合わせ毎に、予め調整して設定しておく。
【００２９】
上述したようにして音声素片を接続し、得られた音声パラメータから生成される合成音声の位相を撹乱することにより、音声素片の接続部で合成音声が不連続になる、あるいは接続の補間処理により合成音声がブザー音的になるなどの自然性の劣化を聴覚上マスクすることができ、劣化の少ない合成音声を得ることができる。
【００３０】
音声生成手段１７は、接続歪マスク手段１６から出力された音声パラメータを音声信号に変換し、合成音声として出力する。
【００３１】
以上のように、この実施の形態１によれば、音声素片の変形に伴う聴覚上の歪を音声信号の位相を調整することによりマスクすることができ、合成音声の自然性の劣化を軽減することができる。また、音声素片の接続に伴う聴覚上の歪を音声信号の位相を調整することによりマスクすることができ、合成音声の自然性の劣化を軽減することができる。
【００３２】
なお、上述した例では変形歪マスク手段１３，１４と、接続歪マスク手段１６とを含む音声合成装置１を示したが、変形歪マスク手段１３，１４のみを含む構成、或いは接続歪マスク手段１６のみを含む構成の音声合成装置も可能である。
【００３３】
また、上述した例では変形手段１１、変形手段１２をそれぞれ独立して持つ構成を示したが、一つの変形手段を共有して用いる構成も可能であり、また、変形歪マスク手段１３，１４についても同様に、一つの変形歪マスク手段を共有して用いる構成も可能である。
【００３４】
実施の形態２．
この発明の実施の形態２に係る音声合成装置２について、図４〜図６を参照して説明する。なお、図４は実施の形態２に係る音声合成装置２の構成を示すブロック図であり、図５および図６はこの音声合成装置２による重畳雑音信号の平均パワーの時間推移の例を示す図である。
【００３５】
実施の形態２に係る音声合成装置２は、変形された音声素片を構成する音声パラメータに対し、音声に雑音信号を重畳したものとすることにより合成音声の変形歪をマスクするものであり、また、音声素片を接続して得られた一連の音声パラメータに対し、音声に雑音信号を重畳することにより合成音声の接続歪をマスクするものである。その他の構成と動作は実施の形態１と同様であり、その説明を省略する。
【００３６】
図４に示すように音声合成装置２を構成する変形歪マスク手段２３は変形された音声素片Ａａを構成する音声パラメータに対し音声に所定の雑音信号を重畳する手段であり、また変形歪マスク手段２４は変形された音声素片Ｂａを構成する音声パラメータに対し音声に所定の雑音信号を重畳する手段である。接続歪マスク手段２６は音声素片を接続して得られた一連の音声パラメータに対し音声に所定の雑音信号を重畳する手段である。
【００３７】
変形歪マスク手段２３、変形歪マスク手段２４では、音声素片Ａａ、音声素片Ｂａから生成される音声信号に対して時刻ｔでは平均パワーＰ（ｔ）である雑音信号を重畳した音声信号が得られる音声素片Ａｂ、音声素片Ｂｂを求め、接続手段１５に出力する。
【００３８】
変形歪マスク手段２３，２４における具体的な処理は、音声素片を表現するパラメータに応じたものとする。例えば、ピッチ時間波形を音声素片のパラメータとして持つピッチ波形重畳型の場合、音声素片Ａａ，Ｂａの各ピッチ時間波形に雑音信号を加算して得られるピッチ時間波形を音声素片Ａｂ，Ｂｂとして処理し、また、ピッチ、高調波の振幅・位相を音声素片のパラメータとしてもつ正弦波合成型の場合、音声素片Ａａ，Ｂａの高調波振幅・位相にそれぞれ雑音信号の振幅スペクトル・位相スペクトルを加算したパラメータを音声素片Ａｂ，Ｂｂとして処理する。
【００３９】
ここで、重畳する雑音信号の平均パワーＰ（ｔ）を、例えば式（３）に従い決定する。
【数３】

式（３）におけるＰｌ，Ｐｈは重畳する雑音信号の最低平均パワーと最高平均パワーである。また、ｔ１，ｔ２はそれぞれ重畳する雑音信号の平均パワーの変化の開始時刻、終了時刻である。
【００４０】
式（３）に従い決定した重畳雑音信号の平均パワーの時間推移の例について図５を参照して説明する。これは、継続時間長がｔ１より短くて変形による影響が少ない素片では重畳雑音信号の平均パワーを小さくして変形歪のマスクを弱くし、変形により継続時間長がｔ１より長くて不自然な合成音が長時間に渡る素片ではｔ１以降の区間で重畳雑音信号の平均パワーを増加することにより強力に変形歪をマスクするものである。
【００４１】
重畳雑音信号の平均パワーＰ（ｔ）を決定するパラメータであるＰｌ，Ｐｈ，ｔ１，ｔ２は音声素片毎に予め調整して設定しておく。例えば、変形によりブザー音的になる傾向が強い撥音／Ｎ／や摩擦性有声子音／ｚ／などではＰｈを大きく設定することにより、これらに対してはより強力に変形歪をマスクすることが可能となる。
【００４２】
上述したようにして変形された音声素片から生成される合成音声に雑音信号を重畳することにより、変形された音声素片から生成される音声信号がブザー音的になるなどの自然性の劣化を聴覚上マスクすることができ、劣化の少ない合成音声を得ることができる。
【００４３】
変形歪マスク手段２３から出力された音声素片Ａｂと変形歪マスク手段２４から出力された音声素片Ｂｂは、接続手段１５に入力される。接続手段１５は、音声素片Ａｂ，Ｂｂを構成する音声パラメータを接続して一連の音声パラメータを生成し、その一連の音声パラメータを接続歪マスク手段２６へ出力する。
【００４４】
つぎに、接続歪マスク手段２６では、接続手段１５から入力された一連の音声パラメータに対して時刻ｔでは平均パワーＰ（ｔ）である雑音信号を重畳した音声信号が得られる音声パラメータを求め、音声生成手段１７に出力する。
【００４５】
接続歪マスク手段２６における具体的な処理は、音声パラメータに応じたものとする。例えば、ピッチ時間波形をパラメータとして持つピッチ波形重畳型の場合、接続手段１５から入力された音声パラメータの各ピッチ時間波形に雑音信号を加算して得られるピッチ時間波形を、音声パラメータとして音声生成手段１７に出力する。また、例えば、ピッチ、高調波の振幅・位相を音声素片のパラメータとしてもつ正弦波合成型の場合、接続手段１５から入力された音声パラメータの高調波振幅・位相にそれぞれ雑音信号の振幅スペクトル・位相スペクトルを加算し、音声パラメータとして音声生成手段１７に出力する。
【００４６】
ここで、重畳する雑音信号の平均パワーＰｃ（ｔ）は、例えば式（４）に従い決定する。
【数４】

式（４）におけるＰｌ_A，Ｐｌ_Bはそれぞれ音声素片Ａｂ，Ｂｂの重畳雑音信号平均パワーが取り得る最小値であり、Ｐｈは重畳雑音平均パワーが取り得る最大値である。また、ｔｃは音声素片Ａｂ，Ｂｂの接続部であり、ｔａ，ｔｂはそれぞれ重畳雑音信号平均パワーの変化の開始時刻、終了時刻である。
【００４７】
図６に式（４）に従い決定した重畳雑音信号の平均パワーの時間推移の例を示す。接続部ｔｃを中心に経過時刻に応じて重畳雑音信号の平均パワーを増加することにより、接続による不連続や、補間処理による歪が発生する区間においてより強力に接続歪をマスクすることができる。平均パワーＰｃ（ｔ）を決定するパラメータであるＰｌ_A，Ｐｌ_B，ｔａ，ｔｂは、例えば接続する音声素片の組み合わせ毎に、予め調整して設定しておく。
【００４８】
上述したように音声素片を接続して得られた音声パラメータから生成される合成音声に雑音信号を重畳することにより、音声素片の接続部で合成音声が不連続になる、あるいは接続の補間処理により合成音声がブザー音的になるなどの自然性の劣化を聴覚上マスクすることができ、劣化の少ない合成音声を得ることができる。
【００４９】
以上のように、この実施の形態２によれば、音声素片の変形に伴う聴覚上の歪を音声信号に雑音信号を重畳することによりマスクすることができ、合成音声の自然性の劣化を軽減することができる。また、音声素片の接続に伴う聴覚上の歪を音声信号に雑音信号を重畳することによりマスクすることができ、合成音声の自然性の劣化を軽減することができる。
【００５０】
なお、上述したように変形歪マスク手段２３、変形歪マスク手段２４と接続歪マスク手段２６とを含む音声合成装置２を示したが、変形歪マスク手段２３，２４のみを含む構成、または接続歪マスク手段２６のみを含む構成も可能である。
【００５１】
また、変形手段１１、変形手段１２をそれぞれ独立して持つ構成を示したが、一つの変形手段を共有して用いる構成も可能であり、変形歪マスク手段２３、変形歪マスク手段２４も同様に、一つの変形歪マスク手段を共有して用いる構成も可能である。
【００５２】
実施の形態３．
この発明の実施の形態３に係る音声合成装置３について、図７〜図９を参照して説明する。なお、図７は実施の形態３に係る音声合成装置３の構成を示すブロック図であり、図８および図９はこの音声合成装置による雑音信号置換帯域の時間推移の例を示す図である。
【００５３】
実施の形態３は、変形された音声素片を構成する音声パラメータに対し、音声の所定の周波数帯域を雑音信号に置換したことにより合成音声の変形歪をマスクするものであり、また、音声素片を接続して得られた一連の音声パラメータに対し、音声の所定の周波数帯域を雑音信号に置換したことにより合成音声の接続歪をマスクするものである。その他の構成と動作は実施の形態１と同様であり、その説明を省略する。
【００５４】
音声合成装置３は図７に示すように、変形された音声素片Ａａを構成する音声パラメータに対し音声の所定の周波数帯域を雑音信号に置換したものとする変形歪マスク手段３３と、変形された音声素片Ｂａを構成する音声パラメータに対し音声の所定の周波数帯域を雑音信号に置換したものとする変形歪マスク手段３４と、音声素片を接続して得られた一連の音声パラメータに対し音声の所定の周波数帯域を雑音信号に置換したものとする接続歪マスク手段３６が備わる。
【００５５】
変形歪マスク手段３３、変形歪マスク手段３４では、音声素片Ａａ，Ｂａから生成される音声信号に対して時刻ｔでは周波数Ｆｍ（ｔ）より高域の信号を雑音信号に置換した音声信号が得られる音声素片Ａｂ，Ｂｂを求め、接続手段１５に出力する。
【００５６】
変形歪マスク手段３３，３４における具体的な処理は、音声素片を表現するパラメータに応じたものとする。例えば、ピッチ時間波形を音声素片のパラメータとして持つピッチ波形重畳型の場合、音声素片Ａａ，Ｂａの各ピッチ時間波形をフーリエ変換などにより周波数領域に変換、周波数Ｆｍ（ｔ）より高域を雑音信号のスペクトルに置換した後に逆フーリエ変換して得られるピッチ時間波形を音声素片Ａｂ，Ｂｂとして処理し、また、ピッチ、高調波の振幅・位相を音声素片のパラメータとしてもつ正弦波合成型の場合、音声素片Ａａ，Ｂａの周波数Ｆｍ（ｔ）より高域の高調波の振幅・位相を雑音信号のスペクトルに置換したパラメータを音声素片Ａｂ，Ｂｂとして処理する。
【００５７】
ここで周波数Ｆｍ（ｔ）は、例えば式（５）に従い決定する。
【数５】

式（５）におけるＦｈ，Ｆｌは雑音信号置換帯域下限が取り得る最高周波数と最低周波数である。また、ｔ１，ｔ２はそれぞれ雑音信号置換帯域幅の変化の開始時刻、終了時刻である。
【００５８】
図８に式（５）に従い決定した雑音信号置換帯域の時間推移の例を示す。これは、継続時間長がｔ１より短くて変形による影響が少ない素片では雑音信号置換帯域を小さくして変形歪のマスクを弱くし、一方、変形により継続時間長がｔ１より長くて不自然な合成音が長時間に渡る素片ではｔ１以降の区間で雑音信号置換帯域を拡大することにより強力に変形歪をマスクするものである。周波数Ｆｍ（ｔ）を決定するパラメータであるＦｈ，Ｆｌ，ｔ１，ｔ２は、音声素片毎に予め調整して設定しておく。例えば、変形によりブザー音的になる傾向が強い撥音／Ｎ／や摩擦性有声子音／ｚ／などではＦｌを低く設定することにより、これらに対してはより強力に変形歪をマスクすることが可能となる。
【００５９】
上述したように変形された音声素片から生成される合成音声の所定の周波数帯域を雑音信号に置換することにより、変形された音声素片から生成される音声信号がブザー音的になるなどの自然性の劣化を聴覚上マスクすることができ、劣化の少ない合成音声を得ることができる。
【００６０】
変形歪マスク手段３３、変形歪マスク手段３４から出力された音声素片Ａｂ、Ｂｂは、接続手段１５に入力される。接続手段１５は、音声素片Ａｂ，Ｂｂを構成する音声パラメータを接続して一連の音声パラメータを生成し、その一連の音声パラメータを接続歪マスク手段３６へ出力する。接続歪マスク手段３６では、接続手段１５から入力された一連の音声パラメータに対して時刻ｔでは周波数Ｆｃ（ｔ）より高域の信号を雑音信号に置換した音声信号が得られる音声パラメータを求め音声生成手段１７に出力する。
【００６１】
接続歪マスク手段３６における具体的な処理は、音声パラメータに応じたものとする。例えば、ピッチ時間波形をパラメータとして持つピッチ波形重畳型の場合、接続手段１５から入力された音声パラメータの各ピッチ時間波形をフーリエ変換などにより周波数領域に変換、周波数Ｆｃ（ｔ）より高域を雑音信号のスペクトルに置換した後に逆フーリエ変換して得られるピッチ時間波形を、音声パラメータとして音声生成手段１７に出力する。また、例えば、ピッチ、高調波の振幅・位相を音声素片のパラメータとしてもつ正弦波合成型の場合、接続手段１５から入力された音声パラメータの周波数Ｆｃ（ｔ）より高域の高調波の振幅・位相を雑音信号のスペクトルに置換し、音声パラメータとして音声生成手段１７に出力する。
【００６２】
ここで周波数Ｆｃ（ｔ）は、例えば式（６）に従い決定する。
【数６】

式（６）におけるＦｈ_A，Ｆｈ_Bはそれぞれ音声素片Ａｂ，Ｂｂの雑音信号置換帯域下限が取り得る最高周波数であり、Ｆｌは雑音信号置換帯域下限が取り得る最低周波数である。また、ｔｃは音声素片Ａｂ，Ｂｂの接続部であり、ｔａ，ｔｂはそれぞれ雑音信号置換帯域幅の変化の開始時刻、終了時刻である。
【００６３】
図９に式（６）に従い決定した雑音信号置換帯域の時間推移の例を示す。接続部ｔｃを中心に経過時刻に応じて雑音信号置換帯域を拡大することにより、接続による不連続や、補間処理による歪が発生する区間においてより強力に接続歪をマスクすることができる。周波数Ｆｃ（ｔ）を決定するパラメータであるＦｈ_A，Ｆｈ_B，ｔａ，ｔｂは、例えば接続する音声素片の組み合わせ毎に、予め調整して設定しておく。
【００６４】
上述したように音声素片を接続して得られた音声パラメータから生成される合成音声の所定の周波数帯域を雑音信号に置換することにより、音声素片の接続部で合成音声が不連続になる、あるいは接続の補間処理により合成音声がブザー音的になるなどの自然性の劣化を聴覚上マスクすることができ、劣化の少ない合成音声を得ることができる。
【００６５】
以上のように、この実施の形態３によれば、音声素片の変形に伴う聴覚上の歪を音声信号の所定の周波数帯域を雑音信号に置換することによりマスクすることができ、合成音声の自然性の劣化を軽減することができる。また、音声素片の接続に伴う聴覚上の歪を音声信号の所定の周波数帯域を雑音信号に置換することによりマスクすることができ、合成音声の自然性の劣化を軽減することができる。
【００６６】
なお、変形歪マスク手段３３，３４と接続歪マスク手段３６とを含む音声合成装置３を示したが、変形歪マスク手段３３，３４のみを含む構成、或いは接続歪マスク手段３６のみを含む構成も可能である。
【００６７】
また、変形手段１１、変形手段１２をそれぞれ独立して持つ構成を示したが、一つの変形手段を共有して用いる構成も可能であり、さらに変形歪マスク手段３３と変形歪マスク手段３４も一つの変形歪マスク手段を共有して用いる構成も可能である。
【００６８】
【発明の効果】
以上のように、この発明によれば、音声素片の変形に伴う聴覚上の歪を音声信号の位相を調整することによりマスクすることができ、合成音声の自然性の劣化を軽減することができる。
この発明によれば、音声素片の接続に伴う聴覚上の歪を音声信号の位相を調整することによりマスクすることができ、合成音声の自然性の劣化を軽減することができる。
【図面の簡単な説明】
【図１】本発明の実施の形態１に係る音声合成装置の構成を示すブロック図である。
【図２】実施の形態１に係る音声合成装置の、位相撹乱帯域の時間推移の例を示す図である。
【図３】実施の形態１に係る音声合成装置の、位相撹乱帯域の時間推移の例を示す図である。
【図４】本発明の実施の形態２に係る音声合成装置の構成を示すブロック図である。
【図５】実施の形態２に係る音声合成装置の、重畳雑音信号の平均パワーの時間推移の例を示す図である。
【図６】実施の形態２に係る音声合成装置の、重畳雑音信号の平均パワーの時間推移の例を示す図である。
【図７】本発明の実施の形態３に係る音声合成装置の構成を示すブロック図である。
【図８】実施の形態３に係る音声合成装置の、雑音信号置換帯域の時間推移の例を示す図である。
【図９】実施の形態３に係る音声合成装置の、雑音信号置換帯域の時間推移の例を示す図である。
【符号の説明】
１，２，３音声合成装置、１１，１２変形手段、１３，１４，２３，２４，３３，３４変形歪マスク手段、１５接続手段、１６，２６，３６接続歪マスク手段、１７音声生成手段。[0001]
BACKGROUND OF THE INVENTION
  The present invention relates to a speech synthesizer for synthesizing speech by deforming and connecting speech segments such as VCV (Vowel-Consonant-Vowel) and CV (Consonant-Vowel).In placeRelated.
[0002]
[Prior art]
The conventional speech synthesizer is obtained by performing a modification such as linearly expanding / contracting the duration length according to each prosodic parameter or changing the pitch period for a plurality of speech segments. A speech unit was connected, and a series of speech parameters obtained were converted into speech signals to obtain synthesized speech (see, for example, Non-Patent Document 1).
[0003]
[Non-Patent Document 1]
E. Moline and F.M. Charpentire "Pitch-Synchronous waveforming technologies for text-to-speech synthesis using diphones" Speech Communication, vol. 9, pp. 453-467, Dec 1990
[0004]
[Problems to be solved by the invention]
However, in the conventional speech synthesizer described above, the speech signal generated from the speech segment obtained by expanding / contracting the duration length of the speech signal is slower or slower than the original speech feature. There is a problem that the naturalness of the synthesized speech deteriorates because it becomes steep. In particular, when the voiced section is extended, it is remarkable that the time change of the pitch waveform of the synthesized speech is reduced and a buzzer sound is deteriorated.
[0005]
In addition, there is a problem that discontinuity between the segments may occur at the connection portion of the speech unit, and the naturalness of the synthesized speech deteriorates. Also, even when connecting segments with interpolation processing that slows changes between segments in order to eliminate this discontinuity, the temporal change of the voice feature is slower than the original temporal change, and is particularly useful. When connected in the voice sound section, the time change of the pitch waveform of the synthesized speech was small, and it was remarkable that the sound deteriorated like a buzzer sound.
[0006]
  The present invention has been made to solve the above-described problems, and masks deterioration in the naturalness of synthesized speech by adjusting the ratio of noisy components of a speech signal generated from a deformed speech segment. The distortion is reduced by hearing, and the natural noise of the synthesized speech is masked and connected by adjusting the ratio of the noisy component of the speech in a predetermined section including the speech unit connection. Speech synthesizer that reduces distortion audiblyPlaceThe purpose is to obtain.
[0007]
[Means for Solving the Problems]
  A speech synthesizer according to this invention, StrangeFor the speech parameters that make up the shaped speech unit,Disturb the phase of the audio in the phase disturbance band,Deformation distortion mask means for controlling the ratio of the noise component of speechThe deformation distortion mask means reduces the phase disturbance band when the duration of the modified speech element is short, and increases the phase disturbance band when the duration of the modified speech element is long.Is.
[0008]
  The speech synthesizer according to the present invention comprises a deformation distortion mask means for superimposing a noise signal on a speech with respect to a speech parameter constituting a modified speech segment and controlling a ratio of a noise component of the speech, The means reduces the average power of the superimposed noise signal when the duration of the modified speech unit is short, and averages the superimposed noise signal when the duration of the modified speech unit is long. It will increase power.
[0009]
  The speech synthesizer according to the present invention comprises a deformation distortion mask means for replacing a speech in a noise signal replacement band with a noise signal and controlling a ratio of a noise component of speech for speech parameters constituting a modified speech unit. And the deformation distortion mask means reduces the noise signal replacement band when the duration of the modified speech unit is short, and reduces the noise signal replacement band when the duration of the modified speech unit is long. Is to increase.
[0010]
  The speech synthesizer according to the present invention perturbs the phase of speech in the phase disturbance band with respect to a series of speech parameters obtained by connecting speech units, and performs speech in a predetermined section including a speech unit connection unit. The connection distortion masking means for controlling the ratio of the noise component is increased, and the connection distortion masking means increases the phase disturbance band at a time close to the connection part and decreases the phase disturbance band at a time far from the connection part.
[0011]
  The speech synthesizer according to the present invention superimposes a noise signal on speech for a series of speech parameters obtained by connecting speech segments, and a speech noise component in a predetermined section including a speech segment connection portion. Connecting distortion mask means for controlling the ratio of the noise, and the connecting distortion mask means increases the average power of the superimposed noise signal at a time close to the connecting portion, and decreases the average power of the superimposed noise signal at a time far from the connecting portion. To do.
  The speech synthesizer according to the present invention replaces speech in a noise signal replacement band with a noise signal for a series of speech parameters obtained by connecting speech units, and includes a predetermined section including a speech unit connection unit And a connection distortion mask means for controlling the ratio of the noise component of the voice, and the connection distortion mask means increases the noise signal replacement band at a time close to the connection part and decreases the noise signal replacement band at a time far from the connection part. Is.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment of the present invention will be described.
Embodiment 1 FIG.
A speech synthesizer according to Embodiment 1 of the present invention will be described with reference to FIGS. FIG. 1 is a block diagram showing the configuration of the speech synthesizer 1 according to the first embodiment, and FIGS. 2 and 3 are diagrams showing examples of time transition of the phase disturbance band by the speech synthesizer 1. FIG.
[0013]
As shown in FIG. 1, the speech synthesizer 1 according to the first embodiment includes a deforming unit 11 and a deforming unit 12 that deform two synthesized speech units, and a speech unit in which the phase of a predetermined band of speech is disturbed. Deformation distortion masking means 13 and deformation distortion masking means 14 for outputting, connection means 15 for connecting speech segments, connection distortion masking means 16 for disturbing the phase of a predetermined band of speech, and converting speech parameters into speech signals, A voice generation means 17 for outputting as synthesized voice is provided.
[0014]
The deformation means 11 linearly expands and contracts the duration of the speech unit A according to the prosodic parameter A, changes the pitch period, and transforms it into the speech unit Aa.
Further, the deforming means 12 linearly expands and contracts the duration of the speech unit B according to the prosodic parameter B, changes the pitch period, and transforms into the speech unit Ba.
[0015]
The deformation distortion mask means 13 perturbs the phase of a predetermined band of the sound with respect to the sound parameters constituting the speech element Aa deformed by the deformation means 11 and outputs the sound element Ab.
Further, the deformation distortion mask means 14 perturbs the phase of a predetermined band of the voice with respect to the voice parameters constituting the voice element Ba deformed by the deformation means 12 and outputs the voice element Bb.
[0016]
The connecting means 15 connects a speech parameter constituting the speech segment Ab output from the deformation distortion mask means 13 and a speech parameter constituting the speech segment Bb output from the deformation distortion mask means 14 to connect a series of speech parameters. Is generated.
The connection distortion mask means 16 perturbs the phase of a predetermined band of the voice with respect to a series of voice parameters obtained by connecting the voice segments Ab and Bb.
The voice generation means 17 converts the voice parameter output from the connection distortion mask means 16 into a voice signal and outputs it as a synthesized voice.
[0017]
Next, the operation of the speech synthesizer 1 will be described.
First, as shown in FIG. 1, one speech unit A of speech to be synthesized is linearly expanded and contracted in duration of the speech unit according to the prosodic parameter A by the deforming means 11, and the pitch period is changed. To obtain a speech segment Aa. Similarly, the other speech unit B of the synthesized speech is linearly expanded / contracted by the deforming means 12 according to the prosodic parameter B, and the speech unit is changed by changing the pitch period. A piece Ba is obtained.
[0018]
Next, with respect to the audio signal generated from the speech element Aa by the deformation distortion masking means 13, the speech element Ab from which an audio signal in which the phase of the signal higher than the frequency Fm (t) is disturbed at the time t is obtained. Similarly, a speech element in which a speech signal obtained by disturbing the phase of a signal higher than the frequency Fm (t) at time t with respect to the speech signal generated from the speech element Ba by the deformation distortion mask means 14 is obtained. The piece Bb is obtained, and the respective speech segments Ab and Bb are output to the connecting means 15.
[0019]
It is assumed that the specific processing in the deformation distortion masking means 13 and 14 depends on the parameter expressing the speech segment. For example, in the case of the pitch waveform superposition type having the pitch time waveform as a parameter of the speech unit, each pitch time waveform of the speech unit Aa, Ba is converted into the frequency domain by Fourier transformation or the like, and is higher than the frequency Fm (t). Pitch time waveform obtained by inverse Fourier transform after randomly rotating the phase is processed as speech segments Ab and Bb, and the sine wave synthesis type has the amplitude and phase of the pitch and harmonics as parameters of the speech segment. In this case, parameters obtained by changing the phase of higher harmonics than the frequency Fm (t) of the speech units Aa and Ba to random values are processed as speech units Ab and Bb.
[0020]
Here, the frequency Fm (t) is determined according to, for example, the equation (1).
[Expression 1]

Fh and Fl in Equation (1) are the highest frequency and the lowest frequency that the lower limit of the phase disturbance band can take. T1 and t2 are the start time and the end time of the change in the phase disturbance bandwidth, respectively.
[0021]
The example of the time transition of the phase disturbance zone | band determined according to Formula (1) of FIG. 2 is shown. This is because in a fragment whose duration is shorter than t1 and less affected by deformation, the phase disturbance band is reduced and the deformation distortion mask is weakened, and an unnatural synthesized sound whose duration is longer than t1 due to deformation. In the segment over a long time, the deformation distortion is strongly masked by expanding the phase disturbance band in the section after t1.
[0022]
The above-described Fh, Fl, t1, and t2 that are parameters for determining the frequency Fm (t) are adjusted and set in advance for each speech unit, for example. For example, by setting Fl low for sound repellent / N / or frictional voiced consonant / z /, which tends to be buzzer sound due to deformation, it is possible to mask deformation distortion more strongly against these. It becomes.
[0023]
By disturbing the phase of the synthesized speech generated from the speech unit modified as described above, the naturalness of the speech signal generated from the modified speech unit becomes a buzzer sound. Auditory masking is possible, and synthesized speech with little deterioration can be obtained.
[0024]
The speech segment Ab output from the deformation distortion mask means 13 and the speech segment Bb output from the deformation distortion mask means 14 are input to the connection means 15. The connection means 15 connects the speech parameters constituting the speech segments Ab and Bb to generate a series of speech parameters, and outputs the series of speech parameters to the connection distortion mask means 16.
[0025]
Thereafter, the connection distortion mask means 16 obtains a sound parameter that can obtain a sound signal in which the phase of the signal in the higher frequency range than the frequency Fc (t) is disturbed at the time t with respect to the series of sound parameters input from the connection means 15. The sound is output to the sound generation means 17.
[0026]
The specific processing in the connection distortion masking means 16 depends on the audio parameter. For example, in the case of a pitch waveform superposition type having a pitch time waveform as a parameter, each pitch time waveform of the voice parameter input from the connection means 15 is converted into a frequency domain by Fourier transformation or the like, and a phase higher than the frequency Fc (t). A pitch time waveform obtained by performing inverse Fourier transform after rotating at random is output to the sound generation means 17 as a sound parameter. Further, for example, in the case of a sine wave synthesis type having pitch and harmonic amplitude / phase as parameters of a speech unit, a higher harmonic phase than the frequency Fc (t) of the speech parameter input from the connection means 15. Is changed to a random value and output to the sound generation means 17 as a sound parameter.
[0027]
Here, the frequency Fc (t) is determined according to, for example, the equation (2).
[Expression 2]

Fh in formula (2)_A, Fh_BIs the highest frequency that the lower limit of the phase disturbance band of the speech segments Ab and Bb can take, and Fl is the lowest frequency that the lower limit of the phase disturbance band can take. Further, tc is a connection part of the speech segments Ab and Bb, and ta and tb are a start time and an end time of the change of the phase disturbance bandwidth, respectively.
[0028]
An example of the time transition of the phase disturbance band determined according to the equation (2) is shown in FIG. By enlarging the phase disturbance band according to the elapsed time centering on the connection part tc, it is possible to more strongly mask the connection distortion in a section where discontinuity due to connection or distortion due to interpolation processing occurs. Fh which is a parameter for determining the frequency Fc (t)_A, Fh_B, Ta, tb are adjusted and set in advance for each combination of speech segments to be connected, for example.
[0029]
By connecting speech units as described above, and disturbing the phase of the synthesized speech generated from the obtained speech parameters, the synthesized speech becomes discontinuous at the connection unit of speech units, or connection interpolation Deterioration of naturalness, such as a synthetic sound becoming a buzzer sound by processing, can be masked auditorily, and a synthesized speech with little deterioration can be obtained.
[0030]
The voice generation means 17 converts the voice parameter output from the connection distortion mask means 16 into a voice signal and outputs it as a synthesized voice.
[0031]
As described above, according to the first embodiment, the auditory distortion associated with the deformation of the speech element can be masked by adjusting the phase of the speech signal, thereby reducing the natural deterioration of the synthesized speech. can do. In addition, auditory distortion associated with the connection of speech segments can be masked by adjusting the phase of the speech signal, and deterioration of the naturalness of the synthesized speech can be reduced.
[0032]
In the above example, the speech synthesizer 1 including the deformation distortion mask means 13 and 14 and the connection distortion mask means 16 is shown. However, the configuration including only the deformation distortion mask means 13 and 14 or the connection distortion mask means 16 is shown. A speech synthesizer having a configuration including only the above is also possible.
[0033]
Further, in the above-described example, the configuration in which the deformation unit 11 and the deformation unit 12 are independently provided is shown. However, a configuration in which one deformation unit is shared can be used, and the deformation

distortion mask units

13 and 14 can be used. Similarly, a configuration in which one deformation strain mask means is shared is also possible.
[0034]
Embodiment 2. FIG.
A speech synthesizer 2 according to Embodiment 2 of the present invention will be described with reference to FIGS. 4 is a block diagram showing the configuration of the speech synthesizer 2 according to the second embodiment, and FIGS. 5 and 6 are diagrams showing examples of time transition of the average power of the superimposed noise signal by the speech synthesizer 2. FIG. It is.
[0035]
The speech synthesizer 2 according to Embodiment 2 masks the distortion distortion of the synthesized speech by superimposing a noise signal on the speech with respect to the speech parameters constituting the modified speech unit, In addition, the connection distortion of the synthesized speech is masked by superimposing a noise signal on the speech for a series of speech parameters obtained by connecting speech segments. Other configurations and operations are the same as those in the first embodiment, and a description thereof will be omitted.
[0036]
As shown in FIG. 4, the deformation distortion mask means 23 constituting the speech synthesizer 2 is a means for superimposing a predetermined noise signal on the voice on the voice parameters constituting the modified speech segment Aa, and the deformation distortion mask. The means 24 is a means for superimposing a predetermined noise signal on the speech with respect to the speech parameters constituting the modified speech segment Ba. The connection distortion mask means 26 is a means for superimposing a predetermined noise signal on the voice with respect to a series of voice parameters obtained by connecting voice segments.
[0037]
In the deformation distortion mask means 23 and the deformation distortion mask means 24, an audio signal obtained by superimposing a noise signal having an average power P (t) at a time t on an audio signal generated from the audio element Aa and the audio element Ba. The obtained speech segment Ab and speech segment Bb are obtained and output to the connection means 15.
[0038]
It is assumed that the specific processing in the deformation distortion masking means 23 and 24 is in accordance with a parameter expressing the speech segment. For example, in the case of a pitch waveform superposition type having a pitch time waveform as a parameter of a speech unit, a pitch time waveform obtained by adding a noise signal to each pitch time waveform of the speech unit Aa, Ba is represented as a speech unit Ab, Bb. In the case of the sine wave synthesis type having the pitch and harmonic amplitude / phase as parameters of the speech unit, the amplitude spectrum / phase of the noise signal is added to the harmonic amplitude / phase of the speech unit Aa, Ba, respectively. The parameter obtained by adding the spectrum is processed as speech segments Ab and Bb.
[0039]
Here, the average power P (t) of the noise signal to be superimposed is determined according to, for example, the equation (3).
[Equation 3]

Pl and Ph in Expression (3) are the lowest average power and the highest average power of the superimposed noise signal. Further, t1 and t2 are the start time and end time of the change in the average power of the superimposed noise signal, respectively.
[0040]
An example of the time transition of the average power of the superimposed noise signal determined according to Expression (3) will be described with reference to FIG. This is because, in a fragment whose duration is shorter than t1 and is less affected by deformation, the average power of the superimposed noise signal is reduced to weaken the deformation distortion mask, and the duration is longer than t1 due to deformation, which is unnatural. In the unit of synthesized sound over a long time, deformation distortion is strongly masked by increasing the average power of the superimposed noise signal in the section after t1.
[0041]
Pl, Ph, t1, and t2, which are parameters for determining the average power P (t) of the superimposed noise signal, are adjusted and set in advance for each speech unit. For example, in the case of sound repellent / N / or frictional voiced consonant / z /, which tends to be a buzzer sound due to deformation, it is possible to mask deformation distortion more strongly by setting Ph large. It becomes.
[0042]
Degradation of naturalness, such as a sound signal generated from a modified speech unit becomes a buzzer sound by superimposing a noise signal on a synthesized speech generated from the speech unit modified as described above Can be masked auditorily, and a synthesized speech with little deterioration can be obtained.
[0043]
The speech segment Ab output from the deformation distortion mask means 23 and the speech segment Bb output from the deformation distortion mask means 24 are input to the connection means 15. The connection means 15 connects the speech parameters constituting the speech segments Ab and Bb to generate a series of speech parameters, and outputs the series of speech parameters to the connection distortion mask means 26.
[0044]
Next, the connection distortion mask means 26 obtains an audio parameter that can obtain an audio signal in which a noise signal having an average power P (t) is superimposed on the series of audio parameters input from the connection means 15 at time t, The sound is output to the sound generation means 17.
[0045]
The specific processing in the connection distortion masking unit 26 depends on the audio parameter. For example, in the case of a pitch waveform superposition type having a pitch time waveform as a parameter, a pitch time waveform obtained by adding a noise signal to each pitch time waveform of the voice parameter input from the connection means 15 is used as a voice parameter. 17 to output. Further, for example, in the case of the sine wave synthesis type having the pitch and harmonic amplitude / phase as parameters of the speech unit, the amplitude spectrum of the noise signal The phase spectrum is added and output to the voice generation means 17 as a voice parameter.
[0046]
Here, the average power Pc (t) of the noise signal to be superimposed is determined, for example, according to the equation (4).
[Expression 4]

Pl in equation (4)_A, Pl_BIs the minimum value that can be taken by the average power of superimposed noise signals of speech segments Ab and Bb, respectively, and Ph is the maximum value that can be taken by the average power of superimposed noise. Further, tc is a connection part of the speech segments Ab and Bb, and ta and tb are a start time and an end time of the change of the superimposed noise signal average power, respectively.
[0047]
FIG. 6 shows an example of the time transition of the average power of the superimposed noise signal determined according to the equation (4). By increasing the average power of the superimposed noise signal according to the elapsed time centering on the connection portion tc, the connection distortion can be more strongly masked in a section where discontinuity due to connection or distortion due to interpolation processing occurs. Pl which is a parameter for determining the average power Pc (t)_A, Pl_B, Ta, tb are adjusted and set in advance for each combination of speech segments to be connected, for example.
[0048]
As described above, the synthesized speech is discontinuous at the speech unit connection part or the connection is interpolated by superimposing a noise signal on the synthesized speech generated from speech parameters obtained by connecting speech units. Deterioration of naturalness, such as a synthetic sound becoming a buzzer sound by processing, can be masked auditorily, and a synthesized speech with little deterioration can be obtained.
[0049]
As described above, according to the second embodiment, the auditory distortion associated with the deformation of the speech segment can be masked by superimposing the noise signal on the speech signal, thereby degrading the naturalness of the synthesized speech. Can be reduced. In addition, the auditory distortion associated with the connection of the speech element can be masked by superimposing a noise signal on the speech signal, and deterioration of the naturalness of the synthesized speech can be reduced.
[0050]
As described above, the speech synthesizer 2 including the deformation distortion mask means 23, the deformation distortion mask means 24, and the connection distortion mask means 26 is shown. However, the configuration including only the deformation distortion mask means 23 and 24, or the connection distortion A configuration including only the mask means 26 is also possible.
[0051]
In addition, although the configuration in which the deformation unit 11 and the deformation unit 12 are independently provided is shown, a configuration in which one deformation unit is shared is also possible, and the deformation strain mask unit 23 and the deformation strain mask unit 24 are similarly used. A configuration in which one deformation strain mask means is shared is also possible.
[0052]
Embodiment 3 FIG.
A speech synthesizer 3 according to Embodiment 3 of the present invention will be described with reference to FIGS. FIG. 7 is a block diagram showing the configuration of the speech synthesizer 3 according to Embodiment 3, and FIGS. 8 and 9 are diagrams showing examples of time transition of the noise signal replacement band by this speech synthesizer.
[0053]
Embodiment 3 masks the distortion distortion of a synthesized speech by replacing a predetermined frequency band of speech with a noise signal with respect to speech parameters constituting the modified speech segment. The connection distortion of the synthesized speech is masked by replacing a predetermined frequency band of speech with a noise signal for a series of speech parameters obtained by connecting the pieces. Other configurations and operations are the same as those in the first embodiment, and a description thereof will be omitted.
[0054]
As shown in FIG. 7, the speech synthesizer 3 is modified by a deformation distortion mask means 33 in which a predetermined frequency band of speech is replaced with a noise signal with respect to the speech parameters constituting the modified speech segment Aa. With respect to a series of speech parameters obtained by connecting the speech unit and the deformation distortion mask means 34 in which a predetermined frequency band of speech is replaced with a noise signal with respect to the speech parameters constituting the speech unit Ba A connection distortion mask means 36 is provided in which a predetermined frequency band of voice is replaced with a noise signal.
[0055]
In the deformation distortion mask means 33 and the deformation distortion mask means 34, an audio signal obtained by substituting a signal higher in frequency than the frequency Fm (t) with a noise signal at a time t with respect to an audio signal generated from the speech segments Aa and Ba. Obtained speech segments Ab and Bb are obtained and output to the connecting means 15.
[0056]
The specific processing in the deformation distortion masking means 33 and 34 depends on the parameter expressing the speech segment. For example, in the case of a pitch waveform superposition type having a pitch time waveform as a parameter of a speech unit, each pitch time waveform of the speech unit Aa, Ba is converted into a frequency domain by Fourier transformation or the like, and a higher range than the frequency Fm (t) is obtained. A pitch time waveform obtained by performing inverse Fourier transform after replacing with a spectrum of a noise signal is processed as speech segments Ab and Bb, and a sine wave composition having pitch and harmonic amplitude / phase as speech segment parameters. In the case of the type, parameters obtained by replacing the amplitude and phase of harmonics higher than the frequency Fm (t) of the speech units Aa and Ba with the spectrum of the noise signal are processed as speech units Ab and Bb.
[0057]
Here, the frequency Fm (t) is determined according to, for example, the equation (5).
[Equation 5]

Fh and Fl in Equation (5) are the highest frequency and the lowest frequency that the noise signal replacement band lower limit can take. T1 and t2 are the start time and end time of the change in the noise signal replacement bandwidth, respectively.
[0058]
FIG. 8 shows an example of the time transition of the noise signal replacement band determined according to the equation (5). This is because an element having a duration shorter than t1 and less affected by deformation reduces the noise signal replacement band and weakens the deformation distortion mask. On the other hand, the duration is longer than t1 and is unnatural. In the unit of synthesized sound over a long period of time, the distortion distortion is strongly masked by expanding the noise signal replacement band in the section after t1. Fh, Fl, t1, and t2 that are parameters for determining the frequency Fm (t) are adjusted and set in advance for each speech unit. For example, by setting Fl low for sound repellent / N / or frictional voiced consonant / z /, which tends to be buzzer sound due to deformation, it is possible to mask deformation distortion more strongly against these. It becomes.
[0059]
By replacing the predetermined frequency band of the synthesized speech generated from the modified speech unit as described above with a noise signal, the speech signal generated from the modified speech unit becomes a buzzer sound, etc. Natural deterioration can be masked auditorily, and synthesized speech with little deterioration can be obtained.
[0060]
The speech segments Ab and Bb output from the deformation distortion mask means 33 and the deformation distortion mask means 34 are input to the connection means 15. The connection means 15 connects the speech parameters constituting the speech segments Ab and Bb to generate a series of speech parameters, and outputs the series of speech parameters to the connection distortion mask means 36. The connection distortion mask means 36 obtains an audio parameter for obtaining an audio signal obtained by replacing a signal higher in frequency than the frequency Fc (t) with a noise signal at time t with respect to a series of audio parameters input from the connection means 15. Output to the generation means 17.
[0061]
The specific processing in the connection distortion mask means 36 depends on the audio parameter. For example, in the case of the pitch waveform superposition type having the pitch time waveform as a parameter, each pitch time waveform of the voice parameter input from the connection means 15 is converted into the frequency domain by Fourier transform or the like, and the higher frequency than the frequency Fc (t) is noise. A pitch time waveform obtained by performing inverse Fourier transform after replacing the signal spectrum is output to the sound generation means 17 as a sound parameter. Further, for example, in the case of a sine wave synthesis type having pitch and harmonic amplitude / phase as parameters of a speech unit, the amplitude of higher harmonics than the frequency Fc (t) of the speech parameter input from the connection means 15 Replace the phase with the spectrum of the noise signal and output it to the voice generation means 17 as a voice parameter.
[0062]
Here, the frequency Fc (t) is determined according to, for example, Expression (6).
[Formula 6]

Fh in equation (6)_A, Fh_BIs the highest frequency that the noise signal replacement band lower limit of the speech segments Ab and Bb can respectively take, and Fl is the lowest frequency that the noise signal replacement band lower limit can take. Further, tc is a connection part of the speech segments Ab and Bb, and ta and tb are the start time and end time of the change of the noise signal replacement bandwidth, respectively.
[0063]
FIG. 9 shows an example of the time transition of the noise signal replacement band determined according to the equation (6). By expanding the noise signal replacement band according to the elapsed time centering on the connection portion tc, it is possible to more strongly mask connection distortion in a section where discontinuity due to connection or distortion due to interpolation processing occurs. Fh which is a parameter for determining the frequency Fc (t)_A, Fh_B, Ta, tb are adjusted and set in advance for each combination of speech segments to be connected, for example.
[0064]
As described above, by replacing a predetermined frequency band of synthesized speech generated from speech parameters obtained by connecting speech units with a noise signal, the synthesized speech becomes discontinuous at the speech unit connection portion. Alternatively, the deterioration of naturalness such as the synthesized speech becomes a buzzer sound by the interpolation process of connection can be masked by hearing, and a synthesized speech with little degradation can be obtained.
[0065]
As described above, according to the third embodiment, the auditory distortion associated with the deformation of the speech element can be masked by replacing the predetermined frequency band of the speech signal with the noise signal, and the synthesized speech Natural deterioration can be reduced. In addition, the auditory distortion associated with the connection of the speech element can be masked by replacing a predetermined frequency band of the speech signal with a noise signal, so that deterioration of the naturalness of the synthesized speech can be reduced.
[0066]
Although the speech synthesizer 3 including the deformation distortion mask means 33 and 34 and the connection distortion mask means 36 is shown, a configuration including only the deformation distortion mask means 33 and 34 or a configuration including only the connection distortion mask means 36 is also possible. Is possible.
[0067]
In addition, although the configuration having the deforming means 11 and the deforming means 12 is shown independently, a configuration in which one deforming means is shared is also possible, and the deformed distortion mask means 33 and the deformed distortion mask means 34 are also one. A configuration in which two deformation strain mask means are shared is also possible.
[0068]
【The invention's effect】
As described above, according to the present invention, the auditory distortion associated with the deformation of the speech element can be masked by adjusting the phase of the speech signal, and deterioration of the naturalness of the synthesized speech can be reduced. it can.
According to the present invention, the auditory distortion associated with the connection of the speech elements can be masked by adjusting the phase of the speech signal, and deterioration of the naturalness of the synthesized speech can be reduced.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to Embodiment 1 of the present invention.
FIG. 2 is a diagram illustrating an example of a time transition of a phase disturbance band in the speech synthesizer according to the first embodiment.
FIG. 3 is a diagram showing an example of a time transition of a phase disturbance band in the speech synthesizer according to the first embodiment.
FIG. 4 is a block diagram showing a configuration of a speech synthesizer according to Embodiment 2 of the present invention.
FIG. 5 is a diagram showing an example of time transition of average power of a superimposed noise signal in the speech synthesizer according to the second embodiment.
FIG. 6 is a diagram showing an example of time transition of average power of a superimposed noise signal in the speech synthesizer according to the second embodiment.
FIG. 7 is a block diagram showing a configuration of a speech synthesizer according to Embodiment 3 of the present invention.
8 is a diagram illustrating an example of time transition of a noise signal replacement band in the speech synthesizer according to Embodiment 3. FIG.
FIG. 9 is a diagram illustrating an example of a time transition of a noise signal replacement band in the speech synthesizer according to the third embodiment.
[Explanation of symbols]
1, 2, 3 Speech synthesizer, 11, 12 Deformation means, 13, 14, 23, 24, 33, 34 Deformation distortion mask means, 15 Connection means, 16, 26, 36 Connection distortion mask means, 17 Speech generation means.

Claims

In a speech synthesizer that synthesizes speech by transforming and connecting speech segments,
A modified distortion mask means for disturbing the phase of the speech in the phase disturbance band and controlling the ratio of the noise component of the speech with respect to the speech parameters constituting the modified speech unit ,
The deformation strain mask means comprises:
Speech synthesis characterized in that the phase disturbance band is reduced when the duration of the modified speech segment is short and the phase disturbance band is increased when the duration of the modified speech unit is long apparatus.

In a speech synthesizer that synthesizes speech by transforming and connecting speech segments,
A deformation distortion mask means that superimposes a noise signal on the voice and controls the ratio of the noise component of the voice with respect to the voice parameters constituting the deformed voice element ,
The deformation strain mask means comprises:
If the duration of the modified speech unit is short, the average power of the superimposed noise signal is reduced. If the duration of the modified speech unit is long, the average power of the superimposed noise signal is increased. A speech synthesizer characterized by the above.

In a speech synthesizer that synthesizes speech by transforming and connecting speech segments,
For a speech parameter that constitutes a modified speech unit, a speech signal in a noise signal replacement band is replaced with a noise signal, and a deformation distortion mask means for controlling a ratio of a noise component of the speech is provided .
The deformation strain mask means comprises:
When duration of the deformed speech unit is short to reduce the noise signal replacement zone, and wherein the increase to Rukoto noise signal substitution band when duration of the deformed speech units long A speech synthesizer.

In a speech synthesizer that synthesizes speech by transforming and connecting speech segments,
A connection that perturbs the phase of speech in the phase disturbance band for a series of speech parameters obtained by connecting speech units and controls the ratio of speech noise components in a predetermined section including the speech unit connection Comprising strain mask means ,
The connection strain mask means comprises:
A speech synthesizer characterized in that a phase disturbance band is increased at a time close to the connection part and a phase disturbance band is reduced at a time far from the connection part .

In a speech synthesizer that synthesizes speech by transforming and connecting speech segments,
Connection distortion mask means for superimposing a noise signal on speech for a series of speech parameters obtained by connecting speech segments and controlling the ratio of speech noise components in a predetermined section including speech segment connections equipped with a,
The connection strain mask means comprises:
A speech synthesizer characterized in that the average power of a noise signal to be superimposed is increased at a time close to the connection unit, and the average power of the noise signal to be superimposed is reduced at a time far from the connection unit .

In a speech synthesizer that synthesizes speech by transforming and connecting speech segments,
For a series of speech parameters obtained by connecting speech units, replace the speech in the noise signal replacement band with a noise signal, and control the proportion of speech noise components in a predetermined section including the speech unit connection. includes a concatenation distortion mask means,
The connection strain mask means comprises:
A speech synthesizer characterized in that the noise signal replacement band is increased at a time close to the connection part and is reduced at a time far from the connection part .