JP3737788B2

JP3737788B2 - Basic frequency pattern generation method, basic frequency pattern generation device, speech synthesis device, fundamental frequency pattern generation program, and speech synthesis program

Info

Publication number: JP3737788B2
Application number: JP2002213188A
Authority: JP
Inventors: 剛平林; 岳彦籠嶋; 龍太郎徳田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2002-07-22
Filing date: 2002-07-22
Publication date: 2006-01-25
Anticipated expiration: 2022-07-22
Also published as: JP2004054063A

Description

【０００１】
【発明の属する技術分野】
本発明は、例えば、テキスト音声合成に関し、特に、基本周波数（Ｆ０）パターンを生成する方法および装置に関する。
【０００２】
【従来の技術】
近年、任意の文章から人工的に音声信号を生成するテキスト音声合成システムが開発されている。通常、このテキスト音声合成システムは、言語処理部、韻律生成部、音声信号生成部の３つのモジュールから構成される。
【０００３】
入力されたテキストは、まず言語処理部において、形態素解析・構文解析等の言語処理が行われ、音韻記号列・アクセント型、品詞などの言語情報が出力される。次に韻律生成部において、基本周波数（ピッチ）やリズムのパターンが生成される。
【０００４】
韻律生成部は、音韻継続時間長生成部とピッチパターン生成部より構成される。音韻継続時間長生成部は、言語情報を参照して、各音素の音韻継続時間長を生成して出力する。ピッチパターン生成部は、言語情報と音韻継続時間長を入力として、声の高さの変化パターンであるピッチパターン（Ｆ０パターンとも云う）を出力する。最後に音声信号生成部において、音声信号が合成される。
【０００５】
テキスト音声合成システムの中で、韻律生成部の性能が合成音声の自然性に関係しており、とりわけ声の高さの変化パターンであるピッチパターンの精度が生成される合成音声の自然性を大きく左右する。
【０００６】
従来のテキスト音声合成におけるピッチパターン生成方法は、比較的単純なモデルを用いてピッチパターンの生成を行っていたため、抑揚が不自然で機械的な合成音声となっていた。
【０００７】
こうした問題を解決するために、自然音声から抽出されたピッチパターンを利用するアプローチが提案されている。例えば、特開平１１−０９５７８３号公報では、自然音声のピッチパターンから統計的な手法を用いて抽出されたアクセント句単位の典型的なパターンである代表パターンを複数記憶しておき、アクセント句毎に選択された代表パターンを変形し、接続することによってピッチパターンを生成する方法が開示されている。
【０００８】
図９は、上述した従来のピッチパターン生成方法に係るピッチパターン生成部の構成例を示したものである。以下、図９を用いて従来のピッチパターン生成方法について説明する。
【０００９】
代表パターン記憶部１８は、アクセント句単位の典型的なピッチパターンを表す代表パターンを複数記憶している。代表パターンは音節単位の長さが一定となるように正規化されており、その各点は対数スケール上のピッチで表現されている。
【００１０】
代表パターンの例を図１０に示す。縦軸は対数スケールのピッチを表している。また、横軸は時間に相当するが、この例では、１音節を３点で表すように正規化されているため、１目盛りが１音節に対応する。
【００１１】
代表パターン選択部１０は、言語情報１００を参照して、代表パターンを、代表パターン記憶部１８よりアクセント句毎に選択して出力する。
【００１２】
言語情報１００は、入力されたテキストに言語解析を行って得られる各アクセント句およびその近傍のアクセント句に関する情報であり、音韻記号列、アクセント型、品詞、構文情報などから構成される。「今日はすばらしい青空です。」というテキストに対する言語情報の例を、図１１に示す。言語情報１００から代表パターン２０１を選択するための規則は、統計的手法や機械学習手法など何らかの公知の方法を用いて生成することが可能である。
【００１３】
代表パターン変形部１８は、代表パターンを、言語情報１００および音韻継続時間長１１１に基づいて変形し、アクセント句パターン２０２を出力する。まず、音韻継続時間長１１１に従って音声単位で時間軸方向に線形伸縮を行う。次に、言語情報１００から代表パターンのダイナミックレンジを推定し、その推定値に従ってパターンを周波数軸方向に線形伸縮する。ダイナミックレンジの推定には、数量化Ｉ類などの公知の統計的手法を用いることができる。
【００１４】
オフセット推定部１２は、アクセント句の平均的な高さに相当するオフセット値１０３を、言語情報１００から推定して出力する。オフセット値の推定には、上述したダイナミックレンジの推定と同様に、数量化Ｉ類などの公知の統計的手法を用いることができる。
【００１５】
オフセット制御部１３は、アクセント句パターン２０２を、推定されたオフセット値１０３に従って周波数軸上で平行移動させ、アクセント句パターン２０４を出力する。上述したパターン変形およびオフセット制御の例を図１２に示す。
【００１６】
パターン接続部１５は、アクセント句毎に生成されたアクセント句パターン２０４を接続するとともに、アクセント句境界で不連続が生じないように平滑化を行って、文ピッチパターン２０６を出力する。文ピッチパターンの例を図１３に示す。
【００１７】
上述したようなテキスト音声合成のピッチパターン生成方法においては、代表パターンの変形が必要となる。例えば音韻継続時間長に従って音節単位で時間軸方向にパターンの変形を行う場合、各点の平均ピッチなどの静的特徴量のみを用いた線形伸縮では、何等かの理論的根拠に基づいた適切な変形ではないため、この変形ピッチパターンに従って生成された合成音の自然性が低下するという問題がある。
【００１８】
図１４および図１５にその一例を示す。ここで、図１４（ａ）と図１５（ｂ）は、選択された代表パターンであり、図１４（ｂ）と図１５（ｂ）は、それぞれ（ａ）図に示した代表パターンを実際に時間軸方向に音節単位で線形伸縮することによって変形させたパターンを表し、図１４（ｃ）と図１５（ｃ）は理想とする変形後のパターンを示している。
【００１９】
図１４の例では、静的特徴のみを用いて伸縮を行っているために、パターンの傾きを考慮した変形ができず、２音節目付近で不自然なピッチ変化が生じている。また、図１５の例では、代表パターンの各点の情報量、および伸縮による変形の精度が不十分なために、本来（ｃ）図のように変形されるべきパターンであっても、単純で不正確な（ｂ）図のような変形パターンが生成されてしまっている。
【００２０】
一方で、電子情報通信学会技術研究報告２００１年９月ＳＰ２００１−７０（５３頁〜５８頁）に記載されたような、動的特徴と静的特徴をパラメータとしてピッチを音素単位でモデル化し、動的特徴量を考慮して滑らかなピッチ変化パターンを生成するというものが提案されている。
【００２１】
しかし、音素単位でモデル化する場合には、ピッチの存在しない無声音に対するモデル化に問題が生じてくる。また、アクセント型を陽に表現できないため、ピッチの変化が滑らかであっても、不自然、もしくは誤った抑揚のパターンが生成されてしまう可能性があるという問題があった。
【００２２】
【発明が解決しようとする課題】
このように、従来は、代表パターンの変形を変形する際には、当該代表パターンの各点の平均ピッチなどの当該代表パターンの静的特徴量のみを用いていたため、変形した結果得られるパターンは不自然なものとなり、自然発声に近い合成音声を生成することができないという問題点があった。
【００２３】
そこで、本発明は、以上の問題を考慮してなされたものであり、人の発声した音声の基本周波数パターンに近い音声の基本周波数パターンの生成が可能な基本周波数パターン生成方法および基本周波数パターン生成装置と、それを用いて、人の発声した音声に近い音声を合成することができる音声合成装置を提供することを目的とする。
【００２４】
【課題を解決するための手段】
本発明は、テキストを解析することによって得られる言語情報を基に、当該テキストに対応する音声の韻律的な特徴の１つである、基本周波数の時間的変化を表した基本周波数パターンを生成するものであって、前記テキストに対応する音声の韻律的な特徴を制御するための１音節以上の時間長を有する音声の単位としての韻律制御単位毎の典型的な基本周波数パターンであって、当該基本周波数パターンを構成する各時系列点における静的特徴と、前記静的特徴の変化の特徴を表した動的特徴とが、それぞれの統計量で表現されている複数の代表パターンを記憶手段に記憶し、この記憶手段に記憶された複数の代表パターンの中から、前記言語情報に基づき前記テキストに対応する代表パターンを選択し、この選択された代表パターンの前記静的特徴の前記統計量と前記動的特徴の前記統計量とからの尤度に基づき、前記テキストに対応する音声の基本周波数パターンを推定することを特徴とする。
【００２５】
本発明によれば、人の発声した音声の基本周波数パターンに近い音声の基本周波数パターンの生成が可能となる。
【００２６】
本発明は、テキストを解析することによって得られる言語情報を基に、当該テキストに対応する音声の韻律的な特徴の１つである、基本周波数の時間的変化を表した基本周波数パターンを生成するものであって、前記テキストに対応する音声の韻律的な特徴を制御するための１音節以上の時間長を有する音声の単位としての韻律制御単位毎の典型的な基本周波数パターンであって、当該基本周波数パターンを構成する各時系列点における静的特徴と、前記静的特徴の変化の特徴を表した動的特徴とが、それぞれの統計量で表現されている複数の代表パターンを記憶手段に記憶し、この記憶手段に記憶された複数の代表パターンの中から、前記言語情報に基づき前記テキストに対応する代表パターンを選択し、この選択された代表パターンの前記静的特徴の前記統計量と前記動的特徴の前記統計量とからの尤度と、前記言語情報に基づき推定される、前記韻律制御単位毎の前記代表パターンの高さを表すオフセット値とを基に、前記代表パターンを変形することにより、前記テキストに対応する音声の基本周波数パターンを生成することを特徴とする。
【００２７】
本発明によれば、人の発声した音声の基本周波数パターンに近い音声の基本周波数パターンの生成が可能となる。
【００２８】
本発明は、テキストを解析することによって得られる言語情報を基に、予め記憶手段に記憶された、音声の韻律的な特徴を制御するための１音節以上の時間長を有する音声の単位としての韻律制御単位毎の典型的な基本周波数パターンである複数の代表パターンの中から、当該テキストに対応する代表パターンを選択し、この選択された代表パターンを、前記言語情報に基づき推定された、前記韻律制御単位毎の前記代表パターンの高さであるオフセット値に基づき変形を行うことにより、当該テキストに対応する音声の基本周波数パターンを生成するものであって、前記韻律制御単位毎の前記オフセット値を、その静的特徴の統計量と、前記静的特徴の変化の特徴を表した動的特徴の統計量とからの尤度に基づき推定することを特徴とする。
【００２９】
本発明によれば、人の発声した音声の基本周波数パターンに近い音声の基本周波数パターンの生成が可能となる。
【００３０】
なお、前記韻律制御単位は、形態素、単語、アクセント句のうちのいずれかであってもよい。
【００３１】
また、前記静的特徴は、対数あるいは線形スケール上のピッチであってもよい。
【００３２】
また、前記動的特徴は、前記時系列点間の前記静的特徴の差分、回帰係数、多項式展開係数のうちのいずれかであってもよい。
【００３３】
また、前記統計量は、平均値と、分散値若しくは標準偏差であってもよい。
【００３４】
さらに、前記代表パターンの変形は、前記選択された代表パターンを複数個接続したパターンに対して行うようにしてもよい。
【００３５】
【発明の実施の形態】
以下、本発明の実施形態について図面を参照して説明する。
【００３６】
図１は、本実施形態に係る音声合成システムの構成例を示したもので、大きく分けて、言語処理部２０、韻律生成部２１、音声信号生成部２２から構成されている。
【００３７】
テキスト２０８が入力されると、まず言語処理部２０において、当該入力されたテキスト２０８に対し、形態素解析や構文解析などの言語解析処理が行われ、音韻記号列、アクセント型、品詞、係り先、ポーズなどの言語情報１００が出力される。
【００３８】
韻律生成部２１では、言語情報１００を基に、入力されたテキスト２０８に対応する音声の韻律的な特徴を表した情報（韻律情報）、すなわち、例えば、音韻継続時間長や、基本周波数（以下では、ピッチ、Ｆ０と簡単に表記することもある）の時間経過に伴う変化を表したパターン、すなわち、基本周波数パターン（以下、簡単にピッチパターンあるいは、Ｆ０パターンと呼ぶ）などが生成される。韻律生成部２１は、音韻継続時間長生成部２３とピッチパターン生成部１より構成される。
【００３９】
音韻継続時間長生成部２３は、言語情報１００を参照して、各音素の時間的な長さ、すなわち、音韻継続時間長１１１を生成して出力する。なお、言語情報から音韻継続時間長を生成する手法は、従来と同様、公知技術を用いればよく、また、本願の要旨ではないので、説明は省略する。
【００４０】
ピッチパターン生成部１は、言語情報１００と音韻継続時間長１１１を入力として、声の高さの変化パターンであるピッチパターン１０６、より具体的には、例えば、アクセント句毎のピッチパターンをアクセント句境界で不連続が生じないように平滑化を行って接続することにより生成された文単位のピッチパターン（文ピッチパターン）１０６を出力する。
【００４１】
音声信号生成部２２では、言語情報１００を基に生成されたピッチパターン１０６や音韻継続時間長１１１などの韻律情報などを基に、入力されたテキスト２０８に対応する音声を合成し、音声信号２０７として出力する。なお、ここで音声を合成する手法は、従来と同様、公知の技術を用いればよく、また、本願の要旨ではないので、説明は省略する。
【００４２】
図２は、図１のピッチパターン生成部１の構成を示すブロック図で、代表パターン選択部１０と、代表パターン伸縮部１１と、オフセット推定部１２と、オフセット制御部１３と、最尤推定部１４と、パターン接続部１５と、代表パターン記憶部１６とから構成されている。なお、図２において、図９と同一部分には同一符号を付している。
【００４３】
図９に示した従来のピッチパターン生成部との相違点は、代表パターンの各点（時系列点）を、静的特徴である対数ピッチの平均および分散と、動的特徴である当該点における上記静的特徴の左側および右側の１次回帰係数の平均および分散とによって表現し、選択された代表パターンを尤度最大化基準に基づいて変形を行うことである。
【００４４】
自然音声の複数のピッチパターンから統計的な手法を用いて抽出されたアクセント句単位の典型的なパターンである代表パターンの各点のピッチは、自然音声の複数のピッチパターンから求められた対数スケールあるいは線形スケール上の平均値であり、代表パターンの各点（時系列点）毎の静的特徴は、例えば、この平均値と分散値（分散値の代わりに分散値の平方根の標準偏差値でもよい）などの統計量で表現されている。これらを静的特徴量とも云う。
【００４５】
また、代表パターンの各点における動的特徴とは、例えば、上記自然音声の複数のピッチパターンから求めた、当該点とその左側（あるいは右側）にあるいずれかの点（例えば、隣接する点）との間の上記静的特徴（例えば、対数あるいは線形スケール上のピッチの平均値）の変化の特徴（例えば、差分、回帰係数、多項式展開係数など）であり、動的特徴は、その平均値と分散値（分散値の代わりに分散値の平方根の標準偏差値でもよい）などの統計量で表現されている。これらは動的特徴量とも云う。
【００４６】
以下、図１６に示すフローチャートを参照しながら図２に示すピッチパターンの構成と動作について説明する。
【００４７】
図２において、代表パターン記憶部１６は、音声の韻律的な特徴を制御するための音声の単位（韻律制御単位）として、例えば、アクセント句単位の典型的なピッチパターンを表す代表パターンを複数記憶している。代表パターンは、音節単位の長さが一定となるように正規化されており、その各点は静的特徴である対数スケールのピッチの統計量（ここでは、平均および分散）と、動的特徴である当該点の左側および右側の１次回帰係数（いわゆる傾き）それぞれの統計量（ここでは、平均および分散）の情報を保持している。つまり、
【数１】

図３に、４つの代表パターン（ａ）〜（ｄ）のそれぞれについての静的特徴を示し、図４に、図３（ａ）〜（ｄ）に示した４つの代表パターンのそれぞれに対応する動的特徴を示す。
【００４８】
図３は、各代表パターンの各点における、静的特徴の情報である対数ピッチの平均値と標準偏差値（分散値の平方根）を表している。また、図４は、代表パターンの各点における、動的特徴の情報の１つである左側１次回帰係数の平均値と標準偏差値を表している。図３、図４において、縦軸は対数スケールの周波数であり、また、横軸は時間に相当するが、ここでは、１音節を３点で表現するように正規化されているため、１目盛りが１音節に対応する。
【００４９】
図２の説明に戻り、代表パターン選択部１０は、言語情報１００を参照して、代表パターンを、代表パターン記憶部１６よりアクセント句毎に選択して出力する（図１６のステップＳ３）。
【００５０】
言語情報１００は、入力されたテキストに言語解析を行って得られる各アクセント句およびその近傍のアクセント句に関する情報であり、音韻記号列、アクセント型、品詞、構文情報などから構成される。「今日はすばらしい青空です。」というテキストに対する言語情報の例は、図１１に示した通りである。言語情報１００から代表パターン２０１を選択するための規則は、統計的手法や機械学習手法など何らかの公知の方法を用いて生成することが可能である。
【００５１】
代表パターン伸縮部１１は、代表パターンの各点のパラメータを音韻継続時間長１１１に従って音節単位で時間軸方向に線形伸縮を行い、アクセント句パターン１０２を出力する（図１６のステップＳ４）。
【００５２】
オフセット推定部１２は、アクセント句の平均的な高さに相当するオフセット値１０３を、言語情報１００から推定して出力する。オフセット値の推定には、上述したダイナミックレンジの推定と同様に、数量化Ｉ類などの公知の統計的手法を用いることができる。
【００５３】
なお、オフセット値とは、韻律制御単位に対応するピッチパターンの全体的な音の高さを表す情報であって、例えば、上記のように、パターンの平均的な高さやパターンの最大ピッチ、最小ピッチ、高さの変化量などの情報であってもよい。
【００５４】
オフセット制御部１３は、アクセント句パターン１０２の各点のパラメータに対して、静的特徴である対数ピッチの平均値を、オフセット推定部１２で推定されたオフセット値１０３に従って変更する。つまり、従来における処理と同様にして、パターンを周波数軸上で平行移動させ、アクセント句パターン１０４を出力する（図１６のステップＳ５）。
【００５５】
最尤推定部１４は、オフセット制御部１３にてオフセットの制御されたアクセント句パターン１０４について、当該パターンの各点における静的特徴と動的特徴のそれぞれについての統計量に対して尤度最大の意味で最適なパラメータ列を求めることで、パターンの変形を行い、パターン１０５を出力する（図１６のステップＳ６）。
【００５６】
【数２】

【００５７】
つまり、パラメータ列は、分散値とは無関係に平均値の列、すなわち各点のピッチの値としては静的特徴である対数ピッチの平均値となってしまう。
【００５８】
そこで、このパラメータ列に、音声認識等で広く用いられている動的特徴を導入する。
【００５９】
【数３】

【００６０】
図５、図６に、代表パターン１０１を変形する過程を示す。図５（ａ）は、選択された代表パターン１０１の各点におけるパラメータのうち、静的特徴である対数ピッチの平均値および標準偏差値（分散値の平方根）を示したものである。図５（ａ）に示した静的特徴に対し、代表パターン伸縮部１１で時間軸方向の線形伸縮を行い、さらに、オフセット制御部１３でオフセット制御を行った結果得られたパターン１０４の各点における平均値を示したものが、図５（ｂ）である。
【００６１】
図５（ｃ）は、図５（ａ）に示した代表パターンについての動的特徴の１つである左側１次回帰係数の平均値および標準偏差値を示したものである。図５（ｃ）に示した動的特徴に対し、代表パターン伸縮部１１で時間軸方向の線形伸縮を行い、さらに、オフセット制御部１３でオフセット制御を行った結果得られたパターン１０４の各点における平均値を示したものが、図５（ｄ）である。
【００６２】
図６は、図５（ｂ）、（ｄ）に示した、静的特徴と動的特徴の時間軸方向整形伸縮とオフセット制御の結果得られたパターンと、最尤推定部１４において生成されたパラメータ列とから生成された最終的なアクセント句パターン、すなわち、パターン１０５である。
【００６３】
図５〜図６に示した代表パターンの第２音節目は、静的特徴である対数ピッチの分散値が小さく（図５（ａ）参照）、動的特徴である１次回帰係数の分散値が比較的大きいため（図５（ｃ）参照）、最尤推定部１４では、元の代表パターンにおけるピッチの値、すなわち、静的特徴を重視するようなパターンの変形が行われている。一方で、当該代表パターンの第３〜４音節目においては、静的特徴の分散値が比較的大きく（図５（ａ）参照）、動的特徴の分散値が小さいために（図５（ｃ）参照）、パターンの傾き、すなわち、動的特徴を重視した変形が行われていることがわかる。
【００６４】
つまり、最尤推定により静的および動的特徴の統計量を反映したパラメータ生成を行っているため、パターンの各点のピッチ値を重視するべき部分と、パターンの変化（傾き）を重視すべき部分とを同時に考慮したような変形が可能となっている。さらに、静的および動的特徴の組み合わせによって代表パターンの各点を表現しているために、代表パターンの表現力も向上しており、この例の第１音節目ような精度の高い複雑な変形パターンの生成も可能となる。
【００６５】
このように、動的特徴を考慮した尤度最大の意味で最適なパラメータを生成することによって、静的特徴であるピッチ情報のみから線形補間などを行う場合と比較して、より自然音声に近い滑らかで高精度のピッチパターンの変形が可能となり、自然性の高い合成音声を生成することができる。さらに、アクセント型はもとの代表パターンによって陽に表現されているため、アクセント位置の正しい滑らかで自然なパターンの生成が可能である。
【００６６】
図２の説明に戻り、パターン接続部１５は、アクセント句毎に生成されたアクセント句パターン１０５を接続するとともに、アクセント句境界で不連続が生じないように平滑化を行って、文ピッチパターン１０６を出力する（図１６のステップＳ７）。
【００６７】
以上のようにして生成されたピッチパターン１０６や音韻継続時間長１１１などの韻律情報などを基に、音声信号生成部２２では、入力されたテキスト２０８に対応する音声を合成し、音声信号２０７として出力する（図１６のステップＳ８）。
【００６８】
本実施形態では、代表パターンに対して、時間長による線形伸縮を行い、オフセットを制御した後に、最尤推定による変形を行っているが、オフセット制御は、時間長による線形伸縮の前でも、最尤推定による変形の後でもよい。
【００６９】
また、本実施形態では、各パターンの接続を行う前に、アクセント句単位の代表パターンに対し、最尤推定による変形を行っているが、順番を入れ替えて、韻律制御単位の代表パターンを複数接続した後に、最尤推定による変形を行ってもよい。
【００７０】
また、本実施形態では、オフセット推定部１２において推定されたオフセット値をそのまま利用してオフセット制御を行っているが、オフセット値についても静的および動的特徴の統計量によって表現し、これらの統計量からの尤度に基づいて変更を行ってから制御に利用してもよい。
【００７１】
図７は、代表パターンに対して、時間長による線形伸縮を行って最尤推定による変形を行うとともに、オフセット推定部１２において推定されたオフセット値についても静的および動的特徴の統計量によって表現し、これらの統計量からの尤度に基づいて変更を行ってからオフセット制御を行う場合のピッチパターン生成部１の構成例を示したものである。
【００７２】
なお、図７において、図２と同一部分には同一符号を付し、異なる部分についてのみ説明する。すなわち、図７では、オフセット値最尤推定部１９がオフセット推定部１２とオフセット制御部１３の間に設けられ、オフセット推定部１２から出力されるオフセット値１０７が静的特徴と動的特徴とで表現されている点が、図２と異なる。
【００７３】
図８は、複数（例えば、ここでは、４つ）のアクセント句単位のピッチパターンを接続してなる例えば１つの文について、オフセット値最尤推定部１９で、各アクセント句のオフセット値を変更する場合を説明するための図である。
【００７４】
図７のオフセット推定部１２で推定されるオフセット値の静的特徴は、例えば、図８（ａ）に示すように、自然音声の複数のピッチパターンから統計的な手法を用いて抽出された、例えば、アクセント句単位の代表パターンの例えば対数スケール（あるいは線形スケール）上のピッチの値の平均値（平均的な高さ）と分散値（分散値の平方根の標準偏差値でもよい））といった統計量で表現されている。
【００７５】
また、オフセット値の動的特徴とは、例えば、図８（ｂ）に示すように、複数のアクセント句単位のピッチパターンを接続したときに、着目するピッチパターンについて、例えばその右側（あるいは左側）のいずれかにある他のピッチパターンと、当該着目するピッチパターンとの間の上記静的特徴（例えば、アクセント句毎のピッチの平均値）の変化の特徴（例えば、着目するピッチパターンと他のピッチパターンとの間の上記静的特徴の差分、回帰係数、多項式展開係数などのいずれか）を表したもので、この静的特徴の変化の平均値と分散値（分散値の平方根の標準偏差値でもよい）といった統計量で表現されている。
【００７６】
オフセット値推定部１９では、オフセット推定部１２から出力された、上記のようなオフセット値１０７に対し、前述した図２の最尤推定部１４と同様にして、例えば、図８（ｃ）に示したように、第２アクセント句のように、静的特徴である対数ピッチの分散値が小さく、動的特徴である１次回帰係数の分散値が比較的大きい場合には、静的特徴を重視するようなオフセット値の変更を行い、第３〜４アクセント句のように、静的特徴の分散値が比較的大きく、動的特徴の分散値が小さい場合には、動的特徴を重視したオフセット値の変更を行う。
【００７７】
なお、オフセット値推定部１９では、代表パターン伸縮部１１から出力された、アクセント句単位の複数のピッチパターンを接続した例えば１文単位で、当該文を構成する各アクセント句単位のオフセットを推定する。
【００７８】
上記実施形態では、日本語のピッチパターン生成について説明したが、言語には依存しない方法であるため、適当な韻律制御単位を選択することで、英語・ドイツ語・フランス語・イタリア語・スペイン語・オランダ語・スウェーデン語・中国語など、外国語に本発明を適用することも可能である。
【００７９】
また、上記実施形態では、韻律制御単位としてアクセント句単位のピッチパターンを処理対象とした場合について説明したが、本発明は、この場合に限らず、例えば、呼気段落、単語、形態素、音節、モーラなどや、さらにこれらを組み合わせた単位といった、他の韻律制御単位であっても適用可能である。
【００８０】
以上説明したように、上記実施形態によれば、入力テキストに対応する音声の韻律的な特徴を制御するための１音節以上の時間長を有する音声の単位としての韻律制御単位（例えばアクセント句）毎の典型的な基本周波数パターンであって、当該基本周波数パターンを構成する各時系列点における静的特徴と、当該時系列点と他の時系列点との間の上記静的特徴の変化の特徴を表した動的特徴とが、それぞれの統計量で表現されている複数の代表パターンを代表パターン記憶部１６に記憶し、代表パターン選択部１０は、代表パターン記憶部１６に記憶された複数の代表パターンの中から、上記言語情報に基づき入力テキストに対応する代表パターンを選択する。代表パターン伸縮部１１では、当該選択された代表パターンの各点の静的特徴を音韻継続時間長１１１に従って音節単位で時間軸方向に線形伸縮を行い、その結果としてのアクセント句パターンを出力する。オフセット制御部１３は、アクセント句パターンの各点の静的特徴である、例えば対数ピットの平均値を、オフセット推定部１２で推定されたオフセット値に従って変更する。最尤推定部１４では、オフセット制御部１３にてオフセットの制御されたアクセント句パターン１０４を、その静的特徴の統計量と動的特徴の統計量とからの尤度を基に変形することにより、入力テキストに対応する音声の基本周波数パターンを生成する。
【００８１】
このようにして生成された基本周波数パターンと、さらに音韻継続時間長１１１などの韻律情報などを基に、入力されたテキスト２０８に対応する音声を合成すると、自然性の高い合成音声を生成することができる。さらに、アクセント型はもとの代表パターンによって陽に表現されているため、アクセント位置の正しい滑らかで自然なパターンの生成が可能である。
【００８２】
すなわち、上記実施形態によれば、韻律制御単位の代表パターンの各点を、静的特徴および動的特徴の統計量によって表現し、これらの統計情報を考慮した尤度最大化基準によるパターン変形を行うことで、より自然な合成音声を生成することができれる。
【００８３】
ここで、韻律制御単位とは、ピッチパターンを生成する際の基本単位であって、１音節以上にわたるピッチの変化を表現可能な長さを有する、様々な文章の構成単位が用いられる。例えば、アクセント句・単語・形態素・呼気段落・音節・モーラなどや、さらにこれらを組み合わせた単位を用いることもできる。
なお、本発明の実施の形態に記載した本発明の手法は、コンピュータに実行させることのできるプログラムとして、磁気ディスク（フロッピーディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤなど）、半導体メモリをなどの記録媒体に格納して、あるいは、インターネットなどのネットワークを介して頒布することもできる。
【００８４】
また、本発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。さらに、上記実施形態には種々の段階の発明は含まれており、開示される複数の構成用件における適宜な組み合わせにより、種々の発明が抽出され得る。例えば、実施形態に示される全構成要件から幾つかの構成要件が削除されても、発明が解決しようとする課題の欄で述べた課題（の少なくとも１つ）が解決でき、発明の効果の欄で述べられている効果（のなくとも１つ）が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。
【００８５】
【発明の効果】
以上詳述したように、本発明のピッチパターン生成方法によれば、韻律制御単位の代表パターンの各時系列点を、静的特徴および動的特徴の統計量によって表現し、これらの情報を利用した最尤推定により高精度にパターン変形を行うことで、自然音声に近い正確で滑らかなピッチパターンの生成が可能であり、自然性の高い合成音声を生成することができる。
【図面の簡単な説明】
【図１】本発明の一実施形態にかかる音声合成システムの構成例を示した図。
【図２】図１のピッチパターン生成部の構成例を示した図。
【図３】代表パターンの静的特徴量について説明するための図。
【図４】代表パターンの動的特徴量について説明するための図。
【図５】代表パターンを変形する過程を示した図。
【図６】代表パターンを変形した結果得られたパターンの一例を示した図。
【図７】ピッチパターン生成部の他の構成例を示した図。
【図８】オフセット値を最尤推定によって求める過程を示した図。
【図９】従来のピッチパターン生成部の構成例を示した図。
【図１０】代表パターンを示した図。
【図１１】言語情報の例を示した図。
【図１２】代表パターンを変形する過程を示した図。
【図１３】生成された文ピッチパターンの一例を示した図。
【図１４】従来の技術で代表パターンを変形する場合の問題点を説明するための図。
【図１５】従来の技術で代表パターンを変形する場合の問題点を説明するための図。
【図１６】図１の音声合成システムの動作を説明するためのフローチャート。
【符号の説明】
１０…代表パターン選択部
１１…代表パターン伸縮部
１２…オフセット推定部
１３…オフセット制御部
１４…最尤推定部
１５…パターン接続部
１６…代表パターン記憶部
１９…オフセット値最尤推定部
２０…言語処理部
２１…韻律生成部
２２…音声信号生成部
２３…音韻継続時間長生成部
２４…ピッチパターン生成部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to, for example, text-to-speech synthesis, and more particularly to a method and apparatus for generating a fundamental frequency (F0) pattern.
[0002]
[Prior art]
In recent years, text-to-speech synthesis systems that artificially generate speech signals from arbitrary sentences have been developed. Usually, this text-to-speech synthesis system includes three modules: a language processing unit, a prosody generation unit, and a speech signal generation unit.
[0003]
The input text is first subjected to linguistic processing such as morphological analysis and syntax analysis in the language processing unit, and linguistic information such as phonological symbol string / accent type and part of speech is output. Next, in the prosody generation unit, a fundamental frequency (pitch) and a rhythm pattern are generated.
[0004]
The prosody generation unit includes a phoneme duration generation unit and a pitch pattern generation unit. The phoneme duration generation unit generates and outputs the phoneme duration of each phoneme with reference to the language information. The pitch pattern generation unit receives the linguistic information and the phoneme duration, and outputs a pitch pattern (also referred to as an F0 pattern) that is a voice pitch change pattern. Finally, an audio signal is synthesized in the audio signal generation unit.
[0005]
In a text-to-speech synthesis system, the performance of the prosody generation unit is related to the naturalness of the synthesized speech. In particular, the naturalness of the synthesized speech that generates the accuracy of the pitch pattern, which is the voice pitch change pattern, is greatly increased. It depends on you.
[0006]
In the conventional pitch pattern generation method in text-to-speech synthesis, a pitch pattern is generated using a relatively simple model, so that the inflection is unnatural and mechanical synthesized speech.
[0007]
In order to solve such a problem, an approach using a pitch pattern extracted from natural speech has been proposed. For example, in Japanese Patent Laid-Open No. 11-095783, a plurality of representative patterns, which are typical patterns of accent phrases extracted from a natural speech pitch pattern using a statistical method, are stored, and each accent phrase is stored. A method of generating a pitch pattern by deforming and connecting selected representative patterns is disclosed.
[0008]
FIG. 9 shows a configuration example of a pitch pattern generation unit according to the above-described conventional pitch pattern generation method. Hereinafter, a conventional pitch pattern generation method will be described with reference to FIG.
[0009]
The representative pattern storage unit 18 stores a plurality of representative patterns representing typical pitch patterns in units of accent phrases. The representative pattern is normalized so that the length of a syllable unit is constant, and each point is represented by a pitch on a logarithmic scale.
[0010]
An example of a representative pattern is shown in FIG. The vertical axis represents the logarithmic scale pitch. Although the horizontal axis corresponds to time, in this example, one syllable corresponds to one syllable because it is normalized so that one syllable is represented by three points.
[0011]
The representative pattern selection unit 10 refers to the language information 100 and selects and outputs a representative pattern for each accent phrase from the representative pattern storage unit 18.
[0012]
The language information 100 is information about each accent phrase obtained by performing language analysis on the input text and the accent phrase in the vicinity thereof, and is composed of a phoneme symbol string, accent type, part of speech, syntax information, and the like. An example of linguistic information for the text “Today is a wonderful blue sky” is shown in FIG. A rule for selecting the representative pattern 201 from the language information 100 can be generated using any known method such as a statistical method or a machine learning method.
[0013]
The representative pattern deforming unit 18 deforms the representative pattern based on the language information 100 and the phoneme duration 111 and outputs an accent phrase pattern 202. First, linear expansion / contraction is performed in the time axis direction in units of speech according to the phoneme duration 111. Next, the dynamic range of the representative pattern is estimated from the language information 100, and the pattern is linearly expanded and contracted in the frequency axis direction according to the estimated value. For estimating the dynamic range, a known statistical method such as quantification class I can be used.
[0014]
The offset estimation unit 12 estimates an offset value 103 corresponding to the average height of the accent phrase from the language information 100 and outputs it. For estimation of the offset value, a known statistical method such as quantification type I can be used as in the above-described estimation of the dynamic range.
[0015]
The offset control unit 13 translates the accent phrase pattern 202 on the frequency axis according to the estimated offset value 103, and outputs an accent phrase pattern 204. An example of the pattern deformation and offset control described above is shown in FIG.
[0016]
The pattern connection unit 15 connects the accent phrase pattern 204 generated for each accent phrase, smoothes the accent phrase boundary so that no discontinuity occurs, and outputs a sentence pitch pattern 206. An example of the sentence pitch pattern is shown in FIG.
[0017]
In the pitch pattern generation method for text-to-speech synthesis as described above, the representative pattern needs to be modified. For example, when the pattern is deformed in the time axis direction in syllable units according to the phoneme duration, linear expansion / contraction using only static features such as the average pitch of each point is appropriate based on some theoretical basis. Since it is not a deformation | transformation, there exists a problem that the naturalness of the synthetic | combination sound produced | generated according to this deformation | transformation pitch pattern falls.
[0018]
An example is shown in FIGS. Here, FIGS. 14A and 15B show the selected representative patterns, and FIGS. 14B and 15B actually show the representative patterns shown in FIG. A pattern deformed by linear expansion and contraction in syllable units in the time axis direction is shown, and FIGS. 14C and 15C show ideal patterns after deformation.
[0019]
In the example of FIG. 14, since expansion / contraction is performed using only static features, deformation that takes into account the inclination of the pattern cannot be performed, and an unnatural pitch change occurs in the vicinity of the second syllable. In the example of FIG. 15, since the information amount of each point of the representative pattern and the accuracy of deformation due to expansion and contraction are insufficient, even a pattern that should be originally deformed as shown in FIG. An inaccurate deformation pattern as shown in FIG.
[0020]
On the other hand, the pitch is modeled in units of phonemes using dynamic features and static features as parameters as described in IEICE Technical Report, September 2001 SP2001-70 (pages 53-58). It has been proposed to generate a smooth pitch change pattern in consideration of the characteristic feature amount.
[0021]
However, when modeling is performed in units of phonemes, a problem arises in modeling for an unvoiced sound having no pitch. In addition, since the accent type cannot be expressed explicitly, there is a problem that even if the pitch change is smooth, an unnatural or incorrect inflection pattern may be generated.
[0022]
[Problems to be solved by the invention]
Thus, conventionally, when deforming the deformation of the representative pattern, only the static feature amount of the representative pattern such as the average pitch of each point of the representative pattern is used, so the pattern obtained as a result of the deformation is There is a problem that it becomes unnatural and it is not possible to generate synthesized speech that is close to natural speech.
[0023]
Therefore, the present invention has been made in consideration of the above problems, and a fundamental frequency pattern generation method and a fundamental frequency pattern generation capable of generating a fundamental frequency pattern of speech that is close to the fundamental frequency pattern of speech uttered by a person. It is an object of the present invention to provide a device and a speech synthesizer that can synthesize speech close to speech uttered by a person using the device.
[0024]
[Means for Solving the Problems]
The present invention generates a fundamental frequency pattern representing temporal changes in fundamental frequency, which is one of the prosodic features of speech corresponding to the text, based on linguistic information obtained by analyzing the text. A basic fundamental frequency pattern for each prosodic control unit as a speech unit having a time length of one or more syllables for controlling the prosodic features of speech corresponding to the text, The storage means stores a plurality of representative patterns in which the static features at each time-series point constituting the basic frequency pattern and the dynamic features representing the change features of the static features are represented by respective statistics. A representative pattern corresponding to the text is selected from a plurality of representative patterns stored in the storage means, based on the language information, and before the selected representative pattern. Based on the likelihood from said statistics of said dynamic features and the statistic of the static characteristics, and estimates the fundamental frequency pattern of the speech corresponding to the text.
[0025]
According to the present invention, it is possible to generate a fundamental frequency pattern of speech that is close to the fundamental frequency pattern of speech uttered by a person.
[0026]
The present invention generates a fundamental frequency pattern representing temporal changes in fundamental frequency, which is one of the prosodic features of speech corresponding to the text, based on linguistic information obtained by analyzing the text. A basic fundamental frequency pattern for each prosodic control unit as a speech unit having a time length of one or more syllables for controlling the prosodic features of speech corresponding to the text, The storage means stores a plurality of representative patterns in which the static features at each time-series point constituting the basic frequency pattern and the dynamic features representing the change features of the static features are represented by respective statistics. A representative pattern corresponding to the text is selected from a plurality of representative patterns stored in the storage means, based on the language information, and before the selected representative pattern. Likelihood from the statistics of the static features and the statistics of the dynamic features, and an offset value representing the height of the representative pattern for each prosodic control unit estimated based on the language information Based on the above, the basic pattern of speech corresponding to the text is generated by modifying the representative pattern.
[0027]
According to the present invention, it is possible to generate a fundamental frequency pattern of speech that is close to the fundamental frequency pattern of speech uttered by a person.
[0028]
The present invention provides a speech unit having a time length of one or more syllables for controlling prosodic features of speech, which is stored in advance in storage means based on linguistic information obtained by analyzing text. A representative pattern corresponding to the text is selected from a plurality of representative patterns that are typical basic frequency patterns for each prosodic control unit, and the selected representative pattern is estimated based on the language information. A basic frequency pattern of speech corresponding to the text is generated by performing transformation based on an offset value that is a height of the representative pattern for each prosodic control unit, and the offset value for each prosodic control unit Is estimated on the basis of the likelihood from the statistics of the static features and the statistics of the dynamic features representing the characteristics of the change of the static features.
[0029]
According to the present invention, it is possible to generate a fundamental frequency pattern of speech that is close to the fundamental frequency pattern of speech uttered by a person.
[0030]
The prosodic control unit may be any of a morpheme, a word, and an accent phrase.
[0031]
The static feature may be a logarithm or a pitch on a linear scale.
[0032]
The dynamic feature may be any one of a difference of the static feature between the time series points, a regression coefficient, and a polynomial expansion coefficient.
[0033]
The statistic may be an average value, a variance value, or a standard deviation.
[0034]
Further, the deformation of the representative pattern may be performed on a pattern in which a plurality of the selected representative patterns are connected.
[0035]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0036]
FIG. 1 shows an example of the configuration of a speech synthesis system according to the present embodiment, which is roughly composed of a language processing unit 20, a prosody generation unit 21, and a speech signal generation unit 22.
[0037]
When the text 208 is input, first, the language processing unit 20 performs language analysis processing such as morphological analysis and syntax analysis on the input text 208 to obtain a phoneme symbol string, accent type, part of speech, dependency destination, Language information 100 such as a pause is output.
[0038]
In the prosody generation unit 21, information (prosodic information) representing the prosodic features of the speech corresponding to the input text 208 based on the language information 100, that is, for example, the phoneme duration length and the fundamental frequency (hereinafter referred to as “prosodic information”). Then, a pattern representing a change with the passage of time of the pitch and F0 (which may be simply expressed as F0), that is, a basic frequency pattern (hereinafter simply referred to as a pitch pattern or F0 pattern) or the like is generated. The prosody generation unit 21 includes a phoneme duration generation unit 23 and a pitch pattern generation unit 1.
[0039]
The phoneme duration generation unit 23 refers to the language information 100 to generate and output the time length of each phoneme, that is, the phoneme duration 111. Note that the technique for generating the phoneme duration length from the linguistic information is not limited to the gist of the present application, and will not be described here.
[0040]
The pitch pattern generation unit 1 receives the linguistic information 100 and the phoneme duration 111 and inputs the pitch pattern 106, which is a voice pitch change pattern, more specifically, for example, the pitch pattern for each accent phrase. A pitch pattern (sentence pitch pattern) 106 generated in units of sentences is output by performing smoothing and connection so that no discontinuity occurs at the boundary.
[0041]
The speech signal generation unit 22 synthesizes speech corresponding to the input text 208 based on the prosody information such as the pitch pattern 106 and the phoneme duration 111 generated based on the language information 100, and the speech signal 207. Output as. Note that, as a method for synthesizing speech, a known technique may be used as in the past, and since it is not the gist of the present application, description thereof is omitted.
[0042]
FIG. 2 is a block diagram showing the configuration of the pitch pattern generation unit 1 of FIG. 1. The representative pattern selection unit 10, the representative pattern expansion / contraction unit 11, the offset estimation unit 12, the offset control unit 13, and the maximum likelihood estimation unit. 14, a pattern connection unit 15, and a representative pattern storage unit 16. In FIG. 2, the same parts as those in FIG. 9 are denoted by the same reference numerals.
[0043]
The difference from the conventional pitch pattern generation unit shown in FIG. 9 is that each point (time series point) of the representative pattern is obtained by changing the average and variance of logarithmic pitches that are static features and the points that are dynamic features. It is expressed by the mean and variance of the primary regression coefficients on the left and right sides of the static feature, and the selected representative pattern is transformed based on the likelihood maximization criterion.
[0044]
The pitch of each point of the representative pattern, which is a typical pattern of accent phrases extracted from multiple pitch patterns of natural speech using a statistical method, is a logarithmic scale obtained from the multiple pitch patterns of natural speech. Alternatively, it is an average value on a linear scale, and the static feature for each point (time series point) of the representative pattern is, for example, the average value and the variance value (the standard deviation value of the square root of the variance value instead of the variance value). Good). These are also called static feature values.
[0045]
In addition, the dynamic feature at each point of the representative pattern is, for example, the point obtained from the plurality of pitch patterns of the natural speech and any point on the left side (or right side) (for example, an adjacent point). And the static feature (for example, the average value of the logarithm or the pitch on the linear scale) (for example, difference, regression coefficient, polynomial expansion coefficient, etc.), and the dynamic feature is its average value And a variance value (which may be a standard deviation value of the square root of the variance value instead of the variance value). These are also called dynamic feature amounts.
[0046]
The configuration and operation of the pitch pattern shown in FIG. 2 will be described below with reference to the flowchart shown in FIG.
[0047]
In FIG. 2, the representative pattern storage unit 16 stores a plurality of representative patterns representing typical pitch patterns in units of accent phrases, for example, as speech units (prosodic control units) for controlling the prosodic features of speech. is doing. The representative pattern is normalized so that the length of the syllable unit is constant, and each point has a logarithmic scale pitch statistic (here, mean and variance), which is a static feature, and a dynamic feature. The information of the statistics (here, mean and variance) of the primary regression coefficients (so-called slopes) on the left and right sides of the point is held. That means
[Expression 1]

FIG. 3 shows static characteristics of each of the four representative patterns (a) to (d), and FIG. 4 corresponds to each of the four representative patterns shown in FIGS. 3 (a) to (d). Showing dynamic features.
[0048]
FIG. 3 shows the average value and the standard deviation value (square root of the variance value) of the logarithmic pitch, which is static feature information, at each point of each representative pattern. FIG. 4 shows an average value and a standard deviation value of the left primary regression coefficient, which is one piece of dynamic feature information, at each point of the representative pattern. 3 and 4, the vertical axis represents the logarithmic scale frequency, and the horizontal axis corresponds to time, but here, since it is normalized to represent one syllable with three points, one scale Corresponds to one syllable.
[0049]
Returning to the description of FIG. 2, the representative pattern selection unit 10 refers to the language information 100 and selects and outputs a representative pattern for each accent phrase from the representative pattern storage unit 16 (step S <b> 3 in FIG. 16).
[0050]
The language information 100 is information about each accent phrase obtained by performing language analysis on the input text and the accent phrase in the vicinity thereof, and is composed of a phoneme symbol string, accent type, part of speech, syntax information, and the like. An example of language information for the text "Today is a wonderful blue sky" is as shown in FIG. A rule for selecting the representative pattern 201 from the language information 100 can be generated using any known method such as a statistical method or a machine learning method.
[0051]
The representative pattern expansion / contraction unit 11 linearly expands / contracts the parameters of each point of the representative pattern in the time axis direction in units of syllables according to the phoneme duration 111, and outputs the accent phrase pattern 102 (step S4 in FIG. 16).
[0052]
The offset estimation unit 12 estimates an offset value 103 corresponding to the average height of the accent phrase from the language information 100 and outputs it. For estimation of the offset value, a known statistical method such as quantification type I can be used as in the above-described estimation of the dynamic range.
[0053]
The offset value is information indicating the overall pitch of the pitch pattern corresponding to the prosodic control unit. For example, as described above, the average height of the pattern, the maximum pitch of the pattern, and the minimum Information such as the amount of change in pitch and height may be used.
[0054]
The offset control unit 13 changes the average value of the logarithmic pitch, which is a static feature, according to the offset value 103 estimated by the offset estimation unit 12 for the parameters of each point of the accent phrase pattern 102. That is, the pattern is translated on the frequency axis in the same manner as in the conventional process, and the accent phrase pattern 104 is output (step S5 in FIG. 16).
[0055]
The maximum likelihood estimator 14 has the maximum likelihood for the statistic for each of the static and dynamic features at each point of the pattern for the accent phrase pattern 104 whose offset is controlled by the offset controller 13. By obtaining an optimal parameter string in terms of meaning, the pattern is transformed and the pattern 105 is output (step S6 in FIG. 16).
[0056]
[Expression 2]

[0057]
That is, the parameter sequence is an average value sequence, that is, an average value of logarithmic pitches, which is a static feature, as a pitch value of each point regardless of the variance value.
[0058]
Therefore, dynamic features widely used in speech recognition and the like are introduced into this parameter sequence.
[0059]
[Equation 3]

[0060]
5 and 6 show a process of deforming the representative pattern 101. FIG. FIG. 5A shows the average value and standard deviation value (square root of the variance value) of the logarithmic pitch, which is a static feature, among the parameters at each point of the selected representative pattern 101. Each point of the pattern 104 obtained as a result of linear expansion / contraction in the time axis direction by the representative pattern expansion / contraction unit 11 and offset control by the offset control unit 13 with respect to the static feature shown in FIG. FIG. 5B shows the average value at.
[0061]
FIG. 5C shows an average value and a standard deviation value of the left primary regression coefficient, which is one of the dynamic features of the representative pattern shown in FIG. Each point of the pattern 104 obtained as a result of linear expansion / contraction in the time axis direction by the representative pattern expansion / contraction unit 11 and offset control by the offset control unit 13 with respect to the dynamic feature shown in FIG. FIG. 5 (d) shows the average value at.
[0062]
FIG. 6 shows a pattern obtained as a result of the time axis direction shaping expansion / contraction of the static feature and the dynamic feature and the offset control shown in FIGS. 5B and 5D, and the maximum likelihood estimation unit 14 generates the pattern. The final accent phrase pattern generated from the parameter string, that is, the pattern 105.
[0063]
The second syllable of the representative pattern shown in FIGS. 5 to 6 has a small logarithmic pitch variance which is a static feature (see FIG. 5A), and a variance of a primary regression coefficient which is a dynamic feature. Is relatively large (see FIG. 5C), the maximum likelihood estimator 14 deforms the pattern so as to emphasize the pitch value in the original representative pattern, that is, the static feature. On the other hand, in the third to fourth syllables of the representative pattern, the variance value of the static feature is relatively large (see FIG. 5A) and the variance value of the dynamic feature is small (FIG. 5C). )), It can be seen that deformation is performed with emphasis on the inclination of the pattern, that is, dynamic features.
[0064]
In other words, since parameter generation reflecting static and dynamic feature statistics is performed by maximum likelihood estimation, the pitch value of each point of the pattern should be emphasized and the change (slope) of the pattern should be emphasized It is possible to make a modification that considers the part at the same time. Furthermore, since each point of the representative pattern is expressed by a combination of static and dynamic features, the representation power of the representative pattern is also improved, and a complex deformation pattern with high accuracy like the first syllable in this example. Can also be generated.
[0065]
In this way, by generating optimal parameters with the maximum likelihood in consideration of dynamic features, it is closer to natural speech than when linear interpolation is performed only from pitch information that is static features. Smooth and highly accurate pitch pattern deformation is possible, and highly natural synthesized speech can be generated. Furthermore, since the accent type is explicitly expressed by the original representative pattern, it is possible to generate a smooth and natural pattern with the correct accent position.
[0066]
Returning to the description of FIG. 2, the pattern connecting unit 15 connects the accent phrase pattern 105 generated for each accent phrase, and performs smoothing so that no discontinuity occurs at the boundary of the accent phrase. Is output (step S7 in FIG. 16).
[0067]
On the basis of the prosody information such as the pitch pattern 106 and the phoneme duration 111 generated as described above, the speech signal generation unit 22 synthesizes speech corresponding to the input text 208 as speech signal 207. Output (step S8 in FIG. 16).
[0068]
In this embodiment, the linear expansion / contraction by time length is performed on the representative pattern and the offset is controlled, and then the deformation is performed by maximum likelihood estimation. However, the offset control is performed even before linear expansion / contraction by the time length. It may be after deformation by likelihood estimation.
[0069]
Further, in this embodiment, before the connection of each pattern, the accent phrase unit representative pattern is transformed by maximum likelihood estimation. However, the order is changed and a plurality of representative patterns of prosodic control units are connected. Then, deformation by maximum likelihood estimation may be performed.
[0070]
In this embodiment, offset control is performed by using the offset value estimated by the offset estimation unit 12 as it is. However, the offset value is also expressed by static and dynamic feature statistics, Changes may be made based on the likelihood from the quantity and then used for control.
[0071]
FIG. 7 shows the representative pattern is subjected to linear expansion / contraction by time length to perform deformation by maximum likelihood estimation, and the offset value estimated by the offset estimation unit 12 is also expressed by static and dynamic feature statistics. The configuration example of the pitch pattern generation unit 1 when the offset control is performed after the change is made based on the likelihood from these statistics is shown.
[0072]
In FIG. 7, the same parts as those in FIG. 2 are denoted by the same reference numerals, and only different parts will be described. That is, in FIG. 7, the offset value maximum likelihood estimator 19 is provided between the offset estimator 12 and the offset controller 13, and the offset value 107 output from the offset estimator 12 includes static and dynamic features. This is different from FIG.
[0073]
FIG. 8 shows, for example, one sentence formed by connecting a plurality of (for example, four) accent phrase unit pitch patterns, the offset value maximum likelihood estimating unit 19 changes the offset value of each accent phrase. It is a figure for demonstrating a case.
[0074]
The static characteristics of the offset value estimated by the offset estimation unit 12 in FIG. 7 are extracted from a plurality of pitch patterns of natural speech using a statistical method, for example, as shown in FIG. For example, statistics such as an average value (average height) and a variance value (which may be a standard deviation value of the square root of the variance value) of a representative pattern of an accent phrase unit on a logarithmic scale (or linear scale), for example. Expressed in quantity.
[0075]
In addition, the dynamic characteristics of the offset value are, for example, as shown in FIG. 8B, when a plurality of accent phrase unit pitch patterns are connected, for example, the right side (or left side) of the pitch pattern of interest. Of the static characteristic (for example, the average value of the pitch for each accent phrase) between the other pitch pattern in any of the above and the target pitch pattern (for example, the target pitch pattern and other This is the difference between the above static features, the regression coefficient, the polynomial expansion coefficient, etc. between the pitch pattern and the mean and variance values (standard deviation of the square root of the variance value). The value may be a value).
[0076]
In the offset value estimating unit 19, the offset value 107 output from the offset estimating unit 12 is shown in FIG. 8C, for example, in the same manner as the maximum likelihood estimating unit 14 in FIG. 2 described above. As in the case of the second accent phrase, when the variance value of the logarithmic pitch that is a static feature is small and the variance value of the primary regression coefficient that is a dynamic feature is relatively large, the static feature is emphasized. If the variance value of the static feature is relatively large and the variance value of the dynamic feature is small, as in the third to fourth accent phrases, the offset value places importance on the dynamic feature. Change the value.
[0077]
The offset value estimation unit 19 estimates the offset of each accent phrase constituting the sentence in units of, for example, one sentence in which a plurality of pitch patterns of accent phrase units output from the representative pattern expansion / contraction unit 11 are connected. .
[0078]
In the above embodiment, Japanese pitch pattern generation has been described. However, since it is a language-independent method, English, German, French, Italian, Spanish, The present invention can also be applied to foreign languages such as Dutch, Swedish and Chinese.
[0079]
In the above embodiment, the case where the pitch pattern of the accent phrase unit is processed as the prosodic control unit has been described. However, the present invention is not limited to this case. It is also applicable to other prosodic control units such as a unit obtained by combining these.
[0080]
As described above, according to the above embodiment, a prosodic control unit (for example, accent phrase) as a speech unit having a time length of one or more syllables for controlling the prosodic feature of the speech corresponding to the input text. A typical basic frequency pattern for each of the static characteristics at each time series point constituting the basic frequency pattern, and the change of the static characteristics between the time series point and another time series point. A dynamic feature representing a feature stores a plurality of representative patterns represented by respective statistics in the representative pattern storage unit 16, and the representative pattern selection unit 10 stores a plurality of representative patterns stored in the representative pattern storage unit 16. The representative pattern corresponding to the input text is selected from the representative patterns based on the language information. The representative pattern expansion / contraction unit 11 linearly expands / contracts the static features of each point of the selected representative pattern in the time axis direction in units of syllables according to the phoneme duration 111, and outputs the resulting accent phrase pattern. The offset control unit 13 changes, for example, the average value of logarithmic pits, which are static features of each point of the accent phrase pattern, according to the offset value estimated by the offset estimation unit 12. The maximum likelihood estimation unit 14 transforms the accent phrase pattern 104 whose offset is controlled by the offset control unit 13 based on the likelihood from the static feature statistics and the dynamic feature statistics. Then, a basic frequency pattern of speech corresponding to the input text is generated.
[0081]
By synthesizing the speech corresponding to the input text 208 on the basis of the fundamental frequency pattern generated in this way and the prosodic information such as the phoneme duration 111, a highly natural synthesized speech is generated. Can do. Furthermore, since the accent type is explicitly expressed by the original representative pattern, it is possible to generate a smooth and natural pattern with the correct accent position.
[0082]
That is, according to the above-described embodiment, each point of the representative pattern of the prosodic control unit is expressed by the statistical amount of the static feature and the dynamic feature, and the pattern deformation based on the likelihood maximization criterion considering these statistical information is performed. By doing so, more natural synthesized speech can be generated.
[0083]
Here, the prosodic control unit is a basic unit for generating a pitch pattern, and various sentence constituent units having a length capable of expressing a pitch change over one syllable or more are used. For example, an accent phrase, a word, a morpheme, an exhalation paragraph, a syllable, a mora, or a unit obtained by combining these may be used.
The method of the present invention described in the embodiment of the present invention uses a magnetic disk (floppy disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), and a semiconductor memory as programs that can be executed by a computer. It can also be stored in a recording medium such as or distributed via a network such as the Internet.
[0084]
Further, the present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the scope of the invention in the implementation stage. Further, the above embodiments include inventions at various stages, and various inventions can be extracted by appropriate combinations of a plurality of disclosed constituent requirements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, the problem (at least one of them) described in the column of the problem to be solved by the invention can be solved, and the column of the effect of the invention If at least one of the effects described in (1) is obtained, a configuration from which this configuration requirement is deleted can be extracted as an invention.
[0085]
【The invention's effect】
As described above in detail, according to the pitch pattern generation method of the present invention, each time series point of the representative pattern of the prosodic control unit is expressed by the statistics of the static feature and the dynamic feature, and the information is used. By performing pattern deformation with high accuracy by the maximum likelihood estimation, it is possible to generate an accurate and smooth pitch pattern close to natural speech, and it is possible to generate synthetic speech with high naturalness.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration example of a speech synthesis system according to an embodiment of the present invention.
2 is a diagram illustrating a configuration example of a pitch pattern generation unit in FIG. 1;
FIG. 3 is a diagram for explaining a static feature amount of a representative pattern.
FIG. 4 is a diagram for explaining a dynamic feature amount of a representative pattern.
FIG. 5 is a diagram showing a process of deforming a representative pattern.
FIG. 6 is a diagram showing an example of a pattern obtained as a result of deforming a representative pattern.
FIG. 7 is a diagram showing another configuration example of the pitch pattern generation unit.
FIG. 8 is a diagram showing a process of obtaining an offset value by maximum likelihood estimation.
FIG. 9 is a diagram illustrating a configuration example of a conventional pitch pattern generation unit.
FIG. 10 is a diagram showing a representative pattern.
FIG. 11 is a diagram showing an example of language information.
FIG. 12 is a diagram showing a process of deforming a representative pattern.
FIG. 13 is a diagram showing an example of a generated sentence pitch pattern.
FIG. 14 is a diagram for explaining a problem when a representative pattern is deformed by a conventional technique.
FIG. 15 is a diagram for explaining a problem when a representative pattern is deformed by a conventional technique.
16 is a flowchart for explaining the operation of the speech synthesis system of FIG.
[Explanation of symbols]
10 ... Representative pattern selection section
11 ... Representative pattern stretchable part
12: Offset estimation unit
13: Offset control unit
14 ... Maximum likelihood estimator
15 ... Pattern connection part
16 ... representative pattern storage
19: Offset value maximum likelihood estimator
20 ... Language processing section
21 ... Prosody generation part
22 ... Audio signal generator
23 ... Phoneme duration generator
24 ... Pitch pattern generation unit

Claims

Based on the linguistic information obtained by analyzing the text, one of the prosodic features of speech corresponding to the text, generating a basic frequency pattern that represents a temporal change in the basic frequency A method,
A representative calculated from a plurality of fundamental frequency patterns of natural speech for each prosodic control unit as an accent phrase for controlling the prosodic features of speech corresponding to the text or a speech unit having a length longer than a word Each time series point constituting the representative pattern is an average value and variance value or standard deviation value of a static feature that is a fundamental frequency at the time series point, and a time series in the vicinity of the time series point. A plurality of the representative patterns for each of the prosodic control units, which are expressed by an average value and a variance value or a standard deviation value of a dynamic feature that is a change amount of a fundamental frequency between points , are stored in a storage unit;
A representative pattern corresponding to the text is selected from a plurality of representative patterns stored in the storage means based on the language information, and the static feature and the motion at each time series point of the selected representative pattern are selected. based on the likelihood calculated from the average value of characteristic and the variance value or standard deviation value, by changing the value of the fundamental frequency of each time series point of the selected representative pattern, corresponding to the text A basic frequency pattern generation method, characterized by generating a basic frequency pattern of speech.

The likelihood calculated from the average value and variance value or standard deviation value of the static feature and the dynamic feature at each time series point of the representative pattern for each selected prosodic control unit is maximized. 2. The fundamental frequency pattern generation method according to claim 1, wherein the fundamental frequency pattern is generated by obtaining a fundamental frequency value at each time series point.

Based on the linguistic information obtained by analyzing the text, one of the prosodic features of speech corresponding to the text, generating a basic frequency pattern that represents a temporal change in the basic frequency A method,
A representative calculated from a plurality of fundamental frequency patterns of natural speech for each prosodic control unit as an accent phrase for controlling the prosodic features of speech corresponding to the text or a speech unit having a length longer than a word Each time series point constituting the representative pattern is an average value and variance value or standard deviation value of a static feature that is a fundamental frequency at the time series point, and a time series in the vicinity of the time series point. A plurality of the representative patterns for each of the prosodic control units, which are expressed by an average value and a variance value or a standard deviation value of a dynamic feature that is a change amount of a fundamental frequency between points , are stored in a storage unit;
A representative pattern corresponding to the text is selected from a plurality of representative patterns stored in the storage means, and the value of the fundamental frequency of each time series point of the selected representative pattern is The likelihood calculated from the mean value and variance value or standard deviation value of the static feature and the dynamic feature at time series points, and the height of the selected representative pattern estimated based on the language information A basic frequency pattern of speech corresponding to the text is generated by changing a value of a basic frequency at each time series point of the selected representative pattern based on an offset value representing Frequency pattern generation method.

The estimated offset value is an average value of static features that are offset values of the fundamental frequency pattern for each prosodic control unit, calculated from the offset values of the plurality of fundamental frequency patterns for each prosodic control unit. Expressed by the variance value or standard deviation value, the average value of the dynamic feature that is the amount of change between the offset value of the prosodic control unit and the neighboring offset value, and the variance value or standard deviation value,
The basic offset according to claim 3, wherein the estimated offset value is changed based on a likelihood calculated from an average value and a variance value or a standard deviation value of the static feature and the dynamic feature. Frequency pattern generation method.

Based on the offset value representing the height of the selected representative pattern estimated based on the language information, after changing the value of each time series point of the selected representative pattern, at each time series point By obtaining a fundamental frequency value at each time series point that maximizes the likelihood calculated from the mean value and variance value or standard deviation value of the static feature and the dynamic feature, the fundamental frequency pattern The basic frequency pattern generation method according to claim 3, wherein:

The basis of each time series point that maximizes the likelihood calculated from the average value and variance value or standard deviation value of the static feature and the dynamic feature at each time series point of the selected representative pattern After calculating the frequency value, based on the offset value representing the height of the selected representative pattern estimated based on the language information, the value of each time series point with the maximum likelihood is changed. 4. The fundamental frequency pattern generation method according to claim 3, wherein

The static characteristic is the fundamental frequency pattern generation method according to claim 1 or 3, wherein it is a fundamental frequency on a logarithmic or a linear scale.

The dynamic features, the difference of the fundamental frequency between the time series points, regression coefficients, the fundamental frequency pattern generation method according to claim 1 or 3, wherein the is one of a polynomial expansion coefficients.

4. The fundamental frequency pattern according to claim 3, wherein a fundamental frequency value at each time series point of each selected representative pattern is changed with respect to a pattern obtained as a result of connecting a plurality of selected representative patterns. Generation method.

Based on the linguistic information obtained by analyzing the text, one of the prosodic features of speech corresponding to the text, generating a basic frequency pattern that represents a temporal change in the basic frequency A device,
A representative calculated from a plurality of fundamental frequency patterns of natural speech for each prosodic control unit as an accent phrase for controlling the prosodic features of speech corresponding to the text or a speech unit having a length longer than a word a pattern, each time series points constituting the representative pattern, a mean value of the static characteristics and the dispersion value or standard deviation is the fundamental frequency of the time-series point, when the time series point and the near neighbor Storage means for storing a plurality of the representative patterns for each of the prosodic control units expressed by an average value and a variance value or a standard deviation value of a dynamic feature that is a change amount of a fundamental frequency between sequence points ; ,
Means for selecting a representative pattern corresponding to the text based on the language information from a plurality of representative patterns stored in the storage means;
Each time of the selected representative pattern based on the likelihood calculated from the average value and variance value or standard deviation value of the static feature and the dynamic feature at each time series point of the selected representative pattern Generating means for generating a fundamental frequency pattern of speech corresponding to the text by changing a value of a fundamental frequency of a sequence point;
A fundamental frequency pattern generation device comprising:

The generating means calculates a likelihood calculated from an average value and a variance value or a standard deviation value of the static feature and the dynamic feature at each time series point of the representative pattern for each of the selected prosodic control units. 11. The fundamental frequency pattern generation method according to claim 10, wherein the fundamental frequency pattern is generated by obtaining a fundamental frequency value at each time series point that is maximized.

Based on the linguistic information obtained by analyzing the text, one of the prosodic features of speech corresponding to the text, generating a basic frequency pattern that represents a temporal change in the basic frequency A device,
A representative pattern calculated from a plurality of fundamental frequency patterns of natural speech in a prosodic control unit as an accent phrase or a speech unit having a length longer than a word for controlling the prosodic features of speech corresponding to the text Each time series point constituting the representative pattern includes an average value and a variance value or standard deviation value of a static feature that is a fundamental frequency at the time series point, and a time series point adjacent to the time series point. Storage means for storing a plurality of the representative patterns for each of the prosodic control units, which is expressed by an average value and a variance value or a standard deviation value of a dynamic feature that is a change amount of a fundamental frequency between
Means for selecting a representative pattern corresponding to the text based on the language information from a plurality of representative patterns stored in the storage means;
The likelihood calculated from the value of the fundamental frequency of each time series point of the selected representative pattern from the average value and variance value or standard deviation value of the static feature and the dynamic feature at each time series point ; Based on the offset value representing the height of the selected representative pattern estimated based on the language information, by changing the value of the fundamental frequency of each time series point of the selected representative pattern, Generating means for generating a fundamental frequency pattern of speech corresponding to the text;
A fundamental frequency pattern generation device comprising:

The estimated offset value is calculated from the offset values of the plurality of fundamental frequency patterns for each of the prosodic control units, and an average value and variance of static features that are offset values of the fundamental frequency patterns for each of the prosodic control units. Value or standard deviation value, and the average value and variance value or standard deviation value of the dynamic feature that is the amount of change between the offset value of the prosodic control unit and the neighboring offset value,
The generating means includes changing the estimated offset value based on a likelihood calculated from an average value and a variance value or a standard deviation value of the static feature and the dynamic feature. The fundamental frequency pattern generation device according to claim 12 .

Based on the linguistic information obtained by analyzing the text, at least prosodic information representing the prosodic features of the speech including the fundamental frequency pattern representing the temporal change of the fundamental frequency of the speech corresponding to the text. In a speech synthesizer that synthesizes speech corresponding to the text based on at least this prosodic information,
A representative pattern calculated from a plurality of fundamental frequency patterns of natural speech in a prosodic control unit as an accent phrase or a speech unit having a length longer than a word for controlling the prosodic features of speech corresponding to the text Each time series point constituting the representative pattern includes an average value and a variance value or standard deviation value of a static feature that is a fundamental frequency at the time series point, and a time series point adjacent to the time series point. Storage means for storing a plurality of the representative patterns for each of the prosodic control units, which is expressed by an average value and a variance value or a standard deviation value of a dynamic feature that is a change amount of a fundamental frequency between
Means for selecting a representative pattern corresponding to the text based on the language information from a plurality of representative patterns stored in the storage means;
Each time of the selected representative pattern based on the likelihood calculated from the average value and variance value or standard deviation value of the static feature and the dynamic feature at each time series point of the selected representative pattern Generating means for generating a fundamental frequency pattern of speech corresponding to the text by changing a value of a fundamental frequency of a sequence point;
Speech synthesis means for synthesizing speech corresponding to the text based on the fundamental frequency pattern generated by the generation means ;
A speech synthesizer characterized by comprising:

Based on the linguistic information obtained by analyzing the text, at least prosodic information representing the prosodic features of the speech including the fundamental frequency pattern representing the temporal change of the fundamental frequency of the speech corresponding to the text. In a speech synthesizer that synthesizes speech corresponding to the text based on at least this prosodic information,
A representative calculated from a plurality of fundamental frequency patterns of natural speech for each prosodic control unit as an accent phrase for controlling the prosodic features of speech corresponding to the text or a speech unit having a length longer than a word Each time series point constituting the representative pattern is an average value and variance value or standard deviation value of a static feature that is a fundamental frequency at the time series point, and a time series in the vicinity of the time series point. Storage means for storing a plurality of the representative patterns for each of the prosodic control units expressed by an average value and a variance value or a standard deviation value of a dynamic feature that is a change amount of a fundamental frequency between points ;
Means for selecting a representative pattern corresponding to the text based on the language information from a plurality of representative patterns stored in the storage means;
The likelihood calculated from the value of the fundamental frequency of each time series point of the selected representative pattern from the average value and variance value or standard deviation value of the static feature and the dynamic feature at each time series point ; Based on the offset value representing the height of the selected representative pattern estimated based on the language information, by changing the value of the fundamental frequency of each time series point of the selected representative pattern, Generating means for generating a fundamental frequency pattern of speech corresponding to the text;
Speech synthesis means for synthesizing speech corresponding to the text based on the fundamental frequency pattern generated by the generation means ;
A speech synthesizer characterized by comprising:

The estimated offset value is calculated from the offset values of the plurality of fundamental frequency patterns for each of the prosodic control units, and an average value and variance of static features that are offset values of the fundamental frequency patterns for each of the prosodic control units. Value or standard deviation value, and the average value and variance value or standard deviation value of the dynamic feature that is the amount of change between the offset value of the prosodic control unit and the neighboring offset value,
16. The speech according to claim 15, wherein the estimated offset value is changed based on a likelihood calculated from an average value and a variance value or a standard deviation value of the static feature and the dynamic feature. Synthesizer.

Based on the linguistic information obtained by analyzing the text, one of the prosodic features of speech corresponding to the text, generating a basic frequency pattern that represents a temporal change in the basic frequency A program,
A representative calculated from a plurality of fundamental frequency patterns of natural speech for each prosodic control unit as an accent phrase for controlling the prosodic features of speech corresponding to the text or a speech unit having a length longer than a word Each time series point constituting the representative pattern is an average value and variance value or standard deviation value of a static feature that is a fundamental frequency at the time series point, and a time series in the vicinity of the time series point. Storage means for storing a plurality of the representative patterns for each of the prosodic control units expressed by an average value and a variance value or a standard deviation value of a dynamic feature that is a change amount of a fundamental frequency between points On the computer,
Selecting a representative pattern corresponding to the text based on the language information from a plurality of representative patterns stored in the storage means;
The likelihood calculated from the value of the fundamental frequency of each time series point of the selected representative pattern from the average value and variance value or standard deviation value of the static feature and the dynamic feature at each time series point ; Based on the offset value representing the height of the selected representative pattern estimated based on the language information, by changing the value of the fundamental frequency of each time series point of the selected representative pattern, Generating a fundamental frequency pattern of speech corresponding to the text;
A basic frequency pattern generation program that executes

Based on the linguistic information obtained by analyzing the text, at least prosodic information representing the prosodic features of the speech including the fundamental frequency pattern representing the temporal change of the fundamental frequency of the speech corresponding to the text. A speech synthesis program for synthesizing speech corresponding to the text based on at least this prosodic information,
A representative calculated from a plurality of fundamental frequency patterns of natural speech for each prosodic control unit as an accent phrase for controlling the prosodic features of speech corresponding to the text or a speech unit having a length longer than a word Each time series point constituting the representative pattern is an average value and variance value or standard deviation value of a static feature that is a fundamental frequency at the time series point, and a time series in the vicinity of the time series point. Storage means for storing a plurality of the representative patterns for each of the prosodic control units expressed by an average value and a variance value or a standard deviation value of a dynamic feature that is a change amount of a fundamental frequency between points On the computer,
Selecting a representative pattern corresponding to the text based on the language information from a plurality of representative patterns stored in the storage means;
The likelihood calculated from the value of the fundamental frequency of each time series point of the selected representative pattern from the average value and variance value or standard deviation value of the static feature and the dynamic feature at each time series point ; Based on the offset value representing the height of the selected representative pattern estimated based on the language information, by changing the value of the fundamental frequency of each time series point of the selected representative pattern, Generating a fundamental frequency pattern of speech corresponding to the text;
Based on the generated fundamental frequency pattern, a step of synthesizing speech corresponding to the text,
A speech synthesis program that executes