JP3966074B2

JP3966074B2 - Pitch conversion device, pitch conversion method and program

Info

Publication number: JP3966074B2
Application number: JP2002152787A
Authority: JP
Inventors: 裕司久湊
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2002-05-27
Filing date: 2002-05-27
Publication date: 2007-08-29
Anticipated expiration: 2022-05-27
Also published as: JP2003345400A

Description

【０００１】
【発明の属する技術分野】
この発明は、歌唱合成に用いるに好適なピッチ変換装置、ピッチ変換方法及びプログラムに関するものである。
【０００２】
【従来の技術】
従来、音声合成装置としては、合成音声のピッチにゆらぎを付与するようにしたものが知られている（例えば、特開平９−２８１９９４号公報参照）。
【０００３】
この従来技術では、ディジタル音声波形データをアナログ音声信号に変換するＤ／Ａ変換器に供給するクロック信号として、格納部から読出したクロック間隔ゆらぎデータに応じてクロック周期にゆらぎをもたせたクロック信号を用いることによりＤ／Ａ変換出力（アナログ音声信号）のピッチにゆらぎを付与している。
【０００４】
【発明が解決しようとする課題】
人間がある音符に対応する音声を発生するとき、物理的に一定の高さ（ピッチ）で発生するのは歌唱を職業とする人でも困難であり、一般的に発声ピッチは音符ピッチから多少ずれ、加えて経時的なピッチ変動も生ずる。特に、歌唱を職業としない一般の人が歌唱した場合には、上記のようなピッチずれやピッチ変動の傾向が強く、歌唱の上手さ（又は下手さ）を評価するための１つの要素となる。また、ピッチのずれ方に歌唱者の特徴が見られる場合もある。その上、人が発声できる上限又は下限に近いピッチの音を発生しようとすると、声の発生機構に物理的な負担がかかるため、発生したいピッチと、実際に発声したピッチとが異なる（上限近くの高音ではピッチが下がり易く、下限近くの低音ではピッチが上がり易い）という現象がある。
【０００５】
上記した従来技術によれば、クロック間隔ゆらぎデータの値をピッチ上昇方向又はピッチ下降方向に変化させることによりピッチ変動の方向及び量を変化させることができるが、平均ピッチで見た場合にピッチ変動を加える前のピッチ（基準ピッチ）を変化させることはできず、ピッチ変動の時間的なパターンを変化させることもできない。換言すれば、上記したような歌唱者の発声ピッチや経時的なピッチ変動を再現することはできない。
【０００６】
この発明の目的は、歌唱合成の際に歌唱者の発声ピッチや経時的なピッチ変動を再現することができる新規なピッチ変換装置、ピッチ変換方法及びプログラムを提供することにある。
【０００７】
【課題を解決するための手段】
この発明に係る第１のピッチ変換装置は、ピッチデータの示すピッチを有する歌唱音声信号を合成する歌唱合成手段を備えた歌唱合成装置において使用されるピッチ変換装置であって、合成すべき順次の歌唱音声にそれぞれ対応して順次にピッチを入力する入力手段と、複数の入力ピッチをそれぞれ複数の音声ピッチに変換するためのピッチ変換関数であって、入力ピッチが所定の下限ピッチよりも低い場合には入力ピッチより高くなるように、入力ピッチが所定の上限ピッチよりも高い場合には入力ピッチより低くなるように、入力ピッチが所定の下限ピッチと所定の上限ピッチとの間である場合には入力ピッチと等しくなるように変換するピッチ変換関数を記憶する記憶手段と、前記入力手段から入力されるピッチ毎に該ピッチを前記ピッチ変換関数に基づいて音声ピッチに変換し、該音声ピッチを示すデータを前記ピッチデータとして前記歌唱合成手段に供給する変換手段とを備えたものである。
【０００８】
第１のピッチ変換装置によれば、複数の入力ピッチをそれぞれ複数の音声ピッチに変換するためのピッチ変換関数が記憶手段に記憶され、このピッチ変換関数に基づいて入力に係る各ピッチが歌唱音声合成用の音声ピッチに変換される。ピッチ変換関数において、複数の音声ピッチとして歌唱者の複数の発声ピッチをそれぞれ用いると、合成歌唱音声において歌唱者の発声ピッチやピッチ特徴を再現することができ、例えば発声可能な上限ピッチの近くではピッチを若干低くすると共に発声可能な下限ピッチの近くではピッチを若干高くすることができる。
【０００９】
第１のピッチ変換装置において、前記入力手段は、歌唱者を示す歌唱者データを入力し、前記記憶手段は、前記ピッチ変換関数を歌唱者毎に記憶し、前記変換手段は、前記歌唱者データの示す歌唱者に対応するピッチ変換関数に基づいてピッチ変換を行なうようにしてもよい。このようにすると、歌唱者毎に発声ピッチやピッチ特徴を再現することができる。
【００１０】
第１のピッチ変換装置においては、ピッチ変換の際に入力ピッチに依存する乱数的な（ランダムな）ピッチ変動を音声ピッチに付与するようにしてもよい。このようにすると、合成歌唱音声に一層自然なピッチ変化を付与することができる。また、ピッチ変換の際に歌唱者の実際の音声に含まれる経時的なピッチ変動を音声ピッチに付与するようにしてもよい。このようにすると、歌唱者の経時的に不安定なピッチ変動を再現することができる。
【００１４】
この発明に係る第１のピッチ変換方法は、複数の入力ピッチをそれぞれ複数の音声ピッチに変換するためのピッチ変換関数であって、入力ピッチが所定の下限ピッチよりも低い場合には入力ピッチより高くなるように、入力ピッチが所定の上限ピッチよりも高い場合には入力ピッチより低くなるように、入力ピッチが所定の下限ピッチと所定の上限ピッチとの間である場合には入力ピッチと等しくなるように変換するピッチ変換関数を記憶する記憶手段と、ピッチデータの示すピッチを有する歌唱音声信号を合成する歌唱合成手段とを備えた歌唱合成装置において使用されるピッチ変換方法であって、合成すべき順次の歌唱音声にそれぞれ対応して順次にピッチを入力するステップと、このステップで入力されるピッチ毎に該ピッチを前記ピッチ変換関数に基づいて音声ピッチに変換し、該音声ピッチを示すデータを前記ピッチデータとして前記歌唱合成手段に供給するステップとを含むものである。
【００１５】
第１のピッチ変換方法によれば、第１のピッチ変換装置に関して前述したと同様にピッチ変換を行なうことができる。
【００１８】
この発明に係る第１のプログラムは、コンピュータと、ピッチデータの示すピッチを有する歌唱音声信号を合成する歌唱合成手段とを備えた歌唱合成装置において使用されるプログラムであって、前記コンピュータを、合成すべき順次の歌唱音声にそれぞれ対応して順次にピッチを入力する入力手段と、複数の入力ピッチをそれぞれ複数の音声ピッチに変換するためのピッチ変換関数であって、入力ピッチが所定の下限ピッチよりも低い場合には入力ピッチより高くなるように、入力ピッチが所定の上限ピッチよりも高い場合には入力ピッチより低くなるように、入力ピッチが所定の下限ピッチと所定の上限ピッチとの間である場合には入力ピッチと等しくなるように変換するピッチ変換関数を記憶する記憶手段と、前記入力手段から入力されるピッチ毎に該ピッチを前記ピッチ変換関数に基づいて音声ピッチに変換し、該音声ピッチを示すデータを前記ピッチデータとして前記歌唱合成手段に供給する変換手段として機能させるものである。
【００１９】
この発明に係る第２のプログラムは、コンピュータを備えた歌唱合成装置において使用されるプログラムであって、前記コンピュータを、合成すべき順次の歌唱音声にそれぞれ対応して順次にピッチを入力する入力手段と、複数の入力ピッチをそれぞれ複数の音声ピッチに変換するためのピッチ変換関数であって、入力ピッチが所定の下限ピッチよりも低い場合には入力ピッチより高くなるように、入力ピッチが所定の上限ピッチよりも高い場合には入力ピッチより低くなるように、入力ピッチが所定の下限ピッチと所定の上限ピッチとの間である場合には入力ピッチと等しくなるように変換するピッチ変換関数を記憶する記憶手段と、前記入力手段から入力されるピッチ毎に該ピッチを前記ピッチ変換関数に基づいて音声ピッチに変換し、該音声ピッチを示すピッチデータを送出する変換手段と、この変換手段から送出されるピッチデータの示す音声ピッチを有する歌唱音声信号を合成する歌唱合成手段として機能させるものである。
【００２０】
第１又は第２のプログラムによれば、第１のピッチ変換装置に関して前述したと同様にピッチ変換を行なうことができる。
【００２１】
この発明に係る第３のプログラムは、コンピュータと、ピッチデータの示すピッチを有する歌唱音声信号を合成する歌唱合成手段とを備えた歌唱合成装置において使用されるプログラムであって、前記コンピュータを、
合成すべき順次の歌唱音声にそれぞれ対応して順次にピッチを入力する入力手段と、
複数の入力ピッチのうちの各入力ピッチ毎に該入力ピッチに対する音声ピッチの経時的変動分を示すピッチ差分データを記憶する記憶手段と、
前記入力手段から入力されるピッチ毎に該ピッチに対応するピッチ差分データを前記記憶手段から読出すと共に入力に係るピッチに対して読出しに係るピッチ差分データの示す音声ピッチの経時的変動分を加算してピッチ変換を行ない、このピッチ変換後のピッチを示すデータを前記ピッチデータとして前記歌唱合成手段に供給する変換手段として機能させるものである。
【００２２】
この発明に係る第４のプログラムは、コンピュータを備えた歌唱合成装置において使用されるプログラムであって、前記コンピュータを、
合成すべき順次の歌唱音声にそれぞれ対応して順次にピッチを入力する入力手段と、
複数の入力ピッチのうちの各入力ピッチ毎に該入力ピッチに対する音声ピッチの経時的変動分を示すピッチ差分データを記憶する記憶手段と、
前記入力手段から入力されるピッチ毎に該ピッチに対応するピッチ差分データを前記記憶手段から読出すと共に入力に係るピッチに対して読出しに係るピッチ差分データの示す音声ピッチの経時的変動分を加算してピッチ変換を行ない、このピッチ変換後のピッチを示すピッチデータを送出する変換手段と、
この変換手段から送出されるピッチデータの示す音声ピッチを有する歌唱音声信号を合成する歌唱合成手段と
して機能させるものである。
【００２３】
第３又は第４のプログラムによれば、第２のピッチ変換装置に関して前述したと同様にピッチ変換を行なうことができる。
【００２４】
【発明の実施の形態】
図１は、この発明の一実施形態に係る歌唱合成装置を示すものである。
【００２５】
図１の歌唱合成装置は、入力部１０、ピッチ変換装置１２及び歌唱合成器１８を含むもので、ピッチ変換装置１２は、ピッチ変換器１４及びデータベース１６を備えている。
【００２６】
入力部１０は、歌唱者を示す歌唱者データ、音声素片（単一の音素［音韻］又は音素連鎖）を示す音声素片データ、音符のピッチ及び長さを示す音符データ、合成音声の音強度を示す音強度データ等を入力するもので、入力に係る音符ピッチＰｉを示す音符ピッチデータ及び入力に係る歌唱者Ｓを示す歌唱者データは、ピッチ変換器１４に供給される。
【００２７】
データベース１６には、複数の入力ピッチ（音符ピッチ）をそれぞれ複数の音声ピッチ（出力ピッチ）に変換するためのピッチ変換データがピッチ変換関数［ＦＴ（Ｓ，ｐ）］又はピッチ変換表の形で歌唱者毎に記憶されている。
【００２８】
図２には、歌唱者Ｓ１，Ｓ２，Ｓ３にそれぞれ対応する３つのピッチ変換関数ＦＴ（Ｓ１，ｐ），ＦＴ（Ｓ２，ｐ），ＦＴ（Ｓ３，ｐ）をデータベース１６に記憶した例を示す。ここで、ｐは、入力ピッチを表わす。
【００２９】
図２に示すピッチ変換装置１２において、ピッチ変換器１４は、入力部１０からの歌唱者データの示す歌唱者Ｓに対応するピッチ変換関数をデータベース１６にて参照すると共に、入力部１０からの音符ピッチデータの示す音符ピッチＰｉに対応する音声ピッチＰｏを参照に係るピッチ変換関数に基づいて算出する。そして、算出された音声ピッチＰｏを示すピッチデータを歌唱合成器１８に出力する。
【００３０】
データベース１６がピッチ変換データをピッチ変換表の形で記憶している場合、ピッチ変換器１４は、入力部１０からの歌唱者Ｓに対応するピッチ変換表を参照すると共に、入力部１０からの音符ピッチデータの示す音符ピッチＰｉに対応する音声ピッチＰｏを参照に係るピッチ変換表から読出す。そして、読出された音声ピッチＰｏを示すピッチデータを歌唱合成器１８に供給する。
【００３１】
歌唱合成器１８は、入力部１０からの歌唱者データ、音声素片データ、音符長データ及び音強度データと、ピッチ変換器１４からのピッチデータとに基づいて歌唱音声信号を合成するものである。歌唱合成方式としては、種々のものが公知であり、そのうちから適切なものを選択して歌唱合成器１８を構成することができる。
【００３２】
歌唱合成器１８では、一例として、歌唱者データの示す歌唱者と、音声素片データの示す音声素片とに対応した音声成分データを用いて歌唱音声信号を合成する。このとき、歌唱音声信号のピッチ、音長及び音強度は、それぞれピッチデータ、音符長データ及び音強度データに応じて決定される。
【００３３】
図３は、ピッチ変換関数の一例を示すものである。図３において、横軸の入力ピッチ［cent］は、ピッチ変換器１４に入力される音符ピッチに相当し、縦軸の出力ピッチ［cent］は、ピッチ変換器１４から出力される音声ピッチに相当する。
【００３４】
図３に示すピッチ変換関数ＦＴ（Ｓ，ｐ）は、所定の下限ピッチＰＬと所定の上限ピッチＰＨとの間では出力ピッチが入力ピッチと等しいが、入力ピッチが上限ピッチＰＨより高いときは人の発声可能な上限ピッチに近づくにつれて徐々に出力ピッチが入力ピッチより低くなると共に、入力ピッチが下限ピッチＰＬより低いときは人の発声可能な下限ピッチに近づくにつれて徐々に出力ピッチが入力ピッチより高くなるような形状になっている。このような形状を数式的に表現すると、次の数１に示す通りである。
【００３５】
【数１】
ＦＴ（Ｓ，ｐ）＞ｐ if ｐ＜ＰＬ
ＦＴ（Ｓ，ｐ）＝ｐ if ＰＬ≦ｐ≦ＰＨ
ＦＴ（Ｓ，ｐ）＜ｐ if ＰＨ＜ｐ
具体例としては、ＰＨ＜ｐの領域では出力ピッチが入力ピッチより最大で数１０セント程度低くなると共にｐ＜ＰＬの領域では出力ピッチが入力ピッチより最大で数１０セント程度高くなるようなピッチ変換関数を用いることができる。
【００３６】
図３に示したようなピッチ変換関数は、歌唱者毎に適切な形状のものが用意され、図２に関して前述したように歌唱者毎にデータベース１６に記憶される。ピッチ変換器１４は、入力に係る歌唱者Ｓに対応するピッチ変換関数を参照して入力ピッチＰｉを出力ピッチＰｏに変換する。このようなピッチ変換を数式的に表現すると、次の数２の通りである。
【００３７】
【数２】
Ｐｏ＝ＦＴ（Ｓ，Ｐｉ）
図４は、図３のピッチ変換関数を用いたピッチ変換の一例を示すもので、（Ａ）は、変換前のピッチ変化（入力ピッチの変化）を示し、（Ｂ）は、変換後のピッチ変化（出力ピッチの変化）を示す。図４（Ａ）において、順次の入力ピッチは、合成すべき順次の歌唱音声にそれぞれ対応するものである。図４によれば、ＰＬより低音域では、出力ピッチが入力ピッチより高くなると共にＰＨより高音域では出力ピッチが入力ピッチより低くなり、ＰＬ以上でＰＨ以下の中音域では出力ピッチが入力ピッチに等しくなっているのがわかる。図４の例では、入力ピッチを離散的に与えたが、そうである必要はなく、連続的に与えても構わない。
【００３８】
図３に示したピッチ変換関数は、直線に近似したものであるが、歌唱者やピッチに依存する乱数的な（ランダムな）ピッチ変動分ｒａｎｄ（Ｓ，ｐ）を加えた次の数３の式に示すようなピッチ変換関数を用いてもよい。
【００３９】
【数３】
ＦＴ（Ｓ，ｐ）＋ｒａｎｄ（Ｓ，ｐ）
このようなピッチ変換関数を用いると、ピッチ変換の際に図４（Ａ）に示すような順次の入力ピッチにそれぞれ応答して順次の出力ピッチにランダムなピッチ変化が加わるようになり、合成音声に一層自然な変化を付与することができる。
【００４０】
上記した実施形態において、データベース１６には、時間に依存しないピッチ変換関数ＦＴ（Ｓ，ｐ）を記憶する例を示したが、データベース１６には、時間に依存するピッチ変換関数を記憶し、このピッチ変換関数を参照してピッチ変換を行なうようにしてもよい。一例としてデータベース１６には、ピッチ差分ΔＦＴ（Ｓ，ｐ，ｔ）を示すピッチ差分データをピッチ変換データとして歌唱者毎に記憶する。ピッチ差分ΔＦＴ（Ｓ，ｐ，ｔ）は、歌唱者Ｓが音符ピッチｐに対応する音声を発生したときに時間ｔの進行に従って音符ピッチｐに対する音声ピッチの差分を表わすものである。
【００４１】
データベース１６にピッチ差分データをピッチ変換関数ΔＦＴ（Ｓ，ｐ，ｔ）の形で歌唱者毎に記憶しておいた場合、ピッチ変換器１４は、入力に係る歌唱者Ｓに対応するピッチ変換関数ΔＦＴ（Ｓ，ｐ，ｔ）を参照して入力ピッチＰｉを出力ピッチＰｏに変換する。このようなピッチ変換を数式的に表現すると、次の数４に示す通りである。
【００４２】
【数４】
Ｐｏ＝Ｐｉ＋ΔＦＴ（Ｓ，Ｐｉ，ｔ）
この場合のピッチ変換は、入力ピッチＰｉに対して入力ピッチＰｉ対応のピッチ差分ΔＦＴ（Ｓ，Ｐｉ，ｔ）を加算することにより行なわれる。
【００４３】
データベース１６には、上記のようにピッチ変換関数ΔＦＴ（Ｓ，ｐ，ｔ）を記憶する代りに、ピッチ差分ΔＦＴ（Ｓ，ｐ，ｔ）の経時的な変化波形を表わすピッチ差分データを記憶するようにしてもよい。図５は、このようなピッチ差分データを歌唱者Ｓ_１…Ｓｎ（ｎは２以上の整数）のうちの各歌唱者毎にｐ１〜ｐ２５の２５ピッチ分記憶した例を示す。ピッチｐ１〜ｐ２５は、１００セント（半音）刻みで１２００〜３６００［ｃｅｎｔ］となっている。データベース１６にピッチ差分データを記憶すると、後述のピッチ波形データを記憶する場合に比べてデータ量が少なくて済む。
【００４４】
図５の例において、各ピッチ毎のピッチ差分データとしては、実際の歌唱に基づくものを用いるとよい。一例を示すと、歌唱者Ｓ_１にピッチｐ１に対応する音声を実際に発生させると共に、ピッチｐ１に対する発生音声のピッチの差分の経時的変化波形を求め、この変化波形を表わすピッチ差分データを用いる。このようにすると、歌唱者の特性を反映したピッチ変化を再現可能になると共に、より人間的な微細なピッチ変化を表現可能になる。
【００４５】
ピッチ変換器１４は、入力に係る歌唱者Ｓに対応するピッチ差分データのうち入力ピッチＰｉに対応するピッチ差分データを参照して前述の数４の式に従って入力ピッチＰｉを出力ピッチＰｏに変換する。図６は、図５のピッチ差分データを用いたピッチ変換の一例を示すもので、（Ａ）は、図４（Ａ）と同様に変換前のピッチ変化（入力ピッチの変化）を示し、（Ｂ）は、変換後のピッチ変化（出力ピッチの変化）を示す。図６によれば、人の発声可能な上限ピッチ又は下限ピッチの近傍では図４に関して前述したと同様に出力が入力ピッチよりそれぞれ低く又は高くなると共に、人の発声可能な上限ピッチ又は下限ピッチの近傍ではピッチの変動量（ゆらぎ量）が大きくなることがわかる。従って、人間的な発声ピッチやピッチ変動の再現が可能となる。
【００４６】
ピッチは、離散値ではなく連続値であるので、図５の例においてすべてのピッチに対応可能とするには全ピッチ分のピッチ差分データを記憶することになり、記憶するデータ量が膨大なものになってしまう。また、ピッチ差分ΔＦＴ（Ｓ，ｐ，ｔ）の変化が長く継続するピッチ差分データについても、記憶するデータ量が膨大なものになってしまう。このような記憶データ量の増大を回避するためには、次の（イ）又は（ロ）のような対策を適宜採用することができる。
【００４７】
（イ）複数の離散的なピッチについてそれぞれピッチ差分ΔＦＴ（Ｓ，ｐ，ｔ）を表わすピッチ差分データを記憶した場合において、入力ピッチとピッチが丁度一致するピッチ差分データを検出できないときは、入力ピッチとピッチが最も近いピッチ差分データを参照してピッチ変換を行なう。また、入力ピッチとピッチが近い２つのピッチ差分データから補間により新たなピッチ差分データを求めてピッチ変換を行なってもよい。
【００４８】
（ロ）ピッチ差分データとしては、ピッチ差分の変化継続時間が所定値以内のものを記憶しておき、入力ピッチの時間長がピッチ差分ΔＦＴ（Ｓ，ｐ，ｔ）の変化継続時間を越えたときは、ピッチ差分の変化波形において時間０等の適当な位置に戻って再びピッチ差分データを読出す。
【００４９】
図５の例において、データベース１６には、歌唱者毎に複数のピッチにそれぞれ対応してピッチ差分の変化波形を表わすピッチ差分データを記憶したが、歌唱者毎に複数のピッチにそれぞれ対応してピッチ変化波形を表わすピッチ波形データをピッチ変換データとして記憶するようにしてもよい。この場合、入力に係る歌唱者Ｓに対応する複数のピッチ波形データのうち入力ピッチＰｉに対応するピッチ波形データを読出して出力ピッチＰｏとすることによりピッチ変換を行なう。ピッチ波形データを実際の歌唱に基づいて作成すると、歌唱者の発声ピッチや経時的なピッチ変動を再現することができる。
【００５０】
上記したようなピッチ変換処理は、パーソナルコンピュータ等の小型コンピュータにおいてソフトウェア処理として実行するようにしてもよい。すなわち、ＲＯＭ又はＲＡＭ等の記憶手段にストアしたプログラムに従ってＣＰＵ（中央処理装置）にピッチ変換処理を実行させるようにしてもよい。
【００５１】
図７は、この発明の他の実施形態に係る歌唱合成装置を示すもので、この装置は、例えば特許第２９０６９７０号に示されるＳＭＳ（Spectral Modeling Synthesis）技術を用いて歌唱合成を行なうものである。
【００５２】
ステップＳ１では、歌唱音声信号を入力し、ステップＳ２では、入力された歌唱音声信号にＳＭＳ分析処理及び区間切出し処理を施す。
【００５３】
ＳＭＳ分析処理では、入力音声信号を一連の時間フレームに区分し、各フレーム毎にＦＦＴ（Fast Fourier Transform）等により１組の強度（マグニチュード）スペクトルデータを生成し、各フレーム毎に１組の強度スペクトルデータから複数のピークに対応する線スペクトルを抽出する。これらの線スペクトルの振幅値及び周波数を表わすデータを調和成分（Deterministic Component）のデータと称する。次に入力音声波形のスペクトルから調和成分のスペクトルを差引いて残差スペクトルを得る。この残差スペクトルを非調和成分（Stochastic Component）と称する。
【００５４】
区間切出し処理では、ＳＭＳ分析処理で得られた調和成分のデータ及び非調和成分のデータを音声素片に対応して区分する。音声素片とは、歌詞の構成要素であり、例えば［ａ］，［ｉ］のような単一の音素（又は音韻：Phoneme）又は例えば「ａｉ」，「ａｐ」のような音素連鎖（複数音素の連鎖）からなるものである。
【００５５】
データベース２０には、音声素片毎に調和成分のデータ及び非調和成分のデータが記憶される。また、データベース２０には、データベース１６に関して前述したと同様にピッチ変換データ（ピッチ差分データ又はピッチ波形データである場合も含む）が記憶されている。
【００５６】
歌唱合成に際しては、ステップＳ３で歌詞データ及びメロディデータを入力する。そして、ステップＳ４では、歌詞データが表わす音素列に音素列／音声素片変換処理を施して音素列を音声素片に区分し、音声素片毎にそれに対応する調和成分のデータ及び非調和成分のデータを音声素片データとしてデータベース２０から読出す。
【００５７】
ステップＳ５では、データベース２０から読出された音声素片データ（調和成分のデータ及び非調和成分のデータ）に音声素片接続処理を施して音声素片データ同士を発音順に接続する。
【００５８】
ステップＳ６では、ピッチ変換処理を行なう。すなわち，ステップＳ３で入力されたメロディデータの示す音符ピッチを前述したと同様にしてデータベース２０のピッチ変換データ（ピッチ差分データ又はピッチ波形データである場合も含む）に基づいて音声ピッチに変換し、該音声ピッチを示すピッチデータを生成する。
【００５９】
ステップ７では、音声素片毎に調和成分のデータとステップＳ６で生成されたピッチデータの示す音声ピッチとに基づいて該音声ピッチに適合した新たな調和成分のデータを生成する。このとき、新たな調和成分のデータでは、ステップＳ５の処理を受けた調和成分のデータが表わすスペクトル包絡の形状をそのまま引継ぐようにスペクトル強度を調整すると、ステップＳ１で入力した音声信号の音色を再現することができる。
【００６０】
ステップＳ８では、ステップＳ７で生成した調和成分のデータとステップＳ５の処理を受けた非調和成分のデータとを音声素片毎に加算する。そして、ステップＳ９では、ステップＳ８で加算処理を受けたデータを音声素片毎に逆ＦＦＴ等により時間領域の歌唱音声信号に変換する。ステップ９の処理の結果として得られる歌唱音声信号は、ディジタル形式の信号であり、Ｄ／Ａ変換器２２によりアナログ形式の歌唱音声信号に変換される。
【００６１】
一例として、「サイタ」（ｓａｉｔａ）という歌唱音声を合成するには、データベース２０から音声素片「＃ｓ」、「ｓａ」、「ａ」、「ａｉ」、「ｉ」、「ｉｔ」、「ｔａ」、「ａ」、「ａ＃」（＃は無音を表わす）にそれぞれ対応する音声素片データを読出してステップＳ５で接続する。そして、ステップＳ７では、音声素片毎にステップＳ６での変換に係るピッチを有する調和成分のデータを生成し、ステップＳ８の加算処理、ステップＳ９の変換処理及び変換器２２のＤ／Ａ変換処理を経ると、「サイタ」の歌唱音声信号が得られる。
【００６２】
上記したステップＳ１〜Ｓ９の処理は、パーソナルコンピュータ等の小型コンピュータにおいてソフトウエア処理として実行してもよく、あるいは電子回路等のハードウェアを用いて実行してもよい。
【００６３】
【発明の効果】
以上のように、この発明によれば、記憶手段に記憶したピッチ変換データを用いて入力音符ピッチを歌唱合成用の音声ピッチに変換する構成にしたので、歌唱者の発声ピッチや経時的なピッチ変動を再現できる効果が得られる。また、人間の実際の発声における経時的に不安定なピッチ変動を忠実に再現したり、音域による発声ピッチの精度の違いを表現したり、歌唱者によるピッチ変化の違いを表現したりすることも可能となる。
【００６４】
その上、入力ピッチに対してピッチ差分データの示す音声ピッチの変動分を加算してピッチ変換を行なう構成にしたので、記憶するデータ量が少なくて済む利点もある。
【図面の簡単な説明】
【図１】この発明の一実施形態に係る歌唱合成装置を示すブロック図である。
【図２】ピッチ変換装置の動作を説明するためのブロック図である。
【図３】ピッチ変換関数の一例を示すグラフである。
【図４】図３のピッチ変換関数を用いたピッチ変換の一例を示す図であり、（Ａ）は、変換前のピッチ変化を示すグラフ、（Ｂ）は、変換後のピッチ変化を示すグラフである。
【図５】データベースにおけるピッチ差分データの記憶状況を示すグラフである。
【図６】図５のピッチ差分データを用いたピッチ変換の一例を示す図であり、（Ａ）は、変換前のピッチ変化を示すグラフ、（Ｂ）は、変換後のピッチ変化を示すグラフである。
【図７】この発明の他の実施形態に係る歌唱合成装置を示すブロック図である。
【符号の説明】
１０：入力部、１２：ピッチ変換装置、１４：ピッチ変換器、１６，２０：データベース、１８：歌唱合成器、２２：Ｄ／Ａ変換器、Ｓ１：歌唱音声信号入力処理、Ｓ２：ＳＭＳ分析及び区間切出し処理、Ｓ３：歌詞データ及びメロディデータ入力処理、Ｓ４：音素列−音声素片変換処理、Ｓ５：音声素片接続処理、Ｓ６：ピッチ変換処理、Ｓ７：調和成分生成処理、Ｓ８：加算処理、Ｓ９：時間領域の歌唱音声信号に変換する処理。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a pitch conversion device, a pitch conversion method and a program suitable for use in singing synthesis.
[0002]
[Prior art]
Conventionally, as a speech synthesizer, a device that gives fluctuations to the pitch of synthesized speech is known (see, for example, JP-A-9-281994).
[0003]
In this prior art, as a clock signal to be supplied to a D / A converter that converts digital audio waveform data into an analog audio signal, a clock signal having a fluctuation in the clock cycle in accordance with the clock interval fluctuation data read from the storage unit. By using it, fluctuations are given to the pitch of the D / A conversion output (analog audio signal).
[0004]
[Problems to be solved by the invention]
It is difficult for a person who is a professor to sing at a physically constant height (pitch) when generating a sound corresponding to a certain note. Generally speaking, the utterance pitch is slightly different from the note pitch. In addition, pitch variation with time also occurs. In particular, when a general person who does not sing as a profession sings, the tendency of pitch shift and pitch fluctuation as described above is strong, and this is one element for evaluating the skill (or skill) of singing. . In addition, the characteristics of the singer may be seen in the pitch deviation. In addition, if you try to generate a sound with a pitch close to the upper limit or lower limit that a person can utter, the voice generation mechanism is physically burdened, so the pitch you want to generate differs from the pitch you actually uttered (near the upper limit). The pitch tends to decrease with high pitches, and the pitch tends to increase with low pitches near the lower limit.
[0005]
According to the above-described prior art, the direction and amount of pitch fluctuation can be changed by changing the value of the clock interval fluctuation data in the pitch increasing direction or the pitch decreasing direction. It is not possible to change the pitch (reference pitch) before adding, and to change the temporal pattern of pitch fluctuation. In other words, it is impossible to reproduce the vocal pitch of the singer as described above and the pitch variation over time.
[0006]
An object of the present invention is to provide a novel pitch conversion device, a pitch conversion method, and a program that can reproduce a vocal pitch of a singer and a pitch variation with time during singing synthesis.
[0007]
[Means for Solving the Problems]
A first pitch conversion apparatus according to the present invention is a pitch conversion apparatus used in a singing voice synthesizing apparatus including a singing voice synthesizing unit that synthesizes a singing voice signal having a pitch indicated by pitch data, and is a sequential pitch to be synthesized. Input means for sequentially inputting the pitch corresponding to each singing voice, and for converting a plurality of input pitches into a plurality of voice pitches, respectively.An input pitch function that is higher than the input pitch when the input pitch is lower than the predetermined lower limit pitch, and lower than the input pitch when the input pitch is higher than the predetermined upper limit pitch. Storage means for storing a pitch conversion function for converting the pitch to be equal to the input pitch when the pitch is between a predetermined lower limit pitch and a predetermined upper limit pitchAnd the pitch for each pitch input from the input meansPitch conversion functionConversion means for converting the voice pitch into a voice pitch and supplying the data indicating the voice pitch as the pitch data to the singing voice synthesis means.
[0008]
According to the first pitch conversion device, a plurality of input pitches are converted into a plurality of audio pitches, respectively.Pitch conversion functionIs stored in the storage means, thisPitch conversion functionBased on, each pitch related to the input is converted into a voice pitch for singing voice synthesis.Pitch conversion functionIn this case, if a plurality of vocal pitches of the singer are used as a plurality of voice pitches, the vocal pitch and pitch characteristics of the singer can be reproduced in the synthesized singing voice. The pitch can be increased slightly in the vicinity of the lower limit pitch that can be uttered while lowering.
[0009]
In the first pitch conversion device, the input means inputs singer data indicating a singer, and the storage meansPitch conversion functionIs stored for each singer, and the conversion means corresponds to the singer indicated by the singer data.Pitch conversion functionPitch conversion may be performed based on the above. In this way, the utterance pitch and pitch characteristics can be reproduced for each singer.
[0010]
In the first pitch conversion device, a random (random) pitch variation depending on the input pitch may be added to the voice pitch during pitch conversion. If it does in this way, a more natural pitch change can be given to synthetic singing voice. Moreover, you may make it provide the audio | voice pitch with the time-dependent pitch fluctuation | variation contained in a singer's actual audio | voice in the case of pitch conversion. In this way, it is possible to reproduce unstable pitch fluctuations over time of the singer.
[0014]
A first pitch conversion method according to the present invention is for converting a plurality of input pitches into a plurality of audio pitches.An input pitch function that is higher than the input pitch when the input pitch is lower than the predetermined lower limit pitch, and lower than the input pitch when the input pitch is higher than the predetermined upper limit pitch. Storage means for storing a pitch conversion function for converting the pitch to be equal to the input pitch when the pitch is between a predetermined lower limit pitch and a predetermined upper limit pitchAnd a pitch conversion method used in a singing voice synthesizing device that synthesizes a singing voice signal having a pitch indicated by pitch data, and sequentially corresponding to each of the sequential singing voices to be synthesized. A step of inputting a pitch, and for each pitch input in this step, the pitch isPitch conversion functionAnd converting the voice pitch into data representing the voice pitch as the pitch data to the singing voice synthesizing means.
[0015]
According to the first pitch conversion method, pitch conversion can be performed in the same manner as described above with respect to the first pitch conversion device.
[0018]
A first program according to the present invention is a program used in a singing voice synthesizing apparatus including a computer and a singing voice synthesizing unit that synthesizes a singing voice signal having a pitch indicated by pitch data. Input means for sequentially inputting pitches corresponding to sequential singing voices, and for converting a plurality of input pitches into a plurality of voice pitches, respectively.An input pitch function that is higher than the input pitch when the input pitch is lower than the predetermined lower limit pitch, and lower than the input pitch when the input pitch is higher than the predetermined upper limit pitch. Storage means for storing a pitch conversion function for converting the pitch to be equal to the input pitch when the pitch is between a predetermined lower limit pitch and a predetermined upper limit pitchAnd the pitch for each pitch input from the input meansPitch conversion functionIs converted to a voice pitch, and the data indicating the voice pitch is made to function as a conversion means for supplying the data as the pitch data to the singing voice synthesizing means.
[0019]
A second program according to the present invention is a program used in a singing voice synthesizing apparatus provided with a computer, wherein the computer sequentially inputs pitches corresponding to sequential singing voices to be synthesized. To convert multiple input pitches into multiple audio pitches.An input pitch function that is higher than the input pitch when the input pitch is lower than the predetermined lower limit pitch, and lower than the input pitch when the input pitch is higher than the predetermined upper limit pitch. Storage means for storing a pitch conversion function for converting the pitch to be equal to the input pitch when the pitch is between a predetermined lower limit pitch and a predetermined upper limit pitchAnd the pitch for each pitch input from the input meansPitch conversion functionIs converted into a voice pitch on the basis of the voice, and functions as a singing voice synthesizing means for synthesizing a singing voice signal having a voice pitch indicated by the pitch data sent from the converting means. Is.
[0020]
According to the first or second program, pitch conversion can be performed in the same manner as described above with respect to the first pitch conversion device.
[0021]
A third program according to the present invention is a program used in a singing voice synthesizing apparatus including a computer and a singing voice synthesizing unit that synthesizes a singing voice signal having a pitch indicated by pitch data.
Input means for sequentially inputting pitches corresponding to sequential singing voices to be synthesized;
Storage means for storing, for each input pitch among a plurality of input pitches, pitch difference data indicating a temporal variation of the voice pitch with respect to the input pitch;
For each pitch input from the input means, the pitch difference data corresponding to the pitch is read from the storage means, and the temporal variation of the voice pitch indicated by the pitch difference data for reading is added to the pitch according to the input. Then, the pitch conversion is performed, and the data indicating the pitch after the pitch conversion is made to function as a conversion means for supplying the data as the pitch data to the singing voice synthesizing means.
[0022]
A fourth program according to the present invention is a program used in a singing synthesizer provided with a computer,
Input means for sequentially inputting pitches corresponding to sequential singing voices to be synthesized;
Storage means for storing, for each input pitch among a plurality of input pitches, pitch difference data indicating a temporal variation of the voice pitch with respect to the input pitch;
For each pitch input from the input means, the pitch difference data corresponding to the pitch is read from the storage means, and the temporal variation of the voice pitch indicated by the pitch difference data for reading is added to the pitch according to the input. Converting means for performing pitch conversion and sending out pitch data indicating the pitch after the pitch conversion;
Singing voice synthesizing means for synthesizing a singing voice signal having a voice pitch indicated by pitch data sent from the converting means;
To function.
[0023]
According to the third or fourth program, pitch conversion can be performed in the same manner as described above with respect to the second pitch conversion device.
[0024]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 1 shows a singing voice synthesizing apparatus according to an embodiment of the present invention.
[0025]
The song synthesizer in FIG. 1 includes an input unit 10, a pitch converter 12, and a song synthesizer 18, and the pitch converter 12 includes a pitch converter 14 and a database 16.
[0026]
The input unit 10 includes singer data indicating a singer, speech segment data indicating a speech segment (single phoneme [phoneme] or phoneme chain), note data indicating the pitch and length of a note, and synthesized speech sound. The sound intensity data indicating the intensity is input, and the note pitch data indicating the note pitch Pi related to the input and the singer data indicating the singer S related to the input are supplied to the pitch converter 14.
[0027]
The database 16 includes pitch conversion data for converting a plurality of input pitches (note pitches) into a plurality of voice pitches (output pitches) in the form of a pitch conversion function [FT (S, p)] or a pitch conversion table. It is remembered for each singer.
[0028]
FIG. 2 shows an example in which three pitch conversion functions FT (S1, p), FT (S2, p), and FT (S3, p) respectively corresponding to the singers S1, S2, and S3 are stored in the database 16. . Here, p represents the input pitch.
[0029]
In the pitch converter 12 shown in FIG. 2, the pitch converter 14 refers to the pitch conversion function corresponding to the singer S indicated by the singer data from the input unit 10 in the database 16, and notes from the input unit 10. The voice pitch Po corresponding to the note pitch Pi indicated by the pitch data is calculated based on the pitch conversion function related to the reference. Then, pitch data indicating the calculated voice pitch Po is output to the song synthesizer 18.
[0030]
When the database 16 stores the pitch conversion data in the form of a pitch conversion table, the pitch converter 14 refers to the pitch conversion table corresponding to the singer S from the input unit 10 and notes from the input unit 10. The voice pitch Po corresponding to the note pitch Pi indicated by the pitch data is read from the pitch conversion table according to the reference. Then, pitch data indicating the read voice pitch Po is supplied to the song synthesizer 18.
[0031]
The song synthesizer 18 synthesizes a song voice signal based on the singer data, voice segment data, note length data, and sound intensity data from the input unit 10 and the pitch data from the pitch converter 14. . Various song synthesis methods are known, and an appropriate one can be selected to configure the song synthesizer 18.
[0032]
As an example, the synthesizer 18 synthesizes a singing voice signal using voice component data corresponding to a singer indicated by the singer data and a voice element indicated by the voice element data. At this time, the pitch, the sound length, and the sound intensity of the singing voice signal are determined according to the pitch data, the note length data, and the sound intensity data, respectively.
[0033]
FIG. 3 shows an example of the pitch conversion function. In FIG. 3, the input pitch [cent] on the horizontal axis corresponds to the note pitch input to the pitch converter 14, and the output pitch [cent] on the vertical axis corresponds to the voice pitch output from the pitch converter 14. To do.
[0034]
The pitch conversion function FT (S, p) shown in FIG. 3 has an output pitch that is equal to the input pitch between the predetermined lower limit pitch PL and the predetermined upper limit pitch PH, but when the input pitch is higher than the upper limit pitch PH, The output pitch gradually becomes lower than the input pitch as it approaches the upper limit pitch that can be uttered, and when the input pitch is lower than the lower limit pitch PL, the output pitch gradually becomes higher than the input pitch as it approaches the lower limit pitch that humans can utter. It has a shape that When such a shape is expressed mathematically, it is as shown in the following equation (1).
[0035]
[Expression 1]
FT (S, p)> p if p <PL
FT (S, p) = p if PL ≦ p ≦ PH
FT (S, p) <p if PH <p
As a specific example, in the region where PH <p, the output pitch is about several tens of cents lower than the input pitch at the maximum, and in the region where p <PL, the output pitch is about several tens of cents higher than the input pitch. Functions can be used.
[0036]
The pitch conversion function as shown in FIG. 3 has a suitable shape for each singer and is stored in the database 16 for each singer as described above with reference to FIG. The pitch converter 14 converts the input pitch Pi into the output pitch Po with reference to the pitch conversion function corresponding to the singer S related to the input. When such pitch conversion is expressed mathematically, the following equation 2 is obtained.
[0037]
[Expression 2]
Po = FT (S, Pi)
FIG. 4 shows an example of pitch conversion using the pitch conversion function of FIG. 3, where (A) shows the pitch change before conversion (change in input pitch), and (B) shows the pitch after conversion. Indicates change (change in output pitch). In FIG. 4A, sequential input pitches correspond to sequential singing voices to be synthesized. According to FIG. 4, the output pitch is higher than the input pitch in the lower frequency range than PL, and the output pitch is lower than the input pitch in the higher frequency range than PH, and the output pitch is set to the input pitch in the middle frequency range higher than PL and lower than PH. You can see that they are equal. In the example of FIG. 4, the input pitch is given discretely. However, this need not be the case and may be given continuously.
[0038]
The pitch conversion function shown in FIG. 3 is approximated to a straight line, but the following number 3 is added to a random (random) pitch variation rand (S, p) depending on the singer and the pitch. A pitch conversion function as shown in the equation may be used.
[0039]
[Equation 3]
FT (S, p) + rand (S, p)
When such a pitch conversion function is used, a random pitch change is added to the sequential output pitch in response to the sequential input pitch as shown in FIG. Can be given a more natural change.
[0040]
In the above-described embodiment, an example in which the time-dependent pitch conversion function FT (S, p) is stored in the database 16 is shown. However, the database 16 stores a time-dependent pitch conversion function, You may make it perform pitch conversion with reference to a pitch conversion function. As an example, the database 16 stores pitch difference data indicating the pitch difference ΔFT (S, p, t) for each singer as pitch conversion data. The pitch difference ΔFT (S, p, t) represents the difference of the voice pitch with respect to the note pitch p as the singer S generates the voice corresponding to the note pitch p as time t progresses.
[0041]
When pitch difference data is stored in the database 16 for each singer in the form of a pitch conversion function ΔFT (S, p, t), the pitch converter 14 corresponds to the singer S related to the input. The input pitch Pi is converted to the output pitch Po with reference to ΔFT (S, p, t). When such pitch conversion is expressed mathematically, it is as shown in the following equation (4).
[0042]
[Expression 4]
Po = Pi + ΔFT (S, Pi, t)
The pitch conversion in this case is performed by adding a pitch difference ΔFT (S, Pi, t) corresponding to the input pitch Pi to the input pitch Pi.
[0043]
Instead of storing the pitch conversion function ΔFT (S, p, t) as described above, the database 16 stores pitch difference data representing a temporal change waveform of the pitch difference ΔFT (S, p, t). You may do it. FIG. 5 shows such pitch difference data as singer S.₁... An example in which 25 pitches of p1 to p25 are stored for each singer in Sn (n is an integer of 2 or more). The pitches p1 to p25 are 1200 to 3600 [cent] in steps of 100 cents (semitones). When pitch difference data is stored in the database 16, the amount of data can be reduced as compared with the case where pitch waveform data described later is stored.
[0044]
In the example of FIG. 5, it is preferable to use data based on an actual song as pitch difference data for each pitch. For example, Singer S₁A sound corresponding to the pitch p1 is actually generated at the same time, a time-dependent change waveform of the pitch difference of the generated sound with respect to the pitch p1 is obtained, and pitch difference data representing this change waveform is used. In this way, it is possible to reproduce the pitch change reflecting the characteristics of the singer and to express a more human fine pitch change.
[0045]
The pitch converter 14 refers to the pitch difference data corresponding to the input pitch Pi among the pitch difference data corresponding to the singer S related to the input, and converts the input pitch Pi into the output pitch Po according to the above-described equation (4). . FIG. 6 shows an example of pitch conversion using the pitch difference data of FIG. 5, and (A) shows the pitch change (change of input pitch) before conversion as in FIG. B) shows the pitch change (change in output pitch) after conversion. According to FIG. 6, in the vicinity of the upper limit pitch or lower limit pitch at which a person can speak, the output becomes lower or higher than the input pitch, respectively, as described above with reference to FIG. It can be seen that the amount of fluctuation (fluctuation) of the pitch increases in the vicinity. Therefore, it is possible to reproduce human voice pitch and pitch fluctuation.
[0046]
Since the pitch is not a discrete value but a continuous value, in order to be able to handle all pitches in the example of FIG. 5, pitch difference data for all the pitches is stored, and the amount of data to be stored is enormous. Become. Also, the amount of data to be stored for the pitch difference data in which the change in pitch difference ΔFT (S, p, t) continues for a long time becomes enormous. In order to avoid such an increase in the amount of stored data, the following measures (A) or (B) can be appropriately adopted.
[0047]
(A) When pitch difference data representing the pitch difference ΔFT (S, p, t) is stored for each of a plurality of discrete pitches, if pitch difference data whose pitch exactly matches the input pitch cannot be detected, input is performed. Pitch conversion is performed with reference to the pitch difference data having the closest pitch. Alternatively, pitch conversion may be performed by obtaining new pitch difference data by interpolation from two pitch difference data having a pitch close to the input pitch.
[0048]
(B) As the pitch difference data, data having a change duration of the pitch difference within a predetermined value is stored, and the time length of the input pitch exceeds the change duration of the pitch difference ΔFT (S, p, t). In this case, the pitch difference data is read again after returning to an appropriate position such as time 0 in the change waveform of the pitch difference.
[0049]
In the example of FIG. 5, the database 16 stores pitch difference data representing a change waveform of the pitch difference corresponding to each of a plurality of pitches for each singer, but corresponds to a plurality of pitches for each singer. You may make it memorize | store the pitch waveform data showing a pitch change waveform as pitch conversion data. In this case, pitch conversion is performed by reading the pitch waveform data corresponding to the input pitch Pi from among the plurality of pitch waveform data corresponding to the singer S related to the input, and setting it as the output pitch Po. When the pitch waveform data is created based on an actual song, it is possible to reproduce the vocal pitch of the singer and the pitch variation over time.
[0050]
The pitch conversion process as described above may be executed as a software process in a small computer such as a personal computer. That is, the CPU (central processing unit) may execute pitch conversion processing according to a program stored in storage means such as ROM or RAM.
[0051]
FIG. 7 shows a singing voice synthesizing apparatus according to another embodiment of the present invention. This apparatus performs singing synthesis using, for example, an SMS (Spectral Modeling Synthesis) technique disclosed in Japanese Patent No. 2906970. .
[0052]
In step S1, a singing voice signal is input, and in step S2, an SMS analysis process and a segment extraction process are performed on the input singing voice signal.
[0053]
In the SMS analysis process, the input speech signal is divided into a series of time frames, and one set of intensity (magnitude) spectrum data is generated by FFT (Fast Fourier Transform) for each frame, and one set of intensity for each frame. A line spectrum corresponding to a plurality of peaks is extracted from the spectrum data. Data representing the amplitude value and frequency of these line spectra is referred to as deterministic component data. Next, a residual spectrum is obtained by subtracting the spectrum of the harmonic component from the spectrum of the input speech waveform. This residual spectrum is referred to as an inharmonic component.
[0054]
In the segment extraction process, the harmonic component data and the anharmonic component data obtained by the SMS analysis process are classified according to the speech segment. A speech segment is a component of lyrics, for example, a single phoneme (or phoneme) such as [a], [i] or “a”. i "," a " It consists of a phoneme chain (a chain of multiple phonemes) like “p”.
[0055]
The database 20 stores harmonic component data and anharmonic component data for each speech unit. Further, the database 20 stores pitch conversion data (including the case of pitch difference data or pitch waveform data) as described above with respect to the database 16.
[0056]
When singing a song, lyrics data and melody data are input in step S3. In step S4, the phoneme sequence represented by the lyrics data is subjected to a phoneme sequence / speech unit conversion process to divide the phoneme sequence into speech units, and for each speech unit, harmonic component data and anharmonic component corresponding thereto. Are read out from the database 20 as speech segment data.
[0057]
In step S5, speech unit connection processing is performed on the speech unit data (harmonic component data and anharmonic component data) read from the database 20 to connect the speech unit data in the order of pronunciation.
[0058]
In step S6, a pitch conversion process is performed. That is, the note pitch indicated by the melody data input in step S3 is converted into a voice pitch based on the pitch conversion data (including the case of pitch difference data or pitch waveform data) in the database 20 in the same manner as described above. Pitch data indicating the voice pitch is generated.
[0059]
In step 7, new harmonic component data suitable for the voice pitch is generated on the basis of the harmonic component data for each voice unit and the voice pitch indicated by the pitch data generated in step S6. At this time, in the new harmonic component data, if the spectrum intensity is adjusted so that the shape of the spectral envelope represented by the harmonic component data subjected to the processing in step S5 is inherited, the timbre of the voice signal input in step S1 is reproduced. can do.
[0060]
In step S8, the harmonic component data generated in step S7 and the anharmonic component data subjected to step S5 are added for each speech unit. In step S9, the data subjected to the addition process in step S8 is converted into a time domain singing voice signal by inverse FFT or the like for each voice unit. The singing voice signal obtained as a result of the processing in step 9 is a digital signal, and is converted into an analog singing voice signal by the D / A converter 22.
[0061]
As an example, in order to synthesize a singing voice “saita”, the speech unit “#s”, “s” a "," a "," a " i "," i "," i t "," t Speech segment data corresponding to “a”, “a”, and “a #” (# represents silence) are read out and connected in step S5. In step S7, harmonic component data having a pitch related to the conversion in step S6 is generated for each speech unit, the addition process in step S8, the conversion process in step S9, and the D / A conversion process in the converter 22 After that, a singing voice signal of “Cita” is obtained.
[0062]
The processing in steps S1 to S9 described above may be executed as software processing in a small computer such as a personal computer, or may be executed using hardware such as an electronic circuit.
[0063]
【The invention's effect】
As described above, according to the present invention, the pitch conversion data stored in the storage means is used to convert the input note pitch into the voice pitch for singing synthesis. An effect of reproducing the fluctuation can be obtained. In addition, it is possible to faithfully reproduce unstable pitch fluctuations over time in actual human utterances, express differences in utterance pitch accuracy by range, and express differences in pitch changes by singers. It becomes possible.
[0064]
In addition, since the pitch conversion is performed by adding the variation of the voice pitch indicated by the pitch difference data to the input pitch, there is an advantage that the amount of data to be stored can be reduced.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a singing voice synthesizing apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram for explaining the operation of the pitch converter.
FIG. 3 is a graph showing an example of a pitch conversion function.
4A and 4B are diagrams illustrating an example of pitch conversion using the pitch conversion function of FIG. 3, in which FIG. 4A is a graph showing a pitch change before conversion, and FIG. 4B is a graph showing a pitch change after conversion; It is.
FIG. 5 is a graph showing the storage status of pitch difference data in a database.
6A and 6B are diagrams showing an example of pitch conversion using the pitch difference data of FIG. 5, in which FIG. 6A is a graph showing pitch change before conversion, and FIG. 6B is a graph showing pitch change after conversion. It is.
FIG. 7 is a block diagram showing a singing voice synthesizing apparatus according to another embodiment of the present invention.
[Explanation of symbols]
10: input unit, 12: pitch converter, 14: pitch converter, 16, 20: database, 18: singing synthesizer, 22: D / A converter, S1: singing voice signal input processing, S2: SMS analysis and Section extraction processing, S3: Lyric data and melody data input processing, S4: Phoneme string-speech segment conversion processing, S5: Speech segment connection processing, S6: Pitch conversion processing, S7: Harmonic component generation processing, S8: Addition processing , S9: Processing to convert to a time domain singing voice signal.

Claims

A pitch conversion device used in a singing voice synthesizing device provided with a singing voice synthesizing means for synthesizing a singing voice signal having a pitch indicated by pitch data,
Input means for sequentially inputting pitches corresponding to sequential singing voices to be synthesized;
A pitch conversion function for converting a plurality of input pitches to a plurality of audio pitches, and when the input pitch is lower than a predetermined lower limit pitch, the input pitch is higher than the input pitch so that the input pitch is higher than the predetermined upper limit pitch. A memory for storing a pitch conversion function for converting the input pitch to be equal to the input pitch when the input pitch is between the predetermined lower limit pitch and the predetermined upper limit pitch so that the input pitch is lower than the input pitch when the input pitch is higher than Means ,
Conversion means for converting the pitch into a voice pitch based on the pitch conversion function for each pitch inputted from the input means, and supplying data indicating the voice pitch to the singing voice synthesis means as the pitch data Pitch converter.

The input means inputs singer data indicating a singer, the storage means stores the pitch conversion function for each singer, and the conversion means is a pitch corresponding to a singer indicated by the singer data. The pitch conversion apparatus according to claim 1, wherein the pitch conversion is performed based on the conversion function .

The pitch conversion device according to claim 1 or 2, wherein the conversion means adds a random pitch variation depending on an input pitch to the voice pitch during pitch conversion.

The pitch conversion device according to claim 1, wherein the conversion unit adds a time-dependent pitch fluctuation to the voice pitch during pitch conversion.

A pitch conversion function for converting a plurality of input pitches to a plurality of audio pitches, and when the input pitch is lower than a predetermined lower limit pitch, the input pitch is higher than the input pitch so that the input pitch is higher than the predetermined upper limit pitch. A memory for storing a pitch conversion function for converting the input pitch to be equal to the input pitch when the input pitch is between the predetermined lower limit pitch and the predetermined upper limit pitch so that the input pitch is lower than the input pitch when the input pitch is higher than means, a pitch conversion method used in the singing voice synthesizing apparatus and a singing voice synthesis means for synthesizing a singing voice signal having a pitch indicated by pitch data,
Inputting a pitch sequentially corresponding to each of the sequential singing voices to be synthesized;
Converting each pitch into a voice pitch based on the pitch conversion function and supplying data indicating the voice pitch to the singing voice synthesizing unit as the pitch data for each pitch input in this step. .

A program used in a song synthesizer comprising a computer and a song synthesis means for synthesizing a song voice signal having a pitch indicated by pitch data, the computer comprising:
Input means for sequentially inputting pitches corresponding to sequential singing voices to be synthesized;
A pitch conversion function for converting a plurality of input pitches to a plurality of audio pitches, and when the input pitch is lower than a predetermined lower limit pitch, the input pitch is higher than the input pitch so that the input pitch is higher than the predetermined upper limit pitch. A memory for storing a pitch conversion function for converting the input pitch to be equal to the input pitch when the input pitch is between the predetermined lower limit pitch and the predetermined upper limit pitch so that the input pitch is lower than the input pitch when the input pitch is higher than Means ,
A program for converting the pitch into a voice pitch based on the pitch conversion function for each pitch input from the input means and functioning as a conversion means for supplying data indicating the voice pitch as the pitch data to the singing voice synthesizing means .

A program used in a singing synthesizer provided with a computer, wherein the computer is
Input means for sequentially inputting pitches corresponding to sequential singing voices to be synthesized;
A pitch conversion function for converting a plurality of input pitches to a plurality of audio pitches, and when the input pitch is lower than a predetermined lower limit pitch, the input pitch is higher than the input pitch so that the input pitch is higher than the predetermined upper limit pitch. so as to be lower than the input pitch is higher than, if the input pitch is between a predetermined lower limit pitch and a predetermined upper limit pitch stores pitch conversion function that converts the so that Do equal to the input pitch Storage means ;
Conversion means for converting the pitch into a voice pitch based on the pitch conversion function for each pitch inputted from the input means, and sending pitch data indicating the voice pitch;
A program that functions as a singing voice synthesizing means for synthesizing a singing voice signal having a voice pitch indicated by pitch data sent from the converting means.