JP2004077608A

JP2004077608A - Apparatus and method for chorus synthesis and program

Info

Publication number: JP2004077608A
Application number: JP2002235039A
Authority: JP
Inventors: Hidenori Kenmochi; 劔持　秀紀; Bonada Jordi; ジョルディ　ボナダ
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2002-08-12
Filing date: 2002-08-12
Publication date: 2004-03-11
Anticipated expiration: 2022-08-12
Also published as: JP4304934B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an apparatus and a method for chorus synthesis and a program that can synthesis a chorus sound giving more natural impressions to an audience. <P>SOLUTION: The chorus synthesizing apparatus 100 has a speech sample database 110 storing three generated speech sample data groups 110a, 110b, and 110c generated based on different speech and three singing generators 120, 121, and 122. When a chorus sound signal of a musical piece consisting of three parts is synthesized, the singing generators 120, 121, and 122 generates singing sound signals for the respective parts under the control of a chorus control part 140 according to lyrics information and melody information and then put together. For the generation, the singing generators 120, 121, and 122 use phoneme sample data included in the different speech sample data groups 110a, 110b, and 110c. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、合唱音信号を合成する合唱合成装置、合唱合成方法、および合唱音を合成するためのプログラムに関する。
【０００２】
【従来の技術】
従来より、歌詞情報やメロディ情報に基づいて、歌唱音信号を合成して歌声を発音する合唱合成装置が提案されている。このように歌唱音信号を合成する装置としては、規則音声合成技術を応用した装置等の種々の装置が提案されている。規則合成技術を応用した歌唱合成装置では、予め発声者が発した音声から、音素や複数の音素を含む音素連鎖を単位とする音声試料データを作成してデータベースに記憶しておく。そして、歌詞情報にしたがって必要となる音素等の音声試料データを読み出して接続することにより歌唱音信号を合成している。
【０００３】
ところで、上記のような歌唱音を合成する歌唱音合成装置では、文章読み上げ装置等の音声合成装置と異なり、斉唱や重唱といった合唱時の歌唱音を電子的に出力するといった利用形態も考えられる。したがって、合唱時の歌唱音（合唱音）を合成する機能を備えた合唱合成装置の開発も行われている。
【０００４】
このような合唱時の合唱音信号を合成する機能を備えた合唱合成装置は、複数のパートの各々に基づいて、音声試料データを読み出して接続することにより合唱音信号を生成する。そして、各々のパートについて生成した歌唱音信号を重ね合わせて出力することにより、合唱音を電子的に出力することができるようになっている。
【０００５】
【発明が解決しようとする課題】
しかし、従来の合唱音信号を合成する機能を備えた合唱合成装置では、各パート毎に歌詞情報やメロディ情報にしたがって歌唱音信号を生成する際に、同一の音声試料データを用いているため、各パート毎に生成された歌唱音はメロディが異なっているものの、生成された各パート毎の音声波形の微細な特徴（ピッチのゆらぎ等）は基本的に同一となってしまう。したがって、これらを重ね合わせた合唱音は、聴取者にとって不自然な合唱音に聴こえてしまう。これは、各パート間の相関関係（微細な特徴が一致する）を聴取者が聴き取ってしまい、不自然な印象を与えているものと考えられる。
【０００６】
また、斉唱時の合唱音信号を合成する場合には、上記のように各パート毎に単純に歌唱音信号を生成して重ね合わせる手法では、全く同じ歌唱音が重ねられて出力されてしまい、この結果聴取者に不自然な印象を与えてしまうことになる。そこで、従来の合唱音合成装置において、斉唱時の合唱音信号を合成する場合には、各パート（内容は同一）毎に生成した歌唱音の発音タイミングを若干ずらしたり、各パート毎に生成した歌唱音のピッチを若干ずらしたりすることにより、全く同一の歌唱音が重ねられて発音されてしまうことを防止していた。しかしながら、発音タイミングやピッチを若干ずらした場合にも、上記のように各パート毎に生成された音声波形の微細な特徴（ゆらぎ等）は基本的に同一となってしまう。したがって、これらを重ね合わせた合唱音は、上記と同様、聴取者にとって不自然な合唱音に聴こえてしまう。
【０００７】
また、特開平７−１４６６９５号公報には、合唱音信号を生成する装置が開示されており、この装置では、各パート毎に歌唱音信号を生成する際に、各パート毎に異なるピッチのゆらぎ成分を付与した歌唱音信号を生成している。このように各パート毎に異なるピッチのゆらぎ成分を付与した歌唱音信号を重ねて出力することにより、各パート間の相関関係を小さくすることができる。しかしながら、この公報に記載された装置において、各パート毎の歌唱音信号に付与されるピッチ成分は、人の音声を基にしたものではなく、人工的に作られたものであるため、各パート間の相関関係は小さくなるものの、合成された合唱音が不自然に聴こえてしまうことがある。
【０００８】
本発明は、上記の事情を考慮してなされたものであり、より自然な印象を聴取者に与えることが可能な合唱音を合成することができる合唱合成装置、合唱合成方法およびプログラムを提供することを目的とする。
【０００９】
【課題を解決するための手段】
上記課題を解決するため、本発明に係る合唱合成装置は、楽曲データに基づいて合唱音信号を合成する合唱合成装置であって、同一種類の音声試料データについて、複数の異なる音声に基づいて各々作成された前記音声試料データを記憶するデータベースと、前記楽曲データにしたがって歌唱音信号を生成する手段であって、必要となる前記音声試料データを前記データベースから読み出して当該歌唱音信号の生成に用いる複数の歌唱生成手段と、前記複数の歌唱生成手段で生成された歌唱音信号から合唱音信号を合成する歌唱合成手段とを具備し、前記楽曲データが複数のパートからなり、前記複数の歌唱生成手段の各々が各前記パートに対応する歌唱音信号を生成する際に、少なくとも２つの前記歌唱生成手段の各々は、前記データベースから異なる音声に基づいて作成された前記音声試料データを読み出して前記歌唱音信号の生成に用いることを特徴としている。
【００１０】
この構成によれば、各歌唱生成手段が対応するパートの歌唱音信号を生成する際に、少なくとも２つの歌唱生成手段が異なる音声に基づいて作成した音声試料データを用いることになる。ここで、異なる音声に基づいて作成した音声試料データは、微細な特徴等が異なっているため、上記少なくとも２つの歌唱生成手段から出力される歌唱音信号は微細な特徴が異なったものとなる。したがって、各パートに応じた歌唱音として、固有の特徴を有する歌唱音が放音されるので、聴取者に対してより自然な印象を与えることができる。
【００１１】
また、本発明の別の態様の合唱合成装置は、楽曲データにしたがって合唱音信号を合成する合唱合成装置であって、音声に基づいて作成された所定の時間長を有する音声試料データを記憶するデータベースと、前記楽曲データにしたがって歌唱音信号を生成する手段であって、必要となる前記音声試料データを前記データベースから読み出して当該歌唱音信号の生成に用いる複数の歌唱生成手段と、前記複数の歌唱生成手段で生成された歌唱音信号から合唱音信号を合成する歌唱合成手段とを具備し、前記楽曲データが複数のパートからなり、前記複数の歌唱生成手段の各々が各前記パートに対応する歌唱音信号を生成する際に、少なくとも２つの前記歌唱生成手段の各々は、前記データベースから読み出した前記音声試料データの異なる時間に対応する部分から使用を開始して前記歌唱音信号を生成することを特徴としている。
【００１２】
この構成によれば、各歌唱生成手段が対応するパートの歌唱音信号を生成する際に、少なくとも２つの歌唱生成手段が音声試料データの異なる時間に対応する部分から使用を開始して生成を行うことになる。ここで、音声に基づいて作成されたある時間長を有する音声試料データは、その時間長の間微細な特徴（音声波形のゆらぎ）が一定ではなく、時間によって微細な特徴等が異なっている。このため、上記少なくとも２つの歌唱生成手段から出力される歌唱音信号は微細な特徴が異なったものとなる。したがって、各パートに応じた歌唱音として、固有の特徴を有する歌唱音が放音されるので、聴取者に対してより自然な印象を与えることができる。
【００１３】
また、本発明に係る合唱合成方法は、楽曲データに基づいて生成された複数の歌唱音信号から合唱音信号を合成する合唱合成方法であって、複数のパートからなる前記楽曲データにしたがって前記複数のパートに対応する歌唱音信号を生成する際には、複数の異なる音声に基づいて各々作成された前記音声試料データを記憶するデータベースから必要となる前記音声試料データを読み出し、少なくとも２つの前記パートに対応する歌唱音信号の生成には、前記データベースから読み出された異なる音声に基づいて作成された前記音声試料データを用いることを特徴としている。
【００１４】
また、本発明の別の態様の合唱合成方法は、楽曲データに基づいて生成された複数の歌唱音信号から合唱音信号を合成する合唱合成方法であって、複数のパートからなる前記楽曲データにしたがって前記複数のパートに対応する歌唱音信号を生成する際には、音声に基づいて作成された所定の時間長を有する音声試料データを記憶するデータベースから必要となる前記音声試料データを読み出し、少なくとも２つの前記パートに対応する歌唱音信号の生成には、前記データベースから読み出した前記音声試料データの異なる時間に対応する部分から使用を開始して前記歌唱音信号を生成することを特徴としている。
【００１５】
また、本発明に係るプログラムは、コンピュータを、楽曲データにしたがって、複数の異なる音声に基づいて各々作成された音声試料データを記憶するデータベースから必要となる前記音声試料データを読み出して歌唱音信号を生成する手段であって、前記楽曲データが複数のパートからなり、前記複数のパートに対応する歌唱音信号を生成する場合には、少なくとも２つの前記パートに対応する歌唱音信号の生成の際に、前記データベースから読み出された異なる音声に基づいて作成された前記音声試料データを用いて前記歌唱音信号を生成する歌唱音生成手段と、前記生成された歌唱音信号から合唱音信号を合成する歌唱合成手段として機能させることを特徴としている。
【００１６】
また、本発明の別の態様のプログラムは、コンピュータを、楽曲データにしたがって、音声に基づいて作成された所定の時間長を有する音声試料データを記憶するデータベースから必要となる前記音声試料データを読み出して歌唱音信号を生成する手段であって、前記楽曲データが複数のパートからなり、前記複数のパートに対応する歌唱音信号を生成する場合には、少なくとも２つの前記パートに対応する歌唱音信号の生成する際に、前記データベースから読み出した前記音声試料データの異なる時間に対応する部分から使用を開始して前記歌唱音信号を生成する歌唱音生成手段と、前記生成された歌唱音信号から合唱音信号を合成する歌唱合成手段として機能させることを特徴としている。
【００１７】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態について説明する。
Ａ．第１実施形態
Ａ−１．第１実施形態の基本構成
まず、図１は本発明の第１実施形態に係る合唱合成装置の基本構成を示すブロック図である。同図に示すように、この合唱合成装置１００は、音声試料データベース１１０と、複数（図示の例では３つ）の歌唱生成器１２０，１２１，１２２と、合唱制御部１４０と、歌唱生成器１２０，１２１，１２２の各々が出力する歌唱音信号を加算して合成し、出力する加算器１３０とを備えている。
【００１８】
音声試料データベース１１０には、人が発声した自然の音声に基づいて作成された音声試料データが記憶されている。この音声試料データベース１１０には、単一の音素または複数の音素で構成される音素連鎖を１つの単位とする音声試料データ（以下、音声素片試料データという）が記憶されている。
【００１９】
多数の短時間長の音声試料データをデータベースに蓄積しておいて、歌詞等に応じてこれらの音声試料データを接続して音声合成処理技術では、合成単位として音素が用いられるのが基本である。このため、この合唱合成装置１００における音声試料データベース１１０に、音素（３０〜５０種類程度）単位のみの音声素片試料データを蓄積するようにしてもよいが、音素間の結合規則は複雑であるため、音素単位のみの音声試料データを蓄積した場合には、良好な品質を得ることが難しい。したがって、音声試料データベース１１０には、音素単位のみの音声素片試料データに加え、音素よりもやや大きい単位（音素連鎖）の音声素片試料データも蓄積しておくことが好ましい。音素よりも大きい単位としては、ＣＶ（子音→母音）、ＶＣ（母音→子音）、ＶＣＶ（母音→子音→母音）、ＣＶＣ（子音→母音→子音）といった単位がある。これらの単位の音声素片試料データを全て蓄積しておくことも考えられるが、合唱音を合成する合唱合成装置１００においては、歌唱において使用頻度の高い母音など長く発音する伸ばし音を１単位とした音声素片試料データ、子音から母音（ＣＶ）および母音から子音（ＶＣ）を１単位とした音声素片試料データ、子音から子音を１単位としたの音声素片試料データ、および母音から母音を１単位とした音声素片試料データを蓄積しておくようにすればよい。
【００２０】
音声試料データベース１１０には、上述したような音素あるいは音素連鎖を１単位とした音声素片試料データが格納されているが、この音声試料データベース１１０では、同一種類の音素（例えば「ａ」）あるいは音素連鎖（例えば、「ａｉ」）について３つの音声素片試料データを記憶している。すなわち、音声試料データベース１１０には、音素あるいは音素連鎖を１単位とした所定数の単位音声素片試料データからなる３つの音声試料データ群１１０ａ，１１０ｂ，１１０ｃが記憶されているのである。
【００２１】
音声試料データベース１１０に記憶されている３つの音声試料データ群１１０ａ，１１０ｂ，１１０ｃは、各々異なる音声に基づいて作成されたデータである。ここで、異なる音声とは、発声者が異なることのみを意味するわけではなく、同じ発声者であっても別の機会に発した音声や異なる発声部分を用いたものであってもよい。このように音声試料データ群１１０ａ，１１０ｂ，１１０ｃは、別の発声者または同じ発声者であっても別の機会に発した音声や別の発声部分に基づいて作成されているのである。このように各音声試料データ群１１０ａ，１１０ｂ，１１０ｃに含まれる同一の音素（あるいは音素連鎖）についてのデータは、各々のデータを作成するために使用した基となる音声が異なっているため、微細な特徴（ピッチのゆらぎ等）が異なったものとなっている。
【００２２】
音声試料データベース１１０には、上述したような３つの音声試料データ群１１０ａ，１１０ｂ，１１０ｃが記憶されており、各歌唱生成器１２０，１２１，１２２は、歌唱音信号を生成する際にこの音声試料データベース１１０から音声素片試料データを読み出して用いることになる。
【００２３】
歌唱生成器１２０，１２１，１２２の各々は、歌詞情報およびメロディ情報を有する楽曲情報にしたがって、音声試料データベース１１０から必要となる音声素片試料データを読み出し、読み出した音声素片試料データを用いて歌唱音信号を生成する。
【００２４】
より具体的には、歌唱生成器１２０，１２１，１２２の各々は、歌詞情報にしたがって音素列を求め、その音素列を構成するために必要な音声素片試料データを決定し、音声試料データベース１１０から読み出す。そして、読み出した音声素片試料データを時系列に接続し、接続した音声素片試料データをメロディ情報にしたがったピッチに応じて適宜調整し、歌唱音信号を生成するのである。
【００２５】
本実施形態に係る合唱合成装置１００は、歌詞情報およびメロディ情報にしたがって歌唱音信号を生成することができる３つの歌唱生成器１２０，１２１，１２２を備えており、これにより３つのパートかならなる合唱曲の楽曲情報（歌詞情報およびメロディ情報）にしたがって、この合唱曲に対応した合唱音信号を合成することができるようになっている。
【００２６】
合唱制御部１４０は、当該合唱合成装置１００において、合唱曲の楽曲情報に基づいて合唱音に対応した合唱音信号を合成する際に、楽曲情報を各パート毎に分割して各歌唱生成器１２０，１２１，１２２に出力する。これにより、３つのパートからなる楽曲情報にしたがって合唱音信号を合成する場合には、各歌唱生成器１２０，１２１，１２２が合唱制御部１４０から供給される各々のパートの歌詞情報およびメロディ情報にしたがって歌唱音信号を生成し、各歌唱生成器１２０，１２１，１２２の各々が生成した歌唱音信号が加算器１３０に出力される。これにより、加算器１３０からは３つのパートからなる合唱曲の楽曲情報にしたがって、この合唱曲に対応した合唱音信号を合成することができるのである。
【００２７】
また、この合唱合成装置１００において、上記のように合唱音信号を合成する際には、合唱制御部１４０は、各歌唱生成器１２０，１２１，１２２の各々が、音声試料データベース１１０に記憶されている音声試料データ群１１０ａ，１１０ｂ，１１０ｃのうち、どの音声試料データ群から音声素片試料データを読み出して用いるかを指定する指定情報を歌唱生成器１２０，１２１，１２２に出力する。ここで、合唱制御部１４０は、各歌唱生成器１２０，１２１，１２２が互いに異なる音声試料データ群１１０ａ，１１０ｂ，１１０ｃに含まれる音声素片試料データを用いて歌唱音信号を生成するように、各歌唱生成器１２０，１２１，１２２に異なるデータ群を指定する指定情報を出力する。
【００２８】
具体的に例示すると、歌唱生成器１２０に対しては音声試料データ群１１０ａを指定する指定情報を出力し、歌唱生成器１２１に対しては音声試料データ群１１０ｂを指定する指定情報を出力し、歌唱生成器１２２に対しては音声試料データ群１１０ｃを指定する指定情報を出力するといった具合である。このような指定情報が合唱制御部１４０から供給されると、歌唱生成器１２０は音声試料データ群１１０ａに含まれる音声素片試料データを読み出して歌唱音信号の生成に用い、歌唱生成器１２１は音声試料データ群１１０ｂに含まれる音声素片試料データを読み出して歌唱音信号の生成に用い、歌唱生成器１２２は音声試料データ群１１０ｃに含まれる音声素片試料データを用いて歌唱音信号を生成することになる。
【００２９】
合唱合成装置１００において、３つのパートからなる合唱曲の歌唱音信号を合成する場合に、上述したように各歌唱生成器１２０，１２１，１２２が互いに異なる音声試料データ群１１０ａ，１１０ｂ，１１０ｃに含まれる音声素片試料データを用いることにより、より自然な印象を聴取者に与えることが可能な合唱音信号を合成することができる。すなわち、音声試料データベース１１０に記憶されている音声試料データ群１１０ａ，１１０ｂ，１１０ｃは、各々異なる音声に基づいて作成されたものであり、同一種類の音素や音素連鎖についてのデータであっても、各音声試料データ群１１０ａ，１１０ｂ，１１０ｃに含まれるデータに示される音声の微細な特徴（ピッチのゆらぎ等）は異なっている。このように微細な特徴が異なっている音声素片試料データが含まれる音声試料データ群１１０ａ，１１０ｂ，１１０ｃのうち、各歌唱生成器１２０，１２１，１２２が異なる音声試料データ群に含まれる音声素片試料データを用いて歌唱音信号を生成することにより、各歌唱生成器１２０，１２１，１２２によって生成される歌唱音信号は、互いに微細な特徴が異なるものとなっている。したがって、これらを重ね合わせた合唱音は、各パート間の相関関係がほとんどない固有の特徴を有するものとなり、聴取者に不自然な印象を与えてしまうことを低減することができる。
【００３０】
また、音声試料データベース１１０に記憶されている音声試料データ群１１０ａ，１１０ｂ，１１０ｃに含まれる音声素片試料データは、各々人が発声した音声に基づいて作成されたデータである。したがって、各音声試料データ群に含まれる音声素片試料データに示される音声の微細な特徴の相違は、予め用意されたピッチのゆらぎを付与するといった人工的に作り出されたものではない。したがって、合成された合唱音が不自然なものとなってしまうことを低減することができる。
【００３１】
Ａ−２．合唱合成装置の具体的な構成
以上説明したのが本実施形態に係る合唱合成装置１００の基本的な構成である。この合唱合成装置１００においては、歌唱生成器１２０，１２１，１２２として、歌詞情報にしたがって音声試料データベース１１０から音声素片試料データを読み出して接続し、メロディ情報にしたがったピッチに応じて接続した音声素片試料データを調整して歌唱音信号を出力するといった歌唱生成器であれば、規則音声合成技術等を応用した歌唱生成器等の公知の種々の歌唱生成器を用いることができ、音声試料データベース１１０には採用する歌唱生成器に対応した音声素片試料データを記憶させておけばよい。以下においては、歌唱生成器１２０，１２１，１２２として、米国特許第５０２９５０９号や特許第２９０６９７０号において提案されているスペクトルモデリング合成（ＳＭＳ：Ｓｐｅｃｔｒａｌ　Ｍｏｄｅｌｉｎｇ　Ｓｙｎｔｈｅｓｉｓ）技術を利用した歌唱生成器を適用した場合を例に挙げて、合唱合成装置１００について具体的に説明する。
【００３２】
まず、ＳＭＳ技術を利用した歌唱生成器１２０，１２１，１２２を備えた歌唱合成装置１００における音声試料データベース１１０の作成手法について説明する。
【００３３】
上述したように、この合唱合成装置１００における音声試料データベース１１０には、発声者の発した音声に基づいて作成された音声素片試料データが記憶されている。ＳＭＳ技術は、オリジナルの音を２つの成分、すなわち調和成分（ｄｅｔｅｒｍｉｎｉｓｔｉｃ　ｃｏｍｐｏｎｅｎｔ）と、非調和成分（ｓｔｏｃｈａｓｔｉｃ　ｃｏｍｐｏｎｅｎｔ）で表すモデルを使用して楽音の分析および合成を行う技術であり、ＳＭＳ技術を利用した音声合成においては、音素あるいは音素連鎖といった１単位の音声素片試料データとして、上記調和成分および非調和成分からなるデータが音声合成に用いられる。したがって、ＳＭＳ技術を利用した合唱合成装置１００においては、音声試料データベース１１０に、発声者の発した音声をＳＭＳ分析することにより得られた調和成分および非調和成分を示すデータが１つの音声素片試料データとして記憶される。以下、図２を参照しながら、音声試料データベース１１０の作成手法について説明する。
【００３４】
同図に示すように、音声試料データベース１１０の作成のために発声者が発した音声は、ＳＭＳ分析部２００に入力され、ＳＭＳ分析部２００においてＳＭＳ分析される。ここで、音声試料データベース１１０には、異なる音声に基づいて作成した音声試料データ群１１０ａ，１１０ｂ，１１０ｃを記憶しておく必要があるため、３つの異なる音声がＳＭＳ分析部２００に入力されることになる。なお、図示では、３つの異なる音声が並列にＳＭＳ分析部２００に入力されるように表されているが、各音声についてのＳＭＳ分析は同時に並列して行う必要はなく、個別に行うようにしてもよい。
【００３５】
ＳＭＳ分析部２００は、入力される音声に対してＳＭＳ分析を行い、各フレーム毎のＳＭＳ分析データを出力する。より具体的には、以下の手法により各フレーム毎のＳＭＳ分析データを出力する。
【００３６】
まず、入力される音声を一連のフレームに分ける。ここで、ＳＭＳ分析に用いるフレーム周期としては、一定の固定長であってもよいし、入力音声のピッチ等に応じてその周期を変更する可変長の周期であってもよい。
【００３７】
次に、フレームに分けた音声に対して高速フーリエ変換（ＦＦＴ：Ｆａｓｔ　Ｆｏｕｒｉｅｒ　Ｔｒａｎｓｆｏｒｍ）等の周波数分析を行う。この周波数分析によって得られた周波数スペクトル（複素スペクトル）から振幅スペクトルと位相スペクトルを求め、振幅スペクトルのピークに対応する特定の周波数のスペクトルを線スペクトルとして抽出する。このとき、基本周波数およびその整数倍の周波数近傍の周波数を持つスペクトルを線スペクトルとする。このようにして抽出した線スペクトルが上述した調和成分に対応している。
【００３８】
次に、上記のように入力音声から線スペクトルを抽出するとともに、抽出した線スペクトルをそのフレームの入力音声（ＦＦＴ後の波形）から減算することにより、残差スペクトルを得る。あるいは、抽出した線スペクトルから合成した調和成分の時間波形データをそのフレームの入力音声波形データから減算して残差成分の時間波形データを取得した後、これに対してＦＦＴ等の周波数分析を行うことにより残差スペクトルを得るようにしてもよい。このようにして得られた残差スペクトルが上述した非調和成分に対応している。
【００３９】
ＳＭＳ分析部２００は、上記のようにして取得した線スペクトル（調和成分）および残差スペクトル（非調和成分）からなる各フレーム毎のＳＭＳ分析データを区間切り出し部２０１に出力する。
【００４０】
区間切り出し部２０１は、ＳＭＳ分析部２００から供給される各フレーム毎のＳＭＳ分析データを、音声試料データベース１１０に記憶すべき音声素片試料データの１単位（音素あるいは音素連鎖）の長さに対応するように切り出す。区間切り出し部２０１は、各素片の単位長さに対応するようにＳＭＳ分析データを切り出し、音声試料データベース１１０に記憶させる。
【００４１】
ここで、音声試料データベース１１０に記憶される音声素片試料データは、音素あるいは音素連鎖毎に切り出されたＳＭＳデータであり、調和成分については、その音素あるいは音素連鎖に含まれるフレーム全てのスペクトル包絡（線スペクトル（倍音系列）の強度（振幅）および位相のスペクトル）が記憶される。なお、このようなスペクトル包絡そのものを調和成分として記憶させるようにしてもよいが、該スペクトル包絡を何らかの関数で表現したものを記憶させるようにしてもよいし、調和成分を逆ＦＦＴ等して得た時間波形として記憶させるようにしてもよい。本実施形態では、非調和成分についても調和成分と同様に、強度スペクトルと位相スペクトルとして記憶させることとするが、上記調和成分と同様、関数や時間波形として記憶させるようにしてもよい。
【００４２】
このような音声に対するＳＭＳ分析および区間切り出しが３つの異なる入力音声の各々について行われ、この結果、音声試料データ群１１０ａ，１１０ｂ，１１０ｃといった３つの異なる音声に基づいて作成された音声素片試料データ（音素あるいは音素連鎖毎のＳＭＳ分析データ）の群が音声試料データベース１１０に記憶される。
【００４３】
以上が本実施形態に係る合唱合成装置１００の音声試料データベース１１０の作成手法の詳細である。
【００４４】
次に、上述したように異なる音声に基づいて作成された３つの音声試料データ群１１０ａ，１１０ｂ，１１０ｃを記憶する音声試料データベース１１０を用いて歌唱音信号を生成する各歌唱生成器１２０，１２１，１２２について説明する。なお、歌唱生成器１２０，１２１，１２２は、各々同様の構成であるため、以下においては歌唱生成器１２０の構成について図３を参照しながら説明し、他の歌唱生成器１２１，１２２についての説明を割愛する。
【００４５】
同図に示すように、この歌唱生成器１２０は、音声素片選択部３０１と、ピッチ決定部３０２と、継続時間長調整部３０３と、音声素片接続部３０４と、調和成分生成部３０５と、加算部３０６と、逆ＦＦＴ（高速フーリエ変換）部３０７と、窓掛け部３０８と、オーバーラップ部３０９とを備えている。
【００４６】
音声素片選択部３０１は、合唱制御部１４０（図１参照）から供給される歌詞情報および指定情報にしたがって、必要となる音声素片試料データを音声試料データベース１１０から読み出す。より具体的には、供給される歌詞情報を音声記号（音素あるいは音素連鎖）列に変換し、変換した音声記号列にしたがって音声試料データベース１１０から音声素片試料データを読み出す。例えば、「サイタ」（ｓａｉｔａ）といった歌詞情報にしたがって歌唱音信号を生成する場合には、該歌詞情報が「＃ｓ」、「ｓ」、「ｓａ」、「ａ」、「ａｉ」、「ｉ」、「ｉｔ」、「ｔ」、「ｔａ」、「ａ」、「ａ＃」といった音声記号列に変換され、これらの各音声記号に対応する音声素片試料データが音声試料データベース１１０から読み出されることになる。
【００４７】
音声素片選択部３０１は、上記のように歌詞情報にしたがって読み出すべき音声素片試料データを決定し、合唱制御部１４０から供給される指定情報に指定される音声試料データ群の中から決定した音声素片試料データを読み出す。例えば、指定情報が音声試料データ群１１０ａを指定している場合には、音声試料データベース１１０の音声試料データ群１１０ａに含まれる「＃ｓ」、「ｓ」、「ｓａ」、「ａ」、「ａｉ」、「ｉ」、「ｉｔ」、「ｔ」、「ｔａ」、「ａ」、「ａ＃」に対応した音声素片試料データを読み出す。
【００４８】
ピッチ決定部３０２は、合唱制御部１４０（図１参照）から供給されるメロディ情報に応じて歌唱音のピッチを決定し、決定したピッチを示すピッチ情報を調和成分生成部３０５に出力する。
【００４９】
継続時間長調整部３０３には、音声素片選択部３０１によって読み出された音声素片試料データ（調和成分および非調和成分）が供給される。ここで、音声素片選択部３０１は、読み出した音声素片試料データをそのまま継続時間長調整部３０３に供給するようにしてもよいが、メロディ情報に示されるピッチ等に応じて適当な補正処理を施してから継続時間長調整部３０３に供給するようにしてもよい。
【００５０】
継続時間長調整部３０３は、メロディ情報等によって決定される音素あるいは音素連鎖毎の発音時間長に応じて音声素片選択部３０１から供給された各音声素片試料データの時間長を変更する処理を行う。より具体的には、ある音声素片試料データを、その時間長より短い時間として使用する場合には、該音声素片試料データからフレームを間引く処理を行う。一方、ある音声素片試料データを、その時間長よりも長い時間継続して使用する場合には、その音声素片試料データを使用する時間長の間繰り返して時間を長くするループ処理を行う。このループ処理において、ある音声素片試料データを繰り返す場合には、当該音声素片試料データの最初から最後（０〜ｔ）までのデータの後に、当該音声素片試料データｊを最初（０）からデータを接続して繰り返すようにしてもよいし、最初から最後（０〜ｔ）までのデータの後に、当該音声素片試料データの時間的に最後（ｔ）の部分から最初の部分に向かってデータを接続して繰り返すようにしてもよい。
【００５１】
継続時間長調整部３０３は、上記のように各音声素片の発音時間長に応じて音声素片試料データ（調和成分および非調和成分）の継続時間長を調整した後、時間調整後の音声素片試料データを音声素片接続部３０４に出力する。
【００５２】
音声素片接続部３０４は、継続時間長調整部３０３から供給された音声素片試料データの調和成分のデータを時系列に接続するとともに、非調和成分のデータを時系列に接続する。このような接続に際し、接続する２つの調和成分のスペクトル包絡の形状の差が大きい場合には、スムージング処理等を施すようにすればよい。音声素片接続部３０４は、接続した調和成分のデータを調和成分生成部３０５に出力するとともに、接続した非調和成分のデータを加算部３０６に出力する。
【００５３】
調和成分生成部３０５には、音声素片接続部３０４から調和成分のデータ（スペクトル包絡情報）が供給されるとともに、ピッチ決定部３０２からメロディ情報にしたがったピッチ情報が供給される。調和成分生成部３０５は、音声素片接続部３０４からのスペクトル包絡情報に示されるスペクトル包絡形状を維持しつつ、ピッチ決定部３０２からのピッチ情報に対応する倍音成分を生成する。
【００５４】
加算部３０６には、音声素片接続部３０４からの非調和成分のデータと、調和成分生成部３０５からの調和成分のデータが供給され、加算部３０６は両者を合成して逆ＦＦＴ部３０７に出力する。逆ＦＦＴ部３０７は、加算部３０６から供給される加算された周波数領域の信号に対し、逆ＦＦＴを施すことにより時間領域の波形信号に変換し、変換後の波形信号を窓掛け部３０８に出力する。窓掛け部３０８では、時間領域の波形信号に対してフレーム長に対応した窓関数が乗算され、オーバーラップ部３０９が乗算後の波形信号をオーバーラップさせながら歌唱音信号を生成する。このようにして歌唱生成器１２０では、合唱制御部１４０（図１参照）から供給された楽曲情報のあるパートの歌詞情報およびメロディ情報にしたがった歌唱音信号が生成され、生成された歌唱音信号が加算器１３０（図１参照）に出力される。
【００５５】
以上が歌唱生成器１２０の詳細な構成であり、図１に示す他の歌唱生成器１２１，１２２（歌唱生成器１２０と同様の構成）からも上記のように合唱制御部１４０から供給された楽曲情報のあるパートの歌詞情報およびメロディ情報にしたがって生成された歌唱音信号が出力される。ここで、上述したように各歌唱生成器１２０，１２１，１２２は、合唱制御部１４０から振り分けられたパートに対応する歌唱音信号を生成する際に、各々異なる音声試料データ群１１０ａ，１１０ｂ，１１０ｃから音声素片試料データを読み出して生成に用いているので、各々が生成する歌唱音信号の微細な特徴（ピッチのゆらぎ等）は異なったものとなる。
【００５６】
加算器１３０は、このように合唱曲の楽曲情報の各パートにしたがって歌唱生成器１２０，１２１，１２２が生成した歌唱音信号を合成して出力する。加算器１３０から出力された３つのパートの歌唱音信号が合成された合唱音信号は、図示せぬＤ／Ａ（Ｄｉｇｉｔａｌ　ｔｏ　Ａｎａｌｏｇ）変換器によってアナログの音声波形信号に変換された後、アンプ等を介してスピーカから放音される。これにより、聴取者は、複数パートからなる合唱曲の楽曲情報にしたがった合唱音を聴くことができる。この合唱合成装置１００から放音される合唱音は、各パートの歌唱音の微細な特徴（ピッチのゆらぎ等の相違に起因する声質等）が相違しており、聴取者により自然な印象を与えることが可能な合唱音を発音することができるのである。
【００５７】
Ｂ．第２実施形態
次に、本発明の第２実施形態に係る合唱合成装置について、図４を参照しながら説明する。同図に示すように、上記第１実施形態における合唱合成装置１００の音声試料データベース１１０が３つの音声試料データ群１１０ａ，１１０ｂ，１１０ｃを記憶していたのに対し、第２実施形態に係る合唱合成装置４００における音声試料データベース１１０には、同一の音素または音素連鎖については１種類の音声素片試料データしか記憶されていない点で相違している。第２実施形態に係る合唱合成装置４００は、このように１つの音素または音素連鎖について１つの音声素片試料データのみを記憶する音声試料データベース１１０を用いて、上記第１実施形態と同様により自然な印象を与えることが可能な合唱音信号を合成することができるようになっている。以下、合唱合成装置４００の構成について、上記第１実施形態に係る合唱合成装置１００との相違点を中心に説明する。
【００５８】
合唱合成装置４００における歌唱生成器１２０，１２１，１２２の各々は、上記第１実施形態と同様であり、歌詞情報およびメロディ情報を有する楽曲情報にしたがって音声試料データベース１１０から必要となる音声素片試料データを読み出し、読み出した音声素片試料データを用いて歌唱音信号を生成する。第２実施形態においては、音声試料データベース１１０には１つの音素あるいは音素連鎖については１つの音声素片試料データしか記憶されていないため、合唱曲の楽曲情報にしたがって歌唱音信号を生成する際には、各歌唱生成器１２０，１２１，１２２が同一の音声素片試料データを用いることもあり得る。上述したように複数のパートの歌唱音信号を同一の音声素片試料データを用いて生成した場合、微細な特徴（ピッチのゆらぎ等）が基本的に同一になるため、聴取者に不自然な印象を与えてしまう。
【００５９】
そこで、この合唱合成装置４００では、合唱制御部１４０が合唱曲の楽曲情報の歌詞情報およびメロディ情報を各パート毎に分割して各歌唱生成器１２０，１２１，１２２に出力するとともに、音声試料データベース１１０に記憶されている音声素片試料データをどの時間に対応する部分から使用を開始するかを指定する指定情報を各歌唱生成器１２０，１２１，１２２に出力するようになっている。
【００６０】
上述した第１実施形態で説明したように、音声試料データベース１１０に記憶される音声素片試料データは、発声者の発した音声に基づいて作成されたものであり、所定の時間長（１フレーム〜数フレーム等）の音声波形に基づいて作成されたデータである。すなわち、前記所定の時間内における時間と振幅との関係で表される音声波形に基づいて作成されたデータである。したがって、上記第１実施形態のように音声素片試料データが周波数領域のデータとして記憶されている場合にも、そのデータは時間領域の音声波形にＦＦＴ等を施して得られたものである。合唱制御部１４０は、このように時間に伴って変化する情報である音声素片試料データをどの時間に対応する部分から使用するかを指定する指定情報を各歌唱生成器１２０，１２１，１２２に供給するのである。
【００６１】
ここで、合唱制御部１４０は、各歌唱生成器１２０，１２１，１２２が音声素片試料データを、互いに異なる時間に対応する部分から使用を開始して歌唱音信号を生成するように、各歌唱生成器１２０，１２１，１２２に異なる使用開始時間を指定する指定情報を出力する。
【００６２】
各歌唱生成器１２０，１２１，１２２は、合唱制御部１４０から供給される各パートの歌詞情報に基づいて必要となる音声素片試料データを音声試料データベース１１０から読み出すと共に、読み出した音声素片試料データを、合唱制御部１４０から指定情報に指定される時間に対応する部分から使用を開始して歌唱音信号の生成を行う。
【００６３】
以下、３つのパートの歌詞情報にしたがって読み出される音声素片試料データが母音の「ａ」であり、音声素片試料データ「ａ」がＦ０〜Ｆ１３といった１３のフレーム（時間０〜Ｔ）からなり、該音声素片試料データを１３フレーム分の長さを使用して各歌唱生成器１２０，１２１，１２２が歌唱音信号を生成する場合について、図５および図６を参照しながら具体的に例示して説明する。
【００６４】
図５に示す例では、歌唱生成器１２０に対しては最初のフレームＦ０から使用を開始するように指定する指定情報が供給されており、歌唱生成器１２１に対してはフレームＦ３から使用を開始するように指定する指定情報が供給されており、歌唱生成器１２２に対してはフレームＦ６から使用を開始するように指定する指定情報が供給されている。なお、図示では説明の便宜上、音声素片試料データが時間領域の音声波形として示されているが、音声試料データベース１１０に記憶しておくデータは、上記第１実施形態のように周波数領域で表現される調和成分（線スペクトル）および非調和成分（残差スペクトル）といった形態であってもよい。
【００６５】
このような指定情報が供給されている場合には、図６に示すように、歌唱生成器１２０は、フレームＦ０，Ｆ１，Ｆ２……Ｆ１３といった順序、つまり音声素片試料データ「ａ」をそのまま使用して歌唱音信号の生成に用いる。また、歌唱生成器１２１は、フレームＦ３，Ｆ４，Ｆ５……Ｆ１３，Ｆ０，Ｆ１，Ｆ２，Ｆ３といった順次で音声素片試料データ「ａ」を使用して歌唱音信号の生成を行う。さらに、歌唱生成器１２２は、フレームＦ６，Ｆ７……Ｆ１３，Ｆ０，Ｆ１……Ｆ５といった順序の音声素片試料データ「ａ」を使用して歌唱音信号の生成を行う。
【００６６】
このように合唱制御部１４０が互いに異なる時間に対応する部分から使用を開始して歌唱音信号を生成するように指定情報を出力することにより、同じ音素「ａ」を同じ時間長（０〜Ｔまで）だけ用いて歌唱音信号を生成する際にも、各歌唱生成器１２０，１２１，１２２が実際に用いるデータは異なるものとなっている。すなわち、各歌唱生成器１２０，１２１，１２２が実際に用いる音声素片試料データに示される微細な特徴（ピッチのゆらぎ等）は異なったものとなり、１つの音素あるいは音素連鎖について１種類の音声素片試料データを用いて、各歌唱生成器１２０，１２１，１２２が微細な特徴の異なる歌唱音信号を生成することができるのである。
【００６７】
ところで、音素「ａ」のように単一の音素についての音声素片試料データを用いる場合には、上記のように単純にデータ中の使用開始時間をずらすといった手法により、各パートについて生成される歌唱音の微細な特徴を変えてより自然な合唱歌唱音を合成することができるが、複数の音素が連なる音素連鎖についての音声素片試料データの場合には、単純にデータ中の使用開始時間をずらすだけでは不都合が生じることもある。例えば、「ａｉ」といった音素連鎖についての音声試料データの場合、時間領域における前半部分は「ａ」の音素をより強く反映したデータであり、後半部分は「ｉ」の音素をより強く反映したデータである。したがって、音素連鎖「ａｉ」の歌唱音信号を生成するために、音素「ｉ」の影響の強い後半部分から使用を開始した場合には、音素連鎖「ｉａ」に類似した傾向を持つデータを用いることになってしまう虞があり、この場合、生成すべき音素連鎖「ａｉ」についての信号が正確に生成できなくなってしまう。
【００６８】
そこで、本実施形態では、複数の音素連鎖に対応する音声素片試料データを用いる場合には、合唱制御部１４０は、図７に示すような指定情報を各歌唱生成器１２０，１２１，１２２に出力するようにしている。同図に示す例では、歌唱生成器１２０に対しては最初のフレームＦ０から使用を開始するように指定する指定情報が供給されており、歌唱生成器１２１に対してはフレームＦ２から使用を開始するように指定する指定情報が供給されており、歌唱生成器１２２に対してはフレームＦ４から使用を開始するように指定する指定情報が供給されている。すなわち、上記単一の音素についての指定情報と比較すると、各歌唱生成器１２０，１２１，１２２に対して指定する使用開始時間が前半部分（「ａ」の影響の強い）に集中している。このように各パートの使用開始時間をデータの前半部分に集中させることで、上記のように実際に使用するデータが音素連鎖「ｉａ」に類似してしまうことを抑制している。
【００６９】
また、上記のように指定情報が供給されている場合に、歌唱生成器１２１がフレームＦ２，Ｆ３，Ｆ４……Ｆ１３，Ｆ０，Ｆ１，Ｆ２といった順序、すなわち一方向に順番に音声素片試料データを使用して歌唱音信号の生成を行うと、音素「ａ」の影響の強いフレームＦ０〜Ｆ２が本来「ｉ」の影響を強くすべき後半部分のデータとして用いられてしまうことになる。そこで、本実施形態では、複数の音素からなる音素連鎖についての音声素片試料データを用いる場合には、最後のフレーム（Ｆ１３）の後にフレームＦ１に戻るのではなく、フレームＦ１２，Ｆ１１……といったように逆方向に戻る順序でフレームを用いるようにしている。したがって、図７に示すように使用開始フレームが指定されている場合には、図８に示すように、歌唱生成器１２０は、フレームＦ０，Ｆ１，Ｆ２……Ｆ１３といった順序、つまり音声試料データベース１１０に記憶されている音声素片試料データをそのまま使用して歌唱音信号の生成に用いる。また、歌唱生成器１２１は、フレームＦ２，Ｆ３，Ｆ４……Ｆ１３，Ｆ１２，Ｆ１１といった順次で音声素片試料データ」を使用して歌唱音信号の生成を行う。さらに、歌唱生成器１２２は、フレームＦ４，Ｆ５，Ｆ６……Ｆ１３，Ｆ１２，Ｆ１１，Ｆ１０，Ｆ９といった順序の音声素片試料データを使用して歌唱音信号の生成を行う。なお、フレームＦ１３からフレームＦ１２といったように逆方向に戻る順序でフレームを使用する際には両者の接続部分に雑音等が生じる虞があるため、各フレームの接続部分において振幅調整処理やクロスフェード処理等を施すようにすればよい。
【００７０】
複数の音素からなる音素連鎖の音声素片試料データを各歌唱生成器１２０，１２１，１２２で用いる場合には、以上のようにすることでより正確に音素連鎖を生成することができ、また各歌唱生成器１２０，１２１，１２２から出力される歌唱音信号の微細な特徴等が異なるものとなる。
【００７１】
以上説明したように、第２実施形態に係る合唱合成装置４００では、１つの音素あるいは音素連鎖について１つの音声素片試料データしか記憶されていなくても、１つの音声素片試料データを用いて、上記第１実施形態と同様により自然な印象を与えることが可能な合唱音信号を合成することができる。すなわち、音声試料データベース１１０に記憶させておくデータ量を抑制しつつ、より自然な印象を与えることが可能な合唱音信号を合成することができるのである。
【００７２】
Ｃ．変形例
なお、本発明は、上述した第１および第２実施形態に限定されるものではなく、以下に例示するような種々の変形が可能である。
【００７３】
（変形例１）
上述した各実施形態においては、音素あるいは音素連鎖といった単位の音声素片試料データを接続して歌唱音信号を生成するようにしているが、ビブラートといわれる歌唱表現法があり、上記各実施形態における合唱合成装置にこのビブラートによる歌唱表現を加える機能を付加するようにしてもよい。
【００７４】
従来より、ビブラートによる歌唱音を電子的に発音するための歌唱音信号を生成する手法としては、上記各実施形態のように音素あるいは音素連鎖単位の音声素片試料データを接続するとともに、該接続した音声素片試料データによって表現される波形に約６Ｈｚ程度の周波数変調を付与する方法が知られている。このような方法を実施するための構成を上記各実施形態における合唱合成装置に加えるようにしてもよいが、聴取者により自然な印象を与えることが可能なビブラート歌唱音信号の生成方法として、発声者がビブラート歌唱法で歌唱した時の音声に基づいて作成したビブラート音声試料データを用いる方法があり、この方法を実施するための構成を上記各実施形態に係る合唱合成装置に付加することが好ましい。
【００７５】
以下、図９を参照しながら、発声者のビブラート歌唱音声に基づいて作成したビブラート音声試料データを用いて歌唱音信号を生成する機能を上記第１実施形態における合唱合成装置に付加した場合を例に挙げて説明する。
【００７６】
同図に示すように、この合唱合成装置１００’における音声試料データベース１１０には、上記音声試料データ群１１０ａ，１１０ｂ，１１０ｃといった音素あるいは音素連鎖を単位とした音声素片試料データに加え、ビブラート歌唱時の歌唱音声に基づいて作成されたビブラート音声試料データが記憶されている。ここで、音声試料データベース１１０には、各々異なる音声に基づいて作成された３つのビブラート音声試料データＢＤａ，ＢＤｂ，ＢＤｃが記憶されている。
【００７７】
この構成の下、合唱制御部１４０は、上述した第１実施形態と同様、各歌唱生成器１２０，１２１，１２２に各パートの歌詞情報およびメロディ情報と、使用する音声試料データ群を指定する指定情報とに加え、３つのビブラート音声試料データＢＤａ，ＢＤｂ，ＢＤｃのいずれを使用するかを指定する第２の指定情報を供給するようになっている。ここで、各歌唱生成器１２０，１２１，１２２に供給される第２の指定情報は、異なるビブラート音声試料データを使用するように指定する情報である。このような第２の指定情報を各歌唱生成器１２０，１２１，１２２に供給することによって、各歌唱生成器１２０，１２１，１２２はビブラート歌唱音信号を生成する際に各々異なるビブラート音声試料データを読み出し、上記実施形態と同様に接続した音声素片試料データによって表現される音声波形に、読み出したビブラート音声試料データによって表現される波形を重ね、重ね合わせた波形信号を歌唱音信号として出力する。
【００７８】
このようにビブラート歌唱音信号を生成する際に、各歌唱生成器１２０，１２１，１２２が異なる音声に基づいて作成された３つのビブラート音声試料データＢＤａ，ＢＤｂ，ＢＤｃを各パート毎に使い分けることにより、生成されるビブラート歌唱音信号の微細な特徴（ビブラート時の周波数の変動具合等）も各パート毎に異なったものとなる。このようにビブラート歌唱音の各パート毎の相関関係がほとんどなく、各々のパートが固有の特徴を持つことになるため、当該合唱合成装置１００’によって合成された合唱音信号に基づいた歌唱音のビブラート部分を聴いた聴取者に対して、より自然な印象を与えることが可能となる。
【００７９】
ところで、合唱音において各パートのビブラート部分の特徴が基本的に同一であることは、聴取者にとって他の部分の特徴が同一である場合よりも不自然な印象を与えるものである。したがって、ビブラート部分だけでも各パート毎に固有の特徴を付与した装置が要望されることもあり得る。このような場合には、上記各実施形態のように音素あるいは音素連鎖についての音声試料データは、各パートで同一のものをそのまま使用して歌唱音信号を生成し、生成した歌唱音信号に各パート毎に異なるビブラート音声試料データによって表現される波形を加算してビブラート効果を付与するようにしてもよい。
【００８０】
（変形例２）
また、図９に示すように、各歌唱生成器１２０，１２１，１２２の数に対応して３つのビブラート音声試料データを用いるようにしてもよいが、図１０に示す合唱歌合成装置４００’のように、歌唱生成器１２０，１２１，１２２が同一のビブラート音声試料データを用いてビブラート部分の歌唱音信号を生成するようにしてもよい。
【００８１】
上述した実施形態で説明したように歌唱生成器１２０，１２１，１２２は、音声試料データ群１１０ａ，１１０ｂ，１１０ｃを使い分けることにより各々異なる固有の特徴を有する歌唱音信号を生成することができるので、このように生成した歌唱音信号に同一のビブラート音声試料データによって表現される波形を加算しても、各々の歌唱生成器１２０，１２１，１２２から出力されるビブラート部分の歌唱音信号は固有の特徴を有したものとなる。したがって、単純に１つのビブラート音声試料データを各歌唱生成器１２０，１２１，１２２が用いるようにしてもよいが、ビブラート音声試料データについても、上記第２実施形態において各歌唱生成器１２０，１２１，１２２による音声素片試料データの使用方法として説明したように、各々の歌唱生成器１２０，１２１，１２２が同一のビブラート音声試料データの異なる時間に対応する部分から使用を開始するようにしてもよい。この場合、合唱制御部１４０がどの時間に対応する部分から使用を開始するかを指定する指定情報を各歌唱生成器１２０，１２１，１２２に供給するようにすればよい。このようにすることで、各歌唱生成器１２０，１２１，１２２がビブラート付与のために用いる実際の音声試料データは異なる特徴を有するものとなる。したがって、各々のパートのビブラート部分の歌唱音信号が固有の特徴を有するものとなり、当該合唱合成装置４００’によって合成された合唱音信号に基づいた歌唱音のビブラート部分を聴いた聴取者に対して、より自然な印象を与えることが可能となる。
【００８２】
（変形例３）
また、上記変形例においては、生成する歌唱音信号にビブラート効果を付与するために音声試料データベース１１０にビブラート音声試料データを記憶させておくようにしていたが、ビブラート以外のトレモロ、ポルタメント等の種々の歌唱法による歌唱音を電子的に放音するために、発声者によるトレモロ部分の歌唱音声や、ポルタメント部分の歌唱音声に基づいて作成した音声試料データを音声試料データベース１１０に記憶させておくようにしてもよい。この場合にも、上述した変形例におけるビブラート音声試料データと同様、各パート毎に音声試料データを用意しておいたり、同じ音声試料データであっても異なる時間に対応した部分から使用を開始したりすることにより、各パートのトレモロやポルタメント部分の歌唱音信号に固有の特徴を持たせることができる。
【００８３】
（変形例４）
また、上述した第１実施形態では、３つの音声試料データ群１１０ａ，１１０ｂ，１１０ｃを音声試料データベース１１０に記憶させるようにしていたが、高音、中音、低音といったように異なるピッチの音声に基づいて、各音声試料データ群１１０ａ，１１０ｂ，１１０ｃに含まれる音声素片試料データを作成するようにしてもよい。例えば、高音の音声に基づいて作成した音声素片試料データは音声試料データ群１１０ａに含ませるようにし、中音の音声に基づいて作成した音声素片試料データは音声試料データ群１１０ｂに含ませるようにし、低音の音声に基づいて作成した音声素片試料データを音声試料データ群１１０ｃに含ませるようにしてもよい。
【００８４】
このように各音域毎に作成された音声試料データ群１１０ａ，１１０ｂ，１１０ｃを記憶した音声試料データベース１１０を用いる場合、合唱制御部１４０は、楽曲情報に含まれる複数のパートのうち、高音域のメロディからなるパートの歌唱音信号の生成を担当する歌唱生成器に対し、高音の音声に基づいて作成した音声試料データ群１１０ａを用いるように指定する指定情報を出力する。また、中音域のメロディからなるパートの歌唱音信号の生成を担当する歌唱生成器に対し、中音の音声に基づいて作成した音声試料データ群１１０ｂを用いるように指定する指定情報を出力し、さらに低音域のメロディからなるパートの歌唱音信号の生成を担当する歌唱生成器に対し、低音の音声に基づいて作成した音声試料データ群１１０ｃを用いるように指定する指定情報を出力する。これにより各歌唱生成器１２０，１２１，１２２は、各々が担当するパートの歌唱音信号の生成により好適な音声素片試料データを用いることができ、より高品位の歌唱音信号を生成することができる。
【００８５】
なお、上記のようにある楽曲に対応する歌唱音信号の生成時には各歌唱生成器１２０，１２１，１２２が使用する音声試料データ群１１０ａ，１１０ｂ，１１０ｃを固定するようにしてもよいが、各パート毎にメロディ情報によって決定される各パート毎のピッチの高低が時間毎に変化することも考えられる。この場合には、ある１つの楽曲の歌唱音信号を生成する際に、各パート毎のメロディ情報によって決定される各パート毎のピッチに高低に応じて合唱制御部１４０が各歌唱生成器１２０，１２１，１２２に対して指定する音声試料データ群を楽曲の途中で逐次変更するような指定情報を出力するようにしてもよい。
【００８６】
（変形例５）
また、上述した変形例では、異なるピッチ毎の音声に基づいて作成した音声試料データ群１１０ａ，１１０ｂ，１１０ｃを音声試料データベース１１０に記憶させるようにしていたが、歌唱時には同じ音韻を発声している間にピッチが大きく変動することもある。したがって、音声試料データベース１１０に、同じ音韻、例えば「ａ」を発声している間にピッチ（音高）を変動させて発した音声に基づいて音声素片試料データを作成し、該音声素片試料データを音声試料データベース１１０に記憶させるようにしてもよい。このように音声試料データベース１１０には、上述した各実施形態において説明した同一ピッチの音素あるいは音素連鎖だけではなく、歌唱時に起こりうる様々なピッチ変動等を考慮して音声素片試料データを作成しておくようにしてもよい。
【００８７】
（変形例６）
また、上述した第１実施形態では、音声試料データベース１１０に記憶されている音声試料データ群１１０ａ，１１０ｂ，１１０ｃを各歌唱生成器１２０，１２１，１２２が使い分けて用い、第２実施形態では、同一の音声素片試料データを異なる時間に対応する部分から使用を開始することにより、より自然な印象を与えることが可能な合唱音信号を合成していた。このような第１および第２実施形態に係る合唱合成装置に、音声試料データベース１１０から読み出した音声素片試料データに示される何らかの値（すなわち音を決定付けるパラメータ）を各歌唱生成器１２０，１２１，１２２毎に変更してから供給するパラメータ変更手段を設けるようにしてもよい。このようなパラメータ変更手段を上記第１実施形態に係る合唱合成装置に付加した場合の構成を図１１に示す。
【００８８】
同図に示すように、この合唱合成装置１００”は、上記第１実施形態における合唱合成装置１００の構成に加え、各歌唱生成器１２０，１２１，１２２に対応して設けられるパラメータ変更部２２０，２２１，２２２を備えている。この構成の下、合唱制御部１４０は、上述した第１実施形態と同様、歌唱生成器１２０，１２１，１２２に各パートの歌詞情報およびメロディ情報と、どの音声試料データ群を使用するかを指定する指定情報を出力するとともに、パラメータ変更部２２０，２２１，２２２の各々に対してパラメータの変更内容を示す変更情報を出力する。ここで、合唱制御部１４０は、音声試料データベース１１０から読み出した音声素片試料データに対して各々異なる内容の変更が施されるような変更情報を各パラメータ変更部２２０，２２１，２２２に出力する。
【００８９】
パラメータ変更部２２０，２２１，２２２は、対応する歌唱生成器１２０，１２１，１２２が必要とする音声素片試料データを指定情報に示される音声試料データ群の中から読み出し、合唱制御部１４０から供給される変更情報にしたがって読み出した音声素片試料データを変更する。そして、変更後の音声素片試料データが対応する歌唱生成器１２０，１２１，１２２に供給する。そして、各歌唱生成器１２０，１２１，１２２が変更後の音声素片試料データを用いて歌唱音信号を生成する。
【００９０】
ここで、パラメータ変更部２２０，２２１，２２２が音声試料データベース１１０から読み出した音声素片試料データに対して行う変更処理の内容としては、音韻性を損なわない程度に音色等を変更する処理であれば種々の変更処理を適用することができる。例えば、音声試料データベース１１０から読み出したある音声素片試料データによって表現される音声のフォルマント構造をモデル化し、フォルマントのバンド幅を数％変更したり、バンドの中心周波数を１０Ｈｚ程度シフトする等によって音色を微妙に変更する方法がある。この場合、変更するフォルマントのバンド幅の割合や、バンドの中心周波数のシフトする量を各パラメータ変更部２２０，２２１，２２２毎に異なる値とすることにより、各パラメータ変更部２２０，２２１，２２２によって読み出された音声素片試料データに示される音声の音色が微妙に異なるものとなる。
【００９１】
（変形例７）
また、上述した各実施形態において、各歌唱生成器１２０，１２１，１２２によって生成された合唱音がより自然な印象を聴取者に与えるために、各パート毎に生成された歌唱音信号による歌唱音の発音タイミングをずらすようにしてもよい。この場合、合唱制御部１４０が各パートに対して発音タイミングをどの程度ずらすかを指定するタイミング指定情報を供給する。この際、合唱制御部１４０は、各歌唱生成器１２０，１２１，１２２での発音タイミングが微妙にずれるようなタイミング指定情報を各歌唱生成器１２０，１２１，１２２に供給する。例えば、歌唱生成器１２０に対しては、合唱制御部１４０から供給される歌詞情報およびメロディ情報にしたがって生成した歌唱音信号を遅延させることなく加算器１３０に出力させ、歌唱生成器１２１に対しては、１０ｍｓｅｃ遅延させて歌唱音信号を加算器１３０に出力させ、歌唱生成器１２２に対しては２０ｍｓｅｃ遅延させて歌唱音信号を加算器１３０に出力させるようにすれば、各パートの歌唱音が微妙にずれて発音され、聴取者に対してより自然な印象を与えることができる。
【００９２】
なお、上記のようにある１つの楽曲の歌唱音信号を生成している際に、各歌唱生成器１２０，１２１，１２２の発音タイミングの相関関係を固定するようにしてもよいが、ある楽曲の途中であっても歌唱生成器１２０，１２１，１２２の発音タイミングの相関関係を変動させるようにしてもよい。例えば楽曲の前半部分では、上記例のように歌唱生成器１２０，１２１，１２２といった順序で発音するようにし、楽曲の後半部分では歌唱生成器１２２，１２１，１２０といった順序で発音するようにしてもよい。
【００９３】
（変形例８）
また、上述した第１実施形態では、歌唱生成器１２０，１２１，１２２の数（３つ）に応じた種類の音声試料データ群を音声試料データベース１１０に記憶させるようにしていたが、歌唱生成器の数よりも多い種類の音声試料データ群を記憶させるようにしてもよい。
【００９４】
また、歌唱生成器１２０，１２１，１２２といった３つの歌唱生成器を備えている場合に、音声試料データベース１１０に２つの音声試料データ群１１０ａ，１１０ｂしか記憶されていない場合には、少なくとも２つの歌唱生成器が異なる音声試料データ群１１０ａ，１１０ｂを用いて歌唱音信号を生成するようにすればよい。この場合には、歌唱生成器１２０が音声試料データ群１１０ａを用い、歌唱生成器１２１が音声試料データ群１１０ｂを用い、歌唱生成器１２２が音声試料データ群１１０ａ，１１０ｂのいずれかを歌唱生成器１２０，１２１と異なる時間に対応する部分から使用を開始すれば、３つの歌唱生成器１２０，１２１，１２２が実際には異なる音声素片試料データを用いて歌唱音信号を生成することになり、上記各実施形態と同様、自然な印象を与えることが可能な合唱音信号を合成することができる。
【００９５】
（変形例９）
上述した各実施形態および変形例における合唱合成装置は、専用のハードウェア回路で構成するようにしてもよいが、図１２に示すようなコンピュータシステムによるソフトウェアによって構成するようにしてもよい。同図に示すように、このコンピュータシステムは、装置全体を制御するＣＰＵ（Ｃｅｎｔｒａｌ　ＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）３２０、各種制御データやプログラム群を記憶するＲＯＭ（Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）３２１、ワークエリアとして使用されるＲＡＭ（Ｒａｎｄｏｍ　Ａｃｃｅｓｓ　Ｍｅｍｏｒｙ）３２２、楽曲情報やプログラム群を記憶するハードディスクやＣＤ−ＲＯＭ（Ｃｏｍｐａｃｔ　Ｄｉｓｃ　Ｒｅａｄ　Ｏｎｌｙ　Ｍｅｍｏｒｙ）ドライブ等の外部記憶装置３２３、キーボードやマウス等の操作部３２４、各種情報をユーザに表示する表示部３２５、Ｄ／Ａ変換器３２６、アンプ３２７、スピーカ３２８を備えている。
【００９６】
ＣＰＵ３２０は、ＲＯＭ３２１もしくはハードディスク等の外部記憶装置３２３に記憶されているプログラム群にしたがって、音声試料データベース１１０をＲＡＭ３２２もしくは外部記憶装置３２３に構築し、音声試料データベース１１０を用いて上記各実施形態や変形例と同様に各パート毎の歌唱音信号合成処理を行う。そして、ＣＰＵ３２０は、生成した各パート毎の歌唱音信号を加算した後、加算後の合唱音信号をＤ／Ａ変換器３２６に出力する。Ｄ／Ａ変換器３２６では合唱音信号がアナログ信号に変換され、該合唱音のアナログ信号アンプ３２７によって増幅された後、スピーカ３２８から放音される。
【００９７】
このように上記各実施形態および変形例における合唱合成装置は、コンピュータシステムによるソフトウェアによって構成することが可能であり、上記各実施形態等と同様の合唱音合成処理をコンピュータシステムに実行させるためのプログラムの形態でユーザに提供するようにしてもよい。このようなプログラムの提供方法としては、ＣＤ−ＲＯＭやフロッピーディスク等の各種記録媒体に記憶して提供する方法や、インターネット等の通信回線を介して提供する方法等がある。
【００９８】
【発明の効果】
以上説明したように、本発明によれば、より自然な印象を聴取者に与えることが可能な合唱音を合成することができる。
【図面の簡単な説明】
【図１】本発明の第１実施形態に係る合唱合成装置の基本構成を示すブロック図である。
【図２】前記合唱合成装置の構成要素でる音声試料データベースの作成手法を説明するための図である。
【図３】前記合唱合成装置の構成要素である歌唱生成器の機能構成を示すブロック図である。
【図４】本発明の第２実施形態に係る合唱合成装置の基本構成を示すブロック図である。
【図５】第２実施形態に係る前記合唱合成装置による歌唱音信号生成方法を説明するための図である。
【図６】第２実施形態に係る前記合唱合成装置による歌唱音信号生成方法を説明するための図である。
【図７】第２実施形態に係る前記合唱合成装置による歌唱音信号生成方法を説明するための図である。
【図８】第２実施形態に係る前記合唱合成装置による歌唱音信号生成方法を説明するための図である。
【図９】第１実施形態に係る前記合唱合成装置の変形例の基本構成を示すブロック図である。
【図１０】第２実施形態に係る前記合唱合成装置の変形例の基本構成を示すブロック図である。
【図１１】第１実施形態に係る前記合唱合成装置の他の変形例の基本構成を示すブロック図である。
【図１２】前記合唱合成装置による機能をソフトウェアによって実現するためのコンピュータシステムの構成を示すブロック図である。
【符号の説明】
１００、１００’、１００”……合唱合成装置、１１０……音声試料データベース、１１０ａ，１１０ｂ，１１０ｃ……音声試料データ群、１２０……歌唱生成器、１２１……歌唱生成器、１２２……歌唱生成器、１３０……加算器、１４０……合唱制御部、２００……ＳＭＳ分析部、２０１……区間切り出し部、２２０、２２１，２２２……パラメータ変更部、３０１……音声素片選択部、３０２……ピッチ決定部、３０３……継続時間長調整部、３０４……音声素片接続部、３０５……調和成分生成部、３０６……加算部、３０７……逆ＦＦＴ部、３０８……窓掛け部、３０９……オーバーラップ部、４００、４００’……合唱合成装置。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a choral synthesis device for synthesizing a choral sound signal, a choral synthesis method, and a program for synthesizing a choral sound.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, there has been proposed a choral synthesis device that synthesizes a singing sound signal based on lyric information and melody information to generate a singing voice. As a device for synthesizing a singing sound signal in this way, various devices such as a device to which a rule voice synthesis technology is applied have been proposed. In a singing voice synthesizing apparatus to which the rule synthesizing technique is applied, voice sample data in units of a phoneme or a phoneme chain including a plurality of phonemes is created from a voice uttered by a speaker in advance and stored in a database. Then, a singing sound signal is synthesized by reading out and connecting voice sample data such as phonemes required according to the lyrics information.
[0003]
By the way, in the singing sound synthesizer for synthesizing the singing sound as described above, unlike a voice synthesizing device such as a text-to-speech device, a use form in which a singing sound at the time of chorus such as singing or repetition is electronically conceivable. Therefore, a choral synthesizer having a function of synthesizing a singing sound (choral sound) at the time of chorus has been developed.
[0004]
A choral synthesizer having a function of synthesizing a choral sound signal at the time of such chorus generates a choral sound signal by reading and connecting audio sample data based on each of a plurality of parts. Then, by superimposing and outputting the singing sound signals generated for each part, the chorus sound can be output electronically.
[0005]
[Problems to be solved by the invention]
However, a conventional choral synthesizer having a function of synthesizing a choral sound signal uses the same voice sample data when generating a singing sound signal according to lyrics information and melody information for each part. Although the singing sound generated for each part has a different melody, fine characteristics (pitch fluctuations and the like) of the generated voice waveform of each part are basically the same. Therefore, a chorus sound obtained by superimposing these sounds sounds unnatural to the listener. This is presumably because the listener hears the correlation between the parts (the minute features match), giving an unnatural impression.
[0006]
Also, when synthesizing a choral sound signal at the time of singing, in the method of simply generating and superimposing a singing sound signal for each part as described above, exactly the same singing sound is superimposed and output, This results in an unnatural impression for the listener. Therefore, when synthesizing a choral sound signal at the time of singing in a conventional choral sound synthesizer, the sounding timing of the singing sound generated for each part (same content) is slightly shifted or generated for each part. By slightly shifting the pitch of the singing sound, the same singing sound is prevented from being repeated and pronounced. However, even when the tone generation timing and pitch are slightly shifted, the fine features (fluctuations and the like) of the sound waveform generated for each part as described above are basically the same. Therefore, the chorus sound obtained by superimposing these sounds as an unnatural chorus sound for the listener, as described above.
[0007]
Further, Japanese Patent Application Laid-Open No. 7-146995 discloses a device for generating a choral sound signal. In this device, when generating a singing sound signal for each part, fluctuations in pitch differ for each part. A singing sound signal to which a component is added is generated. In this way, by superimposing and outputting the singing sound signals to which the fluctuation components of different pitches are provided for each part, the correlation between the parts can be reduced. However, in the device described in this publication, the pitch component added to the singing sound signal for each part is not based on human voice, but is artificially created. Although the correlation between them becomes small, the synthesized chorus sound may be unnaturally heard.
[0008]
The present invention has been made in consideration of the above circumstances, and provides a choral synthesis device, a choral synthesis method, and a program capable of synthesizing a choral sound capable of giving a more natural impression to a listener. The purpose is to:
[0009]
[Means for Solving the Problems]
In order to solve the above problem, a choral synthesizer according to the present invention is a choral synthesizer that synthesizes a choral sound signal based on music data, and each of the same type of audio sample data is based on a plurality of different sounds. Means for generating a singing sound signal in accordance with the database storing the created sound sample data and the song data, wherein the necessary sound sample data is read from the database and used to generate the singing sound signal A plurality of singing generation means; and a singing synthesis means for synthesizing a choral sound signal from the singing sound signals generated by the plurality of singing generation means, wherein the music data is composed of a plurality of parts, and When each of the means generates a singing sound signal corresponding to each of the parts, each of at least two of the singing generating means It is characterized by using the generation of the singing sound signal after reading the audio sample data created based on different voice.
[0010]
According to this configuration, when each singing generation unit generates the singing sound signal of the corresponding part, the voice sample data created by at least two singing generation units based on different sounds is used. Here, since the voice sample data created based on different voices have different fine characteristics and the like, the singing sound signals output from the at least two singing generation means have different fine characteristics. Therefore, a singing sound having a unique characteristic is emitted as a singing sound corresponding to each part, so that a more natural impression can be given to the listener.
[0011]
Further, a choral synthesizer according to another aspect of the present invention is a choral synthesizer that synthesizes a choral sound signal in accordance with music data, and stores audio sample data having a predetermined time length created based on audio. A plurality of singing sound generating means for generating a singing sound signal according to the database and the music data, wherein the plurality of singing sound generating means are used to read out the necessary voice sample data from the database and generate the singing sound signal; Singing synthesis means for synthesizing a choral sound signal from the singing sound signal generated by the singing generation means, wherein the music data comprises a plurality of parts, and each of the plurality of singing generation means corresponds to each of the parts When generating a singing sound signal, each of at least two of the singing generation means may be configured to execute the singing sound signal when the sound sample data read from the database is different. It is characterized by generating the singing sound signal before being used a portion corresponding to.
[0012]
According to this configuration, when each singing generation unit generates a singing sound signal of the corresponding part, at least two singing generation units start using the portions corresponding to different times of the audio sample data and generate the singing sound signals. Will be. Here, in the audio sample data having a certain time length created based on the audio, the fine characteristics (fluctuations of the voice waveform) are not constant during the time length, and the fine characteristics and the like differ depending on time. For this reason, the singing sound signals output from the at least two singing generation means have different fine characteristics. Therefore, a singing sound having a unique characteristic is emitted as a singing sound corresponding to each part, so that a more natural impression can be given to the listener.
[0013]
Further, the choral synthesis method according to the present invention is a choral synthesis method for synthesizing a choral sound signal from a plurality of singing sound signals generated based on music data, wherein the plurality of choir sounds are formed in accordance with the music data composed of a plurality of parts. When generating a singing sound signal corresponding to the part, the necessary voice sample data is read from a database storing the voice sample data created based on a plurality of different voices, and at least two of the parts are read. For generating a singing sound signal corresponding to the above, the voice sample data created based on different voices read from the database is used.
[0014]
Also, a choral synthesis method according to another aspect of the present invention is a choral synthesis method for synthesizing a choral sound signal from a plurality of singing sound signals generated based on music data, wherein the music data includes a plurality of parts. Therefore, when generating a singing sound signal corresponding to the plurality of parts, read out the necessary voice sample data from a database that stores voice sample data having a predetermined time length created based on the voice, at least The generation of the singing sound signals corresponding to the two parts is characterized in that the singing sound signals are generated by starting use from portions corresponding to different times of the voice sample data read from the database.
[0015]
Further, the program according to the present invention reads out the necessary voice sample data from a database that stores voice sample data respectively created based on a plurality of different voices according to music data, and reads a singing sound signal. Means for generating, when the music data is composed of a plurality of parts and a singing sound signal corresponding to the plurality of parts is generated, a singing sound signal corresponding to at least two of the parts is generated. Singing sound generating means for generating the singing sound signal using the sound sample data created based on different sounds read from the database, and synthesizing a choral sound signal from the generated singing sound signal It is characterized by functioning as singing synthesis means.
[0016]
According to another aspect of the present invention, there is provided a program that reads a necessary sound sample data from a database that stores sound sample data having a predetermined time length created based on sound according to music data. Means for generating a singing sound signal by means of a singing sound signal corresponding to at least two of said parts when said music data is composed of a plurality of parts and generating a singing sound signal corresponding to said plurality of parts. When generating a singing sound signal, the singing sound generating means for generating the singing sound signal by starting use from portions corresponding to different times of the audio sample data read from the database, and choiring the generated singing sound signal It is characterized by functioning as singing synthesis means for synthesizing sound signals.
[0017]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
A. First embodiment
A-1. Basic configuration of the first embodiment
First, FIG. 1 is a block diagram showing a basic configuration of a choral synthesizer according to the first embodiment of the present invention. As shown in the figure, the choral synthesizer 100 includes a voice sample database 110, a plurality of (three in the illustrated example) singing generators 120, 121, 122, a choir controller 140, and a singing generator 120. , 121, and 122, the adder 130 adds and combines the singing sound signals output from the singing sound signals, and outputs the combined signal.
[0018]
The voice sample database 110 stores voice sample data created based on a natural voice uttered by a person. The voice sample database 110 stores voice sample data (hereinafter referred to as voice unit sample data) in which a single phoneme or a phoneme chain composed of a plurality of phonemes is used as one unit.
[0019]
A large number of short-time audio sample data are stored in a database, and these audio sample data are connected according to lyrics and the like, and in speech synthesis processing technology, a phoneme is basically used as a synthesis unit. . For this reason, the speech sample database 110 of the chorus synthesizer 100 may store speech unit sample data of only phonemes (about 30 to 50 types), but the connection rules between phonemes are complicated. Therefore, it is difficult to obtain good quality when audio sample data of only phoneme units is accumulated. Therefore, it is preferable that the voice sample database 110 store, in addition to the voice unit sample data of only the phoneme unit, voice unit sample data of a unit (phoneme chain) slightly larger than the phoneme. Units larger than phonemes include units such as CV (consonant → vowel), VC (vowel → consonant), VCV (vowel → consonant → vowel), and CVC (consonant → vowel → consonant). Although it is conceivable to accumulate all speech unit sample data of these units, the chorus synthesizer 100 for synthesizing choral sounds considers a stretched sound that is long used, such as a vowel frequently used in singing, as one unit. Vowel (CV) from vowel and consonant (VC) from vowel as one unit, voice unit sample data from vowel to consonant as one unit, and vowel to vowel May be stored as speech unit sample data.
[0020]
The speech sample database 110 stores speech unit sample data in which a phoneme or a phoneme chain as described above is defined as one unit. In the speech sample database 110, the same type of phoneme (for example, “a”) or Three phoneme unit sample data are stored for a phoneme chain (for example, “ai”). That is, the voice sample database 110 stores three voice sample data groups 110a, 110b, and 110c each including a predetermined number of unit voice unit sample data in which a phoneme or a phoneme chain is defined as one unit.
[0021]
The three audio sample data groups 110a, 110b, 110c stored in the audio sample database 110 are data created based on different audios, respectively. Here, the different voices do not only mean that the voices are different, but may be the same voice or voices that are voiced at different occasions or voices that are different. As described above, the voice sample data groups 110a, 110b, and 110c are created based on voices uttered on different occasions or different utterance parts even for different speakers or the same speaker. As described above, the data of the same phoneme (or phoneme chain) included in each of the voice sample data groups 110a, 110b, and 110c is different from the voice used as a basis for creating each data. Characteristics (pitch fluctuations, etc.) are different.
[0022]
The voice sample database 110 stores the three voice sample data groups 110a, 110b, and 110c as described above. Each of the singing generators 120, 121, and 122 generates the voice sample data when generating the singing sound signal. The speech unit sample data is read from the database 110 and used.
[0023]
Each of the singing generators 120, 121, and 122 reads necessary speech unit sample data from the speech sample database 110 according to song information having lyric information and melody information, and uses the read speech unit sample data. Generate a singing sound signal.
[0024]
More specifically, each of the singing generators 120, 121, and 122 determines a phoneme sequence according to the lyric information, determines speech unit sample data necessary to compose the phoneme sequence, and generates a speech sample database 110. Read from Then, the read speech unit sample data is connected in time series, and the connected speech unit sample data is appropriately adjusted in accordance with the pitch according to the melody information to generate a singing sound signal.
[0025]
The chorus synthesis apparatus 100 according to the present embodiment includes three singing generators 120, 121, and 122 that can generate a singing sound signal in accordance with the lyric information and the melody information, and thus includes three parts. According to the music information (lyric information and melody information) of the chorus, a chorus sound signal corresponding to the chorus can be synthesized.
[0026]
The chorus control unit 140 divides the music information for each part when synthesizing the choral sound signal corresponding to the chorus sound based on the music information of the chorus in the chorus synthesis device 100, and , 121, 122. Accordingly, when synthesizing a choral sound signal in accordance with the music information including three parts, each of the singing generators 120, 121, and 122 adds the lyric information and melody information of each part supplied from the choir control unit 140. Therefore, a singing sound signal is generated, and the singing sound signal generated by each of the singing generators 120, 121, and 122 is output to the adder 130. As a result, the adder 130 can synthesize a choral sound signal corresponding to the chorus song according to the music information of the chorus song composed of three parts.
[0027]
When synthesizing a choral sound signal as described above in the choral synthesizer 100, the chorus control unit 140 stores the singing generators 120, 121, and 122 in the audio sample database 110, respectively. The singing generators 120, 121, and 122 output, to the singing generators 120, 121, 122, designation information for designating which of the voice sample data groups among the voice sample data groups 110 a, 110 b, and 110 c the voice unit sample data is read from and used. Here, the chorus control unit 140 generates a singing sound signal so that each of the singing generators 120, 121, and 122 generates a singing sound signal using the voice unit sample data included in the voice sample data groups 110a, 110b, and 110c different from each other. Designation information for designating a different data group is output to each of the singing generators 120, 121, 122.
[0028]
More specifically, the singing generator 120 outputs designation information specifying the voice sample data group 110a, and the singing generator 121 outputs designation information specifying the voice sample data group 110b. For example, the singing generator 122 outputs designation information for designating the voice sample data group 110c. When such designation information is supplied from the chorus control unit 140, the singing generator 120 reads out the voice segment sample data included in the voice sample data group 110a and uses it for generating a singing sound signal. The voice segment sample data included in the voice sample data group 110b is read and used for generating a singing sound signal, and the singing generator 122 generates a singing sound signal using the voice segment sample data included in the voice sample data group 110c. Will do.
[0029]
When the chorus synthesizer 100 synthesizes a singing sound signal of a chorus composed of three parts, the singing generators 120, 121, and 122 are included in the different audio sample data groups 110a, 110b, and 110c as described above. By using the obtained speech unit sample data, it is possible to synthesize a choral sound signal capable of giving a more natural impression to a listener. That is, the voice sample data groups 110a, 110b, and 110c stored in the voice sample database 110 are created based on different voices, and may be data on the same type of phonemes or phoneme chains. The fine features (such as pitch fluctuations) of the voice shown in the data included in each of the voice sample data groups 110a, 110b, 110c are different. Among the voice sample data groups 110a, 110b, and 110c including the voice unit sample data having minute characteristics different from each other, each of the singing generators 120, 121, and 122 includes a voice element included in a different voice sample data group. By generating the singing sound signal using the one-sample data, the singing sound signals generated by the singing generators 120, 121, and 122 have fine characteristics different from each other. Therefore, the chorus sound obtained by superimposing these parts has a unique feature with little correlation between the parts, and it is possible to reduce the possibility of giving an unnatural impression to the listener.
[0030]
The speech unit sample data included in the voice sample data groups 110a, 110b, and 110c stored in the voice sample database 110 are data created based on voices uttered by humans, respectively. Therefore, the difference in the minute feature of the voice shown in the voice unit sample data included in each voice sample data group is not artificially created such as providing a pitch fluctuation prepared in advance. Therefore, it is possible to reduce the possibility that the synthesized chorus sound becomes unnatural.
[0031]
A-2. Specific configuration of chorus synthesizer
What has been described above is the basic configuration of the choral synthesis apparatus 100 according to the present embodiment. In the chorus synthesizer 100, as the singing generators 120, 121, and 122, voice unit sample data is read out from the voice sample database 110 in accordance with the lyrics information and connected, and the connected voice is connected according to the pitch according to the melody information. Any singing generator that adjusts the unit sample data and outputs a singing sound signal can use various known singing generators such as a singing generator to which a rule speech synthesis technique or the like is applied, The database 110 may store speech unit sample data corresponding to the singing generator used. In the following, a case where a singing generator using a spectral modeling synthesis (SMS) technique proposed in US Pat. Nos. 5,029,509 and 2,906,970 is applied as the singing generators 120, 121, 122. As an example, the choral synthesis device 100 will be specifically described.
[0032]
First, a method of creating the voice sample database 110 in the singing voice synthesizing apparatus 100 including the singing voice generators 120, 121, and 122 using the SMS technique will be described.
[0033]
As described above, the speech sample database 110 in the chorus synthesis apparatus 100 stores speech unit sample data created based on the speech uttered by the speaker. The SMS technology is a technology for analyzing and synthesizing a musical tone using a model that represents an original sound with two components, that is, a deterministic component and a non-harmonic component, and uses the SMS technology. In the speech synthesis described above, data composed of the above-mentioned harmonic component and non-harmonic component is used for speech synthesis as one unit of speech unit sample data such as a phoneme or a phoneme chain. Therefore, in the choral synthesizer 100 using the SMS technique, the speech sample database 110 stores one speech unit as the data indicating the harmonic component and the non-harmonic component obtained by performing the SMS analysis on the voice uttered by the speaker. It is stored as sample data. Hereinafter, a method of creating the audio sample database 110 will be described with reference to FIG.
[0034]
As shown in the figure, the voice uttered by the speaker to create the voice sample database 110 is input to the SMS analysis unit 200, where the voice is analyzed by the SMS analysis unit 200. Here, the voice sample database 110 needs to store voice sample data groups 110a, 110b, and 110c created based on different voices, so that three different voices are input to the SMS analyzer 200. become. Although three different voices are shown to be input to the SMS analysis unit 200 in parallel in the drawing, the SMS analysis for each voice need not be performed simultaneously in parallel, but may be performed individually. Is also good.
[0035]
The SMS analysis unit 200 performs an SMS analysis on the input voice and outputs SMS analysis data for each frame. More specifically, SMS analysis data for each frame is output by the following method.
[0036]
First, the input voice is divided into a series of frames. Here, the frame cycle used for the SMS analysis may be a fixed fixed length, or may be a variable length cycle whose cycle is changed according to the pitch of the input voice.
[0037]
Next, a frequency analysis such as a fast Fourier transform (FFT) is performed on the voice divided into frames. An amplitude spectrum and a phase spectrum are obtained from a frequency spectrum (complex spectrum) obtained by the frequency analysis, and a spectrum of a specific frequency corresponding to a peak of the amplitude spectrum is extracted as a line spectrum. At this time, a spectrum having a frequency near the fundamental frequency and a frequency that is an integral multiple of the fundamental frequency is defined as a line spectrum. The line spectrum extracted in this way corresponds to the above-mentioned harmonic component.
[0038]
Next, a line spectrum is extracted from the input voice as described above, and the extracted line spectrum is subtracted from the input voice (the waveform after the FFT) of the frame to obtain a residual spectrum. Alternatively, the time waveform data of the harmony component synthesized from the extracted line spectrum is subtracted from the input speech waveform data of the frame to obtain the time waveform data of the residual component, and then a frequency analysis such as FFT is performed on this. Thus, a residual spectrum may be obtained. The residual spectrum thus obtained corresponds to the above-described anharmonic component.
[0039]
The SMS analysis unit 200 outputs the SMS analysis data for each frame composed of the line spectrum (harmonic component) and the residual spectrum (non-harmonic component) acquired as described above to the section extraction unit 201.
[0040]
The section cutout unit 201 converts the SMS analysis data for each frame supplied from the SMS analysis unit 200 into one unit (phoneme or phoneme chain) of the speech unit sample data to be stored in the speech sample database 110. Cut out as you would. The section cutout unit 201 cuts out the SMS analysis data so as to correspond to the unit length of each segment, and stores it in the voice sample database 110.
[0041]
Here, the speech unit sample data stored in the speech sample database 110 is SMS data cut out for each phoneme or phoneme chain, and for the harmonic component, the spectral envelope of all the frames included in the phoneme or phoneme chain is stored. (Intensity (amplitude) and phase spectrum of the line spectrum (overtone series)) are stored. Note that such a spectrum envelope itself may be stored as a harmonic component. Alternatively, the spectrum envelope may be stored as a function represented by some function, or the harmonic component may be obtained by inverse FFT or the like. May be stored as a time waveform. In the present embodiment, the non-harmonic component is stored as an intensity spectrum and a phase spectrum similarly to the harmonic component, but may be stored as a function or a time waveform similarly to the harmonic component.
[0042]
SMS analysis and section segmentation of such speech are performed for each of three different input speeches, and as a result, speech unit sample data created based on three different speeches such as speech sample data groups 110a, 110b, and 110c. A group of (SMS analysis data for each phoneme or phoneme chain) is stored in the voice sample database 110.
[0043]
The above is the details of the method of creating the audio sample database 110 of the choral synthesizer 100 according to the present embodiment.
[0044]
Next, as described above, the singing generators 120, 121, and 121 generate singing sound signals using the sound sample database 110 that stores the three sound sample data groups 110a, 110b, and 110c created based on different sounds. 122 will be described. Since the singing generators 120, 121, 122 have the same configuration, the configuration of the singing generator 120 will be described below with reference to FIG. 3, and the other singing generators 121, 122 will be described below. Omit.
[0045]
As shown in the figure, the singing generator 120 includes a speech unit selection unit 301, a pitch determination unit 302, a duration adjustment unit 303, a speech unit connection unit 304, and a harmony component generation unit 305. , An adder 306, an inverse FFT (Fast Fourier Transform) unit 307, a windowing unit 308, and an overlap unit 309.
[0046]
The speech unit selection unit 301 reads out necessary speech unit sample data from the speech sample database 110 according to the lyrics information and the designation information supplied from the chorus control unit 140 (see FIG. 1). More specifically, the supplied lyric information is converted into a speech symbol (phoneme or phoneme chain) sequence, and speech unit sample data is read from the speech sample database 110 according to the converted speech symbol sequence. For example, when a singing sound signal is generated according to lyric information such as “saita” (saita), the lyric information is “#s”, “s”, “sa”, “a”, “ai”, “i”. , "It", "t", "ta", "a", "a #", and the speech unit sample data corresponding to each of these speech symbols is read from the speech sample database 110. Will be.
[0047]
The speech unit selection unit 301 determines the speech unit sample data to be read according to the lyric information as described above, and determines from the speech sample data group specified in the designation information supplied from the chorus control unit 140. Read the speech unit sample data. For example, when the designation information designates the audio sample data group 110a, “#s”, “s”, “sa”, “a”, “a” included in the audio sample data group 110a of the audio sample database 110 are included. The voice unit sample data corresponding to “ai”, “i”, “it”, “t”, “ta”, “a”, and “a #” is read.
[0048]
The pitch determination unit 302 determines the pitch of the singing sound according to the melody information supplied from the chorus control unit 140 (see FIG. 1), and outputs pitch information indicating the determined pitch to the harmony component generation unit 305.
[0049]
The speech unit sample data (harmonic component and non-harmonic component) read out by the speech unit selection unit 301 is supplied to the duration adjusting unit 303. Here, the speech unit selection unit 301 may supply the read speech unit sample data as it is to the duration adjustment unit 303. However, the speech unit selection unit 301 may perform appropriate correction processing according to the pitch or the like indicated in the melody information. May be supplied to the duration adjusting unit 303.
[0050]
The duration adjusting unit 303 changes the time length of each speech unit sample data supplied from the speech unit selection unit 301 according to the sounding time length of each phoneme or phoneme chain determined by melody information or the like. I do. More specifically, when a certain speech unit sample data is used as a time shorter than the time length, a process of thinning out frames from the speech unit sample data is performed. On the other hand, when a certain speech unit sample data is to be used continuously for a longer time than the time length, a loop process for repeatedly lengthening the time during the time period in which the speech unit sample data is used is performed. In this loop processing, when a certain speech unit sample data is repeated, after the data from the beginning to the end (0 to t) of the speech unit sample data, the speech unit sample data j is first (0). From the beginning to the end (0 to t), and after the data from the beginning to the end (0 to t), it goes from the temporally last (t) part to the first part of the speech unit sample data. May be connected and repeated.
[0051]
The duration adjusting unit 303 adjusts the duration of the speech unit sample data (harmonic component and non-harmonic component) according to the pronunciation duration of each speech unit as described above, and then adjusts the time-adjusted speech. The unit sample data is output to the voice unit connection unit 304.
[0052]
The speech unit connection unit 304 connects the harmonic component data of the speech unit sample data supplied from the duration adjustment unit 303 in time series, and connects the non-harmonic component data in time series. In such a connection, if the difference between the shapes of the spectral envelopes of the two connected harmonic components is large, a smoothing process or the like may be performed. The speech unit connection unit 304 outputs the connected harmonic component data to the harmonic component generation unit 305, and outputs the connected non-harmonic component data to the addition unit 306.
[0053]
The harmony component generation unit 305 is supplied with harmony component data (spectrum envelope information) from the speech unit connection unit 304, and is supplied with pitch information according to the melody information from the pitch determination unit 302. The harmonic component generation unit 305 generates a harmonic component corresponding to the pitch information from the pitch determination unit 302, while maintaining the spectrum envelope shape indicated by the spectrum envelope information from the speech unit connection unit 304.
[0054]
The addition unit 306 is supplied with the data of the non-harmonic component from the speech unit connection unit 304 and the data of the harmonic component from the harmonic component generation unit 305, and the addition unit 306 synthesizes the two and sends the result to the inverse FFT unit 307. Output. The inverse FFT section 307 performs an inverse FFT on the added frequency domain signal supplied from the adding section 306 to convert it into a time domain waveform signal, and outputs the converted waveform signal to the windowing section 308. I do. The windowing unit 308 multiplies the time domain waveform signal by a window function corresponding to the frame length, and the overlap unit 309 generates a singing sound signal while overlapping the multiplied waveform signals. In this way, the singing generator 120 generates a singing sound signal in accordance with the lyric information and melody information of a certain part of the music information supplied from the choir controller 140 (see FIG. 1), and generates the singing sound signal. Is output to the adder 130 (see FIG. 1).
[0055]
The detailed configuration of the singing generator 120 has been described above, and the tunes supplied from the choir control unit 140 as described above from the other singing generators 121 and 122 (the same configuration as the singing generator 120) shown in FIG. A singing sound signal generated according to the lyric information and melody information of the part with information is output. Here, as described above, each of the singing generators 120, 121, and 122 generates a different singing sound signal corresponding to the part assigned by the choir controller 140, and generates a different voice sample data group 110 a, 110 b, and 110 c. Since the voice unit sample data is read out from the singing voice signal and used for generation, the fine characteristics (pitch fluctuations and the like) of the singing sound signal generated by each are different.
[0056]
The adder 130 synthesizes and outputs the singing sound signals generated by the singing generators 120, 121, and 122 according to each part of the music information of the chorus. The chorus sound signal synthesized from the three parts of the singing sound signal output from the adder 130 is converted into an analog voice waveform signal by a D / A (Digital to Analog) converter (not shown), and then converted to an amplifier or the like. The sound is emitted from the speaker via. This allows the listener to listen to the chorus sound according to the music information of the chorus composed of a plurality of parts. The chorus sound emitted from the chorus synthesizer 100 differs in the fine characteristics of the singing sound of each part (such as voice quality due to a difference in pitch fluctuation or the like), and gives a more natural impression to the listener. It can pronounce choral sounds that can do it.
[0057]
B. Second embodiment
Next, a choral synthesizer according to a second embodiment of the present invention will be described with reference to FIG. As shown in the figure, the voice sample database 110 of the chorus synthesis apparatus 100 in the first embodiment stores three voice sample data groups 110a, 110b, and 110c, whereas the chorus according to the second embodiment. The difference is that the voice sample database 110 in the synthesizer 400 stores only one type of voice unit sample data for the same phoneme or phoneme chain. The choral synthesizer 400 according to the second embodiment uses the speech sample database 110 that stores only one speech unit sample data for one phoneme or phoneme chain as described above, and is more natural than the first embodiment. It is possible to synthesize a choral sound signal capable of giving a great impression. Hereinafter, the configuration of the chorus synthesis apparatus 400 will be described focusing on the differences from the chorus synthesis apparatus 100 according to the first embodiment.
[0058]
Each of the singing generators 120, 121, and 122 in the chorus synthesis device 400 is the same as in the first embodiment, and requires a speech unit sample from the speech sample database 110 according to music information having lyrics information and melody information. The data is read, and a singing signal is generated using the read speech unit sample data. In the second embodiment, the voice sample database 110 stores only one voice element sample data for one phoneme or phoneme chain. It is possible that each of the singing generators 120, 121, 122 uses the same speech unit sample data. As described above, when singing sound signals of a plurality of parts are generated using the same speech unit sample data, fine features (pitch fluctuations and the like) are basically the same, which is unnatural to the listener. Makes an impression.
[0059]
Therefore, in the chorus synthesis apparatus 400, the chorus control unit 140 divides the lyric information and the melody information of the music information of the chorus song into respective parts and outputs the lyric information to the singing generators 120, 121, and 122, and also outputs the audio sample database. The singing generators 120, 121, and 122 are configured to output to the singing generators 120, 121, 122, the designation information for designating from which part of the speech unit sample data stored in the section 110 the use is started.
[0060]
As described in the first embodiment, the speech unit sample data stored in the speech sample database 110 is created based on the voice uttered by the speaker, and has a predetermined time length (one frame). To several frames). That is, the data is created based on a sound waveform represented by a relationship between time and amplitude within the predetermined time. Therefore, even when the speech unit sample data is stored as data in the frequency domain as in the first embodiment, the data is obtained by performing FFT or the like on the speech waveform in the time domain. The chorus control unit 140 assigns, to each of the singing generators 120, 121, 122, designation information for designating from which part the speech unit sample data, which is information that changes with time, is used. Supply.
[0061]
Here, the chorus control unit 140 generates each singing signal such that each of the singing generators 120, 121, and 122 starts using the speech unit sample data from a portion corresponding to a different time and generates a singing sound signal. The specification information for specifying different use start times is output to the generators 120, 121, and 122.
[0062]
Each of the singing generators 120, 121, and 122 reads necessary speech unit sample data from the speech sample database 110 based on the lyrics information of each part supplied from the chorus control unit 140, and reads the read speech unit samples. The use of the data is started from a portion corresponding to the time designated by the chorus control unit 140 in the designated information to generate a singing sound signal.
[0063]
Hereinafter, the voice unit sample data read out according to the lyrics information of the three parts is the vowel "a", and the voice unit sample data "a" is composed of 13 frames (time 0 to T) such as F0 to F13. The case where each of the singing generators 120, 121, 122 generates a singing sound signal using the length of 13 frames of the voice unit sample data will be specifically illustrated with reference to FIGS. Will be explained.
[0064]
In the example illustrated in FIG. 5, the singing generator 120 is supplied with designation information for designating to start using the first frame F0, and the singing generator 121 starts using the frame F3. Is supplied to the singing generator 122, and the singing generator 122 is supplied with the specifying information to start the use from the frame F6. Although the speech unit sample data is shown as a speech waveform in the time domain for convenience of explanation in the figure, the data stored in the speech sample database 110 is expressed in the frequency domain as in the first embodiment. It may be in the form of a harmonic component (line spectrum) and a non-harmonic component (residual spectrum).
[0065]
When such designation information is supplied, as shown in FIG. 6, the singing generator 120 receives the sequence of frames F0, F1, F2,..., F13, that is, the speech unit sample data “a” as it is. Used to generate a singing sound signal. In addition, the singing generator 121 generates a singing sound signal using the speech unit sample data “a” in the order of frames F3, F4, F5,..., F13, F0, F1, F2, F3. Further, the singing generator 122 generates a singing sound signal using the voice unit sample data “a” in the order of frames F6, F7,..., F13, F0, F1,.
[0066]
In this way, the chorus control unit 140 starts using the parts corresponding to different times and outputs the designation information so as to generate the singing sound signal, so that the same phoneme “a” has the same time length (0 to T). ), The data actually used by the singing generators 120, 121, and 122 is different. That is, the fine features (pitch fluctuations, etc.) shown in the speech unit sample data actually used by each of the singing generators 120, 121, 122 are different, and one type of phoneme or one type of phoneme per phoneme chain is different. The singing generators 120, 121, and 122 can generate singing sound signals having different fine characteristics using the single sample data.
[0067]
By the way, when using the speech unit sample data for a single phoneme such as the phoneme “a”, it is generated for each part by a method of simply shifting the use start time in the data as described above. Although it is possible to synthesize a more natural choral singing sound by changing the fine characteristics of the singing sound, in the case of speech unit sample data of a phoneme chain consisting of multiple phonemes, simply use the start time in the data. In some cases, inconvenience may occur simply by shifting. For example, in the case of speech sample data of a phoneme chain such as “ai”, the first half in the time domain is data that more strongly reflects the phoneme of “a”, and the second half is data that more strongly reflects the phoneme of “i”. It is. Therefore, in order to generate a singing sound signal of the phoneme chain “ai”, if the use is started from the latter half of the strong influence of the phoneme “i”, data having a tendency similar to the phoneme chain “ia” is used. In this case, a signal for the phoneme chain “ai” to be generated cannot be accurately generated.
[0068]
Therefore, in the present embodiment, when using the speech unit sample data corresponding to a plurality of phoneme chains, the chorus control unit 140 sends the designation information as shown in FIG. 7 to each of the singing generators 120, 121, and 122. Output. In the example shown in the figure, the singing generator 120 is supplied with designation information for designating to start using the first frame F0, and the singing generator 121 starts using the frame F2. Is supplied to the singing generator 122, and the singing generator 122 is supplied with the specifying information for starting to use the frame F4. That is, when compared with the designation information for the single phoneme, the use start time designated for each of the singing generators 120, 121, and 122 is concentrated in the first half (the influence of “a” is strong). By concentrating the use start time of each part in the first half of the data in this way, it is possible to suppress the data actually used from becoming similar to the phoneme chain “ia” as described above.
[0069]
When the designation information is supplied as described above, the singing generator 121 outputs the speech unit sample data in the order of frames F2, F3, F4,..., F13, F0, F1, F2, that is, in one direction. When the singing sound signal is generated by using, the frames F0 to F2, which are strongly influenced by the phoneme "a", are used as the data of the second half of the frame where the influence of "i" should be strongly increased. Therefore, in the present embodiment, when using speech unit sample data for a phoneme chain composed of a plurality of phonemes, frames F12, F11,... Are not returned to the frame F1 after the last frame (F13). As described above, the frames are used in the order of returning in the reverse direction. Therefore, when the use start frame is designated as shown in FIG. 7, as shown in FIG. 8, the singing generator 120 performs the order of the frames F0, F1, F2... Is used for generating a singing sound signal by using the speech unit sample data stored in the singing unit as it is. In addition, the singing generator 121 generates a singing sound signal using the speech unit sample data in the order of frames F2, F3, F4,..., F13, F12, and F11. Further, the singing generator 122 generates a singing sound signal using the speech unit sample data in the order of frames F4, F5, F6,..., F13, F12, F11, F10, F9. When frames are used in the order of returning in the reverse direction, such as from frame F13 to frame F12, there is a possibility that noise or the like may be generated at a connection portion between the two. Therefore, amplitude adjustment processing or cross-fade processing is performed at the connection portion of each frame. And so on.
[0070]
When the speech unit sample data of a phoneme chain composed of a plurality of phonemes is used in each of the singing generators 120, 121, and 122, the above procedure enables more accurate generation of a phoneme chain. The singing sound signals output from the singing generators 120, 121 and 122 have different fine characteristics and the like.
[0071]
As described above, in the choral synthesizer 400 according to the second embodiment, even if only one speech unit sample data is stored for one phoneme or a phoneme chain, one speech unit sample data is used. As in the first embodiment, it is possible to synthesize a choral sound signal that can give a more natural impression. That is, it is possible to synthesize a choral sound signal that can give a more natural impression while suppressing the amount of data stored in the audio sample database 110.
[0072]
C. Modified example
Note that the present invention is not limited to the above-described first and second embodiments, and various modifications as exemplified below are possible.
[0073]
(Modification 1)
In each of the above-described embodiments, a singing sound signal is generated by connecting speech unit sample data in units such as phonemes or phoneme chains, but there is a singing expression called vibrato. A function of adding a singing expression by vibrato may be added to the choral synthesizer.
[0074]
Conventionally, as a method of generating a singing sound signal for electronically generating a singing sound by vibrato, as in each of the above-described embodiments, while connecting speech unit sample data in phonemes or phoneme chain units, A method of applying a frequency modulation of about 6 Hz to a waveform represented by the obtained speech unit sample data is known. Although a configuration for performing such a method may be added to the chorus synthesizer in each of the above embodiments, a method of generating a vibrato singing sound signal capable of giving a more natural impression to a listener is described as vocalization. There is a method of using vibrato audio sample data created based on the voice of the person singing with the vibrato singing method, and it is preferable to add a configuration for performing this method to the choral synthesizer according to each of the above embodiments. .
[0075]
Hereinafter, an example in which a function of generating a singing sound signal using the vibrato voice sample data created based on the vibrato singing voice of the speaker is added to the choral synthesizer in the first embodiment will be described with reference to FIG. This will be described in detail.
[0076]
As shown in the figure, the voice sample database 110 of the choral synthesizer 100 'includes vibrato singing in addition to voice unit sample data in units of phonemes or phoneme chains such as the voice sample data groups 110a, 110b, and 110c. Vibrato voice sample data created based on the singing voice at the time is stored. Here, the sound sample database 110 stores three vibrato sound sample data BDa, BDb, and BDc created based on different sounds, respectively.
[0077]
With this configuration, the chorus control unit 140 specifies the lyric information and melody information of each part and the audio sample data group to be used for each of the singing generators 120, 121, and 122, as in the first embodiment. In addition to the information, second designation information for designating which of the three vibrato audio sample data BDa, BDb, BDc to use is supplied. Here, the second specification information supplied to each of the singing generators 120, 121, and 122 is information that specifies to use different vibrato audio sample data. By supplying such second designation information to each of the singing generators 120, 121, and 122, each of the singing generators 120, 121, and 122 outputs different vibrato sound sample data when generating the vibrato singing sound signal. The waveform represented by the read vibrato speech sample data is superimposed on the speech waveform represented by the read speech unit sample data connected in the same manner as in the above embodiment, and the superimposed waveform signal is output as a singing sound signal.
[0078]
When the vibrato singing sound signal is generated as described above, the singing generators 120, 121, and 122 use the three vibrato sound sample data BDa, BDb, and BDc created based on different sounds for each part. The fine characteristics of the generated vibrato singing signal (such as the frequency variation during vibrato) also differ for each part. As described above, since there is almost no correlation between the parts of the vibrato singing sound and each part has a unique characteristic, the singing sound of the singing sound based on the chorus sound signal synthesized by the chorus synthesis device 100 'is described. It is possible to give a more natural impression to the listener who has listened to the vibrato portion.
[0079]
By the way, the fact that the characteristics of the vibrato part of each part in the chorus sound are basically the same gives the listener a more unnatural impression than the case where the characteristics of the other parts are the same. Therefore, there may be a demand for a device in which a unique characteristic is given to each part only in the vibrato portion. In such a case, as in the above embodiments, the voice sample data for the phoneme or phoneme chain generates a singing sound signal using the same data as it is in each part, and the generated singing sound signal is A vibrato effect may be provided by adding waveforms represented by different vibrato audio sample data for each part.
[0080]
(Modification 2)
Further, as shown in FIG. 9, three vibrato audio sample data may be used in accordance with the number of the singing generators 120, 121, 122, but the choir singing device 400 'shown in FIG. As described above, the singing generators 120, 121, and 122 may generate the singing sound signal of the vibrato portion using the same vibrato sound sample data.
[0081]
As described in the above-described embodiment, the singing generators 120, 121, and 122 can generate singing sound signals having different unique characteristics by properly using the audio sample data groups 110a, 110b, and 110c. Even if the waveform represented by the same vibrato sound sample data is added to the singing sound signal thus generated, the singing sound signal of the vibrato portion output from each of the singing generators 120, 121, 122 has a unique feature. Is obtained. Therefore, each singing generator 120, 121, 122 may simply use one vibrato voice sample data, but the singing generators 120, 121, 122 in the second embodiment also apply to the vibrato voice sample data. As described as the method of using the voice unit sample data by the 122, each of the singing generators 120, 121, and 122 may start using the same vibrato voice sample data from portions corresponding to different times. . In this case, the chorus control unit 140 may supply the singing generators 120, 121, and 122 with designation information for designating a portion corresponding to a time at which to start using. In this way, the actual voice sample data used by each of the singing generators 120, 121, and 122 for applying vibrato has different characteristics. Therefore, the singing sound signal of the vibrato portion of each part has a unique characteristic, and for a listener who listened to the vibrato portion of the singing sound based on the chorus sound signal synthesized by the choir synthesis device 400 ′. It is possible to give a more natural impression.
[0082]
(Modification 3)
Further, in the above-described modified example, the vibrato sound sample data is stored in the sound sample database 110 in order to impart a vibrato effect to the generated singing sound signal. However, various types of tremolo other than vibrato, portamento, etc. In order to electronically emit a singing sound according to the singing method, the voice sample data created based on the singing voice of the tremolo portion by the speaker or the singing voice of the portamento portion is stored in the voice sample database 110. It may be. Also in this case, similarly to the vibrato audio sample data in the above-described modification, audio sample data is prepared for each part, or even if the same audio sample data is used from a portion corresponding to a different time. By doing so, it is possible to give the tremolo of each part or the singing sound signal of the portamento part a unique characteristic.
[0083]
(Modification 4)
In the above-described first embodiment, the three audio sample data groups 110a, 110b, and 110c are stored in the audio sample database 110. Thus, the speech unit sample data included in each of the speech sample data groups 110a, 110b, 110c may be created. For example, the speech unit sample data created based on the high-pitched sound is included in the speech sample data group 110a, and the speech unit sample data created based on the middle-tone sound is included in the speech sample data group 110b. In this manner, the speech unit sample data created based on the bass sound may be included in the speech sample data group 110c.
[0084]
When the audio sample database 110 that stores the audio sample data groups 110a, 110b, and 110c created for each of the musical ranges is used, the chorus control unit 140 sets the high-frequency range of the plurality of parts included in the music information. It outputs to a singing generator in charge of generating a singing sound signal of a part composed of a melody, designation information for designating the use of the voice sample data group 110a created based on a high-pitched voice. In addition, to the singing generator in charge of the generation of the singing sound signal of the part consisting of the melody of the midrange, outputting designation information designating to use the voice sample data group 110b created based on the sound of the middle sound, Further, it outputs to a singing generator in charge of generating a singing sound signal of a part composed of a melody in a low frequency range, specifying information for specifying that a voice sample data group 110c created based on low-frequency sounds is to be used. Thereby, each of the singing generators 120, 121, and 122 can use the more suitable voice segment sample data for generating the singing sound signal of the part in charge of each, and generate a higher-quality singing sound signal. it can.
[0085]
When the singing sound signal corresponding to a certain music is generated as described above, the voice sample data groups 110a, 110b, and 110c used by the singing generators 120, 121, and 122 may be fixed. It is also conceivable that the pitch level of each part determined by the melody information changes every time. In this case, when a singing sound signal of a certain music piece is generated, the choir control unit 140 sets each of the singing generators 120, 120 according to the pitch of each part determined by the melody information of each part according to the level. It is also possible to output designation information such that the audio sample data group designated for 121 and 122 is sequentially changed in the middle of the music.
[0086]
(Modification 5)
In the above-described modification, the voice sample data groups 110a, 110b, and 110c created based on voices of different pitches are stored in the voice sample database 110, but the same phoneme is uttered when singing. The pitch may fluctuate greatly during that time. Therefore, in the voice sample database 110, voice unit sample data is created based on voice generated by changing the pitch (pitch) while the same phoneme, for example, “a” is being voiced, and the voice unit is generated. The sample data may be stored in the audio sample database 110. As described above, in the voice sample database 110, not only phonemes or phoneme chains having the same pitch as described in each of the above-described embodiments, but also voice unit sample data is created in consideration of various pitch variations that may occur during singing. You may keep it.
[0087]
(Modification 6)
Further, in the above-described first embodiment, the singing generators 120, 121, and 122 selectively use the audio sample data groups 110a, 110b, and 110c stored in the audio sample database 110. In the second embodiment, the same is used. By starting to use the voice unit sample data from portions corresponding to different times, a choral sound signal capable of giving a more natural impression has been synthesized. Each of the singing generators 120 and 121 assigns some value (that is, a parameter that determines a sound) indicated in the speech unit sample data read from the speech sample database 110 to the choir synthesis devices according to the first and second embodiments. , 122 may be provided after changing. FIG. 11 shows a configuration in a case where such a parameter changing unit is added to the chorus synthesis apparatus according to the first embodiment.
[0088]
As shown in the figure, this choir synthesizer 100 ″ has the same configuration as the choir synthesizer 100 in the first embodiment, but also includes a parameter changing section 220 provided for each of the singing generators 120, 121, 122. Under this configuration, the choir control unit 140 causes the singing generators 120, 121, and 122 to provide the lyric information and melody information of each part, and any audio sample, as in the first embodiment. In addition to outputting the designation information for designating whether to use the data group, it also outputs change information indicating the contents of the parameter change to each of the parameter change units 220, 221, and 222. Here, the chorus control unit 140 Change information such that different changes are made to the speech unit sample data read out from the speech sample database 110 is applied to each parameter change. And outputs it to the parts 220, 221, 222.
[0089]
The parameter change units 220, 221 and 222 read the speech unit sample data required by the corresponding singing generators 120, 121 and 122 from the speech sample data group indicated by the designated information, and supply them from the chorus control unit 140. The read speech unit sample data is changed according to the change information to be performed. Then, the changed speech unit sample data is supplied to the corresponding singing generators 120, 121, 122. Then, each of the singing generators 120, 121, and 122 generates a singing sound signal using the changed speech unit sample data.
[0090]
Here, the content of the changing process performed by the parameter changing units 220, 221, 222 on the speech unit sample data read from the speech sample database 110 may be a process of changing a timbre or the like so as not to impair phonologicality. For example, various change processes can be applied. For example, by modeling the formant structure of the voice represented by certain voice element sample data read from the voice sample database 110, the timbre is changed by changing the band width of the formant by several percent, shifting the center frequency of the band by about 10 Hz, or the like. There is a way to change it slightly. In this case, by changing the ratio of the band width of the formant to be changed and the shift amount of the center frequency of the band to different values for each of the parameter changing units 220, 221, 222, the parameter changing units 220, 221, 222 The timbre of the voice shown in the read voice unit sample data is slightly different.
[0091]
(Modification 7)
In each of the above-described embodiments, the singing sound generated by each of the singing generators 120, 121, and 122 gives the listener a more natural impression to the listener. May be shifted in timing. In this case, the chorus control unit 140 supplies timing designation information for designating how much the sounding timing is shifted for each part. At this time, the chorus control unit 140 supplies timing designating information to each of the singing generators 120, 121, 122 such that the sounding timings of the singing generators 120, 121, 122 are slightly shifted. For example, for the singing generator 120, the singing sound signal generated according to the lyric information and the melody information supplied from the choir controller 140 is output to the adder 130 without delay, and the singing generator 121 is If the singing sound signal is output to the adder 130 with a delay of 10 msec and the singing generator 122 is output with a singing sound signal to the adder 130 with a delay of 20 msec, the singing sound of each part is It is pronounced slightly subtly, and can give a more natural impression to the listener.
[0092]
When the singing sound signal of a certain song is generated as described above, the correlation between the sounding timings of the singing generators 120, 121, and 122 may be fixed. Even during the course, the correlation between the sounding timings of the singing generators 120, 121, 122 may be varied. For example, the first half of the music may be pronounced in the order of the singing generators 120, 121, 122 as in the above example, and the second half of the music may be pronounced in the order of the singing generators 122, 121, 120. Good.
[0093]
(Modification 8)
Further, in the first embodiment described above, the voice sample data group of the type corresponding to the number (three) of the singing generators 120, 121, 122 is stored in the voice sample database 110. May be stored.
[0094]
When three singing generators such as singing generators 120, 121, and 122 are provided, and when only two voice sample data groups 110a and 110b are stored in the voice sample database 110, at least two singing generators are stored. What is necessary is just to make a generator generate a singing sound signal using the different audio sample data groups 110a and 110b. In this case, the singing generator 120 uses the voice sample data group 110a, the singing generator 121 uses the voice sample data group 110b, and the singing generator 122 uses one of the voice sample data groups 110a and 110b as the singing generator. If use is started from a portion corresponding to a time different from 120 and 121, the three singing generators 120, 121 and 122 actually generate singing sound signals using different speech unit sample data, As in the above embodiments, a choral sound signal capable of giving a natural impression can be synthesized.
[0095]
(Modification 9)
The chorus synthesis apparatus in each of the above embodiments and modifications may be configured by a dedicated hardware circuit, or may be configured by software by a computer system as shown in FIG. As shown in FIG. 1, the computer system includes a CPU (Central Processing Unit) 320 for controlling the entire apparatus, a ROM (Read Only Memory) 321 for storing various control data and programs, and a RAM (Random) used as a work area. (Access Memory) 322, an external storage device 323 such as a hard disk or CD-ROM (Compact Disc Only Memory) drive for storing music information and programs, an operation unit 324 such as a keyboard and a mouse, and a display for displaying various information to the user. A section 325, a D / A converter 326, an amplifier 327, and a speaker 328 are provided.
[0096]
The CPU 320 constructs the audio sample database 110 in the RAM 322 or the external storage device 323 in accordance with a program group stored in the external storage device 323 such as the ROM 321 or a hard disk, and uses the audio sample database 110 to execute the above-described embodiments and modifications. A singing sound signal synthesizing process is performed for each part as in the example. Then, after adding the generated singing sound signals for each part, CPU 320 outputs the added choral sound signal to D / A converter 326. In the D / A converter 326, the choral sound signal is converted into an analog signal, amplified by the analog signal amplifier 327 of the choral sound, and then emitted from the speaker 328.
[0097]
As described above, the choral synthesizer in each of the above-described embodiments and modifications can be configured by software using a computer system, and a program for causing a computer system to execute the same choral sound synthesis processing as in each of the above-described embodiments and the like. You may make it provide to a user in the form of. As a method of providing such a program, there are a method of providing the program by storing it on various recording media such as a CD-ROM and a floppy disk, and a method of providing the program via a communication line such as the Internet.
[0098]
【The invention's effect】
As described above, according to the present invention, it is possible to synthesize a chorus sound that can give a more natural impression to a listener.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a basic configuration of a choral synthesis device according to a first embodiment of the present invention.
FIG. 2 is a diagram for explaining a method of creating a voice sample database which is a component of the choral synthesizer.
FIG. 3 is a block diagram showing a functional configuration of a singing generator which is a component of the chorus synthesis apparatus.
FIG. 4 is a block diagram showing a basic configuration of a choral synthesizer according to a second embodiment of the present invention.
FIG. 5 is a diagram for explaining a singing sound signal generation method by the chorus synthesis apparatus according to the second embodiment.
FIG. 6 is a diagram for explaining a singing sound signal generation method by the chorus synthesis apparatus according to the second embodiment.
FIG. 7 is a diagram for explaining a singing sound signal generation method by the chorus synthesis apparatus according to the second embodiment.
FIG. 8 is a diagram for explaining a singing sound signal generation method by the chorus synthesis apparatus according to the second embodiment.
FIG. 9 is a block diagram showing a basic configuration of a modified example of the choral synthesizer according to the first embodiment.
FIG. 10 is a block diagram showing a basic configuration of a modified example of the choral synthesizer according to the second embodiment.
FIG. 11 is a block diagram showing a basic configuration of another modified example of the choral synthesizer according to the first embodiment.
FIG. 12 is a block diagram showing a configuration of a computer system for realizing the function of the chorus synthesizer by software.
[Explanation of symbols]
100, 100 ', 100 "... choir synthesizer, 110 ... voice sample database, 110a, 110b, 110c ... voice sample data group, 120 ... singing generator, 121 ... singing generator, 122 ... singing Generator 130 adder 140 chorus control unit 200 SMS analysis unit 201 section cutout unit 220 220 221 222 parameter change unit 301 speech unit selection unit 302: pitch determination unit, 303: duration length adjustment unit, 304: speech unit connection unit, 305: harmonic component generation unit, 306: addition unit, 307: inverse FFT unit, 308: window Hanging part, 309 ... Overlap part, 400, 400 '... Choral synthesis device.

Claims

A choral synthesizer that synthesizes a choral sound signal based on music data,
For the same type of audio sample data, a database that stores the audio sample data respectively created based on a plurality of different sounds,
A means for generating a singing sound signal according to the music data, a plurality of singing generating means for reading out the required voice sample data from the database and using the singing sound signal for generation,
A singing voice synthesizing unit that synthesizes a choral sound signal from the singing voice signal generated by the plurality of singing voice generating units,
The music data is composed of a plurality of parts, and when each of the plurality of singing generation means generates a singing sound signal corresponding to each of the parts, each of at least two of the singing generation means has a different voice from the database. A chorus synthesizer characterized by reading out the voice sample data created based on the above and using the read out voice sample data to generate the singing sound signal.

The database is speech sample data about phonemes or phoneme segments that are phoneme chains that are a connection of two or more phonemes, and is a phoneme created based on a plurality of different speeches for the same phoneme or phoneme chain. One piece data is stored,
The singing generation means reads out the speech unit sample data corresponding to the lyrics indicated in the song data from the database, connects them, and adjusts the connected speech unit sample data according to the pitch indicated in the song data. The chorus synthesizer according to claim 1, wherein the singing sound signal is generated by using the singing sound signal.

The database is sound sample data each created based on a plurality of different sounds, and stores vibrato sound sample data indicating characteristics of a vibrato portion of the sound,
3. The choir synthesis apparatus according to claim 1, wherein the singing generation unit reads and uses the vibrato sound sample data stored in the database when generating a singing sound signal of a vibrato portion. 4. .

A choral synthesizer that synthesizes a choral sound signal according to music data,
A database storing audio sample data having a predetermined time length created based on the audio,
A means for generating a singing sound signal according to the music data, a plurality of singing generating means for reading out the required voice sample data from the database and using the singing sound signal for generation,
A singing voice synthesizing unit that synthesizes a choral sound signal from the singing voice signal generated by the plurality of singing voice generating units,
The music data is composed of a plurality of parts, and when each of the plurality of singing generation means generates a singing sound signal corresponding to each of the parts, at least two of the singing generation means read from the database. A chorus synthesis apparatus characterized in that the singing sound signal is generated by starting use from portions of the voice sample data corresponding to different times.

The database stores phoneme sample data having the predetermined time length for a phoneme, or a phoneme chain that is a phoneme chain that is a connection of two or more phonemes,
The singing generation means reads out the speech unit sample data corresponding to the lyrics indicated in the song data from the database, connects them, and adjusts the connected speech unit sample data according to the pitch indicated in the song data. The chorus synthesizer according to claim 4, wherein the chorus synthesizer generates the singing sound signal.

The database stores vibrato audio sample data having the predetermined time length indicating characteristics of a vibrato portion of audio,
The choir synthesis apparatus according to claim 4 or 5, wherein the singing generation means reads and uses the vibrato sound sample data stored in the database when generating a singing sound signal of a vibrato portion. .

The music data is composed of a plurality of parts, and when each of the plurality of singing generating means generates a singing sound signal corresponding to each of the parts, the singing sound signal generated by each of at least two of the singing generating means 7. The choral synthesizer according to claim 1, further comprising a timing adjusting means for shifting the output timing of the chorus.

A timbre changing means is provided corresponding to each of the plurality of singing generation means, and changes timbre-related data included in the voice sample data read from the database,
When the music data is composed of a plurality of parts, and when each of the plurality of singing generation means generates a singing sound signal corresponding to each of the parts, each of the timbre changing means corresponding to at least two of the singing generation means 8. The chorus synthesizer according to claim 1, wherein the data of the timbre is changed by different methods.

A choral synthesis method for synthesizing a choral sound signal from a plurality of singing sound signals generated based on music data,
When generating a singing sound signal corresponding to the plurality of parts in accordance with the music data composed of a plurality of parts, it is necessary from a database that stores the sound sample data respectively created based on a plurality of different sounds. Reading the audio sample data,
A choir synthesis method characterized in that the singing sound signals corresponding to at least two of the parts are generated using the voice sample data created based on different voices read from the database.

A choral synthesis method for synthesizing a choral sound signal from a plurality of singing sound signals generated based on music data,
When generating a singing sound signal corresponding to the plurality of parts according to the music data composed of a plurality of parts, it is necessary to generate a singing sound signal corresponding to the plurality of parts from a database that stores sound sample data having a predetermined time length created based on the sound. Read the audio sample data,
The generation of the singing sound signals corresponding to at least two of the parts is characterized in that the singing sound signals are generated by starting use from portions corresponding to different times of the voice sample data read from the database. Choral synthesis method.

Computer
Means for reading out the necessary voice sample data from a database storing voice sample data respectively created based on a plurality of different voices in accordance with the music data and generating a singing sound signal, wherein the music data is And generating a singing sound signal corresponding to the plurality of parts, when generating a singing sound signal corresponding to at least two of the parts, based on different sounds read from the database. Singing sound generation means for generating the singing sound signal using the audio sample data created by
A program for functioning as a singing voice synthesizing unit that synthesizes a singing voice signal from the generated singing voice signal.

Computer
Means for reading out the necessary voice sample data from a database storing voice sample data having a predetermined time length created based on the voice according to the music data and generating a singing sound signal, Consists of a plurality of parts, when generating a singing sound signal corresponding to the plurality of parts, when generating a singing sound signal corresponding to at least two parts, the voice sample data read from the database Singing sound generation means for starting use from a portion corresponding to different times to generate the singing sound signal,
A program for functioning as a singing voice synthesizing unit that synthesizes a singing voice signal from the generated singing voice signal.