JP2004061793A

JP2004061793A - Apparatus, method, and program for singing synthesis

Info

Publication number: JP2004061793A
Application number: JP2002219203A
Authority: JP
Inventors: Yuji Hisaminato; 久湊　裕司
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2002-07-29
Filing date: 2002-07-29
Publication date: 2004-02-26
Anticipated expiration: 2022-07-29
Also published as: JP3979213B2

Abstract

<P>PROBLEM TO BE SOLVED: To keep tone quality such as a timbre as it is when a singing voice having the same pitch as a singing voice stored as speech phoneme data for singing synthesis in a database and to change the tone quality such as the timbre only when a singing voice having a different pitch is synthesized. <P>SOLUTION: When speech phoneme data corresponding to MIDI data (pitch data, lyrics data, etc.) inputted to a synthesis parameter generator 12 are selected from the database 11, synthesis parameters PA as a set of various parameters PAn and pitch data (database pitch data) DP included in the speech phoneme data are extracted. Further, synthesis pitch data SP representing the pitch of a speech to be synthesized in a time series are generated on the basis of the inputted pitch data. The individual parameters PAn are given a key scaling effect on the basis of the synthesis pitch data SP and database pitch data DP by using a key scaling function fn. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、入力された演奏データに基づいて音声を合成する歌唱合成装置、歌唱合成方法及び歌唱合成用プログラムに関する。
【０００２】
【従来の技術】
電子楽器においては、キースケーリングと呼ばれる機能によりピッチ変換を実行している。具体的には、電子ピアノの鍵盤のうち、押された鍵盤に対応したピッチに依存したある値を音色パラメータに加減算することにより、音色の変換を行っている（例えば、特開昭５８−２１１７８６号公報参照）。
例えばカラオケ装置などにおいては、男性の声を女性の声に変換するなどの用途のため、ピッチ変換機能を備えた歌唱合成装置が設けられている。この場合、単純に原歌唱音声のピッチを異なるピッチに変換するだけでは不自然な音声となるので、音色など歌唱合成に用いる他のパラメータも、電子楽器と同様のキースケーリング機能を用いて、ピッチに合わせて変換している。
【０００３】
【発明が解決しようとする課題】
歌唱合成装置において、歌唱合成に用いるパラメータは、実際に録音した歌唱データを音素ごとに切り出して抽出したものであり、各音素で録音の際のピッチが異なる。このため、単純に合成したいピッチに応じたパラメータのキースケーリングでは、音素の録音時のピッチと同じピッチの歌唱音声を合成する場合にも音色に変化が生じてしまう。これを解決するために、各パラメータは各音素毎にピッチに応じたスケーリング量を規定するキースケーリング関数を保持することで、音素の録音時と同じピッチのときにはパラメータが変化しないようにする必要があった。しかし、このようにするとキースケーリング関数の数が膨大になるため、キースケーリング関数の作成及び変更が困難になってしまうという問題があった。
本発明は、このキースケーリング機能を歌唱合成装置、歌唱合成方法、歌唱合成用プログラムに導入したものであり、キースケーリング関数の数を多くしなくても、元の歌唱音声のピッチと同じピッチの歌唱音声を合成しようとする場合には音色など音質をそのままに保ち、元の歌唱音声のピッチと異なるピッチの歌唱音声を合成しようとする場合にだけ音色など音質を変化させることのできるようにすることを目的とする。
【０００４】
【課題を解決するための手段】
上記目的達成のため、本出願の第１の発明に係る歌唱合成装置は、合成しようとする音声の内容を示す音声情報を入力する音声情報入力部と、音声を合成するための音声素片データが記憶された音韻データベースと、前記音声情報に基づいて前記音韻データベースに記憶された前記音声素片データを選択する選択部と、合成しようとする音声のピッチを時系列で示す合成ピッチデータを出力する合成ピッチデータ出力部と、前記選択部で選択された音声素片データから合成パラメータを抽出して出力する合成パラメータ出力部と、前記合成パラメータを抽出する際に使用した音声素片データを構成するピッチデータを抽出してデータベースピッチデータとして出力するデータベースピッチデータ生成部と、前記合成パラメータ毎に用意されたキースケーリング関数を記憶するキースケーリング関数記憶部と、前記合成ピッチデータを前記キースケーリング関数に代入して得られた関数値と前記データベースピッチデータを前記キースケーリング関数に代入して得られた関数値との差に基づき、前記合成パラメータを補正して補正パラメータを出力するキースケーリング部と、前記補正パラメータに基づく波形を合成する波形合成部とを備えたことを特徴とする。
【０００５】
この第１の発明に係る歌唱合成装置によれば、入力された音声情報に基づいて前記音韻データベースに記憶された音声素片データが選択部により選択される。そして、合成しようとする音声のピッチを時系列で示す合成ピッチデータが出力される。また、前記選択部で選択された前記音声素片データからは、各時刻毎の合成パラメータが抽出される。さらに、合成パラメータを抽出する際に使用した音声素片データを構成するピッチデータが抽出されデータベースピッチデータとして出力される。
この合成ピッチデータとデータベースピッチデータとに差がある場合には、キースケーリング関数により、ピッチの変化に即した前記合成パラメータの補正が行われ、合成出力音声の自然性が高められる。一方、両者に差がない場合には、合成パラメータの補正は行われない。このため、各音素、時間ごとのキースケーリング関数を用意しなくとも、出力歌唱音声の自然性を高めることができる。
【０００６】
上記目的達成のため、本出願の第２の発明に係る歌唱合成方法は、合成しようとする音声の内容を示す音声情報を入力する音声情報入力ステップと、音声を合成するための音声素片データを予め音韻データベースに記憶させるとともに、前記音声情報に基づいて前記音韻データベースに記憶された音声素片データを選択する選択ステップと、合成しようとする音声のピッチを時系列で示す合成ピッチデータを出力する合成ピッチデータ出力ステップと、前記選択ステップで選択された前記音声素片データから合成パラメータを抽出して出力する合成パラメータ出力ステップと、前記合成パラメータを抽出する際に使用した音声素片データを構成するピッチデータを抽出してデータベースピッチデータとして出力するデータベースピッチデータ生成ステップと、前記合成パラメータ毎にキースケーリング関数を用意するとともに、前記合成ピッチデータを前記キースケーリング関数に代入して得られた関数値と前記データベースピッチデータを前記キースケーリング関数に代入して得られた関数値との差に基づき、前記合成パラメータを補正して補正パラメータを出力するキースケーリングステップと、前記補正パラメータに基づく波形を合成する波形合成ステップとを備えたことを特徴とする。
【０００７】
上記目的達成のため、本出願の第３の発明に係る歌唱合成用プログラムは、合成しようとする音声の内容を示す音声情報を入力する音声情報入力ステップと、音声を合成するための音声素片データを予め音韻データベースに記憶させるとともに、前記音声情報に基づいて前記音韻データベースに記憶された音声素片データを選択する選択ステップと、合成しようとする音声のピッチを時系列で示す合成ピッチデータを出力する合成ピッチデータ出力ステップと、前記選択ステップで選択された前記音声素片データから合成パラメータを抽出して出力する合成パラメータ出力ステップと、前記合成パラメータを抽出する際に使用した音声素片データを構成するピッチデータを抽出してデータベースピッチデータとして出力するデータベースピッチデータ生成ステップと、前記合成パラメータ毎にキースケーリング関数を用意するとともに、前記合成ピッチデータを前記キースケーリング関数に代入して得られた関数値と前記データベースピッチデータを前記キースケーリング関数に代入して得られた関数値の差に基づき、前記合成パラメータを補正して補正パラメータを出力するキースケーリングステップと、前記補正パラメータに基づく波形を合成する波形合成ステップとをコンピュータに実行させるように構成されたことを特徴とする。
【０００８】
【発明の実施の形態】
以下、本発明の実施の形態を図面に基づいて詳細に説明する。
図１は、本発明の実施の形態に係る歌唱合成装置の概略構成を示すブロック図である。本実施の形態に係る歌唱合成装置１は、図１に示すように、音声データベース１１と、合成パラメータ生成装置１２と、キースケーリング関数記憶部１３と、キースケーリング装置１４と、波形合成装置１５とを備えている。
【０００９】
音声データベース１１は、実際に録音或いは取得した歌唱データ等の信号を音素ごとに切り出したデータであり、図２に示すように、音韻遷移部分（音韻と音韻との間のつながり部分（例：ａ−ｅ、ａ−ｉ等）を示す）及び定常部分（音韻（例：ａ，ｉ等）が定常的に発音される部分）とに分類された音声素片データを格納しており、各音声素片データは、合成パラメータＰＡにより表現されている。
合成パラメータＰＡは、例えばフォルマント中心周波数、バンド幅、ゲインなど複数種類（ここではＮ種類とする）のパラメータＰＡｎ（ｎ＝１，２、・・・Ｎ）から構成される。各パラメータＰＡｎは、横軸を時刻ｔ（ｔ＝０〜Ｔ）、縦軸をパラメータ値としたグラフにより表現することが出来るデータ配列や関数として記憶される。
異なる音声素片データには、互いに異なるパラメータＰＡｎが格納されている。例えば、音韻遷移部ａ−ｅと、音韻遷移部ａ−ｉとではフォルマント中心周波数やゲインの異なるパラメータＰＡｎが記憶される。
【００１０】
また、各音韻遷移部、定常部のデータには、これらのパラメータＰＡｎに加え、その音声素片の各時刻におけるピッチのデータＤＰ（以下、データベースピッチＤＰという）が、各１つずつ格納されている。
【００１１】
合成パラメータ生成装置１２は、合成しようとする歌唱音声を表現するＭＩＤＩデータ（ピッチ、歌詞など）を入力する入力部と、この入力データに対応する音声素片データ（音韻遷移部データ又は定常部分データ）を音声データベース１１から合成パラメータＰＡとして読み出す合成パラメータ生成部として機能する。
【００１２】
また、合成パラメータ生成装置１２は、ＭＩＤＩデータとして合成パラメータ生成装置１２に入力されるピッチデータ、歌詞データに基づいて、合成される出力歌唱音声の各時刻における正確なピッチを決定し、これを合成ピッチデータＳＰとしてキースケーリング装置１４に向けて出力する合成ピッチデータ出力部として機能する。このピッチ決定の処理は、合成しようとする音声の前後に位置する音声についての歌詞の音韻、ピッチなどのデータが考慮されて行われる。
また、合成パラメータ生成装置１２は、読み出された音声素片データに含まれているデータベースピッチＤＰを読み出して、これをキースケーリング装置１４に向けて出力するデータベースピッチデータ生成部として機能する。
【００１３】
キースケーリング関数記憶部１３は、各合成パラメータＰＡｎの数に対応した数（ここではＮ個）のキースケーリング関数ｆｎを記憶している。
キースケーリング装置１４は、このキースケーリング関数記憶部１３に記憶された各合成パラメータＰＡｎに対応するキースケーリング関数ｆｎ（ｎ＝１，２、・・・Ｎ）を読み出し、このキースケーリング関数ｆｎを入力される各合成パラメータＰＡｎに適用してキースケーリング効果を与える。
このキースケーリング効果により、例えば高い音を発生するときは声が甲高くなったり、低い声を発生するときは声が聞き取り難くなったりというように、合成しようとする歌唱音声のピッチに合わせて合成パラメータを調整することが出来るので、自然な音声合成が可能となる。また、合成しようとするピッチごとに音声データベースを備える必要がないため、音声データベースのサイズが小さくて済む。
そして、データベースピッチＤＰを用いて、キースケーリング効果を与えられた合成パラメータＰＡｎに対し、音声データベース１１に記憶されているのと同じピッチの歌唱音声を合成しようとする場合には合成パラメータＰＡｎをそのままに保ち、異なるピッチの歌唱音声を合成しようとする場合には合成パラメータＰＡｎを変化させる処理を行う。具体的には、合成ピッチＳＰ、データベースピッチＤＰを関数ｆｎに代入して得られた関数値ｆｎ（ＳＰ）、ｆｎ（ＤＰ）（図３参照）を得た後、次の［数１］に示すように、両関数値の差（図３の場合、ｆｎ（ＤＰ）−ｆｎ（ＳＰ）＝０．１）を利用して各パラメータＰＡｎにキースケーリング効果を与えて補正パラメータＰＡｎ´を得る。つまり、補正パラメータＰＡｎ´は、パラメータＰＡｎを合成ピッチＰＡｎを合成ピッチＳＰとデータベースピッチＤＰにより補正したパラメータである。
【００１４】
【数１】
ＰＡｎ´＝ＰＡｎ＋ｆｎ（ＳＰ）−ｆｎ（ＤＰ）
【００１５】
また、次に示す［数２］のように、関数値ｆｎ（ＳＰ）、ｆｎ（ＤＰ）の差に定数（例えば１）を加えた値を各パラメータＰＡｎに乗算することによりキースケーリング効果を与えてもよい。
【数２】
ＰＡｎ´＝ＰＡｎ×（ｆｎ（ＳＰ）−ｆｎ（ＤＰ）＋１．０）
【００１６】
上記の［数１］、［数２］のどちらを選ぶかは、各パラメータＰＡｎの性質により決定する。例えば、パラメータＰＡｎの値が、ゲインのように対数で表現されている場合には［数１］が適当であり、周波数（Ｈｚ）のようにリニアな値で表現されている場合には、［数２］が適当である。
【００１７】
［数１］、［数２］のどちらの場合でも、ＳＰ＝ＤＰの場合には、ＰＡｎ´＝ＰＡｎとなる。すなわち、本実施の形態の歌唱合成装置１は、合成しようとするピッチＳＰと、データベース１１に格納されたデータベースピッチＤＰとが等しい場合には、合成パラメータＰＡには変化を加えないようにしている。
波形合成装置１５は、このキースケーリング効果を与えられた補正パラメータＰＡｎ´、及び合成ピッチＳＰにより表現される音声素片データを合成し重ね合わせて出力歌唱音声波形として出力する機能を有する。
【００１８】
次に、本実施の形態に係る歌唱合成装置１の作用を、図４に示すフローチャートに基づいて説明する。図４のフローチャートは、１つの音声素片データ内での処理の手順を示したものであり、これを合成パラメータ生成装置１２に入力されるＭＩＤＩデータに基づき音声データベース１１から選択されるすべての音声素片データについて順次実行し、波形合成装置１５で合成することにより、合成歌唱音声が得られる。
【００１９】
また、この図４に示すフローチャートでは、１つの音声素片データ内の各パラメータＰＡｎ（ｎ＝１，２・・・Ｎ）を１からＮの順で順々に処理するようにしている。また、各パラメータＰＡｎが時刻ｔの関数により表現されていることから、各パラメータＰＡｎの処理においては、各時刻ｔ毎に合成ピッチデータＳＰ（ｔ）、データベースピッチＤＰ（ｔ）を取得して補正パラメータＰＡｎ（ｔ）´の値を得るようにしている。具体的に説明すると、最初に、各パラメータＰＡｎの種類を示す変数ｎを１に初期設定し（Ｓ１）、そのｎの値に相当するスケーリング関数ｆｎをスケーリング関数記憶部１３より読み出す（Ｓ２）。続いて、時刻ｔを０に初期設定し（Ｓ３）、その時刻ｔでのパラメータＰＡｎ（ｔ）を取得する（Ｓ４）。
【００２０】
次に、その時刻ｔでの合成ピッチデータＳＰ（ｔ）、データベースピッチデータＤＰ（ｔ）の値を取得する（Ｓ５）。そして、このＳ４、Ｓ５で取得された値に基づき、［数１］又は［数２］により補正パラメータＰＡｎ（ｔ）´を演算する（Ｓ６）。このＳ６では、［数１］又は［数２］のいずれか一方を固定的に使用させるようにしてもよいし、パラメータＰＡｎの種類に応じて、［数１］又は［数２］のいずれを使用するかを自動選択させるようにしてもよい。
【００２１】
こうして、すべての時刻ｔ（ｔ＝０〜Ｔ）について、ｔをΔｔずつインクリメントしながら補正パラメータＰＡｎ´（ｔ）をＳ４〜Ｓ６の手順を繰り返して演算する（Ｓ７、Ｓ８）。この図４のフローチャートでは、Δｔごとに離散的な補正パラメータ値ＰＡｎ´（ｔ）を演算しているので、各データの中間の値が必要となる場合は補間により演算してもよい。
以上説明した手順を、すべてのパラメータＰＡｎ（ｎ＝１、２、・・・Ｎ）について演算するまで繰り返す（Ｓ９、Ｓ１０）。これにより得られたデータに基づいて得られた波形が、波形合成装置１５で合成されることにより、歌唱音声が合成、出力される。
【００２２】
以上、本発明の実施の形態について説明したが、本発明はこれに限定されるものではなく、例えば、音声データベース１０１には、音韻遷移部分、定常部分に加え、特定の音韻部分のデータ（ｔｉｍｂｒｅ）を保持させるようにしてもよい。
また、データベースピッチＤＰは、上記のようにその音声素片データの各時刻におけるピッチのデータを格納するようにしてもよいが、その音声素片データの各時刻を代表するひとつのピッチデータとして格納してもよい。
また、合成パラメータ生成装置１２に入力するＭＩＤＩデータはピッチ、歌詞に限らず、ダイナミクス、ビブラートなどのデータであってもよい。
また、ＭＩＤＩデータとしてのピッチデータ、歌詞データに基づいて合成ピッチデータＳＰを決定する代わりに、合成ピッチデータＳＰデータを予めＭＩＤＩデータとして保持させておくようにしてもよい。
また、合成パラメータ生成部１２に入力するデータはＭＩＤＩデータに限らず、合成する歌唱音声が生成されるように時系列で演奏データを指定できるものであればよい。
【００２３】
【発明の効果】
以上説明したように、本発明によれば、音声素片のピッチをデータベースピッチとして記憶しているので、少ないキースケーリング関数で、元の歌唱音声のピッチと同じピッチの歌唱音声を合成しようとする場合に、音色など音質をそのままに保ち、元の歌唱音声のピッチと異なるピッチの歌唱音声を合成しようとする場合にだけ音色など音質を変化させることができ、自然な歌唱合成が可能になる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る歌唱合成装置の全体構成を示すブロック図である。
【図２】図１に示す音声データベースに記憶されるデータの内容を概念的に示す。
【図３】キースケーリング関数ｆｎの内容を示す。
【図４】図１に示す歌唱合成装置１の作用を示すフローチャートである。
【符号の説明】
１１…音声データベース、　１２…合成パラメータ生成装置、　１３…キースケーリング関数記憶部　１４…キースケーリング装置、　１５…波形合成装置[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a singing voice synthesizing device, a singing voice synthesizing method, and a singing voice synthesizing program that synthesize voice based on input performance data.
[0002]
[Prior art]
In electronic musical instruments, pitch conversion is performed by a function called key scaling. Specifically, tone conversion is performed by adding or subtracting a certain value depending on a pitch corresponding to a pressed key among the keys of an electronic piano to a tone parameter (for example, Japanese Patent Laid-Open No. 58-211786). Reference).
For example, in a karaoke apparatus or the like, a singing / synthesizing apparatus having a pitch conversion function is provided for purposes such as converting a male voice into a female voice. In this case, simply converting the pitch of the original singing voice to a different pitch results in an unnatural voice, so other parameters used for singing synthesis, such as timbre, can also be adjusted using the key scaling function similar to electronic musical instruments. Is converted to match.
[0003]
[Problems to be solved by the invention]
In the singing voice synthesizing device, the parameters used for singing voice synthesis are obtained by extracting the actually recorded singing data for each phoneme, and the pitch at the time of recording differs for each phoneme. For this reason, in the key scaling of the parameter simply according to the pitch to be synthesized, the timbre changes even when a singing voice having the same pitch as the pitch at the time of recording the phoneme is synthesized. In order to solve this, each parameter must hold a key scaling function that specifies the scaling amount according to the pitch for each phoneme, so that the parameter does not change when the pitch is the same as when recording the phoneme. there were. However, in this case, since the number of key scaling functions becomes enormous, there is a problem that it is difficult to create and change the key scaling functions.
The present invention introduces this key scaling function into a singing voice synthesizing device, a singing voice synthesizing method, and a singing voice synthesizing program. When trying to synthesize a singing voice, keep the tone and other sound quality as it is, and change the tone and other sound quality only when trying to synthesize a singing voice with a pitch different from the pitch of the original singing voice. The purpose is to:
[0004]
[Means for Solving the Problems]
To achieve the above object, a singing voice synthesizing device according to a first invention of the present application includes a voice information input unit for inputting voice information indicating the content of a voice to be synthesized, and voice segment data for synthesizing voice. And a selection unit that selects the speech unit data stored in the phoneme database based on the speech information, and outputs synthesized pitch data indicating a pitch of a speech to be synthesized in a time series. A synthesis pitch data output unit, a synthesis parameter output unit for extracting and outputting a synthesis parameter from the speech unit data selected by the selection unit, and a speech unit data used for extracting the synthesis parameter. A database pitch data generating unit for extracting pitch data to be output and outputting the same as database pitch data; A key scaling function storage unit for storing a scaling function, a function value obtained by substituting the synthesized pitch data into the key scaling function, and a function value obtained by substituting the database pitch data into the key scaling function. And a key scaling unit that corrects the synthesis parameter based on the difference and outputs a correction parameter, and a waveform synthesis unit that synthesizes a waveform based on the correction parameter.
[0005]
According to the singing voice synthesizing device according to the first aspect, the speech unit data stored in the phoneme database is selected by the selection unit based on the input voice information. Then, synthesized pitch data indicating the pitch of the voice to be synthesized in a time series is output. In addition, a synthesis parameter for each time is extracted from the speech unit data selected by the selection unit. Further, pitch data constituting the speech unit data used in extracting the synthesis parameters is extracted and output as database pitch data.
If there is a difference between the synthesized pitch data and the database pitch data, the key scaling function corrects the synthesized parameter in accordance with the change in pitch, thereby enhancing the naturalness of the synthesized output voice. On the other hand, if there is no difference between them, the correction of the synthesis parameter is not performed. Therefore, it is possible to enhance the naturalness of the output singing voice without preparing a key scaling function for each phoneme or time.
[0006]
To achieve the above object, a singing voice synthesizing method according to a second invention of the present application includes a voice information inputting step of inputting voice information indicating the content of a voice to be synthesized, and voice segment data for voice synthesis. In the phoneme database in advance, and selecting speech unit data stored in the phoneme database based on the speech information, and outputting synthesized pitch data indicating the pitch of the speech to be synthesized in a time series. Synthesizing pitch data outputting step, synthesizing parameter outputting step of extracting and outputting a synthesizing parameter from the speech unit data selected in the selecting step, and speech unit data used in extracting the synthesizing parameter. Database pitch data that extracts the constituent pitch data and outputs it as database pitch data And a key scaling function is prepared for each of the synthesis parameters, and a function value obtained by substituting the synthesized pitch data for the key scaling function and the database pitch data are substituted for the key scaling function. A key scaling step of correcting the synthesis parameter based on a difference from the obtained function value and outputting a correction parameter; and a waveform synthesis step of synthesizing a waveform based on the correction parameter.
[0007]
To achieve the above object, a singing voice synthesizing program according to a third invention of the present application includes a voice information inputting step of inputting voice information indicating the content of a voice to be synthesized, and a voice unit for synthesizing voice. A step of storing data in advance in a phoneme database and selecting speech unit data stored in the phoneme database based on the speech information, and synthesizing pitch data indicating a pitch of a speech to be synthesized in a time series. A synthesis pitch data output step for outputting, a synthesis parameter output step for extracting and outputting a synthesis parameter from the speech unit data selected in the selection step, and a speech unit data used for extracting the synthesis parameter. Database pitch that extracts pitch data that constitutes In the data generation step, a key scaling function is prepared for each of the synthesis parameters, and a function value obtained by substituting the synthesized pitch data into the key scaling function and the database pitch data are substituted into the key scaling function. The computer is configured to execute a key scaling step of correcting the synthesis parameter based on the difference between the obtained function values and outputting a correction parameter, and a waveform synthesis step of synthesizing a waveform based on the correction parameter. It is characterized by the following.
[0008]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a block diagram showing a schematic configuration of a singing voice synthesizing device according to an embodiment of the present invention. As shown in FIG. 1, the singing voice synthesizing device 1 according to the present embodiment includes a voice database 11, a synthesizing parameter generating device 12, a key scaling function storage unit 13, a key scaling device 14, a waveform synthesizing device 15, It has.
[0009]
The speech database 11 is data obtained by cutting out a signal such as singing data or the like which is actually recorded or acquired for each phoneme. As shown in FIG. 2, a phoneme transition portion (a connection portion between phonemes and a phoneme (example: a -E, ai, etc.) and a stationary part (a part where a phoneme (eg, a, i, etc.) is steadily pronounced) is stored. The segment data is represented by a synthesis parameter PA.
The synthesis parameter PA includes a plurality of types (here, N types) of parameters PAn (n = 1, 2,..., N) such as a formant center frequency, a bandwidth, and a gain. Each parameter PAn is stored as a data array or function that can be represented by a graph with the horizontal axis representing time t (t = 0 to T) and the vertical axis representing parameter values.
Different speech unit data stores different parameters PAn. For example, parameters PAn having different formant center frequencies and gains are stored in the phoneme transition units ae and ai.
[0010]
Further, in addition to these parameters PAn, pitch data DP (hereinafter, referred to as database pitch DP) of the speech unit at each time is stored in the data of each phoneme transition portion and the steady portion one by one. I have.
[0011]
The synthesis parameter generation device 12 includes an input unit for inputting MIDI data (pitch, lyrics, etc.) representing a singing voice to be synthesized, and speech unit data (phonemic transition data or stationary partial data) corresponding to the input data. ) Functions as a synthesis parameter generation unit that reads out the synthesis parameter PA from the audio database 11.
[0012]
Further, the synthesis parameter generation device 12 determines an accurate pitch at each time of the output singing voice to be synthesized based on the pitch data and the lyrics data input to the synthesis parameter generation device 12 as MIDI data, and synthesizes this. It functions as a synthesized pitch data output unit that outputs the pitch data SP to the key scaling device 14. This pitch determination process is performed in consideration of data such as phonemes and pitches of lyrics for voices positioned before and after the voice to be synthesized.
Further, the synthesis parameter generation device 12 functions as a database pitch data generation unit that reads the database pitch DP included in the read speech unit data and outputs the read data to the key scaling device 14.
[0013]
The key scaling function storage unit 13 stores a number (here, N) of key scaling functions fn corresponding to the number of each synthesis parameter PAn.
The key scaling device 14 reads out the key scaling function fn (n = 1, 2,... N) corresponding to each composite parameter PAn stored in the key scaling function storage unit 13 and inputs the key scaling function fn. A key scaling effect is applied to each of the synthesis parameters PAn.
With this key scaling effect, for example, the synthesis parameter is adjusted according to the pitch of the singing voice to be synthesized, such as raising the pitch of the voice when generating a high sound, or making the voice difficult to hear when generating a low sound. Can be adjusted, so that natural speech synthesis is possible. Further, since it is not necessary to provide a voice database for each pitch to be synthesized, the size of the voice database can be small.
Then, when using the database pitch DP to synthesize a singing voice having the same pitch as that stored in the voice database 11 with the synthesis parameter PAn provided with the key scaling effect, the synthesis parameter PAn is used as it is. When synthesizing singing voices of different pitches, a process of changing the synthesis parameter PAn is performed. Specifically, after obtaining the function values fn (SP) and fn (DP) (see FIG. 3) obtained by substituting the synthesized pitch SP and the database pitch DP into the function fn, the following [Equation 1] is obtained. As shown, a key scaling effect is applied to each parameter PAn using the difference between the two function values (fn (DP) −fn (SP) = 0.1 in FIG. 3) to obtain a correction parameter PAn ′. That is, the correction parameter PAn ′ is a parameter obtained by correcting the parameter PAn with the composite pitch PAn using the composite pitch SP and the database pitch DP.
[0014]
(Equation 1)
PAn '= PAn + fn (SP) -fn (DP)
[0015]
Further, as shown in the following [Equation 2], a key scaling effect is provided by multiplying each parameter PAn by a value obtained by adding a constant (for example, 1) to the difference between the function values fn (SP) and fn (DP). You may.
(Equation 2)
PAn ′ = PAn × (fn (SP) −fn (DP) +1.0)
[0016]
Which of the above [Equation 1] and [Equation 2] is selected depends on the properties of each parameter PAn. For example, when the value of the parameter PAn is expressed by a logarithm such as a gain, [Equation 1] is appropriate, and when the value of the parameter PAn is expressed by a linear value such as a frequency (Hz), [ Equation 2] is appropriate.
[0017]
In both cases of [Equation 1] and [Equation 2], when SP = DP, PAn ′ = PAn. That is, when the pitch SP to be synthesized is equal to the database pitch DP stored in the database 11, the singing voice synthesizing device 1 of the present embodiment does not change the synthesis parameter PA. .
The waveform synthesizing device 15 has a function of synthesizing the speech unit data represented by the correction parameter PAn ′ given the key scaling effect and the synthesized pitch SP, superimposing the synthesized speech unit data, and outputting an output singing voice waveform.
[0018]
Next, the operation of the singing voice synthesizing apparatus 1 according to the present embodiment will be described based on the flowchart shown in FIG. FIG. 4 is a flowchart showing a procedure of processing within one piece of speech unit data, which is used for all speech selected from the speech database 11 based on MIDI data input to the synthesis parameter generation device 12. By sequentially executing the segment data and synthesizing them by the waveform synthesizing device 15, a synthesized singing voice is obtained.
[0019]
In the flowchart shown in FIG. 4, each parameter PAn (n = 1, 2,..., N) in one piece of speech unit data is processed in order from 1 to N. Further, since each parameter PAn is represented by a function of time t, in the processing of each parameter PAn, the synthesized pitch data SP (t) and the database pitch DP (t) are acquired and corrected at each time t. The value of the parameter PAn (t) 'is obtained. More specifically, first, a variable n indicating the type of each parameter PAn is initialized to 1 (S1), and a scaling function fn corresponding to the value of n is read from the scaling function storage unit 13 (S2). Subsequently, the time t is initialized to 0 (S3), and the parameter PAn (t) at the time t is obtained (S4).
[0020]
Next, the values of the synthesized pitch data SP (t) and the database pitch data DP (t) at the time t are acquired (S5). Then, based on the values acquired in S4 and S5, the correction parameter PAn (t) 'is calculated by [Equation 1] or [Equation 2] (S6). In S6, either one of [Equation 1] or [Equation 2] may be fixedly used, or either [Equation 1] or [Equation 2] may be used depending on the type of the parameter PAn. It may be made to automatically select whether to use.
[0021]
In this way, for all times t (t = 0 to T), the correction parameter PAn ′ (t) is calculated by repeating the procedure of S4 to S6 while incrementing t by Δt (S7, S8). In the flowchart of FIG. 4, the discrete correction parameter value PAn ′ (t) is calculated for each Δt, so that when an intermediate value of each data is required, the calculation may be performed by interpolation.
The procedure described above is repeated until calculation is performed for all parameters PAn (n = 1, 2,... N) (S9, S10). The singing voice is synthesized and output by synthesizing the waveform obtained based on the data obtained by the waveform synthesizing device 15.
[0022]
Although the embodiment of the present invention has been described above, the present invention is not limited to this. For example, in the speech database 101, in addition to the phoneme transition portion and the steady portion, data (timbre) of a specific phoneme portion ) May be held.
The database pitch DP may store pitch data at each time of the speech unit data as described above, but may be stored as one pitch data representative of each time of the speech unit data. May be.
The MIDI data input to the synthesis parameter generation device 12 is not limited to the pitch and lyrics, but may be data such as dynamics and vibrato.
Further, instead of determining the synthesized pitch data SP based on the pitch data and the lyrics data as MIDI data, the synthesized pitch data SP data may be stored in advance as MIDI data.
The data input to the synthesis parameter generation unit 12 is not limited to MIDI data, but may be any data as long as performance data can be specified in a time series so as to generate a singing voice to be synthesized.
[0023]
【The invention's effect】
As described above, according to the present invention, since the pitch of the speech unit is stored as the database pitch, the singing voice having the same pitch as the pitch of the original singing voice is synthesized with a small key scaling function. In such a case, the tone quality such as the tone color can be changed only when the tone quality such as the tone color is kept as it is, and the singing voice having a pitch different from the pitch of the original singing voice is to be synthesized.
[Brief description of the drawings]
FIG. 1 is a block diagram showing an entire configuration of a singing voice synthesizing apparatus according to an embodiment of the present invention.
FIG. 2 conceptually shows the contents of data stored in a voice database shown in FIG.
FIG. 3 shows the contents of a key scaling function fn.
FIG. 4 is a flowchart showing the operation of the singing voice synthesizing device 1 shown in FIG.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 11 ... Speech database, 12 ... Synthesis parameter generation device, 13 ... Key scaling function storage part 14 ... Key scaling device, 15 ... Waveform synthesis device

Claims

A voice information input unit for receiving voice information indicating the content of voice to be synthesized;
A phoneme database storing speech unit data for synthesizing speech,
A selection unit that selects the speech unit data stored in the phoneme database based on the speech information,
A synthesized pitch data output unit that outputs synthesized pitch data indicating a pitch of a voice to be synthesized in a time series,
A synthesis parameter output unit that extracts and outputs synthesis parameters from the speech unit data selected by the selection unit,
A database pitch data generation unit that extracts pitch data constituting speech unit data used when extracting the synthesis parameter and outputs the extracted data as database pitch data,
A key scaling function storage unit that stores a key scaling function prepared for each of the synthesis parameters;
Correcting the synthesis parameter based on a difference between a function value obtained by substituting the synthesized pitch data into the key scaling function and a function value obtained by substituting the database pitch data into the key scaling function. A key scaling unit that outputs correction parameters,
A singing voice synthesizing device, comprising: a waveform synthesizing unit that synthesizes a waveform based on the correction parameter.

The singing voice synthesizing apparatus according to claim 1, wherein the synthesized pitch data output unit generates synthesized pitch data based on pitch information included in the voice information, and outputs the generated synthesized pitch data.

The singing voice synthesizing device according to claim 1, wherein the key scaling unit adds a difference between the two function values to the various parameters.

The singing voice synthesizer according to claim 1, wherein the key scaling unit multiplies the various parameters by a value obtained by adding 1 to a difference between the two function values.

A voice information input step of inputting voice information indicating the content of the voice to be synthesized;
A selection step of storing speech unit data for synthesizing speech in advance in a phoneme database, and selecting speech unit data stored in the phoneme database based on the speech information,
A synthesized pitch data output step of outputting synthesized pitch data indicating a pitch of a voice to be synthesized in a time series;
A synthesis parameter output step of extracting and outputting a synthesis parameter from the speech unit data selected in the selection step,
A database pitch data generating step of extracting pitch data constituting speech unit data used when extracting the synthesis parameter and outputting the extracted pitch data as database pitch data;
A key scaling function is prepared for each of the synthesis parameters, and a function value obtained by substituting the synthesized pitch data into the key scaling function and a function value obtained by substituting the database pitch data into the key scaling function A key scaling step of correcting the synthesis parameter and outputting a correction parameter based on the difference between
And a waveform synthesizing step of synthesizing a waveform based on the correction parameter.

A voice information input step of inputting voice information indicating the content of the voice to be synthesized;
A selection step of storing speech unit data for synthesizing speech in advance in a phoneme database, and selecting speech unit data stored in the phoneme database based on the speech information,
A synthesized pitch data output step of outputting synthesized pitch data indicating a pitch of a voice to be synthesized in a time series;
A synthesis parameter output step of extracting and outputting a synthesis parameter from the speech unit data selected in the selection step,
A database pitch data generating step of extracting pitch data constituting speech unit data used when extracting the synthesis parameter and outputting the extracted pitch data as database pitch data;
A key scaling function is prepared for each of the synthesis parameters, and a function value obtained by substituting the synthesized pitch data into the key scaling function and a function value obtained by substituting the database pitch data into the key scaling function. A key scaling step of correcting the synthesis parameter and outputting a correction parameter based on the difference
And a waveform synthesizing step for synthesizing a waveform based on the correction parameter.