JP3834804B2

JP3834804B2 - Musical sound synthesizer and method

Info

Publication number: JP3834804B2
Application number: JP05990697A
Authority: JP
Inventors: 慎一大田
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 1997-02-27
Filing date: 1997-02-27
Publication date: 2006-10-18
Anticipated expiration: 2017-02-27
Also published as: JPH10240264A

Description

【０００１】
【発明の属する技術分野】
この発明は、所望のフォルマントにしたがう自然な楽音を合成する楽音合成装置および方法に関する。
【０００２】
【従来の技術】
従来より、人が発する音声には所定のフォルマントが存在し、これにより音声が特徴づけられていることが知られている。一方、楽音合成装置で音声を合成して所望の音高で出力することにより歌を唄わせる試みがなされている。
【０００３】
【発明が解決しようとする課題】
このように楽音合成装置で歌唱させる場合、より自然な合成音を得ることが求められている。特に、音高（ピッチ）を含む各種の演奏情報が変化した場合でも、それらの演奏情報に適応して自然な歌声を合成できるようにすることが求められている。
【０００４】
この発明は、フォルマント合成音源を用いて歌唱させる場合に、より自然な合成音を得ることができ、特にピッチを含む各種の演奏情報の変化に適応した自然な歌声を合成することが可能な楽音合成装置および方法を提供することを目的とする。
【０００５】
【課題を解決するための手段】
この目的を達成するため、請求項１に係る発明は、唄わせる歌詞の音素を表す歌詞情報と少なくともピッチ情報を含む演奏情報とを入力する入力手段と、音素を指定する情報とピッチを指定する情報と当該音素の発音開始からの経過時間とを引数として与えると、それらに対応したフォルマントパラメータが出力される音素データベースと、上記音素データベースを参照して、上記入力手段により入力した歌詞情報およびピッチ情報と当該歌詞情報が示す音素の発音開始からの経過時間とに対応するフォルマントパラメータを求める手段と、求めたフォルマントパラメータに応じたフォルマントを有する音声を、上記ピッチ情報に応じたピッチで、合成出力するフォルマント合成音源とを備えた楽音合成装置において、前後の音素の組み合わせ毎に、前の音素から後の音素に移行する際の補間時間である調音結合時間を記憶した調音結合データベースと、新たな音素の発音が指示された場合には、現在発音中の音素があるか否かを検出し、あったときには、当該発音中の音素から新たな音素へ移行する際に用いる調音結合時間を、前記調音結合データベースから取り出す手段と、前記調音結合時間をかけて前記発音中の音素のフォルマントパラメータから新たな音素のフォルマントパラメータへと移行するように、補間して求めたフォルマントパラメータを順次前記フォルマント合成音源に出力する補間手段とを備えたことを特徴とする。
【０００６】
請求項２に係る発明は、請求項１において、前記補間手段は、前記取り出した調音結合時間と前記新たな音素のデュレーションタイムとを比較し、前記調音結合時間が前記デュレーションタイムよりも長い場合は、前記調音結合時間を前記デュレーションタイムに一致させるものであることを特徴とする。
【０００７】
請求項３に係る発明は、唄わせる歌詞の音素を表す歌詞情報と少なくともピッチ情報を含む演奏情報とを入力する入力ステップと、音素を指定する情報とピッチを指定する情報と当該音素の発音開始からの経過時間とを引数として与えると、それらに対応したフォルマントパラメータが出力される音素データベースを用いて、上記入力手段により入力した歌詞情報およびピッチ情報と当該歌詞情報が示す音素の発音開始からの経過時間とに対応するフォルマントパラメータを求めるステップと、求めたフォルマントパラメータに応じたフォルマントを有する音声を、上記ピッチ情報に応じたピッチで、合成出力するステップとを備えた楽音合成方法において、前後の音素の組み合わせ毎に、前の音素から後の音素に移行する際の補間時間である調音結合時間を記憶した調音結合データベースを設けるとともに、新たな音素の発音が指示された場合には、現在発音中の音素があるか否かを検出し、あったときには、当該発音中の音素から新たな音素へ移行する際に用いる調音結合時間を、前記調音結合データベースから取り出すステップと、前記調音結合時間をかけて前記発音中の音素のフォルマントパラメータから新たな音素のフォルマントパラメータへと移行するように、補間して求めたフォルマントパラメータを順次前記フォルマント合成音源に出力する補間ステップとを備えたことを特徴とする。
請求項４に係る発明は、請求項３に記載の発明において、前記補間ステップは、前記取り出した調音結合時間と前記新たな音素のデュレーションタイムとを比較し、前記調音結合時間が前記デュレーションタイムよりも長い場合は、前記調音結合時間を前記デュレーションタイムに一致させるステップとをさらに備えたことを特徴とする。
【０００９】
【発明の実施の形態】
以下、図面を用いてこの発明の実施の形態を説明する。
【００１０】
図１は、この発明に係る楽音合成装置（歌唱シンセサイザ）のシステム構成を示す。この楽音合成装置は、中央処理装置（ＣＰＵ）１０１、ＭＩＤＩ（Musical Instrument Digital Interface）インタフェース１０２、データメモリデバイス１０４、ワーキングメモリ１０６、プログラムメモリ１０７、設定操作子１０９、ディスプレイ１１１、ネットワークインターフェース１１２、フォルマント合成音源１１５、サウンドシステム１１６、およびシステム共通バス１１７を備えている。１０１，１０４，１０６〜１０９，１１１，１１２，１１５の各部は、システム共通バス１１７に接続されている。サウンドシステム１１６をシステム共通バス１１７に接続してＣＰＵ１０１から制御できるようにしてもよい。
【００１１】
ＣＰＵ１０１は、この楽音合成装置全体の動作を制御する。ＣＰＵ１０１は、ＭＩＤＩインタフェース１０２を介して、外部のＭＩＤＩ機器群１０３との間でＭＩＤＩ方式メッセージを送受信する機能を有する。データメモリデバイス１０４は、各種のデータを記憶する記憶装置であり、具体的には、半導体メモリ、フロッピーディスク装置（ＦＤＤ）、ハードディスク装置（ＨＤＤ）、光磁気（ＭＯ）ディスク装置、およびＩＣメモリカード装置などのローカルデータ記憶装置１０５である。特に、データメモリデバイス１０４は、ＭＩＤＩデータで演奏データや歌詞データなどを格納している。ローカルデータ記憶装置１０５としては、上記に例示したもののほかにも、様々な形態のメディアを利用する装置が使用できる。
【００１２】
ワーキングメモリ１０６は、ＣＰＵ１０１が動作する際にワーク領域として使用するＲＡＭ（ランダムアクセスメモリ）であり、各種のレジスタ、フラグ、およびバッファなどに使用する。プログラムメモリ１０７は、ＣＰＵ１０１が実行する制御プログラムや各種定数データなどを格納したＲＯＭ（リードオンリメモリ）である。設定操作子１０９は、ユーザが操作する各種スイッチなどの操作子であり、例えば通常のパーソナルコンピュータで用いられているマウス１１０やキーボードなどでよい。ディスプレイ１１１は、各種の情報を表示するために使用する表示装置である。
【００１３】
ネットワークインターフェース１１２は、電話回線などの公衆回線やイーサネット（Ｅｔｈｅｒｎｅｔ）などのローカルエリアネットワーク（ＬＡＮ）に接続するためのインターフェースである。いわゆるパソコン通信やインターネットなどに接続するためのインターフェースでもよい。このネットワークインターフェース１１２を介して、各種のネットワーク１１３に接続することにより、外部のサーバやホストコンピュータから（具体的には、それらに接続されたリモートデータ記憶装置１１４などから）各種のプログラムやデータをダウンロードすることができる。
【００１４】
フォルマント合成音源１１５は、ＣＰＵ１０１からの指示（フォルマントパラメータなど）に応じて、指定されたフォルマントの音声を指定された音高で生成出力する。フォルマント合成音源１１５から出力された音声信号は、サウンドシステム１１６により放音される。
【００１５】
この楽音合成装置では、データメモリ１０４から読み出したＭＩＤＩ形式の歌詞データ（唄わせる歌詞を指定するデータ）および演奏データ（音高などの演奏情報を指定するデータ）や、ＭＩＤＩインターフェース１０２を介してＭＩＤＩ機器１０３から入力した歌詞データおよび演奏データにしたがって、歌唱発音を行なうことができる。歌詞データおよび演奏データは、別に接続した演奏操作子（例えば鍵盤など）１０８から入力したＭＩＤＩデータを用いてもよいし、ネットワークインターフェース１１２を介して外部のネットワーク１１３から入力したＭＩＤＩデータを用いてもよい。この場合、入力したデータをリアルタイムに処理して歌唱させてもよいし、一旦、データメモリデバイス１０４（ローカルデータ記憶装置１０５）に格納した後、それを読み出して処理することにより歌唱させてもよい。歌詞データと演奏データとを別の系列から入力するようにしてもよい。例えば、データメモリデバイス１０４にあらかじめ格納されている歌詞データの歌詞を、演奏操作子１０８からリアルタイムに入力する演奏データの音高で、歌唱させるようにすることもできる。以上のように、歌詞データと演奏データはどのような方式で用意してもよい。
【００１６】
このような歌唱発音は、ＣＰＵ１０１の制御のもとで行なわれる。すなわち、ＣＰＵ１０１は、上述のように各種の方式で用意された歌詞データと演奏データを入力し、後述の図４〜図９で説明するような処理でフォルマント合成音源１１５に発音指示を出し、これにより歌唱させる。この際ＣＰＵ１０１が実行する制御プログラムはＲＯＭであるプログラムメモリ１０７に格納されているものであるが、プログラムメモリ１０７をＲＯＭの代わりにＲＡＭで構成し、ローカルデータ記憶装置１０５に制御プログラムを格納しておき、該制御プログラムを必要に応じてＲＡＭであるプログラムメモリ１０７にロードして実行するようにしてもよい。このようにすれば、制御プログラムの追加やバージョンアップなどが容易に行なえる。特に、ＣＤ−ＲＯＭなどの着脱可能な記録媒体に記憶されている本発明に係る制御プログラムや各種データをＨＤＤなどのローカルデータ記憶装置１０５にインストールして使用するようにすれば、制御プログラムやデータの新規インストールやバージョンアップなどが容易に行なえる。また、ＣＰＵ１０１が実行する制御プログラムは、ネットワークインターフェース１１２を介してネットワーク経由でダウンロードしたものでもよい。その際、ネットワークからダウンロードした制御プログラムを一旦ローカルデータ記憶装置１０５に格納し必要に応じてＲＡＭ構成のプログラムメモリ１０７にロードして実行するようにしてもよいし、ネットワークからダウンロードした制御プログラムを直接ＲＡＭ構成のプログラムメモリ１０７にロードして実行するようにしてもよい。
【００１７】
このような楽音合成装置は、各種の形態で実現可能である。例えば、シンセサイザ、音源モジュールなどの電子楽器に適用してもよいし、いわゆるマルチメディアパソコンに適用してもよい。汎用のパーソナルコンピュータに音源ボードを装着し、外部の鍵盤などのＭＩＤＩ機器から演奏情報（ＭＩＤＩ入力）を入力するＭＩＤＩインターフェースを装着して、必要なソフトウエアを実行することで実現することもできる。
【００１８】
図２は、本発明に係る図１の楽音合成装置で歌唱させる場合の処理概要を示す図である。演奏データ２０１や歌詞データ２０２は、上述したような各種の方式でＭＩＤＩデータでＣＰＵ１０１に入力する。演奏データ２０１は、音高（ピッチ）情報やベロシティ情報などを含むノートオンとノートオフである。歌詞データ２０２は、演奏データ２０１で指定した音符で発音すべき歌詞（音素データ）を示す。歌詞データ２０２は、ＭＩＤＩのシステムエクスクルーシブなどの形式で作成する。例えば、「さいた」という歌詞（音素で表わすと「ｓａｉｔａ」）を順次Ｃ３，Ｅ３，Ｇ３の音高で唄わせる場合、演奏データ２０１と歌詞データ２０２は、例えば以下のようなシーケンス（１）でＣＰＵ１０１に入力させる。・ｓ＜２０＞ａ＜００＞
・Ｃ３のノートオン
・Ｃ３のノートオフ
・ｉ＜００＞
・Ｅ３のノートオン ………（１）
・Ｅ３のノートオフ
・ｔ＜０２＞ａ＜００＞
・Ｇ３のノートオン
・Ｇ３のノートオフ
【００１９】
なお、ここではノートオンメッセージの前にその音符で発音すべき歌詞データを送るようにしている。ｓ，ａ，ｉ，ｔはそれぞれ音素を示し、音素に続く＜＞内の数値はその音素のデュレーションタイム（持続時間）を示す。ただし、＜００＞は次の音素のノートオンが来るまでその音素を持続させて発音することを示す。歌詞データｓ＜２０＞ａ＜００＞とｉ＜００＞とｔ＜０２＞ａ＜００＞とは、それぞれ、所定のシステムエクスクルーシブのスタートを表すコードとエンドを表すコードに挟まれたデータであり、歌詞データであることが分かるようになっている。以下では、ｓ＜２０＞ａ＜００＞のような１ノート中で発音する歌詞のシーケンスをフォーンシーケンス（phoneSEQ）と呼び、歌詞データバッファ２１０をフォーンシーケンスバッファ（phoneSEQバッファ）と呼ぶものとする。
【００２０】
このようなシーケンス（１）を受信したＣＰＵ１０１は、以下のように動作する。まず始めに、フォーンシーケンスｓ＜２０＞ａ＜００＞を受信すると、そのフォーンシーケンスをフォーンシーケンスバッファ２１０に記憶しておく。フォーンシーケンスバッファ２１０は、ワーキングメモリ１０６内に用意してあるバッファである。次に、「Ｃ３のノートオン」を受信すると、ＣＰＵ１０１は、フォーンシーケンスバッファ２１０を参照して発音させる歌詞ｓ＜２０＞ａ＜００＞を知り、その歌詞を指定音高「Ｃ３」で発生するようにフォルマントパラメータを算出してフォルマント合成音源１１５に送出する。フォルマントパラメータは、所定時間（ここでは５ｍｓｅｃ）ごとに送出する。これにより、歌詞ｓ＜２０＞ａ＜００＞の音高「Ｃ３」での発音が行なわれる。
【００２１】
次に「Ｃ３のノートオフ」を受信するが、直前にａ＜００＞が指定されているので、次のノートオンまで「ａ」を持続させるため、ＣＰＵ１０１は受信した「Ｃ３のノートオフ」を無視する。次に発音すべきフォーンシーケンスｉ＜００＞を受信するとそのフォーンシーケンスをフォーンシーケンスバッファ２１０に記憶し、「Ｅ３のノートオン」を受信すると、ＣＰＵ１０１は、フォーンシーケンスバッファ２１０を参照して発音させる歌詞ｉ＜００＞を知り、その歌詞を指定音高「Ｅ３」で発生するようにフォルマントパラメータを算出してフォルマント合成音源１１５に送出する。以下、「ｔａ」の発音も同様の処理により行なう。
【００２２】
フォルマントパラメータは、時系列データであり、ＣＰＵ１０１から所定の時間間隔でフォルマント合成音源１１５に転送する。所定の時間間隔とは、通常、人の音声の特徴を出して発音するには、例えば数ｍｓｅｃ間隔程度の低レートでよい。この実施の形態では５ｍｓｅｃごととした。この時間間隔で逐次フォルマントを時間的に変化させることにより、人の音声の特徴を出して歌を唄わせる。フォルマントパラメータとしては、例えば、有声音／無声音の別、フォルマント中心周波数、フォルマントレベル、およびフォルマントバンド幅（周波数軸上でのフォルマントの形状を規定するパラメータ）などがある。
【００２３】
ＣＰＵ１０１は、入力したフォーンシーケンス２０２と演奏データ２０１に基づいてフォルマントパラメータを算出するが、その際、音素データベースと調音結合データベースを参照する。音素データベースと調音結合データベースは、あらかじめローカルデータ記憶装置１０５に用意されており、それをワーキングメモリ１０６にロードして使用するものとする。何種類かの発音声質（個人差、男声、女声など）で歌唱させることができるように、声質ごとに各種用意した音素データベースと調音結合データベースを選択して用いることができるようにしてもよい。
【００２４】
図３は、音素データベースの参照方式の概念図である。３０１は音素データベースを示す。音素データベース３０１は、各音素ごとのフォルマントパラメータを集めたものである。３０２−１，３０２−２，３０２−３，…，３０２−Ｎ（Ｎは音素の数）は、それぞれ、一つの音素のフォルマントパラメータを集めたものを示す。一つの音素のフォルマントパラメータの集まり（例えば３０２−１）とは、当該音素の発音開始からの経過時間とその時点のピッチとを入力すると対応するフォルマントパラメータが一意に出力されるようなデータベースである。したがって、音素データベース３０１は、音素を特定する音素ナンバ、ピッチ、およびその音素の発音開始からの経過時間を入力し、該入力データに応じたフォルマントパラメータを出力するデータベースである。その形態はどのようなものでもよい。例えば、テーブルの形態でもよいし、入力データの範囲を所定のセグメントに分けて各セグメントごとにフォルマントパラメータを保持しておき入力データに応じた補間処理を行なって出力するようなものでもよい。また、連続データあるいは数式データの形態でもよい。
【００２５】
なお、図３では音素データベース３０１のフォルマントパラメータとしてフォルマント周波数およびフォルマントレベルのグラフ３０５，３０６のみを例示したが、フォルマントパラメータというときは、フォルマント周波数およびフォルマントレベルに限らず、フォルマントバンド幅などの他のフォルマントパラメータを含んでいてもよい。また図３では、時間軸方向の次元は省略し、音素ナンバと矢印３０７のように入力するピッチに応じてフォルマント周波数およびフォルマントレベルが定まる様子を示した。
【００２６】
ＣＰＵ１０１が音素データベースを参照する際に使用するピッチは、３０３に示すように、ノートオンで指定される基本的なピッチデータにピッチベンドデータおよびその他のピッチ生成データを加算した値である。歌声のフォルマント（特にフォルマント周波数）は、ピッチに応じて音素ごとに異なる変化をする。本実施の形態では、ピッチに応じたフォルマントパラメータを出力するように音素データベースを構成しているので、ピッチに応じたフォルマントの変化を合成音声でシミュレートでき、自然な歌声を合成することが可能である。なお、３０４に示すように、ベロシティデータ、ボリュームデータ、およびその他レベル生成データを加算した値を、ピッチデータに反映させても良い。ベロシティ、ボリューム、およびその他レベルデータは、ピッチを変化させる場合があるので、そのピッチの変化をフォルマントパラメータに反映させるということである。特に、ＣＰＵ１０１の処理速度が速ければ、フォルマントパラメータを音源１１５に出力するタイミングの都度、その時点のピッチを反映させてフォルマントパラメータを算出するとよい。
【００２７】
図４は、この楽音合成装置の電源がオンされたときにＣＰＵ１０１が実行するメインプログラムの手順を示す。まずステップ４０１で、各種の初期設定を行なう。特に、後述する発音フラグＨＦＬＡＧは０に初期設定する。次にステップ４０２で各種のイベントを待つ。イベントが発生したときは、発生したイベントをステップ４０３で判別し、それぞれに応じた処理４０４〜４０６を実行し、再びステップ４０２に戻る。ステップ４０４は、ＭＩＤＩメッセージを受信したときに実行されるＭＩＤＩ受信処理である。ステップ４０５は，ＭＩＤＩ受信バッファ中のデータを処理するためのＭＩＤＩデータ処理である。このＭＩＤＩデータ処理は、他のイベントが発生せずタスクが空いているとき繰り返し実行される。その他、各種のイベントが発生したときには発生したイベントに応じた処理（ステップ４０６）を行なう。
【００２８】
図５は、図４のステップ４０４のＭＩＤＩ受信処理の手順を示す。ＭＩＤＩ受信処理では、ステップ５０１で、受信したＭＩＤＩメッセージをＭＩＤＩ受信バッファ（ワーキングメモリ１０６内に確保されている）へ書き込み、リターンする。
【００２９】
図６は、図４のステップ４０５のＭＩＤＩデータ処理の手順を示す。まず、ステップ６０１で、ＭＩＤＩ受信バッファより１バイトを取り込む。次にステップ６０２で、取り込んだ１バイトデータがステータスバイト（最上位ビットが１）であるか否か判別する。ステータスバイトであるときは、ステップ６０４で、ステータスバイト以降のデータバイトを取り込み、ステータスバイトとともに所定のバッファに記憶し、ステップ６０５に進む。ステップ６０２でステータスバイトでないときは、ステップ６０３でその他の処理を行なった後、リターンする。なお、ステータスバイトを取り込んだ時点でＭＩＤＩ受信バッファに以降に続くデータバイトが受信されているとは限らないので、ステップ６０３でステータスバイトに引き続くデータバイトを順次取り込み、１メッセージ分が取り込まれた時点でステップ６０４に進むようにしてもよい。ステップ６０４の時点では、一つのＭＩＤＩメッセージが所定のバッファに取り込まれていることになる。
【００３０】
ステップ６０５では、受信したＭＩＤＩメッセージの各ステータスごとに処理を分岐する。ノートオンであるときは、図７のノートオン処理に進む。ノートオフであるときは、図８のノートオフ処理に進む。システムエクスクルーシブであるときは、ステップ６０７に進み、当該ＭＩＤＩメッセージがフォーンシーケンスのメッセージであるか否かを判別する。フォーンシーケンスであるときは、ステップ６０８で、当該フォーンシーケンスをフォーンシーケンスバッファへ書き込み、リターンする。ステップ６０７でフォーンシーケンスでないときは、ステップ６０９で他のシステムエクスクルーシブの処理を行なった後、リターンする。ステップ６０５で当該ＭＩＤＩメッセージがその他のメッセージであるときは、ステップ６０６で、各ステータスに応じた処理を行なった後、リターンする。特に、受信したＭＩＤＩメッセージが歌唱終了情報（演奏データの最後には曲の終了を示す歌唱終了情報が入力するものとする）であったときは、ステップ６０６に進み、その時点で発音されているすべての音を消音して曲を終了する。
【００３１】
図７を参照して、ノートオンのＭＩＤＩメッセージを受信したときの処理を説明する。まずステップ７０１で、フォーンシーケンスのタイムカウンタtime_counterおよびフォーンカウンタphone_counterを０に初期化する。タイムカウンタtime_counterはタイマ割込（５ｍｓｅｃごと）がおきるごとにインクリメントされるカウンタであり、そのカウンタ値は各音素の発音開始時からの経過時間を５ｍｓｅｃ単位で表したものとなる。タイムカウンタtime_counterは、音素が切り替わるごとに０にリセットされる。フォーンカウンタphone_counterは、１ノート中（すなわち、１ノートで発音する１フォーンシーケンス中）における音素の順番を示すカウンタである。ノートが切り替わるごと、すなわちフォーンシーケンスが切り替わるごとに、０にリセットされ、１フォーンシーケンス中では音素が切り替わるごとにインクリメントされる。０から始まるので、１フォーンシーケンス中の最初の音素はphone_counter＝０、次の音素はphone_counter＝１、…、というようになる。
【００３２】
次に、ステップ７０２で、フォーンシーケンス中に呼気情報があるか否かを判別する。呼気情報とは、ある音素と次の音素とを区切って発音したい場合にフォーンシーケンス中に入れる情報である。呼気情報を入れる場合はフォーンシーケンス中の最後に入れるものとし、その場合、フォーンシーケンスの最後の音素の発音の後、その音素の発音が一旦停止し、次のフォーンシーケンスの発音が開始するようになる。ステップ７０２で呼気情報があるときは、呼気フラグf_kokiに１をセットし、ステップ７０５に進む。呼気情報がないときは、フラグf_kokiを０にリセットして、ステップ７０５に進む。
【００３３】
ステップ７０５では、フォーンシーケンスバッファ中のフォーンカウンタphone_counterで示される位置の音素ナンバとデュレーションタイム（duration time）を抽出する。いまはノートオン直後であってphone_counter＝０であるので、フォーンシーケンスバッファ中の最初の音素が抽出されることになる。次にステップ７０６で、抽出した音素ナンバとピッチデータより、図３で説明したように音素データベースを参照して、フォルマント周波数、フォルマントレベル、およびフォルマントバンド幅などのフォルマントパラメータを引き出す。ここで用いるピッチデータは、図３の３０３で説明したように、ノートオンで指定された基本的なピッチデータにピッチベンドデータやその他のピッチ生成データを反映して求めたピッチである。なお、音素データベースは入力した音素ナンバとピッチと発音開始からの経過時間に対応するフォルマントパラメータを出力するデータベースであるところ、ステップ７０６では音素ナンバとピッチのみを指定しているので、ここで引き出されるフォルマントパラメータは時系列データ（形態はテーブルや数式などの任意の形態でよいが、経過時間を入力するとフォルマントパラメータが一意に出力されるような情報）になる。後述するステップ７１０，９０５では、ステップ７０６で引き出されたフォルマントパラメータに、現タイムカウンタtime_counter値を入力して、その時点でのフォルマントパラメータを求めることになる。
【００３４】
次にステップ７０７で、前後の音素関係により調音結合データベースを参照し調音結合カーブと調音結合時間を取り出す。調音結合データベースは、ある音素から次の音素に滑らかに移行するために参照するデータベースである。前の音素と次の音素を特定すると調音結合データベースから調音結合カーブと調音結合時間を取り出すことができる。そして、その調音結合カーブに沿って前の音素のフォルマントパラメータから次の音素のフォルマントパラメータへと補間して求めたフォルマントパラメータを順次音源に送出することで、前の音素から次の音素への自然な移行が実現できる。このような処理を調音結合と呼ぶ。調音結合時間は、前の音素から次の音素へと補間を行ないつつ移行する時間である。ステップ７０７では、いまノートオンで発音しようとしている音素の前に発音中の音素があるかを確認し、前の音素があれば、その音素からいまノートオンで発音しようとしている音素へ移行する際に用いる調音結合カーブと調音結合時間を調音結合データベースから取り出す処理を行なうものである。
【００３５】
次にステップ７０８では、ステップ７０７で取り出した調音結合時間とステップ７０５で抽出した音素のデュレーションタイムとの大小関係を判別する。調音結合時間がデュレーションタイムより大きいときは、その調音結合時間だけの時間を使って調音結合を行なうと、いまノートオンで発音しようとしている音素のデュレーションタイムを越えてしまうので、発音のタイミングがずれてしまい不都合である。そこでステップ７０９で、調音結合時間をデュレーションタイムに一致させ、ステップ７１０に進む。ステップ７０８で調音結合時間がデュレーションタイムより大きくないときは、そのままステップ７１０に進む。
【００３６】
ステップ７１０では、先にステップ７０６で引き出したフォルマントパラメータを参照して現タイムカウンタtime_counterの値（ここではノートオン直後であるのでタイムカウンタtime_counterの値は０である）におけるフォルマントパラメータを取得し、さらにステップ７０７で調音結合カーブと調音結合時間が取り出されていたときは、取得したフォルマントパラメータと前の音素（現在発音中）のフォルマントパラメータを用いて、上記調音結合カーブと調音結合時間にしたがう調音結合処理を行ない、フォルマントパラメータを算出する。なお、ステップ７０７で調音結合カーブと調音結合時間が取り出されていなかったとき（前に発音中の音素なし）は、調音結合処理は行なう必要がない。次に、ステップ７１１で、算出したフォルマントデータとピッチを音源１１５へ書き込む。これにより、発音すべき音素の最初のフォルマントパラメータが音源１１５に転送され、発音が開始される。なおこれ以後は、後述する図９のタイマ処理で５ｍｓｅｃごとにフォルマントパラメータを音源１１５に送って発音を続ける。次にステップ７１２で、発音フラグＨＦＬＡＧに１をセットして、リターンする。発音フラグＨＦＬＡＧは、１のとき歌唱発音が行なわれていることを示し、０のとき行なわれていないことを示す。
【００３７】
次に図８を参照して、ノートオフのＭＩＤＩメッセージを受信したときの処理を説明する。まずステップ８０１で、いま発音中のフォーンシーケンスの音素のデュレーションタイムを参照し、該デュレーションタイムが＜００＞であって、かつ当該音素の後にさらに発音すべき音素があるか否かを判別する。そうであるとき（例えばデュレーションが＜００＞で伸ばして発音する音素の後を子音で終えたりするとき）は、その音素を発音するためにステップ８０２に進む。ステップ８０１で、いま発音中の音素のデュレーションタイムが＜００＞でないとき、またはデュレーションタイムは＜００＞だがその次に発音すべき音素がないときは、ステップ８０８に進む。
【００３８】
ステップ８０２では、フォーンシーケンス中のいま発音しているデュレーションタイムが＜００＞である音素の次の音素の位置にフォーンカウンタphone_counterを合わせる。さらにタイムカウンタtime_counterを０にリセットする。次にステップ８０３で、フォーンシーケンスバッファ中のフォーンカウンタphone_counterで示される位置の音素ナンバとデュレーションタイムを抽出する。いまはフォーンカウンタphone_counterが、最後に発音すべき音素に位置づけられておりその音素が抽出される。
【００３９】
次のステップ８０４，８０５，８０６，８０７は、それぞれ、図７で説明したステップ７０６，７０７，７１０，７１１と同じ処理である。これらのステップ８０４〜８０７により、発音すべき音素の最初のフォルマントパラメータが音源１１５に転送され、発音が開始される。なおこれ以後は、後述する図９のタイマ処理で５ｍｓｅｃごとにフォルマントパラメータを音源１１５に送って発音を続ける。以上のステップ８０４〜８０７により発音開始する音素は、デュレーションタイムが＜００＞で伸ばして発音した音素の次に発音される音素であるので、次にノートオンが受信されるまでの時間で発音されるものである。もし、ステップ８０４〜８０７により発音開始する音素のデュレーションタイムが長く、当該音素の発音中に次のノートオンが受信された場合は、図７で説明したように当該音素から次のノートオンの音素へと調音結合されて次の音素に強制的に移行することになる。
【００４０】
ステップ８０１でいま発音中の音素のデュレーションタイムが＜００＞でないとき、またはデュレーションタイムは＜００＞だがその次に発音すべき音素がないときは、ステップ８０８で、呼気フラグf_kokiが１であるか否かを判別する。呼気フラグf_kokiが１であるときは、ステップ８０９でキーオフ処理を行なって現在発音している音素を発音停止し、さらに発音フラグＨＦＬＡＧを０にリセットして、リターンする。ステップ８０８で呼気フラグf_kokiが１でないときは、そのままリターンする。結果として、呼気フラグf_kokiが１でないときは、現在発音中の音素の発音は持続することになる。
【００４１】
図９は、タイマ割込により５ｍｓｅｃごとに実行されるタイマ処理の処理手順を示す。タイマ処理では、まずステップ９０１で、発音フラグＨＦＬＡＧが１であるか否かを判別する。ＨＦＬＡＧが１でないときは、現在発音中でないからそのままリターンする。ＨＦＬＡＧが１のときは、ステップ９０２で、いま発音している音素のデュレーションタイムが＜００＞か否か判別する。デュレーションタイムが＜００＞のときは、現在発音中の音素の発音を継続するため、ステップ９０４に進む。デュレーションタイムが＜００＞でないときは、ステップ９０３で、タイムカウンタtime_counterの値が当該デュレーションタイム以下であるか否かを判別する。タイムカウンタtime_counterの値がデュレーションタイム以下のときは、現在発音している音素の発音をさらに継続するということであるから、ステップ９０４に進む。ステップ９０４ではタイムカウンタtime_counterをインクリメントし、現タイムカウンタtime_counter値におけるフォルマントパラメータを算出出力すべくステップ９０５に進む。
【００４２】
ステップ９０３でタイムカウンタtime_counterの値がデュレーションタイムを越えたときは、いま発音中の音素については当該デュレーションタイムだけ発音を終えたということだから、フォーンシーケンス中の次の音素の発音に移行するため、ステップ９０７に進む。ステップ９０７では、フォーンシーケンス中に次の音素があるか否か判別する。次の音素がないときは、ステップ９１２で発音フラグＨＦＬＡＧを０にリセットしてリターンする。次の音素があるときは、その発音をするため、まずステップ９０８で、フォーンカウンタphone_counterをインクリメントし、タイムカウンタtime_counterを０にリセットする。そして、ステップ９０９で、フォーンシーケンス中のフォーンカウンタphone_counterで示される位置（次に発音すべき音素）の音素ナンバとデュレーションタイムを抽出する。
【００４３】
次のステップ９１０，９１１，９０５，９０６は、それぞれ、図７で説明したステップ７０６，７０７，７１０，７１１と同じ処理である。これらのステップにより、発音すべき音素のフォルマントパラメータが音源１１５に転送される。なお、ステップ９０５では、先に音素データベースから引き出してあるフォルマントパラメータを参照して現タイムカウンタtime_counterの値におけるフォルマントパラメータを取得し、さらに先に調音結合データベースから取り出してある調音結合カーブと調音結合時間にしたがう調音結合処理を行なって、フォルマントパラメータを算出する。この「音素データベースから引き出してあるフォルマントパラメータ」とは、具体的には、ノートオンで発音開始した音素の場合はステップ７０６で引き出したもの、ノートオフ時にデュレーション＜００＞の後に存在した音素の場合はステップ８０４で引き出したもの、フォーンシーケンス中のデュレーションが＜００＞でない音素の次の音素の場合はステップ９１０で引き出したものである。同様に、ステップ９０５での調音結合に用いる調音結合カーブと調音結合時間は、具体的には、ノートオンで発音開始した音素の場合はステップ７０７で取り出したもの、ノートオフ時にデュレーション＜００＞の後に存在した音素の場合はステップ８０５で取り出したもの、フォーンシーケンス中のデュレーションが＜００＞でない音素の次の音素の場合はステップ９１１で取り出したものである。
【００４４】
次に、上述の図４〜図９の処理がどのように実行されるかの概要を、具体的な例で説明する。ここでは、「ｓａｉｔａ」をＣ３，Ｅ３，Ｇ３で発音する場合の下記シーケンス（１）の順でイベントが発生したとする。
・ｓ＜２０＞ａ＜００＞
・Ｃ３のノートオン
・Ｃ３のノートオフ
・ｉ＜００＞
・Ｅ３のノートオン ………（１）
・Ｅ３のノートオフ
・ｔ＜０２＞ａ＜００＞
・Ｇ３のノートオン
・Ｇ３のノートオフ
【００４５】
これらの各ＭＩＤＩメッセージは、それぞれ、ステップ４０２→４０３→４０４→図５の５０１の手順で処理され、ＭＩＤＩ受信バッファに書き込まれる。また、書き込まれた各ＭＩＤＩメッセージは、空いた時間に図６のＭＩＤＩデータ処理のステップ６０５以降でＭＩＤＩメッセージの種類ごとに処理される。以下では、上記各ＭＩＤＩメッセージが、ステップ６０５以降どのように処理されるかを説明する。
【００４６】
まず、最初に受信したフォーンシーケンスｓ＜２０＞ａ＜０＞は、ステップ６０５→６０７→６０８の処理で、フォーンシーケンスバッファに書き込まれる。次の「Ｃ３のノートオン」は、ステップ６０５から図７のノートオン処理に進み処理される。ノートオン処理では、ステップ７０１→７０２→７０４→７０５と進み、フォーンシーケンスバッファから最初に発音すべき音素「ｓ」を示す音素ナンバとそのデュレーションタイム＜２０＞を抽出する。そして、ステップ７０６で音素データベースを参照し、音素「ｓ」の音素ナンバとピッチ「Ｃ３」に対応するフォルマントパラメータ群を引き出す。ここで引き出すフォルマントパラメータ群は、音素「ｓ」の発音開始時からの経過時間（５ｍｓｅｃごと）に応じて変化する時系列データのフォルマントパラメータ群（その形態は任意）である。次のステップ７０７では調音結合データベースを参照するが、いま発音しようとしている音素「ｓ」の前に発音している音素はないから、ステップ７０７，７０８では何も処理せず、ステップ７１０に進んで、現タイムカウンタtime_counter値（ノートオン直後なのでtime_counter＝０）におけるフォルマントパラメータを、先にステップ７０６で引き出したフォルマントパラメータ群から求める。求めたフォルマントパラメータを、ピッチデータおよびキーオン信号とともに、ステップ７１１で音源１１５に送出し、これにより音素「ｓ」で音高「Ｃ３」の楽音を発音開始させる。ステップ７１２で発音フラグＨＦＬＡＧ＝１とし、ノートオン処理を終える。
【００４７】
これ以降は、５ｍｓｅｃごとに図９のタイマ処理が実行され、ｓ＜２０＞の発音が継続される。タイマ処理では、始めはステップ９０１→９０２→９０３→９０４と進んでタイムカウンタtime_counterをインクリメントし、さらにステップ９０５で、現タイムカウンタtime_counter値におけるフォルマントパラメータを、先にステップ７０６で引き出したフォルマントパラメータ群から求める。ステップ９０６で、求めたフォルマントパラメータを音源に送出して、リターンするこのステップ９０１→９０２→９０３→９０４→９０５→９０６→リターンの流れを５ｍｓｅｃごとに繰り返し、タイムカウンタtime_counter＝２０に至ると、ステップ９０３から９０７に進み、次の音素ａ＜００＞に移行すべくステップ９０８に進む。そして、ステップ９０９で、フォーンシーケンスバッファから次に発音すべき音素「ａ」を示す音素ナンバとそのデュレーションタイム＜００＞を抽出する。ステップ９１０では音素データベースを参照し、音素「ａ」の音素ナンバとピッチ「Ｃ３」に対応するフォルマントパラメータ群を引き出す。ここで引き出すフォルマントパラメータ群は、音素「ａ」の発音開始時からの経過時間（５ｍｓｅｃごと）に応じて変化する時系列データのフォルマントパラメータ群である。次のステップ９１１では調音結合データベースを参照し、前の音素「ｓ」から今回の音素「ａ」に移行する際に用いる調音結合カーブおよび調音結合時間を取り出す。そして、ステップ９０５で、現タイムカウンタtime_counter値におけるフォルマントパラメータを、先にステップ９１０で引き出したフォルマントパラメータ群と、ステップ９１１で取り出した調音結合カーブおよび調音結合時間から、求める。求めたフォルマントパラメータを、ピッチデータおよびキーオン信号とともに、ステップ９０６で音源１１５に送出し、これにより音素「ｓ」から「ａ」に自然に移行しながら音高「Ｃ３」の楽音の発音を開始する。音素「ａ」のデュレーションタイムは＜００＞であるので、これ以降５ｍｓｅｃごとのタイマ処理では、ステップ９０２→９０４→９０５→９０６と進み、音素「ａ」の発音が継続する。
【００４８】
次の「Ｃ３のノートオフ」は、ステップ６０５から図８のノートオフ処理に進み処理される。現在発音中の音素「ａ」のデュレーションタイムは＜００＞であるが、その後に音素はないので、ステップ８０１から８０８に進む。また、呼気情報も設定されていないので、f_koki＝０であるから、そのままリターンする。したがって、音素「ａ」の発音は継続して実行される。
【００４９】
次のフォーンシーケンス「ｉ＜００＞」は、ステップ６０５→６０７→６０８の処理で、フォーンシーケンスバッファに書き込まれる。次の「Ｅ３のノートオン」は、ステップ６０５から図７のノートオン処理に進み処理される。ノートオン処理では、ステップ７０１→７０２→７０４→７０５と進み、フォーンシーケンスバッファから最初の音素「ｉ」を示す音素ナンバとそのデュレーションタイム＜００＞を抽出する。そして、ステップ７０６で音素データベースを参照し、音素「ｉ」の音素ナンバとピッチ「Ｅ３」に対応するフォルマントパラメータ群を引き出す。ここで引き出すフォルマントパラメータ群は、音素「ｉ」の発音開始時からの経過時間（５ｍｓｅｃごと）に応じて変化する時系列データのフォルマントパラメータ群（その形態は任意）である。次のステップ７０７では調音結合データベースを参照し、前の音素「ａ」（現在発音中）から今回の音素「ｉ」に移行する際に用いる調音結合カーブおよび調音結合時間を取り出す。ステップ７０８の判別では、「ｉ」のデュレーションタイムが＜００＞であり「ａ」から「ｉ」に移行するときの調音結合時間よりデュレーションタイムが長いと言えるので、ステップ７１０に進む。ステップ７１０では、現タイムカウンタtime_counter値におけるフォルマントパラメータを、先にステップ７０６で引き出したフォルマントパラメータ群と、ステップ７０７で取り出した調音結合カーブおよび調音結合時間から、求める。求めたフォルマントパラメータを、ピッチデータおよびキーオン信号とともに、ステップ７１１で音源１１５に送出し、これにより音素「ａ」から「ｉ」に自然に移行しながら音高「Ｅ３」の楽音を発音開始する。ステップ７１２で発音フラグＨＦＬＡＧ＝１とし、ノートオン処理を終える。
【００５０】
以下同様にして、「Ｅ３のノートオフ」から「Ｇ３のノートオフ」までの各メッセージを処理する。
【００５１】
なお、上記発明の実施の形態において、フォルマント合成音源１１５は、全体または部分的にかかわらず、ハードウェアまたはソフトウェアのどちらによって実現しても、また組み合わせて実現してもよい。また、上記発明の実施の形態では、母音と子音とを別の音素と区別した音素ごとの情報で音素データベースを持っているが、５０音（「ｓａ」、「ｓｉ」など）のそれぞれの音を音素とした音素ごとの情報で持ってもよい。音素データベースと調音結合データベースとを合わせた形態としてもよい。
【００５２】
上記実施の形態では、例として５ｍｓｅｃごとにフォルマントパラメータを変化させるシステムとしたが、フォルマントあるいはスペクトル特性の変化の大きいところは速く、変化の緩いところは遅いレートで、フォルマントパラメータの変化を制御するようにしてもよい。
【００５３】
【発明の効果】
以上説明したように、この発明によれば、音素データベースに格納するフォルマントパラメータを、ピッチの変化に対応したフォルマントパラメータの変化を反映したものにしているので、入力したピッチに応じたフォルマントで歌唱発音を行なうことができる。したがって、ピッチがダイナミックに変化した場合でも、そのピッチの変化に応じたフォルマントで自然な合成音を得ることができる。ピッチだけでなく、他の演奏情報をも加味してフォルマントパラメータを得るようにすれば、それらの演奏情報に適応した自然な音声を合成することができる。
【図面の簡単な説明】
【図１】この発明に係る楽音合成装置のシステム構成図
【図２】図１の楽音合成装置で歌唱させる場合の処理概要を示す図
【図３】音素データベースの参照方式の概念図
【図４】メインプログラムの手順を示すフローチャート図
【図５】ＭＩＤＩ受信処理の手順を示すフローチャート図
【図６】ＭＩＤＩデータ処理の手順を示すフローチャート図
【図７】ノートオン処理の手順を示すフローチャート図
【図８】ノートオフ処理の手順を示すフローチャート図
【図９】タイマ処理の手順を示すフローチャート図
【符号の説明】
１０１…中央処理装置（ＣＰＵ）、１０２…ＭＩＤＩインタフェース、１０３…ＭＩＤＩ機器群、１０４…データメモリデバイス、１０５…ローカルデータ記憶装置、１０６…ワーキングメモリ、１０７…プログラムメモリ、１０８…演奏操作子、１０９…設定操作子、１１１…ディスプレイ、１１２…ネットワークインターフェース、１１５…フォルマント合成音源、１１６…サウンドシステム、１１７…システム共通バス、２０１…演奏データ、２０２…歌詞データ、２１０…歌詞データバッファ（フォーンシーケンスバッファ）、３０１…音素データベース。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a musical sound synthesizer and method for synthesizing natural musical sounds according to a desired formant.
[0002]
[Prior art]
Conventionally, it is known that a predetermined formant exists in a voice uttered by a person, and the voice is characterized by this. On the other hand, an attempt has been made to sing a song by synthesizing a voice with a musical tone synthesizer and outputting it at a desired pitch.
[0003]
[Problems to be solved by the invention]
Thus, when singing with a musical sound synthesizer, it is required to obtain a more natural synthesized sound. In particular, even when various performance information including pitch (pitch) changes, it is required to synthesize a natural singing voice by adapting to the performance information.
[0004]
The present invention can obtain a more natural synthesized sound when singing using a formant synthetic sound source, and can synthesize a natural singing voice particularly adapted to changes in various performance information including pitch. It is an object to provide a synthesis apparatus and method.
[0005]
[Means for Solving the Problems]
In order to achieve this object, according to claim 1 invention Input means for inputting lyric information representing the phoneme of the lyric and the performance information including at least pitch information, information specifying the phoneme, information specifying the pitch, and elapsed time from the start of pronunciation of the phoneme When given as an argument, a phoneme database in which formant parameters corresponding to them are output, and referring to the phoneme database, the lyric information and pitch information input by the input means and the pronunciation of the phoneme indicated by the lyric information Means for obtaining a formant parameter corresponding to the elapsed time, and a formant synthesized sound source for synthesizing and outputting speech having a formant corresponding to the obtained formant parameter at a pitch corresponding to the pitch information; In the musical tone synthesizer equipped with, the articulation combination database that stores the articulation combination time, which is the interpolation time when shifting from the previous phoneme to the subsequent phoneme, and the pronunciation of a new phoneme are indicated for each combination of front and rear phonemes If so, it is detected whether or not there is a phoneme currently sounding, and if there is, the articulation coupling time used when shifting from the current phoneme to a new phoneme is extracted from the articulation coupling database. And interpolating means for sequentially outputting formant parameters obtained by interpolation to the formant synthesized sound source so as to shift from the formant parameter of the phoneme being generated to the new formant parameter of the phoneme over the articulation coupling time. When It is provided with.
[0006]
The invention according to claim 2 is the invention according to claim 1, The interpolation means is Comparing the extracted articulation combination time with the duration time of the new phoneme, and if the articulation connection time is longer than the duration time, the articulation connection time is made to coincide with the duration time. Is It is characterized by that.
[0007]
According to claim 3 invention Input step for inputting lyrics information representing a phoneme of the lyric to be uttered and performance information including at least pitch information, information for specifying a phoneme, information for specifying a pitch, and an elapsed time from the start of pronunciation of the phoneme. When given as an argument, it corresponds to the lyric information and pitch information input by the input means and the elapsed time from the start of pronunciation of the phoneme indicated by the lyric information, using a phoneme database in which formant parameters corresponding to them are output. A step of obtaining a formant parameter, and a step of synthesizing and outputting a voice having a formant corresponding to the obtained formant parameter at a pitch corresponding to the pitch information. A tone synthesis database that stores the tone combination time, which is the interpolation time when transitioning from the previous phoneme to the next phoneme, is provided for each combination of front and back phonemes, and new phoneme pronunciation Is detected, it is detected whether or not there is a phoneme currently sounding, and if there is, the articulation combination time used when shifting from the phoneme being pronounced to a new phoneme is used as the articulation combination database. And the formant parameters obtained by interpolation are sequentially output to the formant synthesized sound source so as to shift from the formant parameter of the phoneme being generated to the formant parameter of the new phoneme over the articulation coupling time. Interpolation step and It is provided with.
The invention according to claim 4 is the invention according to claim 3. In the invention, the interpolation step includes Comparing the extracted articulation combination time with the duration time of the new phoneme, and if the articulation connection time is longer than the duration time, further comprising the step of matching the articulation connection time with the duration time. It is characterized by that.
[0009]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0010]
FIG. 1 shows a system configuration of a musical tone synthesizer (singing synthesizer) according to the present invention. This musical tone synthesizer includes a central processing unit (CPU) 101, a MIDI (Musical Instrument Digital Interface) interface 102, a data memory device 104, a working memory 106, a program memory 107, a setting operator 109, a display 111, a network interface 112, a formant. A synthesized sound source 115, a sound system 116, and a system common bus 117 are provided. The units 101, 104, 106 to 109, 111, 112, and 115 are connected to the system common bus 117. The sound system 116 may be connected to the system common bus 117 and controlled from the CPU 101.
[0011]
The CPU 101 controls the operation of the entire musical tone synthesizer. The CPU 101 has a function of transmitting and receiving MIDI system messages to and from the external MIDI device group 103 via the MIDI interface 102. The data memory device 104 is a storage device that stores various types of data. Specifically, a semiconductor memory, a floppy disk device (FDD), a hard disk device (HDD), a magneto-optical (MO) disk device, and an IC memory card. A local data storage device 105 such as a device. In particular, the data memory device 104 stores performance data and lyrics data as MIDI data. As the local data storage device 105, in addition to those exemplified above, devices using various forms of media can be used.
[0012]
The working memory 106 is a RAM (random access memory) used as a work area when the CPU 101 operates, and is used for various registers, flags, buffers, and the like. The program memory 107 is a ROM (read only memory) that stores a control program executed by the CPU 101 and various constant data. The setting operation element 109 is an operation element such as various switches operated by the user, and may be, for example, a mouse 110 or a keyboard used in a normal personal computer. The display 111 is a display device used for displaying various types of information.
[0013]
The network interface 112 is an interface for connecting to a public line such as a telephone line or a local area network (LAN) such as Ethernet. It may be an interface for connecting to so-called personal computer communication or the Internet. By connecting to various networks 113 via this network interface 112, various programs and data can be transferred from an external server or host computer (specifically, from a remote data storage device 114 or the like connected thereto). Can be downloaded.
[0014]
The formant synthesis sound source 115 generates and outputs the sound of the designated formant at the designated pitch in accordance with an instruction (formant parameter or the like) from the CPU 101. The sound signal output from the formant synthesis sound source 115 is emitted by the sound system 116.
[0015]
In this musical tone synthesizer, MIDI-formatted lyric data (data that specifies the lyrics to be replayed) and performance data (data that specifies performance information such as pitches) read from the data memory 104, and MIDI data via the MIDI interface 102 are used. Singing pronunciation can be performed according to the lyrics data and performance data input from the device 103. As the lyric data and performance data, MIDI data input from a performance operator (eg, keyboard) 108 connected separately may be used, or MIDI data input from the external network 113 via the network interface 112 may be used. Good. In this case, the input data may be processed and sung in real time, or once stored in the data memory device 104 (local data storage device 105), it may be sung by reading and processing it. . The lyrics data and the performance data may be input from different series. For example, the lyrics of lyrics data stored in advance in the data memory device 104 can be sung at the pitch of performance data input in real time from the performance operator 108. As described above, the lyrics data and the performance data may be prepared by any method.
[0016]
Such singing pronunciation is performed under the control of the CPU 101. That is, the CPU 101 inputs lyric data and performance data prepared by various methods as described above, and issues a sound generation instruction to the formant synthesis sound source 115 by processing as described later with reference to FIGS. To sing. At this time, the control program executed by the CPU 101 is stored in the program memory 107 which is a ROM. However, the program memory 107 is constituted by a RAM instead of the ROM, and the control program is stored in the local data storage device 105. Alternatively, the control program may be loaded into the program memory 107 that is a RAM and executed as necessary. In this way, control programs can be easily added or upgraded. In particular, if the control program and various data according to the present invention stored in a removable recording medium such as a CD-ROM are installed and used in a local data storage device 105 such as an HDD, the control program and data Can be easily installed and upgraded. Further, the control program executed by the CPU 101 may be downloaded via the network via the network interface 112. At this time, the control program downloaded from the network may be temporarily stored in the local data storage device 105 and loaded into the RAM-configured program memory 107 as necessary for execution, or the control program downloaded from the network may be directly executed. The program may be loaded into the RAM-configured program memory 107 for execution.
[0017]
Such a musical tone synthesizer can be realized in various forms. For example, the present invention may be applied to an electronic musical instrument such as a synthesizer or a sound module, or may be applied to a so-called multimedia personal computer. It can also be realized by mounting a sound source board on a general-purpose personal computer, mounting a MIDI interface for inputting performance information (MIDI input) from a MIDI device such as an external keyboard, and executing necessary software.
[0018]
FIG. 2 is a diagram showing an outline of processing when singing by the musical tone synthesizer of FIG. 1 according to the present invention. The performance data 201 and the lyric data 202 are input to the CPU 101 as MIDI data by various methods as described above. The performance data 201 is note-on and note-off including pitch (pitch) information and velocity information. The lyric data 202 indicates the lyric (phoneme data) to be pronounced with the note specified by the performance data 201. The lyric data 202 is created in a format such as MIDI system exclusive. For example, when the lyrics “Saita” (“saita” in terms of phonemes) are sequentially expressed by the pitches of C3, E3, and G3, the performance data 201 and the lyrics data 202 are, for example, the following sequence (1): To cause the CPU 101 to input.・ S <20> a <00>
・ C3 note-on
・ C3 note-off
・ I <00>
・ E3 note-on (1)
・ E3 note-off
・ T <02> a <00>
・ G3 note-on
・ G3 note-off
[0019]
Here, the lyric data to be pronounced by the note is sent before the note-on message. Each of s, a, i, and t represents a phoneme, and a numerical value in <> following the phoneme represents a duration time (duration) of the phoneme. However, <00> indicates that the phoneme is continuously generated until note-on of the next phoneme comes. The lyrics data s <20> a <00>, i <00>, and t <02> a <00> are data sandwiched between a code indicating the start and end of a predetermined system exclusive, respectively. , It is understood that it is lyrics data. In the following, a lyric sequence such as s <20> a <00> that is pronounced in one note is referred to as a phone sequence (phoneSEQ), and the lyric data buffer 210 is referred to as a phone sequence buffer (phoneSEQ buffer).
[0020]
The CPU 101 that has received such a sequence (1) operates as follows. First, when the phone sequence s <20> a <00> is received, the phone sequence is stored in the phone sequence buffer 210. The phone sequence buffer 210 is a buffer prepared in the working memory 106. Next, when “C3 note-on” is received, the CPU 101 refers to the phone sequence buffer 210 to know the lyrics s <20> a <00> to be pronounced, and generates the lyrics at the designated pitch “C3”. Thus, the formant parameter is calculated and sent to the formant synthesis sound source 115. The formant parameter is transmitted every predetermined time (here, 5 msec). Thus, the pronunciation of the lyrics s <20> a <00> with the pitch “C3” is performed.
[0021]
Next, “C3 note-off” is received. Since a <00> is specified immediately before, “a” is maintained until the next note-on, so the CPU 101 receives the received “C3 note-off”. ignore. Next, when the phone sequence i <00> to be pronounced is received, the phone sequence is stored in the phone sequence buffer 210. When “E3 note-on” is received, the CPU 101 refers to the phone sequence buffer 210 to generate lyrics. Knowing i <00>, formant parameters are calculated so that the lyrics are generated at the designated pitch “E3” and sent to the formant synthesis sound source 115. In the following, the pronunciation of “ta” is performed by the same process.
[0022]
The formant parameter is time-series data, and is transferred from the CPU 101 to the formant synthesis sound source 115 at predetermined time intervals. In general, the predetermined time interval may be a low rate of about several msec intervals, for example, in order to produce a human voice characteristic and generate a sound. In this embodiment, every 5 msec. By gradually changing the formant over time at this time interval, the characteristics of the human voice are produced and the song is sung. Examples of formant parameters include voiced / unvoiced sound, formant center frequency, formant level, and formant bandwidth (parameters that define the formant shape on the frequency axis).
[0023]
The CPU 101 calculates formant parameters based on the input phone sequence 202 and the performance data 201, and refers to the phoneme database and the articulation combination database. The phoneme database and the articulation combination database are prepared in the local data storage device 105 in advance, and are loaded into the working memory 106 for use. You may make it possible to select and use the phoneme database and articulation combination database prepared for each voice quality so that it can be sung with several types of speech quality (individual differences, male voice, female voice, etc.).
[0024]
FIG. 3 is a conceptual diagram of a phoneme database reference method. Reference numeral 301 denotes a phoneme database. The phoneme database 301 is a collection of formant parameters for each phoneme. 302-1, 302-2, 302-3,..., 302-N (N is the number of phonemes) each indicate a collection of formant parameters of one phoneme. A group of formant parameters (for example, 302-1) of one phoneme is a database in which a corresponding formant parameter is uniquely output when an elapsed time from the start of pronunciation of the phoneme and a pitch at that time are input. . Therefore, the phoneme database 301 is a database for inputting a phoneme number and pitch for specifying a phoneme and an elapsed time from the start of pronunciation of the phoneme and outputting a formant parameter corresponding to the input data. Any form may be used. For example, it may be in the form of a table, or the input data range may be divided into predetermined segments, formant parameters may be held for each segment, and interpolation processing according to the input data may be performed and output. Further, it may be in the form of continuous data or mathematical formula data.
[0025]
In FIG. 3, only the formant frequency and formant level graphs 305 and 306 are illustrated as formant parameters of the phoneme database 301. However, the formant parameter is not limited to the formant frequency and formant level, but other formant bandwidths such as formant bandwidth. It may contain formant parameters. In FIG. 3, the dimension in the time axis direction is omitted, and the formant frequency and the formant level are determined according to the phoneme number and the input pitch as indicated by an arrow 307.
[0026]
The pitch used when the CPU 101 refers to the phoneme database is a value obtained by adding pitch bend data and other pitch generation data to basic pitch data designated by note-on, as indicated by 303. The formant of the singing voice (particularly the formant frequency) varies depending on the phoneme depending on the pitch. In this embodiment, since the phoneme database is configured to output formant parameters according to the pitch, changes in formants according to the pitch can be simulated with synthesized speech, and natural singing voices can be synthesized. It is. As indicated by 304, a value obtained by adding velocity data, volume data, and other level generation data may be reflected in the pitch data. Since velocity, volume, and other level data may change the pitch, the change in pitch is reflected in the formant parameter. In particular, if the processing speed of the CPU 101 is high, the formant parameter may be calculated by reflecting the pitch at that time every time the formant parameter is output to the sound source 115.
[0027]
FIG. 4 shows the procedure of the main program executed by the CPU 101 when the power of the tone synthesizer is turned on. First, at step 401, various initial settings are made. In particular, a sound generation flag HFLAG described later is initialized to 0. Next, in step 402, various events are waited. When an event has occurred, the event that has occurred is discriminated in step 403, processing 404 to 406 corresponding to each event is executed, and the process returns to step 402 again. Step 404 is a MIDI reception process executed when a MIDI message is received. Step 405 is MIDI data processing for processing data in the MIDI reception buffer. This MIDI data processing is repeatedly executed when no other event occurs and the task is free. In addition, when various events occur, processing corresponding to the occurred event (step 406) is performed.
[0028]
FIG. 5 shows the procedure of the MIDI reception process in step 404 of FIG. In the MIDI reception process, in step 501, the received MIDI message is written to the MIDI reception buffer (secured in the working memory 106), and the process returns.
[0029]
FIG. 6 shows the procedure of MIDI data processing in step 405 of FIG. First, in step 601, 1 byte is fetched from the MIDI reception buffer. In step 602, it is determined whether or not the fetched 1-byte data is a status byte (the most significant bit is 1). If it is a status byte, in step 604, a data byte after the status byte is fetched and stored in a predetermined buffer together with the status byte, and the process proceeds to step 605. If the status byte is not determined in step 602, other processing is performed in step 603, and then the process returns. Since the subsequent data bytes are not always received in the MIDI reception buffer at the time when the status bytes are fetched, the data bytes following the status bytes are fetched sequentially in step 603, and one message is fetched. The process may proceed to step 604. At the time of step 604, one MIDI message is taken into a predetermined buffer.
[0030]
In step 605, the process branches for each status of the received MIDI message. When the note is on, the process proceeds to the note on process of FIG. When the note is off, the process proceeds to the note off process of FIG. If it is system exclusive, the process proceeds to step 607 to determine whether or not the MIDI message is a phone sequence message. If it is a phone sequence, in step 608, the phone sequence is written into the phone sequence buffer and the process returns. If it is not a phone sequence in step 607, another system exclusive process is performed in step 609, and then the process returns. If the MIDI message is another message at step 605, processing corresponding to each status is performed at step 606, and the process returns. In particular, when the received MIDI message is singing end information (song end information indicating the end of the song is input at the end of the performance data), the process proceeds to step 606, and the sound is generated at that time. Mute all sounds and end the song.
[0031]
With reference to FIG. 7, a process when a note-on MIDI message is received will be described. First, at step 701, the time counter time_counter of the phone sequence and the phone counter phone_counter are initialized to zero. The time counter time_counter is a counter that is incremented every time a timer interrupt (every 5 msec) occurs, and the counter value represents the elapsed time from the start of sound generation of each phoneme in units of 5 msec. The time counter time_counter is reset to 0 every time a phoneme is switched. The phone counter phone_counter is a counter indicating the order of phonemes in one note (that is, in one phone sequence generated by one note). Every time a note is switched, that is, every time a phone sequence is switched, the value is reset to 0, and in one phone sequence, it is incremented every time a phoneme is switched. Since it starts from 0, the first phoneme in one phone sequence is phone_counter = 0, the next phoneme is phone_counter = 1, and so on.
[0032]
Next, in step 702, it is determined whether or not there is exhalation information in the phone sequence. The exhalation information is information that is included in the phone sequence when it is desired to divide a phoneme and the next phoneme to produce sound. When exhalation information is entered, it shall be entered at the end of the phone sequence. In this case, after the last phoneme in the phone sequence is pronounced, the pronunciation of the phoneme is temporarily stopped and the next phone sequence is started. Become. When there is exhalation information in step 702, 1 is set in the exhalation flag f_koki, and the process proceeds to step 705. If there is no exhalation information, the flag f_koki is reset to 0 and the process proceeds to step 705.
[0033]
In step 705, the phoneme number and duration time at the position indicated by the phone counter phone_counter in the phone sequence buffer are extracted. Since immediately after note-on and phone_counter = 0, the first phoneme in the phone sequence buffer is extracted. Next, in step 706, formant parameters such as formant frequency, formant level, and formant bandwidth are derived from the extracted phoneme number and pitch data with reference to the phoneme database as described in FIG. The pitch data used here is a pitch obtained by reflecting pitch bend data and other pitch generation data in the basic pitch data designated by note-on, as described in 303 of FIG. Note that the phoneme database is a database that outputs the formant parameters corresponding to the input phoneme number, pitch, and elapsed time from the start of sound generation. In step 706, only the phoneme number and pitch are specified, so that they are extracted here. The formant parameters are time-series data (the form may be any form such as a table or a mathematical formula, but information that uniquely outputs the formant parameters when the elapsed time is input). In steps 710 and 905 to be described later, the current time counter time_counter value is input to the formant parameter extracted in step 706, and the formant parameter at that time is obtained.
[0034]
Next, in step 707, the articulation combination database and the articulation combination time are extracted by referring to the articulation combination database based on the preceding and following phoneme relationships. The articulation combination database is a database that is referred to for smooth transition from one phoneme to the next. When the previous phoneme and the next phoneme are specified, the articulation combination curve and the articulation combination time can be extracted from the articulation combination database. Then, the formant parameter obtained by interpolating from the previous phoneme formant parameter to the next phoneme formant parameter along the articulation coupling curve is sequentially sent to the sound source, so that the naturalness from the previous phoneme to the next phoneme is transmitted. Transition can be realized. Such processing is called articulation coupling. The articulation coupling time is a time for shifting from the previous phoneme to the next phoneme while performing interpolation. In step 707, it is checked whether there is a phoneme being pronounced in front of the phoneme that is to be pronounced with note-on, and if there is a previous phoneme, the phoneme that is to be pronounced with note-on is transferred from that phoneme. The articulation coupling curve and the articulation coupling time used in the above are extracted from the articulation coupling database.
[0035]
Next, in step 708, the magnitude relationship between the articulation combination time extracted in step 707 and the duration time of the phoneme extracted in step 705 is determined. If the articulation combination time is longer than the duration time, if the articulation combination is performed using only the duration of the articulation combination time, the duration time of the phoneme that is currently being turned on will exceed the duration time of the note, so It is inconvenient. In step 709, the articulation combination time is made to coincide with the duration time, and the process proceeds to step 710. If it is determined in step 708 that the articulation combination time is not longer than the duration time, the process proceeds to step 710 as it is.
[0036]
In step 710, the formant parameter in the value of the current time counter time_counter (here, the value of the time counter time_counter is 0 because it is immediately after note-on) is obtained by referring to the formant parameter previously extracted in step 706, and If the articulation coupling curve and the articulation coupling time are extracted in step 707, the articulation coupling according to the articulation coupling curve and the articulation coupling time is performed using the acquired formant parameter and the formant parameter of the previous phoneme (currently sounding). Perform processing and calculate formant parameters. Note that if the articulation combination curve and the articulation combination time have not been taken out in step 707 (no phoneme is sounding before), it is not necessary to perform the articulation combination processing. Next, in step 711, the calculated formant data and pitch are written into the sound source 115. As a result, the first formant parameter of the phoneme to be sounded is transferred to the sound source 115, and sounding is started. Thereafter, the formant parameter is sent to the sound source 115 every 5 msec in the timer process of FIG. Next, at step 712, 1 is set to the sound generation flag HFLAG and the process returns. The pronunciation flag HFLAG indicates that the singing is being performed when the flag is 1, and indicates that the singing is not performed when the flag is 0.
[0037]
Next, processing when a note-off MIDI message is received will be described with reference to FIG. First, in step 801, the duration time of the phoneme of the phone sequence that is currently sounding is referred to, and it is determined whether or not the duration time is <00> and there is a phoneme to be sounded after the phoneme. When this is the case (for example, when a phoneme is generated with a duration of <00> and pronounced with a consonant), the process proceeds to step 802 to pronounce the phoneme. If it is determined in step 801 that the duration time of the phoneme currently sounding is not <00>, or if the duration time is <00> but there is no phoneme to be sounded next, the process proceeds to step 808.
[0038]
In step 802, the phone counter phone_counter is set to the position of the phoneme next to the phoneme having a duration time of <00> in the phone sequence. Further, the time counter time_counter is reset to zero. Next, in step 803, the phoneme number and duration time at the position indicated by the phone counter phone_counter in the phone sequence buffer are extracted. The phone counter phone_counter is currently positioned as the phoneme to be pronounced last, and the phoneme is extracted.
[0039]
The next steps 804, 805, 806, and 807 are the same processes as steps 706, 707, 710, and 711 described in FIG. Through these steps 804 to 807, the first formant parameter of the phoneme to be sounded is transferred to the sound source 115, and sounding is started. Thereafter, the formant parameter is sent to the sound source 115 every 5 msec in the timer process of FIG. The phonemes that start to sound in the above steps 804 to 807 are phonemes that are sounded after the phonemes that have been sounded with the duration time extended to <00>, so they are sounded in the time until the next note-on is received. Is. If the duration of the phoneme to start sounding in steps 804 to 807 is long and the next note-on is received while the phoneme is being sounded, the next note-on phoneme from the phoneme as described in FIG. Will be forcibly connected to the next phoneme.
[0040]
If the duration time of the currently sounding phoneme is not <00> in step 801, or if the duration time is <00> but there is no phoneme to be pronounced next, whether or not the exhalation flag f_koki is 1 in step 808 Determine whether or not. If the exhalation flag f_koki is 1, the key-off process is performed in step 809 to stop the sounding of the currently sounding phoneme, and the sounding flag HFLAG is reset to 0 and the process returns. If the exhalation flag f_koki is not 1 in step 808, the process directly returns. As a result, when the exhalation flag f_koki is not 1, the pronunciation of the currently sounding phoneme continues.
[0041]
FIG. 9 shows a processing procedure of timer processing executed every 5 msec by timer interruption. In the timer process, first, in step 901, it is determined whether or not the sound generation flag HFLAG is 1. When HFLAG is not 1, it returns as it is because it is not currently sounding. If HFLAG is 1, it is determined in step 902 whether the duration time of the phoneme currently being generated is <00>. When the duration time is <00>, the process proceeds to step 904 in order to continue the sound generation of the currently sounding phoneme. If the duration time is not <00>, it is determined in step 903 whether or not the value of the time counter time_counter is equal to or less than the duration time. When the value of the time counter time_counter is equal to or less than the duration time, it means that the pronunciation of the currently sounding phoneme is further continued, and the process proceeds to step 904. In step 904, the time counter time_counter is incremented, and the process advances to step 905 to calculate and output the formant parameter in the current time counter time_counter value.
[0042]
When the value of the time counter time_counter exceeds the duration time in step 903, it means that the pronunciation of the currently sounding phoneme has been completed for the duration time. Proceed to step 907. In step 907, it is determined whether or not there is a next phoneme in the phone sequence. If there is no next phoneme, the sound generation flag HFLAG is reset to 0 in step 912 and the process returns. If there is a next phoneme, the phone counter phone_counter is first incremented and the time counter time_counter is reset to 0 in step 908 in order to sound it. In step 909, the phoneme number and the duration time at the position (phoneme to be sounded next) indicated by the phone counter phone_counter in the phone sequence are extracted.
[0043]
The next steps 910, 911, 905, and 906 are the same processes as steps 706, 707, 710, and 711 described in FIG. Through these steps, the formant parameter of the phoneme to be generated is transferred to the sound source 115. In step 905, the formant parameter in the value of the current time counter time_counter is obtained by referring to the formant parameter previously extracted from the phoneme database, and the articulation combination curve and the articulation combination time extracted from the articulation combination database first. Thus, the articulation coupling process is performed to calculate the formant parameter. The “formant parameter extracted from the phoneme database” specifically refers to the case of a phoneme that has been extracted at step 706 in the case of a phoneme that starts to be sounded on note-on, or the phoneme that exists after the duration <00> at the time of note-off. Is extracted in step 804, and in the case where the phoneme is next to a phoneme whose duration is not <00> in the phone sequence, it is extracted in step 910. Similarly, the articulation combination curve and the articulation combination time used for the articulation combination in step 905 are specifically those extracted in step 707 in the case of a phoneme that starts sounding with note-on, and duration <00> at the time of note-off. The phonemes that existed later were taken out in step 805, and the phonemes following the phoneme whose duration in the phone sequence is not <00> were taken out in step 911.
[0044]
Next, an outline of how the processes of FIGS. 4 to 9 described above are executed will be described with a specific example. Here, it is assumed that events occur in the order of the following sequence (1) when “saita” is pronounced by C3, E3, and G3.
・ S <20> a <00>
・ C3 note-on
・ C3 note-off
・ I <00>
・ E3 note-on (1)
・ E3 note-off
・ T <02> a <00>
・ G3 note-on
・ G3 note-off
[0045]
Each of these MIDI messages is processed in the order of steps 501 → 403 → 404 → 501 in FIG. 5 and written in the MIDI reception buffer. Also, each written MIDI message is processed for each type of MIDI message in the MIDI data processing step 605 and subsequent steps in FIG. In the following, how each MIDI message is processed after step 605 will be described.
[0046]
First, the phone sequence s <20> a <0> received first is written into the phone sequence buffer in the process of steps 605 → 607 → 608. The next “C3 note-on” proceeds from the step 605 to the note-on process of FIG. In the note-on process, the process proceeds from step 701 → 702 → 704 → 705, and the phoneme number indicating the phoneme “s” to be pronounced first and its duration time <20> are extracted from the phone sequence buffer. In step 706, the phoneme database is referred to, and a formant parameter group corresponding to the phoneme number of the phoneme “s” and the pitch “C3” is extracted. The formant parameter group to be extracted here is a formant parameter group of time series data that changes in accordance with the elapsed time (every 5 msec) from the start of pronunciation of the phoneme “s” (the form is arbitrary). In the next step 707, the articulation combination database is referred to. However, since there is no phoneme sounding before the phoneme “s” that is about to be sounded, nothing is processed in steps 707 and 708, and the process proceeds to step 710. The formant parameter at the current time counter time_counter value (time_counter = 0 because it is immediately after note-on) is obtained from the formant parameter group previously extracted in step 706. The obtained formant parameters are sent to the sound source 115 together with the pitch data and the key-on signal in step 711, thereby starting the tone generation of the pitch “C3” with the phoneme “s”. In step 712, the sound generation flag HFLAG = 1 is set, and the note-on process ends.
[0047]
Thereafter, the timer process of FIG. 9 is executed every 5 msec, and the sound generation of s <20> is continued. In the timer process, the process proceeds to step 901 → 902 → 903 → 904 to increment the time counter time_counter. In step 905, the formant parameter at the current time counter time_counter value is obtained from the formant parameter group previously extracted in step 706. Ask. In step 906, the obtained formant parameters are sent to the sound source, and the flow of this step 901 → 902 → 903 → 904 → 905 → 906 → return is repeated every 5 msec. When the time counter time_counter = 20 is reached, step Proceeding from 903 to 907, the process proceeds to step 908 to shift to the next phoneme a <00>. In step 909, the phoneme number indicating the phoneme “a” to be sounded next and its duration time <00> are extracted from the phone sequence buffer. In step 910, the phoneme database is referred to, and a formant parameter group corresponding to the phoneme number of the phoneme “a” and the pitch “C3” is extracted. The formant parameter group to be extracted here is a formant parameter group of time-series data that changes in accordance with the elapsed time (every 5 msec) from the start of the pronunciation of the phoneme “a”. In the next step 911, the articulation coupling database is referred to, and the articulation coupling curve and the articulation coupling time used when shifting from the previous phoneme “s” to the current phoneme “a” are extracted. In step 905, the formant parameter in the current time counter time_counter value is obtained from the formant parameter group previously extracted in step 910, the articulation coupling curve and the articulation coupling time extracted in step 911. The obtained formant parameters are sent to the sound source 115 together with the pitch data and the key-on signal in step 906, thereby starting the tone generation of the pitch “C3” while naturally shifting from the phoneme “s” to “a”. . Since the duration time of the phoneme “a” is <00>, in the timer processing every 5 msec thereafter, the process proceeds from step 902 → 904 → 905 → 906, and the pronunciation of the phoneme “a” continues.
[0048]
The next “C3 note-off” proceeds from the step 605 to the note-off process of FIG. The duration time of the phoneme “a” that is currently sounding is <00>, but since there is no phoneme thereafter, the process proceeds from step 801 to step 808. Further, since exhalation information is not set, f_koki = 0, so the process returns as it is. Therefore, the pronunciation of the phoneme “a” is continuously executed.
[0049]
The next phone sequence “i <00>” is written to the phone sequence buffer in the process of steps 605 → 607 → 608. The next “note-on of E3” proceeds from the step 605 to the note-on process of FIG. In the note-on process, the process proceeds from step 701 → 702 → 704 → 705, and the phoneme number indicating the first phoneme “i” and its duration time <00> are extracted from the phone sequence buffer. In step 706, the phoneme database is referred to, and a formant parameter group corresponding to the phoneme number of the phoneme “i” and the pitch “E3” is extracted. The formant parameter group to be extracted here is a formant parameter group of time series data that changes in accordance with the elapsed time (every 5 msec) from the start of pronunciation of the phoneme “i” (its form is arbitrary). In the next step 707, the articulation coupling database is referred to, and the articulation coupling curve and the articulation coupling time used when shifting from the previous phoneme “a” (currently sounding) to the current phoneme “i” are extracted. In the determination in step 708, since the duration time of “i” is <00> and it can be said that the duration time is longer than the articulation coupling time when shifting from “a” to “i”, the process proceeds to step 710. In step 710, the formant parameter in the current time counter time_counter value is obtained from the formant parameter group previously extracted in step 706, the articulation coupling curve and the articulation coupling time extracted in step 707. The obtained formant parameters are sent to the sound source 115 together with the pitch data and the key-on signal in step 711, thereby starting to produce the musical tone having the pitch “E3” while naturally shifting from the phoneme “a” to “i”. In step 712, the sound generation flag HFLAG = 1 is set, and the note-on process ends.
[0050]
Similarly, each message from “E3 note-off” to “G3 note-off” is processed.
[0051]
In the embodiment of the present invention, the formant synthesized sound source 115 may be realized by either hardware or software, or may be realized in combination, regardless of the whole or a part. In the embodiment of the present invention, the phoneme database is provided with information for each phoneme in which vowels and consonants are distinguished from different phonemes, but each sound of 50 sounds (“sa”, “si”, etc.) May be stored as information on each phoneme. The phoneme database and the articulation combination database may be combined.
[0052]
In the above embodiment, the formant parameter is changed every 5 msec as an example. However, the change of the formant parameter is controlled at a fast rate when the change of the formant or the spectrum characteristic is large and at a slow rate when the change is slow. It may be.
[0053]
【The invention's effect】
As described above, according to the present invention, the formant parameter stored in the phoneme database reflects the change in formant parameter corresponding to the change in pitch. Can be performed. Therefore, even when the pitch changes dynamically, a natural synthesized sound with a formant corresponding to the change in pitch can be obtained. If formant parameters are obtained in consideration of not only the pitch but also other performance information, natural speech adapted to the performance information can be synthesized.
[Brief description of the drawings]
FIG. 1 is a system configuration diagram of a musical tone synthesizer according to the present invention.
FIG. 2 is a diagram showing an outline of processing when singing with the musical tone synthesizer of FIG. 1;
Fig. 3 Conceptual diagram of phoneme database reference method
FIG. 4 is a flowchart showing the procedure of the main program.
FIG. 5 is a flowchart showing a procedure of a MIDI reception process.
FIG. 6 is a flowchart showing the procedure of MIDI data processing.
FIG. 7 is a flowchart showing a procedure of note-on processing.
FIG. 8 is a flowchart showing a procedure of note-off processing.
FIG. 9 is a flowchart showing a procedure of timer processing.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 101 ... Central processing unit (CPU), 102 ... MIDI interface, 103 ... MIDI apparatus group, 104 ... Data memory device, 105 ... Local data storage device, 106 ... Working memory, 107 ... Program memory, 108 ... Performance operator, 109 ... Setting operator 111 ... Display 112 ... Network interface 115 ... Formant synthesized sound source 116 ... Sound system 117 ... System common bus 201 ... Performance data 202 ... Lyrics data 210 ... Lyrics data buffer (phone sequence buffer) 301 ... Phoneme database.

Claims

Input means for inputting lyric information representing the phoneme of the lyrical lyrics and performance information including at least pitch information;
When the information specifying the phoneme, the information specifying the pitch, and the elapsed time from the start of pronunciation of the phoneme are given as arguments, a phoneme database in which formant parameters corresponding to them are output,
Means for obtaining formant parameters corresponding to the lyric information and pitch information input by the input means and the elapsed time from the start of pronunciation of the phoneme indicated by the lyric information with reference to the phoneme database;
In a musical sound synthesizer equipped with a formant synthesis sound source that synthesizes and outputs speech having a formant corresponding to the obtained formant parameter at a pitch corresponding to the pitch information.
For each combination of front and back phonemes, an articulation combination database that stores articulation combination time, which is an interpolation time when shifting from the previous phoneme to the subsequent phoneme,
When a new phoneme pronunciation is instructed, it is detected whether or not there is a phoneme currently being pronounced, and if there is, the articulation coupling time used when shifting from the current phoneme to a new phoneme is determined. Retrieving from the articulation combination database;
Interpolating means for sequentially outputting formant parameters obtained by interpolation to the formant synthetic sound source so as to shift from the formant parameter of the phoneme being generated to the new formant parameter of the phoneme over the articulation coupling time; and A musical sound synthesizer characterized by comprising.

In the musical tone synthesizer according to claim 1,
The interpolation means compares the extracted articulation combination time with the duration time of the new phoneme, and if the articulation connection time is longer than the duration time, the articulation connection time is made to coincide with the duration time. A musical sound synthesizer characterized by

An input step for inputting lyric information representing the phoneme of the lyric and the performance information including at least the pitch information;
When the information specifying the phoneme, the information specifying the pitch, and the elapsed time from the start of pronunciation of the phoneme are given as arguments, the phoneme database that outputs the formant parameters corresponding to them is used to input the information by the input means. Obtaining formant parameters corresponding to the lyrics information and pitch information and the elapsed time from the start of pronunciation of the phoneme indicated by the lyrics information;
A method for synthesizing a musical sound comprising a step of synthesizing and outputting a voice having a formant corresponding to the obtained formant parameter at a pitch corresponding to the pitch information;
For each combination of front and rear phonemes, an articulation combination database is provided that stores the articulation combination time, which is the interpolation time when shifting from the previous phoneme to the subsequent phoneme,
When a new phoneme pronunciation is instructed, it is detected whether or not there is a phoneme currently being pronounced, and if there is, the articulation coupling time used when shifting from the current phoneme to a new phoneme is determined. Retrieving from the articulation combination database;
An interpolation step of sequentially outputting the formant parameters obtained by interpolation to the formant synthesized sound source so as to shift from the formant parameter of the phoneme being generated to the new formant parameter of the phoneme over the articulation combination time; and A method for synthesizing a musical sound characterized by comprising the above.

In the musical tone synthesis method according to claim 3,
In the interpolation step, the extracted articulation combination time is compared with the duration time of the new phoneme, and when the articulation connection time is longer than the duration time, the articulation connection time is made to coincide with the duration time. A musical sound synthesis method characterized by