JP3776782B2

JP3776782B2 - Method for encoding an acoustic signal

Info

Publication number: JP3776782B2
Application number: JP2001321968A
Authority: JP
Inventors: 敏雄茂出木
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2001-10-19
Filing date: 2001-10-19
Publication date: 2006-05-17
Anticipated expiration: 2021-10-19
Also published as: JP2003124816A

Description

【０００１】
【産業上の利用分野】
本発明は、音楽制作における採譜と呼ばれる以下のような業務を支援するのに適用することができる。採譜業務としては、例えば、譜面が入手できない場合の素材としての既存楽曲の引用・既存楽曲のカバー曲制作、ヒット曲のメロディ・和声進行・音色の分析研究等の楽曲分析、カラオケにおけるＭＩＤＩデータ形式の演奏データ作成、ゲーム機のＢＧＭデータの作成、携帯電話の着メロデータ作成、自動ピアノ・演奏ガイド機能付き鍵盤楽器向け演奏データの作成、楽譜出版・版下作成などがある。
【０００２】
【従来の技術】
音響信号に代表される時系列信号には、その構成要素として複数の周期信号が含まれている。このため、与えられた時系列信号にどのような周期信号が含まれているかを解析する手法は、古くから知られている。例えば、フーリエ解析は、与えられた時系列信号に含まれる周波数成分を解析するための方法として広く利用されている。
【０００３】
このような時系列信号の解析方法を利用すれば、音響信号を符号化することも可能である。コンピュータの普及により、原音となるアナログ音響信号を所定のサンプリング周波数でサンプリングし、各サンプリング時の信号強度を量子化してデジタルデータとして取り込むことが容易にできるようになってきており、こうして取り込んだデジタルデータに対してフーリエ解析などの手法を適用し、原音信号に含まれていた周波数成分を抽出すれば、各周波数成分を示す符号によって原音信号の符号化が可能になる。
【０００４】
一方、電子楽器による楽器音を符号化しようという発想から生まれたＭＩＤＩ（Musical Instrument Digital Interface）規格も、パーソナルコンピュータの普及とともに盛んに利用されるようになってきている。このＭＩＤＩ規格による符号データ（以下、ＭＩＤＩデータという）は、基本的には、楽器のどの鍵盤キーを、どの程度の強さで弾いたか、という楽器演奏の操作を記述したデータであり、このＭＩＤＩデータ自身には、実際の音の波形は含まれていない。そのため、実際の音を再生する場合には、楽器音の波形を記憶したＭＩＤＩ音源が別途必要になるが、その符号化効率の高さが注目を集めており、ＭＩＤＩ規格による符号化および復号化の技術は、現在、パーソナルコンピュータを用いて楽器演奏、楽器練習、作曲などを行うソフトウェアに広く採り入れられている。
【０００５】
そこで、音響信号に代表される時系列信号に対して、所定の手法で解析を行うことにより、その構成要素となる周期信号を抽出し、抽出した周期信号をＭＩＤＩデータを用いて符号化しようとする提案がなされている。例えば、特開平１０−２４７０９９号公報、特開平１１−７３１９９号公報、特開平１１−７３２００号公報、特開平１１−９５７５３号公報、特開２０００−９９００９号公報、特開２０００−９９０９２号公報、特開２０００−９９０９３号公報、特開２０００−２６１３２２号公報、特開２００１−５４５０号公報、特開２００１−１４８６３３号公報、特願２００１−２０９１６３号明細書には、任意の時系列信号について、構成要素となる周波数を解析し、その解析結果からＭＩＤＩデータを作成することができる種々の方法が提案されている。
【０００６】
【発明が解決しようとする課題】
上記各公報および明細書において提案してきたＭＩＤＩ符号化方式により、演奏録音等から得られる音響信号の効率的な符号化が可能になった。音響信号の符号化においては、特に楽器音から発生する倍音成分の扱いが問題となるが、倍音の処理についても上記各公報において開示されている手法により解決を試みている。倍音とは、本来の音である基本音の周波数の整数倍の周波数を有する音であり、本来の音を正確に再現する上では重要な成分であるが、通常は演奏者が意図して発した音ではないため、演奏データから譜面を再現する採譜に本発明を応用する際には不要な成分となる。倍音成分は、ＭＩＤＩノートナンバーでいえば、基本音の＋１２、＋１９、＋２４、＋２８、＋３１、・・・といった値をとるものとなる。特に、特願２００１−２０９１６３号明細書においては、倍音に相当すると考えられる成分を所定の割合で基本音の強度成分に加算するとともに、倍音の強度成分を所定の割合で減算することにより各々の第２強度を算出し、この第２強度を符号化時の優先度として扱うことにより、倍音除去を行っている。
【０００７】
しかしながら、上記手法では、倍音に相当すると考えられる音の強度成分を基本音の第２強度として加算していくだけであるので、倍音であるかどうかの判断が充分であるとはいえない。また、上記手法では、複数の音源が混在している音響信号に対しても一律に倍音除去を行っているため、倍音除去を行ってはいけないヴォーカル（歌声）についても倍音除去が行われてしまうという問題がある。
【０００８】
上記のような点に鑑み、本発明は、精度の良い倍音除去が可能であると共に、人間の声以外に対してだけ、倍音除去を行うことが可能な音響信号の符号化方法を提供することを課題とする。
【０００９】
【課題を解決するための手段】
上記課題を解決するため、本発明では、音響信号の符号化方法として、音響信号に対して時間軸上に複数の単位区間を設定する区間設定段階、単位区間における音響信号と複数の周期関数との相関を求めることにより、各周期関数に対応した強度を算出し、各周期関数が有する周波数と、各周期関数に対応した強度と、単位区間の始点に対応する区間開始時刻と、単位区間の終点に対応する区間終了時刻で構成される単位音素データを算出する単位音素データ算出段階、単位音素データに対して、各単位区間ごとに周波数比が整数倍の関係となる他の単位音素データの強度値を加算して高周波分布度を得て、当該単位音素データとの周波数比が（１／整数）倍の関係となる他の単位音素データの強度値を加算して低周波分布度を得て、高周波分布度と低周波分布度の正負の符号を互いに異ならせて加算することにより得られた値を当該単位音素データの強度値で除することにより、各単位音素データの倍音分布度を算出する倍音分布度算出段階、単位音素データのうち、区間が連続し、周波数が同一で強度が類似するものを連結して連結音素データとし、連結音素データの属性として、周波数は構成する単位音素データのいずれかの周波数を与え、強度は構成する単位音素データの最大値を与え、開始時刻は先頭の単位音素データの区間開始時刻を与え、終了時刻は最後尾の単位音素データの区間終了時刻を与え、倍音分布度は構成する単位音素データのいずれかの倍音分布度を与えるようにする音素データ連結段階、連結処理後の音素データのうち、少なくとも倍音分布度が所定の条件を満たす音素データのみを抽出し、当該抽出した音素データの区間長に対応するデルタタイム情報、周波数に対応するノートナンバー情報、強度に対応するベロシティー情報をもつＭＩＤＩ形式の符号データを作成する符号化段階を、有するようにしたことを特徴とする。
【００１０】
本発明によれば、各単位区間について音響信号の周波数解析を行なうことにより単位音素データを得て、各単位音素データに対して当該単位音素データの整数倍の周波数成分の強度が大きいか、整数分の１の周波数成分の強度が大きいかに基づいて倍音と判断される成分を削除して符号化を行うようにしたので、精度の良い倍音除去が可能となる。
【００１１】
さらに、倍音と判断される成分を除去する際、連結後の音素データの区間長が長いものについてだけ、倍音と判断される成分の除去を行うようすることにより、楽器音等の、人間の声以外に対してだけ倍音除去を行うことが可能となる。
【００１２】
【発明の実施の形態】
以下、本発明の実施形態について図面を参照して詳細に説明する。
（1.音響信号の符号化の基本原理）
はじめに、本発明に係る音響信号の符号化方法の基本原理を述べておく。この基本原理は、前掲の各公報あるいは明細書に開示されているので、ここではその概要のみを簡単に述べることにする。
【００１３】
図１（ａ）に示すように、時系列信号としてアナログ音響信号が与えられたものとする。図１の例では、横軸に時間ｔ、縦軸に振幅（強度）をとって、この音響信号を示している。ここでは、まずこのアナログ音響信号を、デジタルの音響データとして取り込む処理を行う。これは、従来の一般的なＰＣＭの手法を用い、所定のサンプリング周波数でこのアナログ音響信号をサンプリングし、振幅を所定の量子化ビット数を用いてデジタルデータに変換する処理を行えば良い。ここでは、説明の便宜上、ＰＣＭの手法でデジタル化した音響データの波形も図１（ａ）のアナログ音響信号と同一の波形で示すことにする。
【００１４】
続いて、この解析対象となる音響信号の時間軸上に、複数の単位区間を設定する。図１（ａ）に示す例では、時間軸ｔ上に等間隔に６つの時刻ｔ１〜ｔ６が定義され、これら各時刻を始点および終点とする５つの単位区間ｄ１〜ｄ５が設定されている。図１の例では、全て同一の区間長をもった単位区間が設定されているが、個々の単位区間ごとに区間長を変えるようにしてもかまわない。あるいは、隣接する単位区間が時間軸上で部分的に重なり合うような区間設定を行ってもかまわない。
【００１５】
こうして単位区間が設定されたら、各単位区間ごとの音響信号（以下、区間信号と呼ぶことにする）について、それぞれ代表周波数を選出する。各区間信号には、通常、様々な周波数成分が含まれているが、例えば、その中で成分の強度割合の大きな周波数成分を代表周波数として選出すれば良い。ここで、代表周波数とはいわゆる基本周波数が一般的であるが、音声のフォルマント周波数などの倍音周波数や、ノイズ音源のピーク周波数も代表周波数として扱うことがある。代表周波数は１つだけ選出しても良いが、音響信号によっては複数の代表周波数を選出した方が、より精度の高い符号化が可能になる。図１（ｂ）には、個々の単位区間ごとにそれぞれ３つの代表周波数を選出し、１つの代表周波数を１つの代表符号（図では便宜上、音符として示してある）として符号化した例が示されている。ここでは、代表符号（音符）を収容するために３つのトラックＴ１，Ｔ２，Ｔ３が設けられているが、これは個々の単位区間ごとに選出された３つずつの代表符号を、それぞれ異なるトラックに収容するためである。
【００１６】
例えば、単位区間ｄ１について選出された代表符号ｎ（ｄ１，１），ｎ（ｄ１，２），ｎ（ｄ１，３）は、それぞれトラックＴ１，Ｔ２，Ｔ３に収容されている。ここで、各符号ｎ（ｄ１，１），ｎ（ｄ１，２），ｎ（ｄ１，３）は、ＭＩＤＩ符号におけるノートナンバーを示す符号である。ＭＩＤＩ符号におけるノートナンバーは、０〜１２７までの１２８通りの値をとり、それぞれピアノの鍵盤の１つのキーを示すことになる。具体的には、例えば、代表周波数として４４０Ｈｚが選出された場合、この周波数はノートナンバーｎ＝６９（ピアノの鍵盤中央の「ラ音（Ａ３音）」に対応）に相当するので、代表符号としては、ｎ＝６９が選出されることになる。もっとも、図１（ｂ）は、上述の方法によって得られる代表符号を音符の形式で示した概念図であり、実際には、各音符にはそれぞれ強度に関するデータも付加されている。例えば、トラックＴ１には、ノートナンバーｎ（ｄ１，１），ｎ（ｄ２，１）・・・なる音高を示すデータとともに、ｅ（ｄ１，１），ｅ（ｄ２，１）・・・なる強度を示すデータが収容されることになる。この強度を示すデータは、各代表周波数の成分が、元の区間信号にどの程度の度合いで含まれていたかによって決定される。具体的には、各代表周波数をもった周期関数の区間信号に対する相関値に基づいて強度を示すデータが決定されることになる。また、図１（ｂ）に示す概念図では、音符の横方向の位置によって、個々の単位区間の時間軸上での位置が示されているが、実際には、この時間軸上での位置を正確に数値として示すデータが各音符に付加されていることになる。
【００１７】
音響信号を符号化する形式としては、必ずしもＭＩＤＩ形式を採用する必要はないが、この種の符号化形式としてはＭＩＤＩ形式が最も普及しているため、実用上はＭＩＤＩ形式の符号データを用いるのが好ましい。ＭＩＤＩ形式では、「ノートオン」データもしくは「ノートオフ」データが、「デルタタイム」データを介在させながら存在する。「ノートオン」データは、特定のノートナンバーＮとベロシティーＶを指定して特定の音の演奏開始を指示するデータであり、「ノートオフ」データは、特定のノートナンバーＮとベロシティーＶを指定して特定の音の演奏終了を指示するデータである。また、「デルタタイム」データは、所定の時間間隔を示すデータである。ベロシティーＶは、例えば、ピアノの鍵盤などを押し下げる速度（ノートオン時のベロシティー）および鍵盤から指を離す速度（ノートオフ時のベロシティー）を示すパラメータであり、特定の音の演奏開始操作もしくは演奏終了操作の強さを示すことになる。
【００１８】
前述の方法では、第ｉ番目の単位区間ｄｉについて、代表符号としてＪ個のノートナンバーｎ（ｄｉ，１），ｎ（ｄｉ，２），・・・，ｎ（ｄｉ，Ｊ）が得られ、このそれぞれについて強度ｅ（ｄｉ，１），ｅ（ｄｉ，２），・・・，ｅ（ｄｉ，Ｊ）が得られる。そこで、次のような手法により、ＭＩＤＩ形式の符号データを作成することができる。まず、「ノートオン」データもしくは「ノートオフ」データの中で記述するノートナンバーＮとしては、得られたノートナンバーｎ（ｄｉ，１），ｎ（ｄｉ，２），・・・，ｎ（ｄｉ，Ｊ）をそのまま用いれば良い。一方、「ノートオン」データもしくは「ノートオフ」データの中で記述するベロシティーＶとしては、得られた強度ｅ（ｄｉ，１），ｅ（ｄｉ，２），・・・，ｅ（ｄｉ，Ｊ）を所定の方法で規格化した値を用いれば良い。また、「デルタタイム」データは、各単位区間の長さに応じて設定すれば良い。
【００１９】
（2.周期関数との相関を求める具体的な手法）
上述した基本原理の基づく方法では、区間信号に対して、１つまたは複数の代表周波数が選出され、この代表周波数をもった周期信号によって、当該区間信号が表現されることになる。ここで、選出される代表周波数は、文字どおり、当該単位区間内の信号成分を代表する周波数である。この代表周波数を選出する具体的な方法には、後述するように、短時間フーリエ変換を利用する方法と、一般化調和解析の手法を利用する方法とがある。いずれの方法も、基本的な考え方は同じであり、あらかじめ周波数の異なる複数の周期関数を用意しておき、これら複数の周期関数の中から、当該単位区間内の区間信号に対する相関が高い周期関数を見つけ出し、この相関の高い周期関数の周波数を代表周波数として選出する、という手法を採ることになる。すなわち、代表周波数を選出する際には、あらかじめ用意された複数の周期関数と、単位区間内の区間信号との相関を求める演算を行うことになる。そこで、ここでは、周期関数との相関を求める具体的な方法を述べておく。
【００２０】
複数の周期関数として、図２に示すような三角関数が用意されているものとする。これらの三角関数は、同一周波数をもった正弦関数と余弦関数との対から構成されており、１２８通りの標準周波数ｆ（０）〜ｆ（１２７）のそれぞれについて、正弦関数および余弦関数の対が定義されていることになる。ここでは、同一の周波数をもった正弦関数および余弦関数からなる一対の関数を、当該周波数についての周期関数として定義することにする。すなわち、ある特定の周波数についての周期関数は、一対の正弦関数および余弦関数によって構成されることになる。このように、一対の正弦関数と余弦関数とにより周期関数を定義するのは、信号に対する周期関数の相関値を求める際に、相関値が位相の影響を受ける事を考慮するためである。なお、図２に示す各三角関数内の変数Ｆおよびｋは、区間信号Ｘについてのサンプリング周波数Ｆおよびサンプル番号ｋに相当する変数である。例えば、周波数ｆ（０）についての正弦波は、ｓｉｎ（２πｆ（０）ｋ／Ｆ）で示され、任意のサンプル番号ｋを与えると、区間信号を構成する第ｋ番目のサンプルと同一時間位置における周期関数の振幅値が得られる。
【００２１】
ここでは、１２８通りの標準周波数ｆ（０）〜ｆ（１２７）を図３に示すような式で定義した例を示すことにする。すなわち、第ｎ番目（０≦ｎ≦１２７）の標準周波数ｆ（ｎ）は、以下に示す〔数式１〕で定義されることになる。
【００２２】
〔数式１〕
ｆ（ｎ）＝４４０×２^γ ⁽ⁿ⁾
γ（ｎ）＝（ｎ−６９）／１２
【００２３】
このような式によって標準周波数を定義しておくと、最終的にＭＩＤＩデータを用いた符号化を行う際に便利である。なぜなら、このような定義によって設定される１２８通りの標準周波数ｆ（０）〜ｆ（１２７）は、等比級数をなす周波数値をとることになり、ＭＩＤＩデータで利用されるノートナンバーに対応した周波数になるからである。したがって、図２に示す１２８通りの標準周波数ｆ（０）〜ｆ（１２７）は、対数尺度で示した周波数軸上に等間隔（ＭＩＤＩにおける半音単位）に設定した周波数ということになる。
【００２４】
（2.1.短時間フーリエ変換による手法）
続いて、任意の区間の区間信号に対する各周期関数の相関の求め方について、具体的な説明を行う。例えば、図４に示すように、ある単位区間ｄについて区間信号Ｘが与えられていたとする。ここでは、区間長Ｌをもった単位区間ｄについて、サンプリング周波数Ｆでサンプリングが行なわれており、全部でｗ個のサンプル値が得られているものとし、サンプル番号を図示のように、０，１，２，３，・・・，ｋ，・・・，ｗ−２，ｗ−１とする（白丸で示す第ｗ番目のサンプルは、右に隣接する次の単位区間の先頭に含まれるサンプルとする）。この場合、任意のサンプル番号ｋについては、Ｘ（ｋ）なる振幅値がデジタルデータとして与えられていることになる。短時間フーリエ変換においては、Ｘ（ｋ）に対して各サンプルごとに中央の重みが１に近く、両端の重みが０に近くなるような窓関数Ｗ（ｋ）を乗ずることが通常である。すなわち、Ｘ（ｋ）×Ｗ（ｋ）をＸ（ｋ）と扱って以下のような相関計算を行うもので、窓関数の形状としては余弦波形状のハミング窓が一般に用いられている。ここで、ｗは以下の記述においても定数のような記載をしているが、一般にはｎの値に応じて変化させ、区間長Ｌを超えない範囲で最大となるＦ／ｆ（ｎ）の整数倍の値に設定することが望ましい。
【００２５】
このような区間信号Ｘに対して、第ｎ番目の標準周波数ｆ（ｎ）をもった正弦関数Ｒｎとの相関値を求める原理を示す。両者の相関値Ａ（ｎ）は、図５の第１の演算式によって定義することができる。ここで、Ｘ（ｋ）は、図４に示すように、区間信号Ｘにおけるサンプル番号ｋの振幅値であり、ｓｉｎ（２πｆ（ｎ）ｋ／Ｆ）は、時間軸上での同位置における正弦関数Ｒｎの振幅値である。この第１の演算式は、単位区間ｄ内の全サンプル番号ｋ＝０〜ｗ−１の次元について、それぞれ区間信号Ｘの振幅値と正弦関数Ｒｎの振幅ベクトルの内積を求める式ということができる。
【００２６】
同様に、図５の第２の演算式は、区間信号Ｘと、第ｎ番目の標準周波数ｆ（ｎ）をもった余弦関数との相関値を求める式であり、両者の相関値はＢ（ｎ）で与えられる。なお、相関値Ａ（ｎ）を求めるための第１の演算式も、相関値Ｂ（ｎ）を求めるための第２の演算式も、最終的に２／ｗが乗ぜられているが、これは相関値を規格化するためのものでり、前述のとおりｗはｎに依存して変化させるのが一般的であるため、この係数もｎに依存する変数である。
【００２７】
区間信号Ｘと標準周波数ｆ（ｎ）をもった標準周期関数との相関実効値は、図５の第３の演算式に示すように、正弦関数との相関値Ａ（ｎ）と余弦関数との相関値Ｂ（ｎ）との二乗和平方根値Ｅ（ｎ）によって示すことができる。この相関実効値の大きな標準周期関数の周波数を代表周波数として選出すれば、この代表周波数を用いて区間信号Ｘを符号化することができる。
【００２８】
すなわち、この相関値Ｅ（ｎ）が所定の基準以上の大きさとなる１つまたは複数の標準周波数を代表周波数として選出すれば良い。なお、ここで「相関値Ｅ（ｎ）が所定の基準以上の大きさとなる」という選出条件は、例えば、何らかの閾値を設定しておき、相関値Ｅ（ｎ）がこの閾値を超えるような標準周波数ｆ（ｎ）をすべて代表周波数として選出する、という絶対的な選出条件を設定しても良いが、例えば、相関値Ｅ（ｎ）の大きさの順にＱ番目までを選出する、というような相対的な選出条件を設定しても良い。
【００２９】
（2.2.一般化調和解析による手法）
ここでは、本発明に係る音響信号の符号化を行う際に有用な一般化調和解析の手法について説明する。既に説明したように、音響信号を符号化する場合、個々の単位区間内の区間信号について、相関値の高いいくつかの代表周波数を選出することになる。一般化調和解析は、より高い精度で代表周波数の選出を可能にする手法であり、その基本原理は次の通りである。
【００３０】
図６（ａ）に示すような単位区間ｄについて、信号Ｓ（ｊ）なるものが存在するとする。ここで、ｊは後述するように、繰り返し処理のためのパラメータである（ｊ＝１〜Ｊ）。まず、この信号Ｓ（ｊ）に対して、図２に示すような１２８通りの周期関数すべてについての相関値を求める。そして、最大の相関値が得られた１つの周期関数の周波数を代表周波数として選出し、当該代表周波数をもった周期関数を要素関数として抽出する。続いて、図６（ｂ）に示すような含有信号Ｇ（ｊ）を定義する。この含有信号Ｇ（ｊ）は、抽出された要素関数に、その振幅として、当該要素関数の信号Ｓ（ｊ）に対する相関値を乗じることにより得られる信号である。例えば、周期関数として図２に示すように、一対の正弦関数と余弦関数とを用い、周波数ｆ（ｎ）が代表周波数として選出された場合、振幅Ａ（ｎ）をもった正弦関数Ａ（ｎ）ｓｉｎ（２πｆ（ｎ）ｋ／Ｆ）と、振幅Ｂ（ｎ）をもった余弦関数Ｂ（ｎ）ｃｏｓ（２πｆ（ｎ）ｋ／Ｆ）との和からなる信号が含有信号Ｇ（ｊ）ということになる（図６（ｂ）では、図示の便宜上、一方の関数しか示していない）。ここで、Ａ（ｎ），Ｂ（ｎ）は、図５の式で得られる規格化された相関値であるから、結局、含有信号Ｇ（ｊ）は、信号Ｓ（ｊ）内に含まれている周波数ｆ（ｎ）をもった信号成分ということができる。
【００３１】
こうして、含有信号Ｇ（ｊ）が求まったら、信号Ｓ（ｊ）から含有信号Ｇ（ｊ）を減じることにより、差分信号Ｓ（ｊ＋１）を求める。図６（ｃ）は、このようにして求まった差分信号Ｓ（ｊ＋１）を示している。この差分信号Ｓ（ｊ＋１）は、もとの信号Ｓ（ｊ）の中から、周波数ｆ（ｎ）をもった信号成分を取り去った残りの信号成分からなる信号ということができる。そこで、パラメータｊを１だけ増加させることにより、この差分信号Ｓ（ｊ＋１）を新たな信号Ｓ（ｊ）として取り扱い、同様の処理を、パラメータｊをｊ＝１〜Ｊまで１ずつ増やしながらＪ回繰り返し実行すれば、Ｊ個の代表周波数を選出することができる。
【００３２】
このような相関計算の結果として出力されるＪ個の含有信号Ｇ（１）〜Ｇ（Ｊ）は、もとの区間信号Ｘの構成要素となる信号であり、もとの区間信号Ｘを符号化する場合には、これらＪ個の含有信号の周波数を示す情報および振幅（強度）を示す情報を符号データとして用いるようにすれば良い。尚、Ｊは代表周波数の個数であると説明してきたが、標準周波数ｆ（ｎ）の個数と同一すなわちＪ＝１２８であってもよく、周波数スペクトルを求める目的においてはそのように行うのが通例である。
【００３３】
こうして、各単位区間について、所定数の周波数群が選出されたら、この周波数群の各周波数に対応する「音の高さを示す情報」、選出された各周波数の信号強度に対応する「音の強さを示す情報」、当該単位区間の始点に対応する「音の発音開始時刻を示す情報」、当該単位区間に後続する単位区間の始点に対応する「音の発音終了時刻を示す情報」、の４つの情報を含む所定数の符号データを作成すれば、当該単位区間内の区間信号Ｘを所定数の符号データにより符号化することができる。符号データとして、ＭＩＤＩデータを作成するのであれば、「音の高さを示す情報」としてノートナンバーを用い、「音の強さを示す情報」としてベロシティーを用い、「音の発音開始時刻を示す情報」としてノートオン時刻を用い、「音の発音終了時刻を示す情報」としてノートオフ時刻を用いるようにすれば良い。
【００３４】
（3.本発明に係る音響信号の符号化方法）
ここまでに説明した従来技術とも共通する本発明の基本原理を要約すると、原音響信号に単位区間を設定し、単位区間ごとに複数の周波数に対応する信号強度を算出し、得られた信号強度を基に用意された周期関数を利用して１つまたは複数の代表周波数を選出し、選出された代表周波数に対応する音の高さ情報と、選出された代表周波数の強度に対応する音の強さ情報と、単位区間の始点に対応する発音開始時刻と、単位区間の終点に対応する発音終了時刻で構成される符号データを作成することにより、音響信号の符号化が行われていることになる。
【００３５】
本発明の音響信号符号化方法は、上記基本原理において、用意された周期関数の全てについて相関演算を行って、各周波数に対応する強度を求め、これら各周波数と、各周波数の強度と、単位区間の始点に対応する区間開始時刻と、単位区間の終点に対応する区間終了時刻で構成されるデータを「音素データ」と定義し、この音素データをさらに加工することにより最終的な符号化データを得るようにしたものである。
【００３６】
ここからは、本発明の音響信号符号化方法の流れについて、図７に示すフローチャートを用いて説明する。まず、音響信号の時間軸上の全区間に渡って単位区間を設定する（ステップＳ１）。このステップＳ１における手法は、上記基本原理において、図１（ａ）を用いて説明した通りである。
【００３７】
続いて、各単位区間ごとの音響信号、すなわち区間信号について、周波数解析を行って各周波数に対応する強度値を算出し、周波数、強度値、単位区間の始点、終点の４つの情報からなる単位音素データを算出する（ステップＳ２）。具体的には、図２に示したような１２８種の周期関数に対して区間信号の相関強度を求め、その周期関数の周波数、求めた相関強度、単位区間の始点、終点の４つの情報を「単位音素データ」と定義する。この単位音素データとは、音素データのうち、特に単位区間長のものとする。本実施形態では、上記基本原理で説明した場合のように、代表周波数を選出するのではなく、用意した周期関数全てに対応する単位音素データを取得する。このステップＳ２の処理を全単位区間に対して行うことにより、Ｍ×Ｎ個の単位音素データからなる単位音素データ群が得られる。ここで、Ｎは周期関数の総数（上述の例ではＮ＝１２８）、Ｍは音響信号において設定された単位区間の総数である。
【００３８】
続いて、各単位区間ごとに単位音素データが有する強度値に基づいて、各周波数の倍音分布度を算出する（ステップＳ３）。倍音分布度とは、その単位音素データが基本音であるか、他の単位音素データの倍音であるかどうかを判定するための値である。具体的には、以下の〔数式２〕を用いてノートナンバーｎに対応する倍音分布度Ｈ（ｎ）が算出される。
【００３９】
〔数式２〕
Ｈ（ｎ）＝｛V(n+12) + V(n+19) + V(n+24) + V(n+28) + V(n+31) + V(n+34) + V(n+36) - V(n-12) - V(n-19) - V(n-24) - V(n-28) - V(n-31) - V(n-34) - V(n-36) ｝×100／V(n)
【００４０】
上記〔数式２〕において、V(n)はノートナンバーｎの強度値を示しており、 V(n+12) ,V(n+19) ,V(n+24) ,V(n+28) ,V(n+31) ,V(n+34) ,V(n+36) はそれぞれノートナンバーｎの音の２倍音、３倍音、４倍音、５倍音、６倍音、７倍音、８倍音の強度値を、 V(n-12) ,V(n-19) ,V(n-24) ,V(n-28) ,V(n-31) ,V(n-34) ,V(n-36) はそれぞれノートナンバーｎの音を２倍音、３倍音、４倍音、５倍音、６倍音、７倍音、８倍音と仮定したときの基本音の強度値を示している。結局、上記〔数式２〕で算出される倍音分布度Ｈ（ｎ）は、自身の整数倍の周波数の音が多く存在する場合には正の値となり、自身の整数分の１の周波数の音が多く存在する場合には負の値となる。
【００４１】
倍音分布度算出の具体例を図８を用いて説明する。図８は、ある単位区間における１２８個の単位音素データのうち、１０個の単位音素データを示している。実際には、１２８個全ての単位音素データについて、基本音であるか倍音であるかの判断を行うが、図８では、説明の便宜上１０個だけを示している。図８において、音階（ノートナンバー）、強度値は単位音素データを構成するものである。なお、ここで音階およびノートナンバーは、単位音素データの属性である周波数と同義であり、その関係は上記〔数式１〕で定まるものである。ここでは、説明の便宜上、周波数ではなく、音階およびノートナンバーを使用する。また、図８において音階のアルファベットＣ・Ｄ・Ｅ・Ｆ・Ｇ・Ａ・Ｂはそれぞれド・レ・ミ・ファ・ソ・ラ・シを表わしており、Ｇ２という音階は、２オクターブ目のソの音を表わしている。（ピアノ鍵盤の中央のド、すなわちノートナンバー６０で始まる音階を３オクターブ目と定義する方法と４オクターブ目と定義する方法があり、本実施形態では前者を用いている。）
【００４２】
倍音分布度の算出は、次のように行われる。例えば、図８の１行目の音階Ｇ２については、それぞれ周波数が２倍（ノートナンバーが＋１２）の音階Ｇ３、４倍（ノートナンバーが＋２４）の音階Ｇ４、５倍（ノートナンバーが＋２８）の音階Ｂ４、６倍（ノートナンバーが＋３１）の音階Ｄ５の強度値が積算される。これは、音階Ｇ２の強度積算の欄に「120+90+35+40」として示されている。これを音階Ｇ２の強度２０で割って、正規化のために１００を乗じたものが倍音分布度「１４２５」となる。これは、すなわち上記〔数式２〕を用いて倍音分布度Ｈ（ｎ）を算出したものである。
【００４３】
図８の５行目の音階Ｃ４の場合は、周波数が２倍（ノートナンバーが＋１２）の音階Ｃ５の強度値が加算され、１／２倍（ノートナンバーが−１２）の音階Ｃ３の強度値が減算される。その結果、倍音分布度は「−１５０」となる。同様にして各単位音素データについて倍音分布度の算出が行われる。図８の例では、説明の便宜上、１０個の単位音素データに対してしか処理を行っていないが、実際には各単位区間における全単位音素データ（図２に示したように１２８個の周期関数を用意した場合には１２８個）に対して行われる。
【００４４】
続いて、算出された倍音分布度に基づいて、各単位音素データに優先マークを付与する（ステップＳ４）。具体的には、各単位区間ごとに倍音分布度の値が負である単位音素データを優先する対象から外し、倍音分布度の値が正である単位音素データのうちから所定の基準により所定数のものに優先マークを付与する。例えば、図８に示した例で３つのものに優先マークを付与することを考えてみる。この場合、まず、音階Ｃ４・Ｅ４・Ｇ４・Ｂ４・Ｃ５・Ｄ５の倍音分布度の値は全て負であるので、優先する対象から除外される。残った単位音素データのうち、所定の基準で優先するものを決定するが、本実施形態では、倍音分布度の値が１００に近いものを選択するものとする。図８の例では、音階Ｇ３・Ｅ３・Ｃ３の順に１００に近いため、この３つに対応する単位音素データに優先マークが付与されることになる。ここで、まず、倍音分布度が負であるものを優先対象から外すのは、倍音分布度が負であるということが、自身の整数分の１の周波数である音の成分が強いということを示しており、それはすなわち自身が他の基本音の倍音である可能性が高いためである。逆に、倍音分布度が正であるものは、自身の整数分の１の周波数である音の成分が弱いということを示しており、それはすなわち自身が他の基本音の倍音である可能性が低いということであるため、優先マークの付与対象として残される。また、所定の基準として倍音分布度の値が１００に近いものを選択するようにしたのは、１００に近いほど、自身の音と倍音がバランス良く発生していることを示しているためである。倍音分布度の値が大き過ぎると、自身の音が倍音に比較して弱いことを意味し、その音がノイズである可能性が高く、かつ自身の音の倍音の位置（通常はオクターブ上すなわち２倍音の位置）に他の基本音が存在する可能性が高くなるためである。
【００４５】
図８に示した例では、説明の便宜上３つの単位音素データに優先マークを付与するようにしたが、実用上はＭＩＤＩ規格の同時発音可能数に合わせて１６〜３２個程度に優先マークを付与する。優先マークを付与するとは、実際には、単位音素データに倍音分布度の値を保持させたり、優先を示すフラグを記録することになる。なお、ここで優先マークが付与された単位音素データを以降優先音素データと呼ぶことにする。
【００４６】
このようにして優先音素データを含む単位音素データ群が得られたら、この単位音素データ群を構成する単位音素データのうち、強度値が所定の基準以下である単位音素データを削除する（ステップＳ５）。ここで、強度値が所定の基準以下である単位音素データを削除するのは、信号レベルがほとんど０であって、実際には音が存在していないと判断される音素を削除するためである。そのため、この所定の基準としては、音が実際に存在しないレベルとみなされる値が設定される。この時点で単位音素データの数はＭ×Ｎ個より減ることになる。
【００４７】
強度値が所定以下の単位音素データを削除したら、残った単位音素データ群において、同一周波数で時系列方向に連続する複数の単位音素データを１つの連結音素データとして連結する（ステップＳ６）。図９は単位音素データの連結を説明するための概念図である。図９（ａ）は連結前の単位音素データ群の様子を示す図である。図９（ａ）において、格子状に仕切られた各矩形は単位音素データを示しており、網掛けがされている矩形は、上記ステップＳ５において強度値が所定の基準以下であると判断されて削除された単位音素データであり、その他の矩形は削除されなかった単位音素データを示す。ステップＳ６においては、同一周波数（同一ノートナンバー）で時間ｔ方向に連続する単位音素データを連結する。具体的には、図９（ａ）に示す単位音素データ群に対して連結処理を実行すると、図９（ｂ）に示すような複数の連結音素データ、複数の単位音素データからなる音素データ群が得られる。例えば、図９（ａ）に示した単位音素データＡ１、Ａ２、Ａ３は連結されて、図９（ｂ）に示すような連結音素データＡが得られることになる。このとき、構成される単位音素データＡ１、Ａ２、Ａ３のいずれか１つは優先音素データでなければならず、いずれも優先音素データでない場合は連結されずにこの段階で削除され、次のステップＳ７には渡されない。連結が行われる場合、新たに得られる連結音素データＡの周波数としては、単位音素データＡ１、Ａ２、Ａ３に共通の周波数が与えられ、強度値としては、単位音素データＡ１、Ａ２、Ａ３の強度値のうち最大のものが与えられ、開始時刻としては、先頭の単位音素データＡ１の区間開始時刻ｔ１が与えられ、終了時刻としては、最後尾の単位音素データＡ３の区間終了時刻ｔ４が与えられる。なお、連結音素データＡの強度値としては、単位音素データＡ１、Ａ２、Ａ３の強度値の平均値を与えるようにすることも可能である。最終的な符号化時には、優先マーク（実際には、倍音分布度等）は符号化されず、周波数（ノートナンバー）、強度値、開始時刻、終了時刻の４つの情報だけで構成されるため、３つの単位音素データが１つの連結音素データに統合されることにより、データ量は３分の１に削減される。このことは、最終的にＭＩＤＩ符号化される場合には、短い音符３つではなく、長い音符１つとして表現されることを意味している。また、図９（ａ）に示した優先音素データＢのように、同一周波数で時系列方向に連続する単位音素データがない場合で、当該単位音素データが優先音素データである場合には、図９（ｂ）に示すように、連結されずにそのまま残ることになるが、以降の処理においては、連結音素データも、連結されなかった単位区間長の優先音素データもまとめて「音素データ」として扱う。
【００４８】
続いて、全区間において残っている音素データのうち、優先音素データ以外で区間長が所定値より長いものを削除して符号化を行う（ステップＳ７）。区間長とは、音素データの開始時刻から終了時刻までの長さを示す。所定値としては、１００〜２００ｍsec（ミリ秒）が設定される。この所定値は、ピアノとヴォーカルを分けるために設定されるものである。一般に、ピアノ等の鍵楽器は１つの音成分が長く続き、ヴォーカル等の人の声は１つの音成分が短い。ピアノの場合、キーを降ろしている間に音が鳴り続け、時間が経つにつれて振幅は指数関数的に減衰するが、周波数はほとんど変化しないため、単位音素データの連結がされやすく、音の持続時間（デュレーション）が長くなる。これに対し、ヴォーカルは子音の部分では周波数が時間とともに急変して安定せず、母音の部分では音が持続して、ピアノより振幅の減衰は少ないものの周波数は微小に震えるため、音素の連結がされにくく、音の持続時間（デュレーション）が短くなる。ここでは、このような特性を利用して区間長が所定値より長い音素データを削除することによって、楽器音の倍音だけを削除することができる。所定値より短いものについては、優先音素データ以外の音素データであっても削除しないため、ヴォーカルの倍音成分は削除されないことになる。音素データの削除が行われたら、ＭＩＤＩ形式に符号化を行う。
【００４９】
以上、本発明の好適な実施形態について説明したが、上記符号化方法は、コンピュータ等で実行されることは当然である。具体的には、図７のフローチャートに示したようなステップを上記手順で実行するためのプログラムをコンピュータに搭載しておく。そして、音響信号をＰＣＭ方式等でデジタル化した後、コンピュータに取り込み、ステップＳ１〜ステップＳ７の処理を行った後、ＭＩＤＩ形式等の符号データをコンピュータより出力する。出力された符号データは、例えば、ＭＩＤＩデータの場合、ＭＩＤＩシーケンサ、ＭＩＤＩ音源を用いて音声として再生される。
【００５０】
【発明の効果】
以上、説明したように本発明によれば、音響信号に対して時間軸上に複数の単位区間を設定し、単位区間における音響信号と複数の周期関数との相関を求めることにより各周期関数に対応した強度を算出し、各周期関数が有する周波数と、各周期関数に対応した強度と、単位区間の始点に対応する区間開始時刻と、単位区間の終点に対応する区間終了時刻で構成される単位音素データを算出し、単位音素データに対して各単位区間ごとに周波数が互いに整数倍の関係となる両単位音素データの強度に基づいて、各単位音素データの倍音分布度を算出し、単位音素データのうち、区間が連続し、周波数および強度が類似するものを連結して連結音素データとし、連結音素データの属性として、周波数は構成する単位音素データのいずれかの周波数を与え、強度は構成する単位音素データの最大値を与え、開始時刻は先頭の単位音素データの区間開始時刻を与え、終了時刻は最後尾の単位音素データの区間終了時刻を与え、倍音分布度は構成する単位音素データのいずれかの倍音分布度を与えるようにし、連結処理後の音素データのうち、所定の条件を満たす音素データのみを抽出し、符号データを作成するようにしたので、精度の良い倍音除去が可能となるという効果を奏する。また、倍音と判断される成分を除去する際、連結後の音素データの区間長が長いものについてだけ、倍音と判断される成分の除去を行うようすることにより、楽器音等の、人間の声以外に対してだけ倍音除去を行うことも可能となる。
【図面の簡単な説明】
【図１】本発明の音響信号の符号化方法の基本原理を示す図である。
【図２】本発明で利用される周期関数の一例を示す図である。
【図３】図２に示す各周期関数の周波数とＭＩＤＩノートナンバーｎとの関係式を示す図である。
【図４】解析対象となる信号と周期信号との相関計算の手法を示す図である。
【図５】図４に示す相関計算を行うための計算式を示す図である。
【図６】一般化調和解析の基本的な手法を示す図である。
【図７】本発明の音響信号符号化方法のフローチャートである。
【図８】単位音素データの周波数（音階で示す）、強度、倍音分布度の関係を示す図である。
【図９】単位音素データの連結を説明するための概念図である。
【符号の説明】
Ａ（ｎ），Ｂ（ｎ）・・・相関値
ｄ，ｄ１〜ｄ５・・・単位区間
Ｅ（ｎ）・・・相関値
Ｇ（ｊ）・・・含有信号
ｎ，ｎ１〜ｎ６・・・ノートナンバー
Ｓ（ｊ），Ｓ（ｊ＋１）・・・差分信号
Ｘ，Ｘ（ｋ）・・・区間信号
Ｈ（ｎ）・・・倍音分布度[0001]
[Industrial application fields]
The present invention can be applied to support the following work called music transcription in music production. For music transcription, for example, citation of existing music as a material when musical score is not available, cover music production of existing music, music analysis such as melody, harmony progression, tone analysis of hit music, MIDI data in karaoke Format performance data creation, game machine BGM data creation, mobile phone ringtone data creation, performance data creation for keyboard instruments with automatic piano and performance guide function, score publication and block creation.
[0002]
[Prior art]
A time-series signal represented by an acoustic signal includes a plurality of periodic signals as its constituent elements. For this reason, a method for analyzing what kind of periodic signal is included in a given time-series signal has been known for a long time. For example, Fourier analysis is widely used as a method for analyzing frequency components included in a given time series signal.
[0003]
By using such a time-series signal analysis method, an acoustic signal can be encoded. With the spread of computers, it has become easy to sample an analog audio signal as the original sound at a predetermined sampling frequency, quantize the signal intensity at each sampling, and capture it as digital data. If a method such as Fourier analysis is applied to the data and the frequency components included in the original sound signal are extracted, the original sound signal can be encoded by a code indicating each frequency component.
[0004]
On the other hand, the MIDI (Musical Instrument Digital Interface) standard, which was born from the idea of encoding musical instrument sounds by electronic musical instruments, has been actively used with the spread of personal computers. The code data according to the MIDI standard (hereinafter referred to as MIDI data) is basically data that describes the operation of the musical instrument performance such as which keyboard key of the instrument is played with what strength. The data itself does not include the actual sound waveform. Therefore, when reproducing the actual sound, a MIDI sound source storing the waveform of the instrument sound is separately required. However, its high encoding efficiency is attracting attention, and encoding and decoding according to the MIDI standard are being attracted attention. This technology is now widely used in software that uses a personal computer to perform musical instrument performance, practice and compose music.
[0005]
Therefore, by analyzing a time-series signal represented by an acoustic signal by a predetermined method, a periodic signal as a constituent element is extracted, and the extracted periodic signal is encoded using MIDI data. Proposals have been made. For example, JP-A-10-247099, JP-A-11-73199, JP-A-11-73200, JP-A-11-95753, JP-A-2000-99009, JP-A-2000-99092, JP-A-2000-99093, JP-A-2000-261322, JP-A-2001-5450, JP-A-2001-148633, and Japanese Patent Application No. 2001-209163 describe an arbitrary time series signal. Various methods have been proposed that can analyze the frequency that is a component and create MIDI data from the analysis result.
[0006]
[Problems to be solved by the invention]
The MIDI encoding method proposed in each of the above publications and specifications has enabled efficient encoding of acoustic signals obtained from performance recordings and the like. In the encoding of acoustic signals, the handling of overtone components generated from musical instrument sounds is a problem, but attempts have been made to solve overtone processing using the methods disclosed in the above publications. A harmonic is a sound having a frequency that is an integral multiple of the frequency of the basic sound, which is the original sound, and is an important component for accurately reproducing the original sound. Therefore, it is an unnecessary component when the present invention is applied to music transcription that reproduces a musical score from performance data. In terms of the MIDI note number, the overtone component takes values such as +12, +19, +24, +28, +31,. In particular, in Japanese Patent Application No. 2001-209163, each component that is considered to be equivalent to a harmonic is added to the intensity component of the basic sound at a predetermined ratio, and each harmonic intensity component is subtracted at a predetermined ratio. Overtone removal is performed by calculating a second intensity and treating the second intensity as a priority at the time of encoding.
[0007]
However, in the above method, it is not sufficient to determine whether or not the sound is a harmonic because the intensity component of the sound considered to be equivalent to the harmonic is simply added as the second intensity of the basic sound. Moreover, in the above method, harmonics are uniformly removed even for an acoustic signal in which a plurality of sound sources are mixed. Therefore, harmonics are also removed from vocals (singing voices) that should not be removed. There is a problem.
[0008]
In view of the above points, the present invention provides an audio signal encoding method that can remove harmonics with high accuracy and can remove harmonics only for non-human voices. Is an issue.
[0009]
[Means for Solving the Problems]
In order to solve the above problems, in the present invention, as an audio signal encoding method, a section setting stage for setting a plurality of unit sections on a time axis for an acoustic signal, an acoustic signal in a unit section, and a plurality of periodic functions, By calculating the correlation, the intensity corresponding to each periodic function is calculated, the frequency of each periodic function, the intensity corresponding to each periodic function, the section start time corresponding to the start point of the unit section, and the unit section A unit phoneme data calculation stage for calculating unit phoneme data composed of the end time of the section corresponding to the end point, the frequency ratio for each unit section for the unit phoneme dataIntensity values of other unit phoneme data having an integer multiple relationship are added to obtain a high-frequency distribution, and other unit phoneme data having a frequency ratio of (1 / integer) times the unit phoneme data. The intensity value is added to obtain a low frequency distribution degree, and the value obtained by adding the high frequency distribution degree and the positive and negative signs of the low frequency distribution degree to each other is divided by the intensity value of the unit phoneme data. By, Harmonic overtone distribution calculation stage to calculate the overtone distribution of each unit phoneme data, the section of unit phoneme data is continuous, the frequencyAre the sameConcatenation of similar intensities into concatenated phoneme data. As an attribute of concatenated phoneme data, the frequency gives one of the constituent unit phoneme data frequencies, and the intensity gives the maximum value of the constituent unit phoneme data. The time gives the section start time of the first unit phoneme data, the end time gives the section end time of the last unit phoneme data, and the harmonic distribution gives the harmonic distribution of one of the constituent unit phoneme data Phoneme data connection stage to be performedAt least the harmonic distribution isExtract only phoneme data that satisfies a certain condition,MIDI format with delta time information corresponding to the section length of the extracted phoneme data, note number information corresponding to frequency, velocity information corresponding to intensityIt is characterized by having an encoding stage for generating code data.
[0010]
According to the present invention, unit phoneme data is obtained by performing frequency analysis of an acoustic signal for each unit section, and the intensity of a frequency component that is an integral multiple of the unit phoneme data is large or integer for each unit phoneme data. Since the component determined to be a harmonic overtone is deleted based on whether the intensity of the one-frequency component is large, encoding is performed, so that it is possible to remove overtone with high accuracy.
[0011]
Furthermore, when removing components that are determined to be harmonics, the components that are determined to be harmonics are removed only when the segment length of the connected phoneme data is long. It is possible to remove overtones only for other than the above.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
(1. Basic principle of audio signal coding)
First, the basic principle of the audio signal encoding method according to the present invention will be described. Since this basic principle is disclosed in the above-mentioned publications or specifications, only the outline will be briefly described here.
[0013]
As shown in FIG. 1A, it is assumed that an analog acoustic signal is given as a time-series signal. In the example of FIG. 1, the acoustic signal is shown with time t on the horizontal axis and amplitude (intensity) on the vertical axis. Here, first, the analog sound signal is processed as digital sound data. This may be performed by using a conventional general PCM method, sampling the analog acoustic signal at a predetermined sampling frequency, and converting the amplitude into digital data using a predetermined number of quantization bits. Here, for convenience of explanation, the waveform of the acoustic data digitized by the PCM method is also shown by the same waveform as the analog acoustic signal of FIG.
[0014]
Subsequently, a plurality of unit sections are set on the time axis of the acoustic signal to be analyzed. In the example shown in FIG. 1A, six times t1 to t6 are defined at equal intervals on the time axis t, and five unit intervals d1 to d5 having these times as the start point and the end point are set. In the example of FIG. 1, unit sections having the same section length are set, but the section length may be changed for each unit section. Alternatively, the section setting may be performed such that adjacent unit sections partially overlap on the time axis.
[0015]
When the unit section is set in this way, representative frequencies are selected for the acoustic signals (hereinafter referred to as section signals) for each unit section. Each section signal usually includes various frequency components. For example, a frequency component having a high component intensity ratio may be selected as the representative frequency. Here, the so-called fundamental frequency is generally used as the representative frequency, but a harmonic frequency such as a formant frequency of speech or a peak frequency of a noise source may be treated as a representative frequency. Although only one representative frequency may be selected, more accurate encoding is possible by selecting a plurality of representative frequencies depending on the acoustic signal. FIG. 1B shows an example in which three representative frequencies are selected for each unit section, and one representative frequency is encoded as one representative code (shown as a note for convenience in the drawing). Has been. Here, three tracks T1, T2 and T3 are provided to accommodate representative codes (notes), but this means that three representative codes selected for each unit section are assigned to different tracks. It is for accommodating.
[0016]
For example, representative codes n (d1,1), n (d1,2), n (d1,3) selected for the unit section d1 are accommodated in tracks T1, T2, T3, respectively. Here, each code n (d1,1), n (d1,2), n (d1,3) is a code indicating a note number in the MIDI code. The note number in the MIDI code takes 128 values from 0 to 127, each indicating one key of the piano keyboard. Specifically, for example, when 440 Hz is selected as the representative frequency, this frequency corresponds to the note number n = 69 (corresponding to “ra sound (A3 sound)” in the center of the piano keyboard). N = 69 is selected. However, FIG. 1B is a conceptual diagram showing the representative code obtained by the above-described method in the form of a note. In reality, data on intensity is also added to each note. For example, the track T1 includes e (d1,1), e (d2,1)... Along with data indicating the pitches of note numbers n (d1,1), n (d2,1). Data indicating the strength is accommodated. The data indicating the intensity is determined by the degree to which the component of each representative frequency is included in the original section signal. Specifically, the data indicating the intensity is determined based on the correlation value with respect to the section signal of the periodic function having each representative frequency. Further, in the conceptual diagram shown in FIG. 1B, the position of each unit section on the time axis is indicated by the position of the note in the horizontal direction, but in reality, the position on the time axis is shown. Is accurately added as a numerical value to each note.
[0017]
As a format for encoding an acoustic signal, it is not always necessary to adopt the MIDI format. However, since the MIDI format is the most popular as this type of encoding, code data in the MIDI format is practically used. Is preferred. In the MIDI format, “note-on” data or “note-off” data exists while interposing “delta time” data. The “note-on” data is data for designating a specific note number N and velocity V to instruct the start of a specific sound, and the “note-off” data is a specific note number N and velocity V. This is data that designates the end of the performance of a specific sound. The “delta time” data is data indicating a predetermined time interval. Velocity V is a parameter that indicates, for example, the speed at which a piano keyboard is pressed down (velocity at the time of note-on) and the speed at which the finger is released from the keyboard (velocity at the time of note-off). Or it shows the strength of the performance end operation.
[0018]
In the above-described method, J note numbers n (di, 1), n (di, 2),..., N (di, J) are obtained as representative codes for the i-th unit interval di. Intensities e (di, 1), e (di, 2),..., E (di, J) are obtained for each of these. Therefore, MIDI format code data can be created by the following method. First, as the note number N described in the “note on” data or “note off” data, the obtained note numbers n (di, 1), n (di, 2),..., N (di , J) can be used as they are. On the other hand, as the velocity V described in the “note on” data or “note off” data, the obtained intensities e (di, 1), e (di, 2),..., E (di, A value obtained by normalizing J) by a predetermined method may be used. The “delta time” data may be set according to the length of each unit section.
[0019]
(2. Specific method for obtaining correlation with periodic function)
In the method based on the basic principle described above, one or a plurality of representative frequencies are selected for the section signal, and the section signal is represented by a periodic signal having this representative frequency. Here, the representative frequency to be selected is literally a frequency representing the signal component in the unit section. Specific methods for selecting the representative frequency include a method using a short-time Fourier transform and a method using a generalized harmonic analysis method, as will be described later. Both methods have the same basic concept. Prepare a plurality of periodic functions with different frequencies in advance, and from these periodic functions, a periodic function that has a high correlation with the section signal in the unit section. And a method of selecting the frequency of the highly correlated periodic function as a representative frequency is adopted. That is, when selecting a representative frequency, an operation for obtaining a correlation between a plurality of periodic functions prepared in advance and a section signal in a unit section is performed. Therefore, here, a specific method for obtaining the correlation with the periodic function will be described.
[0020]
Assume that trigonometric functions as shown in FIG. 2 are prepared as a plurality of periodic functions. These trigonometric functions are composed of a pair of a sine function and a cosine function having the same frequency. For each of 128 standard frequencies f (0) to f (127), a pair of a sine function and a cosine function. Is defined. Here, a pair of functions consisting of a sine function and a cosine function having the same frequency is defined as a periodic function for the frequency. That is, the periodic function for a specific frequency is constituted by a pair of sine function and cosine function. Thus, the periodic function is defined by a pair of sine function and cosine function in order to consider that the correlation value is influenced by the phase when obtaining the correlation value of the periodic function with respect to the signal. The variables F and k in each trigonometric function shown in FIG. 2 are variables corresponding to the sampling frequency F and the sample number k for the section signal X. For example, a sine wave with respect to the frequency f (0) is represented by sin (2πf (0) k / F), and given an arbitrary sample number k, the same time position as the k-th sample constituting the section signal The amplitude value of the periodic function at is obtained.
[0021]
Here, an example in which 128 standard frequencies f (0) to f (127) are defined by the equations as shown in FIG. That is, the nth (0 ≦ n ≦ 127) standard frequency f (n) is defined by the following [Formula 1].
[0022]
[Formula 1]
f (n) = 440 × 2^γ ⁽ⁿ⁾
γ (n) = (n−69) / 12
[0023]
If the standard frequency is defined by such an expression, it is convenient when finally encoding using MIDI data is performed. This is because the 128 standard frequencies f (0) to f (127) set by such a definition take frequency values forming a geometric series, and correspond to the note numbers used in the MIDI data. This is because it becomes a frequency. Therefore, the 128 standard frequencies f (0) to f (127) shown in FIG. 2 are frequencies set at equal intervals (in semitone units in MIDI) on the frequency axis shown on the logarithmic scale.
[0024]
(2.1. Method using short-time Fourier transform)
Next, a specific description will be given of how to obtain the correlation of each periodic function with respect to a section signal in an arbitrary section. For example, as shown in FIG. 4, it is assumed that a section signal X is given for a certain unit section d. Here, it is assumed that sampling is performed at the sampling frequency F for the unit interval d having the interval length L, and w sample values are obtained in total, and the sample numbers are 0, 1, 2, 3,..., K,..., W-2, w-1 (the w-th sample indicated by a white circle is a sample included at the head of the next unit section adjacent to the right. And). In this case, for an arbitrary sample number k, an amplitude value of X (k) is given as digital data. In the short-time Fourier transform, it is usual to multiply the window function W (k) such that the center weight is close to 1 and the weights at both ends are close to 0 for each sample with respect to X (k). That is, X (k) × W (k) is treated as X (k) and the following correlation calculation is performed. As the shape of the window function, a cosine wave-shaped Hamming window is generally used. Here, w is described as a constant in the following description, but in general, it is changed according to the value of n, and F / f (n) that is maximum within a range not exceeding the section length L. It is desirable to set the value to an integer multiple.
[0025]
The principle of obtaining a correlation value with such a section signal X and the sine function Rn having the nth standard frequency f (n) is shown. Both correlation values A (n) can be defined by the first arithmetic expression of FIG. Here, X (k) is the amplitude value of the sample number k in the section signal X, as shown in FIG. 4, and sin (2πf (n) k / F) is the sine at the same position on the time axis. This is the amplitude value of the function Rn. This first arithmetic expression can be said to be an expression for obtaining the inner product of the amplitude value of the section signal X and the amplitude vector of the sine function Rn for the dimensions of all sample numbers k = 0 to w−1 in the unit section d. .
[0026]
Similarly, the second arithmetic expression in FIG. 5 is an expression for obtaining a correlation value between the interval signal X and the cosine function having the nth standard frequency f (n), and the correlation value between the two is B ( n). The first arithmetic expression for obtaining the correlation value A (n) and the second arithmetic expression for obtaining the correlation value B (n) are finally multiplied by 2 / w. Is for normalizing the correlation value. As described above, since w is generally changed depending on n, this coefficient is also a variable depending on n.
[0027]
The effective correlation value between the interval signal X and the standard periodic function having the standard frequency f (n) is the correlation value A (n) with the sine function, the cosine function, as shown in the third arithmetic expression of FIG. Of the square sum of squares E (n) with the correlation value B (n). If the frequency of the standard periodic function having a large correlation effective value is selected as the representative frequency, the section signal X can be encoded using this representative frequency.
[0028]
That is, one or a plurality of standard frequencies whose correlation value E (n) is greater than or equal to a predetermined reference may be selected as the representative frequency. Here, the selection condition that “correlation value E (n) is greater than or equal to a predetermined reference” is, for example, a standard in which some threshold value is set and correlation value E (n) exceeds this threshold value. An absolute selection condition that all frequencies f (n) are selected as representative frequencies may be set. For example, up to the Qth in the order of the correlation value E (n) is selected. A relative selection condition may be set.
[0029]
(2.2. Method by generalized harmonic analysis)
Here, a generalized harmonic analysis technique useful when encoding an acoustic signal according to the present invention will be described. As already described, when encoding an acoustic signal, several representative frequencies having high correlation values are selected for the section signal in each unit section. Generalized harmonic analysis is a technique that enables the selection of representative frequencies with higher accuracy, and the basic principle thereof is as follows.
[0030]
Assume that there is a signal S (j) for the unit interval d as shown in FIG. Here, j is a parameter for repetitive processing (j = 1 to J), as will be described later. First, correlation values for all 128 periodic functions as shown in FIG. 2 are obtained for this signal S (j). Then, the frequency of one periodic function having the maximum correlation value is selected as a representative frequency, and the periodic function having the representative frequency is extracted as an element function. Subsequently, the inclusion signal G (j) as shown in FIG. 6B is defined. The inclusion signal G (j) is a signal obtained by multiplying the extracted element function by the correlation value of the element function with respect to the signal S (j) of the element function. For example, as shown in FIG. 2, when a frequency f (n) is selected as a representative frequency using a pair of sine function and cosine function as shown in FIG. 2, a sine function A (n) having an amplitude A (n). ) Sin (2πf (n) k / F) and a signal composed of the sum of cosine function B (n) cos (2πf (n) k / F) having amplitude B (n) is included signal G (j) (In FIG. 6B, only one function is shown for convenience of illustration). Here, since A (n) and B (n) are normalized correlation values obtained by the equation of FIG. 5, the inclusion signal G (j) is eventually included in the signal S (j). It can be said that the signal component has a certain frequency f (n).
[0031]
Thus, when the content signal G (j) is obtained, the difference signal S (j + 1) is obtained by subtracting the content signal G (j) from the signal S (j). FIG. 6C shows the difference signal S (j + 1) obtained in this way. The difference signal S (j + 1) can be said to be a signal composed of the remaining signal components obtained by removing the signal component having the frequency f (n) from the original signal S (j). Therefore, by increasing the parameter j by 1, this difference signal S (j + 1) is handled as a new signal S (j), and the same processing is performed J times while increasing the parameter j by 1 from j = 1 to J. If it is repeatedly executed, J representative frequencies can be selected.
[0032]
The J inclusion signals G (1) to G (J) output as a result of such correlation calculation are signals that are constituent elements of the original section signal X, and the original section signal X is encoded. In this case, information indicating the frequency of these J inclusion signals and information indicating the amplitude (intensity) may be used as the code data. Although J has been described as the number of representative frequencies, it may be the same as the number of standard frequencies f (n), that is, J = 128. For the purpose of obtaining a frequency spectrum, this is usually done. It is.
[0033]
Thus, when a predetermined number of frequency groups are selected for each unit section, “information indicating the pitch” corresponding to each frequency of this frequency group, and “sound intensity” corresponding to the signal intensity of each selected frequency. "Information indicating strength", "information indicating the start time of sound generation" corresponding to the start point of the unit section, "information indicating the end time of sound generation" corresponding to the start point of the unit section subsequent to the unit section, If a predetermined number of pieces of code data including the four pieces of information are created, the section signal X in the unit section can be encoded with the predetermined number of pieces of code data. If MIDI data is created as code data, a note number is used as “information indicating the pitch of the sound”, velocity is used as the “information indicating the intensity of the sound”, and “sound generation start time is set. The note-on time may be used as the “information indicating” and the note-off time may be used as the “information indicating the end time of sound generation”.
[0034]
(3. Acoustic signal encoding method according to the present invention)
To summarize the basic principle of the present invention common to the conventional techniques described so far, unit intervals are set in the original sound signal, signal intensities corresponding to a plurality of frequencies are calculated for each unit interval, and the obtained signal intensities are calculated. One or more representative frequencies are selected using a periodic function prepared based on the sound pitch information corresponding to the selected representative frequency and the sound frequency corresponding to the intensity of the selected representative frequency. The sound signal is encoded by creating code data composed of intensity information, a sounding start time corresponding to the start point of the unit section, and a sounding end time corresponding to the end point of the unit section. become.
[0035]
In the acoustic signal encoding method of the present invention, in the basic principle described above, the correlation calculation is performed on all of the prepared periodic functions to obtain the intensity corresponding to each frequency, and each of these frequencies, the intensity of each frequency, and the unit Data consisting of the section start time corresponding to the start point of the section and the section end time corresponding to the end point of the unit section is defined as “phoneme data”, and final encoded data is obtained by further processing this phoneme data. It is something to get.
[0036]
From here, the flow of the acoustic signal encoding method of the present invention will be described using the flowchart shown in FIG. First, a unit section is set over all sections on the time axis of the acoustic signal (step S1). The technique in step S1 is as described with reference to FIG. 1A in the basic principle.
[0037]
Subsequently, for each acoustic signal in each unit section, that is, the section signal, a frequency analysis is performed to calculate an intensity value corresponding to each frequency, and a unit comprising four pieces of information of frequency, intensity value, unit section start point, and end point Phoneme data is calculated (step S2). Specifically, the correlation strength of the section signal is obtained for 128 types of periodic functions as shown in FIG. 2, and four pieces of information of the frequency of the periodic function, the calculated correlation strength, the start point and the end point of the unit section are obtained. It is defined as “unit phoneme data”. The unit phoneme data is assumed to have a unit section length among the phoneme data. In the present embodiment, unit phoneme data corresponding to all the prepared periodic functions is acquired instead of selecting a representative frequency as in the case described in the basic principle. By performing the process of step S2 for all unit sections, a unit phoneme data group composed of M × N unit phoneme data is obtained. Here, N is the total number of periodic functions (N = 128 in the above example), and M is the total number of unit sections set in the acoustic signal.
[0038]
Subsequently, the harmonic distribution degree of each frequency is calculated based on the intensity value of the unit phoneme data for each unit section (step S3). The overtone distribution is a value for determining whether the unit phoneme data is a basic sound or a harmonic of other unit phoneme data. Specifically, the harmonic distribution H (n) corresponding to the note number n is calculated using the following [Equation 2].
[0039]
[Formula 2]
H (n) = {V (n + 12) + V (n + 19) + V (n + 24) + V (n + 28) + V (n + 31) + V (n + 34) + V ( n + 36)-V (n-12)-V (n-19)-V (n-24)-V (n-28)-V (n-31)-V (n-34)-V (n -36)} × 100 / V (n)
[0040]
In the above [Expression 2], V (n) indicates the intensity value of the note number n, and V (n + 12), V (n + 19), V (n + 24), V (n + 28) , V (n + 31), V (n + 34), V (n + 36) are 2nd, 3rd, 4th, 5th, 6th, 7th and 8th harmonics of note number n, respectively. The intensity values are V (n-12), V (n-19), V (n-24), V (n-28), V (n-31), V (n-34), V (n- 36) shows the intensity values of the basic sounds when the note number n is assumed to be a second harmonic, a third harmonic, a fourth harmonic, a fifth harmonic, a sixth harmonic, a seventh harmonic, and an eighth harmonic. Eventually, the harmonic distribution H (n) calculated by the above [Equation 2] becomes a positive value when there are many sounds having an integer multiple of its own, and the sound having a frequency that is a fraction of its own integer. When many exist, it becomes a negative value.
[0041]
A specific example of the overtone distribution degree calculation will be described with reference to FIG. FIG. 8 shows 10 unit phoneme data among 128 unit phoneme data in a certain unit section. Actually, it is determined whether all 128 unit phoneme data are basic sounds or harmonics, but only 10 are shown in FIG. 8 for convenience of explanation. In FIG. 8, the scale (note number) and the intensity value constitute unit phoneme data. Here, the scale and the note number are synonymous with the frequency that is the attribute of the unit phoneme data, and the relationship is determined by the above [Equation 1]. Here, for convenience of explanation, not the frequency but the scale and the note number are used. In FIG. 8, the alphabets C, D, E, F, G, A, and B of the scale represent de, les, mi, fa, so, la, and si, respectively, and the scale G2 is the second octave. It represents the sound of Seo. (There is a method of defining the middle note of the piano keyboard, that is, the scale starting at note number 60 as the third octave and the method of defining the fourth octave, and the former is used in this embodiment.)
[0042]
The calculation of the overtone distribution is performed as follows. For example, for the scale G2 in the first row of FIG. 8, the frequency G3 is 4 times (note number is +12), the scale G4 is 4 times (note number is +24), and the scale G4 is 5 times (note number is +28). The intensity values of the scale D5, which is 6 times the note B4 (note number is +31), are integrated. This is indicated as “120 + 90 + 35 + 40” in the column of intensity integration of the scale G2. Dividing this by the intensity 20 of the scale G2 and multiplying by 100 for normalization gives the harmonic distribution “1425”. That is, the harmonic distribution H (n) is calculated using the above [Equation 2].
[0043]
In the case of the scale C4 in the fifth row in FIG. 8, the intensity value of the scale C5 having a frequency twice (note number is +12) is added, and the intensity value of the scale C3 having a frequency ½ (note number is -12). Is subtracted. As a result, the harmonic distribution is “−150”. Similarly, the overtone distribution degree is calculated for each unit phoneme data. In the example of FIG. 8, for convenience of explanation, only 10 unit phoneme data are processed, but actually, all unit phoneme data in each unit section (128 cycles as shown in FIG. 2). If a function is prepared, it is 128).
[0044]
Subsequently, a priority mark is assigned to each unit phoneme data based on the calculated overtone distribution degree (step S4). Specifically, unit phoneme data having a negative overtone distribution value for each unit section is excluded from priority, and a predetermined number of unit phoneme data having a positive overtone distribution value is determined according to a predetermined reference. Give priority marks to things. For example, consider giving priority marks to three things in the example shown in FIG. In this case, first, since the values of the harmonic distributions of the scales C4, E4, G4, B4, C5, and D5 are all negative, they are excluded from the priority targets. Of the remaining unit phoneme data, the one that is prioritized according to a predetermined criterion is determined. In the present embodiment, one having a harmonic distribution value close to 100 is selected. In the example of FIG. 8, since the scales G3, E3, and C3 are close to 100 in this order, priority marks are assigned to unit phoneme data corresponding to these three. Here, first, those whose harmonic overtone distribution is negative are excluded from priority. The overtone distribution is negative, that the sound component having a frequency that is a fraction of its own integer is strong. This is because it is likely that it is a harmonic of another basic sound. On the other hand, a positive harmonic distribution indicates that a sound component having a frequency that is a fraction of its own integer is weak, that is, it may be a harmonic of another basic sound. Since it is low, it remains as a priority mark grant target. The reason why the harmonic distribution value close to 100 is selected as the predetermined reference is that the closer the value is to 100, the more the own sound and harmonic are generated in a balanced manner. . If the value of the harmonic distribution is too large, it means that the sound is weak compared to the harmonic, the sound is likely to be noise, and the position of the harmonic of its own (usually on the octave, This is because there is a high possibility that another basic sound is present at the position of the second overtone).
[0045]
In the example shown in FIG. 8, priority marks are assigned to three unit phoneme data for convenience of explanation. However, in practice, priority marks are assigned to about 16 to 32 in accordance with the number of simultaneously soundable MIDI standards. To do. Giving a priority mark actually means holding a value of harmonic distribution in unit phoneme data or recording a flag indicating priority. Here, the unit phoneme data to which the priority mark is given is hereinafter referred to as priority phoneme data.
[0046]
When a unit phoneme data group including priority phoneme data is obtained in this way, unit phoneme data having an intensity value equal to or less than a predetermined reference is deleted from the unit phoneme data constituting the unit phoneme data group (step S5). ). Here, the unit phoneme data whose intensity value is equal to or less than a predetermined reference is deleted in order to delete a phoneme whose signal level is almost zero and in which it is determined that no sound actually exists. . Therefore, a value that is regarded as a level at which no sound actually exists is set as the predetermined reference. At this time, the number of unit phoneme data is reduced from M × N.
[0047]
When the unit phoneme data having an intensity value equal to or less than a predetermined value is deleted, a plurality of unit phoneme data continuous in the time-series direction at the same frequency are connected as one connected phoneme data in the remaining unit phoneme data group (step S6). FIG. 9 is a conceptual diagram for explaining the connection of unit phoneme data. FIG. 9A is a diagram showing a state of unit phoneme data groups before connection. In FIG. 9A, each rectangle partitioned in a grid pattern indicates unit phoneme data, and the shaded rectangle is determined to have an intensity value equal to or less than a predetermined reference in step S5. The unit phoneme data has been deleted, and the other rectangles indicate the unit phoneme data that has not been deleted. In step S6, unit phoneme data continuous in the time t direction at the same frequency (same note number) are connected. Specifically, when the concatenation process is executed on the unit phoneme data group shown in FIG. 9A, a phoneme data group including a plurality of connected phoneme data and a plurality of unit phoneme data as shown in FIG. 9B. Is obtained. For example, unit phoneme data A1, A2, and A3 shown in FIG. 9A are connected to obtain connected phoneme data A as shown in FIG. 9B. At this time, any one of the unit phoneme data A1, A2, and A3 to be configured must be priority phoneme data. If none of the phoneme data is prioritized phoneme data, it is deleted at this stage without being connected, and the next step It is not passed to S7. When concatenation is performed, a frequency common to unit phoneme data A1, A2, and A3 is given as a frequency of newly obtained concatenated phoneme data A, and intensity values of unit phoneme data A1, A2, and A3 are used as intensity values. The maximum value is given, the start time is the start time t1 of the first unit phoneme data A1, and the end time is the end time t4 of the last unit phoneme data A3. . As the intensity value of the connected phoneme data A, an average value of the intensity values of the unit phoneme data A1, A2, and A3 can be given. At the time of final encoding, the priority mark (actually the harmonic distribution, etc.) is not encoded, and consists of only four pieces of information: frequency (note number), intensity value, start time, and end time. By integrating the three unit phoneme data into one connected phoneme data, the amount of data is reduced to one third. This means that when MIDI encoding is finally performed, it is expressed not as three short notes but as one long note. Further, when there is no unit phoneme data continuous in the time-series direction at the same frequency as the priority phoneme data B shown in FIG. 9A, when the unit phoneme data is priority phoneme data, As shown in FIG. 9 (b), it remains as it is without being connected. However, in the subsequent processing, the connected phoneme data and the priority phoneme data of unit lengths that are not connected are collectively referred to as “phoneme data”. deal with.
[0048]
Subsequently, among the phoneme data remaining in all the sections, the data other than the priority phoneme data whose section length is longer than a predetermined value is deleted and encoded (step S7). The section length indicates the length from the start time to the end time of phoneme data. As the predetermined value, 100 to 200 msec (milliseconds) is set. This predetermined value is set to separate the piano and the vocal. In general, a key instrument such as a piano has a long sound component, and a human voice such as a vocal has a short sound component. In the case of a piano, the sound continues to sound while the key is down, and the amplitude decays exponentially over time, but the frequency hardly changes, so the unit phoneme data is easily connected and the duration of the sound (Duration) becomes longer. On the other hand, in the vocal part, the frequency changes suddenly with time and does not stabilize, and in the vowel part, the sound persists, and although the attenuation of the amplitude is less than that of the piano, the frequency trembles slightly. The duration of the sound is shortened. Here, by using such characteristics to delete phoneme data having a section length longer than a predetermined value, it is possible to delete only overtones of musical instrument sounds. Since the phoneme data other than the priority phoneme data is not deleted for those shorter than the predetermined value, the harmonic component of the vocal is not deleted. When the phoneme data is deleted, encoding is performed in the MIDI format.
[0049]
Although the preferred embodiments of the present invention have been described above, the encoding method is naturally executed by a computer or the like. Specifically, a program for executing the steps as shown in the flowchart of FIG. Then, after the acoustic signal is digitized by the PCM method or the like, it is taken into a computer, and after performing the processing of Steps S1 to S7, code data such as MIDI format is output from the computer. For example, in the case of MIDI data, the output code data is reproduced as sound using a MIDI sequencer and a MIDI sound source.
[0050]
【The invention's effect】
As described above, according to the present invention, a plurality of unit sections are set on the time axis for an acoustic signal, and the correlation between the acoustic signal and the plurality of periodic functions in the unit section is obtained to obtain each periodic function. Calculates the corresponding intensity, and consists of the frequency of each periodic function, the intensity corresponding to each periodic function, the section start time corresponding to the start point of the unit section, and the section end time corresponding to the end point of the unit section Unit phoneme data is calculated, and the harmonic distribution of each unit phoneme data is calculated based on the intensity of both unit phoneme data whose frequency is an integer multiple of each unit interval for the unit phoneme data. Concatenated phoneme data that have similar intervals and similar frequencies and intensities are connected to form phoneme data. As an attribute of the connected phoneme data, the frequency is one of the constituent phoneme data. The intensity gives the maximum value of the constituent unit phoneme data, the start time gives the section start time of the first unit phoneme data, the end time gives the section end time of the last unit phoneme data, and the overtone distribution degree Gives the degree of harmonic distribution of any of the constituent phoneme data, extracts only the phoneme data satisfying the predetermined condition from the phoneme data after the concatenation process, and creates the code data. There is an effect that it is possible to remove overtones with good quality. In addition, when removing components that are determined to be harmonics, the components that are determined to be harmonics are removed only when the segment length of the connected phoneme data is long. It is also possible to remove overtones only for other than the above.
[Brief description of the drawings]
FIG. 1 is a diagram showing a basic principle of an audio signal encoding method according to the present invention.
FIG. 2 is a diagram showing an example of a periodic function used in the present invention.
3 is a diagram showing a relational expression between the frequency of each periodic function shown in FIG. 2 and a MIDI note number n. FIG.
FIG. 4 is a diagram illustrating a method of calculating a correlation between a signal to be analyzed and a periodic signal.
FIG. 5 is a diagram showing a calculation formula for performing the correlation calculation shown in FIG. 4;
FIG. 6 is a diagram showing a basic method of generalized harmonic analysis.
FIG. 7 is a flowchart of an acoustic signal encoding method according to the present invention.
FIG. 8 is a diagram showing the relationship between frequency (indicated by a scale), intensity, and harmonic distribution of unit phoneme data.
FIG. 9 is a conceptual diagram for explaining connection of unit phoneme data.
[Explanation of symbols]
A (n), B (n) ... correlation value
d, d1 to d5 ... unit interval
E (n) ... correlation value
G (j) ... Inclusion signal
n, n1 to n6 ... note number
S (j), S (j + 1)... Differential signal
X, X (k) ... section signal
H (n) ... Harmonic distribution

Claims

A section setting stage for setting a plurality of unit sections on the time axis for a given acoustic signal,
By calculating the correlation between the acoustic signal in the unit interval and a plurality of periodic functions, the intensity corresponding to each periodic function is calculated, the frequency of each periodic function, the intensity corresponding to each periodic function, and the unit interval A unit phoneme data calculation stage for calculating unit phoneme data composed of a section start time corresponding to the start point of the section and a section end time corresponding to the end point of the unit section;
The unit phoneme data is added with intensity values of other unit phoneme data in which the frequency ratio is an integer multiple for each unit interval to obtain a high frequency distribution, and the frequency ratio with the unit phoneme data is Intensity values of other unit phoneme data having a (1 / integer) multiple relationship are added to obtain a low-frequency distribution, and the high-frequency distribution and the low-frequency distribution are added with different signs. A harmonic distribution calculating step for calculating the harmonic distribution of each unit phoneme data by dividing the value obtained by the intensity value of the unit phoneme data;
Of the unit phoneme data, those having continuous sections, the same frequency, and similar in intensity are connected to form connected phoneme data, and as an attribute of the connected phoneme data, the frequency is one of the constituent unit phoneme data. Given, the intensity gives the maximum value of the constituent unit phoneme data, the start time gives the section start time of the first unit phoneme data, the end time gives the section end time of the last unit phoneme data, and the overtone distribution degree is A phoneme data concatenation stage for giving any harmonic distribution of unit phoneme data constituting;
Of the phoneme data after the concatenation processing, extract only phoneme data that satisfies at least a predetermined degree of harmonic distribution, delta time information corresponding to the section length of the extracted phoneme data, note number information corresponding to the frequency, An encoding stage for generating MIDI format code data having velocity information corresponding to intensity ;
A method for encoding an acoustic signal, comprising:

In the phoneme data connection step, a priority mark which is information indicating that priority is given to a predetermined number of unit phoneme data based on the harmonic distribution for each unit section is given, and unit phoneme data having an intensity value smaller than a predetermined value The audio signal encoding method according to claim 1, wherein the connection processing is executed after deleting the scramble.

In giving the priority mark in the phoneme data connection step, in consideration of the sign of the representative distribution, unit phoneme data determined to be overtones is excluded from the assignment target, and among the remaining unit phoneme data 3. The method of encoding an acoustic signal according to claim 2 , wherein a priority mark is assigned to a predetermined number.

In the encoding step, the segment length of the phoneme data after the concatenation process is longer than a predetermined value, and those with no priority mark are excluded from the encoding target, and the remaining phoneme data is encoded. The method for encoding an acoustic signal according to claim 2 or 3, wherein:

In a computer, a section setting step for setting a plurality of unit sections on the time axis for the acoustic signal, and by obtaining a correlation between the acoustic signal and the plurality of periodic functions in the unit section, the intensity corresponding to each periodic function is obtained. Calculated unit phoneme data composed of the frequency of each periodic function, the intensity corresponding to each periodic function, the section start time corresponding to the start point of the unit section, and the section end time corresponding to the end point of the unit section A unit phoneme data calculation step for calculating the unit phoneme data, and by adding the intensity values of other unit phoneme data in which the frequency ratio is an integer multiple for each unit interval to obtain a high frequency distribution degree, Add the intensity value of other unit phoneme data whose frequency ratio to the unit phoneme data is (1 / integer) multiple to obtain the low frequency distribution, and the positive and negative of the high frequency distribution and the low frequency distribution Sign To the value obtained by not adding it different by dividing the intensity value of the unit phonemic data, harmonic distribution calculation step of calculating the harmonic distribution of each unit phoneme data, among the unit phoneme data, Concatenated continuous phoneme data with the same frequency and similar intensities. As an attribute of the concatenated phoneme data, the frequency is one of the constituent unit phoneme data, and the intensity is the constituent unit. The maximum value of the phoneme data is given, the start time gives the section start time of the first unit phoneme data, the end time gives the section end time of the last unit phoneme data, and the harmonic overtone distribution is any of the constituent phoneme data phonemic data ligation step to provide a Kano harmonic distribution degree of the phoneme data after the connection processing, at least overtones distribution of only a predetermined condition is satisfied phonemic data Extracted, delta time information corresponding to the section length of the phoneme data that the extracted, a coding step of generating the encoded data of the MIDI format having note number information corresponding to the frequency, the velocity information corresponding to the intensity, the connection A program for executing an encoding stage in which only phoneme data satisfying a predetermined condition is extracted from the processed phoneme data and code data is created.

In the phoneme data connection step, a priority mark which is information indicating that priority is given to a predetermined number of unit phoneme data based on the harmonic distribution for each unit section is given, and unit phoneme data having an intensity value smaller than a predetermined value 6. The program according to claim 5, wherein the concatenation process is executed after deleting.