JP4697919B2

JP4697919B2 - Method for encoding an acoustic signal

Info

Publication number: JP4697919B2
Application number: JP2001198402A
Authority: JP
Inventors: 敏雄茂出木
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2001-06-29
Filing date: 2001-06-29
Publication date: 2011-06-08
Anticipated expiration: 2021-06-29
Also published as: JP2003015690A

Description

【０００１】
【産業上の利用分野】
本発明は、音楽制作における採譜と呼ばれる以下のような業務を支援するのに適用することができる。採譜業務としては、例えば、譜面が入手できない場合の素材としての既存楽曲の引用・既存楽曲のカバー曲制作、ヒット曲のメロディ・和声進行・音色の分析研究等の楽曲分析、カラオケにおけるＭＩＤＩデータ形式の演奏データ作成、ゲーム機のＢＧＭデータの作成、携帯電話の着メロデータ作成、自動ピアノ・演奏ガイド機能付き鍵盤楽器向け演奏データの作成、楽譜出版・版下作成などがある。
【０００２】
【従来の技術】
音響信号に代表される時系列信号には、その構成要素として複数の周期信号が含まれている。このため、与えられた時系列信号にどのような周期信号が含まれているかを解析する手法は、古くから知られている。例えば、フーリエ解析は、与えられた時系列信号に含まれる周波数成分を解析するための方法として広く利用されている。
【０００３】
このような時系列信号の解析方法を利用すれば、音響信号を符号化することも可能である。コンピュータの普及により、原音となるアナログ音響信号を所定のサンプリング周波数でサンプリングし、各サンプリング時の信号強度を量子化してデジタルデータとして取り込むことが容易にできるようになってきており、こうして取り込んだデジタルデータに対してフーリエ解析などの手法を適用し、原音信号に含まれていた周波数成分を抽出すれば、各周波数成分を示す符号によって原音信号の符号化が可能になる。
【０００４】
一方、電子楽器による楽器音を符号化しようという発想から生まれたＭＩＤＩ（Musical Instrument Digital Interface）規格も、パーソナルコンピュータの普及とともに盛んに利用されるようになってきている。このＭＩＤＩ規格による符号データ（以下、ＭＩＤＩデータという）は、基本的には、楽器のどの鍵盤キーを、どの程度の強さで弾いたか、という楽器演奏の操作を記述したデータであり、このＭＩＤＩデータ自身には、実際の音の波形は含まれていない。そのため、実際の音を再生する場合には、楽器音の波形を記憶したＭＩＤＩ音源が別途必要になるが、その符号化効率の高さが注目を集めており、ＭＩＤＩ規格による符号化および復号化の技術は、現在、パーソナルコンピュータを用いて楽器演奏、楽器練習、作曲などを行うソフトウェアに広く採り入れられている。
【０００５】
そこで、音響信号に代表される時系列信号に対して、所定の手法で解析を行うことにより、その構成要素となる周期信号を抽出し、抽出した周期信号をＭＩＤＩデータを用いて符号化しようとする提案がなされている。例えば、特開平１０−２４７０９９号公報、特開平１１−７３１９９号公報、特開平１１−７３２００号公報、特開平１１−９５７５３号公報、特開２０００−９９００９号公報、特開２０００−９９０９２号公報、特開２０００−９９０９３号公報、特開２０００−２６１３２２号公報、特開２００１−５４５０号公報、特開２００１−１４８６３３号公報には、任意の時系列信号について、構成要素となる周波数を解析し、その解析結果からＭＩＤＩデータを作成することができる種々の方法が提案されている。
【０００６】
【発明が解決しようとする課題】
上記各公報または明細書において提案してきたＭＩＤＩ符号化方式により、演奏録音等から得られる音響信号の効率的な符号化が可能になった。従来の符号化方式では、特開平１１−９５７５３号公報において開示されているように、単位区間ごとに周波数解析を行って得られる音素（本明細書では、周波数とその周波数に対応する強度の組を音素と呼ぶことにする）を所定数に選別する手法をとっている。これは、通常のＭＩＤＩ音源では同時発音数が１６〜６４という制約があるため、解析により得られる音素をこれに合わせなければならないからである。そのため、各単位区間ごとに、その強度値を基準にして１６程度に選別を行っている。
【０００７】
しかしながら、このように単位区間ごとに選別を行うと、全体における音素の役割が考慮されていないため、音の立ち上がり、あるいは終了部分などのように、ある単位区間においては強度値が小さいが、重要な音の一部であるようなものでも削除されてしまうことになり、精度の良い符号化を行うことができない。
【０００８】
そこで、本出願人は、特願２００１−８７５０号明細書において、単位区間ごとに強度値の高い音素に１６程度の優先マークを付与しておき、その後、連続する区間の音素を連結して連結音素を得て、この連結音素を基に符号データを作成する手法について提案した。
【０００９】
この手法では、連結前の段階で同時発音数をある程度コントロールすることができ、上述のように、重要な音の一部を構成する音素を削除してしまうようなこともないが、連結後には、連結音素の同時刻における重複数が平均２倍程度に増加するため、指定した個数範囲に同時発音数を制限することができないという問題がある。
【００１０】
上記のような点に鑑み、本発明は、重要な音の一部を欠落させてしまうことなく、かつ、同時刻に重複する音素を同時発音可能な数に収めることが可能な音響信号の符号化方法を提供することを課題とする。
【００１１】
【課題を解決するための手段】
上記課題を解決するため、本発明では、与えられた音響信号に対して、時間軸上に複数の単位区間を設定し、設定された単位区間における音響信号と複数の周期関数との相関を求めることにより、各周期関数に対応した強度値を算出し、各周期関数が有する周波数と、前記各周期関数に対応した強度値と、単位区間の始点に対応する区間開始時刻と、単位区間の終点に対応する区間終了時刻で構成される単位音素データを算出し、この単位音素データの算出処理を全単位区間に対して行うことにより得られる全単位音素データから、強度値が所定値に達していないものを削除して、残りの単位音素データを有効な強度値を有する有効音素データとして抽出し、抽出された有効音素データに対して、周波数が同一であって、区間が連続するものを連結して連結音素データとし、連結音素データの属性として、強度値は構成する有効音素データの最大強度値を与え、開始時刻は先頭の有効音素データの区間開始時刻を与え、終了時刻は最後尾の有効音素データの区間終了時刻を与え、連結処理後の全音素データに対して、時間的に重複する音素データを探索し、前記時間的に重複する音素データ間において、その属性である周波数が他の音素データの周波数の整数倍になる音素データを削除し、削除後の音素データの集合により音響信号を表現するようにしたことを特徴とする。本発明によれば、音響信号に対して単位区間ごとに周波数解析を行なって、単位音素データを算出した後、連結処理を行い、連結処理後の音素データについて、時間的に重複する音素データを調べ、重複する音素データの属性である周波数が他の音素データの周波数の整数倍になる音素データを削除するようにしたので、重要な音の一部を欠落させてしまうことなく、かつ、同時刻に重複する音素を同時発音可能な数以下に収めることが可能となる。
【００１２】
【発明の実施の形態】
以下、本発明の実施形態について図面を参照して詳細に説明する。
【００１３】
（音響信号符号化方法の基本原理）
はじめに、本発明に係る音響信号の符号化方法の基本原理を述べておく。この基本原理は、前掲の各公報あるいは明細書に開示されているので、ここではその概要のみを簡単に述べることにする。
【００１４】
図１（ａ）に示すように、時系列信号としてアナログ音響信号が与えられたものとする。図１の例では、横軸に時間ｔ、縦軸に振幅（強度）をとって、この音響信号を示している。ここでは、まずこのアナログ音響信号を、デジタルの音響データとして取り込む処理を行う。これは、従来の一般的なＰＣＭの手法を用い、所定のサンプリング周波数でこのアナログ音響信号をサンプリングし、振幅を所定の量子化ビット数を用いてデジタルデータに変換する処理を行えば良い。ここでは、説明の便宜上、ＰＣＭの手法でデジタル化した音響データの波形も図１（ａ）のアナログ音響信号と同一の波形で示すことにする。
【００１５】
続いて、この解析対象となる音響信号の時間軸上に、複数の単位区間を設定する。図１（ａ）に示す例では、時間軸ｔ上に等間隔に６つの時刻ｔ１〜ｔ６が定義され、これら各時刻を始点および終点とする５つの単位区間ｄ１〜ｄ５が設定されている。図１の例では、全て同一の区間長をもった単位区間が設定されているが、個々の単位区間ごとに区間長を変えるようにしてもかまわない。あるいは、隣接する単位区間が時間軸上で部分的に重なり合うような区間設定を行ってもかまわない。
【００１６】
こうして単位区間が設定されたら、各単位区間ごとの音響信号（以下、区間信号と呼ぶことにする）について、それぞれ代表周波数を選出する。各区間信号には、通常、様々な周波数成分が含まれているが、例えば、その中で成分の強度割合の大きな周波数成分を代表周波数として選出すれば良い。ここで、代表周波数とはいわゆる基本周波数が一般的であるが、音声のフォルマント周波数などの倍音周波数や、ノイズ音源のピーク周波数も代表周波数として扱うことがある。代表周波数は１つだけ選出しても良いが、音響信号によっては複数の代表周波数を選出した方が、より精度の高い符号化が可能になる。図１（ｂ）には、個々の単位区間ごとにそれぞれ３つの代表周波数を選出し、１つの代表周波数を１つの代表符号（図では便宜上、音符として示してある）として符号化した例が示されている。ここでは、代表符号（音符）を収容するために３つのトラックＴ１，Ｔ２，Ｔ３が設けられているが、これは個々の単位区間ごとに選出された３つずつの代表符号を、それぞれ異なるトラックに収容するためである。
【００１７】
例えば、単位区間ｄ１について選出された代表符号ｎ（ｄ１，１），ｎ（ｄ１，２），ｎ（ｄ１，３）は、それぞれトラックＴ１，Ｔ２，Ｔ３に収容されている。ここで、各符号ｎ（ｄ１，１），ｎ（ｄ１，２），ｎ（ｄ１，３）は、ＭＩＤＩ符号におけるノートナンバーを示す符号である。ＭＩＤＩ符号におけるノートナンバーは、０〜１２７までの１２８通りの値をとり、それぞれピアノの鍵盤の１つのキーを示すことになる。具体的には、例えば、代表周波数として４４０Ｈｚが選出された場合、この周波数はノートナンバーｎ＝６９（ピアノの鍵盤中央の「ラ音（Ａ３音）」に対応）に相当するので、代表符号としては、ｎ＝６９が選出されることになる。もっとも、図１（ｂ）は、上述の方法によって得られる代表符号を音符の形式で示した概念図であり、実際には、各音符にはそれぞれ強度に関するデータも付加されている。例えば、トラックＴ１には、ノートナンバーｎ（ｄ１，１），ｎ（ｄ２，１）・・・なる音高を示すデータとともに、ｅ（ｄ１，１），ｅ（ｄ２，１）・・・なる強度を示すデータが収容されることになる。この強度を示すデータは、各代表周波数の成分が、元の区間信号にどの程度の度合いで含まれていたかによって決定される。具体的には、各代表周波数をもった周期関数の区間信号に対する相関値に基づいて強度を示すデータが決定されることになる。また、図１（ｂ）に示す概念図では、音符の横方向の位置によって、個々の単位区間の時間軸上での位置が示されているが、実際には、この時間軸上での位置を正確に数値として示すデータが各音符に付加されていることになる。
【００１８】
音響信号を符号化する形式としては、必ずしもＭＩＤＩ形式を採用する必要はないが、この種の符号化形式としてはＭＩＤＩ形式が最も普及しているため、実用上はＭＩＤＩ形式の符号データを用いるのが好ましい。ＭＩＤＩ形式では、「ノートオン」データもしくは「ノートオフ」データが、「デルタタイム」データを介在させながら存在する。「ノートオン」データは、特定のノートナンバーＮとベロシティーＶを指定して特定の音の演奏開始を指示するデータであり、「ノートオフ」データは、特定のノートナンバーＮとベロシティーＶを指定して特定の音の演奏終了を指示するデータである。また、「デルタタイム」データは、所定の時間間隔を示すデータである。ベロシティーＶは、例えば、ピアノの鍵盤などを押し下げる速度（ノートオン時のベロシティー）および鍵盤から指を離す速度（ノートオフ時のベロシティー）を示すパラメータであり、特定の音の演奏開始操作もしくは演奏終了操作の強さを示すことになる。
【００１９】
前述の方法では、第ｉ番目の単位区間ｄｉについて、代表符号としてＪ個のノートナンバーｎ（ｄｉ，１），ｎ（ｄｉ，２），・・・，ｎ（ｄｉ，Ｊ）が得られ、このそれぞれについて強度ｅ（ｄｉ，１），ｅ（ｄｉ，２），・・・，ｅ（ｄｉ，Ｊ）が得られる。そこで、次のような手法により、ＭＩＤＩ形式の符号データを作成することができる。まず、「ノートオン」データもしくは「ノートオフ」データの中で記述するノートナンバーＮとしては、得られたノートナンバーｎ（ｄｉ，１），ｎ（ｄｉ，２），・・・，ｎ（ｄｉ，Ｊ）をそのまま用いれば良い。一方、「ノートオン」データもしくは「ノートオフ」データの中で記述するベロシティーＶとしては、得られた強度ｅ（ｄｉ，１），ｅ（ｄｉ，２），・・・，ｅ（ｄｉ，Ｊ）を所定の方法で規格化した値を用いれば良い。また、「デルタタイム」データは、各単位区間の長さに応じて設定すれば良い。
【００２０】
（周期関数との相関を求める具体的な方法）
上述した基本原理の基づく方法では、区間信号に対して、１つまたは複数の代表周波数が選出され、この代表周波数をもった周期信号によって、当該区間信号が表現されることになる。ここで、選出される代表周波数は、文字どおり、当該単位区間内の信号成分を代表する周波数である。この代表周波数を選出する具体的な方法には、後述するように、短時間フーリエ変換を利用する方法と、一般化調和解析の手法を利用する方法とがある。いずれの方法も、基本的な考え方は同じであり、あらかじめ周波数の異なる複数の周期関数を用意しておき、これら複数の周期関数の中から、当該単位区間内の区間信号に対する相関が高い周期関数を見つけ出し、この相関の高い周期関数の周波数を代表周波数として選出する、という手法を採ることになる。すなわち、代表周波数を選出する際には、あらかじめ用意された複数の周期関数と、単位区間内の区間信号との相関を求める演算を行うことになる。そこで、ここでは、周期関数との相関を求める具体的な方法を述べておく。
【００２１】
複数の周期関数として、図２に示すような三角関数が用意されているものとする。これらの三角関数は、同一周波数をもった正弦関数と余弦関数との対から構成されており、１２８通りの標準周波数ｆ（０）〜ｆ（１２７）のそれぞれについて、正弦関数および余弦関数の対が定義されていることになる。ここでは、同一の周波数をもった正弦関数および余弦関数からなる一対の関数を、当該周波数についての周期関数として定義することにする。すなわち、ある特定の周波数についての周期関数は、一対の正弦関数および余弦関数によって構成されることになる。このように、一対の正弦関数と余弦関数とにより周期関数を定義するのは、信号に対する周期関数の相関値を求める際に、相関値が位相の影響を受ける事を考慮するためである。なお、図２に示す各三角関数内の変数Ｆおよびｋは、区間信号Ｘについてのサンプリング周波数Ｆおよびサンプル番号ｋに相当する変数である。例えば、周波数ｆ（０）についての正弦波は、ｓｉｎ（２πｆ（０）ｋ／Ｆ）で示され、任意のサンプル番号ｋを与えると、区間信号を構成する第ｋ番目のサンプルと同一時間位置における周期関数の振幅値が得られる。
【００２２】
ここでは、１２８通りの標準周波数ｆ（０）〜ｆ（１２７）を図３に示すような式で定義した例を示すことにする。すなわち、第ｎ番目（０≦ｎ≦１２７）の標準周波数ｆ（ｎ）は、以下に示す〔数式１〕で定義されることになる。
【００２３】
〔数式１〕
ｆ（ｎ）＝４４０×２^γ ⁽ⁿ⁾
γ（ｎ）＝（ｎ−６９）／１２
【００２４】
このような式によって標準周波数を定義しておくと、最終的にＭＩＤＩデータを用いた符号化を行う際に便利である。なぜなら、このような定義によって設定される１２８通りの標準周波数ｆ（０）〜ｆ（１２７）は、等比級数をなす周波数値をとることになり、ＭＩＤＩデータで利用されるノートナンバーに対応した周波数になるからである。したがって、図２に示す１２８通りの標準周波数ｆ（０）〜ｆ（１２７）は、対数尺度で示した周波数軸上に等間隔（ＭＩＤＩにおける半音単位）に設定した周波数ということになる。
【００２５】
続いて、任意の区間の区間信号に対する各周期関数の相関の求め方について、具体的な説明を行う。例えば、図４に示すように、ある単位区間ｄについて区間信号Ｘが与えられていたとする。ここでは、区間長Ｌをもった単位区間ｄについて、サンプリング周波数Ｆでサンプリングが行なわれており、全部でｗ個のサンプル値が得られているものとし、サンプル番号を図示のように、０，１，２，３，・・・，ｋ，・・・，ｗ−２，ｗ−１とする（白丸で示す第ｗ番目のサンプルは、右に隣接する次の単位区間の先頭に含まれるサンプルとする）。この場合、任意のサンプル番号ｋについては、Ｘ（ｋ）なる振幅値がデジタルデータとして与えられていることになる。短時間フーリエ変換においては、Ｘ（ｋ）に対して各サンプルごとに中央の重みが１に近く、両端の重みが０に近くなるような窓関数Ｗ（ｋ）を乗ずることが通常である。すなわち、Ｘ（ｋ）×Ｗ（ｋ）をＸ（ｋ）と扱って以下のような相関計算を行うもので、窓関数の形状としては余弦波形状のハミング窓が一般に用いられている。ここで、ｗは以下の記述においても定数のような記載をしているが、一般にはｎの値に応じて変化させ、区間長Ｌを超えない範囲で最大となるＦ／ｆ（ｎ）の整数倍の値に設定することが望ましい。
【００２６】
このような区間信号Ｘに対して、第ｎ番目の標準周波数ｆ（ｎ）をもった正弦関数Ｒｎとの相関値を求める原理を示す。両者の相関値Ａ（ｎ）は、図５の第１の演算式によって定義することができる。ここで、Ｘ（ｋ）は、図４に示すように、区間信号Ｘにおけるサンプル番号ｋの振幅値であり、ｓｉｎ（２πｆ（ｎ）ｋ／Ｆ）は、時間軸上での同位置における正弦関数Ｒｎの振幅値である。この第１の演算式は、単位区間ｄ内の全サンプル番号ｋ＝０〜ｗ−１の次元について、それぞれ区間信号Ｘの振幅値と正弦関数Ｒｎの振幅ベクトルの内積を求める式ということができる。
【００２７】
同様に、図５の第２の演算式は、区間信号Ｘと、第ｎ番目の標準周波数ｆ（ｎ）をもった余弦関数との相関値を求める式であり、両者の相関値はＢ（ｎ）で与えられる。なお、相関値Ａ（ｎ）を求めるための第１の演算式も、相関値Ｂ（ｎ）を求めるための第２の演算式も、最終的に２／ｗが乗ぜられているが、これは相関値を規格化するためのものでり、前述のとおりｗはｎに依存して変化させるのが一般的であるため、この係数もｎに依存する変数である。
【００２８】
区間信号Ｘと標準周波数ｆ（ｎ）をもった標準周期関数との相関実効値は、図５の第３の演算式に示すように、正弦関数との相関値Ａ（ｎ）と余弦関数との相関値Ｂ（ｎ）との二乗和平方根値Ｅ（ｎ）によって示すことができる。この相関実効値の大きな標準周期関数の周波数を代表周波数として選出すれば、この代表周波数を用いて区間信号Ｘを符号化することができる。
【００２９】
すなわち、この相関値Ｅ（ｎ）が所定の基準以上の大きさとなる１つまたは複数の標準周波数を代表周波数として選出すれば良い。なお、ここで「相関値Ｅ（ｎ）が所定の基準以上の大きさとなる」という選出条件は、例えば、何らかの閾値を設定しておき、相関値Ｅ（ｎ）がこの閾値を超えるような標準周波数ｆ（ｎ）をすべて代表周波数として選出する、という絶対的な選出条件を設定しても良いが、例えば、相関値Ｅ（ｎ）の大きさの順にＱ番目までを選出する、というような相対的な選出条件を設定しても良い。
【００３０】
（一般化調和解析の手法）
ここでは、本発明に係る音響信号の符号化を行う際に有用な一般化調和解析の手法について説明する。既に説明したように、音響信号を符号化する場合、個々の単位区間内の区間信号について、相関値の高いいくつかの代表周波数を選出することになる。一般化調和解析は、より高い精度で代表周波数の選出を可能にする手法であり、その基本原理は次の通りである。
【００３１】
図６（ａ）に示すような単位区間ｄについて、信号Ｓ（ｊ）なるものが存在するとする。ここで、ｊは後述するように、繰り返し処理のためのパラメータである（ｊ＝１〜Ｊ）。まず、この信号Ｓ（ｊ）に対して、図２に示すような１２８通りの周期関数すべてについての相関値を求める。そして、最大の相関値が得られた１つの周期関数の周波数を代表周波数として選出し、当該代表周波数をもった周期関数を要素関数として抽出する。続いて、図６（ｂ）に示すような含有信号Ｇ（ｊ）を定義する。この含有信号Ｇ（ｊ）は、抽出された要素関数に、その振幅として、当該要素関数の信号Ｓ（ｊ）に対する相関値を乗じることにより得られる信号である。例えば、周期関数として図２に示すように、一対の正弦関数と余弦関数とを用い、周波数ｆ（ｎ）が代表周波数として選出された場合、振幅Ａ（ｎ）をもった正弦関数Ａ（ｎ）ｓｉｎ（２πｆ（ｎ）ｋ／Ｆ）と、振幅Ｂ（ｎ）をもった余弦関数Ｂ（ｎ）ｃｏｓ（２πｆ（ｎ）ｋ／Ｆ）との和からなる信号が含有信号Ｇ（ｊ）ということになる（図６（ｂ）では、図示の便宜上、一方の関数しか示していない）。ここで、Ａ（ｎ），Ｂ（ｎ）は、図５の式で得られる規格化された相関値であるから、結局、含有信号Ｇ（ｊ）は、信号Ｓ（ｊ）内に含まれている周波数ｆ（ｎ）をもった信号成分ということができる。
【００３２】
こうして、含有信号Ｇ（ｊ）が求まったら、信号Ｓ（ｊ）から含有信号Ｇ（ｊ）を減じることにより、差分信号Ｓ（ｊ＋１）を求める。図６（ｃ）は、このようにして求まった差分信号Ｓ（ｊ＋１）を示している。この差分信号Ｓ（ｊ＋１）は、もとの信号Ｓ（ｊ）の中から、周波数ｆ（ｎ）をもった信号成分を取り去った残りの信号成分からなる信号ということができる。そこで、パラメータｊを１だけ増加させることにより、この差分信号Ｓ（ｊ＋１）を新たな信号Ｓ（ｊ）として取り扱い、同様の処理を、パラメータｊをｊ＝１〜Ｊまで１ずつ増やしながらＪ回繰り返し実行すれば、Ｊ個の代表周波数を選出することができる。
【００３３】
このような相関計算の結果として出力されるＪ個の含有信号Ｇ（１）〜Ｇ（Ｊ）は、もとの区間信号Ｘの構成要素となる信号であり、もとの区間信号Ｘを符号化する場合には、これらＪ個の含有信号の周波数を示す情報および振幅（強度）を示す情報を符号データとして用いるようにすれば良い。尚、Ｊは代表周波数の個数であると説明してきたが、標準周波数ｆ（ｎ）の個数と同一すなわちＪ＝１２８であってもよく、周波数スペクトルを求める目的においてはそのように行うのが通例である。
【００３４】
こうして、各単位区間について、所定数の周波数群が選出されたら、この周波数群の各周波数に対応する「音の高さを示す情報」、選出された各周波数の信号強度に対応する「音の強さを示す情報」、当該単位区間の始点に対応する「音の発音開始時刻を示す情報」、当該単位区間に後続する単位区間の始点に対応する「音の発音終了時刻を示す情報」、の４つの情報を含む所定数の符号データを作成すれば、当該単位区間内の区間信号Ｘを所定数の符号データにより符号化することができる。符号データとして、ＭＩＤＩデータを作成するのであれば、「音の高さを示す情報」としてノートナンバーを用い、「音の強さを示す情報」としてベロシティーを用い、「音の発音開始時刻を示す情報」としてノートオン時刻を用い、「音の発音終了時刻を示す情報」としてノートオフ時刻を用いるようにすれば良い。
【００３５】
（本発明に係る音響信号の符号化方法）
ここまでに説明した従来技術とも共通する本発明の基本原理を要約すると、原音響信号に単位区間を設定し、単位区間ごとに複数の周波数に対応する信号強度を算出し、得られた信号強度を基に用意された周期関数を利用して１つまたは複数の代表周波数を選出し、選出された代表周波数に対応する音の高さ情報と、選出された代表周波数の強度に対応する音の強さ情報と、単位区間の始点に対応する発音開始時刻と、単位区間の終点に対応する発音終了時刻で構成される符号データを作成することにより、音響信号の符号化が行われていることになる。
【００３６】
本発明の音響信号符号化方法は、上記基本原理において、得られた信号強度を基に、用意された周期関数に対応する周波数を全て利用し、これら各周波数と、各周波数の強度と、単位区間の始点に対応する区間開始時刻と、単位区間の終点に対応する区間終了時刻で構成されるデータを「音素データ」と定義し、この音素データをさらに加工することにより最終的な符号化データを得るようにしたものである。
【００３７】
ここからは、本発明の音響信号符号化方法について、図７に示すフローチャートを用いて説明する。まず、音響信号の時間軸上の全区間に渡って単位区間を設定する（ステップＳ１）。このステップＳ１における手法は、上記基本原理において、図１（ａ）を用いて説明した通りである。
【００３８】
続いて、各単位区間ごとの音響信号、すなわち区間信号について、周波数解析を行って各周波数に対応する強度値を算出し、周波数、強度値、単位区間の始点、終点の４つの情報からなる単位音素データを算出する（ステップＳ２）。具体的には、図２に示したような１２８種の周期関数に対して区間信号の相関強度を求め、その周期関数の周波数、求めた相関強度、単位区間の始点、終点の４つの情報を「単位音素データ」と定義する。この単位音素データとは、音素データのうち、特に最初の単位区間において作成されたものとする。本実施形態では、上記基本原理で説明した場合のように、代表周波数を選出するのではなく、用意した周期関数全てに対応する単位音素データを取得する。このステップＳ２の処理を全単位区間に対して行うことにより、単位音素データ[ｍ，ｎ]（０≦ｍ≦Ｍ−１，０≦ｎ≦Ｎ−１）群が得られる。ここで、Ｎは周期関数の総数（上述の例ではＮ＝１２８）、Ｍは音響信号において設定された単位区間の総数である。つまり、Ｍ×Ｎ個の単位音素データからなる単位音素データ群が得られることになる。
【００３９】
単位音素データ群が得られたら、この単位音素データ群のうち、その強度値が所定値に達していない音素データを削除し、残った音素データを有効な強度値を有する有効音素データとして抽出する（ステップＳ３）。このステップＳ３において、強度値が所定値に達しない音素データを削除するのは、信号レベルがほとんど０であって、実際には音が存在していないと判断される音素を削除するためである。そのため、この所定値としては、音が実際に存在しないレベルとみなされる値が設定される。
【００４０】
このようにして有効音素データの集合である有効音素データ群が得られたら、同一周波数で時系列方向に連続する複数の有効音素データを１つの連結音素データとして連結する（ステップＳ４）。図８は有効音素データの連結を説明するための概念図である。図８（ａ）は連結前の音素データ群の様子を示す図である。図８（ａ）において、格子状に仕切られた各矩形は音素データを示しており、網掛けがされている矩形は、上記ステップＳ３において強度値が所定値に達しないために削除された音素データであり、その他の矩形は有効音素データを示す。ステップＳ４においては、同一周波数（同一ノートナンバー）で時間ｔ方向に連続する有効音素データを連結するため、図８（ａ）に示す有効音素データ群に対して連結処理を実行すると、図８（ｂ）に示すような複数の連結音素データ、複数の有効音素データからなる音素データ群が得られる。例えば、図８（ａ）に示した有効音素データＡ１、Ａ２、Ａ３は連結されて、図８（ｂ）に示すような連結音素データＡが得られることになる。このとき、新たに得られる連結音素データＡの周波数としては、有効音素データＡ１、Ａ２、Ａ３に共通の周波数が与えられ、強度値としては、有効音素データＡ１、Ａ２、Ａ３の強度値のうち最大のものが与えられ、開始時刻としては、先頭の有効音素データＡ１の区間開始時刻ｔ１が与えられ、終了時刻としては、最後尾の有効音素データＡ３の区間終了時刻ｔ４が与えられる。有効音素データ、連結音素データ共に、周波数（ノートナンバー）、強度値、開始時刻、終了時刻の４つの情報で構成されるため、３つの有効音素データが１つの連結音素データに統合されることにより、データ量は３分の１に削減される。このことは、最終的にＭＩＤＩ符号化される場合には、短い音符３つではなく、長い音符１つとして表現されることを意味している。また、図８（ａ）に示した有効音素データＢのように、同一周波数で時系列方向に連続する有効音素データがない場合には、図８（ｂ）に示すように、連結されずにそのまま残ることになるが、以降の処理においては、連結音素データも、連結されなかった単位区間長の有効音素データもまとめて「音素データ」として扱う。
【００４１】
続いて、各時刻ごとに重複している音素データの数を調べ、重複数が同時発音可能数より多い場合に調整処理を行う（ステップＳ５）。具体的な処理としては、まず、重複管理テーブルを用意し、この重複管理テーブルに、発音開始時刻順に音素データを登録していく。ここで、重複管理テーブルを用いた重複している音素データの調整処理を図９を用いて説明する。図９（ａ）は、音素データ群における音素データを開始時刻順に並べたものを示している。
【００４２】
図９（ａ）においては、各音素データについて、その開始時刻および終了時刻のみを示している。例えば、音素Ａは、時刻「０」に発音を開始して時刻「３」まで発音が持続することを示している。このような音素データ群に対して、時刻単位で重複管理テーブルに音素を登録していく。なお、以下の説明において、同時発音数は「４」に設定されているものとする。まず、発音開始時刻順に音素を重複管理テーブルに登録していく。同時発音数が「４」に設定されているため、図９（ｂ）に示すように、音素Ｄまでは、単純に重複管理テーブルに音素が登録されることになる。
【００４３】
音素Ｅが重複管理テーブルに登録されると、図９（ｃ）に示すように重複管理テーブルには５つの音素が並べられることになる。この場合、設定されている同時発音数「４」になるように、音素を１つ減らす処理を行う。具体的には、重複管理テーブルに登録されている５つの音素の中から優先度の低い音素を１つ選定し、選定された音素に対して変更を行うことになる。図９（ｃ）の例では、終了時刻が最も早い音素Ａと音素Ｂが候補となる。音素Ａと音素Ｂは終了時刻が「３」で同時であるため、その音素データの強度値が低い方に対して変更を行う。例えば、音素Ｂの強度値の方が低いとすると、図９（ｃ）に示すように音素Ｂの終了時刻を、新たに重複管理テーブルに登録された音素Ｅの開始時刻と同一の時刻「２」に変更する。重複管理テーブルにおいて、変更する音素データが決まったら、実際の音素データ群における音素データについても変更が行われる。また、変更により重複管理テーブル上の他の４つの音素と時間的に重複しなくなった音素Ｂは、重複管理テーブルより削除される。
【００４４】
次の音素データである音素Ｆが重複管理テーブルに登録されると、図９（ｄ）に示すように重複管理テーブルには、また５つの音素が登録されることになる。終了時刻が最も早い音素Ａが特定できるので、音素Ａの終了時刻を、新たに重複管理テーブルに登録された音素Ｆの開始時刻と同一の時刻「２」に変更する。同時に、音素データ群における音素データも変更される。この時点で、音素データ群における音素データの様子は、図９（ｅ）に示すようになる。図９（ａ）の重複調整処理を行う前と、図９（ｅ）の重複調整処理を行った後を比較するとわかるように、音素Ａと音素Ｂの終了時刻が変更されている。また、重複管理テーブルにおいては、音素Ａが削除され、次の音素が重複管理テーブルに登録されて、上述のような処理が繰り返されることになる。上記のような重複音素の調整処理を、音素データ群内の全ての音素データに対して行うことにより、全ての時刻において、設定された同時発音数を超えない音素データ群が得られることになる。
【００４５】
ここで、上記ステップＳ５における重複音素の調整処理について、図１０に示すフローチャートを用いて整理して説明する。まず、音素データ群の音素データを開始時刻順に重複管理テーブルに１つ登録する（ステップＳ１１）。続いて、新たに登録した音素データの開始時刻までに終了する音素データを重複管理テーブルから削除する（ステップＳ１２）。ここで、重複管理テーブル上に並べられた音素データの数が制限値ｎを超えたかどうかを判断する（ステップＳ１３）。この制限数ｎとしては、通常は、ＭＩＤＩ規格の同時発音可能数である１６程度が設定される。重複管理テーブル上に登録された音素数がｎ以下の場合には、ステップＳ１１に戻って処理を繰り返す。図９を用いて説明した例は、ｎ＝４の場合であり、図９（ｃ）に示すように重複管理テーブル上の音素数が５になるまで、ステップＳ１１からステップＳ１３の処理が繰り返されたことになる。
【００４６】
ステップＳ１３において、重複管理テーブル上に登録された音素数がｎを超えたと判断された場合には、重複管理テーブル上に登録されているｎ＋１個の音素の中から優先度の低い音素を選定する（ステップＳ１４）。図９を用いて説明した例では、図９（ｃ）に示したように音素数が５となった場合に、音素Ｂが最も優先度が低いものとして選定される。優先度は、図９の例では、その強度値を基に決定されている。
【００４７】
優先度が最も低い音素が選定されたら、選定された音素の終了時刻を変更するか、もしくは選定された音素自体を削除する（ステップＳ１５）。基本的には、音素の終了時刻の変更が優先される。具体的には、選定された音素の終了時刻を、新たに重複管理テーブルに登録された音素の開始時刻と同一になるように変更することになる。これにより、選定された音素と新たに登録された音素との時間的な重複がなくなり、設定された制限数を超えないことになる。図９を用いて説明した例では、図９（ｃ）に示すように音素Ｂの終了時刻が、新たに重複管理テーブルに登録された音素Ｅの開始時刻「２」と同一になるように変更され、図９（ｄ）に示すように音素Ａの終了時刻が、新たに重複管理テーブルに登録された音素Ｆの開始時刻「２」と同一になるように変更されている。音素自体を削除する場合とは、音素の終了時刻を変更することにより、音素自体がなくなってしまうような場合である。例えば、開始時刻が「１」で終了時刻が「２」の音素について、終了時刻を変更すると、終了時刻も「１」となり、音の発音時間は「０」となる。このような場合は、データが存在していても意味がないので、その音素自体を削除するのである。上記ステップＳ１１〜ステップＳ１５の処理を全音素、すなわち音素データ群に存在する全ての音素データについて行ったら、処理を終了する（ステップＳ１６）。
【００４８】
図１０のフローチャートに示した手順により、ステップＳ５の処理が終了したら、そのままＭＩＤＩ形式の符号データに符号化しても良いが、本実施形態では、データ量の削減のため、音素データの総数の調整を行う（ステップＳ６）。具体的には、音素データ群における音素データの総数が、設定された数を超えている場合、重要度の低い音素を削除することにより、音素データの総数を所定内に収める。本実施形態では、優先度として、各音素データの（終了時刻−開始時刻）×強度値、により算出される値を採用する。すなわち、この値が低いものを順次削除していくことになる。音素データの総数の調整が行われたら、ＭＩＤＩ形式に符号化を行う（ステップＳ７）。
【００４９】
（倍音成分の除去処理）
上記のような本発明に係る符号化方法により、重要な音の一部を欠落させることなく、また、同時刻に重複する音素を同時発音可能な数に収めることが可能となるが、本発明においては、その手法の特徴から、倍音成分の除去処理を行うことも可能である。倍音とは、本来の音である基本音の周波数の整数倍の周波数を有する音であり、この倍音成分をそのまま符号化してしまうと、本来の音を正確に再現できないことになる。倍音成分は、上記〔数式１〕に示した関係からＭＩＤＩノートナンバーでいえば、基本音の＋１２、＋１９、＋２４、＋２８、＋３１、・・・といった値をとるものとなる。
【００５０】
次に、倍音成分の除去処理を行う場合について説明する。具体的には、上記ステップＳ５において重複音素の処理を行う際に、重複管理テーブルに登録された複数の音素の中で、一方の音素の周波数が、他の音素の周波数の整数倍となっているような関係があるかどうかを調べる。そのような関係が発見されたら、周波数が高い方の音素の強度値の、周波数が低い方の音素（基本音と考えられる）の強度値に対する比率を算出する。この比率が所定の値以下である場合、周波数が高い方の音素を、倍音であると判断して削除する。強度値の比率が所定値以下でない場合は、その音素は倍音成分でなく、基本音である可能性が高いので、削除は行わない。
【００５１】
以上、本発明の好適な実施形態について説明したが、上記符号化方法は、コンピュータ等で実行されることは当然である。具体的には、図７および図１０のフローチャートに示したようなステップを上記手順で実行するためのプログラムをコンピュータに搭載しておく。そして、音響信号をＰＣＭ方式等でデジタル化した後、コンピュータに取り込み、ステップＳ１〜ステップＳ６、およびステップＳ１１〜ステップＳ１６の処理を行った後、ＭＩＤＩ形式等の符号データをコンピュータより出力する。出力された符号データは、例えば、ＭＩＤＩデータの場合、ＭＩＤＩシーケンサ、ＭＩＤＩ音源を用いて音声として再生される。また、上記重複管理テーブルは、コンピュータ内のＲＡＭ等の所定の記憶領域を割り当てることにより実現される。
【００５２】
【発明の効果】
以上、説明したように本発明によれば、与えられた音響信号に対して、時間軸上に複数の単位区間を設定し、設定された単位区間における音響信号と複数の周期関数との相関を求めることにより、各周期関数に対応した強度値を算出し、各周期関数が有する周波数と、前記各周期関数に対応した強度値と、単位区間の始点に対応する区間開始時刻と、単位区間の終点に対応する区間終了時刻で構成される単位音素データを算出し、この単位音素データの算出処理を全単位区間に対して行うことにより得られる全単位音素データから、強度値が所定値に達していないものを削除して、残りの単位音素データを有効な強度値を有する有効音素データとして抽出し、抽出された有効音素データに対して、周波数が同一であって、区間が連続するものを連結して連結音素データとし、連結音素データの属性として、強度値は構成する有効音素データの最大強度値を与え、開始時刻は先頭の有効音素データの区間開始時刻を与え、終了時刻は最後尾の有効音素データの区間終了時刻を与え、連結処理後の全音素データに対して、時間的に重複する音素データを探索し、前記時間的に重複する音素データ間において、その属性である周波数が他の音素データの周波数の整数倍になる音素データを削除し、削除後の音素データの集合により音響信号を表現するようにしたので、重要な音の一部を欠落させてしまうことなく、かつ、同時刻に重複する音素を同時発音可能な数以下に収めることが可能となるという効果を奏する。
【図面の簡単な説明】
【図１】本発明の音響信号の符号化方法の基本原理を示す図である。
【図２】本発明で利用される周期関数の一例を示す図である。
【図３】図２に示す各周期関数の周波数とＭＩＤＩノートナンバーｎとの関係式を示す図である。
【図４】解析対象となる信号と周期信号との相関計算の手法を示す図である。
【図５】図４に示す相関計算を行うための計算式を示す図である。
【図６】一般化調和解析の基本的な手法を示す図である。
【図７】本発明の音響信号符号化方法のフローチャートである。
【図８】有効音素データの連結を説明するための概念図である。
【図９】音素データ群および重複管理テーブルの様子を示す図である。
【図１０】図７のステップＳ５の詳細を示すフローチャートである。
【符号の説明】
Ａ（ｎ），Ｂ（ｎ）・・・相関値
ｄ，ｄ１〜ｄ５・・・単位区間
Ｅ（ｎ）・・・相関値
Ｇ（ｊ）・・・含有信号
ｎ，ｎ１〜ｎ６・・・ノートナンバー
Ｓ（ｊ），Ｓ（ｊ＋１）・・・差分信号
Ｘ，Ｘ（ｋ）・・・区間信号[0001]
[Industrial application fields]
The present invention can be applied to support the following work called music transcription in music production. For music transcription, for example, citation of existing music as a material when musical score is not available, cover music production of existing music, music analysis such as melody, harmony progression, tone analysis of hit music, MIDI data in karaoke Format performance data creation, game machine BGM data creation, mobile phone ringtone data creation, performance data creation for keyboard instruments with automatic piano and performance guide function, score publication and block creation.
[0002]
[Prior art]
A time-series signal represented by an acoustic signal includes a plurality of periodic signals as its constituent elements. For this reason, a method for analyzing what kind of periodic signal is included in a given time-series signal has been known for a long time. For example, Fourier analysis is widely used as a method for analyzing frequency components included in a given time series signal.
[0003]
By using such a time-series signal analysis method, an acoustic signal can be encoded. With the spread of computers, it has become easy to sample an analog audio signal as the original sound at a predetermined sampling frequency, quantize the signal intensity at each sampling, and capture it as digital data. If a method such as Fourier analysis is applied to the data and the frequency components included in the original sound signal are extracted, the original sound signal can be encoded by a code indicating each frequency component.
[0004]
On the other hand, the MIDI (Musical Instrument Digital Interface) standard, which was born from the idea of encoding musical instrument sounds by electronic musical instruments, has been actively used with the spread of personal computers. The code data according to the MIDI standard (hereinafter referred to as MIDI data) is basically data that describes the operation of the musical instrument performance such as which keyboard key of the instrument is played with what strength. The data itself does not include the actual sound waveform. Therefore, when reproducing the actual sound, a MIDI sound source storing the waveform of the instrument sound is separately required. However, its high encoding efficiency is attracting attention, and encoding and decoding according to the MIDI standard are being attracted attention. This technology is now widely used in software that uses a personal computer to perform musical instrument performance, practice and compose music.
[0005]
Therefore, by analyzing a time-series signal represented by an acoustic signal by a predetermined method, a periodic signal as a constituent element is extracted, and the extracted periodic signal is encoded using MIDI data. Proposals have been made. For example, JP-A-10-247099, JP-A-11-73199, JP-A-11-73200, JP-A-11-95753, JP-A-2000-99009, JP-A-2000-99092, JP-A-2000-99093, JP-A-2000-261322, JP-A-2001-5450, and JP-A-2001-148633 analyze the frequency as a component of an arbitrary time-series signal, Various methods for creating MIDI data from the analysis results have been proposed.
[0006]
[Problems to be solved by the invention]
The MIDI encoding method proposed in each of the above publications or specifications has enabled efficient encoding of acoustic signals obtained from performance recordings and the like. In the conventional coding method, as disclosed in Japanese Patent Laid-Open No. 11-95753, a phoneme obtained by performing frequency analysis for each unit section (in this specification, a set of frequency and intensity corresponding to the frequency). Is called a phoneme) to a predetermined number. This is because a normal MIDI sound source has a restriction that the number of simultaneous pronunciations is 16 to 64, and the phoneme obtained by analysis must be matched to this. Therefore, for each unit section, sorting is performed to about 16 on the basis of the intensity value.
[0007]
However, when sorting is performed for each unit section in this way, since the role of phonemes in the whole is not considered, the intensity value is small in a certain unit section, such as the rise or end of sound, but it is important Even if it is a part of a simple sound, it will be deleted, and accurate coding cannot be performed.
[0008]
Accordingly, in the specification of Japanese Patent Application No. 2001-8750, the present applicant assigns a priority mark of about 16 to phonemes having high intensity values for each unit section, and then connects and connects the phonemes in consecutive sections. We proposed a method for obtaining phonemes and creating code data based on these connected phonemes.
[0009]
In this method, the number of simultaneous pronunciations can be controlled to some extent at the stage before connection, and as described above, phonemes that constitute part of important sounds are not deleted, but after connection, Since the overlapping number of connected phonemes at the same time increases by an average of about twice, there is a problem that the number of simultaneous sounds cannot be limited to a specified number range.
[0010]
In view of the above points, the present invention provides a code of an acoustic signal that does not lose a part of an important sound and that can contain overlapping phonemes at the same time so that they can be simultaneously pronounced. It is an object to provide a conversion method.
[0011]
[Means for Solving the Problems]
In order to solve the above problems, in the present invention, a plurality of unit sections are set on a time axis for a given acoustic signal, and correlations between the acoustic signal and the plurality of periodic functions in the set unit section are obtained. By calculating the intensity value corresponding to each periodic function, the frequency of each periodic function, the intensity value corresponding to each periodic function, the section start time corresponding to the start point of the unit section, and the end point of the unit section The unit phoneme data composed of the end time of the section corresponding to is calculated, and the intensity value has reached a predetermined value from all unit phoneme data obtained by performing this unit phoneme data calculation process for all unit sections. The remaining unit phoneme data is deleted as effective phoneme data having an effective intensity value, and the extracted phoneme data has the same frequency and continuous sections. Concatenated into concatenated phoneme data. As the concatenated phoneme data attribute, the intensity value gives the maximum intensity value of the effective phoneme data, the start time gives the section start time of the first effective phoneme data, and the end time ends. The effective phoneme data section end time is given, and all phoneme data after the concatenation process is searched for temporally overlapping phoneme data,Delete phoneme data whose attribute frequency is an integer multiple of the frequency of other phoneme data between the temporally overlapping phoneme dataAndDeleteAn acoustic signal is expressed by a set of later phoneme data. According to the present invention, the frequency analysis is performed on the acoustic signal for each unit interval, the unit phoneme data is calculated, the connection process is performed, and the phoneme data after the connection process is subjected to temporally overlapping phoneme data. Check for duplicate phoneme dataDelete phoneme data whose attribute frequency is an integer multiple of the frequency of other phoneme dataAs a result, it is possible to reduce the number of phonemes that overlap at the same time to less than the number that can be pronounced simultaneously without losing some of the important sounds.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0013]
(Basic principle of acoustic signal encoding method)
First, the basic principle of the audio signal encoding method according to the present invention will be described. Since this basic principle is disclosed in the above-mentioned publications or specifications, only the outline will be briefly described here.
[0014]
As shown in FIG. 1A, it is assumed that an analog acoustic signal is given as a time-series signal. In the example of FIG. 1, the acoustic signal is shown with time t on the horizontal axis and amplitude (intensity) on the vertical axis. Here, first, the analog sound signal is processed as digital sound data. This may be performed by using a conventional general PCM method, sampling the analog acoustic signal at a predetermined sampling frequency, and converting the amplitude into digital data using a predetermined number of quantization bits. Here, for convenience of explanation, the waveform of the acoustic data digitized by the PCM method is also shown by the same waveform as the analog acoustic signal of FIG.
[0015]
Subsequently, a plurality of unit sections are set on the time axis of the acoustic signal to be analyzed. In the example shown in FIG. 1A, six times t1 to t6 are defined at equal intervals on the time axis t, and five unit intervals d1 to d5 having these times as the start point and the end point are set. In the example of FIG. 1, unit sections having the same section length are set, but the section length may be changed for each unit section. Alternatively, the section setting may be performed such that adjacent unit sections partially overlap on the time axis.
[0016]
When the unit section is set in this way, representative frequencies are selected for the acoustic signals (hereinafter referred to as section signals) for each unit section. Each section signal usually includes various frequency components. For example, a frequency component having a high component intensity ratio may be selected as the representative frequency. Here, the so-called fundamental frequency is generally used as the representative frequency, but a harmonic frequency such as a formant frequency of speech or a peak frequency of a noise source may be treated as a representative frequency. Although only one representative frequency may be selected, more accurate encoding is possible by selecting a plurality of representative frequencies depending on the acoustic signal. FIG. 1B shows an example in which three representative frequencies are selected for each unit section, and one representative frequency is encoded as one representative code (shown as a note for convenience in the drawing). Has been. Here, three tracks T1, T2 and T3 are provided to accommodate representative codes (notes), but this means that three representative codes selected for each unit section are assigned to different tracks. It is for accommodating.
[0017]
For example, representative codes n (d1,1), n (d1,2), n (d1,3) selected for the unit section d1 are accommodated in tracks T1, T2, T3, respectively. Here, each code n (d1,1), n (d1,2), n (d1,3) is a code indicating a note number in the MIDI code. The note number in the MIDI code takes 128 values from 0 to 127, each indicating one key of the piano keyboard. Specifically, for example, when 440 Hz is selected as the representative frequency, this frequency corresponds to the note number n = 69 (corresponding to “ra sound (A3 sound)” in the center of the piano keyboard). N = 69 is selected. However, FIG. 1B is a conceptual diagram showing the representative code obtained by the above-described method in the form of a note. In reality, data on intensity is also added to each note. For example, the track T1 includes e (d1,1), e (d2,1)... Along with data indicating the pitches of note numbers n (d1,1), n (d2,1). Data indicating the strength is accommodated. The data indicating the intensity is determined by the degree to which the component of each representative frequency is included in the original section signal. Specifically, the data indicating the intensity is determined based on the correlation value with respect to the section signal of the periodic function having each representative frequency. Further, in the conceptual diagram shown in FIG. 1B, the position of each unit section on the time axis is indicated by the position of the note in the horizontal direction, but in reality, the position on the time axis is shown. Is accurately added as a numerical value to each note.
[0018]
As a format for encoding an acoustic signal, it is not always necessary to adopt the MIDI format. However, since the MIDI format is the most popular as this type of encoding, code data in the MIDI format is practically used. Is preferred. In the MIDI format, “note-on” data or “note-off” data exists while interposing “delta time” data. The “note-on” data is data for designating a specific note number N and velocity V to instruct the start of a specific sound, and the “note-off” data is a specific note number N and velocity V. This is data that designates the end of the performance of a specific sound. The “delta time” data is data indicating a predetermined time interval. Velocity V is a parameter that indicates, for example, the speed at which a piano keyboard is pressed down (velocity at the time of note-on) and the speed at which the finger is released from the keyboard (velocity at the time of note-off). Or it shows the strength of the performance end operation.
[0019]
In the above-described method, J note numbers n (di, 1), n (di, 2),..., N (di, J) are obtained as representative codes for the i-th unit interval di. Intensities e (di, 1), e (di, 2),..., E (di, J) are obtained for each of these. Therefore, MIDI format code data can be created by the following method. First, as the note number N described in the “note on” data or “note off” data, the obtained note numbers n (di, 1), n (di, 2),..., N (di , J) can be used as they are. On the other hand, as the velocity V described in the “note on” data or “note off” data, the obtained intensities e (di, 1), e (di, 2),..., E (di, A value obtained by normalizing J) by a predetermined method may be used. The “delta time” data may be set according to the length of each unit section.
[0020]
(Specific method for obtaining correlation with periodic function)
In the method based on the basic principle described above, one or a plurality of representative frequencies are selected for the section signal, and the section signal is represented by a periodic signal having this representative frequency. Here, the representative frequency to be selected is literally a frequency representing the signal component in the unit section. Specific methods for selecting the representative frequency include a method using a short-time Fourier transform and a method using a generalized harmonic analysis method, as will be described later. Both methods have the same basic concept. Prepare a plurality of periodic functions with different frequencies in advance, and from these periodic functions, a periodic function that has a high correlation with the section signal in the unit section. And a method of selecting the frequency of the highly correlated periodic function as a representative frequency is adopted. That is, when selecting a representative frequency, an operation for obtaining a correlation between a plurality of periodic functions prepared in advance and a section signal in a unit section is performed. Therefore, here, a specific method for obtaining the correlation with the periodic function will be described.
[0021]
Assume that trigonometric functions as shown in FIG. 2 are prepared as a plurality of periodic functions. These trigonometric functions are composed of a pair of a sine function and a cosine function having the same frequency. For each of 128 standard frequencies f (0) to f (127), a pair of a sine function and a cosine function. Is defined. Here, a pair of functions consisting of a sine function and a cosine function having the same frequency is defined as a periodic function for the frequency. That is, the periodic function for a specific frequency is constituted by a pair of sine function and cosine function. Thus, the periodic function is defined by a pair of sine function and cosine function in order to consider that the correlation value is influenced by the phase when obtaining the correlation value of the periodic function with respect to the signal. The variables F and k in each trigonometric function shown in FIG. 2 are variables corresponding to the sampling frequency F and the sample number k for the section signal X. For example, a sine wave with respect to the frequency f (0) is represented by sin (2πf (0) k / F), and given an arbitrary sample number k, the same time position as the k-th sample constituting the section signal The amplitude value of the periodic function at is obtained.
[0022]
Here, an example in which 128 standard frequencies f (0) to f (127) are defined by the equations as shown in FIG. That is, the nth (0 ≦ n ≦ 127) standard frequency f (n) is defined by the following [Formula 1].
[0023]
[Formula 1]
f (n) = 440 × 2^γ ⁽ⁿ⁾
γ (n) = (n−69) / 12
[0024]
If the standard frequency is defined by such an expression, it is convenient when finally encoding using MIDI data is performed. This is because the 128 standard frequencies f (0) to f (127) set by such a definition take frequency values forming a geometric series, and correspond to the note numbers used in the MIDI data. This is because it becomes a frequency. Therefore, the 128 standard frequencies f (0) to f (127) shown in FIG. 2 are frequencies set at equal intervals (in semitone units in MIDI) on the frequency axis shown on the logarithmic scale.
[0025]
Next, a specific description will be given of how to obtain the correlation of each periodic function with respect to a section signal in an arbitrary section. For example, as shown in FIG. 4, it is assumed that a section signal X is given for a certain unit section d. Here, it is assumed that sampling is performed at the sampling frequency F for the unit interval d having the interval length L, and w sample values are obtained in total, and the sample numbers are 0, 1, 2, 3,..., K,..., W-2, w-1 (the w-th sample indicated by a white circle is a sample included at the head of the next unit section adjacent to the right. And). In this case, for an arbitrary sample number k, an amplitude value of X (k) is given as digital data. In the short-time Fourier transform, it is usual to multiply the window function W (k) such that the center weight is close to 1 and the weights at both ends are close to 0 for each sample with respect to X (k). That is, X (k) × W (k) is treated as X (k) and the following correlation calculation is performed. As the shape of the window function, a cosine wave-shaped Hamming window is generally used. Here, w is described as a constant in the following description, but in general, it is changed according to the value of n, and F / f (n) that is maximum within a range not exceeding the section length L. It is desirable to set the value to an integer multiple.
[0026]
The principle of obtaining a correlation value with such a section signal X and the sine function Rn having the nth standard frequency f (n) is shown. Both correlation values A (n) can be defined by the first arithmetic expression of FIG. Here, X (k) is the amplitude value of the sample number k in the section signal X, as shown in FIG. 4, and sin (2πf (n) k / F) is the sine at the same position on the time axis. This is the amplitude value of the function Rn. This first arithmetic expression can be said to be an expression for obtaining the inner product of the amplitude value of the section signal X and the amplitude vector of the sine function Rn for the dimensions of all sample numbers k = 0 to w−1 in the unit section d. .
[0027]
Similarly, the second arithmetic expression in FIG. 5 is an expression for obtaining a correlation value between the interval signal X and the cosine function having the nth standard frequency f (n), and the correlation value between the two is B ( n). The first arithmetic expression for obtaining the correlation value A (n) and the second arithmetic expression for obtaining the correlation value B (n) are finally multiplied by 2 / w. Is for normalizing the correlation value. As described above, since w is generally changed depending on n, this coefficient is also a variable depending on n.
[0028]
The effective correlation value between the interval signal X and the standard periodic function having the standard frequency f (n) is the correlation value A (n) with the sine function, the cosine function, as shown in the third arithmetic expression of FIG. Of the square sum of squares E (n) with the correlation value B (n). If the frequency of the standard periodic function having a large correlation effective value is selected as the representative frequency, the section signal X can be encoded using this representative frequency.
[0029]
That is, one or a plurality of standard frequencies whose correlation value E (n) is greater than or equal to a predetermined reference may be selected as the representative frequency. Here, the selection condition that “correlation value E (n) is greater than or equal to a predetermined reference” is, for example, a standard in which some threshold value is set and correlation value E (n) exceeds this threshold value. An absolute selection condition that all frequencies f (n) are selected as representative frequencies may be set. For example, up to the Qth in the order of the correlation value E (n) is selected. A relative selection condition may be set.
[0030]
(Method of generalized harmonic analysis)
Here, a generalized harmonic analysis technique useful when encoding an acoustic signal according to the present invention will be described. As already described, when encoding an acoustic signal, several representative frequencies having high correlation values are selected for the section signal in each unit section. Generalized harmonic analysis is a technique that enables the selection of representative frequencies with higher accuracy, and the basic principle thereof is as follows.
[0031]
Assume that there is a signal S (j) for the unit interval d as shown in FIG. Here, j is a parameter for repetitive processing (j = 1 to J), as will be described later. First, correlation values for all 128 periodic functions as shown in FIG. 2 are obtained for this signal S (j). Then, the frequency of one periodic function having the maximum correlation value is selected as a representative frequency, and the periodic function having the representative frequency is extracted as an element function. Subsequently, the inclusion signal G (j) as shown in FIG. 6B is defined. The inclusion signal G (j) is a signal obtained by multiplying the extracted element function by the correlation value of the element function with respect to the signal S (j) of the element function. For example, as shown in FIG. 2, when a frequency f (n) is selected as a representative frequency using a pair of sine function and cosine function as shown in FIG. 2, a sine function A (n) having an amplitude A (n). ) Sin (2πf (n) k / F) and a signal composed of the sum of cosine function B (n) cos (2πf (n) k / F) having amplitude B (n) is included signal G (j) (In FIG. 6B, only one function is shown for convenience of illustration). Here, since A (n) and B (n) are normalized correlation values obtained by the equation of FIG. 5, the inclusion signal G (j) is eventually included in the signal S (j). It can be said that the signal component has a certain frequency f (n).
[0032]
Thus, when the content signal G (j) is obtained, the difference signal S (j + 1) is obtained by subtracting the content signal G (j) from the signal S (j). FIG. 6C shows the difference signal S (j + 1) obtained in this way. The difference signal S (j + 1) can be said to be a signal composed of the remaining signal components obtained by removing the signal component having the frequency f (n) from the original signal S (j). Therefore, by increasing the parameter j by 1, this difference signal S (j + 1) is handled as a new signal S (j), and the same processing is performed J times while increasing the parameter j by 1 from j = 1 to J. If it is repeatedly executed, J representative frequencies can be selected.
[0033]
The J inclusion signals G (1) to G (J) output as a result of such correlation calculation are signals that are constituent elements of the original section signal X, and the original section signal X is encoded. In this case, information indicating the frequency of these J inclusion signals and information indicating the amplitude (intensity) may be used as the code data. Although J has been described as the number of representative frequencies, it may be the same as the number of standard frequencies f (n), that is, J = 128. For the purpose of obtaining a frequency spectrum, this is usually done. It is.
[0034]
Thus, when a predetermined number of frequency groups are selected for each unit section, “information indicating the pitch” corresponding to each frequency of this frequency group, and “sound intensity” corresponding to the signal intensity of each selected frequency. "Information indicating strength", "information indicating the start time of sound generation" corresponding to the start point of the unit section, "information indicating the end time of sound generation" corresponding to the start point of the unit section subsequent to the unit section, If a predetermined number of pieces of code data including the four pieces of information are created, the section signal X in the unit section can be encoded with the predetermined number of pieces of code data. If MIDI data is created as code data, a note number is used as “information indicating the pitch of the sound”, velocity is used as the “information indicating the intensity of the sound”, and “sound generation start time is set. The note-on time may be used as the “information indicating” and the note-off time may be used as the “information indicating the end time of sound generation”.
[0035]
(Acoustic signal encoding method according to the present invention)
To summarize the basic principle of the present invention common to the conventional techniques described so far, unit intervals are set in the original sound signal, signal intensities corresponding to a plurality of frequencies are calculated for each unit interval, and the obtained signal intensities are calculated. One or more representative frequencies are selected using a periodic function prepared based on the sound pitch information corresponding to the selected representative frequency and the sound frequency corresponding to the intensity of the selected representative frequency. The sound signal is encoded by creating code data composed of intensity information, a sounding start time corresponding to the start point of the unit section, and a sounding end time corresponding to the end point of the unit section. become.
[0036]
The acoustic signal encoding method of the present invention uses all the frequencies corresponding to the prepared periodic functions based on the obtained signal strength in the above basic principle, and each of these frequencies, the strength of each frequency, and the unit Data consisting of the section start time corresponding to the start point of the section and the section end time corresponding to the end point of the unit section is defined as “phoneme data”, and final encoded data is obtained by further processing this phoneme data. It is something to get.
[0037]
From here, the acoustic signal encoding method of the present invention will be described with reference to the flowchart shown in FIG. First, a unit section is set over all sections on the time axis of the acoustic signal (step S1). The technique in step S1 is as described with reference to FIG. 1A in the basic principle.
[0038]
Subsequently, for each acoustic signal in each unit section, that is, the section signal, a frequency analysis is performed to calculate an intensity value corresponding to each frequency, and a unit comprising four pieces of information of frequency, intensity value, unit section start point, and end point Phoneme data is calculated (step S2). Specifically, the correlation strength of the section signal is obtained for 128 types of periodic functions as shown in FIG. 2, and four pieces of information of the frequency of the periodic function, the calculated correlation strength, the start point and the end point of the unit section are obtained. It is defined as “unit phoneme data”. It is assumed that the unit phoneme data is created especially in the first unit section of the phoneme data. In the present embodiment, unit phoneme data corresponding to all the prepared periodic functions is acquired instead of selecting a representative frequency as in the case described in the basic principle. The unit phoneme data [m, n] (0 ≦ m ≦ M−1, 0 ≦ n ≦ N−1) group is obtained by performing the process of step S2 on all unit sections. Here, N is the total number of periodic functions (N = 128 in the above example), and M is the total number of unit sections set in the acoustic signal. That is, a unit phoneme data group composed of M × N unit phoneme data is obtained.
[0039]
When the unit phoneme data group is obtained, the phoneme data whose intensity value does not reach the predetermined value is deleted from the unit phoneme data group, and the remaining phoneme data is extracted as effective phoneme data having an effective intensity value. (Step S3). In this step S3, the phoneme data whose intensity value does not reach the predetermined value is deleted because the phoneme whose signal level is almost 0 and which is judged to be actually absent is deleted. . Therefore, a value that is regarded as a level where no sound actually exists is set as the predetermined value.
[0040]
When an effective phoneme data group which is a set of effective phoneme data is obtained in this way, a plurality of effective phoneme data continuous in the time-series direction at the same frequency are connected as one connected phoneme data (step S4). FIG. 8 is a conceptual diagram for explaining connection of effective phoneme data. FIG. 8A is a diagram illustrating a state of a phoneme data group before connection. In FIG. 8 (a), each rectangle partitioned in a lattice pattern indicates phoneme data, and the shaded rectangle is a phoneme deleted because the intensity value does not reach the predetermined value in step S3. The other rectangles indicate effective phoneme data. In step S4, in order to concatenate effective phoneme data continuous in the time t direction at the same frequency (same note number), when the concatenation process is executed on the effective phoneme data group shown in FIG. A phoneme data group including a plurality of connected phoneme data and a plurality of effective phoneme data as shown in b) is obtained. For example, the effective phoneme data A1, A2, and A3 shown in FIG. 8A are connected to obtain connected phoneme data A as shown in FIG. 8B. At this time, as the frequency of the newly obtained connected phoneme data A, a common frequency is given to the effective phoneme data A1, A2, and A3, and the intensity value is the intensity value of the effective phoneme data A1, A2, and A3. The largest one is given, the start time t1 of the effective phoneme data A1 is given as the start time, and the end time t4 of the last effective phoneme data A3 is given as the end time. Since both effective phoneme data and connected phoneme data are composed of four pieces of information of frequency (note number), intensity value, start time, and end time, the three effective phoneme data are integrated into one connected phoneme data. The amount of data is reduced to one third. This means that when MIDI encoding is finally performed, it is expressed not as three short notes but as one long note. In addition, when there is no effective phoneme data continuous in the time series direction at the same frequency as in the effective phoneme data B shown in FIG. 8A, as shown in FIG. Although it remains as it is, in the subsequent processing, the concatenated phoneme data and the effective phoneme data having unit lengths that are not concatenated are collectively treated as “phoneme data”.
[0041]
Subsequently, the number of phoneme data duplicated at each time is checked, and adjustment processing is performed when the number of overlapping phone numbers is larger than the number that can be simultaneously generated (step S5). As a specific process, first, a duplication management table is prepared, and phoneme data is registered in the duplication management table in the order of pronunciation start time. Here, an adjustment process for overlapping phoneme data using the overlap management table will be described with reference to FIG. FIG. 9A shows the phoneme data in the phoneme data group arranged in the order of the start time.
[0042]
FIG. 9A shows only the start time and end time of each phoneme data. For example, phoneme A indicates that sounding starts at time “0” and continues until time “3”. For such a phoneme data group, phonemes are registered in the duplication management table in units of time. In the following description, it is assumed that the simultaneous pronunciation number is set to “4”. First, phonemes are registered in the duplication management table in order of pronunciation start time. Since the number of simultaneous pronunciations is set to “4”, as shown in FIG. 9B, up to phoneme D, phonemes are simply registered in the duplication management table.
[0043]
When the phoneme E is registered in the duplication management table, five phonemes are arranged in the duplication management table as shown in FIG. In this case, the phoneme is reduced by one so that the set number of simultaneous pronunciations is “4”. Specifically, one phoneme having a low priority is selected from the five phonemes registered in the duplication management table, and the selected phoneme is changed. In the example of FIG. 9C, phoneme A and phoneme B with the earliest end time are candidates. Since the phoneme A and the phoneme B have the end time “3” at the same time, the phoneme data is changed for the lower intensity value. For example, if the intensity value of the phoneme B is lower, the end time of the phoneme B is set to the same time “2” as the start time of the phoneme E newly registered in the duplication management table as shown in FIG. Change to When the phoneme data to be changed is determined in the duplication management table, the phoneme data in the actual phoneme data group is also changed. In addition, the phoneme B that does not overlap in time with the other four phonemes on the duplication management table due to the change is deleted from the duplication management table.
[0044]
When the phoneme F that is the next phoneme data is registered in the duplication management table, five phonemes are registered in the duplication management table as shown in FIG. Since the phoneme A having the earliest end time can be specified, the end time of the phoneme A is changed to the same time “2” as the start time of the phoneme F newly registered in the duplication management table. At the same time, the phoneme data in the phoneme data group is also changed. At this time, the state of the phoneme data in the phoneme data group is as shown in FIG. The end times of phoneme A and phoneme B are changed, as can be seen by comparing before the overlap adjustment process of FIG. 9A and after the overlap adjustment process of FIG. 9E. In the duplication management table, phoneme A is deleted, the next phoneme is registered in the duplication management table, and the above-described processing is repeated. By performing the above overlapping phoneme adjustment processing on all phoneme data in the phoneme data group, a phoneme data group that does not exceed the set number of simultaneous pronunciations at all times can be obtained. .
[0045]
Here, the overlapping phoneme adjustment processing in step S5 will be described in an organized manner using the flowchart shown in FIG. First, one phoneme data of the phoneme data group is registered in the duplication management table in order of start time (step S11). Subsequently, the phoneme data that ends by the start time of the newly registered phoneme data is deleted from the duplication management table (step S12). Here, it is determined whether or not the number of phoneme data arranged on the duplication management table exceeds the limit value n (step S13). The limit number n is normally set to about 16, which is the number of simultaneously soundable MIDI standards. If the number of phonemes registered on the duplication management table is n or less, the process returns to step S11 and the process is repeated. The example described with reference to FIG. 9 is the case of n = 4, and the processing from step S11 to step S13 is repeated until the number of phonemes on the duplication management table becomes 5, as shown in FIG. 9C. That's right.
[0046]
If it is determined in step S13 that the number of phonemes registered on the duplication management table exceeds n, a phoneme having a low priority is selected from n + 1 phonemes registered on the duplication management table. (Step S14). In the example described with reference to FIG. 9, when the number of phonemes is five as shown in FIG. 9C, the phoneme B is selected as the lowest priority. In the example of FIG. 9, the priority is determined based on the strength value.
[0047]
When the phoneme having the lowest priority is selected, the end time of the selected phoneme is changed, or the selected phoneme itself is deleted (step S15). Basically, priority is given to changing the end time of phonemes. Specifically, the end time of the selected phoneme is changed to be the same as the start time of the phoneme newly registered in the duplication management table. Thereby, there is no time overlap between the selected phoneme and the newly registered phoneme, and the set limit number is not exceeded. In the example described with reference to FIG. 9, the end time of phoneme B is changed to be the same as the start time “2” of phoneme E newly registered in the duplication management table, as shown in FIG. 9C. As shown in FIG. 9D, the end time of the phoneme A is changed to be the same as the start time “2” of the phoneme F newly registered in the duplication management table. The case where the phoneme itself is deleted is a case where the phoneme itself disappears by changing the end time of the phoneme. For example, for a phoneme whose start time is “1” and whose end time is “2”, when the end time is changed, the end time is also “1” and the sound production time is “0”. In such a case, even if the data exists, there is no meaning, so the phoneme itself is deleted. If the process of said step S11-step S15 is performed about all the phonemes, ie, all the phoneme data which exist in a phoneme data group, a process will be complete | finished (step S16).
[0048]
When the process of step S5 is completed according to the procedure shown in the flowchart of FIG. 10, it may be encoded as it is into MIDI-format encoded data. However, in this embodiment, the total number of phoneme data is adjusted to reduce the amount of data. (Step S6). Specifically, when the total number of phoneme data in the phoneme data group exceeds the set number, the total number of phoneme data is kept within a predetermined range by deleting phonemes having low importance. In the present embodiment, a value calculated by (end time−start time) × intensity value of each phoneme data is adopted as the priority. That is, the ones with low values are deleted sequentially. When the total number of phoneme data is adjusted, encoding is performed in the MIDI format (step S7).
[0049]
(Harmonic component removal processing)
With the encoding method according to the present invention as described above, it is possible to reduce the number of phonemes that overlap at the same time without missing a part of important sounds, In this case, it is possible to perform a harmonic component removal process due to the feature of the method. A harmonic is a sound having a frequency that is an integral multiple of the frequency of the basic sound, which is the original sound. If the harmonic component is encoded as it is, the original sound cannot be accurately reproduced. In terms of the MIDI note number, the harmonic component takes values such as +12, +19, +24, +28, +31,.
[0050]
Next, a case where a harmonic component removal process is performed will be described. Specifically, when the overlapping phoneme is processed in step S5, the frequency of one phoneme among the plurality of phonemes registered in the overlap management table is an integral multiple of the frequency of the other phoneme. Find out if there is a relationship. If such a relationship is found, the ratio of the intensity value of the phoneme having the higher frequency to the intensity value of the phoneme having the lower frequency (considered as a basic sound) is calculated. If this ratio is less than or equal to a predetermined value, the phoneme with the higher frequency is determined to be a harmonic and is deleted. If the intensity value ratio is not less than or equal to the predetermined value, the phoneme is not a harmonic component and is likely to be a basic sound, and is not deleted.
[0051]
Although the preferred embodiments of the present invention have been described above, the encoding method is naturally executed by a computer or the like. Specifically, a program for executing the steps shown in the flowcharts of FIGS. 7 and 10 according to the above procedure is installed in the computer. Then, after the acoustic signal is digitized by the PCM method or the like, it is taken into a computer, and after steps S1 to S6 and steps S11 to S16 are performed, code data such as MIDI format is output from the computer. For example, in the case of MIDI data, the output code data is reproduced as sound using a MIDI sequencer and a MIDI sound source. The duplication management table is realized by assigning a predetermined storage area such as a RAM in the computer.
[0052]
【The invention's effect】
As described above, according to the present invention, for a given acoustic signal, a plurality of unit sections are set on the time axis, and the correlation between the acoustic signal and the plurality of periodic functions in the set unit section is calculated. By calculating the intensity value corresponding to each periodic function, the frequency of each periodic function, the intensity value corresponding to each periodic function, the section start time corresponding to the start point of the unit section, and the unit section The unit phoneme data composed of the section end time corresponding to the end point is calculated, and the intensity value reaches a predetermined value from all the unit phoneme data obtained by performing this unit phoneme data calculation process for all unit sections. The remaining unit phoneme data is extracted as effective phoneme data having an effective intensity value, and the extracted phoneme data has the same frequency and continuous sections. Concatenated into concatenated phoneme data. As the concatenated phoneme data attribute, the intensity value gives the maximum intensity value of the effective phoneme data, the start time gives the section start time of the first effective phoneme data, and the end time ends. The effective phoneme data section end time is given, and all phoneme data after the concatenation process is searched for temporally overlapping phoneme data,Delete phoneme data whose attribute frequency is an integer multiple of the frequency of other phoneme data between the temporally overlapping phoneme dataAndDeleteSince the acoustic signal is expressed by a set of later phoneme data, it is possible to keep a part of important sound missing and keep the number of overlapping phonemes at the same time or less. There is an effect that it becomes possible.
[Brief description of the drawings]
FIG. 1 is a diagram showing a basic principle of an audio signal encoding method according to the present invention.
FIG. 2 is a diagram showing an example of a periodic function used in the present invention.
3 is a diagram showing a relational expression between the frequency of each periodic function shown in FIG. 2 and a MIDI note number n. FIG.
FIG. 4 is a diagram illustrating a method of calculating a correlation between a signal to be analyzed and a periodic signal.
FIG. 5 is a diagram showing a calculation formula for performing the correlation calculation shown in FIG. 4;
FIG. 6 is a diagram showing a basic method of generalized harmonic analysis.
FIG. 7 is a flowchart of an acoustic signal encoding method according to the present invention.
FIG. 8 is a conceptual diagram for explaining connection of effective phoneme data.
FIG. 9 is a diagram showing a state of a phoneme data group and a duplication management table.
10 is a flowchart showing details of step S5 in FIG. 7;
[Explanation of symbols]
A (n), B (n) ... correlation value
d, d1 to d5 ... unit interval
E (n) ... correlation value
G (j) ... Inclusion signal
n, n1 to n6 ... note number
S (j), S (j + 1)... Differential signal
X, X (k) ... section signal

Claims

A section setting stage for setting a plurality of unit sections on the time axis for a given acoustic signal,
By calculating the correlation between the acoustic signal in the unit section and a plurality of periodic functions, the intensity value corresponding to each periodic function is calculated, the frequency of each periodic function, the intensity value corresponding to each periodic function, A unit phoneme data calculation stage for calculating unit phoneme data composed of a section start time corresponding to the start point of the unit section and a section end time corresponding to the end point of the unit section;
Delete all unit phoneme data obtained by performing the processing of the unit phoneme data for all unit intervals from the unit phoneme data whose intensity value does not reach the predetermined value, and use the remaining unit phoneme data as effective intensity. An effective phoneme data extraction stage to extract as effective phoneme data having a value;
For the effective phoneme data extracted in the effective phoneme data extraction step, those having the same frequency and continuous sections are connected to form connected phoneme data, and the intensity value is configured as an attribute of the connected phoneme data. Giving the maximum intensity value of the effective phoneme data, the start time giving the section start time of the first effective phoneme data, the end time giving the section end time of the last effective phoneme data,
For all phoneme data after the concatenation process, search for phoneme data that overlaps in time, and between the phoneme data that overlap in time, the frequency that is an attribute is an integer multiple of the frequency of other phoneme data A phoneme number adjustment stage for deleting phoneme data ,
An encoding step of expressing an acoustic signal by a set of phoneme data after the deletion ;
A method for encoding an acoustic signal, comprising:

After the overlapping phoneme number adjustment step, a total phoneme number adjustment step of reducing the total number of phoneme data by deleting less important phoneme data based on the product of the phoneme data pronunciation time and intensity value The method for encoding an acoustic signal according to claim 1 .

The overlapping phoneme number adjustment step moves the end time of the phoneme data having the smallest intensity value among the temporally overlapping phoneme data to the start time side, thereby determining the number of temporally overlapping phoneme data. The method for encoding an acoustic signal according to claim 1 or 2 , wherein the adjustment is performed so as to be equal to or less than a predetermined value.

The overlapping phoneme number adjustment step sequentially registers the phoneme data in the duplication management table based on the order of the start time which is an attribute of the phoneme data, and among the phoneme data already registered in the duplication management table The phoneme data whose section end time is set before the start time of the phoneme data to be newly registered is deleted from the duplication management table, and the phoneme data registered in the duplication management table exceeds a predetermined number In this case, one of the phoneme data registered in the duplication management table is selected, the selected phoneme data in the phoneme data group is corrected, and the selected phoneme data is stored in the duplication management table. The number of phoneme data overlapping in time is adjusted so as to become a predetermined value or less by deleting more. Encoding method of the audio signal according to any one of claims 3 to 1.

A section setting stage for setting a plurality of unit sections on a time axis for a given acoustic signal to a computer, and by obtaining a correlation between the acoustic signal in the unit section and a plurality of periodic functions, The corresponding intensity value is calculated, the frequency of each periodic function, the intensity value corresponding to each periodic function, the section start time corresponding to the start point of the unit section, and the section end time corresponding to the end point of the unit section A unit phoneme data calculation stage for calculating unit phoneme data to be configured, and intensity values have not reached a predetermined value from all unit phoneme data obtained by performing the processing of the unit phoneme data calculation stage for all unit sections And the remaining unit phoneme data is extracted as effective phoneme data having an effective intensity value, and extracted by the effective phoneme data extraction step. With respect to the effective phoneme data, those having the same frequency and continuous sections are connected to form connected phoneme data, and as the attribute of the connected phoneme data, the intensity value is the maximum intensity value of the effective phoneme data constituting the phoneme data. And the start time gives the section start time of the first effective phoneme data, the end time gives the section end time of the last effective phoneme data, a phoneme data connection stage that gives the time for all the phoneme data after the connection process manner overlapping examine phoneme data, among phonemic data overlapping said temporally overlapping phonemes speed adjusting step of deleting phoneme data whose frequency is the attribute is an integral multiple of the frequency of the other phonemes data, the deletion A program for executing an encoding stage for expressing an acoustic signal by a later set of phoneme data.