JP3958841B2

JP3958841B2 - Acoustic signal encoding method and computer-readable recording medium

Info

Publication number: JP3958841B2
Application number: JP24963597A
Authority: JP
Inventors: 敏雄茂出木
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 1997-08-29
Filing date: 1997-08-29
Publication date: 2007-08-15
Anticipated expiration: 2017-08-29
Also published as: JPH1173199A

Abstract

PROBLEM TO BE SOLVED: To encode a vocal analog acoustic signal by MIDI data. SOLUTION: The vocal analog acoustic signal is digitized by a PCM to detect positions of polarity change points expressing local peaks. A higher intrinsic frequency is defined based on a cycle, in which polarity change points having the same polarities appear and a lower intrinsic frequency is defined based on a cycle, in which polarity change points having approximate signal intensities appear. A series of polarity change point group, whose higher intrinsic frequencies are approximate and a series of polarity change point group, whose lower intrinsic frequencies are approximate are respectively defined as unit sections. Higher unit sections Uh (i) and lower unit sections Ul (i) overlap each other on a time base. Representative frequencies and representative intensities are respectively defined in respective unit sections and MIDI data having note numbers corresponding to the representative frequencies and having velocities corresponding to the representative intensities and having delta times corresponding to lengths of the unit sections are defined for every individual unit section.

Description

【０００１】
【発明の属する技術分野】
本発明は音響信号の符号化方法に関し、時系列の強度信号として与えられる音響信号を符号化し、これを復号化して再生する技術に関する。特に、本発明はヴォーカル音響信号（人の話声，歌声の信号）を、ＭＩＤＩ形式の符号データに効率良く変換する処理に適しており、音声を記録する種々の産業分野への応用が期待される。
【０００２】
【従来の技術】
音響信号を符号化する技術として、ＰＣＭ（Pulse Code Modulation ）の手法は最も普及している手法であり、現在、オーディオＣＤやＤＡＴなどの記録方式として広く利用されている。このＰＣＭの手法の基本原理は、アナログ音響信号を所定のサンプリング周波数でサンプリングし、各サンプリング時の信号強度を量子化してデジタルデータとして表現する点にあり、サンプリング周波数や量子化ビット数を高くすればするほど、原音を忠実に再生することが可能になる。ただ、サンプリング周波数や量子化ビット数を高くすればするほど、必要な情報量も増えることになる。そこで、できるだけ情報量を低減するための手法として、信号の変化差分のみを符号化するＡＤＰＣＭ（Adaptive Differential Pulse Code Modulation ）の手法も用いられている。
【０００３】
一方、電子楽器による楽器音を符号化しようという発想から生まれたＭＩＤＩ（Musical Instrument Digital Interface）規格も、パーソナルコンピュータの普及とともに盛んに利用されるようになってきている。このＭＩＤＩ規格による符号データ（以下、ＭＩＤＩデータという）は、基本的には、楽器のどの鍵盤キーを、どの程度の強さで弾いたか、という楽器演奏の操作を記述したデータであり、このＭＩＤＩデータ自身には、実際の音の波形は含まれていない。そのため、実際の音を再生する場合には、楽器音の波形を記憶したＭＩＤＩ音源が別途必要になる。しかしながら、上述したＰＣＭの手法で音を記録する場合に比べて、情報量が極めて少なくてすむという特徴を有し、その符号化効率の高さが注目を集めている。このＭＩＤＩ規格による符号化および復号化の技術は、現在、パーソナルコンピュータを用いて楽器演奏、楽器練習、作曲などを行うソフトウエアに広く採り入れられており、カラオケ、ゲームの効果音といった分野でも広く利用されている。
【０００４】
【発明が解決しようとする課題】
上述したように、ＰＣＭの手法により音響信号を符号化する場合、十分な音質を確保しようとすれば情報量が膨大になり、データ処理の負担が重くならざるを得ない。したがって、通常は、ある程度の情報量に抑えるため、ある程度の音質に妥協せざるを得ない。もちろん、ＭＩＤＩ規格による符号化の手法を採れば、非常に少ない情報量で十分な音質をもった音の再生が可能であるが、上述したように、ＭＩＤＩ規格そのものが、もともと楽器演奏の操作を符号化するためのものであるため、広く一般音響への適用を行うことはできない。別言すれば、ＭＩＤＩデータを作成するためには、実際に楽器を演奏するか、あるいは、楽譜の情報を用意する必要がある。
【０００５】
このように、従来用いられているＰＣＭの手法にしても、ＭＩＤＩの手法にしても、それぞれ音響信号の符号化方法としては一長一短があり、一般の音響について、少ない情報量で十分な音質を確保することはできない。ところが、一般の音響についても効率的な符号化を行いたいという要望は、益々強くなってきている。いわゆるヴォーカル音響と呼ばれる人間の話声や歌声を取り扱う分野では、かねてからこのような要望が強く出されている。たとえば、語学教育、声楽教育、犯罪捜査などの分野では、ヴォーカル音響信号を効率的に符号化する技術が切望されている。ところが、ヴォーカル音響には、基本周波数のほか、その倍音以外の高調波成分が混在するというホルマント特性が現われることが知られており、これまでの技術では効率的な符号化を行うことができなかった。
【０００６】
そこで本発明は、人の声音や歌声を含む音響信号に対しても効率的な符号化を行うことができる音響信号の符号化方法を提供することを目的とする。
【０００７】
【課題を解決するための手段】
(1) 本発明の第１の態様は、時系列の強度信号として与えられる音響信号を符号化するための音響信号の符号化方法において、
符号化対象となる音響信号を、デジタルの音響データとして取り込む入力段階と、
取り込んだ音響データの波形について変極点を求める変極点定義段階と、
この音響データの時間軸上に、少なくとも一部分が重複する複数の単位区間を設定する区間設定段階と、
個々の単位区間内の音響データに基づいて、個々の単位区間を代表する所定の代表周波数および代表強度を定義し、時間軸上での個々の単位区間の始端位置および終端位置を示す情報と代表周波数および代表強度を示す情報とを含む符号データを生成し、個々の単位区間の音響データを個々の符号データによって表現する符号化段階と、
を行い、
区間設定段階では、ある着目する変極点について、その近傍において所定の条件を満たす特定の変極点を探索し、探索された変極点との間の時間軸上での距離に基づいて、当該着目する変極点についての固有周波数を定義する固有周波数定義方法を、前記所定の条件を変えることにより複数通り設定し、これら複数通りの固有周波数定義方法を用いて各変極点に複数通りの固有周波数を定義し、同一の固有周波数定義方法で定義された固有周波数が所定の近似範囲内となるような一群の変極点を含む区間を、当該固有周波数定義方法が関与する１つの単位区間として設定し、
符号化段階では、単位区間内に含まれる変極点について定義された複数通りの固有周波数のうち、当該単位区間の設定に関与した固有周波数定義方法で定義された固有周波数に基づいて当該単位区間の代表周波数を定義し、当該単位区間内に含まれる変極点のもつ信号強度に基づいて当該単位区間の代表強度を定義するようにしたものである。
【００１１】
(2) 本発明の第２の態様は、上述の第１の態様に係る音響信号の符号化方法において、
入力段階で、正および負の両極性デジタル値を信号強度としてもった音響データを用意し、
区間設定段階で、第１の固有周波数定義方法として、「着目する変極点と同極性の変極点」という条件を満たす特定の変極点を探索し、探索された変極点との間の時間軸上での距離に基づいて、着目する変極点についての固有周波数を定義する方法を設定し、第２の固有周波数定義方法として、「着目する変極点に近似した信号強度をもつ変極点」という条件を満たす特定の変極点を探索し、探索された変極点との間の時間軸上での距離に基づいて、着目する変極点についての固有周波数を定義する方法を設定するようにしたものである。
【００１２】
(3) 本発明の第３の態様は、上述の第２の態様に係る音響信号の符号化方法において、
区間設定段階で、第１の固有周波数定義方法で定義される固有周波数ｆｈを上限とし、第２の固有周波数定義方法で定義される固有周波数ｈｌを下限とする範囲内で、複数の固有周波数が定義できる複数通りの固有周波数定義方法を設定するようにしたものである。
【００１３】
(4) 本発明の第４の態様は、上述の第１〜第３の態様に係る音響信号の符号化方法において、
符号化段階で、代表周波数に基づいてノートナンバーを定め、代表強度に基づいてベロシティーを定め、単位区間の長さに基づいてデルタタイムを定め、１つの単位区間の音響データを、ノートナンバー、ベロシティー、デルタタイムで表現されるＭＩＤＩ形式の符号データに変換し、時間軸上で重複する単位区間に対しては異なるチャンネルを割り当てるようにしたものである。
【００１４】
(5) 本発明の第５の態様は、上述の第１〜第４の態様に係る音響信号の符号化方法を実行する音響信号の符号化のためのプログラムを、コンピュータ読み取り可能な記録媒体に記録するようにしたものである。
【００１６】
【発明の実施の形態】
以下、本発明を図示する実施形態に基づいて説明する。本願発明は、特開平１０−２４７０９９号公報に開示された発明（以下、先願発明という）を基本発明とした改良発明に相当するものである。したがって、以下の説明では、まず、§１〜§３において先願発明に係る符号化方法を説明することにする。
【００１７】
§１．先願発明に係る音響信号の符号化方法の基本原理
はじめに、先願発明に係る音響信号の符号化方法の基本原理を図１を参照しながら説明する。いま、図１の上段に示すように、時系列の強度信号としてアナログ音響信号が与えられたものとしよう。図示の例では、横軸に時間軸ｔ、縦軸に信号強度Ａをとってこの音響信号を示している。先願発明では、まずこのアナログ音響信号を、デジタルの音響データとして取り込む処理を行う。これは、従来の一般的なＰＣＭの手法を用い、所定のサンプリング周波数でこのアナログ音響信号をサンプリングし、信号強度Ａを所定の量子化ビット数を用いてデジタルデータに変換する処理を行えばよい。ここでは、説明の便宜上、ＰＣＭの手法でデジタル化した音響データの波形も、図１の上段のアナログ音響信号と同一の波形で示すことにする。
【００１８】
次に、このデジタル音響データの時間軸ｔ上に複数の単位区間を設定する。図示の例では、６つの単位区間Ｕ１〜Ｕ６が設定されている。第ｉ番目の単位区間Ｕｉは、時間軸ｔ上の始端ｓｉおよび終端ｅｉの座標値によって、その時間軸ｔ上での位置と長さとが示される。たとえば、単位区間Ｕ１は、始端ｓ１〜終端ｅ１までの（ｅ１−ｓ１）なる長さをもつ区間である。
【００１９】
こうして、複数の単位区間が設定されたら、個々の単位区間内の音響データに基づいて、個々の単位区間を代表する所定の代表周波数および代表強度を定義する。ここでは、第ｉ番目の単位区間Ｕｉについて、代表周波数Ｆｉおよび代表強度Ａｉが定義された状態が示されている。たとえば、第１番目の単位区間Ｕ１については、代表周波数Ｆ１および代表強度Ａ１が定義されている。代表周波数Ｆ１は、始端ｓ１〜終端ｅ１までの区間に含まれている音響データの周波数成分の代表値であり、代表強度Ａｉは、同じく始端ｓ１〜終端ｅ１までの区間に含まれている音響データの信号強度の代表値である。単位区間Ｕ１内の音響データに含まれる周波数成分は、通常、単一ではなく、信号強度も変動するのが一般的である。先願発明のポイントは、１つの単位区間について、単一の代表周波数と単一の代表強度を定義し、これら代表値を用いて符号化を行う点にある。
【００２０】
すなわち、個々の単位区間について、それぞれ代表周波数および代表強度が定義されたら、時間軸ｔ上での個々の単位区間の始端位置および終端位置を示す情報と、定義された代表周波数および代表強度を示す情報と、により符号データを生成し、個々の単位区間の音響データを個々の符号データによって表現するのである。単一の周波数をもち、単一の信号強度をもった音響信号が、所定の期間だけ持続する、という事象を符号化する手法として、ＭＩＤＩ規格に基づく符号化を利用することができる。ＭＩＤＩ規格による符号データ（ＭＩＤＩデータ）は、いわば音符によって音を表現したデータということができ、図１では、下段に示す音符によって、最終的に得られる符号データの概念を示している。
【００２１】
結局、各単位区間内の音響データは、代表周波数Ｆ１に相当する音程情報（ＭＩＤＩ規格におけるノートナンバー）と、代表強度Ａ１に相当する強度情報（ＭＩＤＩ規格におけるベロシティー）と、単位区間の長さ（ｅ１−ｓ１）に相当する長さ情報（ＭＩＤＩ規格におけるデルタタイム）と、をもった符号データに変換されることになる。このようにして得られる符号データの情報量は、もとの音響信号のもつ情報量に比べて、著しく小さくなり、飛躍的な符号化効率が得られることになる。これまで、ＭＩＤＩデータを生成する手法としては、演奏者が実際に楽器を演奏するときの操作をそのまま取り込んで符号化するか、あるいは、楽譜上の音符をデータとして入力するしかなかったが、上述した手法を用いれば、実際のアナログ音響信号からＭＩＤＩデータを直接生成することが可能になる。
【００２２】
もっとも、上述した手法による符号化方法を実用化するためには、いくつか留意すべき点がある。第１の留意点は、再生時に音源を用意する必要があるという点である。上述の手法によって最終的に得られる符号データには、もとの音響信号の波形データそのものは含まれていないため、何らかの音響波形のデータをもった音源が必要になる。たとえば、ＭＩＤＩデータを再生する場合には、ＭＩＤＩ音源が必要になる。もっとも、ＭＩＤＩ規格が普及した現在では、種々のＭＩＤＩ音源が入手可能であり、実用上は大きな問題は生じない。ただ、もとの音響信号に忠実な再生音を得るためには、もとの音響信号に含まれていた音響波形に近似した波形データをもったＭＩＤＩ音源を用意する必要がある。適当なＭＩＤＩ音源を用いた再生を行うことができれば、むしろもとの音響信号よりも高い音質で、臨場感あふれる再生音を得ることも可能になる。
【００２３】
第２の留意点は、１つの単位区間に含まれる音響データの周波数を、単一の代表周波数に置き換えてしまうという基本原理に基づく符号化手法であるため、非常に幅の広い周波数成分を同時に含んでいるような音響信号の符号化には不向きであるという点である。もちろん、この符号化手法は、どのような音響信号に対しても適用可能であるが、人間の声音のように、ホルマントと呼ばれる複数の特徴周波数成分をもつ音響信号に対して符号化を行っても、再生時に十分な再現性は得られなくなる。したがって、先願発明の符号化手法は、主として、生体の発生するリズム音や、波や風などの自然が発生するリズム音のように、個々の単位区間内には、ある程度限定された周波数成分のみを含む音響信号に対して利用するのが好ましい。本願発明は、先願発明のこの点を改良し、人間の声音のように、ホルマントと呼ばれる複数の特徴周波数成分をもつ音響信号に対して符号化を行っても、十分な再現性を確保できるようにしたものである。その具体的な方法については、§４以降で述べることにする。
【００２４】
第３の留意点は、効率的で再現性の高い符号化を行うためには、単位区間の設定方法に工夫を凝らす必要があるという点である。先願発明の基本原理は、上述したように、もとの音響データを複数の単位区間に分割し、各単位区間ごとに、単一周波数および単一強度を示す符号データに変換するという点にある。したがって、最終的に得られる符号データは、単位区間の設定方法に大きく依存することになる。最も単純な単位区間の設定方法は、時間軸上で、たとえば１０ｍｓごとというように、等間隔に単位区間を一義的に定義する方法である。しかしながら、この方法では、符号化対象となるもとの音響データにかかわらず、常に一定の方法で単位区間の定義が行われることになり、必ずしも効率的で再現性の高い符号化は期待できない。したがって、実用上は、もとの音響データの波形を解析し、個々の音響データに適した単位区間の設定を行うようにするのが好ましい。
【００２５】
効率的な単位区間の設定を行う１つのアプローチは、音響データの中で周波数帯域が近似した区間を１つのまとまった単位区間として抽出するという方法である。単位区間内の周波数成分は１つの代表周波数によって置き換えられてしまうので、この代表周波数とあまりにかけ離れた周波数成分が含まれていると、再生時の再現性が低減する。したがって、ある程度近似した周波数が持続する区間を１つの単位区間として抽出することは、再現性のよい効率的な符号化を行う上で重要である。このアプローチを採る場合、具体的には、もとの音響データの周波数の変化点を認識し、この変化点を境界とする単位区間の設定を行うようにすればよい。
【００２６】
効率的な単位区間の設定を行うもう１つのアプローチは、音響データの中で信号強度が近似した区間を１つのまとまった単位区間として抽出するという方法である。単位区間内の信号強度は１つの代表強度によって置き換えられてしまうので、この代表強度とあまりにかけ離れた信号強度が含まれていると、再生時の再現性が低減する。したがって、ある程度近似した信号強度が持続する区間を１つの単位区間として抽出することは、再現性のよい効率的な符号化を行う上で重要である。このアプローチを採る場合、具体的には、もとの音響データの信号強度の変化点を認識し、この変化点を境界とする単位区間の設定を行うようにすればよい。
【００２７】
§２．先願発明に係る音響信号の符号化方法の実用的な手順
図２は、先願発明のより実用的な手順を示す流れ図である。この手順は、入力段階Ｓ１０、変極点定義段階Ｓ２０、区間設定段階Ｓ３０、符号化段階Ｓ４０の４つの大きな段階から構成されている。入力段階Ｓ１０は、符号化対象となる音響信号を、デジタルの音響データとして取り込む段階である。変極点定義段階Ｓ２０は、後の区間設定段階Ｓ３０の準備段階ともいうべき段階であり、取り込んだ音響データの波形について変極点（ローカルピーク）を求める段階である。また、区間設定段階Ｓ３０は、この変極点に基づいて、音響データの時間軸上に複数の単位区間を設定する段階であり、符号化段階Ｓ４０は、個々の単位区間の音響データを個々の符号データに変換する段階である。符号データへの変換原理は、既に§１で述べたとおりである。すなわち、個々の単位区間内の音響データに基づいて、個々の単位区間を代表する所定の代表周波数および代表強度を定義し、時間軸上での個々の単位区間の始端位置および終端位置を示す情報と、代表周波数および代表強度を示す情報と、によって符号データが生成されることになる。以下、これらの各段階において行われる処理を順に説明する。
【００２８】
＜＜＜２．１入力段階＞＞＞
入力段階Ｓ１０では、サンプリング処理Ｓ１１と直流成分除去処理Ｓ１２とが実行される。サンプリング処理Ｓ１１は、符号化の対象となるアナログ音響信号を、デジタルの音響データとして取り込む処理であり、従来の一般的なＰＣＭの手法を用いてサンプリングを行う処理である。この実施形態では、サンプリング周波数：４４．１ｋＨｚ、量子化ビット数：１６ビットという条件でサンプリングを行い、デジタルの音響データを用意している。
【００２９】
続く、直流成分除去処理Ｓ１２は、入力した音響データに含まれている直流成分を除去するデジタル処理である。たとえば、図３に示す音響データは、振幅の中心レベルが、信号強度を示すデータレンジの中心レベル（具体的なデジタル値としては、たとえば、１６ビットでサンプリングを行い、０〜６５５３５のデータレンジが設定されている場合には３２７６８なる値。以下、説明の便宜上、図３のグラフに示すように、データレンジの中心レベルに０をとり、サンプリングされた個々の信号強度の値を正または負で表現する）よりもＤだけ高い位置にきている。別言すれば、この音響データには、値Ｄに相当する直流成分が含まれていることになる。サンプリング処理の対象になったアナログ音響信号に直流成分が含まれていると、デジタル音響データにもこの直流成分が残ることになる。そこで、直流成分除去処理Ｓ１２によって、この直流成分Ｄを除去する処理を行い、振幅の中心レベルとデータレンジの中心レベルとを一致させる。具体的には、サンプリングされた個々の信号強度の平均が０になるように、直流成分Ｄを差し引く演算を行えばよい。これにより、正および負の両極性デジタル値を信号強度としてもった音響データが用意できる。
【００３０】
＜＜＜２．２変極点定義段階＞＞＞
変極点定義段階Ｓ２０では、変極点探索処理Ｓ２１と同極性変極点の間引処理Ｓ２２とが実行される。変極点探索処理Ｓ２１は、取り込んだ音響データの波形について変極点を求める処理である。図４は、図３に示す音響データの一部を時間軸に関して拡大して示したグラフである。このグラフでは、矢印Ｐ１〜Ｐ６の先端位置の点が変極点（極大もしくは極小の点）に相当し、各変極点はいわゆるローカルピークに相当する点となる。このような変極点を探索する方法としては、たとえば、サンプリングされたデジタル値を時間軸に沿って順に注目してゆき、増加から減少に転じた位置、あるいは減少から増加に転じた位置を認識すればよい。ここでは、この変極点を図示のような矢印で示すことにする。
【００３１】
各変極点は、サンプリングされた１つのデジタルデータに対応する点であり、所定の信号強度の情報（矢印の長さに相当）をもつとともに、時間軸ｔ上での位置の情報をもつことになる。図５は、図４に矢印で示す変極点Ｐ１〜Ｐ６のみを抜き出して示した図である。以下の説明では、この図５に示すように、第ｉ番目の変極点Ｐｉのもつ信号強度（絶対値）を矢印の長さａｉとして示し、時間軸ｔ上での変極点Ｐｉの位置をｔｉとして示すことにする。結局、変極点探索処理Ｓ２１は、図３に示すような音響データに基づいて、図５に示すような各変極点に関する情報を求める処理ということになる。
【００３２】
ところで、図５に示す各変極点Ｐ１〜Ｐ６は、交互に極性が反転する性質を有する。すなわち、図５の例では、奇数番目の変極点Ｐ１，Ｐ３，Ｐ５は上向きの矢印で示され、偶数番目の変極点Ｐ２，Ｐ４，Ｐ６は下向きの矢印で示されている。これは、もとの音響データ波形の振幅が正負交互に現れる振動波形としての本来の姿をしているためである。しかしながら、実際には、このような本来の振動波形が必ずしも得られるとは限らず、たとえば、図６に示すように、多少乱れた波形が得られる場合もある。この図６に示すような音響データに対して変極点探索処理Ｓ２１を実行すると、個々の変極点Ｐ１〜Ｐ７のすべてが検出されてしまうため、図７に示すように、変極点を示す矢印の向きは交互に反転するものにはならない。しかしながら、単一の代表周波数を定義する上では、向きが交互に反転した矢印列が得られるのが好ましい。
【００３３】
同極性変極点の間引処理Ｓ２２は、図７に示すように、同極性のデジタル値をもった変極点（同じ向きの矢印）が複数連続した場合に、絶対値が最大のデジタル値をもった変極点（最も長い矢印）のみを残し、残りを間引きしてしまう処理である。図７に示す例の場合、上向きの３本の矢印Ｐ１〜Ｐ３のうち、最も長いＰ２のみが残され、下向きの３本の矢印Ｐ４〜Ｐ６のうち、最も長いＰ４のみが残され、結局、間引処理Ｓ２２により、図８に示すように、３つの変極点Ｐ２，Ｐ４，Ｐ７のみが残されることになる。この図８に示す変極点は、図６に示す音響データの波形の本来の姿に対応したものになる。
【００３４】
＜＜＜２．３区間設定段階＞＞＞
既に述べたように、先願発明に係る符号化方法において、効率的で再現性の高い符号化を行うためには、単位区間の設定方法に工夫を凝らす必要がある。その意味で、図２に示す各段階のうち、区間設定段階Ｓ３０は、実用上非常に重要な段階である。上述した変極点定義段階Ｓ２０は、この区間設定段階Ｓ３０の準備段階になっており、単位区間の設定は、個々の変極点の情報を利用して行われる。すなわち、この区間設定段階Ｓ３０では、変極点に基づいて音響データの周波数もしくは信号強度の変化点を認識し、この変化点を境界とする単位区間を設定する、という基本的な考え方に沿って処理が進められる。
【００３５】
図５に示すように、矢印で示されている個々の変極点Ｐ１〜Ｐ６には、それぞれ信号強度ａ１〜ａ６が定義されている。しかしながら、個々の変極点Ｐ１〜Ｐ６それ自身には、周波数に関する情報は定義されていない。区間設定段階Ｓ３０において最初に行われる固有周波数定義処理Ｓ３１は、個々の変極点それぞれに、所定の固有周波数を定義する処理である。本来、周波数というものは、時間軸上の所定の区間内の波について定義される物理量であり、時間軸上のある１点について定義されるべきものではない。ただ、ここでは便宜上、個々の変極点について、疑似的に固有周波数なるものを定義することにする（一般に、物理学における「固有周波数」という文言は、物体が音波などに共鳴して振動する物体固有の周波数を意味するが、本願における「固有周波数」とは、このような物体固有の周波数を意味するものではなく、個々の変極点それぞれに定義された疑似的な周波数、別言すれば、信号のある瞬間における基本周波数を意味するものである。）。
【００３６】
いま、図９に示すように、多数の変極点のうち、第ｎ番目〜第（ｎ＋２）番目の変極点Ｐ（ｎ），Ｐ（ｎ＋１），Ｐ（ｎ＋２）に着目する。これら各変極点には、それぞれ信号値ａ（ｎ），ａ（ｎ＋１），ａ（ｎ＋２）が定義されており、また、時間軸上での位置ｔ（ｎ），ｔ（ｎ＋１），ｔ（ｎ＋２）が定義されている。ここで、これら各変極点が、音声データ波形のローカルピーク位置に相当する点であることを考慮すれば、図示のように、変極点Ｐ（ｎ）とＰ（ｎ＋２）との間の時間軸上での距離φは、もとの波形の１周期に対応することがわかる。そこで、たとえば、第ｎ番目の変極点Ｐ（ｎ）の固有周波数ｆ（ｎ）なるものを、ｆ（ｎ）＝１／φと定義すれば、個々の変極点について、それぞれ固有周波数を定義することができる。時間軸上での位置ｔ（ｎ），ｔ（ｎ＋１），ｔ（ｎ＋２）が、「秒」の単位で表現されていれば、
φ＝（ｔ（ｎ＋２）−ｔ（ｎ））
であるから、
ｆ（ｎ）＝１／（ｔ（ｎ＋２）−ｔ（ｎ））
として定義できる。
【００３７】
なお、実際のデジタルデータ処理の手順を考慮すると、個々の変極点の位置は、「秒」の単位ではなく、サンプル番号ｘ（サンプリング処理Ｓ１１における何番目のサンプリング時に得られたデータであるかを示す番号）によって表されることになるが、このサンプル番号ｘと実時間「秒」とは、サンプリング周波数ｆｓによって一義的に対応づけられる。たとえば、第ｍ番目のサンプルｘ（ｍ）と第（ｍ＋１）番目のサンプルｘ（ｍ＋１）との間の実時間軸上での間隔は、１／ｆｓになる。
【００３８】
さて、このようにして個々の変極点に定義された固有周波数は、物理的には、その変極点付近のローカルな周波数を示す量ということになる。隣接する別な変極点との距離が短ければ、その付近のローカルな周波数は高く、隣接する別な変極点との距離が長ければ、その付近のローカルな周波数は低いということになる。もっとも、上述の例では、後続する２つ目の変極点との間の距離に基づいて固有周波数を定義しているが、固有周波数の定義方法としては、この他どのような方法を採ってもかまわない。たとえば、第ｎ番目の変極点の固有周波数ｆ（ｎ）を、先行する第（ｎ−２）番目の変極点との間の距離を用いて、
ｆ（ｎ）＝１／（ｔ（ｎ）−ｔ（ｎ−２））
と定義することもできる。また、前述したように、後続する２つ目の変極点との間の距離に基づいて、固有周波数ｆ（ｎ）を、
ｆ（ｎ）＝１／（ｔ（ｎ＋２）−ｔ（ｎ））
なる式で定義した場合であっても、最後の２つの変極点については、後続する２つ目の変極点が存在しないので、先行する変極点を利用して、
ｆ（ｎ）＝１／（ｔ（ｎ）−ｔ（ｎ−２））
なる式で定義すればよい。
【００３９】
あるいは、後続する次の変極点との間の距離に基づいて、第ｎ番目の変極点の固有周波数ｆ（ｎ）を、
ｆ（ｎ）＝（１／２）・１／（ｔ（ｎ＋１）−ｔ（ｎ））
なる式で定義することもできるし、後続する３つ目の変極点との間の距離に基づいて、
ｆ（ｎ）＝（３／２）・１／（ｔ（ｎ＋３）−ｔ（ｎ））
なる式で定義することもできる。結局、一般式を用いて示せば、第ｎ番目の変極点についての固有周波数ｆ（ｎ）は、ｋ個離れた変極点（ｋが正の場合は後続する変極点、負の場合は先行する変極点）との間の時間軸上での距離に基づいて、
ｆ（ｎ）＝（ｋ／２）・１／（ｔ（ｎ＋ｋ）−ｔ（ｎ））
なる式で定義することができる。ｋの値は、予め適当な値に設定しておけばよい。変極点の時間軸上での間隔が比較的小さい場合には、ｋの値をある程度大きく設定した方が、誤差の少ない固有周波数を定義することができる。ただし、ｋの値をあまり大きく設定しすぎると、ローカルな周波数としての意味が失われてしまうことになり好ましくない。
【００４０】
こうして、固有周波数定義処理Ｓ３１が完了すると、個々の変極点Ｐ（ｎ）には、信号強度ａ（ｎ）と、固有周波数ｆ（ｎ）と、時間軸上での位置ｔ（ｎ）とが定義されることになる。
【００４１】
さて、§１では、効率的で再現性の高い符号化を行うためには、１つの単位区間に含まれる変極点の周波数が所定の近似範囲内になるように単位区間を設定するという第１のアプローチと、１つの単位区間に含まれる変極点の信号強度が所定の近似範囲内になるように単位区間を設定するという第２のアプローチとがあることを述べた。ここでは、この２つのアプローチを用いた単位区間の設定手法を、具体例に即して説明しよう。
【００４２】
いま、図１０に示すように、９つの変極点Ｐ１〜Ｐ９のそれぞれについて、信号強度ａ１〜ａ９と固有周波数ｆ１〜ｆ９とが定義されている場合を考える。この場合、第１のアプローチに従えば、個々の固有周波数ｆ１〜ｆ９に着目し、互いに近似した固有周波数をもつ空間的に連続した変極点の一群を１つの単位区間とする処理を行えばよい。たとえば、固有周波数ｆ１〜ｆ５がほぼ同じ値（第１の基準値）をとり、固有周波数ｆ６〜ｆ９がほぼ同じ値（第２の基準値）をとっており、第１の基準値と第２の基準値との差が所定の許容範囲を越えていた場合、図１０に示すように、第１の基準値の近似範囲に含まれる固有周波数ｆ１〜ｆ５をもつ変極点Ｐ１〜Ｐ５を含む区間を単位区間Ｕ１とし、第２の基準値の近似範囲に含まれる固有周波数ｆ６〜ｆ９をもつ変極点Ｐ６〜Ｐ９を含む区間を単位区間Ｕ２として設定すればよい。先願発明による手法では、１つの単位区間については、単一の代表周波数が与えられることになるが、このように、固有周波数が互いに近似範囲内にある複数の変極点が存在する区間を１つの単位区間として設定すれば、代表周波数と個々の固有周波数との差が所定の許容範囲内に抑えられることになり、大きな問題は生じない。
【００４３】
続いて、固有周波数が近似する変極点を１グループにまとめて、１つの単位区間を定義するための具体的な手法の一例を以下に示す。たとえば、図１０に示すように、９つの変極点Ｐ１〜Ｐ９が与えられた場合、まず変極点Ｐ１とＰ２について、固有周波数を比較し、両者の差が所定の許容範囲ｆｆ内にあるか否かを調べる。もし、
｜ｆ１−ｆ２｜＜ｆｆ
であれば、変極点Ｐ１，Ｐ２を第１の単位区間Ｕ１に含ませる。そして、今度は、変極点Ｐ３を、この第１の単位区間Ｕ１に含ませてよいか否かを調べる。これは、この第１の単位区間Ｕ１についての平均固有周波数（ｆ１＋ｆ２）／２と、ｆ３との比較を行い、
｜（ｆ１＋ｆ２）／２−ｆ３｜＜ｆｆ
であれば、変極点Ｐ３を第１の単位区間Ｕ１に含ませればよい。更に、変極点Ｐ４に関しては、
｜（ｆ１＋ｆ２＋ｆ３）／３−ｆ４｜＜ｆｆ
であれば、これを第１の単位区間Ｕ１に含ませることができ、変極点Ｐ５に関しては、
｜（ｆ１＋ｆ２＋ｆ３＋ｆ４）／４−ｆ５｜＜ｆｆ
であれば、これを第１の単位区間Ｕ１に含ませることができる。ここで、もし、変極点Ｐ６について、
｜（ｆ１＋ｆ２＋ｆ３＋ｆ４＋ｆ５）／５−ｆ６｜＞ｆｆ
なる結果が得られたしまった場合、すなわち、固有周波数ｆ６と、第１の単位区間Ｕ１の平均固有周波数との差が、所定の許容範囲ｆｆを越えてしまった場合、変極点Ｐ５とＰ６との間に不連続位置が検出されたことになり、変極点Ｐ６を第１の単位区間Ｕ１に含ませることはできない。そこで、変極点Ｐ５をもって第１の単位区間Ｕ１の終端とし、変極点Ｐ６は別な第２の単位区間Ｕ２の始端とする。そして、変極点Ｐ６とＰ７について、固有周波数を比較し、両者の差が所定の許容範囲ｆｆ内にあるか否かを調べ、もし、
｜ｆ６−ｆ７｜＜ｆｆ
であれば、変極点Ｐ６，Ｐ７を第２の単位区間Ｕ２に含ませる。そして、今度は、変極点Ｐ８に関して、
｜（ｆ６＋ｆ７）／２−ｆ８｜＜ｆｆ
であれば、これを第２の単位区間Ｕ２に含ませ、変極点Ｐ９に関して、
｜（ｆ６＋ｆ７＋ｆ８）／３−ｆ９｜＜ｆｆ
であれば、これを第２の単位区間Ｕ２に含ませる。
【００４４】
このような手法で、不連続位置の検出を順次行ってゆき、各単位区間を順次設定してゆけば、上述した第１のアプローチに沿った区間設定が可能になる。もちろん、上述した具体的な手法は、一例として示したものであり、この他にも種々の手法を採ることができる。たとえば、平均値と比較する代わりに、常に隣接する変極点の固有周波数を比較し、差が許容範囲ｆｆを越えた場合に不連続位置と認識する簡略化した手法を採ってもかまわない。すなわち、ｆ１とｆ２との差、ｆ２とｆ３との差、ｆ３とｆ４との差、…というように、個々の差を検討してゆき、差が許容範囲ｆｆを越えた場合には、そこを不連続位置として認識すればよい。
【００４５】
以上、第１のアプローチについて述べたが、第２のアプローチに基づく単位区間の設定も同様に行うことができる。この場合は、個々の変極点の信号強度ａ１〜ａ９に着目し、所定の許容範囲ａａとの比較を行うようにすればよい。もちろん、第１のアプローチと第２のアプローチとの双方を組み合わせて、単位区間の設定を行ってもよい。この場合は、個々の変極点の固有周波数ｆ１〜ｆ９と信号強度ａ１〜ａ９との双方に着目し、両者がともに所定の許容範囲ｆｆおよびａａ内に入っていれば、同一の単位区間に含ませるというような厳しい条件を課してもよいし、いずれか一方が許容範囲内に入っていれば、同一の単位区間に含ませるというような緩い条件を課してもよい。
【００４６】
なお、この区間設定段階Ｓ３０においては、上述した各アプローチに基づいて単位区間の設定を行う前に、絶対値が所定の許容レベル未満となる信号強度をもつ変極点を除外する処理を行っておくのが好ましい。たとえば、図１１に示す例のように所定の許容レベルＬＬを設定すると、変極点Ｐ４の信号強度ａ４と変極点Ｐ９の信号強度ａ９は、その絶対値がこの許容レベルＬＬ未満になる。このような場合、変極点Ｐ４，Ｐ９を除外する処理を行うのである。このような除外処理を行う第１の意義は、もとの音響信号に含まれていたノイズ成分を除去することにある。通常、音響信号を電気的に取り込む過程では、種々のノイズ成分が混入することが多く、このようなノイズ成分までも含めて符号化が行われると好ましくない。
【００４７】
もっとも、許容レベルＬＬをある程度以上に設定すると、ノイズ成分以外のものも除外されることになるが、このようにノイズ成分以外の信号を除外することも、場合によっては、十分に意味のある処理になる。すなわち、この除外処理を行う第２の意義は、もとの音響信号に含まれていた情報のうち、興味の対象外となる情報を除外することにある。たとえば、図１の上段に示す音響信号は、人間の心音を示す信号であるが、この音響信号のうち、疾患の診断などに有効な情報は、振幅の大きな部分（各単位区間Ｕ１〜Ｕ６の部分）に含まれており、それ以外の部分の情報はあまり役にたたない。そこで、所定の許容レベルＬＬを設定し、無用な情報部分を除外する処理を行うと、より効率的な符号化が可能になる。
【００４８】
また、心音や肺音のように、生体が発生する生理的リズム音における比較的振幅の小さな成分は、生体内で発生する反響音であることが多く、このような反響音は、符号化の時点で一旦除外してしまっても、再生時にエコーなどの音響効果を加えることにより容易に付加することが可能である。このような点においても、許容レベル未満の変極点を除外する処理は意味をもつ。
【００４９】
なお、許容レベル未満の変極点を除外する処理を行った場合は、除外された変極点の位置で分割されるように単位区間定義を行うようにするのが好ましい。たとえば、図１１に示す例の場合、除外された変極点Ｐ４，Ｐ９の位置（一点鎖線で示す）で分割された単位区間Ｕ１，Ｕ２が定義されている。このような単位区間定義を行えば、図１の上段に示す音響信号のように、信号強度が許容レベル以上の区間（単位区間Ｕ１〜Ｕ６の各区間）と、許容レベル未満の区間（単位区間Ｕ１〜Ｕ６以外の区間）とが交互に出現するような音響信号の場合、非常に的確な単位区間の定義が可能になる。
【００５０】
これまで、区間設定段階Ｓ３０で行われる効果的な区間設定手法の要点を述べてきたが、ここでは、より具体的な手順を述べることにする。図２の流れ図に示されているように、この区間設定段階Ｓ３０は、４つの処理Ｓ３１〜Ｓ３４によって構成されている。固有周波数定義処理Ｓ３１は、既に述べたように、各変極点について、それぞれ近傍の変極点との間の時間軸上での距離に基づいて所定の固有周波数を定義する処理である。ここでは、図１２に示すように、変極点Ｐ１〜Ｐ１７のそれぞれについて、固有周波数ｆ１〜ｆ１７が定義された例を考える。
【００５１】
続く、レベルによるスライス処理Ｓ３２は、絶対値が所定の許容レベル未満となる信号強度をもつ変極点を除外し、除外された変極点の位置で分割されるような区間を定義する処理である。ここでは、図１２に示すような変極点Ｐ１〜Ｐ１７に対して、図１３に示すような許容レベルＬＬを設定した場合を考える。この場合、変極点Ｐ１，Ｐ２，Ｐ１１，Ｐ１６，Ｐ１７が、許容レベル未満の変極点として除外されることになる。図１４では、このようにして除外された変極点を破線の矢印で示す。この「レベルによるスライス処理Ｓ３２」では、更に、除外された変極点の位置で分割されるような区間Ｋ１，Ｋ２が定義される。ここでは、１つでも除外された変極点が存在する場合には、その位置の左右に異なる区間を設定するようにしており、結果的に、変極点Ｐ３〜Ｐ１０までの区間Ｋ１と、変極点Ｐ１２〜Ｐ１５までの区間Ｋ２とが設定されることになる。なお、ここで定義された区間Ｋ１，Ｋ２は、暫定的な区間であり、必ずしも最終的な単位区間になるとは限らない。
【００５２】
次の不連続部分割処理Ｓ３３は、時間軸上において、変極点の固有周波数もしくは信号強度の値が不連続となる不連続位置を探し、処理Ｓ３２で定義された個々の区間を、更にこの不連続位置で分割することにより、新たな区間を定義する処理である。たとえば、上述の例の場合、図１５に示すような暫定区間Ｋ１，Ｋ２が定義されているが、ここで、もし暫定区間Ｋ１内の変極点Ｐ６とＰ７との間に不連続が生じていた場合は、この不連続位置で暫定区間Ｋ１を分割し、図１６に示すように、新たに暫定区間Ｋ１−１とＫ１−２とが定義され、結局、３つの暫定区間Ｋ１−１，Ｋ１−２，Ｋ２が形成されることになる。不連続位置の具体的な探索手法は既に述べたとおりである。たとえば、図１５の例の場合、
｜（ｆ３＋ｆ４＋ｆ５＋ｆ６）／４−ｆ７｜＞ｆｆ
の場合に、変極点Ｐ６とＰ７との間に固有周波数の不連続が生じていると認識されることになる。同様に、変極点Ｐ６とＰ７との間の信号強度の不連続は、
｜（ａ３＋ａ４＋ａ５＋ａ６）／４−ａ７｜＞ａａ
の場合に認識される。
【００５３】
不連続部分割処理Ｓ３３で、実際に区間分割を行うための条件としては、
▲１▼固有周波数の不連続が生じた場合にのみ区間の分割を行う、
▲２▼信号強度の不連続が生じた場合にのみ区間の分割を行う、
▲３▼固有周波数の不連続か信号強度の不連続かの少なくとも一方が生じた場合に区間の分割を行う、
▲４▼固有周波数の不連続と信号強度の不連続との両方が生じた場合にのみ区間の分割を行う、
など、種々の条件を設定することが可能である。あるいは、不連続の度合いを考慮して、上述の▲１▼〜▲４▼を組み合わせるような複合条件を設定することもできる。
【００５４】
こうして、不連続部分割処理Ｓ３３によって得られた区間（上述の例の場合、３つの暫定区間Ｋ１−１，Ｋ１−２，Ｋ２）を、最終的な単位区間として設定することもできるが、ここでは更に、区間統合処理Ｓ３４を行っている。この区間統合処理Ｓ３４は、不連続部分割処理Ｓ３３によって得られた区間のうち、一方の区間内の変極点の固有周波数もしくは信号強度の平均と、他方の区間内の変極点の固有周波数もしくは信号強度の平均との差が、所定の許容範囲内であるような２つの隣接区間が存在する場合に、この隣接区間を１つの区間に統合する処理である。たとえば、上述の例の場合、図１７に示すように、区間Ｋ１−２と区間Ｋ２とを平均固有周波数で比較した結果、
｜（ｆ７＋ｆ８＋ｆ９＋ｆ１０）／４
−（ｆ１２＋ｆ１３＋ｆ１４＋ｆ１５）／４｜＜ｆｆ
のように、平均の差が所定の許容範囲ｆｆ以内であった場合には、区間Ｋ１−２と区間Ｋ２とは統合されることになる。もちろん、平均信号強度の差が許容範囲ａａ以内であった場合に統合を行うようにしてもよいし、平均固有周波数の差が許容範囲ｆｆ内という条件と平均信号強度の差が許容範囲ａａ以内という条件とのいずれか一方が満足された場合に統合を行うようにしてもよいし、両条件がともに満足された場合に統合を行うようにしてもよい。また、このような種々の条件が満足されていても、両区間の間の間隔が時間軸上で所定の距離以上離れていた場合（たとえば、多数の変極点が除外されたために、かなりの空白区間が生じているような場合）は、統合処理を行わないような加重条件を課すことも可能である。
【００５５】
かくして、この区間統合処理Ｓ３４を行った後に得られた区間が、最終的な単位区間として設定されることになる。上述の例では、最終的に、図１８に示すように、単位区間Ｕ１（図１７の暫定区間Ｋ１−１）と、単位区間Ｕ２（図１７で統合された暫定区間Ｋ１−２およびＫ２）とが設定される。
【００５６】
なお、ここに示す実施態様では、こうして得られた単位区間の始端と終端を、その区間に含まれる最初の変極点の時間軸上の位置を始端とし、その区間に含まれる最後の変極点の時間軸上の位置を終端とする、という定義で定めることにする。したがって、図１８に示す例では、単位区間Ｕ１は時間軸上の位置ｔ３〜ｔ６までの区間であり、単位区間Ｕ２は時間軸上の位置ｔ７〜ｔ１５までの区間となる。
【００５７】
＜＜＜２．４符号化段階＞＞＞
次に、図２の流れ図に示されている符号化段階Ｓ４０について説明する。ここに示す実施形態では、この符号化段階Ｓ４０は、符号データ生成処理Ｓ４１と、符号データ修正処理Ｓ４２とによって構成されている。符号データ生成処理Ｓ４１は、区間設定段階Ｓ３０において設定された個々の単位区間内の音声データに基づいて、個々の単位区間を代表する所定の代表周波数および代表強度を定義し、時間軸上での個々の単位区間の始端位置および終端位置を示す情報と、代表周波数および代表強度を示す情報とを含む符号データを生成する処理であり、この処理により、個々の単位区間の音声データは個々の符号データによって表現されることになる。一方、符号データ修正処理Ｓ４２は、後述するように、生成された符号データを、復号化に用いる再生音源装置の特性に適合させるために修正する処理である。
【００５８】
符号データ生成処理Ｓ４１における符号データ生成の具体的手法は、非常に単純である。すなわち、個々の単位区間内に含まれる変極点の固有周波数に基づいて代表周波数を定義し、個々の単位区間内に含まれる変極点のもつ信号強度に基づいて代表強度を定義ればよい。これを図１８の例で具体的に示そう。この図１８に示す例では、変極点Ｐ３〜Ｐ６を含む単位区間Ｕ１と、変極点Ｐ７〜Ｐ１５（ただし、Ｐ１１は除外されている）を含む単位区間Ｕ２とが設定されている。ここに示す実施形態では、単位区間Ｕ１（始端ｔ３，終端ｔ６）については、図１９上段に示すように、代表周波数Ｆ１および代表強度Ａ１が、
Ｆ１＝（ｆ３＋ｆ４＋ｆ５＋ｆ６）／４
Ａ１＝（ａ３＋ａ４＋ａ５＋ａ６）／４
なる式で演算され、単位区間Ｕ２（始端ｔ７，終端ｔ１５）については、図１９下段に示すように、代表周波数Ｆ２および代表強度Ａ２が、

なる式で演算される。別言すれば、代表周波数および代表強度は、単位区間内に含まれる変極点の固有周波数および信号強度の単純平均値となっている。もっとも、代表値としては、このような単純平均値だけでなく、重みを考慮した加重平均値をとってもかまわない。たとえば、信号強度に基づいて個々の変極点に重みづけをし、この重みづけを考慮した固有周波数の加重平均値を代表周波数としてもよい。あるいは、単位区間内に含まれる変極点のもつ信号強度のうちの最大値を代表強度とすることもできる。
【００５９】
こうして個々の単位区間に、それぞれ代表周波数および代表強度が定義されれば、時間軸上での個々の単位区間の始端位置と終端位置は既に得られているので、個々の単位区間に対応する符号データの生成が可能になる。たとえば、図１８に示す例の場合、図２０に示すように、５つの区間Ｅ０，Ｕ１，Ｅ１，Ｕ２，Ｅ２を定義するための符号データを生成することができる。ここで、区間Ｕ１，Ｕ２は、前段階で設定された単位区間であり、区間Ｅ０，Ｅ１，Ｅ２は、各単位区間の間に相当する空白区間である。各単位区間Ｕ１，Ｕ２には、それぞれ代表周波数Ｆ１，Ｆ２と代表強度Ａ１，Ａ２が定義されているが、空白区間Ｅ０，Ｅ１，Ｅ２は、単に始端および終端のみが定義されている区間である。
【００６０】
図２１は、図２０に示す個々の区間に対応する符号データの構成例を示す図表である。この例では、１行に示された符号データは、区間名（実際には、不要）と、区間の始端位置および終端位置と、代表周波数および代表強度と、によって構成されている。一方、図２２は、図２０に示す個々の区間に対応する符号データの別な構成例を示す図表である。図２１に示す例では、各単位区間の始端位置および終端位置を直接符号データとして表現していたが、図２２に示す例では、各単位区間の始端位置および終端位置を示す情報として、区間長Ｌ１〜Ｌ４（図２０参照）を用いている。なお、図２１に示す構成例のように、単位区間の始端位置および終端位置を直接符号データとして用いる場合には、実際には、空白区間Ｅ０，Ｅ１，…についての符号データは不要である（図２１に示す単位区間Ｕ１，Ｕ２の符号データのみから、図２０の構成が再現できる）。
【００６１】
先願発明に係る音響信号の符号化方法によって、最終的に得られる符号データは、この図２１あるいは図２２に示すような符号データである。もっとも、符号データとしては、各単位区間の時間軸上での始端位置および終端位置を示す情報と、代表周波数および代表強度を示す情報とが含まれていれば、どのような構成のデータを用いてもかまわない。最終的に得られる符号データに、上述の情報さえ含まれていれば、所定の音源を用いて音声の再生（復号化）が可能になる。たとえば、図２０に示す例の場合、時刻０〜ｔ３の期間は沈黙を守り、時刻ｔ３〜ｔ６の期間に周波数Ｆ１に相当する音を強度Ａ１で鳴らし、時刻ｔ６〜ｔ７の期間は沈黙を守り、時刻ｔ７〜ｔ１５の期間に周波数Ｆ２に相当する音を強度Ａ２で鳴らせば、もとの音響信号の再生が行われることになる。
【００６２】
§３．ＭＩＤＩ形式の符号データを用いる実施形態
＜＜＜３．１ＭＩＤＩデータへの変換原理＞＞＞
上述したように、先願発明に係る音響信号の符号化方法では、最終的に、個々の単位区間についての始端位置および終端位置を示す情報と、代表周波数および代表強度を示す情報とが含まれた符号データであれば、どのような形式の符号データを用いてもかまわない。しかしながら、実用上は、そのような符号データとして、ＭＩＤＩ形式の符号データを採用するのが最も好ましい。ここでは、ＭＩＤＩ形式の符号データを採用した具体的な実施形態を示す。
【００６３】
図２３は、一般的なＭＩＤＩ形式の符号データの構成を示す図である。図示のとおり、このＭＩＤＩ形式では、「ノートオン」データもしくは「ノートオフ」データが、「デルタタイム」データを介在させながら存在する。「デルタタイム」データは、１〜４バイトのデータで構成され、所定の時間間隔を示すデータである。一方、「ノートオン」データは、全部で３バイトから構成されるデータであり、１バイト目は常にノートオン符号「９０ H」に固定されており（ Hは１６進数を示す）、２バイト目にノートナンバーＮを示すコードが、３バイト目にベロシティーＶを示すコードが、それぞれ配置される。ノートナンバーＮは、音階（一般の音楽でいう全音７音階の音階ではなく、ここでは半音１２音階の音階をさす）の番号を示す数値であり、このノートナンバーＮが定まると、たとえば、ピアノの特定の鍵盤キーが指定されることになる（Ｃ−２の音階がノートナンバーＮ＝０に対応づけられ、以下、Ｎ＝１２７までの１２８通りの音階が対応づけられる。ピアノの鍵盤中央のラの音（Ａ３音）は、ノートナンバーＮ＝６９になる）。ベロシティーＶは、音の強さを示すパラメータであり（もともとは、ピアノの鍵盤などを弾く速度を意味する）、Ｖ＝０〜１２７までの１２８段階の強さが定義される。
【００６４】
同様に、「ノートオフ」データも、全部で３バイトから構成されるデータであり、１バイト目は常にノートオフ符号「８０ H」に固定されており、２バイト目にノートナンバーＮを示すコードが、３バイト目にベロシティーＶを示すコードが、それぞれ配置される。「ノートオン」データと「ノートオフ」データとは対になって用いられる。たとえば、「９０ H，６９，８０」なる３バイトの「ノートオン」データは、ノートナンバーＮ＝６９に対応する鍵盤中央のラのキーを押し下げる操作を意味し、以後、同じノートナンバーＮ＝６９を指定した「ノートオフ」データが与えられるまで、そのキーを押し下げた状態が維持される（実際には、ピアノなどのＭＩＤＩ音源の波形を用いた場合、有限の時間内に、ラの音の波形は減衰してしまう）。ノートナンバーＮ＝６９を指定した「ノートオフ」データは、たとえば、「８０ H，６９，５０」のような３バイトのデータとして与えられる。「ノートオフ」データにおけるベロシティーＶの値は、たとえばピアノの場合、鍵盤キーから指を離す速度を示すパラメータになる。
【００６５】
なお、上述の説明では、ノートオン符号「９０ H」およびノートオフ符号「８０ H」は固定であると述べたが、これらの符号の下位４ビットは必ずしも０に固定されているわけではなく、チャネル番号０〜１５のいずれかを特定するコードとして利用することができ、チャネルごとにそれぞれ別々の楽器の音色についてのオン・オフを指定することができる。
【００６６】
このように、ＭＩＤＩデータは、もともと楽器演奏の操作に関する情報（別言すれば、楽譜の情報）を記述する目的で利用されている符号データであるが、先願発明に係る音響信号の符号化方法への利用にも適している。すなわち、各単位区間についての代表周波数Ｆに基づいてノートナンバーＮを定め、代表強度Ａに基づいてベロシティーＶを定め、単位区間の長さＬに基づいてデルタタイムＴを定めるようにすれば、１つの単位区間の音声データを、ノートナンバー、ベロシティー、デルタタイムで表現されるＭＩＤＩ形式の符号データに変換することが可能になる。このようなＭＩＤＩデータへの具体的な変換方法を図２４に示す。
【００６７】
まず、ＭＩＤＩデータのデルタタイムＴは、単位区間の区間長Ｌ（単位：秒）を用いて、
Ｔ＝Ｌ・７６８
なる簡単な式で定義できる。ここで、数値「７６８」は、四分音符を基準にして、その長さ分解能（たとえば、長さ分解能を１／２に設定すれば八分音符まで、１／８に設定すれば三十二分音符まで表現可能：一般の音楽では１／１６程度の設定が使われる）を、ＭＩＤＩ規格での最小値である１／３８４に設定し、メトロノーム指定を四分音符＝１２０（毎分１２０音符）にした場合のＭＩＤＩデータによる表現形式における時間分解能を示す固有の数値である。
【００６８】
また、ＭＩＤＩデータのノートナンバーＮは、１オクターブ上がると、周波数が２倍になる対数尺度の音階では、単位区間の代表周波数Ｆ（単位：Ｈｚ）を用いて、
Ｎ＝（１２／ｌｏｇ_１０２）・（ｌｏｇ_１０（Ｆ／４４０）＋６９
なる式で定義できる。ここで、右辺第２項の数値「６９」は、ピアノ鍵盤中央のラの音（Ａ３音）のノートナンバー（基準となるノートナンバー）を示しており、右辺第１項の数値「４４０」は、このラの音の周波数（４４０Ｈｚ）を示しており、右辺第１項の数値「１２」は、半音を１音階として数えた場合の１オクターブの音階数を示している。
【００６９】
更に、ＭＩＤＩデータのベロシティーＶは、単位区間の代表強度Ａと、その最大値Ａmax とを用いて、
Ｖ＝（Ａ／Ａmax ）・１２７
なる式で、Ｖ＝０〜１２７の範囲の値を定義することができる。なお、通常の楽器の場合、「ノートオン」データにおけるベロシティーＶと、「ノートオフ」データにおけるベロシティーＶとは、上述したように、それぞれ異なる意味をもつが、この実施形態では、「ノートオフ」データにおけるベロシティーＶとして、「ノートオン」データにおけるベロシティーＶと同一の値をそのまま用いるようにしている。
【００７０】
前章の§２では、図２０に示すような２つの単位区間Ｕ１，Ｕ２内の音声データに対して、図２１あるいは図２２に示すような符号データが生成される例を示したが、ＭＩＤＩデータを用いた場合、単位区間Ｕ１，Ｕ２内の音声データは、図２５の図表に示すような各データ列で表現されることになる。ここで、ノートナンバーＮ１，Ｎ２は、代表周波数Ｆ１，Ｆ２を用いて上述の式により得られた値であり、ベロシティーＶ１，Ｖ２は、代表強度Ａ１，Ａ２を用いて上述の式により得られた値である。
【００７１】
＜＜＜３．２ＭＩＤＩデータの修正処理＞＞＞
図２に示す流れ図における符号化段階Ｓ４０では、符号データ生成処理Ｓ４１の後に、符号データ修正処理Ｓ４２が行われる。符号データ生成処理Ｓ４１は、上述した具体的な手法により、たとえば、図２５に示すようなＭＩＤＩデータ列を生成する処理であり、符号データ修正処理Ｓ４２は、このようなＭＩＤＩデータ列に対して、更に修正を加える処理である。後述するように、図２５に示すようなＭＩＤＩデータ列に基づいて、音声を再生（復号化）するには、実際の音声の波形データをもった再生音源装置（ＭＩＤＩ音源）が必要になるが、このＭＩＤＩ音源の特性は個々の音源ごとに様々であり、必要に応じて、用いるＭＩＤＩ音源の特性に適合させるために、ＭＩＤＩデータに修正処理を加えた方が好ましい場合がある。以下に、このような修正処理が必要な具体的な事例を述べる。
【００７２】
いま、図２６の上段に示すように、区間長Ｌｉをもった単位区間Ｕｉ内の音声データが所定のＭＩＤＩデータ（修正前のＭＩＤＩデータ）によって表現されていた場合を考える。すなわち、この単位区間Ｕｉには、代表周波数Ｆｉおよび代表強度Ａｉが定義されており、代表周波数Ｆｉ，代表強度Ａｉ，区間長Ｌｉに基づいて、ノートナンバーＮｉ，ベロシティーＶｉ，デルタタイムＴｉが設定されていることになる。このとき、このＭＩＤＩデータを再生するために用いる予定のＭＩＤＩ音源のノートナンバーＮｉに対応する再生音の波形が、図２６の中段に示すようなものであったとしよう。この場合、単位区間Ｕｉの単位長Ｌｉよりも、ＭＩＤＩ音源の再生音の持続時間ＬＬｉの方が短いことになる。したがって、修正前のＭＩＤＩデータを、このＭＩＤＩ音源を用いてそのまま再生すると、本来の音が鳴り続けなければならない時間Ｌｉよりも短い持続時間ＬＬｉで、再生音は減衰してしまうことになる。このような事態が生じると、もとの音響信号の再現性が低下してしまう。
【００７３】
そこで、このような場合、単位区間を複数の小区間に分割し、各小区間ごとにそれぞれ別個の符号データを生成する修正処理を行うとよい。この図２６に示す例の場合、図の下段に示すように、もとの単位区間Ｕｉを、２つの小区間Ｕｉ１，Ｕｉ２に分割し、それぞれについて別個のＭＩＤＩデータを生成するようにしている。個々の小区間Ｕｉ１，Ｕｉ２に定義される代表周波数および代表強度は、いずれも分割前の単位区間Ｕｉの代表周波数Ｆｉおよび代表強度Ａｉと同じであり、区間長だけがＬｉ／２になったわけであるから、修正後のＭＩＤＩデータとしては、結局、ノートナンバーＮｉ，ベロシティーＶｉ，デルタタイムＴｉ／２を示すＭＩＤＩデータが２組得られることになる。
【００７４】
一般のＭＩＤＩ音源では、通常、再生音の持続時間はその再生音の周波数に応じて決まる。特に、心音などの音色についての音源では、再生音の周波数をｆ（Ｈｚ）とした場合、その持続時間は５／ｆ（秒）程度である。したがって、このような音源を用いたときには、特定の単位区間Ｕｉについて、代表周波数Ｆｉと区間長Ｌｉとの関係が、Ｌｉ＞５／Ｆｉとなるような場合には、Ｌｉ／ｍ＜５／Ｆｉとなるような適当な分割数ｍを求め、上述した修正処理により、単位区間Ｕｉをｍ個の小区間に分割するような処理を行うのが好ましい。
【００７５】
続いて、修正処理が必要な別な事例を示そう。いま、再生に用いる予定のＭＩＤＩ音源の再生音が、図２７の左側に示すような周波数レンジを有しているのに対し、生成された一連のＭＩＤＩデータに基づく再生音の周波数レンジが、図２７の右側に示すように、低音側にオフセット量ｄだけ偏りを生じていたとしよう。このような場合、再生音はＭＩＤＩ音源の一部の周波数帯域のみを使って提示されるようになるため、一般的には好ましくない。そこで、ＭＩＤＩデータの周波数の平均が、ＭＩＤＩ音源の周波数レンジの中心（この例では、４４０Ｈｚの基準ラ音（ノートナンバーＮ＝６９））に近付くように、ＭＩＤＩデータ側の周波数（ノートナンバー）を全体的に引き上げる修正処理を行い、図２８に示すように、オフセット量ｄが０になるようにするとよい。
【００７６】
もっとも、音響信号の性質によっては、むしろ低音側にシフトした状態のままで再生した方が好ましいものもあり、上述のような修正処理によって必ずしも良好な結果が得られるとは限らない。したがって、個々の音響信号の性質を考慮した上で、このような修正処理を行うか否かを適宜判断するのが好ましい。
【００７７】
この他にも、用いるＭＩＤＩ音源によっては、特性に適合させるために種々の修正処理が必要な場合がある。たとえば、１オクターブの音階差が２倍の周波数に対応していないような特殊な規格のＭＩＤＩ音源を用いた場合には、この規格に適合させるように、ノートナンバーの修正処理などが必要になる。
【００７８】
§４．本発明における改良点
これまで述べてきた先願発明による符号化方法は、生体の発生するリズム音、波や風などの自然が発生するリズム音というように、個々の単位区間内にある程度限定された周波数成分のみを含む音響信号の符号化には、実用上十分な再現性を確保することができる。しかしながら、いわゆるヴォーカル音響と呼ばれている人間の声音のように、非常に幅の広い周波数成分を同時に含んでいるような音響信号を符号化した場合、必ずしも十分な再現性を確保することはできない。特に、人間の声音には、ホルマントと呼ばれる特性（倍音以外の高調波成分が混在する特性）があることが知られており、上述した先願発明による方法では十分な再現性をもった符号化ができないことは、理論的にも裏付けられる。一般的な楽器では、ある特定の音程を演奏すると、演奏した音程に対応する周波数成分とともに、その整数倍の周波数成分（倍音高調波成分）が得られる。したがって、このような楽器の演奏波形をＭＩＤＩ音源として利用すれば、先願発明による符号化方法でも倍音高調波成分を含んだ音を再現することができる。ところが、ホルマントを有する人間の声音には、倍音以外の高調波成分が含まれているため、十分な再現性を確保することができなくなる。
【００７９】
以下に述べる本発明の手法は、ホルマントを有する人間の声音の符号化にも十分に対応できるように、先願発明に対する改良を施したものである。まず、図２９を参照しながら、本発明の基本概念を説明する。ここでは、図２９の上段に示すように、時系列の強度信号としてアナログ音響信号が与えられ、これをデジタル音響データとして取り込んだものとする。続いて、このデジタル音響データの時間軸ｔ上に複数の単位区間Ｕ１〜Ｕ６を設定する。ここまでは、図１に示す先願発明の手法と同様である。こうして、複数の単位区間が設定されたら、個々の単位区間内の音響データに基づいて、個々の単位区間を代表する複数の代表周波数（この例では、高域周波数Ｆｈと低域周波数Ｆｌの２通りの代表周波数）および代表強度を定義する。ここでは、第ｉ番目の単位区間Ｕｉについて、高域周波数Ｆｈ（ｉ）および低域周波数Ｆｌ（ｉ）と、代表強度Ａｉとが定義された状態が示されている。たとえば、第１番目の単位区間Ｕ１については、代表周波数として高域周波数Ｆｈ（１）および低域周波数Ｆｌ（１）と、代表強度Ａ１とが定義されている。
【００８０】
こうして、個々の単位区間について、それぞれ複数の代表周波数および代表強度が定義されたら、時間軸ｔ上での個々の単位区間の始端位置および終端位置を示す情報と、定義された複数の代表周波数および代表強度を示す情報と、により符号データを生成し、個々の単位区間の音響データを個々の符号データによって表現すればよい。たとえば、ＭＩＤＩ規格に基づく符号化を利用すれば、図２９下段に示す音符で示すような符号データが得られる。この図２９下段に示す符号データでは、図１下段に示す符号データと比べればわかるように、個々の音符が和音として提示されている。すなわち、各単位区間ごとに、高域周波数Ｆｈに対応する音符と低域周波数Ｆｌに対応する音符とが作成されていることになり、再生時には、これら２つの音符が同時に和音として演奏されることになる。このような手法を採れば、ホルマントを有する人間の声音の符号化にも十分に対応できるようになる。
【００８１】
もっとも、図２９に示す手法では、同一の単位区間にそれぞれ２通りの代表周波数を定義しているが、実用上は、１つの単位区間には１つの代表周波数のみを定義するようにし、その代わりに、同一時間軸上で重複してそれぞれ異なる単位区間を設定できるようにするのが好ましい。図３０に、より実用的な手法の基本概念を示す。この図３０の中段には、時系列の強度信号としてのデジタル音響データの波形が示されており、この波形より下側には、高域周波数に着目した処理が示され、この波形より上側には、低域周波数に着目した処理が示されている。すなわち、図の下半分に示された高域周波数に着目した処理では、高域単位区間Ｕｈ（１）〜Ｕｈ（６）が設定され、これら各単位区間について、それぞれ代表周波数Ｆｈ（１）〜Ｆｈ（６）と代表強度Ａｈ（１）〜Ａｈ（６）が定義されており、最終的に図の最下段に示されているような高域符号データが生成されることになる。一方、図の上半分に示された低域周波数に着目した処理では、低域単位区間Ｕｌ（１）〜Ｕｌ（４）が設定され、これら各単位区間について、それぞれ代表周波数Ｆｌ（１）〜Ｆｌ（４）と代表強度Ａｌ（１）〜Ａｌ（４）が定義されており、最終的に図の最上段に示されているような低域符号データが生成されることになる。
【００８２】
ここで重要な点は、高域単位区間Ｕｈ（１）〜Ｕｈ（６）と低域単位区間Ｕｌ（１）〜Ｕｌ（４）とが、時間軸ｔ上において、少なくともその一部分が重複しているという点である。もちろん、時間軸ｔを図の左から右へと辿っていった場合、高域単位区間のみしか設定されていない部分や、低域単位区間のみしか設定されていない部分が存在し、また、いずれの単位区間も設定されていない部分も存在し得るが、少なくとも時間軸ｔ上の一部分には、高域単位区間と低域単位区間とが重複して設定された区間が存在することになる。こうして重複設定された単位区間について、それぞれ独立して代表周波数および代表強度を定めて符号化すれば、時間軸上で重複した符号データが得られることになる。たとえば、図３０に示す例の場合、最下段に示された高域符号データと、最上段に示された低域符号データとは、時間軸ｔ上において少なくとも部分的には重なっており、再生時には、和音として演奏されることになる。なお、図示されている音符は概念を示すためのものであり、図の中段に示された波形や各単位区間とは直接関連していない。
【００８３】
このように、時間軸上で少なくとも部分的に重複する単位区間を設定し、各単位区間ごとにそれぞれ別個に符号化を行うようにすれば、再生時には、種々の周波数成分を含んだ和音としての形式で音の再現が可能になる。なお、図２９に示した例は、個々の高域単位区間と個々の低域単位区間とが完全に一致した特別なケースと考えることができる。
【００８４】
§５．本発明に係る音響信号の符号化方法の実用的な手順
本発明に係る符号化手順は、先願発明に係る符号化手順とほぼ同様に行うことができる。すなわち、図２の流れ図に示すように、入力段階Ｓ１０において、符号化対象となる音響信号を、デジタルの音響データとして取り込む処理が行われ、続いて、変極点定義段階Ｓ２０において、取り込んだ音響データの波形について変極点を求める処理が行われる。ここまでの処理は、既に述べた先願発明に係る手順と全く同じである。次に、区間設定段階Ｓ３０において、単位区間の設定が行われるが、本発明では、前述したように、時間軸上で少なくとも部分的に重複するような区間設定が行われることになる。また、符号化段階Ｓ４０では、各単位区間ごとに符号化する処理が行われるが、この処理も重複設定された各単位区間ごとに行われることになる。
【００８５】
区間設定段階Ｓ３０において最初に行われる処理は、既に述べたように、固有周波数定義処理Ｓ３１である。この時点では、既に、変極点探索処理Ｓ２１によって、音響データ波形についての個々の変極点が探索され、同極性変極点の間引処理Ｓ２２によって、同極性のデジタル値をもった変極点が複数連続する場合に、絶対値が最大のデジタル値をもった変極点のみを残す間引きが行われており、正の信号値をもつ変極点と負の信号値をもつ変極点とが交互に現れる状態になっている。固有周波数定義処理Ｓ３１は、このような各変極点のそれぞれに対して、近傍の情報に基いて固有周波数を定義する処理であるが、本発明では、１つの変極点に対して固有周波数を定義する方法を複数通り設定するようにし、これら複数通りの方法を用いて、各変極点に複数通りの固有周波数を定義するようにしている。
【００８６】
ここでは、ヴォーカル音響信号に対して用いるのに適した２通りの具体的な固有周波数定義方法を説明する。いま、変極点定義段階Ｓ２０を経ることにより、図３１にその一部が示されているような変極点群が得られた場合を考える。図３１には、この変極点群のうちの第ｎ番目の変極点Ｐ（ｎ）〜第（ｎ＋１２）番目の変極点Ｐ（ｎ＋１２）が示されている。このような変極点群には、２つの周波数成分が含まれていることがわかる。すなわち、変極点Ｐ（ｎ）とＰ（ｎ＋２）との距離φｈを一周期とする高域周波数成分と、変極点Ｐ（ｎ）とＰ（ｎ＋６）との距離φｌを一周期とする低域周波数成分とである。ヴォーカル音響信号に対して変極点の定義を行うと、図３１に示すような特徴が顕著に現れる。これは、前述したように、人間の音声はホルマントという特徴を有するためである。図３１に示す例において、正の信号強度をもつ変極点Ｐ（ｎ），Ｐ（ｎ＋２），Ｐ（ｎ＋４），Ｐ（ｎ＋６），Ｐ（ｎ＋８）…に注目すれば、信号強度が大中小大中小…と変化していることがわかる。この大中小という変化の周期が周期φｌに相当し、低域周波数成分を示すことになる。これに対し、同極性の変極点の出現周期が周期φｈに相当し、高域周波数成分を示すことになる。
【００８７】
結局、個々の変極点に対して固有周波数を定義する第１の方法として、同極性の変極点が現れる周期φｈを探索し、この周期φｈに基いて固有周波数を定義する方法を採れば、高域固有周波数ｆｈを定義することができる。また、個々の変極点に対して固有周波数を定義する第２の方法として、近似した信号強度をもつ変極点が現れる周期φｌを探索し、この周期φｌに基いて固有周波数を定義する方法を採れば、低域固有周波数ｆｌを定義することができる。より具体的には、各変極点について、それぞれ所定の条件を満たす特定の変極点を探索し、探索された変極点との間の時間軸上での距離に基いて固有周波数を定義すればよい。たとえば、図３１において、変極点Ｐ（ｎ）についての高域固有周波数ｆｈを定義するには、「後続して最初に出現する同極性の変極点」という条件を設定して探索を行えばよい。その結果、この条件を満たす変極点Ｐ（ｎ＋２）が探索されることになるので、両変極点の時間軸上での距離φｈを周期とする周波数が定義される。同様に、変極点Ｐ（ｎ）についての低域固有周波数ｆｌを定義するには、「変極点Ｐ（ｎ）のもつ信号強度にほぼ等しい信号強度をもち、後続して最初に出現する変極点（信号強度に符号をもたせておけば、当然同極性の変極点になる）」という条件を設定して探索を行えばよい。その結果、この条件を満たす変極点Ｐ（ｎ＋６）が探索されることになるので、両変極点の時間軸上での距離φｌを周期とする周波数が定義される。このように、探索条件を変えることにより、同一の変極点に対して複数通りの固有周波数を定義することが可能になる。
【００８８】
上述の手法によれば、第ｎ番目の変極点Ｐ（ｎ）についての高域固有周波数ｆｈ（ｎ）は、§２．３で述べたように、任意の整数ｋを用いて、
ｆｈ（ｎ）＝（ｋ／２）・１／（ｔ（ｎ＋ｋ）−ｔ（ｎ））
なる式で得られることになる。すなわち、第ｎ番目の変極点Ｐ（ｎ）に対してｋ個離れた変極点Ｐ（ｎ＋ｋ）を探索し（ｋが正の場合は後続する変極点、負の場合は先行する変極点）、変極点Ｐ（ｎ）の時間軸上での位置ｔ（ｎ）と探索された変極点Ｐ（ｎ＋ｋ）の時間軸上での位置ｔ（ｎ＋ｋ）との差の逆数に基いて、高域固有周波数ｆｈ（ｎ）が得られることになる。既に述べたように、ｋの値は、ある程度大きく設定した方が、誤差の少ない固有周波数を定義することができるが、あまり大きく設定しすぎると、ローカルな周波数としての意味が失われてしまう。
【００８９】
図３１に示す例の場合、変極点Ｐ（ｎ）についての高域固有周波数ｆｈ（ｎ）は、図示の周期φｈの逆数として定義することができ、

なる式で与えられるが、これは上述の式における係数ｋ＝２に設定した場合に他ならない。もちろん、係数ｋ＝４に設定すれば、変極点Ｐ（ｎ＋４）を探索対象として、
ｆｈ（ｎ）＝２・（１／（ｔ（ｎ＋４）−ｔ（ｎ）））
なる式により、高域固有周波数ｆｈ（ｎ）の値を定義することもできる。
【００９０】
一方、第ｎ番目の変極点Ｐ（ｎ）についての低域固有周波数ｆｌ（ｎ）は、
ｆｌ（ｎ）＝１／（ｔ（ｎ＋ｋ）−ｔ（ｎ））
なる式で得られることになる。ただし、右辺の分母に示されている係数ｋは任意の整数ではなく、所定の条件を満たす整数でなければならない。すなわち、整数ｋで特定される変極点Ｐ（ｎ＋ｋ）が、変極点Ｐ（ｎ）のもつ信号強度に対して所定の誤差範囲内にある信号強度をもつ変極点のうち、変極点（ｎ）に最も近い後続する変極点となるようにしなければならない。あるいは、整数ｋを負にとって、先行する変極点を探索対象とする場合には、整数ｋで特定される変極点Ｐ（ｎ＋ｋ）が、変極点Ｐ（ｎ）のもつ信号強度に対して所定の誤差範囲内にある信号強度をもつ変極点のうち、変極点（ｎ）に最も近い先行する変極点となるようにしてもかまわない。この式の意味するところは、要するに、変極点Ｐ（ｎ）のもつ信号強度とほぼ同じ信号強度をもった最も近い変極点Ｐ（ｎ＋ｋ）を探索し、変極点Ｐ（ｎ）の時間軸上での位置ｔ（ｎ）と探索された変極点Ｐ（ｎ＋ｋ）の時間軸上での位置ｔ（ｎ＋ｋ）との差の逆数に基いて、低域固有周波数ｆｌ（ｎ）を決定するということである。
【００９１】
図３１に示す例の場合、変極点Ｐ（ｎ）についての低域固有周波数ｆｌ（ｎ）は、図示の周期φｌの逆数として定義することができ、

なる式で与えられるが、これは上述の式における係数ｋ＝６に設定した場合に他ならない。すなわち、図３１の例では、変極点Ｐ（ｎ＋６）が、変極点Ｐ（ｎ）のもつ信号強度に対して所定の誤差範囲内にある信号強度を有し、変極点Ｐ（ｎ）に最も近い後続する変極点として探索されたことになる。なお、理論的には、必ずしも最も近い後続する変極点（もしくは最も近い先行する変極点）を探索対象とする必要はない。たとえば、２番目に近い後続する変極点Ｐ（ｎ＋１２）を探索対象とした場合であっても、
ｆｌ（ｎ）＝２・（１／（ｔ（ｎ＋１２）−ｔ（ｎ）））
なる式で低域固有周波数ｆｌ（ｎ）を定義することができ、一般に、ｚ番目に近い後続もしくは先行する変極点Ｐ（ｎ＋ｋ）を探索対象とした場合、
ｆｌ（ｎ）＝ｚ・（１／（ｔ（ｎ＋ｋ）−ｔ（ｎ）））
なる式で低域固有周波数ｆｌ（ｎ）を定義することができる。
【００９２】
かくして、本発明の場合、図２の流れ図におけるステップＳ３１の固有周波数定義処理は、個々の変極点に対してそれぞれ複数通りの固有周波数が定義されることになる。そして、ステップＳ３２〜Ｓ３４の個々の処理は、複数通りの固有周波数についてそれぞれ別個に行われ、ステップＳ４１〜Ｓ４２の個々の処理も、複数通りの固有周波数についてそれぞれ別個に行われることになる。結局、時間軸上で重複するような複数の符号データが生成されることになり、これらの符号データを時間軸上で重複して再生することにより、ホルマント特性を有する人間の声音についても実用的なレベルでの再現性が確保できることになる。
【００９３】
たとえば、図３１に示す具体例において、ｎ＝１として、各変極点をＰ１〜Ｐ１３で表わした場合、各変極点にそれぞれ高域固有周波数を定義すれば、図３２に示すような固有周波数ｆｈｘおよび信号強度ａｘをもった変極点群が定義されることになり、各変極点にそれぞれ低域固有周波数を定義すれば、図３３に示すような固有周波数ｆｌｘおよび信号強度ａｘをもった変極点群が定義されることになる（ただし、ｘ＝１〜１３）。このような２通りの変極点群に対して、それぞれ別個独立して、ステップＳ３２におけるレベルによるスライス処理、ステップＳ３３における不連続部分割処理、ステップＳ３４における区間統合処理を実行すれば、２通りの単位区間が設定されることになる。ここで、図３２に示すような高域固有周波数をもつ変極点群に基いて設定された単位区間は、各変極点に与えられた高域固有周波数が所定の近似範囲となるような一群の変極点を含む区間として設定されることになり、図３３に示すような低域固有周波数をもつ変極点群に基いて設定された単位区間は、各変極点に与えられた低域固有周波数が所定の近似範囲となるような一群の変極点を含む区間として設定されることになる。要するに、ステップＳ３０の区間設定段階では、同一の方法で定義された固有周波数が所定の近似範囲内となるような一群の変極点を含む区間を１つの単位区間と設定する処理が行われる。固有周波数の定義は、複数通りの方法で行われるため、時間軸上で重複する複数の単位区間が定義されることになる。
【００９４】
ステップＳ４０の符号化段階では、各単位区間について、それぞれ別個独立して代表周波数および代表強度が定義される。すなわち、単位区間内に含まれる変極点について定義された複数通りの固有周波数のうち、当該単位区間の設定に関与した固有周波数に基いて、当該単位区間の代表周波数が定義され、当該単位区間に含まれる変極点のもつ信号強度に基いて当該単位区間の代表強度が定義される。たとえば、図３０に示す例の場合、高域単位区間Ｕｈ（１）については、この区間Ｕｈ（１）内に含まれる変極点について定義された複数通りの固有周波数のうち、当該単位区間の設定に関与した高域固有周波数に基いて代表周波数Ｆｈ（１）が定義されることになり、この区間Ｕｈ（１）内に含まれる変極点のもつ信号強度に基いて代表強度Ａｈ（１）が定義されることになる。
【００９５】
もっとも、本発明では必ずしも高域固有周波数ｆｈと低域固有周波数ｆｌとの２通りの固有周波数を用いる必要はなく、これらの間の任意の固有周波数を用いてもかまわない。要するに、高域固有周波数ｆｈを上限とし、低域固有周波数ｆｌを下限とする範囲内で、複数の固有周波数を定義すればよい。たとえば、図３１に示す例において、変極点Ｐ（ｎ）についての固有周波数として、ｆｈ（ｎ）およびｆｌ（ｎ）の他に、変極点Ｐ（ｎ）とＰ（ｎ＋４）との間の時間軸上の距離を周期とした中間固有周波数ｆｍ（ｎ）を定義することもできる。
【００９６】
なお、§３．１で説明したＭＩＤＩデータへの変換原理によると、個々の単位区間に相当するＭＩＤＩデータのベロシティーＶを、単位区間の代表強度Ａを最大値Ａmax で規格化して、１２７を乗じることにより、
Ｖ＝（Ａ／Ａmax ）・１２７
なる式で定義し、Ｖ＝０〜１２７の値をとるベロシティーＶを求めていたが、いわゆるヴォーカル音声信号を符号化する場合には、規格化した値の平方根をとって、
Ｖ＝（Ａ／Ａmax ）^１／２・１２７
なる式でベロシティーＶを定義するか、あるいは対数をとって、
Ｖ＝ｌｏｇ（Ａ／Ａmax ）・１２７＋１２７
（ただし、Ｖ＜０の場合は、Ｖ＝０とする）
なる式でベロシティーＶを定義した方が、より自然な再生音が得られるようになり好ましい。
【００９７】
§６．本発明に係る音響信号の符号化方法の応用例
以上述べた本発明に係る音響信号の符号化方法を用いれば、先願発明に係る符号化方法では十分な再現性を得ることができなかったヴォーカル音響信号についても、実用的なレベルでの適用が可能になる。この符号化方法により、人間の話声や歌声をＭＩＤＩ対応の電子楽器で再生することが可能になり、また、楽譜の形式で表現することも可能になる。
【００９８】
上述した符号化のための種々の処理は、実際には、コンピュータを用いた演算によって行われることになるが、その演算負担はＦＦＴなどの演算に比べると軽く、市販の汎用パーソナルコンピュータを用いても十分にリアルタイムでの処理が可能である。したがって、上述した処理を汎用パーソナルコンピュータに実行させるためのプログラムを記述し、このプログラムをフロッピーディスクやＣＤ−ＲＯＭなどの媒体に記録して配布するようにすれば、汎用パーソナルコンピュータを本発明に係る音響信号の符号化方法を実行するための装置として利用することができる。また、本発明に係る符号化方法で符号化したデータは、この汎用パーソナルコンピュータによって、フロッピーディスクやＣＤ−ＲＯＭなどの媒体に記録して配布したり、通信回線を介して伝送したりすることもできる。
【００９９】
電子楽器による声の再生技術を、カラオケの分野に適用すれば、バックコーラスの再現、模範歌唱の提供、ナレーションの挿入などに利用することができ、これまでにない新たな付加価値をサービスとして提供することができる。特に、通信カラオケの分野に適用すれば、模範歌唱のデータなどを音符の形式で伝送することができるため、効率良いデータ伝送が可能になる。また、コンピュータを利用したエデュテイメントの分野に適用すれば、歌や声を効果音として挿入することができる。あるいは、人の声をモチーフとした音楽作品として取り込むことも可能になる。また、本発明に係る方法で符号化されたＭＩＤＩデータは、通常の楽器音からなるＭＩＤＩ音源を用いて再生することも可能であるため、楽器によって人の話し声を模倣するような芸を行うことも可能である。
【０１００】
また、本発明による符号化方法によれば、人の話声や歌声を客観的な符号データの形で認識することができるため、声を客観的に分析したり評価したりする技術分野へ応用することができる。たとえば、語学教育や声楽教育の分野では、発音、発声、抑揚などを客観的に評価することができ、カラオケの分野では、音程やリズムなどを客観的に評価することにより、歌唱力に対する厳密な点数評価を行うことができる。また、医療分野では、声音聴診音の分析により、呼吸器系の診断に利用でき、患者の話声の分析により、たとえば、咽頭癌の進行度などを診断するためのデータを提供することができる。更に、犯罪捜査やセキュリティの分野においては、本人の声の認証技術に利用することができる。特に、人間の声には、市販の各種ボイスチェンジャーを通した場合にも不変のホルマント特徴が含まれているため、本発明を利用すれば、かなり高い精度で本人認証を行うことができるようになる。
【０１０１】
本発明により符号化された符号データは、バーコード状の紙媒体に記録することもでき、印刷や複写機による複製が可能になる。このように、バーコード状の符号データをファクシミリで伝送すれば、一般の電話より広帯域で機密性の高いボイスメールを実現することができる。あるいは、バーコード状の符号データをロール紙に印刷しておけば、このロール紙上の符号データを読み取りながら再生するオルゴール式の再生機を利用した自動演奏機も実現できる。また、一般の書籍の頁にバーコード状の符号データを印刷しておくようにし、このバーコードを読み取って再生する小形のハンドスキャナを用意すれば、音声の出る本を実現することができ、音声再生機能付きの楽譜集を出版することも可能である。
【０１０２】
【発明の効果】
以上のとおり本発明によれば、人の声音や歌声を含む音響信号に対しても効率的な符号化が可能になる。
【図面の簡単な説明】
【図１】先願発明に係る音響信号の符号化方法の基本原理を示す図である。
【図２】先願発明に係る音響信号の符号化方法の実用的な手順を示す流れ図である。
【図３】入力した音響データに含まれている直流成分を除去するデジタル処理を示すグラフである。
【図４】図３に示す音響データの一部を時間軸に関して拡大して示したグラフである。
【図５】図４に矢印で示す変極点Ｐ１〜Ｐ６のみを抜き出した示した図である。
【図６】多少乱れた音響データの波形を示すグラフである。
【図７】図６に矢印で示す変極点Ｐ１〜Ｐ７のみを抜き出した示した図である。
【図８】図７に示す変極点Ｐ１〜Ｐ７の一部を間引処理した状態を示す図である。
【図９】個々の変極点について、固有周波数を定義する方法を示す図である。
【図１０】個々の変極点に関する情報に基づいて、単位区間を設定する具体的手法を示す図である。
【図１１】所定の許容レベルＬＬに基づくスライス処理を示す図である。
【図１２】単位区間設定の対象となる多数の変極点を矢印で示した図である。
【図１３】図１２に示す変極点に対して、所定の許容レベルＬＬに基づくスライス処理を行う状態を示す図である。
【図１４】図１３に示すスライス処理によって変極点を除外し、暫定区間Ｋ１，Ｋ２を設定した状態を示す図である。
【図１５】図１４に示す暫定区間Ｋ１についての不連続位置を探索する処理を示す図である。
【図１６】図１５で探索された不連続位置に基づいて、暫定区間Ｋ１を分割し、新たな暫定区間Ｋ１−１とＫ１−２とを定義した状態を示す図である。
【図１７】図１６に示す暫定区間Ｋ１−２，Ｋ２についての統合処理を示す図である。
【図１８】図１７に示す統合処理によって、最終的に設定された単位区間Ｕ１，Ｕ２を示す図である。
【図１９】各単位区間についての代表周波数および代表強度を求める手法を示す図である。
【図２０】５つの区間Ｅ０，Ｕ１，Ｅ１，Ｕ２，Ｅ２を定義するための符号データを示す図である。
【図２１】図２０に示す単位区間Ｕ１，Ｕ２内の音響データを符号化して得られる符号データの一例を示す図表である。
【図２２】図２０に示す単位区間Ｕ１，Ｕ２内の音響データを符号化して得られる符号データの別な一例を示す図表である。
【図２３】一般的なＭＩＤＩ形式の符号データの構成を示す図である。
【図２４】各単位区間内の音響データについてのＭＩＤＩデータへの具体的な変換方法を示す図である。
【図２５】図２０に示す単位区間Ｕ１，Ｕ２内の音響データを、ＭＩＤＩデータを用いて符号化した状態を示す図表である。
【図２６】生成したＭＩＤＩデータに対して修正処理が必要な第１の事例を示す図である。
【図２７】生成したＭＩＤＩデータに対して修正処理が必要な第２の事例を示す図である。
【図２８】図２７に示す事例における修正後の状態を示す図である。
【図２９】同一の単位区間に異なる複数の周波数を定義する符号化方法の基本原理を示す図である。
【図３０】時間軸上に少なくとも一部が重複するように、高域単位区間および低域単位区間をそれぞれ定義し、各単位区間にそれぞれ異なる周波数を定義する符号化方法の基本原理を示す図である。
【図３１】個々の変極点について、それぞれ高域固有周波数と低域固有周波数との２通りの固有周波数を定義する方法を示す図である。
【図３２】図３１に示す個々の変極点について、高域固有周波数と信号強度とを定義した状態を示す図である。
【図３３】図３１に示す個々の変極点について、低域固有周波数と信号強度とを定義した状態を示す図である。
【符号の説明】
Ａ，Ａ１〜Ａ６，Ａｉ…代表強度
Ａｈ（１）〜Ａｈ（６）…高域代表強度
Ａｌ（１）〜Ａｌ（４）…低域代表強度
Ａmax …代表強度の最大値
ａ１〜ａ１３…変極点の信号強度
ａａ…許容範囲
Ｄ…直流成分
ｄ…オフセット量
Ｅ０，Ｅ１，Ｅ２…空白区間
ｅ１〜ｅ６…終端位置
Ｆ，Ｆ１〜Ｆ６，Ｆｉ…代表周波数
Ｆｈ（１）〜Ｆｈ（６）…高域代表周波数
Ｆｌ（１）〜Ｆｌ（４）…低域代表周波数
ｆ１〜ｆ１７…変極点の固有周波数
ｆｈ１〜ｆｈ１３…変極点の高域固有周波数
ｆｌ１〜ｆｌ１３…変極点の低域固有周波数
ｆａ，ｆｂ，ｆｃ…周波数特性
ｆｆ…許容範囲
ｆｓ…サンプリング周波数
Ｋ１，Ｋ１−１，Ｋ１−２，Ｋ２…暫定区間
Ｌ，Ｌ１〜Ｌ４，Ｌｉ…区間長
ＬＬ…許容レベル
ＬＬｉ…再生音の持続時間
Ｎ，Ｎｉ…ノートナンバー
Ｐ１〜Ｐ１７…変極点
ｓ１〜ｓ６…始端位置
Ｔ，Ｔｉ…デルタタイム
ｔ１〜ｔ１７…時間軸上の位置
Ｕ１〜Ｕ６，Ｕｉ，Ｕｉ１，Ｕｉ２…単位区間
Ｕｈ（１）〜Ｕｈ（６）…高域単位区間
Ｕｌ（１）〜Ｕｌ（４）…低域単位区間
ｆＶ，Ｖｉ…ベロシティー
ｘ…サンプル番号
φ，φｈ，φｌ…周期[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method for encoding an acoustic signal, and relates to a technique for encoding an acoustic signal given as a time-series intensity signal, decoding it, and reproducing it. In particular, the present invention is suitable for a process for efficiently converting vocal acoustic signals (human speech and singing voice signals) into MIDI-format code data, and is expected to be applied to various industrial fields for recording speech. The
[0002]
[Prior art]
As a technique for encoding an acoustic signal, a PCM (Pulse Code Modulation) technique is the most popular technique, and is currently widely used as a recording system for audio CDs, DAT, and the like. The basic principle of this PCM method is that analog audio signals are sampled at a predetermined sampling frequency, and the signal intensity at each sampling is quantized and expressed as digital data. The sampling frequency and the number of quantization bits can be increased. The more you play, the more faithfully the original sound can be played. However, the higher the sampling frequency and the number of quantization bits, the more information is required. Therefore, as a technique for reducing the amount of information as much as possible, an ADPCM (Adaptive Differential Pulse Code Modulation) technique that encodes only a signal change difference is also used.
[0003]
On the other hand, the MIDI (Musical Instrument Digital Interface) standard, which was born from the idea of encoding musical instrument sounds by electronic musical instruments, has been actively used with the spread of personal computers. The code data according to the MIDI standard (hereinafter referred to as MIDI data) is basically data that describes the operation of the musical instrument performance such as which keyboard key of the instrument is played with what strength. The data itself does not include the actual sound waveform. Therefore, when reproducing actual sound, a separate MIDI sound source storing the waveform of the instrument sound is required. However, compared to the case where sound is recorded by the PCM method described above, the amount of information is extremely small, and the high coding efficiency is attracting attention. The encoding and decoding technology based on the MIDI standard is widely used in software for performing musical instruments, practicing musical instruments, and composing music using a personal computer, and is widely used in fields such as karaoke and game sound effects. Has been.
[0004]
[Problems to be solved by the invention]
As described above, when an acoustic signal is encoded by the PCM method, if an attempt is made to ensure sufficient sound quality, the amount of information becomes enormous and the burden of data processing must be increased. Therefore, normally, in order to limit the amount of information to a certain level, a certain level of sound quality must be compromised. Of course, if the encoding method based on the MIDI standard is adopted, it is possible to reproduce a sound having a sufficient sound quality with a very small amount of information. However, as described above, the MIDI standard itself originally performed the operation of the musical instrument. Since it is for encoding, it cannot be widely applied to general sound. In other words, in order to create MIDI data, it is necessary to actually play a musical instrument or prepare information on a musical score.
[0005]
As described above, both the conventional PCM method and the MIDI method have advantages and disadvantages in the method of encoding an acoustic signal, and sufficient sound quality is ensured with a small amount of information for general sound. I can't do it. However, there is an increasing demand for efficient encoding of general sound. In the field of human voice and singing voice called so-called vocal sound, such a request has been strongly issued for some time. For example, in the fields of language education, vocal music education, criminal investigation and the like, there is a strong demand for a technique for efficiently encoding a vocal acoustic signal. However, it is known that vocal sound has a formant characteristic in which harmonic components other than its harmonics are mixed in addition to the fundamental frequency, and conventional techniques cannot perform efficient encoding. It was.
[0006]
SUMMARY OF THE INVENTION An object of the present invention is to provide an audio signal encoding method capable of efficiently encoding an audio signal including a human voice or singing voice.
[0007]
[Means for Solving the Problems]
  (1) A first aspect of the present invention is an acoustic signal encoding method for encoding an acoustic signal given as a time-series intensity signal.
  An input stage for capturing an acoustic signal to be encoded as digital acoustic data;
  An inflection point definition stage for obtaining an inflection point for the waveform of the acquired acoustic data,
  On the time axis of this acoustic data, a section setting stage for setting a plurality of unit sections at least partially overlapping,
  Based on the acoustic data in each unit section, a predetermined representative frequency and representative intensity representing each unit section are defined, and information indicating the start position and end position of each unit section on the time axis and representative An encoding stage for generating code data including information indicating a frequency and representative intensity, and expressing acoustic data of individual unit sections by the individual code data;
  And
  At the section setting stage,For a particular inflection point, a specific inflection point satisfying a predetermined condition is searched for in the vicinity, and the uniqueness of the inflection point of interest is determined based on the distance on the time axis to the found inflection point. A plurality of natural frequency defining methods for defining the frequency are set by changing the predetermined condition,These multiple waysHow to define natural frequencyDefine multiple natural frequencies at each inflection point usingHow to define natural frequencyAn interval including a group of inflection points such that the natural frequency defined in (1) falls within a predetermined approximate range.Involves the natural frequency definition methodSet as one unit section,
  In the encoding stage, among the multiple natural frequencies defined for the inflection points included in the unit interval, it was involved in setting the unit interval.Defined by natural frequency definition methodThe representative frequency of the unit section is defined based on the natural frequency, and the representative intensity of the unit section is defined based on the signal strength of the inflection point included in the unit section.
[0011]
  (2)   Of the present inventionSecondIn the method for encoding an acoustic signal according to the first aspect described above,
  At the input stage, prepare acoustic data with positive and negative digital values as signal strength,
  At the section setting stage,As a first natural frequency definition method, a specific inflection point that satisfies the condition of “an inflection point having the same polarity as the inflection point of interest” is searched, and based on the distance on the time axis to the found inflection point. Then, a method for defining the natural frequency of the inflection point of interest is set, and a specific inflection point that satisfies the condition of “an inflection point having a signal intensity that approximates the inflection point of interest” is defined as the second natural frequency definition method. And a method for defining the natural frequency for the inflection point of interest is set based on the distance on the time axis to the inflection point found.
[0012]
  (3)   Of the present inventionThirdAspects of the aboveSecondIn the method of encoding an acoustic signal according to the aspect of
In the section setting stage, a plurality of natural frequencies are within a range in which the natural frequency fh defined by the first natural frequency defining method is an upper limit and the natural frequency hl defined by the second natural frequency defining method is a lower limit. Multiple natural frequency definition methods that can be defined are set.
[0013]
  (Four)   Of the present invention4thThe above-mentioned aspects are the first to the above-mentionedThirdIn the method of encoding an acoustic signal according to the aspect of
  At the encoding stage, the note number is determined based on the representative frequency, the velocity is determined based on the representative intensity, the delta time is determined based on the length of the unit section, the acoustic data of one unit section is converted into the note number, It is converted into code data in MIDI format expressed by velocity and delta time, and different channels are assigned to unit sections that overlap on the time axis.
[0014]
  (Five)   Of the present invention5thThe above-mentioned aspects are the first to the above-mentioned4thThe audio signal encoding program for executing the audio signal encoding method according to the above aspect is recorded on a computer-readable recording medium.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
  Hereinafter, the present invention will be described based on the illustrated embodiments. The present invention isJapanese Patent Laid-Open No. 10-247099This corresponds to an improved invention based on the invention disclosed in (1) (hereinafter referred to as the prior invention). Therefore, in the following description, first, the encoding method according to the prior application invention will be described in §1 to §3.
[0017]
§1. Basic principle of encoding method of acoustic signal according to prior invention
First, the basic principle of an audio signal encoding method according to the invention of the prior application will be described with reference to FIG. Assume that an analog sound signal is given as a time-series intensity signal as shown in the upper part of FIG. In the illustrated example, the acoustic signal is shown with the time axis t on the horizontal axis and the signal intensity A on the vertical axis. In the prior invention, first, the analog sound signal is processed as digital sound data. This can be done by using a conventional general PCM method, sampling the analog acoustic signal at a predetermined sampling frequency, and converting the signal intensity A into digital data using a predetermined number of quantization bits. . Here, for convenience of explanation, the waveform of the acoustic data digitized by the PCM method is also shown by the same waveform as the analog acoustic signal in the upper part of FIG.
[0018]
Next, a plurality of unit sections are set on the time axis t of the digital acoustic data. In the illustrated example, six unit sections U1 to U6 are set. The position and length of the i-th unit section Ui on the time axis t are indicated by the coordinate values of the start end si and end ei on the time axis t. For example, the unit section U1 is a section having a length of (e1-s1) from the start end s1 to the end e1.
[0019]
In this way, when a plurality of unit sections are set, a predetermined representative frequency and representative intensity representing each unit section are defined based on the acoustic data in each unit section. Here, a state in which the representative frequency Fi and the representative intensity Ai are defined for the i-th unit section Ui is shown. For example, the representative frequency F1 and the representative intensity A1 are defined for the first unit section U1. The representative frequency F1 is a representative value of the frequency components of the acoustic data included in the section from the start end s1 to the end e1, and the representative intensity Ai is the acoustic data included in the section from the start end s1 to the end e1. This is a representative value of the signal intensity. In general, the frequency component included in the acoustic data in the unit section U1 is not single, and the signal intensity generally varies. The point of the invention of the prior application is that a single representative frequency and a single representative strength are defined for one unit section, and encoding is performed using these representative values.
[0020]
That is, when the representative frequency and the representative strength are defined for each unit section, information indicating the start position and the end position of each unit section on the time axis t, and the defined representative frequency and representative strength are indicated. Code data is generated based on the information, and the acoustic data of each unit section is expressed by the individual code data. As a technique for encoding an event that an acoustic signal having a single frequency and a single signal intensity lasts for a predetermined period, encoding based on the MIDI standard can be used. Code data (MIDI data) according to the MIDI standard can be said to be data expressing a sound by a note, and FIG. 1 shows a concept of code data finally obtained by a note shown in the lower stage.
[0021]
Eventually, the acoustic data in each unit section includes pitch information (note number in the MIDI standard) corresponding to the representative frequency F1, intensity information (velocity in the MIDI standard) corresponding to the representative intensity A1, and the length of the unit section. It is converted into code data having length information (delta time in the MIDI standard) corresponding to (e1-s1). The information amount of the code data obtained in this way is significantly smaller than the information amount of the original acoustic signal, and a dramatic coding efficiency can be obtained. Up to now, as a method for generating MIDI data, there has been no choice but to take the operation when the performer actually plays the musical instrument as it is and encode it or input the notes on the score as data. By using this method, it is possible to directly generate MIDI data from an actual analog sound signal.
[0022]
However, there are some points to be noted in order to put the encoding method according to the above-described method into practical use. The first point to note is that it is necessary to prepare a sound source during playback. Since the code data finally obtained by the above method does not include the waveform data of the original acoustic signal, a sound source having some acoustic waveform data is required. For example, when reproducing MIDI data, a MIDI sound source is required. However, at the present time when the MIDI standard has become widespread, various MIDI sound sources are available, and no practical problem arises. However, in order to obtain reproduced sound that is faithful to the original acoustic signal, it is necessary to prepare a MIDI sound source having waveform data that approximates the acoustic waveform included in the original acoustic signal. If reproduction using an appropriate MIDI sound source can be performed, it is possible to obtain reproduced sound full of realism with higher sound quality than the original acoustic signal.
[0023]
The second point to note is the encoding method based on the basic principle that the frequency of the acoustic data contained in one unit section is replaced with a single representative frequency. The point is that it is not suitable for encoding of an acoustic signal that includes such a signal. Of course, this encoding method can be applied to any acoustic signal, but encoding is performed on an acoustic signal having a plurality of characteristic frequency components called formants, such as a human voice. However, sufficient reproducibility cannot be obtained during reproduction. Accordingly, the encoding method of the prior invention mainly uses frequency components that are limited to some extent in individual unit sections, such as rhythm sounds generated by living bodies and rhythm sounds generated by nature such as waves and winds. It is preferable to use it for an acoustic signal including only. The present invention improves this point of the prior invention, and can ensure sufficient reproducibility even when encoding an acoustic signal having a plurality of characteristic frequency components called formants, such as a human voice. It is what I did. The specific method will be described in §4 and later.
[0024]
A third point to note is that in order to perform efficient and highly reproducible coding, it is necessary to devise a method for setting unit intervals. As described above, the basic principle of the invention of the prior application is that the original acoustic data is divided into a plurality of unit sections and converted into code data indicating a single frequency and a single intensity for each unit section. is there. Therefore, the finally obtained code data greatly depends on the unit interval setting method. The simplest unit interval setting method is a method of uniquely defining unit intervals at equal intervals on the time axis, for example, every 10 ms. However, in this method, the unit interval is always defined by a constant method regardless of the original acoustic data to be encoded, and efficient and highly reproducible encoding cannot be expected. Therefore, in practice, it is preferable to analyze the waveform of the original acoustic data and set the unit interval suitable for the individual acoustic data.
[0025]
One approach for setting an efficient unit section is to extract a section having an approximate frequency band in the acoustic data as one unit section. Since the frequency component in the unit section is replaced by one representative frequency, if a frequency component that is too far from this representative frequency is included, the reproducibility during reproduction is reduced. Therefore, it is important to extract a section where a frequency approximated to some extent is maintained as one unit section in order to perform efficient encoding with high reproducibility. When this approach is taken, specifically, a change point of the frequency of the original acoustic data may be recognized, and a unit section having the change point as a boundary may be set.
[0026]
Another approach for setting an efficient unit section is a method of extracting a section whose signal intensity is approximated from acoustic data as a single unit section. Since the signal strength in the unit section is replaced by one representative strength, if the signal strength is too far from the representative strength, the reproducibility during reproduction is reduced. Therefore, it is important to extract a section where the signal strength approximated to some extent is maintained as one unit section in order to perform efficient encoding with high reproducibility. When this approach is taken, specifically, a change point of the signal intensity of the original acoustic data may be recognized, and a unit section having the change point as a boundary may be set.
[0027]
§2. Practical procedure of acoustic signal encoding method according to prior invention
FIG. 2 is a flowchart showing a more practical procedure of the prior invention. This procedure is composed of four major stages: an input stage S10, an inflection point definition stage S20, a section setting stage S30, and an encoding stage S40. The input stage S10 is a stage in which an acoustic signal to be encoded is captured as digital acoustic data. The inflection point definition step S20 is a step that should be called a preparation step of the subsequent section setting step S30, and is a step of obtaining an inflection point (local peak) for the waveform of the acquired acoustic data. The section setting stage S30 is a stage in which a plurality of unit sections are set on the time axis of the acoustic data based on the inflection point, and the encoding stage S40 is a process of setting the acoustic data of each unit section to individual codes. This is the stage of conversion to data. The principle of conversion to code data has already been described in §1. That is, information indicating the start position and the end position of each unit section on the time axis by defining a predetermined representative frequency and representative intensity representing each unit section based on the acoustic data in each unit section Code data is generated by the information indicating the representative frequency and the representative intensity. Hereinafter, processing performed in each of these steps will be described in order.
[0028]
<<< 2.1 Input stage >>>
In the input stage S10, a sampling process S11 and a DC component removal process S12 are executed. The sampling process S11 is a process for capturing an analog audio signal to be encoded as digital audio data, and is a process for performing sampling using a conventional general PCM technique. In this embodiment, sampling is performed under the conditions of sampling frequency: 44.1 kHz and the number of quantization bits: 16 bits, and digital acoustic data is prepared.
[0029]
The subsequent DC component removal process S12 is a digital process for removing the DC component contained in the input acoustic data. For example, in the acoustic data shown in FIG. 3, the center level of the amplitude is the center level of the data range indicating the signal intensity (as a specific digital value, for example, sampling is performed with 16 bits, and the data range of 0 to 65535 is When set, the value is 32768. Hereinafter, for convenience of explanation, as shown in the graph of Fig. 3, the central level of the data range is set to 0, and the value of each sampled signal intensity is positive or negative. The position is higher by D than (represents). In other words, the acoustic data includes a direct current component corresponding to the value D. If the analog acoustic signal that is the subject of the sampling process contains a direct current component, the direct current component also remains in the digital acoustic data. Therefore, the DC component removal process S12 is performed to remove the DC component D, and the center level of the amplitude and the center level of the data range are matched. Specifically, an operation of subtracting the DC component D may be performed so that the average of the individual sampled signal intensities becomes zero. As a result, acoustic data having positive and negative bipolar digital values as signal strengths can be prepared.
[0030]
<<< 2.2 Inflection point definition stage >>>
In the inflection point definition step S20, an inflection point search process S21 and a thinning process S22 with the same polarity inflection point are executed. The inflection point search process S21 is a process for obtaining an inflection point for the waveform of the acquired acoustic data. FIG. 4 is a graph in which a part of the acoustic data shown in FIG. 3 is enlarged with respect to the time axis. In this graph, points at the tip positions of the arrows P1 to P6 correspond to inflection points (maximum or minimum points), and each inflection point corresponds to a so-called local peak. As a method of searching for such an inflection point, for example, pay attention to the sampled digital value in order along the time axis, and recognize the position where the increase has started to decrease, or the position where the decrease has changed to increase. That's fine. Here, this inflection point is indicated by an arrow as shown.
[0031]
Each inflection point is a point corresponding to one sampled digital data, and has information on a predetermined signal strength (corresponding to the length of an arrow) and information on a position on the time axis t. Become. FIG. 5 is a diagram showing only inflection points P1 to P6 indicated by arrows in FIG. In the following description, as shown in FIG. 5, the signal intensity (absolute value) of the i-th inflection point Pi is indicated as the length ai of the arrow, and the position of the inflection point Pi on the time axis t is represented by ti. Let's show it as Eventually, the inflection point search process S21 is a process for obtaining information on each inflection point as shown in FIG. 5 based on the acoustic data as shown in FIG.
[0032]
Incidentally, the inflection points P1 to P6 shown in FIG. 5 have a property that the polarity is alternately inverted. That is, in the example of FIG. 5, odd-numbered inflection points P1, P3, and P5 are indicated by upward arrows, and even-numbered inflection points P2, P4, and P6 are indicated by downward arrows. This is because the original acoustic data waveform has an original appearance as a vibration waveform in which the amplitude of the positive and negative alternating appears. However, in practice, such an original vibration waveform is not always obtained. For example, as shown in FIG. 6, a somewhat distorted waveform may be obtained. When the inflection point search process S21 is performed on the acoustic data as shown in FIG. 6, all of the inflection points P1 to P7 are detected. As shown in FIG. The orientation does not reverse alternately. However, in defining a single representative frequency, it is preferable to obtain a row of arrows whose directions are alternately reversed.
[0033]
As shown in FIG. 7, the thinning-out process S22 of the same polarity inflection point has a digital value having the maximum absolute value when a plurality of inflection points (arrows in the same direction) having the same polarity digital value are consecutive. This process leaves only the inflection point (longest arrow) and thins out the rest. In the case of the example shown in FIG. 7, only the longest P2 is left among the three upward arrows P1 to P3, and only the longest P4 is left among the three downward arrows P4 to P6. By the thinning-out process S22, only three inflection points P2, P4, and P7 are left as shown in FIG. The inflection point shown in FIG. 8 corresponds to the original shape of the waveform of the acoustic data shown in FIG.
[0034]
<<< 2.3 Section setting stage >>>
As described above, in order to perform efficient and highly reproducible encoding in the encoding method according to the prior invention, it is necessary to devise a unit interval setting method. In that sense, among the stages shown in FIG. 2, the section setting stage S30 is a very important stage in practical use. The inflection point definition stage S20 described above is a preparation stage of the section setting stage S30, and the setting of the unit section is performed using information on individual inflection points. That is, in this section setting step S30, processing is performed in accordance with the basic concept of recognizing the change point of the frequency or signal intensity of the acoustic data based on the inflection point and setting a unit section with this change point as a boundary. Is advanced.
[0035]
As shown in FIG. 5, signal intensities a1 to a6 are defined at individual inflection points P1 to P6 indicated by arrows, respectively. However, information about the frequency is not defined in each inflection point P1 to P6 itself. The natural frequency definition process S31 performed first in the section setting step S30 is a process of defining a predetermined natural frequency for each inflection point. Originally, the frequency is a physical quantity defined for a wave in a predetermined section on the time axis, and should not be defined for a certain point on the time axis. However, for the sake of convenience, for each inflection point, we will define a pseudo natural frequency (in general, the term “natural frequency” in physics means an object that vibrates in resonance with sound waves, etc. Although it means a specific frequency, the “natural frequency” in the present application does not mean such a frequency specific to an object, but is a pseudo frequency defined at each inflection point, in other words, It means the fundamental frequency at a certain moment of the signal.)
[0036]
Now, as shown in FIG. 9, attention is paid to the nth to (n + 2) th inflection points P (n), P (n + 1), and P (n + 2) among the many inflection points. At each inflection point, signal values a (n), a (n + 1), and a (n + 2) are defined, and positions t (n), t (n + 1), and t (on the time axis are defined. n + 2) is defined. Here, considering that each of these inflection points is a point corresponding to the local peak position of the audio data waveform, as shown in the figure, the time axis between the inflection points P (n) and P (n + 2). It can be seen that the distance φ above corresponds to one period of the original waveform. Therefore, for example, if the natural frequency f (n) of the nth inflection point P (n) is defined as f (n) = 1 / φ, the natural frequency is defined for each inflection point. be able to. If the positions t (n), t (n + 1), and t (n + 2) on the time axis are expressed in units of “seconds”,
φ = (t (n + 2) −t (n))
Because
f (n) = 1 / (t (n + 2) -t (n))
Can be defined as
[0037]
In consideration of the actual digital data processing procedure, the position of each inflection point is not a unit of “second”, but a sample number x (which is the data obtained at the time of sampling in the sampling process S11. The sample number x and the real time “second” are uniquely associated by the sampling frequency fs. For example, the interval on the real time axis between the mth sample x (m) and the (m + 1) th sample x (m + 1) is 1 / fs.
[0038]
The natural frequency defined at each inflection point in this way is physically an amount indicating a local frequency near the inflection point. If the distance to another adjacent inflection point is short, the local frequency in the vicinity thereof is high, and if the distance to another adjacent inflection point is long, the local frequency in the vicinity thereof is low. However, in the above example, the natural frequency is defined based on the distance between the subsequent second inflection point, but any other method may be used as the natural frequency definition method. It doesn't matter. For example, using the distance between the natural frequency f (n) of the nth inflection point and the preceding (n−2) th inflection point,
f (n) = 1 / (t (n) -t (n-2))
Can also be defined. Further, as described above, the natural frequency f (n) is calculated based on the distance from the second inflection point that follows.
f (n) = 1 / (t (n + 2) -t (n))
Even if it is defined by the following formula, for the last two inflection points, there is no subsequent second inflection point, so using the preceding inflection point,
f (n) = 1 / (t (n) -t (n-2))
It can be defined by the following formula.
[0039]
Alternatively, the natural frequency f (n) of the nth inflection point is calculated based on the distance from the next inflection point that follows.
f (n) = (1/2) · 1 / (t (n + 1) −t (n))
Can be defined by the following formula, or based on the distance to the third inflection point that follows,
f (n) = (3/2) · 1 / (t (n + 3) −t (n))
It can also be defined by After all, using a general formula, the natural frequency f (n) for the nth inflection point is k inflection points (following inflection points when k is positive, and preceding in negative cases). Based on the distance on the time axis to the inflection point)
f (n) = (k / 2) · 1 / (t (n + k) −t (n))
Can be defined by the formula The value of k may be set to an appropriate value in advance. When the interval between the inflection points on the time axis is relatively small, it is possible to define a natural frequency with less error if the value of k is set to be somewhat large. However, if the value of k is set too large, the meaning as a local frequency is lost, which is not preferable.
[0040]
Thus, when the natural frequency defining process S31 is completed, each inflection point P (n) has a signal intensity a (n), a natural frequency f (n), and a position t (n) on the time axis. Will be defined.
[0041]
Now, in §1, in order to perform efficient and highly reproducible encoding, a first unit section is set such that the frequency of the inflection point included in one unit section is within a predetermined approximate range. It has been described that there is a second approach in which the unit interval is set so that the signal intensity at the inflection point included in one unit interval falls within a predetermined approximate range. Here, a method for setting a unit section using these two approaches will be described based on a specific example.
[0042]
Now, as shown in FIG. 10, consider the case where the signal intensities a1 to a9 and the natural frequencies f1 to f9 are defined for each of the nine inflection points P1 to P9. In this case, according to the first approach, attention is paid to the individual natural frequencies f1 to f9, and a process in which a group of spatially continuous inflection points having natural frequencies approximate to each other is set as one unit section may be performed. . For example, the natural frequencies f1 to f5 take substantially the same value (first reference value), the natural frequencies f6 to f9 take almost the same value (second reference value), and the first reference value and the second reference value When the difference from the reference value exceeds a predetermined allowable range, as shown in FIG. 10, the section including inflection points P1 to P5 having natural frequencies f1 to f5 included in the approximate range of the first reference value May be set as the unit interval U1, and the interval including the inflection points P6 to P9 having the natural frequencies f6 to f9 included in the approximate range of the second reference value may be set as the unit interval U2. In the method according to the invention of the prior application, a single representative frequency is given to one unit section. Thus, a section where there are a plurality of inflection points whose natural frequencies are within an approximate range is 1 If set as one unit section, the difference between the representative frequency and each natural frequency can be suppressed within a predetermined allowable range, and no major problem occurs.
[0043]
Next, an example of a specific method for defining one unit section by grouping inflection points that approximate the natural frequency into one group is shown below. For example, as shown in FIG. 10, when nine inflection points P1 to P9 are given, first the natural frequencies of the inflection points P1 and P2 are compared, and whether or not the difference between the two is within a predetermined allowable range ff. Find out. if,
| F1-f2 | <ff
If so, the inflection points P1 and P2 are included in the first unit section U1. Then, it is examined whether or not the inflection point P3 can be included in the first unit section U1. This compares the average natural frequency (f1 + f2) / 2 for this first unit interval U1 with f3,
| (F1 + f2) / 2−f3 | <ff
If so, the inflection point P3 may be included in the first unit section U1. Furthermore, regarding the inflection point P4,
| (F1 + f2 + f3) / 3−f4 | <ff
If so, this can be included in the first unit section U1, and with regard to the inflection point P5,
| (F1 + f2 + f3 + f4) / 4-f5 | <ff
If so, this can be included in the first unit section U1. Here, if inflection point P6,
| (F1 + f2 + f3 + f4 + f5) / 5-f6 |> ff
In other words, when the difference between the natural frequency f6 and the average natural frequency of the first unit section U1 exceeds a predetermined allowable range ff, the inflection points P5 and P6 A discontinuous position has been detected during this period, and the inflection point P6 cannot be included in the first unit section U1. Therefore, the inflection point P5 is the end of the first unit section U1, and the inflection point P6 is the beginning of another second unit section U2. Then, for the inflection points P6 and P7, the natural frequencies are compared to determine whether the difference between the two is within a predetermined allowable range ff.
| F6-f7 | <ff
If so, the inflection points P6 and P7 are included in the second unit section U2. And this time, with regard to the inflection point P8,
| (F6 + f7) / 2−f8 | <ff
If so, this is included in the second unit section U2, and with respect to the inflection point P9,
| (F6 + f7 + f8) / 3−f9 | <ff
If so, this is included in the second unit interval U2.
[0044]
If the discontinuous positions are sequentially detected by such a method and each unit section is sequentially set, section setting according to the first approach described above becomes possible. Of course, the specific method described above is shown as an example, and various other methods can be adopted. For example, instead of comparing with the average value, a simplified method may be adopted in which the natural frequencies of adjacent inflection points are always compared and the discontinuity position is recognized when the difference exceeds the allowable range ff. That is, the individual differences such as the difference between f1 and f2, the difference between f2 and f3, the difference between f3 and f4, and so on are examined. May be recognized as a discontinuous position.
[0045]
Although the first approach has been described above, the unit interval based on the second approach can be set similarly. In this case, attention should be paid to the signal intensities a1 to a9 at the individual inflection points, and comparison with a predetermined allowable range aa may be performed. Of course, the unit interval may be set by combining both the first approach and the second approach. In this case, paying attention to both the natural frequencies f1 to f9 and the signal intensities a1 to a9 at the individual inflection points, if both are within the predetermined allowable range ff and aa, they are included in the same unit section. It is possible to impose a severe condition such as allowing it to be included, or if either one is within the allowable range, it may be possible to impose a loose condition such that it is included in the same unit section.
[0046]
In the section setting step S30, before setting the unit section based on the above-described approaches, a process of excluding inflection points having a signal intensity whose absolute value is less than a predetermined allowable level is performed. Is preferred. For example, when a predetermined allowable level LL is set as in the example shown in FIG. 11, the absolute values of the signal intensity a4 at the inflection point P4 and the signal intensity a9 at the inflection point P9 are less than the allowable level LL. In such a case, processing for excluding the inflection points P4 and P9 is performed. The first significance of performing such exclusion processing is to remove noise components included in the original acoustic signal. Normally, various noise components are often mixed in the process of electrically capturing an acoustic signal, and it is not preferable to perform encoding including such noise components.
[0047]
Of course, if the allowable level LL is set to a certain level or higher, signals other than noise components are also excluded. However, in some cases, it is also sufficiently meaningful processing to exclude signals other than noise components. become. That is, the second significance of performing this exclusion process is to exclude information that is not of interest from the information included in the original acoustic signal. For example, the acoustic signal shown in the upper part of FIG. 1 is a signal indicating a human heart sound. Among the acoustic signals, information useful for diagnosis of a disease is a portion having a large amplitude (in each of the unit sections U1 to U6). The information of other parts is not very useful. Therefore, if a predetermined allowable level LL is set and a process for excluding unnecessary information parts is performed, more efficient encoding becomes possible.
[0048]
In addition, a component with a relatively small amplitude in a physiological rhythm sound generated by a living body, such as a heart sound and a lung sound, is often a reverberating sound generated in the living body, and such a reverberating sound is encoded. Even if it is once excluded at the time, it can be easily added by adding an acoustic effect such as an echo during reproduction. Even in such a point, the process of excluding inflection points below the allowable level is meaningful.
[0049]
In addition, when the process which excludes the inflection point less than an allowable level is performed, it is preferable to define the unit section so that it is divided at the position of the excluded inflection point. For example, in the case of the example shown in FIG. 11, unit sections U1 and U2 divided at the positions of the inflection points P4 and P9 (shown by alternate long and short dash lines) are defined. If such a unit section definition is performed, as in the acoustic signal shown in the upper part of FIG. 1, a section where the signal intensity is equal to or higher than the allowable level (each section of the unit sections U1 to U6) and a section where the signal intensity is lower than the allowable level (unit section) In the case of an acoustic signal in which the sections other than U1 to U6) appear alternately, a very precise unit section can be defined.
[0050]
Up to now, the main points of the effective section setting method performed in the section setting step S30 have been described, but a more specific procedure will be described here. As shown in the flowchart of FIG. 2, the section setting stage S30 includes four processes S31 to S34. As already described, the natural frequency definition process S31 is a process of defining a predetermined natural frequency for each inflection point based on the distance on the time axis between each inflection point. Here, as shown in FIG. 12, consider an example in which natural frequencies f1 to f17 are defined for each of inflection points P1 to P17.
[0051]
The level-based slice process S32 is a process of excluding inflection points having signal strengths whose absolute values are less than a predetermined allowable level and defining a section that is divided at the positions of the excluded inflection points. Here, consider a case where an allowable level LL as shown in FIG. 13 is set for the inflection points P1 to P17 as shown in FIG. In this case, the inflection points P1, P2, P11, P16, and P17 are excluded as inflection points less than the allowable level. In FIG. 14, the inflection points excluded in this way are indicated by broken-line arrows. In the “slice processing by level S32”, sections K1 and K2 that are divided at the positions of the excluded inflection points are further defined. Here, when even one inflection point is excluded, different sections are set to the left and right of the position. As a result, the section K1 from the inflection points P3 to P10 and the inflection point are set. The section K2 from P12 to P15 is set. The sections K1 and K2 defined here are provisional sections and are not necessarily final unit sections.
[0052]
In the next discontinuous part dividing process S33, a discontinuous position where the natural frequency or signal intensity value of the inflection point is discontinuous is searched for on the time axis, and the individual sections defined in the process S32 are further analyzed. This is a process of defining a new section by dividing at continuous positions. For example, in the case of the above example, provisional sections K1 and K2 as shown in FIG. 15 are defined, but here, there is a discontinuity between the inflection points P6 and P7 in the provisional section K1. In this case, the provisional section K1 is divided at this discontinuous position, and as shown in FIG. 16, provisional sections K1-1 and K1-2 are newly defined. After all, three provisional sections K1-1 and K1- 2 and K2 are formed. The specific search method for the discontinuous position is as described above. For example, in the example of FIG.
| (F3 + f4 + f5 + f6) / 4-f7 |> ff
In this case, it is recognized that there is a discontinuity in the natural frequency between the inflection points P6 and P7. Similarly, the signal strength discontinuity between inflection points P6 and P7 is
| (A3 + a4 + a5 + a6) / 4-a7 |> aa
Recognized in the case of.
[0053]
In the discontinuous part division processing S33, as a condition for actually performing section division,
(1) The section is divided only when the discontinuity of the natural frequency occurs.
(2) Perform segment division only when signal strength discontinuity occurs.
(3) When at least one of the natural frequency discontinuity and the signal strength discontinuity occurs, the section is divided.
(4) The section is divided only when both the natural frequency discontinuity and the signal strength discontinuity occur.
Various conditions can be set. Alternatively, in consideration of the degree of discontinuity, it is possible to set a composite condition that combines the above-mentioned (1) to (4).
[0054]
Thus, the sections obtained by the discontinuous portion dividing process S33 (in the case of the above example, three provisional sections K1-1, K1-2, K2) can be set as final unit sections. Then, the section integration process S34 is performed. In the section integration process S34, among the sections obtained by the discontinuous portion dividing process S33, the natural frequency or signal strength average of the inflection point in one section and the natural frequency or signal of the inflection point in the other section. When there are two adjacent sections whose difference from the average intensity is within a predetermined allowable range, this adjacent section is integrated into one section. For example, in the case of the above-described example, as shown in FIG. 17, as a result of comparing the section K1-2 and the section K2 with the average natural frequency,
| (F7 + f8 + f9 + f10) / 4
− (F12 + f13 + f14 + f15) / 4 | <ff
As described above, when the average difference is within the predetermined allowable range ff, the section K1-2 and the section K2 are integrated. Of course, the integration may be performed when the difference in average signal strength is within the allowable range aa, or the condition that the difference in average natural frequency is within the allowable range ff and the difference in average signal strength is within the allowable range aa. The integration may be performed when either one of the conditions is satisfied, or may be performed when both the conditions are satisfied. Even if these various conditions are satisfied, if the interval between both sections is more than a predetermined distance on the time axis (for example, a large number of inflection points are excluded, a considerable amount of blank If there is a section), it is possible to impose a weighting condition not to perform the integration process.
[0055]
Thus, the section obtained after performing the section integration process S34 is set as the final unit section. In the above example, finally, as shown in FIG. 18, the unit section U1 (provisional section K1-1 in FIG. 17) and the unit section U2 (provisional sections K1-2 and K2 integrated in FIG. 17) and Is set.
[0056]
In the embodiment shown here, the start and end of the unit section obtained in this way are used as the start point on the time axis of the first inflection point included in the section, and the last inflection point included in the section is determined. The definition is that the position on the time axis ends. Accordingly, in the example shown in FIG. 18, the unit section U1 is a section from positions t3 to t6 on the time axis, and the unit section U2 is a section from positions t7 to t15 on the time axis.
[0057]
<<< 2.4 Encoding stage >>>
Next, the encoding step S40 shown in the flowchart of FIG. 2 will be described. In the embodiment shown here, the encoding step S40 includes a code data generation process S41 and a code data correction process S42. The code data generation process S41 defines a predetermined representative frequency and representative intensity representing each unit section based on the audio data in each unit section set in the section setting step S30, This is a process for generating code data including information indicating the start position and end position of each unit section, and information indicating the representative frequency and the representative intensity. With this process, the audio data of each unit section is converted into individual codes. It will be expressed by data. On the other hand, the code data correction process S42 is a process for correcting the generated code data so as to be adapted to the characteristics of the reproduction sound source device used for decoding, as will be described later.
[0058]
The specific method of code data generation in the code data generation process S41 is very simple. That is, the representative frequency may be defined based on the natural frequency of the inflection point included in each unit section, and the representative intensity may be defined based on the signal intensity of the inflection point included in each unit section. This will be specifically shown in the example of FIG. In the example shown in FIG. 18, a unit section U1 including inflection points P3 to P6 and a unit section U2 including inflection points P7 to P15 (however, P11 is excluded) are set. In the embodiment shown here, for the unit section U1 (starting edge t3, ending t6), as shown in the upper part of FIG.
F1 = (f3 + f4 + f5 + f6) / 4
A1 = (a3 + a4 + a5 + a6) / 4
As shown in the lower part of FIG. 19, for the unit section U2 (starting end t7, ending t15), the representative frequency F2 and the representative intensity A2 are

Is calculated by the following formula. In other words, the representative frequency and the representative intensity are simple average values of the natural frequency and the signal intensity of the inflection point included in the unit section. However, as the representative value, not only such a simple average value but also a weighted average value considering the weight may be taken. For example, each inflection point may be weighted based on the signal strength, and the weighted average value of the natural frequency considering this weighting may be used as the representative frequency. Alternatively, the maximum value of the signal strengths of the inflection points included in the unit section can be used as the representative strength.
[0059]
If the representative frequency and the representative strength are defined for each unit section in this way, the start position and end position of each unit section on the time axis have already been obtained, so the codes corresponding to the individual unit sections Data can be generated. For example, in the case of the example shown in FIG. 18, as shown in FIG. 20, code data for defining five sections E0, U1, E1, U2, and E2 can be generated. Here, the sections U1 and U2 are unit sections set in the previous stage, and the sections E0, E1 and E2 are blank sections corresponding to each unit section. In each of the unit sections U1 and U2, the representative frequencies F1 and F2 and the representative intensities A1 and A2 are defined, but the blank sections E0, E1 and E2 are sections in which only the start and end are defined. .
[0060]
FIG. 21 is a chart showing a configuration example of code data corresponding to each section shown in FIG. In this example, the code data shown in one line is composed of a section name (not actually required), a start position and end position of the section, a representative frequency, and a representative strength. On the other hand, FIG. 22 is a chart showing another configuration example of the code data corresponding to each section shown in FIG. In the example shown in FIG. 21, the start end position and the end position of each unit section are directly expressed as code data. However, in the example shown in FIG. 22, the section length is used as information indicating the start end position and end position of each unit section. L1 to L4 (see FIG. 20) are used. Note that when the start and end positions of the unit section are directly used as code data as in the configuration example shown in FIG. 21, the code data for the blank sections E0, E1,. 20 can be reproduced only from the code data of the unit sections U1 and U2 shown in FIG.
[0061]
The code data finally obtained by the encoding method of the acoustic signal according to the prior application invention is code data as shown in FIG. 21 or FIG. However, as the code data, any configuration data can be used as long as the information indicating the start and end positions on the time axis of each unit section and the information indicating the representative frequency and the representative strength are included. It doesn't matter. As long as the above-described information is included in the finally obtained code data, it is possible to reproduce (decode) audio using a predetermined sound source. For example, in the example shown in FIG. 20, silence is maintained during the period from time 0 to t3, a sound corresponding to the frequency F1 is played at intensity A1 during the period from time t3 to t6, and silence is maintained during the period from time t6 to t7. If the sound corresponding to the frequency F2 is sounded with the intensity A2 during the period from the time t7 to the time t15, the original acoustic signal is reproduced.
[0062]
§3. Embodiment using MIDI format code data
<<<< 3.1 Principle of Conversion to MIDI Data >>>
As described above, the acoustic signal encoding method according to the invention of the prior application finally includes information indicating the start position and end position of each unit section, and information indicating the representative frequency and the representative intensity. Any type of code data may be used as long as it is code data. However, in practice, it is most preferable to employ MIDI format code data as such code data. Here, a specific embodiment employing MIDI format code data is shown.
[0063]
FIG. 23 is a diagram showing a configuration of code data in a general MIDI format. As shown in the figure, in the MIDI format, “note-on” data or “note-off” data exists with “delta time” data interposed. The “delta time” data is composed of data of 1 to 4 bytes and is data indicating a predetermined time interval. On the other hand, “note-on” data is data composed of a total of 3 bytes. The first byte is always fixed to the note-on code “90 H” (H indicates a hexadecimal number), and the second byte. A code indicating the note number N and a code indicating the velocity V in the third byte are respectively arranged. The note number N is a numerical value indicating the number of the scale (not the whole scale 7 scale in general music, but the scale of 12 semitones here). A specific keyboard key is designated (the scale of C-2 is associated with note number N = 0, and 128 scales up to N = 127 are associated with each other. (Note A3) is note number N = 69). Velocity V is a parameter indicating the intensity of sound (originally, it means the speed at which a piano keyboard or the like is played), and 128 levels of strength from V = 0 to 127 are defined.
[0064]
Similarly, the “note-off” data is also composed of a total of 3 bytes, the first byte is always fixed to the note-off code “80 H”, and the code indicating the note number N in the second byte. However, a code indicating velocity V is arranged in the third byte. “Note-on” data and “note-off” data are used in pairs. For example, 3-byte “note on” data of “90 H, 69, 80” means an operation of depressing the key in the center of the keyboard corresponding to the note number N = 69, and thereafter the same note number N = 69. The key is held down until the “note-off” data is specified (in fact, when using the waveform of a MIDI sound source such as a piano, the sound of The waveform is attenuated). The “note-off” data designating the note number N = 69 is given as 3-byte data such as “80 H, 69, 50”, for example. For example, in the case of a piano, the value of velocity V in the “note-off” data is a parameter indicating the speed at which the finger is released from the keyboard key.
[0065]
In the above description, the note-on code “90 H” and the note-off code “80 H” are described as being fixed. However, the lower 4 bits of these codes are not necessarily fixed to 0. It can be used as a code for specifying any one of channel numbers 0 to 15, and for each channel, it is possible to specify on / off for the tone color of a separate instrument.
[0066]
As described above, MIDI data is code data originally used for the purpose of describing information relating to the operation of musical instrument performance (in other words, musical score information). Also suitable for use in methods. That is, if the note number N is determined based on the representative frequency F for each unit section, the velocity V is determined based on the representative strength A, and the delta time T is determined based on the length L of the unit section, The audio data of one unit section can be converted into MIDI format code data expressed by note number, velocity, and delta time. A specific method for converting such data into MIDI data is shown in FIG.
[0067]
First, the delta time T of the MIDI data is calculated by using the section length L (unit: second) of the unit section.
T = L · 768
It can be defined by a simple expression Here, the numerical value “768” has a length resolution (for example, up to an eighth note if the length resolution is set to 1/2, and thirty-two if it is set to 1/8, based on the quarter note. Up to half notes can be expressed: In general music, a setting of about 1/16 is used, and the minimum value in the MIDI standard is set to 1/384, and the metronome designation is set to quarter note = 120 (120 notes per minute) ) Is a unique numerical value indicating the time resolution in the representation format by MIDI data.
[0068]
In addition, the note number N of MIDI data uses a representative frequency F (unit: Hz) of a unit interval in a logarithmic scale where the frequency is doubled by one octave.
N = (12 / log₁₀2) ・ (log₁₀(F / 440) +69
It can be defined by the expression Here, the numerical value “69” in the second term on the right side indicates the note number (reference note number) of the sound (A3 sound) in the center of the piano keyboard, and the numerical value “440” in the first term on the right side is The frequency of this sound (440 Hz) is shown, and the numerical value “12” in the first term on the right side shows the number of scales of one octave when a semitone is counted as one scale.
[0069]
Further, the velocity V of the MIDI data is obtained by using the representative intensity A of the unit section and the maximum value Amax.
V = (A / Amax) .127
In this equation, a value in the range of V = 0 to 127 can be defined. In the case of a normal musical instrument, the velocity V in the “note-on” data and the velocity V in the “note-off” data have different meanings as described above. As the velocity V in the “off” data, the same value as the velocity V in the “note on” data is used as it is.
[0070]
In §2 of the previous chapter, an example in which code data as shown in FIG. 21 or FIG. 22 is generated for audio data in two unit sections U1 and U2 as shown in FIG. 20 is shown. Is used, the audio data in the unit sections U1 and U2 are represented by data strings as shown in the chart of FIG. Here, note numbers N1 and N2 are values obtained by the above formula using the representative frequencies F1 and F2, and velocities V1 and V2 are obtained by the above formula using the representative intensities A1 and A2. Value.
[0071]
<<< 3.2 Correction processing of MIDI data >>>
In the encoding step S40 in the flowchart shown in FIG. 2, a code data correction process S42 is performed after the code data generation process S41. The code data generation process S41 is a process for generating, for example, a MIDI data string as shown in FIG. 25 by the specific method described above, and the code data correction process S42 is performed on such a MIDI data string. This is a process for further correction. As will be described later, in order to reproduce (decode) audio based on a MIDI data sequence as shown in FIG. 25, a reproduction sound source device (MIDI sound source) having actual audio waveform data is required. The characteristics of the MIDI sound source vary depending on the individual sound sources, and it may be preferable to add correction processing to the MIDI data in order to adapt to the characteristics of the MIDI sound source to be used as necessary. A specific case where such correction processing is necessary will be described below.
[0072]
Now, as shown in the upper part of FIG. 26, a case is considered where the audio data in the unit section Ui having the section length Li is represented by predetermined MIDI data (MIDI data before correction). That is, the representative frequency Fi and the representative intensity Ai are defined in the unit section Ui, and the note number Ni, velocity Vi, and delta time Ti are set based on the representative frequency Fi, the representative intensity Ai, and the section length Li. Will be. At this time, assume that the waveform of the reproduced sound corresponding to the note number Ni of the MIDI sound source to be used for reproducing the MIDI data is as shown in the middle of FIG. In this case, the duration LLi of the playback sound of the MIDI sound source is shorter than the unit length Li of the unit section Ui. Therefore, if the MIDI data before correction is reproduced as it is using this MIDI sound source, the reproduced sound is attenuated with a duration LLi shorter than the time Li during which the original sound must continue to sound. When such a situation occurs, the reproducibility of the original acoustic signal is degraded.
[0073]
Therefore, in such a case, it is preferable to perform a correction process in which the unit section is divided into a plurality of subsections and separate code data is generated for each subsection. In the example shown in FIG. 26, as shown in the lower part of the figure, the original unit section Ui is divided into two small sections Ui1 and Ui2, and separate MIDI data is generated for each of them. The representative frequency and representative intensity defined in each of the small sections Ui1 and Ui2 are both the same as the representative frequency Fi and representative intensity Ai of the unit section Ui before division, and only the section length is Li / 2. Therefore, as MIDI data after correction, two sets of MIDI data indicating the note number Ni, velocity Vi, and delta time Ti / 2 are obtained.
[0074]
In a general MIDI sound source, the duration of the reproduced sound is usually determined according to the frequency of the reproduced sound. In particular, in a sound source for timbres such as heart sounds, when the frequency of the reproduced sound is f (Hz), the duration is about 5 / f (seconds). Therefore, when such a sound source is used, if the relationship between the representative frequency Fi and the section length Li is Li> 5 / Fi for a specific unit section Ui, Li / m <5 / Fi. It is preferable to obtain an appropriate division number m such that the unit section Ui is divided into m small sections by the correction process described above.
[0075]
Next, let me show you another case that needs to be corrected. Now, while the reproduction sound of the MIDI sound source scheduled to be used for reproduction has a frequency range as shown on the left side of FIG. 27, the frequency range of the reproduction sound based on the generated series of MIDI data is shown in FIG. Suppose that as shown on the right side of FIG. In such a case, since the reproduced sound is presented using only a part of the frequency band of the MIDI sound source, it is generally not preferable. Therefore, the frequency (note number) on the MIDI data side is set so that the average of the frequency of the MIDI data approaches the center of the frequency range of the MIDI sound source (in this example, the reference raton of 440 Hz (note number N = 69)). It is preferable to perform a correction process that raises the entire image so that the offset amount d becomes 0 as shown in FIG.
[0076]
Of course, depending on the nature of the acoustic signal, it may be preferable to reproduce it while being shifted to the low-pitched sound side, and good results may not always be obtained by the correction processing as described above. Therefore, it is preferable to appropriately determine whether or not to perform such correction processing in consideration of the properties of individual acoustic signals.
[0077]
In addition to this, depending on the MIDI sound source to be used, various correction processes may be required to adapt to the characteristics. For example, when using a MIDI sound source of a special standard that does not correspond to a double frequency of an octave scale difference, a note number correction process or the like is required so as to conform to this standard. .
[0078]
§4. Improvements in the present invention
The encoding methods according to the inventions of the prior applications described so far only use frequency components that are limited to some extent within individual unit sections, such as rhythm sounds generated by living bodies and rhythm sounds generated by nature such as waves and winds. Practically sufficient reproducibility can be ensured for encoding of the included acoustic signal. However, when a sound signal that includes a very wide frequency component at the same time as a human voice called so-called vocal sound is encoded, sufficient reproducibility cannot always be ensured. . In particular, it is known that human voice sounds have a characteristic called formant (a characteristic in which harmonic components other than overtones are mixed), and the above-described method according to the invention of the prior application encodes with sufficient reproducibility. It is theoretically supported that this is not possible. In a general musical instrument, when a specific pitch is played, a frequency component (overtone harmonic component) that is an integral multiple of the frequency component corresponding to the played pitch is obtained. Therefore, if such a musical instrument performance waveform is used as a MIDI sound source, a sound including harmonic components can be reproduced even by the encoding method according to the invention of the prior application. However, since human voices having formants contain harmonic components other than overtones, sufficient reproducibility cannot be ensured.
[0079]
The method of the present invention described below is an improvement over the invention of the prior application so that it can sufficiently cope with the encoding of human voices having formants. First, the basic concept of the present invention will be described with reference to FIG. Here, as shown in the upper part of FIG. 29, it is assumed that an analog acoustic signal is given as a time-series intensity signal and is taken in as digital acoustic data. Subsequently, a plurality of unit sections U1 to U6 are set on the time axis t of the digital acoustic data. Up to this point, the method is the same as that of the prior invention shown in FIG. Thus, when a plurality of unit sections are set, a plurality of representative frequencies (in this example, high frequency Fh and low frequency Fl 2) representing each unit section based on the acoustic data in each unit section. Street representative frequency) and representative intensity. Here, a state in which the high frequency Fh (i) and the low frequency Fl (i) and the representative intensity Ai are defined for the i-th unit section Ui is shown. For example, for the first unit section U1, the high frequency Fh (1) and the low frequency Fl (1) and the representative intensity A1 are defined as representative frequencies.
[0080]
Thus, when a plurality of representative frequencies and representative intensities are defined for each unit section, information indicating the start and end positions of the individual unit sections on the time axis t, the plurality of defined representative frequencies, and Code data may be generated based on information indicating the representative intensity, and acoustic data of individual unit sections may be expressed by individual code data. For example, if encoding based on the MIDI standard is used, code data as indicated by musical notes shown in the lower part of FIG. 29 can be obtained. In the code data shown in the lower part of FIG. 29, individual notes are presented as chords as can be seen from the code data shown in the lower part of FIG. That is, for each unit section, a note corresponding to the high frequency Fh and a note corresponding to the low frequency Fl are created, and at the time of playback, these two notes are played simultaneously as chords. become. By adopting such a method, it becomes possible to sufficiently cope with encoding of a human voice having a formant.
[0081]
However, in the method shown in FIG. 29, two representative frequencies are defined in the same unit section, but in practice, only one representative frequency is defined in one unit section. In addition, it is preferable that different unit sections can be set overlapping each other on the same time axis. FIG. 30 shows the basic concept of a more practical method. The middle part of FIG. 30 shows a waveform of digital acoustic data as a time-series intensity signal. Below this waveform, processing focusing on high frequency is shown, and above this waveform is shown. Shows the processing focusing on the low frequency. That is, in the processing focusing on the high frequency shown in the lower half of the figure, the high frequency unit intervals Uh (1) to Uh (6) are set, and the representative frequencies Fh (1) to Fh (6) and representative intensities Ah (1) to Ah (6) are defined, and finally, high-frequency code data as shown in the lowermost part of the figure is generated. On the other hand, in the processing focusing on the low frequency shown in the upper half of the figure, the low frequency unit intervals Ul (1) to Ul (4) are set. Fl (4) and representative intensities Al (1) to Al (4) are defined, and finally, low-frequency code data as shown in the uppermost part of the figure is generated.
[0082]
The important point here is that the high frequency unit interval Uh (1) to Uh (6) and the low frequency unit interval Ul (1) to Ul (4) overlap at least partially on the time axis t. It is that. Of course, when the time axis t is traced from the left to the right in the figure, there is a part in which only the high frequency unit section is set or a part in which only the low frequency unit section is set, There may be a part where no unit section is set, but at least a part on the time axis t includes a section in which the high-frequency unit section and the low-frequency unit section are set to overlap. If the unit frequency set in this way is encoded by determining the representative frequency and the representative strength independently of each other, code data that is duplicated on the time axis can be obtained. For example, in the example shown in FIG. 30, the high-frequency code data shown at the bottom and the low-frequency code data shown at the top are at least partially overlapped on the time axis t. Sometimes it will be played as a chord. Note that the musical notes shown are for conceptual purposes, and are not directly related to the waveform and each unit section shown in the middle of the figure.
[0083]
In this way, if unit sections that at least partially overlap on the time axis are set and encoding is performed separately for each unit section, a chord including various frequency components can be obtained during playback. The sound can be reproduced in the format. Note that the example shown in FIG. 29 can be considered as a special case in which individual high-frequency unit sections and individual low-frequency unit sections completely coincide.
[0084]
§5. Practical procedure of audio signal encoding method according to the present invention
The encoding procedure according to the present invention can be performed in substantially the same manner as the encoding procedure according to the invention of the prior application. That is, as shown in the flowchart of FIG. 2, in the input stage S10, processing for capturing the acoustic signal to be encoded as digital acoustic data is performed, and subsequently, the captured acoustic data in the inflection point definition stage S20. A process for obtaining an inflection point is performed on the waveform. The processing so far is exactly the same as the procedure according to the invention of the prior application already described. Next, in the section setting step S30, the unit section is set. In the present invention, as described above, the section setting that at least partially overlaps on the time axis is performed. In the encoding step S40, a process of encoding is performed for each unit section, and this process is also performed for each unit section that is set to overlap.
[0085]
As already described, the first process performed in the section setting stage S30 is the natural frequency definition process S31. At this time, individual inflection points for the acoustic data waveform have already been searched for by the inflection point search process S21, and a plurality of inflection points having the same polarity digital value are continuously obtained by the thinning process S22 for the same polarity inflection points. In this case, thinning is performed to leave only the inflection point with the maximum digital value, and inflection points with positive signal values and inflection points with negative signal values appear alternately. It has become. The natural frequency defining process S31 is a process of defining a natural frequency for each of such inflection points based on neighboring information. In the present invention, a natural frequency is defined for one inflection point. A plurality of methods are set, and a plurality of natural frequencies are defined at each inflection point using the plurality of methods.
[0086]
Here, two specific natural frequency definition methods suitable for use with vocal audio signals will be described. Consider a case where an inflection point group as shown in FIG. 31 is obtained through the inflection point definition step S20. FIG. 31 shows the nth inflection point P (n) to the (n + 12) th inflection point P (n + 12) in the inflection point group. It can be seen that such an inflection point group includes two frequency components. That is, a high frequency component having a period φh between the inflection points P (n) and P (n + 2) and a low frequency having a period φl between the inflection points P (n) and P (n + 6). Frequency components. When an inflection point is defined for a vocal acoustic signal, the characteristics shown in FIG. This is because, as described above, human speech has the characteristics of formants. In the example shown in FIG. 31, if attention is paid to inflection points P (n), P (n + 2), P (n + 4), P (n + 6), P (n + 8), etc. having positive signal intensity, the signal intensity is large, medium, and small. It can be seen that the change is large, medium and small. The period of change of large, medium, and small corresponds to the period φl and indicates a low frequency component. On the other hand, the appearance period of the inflection point of the same polarity corresponds to the period φh and indicates a high frequency component.
[0087]
After all, as a first method of defining the natural frequency for each inflection point, if a method of searching for a period φh in which an inflection point of the same polarity appears and defining the natural frequency based on this period φh, A region natural frequency fh can be defined. As a second method for defining the natural frequency for each inflection point, a method of searching for a period φl where an inflection point having an approximate signal strength appears and defining the natural frequency based on this period φl can be adopted. For example, the low frequency natural frequency fl can be defined. More specifically, for each inflection point, a specific inflection point that satisfies a predetermined condition is searched, and a natural frequency may be defined based on the distance on the time axis to the found inflection point. . For example, in FIG. 31, in order to define the high frequency natural frequency fh for the inflection point P (n), the search may be performed under the condition of “the inflection point of the same polarity appearing first after that”. . As a result, an inflection point P (n + 2) that satisfies this condition is searched, and a frequency having a period of the distance φh on the time axis of both inflection points is defined. Similarly, in order to define the low-frequency natural frequency fl for the inflection point P (n), “the inflection point having a signal strength substantially equal to the signal strength of the inflection point P (n) and subsequently appearing first. The search may be performed by setting the condition “If the signal intensity is given a sign, it will naturally become an inflection point of the same polarity”. As a result, an inflection point P (n + 6) that satisfies this condition is searched, and a frequency having a period of the distance φl on the time axis of both inflection points is defined. Thus, by changing the search condition, it is possible to define a plurality of natural frequencies for the same inflection point.
[0088]
According to the above-described method, the high-frequency natural frequency fh (n) for the nth inflection point P (n) is, as described in §2.3, using an arbitrary integer k,
fh (n) = (k / 2) · 1 / (t (n + k) −t (n))
Is obtained by the following formula. That is, a search is made for an inflection point P (n + k) that is k pieces away from the nth inflection point P (n) (a succeeding inflection point when k is positive, a preceding inflection point when negative), Based on the reciprocal of the difference between the position t (n) on the time axis of the inflection point P (n) and the position t (n + k) on the time axis of the searched inflection point P (n + k) The frequency fh (n) is obtained. As already described, the value of k can be set to a relatively large value to define a natural frequency with less error, but if it is set too large, the meaning as a local frequency is lost.
[0089]
In the case of the example shown in FIG. 31, the high frequency natural frequency fh (n) for the inflection point P (n) can be defined as the reciprocal of the illustrated period φh,

This is the case when the coefficient k = 2 in the above equation is set. Of course, if the coefficient k = 4 is set, the inflection point P (n + 4) is set as a search target.
fh (n) = 2 · (1 / (t (n + 4) −t (n)))
The value of the high frequency natural frequency fh (n) can also be defined by the following equation.
[0090]
On the other hand, the low frequency natural frequency fl (n) for the nth inflection point P (n) is
fl (n) = 1 / (t (n + k) -t (n))
Is obtained by the following formula. However, the coefficient k indicated in the denominator on the right side is not an arbitrary integer but an integer satisfying a predetermined condition. That is, the inflection point (n) among the inflection points where the inflection point P (n + k) specified by the integer k has a signal intensity within a predetermined error range with respect to the signal intensity of the inflection point P (n). Must be the next inflection point closest to. Alternatively, when the integer k is negative and the preceding inflection point is to be searched, the inflection point P (n + k) specified by the integer k is a predetermined value with respect to the signal strength of the inflection point P (n). Of the inflection points having the signal intensity within the error range, the preceding inflection point closest to the inflection point (n) may be used. In short, this expression means that the nearest inflection point P (n + k) having the same signal intensity as that of the inflection point P (n) is searched for on the time axis of the inflection point P (n). The low-frequency natural frequency fl (n) is determined based on the reciprocal of the difference between the position t (n) at the time t and the position t (n + k) on the time axis of the searched inflection point P (n + k). It is.
[0091]
In the case of the example shown in FIG. 31, the low frequency natural frequency fl (n) for the inflection point P (n) can be defined as the reciprocal of the illustrated period φl,

This is given when the coefficient k = 6 in the above equation is set. That is, in the example of FIG. 31, the inflection point P (n + 6) has a signal intensity within a predetermined error range with respect to the signal intensity of the inflection point P (n), and the inflection point P (n) It will be searched as a near-following inflection point. Theoretically, it is not always necessary to search for the nearest succeeding inflection point (or the nearest preceding inflection point). For example, even when the next inflection point P (n + 12) that is the second closest is the search target,
fl (n) = 2 · (1 / (t (n + 12) −t (n)))
The low frequency natural frequency fl (n) can be defined by the following formula. Generally, when the inflection point P (n + k) that is close to or precedes the zth is a search target,
fl (n) = z · (1 / (t (n + k) −t (n)))
The low frequency natural frequency fl (n) can be defined by the following equation.
[0092]
Thus, in the case of the present invention, the natural frequency defining process in step S31 in the flowchart of FIG. 2 defines a plurality of natural frequencies for each inflection point. The individual processes in steps S32 to S34 are performed separately for a plurality of natural frequencies, and the individual processes in steps S41 to S42 are also performed separately for a plurality of natural frequencies. Eventually, multiple pieces of code data that overlap on the time axis will be generated, and by reproducing these code data on the time axis, it is practical for human voices with formant characteristics. Reproducibility at various levels can be secured.
[0093]
For example, in the specific example shown in FIG. 31, when n = 1 and each inflection point is represented by P1 to P13, if a high frequency natural frequency is defined at each inflection point, the natural frequency fhx as shown in FIG. And inflection points having signal intensity ax are defined. If a low frequency natural frequency is defined for each inflection point, an inflection point having natural frequency flx and signal intensity ax as shown in FIG. Groups will be defined (where x = 1-13). If two types of inflection point groups are separately and independently executed, the slice processing according to the level in step S32, the discontinuous part division processing in step S33, and the section integration processing in step S34 are executed. A unit interval is set. Here, the unit interval set based on the inflection point group having a high frequency natural frequency as shown in FIG. 32 is a group of groups in which the high frequency natural frequency given to each inflection point is within a predetermined approximate range. The section including the inflection point is set, and the unit section set based on the inflection point group having the low frequency natural frequency as shown in FIG. 33 has the low frequency natural frequency given to each inflection point. It is set as a section including a group of inflection points that become a predetermined approximate range. In short, in the section setting stage of step S30, a process including setting a section including a group of inflection points such that the natural frequency defined by the same method falls within a predetermined approximate range is performed as one unit section. Since the natural frequency is defined by a plurality of methods, a plurality of unit sections overlapping on the time axis are defined.
[0094]
In the encoding stage of step S40, the representative frequency and the representative strength are defined independently for each unit section. That is, among the multiple natural frequencies defined for the inflection point included in the unit interval, the representative frequency of the unit interval is defined based on the natural frequency involved in the setting of the unit interval. The representative intensity of the unit section is defined based on the signal intensity of the included inflection point. For example, in the case of the example shown in FIG. 30, for the high frequency unit section Uh (1), the setting of the unit section among the plural natural frequencies defined for the inflection points included in the section Uh (1). The representative frequency Fh (1) is defined on the basis of the high frequency natural frequency involved in the signal, and the representative intensity Ah (1) is determined based on the signal intensity of the inflection point included in the section Uh (1). Will be defined.
[0095]
However, in the present invention, it is not always necessary to use two natural frequencies of the high frequency natural frequency fh and the low frequency natural frequency fl, and any natural frequency between these may be used. In short, a plurality of natural frequencies may be defined within a range in which the high frequency natural frequency fh is the upper limit and the low frequency natural frequency fl is the lower limit. For example, in the example shown in FIG. 31, as the natural frequency for the inflection point P (n), in addition to fh (n) and fl (n), the time between the inflection points P (n) and P (n + 4) It is also possible to define an intermediate natural frequency fm (n) with the distance on the axis as a period.
[0096]
According to the principle of conversion to MIDI data described in §3.1, the velocity V of MIDI data corresponding to each unit section is normalized with the representative intensity A of the unit section as the maximum value Amax, and 127 is obtained. By multiplying
V = (A / Amax) .127
Velocity V having a value of V = 0 to 127 is obtained, but when a so-called vocal audio signal is encoded, the square root of the normalized value is taken,
V = (A / Amax)^1/2・ 127
Define Velocity V with the formula or logarithm,
V = log (A / Amax) .127 + 127
(However, if V <0, V = 0)
It is preferable to define the velocity V by the following formula because a more natural reproduction sound can be obtained.
[0097]
§6. Application example of acoustic signal encoding method according to the present invention
If the above-described encoding method for an acoustic signal according to the present invention is used, even a vocal acoustic signal for which sufficient reproducibility could not be obtained with the encoding method according to the prior invention can be applied at a practical level. Is possible. With this encoding method, it is possible to reproduce a human voice or singing voice with a MIDI-compatible electronic musical instrument, and it is also possible to express it in the form of a score.
[0098]
The various processes for encoding described above are actually performed by calculations using a computer, but the calculation burden is lighter than calculations such as FFT, and a commercially available general-purpose personal computer is used. Can be processed in real time. Therefore, if a program for causing a general-purpose personal computer to execute the above-described processing is described and the program is recorded and distributed on a medium such as a floppy disk or a CD-ROM, the general-purpose personal computer is related to the present invention. The present invention can be used as an apparatus for executing an audio signal encoding method. The data encoded by the encoding method according to the present invention may be recorded and distributed on a medium such as a floppy disk or a CD-ROM by the general-purpose personal computer, or may be transmitted via a communication line. it can.
[0099]
If voice playback technology using electronic musical instruments is applied to the field of karaoke, it can be used to reproduce back chorus, provide model singing, insert narration, etc., providing new added value as a service can do. In particular, when applied to the field of online karaoke, model singing data and the like can be transmitted in the form of musical notes, so efficient data transmission becomes possible. Moreover, if it applies to the field of edutainment using a computer, a song or a voice can be inserted as a sound effect. Alternatively, it can be captured as a musical work with human voice as a motif. In addition, since MIDI data encoded by the method according to the present invention can be reproduced using a MIDI sound source composed of a normal musical instrument sound, performing tricks such as imitating a person's speaking voice with an instrument. Is also possible.
[0100]
In addition, according to the encoding method of the present invention, human speech and singing voice can be recognized in the form of objective code data, so that it can be applied to technical fields in which voice is objectively analyzed and evaluated. can do. For example, in the fields of language education and vocal music education, pronunciation, vocalization, and inflection can be objectively evaluated, and in the karaoke field, strict singing ability can be strictly evaluated by objectively evaluating pitch and rhythm. Score evaluation can be performed. In the medical field, analysis of voice auscultation sounds can be used for diagnosis of the respiratory system, and analysis of patient's speech can provide data for diagnosing the progress of pharyngeal cancer, for example. . Furthermore, in the field of criminal investigation and security, it can be used for authentication technology of the voice of the person. In particular, human voices include formant features that remain unchanged even when they are passed through various commercially available voice changers. Therefore, by using the present invention, it is possible to perform personal authentication with considerably high accuracy. Become.
[0101]
Code data encoded according to the present invention can be recorded on a barcode-like paper medium, and can be printed or copied by a copying machine. In this way, if bar code-like code data is transmitted by facsimile, it is possible to realize a voice mail with a wider band and higher confidentiality than a general telephone. Alternatively, if bar code-like code data is printed on roll paper, an automatic performance device using a music box type reproduction machine that reproduces while reading the code data on the roll paper can be realized. In addition, if a bar code-like code data is printed on a page of a general book, and a small hand scanner that reads and reproduces this bar code is prepared, a book that produces sound can be realized, It is also possible to publish a music book with a voice playback function.
[0102]
【The invention's effect】
As described above, according to the present invention, it is possible to efficiently encode an acoustic signal including a human voice and a singing voice.
[Brief description of the drawings]
FIG. 1 is a diagram showing a basic principle of a method for encoding an acoustic signal according to a prior invention.
FIG. 2 is a flowchart showing a practical procedure of an acoustic signal encoding method according to the invention of the prior application.
FIG. 3 is a graph showing digital processing for removing a DC component contained in input acoustic data.
FIG. 4 is a graph showing a part of the acoustic data shown in FIG. 3 in an enlarged manner with respect to the time axis.
5 is a diagram showing only inflection points P1 to P6 indicated by arrows in FIG.
FIG. 6 is a graph showing a waveform of acoustic data that is somewhat disturbed.
7 is a diagram showing only inflection points P1 to P7 indicated by arrows in FIG.
8 is a diagram showing a state in which a part of inflection points P1 to P7 shown in FIG. 7 is thinned out.
FIG. 9 is a diagram illustrating a method of defining a natural frequency for each inflection point.
FIG. 10 is a diagram showing a specific method for setting a unit section based on information on individual inflection points.
FIG. 11 is a diagram illustrating slice processing based on a predetermined allowable level LL.
FIG. 12 is a diagram showing a number of inflection points to be set as unit intervals by arrows.
13 is a diagram showing a state in which slice processing based on a predetermined allowable level LL is performed on the inflection point shown in FIG.
14 is a diagram showing a state where provisional sections K1 and K2 are set by excluding inflection points by the slice processing shown in FIG.
15 is a diagram showing processing for searching for a discontinuous position for the provisional section K1 shown in FIG.
16 is a diagram illustrating a state in which a provisional section K1 is divided based on the discontinuous positions searched in FIG. 15 and new provisional sections K1-1 and K1-2 are defined.
FIG. 17 is a diagram showing an integration process for provisional sections K1-2 and K2 shown in FIG.
18 is a diagram illustrating unit sections U1 and U2 that are finally set by the integration processing illustrated in FIG. 17;
FIG. 19 is a diagram illustrating a method for obtaining a representative frequency and a representative intensity for each unit section.
FIG. 20 is a diagram illustrating code data for defining five sections E0, U1, E1, U2, and E2.
21 is a chart showing an example of code data obtained by encoding the acoustic data in the unit sections U1 and U2 shown in FIG.
22 is a chart showing another example of code data obtained by encoding the acoustic data in the unit sections U1 and U2 shown in FIG.
FIG. 23 is a diagram illustrating a configuration of code data in a general MIDI format.
FIG. 24 is a diagram showing a specific method for converting sound data in each unit section into MIDI data.
25 is a chart showing a state in which the acoustic data in the unit sections U1, U2 shown in FIG. 20 is encoded using MIDI data.
FIG. 26 is a diagram illustrating a first case in which correction processing is necessary for generated MIDI data.
FIG. 27 is a diagram illustrating a second case in which correction processing is necessary for generated MIDI data.
FIG. 28 is a diagram showing a state after correction in the case shown in FIG. 27;
FIG. 29 is a diagram illustrating a basic principle of an encoding method for defining a plurality of different frequencies in the same unit section.
FIG. 30 is a diagram illustrating a basic principle of an encoding method in which a high frequency unit interval and a low frequency unit interval are defined so that at least a part of them overlaps on the time axis, and different frequencies are defined in each unit interval; It is.
FIG. 31 is a diagram showing a method of defining two natural frequencies, a high frequency natural frequency and a low frequency natural frequency, for each inflection point.
32 is a diagram showing a state in which high frequency natural frequency and signal intensity are defined for each inflection point shown in FIG. 31. FIG.
33 is a diagram showing a state where a low frequency natural frequency and a signal intensity are defined for each inflection point shown in FIG. 31. FIG.
[Explanation of symbols]
A, A1 to A6, Ai ... Representative strength
Ah (1) to Ah (6) ... High frequency representative strength
Al (1) to Al (4) ... Low band representative strength
Amax: Maximum representative strength
a1 to a13 ... Signal strength at the inflection point
aa ... tolerance
D: DC component
d: Offset amount
E0, E1, E2 ... Blank section
e1 to e6: end position
F, F1 to F6, Fi ... representative frequency
Fh (1) to Fh (6) ... High frequency representative frequency
Fl (1) to Fl (4) ... Low frequency representative frequency
f1 to f17: natural frequency of the inflection point
fh1 to fh13 ... high frequency natural frequency of inflection point
fl1 to fl13 ... low frequency natural frequency of inflection point
fa, fb, fc ... frequency characteristics
ff ... Allowable range
fs ... sampling frequency
K1, K1-1, K1-2, K2 ... provisional section
L, L1 to L4, Li ... Section length
LL ... Acceptable level
LLi: Duration of playback sound
N, Ni ... note number
P1 to P17 ... Inflection point
s1 to s6: start position
T, Ti ... Delta time
t1-t17: position on the time axis
U1 to U6, Ui, Ui1, Ui2 ... Unit section
Uh (1)-Uh (6) ... high frequency unit section
Ul (1) to Ul (4) ... low frequency unit interval
fV, Vi ... Velocity
x ... Sample number
φ, φh, φl ... period

Claims

An encoding method for encoding an acoustic signal given as a time-series intensity signal,
An input stage for capturing an acoustic signal to be encoded as digital acoustic data;
An inflection point definition stage for obtaining an inflection point for the waveform of the acquired acoustic data,
On the time axis of the acoustic data, a section setting step for setting a plurality of unit sections at least partially overlapping;
Based on the acoustic data in each unit section, a predetermined representative frequency and representative intensity representing each unit section are defined, and information indicating the start position and end position of each unit section on the time axis An encoding step for generating code data including information indicating a representative frequency and the representative intensity, and expressing acoustic data of each unit section by the individual code data;
Have
In the section setting stage, for a certain inflection point, a specific inflection point that satisfies a predetermined condition in the vicinity is searched, and based on the distance on the time axis to the found inflection point, The natural frequency defining method for defining the natural frequency for the inflection point is set by changing the predetermined condition, and the plural natural frequencies are defined at each inflection point by using the plural natural frequency defining methods. Defining a section including a group of inflection points such that the natural frequency defined by the same natural frequency defining method is within a predetermined approximate range as one unit section in which the natural frequency defining method is involved ,
In the encoding step, among the plurality of natural frequencies defined for the inflection point included in the unit section, the unit section based on the natural frequency defined by the natural frequency definition method involved in the setting of the unit section. And a representative intensity of the unit section is defined based on a signal intensity of an inflection point included in the unit section.

The encoding method according to claim 1,
At the input stage, prepare acoustic data with positive and negative digital values as signal strength,
At the section setting stage, as a first natural frequency definition method, a specific inflection point satisfying the condition of “the inflection point having the same polarity as the inflection point of interest” is searched, and on the time axis between the searched inflection points Based on the distance at, the method of defining the natural frequency for the inflection point of interest is set, and as the second natural frequency definition method, the condition of “an inflection point having a signal strength approximating the inflection point of interest” is set. An acoustic signal characterized by searching for a specific inflection point that satisfies, and setting a method for defining a natural frequency for the inflection point of interest based on a distance on the time axis between the searched inflection point Encoding method.

The encoding method according to claim 2, wherein
In the section setting stage, a plurality of natural frequencies are within a range in which the natural frequency fh defined by the first natural frequency defining method is an upper limit and the natural frequency hl defined by the second natural frequency defining method is a lower limit. A method for encoding an acoustic signal, wherein a plurality of natural frequency defining methods that can be defined are set.

In the encoding method in any one of Claims 1-3,
At the encoding stage, the note number is determined based on the representative frequency, the velocity is determined based on the representative intensity, the delta time is determined based on the length of the unit section, the acoustic data of one unit section is converted into the note number, A method for encoding an acoustic signal, characterized by converting to MIDI format code data expressed in velocity and delta time, and assigning different channels to overlapping unit sections on the time axis.

A computer-readable recording medium in which a program for encoding an acoustic signal for executing the encoding method according to claim 1 is recorded.