JP2008209579A

JP2008209579A - Sound analysis apparatus and program

Info

Publication number: JP2008209579A
Application number: JP2007045236A
Authority: JP
Inventors: Masataka Goto; 真孝後藤; Takuya Fujishima; 琢哉藤島; Keita Arimoto; 慶太有元
Original assignee: Yamaha Corp; National Institute of Advanced Industrial Science and Technology AIST
Current assignee: Yamaha Corp; National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2007-02-26
Filing date: 2007-02-26
Publication date: 2008-09-11
Anticipated expiration: 2027-02-26
Also published as: JP4625935B2

Abstract

<P>PROBLEM TO BE SOLVED: To improve estimation accuracy of a fundamental frequency as a whole, even when a waveform of an input sound signal is unstable in an attack period. <P>SOLUTION: In an attack detection 1a, attack detection is performed by dividing an input sound signal into frames. In an estimation 41 of a probability density function of a fundamental frequency, the probability density function of the fundamental frequency is estimated for each frame by an Expectation-Multiplication (EM) algorithm. In a continuous tracking 42 of the fundamental frequency by a multi-agent model, the fundamental frequency is estimated from the probability function. At that time, control for switching over calculation modes of the process 41 and 42 for obtaining the fundamental frequency is performed, depending on whether or not, a frame to be processed is in the attack period. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、市販のＣＤ（ｃｏｍｐａｃｔｄｉｓｃ）などに収録されている、歌声や複数種類の楽器音を同時に含む音楽音響信号を対象に、メロディ音やベース音の音高（本明細書では基本周波数の意味で用いる）を推定する音分析装置およびプログラムに関する。 The present invention is directed to a musical sound signal including a singing voice and a plurality of types of instrument sounds recorded on a commercially available CD (compact disc) or the like. The present invention relates to a sound analysis apparatus and a program for estimating the

多数の音源の音が混ざり合ったモノラルの音響信号中から、ある特定の音源の音高を推定することは、非常に困難である。混合音に対して音高推定することが難しい本質的な理由の１つに、時間周波数領域において、ある音の周波数成分が同時に鳴っている他の音の周波数成分と重複することが挙げられる。例えば、歌声、鍵盤楽器（ピアノ等）、ギター、ベースギター、ドラムス等で演奏される典型的なポピュラー音楽では、メロディを担う歌声の高調波構造の一部（特に基本周波数成分）は、鍵盤楽器、ギターの高調波成分やベースギターの高次の高調波成分、スネアドラム等の音に含まれるノイズ成分などと頻繁に重複する。そのため、各周波数成分を局所的に追跡するような手法は、複雑な混合音に対しては安定して機能しない。基本周波数成分が存在することを前提に高調波構造を推定する手法もあるが、そのような手法は、ミッシングファンダメンタル（ｍｉｓｓｉｎｇｆｕｎｄａｍｅｎｔａｌ）現象を扱えないという大きな欠点を持つ。さらに、同時に鳴っている他の音の周波数成分が基本周波数成分と重複すると、有効に機能しない。 It is very difficult to estimate the pitch of a specific sound source from a monaural sound signal in which the sounds of many sound sources are mixed. One of the essential reasons why it is difficult to estimate the pitch of a mixed sound is that, in the time-frequency domain, the frequency component of one sound overlaps with the frequency component of another sound that is playing simultaneously. For example, in typical popular music played on singing voices, keyboard instruments (piano, etc.), guitars, bass guitars, drums, etc., part of the harmonic structure of the singing voice that plays the melody (especially the fundamental frequency component) It frequently overlaps with the harmonic component of the guitar, the higher harmonic component of the bass guitar, the noise component included in the sound of the snare drum, and the like. For this reason, a method of locally tracking each frequency component does not function stably for complex mixed sounds. There is a technique for estimating a harmonic structure on the assumption that a fundamental frequency component exists, but such a technique has a major drawback that it cannot handle a missing fundamental phenomenon. Furthermore, if the frequency components of other sounds that are playing at the same time overlap with the fundamental frequency components, they will not function effectively.

以上のような理由により、従来、単一音のみか、非周期的な雑音を伴った単一音を収録した音響信号を対象とした音高の推定技術はあったが、市販のＣＤに記録された音響信号のように複数の音が混ざり合ったものについて音高を推定する技術はなかった。 For the above reasons, there has been a technique for estimating the pitch of a single sound or an acoustic signal that contains a single sound with aperiodic noise, but it is recorded on a commercially available CD. There was no technique for estimating the pitch of a mixed sound signal such as an acoustic signal.

しかしながら、近年、統計的手法を利用することにより、混合音に含まれる各音の音高を適切に推定する技術が提案されるに至った。特許文献１の技術である。 However, in recent years, a technique for appropriately estimating the pitch of each sound included in the mixed sound by using a statistical method has been proposed. This is the technique of Patent Document 1.

この特許文献１の技術では、メロディ音のものと考えられる帯域に属する周波数成分と、ベース音のものと考えられる帯域に属する周波数成分とを入力音響信号からＢＰＦにより別々に取り出し、それらの各帯域の周波数成分に基づき、メロディ音およびベース音の各々の基本周波数の推定を行う。 In the technique of this Patent Document 1, a frequency component belonging to a band considered to be a melody sound and a frequency component belonging to a band considered to be a bass sound are separately extracted from an input acoustic signal by a BPF, and each of those bands is extracted. Based on the frequency components, the fundamental frequencies of the melody sound and the bass sound are estimated.

さらに詳述すると、特許文献１の技術では、音の高調波構造に対応した確率分布を持った音モデルを用意し、メロディ音の帯域の各周波数成分、ベース音の帯域の各周波数成分が、様々な基本周波数に対応した各音モデルを重み付け加算した混合分布であると考える。そして、各音モデルの重みの値をＥＭ（Ｅｘｐｅｃｔａｔｉｏｎ−Ｍａｘｉｍｉｚａｔｉｏｎ）アルゴリズムを用いて推定する。 More specifically, in the technique of Patent Document 1, a sound model having a probability distribution corresponding to the harmonic structure of a sound is prepared, and each frequency component of the band of the melody sound and each frequency component of the band of the base sound are It is considered to be a mixed distribution obtained by weighting and adding each sound model corresponding to various fundamental frequencies. Then, the weight value of each sound model is estimated using an EM (Expectation-Maximization) algorithm.

このＥＭアルゴリズムは、隠れ変数を含む確率モデルに対して最尤推定を行うための反復アルゴリズムであり、局所最適解を求めることができる。ここで、最も大きな重みの値を持つ確率分布は、その時点で最も優勢な高調波構造であるとみなすことができるため、あとはその優勢な高調波構造における基本周波数を音高として求めればよい。この手法は基本周波数成分の存在に依存しないため、ミッシングファンダメンタル現象も適切に扱うことができ、基本周波数成分の存在に依存せずに、最も優勢な高調波構造を求めることができる。
特許第３４１３６３４号特許第３６６０５９９号 This EM algorithm is an iterative algorithm for performing maximum likelihood estimation on a probability model including hidden variables, and a local optimum solution can be obtained. Here, since the probability distribution having the largest weight value can be regarded as the most dominant harmonic structure at that time, the fundamental frequency in the dominant harmonic structure can be obtained as the pitch. . Since this method does not depend on the presence of the fundamental frequency component, the missing fundamental phenomenon can be appropriately handled, and the most dominant harmonic structure can be obtained without depending on the presence of the fundamental frequency component.
Japanese Patent No. 3413634 Patent No. 3660599

ところで、上述した従来の音分析装置では、入力音響信号を一定時間長のフレームに分割し、フレーム単位でＥＭアルゴリズムを実行し、音源の音の基本周波数の推定を行っていた。また、各フレームにおいて、ＥＭアルゴリズムの繰り返しにより各種の基本周波数の音モデルに対する重み値を更新して最適化する際には、前フレームにおいて推定された重み値の最終値を引き継ぎ、これを初期状態として当該フレームにおけるＥＭアルゴリズムを実行していた。しかしながら、一般に楽音は、アタック区間において波形が不安定なものとなり易い。このため、従来の音分析装置は、波形が不安定なアタック区間の入力音響信号の基本周波数の推定処理を行うときに、推定処理が不安定な状態に陥り、基本周波数の誤推定が連続して発生し易いという問題があった。 By the way, in the conventional sound analysis apparatus described above, the input acoustic signal is divided into frames having a fixed time length, and the EM algorithm is executed in units of frames to estimate the fundamental frequency of the sound of the sound source. In addition, in each frame, when the weight value for the sound model of various fundamental frequencies is updated and optimized by repeating the EM algorithm, the final value of the weight value estimated in the previous frame is taken over, and this is initialized. The EM algorithm in the frame was executed. However, generally, a musical tone tends to have an unstable waveform in an attack section. For this reason, when the conventional sound analysis apparatus performs the estimation process of the fundamental frequency of the input acoustic signal in the attack period where the waveform is unstable, the estimation process falls into an unstable state, and erroneous estimation of the fundamental frequency continues. There was a problem that it was easy to occur.

この発明は、以上説明した事情に鑑みてなされたものであり、入力音響信号の波形がアタック区間において不安定なものとなる場合においても、全体としての基本周波数の推定精度を高めることができる音分析装置および音分析プログラムを提供することを目的としている。 The present invention has been made in view of the circumstances described above, and is a sound that can improve the estimation accuracy of the fundamental frequency as a whole even when the waveform of the input acoustic signal becomes unstable in the attack section. An object is to provide an analysis apparatus and a sound analysis program.

この発明は、入力音響信号を所定時間長のフレームに分割し、フレーム毎に入力音響信号がアタック区間の信号であるか否かを判定するアタック検出手段と、フレーム毎に、各々音源の音の高調波構造に対応した構造を有する確率密度関数である音モデルを使用して、各種の基本周波数に対応した複数の音モデルを重み付け加算した混合分布を構成し、この混合分布が入力音響信号の周波数成分の分布となるように、各音モデルに対する重み値を逐次更新して最適化し、最適化された各音モデルの重み値を前記音源の音の基本周波数の確率密度関数として推定する確率密度関数推定手段と、前記基本周波数の確率密度関数に基づいて前記入力音響信号に含まれる１または複数の音源の音の基本周波数を推定して出力する基本周波数推定手段と、前記確率密度関数推定手段の処理対象となるフレームがアタック区間のものであるか否かにより前記確率密度関数推定手段における基本周波数の確率密度関数の推定のための演算または前記基本周波数推定手段における基本周波数の推定のための演算の態様を切り換える演算制御手段とを具備することを特徴とする音分析装置並びにコンピュータを前記音分析装置として機能させるコンピュータプログラムを提供する。 The present invention divides an input sound signal into frames having a predetermined time length, and an attack detection means for determining whether or not the input sound signal is a signal in an attack section for each frame, and for each frame, the sound of the sound source. Using a sound model that is a probability density function having a structure corresponding to a harmonic structure, a mixed distribution is formed by weighting and adding a plurality of sound models corresponding to various fundamental frequencies. Probability density that sequentially updates and optimizes the weight values for each sound model so as to have a frequency component distribution, and estimates the weight values of each optimized sound model as a probability density function of the fundamental frequency of the sound of the sound source Function estimation means, and fundamental frequency estimation means for estimating and outputting the fundamental frequency of the sound of one or more sound sources included in the input acoustic signal based on the probability density function of the fundamental frequency The calculation for estimating the probability density function of the fundamental frequency in the probability density function estimating means or the basic in the fundamental frequency estimating means depending on whether or not the frame to be processed by the probability density function estimating means is in the attack section There is provided a sound analysis apparatus comprising a calculation control means for switching a calculation mode for frequency estimation, and a computer program for causing a computer to function as the sound analysis apparatus.

かかる発明によれば、確率密度関数推定手段の処理対象となるフレームがアタック区間のものであるか否かにより前記確率密度関数推定手段における基本周波数の確率密度関数の推定のための演算または前記基本周波数推定手段における基本周波数の推定のための演算の態様を切り換える演算制御手段とを設けたので、全体としての基本周波数の推定精度を高めるのに適した適切な演算の態様を選択し、確率密度関数推定手段または基本周波数推定手段に実行させることができ、例えば曲全体としての基本周波数の精度を高めることができる。 According to this invention, the calculation for estimating the probability density function of the fundamental frequency in the probability density function estimating means or the basic depending on whether the frame to be processed by the probability density function estimating means is in the attack section or not Since the calculation control means for switching the calculation mode for estimation of the fundamental frequency in the frequency estimation means is provided, an appropriate calculation mode suitable for improving the estimation accuracy of the fundamental frequency as a whole is selected, and the probability density The function estimation means or the fundamental frequency estimation means can be executed, and for example, the accuracy of the fundamental frequency as the entire song can be improved.

以下、図面を参照し、この発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

＜全体構成＞
図１は、この発明の一実施形態による音分析プログラムの処理内容を示す図である。この音分析プログラムは、自然界から音響信号を取得する収音機能、ＣＤ等の記録媒体から音楽の音響信号を再生する再生機能またはネットワークを介して音楽の音響信号を取得する通信機能等の音響信号取得機能を備えたパーソナルコンピュータ等のコンピュータにインストールされて実行される。本実施形態による音分析プログラムを実行するコンピュータは、本実施形態による音分析装置として機能する。 <Overall configuration>
FIG. 1 is a diagram showing the processing contents of a sound analysis program according to an embodiment of the present invention. This sound analysis program includes an acoustic signal such as a sound collection function for acquiring an acoustic signal from the natural world, a playback function for reproducing an acoustic signal of music from a recording medium such as a CD, or a communication function for acquiring an acoustic signal of music via a network. The program is installed and executed on a computer such as a personal computer having an acquisition function. The computer that executes the sound analysis program according to the present embodiment functions as the sound analysis device according to the present embodiment.

本実施形態による音分析プログラムは、音響信号取得機能を介して取得されたモノラルの音楽音響信号に対し、その中のある音源の音高を推定する。その最も重要な例として、ここではメロディラインとベースラインを推定する。メロディは他よりも際立って聞こえる単音の系列、ベースはアンサンブル中で最も低い単音の系列であり、その時間的な変化の軌跡をそれぞれメロディラインＤｍ（ｔ）、ベースラインＤｂ（ｔ）と呼ぶ。時刻tにおける基本周波数Ｆ０をＦｉ（ｔ）（ｉ＝ｍ，ｂ）、振幅をＡｉ（ｔ）とすると、これらは以下のように表される。

The sound analysis program according to the present embodiment estimates the pitch of a certain sound source in a monaural music sound signal acquired through the sound signal acquisition function. As the most important example, the melody line and the bass line are estimated here. The melody is a sequence of single notes that can be heard more prominently than others, and the bass is the sequence of the lowest single note in the ensemble. The temporal changes are called the melody line Dm (t) and the base line Db (t), respectively. Assuming that the fundamental frequency F0 at time t is Fi (t) (i = m, b) and the amplitude is Ai (t), these are expressed as follows.

このメロディラインＤｍ（ｔ）およびベースラインＤｂ（ｔ）を入力音響信号から得るための手段として、音分析プログラムは、瞬時周波数の算出１、アタック検出１ａ、周波数成分の候補の抽出２、周波数帯域の制限３、メロディラインの推定４ａおよびベースラインの推定４ｂの各処理を含む。また、メロディラインの推定４ａおよびベースラインの推定４ｂの各処理は、基本周波数の確率密度関数の推定４１およびマルチエージェントモデルによる基本周波数の継時的な追跡４２の各処理を各々含む。本実施形態において、瞬時周波数の算出１、周波数成分の候補の抽出２、周波数帯域の制限３の処理内容は、前掲特許文献１に開示されたものと基本的に同様である。本実施形態の特徴は、アタック検出１ａを設けた点並びにこのアタック検出１ａの処理結果に基づいて制御されるメロディラインの推定４ａおよびベースラインの推定４ｂの処理内容にある。以下、本実施形態による音分析プログラムを構成する各処理の内容を説明する。 As a means for obtaining the melody line Dm (t) and the base line Db (t) from the input sound signal, the sound analysis program includes instantaneous frequency calculation 1, attack detection 1a, frequency component candidate extraction 2, frequency band , Restriction 3, melody line estimation 4 a, and baseline estimation 4 b. Each process of the melody line estimation 4a and the baseline estimation 4b includes a fundamental frequency probability density function estimation 41 and a fundamental frequency sequential tracking 42 using a multi-agent model. In this embodiment, the processing contents of instantaneous frequency calculation 1, frequency component candidate extraction 2, and frequency band restriction 3 are basically the same as those disclosed in the above-mentioned Patent Document 1. The feature of the present embodiment is that the attack detection 1a is provided, and the processing contents of the melody line estimation 4a and the baseline estimation 4b that are controlled based on the processing result of the attack detection 1a. Hereinafter, the content of each process which comprises the sound analysis program by this embodiment is demonstrated.

＜瞬時周波数の算出１＞
この瞬時周波数の算出１と、アタック検出１ａと、周波数成分の候補の抽出２と、周波数帯域の制限３と、メロディラインの推定４ａおよびベースラインの推定４ｂにおける基本周波数の確率密度関数の推定４１の各処理は、時間軸上において音響信号を分割した一定時間長のフレームを単位として実行される。以下において時刻ｔは、具体的にはフレームの番号である。瞬時周波数の算出１では、入力音響信号を複数のＢＰＦからなるフィルタバンクに与え、フィルタバンクの各ＢＰＦの出力信号について、位相の時間微分である瞬時周波数（Flanagan, J.L. and Golden, R.M.: Phase Vocoder, The BellSystem
Technical J., Vol.45, pp.1493-1509 (1966)参照）を計算する。ここでは、上記Flanaganの手法を用い、短時間フーリエ変換(STFT)の出力をフィルタバンク出力と解釈して、効率良く瞬時周波数を計算する。入力音響信号ｘ(ｔ)に対する窓関数ｈ(ｔ)を用いたＳＴＦＴが式（３）および（４）により与えられるとき、瞬時周波数λ（ω，ｔ）は式（５）により求めることができる。 <Instantaneous frequency calculation 1>
This instantaneous frequency calculation 1, attack detection 1 a, frequency component candidate extraction 2, frequency band restriction 3, melody line estimation 4 a, and baseline estimation 4 b estimation of probability density function of fundamental frequency 41 Each of these processes is executed in units of frames of a certain time length obtained by dividing the acoustic signal on the time axis. In the following, time t is specifically a frame number. In the calculation of instantaneous frequency 1, the input acoustic signal is given to a filter bank composed of a plurality of BPFs, and the instantaneous frequency (Flanagan, JL and Golden, RM: Phase Vocoder) that is the time derivative of the phase for each output signal of each BPF , The BellSystem
Technical J., Vol. 45, pp.1493-1509 (1966)). Here, the above-described Flanagan method is used, the short-time Fourier transform (STFT) output is interpreted as the filter bank output, and the instantaneous frequency is efficiently calculated. When the STFT using the window function h (t) for the input acoustic signal x (t) is given by the equations (3) and (4), the instantaneous frequency λ (ω, t) can be obtained by the equation (5). .

ここで、ｈ(ｔ)は時間周波数の局所化を与える窓関数である（例えば、最適な時間周波数の局所化を与えるガウス関数に２階のカーディナルＢ−スプライン関数を畳み込んで作成した時間窓など)。 Here, h (t) is a window function that gives the localization of the time frequency (for example, a time window created by convolving a second-order cardinal B-spline function with a Gaussian function that gives the optimum localization of the time frequency. Such).

この瞬時周波数を計算するのに、ウェーブレット変換を用いても良い。ここでは、計算量を減らすためにＳＴＦＴを用いるが、単一のＳＴＦＴのみを用いたのでは、ある周波数帯域における時間分解能や周波数分解能が悪くなってしまう。そこで、マルチレートフィルタバンク（Vetterli, M.: A Theory of Multirate Filter Banks, IEEE Trans. on ASSP,
Vol.ASSP-35, No.3, pp. 356-372 (1987)、参照）を構成し、リアルタイムに実行可能という制約のもとで、ある程度妥当な時間周波数分解能を得る。 A wavelet transform may be used to calculate this instantaneous frequency. Here, the STFT is used to reduce the amount of calculation. However, if only a single STFT is used, the time resolution and frequency resolution in a certain frequency band are deteriorated. Therefore, multi-rate filter banks (Vetterli, M .: A Theory of Multirate Filter Banks, IEEE Trans. On ASSP,
Vol.ASSP-35, No.3, pp. 356-372 (1987)), and a reasonable time-frequency resolution is obtained under the restriction that it can be executed in real time.

＜アタック検出１ａ＞
この処理では、時間軸上において入力音響信号を分割した各フレームが入力音響信号のアタック区間内のフレームであるか否かの判定を行い、フレーム毎にそのフレームがアタック区間のものか否かを示す情報をメロディラインの推定４ａおよびベースラインの推定４ｂに引き渡す。フレームが入力音響信号のアタック区間のものであるか否かの判定方法には周知の各種の方法があるが、例えば特許文献２に開示されているように、各フレームをより時間長の短い複数の解析区間に分割し、これらの複数の解析区間における音響信号のエネルギーの変動を解析することにより、当該フレームがアタック区間のものか否かを判定することが可能である。 <Attack detection 1a>
In this processing, it is determined whether or not each frame obtained by dividing the input acoustic signal on the time axis is a frame within the attack section of the input acoustic signal, and whether or not the frame is within the attack section for each frame. The information shown is handed over to the melody line estimation 4a and the baseline estimation 4b. There are various known methods for determining whether or not a frame is in the attack interval of the input acoustic signal. For example, as disclosed in Patent Document 2, each frame is divided into a plurality of shorter time lengths. It is possible to determine whether or not the frame is in the attack section by analyzing the fluctuation of the energy of the acoustic signal in the plurality of analysis sections.

＜周波数成分の候補の抽出２＞
この処理では、フィルタの中心周波数からその瞬時周波数への写像に基づいて、周波数成分の候補を抽出する（Charpentier, F.J.: Pitch detection using the short-termphase
spectrum, Proc. of ICASSP 86, pp.113-116 (1986)参照）。あるＳＴＦＴフィルタの中心周波数ωからその出力の瞬時周波数λ（ω，ｔ）への写像を考える。すると、もし周波数ψの周波数成分があるときには、ψがこの写像の不動点に位置し、その周辺の瞬時周波数の値はほぼ一定となる。つまり、全周波数成分の瞬時周波数Ψ_f ^(t)は、次式によって抽出することができる。

これらの周波数成分のパワーは、Ψ_f ^(t)の各周波数におけるＳＴＦＴパワースペクトルの値として得られるため、周波数成分のパワー分布関数Ψ_p ^(t)(ω)を次のように定義することができる。

<Frequency component candidate extraction 2>
In this process, candidate frequency components are extracted based on the mapping from the center frequency of the filter to its instantaneous frequency (Charpentier, FJ: Pitch detection using the short-termphase
spectrum, Proc. of ICASSP 86, pp. 113-116 (1986)). Consider a mapping from the center frequency ω of an STFT filter to the instantaneous frequency λ (ω, t) of its output. Then, if there is a frequency component of frequency ψ, ψ is located at the fixed point of this mapping, and the value of the instantaneous frequency around it is almost constant. That is, the instantaneous frequency Ψ _f ^(t) of all frequency components can be extracted by the following equation.

Since the power of these frequency components is obtained as the value of the STFT power spectrum at each frequency of ψ _f ^(t) , the power distribution function ψ _p ^(t) (ω) of the frequency component can be defined as follows. it can.

＜周波数帯域の制限３＞
この処理では、抽出した周波数成分に重み付けすることで、周波数帯域を制限する。ここでは、メロディラインとベースライン用に、２種類のＢＰＦを用意する。メロディライン用ＢＰＦは、典型的なメロディラインの主要な基本波成分および高調波成分の多くを通過させることができ、かつ、基本周波数付近の重複が頻繁に起きる周波数帯域をある程度遮断する。一方、ベースライン用ＢＰＦは、典型的なベースラインの主要な基本周波数成分および高調波成分の多くを通過させることができ、かつ、他の演奏パートがベースラインよりも優勢になるような周波数帯域をある程度遮断する。 <Frequency band restriction 3>
In this process, the frequency band is limited by weighting the extracted frequency components. Here, two types of BPF are prepared for the melody line and the base line. The melody line BPF can pass most of the main fundamental wave components and harmonic components of a typical melody line, and cuts off a frequency band in which duplication near the fundamental frequency frequently occurs to some extent. On the other hand, the BPF for a bass line can pass many of the main fundamental frequency components and harmonic components of a typical bass line, and the frequency band in which the other performance parts are dominant over the bass line. To some extent.

本実施形態では、以下、対数スケールの周波数をcentの単位(本来は音高差(音程)を表す尺度)で表し、Ｈｚで表された周波数ｆＨｚを、次のようにcentで表された周波数ｆｃｅｎｔに変換する。

平均律の半音は１００ｃｅｎｔに、１オクターブは１２００ｃｅｎｔに相当する。 In the present embodiment, the logarithmic scale frequency is expressed in units of cents (originally a scale representing pitch difference (pitch)), and the frequency fHz expressed in Hz is expressed as cents as follows: Convert to fcent.

A semitone of equal temperament corresponds to 100 cent, and one octave corresponds to 1200 cent.

周波数ｘｃｅｎｔでのＢＰＦの周波数応答をＢＰＦｉ（ｘ）（ｉ＝ｍ，ｂ）とし、周波数成分のパワー分布関数をΨ’_ｐ ^（ｔ）（ｘ）とすると、ＢＰＦを通過した周波数成分はＢＰＦｉ（ｘ）Ψ’_ｐ ^（ｔ）（ｘ）と表すことができる。ただし、Ψ’_ｐ ^（ｔ）（ｘ）は、周波数軸がｃｅｎｔで表されていることを除けばΨ_ｐ ^（ｔ）（ω）と同じ関数である。ここで、次の段階の準備として、ＢＰＦを通過した周波数成分の確率密度関数ｐ_Ψ ^（ｔ）（ｘ）を定義する。

When the frequency response of the BPF at the frequency x cent is BPFi (x) (i = m, b) and the power distribution function of the frequency component is ψ ′ _p ^(t) (x), the frequency component that has passed through the BPF is BPFi. (X) ψ ′ _p ^(t) (x). However, Ψ ′ _p ^(t) (x) is the same function as Ψ _p ^(t) (ω) except that the frequency axis is represented by cent. Here, as a preparation for the next stage, a probability density function p _Ψ ^(t) (x) of a frequency component that has passed through the BPF is defined.

ここで、Ｐｏｗ^（ｔ）は次式に示すようにＢＰＦを通過した周波数成分のパワーの合計である。

Here, Pow ^(t) is the total power of the frequency components that have passed through the BPF as shown in the following equation.

＜基本周波数の確率密度関数の推定４１＞
この処理では、ＢＰＦを通過した周波数成分の候補に対し、各高調波構造が相対的にどれくらい優勢かを表す基本周波数の確率密度関数を求める。そのために、本実施形態では、周波数成分の確率密度関数ｐ_Ψ ^（ｔ）（ｘ）が、高調波構造を持つ音をモデル化した確率分布（音モデル）の混合分布モデル（重み付き和のモデル）から生成されたと考える。基本周波数がＦの音モデルの確率密度関数をｐ（ｘ｜Ｆ）とすると、その混合分布モデルｐ（ｘ；θ（ｔ））は、次式により定義することができる。

<Estimation 41 of probability density function of fundamental frequency>
In this process, a probability density function of a fundamental frequency representing how much each harmonic structure is relatively dominant with respect to a frequency component candidate that has passed through the BPF is obtained. Therefore, in the present embodiment, the probability distribution function p _Ψ ^(t) (x) of the frequency component is a mixed distribution model (weighted sum model) of probability distribution (sound model) that models a sound having a harmonic structure. ). When the probability density function of a sound model having a fundamental frequency F is p (x | F), the mixed distribution model p (x; θ (t)) can be defined by the following equation.

ここで、ＦｈｉとＦｌｉは、許容される基本周波数の上限と下限であり、ＢＰＦの通過帯域により決定される。また、ｗ^（ｔ）（Ｆ）は、次式を満たすような、音モデルｐ（ｘ｜Ｆ）の重みである。

Here, Fhi and Fli are the upper and lower limits of the allowable fundamental frequency, and are determined by the pass band of the BPF. W ^(t) (F) is a weight of the sound model p (x | F) that satisfies the following expression.

ＣＤ等による実世界の音響信号に対して事前に音源数を仮定することは不可能なため、このように、あらゆる基本周波数の可能性を同時に考慮してモデル化することが重要となる。もし、観測した周波数成分ｐ_Ψ ^（ｔ）（ｘ）がモデルｐ（ｘ；θ（ｔ））から生成されたかのようにモデルパラメータθ（ｔ）を推定できれば、ｐ_Ψ ^（ｔ）（ｘ）は個々の音モデルへと分解されたとみなすことができ、次式に示すように、各基本周波数Ｆの音モデルに対する重みｗ^（ｔ）（Ｆ）を、基本周波数Ｆの確率密度関数ｐ_ＦＯ ^（ｔ）（Ｆ）と解釈することができる。

Since it is impossible to assume the number of sound sources in advance for a real-world acoustic signal such as a CD, it is important to model in consideration of the possibility of all fundamental frequencies at the same time. If the model parameter θ (t) can be estimated as if the observed frequency component p _Ψ ^(t) (x) was generated from the model p (x; θ (t)), then p _Ψ ^(t) (x) is It can be considered that the sound model has been decomposed into individual sound models. As shown in the following equation, the weight w ^(t) (F) for the sound model of each fundamental frequency F is represented by the probability density function p _FO ^{(t ) (F)} and it can be interpreted.

つまり、混合分布中において、ある音モデルｐ（ｘ｜Ｆ）が優勢になればなるほど（すなわち、ｗ^（ｔ）（Ｆ）が大きくなるほど）、ｐ_ＦＯ（ｔ）（Ｆ）において、そのモデルの基本周波数Ｆの確率が高くなる。 That is, in a mixture distribution, the more dominant a sound model p (x | F) becomes (that is, the larger w ^(t) (F)), the more the model of p _FO (t) (F) The probability of the fundamental frequency F increases.

以上から、確率密度関数ｐ_Ψ ^（ｔ）（ｘ）を観測したときに、そのモデルｐ（ｘ；θ（ｔ））のパラメータθ（ｔ）を推定する問題を解けばよいことがわかる。θ（ｔ）の最尤推定量は、次式で定義される平均対数尤度を最大化することで得られる。

From the above, it can be seen that the problem of estimating the parameter θ (t) of the model p (x; θ (t)) should be solved when the probability density function p _Ψ ^(t) (x) is observed. The maximum likelihood estimator of θ (t) is obtained by maximizing the average log likelihood defined by the following equation.

この最大化問題は解析的に解くことが困難なため、前述のＥＭ（Ｅｘｐｅｃｔａｔｉｏｎ−Ｍａｘｉｍｉｚａｔｉｏｎ）アルゴリズムを用いてθ（ｔ）を推定する。ＥＭアルゴリズムは、Ｅステップ（ｅｘｐｅｃｔａｔｉｏｎｓｔｅｐ）とＭステップ（ｍａｘｉｍｉｚａｔｉｏｎｓｔｅｐ）を交互に繰返し適用することで、不完全な観測データ（この場合、ｐ_Ψ ^（ｔ）（ｘ））から最尤推定をおこなうための反復アルゴリズムである。本実施形態では、ＥＭアルゴリズムを繰り返すことにより、ＢＰＦを通過した周波数成分の確率密度関数ｐ_Ψ ^（ｔ）（ｘ）を、各種の基本周波数Ｆに対応した複数の音モデルｐ（ｘ｜Ｆ）を重み付け加算した混合分布と考える場合において、最も尤もらしい重みのパラメータθ^（ｔ）（＝｛ｗ^（ｔ）（Ｆ）｜Ｆｌｉ≦Ｆ≦Ｆｈｉ｝を求める。ここで、ＥＭアルゴリズムの各繰り返しでは、パラメータθ（ｔ）（＝｛ｗ^（ｔ）（Ｆ）｜Ｆｌｉ≦Ｆ≦Ｆｈｉ｝）に関して、古いパラメータ推定値θ_old ^（ｔ）（＝｛ｗ_old ^（ｔ）（Ｆ）｜Ｆｌｉ≦Ｆ≦Ｆｈｉ｝）を更新して新しい（より尤もらしい）パラメータ推定値θ_ｎｅｗ ^（ｔ）（＝｛ｗ_ｎｅｗ ^（ｔ）（Ｆ）｜Ｆｌｉ≦Ｆ≦Ｆｈｉ｝）を求めていく。この古いパラメータ推定値θ_old ^（ｔ）から新しいパラメータ推定値θ_ｎｅｗ ^（ｔ）を求める漸化式は、次のようになる。なお、この漸化式の導出過程は特許文献１に詳細に説明されているので、そちらを参照されたい。

Since this maximization problem is difficult to solve analytically, θ (t) is estimated using the aforementioned EM (Expectation-Maximization) algorithm. The EM algorithm performs maximum likelihood estimation from incomplete observation data (in this case, p _Ψ ^(t) (x)) by repeatedly applying an E step (expectation step) and an M step (maximization step) alternately. Iterative algorithm for In this embodiment, by repeating the EM algorithm, the probability density function p _Ψ ^(t) (x) of the frequency component that has passed through the BPF is converted into a plurality of sound models p (x | F) corresponding to various basic frequencies F. Is the most likely weighting parameter θ ^(t) (= {w ^(t) (F) | Fli ≦ F ≦ Fhi}, where each iteration of the EM algorithm , Parameter θ (t) (= {w ^(t) (F) | Fli ≦ F ≦ Fhi}), the old parameter estimate θ _old ^(t) (= {w _old ^(t) (F) | Fli ≦ F ≦ Fhi}) is updated to obtain a new (more likely) parameter estimate θ _new ^(t) (= {w _new ^(t) (F) | Fli ≦ F ≦ Fhi}). value θ _old ⁽ Recurrence formula for obtaining the new parameter estimate theta _{new new ^(t)} ^from) is as follows. Note that the process derives the recurrence formula is because it is described in detail in Patent Document 1, it is referred to there I want.

図２は、本実施形態において音モデルｐ（ｘ｜Ｆ）に対する重みのパラメータθ^（ｔ）（＝｛ｗ^（ｔ）（Ｆ）｜Ｆｌｉ≦Ｆ≦Ｆｈｉ｝がＥＭアルゴリズムにより更新される過程を例示したものである。なお、図２では、図示を簡素化するために、周波数成分の数が４個である音モデルを使用した場合の例が示されている。 FIG. 2 shows a process in which the weight parameter θ ^(t) (= {w ^(t) (F) | Fli ≦ F ≦ Fhi} for the sound model p (x | F) is updated by the EM algorithm in this embodiment. 2 shows an example in which a sound model having four frequency components is used in order to simplify the illustration.

本実施形態におけるＥＭアリゴリズムでは、各基本周波数Ｆに対応した音モデルｐ（ｘ｜Ｆ）と現状における各音モデルに対する重み値ｗ_old ^（ｔ）（Ｆ）とに基づき、次式に従い、周波数ｘ毎に各音モデルに対応したスペクトル分配比を求める。

In the EM algorithm in the present embodiment, based on the sound model p (x | F) corresponding to each fundamental frequency F and the weight value w _old ^(t) (F) for each current sound model, the frequency x Each time, a spectrum distribution ratio corresponding to each sound model is obtained.

上記式（１８）に示すように、ある周波数ｘにおける各音モデルｐ（ｘ｜Ｆ）に対応したスペクトル分配比（ｘ｜Ｆ）は、重み値ｗ_old（Ｆ）^（ｔ）の乗算された各音モデルｐ（ｘ｜Ｆ）の周波数ｘにおける各振幅値ｗ_old（Ｆ）^（ｔ）ｐ（ｘ｜Ｆ）の総和（式（１８）における分母の積分値に相当）を求め、その総和により各振幅値ｗ_old（Ｆ）^（ｔ）ｐ（ｘ｜Ｆ）を除算することにより得られる。式（１８）から明らかなように、各周波数ｘにおいて、各音モデルｐ（ｘ｜Ｆ）に対応した各スペクトル分配比（ｘ｜Ｆ）は、総和が１となるように正規化されたものとなる。 As shown in the above equation (18), the spectrum distribution ratio (x | F) corresponding to each sound model p (x | F) at a certain frequency x is multiplied by the weight value w _old (F) ^(t) . The sum of the amplitude values w _old (F) ^(t) p (x | F) at the frequency x of each sound model p (x | F) (corresponding to the integral value of the denominator in the equation (18)) is obtained, and the sum _Is obtained by dividing each amplitude value w _old (F) ^(t) p (x | F). As is clear from equation (18), at each frequency x, each spectrum distribution ratio (x | F) corresponding to each sound model p (x | F) is normalized so that the sum is 1. It becomes.

そして、本実施形態では、各周波数ｘにおいて、その周波数ｘにおける確率密度関数ｐ_Ψ ^（ｔ）（ｘ）の関数値をその周波数ｘにおける各音モデルｐ（ｘ｜Ｆ）のスペクトル分配比により分配し、音モデルｐ（ｘ｜Ｆ）毎に、このようにして分配された確率密度関数ｐ_Ψ ^（ｔ）（ｘ）の関数値を総計して、各音モデルｐ（ｘ｜Ｆ）の取り分とする。そして、全音モデルの取り分を合計し、その合計値によって各音モデルの取り分を除算し、総和が１となるように正規化された各音モデルｐ（ｘ｜Ｆ）の取り分を新たな重みパラメータｗ_ｎｅｗ ^（ｔ）（Ｆ）とするのである。以上の処理が繰り返されることにより、基本周波数Ｆの異なった各音モデルｐ（ｘ｜Ｆ）のうち混合音の周波数成分の確率密度関数ｐ_Ψ ^（ｔ）（ｘ）によって支持されている確率の高いものに対する重みパラメータｗ^（ｔ）（Ｆ）が次第に強調されてゆく。この結果、重みパラメータｗ^（ｔ）（Ｆ）は、ＢＰＦを通過した混合音における基本周波数の確率密度関数を示すに至る。 In the present embodiment, at each frequency x, the function value of the probability density function p _Ψ ^(t) (x) at that frequency x is distributed according to the spectrum distribution ratio of each sound model p (x | F) at that frequency x. Then, for each sound model p (x | F), the function values of the probability density function p _Ψ ^(t) (x) distributed in this way are summed up, and a share of each sound model p (x | F) is obtained. And Then, the share of all sound models is summed, the share of each sound model is divided by the sum, and the share of each sound model p (x | F) normalized so that the sum is 1 is a new weighting parameter. Let w _new ^(t) (F). By repeating the above processing, the probability supported by the probability density function p _Ψ ^(t) (x) of the frequency component of the mixed sound among the sound models p (x | F) having different fundamental frequencies F is obtained. The weight parameter w ^(t) (F) for the higher one is gradually emphasized. As a result, the weight parameter w ^(t) (F) represents the probability density function of the fundamental frequency in the mixed sound that has passed through the BPF.

＜マルチエージェントモデルによる基本周波数の継時的な追跡４２（基本周波数推定手段としての処理）＞
本実施形態による音分析プログラムは、以上のようにして得られる基本周波数の確率密度関数に基づいて、入力音響信号に含まれる１または複数の音源の音の基本周波数を推定して出力する基本周波数推定手段としての処理を含む。この処理では、最も優勢な基本周波数Ｆｉ（ｔ）を決定するために、次式に示すように、基本周波数の確率密度関数ｐ_Ｆ０ ^（ｔ）（Ｆ）（式（１５）より、式（１７）を反復計算した最終的な推定値として得られる）を最大にする周波数を基本周波数の推定値として求める。

<Frequency tracking 42 of fundamental frequency by multi-agent model (processing as fundamental frequency estimation means)>
The sound analysis program according to the present embodiment estimates and outputs the fundamental frequency of the sound of one or more sound sources included in the input acoustic signal based on the probability density function of the fundamental frequency obtained as described above. Includes processing as estimation means. In this process, in order to determine the most dominant fundamental frequency Fi (t), the probability density function p _F0 ^(t) (F) of the fundamental frequency (Expression (17) ) Is obtained as the final estimated value obtained by iteratively calculating), and the frequency that maximizes) is obtained as the estimated value of the fundamental frequency.

ところで、基本周波数の確率密度関数において、同時に鳴っている音の基本周波数に対応する複数のピークが拮抗すると、それらのピークが確率密度関数の最大値として次々に選ばれてしまうことがあるため、このように単純に求めた結果は安定しないことがある。そこで、本実施形態における基本周波数推定手段としての処理では、大局的な観点から基本周波数を推定するために、基本周波数の確率密度関数の時間変化において複数のピークの軌跡を継時的に追跡し、その中で最も優勢で安定した基本周波数の軌跡を選択する。このような追跡処理を動的で柔軟に制御するために、マルチエージェントモデルを導入する。 By the way, in the probability density function of the fundamental frequency, if a plurality of peaks corresponding to the fundamental frequency of the sound that is being played at the same time, those peaks may be selected one after another as the maximum value of the probability density function, The result obtained simply as described above may not be stable. Therefore, in the processing as the fundamental frequency estimation means in the present embodiment, in order to estimate the fundamental frequency from a global viewpoint, the trajectories of a plurality of peaks are tracked continuously in the time change of the probability density function of the fundamental frequency. Select the most dominant and stable fundamental frequency trajectory among them. In order to control such tracking process dynamically and flexibly, a multi-agent model is introduced.

マルチエージェントモデルは、１つの特徴検出器と複数のエージェントにより構成される（図３参照）。特徴検出器は、基本周波数の確率密度関数の中で目立つピークを拾い上げる。エージェントは基本的に、それらのピークに駆動されて軌跡を追跡していく。つまり、マルチエージェントモデルは、入力中で目立つ特徴を時間的に追跡する汎用の枠組みである。具体的には、各時刻において以下の処理がおこなわれる。 The multi-agent model is composed of one feature detector and a plurality of agents (see FIG. 3). The feature detector picks up the prominent peaks in the probability density function of the fundamental frequency. The agent basically follows the trajectory driven by those peaks. In other words, the multi-agent model is a general-purpose framework that temporally tracks features that stand out in the input. Specifically, the following processing is performed at each time.

（１）基本周波数の確率密度関数が求まった後、特徴検出器は目立つピーク（最大ピークに応じて動的に変化する閾値を越えたピーク）を複数検出する。そして、目立つピークのそれぞれについて、周波数成分のパワーの合計Ｐｏｗ（ｔ）も考慮しながら、どれくらい将来有望なピークかを評価する。これは、現在時刻を数フレーム先の時刻とみなして、ピークの軌跡をその時刻まで先読みして追跡することで実現する。 (1) After the probability density function of the fundamental frequency is obtained, the feature detector detects a plurality of conspicuous peaks (peaks exceeding a threshold that dynamically changes according to the maximum peak). Then, for each conspicuous peak, the promising peak is evaluated by considering the total power Pow (t) of the frequency components. This is realized by regarding the current time as a time several frames ahead and prefetching and tracking the peak trajectory up to that time.

（２）既に生成されたエージェントがあるときは、それらが相互作用しながら、目立つピークをそれに近い軌跡を持つエージェントへと排他的に割り当てる。複数のエージェントが割り当て候補に上がる場合には、最も信頼度の高いエージェントへと割り当てる。 (2) When there is an agent already generated, the prominent peak is exclusively assigned to an agent having a locus close to it while interacting with each other. If multiple agents are candidates for assignment, assign them to the agent with the highest reliability.

（３）最も有望で目立つピークがまだ割り当てられていないときは、そのピークを追跡する新たなエージェントを生成する。 (3) If the most promising and conspicuous peak has not yet been assigned, a new agent that tracks that peak is generated.

（４）各エージェントは累積ペナルティを持っており、それが一定の閾値を越えると消滅する。 (4) Each agent has a cumulative penalty and disappears when it exceeds a certain threshold.

（５）目立つピークが割り当てられなかったエージェントは、一定のペナルティを受け、基本周波数の確率密度関数の中から自分の追跡する次のピークを直接見つけようとする。もしそのピークも見つからないときは、さらにペナルティを受ける。さもなければ、ペナルティはリセットされる。 (5) An agent that has not been assigned a conspicuous peak receives a certain penalty, and tries to find the next peak to be tracked directly from the probability density function of the fundamental frequency. If the peak is not found, a penalty is applied. Otherwise, the penalty is reset.

（６）各エージェントは、今割り当てられたピークがどれくらい有望で目立つかを表す度合いと、１つ前の時刻の信頼度との重み付き和によって、信頼度を自己評価する。 (6) Each agent self-evaluates the reliability based on the weighted sum of the degree of how promising and conspicuous the peak assigned at present is and the reliability at the previous time.

（７）時刻ｔにおける基本周波数Ｆｉ（ｔ）は、信頼度が高く、追跡しているピークの軌跡に沿ったパワーの合計が大きいエージェントに基づいて決定する。振幅Ａｉ（ｔ）は、基本周波数Ｆｉ（ｔ）の高調波成分等をΨ_ｐ ^（ｔ）（ω）から抽出して決定する。 (7) The fundamental frequency Fi (t) at time t is determined based on an agent having high reliability and a large total power along the track of the peak being tracked. The amplitude Ai (t) is determined by extracting a harmonic component or the like of the fundamental frequency Fi (t) from Ψ _p ^(t) (ω).

＜＜特許文献１の技術に対する本実施形態の改良点＞＞
図４は本実施形態における基本周波数の確率密度関数の推定４１の処理内容を示している。図４に示すように、基本周波数の確率密度関数の推定４１においては、ＥＭアルゴリズムのＥステップおよびＭステップ４１１と、収束判定４１２とを繰り返す。 << Improvements of this embodiment over the technique of Patent Document 1 >>
FIG. 4 shows the processing contents of the estimation 41 of the probability density function of the fundamental frequency in this embodiment. As shown in FIG. 4, in the estimation 41 of the probability density function of the fundamental frequency, the E step and M step 411 of the EM algorithm and the convergence determination 412 are repeated.

まず、ＥステップおよびＭステップ４１１では、前掲式（１７）の漸化式に従い、基本周波数の確率密度関数、すなわち、各種の基本周波数Ｆに対応した音モデルの重み値θ＝θ_ｎｅｗ ^（ｔ）（＝｛ｗ_new ^（ｔ）（Ｆ）｜Ｆｌｉ≦Ｆ≦Ｆｈｉ｝）を求める。 First, in the E step and the M step 411, according to the recurrence formula of the above equation (17), the probability density function of the fundamental frequency, that is, the weight value θ = θ _new ^{(t) of} the sound model corresponding to various fundamental frequencies F. (= {W _new ^(t) (F) | Fli ≦ F ≦ Fhi}).

次に収束判定４１２では、今回のＥステップおよびＭステップ４１１において得られた各種の基本周波数Ｆに対応した音モデルの重み値θ＝θ_ｎｅｗ ^（ｔ）とその前の重み値θ＝θ_old ^（ｔ）とを比較し、重み値θの変化分が許容範囲内に収まったか否かを判定する。そして、重み値θの変化分が許容範囲内に収まったと判定した場合には、基本周波数の確率密度関数の推定４１の処理を終了し、基本周波数の確率密度関数の最終値をマルチエージェントモデルによる基本周波数の継時的な追跡４２に引き渡す。 Next, in the convergence determination 412, the weight value θ = θ _new ^{(t) of} the sound model corresponding to the various fundamental frequencies F obtained in the current E step and M step 411 and the previous weight value θ = θ _old ^{( t)} to determine whether or not the change in the weight value θ falls within the allowable range. When it is determined that the change in the weight value θ falls within the allowable range, the processing of the estimation 41 of the fundamental frequency probability density function is terminated, and the final value of the probability density function of the fundamental frequency is determined by the multi-agent model. Deliver to the temporal tracking 42 of the fundamental frequency.

本実施形態による音分析プログラムには、アタック検出１ａから出力される情報に基づいて、基本周波数の確率密度関数の推定４１または基本周波数推定手段であるマルチエージェントモデルによる基本周波数の継時的な追跡４２の処理の演算態様を制御する演算制御手段が設けられている。これが特許文献１の技術に対する本実施形態の改良点である。基本周波数の確率密度関数の推定４１または基本周波数推定手段としてのマルチエージェントモデルによる基本周波数の継時的な追跡４２の処理の演算態様の制御には、以下の４態様がある。ユーザは、図示しない操作部の操作により、音分析プログラムの演算制御手段にいずれの態様で演算態様の制御を行わせるかを指定することができる。 In the sound analysis program according to the present embodiment, based on the information output from the attack detection 1a, the fundamental frequency probability density function is estimated 41 or the fundamental frequency is continuously tracked by the multi-agent model which is a fundamental frequency estimation means. Calculation control means for controlling the calculation mode of the processing of 42 is provided. This is an improvement of the present embodiment over the technique of Patent Document 1. There are the following four modes for controlling the calculation mode of the process of tracking the fundamental frequency over time by the estimation 41 of the probability density function of the fundamental frequency or the multi-agent model as the fundamental frequency estimation means. The user can specify in which mode the calculation control means of the sound analysis program controls the calculation mode by operating an operation unit (not shown).

＜＜＜第１の態様＞＞＞
この第１の態様は、処理対象であるフレームがアタック区間のものである場合には、当該フレームにおける重み値ｗ^（ｔ）（Ｆ）の逐次更新が所定の初期値ｗ_flat（Ｆ）から開始されるように、基本周波数の確率密度関数の推定４１の演算制御を行い、処理対象であるフレームがアタック区間のものでない場合には、当該フレームにおける重み値ｗ^（ｔ）（Ｆ）の逐次更新が前フレームにおける重み値ｗ^（ｔ-1）（Ｆ）の最終値を初期値として開始されるように基本周波数の確率密度関数の推定４１のための演算制御を行う態様である。 <<< First Aspect >>>
In the first aspect, when the frame to be processed is in the attack section, the sequential update of the weight value w ^(t) (F) in the frame starts from the predetermined initial value w _flat (F). As shown, if the calculation control of the estimation 41 of the probability density function of the fundamental frequency is performed and the frame to be processed is not in the attack section, the weight value w ^(t) (F) in the frame is sequentially updated. Is a mode in which arithmetic control is performed for the estimation 41 of the probability density function of the fundamental frequency so that the final value of the weight value w ^(t-1) (F) in the previous frame is started as an initial value.

従来の技術の下では、フレーム毎に、上述した漸化式（１７）を繰り返して基本周波数の確率密度関数の推定を行う場合に、ｗ^(ｔ)(Ｆ)の初期値として、１つ前の時刻ｔ−１（１つ前のフレーム）における重み値の最終値ｗ^(t-1)(Ｆ)を用いた。しかし、このように前フレームにおける基本周波数の確率密度関数の最終状態を初期値として用いると、波形が不安定なアタック区間のフレームの基本周波数の推定を行う際に、推定の処理が不安定になり、誤推定に陥り易い。そこで、この第１の態様では、各フレームに関して重み値ｗ^（ｔ）（Ｆ）の逐次更新を開始する際、そのフレームがアタック区間以外のものである場合には前フレームにおける重み値の最終値ｗ^(t-1)(Ｆ)を初期値とし、アタック区間のものである場合には、例えば全周波数帯域においてフラットな重み値を持った所定の初期値ｗ_flat（Ｆ）を初期値とするのである。より具体的には、次の通りである。 Under the conventional technique, when the probability density function of the fundamental frequency is estimated by repeating the above recurrence formula (17) for each frame, the initial value of w ^(t) (F) is one before. The final value w ^(t-1) (F) of the weight value at time t-1 (the previous frame) is used. However, if the final state of the probability density function of the fundamental frequency in the previous frame is used as the initial value in this way, the estimation process becomes unstable when estimating the fundamental frequency of the frame in the attack period where the waveform is unstable. Therefore, it is easy to fall into an erroneous estimation. Therefore, in the first aspect, when the sequential updating of the weight value w ^(t) (F) is started for each frame, if the frame is other than the attack section, the final value of the weight value in the previous frame If w ^(t-1) (F) is an initial value and is in the attack period, for example, a predetermined initial value w _flat (F) having a flat weight value in the entire frequency band is used as the initial value. It is. More specifically, it is as follows.

まず、本実施形態では、各フレームでの重み値ｗ^(t)(Ｆ)の逐次更新を開始するとき、図４に示すように、前フレームでの重み値の最終値ｗ^(t-1)(Ｆ)に係数ｒを乗算したものと、所定の初期値ｗ_flat（Ｆ）に係数１−ｒを乗算したものとを加算し、その加算結果を当該フレームにおける重み値ｗ^(t)(Ｆ)の初期値とする。 First, in this embodiment, when the sequential update of the weight value w ^(t) (F) in each frame is started, as shown in FIG. 4, the final value w ^(t−1) of the weight value in the previous frame. A value obtained by multiplying (F) by a coefficient r and a value obtained by multiplying a predetermined initial value w _flat (F) by a coefficient 1-r are added, and the addition result is obtained as a weight value w ^(t) (F ) Is the initial value.

そして、図５に示すように、アタック区間に属しないフレームの処理時には、ｒの値を１とすることにより、前フレームにおける重み値の最終値ｗ^(t-1)(Ｆ)を重み値ｗ^(t)(Ｆ)の初期値とし、アタック区間に属するフレームの処理時には、ｒの値を０とすることにより、所定の初期値ｗ_flat（Ｆ）を重み値ｗ^(t)(Ｆ)の初期値とするのである。 Then, as shown in FIG. 5, when processing a frame that does not belong to the attack period, the final value w ^(t-1) (F) of the weight value in the previous frame is set to the weight value w by setting the value of r to 1. ^(t) The initial value of (F) is used, and when a frame belonging to the attack section is processed, the value of r is set to 0, so that the predetermined initial value w _flat (F) is set to the weight value w ^(t) (F). The initial value is used.

以上のように、この態様によれば、アタック区間のフレームの処理時には、当該フレームでの重み値ｗ^(t)(Ｆ)の逐次更新の際に、前フレームにおける最終的な重み値ｗ^(t-1)(Ｆ)が初期値として採用されない。従って、アタック区間に入力音響信号の波形が不安定となり、基本周波数の推定処理が不安定になる場合でも、連続して基本周波数の誤推定が発生するのを回避することができ、全体としての基本周波数の推定精度を高めることができる。 As described above, according to this embodiment, during processing of the frame of the attack period, during successive updating of the weighting values w in the frame ^{(t) (F),} the final weight value w ^(t in the previous frame ^-1) (F) is not adopted as the initial value. Therefore, even if the waveform of the input acoustic signal becomes unstable during the attack period and the estimation process of the fundamental frequency becomes unstable, it is possible to avoid the erroneous estimation of the fundamental frequency from occurring continuously. The estimation accuracy of the fundamental frequency can be increased.

＜＜＜第２の態様＞＞＞
強いタッチで楽器演奏が行われた等の場合には、音響信号のアタック区間が終了しても暫くの間は波形の不安定な状態が続くことがある。そのような場合、アタック区間が終了した後のフレームであっても、前フレームの最終的な重み値ｗ^(t-1)(Ｆ)を初期値として使用して、重み値ｗ^(t)(Ｆ)の逐次更新を行うと、誤った基本周波数において重み値がピークとなり、基本周波数の誤推定が発生するおそれがある。 <<< Second Aspect >>>
When a musical instrument is played with a strong touch, the waveform may remain unstable for a while after the attack period of the acoustic signal ends. In such a case, even if the frame is after the attack period has ended, the final weight value w ^(t−1) (F) of the previous frame is used as the initial value, and the weight value w ^(t) ( When the sequential update of F) is performed, the weight value reaches a peak at the wrong fundamental frequency, and there is a possibility that the fundamental frequency is erroneously estimated.

そこで、この第２の態様では、次のようにして、重み値ｗ^(ｔ)(Ｆ)の初期値を制御する。まず、第２の態様でも、上記第１の態様と同様、各フレームでの重み値ｗ^(t)(Ｆ)の逐次更新を開始するとき、図４に示すように、前フレームでの重み値の最終値ｗ^(t-1)(Ｆ)に係数ｒを乗算したものと、所定の初期値ｗ_flat（Ｆ）に係数１−ｒを乗算したものとを加算し、その加算結果を当該フレームにおける重み値ｗ^(t)(Ｆ)の初期値とする。 Therefore, in the second mode, the initial value of the weight value w ^(t) (F) is controlled as follows. First, in the second mode, as in the first mode, when the sequential updating of the weight values w ^(t) (F) in each frame is started, as shown in FIG. The final value w ^(t-1) (F) multiplied by the coefficient r and the predetermined initial value w _flat (F) multiplied by the coefficient 1-r are added, and the result of the addition is added to the frame. The initial value of the weight value w ^(t) (F) at.

そして、図６に示すように、アタック区間に属するフレームの処理時には、ｒの値を０とすることにより、所定の初期値ｗ_flat（Ｆ）を重み値ｗ^(t)(Ｆ)の初期値とする。 Then, as shown in FIG. 6, when processing the frame belonging to the attack section, the value of r is set to 0, so that the predetermined initial value w _flat (F) is the initial value of the weight value w ^(t) (F). And

そして、アタック区間が終了した後は、フレームが切り換わる都度、ｒの値を０から１に向けて徐々に高めてゆく。すなわち、第２の態様では、アタック区間に属しない各フレームの処理時には、当該フレームにおける重み値ｗ^(t)(Ｆ)の逐次更新が前フレームにおける重み値の最終値ｗ^(t-1)(Ｆ)と所定の初期値ｗ_flat（Ｆ）とをミキシングした重み値を初期値として開始されるように基本周波数の確率密度関数の推定４１の演算制御を行い、かつ、直前のアタック区間の終了時からの経過時間が長くなるに従って、前フレームでの重み値の最終値ｗ^(t-1)(Ｆ)が強調されるように、前フレームにおける重み値の最終値ｗ^(t-1)(Ｆ)と所定の初期値ｗ_flat（Ｆ）とのミキシング比を制御するのである。 After the attack period ends, the value of r is gradually increased from 0 to 1 each time the frame is switched. That is, in the second mode, during the processing of each frame that does not belong to the attack section, the sequential update of the weight value w ^(t) (F) in the frame is performed as the final value w ^(t−1) ( F) is controlled to control the estimation 41 of the probability density function of the fundamental frequency so that the weight value obtained by mixing the predetermined initial value w _flat (F) is started as an initial value, and the end of the immediately preceding attack section with the passage time becomes longer from the time, as the final value w of the weight values of the previous frame ^{(t-1) (F)} is stressed, the final value w of the weight values in the previous frame ^{(t-1) (} The mixing ratio between F) and a predetermined initial value w _flat (F) is controlled.

この態様によれば、アタック区間が終了しても暫くの間は音響信号の波形が不安定になる状況でも、基本周波数の誤推定が連続して発生するのを回避し、基本周波数の推定の精度を高めることができる。 According to this aspect, even if the waveform of the acoustic signal becomes unstable for a while after the attack period ends, it is possible to avoid the erroneous estimation of the fundamental frequency from occurring continuously, and to estimate the fundamental frequency. Accuracy can be increased.

＜＜＜第３の態様＞＞＞
上記第１の態様および第２の態様では、アタック検出１ａから引き渡される情報に従い、音モデルに対する重み値ｗ^(t)(Ｆ)の初期値の制御を行った。これに対し、第３の態様では、図７に示すように、アタック区間以外の区間では、通常の音モデルが用いられ、アタック区間では、アタック区間用の音モデルが用いられるように、ＥステップおよびＭステップ４１１に用いられる音モデルの切り換えを行う。 <<< Third Aspect >>>
In the first and second aspects, the initial value of the weight value w ^(t) (F) for the sound model is controlled according to the information delivered from the attack detection 1a. On the other hand, in the third mode, as shown in FIG. 7, the E step is performed so that the normal sound model is used in the sections other than the attack section, and the sound model for the attack section is used in the attack section. And the sound model used for the M step 411 is switched.

ここで、アタック区間用の音モデルとしては、実際の楽音の高調波構造よりも顕著なピークが少なく、周波数軸上において各高調波成分の振幅値が緩やかなカーブを描いて変化する高調波構造を持った音モデルを用いる。アタック区間においてこのような音モデルを用いることにより、入力音響信号の波形の変化に対して安定性の良い基本周波数の推定を行うことが可能となる。 Here, the sound model for the attack section has a harmonic structure in which there are fewer peaks than the actual harmonic structure of the musical tone, and the amplitude value of each harmonic component changes along a gentle curve on the frequency axis. Use a sound model with By using such a sound model in the attack period, it is possible to estimate a fundamental frequency that is stable with respect to changes in the waveform of the input acoustic signal.

＜＜＜第４の態様＞＞＞
上記第１〜第３の態様では、基本周波数の確率密度関数の推定４１が、アタック検出１ａの処理結果に基づく制御の対象となった。これに対し、この第４の態様では、基本周波数推定手段であるマルチエージェントモデルによる基本周波数の継時的な追跡４２が、アタック検出１ａの処理結果に基づく制御の対象となる。すなわち、この第４の態様において音分析プログラムの演算制御手段は、アタック区間では、基本周波数の確率密度関数の推定４１により得られる基本周波数の確率密度関数が得られたとしても、その確率密度関数に基づく基本周波数の推定および出力を行わないように、マルチエージェントモデルによる基本周波数の継時的な追跡４２を制御する。すなわち、基本周波数の誤推定が発生するアタック区間では、基本周波数の推定および出力を行わず、出力される基本周波数についてのみ推定の精度を高める趣旨である。 <<< fourth aspect >>>
In the first to third aspects, the estimation 41 of the probability density function of the fundamental frequency is the target of control based on the processing result of the attack detection 1a. On the other hand, in the fourth aspect, the continuous tracking 42 of the fundamental frequency by the multi-agent model which is the fundamental frequency estimation means is a target of control based on the processing result of the attack detection 1a. That is, in this fourth aspect, even if the calculation control means of the sound analysis program obtains the probability density function of the fundamental frequency obtained by the estimation 41 of the probability density function of the fundamental frequency in the attack period, the probability density function The fundamental frequency over time tracking 42 by the multi-agent model is controlled so as not to estimate and output the fundamental frequency based on. In other words, in the attack period in which an erroneous estimation of the fundamental frequency occurs, the estimation and output of the fundamental frequency are not performed, and the estimation accuracy is improved only for the fundamental frequency that is output.

＜他の実施形態＞
以上、この発明の一実施形態について説明したが、この発明には他にも実施形態があり得る。例えば次の通りである。
（１）第１の態様または第２の態様の一方と、第３の態様とを併用し得るように音分析プログラムの演算制御手段を構成してもよい。
（２）第３の態様において、アタック区間のみならず、アタック区間が終了してから所定時間が経過するまでの期間についても、アタック区間用の音モデルを使用して、基本周波数の確率密度関数の推定４１を実行するように制御してもよい。
（３）第４の態様において、アタック区間のみならず、アタック区間が終了してから所定時間が経過するまでの期間についても、基本周波数の推定および出力を停止させてもよい。 <Other embodiments>
Although one embodiment of the present invention has been described above, the present invention may have other embodiments. For example:
(1) The calculation control means of the sound analysis program may be configured so that one of the first aspect or the second aspect and the third aspect can be used together.
(2) In the third aspect, not only the attack period but also a period from when the attack period ends until a predetermined time elapses, the sound frequency model for the attack period is used to calculate the probability density function of the fundamental frequency. Control 41 may be executed.
(3) In the fourth mode, estimation and output of the fundamental frequency may be stopped not only in the attack period but also in a period from when the attack period ends until a predetermined time elapses.

この発明の一実施形態である音分析プログラムの処理内容を示す図である。It is a figure which shows the processing content of the sound analysis program which is one Embodiment of this invention. 同実施形態において音モデルに対する重みのパラメータがＥＭアルゴリズムにより更新される過程を例示した図である。It is the figure which illustrated the process in which the parameter of the weight with respect to a sound model is updated by EM algorithm in the embodiment. １つの特徴検出器と複数のエージェントにより構成されるマルチエージェントモデルによる基本周波数の経時的な追跡を示す図である。It is a figure which shows time-dependent tracking of the fundamental frequency by the multi agent model comprised by one feature detector and a some agent. 同実施形態における基本周波数の確率密度関数の推定４１の処理内容を示す図である。It is a figure which shows the processing content of the estimation 41 of the probability density function of the fundamental frequency in the embodiment. 同実施形態による音分析プログラムの演算制御手段により実行される演算制御の第１の態様を示すタイムチャートである。It is a time chart which shows the 1st aspect of the calculation control performed by the calculation control means of the sound analysis program by the embodiment. 同演算制御手段により実行される演算制御の第２の態様を示すタイムチャートである。It is a time chart which shows the 2nd aspect of the calculation control performed by the calculation control means. 同演算制御手段により実行される演算制御の第３の態様を示すタイムチャートである。It is a time chart which shows the 3rd aspect of the calculation control performed by the calculation control means.

Explanation of symbols

１……瞬時周波数の算出、１ａ……アタック検出、２……周波数成分の候補の抽出、３……周波数帯域の制限、４ａ……メロディラインの推定、４ｂ……ベースラインの推定、４１……基本周波数の確率密度関数の推定、４２……マルチエージェントモデルによる基本周波数の継時的な追跡。 DESCRIPTION OF SYMBOLS 1 ... Calculation of instantaneous frequency, 1a ... Attack detection, 2 ... Extraction of frequency component candidates, 3 ... Frequency band limitation, 4a ... Melody line estimation, 4b ... Baseline estimation, 41 ... ... Estimation of probability density function of fundamental frequency, 42 ... Tracking fundamental frequency over time by multi-agent model.

Claims

Attack detection means for dividing the input acoustic signal into frames of a predetermined time length, and determining whether the input acoustic signal is an attack section signal for each frame;
For each frame, using a sound model that is a probability density function having a structure corresponding to the harmonic structure of the sound of each sound source, a mixed distribution is formed by weighted addition of a plurality of sound models corresponding to various fundamental frequencies. The weight value for each sound model is sequentially updated and optimized so that the mixed distribution becomes the distribution of the frequency components of the input acoustic signal, and the optimized weight value of each sound model is set to the fundamental frequency of the sound of the sound source. A probability density function estimating means for estimating a probability density function of
Fundamental frequency estimation means for estimating and outputting a fundamental frequency of sound of one or a plurality of sound sources included in the input acoustic signal based on a probability density function of the fundamental frequency;
The calculation for estimating the probability density function of the fundamental frequency in the probability density function estimating means or the basic in the fundamental frequency estimating means depending on whether or not the frame to be processed by the probability density function estimating means is in the attack section And a calculation control means for switching a calculation mode for frequency estimation.

When the frame that is the processing target of the probability density function estimating unit is in the attack section, the arithmetic control unit is configured so that the sequential updating of the weight value in the frame starts from a predetermined initial value. When the calculation for estimating the probability density function of the fundamental frequency in the density function estimation means is performed and the frame to be processed by the probability density function estimation means is not in the attack section, the weight value in the frame is 2. The calculation for estimating the probability density function of the fundamental frequency in the probability density function estimating means is controlled so that the sequential update is started with the final value of the weight value in the previous frame as an initial value. The sound analyzer according to 1.

When the frame that is the processing target of the probability density function estimating unit is in the attack section, the arithmetic control unit is configured so that the sequential updating of the weight value in the frame starts from a predetermined initial value. When the calculation for estimating the probability density function of the fundamental frequency in the density function estimation means is performed and the frame to be processed by the probability density function estimation means is not in the attack section, the weight value in the frame is An operation for estimating the probability density function of the fundamental frequency in the probability density function estimating means so that the sequential update is started with a weight value obtained by mixing the final value of the weight value in the previous frame and a predetermined initial value as an initial value. As the elapsed time from the end of the previous attack section becomes longer, the final value of the weight value in the previous frame As is emphasized, the sound analysis apparatus according to claim 1, characterized in that to control the mixing ratio between the final value and the predetermined initial value of the weight values in the previous frame.

The arithmetic and control means switches the sound model used for the estimation of the probability density function of the fundamental frequency depending on whether or not the frame to be processed by the probability density function estimation means is an attack section. The sound analyzer according to claim 1, wherein

The calculation control means causes the fundamental frequency estimation means to output an estimation result of a fundamental frequency for the frame when the frame to be processed by the probability density function estimation means is not in an attack section, and the probability 2. The method according to claim 1, wherein when the frame to be processed by the density function estimation means is in an attack section, the fundamental frequency estimation means is not made to output an estimation result of the fundamental frequency for the frame. The sound analyzer described.

Computer
Attack detection means for dividing the input acoustic signal into frames of a predetermined time length, and determining whether the input acoustic signal is an attack section signal for each frame;
For each frame, using a sound model that is a probability density function having a structure corresponding to the harmonic structure of the sound of each sound source, a mixed distribution is formed by weighted addition of a plurality of sound models corresponding to various fundamental frequencies. The weight value for each sound model is sequentially updated and optimized so that the mixed distribution becomes the distribution of the frequency components of the input acoustic signal, and the optimized weight value of each sound model is set to the fundamental frequency of the sound of the sound source. A probability density function estimating means for estimating a probability density function of
Fundamental frequency estimation means for estimating and outputting a fundamental frequency of sound of one or a plurality of sound sources included in the input acoustic signal based on a probability density function of the fundamental frequency;
The calculation for estimating the probability density function of the fundamental frequency in the probability density function estimating means or the basic in the fundamental frequency estimating means depending on whether or not the frame to be processed by the probability density function estimating means is in the attack section A computer program that functions as a calculation control means for switching a calculation mode for frequency estimation.