JP2006251712A

JP2006251712A - Analyzing method for observation data, especially, sound signal having mixed sounds from a plurality of sound sources

Info

Publication number: JP2006251712A
Application number: JP2005071710A
Authority: JP
Inventors: Shigeki Sagayama; 茂樹嵯峨山; Takuya Nishimoto; 卓也西本; Hirokazu Kameoka; 弘和亀岡
Original assignee: University of Tokyo NUC
Current assignee: University of Tokyo NUC
Priority date: 2005-03-14
Filing date: 2005-03-14
Publication date: 2006-09-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a framework that enables a wide-area time structure and a frequency structure to be estimated at the same time. <P>SOLUTION: An observation spectrum of a sound signal having mixed sounds from a plurality of sound sources is modeled with a superposition object model obtained by superposing a plurality of sound object models, the respective sound object models are represented with a model function having two variables of a frequency (x) and a time (t), and model parameters of the model function are optimized to estimate characteristics of the observation spectrum. The respective sound object models correspond to one harmonic structure. The model function includes a harmonic structure function including the frequency (x) as a variable and an envelope function including the time (t) as a variable. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、観測データの解析方法係り、特に、複数の音源からの音が混在している音響信号の解析方法に関する。 The present invention relates to an observation data analysis method, and more particularly to an acoustic signal analysis method in which sounds from a plurality of sound sources are mixed.

複数の音源からの音が混在している多重音響信号の解析はこれまで数多く研究されているが、いまだ懸案の難問の一つである。近年提案されたカルマンフィルタ（非特許文献１）、信号およびスペクトル領域でのモデル近似推定（非特許文献２，３）に基づく手法はこの分野において大きな進歩をもたらした。しかし、本来、多重音解析の問題は周波数方向と時間方向の情報を同時に処理すべきであり、これらの手法は問題を分解してまず周波数次元の情報を抽出してからその情報を時間方向に連結していくアプローチで解決を図っていた。
K. Nishi, S. Ando and S. Aida,“Optimum Harmonics Tracking Filter for Auditory Scene Analysis,” Proc. IEEE,ICASSP 96, pp. 573.576, 1996. S. Godsill and M. Davy, “BaysianHarmonic Models for Musical Pitch Estimation and Analysis,” Proc.ICASSP2002, Vol. 2, pp. 1769.1772, 2002. M. Goto,“A Predominant-F0Estimation Method for CD Recordings: MAP Estimation Using EM Algorithm forAdaptive Tone Models,” Proc. ICASSP2001, Vol. 5, pp. 3365.3368, 2001. H. Kameoka, T. Nishimoto and S. Sagayama, “Separation of Harmonic Structures Based on Tied Gaussian Mixture Model andInformation Criterion for Concur-rent Sounds,” Proc. ICASSP2004, AE-P5.9, May2004. There have been many studies on the analysis of multi-acoustic signals in which sounds from multiple sound sources are mixed, but it is still one of the difficult problems. Recently proposed methods based on Kalman filters (Non-Patent Document 1), model approximation estimation in the signal and spectral domains (Non-Patent Documents 2 and 3) have made great progress in this field. However, the problem of multiple sound analysis is that the information in the frequency direction and the time direction should be processed at the same time, and these methods first resolve the problem and extract the information in the frequency dimension, then the information in the time direction. We were trying to solve the problem with a consolidated approach.
K. Nishi, S. Ando and S. Aida, “Optimum Harmonics Tracking Filter for Auditory Scene Analysis,” Proc. IEEE, ICASSP 96, pp. 573.576, 1996. S. Godsill and M. Davy, “BaysianHarmonic Models for Musical Pitch Estimation and Analysis,” Proc.ICASSP2002, Vol. 2, pp. 1769.1772, 2002. M. Goto, “A Predominant-F0 Estimation Method for CD Recordings: MAP Estimation Using EM Algorithm for Adaptive Tone Models,” Proc. ICASSP2001, Vol. 5, pp. 3365.3368, 2001. H. Kameoka, T. Nishimoto and S. Sagayama, “Separation of Harmonic Structures Based on Tied Gaussian Mixture Model and Information Criterion for Concur-rent Sounds,” Proc. ICASSP2004, AE-P5.9, May2004.

本発明は、局所的な部分情報を統合していくアプローチではなく、大域的な時間構造と周波数構造を同時推定できる枠組を提供することを目的とするものである。 An object of the present invention is to provide a framework capable of simultaneously estimating a global time structure and a frequency structure, not an approach in which local partial information is integrated.

かかる課題を解決するために本発明が採用した技術手段は、観測データを重畳オブジェクトモデルでモデリングし、各オブジェクトモデルを２変数のモデル関数で表し、モデル関数のモデルパラメータを最適化することで観測値の特徴を推定することを特徴とするものである。 The technical means adopted by the present invention in order to solve such a problem is that observation data is modeled by a superimposed object model, each object model is represented by a two-variable model function, and the model parameter of the model function is optimized for observation. A feature of the value is estimated.

一つの好ましい態様では、観測データは、複数の音源からの音が混在している音響信号の観測スペクトルであり、前記モデル関数の変数は周波数ｘ及び時間ｔである。後述する実施の形態では、対数周波数について説明するが、周波数軸は、線形周波数軸を用いてもよい。観測スペクトルの特徴には、各音の周波数情報（基本周波数、倍音の周波数）、時間情報（立ち上がり時刻、時間長）が含まれる。観測スペクトルの特徴には、さらに、調波構造を構成する各周波数成分の周波数成分パワー比、時間方向のパワースペクトルエンベロープが含まれる。 In one preferred embodiment, the observation data is an observation spectrum of an acoustic signal in which sounds from a plurality of sound sources are mixed, and the variables of the model function are a frequency x and a time t. Although the logarithmic frequency will be described in an embodiment described later, a linear frequency axis may be used as the frequency axis. The characteristics of the observation spectrum include frequency information (basic frequency, overtone frequency) of each sound and time information (rise time, time length). The characteristics of the observed spectrum further include the frequency component power ratio of each frequency component constituting the harmonic structure and the power spectrum envelope in the time direction.

重畳音響オブジェクトモデルは、

で表される。ｐ_k(x,t)は、ｋ番目の音響オブジェクトモデルの一般式である。重畳オブジェクトモデルのパラメータには、各音響オブジェクトモデルを表すモデル関数のパラメータ、及び、各音響オブジェクトモデルの重みが含まれる。 The superimposed acoustic object model is

It is represented by p _k (x, t) is a general expression of the kth acoustic object model. The parameters of the superimposed object model include a parameter of a model function representing each acoustic object model and a weight of each acoustic object model.

観測データが音響信号の場合において、好ましい態様では、一つの音響オブジェクトモデルは、一つの調波構造に対応している。後述する実施の形態では、調和性を仮定しているが、調波構造に何らかの解析的なパラメトリックモデルが仮定できれば、非調和であってもよい。 In the case where the observation data is an acoustic signal, in a preferred aspect, one acoustic object model corresponds to one harmonic structure. In the embodiment described later, harmonicity is assumed. However, as long as an analytical parametric model can be assumed for the harmonic structure, the harmonic structure may be inharmonic.

本発明における２変数ｘ、ｔを有するモデル関数において、ｋ番目の音響オブジェクトモデルｐ_k(x,t)の一般式は、周波数成分を正規分布（ガウス関数）で表す場合について例示すると、

で表される。 In the model function having two variables x and t in the present invention, the general expression of the k-th acoustic object model p _k (x, t) is exemplified for the case where the frequency component is represented by a normal distribution (Gaussian function).

It is represented by

観測データが音響信号の場合において、一つの態様では、モデル関数は、周波数ｘを変数に含む調波構造関数と時間ｔを変数に含むエンベロープ関数を含む。後述する実施形態では、一つの好ましい態様として、ｋ番目の音響オブジェクトモデルｐ(x,t｜Θ_k)を、二つの関数Φ_k(x)とΨ_k(t)の積で表すもの、に基づいて本発明を説明するが、用いる関数はこれには限定されない。後述の実施形態では、調波構造全体に対して共通のエンベロープ関数（ガウス基底関数）を用いる。 In the case where the observation data is an acoustic signal, in one aspect, the model function includes a harmonic structure function including the frequency x as a variable and an envelope function including the time t as a variable. In an embodiment to be described later, as one preferable aspect, the k-th acoustic object model p (x, t | Θ _k ) is expressed by a product of two functions Φ _k (x) and Ψ _k (t). The present invention will be described based on this, but the function used is not limited to this. In an embodiment described later, a common envelope function (Gaussian basis function) is used for the entire harmonic structure.

調波構造関数は、さらに時間ｔを変数に含んでいてもよい。ｋ番目の音響オブジェクトモデルｐ(x,t｜Θ_k)を、二つの関数Φ_k(x,t)とΨ_k(t)の積で表すものが例示される。このものでは、調波構造関数は時間に依存した関数となり、時間ｔによって周波数ｘの値が変化し得る。ｘ−ｔ平面上に投影したピッチ軌跡を多項式等で表現した場合が例示される。 The harmonic structure function may further include time t as a variable. The k-th acoustic object model p (x, t | Θ _k ) is exemplified by a product of two functions Φ _k (x, t) and Ψ _k (t). In this case, the harmonic structure function is a time-dependent function, and the value of the frequency x can change with time t. The case where the pitch locus projected on the xt plane is expressed by a polynomial or the like is exemplified.

一つの調波構造に対して共通のエンベロープ関数を用いる。他の態様では、各調波成分に対して独立したエンベロープ関数を用いる。ｋ番目の音響オブジェクトモデルｐ(x,t｜Θ_k)を、二つの関数Φ_k(x,t)とΨ_ｎ,k(t)の積で表す。このものは、パワーエンベロープ関数を調波成分ごとに別々に用意する場合である。より具体的には、例えば、調波ごとに別々の減衰曲線(エンベロープ関数)を持つようなモデル（倍音、３倍音、４倍音、... で別々の曲線で減衰するようにモデリングする）である。 A common envelope function is used for one harmonic structure. In another aspect, an independent envelope function is used for each harmonic component. The k-th acoustic object model p (x, t | Θ _k ) is represented by the product of two functions Φ _k (x, t) and Ψ _{n, k} (t). In this case, the power envelope function is prepared separately for each harmonic component. More specifically, for example, in a model having a separate attenuation curve (envelope function) for each harmonic (modeled to attenuate with separate curves at harmonics, third harmonics, fourth harmonics, ...) is there.

音響オブジェクトを表す２変数ｘ、ｙを有するモデル関数は、後述する実施の形態では、モデル関数がxの関数とtの関数の積に分解できる特殊な場合の解析的な解法を示している。すなわち、前記一般式において、調波成分ごとのエンベロープ関数が相似になる（つまり、Ψ_k ⁿ(t)がnによらず共通になる）という仮定と、ピッチ軌跡が時間軸に平行である（つまり、μ_k(t)= μ_k）という仮定のもとでは、一般式は、

という形になり、xの関数とtの関数に分解できる。 In the embodiment described later, a model function having two variables x and y representing an acoustic object represents an analytical solution in a special case where the model function can be decomposed into a product of a function of x and a function of t. That is, in the above general formula, the envelope function for each harmonic component is similar (that is, Ψ _k ⁿ (t) is common regardless of n), and the pitch trajectory is parallel to the time axis ( In other words, under the assumption that μ _k (t) = μ _k ), the general formula is

It can be decomposed into a function of x and a function of t.

このように、実施形態では、楽音の音響オブジェクトのピッチ軌跡は時間軸に平行であることを仮定しているが、実際には、これらが平行である状況は限られる。音声、また、楽器音でもビブラートやグリッサンドなどの奏法では平行ではなくなるが、特に、対象となる多重音信号が音楽信号の場合には、ピッチ軌跡が時間軸にほぼ平行であるという仮定は大きな問題とはならない。また、楽音の音響オブジェクトのピッチ軌跡は時間軸に平行であることを仮定せずに、多項式等によりピッチ軌跡をモデリングしてもよい。 As described above, in the embodiment, it is assumed that the pitch trajectory of the musical object of the musical sound is parallel to the time axis, but in reality, the situation where these are parallel is limited. The assumption that the pitch trajectory is almost parallel to the time axis is a big problem, especially when the target multiple sound signal is a music signal, even if it is a voice or musical instrument sound, but it is not parallel in the playing method such as vibrato or glissando. It will not be. Further, the pitch trajectory of a musical sound object may be modeled by a polynomial or the like without assuming that the pitch trajectory is parallel to the time axis.

一つの好ましい態様では、調波構造関数は、基本周波数成分に対応する一つの単峰形分布の代表値である基本周波数推定値と、該基本周波数推定値によって決定される他の単峰形分布の代表値を有し、モデルパラメータは、各単峰形分布の代表値、重み、分散を含む。代表値パラメータは、基本周波数推定値を構成する代表値を含む調波構造モデルの各単峰形分布の各代表値により構成されるが、代表値パラメータにおいては、基本周波数推定値のみが自由パラメータであり、他の代表値は基本周波数推定値によって拘束されるパラメータである。単峰形分布としては数々の分布が知られているが、一つの好ましい態様では、単峰形分布は正規分布（対数正規分布を含む）である。分布の代表値としては、平均、中央値、最頻値が例示されるが、一つの好ましい態様では、分布の代表値は平均である。後述する実施形態では、音響オブジェクトを、拘束つき混合正規分布によりモデル化した調波構造モデルで表しており、拘束つき混合正規分布によりモデル化した調波構造モデルは、基本周波数成分に対応する一つの正規分布の平均μ_ｋである基本周波数推定値と、該基本周波数推定値によって決定される他の正規分布の平均μ_ｋ＋logｎとを有している。重みパラメータｒ_ｋ ^ｎは、音響オブジェクトｋの調波構造を構成する各周波数成分の周波数成分パワー比を表す。分散パラメータσ_ｋは、音響オブジェクトｋの調波構造を構成する各周波数成分の幅を表すが。一つの態様では、既知パラメータとしてモデルに与えてもよい。 In one preferred embodiment, the harmonic structure function includes a fundamental frequency estimate that is a representative value of one unimodal distribution corresponding to a fundamental frequency component, and another unimodal distribution determined by the fundamental frequency estimate. The model parameter includes the representative value, weight, and variance of each unimodal distribution. The representative value parameter is composed of each representative value of each unimodal distribution of the harmonic structure model including the representative value that constitutes the fundamental frequency estimate, but only the fundamental frequency estimate is a free parameter in the representative value parameter. The other representative values are parameters constrained by the fundamental frequency estimation value. A number of distributions are known as the unimodal distribution, but in one preferred embodiment, the unimodal distribution is a normal distribution (including a log normal distribution). Examples of the representative value of the distribution include an average value, a median value, and a mode value. In one preferred embodiment, the representative value of the distribution is an average value. In an embodiment described later, an acoustic object is represented by a harmonic structure model modeled by a constrained mixed normal distribution, and the harmonic structure model modeled by a constrained mixed normal distribution corresponds to a fundamental frequency component. It has a fundamental frequency estimate that is the average μ _k of one normal distribution and an average μ _k + logn of another normal distribution that is determined by the fundamental frequency estimate. Weighting parameter r _k ⁿ represents the frequency component power ratio of each frequency component constituting the harmonic structure of the audio object k. The dispersion parameter σ _k represents the width of each frequency component constituting the harmonic structure of the acoustic object k. In one embodiment, it may be given to the model as a known parameter.

一つの好ましい態様では、エンベロープ関数は、時間軸方向に連続状に配置した複数のガウス関数、

であり、モデルパラメータは、各ガウス分布の代表値、重み、分散を含む。代表値は、主として、音響オブジェクトの立ち上がり時刻の推定に用いるパラメータであり、後述する実施の形態では、（先頭の）ガウス分布の平均ｏ_ｋであるが、代表値はこれには限定されない。各ガウス分布の重みｃ_ｋ ^ｙは、時間方向のパワーエンベロープ曲線を決定するパラメータである。各ガウス分布の分散φ_ｋは、音響オブジェクトの時間長を決定するパラメータである。一つの好ましい態様では、各ガウス関数は、先頭のガウス関数の分散パラメータ（一つの好適な例では、標準偏差パラメータ）に基づく所定の等間隔αφ_ｋで配置されている。 In one preferred embodiment, the envelope function includes a plurality of Gaussian functions arranged continuously in the time axis direction,

And the model parameters include representative values, weights, and variances of each Gaussian distribution. Representative value is primarily a parameter used to estimate the rise time of the audio object, in the embodiment described below, (the beginning of) but the average o _k of Gaussian, the representative value is not limited thereto. Weights c _k ^y of each Gaussian distribution is a parameter that determines the power envelope curve in the time direction. The variance φ _k of each Gaussian distribution is a parameter that determines the time length of the acoustic object. In one preferred embodiment, the Gaussian functions are arranged at predetermined equal intervals αφ _k based on the dispersion parameter (in one preferred example, the standard deviation parameter) of the leading Gaussian function.

他の態様では、エンベロープ関数は、二つのシグモイド関数を組み合わせた関数、

から構成されている。具体的には、エンベロープ関数は、２つのシグモイド関数（同一でも、同一でなくてもよい）の横軸をずらした差のいわゆる二重シグモイド関数であり、パラメータは、o_k ⁽⁰⁾,o_k ⁽¹⁾,a_k,n,b_k,n,A_k,nである。 In another aspect, the envelope function is a function that combines two sigmoid functions,

It is composed of Specifically, the envelope function is a so-called double sigmoid function of a difference in which the horizontal axes of two sigmoid functions (which may or may not be the same) are shifted, and the parameters are o _k ⁽⁰⁾ , o _k ⁽¹⁾ , a _{k, n} , b _{k, n} , A _{k, n} .

また、他の態様では、エンベロープ関数は、極値分布関数、

であり、パラメータは、o_k,a_k,n,b_k,n,A_k,nである。 In another aspect, the envelope function is an extreme value distribution function,

, And the parameter is a _{_{o k, a k, n,}} b k, n, A k, n.

さらに、他の態様では、エンベロープ関数は、Generalized Gaussian Distribution(ＧＤＤ)、

であり、パラメータはo_k,λ_k,nである（但し、pは定数、Γはガンマ関数である）。 Furthermore, in another aspect, the envelope function is Generalized Gaussian Distribution (GDD),

And the parameters are o _k , λ _{k, n} (where p is a constant and Γ is a gamma function).

モデル関数のパラメータの最適化の手法は、一つの好ましい態様では、ＭＡＰ推定であるが、本発明に適用される最適化手法は、ＭＡＰ推定には限定されず、他の最適化手法であってもよい。また、一つの好ましい態様では、モデルパラメータ最適化の推定アルゴリズムは、ＥＭアルゴリズムである。 In one preferred embodiment, the model function parameter optimization method is MAP estimation. However, the optimization method applied to the present invention is not limited to MAP estimation, and may be other optimization methods. Also good. Moreover, in one preferable aspect, the estimation algorithm of model parameter optimization is EM algorithm.

本発明は、音響分析システム、音響分析のためのコンピュータプログラム、あるいは当該プログラムを記録した記録媒体としても提供され得る。 The present invention can also be provided as an acoustic analysis system, a computer program for acoustic analysis, or a recording medium on which the program is recorded.

本発明の観測データの解析方法は、好ましくは、音響信号に適用されるが、本発明に係る重畳オブジェクトモデルは、２次元平面上に投影されたデータから、投影されたデータの元の情報を復元することに拡張できる。他の態様では、観測データは、複数の対象物を含む画像データである。簡単な例を示すと、対象物は直方体状の物体であり、これを、直方体状の物体とその影とに対応するオブジェクトモデルでモデリングし、対象物の特徴を復元する。観測データが画像データの場合には、ロボットヴィジョン等の情景分析に適用することができる。 The observation data analysis method of the present invention is preferably applied to an acoustic signal, but the superimposed object model according to the present invention uses the original information of the projected data from the data projected on the two-dimensional plane. Can be extended to restore. In another aspect, the observation data is image data including a plurality of objects. As a simple example, the target object is a rectangular parallelepiped object, which is modeled with an object model corresponding to the rectangular parallelepiped object and its shadow, and the characteristics of the target object are restored. When the observation data is image data, it can be applied to scene analysis such as robot vision.

本発明によれば、複数の音源からの音が混在している音響信号を重畳オブジェクトモデルでモデリングし、各音響オブジェクト、重畳音響オブジェクトの時間と周波数の大域的な幾何構造を同時推定することができ、精度よく音響信号を解析することができる。 According to the present invention, it is possible to model an acoustic signal in which sounds from a plurality of sound sources are mixed with the superimposed object model and simultaneously estimate the global geometric structure of each acoustic object and the superimposed acoustic object in terms of time and frequency. The sound signal can be analyzed with high accuracy.

本発明を、一つの好ましい態様である、ガウス基底音響オブジェクトモデルを用いた多重スペクトル分離に基づいて説明する。 The invention will be described based on one preferred embodiment, multispectral separation using a Gaussian basis acoustic object model.

[Ａ]音響オブジェクトモデル
[Ａ−１]問題の定式化
図１に示すように、複数の音源からの音が混在している音響信号の観測スペクトルは、複数のピッチ（基本周波数）の時間軌跡に伴う基本波成分および高調波成分が複数重畳した複雑な分布である。このような混合分布を各スペクトルに分離することを考えたとき、短時間分析では、スペクトルの重なり合いが問題となる。本発明では、観測されるスペクトル分布は、一種のマイクロエネルギーパターンのヒストグラムであるとして、時間―周波数平面に、多数のストリップ状の領域に割り当て、各領域が各音響オブジェクトの予測される複数のスペクトル成分を占めるようにする。本明細書では、観測パターンを任意に分解し、分解された各パターンをクラスタと呼ぶ。すなわち、クラスタは分解された観測パターンの分布を意味し、クラスタリングは観測パターンをクラスタに分解することを意味する。もし適切なクラスタリング帰属の度合いが決定されれば、確率的手法によって、観測された複合分布を分離することができる。 [A] Acoustic object model
[A-1] Formulation of Problem As shown in FIG. 1, the observed spectrum of an acoustic signal in which sounds from a plurality of sound sources are mixed includes a fundamental wave component associated with time trajectories of a plurality of pitches (fundamental frequencies) and This is a complex distribution in which multiple harmonic components are superimposed. Considering separation of such a mixed distribution into spectra, spectrum overlap becomes a problem in short-time analysis. In the present invention, the observed spectral distribution is a kind of micro energy pattern histogram, and is assigned to a number of strip-like regions in the time-frequency plane, and each region is a plurality of predicted spectra of each acoustic object. Make up the ingredients. In this specification, an observation pattern is arbitrarily decomposed, and each decomposed pattern is called a cluster. That is, the cluster means the distribution of the observed patterns that are decomposed, and the clustering means that the observed patterns are decomposed into clusters. If an appropriate degree of clustering attribution is determined, the observed composite distribution can be separated by a probabilistic method.

時間周波数平面上に分布する楽音のパワースペクトルは、周波数方向の櫛形構造が時間方向に連なった一種のオブジェクト(以後これを音響オブジェクトと呼ぶ)を形成する。図２は、一つの音響オブジェクトを示し、一つの音響オブジェクトは、周波数―時間平面上に割り当てられた複数のオブジェクト要素から構成され、複数のオブジェクト要素は一つの基本周波数成分に対応する一つの要素と、倍音（整数倍でないものも含む）成分に対応する複数の要素とから構成される。本発明では、多数の楽音からなる音楽信号のスペクトル時間パターンを各楽音オブジェクトが重畳したものであると見なし、音響オブジェクト分解を、時間・周波数の２次元に分散した音響エネルギーのファジークラスタリング問題として解析的に定式化する。 The power spectrum of musical sounds distributed on the time-frequency plane forms a kind of object (hereinafter referred to as an acoustic object) in which comb structures in the frequency direction are continuous in the time direction. FIG. 2 shows one acoustic object, and one acoustic object is composed of a plurality of object elements allocated on the frequency-time plane, and the plurality of object elements are one element corresponding to one fundamental frequency component. And a plurality of elements corresponding to overtone (including non-integer multiple) components. In the present invention, the spectrum time pattern of a music signal composed of a large number of musical sounds is considered to be superimposed on each musical sound object, and the acoustic object decomposition is analyzed as a fuzzy clustering problem of acoustic energy distributed in two dimensions of time and frequency. Formulate it.

各クラスタにおいて一つの音響オブジェクトを幾何的に形作るモデルｐ(x,t｜Θ_k)がパラメータΘ_k（Θ＝{Θ_k |k=1,…K}）で規定できるとし、モデルに基づく目的関数を、

と設定する。ただし、x, t, f(x, t)はそれぞれ対数周波数、時間（フレーム）、ウェーブレット変換により得られた観測スペクトル（パワースペクトル密度）、Ｔ₀, Ｔ₁, Ω₀, Ω₁ はそれぞれ時間と対数周波数の下限と上限を指し、Ｋはクラスタ数、ｋはクラスタのインデックスを表す。 The model p (x, t | Θ _k ) that geometrically forms one acoustic object in each cluster can be defined by the parameter Θ _k (Θ = {Θ _k | k = 1,... K}). Function

And set. However, x, t, f (x, t) are logarithmic frequency, time (frame), observed spectrum (power spectral density) obtained by wavelet transform, and T ₀ , T ₁ , Ω ₀ , Ω ₁ are time, respectively. Indicates the lower and upper limits of the logarithmic frequency, K represents the number of clusters, and k represents the cluster index.

また、ｐ(k|x,t,Θ)は、座標(x, t)におけるk 番目のクラスタにどれくらいの割合でスペクトル成分が帰属するかを表す確率であって、

で与えられる。つまり、ｐ(k|x,t,Θ)f(x, t)は確率的に分離された音響オブジェクトという意味をなす。Ｄ(x,t｜Θ_k)は、k番目のモデルが座標(x, t)においてどれだけ支配的であるかを反映した(擬)距離関数である。より直感的にはモデルと観測スペクトルの積分値がいずれも等しい場合、すなわち、ｐ(x,t｜Θ_k)が、

を満たす場合には、ｐ(k|x,t,Θ)f(x, t)Ｄ(x,t｜Θ_k)は、２つの分布、ｐ(x,t｜Θ_k)とｐ(k|x,t,Θ)f(x,t)、が近くなるほど大きな値を取ることになる。 P (k | x, t, Θ) is a probability indicating how much of the spectral component belongs to the k th cluster at the coordinates (x, t),

Given in. That is, p (k | x, t, Θ) f (x, t) means a stochastic separated acoustic object. D (x, t | Θ _k ) is a (pseudo) distance function that reflects how dominant the k th model is in coordinates (x, t). More intuitively, if both the integral value of the model and the observed spectrum are equal, that is, p (x, t | Θ _k ) is

If p is satisfied, p (k | x, t, Θ) f (x, t) D (x, t | Θ _k ) has two distributions, p (x, t | Θ _k ) and p (k | x, t, Θ) f (x, t), the larger the value becomes.

以上より、観測スペクトルの時系列分布を何らかの幾何モデルで最適近似する問題に帰着することができる。ここで、この目的関数は、Ｄ(x,t｜Θ_k)＝logｐ(x,t｜Θ_k)という特定条件のもとではＥＭアルゴリズムにおけるＱ関数と同形であることに注目されたい。以下では、音響オブジェクトの調波構造と時間連続の両方の性質を同時に反映する２次元分布モデルを定式化する。 From the above, it can be reduced to the problem of optimal approximation of the time series distribution of the observed spectrum by some geometric model. Here, the objective function, D (x, t | Θ k) = logp (x, t | Θ k) is under certain conditions that should be noted that the Q function and the same shape in the EM algorithm. In the following, a two-dimensional distribution model that reflects both the harmonic structure and time continuity of an acoustic object is formulated.

[Ａ−２]ガウス基底音響オブジェクトモデル
楽音の音響オブジェクトのピッチ軌跡は時間軸に平行であると仮定すると、図２のようなｋ番目の音響オブジェクトモデルの特定の時刻t における切口は図３のような調波構造Φ_k(x)を反映した関数となる。そこで、調波構造モデル関数Φ_k(x)を時間軸に沿って図４のようなエンベロープ関数Ψ_k(t)を乗じた形として音響オブジェクトモデルを仮定すると、ｋ番目の音響オブジェクトモデルｐ(x,t｜Θ_k)は、二つの関数とパワー（エネルギー）ｗ_ｋの積で表すことができる。

ここで、

とする。 [A-2] Assuming that the pitch trajectory of the acoustic object of the Gaussian base acoustic object model musical sound is parallel to the time axis, the cut at a specific time t of the kth acoustic object model as shown in FIG. Such a function reflects the harmonic structure Φ _k (x). Therefore, assuming that the acoustic object model is obtained by multiplying the harmonic structure model function Φ _k (x) by the envelope function Ψ _k (t) as shown in FIG. 4 along the time axis, the kth acoustic object model p ( x, t | Θ _k ) can be expressed as a product of two functions and power (energy) w _k .

here,

And

［Ａ−３］調波構造関数Φ_k(x)
モデル関数を構成する調波構造関数の一つの好ましい態様としては、本出願の発明者らが既に提案している調波構造モデル関数を用いることができる。先ず、調波構造モデルについて説明する。短時間スペクトルの解析では、基本周波数成分や調波成分の広がりにより、異なる信号同士の周波数成分が重なり合い、近接する周波数成分の分離や正確な基本周波数あるいは高調波周波数の検出が困難となる。このように広がって観測される周波数成分を各周波数の出現頻度分布あるいは確率分布と見なし、その分布をガウス分布により近似することで、単一の調波構造を有するスペクトルを複数のガウス分布の混合分布としてモデル化する。図３に示すように、スペクトルの拡がり形状をガウス分布で近似することで、周波数値をガウス分布の平均推定、周波数成分のエネルギーを混合ガウス分布の重み推定に対応させることができる。調和性の保持のため、基本周波数成分に対応する1つのガウス分布の平均(基本周波数推定値)のみが自由度をもち、その位置に応じて残りのすべての正規分布の平均の位置は決定される。単一の調波構造をこのような拘束つきの混合ガウス分布によりモデル化したものを本明細書において「調波構造モデル」と呼ぶ。ガウス分布は調波構造モデルに適用できる分布関数の好適な一例であって、その他の単峰性分布関数を用いて調波構造モデルを構成してもよい。平均は分布の代表値の一つの好適な例であって、平均に代えて、中央値、最頻値を用いても良い。 [A-3] Harmonic structure function Φ _k (x)
As a preferable embodiment of the harmonic structure function constituting the model function, the harmonic structure model function already proposed by the inventors of the present application can be used. First, the harmonic structure model will be described. In short-time spectrum analysis, the frequency components of different signals overlap due to the spread of fundamental frequency components and harmonic components, making it difficult to separate adjacent frequency components and accurately detect fundamental frequencies or harmonic frequencies. The frequency component observed in this way is regarded as the frequency distribution or probability distribution of each frequency, and by approximating the distribution with a Gaussian distribution, a spectrum with a single harmonic structure is mixed with multiple Gaussian distributions. Model as a distribution. As shown in FIG. 3, by approximating the spectrum spread shape with a Gaussian distribution, the frequency value can correspond to the average estimation of the Gaussian distribution, and the energy of the frequency component can correspond to the weight estimation of the mixed Gaussian distribution. To maintain harmony, only the average of one Gaussian distribution (fundamental frequency estimate) corresponding to the fundamental frequency component has a degree of freedom, and the average position of all remaining normal distributions is determined according to its position. The A single harmonic structure modeled by such a constrained mixed Gaussian distribution is referred to as a “harmonic structure model” in this specification. The Gaussian distribution is a suitable example of a distribution function that can be applied to the harmonic structure model, and the harmonic structure model may be configured using other unimodal distribution functions. The average is one preferable example of the representative value of the distribution, and the median and the mode may be used instead of the average.

調和性を仮定し、ｎ番目の対数周波数成分が基本対数周波数からlogｎだけ離れているとすると、基本対数周波数がμ_ｋ、ｎ番目の部分対数周波数がμ_ｋ＋logｎと推定される。すなわち、基本周波数推定値をμ_kと置けば、調波構造モデルkの各平均μ_kは、対数周波数領域において、μ_k，μ_ｋ＋log２，μ_ｋ＋logｎ，．．．μ_ｋ＋logＮとなる。各周波数成分の分布をガウス分布で近似することで、一つの調波構造を、ガウス基底の重み付き和でモデリングする。これを定式化すると、調和性を仮定し、１つの周波数成分分布をガウス関数近似することで、調波構造モデルを、

で表す。ただし、μ_kは対数基本周波数推定値、ｒ_ｋ ^ｎ(n=1,…, N、Σｒ_ｋ ^ｎ＝１、ｎは調波構造モデルにおけるガウス基底のインデックスである )はn 次高調波成分パワー比に対応する。 Assuming harmonicity, if the nth log frequency component is logn away from the basic logarithmic frequency, the basic logarithmic frequency is estimated to be μ _k , and the nth partial logarithmic frequency is estimated to be μ _k + logn. That is, if the fundamental frequency estimation value is set as μ _k , each average μ _k of the harmonic structure model _k is expressed by μ _k , μ _k + log2, μ _k + logn,. . . μ _k + logN. By approximating the distribution of each frequency component with a Gaussian distribution, one harmonic structure is modeled with a weighted sum of Gaussian bases. When this is formulated, a harmonic structure model is obtained by assuming harmonicity and approximating one frequency component distribution with a Gaussian function.

Represented by However, mu _k is logarithmic fundamental frequency _{^{estimate, r k n (n = 1}} , ..., N, Σr k n = 1, n is the index of Gaussian basis of the harmonic structure model) the n-th harmonic component power Corresponds to the ratio.

［Ａ−４］エンベロープ関数Ψ_k(t)
エンベロープ関数Ψ_k(t)は、パワースペクトルエンベロープのさまざまな変動に柔軟に対応できる関数であることが望ましい。例えば、音楽信号について言うと、楽器や音楽表現に依存して、アタック、サステイン、リリースは全く異なるであろう。そこで、エンベロープ関数Ψ_k(t)を、複数のガウス基底から表し、各ガウス基底は、エンベロープ形状に関連して、重みｃ_ｋ ^ｙ（ｙ＝0,…, Y-1、Σｃ_ｋ ^ｙ=1、ｙはエンベロープモデルのガウス基底のインデックスである）を有するようにしたガウス基底エンベロープモデルに基づいて構成する。このモデルの特徴は、隣り合うガウス関数同士の間隔を、各ガウス関数の標準偏差パラメータφ_kに基づいて表しており、エンベロープ関数Ψ_k(t)は、

として表される。ただし、Yはガウス基底の数、ｏ_kは先頭のガウス基底の中心であり、音響オブジェクトの立ち上がり時刻の推定に密接に関係し、ｃ_ｋ ^y(y =0,…,Y- 1) はエンベロープ曲線を規定する各ガウス基底の重み値を表す。複数のガウス関数の中心を標準偏差パラメータφ_kと等しい間隔（α＝１の場合で言うと）で配置した特殊な拘束をもったガウス基底関数は、各基底が孤立するのを防いで曲線の滑らかさを保つと同時にφ_kの値あるいは/およびαの値に応じて時間方向に線形伸縮する性質を持ち、さまざまな時間長の音響オブジェクトに広く対応できる。 [A-4] Envelope function Ψ _k (t)
The envelope function Ψ _k (t) is desirably a function that can flexibly cope with various fluctuations of the power spectrum envelope. For example, when it comes to music signals, depending on the instrument and musical expression, the attack, sustain and release will be quite different. Therefore, the envelope function Ψ _k (t) is expressed from a plurality of Gaussian bases, and each Gaussian base is related to the envelope shape and weights c _k ^y (y = 0,..., Y−1, Σc _k ^y = 1). , Y is an index of the Gaussian basis of the envelope model). The feature of this model is that the interval between adjacent Gaussian functions is expressed based on the standard deviation parameter φ _k of each Gaussian function, and the envelope function Ψ _k (t) is

Represented as: However, Y is the number of Gaussian basis, o _k is the center of the head of Gaussian basis, closely related to the estimation of the rise time of the audio _{^{object, c k y (y = 0}} , ..., Y- 1) envelope Represents the weight value of each Gaussian basis that defines the curve. Gaussian basis functions with special constraints in which the centers of multiple Gaussian functions are arranged at equal intervals (in the case of α = 1) with the standard deviation parameter φ _k , prevent each basis from being isolated. While maintaining smoothness, it has the property of linear expansion and contraction in the time direction according to the value of φ _k and / or α, and can be widely applied to acoustic objects of various time lengths.

［Ａ−５］重畳オブジェクトモデル
上述のような一つの調波構造に対応する一つのオブジェクトモデルを重畳させた重畳オブジェクトモデルを用いて、複数の音源からの音が混在している音響信号の観測スペクトルをモデリングする。重畳オブジェクトモデルのモデルパラメータを表１に示す。尚、表１は、好適なモデルパラメータを例示したものであり、本発明に係るモデルパラメータは、表１に示すものには限定されない。

[A-5] Superimposed Object Model Observation of an acoustic signal in which sounds from a plurality of sound sources are mixed using a superimposed object model in which one object model corresponding to one harmonic structure as described above is superimposed. Model the spectrum. Table 1 shows model parameters of the superimposed object model. Table 1 exemplifies suitable model parameters, and the model parameters according to the present invention are not limited to those shown in Table 1.

ｋ：各音響オブジェクト（音響ストリーム）モデルのインデックスであり、実際には、音響オブジェクトのインデックスに対応する。混合音の観測スペクトルを、Ｋ個の音響オブジェクトモデルを用いてモデリングする。 k: Index of each acoustic object (acoustic stream) model, and actually corresponds to the index of the acoustic object. The observed spectrum of the mixed sound is modeled using K acoustic object models.

ｎ：調波構造モデルにおけるガウス基底のインデックスであり、実際には、調波構造の各周波数成分のインデックスに対応する。一つの調波構造を、Ｎ個のガウス関数を用いてモデリングする。 n: Gaussian basis index in the harmonic structure model, which actually corresponds to the index of each frequency component of the harmonic structure. One harmonic structure is modeled using N Gaussian functions.

ｙ：パワーエンベロープモデルにおけるガウス基底のインデックスである。一つのエンベロープ曲線を、Ｙ個のガウス関数を用いてモデリングする。 y: Gaussian basis index in the power envelope model. One envelope curve is modeled using Y Gaussian functions.

μ_ｋ：調波構造モデルにおける先頭のガウス基底の平均であり、実際には、基本対数周波数に対応する。 μ _k : the average of the first Gaussian basis in the harmonic structure model, and actually corresponds to the fundamental logarithmic frequency.

μ_ｋ＋logｎ：調波構造モデルにおけるｎ番目のガウス基底の平均であり、実際には、ｎ番目の対数周波数要素に対応する。 μ _k + logn: The average of the nth Gaussian basis in the harmonic structure model, and actually corresponds to the nth logarithmic frequency element.

ｗ_ｋ：ｋ番目の音響オブジェクトモデルの重みであり、実際には、ｋ番目の音響オブジェクトの相対的支配を意味する。 w _k : The weight of the k th acoustic object model, which actually means the relative dominance of the k th acoustic object.

ｒ_ｋ ^ｎ：ｋ番目の音響オブジェクトモデルの調波構造モデルにおけるガウス基底の重みであり、実際には、周波数成分パワー比に対応する。 r _k ⁿ : Gaussian basis weight in the harmonic structure model of the kth acoustic object model, and actually corresponds to the frequency component power ratio.

ｃ_ｋ ^ｙ：ｋ番目の音響オブジェクトモデルのパワーエンベロープモデルにおけるガウス基底の重みであり、実際には、パワーエンベロープの時間方向の曲線に対応する。 c _k ^y : Weight of the Gauss basis in the power envelope model of the kth acoustic object model, and actually corresponds to a curve in the time direction of the power envelope.

ｏ_ｋ：ｋ番目の音響オブジェクトモデルのパワーエンベロープモデルにおける先頭のガウス基底の平均であり、一例では、ｋ番目の音響オブジェクトの立ち上がり時刻(onset time)に対応する。 o _k : average of the first Gaussian basis in the power envelope model of the kth acoustic object model, and in one example, corresponds to the rise time (onset time) of the kth acoustic object.

σ_ｋ：ｋ番目の音響オブジェクトモデルの調波構造モデルにおけるガウス基底の標準偏差であり、実際には各周波数成分の幅に対応する。 σ _k : Standard deviation of the Gauss basis in the harmonic structure model of the kth acoustic object model, and actually corresponds to the width of each frequency component.

φ_ｋ：ｋ番目の音響オブジェクトモデルのパワーエンベロープモデルにおけるガウス基底の間隔および標準偏差であり、実際にはｋ番目の音響オブジェクトの時間長に関連する。 φ _k : Gaussian base interval and standard deviation in the power envelope model of the kth acoustic object model, and is actually related to the time length of the kth acoustic object.

[Ｂ]最適パラメータ推定
[Ｂ−１]事前分布の仮定
特定のパラメータに関して柔軟な制約条件を与えたい場合、事前分布の仮定は効果的である。例えば、ｒ_ｋ ^ｎとｃ_ｋ ^ｙに関して、調波構造の各成分比やパワーエンベロープに関して想定されるある程度常識的な予測値ｒ_ｋ ^ｎ（バー）、ｃ_ｋ ^ｙ（バー）から極端に逸脱し過ぎないようにパラメータ制約を加えることができる（図５参照）。ここでは、ＭＡＰ推定におけるラグランジュの未定乗数の計算を大幅に簡単化できる事前分布（非特許文献３参照）、

を利用する。ただし、ｄ_r、ｄ_c は事前分布の寄与の大きさ、β(d_r)、β(d_c)はそれぞれ正規化係数を表す。事前分布は、ＭＡＰ推定におけるラグランジュの未定乗数の計算を大幅に簡単化できるという有利な点を有する。尚、この分布以外にもディリクレ分布も同じ目的に適用可能である。 [B] Optimal parameter estimation
[B-1] Assumption of prior distribution The assumption of prior distribution is effective when it is desired to give flexible constraints on specific parameters. For _example, for _r ^{k n} and _c ^{k y,} somewhat commonsense prediction value _r ^{k n} (bar) envisaged for each component ratio and power envelope of the harmonic _structure, too extreme departure from the _c ^{k y} (bar) Parameter constraints can be added (see FIG. 5). Here, a prior distribution (see Non-Patent Document 3) that can greatly simplify the calculation of Lagrange's undetermined multiplier in MAP estimation,

Is used. Here, d _r and d _c are the magnitudes of prior distribution contributions, and β (d _r ) and β (d _c ) are normalization coefficients, respectively. The prior distribution has the advantage that it can greatly simplify the calculation of Lagrange's undetermined multiplier in MAP estimation. In addition to this distribution, the Dirichlet distribution can be applied for the same purpose.

[Ｂ−２]ＥＭアルゴリズムを用いたＭＡＰ推定
以上の拘束条件下の混合音響オブジェクトモデルの最適近似パラメータ推定はＥＭアルゴリズムによるＭＡＰ推定(反復計算による補助関数の単調増加)と同型の問題となる。式（１）における目的関数は補助関数に対応しており、式（１）は、

という補助関数に書き直せる。ただし、λ_r ^(k)，λ_c ^(k)，λ_ｗはラグランジュの未定乗数である。尚、式（９）において、f(x,t)を正規化し、重みの総和を１としてもよい（この場合、Ｆ＝１となる）。 [B-2] Optimum approximate parameter estimation of a mixed acoustic object model under constraint conditions higher than MAP estimation using EM algorithm is a problem of the same type as MAP estimation by EM algorithm (monotonic increase of auxiliary function by iterative calculation). The objective function in Equation (1) corresponds to the auxiliary function, and Equation (1) is

Can be rewritten as an auxiliary function. Here, λ _r ^(k) , λ _c ^(k) , and λ _w are Lagrange's undetermined multipliers. In equation (9), f (x, t) may be normalized and the sum of weights may be set to 1 (in this case, F = 1).

局所最適パラメータは以下の反復計算で求められる。
（１）Ｅ−ステップ
前のＭ−ステップで更新されたΘ（バー）にΘ（ハット）を代入して、補助関数Ｒ（Θ，Θ（バー））にΘ（バー）を計算する。このステップは、帰属確率密度ｐ（k,n,y|x,t，Θ）の更新に対応する。
（２）Ｍ−ステップ
帰属確率密度ｐ（k,n,y|x,t，Θ）固定のもとで、Θ（バー）のパラメータを更新して、補助関数Ｒ（Θ，Θ（バー））を最大化する。 The local optimum parameter is obtained by the following iterative calculation.
(1) Substitute Θ (hat) into Θ (bar) updated in the M-step before E-step, and calculate Θ (bar) to the auxiliary function R (Θ, Θ (bar)). This step corresponds to updating the attribution probability density p (k, n, y | x, t, Θ).
(2) Under the M-step attribution probability density p (k, n, y | x, t, Θ) fixed, the parameter of Θ (bar) is updated, and the auxiliary function R (Θ, Θ (bar) ).

[Ｂ−３] Ｍ−ステップにおけるパラメータの更新式
Ｍ−ステップにおける各モデルパラメータの更新式の計算結果を示す。尚、下記の式では、簡潔のため、時間方向（Ｔ_１, Ｔ_２）及び周波数方向（Ω_１, Ω_２）の積分範囲は省略してある。 [B-3] Parameter Update Formula in M-Step The calculation result of each model parameter update formula in M-step is shown. In the following expression, for the sake of brevity, the integration ranges in the time direction (T ₁ , T ₂ ) and the frequency direction (Ω ₁ , Ω ₂ ) are omitted.

基本対数周波数μ_ｋの更新式は以下のとおりである。これにより、ｋ番目の音響オブジェクトの基本周波数が推定される。

The update formula of the basic logarithmic frequency μ _k is as follows. Thereby, the fundamental frequency of the kth acoustic object is estimated.

スペクトル要素の相対パワーｒ_ｋ ^ｎの更新式は以下のとおりである。これにより、ｋ番目の音響オブジェクトの調波構造における、各周波数の周波数成分パワー比が推定される。

Update equation of relative power r _k ⁿ spectral components are as follows. Thereby, the frequency component power ratio of each frequency in the harmonic structure of the kth acoustic object is estimated.

調波構造における各周波数成分の幅σ_ｋの更新式は以下のとおりである。ｋ番目の音響オブジェクトの調波構造の各周波数成分の幅が推定される。ここでは、幅は、各周波数成分間で同じとしている。

The update formula of the width σ _k of each frequency component in the harmonic structure is as follows. The width of each frequency component of the harmonic structure of the kth acoustic object is estimated. Here, the width is the same between the frequency components.

立ち上がり時刻ｏ_kの更新式は以下のとおりである。ｋ番目の音響オブジェクトの立ち上がり時刻が推定される。

Update equation of the rising time o _k are as follows. The rise time of the kth acoustic object is estimated.

パワーエンベロープ曲線の要素ｃ_ｋ ^ｙの更新式は以下のとおりである。ｋ番目の音響オブジェクトのパワーエンベロープ曲線は、エンベロープ関数を構成する複数のガウス基底の重み付け和によって決定される。

Updating expression elements c _k ^y of power envelope curve is as follows. The power envelope curve of the kth acoustic object is determined by a weighted sum of a plurality of Gaussian bases constituting the envelope function.

時間長の要素φ_ｋの更新式は以下のとおりである。ｋ番目の音響オブジェクトの時間長が推定される。

The update formula for the time length element φ _k is as follows. The time length of the kth acoustic object is estimated.

重畳音響オブジェクトにおける、ｋ番目の音響オブジェクトのパワー（エネルギー）の更新式は以下のとおりである。

The formula for updating the power (energy) of the kth acoustic object in the superimposed acoustic object is as follows.

［Ｃ］実験例
［Ｃ−１］実験例１
本発明に係る方法のテストデータとしてＲＷＣ研究用音楽データベースの中から２曲の実音楽信号(16kHz サンプリング周波数)を利用した。パワースペクトル時系列はガボールウェーブレット変換(フレームシフト20ms，周波数分解能16.7cent，最低周波数50Hz)により出力した。解析区間（時間周波数平面）の時間長は3s(150フレーム)ずつとした。ＥＭアルゴリズムのためのパラメータ（μ_k,o_ｋ｜k=1,…,K）の初期値は、与えられたスペクトル分布から大きい順に70個のピーク（パワースペクトル密度の極大点）を抽出することで決定した。ＥＭアルゴリズムの反復において、音響オブジェクトの総数は閾値処理によって推定した。すなわち、重みパラメータw_kがある一定閾値以下のモデルは無音と判断して除去した。 [C] Experimental Example [C-1] Experimental Example 1
As test data of the method according to the present invention, two actual music signals (16 kHz sampling frequency) were used from the music database for RWC research. The power spectrum time series was output by Gabor wavelet transform (frame shift 20ms, frequency resolution 16.7cent, minimum frequency 50Hz). The time length of the analysis section (temporal frequency plane) was 3 s (150 frames). The initial values of the parameters (μ _k , o _k | k = 1,..., K) for the EM algorithm are 70 peaks (maximum points of power spectral density) in order from the given spectral distribution. Determined. In the iteration of the EM algorithm, the total number of acoustic objects was estimated by thresholding. That is, a model whose weight parameter w _k is equal to or smaller than a certain threshold value is judged to be silent and removed.

実際のスペクトルから推定した最適化モデルの具体例および対応する時間−周波数スペクトルの３次元表示及びグレースケール表示を図６に示す。図６（ａ）は、観測スペクトル分布を３次元（対数周波数軸、時間軸、エネルギー強度を表す軸）に表示したものであり、図６（ｃ）は、図６（ａ）に対応する観測スペクトルのスペクトログラム（横軸：時間、縦軸：対数周波数）のグレースケール表示である。図６（ｂ）は、最適パラメータによる重畳音響オブジェクトモデルの３次元（対数周波数軸、時間軸、エネルギー強度を表す軸）に表示したものであり、図６（ａ）に対応している。図６（ｂ）は、図２に示す各音響オブジェクトが重畳したものである。図６（ｄ）は、図６（ｂ）に対応する最適化された重畳音響オブジェクトモデルのグレースケール表示（横軸：時間、縦軸：周波数）である。図６（ｂ）、（ｄ）に示すように、重畳した音響オブジェクトのピッチのみならず、オンセット時刻（立ち上がり時刻）、時間長、オフセット時刻、パワーエンベロープが適切に推定されている。また、最適帰属確率を用いた正弦波合成によって個々の音響オブジェクトを抽出して再構築することも可能である。 A specific example of the optimization model estimated from the actual spectrum and the corresponding three-dimensional display and gray scale display of the time-frequency spectrum are shown in FIG. FIG. 6A shows the observed spectrum distribution in three dimensions (logarithmic frequency axis, time axis, and axis representing energy intensity), and FIG. 6C shows an observation corresponding to FIG. 6A. It is a gray scale display of a spectrogram of a spectrum (horizontal axis: time, vertical axis: logarithmic frequency). FIG. 6B is a three-dimensional display (logarithmic frequency axis, time axis, and axis representing energy intensity) of the superimposed acoustic object model with the optimum parameters, and corresponds to FIG. FIG. 6B is an overlay of the acoustic objects shown in FIG. FIG. 6D is a grayscale display (horizontal axis: time, vertical axis: frequency) of the optimized superimposed acoustic object model corresponding to FIG. 6B. As shown in FIGS. 6B and 6D, not only the pitch of the superimposed acoustic object but also the onset time (rise time), time length, offset time, and power envelope are appropriately estimated. It is also possible to extract and reconstruct individual acoustic objects by sinusoidal synthesis using the optimal attribution probability.

［Ｃ−２］実験例２
本発明に係る手法の性能評価基準として付属の参照用ＭＩＤＩデータから音名正解率を算出した。また、比較対象として、フレームごとのモデル推定の情報に基づいてＨＭＭでピッチ軌跡を推定する方法（非特許文献４）を選んだ。利用したテストデータに対し従来法に比べて本発明に係る手法は高い性能を示し(表２)、時間方向と周波数方向を同時にモデリングしたことの効果が確認できた。

[C-2] Experimental example 2
As a performance evaluation standard of the method according to the present invention, the pitch correct rate was calculated from the attached reference MIDI data. Further, as a comparison target, a method (Non-patent Document 4) for estimating a pitch trajectory by HMM based on model estimation information for each frame was selected. Compared to the conventional method, the method according to the present invention showed higher performance than the conventional method (Table 2), and the effect of modeling the time direction and the frequency direction at the same time was confirmed.

本発明は、実環境下の音声認識、複数話者環境下での高性能音声収録、カラオケシステムにおける自動採点や伴奏データ作成のための音楽信号解析に利用可能である。 INDUSTRIAL APPLICABILITY The present invention can be used for speech recognition in a real environment, high performance speech recording in a multi-speaker environment, automatic scoring in a karaoke system, and music signal analysis for accompaniment data creation.

実際の音楽演奏信号を、時間Ｔ₀からＴ₁、周波数Ω₀からΩ₁、においてウェーブレット変換した観測スペクトルである。This is an observation spectrum obtained by wavelet transforming an actual music performance signal at times T ₀ to T ₁ and frequencies Ω ₀ to Ω ₁ . ｋ番目の音響オブジェクトスペクトルのパラメトリックモデル（音響オブジェクトモデル）を説明する図であって、周波数―時間平面上の一つの音響オブジェクト（ｋ番目）を表している。It is a figure explaining the parametric model (acoustic object model) of a kth acoustic object spectrum, Comprising: One acoustic object (kth) on a frequency-time plane is represented. ガウス基底調波構造モデルを示す。A Gaussian basis harmonic structure model is shown. ガウス基底パワーエンベロープモデルを示す。A Gaussian basis power envelope model is shown. 重みパラメータｒ_ｋ ^ｎの事前分布を示す図である。Is a diagram illustrating a prior distribution of the weight parameter r _k ^n. （ａ）観測スペクトル分布の３次元表示（周波数軸、時間軸、エネルギー強度を表す軸）；（ｂ）最適パラメータによる重畳音響オブジェクトモデルの３次元表示（周波数軸、時間軸、エネルギー強度を表す軸）；（ｃ）与えられたスペクトログラムのグレースケール表示（横軸：時間、縦軸：周波数）；（ｄ）最適化モデルのグレースケール表示（横軸：時間、縦軸：周波数）；である。(A) Three-dimensional display of observed spectrum distribution (frequency axis, time axis, axis representing energy intensity); (b) Three-dimensional display of superimposed acoustic object model with optimum parameters (frequency axis, time axis, axis representing energy intensity) (C) Gray scale display of a given spectrogram (horizontal axis: time, vertical axis: frequency); (d) Gray scale display of an optimization model (horizontal axis: time, vertical axis: frequency);

Claims

The observation data is modeled by a superimposed object model formed by superimposing multiple object models, each object model is represented by a two-variable model function, and the model parameters of the model function are optimized to estimate the characteristics of the observation data This is a method for analyzing observation data.

An observation spectrum of an acoustic signal in which sounds from a plurality of sound sources are mixed is modeled by a superimposed object model formed by superimposing a plurality of acoustic object models, and each acoustic object model is represented by two variables of frequency x and time t. A method of analyzing an acoustic signal, characterized by estimating a characteristic of an observed spectrum by optimizing a model parameter of the model function.

3. The acoustic signal analysis method according to claim 2, wherein the characteristics of the observed spectrum include frequency information and time information of each sound.

4. The acoustic signal analysis method according to claim 3, wherein the characteristics of the observed spectrum further include a frequency component power ratio of each frequency component constituting the harmonic structure and a power spectrum envelope in the time direction.

5. The acoustic signal analysis method according to claim 2, wherein each acoustic object model corresponds to one harmonic structure.

6. The acoustic signal analysis method according to claim 5, wherein the model function includes a harmonic structure function including a frequency x as a variable and an envelope function including a time t as a variable.

7. The harmonic structure function according to claim 6, wherein a fundamental frequency estimated value that is a representative value of one unimodal distribution corresponding to a fundamental frequency component and another unimodal distribution determined by the fundamental frequency estimated value. A method for analyzing an acoustic signal having a representative value, wherein the model parameter includes a representative value, weight, and variance of each unimodal distribution.

8. The method for analyzing an acoustic signal according to claim 7, wherein the unimodal distribution is a Gaussian distribution.

9. The method for analyzing an acoustic signal according to claim 7, wherein the representative value of the distribution is an average.

10. The acoustic signal analysis method according to claim 6, wherein the harmonic structure function further includes a time t as a variable.

11. The acoustic signal analysis method according to claim 10, wherein the feature of the observation data includes a pitch locus on the xt plane.

12. The acoustic signal analysis method according to claim 6, wherein a common envelope function is used for one harmonic structure.

12. The method for analyzing an acoustic signal according to claim 6, wherein an independent envelope function is used for each harmonic component.

14. The envelope function according to claim 12, wherein the envelope function is a plurality of Gauss functions continuously arranged in the time axis direction, and the model parameter includes a representative value, a weight, and a variance of each Gaussian distribution. Analysis method of acoustic signal.

15. The method of analyzing an acoustic signal according to claim 14, wherein the Gaussian functions are arranged at predetermined equal intervals based on a dispersion parameter of the leading Gaussian function.

16. The method for analyzing an acoustic signal according to claim 6, wherein the envelope function is composed of a function obtained by combining two sigmoid functions.

16. The method of analyzing an acoustic signal according to claim 6, wherein the envelope function is an extreme value distribution function or GDD.

18. The method of analyzing an acoustic signal according to claim 2, wherein the parameter of the superimposed object model includes a parameter of a model function representing each acoustic object model and a weight of each acoustic object model.

19. The acoustic signal analysis method according to claim 2, wherein the parameter optimization is performed by MAP estimation.

19. The acoustic signal analysis method according to claim 2, wherein the estimation algorithm for model parameter optimization is an EM algorithm.

A computer program for causing a computer to execute the method according to any one of claims 2 to 20.

A recording medium on which a computer program for causing a computer to execute the method according to claim 2 is recorded.

An observation spectrum of an acoustic signal in which sounds from a plurality of sound sources are mixed is modeled by a superimposed object model formed by superimposing a plurality of acoustic object models, and each acoustic object model is represented by two variables of frequency x and time t. An acoustic signal analysis system characterized by estimating the characteristics of an observed spectrum by optimizing model parameters of the model function.