JP5807914B2

JP5807914B2 - Acoustic signal analyzing apparatus, method, and program

Info

Publication number: JP5807914B2
Application number: JP2012190188A
Authority: JP
Inventors: 弘和亀岡; 拓磨小野; 小野　順貴; 順貴小野; 茂樹嵯峨山
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC; Inter University Research Institute Corp Research Organization of Information and Systems
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC; Inter University Research Institute Corp Research Organization of Information and Systems
Priority date: 2012-08-30
Filing date: 2012-08-30
Publication date: 2015-11-10
Anticipated expiration: 2032-08-30
Also published as: JP2014048398A

Description

本発明は、音響信号解析装置、方法、及びプログラムに係り、特に、複数のマイクロホンから出力される音響信号の時系列データから、各音源の信号に分離する音響信号解析装置、方法、及びプログラムに関する。 The present invention relates to an acoustic signal analyzing apparatus, method, and program, and more particularly, to an acoustic signal analyzing apparatus, method, and program for separating time-series data of acoustic signals output from a plurality of microphones into signals of respective sound sources. .

音源の成分と音源からマイクロホンまでの伝達特性がともに未知のもとで、マイクロホン入力信号から個々の音源成分を分離抽出する技術をブラインド音源分離(Blind Source Separation;BSS)という。BSSでは観測信号だけから音源信号とその混合過程を推定する必要があるため、通常は音源に関して何らかの仮定を置き、これにより立てられる規準をもとに両未知変数を推定する最適化問題として定式化される。例えば、観測信号数が音源数以上の場合には、音源信号成分が優ガウス分布に従うという仮定のもとで分離フィルタを最尤推定する独立成分分析(Independent Component Analysis;ICA) と呼ぶ方法が有名である（非特許文献１）。周波数領域におけるICAは、音響信号のような時間遅れを有する混合系に対するBSSの有効なアプローチの1つであるが、周波数帯域毎に信号分離が行われるため、帯域毎の分離結果を同一音源毎にまとめる、いわゆるパーミュテーション問題を解く必要があった。 A technique that separates and extracts individual sound source components from the microphone input signal when both the sound source components and the transfer characteristics from the sound source to the microphone are unknown is called Blind Source Separation (BSS). In BSS, it is necessary to estimate the sound source signal and its mixing process only from the observed signal. Therefore, it is usually formulated as an optimization problem in which both unknown variables are estimated based on the criteria established by making some assumptions about the sound source. Is done. For example, when the number of observed signals is greater than or equal to the number of sound sources, a method called Independent Component Analysis (ICA), which estimates the maximum likelihood of the separation filter under the assumption that the sound source signal components follow a Gaussian distribution, is famous. (Non-Patent Document 1). ICA in the frequency domain is one of the effective approaches of BSS for mixed systems with time delays such as acoustic signals, but signal separation is performed for each frequency band, so the separation results for each band are separated for the same sound source. It was necessary to solve the so-called permutation problem.

近年提案されている、ベクトル型変数を用いた独立成分分析(以下、独立ベクトル分析(Independent Vector Analysis;IVA))は、パーミュテーション問題を生じない分離手法として知られている（非特許文献２）。IVAにおいては、各音源のフレーム毎の複素スペクトルをベクトル型変数Y_k,τ＾=(Y_k,τ,1,...,Y_k,τ,ω,...,Y_k,τ,N)^Tとして扱い(ただし、kは音源、τは時間フレーム、ωは周波数を表すインデックスである。)、そのノルムである Independent component analysis using vector-type variables (hereinafter, Independent Vector Analysis (IVA)), which has been proposed in recent years, is known as a separation technique that does not cause a permutation problem (Non-Patent Document 2). ). In IVA, the complex spectrum of each sound source for each frame is represented by a vector type variable Y _{k, τ} ^ = (Y _{k, τ, 1} , ..., Y _{k, τ, ω} , ..., Y _{k, τ, N} ) treated as ^T (where k is a sound source, τ is a time frame, ω is an index representing frequency), and its norm

が優ガウス分布に従うという仮定のもとで最尤となる各帯域の分離フィルタが推定される。また、優ガウス分布の具体的な分布として時変ガウス分布(分散が時刻ごとに変化することを許容したガウス分布)を仮定したIVAを実現する方法も提案されている（非特許文献３）。
A separation filter for each band is estimated under the assumption that follows a Gaussian distribution. In addition, a method for realizing IVA assuming a time-varying Gaussian distribution (a Gaussian distribution that allows dispersion to change with time) as a specific distribution of the dominant Gaussian distribution has been proposed (Non-patent Document 3).

IVAによるブラインド音源分離の問題は、観測信号の時間周波数表現X_Tω＾=(X_1Tω,...,X_MTω)^Tに対し、分離信号 The problem of IVA blind source separation is that the time-frequency representation of the observed signal X _Tω ^ = (X _1Tω , ..., X _MTω ) ^T

の各要素が統計的に独立になるように分離行列W_ω＾の推定する問題として定式化される。ただし、Ｙ_τω＾＝(Y_1τω,...,Y_kτω,...,Y_Kτω)^Tであり、各音源の時間周波数成分を表わしている。
It is formulated as a problem of estimating the separation matrix W _ω ^ so that each element of is statistically independent. However, Y _τω ^ = (Y _1τω ,..., Y _kτω ,..., Y _Kτω ) ^T , which represents the time frequency component of each sound source.

A. Hyv¨arinen, J. Karhumen, and E. Oja, Independent Component Analysis, John Wiley & Sons, 2001.A. Hyv¨arinen, J. Karhumen, and E. Oja, Independent Component Analysis, John Wiley & Sons, 2001. T. Kim, T. Eltoft and T. Lee, “Independent Vector Analysis: An Extension of ICA to Multivariate Components,” Proc. ICA, pp.165−172, 2006.T. Kim, T. Eltoft and T. Lee, “Independent Vector Analysis: An Extension of ICA to Multivariate Components,” Proc. ICA, pp.165-172, 2006. 小野拓磨、小野順貴、嵯峨山茂樹、“音源のアクティベーションを事前情報とした独立ベクトル析による音源分離、”日本音響学会秋季研究発表会講演集、pp.613−614,Sep. 2011.Takuma Ono, Junki Ono, Shigeki Hiyama, “Sound Source Separation by Independent Vector Analysis Using Sound Source Activation as Advance Information,” Acoustical Society of Japan Autumn Meeting, pp.613-614, Sep. 2011.

以上のように、IVAでは、複素スペクトルをベクトル化したもののノルムが優ガウス分布に従うと仮定されるが、優ガウス分布として時変ガウス分布(フレームごとに分散が変化することを許容したガウス分布)を仮定した場合、この仮定は、各音源に対してどの周波数でもパワーの時間包絡が等しいと仮定したことになっている。しかし、一般に対象となる音源は音楽や音声など調波性という特別な性質を持つ場合がある。この場合、調波成分間の谷の部分での振幅包絡は必ずしも他の周波数と一致しない一方、基本周波数や倍音間のパワーの依存関係は音源分離の重要な手掛かりになり得る。また、仮定に適合しない信号（ピッチや周期といった調波性をもつ音源信号）では分離の精度に限界がある。 As described above, in IVA, it is assumed that the norm of the complex spectrum vectorized follows a Gaussian distribution, but as a Gaussian distribution, a time-varying Gaussian distribution (a Gaussian distribution that allows dispersion to change from frame to frame) This assumption assumes that the time envelope of power is the same for all frequencies for each sound source. However, in general, the target sound source may have a special property of harmonics such as music and voice. In this case, the amplitude envelope in the valley portion between the harmonic components does not necessarily coincide with other frequencies, while the dependency of power between the fundamental frequency and the harmonics can be an important clue for sound source separation. In addition, separation accuracy is limited for signals that do not meet the assumptions (sound source signals having harmonic characteristics such as pitch and period).

また、IVAでは、Y_Tω＾の振幅包絡がどの周波数間でも類似する、という仮定ができるだけ満たされるようにW_ω＾が決定されるため、音源が調波構造を有する場合にはこの仮定が大きく崩れ、調波成分の谷間に該当する箇所に他音源の成分が混入したり、パーミュテーション不整合が起こることがあった。よって、もし各音源が調波構造を有するならば、音源モデルには、調波構造形のパワースペクトル密度を仮定した方が良いはずである。しかし、そのためには音源の基本周波数の情報が必要となるが、音源の基本周波数の情報は通常観測することができない。 Also, in IVA, W _ω ^ is determined so that the assumption that the amplitude envelope of Y _Tω ^ is similar between all frequencies is satisfied as much as possible, so this assumption is large when the sound source has a harmonic structure. In some cases, components of other sound sources may be mixed in at locations corresponding to valleys of harmonic components, or permutation mismatch may occur. Therefore, if each sound source has a harmonic structure, it is better to assume a power spectrum density of the harmonic structure type for the sound source model. However, for this purpose, information on the fundamental frequency of the sound source is required, but information on the fundamental frequency of the sound source cannot usually be observed.

本発明は、上記の事情を鑑みてなされたもので、複数のマイクロホンから出力された音響信号の時系列データから、音源毎の音源信号に精度よく分離することができる音響信号解析装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an acoustic signal analyzing apparatus, method, and the like that can accurately separate sound source signals for each sound source from time series data of acoustic signals output from a plurality of microphones. And to provide a program.

上記の目的を達成するために本発明に係る音響信号解析装置は、Ｍ個（Ｍは２以上の整数）のマイクロホンｍから出力される音響信号の時系列データを入力として、観測時間周波数成分Ｘ_mτω（ｍはマイクロホン、τは時間フレーム、ωは周波数のインデックスである。）を要素にもつ三次元配列Ｘ＾を出力する時間周波数解析手段と、Ｋ個の音源ｋの時間周波数成分Ｙ_kτω（ｋは音源、τは時間フレーム、ωは周波数のインデックスである。）を要素にもつ三次元配列Ｙ＾、各音源ｋについて各時間フレームτにおいて調波構造を有するパワースペクトルテンプレートλ^(l) _ω（ｌは、パワースペクトルテンプレートのインデックスである。）が選択される確率π^(l) _kτを要素にもつ三次元配列Π＾、各時間フレームτにおける各音源ｋのパワーσ² _kτを要素にもつ二次元配列Σ＾、及び各周波数ωについて、Ｘ_τω（＝（Ｘ_1τω，・・・，Ｘ_Mτω））に作用させて音源信号の時間周波数成分Ｙ_τω＾（＝（Ｙ_1τω，・・・，Ｙ_Kτω））を得るための分離行列Ｗ_ω＾の各々の初期値を設定するパラメータ初期値設定手段と、（k、τ、ω、ｚ_kτ）の全ての組み合わせにおける、ｚ_kτとσ² _kτが与えられたときの、λ^(ｚkτ) _ω・σ² _kτを分散とするガウス分布で表されたＹ_kτωの確率密度関数、及びπ_kτが与えられたときのｚ_kτの確率、（k、τ、ｌ）の全ての組み合わせに対する前記確率π^(l) _kτの事前確率、及び各周波数ωに対する前記分離行列Ｗ_ω＾の行列式を用いて表された、前記三次元配列Ｘ＾が与えられたときの前記三次元配列Π＾、前記二次元配列Σ＾、及び各周波数ωの前記分離行列Ｗ_ω＾の事後確率を表す目的関数を最大化するように、前記三次元配列Π＾、前記二次元配列Σ＾、及び各周波数ωの前記分離行列Ｗ_ω＾を更新するパラメータ更新手段と、各周波数ωの前記分離行列Ｗ_ω＾及び前記三次元配列Ｘ＾に基づいて、前記三次元配列Ｙ＾を更新する音源信号推定値更新手段と、予め定められた終了条件を満たすまで、前記パラメータ更新手段による更新、及び前記音源信号推定値更新手段による更新を繰り返し行う終了判定手段と、を含んで構成されている。 In order to achieve the above object, an acoustic signal analyzing apparatus according to the present invention receives time series data of acoustic signals output from M (M is an integer of 2 or more) microphones m as input, and an observation time frequency component X _mτω (m microphones, tau is the time frame, omega is the index of the frequency.) and time-frequency analysis means for outputting a three-dimensional array X ^ with the elements, K number of sound sources k time-frequency component _Y kτω ( k is a sound source, τ is a time frame, and ω is a frequency index.) A power spectrum template λ ^(l) _ω having a harmonic structure in each time frame τ for each sound source k. (l is an index of the power spectral template.) is selected as a probability [pi ^(l) _Lkr dimensional array with the elements [pi ^, for each sound source k at each time frame τ power two-dimensional array sigma ^ with sigma ² _{k tau} elements, and for each frequency _{ω, X τω (= (X} 1τω, ···, X Mτω)) of the sound source signals by applying time-frequency component _Y τω ^ ( = (Y _1τω ,..., Y _Kτω )), the parameter initial value setting means for setting each initial value of the separation matrix W _ω ^, and all of (k, τ, ω, z _kτ ) in the combination, when z _Lkr and sigma ² _{k tau} is ^given, λ ^(zkτ) the probability density function of Y _Keitauomega represented by Gaussian distribution to distribute _{ω ·} σ ² _{k τ,} and [pi _Lkr is given Z _kτ probability at the time, the prior probability of the probability π ^(l) _kτ for all combinations of (k, τ, l), and the determinant of the separation matrix W _ω ^ for each frequency ω. Further, when the three-dimensional array X ^ is given, the three-dimensional array Π ^, the two-dimensional array Σ ^, and the frequency ω So as to maximize an objective function that represents the separation matrix W _omega ^ posterior probabilities, the three-dimensional array [pi ^, the two-dimensional array sigma ^, and parameter update means for updating the separation matrix W _omega ^ of each frequency omega And, based on the separation matrix W _ω ^ of each frequency ω and the three-dimensional array X ^, sound source signal estimated value update means for updating the three-dimensional array Y ^, until a predetermined end condition is satisfied, And an end determination unit that repeatedly performs the update by the parameter update unit and the update by the sound source signal estimated value update unit.

本発明に係る音響信号解析方法は、時間周波数解析手段によって、Ｍ個（Ｍは２以上の整数）のマイクロホンｍから出力される音響信号の時系列データを入力として、観測時間周波数成分Ｘ_mτω（ｍはマイクロホン、τは時間フレーム、ωは周波数のインデックスである。）を要素にもつ三次元配列Ｘ＾を出力し、パラメータ初期値設定手段によって、Ｋ個の音源ｋの時間周波数成分Ｙ_kτω（ｋは音源、τは時間フレーム、ωは周波数のインデックスである。）を要素にもつ三次元配列Ｙ＾、各音源ｋについて各時間フレームτにおいて調波構造を有するパワースペクトルテンプレートλ^(l) _ω（ｌは、パワースペクトルテンプレートのインデックスである。）が選択される確率π^(l) _kτを要素にもつ三次元配列Π＾、各時間フレームτにおける各音源ｋのパワーσ² _kτを要素にもつ二次元配列Σ＾、及び各周波数ωについて、Ｘ_τω＾（＝（Ｘ_1τω，・・・，Ｘ_Mτω））に作用させて音源信号の時間周波数成分Ｙ_τω＾（＝（Ｙ_1τω，・・・，Ｙ_Kτω））を得るための分離行列Ｗ_ω＾の各々の初期値を設定し、パラメータ更新手段によって、（k、τ、ω、ｚ_kτ）の全ての組み合わせにおける、ｚ_kτとσ² _kτが与えられたときの、λ^(ｚkτ) _ω・σ² _kτを分散とするガウス分布で表されたＹ_kτωの確率密度関数、及びπ_kτが与えられたときのｚ_kτの確率、（k、τ、ｌ）の全ての組み合わせに対する前記確率π^(l) _kτの事前確率、及び各周波数ωに対する前記分離行列Ｗ_ω＾の行列式を用いて表された、前記三次元配列Ｘ＾が与えられたときの前記三次元配列Π＾、前記二次元配列Σ＾、及び各周波数ωの前記分離行列Ｗ_ωの事後確率を表す目的関数を最大化するように、前記三次元配列Π＾、前記二次元配列Σ＾、及び各周波数ωの前記分離行列Ｗ_ω＾を更新し、音源信号推定値更新手段によって、各周波数ωの前記分離行列Ｗ_ω＾及び前記三次元配列Ｘ＾に基づいて、前記三次元配列Ｙ＾を更新し、終了判定手段によって、予め定められた終了条件を満たすまで、前記パラメータ更新手段による更新、及び前記音源信号推定値更新手段による更新を繰り返し行う。 In the acoustic signal analysis method according to the present invention, the time-frequency analysis means receives time-series data of acoustic signals output from M (M is an integer of 2 or more) microphones m and inputs the observation time frequency component X _mτω ( m microphones, tau is the time frame, omega is the index of the frequency.) and the output of the three-dimensional array X ^ with the element, the parameter initial value setting means, K number of sound sources k time-frequency component _Y kτω ( k is a sound source, τ is a time frame, and ω is a frequency index.) A power spectrum template λ ^(l) _ω having a harmonic structure in each time frame τ for each sound source k. (Where l is an index of the power spectrum template ^{) a} three-dimensional array Π ^ whose elements are probabilities π ^(l) _kτ to be selected, and each time frame τ For the two-dimensional array Σ ^ having the power σ ² _{k τ} of the sound source k and each frequency ω, the time frequency of the sound source signal is caused to act on X _τω ^ (= (X _1τω ,..., X _Mτω )) the component _{_{Y τω ^ (= (Y 1τω}} , ···, Y Kτω)) to set the initial value of each of the separation matrix W _omega ^ for obtaining the parameter updating means, (k, τ, ω, z kτ ) _And the probability density function of Y _kτω represented by a Gaussian distribution with variance of λ ^(zkτ) _ω · σ ² _{k τ} when z _kτ and σ ² _{k τ} are given, and π probability of z _Lkr _{when Lkr} is given, (k, tau, l) the probability π ^(l) _kτ prior probabilities for all combinations, and the separation matrix W _omega ^ determinant for each frequency omega The three-dimensional array Π ^, the two-dimensional array Σ ^ given the three-dimensional array X ^, and Update the three-dimensional array Π ^, the two-dimensional array Σ ^, and the separation matrix W _ω ^ for each frequency ω so as to maximize the objective function representing the posterior probability of the separation matrix W _ω for each frequency ω. Then, the sound source signal estimated value updating means updates the three-dimensional array Y ^ based on the separation matrix _Wω ^ and the three-dimensional array X ^ of each frequency ω, and is determined in advance by the end determination means. Until the end condition is satisfied, the update by the parameter update unit and the update by the sound source signal estimated value update unit are repeated.

本発明に係るプログラムは、上記の音響信号解析装置の各手段としてコンピュータを機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each means of the acoustic signal analyzing apparatus.

以上説明したように、本発明の音響信号解析装置、方法、及びプログラムによれば、各周波数に対する各時間フレームにおいて調波構造を有するパワースペクトルテンプレートの各々が音源毎に選択される確率を要素にもつ三次元配列Π＾を用いて、観測時間周波数成分の三次元配列Ｘ＾が与えられたときの三次元配列Π＾、各音源のパワーを要素にもつ二次元配列Σ＾、及び各周波数ωの分離行列Ｗ_ω＾の事後確率を表す目的関数を最大化するように、三次元配列Π＾、二次元配列Σ＾、及び各周波数ωの分離行列Ｗ_ω＾を更新することを繰り返すことにより、複数のマイクロホンから出力された音響信号の時系列データから、音源毎の音源信号に精度よく分離することができる、という効果が得られる。 As described above, according to the acoustic signal analysis device, method, and program of the present invention, the probability that each power spectrum template having a harmonic structure is selected for each sound source in each time frame for each frequency is an element. Using a three-dimensional array Π ^ having a three-dimensional array X ^ of observation time frequency components, a two-dimensional array Σ ^ having the power of each sound source as an element, and each frequency ω so as to maximize an objective function representing a separation matrix W _omega ^ posterior probabilities, three-dimensional array [pi ^, two-dimensional array sigma ^, and by repeatedly updating the separation matrix W _omega ^ of each frequency omega The effect is that the time-series data of the acoustic signals output from the plurality of microphones can be accurately separated into sound source signals for each sound source.

本発明の実施の形態に係る音響信号解析装置の構成を示す概略図である。It is the schematic which shows the structure of the acoustic signal analyzer which concerns on embodiment of this invention. パワースペクトルテンプレートを示す図である。It is a figure which shows a power spectrum template. 本発明の実施の形態に係る音響信号解析装置における音響信号解析処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the acoustic signal analysis process routine in the acoustic signal analyzer which concerns on embodiment of this invention. （Ａ）ＳＤＲによる部屋E2Aにおける分離性能の評価結果を示すグラフ、及び（Ｂ）ＳＤＲによる部屋JR2における分離性能の評価結果を示すグラフである。(A) The graph which shows the evaluation result of the separation performance in the room E2A by SDR, and (B) The graph which shows the evaluation result of the separation performance in the room JR2 by SDR.

以下、図面を参照して本発明の実施の形態を詳細に説明する。本発明で提案する手法では、スペクトルテンプレートが時変ガウス分布における分散パラメータに組み込まれた音源の生成モデルを立てて、それをもとに、各音源において各フレームでどのスペクトルテンプレートが選ばれるべきかということと、各帯域の分離フィルタとを同時に推定する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the method proposed in the present invention, a generation model of a sound source in which a spectrum template is incorporated in a dispersion parameter in a time-varying Gaussian distribution is established, and which spectrum template should be selected in each frame for each sound source. That is, the separation filter for each band is estimated simultaneously.

＜発明の原理＞
まず、本発明の原理について説明する。まず、音源の生成モデルについて説明する。 <Principle of the invention>
First, the principle of the present invention will be described. First, a sound source generation model will be described.

各フレームにおける基本周波数を潜在変数と見なし、潜在変数(基本周波数インデックスz_kT)に応じて調波構造形のパワースペクトルテンプレートが一つ選択され、そのパワースペクトルをもとに音源の周波数成分が決定されるという生成プロセスを仮定する。これにより、従来のIVAにおいて仮定される音源モデルに比べ、調波性を有する音源を適切にモデル化できるはずであろうと考えられる。 The fundamental frequency in each frame is regarded as a latent variable, and a harmonic-structured power spectrum template is selected according to the latent variable (basic frequency index z _kT ), and the sound source frequency component is determined based on the power spectrum. Assume that the generation process is done. Thus, it is considered that a sound source having harmonics should be appropriately modeled as compared with the sound source model assumed in the conventional IVA.

l=1,...,Lを調波構造テンプレートのインデックスとし、λ^(l) _ωを特定の基本周波数の調波構造テンプレートとする（図２参照）。以上より、音源のGauss性を仮定すると、z_kTが与えられたもとでの音源の生成モデルは、以下の（３）式で表される。 l = 1,..., L are harmonic structure template indexes, and λ ^(l) _ω is a harmonic structure template of a specific fundamental frequency (see FIG. 2). From the above, assuming the Gauss property of the sound source, the sound source generation model with z _kT given is expressed by the following equation (3).

ただし、N_C(・;μ,σ²)は平均μ、分散σ²の複素Gauss分布を表し、σ_kT ²はフレームτにおける音源kのパワーである。一方、z_kτ=lが選択される確率をπ_lkτとすると、潜在変数z_kτは、以下の（４）式に示す離散分布で表される。 Here, N _C (•; μ, σ ² ) represents a complex Gaussian distribution having an average μ and a variance σ ² , and σ _kT ² is the power of the sound source k in the frame τ. On the other hand, if the probability of z _Lkr = l is selected and [pi _Lktau, latent variable z _Lkr is represented by discrete distributions shown in the following equation (4).

更に、π_kτをスパースにするように誘導する目的でπ_kτのハイパー事前分布を以下の（５）式に示すDirichlet分布と仮定する。 Further, for the purpose of inducing π _kτ to be sparse, the hyper prior distribution of π _kτ is assumed to be a Dirichlet distribution represented by the following equation (5).

以上の音源の生成モデルを、上記（２）式に組み込んだIVAの問題は、以下の（６）式に示す目的関数を最大化する問題に帰着する。 The problem of IVA in which the above sound source generation model is incorporated in the above equation (2) results in the problem of maximizing the objective function shown in the following equation (6).

ただし、X＾={X_mτω},Π＾={π^(l) _kτ},Σ＾={σ² _kτ},W＾={W_ω},Θ＾={Π,Σ,W}である。Tは、時間フレーム数である。なお、記号に付された「＾」は、当該記号が行列または多次元配列またはベクトルであることを表わしている。 However, X ^ = {X _mτω }, Π ^ = {π ^(l) _kτ }, Σ ^ = {σ ² _kτ }, W ^ = {W _ω }, Θ ^ = {Π, Σ, W} . T is the number of time frames. Note that “＾” attached to a symbol indicates that the symbol is a matrix, a multidimensional array, or a vector.

上記の（６）式の目的関数の大域最適解は解析的に得ることはできないが、補助関数法により局所解を効率的に探索することができる。ここで、Σ_lγ ^(l) _kτω=1となる補助変数γ_kτωを用いれば、以下の（７）式のように補助関数を設計できる。 Although the global optimum solution of the objective function of the above equation (6) cannot be obtained analytically, a local solution can be efficiently searched by the auxiliary function method. Here, if an auxiliary variable γ _kτω with Σ _lγ ^(l) _kτω = 1 is used, an auxiliary function can be designed as in the following equation (7).

＜補助変数の更新＞
補助変数についての更新式はΣ_lγ^(l) _kτω=1の制約のもと、∂Q/∂_γ ^(l) _kτω=0を解けば以下の（８）式のとおりに得られる。 <Update auxiliary variables>
Original update equation for the auxiliary variables _{^{_{Σ l γ (l) kτω =}}} 1 constraint, obtained as the following equation (8) Solving _{^{_{∂Q / ∂ γ (l) kτω}}} = 0.

ただし、λはテンプレートであり、ｌは、テンプレートのインデックスを表す。 Here, λ is a template and l represents an index of the template.

＜パラメータの更新＞
π^(l) _kτとσ² _kτの更新式については、∂Q/∂π^(l) _kτ=0,∂Q/∂σ² _kτ=0を解くことにより以下の（９）式、（１０）式のとおりに得られる。 <Parameter update>
π The updating expressions ^(l) _Lkr and ^{_{^{σ 2 kτ, ∂Q / ∂π (}}} l) kτ = 0, the following by solving ^{_{∂Q / ∂σ 2 kτ = 0 (}} 9) equation (10) It is obtained according to the formula.

W_ω＾については、以下の（１１）式〜（１４）式のように1行ずつ更新する。 W _ω ^ is updated line by line as in the following formulas (11) to (14).

ただし、e_k^は、k番目のベクトル要素が1でそれ以外の要素は0の単位ベクトル（＝[0,...,1,...,0]^T）である。 Here, e _k ^ is a unit vector (= [0, ..., 1, ..., 0] ^T ) in which the k-th vector element is 1 and the other elements are 0.

＜音源信号推定値の更新＞
以上で更新された分離行列を用いて音源信号の推定を以下の（１５）式のとおりに更新する。 <Update of estimated sound source signal value>
The estimation of the sound source signal is updated as shown in the following equation (15) using the updated separation matrix.

ただし、Y_τ,ω＾=(Y_1,τ,ω,...,Y_K,τ,ω)^Tである。 However, Y _{τ, ω} ^ = (Y _{1, τ, ω} , ..., Y _{K, τ, ω} ) ^T.

＜システム構成＞
次に、Ｍ個（Ｍ≧２）のマイクロホンから得られた音響信号を解析して、既知のＫ個（Ｋ＜Ｍ）の音源信号に分離する音響信号解析装置に、本発明を適用した場合を例にして、本発明の実施の形態を説明する。 <System configuration>
Next, when the present invention is applied to an acoustic signal analyzer that analyzes acoustic signals obtained from M (M ≧ 2) microphones and separates them into known K (K <M) sound source signals The embodiment of the present invention will be described with reference to FIG.

図１に示すように、本発明の実施の形態に係る音響信号解析装置は、ＣＰＵと、ＲＡＭと、後述する音響信号解析処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 As shown in FIG. 1, the acoustic signal analysis apparatus according to the embodiment of the present invention is a computer that includes a CPU, a RAM, and a ROM that stores a program for executing an acoustic signal analysis processing routine described later. It is configured and functionally configured as follows.

音響信号解析装置１００は、入力部１０と、演算部２０と、記憶部３０と、出力部４０とを備えている。 The acoustic signal analysis device 100 includes an input unit 10, a calculation unit 20, a storage unit 30, and an output unit 40.

入力部１０により、Ｍ個のマイクロホンから出力された音響信号（多チャンネル信号）の時系列データが入力される。記憶部３０は、入力部１０により入力された音響信号の時系列データを記憶する。また、記憶部３０は、後述する各処理での結果を記憶すると共に、各基本周波数の調波構造テンプレートλ^(l) _ω（l=1,...,L、ω＝１,・・・,Ｎ）、パラメータβを記憶している。 The input unit 10 receives time-series data of acoustic signals (multi-channel signals) output from the M microphones. The storage unit 30 stores time-series data of acoustic signals input from the input unit 10. In addition, the storage unit 30 stores the result of each process described later, and also the harmonic structure template λ ^(l) _ω (l = 1,..., L, ω = 1,. , N) and the parameter β.

演算部２０は、時間周波数解析部２１と、初期設定部２２と、補助変数更新部２３と、パラメータ更新部２４と、時間周波数成分推定部２５と、終了判定部２６と、信号変換部２７とを備えている。また、パラメータ更新部２４は、確率更新部２４１と、時変ゲイン更新部２４２と、分離行列更新部２４３とを備えている。 The calculation unit 20 includes a time frequency analysis unit 21, an initial setting unit 22, an auxiliary variable update unit 23, a parameter update unit 24, a time frequency component estimation unit 25, an end determination unit 26, and a signal conversion unit 27. It has. The parameter update unit 24 includes a probability update unit 241, a time-varying gain update unit 242, and a separation matrix update unit 243.

時間周波数解析部２１は、各マイクロホンの時系列信号としての観測された音響信号を入力として、時間周波数成分（観測時間周波数成分）Ｘ_mτω（ｍ＝１，・・・，Ｍ、ω＝１,・・・,Ｎ,τ＝１,・・・,Ｔは、それぞれマイクロホン、周波数、時間フレームに対応するインデックスを示す。）を各（ｍ，ω，τ）の要素にもつ三次元配列Ｘを計算する。また、計算した時間周波数成分Ｘ_τωを、記憶部３０に記憶しておく。より詳細には、時間周波数解析部２１は、各マイクロホンｍについて、当該マイクロホンの音響信号の時系列データを入力として、短時間フーリエ変換（Short-Time Fourier Transform；ＳＴＦＴ）を用いて時間周波数解析を行うことにより時間周波数成分Ｘ_mτωを計算し、時間周波数成分Ｘ_mτωを格納した行列（振幅スペクトログラム）Ｘ＾＝（Ｘ_mτω）_M×N×Ｔを出力する。なお、時間周波数成分Ｘ_mτωは、ウェーブレット変換を用いて計算してもよい。 The time frequency analysis unit 21 receives an observed acoustic signal as a time-series signal of each microphone, and receives a time frequency component (observation time frequency component) X _mτω (m = 1,..., M, ω = 1, .., N, τ = 1,..., T indicate indices corresponding to microphones, frequencies, and time frames, respectively.) A three-dimensional array X having (m, ω, τ) as elements (each) calculate. Further, the calculated time frequency component X _τω is stored in the storage unit 30. More specifically, the time-frequency analysis unit 21 performs time-frequency analysis on each microphone m by using time-series data of the acoustic signal of the microphone as an input and using a short-time Fourier transform (STFT). As a result, the time frequency component X _mτω is calculated, and a matrix (amplitude spectrogram) X ^ = (X _mτω ) _{M × N × T} storing the time frequency component X _mτω is output. The time frequency component X _mτω may be calculated using wavelet transform.

初期設定部２２は、後述する処理で用いる各パラメータπ^(l) _kτ、σ² _kτ、W_ω＾の初期値を設定する。初期値として、乱数を用いて適当な値を設定すればよい。また、初期設定部２２は、W_ω＾の初期値と時間周波数成分Ｘ_mτωとに基づいて、上記（２）式に従って、分離信号Ｙτω＾の初期値を計算する。 The initial setting unit 22 sets initial values of the parameters π ^(l) _kτ , σ ² _kτ , and W _ω ^ used in processing that will be described later. An appropriate value may be set as an initial value using a random number. The initial setting section 22, based on the W _omega ^ initial value and time-frequency component X _Emutauomega of, according to the above (2) to calculate the initial value of the separated signal Yτω ^.

補助変数更新部２３は、（ｌ、k、τ、ω）の全ての組み合わせの各々について、記憶部３０に記憶されているπ^(l) _kτ、σ² _kτ、Ｙ_kτω、λ^(l) _ωに基づいて、上記（８）式に従って、観測時間周波数成分Ｘ_mτωが各インデックスｌのパワースペクトルテンプレートλ(l)ωに帰属する確率を示す補助変数γ^(l) _kτωを更新し、記憶部３０に格納する。 Auxiliary variable update section 23, (l, k, τ, ω) for each of all combinations of, [pi stored in the storage unit ^{_{^{30 (l) kτ, σ 2}}} kτ, Y kτω, λ (l) ω Based on the above, the auxiliary variable γ ^(l) _kτω indicating the probability that the observation time frequency component X _mτω belongs to the power spectrum template λ (l) ω of each index 1 is _updated according to the above equation (8), and the storage unit 30 To store.

確率更新部２４１は、（ｌ、k、τ）の全ての組み合わせの各々について、記憶部３０に記憶されているγ^(l) _kτω、βに基づいて、上記（９）式に従って、パワースペクトルテンプレートλ^(l) _ωが選択される確率π^(l) _kτを更新し、記憶部３０に格納する。 The probability update unit 241 uses the power spectrum template according to the above equation (9) based on γ ^(l) _kτω and β stored in the storage unit 30 for each combination of (l, k, τ). The probability π ^(l) _kτ for selecting λ ^(l) _ω is updated and stored in the storage unit 30.

時変ゲイン更新部２４２は、（k、τ）の全ての組み合わせの各々について、記憶部３０に記憶されているγ^(l) _kτω、Ｙ_kτω、λ^(l) _ωに基づいて、上記（１０）式に従って、フレームτにおける音源kのパワーσ_kT ²を更新し、記憶部３０に格納する。 The time-varying gain updating unit 242 performs the above (10 ⁾ based on γ ^(l) _kτω , Y _kτω , λ ^(l) _ω stored in the storage unit 30 for each combination of (k, τ). ), The power σ _kT ² of the sound source k in the frame τ is updated and stored in the storage unit 30.

分離行列更新部２４３は、（k、τ、ω）の全ての組み合わせの各々について、記憶部３０に記憶されているγ^(l) _kτω、λ^(l) _ω、σ_kτ ²に基づいて、上記（１１）式に従って、パラメータ~σ_kτω ²を更新し、記憶部３０に格納する。また、分離行列更新部２４３は、（k、ω）の全ての組み合わせの各々について、記憶部３０に記憶されているＸ_τω、~σ_kτω ²に基づいて、上記（１２）式に従って、中間パラメータＶ_kωを更新し、記憶部３０に格納する。 The separation matrix updating unit 243 performs the above operation based on γ ^(l) _kτω , λ ^(l) _ω , and σ _kτ ² stored in the storage unit 30 for each of all combinations of (k, τ, ω). The parameter ~ σ _kτω ² is updated according to the equation (11) and stored in the storage unit 30. Further, the separation matrix updating unit 243 determines the intermediate parameter for each combination of (k, ω) based on X _τω and ~ σ _kτω ² stored in the storage unit 30 according to the above equation (12). V _kω is updated and stored in the storage unit 30.

そして、分離行列更新部２４３は、（k、ω）の全ての組み合わせの各々について、記憶部３０に記憶されているW_ω＾、Ｖ_kωに基づいて、上記（１３）式、（１４）式に従って、ｗ_kω＾（音源ｋの音響信号の周波数成分ωに対する各マイクロホンｍの重みを要素にもつ重みベクトル）を更新し、記憶部３０に格納する。 Then, the separation matrix updating unit 243 uses the above formulas (13) and (14) based on W _ω ^ and V _kω stored in the storage unit 30 for all the combinations of (k, ω). Accordingly, w _kω ^ (weight vector having the weight of each microphone m for the frequency component ω of the acoustic signal of the sound source k as an element) is updated and stored in the storage unit 30.

時間周波数成分推定部２５は、（τ、ω）の全ての組み合わせの各々について、記憶部３０に記憶されているW_ω＾、Ｘ_τω＾に基づいて、上記（１５）式に従って、各音源の時間周波数成分を表わすベクトルＹ_τω＾＝(Y_1τω,...,Y_kτω,...,Y_Kτω)^Tを更新し、記憶部３０に格納する。 The time frequency component estimator 25 calculates each sound source for each combination of (τ, ω) based on W _ω ^ and X _τω ^ stored in the storage 30 according to the above equation (15). The vector Y _τω ^ = (Y _1τω ,..., Y _kτω ,..., Y _Kτω ) ^T representing the time frequency component is updated and stored in the storage unit 30.

終了判定部２６は、予め定められた終了条件を満足するか否かを判定し、終了条件を満足していない場合には、補助変数更新部２３、パラメータ更新部２４、及び時間周波数成分推定部２５の各処理を繰り返す。終了判定部２６は、終了条件を満足したと判定した場合には、信号変換部２７による処理に移行する。信号変換部２７は、記憶部３０に記憶されているＹ_τω＾を、各音源の音源信号に変換し、出力部４０により、音源信号を出力する。 The end determination unit 26 determines whether or not a predetermined end condition is satisfied. If the end condition is not satisfied, the auxiliary variable update unit 23, the parameter update unit 24, and the time frequency component estimation unit Each process of 25 is repeated. When it is determined that the end condition is satisfied, the end determination unit 26 proceeds to processing by the signal conversion unit 27. The signal conversion unit 27 converts Y _τω ^ stored in the storage unit 30 into a sound source signal of each sound source, and the output unit 40 outputs the sound source signal.

終了条件としては、繰り返し回数ｓが予め定めた回数Ｓに達したことを用いればよい。なお、s-1回目のパラメータを用いたときの目的関数の値とs回目のパラメータを用いたときの目的関数の値との差が、予め定めた閾値よりも小さくなったことを、終了条件として用いてもよい。 As the termination condition, it may be used that the number of repetitions s has reached a predetermined number S. The end condition is that the difference between the value of the objective function when using the s-1th parameter and the value of the objective function when using the sth parameter is smaller than a predetermined threshold. It may be used as

＜音響信号解析装置の作用＞
次に、本実施の形態に係る音響信号解析装置１００の作用について説明する。まず、解析対象の信号として各マイクロホンからの音響信号の時系列データが音響信号解析装置１００に入力され、記憶部３０に格納される。そして、音響信号解析装置１００において、図３に示す音響信号解析処理ルーチンが実行される。 <Operation of acoustic signal analyzer>
Next, the operation of the acoustic signal analysis apparatus 100 according to the present embodiment will be described. First, time-series data of acoustic signals from each microphone is input to the acoustic signal analyzing apparatus 100 as a signal to be analyzed and stored in the storage unit 30. Then, in the acoustic signal analysis apparatus 100, an acoustic signal analysis processing routine shown in FIG. 3 is executed.

まず、ステップＳ１０１において、記憶部３０から、マイクロホン毎に、各時間フレームτ内の音響信号を読み込み、当該音響信号に対して、短時間フーリエ変換を用いた時間周波数分析を行った結果から、観測時間周波数成分Ｘ_mτωを各（ｍ，τ，ω）の要素にもつ三次元配列Ｘ＾を生成して、記憶部３０に記憶する。 First, in step S101, an acoustic signal in each time frame τ is read from the storage unit 30 for each microphone, and a time frequency analysis using a short-time Fourier transform is performed on the acoustic signal. A three-dimensional array X ^ having the time frequency component X _mτω as an element of each (m, τ, ω) is generated and stored in the storage unit 30.

そして、ステップＳ１０２において、乱数を用いて、パラメータΘ＾＝｛Π＾、Σ＾、Ｗ＾｝の初期値を設定して、記憶部３０に記憶すると共に、各音源信号の時間周波数成分Ｙ_kτωを各（ｋ，τ，ω）の要素にもつ三次元配列Ｙ＾を生成して、記憶部３０に記憶する。 In step S102, using random numbers, initial values of parameters Θ ^ = {＾^, Σ ^, W ^} are set and stored in the storage unit 30, and the time frequency component Y _kτω of each sound source signal is _set. Is generated in each (k, τ, ω) element and stored in the storage unit 30.

次にステップＳ１０３では、上記ステップＳ１０２で設定されたパラメータΠ＾、Σ＾、Ｙ＾、又は後述するステップＳ１０４、Ｓ１０５、Ｓ１０７で更新されたパラメータΠ＾、Σ＾、Ｙ＾に基づいて、上記（８）式に従って、補助係数γ^(l) _kτωを各（ｌ，k，τ，ω）の組み合わせについて算出して、記憶部３０に格納する。 Next, in step S103, based on the parameters Π ^, Σ ^, Y ^ set in step S102 or the parameters Π ^, Σ ^, Y ^ updated in steps S104, S105, and S107 described later, The auxiliary coefficient γ ^(l) _kτω is _calculated for each combination of (1, k, τ, ω) according to the equation (8) and stored in the storage unit 30.

そして、ステップＳ１０４では、上記ステップＳ１０３で更新された補助係数γ^(l) _kτωに基づいて、上記（９）式に従って、パワースペクトルテンプレートλ^(l) _ωが選択される確率π^(l) _kτを各（ｌ，k，τ）の組み合わせについて算出して、記憶部３０に格納する。 In step S104, based on the updated auxiliary coefficient _γ ^(l) kτω in step S103, according to the above (9), the power spectral template lambda ^(l) the probability _{that ω} is selected π a ^(l) _Lkr The combination of each (l, k, τ) is calculated and stored in the storage unit 30.

ステップＳ１０５では、上記ステップＳ１０２で設定されたパラメータＹ＾、又はステップＳ１０７で更新されたパラメータＹ＾と、上記ステップＳ１０３で更新された補助係数γ^(l) _kτωに基づいて、上記（１０）式に従って、時間フレームτにおける音源kのパワーσ_kτ ²を各（k，τ）の組み合わせについて算出して、記憶部３０に格納する。 In step S105, based on the parameter Y ^ set in step S102 or the parameter Y ^ updated in step S107 and the auxiliary coefficient γ ^(l) _kτω updated in step S103, the above equation (10) Accordingly, the power σ _kτ ² of the sound source k in the time frame τ is calculated for each combination (k, τ) and stored in the storage unit 30.

次のステップＳ１０６では、上記ステップＳ１０２で設定されたパラメータＷ＾、又は前回のステップＳ１０６で更新されたパラメータＷ＾と、上記ステップＳ１０１で生成された三次元配列Ｘ＾と、上記ステップＳ１０５で更新されたパラメータΣ＾と、上記ステップＳ１０３で更新された補助係数γ^(l) _kτωと、基づいて、上記（１１）式〜（１４）式に従って、ｗ_kωを各（k，ω）の組み合わせについて算出して、記憶部３０に格納する。 In the next step S106, the parameter W ^ set in step S102 or the parameter W ^ updated in the previous step S106, the three-dimensional array X ^ generated in step S101, and the update in step S105 are updated. Based on the parameter Σ ^ and the auxiliary coefficient γ ^(l) _kτω updated in step S103, w _kω is _set for each combination of (k, ω) according to the above equations (11) to (14). Calculate and store in the storage unit 30.

そして、ステップＳ１０７では、上記ステップＳ１０６で更新されたパラメータＷと、上記ステップＳ１０１で生成された三次元配列Ｘ＾とに基づいて、上記（１５）式に従って、各音源信号の時間周波数成分Ｙ_kτωを各（ｋ，τ，ω）の要素にもつ三次元配列Ｙを算出して、記憶部３０に記憶する。 In step S107, based on the parameter W updated in step S106 and the three-dimensional array X ^ generated in step S101, the time frequency component Y _kτω of each sound source signal according to the above equation (15). Is calculated for each (k, τ, ω) element and stored in the storage unit 30.

次のステップＳ１０８では、終了条件として、繰り返し回数ｓが、Ｓに到達したか否かを判定し、繰り返し回数ｓがＳに到達していない場合には、終了条件を満足していないと判断して、上記ステップＳ１０３へ戻り、上記ステップＳ１０３〜ステップＳ１０７の処理を繰り返す。一方、繰り返し回数ｓがＳに到達した場合には、終了条件を満足したと判断し、ステップＳ１０９で、上記ステップＳ１０７で最終的に更新された三次元配列Ｙに基づいて、各音源の音源信号を算出し、出力部４０により出力して、音響信号解析処理ルーチンを終了。 In the next step S108, it is determined whether or not the number of repetitions s has reached S as an end condition. If the number of repetitions s has not reached S, it is determined that the end condition is not satisfied. Then, the process returns to step S103, and the processes of steps S103 to S107 are repeated. On the other hand, when the number of repetitions s reaches S, it is determined that the end condition is satisfied, and in step S109, the sound source signal of each sound source is based on the three-dimensional array Y that is finally updated in step S107. Is output by the output unit 40, and the acoustic signal analysis processing routine is terminated.

＜実験結果＞
次に、本実施の形態に係る手法の有用性を示す目的で、単旋律楽器を音源として用いたシミュレーションによる実験を行った結果について説明する。以下の表１に実験条件を示す。 <Experimental result>
Next, for the purpose of showing the usefulness of the technique according to the present embodiment, the result of an experiment by simulation using a single melodic instrument as a sound source will be described. The experimental conditions are shown in Table 1 below.

予め録音された残響時間の異なる二種類の室内インパルス応答（参考文献（S. Nakamura, K. Hiyane, F. Asano, T. Nishiura and T. Yamada, “Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition,” Proc. LREC, pp. 965−968, 2000.）を参照）を単旋律楽器音(トランペット、サックス)とそれぞれ畳み込み加算することで観測信号とした。部屋E2A(残響時間T60=300ms)では、{−20°,0°}の方向から、部屋JR2(T60=470ms)では、{−30°,30°}の方向から音源が到来するとした。 Two types of pre-recorded room impulse responses with different reverberation times (references (S. Nakamura, K. Hiyane, F. Asano, T. Nishiura and T. Yamada, “Acoustical Sound Database in Real Environments for Sound Scene Understanding and Hands-Free Speech Recognition, “Proc. LREC, pp. 965-968, 2000.)) was convoluted with a single melodic instrument sound (trumpet, saxophone), respectively, to obtain an observation signal. In the room E2A (reverberation time T60 = 300 ms), the sound source comes from the direction of {−20 °, 0 °}, and in the room JR2 (T60 = 470 ms), the sound source comes from the direction of {−30 °, 30 °}.

本実施の形態に係る手法（以下、提案法と称する。）に用いた重みテンプレートは220Hz(A3)から半音ごとに24種類の倍音成分が1/√nで減衰する調波構造テンプレートと全ての周波数で同じ重みのテンプレート1種類の合計25 種類を用いた。提案法の分離行列の初期値として、時変Gauss分布に基づく独立ベクトル分析により分離行列を40回更新した値を用い、その後に提案法を40回反復し、復元信号を得た。提案法はπ^(l) _kTの事前分布としてDirichlet分布を考慮した手法(Prop(dir))と事前分布を一様にした手法(Prop(flat))の2種類を行った。Dirichlet分布を仮定した実験では、フレーム長が大きくなると事前分布の効用が小さくなるため効果を調節する目的でp(π_kT)を周波数bin数乗した。また、提案法40回反復の間にp(π_kT)のパラメータβを1から0.95まで徐々に小さくするアニーリング処理を行った。このとき更新の間にπ^(l) _kτが負になることがあったが、その場合十分小さい正数に置き換えた。 The weight template used in the method according to this embodiment (hereinafter referred to as the proposed method) is a harmonic structure template in which 24 harmonic components are attenuated by 1 / √n for each semitone from 220 Hz (A3) A total of 25 types of templates with the same weight at frequency were used. As the initial value of the separation matrix of the proposed method, the value obtained by updating the separation matrix 40 times by independent vector analysis based on time-varying Gaussian distribution was used, and then the proposed method was repeated 40 times to obtain a restored signal. The proposed method has two types, π ^(l) _kT , which takes into account the Dirichlet distribution (Prop (dir)) and the method that makes the prior distribution uniform (Prop (flat)). In the experiment assuming the Dirichlet distribution, the utility of the prior distribution becomes smaller as the frame length increases, so p (π _kT ) is multiplied by the frequency bin number for the purpose of adjusting the effect. In addition, annealing was performed to gradually reduce the parameter β of p (π _kT ) from 1 to 0.95 during the 40 iterations of the proposed method. At this time, π ^(l) _kτ sometimes became negative during the update. In that case, it was replaced with a sufficiently small positive number.

比較対象として、音源の生成モデルに時変Gauss分布を仮定した独立ベクトル分析による手法(IVA(TVG))音源の生成モデルにラプラス分布を仮定した独立ベクトル分析による手法(IVA(lap))（上記の非特許文献２参照）、独立成分分析による手法(ICA)(参考文献（H. Sawada, R. Mukai, S. Araki and S. Makino, “A Robust Approach to the Permutation Problem of Frequency-domain Blind source Separation,” Proc. ICASSP, pp. 381−384, 2003.）を参照)の3種類をそれぞれ分離行列の更新に80回行った。 For comparison, the independent vector analysis method (IVA (TVG)) assuming a time-varying Gaussian distribution as a sound source generation model (IVA (lap)) using the independent vector analysis method assuming a Laplace distribution as a sound source generation model (above Non-patent document 2), method by independent component analysis (ICA) (reference (H. Sawada, R. Mukai, S. Araki and S. Makino, “A Robust Approach to the Permutation Problem of Frequency-domain Blind source Separation, ”Proc. ICASSP, pp. 381-384, 2003.)) was performed 80 times to update the separation matrix.

SDR（参考文献（E. Vincent, R. Gribonval and C. F´evotte, “Performance Measurement in Blind Audio Source Separation,” IEEE Trans. ASLP, pp.1462−1469, 2006.）を参照）による評価の結果を、図４に示す。ＳＤＲによる部屋E2A(残響時間T60=300ms)における分離性能の評価結果を図４（Ａ）に示し、ＳＤＲによる部屋JR2(T60=470ms)における分離性能の評価結果を図４（Ｂ）に示す。2種類の環境どちらにおいても従来法に比べ提案法が優位な結果を得られた。特に、部屋E2AにおいてDirichlet事前分布を考慮することで、6dB程度の改善が見られた。 Results of evaluation by SDR (see references (E. Vincent, R. Gribonval and C. F´evotte, “Performance Measurement in Blind Audio Source Separation,” IEEE Trans. ASLP, pp.1462-1469, 2006.)) Is shown in FIG. FIG. 4A shows the evaluation results of the separation performance in the room E2A (reverberation time T60 = 300 ms) by SDR, and FIG. 4B shows the evaluation results of the separation performance in the room JR2 (T60 = 470 ms) by SDR. In both environments, the proposed method was superior to the conventional method. In particular, in room E2A, an improvement of about 6 dB was observed by considering the Dirichlet prior distribution.

以上説明したように、本発明の実施の形態に係る音響信号解析装置によれば、各周波数及び各時間フレームにおいて調波構造を有するパワースペクトルテンプレートの各々が音源毎に選択される確率を要素にもつ三次元配列Π＾を用いて、観測時間周波数成分の三次元配列Ｘが与えられたときの三次元配列Π＾、各音源のパワーを要素にもつ二次元配列Σ＾、及び各周波数ωの分離行列Ｗ_ω＾の事後確率を表す目的関数を最大化するように、三次元配列Π＾、二次元配列Σ＾、及び各周波数ωの分離行列Ｗ_ω＾を更新することを繰り返すことにより、複数のマイクロホンから出力された音響信号の時系列データから、音源毎の音源信号に精度よく分離することができる。 As described above, according to the acoustic signal analysis device according to the embodiment of the present invention, the probability that each power spectrum template having a harmonic structure at each frequency and each time frame is selected for each sound source is an element. Using a three-dimensional array Π ^ having a three-dimensional array X of observation time frequency components, a two-dimensional array Σ ^ having the power of each sound source as an element, and each frequency ω so as to maximize an objective function that represents the separation matrix W _omega ^ posterior probabilities, three-dimensional array [pi ^, two-dimensional array sigma ^, and by repeatedly updating the separation matrix W _omega ^ of each frequency omega, It is possible to accurately separate sound source signals for each sound source from time-series data of acoustic signals output from a plurality of microphones.

また、調波構造をもつパワースペクトルのテンプレートを複数個用意しておき、スペクトルテンプレートが時変ガウス分布における分散パラメータに組み込まれた音源の生成モデルを立てて、それをもとに、各音源において各時間フレームでどのスペクトルテンプレートが選ばれるべきかということと、各帯域の分離フィルタとを同時に推定することにより、音源信号が調波構造をもつ場合に、音源分離の精度を向上させることができる。 In addition, a plurality of power spectrum templates with harmonic structures are prepared, and a sound source generation model in which the spectrum template is incorporated in the dispersion parameter in the time-varying Gaussian distribution is established. By simultaneously estimating which spectral template should be selected in each time frame and the separation filter of each band, the accuracy of sound source separation can be improved when the sound source signal has a harmonic structure. .

また、上記の非特許文献３とのアルゴリズムにおける相違点は、（１）γ^(l) _kτω、π^(l) _kTというパラメータとその更新式(上記（８）式、（９）式)が新たに加わったこと、（２）各音源のパワーσ² _kτ が、上記（１０）式のようにγ^(l) _kTωを用いて算出されること、（３）中間変数V_kωが、上記（１１）式に従って算出される各音源の各時刻におけるパワースペクトル推定値~σ² _kτω(周波数ωごとに異なる値をとりうる点が従来技術との重要な差異)を用いて上記（１２）式のように更新されることである。 Also, the difference in the algorithm from Non-Patent Document 3 described above is that the parameters (1) γ ^(l) _kτω and π ^(l) _kT and their update formulas (formulas (8) and (9)) are new. added was enough, (2) power sigma ² _Lkr each sound source, the (10) equation as _γ ^(l) kTω be calculated using, (3) intermediate variables V _kW, (11 ) The power spectrum estimated value at each time calculated according to the equation ~ σ ² _kτω (the difference that can take different values for each frequency ω is an important difference from the prior art) as in the above equation (12) To be updated.

また、複数のマイクロホンで取得した音響信号から混在する個々の音源信号を分離することができるため、ハンズフリーテレビ会議システムや会議録コンテンツの自動作成システムなどの応用が期待される。 In addition, since individual sound source signals mixed from acoustic signals acquired by a plurality of microphones can be separated, applications such as a hands-free video conference system and a system for automatically creating conference contents are expected.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上述の音響信号解析装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, the above-described acoustic signal analysis apparatus has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０演算部
２１時間周波数解析部
２２初期設定部
２３補助変数更新部
２４パラメータ更新部
２５時間周波数成分推定部
２６終了判定部
２７信号変換部
３０記憶部
４０出力部
１００音響信号解析装置
２４１確率更新部
２４２時変ゲイン更新部
２４３分離行列更新部 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 21 Time frequency analysis part 22 Initial setting part 23 Auxiliary variable update part 24 Parameter update part 25 Time frequency component estimation part 26 Termination determination part 27 Signal conversion part 30 Storage part 40 Output part 100 Acoustic signal analysis apparatus 241 Probability update unit 242 Time-varying gain update unit 243 Separation matrix update unit

Claims

Using time series data of acoustic signals output from M (m is an integer of 2 or more) microphones m as input, an observation time frequency component X _mτω (m is a microphone, τ is a time frame, and ω is a frequency index. .), A time-frequency analysis means for outputting a three-dimensional array X ^ having elements as elements,
A time-frequency component Y _kτω of K sound sources k (k is a sound source, τ is a time frame, ω is an index of frequency), and a three-dimensional array Y ^ having elements as elements, and each sound source k is adjusted in each time frame τ. A power spectrum template λ ^(l) _ω having a wave structure (where l is an index of the power spectrum template) is selected as a three-dimensional array Π ^ having a probability π ^(l) _kτ as an element in each time frame τ The time frequency of the sound source signal by acting on X _τω (= (X _1τω ,..., X _Mτω )) about the two-dimensional array Σ ^ having the power σ ² _{k τ} of each sound source k and each frequency ω. Parameter initial value setting means for setting initial values of the separation matrix W _ω ^ for obtaining the component Y _τω (= (Y _1τω ,..., Y _Kτω ));
(K, τ, ω, z kτ) in all combinations of, when z _Lkr and sigma ² _{k tau} is given, represented by a Gaussian distribution to distribute ^{_{^{λ (zkτ) ω · σ 2}}} k τ Y _Keitauomega probability density function probability of z _Lkr when, and [pi _Lkr given, (k, tau, l) said for all combinations of probability _π ^(l) kτ priors and the for each frequency ω The three-dimensional array Π ^, the two-dimensional array Σ ^, and the separation matrix W of each frequency ω when the three-dimensional array X ^ is given, expressed using a determinant of the separation matrix W _ω ^. so as to maximize an objective function that represents the posterior probability of _omega ^, and parameter updating means the three-dimensional arrangement [pi ^, the two-dimensional array sigma ^, and for updating the separation matrix W _omega ^ of each frequency omega,
Sound source signal estimated value updating means for updating the three-dimensional array Y ^ based on the separation matrix _Wω ^ and the three-dimensional array X ^ of each frequency ω;
An end determination unit that repeatedly performs update by the parameter update unit and update by the sound source signal estimated value update unit until a predetermined end condition is satisfied,
An acoustic signal analyzing apparatus including:

The objective function is defined as an auxiliary variable γ ^(l) _kτω indicating the probability that the observation time frequency component X _mτω belongs to the power spectrum template λ ^(l) _ω of each index l for all combinations of (k, τ, ω). The auxiliary function used
The parameter update means includes
Said three-dimensional array [pi ^, two-dimensional array sigma ^, and based on the previous Kipa Lower spectral template ^{_{λ (l) ω, (k}} , τ, ω, l) for each of all combinations of said auxiliary variable gamma ^{( l)} Auxiliary variable updating means for updating _kτω ,
Probability update means for updating the three-dimensional array Π ^ based on the auxiliary variable γ ^(l) _kτω ;
On the basis of the auxiliary variable _γ ^(l) kτω and before Kipa Lower spectral template λ ^_(l) ω, the power updating means for updating the two-dimensional array sigma,
The auxiliary variable _γ ^(l) kτω, the two-dimensional array sigma ^, the three-dimensional array X ^, and based on the previous Kipa Lower spectral template λ ^(l) _ω, the separation matrix W _omega ^ of each frequency omega A separating matrix updating means for updating;
The acoustic signal analysis device according to claim 1, comprising:

The Probability [pi ^(l) the prior distribution of _Lkr, acoustic signal analyzer according to claim 1 or 2, wherein was Direkure distribution.

The time-frequency analysis means X _mτω (where m is a microphone, τ is a time frame, ω) with time-series data of acoustic signals output from M (m is an integer of 2 or more) microphones m as input. Is a frequency index.) And outputs a three-dimensional array X ^ with elements
The parameter initial value setting means, the time-frequency component Y _Keitauomega of the K sound source k (where k sound sources, tau is the time frame, omega is the index of the frequency.) Three-dimensional array having the elements Y ^, each sound source k A power spectrum template λ ^(l) _ω (where l is an index of the power spectrum template ⁾ having a harmonic structure in each time frame τ is selected as a three-dimensional array Π having a probability π ^(l) _kτ as an element ^ effect, two-dimensional array sigma ^ with power sigma ² _{k tau} for each sound source k at each time frame tau elements, and for each frequency _{ω, X τω (= (X} 1τω, ···, X Mτω)) to And setting each initial value of the separation matrix W _ω ^ for obtaining the time frequency component Y _τω (= (Y _1τω ,..., Y _Kτω )) of the sound source signal,
Gauss with variance λ ^(zkτ) _ω · σ ² _{k τ} when z _kτ and σ ² _{k τ} are given for all combinations of (k, τ, ω, z _kτ ) by the parameter updating means. probability of z _Lkr when the probability density function of the represented Y _Keitauomega in distribution, and [pi _Lkr given, (k, tau, l) the probabilities for all combinations of _π ^(l) kτ priors and Expressed using the determinant of the separation matrix W _ω ^ for each frequency ω, the three-dimensional array Π ^ given the three-dimensional array X ^, the two-dimensional array Σ ^, and each frequency ω I said to maximize the objective function representing a separation matrix W _omega ^ posterior probabilities, the three-dimensional array [pi ^, the two-dimensional array sigma ^, and updating the separation matrix W _omega ^ of each frequency omega of
Based on the separation matrix W _ω ^ and the three-dimensional array X ^ of each frequency ω, the three-dimensional array Y ^ is updated by the sound source signal estimated value update means,
An acoustic signal analysis method in which an update by the parameter update unit and an update by the sound source signal estimated value update unit are repeatedly performed by an end determination unit until a predetermined end condition is satisfied.

The program for functioning a computer as each means of the acoustic signal analyzer of any one of Claims 1-3.