JP2018032001A

JP2018032001A - Signal processing device, signal processing method and signal processing program

Info

Publication number: JP2018032001A
Application number: JP2016166232A
Authority: JP
Inventors: 信貴伊藤; Nobutaka Ito; 中谷　智広; Tomohiro Nakatani; 智広中谷; 荒木　章子; Akiko Araki; 章子荒木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-08-26
Filing date: 2016-08-26
Publication date: 2018-03-01
Anticipated expiration: 2036-08-26
Also published as: JP6538624B2

Abstract

PROBLEM TO BE SOLVED: To accurately perform sound source localization even in the case that observation signal length is short.SOLUTION: A time frequency analysis part 10 applies time frequency conversion to recorded sound acquired at different positions, and calculates an observation signal vector. A feature vector calculation part 20 calculates a feature vector including information of the direction of the observation signal vector at each time frequency point. A parameter storage part 30 stores a model parameter of a conditional probability distribution of the feature vector under a condition that a state representing a sound source position takes a state corresponding to each of a plurality of sound source position candidates. A prior probability distribution calculation part 40 adapts a mixture model in which the state representing the sound source position based on the model parameter of the parameter storage part 30 with a prior probability distribution of the state as a load is a load sum of the conditional probability distribution of the feature vector under a known condition to the feature vector, and calculates a prior probability distribution. A sound source position calculation part 50 calculates a sound source position on the basis of the prior probability distribution.SELECTED DRAWING: Figure 2

Description

本発明は、信号処理装置、信号処理方法および信号処理プログラムに関する。 The present invention relates to a signal processing device, a signal processing method, and a signal processing program.

従来、複数のマイクロホン等で観測した収録音を基に、当該音を発生させている音源の位置を推定する音源定位技術が知られている。音源定位技術として、例えば、音源数が既知であると仮定し、観測信号に時間周波数分析を適用することで推定した共分散行列を用いて音源位置を推定する方法が知られている。 Conventionally, a sound source localization technique for estimating the position of a sound source that generates sound based on recorded sound observed with a plurality of microphones or the like is known. As a sound source localization technique, for example, a method of estimating a sound source position using a covariance matrix estimated by applying time-frequency analysis to an observation signal on the assumption that the number of sound sources is known is known.

R. O. Schmidt, "Multiple emitter location and signal parameter estimation," IEEE Transactions on Antennas and Propagation, 1986年3月, vol.AP-34, No.3, p.276-280.R. O. Schmidt, "Multiple emitter location and signal parameter estimation," IEEE Transactions on Antennas and Propagation, March 1986, vol.AP-34, No.3, p.276-280. N. Ito, E. Vincent, N. Ono, R. Gribonval, and S. Sagayama, "Crystal-MUSIC:Accurate localization of multiple sources in diffuse noise environments using crystal-shaped microphone arrays," 2010年9月, Proceedings of 9th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), p.81-88.N. Ito, E. Vincent, N. Ono, R. Gribonval, and S. Sagayama, "Crystal-MUSIC: Accurate localization of multiple sources in diffuse noise environments using crystal-shaped microphone arrays," September 2010, Proceedings of 9th International Conference on Latent Variable Analysis and Signal Separation (LVA / ICA), p.81-88.

しかしながら、従来の音源定位技術には、観測信号長が短い場合に、音源定位を正確に行うことができない場合があるという問題があった。例えば、観測信号長が短い場合、共分散行列の推定のための十分な標本を得ることができず、音源定位を正確に行うことができないことがあった。 However, the conventional sound source localization technique has a problem that sound source localization may not be performed accurately when the observation signal length is short. For example, when the observation signal length is short, sufficient samples for estimation of the covariance matrix cannot be obtained, and sound source localization may not be performed accurately.

本発明の信号処理装置は、複数の異なる位置で取得された収録音に時間周波数分析を適用し、Ｍ次元ベクトルである観測信号ベクトルを計算する時間周波数分析部と、前記時間周波数分析部によって計算された観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルを、時間周波数点ごとに計算する特徴ベクトル計算部と、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、前記特徴ベクトルの条件付き確率分布のモデルパラメータを記憶するパラメータ記憶部と、前記音源位置を表す状態の事前確率分布を荷重とする、前記パラメータ記憶部に記憶されたモデルパラメータに基づく、前記音源位置を表す状態が既知の条件下での、前記特徴ベクトルの条件付き確率分布の荷重和である混合モデルを、前記特徴ベクトル計算部によって計算された特徴ベクトルに当てはめ、前記事前確率分布を計算する事前確率分布計算部と、前記事前確率分布計算部によって計算された事前確率分布に基づいて、前記特徴ベクトルに対応する音源位置を計算する音源位置計算部と、を有することを特徴とする。 The signal processing apparatus according to the present invention applies time-frequency analysis to recorded sound acquired at a plurality of different positions, calculates an observed signal vector that is an M-dimensional vector, and is calculated by the time-frequency analysis unit. A feature vector calculation unit that calculates, for each time frequency point, a feature vector that is a vector including information on the direction of the observed signal vector y (t, f), and a state that represents the sound source position is a plurality of sound source position candidates. A parameter storage unit that stores model parameters of the conditional probability distribution of the feature vector under conditions corresponding to the respective states, and the parameter storage unit that uses the prior probability distribution of the state representing the sound source position as a load The load of the conditional probability distribution of the feature vector under the condition that the state representing the sound source position is known based on the model parameter stored in Is applied to the feature vector calculated by the feature vector calculation unit, the prior probability distribution calculation unit for calculating the prior probability distribution, and the prior probability distribution calculated by the prior probability distribution calculation unit. And a sound source position calculation unit that calculates a sound source position corresponding to the feature vector.

本発明の信号処理方法は、信号処理装置で実行される信号処理方法であって、複数の異なる位置で取得された収録音に時間周波数分析を適用し、Ｍ次元ベクトルである観測信号ベクトルを計算する時間周波数分析工程と、前記時間周波数分析工程によって計算された観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルを、時間周波数点ごとに計算する特徴ベクトル計算工程と、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、前記特徴ベクトルの条件付き確率分布のモデルパラメータを記憶するパラメータ記憶部に記憶されたモデルパラメータを取得し、前記音源位置を表す状態の事前確率分布を荷重とする、前記モデルパラメータに基づく、前記音源位置を表す状態が既知の条件下での、前記特徴ベクトルの条件付き確率分布の荷重和である混合モデルを、前記特徴ベクトル計算工程によって計算された特徴ベクトルに当てはめ、前記事前確率分布を計算する事前確率分布計算工程と、前記事前確率分布計算工程によって計算された事前確率分布に基づいて、前記特徴ベクトルに対応する音源位置を計算する音源位置計算工程と、を含んだことを特徴とする。 The signal processing method of the present invention is a signal processing method executed by a signal processing apparatus, and applies an observation frequency vector that is an M-dimensional vector by applying time-frequency analysis to recorded sound obtained at a plurality of different positions. And a feature vector calculation step of calculating, for each time frequency point, a feature vector which is a vector including information on the direction of the observed signal vector y (t, f) calculated by the time frequency analysis step. And a model parameter stored in a parameter storage unit that stores a model parameter of the conditional probability distribution of the feature vector under a condition that the state representing the sound source position corresponds to each of a plurality of sound source position candidates. The state representing the sound source position based on the model parameter is obtained, with the prior probability distribution of the state representing the sound source position as a load. Prior probability distribution calculation for applying the mixed model, which is a weighted sum of conditional probability distributions of the feature vector under knowledge conditions, to the feature vector calculated by the feature vector calculation step and calculating the prior probability distribution And a sound source position calculating step of calculating a sound source position corresponding to the feature vector based on the prior probability distribution calculated by the prior probability distribution calculating step.

本発明によれば、観測信号長が短い場合であっても、音源定位を正確に行うことができる。 According to the present invention, sound source localization can be performed accurately even when the observation signal length is short.

図１は、本発明における音源定位について説明するための図である。FIG. 1 is a diagram for explaining sound source localization in the present invention. 図２は、第１の実施形態に係る信号処理装置の構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of the configuration of the signal processing device according to the first embodiment. 図３は、第１の実施形態に係る信号処理装置の処理の流れを示すフローチャートである。FIG. 3 is a flowchart showing a processing flow of the signal processing apparatus according to the first embodiment. 図４は、第８の実施形態に係る信号処理装置の構成の一例を示す図である。FIG. 4 is a diagram illustrating an example of a configuration of a signal processing device according to the eighth embodiment. 図５は、第９の実施形態に係る信号処理装置の構成の一例を示す図である。FIG. 5 is a diagram illustrating an example of a configuration of a signal processing device according to the ninth embodiment. 図６は、第１０の実施形態に係る信号処理装置の構成の一例を示す図である。FIG. 6 is a diagram illustrating an example of a configuration of a signal processing device according to the tenth embodiment. 図７は、第１１の実施形態に係る信号処理装置の構成の一例を示す図である。FIG. 7 is a diagram illustrating an example of the configuration of the signal processing device according to the eleventh embodiment. 図８は、プログラムが実行されることにより信号処理装置が実現されるコンピュータの一例を示す図である。FIG. 8 is a diagram illustrating an example of a computer in which a signal processing apparatus is realized by executing a program.

以下に、本願に係る信号処理装置、信号処理方法および信号処理プログラムの実施形態を図面に基づいて詳細に説明する。なお、この実施形態により本発明が限定されるものではない。 Hereinafter, embodiments of a signal processing device, a signal processing method, and a signal processing program according to the present application will be described in detail with reference to the drawings. In addition, this invention is not limited by this embodiment.

［本発明における音源定位について］
音源信号は通常、時間周波数平面上の疎な点でのみ大きいパワーを持つというスパース性を持つため、複数の音源信号が同時に鳴っている状況でも、各時間周波数点では観測信号は音源信号のうち高々１つしか含まないとみなすことができる。そのため、例えば、Ｍ個（Ｍ＞１）の異なる位置で取得された観測信号の時間周波数変換からなるＭ次元縦ベクトルである観測信号ベクトルｙ（ｔ，ｆ）（ｔはフレームの番号（ｔ＝１〜Ｔ）、ｆは周波数ビンの番号（ｆ＝１〜Ｆ））は、当該時間周波数点（ｔ，ｆ）において観測信号に含まれる音源信号の音源位置によって定まる固有の方向を向いているとみなすことができる。正確には、雑音や残響の影響により、観測信号ベクトルｙ（ｔ，ｆ）の方向は、上記の音源位置によって定まる固有の方向を中心として多少の広がりを持って分布する。観測信号の上記の性質を利用すれば、観測信号ベクトルｙ（ｔ，ｆ）の方向に基づいて、音源位置を推定することができる。 [Sound source localization in the present invention]
The sound source signal usually has sparseness that has a large power only at sparse points on the time-frequency plane, so even if multiple sound source signals are sounding at the same time, the observation signal is not included in the sound source signal at each time-frequency point. It can be considered that only one is included at most. Therefore, for example, an observation signal vector y (t, f) (t is a frame number (t = t = f)) which is an M-dimensional vertical vector formed by time-frequency conversion of observation signals acquired at M (M> 1) different positions. 1 to T), f is a frequency bin number (f = 1 to F)) is directed to a specific direction determined by the sound source position of the sound source signal included in the observation signal at the time frequency point (t, f). Can be considered. Precisely, due to the influence of noise and reverberation, the direction of the observation signal vector y (t, f) is distributed with a certain extent around the unique direction determined by the above-mentioned sound source position. If the above-mentioned property of the observation signal is used, the sound source position can be estimated based on the direction of the observation signal vector y (t, f).

本発明の実施形態では、音源定位を、複数（Ｌ個）の音源位置候補のうち、実際に音を発しているものを特定する問題、すなわち実際に音を発している音源位置候補（の番号）の集合を推定する問題として定式化する。この音源位置候補は、例えば、音源定位を行う部屋の中の複数の場所（例えば、部屋の中を格子状に細かく分割したときの各格子点に対応する場所）を音源位置候補とすることができる。また、音源位置候補は、音源が存在し得る領域が既知の場合には、その領域内の複数の場所を音源位置候補とすることができる。例えば、テーブルを囲んで座った複数人の会話の収録音に対し音源定位を行う場合、音源である話者はテーブルの外周付近にのみ存在しうるとみなせるから、テーブルの外周付近の複数の場所を音源位置候補とすることができる（図１参照）。そこで、観測信号の上記のような性質に基づき、本発明の実施形態では、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルｚ（ｔ，ｆ）の、音源位置を表す状態が複数（Ｌ個）の音源位置候補のそれぞれに対応する状態を取る条件下での条件付き確率分布のモデルパラメータを記憶しておき、当該モデルパラメータを事前情報として音源位置の推定に利用する。上述のように、観測信号ベクトルｙ（ｔ，ｆ）の方向は音源位置によって定まるとみなすことができるから、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルｚ（ｔ，ｆ）は音源位置によって定まる固有の確率分布を持つ。前記条件付き確率分布は、音源位置を表す状態が複数（Ｌ個）の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルｚ（ｔ，ｆ）の確率分布を表す。 In the embodiment of the present invention, the sound source localization is a problem of identifying a sound source that actually emits sound among a plurality (L) of sound source position candidates, that is, a sound source position candidate that actually emits a sound (number) ) Is formulated as a problem of estimating the set. As the sound source position candidates, for example, a plurality of locations in a room where sound source localization is performed (for example, locations corresponding to lattice points when the room is finely divided into a lattice shape) may be used as sound source position candidates. it can. In addition, when a region where a sound source can exist is known, a plurality of locations in the region can be set as sound source position candidates. For example, when sound source localization is performed on the recorded sound of conversations of multiple people sitting around the table, it can be considered that the speaker who is the sound source can exist only near the outer periphery of the table, so multiple locations near the outer periphery of the table Can be set as sound source position candidates (see FIG. 1). Therefore, based on the above properties of the observation signal, in the embodiment of the present invention, the sound source of the feature vector z (t, f), which is a vector including information on the direction of the observation signal vector y (t, f). A model parameter of a conditional probability distribution under a condition in which a state representing a position corresponds to each of a plurality (L) of sound source position candidates is stored, and the sound source position is estimated using the model parameter as prior information. To use. As described above, since the direction of the observation signal vector y (t, f) can be considered to be determined by the sound source position, the feature vector z which is a vector including information on the direction of the observation signal vector y (t, f). (T, f) has a unique probability distribution determined by the sound source position. The conditional probability distribution represents the probability distribution of the feature vector z (t, f) under the condition that the state representing the sound source position takes a state corresponding to each of a plurality (L) sound source position candidates.

観測信号ベクトルｙ（ｔ，ｆ）の方向とは、数学的には、観測信号ベクトルｙ（ｔ，ｆ）の全てのマイクロホンに対する要素比ｙ（１，ｔ，ｆ）：ｙ（２，ｔ，ｆ）：・・・：ｙ（Ｍ，ｔ，ｆ）を指す（言い換えれば、複素数体上のＭ次元ベクトル空間における互いにスカラ倍の関係にあるベクトルを同一視することにより得られる空間である、複素数体上のＭ−１次元射影空間の元を指す）。ここで、ｙ（ｍ，ｔ，ｆ）は、ベクトルｙ（ｔ，ｆ）の第ｍ要素を表す。したがって、特徴ベクトルｚ（ｔ，ｆ）が観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルであるとは、特徴ベクトルｚ（ｔ，ｆ）が与えられたときに観測信号ベクトルｙ（ｔ，ｆ）の全てのマイクロホンに対する要素比ｙ（１，ｔ，ｆ）：ｙ（２，ｔ，ｆ）：・・・：ｙ（Ｍ，ｔ，ｆ）が一意に定まることを意味する。前記観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルとしては、例えば観測信号ベクトルに平行な単位ベクトルを用いることができる。また、観測信号ベクトルｙ（ｔ，ｆ）自体も、当然観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んでいるから、これを観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルとして用いることもできる。観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルは、音源位置に関する情報として、位相差と振幅比の両方の情報を含んでいる。これは、振幅比を用いず位相差のみを用いる従来の特徴量（例えば、Time Difference of Arrival（TDOA）やDirection Of Arrival（DOA））と大きく異なる。そのため、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルは、振幅比を用いず位相差のみを用いる従来の特徴量と比較して、より多くの音源位置に関する情報を用いており、より正確な音源定位が可能である。また、限られたデータ長から音源位置に関する情報を最大限に抽出することができるため、本発明の実施形態において、観測信号長が短い場合であっても音源定位を正確に行うことができるという特長に貢献している。観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルを用いることで、位相差のみを用いる場合と比較して、より効果的な信号処理（例えば、音源分離や雑音除去）が可能であることが示されている（参考文献「H. Sawada, S. Araki, and S. Makino, “Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 3, pp. 516-527, Mar. 2011. 」）。なお、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルが、音源位置に関する情報として、位相差と振幅比の両方の情報を含んでいるということは、下記のように説明できる。上述のように、観測信号ベクトルｙ（ｔ，ｆ）の方向とは観測信号ベクトルｙ（ｔ，ｆ）の全てのマイクロホンに対する要素比ｙ（１，ｔ，ｆ）：ｙ（２，ｔ，ｆ）：・・・：ｙ（Ｍ，ｔ，ｆ）を指すが、これは、全てのマイクロホン対（ｍ，ｎ）に対する、２つのマイクロホン（ｍ，ｎ）に対する要素比ｙ（ｍ，ｔ，ｆ）：ｙ（ｎ，ｔ，ｆ）と情報として等価である。さらに、複素数の比が位相差および絶対値の比（振幅比）と情報として等価であることに注意すると、全てのマイクロホン対（ｍ，ｎ）に対する、２つのマイクロホン（ｍ，ｎ）に対する要素比ｙ（ｍ，ｔ，ｆ）：ｙ（ｎ，ｔ，ｆ）は、全てのマイクロホン対（ｍ，ｎ）に対する、２つのマイクロホン（ｍ，ｎ）に対する位相差および振幅比と情報として等価である。したがって、観測信号ベクトルｙ（ｔ，ｆ）の方向は、全てのマイクロホン対（ｍ，ｎ）に対する、２つのマイクロホン（ｍ，ｎ）に対する位相差および振幅比と情報として等価である。すなわち、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルは、音源位置に関する情報として、位相差と振幅比の両方の情報を含んでいる。 The direction of the observation signal vector y (t, f) is mathematically the element ratio y (1, t, f) for all microphones of the observation signal vector y (t, f): y (2, t, f, f):...: points to y (M, t, f) (in other words, a space obtained by equating vectors having a scalar multiple relationship to each other in an M-dimensional vector space on a complex number field, This refers to the element of the M-1 dimensional projective space on the complex number field). Here, y (m, t, f) represents the m-th element of the vector y (t, f). Therefore, the feature vector z (t, f) is a vector including information on the direction of the observation signal vector y (t, f). When the feature vector z (t, f) is given, the observation signal vector Element ratio y (1, t, f) for all microphones of y (t, f): y (2, t, f): ... means that y (M, t, f) is uniquely determined. To do. For example, a unit vector parallel to the observation signal vector can be used as the feature vector that is a vector including information on the direction of the observation signal vector y (t, f). Further, since the observation signal vector y (t, f) itself naturally includes information on the direction of the observation signal vector y (t, f), this is used as information on the direction of the observation signal vector y (t, f). It can also be used as a feature vector that is an included vector. A feature vector, which is a vector including information on the direction of the observation signal vector y (t, f), includes information on both the phase difference and the amplitude ratio as information on the sound source position. This is greatly different from conventional feature quantities (for example, Time Difference of Arrival (TDOA) and Direction Of Arrival (DOA)) that use only the phase difference without using the amplitude ratio. Therefore, the feature vector which is a vector including information on the direction of the observation signal vector y (t, f) relates to more sound source positions compared to the conventional feature amount using only the phase difference without using the amplitude ratio. Information is used, and more accurate sound source localization is possible. In addition, since information on the sound source position can be extracted to the maximum from the limited data length, it can be said that sound source localization can be performed accurately even when the observation signal length is short in the embodiment of the present invention. Contributes to the features. By using a feature vector that is a vector including information on the direction of the observed signal vector y (t, f), more effective signal processing (for example, sound source separation and noise) than in the case of using only the phase difference. (H. Sawada, S. Araki, and S. Makino, “Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment,” IEEE Transactions on Audio , Speech, and Language Processing, vol. 19, no. 3, pp. 516-527, Mar. 2011. Note that the feature vector, which is a vector including information on the direction of the observed signal vector y (t, f), includes information on both the phase difference and the amplitude ratio as information on the sound source position. Can be explained as follows. As described above, the direction of the observation signal vector y (t, f) is the element ratio y (1, t, f) for all microphones of the observation signal vector y (t, f): y (2, t, f). ):...: Y (M, t, f), which is the element ratio y (m, t, f) for two microphones (m, n) for all microphone pairs (m, n). ): Equivalent to y (n, t, f) as information. Furthermore, note that the complex ratio is informationally equivalent to the phase difference and absolute value ratio (amplitude ratio), and the element ratio for two microphones (m, n) for all microphone pairs (m, n). y (m, t, f): y (n, t, f) is equivalent as information to the phase difference and amplitude ratio for two microphones (m, n) for all microphone pairs (m, n). . Therefore, the direction of the observed signal vector y (t, f) is equivalent as information with respect to the phase difference and amplitude ratio for the two microphones (m, n) for all microphone pairs (m, n). In other words, the feature vector, which is a vector including information on the direction of the observed signal vector y (t, f), includes information on both the phase difference and the amplitude ratio as information on the sound source position.

図１を用いて、テーブルを囲んで座った複数人の会話の収録音に対し音源定位を行う場合の例について説明する。図１は、本発明における音源定位について説明するための図である。まず、図１に示すように、信号処理装置は、テーブル１００の周りの領域を等間隔に細かく分割したＬ点を音源位置候補１１０とすることができる。図１の例では、Ｌ＝８である。また、テーブル１００には、３つのマイクロホン１２０が置かれている。この例では、音源はテーブルの外周にのみ存在しうるとみなせ、また座高は個人に依らずほぼ一定とみなしうるから、音源位置はマイクロホン１２０から見た方向（方位角）によって指定することができる。 An example in which sound source localization is performed on recorded sounds of conversations of a plurality of people sitting around a table will be described with reference to FIG. FIG. 1 is a diagram for explaining sound source localization in the present invention. First, as shown in FIG. 1, the signal processing apparatus can set the L point obtained by finely dividing the area around the table 100 at equal intervals as the sound source position candidate 110. In the example of FIG. 1, L = 8. In addition, three microphones 120 are placed on the table 100. In this example, it can be considered that the sound source can exist only on the outer periphery of the table, and the sitting height can be considered to be almost constant regardless of the individual, and therefore the sound source position can be designated by the direction (azimuth angle) seen from the microphone 120. .

信号処理装置は、マイクロホン１２０によって観測された観測信号を基に、観測信号ベクトルｙ（ｔ，ｆ）および観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルｚ（ｔ，ｆ）を計算する。そして、信号処理装置は、条件付き確率分布のモデルパラメータに基づき、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布の荷重和である混合モデルを、特徴ベクトルｚ（ｔ，ｆ）に当てはめることにより、上記荷重和における荷重である事前確率分布を計算する。このとき、計算された事前確率分布は、音源位置で大きい値を取るため、この事前確率分布に基づいて音源位置を推定することができる。このとき、例えば、事前確率分布が、ｌ＝２である音源位置候補１１０で最も大きい値を取っている場合、音源位置は、矢印１３０が示す方向であるとみなすことができる。 The signal processing device, based on the observation signal observed by the microphone 120, the feature vector z (which is a vector including information on the direction of the observation signal vector y (t, f) and the observation signal vector y (t, f). t, f) is calculated. Then, the signal processing device, based on the model parameter of the conditional probability distribution, satisfies the condition of the feature vector z (t, f) under the condition that the state representing the sound source position corresponds to each of a plurality of sound source position candidates. By applying the mixed model, which is the load sum of the attached probability distribution, to the feature vector z (t, f), the prior probability distribution that is the load in the load sum is calculated. At this time, since the calculated prior probability distribution takes a large value at the sound source position, the sound source position can be estimated based on the prior probability distribution. At this time, for example, when the prior probability distribution has the largest value among the sound source position candidates 110 with l = 2, the sound source position can be regarded as the direction indicated by the arrow 130.

［第１の実施形態］
第１の実施形態に係る信号処理装置は、音源数Ｎが未知の条件下で音源位置の集合を推定する。ここで、音源数ＮはＮ＝０であってもよい（音源位置の集合が空集合の場合に対応）。本実施形態では、信号処理装置は、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルｚ（ｔ，ｆ）として観測信号ベクトルｙ（ｔ，ｆ）の方向ベクトルを用い、音源位置を表す状態として複数の音源位置候補のそれぞれに対応する状態を用い、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布として複素ワトソン分布を用い、目的信号が球面波として伝播するという仮定に基づいて複素ワトソン分布のモデルパラメータを計算して記憶し、事前確率分布として時不変の事前確率分布を用いる。 [First Embodiment]
The signal processing apparatus according to the first embodiment estimates a set of sound source positions under a condition where the number N of sound sources is unknown. Here, the number N of sound sources may be N = 0 (corresponding to the case where the set of sound source positions is an empty set). In the present embodiment, the signal processing apparatus uses the direction vector of the observation signal vector y (t, f) as the feature vector z (t, f) that is a vector including information on the direction of the observation signal vector y (t, f). , And a state corresponding to each of the plurality of sound source position candidates as a state representing the sound source position, and a feature vector z () under a condition that the state representing the sound source position corresponds to each of the plurality of sound source position candidates. The complex Watson distribution is used as the conditional probability distribution of t, f), and the model parameters of the complex Watson distribution are calculated and stored based on the assumption that the target signal propagates as a spherical wave. Use a probability distribution.

図２を用いて、第１の実施形態に係る信号処理装置の構成について説明する。図２は、第１の実施形態に係る信号処理装置の構成の一例を示す図である。図２に示すように、信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。 The configuration of the signal processing apparatus according to the first embodiment will be described with reference to FIG. FIG. 2 is a diagram illustrating an example of the configuration of the signal processing device according to the first embodiment. As shown in FIG. 2, the signal processing device 1 includes a time-frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, a prior probability distribution calculation unit 40, and a sound source position calculation unit 50.

時間周波数分析部１０は、複数の異なる位置で取得された収録音であるＭ個のマイクロホンによる観測信号ｙ（ｍ，τ）（ｍはマイクロホンの番号（ｍ＝１〜Ｍ）、τは時刻の番号）に時間周波数分析を適用して観測信号の時間周波数変換ｙ（ｍ，ｔ，ｆ）（ｔはフレームの番号（ｔ＝１〜Ｔ）、ｆは周波数ビンの番号（ｆ＝１〜Ｆ））を計算し、ｙ（ｍ，ｔ，ｆ）（ｍ＝１〜Ｍ）からなるＭ次元縦ベクトルである観測信号ベクトルｙ（ｔ，ｆ）を作成する。前記複数の異なる位置で取得された収録音は、複数の異なる位置で取得された後、何らかの前処理（例えば残響除去処理、空間的白色化処理など）が施された収録音でもよい（参考文献「T. Yoshioka, T. Nakatani, M. Miyoshi, and H. G. Okuno, “Blind separation and dereverberation of speech mixtures by joint optimization,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 1, pp. 69-84, 2011.」、参考文献「H. Sawada, S. Araki, and S. Makino, “Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 3, pp. 516-527, Mar. 2011. 」）。 The time frequency analysis unit 10 is an observation signal y (m, τ) (m is a microphone number (m = 1 to M), and τ is a time of the recorded sound obtained at a plurality of different positions. Time frequency analysis is applied to the number), and the time frequency conversion y (m, t, f) of the observation signal (t is the frame number (t = 1 to T), and f is the frequency bin number (f = 1 to F). )) Is calculated, and an observation signal vector y (t, f), which is an M-dimensional vertical vector composed of y (m, t, f) (m = 1 to M), is created. The recorded sounds acquired at the plurality of different positions may be recorded sounds obtained at a plurality of different positions and then subjected to some preprocessing (for example, dereverberation processing, spatial whitening processing, etc.) (references). “T. Yoshioka, T. Nakatani, M. Miyoshi, and HG Okuno,“ Blind separation and dereverberation of speech mixture by joint optimization, ”IEEE Trans. Audio, Speech, Language Process., Vol. 19, no. 1, pp 69-84, 2011. ”and references“ H. Sawada, S. Araki, and S. Makino, “Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment,” IEEE Transactions on Audio, Speech, and. Language Processing, vol. 19, no. 3, pp. 516-527, Mar. 2011.)).

特徴ベクトル計算部２０は、時間周波数分析部１０から観測信号ベクトルｙ（ｔ，ｆ）を受け取って、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルｚ（ｔ，ｆ）を式（１）により計算する。 The feature vector calculation unit 20 receives the observation signal vector y (t, f) from the time-frequency analysis unit 10, and the feature vector z (t, which is a vector including information on the direction of the observation signal vector y (t, f). , F) is calculated by equation (1).

ここで、||・||は、ユークリッドノルムであり、矢印←は左辺に右辺を代入することを表す。本実施形態におけるモデル化では、観測信号ベクトルｙ（ｔ，ｆ）はＮ個（Ｎは未知でもよく、またＮ＝０でもよい。）の目的信号からなり、背景雑音は含まないと仮定する。また、本発明の実施形態におけるモデル化では、各目的信号は時間周波数平面の疎な点でのみ大きいパワーを持つというスパース性を持つと仮定する。これらの仮定に基づき、本実施形態では、観測信号ベクトルｙ（ｔ，ｆ）は各時間周波数点において１つの目的信号のみを含むと仮定する。すなわち、観測信号ベクトルｙ（ｔ，ｆ）は式（２）によりモデル化される。 Here, || · || is an Euclidean norm, and an arrow ← indicates that the right side is assigned to the left side. In the modeling in this embodiment, it is assumed that the observed signal vector y (t, f) is composed of N target signals (N may be unknown or N = 0) and does not include background noise. In the modeling in the embodiment of the present invention, it is assumed that each target signal has sparsity that has a large power only at a sparse point on the time-frequency plane. Based on these assumptions, in the present embodiment, it is assumed that the observed signal vector y (t, f) includes only one target signal at each time frequency point. That is, the observation signal vector y (t, f) is modeled by the equation (2).

ここで、ｓ（ｎ，ｔ，ｆ）はｎ番目の目的信号の時間周波数変換であり、ｎは目的信号の番号（ｎ＝１〜Ｎ）である。また、ベクトルｈ（ｎ，ｆ）はｎ番目の目的信号の空間伝達特性を表すステアリングベクトルであり、ｎ番目の目的信号の音源位置によって固有の値を取る。式（２）は、観測信号ベクトルｙ（ｔ，ｆ）がｎ番目（ｎは時間周波数点（ｔ，ｆ）によって変化する）の目的信号のみからなることを表している。 Here, s (n, t, f) is the time-frequency conversion of the nth target signal, and n is the number of the target signal (n = 1 to N). The vector h (n, f) is a steering vector representing the spatial transfer characteristic of the nth target signal, and takes a specific value depending on the sound source position of the nth target signal. Expression (2) indicates that the observed signal vector y (t, f) is composed of only the nth target signal (n varies depending on the time frequency point (t, f)).

観測信号ベクトルｙ（ｔ，ｆ）のＭ次元複素ベクトル空間における方向（すなわち、Ｍ次元複素ベクトル空間において観測信号ベクトルｙ（ｔ，ｆ）が張る１次元部分空間）は、当該時間周波数点（ｔ，ｆ）において観測信号に含まれる音源信号の音源位置によって定まる固有の方向（具体的にはステアリングベクトルｈ（ｎ，ｆ）の方向）となる。より正確には、雑音や残響の影響で、観測信号ベクトルｙ（ｔ，ｆ）の方向は、上記の音源位置によって定まる固有の方向を中心として多少の広がりを持って分布する。本実施形態では、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルとして、観測信号ベクトルｙ（ｔ，ｆ）の方向ベクトルである式（１）の特徴ベクトルを用いる。 The direction of the observed signal vector y (t, f) in the M-dimensional complex vector space (that is, the one-dimensional subspace spanned by the observed signal vector y (t, f) in the M-dimensional complex vector space) is the time frequency point (t , F) is a specific direction (specifically, the direction of the steering vector h (n, f)) determined by the sound source position of the sound source signal included in the observation signal. More precisely, due to the influence of noise and reverberation, the direction of the observation signal vector y (t, f) is distributed with a certain extent around the inherent direction determined by the sound source position. In the present embodiment, the feature vector of Expression (1), which is the direction vector of the observation signal vector y (t, f), is used as the feature vector that includes information on the direction of the observation signal vector y (t, f). Use.

本実施形態では、音源位置を表す状態が複数（Ｌ個）の音源位置候補のそれぞれに対応する状態を取る条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布を複素ワトソン分布によりモデル化する（他にも複素ビンガム分布、複素角度中心ガウス分布（complex angular central Gaussian distribution）、複素ガウス分布、混合複素ワトソン分布、混合複素ビンガム分布、混合複素角度中心ガウス分布、混合複素ガウス分布等の確率分布によりモデル化することができる）。すなわち、特徴ベクトルｚ（ｔ，ｆ）は式（３）によりモデル化される。 In the present embodiment, the conditional probability distribution of the feature vector z (t, f) under the condition that the state representing the sound source position corresponds to each of a plurality (L) of sound source position candidates is represented by a complex Watson distribution. Model (other complex Bingham distribution, complex angular central Gaussian distribution, complex Gaussian distribution, mixed complex Watson distribution, mixed complex Bingham distribution, mixed complex angular central Gaussian distribution, mixed complex Gaussian distribution, etc. Can be modeled by the probability distribution). That is, the feature vector z (t, f) is modeled by Equation (3).

ここで、ｇ（ｔ，ｆ）は時間周波数点（ｔ，ｆ）における音源位置を表す状態である。本実施形態では、音源位置を表す状態は、複数（Ｌ個）の音源位置候補のそれぞれに対応する状態１〜Ｌのいずれかの値を取るとする。ここで、状態ｌは、時間周波数点（ｔ，ｆ）において観測信号ベクトルｙ（ｔ，ｆ）に含まれる音源信号の音源位置がｌ番目の音源位置候補である状態と定義する。ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）はｇ（ｔ，ｆ）＝ｌの条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布である。ベクトルａ（ｌ，ｆ）はｌ番目の音源位置候補に対する特徴ベクトルｚ（ｔ，ｆ）の平均方向を定めるモデルパラメータであり、平均方向ベクトルと呼ばれ、式（４）を満たす。κ（ｌ，ｆ）はｌ番目の音源位置候補に対する特徴ベクトルｚ（ｔ，ｆ）の確率分布の平均方向ベクトルａ（ｌ，ｆ）のまわりへの集中度を定めるモデルパラメータであり、集中パラメータと呼ばれる。 Here, g (t, f) is a state representing the sound source position at the time frequency point (t, f). In the present embodiment, the state representing the sound source position is assumed to take any one of the states 1 to L corresponding to each of a plurality (L) of sound source position candidates. Here, the state l is defined as a state in which the sound source position of the sound source signal included in the observation signal vector y (t, f) at the time frequency point (t, f) is the lth sound source position candidate. p (z (t, f) | g (t, f) = l) is a conditional probability distribution of the feature vector z (t, f) under the condition of g (t, f) = 1. The vector a (l, f) is a model parameter that determines the average direction of the feature vector z (t, f) for the l-th sound source position candidate, and is referred to as an average direction vector and satisfies the equation (4). κ (l, f) is a model parameter that determines the degree of concentration around the mean direction vector a (l, f) of the probability distribution of the feature vector z (t, f) for the l-th sound source position candidate. Called.

Ｗ（ｚ；ａ，κ）は平均方向ベクトルがａ、集中パラメータがκであるベクトルｚの複素ワトソン分布であり、式（５）で表される。 W (z; a, κ) is a complex Watson distribution of a vector z having an average direction vector a and a concentration parameter κ, and is represented by Expression (5).

このとき、Ｋは式（６）の無限級数により定義されるKummer関数（第１種合流型超幾何関数）であり、上付きのＨはエルミート転置である。ただし、ｉ＝０のときξ（ξ＋１）・・・（ξ＋ｉ−１）／［η（η＋１）・・・（η＋ｉ−１）］＝１と定める。 At this time, K is a Kummer function (first-type confluent hypergeometric function) defined by the infinite series of Equation (6), and the superscript H is Hermitian transpose. However, when i = 0, ξ (ξ + 1)... (Ξ + i−1) / [η (η + 1)... (Η + i−1)] = 1.

パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルの条件付き確率分布のモデルパラメータを記憶する。具体的に、パラメータ記憶部３０は、式（３）の条件付き確率分布の音源位置をモデル化するモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）および集中パラメータκ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）を記憶する。本実施形態では、これらのモデルパラメータを以下のように計算する。すなわち、目的信号が球面波として伝播するという仮定に基づき、平均方向ベクトルａ（ｌ，ｆ）の第ｍ要素を式（７）により計算する。 The parameter storage unit 30 stores model parameters of a conditional probability distribution of feature vectors under a condition in which a state representing a sound source position corresponds to each of a plurality of sound source position candidates. Specifically, the parameter storage unit 30 is an average direction vector a (l, f) (l = 1 to L, f = 1 to 1) which is a model parameter for modeling the sound source position of the conditional probability distribution of Expression (3). F) and the concentration parameter κ (l, f) (l = 1 to L, f = 1 to F) are stored. In the present embodiment, these model parameters are calculated as follows. That is, based on the assumption that the target signal propagates as a spherical wave, the m-th element of the average direction vector a (l, f) is calculated by Expression (7).

ここで、ベクトルｑ（ｍ）はｍ番目のマイクロホンの直交座標である３次元実ベクトル（本実施形態では既知と仮定）、ベクトルｒ（ｌ）はｌ番目の音源位置候補の直交座標である３次元実ベクトル（既知）、ｊは虚数単位、ω（ｆ）はｆ番目の周波数ビンの角周波数、ｃは音速であり、左辺における下付きのｍは第ｍ要素であることを表し、右辺の分母の平方根の項は、平均方向ベクトルａ（ｌ，ｆ）が式（４）の制約条件を満たすようにするための正規化係数である。 Here, the vector q (m) is a three-dimensional real vector (assumed to be known in the present embodiment) that is an orthogonal coordinate of the mth microphone, and the vector r (l) is an orthogonal coordinate of the lth sound source position candidate 3. Dimensional real vector (known), j is an imaginary unit, ω (f) is the angular frequency of the f th frequency bin, c is the speed of sound, subscript m on the left side represents the m th element, The term of the square root of the denominator is a normalization coefficient for making the average direction vector a (l, f) satisfy the constraint condition of the equation (4).

一方、集中パラメータκ（ｌ，ｆ）は、例えば周波数（ω（ｆ）／２π）のマイナス２乗に比例すると仮定して、式（８）により計算する。式（８）は、観測信号ベクトルｙ（ｔ，ｆ）の方向が、低い周波数ほど小さい分散（大きい集中度）を持つという性質に基づいている。このように、前記性質を適切に考慮することにより、事前確率分布の推定、及びそれに基づく音源定位を正確に行うことができる。比例定数βはどのように定めてもよいが、例えばβ＝６．４×１０＾７Ｈｚ＾２と定めればよい。 On the other hand, the concentration parameter κ (l, f) is calculated by the equation (8) assuming that it is proportional to the minus square of the frequency (ω (f) / 2π), for example. Expression (8) is based on the property that the direction of the observation signal vector y (t, f) has a smaller dispersion (large concentration) as the frequency is lower. As described above, by appropriately considering the property, the estimation of the prior probability distribution and the sound source localization based on the prior probability distribution can be accurately performed. The proportionality constant β may be determined in any way, for example, β = 6.4 × 10 ^ 7 Hz ^ 2.

次に、本実施形態における特徴ベクトルｚ（ｔ，ｆ）の周辺確率分布のモデル化について説明する。本実施形態では、特徴ベクトルｚ（ｔ，ｆ）の周辺確率分布を、音源位置を表す状態ｇ（ｔ，ｆ）の事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）を荷重とする条件付き確率分布ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）の荷重和である、式（９）の混合モデルによりモデル化する。 Next, the modeling of the peripheral probability distribution of the feature vector z (t, f) in the present embodiment will be described. In the present embodiment, the peripheral probability distribution of the feature vector z (t, f) is a condition with the prior probability distribution P (g (t, f) = 1) of the state g (t, f) representing the sound source position as a load. Modeled by the mixed model of Equation (9), which is the load sum of the attached probability distribution p (z (t, f) | g (t, f) = l).

条件付き確率分布ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）は音源位置を表す状態が既知の場合の特徴ベクトルｚ（ｔ，ｆ）の確率分布であるのに対し、式（９）の周辺確率分布ｐ（ｚ（ｔ，ｆ））は音源位置を表す状態が未知の場合の特徴ベクトルｚ（ｔ，ｆ）の確率分布である。事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）は、「時変」の場合と「時不変」の場合がある。前者の場合、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）は時間区間（例えばフレーム）ごとに異なる値を取り得る。後者の場合、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）は時間区間（例えばフレーム）によらず同一の値を取る。 The conditional probability distribution p (z (t, f) | g (t, f) = 1) is the probability distribution of the feature vector z (t, f) when the state representing the sound source position is known. The peripheral probability distribution p (z (t, f)) in the equation (9) is a probability distribution of the feature vector z (t, f) when the state representing the sound source position is unknown. Prior probability distribution P (g (t, f) = 1) may be “time-varying” or “time-invariant”. In the former case, the prior probability distribution P (g (t, f) = 1) can take a different value for each time interval (for example, frame). In the latter case, the prior probability distribution P (g (t, f) = 1) takes the same value regardless of the time interval (for example, frame).

事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）が時不変の場合、音源位置で大きい値を取る事前確率分布を全ての時間区間（例えばフレーム）を用いて推定することから、時変の場合よりも長いデータを推定に用いることができるため、音源の移動や発話交替がない状況では音源位置をより正確に推定できるという効果がある。その反面、音源位置推定を時間区間（例えばフレーム）ごとに行うことができず、またそのため、時変の場合の方が、音源の移動や発話交替がある動的な状況でのトラッキングやダイアリゼーション等には適している。 When the prior probability distribution P (g (t, f) = 1) is time-invariant, the prior probability distribution that takes a large value at the sound source position is estimated using all time intervals (for example, frames). Since data longer than the case can be used for estimation, there is an effect that the position of the sound source can be estimated more accurately in a situation where there is no movement of the sound source or utterance change. On the other hand, sound source position estimation cannot be performed for each time interval (for example, frame), and therefore tracking and dialization in a dynamic situation with sound source movement and utterance change is more time-varying. Suitable for etc.

一方、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）が時変の場合、音源位置で大きい値を取る関数である事前確率分布を時間区間（例えばフレーム）ごとに推定するため、音源位置推定を時間区間（例えばフレーム）ごとに行うことができるという効果に加え、時間区間（例えばフレーム）ごとの音源位置推定に基づいてトラッキングやダイアリゼーションを行うことができるという効果がある。例えば、複数人会話の音声認識では、雑音を音声とみなして誤認識することを防ぐために、「いつ誰が話したか」を推定するダイアリゼーションを行うことで音声認識を適用すべき区間を切り出す必要があるが、「時変」の場合はこのような場合にも応用可能である。 On the other hand, when the prior probability distribution P (g (t, f) = 1) is time-varying, the prior probability distribution, which is a function that takes a large value in the sound source position, is estimated for each time interval (for example, frame). In addition to the effect that estimation can be performed for each time interval (for example, frame), there is an effect that tracking and dialization can be performed based on sound source position estimation for each time interval (for example, frame). For example, in speech recognition of multi-person conversation, it is necessary to cut out the section to which speech recognition should be applied by performing dialization that estimates "when and who spoke" in order to prevent misrecognition by regarding noise as speech. However, “time-varying” can be applied to such a case.

本実施形態では、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）は時不変と仮定する。本実施形態では更に、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）は周波数にも依らないと仮定する。すなわち、本実施形態では、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）がフレームおよび周波数ビンに依存しないと仮定し、α（ｌ）で表す。ただし、α（ｌ）は制約条件α（１）＋…＋α（Ｌ）＝１を満たす。周波数に依らない事前確率分布を用いることで、全ての周波数において観測された特徴ベクトルｚ（ｔ，ｆ）の情報を用いて事前確率分布を推定することができるため、周波数に依存する事前確率分布を用いる場合と比べて、事前確率分布の推定により多くの情報を利用することができ、より正確な事前確率分布の推定およびそれに基づく音源定位が実現できるとともに、観測信号長が短い場合でもより正確な事前確率分布の推定およびそれに基づく音源定位が実現できる。 In the present embodiment, it is assumed that the prior probability distribution P (g (t, f) = 1) is time invariant. In the present embodiment, it is further assumed that the prior probability distribution P (g (t, f) = 1) does not depend on the frequency. That is, in the present embodiment, it is assumed that the prior probability distribution P (g (t, f) = 1) does not depend on the frame and the frequency bin, and is represented by α (l). However, α (l) satisfies the constraint condition α (1) +... + Α (L) = 1. By using the prior probability distribution that does not depend on the frequency, it is possible to estimate the prior probability distribution using information of the feature vector z (t, f) observed at all frequencies, and thus the prior probability distribution that depends on the frequency. Compared to the case of using, more information can be used to estimate the prior probability distribution, more accurate prior probability distribution estimation and sound source localization can be realized, and more accurate even when the observation signal length is short It is possible to estimate a prior probability distribution and sound source localization based on it.

事前確率分布計算部４０は、音源位置を表す状態の事前確率分布α（ｌ）（ｌ＝１〜Ｌ）を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）と集中パラメータκ（ｌ，ｆ）に基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルの条件付き確率分布の荷重和である混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルに当てはめ、事前確率分布α（ｌ）（ｌ＝１〜Ｌ）を計算する。 The prior probability distribution calculation unit 40 uses an average direction vector a (model parameter stored in the parameter storage unit 30 with the prior probability distribution α (l) (l = 1 to L) in a state representing the sound source position as a load. l, f) and a concentration model κ (l, f) based on a known model representing a sound source position, a mixed model that is a weighted sum of conditional probability distributions of feature vectors is represented by a feature vector calculator 20. The prior probability distribution α (l) (l = 1 to L) is calculated by applying to the feature vector calculated by (1).

式（９）の混合モデルを特徴ベクトルｚ（ｔ，ｆ）に当てはめる方法には様々な方法があり、例えば式（９）に関する尤度を目的関数とし（他にも事後確率等を目的関数とすることができる。）、これを勾配法により事前確率分布α（ｌ）（ｌ＝１〜Ｌ）に関して最大化する（他にもＥｘｐｅｃｔａｔｉｏｎ−Ｍａｘｉｍｉｚａｔｉｏｎ（ＥＭ）アルゴリズム等により最大化できる）。 There are various methods for applying the mixed model of Equation (9) to the feature vector z (t, f). For example, the likelihood related to Equation (9) is used as an objective function (the posterior probability is also used as the objective function). This is maximized with respect to the prior probability distribution α (l) (l = 1 to L) by the gradient method (in addition, it can be maximized by an Expectation-Maximization (EM) algorithm or the like).

勾配法に基づく方法は、ＥＭアルゴリズムに基づく方法と比べて、計算量の面で有利である。ＥＭアルゴリズムに基づく方法では、反復ごとに、事前確率分布α（ｌ）（ｌ＝１〜Ｌ）に加えて、時間周波数点ごとの各音源位置候補の寄与率を計算する必要がある。これに対し、勾配法では、反復ごとに事前確率分布α（ｌ）（ｌ＝１〜Ｌ）のみを計算すれば良いため、ＥＭアルゴリズムに比べて計算量を大幅に削減することができる。事前確率分布計算部４０における処理は、例えば下記の通りである。 The method based on the gradient method is advantageous in terms of calculation amount compared with the method based on the EM algorithm. In the method based on the EM algorithm, it is necessary to calculate the contribution ratio of each sound source position candidate for each time frequency point in addition to the prior probability distribution α (l) (l = 1 to L) for each iteration. On the other hand, in the gradient method, since only the prior probability distribution α (l) (l = 1 to L) needs to be calculated for each iteration, the calculation amount can be greatly reduced as compared with the EM algorithm. The processing in the prior probability distribution calculation unit 40 is, for example, as follows.

まず、α（ｌ）←１／Ｌ（ｌ＝１〜Ｌ）によりα（ｌ）を初期化する。次に、下記の式（１０）および（１１）によるα（ｌ）（ｌ＝１〜Ｌ）の処理を、交互に所定回数（例えば１０回）反復する。 First, α (l) is initialized by α (l) ← 1 / L (l = 1 to L). Next, the processing of α (l) (l = 1 to L) by the following formulas (10) and (11) is alternately repeated a predetermined number of times (for example, 10 times).

そして、α（ｌ）（ｌ＝１〜Ｌ）を出力する。ただし、ベクトルαはα（ｌ）（ｌ＝１〜Ｌ）からなるＬ次元縦ベクトル、ベクトルｗ（ｔ，ｆ）はＷ（ｚ（ｔ，ｆ）；ａ（ｌ，ｆ），κ（ｌ，ｆ））（ｌ＝１〜Ｌ）からなるＬ次元縦ベクトル、上付きのＴは転置、λは所定の正の定数（例えばλ＝１）である。 Then, α (l) (l = 1 to L) is output. However, the vector α is an L-dimensional vertical vector composed of α (l) (l = 1 to L), and the vector w (t, f) is W (z (t, f); a (l, f), κ (l , F)) (L = 1 to L), an L-dimensional vertical vector, superscript T is transposed, and λ is a predetermined positive constant (for example, λ = 1).

ここで、式（１０）（１１）の導出について説明する。目的関数である尤度は、ｚ（ｔ，ｆ）（ｔ＝１〜Ｔ，ｆ＝１〜Ｆ）が観測される確率であり、式（１２）で表される。 Here, the derivation of the equations (10) and (11) will be described. The likelihood, which is an objective function, is the probability that z (t, f) (t = 1 to T, f = 1 to F) is observed, and is represented by equation (12).

式（１２）の最大化は、自然対数を取った式（１３）の最大化と等価である。 Maximization of equation (12) is equivalent to maximization of equation (13) taking the natural logarithm.

ここでｌｎは自然対数を表し、＝の上の△は定義であることを表す。式（１３）の勾配を取ると、式（１４）を得、これより式（１０）が従う。一方、式（１１）はα（ｌ）が制約条件α（１）＋…＋α（Ｌ）＝１を満たすようにするための処理である。なお、式（１３）において、荷重を用いずに和を取るのではなく、信頼度に基づく荷重を用いて荷重和を取るように変更した目的関数を用いてもよい。これにより、信頼度の高い時間周波数点における特徴ベクトルにより大きい重みを与えることができ、事前確率分布推定およびそれに基づく音源定位の精度を向上させることができる。例えば、観測信号ベクトルｙ（ｔ，ｆ）のノルムが小さい時間周波数点が雑音に対応し、前記ノルムが大きい時間周波数点が目的信号に対応するとの仮定に基づき、前記ノルムを信頼度に基づく荷重として用いることができる。 Here, ln represents a natural logarithm, and Δ above = represents a definition. Taking the slope of equation (13) gives equation (14), from which equation (10) follows. On the other hand, Expression (11) is a process for making α (l) satisfy the constraint condition α (1) +... + Α (L) = 1. In Equation (13), instead of taking a sum without using a load, an objective function that is changed so as to take a load sum using a load based on reliability may be used. Thereby, a greater weight can be given to the feature vector at the time frequency point with high reliability, and the accuracy of the sound source localization based on the prior probability distribution estimation and it can be improved. For example, based on the assumption that a time frequency point with a small norm of the observed signal vector y (t, f) corresponds to noise and a time frequency point with a large norm corresponds to a target signal, the norm is a load based on reliability. Can be used as

音源位置計算部５０は、事前確率分布計算部４０から事前確率分布α（ｌ）（ｌ＝１〜Ｌ）を受け取って、事前確率分布α（ｌ）（ｌ＝１〜Ｌ）のピーク位置の集合Ｊを計算し、ピーク位置の集合Ｊに基づいて音源位置の集合Ｇを計算し出力する。 The sound source position calculation unit 50 receives the prior probability distribution α (l) (l = 1 to L) from the prior probability distribution calculation unit 40, and determines the peak position of the prior probability distribution α (l) (l = 1 to L). A set J is calculated, and a set G of sound source positions is calculated and output based on the set J of peak positions.

ピーク位置の集合Ｊは例えば次のように計算できる。各番号ｌ＝１〜Ｌに対し、ｌ番目の音源位置候補に隣接する音源位置候補の番号の集合が既知であると仮定する。このとき、「番号ｌがピーク位置であるとは、ｌ番目の音源位置候補に隣接する全ての音源位置候補の番号ｌ´に対しα（ｌ）＞α（ｌ´）が成り立つことである」と定義し、各番号ｌ＝１〜Ｌに対して番号ｌがピーク位置であるか否かを判定することで、ピーク位置の集合Ｊを計算できる。このピーク位置の集合Ｊに基づいて、音源位置を指定する番号ｌの集合または座標（直交座標、極座標、球座標等）の集合である検出された音源位置の集合Ｇを次のように計算できる。 The set J of peak positions can be calculated as follows, for example. Assume that for each number l = 1 to L, a set of numbers of sound source position candidates adjacent to the l-th sound source position candidate is known. At this time, “the number l is the peak position means that α (l)> α (l ′) holds for the numbers l ′ of all sound source position candidates adjacent to the l-th sound source position candidate.” And a set J of peak positions can be calculated by determining whether or not the number 1 is a peak position for each of the numbers l = 1 to L. Based on the set J of peak positions, a set G of detected sound source positions, which is a set of numbers l or a set of coordinates (orthogonal coordinates, polar coordinates, spherical coordinates, etc.) for specifying a sound source position, can be calculated as follows. .

例えば、ピーク位置の集合Ｊをそのまま検出された音源位置の集合Ｇとしてもよいし、ピーク位置ｌのうちピーク値α（ｌ）が所定の閾値Ｓを超えるピーク位置ｌの集合｛ｌ∈Ｊ｜α（ｌ）＞Ｓ｝を検出された音源位置の集合Ｇとしてもよい。閾値Ｓはどのように定めてもよいが、例えばＳ＝１／Ｌとすればよい。また、ピーク位置ｌに対応する音源位置候補の座標であるベクトルｒ（ｌ）の集合｛ｒ（ｌ）｜ｌ∈Ｊ｝を検出された音源位置の集合Ｇとしてもよいし、ピーク値α（ｌ）が所定の閾値Ｓを超えるピーク位置ｌに対応する音源位置候補の座標であるベクトルｒ（ｌ）の集合｛ｒ（ｌ）｜ｌ∈Ｊ，α（ｌ）＞Ｓ｝を検出された音源位置の集合Ｇとしてもよい。 For example, the set J of peak positions may be used as the set G of sound source positions detected as they are, or the set {l∈J | of the peak positions 1 in which the peak value α (l) exceeds a predetermined threshold S among the peak positions l. α (l)> S} may be a set G of detected sound source positions. The threshold value S may be determined in any way, for example, S = 1 / L. Further, a set {r (l) | lεJ} of vectors r (l) that are coordinates of sound source position candidates corresponding to the peak position l may be set as a detected sound source position set G, or a peak value α ( A set {r (l) | lεJ, α (l)> S} of vectors r (l), which are coordinates of a sound source position candidate corresponding to a peak position l where l) exceeds a predetermined threshold S, has been detected. A set G of sound source positions may be used.

［第１の実施形態の処理］
図３を用いて、信号処理装置１の処理の流れについて説明する。図３は、第１の実施形態に係る信号処理装置の処理の流れを示すフローチャートである。図３に示すように、まず、時間周波数分析部１０は、観測信号に対し、時間周波数分析を行い、観測信号ベクトルを計算する（ステップＳ１１）。 [Process of First Embodiment]
A processing flow of the signal processing apparatus 1 will be described with reference to FIG. FIG. 3 is a flowchart showing a processing flow of the signal processing apparatus according to the first embodiment. As shown in FIG. 3, first, the time-frequency analysis unit 10 performs time-frequency analysis on the observation signal and calculates an observation signal vector (step S11).

次に、特徴ベクトル計算部２０は、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルを計算する（ステップＳ１２）。そして、事前確率分布計算部４０は、パラメータ記憶部３０から、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での特徴ベクトルの条件付き確率分布モデルのパラメータを取得する（ステップＳ１３）。 Next, the feature vector calculation unit 20 calculates a feature vector which is a vector including information on the direction of the observation signal vector y (t, f) (step S12). Then, the prior probability distribution calculation unit 40 obtains, from the parameter storage unit 30, the parameters of the conditional probability distribution model of the feature vector under the condition that the state representing the sound source position corresponds to each of a plurality of sound source position candidates. Obtain (step S13).

次に、事前確率分布計算部４０は、各音源位置を表す状態の事前確率分布を初期化する（ステップＳ１４）。そして、事前確率分布計算部４０は、事前確率分布を更新する（ステップＳ１５）。 Next, the prior probability distribution calculation unit 40 initializes a prior probability distribution in a state representing each sound source position (step S14). Then, the prior probability distribution calculation unit 40 updates the prior probability distribution (step S15).

このとき、事前確率分布計算部４０は、例えば、パラメータ記憶部から取得したモデルパラメータによって表される特徴ベクトルの条件付き確率分布を、事前確率分布で荷重した混合モデルを用いて特徴ベクトルの周辺確率分布をモデル化する。そして、事前確率分布計算部４０は、勾配法を用い、当該周辺確率分布の尤度を目的関数としたときの尤度が最大化されるように事前確率分布を更新する。そして、事前確率分布の更新が所定回数反復して行われていない場合（ステップＳ１６、Ｎｏ）、事前確率分布計算部４０は、さらに事前確率分布の更新を行う（ステップＳ１５）。 At this time, the prior probability distribution calculation unit 40 uses, for example, a mixture model in which the conditional probability distribution of the feature vector represented by the model parameter acquired from the parameter storage unit is loaded with the prior probability distribution. Model the distribution. Then, the prior probability distribution calculation unit 40 uses the gradient method to update the prior probability distribution so that the likelihood when the likelihood of the peripheral probability distribution is used as an objective function is maximized. If the prior probability distribution has not been updated a predetermined number of times (step S16, No), the prior probability distribution calculation unit 40 further updates the prior probability distribution (step S15).

一方、事前確率分布の更新が所定回数反復して行われた場合（ステップＳ１６、Ｙｅｓ）、音源位置計算部５０は、事前確率分布計算部４０によって計算された事前確率に基づいて音源位置を計算する（ステップＳ１７）。このとき、音源位置計算部５０は、例えば、事前確率がピークとなる音源位置を計算結果とすることができる。 On the other hand, when the prior probability distribution is updated a predetermined number of times (step S16, Yes), the sound source position calculation unit 50 calculates the sound source position based on the prior probability calculated by the prior probability distribution calculation unit 40. (Step S17). At this time, the sound source position calculation unit 50 can use, for example, a sound source position at which the prior probability reaches a peak as a calculation result.

［第１の実施形態の効果］
時間周波数分析部１０は、Ｍ個の異なる位置で取得された収録音に時間周波数変換を適用し、Ｍ次元ベクトルである観測信号ベクトルを計算する。そして、特徴ベクトル計算部２０は、時間周波数分析部１０によって計算された観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルを、時間周波数点ごとに計算する。また、パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルの条件付き確率分布のモデルパラメータを記憶する。 [Effect of the first embodiment]
The time-frequency analysis unit 10 applies time-frequency conversion to recorded sounds acquired at M different positions, and calculates an observation signal vector that is an M-dimensional vector. Then, the feature vector calculation unit 20 calculates a feature vector that is a vector including information on the direction of the observed signal vector y (t, f) calculated by the time frequency analysis unit 10 for each time frequency point. Further, the parameter storage unit 30 stores model parameters of a conditional probability distribution of feature vectors under a condition in which a state representing a sound source position takes a state corresponding to each of a plurality of sound source position candidates.

ここで、事前確率分布計算部４０は、音源位置を表す状態の事前確率分布を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータに基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルの条件付き確率分布の荷重和である混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルに当てはめ、事前確率分布を計算する。そして、音源位置計算部５０は、事前確率分布計算部４０によって計算された事前確率分布に基づいて、特徴ベクトルに対応する音源位置を計算する。 Here, the prior probability distribution calculation unit 40 uses the prior probability distribution of the state representing the sound source position as a load, and is based on the model parameter stored in the parameter storage unit 30 under a condition where the state representing the sound source position is known. A mixed model, which is a weighted sum of conditional probability distributions of feature vectors, is applied to the feature vectors calculated by the feature vector calculation unit 20 to calculate a prior probability distribution. Then, the sound source position calculation unit 50 calculates the sound source position corresponding to the feature vector based on the prior probability distribution calculated by the prior probability distribution calculation unit 40.

このように、第１の実施形態によれば、観測信号ベクトルの共分散行列を用いずに、音源位置にて大きい値を取る関数である空間スペクトルとみなせる事前確率分布を計算することができるため、観測信号長が短い場合でも正確に音源定位を行うことができる。そのため、観測信号長が短い場合に正確な音源定位が困難であったＣａｐоｎ法やＭＵＳＩＣ法等の従来の音源定位法に比べて、音源位置が時間的に変化する状況や、発話交替のある会話状況などの動的な状況下で有利である。また、第１の実施形態によれば、複数の音源からの音源信号が混在する状況でも、それぞれの音源の音源位置を推定することができる。そのため、複数の音源位置の推定が困難であった遅延和アレイや一般化相互相関関数法等の従来の音源定位法に比べて、発話の重なりがある会話状況などの複数音源が存在する状況下で有利である。また、音源数が未知である状況でも、音源定位を行うことができる。そのため、実際の応用では音源数は事前に分からないことが多いが、そのような状況下でも本実施形態により音源定位が可能である。これは、音源数の事前情報を必要とするＭＵＳＩＣ法等の従来の音源定位法に比べて有利である。さらに、第１の実施形態の方法で得られた事前確率分布は、トラッキング、ダイアリゼーション、マスク推定、音声強調、音声認識といった様々な応用に用いることができる。さらに、第１の実施形態によれば、周波数に依らない事前確率分布を用いることで、全ての周波数において観測された特徴ベクトルｚ（ｔ，ｆ）の情報を用いて事前確率分布を推定することができる（これは、式（１０）において、全ての周波数におけるベクトルｗ（ｔ，ｆ）を用いてベクトルαを更新していることからも分かる。）ため、周波数に依存する事前確率分布を用いる場合と比べて、事前確率分布の推定により多くの情報を利用することができ、より正確な事前確率分布の推定およびそれに基づく音源定位が実現できるとともに、観測信号長が短い場合でもより正確な事前確率分布の推定およびそれに基づく音源定位が実現できる。なお、上では、全てのフレームにおける観測信号を一度に処理するバッチ処理について説明したが、フレームごと（またはいくつかのフレームごと）に観測信号を処理し、音源位置を推定するブロックバッチ処理（またはオンライン処理）とすることもできる。 Thus, according to the first embodiment, it is possible to calculate a prior probability distribution that can be regarded as a spatial spectrum that is a function that takes a large value at the sound source position without using the covariance matrix of the observation signal vector. Sound source localization can be performed accurately even when the observation signal length is short. Therefore, compared to conventional sound source localization methods such as the Capon method and MUSIC method, where accurate sound source localization is difficult when the observation signal length is short, the situation where the sound source position changes over time and conversations with utterance changes It is advantageous in dynamic situations such as situations. Further, according to the first embodiment, the sound source position of each sound source can be estimated even in a situation where sound source signals from a plurality of sound sources are mixed. Therefore, compared to conventional sound source localization methods such as delay-and-sum arrays and generalized cross-correlation function methods where it was difficult to estimate multiple sound source positions, there are multiple sound sources such as conversation situations with overlapping utterances. Is advantageous. Further, sound source localization can be performed even in a situation where the number of sound sources is unknown. Therefore, in actual applications, the number of sound sources is often not known in advance, but sound source localization can be performed according to this embodiment even under such circumstances. This is advantageous compared to a conventional sound source localization method such as the MUSIC method that requires prior information on the number of sound sources. Furthermore, the prior probability distribution obtained by the method of the first embodiment can be used for various applications such as tracking, dialization, mask estimation, speech enhancement, and speech recognition. Furthermore, according to the first embodiment, the prior probability distribution is estimated using information on the feature vector z (t, f) observed at all frequencies by using the prior probability distribution that does not depend on the frequency. (This is also known from the fact that the vector α is updated using the vector w (t, f) at all frequencies in the equation (10).) Therefore, the prior probability distribution depending on the frequency is used. Compared to the case, more information can be used to estimate the prior probability distribution, more accurate prior probability distribution estimation and sound source localization can be realized, and more accurate prior estimation even when the observation signal length is short. Probability distribution estimation and sound source localization based on it can be realized. In the above, the batch processing for processing the observation signals in all the frames at a time has been described. However, the block batch processing for processing the observation signals for each frame (or every several frames) and estimating the sound source position (or Online processing).

［第２の実施形態］
次に、第２の実施形態の構成について説明する。第２の実施形態は、本発明に基づいて音源位置を推定する例であり、第１の実施形態を基にして、事前確率分布として時変の事前確率分布を用いるという変更を加えたものである。すなわち、第２の実施形態では、事前確率分布を時間区間（例えばフレーム）ごとに推定する。このことにより、音源位置推定を時間区間（例えばフレーム）ごとに行うことができるという効果に加え、時間区間（例えばフレーム）ごとの音源位置推定に基づいてトラッキングやダイアリゼーションを行うことができるという効果が得られる。 [Second Embodiment]
Next, the configuration of the second embodiment will be described. The second embodiment is an example of estimating a sound source position based on the present invention, and is based on the first embodiment with a change that a time-varying prior probability distribution is used as the prior probability distribution. is there. That is, in the second embodiment, the prior probability distribution is estimated for each time interval (for example, frame). Thus, in addition to the effect that sound source position estimation can be performed for each time interval (for example, frame), tracking and dialization can be performed based on sound source position estimation for each time interval (for example, frame). Is obtained.

第２の実施形態に係る信号処理装置の構成の一例は、第１の実施形態に係る信号処理装置１と同様、図２で示される。第２の実施形態に係る信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、およびパラメータ記憶部３０については、第１の実施形態と同様であるから、以下では相違点である事前確率分布計算部４０と音源位置計算部５０について詳しく説明する。第１の実施形態と本実施形態との主な相違点は次の通りである。第１の実施形態では、事前確率分布計算部４０で時間区間に依らない事前確率分布を計算し、この事前確率分布に基づき、音源位置計算部５０で時間区間に依らない音源位置を計算する。これに対し、本実施形態では、事前確率分布計算部４０で時間区間ごとの事前確率分布を計算し、この事前確率分布に基づき、音源位置計算部５０で時間区間ごとの音源位置を計算する。 An example of the configuration of the signal processing device according to the second embodiment is illustrated in FIG. 2 as in the case of the signal processing device 1 according to the first embodiment. The signal processing apparatus 1 according to the second embodiment includes a time-frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, a prior probability distribution calculation unit 40, and a sound source position calculation unit 50. Since the time frequency analysis unit 10, the feature vector calculation unit 20, and the parameter storage unit 30 are the same as those in the first embodiment, the prior probability distribution calculation unit 40 and the sound source position calculation unit 50, which are different points, are described below. explain in detail. The main differences between the first embodiment and this embodiment are as follows. In the first embodiment, the prior probability distribution calculation unit 40 calculates a prior probability distribution that does not depend on the time interval, and based on the prior probability distribution, the sound source position calculation unit 50 calculates a sound source position that does not depend on the time interval. On the other hand, in this embodiment, the prior probability distribution calculation unit 40 calculates the prior probability distribution for each time interval, and the sound source position calculation unit 50 calculates the sound source position for each time interval based on the prior probability distribution.

事前確率分布計算部４０は、音源位置を表す状態の事前確率分布α（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）および集中パラメータκ（ｌ，ｆ）に基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布の荷重和である式（１５）の混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルｚ（ｔ，ｆ）に当てはめ、事前確率分布α（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）を計算する。ただし、α（ｌ，ｔ）は制約条件α（１，ｔ）＋…＋α（Ｌ，ｔ）＝１を満たす。 The prior probability distribution calculation unit 40 is a model stored in the parameter storage unit 30 with the prior probability distribution α (l, t) (l = 1 to L, t = 1 to T) in a state representing the sound source position as a load. Conditional probability distribution of the feature vector z (t, f) under the condition that the state representing the sound source position is known based on the average direction vector a (l, f) and the concentration parameter κ (l, f) as parameters. Is applied to the feature vector z (t, f) calculated by the feature vector calculation unit 20, and a prior probability distribution α (l, t) (l = 1 to L, t = 1 to T) is calculated. However, α (l, t) satisfies the constraint condition α (1, t) +... + Α (L, t) = 1.

ここで、第１の実施形態とは異なり、式（１５）における荷重が時不変のα（ｌ）ではなく時変のα（ｌ，ｔ）となっていることに注意する。式（１５）の混合モデルを特徴ベクトルｚ（ｔ，ｆ）に当てはめる方法には様々な方法があり、例えば式（１５）に関する尤度を勾配法により最大化する。 Note that, unlike the first embodiment, the load in equation (15) is not time-invariant α (l) but time-variant α (l, t). There are various methods for applying the mixed model of Expression (15) to the feature vector z (t, f). For example, the likelihood related to Expression (15) is maximized by the gradient method.

事前確率分布計算部４０における処理は、例えば下記の通りである。
１．α（ｌ，ｔ）←１／Ｌ（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）により事前確率分布α（ｌ，ｔ）を初期化する。
２．下記の式（１６）および式（１７）による事前確率分布α（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）の更新を交互に所定回数（例えば１０回）反復する。 The processing in the prior probability distribution calculation unit 40 is, for example, as follows.
1. Prior probability distribution α (l, t) is initialized by α (l, t) ← 1 / L (l = 1 to L, t = 1 to T).
2. The update of the prior probability distribution α (l, t) (l = 1 to L, t = 1 to T) according to the following equations (16) and (17) is alternately repeated a predetermined number of times (for example, 10 times).

３．事前確率分布α（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）を出力する。 3. Prior probability distribution α (l, t) (l = 1 to L, t = 1 to T) is output.

ただし、ベクトルα（ｔ）はα（ｌ，ｔ）（ｌ＝１〜Ｌ）からなるＬ次元縦ベクトルである。式（１６）および（１７）の導出は、式（１０）および（１１）の導出と同様であるため省略する。 However, the vector α (t) is an L-dimensional vertical vector composed of α (l, t) (l = 1 to L). Since the derivations of the equations (16) and (17) are the same as the derivations of the equations (10) and (11), they are omitted.

音源位置計算部５０は、事前確率分布計算部４０から事前確率分布α（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）を受け取って、事前確率分布α（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）のピーク位置の集合Ｊ（ｔ）をフレームごとに計算し、ピーク位置の集合Ｊ（ｔ）に基づいて検出された音源位置の集合Ｇ（ｔ）をフレームごとに計算し出力する。 The sound source position calculation unit 50 receives the prior probability distribution α (l, t) (l = 1 to L, t = 1 to T) from the prior probability distribution calculation unit 40 and receives the prior probability distribution α (l, t) ( A set J (t) of peak positions of l = 1 to L, t = 1 to T) is calculated for each frame, and a set G (t) of sound source positions detected based on the peak position set J (t) Is calculated and output for each frame.

ピーク位置の集合Ｊ（ｔ）は例えば次のように計算できる。ｌ番目（ｌ＝１〜Ｌ）の音源位置候補に隣接する音源位置候補の番号の集合（既知と仮定）を集合Ａ（ｌ）で表す。このとき、ピーク位置の集合Ｊ（ｔ）は、「集合Ａ（ｌ）に属する全ての番号ｌ´に対しα（ｌ，ｔ）＞α（ｌ´，ｔ）」となる番号ｌの集合Ｊ（ｔ）＝｛ｌ｜∀ｌ´∈Ａ（ｌ），α（ｌ，ｔ）＞α（ｌ´，ｔ）｝として計算できる。このピーク位置の集合Ｊ（ｔ）に基づいて、音源位置を指定する番号ｌの集合または座標（直交座標、極座標、球座標等）の集合である検出された音源位置の集合Ｇ（ｔ）を次のように計算することができる。 The set of peak positions J (t) can be calculated as follows, for example. A set (assumed to be known) of numbers of sound source position candidates adjacent to the l-th (l = 1 to L) sound source position candidates is represented by a set A (l). At this time, the set J (t) of peak positions is a set J of number l such that “α (l, t)> α (l ′, t) for all numbers l ′ belonging to set A (l)”. (T) = {l | ∀l′∈A (l), α (l, t)> α (l ′, t)}. Based on this set of peak positions J (t), a set G (t) of detected sound source positions, which is a set of number l or a set of coordinates (orthogonal coordinates, polar coordinates, spherical coordinates, etc.) specifying the sound source position, is used. It can be calculated as follows:

例えば、ピーク位置の集合Ｊ（ｔ）をそのまま検出された音源位置の集合Ｇ（ｔ）とすることができる。また、ピーク位置ｌのうち対応するピーク値α（ｌ，ｔ）が所定の閾値Ｓを超えるものの集合｛ｌ∈Ｊ（ｔ）｜α（ｌ，ｔ）＞Ｓ｝を検出された音源位置の集合Ｇ（ｔ）とすることもできる。ここで閾値Ｓはどのように定めてもよいが、例えばＳ＝１／Ｌとすればよい。また、ピーク位置ｌに対応する音源位置候補の座標であるベクトルｒ（ｌ）の集合｛ｒ（ｌ）｜ｌ∈Ｊ（ｔ）｝を検出された音源位置の集合Ｇ（ｔ）とすることもできる。また、ピーク位置ｌのうちピーク値α（ｌ，ｔ）が所定の閾値Ｓを超えるものに対応する音源位置候補の座標であるベクトルｒ（ｌ）の集合｛ｒ（ｌ）｜ｌ∈Ｊ（ｔ），α（ｌ，ｔ）＞Ｓ｝を検出された音源位置の集合Ｇ（ｔ）としてもよい。 For example, the peak position set J (t) can be used as the detected sound source position set G (t). In addition, among the peak positions l, the sound source positions where the set {lεJ (t) | α (l, t)> S} in which the corresponding peak value α (l, t) exceeds a predetermined threshold S is detected. It can also be set G (t). Here, the threshold value S may be determined in any way, but for example, S = 1 / L. Further, a set {r (l) | l∈J (t)} of vectors r (l) which are coordinates of a sound source position candidate corresponding to the peak position l is set as a detected sound source position set G (t). You can also. Further, a set {r (l) | l∈J () of the vector r (l) corresponding to the coordinates of the sound source position candidate corresponding to the peak position l of which the peak value α (l, t) exceeds the predetermined threshold S. t), α (l, t)> S} may be a set G (t) of detected sound source positions.

［第３の実施形態］
次に、第３の実施形態の構成について説明する。第３の実施形態は、本発明に基づいて音源位置を推定する例であり、第１の実施形態を基にして、音源位置を表す状態として、複数（Ｌ個）の音源位置候補のそれぞれに対応する状態（状態１〜Ｌとする）に加え、背景雑音に対応する状態（状態０とする）も考慮するとともに、音源位置を表す状態が状態０を取る条件下での、特徴ベクトルの条件付き確率分布を、超球面上の一様分布によりモデル化する、という変更を加えたものである。これにより、背景雑音を含む観測信号を適切にモデル化し、背景雑音下でも高精度に音源定位を行うことが可能になるという利点がある。 [Third Embodiment]
Next, the configuration of the third embodiment will be described. The third embodiment is an example of estimating the sound source position based on the present invention. Based on the first embodiment, each of a plurality (L) of sound source position candidates is represented as a state representing the sound source position. In addition to the corresponding states (states 1 to L), the state corresponding to the background noise (state 0) is considered, and the condition of the feature vector under the condition that the state representing the sound source takes state 0 The attached probability distribution is modeled by a uniform distribution on the hypersphere. Thus, there is an advantage that an observation signal including background noise can be appropriately modeled, and sound source localization can be performed with high accuracy even under background noise.

第３の実施形態に係る信号処理装置の構成の一例は、第１の実施形態に係る信号処理装置１と同様、図２で示される。第３の実施形態に係る信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。時間周波数分析部１０、特徴ベクトル計算部２０については、第１の実施形態と同様であるから、以下では相違点であるパラメータ記憶部３０、事前確率分布計算部４０、および音源位置計算部５０について詳しく説明する。第１の実施形態と本実施形態との主な相違点は次の通りである。第１の実施形態では、パラメータ記憶部３０において、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での条件付き確率分布のモデルパラメータを記憶し、事前確率分布計算部４０において、複数の音源位置候補に対応する状態に対する事前確率分布を計算し、音源位置計算部５０において、前記事前確率分布に基づいて音源位置を計算する。これに対し、本実施形態では、パラメータ記憶部３０において、音源位置を表す状態が背景雑音に対応する状態を取る条件下での条件付き確率分布のモデルパラメータをさらに記憶し、事前確率分布計算部４０において、複数の音源位置候補および背景雑音に対応する状態の事前確率分布を計算し、音源位置計算部５０において、前記事前確率分布に基づいて音源位置を計算する。 An example of the configuration of the signal processing device according to the third embodiment is illustrated in FIG. 2 as in the case of the signal processing device 1 according to the first embodiment. The signal processing apparatus 1 according to the third embodiment includes a time-frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, a prior probability distribution calculation unit 40, and a sound source position calculation unit 50. Since the time-frequency analysis unit 10 and the feature vector calculation unit 20 are the same as those in the first embodiment, the parameter storage unit 30, the prior probability distribution calculation unit 40, and the sound source position calculation unit 50, which are differences, will be described below. explain in detail. The main differences between the first embodiment and this embodiment are as follows. In the first embodiment, the parameter storage unit 30 stores the model parameters of the conditional probability distribution under the condition that the state representing the sound source position corresponds to each of a plurality of sound source position candidates, and the prior probability distribution The calculation unit 40 calculates a prior probability distribution for states corresponding to a plurality of sound source position candidates, and the sound source position calculation unit 50 calculates a sound source position based on the prior probability distribution. On the other hand, in the present embodiment, the parameter storage unit 30 further stores a model parameter of the conditional probability distribution under the condition that the state representing the sound source position corresponds to the background noise, and the prior probability distribution calculation unit In 40, a prior probability distribution in a state corresponding to a plurality of sound source position candidates and background noise is calculated, and a sound source position calculation unit 50 calculates a sound source position based on the prior probability distribution.

まず、本実施形態における観測信号ベクトルｙ（ｔ，ｆ）のモデル化について説明する。本実施形態におけるモデル化では、観測信号ベクトルｙ（ｔ，ｆ）はＮ個（Ｎは未知でもよい。Ｎ＝０でもよい。）の目的信号に加えて背景雑音も含むと仮定する。本実施形態では更に、観測信号ベクトルｙ（ｔ，ｆ）は、各時間周波数点において目的信号のうち高々１つの目的信号を含むと仮定するとともに、背景雑音は全ての時間周波数点において観測信号ベクトルｙ（ｔ，ｆ）に含まれると仮定する。このとき、観測信号ベクトルｙ（ｔ，ｆ）は式（１８）または（１９）のいずれかの式によりモデル化される。 First, modeling of the observation signal vector y (t, f) in the present embodiment will be described. In the modeling in this embodiment, it is assumed that the observed signal vector y (t, f) includes background noise in addition to N target signals (N may be unknown or N = 0). In the present embodiment, it is further assumed that the observed signal vector y (t, f) includes at most one target signal among the target signals at each time frequency point, and the background noise is observed signal vectors at all time frequency points. Assume that y (t, f) is included. At this time, the observation signal vector y (t, f) is modeled by any one of the equations (18) and (19).

ここで、式（１８）は時間周波数点（ｔ，ｆ）において目的信号のうちｎ番目（ｎは時間周波数点（ｔ，ｆ）によって変化しうる）の目的信号のみが観測信号ベクトルｙ（ｔ，ｆ）に含まれる場合、式（１９）は時間周波数点（ｔ，ｆ）において観測信号ベクトルｙ（ｔ，ｆ）に目的信号が１つも含まれない場合を表しており、ベクトルｓ（ｎ，ｔ，ｆ）はｎ番目の目的信号、ベクトルｖ（ｔ，ｆ）は背景雑音である。 Here, the equation (18) indicates that only the n-th target signal (where n may vary depending on the time frequency point (t, f)) of the target signals at the time frequency point (t, f) is the observed signal vector y (t , F), Equation (19) represents a case where no observed signal is included in the observed signal vector y (t, f) at the time frequency point (t, f), and the vector s (n , T, f) is the nth target signal, and the vector v (t, f) is background noise.

第１の実施形態の場合と異なり本実施形態では、式（１９）のように観測信号ベクトルｙ（ｔ，ｆ）に目的信号が１つも含まれず背景雑音のみが含まれる場合も考慮に入れたモデル化がなされており、背景雑音下での観測信号をより正確にモデル化することができる。 Unlike the case of the first embodiment, the present embodiment takes into account the case where the observation signal vector y (t, f) does not include any target signal and only the background noise is included as in the equation (19). Modeling has been done, and the observation signal under background noise can be modeled more accurately.

上述のように本実施形態では、式（１９）のように観測信号ベクトルｙ（ｔ，ｆ）に目的信号が１つも含まれない場合も考慮する。本実施形態では、このような場合も適切にモデル化できるように、各時間周波数点における観測信号ベクトルが取り得る音源位置を表す状態として、複数の音源位置候補に対応する状態に加えて、背景雑音に対応する状態をさらに考慮する。前者は式（１８）、後者は式（１９）に対応する。 As described above, in the present embodiment, a case where no observation signal is included in the observation signal vector y (t, f) as in Expression (19) is also considered. In the present embodiment, in order to appropriately model in such a case, the state representing the sound source position that can be taken by the observation signal vector at each time frequency point, in addition to the state corresponding to a plurality of sound source position candidates, Consider further the state corresponding to noise. The former corresponds to equation (18) and the latter corresponds to equation (19).

以下、時間周波数点（ｔ，ｆ）における前記音源位置を表す状態をｇ（ｔ，ｆ）により表す。ｇ（ｔ，ｆ）＝ｌ（ｌ＝１〜Ｌ）の条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布は、第１の実施形態の場合と同様、式（３）の複素ワトソン分布によりモデル化される（他にも複素ビンガム分布、複素角度中心ガウス分布、複素ガウス分布、混合複素ワトソン分布、混合複素ビンガム分布、混合複素角度中心ガウス分布、混合複素ガウス分布等の確率分布によりモデル化することができる）。 Hereinafter, a state representing the sound source position at the time frequency point (t, f) is represented by g (t, f). The conditional probability distribution of the feature vector z (t, f) under the condition of g (t, f) = l (l = 1 to L) is expressed by the equation (3) as in the first embodiment. Modeled by complex Watson distribution (other probabilities such as complex Bingham distribution, complex angular center Gaussian distribution, complex Gaussian distribution, mixed complex Watson distribution, mixed complex Bingham distribution, mixed complex angular center Gaussian distribution, mixed complex Gaussian distribution, etc. Can be modeled by distribution).

一方、ｇ（ｔ，ｆ）＝０の条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布は、式（２０）に示すように、Ｍ次元複素ベクトル空間における単位球面上の一様分布によりモデル化される。 On the other hand, the conditional probability distribution of the feature vector z (t, f) under the condition of g (t, f) = 0 is one on the unit sphere in the M-dimensional complex vector space as shown in the equation (20). Modeled by a uniform distribution.

式（２０）は、背景雑音はあらゆる方向から一様に到来するという仮定に基づいている。本実施形態では、式（２０）を導入することにより、式（１９）のように背景雑音に対応する状態も適切にモデル化することが可能になり、背景雑音下でも音源位置を正確に推定できる。 Equation (20) is based on the assumption that background noise comes uniformly from all directions. In this embodiment, by introducing the equation (20), it becomes possible to appropriately model the state corresponding to the background noise as in the equation (19), and the sound source position is accurately estimated even under the background noise. it can.

次に、本実施形態における特徴ベクトルｚ（ｔ，ｆ）の周辺確率分布のモデル化について説明する。本実施形態では、特徴ベクトルｚ（ｔ，ｆ）の周辺確率分布を、音源位置を表す状態の事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）を荷重とする、条件付き確率分布ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）の荷重和である式（２１）の混合モデルによりモデル化する。 Next, the modeling of the peripheral probability distribution of the feature vector z (t, f) in the present embodiment will be described. In the present embodiment, the peripheral probability distribution of the feature vector z (t, f) is a conditional probability distribution p () with a prior probability distribution P (g (t, f) = 1) representing a sound source position as a load. Modeling is performed by the mixed model of Expression (21) which is a load sum of z (t, f) | g (t, f) = l).

本実施形態では、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）がフレームおよび周波数ビンに依存しないと仮定し、α（ｌ）（ｌ＝０〜Ｌ）で表す。ただし、α（ｌ）は制約条件α（０）＋…＋α（Ｌ）＝１を満たす。κ＝０であり、ａが任意の単位ベクトルであるとき、複素ワトソン分布Ｗ（ｚ；ａ，κ）は式（２０）の一様分布に一致することに注意すると、式（２１）を式（２２）のように書き直すこともできる。ただし、κ（０，ｆ）＝０とし、ベクトルａ（０，ｆ）は任意の単位ベクトルとする。周波数に依らない事前確率分布を用いることで、全ての周波数において観測された特徴ベクトルｚ（ｔ，ｆ）の情報を用いて事前確率分布を推定することができるため、周波数に依存する事前確率分布を用いる場合と比べて、事前確率分布の推定により多くの情報を利用することができ、より正確な事前確率分布の推定およびそれに基づく音源定位が実現できるとともに、観測信号長が短い場合でもより正確な事前確率分布の推定およびそれに基づく音源定位が実現できる。さらに、全ての周波数において観測された特徴ベクトルｚ（ｔ，ｆ）の情報を用いて事前確率分布を推定することができるため、雑音や残響の影響により一つの周波数において観測された特徴ベクトルｚ（ｔ，ｆ）だけでは音源位置が確実には分からないような場合にも、より正確に音源定位を行うことができ、周波数に依存する事前確率分布を用いる場合と比べて、雑音や残響に対する頑健性を向上させることができる。 In this embodiment, it is assumed that the prior probability distribution P (g (t, f) = 1) does not depend on the frame and the frequency bin, and is expressed as α (1) (1 = 0 to L). However, α (l) satisfies the constraint condition α (0) +... + Α (L) = 1. Note that when κ = 0 and a is an arbitrary unit vector, the complex Watson distribution W (z; a, κ) matches the uniform distribution of Equation (20). It can also be rewritten as in (22). However, κ (0, f) = 0 and the vector a (0, f) is an arbitrary unit vector. By using the prior probability distribution that does not depend on the frequency, it is possible to estimate the prior probability distribution using information of the feature vector z (t, f) observed at all frequencies, and thus the prior probability distribution that depends on the frequency. Compared to the case of using, more information can be used to estimate the prior probability distribution, more accurate prior probability distribution estimation and sound source localization can be realized, and more accurate even when the observation signal length is short It is possible to estimate a prior probability distribution and sound source localization based on it. Furthermore, since the prior probability distribution can be estimated using the information of the feature vectors z (t, f) observed at all frequencies, the feature vector z (( Even when the sound source position is not known with t, f) alone, sound source localization can be performed more accurately, and it is more robust against noise and reverberation than using a frequency-dependent prior probability distribution. Can be improved.

パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での条件付き確率分布のモデルパラメータ、および音源位置を表す状態が背景雑音に対応する状態を取る条件下での条件付き確率分布のモデルパラメータを記憶する。前者は例えば第１の実施形態に記載の方法により計算することができ、後者は例えばκ（０，ｆ）←０、ベクトルａ（０，ｆ）は任意の単位ベクトルとすることができる。 The parameter storage unit 30 includes a model parameter of a conditional probability distribution under a condition in which a state representing the sound source position corresponds to each of a plurality of sound source position candidates, and a state in which the state representing the sound source position corresponds to background noise The model parameter of the conditional probability distribution under the condition of taking The former can be calculated, for example, by the method described in the first embodiment, the latter can be, for example, κ (0, f) ← 0, and the vector a (0, f) can be an arbitrary unit vector.

事前確率分布計算部４０は、音源位置を表す状態の事前確率分布α（ｌ）（ｌ＝０〜Ｌ）を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）（ｌ＝０〜Ｌ、ｆ＝１〜Ｆ）と集中パラメータκ（ｌ，ｆ）（ｌ＝０〜Ｌ、ｆ＝１〜Ｆ）に基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルの条件付き確率分布の荷重和である式（２１）の混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルｚ（ｔ，ｆ）に当てはめ、事前確率分布α（ｌ）（ｌ＝０〜Ｌ）を計算する。 The prior probability distribution calculation unit 40 uses the average direction vector a (model parameter stored in the parameter storage unit 30 with the prior probability distribution α (l) (l = 0 to L) in a state representing the sound source position as a load. l, f) (l = 0 to L, f = 1 to F) and the state representing the sound source position based on the concentration parameter κ (l, f) (l = 0 to L, f = 1 to F) are known. The mixed model of Expression (21), which is the load sum of the conditional probability distribution of the feature vector under the condition, is applied to the feature vector z (t, f) calculated by the feature vector calculation unit 20, and the prior probability distribution α (L) Calculate (l = 0 to L).

式（２１）の混合モデルを特徴ベクトルｚ（ｔ，ｆ）に当てはめる方法には様々な方法があり、例えば式（２１）に関する尤度を目的関数とし（他にも事後確率等を目的関数とすることができる。）、これを勾配法により事前確率分布α（ｌ）（ｌ＝０〜Ｌ）に関して最大化する（他にもＥＭアルゴリズム等により最大化できる）。 There are various methods for applying the mixed model of Equation (21) to the feature vector z (t, f). For example, the likelihood related to Equation (21) is used as an objective function (the posterior probability or the like is also used as the objective function). This is maximized with respect to the prior probability distribution α (l) (l = 0 to L) by the gradient method (otherwise, it can be maximized by the EM algorithm or the like).

事前確率分布計算部４０における処理は、例えば下記の通りである。
１．事前確率分布α（ｌ）（ｌ＝０〜Ｌ）をα（ｌ）←１／（Ｌ＋１）により初期化する。
２．下記の式（２３）および（２４）による事前確率分布α（ｌ）（ｌ＝０〜Ｌ）の更新を交互に所定回数（例えば１０回）反復する。 The processing in the prior probability distribution calculation unit 40 is, for example, as follows.
1. Prior probability distribution α (l) (l = 0 to L) is initialized by α (l) ← 1 / (L + 1).
2. The update of the prior probability distribution α (l) (l = 0 to L) by the following equations (23) and (24) is alternately repeated a predetermined number of times (for example, 10 times).

３．事前確率分布α（ｌ）（ｌ＝０〜Ｌ）を出力する。 3. Prior probability distribution α (l) (l = 0 to L) is output.

ここで、ベクトル〜α（αの前の記号「〜」はαの上に記号「〜」を付すことを表す。）はα（ｌ）（ｌ＝０〜Ｌ）からなる（Ｌ＋１）次元縦ベクトルであり、ベクトル〜ｗ（ｔ，ｆ）はＷ（ｚ（ｔ，ｆ）；ａ（ｌ，ｆ），κ（ｌ，ｆ））（ｌ＝０〜Ｌ）からなる（Ｌ＋１）次元縦ベクトルである。なお、式（２３）および式（２４）の導出については、第１の実施形態と同様であるから省略する。 Here, the vector .about.α (the symbol “˜” in front of α indicates that the symbol “˜” is added on α.) Is a (L + 1) -dimensional vertical length composed of α (l) (l = 0 to L). The vector ~ w (t, f) is W (z (t, f); a (l, f), κ (l, f)) (l = 0 to L) and is an (L + 1) -dimensional vertical Is a vector. In addition, since derivation | leading-out of Formula (23) and Formula (24) is the same as that of 1st Embodiment, it abbreviate | omits.

音源位置計算部５０は、事前確率分布計算部４０から受け取った事前確率分布α（ｌ）（ｌ＝０〜Ｌ）に基づいて、検出された音源位置の集合Ｇを計算し出力する。具体的には、ｌの定義域を目的音源に対応するｌ＝１〜Ｌに制限したα（ｌ）（ｌ＝１〜Ｌ）に対して、第１の実施形態に記載の処理を適用することにより、検出された音源位置の集合Ｇを計算する。 The sound source position calculation unit 50 calculates and outputs a set G of detected sound source positions based on the prior probability distribution α (l) (l = 0 to L) received from the prior probability distribution calculation unit 40. Specifically, the process described in the first embodiment is applied to α (l) (l = 1 to L) in which the domain of l is limited to l = 1 to L corresponding to the target sound source. Thus, a set G of detected sound source positions is calculated.

［第４の実施形態］
次に、第４の実施形態の構成について説明する。第４の実施形態は、本発明に基づいて音源位置を推定する例であり、第１の実施形態を基にして、条件付き確率分布のモデルパラメータを目的信号が球面波として伝播するという仮定に基づいて計算するのではなく、実測データを学習データとして用いて事前学習するようにするという変更を加えたものである。目的信号が球面波として伝播するという上記の仮定は、無響室のような反射・残響・回折等の存在しない理想的な環境を想定している。したがって、第１の実施形態では、反射・残響・回折等がある環境では、想定している環境と音源定位を行う環境との間にミスマッチが存在するため、音源定位の性能が低下する問題がある。これに対し本実施形態では、音源定位を行う環境における実測データを用いて条件付き確率分布のモデルパラメータを事前学習することで、そのようなミスマッチを解消し、反射・残響・回折等がある場合でも音源位置を正確に推定することが可能になる、という利点がある。反対に、第１の実施形態には、本実施形態と異なり上記実測データを取得する手間が省けるという利点がある。 [Fourth Embodiment]
Next, the configuration of the fourth embodiment will be described. The fourth embodiment is an example of estimating the sound source position based on the present invention. Based on the first embodiment, the model signal of the conditional probability distribution is assumed to propagate as a spherical wave. Instead of calculating based on this, a change is made such that actual measurement data is used as learning data and pre-learning is performed. The above assumption that the target signal propagates as a spherical wave assumes an ideal environment where there is no reflection, reverberation, diffraction, etc., as in an anechoic chamber. Therefore, in the first embodiment, in an environment where there is reflection, reverberation, diffraction, etc., there is a mismatch between the assumed environment and the environment where sound source localization is performed. is there. On the other hand, in the present embodiment, when the model parameters of the conditional probability distribution are pre-learned using measured data in an environment where sound source localization is performed, such mismatch is eliminated, and there is reflection, reverberation, diffraction, etc. However, there is an advantage that the sound source position can be accurately estimated. On the other hand, unlike the present embodiment, the first embodiment has the advantage of saving the trouble of acquiring the actual measurement data.

第４の実施形態に係る信号処理装置の構成の一例は、第１の実施形態に係る信号処理装置１と同様、図２で示される。第４の実施形態に係る信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、事前確率分布計算部４０、および音源位置計算部５０については、第１の実施形態と同様であるから、以下では相違点であるパラメータ記憶部３０について詳しく説明する。第１の実施形態と本実施形態との主な相違点は次の通りである。第１の実施形態におけるパラメータ記憶部３０は、目的信号が球面波として伝播するという仮定に基づいて計算された、条件付き確率分布のモデルパラメータを記憶する。これに対し、本実施形態におけるパラメータ記憶部３０は、残響下で取得された学習データを用いて学習された、条件付き確率分布のモデルパラメータを記憶する。 An example of the configuration of the signal processing device according to the fourth embodiment is illustrated in FIG. 2, similarly to the signal processing device 1 according to the first embodiment. The signal processing apparatus 1 according to the fourth embodiment includes a time-frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, a prior probability distribution calculation unit 40, and a sound source position calculation unit 50. Since the time-frequency analysis unit 10, the feature vector calculation unit 20, the prior probability distribution calculation unit 40, and the sound source position calculation unit 50 are the same as those in the first embodiment, the parameter storage unit 30 that is different from the first embodiment will be described below. explain in detail. The main differences between the first embodiment and this embodiment are as follows. The parameter storage unit 30 in the first embodiment stores a model parameter of a conditional probability distribution calculated based on the assumption that the target signal propagates as a spherical wave. On the other hand, the parameter storage unit 30 in the present embodiment stores the model parameters of the conditional probability distribution learned using the learning data acquired under reverberation.

パラメータ記憶部３０は、残響下で取得された学習データを用いて学習されたモデルパラメータであって、音源位置を表す状態が複数（Ｌ個）の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布である複素ワトソン分布のモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）と集中パラメータκ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）を記憶する。前記残響下で取得された学習データとしては、例えば、背景雑音が存在しない状況で複数の音源位置候補のそれぞれからのみ音が発せられた場合の観測信号ｘ（ｌ，ｍ，τ）を用いることができる。 The parameter storage unit 30 is a model parameter learned using learning data acquired under reverberation, and the condition representing the sound source position corresponds to each of a plurality (L) of sound source position candidates. An average direction vector a (l, f) (l = 1 to L, f = 1 to F) which is a model parameter of a complex Watson distribution which is a conditional probability distribution of the feature vector z (t, f) below. The concentration parameter κ (l, f) (l = 1 to L, f = 1 to F) is stored. As the learning data acquired under reverberation, for example, an observation signal x (l, m, τ) when sound is emitted only from each of a plurality of sound source position candidates in a situation where no background noise exists is used. Can do.

上記事前学習は、例えば次の手順で行うことができる。
１．１つの音源位置候補のみから音が発せられた場合の観測信号ｘ（ｌ，ｍ，τ）を生成する。例えば、Ｌ個の音源位置候補のそれぞれに対し、当該音源位置候補のみから音が発せられている状況で収録を行うことにより、ｘ（ｌ，ｍ，τ）を生成できる。もしくは、Ｌ個の音源位置候補のそれぞれに対し、当該音源位置候補から各マイクロホン位置までのインパルス応答を計測し、このインパルス応答を目的信号に畳み込むことにより、ｘ（ｌ，ｍ，τ）を生成できる。
２．ｘ（ｌ，ｍ，τ）の時間周波数変換ｘ（ｌ，ｍ，ｔ，ｆ）（ｍ＝１〜Ｍ）からなるＭ次元ベクトルｘ（ｌ，ｔ，ｆ）を計算する。
３．特徴ベクトルζ（ｌ，ｔ，ｆ）を下記の式（２５）により計算する。 The prior learning can be performed, for example, by the following procedure.
1. Generate observation signal x (l, m, τ) when sound is emitted from only one sound source position candidate. For example, x (l, m, τ) can be generated by performing recording for each of the L sound source position candidates in a situation where sound is emitted only from the sound source position candidates. Alternatively, for each of the L sound source position candidates, an impulse response from the sound source position candidate to each microphone position is measured, and x (l, m, τ) is generated by convolving the impulse response with a target signal. it can.
2. An M-dimensional vector x (l, t, f) consisting of a time frequency transformation x (l, m, t, f) (m = 1 to M) of x (l, m, τ) is calculated.
3. The feature vector ζ (l, t, f) is calculated by the following equation (25).

４．特徴共分散行列Ｒ（ｌ，ｆ）を下記の式（２６）により計算する。 4). The feature covariance matrix R (l, f) is calculated by the following equation (26).

５．特徴共分散行列Ｒ（ｌ，ｆ）の固有値分解を行い、最大固有値μ（ｌ，ｆ）および最大固有値に対応するノルム１の固有ベクトルｅ（ｌ，ｆ）を求める。
６．平均方向ベクトルａ（ｌ，ｆ）をａ（ｌ，ｆ）←ｅ（ｌ，ｆ）とする。
７．集中パラメータκ（ｌ，ｆ）を下記の式（２７）により計算する。 5). The eigenvalue decomposition of the feature covariance matrix R (l, f) is performed to obtain the maximum eigenvalue μ (l, f) and the eigenvector e (l, f) of norm 1 corresponding to the maximum eigenvalue.
6). Let the average direction vector a (l, f) be a (l, f) ← e (l, f).
7). The concentration parameter κ (l, f) is calculated by the following equation (27).

上記の処理の導出について説明する。上記の処理は、特徴ベクトルζ（ｌ，ｔ，ｆ）が式（２８）に従って生成されるという仮定の下、式（２８）に関する対数尤度である式（２９）を平均方向ベクトルａ（ｌ，ｆ）および集中パラメータκ（ｌ，ｆ）に関して最大化することにより導かれる。 Derivation of the above processing will be described. In the above processing, under the assumption that the feature vector ζ (l, t, f) is generated according to the equation (28), the equation (29) which is the logarithmic likelihood with respect to the equation (28) is converted into the average direction vector a (l , F) and the concentration parameter κ (l, f).

式（２９）において、平均方向ベクトルａ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）および集中パラメータκ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）のいずれにも依存しない定数項を無視すると、式（３０）のように書き直せる。 In the equation (29), the average direction vector a (l, f) (l = 1 to L, f = 1 to F) and the concentration parameter κ (l, f) (l = 1 to L, f = 1 to F) If a constant term that does not depend on any of the above is ignored, it can be rewritten as shown in Equation (30).

ここで、行列Ｒ（ｌ，ｆ）は式（２６）により定義される。式（２７）におけるベクトルａ（ｌ，ｆ）に依存する項は式（３１）である。 Here, the matrix R (l, f) is defined by Equation (26). The term depending on the vector a (l, f) in the equation (27) is the equation (31).

Courant-Fisherの定理より、式（３１）を式（４）の制約条件下で最大化するベクトルａ（ｌ，ｆ）は、特徴共分散行列Ｒ（ｌ，ｆ）の最大固有値μ（ｌ，ｆ）に対応するノルム１の固有ベクトルｅ（ｌ，ｆ）である。また、式（３０）における集中パラメータκ（ｌ，ｆ）に依存する項は、式（３２）である。 From the Courant-Fisher theorem, the vector a (l, f) that maximizes the equation (31) under the constraint of the equation (4) is the maximum eigenvalue μ (l, f) of the feature covariance matrix R (l, f). The eigenvector e (l, f) of norm 1 corresponding to f). Further, the term depending on the concentration parameter κ (l, f) in the equation (30) is the equation (32).

ここで、集中パラメータκ（ｌ，ｆ）に関する偏微分を０と置くと、式（３３）を得る。 Here, when the partial differential with respect to the concentration parameter κ (l, f) is set to 0, the equation (33) is obtained.

参考文献１「S.Sra and D.Karp,"The multivariate Watson distribution: Maximum-likelihood estimation and other aspects," Journal of Multivariate Analysis,2013年2月,vol.114,p.256-269.」中の式（３．８）に基づいて、式（３３）を集中パラメータκ（ｌ，ｆ）について近似的に解くと式（２７）を得る。本実施形態では、学習データから集中パラメータを学習するため、第１の実施形態と同様、前述の、観測信号ベクトルｙ（ｔ，ｆ）の方向が、低い周波数ほど小さい分散（大きい集中度）を持つという性質を適切に考慮することができ、事前確率分布の推定、及びそれに基づく音源定位を正確に行うことができる。 Reference 1 "S. Sra and D. Karp," The multivariate Watson distribution: Maximum-likelihood estimation and other aspects, "Journal of Multivariate Analysis, February 2013, vol. 114, p. 256-269. Based on equation (3.8), equation (33) is approximately solved with respect to the concentrated parameter κ (l, f) to obtain equation (27). In the present embodiment, in order to learn the concentration parameter from the learning data, as in the first embodiment, the observed signal vector y (t, f) has a smaller variance (large concentration degree) as the frequency of the observed signal becomes lower. Therefore, it is possible to appropriately consider the property of possessing, and to accurately estimate the prior probability distribution and sound source localization based thereon.

［第５の実施形態］
次に、第５の実施形態の構成について説明する。第５の実施形態は、本発明に基づいて音源位置を推定する例であり、第３の実施形態を基にして、背景雑音に対する条件付き確率分布として一様分布を用いるのではなく、実測データを用いて事前学習した条件付き確率分布を用いるようにするという変更を加えたものである。 [Fifth Embodiment]
Next, the configuration of the fifth embodiment will be described. The fifth embodiment is an example of estimating a sound source position based on the present invention, and based on the third embodiment, instead of using a uniform distribution as a conditional probability distribution for background noise, actual measurement data is used. A change is made to use a conditional probability distribution pre-learned using.

第３の実施形態における上記の一様分布の仮定は、雑音があらゆる方向から一様に到来する理想的な環境を想定している。したがって、第３の実施形態では、雑音の到来方向に偏りがある環境では、想定している環境と音源定位を行う環境との間にミスマッチが存在し、音源定位の性能が低下する恐れがある。これに対し本実施形態では、音源定位を行う環境における実測データを用いて、条件付き確率分布のモデルパラメータを事前学習することで、上記のミスマッチを解消し、雑音の到来方向に偏りがある場合でも音源位置を正確に推定することを可能にする、という利点がある。 The above assumption of the uniform distribution in the third embodiment assumes an ideal environment in which noise uniformly arrives from all directions. Therefore, in the third embodiment, in an environment where the noise arrival direction is biased, there is a possibility that a mismatch exists between the assumed environment and the sound source localization environment, and the sound source localization performance may be degraded. . On the other hand, in the present embodiment, when the model parameters of the conditional probability distribution are pre-learned using measured data in an environment where sound source localization is performed, the above mismatch is eliminated, and the noise arrival direction is biased However, there is an advantage that it is possible to accurately estimate the sound source position.

第５の実施形態に係る信号処理装置の構成の一例は、第３の実施形態に係る信号処理装置１と同様、図２で示される。第５の実施形態に係る信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、事前確率分布計算部４０、および音源位置計算部５０については、第３の実施形態と同様であるから、以下では相違点であるパラメータ記憶部３０について詳しく説明する。第３の実施形態と本実施形態との主な相違点は次の通りである。第３の実施形態におけるパラメータ記憶部３０では、音源位置を表す状態が背景雑音に対応する状態を取る条件下での条件付き確率分布のモデルパラメータとして、一様分布に対応するモデルパラメータを記憶する。これに対し、本実施形態におけるパラメータ記憶部３０では、音源位置を表す状態が背景雑音に対応する状態を取る条件下での条件付き確率分布のモデルパラメータとして、学習データを用いて学習したモデルパラメータを記憶する。 An example of the configuration of the signal processing device according to the fifth embodiment is illustrated in FIG. 2, similarly to the signal processing device 1 according to the third embodiment. The signal processing apparatus 1 according to the fifth embodiment includes a time-frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, a prior probability distribution calculation unit 40, and a sound source position calculation unit 50. Since the time-frequency analysis unit 10, the feature vector calculation unit 20, the prior probability distribution calculation unit 40, and the sound source position calculation unit 50 are the same as those in the third embodiment, the parameter storage unit 30, which is a difference, will be described below. explain in detail. The main differences between the third embodiment and this embodiment are as follows. The parameter storage unit 30 according to the third embodiment stores a model parameter corresponding to a uniform distribution as a model parameter of a conditional probability distribution under a condition where a state representing a sound source position corresponds to a background noise. . In contrast, in the parameter storage unit 30 according to the present embodiment, model parameters learned using learning data as model parameters of a conditional probability distribution under conditions where the state representing the sound source position takes a state corresponding to background noise. Remember.

本実施形態では、各時間周波数点における観測信号ベクトルｙ（ｔ，ｆ）の音源位置を表す状態がｇ（ｔ，ｆ）＝ｌ（ｌ＝０〜Ｌ）である条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布を、式（３）の複素ワトソン分布によりモデル化する（他にも複素ビンガム分布、複素角度中心ガウス分布、複素ガウス分布、混合複素ワトソン分布、混合複素ビンガム分布、混合複素角度中心ガウス分布、混合複素ガウス分布等の確率分布によりモデル化することができる）。 In the present embodiment, the feature vector z under the condition that the state representing the sound source position of the observation signal vector y (t, f) at each time frequency point is g (t, f) = 1 (l = 0 to L). The conditional probability distribution of (t, f) is modeled by the complex Watson distribution of Equation (3) (in addition, complex Bingham distribution, complex angular center Gaussian distribution, complex Gaussian distribution, mixed complex Watson distribution, mixed complex Bingham Distribution, mixed complex angular center Gaussian distribution, mixed complex Gaussian distribution, and other probability distributions).

パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態（状態１〜Ｌ）を取る条件下での条件付き確率分布のモデルパラメータ、および音源位置を表す状態が背景雑音に対応する状態（状態０）を取る条件下での条件付き確率分布のモデルパラメータを記憶する。音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での条件付き確率分布のモデルパラメータは、例えば第１または第４の実施形態に記載の方法により計算することができる。 The parameter storage unit 30 includes a model parameter of a conditional probability distribution under a condition in which a state representing the sound source position corresponds to each of a plurality of sound source position candidates (states 1 to L), and a state representing the sound source position. Stores model parameters of a conditional probability distribution under conditions that take a state corresponding to background noise (state 0). The model parameter of the conditional probability distribution under the condition that the state representing the sound source position corresponds to each of the plurality of sound source position candidates can be calculated by the method described in the first or fourth embodiment, for example. it can.

一方、音源位置を表す状態が背景雑音に対応する状態を取る条件下での条件付き確率分布のモデルパラメータは、例えば次のように事前学習される。
１．実測した背景雑音ｘ（０，ｍ，τ）の時間周波数変換ｘ（０，ｍ，ｔ，ｆ）（ｍ＝１〜Ｍ）からなるＭ次元縦ベクトルｘ（０，ｔ，ｆ）を作成する。
２．特徴ベクトルζ（０，ｔ，ｆ）を次の式（３４）により計算する。 On the other hand, the model parameter of the conditional probability distribution under the condition that the state representing the sound source position corresponds to the background noise is pre-learned as follows, for example.
1. An M-dimensional vertical vector x (0, t, f) composed of time-frequency conversion x (0, m, t, f) (m = 1 to M) of the measured background noise x (0, m, τ) is created. .
2. The feature vector ζ (0, t, f) is calculated by the following equation (34).

３．特徴共分散行列Ｒ（０，ｆ）を次の式（３５）により計算する。 3. The feature covariance matrix R (0, f) is calculated by the following equation (35).

４．特徴共分散行列Ｒ（０，ｆ）の固有値分解を行い、最大固有値μ（０，ｆ）および最大固有値に対応するノルム１の固有ベクトルｅ（０，ｆ）を求める。
５．平均方向ベクトルａ（０，ｆ）をａ（０，ｆ）←ｅ（０，ｆ）とする。
６．集中パラメータκ（０，ｆ）を次の式（３６）により計算する。 4). The eigenvalue decomposition of the feature covariance matrix R (0, f) is performed to obtain the maximum eigenvalue μ (0, f) and the eigenvector e (0, f) of the norm 1 corresponding to the maximum eigenvalue.
5). Let the average direction vector a (0, f) be a (0, f) ← e (0, f).
6). The concentration parameter κ (0, f) is calculated by the following equation (36).

なお、上記の処理の導出は、第４の実施形態の場合と同様であるから省略する。 Note that the derivation of the above processing is the same as in the case of the fourth embodiment, and will be omitted.

［第６の実施形態］
次に、第６の実施形態の構成について説明する。第６の実施形態は、本発明に基づいて音源位置を推定する例であり、第４の実施形態を基にして、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布として複素ワトソン分布ではなく複素角度中心ガウス分布を用いるようにするという変更を加えたものである。複素ワトソン分布では、観測信号ベクトルの方向である式（１）の特徴ベクトルの条件付き確率分布が回転対称である場合しか表せないのに対し、複素角度中心ガウス分布ではこの条件付き確率分布が回転対称な場合だけでなく楕円状の分布である場合も表すことができる。式（１）の特徴ベクトルの分布は必ずしも回転対称とは限らないため、本実施形態により、式（１）の特徴ベクトルの分布を第４の実施形態よりも正確にモデル化することができ、その結果、音源位置をより正確に推定できる。 [Sixth Embodiment]
Next, the configuration of the sixth embodiment will be described. The sixth embodiment is an example of estimating a sound source position based on the present invention. Based on the fourth embodiment, the state representing the sound source position takes a state corresponding to each of a plurality of sound source position candidates. This is a modification in which a complex angular center Gaussian distribution is used instead of a complex Watson distribution as a conditional probability distribution of the feature vector z (t, f) under the conditions. In the complex Watson distribution, the conditional probability distribution of the feature vector of Equation (1), which is the direction of the observed signal vector, can be expressed only when it is rotationally symmetric, whereas in the complex angular center Gaussian distribution, this conditional probability distribution is rotated. Not only a symmetrical case but also an elliptical distribution can be expressed. Since the distribution of the feature vector of the formula (1) is not necessarily rotationally symmetric, the present embodiment can model the distribution of the feature vector of the formula (1) more accurately than the fourth embodiment. As a result, the sound source position can be estimated more accurately.

第６の実施形態に係る信号処理装置の構成の一例は、第４の実施形態に係る信号処理装置１と同様、図２で示される。第６の実施形態に係る信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、および音源位置計算部５０については、第４の実施形態と同様であるから、以下では相違点であるパラメータ記憶部３０および事前確率分布計算部４０について詳しく説明する。第４の実施形態と本実施形態との主な相違点は次の通りである。第４の実施形態では、パラメータ記憶部３０において、条件付き確率分布をモデル化する複素ワトソン分布のモデルパラメータを記憶し、事前確率分布計算部４０において、前記複素ワトソン分布のモデルパラメータに基づいて事前確率分布を計算する。これに対し、本実施形態では、パラメータ記憶部３０において、条件付き確率分布をモデル化する複素角度中心ガウス分布のモデルパラメータを記憶し、事前確率分布計算部４０において、前記複素角度中心ガウス分布のモデルパラメータに基づいて事前確率分布を計算する。 An example of the configuration of the signal processing device according to the sixth embodiment is illustrated in FIG. 2, similarly to the signal processing device 1 according to the fourth embodiment. The signal processing apparatus 1 according to the sixth embodiment includes a time-frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, a prior probability distribution calculation unit 40, and a sound source position calculation unit 50. Since the time-frequency analysis unit 10, the feature vector calculation unit 20, and the sound source position calculation unit 50 are the same as those in the fourth embodiment, the parameter storage unit 30 and the prior probability distribution calculation unit 40, which are differences, are described below. explain in detail. The main differences between the fourth embodiment and this embodiment are as follows. In the fourth embodiment, the parameter storage unit 30 stores the model parameters of the complex Watson distribution that models the conditional probability distribution, and the prior probability distribution calculation unit 40 stores the prior parameters based on the model parameters of the complex Watson distribution. Calculate the probability distribution. On the other hand, in the present embodiment, the parameter storage unit 30 stores model parameters of a complex angular center Gaussian distribution that models a conditional probability distribution, and the prior probability distribution calculation unit 40 stores the complex angular center Gaussian distribution. Calculate prior probability distribution based on model parameters.

本実施形態では、Ｌ個の音源位置候補に対するＬ個の条件付き確率分布を、複素角度中心ガウス分布によりモデル化する。すなわち、条件付き確率分布ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）を式（３７）によりモデル化する。 In the present embodiment, L conditional probability distributions for L sound source position candidates are modeled by a complex angle center Gaussian distribution. That is, the conditional probability distribution p (z (t, f) | g (t, f) = 1) is modeled by the equation (37).

ここで、行列Σ（ｌ，ｆ）はｌ番目の音源位置候補に対する特徴ベクトルｚ（ｔ，ｆ）の分布の位置・広がり・方向・形状を定めるモデルパラメータである正定値エルミート行列であり、パラメータ行列と呼ばれ、Ａ（ｚ；Σ）は、パラメータ行列が行列Σであるベクトルｚの複素角度中心ガウス分布であり、式（３８）で表される。 Here, the matrix Σ (l, f) is a positive definite Hermitian matrix that is a model parameter that determines the position, spread, direction, and shape of the distribution of the feature vector z (t, f) for the l-th sound source position candidate. It is called a matrix, and A (z; Σ) is a complex angle center Gaussian distribution of a vector z whose parameter matrix is a matrix Σ and is expressed by Expression (38).

パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布である複素角度中心ガウス分布のモデルパラメータであるパラメータ行列Σ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）を記憶する。パラメータ行列Σ（ｌ，ｆ）は、Ｌ個の音源位置候補のそれぞれに対し、当該音源位置候補のみから音が発せられた場合の観測信号ｘ（ｌ，ｍ，τ）を用いて事前学習される。本実施形態では、特徴量ｚ（ｔ，ｆ）の条件付き確率分布の位置・広がり・方向・形状を定めるパラメータ行列Σ（ｌ，ｆ）を学習データから学習するため、第１の実施形態と同様、前述の、観測信号ベクトルｙ（ｔ，ｆ）の方向が、低い周波数ほど小さい分散（前記広がりに相当）を持つという性質を適切に考慮することができ、事前確率分布の推定、及びそれに基づく音源定位を正確に行うことができる。 The parameter storage unit 30 is a complex angular center Gaussian distribution that is a conditional probability distribution of the feature vector z (t, f) under the condition that the state representing the sound source position corresponds to each of a plurality of sound source position candidates. The parameter matrix Σ (l, f) (l = 1 to L, f = 1 to F), which is the model parameter, is stored. The parameter matrix Σ (l, f) is pre-trained for each of the L sound source position candidates using the observation signal x (l, m, τ) when sound is emitted from only the sound source position candidate. The In the present embodiment, the parameter matrix Σ (l, f) that determines the position, spread, direction, and shape of the conditional probability distribution of the feature quantity z (t, f) is learned from the learning data. Similarly, the above-mentioned property that the direction of the observed signal vector y (t, f) has a smaller variance (corresponding to the spread) at a lower frequency can be appropriately taken into consideration, and estimation of the prior probability distribution and Sound source localization based on this can be performed accurately.

この事前学習は、例えば以下の手順で行うことができる。
１．特徴ベクトルζ（ｌ，ｔ，ｆ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を第４の実施形態と同様に計算する。
２．パラメータ行列Σ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）をＭ×Ｍの単位行列により初期化する。
３．次の式（３９）によるパラメータ行列Σ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）の更新を所定回数（例えば１０回）反復する。 This pre-learning can be performed, for example, by the following procedure.
1. The feature vector ζ (l, t, f) (l = 1 to L, t = 1 to T, f = 1 to F) is calculated in the same manner as in the fourth embodiment.
2. The parameter matrix Σ (l, f) (l = 1 to L, f = 1 to F) is initialized with an M × M unit matrix.
3. The update of the parameter matrix Σ (l, f) (l = 1 to L, f = 1 to F) by the following equation (39) is repeated a predetermined number of times (for example, 10 times).

４．パラメータ行列Σ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）をパラメータ記憶部３０に記憶する。 4). The parameter matrix Σ (l, f) (l = 1 to L, f = 1 to F) is stored in the parameter storage unit 30.

式（３９）の導出について説明する。式（３９）は、特徴ベクトルζ（ｌ，ｔ，ｆ）が式（３７）の条件付き確率分布に従って生成されたという仮定の下、式（３７）に関する対数尤度である式（４０）をパラメータ行列Σ（ｌ，ｆ）に関して最大化することにより導かれる。 Derivation of Expression (39) will be described. Equation (39) is obtained by changing Equation (40), which is a logarithmic likelihood with respect to Equation (37) under the assumption that the feature vector ζ (l, t, f) is generated according to the conditional probability distribution of Equation (37). Derived by maximizing the parameter matrix Σ (l, f).

式（４０）におけるパラメータ行列Σ（ｌ，ｆ）によらない定数項を無視すると、式（４０）は、式（４１）のように書き換えられる。 If a constant term that does not depend on the parameter matrix Σ (l, f) in Expression (40) is ignored, Expression (40) can be rewritten as Expression (41).

式（４１）のパラメータ行列Σ（ｌ，ｆ）に関する偏微分を０と置いて整理すると、式（３９）を得る。 When the partial differentiation with respect to the parameter matrix Σ (l, f) in Expression (41) is set to 0 and rearranged, Expression (39) is obtained.

事前確率分布計算部４０は、音源位置を表す状態の事前確率分布を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータであるパラメータ行列Σ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）に基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布である複素角度中心ガウス分布の荷重和である混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルｚ（ｔ，ｆ）に当てはめ、事前確率分布を計算する。本実施形態では、前記事前確率分布として時不変の事前確率分布α（ｌ）（ｌ＝１〜Ｌ）を考える。 The prior probability distribution calculation unit 40 is a parameter matrix Σ (l, f) (l = 1 to L, f) that is a model parameter stored in the parameter storage unit 30 with the prior probability distribution in a state representing the sound source position as a load. = 1 to F), a mixed model that is a load sum of a complex angular center Gaussian distribution that is a conditional probability distribution of the feature vector z (t, f) under a condition in which the state representing the sound source position is known, A prior probability distribution is calculated by applying to the feature vector z (t, f) calculated by the feature vector calculator 20. In the present embodiment, a time-invariant prior probability distribution α (l) (l = 1 to L) is considered as the prior probability distribution.

事前確率分布計算部４０における事前確率分布の計算は、例えば次のように行えばよい。すなわち、ベクトルｗ（ｔ，ｆ）を条件付き確率である複素角度中心ガウス分布Ａ（ｚ（ｔ，ｆ）；Σ（ｌ，ｆ））（ｌ＝１〜Ｌ）からなるＬ次元縦ベクトルとし、ベクトルｗ（ｔ，ｆ）に対して第１の実施形態の事前確率分布計算部４０における処理を適用する。ただし、第１の実施形態とはベクトルｗ（ｔ，ｆ）の定義が異なることに注意する。なお、上記の処理の導出は、第１の実施形態の場合と同様であるから省略する。 The calculation of the prior probability distribution in the prior probability distribution calculation unit 40 may be performed as follows, for example. That is, the vector w (t, f) is an L-dimensional vertical vector composed of a complex angular center Gaussian distribution A (z (t, f); Σ (l, f)) (l = 1 to L) that is a conditional probability. , The process in the prior probability distribution calculation unit 40 of the first embodiment is applied to the vector w (t, f). However, it should be noted that the definition of the vector w (t, f) is different from that in the first embodiment. Note that derivation of the above processing is the same as in the case of the first embodiment, and will be omitted.

［第７の実施形態］
次に、第７の実施形態の構成について説明する。第７の実施形態は、本発明に基づいて音源位置を推定する例であり、第４の実施形態を基にして、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルｚ（ｔ，ｆ）として式（１）の方向ベクトルではなく観測信号ベクトルｙ（ｔ，ｆ）そのものを用いるようにし、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布として複素ワトソン分布ではなく複素時変ガウス分布を用いるようにし、複素時変ガウス分布のモデルパラメータである空間共分散行列を事前学習して記憶するようにするという変更を加えたものである。 [Seventh Embodiment]
Next, the configuration of the seventh embodiment will be described. The seventh embodiment is an example of estimating the sound source position based on the present invention, and is a vector including information on the direction of the observation signal vector y (t, f) based on the fourth embodiment. The observation signal vector y (t, f) itself is used as the feature vector z (t, f) instead of the direction vector of the expression (1), and the state representing the sound source position corresponds to each of a plurality of sound source position candidates. Is used as the conditional probability distribution of the feature vector z (t, f) under the condition of ## EQU2 ## instead of the complex Watson distribution, and the spatial covariance matrix, which is a model parameter of the complex time-varying Gaussian distribution, is used. It is a modification to make it memorize and memorize in advance.

複素ワトソン分布では観測信号ベクトルの方向の分布が回転対称である場合しか表せないのに対し、複素時変ガウス分布では観測信号ベクトルの方向の分布が回転対称である場合だけでなく楕円状の分布である場合も表せる。観測信号ベクトルの方向の分布は必ずしも回転対称とは限らないため、本実施形態により、音源位置を特徴づける観測信号ベクトルの方向の分布を第４の実施形態よりも正確にモデル化することができ、このモデル化に基づき音源位置をより正確に推定できる。 The complex Watson distribution can be represented only when the observed signal vector direction distribution is rotationally symmetric, whereas the complex time-varying Gaussian distribution is not only when the observed signal vector direction distribution is rotationally symmetric, but also an elliptical distribution. It can also be expressed. Since the observation signal vector direction distribution is not necessarily rotationally symmetric, this embodiment can model the observation signal vector direction distribution characterizing the sound source position more accurately than the fourth embodiment. Based on this modeling, the sound source position can be estimated more accurately.

第７の実施形態に係る信号処理装置の構成の一例は、第４の実施形態に係る信号処理装置１と同様、図２で示される。第７の実施形態に係る信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。時間周波数分析部１０と音源位置計算部５０については第４の実施形態と同様であるから、以下では相違点である特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０について詳しく説明する。第４の実施形態と本実施形態との主な相違点は次の通りである。第４の実施形態では、特徴ベクトル計算部２０において式（１）の特徴ベクトルを計算し、パラメータ記憶部３０において、前記特徴ベクトルの条件付き確率分布をモデル化する複素ワトソン分布のモデルパラメータを記憶し、事前確率分布計算部４０において、音源位置を表す状態の事前確率分布を荷重とする、条件付き確率分布をモデル化する複素ワトソン分布の荷重和である混合モデルを前記特徴ベクトルに当てはめることにより、前記事前確率分布を計算する。これに対し、本実施形態では、特徴ベクトル計算部２０は、時間周波数分析部１０からの観測信号ベクトルを特徴ベクトルとして出力し、パラメータ記憶部３０において、特徴ベクトルである観測信号ベクトルの条件付き確率分布をモデル化する複素時変ガウス分布のモデルパラメータである空間共分散行列を記憶し、事前確率分布計算部４０において、音源位置を表す状態の事前確率分布を荷重とする、条件付き確率分布をモデル化する複素時変ガウス分布の荷重和である混合モデルを特徴ベクトルである観測信号ベクトルに当てはめることにより、前記事前確率分布を計算する。 An example of the configuration of the signal processing device according to the seventh embodiment is illustrated in FIG. 2 as in the case of the signal processing device 1 according to the fourth embodiment. The signal processing apparatus 1 according to the seventh embodiment includes a time-frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, a prior probability distribution calculation unit 40, and a sound source position calculation unit 50. Since the time frequency analysis unit 10 and the sound source position calculation unit 50 are the same as those in the fourth embodiment, the feature vector calculation unit 20, the parameter storage unit 30, and the prior probability distribution calculation unit 40 which are different points will be described in detail below. To do. The main differences between the fourth embodiment and this embodiment are as follows. In the fourth embodiment, the feature vector calculation unit 20 calculates the feature vector of Expression (1), and the parameter storage unit 30 stores the model parameters of the complex Watson distribution that models the conditional probability distribution of the feature vector. Then, in the prior probability distribution calculation unit 40, by applying a mixed model, which is a weighted sum of complex Watson distributions that models the conditional probability distribution, with the prior probability distribution of the state representing the sound source position as a load, to the feature vector The prior probability distribution is calculated. On the other hand, in the present embodiment, the feature vector calculation unit 20 outputs the observation signal vector from the time-frequency analysis unit 10 as a feature vector, and the parameter storage unit 30 has a conditional probability of the observation signal vector that is the feature vector. A conditional probability distribution is stored in which a spatial covariance matrix, which is a model parameter of a complex time-varying Gaussian distribution that models the distribution, is stored, and the prior probability distribution calculation unit 40 uses the prior probability distribution of the state representing the sound source position as a load The prior probability distribution is calculated by applying a mixed model, which is a weighted sum of complex time-varying Gaussian distributions to be modeled, to an observed signal vector that is a feature vector.

特徴ベクトル計算部２０は、時間周波数分析部１０から観測信号ベクトルｙ（ｔ，ｆ）を受け取って、観測信号ベクトルｙ（ｔ，ｆ）を特徴ベクトルｚ（ｔ，ｆ）として出力する。 The feature vector calculation unit 20 receives the observation signal vector y (t, f) from the time-frequency analysis unit 10 and outputs the observation signal vector y (t, f) as the feature vector z (t, f).

本実施形態では、Ｌ個の音源位置候補に対するＬ個の条件付き確率分布として、複素時変ガウス分布を用いる。すなわち、条件付き確率分布ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）を式（４２）によりモデル化する。 In the present embodiment, a complex time-varying Gaussian distribution is used as the L conditional probability distributions for the L sound source position candidates. That is, the conditional probability distribution p (z (t, f) | g (t, f) = 1) is modeled by the equation (42).

式（４２）におけるφ（ｌ，ｔ，ｆ）は、特徴ベクトルｚ（ｔ，ｆ）の「大きさ（ノルム）」の分布を制御する正のパラメータである。一方、式（４２）における行列Ｂ（ｌ，ｆ）は、特徴ベクトルｚ（ｔ，ｆ）の「方向」の分布を制御する（具体的には、特徴ベクトルｚ（ｔ，ｆ）の方向の分布の位置・広がり・方向・形状を制御する）パラメータである。行列Ｂ（ｌ，ｆ）は正定値エルミート行列であり、空間共分散行列と呼ばれる。Ｎ（ｚ；０，Φ）は平均がベクトル０、共分散行列が行列Φであるベクトルｚの複素ガウス分布であり、式（４３）で表される。 Φ (l, t, f) in equation (42) is a positive parameter that controls the distribution of the “magnitude (norm)” of the feature vector z (t, f). On the other hand, the matrix B (l, f) in the equation (42) controls the distribution of the “direction” of the feature vector z (t, f) (specifically, the direction of the feature vector z (t, f)). This parameter controls the position, spread, direction, and shape of the distribution. The matrix B (l, f) is a positive definite Hermitian matrix and is called a spatial covariance matrix. N (z; 0, Φ) is a complex Gaussian distribution of a vector z whose average is the vector 0 and whose covariance matrix is the matrix Φ, and is represented by Expression (43).

式（４２）は時変の共分散行列φ（ｌ，ｔ，ｆ）Ｂ（ｌ，ｆ）を持つことから、ここでは複素時変ガウス分布と呼ぶ。 Since equation (42) has a time-varying covariance matrix φ (l, t, f) B (l, f), it is referred to herein as a complex time-varying Gaussian distribution.

パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルｚ（ｔ，ｆ）である観測信号ベクトルｙ（ｔ，ｆ）の条件付き確率分布のモデルパラメータである空間共分散行列Ｂ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）を記憶する。本実施形態では、パラメータ記憶部３０は、前記条件付き確率分布のモデルパラメータである空間共分散行列Ｂ（ｌ，ｆ）とφ（ｌ，ｔ，ｆ）のうち、音源位置に関係する空間共分散行列Ｂ（ｌ，ｆ）のみを記憶する。一方、φ（ｌ，ｔ，ｆ）は信号のパワーに依存するから、パラメータ記憶部３０には記憶せず、後で述べるように事前確率分布計算部４０において特徴ベクトル計算部２０からの特徴ベクトルを用いて推定する。本実施形態では、観測信号ベクトルｙ（ｔ，ｆ）の方向の分布の位置・広がり・方向・形状を定めるパラメータ行列Ｂ（ｌ，ｆ）を学習データから学習するため、第１の実施形態と同様、前述の観測信号ベクトルｙ（ｔ，ｆ）の方向が、低い周波数ほど小さい分散（前記広がりに相当）を持つという性質を適切に考慮することができ、事前確率分布の推定、及びそれに基づく音源定位を正確に行うことができる。 The parameter storage unit 30 stores the observation signal vector y (t, f), which is the feature vector z (t, f), under the condition that the state representing the sound source position corresponds to each of the plurality of sound source position candidates. A spatial covariance matrix B (l, f) (l = 1 to L, f = 1 to F), which is a model parameter of the conditional probability distribution, is stored. In the present embodiment, the parameter storage unit 30 includes the spatial covariance matrix B (l, f) and φ (l, t, f), which are model parameters of the conditional probability distribution, related to the sound source position. Only the variance matrix B (l, f) is stored. On the other hand, since φ (l, t, f) depends on the signal power, it is not stored in the parameter storage unit 30, and the feature vector from the feature vector calculation unit 20 in the prior probability distribution calculation unit 40 as described later. Estimate using. In the present embodiment, the parameter matrix B (l, f) that determines the position, spread, direction, and shape of the distribution in the direction of the observation signal vector y (t, f) is learned from the learning data. Similarly, the property that the direction of the observed signal vector y (t, f) has a smaller variance (corresponding to the spread) at a lower frequency can be appropriately taken into account, and the prior probability distribution is estimated and based thereon. Sound source localization can be performed accurately.

空間共分散行列Ｂ（ｌ，ｆ）は、Ｌ個の音源位置候補のうちの１つの音源位置候補のみから音が発せられた場合の観測信号ｘ（ｌ，ｍ，τ）を用いて、例えば以下の手順により事前学習される。
１．ｘ（ｌ，ｍ，τ）の時間周波数変換ｘ（ｌ，ｍ，ｔ，ｆ）（ｍ＝１〜Ｍ）からなるＭ次元縦ベクトルｘ（ｌ，ｔ，ｆ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を作成する。特徴ベクトルζ（ｌ，ｔ，ｆ）をζ（ｌ，ｔ，ｆ）←ｘ（ｌ，ｔ，ｆ）とする。ここで、特徴ベクトルζ（ｌ，ｔ，ｆ）の計算方法が、第４の実施形態とは異なることに注意する。
２．空間共分散行列Ｂ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）をＭ×Ｍの単位行列により初期化する。
３．次の式（４４）による空間共分散行列Ｂ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）の更新を所定回数（例えば１０回）反復する。 The spatial covariance matrix B (l, f) uses, for example, an observation signal x (l, m, τ) when sound is emitted from only one sound source position candidate among L sound source position candidates. Pre-learning is performed according to the following procedure.
1. M-dimensional vertical vector x (l, t, f) (l = 1 to L) consisting of time frequency transformation x (l, m, t, f) (m = 1 to M) of x (l, m, τ) t = 1 to T, f = 1 to F). The feature vector ζ (l, t, f) is set to ζ (l, t, f) ← x (l, t, f). Here, it should be noted that the calculation method of the feature vector ζ (l, t, f) is different from that in the fourth embodiment.
2. The spatial covariance matrix B (l, f) (l = 1 to L, f = 1 to F) is initialized with an M × M unit matrix.
3. The update of the spatial covariance matrix B (l, f) (l = 1 to L, f = 1 to F) by the following equation (44) is repeated a predetermined number of times (for example, 10 times).

４．空間共分散行列Ｂ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）をパラメータ記憶部３０に記憶する。 4). The spatial covariance matrix B (l, f) (l = 1 to L, f = 1 to F) is stored in the parameter storage unit 30.

式（４４）の導出について説明する。式（４４）は、ベクトルζ（ｌ，ｔ，ｆ）が式（４２）の条件付き確率分布に従って生成されたという仮定の下、式（４２）に関する対数尤度である式（４５）を空間相関行列Ｂ（ｌ，ｆ）およびφ（ｌ，ｔ，ｆ）に関して最大化することにより導かれる。 Derivation of Expression (44) will be described. Equation (44) is obtained by substituting Equation (45), which is a logarithmic likelihood with respect to Equation (42), under the assumption that the vector ζ (l, t, f) is generated according to the conditional probability distribution of Equation (42). Derived by maximizing the correlation matrices B (l, f) and φ (l, t, f).

式（４５）における空間相関行列Ｂ（ｌ，ｆ）およびφ（ｌ，ｔ，ｆ）によらない定数項を無視すると、式（４５）は、式（４６）に書き換えられる。 When the constant term not depending on the spatial correlation matrix B (l, f) and φ (l, t, f) in the equation (45) is ignored, the equation (45) is rewritten into the equation (46).

式（４６）のφ（ｌ，ｔ，ｆ）に関する偏微分を０と置いて整理すると、式（４７）を得る。 When the partial differentiation with respect to φ (l, t, f) in the equation (46) is set to 0, the equation (47) is obtained.

また、式（４６）のＢ（ｌ，ｆ）に関する偏微分を０と置くと、式（４８）を得、式（４８）に式（４７）を代入すると式（４４）を得る。 Further, when the partial differentiation with respect to B (l, f) in equation (46) is set to 0, equation (48) is obtained, and equation (44) is obtained by substituting equation (47) into equation (48).

次に、本実施形態における特徴ベクトルｚ（ｔ，ｆ）の周辺確率分布のモデル化について説明する。本実施形態では、特徴ベクトルｚ（ｔ，ｆ）の周辺確率分布を、音源位置を表す状態ｇ（ｔ，ｆ）の事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）を荷重とする、条件付き確率分布ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）の荷重和である式（４９）の混合モデルによりモデル化する。 Next, the modeling of the peripheral probability distribution of the feature vector z (t, f) in the present embodiment will be described. In the present embodiment, the surrounding probability distribution of the feature vector z (t, f) is set to the prior probability distribution P (g (t, f) = l) of the state g (t, f) representing the sound source position as a load. Modeling is performed by a mixed model of Expression (49), which is a load sum of the conditional probability distribution p (z (t, f) | g (t, f) = l).

事前確率分布計算部４０は、音源位置を表す状態の事前確率分布α（ｌ）（ｌ＝１〜Ｌ）を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータである空間相関行列Ｂ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）に基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布の荷重和である式（４９）の混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルｚ（ｔ，ｆ）に当てはめ、事前確率分布α（ｌ）（ｌ＝１〜Ｌ）を計算する。 The prior probability distribution calculation unit 40 uses the spatial probability matrix B (model parameter stored in the parameter storage unit 30 with the prior probability distribution α (l) (l = 1 to L) in a state representing the sound source position as a load. l, f) (1 = 1 to L, f = 1 to F), based on a condition that represents a sound source position under a known condition, a weighted sum of conditional probability distributions of feature vectors z (t, f) A mixture model of a certain formula (49) is applied to the feature vector z (t, f) calculated by the feature vector calculation unit 20 to calculate the prior probability distribution α (l) (l = 1 to L).

式（４９）の混合モデルを特徴ベクトルｚ（ｔ，ｆ）に当てはめる方法には様々な方法があり、例えば式（４９）に関する尤度を目的関数とし（他にも事後確率等を目的関数とすることができる。）、これを勾配法に基づいて最大化する（他にもＥＭアルゴリズム等に基づいて最大化できる）。 There are various methods for applying the mixed model of Expression (49) to the feature vector z (t, f). For example, the likelihood related to Expression (49) is used as an objective function (the posterior probability is also used as the objective function). This is maximized based on the gradient method (or can be maximized based on the EM algorithm or the like).

事前確率分布計算部４０における事前確率分布α（ｌ）（ｌ＝１〜Ｌ）の推定は、第１の実施形態と同様にして行うことができる。ただし、第１の実施形態とは異なり、ベクトルｗ（ｔ，ｆ）を、Ｎ（ｚ（ｔ，ｆ），０，φ（ｌ，ｔ，ｆ）Ｂ（ｌ，ｆ））（ｌ＝１〜Ｌ）からなるＬ次元縦ベクトルとする。ここで、φ（ｌ，ｔ，ｆ）は次式により計算できる。 The estimation of the prior probability distribution α (l) (l = 1 to L) in the prior probability distribution calculation unit 40 can be performed in the same manner as in the first embodiment. However, unlike the first embodiment, the vector w (t, f) is changed to N (z (t, f), 0, φ (l, t, f) B (l, f)) (l = 1). To L). Here, φ (l, t, f) can be calculated by the following equation.

上記の処理の導出について説明する。目的関数である尤度は、特徴ベクトルｚ（ｔ，ｆ）（ｔ＝１〜Ｔ，ｆ＝１〜Ｆ）が観測される確率であり、式（５１）で表される。 Derivation of the above processing will be described. The likelihood, which is an objective function, is the probability that a feature vector z (t, f) (t = 1 to T, f = 1 to F) will be observed, and is represented by equation (51).

式（５０）は式（５１）のφ（ｌ，ｔ，ｆ）に関する最大化により導かれる。式（５１）のφ（ｌ，ｔ，ｆ）に関する最大化は、ｌｎ［Ｎ（ｚ（ｔ，ｆ），０，φ（ｌ，ｔ，ｆ）Ｂ（ｌ，ｆ））］のφ（ｌ，ｔ，ｆ）に関する最大化と等価である。そこで、ｌｎ［Ｎ（ｚ（ｔ，ｆ），０，φ（ｌ，ｔ，ｆ）Ｂ（ｌ，ｆ））］のφ（ｌ，ｔ，ｆ）に関する偏微分を０とおくと、式（５０）を得る。あとは、第１の実施形態と同様にして、事前確率分布α（ｌ）（ｌ＝１〜Ｌ）の更新式である式（１０）および式（１１）を導出することができる。 Equation (50) is derived by maximization of Equation (51) with respect to φ (l, t, f). The maximization with respect to φ (l, t, f) in the equation (51) is expressed as φ (n [N (z (t, f), 0, φ (l, t, f) B (l, f))]] It is equivalent to maximization with respect to l, t, f). Therefore, when the partial differential with respect to φ (l, t, f) of ln [N (z (t, f), 0, φ (l, t, f) B (l, f))] is set to 0, (50) is obtained. After that, similarly to the first embodiment, Expressions (10) and (11), which are update expressions of the prior probability distribution α (l) (l = 1 to L), can be derived.

［第８の実施形態］
次に、第８の実施形態の構成について説明する。第８の実施形態は、第２の実施形態に係る信号処理装置１により検出された音源位置の集合Ｇ（ｔ）を用いて、音源位置のトラッキングを行い、音源ごとフレームごとの音源位置ρ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ、Ｎは音源数）を計算する例である。本実施形態では音源位置が方位角のみで指定されるものとし、Ｇ（ｔ）は方位角の集合であり、ρ（ｎ，ｔ）は方位角であるとする。そのような状況としては、例えばマイクロホンが載っているテーブルを囲んで何人かが会話をしている状況が挙げられる。 [Eighth Embodiment]
Next, the configuration of the eighth embodiment will be described. In the eighth embodiment, sound source position tracking is performed using a set G (t) of sound source positions detected by the signal processing apparatus 1 according to the second embodiment, and the sound source position ρ ( n, t) (n = 1 to N, t = 1 to T, N is the number of sound sources). In this embodiment, it is assumed that the sound source position is designated only by the azimuth, G (t) is a set of azimuths, and ρ (n, t) is the azimuth. As such a situation, for example, there is a situation where several people are having a conversation around a table on which a microphone is placed.

図４を用いて、第８の実施形態に係る信号処理装置の構成について説明する。図４は、第８の実施形態に係る信号処理装置の構成の一例を示す図である。図４に示すように、信号処理装置２は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、トラッキング部５１を有する。時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０については、第２の実施形態と同様であるから、以下では相違点であるトラッキング部５１について詳しく説明する。 The configuration of the signal processing device according to the eighth embodiment will be described with reference to FIG. FIG. 4 is a diagram illustrating an example of a configuration of a signal processing device according to the eighth embodiment. As illustrated in FIG. 4, the signal processing device 2 includes a time-frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, a prior probability distribution calculation unit 40, a sound source position calculation unit 50, and a tracking unit 51. Since the time-frequency analysis unit 10, the feature vector calculation unit 20, the parameter storage unit 30, the prior probability distribution calculation unit 40, and the sound source position calculation unit 50 are the same as those in the second embodiment, tracking is the difference in the following. The unit 51 will be described in detail.

トラッキング部５１は、音源位置計算部５０からの検出された音源位置（方位角）の集合Ｇ（ｔ）（ｔ＝１〜Ｔ）を受け取り、音源位置のトラッキングを行って、音源ごとフレームごとの音源位置（方位角）ρ（ｎ，ｔ）（ｎ＝１〜Ｎ，ｔ＝１〜Ｔ）を計算し出力する。このトラッキングは様々な方法により行うことができる。以下ではその一例として、各音源の大まかな音源位置（方位角）が既知であると仮定し、これを利用してトラッキングを行う例を示す。各音源の大まかな音源位置（方位角）が既知である状況の例としては、マイクロホンが置かれた机を囲んで、複数人が椅子に座って会議をしている状況が挙げられる。この場合、椅子が既知の位置にほぼ固定されており、かつ会話中の話者の座席移動がないとすると、椅子の位置（既知）を各音源（話者）の大まかな音源位置として用いることができる。 The tracking unit 51 receives the set G (t) (t = 1 to T) of the detected sound source positions (azimuth angles) from the sound source position calculation unit 50, performs tracking of the sound source position, and performs each sound source for each frame. The sound source position (azimuth angle) ρ (n, t) (n = 1 to N, t = 1 to T) is calculated and output. This tracking can be performed by various methods. In the following, as an example, it is assumed that the rough sound source position (azimuth angle) of each sound source is known, and tracking is performed using this. As an example of the situation where the rough sound source position (azimuth angle) of each sound source is known, there is a situation in which a desk is placed around a microphone and a plurality of people are sitting on a chair and having a meeting. In this case, if the chair is almost fixed at a known position and there is no seat movement of the speaker during the conversation, the chair position (known) should be used as a rough sound source position of each sound source (speaker). Can do.

まず、上記の各音源の大まかな音源位置を、音源位置（方位角）ρ（ｎ，ｔ）の初期値ρ（ｎ，０）とする。 First, the rough sound source position of each sound source is set as the initial value ρ (n, 0) of the sound source position (azimuth angle) ρ (n, t).

フレームｔ−１での音源位置（方位角）ρ（ｎ，ｔ−１）が得られていると仮定すると、フレームｔでの音源位置（方位角）ρ（ｎ，ｔ）は、次の処理により求めることができる。
１．ρ（ｎ，ｔ）をρ（ｎ，ｔ）←ρ（ｎ，ｔ−１）により初期化する。
２．検出された音源位置（方位角）ｒ∈Ｇ（ｔ）（０≦ｒ＜２π）のそれぞれに対し、次の２−１および２−２の処理を行う。
２−１．次の式（５２）により、検出された音源位置（方位角）ｒに最も近い音源の番号νを計算する。 Assuming that the sound source position (azimuth angle) ρ (n, t−1) at frame t−1 is obtained, the sound source position (azimuth angle) ρ (n, t) at frame t is It can ask for.
1. ρ (n, t) is initialized by ρ (n, t) ← ρ (n, t−1).
2. The following processes 2-1 and 2-2 are performed on each detected sound source position (azimuth angle) rεG (t) (0 ≦ r <2π).
2-1. The sound source number ν closest to the detected sound source position (azimuth angle) r is calculated by the following equation (52).

２−２．ν番目の音源の音源位置（方位角）ρ（ν，ｔ）を式（５３）により更新する。 2-2. The sound source position (azimuth angle) ρ (ν, t) of the νth sound source is updated by the equation (53).

式（５３）におけるｄ（ξ，η）は、式（５４）により定義される円周上の距離である。 In the equation (53), d (ξ, η) is a distance on the circumference defined by the equation (54).

また、式（５３）において、∠に下付きの［０，２π）を付した記号は、非零の複素数に対し［０，２π）の範囲の偏角を計算する演算子であり、∠に下付きの［−π，π）を付した記号は、非零の複素数に対し［−π，π）の範囲の偏角を計算する演算子であり、δは０＜δ＜１を満たす定数（例えばδ＝０．００５）である。 In addition, in Equation (53), the symbol with [0, 2π) subscripted to ∠ is an operator that calculates the declination in the range [0, 2π) for a non-zero complex number. The subscript symbol [-π, π) is an operator for calculating the declination in the range [−π, π) for a non-zero complex number, and δ is a constant satisfying 0 <δ <1. (For example, δ = 0.005).

［第９の実施形態］
次に、第９の実施形態の構成について説明する。第９の実施形態は、第８の実施形態に係る信号処理装置２による処理結果に基づいて、ダイアリゼーション（diarization）を行う例である。このダイアリゼーションは、フレームごとに各音源が存在するか存在しないかを判定する（hard decision）ことによって行ってもよいし、フレームごとに各音源の存在確率を計算する（soft decision）ことによって行ってもよい。ここでは、前者の場合の例を示す。 [Ninth Embodiment]
Next, the configuration of the ninth embodiment will be described. The ninth embodiment is an example in which dialization is performed based on a processing result by the signal processing device 2 according to the eighth embodiment. This dialization may be performed by determining whether each sound source exists or does not exist for each frame (hard decision), or by calculating the existence probability of each sound source for each frame (soft decision). May be. Here, an example of the former case is shown.

図５を用いて、第９の実施形態に係る信号処理装置の構成について説明する。図５は、第９の実施形態に係る信号処理装置の構成の一例を示す図である。図５に示すように、信号処理装置３は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、トラッキング部５１、ダイアリゼーション部６０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、トラッキング部５１については、信号処理装置２と同様であるから、以下では相違点であるダイアリゼーション部６０について詳しく説明する。 The configuration of the signal processing apparatus according to the ninth embodiment will be described with reference to FIG. FIG. 5 is a diagram illustrating an example of a configuration of a signal processing device according to the ninth embodiment. As shown in FIG. 5, the signal processing device 3 includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, a prior probability distribution calculation unit 40, a sound source position calculation unit 50, a tracking unit 51, and a dialization unit. 60. Since the time-frequency analysis unit 10, the feature vector calculation unit 20, the parameter storage unit 30, the prior probability distribution calculation unit 40, the sound source position calculation unit 50, and the tracking unit 51 are the same as those of the signal processing device 2, the differences are as follows. The dialization unit 60 is described in detail.

ダイアリゼーション部６０は、音源位置計算部５０からの検出された音源位置の集合Ｇ（ｔ）（ｔ＝１〜Ｔ）と、トラッキング部５１からの音源ごとフレームごとの音源位置（方位角）ρ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ）とを受け取って、音源ごとフレームごとのダイアリゼーション結果ｄ（ｎ，ｔ）を計算し出力する。ただし、フレームｔで音源ｎが存在するときｄ（ｎ，ｔ）＝１、フレームｔで音源ｎが存在しないときｄ（ｎ，ｔ）＝０と定める。 The dialization unit 60 includes a set G (t) (t = 1 to T) of the detected sound source positions from the sound source position calculation unit 50, and a sound source position (azimuth angle) ρ for each sound source from the tracking unit 51. (N, t) (n = 1 to N, t = 1 to T) are received, and the dialization result d (n, t) for each frame for each sound source is calculated and output. However, it is determined that d (n, t) = 1 when the sound source n exists in the frame t and d (n, t) = 0 when the sound source n does not exist in the frame t.

ダイアリゼーション結果ｄ（ｎ，ｔ）の計算方法としては様々な方法が考えられるが、例えば次のように計算すればよい。
１．ｄ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ）をｄ（ｎ，ｔ）←０により初期化する。
２．ｔ＝１〜Ｔに対して次の処理を行う：検出された音源位置（方位角）ｒ∈Ｇ（ｔ）のそれぞれに対し、距離ｄ（ｒ，ρ（ｎ，ｔ））が最小となる音源番号ｎであるνを求め、ｄ（ν，ｔ）←１とする。
３．ｄ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ）をダイアリゼーション結果とする。 Various methods are conceivable as a method of calculating the dialization result d (n, t). For example, the calculation may be performed as follows.
1. d (n, t) (n = 1 to N, t = 1 to T) is initialized by d (n, t) ← 0.
2. The following processing is performed for t = 1 to T: For each detected sound source position (azimuth angle) rεG (t), the distance d (r, ρ (n, t)) is minimized. The sound source number n is obtained, and d (ν, t) ← 1 is set.
3. Let d (n, t) (n = 1 to N, t = 1 to T) be a dialization result.

なお、第９の実施形態において、各音源の正確な音源位置（方位角）が既知の状況では、トラッキング部５１で計算された音源位置（方位角）を用いる代わりに、既知の音源位置（方位角）を音源ごとフレームごとの音源位置（方位角）ρ（ｎ，ｔ）として用いてもよい。そのような状況としては例えば、話者が固定された椅子に座って会話をしている状況や、ビデオカメラの映像により音源位置（方位角）が分かっている状況等がある。 In the ninth embodiment, when the accurate sound source position (azimuth angle) of each sound source is known, instead of using the sound source position (azimuth angle) calculated by the tracking unit 51, the known sound source position (azimuth angle) is used. (Angle) may be used as the sound source position (azimuth angle) ρ (n, t) for each sound source and each frame. Examples of such a situation include a situation where a speaker is sitting on a fixed chair and talking, a situation where a sound source position (azimuth angle) is known from a video camera image, and the like.

［第１０の実施形態］
次に、第１０の実施形態の構成について説明する。第１０の実施形態は、背景雑音下でＮ個（Ｎ＞０）の目的信号が混在する状況において、本発明により推定した音源位置に基づいて各目的信号の波形を推定する例である。本実施形態により、混ざった目的信号を個々の目的信号に分離するとともに、背景雑音を除去することができる。 [Tenth embodiment]
Next, the configuration of the tenth embodiment will be described. The tenth embodiment is an example in which the waveform of each target signal is estimated based on the sound source position estimated by the present invention in a situation where N (N> 0) target signals are mixed under background noise. According to the present embodiment, the mixed target signal can be separated into individual target signals and background noise can be removed.

図６を用いて、第１０の実施形態に係る信号処理装置の構成について説明する。図６は、第１０の実施形態に係る信号処理装置の構成の一例を示す図である。図６に示すように、信号処理装置４は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、トラッキング部５１、ダイアリゼーション部６０、マスク推定部７０、信号強調部８０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、トラッキング部５１、およびダイアリゼーション部６０については信号処理装置３と同様であるから、以下では相違点であるパラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、マスク推定部７０、信号強調部８０について詳しく説明する。信号処理装置３と信号処理装置４の主な相違点は次の通りである。信号処理装置３では、パラメータ記憶部３０において、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での条件付き確率分布のモデルパラメータを記憶し、事前確率分布計算部４０において、前記モデルパラメータに基づいて複数の音源位置候補に対応する状態の事前確率分布を計算し、音源位置計算部５０において、前記事前確率分布に基づいて音源位置を計算する。これに対し、信号処理装置４では、パラメータ記憶部３０において、音源位置を表す状態が背景雑音に対応する状態を取る条件下での条件付き確率分布のモデルパラメータをさらに記憶し、事前確率分布計算部４０において、前記モデルパラメータに基づいて複数の音源位置候補および背景雑音に対応する状態の事前確率分布を計算し、音源位置計算部５０において、前記事前確率分布に基づいて音源位置を計算する。信号処理装置４では更に、マスク推定部７０において、各目的信号および背景雑音の時間周波数点ごとの寄与度（事後確率）であるマスクを推定し、信号強調部８０において、前記マスクに基づいて各目的信号の波形を計算する。 The configuration of the signal processing apparatus according to the tenth embodiment will be described with reference to FIG. FIG. 6 is a diagram illustrating an example of a configuration of a signal processing device according to the tenth embodiment. As shown in FIG. 6, the signal processing device 4 includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, a prior probability distribution calculation unit 40, a sound source position calculation unit 50, a tracking unit 51, and a dialization unit. 60, a mask estimation unit 70, and a signal enhancement unit 80. Since the time-frequency analysis unit 10, the feature vector calculation unit 20, the tracking unit 51, and the dialization unit 60 are the same as those of the signal processing device 3, the parameter storage unit 30 and the prior probability distribution calculation unit 40, which are the differences, are described below. The sound source position calculation unit 50, the mask estimation unit 70, and the signal enhancement unit 80 will be described in detail. The main differences between the signal processing device 3 and the signal processing device 4 are as follows. In the signal processing device 3, the parameter storage unit 30 stores the model parameters of the conditional probability distribution under the condition that the state representing the sound source position corresponds to each of the plurality of sound source position candidates, and calculates the prior probability distribution. The unit 40 calculates a prior probability distribution in a state corresponding to a plurality of sound source position candidates based on the model parameter, and the sound source position calculating unit 50 calculates a sound source position based on the prior probability distribution. On the other hand, in the signal processing device 4, the parameter storage unit 30 further stores a model parameter of a conditional probability distribution under the condition that the state representing the sound source position corresponds to the background noise, and calculates the prior probability distribution calculation. The unit 40 calculates a prior probability distribution in a state corresponding to a plurality of sound source position candidates and background noise based on the model parameter, and the sound source position calculation unit 50 calculates a sound source position based on the prior probability distribution. . In the signal processing device 4, the mask estimation unit 70 further estimates a mask that is a contribution (posterior probability) for each time frequency point of each target signal and background noise, and the signal enhancement unit 80 determines each mask based on the mask. Calculate the waveform of the target signal.

パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布である複素ワトソン分布のモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）と集中パラメータκ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）、および音源位置を表す状態が背景雑音に対応する状態を取る条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布である複素ワトソン分布のモデルパラメータである平均方向ベクトルａ（０，ｆ）（ｆ＝１〜Ｆ）と集中パラメータκ（０，ｆ）（ｆ＝１〜Ｆ）を記憶する。これらのモデルパラメータは、音源位置候補のそれぞれに対応する状態に対しては例えば第４の実施形態に記載の方法により計算でき、背景雑音に対応する状態に対しては例えば第３の実施形態に記載の方法により計算できる。 The parameter storage unit 30 is a model of a complex Watson distribution that is a conditional probability distribution of the feature vector z (t, f) under the condition that the state representing the sound source position takes a state corresponding to each of a plurality of sound source position candidates. Parameters, average direction vector a (l, f) (l = 1 to L, f = 1 to F), concentrated parameter κ (l, f) (l = 1 to L, f = 1 to F), and sound source An average direction vector a (0, f), which is a model parameter of a complex Watson distribution, which is a conditional probability distribution of the feature vector z (t, f) under the condition that the state representing the position corresponds to the background noise. (F = 1 to F) and the concentration parameter κ (0, f) (f = 1 to F) are stored. These model parameters can be calculated by the method described in the fourth embodiment for the state corresponding to each of the sound source position candidates, and for the state corresponding to the background noise, for example, in the third embodiment. It can be calculated by the method described.

事前確率分布計算部４０は、音源位置を表す状態の事前確率分布を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）（ｌ＝０〜Ｌ、ｆ＝１〜Ｆ）と集中パラメータκ（ｌ，ｆ）（ｌ＝０〜Ｌ、ｆ＝１〜Ｆ）に基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布の荷重和である式（２１）の混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルｚ（ｔ，ｆ）に当てはめ、事前確率分布を計算する。本実施形態では、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）がフレームに依存すると仮定し、α（ｌ，ｔ）（ｌ＝０〜Ｌ，ｔ＝１〜Ｔ）で表す。α（ｌ，ｔ）は制約条件α（０，ｔ）＋…＋α（Ｌ，ｔ）＝１を満たす。式（２１）の混合モデルを特徴ベクトルｚ（ｔ，ｆ）に当てはめる方法には様々な方法があるが、本実施形態では式（２１）に関する尤度を勾配法により事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）に関して最大化することにより行う。 The prior probability distribution calculation unit 40 uses the average probability vector a (l, f) (l = 0 to L, which is a model parameter stored in the parameter storage unit 30 with the prior probability distribution in a state representing the sound source position as a load. feature vector z (t, t) under a condition in which the sound source position is known based on f = 1 to F) and the concentration parameter κ (l, f) (l = 0 to L, f = 1 to F). The mixed model of Formula (21), which is the load sum of the conditional probability distribution of f), is applied to the feature vector z (t, f) calculated by the feature vector calculation unit 20 to calculate the prior probability distribution. In the present embodiment, it is assumed that the prior probability distribution P (g (t, f) = 1) depends on the frame, and is represented by α (l, t) (1 = 0 to L, t = 1 to T). α (l, t) satisfies the constraint condition α (0, t) +... + α (L, t) = 1. There are various methods for applying the mixed model of Expression (21) to the feature vector z (t, f). In this embodiment, the likelihood related to Expression (21) is calculated by the gradient method using the prior probability distribution α (l, t) by maximizing with respect to (l = 0 to L, t = 1 to T).

事前確率分布計算部４０における処理は、例えば下記の通りである。
１．事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）をα（ｌ，ｔ）←１／（Ｌ＋１）により初期化する。
２．次の式（５５）および式（５６）による事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）の更新を交互に所定回数（例えば１０回）反復する。 The processing in the prior probability distribution calculation unit 40 is, for example, as follows.
1. Prior probability distribution α (l, t) (l = 0 to L, t = 1 to T) is initialized by α (l, t) ← 1 / (L + 1).
2. The update of the prior probability distribution α (l, t) (l = 0 to L, t = 1 to T) according to the following equations (55) and (56) is alternately repeated a predetermined number of times (for example, 10 times).

３．事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）を出力する。 3. Prior probability distribution α (l, t) (l = 0 to L, t = 1 to T) is output.

ここで、ベクトル〜α（ｔ）（αの前の記号「〜」はαの上に記号「〜」を付すことを表す。）はα（ｌ，ｔ）（ｌ＝０〜Ｌ）からなる（Ｌ＋１）次元縦ベクトルであり、ベクトル〜ｗ（ｔ，ｆ）はＷ（ｚ（ｔ，ｆ）；ａ（ｌ，ｆ），κ（ｌ，ｆ））（ｌ＝０〜Ｌ）からなる（Ｌ＋１）次元縦ベクトルである。なお、式（５５）および式（５６）の導出については、第１の実施形態の場合と同様であるから省略する。 Here, the vector .about..alpha. (T) (the symbol ".about." In front of .alpha. Indicates that the symbol ".about." Is added above .alpha.) Consists of .alpha. (L, t) (l = 0 to L). (L + 1) dimensional vertical vector, vector ~ w (t, f) consists of W (z (t, f); a (l, f), κ (l, f)) (l = 0 ~ L) (L + 1) dimensional vertical vector. In addition, since derivation | leading-out of Formula (55) and Formula (56) is the same as that of the case of 1st Embodiment, it abbreviate | omits.

音源位置計算部５０は、事前確率分布計算部４０から受け取った事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）に基づいて、検出された音源位置の集合Ｇ（ｔ）（ｔ＝１〜Ｔ）を計算し出力する。具体的には、事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）の定義域を目的音源に対応するｌ＝１〜Ｌに制限したα（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）に対して、第２の実施形態の音源位置計算部５０における処理を適用することにより、検出された音源位置の集合Ｇ（ｔ）（ｔ＝１〜Ｔ）を計算する。 The sound source position calculation unit 50 detects a set G of sound source positions detected based on the prior probability distribution α (l, t) (l = 0 to L, t = 1 to T) received from the prior probability distribution calculation unit 40. (T) Calculate (t = 1 to T) and output. Specifically, α (l, t) in which the domain of the prior probability distribution α (l, t) (l = 0 to L, t = 1 to T) is limited to l = 1 to L corresponding to the target sound source. By applying the processing in the sound source position calculation unit 50 of the second embodiment to (l = 1 to L, t = 1 to T), a set G (t) of detected sound source positions (t = 1-T).

マスク推定部７０は、パラメータ記憶部３０からの平均方向ベクトルａ（ｌ，ｆ）（ｌ＝０〜Ｌ、ｆ＝１〜Ｆ）と集中パラメータκ（ｌ，ｆ）（ｌ＝０〜Ｌ、ｆ＝１〜Ｆ）、事前確率分布計算部４０からの事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）、およびトラッキング部５１からの音源ごとフレームごとの音源位置（方位角）ρ（ｎ，ｔ）（ｎ＝１〜Ｎ，ｔ＝１〜Ｔ）を受け取って、特徴ベクトルｚ（ｔ，ｆ）に対する背景雑音および各目的信号の時間周波数点ごとの寄与度（事後確率）であるマスクγ（ｎ，ｔ，ｆ）（ｎ＝０〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を計算し出力する。ここで、γ（０，ｔ，ｆ）は背景雑音に対応するマスクであり、γ（ｎ，ｔ，ｆ）（ｎ＝１〜Ｎ）は目的信号ｎに対応するマスクである。 The mask estimation unit 70 includes an average direction vector a (l, f) (l = 0 to L, f = 1 to F) from the parameter storage unit 30 and a concentrated parameter κ (l, f) (l = 0 to L, f = 1 to F), prior probability distribution α (l, t) (1 = 0 to L, t = 1 to T) from the prior probability distribution calculation unit 40, and sound source for each sound source from the tracking unit 51 for each frame. The position (azimuth) ρ (n, t) (n = 1 to N, t = 1 to T) is received, and the background noise to the feature vector z (t, f) and the contribution of each target signal for each time frequency point A mask γ (n, t, f) (n = 0 to N, t = 1 to T, f = 1 to F) which is a degree (a posteriori probability) is calculated and output. Here, γ (0, t, f) is a mask corresponding to the background noise, and γ (n, t, f) (n = 1 to N) is a mask corresponding to the target signal n.

マスクγ（ｎ，ｔ，ｆ）は様々な方法により計算することができるが、例えば以下のように計算する。
１．特徴ベクトルｚ（ｔ，ｆ）が与えられた条件下でｇ（ｔ，ｆ）＝ｌとなる事後確率Ｐ（ｇ（ｔ，ｆ）＝ｌ｜ｚ（ｔ，ｆ））（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を次の式（５７）および式（５８）により計算する。 The mask γ (n, t, f) can be calculated by various methods. For example, it is calculated as follows.
1. A posteriori probability P (g (t, f) = l | z (t, f)) (1 = 0 to g (t, f) = 1 under the given condition of the feature vector z (t, f) L, t = 1 to T, f = 1 to F) are calculated by the following equations (57) and (58).

２．背景雑音に対応するマスクγ（０，ｔ，ｆ）（ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を次の式（５９）により計算する。 2. A mask γ (0, t, f) (t = 1 to T, f = 1 to F) corresponding to the background noise is calculated by the following equation (59).

３．フレームｔにおいて各目的信号ｎに対応する音源位置候補の番号ｌの集合Ｊ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ）を次の式（６０）により計算する。 3. A set J (n, t) (n = 1 to N, t = 1 to T) of sound source position candidates corresponding to each target signal n in the frame t is calculated by the following equation (60).

４．目的信号に対応するマスクγ（ｎ，ｔ，ｆ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を次の式（６１）により計算する。 4). A mask γ (n, t, f) (n = 1 to N, t = 1 to T, f = 1 to F) corresponding to the target signal is calculated by the following equation (61).

５．マスクγ（ｎ，ｔ，ｆ）（ｎ＝０〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を出力する。 5. A mask γ (n, t, f) (n = 0 to N, t = 1 to T, f = 1 to F) is output.

信号強調部８０は、時間周波数分析部１０からの観測信号ベクトルｙ（ｔ，ｆ）、ダイアリゼーション部６０からの０または１のいずれかの値を取るダイアリゼーション結果ｄ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ）、およびマスク推定部７０からの背景雑音および各目的信号のマスクγ（ｎ，ｔ，ｆ）（ｎ＝０〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を受け取って、各目的信号ｓ（ｎ，τ）を推定する。 The signal emphasizing unit 80 is an observation signal vector y (t, f) from the time-frequency analysis unit 10 and a dialization result d (n, t) (n) that takes either 0 or 1 from the dialization unit 60. = 1 to N, t = 1 to T), and background noise from the mask estimation unit 70 and mask γ (n, t, f) of each target signal (n = 0 to N, t = 1 to T, f = 1-F) is received and each target signal s (n, τ) is estimated.

信号強調部８０における具体的な処理の例は以下の通りである。
１．観測信号の共分散行列Φ（ｆ）を次の式（６２）により計算する。 A specific example of processing in the signal enhancement unit 80 is as follows.
1. The covariance matrix Φ (f) of the observation signal is calculated by the following equation (62).

２．ダイアリゼーション結果ｄ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ）を用いて修正したマスク〜γ（ｎ，ｔ，ｆ）（ｎ＝０〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を次の式（６３）および式（６４）により計算する。式（６３）は、ｄ（ｎ，ｔ）＝０のときにはフレームｔにおける音源ｎのマスクを０で置き換えることを意味している。また、式（６４）は、マスク〜γ（ｎ，ｔ，ｆ）のｎに関する総和が１になるようにするための処理である。 2. Masks corrected using the dialyzation results d (n, t) (n = 1 to N, t = 1 to T) to γ (n, t, f) (n = 0 to N, t = 1 to T, f = 1 to F) is calculated by the following equations (63) and (64). Equation (63) means that the mask of the sound source n in the frame t is replaced with 0 when d (n, t) = 0. Expression (64) is a process for making the sum of masks to γ (n, t, f) related to n equal to 1.

３．共分散行列Ψ（ｎ，ｆ）（ｎ＝０〜Ｎ、ｆ＝１〜Ｆ）を次の式（６５）により計算する。ここで、行列Ψ（０，ｆ）は背景雑音に対応する共分散行列であり、行列Ψ（ｎ，ｆ）（ｎ＝１〜Ｎ）はｎ番目の目的信号と背景雑音の和に対応する共分散行列である。 3. The covariance matrix Ψ (n, f) (n = 0 to N, f = 1 to F) is calculated by the following equation (65). Here, the matrix Ψ (0, f) is a covariance matrix corresponding to the background noise, and the matrix Ψ (n, f) (n = 1 to N) corresponds to the sum of the nth target signal and the background noise. It is a covariance matrix.

４．ｎ番目の目的信号と背景雑音の和に対応する共分散行列Ψ（ｎ，ｆ）から背景雑音に対応する共分散行列Ψ（０，ｆ）を減算することにより、ｎ番目の目的信号に対応する共分散行列〜Ψ（ｎ，ｆ）（ｎ＝１〜Ｎ、ｆ＝１〜Ｆ）を求める。次に、各目的信号のステアリングベクトルｈ（ｎ，ｆ）（ｎ＝１〜Ｎ、ｆ＝１〜Ｆ）を、行列〜Ψ（ｎ，ｆ）の最大固有値に対応する固有ベクトルとして求める。そして、ベクトルｈ（ｎ，ｆ）の第１要素が１に等しくなるように、ｈ（ｎ，ｆ）←ｈ（ｎ，ｆ）／ｈ（１，ｎ，ｆ）によりベクトルｈ（ｎ，ｆ）を正規化する。ここで、ｈ（１，ｎ，ｆ）はベクトルｈ（ｎ，ｆ）の第１要素を表す。 4). Corresponds to the nth target signal by subtracting the covariance matrix Ψ (0, f) corresponding to the background noise from the covariance matrix Ψ (n, f) corresponding to the sum of the nth target signal and the background noise. The covariance matrix to Ψ (n, f) (n = 1 to N, f = 1 to F) is obtained. Next, the steering vector h (n, f) (n = 1 to N, f = 1 to F) of each target signal is obtained as an eigenvector corresponding to the maximum eigenvalue of the matrix to Ψ (n, f). Then, the vector h (n, f) is expressed by h (n, f) ← h (n, f) / h (1, n, f) so that the first element of the vector h (n, f) is equal to 1. ) Is normalized. Here, h (1, n, f) represents the first element of the vector h (n, f).

５．最小分散ビームフォーマに基づき、各目的信号の時間周波数変換ｓ（ｎ，ｔ，ｆ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を次の式（６７）により計算する。 5. Based on the minimum dispersion beamformer, the time-frequency conversion s (n, t, f) (n = 1 to N, t = 1 to T, f = 1 to F) of each target signal is calculated by the following equation (67). To do.

６．各目的信号の時間周波数変換ｓ（ｎ，ｔ，ｆ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）に時間周波数変換の逆変換を適用することにより、各目的信号ｓ（ｎ，τ）を計算する。 6). By applying the inverse transformation of the time frequency conversion to the time frequency conversion s (n, t, f) (n = 1 to N, t = 1 to T, f = 1 to F) of each target signal, s (n, τ) is calculated.

［第１１の実施形態］
次に、第１１の実施形態の構成について説明する。第１１の実施形態は、背景雑音下でＮ個（Ｎ＞０）の目的音声が存在する状況において、本発明により推定した音源位置に基づいて各目的音声の波形を推定し、各目的音声に対して既存の音声認識技術を適用することで各目的音声を音声認識する例である。本発明によれば、背景雑音や複数の話者による音声が混在した状況でも、混ざった目的信号を個々の目的信号に分離するとともに、背景雑音を除去し、高精度な音声認識を実現できる。応用例としては、例えば様々な音が鳴っているオフィスの片隅で行われた会議の自動書き起こし等が挙げられる。 [Eleventh embodiment]
Next, the configuration of the eleventh embodiment will be described. In the eleventh embodiment, in a situation where N (N> 0) target sounds exist under background noise, the waveform of each target sound is estimated based on the sound source position estimated by the present invention. This is an example in which each target speech is recognized by applying an existing speech recognition technology. According to the present invention, even in a situation where background noise and voices from a plurality of speakers are mixed, it is possible to separate mixed target signals into individual target signals and remove background noise to realize highly accurate speech recognition. As an application example, for example, there is an automatic transcription of a conference held in one corner of an office where various sounds are being made.

図７を用いて、第１１の実施形態に係る信号処理装置の構成について説明する。図７は、第１１の実施形態に係る信号処理装置の構成の一例を示す図である。図７に示すように、信号処理装置５は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、トラッキング部５１、ダイアリゼーション部６０、マスク推定部７０、信号強調部８０、音声認識部９０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、トラッキング部５１、ダイアリゼーション部６０、マスク推定部７０、信号強調部８０については第１０の実施形態と同様である。音声認識部９０は、信号強調部８０から各目的信号の波形を受け取って、これに既存の音声認識技術を適用することで、各目的信号に対する認識結果を出力する。 The configuration of the signal processing apparatus according to the eleventh embodiment will be described with reference to FIG. FIG. 7 is a diagram illustrating an example of the configuration of the signal processing device according to the eleventh embodiment. As shown in FIG. 7, the signal processing device 5 includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, a prior probability distribution calculation unit 40, a sound source position calculation unit 50, a tracking unit 51, and a dialization unit. 60, a mask estimation unit 70, a signal enhancement unit 80, and a speech recognition unit 90. About the time-frequency analysis unit 10, the feature vector calculation unit 20, the parameter storage unit 30, the prior probability distribution calculation unit 40, the sound source position calculation unit 50, the tracking unit 51, the dialization unit 60, the mask estimation unit 70, and the signal enhancement unit 80 This is the same as in the tenth embodiment. The speech recognition unit 90 receives the waveform of each target signal from the signal enhancement unit 80, and outputs a recognition result for each target signal by applying an existing speech recognition technology to the waveform.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. Furthermore, all or a part of each processing function performed in each device may be realized by a CPU and a program that is analyzed and executed by the CPU, or may be realized as hardware by wired logic.

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Also, among the processes described in this embodiment, all or part of the processes described as being performed automatically can be performed manually, or the processes described as being performed manually can be performed. All or a part can be automatically performed by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
実施形態の信号処理装置１〜５は、パッケージソフトウェアやオンラインソフトウェアとして上記の音源定位、トラッキング、ダイアリゼーション、音声強調、音声認識を実行する信号処理プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の信号処理プログラムを情報処理装置に実行させることにより、情報処理装置を信号処理装置１〜５として機能させることができる。ここで言う情報処理装置には、デスクトップ型またはノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistant）等のスレート端末等がその範疇に含まれる。 [program]
The signal processing apparatuses 1 to 5 according to the embodiment can be implemented by installing a signal processing program for executing the sound source localization, tracking, dialization, speech enhancement, and speech recognition as package software or online software in a desired computer. For example, by causing the information processing apparatus to execute the signal processing program, the information processing apparatus can function as the signal processing apparatuses 1 to 5. The information processing apparatus referred to here includes a desktop or notebook personal computer. In addition, the information processing apparatus includes mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone System), and slate terminals such as PDA (Personal Digital Assistant).

また、信号処理装置１〜５は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の信号処理に関するサービスを提供する信号処理サーバ装置として実装することもできる。例えば、信号処理サーバ装置は、観測信号を入力とし、音源の位置を出力とする音源定位サービスを提供するサーバ装置として実装される。この場合、信号処理サーバ装置は、Ｗｅｂサーバとして実装することとしてもよいし、アウトソーシングによって上記の信号処理に関するサービスを提供するクラウドとして実装することとしてもかまわない。 Further, the signal processing devices 1 to 5 can be implemented as a signal processing server device that uses a terminal device used by a user as a client and provides the client with services related to the above signal processing. For example, the signal processing server apparatus is implemented as a server apparatus that provides a sound source localization service that receives an observation signal and outputs a sound source position. In this case, the signal processing server device may be implemented as a Web server, or may be implemented as a cloud that provides the above-described signal processing services by outsourcing.

図８は、プログラムが実行されることにより信号処理装置が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 FIG. 8 is a diagram illustrating an example of a computer in which a signal processing apparatus is realized by executing a program. The computer 1000 includes a memory 1010 and a CPU 1020, for example. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example. The video adapter 1060 is connected to the display 1130, for example.

ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、信号処理装置１〜５の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、信号処理装置１〜５における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤにより代替されてもよい。 The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program that defines each process of the signal processing devices 1 to 5 is implemented as a program module 1093 in which a code executable by a computer is described. The program module 1093 is stored in the hard disk drive 1090, for example. For example, a program module 1093 for executing processing similar to the functional configuration in the signal processing devices 1 to 5 is stored in the hard disk drive 1090. Note that the hard disk drive 1090 may be replaced by an SSD.

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 and executes them as necessary.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３およびプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３およびプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read out by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

１、２、３、４、５信号処理装置
１０時間周波数分析部
２０特徴ベクトル計算部
３０パラメータ記憶部
４０事前確率分布計算部
５０音源位置計算部
５１トラッキング部
６０ダイアリゼーション部
７０マスク推定部
８０信号強調部
９０音声認識部 1, 2, 3, 4, 5 Signal processing apparatus 10 Time frequency analysis unit 20 Feature vector calculation unit 30 Parameter storage unit 40 Prior probability distribution calculation unit 50 Sound source position calculation unit 51 Tracking unit 60 Dialization unit 70 Mask estimation unit 80 Signal Emphasis unit 90 Speech recognition unit

Claims

A time-frequency analysis unit that applies time-frequency analysis to recorded sound obtained at a plurality of different positions and calculates an observation signal vector that is an M-dimensional vector;
A feature vector calculation unit that calculates a feature vector, which is a vector including information on the direction of the observed signal vector calculated by the time frequency analysis unit, for each time frequency point;
A parameter storage unit for storing a model parameter of the conditional probability distribution of the feature vector under a condition in which a state representing a sound source position corresponds to each of a plurality of sound source position candidates;
Conditional probability distribution of the feature vector under a condition where the state representing the sound source position is known, based on the model parameter stored in the parameter storage unit, with the prior probability distribution of the state representing the sound source position as a load A prior model that calculates the prior probability distribution by applying a mixed model that is a load sum of the above to the feature vector calculated by the feature vector calculator;
A sound source position calculation unit that calculates a sound source position corresponding to the feature vector based on the prior probability distribution calculated by the prior probability distribution calculation unit;
A signal processing apparatus comprising:

The prior probability distribution calculation unit uses a prior probability distribution for each time interval of the state representing the sound source position as a load, and the state representing the sound source position based on the model parameter stored in the parameter storage unit is known. Applying a mixed model, which is a weighted sum of conditional probability distributions of the feature vectors under the conditions, to the feature vectors calculated by the feature vector calculation unit, and calculating a prior probability distribution for each time interval;
The sound source position calculation unit calculates a sound source position for each time interval corresponding to the feature vector based on the prior probability distribution for each time interval calculated by the prior probability distribution calculation unit. The signal processing apparatus according to claim 1.

The parameter storage unit is a model parameter learned using learning data acquired under reverberation, and under a condition that a state representing a sound source position takes a state corresponding to each of a plurality of sound source position candidates. The signal processing apparatus according to claim 1, wherein a model parameter of a conditional probability distribution of the feature vector is stored.

The parameter storage unit further stores a model parameter of a conditional probability distribution under a condition in which a state representing the sound source position takes a state corresponding to background noise. 2. The signal processing device according to item 1.

5. The signal processing apparatus according to claim 1, wherein the prior probability distribution calculating unit calculates the prior probability distribution based on a gradient method. 6.

A signal processing method executed by a signal processing device,
Applying a time-frequency analysis to recorded sound obtained at a plurality of different positions and calculating an observation signal vector which is an M-dimensional vector;
A feature vector calculation step of calculating, for each time frequency point, a feature vector that is a vector including information on the direction of the observed signal vector calculated by the time frequency analysis step;
A model parameter stored in a parameter storage unit that stores a model parameter of a conditional probability distribution of the feature vector under a condition that a state representing a sound source position corresponds to each of a plurality of sound source position candidates is acquired. , A load that is a weighted prior probability distribution of the state representing the sound source position, and a load sum of the conditional probability distribution of the feature vector based on the model parameter under a condition where the state representing the sound source position is known Applying a model to the feature vector calculated by the feature vector calculating step to calculate the prior probability distribution; and
A sound source position calculating step for calculating a sound source position corresponding to the feature vector based on the prior probability distribution calculated by the prior probability distribution calculating step;
A signal processing method comprising:

A signal processing program for causing a computer to function as the signal processing device according to any one of claims 1 to 5.