JP6538624B2

JP6538624B2 - Signal processing apparatus, signal processing method and signal processing program

Info

Publication number: JP6538624B2
Application number: JP2016166232A
Authority: JP
Inventors: 信貴伊藤; 中谷　智広; 智広中谷; 荒木　章子; 章子荒木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-08-26
Filing date: 2016-08-26
Publication date: 2019-07-03
Anticipated expiration: 2036-08-26
Also published as: JP2018032001A

Description

本発明は、信号処理装置、信号処理方法および信号処理プログラムに関する。 The present invention relates to a signal processing device, a signal processing method, and a signal processing program.

従来、複数のマイクロホン等で観測した収録音を基に、当該音を発生させている音源の位置を推定する音源定位技術が知られている。音源定位技術として、例えば、音源数が既知であると仮定し、観測信号に時間周波数分析を適用することで推定した共分散行列を用いて音源位置を推定する方法が知られている。 2. Description of the Related Art Conventionally, a sound source localization technique is known which estimates the position of a sound source generating the sound based on the recorded sound observed by a plurality of microphones or the like. As a sound source localization technique, for example, there is known a method of estimating a sound source position using a covariance matrix estimated by applying time-frequency analysis to an observation signal, assuming that the number of sound sources is known.

R. O. Schmidt, "Multiple emitter location and signal parameter estimation," IEEE Transactions on Antennas and Propagation, 1986年3月, vol.AP-34, No.3, p.276-280.R. O. Schmidt, "Multiple emitter location and signal parameter estimation," IEEE Transactions on Antennas and Propagation, March 1986, vol. AP-34, No. 3, p. N. Ito, E. Vincent, N. Ono, R. Gribonval, and S. Sagayama, "Crystal-MUSIC:Accurate localization of multiple sources in diffuse noise environments using crystal-shaped microphone arrays," 2010年9月, Proceedings of 9th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), p.81-88.N. Ito, E. Vincent, N. Ono, R. Gribonval, and S. Sagayama, "Crystal-MUSIC: Accurate localization of multiple sources in diffuse noise environments using crystal-shaped microphone arrays," September, 2010, Proceedings of 9th International Conference on Latent Variable Analysis and Signal Separation (LVA / ICA), p. 81-88.

しかしながら、従来の音源定位技術には、観測信号長が短い場合に、音源定位を正確に行うことができない場合があるという問題があった。例えば、観測信号長が短い場合、共分散行列の推定のための十分な標本を得ることができず、音源定位を正確に行うことができないことがあった。 However, the conventional sound source localization technology has a problem that when the observation signal length is short, the sound source localization may not be performed accurately. For example, when the observation signal length is short, it may not be possible to obtain sufficient samples for estimation of the covariance matrix, and the sound source localization can not be performed accurately.

本発明の信号処理装置は、複数の異なる位置で取得された収録音に時間周波数分析を適用し、Ｍ次元ベクトルである観測信号ベクトルを計算する時間周波数分析部と、前記時間周波数分析部によって計算された観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルを、時間周波数点ごとに計算する特徴ベクトル計算部と、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、前記特徴ベクトルの条件付き確率分布のモデルパラメータを記憶するパラメータ記憶部と、前記音源位置を表す状態の事前確率分布を荷重とする、前記パラメータ記憶部に記憶されたモデルパラメータに基づく、前記音源位置を表す状態が既知の条件下での、前記特徴ベクトルの条件付き確率分布の荷重和である混合モデルを、前記特徴ベクトル計算部によって計算された特徴ベクトルに当てはめ、前記事前確率分布を計算する事前確率分布計算部と、前記事前確率分布計算部によって計算された事前確率分布に基づいて、前記特徴ベクトルに対応する音源位置を計算する音源位置計算部と、を有することを特徴とする。 A signal processing apparatus according to the present invention applies time-frequency analysis to the recorded sound acquired at a plurality of different positions, and calculates an observation signal vector which is an M-dimensional vector by the time-frequency analysis unit; A feature vector calculation unit that calculates, for each time frequency point, a feature vector that is a vector including information on the direction of the observed signal vector y (t, f), and a plurality of sound source position candidates in which the state representing the sound source position is A parameter storage unit for storing model parameters of the conditional probability distribution of the feature vector under conditions corresponding to the respective conditions; and the parameter storage unit using an a priori probability distribution of a state representing the sound source position as a load The weight of the conditional probability distribution of the feature vector under conditions where the state representing the sound source position is known, based on the model parameters stored in To a prior probability distribution calculating unit that applies the mixed model that is the feature vector to the feature vector calculated by the feature vector calculating unit, and calculates the prior probability distribution, and the prior probability distribution calculated by the prior probability distribution calculating unit And a sound source position calculating unit that calculates a sound source position corresponding to the feature vector.

本発明の信号処理方法は、信号処理装置で実行される信号処理方法であって、複数の異なる位置で取得された収録音に時間周波数分析を適用し、Ｍ次元ベクトルである観測信号ベクトルを計算する時間周波数分析工程と、前記時間周波数分析工程によって計算された観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルを、時間周波数点ごとに計算する特徴ベクトル計算工程と、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、前記特徴ベクトルの条件付き確率分布のモデルパラメータを記憶するパラメータ記憶部に記憶されたモデルパラメータを取得し、前記音源位置を表す状態の事前確率分布を荷重とする、前記モデルパラメータに基づく、前記音源位置を表す状態が既知の条件下での、前記特徴ベクトルの条件付き確率分布の荷重和である混合モデルを、前記特徴ベクトル計算工程によって計算された特徴ベクトルに当てはめ、前記事前確率分布を計算する事前確率分布計算工程と、前記事前確率分布計算工程によって計算された事前確率分布に基づいて、前記特徴ベクトルに対応する音源位置を計算する音源位置計算工程と、を含んだことを特徴とする。 A signal processing method according to the present invention is a signal processing method executed by a signal processing apparatus, which applies time-frequency analysis to recorded sounds acquired at a plurality of different positions, and calculates an observation signal vector which is an M-dimensional vector. Time-frequency analysis process, and a feature vector calculation process for calculating a feature vector which is a vector including information on the direction of the observed signal vector y (t, f) calculated by the time-frequency analysis process for each time frequency point And the model parameter stored in the parameter storage unit storing the model parameter of the conditional probability distribution of the feature vector under the condition that the state representing the sound source position corresponds to each of the plurality of sound source position candidates. A state representing the sound source position obtained based on the model parameter, which is obtained and obtained as a load is the prior probability distribution of the state representing the sound source position A prior probability distribution calculation for calculating the prior probability distribution by fitting a mixed model which is a weighted sum of the conditional probability distributions of the feature vectors under knowledge conditions to the feature vectors calculated by the feature vector calculating step And sound source position calculating step of calculating a sound source position corresponding to the feature vector based on the step and the prior probability distribution calculated by the prior probability distribution calculating step.

本発明によれば、観測信号長が短い場合であっても、音源定位を正確に行うことができる。 According to the present invention, even when the observation signal length is short, sound source localization can be performed accurately.

図１は、本発明における音源定位について説明するための図である。FIG. 1 is a diagram for explaining sound source localization in the present invention. 図２は、第１の実施形態に係る信号処理装置の構成の一例を示す図である。FIG. 2 is a diagram showing an example of the configuration of the signal processing device according to the first embodiment. 図３は、第１の実施形態に係る信号処理装置の処理の流れを示すフローチャートである。FIG. 3 is a flowchart showing the flow of processing of the signal processing device according to the first embodiment. 図４は、第８の実施形態に係る信号処理装置の構成の一例を示す図である。FIG. 4 is a diagram showing an example of the configuration of a signal processing device according to the eighth embodiment. 図５は、第９の実施形態に係る信号処理装置の構成の一例を示す図である。FIG. 5 is a view showing an example of the configuration of a signal processing apparatus according to the ninth embodiment. 図６は、第１０の実施形態に係る信号処理装置の構成の一例を示す図である。FIG. 6 is a diagram showing an example of the configuration of a signal processing apparatus according to the tenth embodiment. 図７は、第１１の実施形態に係る信号処理装置の構成の一例を示す図である。FIG. 7 is a diagram showing an example of the configuration of a signal processing apparatus according to the eleventh embodiment. 図８は、プログラムが実行されることにより信号処理装置が実現されるコンピュータの一例を示す図である。FIG. 8 is a diagram illustrating an example of a computer in which a signal processing apparatus is realized by executing a program.

以下に、本願に係る信号処理装置、信号処理方法および信号処理プログラムの実施形態を図面に基づいて詳細に説明する。なお、この実施形態により本発明が限定されるものではない。 Hereinafter, embodiments of a signal processing device, a signal processing method, and a signal processing program according to the present application will be described in detail based on the drawings. The present invention is not limited by this embodiment.

［本発明における音源定位について］
音源信号は通常、時間周波数平面上の疎な点でのみ大きいパワーを持つというスパース性を持つため、複数の音源信号が同時に鳴っている状況でも、各時間周波数点では観測信号は音源信号のうち高々１つしか含まないとみなすことができる。そのため、例えば、Ｍ個（Ｍ＞１）の異なる位置で取得された観測信号の時間周波数変換からなるＭ次元縦ベクトルである観測信号ベクトルｙ（ｔ，ｆ）（ｔはフレームの番号（ｔ＝１〜Ｔ）、ｆは周波数ビンの番号（ｆ＝１〜Ｆ））は、当該時間周波数点（ｔ，ｆ）において観測信号に含まれる音源信号の音源位置によって定まる固有の方向を向いているとみなすことができる。正確には、雑音や残響の影響により、観測信号ベクトルｙ（ｔ，ｆ）の方向は、上記の音源位置によって定まる固有の方向を中心として多少の広がりを持って分布する。観測信号の上記の性質を利用すれば、観測信号ベクトルｙ（ｔ，ｆ）の方向に基づいて、音源位置を推定することができる。 [About sound source localization in the present invention]
Since the sound source signal is usually sparse with a large power only at a sparse point on the time frequency plane, the observation signal is one of the sound source signals at each time frequency point even in a situation where multiple sound source signals are sounding simultaneously. It can be regarded as including at most one. Therefore, for example, an observation signal vector y (t, f) (t is a frame number (t = t), which is an M-dimensional longitudinal vector consisting of time-frequency transformation of observation signals acquired at M (M> 1) different positions. 1 to T) and f are the frequency bin numbers (f = 1 to F) are directed in the specific direction determined by the sound source position of the sound source signal included in the observation signal at the time frequency point (t, f) It can be regarded as To be precise, due to the influence of noise and reverberation, the direction of the observation signal vector y (t, f) is distributed with some spread around the specific direction determined by the above sound source position. By utilizing the above-described properties of the observation signal, the sound source position can be estimated based on the direction of the observation signal vector y (t, f).

本発明の実施形態では、音源定位を、複数（Ｌ個）の音源位置候補のうち、実際に音を発しているものを特定する問題、すなわち実際に音を発している音源位置候補（の番号）の集合を推定する問題として定式化する。この音源位置候補は、例えば、音源定位を行う部屋の中の複数の場所（例えば、部屋の中を格子状に細かく分割したときの各格子点に対応する場所）を音源位置候補とすることができる。また、音源位置候補は、音源が存在し得る領域が既知の場合には、その領域内の複数の場所を音源位置候補とすることができる。例えば、テーブルを囲んで座った複数人の会話の収録音に対し音源定位を行う場合、音源である話者はテーブルの外周付近にのみ存在しうるとみなせるから、テーブルの外周付近の複数の場所を音源位置候補とすることができる（図１参照）。そこで、観測信号の上記のような性質に基づき、本発明の実施形態では、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルｚ（ｔ，ｆ）の、音源位置を表す状態が複数（Ｌ個）の音源位置候補のそれぞれに対応する状態を取る条件下での条件付き確率分布のモデルパラメータを記憶しておき、当該モデルパラメータを事前情報として音源位置の推定に利用する。上述のように、観測信号ベクトルｙ（ｔ，ｆ）の方向は音源位置によって定まるとみなすことができるから、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルｚ（ｔ，ｆ）は音源位置によって定まる固有の確率分布を持つ。前記条件付き確率分布は、音源位置を表す状態が複数（Ｌ個）の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルｚ（ｔ，ｆ）の確率分布を表す。 In the embodiment of the present invention, as for sound source localization, a problem of specifying one of a plurality of (L) sound source position candidates that actually emits a sound, that is, (number of sound source position candidates that actually emits a sound) Formulated as a problem of estimating the set of As this sound source position candidate, for example, a plurality of places in the room where sound source localization is performed (for example, places corresponding to respective grid points when the inside of the room is finely divided in a grid shape) may be used as the sound source position candidate. it can. Further, when the area where the sound source may exist is known, the sound source position candidate can set a plurality of places in the area as the sound source position candidate. For example, when performing sound source localization for the recorded sound of a plurality of conversations sitting around a table, it can be considered that a speaker who is a sound source can exist only near the outer periphery of the table, so multiple locations near the outer periphery of the table As a sound source position candidate (see FIG. 1). Therefore, based on the above-described properties of the observation signal, in the embodiment of the present invention, the sound source of the feature vector z (t, f) which is a vector including information on the direction of the observation signal vector y (t, f) A model parameter of conditional probability distribution under a condition that takes a state corresponding to each of a plurality (L) of sound source position candidates representing a position is stored, and the sound source position is estimated using the model parameter as prior information Use for As described above, since the direction of the observed signal vector y (t, f) can be considered to be determined by the sound source position, the feature vector z which is a vector including information on the direction of the observed signal vector y (t, f) (T, f) has an inherent probability distribution determined by the sound source position. The conditional probability distribution represents the probability distribution of the feature vector z (t, f) under the condition that the state representing the sound source position takes a state corresponding to each of a plurality of (L) sound source position candidates.

観測信号ベクトルｙ（ｔ，ｆ）の方向とは、数学的には、観測信号ベクトルｙ（ｔ，ｆ）の全てのマイクロホンに対する要素比ｙ（１，ｔ，ｆ）：ｙ（２，ｔ，ｆ）：・・・：ｙ（Ｍ，ｔ，ｆ）を指す（言い換えれば、複素数体上のＭ次元ベクトル空間における互いにスカラ倍の関係にあるベクトルを同一視することにより得られる空間である、複素数体上のＭ−１次元射影空間の元を指す）。ここで、ｙ（ｍ，ｔ，ｆ）は、ベクトルｙ（ｔ，ｆ）の第ｍ要素を表す。したがって、特徴ベクトルｚ（ｔ，ｆ）が観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルであるとは、特徴ベクトルｚ（ｔ，ｆ）が与えられたときに観測信号ベクトルｙ（ｔ，ｆ）の全てのマイクロホンに対する要素比ｙ（１，ｔ，ｆ）：ｙ（２，ｔ，ｆ）：・・・：ｙ（Ｍ，ｔ，ｆ）が一意に定まることを意味する。前記観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルとしては、例えば観測信号ベクトルに平行な単位ベクトルを用いることができる。また、観測信号ベクトルｙ（ｔ，ｆ）自体も、当然観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んでいるから、これを観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルとして用いることもできる。観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルは、音源位置に関する情報として、位相差と振幅比の両方の情報を含んでいる。これは、振幅比を用いず位相差のみを用いる従来の特徴量（例えば、Time Difference of Arrival（TDOA）やDirection Of Arrival（DOA））と大きく異なる。そのため、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルは、振幅比を用いず位相差のみを用いる従来の特徴量と比較して、より多くの音源位置に関する情報を用いており、より正確な音源定位が可能である。また、限られたデータ長から音源位置に関する情報を最大限に抽出することができるため、本発明の実施形態において、観測信号長が短い場合であっても音源定位を正確に行うことができるという特長に貢献している。観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルを用いることで、位相差のみを用いる場合と比較して、より効果的な信号処理（例えば、音源分離や雑音除去）が可能であることが示されている（参考文献「H. Sawada, S. Araki, and S. Makino, “Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 3, pp. 516-527, Mar. 2011. 」）。なお、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルが、音源位置に関する情報として、位相差と振幅比の両方の情報を含んでいるということは、下記のように説明できる。上述のように、観測信号ベクトルｙ（ｔ，ｆ）の方向とは観測信号ベクトルｙ（ｔ，ｆ）の全てのマイクロホンに対する要素比ｙ（１，ｔ，ｆ）：ｙ（２，ｔ，ｆ）：・・・：ｙ（Ｍ，ｔ，ｆ）を指すが、これは、全てのマイクロホン対（ｍ，ｎ）に対する、２つのマイクロホン（ｍ，ｎ）に対する要素比ｙ（ｍ，ｔ，ｆ）：ｙ（ｎ，ｔ，ｆ）と情報として等価である。さらに、複素数の比が位相差および絶対値の比（振幅比）と情報として等価であることに注意すると、全てのマイクロホン対（ｍ，ｎ）に対する、２つのマイクロホン（ｍ，ｎ）に対する要素比ｙ（ｍ，ｔ，ｆ）：ｙ（ｎ，ｔ，ｆ）は、全てのマイクロホン対（ｍ，ｎ）に対する、２つのマイクロホン（ｍ，ｎ）に対する位相差および振幅比と情報として等価である。したがって、観測信号ベクトルｙ（ｔ，ｆ）の方向は、全てのマイクロホン対（ｍ，ｎ）に対する、２つのマイクロホン（ｍ，ｎ）に対する位相差および振幅比と情報として等価である。すなわち、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルは、音源位置に関する情報として、位相差と振幅比の両方の情報を含んでいる。 The direction of the observation signal vector y (t, f) mathematically means that the element ratio y (1, t, f) for all the microphones of the observation signal vector y (t, f): y (2, t, f) f): ...: y (M, t, f) (in other words, a space obtained by identifying vectors in a scalar multiple relationship in an M-dimensional vector space on a complex number field, Refers to an element of M-1 dimensional projection space on a complex number field). Here, y (m, t, f) represents the m-th element of the vector y (t, f). Therefore, the feature vector z (t, f) is a vector including information on the direction of the observation signal vector y (t, f) when the feature vector z (t, f) is given. Element ratio y (1, t, f): y (2, t, f): ... for all microphones of y (t, f): ... means that y (M, t, f) is uniquely determined Do. As a feature vector which is a vector including information on the direction of the observation signal vector y (t, f), for example, a unit vector parallel to the observation signal vector can be used. Also, since the observation signal vector y (t, f) itself naturally contains information on the direction of the observation signal vector y (t, f), this information can be used as the information on the direction of the observation signal vector y (t, f) It can also be used as a feature vector that is a contained vector. A feature vector that is a vector including information on the direction of the observation signal vector y (t, f) includes information on both phase difference and amplitude ratio as information on the sound source position. This is largely different from the conventional feature quantity (for example, Time Difference of Arrival (TDOA) or Direction Of Arrival (DOA)) using only the phase difference without using the amplitude ratio. Therefore, a feature vector that is a vector including information on the direction of the observed signal vector y (t, f) relates to more sound source positions as compared to a conventional feature using only a phase difference without using an amplitude ratio. Using information, more accurate sound source localization is possible. Further, since information on the sound source position can be extracted to a maximum extent from the limited data length, in the embodiment of the present invention, even if the observation signal length is short, it is possible to accurately perform the sound source localization. It contributes to the feature. By using a feature vector that is a vector including information on the direction of the observed signal vector y (t, f), more effective signal processing (for example, sound source separation or noise compared to the case where only the phase difference is used) It has been shown that removal is possible (see H. Sawada, S. Araki, and S. Makino, “Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment,” IEEE Transactions on Audio. , Speech, and Language Processing, vol. 19, no. 3, pp. 516-527, Mar. 2011. "). Note that the feature vector, which is a vector including information on the direction of the observed signal vector y (t, f), includes information on both the phase difference and the amplitude ratio as information on the sound source position, as described below. It can be explained as follows. As described above, the direction of the observation signal vector y (t, f) is the element ratio y (1, t, f) for all microphones of the observation signal vector y (t, f): y (2, t, f) ): ...: y (M, t, f), which is an element ratio y (m, t, f) for two microphones (m, n) for all microphone pairs (m, n) ): Equivalent to y (n, t, f) as information. Furthermore, noting that the ratio of complex numbers is equivalent to the ratio of phase difference and absolute value ratio (amplitude ratio) as information, the element ratio to two microphones (m, n) for all microphone pairs (m, n) y (m, t, f): y (n, t, f) is equivalent as information on phase difference and amplitude ratio for two microphones (m, n) for all microphone pairs (m, n) . Therefore, the direction of the observation signal vector y (t, f) is equivalent to information on phase difference and amplitude ratio for two microphones (m, n) for all microphone pairs (m, n). That is, a feature vector that is a vector including information on the direction of the observed signal vector y (t, f) includes information on both the phase difference and the amplitude ratio as information on the sound source position.

図１を用いて、テーブルを囲んで座った複数人の会話の収録音に対し音源定位を行う場合の例について説明する。図１は、本発明における音源定位について説明するための図である。まず、図１に示すように、信号処理装置は、テーブル１００の周りの領域を等間隔に細かく分割したＬ点を音源位置候補１１０とすることができる。図１の例では、Ｌ＝８である。また、テーブル１００には、３つのマイクロホン１２０が置かれている。この例では、音源はテーブルの外周にのみ存在しうるとみなせ、また座高は個人に依らずほぼ一定とみなしうるから、音源位置はマイクロホン１２０から見た方向（方位角）によって指定することができる。 An example in the case of performing sound source localization with respect to recording sound of conversations of a plurality of persons sitting around a table will be described using FIG. 1. FIG. 1 is a diagram for explaining sound source localization in the present invention. First, as shown in FIG. 1, the signal processing apparatus can set L points obtained by finely dividing the area around the table 100 at equal intervals as the sound source position candidate 110. In the example of FIG. 1, L = 8. Also, on the table 100, three microphones 120 are placed. In this example, since the sound source can be considered to exist only at the outer periphery of the table, and the seat height can be regarded as substantially constant regardless of the individual, the sound source position can be specified by the direction (azimuth angle) viewed from the microphone 120 .

信号処理装置は、マイクロホン１２０によって観測された観測信号を基に、観測信号ベクトルｙ（ｔ，ｆ）および観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルｚ（ｔ，ｆ）を計算する。そして、信号処理装置は、条件付き確率分布のモデルパラメータに基づき、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布の荷重和である混合モデルを、特徴ベクトルｚ（ｔ，ｆ）に当てはめることにより、上記荷重和における荷重である事前確率分布を計算する。このとき、計算された事前確率分布は、音源位置で大きい値を取るため、この事前確率分布に基づいて音源位置を推定することができる。このとき、例えば、事前確率分布が、ｌ＝２である音源位置候補１１０で最も大きい値を取っている場合、音源位置は、矢印１３０が示す方向であるとみなすことができる。 Based on the observation signal observed by the microphone 120, the signal processing device is a feature vector z (vector) including information on the direction of the observation signal vector y (t, f) and the observation signal vector y (t, f). Calculate t, f). Then, based on the model parameters of the conditional probability distribution, the signal processing device is a condition of the feature vector z (t, f) under the condition that the state representing the sound source position takes a state corresponding to each of a plurality of sound source position candidates. By applying a mixed model, which is a weighted sum of the probability distribution, to the feature vector z (t, f), the prior probability distribution, which is the weight in the weighted sum, is calculated. At this time, since the calculated prior probability distribution takes a large value at the sound source position, the sound source position can be estimated based on the prior probability distribution. At this time, for example, in the case where the prior probability distribution takes the largest value in the sound source position candidate 110 where l = 2, the sound source position can be regarded as the direction indicated by the arrow 130.

［第１の実施形態］
第１の実施形態に係る信号処理装置は、音源数Ｎが未知の条件下で音源位置の集合を推定する。ここで、音源数ＮはＮ＝０であってもよい（音源位置の集合が空集合の場合に対応）。本実施形態では、信号処理装置は、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルｚ（ｔ，ｆ）として観測信号ベクトルｙ（ｔ，ｆ）の方向ベクトルを用い、音源位置を表す状態として複数の音源位置候補のそれぞれに対応する状態を用い、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布として複素ワトソン分布を用い、目的信号が球面波として伝播するという仮定に基づいて複素ワトソン分布のモデルパラメータを計算して記憶し、事前確率分布として時不変の事前確率分布を用いる。 First Embodiment
The signal processing apparatus according to the first embodiment estimates a set of sound source positions under conditions where the number of sound sources N is unknown. Here, the number of sound sources N may be N = 0 (corresponding to the case where the set of sound source positions is an empty set). In the present embodiment, the signal processing apparatus determines the direction vector of the observed signal vector y (t, f) as a feature vector z (t, f) that is a vector including information on the direction of the observed signal vector y (t, f). The feature vector z under the condition that the state representing the sound source position corresponds to each of the plurality of sound source position candidates using the state corresponding to each of the plurality of sound source position candidates as the state representing the sound source position using The complex Watson distribution is used as the conditional probability distribution of t, f), and model parameters of the complex Watson distribution are calculated and stored based on the assumption that the target signal propagates as a spherical wave, and time-variant prior is Use probability distribution.

図２を用いて、第１の実施形態に係る信号処理装置の構成について説明する。図２は、第１の実施形態に係る信号処理装置の構成の一例を示す図である。図２に示すように、信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。 The configuration of the signal processing apparatus according to the first embodiment will be described with reference to FIG. FIG. 2 is a diagram showing an example of the configuration of the signal processing device according to the first embodiment. As shown in FIG. 2, the signal processing device 1 includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, an a priori probability distribution calculation unit 40, and a sound source position calculation unit 50.

時間周波数分析部１０は、複数の異なる位置で取得された収録音であるＭ個のマイクロホンによる観測信号ｙ（ｍ，τ）（ｍはマイクロホンの番号（ｍ＝１〜Ｍ）、τは時刻の番号）に時間周波数分析を適用して観測信号の時間周波数変換ｙ（ｍ，ｔ，ｆ）（ｔはフレームの番号（ｔ＝１〜Ｔ）、ｆは周波数ビンの番号（ｆ＝１〜Ｆ））を計算し、ｙ（ｍ，ｔ，ｆ）（ｍ＝１〜Ｍ）からなるＭ次元縦ベクトルである観測信号ベクトルｙ（ｔ，ｆ）を作成する。前記複数の異なる位置で取得された収録音は、複数の異なる位置で取得された後、何らかの前処理（例えば残響除去処理、空間的白色化処理など）が施された収録音でもよい（参考文献「T. Yoshioka, T. Nakatani, M. Miyoshi, and H. G. Okuno, “Blind separation and dereverberation of speech mixtures by joint optimization,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 1, pp. 69-84, 2011.」、参考文献「H. Sawada, S. Araki, and S. Makino, “Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 3, pp. 516-527, Mar. 2011. 」）。 The time-frequency analysis unit 10 uses observation signals y (m, τ) (m is a microphone number (m = 1 to M), τ is time) of M microphones that are recording sounds acquired at a plurality of different positions. Applying time-frequency analysis to the numbers, time-to-frequency conversion y (m, t, f) of the observation signal (t is the frame number (t = 1 to T), f is the frequency bin number (f = 1 to F) )) To create an observed signal vector y (t, f) which is an M-dimensional longitudinal vector consisting of y (m, t, f) (m = 1 to M). The recorded sound acquired at the plurality of different positions may be a recorded sound that has been subjected to some pre-processing (for example, dereverberation processing, spatial whitening processing, etc.) after being acquired at a plurality of different positions (reference document) "T. Yoshioka, T. Nakatani, M. Miyoshi, and HG Okuno," Blind separation and dereverberation of speech mixtures by joint optimization, "IEEE Trans. Audio, Speech, Language Process., Vol. 19, no. 1, pp. 69-84, 2011., “H. Sawada, S. Araki, and S. Makino,“ Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment, ”IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 3, pp. 516-527, Mar. 2011.

特徴ベクトル計算部２０は、時間周波数分析部１０から観測信号ベクトルｙ（ｔ，ｆ）を受け取って、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルｚ（ｔ，ｆ）を式（１）により計算する。 The feature vector calculation unit 20 receives the observed signal vector y (t, f) from the time frequency analysis unit 10, and the feature vector z (t (t, f) is a vector including information on the direction of the observed signal vector y (t, f). , F) are calculated by equation (1).

ここで、||・||は、ユークリッドノルムであり、矢印←は左辺に右辺を代入することを表す。本実施形態におけるモデル化では、観測信号ベクトルｙ（ｔ，ｆ）はＮ個（Ｎは未知でもよく、またＮ＝０でもよい。）の目的信号からなり、背景雑音は含まないと仮定する。また、本発明の実施形態におけるモデル化では、各目的信号は時間周波数平面の疎な点でのみ大きいパワーを持つというスパース性を持つと仮定する。これらの仮定に基づき、本実施形態では、観測信号ベクトルｙ（ｔ，ｆ）は各時間周波数点において１つの目的信号のみを含むと仮定する。すなわち、観測信号ベクトルｙ（ｔ，ｆ）は式（２）によりモデル化される。 Here, || · || is the Euclidean norm, and the arrow ← indicates that the right side is substituted for the left side. In the modeling in this embodiment, it is assumed that the observation signal vector y (t, f) consists of N (N may be unknown or N may be 0) target signals and does not include background noise. Further, in the modeling in the embodiment of the present invention, it is assumed that each target signal has the sparsity of having large power only at the sparse point of the time frequency plane. Based on these assumptions, in the present embodiment, it is assumed that the observed signal vector y (t, f) contains only one target signal at each time frequency point. That is, the observed signal vector y (t, f) is modeled by equation (2).

ここで、ｓ（ｎ，ｔ，ｆ）はｎ番目の目的信号の時間周波数変換であり、ｎは目的信号の番号（ｎ＝１〜Ｎ）である。また、ベクトルｈ（ｎ，ｆ）はｎ番目の目的信号の空間伝達特性を表すステアリングベクトルであり、ｎ番目の目的信号の音源位置によって固有の値を取る。式（２）は、観測信号ベクトルｙ（ｔ，ｆ）がｎ番目（ｎは時間周波数点（ｔ，ｆ）によって変化する）の目的信号のみからなることを表している。 Here, s (n, t, f) is time-frequency conversion of the n-th target signal, and n is the number (n = 1 to N) of the target signal. The vector h (n, f) is a steering vector representing the space transfer characteristic of the nth target signal, and takes a unique value depending on the sound source position of the nth target signal. Equation (2) indicates that the observed signal vector y (t, f) consists only of the n-th target signal (n varies with time frequency points (t, f)).

観測信号ベクトルｙ（ｔ，ｆ）のＭ次元複素ベクトル空間における方向（すなわち、Ｍ次元複素ベクトル空間において観測信号ベクトルｙ（ｔ，ｆ）が張る１次元部分空間）は、当該時間周波数点（ｔ，ｆ）において観測信号に含まれる音源信号の音源位置によって定まる固有の方向（具体的にはステアリングベクトルｈ（ｎ，ｆ）の方向）となる。より正確には、雑音や残響の影響で、観測信号ベクトルｙ（ｔ，ｆ）の方向は、上記の音源位置によって定まる固有の方向を中心として多少の広がりを持って分布する。本実施形態では、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルとして、観測信号ベクトルｙ（ｔ，ｆ）の方向ベクトルである式（１）の特徴ベクトルを用いる。 The direction of the observed signal vector y (t, f) in the M-dimensional complex vector space (that is, a one-dimensional subspace in which the observed signal vector y (t, f) spans in the M-dimensional complex vector space) is the time frequency point (t , F) become a unique direction (specifically, the direction of the steering vector h (n, f)) determined by the sound source position of the sound source signal included in the observation signal. More precisely, due to noise and reverberation, the direction of the observation signal vector y (t, f) is distributed with some spread around the specific direction determined by the above-mentioned sound source position. In the present embodiment, the feature vector of Expression (1), which is a direction vector of the observation signal vector y (t, f), is used as a feature vector that is a vector including information on the direction of the observation signal vector y (t, f). Use.

本実施形態では、音源位置を表す状態が複数（Ｌ個）の音源位置候補のそれぞれに対応する状態を取る条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布を複素ワトソン分布によりモデル化する（他にも複素ビンガム分布、複素角度中心ガウス分布（complex angular central Gaussian distribution）、複素ガウス分布、混合複素ワトソン分布、混合複素ビンガム分布、混合複素角度中心ガウス分布、混合複素ガウス分布等の確率分布によりモデル化することができる）。すなわち、特徴ベクトルｚ（ｔ，ｆ）は式（３）によりモデル化される。 In the present embodiment, the conditional probability distribution of the feature vector z (t, f) under the condition that the state representing the sound source position corresponds to each of a plurality of (L) sound source position candidates is a complex Watson distribution. Model (other complex Bingham distribution, complex angular central Gaussian distribution, complex Gaussian distribution, mixed complex Watson distribution, mixed complex Bingham distribution, mixed complex angular center Gaussian distribution, mixed complex Gaussian distribution etc. Can be modeled by the probability distribution of That is, the feature vector z (t, f) is modeled by equation (3).

ここで、ｇ（ｔ，ｆ）は時間周波数点（ｔ，ｆ）における音源位置を表す状態である。本実施形態では、音源位置を表す状態は、複数（Ｌ個）の音源位置候補のそれぞれに対応する状態１〜Ｌのいずれかの値を取るとする。ここで、状態ｌは、時間周波数点（ｔ，ｆ）において観測信号ベクトルｙ（ｔ，ｆ）に含まれる音源信号の音源位置がｌ番目の音源位置候補である状態と定義する。ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）はｇ（ｔ，ｆ）＝ｌの条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布である。ベクトルａ（ｌ，ｆ）はｌ番目の音源位置候補に対する特徴ベクトルｚ（ｔ，ｆ）の平均方向を定めるモデルパラメータであり、平均方向ベクトルと呼ばれ、式（４）を満たす。κ（ｌ，ｆ）はｌ番目の音源位置候補に対する特徴ベクトルｚ（ｔ，ｆ）の確率分布の平均方向ベクトルａ（ｌ，ｆ）のまわりへの集中度を定めるモデルパラメータであり、集中パラメータと呼ばれる。 Here, g (t, f) is a state representing the sound source position at the time frequency point (t, f). In the present embodiment, it is assumed that the state representing the sound source position takes any value of states 1 to L corresponding to each of a plurality of (L) sound source position candidates. Here, the state l is defined as a state in which the sound source position of the sound source signal included in the observed signal vector y (t, f) at the time frequency point (t, f) is the l-th sound source position candidate. p (z (t, f) | g (t, f) = 1) is a conditional probability distribution of the feature vector z (t, f) under the condition of g (t, f) = 1. The vector a (l, f) is a model parameter that determines the average direction of the feature vector z (t, f) with respect to the l-th sound source position candidate, is called an average direction vector, and satisfies equation (4). κ (l, f) is a model parameter that defines the degree of concentration around the mean direction vector a (l, f) of the probability distribution of the feature vector z (t, f) for the l-th sound source position candidate It is called.

Ｗ（ｚ；ａ，κ）は平均方向ベクトルがａ、集中パラメータがκであるベクトルｚの複素ワトソン分布であり、式（５）で表される。 W (z; a,)) is a complex Watson distribution of vector z whose average direction vector is a and whose concentration parameter is κ, and is expressed by equation (5).

このとき、Ｋは式（６）の無限級数により定義されるKummer関数（第１種合流型超幾何関数）であり、上付きのＨはエルミート転置である。ただし、ｉ＝０のときξ（ξ＋１）・・・（ξ＋ｉ−１）／［η（η＋１）・・・（η＋ｉ−１）］＝１と定める。 At this time, K is a Kummer function (a first-order combined hypergeometric function) defined by the infinite series of Equation (6), and the superscript H is Hermite transposition. However, when i = 0, it is determined that ・・・ (ξ + 1) ... (ξ + i-1) / [η (η + 1) ... (・・・ + i-1)] = 1.

パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルの条件付き確率分布のモデルパラメータを記憶する。具体的に、パラメータ記憶部３０は、式（３）の条件付き確率分布の音源位置をモデル化するモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）および集中パラメータκ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）を記憶する。本実施形態では、これらのモデルパラメータを以下のように計算する。すなわち、目的信号が球面波として伝播するという仮定に基づき、平均方向ベクトルａ（ｌ，ｆ）の第ｍ要素を式（７）により計算する。 The parameter storage unit 30 stores model parameters of the conditional probability distribution of the feature vector under the condition that the state representing the sound source position corresponds to each of the plurality of sound source position candidates. Specifically, the parameter storage unit 30 is an average direction vector a (l, f) (l = 1 to L, f = 1 to 5) which is a model parameter for modeling the sound source position of the conditional probability distribution of Expression (3). F) and store concentration parameters ((l, f) (l = 1 to L, f = 1 to F). In the present embodiment, these model parameters are calculated as follows. That is, based on the assumption that the target signal propagates as a spherical wave, the m-th element of the average direction vector a (l, f) is calculated by equation (7).

ここで、ベクトルｑ（ｍ）はｍ番目のマイクロホンの直交座標である３次元実ベクトル（本実施形態では既知と仮定）、ベクトルｒ（ｌ）はｌ番目の音源位置候補の直交座標である３次元実ベクトル（既知）、ｊは虚数単位、ω（ｆ）はｆ番目の周波数ビンの角周波数、ｃは音速であり、左辺における下付きのｍは第ｍ要素であることを表し、右辺の分母の平方根の項は、平均方向ベクトルａ（ｌ，ｆ）が式（４）の制約条件を満たすようにするための正規化係数である。 Here, the vector q (m) is a three-dimensional real vector (which is assumed to be known in this embodiment) which is the orthogonal coordinates of the m-th microphone, and the vector r (l) is the orthogonal coordinates of the l-th sound source position candidate A real vector of dimension (known), j is an imaginary unit, ω (f) is an angular frequency of an f-th frequency bin, c is a velocity of sound, and a subscript m on the left side is an m-th element. The term of the square root of the denominator is a normalization coefficient for making the mean direction vector a (l, f) satisfy the constraint of equation (4).

一方、集中パラメータκ（ｌ，ｆ）は、例えば周波数（ω（ｆ）／２π）のマイナス２乗に比例すると仮定して、式（８）により計算する。式（８）は、観測信号ベクトルｙ（ｔ，ｆ）の方向が、低い周波数ほど小さい分散（大きい集中度）を持つという性質に基づいている。このように、前記性質を適切に考慮することにより、事前確率分布の推定、及びそれに基づく音源定位を正確に行うことができる。比例定数βはどのように定めてもよいが、例えばβ＝６．４×１０＾７Ｈｚ＾２と定めればよい。 On the other hand, it is assumed that the concentration parameter ((l, f) is proportional to, for example, the negative square of the frequency (ω (f) / 2π), and is calculated by equation (8). Expression (8) is based on the property that the direction of the observation signal vector y (t, f) has a smaller dispersion (larger degree of concentration) as the frequency is lower. Thus, estimation of the prior probability distribution and sound source localization based thereon can be accurately performed by properly considering the above-mentioned properties. Although the proportional constant β may be determined in any way, it may be determined, for example, as β = 6.4 × 10 7 Hz 2.

次に、本実施形態における特徴ベクトルｚ（ｔ，ｆ）の周辺確率分布のモデル化について説明する。本実施形態では、特徴ベクトルｚ（ｔ，ｆ）の周辺確率分布を、音源位置を表す状態ｇ（ｔ，ｆ）の事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）を荷重とする条件付き確率分布ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）の荷重和である、式（９）の混合モデルによりモデル化する。 Next, modeling of the marginal probability distribution of the feature vector z (t, f) in the present embodiment will be described. In this embodiment, the condition that the peripheral probability distribution of the feature vector z (t, f) is a load with the prior probability distribution P (g (t, f) = 1) of the state g (t, f) representing the sound source position. It models by the mixed model of Formula (9) which is a weighted sum of attached probability distribution p (z (t, f) | g (t, f) = 1).

条件付き確率分布ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）は音源位置を表す状態が既知の場合の特徴ベクトルｚ（ｔ，ｆ）の確率分布であるのに対し、式（９）の周辺確率分布ｐ（ｚ（ｔ，ｆ））は音源位置を表す状態が未知の場合の特徴ベクトルｚ（ｔ，ｆ）の確率分布である。事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）は、「時変」の場合と「時不変」の場合がある。前者の場合、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）は時間区間（例えばフレーム）ごとに異なる値を取り得る。後者の場合、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）は時間区間（例えばフレーム）によらず同一の値を取る。 The conditional probability distribution p (z (t, f) | g (t, f) = 1) is the probability distribution of the feature vector z (t, f) when the state representing the sound source position is known, The marginal probability distribution p (z (t, f)) of Expression (9) is a probability distribution of the feature vector z (t, f) when the state representing the sound source position is unknown. The prior probability distribution P (g (t, f) = 1) may be "time-variant" or "time-invariant". In the former case, the prior probability distribution P (g (t, f) = 1) may take different values for each time interval (eg, frame). In the latter case, the prior probability distribution P (g (t, f) = 1) takes the same value regardless of the time interval (eg, frame).

事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）が時不変の場合、音源位置で大きい値を取る事前確率分布を全ての時間区間（例えばフレーム）を用いて推定することから、時変の場合よりも長いデータを推定に用いることができるため、音源の移動や発話交替がない状況では音源位置をより正確に推定できるという効果がある。その反面、音源位置推定を時間区間（例えばフレーム）ごとに行うことができず、またそのため、時変の場合の方が、音源の移動や発話交替がある動的な状況でのトラッキングやダイアリゼーション等には適している。 When the prior probability distribution P (g (t, f) = 1) is time-invariant, the prior probability distribution taking a large value at the sound source position is estimated using all time intervals (eg, frames). Since data longer than that in the case can be used for estimation, there is an effect that the sound source position can be estimated more accurately in a situation where there is no movement of the sound source or alternate speech. On the other hand, sound source position estimation can not be performed for each time interval (for example, a frame), and therefore, in the case of time change, tracking and diarization in a dynamic situation with movement of sound source and speech alternation And so on.

一方、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）が時変の場合、音源位置で大きい値を取る関数である事前確率分布を時間区間（例えばフレーム）ごとに推定するため、音源位置推定を時間区間（例えばフレーム）ごとに行うことができるという効果に加え、時間区間（例えばフレーム）ごとの音源位置推定に基づいてトラッキングやダイアリゼーションを行うことができるという効果がある。例えば、複数人会話の音声認識では、雑音を音声とみなして誤認識することを防ぐために、「いつ誰が話したか」を推定するダイアリゼーションを行うことで音声認識を適用すべき区間を切り出す必要があるが、「時変」の場合はこのような場合にも応用可能である。 On the other hand, when the prior probability distribution P (g (t, f) = 1) is time-variant, in order to estimate the prior probability distribution which is a function taking a large value at the sound source position for each time interval (for example, frame) In addition to the effect that estimation can be performed for each time interval (for example, frame), there is an effect that tracking and dialing can be performed based on sound source position estimation for each time interval (for example, frame). For example, in speech recognition in a multi-person conversation, it is necessary to cut out a section to which speech recognition should be applied by performing a dialy to estimate "when did you talk" in order to prevent false recognition of noise as speech. Although there is a "time-variant" case, it is applicable to such a case.

本実施形態では、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）は時不変と仮定する。本実施形態では更に、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）は周波数にも依らないと仮定する。すなわち、本実施形態では、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）がフレームおよび周波数ビンに依存しないと仮定し、α（ｌ）で表す。ただし、α（ｌ）は制約条件α（１）＋…＋α（Ｌ）＝１を満たす。周波数に依らない事前確率分布を用いることで、全ての周波数において観測された特徴ベクトルｚ（ｔ，ｆ）の情報を用いて事前確率分布を推定することができるため、周波数に依存する事前確率分布を用いる場合と比べて、事前確率分布の推定により多くの情報を利用することができ、より正確な事前確率分布の推定およびそれに基づく音源定位が実現できるとともに、観測信号長が短い場合でもより正確な事前確率分布の推定およびそれに基づく音源定位が実現できる。 In this embodiment, it is assumed that the prior probability distribution P (g (t, f) = 1) is time-invariant. Further, in the present embodiment, it is assumed that the prior probability distribution P (g (t, f) = 1) does not depend on the frequency. That is, in the present embodiment, it is assumed that the prior probability distribution P (g (t, f) = 1) does not depend on the frame and frequency bin, and is represented by α (1). However, α (l) satisfies the constraint condition α (1) +... + Α (L) = 1. Since the prior probability distribution can be estimated using information of the feature vector z (t, f) observed at all frequencies by using the prior probability distribution independent of frequency, the frequency dependent prior probability distribution More information can be used for estimation of the prior probability distribution compared to the case of using, and more accurate estimation of the prior probability distribution and sound source localization based on it can be realized, and more accurate even when the observation signal length is short Estimation of the prior probability distribution and sound source localization based thereon can be realized.

事前確率分布計算部４０は、音源位置を表す状態の事前確率分布α（ｌ）（ｌ＝１〜Ｌ）を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）と集中パラメータκ（ｌ，ｆ）に基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルの条件付き確率分布の荷重和である混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルに当てはめ、事前確率分布α（ｌ）（ｌ＝１〜Ｌ）を計算する。 The prior probability distribution calculating unit 40 uses the prior probability distribution α (l) (l = 1 to L) of the state representing the sound source position as a load, and is an average direction vector a (model parameters stored in the parameter storage unit 30). a mixed model which is a weighted sum of conditional probability distributions of feature vectors under conditions where the state representing the sound source position is known, based on l, f) and the concentration parameter ((l, f), The prior probability distribution α (l) (l = 1 to L) is calculated by fitting to the feature vector calculated by.

式（９）の混合モデルを特徴ベクトルｚ（ｔ，ｆ）に当てはめる方法には様々な方法があり、例えば式（９）に関する尤度を目的関数とし（他にも事後確率等を目的関数とすることができる。）、これを勾配法により事前確率分布α（ｌ）（ｌ＝１〜Ｌ）に関して最大化する（他にもＥｘｐｅｃｔａｔｉｏｎ−Ｍａｘｉｍｉｚａｔｉｏｎ（ＥＭ）アルゴリズム等により最大化できる）。 There are various methods for applying the mixed model of equation (9) to the feature vector z (t, f). For example, let the likelihood for equation (9) be an objective function (in addition, the posterior probability etc. be an objective function) Max) with respect to the prior probability distribution α (l) (l = 1 to L) by the gradient method (others can be maximized by the Expectation-Maximization (EM) algorithm etc.).

勾配法に基づく方法は、ＥＭアルゴリズムに基づく方法と比べて、計算量の面で有利である。ＥＭアルゴリズムに基づく方法では、反復ごとに、事前確率分布α（ｌ）（ｌ＝１〜Ｌ）に加えて、時間周波数点ごとの各音源位置候補の寄与率を計算する必要がある。これに対し、勾配法では、反復ごとに事前確率分布α（ｌ）（ｌ＝１〜Ｌ）のみを計算すれば良いため、ＥＭアルゴリズムに比べて計算量を大幅に削減することができる。事前確率分布計算部４０における処理は、例えば下記の通りである。 The method based on the gradient method is advantageous in terms of complexity compared to the method based on the EM algorithm. In the method based on the EM algorithm, it is necessary to calculate the contribution rate of each sound source position candidate for each time frequency point in addition to the prior probability distribution α (l) (l = 1 to L) for each iteration. On the other hand, in the gradient method, only the prior probability distribution α (l) (l = 1 to L) needs to be calculated for each iteration, so the amount of calculation can be significantly reduced compared to the EM algorithm. The processing in the prior probability distribution calculation unit 40 is, for example, as follows.

まず、α（ｌ）←１／Ｌ（ｌ＝１〜Ｌ）によりα（ｌ）を初期化する。次に、下記の式（１０）および（１１）によるα（ｌ）（ｌ＝１〜Ｌ）の処理を、交互に所定回数（例えば１０回）反復する。 First, α (l) is initialized by α (l) ← 1 / L (l = 1 to L). Next, the processing of α (l) (1 = 1 to L) according to the following formulas (10) and (11) is alternately repeated a predetermined number of times (for example, 10 times).

そして、α（ｌ）（ｌ＝１〜Ｌ）を出力する。ただし、ベクトルαはα（ｌ）（ｌ＝１〜Ｌ）からなるＬ次元縦ベクトル、ベクトルｗ（ｔ，ｆ）はＷ（ｚ（ｔ，ｆ）；ａ（ｌ，ｆ），κ（ｌ，ｆ））（ｌ＝１〜Ｌ）からなるＬ次元縦ベクトル、上付きのＴは転置、λは所定の正の定数（例えばλ＝１）である。 Then, α (l) (l = 1 to L) is output. Here, vector α is an L-dimensional vertical vector consisting of α (l) (l = 1 to L), and vector w (t, f) is W (z (t, f); a (l, f), κ (l) , F)) (L = 1 to L), superscript T is transposition, and λ is a predetermined positive constant (eg, λ = 1).

ここで、式（１０）（１１）の導出について説明する。目的関数である尤度は、ｚ（ｔ，ｆ）（ｔ＝１〜Ｔ，ｆ＝１〜Ｆ）が観測される確率であり、式（１２）で表される。 Here, the derivation of the equations (10) and (11) will be described. The likelihood of being an objective function is the probability that z (t, f) (t = 1 to T, f = 1 to F) is observed, and is expressed by equation (12).

式（１２）の最大化は、自然対数を取った式（１３）の最大化と等価である。 The maximization of equation (12) is equivalent to the maximization of equation (13) taking the natural logarithm.

ここでｌｎは自然対数を表し、＝の上の△は定義であることを表す。式（１３）の勾配を取ると、式（１４）を得、これより式（１０）が従う。一方、式（１１）はα（ｌ）が制約条件α（１）＋…＋α（Ｌ）＝１を満たすようにするための処理である。なお、式（１３）において、荷重を用いずに和を取るのではなく、信頼度に基づく荷重を用いて荷重和を取るように変更した目的関数を用いてもよい。これにより、信頼度の高い時間周波数点における特徴ベクトルにより大きい重みを与えることができ、事前確率分布推定およびそれに基づく音源定位の精度を向上させることができる。例えば、観測信号ベクトルｙ（ｔ，ｆ）のノルムが小さい時間周波数点が雑音に対応し、前記ノルムが大きい時間周波数点が目的信号に対応するとの仮定に基づき、前記ノルムを信頼度に基づく荷重として用いることができる。 Here, ln represents a natural logarithm, and Δ above = represents a definition. Taking the slope of equation (13) yields equation (14), from which equation (10) follows. On the other hand, equation (11) is a process for making α (l) satisfy the constraint condition α (1) +... + Α (L) = 1. In Equation (13), an objective function may be used which is changed so as to take the load sum using the load based on the reliability, instead of taking the sum without using the load. As a result, it is possible to give greater weight to feature vectors at highly reliable time frequency points, and to improve the accuracy of prior probability distribution estimation and sound source localization based thereon. For example, based on the assumption that the time frequency point where the norm of the observation signal vector y (t, f) is small corresponds to noise, and the time frequency point where the norm is large corresponds to the target signal It can be used as

音源位置計算部５０は、事前確率分布計算部４０から事前確率分布α（ｌ）（ｌ＝１〜Ｌ）を受け取って、事前確率分布α（ｌ）（ｌ＝１〜Ｌ）のピーク位置の集合Ｊを計算し、ピーク位置の集合Ｊに基づいて音源位置の集合Ｇを計算し出力する。 The sound source position calculating unit 50 receives the prior probability distribution α (l) (l = 1 to L) from the prior probability distribution calculating unit 40, and detects the peak position of the prior probability distribution α (l) (l = 1 to L). A set J is calculated, and a set G of sound source positions is calculated and output based on the set J of peak positions.

ピーク位置の集合Ｊは例えば次のように計算できる。各番号ｌ＝１〜Ｌに対し、ｌ番目の音源位置候補に隣接する音源位置候補の番号の集合が既知であると仮定する。このとき、「番号ｌがピーク位置であるとは、ｌ番目の音源位置候補に隣接する全ての音源位置候補の番号ｌ´に対しα（ｌ）＞α（ｌ´）が成り立つことである」と定義し、各番号ｌ＝１〜Ｌに対して番号ｌがピーク位置であるか否かを判定することで、ピーク位置の集合Ｊを計算できる。このピーク位置の集合Ｊに基づいて、音源位置を指定する番号ｌの集合または座標（直交座標、極座標、球座標等）の集合である検出された音源位置の集合Ｇを次のように計算できる。 The set J of peak positions can be calculated, for example, as follows. It is assumed that for each number l = 1 to L, a set of sound source position candidate numbers adjacent to the l-th sound source position candidate is known. At this time, “the number l being a peak position means that α (l)> α (l ′) holds for the number l ′ of all sound source position candidates adjacent to the l-th sound source position candidate. The peak position set J can be calculated by determining whether or not the number l is a peak position for each of the numbers l = 1 to L. Based on this set J of peak positions, it is possible to calculate a set G of detected sound source positions which is a set of number l specifying the sound source position or a set of coordinates (orthogonal coordinates, polar coordinates, spherical coordinates, etc.) as follows. .

例えば、ピーク位置の集合Ｊをそのまま検出された音源位置の集合Ｇとしてもよいし、ピーク位置ｌのうちピーク値α（ｌ）が所定の閾値Ｓを超えるピーク位置ｌの集合｛ｌ∈Ｊ｜α（ｌ）＞Ｓ｝を検出された音源位置の集合Ｇとしてもよい。閾値Ｓはどのように定めてもよいが、例えばＳ＝１／Ｌとすればよい。また、ピーク位置ｌに対応する音源位置候補の座標であるベクトルｒ（ｌ）の集合｛ｒ（ｌ）｜ｌ∈Ｊ｝を検出された音源位置の集合Ｇとしてもよいし、ピーク値α（ｌ）が所定の閾値Ｓを超えるピーク位置ｌに対応する音源位置候補の座標であるベクトルｒ（ｌ）の集合｛ｒ（ｌ）｜ｌ∈Ｊ，α（ｌ）＞Ｓ｝を検出された音源位置の集合Ｇとしてもよい。 For example, the set J of peak positions may be used as the set G of sound source positions detected as it is, or a set of peak positions l where the peak value α (l) exceeds a predetermined threshold S among peak positions l {lεJ | A set G of detected sound source positions may be used as α (l)> S}. The threshold value S may be determined in any manner, for example, S = 1 / L. Alternatively, a set {r (l) | lεJ} of vectors r (l), which are coordinates of a sound source position candidate corresponding to the peak position l, may be used as the set G of detected sound source positions. A set {r (l) | lεJ, α (l)> S} of vectors r (l) which are coordinates of the sound source position candidate corresponding to the peak position l where l) exceeds the predetermined threshold S is detected It may be a set G of sound source positions.

［第１の実施形態の処理］
図３を用いて、信号処理装置１の処理の流れについて説明する。図３は、第１の実施形態に係る信号処理装置の処理の流れを示すフローチャートである。図３に示すように、まず、時間周波数分析部１０は、観測信号に対し、時間周波数分析を行い、観測信号ベクトルを計算する（ステップＳ１１）。 Processing of the First Embodiment
The flow of processing of the signal processing device 1 will be described with reference to FIG. FIG. 3 is a flowchart showing the flow of processing of the signal processing device according to the first embodiment. As shown in FIG. 3, first, the time-frequency analysis unit 10 performs time-frequency analysis on the observation signal to calculate an observation signal vector (step S11).

次に、特徴ベクトル計算部２０は、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルを計算する（ステップＳ１２）。そして、事前確率分布計算部４０は、パラメータ記憶部３０から、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での特徴ベクトルの条件付き確率分布モデルのパラメータを取得する（ステップＳ１３）。 Next, the feature vector calculation unit 20 calculates a feature vector that is a vector including information on the direction of the observation signal vector y (t, f) (step S12). Then, from the parameter storage unit 30, the a priori probability distribution calculation unit 40 generates, from the parameter storage unit 30, the parameters of the conditional probability distribution model of the feature vector under the condition of taking the state corresponding to each of the plurality of sound source position candidates. It acquires (step S13).

次に、事前確率分布計算部４０は、各音源位置を表す状態の事前確率分布を初期化する（ステップＳ１４）。そして、事前確率分布計算部４０は、事前確率分布を更新する（ステップＳ１５）。 Next, the prior probability distribution calculating unit 40 initializes the prior probability distribution of the state representing each sound source position (step S14). Then, the prior probability distribution calculation unit 40 updates the prior probability distribution (step S15).

このとき、事前確率分布計算部４０は、例えば、パラメータ記憶部から取得したモデルパラメータによって表される特徴ベクトルの条件付き確率分布を、事前確率分布で荷重した混合モデルを用いて特徴ベクトルの周辺確率分布をモデル化する。そして、事前確率分布計算部４０は、勾配法を用い、当該周辺確率分布の尤度を目的関数としたときの尤度が最大化されるように事前確率分布を更新する。そして、事前確率分布の更新が所定回数反復して行われていない場合（ステップＳ１６、Ｎｏ）、事前確率分布計算部４０は、さらに事前確率分布の更新を行う（ステップＳ１５）。 At this time, the prior probability distribution calculation unit 40 uses, for example, a mixed model in which the conditional probability distribution of the feature vector represented by the model parameter acquired from the parameter storage unit is loaded by the prior probability distribution. Model the distribution. Then, the prior probability distribution calculation unit 40 uses the gradient method to update the prior probability distribution so that the likelihood when the likelihood of the surrounding probability distribution is the objective function is maximized. Then, when the prior probability distribution is not repeatedly updated a predetermined number of times (step S16, No), the prior probability distribution calculation unit 40 further updates the prior probability distribution (step S15).

一方、事前確率分布の更新が所定回数反復して行われた場合（ステップＳ１６、Ｙｅｓ）、音源位置計算部５０は、事前確率分布計算部４０によって計算された事前確率に基づいて音源位置を計算する（ステップＳ１７）。このとき、音源位置計算部５０は、例えば、事前確率がピークとなる音源位置を計算結果とすることができる。 On the other hand, when the prior probability distribution is repeatedly updated a predetermined number of times (step S16, Yes), the sound source position calculation unit 50 calculates the sound source position based on the prior probability calculated by the prior probability distribution calculation unit 40. (Step S17). At this time, the sound source position calculation unit 50 can use, for example, the sound source position at which the prior probability reaches a peak as the calculation result.

［第１の実施形態の効果］
時間周波数分析部１０は、Ｍ個の異なる位置で取得された収録音に時間周波数変換を適用し、Ｍ次元ベクトルである観測信号ベクトルを計算する。そして、特徴ベクトル計算部２０は、時間周波数分析部１０によって計算された観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルを、時間周波数点ごとに計算する。また、パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルの条件付き確率分布のモデルパラメータを記憶する。 [Effect of First Embodiment]
The time frequency analysis unit 10 applies time frequency conversion to the recorded sound acquired at M different positions, and calculates an observation signal vector which is an M-dimensional vector. Then, the feature vector calculation unit 20 calculates, for each time frequency point, a feature vector that is a vector including information on the direction of the observation signal vector y (t, f) calculated by the time frequency analysis unit 10. The parameter storage unit 30 also stores model parameters of the conditional probability distribution of the feature vector under the condition that the state representing the sound source position corresponds to each of the plurality of sound source position candidates.

ここで、事前確率分布計算部４０は、音源位置を表す状態の事前確率分布を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータに基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルの条件付き確率分布の荷重和である混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルに当てはめ、事前確率分布を計算する。そして、音源位置計算部５０は、事前確率分布計算部４０によって計算された事前確率分布に基づいて、特徴ベクトルに対応する音源位置を計算する。 Here, the prior probability distribution calculation unit 40 uses the prior probability distribution of the state representing the sound source position as a load, based on the model parameters stored in the parameter storage unit 30, under conditions where the state representing the sound source position is known. A mixed model, which is a weighted sum of conditional probability distributions of feature vectors, is applied to the feature vectors calculated by the feature vector calculation unit 20 to calculate a prior probability distribution. Then, the sound source position calculation unit 50 calculates the sound source position corresponding to the feature vector based on the prior probability distribution calculated by the prior probability distribution calculation unit 40.

このように、第１の実施形態によれば、観測信号ベクトルの共分散行列を用いずに、音源位置にて大きい値を取る関数である空間スペクトルとみなせる事前確率分布を計算することができるため、観測信号長が短い場合でも正確に音源定位を行うことができる。そのため、観測信号長が短い場合に正確な音源定位が困難であったＣａｐоｎ法やＭＵＳＩＣ法等の従来の音源定位法に比べて、音源位置が時間的に変化する状況や、発話交替のある会話状況などの動的な状況下で有利である。また、第１の実施形態によれば、複数の音源からの音源信号が混在する状況でも、それぞれの音源の音源位置を推定することができる。そのため、複数の音源位置の推定が困難であった遅延和アレイや一般化相互相関関数法等の従来の音源定位法に比べて、発話の重なりがある会話状況などの複数音源が存在する状況下で有利である。また、音源数が未知である状況でも、音源定位を行うことができる。そのため、実際の応用では音源数は事前に分からないことが多いが、そのような状況下でも本実施形態により音源定位が可能である。これは、音源数の事前情報を必要とするＭＵＳＩＣ法等の従来の音源定位法に比べて有利である。さらに、第１の実施形態の方法で得られた事前確率分布は、トラッキング、ダイアリゼーション、マスク推定、音声強調、音声認識といった様々な応用に用いることができる。さらに、第１の実施形態によれば、周波数に依らない事前確率分布を用いることで、全ての周波数において観測された特徴ベクトルｚ（ｔ，ｆ）の情報を用いて事前確率分布を推定することができる（これは、式（１０）において、全ての周波数におけるベクトルｗ（ｔ，ｆ）を用いてベクトルαを更新していることからも分かる。）ため、周波数に依存する事前確率分布を用いる場合と比べて、事前確率分布の推定により多くの情報を利用することができ、より正確な事前確率分布の推定およびそれに基づく音源定位が実現できるとともに、観測信号長が短い場合でもより正確な事前確率分布の推定およびそれに基づく音源定位が実現できる。なお、上では、全てのフレームにおける観測信号を一度に処理するバッチ処理について説明したが、フレームごと（またはいくつかのフレームごと）に観測信号を処理し、音源位置を推定するブロックバッチ処理（またはオンライン処理）とすることもできる。 As described above, according to the first embodiment, it is possible to calculate the prior probability distribution that can be regarded as a space spectrum that is a function that takes a large value at the sound source position without using the covariance matrix of the observation signal vector. Even when the observation signal length is short, sound source localization can be performed accurately. Therefore, when the observation signal length is short, accurate sound source localization is difficult, compared with the conventional sound source localization method such as the Caporn method or MUSIC method, the situation where the sound source position changes temporally or the conversation with utterance substitution It is advantageous under dynamic situations such as situations. Further, according to the first embodiment, even in a situation where sound source signals from a plurality of sound sources are mixed, it is possible to estimate the sound source position of each sound source. Therefore, in the situation where there are multiple sound sources such as a conversational situation where there are overlapping of utterances, as compared with the conventional sound source localization methods such as delay-and-sum array and generalized cross correlation function method in which estimation of multiple sound source positions is difficult Is advantageous. In addition, sound source localization can be performed even in a situation where the number of sound sources is unknown. Therefore, in many practical applications, the number of sound sources is often unknown in advance, but even under such circumstances, sound source localization is possible according to the present embodiment. This is advantageous over conventional sound source localization methods such as the MUSIC method that requires advance information on the number of sound sources. Furthermore, the prior probability distribution obtained by the method of the first embodiment can be used in various applications such as tracking, dialing, mask estimation, speech enhancement, and speech recognition. Furthermore, according to the first embodiment, the prior probability distribution is estimated using information of feature vectors z (t, f) observed at all frequencies by using the prior probability distribution not depending on the frequency. (It can also be understood from equation (10) that vector α is updated using vector w (t, f) at all frequencies), so we use a frequency-dependent prior probability distribution Compared to the case, more information can be used to estimate the prior probability distribution, more accurate estimation of the prior probability distribution and sound source localization based on it can be realized, and the more accurate prior can be realized even when the observation signal length is short Estimation of probability distribution and sound source localization based on it can be realized. Although the above describes the batch processing for processing observation signals in all frames at one time, block batch processing (or processing for processing the observation signals for each frame (or every several frames) to estimate the sound source position It can also be an online process.

［第２の実施形態］
次に、第２の実施形態の構成について説明する。第２の実施形態は、本発明に基づいて音源位置を推定する例であり、第１の実施形態を基にして、事前確率分布として時変の事前確率分布を用いるという変更を加えたものである。すなわち、第２の実施形態では、事前確率分布を時間区間（例えばフレーム）ごとに推定する。このことにより、音源位置推定を時間区間（例えばフレーム）ごとに行うことができるという効果に加え、時間区間（例えばフレーム）ごとの音源位置推定に基づいてトラッキングやダイアリゼーションを行うことができるという効果が得られる。 Second Embodiment
Next, the configuration of the second embodiment will be described. The second embodiment is an example of estimating a sound source position based on the present invention, and is a modification based on the first embodiment, in which a time-varying prior probability distribution is used as the prior probability distribution. is there. That is, in the second embodiment, the prior probability distribution is estimated for each time interval (for example, frame). As a result, in addition to the effect that the sound source position estimation can be performed for each time interval (for example, frame), the effect that tracking and dialing can be performed based on the sound source position estimation for each time interval (for example, frame) Is obtained.

第２の実施形態に係る信号処理装置の構成の一例は、第１の実施形態に係る信号処理装置１と同様、図２で示される。第２の実施形態に係る信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、およびパラメータ記憶部３０については、第１の実施形態と同様であるから、以下では相違点である事前確率分布計算部４０と音源位置計算部５０について詳しく説明する。第１の実施形態と本実施形態との主な相違点は次の通りである。第１の実施形態では、事前確率分布計算部４０で時間区間に依らない事前確率分布を計算し、この事前確率分布に基づき、音源位置計算部５０で時間区間に依らない音源位置を計算する。これに対し、本実施形態では、事前確率分布計算部４０で時間区間ごとの事前確率分布を計算し、この事前確率分布に基づき、音源位置計算部５０で時間区間ごとの音源位置を計算する。 An example of the configuration of the signal processing device according to the second embodiment is shown in FIG. 2 as in the signal processing device 1 according to the first embodiment. The signal processing device 1 according to the second embodiment includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, an a priori probability distribution calculation unit 40, and a sound source position calculation unit 50. The time-frequency analysis unit 10, the feature vector calculation unit 20, and the parameter storage unit 30 are the same as in the first embodiment, and hence the prior probability distribution calculation unit 40 and the sound source position calculation unit 50 which are differences below. explain in detail. The main differences between the first embodiment and the present embodiment are as follows. In the first embodiment, the prior probability distribution calculating unit 40 calculates the prior probability distribution not depending on the time interval, and the sound source position calculating unit 50 calculates the sound source position independent of the time interval based on the prior probability distribution. On the other hand, in the present embodiment, the prior probability distribution calculating unit 40 calculates the prior probability distribution for each time interval, and the sound source position calculating unit 50 calculates the sound source position for each time interval based on the prior probability distribution.

事前確率分布計算部４０は、音源位置を表す状態の事前確率分布α（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）および集中パラメータκ（ｌ，ｆ）に基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布の荷重和である式（１５）の混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルｚ（ｔ，ｆ）に当てはめ、事前確率分布α（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）を計算する。ただし、α（ｌ，ｔ）は制約条件α（１，ｔ）＋…＋α（Ｌ，ｔ）＝１を満たす。 A model stored in the parameter storage unit 30 in which the prior probability distribution calculation unit 40 uses the prior probability distribution α (l, t) (l = 1 to L, t = 1 to T) of the state indicating the sound source position as a load Conditional probability distribution of the feature vector z (t, f) under conditions where the state representing the sound source position is known, based on the parameters average direction vector a (l, f) and concentration parameter ((l, f) (15) is applied to the feature vector z (t, f) calculated by the feature vector calculation unit 20, and the prior probability distribution α (l, t) (l = 1 to L, Calculate t = 1 to T). However, α (l, t) satisfies the constraint condition α (1, t) +... + Α (L, t) = 1.

ここで、第１の実施形態とは異なり、式（１５）における荷重が時不変のα（ｌ）ではなく時変のα（ｌ，ｔ）となっていることに注意する。式（１５）の混合モデルを特徴ベクトルｚ（ｔ，ｆ）に当てはめる方法には様々な方法があり、例えば式（１５）に関する尤度を勾配法により最大化する。 Here, it should be noted that the load in equation (15) is not time-invariant α (l) but time-varying α (l, t) unlike the first embodiment. There are various methods for fitting the mixed model of Equation (15) to the feature vector z (t, f). For example, the likelihood with respect to Equation (15) is maximized by the gradient method.

事前確率分布計算部４０における処理は、例えば下記の通りである。
１．α（ｌ，ｔ）←１／Ｌ（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）により事前確率分布α（ｌ，ｔ）を初期化する。
２．下記の式（１６）および式（１７）による事前確率分布α（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）の更新を交互に所定回数（例えば１０回）反復する。 The processing in the prior probability distribution calculation unit 40 is, for example, as follows.
1. The prior probability distribution α (l, t) is initialized by α (l, t) ← 1 / L (l = 1 to L, t = 1 to T).
2. The update of the prior probability distribution α (l, t) (l = 1 to L, t = 1 to T) according to the following Equation (16) and Equation (17) is alternately repeated a predetermined number of times (for example, 10 times).

３．事前確率分布α（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）を出力する。 3. The prior probability distribution α (l, t) (l = 1 to L, t = 1 to T) is output.

ただし、ベクトルα（ｔ）はα（ｌ，ｔ）（ｌ＝１〜Ｌ）からなるＬ次元縦ベクトルである。式（１６）および（１７）の導出は、式（１０）および（１１）の導出と同様であるため省略する。 However, the vector α (t) is an L-dimensional vertical vector consisting of α (l, t) (l = 1 to L). The derivation of equations (16) and (17) is omitted as it is similar to the derivation of equations (10) and (11).

音源位置計算部５０は、事前確率分布計算部４０から事前確率分布α（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）を受け取って、事前確率分布α（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）のピーク位置の集合Ｊ（ｔ）をフレームごとに計算し、ピーク位置の集合Ｊ（ｔ）に基づいて検出された音源位置の集合Ｇ（ｔ）をフレームごとに計算し出力する。 The sound source position calculating unit 50 receives the prior probability distribution α (l, t) (l = 1 to L, t = 1 to T) from the prior probability distribution calculating unit 40, and the prior probability distribution α (l, t) ( A set J (t) of peak positions of l = 1 to L, t = 1 to T) is calculated for each frame, and a set G (t) of sound source positions detected based on the set J (t) of peak positions Is calculated and output for each frame.

ピーク位置の集合Ｊ（ｔ）は例えば次のように計算できる。ｌ番目（ｌ＝１〜Ｌ）の音源位置候補に隣接する音源位置候補の番号の集合（既知と仮定）を集合Ａ（ｌ）で表す。このとき、ピーク位置の集合Ｊ（ｔ）は、「集合Ａ（ｌ）に属する全ての番号ｌ´に対しα（ｌ，ｔ）＞α（ｌ´，ｔ）」となる番号ｌの集合Ｊ（ｔ）＝｛ｌ｜∀ｌ´∈Ａ（ｌ），α（ｌ，ｔ）＞α（ｌ´，ｔ）｝として計算できる。このピーク位置の集合Ｊ（ｔ）に基づいて、音源位置を指定する番号ｌの集合または座標（直交座標、極座標、球座標等）の集合である検出された音源位置の集合Ｇ（ｔ）を次のように計算することができる。 The set of peak positions J (t) can be calculated, for example, as follows. A set (assumed to be known) of the numbers of sound source position candidates adjacent to the l-th (l = 1 to L) sound source position candidate is represented by a set A (l). At this time, the set J (t) of peak positions is a set J of numbers l where α (l, t)> α (l ′, t) for all numbers l ′ belonging to set A (l). It can be calculated as (t) = {l | ∀l'∈A (l), α (l, t)> α (l ', t)}. Based on the set J (t) of peak positions, a set G (t) of detected sound source positions is a set of numbers l specifying the sound source position or a set of coordinates (orthogonal coordinates, polar coordinates, spherical coordinates, etc.) It can be calculated as follows.

例えば、ピーク位置の集合Ｊ（ｔ）をそのまま検出された音源位置の集合Ｇ（ｔ）とすることができる。また、ピーク位置ｌのうち対応するピーク値α（ｌ，ｔ）が所定の閾値Ｓを超えるものの集合｛ｌ∈Ｊ（ｔ）｜α（ｌ，ｔ）＞Ｓ｝を検出された音源位置の集合Ｇ（ｔ）とすることもできる。ここで閾値Ｓはどのように定めてもよいが、例えばＳ＝１／Ｌとすればよい。また、ピーク位置ｌに対応する音源位置候補の座標であるベクトルｒ（ｌ）の集合｛ｒ（ｌ）｜ｌ∈Ｊ（ｔ）｝を検出された音源位置の集合Ｇ（ｔ）とすることもできる。また、ピーク位置ｌのうちピーク値α（ｌ，ｔ）が所定の閾値Ｓを超えるものに対応する音源位置候補の座標であるベクトルｒ（ｌ）の集合｛ｒ（ｌ）｜ｌ∈Ｊ（ｔ），α（ｌ，ｔ）＞Ｓ｝を検出された音源位置の集合Ｇ（ｔ）としてもよい。 For example, the set J (t) of peak positions can be set as the set G (t) of sound source positions detected as it is. Also, among the peak positions l, a set of sound source positions for which a set {lεJ (t) | α (l, t)> S} of corresponding peak values α (l, t) exceeding a predetermined threshold value S is detected It can also be set G (t). Here, the threshold value S may be determined in any way, but for example, it may be S = 1 / L. Also, let a set {r (l) | lεJ (t)} of vectors r (l), which are coordinates of sound source position candidates corresponding to the peak position l, be a set G (t) of detected sound source positions You can also. In addition, a set of vectors r (l) which are coordinates of sound source position candidates corresponding to those of the peak positions l where the peak value α (l, t) exceeds the predetermined threshold S {r (l) | lεJ ( A set G (t) of detected sound source positions may be used as t) and α (l, t)> S}.

［第３の実施形態］
次に、第３の実施形態の構成について説明する。第３の実施形態は、本発明に基づいて音源位置を推定する例であり、第１の実施形態を基にして、音源位置を表す状態として、複数（Ｌ個）の音源位置候補のそれぞれに対応する状態（状態１〜Ｌとする）に加え、背景雑音に対応する状態（状態０とする）も考慮するとともに、音源位置を表す状態が状態０を取る条件下での、特徴ベクトルの条件付き確率分布を、超球面上の一様分布によりモデル化する、という変更を加えたものである。これにより、背景雑音を含む観測信号を適切にモデル化し、背景雑音下でも高精度に音源定位を行うことが可能になるという利点がある。 Third Embodiment
Next, the configuration of the third embodiment will be described. The third embodiment is an example of estimating a sound source position based on the present invention, and based on the first embodiment, each of a plurality of (L) sound source position candidates is a state representing a sound source position. In addition to the corresponding states (states 1 to L), the condition corresponding to background noise (state 0) is also considered, and the condition of the feature vector under the condition that the state representing the sound source position takes state 0 The modified probability distribution is modeled as a uniform distribution on the hypersphere. This has an advantage that it is possible to appropriately model an observation signal including background noise and perform source localization with high accuracy even under background noise.

第３の実施形態に係る信号処理装置の構成の一例は、第１の実施形態に係る信号処理装置１と同様、図２で示される。第３の実施形態に係る信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。時間周波数分析部１０、特徴ベクトル計算部２０については、第１の実施形態と同様であるから、以下では相違点であるパラメータ記憶部３０、事前確率分布計算部４０、および音源位置計算部５０について詳しく説明する。第１の実施形態と本実施形態との主な相違点は次の通りである。第１の実施形態では、パラメータ記憶部３０において、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での条件付き確率分布のモデルパラメータを記憶し、事前確率分布計算部４０において、複数の音源位置候補に対応する状態に対する事前確率分布を計算し、音源位置計算部５０において、前記事前確率分布に基づいて音源位置を計算する。これに対し、本実施形態では、パラメータ記憶部３０において、音源位置を表す状態が背景雑音に対応する状態を取る条件下での条件付き確率分布のモデルパラメータをさらに記憶し、事前確率分布計算部４０において、複数の音源位置候補および背景雑音に対応する状態の事前確率分布を計算し、音源位置計算部５０において、前記事前確率分布に基づいて音源位置を計算する。 An example of the configuration of the signal processing device according to the third embodiment is shown in FIG. 2 as in the signal processing device 1 according to the first embodiment. The signal processing device 1 according to the third embodiment includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, an a priori probability distribution calculation unit 40, and a sound source position calculation unit 50. The time-frequency analysis unit 10 and the feature vector calculation unit 20 are the same as in the first embodiment, and hence the parameter storage unit 30, the prior probability distribution calculation unit 40, and the sound source position calculation unit 50 which are differences below. explain in detail. The main differences between the first embodiment and the present embodiment are as follows. In the first embodiment, the parameter storage unit 30 stores model parameters of conditional probability distribution under the condition that the state representing the sound source position corresponds to each of a plurality of sound source position candidates, and the a priori probability distribution The calculation unit 40 calculates a priori probability distribution for a state corresponding to a plurality of sound source position candidates, and the sound source position calculation unit 50 calculates a sound source position based on the a priori probability distribution. On the other hand, in the present embodiment, the parameter storage unit 30 further stores model parameters of conditional probability distribution under the condition that the state representing the sound source position corresponds to the background noise, and the prior probability distribution calculation unit At 40, the a priori probability distribution of the state corresponding to the plurality of sound source position candidates and the background noise is calculated, and the sound source position calculating unit 50 calculates the sound source position based on the a priori probability distribution.

まず、本実施形態における観測信号ベクトルｙ（ｔ，ｆ）のモデル化について説明する。本実施形態におけるモデル化では、観測信号ベクトルｙ（ｔ，ｆ）はＮ個（Ｎは未知でもよい。Ｎ＝０でもよい。）の目的信号に加えて背景雑音も含むと仮定する。本実施形態では更に、観測信号ベクトルｙ（ｔ，ｆ）は、各時間周波数点において目的信号のうち高々１つの目的信号を含むと仮定するとともに、背景雑音は全ての時間周波数点において観測信号ベクトルｙ（ｔ，ｆ）に含まれると仮定する。このとき、観測信号ベクトルｙ（ｔ，ｆ）は式（１８）または（１９）のいずれかの式によりモデル化される。 First, modeling of the observed signal vector y (t, f) in the present embodiment will be described. In the modeling in this embodiment, it is assumed that the observation signal vector y (t, f) includes background noise as well as N target signals (N may be unknown; N may be 0). Further, in the present embodiment, it is assumed that the observed signal vector y (t, f) includes at most one target signal of the target signals at each time frequency point, and the background noise is observed signal vectors at all time frequency points. Suppose that it is included in y (t, f). At this time, the observed signal vector y (t, f) is modeled by either equation (18) or (19).

ここで、式（１８）は時間周波数点（ｔ，ｆ）において目的信号のうちｎ番目（ｎは時間周波数点（ｔ，ｆ）によって変化しうる）の目的信号のみが観測信号ベクトルｙ（ｔ，ｆ）に含まれる場合、式（１９）は時間周波数点（ｔ，ｆ）において観測信号ベクトルｙ（ｔ，ｆ）に目的信号が１つも含まれない場合を表しており、ベクトルｓ（ｎ，ｔ，ｆ）はｎ番目の目的信号、ベクトルｖ（ｔ，ｆ）は背景雑音である。 Here, in the time frequency point (t, f), only the target signal of the n-th (n can vary depending on the time frequency point (t, f)) target signal at time frequency point (t, f) observed signal vector y (t , F), equation (19) represents the case where no target signal is included in the observed signal vector y (t, f) at the time frequency point (t, f), and the vector s (n) , T, f) is the n-th target signal, and vector v (t, f) is background noise.

第１の実施形態の場合と異なり本実施形態では、式（１９）のように観測信号ベクトルｙ（ｔ，ｆ）に目的信号が１つも含まれず背景雑音のみが含まれる場合も考慮に入れたモデル化がなされており、背景雑音下での観測信号をより正確にモデル化することができる。 Unlike the case of the first embodiment, in the present embodiment, the case where only the background noise is included in the observation signal vector y (t, f) without any target signal as in equation (19) is also taken into consideration. The modeling is performed, and the observation signal under background noise can be modeled more accurately.

上述のように本実施形態では、式（１９）のように観測信号ベクトルｙ（ｔ，ｆ）に目的信号が１つも含まれない場合も考慮する。本実施形態では、このような場合も適切にモデル化できるように、各時間周波数点における観測信号ベクトルが取り得る音源位置を表す状態として、複数の音源位置候補に対応する状態に加えて、背景雑音に対応する状態をさらに考慮する。前者は式（１８）、後者は式（１９）に対応する。 As described above, in the present embodiment, the case where no target signal is included in the observed signal vector y (t, f) as in Expression (19) is also considered. In this embodiment, in order to be able to model appropriately also in such a case, in addition to the state corresponding to a plurality of sound source position candidates, the state representing the sound source position obtainable by the observation signal vector at each time frequency point Further consider the condition corresponding to the noise. The former corresponds to equation (18) and the latter corresponds to equation (19).

以下、時間周波数点（ｔ，ｆ）における前記音源位置を表す状態をｇ（ｔ，ｆ）により表す。ｇ（ｔ，ｆ）＝ｌ（ｌ＝１〜Ｌ）の条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布は、第１の実施形態の場合と同様、式（３）の複素ワトソン分布によりモデル化される（他にも複素ビンガム分布、複素角度中心ガウス分布、複素ガウス分布、混合複素ワトソン分布、混合複素ビンガム分布、混合複素角度中心ガウス分布、混合複素ガウス分布等の確率分布によりモデル化することができる）。 Hereinafter, a state representing the sound source position at a time frequency point (t, f) will be represented by g (t, f). The conditional probability distribution of the feature vector z (t, f) under the condition of g (t, f) = l (l = 1 to L) is the same as in the first embodiment. Other probability models such as complex Bingham distribution, complex angular center Gaussian distribution, complex Gaussian distribution, mixed complex Watson distribution, mixed complex Bingham distribution, mixed complex angular center Gaussian distribution, mixed complex Gaussian distribution etc. Can be modeled by distribution).

一方、ｇ（ｔ，ｆ）＝０の条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布は、式（２０）に示すように、Ｍ次元複素ベクトル空間における単位球面上の一様分布によりモデル化される。 On the other hand, the conditional probability distribution of the feature vector z (t, f) under the condition of g (t, f) = 0 is, as shown in equation (20), one of the units on the unit sphere in the M-dimensional complex vector space. It is modeled by uniform distribution.

式（２０）は、背景雑音はあらゆる方向から一様に到来するという仮定に基づいている。本実施形態では、式（２０）を導入することにより、式（１９）のように背景雑音に対応する状態も適切にモデル化することが可能になり、背景雑音下でも音源位置を正確に推定できる。 Equation (20) is based on the assumption that background noise comes uniformly from all directions. In the present embodiment, by introducing equation (20), it is possible to appropriately model the state corresponding to background noise as in equation (19), and the source position can be accurately estimated even under background noise. it can.

次に、本実施形態における特徴ベクトルｚ（ｔ，ｆ）の周辺確率分布のモデル化について説明する。本実施形態では、特徴ベクトルｚ（ｔ，ｆ）の周辺確率分布を、音源位置を表す状態の事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）を荷重とする、条件付き確率分布ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）の荷重和である式（２１）の混合モデルによりモデル化する。 Next, modeling of the marginal probability distribution of the feature vector z (t, f) in the present embodiment will be described. In the present embodiment, conditional probability distribution p (p (g, t, f) = 1), which is a prior probability distribution P (g (t, f) = 1) of the state representing the sound source position, is used as the peripheral probability distribution of feature vector z (t, f). It models by the mixed model of Formula (21) which is a load sum of z (t, f) | g (t, f) = l.

本実施形態では、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）がフレームおよび周波数ビンに依存しないと仮定し、α（ｌ）（ｌ＝０〜Ｌ）で表す。ただし、α（ｌ）は制約条件α（０）＋…＋α（Ｌ）＝１を満たす。κ＝０であり、ａが任意の単位ベクトルであるとき、複素ワトソン分布Ｗ（ｚ；ａ，κ）は式（２０）の一様分布に一致することに注意すると、式（２１）を式（２２）のように書き直すこともできる。ただし、κ（０，ｆ）＝０とし、ベクトルａ（０，ｆ）は任意の単位ベクトルとする。周波数に依らない事前確率分布を用いることで、全ての周波数において観測された特徴ベクトルｚ（ｔ，ｆ）の情報を用いて事前確率分布を推定することができるため、周波数に依存する事前確率分布を用いる場合と比べて、事前確率分布の推定により多くの情報を利用することができ、より正確な事前確率分布の推定およびそれに基づく音源定位が実現できるとともに、観測信号長が短い場合でもより正確な事前確率分布の推定およびそれに基づく音源定位が実現できる。さらに、全ての周波数において観測された特徴ベクトルｚ（ｔ，ｆ）の情報を用いて事前確率分布を推定することができるため、雑音や残響の影響により一つの周波数において観測された特徴ベクトルｚ（ｔ，ｆ）だけでは音源位置が確実には分からないような場合にも、より正確に音源定位を行うことができ、周波数に依存する事前確率分布を用いる場合と比べて、雑音や残響に対する頑健性を向上させることができる。 In this embodiment, it is assumed that the prior probability distribution P (g (t, f) = 1) does not depend on frames and frequency bins, and is represented by α (1) (1 = 0 to L). However, α (l) satisfies the constraint condition α (0) +... + Α (L) = 1. If 式 = 0 and a is an arbitrary unit vector, note that the complex Watson distribution W (z; a,)) matches the uniform distribution of equation (20), equation (21) It can also be rewritten as (22). However, κ (0, f) = 0 and vector a (0, f) is an arbitrary unit vector. Since the prior probability distribution can be estimated using information of the feature vector z (t, f) observed at all frequencies by using the prior probability distribution independent of frequency, the frequency dependent prior probability distribution More information can be used for estimation of the prior probability distribution compared to the case of using, and more accurate estimation of the prior probability distribution and sound source localization based on it can be realized, and more accurate even when the observation signal length is short Estimation of the prior probability distribution and sound source localization based thereon can be realized. Furthermore, since the prior probability distribution can be estimated using information of the feature vector z (t, f) observed at all frequencies, the feature vector z (observed at one frequency due to the influence of noise and reverberation) Even when the sound source position can not be determined with certainty using only t and f), sound source localization can be performed more accurately, and it is more robust against noise and reverberation than using a frequency-dependent prior probability distribution. It is possible to improve the quality.

パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での条件付き確率分布のモデルパラメータ、および音源位置を表す状態が背景雑音に対応する状態を取る条件下での条件付き確率分布のモデルパラメータを記憶する。前者は例えば第１の実施形態に記載の方法により計算することができ、後者は例えばκ（０，ｆ）←０、ベクトルａ（０，ｆ）は任意の単位ベクトルとすることができる。 The parameter storage unit 30 is a model parameter of the conditional probability distribution under the condition that the state representing the sound source position corresponds to each of a plurality of sound source position candidates, and the state representing the sound source position corresponds to the background noise Store the model parameters of the conditional probability distribution under the condition The former can be calculated by, for example, the method described in the first embodiment, and the latter can be, for example, κ (0, f) ← 0, and the vector a (0, f) can be any unit vector.

事前確率分布計算部４０は、音源位置を表す状態の事前確率分布α（ｌ）（ｌ＝０〜Ｌ）を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）（ｌ＝０〜Ｌ、ｆ＝１〜Ｆ）と集中パラメータκ（ｌ，ｆ）（ｌ＝０〜Ｌ、ｆ＝１〜Ｆ）に基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルの条件付き確率分布の荷重和である式（２１）の混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルｚ（ｔ，ｆ）に当てはめ、事前確率分布α（ｌ）（ｌ＝０〜Ｌ）を計算する。 The prior probability distribution calculating unit 40 uses the prior probability distribution α (l) (l = 0 to L) of the state representing the sound source position as a load, and the average direction vector a (model parameters stored in the parameter storage unit 30) The state representing the sound source position is known based on l, f) (l = 0 to L, f = 1 to F) and concentration parameters κ (l, f) (l = 0 to L, f = 1 to F) (21) is applied to the feature vector z (t, f) calculated by the feature vector calculation unit 20, and the prior probability distribution α (L) Calculate (l = 0 to L).

式（２１）の混合モデルを特徴ベクトルｚ（ｔ，ｆ）に当てはめる方法には様々な方法があり、例えば式（２１）に関する尤度を目的関数とし（他にも事後確率等を目的関数とすることができる。）、これを勾配法により事前確率分布α（ｌ）（ｌ＝０〜Ｌ）に関して最大化する（他にもＥＭアルゴリズム等により最大化できる）。 There are various methods for applying the mixed model of Equation (21) to the feature vector z (t, f). For example, let the likelihood for Equation (21) be an objective function (in addition, the posterior probability etc. be an objective function) This can be maximized with respect to the prior probability distribution α (l) (l = 0 to L) by the gradient method (in addition, it can be maximized by the EM algorithm etc.).

事前確率分布計算部４０における処理は、例えば下記の通りである。
１．事前確率分布α（ｌ）（ｌ＝０〜Ｌ）をα（ｌ）←１／（Ｌ＋１）により初期化する。
２．下記の式（２３）および（２４）による事前確率分布α（ｌ）（ｌ＝０〜Ｌ）の更新を交互に所定回数（例えば１０回）反復する。 The processing in the prior probability distribution calculation unit 40 is, for example, as follows.
1. The prior probability distribution α (l) (l = 0 to L) is initialized by α (l) ← 1 / (L + 1).
2. The updating of the prior probability distribution α (l) (l = 0 to L) according to the following equations (23) and (24) is alternately repeated a predetermined number of times (for example, 10 times).

３．事前確率分布α（ｌ）（ｌ＝０〜Ｌ）を出力する。 3. The prior probability distribution α (l) (l = 0 to L) is output.

ここで、ベクトル〜α（αの前の記号「〜」はαの上に記号「〜」を付すことを表す。）はα（ｌ）（ｌ＝０〜Ｌ）からなる（Ｌ＋１）次元縦ベクトルであり、ベクトル〜ｗ（ｔ，ｆ）はＷ（ｚ（ｔ，ｆ）；ａ（ｌ，ｆ），κ（ｌ，ｆ））（ｌ＝０〜Ｌ）からなる（Ｌ＋１）次元縦ベクトルである。なお、式（２３）および式（２４）の導出については、第１の実施形態と同様であるから省略する。 Here, a vector (α (the symbol “〜” before α represents that the symbol “〜” is attached on α) is an (L + 1) -dimensional vertical pattern consisting of α (l) (l = 0 to L) A vector, and the vector ~ w (t, f) is W (z (t, f); a (l, f), κ (l, f)) (l = 0 to L) (L + 1) dimensional vertical It is a vector. The derivation of the equations (23) and (24) is the same as that of the first embodiment, and is therefore omitted.

音源位置計算部５０は、事前確率分布計算部４０から受け取った事前確率分布α（ｌ）（ｌ＝０〜Ｌ）に基づいて、検出された音源位置の集合Ｇを計算し出力する。具体的には、ｌの定義域を目的音源に対応するｌ＝１〜Ｌに制限したα（ｌ）（ｌ＝１〜Ｌ）に対して、第１の実施形態に記載の処理を適用することにより、検出された音源位置の集合Ｇを計算する。 The sound source position calculation unit 50 calculates and outputs a set G of detected sound source positions based on the prior probability distribution α (l) (l = 0 to L) received from the prior probability distribution calculation unit 40. Specifically, the process described in the first embodiment is applied to α (l) (l = 1 to L) in which the domain of l is limited to l = 1 to L corresponding to the target sound source. Thus, a set G of detected sound source positions is calculated.

［第４の実施形態］
次に、第４の実施形態の構成について説明する。第４の実施形態は、本発明に基づいて音源位置を推定する例であり、第１の実施形態を基にして、条件付き確率分布のモデルパラメータを目的信号が球面波として伝播するという仮定に基づいて計算するのではなく、実測データを学習データとして用いて事前学習するようにするという変更を加えたものである。目的信号が球面波として伝播するという上記の仮定は、無響室のような反射・残響・回折等の存在しない理想的な環境を想定している。したがって、第１の実施形態では、反射・残響・回折等がある環境では、想定している環境と音源定位を行う環境との間にミスマッチが存在するため、音源定位の性能が低下する問題がある。これに対し本実施形態では、音源定位を行う環境における実測データを用いて条件付き確率分布のモデルパラメータを事前学習することで、そのようなミスマッチを解消し、反射・残響・回折等がある場合でも音源位置を正確に推定することが可能になる、という利点がある。反対に、第１の実施形態には、本実施形態と異なり上記実測データを取得する手間が省けるという利点がある。 Fourth Embodiment
Next, the configuration of the fourth embodiment will be described. The fourth embodiment is an example of estimating the sound source position based on the present invention, and based on the first embodiment, assuming that the target signal propagates as a spherical wave as a model parameter of conditional probability distribution. Instead of calculating based on the change, a change is made such that actual data is used as learning data and learning is performed in advance. The above assumption that the target signal propagates as a spherical wave assumes an ideal environment without reflection, reverberation, diffraction, etc., such as an anechoic chamber. Therefore, in the first embodiment, in an environment with reflection, reverberation, diffraction, etc., there is a mismatch between the assumed environment and the environment for performing sound source localization, so the problem of the performance of the sound source localization is degraded. is there. On the other hand, in the present embodiment, such mismatch is eliminated by pre-learning model parameters of conditional probability distribution using measured data in an environment where sound source localization is performed, and there are reflections, reverberations, diffractions, etc. However, there is an advantage that it becomes possible to estimate the sound source position accurately. On the contrary, the first embodiment has an advantage that the time and effort for acquiring the above-mentioned actual measurement data can be saved unlike the present embodiment.

第４の実施形態に係る信号処理装置の構成の一例は、第１の実施形態に係る信号処理装置１と同様、図２で示される。第４の実施形態に係る信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、事前確率分布計算部４０、および音源位置計算部５０については、第１の実施形態と同様であるから、以下では相違点であるパラメータ記憶部３０について詳しく説明する。第１の実施形態と本実施形態との主な相違点は次の通りである。第１の実施形態におけるパラメータ記憶部３０は、目的信号が球面波として伝播するという仮定に基づいて計算された、条件付き確率分布のモデルパラメータを記憶する。これに対し、本実施形態におけるパラメータ記憶部３０は、残響下で取得された学習データを用いて学習された、条件付き確率分布のモデルパラメータを記憶する。 An example of the configuration of the signal processing device according to the fourth embodiment is shown in FIG. 2 as in the signal processing device 1 according to the first embodiment. The signal processing device 1 according to the fourth embodiment includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, an a priori probability distribution calculation unit 40, and a sound source position calculation unit 50. The time frequency analysis unit 10, the feature vector calculation unit 20, the a priori probability distribution calculation unit 40, and the sound source position calculation unit 50 are the same as in the first embodiment, so explain in detail. The main differences between the first embodiment and the present embodiment are as follows. The parameter storage unit 30 in the first embodiment stores model parameters of the conditional probability distribution, which are calculated based on the assumption that the target signal propagates as a spherical wave. On the other hand, the parameter storage unit 30 in the present embodiment stores the model parameters of the conditional probability distribution learned using the learning data acquired under reverberation.

パラメータ記憶部３０は、残響下で取得された学習データを用いて学習されたモデルパラメータであって、音源位置を表す状態が複数（Ｌ個）の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布である複素ワトソン分布のモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）と集中パラメータκ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）を記憶する。前記残響下で取得された学習データとしては、例えば、背景雑音が存在しない状況で複数の音源位置候補のそれぞれからのみ音が発せられた場合の観測信号ｘ（ｌ，ｍ，τ）を用いることができる。 The parameter storage unit 30 is a model parameter learned using learning data acquired under reverberation, and is a condition that takes a state corresponding to each of a plurality (L) of sound source position candidates. And a directional vector a (l, f) (l = 1 to L, f = 1 to F) which is a model parameter of a complex Watson distribution that is a conditional probability distribution of the feature vector z (t, f) under The concentration parameters κ (l, f) (l = 1 to L, f = 1 to F) are stored. As the learning data acquired under the reverberation, for example, an observation signal x (l, m, τ) in the case where a sound is emitted only from each of a plurality of sound source position candidates in the absence of background noise is used. Can.

上記事前学習は、例えば次の手順で行うことができる。
１．１つの音源位置候補のみから音が発せられた場合の観測信号ｘ（ｌ，ｍ，τ）を生成する。例えば、Ｌ個の音源位置候補のそれぞれに対し、当該音源位置候補のみから音が発せられている状況で収録を行うことにより、ｘ（ｌ，ｍ，τ）を生成できる。もしくは、Ｌ個の音源位置候補のそれぞれに対し、当該音源位置候補から各マイクロホン位置までのインパルス応答を計測し、このインパルス応答を目的信号に畳み込むことにより、ｘ（ｌ，ｍ，τ）を生成できる。
２．ｘ（ｌ，ｍ，τ）の時間周波数変換ｘ（ｌ，ｍ，ｔ，ｆ）（ｍ＝１〜Ｍ）からなるＭ次元ベクトルｘ（ｌ，ｔ，ｆ）を計算する。
３．特徴ベクトルζ（ｌ，ｔ，ｆ）を下記の式（２５）により計算する。 The above-mentioned prior learning can be performed, for example, in the following procedure.
1. Generate an observation signal x (l, m, τ) when a sound is emitted from only one sound source position candidate. For example, x (l, m, τ) can be generated by performing recording in a situation where sound is emitted only from the sound source position candidate for each of the L sound source position candidates. Alternatively, for each of the L sound source position candidates, an impulse response from the sound source position candidate to each microphone position is measured, and x (l, m, τ) is generated by convoluting this impulse response into the target signal. it can.
2. An M-dimensional vector x (l, t, f) consisting of time-frequency transforms x (l, m, t, f) (m = 1 to M) of x (l, m, τ) is calculated.
3. The feature vector ζ (l, t, f) is calculated by the following equation (25).

４．特徴共分散行列Ｒ（ｌ，ｆ）を下記の式（２６）により計算する。 4. The feature covariance matrix R (l, f) is calculated by the following equation (26).

５．特徴共分散行列Ｒ（ｌ，ｆ）の固有値分解を行い、最大固有値μ（ｌ，ｆ）および最大固有値に対応するノルム１の固有ベクトルｅ（ｌ，ｆ）を求める。
６．平均方向ベクトルａ（ｌ，ｆ）をａ（ｌ，ｆ）←ｅ（ｌ，ｆ）とする。
７．集中パラメータκ（ｌ，ｆ）を下記の式（２７）により計算する。 5. Eigenvalue decomposition of the feature covariance matrix R (l, f) is performed to obtain the largest eigenvalue μ (l, f) and the eigenvector e (l, f) of the norm 1 corresponding to the largest eigenvalue.
6. An average direction vector a (l, f) is set as a (l, f) ee (l, f).
7. The concentration parameter κ (l, f) is calculated by the following equation (27).

上記の処理の導出について説明する。上記の処理は、特徴ベクトルζ（ｌ，ｔ，ｆ）が式（２８）に従って生成されるという仮定の下、式（２８）に関する対数尤度である式（２９）を平均方向ベクトルａ（ｌ，ｆ）および集中パラメータκ（ｌ，ｆ）に関して最大化することにより導かれる。 The derivation of the above process will be described. The above process averages the direction vector a (l) of equation (29), which is the log likelihood for equation (28), under the assumption that the feature vector ζ (l, t, f) is generated according to equation (28) , F) and maximization with respect to the lumped parameter ((l, f).

式（２９）において、平均方向ベクトルａ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）および集中パラメータκ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）のいずれにも依存しない定数項を無視すると、式（３０）のように書き直せる。 In equation (29), average direction vector a (l, f) (l = 1 to L, f = 1 to F) and concentration parameter ((l, f) (l = 1 to L, f = 1 to F) Ignoring a constant term that does not depend on any of, can be rewritten as equation (30).

ここで、行列Ｒ（ｌ，ｆ）は式（２６）により定義される。式（２７）におけるベクトルａ（ｌ，ｆ）に依存する項は式（３１）である。 Here, the matrix R (l, f) is defined by equation (26). The term depending on the vector a (l, f) in the equation (27) is the equation (31).

Courant-Fisherの定理より、式（３１）を式（４）の制約条件下で最大化するベクトルａ（ｌ，ｆ）は、特徴共分散行列Ｒ（ｌ，ｆ）の最大固有値μ（ｌ，ｆ）に対応するノルム１の固有ベクトルｅ（ｌ，ｆ）である。また、式（３０）における集中パラメータκ（ｌ，ｆ）に依存する項は、式（３２）である。 According to the Courant-Fisher theorem, the vector a (l, f) which maximizes the equation (31) under the constraint of the equation (4) is the maximum eigenvalue μ (l, l) of the feature covariance matrix R (l, f) It is an eigenvector e (l, f) of norm 1 corresponding to f). The term depending on the concentration parameter κ (l, f) in equation (30) is equation (32).

ここで、集中パラメータκ（ｌ，ｆ）に関する偏微分を０と置くと、式（３３）を得る。 Here, when the partial derivative of the concentration parameter 0 (l, f) is set to 0, equation (33) is obtained.

参考文献１「S.Sra and D.Karp,"The multivariate Watson distribution: Maximum-likelihood estimation and other aspects," Journal of Multivariate Analysis,2013年2月,vol.114,p.256-269.」中の式（３．８）に基づいて、式（３３）を集中パラメータκ（ｌ，ｆ）について近似的に解くと式（２７）を得る。本実施形態では、学習データから集中パラメータを学習するため、第１の実施形態と同様、前述の、観測信号ベクトルｙ（ｔ，ｆ）の方向が、低い周波数ほど小さい分散（大きい集中度）を持つという性質を適切に考慮することができ、事前確率分布の推定、及びそれに基づく音源定位を正確に行うことができる。 Reference 1 in S. Sra and D. Karp, "The multivariate Watson distribution: Maximum-likelihood estimation and other aspects," Journal of Multivariate Analysis, February 2013, vol. 114, p. 256-269. Based on Equation (3.8), Equation (33) is approximately solved for the concentration parameter κ (l, f) to obtain Equation (27). In the present embodiment, in order to learn concentration parameters from learning data, as in the first embodiment, the direction of the observation signal vector y (t, f) described above decreases dispersion (larger degree of concentration) as the frequency decreases. It is possible to properly take into consideration the nature of having, and to estimate the prior probability distribution and the sound source localization based on it accurately.

［第５の実施形態］
次に、第５の実施形態の構成について説明する。第５の実施形態は、本発明に基づいて音源位置を推定する例であり、第３の実施形態を基にして、背景雑音に対する条件付き確率分布として一様分布を用いるのではなく、実測データを用いて事前学習した条件付き確率分布を用いるようにするという変更を加えたものである。 Fifth Embodiment
Next, the configuration of the fifth embodiment will be described. The fifth embodiment is an example of estimating the sound source position based on the present invention, and based on the third embodiment, the measurement data is not used as the conditional probability distribution for background noise, but the uniform distribution is used. Is modified to use the conditional probability distribution previously learned using.

第３の実施形態における上記の一様分布の仮定は、雑音があらゆる方向から一様に到来する理想的な環境を想定している。したがって、第３の実施形態では、雑音の到来方向に偏りがある環境では、想定している環境と音源定位を行う環境との間にミスマッチが存在し、音源定位の性能が低下する恐れがある。これに対し本実施形態では、音源定位を行う環境における実測データを用いて、条件付き確率分布のモデルパラメータを事前学習することで、上記のミスマッチを解消し、雑音の到来方向に偏りがある場合でも音源位置を正確に推定することを可能にする、という利点がある。 The above uniform distribution assumption in the third embodiment assumes an ideal environment in which noise arrives uniformly from all directions. Therefore, in the third embodiment, in an environment where there is a bias in the noise arrival direction, a mismatch exists between the assumed environment and the environment for performing sound source localization, and there is a risk that the performance of sound source localization may be degraded. . On the other hand, in the present embodiment, the above mismatch is eliminated by pre-learning model parameters of conditional probability distribution using measured data in an environment where sound source localization is performed, and there is a bias in the noise arrival direction. However, there is an advantage that it is possible to accurately estimate the sound source position.

第５の実施形態に係る信号処理装置の構成の一例は、第３の実施形態に係る信号処理装置１と同様、図２で示される。第５の実施形態に係る信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、事前確率分布計算部４０、および音源位置計算部５０については、第３の実施形態と同様であるから、以下では相違点であるパラメータ記憶部３０について詳しく説明する。第３の実施形態と本実施形態との主な相違点は次の通りである。第３の実施形態におけるパラメータ記憶部３０では、音源位置を表す状態が背景雑音に対応する状態を取る条件下での条件付き確率分布のモデルパラメータとして、一様分布に対応するモデルパラメータを記憶する。これに対し、本実施形態におけるパラメータ記憶部３０では、音源位置を表す状態が背景雑音に対応する状態を取る条件下での条件付き確率分布のモデルパラメータとして、学習データを用いて学習したモデルパラメータを記憶する。 An example of the configuration of the signal processing device according to the fifth embodiment is shown in FIG. 2 as in the signal processing device 1 according to the third embodiment. The signal processing device 1 according to the fifth embodiment includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, an a priori probability distribution calculation unit 40, and a sound source position calculation unit 50. The time frequency analysis unit 10, the feature vector calculation unit 20, the a priori probability distribution calculation unit 40, and the sound source position calculation unit 50 are the same as in the third embodiment, so explain in detail. The main differences between the third embodiment and the present embodiment are as follows. The parameter storage unit 30 according to the third embodiment stores model parameters corresponding to uniform distribution as model parameters of conditional probability distribution under the condition that the state representing the sound source position corresponds to the background noise. . On the other hand, in the parameter storage unit 30 in the present embodiment, a model parameter learned using learning data as a model parameter of conditional probability distribution under the condition that the state representing the sound source position corresponds to the background noise. Remember.

本実施形態では、各時間周波数点における観測信号ベクトルｙ（ｔ，ｆ）の音源位置を表す状態がｇ（ｔ，ｆ）＝ｌ（ｌ＝０〜Ｌ）である条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布を、式（３）の複素ワトソン分布によりモデル化する（他にも複素ビンガム分布、複素角度中心ガウス分布、複素ガウス分布、混合複素ワトソン分布、混合複素ビンガム分布、混合複素角度中心ガウス分布、混合複素ガウス分布等の確率分布によりモデル化することができる）。 In the present embodiment, the feature vector z under the condition that the state representing the sound source position of the observed signal vector y (t, f) at each time frequency point is g (t, f) = 1 (1 = 0 to L). The conditional probability distribution of (t, f) is modeled by the complex Watson distribution of Equation (3) (Others: complex Bingham distribution, complex angular center Gaussian distribution, complex Gaussian distribution, mixed complex Watson distribution, mixed complex Bingham It can be modeled by a probability distribution such as distribution, mixed complex angular center Gaussian distribution, mixed complex Gaussian distribution, etc.).

パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態（状態１〜Ｌ）を取る条件下での条件付き確率分布のモデルパラメータ、および音源位置を表す状態が背景雑音に対応する状態（状態０）を取る条件下での条件付き確率分布のモデルパラメータを記憶する。音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での条件付き確率分布のモデルパラメータは、例えば第１または第４の実施形態に記載の方法により計算することができる。 The parameter storage unit 30 has a model parameter of conditional probability distribution under a condition in which the state representing the sound source position corresponds to each of a plurality of sound source position candidates and the state representing the sound source position. The model parameters of the conditional probability distribution under the condition of taking a state (state 0) corresponding to background noise are stored. The model parameters of the conditional probability distribution under the condition that the state representing the sound source position corresponds to each of the plurality of sound source position candidates may be calculated by, for example, the method described in the first or fourth embodiment. it can.

一方、音源位置を表す状態が背景雑音に対応する状態を取る条件下での条件付き確率分布のモデルパラメータは、例えば次のように事前学習される。
１．実測した背景雑音ｘ（０，ｍ，τ）の時間周波数変換ｘ（０，ｍ，ｔ，ｆ）（ｍ＝１〜Ｍ）からなるＭ次元縦ベクトルｘ（０，ｔ，ｆ）を作成する。
２．特徴ベクトルζ（０，ｔ，ｆ）を次の式（３４）により計算する。 On the other hand, model parameters of the conditional probability distribution under the condition that the state representing the sound source position corresponds to the background noise are pre-learned as follows, for example.
1. Create an M-dimensional longitudinal vector x (0, t, f) consisting of time-frequency transformation x (0, m, t, f) (m = 1 to M) of measured background noise x (0, m, τ) .
2. The feature vector ζ (0, t, f) is calculated by the following equation (34).

３．特徴共分散行列Ｒ（０，ｆ）を次の式（３５）により計算する。 3. The feature covariance matrix R (0, f) is calculated by the following equation (35).

４．特徴共分散行列Ｒ（０，ｆ）の固有値分解を行い、最大固有値μ（０，ｆ）および最大固有値に対応するノルム１の固有ベクトルｅ（０，ｆ）を求める。
５．平均方向ベクトルａ（０，ｆ）をａ（０，ｆ）←ｅ（０，ｆ）とする。
６．集中パラメータκ（０，ｆ）を次の式（３６）により計算する。 4. Eigenvalue decomposition of the feature covariance matrix R (0, f) is performed to obtain the largest eigenvalue μ (0, f) and the eigenvector e (0, f) of the norm 1 corresponding to the largest eigenvalue.
5. The average direction vector a (0, f) is set as a (0, f) (e (0, f).
6. The concentration parameter κ (0, f) is calculated by the following equation (36).

なお、上記の処理の導出は、第４の実施形態の場合と同様であるから省略する。 The derivation of the above process is the same as that of the fourth embodiment and thus will not be described.

［第６の実施形態］
次に、第６の実施形態の構成について説明する。第６の実施形態は、本発明に基づいて音源位置を推定する例であり、第４の実施形態を基にして、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布として複素ワトソン分布ではなく複素角度中心ガウス分布を用いるようにするという変更を加えたものである。複素ワトソン分布では、観測信号ベクトルの方向である式（１）の特徴ベクトルの条件付き確率分布が回転対称である場合しか表せないのに対し、複素角度中心ガウス分布ではこの条件付き確率分布が回転対称な場合だけでなく楕円状の分布である場合も表すことができる。式（１）の特徴ベクトルの分布は必ずしも回転対称とは限らないため、本実施形態により、式（１）の特徴ベクトルの分布を第４の実施形態よりも正確にモデル化することができ、その結果、音源位置をより正確に推定できる。 Sixth Embodiment
Next, the configuration of the sixth embodiment will be described. The sixth embodiment is an example of estimating a sound source position based on the present invention, and based on the fourth embodiment, a state representing a sound source position takes a state corresponding to each of a plurality of sound source position candidates. This is a modification in which not a complex Watson distribution but a complex angularly centered Gaussian distribution is used as the conditional probability distribution of the feature vector z (t, f) under the conditions. The complex Watson distribution can only represent the case where the conditional probability distribution of the feature vector of Equation (1), which is the direction of the observation signal vector, is rotationally symmetric, while the complex angular center Gaussian distribution rotates this conditional probability distribution. Not only symmetric cases but also elliptical cases can be represented. Since the distribution of feature vectors in equation (1) is not necessarily rotationally symmetric, this embodiment makes it possible to model the distribution of feature vectors in equation (1) more accurately than the fourth embodiment. As a result, the sound source position can be estimated more accurately.

第６の実施形態に係る信号処理装置の構成の一例は、第４の実施形態に係る信号処理装置１と同様、図２で示される。第６の実施形態に係る信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、および音源位置計算部５０については、第４の実施形態と同様であるから、以下では相違点であるパラメータ記憶部３０および事前確率分布計算部４０について詳しく説明する。第４の実施形態と本実施形態との主な相違点は次の通りである。第４の実施形態では、パラメータ記憶部３０において、条件付き確率分布をモデル化する複素ワトソン分布のモデルパラメータを記憶し、事前確率分布計算部４０において、前記複素ワトソン分布のモデルパラメータに基づいて事前確率分布を計算する。これに対し、本実施形態では、パラメータ記憶部３０において、条件付き確率分布をモデル化する複素角度中心ガウス分布のモデルパラメータを記憶し、事前確率分布計算部４０において、前記複素角度中心ガウス分布のモデルパラメータに基づいて事前確率分布を計算する。 An example of the configuration of the signal processing device according to the sixth embodiment is shown in FIG. 2 as in the signal processing device 1 according to the fourth embodiment. The signal processing device 1 according to the sixth embodiment includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, an a priori probability distribution calculation unit 40, and a sound source position calculation unit 50. The time-frequency analysis unit 10, the feature vector calculation unit 20, and the sound source position calculation unit 50 are the same as those in the fourth embodiment, and therefore, about the parameter storage unit 30 and the prior probability distribution calculation unit 40 which are differences below. explain in detail. The main differences between the fourth embodiment and the present embodiment are as follows. In the fourth embodiment, the parameter storage unit 30 stores model parameters of a complex Watson distribution for modeling a conditional probability distribution, and the prior probability distribution calculating unit 40 performs advance in advance based on the model parameters of the complex Watson distribution. Calculate the probability distribution. On the other hand, in the present embodiment, the parameter storage unit 30 stores model parameters of complex angular center Gaussian distribution for modeling conditional probability distribution, and the prior probability distribution calculation unit 40 stores the complex angular center Gaussian distribution. Calculate the prior probability distribution based on the model parameters.

本実施形態では、Ｌ個の音源位置候補に対するＬ個の条件付き確率分布を、複素角度中心ガウス分布によりモデル化する。すなわち、条件付き確率分布ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）を式（３７）によりモデル化する。 In the present embodiment, L conditional probability distributions for L source position candidates are modeled by a complex angular center Gaussian distribution. That is, the conditional probability distribution p (z (t, f) | g (t, f) = 1) is modeled by equation (37).

ここで、行列Σ（ｌ，ｆ）はｌ番目の音源位置候補に対する特徴ベクトルｚ（ｔ，ｆ）の分布の位置・広がり・方向・形状を定めるモデルパラメータである正定値エルミート行列であり、パラメータ行列と呼ばれ、Ａ（ｚ；Σ）は、パラメータ行列が行列Σであるベクトルｚの複素角度中心ガウス分布であり、式（３８）で表される。 Here, the matrix Σ (l, f) is a positive definite Hermite matrix which is a model parameter for determining the position, the spread, the direction, and the shape of the distribution of the feature vector z (t, f) for the l-th sound source position candidate. Called a matrix, A (z;)) is a complex angular center Gaussian distribution of vector z whose parameter matrix is matrix 、, and is expressed by equation (38).

パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布である複素角度中心ガウス分布のモデルパラメータであるパラメータ行列Σ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）を記憶する。パラメータ行列Σ（ｌ，ｆ）は、Ｌ個の音源位置候補のそれぞれに対し、当該音源位置候補のみから音が発せられた場合の観測信号ｘ（ｌ，ｍ，τ）を用いて事前学習される。本実施形態では、特徴量ｚ（ｔ，ｆ）の条件付き確率分布の位置・広がり・方向・形状を定めるパラメータ行列Σ（ｌ，ｆ）を学習データから学習するため、第１の実施形態と同様、前述の、観測信号ベクトルｙ（ｔ，ｆ）の方向が、低い周波数ほど小さい分散（前記広がりに相当）を持つという性質を適切に考慮することができ、事前確率分布の推定、及びそれに基づく音源定位を正確に行うことができる。 The parameter storage unit 30 is a complex angular center Gaussian distribution that is a conditional probability distribution of the feature vector z (t, f) under the condition that the state representing the sound source position corresponds to each of a plurality of sound source position candidates. The parameter matrix Σ (l, f) (l = 1 to L, f = 1 to F) which is a model parameter of The parameter matrix Σ (l, f) is pre-learned using the observation signal x (l, m, τ) when sound is emitted from only the sound source position candidate for each of the L sound source position candidates. Ru. In this embodiment, since the parameter matrix ((l, f) for determining the position, the spread, the direction, and the shape of the conditional probability distribution of the feature quantity z (t, f) is learned from the learning data, the first embodiment Similarly, the above-mentioned property that the direction of the observation signal vector y (t, f) has a smaller dispersion (corresponding to the spread) as the frequency is lower can be appropriately taken into consideration, estimation of the prior probability distribution, and It is possible to accurately perform sound source localization based on

この事前学習は、例えば以下の手順で行うことができる。
１．特徴ベクトルζ（ｌ，ｔ，ｆ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を第４の実施形態と同様に計算する。
２．パラメータ行列Σ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）をＭ×Ｍの単位行列により初期化する。
３．次の式（３９）によるパラメータ行列Σ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）の更新を所定回数（例えば１０回）反復する。 This prior learning can be performed, for example, by the following procedure.
1. A feature vector ζ (l, t, f) (l = 1 to L, t = 1 to T, f = 1 to F) is calculated in the same manner as the fourth embodiment.
2. The parameter matrix Σ (l, f) (l = 1 to L, f = 1 to F) is initialized with an M × M identity matrix.
3. The update of the parameter matrix Σ (l, f) (l = 1 to L, f = 1 to F) according to the following equation (39) is repeated a predetermined number of times (for example, ten times).

４．パラメータ行列Σ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）をパラメータ記憶部３０に記憶する。 4. The parameter matrix Σ (l, f) (l = 1 to L, f = 1 to F) is stored in the parameter storage unit 30.

式（３９）の導出について説明する。式（３９）は、特徴ベクトルζ（ｌ，ｔ，ｆ）が式（３７）の条件付き確率分布に従って生成されたという仮定の下、式（３７）に関する対数尤度である式（４０）をパラメータ行列Σ（ｌ，ｆ）に関して最大化することにより導かれる。 The derivation of equation (39) will be described. Equation (39) gives equation (40), which is the log likelihood for equation (37), under the assumption that the feature vector ζ (l, t, f) is generated according to the conditional probability distribution of equation (37) It is derived by maximizing the parameter matrix パラメータ (l, f).

式（４０）におけるパラメータ行列Σ（ｌ，ｆ）によらない定数項を無視すると、式（４０）は、式（４１）のように書き換えられる。 Ignoring constant terms not based on the parameter matrix 無視 (l, f) in equation (40), equation (40) can be rewritten as equation (41).

式（４１）のパラメータ行列Σ（ｌ，ｆ）に関する偏微分を０と置いて整理すると、式（３９）を得る。 Equation (39) can be obtained by rearranging the partial derivatives of the parameter matrix Σ (l, f) of equation (41) with 0.

事前確率分布計算部４０は、音源位置を表す状態の事前確率分布を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータであるパラメータ行列Σ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）に基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布である複素角度中心ガウス分布の荷重和である混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルｚ（ｔ，ｆ）に当てはめ、事前確率分布を計算する。本実施形態では、前記事前確率分布として時不変の事前確率分布α（ｌ）（ｌ＝１〜Ｌ）を考える。 The prior probability distribution calculation unit 40 uses, as a load, the prior probability distribution of the state representing the sound source position, a parameter matrix パラメータ (l, f) which is a model parameter stored in the parameter storage unit 30 (l = 1 to L, f A mixed model which is a weighted sum of complex angular center Gaussian distributions, which is a conditional probability distribution of feature vectors z (t, f) under conditions where the state representing the sound source position is known, based on The prior probability distribution is calculated by applying to the feature vector z (t, f) calculated by the feature vector calculation unit 20. In this embodiment, a time-invariant prior probability distribution α (l) (l = 1 to L) is considered as the prior probability distribution.

事前確率分布計算部４０における事前確率分布の計算は、例えば次のように行えばよい。すなわち、ベクトルｗ（ｔ，ｆ）を条件付き確率である複素角度中心ガウス分布Ａ（ｚ（ｔ，ｆ）；Σ（ｌ，ｆ））（ｌ＝１〜Ｌ）からなるＬ次元縦ベクトルとし、ベクトルｗ（ｔ，ｆ）に対して第１の実施形態の事前確率分布計算部４０における処理を適用する。ただし、第１の実施形態とはベクトルｗ（ｔ，ｆ）の定義が異なることに注意する。なお、上記の処理の導出は、第１の実施形態の場合と同様であるから省略する。 The calculation of the prior probability distribution in the prior probability distribution calculation unit 40 may be performed, for example, as follows. That is, let the vector w (t, f) be an L-dimensional longitudinal vector consisting of a complex angular center Gaussian distribution A (z (t, f); ((l, f)) (l = 1 to L) which is a conditional probability. The processing in the prior probability distribution calculation unit 40 of the first embodiment is applied to the vectors w (t, f). However, it should be noted that the definition of the vector w (t, f) is different from that of the first embodiment. Note that the derivation of the above process is the same as that of the first embodiment and thus will not be described.

［第７の実施形態］
次に、第７の実施形態の構成について説明する。第７の実施形態は、本発明に基づいて音源位置を推定する例であり、第４の実施形態を基にして、観測信号ベクトルｙ（ｔ，ｆ）の方向の情報を含んだベクトルである特徴ベクトルｚ（ｔ，ｆ）として式（１）の方向ベクトルではなく観測信号ベクトルｙ（ｔ，ｆ）そのものを用いるようにし、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布として複素ワトソン分布ではなく複素時変ガウス分布を用いるようにし、複素時変ガウス分布のモデルパラメータである空間共分散行列を事前学習して記憶するようにするという変更を加えたものである。 Seventh Embodiment
Next, the configuration of the seventh embodiment will be described. The seventh embodiment is an example of estimating the sound source position based on the present invention, and is a vector including information on the direction of the observed signal vector y (t, f) based on the fourth embodiment. The observation signal vector y (t, f) itself is used as the feature vector z (t, f) instead of the direction vector of equation (1), and the state representing the sound source position corresponds to each of a plurality of sound source position candidates As a conditional probability distribution of the feature vector z (t, f) under the condition of taking, not using complex Watson's distribution but using complex time-varying Gaussian distribution, the space covariance matrix which is a model parameter of complex time-varying Gaussian distribution A change has been made to pre-learn and store.

複素ワトソン分布では観測信号ベクトルの方向の分布が回転対称である場合しか表せないのに対し、複素時変ガウス分布では観測信号ベクトルの方向の分布が回転対称である場合だけでなく楕円状の分布である場合も表せる。観測信号ベクトルの方向の分布は必ずしも回転対称とは限らないため、本実施形態により、音源位置を特徴づける観測信号ベクトルの方向の分布を第４の実施形態よりも正確にモデル化することができ、このモデル化に基づき音源位置をより正確に推定できる。 The complex Watson's distribution can be expressed only when the distribution in the direction of the observed signal vector is rotationally symmetric, while in the complex time-varying Gaussian distribution, the distribution in the direction of the observed signal vector is elliptical as well as the rotationally symmetric distribution. Can also be represented. Since the distribution of the direction of the observed signal vector is not necessarily rotationally symmetric, this embodiment makes it possible to model the distribution of the direction of the observed signal vector characterizing the sound source position more accurately than the fourth embodiment. The source position can be more accurately estimated based on this modeling.

第７の実施形態に係る信号処理装置の構成の一例は、第４の実施形態に係る信号処理装置１と同様、図２で示される。第７の実施形態に係る信号処理装置１は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０を有する。時間周波数分析部１０と音源位置計算部５０については第４の実施形態と同様であるから、以下では相違点である特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０について詳しく説明する。第４の実施形態と本実施形態との主な相違点は次の通りである。第４の実施形態では、特徴ベクトル計算部２０において式（１）の特徴ベクトルを計算し、パラメータ記憶部３０において、前記特徴ベクトルの条件付き確率分布をモデル化する複素ワトソン分布のモデルパラメータを記憶し、事前確率分布計算部４０において、音源位置を表す状態の事前確率分布を荷重とする、条件付き確率分布をモデル化する複素ワトソン分布の荷重和である混合モデルを前記特徴ベクトルに当てはめることにより、前記事前確率分布を計算する。これに対し、本実施形態では、特徴ベクトル計算部２０は、時間周波数分析部１０からの観測信号ベクトルを特徴ベクトルとして出力し、パラメータ記憶部３０において、特徴ベクトルである観測信号ベクトルの条件付き確率分布をモデル化する複素時変ガウス分布のモデルパラメータである空間共分散行列を記憶し、事前確率分布計算部４０において、音源位置を表す状態の事前確率分布を荷重とする、条件付き確率分布をモデル化する複素時変ガウス分布の荷重和である混合モデルを特徴ベクトルである観測信号ベクトルに当てはめることにより、前記事前確率分布を計算する。 An example of the configuration of the signal processing device according to the seventh embodiment is shown in FIG. 2 as in the signal processing device 1 according to the fourth embodiment. The signal processing device 1 according to the seventh embodiment includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, an a priori probability distribution calculation unit 40, and a sound source position calculation unit 50. The time-frequency analysis unit 10 and the sound source position calculation unit 50 are the same as in the fourth embodiment, so the feature vector calculation unit 20, the parameter storage unit 30, and the prior probability distribution calculation unit 40 which are differences will be described in detail below. Do. The main differences between the fourth embodiment and the present embodiment are as follows. In the fourth embodiment, the feature vector calculation unit 20 calculates the feature vector of equation (1), and the parameter storage unit 30 stores model parameters of complex Watson distribution for modeling the conditional probability distribution of the feature vector. The prior probability distribution calculating unit 40 applies a mixed model, which is a weighted sum of complex Watson distributions for modeling conditional probability distributions, to the feature vector, with the prior probability distribution of the state representing the sound source position as a load. , Calculate the prior probability distribution. On the other hand, in the present embodiment, the feature vector calculation unit 20 outputs the observation signal vector from the time frequency analysis unit 10 as a feature vector, and the parameter storage unit 30 outputs the conditional probability of the observation signal vector which is the feature vector. A conditional probability distribution that stores a spatial covariance matrix, which is a model parameter of a complex time-varying Gaussian distribution that models a distribution, and uses the prior probability distribution of a state representing a sound source position as a load in the prior probability distribution calculating unit 40 The prior probability distribution is calculated by applying a mixed model, which is a weighted sum of complex time-variant Gaussian distributions to be modeled, to an observation signal vector which is a feature vector.

特徴ベクトル計算部２０は、時間周波数分析部１０から観測信号ベクトルｙ（ｔ，ｆ）を受け取って、観測信号ベクトルｙ（ｔ，ｆ）を特徴ベクトルｚ（ｔ，ｆ）として出力する。 The feature vector calculation unit 20 receives the observation signal vector y (t, f) from the time frequency analysis unit 10, and outputs the observation signal vector y (t, f) as a feature vector z (t, f).

本実施形態では、Ｌ個の音源位置候補に対するＬ個の条件付き確率分布として、複素時変ガウス分布を用いる。すなわち、条件付き確率分布ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）を式（４２）によりモデル化する。 In this embodiment, a complex time-varying Gaussian distribution is used as L conditional probability distributions for L source position candidates. That is, the conditional probability distribution p (z (t, f) | g (t, f) = 1) is modeled by equation (42).

式（４２）におけるφ（ｌ，ｔ，ｆ）は、特徴ベクトルｚ（ｔ，ｆ）の「大きさ（ノルム）」の分布を制御する正のパラメータである。一方、式（４２）における行列Ｂ（ｌ，ｆ）は、特徴ベクトルｚ（ｔ，ｆ）の「方向」の分布を制御する（具体的には、特徴ベクトルｚ（ｔ，ｆ）の方向の分布の位置・広がり・方向・形状を制御する）パラメータである。行列Ｂ（ｌ，ｆ）は正定値エルミート行列であり、空間共分散行列と呼ばれる。Ｎ（ｚ；０，Φ）は平均がベクトル０、共分散行列が行列Φであるベクトルｚの複素ガウス分布であり、式（４３）で表される。 In the equation (42), φ (l, t, f) is a positive parameter that controls the distribution of the “size (norm)” of the feature vector z (t, f). On the other hand, the matrix B (l, f) in equation (42) controls the distribution of the “direction” of the feature vector z (t, f) (specifically, the direction of the feature vector z (t, f) Control the position, spread, direction and shape of the distribution). The matrix B (l, f) is a positive definite Hermitian matrix and is called a space covariance matrix. N (z; 0,)) is a complex Gaussian distribution of the vector z whose mean is the vector 0 and the covariance matrix is the matrix 、, and is expressed by equation (43).

式（４２）は時変の共分散行列φ（ｌ，ｔ，ｆ）Ｂ（ｌ，ｆ）を持つことから、ここでは複素時変ガウス分布と呼ぶ。 (42) has a time-varying covariance matrix φ (l, t, f) B (l, f), so it is called a complex time-varying Gaussian distribution here.

パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルｚ（ｔ，ｆ）である観測信号ベクトルｙ（ｔ，ｆ）の条件付き確率分布のモデルパラメータである空間共分散行列Ｂ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）を記憶する。本実施形態では、パラメータ記憶部３０は、前記条件付き確率分布のモデルパラメータである空間共分散行列Ｂ（ｌ，ｆ）とφ（ｌ，ｔ，ｆ）のうち、音源位置に関係する空間共分散行列Ｂ（ｌ，ｆ）のみを記憶する。一方、φ（ｌ，ｔ，ｆ）は信号のパワーに依存するから、パラメータ記憶部３０には記憶せず、後で述べるように事前確率分布計算部４０において特徴ベクトル計算部２０からの特徴ベクトルを用いて推定する。本実施形態では、観測信号ベクトルｙ（ｔ，ｆ）の方向の分布の位置・広がり・方向・形状を定めるパラメータ行列Ｂ（ｌ，ｆ）を学習データから学習するため、第１の実施形態と同様、前述の観測信号ベクトルｙ（ｔ，ｆ）の方向が、低い周波数ほど小さい分散（前記広がりに相当）を持つという性質を適切に考慮することができ、事前確率分布の推定、及びそれに基づく音源定位を正確に行うことができる。 The parameter storage unit 30 generates an observation signal vector y (t, f) which is a feature vector z (t, f) under the condition that the state representing the sound source position corresponds to each of a plurality of sound source position candidates. The space covariance matrix B (l, f) (l = 1 to L, f = 1 to F) which is a model parameter of conditional probability distribution is stored. In the present embodiment, the parameter storage unit 30 determines the spatial covariance associated with the sound source position among the spatial covariance matrices B (l, f) and φ (l, t, f) which are model parameters of the conditional probability distribution. Only the variance matrix B (l, f) is stored. On the other hand, since φ (l, t, f) depends on the power of the signal, it is not stored in the parameter storage unit 30, but the feature vector from the feature vector calculation unit 20 in the prior probability distribution calculation unit 40 as described later. Estimate using In this embodiment, since the parameter matrix B (l, f) for determining the position, the spread, the direction, and the shape of the distribution in the direction of the observation signal vector y (t, f) is learned from the learning data, Similarly, the property that the direction of the above-mentioned observation signal vector y (t, f) has a smaller dispersion (corresponding to the spread) as the lower frequency can be appropriately taken into consideration, estimation of the prior probability distribution, and based thereon Sound source localization can be performed accurately.

空間共分散行列Ｂ（ｌ，ｆ）は、Ｌ個の音源位置候補のうちの１つの音源位置候補のみから音が発せられた場合の観測信号ｘ（ｌ，ｍ，τ）を用いて、例えば以下の手順により事前学習される。
１．ｘ（ｌ，ｍ，τ）の時間周波数変換ｘ（ｌ，ｍ，ｔ，ｆ）（ｍ＝１〜Ｍ）からなるＭ次元縦ベクトルｘ（ｌ，ｔ，ｆ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を作成する。特徴ベクトルζ（ｌ，ｔ，ｆ）をζ（ｌ，ｔ，ｆ）←ｘ（ｌ，ｔ，ｆ）とする。ここで、特徴ベクトルζ（ｌ，ｔ，ｆ）の計算方法が、第４の実施形態とは異なることに注意する。
２．空間共分散行列Ｂ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）をＭ×Ｍの単位行列により初期化する。
３．次の式（４４）による空間共分散行列Ｂ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）の更新を所定回数（例えば１０回）反復する。 The spatial covariance matrix B (l, f) is generated, for example, using the observation signal x (l, m, τ) when a sound is emitted from only one of the L source position candidates. It is learned in advance by the following procedure.
1. M-dimensional longitudinal vector x (l, t, f) (l = 1 to L) consisting of time-frequency transformation x (l, m, t, f) (m = 1 to M) of x (l, m, τ) Create t = 1 to T, f = 1 to F). The feature vector ζ (l, t, f) is, (l, t, f) ← x (l, t, f). Here, it should be noted that the method of calculating the feature vector ζ (l, t, f) is different from that of the fourth embodiment.
2. A space covariance matrix B (l, f) (l = 1 to L, f = 1 to F) is initialized with an M × M identity matrix.
3. The update of the spatial covariance matrix B (l, f) (l = 1 to L, f = 1 to F) according to the following equation (44) is repeated a predetermined number of times (for example, ten times).

４．空間共分散行列Ｂ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）をパラメータ記憶部３０に記憶する。 4. The space covariance matrix B (l, f) (l = 1 to L, f = 1 to F) is stored in the parameter storage unit 30.

式（４４）の導出について説明する。式（４４）は、ベクトルζ（ｌ，ｔ，ｆ）が式（４２）の条件付き確率分布に従って生成されたという仮定の下、式（４２）に関する対数尤度である式（４５）を空間相関行列Ｂ（ｌ，ｆ）およびφ（ｌ，ｔ，ｆ）に関して最大化することにより導かれる。 The derivation of equation (44) will be described. Eq. (44) is a space of Eq. (45) which is the log likelihood for Eq. (42) under the assumption that the vector ζ (l, t, f) is generated according to the conditional probability distribution of Eq. It is derived by maximizing on the correlation matrices B (l, f) and φ (l, t, f).

式（４５）における空間相関行列Ｂ（ｌ，ｆ）およびφ（ｌ，ｔ，ｆ）によらない定数項を無視すると、式（４５）は、式（４６）に書き換えられる。 Ignoring the constant terms not based on the spatial correlation matrices B (l, f) and φ (l, t, f) in equation (45), equation (45) can be rewritten as equation (46).

式（４６）のφ（ｌ，ｔ，ｆ）に関する偏微分を０と置いて整理すると、式（４７）を得る。 Equation (47) is obtained by putting 0 as the partial differential of φ (l, t, f) in equation (46).

また、式（４６）のＢ（ｌ，ｆ）に関する偏微分を０と置くと、式（４８）を得、式（４８）に式（４７）を代入すると式（４４）を得る。 Further, when the partial differential of B (l, f) in equation (46) is set to 0, equation (48) is obtained, and equation (47) is substituted in equation (48) to obtain equation (44).

次に、本実施形態における特徴ベクトルｚ（ｔ，ｆ）の周辺確率分布のモデル化について説明する。本実施形態では、特徴ベクトルｚ（ｔ，ｆ）の周辺確率分布を、音源位置を表す状態ｇ（ｔ，ｆ）の事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）を荷重とする、条件付き確率分布ｐ（ｚ（ｔ，ｆ）｜ｇ（ｔ，ｆ）＝ｌ）の荷重和である式（４９）の混合モデルによりモデル化する。 Next, modeling of the marginal probability distribution of the feature vector z (t, f) in the present embodiment will be described. In this embodiment, the peripheral probability distribution of the feature vector z (t, f) is set to the prior probability distribution P (g (t, f) = 1) of the state g (t, f) representing the sound source position as a load. The conditional probability distribution p (z (t, f) | g (t, f) = 1) is modeled by a mixed model of equation (49) which is a weighted sum.

事前確率分布計算部４０は、音源位置を表す状態の事前確率分布α（ｌ）（ｌ＝１〜Ｌ）を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータである空間相関行列Ｂ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）に基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布の荷重和である式（４９）の混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルｚ（ｔ，ｆ）に当てはめ、事前確率分布α（ｌ）（ｌ＝１〜Ｌ）を計算する。 The prior probability distribution calculation unit 40 is a spatial correlation matrix B, which is a model parameter stored in the parameter storage unit 30, with the prior probability distribution α (l) (l = 1 to L) of the state representing the sound source position as a load. The weighted sum of the conditional probability distributions of the feature vector z (t, f) under conditions where the state representing the sound source position is known based on l, f) (l = 1 to L, f = 1 to F) The mixed model of equation (49) is applied to the feature vector z (t, f) calculated by the feature vector calculation unit 20 to calculate the prior probability distribution α (l) (l = 1 to L).

式（４９）の混合モデルを特徴ベクトルｚ（ｔ，ｆ）に当てはめる方法には様々な方法があり、例えば式（４９）に関する尤度を目的関数とし（他にも事後確率等を目的関数とすることができる。）、これを勾配法に基づいて最大化する（他にもＥＭアルゴリズム等に基づいて最大化できる）。 There are various methods for applying the mixed model of equation (49) to the feature vector z (t, f). For example, the likelihood of equation (49) is taken as an objective function (in addition, the posterior probability etc. is taken as an objective function) Can be maximized based on the gradient method (others can also be maximized based on the EM algorithm etc.).

事前確率分布計算部４０における事前確率分布α（ｌ）（ｌ＝１〜Ｌ）の推定は、第１の実施形態と同様にして行うことができる。ただし、第１の実施形態とは異なり、ベクトルｗ（ｔ，ｆ）を、Ｎ（ｚ（ｔ，ｆ），０，φ（ｌ，ｔ，ｆ）Ｂ（ｌ，ｆ））（ｌ＝１〜Ｌ）からなるＬ次元縦ベクトルとする。ここで、φ（ｌ，ｔ，ｆ）は次式により計算できる。 The estimation of the prior probability distribution α (l) (l = 1 to L) in the prior probability distribution calculation unit 40 can be performed in the same manner as in the first embodiment. However, unlike the first embodiment, the vector w (t, f) can be expressed as N (z (t, f), 0, 0 (l, t, f) B (l, f)) (l = 1 Let L be an L-dimensional vertical vector consisting of Here, φ (l, t, f) can be calculated by the following equation.

上記の処理の導出について説明する。目的関数である尤度は、特徴ベクトルｚ（ｔ，ｆ）（ｔ＝１〜Ｔ，ｆ＝１〜Ｆ）が観測される確率であり、式（５１）で表される。 The derivation of the above process will be described. The likelihood that is the objective function is the probability that the feature vector z (t, f) (t = 1 to T, f = 1 to F) is observed, and is expressed by equation (51).

式（５０）は式（５１）のφ（ｌ，ｔ，ｆ）に関する最大化により導かれる。式（５１）のφ（ｌ，ｔ，ｆ）に関する最大化は、ｌｎ［Ｎ（ｚ（ｔ，ｆ），０，φ（ｌ，ｔ，ｆ）Ｂ（ｌ，ｆ））］のφ（ｌ，ｔ，ｆ）に関する最大化と等価である。そこで、ｌｎ［Ｎ（ｚ（ｔ，ｆ），０，φ（ｌ，ｔ，ｆ）Ｂ（ｌ，ｆ））］のφ（ｌ，ｔ，ｆ）に関する偏微分を０とおくと、式（５０）を得る。あとは、第１の実施形態と同様にして、事前確率分布α（ｌ）（ｌ＝１〜Ｌ）の更新式である式（１０）および式（１１）を導出することができる。 Equation (50) is derived by maximization with respect to φ (l, t, f) of equation (51). The maximization of φ (l, t, f) in equation (51) is given by φ (ln (z (t, f), 0, φ (l, t, f) B (l, f))] It is equivalent to maximization with respect to l, t, f). Therefore, assuming that the partial derivative of ln [N (z (t, f), 0, φ (l, t, f) B (l, f))] with respect to φ (l, t, f) is 0, Get (50). After that, Equations (10) and (11), which are update equations of the prior probability distribution α (l) (l = 1 to L), can be derived as in the first embodiment.

［第８の実施形態］
次に、第８の実施形態の構成について説明する。第８の実施形態は、第２の実施形態に係る信号処理装置１により検出された音源位置の集合Ｇ（ｔ）を用いて、音源位置のトラッキングを行い、音源ごとフレームごとの音源位置ρ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ、Ｎは音源数）を計算する例である。本実施形態では音源位置が方位角のみで指定されるものとし、Ｇ（ｔ）は方位角の集合であり、ρ（ｎ，ｔ）は方位角であるとする。そのような状況としては、例えばマイクロホンが載っているテーブルを囲んで何人かが会話をしている状況が挙げられる。 Eighth Embodiment
Next, the configuration of the eighth embodiment will be described. In the eighth embodiment, the sound source position is tracked using the set G (t) of sound source positions detected by the signal processing device 1 according to the second embodiment, and the sound source positions ρ (for each sound source) This is an example of calculating n, t) (n = 1 to N, t = 1 to T, N is the number of sound sources). In the present embodiment, it is assumed that the sound source position is designated only by the azimuth, G (t) is a set of azimuths, and ρ (n, t) is an azimuth. Such a situation may include, for example, a situation in which some people are having a conversation around a table on which a microphone is mounted.

図４を用いて、第８の実施形態に係る信号処理装置の構成について説明する。図４は、第８の実施形態に係る信号処理装置の構成の一例を示す図である。図４に示すように、信号処理装置２は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、トラッキング部５１を有する。時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０については、第２の実施形態と同様であるから、以下では相違点であるトラッキング部５１について詳しく説明する。 The configuration of the signal processing apparatus according to the eighth embodiment will be described with reference to FIG. FIG. 4 is a diagram showing an example of the configuration of a signal processing device according to the eighth embodiment. As shown in FIG. 4, the signal processing device 2 includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, an a priori probability distribution calculation unit 40, a sound source position calculation unit 50, and a tracking unit 51. The time frequency analysis unit 10, the feature vector calculation unit 20, the parameter storage unit 30, the a priori probability distribution calculation unit 40, and the sound source position calculation unit 50 are the same as those in the second embodiment. The section 51 will be described in detail.

トラッキング部５１は、音源位置計算部５０からの検出された音源位置（方位角）の集合Ｇ（ｔ）（ｔ＝１〜Ｔ）を受け取り、音源位置のトラッキングを行って、音源ごとフレームごとの音源位置（方位角）ρ（ｎ，ｔ）（ｎ＝１〜Ｎ，ｔ＝１〜Ｔ）を計算し出力する。このトラッキングは様々な方法により行うことができる。以下ではその一例として、各音源の大まかな音源位置（方位角）が既知であると仮定し、これを利用してトラッキングを行う例を示す。各音源の大まかな音源位置（方位角）が既知である状況の例としては、マイクロホンが置かれた机を囲んで、複数人が椅子に座って会議をしている状況が挙げられる。この場合、椅子が既知の位置にほぼ固定されており、かつ会話中の話者の座席移動がないとすると、椅子の位置（既知）を各音源（話者）の大まかな音源位置として用いることができる。 The tracking unit 51 receives the set G (t) (t = 1 to T) of the detected sound source positions (azimuth angles) from the sound source position calculation unit 50, performs tracking of the sound source position, and sets the sound source for each frame. The sound source position (azimuth angle) ((n, t) (n = 1 to N, t = 1 to T) is calculated and output. This tracking can be done in various ways. In the following, as an example, it is assumed that the rough sound source position (azimuth angle) of each sound source is known, and tracking is performed using this. An example of a situation in which the rough sound source position (azimuth angle) of each sound source is known is a situation in which a plurality of people are sitting in a chair and having a meeting, surrounding a desk on which a microphone is placed. In this case, if the chair is substantially fixed at a known position and there is no seat movement of the speaker during conversation, use the chair position (known) as the rough sound source position of each sound source (speaker) Can.

まず、上記の各音源の大まかな音源位置を、音源位置（方位角）ρ（ｎ，ｔ）の初期値ρ（ｎ，０）とする。 First, a rough sound source position of each sound source is set as an initial value ρ (n, 0) of the sound source position (azimuth angle) ((n, t).

フレームｔ−１での音源位置（方位角）ρ（ｎ，ｔ−１）が得られていると仮定すると、フレームｔでの音源位置（方位角）ρ（ｎ，ｔ）は、次の処理により求めることができる。
１．ρ（ｎ，ｔ）をρ（ｎ，ｔ）←ρ（ｎ，ｔ−１）により初期化する。
２．検出された音源位置（方位角）ｒ∈Ｇ（ｔ）（０≦ｒ＜２π）のそれぞれに対し、次の２−１および２−２の処理を行う。
２−１．次の式（５２）により、検出された音源位置（方位角）ｒに最も近い音源の番号νを計算する。 Assuming that the sound source position (azimuth angle) ((n, t-1) at frame t-1 is obtained, the sound source position (azimuth angle) ((n, t) at frame t is processed as follows It can be determined by
1. Initialize ρ (n, t) by ρ (n, t) ← ρ (n, t−1).
2. The following processing of 2-1 and 2-2 is performed on each of the detected sound source positions (azimuth angles) rεG (t) (0 ≦ r <2π).
2-1. The number (of the sound source closest to the detected sound source position (azimuth angle) r is calculated according to the following equation (52).

２−２．ν番目の音源の音源位置（方位角）ρ（ν，ｔ）を式（５３）により更新する。 2-2. The sound source position (azimuth angle) ((ν, t) of the 番目 -th sound source is updated by equation (53).

式（５３）におけるｄ（ξ，η）は、式（５４）により定義される円周上の距離である。 In equation (53), d (式,)) is a circumferential distance defined by equation (54).

また、式（５３）において、∠に下付きの［０，２π）を付した記号は、非零の複素数に対し［０，２π）の範囲の偏角を計算する演算子であり、∠に下付きの［−π，π）を付した記号は、非零の複素数に対し［−π，π）の範囲の偏角を計算する演算子であり、δは０＜δ＜１を満たす定数（例えばδ＝０．００５）である。 Also, in equation (53), the symbol with subscript [0, 2π) attached to ∠ is an operator for calculating the argument of the range of [0, 2π) with respect to a nonzero complex number. The subscripted [-π, π) symbol is an operator for calculating the argument of the range of [-π, π) with respect to a nonzero complex number, and δ is a constant satisfying 0 <δ <1. (For example, δ = 0.005).

［第９の実施形態］
次に、第９の実施形態の構成について説明する。第９の実施形態は、第８の実施形態に係る信号処理装置２による処理結果に基づいて、ダイアリゼーション（diarization）を行う例である。このダイアリゼーションは、フレームごとに各音源が存在するか存在しないかを判定する（hard decision）ことによって行ってもよいし、フレームごとに各音源の存在確率を計算する（soft decision）ことによって行ってもよい。ここでは、前者の場合の例を示す。 The ninth embodiment
Next, the configuration of the ninth embodiment will be described. The ninth embodiment is an example in which diarization is performed based on the processing result by the signal processing device 2 according to the eighth embodiment. This dialing may be performed by determining whether each sound source is present or absent for each frame (hard decision), or calculated by calculating the existence probability of each sound source for each frame (soft decision). May be Here, an example of the former case is shown.

図５を用いて、第９の実施形態に係る信号処理装置の構成について説明する。図５は、第９の実施形態に係る信号処理装置の構成の一例を示す図である。図５に示すように、信号処理装置３は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、トラッキング部５１、ダイアリゼーション部６０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、トラッキング部５１については、信号処理装置２と同様であるから、以下では相違点であるダイアリゼーション部６０について詳しく説明する。 The configuration of the signal processing apparatus according to the ninth embodiment will be described with reference to FIG. FIG. 5 is a view showing an example of the configuration of a signal processing apparatus according to the ninth embodiment. As shown in FIG. 5, the signal processing device 3 includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, an a priori probability distribution calculation unit 40, a sound source position calculation unit 50, a tracking unit 51, and a dilation unit. Have 60. The time-frequency analysis unit 10, the feature vector calculation unit 20, the parameter storage unit 30, the prior probability distribution calculation unit 40, the sound source position calculation unit 50, and the tracking unit 51 are the same as the signal processing device 2, and the differences are described below. The dialing unit 60, which is

ダイアリゼーション部６０は、音源位置計算部５０からの検出された音源位置の集合Ｇ（ｔ）（ｔ＝１〜Ｔ）と、トラッキング部５１からの音源ごとフレームごとの音源位置（方位角）ρ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ）とを受け取って、音源ごとフレームごとのダイアリゼーション結果ｄ（ｎ，ｔ）を計算し出力する。ただし、フレームｔで音源ｎが存在するときｄ（ｎ，ｔ）＝１、フレームｔで音源ｎが存在しないときｄ（ｎ，ｔ）＝０と定める。 The dilation unit 60 sets the detected sound source position G (t) (t = 1 to T) from the sound source position calculation unit 50 and the sound source position (azimuth angle) rho for each sound source from the tracking unit 51. (N, t) (n = 1 to N, t = 1 to T) are received, and a dialation result d (n, t) for each frame is calculated and output for each sound source. However, it is determined that d (n, t) = 1 when the sound source n is present in the frame t, and d (n, t) = 0 when the sound source n is not present in the frame t.

ダイアリゼーション結果ｄ（ｎ，ｔ）の計算方法としては様々な方法が考えられるが、例えば次のように計算すればよい。
１．ｄ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ）をｄ（ｎ，ｔ）←０により初期化する。
２．ｔ＝１〜Ｔに対して次の処理を行う：検出された音源位置（方位角）ｒ∈Ｇ（ｔ）のそれぞれに対し、距離ｄ（ｒ，ρ（ｎ，ｔ））が最小となる音源番号ｎであるνを求め、ｄ（ν，ｔ）←１とする。
３．ｄ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ）をダイアリゼーション結果とする。 Various methods can be considered as a method of calculating the dilation result d (n, t). For example, calculation may be performed as follows.
1. Initialize d (n, t) (n = 1 to N, t = 1 to T) by d (n, t)) 0.
2. The following process is performed for t = 1 to T: the distance d (r, ((n, t)) is minimized for each of the detected sound source positions (azimuth angles) r ∈ G (t) Find ν, which is the sound source number n, and let d (ν, t) ← 1.
3. Let d (n, t) (n = 1 to N, t = 1 to T) be the dilation result.

なお、第９の実施形態において、各音源の正確な音源位置（方位角）が既知の状況では、トラッキング部５１で計算された音源位置（方位角）を用いる代わりに、既知の音源位置（方位角）を音源ごとフレームごとの音源位置（方位角）ρ（ｎ，ｔ）として用いてもよい。そのような状況としては例えば、話者が固定された椅子に座って会話をしている状況や、ビデオカメラの映像により音源位置（方位角）が分かっている状況等がある。 In the ninth embodiment, when the exact sound source position (azimuth angle) of each sound source is known, instead of using the sound source position (azimuth angle) calculated by the tracking unit 51, the known sound source position (azimuth) The angle) may be used as the sound source position (azimuth angle) ((n, t) for each sound source and frame. As such a situation, for example, there is a situation where a speaker is sitting in a fixed chair and has a conversation, a situation where a sound source position (azimuth angle) is known by an image of a video camera, and the like.

［第１０の実施形態］
次に、第１０の実施形態の構成について説明する。第１０の実施形態は、背景雑音下でＮ個（Ｎ＞０）の目的信号が混在する状況において、本発明により推定した音源位置に基づいて各目的信号の波形を推定する例である。本実施形態により、混ざった目的信号を個々の目的信号に分離するとともに、背景雑音を除去することができる。 Tenth Embodiment
Next, the configuration of the tenth embodiment will be described. The tenth embodiment is an example of estimating the waveform of each target signal based on the sound source position estimated by the present invention in a situation where N (N> 0) target signals are mixed under background noise. According to this embodiment, it is possible to separate mixed target signals into individual target signals and to remove background noise.

図６を用いて、第１０の実施形態に係る信号処理装置の構成について説明する。図６は、第１０の実施形態に係る信号処理装置の構成の一例を示す図である。図６に示すように、信号処理装置４は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、トラッキング部５１、ダイアリゼーション部６０、マスク推定部７０、信号強調部８０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、トラッキング部５１、およびダイアリゼーション部６０については信号処理装置３と同様であるから、以下では相違点であるパラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、マスク推定部７０、信号強調部８０について詳しく説明する。信号処理装置３と信号処理装置４の主な相違点は次の通りである。信号処理装置３では、パラメータ記憶部３０において、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での条件付き確率分布のモデルパラメータを記憶し、事前確率分布計算部４０において、前記モデルパラメータに基づいて複数の音源位置候補に対応する状態の事前確率分布を計算し、音源位置計算部５０において、前記事前確率分布に基づいて音源位置を計算する。これに対し、信号処理装置４では、パラメータ記憶部３０において、音源位置を表す状態が背景雑音に対応する状態を取る条件下での条件付き確率分布のモデルパラメータをさらに記憶し、事前確率分布計算部４０において、前記モデルパラメータに基づいて複数の音源位置候補および背景雑音に対応する状態の事前確率分布を計算し、音源位置計算部５０において、前記事前確率分布に基づいて音源位置を計算する。信号処理装置４では更に、マスク推定部７０において、各目的信号および背景雑音の時間周波数点ごとの寄与度（事後確率）であるマスクを推定し、信号強調部８０において、前記マスクに基づいて各目的信号の波形を計算する。 The configuration of a signal processing apparatus according to the tenth embodiment will be described with reference to FIG. FIG. 6 is a diagram showing an example of the configuration of a signal processing apparatus according to the tenth embodiment. As shown in FIG. 6, the signal processing device 4 includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, an a priori probability distribution calculation unit 40, a sound source position calculation unit 50, a tracking unit 51, and a dilation unit. 60 includes a mask estimation unit 70 and a signal enhancement unit 80. The time frequency analysis unit 10, the feature vector calculation unit 20, the tracking unit 51, and the dilation unit 60 are the same as those of the signal processing device 3, and hence the parameter storage unit 30 and the a priori probability distribution calculation unit 40 which are differences below. The sound source position calculation unit 50, the mask estimation unit 70, and the signal enhancement unit 80 will be described in detail. The main differences between the signal processing device 3 and the signal processing device 4 are as follows. The signal processing device 3 stores, in the parameter storage unit 30, model parameters of the conditional probability distribution under the condition that the state representing the sound source position corresponds to each of the plurality of sound source position candidates, and calculates the prior probability distribution. In the unit 40, a prior probability distribution of states corresponding to a plurality of sound source position candidates is calculated based on the model parameters, and in the sound source position calculation unit 50, a sound source position is calculated based on the prior probability distribution. On the other hand, in the signal processing device 4, the parameter storage unit 30 further stores model parameters of conditional probability distribution under the condition that the state representing the sound source position corresponds to the background noise, and calculates the prior probability distribution In part 40, a prior probability distribution of states corresponding to a plurality of sound source position candidates and background noise is calculated based on the model parameters, and in sound source position calculation part 50, a sound source position is calculated based on the prior probability distribution. . In the signal processing device 4, the mask estimation unit 70 further estimates a mask that is the degree of contribution (posterior probability) for each time frequency point of each target signal and background noise, and the signal enhancement unit 80 determines each mask based on the mask. Calculate the waveform of the target signal.

パラメータ記憶部３０は、音源位置を表す状態が複数の音源位置候補のそれぞれに対応する状態を取る条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布である複素ワトソン分布のモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）と集中パラメータκ（ｌ，ｆ）（ｌ＝１〜Ｌ、ｆ＝１〜Ｆ）、および音源位置を表す状態が背景雑音に対応する状態を取る条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布である複素ワトソン分布のモデルパラメータである平均方向ベクトルａ（０，ｆ）（ｆ＝１〜Ｆ）と集中パラメータκ（０，ｆ）（ｆ＝１〜Ｆ）を記憶する。これらのモデルパラメータは、音源位置候補のそれぞれに対応する状態に対しては例えば第４の実施形態に記載の方法により計算でき、背景雑音に対応する状態に対しては例えば第３の実施形態に記載の方法により計算できる。 The parameter storage unit 30 is a model of a complex Watson distribution that is a conditional probability distribution of the feature vector z (t, f) under the condition that the state representing the sound source position corresponds to each of a plurality of sound source position candidates. Parameters of the average direction vector a (l, f) (l = 1 to L, f = 1 to F) and the concentration parameter パラメータ (l, f) (l = 1 to L, f = 1 to F), and the sound source Average direction vector a (0, f) which is a model parameter of complex Watson distribution which is conditional probability distribution of feature vector z (t, f) under the condition that the state representing position corresponds to background noise. (F = 1 to F) and concentration parameters パラメータ (0, f) (f = 1 to F) are stored. These model parameters can be calculated, for example, by the method described in the fourth embodiment for the state corresponding to each of the sound source position candidates, and for the state corresponding to the background noise, for example, in the third embodiment. It can be calculated by the method described.

事前確率分布計算部４０は、音源位置を表す状態の事前確率分布を荷重とする、パラメータ記憶部３０に記憶されたモデルパラメータである平均方向ベクトルａ（ｌ，ｆ）（ｌ＝０〜Ｌ、ｆ＝１〜Ｆ）と集中パラメータκ（ｌ，ｆ）（ｌ＝０〜Ｌ、ｆ＝１〜Ｆ）に基づく、音源位置を表す状態が既知の条件下での、特徴ベクトルｚ（ｔ，ｆ）の条件付き確率分布の荷重和である式（２１）の混合モデルを、特徴ベクトル計算部２０によって計算された特徴ベクトルｚ（ｔ，ｆ）に当てはめ、事前確率分布を計算する。本実施形態では、事前確率分布Ｐ（ｇ（ｔ，ｆ）＝ｌ）がフレームに依存すると仮定し、α（ｌ，ｔ）（ｌ＝０〜Ｌ，ｔ＝１〜Ｔ）で表す。α（ｌ，ｔ）は制約条件α（０，ｔ）＋…＋α（Ｌ，ｔ）＝１を満たす。式（２１）の混合モデルを特徴ベクトルｚ（ｔ，ｆ）に当てはめる方法には様々な方法があるが、本実施形態では式（２１）に関する尤度を勾配法により事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）に関して最大化することにより行う。 The prior probability distribution calculation unit 40 uses, as a load, the prior probability distribution of the state representing the sound source position, and is an average direction vector a (l, f) which is a model parameter stored in the parameter storage unit 30 (l = 0 to L, A feature vector z (t, t) based on a condition representing a sound source position based on f = 1 to F and a concentration parameter ((l, f) (l = 0 to L, f = 1 to F) The mixed model of Formula (21) which is a weighted sum of the conditional probability distribution of f) is applied to the feature vector z (t, f) calculated by the feature vector calculation unit 20 to calculate the prior probability distribution. In this embodiment, it is assumed that the prior probability distribution P (g (t, f) = 1) depends on the frame, and is represented by α (l, t) (l = 0 to L, t = 1 to T). α (l, t) satisfies the constraint condition α (0, t) +... + α (L, t) = 1. There are various methods for applying the mixed model of Equation (21) to the feature vector z (t, f), but in the present embodiment, the likelihood of Equation (21) is a priori probability distribution α (l, t) by maximizing with respect to (l = 0 to L, t = 1 to T).

事前確率分布計算部４０における処理は、例えば下記の通りである。
１．事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）をα（ｌ，ｔ）←１／（Ｌ＋１）により初期化する。
２．次の式（５５）および式（５６）による事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）の更新を交互に所定回数（例えば１０回）反復する。 The processing in the prior probability distribution calculation unit 40 is, for example, as follows.
1. The prior probability distribution α (l, t) (l = 0 to L, t = 1 to T) is initialized by α (l, t) ← 1 / (L + 1).
2. Updating of the prior probability distribution α (l, t) (l = 0 to L, t = 1 to T) according to the following equation (55) and equation (56) is alternately repeated a predetermined number of times (for example, 10 times).

３．事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）を出力する。 3. The prior probability distribution α (l, t) (l = 0 to L, t = 1 to T) is output.

ここで、ベクトル〜α（ｔ）（αの前の記号「〜」はαの上に記号「〜」を付すことを表す。）はα（ｌ，ｔ）（ｌ＝０〜Ｌ）からなる（Ｌ＋１）次元縦ベクトルであり、ベクトル〜ｗ（ｔ，ｆ）はＷ（ｚ（ｔ，ｆ）；ａ（ｌ，ｆ），κ（ｌ，ｆ））（ｌ＝０〜Ｌ）からなる（Ｌ＋１）次元縦ベクトルである。なお、式（５５）および式（５６）の導出については、第１の実施形態の場合と同様であるから省略する。 Here, the vector ~ α (t) (the symbol "~" in front of α represents adding the symbol "~" on α) is composed of α (l, t) (l = 0 to L) (L + 1) -dimensional vertical vector, vector ~ w (t, f) consists of W (z (t, f); a (l, f), ((l, f)) (l = 0 to L) (L + 1) -dimensional vertical vector. The derivation of the equation (55) and the equation (56) is the same as that of the first embodiment, and is therefore omitted.

音源位置計算部５０は、事前確率分布計算部４０から受け取った事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）に基づいて、検出された音源位置の集合Ｇ（ｔ）（ｔ＝１〜Ｔ）を計算し出力する。具体的には、事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）の定義域を目的音源に対応するｌ＝１〜Ｌに制限したα（ｌ，ｔ）（ｌ＝１〜Ｌ、ｔ＝１〜Ｔ）に対して、第２の実施形態の音源位置計算部５０における処理を適用することにより、検出された音源位置の集合Ｇ（ｔ）（ｔ＝１〜Ｔ）を計算する。 The sound source position calculating unit 50 detects the set G of sound source positions detected based on the prior probability distribution α (l, t) (l = 0 to L, t = 1 to T) received from the prior probability distribution calculating unit 40. (T) Calculate (t = 1 to T) and output. Specifically, α (l, t) in which the domain of the prior probability distribution α (l, t) (l = 0 to L, t = 1 to T) is limited to l = 1 to L corresponding to the target sound source A set G (t) of sound source positions detected by applying the processing in the sound source position calculation unit 50 of the second embodiment to (l = 1 to L, t = 1 to T) Calculate 1 to T).

マスク推定部７０は、パラメータ記憶部３０からの平均方向ベクトルａ（ｌ，ｆ）（ｌ＝０〜Ｌ、ｆ＝１〜Ｆ）と集中パラメータκ（ｌ，ｆ）（ｌ＝０〜Ｌ、ｆ＝１〜Ｆ）、事前確率分布計算部４０からの事前確率分布α（ｌ，ｔ）（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ）、およびトラッキング部５１からの音源ごとフレームごとの音源位置（方位角）ρ（ｎ，ｔ）（ｎ＝１〜Ｎ，ｔ＝１〜Ｔ）を受け取って、特徴ベクトルｚ（ｔ，ｆ）に対する背景雑音および各目的信号の時間周波数点ごとの寄与度（事後確率）であるマスクγ（ｎ，ｔ，ｆ）（ｎ＝０〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を計算し出力する。ここで、γ（０，ｔ，ｆ）は背景雑音に対応するマスクであり、γ（ｎ，ｔ，ｆ）（ｎ＝１〜Ｎ）は目的信号ｎに対応するマスクである。 The mask estimation unit 70 calculates the average direction vector a (l, f) (l = 0 to L, f = 1 to F) from the parameter storage unit 30 and the concentration parameter パラメータ (l, f) (l = 0 to L, f = 1 to F), the prior probability distribution α (l, t) (l = 0 to L, t = 1 to T) from the prior probability distribution calculating unit 40, and the sound source for each sound source from the tracking unit 51 for each frame Receiving position (azimuth angle) ((n, t) (n = 1 to N, t = 1 to T), background noise to feature vector z (t, f) and contribution of each target signal at each time frequency point Calculate and output a mask γ (n, t, f) (n = 0 to N, t = 1 to T, f = 1 to F) which is a degree (a posterior probability). Here, γ (0, t, f) is a mask corresponding to background noise, and γ (n, t, f) (n = 1 to N) is a mask corresponding to the target signal n.

マスクγ（ｎ，ｔ，ｆ）は様々な方法により計算することができるが、例えば以下のように計算する。
１．特徴ベクトルｚ（ｔ，ｆ）が与えられた条件下でｇ（ｔ，ｆ）＝ｌとなる事後確率Ｐ（ｇ（ｔ，ｆ）＝ｌ｜ｚ（ｔ，ｆ））（ｌ＝０〜Ｌ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を次の式（５７）および式（５８）により計算する。 The mask γ (n, t, f) can be calculated by various methods, for example, as follows.
1. A posteriori probability P (g (t, f) = l | z (t, f)) (l = 0 to g (t, f) = 1 under the condition that the feature vector z (t, f) is given L, t = 1 to T, f = 1 to F) are calculated by the following equations (57) and (58).

２．背景雑音に対応するマスクγ（０，ｔ，ｆ）（ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を次の式（５９）により計算する。 2. The mask γ (0, t, f) (t = 1 to T, f = 1 to F) corresponding to the background noise is calculated by the following equation (59).

３．フレームｔにおいて各目的信号ｎに対応する音源位置候補の番号ｌの集合Ｊ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ）を次の式（６０）により計算する。 3. A set J (n, t) (n = 1 to N, t = 1 to T) of the number l of sound source position candidates corresponding to each target signal n in the frame t is calculated by the following equation (60).

４．目的信号に対応するマスクγ（ｎ，ｔ，ｆ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を次の式（６１）により計算する。 4. A mask γ (n, t, f) (n = 1 to N, t = 1 to T, f = 1 to F) corresponding to the target signal is calculated by the following equation (61).

５．マスクγ（ｎ，ｔ，ｆ）（ｎ＝０〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を出力する。 5. The mask γ (n, t, f) (n = 0 to N, t = 1 to T, f = 1 to F) is output.

信号強調部８０は、時間周波数分析部１０からの観測信号ベクトルｙ（ｔ，ｆ）、ダイアリゼーション部６０からの０または１のいずれかの値を取るダイアリゼーション結果ｄ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ）、およびマスク推定部７０からの背景雑音および各目的信号のマスクγ（ｎ，ｔ，ｆ）（ｎ＝０〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を受け取って、各目的信号ｓ（ｎ，τ）を推定する。 The signal emphasizing unit 80 obtains the observed signal vector y (t, f) from the time frequency analysis unit 10, and the dilation result d (n, t) (n) taking any value of 0 or 1 from the dilation unit 60. = 1 to N, t = 1 to T), background noise from the mask estimation unit 70, and masks γ (n, t, f) of target signals (n = 0 to N, t = 1 to T, f = 1 to F) to estimate each target signal s (n, τ).

信号強調部８０における具体的な処理の例は以下の通りである。
１．観測信号の共分散行列Φ（ｆ）を次の式（６２）により計算する。 An example of specific processing in the signal emphasizing unit 80 is as follows.
1. The covariance matrix ((f) of the observed signal is calculated by the following equation (62).

２．ダイアリゼーション結果ｄ（ｎ，ｔ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ）を用いて修正したマスク〜γ（ｎ，ｔ，ｆ）（ｎ＝０〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を次の式（６３）および式（６４）により計算する。式（６３）は、ｄ（ｎ，ｔ）＝０のときにはフレームｔにおける音源ｎのマスクを０で置き換えることを意味している。また、式（６４）は、マスク〜γ（ｎ，ｔ，ｆ）のｎに関する総和が１になるようにするための処理である。 2. A mask corrected by using the dilation result d (n, t) (n = 1 to N, t = 1 to T) ~ γ (n, t, f) (n = 0 to N, t = 1 to T, f = 1 to F) is calculated by the following equation (63) and equation (64). Equation (63) means that the mask of sound source n in frame t is replaced with 0 when d (n, t) = 0. Further, equation (64) is a process for causing the sum of n of the masks ̃γ (n, t, f) to be one.

３．共分散行列Ψ（ｎ，ｆ）（ｎ＝０〜Ｎ、ｆ＝１〜Ｆ）を次の式（６５）により計算する。ここで、行列Ψ（０，ｆ）は背景雑音に対応する共分散行列であり、行列Ψ（ｎ，ｆ）（ｎ＝１〜Ｎ）はｎ番目の目的信号と背景雑音の和に対応する共分散行列である。 3. The covariance matrix Ψ (n, f) (n = 0 to N, f = 1 to F) is calculated by the following equation (65). Here, the matrix Ψ (0, f) is a covariance matrix corresponding to the background noise, and the matrix Ψ (n, f) (n = 1 to N) corresponds to the sum of the n-th target signal and the background noise It is a covariance matrix.

４．ｎ番目の目的信号と背景雑音の和に対応する共分散行列Ψ（ｎ，ｆ）から背景雑音に対応する共分散行列Ψ（０，ｆ）を減算することにより、ｎ番目の目的信号に対応する共分散行列〜Ψ（ｎ，ｆ）（ｎ＝１〜Ｎ、ｆ＝１〜Ｆ）を求める。次に、各目的信号のステアリングベクトルｈ（ｎ，ｆ）（ｎ＝１〜Ｎ、ｆ＝１〜Ｆ）を、行列〜Ψ（ｎ，ｆ）の最大固有値に対応する固有ベクトルとして求める。そして、ベクトルｈ（ｎ，ｆ）の第１要素が１に等しくなるように、ｈ（ｎ，ｆ）←ｈ（ｎ，ｆ）／ｈ（１，ｎ，ｆ）によりベクトルｈ（ｎ，ｆ）を正規化する。ここで、ｈ（１，ｎ，ｆ）はベクトルｈ（ｎ，ｆ）の第１要素を表す。 4. Corresponds to the nth target signal by subtracting the covariance matrix Ψ (0, f) corresponding to the background noise from the covariance matrix Ψ (n, f) corresponding to the sum of the nth target signal and the background noise The covariance matrix Ψ (n, f) (n = 1 to N, f = 1 to F) to be calculated is obtained. Next, steering vectors h (n, f) (n = 1 to N, f = 1 to F) of the respective target signals are determined as eigenvectors corresponding to the maximum eigenvalues of the matrix Ψ (n, f). Then, h (n, f) により h (n, f) / h (1, n, f) causes vector h (n, f) such that the first element of vector h (n, f) is equal to 1. Normalize). Here, h (1, n, f) represents the first element of the vector h (n, f).

５．最小分散ビームフォーマに基づき、各目的信号の時間周波数変換ｓ（ｎ，ｔ，ｆ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）を次の式（６７）により計算する。 5. Based on the minimum dispersion beamformer, time frequency conversion s (n, t, f) (n = 1 to N, t = 1 to T, f = 1 to F) of each target signal is calculated by the following equation (67) Do.

６．各目的信号の時間周波数変換ｓ（ｎ，ｔ，ｆ）（ｎ＝１〜Ｎ、ｔ＝１〜Ｔ、ｆ＝１〜Ｆ）に時間周波数変換の逆変換を適用することにより、各目的信号ｓ（ｎ，τ）を計算する。 6. Each target signal is obtained by applying the inverse transform of the time frequency conversion to the time frequency conversion s (n, t, f) (n = 1 to N, t = 1 to T, f = 1 to F) of each target signal. Calculate s (n, τ).

［第１１の実施形態］
次に、第１１の実施形態の構成について説明する。第１１の実施形態は、背景雑音下でＮ個（Ｎ＞０）の目的音声が存在する状況において、本発明により推定した音源位置に基づいて各目的音声の波形を推定し、各目的音声に対して既存の音声認識技術を適用することで各目的音声を音声認識する例である。本発明によれば、背景雑音や複数の話者による音声が混在した状況でも、混ざった目的信号を個々の目的信号に分離するとともに、背景雑音を除去し、高精度な音声認識を実現できる。応用例としては、例えば様々な音が鳴っているオフィスの片隅で行われた会議の自動書き起こし等が挙げられる。 Eleventh Embodiment
Next, the configuration of the eleventh embodiment will be described. In the eleventh embodiment, in a situation where N (N> 0) target voices are present under background noise, the waveform of each target voice is estimated based on the sound source position estimated by the present invention, and each target voice is In contrast, this is an example of speech recognition of each target speech by applying the existing speech recognition technology. According to the present invention, even in the situation where background noise and speech from a plurality of speakers are mixed, it is possible to separate mixed target signals into individual target signals and remove background noise to realize highly accurate speech recognition. An application example is, for example, automatic transcription of a conference held at a corner of an office where various sounds are sounding.

図７を用いて、第１１の実施形態に係る信号処理装置の構成について説明する。図７は、第１１の実施形態に係る信号処理装置の構成の一例を示す図である。図７に示すように、信号処理装置５は、時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、トラッキング部５１、ダイアリゼーション部６０、マスク推定部７０、信号強調部８０、音声認識部９０を有する。時間周波数分析部１０、特徴ベクトル計算部２０、パラメータ記憶部３０、事前確率分布計算部４０、音源位置計算部５０、トラッキング部５１、ダイアリゼーション部６０、マスク推定部７０、信号強調部８０については第１０の実施形態と同様である。音声認識部９０は、信号強調部８０から各目的信号の波形を受け取って、これに既存の音声認識技術を適用することで、各目的信号に対する認識結果を出力する。 The configuration of a signal processing apparatus according to the eleventh embodiment will be described with reference to FIG. FIG. 7 is a diagram showing an example of the configuration of a signal processing apparatus according to the eleventh embodiment. As shown in FIG. 7, the signal processing device 5 includes a time frequency analysis unit 10, a feature vector calculation unit 20, a parameter storage unit 30, an a priori probability distribution calculation unit 40, a sound source position calculation unit 50, a tracking unit 51, and a dilation unit. 60 includes a mask estimation unit 70, a signal enhancement unit 80, and a speech recognition unit 90. The time frequency analysis unit 10, the feature vector calculation unit 20, the parameter storage unit 30, the prior probability distribution calculation unit 40, the sound source position calculation unit 50, the tracking unit 51, the dilation unit 60, the mask estimation unit 70, and the signal enhancement unit 80 This is the same as the tenth embodiment. The speech recognition unit 90 receives the waveforms of the respective target signals from the signal enhancement unit 80 and applies existing speech recognition technology thereto to output a recognition result for each target signal.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration etc.]
Further, each component of each device illustrated is functionally conceptual, and does not necessarily have to be physically configured as illustrated. That is, the specific form of the distribution and integration of each device is not limited to the illustrated one, and all or a part thereof may be functionally or physically dispersed in any unit depending on various loads, usage conditions, etc. It can be integrated and configured. Furthermore, all or any part of each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as wired logic hardware.

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in the present embodiment, all or part of the process described as being automatically performed may be manually performed, or the process described as being manually performed. All or part of them can be performed automatically by known methods. In addition to the above, the processing procedures, control procedures, specific names, and information including various data and parameters shown in the above documents and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
実施形態の信号処理装置１〜５は、パッケージソフトウェアやオンラインソフトウェアとして上記の音源定位、トラッキング、ダイアリゼーション、音声強調、音声認識を実行する信号処理プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の信号処理プログラムを情報処理装置に実行させることにより、情報処理装置を信号処理装置１〜５として機能させることができる。ここで言う情報処理装置には、デスクトップ型またはノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistant）等のスレート端末等がその範疇に含まれる。 [program]
The signal processing apparatuses 1 to 5 of the embodiment can be implemented by installing a signal processing program for executing the above-described sound source localization, tracking, dilation, voice emphasis, and voice recognition as package software or online software in a desired computer. For example, by causing the information processing apparatus to execute the above signal processing program, the information processing apparatus can be functioned as the signal processing apparatuses 1 to 5. The information processing apparatus referred to here includes a desktop or laptop personal computer. In addition, the information processing apparatus also includes mobile communication terminals such as smartphones, cellular phones and PHS (Personal Handyphone System), and slate terminals such as PDA (Personal Digital Assistant).

また、信号処理装置１〜５は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の信号処理に関するサービスを提供する信号処理サーバ装置として実装することもできる。例えば、信号処理サーバ装置は、観測信号を入力とし、音源の位置を出力とする音源定位サービスを提供するサーバ装置として実装される。この場合、信号処理サーバ装置は、Ｗｅｂサーバとして実装することとしてもよいし、アウトソーシングによって上記の信号処理に関するサービスを提供するクラウドとして実装することとしてもかまわない。 The signal processing devices 1 to 5 can also be implemented as a signal processing server device that uses a terminal device used by a user as a client and provides the client with a service related to the above signal processing. For example, the signal processing server device is implemented as a server device that provides a sound source localization service in which an observation signal is input and a position of a sound source is output. In this case, the signal processing server apparatus may be implemented as a Web server, or may be implemented as a cloud that provides the above-mentioned signal processing service by outsourcing.

図８は、プログラムが実行されることにより信号処理装置が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 FIG. 8 is a diagram illustrating an example of a computer in which a signal processing apparatus is realized by executing a program. The computer 1000 includes, for example, a memory 1010 and a CPU 1020. The computer 1000 also includes a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These units are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. Disk drive interface 1040 is connected to disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.

ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、信号処理装置１〜５の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、信号処理装置１〜５における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤにより代替されてもよい。 The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, a program defining each process of the signal processing devices 1 to 5 is implemented as a program module 1093 in which a computer-executable code is described. The program module 1093 is stored, for example, in the hard disk drive 1090. For example, a program module 1093 for executing the same processing as the functional configuration of the signal processing devices 1 to 5 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD.

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the process of the above-described embodiment is stored as program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as needed, and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３およびプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３およびプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, and may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). The program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

１、２、３、４、５信号処理装置
１０時間周波数分析部
２０特徴ベクトル計算部
３０パラメータ記憶部
４０事前確率分布計算部
５０音源位置計算部
５１トラッキング部
６０ダイアリゼーション部
７０マスク推定部
８０信号強調部
９０音声認識部 1, 2, 3, 4, 5 Signal processing device 10 Time-frequency analysis unit 20 Feature vector calculation unit 30 Parameter storage unit 40 Prior probability distribution calculation unit 50 Sound source position calculation unit 51 Tracking unit 60 Dialing unit 70 Mask estimation unit 80 Signal Highlighter 90 Speech recognition unit

Claims

A time-frequency analysis unit that applies time-frequency analysis to the recorded sound acquired at a plurality of different positions and calculates an observation signal vector that is an M-dimensional vector;
A feature vector calculation unit that calculates, for each time frequency point, a feature vector that is a vector including information on the direction of the observed signal vector calculated by the time frequency analysis unit;
A parameter storage unit storing model parameters of conditional probability distribution of the feature vector under a condition that a state representing a sound source position corresponds to each of a plurality of sound source position candidates;
Conditional probability distribution of the feature vector under conditions where the state representing the sound source position is known, based on the model parameters stored in the parameter storage unit, with the prior probability distribution of the state representing the sound source position as a load A prior probability distribution calculation unit which applies the mixed model, which is a weighted sum of the above, to the feature vector calculated by the feature vector calculation unit, and calculates the prior probability distribution;
A sound source position calculating unit that calculates a sound source position corresponding to the feature vector based on the prior probability distribution calculated by the prior probability distribution calculating unit;
A signal processing apparatus characterized by comprising:

The prior probability distribution calculating unit uses, as a load, an a priori probability distribution for each time section of the state representing the sound source position, and the state representing the sound source position based on the model parameter stored in the parameter storage unit is known. Fitting a mixed model, which is a weighted sum of the conditional probability distributions of the feature vectors under the conditions, to the feature vectors calculated by the feature vector calculation unit, and calculating a prior probability distribution for each of the time intervals;
The sound source position calculation unit is characterized by calculating a sound source position for each of the time intervals corresponding to the feature vector based on the a priori probability distribution for each of the time intervals calculated by the a priori probability distribution calculation unit. The signal processing device according to claim 1.

The parameter storage unit is a model parameter learned using learning data acquired under reverberation, and a condition representing a sound source position takes a state corresponding to each of a plurality of sound source position candidates, The signal processing apparatus according to claim 1, wherein model parameters of the conditional probability distribution of the feature vector are stored.

The said parameter memory | storage part further stores the model parameter of conditional probability distribution in the condition which takes the state which respond | corresponds to a background noise the state showing the said sound source position, The conditions any one of Claim 1 to 3 characterized by the above-mentioned. The signal processing device according to item 1.

The signal processing apparatus according to any one of claims 1 to 4, wherein the prior probability distribution calculating unit calculates the prior probability distribution based on a gradient method.

A signal processing method to be executed by a signal processing device, comprising:
Applying time-frequency analysis to the recorded sound acquired at a plurality of different positions, and calculating an observation signal vector which is an M-dimensional vector;
Calculating a feature vector, which is a vector including information on the direction of the observed signal vector calculated by the time frequency analysis step, for each time frequency point;
Obtaining a model parameter stored in a parameter storage unit storing a model parameter of the conditional probability distribution of the feature vector under a condition that the state indicating the sound source position corresponds to each of a plurality of sound source position candidates; A mixture based on the model parameters, a load sum of conditional probability distributions of the feature vectors under a condition in which the state representing the sound source position is known, wherein a load is a prior probability distribution of the state representing the sound source position; A prior probability distribution calculating step of fitting a model to the feature vector calculated by the feature vector calculating step, and calculating the prior probability distribution;
A sound source position calculating step of calculating a sound source position corresponding to the feature vector based on the prior probability distribution calculated by the prior probability distribution calculating step;
A signal processing method comprising:

A signal processing program for causing a computer to function as the signal processing device according to any one of claims 1 to 5.