JP5791081B2

JP5791081B2 - Sound source separation localization apparatus, method, and program

Info

Publication number: JP5791081B2
Application number: JP2012160450A
Authority: JP
Inventors: 勝彦石黒; 澤田　宏; 宏澤田; 琢馬大塚; 博奥乃
Original assignee: Kyoto University; Nippon Telegraph and Telephone Corp
Current assignee: Kyoto University; Nippon Telegraph and Telephone Corp
Priority date: 2012-07-19
Filing date: 2012-07-19
Publication date: 2015-10-07
Anticipated expiration: 2032-07-19
Also published as: JP2014021315A

Description

本発明は、音源分離定位装置、方法、及びプログラムに係り、特に、複数の音源の各々から発せられた音の混合音から、個別の音源毎の音を分離すると共に、各音源の方向を定位する音源分離定位装置、方法、及びプログラムに関する。 The present invention relates to a sound source separation localization apparatus, method, and program, and in particular, separates sound for each individual sound source from a mixed sound generated from each of a plurality of sound sources and localizes the direction of each sound source. The present invention relates to a sound source separation localization apparatus, method, and program.

複数の音源の各々から発せられた音の重ね合わせである環境音（以下、混合音と呼ぶ）を個別の音源毎の音へと分離する音源分離技術は非常に古い歴史を持つ技術である。この技術は、例えば、会議の様子を録音した混合音から会議の議事録を作成するための発話者分離などに利用することができる。また、混合音を観測した複数のマイクの位置関係及び各マイクで観測された音から、各音源の相対位置及び方向を計算する音源定位技術は、例えば、環境中を自律移動するロボットや機械の自己位置同定や障害物回避などのための基礎的な技術として、非常に多くの手法が提案されている（例えば、非特許文献１〜３）。 A sound source separation technique for separating an environmental sound (hereinafter referred to as a mixed sound), which is a superposition of sounds emitted from each of a plurality of sound sources, into a sound for each individual sound source has a very long history. This technique can be used, for example, for speaker separation for creating a meeting minutes from a mixed sound recording a meeting. The sound source localization technology that calculates the relative position and direction of each sound source from the positional relationship of the plurality of microphones observing the mixed sound and the sound observed by each microphone is, for example, a robot or machine that autonomously moves in the environment. A large number of techniques have been proposed as basic techniques for self-position identification and obstacle avoidance (for example, Non-Patent Documents 1 to 3).

非特許文献１では、各時刻と各周波数とにおいては、通常高々１つの音源からの信号しか観測されない、という音源のスパース性を利用した音源分離法を提案している。非特許文献２では、ロボットでの利用を前提とした音源の分離及び定位を行うシステムを提案している。非特許文献３では、音源数よりも多いマイクを用いた音源分離手法を提案している。 Non-Patent Document 1 proposes a sound source separation method using the sparseness of a sound source in which only a signal from at most one sound source is usually observed at each time and each frequency. Non-Patent Document 2 proposes a system that performs sound source separation and localization on the premise of use in a robot. Non-Patent Document 3 proposes a sound source separation method using more microphones than the number of sound sources.

この音源分離及び音源定位の２つの問題は、互いに深く密接に関係した相互依存の問題であることが知られている。例えば、複数の音源の位置が分かっている場合には、ビームフォーマという技術を使うことで各音源のみの分離音を精度よく復元できることが知られている。一方、各音源の音が分離できている場合に、各音源の位置を決定することも比較的容易である。 It is known that the two problems of sound source separation and sound source localization are interdependence problems that are deeply and closely related to each other. For example, when the positions of a plurality of sound sources are known, it is known that the separated sound of each sound source can be accurately restored by using a technique called a beam former. On the other hand, when the sound of each sound source is separated, it is relatively easy to determine the position of each sound source.

Sawada, H., Araki, S. and Makino, S. “Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment”, IEEE Transactions on Audio, Speech and Language Processing, Vol. 19, No. 3, pp. 516-527, 2011.Sawada, H., Araki, S. and Makino, S. “Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment”, IEEE Transactions on Audio, Speech and Language Processing, Vol. 19, No. 3, pp 516-527, 2011. Nakadai, K. Lourens, T., Okuno, H. G. and Kitano, H. “Active Audition for Humanoid”, in Proc. AAAI, 2000.Nakadai, K. Lourens, T., Okuno, H. G. and Kitano, H. “Active Audition for Humanoid”, in Proc. AAAI, 2000. Lee, I., Kim, T. and Lee, T.-W., “Fast Fixed-point Independent Vector Analysis Algorithms for Convolutive Blind Source Separation”, Signal Processing, Vol. 87, No. 8, pp.1859-1871, 2007.Lee, I., Kim, T. and Lee, T.-W., “Fast Fixed-point Independent Vector Analysis Algorithms for Convolutive Blind Source Separation”, Signal Processing, Vol. 87, No. 8, pp.1859-1871 , 2007.

上述の音源分離及び音源定位の２つの問題を同時に解決することができれば、例えば、自律ロボットが障害物回避を行いながら、騒音環境下で特定のユーザの指令コマンドを音声で受け取って行動することなど、非常に高度な知能システムを実現することができる。 If the above two problems of sound source separation and sound source localization can be solved at the same time, for example, an autonomous robot acts by avoiding an obstacle while receiving a command command of a specific user in a noisy environment. A very advanced intelligent system can be realized.

しかしながら、非特許文献１に代表される既存手法では、音源分離及び音源定位という相互依存する問題を個別に解決している。例えば、非特許文献１の手法は音源分離を主目的としており、音源分離完了後に各音源の定位を行うことを前提にしている。また、非特許文献２の手法は逆に各音源の定位を完了した後に、各音源が発する音声信号を分離している。これらの従来手法のように、まず、音源分離及び音源定位の一方の問題を何らかの事前情報や強い仮定を伴う方法で解決した後に、他方の問題を解決する場合には、最初に解決した一方の問題の精度が悪かった場合に、他方の問題の精度も大きく劣化してしまう、という問題がある。 However, the existing methods represented by Non-Patent Document 1 individually solve the interdependent problems of sound source separation and sound source localization. For example, the method of Non-Patent Document 1 is mainly intended for sound source separation, and assumes that each sound source is localized after the sound source separation is completed. On the other hand, the method of Non-Patent Document 2 separates the sound signal emitted from each sound source after the localization of each sound source is completed. Like these conventional methods, after solving one problem of sound source separation and sound source localization with a method that involves some prior information and strong assumptions, when solving the other problem, There is a problem that when the accuracy of a problem is poor, the accuracy of the other problem is greatly degraded.

本発明は上記問題点を解決するために成されたものであり、音源分離及び音源定位の両方の問題に対して、安定して高い性能を得ることができる音源分離定位装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above problems, and a sound source separation localization apparatus, method, and program capable of stably obtaining high performance with respect to both problems of sound source separation and sound source localization. The purpose is to provide.

上記目的を達成するために、本発明の音源分離定位装置は、複数の音源の各々から発せられた各音の混合音を、各々異なる位置に配置された複数の観測手段により観測した混合音信号を受け付ける受付手段と、前記受付手段により受け付けた混合音信号を、前記複数の音源の各々に対応するように分離する音源分離と、前記観測手段を基準とした前記複数の音源の各々が存在する方向を推定する音源定位とを、前記音源分離と前記音源定位とで相互に依存させた変数を用いて反復処理する同時最適化により解析する解析手段と、前記解析手段により解析された音源分離及び音源定位の結果を出力する出力手段と、を含んで構成されている。 In order to achieve the above object, the sound source separation and localization apparatus of the present invention is a mixed sound signal obtained by observing a mixed sound of each sound emitted from each of a plurality of sound sources by a plurality of observation means arranged at different positions. Receiving means, sound source separation for separating the mixed sound signal received by the receiving means so as to correspond to each of the plurality of sound sources, and each of the plurality of sound sources based on the observation means Analysis means for analyzing the sound source localization for estimating the direction by simultaneous optimization using iterative processing using variables mutually dependent on the sound source separation and the sound source localization; and the sound source separation analyzed by the analysis means and Output means for outputting the result of sound source localization.

本発明の音源分離定位装置によれば、受付手段が、複数の音源の各々から発せられた各音の混合音を、各々異なる位置に配置された複数の観測手段により観測した混合音信号を受け付ける。そして、解析手段が、受付手段により受け付けた混合音信号を、複数の音源の各々に対応するように分離する音源分離と、観測手段を基準とした複数の音源の各々が存在する方向を推定する音源定位とを、音源分離と音源定位とで相互に依存させた変数を用いて反復処理する同時最適化により解析する。相互に依存させた変数を用いるとは、音源分離及び音源定位の一方で求めた変数を、他方の変数を求める際に用いることである。最後に、出力手段が、解析手段により解析された音源分離及び音源定位の結果を出力する。 According to the sound source separation and localization apparatus of the present invention, the reception unit receives a mixed sound signal obtained by observing a mixed sound of each sound emitted from each of the plurality of sound sources by a plurality of observation units arranged at different positions. . Then, the analysis unit estimates the direction in which each of the plurality of sound sources exists based on the sound source separation that separates the mixed sound signal received by the reception unit so as to correspond to each of the plurality of sound sources. Sound source localization is analyzed by simultaneous optimization using iterative processing using variables that are mutually dependent on sound source separation and sound source localization. The use of mutually dependent variables means that a variable obtained by one of sound source separation and sound source localization is used when obtaining the other variable. Finally, the output means outputs the sound source separation and sound source localization results analyzed by the analysis means.

このように、音源分離と音源定位とを相互に依存させて同時最適化により解析することにより、音源分離及び音源定位の両方の問題に対して、安定して高い性能を得ることができる。 As described above, by performing the optimization by making the sound source separation and the sound source localization mutually depend on each other, the high performance can be stably obtained with respect to the problems of both the sound source separation and the sound source localization.

また、前記受付手段は、前記混合音信号を、時間フレームｔ及び周波数ビンｆ毎の各要素からなる時間周波数領域の観測信号ｘ_ｔｆに変換して前記解析手段に受け渡すことができる。また、前記解析手段は、前記観測信号ｘ_ｔｆの各要素が、仮想的に設定した複数の音源の各々へ該各要素を割り当てる複数のマスクのｋ番目のマスクに対応する信号である確率を表すマスク変数ξ_ｔｆｋを、前記複数のマスクの各々について計算する音源時間周波数マスク変数計算手段と、前記ｋ番目のマスクに対応した音源が、前記観測手段を基準として分割された複数の方向のｄ番目の方向に存在する確率を表す音源定位変数η_ｋｄを、前記複数の方向の各々について計算する音源定位変数計算手段と、前記マスク変数ξ_ｔｆｋ及び前記音源定位変数η_ｋｄの計算に用いられる統計量を計算する統計量計算手段と、前記音源時間周波数マスク変数計算手段、前記音源定位変数計算手段、及び前記統計量計算手段の計算を、予め定めた収束条件を満たすまで反復させる収束条件判定手段と、を含んで構成することができ、前記マスク変数ξ_ｔｆｋの計算に前記音源定位変数η_ｋｄを用い、前記音源定位変数η_ｋｄの計算に前記マスク変数ξ_ｔｆｋを用いることができる。これにより、音源分離と音源定位とを相互に依存させて、効率よく同時最適化を行うことができる。 In addition, the reception unit can convert the mixed sound signal into an observation signal _xtf in a time-frequency domain including elements for each of the time frame t and the frequency bin f and pass it to the analysis unit. Further, the analyzing means represents a probability that each element of the observation signal _xtf is a signal corresponding to a kth mask of a plurality of masks that assign each element to each of a plurality of virtually set sound sources. A sound source time-frequency mask variable calculation unit that calculates a mask variable ξ _tfk for each of the plurality of masks, and a d-th component in a plurality of directions in which a sound source corresponding to the k-th mask is divided based on the observation unit. A sound source localization variable η _kd representing the probability of existing in the direction of the sound source for each of the plurality of directions, a statistic used for calculating the mask variable ξ _tfk and the sound source localization variable η _kd Statistic calculation means for calculating the sound source time frequency mask variable calculation means, the sound source localization variable calculation means, and the calculation of the statistic calculation means with a predetermined convergence A convergence condition determination means for repeated until the condition, can be configured to include, the sound source localization variable eta _kd used in the calculation of the mask variables xi] _TFK, the mask variables in the calculation of the sound source localization variable eta _kd ξ _tfk can be used. This makes it possible to efficiently perform simultaneous optimization by making sound source separation and sound source localization depend on each other.

また、前記解析手段は、無響環境において測定された前記複数の観測手段のステアリングベクトルを用いて、前記音源分離及び前記音源定位を解析することができる。これにより、様々な残響環境にも適用することができる。 Further, the analysis unit can analyze the sound source separation and the sound source localization using steering vectors of the plurality of observation units measured in an anechoic environment. Thereby, it is applicable also to various reverberation environments.

また、本発明の音源分離定位方法は、受付手段と、解析手段と、出力手段とを含む音源分離定位装置における音源分離定位方法であって、前記受付手段が、複数の音源の各々から発せられた各音の混合音を、各々異なる位置に配置された複数の観測手段により観測した混合音信号を受け付け、前記解析手段が、前記受付手段により受け付けた混合音信号を、前記複数の音源の各々に対応するように分離する音源分離と、前記観測手段を基準とした前記複数の音源の各々が存在する方向を推定する音源定位とを、前記音源分離と前記音源定位とで相互に依存させた変数を用いて反復処理する同時最適化により解析し、前記出力手段が、前記解析手段により解析された音源分離及び音源定位の結果を出力する方法である。 The sound source separation and localization method of the present invention is a sound source separation and localization method in a sound source separation and localization apparatus including a reception means, an analysis means, and an output means, wherein the reception means is emitted from each of a plurality of sound sources. A mixed sound signal obtained by observing the mixed sound of each sound by a plurality of observation means arranged at different positions, and the analysis means receives the mixed sound signal received by the reception means for each of the plurality of sound sources. And the sound source localization for estimating the direction in which each of the plurality of sound sources is present based on the observation means are mutually dependent on the sound source separation and the sound source localization. In this method, analysis is performed by simultaneous optimization using iterative processing using variables, and the output means outputs the result of sound source separation and sound source localization analyzed by the analysis means.

また、前記解析手段が、音源時間周波数マスク変数計算手段と、音源定位変数計算手段と、統計量計算手段と、収束条件判定手段とを含む音源分離定位装置における音源分離定位方法であって、前記受付手段が、前記混合音信号を、時間フレームｔ及び周波数ビンｆ毎の各要素からなる時間周波数領域の観測信号ｘ_ｔｆに変換して前記解析手段に受け渡し、前記音源時間周波数マスク変数計算手段が、前記観測信号ｘ_ｔｆの各要素が、仮想的に設定した複数の音源の各々へ該各要素を割り当てる複数のマスクのｋ番目のマスクに対応する信号である確率を表すマスク変数ξ_ｔｆｋを、前記複数のマスクの各々について計算し、前記音源定位変数計算手段が、前記ｋ番目のマスクに対応した音源が、前記観測手段を基準として分割された複数の方向のｄ番目の方向に存在する確率を表す音源定位変数η_ｋｄを、前記複数の方向の各々について計算し、前記統計量計算手段が、前記マスク変数ξ_ｔｆｋ及び前記音源定位変数η_ｋｄの計算に用いられる統計量を計算し、前記収束条件判定手段が、前記音源時間周波数マスク変数計算手段、前記音源定位変数計算手段、及び前記統計量計算手段の計算を、予め定めた収束条件を満たすまで反復させ、前記マスク変数ξ_ｔｆｋの計算に前記音源定位変数η_ｋｄを用い、前記音源定位変数η_ｋｄの計算に前記マスク変数ξ_ｔｆｋを用いることができる。 Further, the analysis means is a sound source separation localization method in a sound source separation localization apparatus including a sound source time frequency mask variable calculation means, a sound source localization variable calculation means, a statistic calculation means, and a convergence condition determination means, The accepting means converts the mixed sound signal into an observation signal _xtf in the time frequency domain composed of each element for each time frame t and frequency bin f and passes it to the analyzing means, and the sound source time frequency mask variable calculating means , A mask variable ξ _tfk representing the probability that each element of the observed signal x _tf is a signal corresponding to the k-th mask of a plurality of masks that assign each element to a plurality of virtually set sound sources, Each of the plurality of masks is calculated, and the sound source localization variable calculating unit is configured to generate a plurality of sound sources corresponding to the k-th mask divided based on the observation unit. The sound source localization variable eta _kd representing the probability that exist d th direction of direction, was calculated for each of the plurality of directions, the statistic calculation means, the calculation of the mask variables xi] _TFK and the sound source localization variable eta _kd Until the convergence condition determining means calculates the sound source time frequency mask variable calculating means, the sound source localization variable calculating means, and the statistics calculating means until a predetermined convergence condition is satisfied. cycled, the sound source localization variable eta _kd used in the calculation of the mask variables xi] _TFK, it is possible to use the mask variables xi] _TFK in the calculation of the sound source localization variable eta _kd.

また、本発明の音源分離定位方法において、前記解析手段が、無響環境において測定された前記複数の観測手段のステアリングベクトルを用いて、前記音源分離及び前記音源定位を解析することができる。 In the sound source separation localization method of the present invention, the analysis unit can analyze the sound source separation and the sound source localization using steering vectors of the plurality of observation units measured in an anechoic environment.

また、本発明の音源分離定位プログラムは、コンピュータを、上記の音源分離定位装置を構成する各手段として機能させるためのプログラムである。 The sound source separation localization program of the present invention is a program for causing a computer to function as each means constituting the sound source separation localization device.

以上説明したように、本発明の音源分離定位装置、方法、及びプログラムによれば、音源分離と音源定位とを相互に依存させて同時最適化により解析することにより、音源分離及び音源定位の両方の問題に対して、安定して高い性能を得ることができる、という効果が得られる。 As described above, according to the sound source separation localization apparatus, method, and program of the present invention, both sound source separation and sound source localization are obtained by analyzing sound source separation and sound source localization by making them mutually dependent and analyzing by simultaneous optimization. With respect to the problem, it is possible to obtain an effect that high performance can be stably obtained.

本実施の形態の概要（音源分離）を示すイメージ図である。It is an image figure which shows the outline | summary (sound source separation) of this Embodiment. 本実施の形態の概要（音源定位）を示すイメージ図である。It is an image figure which shows the outline | summary (sound source localization) of this Embodiment. 本実施の形態に係る音源分離定位装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the sound source separation localization apparatus which concerns on this Embodiment. 記憶部の構成を示す図である。It is a figure which shows the structure of a memory | storage part. 本実施の形態における音源分離定位処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the sound source separation localization process routine in this Embodiment. 初期値生成処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the initial value production | generation processing routine. 音源時間周波数マスク変数計算処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the sound source time frequency mask variable calculation processing routine. 音源定位変数計算処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the sound source localization variable calculation processing routine. 統計量計算処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the statistics calculation processing routine. 実験例のセットアップを示す概略図である。It is the schematic which shows the setup of an experiment example. 実験例における音源定位の性能を示すグラフである。It is a graph which shows the performance of the sound source localization in an experiment example. 実験例における音源分離の性能を示すグラフである。It is a graph which shows the performance of the sound source separation in an experiment example.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜概要＞
まず、本実施の形態の概要について説明する。図１及び図２は、本実施の形態の概要を示すイメージ図である。 <Overview>
First, an outline of the present embodiment will be described. 1 and 2 are image diagrams showing an outline of the present embodiment.

図１に示すように、音源分離は、観測した混合音をフーリエ変換によって時間周波数領域の信号に変換した観測信号中の各（ｔ，ｆ）要素を、Ｋ種類の音源に割り振ることで実現する。なお、ｔは時間フレームを表すインデックス、ｆは周波数ビンを表すインデックスである。この方法は、時間周波数領域の観測信号のスパース性を利用した音源分離手法として、非特許文献１などで利用されており、良い音源分離性能を示すことが知られている。時間周波数領域の観測信号の各音源への割り当ては「音源時間周波数マスク」と呼ばれる。このマスクによって混合音を音源毎に分離し、各音源から発せられた音を復元することができる。 As shown in FIG. 1, sound source separation is realized by assigning each (t, f) element in an observed signal obtained by converting the observed mixed sound into a signal in the time-frequency domain by Fourier transform, to K types of sound sources. . Note that t is an index representing a time frame, and f is an index representing a frequency bin. This method is used in Non-Patent Document 1 as a sound source separation method using the sparsity of observation signals in the time-frequency domain, and is known to exhibit good sound source separation performance. The assignment of observation signals in the time-frequency domain to each sound source is called a “sound source time-frequency mask”. With this mask, the mixed sound can be separated for each sound source, and the sound emitted from each sound source can be restored.

図２に示すように、音源定位は、Ｋ種類の音源の方向を、マイクロフォンアレイを中心とした３６０度方向のいずれかに決定することで実現する。数学的には、マイクロフォンの方向解像度などの制約に従って、方向をＤ種類へ離散化（分割）する。そして、各音源をＤ種類の方向中のいずれかの方向１つへ割り当てる、すなわちＤ種類の方向へクラスタリングすることによって定位する。 As shown in FIG. 2, the sound source localization is realized by determining the direction of the K types of sound sources to be one of 360-degree directions around the microphone array. Mathematically, the direction is discretized (divided) into D types according to constraints such as the direction resolution of the microphone. Then, each sound source is localized by assigning it to one of the D types of directions, that is, by clustering in the D types of directions.

これら個々の手法自体は新しいものではないが、本実施の形態では、音源分離及び音源定位を同時に、かつ相互依存する形で解決する枠組みを特徴とする。すなわち、本実施の形態では、音源時間周波数マスクを計算することで音源分離を可能とする。また、音源クラスタ毎にその方向を計算することで音源定位を可能とする。さらに、これらの音源分離及び音源定位を交互反復して同時最適化し、繰り返し計算手法により収束させることで、複数音源の同時分離及び定位を可能とすることを特徴とする。 Although these individual methods are not new, the present embodiment is characterized by a framework that solves sound source separation and sound source localization simultaneously and in an interdependent manner. That is, in the present embodiment, sound source separation is enabled by calculating a sound source time frequency mask. In addition, sound source localization can be performed by calculating the direction of each sound source cluster. Further, the present invention is characterized in that simultaneous separation and localization of a plurality of sound sources can be achieved by alternately optimizing the sound source separation and sound source localization simultaneously and converging them by an iterative calculation method.

さらに、本実施の形態のもう一つの特徴として、マイクロフォンアレイの無響ステアリングベクトルを利用する点がある。無響ステアリングベクトルとは、各マイクロフォンアレイの音響的な固有の性質である、無響室のインパルス応答である。この情報は実際の有響環境下における観測状況でのインパルス応答は異なるが、そのインパルス応答を予測する上では非常に有効であることが多い。本実施の形態では、この無響ステアリングベクトルを事前に計測、入力しておき、実際の混合音に適した音響特性の推定を音源分離及び定位と同時に行う。これにより、残響環境によらず、良い分離及び定位性能を得ることができる。 Furthermore, another feature of the present embodiment is that an anechoic steering vector of a microphone array is used. An anechoic steering vector is an anechoic chamber impulse response, which is an acoustically unique property of each microphone array. This information is very effective in predicting the impulse response, although the impulse response in the observation situation in an actual anechoic environment is different. In this embodiment, the anechoic steering vector is measured and input in advance, and the acoustic characteristics suitable for the actual mixed sound are estimated simultaneously with sound source separation and localization. As a result, good separation and localization performance can be obtained regardless of the reverberant environment.

＜システム構成＞
本実施の形態に係る音源分離定位装置１０は、ＣＰＵ（Central Processing Unit）と、ＲＡＭ（Random Access Memory）と、後述する音源分離定位処理ルーチンを実行するためのプログラムを記憶したＲＯＭ（Read Only Memory）とを備えたコンピュータで構成されており、ＣＰＵが音源分離定位処理ルーチンを実行するためのプログラムを、内部記憶装置であるＲＯＭから読み込んで実行することにより形成される。 <System configuration>
A sound source separation localization apparatus 10 according to the present embodiment includes a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only Memory) that stores a program for executing a sound source separation localization processing routine described later. ), And the CPU is formed by reading and executing a program for executing the sound source separation localization processing routine from the ROM which is an internal storage device.

このコンピュータは、機能的には、図３に示すように、解析したい混合音及びマイクロフォンアレイの音響特性を示すデータの入力を受け付ける受付部１と、音源分離及び音源定位の解析に必要な変数を計算及び更新する解析部２と、受け付けたデータ及び計算された情報を記憶する記憶部３と、解析結果を出力する出力部４とを含んだ構成で表すことができる。 As shown in FIG. 3, this computer functionally includes a reception unit 1 that receives input of mixed sound to be analyzed and data indicating the acoustic characteristics of the microphone array, and variables necessary for sound source separation and sound source localization analysis. It can be expressed by a configuration including an analysis unit 2 that calculates and updates, a storage unit 3 that stores received data and calculated information, and an output unit 4 that outputs an analysis result.

受付部１は、さらに、混合音観測部１１と、時間周波数領域観測変換部１２と、事前設定値受付部１３とを含んだ構成で表すことができる。 The receiving unit 1 can be expressed by a configuration including a mixed sound observation unit 11, a time frequency domain observation conversion unit 12, and a preset value reception unit 13.

混合音観測部１１は、記憶装置などの入力器または本装置に付随するマイクロフォンアレイから、観測された混合音が電子データに変換された混合音信号を受け付ける。例えば、既にマイクロフォンアレイによって観測され、電子データに変換された上で、一旦記憶装置に記憶された混合音信号を記憶装置から読み込むことにより、入力データとして受け付けることができる。また、本装置に付随するマイクロフォンアレイで観測された混合音を、直接電子データに変換して受け付けることもできる。 The mixed sound observation unit 11 receives a mixed sound signal obtained by converting the observed mixed sound into electronic data from an input device such as a storage device or a microphone array attached to the apparatus. For example, the mixed sound signal that has already been observed by the microphone array and converted into electronic data and then temporarily stored in the storage device can be received as input data by reading from the storage device. Moreover, the mixed sound observed with the microphone array attached to this apparatus can be directly converted into electronic data and accepted.

時間周波数領域観測変換部１２では、混合音観測部１１で受け付けた混合音信号を、フーリエ変換を利用して時間周波数領域の信号へと変換する。以下、混合音信号を時間周波数領域に変換した信号を観測信号と呼ぶ。 The time frequency domain observation conversion unit 12 converts the mixed sound signal received by the mixed sound observation unit 11 into a time frequency domain signal using Fourier transform. Hereinafter, a signal obtained by converting the mixed sound signal into the time frequency domain is referred to as an observation signal.

事前設定値受付部１３は、キーボードや記憶装置などの入力器から、後述する本装置の実装したモデルに必要な定数、入力された混合音を観測したマイクロフォンアレイの無響ステアリングベクトル情報を含む統計量初期値の一部、及び収束判定閾値の値を受け付ける。 The preset value receiving unit 13 receives statistics from the input device such as a keyboard or a storage device, including constants necessary for a model mounted on the device, which will be described later, and anechoic steering vector information of the microphone array in which the input mixed sound is observed. A part of the initial value of the quantity and the value of the convergence determination threshold value are received.

解析部２は、さらに、初期値生成部２１と、音源時間周波数マスク変数計算部２２と、音源定位変数計算部２３と、統計量計算部２４と、収束条件判定部２５とを含んだ構成で表すことができる。 The analysis unit 2 further includes an initial value generation unit 21, a sound source time frequency mask variable calculation unit 22, a sound source localization variable calculation unit 23, a statistic calculation unit 24, and a convergence condition determination unit 25. Can be represented.

初期値生成部２１は、受付部１で受け付けた情報を記憶部３の各部へ記憶すると共に、記憶部３の各部に記憶された値の初期化を行う。 The initial value generation unit 21 stores the information received by the reception unit 1 in each unit of the storage unit 3 and initializes values stored in each unit of the storage unit 3.

音源時間周波数マスク変数計算部２２は、記憶部３に保存された情報を利用して、音源時間周波数マスク変数を計算し、保存及び更新する。 The sound source time frequency mask variable calculation unit 22 uses the information stored in the storage unit 3 to calculate, store, and update the sound source time frequency mask variable.

音源定位変数計算部２３は、記憶部３に保存された情報を利用して、音源定位変数を計算し、保存及び更新する。 The sound source localization variable calculator 23 uses the information stored in the storage unit 3 to calculate, store, and update the sound source localization variable.

統計量計算部２４は、記憶部３に保存された情報を利用して、統計量を計算し、保存及び更新する。 The statistic calculator 24 uses the information stored in the storage unit 3 to calculate a statistic and save and update it.

収束条件判定部２５は、記憶部３に保存された情報を利用して、解析部２の計算処理を継続するか、終了するかを判定する。終了する場合は、解析結果を出力部４へ渡す。 The convergence condition determination unit 25 uses the information stored in the storage unit 3 to determine whether to continue or end the calculation process of the analysis unit 2. In the case of termination, the analysis result is passed to the output unit 4.

記憶部３には、図４に示すように、定数記憶部３１、観測信号記憶部３２、音源時間周波数マスク変数記憶部３３、音源定位変数記憶部３４、統計量記憶部３５、統計量初期値記憶部３６、及び収束判定閾値記憶部３７の各記憶部が設けられている。 As shown in FIG. 4, the storage unit 3 includes a constant storage unit 31, an observation signal storage unit 32, a sound source time frequency mask variable storage unit 33, a sound source localization variable storage unit 34, a statistic storage unit 35, and a statistic initial value. Storage units 36 and a convergence determination threshold storage unit 37 are provided.

定数記憶部３１には、本装置の実装したモデルに必要な定数が記憶される。 The constant storage unit 31 stores constants necessary for the model mounted on the apparatus.

観測信号記憶部３２には、時間周波数領域観測変換部１２で変換された観測信号が記憶される。 The observation signal storage unit 32 stores the observation signal converted by the time frequency domain observation conversion unit 12.

音源時間周波数マスク変数記憶部３３には、主に音源分離の解析結果を表現する情報を表す音源時間周波数マスク変数が記憶される。 The sound source time frequency mask variable storage unit 33 stores a sound source time frequency mask variable that mainly represents information representing the analysis result of sound source separation.

音源定位変数記憶部３４には、主に音源定位の解析結果を表現する情報を表す音源定位変数が記憶される。 The sound source localization variable storage unit 34 stores a sound source localization variable that mainly represents information representing the analysis result of the sound source localization.

統計量記憶部３５には、音源分離及び音源定位に必要となる各種統計量が記憶される。 The statistics storage unit 35 stores various statistics required for sound source separation and sound source localization.

統計量初期値記憶部３６には、統計量の計算に必要となる初期値である統計量初期値が記憶される。 The statistic initial value storage unit 36 stores a statistic initial value that is an initial value necessary for calculating a statistic.

収束判定閾値記憶部３７には、解析結果の収束を判定するために用いる閾値が記憶される。 The convergence determination threshold storage unit 37 stores a threshold used for determining the convergence of the analysis result.

出力部４は、さらに、分離音抽出部４１と、音声波形復元部４２と、音源方向抽出部４３と、最終出力部４４とを含んだ構成で表すことができる。 The output unit 4 can be expressed by a configuration including a separated sound extraction unit 41, a speech waveform restoration unit 42, a sound source direction extraction unit 43, and a final output unit 44.

分離音抽出部４１は、記憶部３に保存された情報を利用して、時間周波数領域での分離音信号を計算して音声波形復元部４２へと渡す。 The separated sound extraction unit 41 uses the information stored in the storage unit 3 to calculate a separated sound signal in the time-frequency domain and passes it to the speech waveform restoration unit 42.

音声波形復元部４２は、記憶部３に保存された情報、及び分離音抽出部４１から渡された各分離音の時間周波数領域信号を利用して、各分離音の時間周波数領域信号を逆フーリエ変換によって分離音の音声信号へと復元する。 The speech waveform restoration unit 42 uses the information stored in the storage unit 3 and the time frequency domain signal of each separated sound passed from the separated sound extraction unit 41 to inverse Fourier transform the time frequency domain signal of each separated sound. The sound is restored to a separated sound signal by conversion.

音源方向抽出部４３は、記憶部３に保存された情報を利用して、各音源の方向を計算及び定位する。 The sound source direction extraction unit 43 uses the information stored in the storage unit 3 to calculate and localize the direction of each sound source.

最終出力部４４は、ディスプレイ、プリンタ、スピーカー、磁気ディスクなどで実装された出力装置に、ユーザの所望の形式で音源分離及び音源定位の解析結果を出力する。 The final output unit 44 outputs sound source separation and sound source localization analysis results in a format desired by the user to an output device mounted with a display, a printer, a speaker, a magnetic disk, or the like.

＜本実施の形態の作用＞
次に、本実施の形態に係る音源分離定位装置１０の作用について説明する。まず、複数のマイクロフォンを任意の配置で設置したマイクロフォンアレイを利用して観測された混合音が記憶装置に混合音信号として記憶された状態、または本装置に付随するマイクロフォンアレイにより混合音が観測されている状態で、音源分離定位装置１０において、図５に示す音源分離定位処理ルーチンが実行される。 <Operation of the present embodiment>
Next, the operation of the sound source separation localization apparatus 10 according to the present embodiment will be described. First, a mixed sound observed using a microphone array in which a plurality of microphones are arranged in an arbitrary arrangement is stored as a mixed sound signal in a storage device, or a mixed sound is observed by a microphone array attached to this device. In the sound source separation localization apparatus 10, a sound source separation localization processing routine shown in FIG. 5 is executed.

ステップ１００で、混合音観測部１１が、記憶装置に記憶された混合音信号を読み込むことにより受け付けるか、または、マイクロフォンアレイにより観測された混合音を電子データである混合音信号に変換して直接受け付ける。 In step 100, the mixed sound observation unit 11 accepts the mixed sound signal stored in the storage device by reading it, or directly converts the mixed sound observed by the microphone array into a mixed sound signal that is electronic data. Accept.

次に、ステップ１０２で、時間周波数領域観測変換部１２が、上記ステップ１００で受け付けた混合音信号を時間周波数領域の信号である観測信号へ変換する。変換には短時間フーリエ変換（ＳＴＦＴ）あるいは高速フーリエ変換（ＦＦＴ）を利用することができる。変換した観測信号をｘ_ｔｆで表す。各ｘ_ｔｆは時間フレームｔ（ｔ＝１，...，Ｔ）、フーリエ変換によるｆ（ｆ＝１，...，Ｆ）番目の周波数ビン（周波数帯）における音声信号の変換表現である。各ｘ_ｔｆはマイク数に相当する要素数のベクトルであり、各要素は複素数である。以後、このｘ_ｔｆを本装置にとっての観測量として用いる。この観測量は、本実施の形態の中では複素正規分布から生成されるものと仮定する。入力された混合音信号全てを時間周波数領域へ変換し、変換した各ｘ_ｔｆを初期値生成部２１に渡す。 Next, in step 102, the time frequency domain observation conversion unit 12 converts the mixed sound signal received in step 100 into an observation signal that is a signal in the time frequency domain. For the conversion, short-time Fourier transform (STFT) or fast Fourier transform (FFT) can be used. The converted observation signal is represented by _xtf . Each _xtf is a converted expression of the audio signal in the time frame t (t = 1,..., T) and the f (f = 1,..., F) th frequency bin (frequency band) by Fourier transform. . Each _xtf is a vector of the number of elements corresponding to the number of microphones, and each element is a complex number. Hereinafter, this _xtf is used as an observation amount for this apparatus. This observation amount is assumed to be generated from a complex normal distribution in the present embodiment. All the input mixed sound signals are converted into the time-frequency domain, and each converted _xtf is passed to the initial value generation unit 21.

次に、ステップ１０４で、事前設定値受付部１３が、音源分離定位装置１０における解析処理に必要な定数を受け付ける。定数には、観測信号の総時間フレーム数Ｔ、観測信号の総周波数ビン数Ｆ、仮想的に設定する音源の最大数であるマスク数Ｋ、音源方向をクラスタリングするための方向クラス数Ｄ、混合音を観測したマイクロフォンアレイのマイク数Ｍ、及び統計量の初期値を計算する際に利用される正実数である正則化定数εが含まれる。マスク数Ｋは、例えば１２とすることができるが、さらに多数の音源が予想される場合には、音源数を十分上回る数を設定する。また、正則化定数εは、例えば０．０００１と設定することができる。 Next, in step 104, the preset value receiving unit 13 receives constants necessary for analysis processing in the sound source separation localization apparatus 10. The constants include the total number of time frames T of the observed signal, the total number of frequency bins F of the observed signal, the number of masks K that is the maximum number of sound sources virtually set, the number of direction classes D for clustering the sound source directions, and the mixture The number of microphones M of the microphone array in which the sound is observed and the regularization constant ε that is a positive real number used when calculating the initial value of the statistic are included. The number K of masks can be set to 12, for example, but when a larger number of sound sources are expected, a number sufficiently exceeding the number of sound sources is set. Further, the regularization constant ε can be set to 0.0001, for example.

また、事前設定値受付部１３は、解析に必要な統計量の初期値の一部（β_ｋ ^０、κ_ｄ ^０、ａ_ｔｆ ^０、ｖ_ｆｄ ^０）も受け付ける。β_ｋ ^０は音源時間周波数マスク変数の数学モデルに利用されるディリクレ分布の初期パラメータであり、ｋ＝１，...，Ｋに対し０より大きい値を設定する。例えば、β_１ ^０＝β_２ ^０＝...＝β_Ｋ ^０とすることができる。κ_ｄ ^０は音源定位変数の数学モデルに利用されるディリクレ分布の初期パラメータであり、ｄ＝１，...，Ｄに対し０より大きい値を設定する。例えば、κ_１ ^０＝κ_２ ^０＝...＝κ_Ｄ ^０とすることができる。ａ_ｔｆ ^０は時間周波数領域観測信号の数学モデルで利用されるガンマ分布の初期パラメータである。ｖ_ｆｄ ^０は時間周波数領域観測信号の数学モデルで利用される複素ウィシャート分布の初期パラメータである。例えば、ｔ＝１，...，Ｔ，ｆ＝１，...，Ｆ，ｄ＝１，...，Ｄに対しａ_ｔｆ ^０＝１、ｖ_ｆｄ ^０＝Ｍと設定することができる。 The preset value receiving unit 13 also receives part of the initial values of the statistics necessary for the analysis (β _k ⁰ , κ _d ⁰ , a _tf ⁰ , v _fd ⁰ ). β _k ⁰ is an initial parameter of the Dirichlet distribution used in the mathematical model of the sound source time frequency mask variable, and a value larger than 0 is set for k = 1,. For example, β ₁ ⁰ = β ₂ ⁰ = ... = β _K ⁰ . κ _d ⁰ is an initial parameter of the Dirichlet distribution used in the mathematical model of the sound source localization variable, and a value larger than 0 is set for d = 1,. For example, κ ₁ ⁰ = κ ₂ ⁰ = ... = κ _D ⁰ . a _tf ⁰ is an initial parameter of the gamma distribution used in the mathematical model of the time-frequency domain observation signal. v _fd ⁰ is an initial parameter of the complex Wishart distribution used in the mathematical model of the time-frequency domain observation signal. For example, it is possible to set a _tf ⁰ = 1 and v _fd ⁰ = M for t = 1,..., T, f = 1 _,. .

さらに、事前設定値受付部１３は、マイクロフォンアレイの音響的特性を表す、無響ステアリングベクトルｑも受け付ける。無響ステアリングベクトルは周波数ビンｆ及び方向ｄ毎に式（１）に示すように、Ｍ本のマイク毎に事前に無響室で測定したものである。 Furthermore, the preset value receiving unit 13 also receives an anechoic steering vector q representing the acoustic characteristics of the microphone array. The anechoic steering vector is measured in advance in the anechoic chamber for each of the M microphones as shown in Equation (1) for each frequency bin f and direction d.

例えば非特許文献１や非特許文献３等の従来手法では、利用するマイクの配置や無響室でのインパルス応答といった、システム固有の音響的特性を利用していない。これは利用するシステムの設定に寄らない一般性を持つものの、音響特性の事前情報が利用できないことで様々な残響環境において高精度な音源分離及び定位性能を得る可能性が低くなってしまう。 For example, conventional methods such as Non-Patent Document 1 and Non-Patent Document 3 do not use acoustic characteristics unique to the system, such as arrangement of microphones to be used and impulse response in an anechoic room. Although this has a generality that does not depend on the setting of the system to be used, the possibility of obtaining high-precision sound source separation and localization performance in various reverberant environments is reduced because the prior information on acoustic characteristics cannot be used.

そこで、本実施の形態では、システム固有の音響特性である無響ステアリングベクトルを事前に測定及び入力しておくことで、様々な残響環境にも適応できる高精度な解析を実現することができる。 Therefore, in the present embodiment, high-accuracy analysis that can be applied to various reverberant environments can be realized by measuring and inputting an anechoic steering vector, which is an acoustic characteristic unique to the system, in advance.

さらに、事前設定値受付部１３は、解析処理の収束判定に利用する収束判定閾値θも受け付ける。収束判定閾値θの値は、ユーザの設定した収束判定基準によって変わるが、本実施の形態では、音源分離解析の変化幅を利用するため、正の実数となる。 Further, the preset value receiving unit 13 also receives a convergence determination threshold value θ used for determination of convergence of the analysis process. Although the convergence determination threshold value θ varies depending on the convergence determination criterion set by the user, in the present embodiment, since it uses the change width of the sound source separation analysis, it becomes a positive real number.

次に、ステップ２００〜６００で、解析部２が、記憶部３に定義される変数を最適化するための計算を実施する。記憶部３に定義される変数の最適化には様々な最適化法（例えば、非特許文献４「Ｃ．Ｍ．ビショップ、“パターン認識と機械学習上・下”、シュプリンガー・ジャパン、２００７．」）を利用できるが、本実施の形態では、変分ベイズ法に基づく音源分離及び音源定位の同時最適化を行う。 Next, in steps 200 to 600, the analysis unit 2 performs a calculation for optimizing the variables defined in the storage unit 3. Various optimization methods (for example, Non-Patent Document 4 “CM Bishop,“ Pattern Recognition and Machine Learning Up / Down ”, Springer Japan, 2007.”) are used to optimize variables defined in the storage unit 3. In this embodiment, sound source separation and sound source localization are simultaneously optimized based on the variational Bayes method.

まず、ステップ２００で、初期値生成部２１が図６に示す初期値生成処理ルーチンを実行して初期値を設定する。そして、ステップ３００で、音源時間周波数マスク変数計算部２２が図７に示す音源時間周波数マスク変数計算処理ルーチンを実行し、ステップ４００で、音限定位変数計算部２３が図８に示す音限定位変数計算処理ルーチンを実行し、ステップ５００で、統計量計算部２４が図９に示す統計量計算処理ルーチンを実行して順番に各値を計算し、収束条件を満足するまで繰り返し反復計算を行うことで最適化する。変分ベイズ法に基づく計算では、必ず計算結果が収束することが保証されている。 First, in step 200, the initial value generation unit 21 executes an initial value generation processing routine shown in FIG. 6 to set an initial value. Then, in step 300, the sound source time frequency mask variable calculation unit 22 executes the sound source time frequency mask variable calculation processing routine shown in FIG. 7, and in step 400, the sound limit variable calculation unit 23 performs the sound limit level shown in FIG. The variable calculation processing routine is executed, and in step 500, the statistic calculation unit 24 executes the statistic calculation processing routine shown in FIG. 9 to calculate each value in order, and repeatedly performs the calculation until the convergence condition is satisfied. To optimize. In calculations based on the variational Bayes method, it is guaranteed that the calculation results converge.

以下、各処理について詳述する。なお、音源時間周波数マスク変数計算処理ルーチン、音限定位変数計算処理ルーチン、及び統計量計算処理ルーチンの実行の順番は任意でよい。 Hereinafter, each process is explained in full detail. The order of execution of the sound source time frequency mask variable calculation routine, the sound limit variable calculation routine, and the statistic calculation routine may be arbitrary.

まず、初期値生成処理ルーチン（図６）では、ステップ２０２で、上記ステップ１０４において事前設定値受付部１３が受け付けた総時間フレーム数Ｔ、総周波数ビン数Ｆ、マスク数Ｋ、方向クラス数Ｄ、マイク数Ｍ、及び正則化定数εを、定数記憶部３１に保存する。 First, in the initial value generation processing routine (FIG. 6), in step 202, the total number of time frames T, the total number of frequency bins F, the number of masks K, and the number of direction classes D received by the preset value receiving unit 13 in step 104 above. , The number of microphones M, and the regularization constant ε are stored in the constant storage unit 31.

次に、ステップ２０４で、上記ステップ１０２において時間周波数領域観測変換部１２の計算の結果得られた観測信号ｘ_ｔｆを、観測信号記憶部３２に保存する。 Next, in step 204, the observation signal x _tf obtained as a result of the calculation of the time frequency domain observation conversion unit 12 in step 102 is stored in the observation signal storage unit 32.

次に、ステップ２０６で、上記ステップ１０４において事前設定値受付部１３が受け付けた収束判定閾値θを、収束判定閾値記憶部３７に保存する。 Next, in step 206, the convergence determination threshold θ received by the preset value reception unit 13 in step 104 is stored in the convergence determination threshold storage unit 37.

次に、ステップ２０８で、統計量の初期値を設定する。まず、上記ステップ１０４において事前設定値受付部１３が受け付けた統計量初期値の一部（β_ｋ ^０、κ_ｄ ^０、ａ_ｔｆ ^０、ｖ_ｆｄ ^０）を統計量初期値記憶部３６に保存する。さらに、上記ステップ１０４において事前設定値受付部１３が受け付けた無響ステアリングベクトルｑ_ｆｄ、及び上記ステップ１０２で計算された観測信号ｘ_ｔｆを利用して、式（２）及び（３）に示すように、統計量初期値の一部であるＧ_ｆｄ ^０及びｂ_ｔｆ ^０を計算する。なお、Ｈはエルミート転置を示し、Ｉ_ＭはＭ次元の単位行列を表す。 Next, in step 208, the initial value of the statistic is set. First, a part (β _k ⁰ , κ _d ⁰ , _atf ⁰ , v _fd ⁰ ) of the statistic initial value received by the preset value receiving unit 13 in step 104 is stored in the statistic initial value storage unit 36. . Further, using the anechoic steering vector q _fd received by the preset value receiving unit 13 in step 104 and the observation signal x _tf calculated in step 102, as shown in equations (2) and (3): _Next , G _fd ⁰ and b _tf ⁰ which are a part of the statistical initial value are calculated. H represents Hermitian transposition, and I _M represents an M-dimensional unit matrix.

式（２）及び（３）により計算された統計量初期値の一部（Ｇ_ｆｄ ^０、ｂ_ｔｆ ^０）を統計量初期値記憶部３６に保存する。 A part (G _fd ⁰ , b _tf ⁰ ) of the statistic initial value calculated by the equations (2) and (3) is stored in the statistic initial value storage unit 36.

次に、ステップ２１０で、式（４）により、音源時間周波数マスク変数の初期値ξ_ｔｆｋ ^０を計算し、計算したξ_ｔｆｋ ^０をξ_ｔｆｋとして音源時間周波数マスク変数記憶部３３に保存する。なお、Ｚはｋに関する和を１にするための正規化項である。 Next, in step 210, the initial value ξ _tfk ⁰ of the sound source time frequency mask variable is calculated by the equation (4), and the calculated ξ _tfk ⁰ is stored in the sound source time frequency mask variable storage unit 33 as ξ _tfk . Z is a normalization term for making the sum related to k 1.

次に、ステップ２１２で、式（５）及び（６）により、音源定位変数の初期値η_ｋｄ ^０を計算し、計算したη_ｋｄ ^０をη_ｋｄとして音源定位変数記憶部３４に保存する。 Next, in step 212, the initial value η _kd ⁰ of the sound source localization variable is calculated by the equations (5) and (6), and the calculated η _kd ⁰ is stored in the sound source localization variable storage unit 34 as η _kd .

以上、初期値の設定が終了すると、初期値生成処理ルーチンを終了して、音源分離定位処理ルーチンへリターンする。 As described above, when the setting of the initial value is completed, the initial value generation processing routine is ended, and the process returns to the sound source separation localization processing routine.

次に、音源時間周波数マスク変数計算処理ルーチン（図７）では、ステップ３０２で、記憶部３から必要な情報をロードし、次に、ステップ３０４で、時間フレームに対応する変数ｔを１にセットし、次に、ステップ３０６で、周波数ビンに対応する変数ｆを１にセットする。 Next, in the sound source time frequency mask variable calculation processing routine (FIG. 7), necessary information is loaded from the storage unit 3 in step 302, and then a variable t corresponding to the time frame is set to 1 in step 304. Then, in step 306, the variable f corresponding to the frequency bin is set to 1.

次に、ステップ３０８で、式（７）により、ｋ＝１，...，Ｋについて音源時間周波数マスク変数ξ_ｔｆｋを計算する。なお、Ψはディガンマ関数である。 Next, in step 308, the sound source time frequency mask variable ξ _tfk is calculated for k = 1 _,. Note that Ψ is a digamma function.

音源時間周波数マスク変数ξ_ｔｆｋは、観測信号を分離するために計算する変数である。ｔ＝１，...，Ｔ、ｆ＝１，...，Ｆ、ｋ＝１，...，Ｋとする。ξ_ｔｆｋは時間フレームｔ、周波数ビンｆにおける観測信号がｋ番目の音源（マスク）による信号である確率を表す。この音源時間周波数マスク変数ξ_ｔｆｋに従って、観測信号をＫ音源に分離することで、音源毎の分離音を復元することができる。この音源時間周波数マスク変数ξ_ｔｆｋは、本実施の形態では多項分布から生成されると仮定しており、その多項分布のパラメータは統計量βでパラメタライズされたディリクレ分布によって決定されるものとする。 The sound source time frequency mask variable ξ _tfk is a variable to be calculated in order to separate the observation signals. t = 1,..., T, f = 1,..., F, k = 1,. ξ _tfk represents the probability that the observed signal in time frame t and frequency bin f is a signal from the kth sound source (mask). By separating the observation signal into K sound sources according to the sound source time frequency mask variable ξ _tfk , the separated sound for each sound source can be restored. This sound source time frequency mask variable ξ _tfk is assumed to be generated from a multinomial distribution in the present embodiment, and the parameters of the multinomial distribution are determined by the Dirichlet distribution parameterized by the statistic β. .

式（７）のポイントは、右辺第４項にあるように、音源定位変数η_ｋｄが必要であるという点である。これは、音源定位の情報を使って音源分離が改善されることを表している。 The point of Expression (7) is that the sound source localization variable η _kd is necessary as in the fourth term on the right side. This indicates that sound source separation is improved by using information on sound source localization.

次に、ステップ３１０で、上記ステップ３０８でｋ＝１，...，Ｋについて計算された音源時間周波数マスク変数ξ_ｔｆｋを、式（８）により正規化する。ξ_ｔｆｋは確率であるので、各ｔ及びｆ毎に、全てのｋに対する和が常に１となるように正規化する。 Next, in step 310, the sound source time frequency mask variable ξ _tfk calculated for k = 1,..., K in step 308 is normalized by the equation (8). Since ξ _tfk is a probability, _{normalization} is performed so that the sum for all k is always 1 for each t and f.

次に、ステップ３１２で、ｆを１インクリメントして、次のステップ３１４で、ｆが総周波数ビン数Ｆを超えたか否かを判定し、ｆが未だＦに到達していない場合には、ステップ３０８へ戻って、ステップ３０８〜３１２の処理を繰り返す。 Next, in step 312, f is incremented by 1, and in the next step 314, it is determined whether f exceeds the total frequency bin number F. If f has not yet reached F, Returning to 308, the processing of steps 308-312 is repeated.

一方、ｆがＦを超えた場合には、ステップ３１６へ移行し、ｔを１インクリメントして、次のステップ３１８で、ｔが総時間フレーム数Ｔを超えたか否かを判定し、ｔが未だＴに到達していない場合には、ステップ３０６へ戻って、ステップ３０６〜３１６の処理を繰り返す。 On the other hand, if f exceeds F, the process proceeds to step 316, t is incremented by 1, and in the next step 318, it is determined whether or not t exceeds the total number of time frames T. If T has not been reached, the process returns to step 306 and the processes of steps 306 to 316 are repeated.

一方、ｔがＴを超えた場合には、ステップ３２０へ移行し、計算された音源時間周波数マスク変数ξ_ｔｆｋを音源時間周波数マスク変数記憶部３３に保存して更新し、音源時間周波数マスク変数計算処理ルーチンを終了して、音源分離定位処理ルーチンへリターンする。 On the other hand, if t exceeds T, the _{process proceeds} to step 320, where the calculated sound source time frequency mask variable ξ _tfk is stored and updated in the sound source time frequency mask variable storage unit 33 to calculate sound source time frequency mask variable. The processing routine is ended, and the process returns to the sound source separation localization processing routine.

次に、音源時定位変数計算処理ルーチン（図８）では、ステップ４０２で、記憶部３から必要な情報をロードし、次に、ステップ４０４で、各マスクに対応する変数ｋを１にセットする。 Next, in the sound source localization variable calculation processing routine (FIG. 8), necessary information is loaded from the storage unit 3 in step 402, and then a variable k corresponding to each mask is set to 1 in step 404. .

次に、ステップ４０６で、式（９）により、ｄ＝１，...，Ｄについて音源定位変数η_ｋｄを計算する。 Next, in step 406, the sound source localization variable η _kd is calculated for d = 1 _,.

音源定位変数η_ｋｄは、複数音源の定位、すなわち各音源のマイクロフォンアレイに対する方向を推定するために計算する変数である。ｋ＝１，...，Ｋ、ｄ＝１，...，Ｄとする。η_ｋｄは音源ｋの方向がｄ番目の離散化された方向にある確率を表す。この変数に従って各音源の方向を推定することができる。この変数は、本実施の形態では多項分布から生成されると仮定しており、その多項分布のパラメータは統計量κでパラメタライズされたディリクレ分布によって決定されるものとする。 The sound source localization variable η _kd is a variable calculated to estimate the localization of a plurality of sound sources, that is, the direction of each sound source relative to the microphone array. k = 1,..., K, d = 1,. η _kd represents the probability that the direction of the sound source k is in the d-th discretized direction. The direction of each sound source can be estimated according to this variable. This variable is assumed to be generated from a multinomial distribution in the present embodiment, and the parameters of the multinomial distribution are determined by a Dirichlet distribution parameterized by a statistic κ.

式（９）のポイントは、右辺第３項にあるように、音源時間周波数マスク変数ξ_ｔｆｋが必要であるという点である。これは、音源分離の情報を使って音源定位が改善されることを表している。 The point of equation (9) is that the sound source time frequency mask variable ξ _tfk is required as in the third term on the right side. This indicates that sound source localization is improved using information on sound source separation.

次に、ステップ４０８で、上記ステップ４０６でｄ＝１，...，Ｄについて計算された音源定位変数η_ｋｄを、式（１０）により正規化する。η_ｋｄは確率であるので、ｋ毎に、全てのｄに対する和が常に１となるように正規化される。 Next, in step 408, the sound source localization variable η _kd calculated for d = 1,..., D in step 406 is normalized by equation (10). Since η _kd is a probability, it is normalized so that the sum for all d is always 1 for each k.

次に、ステップ４１０で、ｋを１インクリメントして、次のステップ４１２で、ｋが設定された最大マスク数Ｋを超えたか否かを判定し、ｋが未だＫに到達していない場合には、ステップ４０６へ戻って、ステップ４０６〜４１０の処理を繰り返す。 Next, in step 410, k is incremented by 1, and in the next step 412, it is determined whether or not k has exceeded the set maximum number of masks K. If k has not yet reached K, Returning to step 406, the processing of steps 406 to 410 is repeated.

一方、ｋがＫを超えた場合には、ステップ４１４へ移行し、計算された音源定位変数η_ｋｄを音源定位変数記憶部３４に保存して更新し、音源定位変数計算処理ルーチンを終了して、音源分離定位処理ルーチンへリターンする。 On the other hand, if k exceeds K, the process proceeds to step 414, where the calculated sound source localization variable η _kd is stored and updated in the sound source localization variable storage unit 34, and the sound source localization variable calculation processing routine is terminated. Return to the sound source separation localization processing routine.

次に、統計量計算処理ルーチン（図９）では、音源時間周波数マスク変数計算処理ルーチン及び音源定位変数計算処理ルーチンで用いる各統計量を計算する。まず、ステップ５０２で、記憶部３から必要な情報をロードする。 Next, in the statistic calculation processing routine (FIG. 9), each statistic used in the sound source time frequency mask variable calculation processing routine and the sound source localization variable calculation processing routine is calculated. First, in step 502, necessary information is loaded from the storage unit 3.

次に、ステップ５０４で、音源時間周波数マスク変数の数学モデルに利用されるディリクレ分布のパラメータであるβ_ｔｋを、ｔ＝１，...，Ｔ及びｋ＝１，...，Ｋについて、式（１１）により計算する。β_ｔｋは、直感的には時間フレームｔにおいて、各周波数ビンｆ上の観測信号が音源ｋからの信号で説明される可能性の強さを表すパラメータであり、０より大きい値となる。。 Next, in step 504, β _tk , which is a parameter of the Dirichlet distribution used in the mathematical model of the sound source time frequency mask variable, is set to t = 1,..., T and k = 1,. Calculated according to equation (11). Intuitively, β _tk is a parameter representing the strength of the possibility that the observation signal on each frequency bin f is explained by the signal from the sound source k in the time frame t, and is a value larger than zero. .

次に、ステップ５０６で、音源定位変数の数学モデルに利用されるディリクレ分布のパラメータであるκ_ｄを、ｄ＝１，...，Ｄについて、式（１２）により計算する。κ_ｄは、直感的には音源ｋが方向ｄに存在する可能性の強さを表すパラメータであり、０より大きい値となる。 Next, in step 506, κ _d , which is a parameter of the Dirichlet distribution used in the mathematical model of the sound source localization variable, is calculated for d = 1,. Intuitively, κ _d is a parameter indicating the strength of the possibility that the sound source k exists in the direction d, and is a value larger than zero.

次に、ステップ５０８で、時間周波数領域観測信号の数学モデルで利用されるガンマ分布のパラメータであるａ_ｔｆｋを、ｔ＝１，...，Ｔ、ｆ＝１，...，Ｆ、及びｋ＝１，...，Ｋについて、式（１３）により計算する。 Next, in step 508, a _tfk that is a parameter of the gamma distribution used in the mathematical model of the time-frequency domain observation signal is _changed to t = 1,..., T, f = 1 _,. k = 1,..., K is calculated by the equation (13).

次に、ステップ５１０で、時間周波数領域観測信号の数学モデルで利用されるガンマ分布のパラメータであるｂ_ｔｆｋを、ｔ＝１，...，Ｔ、ｆ＝１，...，Ｆ、及びｋ＝１，...，Ｋについて、式（１４）により計算する。 Next, in step 510, b _tfk , which is a parameter of the gamma distribution used in the mathematical model of the time-frequency domain observation signal, is _changed to t = 1,..., T, f = 1 _,. For k = 1,..., K, calculation is performed according to equation (14).

次に、ステップ５１２で、時間周波数領域観測信号の数学モデルで利用される複素ウィシャート分布のパラメータであるｖ_ｆｄを、ｆ＝１，...，Ｆ及びｄ＝１，...，Ｄについて、式（１５）により計算する。 Next, in step 512, v _fd that is a parameter of the complex Wishart distribution used in the mathematical model of the time-frequency domain observation signal is set to f = 1,..., F and d = 1,. , Calculated by the equation (15).

次に、ステップ５１４で、時間周波数領域観測信号の数学モデルで利用される複素ウィシャート分布のパラメータであるＧ_ｆｄを、ｆ＝１，...，Ｆ及びｄ＝１，...，Ｄについて、式（１６）により計算する。Ｇ_ｆｄは、無響ステアリングベクトルの情報を取り込んだ、実際の有響環境の音響情報を含む行列である。 Next, in step 514, G _fd that is a parameter of the complex Wishart distribution used in the mathematical model of the time-frequency domain observation signal is set to f = 1,..., F and d = 1,. , Calculated by the equation (16). G _fd is a matrix that includes acoustic information of an actual anechoic environment, incorporating information of an anechoic steering vector.

次に、ステップ５１６で、上記ステップ５０４〜５１４で計算した各統計量を、統計量記憶部３５に保存して更新し、統計量計算処理ルーチンを終了して、音源分離定位処理ルーチンにリターンする。 Next, in step 516, each statistic calculated in steps 504 to 514 is stored and updated in the statistic storage unit 35, the statistic calculation processing routine is terminated, and the process returns to the sound source separation localization processing routine. .

音源分離定位処理ルーチンでは、次に、ステップ６００へ移行し、収束条件判定部２５が、記憶部３に保存された各値を監視して、計算の収束条件が満たされたか否かを判定する。収束条件は反復計算の繰り返し回数など任意に設定してよいが、変分ベイズ法に基づく解析計算を行う本実施の形態では、例えば、式（１７）に示すような収束条件を用いることができる。ただし、ξ’は更新前のξの値を表す。この収束条件を用いた場合には、必ず各値の計算が収束することが知られている。 In the sound source separation localization processing routine, next, the process proceeds to step 600, where the convergence condition determination unit 25 monitors each value stored in the storage unit 3 to determine whether or not the calculation convergence condition is satisfied. . The convergence condition may be arbitrarily set such as the number of iterations of the iterative calculation. However, in this embodiment in which the analysis calculation based on the variational Bayes method is performed, for example, the convergence condition as shown in Expression (17) can be used. . Here, ξ ′ represents the value of ξ before update. It is known that the calculation of each value always converges when this convergence condition is used.

なお、式（１７）では、音源時間周波数マスク変数ξ_ｔｈｋの変化幅を収束条件として用いているが、音源定位変数η_ｋｄの変化幅を収束条件として用いてもよい。 In Expression (17), the change width of the sound source time frequency mask variable ξ _thk is used as the convergence condition, but the change width of the sound source localization variable η _kd may be used as the convergence condition.

収束条件を満たしていない場合には、ステップ３００へ戻り、各値の計算を繰り返す。一方、収束条件を満たした場合には、ステップ７００へ移行する。 If the convergence condition is not satisfied, the process returns to step 300 and the calculation of each value is repeated. On the other hand, if the convergence condition is satisfied, the process proceeds to step 700.

ステップ７００では、分離音抽出部４１が、観測信号ｘ_ｔｆ及び音源時間周波数マスク変数ξ_ｔｆｋを利用することで、各音源に対応した時間周波数領域での分離音信号を計算する。まず、Ｋ個の音源マスク数、すなわち仮想的な最大の音源数に対して、Ｎ＜＝Ｋとなる抽出音源数Ｎを決定する。これは事前に指定しておいてもよいし、解析終了後に記憶部３の情報を利用して何らかの決定則に基づいて自動的または人手で決定してもよい。 In step 700, it separated sound extraction unit 41, the observation signals x _tf and by utilizing the source time-frequency mask variables xi] _TFK, to calculate the separated sound signals in the time frequency region corresponding to each sound source. First, the number N of extracted sound sources that satisfies N <= K is determined with respect to the number of K sound source masks, that is, the virtual maximum number of sound sources. This may be specified in advance, or may be determined automatically or manually based on some decision rule using information in the storage unit 3 after the analysis is completed.

同時に、音源時間周波数マスク変数ξ_ｔｆｋのインデックスの順番を入れ替える。具体的に入れ替えるインデックスはξ_ｔｆｋの最後のインデックスであるｋである。インデックスの順番を入れ替える方法は、下記に示すように、音源インデックスｋ毎に全てのマスク変数の総和を計算し、この総和が大きい順番に入れ替える。 At the same time, the order of the indexes of the sound source time frequency mask variable ξ _tfk is changed. Specifically, the index to be replaced is k which is the last index of ξ _tfk . As shown below, the index order is switched by calculating the sum of all mask variables for each sound source index k, and switching the order of the sum to the largest.

上記のインデックスの並べ替えは、直観的には、音源ｋを予想される音量が大きい順番に入れ替えることに相当する。入れ替えた音源のインデックスをｎで表す。 Intuitionally rearranging the above indexes is equivalent to replacing the sound source k in order of increasing expected sound volume. The index of the replaced sound source is represented by n.

そして、入れ替えたインデックスのもとで、式（１８）を利用して、ｎ番目に音量の大きい音源の時間フレームｔ、周波数ビンｆでの時間周波数領域の分離音信号ｙ_ｔｆ ^ｎを計算する。 Then, the separated sound signal y _tf ⁿ in the time frequency domain at the time frame t and the frequency bin f of the sound source with the nth loudest volume is calculated using the equation (18) under the replaced index.

式（１８）の意味は、右辺第１項の時間周波数マスク変数ξ_ｔｆｋの分数によって、Ｎ個の音源の中で各（ｔ，ｆ）における音源ｎの音が占める割合を計算し、この割合で混合音の時間周波数領域表現であるｘ_ｔｆを分配するというものである。全てのｔ＝１，...，Ｔ、ｆ＝１，...，Ｆ、及びｎ＝１，...，Ｎに対して式（１８）の計算が終了したら、音源毎の時間周波数領域の分離音信号の計算結果を音声波形復元部４２へ渡すと共に、入れ替えた音源インデックス情報ｎを音源方向抽出部４３へ渡す。 The meaning of equation (18) is that the ratio of the sound of the sound source n at each (t, f) among the N sound sources is calculated by the fraction of the time frequency mask variable ξ _tfk of the first term on the right side, and this ratio Then, x _tf which is the time frequency domain representation of the mixed sound is distributed. When the calculation of equation (18) is completed for all t = 1,..., T, f = 1,. The calculation result of the separated sound signal of the region is passed to the speech waveform restoration unit 42, and the replaced sound source index information n is passed to the sound source direction extraction unit 43.

次に、ステップ７０２で、音声波形復元部４２が、上記ステップ７００において分離音抽出部４１より受け取った分離音信号ｙ_ｔｆ ^ｎを変換して、通常の音声波形を復元する。具体的には、時間周波数領域観測変換部１２の逆変換である逆フーリエ変換を音源ｎ毎に実施する。 Next, in step 702, the speech waveform restoration unit 42 transforms the separated sound signal y _tf ⁿ received from the separated sound extraction unit 41 in step 700 to restore a normal speech waveform. Specifically, the inverse Fourier transform, which is the inverse transform of the time frequency domain observation transform unit 12, is performed for each sound source n.

次に、ステップ７０４で、音源方向抽出部４３が、上記ステップ７００において分離音抽出部４１より受け取った入れ替えた音源インデックスｎ、及び音源定位変数η_ｋｄを利用して、Ｎ個の音源の方向を計算する。具体的には、各ｎに対して式（１９）の計算を行えばよい。これによって音源ｎの存在する方向のインデックスｄ_ｎを求めることができる。 Next, in step 704, the sound source direction extraction unit 43 uses the replaced sound source index n and the sound source localization variable η _kd received from the separated sound extraction unit 41 in step 700 to determine the directions of the N sound sources. calculate. Specifically, the calculation of Expression (19) may be performed for each n. This makes it possible to determine the index d _n in a direction that exists in the source n.

次に、ステップ７０６で、最終出力部４４が、記憶部３、分離音抽出部４１、音声波形復元部４２、及び音源方向抽出部４３の情報を用いて、ユーザの所望の形で解析結果を出力して、音源分離定位処理ルーチンを終了する。 Next, in step 706, the final output unit 44 uses the information of the storage unit 3, the separated sound extraction unit 41, the speech waveform restoration unit 42, and the sound source direction extraction unit 43 to output the analysis result in the form desired by the user. To output the sound source separation localization processing routine.

なお、上記ステップ７０２及び７０４の処理はいずれを先に行ってもよい。 Note that either of the processes in steps 702 and 704 may be performed first.

以上説明したように、本実施の形態に係る音源分離定位装置によれば、複数音源の各々から発せられた音の混合音を観測した際に、各音源への音源分離と音源の方向定位とを、同時に一つの統計的枠組みによって解決することにより、既存手法のように「一方の問題で失敗した結果、他方の問題まで失敗する」という状況を回避し、両方の問題に対して、安定して高い性能を得ることができる。 As described above, according to the sound source separation localization apparatus according to the present embodiment, when observing a mixed sound of sounds emitted from each of a plurality of sound sources, sound source separation to each sound source and sound source direction localization are performed. Is solved by one statistical framework at the same time, avoiding the situation of “failure of one problem and failure of the other problem” as in the existing method, and stable against both problems. High performance.

また、音源分離と音源定位とは相互依存の問題であるため、両問題を同時に解決することにより、各問題に対して個別に解決するよりも高い精度を得ることができる。 Further, since sound source separation and sound source localization are problems of mutual dependence, it is possible to obtain higher accuracy than solving each problem individually by solving both problems at the same time.

さらに、複数音源の同時分離及び定位に、混合音を観測するマイクロフォンアレイについて事前に計測した無響環境のステアリングベクトルを利用することにより、現実の未知有響環境下でステアリングベクトルを再計測することなく、様々な環境に適合して、音源分離及び定位を実施することができる。 Furthermore, the steering vector can be re-measured in an actual unknown anechoic environment by using the steering vector of the anechoic environment measured in advance for the microphone array that observes the mixed sound for simultaneous separation and localization of multiple sound sources. In addition, sound source separation and localization can be performed in accordance with various environments.

＜実験例＞
次に、本実施の形態に係る音源分離定位装置における実験の結果について説明する。 <Experimental example>
Next, the result of the experiment in the sound source separation localization apparatus according to this embodiment will be described.

図１０に実験のセットアップを示す。本実験では、音源数Ｎ＝２または３を既知として、マイク数Ｍ＝２，４，８のパターンで各音源の分離及び定位の性能評価を行う。実験では、離散化した方向の数はＤ＝７２、すなわち５度おきに方向を区分けする。 FIG. 10 shows the experimental setup. In this experiment, it is assumed that the number of sound sources N = 2 or 3 is known, and the performance evaluation of the separation and localization of each sound source is performed with the pattern of the number of microphones M = 2, 4, and 8. In the experiment, the number of discretized directions is D = 72, that is, the directions are divided every 5 degrees.

図１１は音源定位の性能を示すグラフである。同図（ａ）は音源数が２、（ｂ）は音源数が３の場合である。図中ＲＴ_６０は、混合音の観測環境の残響時間を表している。これによれば、残響が長くなるとその分定位性能が落ちることがわかる。しかし、マイク数が多い、例えばＭ＝８の場合には、定位誤差はほとんどゼロで済むという結果である。すなわち、特にマイク数が多い場合、本実施の形態に係る音源分離定位装置によれば、高精度に音源の定位を実現することができる。 FIG. 11 is a graph showing the performance of sound source localization. FIG. 4A shows the case where the number of sound sources is 2, and FIG. In the figure, RT ₆₀ represents the reverberation time of the mixed sound observation environment. According to this, it can be seen that the localization performance decreases as the reverberation becomes longer. However, when the number of microphones is large, for example, when M = 8, the localization error is almost zero. That is, particularly when the number of microphones is large, the sound source separation and localization apparatus according to the present embodiment can achieve localization of a sound source with high accuracy.

図１２は音源分離の性能を示すグラフである。ここでは従来手法との比較を行った。比較対象は、マイク数と音源数との組み合わせによって変更する。まず、Ｍ＞＝Ｎの場合、すなわちマイク数の方が音源よりも多い場合は非特許文献３のＩＶＡ法を利用する。一方、Ｍ＜Ｎ、すなわちマイク数の方が音源よりも少ない場合は非特許文献１の方法（ＴＦ−ｐｅｒｍ）を利用する。 FIG. 12 is a graph showing the performance of sound source separation. Here, comparison with the conventional method was performed. The comparison target is changed depending on the combination of the number of microphones and the number of sound sources. First, when M> = N, that is, when the number of microphones is larger than that of the sound source, the IVA method of Non-Patent Document 3 is used. On the other hand, when M <N, that is, when the number of microphones is smaller than the sound source, the method (TF-perm) of Non-Patent Document 1 is used.

図１２では、（ａ）、（ｃ）、及び（ｅ）は音源数が２の場合、（ｂ）、（ｄ）、及び（ｆ）は音源数が３の場合である。また、（ａ）及び（ｂ）はＲＴ_６０＝２０ｍｓｅｃ、（ｃ）及び（ｄ）はＲＴ_６０＝４００ｍｓｅｃ、（ｅ）及び（ｆ）はＲＴ_６０＝６００ｍｓｅｃである。すなわち、上から下の行に移る毎に残響時間が長い環境での実験結果を示している。（ｂ）、（ｄ）、及び（ｆ）ではマイク数が２のときに非特許文献１の手法の結果を掲載している。 In FIG. 12, (a), (c), and (e) are when the number of sound sources is 2, and (b), (d), and (f) are when the number of sound sources is 3. Also, (a) and (b) are RT ₆₀ = 20 msec, (c) and (d) are RT ₆₀ = 400 msec, and (e) and (f) are RT ₆₀ = 600 msec. That is, the experimental results in an environment where the reverberation time is long each time the line moves from the top to the bottom are shown. In (b), (d), and (f), when the number of microphones is 2, the results of the method of Non-Patent Document 1 are listed.

同図より明らかなように、ほぼ全ての実験環境及びマイク数で、本実施の形態に係る音源分離定位装置は、従来手法よりも良い分離精度を達成することができている。これは、音源分離を音源定位と同時に問題を解決することで、より良い音源分離が達成できることを意味しており、本発明の有効性を示すものである。 As is clear from the figure, the sound source separation localization apparatus according to the present embodiment can achieve better separation accuracy than the conventional method in almost all experimental environments and the number of microphones. This means that better sound source separation can be achieved by solving the problem at the same time as sound source localization, which shows the effectiveness of the present invention.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上述の音源分離定位装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, the above-described sound source separation localization apparatus has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１受付部
２解析部
３記憶部
４出力部
１０音源分離定位装置
１１混合音観測部
１２時間周波数領域観測変換部
１３事前設定値受付部
２１初期値生成部
２２音源時間周波数マスク変数計算部
２３音源定位変数計算部
２３音源分離変数計算部
２４統計量計算部
２５収束条件判定部
４１分離音抽出部
４２音声波形復元部
４３音源方向抽出部
４４最終出力部 DESCRIPTION OF SYMBOLS 1 Reception part 2 Analysis part 3 Storage part 4 Output part 10 Sound source separation and localization apparatus 11 Mixed sound observation part 12 Time frequency domain observation conversion part 13 Preset value reception part 21 Initial value generation part 22 Sound source time frequency mask variable calculation part 23 Sound source Localization variable calculation unit 23 Sound source separation variable calculation unit 24 Statistics calculation unit 25 Convergence condition determination unit 41 Separated sound extraction unit 42 Speech waveform restoration unit 43 Sound source direction extraction unit 44 Final output unit

Claims

Receiving means for receiving a mixed sound signal obtained by observing a mixed sound of each sound emitted from each of a plurality of sound sources by a plurality of observation means arranged at different positions;
Sound source separation for separating the mixed sound signal received by the reception unit so as to correspond to each of the plurality of sound sources, and sound source localization for estimating a direction in which each of the plurality of sound sources exists with reference to the observation unit And analyzing means for analyzing by simultaneous optimization that iteratively processes using variables mutually dependent on the sound source separation and the sound source localization;
Output means for outputting the result of sound source separation and sound source localization analyzed by the analysis means ,
The accepting means converts the mixed sound signal into an observation signal _xtf in a time frequency domain composed of elements for each time frame t and frequency bin f, and _delivers it to the analyzing means,
The analysis means includes
A mask variable ξ _tfk representing a probability that each element of the observation signal x _tf is a signal corresponding to a kth mask of a plurality of masks that assign each element to each of a plurality of virtually set sound sources , A sound source time frequency mask variable calculating means for calculating each of a plurality of masks;
A sound source localization variable η _kd representing a probability that a sound source corresponding to the k-th mask exists in the d-th direction among a plurality of directions divided on the basis of the observation means is calculated for each of the plurality of directions. A sound source localization variable calculation means;
Statistic calculation means for calculating a statistic used for calculation of the mask variable ξ _tfk and the sound source localization variable η _kd ;
A convergence condition determining means for repeating the calculation of the sound source time frequency mask variable calculating means, the sound source localization variable calculating means, and the statistic calculating means until a predetermined convergence condition is satisfied,
The sound source localization variable eta _kd used in the calculation of the mask variables xi] _TFK, using said mask variables xi] _TFK in the calculation of the sound source localization variable eta _kd
Sound source separation localization apparatus.

The analyzing means uses the steering vectors of the plurality of observing means measured in anechoic environment, the sound source separation and claim 1 Symbol placement of sound source separation localization apparatus for analyzing the sound source localization.

A sound source separation and localization method in a sound source separation and localization apparatus including a reception unit, a sound source time frequency mask variable calculation unit, a sound source localization variable calculation unit, a statistic calculation unit, an analysis unit including a convergence condition determination unit, and an output unit. And
The reception means receives a mixed sound signal obtained by observing a mixed sound of each sound emitted from each of a plurality of sound sources by a plurality of observation means arranged at different positions;
Sound source separation in which the analysis means separates the mixed sound signal received by the reception means so as to correspond to each of the plurality of sound sources, and a direction in which each of the plurality of sound sources exists based on the observation means The sound source localization for estimating the sound source is analyzed by simultaneous optimization using iterative processing using variables mutually dependent on the sound source separation and the sound source localization,
In the sound source separation localization method in which the output means outputs the result of sound source separation and sound source localization analyzed by the analysis means ,
The reception unit converts the mixed sound signal into an observation signal x _tf in a time frequency domain composed of elements for each of the time frame t and the frequency bin f, and passes it to the analysis unit.
The sound source time frequency mask variable calculation means is a signal in which each element of the observation signal _xtf corresponds to a kth mask of a plurality of masks that assign each element to each of a plurality of virtually set sound sources. A mask variable ξ _tfk representing the probability is calculated for each of the plurality of masks;
The sound source localization variable calculation means, the sound source corresponding to the k-th mask, the sound source localization variable eta _kd representing the probability that exist d th direction of the divided plurality of directions said observation means as a reference, the Calculate for each of multiple directions,
The statistic calculating means calculates a statistic used for calculating the mask variable ξ _tfk and the sound source localization variable η _kd ;
The convergence condition determining means repeats the calculation of the sound source time frequency mask variable calculating means, the sound source localization variable calculating means, and the statistic calculating means until a predetermined convergence condition is satisfied,
The sound source localization variable eta _kd used in the calculation of the mask variables xi] _TFK, using said mask variables xi] _TFK in the calculation of the sound source localization variable eta _kd
Sound source separation localization method.

4. The sound source separation localization method according to claim 3 , wherein the analysis unit analyzes the sound source separation and the sound source localization using steering vectors of the plurality of observation units measured in an anechoic environment.

The computer, the sound source separation localization program for causing to function as each means constituting the sound source separation localization apparatus according to claim 1 or claim 2, wherein.