JP2010121975A

JP2010121975A - Sound-source localizing device

Info

Publication number: JP2010121975A
Application number: JP2008293831A
Authority: JP
Inventors: Carlos Toshinori Ishii; 石井カルロス寿憲; Chatot Olivier; シャトッ・オリビエ; Hiroshi Ishiguro; 浩石黒; Norihiro Hagita; 紀博萩田
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2008-11-17
Filing date: 2008-11-17
Publication date: 2010-06-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide a sound-source localizing device capable of localizing a sound source stably by using the MUSIC method. <P>SOLUTION: The sound source localizing device includes: an eigenvector calculating section for performing FFT transformation on each of sound source signals in a plurality of channels at each 200 ms, calculating a spatial correlation matrix for each frequency band for eigenvalue decomposition, and calculating eigenvectors and eigenvalues for each of a plurality of frequency bands; first and second average value calculating sections 120, 122 for calculating eigenvalue profiles for the first and second frequency ranges, based on the eigenvalue; a KNN classifier 124 for estimating the number of sound sources by the k classification method, with the sets of the eigenvalue profiles as parameters; and a sound-source estimating section for estimating the same number of sound-source directions as the number of sound sources by the MUSIC method, based on the number of sound sources estimated by the KNN classifier 124, information on the arrangement of microphone elements, and the eigenvectors. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

この発明は実環境における音源定位技術に関し、特に、実環境におけるＭＵＳＩＣ（ＭＵｌｔｉＰｌｅＳＩｇｎａｌＣｌａｓｓｉｆｉｃａｔｉｏｎ）法を用いた音源定位の技術に関する。 The present invention relates to a sound source localization technique in a real environment, and more particularly, to a sound source localization technique using a MUSIC (MultiPle Signal Classification) method in a real environment.

人とロボットとの音声コミュニケーションにおいて、ロボットに取付けたマイクロホンは通常離れた位置（１ｍ以上）にあり、例えば電話音声のようにマイクと口との距離が数センチの場合と比べて、信号と雑音の比（ＳＮＲ）は低くなる。このため、傍にいる他人の声や環境の雑音が妨害音となり、ロボットによる目的音声の認識が難しくなる。従って、ロボットへの応用として、音源定位や音源分離は重要である。 In voice communication between a person and a robot, the microphone attached to the robot is usually located at a distance (1 m or more). Compared to the case where the distance between the microphone and the mouth is several centimeters, such as telephone voice, the signal and noise The ratio (SNR) is low. For this reason, the voices of others nearby and the noise of the environment become interference sounds, making it difficult for the robot to recognize the target speech. Therefore, sound source localization and sound source separation are important for robot applications.

音源定位に関しては過去にさまざまな研究がされている。しかし、その大半ではシミュレーション・データ又はラボ・データのみが使用され、ロボットが動作する実環境のデータを評価するものは少ない。３次元の音源定位を評価する研究も少ない。発話相手の顔を見ながら話したり聞いたりすることも人間とロボットの対話インタラクションを改善するための重要なビヘービアであり、そのためには３次元の音源定位も重要となる。 Various studies have been conducted on sound source localization in the past. However, most of them use only simulation data or lab data, and few evaluate real-world data in which the robot operates. There are few studies to evaluate 3D sound source localization. Talking and listening while looking at the face of the utterance partner is also an important behavior for improving the interaction between humans and robots. For that purpose, three-dimensional sound source localization is also important.

実環境を想定した従来技術として特許文献１に記載のものがある。特許文献１に記載の技術は、分解能が高いＭＵＳＩＣ法と呼ばれる有名な音源定位の手法を用いている。 There exists a thing of patent document 1 as a prior art which assumed the real environment. The technique described in Patent Document 1 uses a famous sound source localization method called the MUSIC method with high resolution.

特許文献１に記載の発明では、マイクロホンアレイを用い、マイクロホンアレイからの信号をフーリエ変換し、その結果得られた受信信号ベクトルと、過去の相関行列とに基づいて現在の相関行列を計算する。このようにして求められた相関行列を固有値分解し、最大固有値と、最大固有値以外の固有値に対応する固有ベクトルである雑音空間とを求める。さらに、マイクロホンアレイのうちの１つのマイクロホンを基準として、各マイクの出力の位相差と、雑音空間と、最大固有値とに基づいて、ＭＵＳＩＣ法により音源の方向を推定する。
特開2008-175733号公報 In the invention described in Patent Document 1, a microphone array is used, a signal from the microphone array is Fourier transformed, and a current correlation matrix is calculated based on a received signal vector obtained as a result and a past correlation matrix. The correlation matrix obtained in this way is subjected to eigenvalue decomposition to obtain a maximum eigenvalue and a noise space that is an eigenvector corresponding to an eigenvalue other than the maximum eigenvalue. Furthermore, the direction of the sound source is estimated by the MUSIC method based on the phase difference of the output of each microphone, the noise space, and the maximum eigenvalue, using one microphone in the microphone array as a reference.
JP 2008-175733 A

ＭＵＳＩＣ法は分解能が高いという特長があるが、ＭＵＳＩＣ法を使用する場合には音源の数を与えなければならないという問題がある。特許文献１に記載の技術では、音源は一つであることが想定されているため、こうした問題は生じない。しかし、実際にロボットが稼動する環境としてはそのような環境であることはまれで、常に複数の音源が存在しており、しかもその数は一定していない。ＭＵＳＩＣ法を用いる場合、音源の数の予測を誤ると音源定位も誤ってしまい、ロボットが人間と正しくインタラクションをすることが困難となってしまう。 The MUSIC method has a feature of high resolution, but there is a problem that the number of sound sources must be given when the MUSIC method is used. In the technique described in Patent Document 1, since it is assumed that there is one sound source, such a problem does not occur. However, the environment in which the robot actually operates is rarely such an environment, and there are always a plurality of sound sources, and the number is not constant. When the MUSIC method is used, if the number of sound sources is incorrectly predicted, sound source localization will also be incorrect, making it difficult for the robot to interact correctly with humans.

さらに特許文献１に記載の技術では、音源定位は２次元的に行なわれている。しかし、実際のロボットの稼働環境は２次元ではなく、３次元的である。例えば、商店街などでは比較的高い位置にスピーカが置かれており、そのスピーカから常に音声が流されていることが多い。また、スピーカの位置は一定であるが、音量が変化することもある。そうした環境では音源を３次元的に定位することが好ましいが、特許文献１に記載の技術では２次元的にしか行なえないという問題がある。 Furthermore, in the technique described in Patent Document 1, sound source localization is performed two-dimensionally. However, the actual operating environment of the robot is not two-dimensional but three-dimensional. For example, in a shopping street or the like, a speaker is placed at a relatively high position, and sound is often played from the speaker at all times. Moreover, although the position of the speaker is constant, the volume may change. In such an environment, it is preferable to localize the sound source three-dimensionally, but the technique described in Patent Document 1 has a problem that it can be performed only two-dimensionally.

特に人間を相手にするロボットの場合、人間の身長はさまざまで、大人の場合にはロボットより高い位置で話し、子供の場合には逆にロボットより低い位置で話すことが多い。そうした点からも、３次元的な音源定位をすることが望まれる。 In particular, in the case of a robot against a human being, the height of the human being varies, and in the case of an adult, the person speaks at a higher position than the robot, and the child often speaks at a lower position than the robot. From such a point, it is desirable to perform three-dimensional sound source localization.

さらに、人間は頻繁に移動するため、音源を実時間で安定してトラッキングすることも必要である。 Furthermore, since humans move frequently, it is also necessary to track the sound source stably in real time.

それゆえに本発明の目的は、ＭＵＳＩＣ法を使用して安定的に音源定位を行なうことができる音源定位装置を提供することである。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a sound source localization apparatus that can stably perform sound source localization using the MUSIC method.

本発明の他の目的は、ＭＵＳＩＣ法を使用して安定的に音源定位を行なうために、音源数を精度高く推定可能な音源定位装置を提供することである。 Another object of the present invention is to provide a sound source localization apparatus that can estimate the number of sound sources with high accuracy in order to stably perform sound source localization using the MUSIC method.

本発明のさらに他の目的は、ＭＵＳＩＣ法を使用して安定的に音源定位を行なうために、音源数を精度高く推定可能でかつ安定してトラッキングができる音源定位装置を提供することである。 Still another object of the present invention is to provide a sound source localization apparatus that can estimate the number of sound sources with high accuracy and can perform tracking stably in order to stably perform sound source localization using the MUSIC method.

本発明に係る音源定位装置は、マイクロホンアレイの出力から得られる複数チャンネルの音源信号の各々を、所定の時間間隔ごとに複数個の周波数帯域の周波数成分に変換するための変換手段と、変換手段により得られた複数チャンネルの音源信号の複数個の周波数帯域の各々について、所定の時間間隔ごとに周波数成分間の空間相関行列を求めるための相関行列算出手段と、相関行列算出手段により所定時間間隔ごとに、かつ複数個の周波数帯域の各々について算出される空間相関行列の各々を固有値分解し、複数個の周波数帯域の各々について固有ベクトル及び固有値を算出するための固有ベクトル算出手段と、固有ベクトル算出手段により所定時間間隔ごとに、かつ複数個の周波数帯域の各々について算出される固有値に基づき、第１及び第２の周波数範囲についての固有値プロファイルを算出するための固有値プロファイル算出手段とを含む。第１及び第２の周波数範囲の各々は、複数個の周波数帯域のうちの１又は複数個の周波数帯域を含む。音源定位装置はさらに、固有値プロファイル算出手段により第１及び第２の周波数範囲について算出された固有値プロファイルの組をパラメータとして、所定の時間間隔ごとに音源数を推定するための音源数推定手段と、音源数推定手段により推定された音源数と、マイクロホンアレイに属するマイクロホン素子の配置に関する情報と、所定の時間間隔ごとに固有ベクトル算出手段により算出された固有ベクトルとに基づいて、ＭＵＳＩＣ法により音源数と等しい数の音源方位を推定するための音源推定手段とを含む。 A sound source localization apparatus according to the present invention includes a conversion unit for converting each of a plurality of sound source signals obtained from the output of a microphone array into frequency components of a plurality of frequency bands at predetermined time intervals, and a conversion unit. Correlation matrix calculation means for obtaining a spatial correlation matrix between frequency components for each predetermined time interval for each of a plurality of frequency bands of the sound source signals of a plurality of channels obtained by the above, and a predetermined time interval by the correlation matrix calculation means Eigenvalue decomposition for each of the spatial correlation matrices calculated for each of the plurality of frequency bands, and eigenvector calculation means for calculating eigenvectors and eigenvalues for each of the plurality of frequency bands, and eigenvector calculation means Based on eigenvalues calculated for each of a plurality of frequency bands at predetermined time intervals, the first And a unique value profile calculating means for calculating an eigenvalue profile for beauty second frequency range. Each of the first and second frequency ranges includes one or more frequency bands of the plurality of frequency bands. The sound source localization apparatus further includes a sound source number estimating unit for estimating the number of sound sources for each predetermined time interval using the set of eigen value profiles calculated for the first and second frequency ranges by the eigen value profile calculating unit as parameters. Based on the number of sound sources estimated by the sound source number estimating means, information on the arrangement of microphone elements belonging to the microphone array, and the eigenvectors calculated by the eigenvector calculating means at predetermined time intervals, the number of sound sources is equal to the number of sound sources by the MUSIC method. Sound source estimation means for estimating the number of sound source directions.

ＭＵＳＩＣ法で音源定位をする場合、音源数を推定する必要がある。上のように算出された相関行列の固有値と音源数との間に関係があることは知られているが、特に上のように２つの周波数範囲に属する周波数帯域の周波数成分についてそれぞれ別々に算出された固有値を用いた固有値プロファイルを使用して音源数を推定すると高い精度が得られることが実験により判明した。音源数を高い精度で推定可能となったことにより、ＭＵＳＩＣ法によって、安定して、かつ精度高く音源定位を行なうことが可能になる。 When performing sound source localization using the MUSIC method, it is necessary to estimate the number of sound sources. Although it is known that there is a relationship between the eigenvalues of the correlation matrix calculated as above and the number of sound sources, it is calculated separately for the frequency components in the frequency bands belonging to the two frequency ranges as shown above. Experiments have shown that high accuracy can be obtained by estimating the number of sound sources using the eigenvalue profile using the eigenvalues. Since the number of sound sources can be estimated with high accuracy, sound source localization can be performed stably and with high accuracy by the MUSIC method.

好ましくは、第１の周波数範囲と第２の周波数範囲とは互いに連続している。 Preferably, the first frequency range and the second frequency range are continuous with each other.

より好ましくは、第１の周波数範囲と第２の周波数範囲とは互いに重複していない。 More preferably, the first frequency range and the second frequency range do not overlap each other.

さらに好ましくは、第１及び第２の周波数範囲の下限は１ｋＨｚであり、上限は６ｋＨｚである。 More preferably, the lower limit of the first and second frequency ranges is 1 kHz, and the upper limit is 6 kHz.

固有値プロファイル算出手段は、固有ベクトル算出手段により所定時間間隔ごとに、かつ第１の周波数範囲に属する周波数帯域について算出される固有値を、固有値番号ごとに平均するための第１の固有値平均手段と、固有ベクトル算出手段により所定時間間隔ごとに、かつ第２の周波数範囲に属する周波数帯域について算出される固有値を、固有値番号ごとに平均するための第２の固有値平均手段と、第１および第２の固有値平均手段により固有値番号ごとに算出された固有値の平均により、固有値プロファイルを作成し出力するための手段とを含んでもよい。 The eigenvalue profile calculating means includes a first eigenvalue averaging means for averaging eigenvalues calculated for the frequency bands belonging to the first frequency range for each eigenvalue number at predetermined time intervals by the eigenvector calculating means, and eigenvectors. Second eigenvalue averaging means for averaging eigenvalues calculated for the frequency bands belonging to the second frequency range by the calculating means for each eigenvalue number, and first and second eigenvalue averages Means for creating and outputting an eigenvalue profile based on an average of eigenvalues calculated for each eigenvalue number by the means.

実験によれば、固有値プロファイルを作成するにあたり、第１及び第２の周波数範囲の各々において、それら範囲に属する周波数帯域に対して算出された固有値を、固有値番号ごとに平均してそれを音源数の予測パラメータに用いると、音源数の予測精度が高くなることが判明した。したがってこのようにして固有値プロファイルを算出することにより、ＭＵＳＩＣ法によって、安定して、かつ精度高く音源定位を行なうことが可能になる。 According to the experiment, in creating the eigenvalue profile, in each of the first and second frequency ranges, the eigenvalues calculated for the frequency bands belonging to these ranges are averaged for each eigenvalue number and the number is calculated. It was found that the accuracy of predicting the number of sound sources increases when used as a prediction parameter. Therefore, by calculating the eigenvalue profile in this way, sound source localization can be performed stably and with high accuracy by the MUSIC method.

好ましくは、第１の周波数範囲と第２の周波数範囲との境界は、２．５ｋＨｚ以上４ｋＨｚ以下の範囲内にあり、例えば３ｋＨｚ、又は４ｋＨｚである。 Preferably, the boundary between the first frequency range and the second frequency range is in a range of 2.5 kHz to 4 kHz, for example, 3 kHz or 4 kHz.

より好ましくは、音源数推定手段は、固有値プロファイルの組をパラメータとして正しい音源数を推定するように予め学習済の非線形推定手段を含む。 More preferably, the sound source number estimation unit includes a nonlinear estimation unit that has been learned in advance so as to estimate the correct number of sound sources using a set of eigenvalue profiles as a parameter.

さらに好ましくは、非線形推定手段は、固有値プロファイルの組をパラメータとして、各々が固有値プロファイルの組と対応する音源数との組合せからなる複数個の学習データを記憶するための学習データ記憶手段と、固有値プロファイル算出手段により算出された固有値プロファイルの組をパラメータとして、学習データ記憶手段に記憶された学習データを用いたｋ近傍法によって音源数を推定するための手段とを含む。 More preferably, the nonlinear estimation means includes learning data storage means for storing a plurality of learning data, each of which includes a combination of eigenvalue profiles and the number of sound sources corresponding to each set of eigenvalue profiles, and eigenvalues. Means for estimating the number of sound sources by the k-nearest neighbor method using the learning data stored in the learning data storage means using the set of eigenvalue profiles calculated by the profile calculation means as parameters.

ｋ近傍法は計算量も少なく、リソースに限りがあるロボットに実装する上で都合がよい。実験によってもよい結果が得られることが分かった。したがってｋ近傍法を用いて音源数を推定することで、ＭＵＳＩＣ法により、安定して、かつ精度高く音源定位を行なうことができる。 The k-neighbor method has a small amount of calculation and is convenient for implementation in a robot with limited resources. Experiments have shown that good results can be obtained. Therefore, by estimating the number of sound sources using the k-nearest neighbor method, sound source localization can be performed stably and accurately with the MUSIC method.

ｋ近傍法によって音源数を推定するための手段で使用される近傍の学習データ数は６でもよい。 The number of nearby learning data used in the means for estimating the number of sound sources by the k-nearest neighbor method may be six.

実験によれば、ｋ近傍法では近傍としてｋ＝６を選択した場合が最も精度を高くすることができた。 According to experiments, in the k-nearest neighbor method, the highest accuracy was obtained when k = 6 was selected as the neighborhood.

より好ましくは、音源定位装置は、音源推定手段により所定の時間間隔ごとに推定された音源方位を時間軸上でトラッキングするための音源トラッキング手段をさらに含む。 More preferably, the sound source localization apparatus further includes sound source tracking means for tracking the sound source azimuth estimated at predetermined time intervals by the sound source estimating means on the time axis.

以下の本発明の実施の形態の説明において、同一の部品には同一の参照番号を付してある。それらの機能も同一である。したがってそれらについての詳細な説明は繰返さない。 In the following description of the embodiments of the present invention, the same reference numerals are assigned to the same components. Their functions are also the same. Therefore, detailed description thereof will not be repeated.

［概要］
本実施の形態では、ロボットの頭部付近にマイクロホンアレイを配置し、これらマイクロホンアレイから得られた信号からリアルタイムで複数個の音源を定位し、それらのトラッキングを行なう。そのために、以下に説明する実施の形態の音源定位装置は、予め取得した学習データに基づき、音源信号から得られた情報を用いて音源数を推定する仕組みを用いる。推定された音源数を用いてＭＵＳＩＣ法（付録を参照。）により音源定位を行なうことにより、安定して精度高い音源定位を行なうことができる。 [Overview]
In the present embodiment, a microphone array is arranged near the head of the robot, a plurality of sound sources are localized in real time from signals obtained from these microphone arrays, and tracking thereof is performed. For this purpose, the sound source localization apparatus according to the embodiment described below uses a mechanism for estimating the number of sound sources using information obtained from sound source signals based on learning data acquired in advance. By performing sound source localization by the MUSIC method (see the appendix) using the estimated number of sound sources, sound source localization can be performed stably and accurately.

［構成］
図１に、マイクロホンアレイをロボット３０の胸部にフィットさせた状態を示す。具体的には、ロボット３０の首の周囲にマイクロホンをフィットさせるためのマイクロホン台３２を作成し、複数のマイクロホンＭＣ１等をこのマイクロホン台３２に固定した後にマイクロホン台３２をロボット３０の首の周りに固定してある。 [Constitution]
FIG. 1 shows a state in which the microphone array is fitted to the chest of the robot 30. Specifically, a microphone base 32 for fitting a microphone around the neck of the robot 30 is created, a plurality of microphones MC1 and the like are fixed to the microphone base 32, and then the microphone base 32 is placed around the neck of the robot 30. It is fixed.

図２に、マイクロホン台３２の正面図、平面図、及び右側面図を示す。図２を参照して、マイクロホンＭＣ１等は全部で１４個だけ使用される。それらのうち９個はマイクロホン台３２の前部に取付けられ、残りの５個はロボット３０の首を囲む形でマイクロホン台３２の上面に取付けられている。なお、１４個のマイクロホンのうち、中央にあるマイクロホンＭＣ１の出力については、後の処理で他と区別して使用する。本実施の形態では、、各マイクロホンは無指向性のものを用いている。 FIG. 2 shows a front view, a plan view, and a right side view of the microphone base 32. Referring to FIG. 2, only 14 microphones MC1 etc. are used. Nine of them are attached to the front part of the microphone base 32, and the remaining five are attached to the upper surface of the microphone base 32 so as to surround the neck of the robot 30. Of the 14 microphones, the output of the microphone MC1 at the center is used separately from others in the subsequent processing. In this embodiment, each microphone is a non-directional microphone.

図３は、図１に示すロボットのうち、音源定位に関係する音源定位処理部５０のみを取り出してブロック図形式で示した図である。図３を参照して、音源定位処理部５０は、マイクロホンＭＣ１等を含むマイクロホンアレイ５２から１４個のアナログ音源信号を受け、アナログ／デジタル変換を行なって１４個のデジタル音源信号を出力するＡ／Ｄ変換器５４と、Ａ／Ｄ変換器５４から出力される１４個のデジタル音源信号を受け、ＭＵＳＩＣ法で必要とされる相関行列とその固有値及び固有ベクトルを２００ミリ秒ごとに出力するための固有ベクトル算出部６０と、固有ベクトル算出部６０から２００ミリ秒ごとに出力される固有ベクトル及び固有値を使用し、ＭＵＳＩＣ法に基づいて複数個の音源位置を推定してその位置（方向）を表す値（本実施の形態では、３次元極座標の内の２つの偏角φ及びθとする。付録の「ＭＵＳＩＣ応答」を参照。）を逐次出力する音源推定部６２と、音源推定部６２の出力の時系列を蓄積し、時間的に連続して存在する音源をグルーピングし、孤立した音源を削除（フィルタリング）して時間的に変動する音源の方位を推定するためのグルーピング部６４と、グルーピング部６４が音源位置を表す情報を蓄積するために使用するバッファ６６とを含む。 FIG. 3 is a block diagram showing only the sound source localization processing unit 50 related to sound source localization from the robot shown in FIG. Referring to FIG. 3, sound source localization processing unit 50 receives 14 analog sound source signals from microphone array 52 including microphone MC1 and the like, performs analog / digital conversion, and outputs 14 digital sound source signals. An eigenvector for receiving the D converter 54 and 14 digital sound source signals output from the A / D converter 54 and outputting a correlation matrix, its eigenvalue and eigenvector required by the MUSIC method every 200 milliseconds. Using the calculation unit 60 and the eigenvectors and eigenvalues output from the eigenvector calculation unit 60 every 200 milliseconds, a plurality of sound source positions are estimated based on the MUSIC method and represent the positions (directions) (this embodiment) In this form, it is assumed that two declination angles φ and θ in the three-dimensional polar coordinates (refer to “MUSIC response” in the appendix). The time series of the outputs of the determination unit 62 and the sound source estimation unit 62 are accumulated, the sound sources that exist continuously in time are grouped, the isolated sound sources are deleted (filtered), and the direction of the sound source that fluctuates in time is determined. A grouping unit 64 for estimation and a buffer 66 used by the grouping unit 64 to store information representing the sound source position are included.

本実施の形態では、Ａ／Ｄ変換器５４は、一般的な１６ｋＨｚ／１６ビットで各マイクロホンの出力をＡ／Ｄ変換する。 In the present embodiment, the A / D converter 54 performs A / D conversion on the output of each microphone at a general 16 kHz / 16 bits.

固有ベクトル算出部６０は、Ａ／Ｄ変換器５４の出力する１４個のデジタル音源信号を２５ミリ秒のフレーム長でフレーム化するためのフレーム化処理部８０と、フレーム化処理部８０の出力する１４チャンネルのフレーム化された音源信号に対してそれぞれＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍａｔｉｏｎ）を施し、所定個数の周波数領域（以下、各周波数領域を「ビン」と呼び、周波数領域の数を「ビン数」と呼ぶ。）に変換して出力するＦＦＴ処理部８２と、ＦＦＴ処理部８２から２５ミリ秒ごとに出力される各チャネルの各ビンの値を２００ミリ秒ごとにブロック化するためのブロック化処理部８４と、ブロック化処理部８４から出力される各ビンの値の間の相関を要素とする相関行列を所定時間ごと（２００ミリ秒ごと）に算出し出力する相関行列算出部８６と、相関行列算出部８６から出力される相関行列を固有値分解し、固有値９０と固有ベクトル９２とを音源推定部６２に出力する固有値分解部８８とを含む。なお本実施の形態では、音源信号の周波数成分のうち、空間的分解能が低い１ｋＨｚ以下の帯域と、空間的aliasingが起こり得る６ｋＨｚ以上の帯域を除外する。 The eigenvector calculation unit 60 framing the 14 digital sound source signals output from the A / D converter 54 with a frame length of 25 milliseconds, and the output 14 from the framing processing unit 80. FFT (Fast Fourier Transform) is applied to the sound source signals framed in the channel, and a predetermined number of frequency regions (hereinafter, each frequency region is referred to as “bin”, and the number of frequency regions is referred to as “bin number”). .)), And a block processing unit 84 for blocking each bin value of each channel output from the FFT processing unit 82 every 25 milliseconds. And a correlation matrix whose element is a correlation between the bin values output from the block processing unit 84 (200 mm). A correlation matrix calculation unit 86 that calculates and outputs every second), and an eigenvalue decomposition unit 88 that performs eigenvalue decomposition on the correlation matrix output from the correlation matrix calculation unit 86 and outputs the eigenvalue 90 and the eigenvector 92 to the sound source estimation unit 62; including. In the present embodiment, the frequency component of the sound source signal excludes a band of 1 kHz or less with a low spatial resolution and a band of 6 kHz or more where spatial aliasing may occur.

音源推定部６２は、マイクロホンアレイ５２に含まれる各マイクロホンの位置を所定の座標系を用いて表す位置ベクトルを記憶するための位置ベクトル記憶部１００と、固有値分解部８８から与えられる固有値をパラメータとして、音源数を推定し推定音源数（これを「ＮＯＳ」と呼ぶ。）を出力するための音源数推定部１０２と、音源数推定部１０２から与えられるＮＯＳ，位置ベクトル記憶部１００に記憶されているマイクロホンの位置ベクトル、及び固有値分解部８８から出力される固有ベクトルを用いて、ＭＵＳＩＣ法においてＭＵＳＩＣ空間スペクトルと呼ばれる値を算出し出力するＭＵＳＩＣ空間スペクトル算出部１０４とを含む。ブロックごとに得られる相関行列の固有値が音源数に関連することは、例えばＦ．アサノら、「リアルタイム音源定位及び生成システムと自動音声認識におけるその応用」、Ｅｕｒｏｓｐｅｅｃｈ，２００１、アールボルグ、デンマーク、２００１、１０１３−１０１６頁（F. Asano, M. Goto, K. Itou, and H. Asoh, “Real-time sound source localization and separation system and its application on automatic speech recognition,” in Eurospeech 2001, Aalborg, Denmark, 2001, pp. 1013-1016）にも記載されており、既に知られている事項である。 The sound source estimation unit 62 uses the position vector storage unit 100 for storing a position vector representing the position of each microphone included in the microphone array 52 using a predetermined coordinate system, and the eigenvalue given from the eigenvalue decomposition unit 88 as parameters. The sound source number estimating unit 102 for estimating the number of sound sources and outputting the estimated number of sound sources (referred to as “NOS”) and the NOS and position vector storage unit 100 provided from the sound source number estimating unit 102 are stored. A MUSIC spatial spectrum calculation unit 104 that calculates and outputs a value called a MUSIC spatial spectrum in the MUSIC method using the microphone position vector and the eigenvector output from the eigenvalue decomposition unit 88. The fact that the eigenvalue of the correlation matrix obtained for each block is related to the number of sound sources is, for example, F.D. Asano et al., “Real-time sound source localization and generation system and its application in automatic speech recognition”, Eurospeech, 2001, Aalborg, Denmark, 2001, 1013-1016 (F. Asano, M. Goto, K. Itou, and H. Asoh, “Real-time sound source localization and separation system and its application on automatic speech recognition,” in Eurospeech 2001, Aalborg, Denmark, 2001, pp. 1013-1016) It is.

なお、本実施の形態では、各音源の２次元的な方位角だけでなく、仰角も推定する。そのために、ＭＵＳＩＣアルゴリズムの３次元版（付録を参照）を実装した。方位角と仰角とのセットを、これ以降、音源方位（ＤＯＡ）と呼ぶ。このアルゴリズムでは、音源までの距離は推定しない。音源方位のみを推定するようにすることで、処理時間を大幅に減少させることができる。 In the present embodiment, not only the two-dimensional azimuth angle of each sound source but also the elevation angle is estimated. To that end, a three-dimensional version of the MUSIC algorithm (see appendix) was implemented. The set of azimuth and elevation is hereinafter referred to as sound source azimuth (DOA). This algorithm does not estimate the distance to the sound source. By estimating only the sound source azimuth, the processing time can be significantly reduced.

音源推定部６２はさらに、ＭＵＳＩＣ空間スペクトル算出部１０４により算出されたＭＵＳＩＣ空間スペクトルに基づいて、ＭＵＳＩＣ法にしたがいＭＵＳＩＣ応答と呼ばれる値を各方位ごとに算出し出力するためのＭＵＳＩＣ応答算出部１０６と、ＭＵＳＩＣ応答算出部１０６により算出されたＭＵＳＩＣ応答のピークを、音源数推定部１０２により推定された個数だけ、値の大きいものから順番に各ブロックごとに検出し、音源方位を示す情報としてその方位を出力するためのピーク検出部１０８とを含む。 The sound source estimation unit 62 further includes a MUSIC response calculation unit 106 for calculating and outputting a value called a MUSIC response for each direction according to the MUSIC method based on the MUSIC spatial spectrum calculated by the MUSIC spatial spectrum calculation unit 104. The peak of the MUSIC response calculated by the MUSIC response calculation unit 106 is detected for each block in order from the largest value by the number estimated by the sound source number estimation unit 102, and the direction as information indicating the sound source direction is detected. And a peak detector 108 for outputting.

図４は、図３に示す音源数推定部１０２のより詳細なブロック図である。図４を参照して、音源数推定部１０２は、固有値分解部８８からビンごとに与えられる固有値９０のうち、１−３ｋＨｚ（１ｋＨｚ以上、３ｋＨｚ以下）の周波数帯の各ビンから得られた固有値の平均値を固有値番号別に算出するための第１の平均値算出部１２０と、３−６ｋＨｚの周波数帯の各ビンから得られた固有値の平均値を固有値番号別に算出するための第２の平均値算出部１２２とを含む。音源数推定部１０２はさらに、第１の平均値算出部１２０から与えられる固有値の平均値の組と、第２の平均値算出部１２２から与えられる固有値の平均値の組とをパラメータとして、ＫＮＮ（ｋ−ＮｅａｒｅｓｔＮｅｉｇｈｂｏｒ：ｋ近傍法）分類によって音源数を推定しＮＯＳ１２８として出力するためのＫＮＮ分類器１２４と、ＫＮＮ分類器１２４がＮＯＳ推定の際に使用する学習データを記憶する学習データ記憶部１２６とを含む。 4 is a more detailed block diagram of the sound source number estimation unit 102 shown in FIG. Referring to FIG. 4, sound source number estimation section 102 has eigenvalues obtained from each bin in the frequency band of 1-3 kHz (1 kHz or more, 3 kHz or less) among eigenvalues 90 given for each bin from eigenvalue decomposition section 88. The first average value calculation unit 120 for calculating the average value of each eigenvalue number, and the second average for calculating the average value of the eigenvalues obtained from each bin in the frequency band of 3-6 kHz for each eigenvalue number A value calculation unit 122. The sound source number estimation unit 102 further uses, as parameters, an average value set of eigenvalues given from the first average value calculation unit 120 and an average value set of eigenvalues given from the second average value calculation unit 122 as parameters. (K-Nearest Neighbor: k-neighbor method) KNN classifier 124 for estimating the number of sound sources by classification and outputting as NOS 128, and learning data storage unit for storing learning data used by KNN classifier 124 for NOS estimation 126.

学習データ記憶部１２６には、後述するとおり、事前の学習データ作成時に得られた第１の平均値算出部１２０及び第２の平均値算出部１２２からの出力の値と、そのときの音源数とが組になったサンプル（学習データ）が、複数個記憶されている。推定時には、第１の平均値算出部１２０及び第２の平均値算出部１２２から与えられるパラメータ（２組の固有値）を入力ベクトルとして、ベクトル空間で入力ベクトルに最も近いｋ個の学習データが選ばれ、それらｋ個の学習データを音源数で分類し、最も多数の学習データが分類された音源数を推定音源数として出力する。本実施の形態ではｋ＝６である。 In the learning data storage unit 126, as will be described later, the output values from the first average value calculating unit 120 and the second average value calculating unit 122 obtained at the time of creating prior learning data, and the number of sound sources at that time A plurality of samples (learning data) are stored. At the time of estimation, the k learning data closest to the input vector in the vector space are selected using the parameters (two sets of eigenvalues) given from the first average value calculation unit 120 and the second average value calculation unit 122 as input vectors. The k learning data are classified by the number of sound sources, and the number of sound sources into which the largest number of learning data is classified is output as the estimated number of sound sources. In this embodiment, k = 6.

音源数推定部１０２は、図４に示すとおり、１ｋＨｚ−６ｋＨｚの周波数帯域を２つに分け、各々について固有値の平均値の組を算出してそれを予測のためのパラメータとしている。この理由については後述するが、要するに実験によればこのようにパラメータを設定した場合に最も音源方位の推定精度が高かったことによる。 As shown in FIG. 4, the number-of-sound sources estimation unit 102 divides the frequency band of 1 kHz-6 kHz into two, calculates an average value set of eigenvalues for each, and uses it as a parameter for prediction. The reason for this will be described later. In short, it is because, according to the experiment, when the parameters are set in this way, the estimation accuracy of the sound source direction is the highest.

ピーク検出部１０８では、局所的なピークを検出することでＤＯＡを求める。ただし、２つの音源の方位が近い場合には、局所的なピークが１つしか見出せないことがあり得る。例えば図１１（Ａ）に示すピーク２００のように、２つのピークがほぼ重なっている場合である。このようなときにも音源数だけのピークを検出できるようにするため、ピーク検出部１０８はピーク検出を以下のように行なう。 The peak detector 108 obtains DOA by detecting a local peak. However, if the directions of the two sound sources are close, only one local peak may be found. For example, this is a case where two peaks almost overlap each other like a peak 200 shown in FIG. In such a case, the peak detector 108 performs peak detection as follows in order to be able to detect as many peaks as the number of sound sources.

まず、最大の局所ピーク２００を検出する。この局所ピークが１つのピーク（ＤＯＡ）となる。次に、この局所ピーク部分のＭＵＳＩＣ応答からから２次元ガウシアン２０２（図１１（Ｂ）を参照。）を差引く。この２次元ガウシアンは、１つの音源が存在するときの標準偏差を持ち、かつ検出されたピークの振幅を持つものとする。この作業により、仮にこのピークに他のピークが重なっている場合には、重なっていて検出できなかったピーク２０４（図１１（Ｃ）を参照）のみがこの部分に残ることになる。この作業を音源の数（ＮＯＳ）だけ繰返す。 First, the maximum local peak 200 is detected. This local peak becomes one peak (DOA). Next, the two-dimensional Gaussian 202 (see FIG. 11B) is subtracted from the MUSIC response of the local peak portion. This two-dimensional Gaussian has a standard deviation when one sound source exists and has an amplitude of a detected peak. As a result of this operation, if another peak overlaps with this peak, only the peak 204 (see FIG. 11C) that cannot be detected due to the overlap remains in this portion. This operation is repeated for the number of sound sources (NOS).

グルーピング部６４による音源方位のフィルタリング及びトラッキングは以下のようにして行なわれる。音源数が過大推定された場合、誤ったＤＯＡの挿入が起きる。本実施の形態に係る手法では、グルーピングによるフィルタリングのアルゴリズムを用いることで、孤立したＤＯＡを削除する。本アルゴリズムは、過去１０ブロック（２秒に相当）に検出されたＤＯＡより、現在のＤＯＡ候補をグループ化するか否かを判断する。具体的には、以下の条件にあてはまる場合、グループ化を行なう。 Filtering and tracking of the sound source direction by the grouping unit 64 is performed as follows. If the number of sound sources is overestimated, incorrect DOA insertion occurs. In the method according to the present embodiment, an isolated DOA is deleted by using a filtering algorithm based on grouping. This algorithm determines whether to group current DOA candidates from DOA detected in the past 10 blocks (corresponding to 2 seconds). Specifically, grouping is performed when the following conditions are satisfied.

（１）前のＤＯＡは、現在のＤＯＡを先端とした「円錐」の内部にある。ここでの「円錐」は、「付録」に述べる３次元ＭＵＳＩＣ法で求める音源方位（θ、φ）及び時刻ｔを座標とする３次元座標空間内での円錐である。円錐の高さ方向を時間軸に平行にとる。本実施の形態では、円錐の底面は方位角が±３０度、仰角が±７度となるような形状に設定した。これらの値は、人は縦方向（仰角の変化）よりも、横方向（方位角の変化）に移動する確率が高いことに基づき、ヒューリステックに設定した。 (1) The previous DOA is inside a “cone” with the current DOA as the tip. Here, the “cone” is a cone in the three-dimensional coordinate space with the sound source azimuth (θ, φ) and the time t obtained by the three-dimensional MUSIC method described in the “Appendix” as coordinates. The height direction of the cone is parallel to the time axis. In the present embodiment, the bottom surface of the cone is set to have a shape with an azimuth angle of ± 30 degrees and an elevation angle of ± 7 degrees. These values were set to heuristics based on the higher probability that a person would move in the horizontal direction (change in azimuth) than in the vertical direction (change in elevation angle).

（２）現在のＤＯＡと前のＤＯＡが属するグループの傾向線との距離がある閾値よりも小さい。ここで、「傾向線」とは、そのグループに属するＤＯＡ列に対する回帰線を現在の時点まで外挿して考えられる直線のことをいう。 (2) The distance between the current DOA and the trend line of the group to which the previous DOA belongs is smaller than a certain threshold. Here, the “trend line” refers to a straight line that can be considered by extrapolating the regression line for the DOA column belonging to the group to the current time point.

１番目の条件により、ＤＯＡは音源方位（θ、φ）が近いグループにのみグループ化可能となり、２番目の条件により、方向性が異なるグループにはグループ化されないこととなる。実験によれば、こうした基準でＤＯＡをグループ化することにより、いずれの音源もうまくトラッキング出来ることが分かった。特に、２つの互いに異なる音源が、音源方位が重なるように近づき、互いに交差して遠ざかるような移動を行なった場合にも、音源方位を正しく検出することができた。 According to the first condition, the DOA can be grouped only into a group having a sound source direction (θ, φ) close to each other, and according to the second condition, the DOA is not grouped into a group having a different directionality. Experiments have shown that any sound source can be tracked well by grouping DOAs according to these criteria. In particular, even when two different sound sources approach each other so that the sound source directions overlap and move so as to cross each other and move away from each other, the sound source directions can be correctly detected.

図５は、図４に示す学習データ記憶部１２６に予め記憶される学習データを事前に作成するための学習データ作成処理部１４０のブロック図である。本実施の形態では、この学習データ作成処理部１４０もロボット３０中に設けられているが、学習データ作成処理部１４０をロボット３０の外に設けて学習データを作成し、作成された学習データのみをロボット３０中に組み込むようにしてもよいことはいうまでもない。 FIG. 5 is a block diagram of the learning data creation processing unit 140 for creating learning data stored in advance in the learning data storage unit 126 shown in FIG. In this embodiment, the learning data creation processing unit 140 is also provided in the robot 30, but the learning data creation processing unit 140 is provided outside the robot 30 to create learning data, and only the created learning data is created. It goes without saying that may be incorporated into the robot 30.

図５を参照して、学習データ作成処理部１４０は、図３に示すものと同じマイクロホンアレイ５２及びＡ／Ｄ変換器５４と、Ａ／Ｄ変換器５４の出力するデジタル音源信号を受け、図３に示す固有ベクトル算出部６０と同様の処理で２００ミリ秒ごとに音源信号の各周波数帯域ごとに固有値を算出し出力するための固有値算出部１６０を含む。固有値算出部１６０は、図３に示すフレーム化処理部８０、ＦＦＴ処理部８２、ブロック化処理部８４、相関行列算出部８６及び固有値分解部８８とそれぞれ同様の処理を行なうフレーム化処理部１８０、ＦＦＴ処理部１８２、ブロック化処理部１８４、相関行列算出部１８６及び固有値分解部１８８を含む。ただし、固有値分解部１８８は固有値のみを出力する。 Referring to FIG. 5, learning data creation processing unit 140 receives the same microphone array 52 and A / D converter 54 as those shown in FIG. 3, and the digital sound source signal output from A / D converter 54. 3 includes an eigenvalue calculation unit 160 for calculating and outputting eigenvalues for each frequency band of the sound source signal every 200 milliseconds by the same processing as the eigenvector calculation unit 60 shown in FIG. The eigenvalue calculation unit 160 is a framing processing unit 180 that performs the same processing as the framing processing unit 80, the FFT processing unit 82, the blocking processing unit 84, the correlation matrix calculation unit 86, and the eigenvalue decomposition unit 88 shown in FIG. An FFT processing unit 182, a blocking processing unit 184, a correlation matrix calculation unit 186, and an eigenvalue decomposition unit 188 are included. However, the eigenvalue decomposition unit 188 outputs only eigenvalues.

固有値算出部１６０はさらに、ＦＦＴ処理部１８２が２５ミリ秒ごとに出力するフレーム化された周波数成分の出力１９４を受け、音源信号の各チャンネル間でクロス・チャンネル・スペクトル・バイナリ・マスキング処理（この詳細については後述する。）を行なうための第１のバイナリ・マスキング処理部１６２と、第１のバイナリ・マスキング処理部１６２から出力されるクロス・チャンネル・スペクトル・バイナリ・マスキング処理がされた音源信号の各々について、さらに、中央のマイクロホンＭＣ１から得られた周波数成分（ＦＦＴ処理部１８２の出力１９２）との間でバイナリ・マスキング処理を行なうための第２のバイナリ・マスキング処理部１６４とを含む。 The eigenvalue calculation unit 160 further receives an output 194 of the framed frequency component output by the FFT processing unit 182 every 25 milliseconds, and performs cross channel spectrum binary masking processing (this) between each channel of the sound source signal. (The details will be described later.) And a sound source signal subjected to cross channel spectrum binary masking processing output from the first binary masking processing unit 162. Are further included with a second binary masking processing unit 164 for performing binary masking processing with a frequency component (output 192 of the FFT processing unit 182) obtained from the central microphone MC1.

第１のバイナリ・マスキング処理部１６２が行なうクロス・チャンネル・バイナリ・マスキング処理とは以下のような処理をいう。２つのチャンネルの信号を周波数領域に変換し、変換後の２つの信号についてフレームごとに個々の周波数成分の値を比較する。値の大きな（強い）方の値は残し、小さな（弱い）方の値に０を割当てる。この後、双方の信号を時間領域に戻す。この処理は、マイクロホン間で相関のある音声を拾っている場合に、音源により近い方のマイクロホンからの信号のみを残すために行なわれる。この処理により、チャンネル間の音漏れを抑え、より信頼性の高いレファレンス信号が得られる。 The cross channel binary masking process performed by the first binary masking processing unit 162 refers to the following process. The signals of the two channels are converted into the frequency domain, and the values of individual frequency components are compared for each frame for the two signals after conversion. The larger (stronger) value is retained, and 0 is assigned to the smaller (weaker) value. Thereafter, both signals are returned to the time domain. This processing is performed in order to leave only the signal from the microphone closer to the sound source when picking up sound having a correlation between the microphones. By this processing, sound leakage between channels can be suppressed, and a more reliable reference signal can be obtained.

第２のバイナリ・マスキング処理部１６４が行なうバイナリ・マスキング処理は以下のような処理をいう。任意の信号（これを処理対象の信号と呼ぶ。）をとり、その信号とマイクロホンＭＣ１から得られた信号とをともに周波数領域に変換する。この２つの信号についてフレームごとに個々の周波数成分の値を比較する。処理対象の信号の方の値が大きければ（強ければ）その値を残し、小さければその周波数成分の値に０を割当てる。この後、処理対象の信号を時間領域に戻す。この処理は、各音源信号から環境音を除外するための処理である。したがってここでは中央のマイクロホンＭＣ１からの周波数成分を基準として用いている。 The binary masking process performed by the second binary masking processing unit 164 refers to the following process. An arbitrary signal (referred to as a signal to be processed) is taken, and the signal and the signal obtained from the microphone MC1 are both converted into the frequency domain. For these two signals, the values of individual frequency components are compared for each frame. If the value of the signal to be processed is larger (stronger), the value is left, and if smaller, 0 is assigned to the value of the frequency component. Thereafter, the signal to be processed is returned to the time domain. This process is a process for excluding environmental sounds from each sound source signal. Therefore, here, the frequency component from the center microphone MC1 is used as a reference.

学習データ作成処理部１４０はさらに、第２のバイナリ・マスキング処理部１６４から２５ミリ秒ごとのフレームについて出力される各チャンネルの音源信号について、２５ミリ秒ごとにパワーを算出するためのパワー算出部１６８と、パワー算出部１６８の算出する各チャンネルのパワーの値の時系列（パワー軌道）としきい値とを比較し、しきい値を上回った音源信号をアクティブであると判定し、アクティブである音源信号の数をそのフレームの音源数として出力するための音源数判定部１７２と、音源数判定部１７２が判定に用いるしきい値を予め記憶するためのしきい値記憶部１７０とを含む。音源数判定部１７２の出力する音源数を、図３に示す音源数推定部１０２が出力する音源数（ＮＯＳ）と区別するためにＰＮＯＳと呼ぶ。 The learning data creation processing unit 140 further includes a power calculation unit for calculating the power every 25 milliseconds for the sound source signal of each channel output from the second binary masking processing unit 164 for each 25 millisecond frame. 168 and a time series (power trajectory) of the power value of each channel calculated by the power calculation unit 168 are compared with a threshold value, and a sound source signal exceeding the threshold value is determined to be active, and is active. A sound source number determination unit 172 for outputting the number of sound source signals as the number of sound sources of the frame, and a threshold value storage unit 170 for preliminarily storing a threshold value used for determination by the sound source number determination unit 172 are included. The number of sound sources output from the sound source number determination unit 172 is referred to as PNOS in order to distinguish it from the number of sound sources (NOS) output from the sound source number estimation unit 102 shown in FIG.

学習データ作成処理部１４０はさらに、固有値分解部１８８から２００ミリ秒ごとに出力される固有値の組と、音源数判定部１７２から与えられるＰＮＯＳの値とを組として学習データ記憶部１２６に学習データとして格納させるための学習データ蓄積部１７４とを含む。なお、図５には図示していないが、学習データ蓄積部１７４は、固有値分解部１８８から与えられる固有値を、図４に示す音源数推定部１０２と同様に１−３ｋＨｚの帯域と、３ｋＨｚ−６ｋＨｚの帯域とで別々に平均して２組の固有値に変換してから学習データ記憶部１２６にそのときのＰＮＯＳの値と共に記憶させる。固有値分解部１８８からの固有値は２００ミリ秒ごとに得られ、ＰＮＯＳは２５ミリ秒ごとに得られるため、平均と四捨五入とによって２００ミリ秒ごとのブロックに変換する。 The learning data creation processing unit 140 further stores the learning data in the learning data storage unit 126 as a set of the eigenvalue set output from the eigenvalue decomposition unit 188 every 200 milliseconds and the PNOS value given from the sound source number determination unit 172. And a learning data storage unit 174 for storing as. Although not shown in FIG. 5, the learning data storage unit 174 uses the eigenvalue given from the eigenvalue decomposition unit 188 to set the eigenvalue given by the eigenvalue decomposition unit 188 in the same manner as the sound source number estimation unit 102 shown in FIG. It is averaged separately for the 6 kHz band and converted into two sets of eigenvalues, and then stored in the learning data storage unit 126 together with the PNOS value at that time. Since the eigenvalue from the eigenvalue decomposition unit 188 is obtained every 200 milliseconds and the PNOS is obtained every 25 milliseconds, it is converted into blocks every 200 milliseconds by averaging and rounding.

以下、図４の音源数推定部１０２において固有値を１−３ｋＨｚの帯域と３−６ｋＨｚの帯域とに分けてそれぞれ別々に平均する理由について説明する。 The reason why the eigenvalues are averaged separately in the 1-3 kHz band and the 3-6 kHz band in the sound source number estimation unit 102 in FIG. 4 will be described below.

発明者は、固有値が、環境の変化によりどのように影響されるかを調べるため以下のような実験を行ない、ＰＮＯＳごとに固有値を整理した。ビンごとに固有値のセットが得られるが、ここでは、ブロックごとに１つの代表的な固有値のセットを求めるため、特定の周波数帯域で平均化した固有値をそのブロックの固有値プロファイルとして扱う。 The inventor conducted the following experiment in order to examine how the eigenvalue is affected by the environmental change, and arranged the eigenvalue for each PNOS. A set of eigenvalues is obtained for each bin. Here, in order to obtain one typical eigenvalue set for each block, eigenvalues averaged in a specific frequency band are treated as eigenvalue profiles of the blocks.

図６−図８に、３つの異なった環境（ＯＦＣ、ＵＣＷ１、ＵＣＷ２）において、ＰＮＯＳ毎に整理したブロック毎の固有値のプロファイルを表示する。これら環境は実験のための音声を収録した環境であり、具体的には以下のとおりである。 6 to 8 show profiles of eigenvalues for each block arranged for each PNOS in three different environments (OFC, UCW1, and UCW2). These environments are environments that record audio for experiments, and are specifically as follows.

すなわち、マイクロホンアレイによるデータ収録を２つの異なった環境で行なった。一つ目はオフィス環境（ＯＦＣ）で、室内のエアコンとロボットの内部雑音が主な雑音源となる。二つ目の環境は、現在ロボビーの実証実験が行なわれている野外のショッピングモールの通路（ＵＣＷ）である。ＵＣＷでの主な雑音源は、天井に設置されているスピーカから流れてくるポップ・ロックミュージックとなる。通路内のさまざまな位置およびさまざまな向きで収録を行なった。 That is, data recording by a microphone array was performed in two different environments. The first is the office environment (OFC), where the internal noise of indoor air conditioners and robots is the main noise source. The second environment is an open-air shopping mall corridor (UCW) where Robbie's demonstration experiment is currently being conducted. The main noise source in UCW is pop-rock music flowing from speakers installed on the ceiling. Recorded at various locations and in various directions in the aisle.

ここでは、その中の４つ収録の結果を示す。１つ目はオフィス環境（ＯＦＣ）で、残りの３つはショッピングモールの環境（ＵＣＷ１−３）である。テーブル１にそれぞれの収録での音源に関する詳細を示す。 Here, the results of four of them are shown. The first is an office environment (OFC), and the remaining three are shopping mall environments (UCW1-3). Table 1 shows the details about the sound source in each recording.

図６−図８を参照すると、音源数（ＮＯＳ）が、固有値プロファイルの全般的なオフセット及び形状に関連していることが観察される。理想的な固有値プロファイルの形状としては、指向的な音源の数Ｎに対応する最初のＮ個の固有値が強いパワーを示し、無指向的な音源に対応する残りのＭ−Ｎ個の固有値が小さいパワーを示す。しかしながら、図６−図８に表示している実際の形状では、指向的音源の成分と無指向的音源の成分の境目が不明確であり、無指向成分もフラットではなく、緩やかな傾きを示している。

With reference to FIGS. 6-8, it is observed that the number of sound sources (NOS) is related to the general offset and shape of the eigenvalue profile. As the shape of an ideal eigenvalue profile, the first N eigenvalues corresponding to the number N of directional sound sources indicate strong power, and the remaining MN eigenvalues corresponding to omnidirectional sound sources are small. Indicates power. However, in the actual shape displayed in FIG. 6 to FIG. 8, the boundary between the directional sound source component and the omnidirectional sound source component is unclear, and the omnidirectional component is not flat and shows a gentle inclination. ing.

また、異なったＰＮＯＳ間でも固有値プロファイルが一部重複することも観察される。例えば、ＯＦＣ（図６）では、ＰＮＯＳ＝１とＰＮＯＳ＝２のプロファイルが大幅に重複していることが観察される。従って、分類器による厳密な音源数の推定は困難であることが予想される。 It is also observed that eigenvalue profiles partially overlap even between different PNOS. For example, in the OFC (FIG. 6), it is observed that the profiles of PNOS = 1 and PNOS = 2 are significantly overlapped. Therefore, it is expected that it is difficult to accurately estimate the number of sound sources by the classifier.

更には、環境の変化による、固有値プロファイルの形状への影響も強いことが観られる。幅にも傾きにも違いが観られる。例えば、ＯＦＣとＵＣＷ１のＰＮＯＳ＝０の固有値プロファイル（図６，図７のＮＯＳ＝０）を比較すると、その違いは明らかである。ＵＣＷ１では、背景の音楽があるため、ＯＦＣよりも値が大きく、ばらつきも大きくなっている。これは環境の変化を分類器に考慮する必要があることを示している。 Furthermore, it can be seen that the influence of the change in the environment on the shape of the eigenvalue profile is strong. Differences can be seen in width and inclination. For example, when the OFC and UCW1 PNOS = 0 eigenvalue profiles (NOS = 0 in FIGS. 6 and 7) are compared, the difference is clear. In UCW1, since there is background music, the value is larger than that of OFC, and the variation is larger. This indicates that environmental changes need to be considered in the classifier.

また、環境音楽の音源に近づくことによる固有値プロファイルへの影響も観られる。ＵＣＷ２のＰＮＯＳ＝０のプロファイルの形状は、ＵＣＷ１のＰＮＯＳ＝１のものと類似している。これは、ロボットが環境音楽の音源に近づく場合は、環境音楽が新たな指向的音源となり、離れている場合は、無指向的音源となることが反映されている。環境音楽が指向的音源の場合、その方位を求めることが出来、後続処理となるターゲット音声の音源分離にも役立つ。 Moreover, the influence on the eigenvalue profile by approaching the sound source of environmental music is also seen. The profile of UCW2's PNOS = 0 profile is similar to that of UCW1's PNOS = 1. This reflects that the environmental music becomes a new directional sound source when the robot approaches the sound source of the environmental music, and the omnidirectional sound source when the robot is away. When the environmental music is a directional sound source, the direction can be obtained, which is useful for the sound source separation of the target sound which is the subsequent processing.

最後に、異なった周波数帯域で平均化した固有値のプロファイルを分析した。図６−図８の３列に、それぞれ１‐６ｋＨｚ（ＡＶＧ１＿６）、１−３ｋＨｚ（ＡＶＧ１＿３）および３−６ｋＨｚ（ＡＶＧ３＿６）の３つの異なった周波数帯域の周波数ビンで平均化した固有値のプロファイルを示す。ＮＯＳ＞０のプロファイルで図６−図８の３列を比較すると、ＡＶＧ３＿６（右の列）では、第１と第６の固有値の差がより大きいことが分かる。この結果より、ＡＶＧ３＿６の方が、帯域幅が広域であるＡＶＧ１＿６よりも高い識別性を持つと考えられる。しかし、／ｕ／及び／Ｏ／のように３ｋＨｚ以上の成分が弱い音声区間では、ＡＶＧ３＿６では検出されない恐れがある。これらの結果より、分割した周波数帯域から求められる固有値の２セットを分類器に用いるようにした。こうすることで、３ｋＨｚ以上の成分が弱い区間ではＡＶＧ１＿３によって比較的識別性を高くすることができ、３ｋＨｚ以上の成分が十分強い区間ではＡＶＧ３＿６によって十分な識別性を期待できる。 Finally, eigenvalue profiles averaged in different frequency bands were analyzed. 6 to 8 show profiles of eigenvalues averaged in frequency bins in three different frequency bands of 1-6 kHz (AVG1_6), 1-3 kHz (AVG1_3), and 3-6 kHz (AVG3_6), respectively. . Comparing the three columns in FIGS. 6 to 8 with a profile of NOS> 0, it can be seen that the difference between the first and sixth eigenvalues is larger in AVG3_6 (right column). From this result, it is considered that AVG3_6 has higher discrimination than AVG1_6 having a wide bandwidth. However, there is a possibility that the AVG3_6 may not detect in a voice section where a component of 3 kHz or more is weak such as / u / and / O /. From these results, two sets of eigenvalues obtained from the divided frequency bands were used for the classifier. By doing so, the AVG1_3 can relatively increase the discriminability in the interval where the component of 3 kHz or higher is weak, and the AVG3_6 can expect sufficient discriminability in the interval where the component of 3 kHz or higher is sufficiently strong.

分類アルゴリズムとして、ｋＮＮ（ｋ−ＮｅａｒｅｓｔＮｅｉｇｈｂｏｒｓ）アルゴリズムを選択した。ｋＮＮは計算量も少なく、非線形にも対応できるためである。ｋＮＮに限らず、非線形に対応できる機械学習方式、例えばＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）又はＮＮ（ＮｅｕｒａｌＮｅｔｗｏｒｋ）などを用いることも可能である。 The kNN (k-Nearest Neighbors) algorithm was selected as the classification algorithm. This is because kNN has a small amount of calculation and can cope with non-linearity. Not only kNN but also a machine learning method that can deal with non-linearity, for example, SVM (Support Vector Machine) or NN (Neural Network) can be used.

観測信号の相関行列から得られた固有値を分類アルゴリズムの入力パラメータとして用いる。さまざまな周波数帯域の周波数ビンを通して求めた固有値の平均値セット（ＡＶＧ）又は最大値セット（ＭＡＸ）を入力として、さまざまな分類器を学習・評価した。周波数帯域を２つに分割して得られた固有値の２セットも評価した。 The eigenvalue obtained from the correlation matrix of the observed signal is used as an input parameter for the classification algorithm. Various classifiers were learned and evaluated using an average value set (AVG) or a maximum value set (MAX) of eigenvalues obtained through frequency bins in various frequency bands as inputs. Two sets of eigenvalues obtained by dividing the frequency band into two were also evaluated.

ｋＮＮ分類器の性能を、さまざまなｋ（ｎｅａｒｅｓｔｎｅｉｇｈｂｏｒｓの数）に対し、１０重クロス検証により評価した。図９に、さまざまな分類器において、推定したＮＯＳとレファレンスのＰＮＯＳがマッチした度合い（ＮＯＳ精度）を示す。図９は、最もＮＯＳ精度が高かったｋ＝６の場合の結果である。 The performance of the kNN classifier was evaluated by 10-fold cross validation for various k (number of nearest neighbors). FIG. 9 shows the degree of matching (NOS accuracy) between the estimated NOS and the reference PNOS in various classifiers. FIG. 9 shows the result when k = 6 where the NOS accuracy was highest.

図９より、まず固有値の最大値（ＭＡＸ）よりも、平均値（ＡＶＧ）の方が高い性能を示すことが分かる。最も性能が高かったのは、［１−３］（周波数大域１ｋＨｚ−３ｋＨｚ）と「３−６」（３ｋＨｚ−６ｋＨｚ）との平均を使用した場合である。また、［１−４］と［４−６］を使用した場合も、［１−３］及び「３−６」の組合せより低いものの、全体をまとめた［１−６］の場合と比較してよい値を示している。これは、周波数帯域を分割した固有値の２セットを使用することが効果的であることを示している。 FIG. 9 shows that the average value (AVG) exhibits higher performance than the maximum eigenvalue (MAX). The highest performance was obtained when the average of [1-3] (frequency range 1 kHz-3 kHz) and “3-6” (3 kHz-6 kHz) was used. Also, when [1-4] and [4-6] are used, although lower than the combination of [1-3] and [3-6], it is compared with the case of [1-6] that summarizes the whole. This is a good value. This indicates that it is effective to use two sets of eigenvalues obtained by dividing the frequency band.

図９からはさらに、［１−２］と［２-６］を用いた場合には［１−６］とほぼ同等の精度を示すこと、したがって周波数帯域を２つに分割したときに、［１−６］より精度が高くなる分割周波数位置が、２．５ｋＨｚ以上で４ｋＨｚ以下のあたりにあることが分かる。 FIG. 9 further shows that when [1-2] and [2-6] are used, the accuracy is almost the same as [1-6]. Therefore, when the frequency band is divided into two, It can be seen that the division frequency position where the accuracy becomes higher than 1-6] is around 2.5 kHz or more and 4 kHz or less.

音源方位（ＤＯＡ）の検出におけるＦＦＴのビン数（ＮＦＦＴ）の影響を調べた。図１０に、ＮＦＦＴのさまざまな値（３２，６４，１２８，２５６、５１２）に対する、ＤＯＡの性能を表示する。ＮＦＦＴ＝１２８が音源方位の推定において最も高い性能を示した。しかし、図１０からわかるように、より小さいＮＦＦＴでも著しい性能の劣化は観られない。より小さいＮＦＦＴを使うことで計算量は大幅に減少できる。したがって上記実施の形態では、ＮＦＦＴ＝６４を採用した。 The influence of FFT bin number (NFFT) on the detection of sound source direction (DOA) was examined. FIG. 10 displays the DOA performance for various values of NFFT (32, 64, 128, 256, 512). NFFT = 128 showed the highest performance in estimating the sound source direction. However, as can be seen from FIG. 10, no significant performance degradation is observed even with a smaller NFFT. Using a smaller NFFT can greatly reduce the amount of computation. Therefore, NFFT = 64 is adopted in the above embodiment.

［動作］
上記実施の形態に係る音源定位処理部５０は以下のように動作する。マイクロホンアレイが図１及び図２に示すようにマイクロホン台３２を用いてロボット３０に装着されるものとする。 [Operation]
The sound source localization processing unit 50 according to the above embodiment operates as follows. It is assumed that the microphone array is attached to the robot 30 using a microphone base 32 as shown in FIGS.

最初に、学習データの作成時の学習データ作成処理部１４０は以下のように動作する。マイクロホンアレイ５２は音源からの音声を１４個のアナログ電気信号に変換し、Ａ／Ｄ変換器５４に与える。Ａ／Ｄ変換器５４は１６ｋＨｚでこれら信号を１６ビットのデジタル信号化し、１４個のデジタル信号をフレーム化処理部１８０に与える。 First, the learning data creation processing unit 140 when creating learning data operates as follows. The microphone array 52 converts the sound from the sound source into 14 analog electric signals and supplies the analog electric signals to the A / D converter 54. The A / D converter 54 converts these signals into 16-bit digital signals at 16 kHz, and provides 14 digital signals to the framing processor 180.

フレーム化処理部１８０は、２５ミリ秒のフレーム長でこれら各チャンネルのデジタル音源信号をフレーム化し、ＦＦＴ処理部１８２に与える。ＦＦＴ処理部１８２は、各チャンネルの各フレームのデジタル音源信号に対してＦＦＴを施し、各周波数成分の出力１９４に変換してブロック化処理部１８４及び第１のバイナリ・マスキング処理部１６２に与える。ＦＦＴ処理部１８２の出力１９４のうちマイクロホンＭＣ１からの音源信号から得られた出力１９２は第２のバイナリ・マスキング処理部１６４にも与えられる。 The framing processing unit 180 framing the digital sound source signals of these channels with a frame length of 25 milliseconds, and gives them to the FFT processing unit 182. The FFT processing unit 182 performs FFT on the digital sound source signal of each frame of each channel, converts it to the output 194 of each frequency component, and gives it to the blocking processing unit 184 and the first binary masking processing unit 162. Out of the output 194 of the FFT processing unit 182, the output 192 obtained from the sound source signal from the microphone MC 1 is also supplied to the second binary masking processing unit 164.

ブロック化処理部１８４は、ＦＦＴ処理部１８２から２５ミリ秒ごとに出力される信号を２００ミリ秒ごとにブロック化し、相関行列算出部１８６に与える。相関行列算出部１８６はこれら各ブロックについて、チャンネル毎の相関行列を算出し、固有値分解部１８８に与える。固有値分解部１８８は、相関行列算出部１８６により算出された相関行列に固有値分解を施し、学習データ蓄積部１７４に与える。 The blocking processing unit 184 blocks the signal output every 25 milliseconds from the FFT processing unit 182 every 200 milliseconds, and gives it to the correlation matrix calculation unit 186. Correlation matrix calculation section 186 calculates a correlation matrix for each channel for each of these blocks, and provides it to eigenvalue decomposition section 188. The eigenvalue decomposition unit 188 performs eigenvalue decomposition on the correlation matrix calculated by the correlation matrix calculation unit 186 and supplies the result to the learning data storage unit 174.

一方、第１のバイナリ・マスキング処理部１６２は、ＦＦＴ処理部１８２の出力１９４に対してクロス・チャンネル・スペクトル・バイナリ・マスキング処理を施して結果を第２のバイナリ・マスキング処理部１６４に与える。第２のバイナリ・マスキング処理部１６４は、第１のバイナリ・マスキング処理部１６２の出力する各チャンネルの周波数領域の値に対し、マイクロホンＭＣ１からの音源信号から得られたＦＦＴ後の周波数成分（ＦＦＴ処理部８２の出力１９２）を基準にバイナリ・マスキング処理を施し、各チャンネルの信号を時間領域に戻して出力する。 On the other hand, the first binary masking processing unit 162 applies cross channel spectrum binary masking processing to the output 194 of the FFT processing unit 182 and gives the result to the second binary masking processing unit 164. The second binary masking processing unit 164 applies the frequency component (FFT) after FFT obtained from the sound source signal from the microphone MC1 to the frequency domain value of each channel output from the first binary masking processing unit 162. A binary masking process is performed based on the output 192) of the processing unit 82, and the signal of each channel is returned to the time domain and output.

パワー算出部１６８は、第２のバイナリ・マスキング処理部１６４が出力する各チャンネルの音源信号について、２５ミリ秒ごとにパワーを算出して音源数判定部１７２に与える。音源数判定部１７２は、パワー算出部１６８から与えられる各チャンネルの音源信号のパワーのパワー軌跡を追跡し、しきい値記憶部１７０に記憶されたしきい値よりも大きなパワーをもったチャンネルがあればそのチャンネルの音源信号がアクティブであると判定し、ブロックごとに、そのブロックでアクティブな音源信号の数をＰＮＯＳとして学習データ蓄積部１７４に出力する。 The power calculation unit 168 calculates the power for each sound source signal of each channel output from the second binary masking processing unit 164 every 25 milliseconds and supplies the power to the sound source number determination unit 172. The sound source number determination unit 172 tracks the power trajectory of the power of the sound source signal of each channel given from the power calculation unit 168, and a channel having a power larger than the threshold value stored in the threshold value storage unit 170 is found. If there is, the sound source signal of the channel is determined to be active, and the number of sound source signals active in the block is output to the learning data storage unit 174 as PNOS for each block.

学習データ蓄積部１７４は、固有値分解部１８８から与えられる固有値のうち、１−３ｋＨｚの固有値を固有値番号ごとにそれぞれ平均化して第１の固有値の組を算出する。学習データ蓄積部１７４はさらに、同様にして３−６ｋＨｚの固有値を固有値番号ごとにそれぞれ平均化して第２の固有値の組を算出する。学習データ蓄積部１７４は、これら２組の固有値をパラメータとして、ＰＮＯＳとともに１つの学習データ項目として学習データ記憶部１２６に記憶させる。 The learning data storage unit 174 calculates a first set of eigenvalues by averaging eigenvalues of 1-3 kHz among eigenvalues given from the eigenvalue decomposition unit 188 for each eigenvalue number. Similarly, the learning data storage unit 174 calculates the second set of eigenvalues by averaging the eigenvalues of 3-6 kHz for each eigenvalue number. The learning data storage unit 174 stores these two sets of eigenvalues as parameters in the learning data storage unit 126 together with PNOS as one learning data item.

こうして学習データ記憶部１２６に学習データが記憶されると、この学習データ記憶部１２６を図４の音源数推定部１０２において使用することで音源定位処理部５０が動作可能となる。 When the learning data is stored in the learning data storage unit 126 in this way, the sound source localization processing unit 50 can be operated by using the learning data storage unit 126 in the sound source number estimation unit 102 of FIG.

ランタイムには、音源定位処理部５０は以下のように動作する。マイクロホンアレイ５２、Ａ／Ｄ変換器５４、フレーム化処理部８０、ＦＦＴ処理部８２、ブロック化処理部８４、相関行列算出部８６、及び固有値分解部８８は、学習データ作成時の学習データ作成処理部１４０のフレーム化処理部１８０、ＦＦＴ処理部１８２、ブロック化処理部１８４、相関行列算出部１８６、及び固有値分解部１８８と同様に動作する。ただし固有値分解部８８は、固有値だけでなく各固有値に対応する固有ベクトル９２も算出しＭＵＳＩＣ空間スペクトル算出部１０４に２００ミリ秒ごとに与える。固有値分解部８８が算出した固有値９０は音源数推定部１０２に与えられる。なお、位置ベクトル記憶部１００には予めマイクロホンアレイ５２のマイクロホンの配置に応じた位置ベクトルが記憶されているものとする。 At runtime, the sound source localization processing unit 50 operates as follows. The microphone array 52, the A / D converter 54, the framing processing unit 80, the FFT processing unit 82, the blocking processing unit 84, the correlation matrix calculation unit 86, and the eigenvalue decomposition unit 88 are used to generate learning data when learning data is generated. The same operations as the framing processing unit 180, the FFT processing unit 182, the blocking processing unit 184, the correlation matrix calculation unit 186, and the eigenvalue decomposition unit 188 of the unit 140 are performed. However, the eigenvalue decomposition unit 88 calculates not only eigenvalues but also eigenvectors 92 corresponding to the respective eigenvalues and gives them to the MUSIC space spectrum calculation unit 104 every 200 milliseconds. The eigenvalue 90 calculated by the eigenvalue decomposition unit 88 is given to the sound source number estimation unit 102. It is assumed that the position vector storage unit 100 stores a position vector corresponding to the arrangement of microphones in the microphone array 52 in advance.

音源数推定部１０２は以下のように動作する。第１の平均値算出部１２０は、２００ミリ秒ごとに与えられる固有値９０のうち、１−３ｋＨｚの領域の固有値を固有値番号ごとに平均し、第１の組の固有値としてＫＮＮ分類器１２４に与える。第２の平均値算出部１２２は、２００ミリ秒ごとに与えられる固有値９０のうち、３−６ｋＨｚの領域の固有値を固有値番号ごとに平均し、第２の組の固有値としてＫＮＮ分類器１２４に与える。 The sound source number estimation unit 102 operates as follows. The first average value calculation unit 120 averages eigenvalues in the region of 1-3 kHz among eigenvalues 90 given every 200 milliseconds for each eigenvalue number, and gives them to the KNN classifier 124 as a first set of eigenvalues. . The second average value calculation unit 122 averages eigenvalues in the region of 3-6 kHz out of the eigenvalues 90 given every 200 milliseconds for each eigenvalue number, and gives the result to the KNN classifier 124 as a second set of eigenvalues. .

ＫＮＮ分類器１２４は、第１の平均値算出部１２０及びＫＮＮ分類器１２４から与えられる第１及び第２の組の固有値をパラメータとして、学習データ記憶部１２６に記憶された学習データに基づき、ＫＮＮ分類によってＮＯＳ１２８を推定し、ＭＵＳＩＣ空間スペクトル算出部１０４に与える。 The KNN classifier 124 uses the first and second sets of eigenvalues given from the first average value calculation unit 120 and the KNN classifier 124 as parameters, and based on the learning data stored in the learning data storage unit 126, KNN The NOS 128 is estimated by classification and provided to the MUSIC spatial spectrum calculation unit 104.

ＭＵＳＩＣ空間スペクトル算出部１０４以下の処理は通常のＭＵＳＩＣ法の処理を３次元化したものである。まずＭＵＳＩＣ空間スペクトル算出部１０４は、位置ベクトル記憶部１００に記憶された位置ベクトルと、音源数推定部１０２から与えられたＮＯＳとに基づいてＭＵＳＩＣ空間スペクトルを２００ミリ秒ごとに算出しＭＵＳＩＣ応答算出部１０６に与える。ＭＵＳＩＣ応答算出部１０６はＭＵＳＩＣ空間スペクトルに基づき、２００ミリ秒ごとにＭＵＳＩＣ応答を算出しピーク検出部１０８に与える。ピーク検出部１０８はこのＭＵＳＩＣ応答の中を探索し、音源数推定部１０２から与えられるＮＯＳと同じ個数だけのピークを大きい方から検出し、グルーピング部６４に与える。グルーピング部６４には、２００ミリ秒ごとにＮＯＳと同じ個数の音源方位が与えられる。 The processing after the MUSIC spatial spectrum calculation unit 104 is a three-dimensional processing of the normal MUSIC method. First, the MUSIC spatial spectrum calculation unit 104 calculates the MUSIC spatial spectrum every 200 milliseconds based on the position vector stored in the position vector storage unit 100 and the NOS given from the sound source number estimation unit 102, and calculates the MUSIC response. Part 106. The MUSIC response calculation unit 106 calculates the MUSIC response every 200 milliseconds based on the MUSIC spatial spectrum, and provides the MUSIC response calculation unit 106 to the peak detection unit 108. The peak detecting unit 108 searches the MUSIC response, detects the same number of peaks as the NOS given from the sound source number estimating unit 102 from the larger one, and gives it to the grouping unit 64. The grouping unit 64 is given the same number of sound source directions as the NOS every 200 milliseconds.

グルーピング部６４は、最初の１０ブロック分については、各ブロックで算出された音源方位をバッファ６６に記憶する。グルーピング部６４は同時に、ブロック間で同じ音源と思われるものをグループ化する処理を実行する。グループ化の手法については既に説明したとおりである。１１ブロック以降は、グルーピング部６４は、先入先出方式で各ブロックでの音源方位をバッファ６６に記憶するとともに、１０ブロック分の時間が経過しても他の音源方位とグループ化されなかった音源方位を削除する。こうして、２ブロック以上継続したグループが存在する場合には、グルーピング部６４はそれらグループが１つの音源を表すものとして、ブロックごとにその音源方位を出力する。 For the first 10 blocks, the grouping unit 64 stores the sound source direction calculated for each block in the buffer 66. At the same time, the grouping unit 64 executes a process of grouping what seems to be the same sound source between blocks. The grouping method has already been described. After 11 blocks, the grouping unit 64 stores the sound source direction of each block in the buffer 66 by the first-in first-out method, and the sound source that has not been grouped with other sound source directions even after 10 blocks have elapsed. Remove the bearing. Thus, when there are groups that continue for two or more blocks, the grouping unit 64 outputs the sound source direction for each block, assuming that these groups represent one sound source.

以上のような動作によって、音源定位処理部５０は継続的に複数個の音源の定位とトラッキングとを行なうことができる。 With the above operation, the sound source localization processing unit 50 can continuously perform localization and tracking of a plurality of sound sources.

上記実施の形態に係る音源定位処理部５０によれば、音源の数と関連があるとして知られている、チャンネル間の相関行列の固有値について、各ブロックごとに、１−３ｋＨｚの周波数領域と３−６ｋＨｚの周波数領域とに分けて固有値番号ごとに平均を算出して音源数推定のパラメータとしている。実験結果からも明らかなように、このようなパラメータを用いることにより、高い精度で音源数を予測することができ、ＭＵＳＩＣ法による音源定位を安定して精度高く行なうことができる。 According to the sound source localization processing unit 50 according to the above-described embodiment, the eigenvalue of the correlation matrix between channels, which is known to be related to the number of sound sources, has a frequency region of 1-3 kHz and 3 for each block. An average is calculated for each eigenvalue number divided into a frequency region of −6 kHz and used as a parameter for estimating the number of sound sources. As is clear from the experimental results, the number of sound sources can be predicted with high accuracy by using such parameters, and sound source localization by the MUSIC method can be performed stably and with high accuracy.

さらに、本実施の形態では３次元ＭＵＳＩＣ法を用いているため、方位角だけではなく、ある範囲で仰角を含めて音源方位を推定することができる。そのため、実環境でさまざまな方向から音声を受ける環境でもロボットなどが正しく音源を定位して適切な動作を行なうことが可能になる。ロボットが人間とのインタラクションを行なう場合でも、相手の顔を見つめながら適切な動作を行なうことが期待でき、ロボットと人間とのインタラクションをよりスムーズなものとすることができる。 Furthermore, since the three-dimensional MUSIC method is used in the present embodiment, it is possible to estimate the sound source azimuth including not only the azimuth angle but also the elevation angle within a certain range. Therefore, even in an environment where voice is received from various directions in a real environment, a robot or the like can correctly locate a sound source and perform an appropriate operation. Even when the robot interacts with a human, it can be expected to perform an appropriate operation while looking at the face of the other party, and the interaction between the robot and the human can be made smoother.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

［付録：ＭＵＳＩＣ法］
Ｍ個のマイク入力のフーリエ変換Ｘｍ（ｋ、ｔ）は、式（１）のようにモデル化される。 [Appendix: MUSIC method]
The Fourier transform Xm (k, t) of M microphone inputs is modeled as shown in Equation (1).

ただし、ベクトルｓ（ｋ、ｔ）はＮ個の音源のスペクトルＳ_ｎ（ｋ，ｔ）から成る：ｓ（ｋ、ｔ）＝［Ｓ_１（ｋ，ｔ），…，Ｓ_Ｎ（ｋ、ｔ）］^Ｔ。ｋとｔはそれぞれ周波数と時間フレームのインデックスを示す。ベクトルｎ（ｋ、ｔ）は背景雑音を示す。行列Ａ_ｋは変換関数行列であり、その（ｍ、ｎ）要素はｎ番目の音源から、ｍ番目のマイクロホンへの直接パスの変換関数である。Ａ_ｋのｎ列目のベクトルをｎ番目の音源の位置ベクトル（ＳｔｅｅｒｉｎｇＶｅｃｔｏｒ）と呼ぶ。

However, the vector s (k, t) consists of N sound source spectra S _n (k, t): s (k, t) = [S ₁ (k, t),..., S _N (k, t) )] ^T. k and t indicate frequency and time frame indexes, respectively. Vector n (k, t) indicates background noise. The matrix A _k is a conversion function matrix, and its (m, n) element is a conversion function of a direct path from the nth sound source to the mth microphone. The n-th column vectors of A _k is referred to as a position vector of the n-th sound source (Steering Vector).

まず、式（２）で定義される空間相関行列Ｒ_ｋを求め、式（３）に示すＲ_ｋの固有値分解により、固有値の対角行列Λ_ｋ及び固有ベクトルから成るＥ_ｋが求められる。 First, a spatial correlation matrix R _k defined by Equation (2) is obtained, and E _k composed of an eigenvalue diagonal matrix Λ _k and an eigenvector is obtained by eigenvalue decomposition of R _k shown in Equation (3).

固有ベクトルはＥ_ｋ＝［Ｅ_ｋｓ｜Ｅ_ｋｎ］のように分割出来る。Ｅ_ｋｓとＥ_ｋｎとはそれぞれ支配的なＮ個の固有値に対応する固有ベクトルと、それ以外の固有ベクトルとを示す。

The eigenvector can be divided as E _k = [E _ks | E _kn ]. E _ks and E _kn indicate eigenvectors corresponding to the dominant N eigenvalues and other eigenvectors, respectively.

ＭＵＳＩＣ空間スペクトルは式（４）と（５）とで求める。ｒは距離、θとφとはそれぞれ方位角と仰角とを示す。式（５）は、スキャンされる点（ｒ、θ、φ）における正規化した位置ベクトルである。 The MUSIC spatial spectrum is obtained by equations (4) and (5). r is a distance, and θ and φ are an azimuth angle and an elevation angle, respectively. Equation (5) is a normalized position vector at the scanned point (r, θ, φ).

空間スペクトル（本明細書では「ＭＵＳＩＣ応答」と呼ぶ。）は、ＭＵＳＩＣ空間スペクトルを式（６）のように平均化したものである。

The spatial spectrum (referred to herein as a “MUSIC response”) is an averaged MUSIC spatial spectrum as shown in Equation (6).

式（６）においてｋ_Ｌ及びｋ_Ｈは、それぞれ周波数帯域の下位と上位の境界のインデックスであり、Ｋ＝ｋ_Ｈ−ｋ_Ｌ＋１。音源の方位は、ＭＵＳＩＣ応答のＮ個のピークから求められる。

In Expression (6), k _L and k _H are indices of the lower and upper boundaries of the frequency band, respectively, and K = k _H −k _L +1. The direction of the sound source is obtained from N peaks of the MUSIC response.

本発明の１実施の形態に係る音源定位処理部５０を有するロボット３０にマイクロホン台３２を装着した状態を示す図である。It is a figure which shows the state which attached the microphone stand 32 to the robot 30 which has the sound source localization process part 50 which concerns on one embodiment of this invention. マイクロホン台３２の３面図である。3 is a three-side view of a microphone base 32. FIG. 音源定位処理部５０のブロック図である。3 is a block diagram of a sound source localization processing unit 50. FIG. 音源数推定部１０２のブロック図である。3 is a block diagram of a sound source number estimation unit 102. FIG. 学習データ作成処理部１４０のブロック図である。4 is a block diagram of a learning data creation processing unit 140. FIG. 収録環境ＯＦＣにおいて、ＰＮＯＳ毎に整理したブロック毎の固有値のプロファイルを示す図である。In a recording environment OFC, it is a figure which shows the profile of the eigenvalue for every block arranged for every PNOS. 収録環境ＵＣＷ１において、ＰＮＯＳ毎に整理したブロック毎の固有値のプロファイルを示す図である。In the recording environment UCW1, it is a figure which shows the profile of the eigenvalue for every block arranged for every PNOS. 収録環境ＵＣＷ２において、ＰＮＯＳ毎に整理したブロック毎の固有値のプロファイルを示す図である。In the recording environment UCW2, it is a figure which shows the profile of the eigenvalue for every block arranged for every PNOS. ｋＮＮ分類法による音源数の推定の実験精度を、固有値プロファイルの計算方法別に示すグラフである。It is a graph which shows the experimental precision of estimation of the number of sound sources by a kNN classification method according to the calculation method of an eigenvalue profile. ｋＮＮ分類法における近傍数ｋの値による、ブロックごとの音源数の推定精度を示すグラフである。It is a graph which shows the estimation precision of the number of sound sources for every block by the value of the neighborhood number k in a kNN classification method. ピーク検出部１０８によるピーク検出方法を説明するための図である。It is a figure for demonstrating the peak detection method by the peak detection part.

Explanation of symbols

３０ロボット
３２マイクロホン台
５０音源定位処理部
５２マイクロホンアレイ
６０固有ベクトル算出部
６２音源推定部
６４グルーピング部
８６，１８６相関行列算出部
８８，１８８固有値分解部
１０２音源数推定部
１０４ＭＵＳＩＣ空間スペクトル算出部
１０６ＭＵＳＩＣ応答算出部
１０８ピーク検出部
１２０第１の平均値算出部
１２２第２の平均値算出部
１２４ＫＮＮ分類器
１２６学習データ記憶部
１６２第１のバイナリ・マスキング処理部
１６４第２のバイナリ・マスキング処理部
１６８パワー算出部
１７２音源数判定部
１７４学習データ蓄積部 30 Robot 32 Microphone base 50 Sound source localization processing unit 52 Microphone array 60 Eigen vector calculation unit 62 Sound source estimation unit 64 Grouping unit 86, 186 Correlation matrix calculation unit 88, 188 Eigen value decomposition unit 102 Sound source number estimation unit 104 MUSIC spatial spectrum calculation unit 106 MUSIC Response calculation unit 108 Peak detection unit 120 First average value calculation unit 122 Second average value calculation unit 124 KNN classifier 126 Learning data storage unit 162 First binary masking processing unit 164 Second binary masking processing unit 168 Power calculation unit 172 Sound source number determination unit 174 Learning data storage unit

Claims

Conversion means for converting each of the sound source signals of a plurality of channels obtained from the output of the microphone array into frequency components of a plurality of frequency bands at predetermined time intervals;
Correlation matrix calculating means for obtaining a spatial correlation matrix between frequency components for each of the predetermined time intervals for each of the plurality of frequency bands of the sound source signals of the plurality of channels obtained by the converting means;
The correlation matrix calculation means performs eigenvalue decomposition on each of the spatial correlation matrices calculated for each of the plurality of frequency bands at each predetermined time interval, and calculates eigenvectors and eigenvalues for each of the plurality of frequency bands. Eigenvector calculation means for performing,
Eigenvalue profile calculation means for calculating eigenvalue profiles for the first and second frequency ranges based on the eigenvalues calculated for each of the plurality of frequency bands by the eigenvector calculation means at each predetermined time interval. Each of the first and second frequency ranges includes one or more of the plurality of frequency bands,
further,
A sound source number estimating means for estimating the number of sound sources at each predetermined time interval, using as a parameter the set of eigen value profiles calculated by the eigen value profile calculating means for the first and second frequency ranges;
Based on the number of sound sources estimated by the sound source number estimating means, information on the arrangement of microphone elements belonging to the microphone array, and eigenvectors calculated by the eigenvector calculating means at each predetermined time interval, the MUSIC method is used. A sound source localization apparatus including sound source estimation means for estimating the number of sound source directions equal to the number of sound sources.

The sound source localization apparatus according to claim 1, wherein the first frequency range and the second frequency range are continuous with each other.

The sound source localization apparatus according to claim 1, wherein the first frequency range and the second frequency range do not overlap each other.

The sound source localization apparatus according to claim 3, wherein a lower limit of the first and second frequency ranges is 1 kHz, and an upper limit is 6 kHz.

The eigenvalue profile calculation means includes:
First eigenvalue averaging means for averaging eigenvalues calculated for the frequency bands belonging to the first frequency range by the eigenvector calculating means for each eigenvalue number;
Second eigenvalue averaging means for averaging eigenvalues calculated for the frequency bands belonging to the second frequency range by the eigenvector calculating means for each eigenvalue number;
5. A means for creating and outputting the eigenvalue profile by averaging eigenvalues calculated for each eigenvalue number by the first and second eigenvalue averaging means. Sound source localization device.

The sound source localization apparatus according to claim 5, wherein a boundary between the first frequency range and the second frequency range is in a range of 2.5 kHz to 4 kHz.

The sound source localization apparatus according to any one of claims 1 to 6, wherein the sound source number estimation means includes non-linear estimation means that has been learned in advance so as to estimate a correct number of sound sources using the set of eigenvalue profiles as a parameter.

The nonlinear estimation means includes
Learning data storage means for storing a plurality of learning data each comprising a combination of the eigenvalue profile and the number of sound sources corresponding to the set of eigenvalue profiles as a parameter;
And a means for estimating the number of sound sources by a k-nearest neighbor method using the learning data stored in the learning data storage means using the set of eigenvalue profiles calculated by the eigenvalue profile calculation means as a parameter. The sound source localization apparatus according to 7.

The sound source localization apparatus according to claim 8, wherein the number of neighboring learning data used by the means for estimating the number of sound sources by the k-neighbor method is six.

10. The sound source localization apparatus according to claim 1, further comprising a sound source tracking unit for tracking the sound source direction estimated at the predetermined time interval by the sound source estimation unit on a time axis.