JP5660362B2

JP5660362B2 - Sound source localization apparatus and computer program

Info

Publication number: JP5660362B2
Application number: JP2010086705A
Authority: JP
Inventors: イシイ・カルロス・トシノリ; 棟梁; 石黒　浩; 浩石黒; 萩田　紀博; 紀博萩田
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2010-04-05
Filing date: 2010-04-05
Publication date: 2015-01-28
Anticipated expiration: 2030-04-05
Also published as: JP2011220701A

Description

この発明は実環境における音源定位技術に関し、特に、実環境におけるＭＵＳＩＣ（ＭＵｌｔｉＰｌｅＳＩｇｎａｌＣｌａｓｓｉｆｉｃａｔｉｏｎ）法を用いた音源定位、及び、音声の方向性による、移動する音源による音声の発生区間の検出技術に関する。 The present invention relates to a sound source localization technique in a real environment, and more particularly, to a sound source localization using a MUSIC (Multiple Signal Classification) method in a real environment, and a technique for detecting a sound generation section by a moving sound source based on the sound directionality.

人とロボットとの音声コミュニケーションにおいて、ロボットに取付けたマイクロホンは通常離れた位置（１ｍ以上）にある。したがって例えば電話音声のようにマイクと口との距離が数センチの場合と比べて、信号と雑音の比（ＳＮＲ）は低くなる。このため、傍にいる他人の声や環境の雑音が妨害音となり、ロボットによる目的音声の認識が難しくなる。従って、ロボットへの応用として、音源定位や音源分離は重要である。 In voice communication between a person and a robot, the microphone attached to the robot is usually located at a distance (1 m or more). Therefore, for example, the signal-to-noise ratio (SNR) is lower than when the distance between the microphone and the mouth is several centimeters as in telephone speech. For this reason, the voices of others nearby and the noise of the environment become interference sounds, making it difficult for the robot to recognize the target speech. Therefore, sound source localization and sound source separation are important for robot applications.

音源定位に関しては過去にさまざまな研究がされている。しかし、その大半ではシミュレーション・データ又はラボ・データのみが使用され、ロボットが動作する実環境のデータを評価するものは少ない。３次元の音源定位を評価する研究も少ない。発話相手の顔を見ながら話したり聞いたりすることも人間とロボットとの対話インタラクションを改善するための重要なビヘービアであり、そのためには３次元の音源定位も重要となる。 Various studies have been conducted on sound source localization in the past. However, most of them use only simulation data or lab data, and few evaluate real-world data in which the robot operates. There are few studies to evaluate 3D sound source localization. Talking and listening while looking at the face of the utterance partner is also an important behavior for improving dialogue interaction between humans and robots. For that purpose, three-dimensional sound source localization is also important.

実環境を想定した従来技術として特許文献１に記載のものがある。特許文献１に記載の技術は、分解能が高いＭＵＳＩＣ法と呼ばれる公知の音源定位の手法を用いている。 There exists a thing of patent document 1 as a prior art which assumed the real environment. The technique described in Patent Document 1 uses a known sound source localization technique called the MUSIC method with high resolution.

特許文献１に記載の発明では、マイクロホンアレイを用い、マイクロホンアレイからの信号をフーリエ変換して得られた受信信号ベクトルと、過去の相関行列とに基づいて現在の相関行列を計算する。このようにして求められた相関行列を固有値分解し、最大固有値と、最大固有値以外の固有値に対応する固有ベクトルである雑音空間とを求める。さらに、マイクロホンアレイのうち、１つのマイクロホンを基準として、各マイクの出力の位相差と、雑音空間と、最大固有値とに基づいて、ＭＵＳＩＣ法により音源の方向を推定する。 In the invention described in Patent Document 1, a microphone array is used, and a current correlation matrix is calculated based on a received signal vector obtained by Fourier transform of a signal from the microphone array and a past correlation matrix. The correlation matrix obtained in this way is subjected to eigenvalue decomposition to obtain a maximum eigenvalue and a noise space that is an eigenvector corresponding to an eigenvalue other than the maximum eigenvalue. Furthermore, the direction of the sound source is estimated by the MUSIC method based on the phase difference of the output of each microphone, the noise space, and the maximum eigenvalue with one microphone as a reference in the microphone array.

特開2008-175733号公報JP 2008-175733 A

ＭＵＳＩＣ法は分解能が高いという特長があるが、ＭＵＳＩＣ法を使用する場合には音源の数を与えなければならないという問題がある。特許文献１に記載の技術では、音源は一つであることが想定されているため、こうした問題は生じない。しかし、実際にロボットが稼動する環境としてはそのような環境であることはまれで、常に複数の音源が存在しており、しかもその数は一定していない。ＭＵＳＩＣ法を用いる場合、音源の数の予測を誤ると音源定位も誤ってしまい、ロボットが人間と正しくインタラクションをすることが困難となってしまう。特に、音源の数を多く予測しすぎると、好ましいインタラクションが難しくなるだけでなく、計算コストも高くなるという問題がある。 The MUSIC method has a feature of high resolution, but there is a problem that the number of sound sources must be given when the MUSIC method is used. In the technique described in Patent Document 1, since it is assumed that there is one sound source, such a problem does not occur. However, the environment in which the robot actually operates is rarely such an environment, and there are always a plurality of sound sources, and the number is not constant. When the MUSIC method is used, if the number of sound sources is incorrectly predicted, sound source localization will also be incorrect, making it difficult for the robot to interact correctly with humans. In particular, if the number of sound sources is predicted too much, there is a problem that not only preferable interaction becomes difficult, but also the calculation cost increases.

さらに特許文献１に記載の技術では、音源定位は２次元的に行なわれている。しかし、実際のロボットの稼働環境は２次元ではなく、３次元的である。例えば、商店街などでは比較的高い位置にスピーカが置かれており、そのスピーカから常に音声が流されていることが多い。また、スピーカの位置は一定であるが、音量が変化することもある。そうした環境では音源を３次元的に定位することが好ましいが、特許文献１に記載の技術では２次元的にしか行なえないという問題がある。 Furthermore, in the technique described in Patent Document 1, sound source localization is performed two-dimensionally. However, the actual operating environment of the robot is not two-dimensional but three-dimensional. For example, in a shopping street or the like, a speaker is placed at a relatively high position, and audio is often played from the speaker. Moreover, although the position of the speaker is constant, the volume may change. In such an environment, it is preferable to localize the sound source three-dimensionally, but the technique described in Patent Document 1 has a problem that it can be performed only two-dimensionally.

特に人間を相手にするロボットの場合、人間の身長はさまざまで、大人の場合にはロボットより高い位置で話し、子供の場合には逆にロボットより低い位置で話すことが多い。そうした点からも、３次元的な音源定位をすることが望まれる。ロボットと人とが対話するときには、ロボットの顔を相手の顔の方向に向ける必要があるが、３次元的な音源定位が行なえないと、そのような対話を行なうことは困難である。 In particular, in the case of a robot against a human being, the height of the human being varies, and in the case of an adult, the person speaks at a higher position than the robot, and the child often speaks at a lower position than the robot. From such a point, it is desirable to perform three-dimensional sound source localization. When a robot and a person interact with each other, it is necessary to point the robot's face in the direction of the opponent's face, but it is difficult to perform such a conversation unless three-dimensional sound source localization is performed.

さらに、人間は頻繁に移動するため、音源を実時間で安定してトラッキングすることも必要である。 Furthermore, since humans move frequently, it is also necessary to track the sound source stably in real time.

それゆえに本発明の目的は、ＭＵＳＩＣ法を使用して安定的に音源定位を行なうことができる音源定位装置を提供することである。ここでの音源定位とは、音源の方位を継続的に特定することをいう。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a sound source localization apparatus that can stably perform sound source localization using the MUSIC method. Here, sound source localization refers to continuously specifying the direction of the sound source.

本発明の他の目的は、ＭＵＳＩＣ法を使用して安定的に音源をトラッキング可能な音源定位装置を提供することである。 Another object of the present invention is to provide a sound source localization apparatus that can stably track a sound source using the MUSIC method.

本発明のさらに他の目的は、ＭＵＳＩＣ法を使用して安定的に音源のトラッキングを行ない、音源の発生及び消滅を精度高く実時間で予測できる音源定位装置を提供することである。 Still another object of the present invention is to provide a sound source localization apparatus that can stably track a sound source using the MUSIC method and accurately predict the generation and disappearance of the sound source in real time.

本発明の第１の局面によれば、音源定位装置は、マイクロホンアレイの出力から得られる複数チャンネルの音源信号の各々と、マイクロホンアレイに含まれる各マイクロホンの間の位置関係とに基づいて、マイクロホンアレイの位置に関連して定められる点を中心とする３次元空間内で定義された複数の方位の各々について、ＭＵＳＩＣアルゴリズムにより、所定時間ごとにＭＵＳＩＣパワーを算出するためのＭＵＳＩＣ応答算出手段と、複数の方位の各々について、ＭＵＳＩＣ応答算出手段により時系列として得られたＭＵＳＩＣパワーの値の強度及びその変化量に基づいて、音源の発生から消滅までの音源の方位の変化を検出するための音源推定手段とを含む。 According to the first aspect of the present invention, the sound source localization apparatus is based on each of a plurality of sound source signals obtained from the output of the microphone array and a positional relationship between the microphones included in the microphone array. MUSIC response calculating means for calculating MUSIC power for each predetermined time by a MUSIC algorithm for each of a plurality of orientations defined in a three-dimensional space centered on a point defined in relation to the position of the array; For each of a plurality of directions, a sound source for detecting a change in the direction of the sound source from the generation to the disappearance of the sound source based on the intensity of the MUSIC power value obtained as a time series by the MUSIC response calculation means and the amount of change Estimation means.

好ましくは、音源推定手段は、複数の方位の各々に対して、複数の方位の内で、処理対象となっている方位に隣接する方位の各々に関して直前にＭＵＳＩＣ応答算出手段により算出されたＭＵＳＩＣパワーの値と、当該処理対象となっている方位に関して得られた最新のＭＵＳＩＣパワーの値との差を算出し、その差のうちで所定の条件を充足する値を当該方位についてのＭＵＳＩＣΔパワーとして出力するためのＭＵＳＩＣΔパワー算出手段と、複数の方位の各々に対して、当該方位についてＭＵＳＩＣ応答算出手段により算出されたＭＵＳＩＣパワーの値と、当該方位についてＭＵＳＩＣΔパワー算出手段により出力されたＭＵＳＩＣΔパワーとに基づいて、音源の発生があったか否かを検出し、音源の発生の検出に応答して、音源発生を示す情報とその方位を示す情報とを出力するためのオンセット検出手段と、オンセット検出手段により音源の発生が検出されたことに応答して、オンセット検出手段により発生が検出された音源の方位を、ＭＵＳＩＣ応答算出手段により当該音源の方位に隣接する方位の各々について算出されたＭＵＳＩＣパワーの値と、当該音源の方位に隣接する方位の各々についてＭＵＳＩＣΔパワー算出手段により出力されたＭＵＳＩＣΔパワーとに基づいて、当該音源の消滅までトラッキングするためのトラッキング手段とを含む。 Preferably, the sound source estimation means, for each of the plurality of azimuths, includes the MUSIC power calculated by the MUSIC response calculation means immediately before each of the azimuths adjacent to the processing target azimuth among the plurality of azimuths. And the latest MUSIC power value obtained for the azimuth to be processed is calculated, and a value satisfying a predetermined condition is output as the MUSIC Δ power for the azimuth. MUSICΔ power calculation means for performing, for each of a plurality of azimuths, the value of MUSIC power calculated by the MUSIC response calculation means for the azimuth and the MUSICΔ power output by the MUSICΔ power calculation means for the azimuth On the basis of whether or not a sound source has been generated, and in response to the detection of the sound source, Onset detection means for outputting information indicating the direction and the direction of the sound source, and in response to the detection of the generation of the sound source by the onset detection means, the sound source detected by the onset detection means MUSIC power value calculated by the MUSIC response calculating means for each of the azimuths adjacent to the direction of the sound source, and MUSIC Δ power output by the MUSIC Δ power calculating means for each of the directions adjacent to the direction of the sound source And tracking means for tracking until the sound source disappears.

より好ましくは、音源定位装置はさらに、ＭＵＳＩＣ応答算出手段により出力されたＭＵＳＩＣパワーの移動平均を複数の方位の各々に対して算出して平滑化するための平滑化手段を含む。ＭＵＳＩＣΔパワー算出手段と、トラッキング手段はいずれも平滑化手段により平滑化されたＭＵＳＩＣパワーを入力として受ける。 More preferably, the sound source localization apparatus further includes a smoothing unit for calculating and smoothing the moving average of the MUSIC power output by the MUSIC response calculating unit for each of a plurality of directions. Both the MUSIC Δ power calculation means and the tracking means receive the MUSIC power smoothed by the smoothing means as an input.

さらに好ましくは、オンセット検出手段は、複数の方位の各々に対して、ＭＵＳＩＣΔパワー算出手段により算出されたＭＵＳＩＣΔパワーの値が第１のオンセット用しきい値より大きく、かつＭＵＳＩＣ応答算出手段により算出されたＭＵＳＩＣパワーの値が第２のオンセット用しきい値より大きいか否かを判定するための第１の判定手段と、第１の判定手段による判定結果に基づいて、音源の発生を検出するための第２の判定手段とを含む。 More preferably, the onset detection means has a value of the MUSICΔ power calculated by the MUSICΔ power calculation means larger than the first onset threshold for each of the plurality of directions, and the MUSIC response calculation means. Based on the first determination means for determining whether or not the calculated MUSIC power value is greater than the second onset threshold, and the determination result by the first determination means, the generation of the sound source is performed. Second determining means for detecting.

第２の判定手段は、第１の判定手段による判定結果が肯定であった方位のうち、当該方位について算出されているＭＵＳＩＣパワーの上位から限定された一定個数のみを、検出された音源として特定するための音源限定手段を含んでもよい。 The second determination means identifies only a certain number of the azimuths determined by the first determination means as a detected sound source from the top of the MUSIC power calculated for the azimuth. The sound source limiting means for performing may be included.

好ましくは、トラッキング手段は、オンセット検出手段により音源の発生が検出されたことに応答して、当該検出以後、所定時間ごとにＭＵＳＩＣ応答算出手段により出力されるＭＵＳＩＣ応答のうちで、音源の方位に隣接する方位について算出されるＭＵＳＩＣパワーの値の最大値を辿ることにより、発生が検出された音源の移動をトラッキングするための手段と、トラッキングするための手段によりトラッキングされた音源の方位について算出されたＭＵＳＩＣパワー及びＭＵＳＩＣΔパワーが所定の条件を充足したときに、当該音源の消滅を検出し、トラッキングするための手段によるトラッキングを中止させるための音源消滅検出手段とを含む。 Preferably, the tracking unit responds to the detection of the generation of the sound source by the onset detection unit, and the direction of the sound source among the MUSIC responses output by the MUSIC response calculation unit every predetermined time after the detection. By tracing the maximum value of the MUSIC power value calculated for the azimuth adjacent to the signal, the means for tracking the movement of the sound source in which the occurrence is detected and the direction of the sound source tracked by the means for tracking are calculated. Sound source disappearance detecting means for detecting the disappearance of the sound source and stopping the tracking by the means for tracking when the MUSIC power and the MUSICΔ power satisfy the predetermined condition.

より好ましくは、音源消滅検出手段は、トラッキングするための手段によりトラッキングされた音源の方位について算出されたＭＵＳＩＣΔパワーが第１のオフセット用しきい値より小か否かを判定するための第１のオフセット判定手段と、当該方位について算出されたＭＵＳＩＣパワーが当該音源の発生時のＭＵＳＩＣパワーよりある正の定数だけ大きな第２のオフセット用定数より大きいか否かを判定するための第２のオフセット判定手段と、第１及び第２のオフセット判定手段の判定結果のいずれかが肯定であるという結果が、ＭＵＳＩＣ応答算出手段により所定時間ごとに算出されたＭＵＳＩＣ応答について所定回数（好ましくは複数回）だけ連続して得られたときに、トラッキングするための手段によりトラッキングされている音源が消滅したと判定し、トラッキングを中止させるための手段とを含む。 More preferably, the sound source disappearance detecting unit is a first unit for determining whether or not the MUSICΔ power calculated for the direction of the sound source tracked by the unit for tracking is smaller than a first offset threshold value. Offset determination means and a second offset determination for determining whether or not the MUSIC power calculated for the direction is larger than a second offset constant that is larger than the MUSIC power at the time of generation of the sound source by a certain positive constant The result that either one of the determination results of the means and the first and second offset determination means is affirmative is the predetermined number of times (preferably a plurality of times) for the MUSIC response calculated every predetermined time by the MUSIC response calculation means. Sounds tracked by means of tracking when obtained continuously There were determined to have disappeared, and means for stopping the tracking.

本願発明の第２の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを、上記したいずれかの音源定位装置の各手段として機能させる。 When the computer program according to the second aspect of the present invention is executed by a computer, it causes the computer to function as each means of any of the sound source localization apparatuses described above.

本発明の１実施の形態に係る音源定位処理部５０を有するロボット３０にマイクロホン台３２を装着した状態を示す図である。It is a figure which shows the state which attached the microphone stand 32 to the robot 30 which has the sound source localization process part 50 which concerns on one embodiment of this invention. マイクロホン台３２の３面図である。3 is a three-side view of a microphone base 32. FIG. 音源定位処理部５０のブロック図である。3 is a block diagram of a sound source localization processing unit 50. FIG. 音源推定部６４の機能を実現するためのコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program for implement | achieving the function of the sound source estimation part 64. FIG. オンセット検出処理を実現するコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program which implement | achieves an onset detection process. トラッキング処理を実現するコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program which implement | achieves tracking processing. オフセット検出処理を実現するコンピュータプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the computer program which implement | achieves an offset detection process. 実施の形態に係る音源定位処理部５０について、その性能を測るために行なった実験の結果を示すグラフである。It is a graph which shows the result of the experiment conducted in order to measure the performance about the sound source localization process part 50 which concerns on embodiment. 実施の形態に係る音源定位処理部５０について、その性能を測るために行なった実験の結果を示すグラフである。It is a graph which shows the result of the experiment conducted in order to measure the performance about the sound source localization process part 50 which concerns on embodiment. 実施の形態に係る音源定位処理部５０について、その性能を測るために行なった実験の結果を示すグラフである。It is a graph which shows the result of the experiment conducted in order to measure the performance about the sound source localization process part 50 which concerns on embodiment. 本発明の実施の形態に係る音源定位処理部５０を実現するコンピュータのブロック図である。It is a block diagram of the computer which implement | achieves the sound source localization process part 50 which concerns on embodiment of this invention.

以下の本発明の実施の形態の説明において、同一の部品には同一の参照番号を付してある。それらの機能も同一である。したがってそれらについての詳細な説明は繰返さない。 In the following description of the embodiments of the present invention, the same reference numerals are assigned to the same components. Their functions are also the same. Therefore, detailed description thereof will not be repeated.

［概要］
本実施の形態では、ロボットの頭部付近にマイクロホンアレイを配置し、これらマイクロホンアレイから得られた信号からリアルタイムで複数個の音源を定位し、それらのトラッキングを行なう。そのために、以下に説明する実施の形態の音源定位装置は、音源数を固定したＭＵＳＩＣアルゴリズムによりＭＵＳＩＣ空間スペクトルを算出し、得られたＭＵＳＩＣ空間スペクトルを直接に用いて音源数とその位置とを動的に推定する仕組みを採用する。 [Overview]
In the present embodiment, a microphone array is arranged near the head of the robot, a plurality of sound sources are localized in real time from signals obtained from these microphone arrays, and tracking thereof is performed. For this purpose, the sound source localization apparatus of the embodiment described below calculates the MUSIC spatial spectrum by the MUSIC algorithm with a fixed number of sound sources, and directly uses the obtained MUSIC spatial spectrum to move the number of sound sources and their positions. Adopt a mechanism that estimates automatically.

［構成］
図１に、マイクロホンアレイをロボット３０の胸部にフィットさせた状態を示す。具体的には、ロボット３０の首の周囲にマイクロホンをフィットさせるためのマイクロホン台３２を作成し、複数のマイクロホンＭＣ１等をこのマイクロホン台３２に固定した後にマイクロホン台３２をロボット３０の首の周りに固定してある。 [Constitution]
FIG. 1 shows a state in which the microphone array is fitted to the chest of the robot 30. Specifically, a microphone base 32 for fitting a microphone around the neck of the robot 30 is created, and a plurality of microphones MC1 and the like are fixed to the microphone base 32, and then the microphone base 32 is placed around the neck of the robot 30. It is fixed.

図２に、マイクロホン台３２の正面図、平面図、及び右側面図を示す。図２を参照して、マイクロホンＭＣ１等は全部で１４個だけ使用される。それらのうち９個はマイクロホン台３２の前部に取付けられ、残りの５個はロボット３０の首を囲む形でマイクロホン台３２の上面に取付けられている。なお、１４個のマイクロホンのうち、中央にあるマイクロホンＭＣ１の出力については、後の処理で他と区別して使用する。本実施の形態では、各マイクロホンは無指向性のものを用いている。 FIG. 2 shows a front view, a plan view, and a right side view of the microphone base 32. Referring to FIG. 2, only 14 microphones MC1 etc. are used. Nine of them are attached to the front part of the microphone base 32, and the remaining five are attached to the upper surface of the microphone base 32 so as to surround the neck of the robot 30. Of the 14 microphones, the output of the microphone MC1 at the center is used separately from others in the subsequent processing. In this embodiment, each microphone is an omnidirectional microphone.

図３は、図１に示すロボットのうち、音源定位に関係する音源定位処理部５０のみを取り出してブロック図形式で示した図である。図３を参照して、音源定位処理部５０は、マイクロホンＭＣ１等を含むマイクロホンアレイ５２から１４個のアナログ音源信号を受け、アナログ／デジタル変換を行なって１４個のデジタル音源信号を出力するＡ／Ｄ変換器５４と、Ａ／Ｄ変換器５４から出力される１４個のデジタル音源信号を受け、ＭＵＳＩＣ法で必要とされる相関行列とその固有値及び固有ベクトルを１００ミリ秒について１ブロックとしてブロックごとに出力するための固有ベクトル算出部６０と、固有ベクトル算出部６０からブロックごとに出力される固有ベクトルを使用し、ＭＵＳＩＣ法によりＭＵＳＩＣ空間スペクトルを出力するＭＵＳＩＣ処理部６２と、ＭＵＳＩＣ処理部６２が出力するＭＵＳＩＣ空間スペクトルに基づいて、音源数とその位置とを動的に推定してその位置（方向）を表す値（本実施の形態では、３次元極座標の内の２つの偏角φ及びθとする。付録の「ＭＵＳＩＣ応答」を参照。）を時系列で出力する音源推定部６４と、音源推定部６４の出力の時系列を蓄積するためのバッファ６６とを含む。なお、本明細書では、「ＭＵＳＩＣ応答」とは、ＭＵＳＩＣアルゴリズムにより得られるＭＵＳＩＣ空間スペクトルを所定の式で平均化したものである。詳細については付録の「ＭＵＳＩＣ応答」を参照されたい。 FIG. 3 is a block diagram showing only the sound source localization processing unit 50 related to sound source localization from the robot shown in FIG. Referring to FIG. 3, sound source localization processing unit 50 receives 14 analog sound source signals from microphone array 52 including microphone MC1 and the like, performs analog / digital conversion, and outputs 14 digital sound source signals. D converter 54, a / D received fourteen digital sound signal output from the transducer 54, the block correlation matrix required by the MUSIC method and the eigenvalues and eigenvectors as one block have one 100ms The eigenvector calculation unit 60 for outputting each block, the MUSIC processing unit 62 that outputs the MUSIC space spectrum by the MUSIC method using the eigenvector output from the eigenvector calculation unit 60 for each block, and the MUSIC processing unit 62 output Based on the MUSIC spatial spectrum, the number of sound sources and their positions are dynamically estimated. A sound source that outputs a value representing the position (direction) (in this embodiment, two declination angles φ and θ in the three-dimensional polar coordinates; see “MUSIC response” in the appendix) in time series. An estimation unit 64 and a buffer 66 for accumulating the time series of the output of the sound source estimation unit 64 are included. In the present specification, the “MUSIC response” is obtained by averaging the MUSIC spatial spectrum obtained by the MUSIC algorithm using a predetermined formula. Please refer to “MUSIC Response” in the Appendix for details.

本実施の形態では、Ａ／Ｄ変換器５４は、一般的な１６ｋＨｚ／１６ビットで各マイクロホンの出力をＡ／Ｄ変換する。 In the present embodiment, the A / D converter 54 performs A / D conversion on the output of each microphone at a general 16 kHz / 16 bits.

固有ベクトル算出部６０は、Ａ／Ｄ変換器５４の出力する１４個のデジタル音源信号を４ミリ秒のフレーム長でフレーム化するためのフレーム化処理部８０と、フレーム化処理部８０の出力する１４チャンネルのフレーム化された音源信号に対してそれぞれＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍａｔｉｏｎ）を施し、所定個数の周波数領域（以下、各周波数領域を「ビン」と呼び、周波数領域の数を「ビン数」と呼ぶ。）に変換して出力するＦＦＴ処理部８２と、ＦＦＴ処理部８２から４ミリ秒ごとに出力される各チャネルの各ビンの値を１００ミリ秒ごとにブロック化するためのブロック化処理部８４と、ブロック化処理部８４から出力される各ビンの値の間の相関を要素とする相関行列を所定時間ごと（１００ミリ秒ごと）に算出し出力する相関行列算出部８６と、相関行列算出部８６から出力される相関行列を固有値分解し、固有ベクトル９２をＭＵＳＩＣ処理部６２に出力する固有値分解部８８とを含む。なお本実施の形態では、音源信号の周波数成分のうち、空間的分解能が低い１ｋＨｚ以下の帯域と、空間的エイリアシングが起こり得る６ｋＨｚ以上の帯域を除外する。 The eigenvector calculation unit 60 framing the 14 digital sound source signals output from the A / D converter 54 with a frame length of 4 milliseconds, and the output 14 from the framing processing unit 80. FFT (Fast Fourier Transform) is applied to the sound source signals framed in the channel, and a predetermined number of frequency regions (hereinafter, each frequency region is referred to as “bin”, and the number of frequency regions is referred to as “bin number”). .)), And a block processing unit 84 for blocking each bin value of each channel output from the FFT processing unit 82 every 4 milliseconds. And a correlation matrix whose element is the correlation between the bin values output from the block processing unit 84 (every 100 milliseconds). And a correlation matrix calculation unit 86 that calculates and outputs the correlation matrix output from the correlation matrix calculation unit 86 and an eigenvalue decomposition unit 88 that outputs the eigenvector 92 to the MUSIC processing unit 62. In the present embodiment, the frequency component of the sound source signal excludes a band of 1 kHz or less with a low spatial resolution and a band of 6 kHz or more where spatial aliasing may occur.

通常、ＦＦＴでは５１２〜１０２４点を使用する（１６ｋＨｚのサンプリングレートで３２〜６４ミリ秒に相当）が、ここでは１フレームを４ミリ秒（ＦＦＴでは６４〜１２８点に相当）とした。このようにフレーム長を短くすることにより、ＦＦＴの計算量が少なくてすむだけでなく、後の相関行列の算出、固有値分解、及びＭＵＳＩＣ応答の算出における計算量も少なくて済む。その結果、性能を落とすことなく、比較的非力なコンピュータを用いても十分にリアルタイムで音源定位を行なうことができる。 Normally, 512 to 1024 points are used in FFT (corresponding to 32 to 64 milliseconds at a sampling rate of 16 kHz), but here one frame is set to 4 milliseconds (corresponding to 64 to 128 points in FFT). By reducing the frame length in this way, not only the amount of calculation of FFT is reduced, but also the amount of calculation in later calculation of correlation matrix, eigenvalue decomposition, and calculation of MUSIC response is reduced. As a result, sound source localization can be performed sufficiently in real time even if a relatively weak computer is used without degrading performance.

ＭＵＳＩＣ処理部６２は、マイクロホンアレイ５２に含まれる各マイクロホンの位置を所定の座標系を用いて表す位置ベクトルを記憶するための位置ベクトル記憶部１００と、位置ベクトル記憶部１００に記憶されているマイクロホンの位置ベクトル、及び固有値分解部８８から出力される固有ベクトルを用いて、音源数が固定されているものとしてＭＵＳＩＣ法によりＭＵＳＩＣ空間スペクトルを算出し出力するＭＵＳＩＣ空間スペクトル算出部１０４とを含む。ブロックごとに得られる相関行列の固有値が音源数に関連することは、例えばＦ．アサノら、「リアルタイム音源定位及び生成システムと自動音声認識におけるその応用」、Ｅｕｒｏｓｐｅｅｃｈ，２００１、アールボルグ、デンマーク、２００１、１０１３−１０１６頁（F. Asano, M. Goto, K. Itou, and H. Asoh, “Real-time sound source localization and separation system and its application on automatic speech recognition,” in Eurospeech 2001, Aalborg, Denmark, 2001, pp. 1013-1016）にも記載されており、既に知られている事項である。 The MUSIC processing unit 62 includes a position vector storage unit 100 for storing a position vector representing the position of each microphone included in the microphone array 52 using a predetermined coordinate system, and a microphone stored in the position vector storage unit 100. And a MUSIC spatial spectrum calculation unit 104 that calculates and outputs a MUSIC spatial spectrum by the MUSIC method on the assumption that the number of sound sources is fixed, using the position vector of Eq. And the eigenvector output from the eigenvalue decomposition unit 88. The fact that the eigenvalue of the correlation matrix obtained for each block is related to the number of sound sources is, for example, F.D. Asano et al., “Real-time sound source localization and generation system and its application in automatic speech recognition”, Eurospeech, 2001, Aalborg, Denmark, 2001, 1013-1016 (F. Asano, M. Goto, K. Itou, and H. Asoh, “Real-time sound source localization and separation system and its application on automatic speech recognition,” in Eurospeech 2001, Aalborg, Denmark, 2001, pp. 1013-1016) It is.

なお、本実施の形態では、各音源の２次元的な方位角だけでなく、仰角も推定する。そのために、ＭＵＳＩＣアルゴリズムの３次元版（付録を参照）を実装した。方位角と仰角とのセットを、これ以降、音源方位（ＤＯＡ）と呼ぶ。このアルゴリズムでは、音源までの距離は推定しない。音源方位のみを推定するようにすることで、処理時間を大幅に減少させることができる。 In the present embodiment, not only the two-dimensional azimuth angle of each sound source but also the elevation angle is estimated. To that end, a three-dimensional version of the MUSIC algorithm (see appendix) was implemented. The set of azimuth and elevation is hereinafter referred to as sound source azimuth (DOA). This algorithm does not estimate the distance to the sound source. By estimating only the sound source azimuth, the processing time can be significantly reduced.

ＭＵＳＩＣ処理部６２はさらに、ＭＵＳＩＣ空間スペクトル算出部１０４により算出されたＭＵＳＩＣ空間スペクトルに基づいて、ＭＵＳＩＣ法にしたがいＭＵＳＩＣ応答と呼ばれる値を各方位（後述する。）について算出し出力するためのＭＵＳＩＣ応答算出部１０６を含む。 The MUSIC processing unit 62 further calculates a MUSIC response based on the MUSIC spatial spectrum calculated by the MUSIC spatial spectrum calculation unit 104 according to the MUSIC method for each direction (described later) and outputs the MUSIC response. A calculation unit 106 is included.

ここでいう「方位」とは、音源位置を探索するために３次元空間に定義されたメッシュの各枠のことをいう。このメッシュは、以下の実施の形態では、仰角５度の範囲で空間を輪状に区切り、仰角の大きさにより異なる数の探索点を設けた。ここでいう「探索点」とは、上記したメッシュの中央の点のことをいう。 The “azimuth” here refers to each frame of the mesh defined in the three-dimensional space in order to search for the sound source position. In this embodiment, the mesh is divided into a ring shape in a range of an elevation angle of 5 degrees, and different numbers of search points are provided depending on the size of the elevation angle. The “search point” here refers to the center point of the mesh described above.

探索点の数は、仰角が０度の輪においては隣接する探索点への方向角が５度となるように選ばれている。探索点の数は仰角が０度の輪で最大であり、仰角が大きくなるにつれて少なくなる。この際、一つの輪内の探索点の間の距離（角度と考えてもよい。）は互いに等しく、その距離（角度）は仰角が０度の輪における隣接する探索点同士の距離（角度）とできるだけ近くなるように選ばれている。 The number of search points is selected so that the direction angle to an adjacent search point is 5 degrees in a ring with an elevation angle of 0 degrees. The number of search points is the maximum for a wheel with an elevation angle of 0 degrees, and decreases as the elevation angle increases. At this time, the distances (which may be considered as angles) between search points in one ring are equal to each other, and the distances (angles) are the distances (angles) between adjacent search points in a ring having an elevation angle of 0 degrees. And is chosen to be as close as possible.

音源推定部６４は、ＭＵＳＩＣ応答算出部１０６により算出されたＭＵＳＩＣ応答のピークを、一時的に時系列に所定数だけＦＩＦＯ形式で蓄積するためのバッファ１０８と、バッファ１０８に蓄積された各ブロックの各探索点のＭＵＳＩＣ応答について、移動平均を算出し平滑化することでノイズを除去するための平滑化フィルタ部１１０と、平滑化フィルタ部１１０の出力する各ブロックの各探索点のＭＵＳＩＣ応答の値に基づき、各探索点におけるＭＵＳＩＣ応答の、前ブロックのＭＵＳＩＣ応答との差（ＭＵＳＩＣΔパワー）を算出するためのＭＵＳＩＣΔスペクトログラム算出部１１２とを含む。ＭＵＳＩＣΔスペクトログラム算出部１１２は、具体的には以下のようにしてＭＵＳＩＣΔパワーを算出する。すなわち、探索点の各々につき、現ブロックのＭＵＳＩＣパワーの値と、その探索点に隣接する全ての点における、前ブロックのＭＵＳＩＣパワーとの間の差を取る。その差の最小値をその探索点におけるＭＵＳＩＣΔパワーとする。なお、ここでは、時系列として、各方位について算出されるＭＵＳＩＣパワーをＭＵＳＩＣスペクトログラムと呼び、各方位について上のように算出された値の時系列をＭＵＳＩＣΔスペクトログラムと呼んでいる。 The sound source estimation unit 64 temporarily stores a predetermined number of MUSIC response peaks calculated by the MUSIC response calculation unit 106 in the FIFO format in a time series, and each block stored in the buffer 108. About the MUSIC response of each search point, the smoothing filter unit 110 for removing noise by calculating and smoothing the moving average, and the value of the MUSIC response of each search point of each block output by the smoothing filter unit 110 And a MUSIC Δ spectrogram calculating unit 112 for calculating a difference (MUSIC Δ power) between the MUSIC response at each search point and the MUSIC response of the previous block. Specifically, the MUSIC Δ spectrogram calculating unit 112 calculates the MUSIC Δ power as follows. That is, for each search point, the difference between the value of the MUSIC power of the current block and the MUSIC power of the previous block at all points adjacent to the search point is taken. The minimum value of the difference is set as the MUSIC Δ power at the search point. Here, as a time series, the MUSIC power calculated for each azimuth is called a MUSIC spectrogram, and the time series of values calculated for each azimuth is called a MUSIC Δ spectrogram.

音源推定部６４はさらに、平滑化フィルタ部１１０から出力される、平滑化された各ブロックのＭＵＳＩＣ応答を使用して、音源が音の発生を開始した方位（音源が音声の発生活動を開始したことをオンセットと呼ぶ。）を検出する処理と、音源からの音の発生が停止した方位及び時刻（音源による音の発生の停止をオフセットと呼ぶ。）を検出する処理とを行ない、オンセット及びオフセットの発生方位を出力するためのオンセット・オフセット検出部１１４と、ＭＵＳＩＣΔスペクトログラム算出部１１２の出力するＭＵＳＩＣΔパワーと、オンセット・オフセット検出部１１４からのオンセット検出出力及びオフセット検出出力に応答して、音源のオンセットからオフセットまでをトラッキングするためのトラッキング部１１８と、ＭＵＳＩＣΔスペクトログラム算出部１１２、オンセット・オフセット検出部１１４及びトラッキング部１１８がそれぞれの処理に使用する、各探索点の位置、及び、各探索点に隣接する、探索対象となる探索点に関する情報を記憶するための探索点記憶部１１６とを含む。 The sound source estimation unit 64 further uses the smoothed MUSIC response of each block output from the smoothing filter unit 110 to determine the direction in which the sound source has started generating sound (the sound source has started sound generation activity). This is referred to as onset.) And the process of detecting the azimuth and time at which sound generation from the sound source stopped (stopping sound generation by the sound source is referred to as offset) is performed. And an onset offset detection unit 114 for outputting an offset generation direction, a MUSICΔ power output from the MUSICΔ spectrogram calculation unit 112, and an onset detection output and an offset detection output from the onset offset detection unit 114 A tracking unit 118 for tracking the sound source from onset to offset, and MU The SIC Δ spectrogram calculation unit 112, the onset offset detection unit 114, and the tracking unit 118 store information on the position of each search point and the search point that is adjacent to each search point and is used for each process. And a search point storage unit 116.

なお、各音源からの音声信号のパワーを算出するに先立って、音声信号に対してチャンネル間スペクトルバイナリマスキング処理を行なう。これは、２つのチャンネル間において、パワーの大きな方の信号を残し、他方のチャンネルの信号はゼロにする、という処理である。こうすることにより、チャンネル間の干渉リークを削減することができる。また、マイクロホンアレイ５２のうち、中央に位置するマイクロホンＭＣ１からの音声信号を用い、全ての音声信号から周囲の音楽による雑音を除去する処理を行なう。 Prior to calculating the power of the audio signal from each sound source, inter-channel spectral binary masking processing is performed on the audio signal. This is a process of leaving the signal with the larger power between the two channels and setting the signal of the other channel to zero. By doing so, interference leakage between channels can be reduced. Further, using the audio signal from the microphone MC1 located in the center of the microphone array 52, a process of removing noise caused by surrounding music from all the audio signals is performed.

［コンピュータによる実現］
上記した音源定位処理部５０は、実際にはコンピュータハードウェアと、当該コンピュータハードウェアにより実行されるコンピュータプログラムとにより、ハードウェアとソフトウェアとの協働により実現される。以下、音源定位処理部５０の中でも、本実施の形態の特徴となる音源推定部６４の機能を実現するためのコンピュータプログラムの制御構造について説明する。 [Realization by computer]
The above-described sound source localization processing unit 50 is actually realized by the cooperation of hardware and software by computer hardware and a computer program executed by the computer hardware. Hereinafter, a control structure of a computer program for realizing the function of the sound source estimation unit 64 that is a feature of the present embodiment in the sound source localization processing unit 50 will be described.

図４を参照して、音源推定部６４の機能を実現するためのプログラムは、ロボットの電源投入後、初期化を行なうステップ１３０と、ＭＵＳＩＣ応答算出部１０６からバッファ１０８を介して処理対象となるブロックのＭＵＳＩＣ応答を受信するステップ１３２と、処理対象のブロックのＭＵＳＩＣ応答に対する移動平均をとることにより、ＭＵＳＩＣ応答からノイズを除去するための平滑化処理を行なうステップ１３４と、全ての探索点について、ＭＵＳＩＣΔパワー（ＭＵＳＩＣΔスペクトログラム）を算出するステップ１３６と、ステップ１３６で各方位について算出されたＭＵＳＩＣΔパワー、及びＭＵＳＩＣパワーに基づいて、音源の発生（オンセット）を検出するステップ１３８と、ステップ１３８で検出された音源をトラッキングし、処理をステップ１３２に戻すステップ１４０とを含む。なお、全ての探索点の識別子とその方位、並びにその探索点に隣接する探索点の識別子は、図３に示す探索点記憶部１１６に記憶されている。 Referring to FIG. 4, the program for realizing the function of sound source estimation unit 64 becomes a processing target through step 130 for initialization after power-on of the robot and from MUSIC response calculation unit 106 via buffer 108. Step 132 for receiving the MUSIC response of the block, Step 134 for performing smoothing processing to remove noise from the MUSIC response by taking a moving average with respect to the MUSIC response of the block to be processed, and for all search points, Step 136 for calculating the MUSIC Δ power (MUSIC Δ spectrogram), Step 138 for detecting the generation (onset) of the sound source based on the MUSIC Δ power calculated for each direction in Step 136 and the MUSIC power, and the detection in Step 138 Tracked sound source And, and a step 140 of returning processing to step 132. Note that the identifiers of all search points, their orientations, and the identifiers of search points adjacent to the search points are stored in the search point storage unit 116 shown in FIG.

既に述べたとおり、ステップ１３６におけるＭＵＳＩＣΔスペクトルの算出にあたっては、各探索点について、現ブロックのＭＵＳＩＣパワーと、その探索点に隣接する探索点の各々の前ブロックのＭＵＳＩＣパワーとの差を算出し、その中の最小値を採用することで、各探索点について、ブロックごとにＭＵＳＩＣΔパワーが算出される。 As already described, in calculating the MUSIC Δ spectrum in step 136, for each search point, the difference between the MUSIC power of the current block and the MUSIC power of each previous block of the search points adjacent to the search point is calculated, By adopting the minimum value among them, the MUSIC Δ power is calculated for each search point for each block.

図５を参照して、図４のステップ１３８で実行されるオンセット検出処理は、トラッキングの対象となっている探索点以外の各探索点について、以下に述べるステップ１６２，１６４，１６６及び１６８を実行するステップ１６０を含む。 Referring to FIG. 5, the onset detection process executed in step 138 of FIG. 4 includes steps 162, 164, 166 and 168 described below for each search point other than the search points to be tracked. Performing step 160.

ステップ１６０の処理は、その探索点について図４のステップ１３４で算出されたＭＵＳＩＣΔパワーの値が１．０ｄＢより大きいか否かを判定し、ＭＵＳＩＣΔパワーの値が１．０ｄＢ以下の場合にはその探索点に関する処理を終了するステップ１６２と、ステップ１６２でその探索点のＭＵＳＩＣΔパワーの値が１．０ｄＢより大きいと判定されたときに実行され、その探索点のＭＵＳＩＣパワーの値が１．８ｄＢより大きいか否かを判定し、ＭＵＳＩＣパワーの値が１．８ｄＢ以下の場合にはこの探索点に関する処理を終了するステップ１６４と、ステップ１６４でＭＵＳＩＣパワーの値が１．８ｄＢより大きいと判定されたときに実行され、この探索点をオンセット候補として一旦その探索点の識別子を記憶するステップ１６６と、ステップ１６６に続き、この探索点のＭＵＳＩＣパワーをこの音源のトラッキングのオンセットＭＵＳＩＣパワーとして記憶してこの探索点に対する処理を終了するステップ１６８とを含む。なお、ステップ１６６では、予め準備したオンセット候補数を示す変数に１が加算される。 The process of step 160 determines whether or not the value of MUSICΔ power calculated in step 134 of FIG. 4 for the search point is greater than 1.0 dB, and if the value of MUSICΔ power is 1.0 dB or less, This is executed when it is determined in step 162 that the processing relating to the search point is finished and in step 162 that the MUSIC Δ power value of the search point is larger than 1.0 dB, and the MUSIC power value of the search point is more than 1.8 dB. If the MUSIC power value is 1.8 dB or less, it is determined in step 164 that the processing relating to this search point is terminated, and in step 164 it is determined that the MUSIC power value is greater than 1.8 dB. Step 166, which is executed occasionally, and temporarily stores the identifier of the search point as an onset candidate. Following-up 166, and stores the MUSIC power of the search point as a tracking onset MUSIC power of the sound source and the step 168 to terminate the processing for the search point. In step 166, 1 is added to a variable indicating the number of onset candidates prepared in advance.

オンセット検出処理はさらに、ステップ１６０において全ての探索点についてオンセット候補か否かが判定された後に実行され、オンセット候補数が０か否かを判定し、オンセット候補数が０のときには現ブロックに対するオンセット検出処理を終了するステップ１７０と、ステップ１７０でオンセット候補数が０ではないと判定されたことに応答して実行され、オンセット候補のうち、ＭＵＳＩＣパワーの値が最大のものからＭＵＳＩＣパワーの値の順番で２個までをこのブロックにおけるオンセットとして選択するステップ１７２と、ステップ１７２において選択された最大２個のオンセットの各々に対し、新たにトラッキング用のリストを準備し、各々の先頭要素にオンセットＭＵＳＩＣパワーなど、オンセットとなった探索点に関する情報を格納してオンセット検出処理を終了するステップ１７４とを含む。ここで新たに作成されたトラッキングリストの全てが、以後のトラッキングの対象となる。各トラッキングリストの先頭要素には、トラッキングの終了フラグを格納する領域が設けられ、その値が０に設定される。トラッキングの終了フラグとは、そのトラッキングリストに対応する音源のトラッキングが終了したか否かを示すフラグである。トラッキングの終了フラグは、その値が０であればそのリストに対するトラッキングが実行中であることを示し、その値が９であればそのリストに対するトラッキングが終了した（音源からの音の発生が終了した）ことを示す。 The onset detection process is further executed after it is determined in step 160 whether or not all search points are onset candidates, and it is determined whether or not the number of onset candidates is zero. Step 170 for ending the onset detection process for the current block is executed in response to the determination that the number of onset candidates is not zero in step 170. Among the onset candidates, the MUSIC power value is the largest. Step 172 for selecting up to 2 MUSIC power values in order of the MUSIC power as onsets in this block, and preparing a new tracking list for each of the up to 2 onsets selected in step 172 And on the search points that are onset, such as onset MUSIC power at each head element Store information and a step 174 to end the onset detection process. Here, all of the newly created tracking lists are targets for subsequent tracking. The head element of each tracking list is provided with an area for storing a tracking end flag, and its value is set to zero. The tracking end flag is a flag indicating whether or not tracking of the sound source corresponding to the tracking list has ended. If the tracking end flag is 0, it indicates that tracking for the list is in progress. If the value is 9, tracking for the list is completed (the generation of sound from the sound source has ended). )

図６を参照して、図４のステップ１４０で実行されるトラッキング処理は、トラッキングリストの各々について、以下に説明するステップ２０２─２１４を実行するステップ２００を含む。 Referring to FIG. 6, the tracking process performed in step 140 of FIG. 4 includes a step 200 of executing steps 202-214 described below for each of the tracking lists.

ステップ２００において、各トラッキングリストについて実行される処理は、対象となるトラッキングリストについてのトラッキングの終了フラグが０か否かを判定し、終了フラグが０でないときにはこのトラッキングリストに対する処理を終了するステップ２０２と、ステップ２０２でこのトラッキングの終了フラグが０と判定されたときに実行され、トラッキングリストの末尾の探索点（前ブロックで検出された音源の方位を示す。）の周囲の探索点の全てについて、算出済のＭＵＳＩＣパワーを記憶装置から読出すステップ２０４と、読出されたＭＵＳＩＣパワーのうち最大のものに対応する探索点を、このトラッキングリストの末尾に追加するステップ２０６とを含む。 In step 200, the process executed for each tracking list determines whether the tracking end flag for the target tracking list is 0 or not. If the end flag is not 0, the process for the tracking list is ended 202. And when the tracking end flag is determined to be 0 in step 202, all the search points around the search point at the end of the tracking list (indicating the direction of the sound source detected in the previous block). Reading 204 the calculated MUSIC power from the storage device, and adding 206 a search point corresponding to the maximum of the read MUSIC powers to the end of the tracking list.

ステップ２００の処理はさらに、ステップ２０６に続き、対象となっているトラッキングリストについてオフセット検出処理を実行するステップ２０８と、ステップ２０８に続き、オフセット検出処理によりオフセットフラグに設定された値が０か否かを判定して判定結果により制御の流れを分岐させるステップ２１０とを含む。オフセットフラグは、０のときこのトラッキングリストに対応する音源について、音の発生が停止したと判定されたときには９となり、それ以外のとき、すなわち引続き音源があると判定されたときには０に設定される。 The processing in step 200 is further continued from step 206 to step 208 in which the offset detection processing is executed for the target tracking list, and subsequent to step 208, whether or not the value set in the offset flag by the offset detection processing is 0. And step 210 for branching the flow of control according to the determination result. The offset flag is set to 9 when it is determined that sound generation has stopped for the sound source corresponding to this tracking list when it is 0, and is set to 0 at other times, that is, when it is determined that there is a continuous sound source. .

ステップ２００で実行される処理はさらに、ステップ２１０においてオフセットフラグの値が０でないと判定されたときに実行され、ステップ２０８のオフセット検出処理の結果、５ブロック続いてオフセットフラグが０以外の値に設定されたか否かを判定し、５ブロックに達していないときには何もせず処理対象のトラッキングリストに対する処理を終了するステップ２１２と、ステップ２１２において、５ブロック連続してこのトラッキングリストについてオフセットフラグが０でなかったと判定されたことに応答して実行され、このトラッキングリストの終了フラグをセット（終了フラグの値を９に設定）してこのトラッキングリストに対する処理を終了するステップ２１４とを含む。 The processing executed in step 200 is further executed when it is determined in step 210 that the value of the offset flag is not 0. As a result of the offset detection processing in step 208, the offset flag is set to a value other than 0 after 5 blocks. It is determined whether or not it has been set, and when it has not reached 5 blocks, nothing is done and step 212 ends the processing for the tracking list to be processed. In step 212, the offset flag is set to 0 for this tracking list for 5 consecutive blocks. And step 214 that is executed in response to the determination that the tracking list is not, sets the end flag of this tracking list (sets the value of the end flag to 9), and ends the processing for this tracking list.

図７を参照して、図６のステップ２０８で実行されるオフセット検出処理は、対象となるトラッキングリストに最後に追加された探索点（図６のステップ２０６）について、そのＭＵＳＩＣΔパワーが−１．２ｄＢより小か否かを判定し、判定結果により制御の流れを分岐させるステップ２３０と、ステップ２３０でＭＵＳＩＣΔパワーが−１．２ｄＢより小であると判定されたときに実行され、このトラッキングリストのオフセットフラグを９に設定してオフセット検出処理を終了させるステップ２３６とを含む。 Referring to FIG. 7, in the offset detection process executed in step 208 in FIG. 6, the MUSIC Δpower of the search point last added to the target tracking list (step 206 in FIG. 6) is −1. This step is executed when it is determined whether the MUSICΔ power is smaller than -1.2 dB in step 230 for determining whether or not it is smaller than 2 dB and branching the control flow according to the determination result. And step 236 for setting the offset flag to 9 and ending the offset detection process.

オフセット検出処理はさらに、ステップ２３０においてＭＵＳＩＣΔパワーの値が−１．２ｄＢ以上であると判定されたときに、オフセット検出のためのしきい値θ_Ｈ＝オンセットＭＵＳＩＣパワー＋α（α＞０）という式にしたがってオフセット検出のためのしきい値θ_Ｈを算出するステップ２３２と、トラキングリストに最後に追加された探索点（図６のステップ２０６）の現ブロックのＭＵＳＩＣパワーが上記したしきい値θ_Ｈより小さいか否かを判定し、もしも判定結果が肯定であれば制御をステップ２３６に進めるステップ２３４と、ステップ２３４での判定が否定であるときに、このトラッキングリストのオフセットフラグを０に設定してオフセット検出処理を終了するステップ２３８とを含む。 In the offset detection process, when it is determined in step 230 that the value of the MUSICΔ power is −1.2 dB or more, a threshold value θ _H = onset MUSIC power + α (α> 0) for offset detection. Step 232 for calculating the threshold value θ _H for offset detection according to the equation, and the MUSIC power of the current block at the search point last added to the tracking list (step 206 in FIG. 6) is the threshold value described above. determines whether theta _H smaller than the step 234 if the determination result is to advance to step 236 to control if affirmative, when the determination at step 234 is negative, the offset flag for this tracking list 0 And step 238 for setting and ending the offset detection process.

後述するように、オフセット検出において、トラッキング中の音源のＭＵＳＩＣパワーがオンセット時のパワーよりも小さくなったときではなく、オンセット時のパワー＋αよりも小さくなったときに強制的にオフセットとすることにより、音源のトラッキングの精度が高くなるという効果が得られる。 As will be described later, in the offset detection, the offset is forcibly set when the MUSIC power of the sound source being tracked is smaller than the power at the time of onset, and when it is smaller than the power at the time of onset + α. As a result, the effect of increasing the accuracy of tracking of the sound source can be obtained.

［動作］
上記実施の形態に係る音源定位処理部５０は以下のように動作する。マイクロホンアレイが図１及び図２に示すようにマイクロホン台３２を用いてロボット３０に装着されるものとする。 [Operation]
The sound source localization processing unit 50 according to the above embodiment operates as follows. It is assumed that the microphone array is attached to the robot 30 using a microphone base 32 as shown in FIGS.

マイクロホンアレイ５２は音源からの音声を１４個のアナログ電気信号に変換し、Ａ／Ｄ変換器５４に与える。Ａ／Ｄ変換器５４は１６ｋＨｚでこれら信号を１６ビットのデジタル信号化し、１４個のデジタル信号をフレーム化処理部８０に与える。 The microphone array 52 converts the sound from the sound source into 14 analog electric signals and supplies the analog electric signals to the A / D converter 54. The A / D converter 54 converts these signals into 16-bit digital signals at 16 kHz, and provides 14 digital signals to the framing processor 80.

フレーム化処理部８０は、４ミリ秒のフレーム長でこれら各チャンネルのデジタル音源信号をフレーム化し、ＦＦＴ処理部８２に与える。ＦＦＴ処理部８２は、各チャンネルの各フレームのデジタル音源信号に対してＦＦＴを施し、各周波数成分の出力に変換してブロック化処理部８４に与える。この間、音声信号に対して前述のチャンネル間スペクトルバイナリマスキング処理、及び中央に位置するマイクロホンＭＣ１からの音声信号を用い、全ての音声信号から周囲の音楽による雑音を除去する処理を行なう。 The framing processing unit 80 frames the digital sound source signals of these channels with a frame length of 4 milliseconds, and supplies the framing processing unit 80 to the FFT processing unit 82. The FFT processing unit 82 performs FFT on the digital sound source signal of each frame of each channel, converts the digital sound source signal into an output of each frequency component, and gives it to the blocking processing unit 84. During this time, the above-described inter-channel spectral binary masking process is performed on the audio signal, and the process of removing noise from surrounding music from all the audio signals using the audio signal from the microphone MC1 located in the center.

ブロック化処理部８４は、ＦＦＴ処理部８２から４ミリ秒ごとに出力される信号を１００ミリ秒ごとにブロック化し、相関行列算出部８６に与える。相関行列算出部８６はこれら各ブロックについて、チャンネル毎の相関行列を算出し、固有値分解部８８に与える。固有値分解部８８は、相関行列算出部８６により算出された相関行列に固有値分解を施し、ＭＵＳＩＣ空間スペクトル算出部１０４に与える。 The blocking processing unit 84 blocks the signal output every 4 milliseconds from the FFT processing unit 82 every 100 milliseconds, and gives it to the correlation matrix calculation unit 86. The correlation matrix calculation unit 86 calculates a correlation matrix for each channel for each of these blocks, and provides it to the eigenvalue decomposition unit 88. The eigenvalue decomposition unit 88 performs eigenvalue decomposition on the correlation matrix calculated by the correlation matrix calculation unit 86 and gives the result to the MUSIC spatial spectrum calculation unit 104.

ＭＵＳＩＣ空間スペクトル算出部１０４以下の処理は通常のＭＵＳＩＣ法の処理を３次元化したものである。まずＭＵＳＩＣ空間スペクトル算出部１０４は、位置ベクトル記憶部１００に記憶された位置ベクトルと、固有値分解部８８から出力される固有ベクトル９２とに基づき、音源数が固定したものとしてＭＵＳＩＣ空間スペクトルを１００ミリ秒ごとに算出しＭＵＳＩＣ応答算出部１０６に与える。ＭＵＳＩＣ応答算出部１０６はＭＵＳＩＣ空間スペクトルに基づき、１００ミリ秒ごとにＭＵＳＩＣ応答を算出しバッファ１０８に記憶させる。 The processing after the MUSIC spatial spectrum calculation unit 104 is a three-dimensional processing of the normal MUSIC method. First, the MUSIC spatial spectrum calculation unit 104 determines that the number of sound sources is fixed based on the position vector stored in the position vector storage unit 100 and the eigenvector 92 output from the eigenvalue decomposition unit 88, and sets the MUSIC spatial spectrum to 100 milliseconds. Is calculated for each and provided to the MUSIC response calculation unit 106. The MUSIC response calculation unit 106 calculates a MUSIC response every 100 milliseconds based on the MUSIC spatial spectrum and stores it in the buffer 108.

バッファ１０８は、ＭＵＳＩＣ応答算出部１０６から出力されるＭＵＳＩＣ応答を時系列で、ＦＩＦＯ形式で所定数だけ蓄積する。 The buffer 108 accumulates a predetermined number of MUSIC responses output from the MUSIC response calculation unit 106 in time series in the FIFO format.

平滑化フィルタ部１１０は、バッファ１０８に記憶された各ブロックのＭＵＳＩＣ応答を読出し（図４のステップ１３２）、そのブロックのＭＵＳＩＣ応答について、所定ブロックにわたる移動平均をとり、平滑化されたＭＵＳＩＣパワーをＭＵＳＩＣΔスペクトログラム算出部１１２及びオンセット・オフセット検出部１１４に与える。 The smoothing filter unit 110 reads the MUSIC response of each block stored in the buffer 108 (step 132 in FIG. 4), takes a moving average over a predetermined block for the MUSIC response of the block, and uses the smoothed MUSIC power. The MUSIC Δ spectrogram calculation unit 112 and the onset offset detection unit 114 are provided.

ＭＵＳＩＣΔスペクトログラム算出部１１２は、ブロックデータを受信すると（図４のステップ１３２）、各探索点について、現ブロックのＭＵＳＩＣパワーと、その探索点に隣接する探索点の各々の前ブロックのＭＵＳＩＣパワーとの差を算出し、その中の最小値を採用することで、各探索点について、ブロックごとにＭＵＳＩＣΔパワーを算出し（図４のステップ１３４）トラッキング部１１８に与える。オンセット・オフセット検出部１１４は、既に述べた構造のプログラム（図５及び図７）により実現され、平滑化フィルタ部１１０から出力される平滑化後のＭＵＳＩＣパワー及びＭＵＳＩＣΔパワーに基づいて、各ブロックにおいて音源のオンセット又はオフセットがあればそれらを検出し、トラッキング部１１８に与える。オンセット検出時にはその方位（探索点）も検出され、トラッキング部１１８に与えられる。 When the MUSIC Δ spectrogram calculation unit 112 receives the block data (step 132 in FIG. 4), for each search point, the MUSIC power of the current block and the MUSIC power of the previous block of each search point adjacent to the search point are calculated. By calculating the difference and adopting the minimum value among them, the MUSICΔ power is calculated for each block for each search point (step 134 in FIG. 4) and provided to the tracking unit 118. The onset offset detection unit 114 is realized by the program having the above-described structure (FIGS. 5 and 7), and each block is based on the smoothed MUSIC power and MUSICΔ power output from the smoothing filter unit 110. If there are any onsets or offsets of the sound source at, they are detected and given to the tracking unit 118. At the time of onset detection, the direction (search point) is also detected and given to the tracking unit 118.

具体的には、オンセットの検出においては、図５のステップ１６２及び１６４により示されるように、各ブロックについて、ＭＵＳＩＣΔパワーが１．０ｄＢより大きく、かつＭＵＳＩＣパワーが１．８ｄＢより大きい探索点がオンセット候補となる。図５のステップ１７０−１７４により示されるように、各ブロックについてオンセット候補があるときには、上位から２個までがオンセットとして検出される。 Specifically, in onset detection, as indicated by steps 162 and 164 in FIG. 5, for each block, there are search points for which the MUSIC Δpower is greater than 1.0 dB and the MUSIC power is greater than 1.8 dB. Candidate onset. As shown by steps 170 to 174 in FIG. 5, when there are onset candidates for each block, up to two onsets are detected as onsets.

オフセットの検出では、図７のステップ２３０─２３４により示されるように、トラッキングされた最後の探索点について、そのＭＵＳＩＣΔパワーが−１．２ｄＢより小さいときにはオフセットと判定されるが、それ以外にもＭＵＳＩＣパワーがその音源のオンセットＭＵＳＩＣパワー＋αよりも小さくなったときにも強制的にオフセットと判定する。 In the detection of the offset, as indicated by steps 230-234 in FIG. 7, the last tracked search point is determined to be an offset when its MUSICΔ power is smaller than −1.2 dB. Even when the power becomes smaller than the onset MUSIC power + α of the sound source, the offset is forcibly determined.

トラッキング部１１８は、オンセット・オフセット検出部１１４からオンセット検出信号が与えられると、音源のトラッキングを開始する。具体的には、トラッキング部１１８は、オンセット検出後、その音源位置の探索点に隣接する探索点のうち、ＭＵＳＩＣパワーが最大の探索点をトラッキングし（図６のステップ２０４─ステップ２０６）、オフセットが検出された時点でトラッキングを終了する。ただし、本実施の形態では、オフセットが発生した後、４ブロックまではトラッキングを継続し、５ブロック経過後もオンセット条件を満たす探索点がトラッキング方向に生じないときに初めてトラッキングを終了する（図６のステップ２１０，２１２及び２１４）。 When the onset detection signal is given from the onset / offset detection unit 114, the tracking unit 118 starts tracking the sound source. Specifically, after detecting the onset, the tracking unit 118 tracks a search point having the maximum MUSIC power among search points adjacent to the search point for the sound source position (step 204 to step 206 in FIG. 6). Tracking is terminated when an offset is detected. However, in the present embodiment, after the offset occurs, tracking is continued up to 4 blocks, and tracking is finished only when a search point that satisfies the onset condition does not occur in the tracking direction even after 5 blocks have elapsed (see FIG. 6 steps 210, 212 and 214).

トラッキング部１１８によってトラッキングされた音源方位はバッファ６６にブロックごとに蓄積される。 The sound source direction tracked by the tracking unit 118 is stored in the buffer 66 for each block.

以上のような動作によって、音源定位処理部５０は継続的に複数個の音源の定位とトラッキングとを行なうことができる。 With the above operation, the sound source localization processing unit 50 can continuously perform localization and tracking of a plurality of sound sources.

図８−図１０に、上記実施の形態に係る音源定位処理部５０について、その性能を測るために行なった実験の結果を示す。この実験では、音源（ＤｉｒｅｃｔｉｏｎｓＯｆＡｒｒｉｖａｌ）定位の性能を測るために、以下の値を用いる。第１はＤＯＡ精度、第２はＤＯＡ挿入率である。ＤＯＡ精度とは、上記した装置により、正しいＤＯＡが検出された率のことである。ＤＯＡ精度は高い方が好ましい。ＤＯＡ挿入率とは、正しいＤＯＡの数と比較して、ブロック当たりで余分に検出された音源数の平均値のことをいう。ＤＯＡ挿入率は低い方が好ましい。 FIG. 8 to FIG. 10 show the results of experiments performed to measure the performance of the sound source localization processing unit 50 according to the above embodiment. In this experiment, the following values are used to measure the performance of the sound source (Directions Of Arrival) localization. The first is the DOA accuracy, and the second is the DOA insertion rate. The DOA accuracy is a rate at which correct DOA is detected by the above-described apparatus. Higher DOA accuracy is preferable. The DOA insertion rate is an average value of the number of sound sources detected extra per block as compared with the correct number of DOAs. A lower DOA insertion rate is preferable.

正しいＤＯＡとしては、音源信号から得られた音源の活動を示す情報から、正しい音源数を使用して得られたＤＯＡを用いた。各音源が活動している間の、ＤＯＡの予測位置から得られる軌跡を区分線形近似で近似した。音源が移動しているか否かをチェックするため、ビデオ映像も利用した。 As the correct DOA, DOA obtained by using the correct number of sound sources from information indicating the activity of the sound source obtained from the sound source signal was used. While each sound source was active, the trajectory obtained from the predicted DOA position was approximated by piecewise linear approximation. Video was also used to check whether the sound source was moving.

実験では、２つの異なった環境（ＯＦＣ、ＵＣＷ。これらについては後述する。）において、ヒューマノイド型ロボットに実装された音源定位処理部５０によるＤＯＡ予測を行ない、そのＤＯＡ精度とＤＯＡ挿入率とを求めた。図８−図１０はその結果を示している。これら環境は実験のための音声を収録した環境であり、具体的には以下のとおりである。 In the experiment, DOA prediction is performed by the sound source localization processing unit 50 mounted on the humanoid robot in two different environments (OFC, UCW, which will be described later), and the DOA accuracy and DOA insertion rate are obtained. It was. 8 to 10 show the results. These environments are environments that record audio for experiments, and are specifically as follows.

すなわち、マイクロホンアレイによるデータ収録を２つの異なった環境で行なった。１つ目はオフィス環境（ＯｆｆｉｃｅＥｎｖｉｒｏｎｍｅｎｔ：ＯＦＣ）で、室内のエアコンとロボットの内部雑音が主な雑音源となる。２つ目の環境は、実験を行なった野外のショッピングモールの通路（ＵｎｉｖｅｒｓａｌＣｉｔｙＷａｌｋＯｓａｋａ：ＵＣＷ）である。ＵＣＷでの主な雑音源は、天井に設置されているスピーカから流れてくるポップ・ロックミュージックである。通路内のさまざまな位置およびさまざまな向きにロボットを配置して実験用データ及び画像の収録を行なった。なお、図８−図１０において「ＯＦＣ３」など、「ＯＦＣ」又は「ＵＣＷ」の後に付加されている数字は、録音の順番を示す。 That is, data recording by a microphone array was performed in two different environments. The first one is an office environment (OFC), and the internal noise of indoor air conditioners and robots are the main noise sources. The second environment is an outdoor shopping mall passage (UCW) where the experiment was conducted. The main noise source in UCW is pop-rock music flowing from speakers installed on the ceiling. The experimental data and images were recorded by placing the robot in various positions and various directions in the passage. 8 to 10, numbers added after “OFC” or “UCW” such as “OFC3” indicate the order of recording.

ＯＦＣでは、４つの参加者（４人の男性）を音源として用いた。最初に各参加者が１人ずつロボットに対して約１０秒間話しかけた。この間、他の参加者は静かにしていた。収録の最後の１５秒間では、４人の参加者が同時に発話した。録音時、各参加者はそれぞれ別々の音声キャプチャ装置に接続されたマイクロホンを着用していた。録音開始時の一時的な音を用いて、これら参加者の発話を録音したものとマイクロホンアレイ５２からの音声信号の録音とを人手により同期させた。 In the OFC, four participants (four men) were used as sound sources. First, each participant spoke to the robot for about 10 seconds. During this time, the other participants were quiet. During the last 15 seconds of recording, four participants spoke at the same time. During recording, each participant wore a microphone connected to a separate audio capture device. Using the temporary sound at the start of recording, the recording of the speech of these participants and the recording of the audio signal from the microphone array 52 were manually synchronized.

ＵＣＷでは、全ての録音には２人の参加者（いずれも男性）を音源として用いた。実験では、いずれの場合も、最初に各参加者が１０秒程度順番に別々に話し、最後に同時に発話を行なった。ＵＣＷ７及びＵＣＷ８では１人の参加者が移動、別の参加者が静止しながら、ほぼ全ての時間にわたり２人が共に発話していた。ＵＣＷ１−４及びＵＣＷ９では、ロボットは天井のスピーカから遠く離れた位置に配置された。ＵＣＷ５−８では、ロボットをスピーカの近く（数メートル）に配置した。ＵＣＷ１０−１３では、ロボットをスピーカの直下に配置した。 In UCW, two participants (both men) were used as sound sources for all recordings. In each experiment, in each case, each participant first spoke separately for about 10 seconds, and finally spoke at the same time. In UCW7 and UCW8, one participant moved, while another participant was stationary, and the two were speaking together for almost all the time. In UCW1-4 and UCW9, the robot was placed far away from the speaker on the ceiling. In UCW5-8, the robot was placed near the speaker (several meters). In UCW10-13, the robot was placed directly under the speaker.

全ての試行において、ロボットを様々な方向に向け、音源を様々な位置に配置してデータの取得を行なった。 In all trials, the robot was pointed in various directions and the sound sources were placed in various positions to acquire data.

図８は、ＯＦＣ及びＵＷＣ環境において、音源定位に関するパラメータのいくつかを変えて行なった試行でのＤＯＡ予測の性能（精度及び挿入率）を示す。変更されたパラメータは、平滑化フィルタ部１１０の有無、オフセット検出時のしきい値θ_ＴＨに加算される値αの有無、ブロック当たりのオンセット数の制限の有無、及びトラッキングの前後、である。 FIG. 8 shows DOA prediction performance (accuracy and insertion rate) in trials performed by changing some of the parameters related to sound source localization in the OFC and UWC environments. The changed parameters are the presence / absence of the smoothing filter unit 110, the presence / absence of the value α added to the threshold value θ _TH at the time of offset detection, the presence / absence of the restriction on the number of onsets per block, and before and after tracking. .

ＭＵＳＩＣスペクトログラムの算出で使用したパラメータの値は以下のとおりである。すなわち、ＮＦＦＴ（ＦＦＴ長）＝６４、周波数範囲＝１−６ｋＨｚ、ＭＵＳＩＣ空間スペクトル算出部１０４における固定音源数＝２、である。ＮＦＦＴの値を大きくすると性能は多少高くなるが、ＮＦＦＴ＝６４とすると動作クロック周波数２ＧＨｚ程度の市販のＣＰＵを使用してもリアルタイムで動作できるため、この値を使用した。 The parameter values used in the calculation of the MUSIC spectrogram are as follows. That is, NFFT (FFT length) = 64, frequency range = 1-6 kHz, and the number of fixed sound sources in the MUSIC spatial spectrum calculation unit 104 = 2. When the value of NFFT is increased, the performance is somewhat improved. However, when NFFT = 64, the value can be used in real time even when a commercially available CPU having an operation clock frequency of about 2 GHz is used.

図８の左側に、各実験条件とＯＦＣ及びＵＣＷの各々とにおける、音源に関する平均のＤＯＡ精度を示し、図９の中央にＤＯＡ挿入率を示した。 The left side of FIG. 8 shows the average DOA accuracy for the sound source under each experimental condition and each of OFC and UCW, and the DOA insertion rate is shown in the center of FIG.

これらのグラフから、「αなし」とした場合のＤＯＡ精度が最も高いことが分かる。しかしこの場合にはまた、ＤＯＡ挿入率が最も悪い。「平滑化なし」、「オンセット数制限なし」、及び「トラッキング前」について得られたＤＯＡ精度は互いによく似ている。しかし、「トラッキング前」ではＤＯＡ挿入率が低下していることが明確に見て取れる。これは、平滑化及び１ブロック当たりのオンセット数の制限が有効であることを示している。「トラッキング前」及び「トラッキング後」を比較すると、ＵＣＷ３−６及びＵＣＷ１０−１１においてＤＯＡ精度にやや改善が見られるのに対し、ＵＣＷ８−９ではＤＯＡ挿入率に非常に小さな改善しか見られないことが分かる。 From these graphs, it is understood that the DOA accuracy is highest when “no α” is set. However, in this case, the DOA insertion rate is also the worst. The DOA accuracy obtained for “no smoothing”, “no limit on number of sets”, and “before tracking” are very similar to each other. However, it can be clearly seen that the DOA insertion rate is decreasing “before tracking”. This indicates that smoothing and the limitation of the number of onsets per block are effective. Comparing “before tracking” and “after tracking”, UCW3-6 and UCW10-11 show a slight improvement in DOA accuracy, while UCW8-9 shows a very small improvement in DOA insertion rate. I understand.

環境音楽についてのＤＯＡ精度の結果を、図８の右側に示す。この結果によれば、ＵＣＷ１−４及びＵＣＷ７−９においてＤＯＡ精度が低いことが分かる。これはロボットがスピーカから比較的遠くに配置されていたためであろう。ＵＣＷ１０−１３ではＤＯＡ精度は１００％に近い。これらの実験条件では、ロボットがスピーカの直下に配置されていて、背景音楽が明確に指向性を持った音源として検知されたためである。これに対し、ロボットがスピーカにほど近い位置に置かれたＵＣＷ５−６では、ＤＯＡ精度は中間の値となっている。 The results of DOA accuracy for environmental music are shown on the right side of FIG. According to this result, it is understood that the DOA accuracy is low in UCW1-4 and UCW7-9. This is probably because the robot was placed relatively far from the speaker. In UCW10-13, the DOA accuracy is close to 100%. This is because, under these experimental conditions, the robot was placed directly under the speaker, and the background music was detected as a sound source with a clear directivity. On the other hand, in the UCW5-6 where the robot is placed close to the speaker, the DOA accuracy is an intermediate value.

各音源についての性能を検討すると、図９から、ＯＦＣ２の第２及び第４の音源、ＵＣＷ９の１番目の音源、及びＵＣＷ１２の２番目の音源のＤＯＡ精度が低くなっている。これらの環境では、音源からの音声がロボットの背後にあったため、ロボットの前方からの音声では、パワー及び指向性の双方ともが低かったためと思われる。 Examining the performance of each sound source, as shown in FIG. 9, the DOA accuracy of the second and fourth sound sources of OFC2, the first sound source of UCW9, and the second sound source of UCW12 is low. In these environments, since the sound from the sound source was behind the robot, it seems that both the power and directivity were low in the sound from the front of the robot.

上記実施の形態に係る音源定位処理部５０によれば、ＭＵＳＩＣ応答の生データを用いて音源定位を行なっている。実験結果からも明らかなように、このような処理により、高い精度で音源数を予測し、トラッキングすることができる。また、オフセット検出のためのしきい値として、オンセット時のＭＵＳＩＣパワーの値にαを加算することで、オンセット時のＭＵＳＩＣパワーよりＭＵＳＩＣパワーが多少高くても、オフセットを検出したものと見なすことができ、その結果、ＤＯＡの後挿入の頻度を下げることができる。その結果、ＭＵＳＩＣ法による音源定位を安定して精度高く行なうことができる。背景雑音の発生源が音源定位処理部５０のすぐ近くにあるような場合には、精度は低下するが、実際のロボットの実装では、音声だけではなく画像を使用して音源が対話相手か否かを予測することもでき、精度の低下を防止することが期待できる。 According to the sound source localization processing unit 50 according to the above embodiment, sound source localization is performed using raw data of a MUSIC response. As is clear from the experimental results, the number of sound sources can be predicted and tracked with high accuracy by such processing. Further, by adding α to the value of the MUSIC power at the time of onset as a threshold for detecting the offset, it is considered that the offset has been detected even if the MUSIC power is slightly higher than the MUSIC power at the time of onset. As a result, the frequency of post-insertion of DOA can be reduced. As a result, sound source localization by the MUSIC method can be performed stably and accurately. When the source of background noise is in the immediate vicinity of the sound source localization processing unit 50, the accuracy is lowered. However, in actual robot implementation, whether the sound source is a conversation partner using not only speech but also images. Can be predicted, and it can be expected to prevent a decrease in accuracy.

さらに、本実施の形態では３次元ＭＵＳＩＣ法を用いているため、方位角だけではなく、ある範囲で仰角を含めて音源方位を推定することができる。そのため、実環境でさまざまな方向から音声を受ける環境でもロボットなどが正しく音源を定位して適切な動作を行なうことが可能になる。ロボットが人間とのインタラクションを行なう場合でも、相手の顔を見つめながら適切な動作を行なうことが期待でき、ロボットと人間とのインタラクションをよりスムーズなものとすることができる。
［コンピュータによる実現］
この実施の形態に係る音源定位処理部５０は、コンピュータハードウェアと、そのコンピュータハードウェアにより実行されるプログラムと、コンピュータハードウェアに格納されるデータとにより実現することができる。図１１はこのコンピュータシステム３３０の内部構成を示す。 Furthermore, since the three-dimensional MUSIC method is used in the present embodiment, it is possible to estimate the sound source azimuth including not only the azimuth angle but also the elevation angle within a certain range. Therefore, even in an environment where voice is received from various directions in a real environment, a robot or the like can correctly locate a sound source and perform an appropriate operation. Even when the robot interacts with a human, it can be expected to perform an appropriate operation while looking at the face of the other party, and the interaction between the robot and the human can be made smoother.
[Realization by computer]
The sound source localization processing unit 50 according to this embodiment can be realized by computer hardware, a program executed by the computer hardware, and data stored in the computer hardware. FIG. 11 shows the internal configuration of the computer system 330.

図１１を参照して、このコンピュータシステム３３０は、リムーバブルメモリの着脱部であるメモリポート３７０およびＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）ドライブ３６４を有するコンピュータ３４０と、キーボード３４６と、マウス３４８と、モニタ３４２とを含む。 Referring to FIG. 11, this computer system 330 includes a computer 340 having a memory port 370 and a DVD (Digital Versatile Disc) drive 364 as a removable memory attachment / detachment unit, a keyboard 346, a mouse 348, and a monitor 342. Including.

コンピュータ３４０は、メモリポート３７０およびＤＶＤドライブ３６４に加えて、ＣＰＵ（中央処理装置）３５６と、ＣＰＵ３５６、メモリポート３７０およびＤＶＤドライブ３６４が接続されたバス３６６と、バス３６６に接続され、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）３５８と、バス３６６に接続され、プログラム命令、システムプログラム、および作業データ等を一時的に記憶するランダムアクセスメモリ（ＲＡＭ）３６０と、バス３６６に接続され、大量のデータを記憶する不揮発性記憶装置であるハードディスク３６２と、バス３６６に接続され、Ａ／Ｄ変換器５４からの出力を受ける入出力インタフェース（Ｉ／Ｆ）３６８と、無線によりローカルエリアネットワーク（ＬＡＮ）への接続を提供する無線ネットワークＩ／Ｆ３７２とを含む。 In addition to the memory port 370 and the DVD drive 364, the computer 340 is connected to the CPU (Central Processing Unit) 356, the bus 366 to which the CPU 356, the memory port 370 and the DVD drive 364 are connected, and the bus 366. Are connected to a bus 366, a random access memory (RAM) 360 that temporarily stores program instructions, system programs, work data, and the like, and a bus 366. A hard disk 362 that is a non-volatile storage device storing a large amount of data, an input / output interface (I / F) 368 connected to the bus 366 and receiving the output from the A / D converter 54, and a local area network ( LAN) And a radio network I / F372.

コンピュータシステム３３０に音源定位処理部５０として機能させるためのコンピュータプログラムは、ＤＶＤドライブ３６４またはメモリポート３７０に挿入されるＤＶＤ３９０またはリムーバブルメモリ３９２に記憶され、さらにハードディスク３６２に転送される。または、プログラムは図示しないネットワークを通じてコンピュータ３４０に送信されハードディスク３６２に記憶されてもよい。プログラムは実行の際にＲＡＭ３６０にロードされる。ＤＶＤ３９０から、リムーバブルメモリ３９２から、またはネットワークを介して、直接にＲＡＭ３６０にプログラムをロードしてもよい。 A computer program for causing the computer system 330 to function as the sound source localization processing unit 50 is stored in the DVD 390 or the removable memory 392 inserted into the DVD drive 364 or the memory port 370 and further transferred to the hard disk 362. Alternatively, the program may be transmitted to the computer 340 through a network (not shown) and stored in the hard disk 362. The program is loaded into the RAM 360 when executed. The program may be loaded into the RAM 360 directly from the DVD 390, from the removable memory 392, or via a network.

このプログラムは、コンピュータ３４０にこの実施の形態の音源定位処理部５０として動作を行なわせる複数の命令を含む。この動作を行なわせるのに必要な基本的機能のいくつかはコンピュータ３４０上でオペレーティングシステム（ＯＳ）が動作しているときにはＯＳにより提供されることもある。それら機能はサードパーティのプログラム、またはコンピュータ３４０にインストールされる各種ツールキットのモジュールにより提供されることもある。従って、このプログラムはこの実施の形態のシステムおよび方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能または「ツール」を呼出すことにより、上記した音源定位処理部５０としての動作を実行する命令のみを含んでいればよい。コンピュータシステム３３０の動作は周知であるので、ここでは繰返さない。 This program includes a plurality of instructions for causing the computer 340 to operate as the sound source localization processing unit 50 of this embodiment. Some of the basic functions necessary to perform this operation may be provided by the OS when an operating system (OS) is running on the computer 340. These functions may be provided by third party programs or modules of various toolkits installed on the computer 340. Therefore, this program does not necessarily include all functions necessary for realizing the system and method of this embodiment. This program includes only an instruction for executing the operation as the sound source localization processing unit 50 by calling an appropriate function or “tool” in a controlled manner so as to obtain a desired result. Just go out. The operation of computer system 330 is well known and will not be repeated here.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

［付録：ＭＵＳＩＣ法］
Ｍ個のマイク入力のフーリエ変換Ｘｍ（ｋ、ｔ）は、式（１）のようにモデル化される。 [Appendix: MUSIC method]
The Fourier transform Xm (k, t) of M microphone inputs is modeled as shown in Equation (1).

ただし、ベクトルｓ（ｋ、ｔ）はＮ個の音源のスペクトルＳ_ｎ（ｋ，ｔ）から成る：ｓ（ｋ、ｔ）＝［Ｓ_１（ｋ，ｔ），…，Ｓ_Ｎ（ｋ、ｔ）］^Ｔ。ｋとｔはそれぞれ周波数と時間フレームのインデックスを示す。ベクトルｎ（ｋ、ｔ）は背景雑音を示す。行列Ａ_ｋは変換関数行列であり、その（ｍ、ｎ）要素はｎ番目の音源から、ｍ番目のマイクロホンへの直接パスの変換関数である。Ａ_ｋのｎ列目のベクトルをｎ番目の音源の位置ベクトル（ＳｔｅｅｒｉｎｇＶｅｃｔｏｒ）と呼ぶ。

However, the vector s (k, t) consists of N sound source spectra S _n (k, t): s (k, t) = [S ₁ (k, t),..., S _N (k, t) )] ^T. k and t indicate frequency and time frame indexes, respectively. Vector n (k, t) indicates background noise. The matrix A _k is a conversion function matrix, and its (m, n) element is a conversion function of a direct path from the nth sound source to the mth microphone. The n-th column vectors of A _k is referred to as a position vector of the n-th sound source (Steering Vector).

まず、式（２）で定義される空間相関行列Ｒ_ｋを求め、式（３）に示すＲ_ｋの固有値分解により、固有値の対角行列Λ_ｋ及び固有ベクトルから成るＥ_ｋが求められる。 First, a spatial correlation matrix R _k defined by Equation (2) is obtained, and E _k composed of an eigenvalue diagonal matrix Λ _k and an eigenvector is obtained by eigenvalue decomposition of R _k shown in Equation (3).

固有ベクトルはＥ_ｋ＝［Ｅ_ｋｓ｜Ｅ_ｋｎ］のように分割出来る。Ｅ_ｋｓとＥ_ｋｎとはそれぞれ支配的なＮ個の固有値に対応する固有ベクトルと、それ以外の固有ベクトルとを示す。

The eigenvector can be divided as E _k = [E _ks | E _kn ]. E _ks and E _kn indicate eigenvectors corresponding to the dominant N eigenvalues and other eigenvectors, respectively.

ＭＵＳＩＣ空間スペクトルは式（４）と（５）とで求める。ｒは距離、θとφとはそれぞれ方位角と仰角とを示す。式（５）は、スキャンされる点（ｒ、θ、φ）における正規化した位置ベクトルである。 The MUSIC spatial spectrum is obtained by equations (4) and (5). r is a distance, and θ and φ are an azimuth angle and an elevation angle, respectively. Equation (5) is a normalized position vector at the scanned point (r, θ, φ).

空間スペクトル（本明細書では「ＭＵＳＩＣ応答」と呼ぶ。）は、ＭＵＳＩＣ空間スペクトルを式（６）のように平均化したものである。

The spatial spectrum (referred to herein as a “MUSIC response”) is an averaged MUSIC spatial spectrum as shown in Equation (6).

式（６）においてｋ_Ｌ及びｋ_Ｈは、それぞれ周波数帯域の下位と上位の境界のインデックスであり、Ｋ＝ｋ_Ｈ−ｋ_Ｌ＋１。音源の方位は、ＭＵＳＩＣ応答のピークから求められる。

In Expression (6), k _L and k _H are indices of the lower and upper boundaries of the frequency band, respectively, and K = k _H −k _L +1. The direction of the sound source is obtained from the peak of the MUSIC response.

３０ロボット
３２マイクロホン台
５０音源定位処理部
５２マイクロホンアレイ
６０固有ベクトル算出部
６２ＭＵＳＩＣ処理部
６４音源推定部
８６相関行列算出部
８８固有値分解部
１０４ＭＵＳＩＣ空間スペクトル算出部
１０６ＭＵＳＩＣ応答算出部
１１０平滑化フィルタ部
１１２ＭＵＳＩＣΔスペクトログラム算出部
１１４オンセット・オフセット検出部
１１８トラッキング部 30 robot 32 microphone base 50 sound source localization processing unit 52 microphone array 60 eigenvector calculation unit 62 MUSIC processing unit 64 sound source estimation unit 86 correlation matrix calculation unit 88 eigenvalue decomposition unit 104 MUSIC spatial spectrum calculation unit 106 MUSIC response calculation unit 110 smoothing filter unit 112 MUSIC Δ Spectrogram Calculation Unit 114 Onset Offset Detection Unit 118 Tracking Unit

Claims

Based on each of the sound source signals of a plurality of channels obtained from the output of the microphone array and the positional relationship between the microphones included in the microphone array, the point determined in relation to the position of the microphone array is the center. MUSIC response calculation means for calculating MUSIC power for each predetermined time by a MUSIC algorithm for each of a plurality of orientations respectively corresponding to a plurality of mesh frames defined in a three-dimensional space;
For detecting a change in the direction of the sound source from the generation to the disappearance of the sound source based on the intensity of the MUSIC power value obtained as a time series by the MUSIC response calculation means and the amount of change for each of the plurality of directions. and the sound source estimation means a including sound source localization apparatus,
The sound source estimation means includes
For each of the plurality of azimuths, the MUSIC response calculation means immediately before each of the azimuths corresponding to the frame adjacent to the frame corresponding to the processing target azimuth among the plurality of azimuths. The difference between the MUSIC power value and the latest MUSIC power value obtained for the azimuth to be processed is calculated, and the smallest value among the differences is output as the MUSIC Δ power for the azimuth. MUSICΔ power calculation means for
The MUSICΔ power value calculated by the MUSICΔ power calculation means is larger than the first onset threshold value for each of the plurality of directions that are not the direction of the sound source, and the MUSIC response calculation means. First determination means for determining whether or not the value of the MUSIC power calculated by the above is larger than a second onset threshold value ;
In response to the determination results by the first determination means being all positive, it is determined that a sound source has occurred in the processing direction, and information indicating the generation of the sound source and information indicating the direction are output. Second determination means,
In response to the occurrence of the sound source is detected by the second judging means, the orientation of the sound source generating is detected by the second judging means, corresponding to the azimuth of the sound source by the MUSIC response calculator means The MUSIC power value calculated for each azimuth corresponding to the frame adjacent to the frame and the MUSICΔ power output by the MUSICΔ power calculation means for each azimuth corresponding to the frame adjacent to the frame corresponding to the direction of the sound source. based on the power, look including a tracking means for tracking up to disappearance of the sound source,
The tracking means includes
Responding to the direction of the sound source among the MUSIC responses output by the MUSIC response calculating means every predetermined time after the detection in response to the occurrence of the sound source detected by the second determination means. Means for tracking the movement of the sound source in which the occurrence has been detected by following the maximum value of the MUSIC power value calculated for the orientation corresponding to each of the frames adjacent to the frame ;
First offset determination means for determining whether or not the MUSICΔ power calculated for the orientation of the sound source tracked by the tracking means is smaller than a first offset threshold;
Second offset determination means for determining whether or not the MUSIC power calculated for the azimuth is smaller than a second offset constant that is larger than the MUSIC power at the time of generation of the sound source by a certain positive constant;
When the result that either one of the determination results of the first and second offset determination means is affirmative is continuously obtained a predetermined number of times for the MUSIC response calculated every predetermined time by the MUSIC response calculation means to, determine a sound source being tracked by said means for tracking is extinguished, including a means for stopping the tracking, sound source localization device.

And a smoothing means for calculating and smoothing the moving average of the MUSIC power output by the MUSIC response calculating means for each of the plurality of directions.
2. The sound source localization apparatus according to claim 1, wherein both of the MUSICΔ power calculating unit and the tracking unit receive the MUSIC power smoothed by the smoothing unit as an input.

The second determination means includes
Of the azimuths for which the determination result by the first determination unit is affirmative, only a certain number limited from the top of the MUSIC power calculated for the azimuth is identified as a detected sound source limiting unit. The sound source localization apparatus according to claim 1 , comprising:

The sound source localization apparatus according to claim 1 , wherein the predetermined number of times is a plurality of times.

A computer program that, when executed by a computer, causes the computer to function as each means according to any one of claims 1 to 4 .