JP2014098568A

JP2014098568A - Sound source position estimation device, sound source position estimation method, and sound source position estimation program

Info

Publication number: JP2014098568A
Application number: JP2012249050A
Authority: JP
Inventors: Carlos Toshinori Ishii; イシイ・カルロス・トシノリ; Yani Evan; ヤニ・エヴァン; Norihiro Hagita; 紀博萩田
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2012-11-13
Filing date: 2012-11-13
Publication date: 2014-05-29

Abstract

PROBLEM TO BE SOLVED: To provide a sound source position estimation device capable of estimating a three-dimensional sound source position by utilizing both a direct sound and a reflected sound.SOLUTION: In a sound source position estimation device 1000, a MUSIC processing part 62 specifies directions in which sounds reach a plurality of sound sensor arrays MC1, MC2 on the basis of a position relation between respective sound source signals of a plurality of channels from the plurality of sound sensor arrays MC1, MC2 and respective sound sensors included in the sound sensor arrays. A sound source direction/reflected sound direction estimation part 110 estimates directions of the sound sources to direct sounds and directions of the sound sources to reflected sounds on the basis of the direction in which a specified sound arrives and spatial three-dimensional map information. A position estimation processing part 114 estimates the position of a sound source in three dimension according to an overlap of extension areas in directions of the direct sounds and reflected sounds from the plurality of sound sensor arrays to the sound sources.

Description

この発明は実環境における音源定位技術に関し、特に、実環境において複数のセンサアレイによる音声の方向性を用いた音源位置の推定技術に関する。 The present invention relates to a sound source localization technique in a real environment, and more particularly to a sound source position estimation technique using sound directivity by a plurality of sensor arrays in a real environment.

家庭、オフィス、商店街など、異なった環境では、場所や時間によって多様な雑音特性を持つため、音声などの特定の音を対象としたアプリケーションでは、使用される環境の雑音の種類や度合いにより、期待した性能が得られないという問題がある。 Different environments such as home, office, and shopping streets have various noise characteristics depending on location and time, so in applications that target specific sounds such as voice, depending on the type and degree of environmental noise used, There is a problem that the expected performance cannot be obtained.

たとえば、人とロボットとの音声コミュニケーションにおいて、ロボットに取付けたマイクロホンは、通常離れた位置（１ｍ以上）にある。したがって、たとえば、電話音声のようにマイクと口との距離が数センチの場合と比べて、信号と雑音の比（ＳＮＲ）は低くなる。このため、傍にいる他人の声や環境の雑音が妨害音となり、ロボットによる目的音声の認識が難しくなる。従って、ロボットへの応用として、音源定位や音源分離は重要である。 For example, in voice communication between a person and a robot, the microphone attached to the robot is usually at a position (1 m or more) away from the robot. Therefore, for example, the signal-to-noise ratio (SNR) is lower than when the distance between the microphone and the mouth is several centimeters as in telephone voice. For this reason, the voices of others nearby and the noise of the environment become interference sounds, making it difficult for the robot to recognize the target speech. Therefore, sound source localization and sound source separation are important for robot applications.

音源定位に関して、実環境を想定した従来技術として特許文献１または特許文献２に記載のものがある。特許文献１または特許文献２に記載の技術は、分解能が高いＭＵＳＩＣ法と呼ばれる公知の音源定位の手法を用いている。 Regarding the sound source localization, there are those described in Patent Document 1 or Patent Document 2 as conventional techniques assuming an actual environment. The technique described in Patent Document 1 or Patent Document 2 uses a known sound source localization method called the MUSIC method with high resolution.

特許文献１または特許文献２に記載の発明では、マイクロホンアレイを用い、マイクロホンアレイからの信号をフーリエ変換して得られた受信信号ベクトルと、過去の相関行列とに基づいて現在の相関行列を計算する。このようにして求められた相関行列を固有値分解し、最大固有値と、最大固有値以外の固有値に対応する固有ベクトルである雑音空間とを求める。さらに、マイクロホンアレイのうち、１つのマイクロホンを基準として、各マイクの出力の位相差と、雑音空間と、最大固有値とに基づいて、ＭＵＳＩＣ法により音源の方向を推定する。 In the invention described in Patent Document 1 or Patent Document 2, a current correlation matrix is calculated based on a received signal vector obtained by Fourier-transforming a signal from the microphone array using a microphone array and a past correlation matrix. To do. The correlation matrix thus obtained is subjected to eigenvalue decomposition to obtain the maximum eigenvalue and a noise space that is an eigenvector corresponding to an eigenvalue other than the maximum eigenvalue. Furthermore, the direction of the sound source is estimated by the MUSIC method based on the phase difference of the output of each microphone, the noise space, and the maximum eigenvalue with one microphone as a reference in the microphone array.

これまでの音源定位や音源分離に関するほとんどの研究では、反射音は悪影響を与えるものとして扱われてきた（たとえば、非特許文献１、非特許文献２を参照）。 In most studies on sound source localization and sound source separation so far, reflected sound has been treated as having an adverse effect (for example, see Non-Patent Document 1 and Non-Patent Document 2).

特開２００８−１７５７３３号公報明細書Japanese Patent Application Laid-Open No. 2008-175733 特開２０１１−２２０７０１号公報明細書JP 2011-220701 A Specification

F. Asano, M. Goto, K. Itou, and H. Asoh, “Real time sound source localization and separation system and its application on automatic speech recognition,” in Eurospeech 2001, Aalborg, Denmark, 2001, pp. 1013-1016.F. Asano, M. Goto, K. Itou, and H. Asoh, “Real time sound source localization and separation system and its application on automatic speech recognition,” in Eurospeech 2001, Aalborg, Denmark, 2001, pp. 1013-1016 . C. T. Ishi, O. Chatot, H. Ishiguro, N. Hagita, “Evaluation of a ＭＵＳＩＣ-based real-time sound localization of multiple sound sources in real noisy environments,” in Proc. of the 2009 IEEE/RSJ Intl. Conf. on Intelligent Robots and System, St. Louis, USA, 2009, pp. 2027-2032.CT Ishi, O. Chatot, H. Ishiguro, N. Hagita, “Evaluation of a MUSIC-based real-time sound localization of multiple sound sources in real noisy environments,” in Proc. Of the 2009 IEEE / RSJ Intl. Conf. on Intelligent Robots and System, St. Louis, USA, 2009, pp. 2027-2032.

しかしながら、このような音源定位の技術において、実環境では、たとえば、マイクロホンアレイの周りに壁や天井やガラス窓やディスプレイなどの音を反射する表面が存在する場合、音源の直接音と同時に音源の反射音も測定されることがある。特に、マイクロホンアレイと、音源との距離が大きい場合、反射音の影響は無視できないと予想される。 However, in such a sound source localization technology, in a real environment, for example, when there is a surface that reflects sound such as a wall, ceiling, glass window, or display around the microphone array, Reflected sound may also be measured. In particular, when the distance between the microphone array and the sound source is large, it is expected that the influence of the reflected sound cannot be ignored.

したがって、実環境においては、反射音の影響があることを当然の前提として、音源定位を実行することが望まれるが、反射音の影響を積極的に取り込んで、音源の位置を推定する手法は、必ずしも明らかではない。 Therefore, in a real environment, it is desirable to perform sound source localization on the premise that there is an influence of reflected sound.However, a method for estimating the position of a sound source by actively incorporating the influence of reflected sound is , Not necessarily obvious.

本発明は、上記のような問題点を解決するためになされたものであって、その目的は、直接音と反射音との双方を利用して、３次元の音源位置の推定を行うことが可能な音源位置推定装置、音源位置推定方法および音源位置推定プログラムを提供することである。 The present invention has been made to solve the above-described problems, and its purpose is to perform estimation of a three-dimensional sound source position using both direct sound and reflected sound. A sound source position estimating apparatus, a sound source position estimating method, and a sound source position estimating program are provided.

本発明では、複数のマイクロホンアレイを用いて音源方向を推定し、空間の情報を用いて反射音の方向を推定し、これらの情報を統合して音源定位（３次元空間の位置推定）を行う。 In the present invention, the direction of the sound source is estimated using a plurality of microphone arrays, the direction of the reflected sound is estimated using spatial information, and the information is integrated to perform sound source localization (position estimation in a three-dimensional space). .

この発明の１つの局面に従うと、音源位置推定装置であって、複数の音センサアレイと、音センサアレイ中の各音センサの配置の情報および測定環境の空間３次元地図情報を格納するための記憶装置と、複数の音センサアレイからの複数チャンネルの音源信号の各々と音センサアレイに含まれる各音センサの間の位置関係とに基づいて、複数の音センサアレイに音の到来する方位を特定するための処理を実行する音源定位手段と、特定された音の到来する方向と空間３次元地図情報とに基づいて、直接音に対する音源の方向および反射音に対する音源の方向とを推定するための音源方向推定手段と、各複数の音センサアレイから直接音および反射音による音源の方向への延長領域の重なりに応じて、３次元での音源の位置を推定するための音源位置推定手段とを備える。 According to one aspect of the present invention, there is provided a sound source position estimation device for storing a plurality of sound sensor arrays, information on the arrangement of each sound sensor in the sound sensor array, and spatial three-dimensional map information of a measurement environment. Based on the storage device, each of the sound source signals of the plurality of channels from the plurality of sound sensor arrays, and the positional relationship between the sound sensors included in the sound sensor array, the direction of arrival of the sound to the plurality of sound sensor arrays is determined. In order to estimate the sound source localization means for executing the process for specifying, the direction of the specified sound and the direction of the sound source for the reflected sound and the direction of the sound source for the reflected sound based on the direction of arrival of the specified sound and the spatial three-dimensional map information Sound for estimating the position of the sound source in three dimensions according to the overlapping of the extension areas in the direction of the sound source by direct sound and reflected sound from each of the sound sensor arrays And a position estimation means.

好ましくは、音源方向推定手段は、所定回数以下の反射のみにより、反射音に対する音源の方向の推定を実行する。 Preferably, the sound source direction estimating means performs estimation of the direction of the sound source with respect to the reflected sound only by a predetermined number of reflections or less.

好ましくは、音源方向推定手段は、所定距離以内に存在する反射のみにより、反射音に対する音源の方向の推定を実行する。 Preferably, the sound source direction estimating means performs estimation of the direction of the sound source with respect to the reflected sound only by reflection existing within a predetermined distance.

好ましくは、音源位置推定手段の推定処理において、音源の方向への延長領域は、音源方向に沿う直線の回りに角度推定誤差を付加した領域である。 Preferably, in the estimation processing of the sound source position estimating means, the extension region in the direction of the sound source is a region in which an angle estimation error is added around a straight line along the sound source direction.

好ましくは、音源定位手段は、ＭＵＳＩＣ法によるＭＵＳＩＣ応答強度により、音の到来する方位を特定する。 Preferably, the sound source localization means identifies the direction in which the sound arrives based on the MUSIC response intensity according to the MUSIC method.

この発明の他の局面に従うと、複数の音センサアレイからの信号に基づいて音源の３次元の位置を推定する音源位置推定方法であって、複数の音センサアレイからの複数チャンネルの音源信号の各々と音センサアレイに含まれる各音センサの間の位置関係とに基づいて、複数の音センサアレイに音の到来する方位を特定するステップと、特定された音の到来する方向と空間３次元地図情報とに基づいて、直接音に対する音源の方向および反射音に対する音源の方向とを推定するステップと、各複数の音センサアレイから直接音および反射音による音源の方向への延長領域の重なりに応じて、３次元での音源の位置を推定するステップとを備える。 According to another aspect of the present invention, there is provided a sound source position estimation method for estimating a three-dimensional position of a sound source based on signals from a plurality of sound sensor arrays, wherein a plurality of sound source signals from a plurality of sound sensor arrays Based on each and the positional relationship between each sound sensor included in the sound sensor array, the step of specifying the direction in which sound arrives at the plurality of sound sensor arrays, the direction in which the specified sound arrives, and the spatial three-dimensional Based on the map information, the step of estimating the direction of the sound source with respect to the direct sound and the direction of the sound source with respect to the reflected sound, and the overlapping of the extension regions from each of the plurality of sound sensor arrays to the direction of the sound source with the direct sound and the reflected sound And a step of estimating the position of the sound source in three dimensions.

この発明のさらに他の局面に従うと、演算装置と記憶装置とを有するコンピュータに、複数の音センサアレイからの信号に基づいて音源の３次元の位置を推定させるための音源位置推定プログラムであって、音源位置推定プログラムは、演算装置が、複数の音センサアレイからの複数チャンネルの音源信号の各々と音センサアレイに含まれる各音センサの間の位置関係とに基づいて、複数の音センサアレイに音の到来する方位を特定するステップと、演算装置が、特定された音の到来する方向と記憶装置に記憶された空間３次元地図情報とに基づいて、直接音に対する音源の方向および反射音に対する音源の方向とを推定するステップと、演算装置が、各複数の音センサアレイから直接音および反射音による音源の方向への延長領域の重なりに応じて、３次元での音源の位置を推定するステップとを、コンピュータに実行させる。 According to still another aspect of the present invention, there is provided a sound source position estimation program for causing a computer having an arithmetic device and a storage device to estimate a three-dimensional position of a sound source based on signals from a plurality of sound sensor arrays. The sound source position estimation program includes a plurality of sound sensor arrays based on a positional relationship between sound source signals of a plurality of channels from a plurality of sound sensor arrays and sound sensors included in the sound sensor array. The direction in which the sound arrives, and the calculation device, based on the direction in which the sound arrives and the spatial three-dimensional map information stored in the storage device, the direction of the sound source and the reflected sound with respect to the direct sound The direction of the sound source with respect to the sound source, and the calculation unit overlaps the extension region in the direction of the sound source by direct sound and reflected sound from each of the plurality of sound sensor arrays. In response, and estimating the position of the sound source in three dimensions, it causes the computer to execute.

この発明によれば、直接音と反射音との双方を利用して、音源の３次元の正確な位置の推定を行うことが可能である。 According to the present invention, it is possible to estimate a three-dimensional accurate position of a sound source by using both direct sound and reflected sound.

音源位置推定装置１０００の構成を示すブロック図である。It is a block diagram which shows the structure of the sound source position estimation apparatus 1000. FIG. 本実施の形態における音源位置推定装置１０００の処理の概要を示す概念図である。It is a conceptual diagram which shows the outline | summary of the process of the sound source position estimation apparatus 1000 in this Embodiment. 音源位置推定装置１０００による音源位置の推定処理をさらに詳しく説明するための概念図である。It is a conceptual diagram for demonstrating in more detail the estimation process of the sound source position by the sound source position estimation apparatus 1000. FIG. 音源の位置推定を行う空間についての空間３次元地図を示す概念図である。It is a conceptual diagram which shows the space three-dimensional map about the space which performs the position estimation of a sound source. 実験に使用した１６素子のマイクロホンアレイの形状を示す図である。It is a figure which shows the shape of the microphone array of 16 elements used for experiment. 実験に用いた室内において、マイクロホンアレイの設置の態様を説明するための図である。It is a figure for demonstrating the aspect of installation of a microphone array in the room | chamber interior used for experiment. 測定した音源の位置とマイクロホンアレイの位置についての数値を示す図である。It is a figure which shows the numerical value about the position of the measured sound source, and the position of a microphone array. 測定した音源の位置とマイクロホンアレイの位置を示す図である。It is a figure which shows the position of the measured sound source, and the position of a microphone array. 音源の位置（“１”〜“６”）と向き（“Ｆ”、“Ｌ”、“Ｂ”、“Ｒ”）の条件ごとに検出レートを示す図である。It is a figure which shows a detection rate for every conditions of a sound source's position ("1"-"6") and direction ("F", "L", "B", "R"). 図９と同様の条件における平均位置推定誤差を示す図である。It is a figure which shows the average position estimation error on the same conditions as FIG. コンピュータシステム２０００のハードウェア構成をブロック図形式で示す図である。2 is a block diagram showing a hardware configuration of a computer system 2000. FIG. コンピュータプルグラムによる処理を説明するためのフローチャートである。It is a flowchart for demonstrating the process by a computer program.

以下、本発明の実施の形態の音源位置推定装置の構成について、図に従って説明する。なお、以下の実施の形態において、同じ符号を付した構成要素および処理工程は、同一または相当するものであり、必要でない場合は、その説明は繰り返さない。 Hereinafter, the configuration of a sound source position estimation apparatus according to an embodiment of the present invention will be described with reference to the drawings. In the following embodiments, components and processing steps given the same reference numerals are the same or equivalent, and the description thereof will not be repeated unless necessary.

なお、以下の説明では、音センサとしては、いわゆるマイクロホン、より特定的にはエレクトレットコンデンサマイクロホンを例にとって説明を行うが、音声を電気信号として検出できるセンサであれば、他の音センサであってもよい。 In the following description, as a sound sensor, a so-called microphone, more specifically an electret condenser microphone will be described as an example, but other sound sensors may be used as long as they can detect sound as an electric signal. Also good.

また、本実施の形態では、音環境の事前知識の習得およびその利用を総称して「音環境知能」と呼ぶ。また、いつ、どこで、どのような音が発生したのかの時空間的な表現、つまり音が発生するとできる地図のことを「音環境地図」と呼ぶ。 In the present embodiment, acquisition of prior knowledge of sound environment and its use are collectively referred to as “sound environment intelligence”. A space-time representation of when, where, and what sound is generated, that is, a map that can generate sound is called a “sound environment map”.

実環境では、異なった場所で発生する複数の音が混合して観測されるため、音環境地図の生成において、騒音計で空間をスキャンするような単純な方法は不十分である。音環境の事前知識として役立つと考えられる音源の位置や種類を特徴付けた音環境地図の生成には、空間的情報（通常の地図）に加え、少なくとも音源の定位および分離が必要であり、さらには、音源の分類がなされることが望ましい。 In a real environment, a plurality of sounds generated at different locations are mixed and observed, so that a simple method of scanning a space with a sound level meter is insufficient for generating a sound environment map. Generating a sound environment map that characterizes the location and type of a sound source that may be useful as prior knowledge of the sound environment requires at least localization and separation of the sound source in addition to spatial information (normal map). It is desirable that the sound source is classified.

そこで、本実施の形態の音源位置推定装置では、以下に説明するように、複数の音源を定位・分離するため、複数のマイクロホンアレイを連携させ、空間内の特定の音源に対する音環境地図を生成し、音環境を構造化する。 Therefore, in the sound source position estimation apparatus according to the present embodiment, as described below, in order to localize and separate a plurality of sound sources, a plurality of microphone arrays are linked to generate a sound environment map for a specific sound source in the space. And structure the sound environment.

ここでの音源定位とは、音源の方位を継続的に特定することをいい、音源の位置推定とは、所定の空間内で、音源定位により特定された音源の方位に基づいて、３次元的な音源の位置を推定することをいう。 The sound source localization here means to continuously specify the direction of the sound source, and the position estimation of the sound source is based on the direction of the sound source specified by the sound source localization in a predetermined space. This refers to estimating the position of a sound source.

以下では、音源の位置の推定のために、音源の方位を推定するための手法として、ＭＵＳＩＣ（Multiple Signal Classification）法を例にとって説明する。ただし、音源の方位を推定できる方法であれば、他の手法を用いてもよい。
（ＭＵＳＩＣ空間スペクトル）
ＭＵＳＩＣとは、音源定位において分解能が高い特徴を持つ手法の一種である。 In the following, a MUSIC (Multiple Signal Classification) method will be described as an example of a method for estimating the direction of a sound source in order to estimate the position of the sound source. However, other methods may be used as long as the direction of the sound source can be estimated.
(MUSIC spatial spectrum)
MUSIC is a kind of technique having a high resolution feature in sound source localization.

ＭＵＳＩＣ法の概略について説明すると、まず、高速フーリエ変換により多チャンネルのスペクトルＸ（ｋ，ｔ）をフレーム毎に求め、スペクトル領域でチャンネル間の空間的相関行列Ｒ_kをブロック毎に求め、相関行列の固有値分解により指向性の成分と無指向性の成分のサブ空間を分解し、無指向性のサブ空間に対応する固有ベクトルＥ_k ⁿと、対象の検索空間に応じて予め用意した方向ベクトルａ_k を用いて（狭帯域の）ＭＵＳＩＣ空間スペクトルＰ（ｋ）を周波数ビンごとに求め、特定の周波数帯域内の周波数ビン毎のＭＵＳＩＣ空間スペクトルを統合して広帯域ＭＵＳＩＣ空間スペクトルが求まる。 The outline of the MUSIC method will be described. First, a multi-channel spectrum X (k, t) is obtained for each frame by fast Fourier transform, and a spatial correlation matrix R _k between channels in a spectral region is obtained for each block. The eigenvalue decomposition of the directional component and the omnidirectional component subspace is decomposed, the eigenvector E _k ⁿ corresponding to the omnidirectional subspace, and the direction vector a _k prepared in advance according to the target search space. Is used to obtain a (narrowband) MUSIC spatial spectrum P (k) for each frequency bin, and a MUSIC spatial spectrum for each frequency bin within a specific frequency band is integrated to obtain a wideband MUSIC spatial spectrum.

以下では、広帯域ＭＵＳＩＣ空間スペクトルを単に「ＭＵＳＩＣ空間スペクトル」と呼び、ＭＵＳＩＣ空間スペクトルの時系列を「ＭＵＳＩＣスペクトログラム」を呼ぶ。 In the following, the broadband MUSIC spatial spectrum is simply referred to as “MUSIC spatial spectrum”, and the time series of the MUSIC spatial spectrum is referred to as “MUSIC spectrogram”.

音源定位においては、ＭＵＳＩＣ空間スペクトルのピークを探索することにより、音源の方向が求まる。 In sound source localization, the direction of the sound source is obtained by searching for the peak of the MUSIC spatial spectrum.

図１は、音源位置推定装置１０００の構成を示すブロック図である。 FIG. 1 is a block diagram showing a configuration of a sound source position estimation apparatus 1000.

なお、以下では、マイクロホンアレイが２つである場合を例にとって説明するが、マイクロホンアレイの個数はより多くてもよい。 In the following, a case where there are two microphone arrays will be described as an example, but the number of microphone arrays may be larger.

図１を参照して、音源位置推定装置１０００の音源定位処理部５０は、マイクロホン５２．１１〜５２．１ｐ（ｐ：自然数）を含むマイクロホンアレイＭＣ１およびマイクロホン５２．２１〜５２．２ｐを含むマイクロホンアレイＭＣ２から、それぞれｐ個のアナログ音源信号を受け、アナログ／デジタル変換を行なってｐ個のデジタル音源信号をそれぞれ出力するＡ／Ｄ変換器５４．１および５４．２と、Ａ／Ｄ変換器５４．１および５４．２からそれぞれ出力されるｐ個のデジタル音源信号を受け、ＭＵＳＩＣ法で必要とされる相関行列とその固有値および固有ベクトルを、所定の時間、たとえば、１００ミリ秒を１ブロックとしてブロックごとに出力するための固有ベクトル算出部６０と、固有ベクトル算出部６０からブロックごとに出力される固有ベクトルを使用し、ＭＵＳＩＣ法によりＭＵＳＩＣ空間スペクトルを出力するＭＵＳＩＣ処理部６２と、ＭＵＳＩＣ処理部６２が出力するＭＵＳＩＣ空間スペクトルに基づいて、音源の方向（本実施の形態では、３次元極座標の内の２つの偏角φおよびθとする）を推定するとともに、空間３次元地図の情報により、反射音の方向も推定して、直接音の方向および反射音の方向とを動的に推定して、音源の３次元的な位置を表す値を時系列で出力する音源位置推定部６４とを含む。なお、本明細書では、「ＭＵＳＩＣ応答」とは、ＭＵＳＩＣアルゴリズムにより得られるＭＵＳＩＣ空間スペクトルを所定の式で平均化したものである。 Referring to FIG. 1, sound source localization processing unit 50 of sound source position estimating apparatus 1000 includes a microphone array MC1 including microphones 52.11 to 52.1p (p: natural number) and a microphone including microphones 52.21 to 52.2p. A / D converters 54.1 and 54.2 that receive p analog sound source signals from array MC2, respectively, perform analog / digital conversion, and respectively output p digital sound source signals, and A / D converters Receiving p digital sound source signals respectively output from 54.1 and 54.2, the correlation matrix and its eigenvalues and eigenvectors required by the MUSIC method are set to a predetermined time, for example, 100 milliseconds as one block. An eigenvector calculation unit 60 for outputting each block, and an eigenvector calculation unit 60 for each block The MUSIC processing unit 62 that outputs the MUSIC spatial spectrum by the MUSIC method using the force eigenvectors, and the direction of the sound source based on the MUSIC spatial spectrum output by the MUSIC processing unit 62 (in this embodiment, three-dimensional polar coordinates And the direction of the reflected sound is also estimated dynamically based on the information of the spatial three-dimensional map, and the direction of the direct sound and the direction of the reflected sound are dynamically estimated. And a sound source position estimating unit 64 that outputs a value representing the three-dimensional position of the sound source in time series. In the present specification, the “MUSIC response” is obtained by averaging the MUSIC spatial spectrum obtained by the MUSIC algorithm using a predetermined formula.

特に限定されないが、本実施の形態では、Ａ／Ｄ変換器５４．１および５４．２は、一般的な１６ｋＨｚ／１６ビットで各マイクロホンの出力をＡ／Ｄ変換する。 Although not particularly limited, in the present embodiment, A / D converters 54.1 and 54.2 A / D convert the output of each microphone at a general 16 kHz / 16 bit.

また、固有ベクトル算出部６０は、マイクロホンアレイＭＣ１からの信号に基づきＡ／Ｄ変換器５４．１の出力するｐ個のデジタル音源信号を、たとえば、４ミリ秒のフレーム長でフレーム化するためのフレーム化処理部８０．１と、フレーム化処理部８０．１の出力するｐチャンネルのフレーム化された音源信号に対してそれぞれＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍａｔｉｏｎ）を施し、所定個数の周波数領域（以下、各周波数領域を「ビン」と呼び、周波数領域の数を「ビン数」と呼ぶ。）に変換して出力するＦＦＴ処理部８２．１と、ＦＦＴ処理部８２．１から４ミリ秒ごとに出力される各チャネルの各ビンの値を、１００ミリ秒ごとにブロック化するためのブロック化処理部８４．１と、ブロック化処理部８４．１から出力される各ビンの値の間の相関を要素とする相関行列を所定時間ごと（１００ミリ秒ごと）に算出し出力する相関行列算出部８６．１と、相関行列算出部８６．１から出力される相関行列を固有値分解し、固有ベクトル９２．１をＭＵＳＩＣ処理部６２に出力する固有値分解部８８．１とを含む。また、固有ベクトル算出部６０は、マイクロホンアレイＭＣ２からの信号に基づきＡ／Ｄ変換器５４．２の出力するｐ個のデジタル音源信号に対応しても、マイクロホンアレイＭＣ１からの信号と同様な処理を実行するための、フレーム化処理部８０．２と、ＦＦＴ処理部８２．２と、ブロック化処理部８４．２と、相関行列算出部８６．２と、固有値分解部８８．２とを含む。なお、これも特に限定されないが、本実施の形態では、音源信号の周波数成分のうち、空間的分解能が低い１ｋＨｚ以下の帯域と、空間的エイリアシングが起こり得る６ｋＨｚ以上の帯域を除外するものとする。 Further, the eigenvector calculation unit 60 frames the p digital sound source signals output from the A / D converter 54.1 based on the signal from the microphone array MC1 with a frame length of, for example, 4 milliseconds. The FFT processing unit 80.1 and the p-channel framed sound source signals output from the framing processing unit 80.1 are each subjected to FFT (Fast Fourier Transform) to obtain a predetermined number of frequency regions (hereinafter, each frequency). The region is called “bin”, and the number of frequency regions is called “bin number”.) The FFT processing unit 82.1 that outputs the converted signal and the FFT processing unit 82.1 outputs it every 4 milliseconds. A blocking processing unit 84.1 for blocking the value of each bin of each channel every 100 milliseconds, and a blocking processing unit 84. Correlation matrix calculation unit 86.1 that calculates and outputs a correlation matrix having a correlation between values of bins output from as a factor every predetermined time (every 100 milliseconds), and correlation matrix calculation unit 86.1 An eigenvalue decomposition unit 88.1 that performs eigenvalue decomposition on the output correlation matrix and outputs the eigenvector 92.1 to the MUSIC processing unit 62. Further, the eigenvector calculation unit 60 performs the same processing as the signal from the microphone array MC1 even if it corresponds to the p digital sound source signals output from the A / D converter 54.2 based on the signal from the microphone array MC2. A framing processing unit 80.2, an FFT processing unit 82.2, a blocking processing unit 84.2, a correlation matrix calculation unit 86.2, and an eigenvalue decomposition unit 88.2 for execution are included. Although this is not particularly limited, in this embodiment, the frequency component of the sound source signal excludes a band of 1 kHz or less with a low spatial resolution and a band of 6 kHz or more where spatial aliasing may occur. .

通常、ＦＦＴでは５１２〜１０２４点を使用する（１６ｋＨｚのサンプリングレートで３２〜６４ミリ秒に相当）が、ここでは１フレームを４ミリ秒（ＦＦＴでは６４〜１２８点に相当）とした。このようにフレーム長を短くすることにより、ＦＦＴの計算量が少なくてすむだけでなく、後の相関行列の算出、固有値分解、およびＭＵＳＩＣ応答の算出における計算量も少なくて済む。その結果、性能を落とすことなく、比較的非力なコンピュータを用いても十分にリアルタイムで音源定位を行なうことができる。 Usually, 512 to 1024 points are used in FFT (corresponding to 32 to 64 milliseconds at a sampling rate of 16 kHz), but here one frame is set to 4 milliseconds (corresponding to 64 to 128 points in FFT). By reducing the frame length in this way, not only the amount of calculation of FFT is reduced, but also the amount of calculation in later calculation of correlation matrix, eigenvalue decomposition, and calculation of MUSIC response is reduced. As a result, sound source localization can be performed sufficiently in real time even if a relatively weak computer is used without degrading performance.

ＭＵＳＩＣ処理部６２は、マイクロホンアレイＭＣ１およびＭＣ２に含まれる各マイクロホンの位置を所定の座標系を用いて表す位置ベクトルを記憶するためのマイク配置記憶部１００と、マイク配置記憶部１００に記憶されているマイクロホンの位置ベクトル、および固有値分解部８８．１から出力される固有ベクトルを用いて、音源数が固定されているものとしてＭＵＳＩＣ法によりＭＵＳＩＣ空間スペクトルを算出し出力するＭＵＳＩＣ空間スペクトル算出部１０４．１とを含む。ブロックごとに得られる相関行列の固有値が音源数に関連することは、例えば、Ｆ．アサノら、「リアルタイム音源定位及び生成システムと自動音声認識におけるその応用」、Ｅｕｒｏｓｐｅｅｃｈ，２００１、アールボルグ、デンマーク、２００１、１０１３−１０１６頁（F. Asano, M. Goto, K. Itou, and H. Asoh, “Real-time sound source localization and separation system and its application on automatic speech recognition,” in Eurospeech 2001, Aalborg, Denmark, 2001, pp. 1013-1016）にも記載されており、既に知られている事項である。 The MUSIC processing unit 62 is stored in the microphone arrangement storage unit 100 and the microphone arrangement storage unit 100 for storing a position vector that represents the position of each microphone included in the microphone arrays MC1 and MC2 using a predetermined coordinate system. The MUSIC spatial spectrum calculation unit 104.1 that calculates and outputs the MUSIC spatial spectrum by the MUSIC method on the assumption that the number of sound sources is fixed, using the microphone position vector and the eigenvector output from the eigenvalue decomposition unit 88.1. Including. The fact that the eigenvalue of the correlation matrix obtained for each block is related to the number of sound sources is, for example, F. Asano et al., “Real-time sound source localization and generation system and its application in automatic speech recognition”, Eurospeech, 2001, Aalborg, Denmark, 2001, 1013-1016 (F. Asano, M. Goto, K. Itou, and H. Asoh, “Real-time sound source localization and separation system and its application on automatic speech recognition,” in Eurospeech 2001, Aalborg, Denmark, 2001, pp. 1013-1016) It is.

なお、本実施の形態では、各音源の２次元的な方位角だけでなく、仰角も推定する。そのために、ＭＵＳＩＣアルゴリズムとしては、３次元での計算が可能なものを実装する。方位角と仰角とのセットを、これ以降、音源方位（ＤＯＡ）と呼ぶ。ＭＵＳＩＣ処理部６２で実行されるアルゴリズムでは、音源までの距離は推定しない。音源方位のみを推定するようにすることで、処理時間を大幅に減少させることができる。 In the present embodiment, not only the two-dimensional azimuth angle of each sound source but also the elevation angle is estimated. Therefore, as the MUSIC algorithm, an algorithm that can calculate in three dimensions is implemented. The set of azimuth and elevation is hereinafter referred to as sound source azimuth (DOA). The algorithm executed by the MUSIC processing unit 62 does not estimate the distance to the sound source. By estimating only the sound source azimuth, the processing time can be significantly reduced.

ＭＵＳＩＣ処理部６２はさらに、ＭＵＳＩＣ空間スペクトル算出部１０４．１により算出されたＭＵＳＩＣ空間スペクトルに基づいて、ＭＵＳＩＣ法にしたがいＭＵＳＩＣ応答と呼ばれる値を各方位について算出し出力するためのＭＵＳＩＣ応答算出部１０６．１を含む。 The MUSIC processing unit 62 further calculates a value called a MUSIC response for each azimuth according to the MUSIC method based on the MUSIC spatial spectrum calculated by the MUSIC spatial spectrum calculation unit 104.1 and outputs the MUSIC response calculation unit 106. .1 included.

ＭＵＳＩＣ処理部６２はさらに、固有値分解部８８．２から出力される固有ベクトルを用いて、固有値分解部８８．１から出力される固有ベクトルに対するのと同様の処理を実行する、ＭＵＳＩＣ空間スペクトル算出部１０４．２と、ＭＵＳＩＣ応答算出部１０６．２とを含む。 The MUSIC processing unit 62 further uses the eigenvector output from the eigenvalue decomposition unit 88.2 to execute the same processing as that for the eigenvector output from the eigenvalue decomposition unit 88.1. 2 and a MUSIC response calculation unit 106.2.

音源位置推定部６４は、ＭＵＳＩＣ応答算出部１０６．１および１０６．２により算出されたＭＵＳＩＣ応答のピークを、一時的に時系列に所定数だけＦＩＦＯ形式でそれぞれ蓄積するためのバッファ１０８．１および１０８．２を含む。さらに、音源位置推定部６４において、音源方向・反射音方向推定部１１０は、バッファ１０８．１および１０８．２に蓄積された各ブロックの各探索点のＭＵＳＩＣ応答について、音源の方向（上述した２つの偏角φおよびθ）を推定するとともに、空間３次元地図記憶部１１２に記憶された空間３次元地図の情報により、反射音の方向も動的に推定する。そして、音源位置推定部６４において、位置推定処理部１１４は、動的に推定された直接音の方向および反射音の方向とを用いて、後に説明する手続きにより、音源の３次元的な位置を推定して、時系列で出力する。 The sound source position estimating unit 64 includes buffers 108.1 and a buffer for temporarily storing a predetermined number of MUSIC response peaks calculated by the MUSIC response calculating units 106.1 and 106.2 in a time-series manner in a FIFO format, respectively. Including 108.2. Further, in the sound source position estimating unit 64, the sound source direction / reflected sound direction estimating unit 110 determines the direction of the sound source (the 2 described above) for the MUSIC response of each search point of each block accumulated in the buffers 108.1 and 108.2. And the direction of the reflected sound is dynamically estimated based on the information of the spatial 3D map stored in the spatial 3D map storage unit 112. Then, in the sound source position estimation unit 64, the position estimation processing unit 114 uses the dynamically estimated direct sound direction and reflected sound direction to determine the three-dimensional position of the sound source by a procedure described later. Estimate and output in time series.

なお、推定値については、時系列のデータとして外部の記憶装置に格納してもよいし、あるいは、たとえば、リアルタイムで、表示装置に表示される構成としてもよい。 Note that the estimated value may be stored in an external storage device as time-series data, or may be configured to be displayed on the display device in real time, for example.

ここで、ＭＵＳＩＣ法では、狭帯域ＭＵＳＩＣ空間スペクトルの推定において、その時刻に発している指向性を持つ音源数（ＮＯＳ）を与える必要があるが、以下の説明では、固定数を与え、ＭＵＳＩＣ空間スペクトル上で、特定の閾値を超えたピークのみを指向性のある音源とみなすものとして説明する。
（ＭＵＳＩＣ法）
以下、上述した３次元での方位を算出するＭＵＳＩＣ法について、簡単にまとめる。 Here, in the MUSIC method, in the estimation of the narrow band MUSIC space spectrum, it is necessary to give the number of sound sources (NOS) having directivity emitted at that time, but in the following explanation, a fixed number is given and the MUSIC space is given. In the following description, it is assumed that only peaks that exceed a specific threshold on the spectrum are regarded as directional sound sources.
(MUSIC method)
Hereinafter, the MUSIC method for calculating the above-described three-dimensional orientation will be briefly summarized.

たとえば、Ｍ個のマイク入力のフーリエ変換Ｘｍ（ｋ、ｔ）は、式（１）のようにモデル化される。 For example, the Fourier transform Xm (k, t) of M microphone inputs is modeled as shown in Equation (1).

ただし、ベクトルｓ（ｋ、ｔ）はＮ個の音源のスペクトルＳ_n（ｋ、ｔ）から成る（ｎ＝１，…，Ｎ）。 However, the vector s (k, t) consists of N sound source spectra S _n (k, t) (n = 1,..., N).

すなわち、ｓ（ｋ、ｔ）＝［Ｓ₁（ｋ、ｔ）、…、Ｓ_N（ｋ、ｔ）］^Tである。ここで、ｋとｔはそれぞれ周波数と時間フレームのインデックスを示す。ベクトルｎ（ｋ、ｔ）は背景雑音を示す。行列Ａ_ｋは変換関数行列であり、その（ｍ、ｎ）要素はｎ番目の音源から、ｍ番目のマイクロホンへの直接パスの変換関数である。Ａ_ｋのｎ列目のベクトルをｎ番目の音源の位置ベクトル（ＳｔｅｅｒｉｎｇＶｅｃｔｏｒ）と呼ぶ。 That is, s (k, t) = [S ₁ (k, t),..., S _N (k, t)] ^T. Here, k and t indicate frequency and time frame indexes, respectively. Vector n (k, t) indicates background noise. The matrix A _k is a conversion function matrix, and its (m, n) element is a conversion function of a direct path from the nth sound source to the mth microphone. The n-th column vectors of A _k is referred to as a position vector of the n-th sound source (Steering Vector).

まず、式（２）で定義される空間相関行列Ｒ_ｋを求め、式（３）に示すＲｋの固有値分解により、固有値の対角行列Λ_ｋおよび固有ベクトルから成るＥ_ｋが求められる。 First, a spatial correlation matrix R _k defined by Equation (2) is obtained, and E _k composed of a diagonal matrix Λ _k of eigenvalues and eigenvectors is obtained by eigenvalue decomposition of Rk shown in Equation (3).

固有ベクトルはＥ_ｋ＝［Ｅ_ｋｓ｜Ｅ_ｋｎ］のように分割出来る。Ｅ_ｋｓとＥ_ｋｎとはそれぞれ支配的なＮ個の固有値に対応する固有ベクトルと、それ以外の固有ベクトルとを示す。 The eigenvector can be divided as E _k = [E _ks | E _kn ]. E _ks and E _kn indicate eigenvectors corresponding to the dominant N eigenvalues and other eigenvectors, respectively.

ＭＵＳＩＣ空間スペクトルは式（４）と（５）とで求める。ｒは距離、θとφとはそれぞれ方位角と仰角とを示す。式（５）は、スキャンされる点（ｒ、θ、φ）における正規化した位置ベクトルである。 The MUSIC spatial spectrum is obtained by equations (4) and (5). r is a distance, and θ and φ are an azimuth angle and an elevation angle, respectively. Equation (5) is a normalized position vector at the scanned point (r, θ, φ).

ＭＵＳＩＣ応答は、ＭＵＳＩＣ空間スペクトルを式（６）のように平均化したものである。 The MUSIC response is obtained by averaging the MUSIC spatial spectrum as shown in Equation (6).

式（６）においてｋ_Lおよびｋ_Hは、それぞれ周波数帯域の下位と上位の境界のインデックスであり、Ｋ＝ｋ_H−ｋ_L＋１である。マイクロホンアレイに到来する音の方位は、ＭＵＳＩＣ応答のピークを探索することにより求められる。
（空間情報と反射音を利用した複数アレイによる音源定位）
図２は、本実施の形態における音源位置推定装置１０００の処理の概要を示す概念図である。 In Equation (6), k _L and k _H are indices of the lower and upper boundaries of the frequency band, respectively, and K = k _H −k _L +1. The direction of the sound arriving at the microphone array can be obtained by searching for the peak of the MUSIC response.
(Sound source localization using multiple arrays using spatial information and reflected sound)
FIG. 2 is a conceptual diagram showing an outline of processing of the sound source position estimation apparatus 1000 in the present embodiment.

図２を参照して、音源位置推定装置１０００は、複数のマイクロホンアレイにおいて、ＭＵＳＩＣスペクトログラムによる音源方向推定（マイクロホンアレイに到来する音の方位の推定）を行い、空間３次元地図情報とマイクロホンアレイの位置情報を用いて反射音の方向も推定し、これらの情報を統合して複数の音源位置の推定を行う。 Referring to FIG. 2, sound source position estimation apparatus 1000 performs sound source direction estimation (estimation of the direction of sound arriving at microphone array) using a MUSIC spectrogram in a plurality of microphone arrays, and spatial three-dimensional map information and microphone array The direction of the reflected sound is also estimated using the position information, and the information is integrated to estimate a plurality of sound source positions.

音源位置推定装置１０００が、音源の位置を推定する原理は、複数のマイクロホンアレイを用いて音源方向を推定し、空間内のマイクロホンアレイの位置と向きが既知である場合、それぞれのマイクロホンアレイで推定された音源方向が重なった位置に音源が存在する確率が高い、ということである。 The principle that the sound source position estimation apparatus 1000 estimates the position of the sound source is to estimate the sound source direction using a plurality of microphone arrays, and when the position and orientation of the microphone array in the space are known, the sound source position estimation device 1000 estimates the sound source direction. This means that there is a high probability that a sound source exists at a position where the sound source directions overlapped.

また、空間内のマイクロホンアレイの位置、音が反射しやすい天井や壁やディスプレイなどとの位置関係によって、アレイで反射音が測定される場合があり、音源位置推定装置１０００は、１つのマイクロホンアレイでも反射音と直接音の方向が検出された場合、反射音を壁や天井で反転させた方向と直接音が重なった位置に音源が存在する確率が高いと判断する。すなわち、従来のマイクロホンアレイ処理では、反射音は音源定位や音源分離に悪影響を与えるものとして扱うのに対して、音源位置推定装置１０００では、逆に反射音の情報を利用することで、音源の位置を推定する。 In addition, the reflected sound may be measured by the array depending on the position of the microphone array in the space and the positional relationship with the ceiling, wall, display, or the like where the sound is likely to be reflected. However, when the direction of the reflected sound and the direct sound is detected, it is determined that there is a high probability that the sound source exists at the position where the reflected sound overlaps the direction in which the reflected sound is inverted by the wall or ceiling. In other words, in the conventional microphone array processing, the reflected sound is treated as having an adverse effect on the sound source localization and sound source separation, whereas the sound source position estimation apparatus 1000 uses the information of the reflected sound on the contrary. Estimate the position.

図３は、以上説明した音源位置推定装置１０００による音源位置の推定処理をさらに詳しく説明するための概念図である。 FIG. 3 is a conceptual diagram for explaining the sound source position estimation processing by the sound source position estimating apparatus 1000 described above in more detail.

図３においては、対象となる空間である室内の床の２次元の各方向をｘ方向およびｙ方向とし、天井から床に向かう方向をｚ方向としている。 In FIG. 3, the two-dimensional directions of the indoor floor, which is the target space, are the x direction and the y direction, and the direction from the ceiling to the floor is the z direction.

天井に２つのマイクロホンアレイＭＣ１およびＭＣ２が設置されており、室内には、ｙ＝０の壁面の前に、反射面を構成するディスプレイＤＰが配置されているものとする。 It is assumed that two microphone arrays MC1 and MC2 are installed on the ceiling, and a display DP that constitutes a reflective surface is arranged in the room in front of the wall surface at y = 0.

なお、図３において、楕円形の印は、室内にいる人間の頭部の位置を示しているものとする。 In FIG. 3, it is assumed that an oval mark indicates the position of a human head in the room.

また、図４は、音源の位置推定を行う空間についての空間３次元地図を示す概念図である。 FIG. 4 is a conceptual diagram showing a spatial three-dimensional map for a space where the position of a sound source is estimated.

図３を参照して、音源位置推定装置１０００の音源方向・反射音方向推定部１１０は、定位された音源が反射音であるか否かは予め分からないため、まず推定されたすべての音源方向を、図４の空間３次元地図で位置が特定される、壁や天井で反転させる。反射は空間内で複数生じ得るが、所定の回数、たとえば、２度目以降の反射は強度も指向性も衰えることを考慮し、反転は、所定の回数未満（ここでは１度）のみ行うこととする。 Referring to FIG. 3, sound source direction / reflected sound direction estimation unit 110 of sound source position estimating apparatus 1000 does not know in advance whether or not the localized sound source is a reflected sound. Is inverted on the wall or ceiling whose position is specified in the spatial three-dimensional map of FIG. A plurality of reflections may occur in the space, but in consideration of the fact that the reflection after the second time, for example, the second and subsequent reflections, both the intensity and the directivity decline, the inversion should be performed less than the predetermined number (here, once). To do.

そして、音源方向・反射音方向推定部１１０は、３次元空間を考慮し、方位角および仰角で音源方向を表現する。 Then, the sound source direction / reflected sound direction estimation unit 110 represents the sound source direction with an azimuth angle and an elevation angle in consideration of a three-dimensional space.

推定された方向には、角度の誤差（Angle uncertainty: ＡＵ）があり、マイクロホンアレイからの距離に応じて推定位置の誤差（Position uncertainty: ＰＵ）が大きくなる。この性質を考慮し、推定位置誤差を以下の式で求める。 There is an angle error (Angle uncertainty: AU) in the estimated direction, and an error in the estimated position (Position uncertainty: PU) increases according to the distance from the microphone array. Considering this property, the estimated position error is obtained by the following equation.

ＰＵ（ｄ）＝±ＡＵ／３６０×２π×ｄ
ここで、ｄはアレイの中心からの距離である。例えば、球面上で５度の分解能で音源方向が検知された場合（ＡＵ＝５）、マイクロホンアレイから１メートル離れた位置に音源がある場合（ｄ＝１ｍ）、その方向に直線を１メートル伸ばした際の推定位置誤差は、±８．７ｃｍとなる。２メートルの場合、誤差はその倍の±１７．４ｃｍとなる。 PU (d) = ± AU / 360 × 2π × d
Here, d is a distance from the center of the array. For example, when the sound source direction is detected with a resolution of 5 degrees on the spherical surface (AU = 5), when the sound source is located 1 meter away from the microphone array (d = 1 m), the straight line is extended by 1 meter in that direction. In this case, the estimated position error is ± 8.7 cm. In the case of 2 meters, the error is ± 17.4 cm, which is twice that.

位置推定処理部１１４は、たとえば、検知された２つの方向が上述の誤差を考慮して空間上で重なっているか否かを判定する方法として、それぞれの方向に直線を引き、２直線の最短距離を幾何学の公式を用いて推定する。この最短距離がそれぞれの直線における誤差（ＰＵ）を足した値よりも小さい場合、これらの直線は重なっているとみなす。また、検出された方向の重なりが生じた位置に音源が存在するとみなす。 For example, the position estimation processing unit 114 draws a straight line in each direction as a method of determining whether or not two detected directions overlap in the space in consideration of the above-described error, and the shortest distance between the two straight lines. Is estimated using a geometric formula. If this shortest distance is smaller than the sum of errors (PU) in the respective straight lines, these straight lines are regarded as overlapping. Further, it is assumed that the sound source exists at the position where the overlap in the detected direction occurs.

検出されたすべての直接音と反射音の方向に引いた直線の距離をペア毎に求め、方向の重なりを複数探索する。重なりがあった場合は、平均位置を音源の推定位置とする。重なりがない場合は方向情報を保留とし、重なりが生じた時点で、位置を割り当てる。 The distances of straight lines drawn in the directions of all detected direct sounds and reflected sounds are obtained for each pair, and a plurality of overlapping directions are searched. If there is an overlap, the average position is set as the estimated position of the sound source. When there is no overlap, the direction information is reserved, and a position is assigned when the overlap occurs.

上記のような直線を用いる方法は、リアルタイムでの位置推定に適している。なお、位置推定処理部１１４による音源位置の推定処理としては、リアルタイム性には劣るものの、たとえば、上述した推定された音源方向および反射音方向を軸とし、上記所定の誤差を許容する範囲（円錐形状となる）について、投票処理を行い、複数の方向からの投票が重なることで、所定値以上の投票結果となる領域を音源の位置と推定するなど、他の方法を用いることも可能である。
［データ収集および分析結果］
以下、上述したような音源位置の推定処理についての実験結果を説明する。
（マイクロホンアレイ）
図５は、実験に使用した１６素子のマイクロホンアレイの形状を示す図である。 The method using a straight line as described above is suitable for position estimation in real time. Note that the sound source position estimation processing by the position estimation processing unit 114 is inferior in real-time property, but for example, a range (conical) that allows the above-described predetermined error with the estimated sound source direction and reflected sound direction as axes. It is also possible to use other methods, such as estimating the area where the voting result is equal to or greater than a predetermined value by performing voting processing on a shape) and voting from multiple directions overlapping, .
[Data collection and analysis results]
Hereinafter, experimental results regarding the above-described sound source position estimation processing will be described.
(Microphone array)
FIG. 5 is a diagram showing the shape of a 16-element microphone array used in the experiment.

図５に示すように、３次元空間における方位角および仰角を求めるため、マイクは直径３０ｃｍの半球面上に配置した。 As shown in FIG. 5, in order to obtain the azimuth angle and elevation angle in a three-dimensional space, the microphone was placed on a hemispherical surface having a diameter of 30 cm.

多チャンネルオーディオキャプチャデバイスとして、１６−ｃｈａｎｎｅｌＡ／Ｄ変換機を使用した。マイクは、エレクトレットコンデンサマイクロホンを用い、１６ｋＨｚ／１６ｂｉｔｓでサンプリングを行った。 A 16-channel A / D converter was used as a multi-channel audio capture device. The microphone used an electret condenser microphone and sampled at 16 kHz / 16 bits.

ＭＵＳＩＣ空間スペクトルによる音源方向推定のパラメータとして、音源の固定数を３とし、ＭＵＳＩＣパワーの閾値を２．５ｄＢとし、同時に発する音源の最大数を６に設定した。 As parameters for sound source direction estimation using the MUSIC spatial spectrum, the fixed number of sound sources was set to 3, the threshold of MUSIC power was set to 2.5 dB, and the maximum number of sound sources to be emitted simultaneously was set to 6.

また、ＭＵＳＩＣ空間スペクトルを求める際に用いる周波数帯域に関しては、空間的歪み（spatial aliasing）と低周波数帯域における低い分解能を避けるため、１０００〜５０００Ｈｚの帯域を用いた。 As for the frequency band used when obtaining the MUSIC spatial spectrum, a band of 1000 to 5000 Hz was used in order to avoid spatial aliasing and low resolution in the low frequency band.

音源方向推定の探索空間は、３次元空間で球面上５度間隔の分解能に設定する。マイクロホンアレイを天井に取り付けるため、方位角は０〜３６０度、仰角は−５度〜−８０度に制限した。−８５〜−９０度（アレイの真上の方向）には、アレイの形状より音源が存在しない場合にもＭＵＳＩＣ空間スペクトルにピークが生じるため、その領域を探索空間から除外している。
（評価データの収集）
図６は、実験に用いた室内において、マイクロホンアレイの設置の態様を説明するための図である。 The search space for sound source direction estimation is set to a resolution of 5 degrees on the spherical surface in a three-dimensional space. In order to attach the microphone array to the ceiling, the azimuth angle was limited to 0 to 360 degrees and the elevation angle was limited to -5 degrees to -80 degrees. At -85 to -90 degrees (in the direction directly above the array), a peak occurs in the MUSIC space spectrum even when there is no sound source due to the shape of the array, so that region is excluded from the search space.
(Collecting evaluation data)
FIG. 6 is a diagram for explaining a mode of installing the microphone array in the room used for the experiment.

図６（ａ）は、実験に用いたマイクロホンアレイの配置を示す図である。 FIG. 6A is a diagram showing the arrangement of the microphone array used in the experiment.

また、図６（ｂ）に示すように、２つのマイクロホンアレイＭＣ１およびＭＣ２を天井に取り付けた。マイクロホンアレイと天井の間には吸音素材を入れ込み、天井での反射は扱わないこととした。また、床は反射しにくいタイルカーペットであり、反射が生じたとしても天井にあるアレイへの距離が大きいため、床での反射は扱わないものとする。 Further, as shown in FIG. 6B, two microphone arrays MC1 and MC2 were attached to the ceiling. Sound absorbing material is inserted between the microphone array and the ceiling, and reflection from the ceiling is not handled. Also, the floor is a tile carpet that is difficult to reflect, and even if reflection occurs, the distance to the array on the ceiling is large, so reflection on the floor is not handled.

従って、以下の実験では、推定された音源方向を壁で一回のみ反転させることとした。ただし、マイクロホンアレイと反射面との距離について、予め所定の距離以上となった場合に、反射の影響を無視するという条件を、音源方向・反射音方向推定部１１０の行う処理に事前に設定しておいてもよい。 Therefore, in the following experiment, the estimated sound source direction was reversed only once on the wall. However, when the distance between the microphone array and the reflection surface is a predetermined distance or more in advance, a condition that the influence of reflection is ignored is set in advance in the processing performed by the sound source direction / reflection sound direction estimation unit 110. You may keep it.

音源の向きによって、その音源の指向性が変化し、同じ位置でもアレイに対する向きによってアレイで観測される指向性の強度が変化する。そこで、実験においては、人が発した音声を対象音源とし、人が異なる方向を向いて発声した場合の音源位置の推定処理について説明する。また、環境に固定されたエアコン（図６の左上）もアレイに対して指向性を持つ雑音源となる。 The directivity of the sound source changes depending on the direction of the sound source, and the directivity intensity observed in the array also changes depending on the direction relative to the array even at the same position. Therefore, in the experiment, a sound source position estimation process in the case where a person utters in a different direction with a voice uttered by a person as a target sound source will be described. An air conditioner (upper left in FIG. 6) fixed to the environment is also a noise source having directivity with respect to the array.

対象音源の位置として、図６（ａ）の机の周り６か所を固定し、各位置において、前、左、後ろ、右の４つの向きでデータを収録した。エアコンはスイッチオンの状態にした。特に、人が音源の場合、正確に音源の位置を固定することは難しいが、向きを変えた際に、口の位置ができるだけ変わらないようにした。 Six locations around the desk in FIG. 6A were fixed as the positions of the target sound sources, and data were recorded in four directions, front, left, back, and right, at each position. The air conditioner was switched on. In particular, if the person is a sound source, it is difficult to accurately fix the position of the sound source, but when the orientation is changed, the position of the mouth is kept as small as possible.

図７は、測定した音源の位置とマイクロホンアレイの位置についての数値を示す図である。 FIG. 7 is a diagram showing numerical values for the measured position of the sound source and the position of the microphone array.

また、図８は、測定した音源の位置とマイクロホンアレイの位置を示す図である。 FIG. 8 is a diagram showing the measured position of the sound source and the position of the microphone array.

図８においては、音源の位置および向きとアレイの位置を部屋の上面図に重ねて示す。
ｘ＝０およびｙ＝０の平面には壁が存在する。ｘ＝７４００ｍｍ、およびｙ＝−５６００ｍｍにも壁が存在するが、アレイから離れているため、実験においては、床の場合と同様に、反射音推定には用いていない。 In FIG. 8, the position and orientation of the sound source and the position of the array are shown superimposed on the top view of the room.
There are walls in the plane where x = 0 and y = 0. Although there are walls at x = 7400 mm and y = −5600 mm, they are not used in the reflected sound estimation in the experiment as in the case of the floor because they are away from the array.

図８では、音源の位置として６か所（“１”〜“６”）が設定され、音源の向きは、各音源において４方向（“Ｆ”、“Ｌ”、“Ｂ”、“Ｒ”）が設定されている。つまり、ここでは、設定される方向は、壁に平行な４つの方向である。
（音源の種類およびアレイに対する音源の向きの影響）
実験では、それぞれのアレイで測定された音源方向と反射音が実際発した音源位置をどの程度精度よく推定可能であるかを評価する。 In FIG. 8, six locations (“1” to “6”) are set as the positions of the sound sources, and the directions of the sound sources are four directions (“F”, “L”, “B”, “R”) in each sound source. ) Is set. That is, here, the set directions are four directions parallel to the wall.
(Influence of sound source direction on sound source type and array)
In the experiment, we evaluate how accurately the sound source direction measured by each array and the sound source position where the reflected sound actually emitted can be estimated.

評価尺度として、各アレイで検出された各音源方向に対する直線と、図７に示した対象音源の座標位置との距離を測定した。ここでは、音源方向推定誤差による位置推定誤差の他に、対象音源の位置が必ずしも正確ではないことも考慮し、位置推定誤差が４０ｃｍ以内であれば、その方向は対象音源が発しているものとみなす。 As an evaluation scale, the distance between the straight line for each sound source direction detected in each array and the coordinate position of the target sound source shown in FIG. 7 was measured. Here, in addition to the position estimation error due to the sound source direction estimation error, considering that the position of the target sound source is not necessarily accurate, if the position estimation error is within 40 cm, the direction is that the target sound source is emitting. I reckon.

各マイクロホンアレイで観測された各方向に対し、上述の条件を満たしたブロックの数を発話区間の全ブロック数で割ったものを検出レートとする。 For each direction observed in each microphone array, the detection rate is obtained by dividing the number of blocks satisfying the above condition by the total number of blocks in the speech section.

図９は、音源の位置（“１”〜“６”）と向き（“Ｆ”、“Ｌ”、“Ｂ”、“Ｒ”）の条件ごとに検出レートを示す図である。 FIG. 9 is a diagram showing the detection rate for each condition of the position (“1” to “6”) and the direction (“F”, “L”, “B”, “R”) of the sound source.

なお、図９においては、マイクロホンアレイＭＣ１を“Ａｒｒａｙ１”と表示し、マイクロホンアレイＭＣ２を“Ａｒｒａｙ２”と表示している。 In FIG. 9, the microphone array MC1 is displayed as “Array1”, and the microphone array MC2 is displayed as “Array2”.

それぞれのアレイ（“Ａｒｒａｙ１”、 “Ａｒｒａｙ２”）で検出された音源方向は、直接音（“ｄ”）、平面ｙ＝０での反射音（“ｒｙ”）および平面ｘ＝０での反射音（“ｒｘ”）に分けて結果を表示している。 The sound source directions detected by the respective arrays (“Array 1”, “Array 2”) are the direct sound (“d”), the reflected sound at the plane y = 0 (“ry”), and the reflected sound at the plane x = 0. The results are displayed separately for (“rx”).

図９の結果より、まず音源の位置と方向によってそれぞれのアレイで直接音（ｄ）および反射音（ｒｘ，ｒｙ）が観測される率が変化することが分かる。これは、音源の位置と向きによって、アレイが「見えている」のか、壁が「見えている」のかに依存する。 From the result of FIG. 9, it can be seen that the rate at which the direct sound (d) and the reflected sound (rx, ry) are observed in each array varies depending on the position and direction of the sound source. This depends on whether the array is “visible” or the wall is “visible” depending on the position and orientation of the sound source.

例えば図９の “１Ｌ”の条件では、Ａｒｒａｙ１の直接音ｄと反射音ｒｘがおよそ０．８の率で検出されている。また、Ａｒｒａｙ２では、反射音ｒｘがおよそ０．６の率で検出され、直接音はほとんど検出されていない。 For example, under the condition “1L” in FIG. 9, the direct sound d and the reflected sound rx of Array 1 are detected at a rate of approximately 0.8. In Array 2, the reflected sound rx is detected at a rate of approximately 0.6, and almost no direct sound is detected.

音源位置推定においては、同じ音源に対し、複数（少なくとも２つ）の方向が検出されれば、その重なった位置に音源が存在すると判定することができる。例えば、図９の“６Ｒ”の条件で、０．９以上の率で両アレイの直接音が重なって観測されている。 In sound source position estimation, if a plurality of (at least two) directions are detected for the same sound source, it can be determined that a sound source exists at the overlapping position. For example, under the condition of “6R” in FIG. 9, the direct sounds of both arrays are observed at a rate of 0.9 or higher.

図９において、直接音が高い率で上位を占めている条件は、｛２Ｆ，３Ｆ，４Ｆ，５Ｆ，６Ｆ，３Ｌ，４Ｌ，４Ｂ，５Ｂ，１Ｒ，２Ｒ，５Ｒ，６Ｒ｝で、全条件のおよそ半分を占めいている。平面ｘ＝０での反射音（ｒｘ）が上位に入っている条件は、Ａｒｒａｙ１の場合{１Ｌ，２Ｌ，５Ｌ，６Ｌ，６Ｂ｝となっている。これらの条件は、平面ｘ＝０の壁に近く、その方向を向いている条件である。また、平面ｙ＝０での反射音（ｒｙ）が上位に入っている条件は、Ａｒｒａｙ１の場合は｛１Ｆ，６Ｆ｝で、Ａｒｒａｙ２の場合は｛３Ｆ，４Ｆ，４Ｒ｝となっている。 In FIG. 9, the condition where the direct sound occupies the higher rank is {2F, 3F, 4F, 5F, 6F, 3L, 4L, 4B, 5B, 1R, 2R, 5R, 6R}. It occupies about half. The condition that the reflected sound (rx) at the plane x = 0 is higher is {1L, 2L, 5L, 6L, 6B} in the case of Array1. These conditions are conditions that are close to the wall of the plane x = 0 and face the direction. The condition that the reflected sound (ry) at the plane y = 0 is higher is {1F, 6F} in the case of Array1 and {3F, 4F, 4R} in the case of Array2.

したがって、本実施の形態の音源位置推定装置１０００では、少なくとも、人の発話のような指向性での音源については、高い確率で音源位置の推定を実行することが可能であるといえる。 Therefore, it can be said that sound source position estimation apparatus 1000 according to the present embodiment can perform sound source position estimation with a high probability at least for sound sources with directivity such as human speech.

図１０は、図９と同様の条件における平均位置推定誤差を示す図である。 FIG. 10 is a diagram showing an average position estimation error under the same conditions as in FIG.

ただし、図１０のグラフで誤差が０の点は、その条件で方向が検出されなかった場合を示している。 However, the point where the error is 0 in the graph of FIG. 10 indicates a case where the direction is not detected under the condition.

図１０に示すように、直接音でも反射音でも平均誤差は１００〜３００ｍｍの範囲で検出されている。
［コンピュータによる実現］
上記した音源定位処理部５０は、実際にはコンピュータハードウェアと、当該コンピュータハードウェアにより実行されるコンピュータプログラムとにより、ハードウェアとソフトウェアとの協働により実現される。以下、音源定位処理部５０の機能を実現するためのコンピュータプログラムの動作について簡単に説明する。 As shown in FIG. 10, the average error is detected in the range of 100 to 300 mm for both direct sound and reflected sound.
[Realization by computer]
The above-described sound source localization processing unit 50 is actually realized by the cooperation of hardware and software by computer hardware and a computer program executed by the computer hardware. The operation of the computer program for realizing the function of the sound source localization processing unit 50 will be briefly described below.

図１１は、このようなコンピュータプログラムを実行するためのコンピュータシステム２０００のハードウェア構成をブロック図形式で示す図である。 FIG. 11 is a block diagram showing the hardware configuration of a computer system 2000 for executing such a computer program.

図１１に示されるように、このコンピュータシステム２０００を構成するコンピュータ本体２０１０は、ディスクドライブ２０３０およびメモリドライブ２０２０に加えて、それぞれバス２０５０に接続されたＣＰＵ（Central Processing Unit ）２０４０と、ＲＯＭ（Read Only Memory) ２０６０およびＲＡＭ（Random Access Memory）２０７０を含むメモリと、不揮発性の書換え可能な記憶装置、たとえば、ハードディスク２０８０と、ネットワークを介しての通信を行うための通信インタフェース２０９０と、マイクロホンアレイＭＣ１およびＭＣ２と信号の授受を行うための音声入力インタフェース２０９２とを含んでいる。ディスクライブ２０３０には、ＣＤ−ＲＯＭ２２００などの光ディスクが装着される。メモリドライブ２０２０にはメモリカード２２１０が装着される。 As shown in FIG. 11, in addition to the disk drive 2030 and the memory drive 2020, the computer main body 2010 constituting the computer system 2000 includes a CPU (Central Processing Unit) 2040 and a ROM (Read Only memory) 2060 and RAM (Random Access Memory) 2070, a non-volatile rewritable storage device such as a hard disk 2080, a communication interface 2090 for performing communication via a network, and a microphone array MC1. And an audio input interface 2092 for exchanging signals with MC2. The disc 2030 is loaded with an optical disc such as a CD-ROM 2200. A memory card 2210 is attached to the memory drive 2020.

後に説明するように、音源位置推定装置のプログラムが動作するにあたっては、その動作の基礎となる情報を格納するデータベースは、ハードディスク２０８０に格納されるものとして説明を行う。 As will be described later, when the program of the sound source position estimation apparatus operates, a database that stores information serving as the basis of the operation will be described as being stored in the hard disk 2080.

なお、図１１では、コンピュータ本体に対してインストールされるプログラム等の情報を記録可能な媒体として、ＣＤ−ＲＯＭ２２００を想定しているが、他の媒体、たとえば、ＤＶＤ−ＲＯＭ（Digital Versatile Disc）などでもよく、あるいは、メモリカードやＵＳＢメモリなどでもよい。その場合は、コンピュータ本体２２００には、これらの媒体を読取ることが可能なドライブ装置が設けられる。 In FIG. 11, a CD-ROM 2200 is assumed as a medium capable of recording information such as a program installed in the computer main body. However, other media such as a DVD-ROM (Digital Versatile Disc) is used. Alternatively, a memory card or a USB memory may be used. In that case, the computer main body 2200 is provided with a drive device capable of reading these media.

音源位置推定装置１０００の主要部は、コンピュータハードウェアと、ＣＰＵ２０４０により実行されるソフトウェアとにより構成される。一般的にこうしたソフトウェアはＣＤ−ＲＯＭ２２００等の記憶媒体に格納されて流通し、ディスクドライブ２０３０等により記憶媒体から読取られてハードディスク２０８０に一旦格納される。または、当該装置がネットワーク３１０に接続されている場合には、ネットワーク上のサーバから一旦ハードディスク２０８０にコピーされる。そうしてさらにハードディスク２０８０からメモリ中のＲＡＭ２０７０に読出されてＣＰＵ２０４０により実行される。なお、ネットワーク接続されている場合には、ハードディスク２０８０に格納することなくＲＡＭに直接ロードして実行するようにしてもよい。 The main part of the sound source position estimation apparatus 1000 is composed of computer hardware and software executed by the CPU 2040. Generally, such software is stored and distributed in a storage medium such as a CD-ROM 2200, read from the storage medium by a disk drive 2030 or the like, and temporarily stored in the hard disk 2080. Alternatively, when the device is connected to the network 310, it is temporarily copied from the server on the network to the hard disk 2080. Then, the data is further read from the hard disk 2080 to the RAM 2070 in the memory and executed by the CPU 2040. In the case of network connection, the program may be directly loaded into the RAM and executed without being stored in the hard disk 2080.

音源位置推定装置１０００として機能するためのプログラムは、コンピュータ本体２０１０に、情報処理装置等の機能を実行させるオペレーティングシステム（ＯＳ）は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム２０がどのように動作するかは周知であり、詳細な説明は省略する。 The program for functioning as the sound source position estimation apparatus 1000 does not necessarily include an operating system (OS) that causes the computer main body 2010 to execute functions such as an information processing apparatus. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 20 operates is well known and will not be described in detail.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

さらに、ＣＰＵ２０４０も、１つのプロセッサであっても、あるいは複数のプロセッサであってもよい。すなわち、シングルコアのプロセッサであっても、マルチコアのプロセッサであってもよい。 Further, the CPU 2040 may be a single processor or a plurality of processors. That is, it may be a single core processor or a multi-core processor.

なお、音源位置推定装置のプログラムの動作の基礎となる情報を格納するデータベースは、インタフェース２０９０を介して接続される外部の記憶装置内に格納されていてもよい。たとえば、ネットワークを介して外部サーバに接続している場合は、動作の基礎となる情報を格納するデータベースは、外部サーバ内のハードディスク（図示せず）等の記憶装置に格納されていてもよい。この場合は、コンピュータ２０００はクライエント機として動作し、このようなデータベースのデータをネットワークを介して外部サーバとやり取りする。 Note that a database that stores information serving as a basis for the operation of the program of the sound source position estimation device may be stored in an external storage device connected via the interface 2090. For example, when connected to an external server via a network, a database that stores information serving as a basis of operation may be stored in a storage device such as a hard disk (not shown) in the external server. In this case, the computer 2000 operates as a client machine, and exchanges such database data with an external server via a network.

図１２は、このようなコンピュータプルグラムによる処理を説明するためのフローチャートである。 FIG. 12 is a flowchart for explaining processing by such a computer program.

音源定位処理部５０として機能するコンピュータ２０００においては、音源位置推定処理が開始されると、ＣＰＵ２０４０は、処理の初期化処理を実行する（Ｓ１０２）。 In the computer 2000 functioning as the sound source localization processing unit 50, when the sound source position estimation process is started, the CPU 2040 executes a process initialization process (S102).

なお、以下のステップＳ１０４〜Ｓ１０８までの処理は、図１で説明したのと同様に、マイクロホンアレイＭＣ１およびマイクロホンアレイＭＣ２からの信号のそれぞれに対して並列的に実行される。 Note that the processing from the following steps S104 to S108 is executed in parallel with respect to each of the signals from the microphone array MC1 and the microphone array MC2, as described in FIG.

続いて、音声入力インタフェース２０９２は、マイクロホンアレイＭＣ１およびマイクロホンアレイＭＣ２からの信号を受けてＡ／Ｄ変換し、それぞれｐ個のデジタル音源信号として出力し、このようなデジタル音源信号は、たとえば、ＲＡＭ２０７０に一時格納される。ＣＰＵ２０４０は、ｐ個のデジタル音源信号を、たとえば、４ミリ秒のフレーム長でフレーム化するためのフレーム化処理し、ｐチャンネルのフレーム化された音源信号に対してそれぞれＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍａｔｉｏｎ）を施し、ＦＦＴ処理により所定個数の周波数領域に変換し、４ミリ秒ごとの各チャネルの各ビンの値を、１００ミリ秒ごとにブロック化する（Ｓ１０４）。 Subsequently, the audio input interface 2092 receives signals from the microphone array MC1 and the microphone array MC2, performs A / D conversion, and outputs them as p digital sound source signals. Such a digital sound source signal is, for example, the RAM 2070. Temporarily stored. The CPU 2040 performs a framing process for framing the p digital sound source signals with a frame length of, for example, 4 milliseconds, and performs FFT (Fast Fourier Transform) on the p channel framed sound source signals, respectively. Then, it is converted into a predetermined number of frequency regions by FFT processing, and the value of each bin of each channel every 4 milliseconds is blocked every 100 milliseconds (S104).

さらに、ＣＰＵ２０４０は、ブロック化された各ビンの値の間の相関を要素とする相関行列を所定時間ごと（１００ミリ秒ごと）に算出し、相関行列を固有値分解し、固有ベクトルを算出する（Ｓ１０６）。 Further, the CPU 2040 calculates a correlation matrix having a correlation between the bin values as a factor at every predetermined time (every 100 milliseconds), decomposes the correlation matrix into eigenvalues, and calculates an eigenvector (S106). ).

なお、図１でも説明したとおり、これも特に限定されないが、本実施の形態では、音源信号の周波数成分のうち、空間的分解能が低い１ｋＨｚ以下の帯域と、空間的エイリアシングが起こり得る６ｋＨｚ以上の帯域を除外するものとする。 As described with reference to FIG. 1, this is not particularly limited. In the present embodiment, the frequency component of the sound source signal has a low spatial resolution of 1 kHz or less and a spatial aliasing of 6 kHz or more. Bands shall be excluded.

また、１フレーム長を４ミリ秒（ＦＦＴでは６４〜１２８点に相当）と短くすることにより、ＦＦＴの計算量が少なくてすむだけでなく、後の相関行列の算出、固有値分解、およびＭＵＳＩＣ応答の算出における計算量も抑制する。 Also, by shortening one frame length to 4 milliseconds (corresponding to 64 to 128 points in FFT), not only the amount of calculation of FFT can be reduced, but also calculation of correlation matrix, eigenvalue decomposition, and MUSIC later. The amount of calculation in calculating the response is also suppressed.

なお、ＨＤＤ２０８０は、マイクロホンアレイＭＣ１およびＭＣ２に含まれる各マイクロホンの位置を所定の座標系を用いて表す位置ベクトルを記憶するとともに、空間３次元地図の情報も記憶しているものとする。 It is assumed that HDD 2080 stores a position vector that represents the position of each microphone included in microphone arrays MC1 and MC2 using a predetermined coordinate system, and also stores information on a spatial three-dimensional map.

続けて、ＣＰＵ２０４０は、ＨＤＤ２０８０に記憶されているマイクロホンの位置ベクトル、および算出された固有ベクトルを用いて、音源数が固定されているものとしてＭＵＳＩＣ法によりＭＵＳＩＣ空間スペクトルを算出する。ＭＵＳＩＣアルゴリズムとしては、３次元での計算が可能なものが実装されているものとし、３次元の音源方位（ＤＯＡ）のみを推定が行われる。ＣＰＵ２０４０は、さらに、算出されたＭＵＳＩＣ空間スペクトルに基づいて、ＭＵＳＩＣ法にしたがいＭＵＳＩＣ応答を各方位について算出する（Ｓ１０８）。 Subsequently, the CPU 2040 calculates the MUSIC spatial spectrum by the MUSIC method, assuming that the number of sound sources is fixed, using the microphone position vector stored in the HDD 2080 and the calculated eigenvector. As the MUSIC algorithm, what can be calculated in three dimensions is implemented, and only the three-dimensional sound source direction (DOA) is estimated. Further, the CPU 2040 calculates a MUSIC response for each direction according to the MUSIC method based on the calculated MUSIC spatial spectrum (S108).

ＲＡＭ２０７０は、算出されたＭＵＳＩＣ応答のピークを、一時的に時系列に所定数だけＦＩＦＯ形式でそれぞれ蓄積する。ＣＰＵ２０４０は、ＲＡＭ２０７０に蓄積された各ブロックの各探索点のＭＵＳＩＣ応答について、音源の方向（上述した２つの偏角φおよびθ）を推定するとともに、空間３次元地図の情報により、反射音の方向も動的に推定する（Ｓ１１０）。 The RAM 2070 temporarily accumulates the calculated MUSIC response peaks in a FIFO format in a time-series manner. The CPU 2040 estimates the direction of the sound source (the two declination angles φ and θ described above) for the MUSIC response of each search point of each block accumulated in the RAM 2070 and uses the information of the spatial three-dimensional map to determine the direction of the reflected sound. Is also dynamically estimated (S110).

そして、ＣＰＵ２０４０は、動的に推定された直接音の方向および反射音の方向とを用いて、上述したような手続きにより、音源の３次元的な位置を推定して、時系列で出力する（Ｓ１１２）。 Then, the CPU 2040 estimates the three-dimensional position of the sound source by the procedure described above using the dynamically estimated direction of the direct sound and the direction of the reflected sound, and outputs it in time series ( S112).

ＣＰＵ２０４０が、処理を終了させる指示が与えられていると判断した場合（Ｓ１１４）、処理は終了し、終了の指示が与えられていないと判断した場合は、処理は、Ｓ１０４に復帰する。 If the CPU 2040 determines that an instruction to end the process has been given (S114), the process ends. If the CPU 2040 determines that no instruction to end has been given, the process returns to S104.

以上説明したとおり、本発明の音源位置推定装置１０００では、複数のマイクロホンアレイにおいて音源方向推定を行い、空間の情報と反射音の方向の情報を利用し音源定位（３次元空間の位置推定）に利用することで、音源の３次元的な位置を推定することが可能である。 As described above, the sound source position estimation apparatus 1000 according to the present invention performs sound source direction estimation in a plurality of microphone arrays and uses sound space information and reflected sound direction information for sound source localization (position estimation in a three-dimensional space). By using this, it is possible to estimate the three-dimensional position of the sound source.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

５０音源定位処理部、６０固有ベクトル算出部、６２ＭＵＳＩＣ処理部、６４音源位置推定部、８６．１，８６．２相関行列算出部、８８．１，８８．２固有値分解部、１０４．１，１０４．２ＭＵＳＩＣ空間スペクトル算出部、１０６．１，１０６．２ＭＵＳＩＣ応答算出部、１１０音源方向・反射音方向推定部、１１４位置推定処理部、ＭＣ１，ＭＣ２マイクロホンアレイ。 50 sound source localization processing unit, 60 eigenvector calculation unit, 62 MUSIC processing unit, 64 sound source position estimation unit, 86.1, 86.2 correlation matrix calculation unit, 88.1, 88.2 eigenvalue decomposition unit, 104.1, 104 .2 MUSIC spatial spectrum calculation unit, 106.1, 106.2 MUSIC response calculation unit, 110 sound source direction / reflection sound direction estimation unit, 114 position estimation processing unit, MC1, MC2 microphone array.

Claims

A plurality of sound sensor arrays;
A storage device for storing information on the arrangement of each sound sensor in the sound sensor array and spatial three-dimensional map information of the measurement environment;
Based on each of the sound source signals of a plurality of channels from the plurality of sound sensor arrays and the positional relationship between the sound sensors included in the sound sensor array, the direction in which sound comes to the plurality of sound sensor arrays is specified. Sound source localization means for executing processing for performing,
Sound source direction estimation means for estimating the direction of the sound source with respect to the direct sound and the direction of the sound source with respect to the reflected sound based on the identified direction of arrival of the sound and the spatial three-dimensional map information;
Sound source position estimating means for estimating the position of the sound source in three dimensions according to the overlap of extension regions in the direction of the sound source due to the direct sound and the reflected sound from each of the plurality of sound sensor arrays, Sound source position estimation device.

The sound source position estimating apparatus according to claim 1, wherein the sound source direction estimating unit estimates the direction of the sound source with respect to the reflected sound only by a predetermined number of reflections or less.

The sound source position estimating apparatus according to claim 1, wherein the sound source direction estimating unit estimates the direction of the sound source with respect to the reflected sound only by reflection existing within a predetermined distance.

The sound source position estimation apparatus according to claim 2, wherein, in the estimation processing of the sound source position estimation means, the extension region in the direction of the sound source is a region to which an angle estimation error is added around a straight line along the sound source direction.

The sound source localization apparatus according to claim 2, wherein the sound source localization unit specifies a direction in which the sound arrives based on a MUSIC response intensity by a MUSIC method.

A sound source position estimation method for estimating a three-dimensional position of a sound source based on signals from a plurality of sound sensor arrays,
Based on each of the sound source signals of a plurality of channels from the plurality of sound sensor arrays and the positional relationship between the sound sensors included in the sound sensor array, the direction in which sound comes to the plurality of sound sensor arrays is specified. And steps to
Estimating the direction of the sound source relative to the direct sound and the direction of the sound source relative to the reflected sound based on the identified direction of arrival of the sound and the spatial three-dimensional map information;
A sound source position estimation method comprising: estimating a position of the sound source in three dimensions in accordance with an overlap of extension regions in the direction of the sound source due to the direct sound and the reflected sound from each of the plurality of sound sensor arrays.

A sound source position estimation program for causing a computer having an arithmetic device and a storage device to estimate a three-dimensional position of a sound source based on signals from a plurality of sound sensor arrays, the sound source position estimation program,
The arithmetic unit is configured to output sound to the plurality of sound sensor arrays based on each of a plurality of sound source signals from the plurality of sound sensor arrays and a positional relationship between the sound sensors included in the sound sensor array. Identifying the incoming direction;
A step of estimating the direction of the sound source with respect to the direct sound and the direction of the sound source with respect to the reflected sound based on the specified direction of arrival of the sound and the spatial three-dimensional map information stored in the storage device; When,
A step of estimating the position of the sound source in three dimensions in accordance with the overlap of extension areas in the direction of the sound source due to the direct sound and the reflected sound from each of the plurality of sound sensor arrays; A sound source position estimation program to be executed.