JP6467736B2

JP6467736B2 - Sound source position estimating apparatus, sound source position estimating method, and sound source position estimating program

Info

Publication number: JP6467736B2
Application number: JP2014176949A
Authority: JP
Inventors: イシイ・カルロス・トシノリ; ヤニ・エヴァン; 萩田紀博
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2014-09-01
Filing date: 2014-09-01
Publication date: 2019-02-13
Anticipated expiration: 2034-09-01
Also published as: JP2016050872A

Description

この発明は実環境における音源定位技術に関し、特に、実環境において複数のセンサアレイによる音声の方向性を用いた音源位置の推定技術に関する。 The present invention relates to a sound source localization technique in a real environment, and more particularly to a sound source position estimation technique using sound directivity by a plurality of sensor arrays in a real environment.

従来、音源方向の検出を行うことで、少数の撮像手段により明瞭な画像を効率的に取得可能な撮像装置などについての提案がある（たとえば、特許文献１）。 2. Description of the Related Art Conventionally, there has been a proposal for an imaging device that can efficiently acquire a clear image with a small number of imaging means by detecting the sound source direction (for example, Patent Document 1).

この特許文献１には、以下のような技術が開示されている。すなわち、システムには、２個の音源方向検出部が設けられており、音源方向検出部は、それぞれ複数のマイクロホンを備え、各マイクロホンの音声信号の音圧レベルにより音源方向の検出を行う。音源位置推定部は、音源方向検出部で検出された音源方向に基づいて、撮像対象部屋の中の音源位置を幾何学的に推定する。撮像部は、推定された音源位置を指向して撮影するように制御される。撮影された映像データは、画像認識部により画像認識処理される。画像認識部は、被写体（音源）が大きく表示されるように、撮像部のズーム機能を制御する。 This Patent Document 1 discloses the following technique. That is, the system is provided with two sound source direction detection units, each of which includes a plurality of microphones and detects the sound source direction based on the sound pressure level of the audio signal of each microphone. The sound source position estimation unit geometrically estimates the sound source position in the imaging target room based on the sound source direction detected by the sound source direction detection unit. The imaging unit is controlled so as to shoot at the estimated sound source position. The captured video data is subjected to image recognition processing by an image recognition unit. The image recognizing unit controls the zoom function of the imaging unit so that the subject (sound source) is displayed large.

このような構成により、音源である人間を含む物体（被写体）の位置を把握することができるため、比較的少ない数の撮像手段により、被写体を明瞭に撮影することが可能となるとともに、撮像装置全体としてのシステムコストが抑制できる。 With such a configuration, the position of an object (subject) including a human being as a sound source can be grasped, so that the subject can be clearly photographed by a relatively small number of imaging means, and the imaging apparatus The overall system cost can be suppressed.

ただし、このようなシステムは、発話に合せて画像を撮影することに主眼が置かれており、誰が、いつ、どこでしゃべっているのかを推定し、記録することを目的としたものではない。 However, such a system focuses on taking an image according to the utterance, and is not intended to estimate and record who is speaking when and where.

このような目的に対して、小学校の理科室に複数のマイクロホンアレイと複数のキネクトセンサを設置し、理科の授業が実際に行われたデータを収集した例も報告されている（たとえば、非特許文献１）。 For this purpose, there have also been reported examples of collecting data on actual science classes by installing multiple microphone arrays and multiple kinetic sensors in an elementary science room (for example, non-patented) Reference 1).

一方で、複数のアレイを用いて音と空間の情報のみから反射音も利用して音源位置を推定する手法などについても提案がある（特許文献２を参照）。 On the other hand, there has also been proposed a method for estimating a sound source position using reflected sound from only sound and space information using a plurality of arrays (see Patent Document 2).

特開２００５−１５１０４２号公報明細書Japanese Patent Application Laid-Open No. 2005-151042 特開２０１４−９８５６８号公報明細書JP 2014-98568 A Specification

Ishi， C.， Even， J.， Hagita， N. (2013). ”Using multiple microphone arrays and reflections for 3D localization of sound sources，” IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2013)， pp. 3937-3942， Nov.， 2013.Ishi, C., Even, J., Hagita, N. (2013). “Using multiple microphone arrays and reflections for 3D localization of sound sources,” IEEE / RSJ International Conference on Intelligent Robots and Systems (IROS 2013), pp. 3937-3942, Nov., 2013.

しかしながら、非特許文献１や特許文献２に開示の技術では、発話区間と人との対応付けまでは行っていない。 However, in the technologies disclosed in Non-Patent Document 1 and Patent Document 2, the correspondence between the speech section and the person is not performed.

誰が、いつ、どこでしゃべっているのかを推定する対話行動認識プラットフォームが実現すれば、教室内や会議などのように、複数の人が時に席を移りながら会話や協調作業をする際のデータの観察が容易になることが期待できると期待される。 If a dialogue behavior recognition platform that estimates who is speaking when and where is realized, observation of data when people are talking and collaborating while moving from time to time, such as in a classroom or a meeting Is expected to be easier.

本発明は、上記のような問題点を解決するためになされたものであって、その目的は、所定の空間内において、誰が、いつ、どこでしゃべっているのかを推定し、記録することが可能な音源位置推定装置を提供することである。 The present invention has been made to solve the above-described problems, and its purpose is to estimate and record who is speaking when and where in a predetermined space. A sound source position estimating apparatus is provided.

また、本発明の目的は、所定の空間内において、しゃべっている人の顔の向きの推定も可能とする音源位置推定装置を提供することである。 Another object of the present invention is to provide a sound source position estimation apparatus that can estimate the direction of the face of a talking person in a predetermined space.

本発明では、複数のマイクロホンアレイを用いて音源方向を推定し、併せて、人位置の推定の情報を用いて人の位置を推定し、これらの情報を統合して音源定位（３次元空間の位置推定）を行う。 In the present invention, the direction of the sound source is estimated using a plurality of microphone arrays, and the position of the person is estimated using information on the estimation of the position of the person. Position estimation).

この発明の１つの局面に従うと、音源位置推定装置であって、複数の音センサアレイと、所定空間内の人の位置を推定するための人位置推定手段と、音センサアレイ中の各音センサの配置の情報および人の位置情報を格納するための記憶装置と、複数の音センサアレイからの複数チャンネルの信号の各々と音センサアレイに含まれる各音センサの間の位置関係とに基づいて、複数の音センサアレイに音の到来する方向を特定するための処理を実行する音源定位手段と、複数の音センサアレイのうち、異なる音センサアレイでそれぞれ特定された音の到来する方向の組と人の位置情報とに基づいて、発話中の人を推定するための音声区間推定手段とを備える。 According to one aspect of the present invention, there is provided a sound source position estimating device, a plurality of sound sensor arrays, human position estimating means for estimating the position of a person in a predetermined space, and each sound sensor in the sound sensor array A storage device for storing the arrangement information and the position information of the person, and a positional relationship between each of the signals of the plurality of channels from the plurality of sound sensor arrays and each of the sound sensors included in the sound sensor array A set of sound source localization means for executing processing for specifying the direction in which sound arrives at a plurality of sound sensor arrays, and the direction of arrival of sound respectively identified by different sound sensor arrays among the plurality of sound sensor arrays And speech section estimation means for estimating the person who is speaking based on the position information of the person.

好ましくは、音声区間推定手段は、音源の候補位置のうち、音の到来方向を特定するのに使用した音センサアレイまでの距離の総和が最小の音源の候補位置を音源位置と推定する。 Preferably, the speech section estimation means estimates a sound source candidate position having a minimum sum of distances to a sound sensor array used for specifying a sound arrival direction among sound source candidate positions as a sound source position.

好ましくは、音声区間推定手段は、推定された音源位置と人の位置とが第２のしきい値以下であることに応じて、発話中の人を推定する。 Preferably, the speech section estimation means estimates a person who is speaking in response to the estimated sound source position and person position being equal to or less than the second threshold value.

好ましくは、音声区間推定手段は、発話中の人であると推定された人の位置と対応する音源位置とに応じて、発話中の人の顔の向きを推定する。 Preferably, the speech section estimation means estimates the direction of the face of the person who is speaking according to the position of the person estimated to be the person who is speaking and the corresponding sound source position.

好ましくは、音声区間推定手段により推定された発話中の人についての音声を分離して、発話内容と発話者とを関連づけて記録するための音源分離手段をさらに備える。 Preferably, the apparatus further includes sound source separation means for separating the voice of the person who is speaking, estimated by the voice section estimation means, and recording the speech content and the speaker in association with each other.

この発明の他の局面に従うと、複数の音センサアレイからの信号と推定された人位置とに基づいて、所定の空間内での発話者を推定する音源位置推定方法であって、位置センサからの測定データにより所定空間内の人の位置を推定するステップと、複数の音センサアレイからの複数チャンネルの音源信号の各々と音センサアレイに含まれる各音センサの間の位置関係とに基づいて、複数の音センサアレイに音の到来する方向を特定するステップと、複数の音センサアレイのうち、異なる音センサアレイでそれぞれ特定された音の到来する方向の組と人の位置情報とに基づいて、発話中の人を推定するステップとを備え、発話中の人を推定するステップは、音の到来する方向の組ごとに、到来方向の延長線間の最短距離が第１のしきい値以下であることに応じて、最短距離に対応する直線上に音源の候補位置が存在すると推定するステップを含む。 According to another aspect of the present invention, there is provided a sound source position estimating method for estimating a speaker in a predetermined space based on signals from a plurality of sound sensor arrays and estimated human positions, Based on the step of estimating the position of the person in the predetermined space from the measurement data of the plurality of sound sources, and the positional relationship between each of the sound source signals of the plurality of channels from the plurality of sound sensor arrays and each sound sensor included in the sound sensor array Identifying a direction in which sound arrives in a plurality of sound sensor arrays, and a set of directions in which sound is identified respectively by different sound sensor arrays out of the plurality of sound sensor arrays and person position information And estimating the person who is speaking, the step of estimating the person speaking is that the shortest distance between the extension lines in the direction of arrival is a first threshold value for each set of directions in which sound arrives. In Depending on Rukoto comprises estimating a candidate position of the sound source exists on a straight line corresponding to the shortest distance.

この発明のさらに他の局面に従うと、演算装置と記憶装置とを有するコンピュータに、複数の音センサアレイからの信号と推定された人位置とに基づいて、所定の空間内での発話者を推定する音源位置推定プログラムであって、音源位置推定プログラムは、演算装置が、位置センサからの測定データにより所定空間内の人の位置を推定するステップと、演算装置が、複数の音センサアレイからの複数チャンネルの音源信号の各々と音センサアレイに含まれる各音センサの間の位置関係とに基づいて、複数の音センサアレイに音の到来する方向を特定するステップと、演算装置が、複数の音センサアレイのうち、異なる音センサアレイでそれぞれ特定された音の到来する方向の組と人の位置情報とに基づいて、発話中の人を推定するステップであって、音の到来する方向の組ごとに、到来方向の延長線間の最短距離が第１のしきい値以下であることに応じて、最短距離に対応する直線上に音源の候補位置が存在すると推定するステップを含むステップとを、コンピュータに実行させる。
According to still another aspect of the present invention, a computer having an arithmetic device and a storage device is used to estimate a speaker in a predetermined space based on signals from a plurality of sound sensor arrays and estimated human positions. A sound source position estimation program, in which the arithmetic device estimates the position of a person in a predetermined space based on measurement data from the position sensor, and the arithmetic device includes a plurality of sound sensor arrays. A step of identifying a direction in which sound arrives at the plurality of sound sensor arrays based on each of the sound source signals of the plurality of channels and a positional relationship between the sound sensors included in the sound sensor array; of sound sensor array, on the basis of the position information of the direction of the set and people arriving sound identified respectively different sound sensor array, in the step of estimating the human in speech Thus, for each set of sound arrival directions, the sound source candidate position is located on a straight line corresponding to the shortest distance in response to the shortest distance between the extension lines of the arrival directions being equal to or less than the first threshold value. And causing the computer to execute steps including the step of presuming that it exists .

この発明によれば、複数のアレイによる音源位置推定と人位置情報を組み合わせて、音声アクティビティを検出するシステムの精度を改善することが可能である。 According to the present invention, it is possible to improve the accuracy of a system for detecting voice activity by combining sound source position estimation with a plurality of arrays and human position information.

また、この発明によれば、発話している際の話者の顔の向きも推定することが可能となり、空間内のどのような文脈で発話されたかの手がかりとなり、より高度な対話行動認識が可能となる。 In addition, according to the present invention, it is possible to estimate the direction of the speaker's face when speaking, providing a clue as to what context in the space was spoken, and enabling more advanced interactive behavior recognition It becomes.

本実施の形態の音源位置推定装置を含む対話行動認識システム１０００の構成を説明するための概念図である。It is a conceptual diagram for demonstrating the structure of the interactive action recognition system 1000 containing the sound source position estimation apparatus of this Embodiment. 実験を実施した際の環境を示す図である。It is a figure which shows the environment at the time of implementing experiment. 図１に示した音源位置推定装置２０００の構成の概要を示すブロック図である。It is a block diagram which shows the outline | summary of a structure of the sound source position estimation apparatus 2000 shown in FIG. 音源位置推定装置２０００をコンピュータにより実現した場合の処理のフローを説明するためのフローチャートである。It is a flowchart for demonstrating the flow of a process when the sound source position estimation apparatus 2000 is implement | achieved by the computer. 音源方向推定部１０４０の構成を示すブロック図である。It is a block diagram which shows the structure of the sound source direction estimation part 1040. FIG. 最短距離を求める手続きを示す概念図である。It is a conceptual diagram which shows the procedure which calculates | requires the shortest distance. コンピュータプログラムを実行するためのコンピュータシステム２０００のハードウェア構成をブロック図形式で示す図である。FIG. 2 is a block diagram showing a hardware configuration of a computer system 2000 for executing a computer program. マイクロホンアレイにおけるマイクの配置を示す図である。It is a figure which shows arrangement | positioning of the microphone in a microphone array. マイクロホンアレイの位置と、評価した人の位置情報を示す図である。It is a figure which shows the position of a microphone array, and the positional information of the evaluated person. ２名の話者が単独で発話した場合の発話区間検出率の結果を示す図である。It is a figure which shows the result of an utterance area detection rate when two speakers utter independently. 顔の向きの推定に関する分析結果を記す図である。It is a figure which describes the analysis result regarding estimation of the direction of a face. 顔の向きの推定に関する分析結果を記す図である。It is a figure which describes the analysis result regarding estimation of the direction of a face. 顔の向きの推定結果の統計値を示す図である。It is a figure which shows the statistical value of the estimation result of a face direction. ２名が同時に発声した場合の顔の向きの推定結果を示す図である。It is a figure which shows the estimation result of the direction of a face when two persons speak simultaneously.

以下、本発明の実施の形態の音源位置推定装置の構成について、図に従って説明する。なお、以下の実施の形態において、同じ符号を付した構成要素および処理工程は、同一または相当するものであり、必要でない場合は、その説明は繰り返さない。 Hereinafter, the configuration of a sound source position estimation apparatus according to an embodiment of the present invention will be described with reference to the drawings. In the following embodiments, components and processing steps given the same reference numerals are the same or equivalent, and the description thereof will not be repeated unless necessary.

なお、以下の説明では、音センサとしては、いわゆるマイクロホン、より特定的にはエレクトレットコンデンサマイクロホンを例にとって説明を行うが、音声を電気信号として検出できるセンサであれば、他の音センサであってもよい。 In the following description, as a sound sensor, a so-called microphone, more specifically an electret condenser microphone will be described as an example, but other sound sensors may be used as long as they can detect sound as an electric signal. Also good.

実環境では、異なった場所で発生する複数の音が混合して観測されるため、本実施の形態の音源位置推定装置では、以下に説明するように、複数の音源を定位・分離するため、複数のマイクロホンアレイを連携させる。 In a real environment, a plurality of sounds generated at different locations are observed in a mixed manner, so in the sound source position estimation device of the present embodiment, as described below, in order to localize and separate a plurality of sound sources, Link multiple microphone arrays.

ここでの「音源定位」とは、音源の方位を継続的に特定することをいい、「音源の位置推定」とは、所定の空間内で、音源定位により特定された音源の方位に基づいて、３次元的な音源の位置を推定することをいう。
［システムの構成］
図１は、本実施の形態の音源位置推定装置を含む対話行動認識システム１０００の構成を説明するための概念図である。 Here, “sound source localization” means to continuously specify the direction of the sound source, and “sound source position estimation” is based on the direction of the sound source specified by the sound source localization in a predetermined space. It means estimating the position of a three-dimensional sound source.
[System configuration]
FIG. 1 is a conceptual diagram for explaining a configuration of a dialogue action recognition system 1000 including a sound source position estimation apparatus according to the present embodiment.

図１を参照して、対話行動認識システム１０００では、所定の空間、たとえば、会議室において、その天井にマイクロホンアレイ１０５２．１および１０５２．２が設置され、会議室内のより床面に近い位置、たとえば、テーブルの上に、マイクロホンアレイ１０５２．３および１０５２．４が設置される。特に、限定されないが、たとえば、マイクロホンアレイ１０５２．１と１０５２．２とを結ぶ方向と、マイクロホンアレイ１０５２．３と１０５２．４とを結ぶ方向とは、直交するように配置されている。 Referring to FIG. 1, in interactive action recognition system 1000, in a predetermined space, for example, a conference room, microphone arrays 1052.1 and 1052.2 are installed on the ceiling, and a position closer to the floor surface in the conference room, For example, microphone arrays 1052.3 and 1052.4 are installed on the table. Although not particularly limited, for example, the direction connecting the microphone arrays 1052.1 and 1052.2 and the direction connecting the microphone arrays 1052.3 and 1052.4 are arranged to be orthogonal to each other.

なお、マイクロホンアレイの個数については、このように４個に限定されるものではなく、一般には、複数個であれば、特に制限はない。 Note that the number of microphone arrays is not limited to four as described above, and generally there is no particular limitation as long as the number is plural.

会議室内には、たとえば、立った状態で発話している発話者１０．１と、座位で発話している発話者１０．２および１０．３とがいるものとする。 Assume that there are, for example, a speaker 10.1 speaking in a standing state and speakers 10.2 and 10.3 speaking in a sitting position in the conference room.

さらに、会議室内には、その３つの隅にそれぞれ、人の位置を検知するためのレーザレンジファインダ（ＬＲＦ：Laser Range Finder）１０１０．１，１０１０．２および１０１０．３（以下、総称するときは、ＬＲＦ１０１０と呼ぶ）が配置されている。なお、レーザレンジファインダは、会議室内の人の位置を推定するための検知装置の一例であって、会議室内の人の位置を検知できるものであれば、他のセンサであってもよく、個数についても、３個に限定されるものではない。 Further, in the conference room, laser range finders (LRF) 1010.1, 1010.2, and 1010.3 (hereinafter collectively referred to as “LRF”) for detecting the position of a person at the three corners, respectively. , Referred to as LRF1010). The laser range finder is an example of a detection device for estimating the position of a person in the meeting room, and may be another sensor as long as it can detect the position of the person in the meeting room. Also, is not limited to three.

音源位置推定装置２０００は、ＬＲＦ１０１０からのデータを基に人位置の推定を推定するとともに、マイクロホンアレイ１０５２（マイクロホンアレイ１０５２．１〜１０５２．４のように複数のマイクロホンアレイを総称する際には、マイクロホンアレイ１０５２と呼ぶ）により取得された音源の位置とを、経時的に収集し、各音声発話区間を同定して、音声発話期間ごとの発話者を特定する。
［システムの設置環境］
図２は、後に説明するような実験を実施した際の環境を示す図である。 The sound source position estimating apparatus 2000 estimates the human position based on the data from the LRF 1010, and also refers to a microphone array 1052 (when collectively referring to a plurality of microphone arrays such as the microphone arrays 1052.1 to 1052.4). The position of the sound source acquired by the microphone array 1052) is collected over time, each voice utterance section is identified, and a speaker for each voice utterance period is specified.
[System installation environment]
FIG. 2 is a diagram illustrating an environment when an experiment described later is performed.

図２に示すように、実験では、複数のマイクロホンアレイを設置した研究室内のミーティングスペースを使用した。マイクロホンアレイは机の上に１６チャンネルのものを２個と、天井に８チャンネルのものを２個設置した。
[音源位置推定のための構成]
図３は、図１に示した音源位置推定装置２０００の構成の概要を示すブロック図である。 As shown in FIG. 2, in the experiment, a meeting space in a laboratory where a plurality of microphone arrays were installed was used. Two 16-channel microphone arrays were installed on the desk and two 8-channel microphone arrays were installed on the ceiling.
[Configuration for sound source position estimation]
FIG. 3 is a block diagram showing an outline of the configuration of the sound source position estimation apparatus 2000 shown in FIG.

図４は、後に説明するように、音源位置推定装置２０００をコンピュータにより実現した場合の処理のフローを説明するためのフローチャートである。 FIG. 4 is a flowchart for explaining a processing flow when the sound source position estimating apparatus 2000 is realized by a computer, as will be described later.

図３および図４を参照して、まず、複数のマイクロホンアレイ１０５２．１〜１０５２．４からの信号に基づいて、それぞれ３次元空間音源方向推定部１０４０．１〜１０４０．４（総称する場合は、音源方向推定部１０４０と呼ぶ）が、それぞれ３次元空間の音源方向推定（方位角および仰角の推定）を行う（Ｓ１０２）。多くの音源定位の研究では、方位角のみが推定されるが、会議室や教室のように人の数が多い場合、同じ方向に複数の音源が存在する確率が高くなり、仰角の推定も重要となる。 3 and 4, first, based on signals from a plurality of microphone arrays 1052.1 to 1052.4, three-dimensional spatial sound source direction estimation units 1040.1 to 1040.4 (in the case of generic names) , Referred to as a sound source direction estimation unit 1040), respectively, performs sound source direction estimation (estimation of azimuth and elevation angle) in a three-dimensional space (S102). In many sound source localization studies, only the azimuth is estimated, but when there are a large number of people, such as in a conference room or classroom, the probability that multiple sound sources exist in the same direction increases, and the estimation of the elevation angle is also important. It becomes.

音源方向推定部１０４０は、実時間処理で３次元空間での音源方向を５度の空間的分解能および１００msの時間分解能で推定するＭＵＳＩＣ法に基づくシステムである。 The sound source direction estimation unit 1040 is a system based on the MUSIC method that estimates a sound source direction in a three-dimensional space with a spatial resolution of 5 degrees and a time resolution of 100 ms by real-time processing.

音源方向検出には、より高い分解能が望ましいが、３次元空間での探索には処理時間が多くなってしまい、一般のＣＰＵでは実時間処理が難しくなる。そこで、本実施の形態では、上述のとおり、まず、５度の分解能で検出された方向に対し、階層的に、ｉ）３度（探索範囲：-６〜６度）、ｉｉ）２度（探索範囲：-４〜４度）、ｉｉｉ）１度（探索範囲：-３〜３度）というように、順次分解能を上げつつ、最終的な音源方位の推定を行う。 A higher resolution is desirable for the sound source direction detection, but the processing time is increased for the search in the three-dimensional space, and real-time processing becomes difficult for a general CPU. Therefore, in the present embodiment, as described above, first, i) 3 degrees (search range: −6 to 6 degrees) and ii) 2 degrees (in a hierarchical manner with respect to the direction detected with a resolution of 5 degrees. Search range: −4 to 4 degrees), iii) Final sound source azimuth is estimated while sequentially increasing the resolution, such as 1 degree (search range: −3 to 3 degrees).

１００msごとに探索する方向の数は、同時に検出された方向の数に比例するが、クロック周波数２．６ＧＨｚのＣＰＵ（Central Processing Unit）でも十分に実時間処理で動作可能である。 The number of directions searched every 100 ms is proportional to the number of directions detected at the same time, but a CPU (Central Processing Unit) with a clock frequency of 2.6 GHz can sufficiently operate in real-time processing.

人位置推定部１０７０には、上述のとおり、２次元のＬＲＦ１０１０を３台用いて２次元の人位置推定を用いている（Ｓ１０４）。 As described above, the human position estimation unit 1070 uses two-dimensional LRF 1010 and uses two-dimensional human position estimation (S104).

なお、ＬＲＦを用いた人位置推定の方法については、たとえば、以下の文献に開示がある。 In addition, about the method of person position estimation using LRF, the following literature has an indication, for example.

公知文献１：D.F. Glas et al.， ”Laser tracking of human body motion using adaptive shape modeling，” Proceedings of 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems， pp. 602-608， 2007.
音声区間推定部１０８０では、音源方向と人位置情報を基に、後に説明するように、その人が発話しているか否かを判断する（Ｓ１０６）。不揮発性記憶装置２０８０中に記憶された部屋の空間情報とアレイの位置情報を基に、それぞれのマイクロホンアレイから得られた音源方向と、人位置推定部から得られる人の位置情報を重ね合わせる。「部屋の空間情報」とは、たとえば、会議室のような所定の空間内でのマイクロホンアレイの設置位置の情報を含む。なお、特許文献２に記載のように、反射音も利用する場合は、「部屋の空間情報」は、所定の空間の壁、天井の位置に関する情報を含んでいてもよい。 Known Document 1: DF Glas et al., “Laser tracking of human body motion using adaptive shape modeling,” Proceedings of 2007 IEEE / RSJ International Conference on Intelligent Robots and Systems, pp. 602-608, 2007.
The speech section estimation unit 1080 determines whether or not the person is speaking based on the sound source direction and person position information, as will be described later (S106). Based on the room space information and the array position information stored in the nonvolatile storage device 2080, the sound source direction obtained from each microphone array and the person position information obtained from the person position estimation unit are superimposed. “Room space information” includes, for example, information on the installation position of the microphone array in a predetermined space such as a conference room. Note that, as described in Patent Document 2, when the reflected sound is also used, the “room space information” may include information on the positions of the walls and the ceiling of the predetermined space.

マイクロホンアレイ１０５０に対し、人の方向と空調やエアコンなどの雑音源の方向が重なる場合はまれではなく、誤検出を減らす必要がある。そのため、複数の方向が重なった場合のみを音源候補とし、音源方向の重なりの位置が人の位置と重なれば、その人が発話している確率が高いとみなす。 It is not uncommon for the microphone array 1050 to overlap the direction of a person and the direction of a noise source such as an air conditioner or an air conditioner, and it is necessary to reduce false detections. Therefore, only a case where a plurality of directions overlap is regarded as a sound source candidate, and if the overlapping position of the sound source directions overlaps with the position of a person, the probability that the person is speaking is high.

最後に、音源分離部１０９０は、検出されたそれぞれの音源区間に対し、音源に最も近いマイクロホンアレイを用いて、検出された方向にビームを向けて、音源分離を行い、その音源からの音声を、その発話者からの音声として、不揮発性記憶装置２０８０に記録する（Ｓ１０８）。 Finally, the sound source separation unit 1090 performs sound source separation for each detected sound source section by directing the beam in the detected direction using the microphone array closest to the sound source, and outputs the sound from the sound source. Then, the voice from the speaker is recorded in the nonvolatile storage device 2080 (S108).

続いて、処理の終了が指示されていると判断されば、処理を終結し、処理の終了が指示されていなければ、処理をステップＳ１０２に復帰させて、次の時間ブロックでの処理を行う。
[ＭＵＳＩＣ法による音源方向の推定処理]
図５は、音源方向推定部１０４０の構成を示すブロック図である。音源方向推定部１０４０．１〜１０４０．４の構成は基本的に同様である。 Subsequently, if it is determined that the end of the process is instructed, the process is terminated. If the end of the process is not instructed, the process returns to step S102 to perform the process in the next time block.
[Sound source direction estimation using the MUSIC method]
FIG. 5 is a block diagram illustrating a configuration of the sound source direction estimation unit 1040. The configuration of the sound source direction estimation units 1040.1 to 1040.4 is basically the same.

一例として、音源の位置の推定のために、音源の方位を推定するための手法の具体例として、ＭＵＳＩＣ（Multiple Signal Classification）法を例にとって説明する。ただし、音源の方位を推定できる方法であれば、他の手法を用いてもよい。 As an example, a MUSIC (Multiple Signal Classification) method will be described as an example of a method for estimating the direction of a sound source in order to estimate the position of the sound source. However, other methods may be used as long as the direction of the sound source can be estimated.

ＭＵＳＩＣ法の概略について説明すると、まず、高速フーリエ変換により多チャンネルのスペクトルＸ（ｋ，ｔ）をフレーム毎に求め、スペクトル領域でチャンネル間の空間的相関行列Ｒ_kをブロック毎に求め、相関行列の固有値分解により指向性の成分と無指向性の成分のサブ空間を分解し、無指向性のサブ空間に対応する固有ベクトルＥ_k ⁿと、対象の検索空間に応じて予め用意した方向ベクトルａ_k を用いて（狭帯域の）ＭＵＳＩＣ空間スペクトルＰ（ｋ）を周波数ビンごとに求め、特定の周波数帯域内の周波数ビン毎のＭＵＳＩＣ空間スペクトルを統合して広帯域ＭＵＳＩＣ空間スペクトルが求まる。 The outline of the MUSIC method will be described. First, a multi-channel spectrum X (k, t) is obtained for each frame by fast Fourier transform, and a spatial correlation matrix R _k between channels in a spectral region is obtained for each block. The eigenvalue decomposition of the directional component and the omnidirectional component subspace is decomposed, the eigenvector E _k ⁿ corresponding to the omnidirectional subspace, and the direction vector a _k prepared in advance according to the target search space. Is used to obtain a (narrowband) MUSIC spatial spectrum P (k) for each frequency bin, and a MUSIC spatial spectrum for each frequency bin within a specific frequency band is integrated to obtain a wideband MUSIC spatial spectrum.

以下では、広帯域ＭＵＳＩＣ空間スペクトルを単に「ＭＵＳＩＣ空間スペクトル」と呼び、ＭＵＳＩＣ空間スペクトルの時系列を「ＭＵＳＩＣスペクトログラム」を呼ぶ。 In the following, the broadband MUSIC spatial spectrum is simply referred to as “MUSIC spatial spectrum”, and the time series of the MUSIC spatial spectrum is referred to as “MUSIC spectrogram”.

音源定位においては、ＭＵＳＩＣ空間スペクトルのピークを探索することにより、音源の方向が求まる。 In sound source localization, the direction of the sound source is obtained by searching for the peak of the MUSIC spatial spectrum.

なお、以下では、マイクロホンアレイが１つである場合を例にとって説明するが、マイクロホンアレイの個数はより多くてもよい。 In the following, a case where there is one microphone array will be described as an example, but the number of microphone arrays may be larger.

図５を参照して、音源パワースペクトル取得部１０５０は、マイクロホン１０５２．１〜１０５２．ｐ（ｐ：自然数）を含むマイクロホンアレイＭＣ１から、それぞれｐ個のアナログ音源信号を受け、アナログ／デジタル変換を行なってｐ個のデジタル音源信号をそれぞれ出力するＡ／Ｄ変換器１０５４と、Ａ／Ｄ変換器１０５４からそれぞれ出力されるｐ個のデジタル音源信号を受け、ＭＵＳＩＣ法で必要とされる相関行列とその固有値および固有ベクトルを、所定の時間、たとえば、１００ミリ秒を１ブロックとしてブロックごとに出力するための固有ベクトル算出部６１と、固有ベクトル算出部６１からブロックごとに出力される固有ベクトルを使用し、ＭＵＳＩＣ法によりＭＵＳＩＣ空間スペクトルを出力するＭＵＳＩＣ処理部６２とを含む。音源方向推定部１０６０は、ＭＵＳＩＣ処理部６２が出力するＭＵＳＩＣ空間スペクトルに基づいて、音源の方向（本実施の形態では、３次元極座標の内の２つの偏角φおよびθとする）を推定する。なお、本明細書では、「ＭＵＳＩＣ応答」とは、ＭＵＳＩＣアルゴリズムにより得られるＭＵＳＩＣ空間スペクトルを所定の式で平均化したものである。 Referring to FIG. 5, sound source power spectrum acquisition section 1050 includes microphones 1052.1 to 1052. An A / D converter 1054 that receives p analog sound source signals from the microphone array MC1 including p (p: natural number), performs analog / digital conversion, and outputs p digital sound source signals, respectively, and A / Receiving p digital sound source signals respectively output from the D converter 1054, the correlation matrix and its eigenvalues and eigenvectors required by the MUSIC method are set for each block with a predetermined time, for example, 100 milliseconds as one block. An eigenvector calculation unit 61 for outputting, and a MUSIC processing unit 62 that uses the eigenvector output from the eigenvector calculation unit 61 for each block and outputs a MUSIC space spectrum by the MUSIC method. The sound source direction estimation unit 1060 estimates the direction of the sound source (in the present embodiment, two deflection angles φ and θ in the three-dimensional polar coordinates) based on the MUSIC spatial spectrum output from the MUSIC processing unit 62. . In the present specification, the “MUSIC response” is obtained by averaging the MUSIC spatial spectrum obtained by the MUSIC algorithm using a predetermined formula.

特に限定されないが、本実施の形態では、Ａ／Ｄ変換器１０５４は、一般的な１６ｋＨｚ／１６ビットで各マイクロホンの出力をＡ／Ｄ変換する。 Although not particularly limited, in this embodiment, the A / D converter 1054 A / D converts the output of each microphone at a general 16 kHz / 16 bits.

また、固有ベクトル算出部６１は、マイクロホンアレイＭＣ１からの信号に基づきＡ／Ｄ変換器１０５４の出力するｐ個のデジタル音源信号を、たとえば、４ミリ秒のフレーム長でフレーム化するためのフレーム化処理部８０と、フレーム化処理部８０の出力するｐチャンネルのフレーム化された音源信号に対してそれぞれＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍａｔｉｏｎ）を施し、所定個数の周波数領域（以下、各周波数領域を「ビン」と呼び、周波数領域の数を「ビン数」と呼ぶ。）に変換して出力するＦＦＴ処理部８２と、ＦＦＴ処理部８２から４ミリ秒ごとに出力される各チャネルの各ビンの値を、１００ミリ秒ごとにブロック化するためのブロック化処理部８４と、ブロック化処理部８４から出力される各ビンの値の間の相関を要素とする相関行列を所定時間ごと（１００ミリ秒ごと）に算出し出力する相関行列算出部８６と、相関行列算出部８６から出力される相関行列を固有値分解し、固有ベクトル９２をＭＵＳＩＣ処理部６２に出力する固有値分解部８８とを含む。 In addition, the eigenvector calculation unit 61 framing the p digital sound source signals output from the A / D converter 1054 based on the signal from the microphone array MC1 with a frame length of, for example, 4 milliseconds. Unit 80 and p-channel framed sound source signals output from framing processing unit 80 are each subjected to FFT (Fast Fourier Transform), and a predetermined number of frequency regions (hereinafter, each frequency region is referred to as a “bin”). The FFT processing unit 82 that converts the frequency domain number into a number of bins and outputs it, and the bin value of each channel that is output from the FFT processing unit every 4 milliseconds is expressed as 100. A blocking processing unit 84 for blocking every millisecond, and each bin output from the blocking processing unit 84 A correlation matrix calculation unit 86 that calculates and outputs a correlation matrix having a correlation between the two as a factor every predetermined time (every 100 milliseconds), a correlation matrix output from the correlation matrix calculation unit 86 is subjected to eigenvalue decomposition, and an eigenvector 92 Is output to the MUSIC processing unit 62.

通常、ＦＦＴでは５１２〜１０２４点を使用する（１６ｋＨｚのサンプリングレートで３２〜６４ミリ秒に相当）が、ここでは１フレームを４ミリ秒（ＦＦＴでは６４〜１２８点に相当）とした。このようにフレーム長を短くすることにより、ＦＦＴの計算量が少なくてすむだけでなく、後の相関行列の算出、固有値分解、およびＭＵＳＩＣ応答の算出における計算量も少なくて済む。その結果、性能を落とすことなく、比較的非力なコンピュータを用いても十分にリアルタイムで音源定位を行なうことができる。 Normally, 512 to 1024 points are used in FFT (corresponding to 32 to 64 milliseconds at a sampling rate of 16 kHz), but here one frame is set to 4 milliseconds (corresponding to 64 to 128 points in FFT). By reducing the frame length in this way, not only the amount of calculation of FFT is reduced, but also the amount of calculation in later calculation of correlation matrix, eigenvalue decomposition, and calculation of MUSIC response is reduced. As a result, sound source localization can be performed sufficiently in real time even if a relatively weak computer is used without degrading performance.

ＭＵＳＩＣ処理部６２は、マイクロホンアレイＭＣ１に含まれる各マイクロホンの位置を所定の座標系を用いて表す位置ベクトルを記憶するためのマイク配置記憶部１００と、マイク配置記憶部１００に記憶されているマイクロホンの位置ベクトル、および固有値分解部８８から出力される固有ベクトルを用いて、音源数が固定されているものとしてＭＵＳＩＣ法によりＭＵＳＩＣ空間スペクトルを算出し出力するＭＵＳＩＣ空間スペクトル算出部１０４とを含む。 The MUSIC processing unit 62 includes a microphone arrangement storage unit 100 for storing a position vector representing the position of each microphone included in the microphone array MC1 using a predetermined coordinate system, and a microphone stored in the microphone arrangement storage unit 100. And a MUSIC spatial spectrum calculation unit 104 that calculates and outputs a MUSIC spatial spectrum by the MUSIC method on the assumption that the number of sound sources is fixed, using the position vector of Eq. And the eigenvector output from the eigenvalue decomposition unit 88.

ブロックごとに得られる相関行列の固有値が音源数に関連することは、例えば、以下の文献にも記載されており、既に知られている事項である。 The fact that the eigenvalue of the correlation matrix obtained for each block is related to the number of sound sources is also described in the following documents, for example, and is already known.

公知文献２：Ｆ．アサノら、「リアルタイム音源定位及び生成システムと自動音声認識におけるその応用」、Ｅｕｒｏｓｐｅｅｃｈ，２００１、アールボルグ、デンマーク、２００１、１０１３−１０１６頁（F. Asano， M. Goto， K. Itou， and H. Asoh， ”Real-time sound source localization and separation system and its application on automatic speech recognition，” in Eurospeech 2001， Aalborg， Denmark， 2001， pp. 1013-1016）
なお、本実施の形態では、各音源の２次元的な方位角だけでなく、仰角も推定する。そのために、ＭＵＳＩＣアルゴリズムとしては、３次元での計算が可能なものを実装する。方位角と仰角とのセットを、これ以降、音源方位（ＤＯＡ）と呼ぶ。ＭＵＳＩＣ処理部６２で実行されるアルゴリズムでは、音源までの距離は推定しない。音源方位のみを推定するようにすることで、処理時間を大幅に減少させることができる。 Known Document 2: F.R. Asano et al., “Real-time sound source localization and generation system and its application in automatic speech recognition”, Eurospech, 2001, Aalborg, Denmark, 2001, 1013-1016 (F. Asano, M. Goto, K. Itou, and H. Asoh, “Real-time sound source localization and separation system and its application on automatic speech recognition,” in Eurospeech 2001, Aalborg, Denmark, 2001, pp. 1013-1016)
In the present embodiment, not only the two-dimensional azimuth angle of each sound source but also the elevation angle is estimated. Therefore, as the MUSIC algorithm, an algorithm that can calculate in three dimensions is implemented. The set of azimuth and elevation is hereinafter referred to as sound source azimuth (DOA). The algorithm executed by the MUSIC processing unit 62 does not estimate the distance to the sound source. By estimating only the sound source azimuth, the processing time can be significantly reduced.

ＭＵＳＩＣ処理部６２はさらに、ＭＵＳＩＣ空間スペクトル算出部１０４により算出されたＭＵＳＩＣ空間スペクトルに基づいて、ＭＵＳＩＣ法にしたがいＭＵＳＩＣ応答と呼ばれる値を各方位について算出し出力するためのＭＵＳＩＣ応答算出部１０６を含む。 The MUSIC processing unit 62 further includes a MUSIC response calculation unit 106 for calculating and outputting a value called a MUSIC response for each direction according to the MUSIC method based on the MUSIC spatial spectrum calculated by the MUSIC spatial spectrum calculation unit 104. .

音源方向推定部１０６０は、ＭＵＳＩＣ応答算出部１０６により算出されたＭＵＳＩＣ応答のピークを、一時的に時系列に所定数だけＦＩＦＯ形式でそれぞれ蓄積するためのバッファ１０８を含む。さらに、音源方向推定処理部１１０は、バッファ１０８に蓄積された各ブロックの各探索点のＭＵＳＩＣ応答について、音源の方向（上述した２つの偏角φおよびθ）を推定する。 The sound source direction estimation unit 1060 includes a buffer 108 for temporarily accumulating a predetermined number of MUSIC response peaks calculated by the MUSIC response calculation unit 106 in time series in the FIFO format. Further, the sound source direction estimation processing unit 110 estimates the direction of the sound source (the two declination angles φ and θ described above) for the MUSIC response of each search point of each block accumulated in the buffer 108.

ここで、ＭＵＳＩＣ法では、狭帯域ＭＵＳＩＣ空間スペクトルの推定において、その時刻に発している指向性を持つ音源数（ＮＯＳ）を与える必要があるが、以下の説明では、固定数を与え、ＭＵＳＩＣ空間スペクトル上で、特定の閾値を超えたピークのみを指向性のある音源とみなすものとして説明する。
（ＭＵＳＩＣ法）
以下、上述した３次元での方位を算出するＭＵＳＩＣ法について、簡単にまとめる。 Here, in the MUSIC method, in the estimation of the narrow band MUSIC space spectrum, it is necessary to give the number of sound sources (NOS) having directivity emitted at that time, but in the following explanation, a fixed number is given and the MUSIC space is given. In the following description, it is assumed that only peaks that exceed a specific threshold on the spectrum are regarded as directional sound sources.
(MUSIC method)
Hereinafter, the MUSIC method for calculating the above-described three-dimensional orientation will be briefly summarized.

たとえば、Ｍ個のマイク入力のフーリエ変換Ｘｍ（ｋ、ｔ）は、式（Ｍ１）のようにモデル化される。 For example, the Fourier transform Xm (k, t) of M microphone inputs is modeled as in equation (M1).

ただし、ベクトルｓ（ｋ、ｔ）はＮ個の音源のスペクトルＳ_n（ｋ、ｔ）から成る（ｎ＝１，…，Ｎ）。 However, the vector s (k, t) consists of N sound source spectra S _n (k, t) (n = 1,..., N).

すなわち、ｓ（ｋ、ｔ）＝［Ｓ₁（ｋ、ｔ）、…、Ｓ_N（ｋ、ｔ）］^Tである。ここで、ｋとｔはそれぞれ周波数と時間フレームのインデックスを示す。ベクトルｎ（ｋ、ｔ）は背景雑音を示す。行列Ａ_ｋは変換関数行列であり、その（ｍ、ｎ）要素はｎ番目の音源から、ｍ番目のマイクロホンへの直接パスの変換関数である。Ａ_ｋのｎ列目のベクトルをｎ番目の音源の位置ベクトル（ＳｔｅｅｒｉｎｇＶｅｃｔｏｒ）と呼ぶ。 That is, s (k, t) = [S ₁ (k, t),..., S _N (k, t)] ^T. Here, k and t indicate frequency and time frame indexes, respectively. Vector n (k, t) indicates background noise. The matrix A _k is a conversion function matrix, and its (m, n) element is a conversion function of a direct path from the nth sound source to the mth microphone. The n-th column vectors of A _k is referred to as a position vector of the n-th sound source (Steering Vector).

まず、式（Ｍ２）で定義される空間相関行列Ｒ_ｋを求め、式（Ｍ３）に示すＲｋの固有値分解により、固有値の対角行列Λ_ｋおよび固有ベクトルから成るＥ_ｋが求められる。 First, a spatial correlation matrix R _k defined by Expression (M2) is obtained, and E _k composed of a diagonal matrix Λ _k of eigenvalues and eigenvectors is obtained by eigenvalue decomposition of Rk shown in Expression (M3).

固有ベクトルはＥ_ｋ＝［Ｅ_ｋｓ｜Ｅ_ｋｎ］のように分割出来る。Ｅ_ｋｓとＥ_ｋｎとはそれぞれ支配的なＮ個の固有値に対応する固有ベクトルと、それ以外の固有ベクトルとを示す。 The eigenvector can be divided as E _k = [E _ks | E _kn ]. E _ks and E _kn indicate eigenvectors corresponding to the dominant N eigenvalues and other eigenvectors, respectively.

ＭＵＳＩＣ空間スペクトルは式（Ｍ４）と（Ｍ５）とで求める。ｒは距離、θとφとはそれぞれ方位角と仰角とを示す。式（Ｍ５）は、スキャンされる点（ｒ、θ、φ）における正規化した位置ベクトルである。 The MUSIC spatial spectrum is obtained by equations (M4) and (M5). r is a distance, and θ and φ are an azimuth angle and an elevation angle, respectively. Equation (M5) is a normalized position vector at the scanned point (r, θ, φ).

ＭＵＳＩＣ応答（パワーに相当）は、ＭＵＳＩＣ空間スペクトルを式（Ｍ６）のように平均化したものである。 The MUSIC response (corresponding to power) is obtained by averaging the MUSIC spatial spectrum as shown in Equation (M6).

式（Ｍ６）においてｋ_Lおよびｋ_Hは、それぞれ周波数帯域の下位と上位の境界のインデックスであり、Ｋ＝ｋ_H−ｋ_L＋１である。マイクロホンアレイに到来する音の方位は、ＭＵＳＩＣ応答のピークを探索することにより求められる。 In Expression (M6), k _L and k _H are indices of the lower and upper boundaries of the frequency band, respectively, and K = k _H −k _L +1. The direction of the sound arriving at the microphone array can be obtained by searching for the peak of the MUSIC response.

なお、上述したとおり、音の到来方向の推定アルゴリズムとしては、ＭＵＳＩＣ法を用いることも、一方で、他の方法、たとえば、ステアード応答パワー法を用いることも可能である。 As described above, the MUSIC method can be used as an algorithm for estimating the direction of arrival of sound, while other methods such as the steered response power method can be used.

たとえば、ステアード応答パワー法については、以下の文献に開示がある。 For example, the steered response power method is disclosed in the following document.

公知文献３：M. Brandstein and H. Silverman， ”A robust method for speech signal time-delay estimation in reverberant rooms，” in IEEE Conference on Acoustics， Speech， and Signal Processing， ICASSP 1997， 1997， pp. 375-378.
公知文献４：A. Badali， J.-M. Valin， F. Michaud， and P. Aarabi， ”Evaluating realtime audio localization algorithms for artificial audition on mobile robots，” in Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems， IROS 2009， 2009， pp. 2033-2038.
［音声区間推定部１０８０の処理］
次に、音声区間推定部１０８０において、マイクロホンアレイから得られた音源方向により、音源候補の位置を推定する処理について、以下、説明する。 Known Document 3: M. Brandstein and H. Silverman, “A robust method for speech signal time-delay estimation in reverberant rooms,” in IEEE Conference on Acoustics, Speech, and Signal Processing, ICASSP 1997, 1997, pp. 375-378 .
Known Document 4: A. Badali, J.-M. Valin, F. Michaud, and P. Aarabi, “Evaluating realtime audio localization algorithms for artificial audition on mobile robots,” in Proceedings of IEEE / RSJ International Conference on Intelligent Robots and Systems, IROS 2009, 2009, pp. 2033-2038.
[Processing of Speech Section Estimating Unit 1080]
Next, processing for estimating the position of the sound source candidate by the sound section estimation unit 1080 based on the sound source direction obtained from the microphone array will be described below.

複数のマイクロホンアレイから検出された複数の方向をペア毎に評価する。２つの方向（dir_１、 dir_２）が３次元空間で交差しているかを判断するため、まず次式により、最短距離（dist_dir）を計算する。 A plurality of directions detected from a plurality of microphone arrays are evaluated for each pair. In order to determine whether two directions (dir ₁ , dir ₂ ) intersect in a three-dimensional space, first, the shortest distance (dist _dir ) is calculated by the following equation.

ここで、v_１、 v_２は各方向に平行したベクトル、p_１、 p_２は各アレイの位置を示す。 Here, v ₁ and v ₂ are vectors parallel to each direction, and p ₁ and p ₂ are the positions of the arrays.

図６は、このような最短距離を求める手続きを示す概念図である。 FIG. 6 is a conceptual diagram showing a procedure for obtaining such a shortest distance.

図６に示すように、点p₁を通りベクトルv_１と平行な直線ｌ₁をパラメータｔによるパラメータ表示により、ｘ＝ｐ₁＋tｖ₁で表し、点ｐ₂を通りベクトルv₂と平行な直線ｌ₂をパラメータuによるパラメータ表示により、ｘ＝ｐ₂＋uｖ₂で表すものとする。 As shown in FIG. 6, the parameter display linear l ₁ as parallel to the vector v ₁ point p ₁ by the parameter t, x = p ₁ + expressed in tv _1, a straight line parallel to the point p ₂ and as vector v ₂ Let l ₂ be expressed as x = p ₂ + uv ₂ by the parameter display by the parameter u.

直線ｌ₂と平行で、直線ｌ₁を含む平面α：（ｎ・ｘ）＋ｄ＝０を考える。 Consider a plane α: (n · x) + d = 0 parallel to the straight line l ₂ and including the straight line l ₁ .

平面αの法線は、２直線ｌ₁およびｌ₂に垂直となるので、法線ベクトルｎは、２直線の方向ベクトルの外積としてｎ＝（v_１×v₂）／｜v_１×v₂｜となる。 Since the normal of the plane α is perpendicular to the two straight lines l ₁ and l ₂ , the normal vector n is n = (v ₁ × v ₂ ) / | v ₁ × v ₂ as the outer product of the two straight line direction vectors. |

また、平面αは、直線ｌ₁上の点p₁を含むので、ｄ＝−（ｎ・ｐ₁）となる。 Also, the plane alpha, because it contains a point p ₁ on the straight line l _1, d = - a (n · p _1).

したがって、平面αは、以下の式で表される。 Accordingly, the plane α is expressed by the following formula.

さて、直線ｌ₂と平面αとの距離をｈとすれば、直線ｌ₁上の点Ｐと直線ｌ₂上の点Ｑとの距離ＰＱは、常に、距離ｈ以上の大きさとなる。言い換えれば、ＰＱの最小値、すなわち、dist_dirは、点ｐ₂と平面αとの距離として、上述した式（１）で表されることになる。 If the distance between the straight line l ₂ and the plane α is h, the distance PQ between the point P on the straight line l ₁ and the point Q on the straight line l ₂ is always greater than or equal to the distance h. In other words, the minimum value of PQ, that is, dist _dir is expressed by the above-described equation (1) as the distance between the point p ₂ and the plane α.

すなわち、ｈ＝dist_dir（dir_１、 dir_２）が成り立つ。 That is, h = dist _dir (dir ₁ , dir ₂ ) holds.

ここで、音声区間推定部１０８０は、この最短距離dist_dirが、以下に示すように、所定の閾値（dist_dir-th）よりも小さい場合、２つの方向は交差しているとみなす。 Here, if the shortest distance dist _dir is smaller than a predetermined threshold (dist _dir-th ), the speech section estimation unit 1080 considers the two directions to intersect.

特に限定されないが、後に説明する実験では、dist_dir-th を２０cmとする。 Although not particularly limited, in the experiment described later, dist _{dir-th is set} to 20 cm.

方向が交差していると判断された方向ペアに対し、音声区間推定部１０８０は、音源の位置（pos_source）を以下の式により推定する。 For the direction pair determined to have intersected directions, the speech segment estimation unit 1080 estimates the position of the sound _source (pos _source ) using the following equation.

ここで、pos_n は、最短距離に対する直線が各アレイからの音源方向を描いた直線と交わる座標点を示す。 Here, pos _n indicates a coordinate point where a straight line with respect to the shortest distance intersects with a straight line describing a sound source direction from each array.

次に、音声区間推定部１０８０は、上述の処理により、すべての方向ペアを評価して得られた音源位置の候補に対し、人位置との重なりを評価する。 Next, the speech section estimation unit 1080 evaluates the overlap with the human position for the sound source position candidates obtained by evaluating all the directional pairs by the above-described processing.

音源方向の重なりによる音源位置と人位置の重なりにおいては、人位置検出が２次元であるため、２次元での距離（図１でのｘｙ平面内での距離）を評価する。すなわち、音声区間推定部１０８０は、検出された各人位置と各音源位置候補の２次元距離を計算し、以下の数式（４）のように、２次元距離が閾値よりも小さい場合、その人が発話しているとみなす。 In the overlap of the sound source position and the human position due to the overlapping of the sound source directions, since the human position detection is two-dimensional, the two-dimensional distance (distance in the xy plane in FIG. 1) is evaluated. That is, the speech section estimation unit 1080 calculates the two-dimensional distance between each detected human position and each sound source position candidate, and if the two-dimensional distance is smaller than the threshold as shown in the following formula (4), the person Is considered speaking.

ここでも、特に限定されないが、後に説明する実験では、位置誤差の評価には、閾値（dist_pos-th）を３０cmと設定するものとする。 Here, although not particularly limited, in an experiment described later, a threshold (dist _pos-th ) is set to 30 cm for evaluation of the position error.

人位置推定は、２次元で身長の情報は得られないが、音源位置推定は３次元で求められるため、人の口元の位置を考慮した制限が可能となる。口元の位置は、人が座っている場合と立っている場合を想定し、音源位置の高さが、所定の範囲、たとえば、ｚ＝８０〜１７０cmの範囲内である場合のみ、音声区間推定部１０８０は、その人が発話している確率が高いとみなす。 In human position estimation, height information cannot be obtained in two dimensions, but since sound source position estimation is obtained in three dimensions, it is possible to limit in consideration of the position of the person's mouth. Assuming that the position of the mouth is sitting and standing, only when the height of the sound source position is within a predetermined range, for example, a range of z = 80 to 170 cm, the speech section estimation unit 1080 considers that the person has a high probability of speaking.

人位置は３３〜６６msごとに推定され、音源方向は１００msごとに推定されるため、音声区間推定部１０８０は、１００msの時間分解能で音声区間を検出する。 Since the human position is estimated every 33 to 66 ms and the sound source direction is estimated every 100 ms, the speech segment estimation unit 1080 detects the speech segment with a time resolution of 100 ms.

さらに、音声区間推定部１０８０は、３００ms（３ブロック）以下の区間で、音声アクティビティが有りと判定されたブロックに挟まれた場合は、その区間のマージングを行う。また、音声区間推定部１０８０は、このようなマージング後の音声区間の前後２００msに対してプリロール（pre-roll）期間とアフターロール（after-roll）期間を追加したものを検出された発話区間とする。 Furthermore, if the speech section estimation unit 1080 is sandwiched between blocks determined to have voice activity in a section of 300 ms (3 blocks) or less, the section is merged. Further, the speech segment estimation unit 1080 detects a speech segment detected by adding a pre-roll period and an after-roll period to 200 ms before and after such a merged speech segment. To do.

さらに、音声区間推定部１０８０は、人の口元は人体の正中矢状面と正中冠状面の交点よりも前寄りに位置していることを考慮し、本実施の形態では、後に説明するように、顔の向きの推定も行う。 Furthermore, in consideration of the fact that the human mouth is located in front of the intersection of the median sagittal plane and the midline coronal plane of the human body, the speech section estimation unit 1080 will be described later in this embodiment. Also, the direction of the face is estimated.

音源位置と人位置の距離が閾値（distpos-th）より小さい音源位置候補のうち、音源方向が推定された複数のマイクロホンアレイとの総距離が最も小さいものを音源位置のベスト候補とする。すなわち、２つのマイクロホンアレイからの音源方向推定により音源位置候補が特定されている場合、音源位置候補と各マイクロホンアレイとの「距離の和」を求める。同一の人位置について複数の音源候補位置がある場合は、このような「距離の和」のうち、もっとも小さな距離の和に対応する音源候補位置（すなわち、その人位置からより近いマイクロホンアレイにより推定された音源候補位置）をベスト候補として選択して、その人位置とベスト音源位置を結ぶベクトルの方向を、その人のその発話区間での「顔の向き」とする。 Among the sound source position candidates whose distance between the sound source position and the human position is smaller than the threshold (distpos-th), the sound source position best candidate having the smallest total distance from the plurality of microphone arrays whose sound source directions are estimated is determined. That is, when a sound source position candidate is specified by sound source direction estimation from two microphone arrays, a “distance sum” between the sound source position candidate and each microphone array is obtained. When there are multiple sound source candidate positions for the same person position, the sound source candidate position corresponding to the sum of the smallest distances among these “sum of distances” (ie, estimated by a microphone array closer to the person position) The selected sound source candidate position) is selected as the best candidate, and the direction of the vector connecting the person position and the best sound source position is set as the “face direction” of the person in the utterance section.

最後に、音源分離部１０９０は、検出されたそれぞれの発話区間に対し、音源に最も近いマイクロホンアレイを用いて、検出された方向にビームを向けて、音源分離を行い、その音源からの音声を、その発話者からの音声として、たとえば、当該発話区間と関連づけて、不揮発性記憶装置２０８０に記録する。
［コンピュータによる実現］
音源位置推定装置２０００の音源方向推定部１０４０、音声区間推定部１０８０および音源分離部１０９０の処理は、実際にはコンピュータハードウェアと、当該コンピュータハードウェアにより実行されるコンピュータプログラムとにより、ハードウェアとソフトウェアとの協働により実現される。以下、これらの機能を実現するためのコンピュータプログラムの動作について簡単に説明する。 Finally, the sound source separation unit 1090 performs sound source separation by directing the beam in the detected direction using the microphone array closest to the sound source for each detected utterance section, and outputs the sound from the sound source. For example, the voice from the speaker is recorded in the nonvolatile storage device 2080 in association with the speech section.
[Realization by computer]
The processing of the sound source direction estimation unit 1040, the speech segment estimation unit 1080, and the sound source separation unit 1090 of the sound source position estimation device 2000 is actually performed by computer hardware and a computer program executed by the computer hardware. Realized by collaboration with software. The operation of the computer program for realizing these functions will be briefly described below.

図７は、このようなコンピュータプログラムを実行するためのコンピュータシステム２０００のハードウェア構成をブロック図形式で示す図である。 FIG. 7 is a block diagram showing the hardware configuration of a computer system 2000 for executing such a computer program.

図７に示されるように、このコンピュータシステム２０００を構成するコンピュータ本体２０１０は、ディスクドライブ２０３０およびメモリドライブ２０２０に加えて、それぞれバス２０５０に接続されたＣＰＵ（Central Processing Unit ）２０４０と、ＲＯＭ（Read Only Memory)２０６０およびＲＡＭ（Random Access Memory）２０７０を含むメモリと、不揮発性の書換え可能な記憶装置、たとえば、ハードディスク２０８０と、ネットワークを介しての通信を行うための通信インタフェース２０９０と、マイクロホンアレイＭＣ１およびＭＣ２と信号の授受を行うための音声入力インタフェース２０９２とを含んでいる。ディスクライブ２０３０には、ＣＤ−ＲＯＭ２２００などの光ディスクが装着される。メモリドライブ２０２０にはメモリカード２２１０が装着される。 As shown in FIG. 7, in addition to the disk drive 2030 and the memory drive 2020, the computer main body 2010 constituting the computer system 2000 includes a CPU (Central Processing Unit) 2040 and a ROM (Read Only memory) 2060 and RAM (Random Access Memory) 2070, a non-volatile rewritable storage device such as a hard disk 2080, a communication interface 2090 for performing communication via a network, and a microphone array MC1. And an audio input interface 2092 for exchanging signals with MC2. The disc 2030 is loaded with an optical disc such as a CD-ROM 2200. A memory card 2210 is attached to the memory drive 2020.

音源位置推定装置２０００の音源方向推定部１０４０、音声区間推定部１０８０および音源分離部１０９０の処理のプログラムが動作するにあたっては、その動作の基礎となる情報を格納するデータベースは、ハードディスク２０８０に格納されるものとして説明を行う。 When the processing programs of the sound source direction estimation unit 1040, the speech segment estimation unit 1080, and the sound source separation unit 1090 of the sound source position estimation apparatus 2000 operate, a database that stores information that is the basis of the operation is stored in the hard disk 2080. It will be described as a thing.

なお、図７では、コンピュータ本体に対してインストールされるプログラム等の情報を記録可能な媒体として、ＣＤ−ＲＯＭ２２００を想定しているが、他の媒体、たとえば、ＤＶＤ−ＲＯＭ（Digital Versatile Disc）などでもよく、あるいは、メモリカードやＵＳＢメモリなどでもよい。その場合は、コンピュータ本体２２００には、これらの媒体を読取ることが可能なドライブ装置が設けられる。 In FIG. 7, the CD-ROM 2200 is assumed as a medium capable of recording information such as a program installed in the computer main body. However, other media such as a DVD-ROM (Digital Versatile Disc) is used. Alternatively, a memory card or a USB memory may be used. In that case, the computer main body 2200 is provided with a drive device capable of reading these media.

音源位置推定装置２０００の主要部は、コンピュータハードウェアと、ＣＰＵ２０４０により実行されるソフトウェアとにより構成される。一般的にこうしたソフトウェアはＣＤ−ＲＯＭ２２００等の記憶媒体に格納されて流通し、ディスクドライブ２０３０等により記憶媒体から読取られてハードディスク２０８０に一旦格納される。または、当該装置がネットワーク３１０に接続されている場合には、ネットワーク上のサーバから一旦ハードディスク２０８０にコピーされる。そうしてさらにハードディスク２０８０からメモリ中のＲＡＭ２０７０に読出されてＣＰＵ２０４０により実行される。なお、ネットワーク接続されている場合には、ハードディスク２０８０に格納することなくＲＡＭに直接ロードして実行するようにしてもよい。 The main part of the sound source position estimation apparatus 2000 is configured by computer hardware and software executed by the CPU 2040. Generally, such software is stored and distributed in a storage medium such as a CD-ROM 2200, read from the storage medium by a disk drive 2030 or the like, and temporarily stored in the hard disk 2080. Alternatively, when the device is connected to the network 310, it is temporarily copied from the server on the network to the hard disk 2080. Then, the data is further read from the hard disk 2080 to the RAM 2070 in the memory and executed by the CPU 2040. In the case of network connection, the program may be directly loaded into the RAM and executed without being stored in the hard disk 2080.

音源位置推定装置２０００として機能するためのプログラムは、コンピュータ本体２０１０に、情報処理装置等の機能を実行させるオペレーティングシステム（ＯＳ）は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム２０がどのように動作するかは周知であり、詳細な説明は省略する。 The program for functioning as the sound source position estimation apparatus 2000 does not necessarily include an operating system (OS) that causes the computer main body 2010 to execute functions such as an information processing apparatus. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 20 operates is well known and will not be described in detail.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

さらに、ＣＰＵ２０４０も、１つのプロセッサであっても、あるいは複数のプロセッサであってもよい。すなわち、シングルコアのプロセッサであっても、マルチコアのプロセッサであってもよい。 Further, the CPU 2040 may be a single processor or a plurality of processors. That is, it may be a single core processor or a multi-core processor.

なお、音源位置推定装置２０００のプログラムの動作の基礎となる情報を格納するデータベースは、インタフェース２０９０を介して接続される外部の記憶装置内に格納されていてもよい。たとえば、ネットワークを介して外部サーバに接続している場合は、動作の基礎となる情報を格納するデータベースは、外部サーバ内のハードディスク（図示せず）等の記憶装置に格納されていてもよい。この場合は、コンピュータ２０００はクライエント機として動作し、このようなデータベースのデータをネットワークを介して外部サーバとやり取りする。
[実験結果]
（１）データ収集
上述のとおり、実験は、図２に示したような、複数のマイクロホンアレイを設置した研究室内のミーティングスペースで実施した。 Note that a database that stores information serving as a basis for the operation of the program of the sound source position estimation apparatus 2000 may be stored in an external storage device connected via the interface 2090. For example, when connected to an external server via a network, a database that stores information serving as a basis of operation may be stored in a storage device such as a hard disk (not shown) in the external server. In this case, the computer 2000 operates as a client machine, and exchanges such database data with an external server via a network.
[Experimental result]
(1) Data Collection As described above, the experiment was conducted in a meeting space in the laboratory where a plurality of microphone arrays were installed as shown in FIG.

マイクロホンアレイは机の上に１６チャンネルのものを２個と、天井に８チャンネルのものを２個設置した。 Two 16-channel microphone arrays were installed on the desk and two 8-channel microphone arrays were installed on the ceiling.

図８は、マイクロホンアレイにおけるマイクの配置を示す図である。 FIG. 8 is a diagram showing the arrangement of microphones in the microphone array.

図８（ａ）の平面図および図８（ｂ）の側面図に示すように、１６チャンネルのアレイの形状は直径３０cmの半球面上に配置するようにアレイフレームを作成した。８チャンネルのアレイは１５cmの円形上に均等にマイクを配置した形状である。 As shown in the plan view of FIG. 8 (a) and the side view of FIG. 8 (b), an array frame was prepared so that the 16-channel array was arranged on a hemisphere with a diameter of 30 cm. The 8-channel array has a shape in which microphones are evenly arranged on a 15 cm circle.

図９は、マイクロホンアレイの位置と、評価した人の位置情報を示す図である。 FIG. 9 is a diagram showing the position of the microphone array and the position information of the evaluated person.

図９において、中央の長方形はテーブルを示す。 In FIG. 9, the central rectangle represents a table.

テーブル上のアレイの高さはz=７３０mm、天井のアレイはz=２６９０mmである。テーブルの周り１０か所（P１〜P１０）において、座った条件と立った条件で発声したデータを収集した。 The height of the array on the table is z = 730 mm, and the height of the array on the ceiling is z = 2690 mm. At 10 locations (P1 to P10) around the table, data uttered under sitting and standing conditions were collected.

話者２名（男女各１名）が、各位置で４方向（前：F、左：L、後ろ：B、右：R）を向いて、「顔の向きを検出する実験を行っています」という文を発声した。
（評価結果：単独で発話した場合）
まず、図１０は、２名の話者（女性F１および男性M１）が単独で発話した場合の発話区間検出率（precision およびrecall）の結果を示す図である。 Two speakers (one male and one female) face each direction in four directions (front: F, left: L, back: B, right: R). ".
(Evaluation result: When speaking alone)
First, FIG. 10 is a diagram showing the results of the speech segment detection rate (precision and recall) when two speakers (female F1 and male M1) speak alone.

ここで、“precision”とは、正誤を含めて発話区間であると判定された中（実際には発話区間であり、かつ、発話区間と判例された区間、および、実際には非発話区間であり、かつ、発話区間と判例された区間）に、どれだけ、正しい発話区間が含まれているかを示し、“recall”とは、正しく発話区間の判定がされた中（実際には発話区間であり、かつ、発話区間と判例された区間、および、実際には非発話区間であり、かつ、非発話区間と判例された区間）において、どれだけの区間が正しい発話区間であるかを示す。 Here, “precision” is determined to be an utterance interval including correctness (actually an utterance interval, and an interval that has been preceded as an utterance interval, and actually a non-utterance interval) Yes, it indicates how much the correct utterance section is included in the utterance section, and “recall” means that the utterance section is correctly determined (actually in the utterance section) In other words, it indicates how many sections are correct utterance sections in a section that is judged as an utterance section and a section that is actually a non-utterance section and is judged as a non-utterance section.

発話区間検出において、各位置において、方向がすべてのアレイに背いている向き（例えば位置P１の後ろ向きB、位置P２の右向きRなど）は、音源方向推定の精度が低いことが予想されるため、これらの条件を除外した場合の結果も記載する。 In the speech section detection, the direction in which the direction is opposite to all the arrays at each position (for example, backward B of position P1, rightward R of position P2, etc.) is expected to be low in accuracy of sound source direction estimation. The results when these conditions are excluded are also described.

図１０では、すべてのデータを用いた結果を”all data”に示し、テーブルに背く方向を除外した結果を”excluding outside direction”として示す。 In FIG. 10, the result using all data is shown as “all data”, and the result of excluding the direction against the table is shown as “excluding outside direction”.

まず、すべてのデータに対する結果（”all data”）は、話者F１（女性）の場合９６%の検出率で話者M１（男性）の場合は８３%であった。この結果に対し、テーブルに背く場合を除外した結果（”excluding outside direction”）では、いずれの話者も９７%以上という高い検出率が得られた。 First, the result ("all data") for all data was 96% for speaker F1 (female) and 83% for speaker M1 (male). In contrast to this result, in the result of excluding the case of disobeying the table (“excluding outside direction”), a high detection rate of 97% or more was obtained for all speakers.

また、precisionとrecallの値にほとんど差がみられず、挿入誤りが少ないことが示された。これは複数の音源が重なった位置にのみ音源が存在する候補として扱っていることが効いていると考えられる。 In addition, there was almost no difference in precision and recall values, indicating that there were few insertion errors. This is considered to be effective as a candidate in which a sound source exists only at a position where a plurality of sound sources overlap.

次に、図１１および図１２は、顔の向きの推定に関する分析結果を記す図である。 Next, FIG. 11 and FIG. 12 are diagrams illustrating analysis results relating to face orientation estimation.

図１１および図１２では、白丸はマイクロホンアレイの位置を示し、図９に示した位置Ｐ２で４方向（（ａ）〜（ｄ）の順に、Ｆ，Ｌ，Ｂ，Ｒ）に発声した際の音源方向推定結果と検出された向き（黒い矢印）を示す。 In FIG. 11 and FIG. 12, the white circles indicate the position of the microphone array. The sound source direction estimation result and the detected direction (black arrow) are shown.

人がいない方向を差している線は、天井の空調やエアコンなどの雑音源に対応する。 Lines pointing away from people correspond to noise sources such as ceiling air conditioners and air conditioners.

図１１および図１２の例より、顔の向きによって、複数のアレイで検出された音源方向が交差する位置が人位置の中心点より顔を向いた方向にずれていることが分かる。 From the examples of FIGS. 11 and 12, it can be seen that the position where the sound source directions detected by the plurality of arrays intersect with each other is shifted in the direction facing the face from the center point of the human position depending on the face direction.

また、この人位置の中心点から音源方向の交差する位置へ向かう方向として推定される顔の向きについても、少なくとも、４方向の区別は可能であることがわかる。 It can also be seen that at least four directions can be distinguished from the direction of the face estimated as the direction from the center point of the person position toward the position where the sound source directions intersect.

図１３は、顔の向きの推定結果の統計値を示す図である。 FIG. 13 is a diagram illustrating statistical values of face direction estimation results.

図１３に示される結果より、顔の向きの推定誤差の平均値は、いずれの条件でも０度に近く、正しい向きの周辺で推定がばらついていることとなる。ばらつきについては、全データ（”all data”）の場合、標準偏差が３０度前後であり、アレイに背いている条件を除外する場合（”excluding outside direction”）は２０度前後となっている。この結果より、発話中に少なくとも、前後左右の識別は可能であることが確認できる。
（評価結果：複数人が同時に発話した場合）
次に、２名が同時に発声した場合の結果について説明する。 From the results shown in FIG. 13, the average value of the face direction estimation error is close to 0 degrees under any condition, and the estimation varies around the correct direction. Regarding the variation, in the case of all data ("all data"), the standard deviation is around 30 degrees, and in the case of excluding the conditions that are not in the array ("excluding outside direction"), it is around 20 degrees. From this result, it can be confirmed that at least front, back, left and right can be identified during utterance.
(Evaluation result: When multiple people speak at the same time)
Next, the results when two people speak at the same time will be described.

２名が同時に同じ文を発声した際の位置は、(P１０;P１)、 (P１０; P２)、 (P９;P２)、 (P１;P３)、 (P２;P４)、 (P４、P５) の６つの組み合わせで評価した。顔の向きは指定せず、ミーティングの場を想定してお互いに向けて発話するような自然な向きで発声するよう指示した。 The positions when two people utter the same sentence at the same time are (P10; P1), (P10; P2), (P9; P2), (P1; P3), (P2; P4), (P4, P5) Six combinations were evaluated. The direction of the face was not specified, and it was instructed to speak in a natural direction as if speaking toward each other assuming a meeting place.

発話区間検出においては、条件の数は少ないが、９８%の検出率が得られ、２名同時発話でも精度よく発話区間検出が可能であることが示された。 In the detection of the utterance interval, although the number of conditions is small, a detection rate of 98% was obtained, and it was shown that the utterance interval can be detected with high accuracy even when two people speak simultaneously.

図１４は、２名が同時に発声した場合の顔の向きの推定結果を示す図である。 FIG. 14 is a diagram showing the estimation result of the face orientation when two people speak simultaneously.

図１４に示されるように、顔の向きにおいては、図１４（ａ）の例では、話者間が１メートル程度で横並びに座っている状態であるが、お互いに向けた発話でも面向かって発話しないことが導ける。一方、図１４（ｂ）は、テーブルの隣接する２辺に話者がいる状態であり、お互いの方向を向いて発話しているのがわかる。 As shown in FIG. 14, in the face direction, in the example of FIG. 14 (a), the speakers are sitting side by side at about 1 meter, but even when speaking toward each other, the speech is directed to the face. I can lead to not. On the other hand, FIG. 14B shows a state in which there are speakers on two adjacent sides of the table, and it can be seen that the speakers are speaking in the directions of each other.

また、４名がミーティングテーブルの周りで同時に発話した場合も、問題なく発話区間検出が可能であることを確認した。また、テーブルの周りを歩きながら発話するデータも収集し、移動中の場合も、発話区間および顔の向きも正しく動作することを確認した。 It was also confirmed that it was possible to detect the utterance section without any problem even when 4 people uttered around the meeting table at the same time. In addition, we collected data to utter while walking around the table, and confirmed that the utterance section and face orientation work correctly when moving.

以上説明したように、本実施の形態の対話行動認識システム１０００によれば、複数のアレイによる音源位置推定と人位置情報を組み合わせて、音声アクティビティを検出するシステムの精度を改善することが可能である。 As described above, according to the interactive action recognition system 1000 of the present embodiment, it is possible to improve the accuracy of a system that detects voice activity by combining sound source position estimation with a plurality of arrays and human position information. is there.

また、対話行動認識システム１０００によれば、発話している際の話者の顔の向きも推定することが可能となり、空間内のどのような文脈で発話されたかの手がかりとなり、より高度な対話行動認識が可能となる。 Further, according to the dialog action recognition system 1000, it is possible to estimate the direction of the speaker's face when speaking, which is a clue of what context in the space is spoken, and more advanced dialog action. Recognition is possible.

なお、以上の説明では、対話行動認識システム１０００は、教室内や会議などのように、複数の人が時に席を移りながら会話や協調作業をする際のデータの観察を行うためのシステムとして説明した。ただし、たとえば、会議の場面を想定すると、会議中に発話した人およびその人の発話の内容を特定することが可能となる。この場合、発話の内容を音声認識技術によりテキスト文に変換すれば、自動的に議事録を作成するシステムに応用することも可能である。 In the above description, the dialogue action recognition system 1000 is described as a system for observing data when a plurality of people sometimes perform conversations and collaborative work while changing their seats, such as in a classroom or a meeting. did. However, for example, assuming a meeting scene, it is possible to specify the person who spoke during the meeting and the content of that person's utterance. In this case, if the content of the utterance is converted into a text sentence by voice recognition technology, it can be applied to a system that automatically creates minutes.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

６１固有ベクトル算出部、６２ＭＵＳＩＣ処理部、８６相関行列算出部、８８固有値分解部、１０６ＭＵＳＩＣ応答算出部、１１０音源方向推定処理部、１０４０音源方向推定部、１０５０音源パワースペクトル取得部、１０６０音源方向推定部、１０８０音声区間推定部、１０７０人位置推定部、１０９０音源分離部、ＭＣ１，ＭＣ２マイクロホンアレイ。 61 eigenvector calculation unit, 62 MUSIC processing unit, 86 correlation matrix calculation unit, 88 eigenvalue decomposition unit, 106 MUSIC response calculation unit, 110 sound source direction estimation processing unit, 1040 sound source direction estimation unit, 1050 sound source power spectrum acquisition unit, 1060 sound source direction Estimator, 1080 Voice segment estimator, 1070 Person position estimator, 1090 Sound source separator, MC1, MC2 Microphone array.

Claims

A plurality of sound sensor arrays;
Human position estimating means for estimating the position of a person in a predetermined space;
A storage device for storing information on arrangement of each sound sensor in the sound sensor array and position information of a person;
Based on each of the signals of the plurality of channels from the plurality of sound sensor arrays and the positional relationship between the sound sensors included in the sound sensor array, the direction in which the sound arrives at the plurality of sound sensor arrays is specified. Sound source localization means for executing processing for,
A speech section estimation means for estimating a person who is speaking based on a set of directions of arrival of the sounds specified by different sound sensor arrays among the plurality of sound sensor arrays and the position information of the person And
The speech section estimation means, on a straight line corresponding to the shortest distance, according to the shortest distance between the extension lines of the arrival direction being equal to or less than a first threshold value for each set of directions in which the sound arrives wherein you estimate a candidate location of the sound source is present, the sound source position estimation apparatus.

The speech section estimation means estimates a candidate position of a sound source having a minimum sum of distances to the sound sensor array used for specifying the direction of arrival of the sound among the candidate positions of the sound source as a sound source position. The sound source position estimation apparatus according to claim 1 .

The sound source position estimation unit according to claim 2 , wherein the speech section estimation unit estimates a person who is speaking in response to the estimated sound source position and the person position being equal to or less than a second threshold value. apparatus.

The sound source according to claim 3 , wherein the speech section estimation unit estimates a face direction of a person who is speaking according to the position of the person estimated to be a person who is speaking and the corresponding sound source position. Position estimation device.

Separating the audio for the person in the estimated speech by the speech interval estimation means further comprises sound source separation means for recording in association with speech content and a speaker, one of the claims 1-4 1 The sound source position estimation apparatus according to the item.

A sound source position estimation method for estimating a speaker in a predetermined space based on signals from a plurality of sound sensor arrays and estimated human positions,
Estimating a position of a person in the predetermined space from measurement data from a position sensor;
Based on each of the sound source signals of a plurality of channels from the plurality of sound sensor arrays and the positional relationship between the sound sensors included in the sound sensor array, the direction in which the sound comes to the plurality of sound sensor arrays is specified. And steps to
Estimating a person who is speaking based on a set of directions of arrival of the sounds respectively identified by different sound sensor arrays out of the plurality of sound sensor arrays, and the position information of the person,
The step of estimating the person who is speaking is configured to set the shortest distance for each set of directions in which the sound arrives according to a shortest distance between extension lines of the arrival directions being equal to or less than a first threshold value. A sound source position estimation method including a step of estimating that a candidate position of the sound source exists on a corresponding straight line .

A sound source position estimation program for estimating a speaker in a predetermined space based on signals from a plurality of sound sensor arrays and estimated human positions in a computer having an arithmetic device and a storage device, The sound source position estimation program is
The arithmetic device estimating the position of the person in the predetermined space from the measurement data from the position sensor;
The arithmetic unit is configured to output sound to the plurality of sound sensor arrays based on each of a plurality of sound source signals from the plurality of sound sensor arrays and a positional relationship between the sound sensors included in the sound sensor array. Identifying the direction of arrival;
The computing device estimates a person who is speaking based on a set of directions of arrival of the sounds respectively identified by different sound sensor arrays from among the plurality of sound sensor arrays and the position information of the person. And the sound source on a straight line corresponding to the shortest distance in response to the shortest distance between the extension lines of the arrival directions being equal to or less than a first threshold value for each set of the sound arrival directions. A sound source position estimation program for causing a computer to execute a step including a step of estimating that there is a candidate position .