JP5724125B2

JP5724125B2 - Sound source localization device

Info

Publication number: JP5724125B2
Application number: JP2011076230A
Authority: JP
Inventors: イシイ・カルロス・トシノリ; 昌裕塩見; パニコス・イラクレオス; ヤニ・エヴァン; 敬宏宮下; 智史小泉; 萩田　紀博; 紀博萩田
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2011-03-30
Filing date: 2011-03-30
Publication date: 2015-05-27
Anticipated expiration: 2031-03-30
Also published as: JP2012211768A

Description

この発明は音源定位技術に関し、特に、人間と雑音とが混在している環境において、人間の発する音声を高精度でトラッキングするための音源定位技術に関する。 The present invention relates to a sound source localization technique, and more particularly, to a sound source localization technique for tracking a voice uttered by a human with high accuracy in an environment where humans and noise are mixed.

人とロボットとの音声コミュニケーションにおいて、ロボットに取付けたマイクロホンは通常離れた位置（１ｍ以上）にある。したがって例えば電話音声のようにマイクと口との距離が数センチの場合と比べて、信号と雑音の比（ＳＮＲ）は低くなる。このため、傍にいる他人の声や環境の雑音が妨害音となり、ロボットによる目的音声の認識が難しくなる。従って、ロボットへの応用として、音源定位や音源分離は重要である。 In voice communication between a person and a robot, the microphone attached to the robot is usually located at a distance (1 m or more). Therefore, for example, the signal-to-noise ratio (SNR) is lower than when the distance between the microphone and the mouth is several centimeters as in telephone speech. For this reason, the voices of others nearby and the noise of the environment become interference sounds, making it difficult for the robot to recognize the target speech. Therefore, sound source localization and sound source separation are important for robot applications.

音源定位に関しては過去にさまざまな研究がされている。しかし、その大半ではシミュレーション・データ又はラボ・データのみが使用され、ロボットが動作する実環境のデータを評価するものは少ない。３次元の音源定位を評価する研究も少ない。発話相手の位置を把握しながら話したり聞いたりすることも人間とロボットとの対話インタラクションを改善するための重要なビヘービアであり、そのためには移動する音源の定位も重要となる。 Various studies have been conducted on sound source localization in the past. However, most of them use only simulation data or lab data, and few evaluate real-world data in which the robot operates. There are few studies to evaluate 3D sound source localization. Talking and listening while grasping the position of the utterance partner is also an important behavior for improving the interaction between humans and robots, and for that purpose the localization of the moving sound source is also important.

実環境を想定した従来技術として特許文献１に記載のものがある。特許文献１に記載の技術は、分解能が高いＭＵＳＩＣ法と呼ばれる公知の音源定位の手法を用いている。 There exists a thing of patent document 1 as a prior art which assumed the real environment. The technique described in Patent Document 1 uses a known sound source localization technique called the MUSIC method with high resolution.

特許文献１に記載の発明では、マイクロホンアレイを用い、マイクロホンアレイからの信号をフーリエ変換して得られた受信信号ベクトルと、過去の相関行列とに基づいて現在の相関行列を計算する。このようにして求められた相関行列を固有値分解し、最大固有値と、最大固有値以外の固有値に対応する固有ベクトルである雑音空間とを求める。さらに、マイクロホンアレイのうち、１つのマイクロホンを基準として、各マイクの出力の位相差と、雑音空間と、最大固有値とに基づいて、ＭＵＳＩＣ法により音源の方向を推定する。 In the invention described in Patent Document 1, a microphone array is used, and a current correlation matrix is calculated based on a received signal vector obtained by Fourier transform of a signal from the microphone array and a past correlation matrix. The correlation matrix obtained in this way is subjected to eigenvalue decomposition to obtain a maximum eigenvalue and a noise space that is an eigenvector corresponding to an eigenvalue other than the maximum eigenvalue. Furthermore, the direction of the sound source is estimated by the MUSIC method based on the phase difference of the output of each microphone, the noise space, and the maximum eigenvalue with one microphone as a reference in the microphone array.

特開2008-175733号公報JP 2008-175733 A

しかし、特許文献１に記載された方法にはさらに改善の余地があると思われる。例えば、人間とそれ以外の雑音源とが混在している場合、人間の発生する音声と雑音とを精度高く分離する必要がある。そうした音源分離の精度が高くならなければ、例えば音声認識又は話者の同定などの処理の精度を高くすることもできない。特に、人間のように動く音源が存在する場合、又は音源定位をロボットなどのように移動可能なものに設ける場合などにこうしたことが問題となる。さらに、音声認識及び話者同定などに先立ち、音の種類が判定できれば、後続する処理の負担を軽減でき、さらに好ましい。 However, the method described in Patent Document 1 seems to have room for further improvement. For example, when humans and other noise sources are mixed, it is necessary to accurately separate speech and noise generated by humans. If the accuracy of such sound source separation does not increase, the accuracy of processing such as speech recognition or speaker identification cannot be increased. This is particularly a problem when there is a sound source that moves like a human being, or when a sound source localization is provided on a movable source such as a robot. Furthermore, if the type of sound can be determined prior to speech recognition and speaker identification, it is more preferable because the burden of subsequent processing can be reduced.

それゆえに本発明の目的は、音源定位とそれら音源の属性の判定とを行なうことができる音源定位装置を提供することである。 Therefore, an object of the present invention is to provide a sound source localization apparatus that can perform sound source localization and determination of attributes of those sound sources.

本発明の第１の局面にかかる音源定位装置は、レーザレンジファインダにより人の位置を検出する人位置検出手段と、マイクロホンアレイの出力から得られる複数チャンネルの音源信号の各々と、マイクロホンアレイに含まれる各マイクロホンの間の位置関係と、人位置検出手段の出力とに基づいて、マイクロホンアレイの位置に関連して定められる点を中心とする空間内で定義された複数の方向の各々について、所定時間ごとにＭＵＳＩＣパワーを算出し、当該ＭＵＳＩＣパワーのピークを音源位置として所定時間ごとに検出するための音源定位手段と、マイクロホンアレイの出力信号から、音源定位手段により検出された音源位置からの音声信号を分離する音源分離手段と、音源分離手段により分離された音声信号の属性を判定する音源属性判定手段とを含む。 A sound source localization apparatus according to a first aspect of the present invention includes a human position detection unit that detects a human position with a laser range finder, each of a plurality of sound source signals obtained from an output of a microphone array, and a microphone array. Predetermined directions for each of a plurality of directions defined in a space centered on a point determined in relation to the position of the microphone array based on the positional relationship between the microphones and the output of the human position detecting means. Sound source localization means for calculating MUSIC power every time and detecting the peak of the MUSIC power as a sound source position every predetermined time, and sound from the sound source position detected by the sound source localization means from the output signal of the microphone array Sound source separation means for separating signals, and sound source for determining attributes of the audio signal separated by the sound source separation means And a sex determination means.

レーザレンジファインダにより検出される人位置が、音源定位のための情報に用いられる。音声信号のみを用いる場合と比較して、音源定位精度を高くできる。音源定位の精度を高くすることができると、分離した音源からの属性を安定して精度高く推定できる。その結果、音源定位とそれら音源の属性の判定とを行なうことができる音源定位及び音属性推定装置を提供することができる。 The human position detected by the laser range finder is used as information for sound source localization. The sound source localization accuracy can be increased as compared with the case where only the audio signal is used. If the accuracy of sound source localization can be increased, the attributes from the separated sound sources can be estimated stably and with high accuracy. As a result, it is possible to provide a sound source localization and sound attribute estimation device that can perform sound source localization and determination of attributes of these sound sources.

好ましくは、音源属性判定手段は、複数の個人の音声の音響的特徴量の統計的モデルである複数の個人別音響モデルと、人間以外の音源であって、属性が既知の雑音源からの音響的特徴量の統計的モデルである複数の雑音音響モデルと、人位置検出手段の出力と、音源定位手段の出力とを受け、音源方向に人が存在するときには複数の個人別音響モデル及び複数の雑音音響モデルとを選択し、音源方向に人が存在しないときには複数の雑音音響モデルを選択する、音響モデル選択手段と、音響モデル選択手段により選択された音響モデルを用い、音源分離手段により分離された音声信号の属性を確率的手法により推定する統計的推定手段とを含む。 Preferably, the sound source attribute determination means includes a plurality of individual acoustic models that are statistical models of acoustic features of a plurality of individual sounds, and sound from a noise source that is a sound source other than a human and has a known attribute. Receiving a plurality of noise acoustic models, which are statistical models of the characteristic features, the output of the human position detection means, and the output of the sound source localization means, and when there is a person in the sound source direction, a plurality of individual acoustic models and a plurality of A noise acoustic model is selected, and when there is no person in the sound source direction, a plurality of noise acoustic models are selected, and the acoustic model selected by the acoustic model selection means is separated by the sound source separation means. And statistical estimation means for estimating the attributes of the voice signal by a probabilistic method.

音声信号の属性を推定するときに、レーザレンジファインダにより検出される人位置に応じ、その音声信号の音源が人である可能性があれば個人別音響モデルと雑音音響モデルとを用いる。レーザレンジファインダにより人位置が検出されない場合には、雑音音響モデルのみが用いられる。そのため、音源の属性を推定する際の計算量を削減し、処理速度を高めることができるとともに、属性判定の精度を高くすることができる。 When estimating the attribute of the audio signal, depending on the position of the person detected by the laser range finder, if there is a possibility that the sound source of the audio signal is a person, an individual acoustic model and a noise acoustic model are used. When the human position is not detected by the laser range finder, only the noise acoustic model is used. Therefore, it is possible to reduce the amount of calculation when estimating the attribute of the sound source, increase the processing speed, and increase the accuracy of attribute determination.

さらに好ましくは、音源定位手段は、マイクロホンアレイの出力から得られる複数チャンネルの音源信号の各々と、マイクロホンアレイに含まれる各マイクロホンの間の位置関係とに基づき、複数の方向の各々について、所定時間ごとにＭＵＳＩＣパワーを算出し、当該ＭＵＳＩＣパワーがしきい値を超えるピークが存在する位置及び方向を音源の概略の位置として推定する概略位置推定手段と、概略位置推定手段により推定された位置及び方向のうち、人位置検出手段により人が検出された位置及び方向を中心としてより詳細にＭＵＳＩＣパワーのピークを検出することにより、音源位置を検出するための詳細検出手段とを含む。 More preferably, the sound source localization means is configured to perform a predetermined time for each of the plurality of directions based on each of the sound source signals of a plurality of channels obtained from the output of the microphone array and the positional relationship between the microphones included in the microphone array. MUSIC power is calculated for each, and the approximate position estimation means for estimating the position and direction where the peak at which the MUSIC power exceeds the threshold exists as the approximate position of the sound source, and the position and direction estimated by the approximate position estimation means Among them, detailed detection means for detecting the sound source position by detecting the peak of the MUSIC power in more detail around the position and direction where the person is detected by the human position detection means.

音声信号から得られた音源の情報により、大まかな音源定位をした後に、人が検出された位置を中心により細かく音源定位を行なうことができる。人位置を中心とした音源定位の精度を高めることができ、そのための計算量の増加も抑えることができる。 After performing a rough sound source localization based on the sound source information obtained from the audio signal, the sound source localization can be performed more finely around the position where the person is detected. The accuracy of sound source localization centering on the human position can be increased, and an increase in the amount of calculation for this can be suppressed.

本発明の１実施の形態に係る音源分離及び音種類判定装置の処理の原理的構成を示す模式図である。It is a schematic diagram which shows the fundamental structure of the process of the sound source separation and sound kind determination apparatus which concerns on one embodiment of this invention. 図１に示す音源分離及び音種類判定装置の概略の機能的構成を示すブロック図である。FIG. 2 is a block diagram showing a schematic functional configuration of a sound source separation and sound type determination device shown in FIG. 1. 図２に示す音源定位処理部の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the sound source localization process part shown in FIG. 図３に示す音源定位部の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the sound source localization part shown in FIG. 図２に示す音源分離処理部の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the sound source separation process part shown in FIG. 図２に示す音源種類同定処理部の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the sound source kind identification process part shown in FIG. 図６に示す音源属性判定部の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the sound source attribute determination part shown in FIG. 図７に示す音源属性判定部の出力する属性候補リスト及び１世代前の属性候補リストの概略構成を示す模式図である。It is a schematic diagram which shows schematic structure of the attribute candidate list | wrist output from the sound source attribute determination part shown in FIG. 音源属性判定の処理を行なうコンピュータプログラムの制御構造の概略を示すフローチャートである。It is a flowchart which shows the outline of the control structure of the computer program which performs a process of sound source attribute determination. 図９に示すプログラムにおいて再帰的に呼出されるＩＤ交換チェックルーチンの制御構造を示すフローチャートである。10 is a flowchart showing a control structure of an ID exchange check routine that is recursively called in the program shown in FIG. 9. 本発明の実施の形態に係る音源分離及び音種類判定装置を実現するためのコンピュータシステムの外観を示す図である。It is a figure which shows the external appearance of the computer system for implement | achieving the sound source separation and sound kind determination apparatus which concerns on embodiment of this invention. 図１１に示すコンピュータシステムのハードウェア構成のブロック図である。It is a block diagram of the hardware constitutions of the computer system shown in FIG.

以下の説明及び図面では、同一の部品には同一の参照番号を付してある。したがって、それらについての詳細な説明は繰返さない。 In the following description and drawings, the same parts are denoted by the same reference numerals. Therefore, detailed description thereof will not be repeated.

以下の実施の形態では、レーザ・レンジ・ファインダ（ＬＲＦ）と呼ばれる、対象物との距離を測定し、人物が測定範囲内に存在するか否かを判定し、さらに測定された人物のトラッキングをする技術を用いる。そうした技術は、周囲の環境を測定しながら移動する必要のある移動ロボットの分野では広く普及している。また、ＬＲＦの出力を用い、検出された物体と、予め登録された物体とのマッチングをとることにより、物体の同定を行なう技術も開発されている。そのような技術については、例えば以下の参考文献１に記載されている。さらに、人の位置だけでなくその向きまで推定する技術も開発されている（参考文献２）。なお、人の位置を検出するための装置がＬＲＦに限定されるわけではない。カメラ等により撮影された画像に対し画像処理技術を用いても良い。 In the following embodiments, the distance to an object, called a laser range finder (LRF), is measured to determine whether or not a person is within the measurement range, and the measured person is tracked. Use technology. Such a technique is widely used in the field of mobile robots that need to move while measuring the surrounding environment. A technique for identifying an object by matching the detected object with a previously registered object using the output of the LRF has been developed. Such a technique is described in Reference Document 1 below, for example. Furthermore, the technique which estimates not only a person's position but the direction is also developed (reference document 2). Note that the device for detecting the position of a person is not limited to the LRF. You may use an image processing technique with respect to the image image | photographed with the camera etc.

［参考文献１］
坂場俊介、冨澤哲雄、大場光太郎、和田和義、「分散配置された物体形状の知識とＬＲＦを併用したパスプラニングに関する研究」（第８回計測自動制御学会システムインテグレーション部門講演会（Ｓ１２００７）資料、２００７年１２月７日、計測自動制御学会。 [Reference 1]
Shunsuke Sakaba, Tetsuo Serizawa, Kotaro Ohba, Kazuyoshi Wada, "Study on path planning using knowledge of distributed object shape and LRF" (8th Society of Instrument and Control Engineers System Integration Division Lecture Meeting (S12007), 2007 December 7, 2012, Society of Instrument and Control Engineers.

［参考文献２］
宮下敬宏、Glas Dylan、石黒浩、萩田紀博、「レーザ距離計による適応型人形状モデルを利用した人追跡手法」、
第２５回日本ロボット学会学術講演会、1I13、2007。 [Reference 2]
Norihiro Miyashita, Glas Dylan, Hiroshi Ishiguro, Norihiro Hamada, “Human Tracking Method Using Adaptive Human Shape Model with Laser Distance Meter”,
The 25th Annual Conference of the Robotics Society of Japan, 1I13, 2007.

このようにＬＲＦを用いた物体の位置、向き、及び既知の物体とのマッチングなどについては開発が進んでいる。しかし、ＬＲＦを音源定位と組合わせることについては従来は全く考慮されていなかった。本実施の形態では、ＬＲＦを用いた人間の追跡及び人間同定の技術を音源の追跡及び音源種類の同定に適用することにより、音源分離、音源追跡、及び音源種類の判定の精度を高める。 In this way, the development of the position and orientation of an object using LRF, matching with a known object, and the like are in progress. However, conventionally, no consideration has been given to combining LRF with sound source localization. In the present embodiment, the accuracy of sound source separation, sound source tracking, and sound source type determination is improved by applying human tracking and human identification technology using LRF to sound source tracking and sound source type identification.

［構成］
図１に、本発明の１実施の形態の構成の原理を概念的に示す。本発明に係る音源定位装置の１例である音源分離及び音種類判定装置は、図１には図示していないＬＲＦと、ＬＲＦの出力から周囲の人間の位置とその種類（人間の識別子）とを判定する人位置計測装置と、音源定位のための、特許文献１で開示されている技術とを組合わせることにより、音源種類の判定と音源定位とを行なう。なお、図１がフローチャート形式で示されていることからも分かるように、本実施の形態は、ＣＰＵ（中央演算処理装置）を含むコンピュータハードウェアと、コンピュータハードウェアにより実行されることにより、音源種類の判定と音源定位とを行なうコンピュータプログラムとにより実現される。もちろん、そのような組合せでの実現に本発明が限定されるわけではない。例えばプログラムと同様のアルゴリズムをハードウェアにより実現する装置、プログラムをハードワイア化した装置によっても同様の効果を得られることはいうまでもない。 [Constitution]
FIG. 1 conceptually shows the principle of the configuration of one embodiment of the present invention. A sound source separation and sound type determination device, which is an example of a sound source localization device according to the present invention, includes an LRF not shown in FIG. The combination of the human position measuring device for determining the sound source and the technique disclosed in Patent Document 1 for sound source localization performs sound source type determination and sound source localization. As can be seen from the fact that FIG. 1 is shown in the form of a flowchart, the present embodiment is implemented by computer hardware including a CPU (central processing unit) and computer hardware, thereby generating a sound source. This is realized by a computer program that performs type determination and sound source localization. Of course, the present invention is not limited to such a combination. For example, it goes without saying that the same effect can be obtained by a device that implements the same algorithm as that of the program by hardware, or a device that implements the program in hardware.

図１を参照して、この実施の形態に係る音源定位装置の動作を制御するプログラムは、環境内に人が存在するか否かを、ＬＲＦ、図示しない赤外線センサ、又は図示しない熱センサなどにより感知し、人が存在しないと判定されたときにはこの装置の動作を終了させるステップ３０と、ステップ３０の判定が肯定のとき（人がいると判定されたとき）、装置の動作を終了させるための処理（終了処理）がユーザにより行なわれたか否かを判定し、行なわれていれば装置の動作を停止させるステップ３２と、ステップ３２の判定が否定のときに、図示しない複数のマイクロホンアレイ、及び図示しない人位置判定装置の出力に基づいて、音源定位を行なうステップ３４と、ステップ３４での処理の結果を用い、各音源の音種類の同定を行ない、人の音声と人以外の音声とを分離するステップ３６と、ステップ３６で分離された人の音声をトラッキングし、適切なラベルを付して蓄積することにより対話音声データベースを順次構築するステップ３８とを含む。ステップ３８の処理が終了すると制御はステップ３２に戻り、以後ステップ３２〜ステップ３８の処理が、ユーザにより終了指示が行われるまで繰返し実行される。 Referring to FIG. 1, a program for controlling the operation of the sound source localization apparatus according to this embodiment uses an LRF, an infrared sensor (not shown), or a thermal sensor (not shown) to determine whether or not a person is present in the environment. Detecting and ending the operation of the device when it is determined that there is no person, and when the determination of step 30 is affirmative (when it is determined that there is a person) It is determined whether or not the process (end process) has been performed by the user, and if so, step 32 for stopping the operation of the apparatus, and when the determination of step 32 is negative, Based on the output of the human position determination device (not shown), sound source localization is performed, and the result of the process in step 34 is used to identify the sound type of each sound source. Separating voices of non-human and non-human voices, and step 38 of sequentially building a dialogue voice database by tracking the voices of the people separated in step 36 and storing them with appropriate labels. Including. When the process of step 38 is completed, the control returns to step 32, and thereafter the processes of steps 32 to 38 are repeatedly executed until an end instruction is given by the user.

図２を参照して、この装置を含む音源定位システムは、複数のマイクロホンアレイを含むマイクロホンアレイ群５２と、複数のＬＲＦを含むＬＲＦ群５６と、予め周囲にいる可能性のある人間に関する特徴とその識別子とを記憶し、ＬＲＦ群５６の出力を用いて、どの位置にどの人間が存在するかを示す情報（位置情報及び人間の識別子。以後これらをまとめて人位置情報と呼ぶ。）を出力する人位置計測装置５８と、システムを構成する各部の同期を制御するための同期用タイムサーバ５４と、マイクロホンアレイ群５２の出力、同期用タイムサーバ５４から出力される同期用制御信号、及び人位置計測装置５８から出力される人位置情報を受けるように接続され、マイクロホンアレイ群５２から出力される音声信号に基づいて音源定位を行なって音源を分離し、さらに各音源についてその種類を同定して出力する音源定位装置５０とを含む。 Referring to FIG. 2, a sound source localization system including this device includes a microphone array group 52 including a plurality of microphone arrays, an LRF group 56 including a plurality of LRFs, and features related to humans who may be in the surroundings in advance. The identifier is stored, and information indicating which person is present at which position (position information and human identifier. These are collectively referred to as human position information hereinafter) is output using the output of the LRF group 56. The person position measuring device 58, the synchronization time server 54 for controlling the synchronization of each part of the system, the output of the microphone array group 52, the synchronization control signal output from the synchronization time server 54, and the person It is connected so as to receive the human position information output from the position measuring device 58, and the sound source localization is performed based on the audio signal output from the microphone array group 52. It includes a sound source localization apparatus 50 for separating the sound source, and outputs the identified that type more for each sound source.

音源定位装置５０は、マイクロホンアレイ群５２から各マイクロホンアレイの出力する音声信号を受け、人位置計測装置５８から人位置情報を受け取り、音源定位処理を行なって、音源から得られたと考えられる音の方向（多くの場合、これは音源の数に対応する複数である。）を示す情報を出力する音源定位処理部６０と、音源定位処理部６０から得られる複数音の方向を示す情報と、マイクロホンアレイ群５２から得られる音声信号とを受け、音源定位処理部６０から得られた方向の音源からの音声信号７４を他の音声信号から分離して出力する音源分離処理部７０と、音源分離処理部７０の出力する音声信号７４、人位置計測装置５８の出力する人位置情報、及び音源定位処理部６０の出力する複数の音信号の方向及び位置に関する情報を用い、各音源からの音種類を同定し、出力する音源種類同定処理部７２とを含む。 The sound source localization device 50 receives audio signals output from each microphone array from the microphone array group 52, receives human position information from the human position measurement device 58, performs sound source localization processing, and generates sound that is considered to have been obtained from the sound source. A sound source localization processing unit 60 that outputs information indicating a direction (in many cases, this is a plurality corresponding to the number of sound sources), information indicating the directions of a plurality of sounds obtained from the sound source localization processing unit 60, and a microphone A sound source separation processing unit 70 that receives the sound signal obtained from the array group 52 and separates and outputs the sound signal 74 from the sound source in the direction obtained from the sound source localization processing unit 60 from other sound signals; Information about the direction and position of a plurality of sound signals output from the sound signal localization processing unit 60, and the audio signal 74 output from the unit 70, the human position information output from the human position measuring device 58 Used to identify the sound types from the sound sources, and a sound source localization processing unit 72 for outputting.

図３を参照して、図２に示す音源定位処理部６０は、音源定位を行なうために探索すべき複数の方向を特定する位置ベクトルを記憶する位置ベクトルＤＢ（データベース）８０と、マイクロホンアレイ群５２に含まれる各マイクロホンアレイの位置を示す位置ベクトルを記憶するアレイ位置ＤＢ８２と、マイクロホンアレイ群５２内の各マイクロホンアレイに対応して設けられ、それぞれ位置ベクトルＤＢ８０から探索方向の位置ベクトルを、アレイ位置ＤＢ８２から対応のマイクロホンアレイの位置ベクトルを、図１に示す人位置計測装置５８から人位置情報を、それぞれ受け取り、公知のＭＵＳＩＣ法による音源定位の方式に加え、さらに人位置情報を用いた音源定位により音源方向を高精度に決定し、それぞれ出力するための、複数の音源定位部８４，…，８６，…，８８と、アレイ位置ＤＢ８２から得られる、各マイクロホンアレイの位置情報と、位置ベクトルＤＢ８０に記憶された、各方位の位置ベクトルとを用い、各マイクロホンアレイに対する各方位の相対位置ベクトルを生成し出力する相対位置ベクトル生成部１０８と、複数の音源定位部８４，…，８６，…，８８がそれぞれ出力する音源方位情報と、相対位置ベクトル生成部１０８から出力される相対位置ベクトルと、人位置計測装置５８から与えられる人位置情報とを用い、ＭＵＳＩＣスペクトルのピークが存在する可能性の高い位置を詳細に探索し、ピーク位置を示す信号を出力する詳細探索部１１０とを含む。 Referring to FIG. 3, a sound source localization processing unit 60 shown in FIG. 2 includes a position vector DB (database) 80 that stores position vectors for specifying a plurality of directions to be searched for sound source localization, and a microphone array group. 52 is provided corresponding to each microphone array in the microphone array group 52 and stores a position vector in the search direction from the position vector DB 80. The position vector of the corresponding microphone array is received from the position DB 82, and the person position information is received from the person position measuring device 58 shown in FIG. 1, and the sound source using the person position information is added to the sound source localization method by the known MUSIC method. To determine the sound source direction with high accuracy by localization and output each .., 88, and the position information of each microphone array obtained from the array position DB 82 and the position vector of each direction stored in the position vector DB 80. A relative position vector generation unit 108 that generates and outputs a relative position vector of each direction, sound source direction information output by each of the plurality of sound source localization units 84,..., 86, and 88, and output from the relative position vector generation unit 108 Detailed search that uses the relative position vector and the human position information given from the human position measuring device 58 to search in detail the position where the peak of the MUSIC spectrum is likely to exist and to output a signal indicating the peak position Part 110.

複数の音源定位部８４，…，８６，…，８８はいずれも同じ構造を持つ。したがって、以下では代表として音源定位部８４についてその構造を説明する。
図４を参照して、音源定位部８４は、対応するマイクロホンアレイからアレイに含まれるマイクロホンの数（例えば１４個）のアナログ音源信号を受け、アナログ／デジタル（Ａ／Ｄ）変換を行なって同数のデジタル音源信号を出力するＡ／Ｄ変換器１００と、Ａ／Ｄ変換器１００から出力される複数のデジタル音声信号を受け、音声信号を所定時間毎にフレーム化し、各フレームについてＭＵＳＩＣ応答の算出のために必要なマイクロホン出力に関する相関行列と、その最大固有値と、最大固有値以外の固有値に対応する固有ベクトルである雑音空間とを算出し出力する固有ベクトル算出部１０２と、固有ベクトル算出部１０２から所定時間ごとに出力される情報を使用し、位置ベクトルＤＢ８０から得られる位置ベクトルにより定まる各方向についてＭＵＳＩＣ法により算出されるＭＵＳＩＣ応答を出力するＭＵＳＩＣ処理部１０４と、ＭＵＳＩＣ処理部１０４の出力するＭＵＳＩＣ応答をしきい値と比較することにより、ＭＵＳＩＣ音源が存在する可能性の高い方位を、すなわちピークの方位を推定し音源の方向を示す情報を出力するピーク検出部１０６とを含む。 The plurality of sound source localization parts 84, ..., 86, ..., 88 all have the same structure. Therefore, the structure of the sound source localization unit 84 will be described below as a representative.
Referring to FIG. 4, sound source localization unit 84 receives analog sound source signals corresponding to the number of microphones (for example, 14) included in the array from the corresponding microphone array, performs analog / digital (A / D) conversion, and the same number. An A / D converter 100 that outputs a digital sound source signal and a plurality of digital audio signals output from the A / D converter 100, frames the audio signal at predetermined time intervals, and calculates a MUSIC response for each frame The eigenvector calculation unit 102 that calculates and outputs a correlation matrix related to the microphone output necessary for the output, its maximum eigenvalue, and a noise space that is an eigenvector corresponding to an eigenvalue other than the maximum eigenvalue, and from the eigenvector calculation unit 102 every predetermined time Each of which is determined by the position vector obtained from the position vector DB 80 By comparing the MUSIC processing unit 104 that outputs the MUSIC response calculated by the MUSIC method with respect to the direction and the MUSIC response output by the MUSIC processing unit 104 with a threshold value, it is possible to determine the direction in which the MUSIC sound source is likely to exist. That is, it includes a peak detection unit 106 that estimates the peak direction and outputs information indicating the direction of the sound source.

本実施の形態では、Ａ／Ｄ変換器１００は、一般的な１６ｋＨｚ／１６ビットで各マイクロホンの出力であるアナログ信号をデジタル信号に変換する。 In the present embodiment, A / D converter 100 converts an analog signal, which is an output of each microphone, into a digital signal at a general 16 kHz / 16 bit.

固有ベクトル算出部１０２は、Ａ／Ｄ変換器１００の出力する複数個のデジタル音源信号を所定のフレーム長及び所定のシフト長でフレーム化するためのフレーム化処理部１２０と、フレーム化処理部１２０の出力する複数チャンネルのフレーム化された音源信号に対してそれぞれＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍａｔｉｏｎ）を施し、所定個数の周波数領域（以下、各周波数領域を「ビン」と呼び、周波数領域の数を「ビン数」と呼ぶ。）に変換して出力するＦＦＴ処理部１２２と、ＦＦＴ処理部１２２からフレーム化処理部１２０におけるシフト長に応じた時間間隔で出力される各チャネルの各ビンの値の間の相関を要素とする相関行列を所定時間ごとに算出し出力する相関行列算出部１２４と、相関行列算出部１２４から出力される相関行列を固有値分解し、最大固有地及び雑音空間からなる出力１１２をＭＵＳＩＣ処理部１０４に出力する固有値分解部１２６とを含む。なお本実施の形態では、音源信号の周波数成分のうち、空間的分解能が低い１ｋＨｚ以下の帯域と、空間的エイリアシングが起こり得る６ｋＨｚ以上の帯域を除外する。 The eigenvector calculation unit 102 includes a framing processing unit 120 for framing a plurality of digital sound source signals output from the A / D converter 100 with a predetermined frame length and a predetermined shift length, and the framing processing unit 120 FFT (Fast Fourier Transform) is applied to each framed sound source signal to be output, and a predetermined number of frequency regions (hereinafter, each frequency region is referred to as a “bin”, and the number of frequency regions is defined as “the number of bins”. The correlation between the FFT processing unit 122 that converts and outputs the bin value of each channel that is output from the FFT processing unit 122 at a time interval according to the shift length in the framing processing unit 120. A correlation matrix calculation unit 124 that calculates and outputs a correlation matrix having elements as elements at predetermined time intervals, and a correlation matrix calculation unit 1 The eigenvalue decomposition unit 126 outputs eigenvalue decomposition of the correlation matrix output from 24 and outputs an output 112 composed of the maximum eigenlocation and noise space to the MUSIC processing unit 104. In the present embodiment, the frequency component of the sound source signal excludes a band of 1 kHz or less with a low spatial resolution and a band of 6 kHz or more where spatial aliasing may occur.

ＭＵＳＩＣ処理部１０４は、対応するマイクロホンアレイに含まれる各マイクロホンの位置を表す位置ベクトルを位置ベクトルＤＢ８０から受け、固有値分解部１２６から出力される固有ベクトル及び雑音空間を用い、音源数が固定されているものとしてＭＵＳＩＣ法によりＭＵＳＩＣ空間スペクトルを算出し出力するＭＵＳＩＣ空間スペクトル算出部１４０と、ＭＵＳＩＣ空間スペクトル算出部１４０により算出されたＭＵＳＩＣ空間スペクトルに基づいて、ＭＵＳＩＣ法にしたがいＭＵＳＩＣ応答と呼ばれる値を位置ベクトルに応じた各方位について算出し出力するためのＭＵＳＩＣ応答算出部１４２とを含む。 The MUSIC processing unit 104 receives a position vector representing the position of each microphone included in the corresponding microphone array from the position vector DB 80, and uses the eigenvector and noise space output from the eigenvalue decomposition unit 126 to fix the number of sound sources. A MUSIC spatial spectrum calculation unit 140 that calculates and outputs a MUSIC spatial spectrum by the MUSIC method, and a value called a MUSIC response according to the MUSIC method based on the MUSIC spatial spectrum calculated by the MUSIC spatial spectrum calculation unit 140 is a position vector. And a MUSIC response calculation unit 142 for calculating and outputting each azimuth corresponding to the.

ここでいう「方位」とは、音源位置を探索するために３次元空間に定義されたメッシュの各枠のことをいう。このメッシュは、以下の実施の形態では、仰角５度の範囲で空間を輪状に区切り、仰角の大きさにより異なる数の探索点を設けた。ここでいう「探索点」とは、上記したメッシュの中央の点のことをいう。 The “azimuth” here refers to each frame of the mesh defined in the three-dimensional space in order to search for the sound source position. In this embodiment, the mesh is divided into a ring shape in a range of an elevation angle of 5 degrees, and different numbers of search points are provided depending on the size of the elevation angle. The “search point” here refers to the center point of the mesh described above.

探索点の数は、仰角が０度の輪においては隣接する探索点への方向角が５度となるように選ばれている。探索点の数は仰角が０度の輪で最大であり、仰角が大きくなるにつれて少なくなる。この際、一つの輪内の探索点の間の距離（角度と考えてもよい。）は互いに等しく、その距離（角度）は仰角が０度の輪における隣接する探索点同士の距離（角度）とできるだけ近くなるように選ばれている。 The number of search points is selected so that the direction angle to an adjacent search point is 5 degrees in a ring with an elevation angle of 0 degrees. The number of search points is the maximum for a wheel with an elevation angle of 0 degrees, and decreases as the elevation angle increases. At this time, the distances (which may be considered as angles) between search points in one ring are equal to each other, and the distances (angles) are the distances (angles) between adjacent search points in a ring having an elevation angle of 0 degrees. And is chosen to be as close as possible.

図５を参照して、音源分離処理部７０は、マイクロホンアレイに対応して設けられ、音源定位処理部６０から出力される１つの音源の方向及び位置を示す情報に基づき、その音源方向に近いマイクロホンアレイからの出力に対し、目的方向からの信号を強調し、他の方向からの妨害音を抑圧することにより、音源の音声信号を分離し出力するための、複数の適応ビームフォーマ１６０，…，１６２，…，１６４を含む。複数の適応ビームフォーマ１６０，…，１６２，…，１６４の出力は、分離された音源からの音声信号７４として音源種類同定処理部７２に与えられる。 Referring to FIG. 5, sound source separation processing unit 70 is provided corresponding to the microphone array, and is close to the sound source direction based on information indicating the direction and position of one sound source output from sound source localization processing unit 60. A plurality of adaptive beamformers 160 for separating and outputting the sound signal of the sound source by enhancing the signal from the target direction and suppressing the interference sound from the other direction with respect to the output from the microphone array. , 162,..., 164. The outputs of the plurality of adaptive beamformers 160, ..., 162, ..., 164 are provided to the sound source type identification processing unit 72 as audio signals 74 from the separated sound sources.

図６を参照して、音源種類同定処理部７２は、各々が個人別の音響モデルである、複数の個人別ＧＭＭ１８０と、各々が特定の種類の雑音に対応する音響モデルである、複数の雑音ＧＭＭ１８２と、各々が、音源定位処理部６０から受ける音源の方向及び位置に基づいて、その音源が人間か否かを判定し、判定結果に応じて複数の個人別ＧＭＭ１８０又は複数の雑音ＧＭＭ１８２のいずれかを選択して音源の属性を判定する、複数の音源属性判定部１９０，…，１９２，…，１９４とを含む。音源属性判定部１９０，…，１９２，…，１９４はいずれも同様の構成を有する。したがって、以下ではこれらを代表して音源属性判定部１９０の構成について説明する。 Referring to FIG. 6, the sound source type identification processing unit 72 includes a plurality of individual GMMs 180 each of which is an individual acoustic model, and a plurality of noises each of which is an acoustic model corresponding to a specific type of noise. Based on the GMM 182 and the direction and position of the sound source received from the sound source localization processing unit 60, it is determined whether or not the sound source is a human. Depending on the determination result, either the plurality of individual GMMs 180 or the plurality of noise GMMs 182 A plurality of sound source attribute determining units 190,..., 192,. The sound source attribute determination units 190, ..., 192, ..., 194 all have the same configuration. Therefore, the configuration of the sound source attribute determination unit 190 will be described below as a representative of these.

図７を参照して、音源属性判定部１９０は、人位置計測装置５８から与えられる人の位置に関する情報と、音源定位処理部６０から与えられる音源の方向及び位置に関する情報とを比較し、両者が一致するか否かに基づいて音源が人によるものか否かを示す信号を出力する比較部２１０と、複数の個人別ＧＭＭ１８０に接続された入力と、複数の雑音ＧＭＭ１８２に接続された入力とを持ち、比較部２１０の出力に基づき、音源の方向に人がいるときには両者を選択し、人がいないときには雑音ＧＭＭ１８２のみを選択して出力する選択部２１２とを含む。 Referring to FIG. 7, sound source attribute determination unit 190 compares information on the position of the person given from person position measurement device 58 with information on the direction and position of the sound source given from sound source localization processing unit 60, A comparison unit 210 that outputs a signal indicating whether the sound source is from a person based on whether the sound sources match, an input connected to a plurality of individual GMMs 180, and an input connected to a plurality of noise GMMs 182 And a selection unit 212 that selects both when there is a person in the direction of the sound source based on the output of the comparison unit 210, and selects and outputs only the noise GMM 182 when there is no person.

音源属性判定部１９０はさらに、分離された音源であって比較部２１０への入力に対応する音源の音声信号から、フレームごとにＭＦＣＣなどの音響特徴量を抽出し、特徴ベクトルの系列として出力する特徴抽出部２１４と、特徴抽出部２１４により抽出された特徴ベクトルの系列に対し、選択部２１２により選択されたＧＭＭ群（複数の雑音ＧＭＭ１８２、又は、複数の個人別ＧＭＭ１８０及び複数の雑音ＧＭＭ１８２）を用い、音声の属性を推定し、推定結果を出力する音源属性推定部２１６とを含む。音源属性推定部２１６の出力は、音源が人間であれば、候補の人間の識別子とその尤度とからなる候補リストであり、音源が人間以外であれば候補の雑音の特定情報とその尤度とからなる候補リストである。 Source attribute determination unit 190 further from the sound source of the audio signal that corresponds to the input to the comparison unit 210 a separate sound sources, and extracted acoustic features, such as MFCC for each frame, the output as a sequence of feature vectors And a GMM group selected by the selection unit 212 (a plurality of noise GMMs 182 or a plurality of individual GMMs 180 and a plurality of noise GMMs 182) for the feature vector sequence extracted by the feature extraction unit 214 and the feature extraction unit 214 And a sound source attribute estimation unit 216 that estimates audio attributes and outputs estimation results. The output of the sound source attribute estimation unit 216 is a candidate list including a candidate human identifier and its likelihood if the sound source is a human, and if the sound source is other than a human, the candidate noise specifying information and its likelihood. A candidate list consisting of

このように、音源方向に人がいる場合には複数の個人別ＧＭＭ１８０と雑音ＧＭＭ１８２とを用いて音源の属性を推定し、人がいないと考えられる場合には雑音ＧＭＭ１８２のみを属性推定に用いる。人がいない場合には雑音ＧＭＭ１８２のみにモデルが絞られるため、処理量が削減され、処理時間が短縮化される上、推定の精度も高くなる。 Thus, when there is a person in the direction of the sound source, the attribute of the sound source is estimated using a plurality of individual GMM 180 and noise GMM 182, and when it is considered that there is no person, only the noise GMM 182 is used for attribute estimation. When there is no person, the model is narrowed down to the noise GMM 182 only, so that the processing amount is reduced, the processing time is shortened, and the estimation accuracy is also increased.

音源属性推定部２１６から出力される検出ＩＤ・尤度リスト２３０は尤度を伴う。したがって途中で音源の属性が入れ替わる場合もあり得る。そのため、図１に示すステップ３８により行なわれるトラッキングでは、検出ＩＤ・尤度リスト２３０上で候補の順序に変化が生じたか否かを常に監視する必要がある。 The detection ID / likelihood list 230 output from the sound source attribute estimation unit 216 includes likelihood. Therefore, the attribute of the sound source may be changed during the process. Therefore, in the tracking performed in step 38 shown in FIG. 1, it is necessary to always monitor whether or not the order of candidates has changed on the detection ID / likelihood list 230.

図８（Ａ）を参照して、図７に示す検出ＩＤ・尤度リスト２３０は、複数の属性の推定結果の候補であって、それぞれの尤度にしたがって配列された複数の候補のエントリを含む。検出ＩＤ・尤度リスト２３０の各候補のエントリは、候補の識別子ＣＩＤｎ（ｎは検出ＩＤ・尤度リスト２３０上における順番を示す。）と、音声がその候補により発生されたものである確率を示す尤度ＣＰｒｏｂ_ｎ（ｎはＣＩＤｎのｎと同様。）とを含む。検出ＩＤ・尤度リスト２３０には、こうした候補が複数個配列されている。 Referring to FIG. 8A, detection ID / likelihood list 230 shown in FIG. 7 is a candidate of estimation results of a plurality of attributes, and a plurality of candidate entries arranged according to the respective likelihoods. Including. Each candidate entry in the detection ID / likelihood list 230 includes a candidate identifier CIDn (n indicates an order on the detection ID / likelihood list 230) and a probability that a voice is generated by the candidate. Likelihood CPProb _n (where n is the same as n of CIDn). A plurality of such candidates are arranged in the detection ID / likelihood list 230.

なお、図１に示すステップ３８の処理のため、この装置は、検出ＩＤ・尤度リスト２３０の時系列を音源属性のトラッキングの履歴として保存する。フレームの各音声とこれら履歴とを互いに関連付けてあるため、結果として各発話に対し、その発話者のラベル及び発話位置を付した対話データベースが維持できる。 Note that, for the processing of step 38 shown in FIG. 1, this apparatus stores the time series of the detection ID / likelihood list 230 as a tracking history of the sound source attribute. Since each voice of the frame and these histories are associated with each other, as a result, it is possible to maintain a dialogue database with each speaker's label and speaker position.

図８（Ｂ）を参照して、図に示すステップ３８の処理のため、この装置は、上記した検出ＩＤ・尤度リスト２３０をコピーした作業用の候補リスト２４０と、上記した履歴の先頭の候補リストをコピーした、作業用の履歴リスト２４２とを用いる。履歴リスト２４２の各候補は候補リスト２４０内の各候補と同じ構成を持っている。ここでは、直前候補は識別子ＨＩＤｎ（ｎは履歴リスト２４２における順位）により表し、その尤度をＨＰｒｏｂ_ｎ（ｎはＨＩＤｎのｎと同様）により表す。候補リスト２４０及び履歴リスト２４２を使用して、図１のステップ３８で属性交換のチェックが行なわれる。その方法について図９及び図１０を参照して説明する。 With reference to FIG. 8B, for the processing of step 38 shown in the figure, this apparatus includes a candidate list 240 for work obtained by copying the above-described detection ID / likelihood list 230 and the top of the above history. A working history list 242 that is a copy of the candidate list is used. Each candidate in the history list 242 has the same configuration as each candidate in the candidate list 240. Here, the immediately preceding candidate is represented by the identifier HIDn (n is the rank in the history list 242), and the likelihood is represented by HProbe _n (n is the same as n of HIDn). Using the candidate list 240 and the history list 242, the attribute exchange is checked in step 38 of FIG. The method will be described with reference to FIG. 9 and FIG.

図９に、図１のステップ３８を実現するためのプログラムの制御構造の概略をフローチャート形式で示す。なお図９では、図を分かりやすくするために各ステップを単一の音源に対して行なった場合を示してあるが、実際にはこれら処理は音源の全てに対して行なわれる。 FIG. 9 shows an outline of a control structure of a program for realizing step 38 in FIG. 1 in a flowchart format. FIG. 9 shows a case where each step is performed on a single sound source for the sake of clarity, but in actuality, these processes are performed on all of the sound sources.

図９を参照して、このプログラムは、履歴の末尾の候補リスト（直前サイクルで図７の音源属性推定部２１６から出力された検出ＩＤ・尤度リスト２３０と同じ）を履歴リスト２４２にコピーするステップ２５０と、現在のサイクルで音源属性推定部２１６から出力された検出ＩＤ・尤度リスト２３０を候補リスト２４０にコピーするステップ２５２と、候補リスト２４０と履歴リスト２４２とを引数にして図１０に制御構造を示すＩＤ交換チェックルーチンを呼出すステップ２５４とを含む。後述するように、ＩＤ交換チェックルーチンは再帰的に自己を呼出すプログラムであり、最終的にステップ２５４に制御が戻ってきた段階ではＩＤの交換がもしあれば交換がされた後のリストが履歴リスト２４２に、もしなければ引数で渡した履歴リスト２４２がそのまま、戻り値として返される。このルーチンの内容については図１０を参照して後述する。 Referring to FIG. 9, this program copies the candidate list at the end of the history (same as detection ID / likelihood list 230 output from sound source attribute estimation unit 216 in FIG. 7 in the previous cycle) to history list 242. The step 250, the step 252 for copying the detection ID / likelihood list 230 output from the sound source attribute estimation unit 216 in the current cycle to the candidate list 240, the candidate list 240 and the history list 242 as arguments are shown in FIG. And a step 254 of calling an ID exchange check routine indicating the control structure. As will be described later, the ID exchange check routine is a program that recursively calls itself, and when the control finally returns to step 254, the list after the exchange if there is an ID exchange is a history list. If not, the history list 242 passed as an argument is returned as it is as a return value. The contents of this routine will be described later with reference to FIG.

このプログラムはさらに、ステップ２５４においてＩＤ交換チェックルーチンからの戻り値である履歴リスト２４２にもし空白部があれば、検出ＩＤ・尤度リスト２３０の、対応する要素（候補の識別子及び尤度）をコピーするステップ２５６と、こうして最終的に得られた履歴リスト２４２を履歴の末尾に追加するステップ２５８と、対話データベースに、各音声データをその発話者のＩＤ及び位置情報とともに追加し、処理を終了するステップ２６０とを含む。 If there is a blank part in the history list 242 which is a return value from the ID exchange check routine in step 254, this program further displays the corresponding element (candidate identifier and likelihood) of the detected ID / likelihood list 230. Step 256 for copying, step 258 for adding the history list 242 thus finally obtained to the end of the history, and adding each voice data to the dialogue database together with the ID and position information of the speaker, and the process is terminated. Step 260.

図１０を参照して、図９のステップ２５４で呼出されるＩＤ交換チェックルーチンは、以下の制御構造を持つ。すなわち、このプログラムは、候補リスト２４０の要素数が１か否かを判定し、判定結果により制御の流れを分岐させるステップ２８０と、ステップ２８０の判定が否定のときに、さらに候補リスト２４０の先頭の候補の識別子ＣＩＤ１と、履歴リスト２４２の先頭の候補の識別子ＨＩＤ１とが一致するか否かを判定し、判定結果により制御の流れを分岐させるステップ２８２と、ステップ２８０又はステップ２８２の判定が肯定のときに実行され、候補リスト２４０を履歴リスト２４２にコピーしてこのルーチンの実行を終了して呼出元ルーチンに制御を戻すステップ３０６とを含む。なお、このルーチンが呼出元ルーチンに制御を戻すときには、戻り値として履歴リスト２４２が戻されるものとする。 Referring to FIG. 10, the ID exchange check routine called in step 254 of FIG. 9 has the following control structure. That is, this program determines whether or not the number of elements in the candidate list 240 is 1, and when the determination result is step 280 for branching the control flow, and when the determination in step 280 is negative, the program further starts the candidate list 240. It is determined whether or not the candidate identifier CID1 and the identifier HID1 of the first candidate in the history list 242 match, and the determination in step 282 and step 280 or step 282 is affirmed according to the determination result. And a step 306 which copies the candidate list 240 to the history list 242 and terminates execution of this routine and returns control to the calling routine. When this routine returns control to the caller routine, the history list 242 is returned as a return value.

このプログラムはさらに、ステップ２８２の判定が否定のときに、候補リスト２４０の１番目及び２番目の候補の識別子ＣＩＤ１及びＣＩＤ２の尤度ＣＰｒｏｂ１及びＣＰｒｏｂ２に基づき以下の式によりそれぞれ新たな尤度ＮＰｒｏｂ１及びＮｐｒｏｂ２を再計算するステップ２８４を含む。 The program further provides new likelihoods NProb1 and NProb1 according to the following equations based on the likelihoods CProb1 and CProbe2 of the identifiers CID1 and CID2 of the first and second candidates in the candidate list 240 when the determination in step 282 is negative: Step 284 is included to recalculate Nprob2.

ただしｗは０＜ｗ＜１を満たす任意の値である。

However, w is an arbitrary value satisfying 0 <w <1.

このプログラムはさらに、ステップ２８４に続き、新たに計算された尤度ＮＰｒｏｂ１が尤度ＮＰｒｏｂ２より大きいか否かを判定し、判定結果に従って制御の流れを分岐させるステップ２８６と、ステップ２８６の判定が肯定のときに実行され、候補リスト２４０の１番目の候補の識別子ＣＩＤ１を履歴リスト２４２の１番目の候補の識別子ＨＩＤ１に代入し、候補リスト２４０の１番目の候補について再計算された尤度ＮＰｒｏｂ１を履歴リスト２４２の１番目の候補の尤度ＨＰｒｏｂ１に代入するステップ２８８と、ステップ２８８に続き、候補リスト２４０の２番目の候補の識別子ＣＩＤ２を履歴リスト２４２の２番目の候補の識別子ＨＩＤ２に代入し、候補リスト２４０の２番目の候補について再計算された尤度ＮＰｒｏｂ２を履歴リスト２４２の２番目の候補の尤度ＨＰｒｏｂ２に代入して呼出元ルーチンに制御を戻すステップ２９０とを含む。 In step 284, the program further determines whether the newly calculated likelihood NProb1 is greater than the likelihood NProb2 and branches the control flow according to the determination result, and the determination in step 286 is positive. Is executed, and the identifier CID1 of the first candidate in the candidate list 240 is substituted for the identifier HID1 of the first candidate in the history list 242 and the likelihood NProb1 recalculated for the first candidate in the candidate list 240 is Subsequent to step 288 for substituting the likelihood HProbe1 for the first candidate in the history list 242 and step 288, the identifier CID2 for the second candidate in the candidate list 240 is substituted for the identifier HID2 for the second candidate in the history list 242. The likelihood NProb2 recalculated for the second candidate in the candidate list 240 is used. Substituted in the second candidate likelihood HProb2 list 242 and a step 290 which returns control to the calling routine.

このプログラムはさらに、ステップ２８６の判定が否定のときに、新たに計算された尤度ＮＰｒｏｂ２が尤度ＨＰｒｏｂ２より大きいか否かを判定し、判定が否定のときには制御の流れをステップ２８８に分岐させるステップ２９２と、ステップ２９２の判定が肯定のときに実行され、候補リスト２４０の２番目の候補の識別子ＣＩＤ２を履歴リスト２４２の１番目の候補の識別子ＨＩＤ１に代入し、候補リスト２４０の２番目の候補について再計算された尤度ＮＰｒｏｂ２を履歴リスト２４２の１番目の候補の尤度ＨＰｒｏｂ１に代入するステップ２９４と、ステップ２９４に続き、候補リスト２４０の１番目の候補の識別子ＣＩＤ１を履歴リスト２４２の２番目の候補の識別子ＨＩＤ２に代入し、候補リスト２４０の１番目の候補について再計算された尤度ＮＰｒｏｂ１を履歴リスト２４２の２番目の候補の尤度ＨＰｒｏｂ２に代入するステップ２９６とを含む。 The program further determines whether or not the newly calculated likelihood NProb2 is greater than the likelihood HProbe2 when the determination at step 286 is negative, and branches the control flow to step 288 when the determination is negative. This is executed when the determinations in step 292 and step 292 are affirmative, and the identifier CID2 of the second candidate in the candidate list 240 is substituted for the identifier HID1 of the first candidate in the history list 242, and the second candidate in the candidate list 240 Subsequent to step 294, substituting the likelihood NProbe2 recalculated for the candidate into the likelihood HProbe1 of the first candidate in the history list 242, the identifier CID1 of the first candidate in the candidate list 240 is assigned to the history list 242. Substituting the identifier HID2 of the second candidate for the first candidate in the candidate list 240 And a step 296 that assigns a likelihood NProb1 recalculated to the second candidate likelihood HProb2 history list 242.

このプログラムはさらに、ステップ２９６に続き、履歴リスト２４２のうち、先頭の候補ＨＩＤ１のエントリを除いたリストを新たな引数（履歴リスト２４２）として自分自身を再帰的に呼出すステップ２９８と、ステップ２９８の処理による戻り値の履歴リスト２４２の先頭に、先頭の候補ＨＩＤ１のエントリを追加して、履歴リスト２４２を戻り値として呼出元ルーチンに制御を戻すステップ３００とを含む。 In step 296, the program further recursively calls itself as a new argument (history list 242) from the history list 242 excluding the entry of the first candidate HID1. And a step 300 of adding an entry of the first candidate HID1 to the head of the return value history list 242 by processing and returning the control to the calling source routine using the history list 242 as a return value.

［動作］
上に説明した音源分離及び音種類判定装置は以下のように動作する。この動作に先立ち、図２に示す人位置計測装置５８には、測定対象となる人物をＬＲＦ群５６の出力に基づいて識別するために必要な情報と、各人物の識別子とが記憶されているものとする。また図３に示す位置ベクトルＤＢ８０には音源分離及び音種類判定装置がＭＵＳＩＣ応答を算出するための空間グリッドの各点（方位）を特定する位置ベクトルが予め記憶されている。アレイ位置ＤＢ８２には、マイクロホンアレイ群５２を構成する各マイクロホンアレイの位置が記憶される。複数の個人別ＧＭＭ１８０としては、測定対象となる人物についてそれぞれ予め作成された音響モデルが準備される。雑音ＧＭＭ１８２としては、予め収集された、属性が予め分かっている雑音に関する音響モデルが準備される。 [Operation]
The sound source separation and sound type determination apparatus described above operates as follows. Prior to this operation, the person position measuring device 58 shown in FIG. 2 stores information necessary for identifying the person to be measured based on the output of the LRF group 56 and the identifier of each person. Shall. Further, the position vector DB 80 shown in FIG. 3 stores in advance position vectors that specify each point (orientation) of the spatial grid for the sound source separation and sound type determination device to calculate the MUSIC response. The array position DB 82 stores the positions of the microphone arrays constituting the microphone array group 52. As the individual GMMs 180, acoustic models created in advance for each person to be measured are prepared. As the noise GMM 182, an acoustic model relating to noise that is collected in advance and whose attributes are known in advance is prepared.

音源分離及び音種類判定装置が動作を開始すると、図１及び図２を参照して、ＬＲＦ群５６が周囲に存在する人物に関する情報を出力し、人位置計測装置５８に与える。人位置計測装置５８は、ＬＲＦ群５６からの出力に基づき、周囲に存在している人物の位置と、それら各人物の識別子とを音源定位処理部６０及び音源種類同定処理部７２に出力する。人物が何ら検知されないときには（図１のステップ３０にてＮＯ）音源分離及び音種類判定装置は動作を終了する。人物が検知され、かつこの装置に対して動作の終了を指示する操作がされなければ（ステップ３２においてＮＯ）、音源定位処理がステップ３４で実行される。 When the sound source separation and sound type determination device starts operating, the LRF group 56 outputs information about a person existing in the vicinity with reference to FIG. 1 and FIG. Based on the output from the LRF group 56, the human position measuring device 58 outputs the positions of the persons existing around and the identifiers of these persons to the sound source localization processing unit 60 and the sound source type identification processing unit 72. When no person is detected (NO in step 30 in FIG. 1), the sound source separation and sound type determination device ends the operation. If a person is detected and no operation for instructing the apparatus to end the operation is performed (NO in step 32), a sound source localization process is executed in step 34.

音源定位処理は以下のように行なわれる。図２を参照して、マイクロホンアレイ群５２の各マイクロホンアレイは、各位置で、複数のマイクロホンにより音声をアナログ電気信号である電気信号に変換し、音源定位処理部６０に与える。 The sound source localization process is performed as follows. Referring to FIG. 2, each microphone array of microphone array group 52 converts sound into an electrical signal that is an analog electrical signal at each position by a plurality of microphones, and provides it to sound source localization processing unit 60.

図３及び図４を参照して、音源定位処理部６０の音源定位部８４の各々において、以下の処理が実行される。特に図４を参照して、Ａ／Ｄ変換器１００が、対応のマイクロホンアレイから与えられる複数の音声信号を複数チャネルのデジタル音声信号に変換し、固有ベクトル算出部１０２のフレーム化処理部１２０に与える。フレーム化処理部１２０は、所定フレーム長及び所定シフト長でこれら複数チャネルのデジタル音声をフレーム化し、ＦＦＴ処理部１２２に与える。ＦＦＴ処理部１２２は、与えられる複数チャネルのデジタル音声信号の各々について、フレームごとにＦＦＴを施し、周波数領域に変換して相関行列算出部１２４に与える。相関行列算出部１２４は、ＦＦＴ処理部１２２の出力する各ビンの値の間の相関を要素とする相関行列を所定時間ごとに算出し固有値分解部１２６に与える。固有値分解部１２６は、この相関行列の最大固有値と、最大固有値以外の固有値に対応する固有ベクトルである雑音空間とを求め、出力１１２としてＭＵＳＩＣ空間スペクトル算出部１４０に与える。 With reference to FIGS. 3 and 4, the following processing is executed in each of sound source localization units 84 of sound source localization processing unit 60. Referring to FIG. 4 in particular, A / D converter 100 converts a plurality of audio signals provided from a corresponding microphone array into a digital audio signal of a plurality of channels, and provides it to framing processing unit 120 of eigenvector calculation unit 102. . The framing processor 120 framing the digital audio of the plurality of channels with a predetermined frame length and a predetermined shift length, and provides the frame to the FFT processor 122. The FFT processing unit 122 performs FFT on each of the given digital audio signals of a plurality of channels for each frame, converts it to the frequency domain, and provides it to the correlation matrix calculation unit 124. Correlation matrix calculation section 124 calculates a correlation matrix having the correlation between the bin values output from FFT processing section 122 as elements, and provides the correlation matrix to eigenvalue decomposition section 126. The eigenvalue decomposition unit 126 obtains the maximum eigenvalue of the correlation matrix and a noise space that is an eigenvector corresponding to an eigenvalue other than the maximum eigenvalue, and gives the output 112 to the MUSIC space spectrum calculation unit 140.

ＭＵＳＩＣ空間スペクトル算出部１４０は、この音源定位部８４に対応するマイクロホンアレイ内のマイクロホンの位置を表す位置ベクトルを位置ベクトルＤＢ８０から受け、固有値分解部１２６から受けた固有ベクトル及び雑音空間を用い、ＭＵＳＩＣ法によりＭＵＳＩＣ空間スペクトルを算出し出力する。このとき、ＭＵＳＩＣ空間スペクトル算出部１４０は、音源数を固定したものとしてＭＵＳＩＣ空間スペクトルの算出を行なう。算出されたＭＵＳＩＣ空間スペクトルはＭＵＳＩＣ応答算出部１４２に与えられる。 The MUSIC space spectrum calculation unit 140 receives a position vector representing the position of the microphone in the microphone array corresponding to the sound source localization unit 84 from the position vector DB 80, uses the eigenvector and noise space received from the eigenvalue decomposition unit 126, and uses the MUSIC method To calculate and output the MUSIC spatial spectrum. At this time, the MUSIC spatial spectrum calculation unit 140 calculates the MUSIC spatial spectrum assuming that the number of sound sources is fixed. The calculated MUSIC spatial spectrum is given to the MUSIC response calculation unit 142.

ＭＵＳＩＣ応答算出部１４２は、与えられたＭＵＳＩＣ空間スペクトルに基づいて、ＭＵＳＩＣ法にしたがいＭＵＳＩＣ応答を位置ベクトルに応じた各方位について算出しピーク検出部１０６に出力する。 The MUSIC response calculation unit 142 calculates the MUSIC response for each direction according to the position vector according to the MUSIC method based on the given MUSIC spatial spectrum, and outputs the MUSIC response to the peak detection unit 106.

ピーク検出部１０６は、ＭＵＳＩＣ応答算出部１４２から出力される各方位についてのＭＵＳＩＣ応答の値としきい値とを比較し、ＭＵＳＩＣ応答のピーク位置の候補を音源位置として定め、その方向を示す情報を詳細探索部１１０（図３）に与える。 The peak detection unit 106 compares the value of the MUSIC response for each azimuth output from the MUSIC response calculation unit 142 with a threshold value, determines a peak position candidate of the MUSIC response as a sound source position, and indicates information indicating the direction. It gives to the detailed search part 110 (FIG. 3).

図３を参照して、相対位置ベクトル生成部１０８は、位置ベクトルＤＢ８０に記憶された各位置ベクトルと、アレイ位置ＤＢ８２に記憶されたマイクロホンアレイ群５２内のマイクロホンアレイの位置とに基づき、各マイクロホンアレイに対する相対位置ベクトルを算出し、詳細探索部１１０に与える。詳細探索部１１０は、人位置計測装置５８（図２）から与えられる人位置及びＩＤと、相対位置ベクトル生成部１０８から与えられる各相対位置ベクトルとを用い、音源定位部８４からそれぞれ出力される音源方位情報に基づき、マイクロホンアレイの位置を起点とし、音源位置を通る半直線の交点位置を中心とした所定の範囲内においてさらに詳細にＭＵＳＩＣ応答の値が高い位置を探索し、その位置を示す信号を音源分離処理部７０に対して出力する。以上が、図１のステップ３４の処理に相当する。 Referring to FIG. 3, relative position vector generation unit 108 sets each microphone based on each position vector stored in position vector DB 80 and the position of the microphone array in microphone array group 52 stored in array position DB 82. A relative position vector with respect to the array is calculated and provided to the detailed search unit 110. The detailed search unit 110 uses the person position and ID given from the person position measuring device 58 (FIG. 2) and each relative position vector given from the relative position vector generation unit 108, and outputs them from the sound source localization unit 84. Based on the sound source azimuth information, the position where the value of the MUSIC response is high in a predetermined range centering on the intersection of the half line passing through the sound source position and starting from the position of the microphone array is searched for and the position is indicated. The signal is output to the sound source separation processing unit 70. The above corresponds to the processing of step 34 in FIG.

図５を参照して、音源分離処理部７０の複数の適応ビームフォーマ１６０，…，１６２，…，１６４はそれぞれ、対応する詳細探索部１１０から出力される音源の方向及び位置の情報を用い、マイクロホンアレイの出力する音声信号からその音源の音声信号を分離し、音源種類同定処理部７２に与える。 Referring to FIG. 5, the plurality of adaptive beamformers 160,..., 162,... 164 of the sound source separation processing unit 70 each use information on the direction and position of the sound source output from the corresponding detailed search unit 110. The sound signal of the sound source is separated from the sound signal output from the microphone array and provided to the sound source type identification processing unit 72.

図６を参照して、音源種類同定処理部７２の音源属性判定部１９０，…，１９２，…，１９４はそれぞれ、人位置計測装置５８から与えられる人の位置及びＩＤを示す情報と、と、音源定位処理部６０から与えられる音源の方向及び位置を示す情報とに基づき、音源の属性を以下のように判定してその結果を出力する。 Referring to FIG. 6, the sound source attribute determination units 190,..., 192,..., 194 of the sound source type identification processing unit 72 are respectively information indicating a person's position and ID given from the human position measuring device 58, and Based on the information indicating the direction and position of the sound source given from the sound source localization processing unit 60, the attribute of the sound source is determined as follows and the result is output.

図７を参照して、例えば音源属性判定部１９０の比較部２１０は、人の位置と音源の方向及び位置とを比較し、両者が一致していれば複数の個人別ＧＭＭ１８０と複数の雑音ＧＭＭ１８２を、さもなければ複数の雑音ＧＭＭ１８２のみを、それぞれ選択して音源属性推定部２１６に与える。一方、特徴抽出部２１４は、処理対象となる音源からの音声信号から所定の特徴量を抽出し、フレームごとの特徴量ベクトルの系列を音源属性推定部２１６に与える。 Referring to FIG. 7, for example, the comparison unit 210 of the sound source attribute determination unit 190 compares the position of the person with the direction and position of the sound source, and if they match, a plurality of individual GMMs 180 and a plurality of noise GMMs 182 are compared. Otherwise, only the plurality of noise GMMs 182 are selected and supplied to the sound source attribute estimation unit 216, respectively. On the other hand, the feature extraction unit 214 extracts a predetermined feature amount from an audio signal from a sound source to be processed, and gives a sequence of feature amount vectors for each frame to the sound source attribute estimation unit 216.

音源属性推定部２１６は、選択部２１２により選択された、複数の雑音ＧＭＭ１８２のみ、又は複数の個人別ＧＭＭ１８０及び雑音ＧＭＭ１８２を用い、特徴抽出部２１４からの特徴量ベクトルの系列が各個人又は各雑音源によるものから生じた尤度を算出し、上位の所定個数からなる候補リストである候補リスト２４０を作成して出力する。検出ＩＤ・尤度リスト２３０は図８（Ａ）に示す候補リスト２４０と同様の構成を持ち、その音源を発生した個人又は雑音源の候補を尤度順に並べたもので、各エントリは個人又は雑音源の識別子（ＣＩＤ）とその尤度とを含む。音源属性推定部２１６は、フレームシフト時間に対応した間隔でこの検出ＩＤ・尤度リスト２３０を出力する。以上が、図１のステップ３６の処理に相当する。 The sound source attribute estimation unit 216 uses only the plurality of noise GMMs 182 selected by the selection unit 212 or a plurality of individual GMMs 180 and noise GMMs 182, and the feature vector sequence from the feature extraction unit 214 is each individual or each noise. The likelihood generated from the source is calculated, and a candidate list 240, which is a candidate list consisting of a predetermined number of higher ranks, is created and output. The detection ID / likelihood list 230 has the same configuration as that of the candidate list 240 shown in FIG. 8A, in which the individual who generated the sound source or the noise source candidates are arranged in order of likelihood. It includes the noise source identifier (CID) and its likelihood. The sound source attribute estimation unit 216 outputs the detection ID / likelihood list 230 at intervals corresponding to the frame shift time. The above corresponds to the processing of step 36 in FIG.

図１を参照して、ステップ３８のトラッキング処理は以下のように行なわれる。ここでは、既に音源属性推定部２１６により出力された各音源の尤度リストの履歴が保存されており、対話データベースにもそれに対応した音声データが蓄積されているものとする。 Referring to FIG. 1, the tracking process in step 38 is performed as follows. Here, it is assumed that the history of the likelihood list of each sound source output by the sound source attribute estimation unit 216 is already stored, and the corresponding speech data is also stored in the dialogue database.

図９を参照して、この音源分離及び音種類判定装置は、音源属性推定部２１６から検出ＩＤ・尤度リスト２３０が出力されると、ステップ２５０において、既に記憶されていた履歴の末尾のリストを履歴リスト２４２にコピーする。続くステップ２５２において、音源属性推定部２１６が出力した検出ＩＤ・尤度リスト２３０を候補リスト２４０にコピーする。 Referring to FIG. 9, when the detection ID / likelihood list 230 is output from the sound source attribute estimation unit 216, the sound source separation and sound type determination device, at step 250, the list at the end of the history already stored. Is copied to the history list 242. In subsequent step 252, the detection ID / likelihood list 230 output by the sound source attribute estimation unit 216 is copied to the candidate list 240.

続くステップ２５４では、図１０に制御構造を示すＩＤ交換チェックルーチンを呼出す。 In the following step 254, an ID exchange check routine whose control structure is shown in FIG. 10 is called.

図１０を参照して、ＩＤ交換チェックルーチンは以下のように実行される。ここでは、２つの場合について順に説明する。説明を分かりやすくするため、音源属性推定部２１６が出力する検出ＩＤ・尤度リスト２３０の要素数は３であるものとする。最初に、検出ＩＤ・尤度リスト２３０の先頭の候補の識別子が、履歴の末尾のリストの先頭の候補の識別子と同一である場合を説明する。次に、検出ＩＤ・尤度リスト２３０の１番目と２番目の候補の識別子が、履歴の末尾のリストの１番目と２番目の候補の識別子を入替えたものである場合を説明する。 Referring to FIG. 10, the ID exchange check routine is executed as follows. Here, two cases will be described in order. In order to make the explanation easy to understand, it is assumed that the number of elements of the detection ID / likelihood list 230 output by the sound source attribute estimation unit 216 is three. First, the case where the identifier of the top candidate in the detection ID / likelihood list 230 is the same as the identifier of the top candidate in the list at the end of the history will be described. Next, a case will be described in which the identifiers of the first and second candidates in the detection ID / likelihood list 230 are obtained by replacing the identifiers of the first and second candidates in the list at the end of the history.

〈新旧の第１及び第２の候補が同一の場合〉
最初にステップ２８０で候補リスト２４０の要素数（候補のエントリ数）が１か否かが判定される。判定結果が肯定であれば制御はステップ３０６に進み、候補リスト２４０が履歴リスト２４２にコピーされ、呼出元ルーチンに復帰する。ここでは、検出ＩＤ・尤度リスト２３０の要素数が３である場合を想定しているのでステップ２８０の判定は否定となり、ステップ２８２に制御が進む。 <When the new and old first and second candidates are the same>
First, in step 280, it is determined whether or not the number of elements in the candidate list 240 (number of candidate entries) is one. If the determination result is affirmative, control proceeds to step 306 where the candidate list 240 is copied to the history list 242 and returns to the caller routine. Here, since it is assumed that the number of elements in the detection ID / likelihood list 230 is 3, the determination in step 280 is negative and control proceeds to step 282.

ステップ２８２では、候補リスト２４０の１番目の候補の識別子ＣＩＤ１が履歴リスト２４２の１番目の候補の識別子ＨＩＤ１と等しいか否かが判定される。判定結果が否定であれば制御はステップ２８４に進む。ここでは、仮定から判定結果が肯定となるので、制御はステップ３０６に進み、候補リスト２４０が履歴リスト２４２にコピーされ、制御は呼出元ルーチンに復帰する。 In step 282, it is determined whether the identifier CID1 of the first candidate in the candidate list 240 is equal to the identifier HID1 of the first candidate in the history list 242. If the determination result is negative, control proceeds to step 284. Here, since the determination result is affirmative from the assumption, control proceeds to step 306, the candidate list 240 is copied to the history list 242, and control returns to the caller routine.

図９を参照して、ステップ２５６で、履歴リスト２４２の空白部に、候補リスト２４０の対応要素がコピーされる。ここでは、既に候補リスト２４０が履歴リスト２４２にコピーされているのでステップ２５６では何も処理されない。 Referring to FIG. 9, in step 256, the corresponding element of candidate list 240 is copied to the blank portion of history list 242. Here, since candidate list 240 has already been copied to history list 242, nothing is processed in step 256.

続くステップ２５８では、ステップ２５４及びステップ２５６の結果得られた履歴リスト２４２が、履歴の末尾に追加される。ステップ２６０では、対話データベースに、このときの音声データを履歴リスト２４２の先頭の識別子とともに記録し、次の処理（図１のステップ３２）に制御が戻る。 In the subsequent step 258, the history list 242 obtained as a result of the steps 254 and 256 is added to the end of the history. In step 260, the voice data at this time is recorded in the dialogue database together with the identifier at the head of the history list 242, and the control returns to the next process (step 32 in FIG. 1).

〈新旧の第１及び第２の候補が入れ替わった場合〉
図１０を参照して、ステップ２８０の判定結果はＮＯとなる。続くステップ２８２の判定結果もＮＯとなる。制御はステップ２８４に進み、ＣＩＤ１とＣＩＤ２との尤度を前述の式にしたがって再計算し、その結果、ＮＰｒｏｂ１とＮＰｒｏｂ２とが得られる。 <When the first and second candidates of the old and new are replaced>
Referring to FIG. 10, the determination result in step 280 is NO. The determination result in subsequent step 282 is also NO. Control proceeds to step 284 where the likelihoods of CID1 and CID2 are recalculated according to the above equation, resulting in NProbe1 and NProbe2.

ステップ２８６ではＮＰｒｏｂ１がＮＰｒｏｂ２より大きいか否かが判定される。判定が肯定の場合には制御はステップ２８８に進み、さもなければ制御はステップ２９２に進む。ステップ２９２ではさらにＮＰｒｏｂ２が履歴リスト２４２の２番目の尤度ＨＰｒｏｂ２より大きいか否かが判定され、その結果にしたがって制御の流れが分岐する。 In step 286, it is determined whether NProb1 is greater than NProb2. If the determination is affirmative, control proceeds to step 288, otherwise control proceeds to step 292. In step 292, it is further determined whether NProbe2 is greater than the second likelihood HProbe2 in the history list 242, and the flow of control branches according to the result.

以下、３つの場合に分けて動作を説明する。 The operation will be described below in three cases.

（１）ステップ２８６の判定が肯定
この場合、ステップ２８８の処理により、新たな候補リスト２４０の１番目の候補の識別子ＣＩＤ１が履歴リスト２４２の１番目の候補の識別子ＨＩＤ１に代入され、候補リスト２４０の１番目の候補の尤度ＮＰｒｏｂ１が履歴リスト２４２の１番目の候補の尤度ＨＰｒｏｂ１に代入される。さらに、ステップ２９２の処理により、新たな候補リスト２４０の２番目の候補の識別子ＣＩＤ２が履歴リスト２４２の２番目の候補の識別子ＨＩＤ２に代入され、候補リスト２４０の２番目の候補の尤度ＮＰｒｏｂ２が履歴リスト２４２の２番目の候補の尤度ＨＰｒｏｂ２に代入される。すなわち、履歴リスト２４２の１、２番目の候補に代えて、候補リスト２４０の１番目及び２番目の候補が履歴リスト２４２の１番目及び２番目にそれぞれ代入される。この後、図９のステップ２５６に制御が戻る。 (1) Affirmation of Step 286 In this case, the processing of Step 288 substitutes the identifier CID1 of the first candidate in the new candidate list 240 into the identifier HID1 of the first candidate in the history list 242, and the candidate list 240 The first candidate likelihood NProb1 is substituted into the first candidate likelihood HProbe1 of the history list 242. Further, the second candidate identifier CID2 of the new candidate list 240 is substituted into the second candidate identifier HID2 of the history list 242 by the processing of step 292, and the likelihood NProbe2 of the second candidate of the candidate list 240 is Substituted into the likelihood HPProb2 of the second candidate in the history list 242. That is, instead of the first and second candidates in the history list 242, the first and second candidates in the candidate list 240 are substituted into the first and second candidates in the history list 242, respectively. Thereafter, control returns to step 256 in FIG.

ここでは履歴リスト２４２の３番目には前回の３番目の候補の情報が入っている。したがってステップ２５６では何も処理されない。続くステップ２５８で、履歴リスト２４２が履歴の末尾に追加され、ステップ２６０で対話データベースにデータが追加される。要するにこの（１）の場合、１番目と２番目の候補は前回と同様であり、入れ替わらない。 Here, the third information in the history list 242 contains information on the previous third candidate. Therefore, nothing is processed in step 256. In subsequent step 258, the history list 242 is added to the end of the history, and in step 260 data is added to the interaction database. In short, in the case of (1), the first and second candidates are the same as in the previous time, and are not interchanged.

（２）ステップ２８６の判定が肯定、ステップ２９２の判定が肯定
この場合には、ステップ２９４で、新たな候補リスト２４０の２番目の候補の識別子ＣＩＤ２が履歴リスト２４２の１番目の候補の識別子ＨＩＤ１に代入され、候補リスト２４０の２番目の候補の尤度ＮＰｒｏｂ２が履歴リスト２４２の１番目の候補の尤度ＨＰｒｏｂ１に代入される。さらに、ステップ２９６で、新たな候補リスト２４０の１番目の候補の識別子ＣＩＤ１が履歴リスト２４２の２番目の候補の識別子ＨＩＤ２に代入され、候補リスト２４０の１番目の候補の尤度ＮＰｒｏｂ１が履歴リスト２４２の２番目の候補の尤度ＨＰｒｏｂ２に代入される。要するに、直前の１番目及び２番目の候補が入れ替わることになる。 (2) The determination in step 286 is affirmative and the determination in step 292 is affirmative. In this case, in step 294, the identifier CID2 of the second candidate in the new candidate list 240 is the identifier HID1 of the first candidate in the history list 242. And the likelihood NProbe2 of the second candidate in the candidate list 240 is assigned to the likelihood HProbe1 of the first candidate in the history list 242. Further, in step 296, the identifier CID1 of the first candidate of the new candidate list 240 is substituted for the identifier HID2 of the second candidate of the history list 242, and the likelihood NProb1 of the first candidate of the candidate list 240 is the history list. 242 is substituted into the likelihood HPProb2 of the second candidate. In short, the immediately preceding first and second candidates are interchanged.

ステップ２９８では、さらに、候補リスト２４０及び履歴リスト２４２からそれぞれ先頭の要素を除いたものを引数にして自分自身を呼出す。この処理については後述する。ここでは、ステップ２９８の処理の結果、新たな引数となった候補リスト２４０及び履歴リスト２４２を用い、尤度の再計算の結果、１番目の候補と２番目の候補を入替える必要があった場合にはそのように変更された履歴リスト２４２が戻り値として戻され、そのような入れ替えが必要でないときには、候補の入替がない形の履歴リスト２４２が戻り値として戻されることを指摘しておく。ただし、尤度についてはステップ２８４の結果により修正されている可能性がある。 In step 298, the caller itself is further called with the argument obtained by removing the first element from the candidate list 240 and the history list 242. This process will be described later. Here, as a result of the process of step 298, the candidate list 240 and the history list 242 that have become new arguments are used, and as a result of likelihood recalculation, the first candidate and the second candidate have to be switched. In this case, it is pointed out that the history list 242 so changed is returned as a return value, and when such replacement is not necessary, the history list 242 without candidate replacement is returned as a return value. . However, the likelihood may be corrected by the result of step 284.

ステップ２９８の処理後、ステップ３００において、ステップ２９８で戻り値として得られた履歴リスト２４２の先頭に、ステップ２９８において取り除いておいた先頭の候補のエントリを付加し手新たな履歴リスト２４２を生成する。この履歴リスト２４２を戻り値としてこのルーチンの実行を終了し、呼出元ルーチン（図９のステップ２５６）に戻る。 After the processing in step 298, in step 300, a new history list 242 is generated by adding the top candidate entry removed in step 298 to the top of the history list 242 obtained as a return value in step 298. . Using this history list 242 as a return value, the execution of this routine is terminated, and the routine returns to the caller routine (step 256 in FIG. 9).

以下の処理は上記（１）の場合と同様である。 The following processing is the same as in the case of (1) above.

（３）ステップ２８６の判定が肯定、ステップ２９２の判定が否定
この場合には上記（１）と同じ処理が実行される。 (3) The determination in step 286 is affirmative and the determination in step 292 is negative. In this case, the same processing as in (1) above is executed.

〈再帰的処理〉
図１０のステップ２９８で、再帰的な呼出がおこなわれた場合のこのプログラムによる処理について説明する。説明を分かりやすくするため、図９のルーチンを「主ルーチン」、主ルーチンから呼出された図１０のルーチンを「子ルーチン」、子ルーチンから呼出された図１０のルーチンを「孫ルーチン」、孫ルーチンから呼出された図１０のルーチンを「ひ孫ルーチン」と呼ぶことにする。上記説明にしたがえば、孫ルーチンでは、新たな候補リスト２４０及び履歴リスト２４２の要素数は、いずれも２となっている。説明を分かりやすくするため、引数として渡される候補リスト２４０及び履歴リスト２４２の各エントリの識別子及び尤度については、親ルーチンのときと同じ呼び方で示すものとする。 <Recursive processing>
The processing by this program when a recursive call is made at step 298 in FIG. 10 will be described. For easy understanding, the routine of FIG. 9 is “main routine”, the routine of FIG. 10 called from the main routine is “child routine”, the routine of FIG. 10 called from the child routine is “grandchild routine”, and the grandchild The routine of FIG. 10 called from the routine will be referred to as a “great-grandchild routine”. According to the above description, in the grandchild routine, the number of elements in the new candidate list 240 and history list 242 are both 2. For ease of explanation, the identifier and likelihood of each entry in the candidate list 240 and history list 242 passed as arguments are indicated in the same way as in the parent routine.

この例では、ステップ２８０の判定結果は否定となる。以後は子ルーチンの実行時と同様の処理が、引数として渡された、要素数２の候補リスト２４０及び履歴リスト２４２に対して実行され、処理により得られた新たな履歴リスト２４２（要素数は２）を戻り値としてこのルーチンの実行を終了し制御は子ルーチンに戻る。ただし、図１０のステップ２９８での処理が問題となる。すなわちこのステップの処理が行なわれる場合、孫ルーチンで再度、このルーチンの呼出しが行なわれる。ただし、孫ルーチンにおいて引数としてこのルーチンに渡される候補リスト２４０及び履歴リスト２４２は、いずれも先頭の要素を除いたリストとなるので、それらの要素数はいずれも１となる。 In this example, the determination result in step 280 is negative. Thereafter, the same processing as the execution of the child routine is performed on the candidate list 240 and the history list 242 with two elements passed as arguments, and a new history list 242 (the number of elements is obtained by the processing) is obtained. The execution of this routine is terminated with 2) as a return value, and control returns to the child routine. However, the processing in step 298 in FIG. 10 becomes a problem. That is, when the process of this step is performed, the grandchild routine calls this routine again. However, since the candidate list 240 and the history list 242 passed to this routine as arguments in the grandchild routine are both lists excluding the top element, the number of those elements is all one.

したがって、ひ孫ルーチンでは、ステップ２８０の判定結果が肯定となり、その要素のみを持つ履歴リスト２４２が戻り値として孫ルーチンに戻される。 Therefore, in the great-grandchild routine, the determination result in step 280 is affirmative, and the history list 242 having only that element is returned to the grandchild routine as a return value.

したがって、孫ルーチンのステップ２９８の戻り値は、引数として孫ルーチンがひ孫ルーチンに渡した履歴リスト２４２そのものとなる。ステップ３００ではこのリストの先頭に、ステップ２９８で取り除いた候補のエントリを付加して子ルーチンに戻り値として戻す。その結果、子ルーチンのステップ２９８では、要素数が２の履歴リスト２４２が戻り値として得られ、子ルーチンのステップ２９８で取り除いておいた要素が、履歴リスト２４２の先頭に付加され、３個のエントリを持つ履歴リスト２４２が主ルーチンに戻り値として返される。 Therefore, the return value of step 298 of the grandchild routine is the history list 242 itself that the grandchild routine passes to the great-grandchild routine as an argument. In step 300, the candidate entry removed in step 298 is added to the top of this list and returned to the child routine as a return value. As a result, in step 298 of the child routine, the history list 242 having the number of elements of 2 is obtained as a return value, and the elements removed in step 298 of the child routine are added to the head of the history list 242 and added to the three items. A history list 242 having entries is returned as a return value to the main routine.

以上の処理により、図９のステップ２５４の結果、候補の交代の可能性を、各候補の尤度を再計算した結果に基づいて調整した履歴リスト２４２が得られることになる。 Through the above processing, as a result of step 254 in FIG. 9, a history list 242 in which the possibility of alternation of candidates is adjusted based on the result of recalculating the likelihood of each candidate is obtained.

なお、ここでは説明を分かりやすくするためにもとの履歴リスト２４２のエントリ数が３であることを前提に説明した。しかし、エントリ数が４以上である場合にも、再帰的な処理により同様の結果が得られる。 Note that the description here is based on the assumption that the number of entries in the original history list 242 is 3 for easy understanding. However, when the number of entries is 4 or more, a similar result can be obtained by recursive processing.

［コンピュータによる実現］
この実施の形態に係る音源分離及び音種類判定装置は、コンピュータハードウェアと、そのコンピュータハードウェアにより実行されるプログラムと、コンピュータハードウェアに格納されるデータとにより実現される。図１１はこのコンピュータシステム５３０の外観を示し、図１２はコンピュータシステム５３０の内部構成を示す。 [Realization by computer]
The sound source separation and sound type determination device according to this embodiment is realized by computer hardware, a program executed by the computer hardware, and data stored in the computer hardware. FIG. 11 shows the external appearance of the computer system 530, and FIG. 12 shows the internal configuration of the computer system 530.

図１１を参照して、このコンピュータシステム５３０は、メモリポート５５２及びＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）ドライブ５５０を有するコンピュータ５４０と、キーボード５４６と、マウス５４８と、モニタ５４２とを含む。 Referring to FIG. 11, the computer system 530 includes a computer 540 having a memory port 552 and a DVD (Digital Versatile Disc) drive 550, a keyboard 546, a mouse 548, and a monitor 542.

図１２を参照して、コンピュータ５４０は、メモリポート５５２及びＤＶＤドライブ５５０に加えて、ＣＰＵ（中央処理装置）５５６と、ＣＰＵ５５６、メモリポート５５２及びＤＶＤドライブ５５０に接続されたバス５６６と、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）５５８と、バス５６６に接続され、プログラム命令、システムプログラム、及び作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）５６０とを含む。コンピュータ５４０はさらに、ローカルエリアネットワーク（ＬＡＮ）への接続をコンピュータ５４０に提供するネットワークインタフェイスカード（ＮＩＣ）５７４と、マイクロホンアレイからの入力を受けてデジタル音声信号に変換する、Ａ／Ｄ変換機能を持つサウンドボード５６８とを含む。図２に示す同期用タイムサーバ５４及び人位置計測装置５８との通信については、ＣＰＵ５５６は、バス５６６及びＮＩＣ５７４を用いたネットワーク通信により行なう。 Referring to FIG. 12, in addition to the memory port 552 and the DVD drive 550, the computer 540 boots up a CPU (Central Processing Unit) 556, a bus 566 connected to the CPU 556, the memory port 552, and the DVD drive 550, and A read only memory (ROM) 558 that stores programs and the like, and a random access memory (RAM) 560 that is connected to the bus 566 and stores program instructions, system programs, work data, and the like. The computer 540 further includes a network interface card (NIC) 574 that provides a connection to the local area network (LAN) to the computer 540 and an A / D conversion function that receives input from the microphone array and converts it into a digital audio signal. And a sound board 568. The CPU 556 performs communication with the synchronization time server 54 and the human position measuring device 58 illustrated in FIG. 2 by network communication using the bus 566 and the NIC 574.

コンピュータシステム５３０に音源分離及び音種類判定装置としての動作を行なわせるためのコンピュータプログラムは、ＤＶＤドライブ５５０又はメモリポート５５２に挿入されるＤＶＤ５６２又は半導体メモリ５６４に記憶され、さらにハードディスク５５４に転送される。又は、プログラムは図示しないネットワークを通じてコンピュータ５４０に送信されハードディスク５５４に記憶されてもよい。プログラムは実行の際にＲＡＭ５６０にロードされる。ＤＶＤ５６２から、半導体メモリ５６４から、又はネットワークを介して、直接にＲＡＭ５６０にプログラムをロードしてもよい。 A computer program for causing the computer system 530 to operate as a sound source separation and sound type determination device is stored in the DVD 562 or the semiconductor memory 564 inserted into the DVD drive 550 or the memory port 552 and further transferred to the hard disk 554. . Alternatively, the program may be transmitted to the computer 540 through a network (not shown) and stored in the hard disk 554. The program is loaded into the RAM 560 when executed. The program may be loaded into the RAM 560 directly from the DVD 562, from the semiconductor memory 564, or via a network.

このプログラムは、コンピュータ５４０にこの実施の形態の音源分離及び音種類判定装置として動作を行なわせる複数の命令を含む。この動作を行なわせるのに必要な基本的機能のいくつかはコンピュータ５４０上で動作するオペレーティングシステム（ＯＳ）もしくはサードパーティのプログラム、又はコンピュータ５４０にインストールされる各種ツールキットのモジュールにより提供される。従って、このプログラムはこの実施の形態のシステム及び方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能又は「ツール」を呼出すことにより、上記した音源分離及び音種類判定装置としての動作を実行する命令のみを含んでいればよい。コンピュータシステム５３０の動作は周知であるので、ここでは繰返さない。 This program includes a plurality of instructions that cause the computer 540 to operate as the sound source separation and sound type determination device of this embodiment. Some of the basic functions required to perform this operation are provided by operating system (OS) or third party programs running on the computer 540 or various toolkit modules installed on the computer 540. Therefore, this program does not necessarily include all functions necessary to realize the system and method of this embodiment. This program includes only instructions for executing the above-described operation as the sound source separation and sound type determination device by calling an appropriate function or “tool” in a controlled manner so as to obtain a desired result. Should be included. The operation of computer system 530 is well known and will not be repeated here.

以上のように本実施の形態によれば、マイクロホンアレイからの音声だけではなく、ＬＲＦにより検出された人位置に関する情報も、音源定位及び音源の属性推定に使用する。音声だけの場合と比較して、音源定位の精度を高くすることができ、そのときの処理量の増加を抑えることもできる。音源の属性推定の場合にも、人がいる可能性のある場合のみ、個人別ＧＭＭを用いるため、計算量の増加を抑制しながら音源の属性を精度よく行なうことができる。 As described above, according to the present embodiment, not only the sound from the microphone array but also information related to the human position detected by the LRF is used for sound source localization and sound source attribute estimation. Compared to the case of only sound, the accuracy of sound source localization can be increased, and an increase in processing amount at that time can also be suppressed. Also in the case of sound source attribute estimation, since the individual-specific GMM is used only when there is a possibility that there is a person, the sound source attributes can be accurately performed while suppressing an increase in the amount of calculation.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are included. Including.

５０音源定位装置
５２マイクロホンアレイ群
５４同期用タイムサーバ
５６ＬＲＦ群
５８人位置計測装置
６０音源定位処理部
７０音源分離処理部
７２音源種類同定処理部
８０位置ベクトルＤＢ
８２アレイ位置ＤＢ
８４，８６，８８音源定位部
１０２固有ベクトル算出部
１０４ＭＵＳＩＣ処理部
１０６ピーク検出部
１０８相対位置ベクトル生成部
１１０詳細探索部
１６０，１６２，１６４適応ビームフォーマ
１８０個人別ＧＭＭ
１８２雑音ＧＭＭ
１９０，１９２，１９４音源属性判定部
２１０比較部
２１２選択部
２１４特徴抽出部
２１６音源属性推定部
２３０検出ＩＤ・尤度リスト
２４０候補リスト
２４２履歴リスト 50 sound source localization device 52 microphone array group 54 synchronization time server 56 LRF group 58 human position measurement device 60 sound source localization processing unit 70 sound source separation processing unit 72 sound source type identification processing unit 80 position vector DB
82 Array position DB
84, 86, 88 Sound source localization unit 102 Eigenvector calculation unit 104 MUSIC processing unit 106 Peak detection unit 108 Relative position vector generation unit 110 Detailed search unit 160, 162, 164 Adaptive beamformer 180 Individual GMM
182 Noise GMM
190, 192, 194 Sound source attribute determination unit 210 Comparison unit 212 Selection unit 214 Feature extraction unit 216 Sound source attribute estimation unit 230 Detection ID / likelihood list 240 Candidate list 242 History list

Claims

And the human position detection means for detecting the position of a person,
Receiving each of the sound source signals of a plurality of channels obtained from the output signal of the microphone array, and the positional relationship between each microphone included in the microphone array, a center point defined with respect to the position before Symbol microphone array MUSIC power is calculated for each of a plurality of directions defined in the space to be determined from the sound source signals of the plurality of channels every predetermined time, and a predetermined number of directions giving a peak equal to or greater than the threshold value of the MUSIC power Sound source localization means for detecting at predetermined time intervals as
Sound source separation means for separating an audio signal from a sound source position detected by the sound source localization means from an output signal of the microphone array;
Look including a sound source type determination means for determining a source type of the separated audio signal by the sound source separation means,
The sound source type determination means includes
A plurality of individual acoustic models that are statistical models of acoustic features of at least MFCC of a plurality of individual voices;
A plurality of noise acoustic models that are non-human sound sources and are statistical models of the acoustic features from noise sources of known sound source types;
In response to the output of the person position detecting means and the output of the sound source localization means, when there is a person in the sound source direction, the plurality of individual acoustic models and the plurality of noise acoustic models are selected, and the person in the sound source direction is selected. Acoustic model selection means for selecting the plurality of noise acoustic models when not present;
Using the acoustic model selected by the acoustic model selection unit, the likelihood that gives the sequence of the acoustic feature amount of the audio signal separated by the sound source separation unit is calculated, and corresponds to the acoustic model having the highest likelihood A sound source localization apparatus including an estimation unit that estimates a sound source type as a sound source type of the audio signal .

Further, the sound source localization of people direction estimated by means, for each of the direction towards a person is detected by the person position detection means, about the direction, details from the time search of the sound source position by the sound source localization means The sound source localization according to claim 1 , further comprising: detailed detection means for detecting a more detailed sound source position by calculating the MUSIC power in each direction by changing the direction and detecting a peak of the MUSIC power. apparatus.