JP3632099B2

JP3632099B2 - Robot audio-visual system

Info

Publication number: JP3632099B2
Application number: JP2002365764A
Authority: JP
Inventors: 一博中臺; 博奥乃; 宏明北野
Original assignee: Japan Science and Technology Agency; National Institute of Japan Science and Technology Agency
Current assignee: Japan Science and Technology Agency; National Institute of Japan Science and Technology Agency
Priority date: 2002-12-17
Filing date: 2002-12-17
Publication date: 2005-03-23
Anticipated expiration: 2022-12-17
Also published as: TW200411627A; JP2004198656A; TWI222622B

Abstract

<P>PROBLEM TO BE SOLVED: To provide a robot audio-visual system which recognizes separated sounds from each sound source. <P>SOLUTION: The robot audio-visual system comprises an audio module 20, a face module 30, a stereo module 37, a motor control module 40, and an association module 50 to control each module. The audio module recognizes voices by a plurality of acoustic modules and integrates voice recognition results of respective acoustic models by a selector, and the system is so constituted that a voice recognition result having the highest reliability out of these voice recognition result is determined. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明はロボット、特に人型または動物型ロボットにおける視聴覚システムに関するものである。
【０００２】
【従来の技術】
近年、このような人型または動物型ロボットにおいては、ＡＩの研究目的の対象にとどまらず、所謂「人間のパートナー」としての将来的な利用が考えられている。そして、ロボットが人間との知的なソーシャルインタラクションを行なうために、視聴覚等の知覚がロボットには必要である。そして、ロボットが人間とのソーシャルインタラクションを実現するためには、知覚のうち、視聴覚、特に聴覚が重要な機能であることは明らかである。従って、視覚，聴覚に関して、所謂能動知覚が注目されてきている。
【０００３】
ここで、能動知覚とは、ロボット視覚やロボット聴覚等の知覚を担当する知覚装置を知覚すべき目標に追従する働きをを言い、例えば、これらの知覚装置を支持する頭部を駆動機構により目標に追従するように姿勢制御するものである。
【０００４】
ロボットにおける能動視覚においては、少なくとも知覚装置であるカメラが、駆動機構による姿勢制御によって、その光軸方向を目標に向かって保持され、更に目標に対して自動的にフォーカシングやズームイン，ズームアウト等を行う。これにより、目標が移動してもカメラによって撮像される。このような能動視覚の研究が従来、様々に行なわれている。
【０００５】
これに対して、ロボットにおける能動聴覚においては、少なくとも知覚装置であるマイクが、駆動機構による姿勢制御によってその指向性を目標に向かって保持され、目標からの音がマイクによって集音される。このとき、能動聴覚の不利な点として、駆動機構が作用している間はマイクが駆動機構の作動音を拾ってしまうために目標からの音に比較的大きなノイズが混入してしまい、目標からの音を認識できなくなってしまうことがある。このような能動聴覚の不利な点を排除するために、例えば視覚情報を参照して音源の方向付けを行なうことにより、目標からの音を正確に認識する方法が採用されている。
【０００６】
ところで、このような能動聴覚においては、マイクで集音した音に基づいて、（Ａ）音源の定位，（Ｂ）各音源から発せられた音毎の分離，（Ｃ）そして各音源からの音の認識を行なう必要がある。このうち、（Ａ）音源定位及び（Ｂ）音源分離については、能動聴覚における実時間・実環境での音源定位・追跡・分離に関する種々の研究が行なわれている（特許文献１参照）。
【０００７】
【特許文献１】
国際公開第０１／９５３１４号パンフレット
【０００８】
ここで、例えば、特許文献１に示すように、ＨＲＴＦ（頭部伝達関数）から求められる両耳間位相差（ＩＰＤ），両耳間強度差（ＩＩＤ）を利用して音源定位を行なうことが知られている。また、特許文献１では、例えば所謂方向通過型フィルタ、即ちディレクションパスフィルタを用いて、特定の方向のＩＰＤと同じＩＰＤを有するサブバンドを選択することにより、各音源からの音を分離する方法が知られている。
【０００９】
これに対して、音源分離により分離された各音源からの音の認識については、例えばマルチコンディショニングやミッシングデータ等のノイズに対してロバストな音声認識へのアプローチは種々の研究が行なわれている（例えば非特許文献１，２参照）。
【００１０】
【非特許文献１】
Ｊ．ベーカー等著，クリーンスピーチモデルに基づくロバスト“ユーロスピーチ２００１−第７回ヨーロッパ会議予稿集”，２００１年，第１巻，ｐ２１３−２１６（Ｊ．Ｂａｋｅｒ，Ｍ．Ｃｏｏｋｅ，ａｎｄＰ．Ｇｒｅｅｎ，Ｒｏｂｕｓｔａｓｂａｓｅｄｏｎｃｌｅａｎｓｐｅｅｃｈｍｏｄｅｌｓ：Ａｎｅｖａｌｕａｔｉｏｎｏｆｍｉｓｓｉｎｇｄａｔａｔｅｃｈｎｉｑｕｅｓｆｏｒｃｏｎｎｅｃｔｅｄｄｉｇｉｔｒｅｃｏｇｎｉｔｉｏｎｉｎｎｏｉｓｅ． ”７ｔｈＥｕｒｏｐｅａｎｃｏｎｆｅｒｅｎｃｅｏｎＳｐｅｅｃｈＣｏｍｍｎｉｃａｔｉｏｎＴｅｃｈｎｏｌｏｇｙ”，Ｖｏｌｕｍｅ１，ｐ．２１３−２１６）
【非特許文献２】
Ｐ．レネベイ等著，ロバストスピーチ認識 ”ユーロスピーチ２００１−第７回ヨーロッパ会議予稿集”，２００１年，第１２巻，ｐｐ．１１０７−１１１０（ＰｈｉｌｉｐｐｅＲｅｎｅｖｅｙ，ＲｏｌｆＶｅｔｔｅｒ，ａｎｄＪｅｎｓＫｒａｕｓ．Ｒｏｂｕｓｔｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎｕｓｉｎｇｍｉｓｓｉｎｇｆｅａｔｕｒｅｔｈｅｏｒｙａｎｄｖｅｃｔｏｒｑｕａｎｔｉｚａｔｉｏｎ． ”７ｔｈＥｕｒｏｐｅａｎＣｏｎｆｅｒｅｎｃｅｏｎＳｐｅｅｃｈＣｏｍｍｕｎｉｃａｔｉｏｎＴｅｃｈｎｏｌｏｇｙ”，Ｖｏｌｕｍｅ１２，ｐｐ．１１０７−１１１０）
【００１１】
【発明が解決しようとする課題】
しかしながら、これらの研究（例えば非特許文献１，２）においては、Ｓ／Ｎ比が小さい場合には、有効な音声認識を行なうことができない。また、実時間・実環境での音声認識についての研究は行なわれていない。
【００１２】
この発明は、以上の点に鑑みて、各音源からの分離された音についての認識を行なうようにしたロボット視聴覚システムを提供することを目的としている。
【００１３】
【課題を解決するための手段】
上記目的を達成するために、本発明のロボット視聴覚システムの第１の構成は、各話者毎に、且つ、各方向毎に作られている複数の音響モデルと、これらの音響モデルを使用して、音源分離された音響信号に対して音声認識プロセスを実行する音声認識エンジンと、この音声認識プロセスによって音響モデル別に得られた複数の音声認識プロセス結果を統合し、何れかの音声認識プロセス結果を選択するセレクタと、を備えて、各話者が同時に発話した音声を各々認識することを特徴としている。
【００１４】
前記セレクタは、音声認識プロセス結果を統合する際に、音声認識プロセスによる認識結果と話者の方向とに基づいてコスト関数値を算出し、最も大きいコスト関数値を持つ前記音声認識プロセス結果を最も信頼性の高い音声認識結果として判断するように構成され、前記セレクタにて選択された音声認識プロセス結果を外部に出力する対話部を備えていてもよい。
【００１５】
このような第１の構成によれば、音源定位・音源分離された音響信号に基づいて、複数の音響モデルを使用することによって、それぞれ音声認識プロセスを行なう。そして、各音響モデルによる音声認識プロセス結果をセレクタにより統合して、最も信頼性の高い音声認識結果を判断する。
【００１６】
また、上記目的を達成するために、本発明のロボット視聴覚システムの第２の構成は、外部の音を集音する少なくとも一対のマイクを備えており、このマイクからの音響信号に基づいて、ピッチ抽出，調波構造に基づいたグルーピングによる音源の分離及び定位によって少なくとも一人の話者の方向を決定し、その聴覚イベントを抽出する聴覚モジュールと、ロボットの前方を撮像するカメラを備えており、このカメラにより撮像された画像に基づいて各話者の顔識別と定位とから各話者を同定してその顔イベントを抽出する顔モジュールと、ロボットを水平方向に回動させる駆動モータを備えこの駆動モータの回転位置に基づいてモータイベントを抽出するモータ制御モジュールと、上記聴覚イベント，顔イベント及びモータイベントから、聴覚イベントの音源定位及び顔イベントの顔定位の方向情報に基づいて各話者の方向を決定し、この決定に対してカルマンフィルタを用いて上記イベントを時間方向に接続することにより聴覚ストリーム及び顔ストリームを生成し、さらにこれらを関連付けてアソシエーションストリームを生成するアソシエーションモジュールと、これらのストリームに基づいてアテンション制御と、それに伴う行動のプランニング結果に基づいてモータの駆動制御を行うアテンション制御モジュールと、を備え、上記聴覚モジュールが、各話者毎に、且つ、各方向毎に作られている複数の音響モデルを備えており、上記アソシエーションモジュールからの正確な音源方向情報に基づいて、正面方向で最小となり且つ左右に角度が大きくなるにつれて大きくなるパスレンジを有するアクティブ方向通過型フィルタにより、所定幅の範囲内の両耳間位相差（ＩＰＤ）または両耳間強度差（ＩＩＤ）をもったサブバンドを集めて、音源の波形を再構築することにより音源分離を行なうと共に、複数の音響モデルを使用して音源分離された音響信号の音声認識を行ない、各音響モデルによる音声認識結果をセレクタにより統合して、これらの音声認識結果のうち最も信頼性の高い音声認識結果を判断するように構成されている。
【００１７】
このような第２の構成によれば、聴覚モジュールがマイクが集音した外部の対象からの音から、調波構造を利用してピッチ抽出を行なうことにより音源毎の方向を得て個々の話者を同定し、その聴覚イベントを抽出する。
【００１８】
また、顔モジュールが、カメラにより撮像された画像から、パターン認識による各話者の顔識別と定位から、個々の話者の顔イベントを抽出する。
【００１９】
さらに、モータ制御モジュールが、ロボットを水平方向に回動させる駆動モータの回転位置に基づいて、ロボットの方向を検出することによって、モータイベントを抽出する。
【００２０】
なお、上記イベントとは、各時点において検出される音または顔が在ること、あるいは駆動モータが回転される状態を示しており、ストリームとは、エラー訂正処理を行ないながら例えばカルマンフィルタ等により時間的に連続するように接続したイベントを示している。
【００２１】
ここで、アソシエーションモジュールは、このようにしてそれぞれ抽出された聴覚イベント，顔イベント及びモータイベントに基づいて、各話者の聴覚ストリーム及び顔ストリームを生成し、さらにこれらのストリームを関連付けてアソシエーションストリームを生成して、アテンション制御モジュールがこれらのストリームに基づいてアテンション制御を行なうことにより、モータ制御モジュールの駆動モータ制御のプランニングを行なう。なお、アソシエーションストリームとは、聴覚ストリーム及び顔ストリームを包含する概念である。
【００２２】
なお、アテンションとは、ロボットが対象である話者を聴覚的及び／または視覚的に「注目」することであり、アンテンション制御とは、モータ制御モジュールによりその向きを変えることによってロボットが上記話者に注目するようにすることである。
【００２３】
そして、アテンション制御モジュールは、このプランニングに基づいて、モータ制御モジュールの駆動モータを制御することにより、ロボットの方向を対象である話者に向ける。これにより、ロボットが対象である話者に対して正対することにより、聴覚モジュールが当該話者の声を感度の高い正面方向にてマイクで正確に集音，定位することができると共に、顔モジュールが当該話者の画像をカメラにより良好に撮像することができるようになる。
【００２４】
従って、このような聴覚モジュール，顔モジュール及びモータ制御モジュールと、アソシエーションモジュール及びアテンション制御モジュールとの連携によって、ロボットの聴覚及び視覚がそれぞれ有する曖昧性が互いに補完されることになり、所謂ロバスト性が向上し、複数の話者であっても、各話者をそれぞれ知覚することができる。
【００２５】
また、例えば聴覚イベントまたは顔イベントの何れか一方が欠落したときであっても、顔イベントまたは聴覚イベントのみに基づいて、対象である話者をアソシエーションモジュールが知覚することができるので、リアルタイムにモータ制御モジュールの制御を行なうことができる。
【００２６】
さらに、上記聴覚モジュールが、上述したように音源定位・音源分離された音響信号に基づいて、複数の音響モデルを使用することによってそれぞれ音声認識を行なう。そして、各音響モデルによる音声認識結果をセレクタにより統合して、最も信頼性の高い音声認識結果を判断する。
【００２７】
これにより、従来の音声認識と比較して複数の音響モデルを使用することによって、実時間・実環境での正確な音声認識を行なうことが可能になると共に、各音響モデルによる音声認識結果をセレクタにより統合して、最も信頼性の高い音声認識結果を判断して、より一層正確な音声認識を行なうことができる。
【００２８】
また、上記目的を達成するために、本発明のロボット視聴覚システムの第３の構成は、外部の音を集音する少なくとも一対のマイクを備えており、このマイクからの音響信号に基づいてピッチ抽出，調波構造に基づいたグルーピングによる音源の分離及び定位によって少なくとも一人の話者の方向を決定しその聴覚イベントを抽出する聴覚モジュールと、ロボットの前方を撮像するカメラを備えこのカメラで撮像された画像に基づいて各話者の顔識別と定位とから各話者を同定してその顔イベントを抽出する顔モジュールと、ステレオカメラにより撮像された画像から抽出された視差に基づいて縦に長い物体を抽出定位してステレオイベントを抽出するステレオモジュールと、ロボットを水平方向に回動させる駆動モータを備えこの駆動モータの回転位置に基づいてモータイベントを抽出するモータ制御モジュールと、前記聴覚イベント，顔イベント，ステレオイベント及びモータイベントから聴覚イベントの音源定位及び顔イベントの顔定位の方向情報に基づいて各話者の方向を決定しこの決定に対してカルマンフィルタを用いて前記イベントを時間方向に接続することにより聴覚ストリーム，顔ストリーム及びステレオ視覚ストリームを生成しさらにこれらを関連付けてアソシエーションストリームを生成するアソシエーションモジュールと、これらのストリームに基づいてアテンション制御と、それに伴う行動のプランニング結果に基づいてモータの駆動制御を行うアテンション制御モジュールと、を備え、上記聴覚モジュールが、各話者毎に、且つ、各方向毎に作られている複数の音響モデルを備えており、上記アソシエーションモジュールからの正確な音源方向情報に基づいて、正面方向で最小となり且つ左右に角度が大きくなるにつれて大きくなるパスレンジを有するアクティブ方向通過型フィルタにより、所定幅の範囲内の両耳間位相差（ＩＰＤ）または両耳間強度差（ＩＩＤ）をもったサブバンドを集めて、音源の波形を再構築することにより音源分離を行なうと共に、音声認識の際に、複数の音響モデルを使用して音源分離された音響信号の音声認識を行ない、各音響モデルによる音声認識結果をセレクタにより統合して、これらの音声認識結果のうち最も信頼性の高い音声認識結果を判断するように構成されている。
【００２９】
このような第３の構成によれば、聴覚モジュールは、マイクが集音した外部の目標からの音から調波構造を利用してピッチ抽出を行なうことにより音源毎の方向を得て、個々の話者の方向を決定してその聴覚イベントを抽出する。
【００３０】
また、顔モジュールは、カメラにより撮像された画像からパターン認識による各話者の顔識別と定位から各話者を同定して、個々の話者の顔イベントを抽出する。さらに、ステレオモジュールは、ステレオカメラにより撮像された画像から抽出された視差に基づいて縦に長い物体を抽出定位してステレオイベントを抽出する。
【００３１】
さらに、モータ制御モジュールは、ロボットを水平方向に回動させる駆動モータの回転位置に基づいて、ロボットの方向を検出することによってモータイベントを抽出する。
【００３２】
なお、上記イベントとは、各時点において検出される音，顔及び縦に長い物体が在ること、あるいは駆動モータが回転される状態を示しており、ストリームとは、エラー訂正処理を行ないながら例えばカルマンフィルタ等により時間的に連続するように接続したイベントを示している。
【００３３】
ここで、アソシエーションモジュールは、このようにしてそれぞれ抽出された聴覚イベント，顔イベント，ステレオイベント及びモータイベントに基づいて、聴覚イベントの音源定位及び顔イベントの顔定位の方向情報によって各話者の方向を決定することにより、各話者の聴覚ストリーム，顔ストリーム及びステレオ視覚ストリームを生成し、さらにこれらのストリームを関連付けてアソシエーションストリームを生成する。なお、アソシエーションストリームとは、聴覚ストリーム，顔ストリーム及びステレオ視覚ストリームを包含する概念である。この際、アソシエーションモジュールは、聴覚イベントの音源定位及び顔イベントの顔定位、即ち聴覚及び視覚の方向情報に基づいて各話者の方向を決定し、決定された各話者の方向を参考にして、アソシエーションストリームを生成する。
【００３４】
そして、アテンション制御モジュールが、これらのストリームに基づいてアテンション制御と、それに伴う行動のプランニング結果に基づいて、モータの駆動制御を行なう。そして、アテンション制御モジュールは、このプランニングに基づいてモータ制御モジュールの駆動モータを制御してロボットの方向を目標である話者に向ける。これにより、ロボットが目標である話者に対して正対することによって聴覚モジュールが当該話者の声を感度の高い正面方向にてマイクにより正確に集音，定位することができる共に、顔モジュールが当該話者の画像をカメラにより良好に撮像することができるようになる。
【００３５】
従って、このような聴覚モジュール，顔モジュール，ステレオモジュール及びモータ制御モジュールと、アソシエーションモジュール及びアテンション制御モジュールとの連携によって、聴覚ストリームの音源定位及び顔ストリームの話者定位という方向情報に基づいて各話者の方向を決定することにより、ロボットの聴覚及び視覚がそれぞれ有する曖昧性が互いに補完されることになり、所謂ロバスト性が向上し、複数の話者であっても各話者をそれぞれ確実に知覚することができる。
【００３６】
また、例えば聴覚ストリーム，顔ストリーム及びステレオ視覚ストリームの何れかが欠落したときであっても、残りのストリームに基づいて目標である話者をアテンション制御モジュールが追跡することができるので、正確に目標の方向を把握して、モータ制御モジュールの制御を行なうことができる。
【００３７】
ここで、聴覚モジュールが、アソシエーションモジュールからのアソシエーションストリームを参照することにより、顔モジュールからの顔ストリームやステレオモジュールからのステレオ視覚ストリームをも考慮して音源定位を行なうことによって、より一層正確な音源定位を行なうことができる。
【００３８】
そして、上記聴覚モジュールは、アソシエーションモジュールからの正確な音源方向情報に基づいて、聴覚特性に従って正面方向で最小となり且つ左右に角度が大きくなるにつれて大きくなるパスレンジを有するアクティブ方向通過型フィルタにより、所定幅の範囲内の両耳間位相差（ＩＰＤ）または両耳間強度差（ＩＩＤ）をもったサブバンドを集めて、音源の波形を再構築して音源分離を行なうので、上述した聴覚特性に応じてパスレンジ即ち感度を調整することにより、方向による感度の違いを考慮して、より正確に音源分離を行なうことができる。さらに、上記聴覚モジュールは、上述したように聴覚モジュールによって音源定位・音源分離された音響信号に基づいて、複数の音響モデルを使用することによってそれぞれ音声認識を行なう。そして、各音響モデルによる音声認識結果をセレクタにより統合して、最も信頼性の高い音声認識結果を判断して、この音声認識結果を対応する話者と関連付けて出力する。
【００３９】
これにより、従来の音声認識と比較して、複数の音響モデルを使用することによって、実時間・実環境での正確な音声認識を行なうことが可能になると共に、各音響モデルによる音声認識結果をセレクタにより統合して、最も信頼性の高い音声認識結果を判断することにより、より一層正確な音声認識を行なうことができる。
【００４０】
なお、第２の構成と第３の構成においては、聴覚モジュールによる音声認識ができなかったときに、前記アテンション制御モジュールが、当該音響信号の音源の方向に前記マイク及び前記カメラを向けて、前記マイクから再び音声を集音させ、この音に対して聴覚モジュールにより音源定位・分離された音響信号に基づいて、再度聴覚モジュールによる音声認識を行なうように構成されている。
【００４１】
さらに、前記聴覚モジュールは、音声認識を行なう際に顔モジュールによる顔イベントを参照するのが望ましい。また、前記聴覚モジュールにて判断された音声認識結果を外部に出力する対話部が備えられていてもよい。さらに、前記アクティブ方向通過型フィルタのパスレンジが周波数毎に制御可能であることが望ましい。
【００４２】
上記聴覚モジュールによる音声認識ができなかったとき、アテンション制御モジュールが、当該音響信号の音源の方向（当該話者）にマイク及びカメラを向けて、再度マイクから音声を集音させ、聴覚モジュールにより音源定位・分離された音響信号に基づいて、再度聴覚モジュールによる音声認識を行なう場合には、ロボットの聴覚モジュールのマイク及び顔モジュールのカメラが当該話者と正対することによって、確実な音声認識を行なうことが可能になる。
【００４３】
上記聴覚モジュールは、音声認識を行なう際に、アソシエーションモジュールからのアソシエーションストリームを参照することにより、顔モジュールからの顔ストリームをも考慮する。即ち、聴覚モジュールは、顔モジュールにより定位された顔イベントに関して、聴覚モジュールにより定位・分離された音源（話者）からの音響信号に基づいて音声認識を行なうことにより、より一層正確な音声認識を行なうことができる。
【００４４】
上記アクティブ方向通過型フィルタのパスレンジが周波数毎に制御可能であると、さらに集音した音からの分離の精度が上がり、これにより音声認識もさらに向上する。
【００４５】
【発明の実施の形態】
以下、図面に示した実施形態に基づいて、この発明を詳細に説明する。
図１及び図２は、それぞれこの発明によるロボット視聴覚システムの一実施形態を備えた実験用の上半身のみの人型ロボットの全体構成例を示している。図１において、人型ロボット１０は、４ＤＯＦ（自由度）のロボットとして構成されており、ベース１１と、ベース１１上にて一軸（垂直軸）周りに回動可能に支持された胴体部１２と、胴体部１２上にて三軸方向（垂直軸，左右方向の水平軸及び前後方向の水平軸）の周りに揺動可能に支持された頭部１３とを含んでいる。
【００４６】
上記ベース１１は固定配置されていてもよく、脚部として動作可能としてもよい。また、ベース１１は、移動可能な台車等の上に載置されていてもよい。胴体部１２は、ベース１１に対して垂直軸の周りに、図１にて矢印Ａで示すように回動可能に支持されており、図示しない駆動手段によって回転駆動されると共に、図示の場合、防音性の外装によって覆われている。
【００４７】
頭部１３は胴体部１２に対して連結部材１３ａを介して支持されており、この連結部材１３ａに対して前後方向の水平軸の周りに、図１にて矢印Ｂで示すように揺動可能に、また左右方向の水平軸の周りに、図２にて矢印Ｃで示すように揺動可能に支持されていると共に、上記連結部材１３ａが、胴体部１２に対してさらに前後方向の水平軸の周りに、図１にて矢印Ｄで示すように揺動可能に支持されており、それぞれ図示しない駆動手段によって、各矢印Ａ，Ｂ，Ｃ，Ｄ方向に回転駆動される。ここで、頭部１３は、図３に示すように全体が防音性の外装１４により覆われ、前側にロボット視覚を担当する視覚装置としてのカメラ１５、両側にロボット聴覚を担当する聴覚装置としての一対のマイク１６（１６ａ，１６ｂ）を備えている。なお、マイク１６は、頭部１３の両側に限定されることなく、頭部１３の他の位置あるいは胴体部１２等に設けられていてもよい。
【００４８】
上記外装１４は、例えばウレタン樹脂等の吸音性の合成樹脂から構成されており、頭部１３の内部がほぼ完全に密閉されて、頭部１３の内部の遮音が行われるように構成されている。なお、胴体部１２の外装も、頭部１３の外装１４と同様に、吸音性の合成樹脂から構成されている。
【００４９】
上記カメラ１５は公知の構成であって、例えば所謂パン，チルト，ズームの３ＤＯＦ（自由度）を有する市販のカメラにより構成されている。尚、カメラ１５は、同期をとってステレオ画像を送ることができるように設計されている。
【００５０】
上記マイク１６は、それぞれ頭部１３の側面において前方に向かって指向性を有するように取り付けられている。マイク１６の左右の各マイク１６ａ，１６ｂは、それぞれ図１及び図２に示すように、頭部１３の外装１４の両側に配置された段部１４ａ，１４ｂの内側に取り付けられている。そして、各マイク１６ａ，１６ｂは、段部１４ａ，１４ｂに設けられた貫通穴を通して、前方の音を集音すると共に、外装１４の内部の音を拾わないように、適宜の手段により遮音されている。なお、段部１４ａ，１４ｂに設けられた貫通穴は、段部１４ａ，１４ｂの内側から頭部前方に向けて貫通するように、各段部１４ａ，１４ｂに形成されている。これにより、各マイク１６ａ，１６ｂは、所謂バイノーラルマイクとして構成されている。なお、マイク１６ａ，１６ｂの取付位置に近接する外装１４は人間の外耳形状に形成されていてもよい。ここで、マイク１６は、外装１４の内側に配置された一対の内部マイクを含んでいてもよく、この内部マイクにより集音された内部音に基づいて、ロボット１０の内部に発生するノイズをキャンセルすることができる。
【００５１】
図４は、上記カメラ１５及びマイク１６を含むロボット視聴覚の電気的構成例を示している。図４において、ロボット視聴覚システム１７は、聴覚モジュール２０，顔モジュール３０，ステレオモジュール３７，モータ制御モジュール４０及びアソシエーションモジュール５０から構成されている。
【００５２】
ここで、アソシエーションモジュール５０はクライアントからの依頼に応じて処理を実行するサーバとして構成されており、このサーバに対するクライアントが、他のモジュール、即ち聴覚モジュール２０，顔モジュール３０，ステレオモジュール３７，モータ制御モジュール４０であり、これらのサーバとクライアントとは、互いに非同期で動作する。なお、上記サーバと各クライアントとは、各々、パーソナルコンピュータにより構成されており、更にこれらの各パーソナルコンピュータは、例えばＴＣＰ／ＩＰプロトコルの通信環境の下で、相互にＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）として構成されている。この場合、好ましくは、データ量の大きいイベントやストリームの通信のためには、ギガビット（Ｇｉｇａｂｉｔ）のデータ交換が可能な高速ネットワークをロボット視聴覚システム１７に適用するのが好ましく、また時刻の同期等の制御用通信のためには中速ネットワークをロボット視聴覚システム１７に適用するのが好ましい。このように大きなデータが高速に各パーソナルコンピュータ間を伝送することで、ロボット全体のリアルタイム性及びスケーラビリティを向上させることができる。
【００５３】
また、各モジュール２０，３０，３７，４０，５０は、それぞれ階層的に分散して構成されており、具体的には下位から順次にデバイス層，プロセス層，特徴層，イベント層から構成されている。
【００５４】
上記聴覚モジュール２０は、デバイス層としてのマイク１６と、プロセス層としてのピーク抽出部２１，音源定位部２２，音源分離部２３及びアクティブ方向通過型フィルタ２３ａと、特徴層（データ）としてのピッチ２４，音源水平方向２５と、イベント層としての聴覚イベント生成部２６と、さらにプロセス層としての音声認識部２７及び会話部２８と、から構成されている。
【００５５】
ここで、聴覚モジュール２０は、図５に示すように作用する。即ち、図５において、聴覚モジュール２０は、例えば４８ｋＨｚ，１６ビットでサンプリングされたマイク１６からの音響信号を、符号Ｘ１で示すようにＦＦＴ（高速フーリエ変換）により周波数解析して、符号Ｘ２で示すように左右のチャンネル毎にスペクトルを生成する。そして、聴覚モジュール２０は、ピーク抽出部２１により左右のチャンネル毎に一連のピークを抽出して、左右のチャンネルで同じか類似のピークをペアとする。
【００５６】
ここで、ピーク抽出は、（α）パワーがしきい値以上で且つ（β）ローカルピークであって、（γ）低周波ノイズとパワーの小さい高周波帯域をカットするため例えば９０Ｈｚ乃至３ｋＨｚの間の周波数であるという、３つの条件（α〜γ）を満たすデータのみを透過させる帯域フィルタを使用して行なわれる。このしきい値は、周囲の暗騒音を計測して、さらに感度パラメータ、例えば１０ｄＢを加えた値として定義される。
【００５７】
そして、聴覚モジュール２０は、各ピークが調波構造を有していることを利用して音源分離を行う。具体的には、音源分離部２３は、周波数の低い方から順に調波構造を有するローカルピークを抽出して、この抽出されたピークの集合を一つの音とみなす。このようにして、音源毎の音響信号が混合音からそれぞれ分離される。音源分離の際、聴覚モジュール２０の音源定位部２２は、符号Ｘ３で示すように、音源分離部２３にて分離された各音源毎の音響信号に対して、左右のチャンネルから同じ周波数の音響信号を選択して、ＩＰＤ（相互位相差）及びＩＩＤ（相互強度差）を計算する。なお、この計算は、例えば５度毎に行われる。そして、音源定位部２２は、計算結果をアクティブ方向通過型フィルタ２３ａに出力する。
【００５８】
これに対して、アクティブ方向通過型フィルタ２３ａは、アソシエーションモジュール５０にて算出されたアソシエーションストリーム５９の方向θに基づいて、符号Ｘ４で示すように、ＩＰＤの理論値ＩＰＤ（＝Δφ′（θ））を生成すると共に、ＩＩＤの理論値ＩＩＤ（＝Δρ′（θ））を計算する。なお、方向θは、顔定位（顔イベント３９）とステレオ視覚（ステレオ視覚イベント３９ａ）と音源定位（聴覚イベント２９）とに基づいて、アソシエーションモジュール５０におけるリアルタイムトラッキング（符号Ｘ３′）による算出結果である。
【００５９】
ここで、理論値ＩＰＤと理論値ＩＩＤの各計算は、以下に説明する聴覚エピポーラ幾何を利用して行われ、具体的にはロボット１０の正面を０度と設定し、±９０度の範囲で理論値ＩＰＤ及び理論値ＩＩＤが計算される。ここで、上記聴覚エピポーラ幾何は、ＨＲＴＦを使用せずに音源の方向情報を得るために必要である。ステレオ視覚研究においては、エピポーラ幾何が最も一般的な定位法の一つであり、聴覚エピポーラ幾何は視覚におけるエピポーラ幾何の聴覚への応用である。そして、聴覚エピポーラ幾何が幾何学的関係を利用して方向情報を得るので、ＨＲＴＦを不要にすることができるのである。
【００６０】
上記聴覚エピポーラ幾何においては、音源が無限遠にあると仮定し、Δφ，θ，ｆ，ｖをそれぞれＩＰＤ，音源方向，周波数，音速とし、ｒをロボット頭部を球形とみなした場合の半径とすると、以下の式（１）
【数１】

により表わされる。
【００６１】
他方、ＦＦＴ（高速フーリエ変換）により得られた一対のスペクトルに基づいて、各サブバンドのＩＰＤΔψ′及びＩＩＤΔρ′を、以下の式（２），（３）により計算する。
【数２】

【数３】

ここで、Ｓｐ_ｌ，Ｓｐ_ｒは、それぞれある時刻に左右のマイク１６ａ，１６ｂから得られたスペクトルである。
【００６２】
さらに、アクティブ方向通過型フィルタ２３ａは、符号Ｘ７で示す通過帯域関数に従って、前記ストリーム方向θ_Ｓから、θ_Ｓに対応するアクティブ方向通過型フィルタ２３ａの通過帯域δ（θ_Ｓ）を選択する。ここで、通過帯域関数は、図５のＸ７に示すように、ロボットの正面方向（θ＝０度）で感度が最大となり、側方で感度が低下することから、θ＝０度で最小値をとり、側方でより大きくなるような関数である。これは、正面方向で定位の感度が最大になり、左右に角度が大きくなるにつれて感度が低下するという聴覚特性を再現するためのものである。なお、正面方向で定位の感度が最大になることは、哺乳類の目の構造に見られる中心窩にならって聴覚中心窩と呼ぶ。この聴覚中心窩に関して、人間の場合には、正面の定位の感度が±２度程度であり、左右９０度付近にて±８度程度とされている。
【００６３】
そして、アクティブ方向通過型フィルタ２３ａは、選択した通過帯域δ（θ_Ｓ）を使用して、θ_Ｌからθ_Ｈの範囲にある音響信号を抽出する。尚、θ_Ｌ＝θ_Ｓ−δ（θ_Ｓ），θ_Ｈ＝θ_Ｓ＋δ（θ_Ｓ）と定義する。
【００６４】
また、アクティブ方向通過型フィルタ２３ａは、符号Ｘ５で示すように、ストリーム方向θ_Ｓを頭部伝達関数（ＨＲＴＦ）に利用して、θ_Ｌ及びθ_ＨにおけるＩＰＤ及びＩＩＤの理論値ＩＰＤ（＝Δφ_Ｈ（θ_Ｓ））とＩＩＤ（＝Δρ_Ｈ（θ_Ｓ））とを、即ち抽出すべき音源の方向を推定する。そして、アクティブ方向通過型フィルタ２３ａは、音源方向θに対して聴覚エピポーラ幾何に基づいて各サブバンド毎に計算されたＩＰＤ（＝Δφ_Ｅ（θ））及びＩＩＤ（＝Δρ_Ｅ（θ））と、ＨＲＴＦに基づいて得られたＩＰＤ（＝Δφ_Ｈ（θ））及びＩＩＤ（＝Δρ_Ｈ（θ））とに基づいて、符号Ｘ６で示すように、前述した通過帯域δ（θ）により決定される角度θ_Ｌからθ_Ｈの角度範囲で、抽出されたＩＰＤ（＝Δφ_Ｅ）及びＩＩＤ（＝Δρ_Ｅ）が以下の条件を満たすようなサブバンドを集める。
【００６５】
ここで、周波数ｆ_ｔｈは、フィルタリングの判断基準としてＩＰＤまたはＩＩＤを採用する閾値であって、ＩＰＤによる定位が有効である周波数の上限を示す。なお、周波数ｆ_ｔｈは、ロボット１０のマイク間距離に依存し、本実施形態においては例えば１５００Ｈｚ程度である。
【００６６】
即ち、
【数４】

【００６７】
これは、所定周波数ｆ_ｔｈ未満の周波数で、ＨＲＴＦによるＩＰＤの通過帯域δ（θ）の範囲内にＩＰＤ（＝Δφ′）が在る場合、そして所定周波数ｆ_ｔｈ以上の周波数でＨＲＴＦによるＩＩＤの通過帯域δ（θ）の範囲内にＩＩＤ（＝Δρ′）が在る場合に、サブバンドを集めることを意味している。ここで、一般に低周波数帯域ではＩＰＤが大きく影響し、高周波数帯域ではＩＩＤが大きく影響し、その閾値である周波数ｆ_ｔｈはマイク間距離に依存する。
【００６８】
そして、アクティブ方向通過型フィルタ２３ａは、このようにして集めたサブバンドから音響信号を再合成して、波形を構築することにより、符号Ｘ８で示すように、パス−サブバンド方向を生成し、符号Ｘ９で示すように、各サブバンド毎にフィルタリングを行なって、符号Ｘ１０で示す逆周波数変換ＩＦＦＴ（逆フーリエ変換）により、符号Ｘ１１で示すように、該当範囲にある各音源からの分離音（音響信号）を抽出する。
【００６９】
上記音声認識部２７は、図５に示すように、自声抑制部２７ａと自動認識部２７ｂとから構成されている。自声抑制部２７ａは、聴覚モジュール２０にて音源定位・音源分離された各音響信号から、後述する対話部２８のスピーカ２８ｃから発せられた音声を除去して外部からの音響信号のみを取り出すものである。自動認識部２７ｂは、図６に示すように、音声認識エンジン２７ｃと音響モデル２７ｄとセレクタ２７ｅとから構成されており、この音声認識エンジン２７ｃとしては、例えば京都大学で開発された「Ｊｕｒｉａｎ」という音声認識エンジンを利用することができ、これにより各話者が発話した単語を認識することができるようになっている。
【００７０】
図６において、自動認識部２７ｂは、例として男性２人（話者Ａ，Ｃ）と女性１人（話者Ｂ）の三人の話者の認識を行なうように構成されている。このために自動認識部２７ｂには、各話者の各方向毎にそれぞれ音響モデル２７ｄが備えられている。図６の場合には、音響モデル２７ｄは、３人の各話者Ａ，Ｂ，Ｃに関してそれぞれ各話者が発した音声とその方向とを組み合わせて成り、複数種類、この場合９種類の音響モデル２７ｄが備えられている。
【００７１】
音声認識エンジン２７ｃは、並列に９つの音声認識プロセスを実行し、その際に上記９つの音響モデル２７ｄが用いられる。具体的には、音声認識エンジン２７ｃは、それぞれ互いに並列的に入力された音響信号に対して、上記９つの音響モデル２７ｄを用いて音声認識プロセスを実行する。そして、これらの音声認識結果がセレクタ２７ｅに出力される。
【００７２】
上記セレクタ２７ｅは、各音響モデル２７ｄからのすべての音声認識プロセス結果を統合して、例えば多数決により最も信頼性が高い音声認識プロセス結果を判断して、その音声認識結果を出力する。
【００７３】
ここで、特定話者の音響モデル２７ｄに対する単語認識率を具体的な実験により説明する。
先ず、３ｍ×３ｍの部屋内において、３つのスピーカをロボット１０から１ｍの位置に且つロボットから０度及び±６０度の方向に置く。次に、音響モデル用の音声データとして、男性２名，女性１名が各々発話した、色，数字，食べ物のような１５０語の単語の音声をスピーカから出力して、ロボット１０のマイク１６ａ，１６ｂで集音する。なお、各単語の集音に当たり、一つのスピーカのみからの音声、二つのスピーカから同時に出力される音声、そして三つのスピーカから同時出力される音声、として、各単語に対して３つのパターンを録音する。そして、録音した音声信号に対して前述したアクティブ方向通過型フィルタ２３ａによって音声分離して各音声データを抽出し、話者及び方向毎に整理して、音響モデルのトレーニングセットを作成する。
【００７４】
そして、各音響モデル２７ｄには、トライフォンを使用して、各トレーニングセット毎に、ＨＴＫ（ＨｉｄｄｅｎＭａｒｃｏｖＭｏｄｅｌ）ツールキット２７ｆを使用して、各話者の各方向毎に計９種類の音声認識用の音声データを作成した。
【００７５】
このようにして得られた音響モデル用音声データを使用して、特定話者の音響モデル２７ｄに対する単語認識率を実験により調べたところ、図７に示す結果が得られた。図７は、横軸に方向を，縦軸に単語認識率を示すグラフであり、符号Ｐは本人（話者Ａ）の音声，符号Ｑは他者（話者Ｂ，Ｃ）の音声の場合を示す。話者Ａの音響モデルでは、話者Ａがロボット１０の正面に位置している場合（図７（Ａ））には、正面（０度）にて８０％以上の単語認識率となり、また話者Ａが右方６０度または左方−６０度に位置する場合、それぞれ図７（Ｂ）又は図７（Ｃ）に示すように、話者よりも方向の違いによる認識率の低下が少なく、特に話者も方向もあっている場合には、８０％以上の単語認識率となることが分かった。
【００７６】
この結果を考慮して、音声認識の際に、音源方向が既知であることを利用して、セレクタ２７ｅは、以下の式（５）により与えられるコスト関数Ｖ（ｐｅ）を統合のために使用する。
【数５】

【００７７】
ここで、ｒ（ｐ，ｄ），Ｒｅｓ（ｐ，ｄ）をそれぞれ話者ｐと方向ｄの音響モデルを使用した場合の単語認識率と入力音声に対する認識結果と定義し、ｄ_ｅをリアルタイムトラッキングによる音源方向とし、さらにｐ_ｅを評価対象の話者とする。
【００７８】
上記Ｐ_ｖ（ｐ_ｅ，ｄ_ｅ）は顔認識モジュールで生成される確率であり、顔認識ができない場合には常に１．０とする。そして、セレクタ２７ｅは最も大きいコスト関数Ｖ（ｐ_ｅ）を有する話者ｐ_ｅと認識結果Ｒｅｓ（ｐ，ｄ）を出力する。その際、セレクタ２７ｅは、顔モジュール３０からの顔認識による顔イベント３９を参照することにより、話者を特定することができるので、音声認識のロバスト性を向上させることができる。
【００７９】
なお、コスト関数Ｖ（ｐ_ｅ）の最大値が１．０以下または二番目に大きい値と近い場合には、音声認識が失敗または一つの候補に絞りきれなかったことにより音声認識ができないと判断して、その旨を後述する対話部２８に出力する。上記対話部２８は、対話制御部２８ａと音声合成部２８ｂとスピーカ２８ｃとから構成されている。上記対話制御部２８ａは、後述するアソシエーションモジュール６０により制御されることにより、音声認識部２７からの音声認識結果、即ち話者ｐ_ｅと認識結果Ｒｅｓ（ｐ，ｄ）とに基づいて、対象とする話者に対する音声データを生成し、音声合成部２８ｂに出力する。上記音声合成部２８ｂは、対話制御部２８ａからの音声データに基づいてスピーカ２８ｃを駆動して、音声データに対応する音声を発する。
【００８０】
これにより、対話部２８は音声認識部２７からの音声認識結果に基づいて、例えば話者Ａが好きな数字として「１」と言った場合に、ロボット１０が当該話者Ａに正対した状態で、当該話者Ａに対して「Ａさんは「１」と言いました」というように音声を発することになる。
【００８１】
なお、対話部２８は、音声認識部２７から音声認識ができなかった旨が出力された場合には、ロボット１０が当該話者Ａに正対した状態で、当該話者Ａに対して、「あなたは「２ですか？４ですか？」と質問して、再度話者Ａの回答について音声認識を行なうようになっている。この場合、話者Ａに対してロボット１０が正対していることから、音声認識の精度がより一層向上することになる。
【００８２】
このようにして、聴覚モジュール２０は、マイク１６からの音響信号に基づいて、ピッチ抽出，音源の分離及び定位から少なくとも一人の話者を特定（話者同定）してその聴覚イベントを抽出し、ネットワークを介してアソシエーションモジュール５０に対して送信すると共に、各話者の音声認識を行なって対話部２８により音声認識結果を話者に対して音声により確認するようになっている。
【００８３】
ここで、実際には、音源方向θ_ｓが時間ｔの関数であることから、特定音源を抽出し続けるためには時間方向の連続性を考慮する必要があるが、上述したように、リアルタイムトラッキングからのストリーム方向θ_ｓにより、音源方向を得るようにしている。
【００８４】
これによって、リアルタイムトラッキングにて、すべてのイベントをストリームという時間的流れを考慮した表現で表わしているので、同時に複数の音源が存在したり、音源やロボット自身が移動する場合でも、一つのストリームに注目することによって特定音源からの方向情報を連続的に得ることができる。さらに、ストリームは視聴覚のイベントを統合するためにも使用しているので、顔イベントを参照して聴覚イベントにより音源定位を行なうことにより、音源定位の精度が向上することになる。
【００８５】
上記顔モジュール３０は、デバイス層としてのカメラ１５と、プロセス層としての顔発見部３１，顔識別部３２，顔定位部３３と、特徴層（データ）としての顔ＩＤ３４，顔方向３５と、イベント層としての顔イベント生成部３６と、から構成されている。
【００８６】
これにより、顔モジュール３０は、カメラ１５からの画像信号に基づいて、顔発見部３１により例えば肌色抽出により各話者の顔を検出し、顔識別部３２にて前もって登録されている顔データベース３８により検索して、一致した顔があった場合、その顔ＩＤ３４を決定して当該顔を識別すると共に、顔定位部３３により当該顔方向３５を決定（定位）する。
【００８７】
ここで、顔モジュール３０は、顔発見部３１が画像信号から複数の顔を見つけた場合、各顔について上記処理、即ち識別及び定位そして追跡を行なう。その際、顔発見部３１により検出された顔の大きさ，方向及び明るさがしばしば変化するので、顔発見部３１は顔領域検出を行なって、肌色抽出と相関演算に基づくパターンマッチングの組合せによって２００ｍ秒以内に複数の顔を正確に検出できるようになっている。
【００８８】
顔定位部３３は、二次元の画像平面における顔位置を三次元空間に変換し、三次元空間における顔位置を、方位角θ，高さφ及び距離ｒのセットとして得る。そして、顔モジュール３０は、各顔毎に、顔ＩＤ（名前）３４及び顔方向３５から、顔イベント生成部３６により顔イベント３９を生成して、ネットワークを介してアソシエーションモジュール５０に対して送信するようになっている。
【００８９】
上記顔ステレオモジュール３７は、デバイス層としてのカメラ１５と、プロセス層としての視差画像生成部３７ａ，目標抽出部３７ｂと、特徴層（データ）としての目標方向３７ｃと、イベント層としてのステレオイベント生成部３７ｄとから構成されている。これにより、ステレオモジュール３７は、カメラ１５からの画像信号に基づいて視差画像生成部３７ａによって双方のカメラ１５の画像信号から視差画像を生成する。次いで、目標抽出部３７ｂが、視差画像を領域分割し、その結果、縦に長い物体が発見されれば、目標抽出部３７ｂはそれを人物候補として抽出し、その目標方向３７ｃを決定（定位）する。ステレオイベント生成部３７ｄは、目標方向３７ｃに基づいてステレオイベント３９ａを生成し、ネットワークを介してアソシエーションモジュール５０に対して送信するようになっている。
【００９０】
上記モータ制御モジュール４０は、デバイス層としてのモータ４１及びポテンショメータ４２と、プロセス層としてのＰＷＭ制御回路４３，ＡＤ変換回路４４及びモータ制御部４５と、データである特徴層としてのロボット方向４６と、イベント層としてのモータイベント生成部４７とから構成されている。これにより、モータ制御モジュール４０においては、モータ制御部４５がアテンション制御モジュール５７（後述）からの指令に基づいてＰＷＭ制御回路４３を介してモータ４１を駆動制御する。また、モータ４１の回転位置をポテンショメータ４２により検出する。この検出結果は、ＡＤ変換回路４４を介してモータ制御部４５に送られる。そして、モータ制御部４５は、ＡＤ変換回路４４から受け取った信号からロボット方向４６を抽出する。モータイベント生成部４７は、ロボット方向４６に基づいて、モータ方向情報から成るモータイベント４８を生成して、ネットワークを介してアソシエーションモジュール５０に対して送信するようになっている。
【００９１】
上記アソシエーションモジュール５０は、上述した聴覚モジュール２０，顔モジュール３０，ステレオモジュール３７，モータ制御モジュール４０に対して、階層的に上位に位置付けられており、各モジュール２０，３０，３７，４０のイベント層の上位であるストリーム層を構成している。具体的には、アソシエーションモジュール５０は、聴覚モジュール２０，顔モジュール３０，ステレオモジュール３７及びモータ制御モジュール４０からの非同期イベント５１、即ち聴覚イベント２９，顔イベント３９，ステレオイベント３９ａ及びモータイベント４８を同期させて聴覚ストリーム５３，顔ストリーム５４，ステレオ視覚ストリーム５５を生成する絶対座標変換部５２と、各ストリーム５３，５４，５５を関連付けてアソシエーションストリーム５９を生成し、あるいはこれらストリーム５３，５４，５５の関連付けを解除する関連付け部５６と、さらにアテンション制御モジュール５７と、ビューア５８とを備えている。
【００９２】
上記絶対座標変換部５２は、聴覚モジュール２０からの聴覚イベント２９，顔モジュール３０からの顔イベント３９，ステレオモジュール３７からのステレオイベント３９ａに、モータ制御モジュール４０からのモータイベント４８を同期させると共に、聴覚イベント２９，顔イベント３９及びステレオイベント３９ａに関して、同期させたモータイベントによって、その座標系を絶対座標系に変換することにより、聴覚ストリーム５３，顔ストリーム５４及びステレオ視覚ストリーム５５を生成する。その際、上記絶対座標変換部５２は、同一話者の聴覚ストリーム，顔ストリーム及びステレオ視覚ストリームに接続することによって、聴覚ストリーム５３，顔ストリーム５４及びステレオ視覚ストリーム５５を生成する。
【００９３】
また、関連付け部５６は、聴覚ストリーム５３，顔ストリーム５４，ステレオ視覚ストリーム５５に基づいて、これらのストリーム５３，５４，５５の時間的つながりを考慮してストリームを関連付け、あるいは関連付けを解除して、アソシエーションストリーム５９を生成すると共に、逆にアソシエーションストリーム５９を構成する聴覚ストリーム５３，顔ストリーム５４及びステレオ視覚ストリーム５５の結び付きが弱くなれば、関係付けを解除するようになっている。これにより、目標となる話者が移動している場合であっても、当該話者の移動を予測して、その移動範囲となる角度範囲内であれば、上述したストリーム５３，５４，５５の生成を行なうことによって、当該話者の移動を予測して追跡できることになる。
【００９４】
また、アテンション制御モジュール５７は、モータ制御モジュール４０の駆動モータ制御のプランニングのためのアテンション制御を行なうものであり、その際アソシエーションストリーム５９，聴覚ストリーム５３，顔ストリーム５４そしてステレオ視覚ストリーム５５の順に優先的に参照して、アテンション制御を行なう。そして、アテンション制御モジュール５７は、聴覚ストリーム５３，顔ストリーム５４及びステレオ視覚ストリーム５５の状態とアソシエーションストリーム５９の存否に基づいて、ロボット１０の動作プランニングを行ない、駆動モータ４１の動作の必要があれば、モータ制御モジュール４０に対して動作指令としてのモータイベントをネットワークを介して送信する。ここで、アテンション制御モジュール５７におけるアテンション制御は、連続性とトリガに基づいており、連続性により同じ状態を保持しようとし、トリガにより最も興味のある対象を追跡しようとして、アテンションを向けるべきストリームを選択して、トラッキングを行なう。
【００９５】
このようにして、アテンション制御モジュール５７は、アテンション制御を行なって、モータ制御モジュール４０の駆動モータ４１の制御のプランニングを行ない、このプランニングに基づいて、モータコマンド６４ａを生成し、ネットワーク７０を介してモータ制御モジュール４０に伝送する。これにより、モータ制御モジュール４０では、このモータコマンド６４ａに基づいて、モータ制御部４５がＰＷＭ制御を行なって、駆動モータ４１を回転駆動させて、ロボット１０を所定方向に向けるようになっている。
【００９６】
ビューア５８は、このようにして生成された各ストリーム５３，５４，５５，５７をサーバの画面上に表示するものであり、具体的にはレーダチャート５８ａ及びストリームチャート５８ｂにより表示する。レーダチャート５８ａは、その瞬間におけるストリームの状態、より詳細にはカメラの視野角と音源方向を示し、ストリームチャート５８ｂは、アソシエーションストリーム（太線図示）と聴覚ストリーム，顔ストリーム及びステレオ視覚ストリーム（細線図示）を示している。
【００９７】
本発明実施形態による人型ロボット１０は以上のように構成されており、以下のように動作する。
まず、ロボット１０の前方１ｍの距離で、斜め左（θ＝＋６０度），正面（θ＝０度）そして斜め右（θ＝−６０度）の方向に、それぞれ話者が並んでおり、ロボット１０が対話部２８により、三人の話者に質問して、各話者が同時に質問に対する回答を行なう。
【００９８】
これにより、ロボット１０はマイク１６が当該話者の音声を拾って、聴覚モジュール２０が音源方向を伴う聴覚イベント２９を生成して、ネットワークを介してアソシエーションモジュール５０に伝送する。これにより、アソシエーションモジュール５０は、この聴覚イベント２９に基づいて、聴覚ストリーム５３を生成する。
【００９９】
また、顔モジュール３０は、カメラ１５による話者の顔の画像を取り込んで、顔イベント３９を生成して、当該話者の顔を顔データベース３８により検索し、顔識別を行なうと共に、その結果である顔ＩＤ２４及び画像をネットワーク７０を介してアソシエーションモジュール５０に伝送する。なお、当該話者の顔が顔データベース３８に登録されていない場合には、顔モジュール３０は、その旨をネットワークを介してアソシエーションモジュール５０に伝送する。
【０１００】
従って、アソシエーションモジュール５０は、これらの聴覚イベント２９，顔イベント３９，ステレオイベント３９ａに基づいて、アソシエーションストリーム５９を生成する。
【０１０１】
ここで、聴覚モジュール２０は、アクティブ方向通過型フィルタ２３ａにより、聴覚エピポーラ幾何によるＩＰＤを利用して、各音源（話者Ｘ，Ｙ，Ｚ）の定位及び分離を行なって、分離音（音響信号）を取り出す。そして、聴覚モジュール２０は、その音声認識部２７により音声認識エンジン２７ｃを使用して、各話者Ｘ，Ｙ，Ｚの音声を認識してその結果を対話部２８に出力する。これにより、対話部２８は、音声認識部２７により音声認識された前記回答を、それぞれの話者に対してロボット１０が正対した状態で発話する。なお、音声認識部２７が正しく音声認識できなかった場合には、ロボット１０が当該話者に正対した状態で再度質問を繰り返し、その回答に基づいて再度音声認識を行なう。
【０１０２】
このようにして、本発明実施形態による人型ロボット１０によれば、聴覚モジュール２０により音源定位・音源分離された分離音（音響信号）に基づいて、音声認識部２７が、各話者及び方向に対応する音響モデルを使用して音声認識を行なうことにより同時に発話する複数の話者の音声を音声認識することができる。
【０１０３】
以下に、音声認識部２７の動作を実験により評価する。
これらの実験においては、図８に示すように、ロボット１０の前方１ｍの距離で、斜め左（θ＝＋６０度），正面（θ＝０度）そして斜め右（θ＝−６０度）の方向に、それぞれ話者Ｘ，Ｙ，Ｚが並んでいる。なお、実験では、話者として人間の代わりにそれぞれスピーカを置くと共に、その前面に話者の写真を配置している。ここで、スピーカは、音響モデルを作成したときと同じスピーカを使用しており、スピーカから発せられた音声を写真の話者の音声とみなしている。
【０１０４】
そして、以下のシナリオに基づいて音声認識の実験を行なう。
１．ロボット１０が三人の話者Ｘ，Ｙ，Ｚに質問する。
２．三人の話者Ｘ，Ｙ，Ｚが同時に質問に対する回答を行なう。
３．ロボット１０が三人の話者Ｘ，Ｙ，Ｚの混合音声に基づいて、音源定位・音源分離を行ない、さらに各分離音について音声認識を行なう。
４．ロボット１０が、順次に各話者Ｘ，Ｙ，Ｚに正対した状態で当該話者の回答を答える。
５．ロボット１０は、音声認識が正しくできなかったと判断したとき、当該話者に正対して再度質問を繰り返し、その回答に基づいて再度音声認識を行なう。
【０１０５】
上記シナリオによる実験結果の第一の例を図９に示す。
１．ロボット１０が「好きな数字は何ですか？」と質問する。（図９（ａ）参照）
２．各話者Ｘ，Ｙ，Ｚとしてのスピーカから、同時に１から１０までの数字のうちから、任意の数字を読み上げた音声を流す。例えば図９（ｂ）に示すように、話者Ｘは「２」，話者Ｙは「１」そして話者Ｚは「３」と言う。
３．ロボット１０は、聴覚モジュール２０にて、そのマイク１６で集音した音響信号に基づいて、アクティブ方向通過型フィルタ２３ａにより音源定位・音源分離を行なって、分離音を抽出する。そして、各話者Ｘ，Ｙ，Ｚに対応する分離音に基づいて、各話者別に音声認識部２７が９つの音響モデルを使用して、同時に音声認識プロセスを実行し、その音声認識を行なう。
４．その際、音声認識部２７のセレクタ２７ｅが、正面が話者Ｙであると仮定して音声認識の評価を行ない（図９（ｃ））、続いて正面が話者Ｘであると仮定して音声認識の評価を行ない（図９（ｄ））、最後に正面が話者Ｚであると仮定して音声認識の評価を行なう（図９（ｅ））。
５．そして、セレクタ２７ｅが、音声認識結果を統合して、図９（ｆ）に示すように、ロボット正面（θ＝０度）に関して、最も適合の良い話者名（Ｙ）と音声認識結果（「１」）を決定し対話部２８に出力する。これにより、図９（ｇ）に示すように、ロボット１０が話者Ｙに正対した状態にて、「Ｙさんは「１」です。」と答える。
６．続いて、斜め左（θ＝＋６０度）の方向に関して、上記と同様の処理を行って、図９（ｈ）に示すように、ロボット１０が話者Ｘに正対した状態にて、「Ｘさんは「２」です。」と答える。更に、斜め右（θ＝−６０度）の方向に対しても同様の処理を行って、図９（ｉ）に示すように、ロボット１０が話者Ｚに正対した状態にて、「Ｚさんは「３」です。」と答える。
【０１０６】
この場合、ロボット１０は、各話者Ｘ，Ｙ，Ｚの回答をすべて正しく音声認識することができた。従って、同時発話の場合であっても、ロボット１０のマイク１６を使用したロボット視聴覚システム１７における音源定位・音源分離・音声認識の有効性が示された。
【０１０７】
なお、図９（ｊ）に示すように、ロボット１０が各話者に正対せずに、「Ｙさんは「１」です。Ｘさんは「２」です。Ｚさんは「３」です。合計「６」です。」というように、各話者Ｘ，Ｙ，Ｚの答えた数字の合計も答えるようにしてもよい。
【０１０８】
図１０は、上述したシナリオによる実験結果の第二の例を示している。
１．図９に示した第一の例と同様にして、ロボット１０が「好きな数字は何ですか？」と質問し（図１０（ａ）参照）、各話者Ｘ，Ｙ，Ｚとしてのスピーカから、図１０（ｂ）に示すように、話者Ｘは「２」，話者Ｙは「１」そして話者Ｚは「３」という音声が流れる。
２．ロボット１０は、同様にして、聴覚モジュール２０にて、そのマイク１６で集音した音響信号に基づいて、アクティブ方向通過型フィルタ２３ａにより音源定位・音源分離を行なって分離音を抽出し、各話者Ｘ，Ｙ，Ｚに対応する分離音に基づいて、各話者別に音声認識部２７が９つの音響モデルを使用して、同時に音声認識プロセスを実行し、その音声認識を行なう。その際、音声認識部２７のセレクタ２７ｅは、図１０（ｃ）に示すように、正面の話者Ｙについては、正しく音声認識の評価を行なうことができる。
３．これに対して、＋６０度に位置する話者Ｘについて、セレクタ２７ｅは、図１０（ｄ）に示すように、「２」であるか「４」であるか決定することができない。
４．従って、ロボット１０は、図１０（ｅ）に示すように、＋６０度に位置する話者Ｘに正対して、「２ですか？４ですか？」と質問する。
５．これに対して、図１０（ｆ）に示すように、話者Ｘであるスピーカから「２」という回答が流れる。この場合、話者Ｘは、ロボット１０の正面に位置していることから、聴覚モジュール２０が話者Ｘの回答について正しく音源定位・音源分離し、音声認識部２７が正しく音声認識して、話者名Ｘと音声認識結果「２」を対話部２８に出力する。これにより、ロボット１０は、図１０（ｇ）に示すように、話者Ｘに対して「Ｘさんは「２」です。」と答える。
６．続いて、話者Ｚについても同様の処理を行なって、その音声認識結果を話者Ｚに対して答える。即ち、図１０（ｈ）に示すように、ロボット１０が話者Ｚに正対した状態にて、「Ｚさんは「３」です。」と答える。
【０１０９】
このようにして、ロボット１０は、再質問により、各話者Ｘ，Ｙ，Ｚの回答をすべて正しく音声認識することができた。従って、側方での聴覚中心窩の影響による分離精度の低下による音声認識の曖昧さを、ロボット１０が側方の話者に対して正対して再質問することにより解消して、音源分離精度を向上させ、音声認識精度を向上させることができることが示された。
【０１１０】
なお、図１０（ｉ）に示すように、ロボット１０が各話者の音声認識を正しく行なった後、「Ｙさんは「１」です。Ｘさんは「２」です。Ｚさんは「３」です。合計「６」です。」というように、各話者Ｘ，Ｙ，Ｚの答えた数字の合計も答えるようにしてもよい。
【０１１１】
図１１は、上述したシナリオによる実験結果の第三の例を示している。
１．この場合も図９に示した第一の例と同様にして、ロボット１０が「好きな数字は何ですか？」と質問し（図１０（ａ）参照）、各話者Ｘ，Ｙ，Ｚとしてのスピーカから、図１０（ｂ）に示すように、話者Ｘは「８」，話者Ｙは「７」そして話者Ｚは「９」という音声が流れる。
２．ロボット１０は、同様にして、聴覚モジュール２０にて、そのマイク１６で集音した音響信号に基づいて、リアルタイムトラッキング（Ｘ３′参照）によるストリーム方向θ、そして各話者の顔イベントを参照して、アクティブ方向通過型フィルタ２３ａにより音源定位・音源分離を行なって分離音を抽出し、各話者Ｘ，Ｙ，Ｚに対応する分離音に基づいて、各話者毎に音声認識部２７が９つの音響モデルを使用して、同時に音声認識プロセスを実行し、その音声認識を行なう。
その際、音声認識部２７のセレクタ２７ｅは、正面の話者Ｙについては、顔イベントに基づいて話者Ｙである確率が高いことから、各音響モデルによる音声認識結果の統合の際に、図１０（ｃ）に示すようにこれを考慮する。これにより、より正確な音声認識を行なうことができる。従って、ロボット１０は、図１１（ｄ）に示すように、話者Ｘに対して「Ｘさんは「７」です。」と答える。
３．これに対して、＋６０度に位置する話者Ｘについて、ロボット１０が向きを変えて正対すると、このときの正面の話者Ｘについて、顔イベントに基づいて話者Ｘである確率が高いので、同様にして、セレクタ２７ｅは、図１１（ｅ）に示すようにこれを考慮する。従って、ロボット１０は、図１１（ｆ）に示すように、話者Ｘに対して「Ｙさんは「８」です。」と答える。
４．続いて、セレクタ２７ｅは、図１１（ｇ）に示すように、話者Ｚについても同様の処理を行なって、その音声認識結果を話者Ｚに対して答える。即ち、図１１（ｈ）に示すように、ロボット１０が話者Ｚに正対した状態にて「Ｚさんは「９」です。」と答える。
【０１１２】
このようにして、ロボット１０は、各話者毎に正対して、その顔イベントを参照しながら話者の顔認識に基づいて、各話者Ｘ，Ｙ，Ｚの回答をすべて正しく音声認識することができた。これにより、顔認識により話者が誰であるかを特定することができるので、より精度の高い音声認識を行なうことができることが示された。特に、特定の環境での利用を前提とするような場合、顔認識によってほぼ１００％に近い顔認識精度が得られると、顔認識情報を信頼性の高い情報として利用することができることになり、音声認識部２７の音声認識エンジン２７ｃで使用される音響モデル２７ｄの数を削減することができるので、より高速で且つ高精度の音声認識が可能になる。
【０１１３】
図１２は、上述したシナリオによる実験結果の第四の例を示している。
１．ロボット１０が「好きなフルーツは何ですか？」と質問し（図１２（ａ）参照）、各話者Ｘ，Ｙ，Ｚとしてのスピーカから、例えば図１２（ｂ）に示すように、話者Ｘは「梨」，話者Ｙは「スイカ」そして話者Ｚは「メロン」と言う。
２．ロボット１０は、聴覚モジュール２０にて、そのマイク１６で集音した音響信号に基づいて、アクティブ方向通過型フィルタ２３ａにより音源定位・音源分離を行なって分離音を抽出する。そして、各話者Ｘ，Ｙ，Ｚに対応する分離音に基づいて、各話者毎に音声認識部２７が９つの音響モデルを使用して、同時に音声認識プロセスを実行し、その音声認識を行なう。
３．その際、音声認識部２７のセレクタ２７ｅが、正面が話者Ｙであると仮定して音声認識の評価を行ない（図１２（ｃ））、続いて正面が話者Ｘであると仮定して音声認識の評価を行ない（図１２（ｄ））、最後に正面が話者Ｚであると仮定して音声認識の評価を行なう（図１２（ｅ））。
４．そして、セレクタ２７ｅが、音声認識結果を統合して、図１２（ｆ）に示すように、ロボット正面（θ＝０度）方向に関して最も適合の良い話者名（Ｙ）と音声認識結果（「スイカ」）を決定し対話部２８に出力する。これにより、図９（ｇ）に示すように、ロボット１０が話者Ｙに正対した状態にて、「Ｙさんは「スイカ」です。」と答える。
５．続いて、各話者Ｘ，Ｚについても同様の処理を行なって、その音声認識結果を各話者Ｘ，Ｚに対して答える。即ち、図１２（ｈ）に示すように、ロボット１０が話者Ｘに正対した状態にて、「Ｘさんは「梨」です。」と答え、さらに図１２（ｉ）に示すように、ロボット１０が話者Ｚに正対した状態にて「Ｚさんは「メロン」です。」と答える。
【０１１４】
この場合、ロボット１０は、各話者Ｘ，Ｙ，Ｚの回答をすべて正しく音声認識することができた。従って、音声認識エンジン２７ｃに登録された単語は数字に限ることなく、前もって登録された単語であれば、音声認識可能であることが分かる。ここで、実験に使用した音声認識エンジン２７ｃでは、約１５０語の単語が登録されている。なお、単語の音節数が多くなると、音声認識率はやや低くなる。
【０１１５】
上述した実施形態においては、ロボット１０は、その上半身が４ＤＯＦ（自由度）を有するように構成されているが、これに限らず、任意の動作を行なうように構成されたロボットに本発明によるロボット視聴覚システムを組み込むことも可能である。
【０１１６】
また、上述した実施形態においては、本発明によるロボット視聴覚システムを人型ロボット１０に組み込んだ場合について説明したが、これに限らず、犬型等の各種動物型ロボットや、その他の形式のロボットに組み込むことも可能であることは明らかである。
【０１１７】
また、上記説明では、図４に示すようにロボット視聴覚システム１７がステレオモジュール３７を備える構成例を説明したが、本発明の実施形態に係るロボット視聴覚システムは、ステレオモジュール３７を備えずに構成することもできる。この場合、アソシエーションモジュール５０は、聴覚イベント２９，顔イベント３９及びモータイベント４８に基づいて、各話者の聴覚ストリーム５３及び顔ストリーム５４を生成し、さらにこれらの聴覚ストリーム５３及び顔ストリーム５４を関連付けてアソシエーションストリーム５９を生成するように構成され、アテンション制御モジュール５０においては、これらのストリームに基づいてアテンション制御が行われるように構成される。
【０１１８】
さらに、上記説明においては、アクティブ方向通過型フィルタ２３ａは、方向毎に通過帯域幅（パスレンジ）を制御しており、処理する音の周波数によらず通過帯域幅を一定としていた。
ここで、通過帯域δを導出するために、１００Ｈｚ，２００Ｈｚ，５００Ｈｚ，１０００Ｈｚ，２０００Ｈｚ，１００Ｈｚの調波構造音（ハーモニクス）の５つの純音と１つのハーモニクスとを用いて、１音源に対する音源抽出率を調べる実験を行った。なお、音源をロボット正面である０度からロボットの左位置或いは右位置である９０度の範囲で１０度毎に位置を移動させた。図１３〜図１５は音源を０度から９０度の範囲の各位置に設置した場合の音源抽出率を示すグラフであり、この実験結果が示すように、周波数に応じて通過帯域幅を制御することにより、特定の周波数の音の抽出率を向上させることができ、分離精度を向上できる。よって、音声認識率も向上する。従って、上記説明したロボット視聴覚システム１７においては、アクティブ方向通過型フィルタ２３ａのパスレンジが、周波数毎に制御可能に構成されるのが望ましい。
【０１１９】
【発明の効果】
以上述べたように、この発明によれば、従来の音声認識と比較して、複数の音響モデルを使用することによって、実時間・実環境での正確な音声認識を行なうことが可能である。また、各音響モデルによる音声認識結果をセレクタにより統合して、最も信頼性の高い音声認識結果を判断することにより、従来の音声認識に比べて、より一層正確な音声認識を行なうことができる。
【図面の簡単な説明】
【図１】この発明によるロボット聴覚装置の第一の実施形態を組み込んだ人型ロボットの外観を示す正面図である。
【図２】図１の人型ロボットの側面図である。
【図３】図１の人型ロボットにおける頭部の構成を示す概略拡大図である。
【図４】図１の人型ロボットにおけるロボット視聴覚システムの電気的構成例を示すブロック図である。
【図５】図４に示すロボット視聴覚システムにおける聴覚モジュールの作用を示す図である。
【図６】図４のロボット視聴覚システムにおける聴覚モジュールの音声認識部で使用される音声認識エンジンの構成例を示す概略斜視図である。
【図７】図６の音声認識エンジンによる正面及び左右±６０度の方向の話者による音声の認識率を示すグラフであり、（Ａ）は正面の話者、（Ｂ）は斜め左＋６０度の話者そして（Ｃ）は斜め右−６０度の話者の場合を示している。
【図８】図４に示すロボット視聴覚システムにおける音声認識実験を示す概略斜視図である。
【図９】図４のロボット視聴覚システムの音声認識実験の第一の例の結果を順次に示す図である。
【図１０】図４のロボット視聴覚システムの音声認識実験の第二の例の結果を順次に示す図である。
【図１１】図４のロボット視聴覚システムの音声認識実験の第三の例の結果を順次に示す図である。
【図１２】図４のロボット視聴覚システムの音声認識実験の第四の例の結果を順次に示す図である。
【図１３】本発明の実施形態に係るアクティブ方向通過型フィルタの通過帯域幅を制御した場合の抽出率を示す図であり、（ａ）は０度、（ｂ）は１０度、（ｃ）は２０度、（ｄ）は３０度の方向に音源がある場合である。
【図１４】本発明の実施形態に係るアクティブ方向通過型フィルタの通過帯域幅を制御した場合の抽出率を示す図であり、（ａ）は４０度、（ｂ）は５０度、（ｃ）は６０度の方向に音源がある場合である。
【図１５】本発明の実施形態に係るアクティブ方向通過型フィルタの通過帯域幅を制御した場合の抽出率を示す図であり、（ａ）は７０度、（ｂ）は８０度、（ｃ）は９０度の方向に音源がある場合である。
【符号の説明】
１０人型ロボット
１１ベース
１２胴体部
１３頭部
１４外装
１５カメラ（ロボット視覚）
１６，１６ａ，１６ｂマイク（ロボット聴覚）
１７ロボット視聴覚システム
２０聴覚モジュール
２１ピーク抽出部
２２音源定位部
２３音源分離部
２３ａアクティブ方向通過型フィルタ
２６聴覚イベント生成部
２７音声認識部
２７ａ自声抑制部
２７ｂ自動認識部
２７ｃ音声認識エンジン
２７ｄ音響モデル
２７ｅセレクタ
２８対話部
３０顔モジュール
３７ステレオモジュール
４０モータ制御モジュール
５０アソシエーションモジュール
５７アテンション制御モジュール[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an audiovisual system in a robot, particularly a humanoid or animal type robot.
[0002]
[Prior art]
In recent years, such humanoid or animal robots are not limited to AI research purposes, but are considered for future use as so-called “human partners”. In order for the robot to perform intelligent social interaction with humans, the robot must perceive audiovisual information. In order for the robot to realize social interaction with humans, it is clear that audiovisual, particularly hearing, is an important function of perception. Therefore, so-called active perception has attracted attention regarding vision and hearing.
[0003]
Here, active perception refers to the function of following a target to be perceived by a perceptual device in charge of perception such as robot vision or robot audition. For example, the head supporting these perceptual devices is targeted by a drive mechanism. The posture is controlled so as to follow.
[0004]
In active vision in robots, at least the camera, which is a sensory device, holds its optical axis direction toward the target by posture control by the drive mechanism, and further performs focusing, zoom-in, zoom-out, etc. on the target automatically. Do. Thereby, even if the target moves, the image is taken by the camera. Various studies of active vision have been conducted in the past.
[0005]
On the other hand, in active hearing in a robot, at least a microphone, which is a perceptual device, holds its directivity toward a target by posture control by a drive mechanism, and sounds from the target are collected by the microphone. At this time, as a disadvantage of active hearing, since the microphone picks up the operation sound of the drive mechanism while the drive mechanism is operating, relatively large noise is mixed in the sound from the target, and from the target. May not be able to recognize the sound. In order to eliminate such disadvantages of active hearing, a method of accurately recognizing a sound from a target by, for example, directing a sound source with reference to visual information is employed.
[0006]
By the way, in such active hearing, based on the sound collected by the microphone, (A) localization of the sound source, (B) separation of each sound emitted from each sound source, (C) and sound from each sound source Need to be recognized. Among these, regarding (A) sound source localization and (B) sound source separation, various studies relating to sound source localization / tracking / separation in real time and real environment in active hearing have been performed (see Patent Document 1).
[0007]
[Patent Document 1]
International Publication No. 01/95314 Pamphlet
[0008]
Here, for example, as shown in Patent Document 1, sound source localization can be performed using an interaural phase difference (IPD) and an interaural intensity difference (IID) obtained from an HRTF (head related transfer function). Are known. In Patent Document 1, for example, a so-called direction pass filter, that is, a direction pass filter, is used to select a subband having the same IPD as an IPD in a specific direction, thereby separating sound from each sound source. Are known.
[0009]
On the other hand, for the recognition of sound from each sound source separated by sound source separation, various studies have been conducted on approaches to speech recognition that are robust against noise such as multiconditioning and missing data (for example, For example, refer nonpatent literature 1 and 2).
[0010]
[Non-Patent Document 1]
J. et al. Baker et al., Robust "Euro Speech 2001-7th European Conference Proceedings", 2001, Volume 1, p213-216 (J. Baker, M. Cooke, and P. Green, Robust as, based on the clean speech model. based on clean spectrum models: An evaluation of missing data technologies for connected digit recognition in noise. “7th European confidence concealment”.
[Non-Patent Document 2]
P. Renee Bay et al., Robust Speech Recognition "Euro Speech 2001-7th European Conference Proceedings", 2001, Vol. 12, pp. 1107-1110 (Philippe Renevey, Rolf Vetter, and Jens Kraus. Robust speech recognition using missing feature theory and vector quantization. "7th European Conference on Speech Communication Technology", Volume 12, pp. 1107-1110)
[0011]
[Problems to be solved by the invention]
However, in these studies (for example, Non-Patent Documents 1 and 2), effective speech recognition cannot be performed when the S / N ratio is small. There is no research on real-time and real-time speech recognition.
[0012]
In view of the above, an object of the present invention is to provide a robot audio-visual system that recognizes separated sounds from sound sources.
[0013]
[Means for Solving the Problems]
In order to achieve the above object, the first configuration of the robot audio-visual system of the present invention is to Every and every direction A plurality of acoustic models, a speech recognition engine that uses these acoustic models to perform a speech recognition process on the sound signal separated from the sound source, and a plurality of speech obtained by the acoustic model by the speech recognition process A selector that integrates the recognition process results and selects one of the speech recognition process results, and each speaker speaks simultaneously voice It is characterized by recognizing each.
[0014]
The selector is When integrating the speech recognition process results, the cost function value is calculated based on the recognition result of the speech recognition process and the direction of the speaker, and has the largest cost function value. The speech recognition process result Judged as the most reliable speech recognition result And an interactive unit that outputs the result of the speech recognition process selected by the selector to the outside.
[0015]
According to such a first configuration, each of the speech recognition processes is performed by using a plurality of acoustic models based on the acoustic signals subjected to sound source localization and sound source separation. Then, the speech recognition process results based on the respective acoustic models are integrated by the selector to determine the most reliable speech recognition result.
[0016]
In order to achieve the above object, the second configuration of the robot audio-visual system of the present invention includes at least a pair of microphones that collect external sounds, and pitches based on acoustic signals from the microphones. It is equipped with an auditory module that determines the direction of at least one speaker by extracting and localization of sound sources by grouping based on extraction and harmonic structure and extracting the auditory event, and a camera that images the front of the robot. This drive includes a face module that identifies each speaker from the face identification and localization of each speaker based on the image captured by the camera and extracts the face event, and a drive motor that rotates the robot horizontally. A motor control module that extracts a motor event based on the rotational position of the motor, and the above auditory event, face event, and motor event The direction of each speaker is determined based on the sound source localization of the auditory event and the direction information of the face localization of the face event, and the event is connected in the time direction using a Kalman filter in response to this determination. And an association module that generates an association stream by associating them with each other, an attention control based on these streams, and an attention control module that performs motor drive control on the basis of the action planning result The auditory module is Each speaker has a plurality of acoustic models made in each direction, Based on accurate sound source direction information from the association module, an interaural phase difference within a predetermined width range is obtained by an active direction pass filter having a pass range that is minimum in the front direction and increases as the angle increases to the left and right. (IPD) or subbands having interaural intensity difference (IID) are collected, sound source separation is performed by reconstructing the waveform of the sound source, and sound signals separated using a plurality of acoustic models The speech recognition results obtained by the respective acoustic models are integrated by a selector, and the most reliable speech recognition result among these speech recognition results is determined.
[0017]
According to such a second configuration, the auditory module obtains the direction of each sound source by performing pitch extraction using the harmonic structure from the sound from the external target collected by the microphone, and thereby the individual speech. Identify the person and extract their auditory events.
[0018]
In addition, the face module extracts face events of individual speakers from the face identification and localization of each speaker by pattern recognition from the image captured by the camera.
[0019]
Further, the motor control module extracts the motor event by detecting the direction of the robot based on the rotational position of the drive motor that rotates the robot in the horizontal direction.
[0020]
The event indicates the presence of a sound or a face detected at each time point, or the state where the drive motor is rotated. The stream is a time-sampling by, for example, a Kalman filter while performing error correction processing. Shows events connected in series.
[0021]
Here, the association module generates an auditory stream and a face stream of each speaker based on the auditory event, the face event, and the motor event extracted in this manner, and further associates these streams to generate the association stream. Then, the attention control module performs attention control based on these streams, thereby planning driving motor control of the motor control module. Note that the association stream is a concept including an auditory stream and a face stream.
[0022]
Attention is the “attention” of the speaker who is the target of the robot, audibly and / or visually. Unattention control is the direction in which the robot speaks by changing its direction using the motor control module. To pay attention to the person.
[0023]
Then, the attention control module controls the drive motor of the motor control module based on this planning to direct the direction of the robot toward the target speaker. As a result, when the robot faces the target speaker, the auditory module can accurately collect and localize the voice of the speaker with a microphone in the front direction with high sensitivity. However, the image of the speaker can be captured well by the camera.
[0024]
Therefore, the ambiguity of the hearing and vision of the robot is complemented by the cooperation of the auditory module, the face module, and the motor control module, and the association module and the attention control module, and so-called robustness is achieved. Even if there are a plurality of speakers, each speaker can be perceived individually.
[0025]
In addition, for example, even when either an auditory event or a face event is lost, the association module can perceive the target speaker based only on the face event or the auditory event, so that the motor is real-time. Control of the control module can be performed.
[0026]
Further, the auditory module performs speech recognition by using a plurality of acoustic models based on the acoustic signals that have been sound source localized and separated as described above. Then, the speech recognition results by the respective acoustic models are integrated by the selector to determine the most reliable speech recognition result.
[0027]
This makes it possible to perform accurate speech recognition in real time and in a real environment by using multiple acoustic models compared to conventional speech recognition, and to select speech recognition results from each acoustic model. Can be integrated to determine the most reliable speech recognition result, and more accurate speech recognition can be performed.
[0028]
In order to achieve the above object, the third configuration of the robot audio-visual system of the present invention includes at least a pair of microphones that collect external sounds, and performs pitch extraction based on acoustic signals from the microphones. , The auditory module that determines the direction of at least one speaker by separating and localizing sound sources by grouping based on the harmonic structure and extracting the auditory event, and the camera that images the front of the robot were captured by this camera A face module that identifies each speaker from the face identification and localization of each speaker based on the image and extracts the face event, and a vertically long object based on the parallax extracted from the image captured by the stereo camera This drive mode is equipped with a stereo module that extracts and localizes the stereo events and a drive motor that rotates the robot horizontally. A motor control module for extracting a motor event based on the rotation position of the speaker, and from each auditory event, face event, stereo event, and motor event, the sound source localization of the auditory event and the direction information of the face localization of the face event An association module that determines a direction and connects the event to the determination using a Kalman filter in the time direction to generate an auditory stream, a face stream, and a stereo visual stream, and further associate them to generate an association stream; and An attention control module for controlling the drive of the motor based on the attention control based on the result of the planning of the action and the action planning associated therewith, the auditory module, Each speaker has a plurality of acoustic models made in each direction, Based on accurate sound source direction information from the association module, an interaural phase difference within a predetermined width range is obtained by an active direction pass filter having a pass range that is minimum in the front direction and increases as the angle increases to the left and right. (IPD) or subbands with interaural intensity difference (IID) are collected, sound source separation is performed by reconstructing the waveform of the sound source, and a plurality of acoustic models are used for speech recognition. It is configured to perform voice recognition of the sound signal separated from the sound source, integrate the voice recognition results by each acoustic model by the selector, and determine the most reliable voice recognition result among these voice recognition results. .
[0029]
According to such a third configuration, the auditory module obtains the direction of each sound source by performing pitch extraction using the harmonic structure from the sound from the external target collected by the microphone, The direction of the speaker is determined and the auditory event is extracted.
[0030]
In addition, the face module identifies each speaker from the face identification and localization of each speaker by pattern recognition from the image captured by the camera, and extracts the face event of each speaker. Furthermore, the stereo module extracts a stereo event by extracting and localizing a vertically long object based on the parallax extracted from the image captured by the stereo camera.
[0031]
Further, the motor control module extracts the motor event by detecting the direction of the robot based on the rotational position of the drive motor that rotates the robot in the horizontal direction.
[0032]
The event indicates a sound detected at each time point, a face and a vertically long object, or a state in which the drive motor is rotated. A stream is, for example, while performing error correction processing. An event connected so as to be continuous in time by a Kalman filter or the like is shown.
[0033]
Here, the association module determines the direction of each speaker based on the direction information of the sound source localization of the auditory event and the face localization of the face event based on the auditory event, the face event, the stereo event, and the motor event extracted in this way. , The auditory stream, the face stream, and the stereo visual stream of each speaker are generated, and these streams are associated with each other to generate an association stream. Note that the association stream is a concept including an auditory stream, a face stream, and a stereo visual stream. At this time, the association module determines the direction of each speaker based on the sound source localization of the auditory event and the face localization of the face event, that is, the auditory and visual direction information, and refers to the determined speaker direction. Generate an association stream.
[0034]
Then, the attention control module performs the drive control of the motor based on the attention control based on these streams and the planning result of the action accompanying therewith. Then, the attention control module controls the drive motor of the motor control module based on the planning to direct the direction of the robot to the target speaker. As a result, when the robot faces the target speaker, the auditory module can accurately collect and localize the voice of the speaker in the front direction with high sensitivity. The image of the speaker can be captured well by the camera.
[0035]
Therefore, each speech based on the direction information of the sound source localization of the auditory stream and the speaker localization of the face stream is obtained by the cooperation of the auditory module, the face module, the stereo module, and the motor control module, and the association module and the attention control module. By determining the direction of the speaker, the ambiguity of the hearing and vision of the robot is complemented with each other, so that the so-called robustness is improved, and each speaker can be surely connected even with a plurality of speakers. Can perceive.
[0036]
Also, even when any of the auditory stream, face stream, and stereo visual stream is missing, for example, the attention control module can track the target speaker based on the remaining streams, so the target can be accurately It is possible to control the motor control module by grasping the direction of the motor.
[0037]
Here, the auditory module refers to the association stream from the association module, and by performing sound source localization in consideration of the face stream from the face module and the stereo visual stream from the stereo module, a more accurate sound source Localization can be performed.
[0038]
Then, the auditory module is based on accurate sound source direction information from the association module, and has a predetermined width by an active direction pass filter having a pass range that is minimized in the front direction according to auditory characteristics and increases as the angle increases to the left and right. The subbands having interaural phase difference (IPD) or interaural intensity difference (IID) within the range of are collected, and the sound source waveform is reconstructed to perform sound source separation. By adjusting the pass range, that is, the sensitivity, the sound source separation can be performed more accurately in consideration of the difference in sensitivity depending on the direction. Furthermore, as described above, the auditory module performs speech recognition by using a plurality of acoustic models based on the acoustic signal that has been sound source localized and separated by the auditory module as described above. Then, the speech recognition results by the respective acoustic models are integrated by the selector, the most reliable speech recognition result is determined, and the speech recognition result is output in association with the corresponding speaker.
[0039]
This makes it possible to perform accurate speech recognition in real time and real environment by using a plurality of acoustic models compared to conventional speech recognition, and to obtain speech recognition results by each acoustic model. By integrating by the selector and determining the most reliable speech recognition result, more accurate speech recognition can be performed.
[0040]
In the second configuration and the third configuration, when voice recognition by the auditory module is not possible, the attention control module directs the microphone and the camera toward the sound source of the acoustic signal, and The sound is collected again from the microphone, and the sound recognition by the hearing module is performed again based on the sound signal localized and separated by the hearing module for the sound.
[0041]
Further, the auditory module preferably refers to a face event by the face module when performing speech recognition. In addition, a dialogue unit that outputs the voice recognition result determined by the auditory module to the outside may be provided. Furthermore, it is desirable that the pass range of the active direction pass filter can be controlled for each frequency.
[0042]
When speech recognition by the auditory module is not possible, the attention control module directs the microphone and camera to the direction of the sound source of the acoustic signal (the speaker), collects the sound again from the microphone, When speech recognition by the auditory module is performed again based on the localized / separated acoustic signal, reliable speech recognition is performed by the microphone of the robot's auditory module and the camera of the face module facing the speaker. It becomes possible.
[0043]
The auditory module also considers the face stream from the face module by referring to the association stream from the association module when performing speech recognition. That is, the auditory module performs voice recognition on the face event localized by the face module based on the acoustic signal from the sound source (speaker) localized / separated by the auditory module, thereby achieving more accurate voice recognition. Can be done.
[0044]
If the pass range of the active direction pass filter is controllable for each frequency, the accuracy of separation from the collected sound further increases, thereby further improving speech recognition.
[0045]
DETAILED DESCRIPTION OF THE INVENTION
The present invention will be described in detail below based on the embodiments shown in the drawings.
FIGS. 1 and 2 each show an example of the overall configuration of an experimental upper-body-only humanoid robot equipped with an embodiment of the robot audiovisual system according to the present invention. In FIG. 1, a humanoid robot 10 is configured as a 4 DOF (degree of freedom) robot, and includes a base 11 and a body portion 12 supported on the base 11 so as to be rotatable about one axis (vertical axis). The head portion 13 is supported on the body portion 12 so as to be swingable around three axis directions (vertical axis, horizontal axis in the left-right direction and horizontal axis in the front-rear direction).
[0046]
The base 11 may be fixedly arranged and operable as a leg portion. The base 11 may be placed on a movable carriage or the like. The body portion 12 is supported so as to be rotatable around a vertical axis with respect to the base 11 as shown by an arrow A in FIG. 1, and is rotated by a driving means (not shown). Covered by a soundproof exterior.
[0047]
The head 13 is supported on the body 12 via a connecting member 13a, and can swing around the horizontal axis in the front-rear direction with respect to the connecting member 13a as shown by an arrow B in FIG. Further, as shown by an arrow C in FIG. 2, the connecting member 13a is supported around the horizontal axis in the left-right direction, and the connecting member 13a further extends in the front-rear horizontal axis. 1 is supported so as to be swingable as indicated by an arrow D in FIG. 1, and is driven to rotate in the directions of arrows A, B, C, and D by driving means (not shown). Here, as shown in FIG. 3, the head 13 is entirely covered with a soundproof exterior 14, and a camera 15 as a visual device in charge of robot vision on the front side and a hearing device in charge of robot hearing on both sides. A pair of microphones 16 (16a, 16b) is provided. The microphones 16 are not limited to both sides of the head 13 and may be provided at other positions of the head 13 or at the body 12.
[0048]
The exterior 14 is made of, for example, a sound-absorbing synthetic resin such as urethane resin, and is configured so that the inside of the head 13 is almost completely sealed and sound insulation inside the head 13 is performed. . The exterior of the body 12 is also made of a sound-absorbing synthetic resin, like the exterior 14 of the head 13.
[0049]
The camera 15 has a known configuration, for example, a commercially available camera having a so-called 3 DOF (degree of freedom) of pan, tilt, and zoom. The camera 15 is designed to send a stereo image in synchronization.
[0050]
The microphones 16 are respectively attached to the side surfaces of the head 13 so as to have directivity toward the front. As shown in FIGS. 1 and 2, the left and right microphones 16 a and 16 b of the microphone 16 are attached to the inside of step portions 14 a and 14 b disposed on both sides of the exterior 14 of the head 13. The microphones 16a and 16b collect sound from the front through the through holes provided in the step portions 14a and 14b, and are sound-insulated by appropriate means so as not to pick up the sound inside the exterior 14. Yes. In addition, the through-hole provided in step part 14a, 14b is formed in each step part 14a, 14b so that it may penetrate toward the head front from the inner side of step part 14a, 14b. Thereby, each microphone 16a, 16b is comprised as what is called a binaural microphone. In addition, the exterior 14 close to the attachment positions of the microphones 16a and 16b may be formed in a human outer ear shape. Here, the microphone 16 may include a pair of internal microphones disposed inside the exterior 14, and cancels noise generated inside the robot 10 based on internal sounds collected by the internal microphones. can do.
[0051]
FIG. 4 shows an electrical configuration example of the robot audio-visual including the camera 15 and the microphone 16. In FIG. 4, the robot audio-visual system 17 includes an auditory module 20, a face module 30, a stereo module 37, a motor control module 40, and an association module 50.
[0052]
Here, the association module 50 is configured as a server that executes processing in response to a request from the client, and the client for this server is another module, that is, the auditory module 20, the face module 30, the stereo module 37, and the motor control. Module 40, these servers and clients operate asynchronously with each other. The server and each client are each configured by a personal computer, and each personal computer is configured as a LAN (Local Area Network) with each other, for example, under a TCP / IP protocol communication environment. Has been. In this case, it is preferable to apply a high-speed network capable of exchanging gigabit data to the robot audio-visual system 17 for communication of events and streams with a large amount of data. For this control communication, it is preferable to apply a medium speed network to the robot audio-visual system 17. By transmitting such a large amount of data between the personal computers at a high speed, the real-time property and scalability of the entire robot can be improved.
[0053]
Each

module

20, 30, 37, 40, 50 is hierarchically distributed, and is specifically composed of a device layer, a process layer, a feature layer, and an event layer sequentially from the bottom. Yes.
[0054]
The auditory module 20 includes a microphone 16 as a device layer, a peak extraction unit 21 as a process layer, a sound source localization unit 22, a sound source separation unit 23, an active direction pass filter 23a, and a pitch 24 as a feature layer (data). , A sound source horizontal direction 25, an auditory event generation unit 26 as an event layer, and a voice recognition unit 27 and a conversation unit 28 as process layers.
[0055]
Here, the auditory module 20 operates as shown in FIG. That is, in FIG. 5, the auditory module 20 performs frequency analysis on the acoustic signal from the microphone 16 sampled at, for example, 48 kHz and 16 bits by FFT (Fast Fourier Transform) as indicated by reference numeral X1, and indicated by reference numeral X2. Thus, a spectrum is generated for each of the left and right channels. Then, the auditory module 20 extracts a series of peaks for each of the left and right channels by the peak extraction unit 21, and pairs the same or similar peaks in the left and right channels.
[0056]
Here, in the peak extraction, (α) power is equal to or higher than the threshold and (β) is a local peak, and (γ) low frequency noise and a high frequency band with a small power are cut, for example, between 90 Hz and 3 kHz. This is performed using a bandpass filter that transmits only data satisfying the three conditions (α to γ). This threshold value is defined as a value obtained by measuring ambient background noise and adding a sensitivity parameter, for example, 10 dB.
[0057]
And the auditory module 20 performs sound source separation using the fact that each peak has a harmonic structure. Specifically, the sound source separation unit 23 extracts local peaks having a harmonic structure in order from the lowest frequency, and regards the extracted set of peaks as one sound. In this way, the acoustic signal for each sound source is separated from the mixed sound. At the time of sound source separation, the sound source localization unit 22 of the auditory module 20 receives an acoustic signal having the same frequency from the left and right channels with respect to the acoustic signal for each sound source separated by the sound source separation unit 23, as indicated by reference numeral X3. To calculate IPD (mutual phase difference) and IID (mutual intensity difference). This calculation is performed, for example, every 5 degrees. Then, the sound source localization unit 22 outputs the calculation result to the active direction pass filter 23a.
[0058]
On the other hand, the active direction pass filter 23a, based on the direction θ of the association stream 59 calculated by the association module 50, as shown by reference numeral X4, represents the theoretical IPD value IPD (= Δφ ′ (θ). ) And the theoretical value IID (= Δρ ′ (θ)) of the IID is calculated. The direction θ is a calculation result by real-time tracking (symbol X3 ′) in the association module 50 based on the face localization (face event 39), stereo vision (stereo vision event 39a), and sound source localization (auditory event 29). is there.
[0059]
Here, each calculation of the theoretical value IPD and the theoretical value IID is performed using the auditory epipolar geometry described below. Specifically, the front surface of the robot 10 is set to 0 degree, and within a range of ± 90 degrees. The theoretical value IPD and the theoretical value IID are calculated. Here, the auditory epipolar geometry is necessary for obtaining the direction information of the sound source without using the HRTF. In stereo vision research, epipolar geometry is one of the most common localization methods, and auditory epipolar geometry is an application of auditory epipolar geometry to hearing. Since the auditory epipolar geometry uses the geometric relationship to obtain direction information, the HRTF can be eliminated.
[0060]
In the auditory epipolar geometry, assuming that the sound source is at infinity, Δφ, θ, f, and v are IPD, sound source direction, frequency, and sound velocity, respectively, and r is the radius when the robot head is regarded as a sphere. Then, the following formula (1)
[Expression 1]

Is represented by
[0061]
On the other hand, based on a pair of spectra obtained by FFT (Fast Fourier Transform), IPDΔψ ′ and IIDΔρ ′ of each subband are calculated by the following equations (2) and (3).
[Expression 2]

[Equation 3]

Where Sp _l , Sp _r Are spectra obtained from the left and right microphones 16a and 16b at a certain time.
[0062]
Further, the active direction pass filter 23a has the stream direction θ according to the passband function indicated by reference numeral X7. _S From θ _S The passband δ (θ of the active direction pass filter 23a corresponding to _S ) Is selected. Here, as shown by X7 in FIG. 5, the passband function has a maximum sensitivity in the front direction (θ = 0 degrees) of the robot and a sensitivity decrease in the lateral direction. Therefore, the minimum value is obtained when θ = 0 degrees. It is a function that takes and gets larger on the side. This is for reproducing the auditory characteristic that the localization sensitivity is maximized in the front direction and the sensitivity decreases as the angle increases to the left and right. Note that the maximum localization sensitivity in the frontal direction is called the auditory fovea following the fovea found in the structure of mammalian eyes. With respect to this auditory fovea, in the case of humans, the sensitivity of localization at the front is about ± 2 degrees, and is about ± 8 degrees around 90 degrees on the left and right.
[0063]
The active direction pass filter 23a then selects the selected pass band δ (θ _S ), Θ _L To θ _H An acoustic signal in the range is extracted. Θ _L = Θ _S −δ (θ _S ), Θ _H = Θ _S + Δ (θ _S ).
[0064]
Further, the active direction pass filter 23a has a stream direction θ as indicated by reference numeral X5. _S To the head related transfer function (HRTF) _L And θ _H IPD and IID theoretical values IPD (= Δφ) _H (Θ _S )) And IID (= Δρ _H (Θ _S )), That is, the direction of the sound source to be extracted is estimated. The active direction pass filter 23a then calculates the IPD (= Δφ) calculated for each subband based on the auditory epipolar geometry with respect to the sound source direction θ. _E (Θ)) and IID (= Δρ _E (Θ)) and the IPD (= Δφ) obtained based on the HRTF _H (Θ)) and IID (= Δρ _H (Θ)), the angle θ determined by the passband δ (θ) described above, as indicated by reference numeral X6. _L To θ _H In the angle range of the extracted IPD (= Δφ _E ) And IID (= Δρ) _E ) Collect subbands that satisfy the following conditions.
[0065]
Where the frequency f _th Is a threshold that employs IPD or IID as a criterion for filtering, and indicates the upper limit of the frequency at which localization by IPD is effective. Frequency f _th Depends on the distance between the microphones of the robot 10 and is, for example, about 1500 Hz in this embodiment.
[0066]
That is,
[Expression 4]

[0067]
This is the predetermined frequency f _th If there is an IPD (= Δφ ′) in the range of the passband δ (θ) of the IPD by HRTF at a frequency less than _th This means that subbands are collected when IID (= Δρ ′) is present within the range of the IID passband δ (θ) by HRTF at the above frequency. Here, in general, the IPD has a large influence in the low frequency band, and the IID has a large influence in the high frequency band, and the threshold frequency f _th Depends on the distance between the microphones.
[0068]
Then, the active direction pass filter 23a re-synthesizes the acoustic signals from the subbands collected in this way, and constructs a waveform to generate a pass-subband direction as indicated by reference numeral X8. As shown by reference numeral X9, filtering is performed for each sub-band, and by the inverse frequency transform IFFT (inverse Fourier transform) indicated by reference numeral X10, as shown by reference numeral X11, separated sound from each sound source in the corresponding range ( (Acoustic signal) is extracted.
[0069]
As shown in FIG. 5, the voice recognition unit 27 is composed of a self-voice suppression unit 27a and an automatic recognition unit 27b. The voice suppression unit 27a removes the sound emitted from the speaker 28c of the dialogue unit 28, which will be described later, and extracts only the external acoustic signal from each acoustic signal subjected to sound source localization / sound source separation by the auditory module 20. It is. As shown in FIG. 6, the automatic recognition unit 27b includes a speech recognition engine 27c, an acoustic model 27d, and a selector 27e. The speech recognition engine 27c is called “Jurian” developed at Kyoto University, for example. A speech recognition engine can be used to recognize words uttered by each speaker.
[0070]
In FIG. 6, the automatic recognition unit 27b is configured to recognize three speakers, for example, two men (speakers A and C) and one woman (speaker B). For this purpose, the automatic recognition unit 27b is provided with an acoustic model 27d for each direction of each speaker. In the case of FIG. 6, the acoustic model 27d is composed of a combination of voices and directions of the three speakers A, B, and C, respectively, and a plurality of types, in this case, nine types of acoustics. A model 27d is provided.
[0071]
The voice recognition engine 27c executes nine voice recognition processes in parallel, and the nine acoustic models 27d are used at that time. Specifically, the speech recognition engine 27c executes a speech recognition process on the acoustic signals input in parallel with each other using the nine acoustic models 27d. These voice recognition results are output to the selector 27e.
[0072]
The selector 27e integrates all the speech recognition process results from the respective acoustic models 27d, determines the most reliable speech recognition process result by, for example, majority decision, and outputs the speech recognition result.
[0073]
Here, the word recognition rate for the acoustic model 27d of the specific speaker will be described by a specific experiment.
First, in a 3 m × 3 m room, three speakers are placed at a position 1 m from the robot 10 and in directions of 0 degrees and ± 60 degrees from the robot. Next, as voice data for the acoustic model, voices of 150 words such as colors, numbers, and food spoken by two men and one woman are output from a speaker, and the microphone 16a of the robot 10 Collect sound at 16b. When collecting each word, three patterns are recorded for each word as audio from only one speaker, audio output simultaneously from two speakers, and audio output simultaneously from three speakers. To do. Then, the recorded voice signal is separated into voices by the above-described active direction pass filter 23a, and each voice data is extracted and arranged for each speaker and direction, thereby creating a training set of an acoustic model.
[0074]
For each acoustic model 27d, a triphone is used, and for each training set, an HTK (Hidden Markov Model) tool kit 27f is used, and a total of nine types of speech recognition are performed for each direction of each speaker. Audio data for was created.
[0075]
Using the acoustic model speech data obtained in this way, the word recognition rate of the specific speaker for the acoustic model 27d was examined by experiment, and the result shown in FIG. 7 was obtained. FIG. 7 is a graph showing the direction on the horizontal axis and the word recognition rate on the vertical axis, where P is the voice of the person (speaker A) and Q is the voice of the other person (speakers B and C). Indicates. In the acoustic model of the speaker A, when the speaker A is located in front of the robot 10 (FIG. 7A), the word recognition rate is 80% or more at the front (0 degree), and the speaker When the person A is located at 60 degrees to the right or -60 degrees to the left, as shown in FIG. 7 (B) or FIG. 7 (C), respectively, there is less decrease in the recognition rate due to the difference in direction than the speaker, It was found that the word recognition rate was 80% or more especially when the speaker and the direction were in the same direction.
[0076]
Considering this result, the selector 27e uses the cost function V (pe) given by the following equation (5) for integration by utilizing the fact that the sound source direction is known at the time of speech recognition. To do.
[Equation 5]

[0077]
Here, r (p, d) and Res (p, d) are defined as the word recognition rate and the recognition result for the input speech when the acoustic model of speaker p and direction d is used, respectively, d _e Is the sound source direction by real-time tracking, and p _e Is the speaker to be evaluated.
[0078]
P above _v (P _e , D _e ) Is a probability generated by the face recognition module, and is always set to 1.0 when face recognition is impossible. The selector 27e has the largest cost function V (p _e ) Speaker p _e And the recognition result Res (p, d) is output. At that time, the selector 27e can identify the speaker by referring to the face event 39 by the face recognition from the face module 30, so that the robustness of the voice recognition can be improved.
[0079]
The cost function V (p _e ) Is not greater than 1.0 or close to the second largest value, it is determined that voice recognition cannot be performed because voice recognition has failed or cannot be narrowed down to one candidate, and this will be described later. Output to the dialogue unit 28. The dialogue unit 28 includes a dialogue control unit 28a, a voice synthesis unit 28b, and a speaker 28c. The dialogue control unit 28a is controlled by an association module 60 to be described later, whereby the speech recognition result from the speech recognition unit 27, that is, the speaker p. _e Based on the recognition result Res (p, d), speech data for the target speaker is generated and output to the speech synthesizer 28b. The voice synthesizing unit 28b drives the speaker 28c based on the voice data from the dialogue control unit 28a, and emits a voice corresponding to the voice data.
[0080]
Thus, when the dialogue unit 28 says “1” as a favorite number, for example, by the speaker A based on the voice recognition result from the voice recognition unit 27, the robot 10 faces the speaker A directly. Then, the speaker A utters a voice such as “Mr. A said“ 1 ””.
[0081]
When the voice recognition unit 27 outputs a message indicating that voice recognition could not be performed, the dialogue unit 28 indicates to the speaker A that the robot 10 is facing the speaker A. You ask the question "2 or 4?" And the voice of speaker A is recognized again. In this case, since the robot 10 faces the speaker A, the accuracy of voice recognition is further improved.
[0082]
In this way, the auditory module 20 identifies (speaker identification) at least one speaker from pitch extraction, sound source separation and localization based on the acoustic signal from the microphone 16, and extracts the auditory event. The data is transmitted to the association module 50 via the network, and the voice recognition of each speaker is performed, and the voice recognition result is confirmed to the speaker by voice by the dialogue unit 28.
[0083]
Here, in practice, the sound source direction θ _s Is a function of time t. Therefore, in order to continue extracting a specific sound source, it is necessary to consider continuity in the time direction, but as described above, the stream direction θ from real-time tracking _s Thus, the sound source direction is obtained.
[0084]
As a result, in real-time tracking, all events are expressed in terms of the time flow of streams, so even if multiple sound sources exist at the same time or the sound source or the robot itself moves, it will be a single stream. By paying attention, direction information from a specific sound source can be obtained continuously. Furthermore, since the stream is also used to integrate audiovisual events, the sound source localization accuracy is improved by performing sound source localization using auditory events with reference to face events.
[0085]
The face module 30 includes a camera 15 as a device layer, a face discovery unit 31 as a process layer, a face identification unit 32, a face localization unit 33, a face ID 34 as a feature layer (data), a face direction 35, and an event And a face event generation unit 36 as a layer.
[0086]
Thereby, the face module 30 detects the face of each speaker by, for example, skin color extraction by the face finding unit 31 based on the image signal from the camera 15, and the face database 38 registered in advance by the face identifying unit 32. If there is a matching face, the face ID 34 is determined to identify the face, and the face orientation unit 33 determines (orientates) the face direction 35.
[0087]
Here, when the face finding unit 31 finds a plurality of faces from the image signal, the face module 30 performs the above processing, that is, identification, localization, and tracking for each face. At that time, since the size, direction, and brightness of the face detected by the face finding unit 31 often change, the face finding unit 31 performs face area detection and performs a combination of skin color extraction and pattern matching based on correlation calculation. A plurality of faces can be accurately detected within 200 milliseconds.
[0088]
The face localization unit 33 converts the face position in the two-dimensional image plane into a three-dimensional space, and obtains the face position in the three-dimensional space as a set of azimuth angle θ, height φ, and distance r. Then, for each face, the face module 30 generates a face event 39 from the face ID (name) 34 and the face direction 35 by the face event generation unit 36 and transmits it to the association module 50 via the network. It is like that.
[0089]
The face stereo module 37 includes a camera 15 as a device layer, a parallax image generation unit 37a and a target extraction unit 37b as process layers, a target direction 37c as a feature layer (data), and a stereo event generation as an event layer. Part 37d. Accordingly, the stereo module 37 generates a parallax image from the image signals of both cameras 15 by the parallax image generation unit 37 a based on the image signal from the camera 15. Next, the target extraction unit 37b divides the parallax image into regions. As a result, if a vertically long object is found, the target extraction unit 37b extracts it as a human candidate and determines its target direction 37c (localization). To do. The stereo event generation unit 37d generates a stereo event 39a based on the target direction 37c and transmits it to the association module 50 via the network.
[0090]
The motor control module 40 includes a motor 41 and a potentiometer 42 as device layers, a PWM control circuit 43, an AD conversion circuit 44, and a motor control unit 45 as process layers, a robot direction 46 as a feature layer as data, It comprises a motor event generation unit 47 as an event layer. Thereby, in the motor control module 40, the motor control unit 45 drives and controls the motor 41 via the PWM control circuit 43 based on a command from the attention control module 57 (described later). Further, the rotational position of the motor 41 is detected by a potentiometer 42. This detection result is sent to the motor control unit 45 via the AD conversion circuit 44. Then, the motor control unit 45 extracts the robot direction 46 from the signal received from the AD conversion circuit 44. The motor event generation unit 47 generates a motor event 48 including motor direction information based on the robot direction 46 and transmits the motor event 48 to the association module 50 via the network.
[0091]
The association module 50 is positioned hierarchically higher than the auditory module 20, the face module 30, the stereo module 37, and the motor control module 40, and the event layer of each

module

20, 30, 37, 40. The stream layer which is the upper layer of the above is configured. Specifically, the association module 50 synchronizes the asynchronous event 51 from the auditory module 20, the face module 30, the stereo module 37, and the motor control module 40, that is, the auditory event 29, the face event 39, the stereo event 39a, and the motor event 48. The absolute coordinate conversion unit 52 that generates the auditory stream 53, the face stream 54, and the stereo visual stream 55 and the stream 53, 54, and 55 are associated with each other to generate the association stream 59, or the stream 53, 54, and 55 An association unit 56 for releasing the association, an attention control module 57, and a viewer 58 are provided.
[0092]
The absolute coordinate conversion unit 52 synchronizes the motor event 48 from the motor control module 40 with the auditory event 29 from the auditory module 20, the face event 39 from the face module 30, and the stereo event 39a from the stereo module 37. With respect to the auditory event 29, the face event 39, and the stereo event 39a, the auditory stream 53, the face stream 54, and the stereo visual stream 55 are generated by converting the coordinate system into an absolute coordinate system by the synchronized motor event. At that time, the absolute coordinate conversion unit 52 generates an auditory stream 53, a face stream 54, and a stereo visual stream 55 by connecting to the auditory stream, face stream, and stereo visual stream of the same speaker.
[0093]
Further, the associating unit 56 associates or cancels the stream based on the auditory stream 53, the face stream 54, and the stereo visual stream 55 in consideration of the temporal connection of these streams 53, 54, and 55. The association stream 59 is generated, and conversely, when the connection of the auditory stream 53, the face stream 54, and the stereo visual stream 55 constituting the association stream 59 becomes weak, the association is canceled. As a result, even if the target speaker is moving, if the movement of the speaker is predicted and the angle is within the moving range, the above-described streams 53, 54, 55 By performing the generation, the movement of the speaker can be predicted and tracked.
[0094]
The attention control module 57 performs attention control for planning the drive motor control of the motor control module 40. At this time, the association stream 59, the auditory stream 53, the face stream 54, and the stereo visual stream 55 are given priority in this order. Attention control is performed with reference to the above. Then, the attention control module 57 performs the operation planning of the robot 10 based on the state of the auditory stream 53, the face stream 54, and the stereo visual stream 55 and the presence / absence of the association stream 59, and if the drive motor 41 needs to be operated. Then, a motor event as an operation command is transmitted to the motor control module 40 via the network. Here, the attention control in the attention control module 57 is based on continuity and triggers, selects the stream to which attention should be directed, trying to keep the same state with continuity and tracking the most interesting objects with triggers And tracking.
[0095]
In this way, the attention control module 57 performs attention control and performs control planning of the drive motor 41 of the motor control module 40, and generates a motor command 64 a based on this planning, via the network 70. Transmit to the motor control module 40. As a result, in the motor control module 40, the motor control unit 45 performs PWM control based on the motor command 64a to rotate the drive motor 41 so that the robot 10 is directed in a predetermined direction.
[0096]
The viewer 58 displays the streams 53, 54, 55, and 57 generated in this way on the server screen, and specifically displays the radar chart 58a and the stream chart 58b. The radar chart 58a shows the state of the stream at the moment, more specifically, the viewing angle of the camera and the sound source direction, and the stream chart 58b shows the association stream (thick line illustration), auditory stream, face stream, and stereo visual stream (thin line illustration). ).
[0097]
The humanoid robot 10 according to the embodiment of the present invention is configured as described above, and operates as follows.
First, at a distance of 1 m ahead of the robot 10, speakers are lined up in the diagonally left (θ = + 60 degrees), front (θ = 0 degrees), and diagonally right (θ = -60 degrees) directions. 10, the dialogue unit 28 asks three speakers, and each speaker answers the question at the same time.
[0098]
As a result, the robot 10 picks up the voice of the speaker by the microphone 16 and the auditory module 20 generates an auditory event 29 with the sound source direction and transmits it to the association module 50 via the network. Thereby, the association module 50 generates an auditory stream 53 based on the auditory event 29.
[0099]
In addition, the face module 30 takes in the image of the speaker's face by the camera 15, generates a face event 39, searches the face database 38 for the face of the speaker, performs face identification, and as a result A face ID 24 and an image are transmitted to the association module 50 via the network 70. When the face of the speaker is not registered in the face database 38, the face module 30 transmits that fact to the association module 50 via the network.
[0100]
Therefore, the association module 50 generates an association stream 59 based on these auditory events 29, face events 39, and stereo events 39a.
[0101]
Here, the auditory module 20 performs localization and separation of each sound source (speakers X, Y, Z) using the IPD based on auditory epipolar geometry by the active direction pass filter 23a, and separates sound (acoustic signal). ). The auditory module 20 recognizes the voices of the speakers X, Y, and Z by using the voice recognition engine 27 c by the voice recognition unit 27 and outputs the result to the dialogue unit 28. As a result, the dialogue unit 28 utters the answer that has been voice-recognized by the voice recognition unit 27 in a state where the robot 10 faces each speaker. If the voice recognition unit 27 cannot recognize the voice correctly, the robot 10 repeats the question while facing the speaker, and performs voice recognition again based on the answer.
[0102]
In this way, according to the humanoid robot 10 according to the embodiment of the present invention, the speech recognition unit 27 is configured to select each speaker and direction based on the separated sound (acoustic signal) that is sound source localization / sound source separated by the auditory module 20. By performing speech recognition using an acoustic model corresponding to, voices of a plurality of speakers who speak simultaneously can be recognized.
[0103]
Hereinafter, the operation of the speech recognition unit 27 is evaluated by experiments.
In these experiments, as shown in FIG. 8, at a distance of 1 m in front of the robot 10, the directions are diagonally left (θ = + 60 degrees), front (θ = 0 degrees), and diagonally right (θ = −60 degrees). In addition, speakers X, Y, and Z are lined up respectively. In the experiment, a speaker is placed instead of a human as a speaker, and a photograph of the speaker is placed in front of the speaker. Here, the speaker uses the same speaker as when the acoustic model was created, and the sound emitted from the speaker is regarded as the sound of the speaker of the photograph.
[0104]
Based on the following scenario, a speech recognition experiment is performed.
1. The robot 10 asks three speakers X, Y, and Z.
2. Three speakers X, Y, and Z simultaneously answer questions.
3. The robot 10 performs sound source localization / sound source separation based on the mixed speech of the three speakers X, Y, and Z, and further performs speech recognition for each separated sound.
4). The robot 10 answers the speaker's answer in a state of facing each speaker X, Y, and Z sequentially.
5. When the robot 10 determines that the voice recognition has not been performed correctly, the robot 10 repeats the question again in front of the speaker, and performs voice recognition again based on the answer.
[0105]
FIG. 9 shows a first example of the experimental results based on the above scenario.
1. The robot 10 asks "What is your favorite number?" (See Fig. 9 (a))
2. From the speakers as the speakers X, Y, and Z, a voice in which an arbitrary number is read out from numbers 1 to 10 is played at the same time. For example, as shown in FIG. 9B, the speaker X is “2”, the speaker Y is “1”, and the speaker Z is “3”.
3. The robot 10 uses the active direction passing filter 23a to perform sound source localization and sound source separation based on the acoustic signal collected by the microphone 16 in the auditory module 20, and extracts the separated sound. Then, based on the separated sounds corresponding to the speakers X, Y, and Z, the speech recognition unit 27 uses the nine acoustic models for each speaker to simultaneously execute the speech recognition process and perform the speech recognition. .
4). At that time, it is assumed that the selector 27e of the speech recognition unit 27 evaluates speech recognition assuming that the front is the speaker Y (FIG. 9 (c)), and then the front is the speaker X. Speech recognition is evaluated (FIG. 9D), and finally speech recognition is evaluated assuming that the front is the speaker Z (FIG. 9E).
5. Then, the selector 27e integrates the speech recognition results, and as shown in FIG. 9F, the best matching speaker name (Y) and speech recognition results (“ 1 ") is determined and output to the dialogue unit 28. As a result, as shown in FIG. 9G, “Mr. Y is“ 1 ”with the robot 10 facing the speaker Y. ".
6). Subsequently, the same processing as described above is performed with respect to the diagonally left (θ = + 60 degrees) direction, and the robot 10 faces the speaker X as shown in FIG. Is "2". ". Further, the same processing is performed for the diagonally right direction (θ = −60 degrees), and the robot 10 faces the speaker Z as shown in FIG. Is "3". ".
[0106]
In this case, the robot 10 was able to correctly recognize all the answers from the speakers X, Y, and Z. Therefore, even in the case of simultaneous speech, the effectiveness of sound source localization / sound source separation / speech recognition in the robot audio-visual system 17 using the microphone 16 of the robot 10 was shown.
[0107]
As shown in FIG. 9J, the robot 10 does not face each speaker and “Y is“ 1 ”. X is “2”. Z is “3”. The total is “6”. The total of the numbers answered by the speakers X, Y, and Z may also be answered.
[0108]
FIG. 10 shows a second example of the experimental result based on the scenario described above.
1. Similarly to the first example shown in FIG. 9, the robot 10 asks “What is your favorite number?” (See FIG. 10A), and speakers as speakers X, Y, and Z Thus, as shown in FIG. 10 (b), the speaker X is “2”, the speaker Y is “1”, and the speaker Z is “3”.
2. Similarly, the robot 10 performs sound source localization / sound source separation by the active direction passing filter 23a based on the acoustic signal collected by the microphone 16 in the auditory module 20, and extracts the separated sound. Based on the separated sounds corresponding to the persons X, Y, and Z, the voice recognition unit 27 uses the nine acoustic models for each speaker to simultaneously execute the voice recognition process and perform the voice recognition. At that time, the selector 27e of the speech recognition unit 27 can correctly evaluate speech recognition for the front speaker Y as shown in FIG. 10 (c).
3. On the other hand, for the speaker X located at +60 degrees, the selector 27e cannot determine whether it is “2” or “4” as shown in FIG.
4). Accordingly, as shown in FIG. 10E, the robot 10 directly asks the speaker X located at +60 degrees as “2? 4?”.
5. On the other hand, as shown in FIG. 10F, the answer “2” flows from the speaker who is the speaker X. In this case, since the speaker X is located in front of the robot 10, the auditory module 20 correctly performs sound source localization / sound source separation for the answer of the speaker X, and the speech recognition unit 27 correctly recognizes the speech, The person name X and the voice recognition result “2” are output to the dialogue unit 28. As a result, the robot 10 gives “X is“ 2 ”to the speaker X as shown in FIG. ".
6). Subsequently, the same processing is performed for the speaker Z, and the speech recognition result is answered to the speaker Z. That is, as shown in FIG. 10 (h), in the state where the robot 10 faces the speaker Z, “Mr. Z is“ 3 ”. ".
[0109]
Thus, the robot 10 was able to correctly recognize all the answers of the speakers X, Y, and Z by re-questioning. Therefore, the ambiguity of speech recognition due to a decrease in the separation accuracy due to the influence of the auditory fovea on the side is eliminated by the robot 10 re-questing to the side speaker and the sound source separation accuracy is eliminated. It was shown that voice recognition accuracy can be improved.
[0110]
As shown in FIG. 10 (i), after the robot 10 correctly recognizes each speaker's voice, “Y is“ 1 ”. X is “2”. Z is “3”. The total is “6”. The total of the numbers answered by the speakers X, Y, and Z may also be answered.
[0111]
FIG. 11 shows a third example of the experimental result based on the scenario described above.
1. In this case as well, as in the first example shown in FIG. 9, the robot 10 asks "What is your favorite number?" (See FIG. 10A), and each speaker X, Y, Z As shown in FIG. 10 (b), the speaker X “8”, the speaker Y “7”, and the speaker Z “9” flow.
2. Similarly, the robot 10 refers to the stream direction θ by the real-time tracking (see X3 ′) and the face event of each speaker based on the acoustic signal collected by the microphone 16 in the auditory module 20. Then, sound source localization / sound source separation is performed by the active direction passing type filter 23a to extract the separated sound, and the speech recognition unit 27 has 9 for each speaker based on the separated sound corresponding to each speaker X, Y, Z. Two acoustic models are used to simultaneously perform a speech recognition process and perform the speech recognition.
At that time, the selector 27e of the speech recognition unit 27 has a high probability that the front speaker Y is the speaker Y based on the face event. Consider this as shown in 10 (c). Thereby, more accurate voice recognition can be performed. Therefore, as shown in FIG. 11 (d), the robot 10 gives “X is“ 7 ”to the speaker X. ".
3. On the other hand, when the robot 10 changes its orientation and faces the speaker X located at +60 degrees, the probability that the speaker X at the front is the speaker X based on the face event is high. Similarly, the selector 27e considers this as shown in FIG. Therefore, as shown in FIG. 11 (f), the robot 10 is “8 for Mr. Y” for the speaker X. ".
4). Subsequently, as shown in FIG. 11G, the selector 27 e performs the same process for the speaker Z and returns the speech recognition result to the speaker Z. That is, as shown in FIG. 11 (h), “Z is“ 9 ”with the robot 10 facing the speaker Z. ".
[0112]
In this manner, the robot 10 correctly recognizes all the answers of the speakers X, Y, and Z based on the face recognition of the speaker while referring to the face event for each speaker. I was able to. As a result, it is shown that the speaker can be identified by face recognition, so that voice recognition with higher accuracy can be performed. In particular, when it is assumed to be used in a specific environment, when face recognition accuracy close to 100% is obtained by face recognition, the face recognition information can be used as highly reliable information. Since the number of acoustic models 27d used in the speech recognition engine 27c of the speech recognition unit 27 can be reduced, speech recognition with higher speed and higher accuracy is possible.
[0113]
FIG. 12 shows a fourth example of the experimental result based on the scenario described above.
1. The robot 10 asks "What is your favorite fruit?" (See FIG. 12 (a)), and from the speakers as the speakers X, Y, Z, for example, as shown in FIG. 12 (b) Talker X says “Pear”, Talker Y says “Watermelon”, and Talker Z says “Melon”.
2. Based on the acoustic signal collected by the microphone 16, the robot 10 performs sound source localization / sound source separation by the active direction passing filter 23 a and extracts the separated sound. Then, based on the separated sounds corresponding to the speakers X, Y, and Z, the speech recognition unit 27 performs the speech recognition process simultaneously for each speaker using nine acoustic models, and the speech recognition is performed. Do.
3. At that time, it is assumed that the selector 27e of the speech recognition unit 27 evaluates speech recognition assuming that the front is the speaker Y (FIG. 12C), and then the front is the speaker X. Speech recognition is evaluated (FIG. 12D), and finally speech recognition is evaluated assuming that the front is the speaker Z (FIG. 12E).
4). Then, the selector 27e integrates the speech recognition results, and as shown in FIG. 12 (f), the speaker name (Y) and the speech recognition result (“ Watermelon ") is determined and output to the dialogue unit 28. As a result, as shown in FIG. 9G, “Mr. Y is“ Watermelon ”with the robot 10 facing the speaker Y”. ".
5. Subsequently, the same processing is performed for the speakers X and Z, and the speech recognition result is answered to the speakers X and Z. That is, as shown in FIG. 12 (h), in a state where the robot 10 faces the speaker X, “Mr. X is“ pear ”. Then, as shown in FIG. 12 (i), “Z is“ Melon ”with the robot 10 facing the speaker Z”. ".
[0114]
In this case, the robot 10 was able to correctly recognize all the answers from the speakers X, Y, and Z. Accordingly, the words registered in the speech recognition engine 27c are not limited to numbers, and it can be understood that speech recognition is possible if the words are registered in advance. Here, in the speech recognition engine 27c used in the experiment, about 150 words are registered. Note that as the number of syllables of a word increases, the speech recognition rate decreases slightly.
[0115]
In the embodiment described above, the robot 10 is configured such that its upper body has 4 DOF (degree of freedom). However, the present invention is not limited to this, and the robot according to the present invention may be a robot configured to perform any operation. It is also possible to incorporate an audiovisual system.
[0116]
In the above-described embodiment, the case where the robot audio-visual system according to the present invention is incorporated in the humanoid robot 10 has been described. However, the present invention is not limited to this, and various animal robots such as dogs and other types of robots are used. Obviously, it can also be incorporated.
[0117]
In the above description, the robot audiovisual system 17 includes the stereo module 37 as illustrated in FIG. 4. However, the robot audiovisual system according to the embodiment of the present invention is configured without the stereo module 37. You can also In this case, the association module 50 generates an auditory stream 53 and a face stream 54 for each speaker based on the auditory event 29, the face event 39, and the motor event 48, and further associates the auditory stream 53 and the face stream 54 with each other. Thus, the association stream 59 is generated, and the attention control module 50 is configured to perform attention control based on these streams.
[0118]
Further, in the above description, the active direction pass filter 23a controls the pass bandwidth (pass range) for each direction, and the pass bandwidth is constant regardless of the frequency of the sound to be processed.
Here, in order to derive the pass band δ, the sound source extraction rate for one sound source is obtained by using five pure tones and one harmonic of harmonic structural sounds (harmonics) of 100 Hz, 200 Hz, 500 Hz, 1000 Hz, 2000 Hz, and 100 Hz. An experiment was conducted to investigate. The position of the sound source was moved every 10 degrees in the range of 0 degrees, which is the front of the robot, and 90 degrees, which is the left position or the right position of the robot. FIGS. 13 to 15 are graphs showing the sound source extraction rate when the sound source is installed at each position in the range of 0 to 90 degrees, and as shown in the experimental results, the pass bandwidth is controlled according to the frequency. As a result, the extraction rate of the sound of a specific frequency can be improved, and the separation accuracy can be improved. Therefore, the voice recognition rate is also improved. Therefore, in the robot audio-visual system 17 described above, it is desirable that the path range of the active direction pass filter 23a is configured to be controllable for each frequency.
[0119]
【The invention's effect】
As described above, according to the present invention, it is possible to perform accurate speech recognition in real time and in a real environment by using a plurality of acoustic models as compared with conventional speech recognition. Further, by integrating the speech recognition results of the respective acoustic models by the selector and determining the most reliable speech recognition result, it is possible to perform speech recognition more accurately than conventional speech recognition.
[Brief description of the drawings]
FIG. 1 is a front view showing the appearance of a humanoid robot incorporating a first embodiment of a robot hearing apparatus according to the present invention.
FIG. 2 is a side view of the humanoid robot of FIG.
3 is a schematic enlarged view showing a configuration of a head in the humanoid robot of FIG. 1. FIG.
4 is a block diagram showing an example of an electrical configuration of a robot audio-visual system in the humanoid robot of FIG. 1;
FIG. 5 is a diagram showing an operation of an auditory module in the robot audiovisual system shown in FIG. 4;
6 is a schematic perspective view showing a configuration example of a speech recognition engine used in a speech recognition unit of an auditory module in the robot audiovisual system of FIG.
7 is a graph showing a speech recognition rate by speakers in the front and left / right directions of ± 60 degrees by the voice recognition engine of FIG. 6, wherein (A) is a front speaker and (B) is diagonally left +60 degrees. And (C) shows the case of a speaker at an angle of -60 degrees diagonally to the right.
8 is a schematic perspective view showing a speech recognition experiment in the robot audiovisual system shown in FIG. 4;
9 is a diagram sequentially showing the results of the first example of the speech recognition experiment of the robot audiovisual system of FIG. 4;
10 is a diagram sequentially illustrating the results of a second example of the speech recognition experiment of the robot audio-visual system of FIG. 4;
11 is a diagram sequentially showing the results of a third example of the speech recognition experiment of the robot audiovisual system of FIG.
12 is a diagram sequentially illustrating the results of a fourth example of the speech recognition experiment of the robot audiovisual system of FIG. 4;
FIGS. 13A and 13B are diagrams showing the extraction rate when the pass bandwidth of the active direction pass filter according to the embodiment of the present invention is controlled, where FIG. 13A is 0 degree, FIG. 13B is 10 degrees, and FIG. Is a case where the sound source is in the direction of 20 degrees and (d) is in the direction of 30 degrees.
FIGS. 14A and 14B are diagrams showing extraction rates when the pass bandwidth of the active direction pass filter according to the embodiment of the present invention is controlled, where FIG. 14A is 40 degrees, FIG. 14B is 50 degrees, and FIG. Is when the sound source is in the direction of 60 degrees.
FIGS. 15A and 15B are diagrams showing extraction rates when the pass bandwidth of the active direction pass filter according to the embodiment of the present invention is controlled, where FIG. 15A is 70 degrees, FIG. 15B is 80 degrees, and FIG. Is when the sound source is in the direction of 90 degrees.
[Explanation of symbols]
10 Humanoid robot
11 base
12 Torso
13 head
14 Exterior
15 Camera (Robot vision)
16, 16a, 16b Microphone (robot hearing)
17 Robot audio-visual system
20 Hearing module
21 Peak extractor
22 Sound source localization section
23 Sound source separation unit
23a Active direction pass filter
26 Auditory event generator
27 Voice recognition unit
27a Voice suppression part
27b Automatic recognition unit
27c Speech recognition engine
27d acoustic model
27e selector
28 Dialogue Department
30 face module
37 Stereo module
40 Motor control module
50 association module
57 Attention control module

Claims

For each speaker, and a plurality of acoustic models that are made for each direction, using these acoustic models, the speech recognition engine that executes the speech recognition process on the sound source separation acoustic signals A selector that integrates a plurality of speech recognition process results obtained for each acoustic model by the speech recognition process and selects any one of the speech recognition process results;
A robot audio-visual system characterized by recognizing voices spoken simultaneously by each speaker.

When the selector integrates the speech recognition process results, it calculates a cost function value based on the recognition result of the speech recognition process and the direction of the speaker, and the speech recognition process result with the largest cost function value is the most reliable. characterized that you determined as higher speech recognition result of sexual, robotic audiovisual system according to claim 1.

The robot audio-visual system according to claim 1, further comprising an interactive unit that outputs a result of the voice recognition process selected by the selector to the outside.

At least one pair of microphones for collecting external sound is provided. Based on the acoustic signals from these microphones, the direction of at least one speaker is determined by pitch extraction and sound source separation and localization based on harmonic structure. An auditory module to determine and extract the auditory event;
A face module that includes a camera that images the front of the robot, identifies each speaker based on the face identification and localization of each speaker based on an image captured by the camera, and extracts the face event;
A motor control module that includes a drive motor that rotates the robot in the horizontal direction and extracts a motor event based on the rotational position of the drive motor;
From the auditory event, face event, and motor event, the direction of each speaker is determined based on the direction information of the sound source localization of the auditory event and the face localization of the face event. An association module for generating an auditory stream and a face stream by connecting in a direction, and further associating them to generate an association stream;
Attention control based on these streams, and an attention control module that performs motor drive control on the basis of the action planning results associated therewith,
The auditory module
Each speaker has a plurality of acoustic models made in each direction,
Based on accurate sound source direction information from the association module, an interaural phase difference within a predetermined width range is obtained by an active direction pass filter having a pass range that is minimum in the front direction and increases as the angle increases to the left and right. (IPD) or subbands with interaural intensity difference (IID) are collected and sound source separation is performed by reconstructing the sound source waveform,
Performs speech recognition of sound signals separated from sound sources using multiple acoustic models, integrates the speech recognition results of each acoustic model with a selector, and provides the most reliable speech recognition result among these speech recognition results A robot audiovisual system, characterized by being configured to determine.

At least one pair of microphones for collecting external sound is provided. Based on the acoustic signals from these microphones, the direction of at least one speaker is determined by pitch extraction and sound source separation and localization based on harmonic structure. An auditory module to determine and extract the auditory event;
A face module that includes a camera that images the front of the robot, identifies each speaker based on the face identification and localization of each speaker based on an image captured by the camera, and extracts the face event;
A stereo module that extracts and localizes a vertically long object based on parallax extracted from an image captured by a stereo camera, and extracts a stereo event;
A motor control module that includes a drive motor that rotates the robot in the horizontal direction and extracts a motor event based on the rotational position of the drive motor;
From the auditory event, the face event, the stereo event, and the motor event, the direction of each speaker is determined based on the direction information of the sound source localization of the auditory event and the face localization of the face event, and the Kalman filter is used for this determination. An association module for generating an auditory stream, a face stream, and a stereo visual stream by connecting events in a time direction, and further associating them to generate an association stream;
Attention control based on these streams, and an attention control module that performs motor drive control on the basis of the action planning results associated therewith,
The auditory module
Each speaker has a plurality of acoustic models made in each direction,
Based on accurate sound source direction information from the association module, an interaural phase difference within a predetermined width range is obtained by an active direction pass filter having a pass range that is minimum in the front direction and increases as the angle increases to the left and right. (IPD) or binaural intensity difference (IID) is collected and sound source separation is performed by reconstructing the sound source waveform,
Performs speech recognition of sound signals separated from sound sources using multiple acoustic models, integrates the speech recognition results of each acoustic model with a selector, and provides the most reliable speech recognition result among these speech recognition results A robot audiovisual system, characterized by being configured to determine.

When voice recognition by the auditory module has failed, the attention control module picks up the voice again from the microphone by directing the microphone and the camera in the direction of the sound source of the acoustic signal. The robot audio-visual system according to claim 4 or 5, wherein voice recognition by the auditory module is performed again based on an acoustic signal localized and separated by the auditory module.

Said selector, when integrating speech recognition result, and calculates the cost function value based on the direction of the speaker and the recognition result of the speech recognition, speech recognition results of the most reliable with the highest cost function value The robot audio-visual system according to claim 4 or 5 , wherein the robot audio-visual system is determined as a high speech recognition result .

The robot audio-visual system according to any one of claims 4 to 7, further comprising an interactive unit that outputs a voice recognition result determined by the auditory module to the outside.

The robot audiovisual system according to any one of claims 4 to 8, wherein a path range of the active direction pass filter is controllable for each direction .

The robot audio-visual system according to claim 2 or 7, wherein the selector also considers a probability in face recognition when calculating the cost function value.