JP2002264051A

JP2002264051A - Robot audio-visual system

Info

Publication number: JP2002264051A
Application number: JP2001067847A
Authority: JP
Inventors: Kazuhiro Nakadai; 一博中臺; Kenichi Hidai; 健一日台; Hiroshi Okuno; 博奥乃; Hiroaki Kitano; 宏明北野
Original assignee: Japan Science and Technology Corp
Current assignee: Japan Science and Technology Agency
Priority date: 2001-03-09
Filing date: 2001-03-09
Publication date: 2002-09-18
Anticipated expiration: 2021-03-09
Also published as: JP3843741B2

Abstract

PROBLEM TO BE SOLVED: To provide a robot audio-visual system ensuring the tracing of an object by integrating visual and auditory information on the object. SOLUTION: An auditory module 20 extracts an auditory event 28 by identifying the sound source of a speaker by pitch extraction and the separation and orientation of the sound source from an acoustic signal of a microphone. A visual module 30 extracts a visual event 39 by identifying each speaker by the face identification and orientation of the speaker from the image of a camera. A motor control module 40 extracts a motor event 49 from the rotating position of a drive motor. An association module 60 creates an auditory stream 65 and a visual stream 66 from the auditory event, visual event and motor event and creates an association stream 67 by associating these streams. The direction of each speaker is thereby determined on the basis of the sound source orientation of the auditory event and the face orientation of the visual event to perform attention control for the planning of drive motor control on the basis of these streams.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明はロボット、特に人型
または動物型ロボットにおける視聴覚システムに関する
ものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an audiovisual system for a robot, particularly a humanoid or animal robot.

【０００２】[0002]

【従来の技術】近年、このような人型または動物型ロボ
ットにおいては、視覚，聴覚の能動知覚が注目されてき
ている。能動知覚とは、ロボット視覚やロボット聴覚等
の知覚を担当する知覚装置を、知覚すべき対象に追従す
るように、これらの知覚装置を支持する例えば頭部を駆
動機構により姿勢制御するものである。2. Description of the Related Art In recent years, attention has been paid to active perception of sight and hearing in such humanoid or animal type robots. Active perception is to control the posture of a head that supports these perception devices, such as a head, by a drive mechanism so that the perception device that is responsible for perception such as robot vision and robot hearing follows an object to be perceived. .

【０００３】ここで、能動視覚に関しては、少なくとも
知覚装置であるカメラが、駆動機構による姿勢制御によ
ってその光軸方向が対象に向かって保持され、対象に対
して自動的にフォーカシングやズームイン，ズームアウ
ト等が行なわれることにより、対象がカメラによって撮
像されるようになっており、種々の研究が行なわれてい
る。Here, with regard to active vision, at least a camera, which is a perceptual device, holds its optical axis direction toward a subject by attitude control by a driving mechanism, and automatically focuses, zooms in, and zooms out on the subject. As a result, the object is imaged by a camera, and various studies have been made.

【０００４】これに対して、能動聴覚に関しては、少な
くとも知覚装置であるマイクが、駆動機構による姿勢制
御によってその指向性が対象に向かって保持され、対象
からの音がマイクによって集音される。このような能動
聴覚は、例えば本出願人による特願２０００−２２６７
７号（ロボット聴覚システム）に開示されており、視覚
情報を参照して音源の方向付けを行なうようにしてい
る。[0004] On the other hand, regarding active hearing, at least a microphone, which is a perceptual device, maintains its directivity toward an object by attitude control by a drive mechanism, and sounds from the object are collected by the microphone. Such active hearing is disclosed, for example, in Japanese Patent Application No. 2000-2267 by the present applicant.
No. 7 (robot hearing system), which directs a sound source with reference to visual information.

【０００５】[0005]

【発明が解決しようとする課題】ところで、これらの能
動視覚及び能動聴覚は、ロボットの向き（水平方向）を
変更するためのモータ制御モジュールと密接に関連があ
り、特定の対象に対して能動視覚及び能動聴覚を働かせ
るためには、ロボットを特定の対象に向ける、即ちアテ
ンション制御を行なう必要がある。しかしながら、ロボ
ットが周囲の状況に基づいて的確に対象である各話者を
同定するためには、視覚及び聴覚の情報統合を行なう必
要があるが、例えば複数の人間が互いに話をしているよ
うな状況において、リアルタイム処理により各人を同定
して、能動聴覚を行なうことは行なわれていない。Incidentally, these active vision and active hearing are closely related to a motor control module for changing the orientation (horizontal direction) of the robot, and the active vision and the active hearing are performed for a specific object. In order to activate active hearing, it is necessary to aim the robot at a specific target, that is, to perform attention control. However, in order for the robot to accurately identify each speaker of interest based on the surrounding situation, it is necessary to integrate visual and auditory information, for example, as if multiple people are talking with each other. In such a situation, it has not been performed to identify each person by real-time processing and perform active hearing.

【０００６】この発明は、以上の点にかんがみて、対象
に対する視覚及び聴覚の情報を統合して、対象の追跡を
確実に行なうようにした、ロボット視聴覚システムを提
供することを目的としている。SUMMARY OF THE INVENTION In view of the above, an object of the present invention is to provide a robot audio-visual system which integrates visual and auditory information on an object to reliably track the object.

【０００７】[0007]

【課題を解決するための手段】前記目的は、この発明に
よれば、外部の音を集音する少なくとも一対のマイクを
含む聴覚モジュールと、ロボットの前方を撮像するカメ
ラを含む視覚モジュールと、ロボットを水平方向に回動
させる駆動モータを含むモータ制御モジュールと、聴覚
モジュール，視覚モジュール及びモータ制御モジュール
からのイベントを統合してストリームを生成するアソシ
エーションモジュールと、アソシエーションモジュール
により生成されたストリームに基づいてアテンション制
御を行なうアテンション制御モジュールと、を備えてい
るロボット視聴覚システムであって、聴覚モジュール
が、マイクからの音響信号に基づいて、ピッチ抽出，音
源の分離及び定位から少なくとも一人の話者の方向を決
定してその聴覚イベントを抽出し、視覚モジュールが、
カメラにより撮像された画像に基づいて、各話者の顔識
別と定位から各話者を同定してその視覚イベントを抽出
し、モータ制御モジュールが、駆動モータの回転位置に
基づいて、モータイベントを抽出することにより、アソ
シエーションモジュールが、聴覚イベント，視覚イベン
ト及びモータイベントから、聴覚イベントの音源定位及
び視覚イベントの顔定位の方向情報に基づいて、各話者
の方向を決定することにより聴覚ストリーム及び視覚ス
トリームを生成し、さらにこれらを関連付けてアソシエ
ーションストリームを生成して、アテンション制御モジ
ュールが、これらのストリームに基づいてモータ制御モ
ジュールの駆動モータ制御のプランニングのためのアテ
ンション制御を行なうことを特徴とするロボット視聴覚
システムにより、達成される。According to the present invention, there is provided a hearing module including at least a pair of microphones for collecting external sounds, a visual module including a camera for capturing an image in front of the robot, and a robot. A motor control module that includes a drive motor that rotates the camera horizontally, an association module that integrates events from the auditory module, the vision module, and the motor control module to generate a stream, and a stream generated by the association module. An attention control module for performing attention control, wherein the hearing module determines a direction of at least one speaker from pitch extraction, sound source separation and localization based on an acoustic signal from a microphone. Determined that hearing eve Extract the door, visual module,
Based on the image captured by the camera, each speaker is identified from the face identification and localization of each speaker and its visual event is extracted, and the motor control module generates a motor event based on the rotational position of the drive motor. The extraction allows the association module to determine the direction of each speaker from the auditory event, the visual event, and the motor event based on the direction information of the sound source localization of the auditory event and the face localization of the visual event. Generating a visual stream and further associating them to generate an association stream, wherein the attention control module performs attention control for planning the drive motor control of the motor control module based on these streams. With the robot audiovisual system, It is made.

【０００８】また、前記目的は、この発明によれば、外
部の音を集音する少なくとも一対のマイクを含む聴覚モ
ジュールと、ロボットの前方を撮像するカメラを含む視
覚モジュールと、ロボットを水平方向に回動させる駆動
モータを含むモータ制御モジュールと、聴覚モジュー
ル，視覚モジュール及びモータ制御モジュールからのイ
ベントを統合してストリームを生成するアソシエーショ
ンモジュールと、アソシエーションモジュールにより生
成されたストリームに基づいてアテンション制御を行な
うアテンション制御モジュールと、を備えている人型ま
たは動物型のロボットの視聴覚システムであって、聴覚
モジュールが、マイクからの音響信号に基づいて、ピッ
チ抽出，音源の分離及び定位から少なくとも一人の話者
の方向を決定してその聴覚イベントを抽出し、視覚モジ
ュールが、カメラにより撮像された画像に基づいて、各
話者の顔識別と定位から各話者を同定してその視覚イベ
ントを抽出し、モータ制御モジュールが、駆動モータの
回転位置に基づいてモータイベントを抽出することによ
り、アソシエーションモジュールが、聴覚イベント，視
覚イベント及びモータイベントから聴覚イベントの音源
定位及び視覚イベントの顔定位の方向情報に基づいて各
話者の方向を決定することにより、聴覚ストリーム及び
視覚ストリームを生成し、さらにこれらを関連付けてア
ソシエーションストリームを生成して、アテンション制
御モジュールが、これらのストリームに基づいてモータ
制御モジュールの駆動モータ制御のプランニングのため
のアテンション制御を行なうことを特徴とするロボット
視聴覚システムにより、達成される。Further, according to the present invention, there is provided a hearing module including at least a pair of microphones for collecting external sounds, a visual module including a camera for capturing an image in front of the robot, and A motor control module including a driving motor to be rotated; an association module that integrates events from the hearing module, the vision module, and the motor control module to generate a stream; and performs attention control based on the stream generated by the association module. At least one speaker from pitch extraction, sound source separation and localization based on acoustic signals from the microphone. Direction The auditory event is extracted, the visual module identifies each speaker from the face identification and localization of each speaker based on the image captured by the camera, and extracts the visual event thereof. By extracting the motor event based on the rotational position of, the association module determines the direction of each speaker based on the direction information of the sound source localization of the auditory event and the face localization of the visual event from the auditory event, the visual event and the motor event. The determination generates an audio stream and a visual stream, and further associates them to generate an association stream, and the attention control module uses the streams to determine an attention for planning the drive motor control of the motor control module. Specially to control The robot audiovisual system that is achieved.

【０００９】本発明によるロボット視聴覚システムは、
好ましくは、前記アソシエーションモジュールが、聴覚
ストリーム及び視覚ストリームを生成する際に、非同期
に生成される聴覚イベント，視覚イベント及びモータイ
ベントを相互に同期する。A robot audiovisual system according to the present invention comprises:
Preferably, the association module synchronizes the asynchronously generated auditory, visual, and motor events with each other when generating the audio and visual streams.

【００１０】本発明によるロボット視聴覚システムは、
好ましくは、前記聴覚モジュールが、音響信号から音声
のＭＦＣＣを検出して各話者を同定し、前記アソシエー
ションモジュールが、聴覚イベントの話者同定及び視覚
イベントの話者同定に基づいて当該話者を特定すること
により、聴覚イベント及び視覚イベントを接続すべき聴
覚ストリーム及び視覚ストリームを選定する。[0010] The robot audiovisual system according to the present invention comprises:
Preferably, the auditory module identifies each speaker by detecting the MFCC of the audio from the audio signal, and the association module identifies the speaker based on the speaker identification of the auditory event and the speaker identification of the visual event. By specifying, the auditory and visual streams to which the auditory and visual events are to be connected are selected.

【００１１】本発明によるロボット視聴覚システムは、
好ましくは、前記アソシエーションモジュールが、複数
のストリームが互いに接近する場合に、聴覚イベント及
び視覚イベントの時間的流れを参照して、当該聴覚イベ
ント及び視覚イベントを接続すべき聴覚ストリーム及び
視覚ストリームを選定する。[0011] The robot audiovisual system according to the present invention comprises:
Preferably, when the plurality of streams approach each other, the association module refers to a temporal flow of the auditory event and the visual event, and selects an auditory stream and a visual stream to which the auditory event and the visual event are to be connected. .

【００１２】本発明によるロボット視聴覚システムは、
好ましくは、前記アソシエーションモジュールが、互い
に結び付きの強い聴覚ストリームと視覚ストリームを関
連付けて、アソシエーションストリームを生成すると共
に、アソシエーションストリームを構成する聴覚ストリ
ームと視覚ストリームの結び付きが弱くなったとき、関
連付けを解除して、アソシエーションストリームを消滅
させる。A robot audiovisual system according to the present invention comprises:
Preferably, the association module associates an auditory stream and a visual stream that are strongly associated with each other to generate an association stream, and cancels the association when the association between the auditory stream and the visual stream that constitute the association stream is weakened. To destroy the association stream.

【００１３】前記構成によれば、聴覚モジュールが、マ
イクが集音した外部の対象からの音から、調波構造を利
用してピッチ抽出を行なうことにより音源毎の方向を得
て、個々の話者の方向を決定してその聴覚イベントを抽
出する。また、視覚モジュールが、カメラにより撮像さ
れた画像から、パターン認識による各話者の顔識別と定
位から各話者を同定して、個々の話者の視覚イベントを
抽出する。さらに、モータ制御モジュールが、ロボット
を水平方向に回動させる駆動モータの回転位置に基づい
て、ロボットの方向を検出することによって、モータイ
ベントを抽出する。なお、前記イベントとは、各時点に
おいて音または顔が検出され、ピッチ及び方向等の特徴
が抽出されて、話者同定や顔識別等が行なわれること、
あるいは駆動モータが回転される状態を示しており、ス
トリームとは、時間的に連続するイベントを示してい
る。[0013] According to the above configuration, the auditory module obtains the direction for each sound source by performing pitch extraction using the harmonic structure from the sound from the external object collected by the microphone, and obtains the individual speech. The direction of the person is determined and the auditory event is extracted. In addition, the visual module identifies each speaker from face identification and localization of each speaker by pattern recognition from an image captured by the camera, and extracts a visual event of each speaker. Further, the motor control module extracts a motor event by detecting a direction of the robot based on a rotation position of a drive motor that rotates the robot in a horizontal direction. Note that the event means that a sound or face is detected at each time point, features such as pitch and direction are extracted, and speaker identification and face identification are performed.
Alternatively, it indicates a state in which the drive motor is rotated, and the stream indicates a temporally continuous event.

【００１４】ここで、アソシエーションモジュールは、
このようにしてそれぞれ抽出された聴覚イベント，視覚
イベント及びモータイベントに基づいて、聴覚イベント
の音源定位及び視覚イベントの顔定位の方向情報によっ
て各話者の方向を決定することにより、各話者の聴覚ス
トリーム及び視覚ストリームを生成し、さらにこれらの
ストリームを関連付けてアソシエーションストリームを
生成する。この際、アソシエーションモジュールは、聴
覚イベントの音源定位及び視覚イベントの顔定位即ち聴
覚及び視覚の方向情報に基づいて、各話者の方向を決定
し、決定された各話者の方向を参考にしてアソシエーシ
ョンストリームを生成することになる。そして、アテン
ション制御モジュールが、これらのストリームに基づい
てアテンション制御を行なうことにより、モータ制御モ
ジュールの駆動モータ制御のプランニングを行なう。ア
テンションとは、ロボットが対象である話者を、聴覚的
及び／または視覚的に「注目」することであり、アテン
ション制御とは、モータ制御モジュールによりその向き
を変えることにより、ロボットが前記話者に注目するよ
うにすることである。Here, the association module is
By determining the direction of each speaker based on the direction information of the sound source localization of the auditory event and the face localization of the visual event based on the auditory event, the visual event, and the motor event thus extracted, the speaker's An audio stream and a visual stream are generated, and the streams are associated to generate an association stream. At this time, the association module determines the direction of each speaker based on the sound source localization of the auditory event and the face localization of the visual event, i.e., the auditory and visual direction information, and refers to the determined direction of each speaker. An association stream will be generated. Then, the attention control module performs planning of drive motor control of the motor control module by performing attention control based on these streams. Attention is the audible and / or visual "attention" of the speaker to which the robot is intended. Attention control is when the robot changes its orientation by means of a motor control module so that the robot can It is to pay attention to.

【００１５】そして、アテンション制御モジュールは、
このプランニングに基づいて、モータ制御モジュールの
駆動モータを制御することにより、ロボットの方向を対
象である話者に向ける。これにより、ロボットが対象で
ある話者に対して正対することにより、聴覚モジュール
が当該話者の声を、感度の高い正面方向にてマイクによ
り正確に集音，定位することができる共に、視覚モジュ
ールが当該話者の画像をカメラにより良好に撮像するこ
とができるようになる。Then, the attention control module comprises:
By controlling the drive motor of the motor control module based on this planning, the direction of the robot is directed to the target speaker. With this, the robot can directly face the target speaker, so that the hearing module can accurately collect and localize the voice of the speaker by the microphone in a highly sensitive frontal direction, The module can better image the speaker with the camera.

【００１６】従って、このような聴覚モジュール，視覚
モジュール及びモータ制御モジュールと、アソシエーシ
ョンモジュール及びアテンション制御モジュールとの連
携によって、聴覚イベントの音源定位及び視覚イベント
の話者定位という方向情報に基づいて、各話者の方向を
決定することにより、ロボットの聴覚及び視覚がそれぞ
れ有する曖昧性が互いに補完されることになり、所謂ロ
バスト性が向上し、複数の話者であっても、各話者をそ
れぞれ確実に知覚することができる。また、例えば聴覚
イベントまたは視覚イベントの何れか一方が欠落したと
きであっても、残りの視覚イベントまたは聴覚イベント
のみに基づいて、対象である話者をアテンション制御モ
ジュールが追跡することができるので、正確に対象の方
向を把握して、モータ制御モジュールの制御を行なうこ
とができる。Therefore, by the cooperation of the hearing module, the visual module, and the motor control module with the association module and the attention control module, based on the direction information of the sound source localization of the auditory event and the speaker localization of the visual event, By determining the direction of the speakers, the ambiguities of the hearing and vision of the robot are complemented with each other, so-called robustness is improved, and even if there are a plurality of speakers, each speaker is Can be reliably perceived. Also, for example, even when one of the auditory event or the visual event is missing, the attention control module can track the target speaker based on only the remaining visual event or the auditory event, It is possible to accurately grasp the direction of the object and control the motor control module.

【００１７】前記アソシエーションモジュールが、聴覚
ストリーム及び視覚ストリームを生成する際に、非同期
に生成される聴覚イベント，視覚イベント及びモータイ
ベントを相互に同期する場合には、それぞれ非同期で生
成された聴覚イベント，視覚イベント及びモータイベン
トが互いに同期することによって、アソシエーションモ
ジュールでのこれらのイベントの互いに異なる生成周期
及び遅延時間が吸収されることになり、聴覚イベントの
音源定位と視覚イベントの話者定位の方向情報による各
話者の方向の決定がより正確に行なわれる。従って、聴
覚イベントから成る聴覚ストリームと視覚イベントから
成る視覚ストリームが互いに近い距離に存在した場合に
は、相互に関連付けて、より高次のアソシエーションス
トリームを生成することができる。In the case where the association module synchronizes an audio event, a visual event, and a motor event that are generated asynchronously with each other when generating the audio stream and the visual stream, the audio event generated asynchronously, By synchronizing the visual and motor events with each other, different generation periods and delay times of these events in the association module will be absorbed, and the direction information of the sound source localization of the auditory event and the speaker localization of the visual event will be absorbed. Is used to determine the direction of each speaker more accurately. Therefore, when the auditory stream composed of the auditory event and the visual stream composed of the visual event exist at a distance close to each other, it is possible to associate them with each other to generate a higher-order association stream.

【００１８】前記アソシエーションモジュールが、前記
聴覚モジュールが音響信号から音声のＭＦＣＣを検出し
て各話者を同定し、前記アソシエーションモジュール
が、聴覚イベントの話者同定及び視覚イベントの話者同
定に基づいて当該話者を特定することにより、聴覚イベ
ント及び視覚イベントを接続すべき聴覚ストリーム及び
視覚ストリームを選定する場合には、聴覚イベントから
音声のＭＦＣＣにより話者同定を行なうことが可能とな
り、聴覚イベント及び視覚イベントにより個々の話者が
同定されることになる。従って、各聴覚イベント及び視
覚イベントを、それぞれ同一話者の聴覚ストリーム及び
視覚ストリームに接続することにより、例えば複数の話
者が存在する場合であっても、各話者をより正確に特定
して、聴覚ストリーム及び視覚ストリームを生成するこ
とができると共に、聴覚イベントまたは視覚イベントの
一方が途中で途切れた場合であっても、他方のイベント
により、話者の同定を継続することができる。これによ
り、同じ方向から複数の話者の音声が検出された場合で
も、各話者を同定して、聴覚及び視覚の高次の統合を行
なうことにより、各話者をより正確に追跡することが可
能になる。[0018] The association module identifies each speaker by the hearing module detecting MFCC of speech from the audio signal, and the association module determines based on speaker identification of the auditory event and speaker identification of the visual event. By specifying the speaker, when an auditory stream and a visual stream to which the auditory event and the visual event are to be connected are selected, the speaker identification can be performed by the MFCC of the audio from the auditory event. Visual events will identify individual speakers. Thus, by connecting each auditory event and visual event to the same speaker's auditory stream and visual stream, respectively, it is possible to more accurately identify each speaker even when there are multiple speakers, for example. , An auditory stream and a visual stream can be generated, and even if one of the auditory event and the visual event is interrupted halfway, the other event can continue speaker identification. This enables more accurate tracking of each speaker by identifying each speaker and performing higher-order integration of hearing and vision even when voices of multiple speakers are detected from the same direction. Becomes possible.

【００１９】前記アソシエーションモジュールが、複数
のストリームが互いに接近する場合に、聴覚イベント及
び視覚イベントの時間的流れを参照して、当該聴覚イベ
ント及び視覚イベントを接続すべき聴覚ストリーム及び
視覚ストリームを選定する場合には、複数の話者が互い
に接近していて、これらの話者による聴覚ストリーム及
び視覚ストリームが互いに接近して交錯するようなとき
であっても、当該話者の動きの範囲を予測して、この範
囲内であれば、聴覚ストリームまたは視覚ストリームを
保持することにより、聴覚ストリーム及び視覚ストリー
ムをより正確に生成することができる。従って、これら
の聴覚ストリーム及び視覚ストリームが有する曖昧性が
互いに補完され、所謂ロバスト性が向上して、複数の話
者を確実に追跡することができる。When the plurality of streams approach each other, the association module refers to a temporal flow of the auditory event and the visual event, and selects an auditory stream and a visual stream to which the auditory event and the visual event are to be connected. In this case, even when a plurality of speakers are close to each other and the auditory stream and the visual stream of these speakers are close to each other and intersect, the range of movement of the speaker is predicted. In this range, by retaining the auditory stream or the visual stream, the auditory stream and the visual stream can be generated more accurately. Therefore, the ambiguity of the auditory stream and the visual stream is complemented with each other, so-called robustness is improved, and a plurality of speakers can be reliably tracked.

【００２０】前記アソシエーションモジュールが、互い
に結び付きの強い聴覚ストリームと視覚ストリームを関
連付けて、アソシエーションストリームを生成すると共
に、アソシエーションストリームを構成する聴覚ストリ
ームと視覚ストリームの結び付きが弱くなったとき、関
連付けを解除して、アソシエーションストリームを消滅
させる場合には、各話者毎に、正確にアソシエーション
ストリームを生成することができるので、聴覚ストリー
ム及び視覚ストリームが有する曖昧性をできるだけ排除
して、正確な話者の特定を行なうことができる。さら
に、この場合、所定角度を適宜に選定することによっ
て、話者が移動している場合であっても、確実に話者の
移動を捉えて、いわば話者の移動を予測して当該話者の
特定を行なうことかできる。The association module associates an auditory stream and a visual stream that are strongly associated with each other to generate an association stream. When the association between the auditory stream and the visual stream that constitute the association stream is weakened, the association module releases the association. Therefore, when the association stream is extinguished, the association stream can be accurately generated for each speaker, so that the ambiguity of the auditory stream and the visual stream is eliminated as much as possible, and the accurate speaker identification is performed. Can be performed. Furthermore, in this case, by appropriately selecting the predetermined angle, even if the speaker is moving, the movement of the speaker is surely captured, so to speak, the movement of the speaker is predicted. Can be specified.

【００２１】[0021]

【発明の実施の形態】以下、図面に示した実施形態に基
づいて、この発明を詳細に説明する。図１乃至図４はこ
の発明によるロボット視聴覚システムの一実施形態を備
えた実験用の人型ロボットの全体構成を示している。図
１において、人型ロボット１０は、４ＤＯＦ（自由度）
のロボットとして構成されており、ベース１１と、ベー
ス１１上にて一軸（垂直軸）周りに回動可能に支持され
た胴体部１２と、胴体部１２上にて、三軸方向（垂直
軸，左右方向の水平軸及び前後方向の水平軸）の周りに
揺動可能に支持された頭部１３と、を含んでいる。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail based on embodiments shown in the drawings. 1 to 4 show the overall configuration of an experimental humanoid robot provided with an embodiment of the robot audiovisual system according to the present invention. In FIG. 1, the humanoid robot 10 has 4 DOF (degree of freedom).
The robot includes a base 11, a body 12 supported on the base 11 so as to be rotatable around one axis (vertical axis), and a three-axis direction (vertical axis, (A horizontal axis in the left-right direction and a horizontal axis in the front-rear direction).

【００２２】前記ベース１１は固定配置されていてもよ
く、脚部として動作可能としてもよい。また、ベース１
１は、移動可能な台車等の上に載置されていてもよい。
前記胴体部１２は、ベース１１に対して垂直軸の周り
に、図１にて矢印Ａで示すように回動可能に支持されて
おり、図示しない駆動手段によって回転駆動されると共
に、図示の場合、防音性の外装によって覆われている。The base 11 may be fixedly arranged, and may be operable as a leg. Also base 1
1 may be mounted on a movable cart or the like.
The body portion 12 is rotatably supported as shown by an arrow A in FIG. 1 around a vertical axis with respect to the base 11, and is driven to rotate by driving means (not shown). , Covered by a soundproof exterior.

【００２３】前記頭部１３は胴体部１２に対して連結部
材１３ａを介して支持されており、この連結部材１３ａ
に対して前後方向の水平軸の周りに、図１にて矢印Ｂで
示すように揺動可能に、また左右方向の水平軸の周り
に、図２にて矢印Ｃで示すように揺動可能に支持されて
いると共に、前記連結部材１３ａが、胴体部１２に対し
てさらに前後方向の水平軸の周りに、図１にて矢印Ｄで
示すように揺動可能に支持されており、それぞれ図示し
ない駆動手段によって、各矢印Ａ，Ｂ，Ｃ，Ｄ方向に回
転駆動される。The head 13 is supported by the body 12 via a connecting member 13a.
Can swing about a horizontal axis in the front-rear direction, as shown by an arrow B in FIG. 1, and can swing about a horizontal axis in the left-right direction, as shown by an arrow C in FIG. In addition, the connecting member 13a is swingably supported as shown by an arrow D in FIG. It is rotationally driven in the directions of the arrows A, B, C, and D by the non-driving means.

【００２４】ここで、前記頭部１３は、図３に示すよう
に全体が防音性の外装１４により覆われていると共に、
前側にロボット視覚を担当する視覚装置としてのカメラ
１５を、また両側にロボット聴覚を担当する聴覚装置と
しての一対のマイク１６（１６ａ，１６ｂ）を備えてい
る。Here, the head 13 is entirely covered with a soundproof exterior 14, as shown in FIG.
A camera 15 is provided on the front side as a visual device in charge of robot vision, and a pair of microphones 16 (16a, 16b) are provided on both sides as hearing devices in charge of robot hearing.

【００２５】前記外装１４は、例えばウレタン樹脂等の
吸音性の合成樹脂から構成されており、頭部１３の内部
をほぼ完全に密閉することにより、頭部１３の内部の遮
音を行なうように構成されている。尚、胴体部１２の外
装も、同様にして吸音性の合成樹脂から構成されてい
る。前記カメラ１５は公知の構成であって、例えば所謂
パン，チルト，ズームの３ＤＯＦ（自由度）を有する市
販のカメラが適用され得る。The exterior 14 is made of a sound-absorbing synthetic resin such as a urethane resin, for example. The interior of the head 13 is sound-insulated by almost completely sealing the interior of the head 13. Have been. The exterior of the body 12 is also made of a synthetic resin having a sound absorbing property. The camera 15 has a known configuration, and for example, a commercially available camera having a so-called pan, tilt, and zoom 3DOF (degree of freedom) can be applied.

【００２６】前記マイク１６は、それぞれ頭部１３の側
面において、前方に向かって指向性を有するように取り
付けられている。ここで、マイク１６の左右の各マイク
１６ａ，１６ｂは、それぞれ図１及び図２に示すよう
に、外装１４の両側にて前方に向いた段部１４ａ，１４
ｂにて、内側に取り付けられ、段部１４ａ，１４ｂに設
けられた貫通穴を通して、前方の音を集音すると共に、
外装１４の内部の音を拾わないように適宜の手段により
遮音されている。これにより、マイク１６ａ，１６ｂ
は、所謂バイノーラルマイクとして構成されている。な
お、マイク１６ａ，１６ｂの取付位置の近傍において、
外装１４は人間の外耳形状に形成されていてもよい。The microphones 16 are mounted on the side surfaces of the head 13 so as to have directivity toward the front. Here, the left and right microphones 16a and 16b of the microphone 16 are, as shown in FIG. 1 and FIG.
b, the front sound is collected through the through holes provided inside the steps 14a and 14b and attached to the inside.
The sound is shielded by appropriate means so as not to pick up the sound inside the exterior 14. Thereby, the microphones 16a, 16b
Are configured as so-called binaural microphones. In the vicinity of the mounting positions of the microphones 16a and 16b,
The exterior 14 may be formed in the shape of a human outer ear.

【００２７】図４は、前記マイク１６及びカメラ１５を
含むロボット視聴覚システムの電気的構成を示してい
る。図４において、視聴覚システム１７は、パーティ受
付及びコンパニオン用ロボットとしての構成であり、聴
覚モジュール２０，視覚モジュール３０，モータ制御モ
ジュール４０，対話モジュール５０及びアソシエーショ
ンモジュール６０から構成されている。以下、図４の各
部を拡大して示す図５〜図９をも参照しつつさらに説明
する。説明の便宜上、聴覚モジュール２０をブロック１
として図５に拡大して示し、視覚モジュール３０をブロ
ック２として図６に拡大して示し、モータ制御モジュー
ル４０をブロック３として図７に拡大して示し、対話モ
ジュール５０をブロック４として図８に拡大して示し、
さらに、アソシエーションモジュール６０をブロック５
として図９に拡大して示す。ここで、アソシエーション
モジュール６０（ブロック５、図９）はサーバから構成
されていると共に、他のモジュール、即ち聴覚モジュー
ル２０（ブロック１、図５），視覚モジュール３０（ブ
ロック２、図６），モータ制御モジュール４０（ブロッ
ク３、図７），対話モジュール５０（ブロック４、図
８）は、それぞれクライアントから構成されており、互
いに非同期で動作する。FIG. 4 shows an electrical configuration of the robot audiovisual system including the microphone 16 and the camera 15. In FIG. 4, the audiovisual system 17 is configured as a party reception and companion robot, and includes an audio module 20, a visual module 30, a motor control module 40, a dialog module 50, and an association module 60. Hereinafter, a further description will be given with reference to FIGS. For convenience of explanation, the auditory module 20 is block 1
5, the visual module 30 is shown as a block 2 in FIG. 6, the motor control module 40 is shown as a block 3 in FIG. 7, and the dialogue module 50 is a block 4 in FIG. Shown enlarged,
Further, the association module 60 is set to block 5
FIG. 9 is an enlarged view of FIG. Here, the association module 60 (block 5, FIG. 9) is composed of a server, and has other modules, ie, the hearing module 20 (block 1, FIG. 5), the visual module 30 (block 2, FIG. 6), and the motor. The control module 40 (block 3, FIG. 7) and the dialogue module 50 (block 4, FIG. 8) are each composed of a client and operate asynchronously with each other.

【００２８】なお、前記サーバ及び各クライアントは、
例えばパーソナルコンピュータにより構成されており、
例えば１００Ｂａｓｅ−Ｔ等のネットワーク７０を介し
て、例えばＴＣＰ／ＩＰプロトコルにより、相互にＬＡ
Ｎ接続されている。また、各モジュール２０，３０，４
０，５０，６０は、それぞれ階層的に分散して、具体的
には下位から順次にデバイス層，プロセス層，特徴層，
イベント層から構成されている。The server and each client are:
For example, it is configured by a personal computer,
LA, for example, by a TCP / IP protocol via a network 70 such as 100Base-T.
N connections are made. In addition, each module 20, 30, 4
0, 50, and 60 are hierarchically distributed, and specifically, are device layers, process layers, feature layers,
It consists of an event layer.

【００２９】図５に示すように、前記聴覚モジュール２
０は、デバイス層としてのマイク１６と、プロセス層と
してのピーク抽出部２１，音源定位部２２，音源分離部
２３及び話者特定部２３ａと、特徴層（データ）として
のピッチ２４，水平方向２５と、イベント層としての聴
覚イベント生成部２６及びビューア２７と、から構成さ
れている。As shown in FIG. 5, the hearing module 2
0 denotes a microphone 16 as a device layer, a peak extraction unit 21, a sound source localization unit 22, a sound source separation unit 23, and a speaker identification unit 23a as process layers, a pitch 24 as a characteristic layer (data), and a horizontal direction 25. And an auditory event generator 26 and a viewer 27 as an event layer.

【００３０】ここで、聴覚モジュール２０は、図１０に
示すように作用する。即ち、図１０において、聴覚モジ
ュール２０は、符号Ｘ１で示すように、例えば４８ｋＨ
ｚ，１６ビットでサンプリングされたマイク１６からの
音響信号を、符号Ｘ２で示すようにＦＦＴ（高速フーリ
エ変換）により周波数解析して、符号Ｘ３で示すよう
に、左右のチャンネル毎にスペクトルを生成する。そし
て、聴覚モジュール２０は、ピーク抽出部２１により左
右のチャンネル毎に一連のピークを抽出して、左右のチ
ャンネルで同じか類似のピークをペアとする。ここで、
ピーク抽出は、パワーがしきい値以上で且つローカルピ
ークであって、低周波ノイズとパワーの小さい高周波帯
域をカットするため例えば９０Ｈｚ乃至３ｋＨｚの間の
周波数であるという条件のデータのみを透過させる帯域
フィルタを使用することにより行なわれる。このしきい
値は、周囲の暗騒音を計測して、さらに感度パラメー
タ、例えば１０ｄＢを加えた値として定義される。Here, the hearing module 20 operates as shown in FIG. That is, in FIG. 10, the hearing module 20 is, for example,
The sound signal from the microphone 16 sampled at z, 16 bits is subjected to frequency analysis by FFT (Fast Fourier Transform) as indicated by a symbol X2, and a spectrum is generated for each of the left and right channels as indicated by a symbol X3. . Then, the hearing module 20 extracts a series of peaks for each of the left and right channels by the peak extracting unit 21 and pairs the same or similar peaks in the left and right channels. here,
The peak extraction is a band that transmits only data under the condition that the power is equal to or higher than the threshold value and is a local peak, and the frequency is, for example, 90 Hz to 3 kHz in order to cut low-frequency noise and high-frequency band with low power. This is done by using a filter. This threshold is defined as a value obtained by measuring background noise in the surroundings and adding a sensitivity parameter, for example, 10 dB.

【００３１】そして、聴覚モジュール２０は、各ピーク
が調波構造を有していることを利用して、周波数が低い
方から順に調波構造を有するローカルピークを抽出し
て、抽出されたピークの集合を一つの音として、音源分
離部２３により、符号Ｘ４で示すように逆ＦＦＴ（高速
フーリエ変換）を適用することにより、符号Ｘ５で示す
ように各音源からの混合音から音源毎の音響信号を分離
する。Using the fact that each peak has a harmonic structure, the hearing module 20 extracts local peaks having a harmonic structure in order from the lowest frequency, and extracts the peaks of the extracted peaks. Assuming the set as one sound, the sound source separation unit 23 applies an inverse FFT (fast Fourier transform) as indicated by reference sign X4, thereby obtaining a sound signal for each sound source from a mixed sound from each sound source as indicated by reference sign X5. Is separated.

【００３２】これにより、聴覚モジュール２０は、各音
源毎の音響信号について、音源定位部２２により、符号
Ｘ６で示すように左右のチャンネルから同じ周波数の音
響信号を選択して、例えば５度毎にＩＰＤ（両耳間位相
差）及びＩＩＤ（両耳間強度差）を求める。そして、聴
覚モジュール２０の音源定位部２２は、所謂聴覚エピポ
ーラ幾何を利用して、ロボット１０の正面を０度として
±９０度の範囲で、符号Ｘ７で示す仮説推論によるＩＰ
ＤＰｈの仮説を生成して、Thus, the auditory module 20 selects the sound signal of the same frequency from the left and right channels as indicated by the symbol X6 by the sound source localization unit 22 for the sound signal of each sound source, for example, every five degrees. IPD (interaural phase difference) and IID (interaural intensity difference) are determined. Then, using a so-called auditory epipolar geometry, the sound source localization unit 22 of the auditory module 20 uses the hypothesis inference indicated by the symbol X7 in a range of ± 90 ° with the front of the robot 10 being 0 °.
Generate a hypothesis for D Ph

【数１】により分離した音と各仮説間の距離ｄ（θ）を計算す
る。ここで、ｎ_f＜１．５ｋＨｚは、周波数が１．５ｋ
Ｈｚ以下である倍音である。これは、左右のマイク１５
のベースラインからＩＰＤが１．２乃至１．５ｋＨｚ以
下の周波数に対して有効であるので、今回の実験では
１．５ｋＨｚ以下としたものである。(Equation 1) Then, the distance d (θ) between the sound separated by the above and each hypothesis is calculated. Here, n _f <1.5 kHz means that the frequency is 1.5 kHz
Hz. This is the left and right microphone 15
Since the IPD is effective for frequencies of 1.2 to 1.5 kHz or less from the baseline, the experiment was performed at 1.5 kHz or less in this experiment.

【００３３】ＩＩＤについては、ＩＰＤと同様に、分離
した音の各倍音の左右チャンネル間のパワー差から求め
られる。ただし、ＩＩＤについては、仮説推論ではな
く、The IID is obtained from the power difference between the left and right channels of each overtone of the separated sound, similarly to the IPD. However, IID is not hypothetical inference,

【数２】による判別関数を用いて、音源が左右何れかを判定する
ものとする。即ち、周波数ｆの各倍音のＩＩＤをＩ
_s（ｆ）としたとき、音源は、Ｉが正であればロボット
の左方向に、Ｉが負であれば右方向に、そしてほぼ０で
あれば正面方向に存在することになる。ここで、ＩＩＤ
の仮説生成には、ロボット１０の頭部形状を考慮した膨
大な計算が必要となることから、リアルタイム処理を考
慮して、ＩＰＤと同様の仮説推論は行なわない。このよ
うにして、符号Ｘ８で示すように、ＩＰＤとＩＩＤのマ
ッチングを行なう。(Equation 2) The sound source is determined to be either left or right using a discriminant function based on That is, the IID of each overtone of the frequency f is I
_{Assuming that s} (f), the sound source exists in the left direction of the robot when I is positive, in the right direction when I is negative, and in the front direction when I is almost zero. Where IID
Since the generation of a hypothesis requires an enormous amount of calculation considering the shape of the head of the robot 10, a hypothesis inference similar to that of the IPD is not performed in consideration of real-time processing. In this way, the matching between the IPD and the IID is performed, as indicated by reference numeral X8.

【００３４】そして、聴覚モジュール２０の音源定位部
２２は、符号Ｘ９で示すように、前記距離ｄ（θ）か
ら、確立密度関数Then, the sound source localization unit 22 of the hearing module 20 calculates the probability density function from the distance d (θ), as indicated by the symbol X9.

【数３】を利用して、ＩＰＤの確信度ＢＦ_IPD（θ）を計算す
る。ここで、ｍ，ｓは、それぞれｄ（θ）の平均と分散
であり、ｎはｄの個数である。また、ＩＩＤの確信度Ｂ
Ｆ_IID（θ）は、３０度＜θ≦９０度で、前記Ｉが＋の
とき０．３５，−のとき０．６５、−３０度＜θ≦３０
度で、前記Ｉが＋のとき０．５，−のとき０．５、−９
０度＜θ≦−３０度で、前記Ｉが＋のとき０．６５，−
のとき０．３５となる。(Equation 3) Is used to calculate the _IPD certainty factor BF _IPD (θ). Here, m and s are the average and variance of d (θ), respectively, and n is the number of d. In addition, the confidence B of the IID
F _IID (θ) is 30 degrees <θ ≦ 90 degrees, 0.35 when I is +, 0.65 when −, and −30 degrees <θ ≦ 30.
In degrees, 0.5 when I is +, 0.5 when −, -9
0 degree <θ ≦ −30 degrees and 0.65−
It becomes 0.35 at the time of.

【００３５】そして、このようにして得られたＩＰＤの
確信度ＢＦ_IPD（θ）及びＩＩＤの確信度ＢＦ
_IID（θ）を、符号Ｘ１０で示すように、Then, the confidence BF _IPD (θ) of the IPD obtained in this way and the confidence BF of the IID
_IID (θ) is represented by a symbol X10,

【数４】で示されるＤｅｍｐｓｔｅｒ−Ｓｈａｆｅｒ理論により
統合して、確信度ＢＦ_IP _D+IID（θ）を生成する。ま
た、前記話者特定部２３ａは、マイク１６からの音響信
号から、例えばＭＦＣＣ（メル周波数ケプストラム係
数）を求めて、前もって登録してある話者のＭＦＣＣと
照合することにより、当該話者本人を同定する。これに
より、聴覚モジュール２０は、聴覚イベント生成部２６
により、音源方向として尤度の高い順に上位２０個の確
信度ＢＦ_IPD+IID（θ）と方向（θ）のリスト，ピッチ
と、話者同定により聴覚イベント２８を生成する。(Equation 4) Are integrated according to the Dempster-Shafer theory shown by the formula (1) to generate a certainty factor BF _IP _{D + IID} (θ). Further, the speaker identification unit 23a obtains, for example, an MFCC (mel frequency cepstrum coefficient) from the acoustic signal from the microphone 16 and compares the MFCC with the MFCC of the speaker registered in advance to identify the speaker itself. Identify. As a result, the auditory module 20 outputs the auditory event generator 26
Thus, the auditory event 28 is generated by the top 20 confidence factors BF _{IPD + IID} (θ), the list of the directions (θ), the pitch, and the speaker identification in the descending order of the likelihood as the sound source direction.

【００３６】このようにして、聴覚モジュール２０は、
マイク１６からの音響信号に基づいて、ピッチ抽出，音
源の分離及び定位そしてＭＦＣＣから、少なくとも一人
の話者を特定（話者同定）してその聴覚イベントを抽出
し、ネットワーク７０を介してアソシエーションモジュ
ール６０に対して送信するようになっている。尚、聴覚
モジュール２０における上述した処理は、４０ｍ秒毎に
行なわれる。In this way, the hearing module 20
Based on the sound signal from the microphone 16, at least one speaker is specified (speaker identification) from the pitch extraction, sound source separation and localization, and MFCC, and the auditory event is extracted. 60. The above-described processing in the hearing module 20 is performed every 40 msec.

【００３７】ビューア２７は、このようにして生成され
た聴覚イベント２８をクライアントの画面上に表示する
ものであり、具体的には左側のウインドウに、聴覚イベ
ント２８のパワースペクトルと抽出したピークを、右側
のウインドウに、縦軸を相対的な方位角，横軸をピッチ
（周波数）とする聴覚イベントを表示する。ここで、聴
覚イベントは、音源定位の確信度を円の直径とする円に
より表現されている。The viewer 27 displays the auditory event 28 thus generated on the screen of the client. Specifically, the power spectrum and the extracted peak of the auditory event 28 are displayed in the left window. In the window on the right, auditory events having a relative azimuth on the vertical axis and a pitch (frequency) on the horizontal axis are displayed. Here, the auditory event is represented by a circle having the certainty factor of the sound source localization as the diameter of the circle.

【００３８】図６に示すように、前記視覚モジュール３
０は、デバイス層としてのカメラ１５と、プロセス層と
しての顔発見部３１，顔識別部３２，顔定位部３３と、
特徴層（データ）としての顔ＩＤ３４，顔方向３５と、
イベント層としての視覚イベント生成部３６及びビュー
ア３７と、から構成されている。As shown in FIG. 6, the visual module 3
0 indicates a camera 15 as a device layer, a face detection unit 31, a face identification unit 32, and a face localization unit 33 as process layers;
A face ID 34 and a face direction 35 as feature layers (data);
It comprises a visual event generator 36 and a viewer 37 as an event layer.

【００３９】これにより、視覚モジュール３０は、カメ
ラからの画像信号に基づいて、顔発見部３１により例え
ば肌色抽出により各話者の顔を検出し、顔識別部３２に
て前もって登録されている顔データベース３８により検
索して、一致した顔があった場合、その顔ＩＤ３４を決
定して当該顔を識別すると共に、顔定位部３３により当
該顔方向３５を決定（定位）する。ここで、視覚モジュ
ール３０は、顔発見部３１が画像信号から複数の顔を見
つけた場合、各顔について前記処理、即ち識別及び定位
そして追跡を行なう。その際、顔発見部３１により検出
された顔の大きさ，方向及び明るさがしばしば変化する
ので、顔発見部３１は、顔領域検出を行なって、肌色抽
出と相関演算に基づくパターンマッチングの組合せによ
って、２００ｍ秒以内に複数の顔を正確に検出できるよ
うになっている。Thus, the visual module 30 detects the face of each speaker by, for example, extracting skin color by the face detecting unit 31 based on the image signal from the camera, and the face registered in advance by the face identifying unit 32. When there is a matched face searched by the database 38, the face ID 34 is determined to identify the face, and the face direction 35 is determined (localized) by the face localization unit 33. Here, when the face finding unit 31 finds a plurality of faces from the image signal, the visual module 30 performs the above-described processing, that is, identification, localization, and tracking for each face. At this time, since the size, direction, and brightness of the face detected by the face detection unit 31 often change, the face detection unit 31 detects the face area and extracts the combination of the skin color and the pattern matching based on the correlation operation. Thus, a plurality of faces can be accurately detected within 200 ms.

【００４０】また、顔識別部３２は、顔発見部３１によ
り検出された各顔領域画像を、判別空間に射影し、顔デ
ータベース３８に前もって登録された顔データとの距離
ｄを計算する。この距離ｄは、登録顔数（Ｌ）に依存す
るので、The face identifying unit 32 projects each face area image detected by the face finding unit 31 into the discrimination space, and calculates a distance d from the face data registered in the face database 38 in advance. Since this distance d depends on the number of registered faces (L),

【数５】により、パラメータの依存しない確信度Ｐｖに変換され
る。ここで、判別空間の基底となる判別行列は、公知の
オンラインＬＤＡにより、通常のＬＤＡと比較して少な
い計算により更新され得るので、リアルタイムに顔デー
タを登録することが可能である。(Equation 5) Is converted to a certainty factor Pv independent of parameters. Here, the discriminant matrix serving as the basis of the discriminant space can be updated by a known online LDA with a smaller number of calculations compared to a normal LDA, so that face data can be registered in real time.

【００４１】顔定位部３３は、二次元の画像平面におけ
る顔位置を三次元空間に変換し、顔が画像平面にて
（ｘ，ｙ）に位置する幅と高さがそれぞれＸ及びＹであ
るｗ×ｗピクセルであるとすると、三次元空間における
顔位置は、以下の各式で与えられる方位角θ，高さφ及
び距離ｒのセットとして得られる。The face localization unit 33 converts the face position on the two-dimensional image plane into a three-dimensional space, and the width and height of the face at (x, y) on the image plane are X and Y, respectively. Assuming that there are w × w pixels, the face position in the three-dimensional space is obtained as a set of an azimuth angle θ, a height φ, and a distance r given by the following equations.

【数６】 (Equation 6)

【数７】 (Equation 7)

【数８】ここで、Ｃ₁及びＣ₂は、探索画像サイズ（Ｘ，Ｙ）と
カメラの画角そして実際の顔の大きさにより定義される
定数である。(Equation 8) Here, C ₁ and C ₂ are constants defined by the search image size (X, Y), the angle of view of the camera, and the actual face size.

【００４２】そして、視覚モジュール３０は、各顔毎
に、顔ＩＤ（名前）３４及び顔方向３５から、視覚イベ
ント生成部３６により視覚イベント３９を生成する。詳
細には、視覚イベント３９は、各顔毎に、上位５つの確
信度付きの顔ＩＤ（名前）３４と位置（距離ｒ，水平角
度θ及び垂直角度φ）から構成されている。Then, the visual module 30 generates a visual event 39 from the face ID (name) 34 and the face direction 35 by the visual event generator 36 for each face. In detail, the visual event 39 includes, for each face, the top five face IDs (names) 34 with certainty factors and positions (distance r, horizontal angle θ, and vertical angle φ).

【００４３】ビューア３７は、視覚イベントをクライア
ントの画面上に表示するものであり、具体的には、カメ
ラ１５による画像と、顔識別の確信度付きで抽出した顔
の顔ＩＤと、定位の結果である位置のリストを表示す
る。ここで、カメラ１５による画像には、発見し同定さ
れた顔が長方形の枠で囲まれて表示されている。複数の
顔が発見された場合には、各顔について、それぞれ同定
を示す長方形の枠と、定位の結果としてのリストが表示
される。The viewer 37 displays a visual event on the screen of the client. Specifically, the viewer 37 displays an image obtained by the camera 15, a face ID of a face extracted with certainty of face identification, and a localization result. Display a list of locations that are. Here, in the image obtained by the camera 15, the face that has been found and identified is displayed surrounded by a rectangular frame. When a plurality of faces are found, a rectangular frame indicating the identification and a list as a result of the localization are displayed for each face.

【００４４】図７に示すように、前記モータ制御モジュ
ール４０は、デバイス層としてのモータ４１及びポテン
ショメータ４２と、プロセス層としてのＰＷＭ制御回路
４３，ＡＤ変換回路４４及びモータ制御部４５と、特徴
層としてのロボット方向４６と、イベント層としてのモ
ータイベント生成部４７と、ビューア４８と、から構成
されている。As shown in FIG. 7, the motor control module 40 includes a motor 41 and a potentiometer 42 as device layers, a PWM control circuit 43, an AD conversion circuit 44 and a motor control unit 45 as process layers, and a characteristic layer. , A motor event generator 47 as an event layer, and a viewer 48.

【００４５】これにより、モータ制御モジュール４０
は、アテンション制御モジュール６４（後述）からの指
令に基づいてモータ制御部４５によりＰＷＭ制御回路４
３を介してモータ４１を駆動制御すると共に、モータ４
１の回転位置をポテンショメータ４２により検出して、
ＡＤ変換回路４４を介してモータ制御部４５によりロボ
ット方向４６を抽出し、モータイベント生成部４７によ
りモータ方向情報から成るモータイベント４９を生成す
る。Thus, the motor control module 40
Is controlled by the motor control unit 45 based on a command from the attention control module 64 (described later).
The drive of the motor 41 is controlled via the
1 is detected by the potentiometer 42,
A motor control unit 45 extracts a robot direction 46 via an AD conversion circuit 44, and a motor event generation unit 47 generates a motor event 49 including motor direction information.

【００４６】ビューア４８は、モータイベントをクライ
アントの画面上に三次元的に表示するものであって、具
体的にはモータイベント４９によるロボットの向きと動
作速度を、例えばＯｐｅｎＧＬにより実装されている三
次元ビューアを利用して、リアルタイムに三次元表示す
るようになっている。The viewer 48 displays the motor event three-dimensionally on the client screen. Specifically, the viewer 48 displays the direction and the operation speed of the robot by the motor event 49, for example, in a tertiary system implemented by OpenGL. The original viewer is used to display three-dimensional images in real time.

【００４７】図８に示すように、前記対話モジュール５
０は、デバイス層としてのスピーカ５１及びマイク１６
と、プロセス層としての音声合成回路５２，対話制御回
路５３及び自声抑制回路５４，音声認識回路５５と、か
ら構成されている。As shown in FIG.
0 is a speaker 51 and a microphone 16 as device layers.
And a speech synthesis circuit 52 as a process layer, a dialogue control circuit 53, a self-voice suppression circuit 54, and a speech recognition circuit 55.

【００４８】これにより、対話モジュール５０は、後述
するアソシエーションモジュール６０により対話制御回
路５３を制御し、音声合成回路５２によりスピーカ５１
を駆動することによって、対象とする話者に対して所定
の音声を発すると共に、マイク１６からの音響信号から
自声抑制回路５４によりスピーカ５１からの音を除去し
た後、音声認識回路５５により対象とする話者の音声を
認識する。なお、前記対話モジュール５０は、階層とし
ての特徴層及びイベント層を備えていない。Thus, the dialogue module 50 controls the dialogue control circuit 53 by the later-described association module 60 and the speaker 51 by the speech synthesis circuit 52.
, A predetermined sound is emitted to the target speaker, the sound from the speaker 51 is removed from the acoustic signal from the microphone 16 by the self-voice suppression circuit 54, and then the target Recognize the speaker's voice. The interaction module 50 does not have a feature layer and an event layer as a hierarchy.

【００４９】ここで、対話制御回路５３は、例えばパー
ティ受付ロボットの場合には、現在のアテンションを継
続することが最優先となるが、パーティロボットの場合
には、最も最近に関連付けられたストリームに対して、
アテンション制御される。Here, for example, in the case of a party reception robot, the highest priority is given to the current attention in the case of a party reception robot. for,
Attention controlled.

【００５０】図９に示すように、前記アソシエーション
モジュール６０は、上述した聴覚モジュール２０，視覚
モジュール３０，モータ制御モジュール４０，対話モジ
ュール５０に対して、階層的に上位に位置付けられてお
り、各モジュール２０，３０，４０，５０のイベント層
の上位であるストリーム層を構成している。具体的に
は、前記アソシエーションモジュール６０は、聴覚モジ
ュール２０，視覚モジュール３０及びモータ制御モジュ
ール４０からの非同期イベント６１ａ即ち聴覚イベント
２８，視覚イベント３９及びモータイベント４９を同期
させて同期イベント６１ｂにする同期回路６２と、これ
らの同期イベント６１ｂを相互に関連付けて、聴覚スト
リーム６５，視覚ストリーム６６及びアソシエーション
ストリーム６７を生成するストリーム生成部６３と、さ
らにアテンション制御モジュール６４と、ビューア６８
を備えている。As shown in FIG. 9, the association module 60 is hierarchically positioned higher than the above-described auditory module 20, visual module 30, motor control module 40, and dialog module 50. The stream layer, which is the upper layer of the 20, 30, 40, and 50 event layers, is configured. Specifically, the association module 60 synchronizes the asynchronous event 61a from the auditory module 20, the visual module 30, and the motor control module 40, that is, the auditory event 28, the visual event 39, and the motor event 49 into a synchronous event 61b. A circuit 62, a stream generation unit 63 that correlates these synchronization events 61b to generate an auditory stream 65, a visual stream 66, and an association stream 67; an attention control module 64;
It has.

【００５１】前記同期回路６２は、聴覚モジュール２０
からの聴覚イベント２８，視覚モジュール３０からの視
覚イベント３８及びモータ制御モジュール４０からのモ
ータイベント４９を同期させて、同期聴覚イベント，同
期視覚イベント及び同期モータイベントを生成する。そ
の際、聴覚イベント２８及び視覚イベント３９は、同期
モータイベントによって、その座標系が絶対座標系に変
換されることになる。The synchronizing circuit 62 is connected to the hearing module 20.
The synchronizing auditory event 28, the visual event 38 from the visual module 30, and the motor event 49 from the motor control module 40 are synchronized to generate a synchronous auditory event, a synchronous visual event, and a synchronous motor event. At this time, the coordinate system of the auditory event 28 and the visual event 39 is converted into the absolute coordinate system by the synchronous motor event.

【００５２】ここで、各イベントの実際に観測されてか
らネットワーク７０を介してアソシエーションモジュー
ル６０に到着するまでの遅延時間は、例えば聴覚イベン
ト２８では４０ｍ秒、視覚イベント３９では２００ｍ
秒、モータイベント４９では１００ｍであり、ネットワ
ーク７０における遅延が１０乃至２００ｍ秒であり、さ
らに到着周期も異なることによるものである。従って、
各イベントの同期を取るために、聴覚モジュール２０，
視覚モジュール３０及びモータ制御モジュール４０から
の聴覚イベント２８，視覚イベント３９及びモータイベ
ント４９は、それぞれ実際の観測時間を示すタイムスタ
ンプ情報を備えており、図示しない短期記憶回路にて、
例えば２秒間の間だけ一旦記憶される。Here, the delay time from the actual observation of each event to the arrival at the association module 60 via the network 70 is, for example, 40 ms for the auditory event 28 and 200 m for the visual event 39.
Second, the motor event 49 is 100 m, the delay in the network 70 is 10 to 200 msec, and the arrival period is also different. Therefore,
To synchronize each event, the hearing module 20,
The auditory event 28, the visual event 39, and the motor event 49 from the visual module 30 and the motor control module 40 each have time stamp information indicating the actual observation time.
For example, it is temporarily stored only for two seconds.

【００５３】そして、同期回路６２は、短期記憶回路に
記憶された各イベントを、上述した遅延時間を考慮し
て、実際の観測時間と比較して５００ｍ秒の遅延時間を
備えるように、同期プロセスにより取り出す。これによ
り、同期回路６２の応答時間は５００ｍ秒となる。ま
た、このような同期プロセスは例えば１００ｍ秒周期で
動作するようになっている。尚、各イベントは、それぞ
れ互いに非同期でアソシエーションモジュール６０に到
着するので、同期を取るための観測時刻と同時刻のイベ
ントが存在するとは限らない。従って、同期プロセス
は、同期を取るための観測時刻前後に発生したイベント
に対して、線形補間による補間を行なうようになってい
る。Then, the synchronization circuit 62 converts each event stored in the short-term storage circuit into a synchronization process so as to provide a delay time of 500 ms in comparison with the actual observation time in consideration of the delay time described above. Remove by As a result, the response time of the synchronization circuit 62 becomes 500 ms. Further, such a synchronization process operates at a period of, for example, 100 ms. Since each event arrives at the association module 60 asynchronously with each other, an event at the same time as the observation time for synchronization is not always present. Therefore, in the synchronization process, interpolation is performed by linear interpolation on events that occur before and after the observation time for synchronization.

【００５４】また、ストリーム生成部６３は、図１１に
示すように、短期記憶回路Ｍから聴覚イベントＳ及び視
覚イベントＶを読み出して、以下の点に基づいてストリ
ーム６５，６６，６７の生成を行なう。１．聴覚イベント２８は、符号Ｙ１で示すように、同
等または倍音関係にあるピッチを備え、方向が±１０度
以内で最も近い聴覚ストリーム６５に接続される。尚、
±１０度以内の値は、聴覚エピポーラ幾何の精度を考慮
して選定されたものであ。２．視覚イベント３９は、符号Ｙ２で示すように、共
通の顔ＩＤ３４を有し且つ４０ｃｍの範囲内で最も近い
視覚ストリーム６６に接続される。尚、４０ｃｍの範囲
内の値は、秒速４ｍ以上で人間が移動することがないと
いうことを前提として選定されたものである。３．すべてのストリームに対して探索を行なった結
果、接続可能なストリーム６５，６６が存在しないイベ
ントがある場合には、符号Ｙ３で示すように当該イベン
ト２８，３９は新たなストリーム６５，６６を構成する
ことになる。４．既に存在しているストリーム６５，６６は、これ
らに接続されるイベント２８，３９がない場合には、符
号Ｙ４ａで示すように、最大で５００ｍ秒間は存続する
が、その後もイベントが接続されない状態が継続する
と、符号Ｙ４ｂで示すように消滅する。５．聴覚ストリーム６５と視覚ストリーム６６が±１
０度以内に近接する状態が１秒間のうち５００ｍ秒以上
継続する場合、これの聴覚ストリーム６５と視覚ストリ
ーム６６は、同一話者に由来するものであるとみなさ
れ、符号Ｙ５で示すように、互いに関係付けられて、ア
ソシエーションストリーム６７が生成される。６．アソシエーションストリーム６７は、聴覚イベン
ト２８または視覚イベント３９が３秒間以上接続されな
い場合には、関係付けが解除され、既存の聴覚ストリー
ム６５または視覚ストリーム６６のみが存続する。７．アソシエーションストリーム６７は、聴覚ストリ
ーム６５及び視覚ストリーム６６の方向差が３秒間、±
３０度以上になった場合には、関係付けが解除され、個
々の聴覚ストリーム６５及び視覚ストリーム６６に戻
る。Further, as shown in FIG. 11, the stream generator 63 reads the auditory event S and the visual event V from the short-term storage circuit M, and generates the streams 65, 66, 67 based on the following points. . 1. The auditory event 28 is connected to the nearest auditory stream 65 with equal or overtone pitches and within ± 10 degrees in direction, as indicated by the symbol Y1. still,
Values within ± 10 degrees are selected taking into account the accuracy of the auditory epipolar geometry. 2. The visual event 39 has a common face ID 34 and is connected to the closest visual stream 66 within a range of 40 cm, as indicated by the symbol Y2. The value in the range of 40 cm is selected on the assumption that a human does not move at a speed of 4 m or more per second. 3. As a result of searching all the streams, if there is an event for which there is no connectable stream 65, 66, the event 28, 39 constitutes a new stream 65, 66, as indicated by reference numeral Y3. Will be. 4. If there are no events 28 and 39 connected to these streams 65 and 66, the existing streams 65 and 66 remain for a maximum of 500 msec, as indicated by reference numeral Y4a. If it continues, it will disappear as shown by code | symbol Y4b. 5. Auditory stream 65 and visual stream 66 are ± 1
If the state of approaching within 0 degrees lasts for 500 ms or more in one second, the auditory stream 65 and the visual stream 66 are considered to be derived from the same speaker, and as shown by a symbol Y5, Associated with each other, an association stream 67 is generated. 6. The association stream 67 is dissociated if the auditory event 28 or visual event 39 is not connected for more than 3 seconds, and only the existing auditory stream 65 or visual stream 66 remains. 7. The association stream 67 has a direction difference between the auditory stream 65 and the visual stream 66 of 3 seconds, ±
If the angle exceeds 30 degrees, the association is released, and the process returns to the individual auditory streams 65 and the visual streams 66.

【００５５】これにより、ストリーム生成部６３は、同
期回路６２からの同期聴覚イベント及び同期視覚イベン
トに基づいて、これらの時間的つながりを考慮してイベ
ントを接続することにより、同期聴覚イベント及び同期
視覚イベントを、同一話者の聴覚ストリーム６５及び視
覚ストリーム６６に接続することによって、聴覚ストリ
ーム６５及び視覚ストリーム６６を生成すると共に、相
互の結び付きの強い聴覚ストリーム６５及び視覚ストリ
ーム６６を関係付けて、アソシエーションストリーム６
７を生成するようになっており、逆にアソシエーション
ストリーム６７を構成する聴覚ストリーム６５及び視覚
ストリーム６６の結び付きが弱くなれば、関係付けを解
除するようになっている。これにより、対象となる話者
が移動している場合であっても、当該話者の移動を予測
して、その移動範囲となる角度範囲内であれば、上述し
たストリーム６５，６６，６７の生成を行なうことによ
って、当該話者の移動を予測して追跡できることにな
る。Thus, based on the synchronous auditory event and the synchronous visual event from the synchronous circuit 62, the stream generation unit 63 connects the events in consideration of their temporal connection, thereby forming the synchronous auditory event and the synchronous visual event. By connecting the events to the same speaker's auditory stream 65 and the visual stream 66, the auditory stream 65 and the visual stream 66 are generated, and the strongly interconnected auditory stream 65 and the visual stream 66 are associated to associate. Stream 6
7 is generated. Conversely, when the connection between the auditory stream 65 and the visual stream 66 constituting the association stream 67 is weakened, the association is released. Accordingly, even when the target speaker is moving, the movement of the speaker is predicted, and if the target speaker is within the angle range which is the moving range, the streams 65, 66, and 67 described above are transmitted. By performing the generation, the movement of the speaker can be predicted and tracked.

【００５６】また、アテンション制御モジュール６４
は、モータ制御モジュール４０の駆動モータ制御のプラ
ンニングのためのアテンション制御を行なうものであ
り、その際アソシエーションストリーム６７，聴覚スト
リーム６５そして視覚ストリーム６６の順に優先的に参
照して、アテンション制御を行なう。そして、アテンシ
ョン制御モジュール６４は、聴覚ストリーム６５及び視
覚ストリーム６６の状態とアソシエーションストリーム
６７の存否に基づいて、ロボット１０の動作プランニン
グを行ない、駆動モータ４１の動作の必要があれば、モ
ータ制御モジュール４０に対して動作指令としてのモー
タイベントをネットワーク７０を介して送信する。The attention control module 64
Performs attention control for planning the drive motor control of the motor control module 40. At this time, the attention control is performed by preferentially referring to the association stream 67, the auditory stream 65, and the visual stream 66 in this order. Then, the attention control module 64 performs the operation planning of the robot 10 based on the state of the auditory stream 65 and the visual stream 66 and the presence or absence of the association stream 67. If the operation of the drive motor 41 is required, the motor control module 40 , A motor event as an operation command is transmitted via the network 70.

【００５７】ここで、アテンション制御モジュール６４
におけるアテンション制御は、連続性とトリガに基づい
ており、連続性により同じ状態を保持しようとし、トリ
ガにより最も興味のある対象を追跡しようとする。従っ
て、アテンション制御は、１．アソシエーションストリームの存在は、ロボット
１０に対して正対して話している人が現在も存在してい
る、あるいは近い過去に存在していたことを示している
ので、このようなロボット１０に対して話している人に
対して、高い優先度でアテンションを向けて、トラッキ
ングを行なう必要がある。２．マイク１６は無指向性であるので、カメラの視野
角のような検出範囲が存在せず、広範囲の聴覚情報を得
ることができるので、視覚ストリームより聴覚ストリー
ムの優先度を高くすべきである。という二つの点を考慮
して、以下の原則に従ってアテンションを向けるストリ
ームを選択して、トラッキングを行なう。１．アソシエーションストリームのトラッキングを最
優先する。２．アソシエーションストリームが存在しない場合、
聴覚ストリームのトラッキングを優先する。３．アソシエーションストリーム及び聴覚ストリーム
が存在しない場合、視覚ストリームのトラッキングを優
先する。このようにして、アテンション制御モジュール６４は、
アテンション制御を行なって、モータ制御モジュール４
０の駆動モータ４１の制御のプランニングを行ない、こ
のプランニングに基づいて、モータコマンド６６を生成
し、ネットワーク７０を介してモータ制御モジュール４
０に伝送する。これにより、モータ制御モジュール４０
では、このモータコマンド６６に基づいて、モータ制御
部４５がＰＷＭ制御を行なって、駆動モータ４１を回転
駆動させて、ロボット１０を所定方向に向けるようにな
っている。Here, the attention control module 64
Attention control in is based on continuity and triggers, trying to keep the same state with continuity and trying to track the objects of most interest with triggers. Therefore, attention control includes: The presence of the association stream indicates that the person who is speaking directly to the robot 10 is still present or has been present in the near future, It is necessary to focus attention on a person who has high priority and perform tracking. 2. Since the microphone 16 is omnidirectional, there is no detection range such as the viewing angle of the camera, and a wide range of auditory information can be obtained. Therefore, the priority of the auditory stream should be higher than that of the visual stream. In consideration of these two points, a stream to which attention is directed is selected and tracking is performed according to the following principle. 1. Prioritize association stream tracking. 2. If there is no association stream,
Prioritize audio stream tracking. 3. If the association stream and the auditory stream do not exist, priority is given to tracking the visual stream. In this way, the attention control module 64
By performing attention control, the motor control module 4
0, the control of the drive motor 41 is planned, a motor command 66 is generated based on the plan, and the motor control module 4 is controlled via the network 70.
Transmit to 0. Thereby, the motor control module 40
Then, based on the motor command 66, the motor control unit 45 performs PWM control to rotate and drive the drive motor 41 so as to direct the robot 10 in a predetermined direction.

【００５８】ビューア６８は、このようにして生成され
た各ストリームをサーバの画面上に表示するものであ
り、具体的にはレーダチャート及びストリームチャート
により表示する。ここで、レーダチャートは、その瞬間
におけるストリームの状態、より詳細にはカメラの視野
角と音源方向を示し、ストリームチャートは、アソシエ
ーションストリーム（太線図示）と聴覚ストリーム及び
視覚ストリーム（細線図示）を示している。The viewer 68 displays each stream generated in this way on the screen of the server. Specifically, the viewer 68 displays a radar chart and a stream chart. Here, the radar chart shows the state of the stream at that moment, more specifically, the viewing angle of the camera and the sound source direction, and the stream chart shows the association stream (shown in bold lines), the auditory stream and the visual stream (shown in thin lines). ing.

【００５９】本発明実施形態による人型ロボット１０は
以上のように構成されており、パーティ受付ロボットと
して対象とする話者に対して、図１２を参照して、以下
のように動作する。先ず、図１２（Ａ）に示すように、
ロボット１０は、パーティ会場の入口前に配置されてい
る。そして、図１２（Ｂ）に示すように、パーティ参加
者Ｐがロボット１０に接近してくるが、ロボット１０
は、まだ当該参加者Ｐを認識していない。ここで、参加
者Ｐがロボット１０に対して例えば「こんにちは」と話
し掛けると、ロボット１０は、マイク１６が当該参加者
Ｐの音声を拾って、聴覚モジュール２０が音源方向を伴
う聴覚イベント２８を生成して、ネットワーク７０を介
してアソシエーションモジュール６０に伝送する。The humanoid robot 10 according to the embodiment of the present invention is configured as described above, and operates as follows with reference to FIG. 12 for a speaker to be a party reception robot. First, as shown in FIG.
The robot 10 is arranged in front of the entrance of the party venue. Then, as shown in FIG. 12B, the party participant P approaches the robot 10,
Has not yet recognized the participant P. Here, the participant P has talks with the robot 10, for example, "Hello", the robot 10 includes a microphone 16 picking up the sound of the participant P, generate an auditory event 28 hearing module 20 involves a sound source direction Then, the data is transmitted to the association module 60 via the network 70.

【００６０】これにより、アソシエーションモジュール
６０は、この聴覚イベント２８に基づいて聴覚ストリー
ム２９を生成する。このとき、視覚モジュール３０は参
加者Ｐがカメラ１５の視野内に入っていないので、視覚
イベント３９を生成しない。従って、アソシエーション
モジュール６０は、聴覚イベント２８のみに基づいて聴
覚ストリーム２９を生成し、アテンション制御モジュー
ル６４は、この聴覚ストリーム２９をトリガーとして、
ロボット１０を参加者Ｐの方向に向けるようなアテンシ
ョン制御を行なう。As a result, the association module 60 generates the auditory stream 29 based on the auditory event 28. At this time, the visual module 30 does not generate the visual event 39 because the participant P is not in the field of view of the camera 15. Accordingly, the association module 60 generates the auditory stream 29 based only on the auditory event 28, and the attention control module 64 uses the auditory stream 29 as a trigger to generate the auditory stream 29.
Attention control is performed such that the robot 10 is directed toward the participant P.

【００６１】このようにして、図１２（Ｃ）に示すよう
に、ロボット１０が参加者Ｐの方向を向き、所謂声によ
るトラッキングが行なわれる。そして、視覚モジュール
３０がカメラ１５による参加者Ｐの顔の画像を取り込ん
で、視覚イベント３９を生成して、当該参加者Ｐの顔を
顔データベース３８により検索し、顔識別を行なうと共
に、その結果である顔ＩＤ２４及び画像をネットワーク
７０を介してアソシエーションモジュール６０に伝送す
る。尚、当該参加者Ｐの顔が顔データベース３８に登録
されていない場合には、視覚モジュール３０は、その旨
をネットワーク７０を介してアソシエーションモジュー
ル６０に伝送する。In this way, as shown in FIG. 12C, the robot 10 faces the direction of the participant P, and tracking by a so-called voice is performed. Then, the visual module 30 captures the image of the face of the participant P by the camera 15, generates a visual event 39, searches the face of the participant P by the face database 38, performs face identification, and determines the result. Is transmitted to the association module 60 via the network 70. If the face of the participant P is not registered in the face database 38, the visual module 30 transmits the fact to the association module 60 via the network 70.

【００６２】このとき、ロボット１０は、聴覚イベント
２８及び視覚イベント３９によりアソシエーションスト
リーム６５を生成しており、このアソシエーションスト
リーム６５によりアテンション制御モジュール６４は、
そのアテンション制御を変更しないので、ロボット１０
は、参加者Ｐの方向を向き続ける。従って、参加者Ｐが
移動したとしても、ロボット１０は、アソシエーション
ストリーム６５によりモータ制御モジュール４０を制御
することにより、参加者Ｐを追跡して、視覚モジュール
３０のカメラ１５が参加者Ｐを継続して撮像し得るよう
になっている。At this time, the robot 10 has generated an association stream 65 based on the auditory event 28 and the visual event 39, and the attention control module 64 uses the association stream 65 to generate the association stream 65.
Since the attention control is not changed, the robot 10
Keeps pointing in the direction of the participant P. Therefore, even if the participant P moves, the robot 10 tracks the participant P by controlling the motor control module 40 by the association stream 65, and the camera 15 of the visual module 30 continues the participant P. It is possible to take an image.

【００６３】そして、アソシエーションモジュール６０
は、聴覚モジュール２０の音声認識回路５５に入力を与
えて、音声認識回路５５はその音声認識結果を対話制御
回路５３に与える。これにより、対話制御回路５３は音
声合成を行なってスピーカ５１から発声する。このと
き、音声認識回路５５がマイク１６からの音響信号から
スピーカ５１からの音を自声抑制回路５４により低減す
ることにより、ロボット１０は自身の発声を無視して、
相手の声をより正確に認識することができる。Then, the association module 60
Gives an input to the speech recognition circuit 55 of the hearing module 20, and the speech recognition circuit 55 gives the speech recognition result to the dialogue control circuit 53. As a result, the dialog control circuit 53 performs voice synthesis and utters voice from the speaker 51. At this time, the voice recognition circuit 55 reduces the sound from the speaker 51 from the sound signal from the microphone 16 by the self-voice suppression circuit 54, so that the robot 10 ignores its own utterance,
The other party's voice can be more accurately recognized.

【００６４】ここで、音声合成による発声は、参加者Ｐ
の顔が前記顔データベース３８に登録されているか否か
で異なる。参加者Ｐの顔が顔データベース３８に登録さ
れている場合には、アソシエーションモジュール６０
は、視覚モジュール３０からの顔ＩＤ２４に基づいて、
対話モジュール５０を制御して、音声合成により「こん
にちは。ＸＸＸさんですか？」と参加者Ｐに対して質問
する。これに対して、参加者Ｐが「はい。」と答える
と、対話モジュール５０がマイク１６からの音響信号に
基づいて、音声認識回路５５により「はい」を認識して
対話制御回路５３により音声合成を行ない、スピーカ５
１から「ようこそＸＸＸさん、どうぞ部屋にお入り下さ
い。」と発声する。Here, the utterance by the speech synthesis is based on the participant P
Is registered in the face database 38 or not. If the face of the participant P is registered in the face database 38, the association module 60
Is based on the face ID 24 from the vision module 30,
To control the interaction module 50, to ask questions to participants P as "Hello .XXX-san?" By the speech synthesis. On the other hand, if the participant P answers "Yes.", The dialogue module 50 recognizes "Yes" by the voice recognition circuit 55 based on the acoustic signal from the microphone 16, and the voice is synthesized by the dialogue control circuit 53. And the speaker 5
From 1, he says, "Welcome XXX, please enter the room."

【００６５】また、参加者Ｐの顔が顔データベース３８
に登録されていない場合には、アソシエーションモジュ
ール６０は、対話モジュール５０を制御して、音声合成
により「こんにちは。あなたのお名前を教えていただけ
ますか？」と参加者Ｐに対して質問する。これに対し
て、参加者Ｐが「ＸＸＸです。」と自分の名前を答える
と、対話モジュール５０がマイク１６からの音響信号に
基づいて、音声認識回路５５により「ＸＸＸ」を認識し
て、対話制御回路５３により音声合成を行ない、スピー
カ５１から「ようこそＸＸＸさん、どうぞ部屋にお入り
下さい。」と発声する。このようにして、ロボット１０
は、参加者Ｐの認識を行なってパーティ会場への入場を
誘導すると共に、視覚モジュール３０にて、当該参加者
Ｐの顔の画像と名前「ＸＸＸ」を顔データベース３８に
登録させる。The face of the participant P is stored in the face database 38.
If that is not registered, the association module 60 is to control the interaction module 50, the question for the "Hello. Can you tell me your name?" And participants P by speech synthesis. On the other hand, when the participant P answers his name "XXX.", The dialogue module 50 recognizes "XXX" by the voice recognition circuit 55 based on the acoustic signal from the microphone 16, and the dialogue is performed. Speech synthesis is performed by the control circuit 53, and the speaker 51 utters "Welcome XXX, please enter the room." In this way, the robot 10
Performs recognition of the participant P to guide entry to the party venue, and causes the visual module 30 to register the face image and the name “XXX” of the participant P in the face database 38.

【００６６】また、人型ロボット１０は、コンパニオン
用ロボットとして、以下のように動作する。この場合、
人型ロボット１０は、聴覚モジュール２０による聴覚イ
ベント２８及び視覚モジュール３０による視覚イベント
３９と、アソシエーションモジュール６０によるアソシ
エーションストリーム６５に基づいて、複数の話者を聴
覚及び視覚により認識していると共に、複数の話者のう
ちの一人の話者を追跡したり、あるいは途中で他の話者
に切り換えて追跡することができる。なお、コンパニオ
ン用ロボットの場合には、ロボット１０は受動的な役割
を果たす、即ちパーティ参加者の「話を聴き」あるいは
「話者を見る」のみであり、対話モジュール５０により
発声することはない。The humanoid robot 10 operates as a companion robot as follows. in this case,
The humanoid robot 10 aurally and visually recognizes a plurality of speakers based on an auditory event 28 by the auditory module 20 and a visual event 39 by the visual module 30 and an association stream 65 by the association module 60. One of the speakers can be tracked, or the other speaker can be switched and tracked along the way. In the case of a companion robot, the robot 10 plays a passive role, that is, plays only "listen" or "looks at the speaker" of the party participant, and does not utter by the dialog module 50. .

【００６７】また、コンパニオン用ロボットとしての人
型ロボット１０は、パーティ受付ロボットと顔データベ
ース３８を共用し、あるいはパーティ受付ロボットの顔
データベース３８が転送または複写されるようにしても
よい。この場合、コンパニオン用ロボットとしての人型
ロボット１０は、パーティ参加者全員を常に顔識別によ
って認識することができる。The humanoid robot 10 as the companion robot may share the face database 38 with the party reception robot, or the face database 38 of the party reception robot may be transferred or copied. In this case, the humanoid robot 10 as the companion robot can always recognize all the party participants by face identification.

【００６８】このようにして、本発明実施形態による人
型ロボット１０によれば、アソシエーションモジュール
６０が、聴覚モジュール２０及び視覚モジュール３０か
らの聴覚イベント及び視覚イベントに基づいて、これら
の方向情報そして個々の話者同定から、これらの時間的
流れを考慮して、聴覚ストリーム，視覚ストリームそし
てアソシエーションストリームを生成することによっ
て、複数の対象である話者を認識しているので、何れか
のイベントが欠落したり明確に認識できなくなった場
合、例えば話者が移動して「見えなく」なった場合でも
聴覚により、また話者が話をせず「聞こえなく」なった
場合でも視覚により、リアルタイムに複数の話者を聴覚
的及び／又は視覚的にトラッキングすることができる。As described above, according to the humanoid robot 10 according to the embodiment of the present invention, the association module 60 uses these directional information and individual information based on the auditory and visual events from the auditory module 20 and the visual module 30. From the speaker identification of the speaker, the auditory stream, the visual stream, and the association stream are generated in consideration of these temporal flows, so that a plurality of speakers are recognized. Multiple times in real time, for example, when the speaker moves and becomes `` invisible '' due to hearing or when the speaker becomes `` inaudible '' without speaking, Speakers can be tracked audibly and / or visually.

【００６９】上述した実施形態において、人型ロボット
１０は４ＤＯＦ（自由度）を有するように構成されてい
るが、これに限らず、任意の動作を行なうように構成さ
れたロボットに本発明によるロボット聴覚システムを組
み込むことも可能である。また、上述した実施形態にお
いては、本発明によるロボット視聴覚システムを人型ロ
ボット１０に組み込んだ場合について説明したが、これ
に限らず、犬型等の各種動物型ロボットや、その他の形
式のロボットに組み込むことも可能であることは明らか
である。In the above-described embodiment, the humanoid robot 10 is configured to have 4 DOF (degree of freedom). However, the present invention is not limited to this. The robot according to the present invention may be a robot configured to perform an arbitrary operation. It is also possible to incorporate a hearing system. Further, in the above-described embodiment, the case where the robot audiovisual system according to the present invention is incorporated in the humanoid robot 10 has been described. However, the present invention is not limited to this. Obviously, it is also possible to incorporate.

【００７０】[0070]

【発明の効果】以上述べたように、この発明によれば、
聴覚モジュールが、マイクが集音した外部の対象からの
音から、調波構造を利用してピッチ抽出を行なうことに
より音源毎の方向を得て個々の話者の音源を同定し、そ
の聴覚イベントを抽出する。また、視覚モジュールが、
カメラにより撮像された画像から、パターン認識による
各話者の顔識別と定位から各話者を同定して個々の話者
の視覚イベントを抽出する。さらに、モータ制御モジュ
ールが、ロボットを水平方向に回動させる駆動モータの
回転位置に基づいて、ロボットの方向を検出することに
よってモータイベントを抽出する。As described above, according to the present invention,
The auditory module identifies the sound source of each speaker by obtaining the direction of each sound source by performing pitch extraction using the harmonic structure from the sound from the external object collected by the microphone, and identifying the sound source of each speaker. Is extracted. Also, the visual module
From the image captured by the camera, each speaker is identified from the face identification and localization of each speaker by pattern recognition, and the visual event of each speaker is extracted. Further, the motor control module extracts a motor event by detecting a direction of the robot based on a rotation position of a drive motor that rotates the robot in a horizontal direction.

【００７１】ここで、アソシエーションモジュールは、
このようにしてそれぞれ抽出された聴覚イベント，視覚
イベント及びモータイベントに基づいて、その方向情
報、そして話者同定を参照して各話者の聴覚ストリーム
及び視覚ストリームを生成し、さらにこれらのストリー
ムを関連付けてアソシエーションストリームを生成し
て、アテンション制御モジュールは、これらのストリー
ムに基づいてアテンション制御を行なうことにより、モ
ータ制御モジュールの駆動モータ制御のプランニングを
行なう。この際、アソシエーションモジュールは、聴覚
イベントの音源定位及び視覚イベントの顔定位、即ち聴
覚及び視覚の方向情報に基づいて、各話者の方向を決定
して、聴覚ストリーム，視覚ストリーム及びアソシエー
ションストリームを生成することになる。Here, the association module is
Based on the auditory event, visual event, and motor event thus extracted, an audio stream and a visual stream of each speaker are generated with reference to the direction information and the speaker identification, and these streams are further generated. By generating association streams in association with each other, the attention control module performs attention control based on these streams, thereby planning the drive motor control of the motor control module. At this time, the association module determines the direction of each speaker based on the sound source localization of the auditory event and the face localization of the visual event, that is, the audio and visual direction information, and generates an auditory stream, a visual stream, and an association stream. Will do.

【００７２】そして、アテンション制御モジュールは、
このプランニングに基づいて、モータ制御モジュールの
駆動モータを制御することにより、ロボットの方向を対
象である話者に向ける。これにより、ロボットが対象で
ある話者に対して正対することにより、聴覚モジュール
が当該話者の声を、感度の高い正面方向にてマイクによ
り正確に集音，定位することができる共に、視覚モジュ
ールが当該話者の画像をカメラにより良好に撮像するこ
とができるようになる。Then, the attention control module:
By controlling the drive motor of the motor control module based on this planning, the direction of the robot is directed to the target speaker. With this, the robot can directly face the target speaker, so that the hearing module can accurately collect and localize the voice of the speaker by the microphone in a highly sensitive frontal direction, The module can better image the speaker with the camera.

【００７３】従って、このような聴覚モジュール，視覚
モジュール及びモータ制御モジュールと、アソシエーシ
ョンモジュール及びアテンション制御モジュールとの連
携によって、聴覚イベント及び視覚イベントの方向情
報、話者同定そして時間的流れを参照して、話者の追跡
を行なうことにより、ロボットの聴覚及び視覚がそれぞ
れ有する曖昧性が互いに補完されることになり、所謂ロ
バスト性が向上し、複数の話者であっても、各話者をそ
れぞれ確実に知覚することができる。また、例えば聴覚
イベントまたは視覚イベントの何れか一方が欠落したと
きであっても、残りの視覚イベントまたは聴覚イベント
のみに基づいて、対象である話者をアソシエーションモ
ジュールが知覚することができるので、正確に対象の方
向を把握して、モータ制御モジュールの制御を行なうこ
とができる。これにより、本発明によれば、対象に対す
る視覚及び聴覚の情報を統合して、対象の追跡を確実に
行なうようにした、極めて優れたロボット視聴覚システ
ムが提供される。Therefore, the cooperation of the hearing module, the visual module and the motor control module with the association module and the attention control module makes it possible to refer to the directional information, the speaker identification and the temporal flow of the auditory and visual events. By tracking the speakers, the ambiguities of the robot's hearing and vision are complemented by each other, and the so-called robustness is improved. Can be reliably perceived. Also, for example, even when either the auditory event or the visual event is lost, the association module can perceive the target speaker based on only the remaining visual event or the auditory event, so that accurate Then, the direction of the target can be grasped and the motor control module can be controlled. Thus, according to the present invention, there is provided an extremely excellent robot audio-visual system that integrates visual and auditory information on an object to reliably track the object.

[Brief description of the drawings]

【図１】この発明によるロボット聴覚装置の第一の実施
形態を組み込んだ人型ロボットの外観を示す正面図であ
る。FIG. 1 is a front view showing the appearance of a humanoid robot incorporating a first embodiment of a robot hearing device according to the present invention.

【図２】図１の人型ロボットの側面図である。FIG. 2 is a side view of the humanoid robot of FIG. 1;

【図３】図１の人型ロボットにおける頭部の構成を示す
概略拡大図である。FIG. 3 is a schematic enlarged view showing a configuration of a head in the humanoid robot of FIG. 1;

【図４】図１の人型ロボットにおけるロボット視聴覚シ
ステムの電気的構成を示すブロック図である。FIG. 4 is a block diagram showing an electrical configuration of a robot audiovisual system in the humanoid robot of FIG. 1;

【図５】図４におけるブロック１の聴覚モジュールを拡
大して示す電気的構成のブロック図である。5 is a block diagram of an electrical configuration showing an enlarged view of a hearing module of block 1 in FIG. 4;

【図６】図４におけるブロック２の視覚モジュールを拡
大して示す電気的構成のブロック図である。6 is a block diagram of an electrical configuration showing a visual module of a block 2 in FIG. 4 in an enlarged manner.

【図７】図４におけるブロック３のモータ制御モジュー
ルを拡大して示す電気的構成のブロック図である。FIG. 7 is a block diagram of an electric configuration showing a motor control module of a block 3 in FIG. 4 in an enlarged manner.

【図８】図４におけるブロック４の対話モジュールを拡
大して示す電気的構成のブロック図である。FIG. 8 is a block diagram of an electric configuration showing a dialog module of block 4 in FIG. 4 in an enlarged manner.

【図９】図４におけるブロック５のアソシエーションモ
ジュールを拡大して示す電気的構成のブロック図であ
る。9 is an enlarged block diagram of an electrical configuration showing an association module of a block 5 in FIG. 4;

【図１０】図４のロボット視聴覚システムにおける聴覚
モジュールによるピーク抽出，音源定位及び音源分離を
示す図である。FIG. 10 is a diagram illustrating peak extraction, sound source localization, and sound source separation by an auditory module in the robot audiovisual system of FIG. 4;

【図１１】図４のロボット視聴覚システムにおけるアソ
シエーションモジュールによるストリーム生成を示す図
である。FIG. 11 is a diagram showing stream generation by an association module in the robot audiovisual system of FIG. 4;

【図１２】図４のロボット視聴覚システムにおけるパー
ティ受付ロボットとしての動作例を示す図である。12 is a diagram illustrating an operation example as a party reception robot in the robot audiovisual system of FIG. 4;

[Explanation of symbols]

１０人型ロボット１１ベース１２胴体部１３頭部１３ａ連結部材１４外装１５カメラ（ロボット視覚）１６，１６ａ，１６ｂマイク（ロボット聴覚）１７ロボット視聴覚システム２０聴覚モジュール３０視覚モジュール４０モータ制御モジュール５０対話モジュール６０アソシエーションモジュール６４アテンション制御モジュール７０ネットワーク Reference Signs List 10 humanoid robot 11 base 12 torso 13 head 13a connecting member 14 exterior 15 camera (robot vision) 16, 16a, 16b microphone (robot hearing) 17 robot audiovisual system 20 hearing module 30 visual module 40 motor control module 50 dialog module 60 Association Module 64 Attention Control Module 70 Network

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ０６Ｔ 7/00 ３００Ｇ０６Ｔ 7/00 ３００Ｆ５Ｌ０９６ 7/60 １５０ 7/60 １５０ＢＧ１０Ｌ 15/28 Ｈ０４Ｎ 7/18 Ｚ 17/00 Ｇ１０Ｌ 3/00 ５１１ 15/00 ５４５Ｆ 15/22 ５５１Ｈ 15/20 ５７１Ｔ 21/02 3/02 ３０１Ｃ 15/02 9/00 ３０１ＡＨ０４Ｎ 7/18 Ｆターム(参考） 2C150 BA11 CA01 CA02 CA04 DA04 DA05 DA24 DA25 DA26 DA27 DA28 DF03 DF04 DF06 DF33 ED42 ED47 ED52 EF07 EF16 EF17 EF23 EF29 EF33 EF36 3C007 AS34 AS36 CS08 JS03 KS04 KS08 KS18 KS39 KT01 LT08 NS01 WA02 WA03 WB19 WC07 5B057 AA05 CA12 CB13 CC03 DA07 DB03 5C054 CC05 CD03 CG02 FC12 FF07 HA04 5D015 AA03 CC13 DD02 EE04 KK01 KK04 LL06 5L096 BA05 BA16 BA18 CA02 DA05 FA67 HA05 JA11 LA12 ──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification code FI Theme coat ゛ (Reference) G06T 7/00 300 G06T 7/00 300F 5L096 7/60 150 7/60 150B G10L 15/28 H04N 7/18 Z 17/00 G10L 3/00 511 15/00 545F 15/22 551H 15/20 571T 21/02 3/02 301C 15/02 9/00 301A H04N 7/18 F term (reference) 2C150 BA11 CA01 CA02 CA04 DA04 DA05 DA24 DA25 DA26 DA27 DA28 DF03 DF04 DF06 DF33 ED42 ED47 ED52 EF07 EF16 EF17 EF23 EF29 EF33 EF36 3C007 AS34 AS36 CS08 JS03 KS04 KS08 KS18 KS39 KT01 LT08 NS01 WA02 WA03 CC05 CB19 CC05 CCB5 CC05 CCB5 CC05 CCB5 CC05 CCB5 CC05 CC05 CCB5 CC05 CCB5 CC05 CCB5 CC05 CCB5CC 5D015 AA03 CC13 DD02 EE04 KK01 KK04 LL06 5L096 BA05 BA16 BA18 CA02 DA05 FA67 HA05 JA11 LA12

Claims

[Claims]

1. A hearing module including at least a pair of microphones for collecting external sounds, a visual module including a camera for capturing an image in front of the robot, and a motor control module including a drive motor for rotating the robot in a horizontal direction. An association module that integrates events from the hearing module, the vision module, and the motor control module to generate a stream, and an attention control module that performs attention control based on the stream generated by the association module. A robot audiovisual system, wherein the auditory module determines a direction of at least one speaker from pitch extraction, sound source separation and localization based on an acoustic signal from a microphone, and extracts the auditory event, Visual module Based on the image captured by the camera, identify the sound source of each speaker from the face identification and localization of each speaker, extract the visual event, the motor control module, the rotation position of the drive motor Extracting the motor event based on the audio event, the association module determines the direction of each speaker from the auditory event, the visual event and the motor event based on the direction information of the sound source localization of the auditory event and the face localization of the visual event. Generating an audio stream and a visual stream, and further associating them to generate an association stream, wherein the attention control module is configured to generate an attention stream based on the streams for planning the drive motor control of the motor control module. Performing control. Tsu door audiovisual system.

2. A hearing module including at least a pair of microphones for collecting external sounds, a visual module including a camera for capturing an image in front of the robot, and a motor control module including a driving motor for rotating the robot in a horizontal direction. An association module that integrates events from the hearing module, the vision module, and the motor control module to generate a stream, and an attention control module that performs attention control based on the stream generated by the association module. An audiovisual system for a humanoid or animal robot, wherein the auditory module determines a direction of at least one speaker from pitch extraction, sound source separation and localization based on an acoustic signal from a microphone. , Extract that auditory event The visual module identifies each speaker from the face identification and localization of each speaker based on an image captured by a camera, and extracts a visual event of the speaker. By extracting a motor event based on the rotational position of the audio event, the association module can determine each story based on the sound source localization of the auditory event and the face localization direction information of the visual event from the auditory event, the visual event, and the motor event. Generating an auditory stream and a visual stream by determining the direction of the driver, and further generating an association stream by associating the streams with the audio stream and the visual stream, wherein the attention control module controls the drive motor control of the motor control module based on the streams. Perform attention control for planning. Characterized in that, the robot audiovisual system.

3. The method as claimed in claim 1, wherein the association module synchronizes the asynchronously generated auditory, visual, and motor events with each other when generating the auditory stream and the visual stream.
Or the robot audiovisual system according to 2.

4. The hearing module detects a MFCC of speech from an audio signal to identify each speaker, and the association module determines the speaker based on an auditory event speaker identification and a visual event speaker identification. The robot audiovisual system according to any one of claims 1 to 3, wherein an audio stream and a visual stream to which the audio event and the visual event are to be connected are selected by specifying a speaker.

5. The association module, when a plurality of streams approach each other, refers to a temporal flow of the auditory event and the visual event, and generates an auditory stream and a visual stream to which the auditory event and the visual event should be connected. The robot audiovisual system according to any one of claims 1 to 4, wherein the robot audiovisual system is selected.

6. The association module generates an association stream by associating an auditory stream and a visual stream that are strongly associated with each other, and when the association between the auditory stream and the visual stream that make up the association stream is weakened,
The robot audiovisual system according to any one of claims 1 to 5, wherein the association is canceled to cause the association stream to disappear.