JP2006251266A

JP2006251266A - Audio-visual coordinated recognition method and device

Info

Publication number: JP2006251266A
Application number: JP2005066512A
Authority: JP
Inventors: Hiroshi Shinjo; 広新庄; Masato Togami; 真人戸上; Akio Amano; 明雄天野
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2005-03-10
Filing date: 2005-03-10
Publication date: 2006-09-21

Abstract

<P>PROBLEM TO BE SOLVED: To accurately detect a person, to whom conversation is to be conducted, using voice information and image information in a robot and to conduct the conversation. <P>SOLUTION: In the robot, specific voice, that indicates the beginning of the conversation being uttered by a speaker, is recognized, direction to the speaker is detected by estimating the sound source direction, movement is made toward the direction of the detected speaker, the speaker's face is detected from the image inputted from a camera after the movement is made and interactive processing is conducted when the face is detected. In the interactive processing, priority is given to the direction along which the face is detected and voice recognition is conducted by limiting the directionality of the voice recognition toward the face direction. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、画像認識と音声認識の両者を用いたヒューマンインタフェース技術に関する。特に、画像認識機能と音声認識機能、および機構制御機能を持つロボットに関する。 The present invention relates to a human interface technique using both image recognition and voice recognition. In particular, the present invention relates to a robot having an image recognition function, a voice recognition function, and a mechanism control function.

近年のロボットは、産業用ロボットのように決められた作業だけを実行するものから、状況を判断し、人間とのコミュニケーション(対話)をはかるものへと進歩しつつある。人間との対話には、周囲環境から対話相手を検出する技術と音声を認識する技術が必要がある。 Recent robots are progressing from performing only predetermined tasks like industrial robots to determining situations and communicating with humans. For dialogue with humans, technology for detecting a conversation partner from the surrounding environment and technology for recognizing speech are required.

従来のロボットと人間の対話では、ロボットの前で対話する人間はあらかじめ決められた１人であり、話者が接話マイクをつけることにより、ロボットに対話者の声のみを認識させていた。接話マイクを用いることにより、話者検出技術は不要となるうえ、周囲の雑音を入力することなく話者の声だけを認識することができた。 In the conventional dialogue between a robot and a human, the person who talks in front of the robot is a predetermined person, and the speaker attaches a close-up microphone so that the robot recognizes only the voice of the conversation person. By using a close-up microphone, speaker detection technology is no longer necessary, and only the voice of the speaker can be recognized without inputting ambient noise.

次に、接話マイクを使わずに対話相手の方向を検出する手段として、複数のマイクを配置して各マイクの位相差を用いて音源の方向を推定する技術がある。さらに、カメラを用いて人間の顔を検出することにより、話者の位置を検出する技術もある。 Next, as a means for detecting the direction of the conversation partner without using the close-talking microphone, there is a technique of arranging a plurality of microphones and estimating the direction of the sound source using the phase difference of each microphone. Further, there is a technique for detecting the position of a speaker by detecting a human face using a camera.

しかしながら、音声のみの場合、人間の声以外のノイズに誤反応するという問題があった。さらに、画像のみの場合は複雑な背景を顔と誤認識することや、照明や顔の角度などの条件により、顔を検出できないという問題があった。このため、画像認識と音声認識の両方を用いて、音声方向に顔があれば、その顔の人物を話者とする技術が考案された。この観点での公知例としては特許文献１〜３がある。特許文献１では、動体検出結果、顔検出結果、音源方向検出結果のいずれかに移動するように制御する。移動中に顔検出された場合に、顔方向に移動し、所定の範囲内に入れば停止する。特許文献２では、顔認識と音源方向推定などを用いて話者を同定している。特許文献３では、顔認識、音声認識などの認識機能の出力結果のいずれか1個以上の結果を用いて話者をトラッキングしている。 However, in the case of only voice, there is a problem that it reacts erroneously to noise other than human voice. Further, in the case of only an image, there are problems that a complicated background is misrecognized as a face, and the face cannot be detected due to conditions such as illumination and face angle. For this reason, a technique has been devised that uses both image recognition and voice recognition, and if there is a face in the voice direction, the person with that face is the speaker. Known examples in this respect include Patent Documents 1 to 3. In Patent Document 1, control is performed to move to any one of a moving object detection result, a face detection result, and a sound source direction detection result. When a face is detected during movement, the face moves in the direction of the face and stops when it falls within a predetermined range. In Patent Document 2, a speaker is identified by using face recognition and sound source direction estimation. In Patent Literature 3, a speaker is tracked using one or more results of output of recognition functions such as face recognition and voice recognition.

特開２００４−１３０４２７号公報JP 2004-130427 A

特開２００２−２６４０５１号公報JP 2002-264051 A 特開２００４−２８３９５９号公報JP 2004-283959 A 特開平８−２７２９７３号公報JP-A-8-272773 大賀寿郎, 山崎芳男, 金田豊, ”音響システムとディジタル処理,” 電子情報通信学会,pp.203-209,1995/3/25Toshiro Oga, Yoshio Yamazaki, Yutaka Kaneda, "Acoustic systems and digital processing," IEICE, pp.203-209, 1995/3/25

本発明で解決しようとする第１の課題は、ロボットの周囲に雑音があり複数の人物が存在する状況下において、正確に対話相手を検出し、対話を実行することである。
特許文献１では、顔認識および音源方向推定の結果のいずれかの方向に移動し、顔が検出された場合、顔方向に移動する。しかしながら、特許文献1の方法では、音源方向に向いた際に見つけた顔を話者とする。音源方向の推定の精度は高くないため、音源方向に複数の人物がいる場合は正確に話者を特定できないという問題点がある。 A first problem to be solved by the present invention is to accurately detect a conversation partner and execute a conversation in a situation where there are noises around the robot and there are a plurality of persons.
In Patent Literature 1, when the face is detected and the face is detected and the face is detected, the face is moved in the face direction. However, in the method of Patent Document 1, the face found when facing the sound source direction is the speaker. Since the accuracy of estimation of the sound source direction is not high, there is a problem that a speaker cannot be specified accurately when there are a plurality of persons in the sound source direction.

本発明で解決しようとする第２の課題は、話者がカメラの視野外に存在し、話者でない人物が視野内に存在するような状況においても、正しく話者を認識することである。特許文献３では、顔認識と音声認識のいずれか1個以上の結果を用いて話者をトラッキングしている。この方法では、認識処理の優先度についての詳細な記述がない。仮に上記の状況においては顔認識を優先した場合、話者の検出を誤るという問題がある。 The second problem to be solved by the present invention is to correctly recognize the speaker even in a situation where the speaker is outside the field of view of the camera and a person who is not the speaker is in the field of view. In Patent Document 3, a speaker is tracked using at least one result of face recognition and voice recognition. In this method, there is no detailed description about the priority of recognition processing. If the face recognition is given priority in the above situation, there is a problem that a speaker is detected incorrectly.

本発明で解決しようとする第３の課題は、音声認識と顔認識が誤認識や認識不能の場合でも適切な制御を行って話者を検出することである。顔認識では、照明条件や話者との距離および顔の角度などの条件により、視野内に顔が存在しても検出できない場合や誤検出する場合がある。音声認識では、静かな室内においては話者の声しか検出されないが、屋外などでは話者の声以外の雑音が存在するため、話者の声と雑音を間違える場合がある。このような状況では、認識が正しく動作することを前提としている従来手法では対応できないという問題点がある。 The third problem to be solved by the present invention is to detect a speaker by performing appropriate control even when voice recognition and face recognition are misrecognized or unrecognizable. In face recognition, depending on conditions such as lighting conditions, distance to the speaker, and face angle, even if a face is present in the field of view, it may not be detected or may be detected incorrectly. In speech recognition, only the voice of the speaker is detected in a quiet room, but noise other than the voice of the speaker is present outdoors and the like, so the voice of the speaker may be mistaken for the noise. In such a situation, there is a problem that the conventional method which assumes that recognition works correctly cannot be dealt with.

特許文献１では、音源方向と顔方向が一致していることが前提条件であるため、認識機能のいずれかが誤認識や認識不能となった場合には話者の特定ができないという問題点がある。 In Patent Document 1, since it is a precondition that the sound source direction and the face direction coincide with each other, there is a problem that the speaker cannot be specified if any of the recognition functions is erroneously recognized or cannot be recognized. is there.

特許文献２では、顔認識と音源方向推定などを用いて、複数の人物が存在する状況下で話者を同定している。しかしながら、この方式では、顔認識や音源方向推定などの各認識機能が常に正しく認識していることが前提となっている。しかし、上記の環境で認識を行う場合、音声認識の誤りや画像処理の誤りは不可避であり、特許文献２の方式では正しく認識できないという問題がある。 In Patent Document 2, a speaker is identified in a situation where there are a plurality of persons using face recognition and sound source direction estimation. However, this method assumes that each recognition function such as face recognition and sound source direction estimation is always correctly recognized. However, when recognition is performed in the above environment, an error in speech recognition and an error in image processing are unavoidable, and there is a problem that the method of Patent Document 2 cannot be recognized correctly.

特許文献３では、顔認識、音声認識などの認識機能の出力結果のいずれか1個以上の結果を用いて話者をトラッキングする。しかしながら、顔認識と音声認識の詳細な制御については説明されておらず、この例だけでは誤認識や認識不能の際に適切な制御はできないという問題点がある。 In Patent Literature 3, a speaker is tracked using one or more results of output of recognition functions such as face recognition and speech recognition. However, detailed control of face recognition and voice recognition is not described, and there is a problem that proper control cannot be performed in the case of misrecognition or recognition failure only with this example.

このような課題を解決するために発明された視聴覚連携認識方法は、音声入力と画像入力に基づいて処理を実行する視聴覚連携認識方法であって、話者が発する会話の最初に利用する特定の単語もしくは文の音声とその方向とを認識し、認識に失敗すれば初期状態に戻り、認識に成功すれば検出された音声の方向にカメラを向け、移動中もしくは移動後にカメラから入力された画像から人物の顔を検出し、顔が検出されなかった場合には初期状態に戻り、顔が検出された場合には対話処理を行う。 An audiovisual linkage recognition method invented to solve such a problem is an audiovisual linkage recognition method that executes processing based on voice input and image input, and is a specific method used at the beginning of a conversation uttered by a speaker. Recognize the voice of a word or sentence and its direction, and if the recognition fails, return to the initial state, and if the recognition succeeds, point the camera in the direction of the detected voice and input images from the camera while moving The face of the person is detected, and if no face is detected, the process returns to the initial state, and if a face is detected, a dialogue process is performed.

本発明の効果は、周辺雑音がある環境において、離れた位置から対話者を特定することができることである。さらに詳しくは、接話マイクを用いずに周辺雑音と対話者の声を正しく聞き分けることができることである。本発明の第２の効果は、周囲に複数の人物が存在する場合にも対話者を特定できることである。本発明の第３の効果は、環境条件が悪い場合においても、認識の頑強性が高いことである。周辺に雑音がある場合や複数の人物が存在する場合などにおいて、音声認識や顔認識の一方もしくは両方を誤るか認識できないことがある。このような状況においても、自律的に正しい対話を成功させるよう制御することができる。 The effect of the present invention is that an interlocutor can be identified from a remote location in an environment with ambient noise. More specifically, it is possible to correctly distinguish between ambient noise and a conversation person's voice without using a close-up microphone. The second effect of the present invention is that a conversation person can be specified even when there are a plurality of persons around. The third effect of the present invention is that the robustness of recognition is high even when the environmental conditions are bad. When there is noise in the vicinity or when there are a plurality of persons, one or both of voice recognition and face recognition may be wrong or unrecognizable. Even in such a situation, it is possible to control autonomously to succeed in a correct dialogue.

以下では本発明の全体的な記述のためにいくつかの特定な詳細例を提供する。しかしながら本発明がこれらの特定な詳細なしでも実用化できることは当業者にとっては明白である。本明細書の記述および図面は、当業者が別当業者に発明の内容を開示するのに使用される通常手段である。なお、本明細書において「一実施例」という記述がある場合、必ずしも同じ実施例のみに当てはまるのではなく、個別の実施例は互いに限定的ではない。さらに、本発明の実施例を示す処理の作業順序は、例示であって限定はされない。 The following provide some specific details for an overall description of the invention. However, it will be apparent to those skilled in the art that the present invention may be practiced without these specific details. The descriptions and drawings in this specification are typical means used by those skilled in the art to disclose the subject matter to others skilled in the art. In addition, when there is a description of “one embodiment” in the present specification, it does not necessarily apply only to the same embodiment, and individual embodiments are not limited to each other. Furthermore, the processing order of the processing according to the embodiment of the present invention is illustrative and not limited.

まず、本発明における視聴覚連携装置の構成を図１に示す。１００はそれぞれ異なる位置に設置される複数のマイクから構成されるマイクロホンアレイである。１１０ではマイクロホンアレイからの音声信号から音源の方向を検出する。１２０では、音源方向の音声を認識する。１３０は画像を入力するカメラである。１４０では入力された画像から顔領域を検出する。１４０においては、顔検出に加えて、検出された顔が登録済みの誰であるかを識別する機能を持っても良い。１５０はマイクロホンアレイ１００やカメラ１３０の向きや位置を移動させる機構制御部である。移動には、水平垂直方向の回転運動や前進後退や上下左右などの移動運動を含む。１６０では、話者の位置や話者の話の内容に基づいて対話を制御する。１６０によって決められたロボットの発話内容がスピーカ１７０を通じてロボットの声として発せられる。１８０は全体制御部である。音声認識と顔認識の結果を統合して話者の方向を検出して、その結果に基づいてマイクロホンアレイ１００やカメラ１３０を機構制御部１５０を通じて移動させ、対話制御部１６０によって話者と会話する。 First, FIG. 1 shows the configuration of an audiovisual cooperation apparatus according to the present invention. Reference numeral 100 denotes a microphone array composed of a plurality of microphones installed at different positions. At 110, the direction of the sound source is detected from the audio signal from the microphone array. At 120, the sound in the direction of the sound source is recognized. Reference numeral 130 denotes a camera for inputting an image. In 140, a face area is detected from the input image. In 140, in addition to the face detection, it may have a function of identifying who the registered face is registered. A mechanism control unit 150 moves the orientation and position of the microphone array 100 and the camera 130. The movement includes rotational movement in the horizontal and vertical directions, moving movement such as forward / backward, up / down / left / right. In 160, the dialogue is controlled based on the position of the speaker and the content of the speaker's story. The utterance content of the robot determined by 160 is uttered as a robot voice through the speaker 170. Reference numeral 180 denotes an overall control unit. The results of speech recognition and face recognition are integrated to detect the direction of the speaker, and the microphone array 100 and the camera 130 are moved through the mechanism control unit 150 based on the result, and the conversation control unit 160 has a conversation with the speaker. .

まず、本実施例の特徴を説明する。本実施例の第１の特徴は、話者の方向を正確に推定するため、音声認識による音源方向と顔検出による顔方向の両者が一致した場合のみ対話を行うことである。一致が条件になっているため、いずれか一方のみを用いる場合に比べて誤認識が少ない。 First, features of the present embodiment will be described. The first feature of the present embodiment is that, in order to accurately estimate the direction of the speaker, the dialogue is performed only when both the sound source direction by voice recognition and the face direction by face detection match. Since coincidence is a condition, there are fewer false recognitions than when only one of them is used.

第２の特徴は、話者方向の検出には音源方向推定の結果を優先して移動し、話者の方向に振り向いた後は顔の方向を優先して話者の方向を詳細に決定することである。これにより、視野内に複数の人物が存在する場合でも話者を特定することができる。 The second feature is that the direction of the sound source direction is preferentially moved for detection of the speaker direction, and after turning around the direction of the speaker, the direction of the face is preferentially determined with priority on the face direction. That is. Thereby, a speaker can be specified even when there are a plurality of persons in the field of view.

第３の特徴は、話者の呼びかけに対しては広い指向性で音源方向推定を行うかわりに、認識可能な単語を限定することである。音源方向推定の範囲を広げることにより周辺雑音と話者の声の判別が難しくなるかわりに、単語を限定することで判別の性能を上げる。 The third feature is that the recognizable words are limited to the speaker's call instead of performing the sound source direction estimation with wide directivity. Although it becomes difficult to discriminate between ambient noise and the voice of the speaker by expanding the range of sound source direction estimation, the discrimination performance is improved by limiting the words.

第４の特徴は、対話開始以降は話者方向のみに指向性を限定することである。指向性を限定することにより、周辺雑音と話者の声を明確に分離できる。
以上の特徴により、高精度な話者検出と対話を実現することができる。 The fourth feature is that directivity is limited only to the speaker direction after the start of the dialogue. By limiting the directivity, the ambient noise and the voice of the speaker can be clearly separated.
With the above features, highly accurate speaker detection and dialogue can be realized.

本発明の第１の実施例として、話者を検出して対話を行う処理のフローを図２と図３を用いて説明する。図２は全体フローであり、図３はステップ２５０の対話処理部分のみのフローである。 As a first embodiment of the present invention, a processing flow for detecting a speaker and having a conversation will be described with reference to FIGS. FIG. 2 is an overall flow, and FIG. 3 is a flow of only the dialog processing part of step 250.

まず、図２について説明する。ステップ２００からステップ２４０までの処理は、話者の方向を検出するための処理である。ステップ２００では、話者からロボットへの呼びかけの声を検出する。この処理においては、音源方向の推定範囲を広範囲に設定する。音声入力の範囲は、マイクロホンアレイからの入力の合成により制御することができる。この処理においては、ノイズと呼びかけを精度よく判別するため、呼びかけに用いる単語や文を限定することも可能である。音源方向の推定方法については後述する。ステップ２１０では、呼びかけの音声と方向が認識されたか否かを判定する。呼びかけが検出されなかった場合には、ステップ２００に戻る。すなわち、ロボットは何も反応しないため、話者は再度呼びかけを行うことになる。 First, FIG. 2 will be described. The processing from step 200 to step 240 is processing for detecting the direction of the speaker. In step 200, a voice calling from the speaker to the robot is detected. In this process, the estimation range of the sound source direction is set in a wide range. The range of audio input can be controlled by combining the inputs from the microphone array. In this process, it is possible to limit words and sentences used for calling in order to accurately distinguish between noise and calling. A method for estimating the sound source direction will be described later. In step 210, it is determined whether the call voice and direction are recognized. If no call is detected, the process returns to step 200. That is, since the robot does not react at all, the speaker calls again.

ステップ２１０で呼びかけが検出された場合には、ステップ２２０で呼びかけ方向に移動する。移動には、角度の変更や前進後退や上下左右等の動作を含む。移動では、対象物や周辺の物体との距離情報を用いることがある。距離情報を得る一実施例としては、ステップ２００で音声情報から求めることや、レーザレーダ等の利用、複数のカメラを用いたステレオ視などが挙げられる。ステップ２３０では呼びかけ方向に向かって顔検出を行う。顔検出処理の一実施例としては、特開平８−２７２９７３号公報（特許文献４）に開示された技術がある。これは、画像中から肌色領域を判定することにより顔領域を検出するものであり、画像のＨ（色相）、Ｓ（彩度）、Ｖ（明度）の画像情報から顔領域を検出するものである。通常のカメラからの入力を利用するには、ＲＧＢ信号をＨＳＶ信号に変換すればよい。ステップ２４０では、顔が検出されたか否かを判定する。顔が検出されなかった場合には、ステップ２００に戻る。すなわち、ロボットは何も反応しないため、話者は再度呼びかけを行うことになる。ステップ２４０で顔が検出された場合には、対話相手が検出されたと判断して、ステップ２５０の対話処理を行う。 If a call is detected in step 210, the process moves in the call direction in step 220. The movement includes operations such as angle change, forward / backward movement, up / down / left / right. The movement may use distance information with respect to the target object and surrounding objects. Examples of obtaining distance information include obtaining from voice information in step 200, using a laser radar or the like, and stereo viewing using a plurality of cameras. In step 230, face detection is performed in the calling direction. As an example of the face detection process, there is a technique disclosed in Japanese Patent Application Laid-Open No. 8-272773 (Patent Document 4). This is to detect a face area by determining a skin color area from an image, and to detect a face area from image information of H (hue), S (saturation), and V (lightness) of the image. is there. In order to use input from a normal camera, the RGB signal may be converted into an HSV signal. In step 240, it is determined whether a face has been detected. If no face is detected, the process returns to step 200. That is, since the robot does not react at all, the speaker calls again. If a face is detected in step 240, it is determined that a dialogue partner has been detected, and the dialogue processing in step 250 is performed.

図３は、図２のステップ２５０の対話処理の詳細フローである。まず、ステップ３００において、呼びかけ者に対してロボットから発話する。ステップ３００の発話内容に対応して呼びかけ者が会話を行った場合、ステップ３１０においてその音声を認識する。この際に、音声認識可能な音源方向の範囲を視野角内、もしくは呼びかけ方向に限定すれば、周辺雑音と会話の音声を分離する精度が高くなる。ステップ３１０では、音源方向も再度推定する。ステップ３２０では、音声認識が成功したか否かを判定する。成功しない場合はステップ３１０に戻る。すなわち、ロボットは何も反応しないため、話者は再度話しかけることになる。 FIG. 3 is a detailed flow of the dialog processing in step 250 of FIG. First, in step 300, the robot speaks to the caller. If the caller has a conversation corresponding to the utterance content in step 300, the voice is recognized in step 310. At this time, if the range of the sound source direction in which speech recognition is possible is limited to the viewing angle or the calling direction, the accuracy of separating the ambient noise and the speech of the conversation is increased. In step 310, the sound source direction is also estimated again. In step 320, it is determined whether the voice recognition is successful. If not successful, the process returns to step 310. That is, since the robot does not react at all, the speaker speaks again.

音声認識が成功した場合、ステップ３３０において顔検出を行う。ステップ３４０では、画像中から顔が検出されたか否かを判定する。顔が検出されなかった場合にはステップ３５０でロボットが発話する。この場合の発話内容の例としては、顔が検出できないことを伝えるための「顔が検出できません。」や、正面を向いてもらうための「こちらを向いてください。」、立ち位置を変えて照明や背景を変更する「少し右(もしくは左)へ移動してください。」などがある。なお、ステップ３５０の処理を規定回数以上繰り返した場合、話者は存在しなかったと判断して対話を終了してもよい。また、ステップ３５０を省略し、検出失敗の場合は直接ステップ３３０に戻ってもよい。ステップ３４０で顔が検出された場合には、ステップ３６０において、ステップ３１０で求めた音源方向とステップ３３０で求めた顔方向が一致するか否かを判定する。 If the speech recognition is successful, face detection is performed in step 330. In step 340, it is determined whether a face has been detected from the image. If no face is detected, the robot speaks at step 350. Examples of utterances in this case include “Cannot detect face” to tell you that the face cannot be detected, “Look here” to have you face the front, and change the standing position to light Or change the background, such as “Please move to the right (or left)”. If the process of step 350 is repeated a predetermined number of times, it may be determined that there is no speaker and the conversation may be terminated. Further, step 350 may be omitted, and if detection fails, the process may return directly to step 330. If a face is detected in step 340, it is determined in step 360 whether or not the sound source direction obtained in step 310 matches the face direction obtained in step 330.

方向が一致しない場合には、ステップ３７０において話者方向に移動する。この移動は、顔の位置がカメラの視野の中心になるようにカメラを移動させることである。ステップ３７０を行う理由は、音源方向と顔方向の両者が検出された場合、顔方向の検出精度の方が高いからである。これにより、視野内に複数の人物が存在する場合においても、正確に話者を特定することができる。以降の処理においては、ステップ３１０の音声認識において、指向性を話者方向により強くすることにより、さらに精度の高い認識も可能である。また、ステップ３７０においては、実際の移動をせずに、視野内での対話相手の判別だけでも良い。 If the directions do not match, move to the speaker direction in step 370. This movement is to move the camera so that the position of the face is at the center of the field of view of the camera. The reason for performing step 370 is that when both the sound source direction and the face direction are detected, the detection accuracy of the face direction is higher. Thereby, even when there are a plurality of persons in the field of view, the speaker can be accurately identified. In the subsequent processing, in the speech recognition in step 310, it is possible to recognize with higher accuracy by making the directivity stronger in the speaker direction. Further, in step 370, it is only necessary to determine the conversation partner within the field of view without actually moving.

ステップ３６０において方向が一致した場合には、ステップ３８０において対話内容に応じて対話を継続するか否かを判定する。対話を継続する場合にはステップ３００に戻る。この場合のステップ３００の発話内容は、対話の内容により異なる。対話を継続しない場合には、対話を終了する。対話終了後は、図２のステップ２００に戻って次の対話の開始を待機してもよい。 If the directions match in step 360, it is determined in step 380 whether or not to continue the dialogue according to the content of the dialogue. If the dialogue is continued, the process returns to step 300. In this case, the content of the utterance in step 300 differs depending on the content of the dialogue. If the dialogue is not continued, the dialogue is terminated. After completion of the dialog, the process may return to step 200 in FIG. 2 and wait for the start of the next dialog.

本発明の第２の実施例として、話者を検出して対話を行う処理のフローを図４を用いて説明する。図４において、図２と同じ番号は同じ処理であるため、説明を省略する。
ステップ２４０で顔が検出されなかった場合、照明条件や顔の角度によって顔が検出できていない可能性がある。このような状況に対応するため、ステップ２６０において後述のリトライの回数制限をする。リトライが規定回数数以内であれば、ステップ２７０において、顔検出を改善するためにユーザに対して姿勢変更を要求する発話を行う。具体的な例としては、「こちらを向いてください。」や「(照明の影響のため)少し右（もしくは左）に移動してください。」などがある。規定回数を超えれば、ステップ２００における呼びかけ音声の検出が誤っていたと判断してステップ２００へ戻る。 As a second embodiment of the present invention, a flow of processing for detecting a speaker and performing a dialogue will be described with reference to FIG. In FIG. 4, the same numbers as those in FIG.
If no face is detected in step 240, the face may not be detected depending on the illumination conditions and the face angle. In order to cope with such a situation, the number of retries described later is limited in step 260. If the retry is within the specified number of times, in step 270, an utterance requesting the user to change the posture is performed in order to improve face detection. Specific examples include “Turn here” and “Move to the right (or left) slightly (due to lighting effects)”. If the specified number of times is exceeded, it is determined that the detection of the calling voice in step 200 is incorrect, and the process returns to step 200.

以下、実施例１と２で用いている音源方向推定技術の一実施例について説明する。マイクロホンを複数使って、音源の方向を推定する技術の一実施例としては、非特許文献１に示す死角形成型音源定位技術がある。この技術では、判定対象の方向以外に存在する音源方向に死角を形成し、判定対象の方向の音のみを抽出することで、方向毎の音のパワーを算出する。そして、その方向毎の音のパワーから音源方向を推定する。死角形成型音源定位技術は、音源数がマイク数を下回る場合、高精度に音源方向を推定できることが知られている。 Hereinafter, an embodiment of the sound source direction estimation technique used in Embodiments 1 and 2 will be described. As an example of a technique for estimating the direction of a sound source using a plurality of microphones, there is a blind spot forming type sound source localization technique shown in Non-Patent Document 1. In this technique, a blind spot is formed in a sound source direction that exists in a direction other than the direction of the determination target, and only the sound in the direction of the determination target is extracted, thereby calculating the sound power for each direction. Then, the sound source direction is estimated from the sound power for each direction. It is known that the blind spot forming type sound source localization technology can estimate the sound source direction with high accuracy when the number of sound sources is less than the number of microphones.

本発明の構成図である。It is a block diagram of the present invention. 第１の実施例を示す視聴覚連携フローの図である。It is a figure of the audio-visual cooperation flow which shows a 1st Example. 第１の実施例中の対話処理のフローの図である。It is a figure of the flow of the dialogue process in a 1st Example. 第２の実施例を示す視聴覚連携フローの図である。It is a figure of the audiovisual cooperation flow which shows a 2nd Example.

Explanation of symbols

１００：マイクロホンアレイ、１１０：音源方向推定部、１２０：音声認識部、１３０：カメラ、１４０：顔認識部、１５０：機構制御部、１６０：対話制御部、１７０：全体制御部。 100: Microphone array, 110: Sound source direction estimation unit, 120: Speech recognition unit, 130: Camera, 140: Face recognition unit, 150: Mechanism control unit, 160: Dialog control unit, 170: Overall control unit

Claims

An audiovisual linkage recognition method for executing processing based on audio input and image input,
Based on the voice input from the microphone array, it recognizes the voice and direction of a specific word or sentence used at the beginning of the conversation made by the speaker,
If the speech recognition fails, it returns to the initial state,
If the speech recognition is successful, point the camera in the direction of the detected speech,
Detecting a human face from an image input from the camera,
If no face is detected, it returns to the initial state,
An audio-visual cooperative recognition method characterized in that dialogue processing is performed when a face is detected.

An audiovisual linkage recognition method for executing processing based on audio input and image input,
Based on the voice input from the microphone array, it recognizes the voice and direction of a specific word or sentence used at the beginning of the conversation made by the speaker,
If recognition fails, it returns to the initial state,
If recognition succeeds, point the camera in the direction of the detected voice,
Detecting a human face from an image input from the camera,
If no face is detected, inform the user that the face cannot be detected and return to the face detection process.
An audio-visual cooperative recognition method characterized in that dialogue processing is performed when a face is detected.

In the dialogue processing of the audiovisual linkage method according to claim 1 or 2, utterance is performed in response to a talk from a speaker,
Limiting directivity to the speaker direction detected in claim 1 or claim 2, recognizing the voice and direction of the speaker,
If voice recognition fails, it returns to the voice recognition standby state,
If speech recognition is successful, face detection processing is performed.
If no face is detected, tell the utterance that no face is detected and return to the face detection process.
If a face is detected, determine whether the direction of the speaker and the direction of the speaker detected in the voice recognition match,
If they do not match, correct the speaker direction recognized in claim 1 or claim 2 and move to the detected face direction,
If they match, decide whether to continue the conversation,
To continue the dialogue, select the utterance content according to the state transition of the dialogue, return to the utterance,
An audiovisual linkage recognition method characterized in that the processing is terminated if not continued.

An audio-visual cooperative recognition device that executes processing based on audio input and image input,
Collecting voice information from a plurality of microphones, having a sound source direction estimation unit that estimates the direction of the voice from the collected voice information, and a voice recognition unit that recognizes the voice from the estimated sound source direction,
It has a face recognition unit that collects image information from the camera and detects a human face area from the collected image information,
A mechanism control unit for moving the position and direction of the camera and microphone;
The sound source direction estimation unit, the speech recognition unit, and the result of the face recognition unit are integrated to detect the speaker direction, the mechanism control unit of the device, and the overall control unit to understand the conversation content,
A dialog control unit that determines the state transition of the dialog according to the speech recognition result and selects the utterance content, and a speaker that utters,
The sound source direction estimating unit changes the directionality of direction estimation according to the state of dialogue;
In the mechanism control unit, a process of moving the device in the sound source direction estimated in the control unit;
The audio-visual recognizing and recognizing portable apparatus characterized in that the overall control unit executes a process of performing a dialogue only when both of the speech direction estimation results using the voice information and the image information match.