JP2008087140A

JP2008087140A - Speech recognition robot and control method of speech recognition robot

Info

Publication number: JP2008087140A
Application number: JP2006273620A
Authority: JP
Inventors: Makoto Kawarada; 誠河原田
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2006-10-05
Filing date: 2006-10-05
Publication date: 2008-04-17

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition robot which responds to a speaker always in a state of facing the direction of the speaker making a speech and also to provide its control method. <P>SOLUTION: The speech recognition robot is furnished with: a sound source specifying part to specify a generating direction of speech; a speech recognizing part to receive the speech and to recognize its contents; an image pick-up part to acquire a picked-up image as an image data concerning more than two directions; a discriminating part to discriminate a position and a direction of a detected face; and an extracting part to extract the detected face. The generating direction of the speech specified by the speech specifying part is picked up by the image pick-up part, a face of a figure existing in the image picked up by the image pick-up part is detected by a face detecting part, a face turning to the direction of the robot is discriminated out of the detected faces by the discriminating part, the face in a position nearest from the robot is extracted by the extracting part, a direction to pick up the image of the image pick-up part is changed in accordance with the extracted face, and the extracted faces are continuously picked up. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、人間に話し掛けられたことで得られる音声の内容を認識し、話し掛けた人間に応対する会話応対型の音声認識ロボットおよびこのような音声認識ロボットの制御方法に関する。 The present invention relates to a speech-recognition type speech recognition robot that recognizes the content of speech obtained by speaking to a person and responds to the spoken person, and a control method for such a speech recognition robot.

近年、人間（発話者）が話し掛けた内容を音声データとして取得し、その内容を認識してその内容に対応した応答文を音声として発話することで、人間との会話を行うための会話型の音声認識システムが、開発されつつある。このような音声認識システムは、システム内部の記憶領域に応答文として発話する音声データを多数記憶しておき、認識した音声の内容に最も関連付けられる音声データを選択し、選択した音声データを音声として発話するものである。このような音声認識システムは、話し掛けられた内容を音声データとして取得し、この音声データを所定の言語の文節に区切ることで、各文節の順序や近接度合いに基づいてその内容に最も適した応答文を選択する。（例えば特許文献１） In recent years, it has become a conversational type for conversing with humans by acquiring the content spoken by humans (speakers) as speech data, recognizing the content and speaking the response sentence corresponding to the content as speech. Speech recognition systems are being developed. Such a speech recognition system stores a large amount of speech data to be uttered as a response sentence in a storage area inside the system, selects speech data most related to the content of the recognized speech, and selects the selected speech data as speech. Speak. Such a speech recognition system acquires the spoken content as speech data, and divides the speech data into clauses of a predetermined language, so that the response most suitable for the content is based on the order and proximity of each clause. Select a sentence. (For example, Patent Document 1)

特開２００４−１０９３２３号公報JP 2004-109323 A

ところで、前述のような音声認識システムは、取得した発話者からの音声を聞き分ける際に、指向性のあるマイク等を用いて音声を受け、音声の発声された方向を特定することができる。そこで、このような音声認識システムを人間対話型のロボットに組み込んだ場合（以下、このようなロボットを音声認識ロボットという）、特定された方向を向くように音声認識ロボットを動かすことで、発話者に対して応答し、また、発話者からの音声を受ける際に、ロボットがあたかも発話者を対話対象と認識して会話をしているように見せることができる。 By the way, the voice recognition system as described above can identify the direction in which the voice is uttered by receiving the voice using a directional microphone or the like when distinguishing the acquired voice from the speaker. Therefore, when such a speech recognition system is incorporated into a human interactive robot (hereinafter referred to as a “speech recognition robot”), the speaker is moved by moving the speech recognition robot so as to face a specified direction. When receiving a voice from the speaker, the robot can be seen as if it is having a conversation by recognizing the speaker as a conversation target.

しかしながら、発話者が複数おり、かつ、これらの発話者がロボットに対して相対的に移動している場合に、ロボットから見た発話者の方向が特定できず、ロボットが発話者を見るように方向を定めることができない場合がある。 However, if there are multiple speakers and these speakers are moving relative to the robot, the direction of the speaker viewed from the robot cannot be specified, and the robot sees the speaker. The direction may not be determined.

本発明は、このような問題を解決するためになされたものであり、常に発話した発話者の方向を向いた状態で、発話者に応答することができる音声認識ロボットおよび音声認識ロボットの制御方法を提供することを目的としている。 The present invention has been made to solve such a problem, and is a speech recognition robot capable of responding to a speaker in a state in which the speaker is always directed to the speaker, and a method for controlling the speech recognition robot. The purpose is to provide.

本発明にかかる音声認識ロボットは、音声の発生した方向を特定する音源特定部と、発生した音声を受信し、その内容を認識する音声認識部と、少なくとも２つ以上の方向について撮像し、その撮像した画像を画像データとして取得する撮像部と、撮像した画像内に存在する人物の顔を検出する顔検出部と、検出した顔の位置および向きを判別する判別部と、検出した顔を抽出する抽出部と、を備えており、前記音源特定部で特定した音声の発生した方向を撮像部で撮像するとともに、前記撮像部で撮像した画像内に存在する人物の顔を、前記顔検出部で検出し、前記検出した人物の顔のうち、判別部によりロボットの方向を向いている人物の顔を判別し、判別された顔のうち、ロボットから最も近い位置に存在する顔を抽出部で抽出した後に、撮像部の撮像する方向を、抽出した顔の位置に併せて変更することで、抽出した顔を撮像し続けるようにトラッキングすることを特徴としている。 A voice recognition robot according to the present invention includes a sound source identifying unit that identifies a direction in which a voice is generated, a voice recognition unit that receives the generated voice and recognizes the contents thereof, and images at least two directions, An imaging unit that acquires a captured image as image data, a face detection unit that detects a human face existing in the captured image, a determination unit that determines the position and orientation of the detected face, and a detected face An image capturing unit that captures the direction in which the sound specified by the sound source specifying unit is generated, and detects a face of a person existing in the image captured by the image capturing unit. In the detected face of the person, the face of the person facing the direction of the robot is discriminated by the discriminating section, and the face that is closest to the robot among the discriminated faces is extracted by the extracting section. After extraction , The direction of imaging of the imaging unit, by changing in accordance with the position of the extracted face is characterized by tracking to continue imaging the extracted face.

このような音声認識ロボットは、音声が発生した方向を向いた撮像部内に、複数の人物が存在していたときに、ロボットの方向を向いて、かつ、ロボットに最も近い位置に位置する人物を会話対象である発話者とみなして、その発話者の方を向き続けるように制御される。そのため、発話者との応答を行う際に、連続的に発話者を向いた状態を取りつづけることが可能となる。 Such a speech recognition robot, when there are a plurality of persons in the imaging unit facing the direction in which the sound is generated, faces the direction of the robot and is located at the position closest to the robot. The speaker is regarded as a conversation target speaker and is controlled so as to keep facing the speaker. For this reason, it is possible to continue to face the speaker when performing a response with the speaker.

また、このような音声認識ロボットとして、抽出した顔が、ロボットを中心に規定される領域の外に出ると、トラッキングを中止するように構成されていてもよい。すなわち、人物がロボットから所定距離以上離れると、会話対象である相手ではなくなったと判断し、トラッキングを中止する。このようにトラッキングが中止された際に、他の発話者からの音声を受信している場合は、発話している人を次の会話対象者として検出するように、再度トラッキング対象を抽出するように制御されていてもよい。 Further, such a voice recognition robot may be configured to stop tracking when the extracted face goes out of an area defined around the robot. That is, when the person is away from the robot by a predetermined distance or more, it is determined that the person who is the conversation target is no longer the partner, and the tracking is stopped. When tracking is stopped in this way, if the voice from another speaker is received, the tracking target is extracted again so that the person speaking is detected as the next conversation target. It may be controlled.

また、発話者から発声された方向を特定する音源特定部を、指向性を有する１または複数のマイクロフォンから構成するようにしてもよい。このように構成された音声認識ロボットは、音声の発声された方向を簡単にかつ精度よく検出することができる。 In addition, the sound source specifying unit that specifies the direction uttered by the speaker may be composed of one or more microphones having directivity. The voice recognition robot configured as described above can easily and accurately detect the direction in which the voice is uttered.

さらに、前記顔検出部は、撮像部により撮像された画像内に存在する人物の顔について、該顔に含まれる目と口の位置を特定し、これらの位置に基づいて顔の輪郭を推定することで、顔全体を検出するようにしてもよい。このようにすると、人物（発話者）の顔近辺に別の人物や物体等が存在し、ロボットからの距離を測定するだけでは人物の顔の輪郭が特定できない場合であっても、人物の顔内の目と口の位置を検出するだけで、顔の輪郭を特定することが可能になる。 Furthermore, the face detection unit specifies the positions of eyes and mouth included in the face of a person existing in the image captured by the imaging unit, and estimates the face contour based on these positions. Thus, the entire face may be detected. In this way, even if there is another person or object near the face of the person (speaker) and the outline of the person's face cannot be specified by simply measuring the distance from the robot, It is possible to specify the contour of the face only by detecting the positions of the eyes and mouth.

また、撮像部で撮像された画像内に含まれる顔に含まれる、両目の位置をロボットからみて相対的に求め、これらの位置関係から、この顔の向いている方向を判別するようにしてもよい。このようにすると、顔全体の位置や、発話者の身体全体を考慮することなく、簡単に発話者の顔の向きを判別することができる。 Further, the positions of both eyes included in the face included in the image captured by the image capturing unit may be relatively determined from the viewpoint of the robot, and the direction in which the face is facing may be determined from the positional relationship. Good. In this way, the orientation of the speaker's face can be easily determined without considering the position of the entire face or the entire body of the speaker.

さらに、検出した顔が、撮像した画像内の略中央に位置し続けるように、撮像する撮像部の方向を変更するように構成すると、撮像部がロボットの顔に相当する位置に設けられている場合、常に会話対象の人物（発話者）の方を向いて発話することができる。したがって、あたかも会話対象の人物の顔を見続けて会話を行っているように見せることができる。 Further, when the direction of the imaging unit to be imaged is changed so that the detected face is positioned substantially at the center in the captured image, the imaging unit is provided at a position corresponding to the face of the robot. In this case, it is possible to always speak toward the person (speaker) to be conversed. Therefore, it is possible to make it appear as if the conversation is being continued while looking at the face of the person to be conversed.

なお、このような音声認識ロボットにおいて、移動手段を備えさせることで、所定の領域内を移動可能に構成してもよい。このような音声認識ロボットは、発話者を特定した状態で自己位置を変更して移動することができるため、例えば案内動作を行うようなロボットに用いることが可能になる。 Note that such a voice recognition robot may be configured to be movable within a predetermined area by providing a moving means. Since such a speech recognition robot can move by changing its own position in a state where the speaker is specified, it can be used for a robot that performs a guidance operation, for example.

また、本発明は、音声認識ロボットを制御するロボット制御方法をも提供するものであって、詳細には、音声の発生した方向を特定するステップと、発声した音声を受信し、その内容を認識するステップと、音声の発生した方向について撮像し、その撮像した画像を画像データとして取得するステップと、撮像した画像内に存在する人物の顔を検出するステップと、検出した顔の向きを判別し、ロボットの方向を向いている顔を判別するステップと、判別した顔のうち、ロボットから最も近い距離に位置する顔を抽出するステップと、前記抽出した顔をトラッキングするように、撮像する方向を制御するステップと、前記抽出した顔を撮像し続けるようにトラッキングするステップと、を備えることを特徴とする音声認識ロボットの制御方法を提供するものである。 The present invention also provides a robot control method for controlling a voice recognition robot. Specifically, the present invention provides a step of specifying the direction in which the voice is generated, and receiving the spoken voice and recognizing the content. Determining the direction in which the sound is generated, capturing the captured image as image data, detecting the face of a person existing in the captured image, and determining the orientation of the detected face Determining a face facing the direction of the robot, extracting a face located at the closest distance from the robot among the determined faces, and imaging direction so as to track the extracted face A method for controlling a voice recognition robot, comprising: a step of controlling; and a step of tracking so as to continue imaging the extracted face It is intended to provide.

このような音声認識ロボットの制御方法を用いると、音声が発生した方向を向いた撮像部内に、複数の人物が存在していた場合であっても、ロボットの方向を向いて、かつ、ロボットに最も近い位置に位置する人物を会話対象である発話者とみなして、その発話者の方を向き続けるようにロボットを制御することができる。そのため、発話者との応答を行う際に、連続的に発話者を向いた状態を取りつづけることが可能となる。 By using such a voice recognition robot control method, even if there are a plurality of persons in the imaging unit facing the direction in which the voice is generated, the robot faces the robot and It is possible to control the robot so that the person located at the closest position is regarded as the speaker who is the subject of conversation and the speaker is kept facing. For this reason, it is possible to continue to face the speaker when performing a response with the speaker.

なお、このようなロボット制御方法においても、抽出した顔が、ロボットから所定距離の範囲内から外れると、トラッキングを中止するようにしてもよい。このように制御することで、当初ロボットに話し掛けていた人物が会話を中止したことを検知し、次の会話対象である人物を探して会話を続けることができる。 In such a robot control method, tracking may be stopped when the extracted face is out of a predetermined distance from the robot. By controlling in this way, it is possible to detect that the person who was initially talking to the robot has stopped the conversation, and to search for the person who is the next conversation target and continue the conversation.

以上、説明したように、本発明によると、音声認識ロボットを、常に発話した発話者の方向を向いた状態で、発話者に応答するように制御することが可能となる。 As described above, according to the present invention, the voice recognition robot can be controlled to respond to the speaker in a state where the voice recognition robot is always directed to the speaker.

発明の実施の形態１．
以下に、図１から図６を参照しつつ、本発明の第１の実施形態にかかる音声認識ロボットについて説明する。 Embodiment 1 of the Invention
Hereinafter, a voice recognition robot according to a first embodiment of the present invention will be described with reference to FIGS. 1 to 6.

図１は、室内Ｒの中に人が複数移動しており、その室内Ｒ内に音声認識ロボット１が載置されている様子を示している。図１に示される音声認識ロボット１は、図２に示すように、地面に固定された胴体１０と、この胴体１０に接続された頭部１１、右腕１２、左腕１３を備えるような、人間の上半身と同様に構成されたヒューマノイド型のロボットである。以下、各構成要素について詳細に説明する。 FIG. 1 shows a state where a plurality of people are moving in the room R and the voice recognition robot 1 is placed in the room R. As shown in FIG. 2, the voice recognition robot 1 shown in FIG. 1 includes a torso 10 fixed to the ground, a head 11, a right arm 12, and a left arm 13 connected to the torso 10. It is a humanoid robot constructed in the same way as the upper body. Hereinafter, each component will be described in detail.

胴体１０は、室内Ｒの床面に設置され、その内部に音声認識ロボット１の動作およびその他の機能を制御する制御部１００を備えている。制御部１００は、後述するマイクから入力された入力信号の内容を認識し、適切な応答データを選択した後に該応答データを音声で出力する、演算処理部やメモリ等を備えた制御コンピュータである。この制御部１００の詳細な構成については後述する。 The body 10 is installed on the floor surface of the room R, and includes a control unit 100 that controls the operation of the voice recognition robot 1 and other functions. The control unit 100 is a control computer that includes an arithmetic processing unit, a memory, and the like that recognizes the contents of an input signal input from a microphone, which will be described later, and outputs the response data by voice after selecting appropriate response data. . The detailed configuration of the control unit 100 will be described later.

頭部１１は、音声認識ロボット１の前方の所定範囲を撮像するための撮像部１１１、１１２と、周囲で生じた音声を聞き取るための音源特定部１１３、１１４と、外部に対して言葉を発声するためのスピーカ１１５と、を備えている。撮像部１１１、１１２は、各々所定範囲の光学的な情報を取得して撮像データとして取り込み、演算部１０１へ出力する光学カメラであり、頭部１１の前面左右に設けられている。 The head 11 utters words to the outside, imaging units 111 and 112 for imaging a predetermined range in front of the voice recognition robot 1, sound source identification units 113 and 114 for listening to surrounding sounds, and And a speaker 115 for The imaging units 111 and 112 are optical cameras that each acquire optical information within a predetermined range, capture it as imaging data, and output it to the calculation unit 101, and are provided on the front left and right sides of the head 11.

音源特定部１１３、１１４は、各々一定の方向からの音声を取得可能な、いわゆる指向性を有するマイクを水平方向に複数配置したものであり、周囲で発声した音声が、音声認識ロボット１からみて相対的にどの方向から伝達されたものかを大まかに特定することができる。なお、音源特定部１１３、１１４は、頭部１１の左右の側面に設けられており、音声認識ロボット１の周囲で発せられた音声を取得し、音声データとして取り込み、演算処理部１０１へ出力する。 The sound source identification units 113 and 114 each have a plurality of so-called directional microphones each capable of acquiring sound from a certain direction, and a plurality of so-called directivity microphones are arranged in the horizontal direction. It is possible to roughly specify from which direction the signal is transmitted. Note that the sound source identification units 113 and 114 are provided on the left and right side surfaces of the head 11, acquire voices emitted around the voice recognition robot 1, acquire the voice data, and output the voice data to the arithmetic processing unit 101. .

そして、スピーカ１１５は、制御部１００で作成された応答データを外部へ所定の方向および大きさで出力するものであり、頭部１１の前面下方に設けられている。 The speaker 115 outputs response data created by the control unit 100 to the outside in a predetermined direction and size, and is provided below the front surface of the head 11.

また、頭部１１は、胴体１０に対して床面に対して水平な面内で左右方向に回動可能に接続されており、頭部１１を回動することで撮像する範囲を状況に応じて変更し、周囲の環境を撮像することができる。 The head 11 is connected to the body 10 so as to be rotatable in the left-right direction in a plane horizontal to the floor surface. To change the image of the surrounding environment.

右腕１２および左腕１３は、制御部１００に含まれる演算処理部１０１によって、駆動所定の制御プログラムに従って各腕部に含まれる関節部が駆動する量が制御され、各関節の関節駆動角度が決定されることで、所望の位置および姿勢をとるものである。 The right arm 12 and the left arm 13 are controlled by the arithmetic processing unit 101 included in the control unit 100 to control the amount of driving of the joints included in each arm according to a predetermined drive control program, and the joint driving angle of each joint is determined. Thus, a desired position and posture are taken.

なお、制御部１００は、図３に示すように、入力される音声データからその内容を認識する音声認識部１０１、頭部１１に備えられた前記撮像部１１１，１１２により撮像した画像内に存在する人物の顔を検出する顔検出部１０２、顔検出部１０２により検出した顔の位置および向きを判別する判別部１０３、検出した顔を抽出する抽出部１０４、および音声出力する応答データを作成するための音声合成部１０５と、所定のプログラムやデータを記憶する図示しない記憶領域を備えている As shown in FIG. 3, the control unit 100 exists in an image captured by the speech recognition unit 101 that recognizes the content from the input speech data, and the imaging units 111 and 112 provided in the head 11. A face detection unit 102 for detecting the face of the person to be detected, a discrimination unit 103 for discriminating the position and orientation of the face detected by the face detection unit 102, an extraction unit 104 for extracting the detected face, and response data for outputting sound And a storage area (not shown) for storing predetermined programs and data.

音声認識部１０１は、音源特定部１１３，１１４から取得した音声をＷＡＶＥファイルなどの音声データに変換するとともに、その音声データを音節毎に分割し、各音節を記憶領域に記憶された単語データベースを用いて単語に置き換える。そして、音声データに含まれる単語およびその語順を解析して、記憶領域に記憶された多数の文章のうち、この解析した音声データに最も近い文章を選び出す。選び出した文章と、音声データとの近似度合いが所定の値以上の場合は、解析した音声データが、選び出した文章と同一の内容として認識し、取得した音声を、選び出した文章と等しい旨を示す信号を出力する。また、最も近い文章が、所定の近似度合いに満たない場合は、該当する文章が記憶領域に記憶されていないとし、取得した音声の内容を認識できなかったことを表す信号を出力する。 The voice recognition unit 101 converts the voice acquired from the sound source identification units 113 and 114 into voice data such as a WAVE file, divides the voice data into syllables, and stores a word database in which each syllable is stored in a storage area. Use to replace with a word. Then, the words included in the voice data and the word order thereof are analyzed, and the sentence closest to the analyzed voice data is selected from the many sentences stored in the storage area. If the degree of approximation between the selected text and the voice data is greater than or equal to a predetermined value, the analyzed voice data is recognized as the same content as the selected text, and indicates that the acquired voice is equal to the selected text Output a signal. If the closest sentence is less than the predetermined degree of approximation, it is determined that the corresponding sentence is not stored in the storage area, and a signal indicating that the content of the acquired voice has not been recognized is output.

顔検出部１０２は、撮像部１１１、１１２で撮像することにより得られた画像データから、人物の顔部分のみを検出する。この顔検出部１０２は、まず、人物の顔中に含まれる目および口（口唇）を抽出し、これらの位置に基づいて、顔の輪郭に相当する縁部を推定する。そして、このように推定された顔の輪郭によって囲まれる領域を人物の顔として検出する。 The face detection unit 102 detects only the face portion of a person from the image data obtained by imaging with the imaging units 111 and 112. The face detection unit 102 first extracts eyes and mouths (lips) included in a person's face, and estimates an edge corresponding to the contour of the face based on these positions. Then, an area surrounded by the estimated face contour is detected as a human face.

なお、顔検出部１０２によって、検出された各人物の顔に含まれる目の位置、すなわちロボットからの相対距離および相対的方向に基づいて、検出した顔が、ロボットから見てどの方向を向いているかどうかを推定することができる。詳細には、図４に示すように、人物の顔に含まれる目（右目Ｅ１および左目Ｅ２）の各中心位置Ｅ１０，Ｅ１１を特定するとともに、各中心位置を結ぶ線分上の中点Ｍを特定する。そして、この各中心位置を結ぶ線分を含み、床面に平行な平面内で、中点Ｍから各中心位置を結ぶ線分に垂直な方向Ｐを求め、この方向Ｐを目（右目Ｅ１、左目Ｅ２）の視線方向、すなわちこれらの目を含む顔が向いている方向とする。そして、検出した各顔の向いている方向を各々求め、それらの方向についての信号を、各顔の存在するロボットからの相対位置と併せて出力する。 It should be noted that, based on the position of the eyes included in each human face detected by the face detection unit 102, that is, the relative distance and relative direction from the robot, the detected face faces in the direction viewed from the robot. It can be estimated whether or not. Specifically, as shown in FIG. 4, the center positions E10 and E11 of the eyes (right eye E1 and left eye E2) included in the face of the person are specified, and the midpoint M on the line segment connecting the center positions is determined. Identify. Then, a direction P that includes the line segment connecting the center positions and is perpendicular to the line segment connecting the center positions from the midpoint M in a plane parallel to the floor surface is obtained. The line-of-sight direction of the left eye E2), that is, the direction in which the face including these eyes is facing. Then, the detected direction of each face is obtained, and a signal for each direction is output together with the relative position from the robot where each face exists.

判別部１０３は、顔検出部１０２によって検出された、撮像された画像内に含まれる各顔の、ロボットからの相対位置および向いている方向から、どの顔がロボット自身に向けられているかを判別する。具体的には、図５に示すように、自身の位置R（例えば頭部１１の中心点）を基準として、目（右目および左目）の位置に基づいて判断した各顔の中心位置と、各顔の向いている方向とを組み合わせ、自身の位置を含むか否かを判断する。ここで、各顔の向いている方向には、所定の幅をもたせることとし、詳細には各方向を中心として床面に水平な方向に左右微小角度（例えば５度）ずつ幅を持たせるものとしている。このようにして、各顔がロボット自身の向きを向いているか否かを判断し、ロボット自身の向きを向いていないものは、後述する抽出対象から外すものとする。 The discriminating unit 103 discriminates which face is directed to the robot itself from the relative position and the facing direction of each face included in the captured image detected by the face detecting unit 102. To do. Specifically, as shown in FIG. 5, the center position of each face determined based on the position of the eyes (right eye and left eye) with reference to its own position R (for example, the center point of the head 11), It is determined whether or not it includes its own position by combining with the direction the face is facing. Here, the direction in which each face is facing has a predetermined width, and in detail, the width is given by a minute angle (for example, 5 degrees) in the direction horizontal to the floor with each direction as the center. It is said. In this way, it is determined whether or not each face is facing the robot itself, and those not facing the robot itself are excluded from the extraction targets described later.

抽出部１０４は、判別部１０３において、ロボットの方を向いていると判断された画像中の各顔のうち、ロボットからの距離が最も近いものを選び、その輪郭部分およびその内部を画像データ内において抽出する。具体的には、各顔の中心位置を特定し、各中心位置と、ロボットの頭部１１の前面における中心部分との距離を算出し、その算出した距離が最小である顔を選び、その顔の占める領域の画像データを抽出する。そして、頭部１１が回動することで撮像する場所が変化しても、抽出した画像データと近似する領域を常に特定し続ける。これによって、選択した顔が撮像部によって撮像される画像内に存在している限りは、トラッキングを行うことができる。なお、顔の中心位置とロボットとの距離については、例えば三角測量の手法により算出する。 The extraction unit 104 selects the face in the image that is determined to be facing the robot by the determination unit 103 and has the closest distance from the robot, and the contour portion and the inside thereof are included in the image data. Extract in Specifically, the center position of each face is specified, the distance between each center position and the central portion of the front surface of the robot head 11 is calculated, the face with the smallest calculated distance is selected, and the face The image data of the area occupied by is extracted. And even if the place where the image is taken changes as the head 11 rotates, the region that approximates the extracted image data is always specified. As a result, as long as the selected face exists in the image captured by the imaging unit, tracking can be performed. The distance between the center position of the face and the robot is calculated by, for example, a triangulation method.

音声合成部１０５は、音声認識部１０１によって認識された、取得した音声の内容に対応する、予め記憶領域内に多数記憶された応答文データ群の中から、最も適切な応答文データを読み出し、音声ファイルに変換してスピーカ１１５を介して外部に出力する。また、必要に応じて、腕部（右腕１２、左腕１３）を用いて身振りなどの動作を、音声出力に併せて行うようにしてもよい。 The speech synthesizer 105 reads the most appropriate response sentence data from among a large number of response sentence data groups stored in advance in the storage area corresponding to the acquired speech content recognized by the speech recognition unit 101. It converts into an audio file and outputs it outside through the speaker 115. Moreover, you may make it perform operation | movement, such as gesture, using an arm part (the right arm 12 and the left arm 13) with audio | voice output as needed.

このように構成された音声認識ロボット１は、その前面近傍に位置する人物を撮像し、その撮像した画面内に含まれる複数の人物に対して、自分が応答すべき人物を特定して、その人物から発声された音声の内容を認識し、その内容に応じた内容の音声出力を行う。以下に、音声認識ロボット１が、取得した音声に基づいて、自分が応答すべき人物を特定する手順について、図６に示すフローチャートを用いて詳細に説明する。 The speech recognition robot 1 configured as described above captures a person located near the front surface, identifies a person to whom he / she should respond to a plurality of persons included in the captured screen, and It recognizes the contents of the voice uttered by a person and outputs the voice according to the contents. Below, the procedure in which the voice recognition robot 1 specifies a person to whom it should respond based on the acquired voice will be described in detail with reference to the flowchart shown in FIG.

まず、音声認識ロボット１は、電力が供給されると、周囲からの音声を取得するための状態を準備する（ステップ１０１）。この状態で、音声認識ロボット１の周囲に存在する人物が音声認識ロボット１に話し掛けると、音声認識ロボット１は、この人物の発声した音声を音源特定部１１３，１１４で取得する（ステップ１０２）。このとき、音源特定部１１３、１１４は、取得した音声の発声した方向（ロボットからみた相対的な方向）を特定する。そして、特定した方向に頭部１１の前面が位置するように頭部１１を回動し、撮像を開始する（ステップ１０３）。 First, when power is supplied, the voice recognition robot 1 prepares a state for acquiring voice from the surroundings (step 101). In this state, when a person existing around the voice recognition robot 1 speaks to the voice recognition robot 1, the voice recognition robot 1 acquires the voice uttered by the person using the sound source identification units 113 and 114 (step 102). At this time, the sound source identification units 113 and 114 identify the direction in which the acquired voice is uttered (relative direction as viewed from the robot). Then, the head 11 is rotated so that the front surface of the head 11 is positioned in the specified direction, and imaging is started (step 103).

撮像部１１１、１１２で撮像したことで得られた画像データは、制御部１００に入力され、この画像データの中において顔検出部１０２が人物の顔を検出できるか否かを判断する（ステップ１０４）。ここで、画像データ中に人物の顔が１つでも検出できれば、検出した顔の向きを、判別部１０３によって判別し、音声認識ロボット１の方を向いている顔が存在するか否かを判断する（ステップ１０５）。音声認識ロボット１の方を向いている顔が存在すれば、それらの顔の、音声認識ロボット１からの各距離を求めて、最も近い位置に存在する顔を選択する（ステップ１０６）。逆に、音声認識ロボット１の方を向いている顔が存在しなければ、音声認識ロボット１に対して話し掛けられた状態ではないと判断し、再度音声を取得する準備するために、ステップ１０１に戻る。 Image data obtained by imaging by the imaging units 111 and 112 is input to the control unit 100, and it is determined whether or not the face detection unit 102 can detect a human face in the image data (step 104). ). Here, if even one person's face can be detected in the image data, the direction of the detected face is determined by the determination unit 103, and it is determined whether or not there is a face facing the voice recognition robot 1. (Step 105). If there are faces facing the voice recognition robot 1, the distances of those faces from the voice recognition robot 1 are obtained, and the face existing at the closest position is selected (step 106). Conversely, if there is no face facing the voice recognition robot 1, it is determined that the voice recognition robot 1 is not in a talked state, and in order to prepare for acquiring voice again, step 101 is performed. Return.

次に、最も近い位置に存在する、音声認識ロボット１の方を向いている顔を判別すると、画像データ内における、判別した顔が占める領域を抽出し、選択した顔をトラッキングする。（ステップ１０７）。このとき、選択した顔として抽出した領域を、画像データの略中心に位置するように頭部１１を回動させることで、常に音声認識ロボット１に話し掛ける人物の顔の方向に、音声認識ロボット１の頭部１１の前面（音声認識ロボット１の顔前面）が向いているように動作させる。さらに、音源特定部１１３，１１４も頭部１１と同時に回動するため、音源として特定された方向からの音声を常に取得し続ける。 Next, when the face facing the voice recognition robot 1 that is present at the closest position is determined, the area occupied by the determined face in the image data is extracted, and the selected face is tracked. (Step 107). At this time, by rotating the head 11 so that the region extracted as the selected face is positioned at the approximate center of the image data, the voice recognition robot 1 is always directed toward the face of the person who speaks to the voice recognition robot 1. The head 11 is operated so that the front surface (the front surface of the voice recognition robot 1) is facing. Furthermore, since the sound source specifying units 113 and 114 also rotate simultaneously with the head 11, the sound from the direction specified as the sound source is always acquired.

また、音声認識ロボット１に話し掛ける人物の顔をトラッキングする動作は、この人物が音声認識ロボット１から所定距離以上離れた位置に移動するまで続けられる。すなわち、選択した顔の中心位置と、音声認識ロボット１との相対距離が所定距離以上であるか否かを判断し（ステップ１０８）、所定距離以内に存在する限りは会話を継続する。逆に、所定距離以上離れたと判断した場合は、トラッキングを解除するとともに（ステップ１０９）、次の音声を取得するか否かを判断する（ステップ１１０）。そして、継続して音声取得を行う場合は、ステップ１０１に戻って音声を取得するための準備状態に戻る。音声取得を継続しない場合は、電源供給等を停止し、動作を終了する。 The operation of tracking the face of the person who speaks to the voice recognition robot 1 is continued until the person moves to a position separated from the voice recognition robot 1 by a predetermined distance or more. That is, it is determined whether the relative distance between the center position of the selected face and the voice recognition robot 1 is equal to or greater than a predetermined distance (step 108), and the conversation is continued as long as it is within the predetermined distance. Conversely, if it is determined that the distance is greater than the predetermined distance, tracking is canceled (step 109), and it is determined whether or not the next sound is to be acquired (step 110). And when performing voice acquisition continuously, it returns to step 101 and returns to the preparation state for acquiring a voice. When the voice acquisition is not continued, the power supply is stopped and the operation is terminated.

このように、音声認識ロボット１に話し掛ける人物をトラッキングしている間、音声認識ロボット１は、この人物から発せられた言葉を音声認識し、その内容に応じた応答文を発話して会話を継続するとともに、トラッキングが解除されると会話を中止する。 In this way, while tracking the person who speaks to the voice recognition robot 1, the voice recognition robot 1 recognizes the words uttered by the person, and utters a response sentence corresponding to the contents, and continues the conversation. In addition, the conversation is stopped when the tracking is released.

なお、選択した顔が音声認識ロボット１の撮像部１１１，１１２により撮像されない場合、例えばロボットと人物との間に障害物が存在したり、音声認識ロボット１に話かけた人物が、ロボットの裏側など撮像部１１１，１１２で撮像できない位置に移動した場合も、音声認識ロボット１は、選択した顔との距離が所定距離以上であると判断して、トラッキングを中止する。すなわち、音声認識ロボット１が、トラッキングしている顔を判別できなくなった時点でトラッキングを中止する。このとき、トラッキングしている顔が判別できなくなったとする判断は、撮像部１１１，１１２により撮像された画像データ内に選択した顔を表す領域が微小時間（例えば数秒間）存在しないことが確認された場合に、行われることが好ましい。このようにすると、瞬間的に音声認識ロボット１と人物との間に障害物が通過した場合に、音声認識ロボット１と人物との会話が途切れることなく継続される。 If the selected face is not imaged by the imaging units 111 and 112 of the voice recognition robot 1, for example, an obstacle exists between the robot and the person, or the person who has spoken to the voice recognition robot 1 is behind the robot. Also when moving to a position where the imaging units 111 and 112 cannot capture images, the speech recognition robot 1 determines that the distance from the selected face is equal to or greater than a predetermined distance, and stops tracking. That is, the tracking is stopped when the voice recognition robot 1 becomes unable to determine the face being tracked. At this time, the determination that the face being tracked can no longer be determined is confirmed by the fact that the area representing the selected face does not exist in the image data captured by the imaging units 111 and 112 for a very short time (for example, several seconds). Is preferably performed. In this way, when an obstacle passes instantaneously between the voice recognition robot 1 and the person, the conversation between the voice recognition robot 1 and the person is continued without interruption.

また、本実施形態においては、選択した顔とロボットの間の、トラッキングを中止する距離については、撮像部により撮像された画像から、人物の顔を検出できる最長の距離に設定することが好ましい。このような、撮像部により人物の顔を検出できる距離は、撮像部の解像度や周囲の明るさなどに基づいて定めてもよい。 In this embodiment, it is preferable to set the distance between the selected face and the robot to stop tracking to the longest distance at which a human face can be detected from the image captured by the imaging unit. Such a distance by which the human face can be detected by the imaging unit may be determined based on the resolution of the imaging unit, the brightness of the surroundings, and the like.

本実施形態においては、音声認識部は、取得した音声を音声データに変換するとともに、その音声データを音節毎に分割し、分割された音節を単語に置き換える手法が用いられているが、本発明はこれに限られるものではなく、現在用いられている多くの音声認識手法に置き換えて用いることが可能である。さらに、認識した音声の内容に対して応答する応答文を選択する手法等についても、前記実施形態に限られるものではなく、他の手法を適用することも可能である。 In the present embodiment, the speech recognition unit converts the acquired speech into speech data, divides the speech data into syllables, and replaces the divided syllables with words. Is not limited to this, and can be used in place of many speech recognition methods currently used. Furthermore, the method of selecting a response sentence that responds to the recognized voice content is not limited to the above embodiment, and other methods can be applied.

また、前述の実施形態においては、画像データから、人物の顔部分のみを検出する顔検出部は、人物の顔中に含まれる目および口（口唇）を抽出し、これらの位置に基づいて、顔の輪郭に相当する縁部を推定しているが、本発明はこれに限られるものではない。例えば、人物の顔の輪郭そのものや、顔内部の別の部分に基づいて、人物の顔を検出するようにしてもよい。 In the above-described embodiment, the face detection unit that detects only the face portion of the person from the image data extracts the eyes and mouth (lips) included in the face of the person, and based on these positions, Although the edge corresponding to the contour of the face is estimated, the present invention is not limited to this. For example, the face of a person may be detected based on the contour of the person's face itself or another part inside the face.

また、前述の実施形態における抽出部は、画像中に存在する顔の中心位置と、ロボットとの距離を算出し、その算出した距離が最小である顔を選ぶようにしているが、本発明はこれに限られるものではない。例えば、判別された人物の顔の向きが、各顔のロボットに対する方向に最も近い向きとなっている顔を抽出するようにしてもよい。この場合、画像中に存在する顔の中心と、ロボットとを結ぶ直線を各顔毎に仮想的に描写し、その各直線と、顔の向きとがなす傾斜角度をそれぞれ求め、その傾斜角度が最も小さい顔を抽出するようにするなどの手法が考えられる。 Further, the extraction unit in the above-described embodiment calculates the distance between the center position of the face existing in the image and the robot, and selects the face having the minimum calculated distance. It is not limited to this. For example, a face whose face direction is the closest to the direction of each face with respect to the robot may be extracted. In this case, a straight line connecting the center of the face existing in the image and the robot is virtually drawn for each face, and an inclination angle formed by each straight line and the direction of the face is obtained. A method such as extracting the smallest face can be considered.

また、抽出部は、画像中に存在する顔部分のみを抽出し、トラッキングを行うのではなく、顔を含むその人物の身体全体をトラッキング対象としてもよい。このようにすると、顔部分の占める領域を特定する作業に失敗した場合であっても、その身体部分を抽出することで、顔部分の占める領域の場所を推定し、再度トラッキングした状態に復帰しやすくなる。 In addition, the extraction unit may extract only the face portion existing in the image and perform tracking, and may set the entire body of the person including the face as a tracking target. In this way, even if the task of identifying the area occupied by the face portion fails, by extracting the body part, the location of the area occupied by the face portion is estimated and returned to the tracked state again. It becomes easy.

発明の実施の形態２．
次に、図７および８を用いて、本発明の第２の実施形態にかかる音声認識ロボットであって、室内Ｒ内を案内する案内ロボットとして適用された例について説明する。なお、本実施形態においては、前述の実施形態と同様または同一の構成については、同一の符号を付して、その説明を省略するものとする。 Embodiment 2 of the Invention
Next, with reference to FIGS. 7 and 8, an example of a voice recognition robot according to the second embodiment of the present invention, which is applied as a guide robot for guiding the inside of the room R, will be described. In the present embodiment, the same or similar components as those of the above-described embodiment are denoted by the same reference numerals, and the description thereof is omitted.

図７に示すように、本実施形態にかかる音声認識ロボット１'は、頭部１１や腕部を備えた胴体１０に、移動手段としての一対の車輪２１、２２およびキャスタ２３を備えており、外部からの指令に従って移動可能であるとともに、外部環境に基づいて自律的に移動するものである。なお、車輪２１，２２は各々独立して回動し、ロボット１'の左右方向への移動や旋回動作を可能にし、車輪２１，２２およびキャスタ２３とによって、ロボット１'はその構造全体を水平に支持した状態で室内Ｒ内を移動可能となる。 As shown in FIG. 7, the speech recognition robot 1 ′ according to this embodiment includes a pair of wheels 21 and 22 and a caster 23 as moving means on a body 10 including a head 11 and an arm. In addition to being able to move according to instructions from the outside, it moves autonomously based on the external environment. The wheels 21 and 22 rotate independently to enable the robot 1 ′ to move left and right and turn. The wheels 21 and 22 and the caster 23 allow the robot 1 ′ to horizontally move the entire structure. The inside of the room R can be moved in a state where it is supported.

また、図示は省略するが、胴体１０の内部には、車輪２１、２２を駆動するための駆動部としてのモータが独立して備えられているとともに、これらのモータを制御するための制御部としてのＣＰＵ、室内Ｒについてのマップ情報を記憶する記憶部などを併せて備えている。そして、車輪２１、２２の回転数に基づいて、ロボット１'の移動した距離や速度、および方向が算出され、ロボット１'の位置情報を取得し、前述のマップ情報と併せて自己位置を求めている。 Although not shown in the figure, a motor as a drive unit for driving the wheels 21 and 22 is independently provided in the body 10, and as a control unit for controlling these motors. CPU, a storage unit for storing map information about the room R, and the like are also provided. Based on the number of rotations of the wheels 21 and 22, the distance, speed, and direction of movement of the robot 1 ′ are calculated, the position information of the robot 1 ′ is acquired, and the self position is obtained together with the map information described above. ing.

このように構成された音声認識ロボット１'が、その前面近傍に位置する人物を撮像し、その撮像した画面内に含まれる複数の人物に対して、自分が応答すべき人物を特定する手順について、図８に示すフローチャートを用いて詳細に説明する。なお、本実施形態においては、撮像した画像内に存在する人物の顔を検出する手順や、検出した顔の位置および向きの判別、検出した顔を抽出する手順などは、前述の実施形態において説明したものと同様の手法を用いているものとする。 Procedure for the voice recognition robot 1 ′ configured in this way to image a person located in the vicinity of the front face and identify a person to whom he / she should respond to a plurality of persons included in the captured image This will be described in detail with reference to the flowchart shown in FIG. In the present embodiment, the procedure for detecting the face of a person existing in the captured image, the determination of the position and orientation of the detected face, the procedure for extracting the detected face, etc. are described in the above-described embodiments. It is assumed that the same method is used.

前述の実施形態と同様に、音声認識ロボット１'は、室内Ｒ内の所定の初期位置に設置され、電力が供給されると、周囲からの音声を取得するための状態を準備する（ステップ２０１）。なお、音声認識ロボット１'には、予め室内Ｒ内のマップ情報が記憶されているとともに、初期位置を認識させているものとする。 Similar to the above-described embodiment, the voice recognition robot 1 ′ is installed at a predetermined initial position in the room R, and when power is supplied, prepares a state for acquiring voice from the surroundings (step 201). ). Note that the voice recognition robot 1 ′ stores map information in the room R in advance and recognizes the initial position.

この状態で、音声認識ロボット１'の周囲に存在する人物が音声認識ロボット１に話し掛けると、音声認識ロボット１は、この人物の発声した音声を音源特定部１１３，１１４で取得する（ステップ２０２）。このとき、音源特定部１１３、１１４は、取得した音声の発声した方向（ロボットからみた相対的な方向）を特定し、特定した方向に胴体１０前面が位置するように車輪２１，２２を駆動して、頭部１１に備えられた撮像部１１１，１１２を用いて撮像を開始する（ステップ２０３）。 In this state, when a person existing around the voice recognition robot 1 ′ speaks to the voice recognition robot 1, the voice recognition robot 1 acquires the voice uttered by the person using the sound source identification units 113 and 114 (step 202). . At this time, the sound source identification units 113 and 114 identify the direction in which the acquired voice is uttered (relative direction as viewed from the robot), and drive the wheels 21 and 22 so that the front surface of the body 10 is positioned in the identified direction. Then, imaging is started using the imaging units 111 and 112 provided in the head 11 (step 203).

撮像部１１１、１１２で撮像したことで得られた画像データは、制御部１００に入力され、この画像データの中において顔検出部１０２が人物の顔を検出できるか否かを判断する（ステップ２０４）。ここで、画像データ中に人物の顔が１つでも検出できれば、検出した顔の向きを、判別部１０３によって判別し、音声認識ロボット１'の方を向いている顔が存在するか否かを判断する（ステップ２０５）。音声認識ロボット１'の方を向いている顔が存在すれば、それらの顔の、音声認識ロボット１'からの各距離を求めて、最も近い位置に存在する顔を選択する（ステップ２０６）。逆に、音声認識ロボット１１'の方を向いている顔が存在しなければ、音声認識ロボット１'に対して話し掛けられた状態ではないと判断し、再度音声を取得する準備に戻る。 Image data obtained by imaging by the imaging units 111 and 112 is input to the control unit 100, and it is determined whether or not the face detection unit 102 can detect a human face in the image data (step 204). ). Here, if even one person's face can be detected in the image data, the direction of the detected face is discriminated by the discriminating unit 103, and whether or not there is a face facing the voice recognition robot 1 'exists. Judgment is made (step 205). If there are faces facing the voice recognition robot 1 ', the distances between the faces from the voice recognition robot 1' are obtained, and the face existing at the closest position is selected (step 206). Conversely, if there is no face facing the voice recognition robot 11 ′, it is determined that the voice recognition robot 1 ′ is not in a state of talking to the voice recognition robot 11 ′, and the process returns to preparation for acquiring the voice again.

次に、最も近い位置に存在する、音声認識ロボット１'の方を向いている顔を判別すると、画像データ内における、判別した顔が占める領域を抽出し、選択した顔をトラッキングする。（ステップ２０７）。このとき、選択した顔として抽出した領域を、画像データの略中心に位置するように、頭部１１の回動と、車輪２１、２２の駆動とを組み合わせることで、常に音声認識ロボット１'に話し掛ける人物の顔の方向に、音声認識ロボット１'の頭部１１の前面（音声認識ロボット１'の顔前面）が向いているように動作させる。同時に、音源特定部１１３，１１４も頭部１１と同時に回動するため、音源として特定された方向からの音声を常に取得し続ける。 Next, when a face that faces the voice recognition robot 1 ′ that is present at the closest position is determined, an area occupied by the determined face in the image data is extracted, and the selected face is tracked. (Step 207). At this time, by combining the rotation of the head 11 and the driving of the wheels 21 and 22 so that the region extracted as the selected face is located at the approximate center of the image data, the voice recognition robot 1 ′ is always used. The operation is performed so that the front surface of the head 11 of the speech recognition robot 1 ′ (the front surface of the speech recognition robot 1 ′) faces the direction of the face of the person talking. At the same time, since the sound source specifying units 113 and 114 also rotate simultaneously with the head 11, the sound from the direction specified as the sound source is always acquired.

そして、トラッキングした人物から取得した音声を解析し、案内する目的の場所を特定する（ステップ２０８）。案内する場所が特定されると、音声認識ロボット１'に話し掛ける人物の顔をトラッキングし続けた状態で、音声認識ロボット１は、頭部１１の前面をこの人物に向けつつ、トラッキングした人物から所定距離となるように車輪２１、２２を駆動し、自己位置を調整する。そして、室内Ｒ内における案内する場所へ案内するために、トラッキングした人物との距離を一定に保った状態で、車輪２１、２２を駆動して所定の場所に向けて移動を開始する（ステップ２０９）。 Then, the voice acquired from the tracked person is analyzed, and the target place to be guided is specified (step 208). When the place to be guided is specified, the voice recognition robot 1 keeps tracking the face of the person who talks to the voice recognition robot 1 ′, and the voice recognition robot 1 turns the front of the head 11 toward this person and starts from the tracked person. The wheels 21 and 22 are driven so as to be the distance, and the self-position is adjusted. Then, in order to guide to the place to guide in the room R, the wheels 21 and 22 are driven in a state where the distance from the tracked person is kept constant, and movement toward a predetermined place is started (step 209). ).

なお、案内動作中においては、トラッキングした人物からの案内動作を中止する旨の音声指示があるか否かを検出している（ステップ２１０）。もし案内動作を中止する旨の指示がある場合は、案内動作を中止して室内Ｒ内の初期位置に戻る（ステップ３１０）が、案内動作を中止する指示が無い場合は、所定の案内場所まで移動を続ける。 During the guidance operation, it is detected whether there is a voice instruction to stop the guidance operation from the tracked person (step 210). If there is an instruction to stop the guiding operation, the guiding operation is stopped and the initial position in the room R is returned (step 310). If there is no instruction to stop the guiding operation, the predetermined guidance location is reached. Continue moving.

そして、案内する所定の場所に到達すると、トラッキングした人物に対して案内動作を終了しても良い旨を確認し（ステップ２１１）、終了しても良い場合はトラッキングを解除して初期位置まで戻る（ステップ２１２）。もし、案内動作を継続する場合は、ステップ２０８に戻って、トラッキングした人物からの音声を認識し、案内対象となる場所を再度取得し、案内動作を再開する。このように、音声認識ロボット１'は、室内Ｒにおける案内動作を、次の音声の取得判断（ステップ２１３）により、動作停止の指令があるまで継続しつづける。 Then, when reaching a predetermined place for guidance, it is confirmed that the guidance operation may be terminated for the tracked person (step 211), and if it can be terminated, the tracking is canceled and the initial position is returned. (Step 212). If the guidance operation is to be continued, the process returns to step 208, the voice from the tracked person is recognized, the location to be guided is acquired again, and the guidance operation is resumed. As described above, the voice recognition robot 1 ′ continues the guidance operation in the room R until an operation stop command is issued by the next voice acquisition determination (step 213).

本実施形態形によれば、トラッキングした人物を常に画像内の中心に位置させ、案内する対象となる人間の位置を常に確認した状態で、ロボットが所望の場所まで移動しつづけることができる。そのため、案内される人間の動きに、ロボットの移動する動きが併せた状態で、案内動作を行うことができる。したがって、案内動作中に人間がロボットを見失ったり、逆にロボットがトラッキング対象の人物を見失う場合が少なくなり、より確実に案内動作を遂行することが可能になる。 According to this embodiment, the robot can continue to move to a desired location in a state where the tracked person is always positioned at the center in the image and the position of the person to be guided is always confirmed. Therefore, the guidance operation can be performed in a state in which the movement of the robot is combined with the movement of the human being guided. Therefore, it is less likely that a human loses sight of the robot during the guidance operation, or conversely, the robot loses sight of the person to be tracked, and the guidance operation can be performed more reliably.

なお、前述の実施形態においては、車輪の回転数によって移動した距離等を求めて自己位置を特定していたが、これに代えて、外部に設けた位置認識システムを用いてロボットの自己位置を求め、ロボットに位置信号を送信することでロボットに自己位置を認識させるようにしてもよい。このようにすると、ロボットはより正確な自己位置をリアルタイムに認識することができる。 In the above-described embodiment, the self-position is specified by obtaining the distance moved by the number of rotations of the wheel. Instead, the self-position of the robot is determined by using an external position recognition system. Then, the robot may be made to recognize its own position by transmitting a position signal to the robot. In this way, the robot can recognize a more accurate self-position in real time.

また、ロボットの移動手段としては、車輪とキャスタの組み合わせに限られるものではなく、車輪のみで構成される倒立振子型の移動手段であってもよく、脚部を駆動することで移動する歩行型の移動手段であってもよい。 The robot moving means is not limited to a combination of wheels and casters, and may be an inverted pendulum type moving means composed of only wheels, and is a walking type that moves by driving a leg. It may be a moving means.

また、ロボットに設けられる周囲の外部環境を認識する手段としては、撮像部のみではなく、レーザレンジセンサやＣＣＤ等の光学カメラを別途設けたり、ロボット外部に設けられた基地局によりそのような外部環境情報をロボットに送信したりしてもよい。 Also, as a means for recognizing the surrounding external environment provided in the robot, not only the imaging unit but also an optical camera such as a laser range sensor or CCD, or such a base station provided outside the robot is used. Environmental information may be transmitted to the robot.

本発明に係る第１の実施の形態である音声認識ロボットが室内に設けられている様子を示す全体外略図である。BRIEF DESCRIPTION OF THE DRAWINGS It is the whole outline figure which shows a mode that the speech recognition robot which is 1st Embodiment based on this invention is provided indoors. 図１に示す音声認識ロボットを概略的に示す概略図である。FIG. 2 is a schematic view schematically showing the voice recognition robot shown in FIG. 1. 図１に示す音声認識ロボットに備えられた制御部の内部を概念的に表したブロック図である。FIG. 2 is a block diagram conceptually showing the inside of a control unit provided in the voice recognition robot shown in FIG. 1. 図１に示す音声認識ロボットに備えられた顔検出部によって、各人物の顔の向きを求める様子を示す図であるIt is a figure which shows a mode that the direction of each person's face is calculated | required by the face detection part with which the speech recognition robot shown in FIG. 1 was equipped. 図１に示す音声認識ロボットに備えられた判別部が、ロボットの方向を向いている顔を判別する様子を示図である。It is a figure which shows a mode that the discrimination | determination part with which the voice recognition robot shown in FIG. 1 was equipped discriminate | determines the face which has faced the direction of the robot. 図１に示す音声認識ロボットが、取得した音声に基づいて、自分が応答すべき人物を特定する手順を示すフローチャートである。It is a flowchart which shows the procedure in which the speech recognition robot shown in FIG. 1 specifies the person who should respond based on the acquired voice. 本発明に係る第２の実施の形態である音声認識ロボットを概略的に示す概略図である。It is the schematic which shows roughly the speech recognition robot which is 2nd Embodiment which concerns on this invention. 図７に示す音声認識ロボットが、取得した音声に基づいて、自分が応答すべき人物を特定しつつ、案内動作を行う手順を示すフローチャートである。It is a flowchart which shows the procedure which the voice recognition robot shown in FIG. 7 performs guidance operation | movement, specifying the person who should respond based on the acquired audio | voice.

Explanation of symbols

１、１'・・・音声認識ロボット
１００・・・制御部
１０１・・・音声認識部
１０２・・・顔検出部
１０３・・・判別部
１０４・・・抽出部
１０５・・・音声合成部
１１１，１１２・・・撮像部
１１３，１１４・・・音源特定部
１１５・・・スピーカ DESCRIPTION OF SYMBOLS 1, 1 '... Voice recognition robot 100 ... Control part 101 ... Voice recognition part 102 ... Face detection part 103 ... Discrimination part 104 ... Extraction part 105 ... Speech synthesis part 111 112, imaging unit 113, 114 ... sound source identification unit 115 ... speaker

Claims

A sound source identifying unit for identifying the direction in which the sound is generated;
A voice recognition unit that receives the generated voice and recognizes its contents;
An imaging unit that captures images in at least two directions and acquires the captured images as image data;
A face detection unit for detecting the face of a person present in the captured image;
A discriminator for discriminating the position and orientation of the detected face;
An extraction unit for extracting the detected face;
A voice recognition robot comprising:
While capturing the direction in which the sound specified by the sound source specifying unit is generated,
A face of a person present in the image captured by the imaging unit is detected by the face detection unit;
Among the detected human faces, the discrimination unit determines the face of the person facing the robot,
Among the identified faces, the extraction unit extracts the face that is closest to the robot,
A voice recognition robot characterized in that the imaging direction of the imaging unit is changed in accordance with the position of the extracted face, and tracking is performed so that the extracted face is continuously imaged.

The voice recognition robot according to claim 1, wherein the tracking is stopped when the extracted face goes out of a region defined around the robot.

The voice recognition robot according to claim 1, wherein the sound source identification unit is composed of one or more microphones having directivity.

2. The face detection unit detects an entire face by specifying positions of eyes and mouth included in a human face and estimating a face outline based on these positions. 4. The voice recognition robot according to any one of 3.

5. The voice recognition robot according to claim 1, wherein the discrimination unit discriminates a direction of the detected face based on positions of both eyes included in the detected face from the robot.

The voice recognition robot according to claim 1, wherein the direction in which the detected face is captured is changed so that the detected face continues to be positioned at substantially the center of the captured image.

The voice recognition robot according to any one of claims 1 to 6, wherein the voice recognition robot further includes a moving unit, and is configured to be movable within a predetermined area.

Identifying the direction in which the sound occurred;
Receiving the spoken voice and recognizing its contents;
Capturing in the direction in which the sound is generated, and acquiring the captured image as image data;
Detecting a human face present in the captured image;
Determining the direction of the detected face and determining the face facing the direction of the robot;
Extracting the face located at the closest distance from the robot among the determined faces;
Controlling the direction of imaging so as to track the extracted face;
Tracking so as to continue to capture the extracted face;
A method for controlling a speech recognition robot, comprising:

The method of controlling a voice recognition robot according to claim 8, further comprising a step of stopping tracking when the extracted face is out of a predetermined distance from the robot.