JP2009222969A

JP2009222969A - Speech recognition robot and control method for speech recognition robot

Info

Publication number: JP2009222969A
Application number: JP2008067103A
Authority: JP
Inventors: Ryo Murakami; 涼村上
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2008-03-17
Filing date: 2008-03-17
Publication date: 2009-10-01

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition robot and a control method for the speech recognition robot, capable of making a speaker understand a cause of failure of the speech recognition, when the speech recognition fails. <P>SOLUTION: The speech recognition robot includes: a receiving section for receiving speech spoken by the speaker; a speech recognition section for recognizing contents of the received speech; an imaging section for taking an image in a direction which the speech is received, to obtain the taken image as an image data; a face detecting section for detecting a face of the speaker who exists in the taken image; an extracting section for extracting motion of a specified part from the detected face; a determining section for determining a speech receiving status based on the motion of the extracted specified part; and an output section for outputting a warning signal based on the determined speech receiving status, when the speech recognition fails. The warning signal which makes the speaker understand the cause of failure of the speech recognition is output, by assuring that the received speech is surely spoken by the speaker. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、発話者の発した音声の内容を認識するとともに、発話者に対して応答する会話応対型の音声認識ロボット、およびこのような音声認識ロボットの制御方法に関するものである。 The present invention relates to a conversation-responsive speech recognition robot that recognizes the content of speech uttered by a speaker and responds to the speaker, and a method for controlling such a speech recognition robot.

近年、人間（発話者）が話し掛けた内容を音声データとして受信し、その内容を認識してその内容に対応した応答文を音声として出力することで、人間との会話を行うための会話型の音声認識システムが、開発されつつある。このような音声認識システムは、システム内部の記憶領域に応答文として発話する音声データを多数記憶しておき、認識した音声の内容に最も関連付けられる音声データを選択し、選択した音声データを音声として発話するものである。このような音声認識システムは、話し掛けられた内容を音声データとして取得し、この音声データを所定の言語の文節に区切ることで、各文節の順序や近接度合いに基づいてその内容に最も適した応答文を選択する。（例えば特許文献１） In recent years, it has received conversational content for humans (speakers) as speech data, recognized the content, and output a response sentence corresponding to the content as speech. Speech recognition systems are being developed. Such a speech recognition system stores a large amount of speech data to be uttered as a response sentence in a storage area inside the system, selects speech data most related to the content of the recognized speech, and selects the selected speech data as speech. Speak. Such a speech recognition system acquires the spoken content as speech data, and divides the speech data into clauses of a predetermined language, so that the response most suitable for the content is based on the order and proximity of each clause. Select a sentence. (For example, Patent Document 1)

一方、このような音声認識システムを利用した、音声認識機能を備えたロボットも、接客などに活用されつつある。このような音声認識ロボットは、人間から話し掛けられた内容が必ずしも理解できるとは限らないため、音声認識を失敗した場合に、話し掛けた人間に対して、音声認識ができなかったことを理解させるために、ロボットに特有の動作を行わせることが行われる（例えば特許文献２）。 On the other hand, robots having a voice recognition function using such a voice recognition system are also being used for customer service. Such a speech recognition robot does not always understand the content spoken by humans, so that if speech recognition fails, the spoken humans can understand that speech recognition has failed. The robot is caused to perform a specific operation (for example, Patent Document 2).

特開２００４−１０９３２３号公報JP 2004-109323 A 特開２００２−１１６７９２号公報JP 2002-116792 A

しかしながら、このような音声認識機能を備えたロボットが、音声認識を失敗する理由としては、人間の発した音声の強さ（ボリューム）や音声内容の問題に起因するもの以外に、音声を入力するハードウェア（マイク）の故障や、ドライバ不良などのソフトウェア上の問題に起因するものもある。これらの音声認識を失敗する理由のうち、人間の発声した発話の強さや内容に起因するもの以外の理由については、入力する音声を修正しても音声認識を正しく行うことができない。そのため、前述のように、音声認識が成功しなかったことをロボットが動作により人間（発話者）に伝えた場合、発話者側の音声を調整しても再度音声認識を失敗してしまう。 However, the reason why a robot having such a voice recognition function fails in voice recognition is that it inputs voice in addition to those caused by human voice strength (volume) and voice content problems. Some are due to software problems such as hardware (microphone) failures and driver failures. Of these reasons for failing speech recognition, speech recognition cannot be performed correctly even if the input speech is corrected for reasons other than those caused by the strength or content of the speech uttered by humans. Therefore, as described above, when the robot informs the human (speaker) that the speech recognition has not been successful, the speech recognition fails again even if the speech on the speaker side is adjusted.

本発明は、このような問題を解決するためになされたものであり、音声認識が成功しなかった際に、発話者に対して音声認識の失敗した原因を理解させることの可能な音声認識ロボットおよび音声認識ロボットの制御方法を提供することを目的としている。 The present invention has been made to solve such a problem. When the speech recognition is not successful, the speech recognition robot capable of allowing the speaker to understand the cause of the speech recognition failure. And it aims at providing the control method of a speech recognition robot.

本発明にかかる音声認識ロボットは、発話者が発声した音声を受信する受信部と、受信した音声の内容を認識する音声認識部と、を備える音声認識ロボットであって、音声を受信した方向について撮像し、撮像した画像を画像データとして取得する撮像部と、撮像した画像内に存在する発話者の顔を検出する顔検出部と、検出した顔の中から特定部位の動きを抽出する抽出部と、抽出した特定部位の動きに基づいて音声の受信状態を判定する判定部と、音声認識が成功しなかった際に、前記判定した音声の受信状態に基づいて警告信号を出力する出力部と、を備えることを特徴としている。 A voice recognition robot according to the present invention is a voice recognition robot including a receiving unit that receives voice uttered by a speaker, and a voice recognition unit that recognizes the content of the received voice. An imaging unit that captures an image and acquires the captured image as image data, a face detection unit that detects a face of a speaker existing in the captured image, and an extraction unit that extracts a motion of a specific part from the detected face A determination unit that determines the reception state of the voice based on the extracted movement of the specific part; and an output unit that outputs a warning signal based on the determined reception state of the voice when the voice recognition is not successful. It is characterized by providing.

このような音声認識ロボットは、受信した音声の音声認識を成功しなかった際に、受信した音声が確実に発話者から発声されたことを確認することができる。そのため、発話者から音声が発声されたにも関わらず、音声認識ができなかった場合に、発話者の発話の強さや内容によるものではなく、ロボット内部のハードウェアまたはソフトウェア上の理由であると判断することができる。したがって、出力する警告信号中に、音声認識を失敗した理由が発話者の発話の強さや内容によるものではなく、ロボット内部のハードウェアまたはソフトウェア上の理由である旨を含めることにより、発話者に対して音声認識の失敗した原因を理解させることが可能となる。 Such a voice recognition robot can confirm that the received voice is surely uttered by the speaker when the voice recognition of the received voice is not successful. Therefore, when speech is not recognized despite the voice being spoken by the speaker, it is not due to the strength or content of the speaker's speech but to the hardware or software reason inside the robot. Judgment can be made. Therefore, by including in the warning signal to be output that the reason why the speech recognition failed is not due to the strength or content of the utterance of the speaker, but to the reason for hardware or software inside the robot, On the other hand, it is possible to understand the cause of the voice recognition failure.

なお、前記抽出部の抽出する特定部位は、発話者が音声を発声したことを判定できる部位であればいずれの部位を利用してもよいが、検出した顔に含まれる口唇であることが好ましい。発話者の口唇の動きを抽出し、その動きに基づいて受信した音声が発話者から発声されたか否かを判定することにより、受信した音声が発話者から確実に発声したものであることを確認することが可能となる。 The specific part extracted by the extraction unit may be any part as long as it can be determined that the speaker has uttered voice, but is preferably the lip included in the detected face. . Extracts the movement of the speaker's lips and determines whether the received voice is uttered from the speaker by determining whether the received voice is uttered by the speaker. It becomes possible to do.

また、抽出された特定部位が口唇である場合に、前記判定部は、抽出された口唇の開閉度合いに基づいて、音声の受信状態を判定するものであってもよい。このように、音声を受信した際に発話者の口唇が開閉しているか否かを判定することによって、受信した音声が発話者から確実に発声したものであることを簡単に確認することができる。 Moreover, when the extracted specific site | part is a lip, the said determination part may determine the receiving state of an audio | voice based on the open / closed degree of the extracted lip. In this way, it is possible to easily confirm that the received voice is surely uttered from the speaker by determining whether or not the speaker's lips are opened and closed when the voice is received. .

なお、前記顔検出部としては、どのような手段により発話者の顔を検出するものであってもよいが、発話者の顔に含まれる目と口唇の位置を特定し、これらの位置に基づいて顔の輪郭を推定することで、顔全体を検出するものであってもよい。このようにすると、人物（発話者）の顔近辺に別の人物や物体等が存在し、ロボットからの距離を測定するだけでは人物の顔の輪郭が特定できない場合であっても、人物の顔内の目と口の位置を検出するだけで、顔の輪郭を特定することが可能になる。 The face detection unit may detect the speaker's face by any means, but identifies the positions of the eyes and lips contained in the speaker's face and based on these positions. Thus, the entire face may be detected by estimating the contour of the face. In this way, even if there is another person or object near the face of the person (speaker) and the outline of the person's face cannot be specified by simply measuring the distance from the robot, It is possible to specify the contour of the face only by detecting the positions of the eyes and mouth.

また、このような音声認識ロボットにおいては、前記顔検出部が検出した顔に含まれる両目の位置に基づいて、検出した顔の方向を認識する方向認識部をさらに備えていることが好ましい。このようにすると、顔全体の位置や、発話者の身体全体を考慮することなく、簡単に発話者の顔の向きを簡単に判別することができるとともに、音声を受信した方向に複数の人間が存在した場合に、検出した各々の顔の方向を認識することによって、発話者を特定することが可能となる。なお、音声を受信した方向を特定する手段としては、指向性を有する１または複数のマイクロフォンなどを利用してもよい。このように構成された音声認識ロボットは、音声の受信した方向を簡単にかつ精度よく検出することができるため好適である。 In addition, such a speech recognition robot preferably further includes a direction recognition unit that recognizes the detected face direction based on the positions of both eyes included in the face detected by the face detection unit. In this way, it is possible to easily determine the orientation of the speaker's face without considering the position of the entire face and the entire body of the speaker, and a plurality of human beings in the direction in which the voice is received. If it exists, the speaker can be identified by recognizing the direction of each detected face. Note that one or a plurality of microphones having directivity may be used as means for specifying the direction in which the voice is received. The voice recognition robot configured as described above is preferable because it can easily and accurately detect the direction in which the voice is received.

さらに、このような音声認識ロボットは、前記撮像部が、顔検出部により検出した顔を撮像した画像内の略中央に位置し続けさせるように、撮像する方向を変更するように構成されていると、さらに好適である。このようにすると、発話者がロボットに向かって話し掛けている間に移動し、ロボットとの相対的な位置を変更しても、その動きに追従して発話者の顔を検出することができる。さらに、このような音声認識ロボットがヒューマノイド型に構成されている場合、前記撮像部をロボットの顔に相当する位置に設けると、発話者が発話している間、常に会話対象の人物（発話者）の方を向いて応答することができるため、あたかも発話者の顔を見続けて会話を行っているように見せることができるという効果も得られる。 Further, such a voice recognition robot is configured to change the imaging direction so that the imaging unit continues to be positioned substantially at the center in the image obtained by imaging the face detected by the face detection unit. And more preferred. In this way, even if the speaker moves while speaking to the robot and changes the relative position with the robot, the face of the speaker can be detected following the movement. Further, when such a speech recognition robot is configured as a humanoid type, if the imaging unit is provided at a position corresponding to the face of the robot, a person who is always a conversation target (speaker) ), It is possible to make it appear as if the conversation is continued while looking at the face of the speaker.

なお、このような音声認識ロボットとしては、床面などに固定されたものであってもよいが、移動手段を備え、所定の領域内を移動可能に構成されているものであってもよい。このような音声認識ロボットは、発話者を特定した状態で自己位置を変更して移動することができるため、例えば会場内などにおいて移動と行いつつ案内を行う案内動作を行うような案内ロボットに用いることが可能になる。 Note that such a voice recognition robot may be fixed to a floor surface or the like, but may be configured to include moving means and be movable within a predetermined area. Since such a speech recognition robot can move while changing its own position in a state where the speaker is specified, for example, it is used for a guidance robot that performs a guidance operation for performing guidance while performing movement in a venue or the like. It becomes possible.

また、本発明は、発話者が発声した音声を受信し、その内容を認識する音声認識ロボットを制御するロボット制御方法をも提供する。このようなロボット制御方法は、詳細には、音声を受信した方向について撮像し、その撮像した画像を画像データとして取得する撮像ステップと、撮像した画像内に存在する発話者の顔を検出する顔検出ステップと、検出した顔の中から特定部位の動きを抽出する抽出ステップと、抽出した特定部位の動きに基づいて音声入力の状態を判定する判定ステップと、音声認識が成功しなかった際に、前記判定した結果に基づいて警告信号を出力する出力ステップと、を備えることを特徴としている。 The present invention also provides a robot control method for controlling a voice recognition robot that receives voice uttered by a speaker and recognizes the contents thereof. Specifically, such a robot control method captures an image in the direction in which the voice is received, acquires an image of the captured image as image data, and detects a face of a speaker present in the captured image. A detection step, an extraction step for extracting the movement of the specific part from the detected face, a determination step for determining the state of voice input based on the extracted movement of the specific part, and when the voice recognition is not successful And an output step of outputting a warning signal based on the determined result.

このようなロボット制御方法を用いることによって、受信した音声の音声認識を成功しなかった際に、受信した音声が確実に発話者から発声されたことを確認することができる。そのため、発話者から音声が発声されたにも関わらず、音声認識ができなかった場合に、発話者の発話の強さや内容によるものではなく、ロボット内部のハードウェアまたはソフトウェア上の理由であることがわかる。したがって、出力する警告信号中に、音声認識を失敗した理由が発話者の発話の強さや内容によるものではなく、ロボット内部のハードウェアまたはソフトウェア上の理由である旨を含めることにより、発話者に対して音声認識の失敗した原因を理解させることが可能となる。 By using such a robot control method, it is possible to confirm that the received voice is surely uttered by the speaker when the voice recognition of the received voice is not successful. For this reason, when speech is not recognized despite the voice being spoken by the speaker, it is not due to the strength or content of the speaker's utterance but to the hardware or software reason inside the robot. I understand. Therefore, by including in the warning signal to be output that the reason why the speech recognition failed is not due to the strength or content of the utterance of the speaker, but to the reason for hardware or software inside the robot, On the other hand, it is possible to understand the cause of the voice recognition failure.

なお、前記抽出ステップにおいて抽出する特定部位は、発話者が音声を発声したことを判定できる部位であればいずれの部位を利用してもよいが、検出した顔に含まれる口唇であることが好ましい。発話者の口唇の動きを抽出し、その動きに基づいて受信した音声が発話者から発声されたか否かを判定することにより、受信した音声が発話者から確実に発声したものであることを確認することが可能となる。 The specific part extracted in the extraction step may be any part as long as it can be determined that the speaker uttered the voice, but is preferably the lip included in the detected face. . Extracts the movement of the speaker's lips and determines whether the received voice is uttered from the speaker by determining whether the received voice is uttered by the speaker. It becomes possible to do.

また、前記判定ステップにおいては、前記口唇の開閉度合いに基づいて音声の受信状態の判定を行ってもよい。このように、音声を受信した際に発話者の口唇が開閉しているか否かを判定することによって、受信した音声が発話者から確実に発声したものであることを簡単に確認することができる。 Further, in the determination step, the sound reception state may be determined based on the opening / closing degree of the lips. In this way, it is possible to easily confirm that the received voice is surely uttered from the speaker by determining whether or not the speaker's lips are opened and closed when the voice is received. .

以上、説明したように、本発明によると、音声認識ロボットに話し掛けた発話者が、ロボットの音声認識を失敗した原因を理解することが可能となる。 As described above, according to the present invention, a speaker who talks to a speech recognition robot can understand the cause of the failure of speech recognition of the robot.

発明の実施の形態１．
以下に、図１から図５を参照しつつ本発明の実施の形態１にかかる音声認識ロボットおよび音声認識ロボットの制御方法について説明する。この実施の形態においては、音声認識ロボット（以下、単にロボットという）は、車輪駆動により移動可能で、かつ、上半身がヒューマノイド型に構成されたロボットである例を用いて説明するものとする。 Embodiment 1 of the Invention
Hereinafter, a voice recognition robot and a method for controlling the voice recognition robot according to the first embodiment of the present invention will be described with reference to FIGS. 1 to 5. In this embodiment, a voice recognition robot (hereinafter simply referred to as a robot) will be described using an example in which the robot is movable by wheel driving and the upper body is a humanoid type robot.

図１に示すロボット１０は、その上半身がヒューマノイド型に構成されるロボットであり、頭部１１、胴体部１２、右腕部１３、左腕部１４、腰部１５および移動手段としての車輪駆動部２０を備えている。 A robot 10 shown in FIG. 1 is a robot whose upper body is configured as a humanoid type, and includes a head 11, a body 12, a right arm 13, a left arm 14, a waist 15, and a wheel drive unit 20 as a moving means. ing.

頭部１１は、その前面左右において撮像部としてのカメラ１１１、１１２が設けられるとともに、その側面にアンテナ１１３を備えるほか、前面下方に出力部としてのスピーカ１１４を備えており、さらに外部からの音声信号を入力するための受信部としてのマイク１１５，１１６を備えている。また、頭部１１は、胴体１２に対して床面に水平な面内で左右方向に回動可能に接続されており、頭部１１を回動することでカメラ１１１，１１２の撮像範囲を状況に応じて変更し、周囲の環境を把握することができる。 The head 11 is provided with cameras 111 and 112 as imaging units on the left and right sides of the front surface thereof, and is provided with an antenna 113 on the side surface thereof and a speaker 114 as an output unit on the lower side of the front surface. Microphones 115 and 116 are provided as receiving units for inputting signals. Further, the head 11 is connected to the body 12 so as to be able to rotate in the left-right direction within a plane horizontal to the floor surface, and the imaging range of the cameras 111 and 112 can be changed by rotating the head 11. It can be changed according to the situation and the surrounding environment can be grasped.

頭部１１に設けられたカメラ１１１、１１２は周囲の環境を視認するものであり、視認した結果得られる画像データを後述する制御コンピュータに出力する。このようなカメラとしては、例えば周囲の環境を撮像し、撮像した画像をディジタル信号として取得するディジタルカメラを用いることができる。 Cameras 111 and 112 provided on the head 11 visually recognize the surrounding environment, and output image data obtained as a result of the visual recognition to a control computer described later. As such a camera, for example, a digital camera that images the surrounding environment and acquires the captured image as a digital signal can be used.

また、アンテナ１１３は、ロボット１０の絶対位置を認識する位置信号を受信するともに、現在の自己位置や自己の状態を示す信号を送信するために用いられる。これらの情報の送受信は、図示しないロボット監視システムとの間で行われる。後述する制御コンピュータは、アンテナ１１３により受信した位置信号に基づいて、自身の絶対位置をリアルタイムで認識する。このようにして得られた位置信号は、移動する経路や方向を定める際に用いられる。 The antenna 113 is used to receive a position signal for recognizing the absolute position of the robot 10 and to transmit a signal indicating the current self position and the self state. Transmission / reception of these pieces of information is performed with a robot monitoring system (not shown). A control computer described later recognizes its absolute position in real time based on a position signal received by the antenna 113. The position signal obtained in this way is used when determining a moving path and direction.

また、スピーカ１１４は、頭部１１の前面下方に設けられており、制御コンピュータの記憶領域に含まれる音声データファイルから、適宜選択された音声データを外部に出力する。スピーカ１１４から出力される音声ファイルは、ガイドなどの情報の他、後述するように、発話者に話し掛けられた際に応答する内容の複数のファイルで構成されている。そして、これらの音声ファイルのうち、適切なものが発話者に対して適宜選択され、スピーカ１１４を介して外部に出力される。 The speaker 114 is provided below the front surface of the head 11 and outputs sound data appropriately selected from the sound data file included in the storage area of the control computer. The audio file output from the speaker 114 is composed of a plurality of files whose contents respond when spoken to a speaker, as described later, in addition to information such as a guide. Then, an appropriate one of these audio files is appropriately selected for the speaker and output to the outside through the speaker 114.

マイク１１５，１１６は、各々一定の方向からの音声を取得可能な、いわゆる指向性を有するマイクロフォンを水平方向に複数配置したものであり、周囲で発声した音声が、音声認識ロボット１からみて相対的にどの方向から伝達されたものかを大まかに特定することができる。これらのマイク１１５，１１６は、頭部１１の左右の側面に設けられており、音声認識ロボット１の周囲で発せられた音声を取得し、音声データとして取り込み、制御コンピュータ１２０へ出力する。 The microphones 115 and 116 each have a plurality of so-called directional microphones that can acquire sound from a certain direction in the horizontal direction, and the sounds uttered in the surroundings are relative to the speech recognition robot 1. It is possible to roughly specify from which direction the signal is transmitted. These microphones 115, 116 are provided on the left and right side surfaces of the head 11, acquire voices emitted around the voice recognition robot 1, take them as voice data, and output them to the control computer 120.

胴体部１２は、前述のように制御コンピュータ１２０を内蔵するとともに、ロボットの各構成要素に電力を供給するためのバッテリー（図示せず）を備えている。 The body 12 includes the control computer 120 as described above and includes a battery (not shown) for supplying power to each component of the robot.

そして、図２に示すように、胴体部１２の内部には、カメラ１１１や１１２から入力された画像データとしてのディジタル信号や、マイク１１５，１１６から入力された音声信号の内容を認識し、適切な動作を行うための制御部としての制御コンピュータ１２０と、このコンピュータ１２０を含む各構成要素を動作させるための電力供給を行うバッテリー（図示せず）が備えられている。この制御コンピュータ１２０に含まれる図示しない演算処理部は、画像処理により、カメラ１１１，１１２から入力された画像データとしてのディジタル信号から、音声信号を発声した発話者の顔を検出し、さらに、検出した顔の中から目や口唇を抽出する。これらの画像処理に関する詳細な手順については後述する。 As shown in FIG. 2, the body 12 recognizes the contents of digital signals as image data input from the cameras 111 and 112 and the contents of audio signals input from the microphones 115 and 116. A control computer 120 as a control unit for performing various operations and a battery (not shown) for supplying power for operating each component including the computer 120 are provided. An arithmetic processing unit (not shown) included in the control computer 120 detects the face of the speaker who uttered the voice signal from the digital signal as the image data input from the cameras 111 and 112 by image processing, and further detects the face. Eyes and lips from the selected face. Detailed procedures regarding these image processes will be described later.

また、右腕部１３および左腕部１４は、胴体部１２の左右側面に取り付けられ、肘部、手首部、指部などの各所において図示しないモータ部により駆動する複数の関節部を備える。そして、制御コンピュータ１２０からの信号により、これらの関節部の関節駆動量を変更することでその姿勢を変更し、物体把持、方向指示などの所望の動作を行うことができる。また、関節部により駆動される各腕部は、その形状が予め制御コンピュータに記憶されており、関節部の駆動により動作を行う際に、その動作により腕部が占める空間が演算処理部により計算されるものとする。 Moreover, the right arm part 13 and the left arm part 14 are attached to the left and right side surfaces of the body part 12, and include a plurality of joint parts that are driven by a motor part (not shown) at various points such as the elbow part, the wrist part, and the finger part. Then, the posture can be changed by changing the joint drive amount of these joints by a signal from the control computer 120, and desired operations such as object gripping and direction instruction can be performed. The shape of each arm driven by the joint is stored in the control computer in advance, and when the operation is performed by driving the joint, the space occupied by the arm by the operation is calculated by the arithmetic processing unit. Shall be.

腰部１５は、車輪駆動部２０の上方に固定されるとともに、胴体部１２の底面に対してモータ等の駆動力によって回動可能に取り付けられ、車輪駆動部２０と胴体部１２との相対的な姿勢を変更可能としている。 The waist 15 is fixed above the wheel drive unit 20 and is rotatably attached to the bottom surface of the body unit 12 by a driving force such as a motor. The wheel drive unit 20 and the body unit 12 are relatively connected to each other. The posture can be changed.

また、車輪駆動部２０は、図２に示すように、1対の対向する車輪２１、２１と、その前方にキャスタ２２を備える対向２輪型の車両で構成されている。ロボット１０は、これらの車輪２１、２１、キャスタ２２とでその姿勢を水平に支持された状態で移動可能となっている。さらに、車輪駆動部２０の内部には、車輪２１、２１をそれぞれ駆動するモータ２３、２３と、各車輪の回転数を検出するためのカウンタ２４、２４とが備えられている。このように構成された車輪駆動部は、制御コンピュータ１２０により、車輪２１、２１の駆動量をそれぞれ独立に制御され、直進や曲線移動（旋回）、後退、その場回転（両車輪の中点を中心とした旋回）などの移動動作を行うことができるとともに、移動速度や移動する方向が自律的に定められる。 Further, as shown in FIG. 2, the wheel drive unit 20 is composed of a pair of opposed wheels 21, 21 and an opposed two-wheel vehicle including a caster 22 in front thereof. The robot 10 can move while these postures are supported horizontally by the wheels 21 and 21 and the caster 22. Further, the wheel drive unit 20 includes motors 23 and 23 for driving the wheels 21 and 21, respectively, and counters 24 and 24 for detecting the rotation speed of each wheel. The wheel drive unit configured as described above is controlled by the control computer 120 independently of the drive amounts of the wheels 21, 21, and travels straight, curves (turns), moves backwards, rotates on the spot (the middle point of both wheels). It is possible to perform a movement operation such as turning around the center, and autonomously determine a moving speed and a moving direction.

なお、制御コンピュータ１２０は、図３に示すように、入力される音声データからその内容を認識する音声認識部１２１、頭部１０に備えられた前記カメラ１１１，１１２により撮像した画像内に存在する人物の顔を検出する顔検出部１２２、顔検出部１２２により検出した顔の向きを認識する方向認識部１２３、検出した顔の中から、特定部位としての口唇部分を抽出する抽出部１２４、抽出した特定部位の動きに基づいて音声受信の状態を判定する判定部１２５、警告信号として出力するための応答文データを作成する音声合成部１２６、および所定のプログラムや複数の応答文データからなる応答文データベースなどを記憶する記憶領域１２７ａを有するコントロール部１２７を備えている。 As shown in FIG. 3, the control computer 120 exists in an image captured by the voice recognition unit 121 that recognizes the content from the input voice data and the cameras 111 and 112 provided in the head 10. A face detection unit 122 for detecting a person's face, a direction recognition unit 123 for recognizing the orientation of the face detected by the face detection unit 122, an extraction unit 124 for extracting a lip portion as a specific part from the detected face, and extraction A determination unit 125 that determines the state of voice reception based on the movement of the specific part, a voice synthesis unit 126 that generates response sentence data to be output as a warning signal, and a response composed of a predetermined program and a plurality of response sentence data A control unit 127 having a storage area 127a for storing a sentence database and the like is provided.

音声認識部１２１は、マイク１１５，１１６から取得した音声をＷＡＶＥファイルなどの音声データに変換するとともに、その音声データを発話区間毎に分割し、各音節を記憶領域１２７ａに記憶された単語データベースを用いて単語に置き換える。そして、各発話区間に含まれる単語およびその語順を解析して、記憶領域に記憶された多数の文章のうち、この解析した音声データに最も近い文章を選び出す。選び出した文章と、音声データとの近似度合いが所定の値以上の場合は、解析した音声データが、選び出した文章と同一の内容として認識し、取得した音声を、選び出した文章と等しい旨を示す信号を出力する。また、最も近い文章が、所定の近似度合いに満たない場合は、該当する文章が記憶領域に記憶されていないとし、取得した音声の内容を認識できなかったことを表す信号を出力する。 The voice recognition unit 121 converts the voice acquired from the microphones 115 and 116 into voice data such as a WAVE file, divides the voice data into speech sections, and stores a word database in which each syllable is stored in the storage area 127a. Use to replace with a word. Then, the words included in each utterance section and the word order thereof are analyzed, and the sentence closest to the analyzed voice data is selected from the many sentences stored in the storage area. If the degree of approximation between the selected text and the voice data is greater than or equal to a predetermined value, the analyzed voice data is recognized as the same content as the selected text, and indicates that the acquired voice is equal to the selected text Output a signal. If the closest sentence is less than the predetermined degree of approximation, it is determined that the corresponding sentence is not stored in the storage area, and a signal indicating that the content of the acquired voice has not been recognized is output.

なお、音声データを発話区間毎に分割するための処理は、例えばＭＦＣＣ（Ｍｅｌ−ｆｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）パラメータで表される音声特徴量を用いることができる。このＭＦＣＣパラメータの一例としては、例えば１６［ｂｉｔ］，１６［Ｈｚ］の音声データを所定の微小時間間隔（例えば２０［ｍｓ］）のフレームで、静的特徴ＭＦＣＣ１２次元、動的特徴ＭＦＣＣ１２次元、パワー１次元の合計２５次元の特徴量で表されるものを用いることができる。このようなＭＦＣＣパラメータを入力された音声データから計算し、このＭＦＣＣパラメーから２５個の正規分布からなる音声区間判別用データを計算する。そして、これらの音声区間判別用データと、ＭＦＣＣパラメータとを各次元毎に類似度（例えばマハラノビス距離などを利用）を計算し、この計算を一定時間繰り返した結果得られた平均値を所定の閾値と比較することで、発話区間を判別する。 In addition, the process for dividing | segmenting audio | voice data for every speech area can use the audio | voice feature-value represented, for example by the MFCC (Mel-frequency Cepstrum Coefficient) parameter. As an example of the MFCC parameter, for example, voice data of 16 [bit], 16 [Hz] is framed at a predetermined minute time interval (for example, 20 [ms]), a static feature MFCC 12 dimension, a dynamic feature MFCC 12 dimension, What is represented by a total of 25 dimensional feature values of 1 dimensional power can be used. Such MFCC parameters are calculated from the input voice data, and voice segment discrimination data including 25 normal distributions is calculated from the MFCC parameters. Then, the degree of similarity (for example, using Mahalanobis distance, etc.) is calculated for each dimension of the speech segment determination data and the MFCC parameter, and the average value obtained as a result of repeating this calculation for a predetermined time is used as a predetermined threshold value. To determine the utterance interval.

また、音声認識部１２１は、入力された音声データが音声入力か否かを判別する機能を有している。この音声入力か否かを判別する手法としては、任意の情報処理を利用することができるが、例えば入力された音声の音量が一定時間あたりに音量ゼロのラインを交差する数（ゼロ交差数）を計算し、このゼロ交差数が所定の閾値を超えた場合に音声入力と判別する手法、もしくは、入力された音声データの振幅値の２乗平均を、一定時間について計算し、その値が閾値を超えた場合に音声入力と判別する手法などを用いることができる。 The voice recognition unit 121 has a function of determining whether or not the input voice data is a voice input. Arbitrary information processing can be used as a method for determining whether or not the voice is input. For example, the volume of the input voice intersects the zero volume line per certain time (the number of zero crossings). , And a method of discriminating voice input when the number of zero crossings exceeds a predetermined threshold, or a mean square of amplitude values of input voice data is calculated for a certain time, and the value is a threshold A method of discriminating from voice input when exceeding the threshold can be used.

顔検出部１２２は、カメラ１１１、１１２で撮像することにより得られた画像データから、人物の顔部分のみを検出する。この顔検出部１２２は、まず、人物の顔中に含まれる目および口（口唇）を抽出し、これらの位置に基づいて、顔の輪郭に相当する縁部を推定する。そして、このように推定された顔の輪郭によって囲まれる領域を人物の顔として検出する。 The face detection unit 122 detects only the face portion of a person from image data obtained by imaging with the cameras 111 and 112. The face detection unit 122 first extracts the eyes and mouth (lips) included in the face of a person, and estimates an edge corresponding to the contour of the face based on these positions. Then, an area surrounded by the estimated face contour is detected as a human face.

方向認識部１２３は、顔検出部１２２によって、検出された人物の顔に含まれる目の位置、すなわちロボットからの相対距離および相対的方向に基づいて、検出した顔が、ロボットから見てどの方向を向いているかどうかを推定することができる。詳細には、人物の顔に含まれる右目および左目の各中心位置を特定するとともに、各中心位置を結ぶ線分上の中点を特定する。そして、この各中心位置を結ぶ線分を含み、床面に平行な平面内で、この中点から各中心位置を結ぶ線分に垂直な方向を求め、この方向を視線方向、すなわち発話者の顔の向いている方向とする。 The direction recognizing unit 123 determines in which direction the detected face is viewed from the robot based on the position of the eyes included in the face of the person detected by the face detecting unit 122, that is, the relative distance and relative direction from the robot. Can be estimated. Specifically, the center positions of the right eye and the left eye included in the face of the person are specified, and the midpoint on the line segment connecting the center positions is specified. Then, a direction perpendicular to the line segment connecting each center position is obtained from this midpoint in a plane parallel to the floor including the line segment connecting each center position, and this direction is determined as the gaze direction, that is, the speaker's direction. The direction the face is facing.

さらに、方向認識部１２３は、顔検出部１２２によって検出された、撮像された画像内に含まれる各顔の、ロボットからの相対位置および向いている方向から、検出された顔がロボット自身に向けられているか否かを認識する。具体的には、ロボット自身の位置（例えば頭部１１の中心点）を基準としてここで、各顔の向いている方向には、所定の幅をもたせることとし、詳細には各方向を中心として床面に水平な方向に左右微小角度（例えば５度）ずつ幅を持たせるものとしている。このようにして、各顔がロボット自身の向きを向いているか否かを判断し、ロボット自身の向きを向いていないものは、発話者ではないとみなすものとする。 Furthermore, the direction recognizing unit 123 directs the detected face toward the robot itself from the relative position and the direction of each face included in the captured image detected by the face detecting unit 122. Recognize whether or not Specifically, based on the position of the robot itself (for example, the center point of the head 11), a predetermined width is given to the direction in which each face faces, and in detail, each direction is centered. It is assumed that a width is given by a minute angle (for example, 5 degrees) in the horizontal direction to the floor surface. In this way, it is determined whether or not each face is facing the robot itself, and those not facing the robot itself are considered not to be speakers.

抽出部１２４は、方向認識部１２３において、ロボットの方を向いていると判断された画像中の顔の中から、特定部位として、口唇を抽出する。この口唇部分を抽出するための処理は、任意の処理を用いることができるが、例えば口唇の形状データを予め多数記憶させておき、認識した顔の中で、前記形状データで表される形状に近似する部分を口唇として抽出する処理などが用いられる。なお、抽出部１２４においては、抽出した口唇の画像データを時系列的に連続したフレームとして取得し、記憶領域１２７に記憶する。 The extraction unit 124 extracts the lips as a specific part from the face in the image determined to be facing the robot by the direction recognition unit 123. Any process can be used as the process for extracting the lip portion. For example, a large number of lip shape data is stored in advance, and the shape represented by the shape data in the recognized face is stored. For example, a process of extracting an approximated portion as a lip is used. Note that the extraction unit 124 acquires the extracted lip image data as time-sequential continuous frames and stores them in the storage area 127.

判定部１２５は、抽出部１２４で抽出した口唇の形状を示す画像データで表される連続したフレームから、口唇が開閉しているか否かを判定する。このような判定手法としては、図４に示すような連続して取得されたＮ個のフレームについて、現在のフレームからＮ個前までのフレームに表される画像との相関値の平均に基づいて判定するものが用いられる。このような平均Ｖ（ｔ）は、現在のフレームをｐ（ｔ）、現在からｉ個前のフレームをＲ（ｔ−ｉ）、相関値を計算する関数をＳとすると、以下の（式１）で表される： The determination unit 125 determines whether or not the lips are opened and closed from the continuous frames represented by the image data indicating the shape of the lips extracted by the extraction unit 124. Such a determination method is based on the average of correlation values with the images represented in the frames from the current frame to the Nth previous frame, for N frames obtained continuously as shown in FIG. What is determined is used. Such an average V (t) is represented by the following (formula 1) where p (t) is the current frame, R (ti) is the i-th previous frame, and S is the function for calculating the correlation value. Is represented by:

このような平均Ｖ（ｔ）の計算を所定回数（Ｍ回）のフレームごとに繰り返し、フレーム数Ｍ個の相関値平均と、所定の閾値とを比較し、閾値を下回っている場合に、口唇が開閉していると判断する。なお、口唇が開閉しているか否かの判定結果は、音声合成部１２６に送信され、出力する音声データを選択する際の判断に利用される。 Such calculation of the average V (t) is repeated for each predetermined number of frames (M times), and the average of correlation values of M frames is compared with a predetermined threshold value. Is determined to be open / closed. Note that the determination result as to whether or not the lips are open / closed is transmitted to the voice synthesizer 126 and is used for determination when selecting voice data to be output.

音声合成部１２６は、音声認識部１２１によって認識された、取得した音声の内容に対応する、予め記憶領域内に多数記憶された応答文データ群の中から、最も適切な応答文データを読み出し、音声ファイルに変換してスピーカ１１４を介して外部に出力する。このとき、必要に応じて、腕部（右腕１２、左腕１３）を用いて身振りなどの動作を、音声出力に併せて行うようにしてもよい。また、音声合成部１２６は、受信した音声の認識に失敗すると、判定部１２５から送信された判定結果に基づいて、適切な応答文データを選択し、音声情報として出力する。 The speech synthesizer 126 reads out the most appropriate response sentence data from among a large number of response sentence data groups stored in advance in the storage area corresponding to the acquired speech content recognized by the speech recognition unit 121. The sound file is converted and output to the outside through the speaker 114. At this time, if necessary, an operation such as a gesture may be performed together with the sound output using the arm portions (the right arm 12 and the left arm 13). In addition, when the speech synthesis unit 126 fails to recognize the received speech, the speech synthesis unit 126 selects appropriate response text data based on the determination result transmitted from the determination unit 125 and outputs it as speech information.

コントロール部１２７は、記憶したプログラムに基づいて、前述した画像データや音声データについて、記憶，加工，出力等の処理を行う他、腕部（右腕部および左腕部）を駆動するモータの駆動量や車輪駆動部２０の動きを制御する。特に、前述した記憶領域１２７ａ内において記憶された、移動経路を決定するためのプログラムに基づいて、周囲の環境に応じて移動経路を適宜決定する。詳細については説明を省略するが、コントロール部１２７においては、移動動作を行う際に、認識された外部の物体に関する情報に基づいて、移動する方向や経路計画、または取り得る動作を自律的に選択することができる。 The control unit 127 performs processing such as storage, processing, and output on the above-described image data and sound data based on the stored program, and also the driving amount of the motor that drives the arm unit (right arm unit and left arm unit) The movement of the wheel drive unit 20 is controlled. In particular, based on a program for determining a movement route stored in the storage area 127a, the movement route is appropriately determined according to the surrounding environment. Although details are omitted, the control unit 127 autonomously selects a moving direction, a route plan, or an action that can be taken based on information about a recognized external object when performing a moving operation. can do.

次に、このように構成された音声認識ロボット１０が、発話者が発声した音声を認識し、その音声入力の状態を判定した後に、判定した結果に基づいて警告信号を出力するまでの手順について、図５に示すフローチャートを用いて説明する。 Next, after the voice recognition robot 10 configured as described above recognizes the voice uttered by the speaker, determines the voice input state, and then outputs a warning signal based on the determined result. This will be described with reference to the flowchart shown in FIG.

図５に示すように、まず、音声認識ロボットは、音が入力されるのを待機した状態であり、この状態で外部から音声認識ロボットに向けて発せられた音を受信すると（ＳＴＥＰ１０１）、マイク１１５、１１６は、受信した音が発せられた方向（ロボットが音を受信した方向）を特定する（ＳＴＥＰ１０２）。そして、特定した方向に頭部１１の前面が位置するように頭部１１を回動し、カメラ１１１、１１２により、音の発せられた方向を撮像して画像データを取得する（ＳＴＥＰ１０３）。カメラ１１１，１１２により得られた画像データは、コントロール部１２７に送信され、記憶領域１２７ａ内に記憶される。 As shown in FIG. 5, first, the voice recognition robot is in a state of waiting for a sound to be input. When a sound emitted from the outside toward the voice recognition robot is received in this state (STEP 101), the microphone 115 and 116 specify the direction in which the received sound was emitted (the direction in which the robot received the sound) (STEP 102). Then, the head 11 is rotated so that the front surface of the head 11 is positioned in the specified direction, and the direction in which the sound is emitted is captured by the cameras 111 and 112 to acquire image data (STEP 103). Image data obtained by the cameras 111 and 112 is transmitted to the control unit 127 and stored in the storage area 127a.

次に、記憶された画像データの中から、顔検出部１２２が人物の顔を検出できるか否かを判断する（ＳＴＥＰ１０４）。ここで、画像データ中に人物の顔が１つでも検出できれば、検出した顔の向きを、方向認識部１２３によって認識し、音声認識ロボット１の方を向いている顔が存在するか否かを判断する（ＳＴＥＰ１０５）。また、画像データ中から顔が検出できなければ、音の入力されるのを待機した初期状態に戻る。 Next, it is determined whether or not the face detection unit 122 can detect a person's face from the stored image data (STEP 104). Here, if even one human face can be detected in the image data, the direction of the detected face is recognized by the direction recognition unit 123, and whether or not there is a face facing the voice recognition robot 1 is present. Judgment is made (STEP 105). If a face cannot be detected from the image data, the process returns to the initial state waiting for sound input.

音声認識ロボット１の方を向いている顔が存在すれば、それらの顔の、音声認識ロボット１からの各距離を求めて、最も近い位置に存在する顔を選択する（ＳＴＥＰ１０６）。このようにすることで、撮像した画面内に含まれる複数の人物に対して、自分が応答すべき人物を特定することができる。一方、ＳＴＥＰ１０５においてロボットを向いている顔が検出できなかったり、ＳＴＥＰ１０６においてロボットの方を向いている顔が存在しないと判断された場合は、音声認識ロボット１に対して話し掛けた発話者は存在しないと判断し、音の入力されるのを待機した初期状態に戻る。 If there are faces facing the voice recognition robot 1, the distances of the faces from the voice recognition robot 1 are obtained, and the face existing at the closest position is selected (STEP 106). In this way, it is possible to specify a person to whom he / she should respond to a plurality of persons included in the captured screen. On the other hand, if the face facing the robot cannot be detected in STEP 105, or if it is determined in STEP 106 that there is no face facing the robot, there is no speaker who has spoken to the speech recognition robot 1. It returns to the initial state which waited for sound input.

次に、選択した顔の中から、口唇部分のみと抽出し（ＳＴＥＰ１０７）、抽出した口唇の開閉動作が行われているか否かを判断する（ＳＴＥＰ１０８）。そして、開閉動作が行われていなければ、音声認識ロボット１に対して音声が発せられていないと判断し、ＳＴＥＰ１０１に戻って音声を受信するまで待機する。口唇の開閉動作が行われていると、音声認識部１２１により音声認識を行った結果、受信した音が音声か否かを判断する（ＳＴＥＰ１０９）。受信した音が音声であると判断されると、音声認識部１２１は続いて受信した音声の音声認識を行う（ＳＴＥＰ１１０）。受信した音が音声でないと判断されると、音の入力を待機した初期状態に戻る。 Next, only the lip portion is extracted from the selected face (STEP 107), and it is determined whether or not the opening / closing operation of the extracted lip is being performed (STEP 108). If the opening / closing operation is not performed, it is determined that no voice is being emitted to the voice recognition robot 1, and the process returns to STEP 101 and waits until the voice is received. If the opening / closing operation of the lips is performed, it is determined whether or not the received sound is a voice as a result of the voice recognition by the voice recognition unit 121 (STEP 109). If it is determined that the received sound is a voice, the voice recognition unit 121 subsequently performs voice recognition of the received voice (STEP 110). If it is determined that the received sound is not a sound, the process returns to the initial state where the sound input is waited.

次に、音声認識部１２１の音声認識が成功したか否かを判断する（ＳＴＥＰ１１１）。音声認識が成功すると、認識した音声の内容に対して適切な応答文を記憶された応答部データベースの中から選択し、出力する（ＳＴＥＰ１１２）。一方、音声認識が失敗すると、受信した音声を音声認識できなかった原因として、音声を受信するためのマイク（ハードウェア）またはソフトウェア上の問題であると推定し、その旨を知らせるための応答文（例えば、"マイクを確認してください"など）を選択し、出力する（ＳＴＥＰ２１２）。 Next, it is determined whether or not the voice recognition by the voice recognition unit 121 is successful (STEP 111). If the speech recognition is successful, an appropriate response sentence for the recognized speech content is selected from the stored response unit database and output (STEP 112). On the other hand, if the voice recognition fails, it is assumed that the received voice cannot be recognized as a problem in the microphone (hardware) or software for receiving the voice, and a response message to inform that fact. (For example, “Please check the microphone”) is selected and output (STEP 212).

そして、応答文の出力が終了した後は、再度音声受信を継続するか否かを判断し（ＳＴＥＰ１１３）、継続する場合は音の入力を待機した初期状態に戻る。音声の受信を継続しない場合は、所定の終了処理を行った後、音声の受信を終了する。 Then, after the output of the response sentence is completed, it is determined again whether or not the voice reception is continued (STEP 113), and when it is continued, the process returns to the initial state waiting for the sound input. When the reception of sound is not continued, the reception of the sound is ended after performing a predetermined end process.

このように、上述した実施形態によると、音声認識ロボットは、受信した音を発話者からの音声であると判断することができるため、この音声が認識できない理由が、受信するためのハードウェアまたはソフトウェア上の問題が原因であると判断することができる。そのため、このような原因を指摘する旨の警告信号を出力することによって、発話者に対してロボットの音声認識を失敗した原因を理解することが可能となる。 As described above, according to the above-described embodiment, the voice recognition robot can determine that the received sound is the voice from the speaker. Therefore, the reason why the voice cannot be recognized is the hardware for receiving or It can be determined that a software problem is the cause. Therefore, by outputting a warning signal indicating that such a cause is pointed out, it becomes possible to understand the cause of the failure of speech recognition of the robot to the speaker.

なお、このような警告信号の出力は、音声認識を失敗した具体的な理由を内容に含めた音声データを発するものであってもよいが、これに加えて、腕部などの関節駆動によるジェスチャーなどにより、発話者に音声認識を失敗した理由を伝えるようにしてもよい。また、ＬＥＤなどの発光素子をロボットに設け、この発光素子を点滅させることにより、ロボット側のマイク等が不調である旨を発話者に対して伝えるようにしてもよい。また、発話者がロボットに対して遠隔操作を行うためのコントローラ等を所持し、このコントローラを介してロボットに信号を送信している場合には、このコントローラに設けた発光素子や振動素子などを動作させて、ロボットのマイク等が不調である旨を伝えるようにしてもよい。 Note that the output of such a warning signal may be to generate voice data including the specific reason why voice recognition has failed, but in addition to this, gestures by driving joints such as arms For example, the reason why the speech recognition failed may be communicated to the speaker. Further, a light emitting element such as an LED may be provided in the robot, and the light emitting element may be blinked to notify the speaker that the microphone on the robot side is malfunctioning. In addition, when a speaker has a controller for remotely controlling the robot, and a signal is transmitted to the robot via the controller, a light emitting element, a vibration element, etc. provided in the controller are installed. You may make it operate | move and notify that the microphone etc. of a robot are out of order.

また、前述の実施形態においては、音声認識ロボットは、受信した音の発声した方向を撮像し、その撮像した画像データ内から発話者の顔を検出するように制御されているが、これに加えて、顔を検出した後も、検出した顔が撮像して得られる画像中の略中央に位置するように、カメラの方向を修正するように制御されてもよい。このようにすると、発話者に対して応答している最中に、ロボットの「顔」に相当する部分が常に発話者を向くため、発話者の顔を見続けて会話を行っているように見せることができるという効果も得られる。また、ロボットの応対中に発話者が移動しても、発話者の動きに追従して発話者の顔を検出することができる。このような、発話者の動きに追従する動作（トラッキング）は、発話者（検出された発話者の顔）が、音声認識ロボットから所定距離以上離れた位置に移動するまで続けるようにすると好適である。このような、発話者の動きに追従する動作を継続するための距離は、撮像部の解像度や、周囲の明るさなどに基づいて決定してもよい。 In the above-described embodiment, the voice recognition robot is controlled so as to capture the direction in which the received sound is uttered and detect the speaker's face from the captured image data. Thus, even after the face is detected, control may be performed so as to correct the direction of the camera so that the detected face is positioned substantially in the center of the image obtained by imaging. In this way, while responding to the speaker, the part corresponding to the robot's “face” is always facing the speaker, so that the conversation is continued while looking at the speaker's face. The effect of being able to show is also obtained. In addition, even if the speaker moves during reception of the robot, the face of the speaker can be detected following the movement of the speaker. Such an operation (tracking) to follow the movement of the speaker is preferably continued until the speaker (the detected speaker's face) moves to a position more than a predetermined distance away from the voice recognition robot. is there. The distance for continuing such an operation of following the movement of the speaker may be determined based on the resolution of the imaging unit, the brightness of the surroundings, and the like.

また、前述の実施形態においては、検出した顔の向きを認識する手法として、発話者の目の位置に基づいて判断した各顔の中心位置に基づいた認識手法を用いているが、これに代えて、発話者の目や口の形状上の特徴（目じりや口端部など）をニューラルネットワークにより抽出することにより認識する手法を用いてもよい。また、カメラにより撮像された画像内における特徴点の三次元位置と、これらの特徴点から得られる顔モデルとの比較により、顔モデルの向きを求めるようにしてもよい。 In the embodiment described above, as a method for recognizing the detected face orientation, a recognition method based on the center position of each face determined based on the position of the speaker's eyes is used. Thus, a method of recognizing the features of the speaker's eyes and mouth shape (such as the eyes and the mouth edge) by extracting with a neural network may be used. Further, the orientation of the face model may be obtained by comparing the three-dimensional position of the feature point in the image captured by the camera and the face model obtained from these feature points.

また、音声認識ロボットの移動手段としては、前述のような車輪とキャスタの組み合わせに限られるものではなく、車輪のみで構成される倒立振子型の移動手段であってもよく、脚部を駆動することで移動する歩行型の移動手段であってもよい。 Further, the moving means of the voice recognition robot is not limited to the combination of wheels and casters as described above, and may be an inverted pendulum type moving means composed of only wheels, and drives the legs. It may be a walking type moving means that moves.

また、ロボットに設けられる周囲の外部環境を認識する手段としては、撮像部のみではなく、レーザレンジセンサやＣＣＤ等の光学カメラを別途設けたり、ロボット外部に設けられた基地局によりそのような外部環境情報をロボットに送信したりしてもよい。 Also, as a means for recognizing the surrounding external environment provided in the robot, not only the imaging unit but also an optical camera such as a laser range sensor or CCD, or such a base station provided outside the robot is used. Environmental information may be transmitted to the robot.

なお、前述の実施形態においては、、図示しないロボット監視システムからの信号をアンテナ１１３で受信し、ロボットの絶対位置を認識しているが、これに代えて、車輪の回転数などにより移動した距離および方向をオドメトリ法により算出し、自己位置を求めるものであってもよい。また、ロボット監視システムからの情報およびオドメトリ法により算出された自己位置を組み合わせて、より正確な自己位置を算出するようにしてもよい。 In the above-described embodiment, a signal from a robot monitoring system (not shown) is received by the antenna 113 and the absolute position of the robot is recognized. Instead, the distance moved by the number of rotations of the wheels, etc. Alternatively, the self-position may be calculated by calculating the azimuth and direction. Further, a more accurate self-position may be calculated by combining information from the robot monitoring system and the self-position calculated by the odometry method.

また、本実施形態においては、音声認識部は、取得した音声を音声データに変換するとともに、その音声データを音節毎に分割し、分割された音節を単語に置き換える手法が用いられているが、本発明はこれに限られるものではなく、現在用いられている多くの音声認識手法を用いることが可能である。さらに、認識した音声の内容に対して応答する応答文を選択する手法等についても、前記実施形態に限られるものではなく、他の手法を適用することも可能である。 In the present embodiment, the voice recognition unit uses a technique of converting the acquired voice into voice data, dividing the voice data into syllables, and replacing the divided syllables with words. The present invention is not limited to this, and many speech recognition methods currently used can be used. Furthermore, the method of selecting a response sentence that responds to the recognized voice content is not limited to the above embodiment, and other methods can be applied.

本発明に係る第１の実施の形態である音声認識ロボットの外観を示す全体概略図である。1 is an overall schematic diagram showing an appearance of a voice recognition robot according to a first embodiment of the present invention. 図１に示す音声認識ロボットに備えられた車輪駆動部の内部構造を概略的に示す概略図である。It is the schematic which shows schematically the internal structure of the wheel drive part with which the speech recognition robot shown in FIG. 1 was equipped. 図１に示す音声認識ロボットに備えられた制御部の内部機能を概念的に表したブロック図である。FIG. 2 is a block diagram conceptually showing internal functions of a control unit provided in the voice recognition robot shown in FIG. 1. 連続して取得されたＮ個のフレームについて、現在のフレームからＮ個前までのフレームに表される画像との相関値の平均に基づいて判定する様子を概念的に示す図である。It is a figure which shows notionally a mode that it determines based on the average of a correlation value with the image represented by the flame | frame represented from the present frame to the frame before N about N frames acquired continuously. 図１に示す音声認識ロボットが、発話者が発声した音声を認識し、その音声入力の状態を判定した後に、判定した結果に基づいて警告信号を出力するまでの手順を示すフローチャートである。FIG. 3 is a flowchart showing a procedure from when the voice recognition robot shown in FIG. 1 recognizes a voice uttered by a speaker, determines a voice input state, and outputs a warning signal based on the determined result.

Explanation of symbols

１０・・・音声認識ロボット
１００・・・制御部
１２１・・・音声認識部
１２２・・・顔検出部
１２３・・・方向認識部
１２４・・・抽出部
１２５・・・判定部
１２６・・・音声合成部
１２７・・・コントロール部
１２７ａ・・・記憶領域
１１１，１１２・・・撮像部（カメラ）
１１４・・・出力部（スピーカ）
１１５，１１６・・・受信部（マイク）
２０・・・移動手段（車輪駆動部） DESCRIPTION OF SYMBOLS 10 ... Voice recognition robot 100 ... Control part 121 ... Voice recognition part 122 ... Face detection part 123 ... Direction recognition part 124 ... Extraction part 125 ... Determination part 126 ... Speech synthesis unit 127... Control unit 127 a... Storage areas 111 and 112.
114 ... Output unit (speaker)
115, 116... Receiver (microphone)
20 ... Moving means (wheel drive unit)

Claims

A voice recognition robot comprising: a receiving unit that receives voice uttered by a speaker; and a voice recognition unit that recognizes the content of the received voice,
An imaging unit that captures an image in the direction in which the sound is received and acquires the captured image as image data;
A face detection unit for detecting the face of the speaker present in the captured image;
An extraction unit for extracting the movement of a specific part from the detected face;
A determination unit that determines the reception state of the sound based on the extracted movement of the specific part;
A voice recognition robot comprising: an output unit that outputs a warning signal based on the determined voice reception state when voice recognition is not successful.

The voice recognition robot according to claim 1, wherein the specific part extracted by the extraction unit is a lip included in the detected face.

The voice recognition robot according to claim 2, wherein the determination unit determines a voice reception state based on a degree of opening / closing of the lips.

2. The face detection unit detects an entire face by specifying positions of eyes and lips included in a speaker's face and estimating a face outline based on the positions. 4. The speech recognition robot according to any one of items 1 to 3.

The voice recognition robot according to claim 1, further comprising a direction recognition unit that recognizes the direction of the detected face based on the positions of both eyes included in the detected face.

The voice recognition according to any one of claims 1 to 5, wherein the imaging unit is capable of changing the imaging direction so that the detected face continues to be positioned substantially at the center in the captured image. robot.

The voice recognition robot according to any one of claims 1 to 6, wherein the voice recognition robot further includes a moving unit, and is configured to be movable within a predetermined area.

A control method for controlling a voice recognition robot that receives a voice uttered by a speaker and recognizes its contents,
An imaging step of capturing in the direction in which the sound is received and acquiring the captured image as image data;
A face detection step for detecting the face of the speaker present in the captured image;
An extraction step of extracting the movement of a specific part from the detected face;
A determination step of determining a voice reception state based on the extracted movement of the specific part;
An output step of outputting a warning signal based on the determined result when the speech recognition is not successful;
A method for controlling a speech recognition robot, comprising:

The method for controlling a voice recognition robot according to claim 8, wherein the specific part extracted in the extraction step is a lip included in the detected face.

The method for controlling a voice recognition robot according to claim 9, wherein in the determination step, a voice reception state is determined based on a degree of opening / closing of the lips.