JP2007155985A

JP2007155985A - Robot and voice recognition device, and method for the same

Info

Publication number: JP2007155985A
Application number: JP2005349115A
Authority: JP
Inventors: Ryota Hiura; 亮太日浦; Ken Onishi; 献大西; Keiichiro Osada; 啓一郎長田; Kyoko Oshima; 京子大嶋
Original assignee: Mitsubishi Heavy Industries Ltd
Current assignee: Mitsubishi Heavy Industries Ltd
Priority date: 2005-12-02
Filing date: 2005-12-02
Publication date: 2007-06-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a robot and a voice recognition device, and a method for the same, capable of reducing malfunction generation by improving accuracy of voice recognition. <P>SOLUTION: The robot comprises: a microphone 14; a voice recognition section 50 for recognizing voice input from the microphone 14; a person detecting section 51a for detecting a person around a robot main body; a voice recognition effectiveness determination section 52 for determining effectiveness of a voice recognition result by the voice recognition section 50, when the person is detected by the person detecting section 51a; and a response action execution section 53 for executing response action corresponding to the voice recognition result, when the voice recognition result is determined to be effective by the voice recognition determination section 52. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声認識機能を有するロボットに関するものである。 The present invention relates to a robot having a voice recognition function.

従来、一般的家庭向けの対話型ロボットでは、雑音が存在する環境で常時ユーザの呼びかけに対応することが要求されている。しかしながら、ユーザの指令であるか、テレビ・ラジオなどの生活雑音であるかを判断することは非常に難しく、テレビ・ラジオ等の生活雑音をユーザによる指令であると誤って認識することも少なくない。
このような問題に対し、例えば、特許文献１では、上述した誤認識の解決策として、常時適正な音声区間を用いて音声認識を実施することが提案されている。具体的には、マイクロフォンの入力から認識されるべき音声の区分を切り出す処理である音声区間検出において使用される閾値等の「音声認識環境」を、実行中の音声認識処理の種類、或いは、マイクロフォンの状況等に応じて切り替えることで、常時適正な音声区間検知を検出している。
特開２００５−１９５８３４号公報 Conventional interactive robots for general homes are required to always respond to user calls in an environment where noise exists. However, it is very difficult to judge whether it is a user's command or a living noise such as TV / radio, and it is often the case that a living noise such as TV / radio is mistakenly recognized as a user's command. .
For such a problem, for example, Patent Document 1 proposes that speech recognition is always performed using a proper speech section as a solution to the above-described erroneous recognition. Specifically, the “speech recognition environment” such as a threshold used in voice segment detection, which is a process of extracting a voice segment to be recognized from the input of the microphone, is set as the type of voice recognition process being executed or the microphone. By switching according to the situation or the like, appropriate voice segment detection is always detected.
JP 2005-195834 A

しかしながら、上記特許文献１に開示されている発明では、ロボットの内部状態(実施している音声認識タスクやマイクロフォンの選択状況等)によってのみ音声認識環境を切り替えるため、例えば、ロボットが同じ内部状態にあるときに、ラジオなどから発せられた音声と、人が指令として発生した音声を区別することができず、依然として、雑音をユーザの指令として誤って認識し、誤動作してしまうという問題があった。 However, in the invention disclosed in Patent Document 1, since the voice recognition environment is switched only by the internal state of the robot (such as the voice recognition task being performed and the microphone selection status), for example, the robot is in the same internal state. In some cases, it was impossible to distinguish between voices emitted from radio etc. and voices generated by humans as commands, and there was still a problem that noise was mistakenly recognized as user commands and malfunctioned. .

本発明は、上記問題を解決するためになされたもので、音声認識の精度を向上させることにより、誤動作の発生を低下させることのできるロボットおよび音声認識装置ならびにその方法を提供することを目的とする。 The present invention has been made to solve the above problems, and an object thereof is to provide a robot, a speech recognition apparatus, and a method thereof that can reduce the occurrence of malfunctions by improving the accuracy of speech recognition. To do.

上記課題を解決するために、本発明は以下の手段を採用する。
本発明は、音声入力手段と、前記音声入力手段から入力された音声を認識する音声認識手段と、ロボット本体周辺の人物を検知する人物検知手段と、前記人物検知手段により人物が検知された場合に、前記音声認識手段による音声認識結果を有効とする判断手段と、前記判断手段により前記音声認識結果が有効とされた場合に、該音声認識結果に対応する応答動作を実行する応答動作実行手段とを具備するロボットを提供する。 In order to solve the above problems, the present invention employs the following means.
The present invention provides a voice input means, a voice recognition means for recognizing voice input from the voice input means, a person detection means for detecting a person around the robot body, and a person detected by the person detection means A determination unit that validates the speech recognition result by the speech recognition unit, and a response operation execution unit that executes a response operation corresponding to the speech recognition result when the speech recognition result is validated by the determination unit. A robot comprising:

このような構成によれば、人物検出手段により、ロボット本体周辺の人物が存在するか否かが検出され、その検出結果が判断手段へ供給される。判断手段は、人物検知手段により人物が検知された場合に、音声認識手段による音声認識結果を有効として、この音声認識結果を応答動作実行手段へ出力する。これにより、音声認識結果に対応する応答動作が応答動作実行手段により実行されることとなる。
この場合において、音声認識結果が有効か否かを周囲に人物がいるか否かにより判断するので、テレビやラジオなどの生活雑音をユーザからの音声入力であるとして誤って認識する確率を低下させることができる。更に、この精度の高い音声認識結果に基づいて会話等の動作が実行されるので、誤動作の確率を大幅に低減させることが可能となる。 According to such a configuration, the person detection unit detects whether or not a person around the robot body exists, and the detection result is supplied to the determination unit. When the person is detected by the person detection unit, the determination unit validates the voice recognition result by the voice recognition unit and outputs the voice recognition result to the response operation execution unit. Thereby, the response operation corresponding to the voice recognition result is executed by the response operation executing means.
In this case, whether or not the voice recognition result is valid is determined based on whether or not there is a person in the vicinity, so that the probability of erroneously recognizing life noise such as television and radio as voice input from the user is reduced. Can do. Furthermore, since a conversation or the like is performed based on the highly accurate speech recognition result, the probability of malfunction can be greatly reduced.

上記のロボットにおいて、前記応答動作実行手段は、前記判断手段により前記音声認識結果が無効とされた場合に、物音を検知した旨を示す代替応答動作を実行することとしても良い。 In the robot described above, the response operation executing unit may execute an alternative response operation indicating that a sound is detected when the speech recognition result is invalidated by the determination unit.

このように、判断手段により音声認識結果が無効とされた場合には、物音を検知した旨を示す代替応答動作が実行されるので、単に静止している場合に比べて、ロボットをユーザの生活空間に自然に溶け込ませることができる。
上記代替動作としては、数文字で構成された音声を発する、１つの動きで完結する動作を行わせるなど、短時間（例えば、数秒或いは数十秒）で完結する動作とすることが好ましい。このような簡単な動作を行わせることにより、例えば、ユーザからの音声入力が直後になされた場合でも、このユーザの音声入力に速やかに対応することが可能となる。 In this way, when the speech recognition result is invalidated by the determination means, an alternative response operation indicating that a sound has been detected is executed. It can be naturally melted into the space.
The alternative operation is preferably an operation that can be completed in a short time (for example, several seconds or several tens of seconds), such as performing an operation that completes with one movement that emits a voice composed of several characters. By performing such a simple operation, for example, even when a voice input from the user is performed immediately after that, it is possible to promptly respond to the voice input of the user.

上記のロボットにおいて、前記判断手段は、前記音声認識結果として命令に関する重要用語を受け付けた場合に、前記人物検知手段の検知結果にかかわらず、前記音声認識結果を有効とすることとしても良い。 In the robot described above, the determination unit may validate the voice recognition result regardless of the detection result of the person detection unit when an important term related to a command is received as the voice recognition result.

このように、命令に関する重要用語、例えば、種々の応答動作を実行させるためのキーワードが入力された場合等には、判断手段が、人物検知手段の検知結果に関わらず音声認識結果を有効と判断するので、ユーザの要求を高い確率で受け付けることが可能となる。 As described above, when an important term related to an instruction, for example, a keyword for executing various response operations is input, the determination unit determines that the voice recognition result is valid regardless of the detection result of the person detection unit. Therefore, it becomes possible to accept a user request with a high probability.

上記のロボットにおいて、前記応答動作実行手段は、前記応答動作の実行期間中において、前記判断手段により有効である音声認識結果が所定の期間に渡って出力されなかった場合に、実行中の応答動作を終了することとしても良い。 In the robot described above, the response operation executing unit performs the response operation being executed when a voice recognition result valid by the determination unit is not output over a predetermined period during the execution period of the response operation. It is good also as ending.

このような構成によれば、音声認識結果が所定の期間に渡って応答動作実行手段に出力されなかった場合には、ユーザからの音声入力が所定期間に渡ってされなかったと判断することにより、応答動作実行手段が実行中であった応答動作を終了する。これにより、応答動作の実行途中、例えば、対話の途中でユーザが話すのをやめてしまい、ロボット本体周辺から離れてしまった場合であっても、直近の音声入力から所定期間後には、応答動作を終了させることが可能となる。 According to such a configuration, when the voice recognition result is not output to the response operation executing unit over a predetermined period, by determining that the voice input from the user has not been performed over the predetermined period, The response operation being executed by the response operation executing means is terminated. As a result, even if the user stops speaking during the execution of the response operation, for example, in the middle of the conversation, and moves away from the periphery of the robot body, the response operation is performed after a predetermined period from the latest voice input. It can be terminated.

上記のロボットは、ロボット本体周辺にいる人物の顔を検知する顔検知手段を更に備えていてもよく、前記判断手段は、前記人物の顔が検知された場合に、前記音声認識手段による音声認識結果を有効とすることとしても良い。 The robot may further include a face detection unit that detects the face of a person around the robot body, and the determination unit performs voice recognition by the voice recognition unit when the face of the person is detected. It is good also as validating a result.

このような構成によれば、判断手段により、人物の顔が検知されたか否かにより音声認識手段による音声認識結果が有効か否かが判断されるので、ユーザの顔が確実にロボット本体の方を向いている場合に限って、ユーザからの音声入力に対する応答を行うことが可能となる。これにより、生活雑音をユーザからの音声入力であると誤認識することがなくなり、誤動作を低減させることができる。 According to such a configuration, the determination unit determines whether or not the voice recognition result by the voice recognition unit is valid based on whether or not a person's face is detected. It is possible to respond to the voice input from the user only when facing the direction. As a result, life noise is not erroneously recognized as voice input from the user, and malfunctions can be reduced.

上記のロボットにおいて、前記判断手段は、前記応答動作実行手段により前記応答動作が実行されている場合または前記応答動作の実行終了からの経過時間が、人物が周辺にいると推定する推定期間以内である場合には、前記顔検知手段の検知結果にかかわらず、前記音声認識結果を有効とすることとしても良い。 In the robot described above, the determination unit may be configured so that an elapsed time from when the response operation is performed by the response operation execution unit or after the execution of the response operation is within an estimation period for estimating that a person is in the vicinity. In some cases, the voice recognition result may be validated regardless of the detection result of the face detection means.

このように、ユーザからの音声入力であるとした直近の判断から推定期間が経過していないときに、再度、音声認識手段から音声認識結果が判断手段へ入力された場合には、この音声認識情報がユーザによるものであると判断するので、容易な判断処理により、音声認識結果の有効・無効を判定することが可能となる。 As described above, when the estimation period has not elapsed since the most recent determination that the input is a voice input from the user, and the voice recognition result is input to the determination unit again from the voice recognition unit, this voice recognition is performed. Since it is determined that the information is from the user, it is possible to determine whether the speech recognition result is valid or invalid by an easy determination process.

上記のロボットにおいて、前記人物検知手段が、ロボット本体周辺を撮影する撮像手段を備え、前記撮像手段により取得された画像情報から人物を検知することとしても良く、特に、前記撮像装置として、例えば、ロボット本体の頭頂部に設けられた全方位カメラを採用すると良い。 In the robot described above, the person detection unit may include an imaging unit that captures the periphery of the robot body, and may detect a person from image information acquired by the imaging unit. An omnidirectional camera provided on the top of the robot body may be employed.

このように、周囲の情報を画像として取得する撮像手段として、ロボット本体の頭頂部に設けられた全方位カメラを採用することにより、ロボットの顔の向きなどに依存することなく、常に、周囲３６０°の情報を取得することが可能となる。これにより、安定した人物検知を実現させることが可能となる。 In this way, by adopting an omnidirectional camera provided at the top of the robot body as an imaging means for acquiring surrounding information as an image, the surrounding 360 is always independent of the orientation of the face of the robot. It becomes possible to obtain information of °. As a result, stable person detection can be realized.

上記のロボットにおいて、前記人物検知手段が、接触を検知する接触検知手段を備えることとしても良く、また、物体までの距離を計測する距離センサを備えることとしても良い。
このような接触検知手段や距離センサ等により得られた情報を用いることにより、容易に人物検知を行うことが可能となる。 In the robot described above, the person detection unit may include a contact detection unit that detects contact, or may include a distance sensor that measures a distance to an object.
By using information obtained by such a contact detection means, a distance sensor, or the like, it is possible to easily detect a person.

上記のロボットにおいて、前記人物検知手段が、前記撮像装置、前記接触検知手段、及び前記距離センサからの情報を蓄積する蓄積手段を有し、前記蓄積手段に蓄積されている過去所定期間における情報に基づいて、人物を検知することとしても良い。 In the robot described above, the person detection means includes storage means for storing information from the imaging device, the contact detection means, and the distance sensor, and the information in the past predetermined period stored in the storage means. Based on this, a person may be detected.

このように、過去所定期間（例えば、十秒から十数秒程度）における情報に基づいて人物を検知することにより、周囲の状況をより詳細に把握することが可能となるので、人物検知の精度を向上させることが可能となる。 As described above, by detecting a person based on information in a past predetermined period (for example, about 10 seconds to about a dozen seconds or more), it becomes possible to grasp the surrounding situation in more detail, so that the accuracy of person detection is improved. It becomes possible to improve.

本発明は、音声入力手段と、前記音声入力手段から入力された音声を認識する音声認識手段と、ロボット本体周辺の人物を検知する人物検知手段と、前記人物検知手段により人物が検知された場合に、音声認識手段の正解確信度の閾値を低下させるパラメータ変更手段と、前記判断手段により前記音声認識結果が有効とされた場合に、該音声認識結果に対応する応答動作を実行する応答動作実行手段とを具備するロボットを提供する。 The present invention provides a voice input means, a voice recognition means for recognizing voice input from the voice input means, a person detection means for detecting a person around the robot body, and a person detected by the person detection means In addition, when the speech recognition result is validated by the determination unit, the parameter changing unit that lowers the threshold of the certainty of correct answer of the speech recognition unit, and the response operation execution that executes the response operation corresponding to the speech recognition result And a robot comprising the means.

本発明は、入力音声を取得する音声取得過程と、前記入力音声を認識する音声認識過程と、ロボット本体周辺の人物を検知する人物検知過程と、ロボット本体周辺において人物が検知された場合に、前記音声認識過程における音声認識結果を有効とする判断過程とを具備する音声認識方法を提供する。 The present invention provides a voice acquisition process for acquiring an input voice, a voice recognition process for recognizing the input voice, a person detection process for detecting a person around the robot body, and a person detected in the vicinity of the robot body. There is provided a speech recognition method comprising a determination step of validating a speech recognition result in the speech recognition step.

本発明は、入力音声を取得する音声取得過程と、前記入力音声を認識する音声認識過程と、ロボット本体周辺の人物を検知する人物検知過程と、ロボット本体周辺において人物が検知された場合に、音声認識過程における正解確信度の閾値を低下させるパラメータ変更過程とを具備する音声認識方法を提供する。 The present invention provides a voice acquisition process for acquiring an input voice, a voice recognition process for recognizing the input voice, a person detection process for detecting a person around the robot body, and a person detected in the vicinity of the robot body. There is provided a speech recognition method including a parameter changing process for reducing a threshold of a certainty of correct answer in a speech recognition process.

本発明によれば、音声認識の精度を向上させることが可能となるので、誤動作の発生を低下させることができるという効果を奏する。 According to the present invention, since it is possible to improve the accuracy of voice recognition, it is possible to reduce the occurrence of malfunction.

以下に、本発明に係るロボットの一実施形態について、図面を参照して説明する。
図１は、本発明の一の実施形態に係るロボットの正面図、図２は、図１に示したロボットの左側面図である。
図１および図２に示すように、ロボット本体１には、頭部２と、この頭部２を下方から支持する胸部３と、この胸部３の右側に設けられた右腕部４ａ、胸部３の左側に設けられた左腕部４ｂと、胸部３の下方に接続された腰部５と、この腰部５の下方に接続されたスカート部６と、このスカート部６の下方に接続された脚部７とが設けられている。 Hereinafter, an embodiment of a robot according to the present invention will be described with reference to the drawings.
FIG. 1 is a front view of a robot according to an embodiment of the present invention, and FIG. 2 is a left side view of the robot shown in FIG.
As shown in FIGS. 1 and 2, the robot body 1 includes a head 2, a chest 3 that supports the head 2 from below, a right arm 4 a provided on the right side of the chest 3, and a chest 3. A left arm portion 4b provided on the left side, a waist portion 5 connected below the chest portion 3, a skirt portion 6 connected below the waist portion 5, and a leg portion 7 connected below the skirt portion 6; Is provided.

頭部２には、頭頂部近傍に全方位カメラ１１が一つ設けられている。この全方位カメラ１１の外周に沿って複数の赤外線ＬＥＤ１２が所定の間隔で円環上に配置されている。
頭部２の前面の中央近傍には、図１に示すように、前方を撮像するための前方カメラ１３が正面視して右側に一つ、マイクロフォン１４が正面視して左側に一つ、それぞれ設けられている。 One omnidirectional camera 11 is provided near the top of the head 2. A plurality of infrared LEDs 12 are arranged on the ring at predetermined intervals along the outer periphery of the omnidirectional camera 11.
In the vicinity of the center of the front surface of the head 2, as shown in FIG. 1, the front camera 13 for imaging the front is one on the right side when viewed from the front, and the microphone 14 is one on the left side when viewed from the front. Is provided.

胸部３の前面の中央近傍には、モニタ１５が一つ設けられている。このモニタ１５の上方には、人を検知するための超音波距離センサ１６が一つ設けられている。モニタ１５の下方には、電源スイッチ１７が一つ設けられている。超音波距離センサ１６の上方には、２つのスピーカ１８が左右に一つずつ設けられている。また、図２に示すように、胸部３の背面には、荷物を収納することができるランドセル部３３が設けられている。ランドセル部３３には、上部に設けたヒンジ周りに回動可能な開閉扉３３ａが設けられている。図１に示すように、胸部３の左右の肩部には、マンマシンインターフェースとして機能する肩スイッチ１９がそれぞれ一つずつ設けられている。肩スイッチ１９には、例えば、タッチセンサが採用されている。 One monitor 15 is provided near the center of the front surface of the chest 3. One ultrasonic distance sensor 16 for detecting a person is provided above the monitor 15. One power switch 17 is provided below the monitor 15. Above the ultrasonic distance sensor 16, two speakers 18 are provided one on each side. In addition, as shown in FIG. 2, a backpack 33 that can store luggage is provided on the back of the chest 3. The school bag 33 is provided with an opening / closing door 33a that can be rotated around a hinge provided at the top. As shown in FIG. 1, one shoulder switch 19 that functions as a man-machine interface is provided on each of the left and right shoulders of the chest 3. For the shoulder switch 19, for example, a touch sensor is employed.

右腕部４ａおよび左腕部４ｂには、多関節構造が採用されている。右腕部４ａ、左腕部４ｂにおいて、胸部３との接続部近傍には、体や物の挟み込みを検知して腕の動作を止めるための脇スイッチ２０がそれぞれ設けられている。図１に示すように、右腕部４ａの手のひら部分には、マンマシンインターフェースとして機能する握手スイッチ２１が内蔵されている。これら脇スイッチ２０や握手スイッチ２１には、例えば、押圧センサが採用される。 A multi-joint structure is adopted for the right arm portion 4a and the left arm portion 4b. In the right arm portion 4a and the left arm portion 4b, side switches 20 are provided in the vicinity of the connection portion with the chest portion 3 to detect the pinching of a body or an object and stop the movement of the arm. As shown in FIG. 1, a handshake switch 21 that functions as a man-machine interface is built in the palm of the right arm 4a. For the side switch 20 and the handshake switch 21, for example, a pressure sensor is employed.

腰部５の前面の中央近傍には、人を検知するための超音波距離センサ２２が左右に一つずつ設けられている。これら超音波距離センサ２２の下方には、複数の赤外センサ２３が配列されたセンサ領域２４が設けられている。これら赤外線センサ２２は、ロボット本体１の下方前方にある障害物等を検出するためのものである。図１および図２に示すように、腰部５の下方には、前面および背面において、音源方向を検出するためのマイクロフォン２５が左右に一つずつ、計４つ設けられている。図２に示すように、腰部５の側面の左右には、本体を持ち上げるときに使用する取手部２６がそれぞれ一つずつ設けられている。取手部２６は、凹所とされており、操作者の手が挿入できるようになっている。 In the vicinity of the center of the front surface of the waist 5, one ultrasonic distance sensor 22 for detecting a person is provided on each side. Below these ultrasonic distance sensors 22, a sensor region 24 in which a plurality of infrared sensors 23 are arranged is provided. These infrared sensors 22 are for detecting an obstacle or the like in the lower front of the robot body 1. As shown in FIG. 1 and FIG. 2, a total of four microphones 25 are provided below the waist 5 for detecting the sound source direction, one on the left and one on the front and back. As shown in FIG. 2, one handle portion 26 used for lifting the main body is provided on each of the left and right sides of the waist portion 5. The handle 26 is a recess so that the operator's hand can be inserted.

スカート部６の前面下方には、段差を検出するための赤外線センサ２７が、中央および左右に計３つ設けられている。図２に示すように、スカート部６の背面には、充電コネクタ２８が設けられている。 Below the front surface of the skirt portion 6, a total of three infrared sensors 27 for detecting a step are provided in the center and on the left and right. As shown in FIG. 2, a charging connector 28 is provided on the back surface of the skirt portion 6.

図１に示すように、脚部７の前面には、側方の距離を検出するための赤外線センサ２９が左右に一つずつ設けられている。これら赤外線センサ２９は、主に段差検出に用いられるものである。
図２に示すように、脚部７の背面には、充電ステーションにロボット本体１を位置固定するためのフック３０が設けられている。脚部７は、走行用車輪３１および４つのボールキャスタ３２を備えた台車とされている。 As shown in FIG. 1, one infrared sensor 29 for detecting a lateral distance is provided on the front surface of the leg portion 7 on the left and right sides. These infrared sensors 29 are mainly used for level difference detection.
As shown in FIG. 2, a hook 30 for fixing the position of the robot body 1 to the charging station is provided on the back surface of the leg portion 7. The leg portion 7 is a carriage provided with traveling wheels 31 and four ball casters 32.

このようなロボットは、ロボット本体１に内蔵されたバッテリからの電源供給により、作業空間を自立的に移動することが可能な構成を備えており、一般家庭等の屋内を作業空間として人間と共存し、例えば、一般家庭内でロボットの所有者や操作者などのユーザの生活を補助・支援・介護するための各種サービスを提供するために用いられる。
そのため、ロボット１は、ユーザとの会話を実現させる会話機能のほか、ユーザの行動を見守ったり、ユーザの行動を補助したり、ユーザと一緒に行動したりする機能を備えている。このような機能は、例えば、後述するロボット本体１の内部に内蔵されたマイクロコンピュータ等からなる制御装置により実現されるものである。制御装置には、図１および図２に示した各種カメラや各種センサ等が接続されており、カメラからの画像情報やセンサからのセンサ検出情報を取得し、これらの情報に基づいて各種プログラムを実行することにより、上述した各種機能を実現させる。なお、ロボット本体１の形状としては、図１および図２に示した形状に限られず、愛玩用に動物を模したものなど、種々のものを採用することが可能である。 Such a robot has a configuration capable of moving independently in a work space by supplying power from a battery built in the robot body 1, and coexists with a human being as a work space indoors. For example, it is used to provide various services for assisting, supporting, and caring for the lives of users such as robot owners and operators in general households.
Therefore, in addition to the conversation function for realizing the conversation with the user, the robot 1 has a function of watching the user's action, assisting the user's action, and acting with the user. Such a function is realized by, for example, a control device including a microcomputer or the like built in the robot body 1 described later. Various cameras and various sensors shown in FIGS. 1 and 2 are connected to the control device, acquire image information from the cameras and sensor detection information from the sensors, and execute various programs based on these information. By executing this, the various functions described above are realized. The shape of the robot body 1 is not limited to the shape shown in FIGS. 1 and 2, and various shapes such as a model imitating an animal for pets can be adopted.

〔第１の実施形態に係る音声認識装置〕
次に、本発明の特徴部分である音声認識機能について説明する。音声認識機能は上述した会話機能を実現させるために必要となる機能であり、上述した制御装置内に設けられた音声認識装置により実現されるものである。
図３に示すように、本実施形態に係る音声認識装置は、マイクロフォン１４から入力された音声を認識する音声認識部（音声認識手段）５０と、ロボット本体周辺の人物を検知する人物検知部（人物検知手段）５１ａと、人物検知部５１ａにより人物が検知された場合に、音声認識部５０による音声認識結果を有効とする音声認識採否判断部（判断手段）５２と、音声認識採否判断部５２による判断結果に応じた応答動作を実行する応答動作実行部（応答動作実行手段）５３とを備えている。 [Voice Recognition Apparatus According to First Embodiment]
Next, the speech recognition function that is a characteristic part of the present invention will be described. The voice recognition function is a function necessary for realizing the conversation function described above, and is realized by a voice recognition device provided in the control device described above.
As shown in FIG. 3, the speech recognition apparatus according to the present embodiment includes a speech recognition unit (speech recognition unit) 50 that recognizes speech input from the microphone 14 and a person detection unit that detects a person around the robot body ( (Person detection means) 51a, and when a person is detected by the person detection section 51a, a voice recognition acceptance / rejection judgment section (judgment means) 52 that validates the voice recognition result by the voice recognition section 50, and a voice recognition acceptance / rejection judgment section 52 And a response operation execution unit (response operation execution means) 53 that executes a response operation according to the determination result of.

上記マイクロフォン１４は、ユーザの音声や電話のベル、呼び鈴、テレビ・ラジオ等の生活雑音を取り込み、これらの音声を電気信号に変換して音声認識部５０へ出力する。
音声認識部５０は、マイクロフォン１４から出力された電気信号を予め保持している複数の辞書データと照合し、所定の正解確信度を超える照合結果が得られた場合に、その照合結果を音声認識結果として音声認識採否判断部５２へ出力する。ここで、所定の正解確信度とは、マイクロフォン１４からの音声が照合された辞書データと同一であると判断する基準値である。 The microphone 14 captures user's voice and life noises such as telephone bells, doorbells, TV / radio, etc., converts these voices into electrical signals, and outputs them to the voice recognition unit 50.
The voice recognition unit 50 collates the electrical signal output from the microphone 14 with a plurality of dictionary data held in advance, and when a collation result exceeding a predetermined correct answer certainty factor is obtained, the collation result is voice-recognized. As a result, it is output to the speech recognition acceptance / rejection determination unit 52. Here, the predetermined correct answer certainty factor is a reference value for determining that the voice data from the microphone 14 is the same as the collated dictionary data.

上記人物検知部５１ａは、カメラ６１、接触センサ６２、人検知センサ６３、処理部６４、およびデータ保持部（蓄積手段）６５を備えている。
カメラ６１は、例えば、図１および図２に示した全方位カメラ１１や前方カメラ１３であり、ロボット本体１の周囲３６０°の情報や頭部２の前方に関する周囲情報を画像として取得し、この画像情報を処理部６４へ出力する。特に、周囲の画像情報を取得するカメラとして、全方位カメラ１１を採用することにより、ロボットの頭部２の正面向きにかかわらず、常に安定した周囲情報を取り込むことが可能となる。これにより、人物検知の精度を向上させることができる。 The person detection unit 51 a includes a camera 61, a contact sensor 62, a human detection sensor 63, a processing unit 64, and a data holding unit (storage unit) 65.
The camera 61 is, for example, the omnidirectional camera 11 or the front camera 13 shown in FIGS. 1 and 2, and acquires information about 360 ° around the robot body 1 and surrounding information about the front of the head 2 as an image. The image information is output to the processing unit 64. In particular, by adopting the omnidirectional camera 11 as a camera for acquiring surrounding image information, it is possible to always capture stable surrounding information regardless of the front direction of the robot head 2. Thereby, the accuracy of person detection can be improved.

接触センサ６２は、例えば、図１および図２に示したロボット本体１の胸部３に設けられた肩スイッチ１９、右腕部４ａに設けられた握手スイッチ２１および脇スイッチ２０ならびに左腕部４ｂに設けられた脇スイッチ２０等であり、これらの各センサは、ユーザからの操作情報が得られた場合に接触検知信号を処理部６４へ出力する。 The contact sensor 62 is provided, for example, on the shoulder switch 19 provided on the chest 3 of the robot body 1 shown in FIGS. 1 and 2, the handshake switch 21 provided on the right arm 4a, the side switch 20, and the left arm 4b. The side switch 20 and the like, and these sensors output a contact detection signal to the processing unit 64 when operation information from the user is obtained.

人検知センサ６３は、例えば、図１および図２に示した超音波距離センサ１６，２２および音源方向を検知するためのマイクロフォン２５等を備えて構成されており、各種センサによる検知信号を処理部６４へ出力する。
処理部６４は、カメラ６１から取得した撮影画像を処理することによりフレーム間差分データを作成し、これらの差分データに基づいて動くものがあるか否かを検出し、動くものがあった場合に、ロボット本体周辺に人物がいると判断する。或いは、処理部６４は、取得した画像情報から人物の顔や体のシルエット等を抽出することにより、人物検知を行うこととしても良い。 The human detection sensor 63 includes, for example, the ultrasonic distance sensors 16 and 22 shown in FIGS. 1 and 2, the microphone 25 for detecting the direction of the sound source, and the like. Output to 64.
The processing unit 64 creates inter-frame difference data by processing the captured image acquired from the camera 61, detects whether there is something that moves based on these difference data, and if there is something that moves, It is determined that there is a person around the robot body. Alternatively, the processing unit 64 may perform person detection by extracting a person's face, body silhouette, and the like from the acquired image information.

更に処理部６４は、接触センサ６２から接触検知信号を受け付けた場合に、ロボット本体周辺に人物がいると判断する。更に、処理部６４は、人検知センサ６３を構成する超音波距離センサ１６および２２によるセンサ信号からロボット本体１の前方にある障害物までの距離を検出し、この距離がある所定範囲内であった場合に、ロボット本体１の周辺に人物がいると判断する。なお、マイクロフォンからの音声情報は、人物がいる方向を検出するために使用される。 Further, when the processing unit 64 receives a contact detection signal from the contact sensor 62, the processing unit 64 determines that there is a person around the robot body. Further, the processing unit 64 detects the distance to the obstacle ahead of the robot body 1 from the sensor signals from the ultrasonic distance sensors 16 and 22 constituting the human detection sensor 63, and the distance is within a predetermined range. If it is determined that there is a person around the robot body 1. Note that audio information from the microphone is used to detect the direction in which the person is present.

処理部６４は、カメラ６１、接触センサ６２および人検知センサ６３から取得した情報に基づいて上述の各種判断を行った結果、いずれかの判断結果においてロボット本体周辺に人物がいると判断した場合には、周囲に人がいる旨を示す人物検知情報を音声認識採否判断部５２に出力し、いずれの判断結果においてもロボット本体周辺に人物がいないと判断した場合には、周囲に人がいない旨を示す人物検知情報を音声認識採否判断部５２に出力する。 When the processing unit 64 performs the above-described various determinations based on the information acquired from the camera 61, the contact sensor 62, and the human detection sensor 63, when it is determined that there is a person around the robot body in any of the determination results. Outputs the person detection information indicating that there is a person in the surroundings to the voice recognition acceptance / rejection determination unit 52, and if it is determined that there is no person around the robot body in any of the determination results, there is no person in the vicinity. Is output to the voice recognition acceptance / rejection determination unit 52.

データ蓄積部６５は、処理部６４が人物検知に用いたデータ、換言すると、カメラ６１、接触センサ６２、人検知センサ６３から処理部６４が取得した情報をＦＩＦＯ（First In Ｆirst Out）方式で所定の期間に渡り保持するものである。 The data storage unit 65 predetermines the data used by the processing unit 64 for human detection, in other words, the information acquired by the processing unit 64 from the camera 61, the contact sensor 62, and the human detection sensor 63 using a FIFO (First In First Out) method. It will be held for the period.

音声認識採否判断部５２は、音声認識部５０から音声認識結果を受け付けた場合に、人物検知部５１ａから周囲に人がいる旨を示す人物検知情報が入力されているかを判断する。この結果、周囲に人がいる旨を示す人物検知情報が入力されていた場合には、音声認識部５０による音声認識結果を有効と判断し、つまり、この音声認識結果に基づく音声がユーザにより入力されたものであると判断して、この音声認識結果を応答動作実行部５３に出力する。一方、音声認識採否判断部５２は、周囲に人がいない旨を示す人物検知情報が人物検知部５１ａから入力されていた場合には、音声認識部５０による音声認識結果を無効、つまり、テレビなどの生活雑音であると判断し、この判断結果を応答動作実行部５３に出力する。 When the speech recognition acceptance / rejection determination unit 52 receives a speech recognition result from the speech recognition unit 50, the speech recognition acceptance / rejection determination unit 52 determines whether or not person detection information indicating that there is a person around is input from the person detection unit 51a. As a result, when the person detection information indicating that there is a person around is input, it is determined that the voice recognition result by the voice recognition unit 50 is valid, that is, the voice based on the voice recognition result is input by the user. The voice recognition result is output to the response operation execution unit 53. On the other hand, the voice recognition acceptance / rejection determination unit 52 invalidates the voice recognition result by the voice recognition unit 50 when the person detection information indicating that there is no person is input from the person detection unit 51a, that is, a television or the like. And the determination result is output to the response operation execution unit 53.

応答動作実行部５３は、会話シナリオ実行部６６および代替動作実行部６７を備えている。この応答動作実行部５３は、例えば、小型のマイクロコンピュータと、会話動作、代替応答動作等の各種応答動作を実現させるための手順が記載されたアプリケーションプログラムを保有しており、音声認識採否判断部５２から受け付けた情報に応じたアプリケーションプログラムを読み出して実行することにより、後述のような会話シナリオ実行部６６や代替動作実行部６７による各種機能を実現させる。 The response operation execution unit 53 includes a conversation scenario execution unit 66 and an alternative operation execution unit 67. The response operation executing unit 53 has, for example, a small microcomputer and an application program in which procedures for realizing various response operations such as a conversation operation and an alternative response operation are described. By reading and executing an application program corresponding to the information received from 52, various functions by the conversation scenario execution unit 66 and the alternative operation execution unit 67 as described later are realized.

会話シナリオ実行部６６は、音声認識採否判断部５２から受け付けた音声認識結果に応じた会話シナリオを作成し、この合成音声データに基づく音声信号をロボット本体１が備えるスピーカ１８へ出力することにより、合成音声データに応じた発話を行う。
代替動作実行部６７は、音声認識採否判断部５２から無効判断を受け付けたときに、周囲の物音を検知した旨を示す代替動作を実現させるものである。
上記代替動作としては、例えば、短時間（数秒乃至数十秒）で完結する動作または発話が好ましい。動作であれば、例えば、首をかしげる、首をフル、周囲を見回す、目を開閉させる等の１つの動きで完結するものが一例として挙げられ、発話であれば、「ふぅ」、「クピッ」、「んー」などの数個の文字で構成されるものが一例として挙げられる。 The conversation scenario execution unit 66 creates a conversation scenario according to the voice recognition result received from the voice recognition acceptance / rejection determination unit 52, and outputs a voice signal based on the synthesized voice data to the speaker 18 included in the robot main body 1. Speaks according to the synthesized voice data.
The alternative action execution unit 67 realizes an alternative action indicating that surrounding sound has been detected when an invalidity determination is received from the voice recognition acceptance / rejection determination part 52.
As the alternative operation, for example, an operation or speech that is completed in a short time (several seconds to several tens of seconds) is preferable. For example, the movement can be completed by one movement such as raising the neck, full neck, looking around, opening and closing eyes, etc. For speech, "Fu", "Cuppi" An example is composed of several characters such as "N-".

このような音声認識装置において、マイクロフォン１４により音が受け付けらて、この音に基づく電気信号が音声認識部５０へ出力されると、音声認識部５０は、この電気信号に基づく音声認識を行い、その音声認識結果を音声認識採否判断部５２に出力する。
一方、人物検知部５１ａでは、カメラ６１の画像情報、接触センサ６２および人検知センサ６３からの各種センサ情報が定期的に処理部６４へ出力され、処理部６４により、これらの情報に基づく人物検知が定期的（所定の時間間隔）に行われる。処理部６４による人物検知の判断結果は、人物検知情報として音声認識採否判断部５２へ出力される。 In such a speech recognition device, when sound is received by the microphone 14 and an electrical signal based on this sound is output to the speech recognition unit 50, the speech recognition unit 50 performs speech recognition based on this electrical signal, The voice recognition result is output to the voice recognition acceptance / rejection determination unit 52.
On the other hand, in the person detection unit 51a, image information of the camera 61 and various sensor information from the contact sensor 62 and the human detection sensor 63 are periodically output to the processing unit 64, and the processing unit 64 detects the person based on these information. Are performed periodically (at predetermined time intervals). The determination result of the person detection by the processing unit 64 is output to the voice recognition acceptance / rejection determination unit 52 as person detection information.

音声認識採否判断部５２は、音声認識部５０から音声認識結果を受け付けると、人物検知部５１ａからの人物検知情報を参照する。この結果、この人物検知情報が周囲に人物がいる旨を示す情報であれば、この音声認識結果をユーザからの入力によるものと判断し、この音声認識結果を有効として、音声認識結果を応答動作実行部５３の会話シナリオ実行部６６へ出力する。この音声認識結果５２を受け付けた会話シナリオ実行部６６では、音声認識結果５２に対応するシナリオ音声データをスピーカ１８へ出力する。これにより、ユーザからの音声入力に対応する適切な発話がなされることとなる。 When receiving the voice recognition result from the voice recognition unit 50, the voice recognition acceptance / rejection determination unit 52 refers to the person detection information from the person detection unit 51a. As a result, if the person detection information is information indicating that there is a person in the surroundings, the voice recognition result is determined to be input from the user, the voice recognition result is validated, and the voice recognition result is responded. The data is output to the conversation scenario execution unit 66 of the execution unit 53. The conversation scenario execution unit 66 that has received the voice recognition result 52 outputs scenario voice data corresponding to the voice recognition result 52 to the speaker 18. As a result, an appropriate utterance corresponding to the voice input from the user is made.

一方、音声認識採否判断部５２は、音声認識部５０から音声認識結果を受け付けたときの人物検知情報が、周囲に人物がいない旨を示す情報であった場合には、この音声認識結果を生活雑音によるものと判断して、この判断結果を代替動作実行部６７に出力する。この判断結果を受け付けた代替動作実行部６７は、所定の動作、例えば、首をかしげる等の動作を行う。 On the other hand, if the person detection information when the voice recognition result is received from the voice recognition unit 50 is information indicating that there is no person in the vicinity, the voice recognition acceptance / rejection determination unit 52 displays the voice recognition result as life. It is determined that the noise is due to noise, and the determination result is output to the alternative operation execution unit 67. Receiving this determination result, the substitute operation execution unit 67 performs a predetermined operation, for example, an operation such as a neck bend.

以上、説明してきたように、本実施形態に係る音声認識装置によれば、音声認識による結果を周囲の状況と照らし合わせて、ユーザからの入力によるものか否かを判断するので、テレビやラジオなどの生活雑音をユーザによる音声認識であると誤って認識する確率を低下させることができる。そして、このような音声認識結果に基づいて、会話等の動作を実行するので、誤動作の確率を低減させることが可能となる。 As described above, according to the speech recognition apparatus according to the present embodiment, the result of speech recognition is compared with the surrounding situation to determine whether the input is from the user. The probability of erroneously recognizing life noise such as voice recognition by the user can be reduced. Then, since an operation such as a conversation is executed based on such a speech recognition result, the probability of malfunction can be reduced.

なお、本実施形態において、過去の画像情報ならびにセンサ情報等も考慮して、人物が周囲にいるか否かを検知することとしても良い。
具体的には、人物検知部５１ａの処理部６４が、データ蓄積部６５に格納されている過去所定期間における情報を用いて、総合的に人物検知を行うようにしても良い。
また、音声認識採否判断部５２が人物検知部５１ａからの人物検知情報を過去所定期間にわたり保持しており、これらの情報に基づいて音声認識結果の有効・無効を判断することとしても良い。
このように、音声が入力された時点だけでなく、過去所定期間にわたる情報に基づいて人物がいるか否かを判定することにより、人物検知の精度を高めることが可能となる。 In the present embodiment, it is also possible to detect whether or not a person is around by taking into account past image information and sensor information.
Specifically, the processing unit 64 of the person detection unit 51a may perform person detection comprehensively using information in the past predetermined period stored in the data storage unit 65.
The voice recognition acceptance / rejection determination unit 52 may hold the person detection information from the person detection unit 51a for a predetermined period in the past, and may determine whether the voice recognition result is valid or invalid based on the information.
In this way, it is possible to improve the accuracy of person detection by determining whether or not there is a person based on information over the past predetermined period as well as when the voice is input.

〔第２の実施形態〕
次に、本発明の第２の実施形態に係る音声認識装置について説明する。本実施形態に係る音声認識装置が第１の実施形態と異なる点は、図４に示すように、人物検知部５１ｂの処理部６４が人物の顔を検知する顔検知機能（図示せず）を備え、この顔検知の結果を顔検知情報として音声認識採否判断部５２へ出力する点である。以下、本実施形態の音声認識装置について、第１の実施形態と共通する点については説明を省略し、異なる点についてのみ説明する。 [Second Embodiment]
Next, a speech recognition apparatus according to the second embodiment of the present invention will be described. As shown in FIG. 4, the speech recognition apparatus according to the present embodiment is different from the first embodiment in that the processing unit 64 of the person detection unit 51b has a face detection function (not shown) for detecting a person's face. The face detection result is output to the voice recognition acceptance / rejection determination unit 52 as face detection information. Hereinafter, with respect to the speech recognition apparatus of the present embodiment, description of points that are common to the first embodiment will be omitted, and only differences will be described.

図４において、人物検知部５１ｂの処理部６４は、カメラ６１から受け付けた画像情報から顔の情報を抽出する顔抽出機能（図示せず）を備えている。画像情報から人物の顔を抽出あるいは認識する技術としては、公知の技術を採用することが可能である。例えば、カメラ６１から取得した画像情報から２つの楕円部分を抽出し、それらの間隔や大きさ、色などを調べることにより目を認識することにより、顔検知を行うことが可能である。また、顔の輪郭を抽出することにより顔検知を行う技術等を採用しても良い。
人物検知部５１ｂは、上述した人物の検出処理および顔の検出処理を定期的に行い、この検出結果を人物検知情報、顔検知情報として音声識別採否判断部５２へ出力する。 In FIG. 4, the processing unit 64 of the person detection unit 51 b includes a face extraction function (not shown) that extracts face information from image information received from the camera 61. As a technique for extracting or recognizing a person's face from image information, a known technique can be employed. For example, face detection can be performed by extracting two ellipse parts from the image information acquired from the camera 61 and recognizing the eyes by examining their distance, size, color, and the like. Further, a technique for detecting a face by extracting a face outline may be employed.
The person detection unit 51b periodically performs the above-described person detection process and face detection process, and outputs the detection results to the voice identification acceptance / rejection determination unit 52 as person detection information and face detection information.

音声認識採否判断部５２は、音声認識部５０から音声認識結果を受け付けたときの人物検知情報、顔検知情報を参照し、周囲に人物がおり、かつ、顔が検知されている場合に、音声認識結果を有効と判断して、音声認識結果を会話シナリオ実行部６６へ出力する。一方、周囲に人物がいるが、顔が検知されていない場合、或いは、人も顔も検知されていない場合には、音声認識結果を無効と判断して、この判断結果を代替動作実行部６７へ出力する。 The voice recognition acceptance / rejection determination unit 52 refers to the person detection information and the face detection information when the voice recognition result is received from the voice recognition unit 50, and when there is a person around and a face is detected, The recognition result is determined to be valid, and the speech recognition result is output to the conversation scenario execution unit 66. On the other hand, if there is a person around but no face is detected, or if neither a person nor a face is detected, the voice recognition result is determined to be invalid, and this determination result is used as an alternative action execution unit 67. Output to.

以上、説明してきたように、本実施形態に係る音声認識装置によれば、人物が周囲にいるだけでなく、その人物の顔が検知できた場合に、音声認識結果をユーザからの音声入力であると判断するので、生活雑音をユーザからの音声入力であると誤認識する確率を更に低下させることができ、この結果、ロボットの誤動作を更に低減させることが可能となる。 As described above, according to the speech recognition apparatus according to the present embodiment, when a person is not only in the vicinity but also when the face of the person can be detected, the speech recognition result can be obtained by speech input from the user. Since it is determined that there is, it is possible to further reduce the probability of misrecognizing life noise as voice input from the user, and as a result, it is possible to further reduce the malfunction of the robot.

なお、上述の顔検知についても、上述した第１の実施形態と同様、過去所定期間の情報を参照して顔検知を行うこととしてもよい。
更に、音声認識採否判断部５２は、音声認識結果が有効であるとした直近の判断結果から、人物が周囲にいると推定できる推定期間内に、音声認識部５０から入力された音声認識結果に関しては、その音声認識結果を自動的に有効であると判断することとしても良い。つまり、ユーザから前回音声入力があってから上記推定期間が経過していない場合には、周囲にユーザがいることが明らかであるので、この場合には、自動的に音声認識部５０からの音声認識結果を有効として取り扱うことにより、煩雑な処理を行うことなく容易に判定を行うことが可能となる。 Note that the face detection described above may be performed by referring to information of a past predetermined period, as in the first embodiment described above.
Further, the speech recognition acceptance / rejection determination unit 52 relates to the speech recognition result input from the speech recognition unit 50 within the estimation period in which it can be estimated that the person is around from the latest determination result that the speech recognition result is valid. May automatically determine that the speech recognition result is valid. In other words, when the estimated period has not elapsed since the previous voice input from the user, it is clear that there is a user around. In this case, the voice from the voice recognition unit 50 is automatically set. By treating the recognition result as valid, it is possible to easily make a determination without performing complicated processing.

〔第３の実施形態〕
次に、本発明の第３の実施形態に係る音声認識装置について説明する。本実施形態に係る音声認識装置は、例えば、会話シナリオ実行部６６による発話がユーザと対話形式で段階的に進行していくような態様に利用されるのに好適なものである。
ここで、対話形式で段階的に進行する場合とは、例えば、ユーザから「伝言」と入力された場合に、会話シナリオ実行部６６により「伝言ですね。誰に伝えますか？」という発話がなされ、この質問に対して、ユーザが更に音声入力、例えば、「Ａさん」などと発すると、会話シナリオ実行部６６が「Ａさんに伝えます。よろしいですか？」などと対応するなどのように、ユーザとの間で会話が進行していくものをいう。なお、上記は一例であり、会話の内容については限定されない。 [Third Embodiment]
Next, a speech recognition apparatus according to the third embodiment of the present invention will be described. The speech recognition apparatus according to the present embodiment is suitable for use in an aspect in which, for example, the utterance by the conversation scenario execution unit 66 proceeds step by step in an interactive manner with the user.
Here, the case of proceeding step by step in an interactive format means that, for example, when “message” is input from the user, an utterance “What is the message? In response to this question, when the user further inputs a voice, for example, “Mr. A”, the conversation scenario execution unit 66 responds to “I want to tell Mr. A. Are you sure?” In addition, it means that the conversation progresses with the user. The above is an example, and the content of the conversation is not limited.

そして、上記のように、ユーザとの間で会話が進行していく場合、会話シナリオ実行部６６による発話がなされてから、この発話に対応するユーザからの応答を受け付けるまでの待ち受け期間（以下、この期間における会話シナリオ実行部６６の状態を「待ち受け状態」という。）が発生することとなる。
本実施形態に係る音声認識装置は、この待ち受け状態における会話シナリオ実行部６６の処理に特徴がある。 Then, as described above, when the conversation proceeds with the user, a standby period (hereinafter referred to as a response from the user corresponding to the utterance after the utterance is made by the conversation scenario executing unit 66). The state of the conversation scenario execution unit 66 during this period is referred to as “standby state”).
The voice recognition apparatus according to the present embodiment is characterized by the processing of the conversation scenario execution unit 66 in this standby state.

具体的には、図５に示すように、本実施形態の音声認識装置では、人物検知部５１ｃが音声認識採否判断部５２および応答動作実行部５３に対して、人物検知情報を出力する構成をとる。このような構成をとることにより、会話シナリオ実行部６６は、待ち受け状態に入ってから所定の期間に渡って、音声認識採否判断部５２から音声認識結果を受け付けておらず、かつ、人物検知情報として周囲に人物がいない旨の情報が継続して入力された場合には、現在実行している応答動作を終了することにより、待ち受け状態を解除し、初期状態に遷移する。つまり、ユーザから音声入力がされていない元の状態に戻る。 Specifically, as shown in FIG. 5, in the speech recognition apparatus of the present embodiment, a configuration in which the person detection unit 51 c outputs person detection information to the speech recognition acceptance / rejection determination unit 52 and the response operation execution unit 53. Take. With this configuration, the conversation scenario execution unit 66 has not received a voice recognition result from the voice recognition acceptance / rejection determination unit 52 over a predetermined period after entering the standby state, and the person detection information When the information indicating that there is no person in the surroundings is continuously input, the standby state is canceled by transitioning to the initial state by terminating the currently executing response operation. That is, it returns to the original state where no voice input is made by the user.

以上説明してきたように、本実施形態に係る音声認識装置によれば、待ち受け状態にある会話シナリオ実行部６６において、待ち受け状態に入ってから所定の期間に渡って、音声認識採否判断部５２から音声認識結果を受け付けておらず、かつ、人物検知情報として周囲に人物がいない旨の情報が継続して入力された場合には、現在実行している応答動作を終了することにより、待ち受け状態を解除し、初期状態に遷移する。
これにより、例えば、対話の途中でユーザが会話をやめてしまい、ロボット本体１の周辺から離れてしまった場合に、これを速やかに検知して、会話シナリオ実行部６６が待ち受け状態を解除することが可能となる。この結果、例えば、ユーザによる音声入力を長時間待ち受けているなどの状態を回避することができる。 As described above, according to the speech recognition apparatus according to the present embodiment, in the conversation scenario execution unit 66 in the standby state, from the speech recognition acceptance / rejection determination unit 52 for a predetermined period after entering the standby state. If the voice recognition result is not received and information indicating that there is no person in the surroundings is continuously input as the person detection information, the standby state is set by terminating the currently executed response action. Release and transition to the initial state.
Thereby, for example, when the user stops the conversation in the middle of the conversation and moves away from the periphery of the robot body 1, this can be detected promptly and the conversation scenario execution unit 66 can release the standby state. It becomes possible. As a result, for example, it is possible to avoid a state where the user has been waiting for a voice input for a long time.

〔第４の実施形態〕
次に、本発明の第４の実施形態に係る音声認識装置について説明する。本実施形態に係る音声認識装置が第１の実施形態と異なる点は、図６に示すように、図１に示した音声認識採否判断部５２を備えていない点、人物検知部５１ｄから音声認識部５０に対して直接的に人物検知情報が出力される点、音声認識部５０から音声認識結果が会話シナリオ実行部６６へ直接的に出力される点である。
以下、本実施形態の音声認識装置について、第１の実施形態と共通する点については説明を省略し、異なる点についてのみ説明する。 [Fourth Embodiment]
Next, a speech recognition apparatus according to the fourth embodiment of the present invention will be described. The voice recognition apparatus according to the present embodiment is different from the first embodiment in that the voice recognition acceptance / rejection determination unit 52 shown in FIG. 1 is not provided as shown in FIG. The person detection information is directly output to the unit 50, and the voice recognition result is directly output from the voice recognition unit 50 to the conversation scenario execution unit 66.
Hereinafter, with respect to the speech recognition apparatus of the present embodiment, description of points that are common to the first embodiment will be omitted, and only differences will be described.

図６において、人物検知部５１ｄの処理部６４は、カメラ６１、接触センサ６２、人検知センサ６３等から入力される情報に基づいて人物検知を行い、周囲に人物がいるか否かを示す人物検知情報を音声認識部５０´に出力する。 In FIG. 6, the processing unit 64 of the person detection unit 51 d performs person detection based on information input from the camera 61, the contact sensor 62, the person detection sensor 63, and the like, and detects whether there is a person around. The information is output to the voice recognition unit 50 ′.

音声認識部５０´は、人物検知部５１ｄからの人物検知情報に基づいて、音声認識を行う際に使用する正確確信度の閾値を変動させるパラメータ変更部（図示せず）を備え、人物検知部５１ｄにより人物が周囲にいると判断された場合に、この正解確信度の閾値を低く設定し、一方、人物が周囲にいないと判断された場合に、上記正解確信度の閾値を高く設定する。これにより、人物が周囲にいないと判断された場合の音声認識の信頼度を、人物が周囲にいると判断された場合に比べて厳しく設定する。
音声認識部５０´は、内蔵するパラメータ変更部により設定された正解確信度に基づいて音声認識を行う。つまり、マイクロフォン１４から入力された音声情報と、予め保有している辞書データとの照合結果がパラメータ変更部により設定された正解確信度以上であった場合に、当該音声情報は辞書データの音声と同一であると判断し、このときの辞書データの情報を音声認識結果として会話シナリオ実行部６６へ出力する。 The voice recognition unit 50 ′ includes a parameter changing unit (not shown) that varies a threshold of accuracy certainty used when performing voice recognition based on the person detection information from the person detection unit 51d. When it is determined by 51d that the person is in the vicinity, the threshold value of the correct answer reliability is set low. On the other hand, when it is determined that the person is not in the vicinity, the threshold value of the correct answer reliability is set high. Thereby, the reliability of speech recognition when it is determined that the person is not in the vicinity is set more strictly than when the person is determined as being around.
The voice recognition unit 50 ′ performs voice recognition based on the correct answer certainty set by the built-in parameter changing unit. That is, when the collation result between the voice information input from the microphone 14 and the dictionary data held in advance is equal to or higher than the correct answer certainty set by the parameter changing unit, the voice information is the voice of the dictionary data. It is determined that they are the same, and the dictionary data information at this time is output to the conversation scenario execution unit 66 as a voice recognition result.

会話シナリオ実行部６６は、音声認識部５０´から受け付けた音声認識結果に対応するシナリオ音声データをスピーカ１８に出力する。これにより、ユーザからの音声入力に対応する適切な発話がなされることとなる。
一方、音声認識部５０´において、マイクロフォン１４から入力された音声情報と、予め保有している辞書データとの照合結果がパラメータ変更部により設定された正解確信度未満であった場合には、当該音声情報は音声認識結果に失敗したとして、情報の出力を行わないため、会話シナリオ実行部６６は作動しないこととなる。 The conversation scenario execution unit 66 outputs scenario voice data corresponding to the voice recognition result received from the voice recognition unit 50 ′ to the speaker 18. As a result, an appropriate utterance corresponding to the voice input from the user is made.
On the other hand, when the speech recognition unit 50 ′ matches the speech information input from the microphone 14 and the dictionary data held in advance with a degree of correctness certainty set by the parameter changing unit, Since the voice information fails to be recognized as a result of voice recognition, no information is output, so that the conversation scenario execution unit 66 does not operate.

以上説明したように、本実施形態に係る音声認識装置では、ロボット本体の周辺に人物がいるか否かに応じて音声認識の際に用いる正解確信度の値を変更するので、人物がいるときには、低めに設定された正確確信度に基づいて音声認識を行うことで音声認識の成功率を高め、人物がいないときには高めに設定された正解確信度に基づいて音声認識を行うことで音声認識の成功率を低下させている。これにより、テレビなどの生活雑音をユーザからの音声入力として誤って認識する確率を低下させることが可能となり、ロボットの誤動作を防止することが可能となる。更に、本実施形態によれば、図１に示したように、音声認識採否判断部５２を必要としないため、構成を簡素化することができ、装置の小型化、軽量化等を図ることが可能となる。 As described above, in the speech recognition apparatus according to the present embodiment, the correct confidence value used for speech recognition is changed depending on whether or not there is a person around the robot body. The success rate of speech recognition is increased by performing speech recognition based on the accuracy confidence set low, and the speech recognition success is achieved by performing speech recognition based on the correct confidence confidence set higher when there is no person. The rate is decreasing. As a result, it is possible to reduce the probability of erroneously recognizing life noise such as television as voice input from the user, and to prevent malfunction of the robot. Furthermore, according to the present embodiment, as shown in FIG. 1, since the voice recognition acceptance / rejection determination unit 52 is not required, the configuration can be simplified, and the apparatus can be reduced in size and weight. It becomes possible.

なお、上述した各実施形態における構成を組み合わせることとしても良い。例えば、実施形態２と実施形態３とを組み合わせることにより、人物検知部５１ｂから応答動作実行部５３に対して出力される情報に、顔検知情報を追加するようにしても良い。
また、上記第４の実施形態において、処理部６４から音声認識部５０´に対して出力する情報に、顔検知情報を追加することも可能である。
また、上記音声認識採否判断部５２において、音声認識部５０から入力された音声認識家結果が、ロボットに対する命令に関する重要用語であった場合には、人物検知部からの情報にかかわらず、この音声認識結果を有効であると判断することとしても良い。
また、ユーザからの要求の重要度に応じて、音声認識採否判断部５２の判断基準を変更させても良い。例えば、音声認識部５０の音声認識結果に基づく音声内容が重要度の高いものであれば、顔や人物が検知されていなくても有効と判断し、また、あまり重要度の高いものでなければ、顔が見えていなければ無効と判断することとしても良い。 In addition, it is good also as combining the structure in each embodiment mentioned above. For example, the face detection information may be added to the information output from the person detection unit 51b to the response operation execution unit 53 by combining the second embodiment and the third embodiment.
In the fourth embodiment, face detection information can be added to information output from the processing unit 64 to the voice recognition unit 50 '.
In the speech recognition acceptance / rejection determination unit 52, if the speech recognizer result input from the speech recognition unit 50 is an important term related to a command to the robot, the speech recognition regardless of information from the person detection unit. It is good also as judging that a recognition result is effective.
Further, the determination criterion of the speech recognition acceptance / rejection determination unit 52 may be changed according to the importance of the request from the user. For example, if the voice content based on the voice recognition result of the voice recognition unit 50 is highly important, it is determined that the face or person is effective even if no face or person is detected. If the face is not visible, it may be determined to be invalid.

以上、本発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes design changes and the like within a scope not departing from the gist of the present invention.

本発明の一実施形態に係るロボットの正面図である。It is a front view of the robot which concerns on one Embodiment of this invention. 図１に示したロボットの左側面図である。It is a left view of the robot shown in FIG. 本発明の第１の実施形態に係る音声認識装置の構成を示す図である。It is a figure which shows the structure of the speech recognition apparatus which concerns on the 1st Embodiment of this invention. 本発明の第２の実施形態に係る音声認識装置の構成を示す図である。It is a figure which shows the structure of the speech recognition apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施形態に係る音声認識装置の構成を示す図である。It is a figure which shows the structure of the speech recognition apparatus which concerns on the 3rd Embodiment of this invention. 本発明の第４の実施形態に係る音声認識装置の構成を示す図である。It is a figure which shows the structure of the speech recognition apparatus which concerns on the 4th Embodiment of this invention.

Explanation of symbols

１４マイクロフォン
６１カメラ
６２接触センサ
６３人検知センサ
６４処理部
６５データ蓄積部
５０、５０´ 音声認識部
５１ａ乃至５１ｄ人物検知部
５２音声認識採否判断部
５３応答動作実行部
６６会話シナリオ実行部
６７代替動作実行部 14 microphone 61 camera 62 contact sensor 63 human detection sensor 64 processing unit 65 data storage unit 50, 50 'voice recognition unit 51a to 51d person detection unit 52 voice recognition acceptance / rejection determination unit 53 response operation execution unit 66 conversation scenario execution unit 67 alternative operation Execution part

Claims

Voice input means;
Voice recognition means for recognizing voice input from the voice input means;
A person detection means for detecting a person around the robot body;
A determination unit that validates a voice recognition result by the voice recognition unit when a person is detected by the person detection unit;
And a response action executing means for executing a response action corresponding to the voice recognition result when the voice recognition result is validated by the determination means.

The robot according to claim 1, wherein the response operation executing unit executes an alternative response operation indicating that a sound is detected when the voice recognition result is invalidated by the determination unit.

3. The determination unit according to claim 1, wherein the determination unit validates the voice recognition result regardless of a detection result of the person detection unit when an important term related to a command is received as the voice recognition result. The robot described.

The response operation executing means terminates the response operation being executed when a voice recognition result valid by the determination means is not output over a predetermined period during the execution period of the response operation. The robot according to any one of claims 1 to 3.

Equipped with a face detection means for detecting the face of a person around the robot body,
The robot according to any one of claims 1 to 4, wherein the determination unit validates a voice recognition result by the voice recognition unit when the face of the person is detected.

In the case where the response operation is executed by the response operation execution unit or the elapsed time from the end of the execution of the response operation is within an estimation period in which it is estimated that the person is in the vicinity, The robot according to any one of claims 1 to 5, wherein the voice recognition result is valid regardless of a detection result of the face detection unit.

The robot according to any one of claims 1 to 6, wherein the person detecting means includes an imaging means for photographing the periphery of the robot body, and detects a person from image information acquired by the imaging means.

The robot according to claim 7, wherein the imaging device is an omnidirectional camera provided at the top of the robot body.

The robot according to any one of claims 1 to 8, wherein the person detection means includes contact detection means for detecting contact.

The robot according to claim 1, wherein the person detection unit includes a distance sensor that measures a distance to an object.

The person detection means includes storage means for storing information from the imaging device, the contact detection means, and the distance sensor, and a person is detected based on information in a past predetermined period stored in the storage means. The robot according to claim 10 to be detected.

Voice input means;
Voice recognition means for recognizing voice input from the voice input means;
A person detection means for detecting a person around the robot body;
A parameter changing means for lowering a threshold of correctness confidence of the voice recognition means when a person is detected by the person detecting means;
And a response action executing means for executing a response action corresponding to the voice recognition result when the voice recognition result is validated by the determination means.

An audio acquisition process for acquiring input audio;
A speech recognition process for recognizing the input speech;
A person detection process for detecting a person around the robot body,
A speech recognition method comprising: a determination process for validating a speech recognition result in the speech recognition process when a person is detected around the robot body.

An audio acquisition process for acquiring input audio;
A speech recognition process for recognizing the input speech;
A person detection process for detecting a person around the robot body,
A parameter recognition process comprising: a parameter changing process for lowering a threshold of correct confidence in the voice recognition process when a person is detected around the robot body.