JP2018185362A

JP2018185362A - Robot and control method of the same

Info

Publication number: JP2018185362A
Application number: JP2017085336A
Authority: JP
Inventors: 石田　卓也; Takuya Ishida; 卓也石田; 岳史小山; Takeshi Koyama; 正樹渋谷; Masaki Shibuya; 仁秋田; Hitoshi Akita; 匡将榎本; Tadamasa Enomoto; 智斉江田; Tomonari Eda
Original assignee: Fuji Soft Inc
Current assignee: Fuji Soft Inc
Priority date: 2017-04-24
Filing date: 2017-04-24
Publication date: 2018-11-22

Abstract

PROBLEM TO BE SOLVED: To correspond to speech of a user even during self-speech.SOLUTION: A robot 1 interacting with a user comprises: an image analysis section 106 for analyzing a face image of the user; voice analysis sections 105 and 111 for estimating a sound source direction from ambient voice and performing voice recognition processing; a start determination section 110 for determining start and stop of the voice recognition processing by the voice analysis sections; and a speech generation section 115 for generating a message in accordance with a recognition result of the voice recognition processing and making a speech. The start determination section temporarily stops the speech by the speech generation section and starts the voice recognition processing by the voice analysis sections when a prescribed condition showing that the user is speaking is satisfied from an operation state of the speech generation section, an analysis result of the face image of the user by the image analysis section and an estimation result of the sound source direction by the voice analysis section.SELECTED DRAWING: Figure 1

Description

本発明は、ロボットおよびその制御方法に関する。 The present invention relates to a robot and a control method thereof.

ユーザと対話するロボットが普及しつつあるが、ロボットの発話中にユーザが発話すると、音声認識処理の精度が低下し、会話が成立しなくなる場合がある。ユーザの発した音声を認識するための音声認識処理に対して、ロボットの発話がノイズとして入力されてしまうためである。 Robots that interact with users are becoming widespread, but if a user speaks while the robot is speaking, the accuracy of the speech recognition process may be reduced and the conversation may not be established. This is because the speech of the robot is input as noise for the speech recognition processing for recognizing the speech uttered by the user.

このため、ユーザが発話する場合には、ロボットをタッチするなどして、ロボットの発話を停止させ、その後でユーザ音声の音声認識処理を実行していた。 For this reason, when the user utters, the robot utterance is stopped by touching the robot or the like, and then the voice recognition processing of the user voice is executed.

ロボットではないが、対話型の案内装置では、音声認識処理の認識率を向上させるために、トークスイッチの押下によりユーザの発話を検知した場合は、対話型案内装置の案内（発話）を停止する（特許文献１）。 Although not a robot, an interactive guidance device stops guidance (utterance) of an interactive guidance device when a user's utterance is detected by pressing a talk switch in order to improve the recognition rate of voice recognition processing. (Patent Document 1).

一方、適応フィルタやエコーキャンセルなどの処理を搭載し、発話と音声認識とを同時に処理できるようにしたロボットも知られている（特許文献２）。 On the other hand, there is also known a robot equipped with processing such as an adaptive filter and echo cancellation so that speech and voice recognition can be processed simultaneously (Patent Document 2).

なお、ロボットではないが、ユーザの口元の動きを検知することにより、ユーザの発話開始を判定するナビゲーション装置も知られている（特許文献３）。 In addition, although it is not a robot, the navigation apparatus which determines a user's speech start by detecting the motion of a user's mouth is also known (patent document 3).

特許第３６５４０４５号公報Japanese Patent No. 3654045 特許第５１７８３７０号公報Japanese Patent No. 5178370 特開２００９−９８２１７号公報JP 2009-98217 A

特許文献１に記載の従来技術では、何らかの操作により音声入力を開始する指示があった場合、案内装置の音量を低減もしくは消音させ、ノイズ推定区間が終了したときに音声入力の許可を報知する。しかし、ユーザは、発話を希望するたびに、案内装置に何らかの操作を行って案内装置自身の発話（自己発話）を停止させる必要があるため、手間がかかるばかりか、対話のテンポが悪くなり、不自然な対話になりやすい。 In the prior art described in Patent Document 1, when there is an instruction to start voice input by some operation, the volume of the guidance device is reduced or silenced, and the permission of voice input is notified when the noise estimation section ends. However, every time the user wants to speak, it is necessary to stop the utterance of the guidance device itself (self-speaking) by performing some operation on the guidance device. Prone to unnatural dialogue.

特許文献２に記載の従来技術では、音声認識処理に入力される信号のうち自己発話による信号を、適応フィルタやエコーキャンセルを用いて除去できるため、自己発話中であってもユーザの発話を認識することができる。しかし、適応フィルタやエコーキャンセルといった処理を用いても自己発話中のユーザの音声を正確に認識するのは難しく、かつ、適応フィルタやエコーキャンセルの実装には手間がかかり、コストも増大する。特に、メモリサイズやＣＰＵ（Central Processing Unit）などのリソースが限られる安価で小型なロボットに、画像処理（顔認識）、音声認識、動作などの基本的制御処理に加えて、適応フィルタなどの処理を実装するのは容易ではない。 In the prior art described in Patent Document 2, a signal by self-speech among signals input to speech recognition processing can be removed using an adaptive filter or echo cancellation, so that the user's speech is recognized even during self-speech. can do. However, it is difficult to accurately recognize the voice of the user who is currently speaking even if processing such as an adaptive filter or echo cancellation is used, and implementation of the adaptive filter or echo cancellation takes time and costs. In particular, in addition to basic control processing such as image processing (face recognition), voice recognition, and motion, processing such as adaptive filters is added to inexpensive and small robots with limited resources such as memory size and CPU (Central Processing Unit). Is not easy to implement.

特許文献３に記載の従来技術では、ユーザの発話動作（口元の動き）を検知することにより音声認識処理を開始させるため、音声認識処理を開始するための特別な操作は不要である。しかし、ユーザの口元の動作のみでユーザの発話を検知するのは難しい。 In the prior art described in Patent Literature 3, since the voice recognition process is started by detecting the user's speech movement (movement of the mouth), a special operation for starting the voice recognition process is unnecessary. However, it is difficult to detect the user's utterance only by the movement of the user's mouth.

本発明は、上記の課題に鑑みてなされたもので、その目的は、自己の発話中でもユーザの発話に応答して対話することができるようにしたロボットおよびその制御方法を提供することにある。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a robot capable of interacting in response to a user's utterance even during his / her utterance and a control method thereof.

本発明の一つの観点に係るロボットは、ユーザと対話するロボットであって、ユーザの顔画像を解析する画像解析部と、周囲の音声から音源方向の推定と音声認識処理とを行う音声解析部と、音声解析部による音声認識処理の開始と停止を判定する起動判定部と、音声認識処理の認識結果に応じてメッセージを生成し発話する発話生成部と、を備え、起動判定部は、発話生成部の動作状態と画像解析部によるユーザの顔画像の解析結果と音声解析部による音源方向の推定結果とから、ユーザが話しかけていることを示す所定の条件を満たす場合に、発話生成部による発話を一時停止させ、音声解析部による音声認識処理を起動させる。 A robot according to one aspect of the present invention is a robot that interacts with a user, an image analysis unit that analyzes a user's face image, and a voice analysis unit that performs sound source direction estimation and voice recognition processing from surrounding sounds And an activation determination unit that determines the start and stop of voice recognition processing by the voice analysis unit, and an utterance generation unit that generates a message and utters according to the recognition result of the voice recognition process. When the predetermined condition indicating that the user is speaking is satisfied based on the operation state of the generation unit, the analysis result of the face image of the user by the image analysis unit, and the estimation result of the sound source direction by the voice analysis unit, the utterance generation unit The speech is paused and the speech recognition process by the speech analysis unit is started.

所定の条件とは、発話生成部が発話中であり、ユーザの顔が正面に位置する状態で、画像解析部によるユーザの顔画像の解析結果がユーザの口元の動作を示す画像であり、かつ音声解析部による音源方向の推定結果が前方を示す場合であってもよい。 The predetermined condition is an image in which the utterance generation unit is speaking, the user's face is located in front, and the analysis result of the user's face image by the image analysis unit indicates the operation of the user's mouth, and The estimation result of the sound source direction by the voice analysis unit may indicate the front.

本体部に対して可動する可動部を備えており、起動判定部は、音声認識処理の開始を示す所定の音声認識開始動作を、可動部に実現させるものであってもよい。 The movable part movable with respect to a main-body part may be provided, and the starting determination part may implement | achieve the predetermined speech recognition start operation | movement which shows the start of a speech recognition process in a movable part.

発話生成部の動作中は、音声認識処理の停止を示す音声認識停止動作を、可動部を動作させることにより実現させてもよい。 During the operation of the utterance generation unit, a speech recognition stop operation that indicates the stop of the speech recognition process may be realized by operating the movable unit.

音声認識開始動作は、音声認識停止動作を停止させることであってもよい。 The voice recognition start operation may be to stop the voice recognition stop operation.

音声解析部による音声認識処理の結果に対応するメッセージを発話生成部が生成できる場合は、メッセージを発話してユーザと新たな会話を開始し、音声解析部による音声認識処理の結果に対応するメッセージを発話生成部が生成できない場合は、一時停止した発話を発話生成部により再開させてもよい。 If the utterance generation unit can generate a message corresponding to the result of the speech recognition processing by the speech analysis unit, the message is uttered, a new conversation with the user is started, and the message corresponding to the result of the speech recognition processing by the speech analysis unit If the utterance generation unit cannot generate the utterance, the utterance generation unit may resume the paused utterance.

本発明の他の観点に従うロボットの制御方法は、ユーザと対話するロボットをロボット制御部により制御する方法であって、ロボット制御部は、ロボットの発話中にユーザが発話したか監視し、ユーザによる発話を検出した場合、発話を一時停止すると共に、ロボットの発話中に実施していた、少なくともロボットの可動部を動かすことで音声認識処理の停止を示す音声認識停止動作を一時停止させ、ユーザの発する音声を取得し、取得したユーザの音声を音声認識処理し、音声認識処理の結果に対応するメッセージを生成可能か判定し、メッセージを生成できると判定した場合は、メッセージを発話してユーザと新たな会話を開始し、メッセージを生成できないと判定した場合は、一時停止させた発話を再開させるとともに、音声認識停止動作を再開させる。 A robot control method according to another aspect of the present invention is a method in which a robot that interacts with a user is controlled by a robot controller, and the robot controller monitors whether the user has spoken during the utterance of the robot. When an utterance is detected, the utterance is paused, and at the same time, the voice recognition stop operation indicating that the voice recognition processing is stopped is paused by moving at least the movable part of the robot, which is performed during the robot's utterance. The voice to be uttered is acquired, the voice of the acquired user is subjected to voice recognition processing, it is determined whether a message corresponding to the result of the voice recognition processing can be generated, and if it is determined that the message can be generated, the message is uttered and the user When it is determined that a new conversation is started and a message cannot be generated, the paused speech is resumed and voice recognition is stopped. To resume the work.

ロボットの全体構成を示す図。The figure which shows the whole structure of a robot. ロボット発話中にユーザが話しかけた場合の処理を示すフローチャート。The flowchart which shows a process when a user talks during robot speech. ロボットの発話とユーザの話しかけとの関係を模式的に示す説明図。Explanatory drawing which shows typically the relationship between the speech of a robot, and a user's talk. 第２実施例に係る処理のフローチャート。The flowchart of the process which concerns on 2nd Example. 第３実施例に係る処理のフローチャート。The flowchart of the process which concerns on 3rd Example. 第４実施例に係り、ユーザの顔がロボットの正面に位置する一例を示す説明図である。It is explanatory drawing which shows an example which concerns on 4th Example and a user's face is located in the front of a robot. ユーザの顔がロボットの正面に位置する他の例と位置しない例とを示す説明図である。It is explanatory drawing which shows the other example in which a user's face is located in front of a robot, and the example which is not located.

本実施形態では、後述の通り、ユーザ（話者）の顔の正面を見た状態で、ロボット１の発話中に、ユーザの口元が動いていることと、ユーザの方向から音声が到来することとを同時に検出した場合に、ロボット１の発話を一時停止し、音声認識を開始する。これにより、ロボット１の発話中（自己発話中）にユーザが割り込んで発話した場合でも、ユーザの発話を認識して自然な対話を継続することができる。なお、以下では、ロボット１の発話を一時的に停止させることを、「一時停止」または「中断」と表現する。 In the present embodiment, as will be described later, the user's mouth is moving and the voice comes from the user's direction while the robot 1 is speaking while looking at the front of the user's (speaker) face. Are simultaneously detected, the utterance of the robot 1 is temporarily stopped and voice recognition is started. Thereby, even when the user interrupts and speaks while the robot 1 is speaking (during self-speaking), it is possible to recognize the user's speech and continue a natural conversation. Hereinafter, temporarily stopping the speech of the robot 1 is expressed as “pause” or “interrupt”.

図１は、ロボット１の全体構成を示す。ロボット１は、ユーザとコミュニケーションすることのできる、いわゆるコミュニケーションロボットとして構成される。ロボット１は、例えば、一般家庭、オフィス、各種商業施設、病院、介護施設、保育園、幼稚園、学校などで、ユーザと対話したり、運動して遊んだりすることができる。以下、ロボット１と対話を通じてコミュニケーションする者をユーザと呼ぶ。 FIG. 1 shows the overall configuration of the robot 1. The robot 1 is configured as a so-called communication robot that can communicate with a user. The robot 1 can interact with a user or exercise and play in, for example, a general home, an office, various commercial facilities, a hospital, a care facility, a nursery school, a kindergarten, and a school. Hereinafter, a person who communicates with the robot 1 through dialogue is referred to as a user.

ロボット１は、一つまたは複数のアプリケーション（サービス）を備える。アプリケーションとは、例えば、ニュース、レクリエーション、クイズ、ゲーム、体操、ダンス等である。アプリケーションに対応付けられているコマンドをユーザが発すると、ロボット１はその音声を認識し、コマンドに応じたアプリケーションを実行する。 The robot 1 includes one or a plurality of applications (services). Examples of applications include news, recreation, quizzes, games, gymnastics, and dance. When the user issues a command associated with the application, the robot 1 recognizes the voice and executes the application corresponding to the command.

ロボット１は、ロボット制御部１０と、ロボット本体１１とに大別できる。ロボット制御部１０は、ロボット本体１１を制御するもので、その詳細は後述する。ロボット本体１１は、例えば、胴体１２と、両脚１３Ｒ，１３Ｌと、両手１４Ｒ，１４Ｌと、頭部１５とを備える。以下、左右を区別しない場合、両脚１３、両手１４と呼ぶ。胴体１２は「本体部」に該当する。両脚１３，両手１４，頭部１５は、胴体１２に対して可動に設けられており、「可動部」に該当する。 The robot 1 can be roughly divided into a robot control unit 10 and a robot body 11. The robot control unit 10 controls the robot body 11, and details thereof will be described later. The robot body 11 includes, for example, a body 12, legs 13R and 13L, hands 14R and 14L, and a head 15. Hereinafter, when the left and right are not distinguished, they are referred to as both legs 13 and both hands 14. The body 12 corresponds to a “main part”. Both the legs 13, both hands 14, and the head 15 are provided so as to be movable with respect to the body 12 and correspond to “movable parts”.

ロボット制御部１０は、ロボット本体１１に設けられる。ロボット１には、後述するカメラ１２１やマイク１２０などが設けられる。 The robot control unit 10 is provided in the robot body 11. The robot 1 is provided with a camera 121 and a microphone 120 described later.

ロボット制御部１０は、例えば、マイクロプロセッサ（以下ＣＰＵ）１０１、メモリ１０２、ＳＳＤ（Solid State Drive）１０３、統合制御部１０４、音声信号処理部１０５、画像認識部１０６、音声合成部１０７、音声出力部１０８、ＬＥＤ（Light Emitting Diode）駆動部１０９、音声認識起動判定部１１０、音声認識部１１１、辞書データベース１１２、音響モデル１１３、言語モデル１１４、発話生成部１１５、発話データベース１１６、アクチュエータ制御部１１７、アクチュエータ駆動部１１８、モーションデータベース１１９、図示せぬ通信部や電源部等を備える。 The robot control unit 10 includes, for example, a microprocessor (hereinafter referred to as CPU) 101, a memory 102, an SSD (Solid State Drive) 103, an integrated control unit 104, a voice signal processing unit 105, an image recognition unit 106, a voice synthesis unit 107, a voice output, and the like. Unit 108, LED (Light Emitting Diode) drive unit 109, voice recognition activation determination unit 110, voice recognition unit 111, dictionary database 112, acoustic model 113, language model 114, utterance generation unit 115, utterance database 116, actuator control unit 117 , An actuator driving unit 118, a motion database 119, a communication unit (not shown), a power supply unit, and the like.

ＣＰＵ１０１は、メモリ１０２またはＳＳＤ１０３に格納されているコンピュータプログラムや動作制御データを読み込んで実行することにより、例えば、ニュース、レクリエーション、クイズ、ダンス、体操等のアプリケーションを実行する。本実施例では、補助記憶装置としてＳＳＤ１０３を用いるが、ＳＳＤ以外の記憶装置を用いてもよい。 The CPU 101 reads and executes computer programs and operation control data stored in the memory 102 or the SSD 103, thereby executing applications such as news, recreation, quiz, dance, and gymnastics. In this embodiment, the SSD 103 is used as the auxiliary storage device, but a storage device other than the SSD may be used.

統合制御部１０４は、音声認識部１１１の音声認識結果がコマンドである場合に、コマンドに応じた動作の開始をＬＥＤ駆動部１０９、発話生成部１１５、アクチュエータ制御部１１７などの対応する各処理部へ指示する機能である。統合制御部１０４は、音声認識の結果がコマンド以外の言葉である場合、ユーザの言葉を発話生成部１１５へ送る。 When the voice recognition result of the voice recognition unit 111 is a command, the integrated control unit 104 starts the operation in accordance with the command, corresponding processing units such as the LED driving unit 109, the utterance generation unit 115, and the actuator control unit 117. It is a function to instruct to. If the result of speech recognition is a word other than the command, the integrated control unit 104 sends the user's word to the utterance generation unit 115.

音声信号処理部１０５と音声認識部１１１とは、「音声解析部」に該当する。音声信号処理部１０５は、複数のマイク１２０から音信号を取得し、解析する。マイク１２０は、例えば、ロボット本体１１のうち左右の両耳に該当する部分と、首に該当する部分の前後にそれぞれ１つずつ設けられている。つまり、ロボット１には、音源の方向を推定できるように、複数のマイク１２０が異なる場所に設けられている。 The voice signal processing unit 105 and the voice recognition unit 111 correspond to a “voice analysis unit”. The audio signal processing unit 105 acquires sound signals from the plurality of microphones 120 and analyzes them. For example, one microphone 120 is provided on each of the robot body 11 before and after the portion corresponding to the left and right ears and the portion corresponding to the neck. That is, the robot 1 is provided with a plurality of microphones 120 at different locations so that the direction of the sound source can be estimated.

音声信号処理部１０５は、例えば、特徴ベクトル抽出部１０５１と、音源方向推定部１０５２を備える。なお、図中では「部」という言葉を適宜省略している。 The audio signal processing unit 105 includes, for example, a feature vector extraction unit 1051 and a sound source direction estimation unit 1052. In the figure, the word “part” is omitted as appropriate.

特徴ベクトル抽出部１０５１は、マイク１２０で取得した音声から音声認識のための特徴量を特徴ベクトルとして抽出する。抽出された特徴ベクトルは、音声認識部１１１へ送られる。なお、特徴ベクトル抽出部１０５１は、音源方向推定部１０５２による音源方向の推定結果を利用してビームフォーミングを行い、音源方向から到来する音声を強調した信号に基づいて特徴ベクトルを生成することもできる。これにより、ロボット１に話しかけているユーザの音声の特徴量をより明確に抽出することができる。 The feature vector extraction unit 1051 extracts a feature amount for speech recognition from the speech acquired by the microphone 120 as a feature vector. The extracted feature vector is sent to the speech recognition unit 111. Note that the feature vector extraction unit 1051 can perform beam forming using the estimation result of the sound source direction by the sound source direction estimation unit 1052, and can generate a feature vector based on a signal that emphasizes speech coming from the sound source direction. . Thereby, the feature amount of the voice of the user talking to the robot 1 can be extracted more clearly.

音源方向推定部１０５２は、音源からの音波が各マイク１２０へ到達する時間の差を解析することで、音源の方向を推定する。音源の方向の推定結果は、音声認識起動判定部１１０へ送られる。例えば、ロボット本体１１の左右の耳部に実装した２つのマイク１２０で受音した音声信号の到来時間差を算出することにより、音源方向を推定することができる（遅延時間推定法）。さらに、ロボット本体１１の首部の前後に実装した２つのマイク１２０が受音した音声信号の強弱により、音声が到来した方向の前後を区別することができる。なお、４つの全てのマイク１２０で受音した音声信号の到来時間差を算出して、音源方向を推定しても良い。 The sound source direction estimation unit 1052 estimates the direction of the sound source by analyzing the time difference between the sound waves from the sound source reaching each microphone 120. The estimation result of the direction of the sound source is sent to the speech recognition activation determination unit 110. For example, the direction of the sound source can be estimated by calculating the arrival time difference between the audio signals received by the two microphones 120 mounted on the left and right ears of the robot body 11 (delay time estimation method). Furthermore, it is possible to distinguish the front and rear in the direction in which the voice arrives based on the strength of the voice signal received by the two microphones 120 mounted before and after the neck of the robot body 11. Note that the sound source direction may be estimated by calculating the arrival time difference of the audio signals received by all four microphones 120.

画像認識部１０６は、「画像解析部」に該当し、カメラ１２１で撮像した画像データ（以下、画像とも呼ぶ）を解析する。画像認識部１０６の解析結果は、音声認識起動判定部１１０へ送られる。カメラ１２１は、例えば、ロボット頭部１５の正面（ロボットの顔に当たる領域）に少なくとも一つ設けられる。画像認識部１０６は、例えば、顔検出部１０６１と、口元動作検出部１０６２とを有する。 The image recognition unit 106 corresponds to an “image analysis unit” and analyzes image data (hereinafter also referred to as an image) captured by the camera 121. The analysis result of the image recognition unit 106 is sent to the voice recognition activation determination unit 110. For example, at least one camera 121 is provided in front of the robot head 15 (an area corresponding to the face of the robot). The image recognition unit 106 includes, for example, a face detection unit 1061 and a mouth movement detection unit 1062.

顔検出部１０６１は、カメラ１２１から取得した画像からユーザの顔（人間の顔）を抽出する。詳しくは、顔検出部１０６１は、矩形で検出する顔領域と、鼻と口の位置を示す座標および顔向きを出力する。 The face detection unit 1061 extracts the user's face (human face) from the image acquired from the camera 121. Specifically, the face detection unit 1061 outputs a face area detected by a rectangle, coordinates indicating the positions of the nose and mouth, and a face direction.

口元動作検出部１０６２は、カメラ１２１で連続して撮影した画像に基づいて、ユーザの口元の動きの有無を検出する。詳しくは、連続して撮影した画像から、顔検出部１０６１の検出した鼻や口などの座標を利用して、ユーザの口元に動きがあるか判定する。 The mouth movement detection unit 1062 detects the presence / absence of movement of the user's mouth based on images continuously captured by the camera 121. Specifically, it is determined whether or not there is a movement in the user's mouth from the continuously captured images using the coordinates of the nose and mouth detected by the face detection unit 1061.

音声合成部１０７は、発話生成部１１５から入力されるメッセージ（応答文など）に対応する音声信号を合成する。音声合成部１０７で合成された音声信号は、音声出力部１０８を介してスピーカ１２２から外部へ出力される。 The voice synthesizer 107 synthesizes a voice signal corresponding to a message (such as a response sentence) input from the utterance generator 115. The voice signal synthesized by the voice synthesizer 107 is output from the speaker 122 to the outside via the voice output unit 108.

ＬＥＤ駆動部１０９は、ロボット本体１１に設けられたＬＥＤ１２３を駆動する。ＬＥＤ１２３は、例えば、ロボット頭部１５の正面（顔に該当する領域）に少なくとも一つ設けることができる。これに代えて、例えば、ロボット頭部１５の背面、首部の周辺、胴体１２の胸部分などにＬＥＤを設けてもよい。さらには、ＬＥＤ１２３に代えて、あるいはＬＥＤ１２３と共に、液晶ディスプレイ、ＯＬＥＤ（Organic Light Emitting Diode）などの表示部、発光部を設けることもできる。 The LED driving unit 109 drives the LED 123 provided in the robot body 11. For example, at least one LED 123 can be provided in front of the robot head 15 (an area corresponding to a face). Instead of this, for example, LEDs may be provided on the back surface of the robot head 15, the periphery of the neck, the chest of the body 12, and the like. Furthermore, instead of the LED 123 or together with the LED 123, a liquid crystal display, a display unit such as an OLED (Organic Light Emitting Diode), and a light emitting unit may be provided.

音声認識起動判定部１１０は、「起動判定部」に該当する。以下、起動判定部１１０とも呼ぶ。起動判定部１１０は、各マイク１２０から受音した音声とカメラ１２１で撮像した画像とに基づいて、ユーザ（話者）の状況を推定し、音声認識処理の開始または停止について判定する。起動判定部１１０の判定結果に基づいて、特徴ベクトル抽出部１０５１と音声認識部１１１とは、その作動を開始または停止させる。 The speech recognition activation determination unit 110 corresponds to an “activation determination unit”. Hereinafter, it is also referred to as an activation determination unit 110. The activation determination unit 110 estimates the user (speaker) status based on the sound received from each microphone 120 and the image captured by the camera 121, and determines whether the speech recognition process is started or stopped. Based on the determination result of the activation determination unit 110, the feature vector extraction unit 1051 and the speech recognition unit 111 start or stop their operations.

音声認識部１１１は、音声信号処理部１０５から取得する特徴ベクトル等に基づいて、音声を認識する。音声認識部１１１は、例えば、辞書データベース１１２と、音響モデル１１３と、言語モデル１１４とを利用することができる。 The speech recognition unit 111 recognizes speech based on feature vectors acquired from the speech signal processing unit 105. The voice recognition unit 111 can use, for example, a dictionary database 112, an acoustic model 113, and a language model 114.

音響モデル１１３は、テキストの読みとテキストを発音したときの波形とを対応づけて記憶したデータベースであり、どのような波形の音がどのような単語として認識されるかを定義する。言語モデル１１４は、言語ごとの単語の並べ方（文法）などを記憶したデータベースである。辞書データベース１１２は、一般的な辞書のデータを保持する。例えば、辞書データベース１１２、音響モデル１１３、言語モデル１１４を言語ごとに用意することで、多言語に対応することもできる。 The acoustic model 113 is a database in which text reading and a waveform when the text is pronounced are stored in association with each other, and defines what kind of waveform sound is recognized as what word. The language model 114 is a database that stores word arrangement (grammar) and the like for each language. The dictionary database 112 holds general dictionary data. For example, by preparing the dictionary database 112, the acoustic model 113, and the language model 114 for each language, it is possible to support multiple languages.

発話生成部１１５は、音声認識部１１１の認識した言葉に対応する応答文（メッセージ）を生成し、音声合成部１０７等を介してスピーカ１２２から外部へ出力させる。詳しくは、発話生成部１１５は、発話データベース１１６を参照して、認識された言葉に応答する文例を選択し、選択した文例から応答文を生成する。 The utterance generation unit 115 generates a response sentence (message) corresponding to the words recognized by the voice recognition unit 111 and outputs the response sentence to the outside from the speaker 122 via the voice synthesis unit 107 or the like. Specifically, the utterance generation unit 115 refers to the utterance database 116, selects a sentence example that responds to the recognized word, and generates a response sentence from the selected sentence example.

アクチュエータ制御部１１７は、アクチュエータ駆動部１１８を介して複数のアクチュエータ１２４に接続されており、各アクチュエータ１２４を制御する。アクチュエータ１２４としては、例えば、各部の関節を動かすためのＤＣサーボモータ等がある。これに限らず、例えば、超音波モータ、圧電アクチュエータ、ソレノイド等をアクチュエータとして用いてもよい。以下、アクチュエータの例として関節モータを挙げて説明する。そこで、関節モータに符号１２４を付して関節モータ１２４と呼ぶことがある。 The actuator control unit 117 is connected to the plurality of actuators 124 via the actuator driving unit 118 and controls each actuator 124. Examples of the actuator 124 include a DC servo motor for moving the joints of each part. For example, an ultrasonic motor, a piezoelectric actuator, a solenoid, or the like may be used as the actuator. Hereinafter, a joint motor will be described as an example of an actuator. Therefore, the joint motor may be referred to as a joint motor 124 by adding a reference numeral 124.

アクチュエータ制御部１１７は、統合制御部１０４から指示された動作を実現するために、モーションデータベース１１９を参照して、各部の関節モータ１２４を制御する。モーションデータベース１１９には、各種の動作に対応する関節モータ１２４の制御情報（回転角、回転時間、回転速度、シーケンス）が登録されている。 The actuator control unit 117 refers to the motion database 119 to control the joint motor 124 of each unit in order to realize the operation instructed by the integrated control unit 104. In the motion database 119, control information (rotation angle, rotation time, rotation speed, sequence) of the joint motor 124 corresponding to various operations is registered.

ここで、統合制御部１０４は、起動判定部１１０が、音声認識処理を開始すべきと判定すると、アクチュエータ制御部１１７に対して、音声認識開始モーション１１９１を実行するよう指示する。一方、統合制御部１０４は、起動判定部１１０が、音声認識処理を停止すべきと判定すると、アクチュエータ制御部１１７に対して、音声認識停止モーション１１９２を実行するよう指示する。 Here, when the activation determination unit 110 determines that the voice recognition process should be started, the integrated control unit 104 instructs the actuator control unit 117 to execute the voice recognition start motion 1191. On the other hand, when the activation determination unit 110 determines that the voice recognition process should be stopped, the integrated control unit 104 instructs the actuator control unit 117 to execute the voice recognition stop motion 1192.

音声認識開始モーション１１９１は、音声認識処理を開始するタイミングで実行されるモーションであり、「音声認識開始動作」に該当する。音声認識停止モーション１１９２は、音声認識処理を停止するタイミング（つまり、発話を開始するタイミング）で実行されるモーションであり、「音声認識停止動作」に該当する。音声認識停止モーション１１９２は、ロボット１の自己発話中に実施されるモーションであるため、発話モーションと呼ぶこともできる。 The voice recognition start motion 1191 is a motion executed at the timing of starting the voice recognition process, and corresponds to a “voice recognition start operation”. The voice recognition stop motion 1192 is a motion that is executed at the timing of stopping the voice recognition processing (that is, the timing of starting speech), and corresponds to the “voice recognition stop operation”. Since the voice recognition stop motion 1192 is a motion performed during the self-speaking of the robot 1, it can also be referred to as a speech motion.

音声認識開始モーション１１９１，音声認識停止モーション１１９２では、例えば、ロボット本体１１の可動部を動かしたり、ＬＥＤ１２３を点灯させたりすることで、ユーザの注意を喚起することができる。 In the voice recognition start motion 1191 and the voice recognition stop motion 1192, for example, the user can be alerted by moving the movable part of the robot body 11 or turning on the LED 123.

ユーザは、音声認識処理の開始タイミングおよび停止タイミングを、ロボット１の動作を通じて体験的に習得することができる。ユーザは、音声認識開始モーション１１９１を視認することで、音声認識処理が開始されたことを知ることができ、適切なタイミングで発話することができる。ユーザは、音声認識停止モーション１１９２を視認することで、音声認識処理が停止されていることを知ることができる。ロボット１の性能に関心の少ないユーザであっても、自分の発話が認識されたか否かと、これらのモーション１１９１，１１９２との関係とを経験することができるため、ロボット１の性能に適した話し方を自然に学習することが期待できる。 The user can learn the start timing and stop timing of the voice recognition processing through the operation of the robot 1 through experience. By visually recognizing the voice recognition start motion 1191, the user can know that the voice recognition processing has started, and can speak at an appropriate timing. By visually recognizing the voice recognition stop motion 1192, the user can know that the voice recognition process has been stopped. Even a user who is less interested in the performance of the robot 1 can experience whether or not his / her utterance has been recognized and the relationship between these motions 1191 and 1192. Can be expected to learn naturally.

音声認識開始モーション１１９１の例を説明する。音声認識開始モーション１１９１では、例えば、自己発話中は腕１４を動かしており、音声認識処理の開始時には腕１４の動きを停止させるという動作を行うことで、音声認識処理の開始を知らせる。腕１４の動作停止に代えて、あるいは腕１４の動作停止と共に、音源のユーザに耳を傾けるような仕草をすることで、音声認識処理の開始を知らせてもよい。 An example of the voice recognition start motion 1191 will be described. In the voice recognition start motion 1191, for example, the arm 14 is moved during self-speaking, and when the voice recognition process starts, the movement of the arm 14 is stopped to notify the start of the voice recognition process. Instead of stopping the operation of the arm 14 or simultaneously with stopping the operation of the arm 14, the start of the voice recognition process may be notified by giving a gesture to listen to the user of the sound source.

音声認識停止モーション１１９２の例を説明する。音声認識停止モーション１１９２では、例えば、音声認識処理中は腕１４を動かさないでおき、音声認識処理を停止させて自己発話が開始されると腕１４の動きを開始するという動作を行うことで、音声認識処理の停止を知らせる。腕１４の動作開始に代えて、あるいは腕１４の動作開始と共に、音声認識処理中に音源のユーザへ耳を傾けていた仕草を停止し、通常状態に戻してもよい。腕１４以外の他の可動部、例えば脚１３や頭部１５を動作させて音声認識の開始または停止をユーザに知らせることもできる。 An example of the voice recognition stop motion 1192 will be described. In the voice recognition stop motion 1192, for example, the arm 14 is not moved during the voice recognition process, and the movement of the arm 14 is started when the voice recognition process is stopped and the self-speaking is started. Notify that the speech recognition process has stopped. Instead of starting the operation of the arm 14 or simultaneously with the start of the operation of the arm 14, the gesture that is listening to the user of the sound source during the speech recognition process may be stopped and returned to the normal state. It is also possible to notify the user of the start or stop of voice recognition by operating a movable part other than the arm 14, for example, the leg 13 or the head 15.

なお、音声認識開始モーション１１９１では、上述のように、可動部としての腕１４の動作（例えば、腕を上下に振る動作。腕振りモーションとも呼ぶ）を停止させるが、関節モータ１２４を所定の位置で停止させる制御を行ってもよいし、関節モータ１２４をブレーキモードで停止させてもよい。 In the voice recognition start motion 1191, as described above, the operation of the arm 14 as the movable portion (for example, the operation of swinging the arm up and down, also referred to as the arm swing motion) is stopped, but the joint motor 124 is moved to a predetermined position. The joint motor 124 may be stopped in the brake mode.

本実施例の音声認識開始モーション１１９１ではＤＣサーボモータとして構成される関節モータ１２４の内部のＨブリッジをショートさせることで、モータではなく発電機として機能させる。これにより、ＤＣサーボモータの停止制御時のノイズを低減して、音声認識処理の精度低下を抑制できる。さらに、ＤＣサーボモータを停止制御する場合は、停止させる角度を指定して駆動を指示する必要があり、制御処理が煩雑で時間がかかるが、ブレーキモードの場合は、Ｈブリッジをショートさせるだけでよく、簡単かつ速やかに停止させることができる。さらに、ブレーキモード時では、ＤＣサーボモータを回転させるために負荷が必要となるため、腕１４が重力により自然に回転する量を抑制できる。 In the voice recognition start motion 1191 of the present embodiment, the H bridge inside the joint motor 124 configured as a DC servo motor is short-circuited to function as a generator instead of a motor. Thereby, the noise at the time of stop control of a DC servo motor can be reduced, and the fall of the precision of voice recognition processing can be controlled. Furthermore, when stopping control of the DC servo motor, it is necessary to specify the angle to be stopped and instruct driving, and the control processing is complicated and takes time, but in the brake mode, only the H bridge is short-circuited. Well, it can be stopped easily and quickly. Furthermore, in the brake mode, since a load is required to rotate the DC servo motor, it is possible to suppress the amount that the arm 14 is naturally rotated by gravity.

図２を参照して、本実施例による対話制御処理を説明する。全体の流れは別の実施例で後述する。 With reference to FIG. 2, the dialogue control process according to the present embodiment will be described. The overall flow will be described later in another embodiment.

まず最初の状態で、ロボット１は発話中であるとする（Ｓ１０）。ロボット１は、例えば、ニュースを読み上げたり、ユーザと会話しているものとする。ロボット１の発話中、つまり自己発話中では、発話モーション（音声認識停止モーション１１９２）が実行されている。 First, it is assumed that the robot 1 is speaking in the initial state (S10). It is assumed that the robot 1 reads out news or is talking to the user, for example. During the speech of the robot 1, that is, during the self-speech, a speech motion (voice recognition stop motion 1192) is executed.

ロボット制御部１０は、ユーザによるロボット１への話しかけを判定するための所定条件が成立したか監視している（Ｓ１１）。所定の条件とは、例えば、ユーザの顔の向きが所定の方向を向いており、ユーザの口元に動きが検出されており、マイク１２０で検出した音声の音源の方向がロボット１の正面前方にあること、である。詳しくは、ユーザの顔がロボット１の正面にあり、ユーザの口元が動いており、ロボット１の正面前方から音声が検出された場合である。ここで、本明細書において、ロボット１の正面前方とは、ロボット１の胴体を基準にしたものではなく、ロボット１に搭載されたカメラ１２１で撮影可能な方向（画角撮影範囲）を言う。「ユーザの顔が正面に位置する場合」の例については、図６，図７で後述する。 The robot control unit 10 monitors whether a predetermined condition for determining whether the user talks to the robot 1 is satisfied (S11). The predetermined condition is, for example, that the user's face is directed in a predetermined direction, a movement is detected in the user's mouth, and the direction of the sound source detected by the microphone 120 is in front of the robot 1. There is. Specifically, this is a case where the user's face is in front of the robot 1, the user's mouth is moving, and voice is detected from the front front of the robot 1. Here, in the present specification, the front front of the robot 1 is not based on the body of the robot 1 but refers to a direction (viewing angle shooting range) in which the camera 121 mounted on the robot 1 can shoot. An example of “when the user's face is located in front” will be described later with reference to FIGS.

所定の条件が成立した場合は、ロボット１の自己発話中に、ユーザが話しかけ始めた状態であると推定することができる。これら顔の向き、口元の動き、音源の方向の検出タイミングが一致する場合に、所定の条件が成立したものとして判定する（Ｓ１１）。誤検知を抑制し、対話が中断するのを防止するためである。タイミングが一致しているか否かは、例えば、音声データのタイムスタンプと画像データのタイムスタンプとの差が所定時間内に収まるか否かで判定できる。 When the predetermined condition is satisfied, it can be estimated that the user has started speaking while the robot 1 is speaking. When the detection timings of the face direction, the mouth movement, and the sound source direction coincide with each other, it is determined that a predetermined condition is satisfied (S11). This is to suppress false detections and prevent the conversation from being interrupted. Whether the timings match can be determined, for example, by determining whether the difference between the time stamp of the audio data and the time stamp of the image data falls within a predetermined time.

ロボット制御部１０は、所定の条件が成立したと判定すると（Ｓ１１：ＹＥＳ）、音声認識処理を開始したことを、ロボット１の持つ表現能力を駆使してユーザへ知らせる（Ｓ１２）。ロボット制御部１０は、例えば、自己発話を一時停止し、発話モーションを停止し、ユーザへの短い問いかけ動作を行い、かつ、音声入力待ちを示すＬＥＤ表示を行い、そして、各マイク１２０を通じて音声を収集する（Ｓ１２）。 When the robot control unit 10 determines that the predetermined condition is satisfied (S11: YES), the robot control unit 10 notifies the user that the voice recognition process has been started using the expression ability of the robot 1 (S12). For example, the robot control unit 10 pauses self-speaking, stops the utterance motion, performs a short inquiry operation to the user, performs an LED display indicating a voice input wait, and transmits a voice through each microphone 120. Collect (S12).

ここで、発話モーションの停止は、音声認識停止モーション１１９２を停止させることを意味する。ユーザへの短い問いかけ動作では、例えば、「ん？」のような短い言葉であって、対話の最中に突然発せられたとしても対話の流れをあまり妨げないと思われる言葉を発する。音声入力待ちを示すＬＥＤ表示では、例えば、ＬＥＤ１２３を青く点灯させることで、音声入力待ちであることをユーザに知らせる。 Here, the stop of the speech motion means that the speech recognition stop motion 1192 is stopped. In a short questioning operation to the user, for example, a short word such as “n?” Is issued, and even if it is suddenly issued during the dialogue, a word that seems not to disturb the flow of the dialogue is emitted. In the LED display indicating voice input waiting, for example, the LED 123 is lit blue to notify the user that voice input is waiting.

ロボット制御部１０は、ユーザの顔がロボット１の正面に見えている状態で、各マイク１２０から音声を収集できるか判定する（Ｓ１３）。ユーザの顔がロボット１の正面に見えている状態で音声を収集できない場合（Ｓ１３：ＮＯ）、ロボット制御部１０は、音声の収集開始から所定時間ｔ１が経過したか判定する（Ｓ１４）。 The robot control unit 10 determines whether voice can be collected from each microphone 120 in a state where the user's face is visible in front of the robot 1 (S13). When the voice cannot be collected while the user's face is visible in front of the robot 1 (S13: NO), the robot control unit 10 determines whether a predetermined time t1 has elapsed from the start of voice collection (S14).

この所定時間ｔ１は、例えば２，３秒等の数秒程度に設定することができる。無音状態が続いた場合は、ステップＳ１２で一時停止させたロボット１の発話（自己発話）を直ちに再開させるためである。所定の時間ｔ１を短く設定することで、所定の条件が偶然成立したような場合に、対話が長時間途切れるのを防止することができ、一時停止させた発話に自然に復帰させることができる。 The predetermined time t1 can be set to about several seconds such as a few seconds. This is because when the silent state continues, the utterance (self-utterance) of the robot 1 temporarily stopped in step S12 is immediately resumed. By setting the predetermined time t1 short, it is possible to prevent the conversation from being interrupted for a long time when the predetermined condition is satisfied by chance, and it is possible to naturally return to the paused utterance.

ロボット制御部１０は、音声の収集を開始してから所定時間ｔ１が経過するまでステップＳ１３を繰り返す（Ｓ１４：ＮＯ）。所定時間ｔ１が経過しても音声を収集できない場合（Ｓ１４：ＹＥＳ）、ロボット制御部１０は、ステップＳ１２で一時停止させた自己発話を再開し（Ｓ１５）、ステップＳ１０へ戻る。 The robot controller 10 repeats step S13 until a predetermined time t1 has elapsed since the start of voice collection (S14: NO). If the voice cannot be collected even after the predetermined time t1 has elapsed (S14: YES), the robot control unit 10 resumes the self-speech suspended in step S12 (S15), and returns to step S10.

自己発話を再開するステップＳ１５では、発話モーション（音声認識停止モーション１１９２）を再開させると共に、ＬＥＤ１２３を例えば赤く点灯させることで、音声認識処理を停止したことをユーザへ知らせる。 In step S15 for resuming self-speech, the speech motion (speech recognition stop motion 1192) is resumed, and the LED 123 is lit red, for example, to inform the user that the speech recognition process has been stopped.

一方、ロボット制御部１０は、所定時間ｔ１が経過する前に、ユーザの顔がロボット１の正面に見えている状態で、各マイク１２０から音声を収集できる場合（Ｓ１３：ＹＥＳ）、音声認識処理を開始する（Ｓ１６）。 On the other hand, when the robot control unit 10 can collect voice from each microphone 120 in a state where the user's face is visible in front of the robot 1 before the predetermined time t1 elapses (S13: YES), voice recognition processing is performed. Is started (S16).

なお、図示は省略しているが、ロボット１の内部で音声認識処理の全てを行う必要はなく、少なくとも一部の処理をロボット１の外部に設けられた音声認識処理サーバで実行してもよい。外部のサーバで音声認識処理の全部または一部を実行する場合、音声認識処理の結果が出るまで多少の時間を要する。そこで、ステップＳ１６では、ロボット頭部１５を前に傾ける等して頷く動作（頷きモーション）を実行してもよい。ロボット１が頷きモーションを実行することで、音声認識処理に時間がかかった場合でも、間を持たせることができ、自然な対話を継続できる。なお、音声認識処理の全てをロボット１内で実行する場合であっても、音声認識処理中に頷きモーションを実行することで、ユーザに対し、ユーザの言葉に耳を傾けているように演出することができる。 Although not shown, it is not necessary to perform all of the speech recognition processing inside the robot 1, and at least a part of the processing may be executed by a speech recognition processing server provided outside the robot 1. . When all or part of the voice recognition process is executed by an external server, it takes some time until the result of the voice recognition process is obtained. Therefore, in step S16, an operation of moving the robot head 15 by tilting it forward (whipping motion) may be executed. When the robot 1 performs the whispering motion, even when the voice recognition processing takes time, it is possible to keep a gap and continue natural conversation. Even when all of the voice recognition processing is executed in the robot 1, the whispering motion is executed during the voice recognition processing, so that the user is directed to listen to the user's words. be able to.

ロボット制御部１０は、音声認識の結果が出たか判定する（Ｓ１７）。音声認識結果が出ない場合（Ｓ１７：ＮＯ）、ロボット制御部１０は、ステップＳ１６での音声認識処理の開始から所定時間ｔ２が経過したか判定する（Ｓ１８）。ロボット制御部１０は、音声認識の結果が出るまで、所定時間ｔ２だけ待機する。 The robot control unit 10 determines whether a voice recognition result has been obtained (S17). When the voice recognition result is not output (S17: NO), the robot control unit 10 determines whether the predetermined time t2 has elapsed from the start of the voice recognition process in step S16 (S18). The robot control unit 10 waits for a predetermined time t2 until a voice recognition result is obtained.

所定時間ｔ２は、ステップＳ１４で述べた所定時間ｔ１よりも長い値（例えば１０秒程度）に設定することができる。ユーザによるロボット１への話しかけが行われている可能性の高い状況下では、比較的長い時間ｔ２だけ待機することで、ユーザからの話しかけ（割込み）を受け入れて、新たな対話へ導くことができる。もしも、ステップＳ１８の待機時間ｔ２を短く設定すると、ユーザが話しかけようとした動作がキャンセルされる可能性が高くなり、かえって自然な対話を阻害するおそれがある。 The predetermined time t2 can be set to a value (for example, about 10 seconds) longer than the predetermined time t1 described in step S14. Under a situation where there is a high possibility that the user is talking to the robot 1, by waiting for a relatively long time t2, it is possible to accept a talk (interrupt) from the user and lead to a new conversation. . If the waiting time t2 in step S18 is set short, there is a high possibility that the operation that the user tried to talk to is canceled, and there is a possibility that natural dialogue will be hindered.

ロボット制御部１０は、所定時間ｔ２が経過する前に、音声認識結果を得た場合（Ｓ１７：ＹＥＳ）、音声認識された言葉に対応するメッセージを発話生成部１１５が生成できるか否か判定する（Ｓ１９）。 When the speech recognition result is obtained before the predetermined time t2 has elapsed (S17: YES), the robot control unit 10 determines whether the utterance generation unit 115 can generate a message corresponding to the speech-recognized word. (S19).

詳しくは、ロボット制御部１０は、音声認識された言葉を形態素解析し、形態素解析の結果と一致するキーワードが発話データベース１１６に記憶されているか判定することで、有効な会話の可能な音声を認識したか判定する（Ｓ１９）。 Specifically, the robot control unit 10 recognizes speech capable of effective conversation by performing morphological analysis on the speech-recognized word and determining whether or not a keyword matching the result of the morphological analysis is stored in the speech database 116. It is determined whether or not (S19).

音声認識された言葉が発話データベース１１６に記憶されているキーワードを含む場合（Ｓ１９：ＹＥＳ）、ロボット制御部１０は、認識した音声に基づいて、ユーザと新しい会話を開始する（Ｓ２０）。 When the speech-recognized word includes a keyword stored in the utterance database 116 (S19: YES), the robot control unit 10 starts a new conversation with the user based on the recognized speech (S20).

これに対し、ロボット制御部１０は、ステップＳ１６で音声認識された言葉が発話データベース１１６に記憶されているキーワードを一つも含んでいない場合、即ち、有効な会話が可能な音声を認識できなかった場合（Ｓ１９：ＮＯ）、ステップＳ１５に移り、一時停止させていたロボット１の発話を再開させる。 On the other hand, the robot control unit 10 could not recognize the speech capable of effective conversation when the words recognized in step S16 do not include any keywords stored in the utterance database 116. In the case (S19: NO), the process proceeds to step S15, and the speech of the robot 1 that has been paused is resumed.

図３は、図２で述べた処理の一つの具体例を示すタイムチャートである。図示は省略するが、最初にユーザが「明日の天気を教えて」といった内容の発言をし、ロボット１がこの発言に応答して天気予報データをネットワーク上の天気予報サーバ等から収集して応答する場合を例に挙げて説明する。 FIG. 3 is a time chart showing one specific example of the processing described in FIG. Although illustration is omitted, the user first makes a statement such as “Tell me about tomorrow's weather”, and the robot 1 responds to this statement by collecting weather forecast data from a weather forecast server or the like on the network. An example of the case will be described.

ケース（ａ）では、ロボット１は、例えば、「今朝の関東地方は晴れのようです」といったニュースを自己発話Ｗ１として読み上げているものとする。なお、ロボット１は、図示せぬ通信機能を用いて、外部のニュース配信サイトなどの情報源から情報を適宜取得できる。 In case (a), it is assumed that the robot 1 reads, for example, a news such as “This morning in the Kanto region seems to be sunny” as a self-utterance W1. The robot 1 can appropriately acquire information from an information source such as an external news distribution site using a communication function (not shown).

ケース（ｂ）は、ロボット１の自己発話Ｗ１中に、ユーザがロボット１に話しかけ始めたような状況が偶然出現した場合、つまり、図２のステップＳ１１で述べた所定の条件が一時的に成立した場合を示す。 In the case (b), when a situation in which the user starts talking to the robot 1 appears by chance during the self-utterance W1 of the robot 1, that is, the predetermined condition described in step S11 in FIG. 2 is temporarily established. Shows the case.

ケース（ｂ）では、所定の条件が成立すると、ロボット１は、自己発話Ｗ１を一時的に中断し、短い応答「ん？」を発する（図２のＳ１２）。これと同時に、ロボット１は、腕１４を上下動させるなどの音声認識停止モーション１１９２を停止し、音声認識処理を開始する（Ｓ１２）。しかし、音声認識処理を開始後に、音声を取得できない無音時間が所定時間ｔ１継続すると（図２のＳ１４でＹＥＳ）、音声認識処理を停止し、中断された発話Ｗ１ａをその続きＷ１ｂから再開する（図２のＳ１５）。 In the case (b), when a predetermined condition is satisfied, the robot 1 temporarily suspends the self-utterance W1 and issues a short response “?” (S12 in FIG. 2). At the same time, the robot 1 stops the voice recognition stop motion 1192 such as moving the arm 14 up and down and starts the voice recognition process (S12). However, after the voice recognition process is started, if the silent time during which no voice can be acquired continues for a predetermined time t1 (YES in S14 in FIG. 2), the voice recognition process is stopped and the suspended utterance W1a is resumed from the subsequent W1b ( S15 of FIG. 2).

つまり、ケース（ｂ）のように、ユーザから話しかけられたと仮に誤って判定した場合でも、ロボット１は「ん？」というごく短く自然な応答を返し、一瞬だけ耳をそばだてるかのような反応を示してから速やかに元の発話に復帰する。したがって、ロボット１の発話が中断前部分Ｗ１ａと中断後部分Ｗ１ｂとに分かれた場合でも、不自然さをユーザに与える可能性を低減できる。 In other words, even if it is erroneously determined that the user has spoken, as in case (b), the robot 1 returns a very short and natural response of “n?” And reacts as if the ears are stood away for a moment. Return to the original utterance promptly after showing. Therefore, even when the utterance of the robot 1 is divided into the pre-interruption part W1a and the post-interruption part W1b, the possibility of giving the user unnaturalness can be reduced.

ケース（ｃ）は、所定の条件が成立したので「ん？」という応答を発し、音声認識処理を開始したが、所定時間ｔ２内に認識可能な音声を収集できなかった場合を示す（図２のＳ１７でＮＯ、Ｓ１８でＹＥＳ）。 Case (c) shows a case where a predetermined condition is satisfied, a response “n?” Is issued, and voice recognition processing is started, but no recognizable voice can be collected within a predetermined time t2 (FIG. 2). NO at S17 and YES at S18).

ケース（ｃ）では、音声認識処理の開始から所定時間ｔ２が経過した後、ロボット１は、中断された発話を中断位置から再開する（Ｓ１５）。なお、音声認識停止モーション１１９２と音声認識開始モーション１１９１との切替については、図２で述べたので、ここでは割愛する。 In the case (c), after a predetermined time t2 has elapsed from the start of the speech recognition process, the robot 1 resumes the suspended utterance from the suspended position (S15). Note that switching between the voice recognition stop motion 1192 and the voice recognition start motion 1191 has been described with reference to FIG.

ケース（ｄ）は、所定の条件が成立して「ん？」という応答を発し、音声認識処理を開始したが、有効な会話の可能な音声を収集できなかった場合、即ち、発話データベース１１６に登録されているキーワードに対応する音声を認識できなかった場合を示す（図２のＳ１７でＹＥＳ、Ｓ１９でＮＯ）。ユーザの音声からキーワードを抽出できなかった場合、ロボット１は、中断された発話を中断位置から再開する（Ｓ１５）。 In the case (d), when a predetermined condition is satisfied and a response “n?” Is issued and the speech recognition processing is started, but speech capable of valid conversation cannot be collected, that is, in the speech database 116. The case where the voice corresponding to the registered keyword could not be recognized is shown (YES in S17 of FIG. 2, NO in S19). If the keyword cannot be extracted from the user's voice, the robot 1 resumes the suspended utterance from the suspended position (S15).

ケース（ｅ）は、ロボット１の発話中に検出されたユーザからの話しかけに対応して、ロボット１が新たな話題に転じる場合を示す。 Case (e) shows a case where the robot 1 turns to a new topic in response to a conversation from the user detected during the speech of the robot 1.

ロボット１が「ん？」という短い応答を発した後、ユーザから収集した音声に有効な会話の可能なキーワードが含まれている場合（図２のＳ１９でＹＥＳ）、ロボット１は、ユーザから話しかけられた音声に応じて新しい会話Ｗ２を開始する（図２のＳ２０）。最初の発話Ｗ１の残り部分Ｗ１ｂは、発話されない。 After the robot 1 issues a short response “No?”, If the voice collected from the user includes a keyword capable of effective conversation (YES in S19 in FIG. 2), the robot 1 talks to the user. A new conversation W2 is started according to the received voice (S20 in FIG. 2). The remaining portion W1b of the first utterance W1 is not uttered.

ケース（ｅ）では、ユーザからの話しかけに対応して、新たな話題に移ることができ、円滑なコミュニケーションを継続することができる。 In the case (e), it is possible to move to a new topic in response to the conversation from the user, and smooth communication can be continued.

このように本実施例のロボット１は、音声認識中に、（１）ユーザの顔または音声を検出不能の場合（Ｓ１３：ＮＯ）、（２）検出した音声を認識できない場合（Ｓ１７：ＮＯ）、（３）音声認識した言葉に対応した会話ができない場合（Ｓ１９：ＮＯ）、のいずれかの状態になると、音声認識処理を停止し、音声認識開始モーション１１９１から音声認識停止モーション１１９２へ切り替わる。 As described above, during the speech recognition, the robot 1 of the present embodiment (1) cannot detect the user's face or voice (S13: NO), and (2) cannot detect the detected voice (S17: NO). (3) When the conversation corresponding to the speech-recognized word cannot be made (S19: NO), the speech recognition process is stopped and the speech recognition start motion 1191 is switched to the speech recognition stop motion 1192.

なお、新しい会話が開始された場合（Ｓ２０）、ロボット１の発話中では音声認識処理は停止される。ロボット１の発話が終わった後、音声認識停止モーション１１９２から音声認識開始モーション１１９１に切り替わり、音声認識が開始される。 When a new conversation is started (S20), the speech recognition process is stopped while the robot 1 is speaking. After the utterance of the robot 1 is finished, the voice recognition stop motion 1192 is switched to the voice recognition start motion 1191, and voice recognition is started.

このように構成される本実施例によれば、所定の条件が成立した場合（話者であるユーザの顔の正面を見ている自己発話中に、ユーザの口元が動いており、かつロボット１の正面前方から音声が入力される場合）、ロボット１の発話を一時停止し、音声認識処理を開始する。したがって、本実施例によれば、自己発話中に音声認識可能な機能（バージイン機能）を用いずに、自己発話中のユーザからの話しかけに自然に応対することができ、円滑なコミュニケーションを行うことができる。 According to this embodiment configured as described above, when a predetermined condition is satisfied (the user's mouth is moving during self-speaking while looking in front of the user's face as a speaker), and the robot 1 When the voice is input from the front of the robot 1), the speech of the robot 1 is paused and the voice recognition process is started. Therefore, according to the present embodiment, it is possible to respond naturally to the conversation from the user during the self-speaking without using the function (barge-in function) capable of recognizing the voice during the self-speaking, and perform smooth communication. Can do.

本実施例によれば、リソースの制約が大きいためにバージイン機能を備えるのが難しい安価なロボットであっても、自己発話中のユーザからの話しかけに対応できる。したがって、コストをあまり増大させずにロボット１の対話性能を向上することができる。 According to the present embodiment, even an inexpensive robot that is difficult to provide a barge-in function due to large resource constraints can cope with a conversation from a user who is speaking. Therefore, it is possible to improve the interactive performance of the robot 1 without increasing the cost so much.

本実施例によれば、ユーザの顔の向きと口元の動きと音声到来方向との条件が全て揃ったときに、ユーザがロボット１に向かって話していると判断するため、ユーザの発話を精度よく検出することができ、話題の変化に追従することができる。 According to the present embodiment, when all the conditions of the user's face direction, mouth movement, and voice arrival direction are all determined, it is determined that the user is speaking toward the robot 1, so that the user's utterance is accurate. It can detect well and follow changes in the topic.

本実施例によれば、ロボット１の発話を一時停止し、音声認識処理を開始する際に音声認識が可能状態になったことを示す音声認識開始モーション１１９１を実施する。このため、ユーザは、ロボットが発話モード（音声認識停止モード）から音声認識モードへ切り替わるタイミングを自然に学習できる。ユーザは、音声認識モードに移行するまでは音声認識されないという体験を通して、ロボット１の発話中にいきなり重要な言葉を発しても無駄になるといったロボット１の性能に応じた発話方法（割込み方法）を学習することができる。これにより、ユーザが自然にロボットとの対話をスムーズに行えるようになるという効果を期待できる。 According to the present embodiment, the speech recognition start motion 1191 indicating that speech recognition is enabled when the speech of the robot 1 is paused and the speech recognition process is started is performed. For this reason, the user can naturally learn the timing at which the robot switches from the speech mode (voice recognition stop mode) to the voice recognition mode. Through the experience that the user does not recognize the voice until the mode is changed to the voice recognition mode, the user can use an utterance method (interrupt method) according to the performance of the robot 1 such that it is wasted even if an important word is suddenly spoken during the utterance of the robot 1. Can learn. As a result, it is possible to expect an effect that the user can naturally and smoothly interact with the robot.

本実施例では、音声認識処理の開始を知らせるために音声認識開始モーション１１９１を実行し、発話中に動かしていた部位（例えば腕１４）の動きを停止させる。これにより、本実施例によれば、自然な対話動作の中で、ユーザにモードが切り替わったことを知らせることができる。 In this embodiment, the voice recognition start motion 1191 is executed to notify the start of the voice recognition process, and the movement of the part (for example, the arm 14) moved during the speech is stopped. Thereby, according to the present embodiment, it is possible to notify the user that the mode has been switched in a natural dialogue operation.

本実施例によれば、ロボット１の自己発話中にユーザから発せられた言葉の音声認識結果が、ロボット１の応答可能な話題の範囲（発話データベース１１６に記憶されているキーワードの範囲）である場合、その新しい話題に応答し、一方、応答可能な話題の範囲ではない場合、または音声認識できなかった場合は一時停止した自己発話をその続きから再開する。これにより、移り気なユーザの興味に合わせて対話を継続することができ、ユーザが意味不明な音を発した場合等には、会話のテンポをあまり崩すことなく、元の話題に復帰してコミュニケーションを継続することができる。 According to the present embodiment, the speech recognition result of words uttered by the user during the self-speaking of the robot 1 is the topic range (the keyword range stored in the utterance database 116) that the robot 1 can respond to. In the case of responding to the new topic, on the other hand, if it is not within the range of the topic that can be responded or if speech recognition is not possible, the paused self-utterance is resumed from the continuation. As a result, the conversation can be continued according to the interest of the mobile user, and when the user emits an unclear sound, the communication returns to the original topic without disrupting the conversation tempo. Can continue.

本実施例によれば、音声認識処理を停止してロボット１の発話を開始するときに、ユーザの音声を認識しない状態になったことを示す音声認識停止モーション１１９２を実行するため、ユーザは、ロボット１が音声認識モードから発話モード（音声認識停止モード）へ切り替わるタイミングを自然に学習できる。 According to the present embodiment, when the speech recognition process is stopped and the utterance of the robot 1 is started, the user executes the speech recognition stop motion 1192 indicating that the user's speech is not recognized. The timing at which the robot 1 switches from the voice recognition mode to the speech mode (voice recognition stop mode) can be naturally learned.

本実施例によれば、音声認識開始モーション１１９１では、腕１４などの可動部を停止させるため、関節モータ１２４の作動音をマイク１２０が収集するのを防止し、音声認識の精度を高めることができる。 According to the present embodiment, in the voice recognition start motion 1191, the movable part such as the arm 14 is stopped, so that the operation sound of the joint motor 124 is prevented from being collected by the microphone 120, and the voice recognition accuracy is improved. it can.

本実施例によれば、腕１４の停止時に、関節モータ１２４をブレーキモードで停止させるため、電気エネルギを消費せずに重力に抗して停止状態を保持することができる。 According to the present embodiment, when the arm 14 is stopped, the joint motor 124 is stopped in the brake mode. Therefore, the stopped state can be maintained against gravity without consuming electric energy.

図４を用いて第２実施例を説明する。本実施例を含む以下の各実施例は第１実施例の変形例に該当するため、第１実施例との相違を中心に述べる。本実施例では、ロボット１の周囲のユーザ数に応じて、第１実施例で述べた対話制御方法、即ち、ロボット１の自己発話中にユーザから話しかけられた場合に対応する対話処理（以下、割込み対応対話制御処理）の起動を制御する。 A second embodiment will be described with reference to FIG. Each of the following embodiments including this embodiment corresponds to a modification of the first embodiment, and therefore, differences from the first embodiment will be mainly described. In this embodiment, according to the number of users around the robot 1, the dialogue control method described in the first embodiment, that is, the dialogue processing corresponding to the case where the user speaks during the self-speaking of the robot 1 (hereinafter, referred to as the dialogue processing) Controls the activation of the interrupt handling dialog control process).

図４は、本実施例に係るロボット１の実行する処理の一部を示す。ロボット制御部１０は、カメラ１２１で撮影した画像やマイク１２０で収集した音声から、ロボット１の周囲の環境を取得する（Ｓ３０）。例えば、ロボット制御部１０は、カメラ１２１を搭載した頭部１５を所定角度水平方向に回動させることで、ロボット１の周囲に存在するユーザの画像を見渡すようにして取得する。 FIG. 4 shows a part of processing executed by the robot 1 according to the present embodiment. The robot control unit 10 acquires the environment around the robot 1 from the image captured by the camera 121 and the sound collected by the microphone 120 (S30). For example, the robot control unit 10 acquires the image of the user existing around the robot 1 by looking around the robot 1 by rotating the head 15 equipped with the camera 121 in a horizontal direction by a predetermined angle.

ロボット制御部１０は、ステップＳ３０で取得した画像等から会話対象となり得るユーザの候補を検出する（Ｓ３１）。ロボット制御部１０は、例えば、ユーザの顔画像の大きさなどからロボット１との距離を推定し、ロボット１から所定距離内に位置するユーザであって、ロボット１の正面付近に存在するユーザを、ロボット１と会話可能なユーザの候補として検出する（Ｓ３１）。 The robot control unit 10 detects a candidate user who can be a conversation target from the image acquired in step S30 (S31). For example, the robot control unit 10 estimates the distance from the robot 1 based on the size of the user's face image and the like, and is a user who is located within a predetermined distance from the robot 1 and exists near the front of the robot 1. Then, it is detected as a user candidate who can talk with the robot 1 (S31).

ロボット制御部１０は、ステップＳ３１で検出したユーザ数が、あらかじめ設定された閾値ＴｈＵ以下であるか判定する（Ｓ３２）。会話可能なユーザ数が閾値ＴｈＵ以下である場合（Ｓ３２：ＹＥＳ）、ロボット制御部１０は、第１実施例で述べた割込み対応対話制御処理（図２のＳ１０〜Ｓ２０）を実行可能にセットする（Ｓ３３）。 The robot control unit 10 determines whether the number of users detected in step S31 is equal to or less than a preset threshold value ThU (S32). When the number of users who can talk is equal to or less than the threshold value ThU (S32: YES), the robot control unit 10 sets the interrupt handling dialogue control process (S10 to S20 in FIG. 2) described in the first embodiment to be executable. (S33).

一方、会話可能なユーザ数が閾値ＴｈＵよりも多い場合（Ｓ３２：ＮＯ）、ロボット制御部１０は、ステップＳ３３をスキップする。したがって、ロボット１は、割込み対応対話制御処理を実行することができない。 On the other hand, when the number of users who can talk is larger than the threshold value ThU (S32: NO), the robot control unit 10 skips step S33. Therefore, the robot 1 cannot execute the interrupt handling dialogue control process.

割込み対応対話制御処理の実施可否を決定した後、ロボット制御部１０は、ユーザと対話する（Ｓ３４）。ロボット制御部１０は、ユーザが「クイズ」や「ダンス」などのアプリケーションの実行を要求するコマンドを発話した場合、そのコマンドに応じたアプリケーションを実行する。 After determining whether or not to execute the interrupt-corresponding dialog control process, the robot control unit 10 interacts with the user (S34). When the user utters a command requesting execution of an application such as “quiz” or “dance”, the robot control unit 10 executes the application corresponding to the command.

このように構成される本実施例も第１実施例と同様の作用効果を奏する。さらに本実施例では、ロボット１と対話可能なユーザ数が閾値ＴｈＵ以下の場合に、第１実施例で述べた割込み対応対話制御処理を実行可能とする。したがって、ロボット１が、閾値ＴｈＵよりも多い数のユーザを相手にして「レクリエーション」などを実行する場合に、ロボット１の司会進行などが周囲のユーザの発言に妨げられるのを抑制することができる。 Configuring this embodiment like this also achieves the same operational effects as the first embodiment. Furthermore, in this embodiment, when the number of users who can interact with the robot 1 is equal to or less than the threshold value ThU, the interrupt handling dialog control process described in the first embodiment can be executed. Therefore, when the robot 1 performs “recreation” or the like with more users than the threshold ThU, it is possible to suppress the progress of the moderator of the robot 1 from being disturbed by the speech of surrounding users. .

図５を用いて第３実施例を説明する。本実施例では、アプリケーションの種類に応じて、割込み対応対話制御処理の実行可否を決定する。 A third embodiment will be described with reference to FIG. In this embodiment, whether or not to execute the interrupt-compatible dialog control process is determined according to the type of application.

図５は、本実施例に係るロボット１が実行する処理の一部を示す。ロボット制御部１０は、モードを判定する（Ｓ４０）。ここでは、モードとして、例えば、自由会話モード（Ｓ４１，Ｓ４２）、ニュースモード（Ｓ４３，Ｓ４４）、レクリエーションモード（Ｓ４５，Ｓ４６）、充電要求モード（Ｓ４７，Ｓ４８）などがあるとする。 FIG. 5 shows a part of processing executed by the robot 1 according to the present embodiment. The robot control unit 10 determines the mode (S40). Here, for example, it is assumed that there are a free conversation mode (S41, S42), a news mode (S43, S44), a recreation mode (S45, S46), a charge request mode (S47, S48), and the like.

自由会話モードでは、ロボット１は、ユーザと自由に会話する。ニュースモードは、ユーザが「ニュースを読んで」などのコマンドを発話した場合に実施される。ニュースモードでは、ロボット制御部１０は図外のニュース配信サーバからニュースを取得し、取得したニュースを読み上げる。 In the free conversation mode, the robot 1 has a free conversation with the user. The news mode is performed when the user utters a command such as “read news”. In the news mode, the robot control unit 10 acquires news from a news distribution server (not shown) and reads out the acquired news.

レクリエーションモードは、ロボット１の管理者（例えば、ロボット１の設置された施設のロボット担当者など）が事前に日時を決めて設定することができる。レクリエーションモードは、例えば、介護施設、学校、病院、スーパーマーケット、百貨店、遊園地などで実行される。レクリエーションモードでは、ロボット１が司会を務め、「体操」、「合唱」、「クイズ大会」などのプログラムを実行する。 The recreation mode can be set by a manager of the robot 1 (for example, a robot person in charge at a facility where the robot 1 is installed) by determining the date and time in advance. The recreation mode is executed in, for example, a nursing facility, a school, a hospital, a supermarket, a department store, an amusement park, and the like. In the recreation mode, the robot 1 serves as a moderator and executes programs such as “gymnastics”, “choral”, and “quiz competition”.

充電要求モードは、ロボット１の蓄電池（図示せず）のＳＯＣ（State Of Charge）が所定の閾値まで低下し、充電が必要になった場合に実行される。充電要求モードになると、ロボット制御部１０は、充電が必要であることをＬＥＤ１２３の点灯などで管理者に通知する。 The charge request mode is executed when the SOC (State Of Charge) of the storage battery (not shown) of the robot 1 is lowered to a predetermined threshold value and charging is necessary. When the charging request mode is set, the robot control unit 10 notifies the administrator that the charging is necessary by turning on the LED 123 or the like.

ロボット制御部１０は、自由会話モードの場合（Ｓ４１：ＹＥＳ）、割込み対応対話制御処理を実行可能な状態で、ユーザと自由に対話する（Ｓ４２）。 In the free conversation mode (S41: YES), the robot control unit 10 freely interacts with the user in a state where the interrupt-corresponding dialog control process can be executed (S42).

ロボット制御部１０は、ニュースモードの場合（Ｓ４３：ＹＥＳ）、割込み対応対話制御処理を実行可能な状態で、ニュースを読み上げる（Ｓ４４）。 In the news mode (S43: YES), the robot control unit 10 reads the news in a state where the interrupt-corresponding dialogue control process can be executed (S44).

ロボット制御部１０は、レクリエーションモードの場合（Ｓ４５：ＹＥＳ）、割込み対応対話制御処理を実行しない状態で、レクリエーションの司会を務める（Ｓ４６）。 In the recreation mode (S45: YES), the robot control unit 10 serves as a recreation moderator without executing the interrupt-corresponding dialogue control process (S46).

ロボット制御部１０は、充電要求モードの場合（Ｓ４７：ＹＥＳ）、割込み対応対話制御処理を実行しない状態で、管理者に対して充電を要求する（Ｓ４８）。 In the charge request mode (S47: YES), the robot control unit 10 requests the administrator to charge without executing the interrupt handling dialogue control process (S48).

このように構成される本実施例も第１実施例と同様の作用効果を奏する。さらに、本実施例では、モードに応じて（アプリケーションの種類に応じて）、割込み対応対話制御処理の実行可否を決定するため、ロボット１は、状況に応じてユーザとコミュニケーションを取ることができる。 Configuring this embodiment like this also achieves the same operational effects as the first embodiment. Furthermore, in the present embodiment, the robot 1 can communicate with the user according to the situation because it determines whether or not to execute the interrupt-corresponding dialog control process according to the mode (according to the type of application).

例えば、レクリエーションモードでは、多くのユーザを相手にすることが多いため、もしも割込み対応対話制御処理を実行可能に設定すると、ユーザの発話によりロボット１の司会進行が頻繁に中断されてしまい、円滑なレクリエーション活動を行うことができないおそれがある。 For example, in the recreation mode, many users are often dealt with. Therefore, if the interrupt-corresponding dialogue control process is set to be executable, the moderation of the robot 1 is frequently interrupted by the user's speech, and smooth Recreational activities may not be possible.

充電要求モードでは、蓄電池の残量が少なくなっているため、ユーザからの話しかけにいちいち対応していたのでは蓄電池の残量がより早くなくなってしまい、ロボット１の機能が停止するおそれがある。 In the charge request mode, since the remaining amount of the storage battery is low, the remaining amount of the storage battery is lost earlier if the user responds to the talks one by one, and the function of the robot 1 may stop.

しかし、本実施例では、割込み対応対話制御処理を実行すべきモードと、実行しないモードとにわけてロボット１を制御するため、ロボット１の置かれた状況に応じて円滑なコミュニケーションを実現することができる。 However, in this embodiment, since the robot 1 is controlled in a mode in which the interrupt-corresponding dialogue control process should be executed and a mode in which it is not executed, smooth communication is realized according to the situation where the robot 1 is placed. Can do.

図６，図７を参照して第４実施例を説明する。本実施例では、「所定の条件」について詳細に説明する。 A fourth embodiment will be described with reference to FIGS. In this embodiment, the “predetermined condition” will be described in detail.

上述の通り、ユーザが話しかけていることを示す所定の条件を満たす場合に、発話生成部１１５による発話を一時停止させ、音声解析部（１０５，１１１）による音声認識処理を起動させる。 As described above, when a predetermined condition indicating that the user is speaking is satisfied, the utterance by the utterance generation unit 115 is temporarily stopped, and the voice recognition process by the voice analysis unit (105, 111) is started.

所定の条件とは、上述の通り、発話生成部１１５が発話中であり、ユーザの顔が正面に位置する状態で、画像解析部としての画像認識部１０６によるユーザの顔画像の解析結果がユーザの口元の動作を示す画像であり、かつ音声解析部による音源方向の推定結果が前方を示す場合である。 As described above, the predetermined condition is that the utterance generation unit 115 is speaking and the user's face is positioned in front, and the analysis result of the user's face image by the image recognition unit 106 as the image analysis unit is This is an image showing the motion of the mouth and the estimation result of the sound source direction by the speech analysis unit indicates the front.

図６に示すように、「ユーザの顔が正面に位置する状態」とは、ユーザがロボット１を見ている状態である。詳しくは、ユーザがロボット１を見ている状態とは、ロボット１のカメラ１２１で撮影可能な範囲（画角撮影範囲）において、ユーザの顔の正面がロボット１の方向を向いている状態である。ロボット１のカメラ１２１がユーザの正面の顔を撮影できればよいため、ユーザがロボット１の真正面に位置する場合（図６のＡ）に限らず、真正面以外の位置であっても画角撮影範囲内にユーザの正面の顔が位置する場合（図６のＢ，Ｃ）も含む。 As shown in FIG. 6, “the state where the user's face is located in front” is a state where the user is looking at the robot 1. Specifically, the state in which the user is looking at the robot 1 is a state in which the front of the user's face is facing the direction of the robot 1 in a range that can be photographed by the camera 121 of the robot 1 (field angle photographing range). . Since the camera 121 of the robot 1 only needs to be able to capture the face in front of the user, not only when the user is positioned directly in front of the robot 1 (A in FIG. 6), but also within a field angle shooting range even at positions other than the front. The case where the face in front of the user is located (B, C in FIG. 6) is also included.

図７の説明図に示すように、ユーザの顔が画角撮影範囲内に位置していても、ユーザがロボット１の方を向いていない場合（図７のＤ）、「ユーザの顔が正面に位置する状態」には含まない。また、ユーザはロボット１の方を向いているが、画角撮影範囲から外れている場合（図７のＥ）、「ユーザの顔が正面に位置する場合」に含まない。 As shown in the explanatory diagram of FIG. 7, when the user's face is located within the field angle shooting range but the user is not facing the robot 1 (D in FIG. 7), “the user's face is the front It is not included in the “state located at”. Further, when the user faces the robot 1 but is out of the field angle shooting range (E in FIG. 7), it is not included in “when the user's face is located in front”.

したがって、「ユーザの顔が正面に位置する場合」の具体例は、図６のＡ〜Ｃと図７のＡ’およびＢ’が該当し、図７のＤ，Ｅは該当しない。図示はしないが、画角撮影範囲内のユーザがロボット１に背を向けている場合や、ユーザが天井を見上げていたり床を見ている場合も、「ユーザの顔が正面に位置する場合」には該当しない。 Therefore, a specific example of “when the user's face is located in front” corresponds to A to C in FIG. 6 and A ′ and B ′ in FIG. 7, and D and E in FIG. 7 do not correspond. Although not shown, even when the user within the field-of-view photographing range is turning his back to the robot 1 or when the user is looking up at the ceiling or looking at the floor, “when the user's face is in front” Not applicable.

なお、本発明は、上述した実施の形態に限定されない。当業者であれば、本発明の範囲内で、種々の追加や変更等を行うことができる。ロボットは、人型である必要はなく、例えば犬、猫、鳥、魚などの動物、ひまわり、トウモロコシなどの植物、円柱、ポリゴン、立方体などの幾何学状物体でもよい。 The present invention is not limited to the above-described embodiment. A person skilled in the art can make various additions and changes within the scope of the present invention. The robot does not need to be humanoid, and may be a geometric object such as an animal such as a dog, a cat, a bird, or a fish, a plant such as a sunflower or corn, a cylinder, a polygon, or a cube.

１：ロボット、１０：ロボット制御部、１１：ロボット本体、１２：胴体、１３Ｌ，１３Ｒ：脚、１４Ｌ，１４Ｒ：手、１５：頭部、１０４：統合制御部、１０５：音声信号処理部、１０６：画像認識部、１１０：音声認識起動判定部、１１１：音声認識部、１１５：発話生成部、１１６：発話データベース、１１９：モーションデータベース 1: Robot, 10: Robot controller, 11: Robot body, 12: Body, 13L, 13R: Leg, 14L, 14R: Hand, 15: Head, 104: Integrated controller, 105: Audio signal processor, 106 : Image recognition unit, 110: voice recognition activation determination unit, 111: voice recognition unit, 115: utterance generation unit, 116: utterance database, 119: motion database

Claims

A robot that interacts with the user,
An image analysis unit for analyzing a user's face image;
A speech analysis unit that performs sound source direction estimation and speech recognition processing from surrounding speech,
An activation determination unit that determines start and stop of the speech recognition processing by the speech analysis unit;
An utterance generator that generates a message and utters according to the recognition result of the voice recognition process;
With
The activation determination unit is a predetermined state indicating that the user is speaking from the operation state of the utterance generation unit, the analysis result of the user's face image by the image analysis unit, and the estimation result of the sound source direction by the voice analysis unit. If the condition is satisfied, temporarily stop the utterance by the utterance generation unit, and activate the speech recognition processing by the speech analysis unit,
robot.

The predetermined condition is an image in which the utterance generation unit is uttering and the user's face is located in front, and the analysis result of the user's face image by the image analysis unit indicates the operation of the user's mouth. There is a case where the estimation result of the sound source direction by the voice analysis unit indicates the front,
The robot according to claim 1.

It has a movable part that can move relative to the main body,
The activation determination unit causes the movable unit to realize a predetermined voice recognition start operation indicating the start of the voice recognition process.
The robot according to claim 2.

During the operation of the utterance generation unit, a speech recognition stop operation indicating stop of the speech recognition process is realized by operating the movable unit.
The robot according to claim 3.

The voice recognition start operation is to stop the voice recognition stop operation.
The robot according to claim 4.

If the utterance generation unit can generate a message corresponding to the result of the voice recognition processing by the voice analysis unit, utter the message and start a new conversation with the user,
If the utterance generation unit cannot generate a message corresponding to the result of the speech recognition processing by the voice analysis unit, the utterance generation unit resumes the paused utterance,
The robot according to claim 1 or 2.

It has a movable part that can move relative to the main body,
The activation determination unit, when the utterance generation unit resumes the paused utterance, realizes a voice recognition stop operation indicating the stop of the voice recognition processing by operating the movable unit,
The robot according to claim 6.

A method of controlling a robot that interacts with a user by a robot control unit,
The robot controller is
Monitoring whether the user speaks during the utterance of the robot,
When the user's utterance is detected, the utterance is paused and the voice recognition stop operation indicating that the voice recognition process is stopped by moving at least the movable part of the robot, which is performed during the utterance of the robot, is paused. Let
Get the voice that the user utters,
Voice recognition processing is performed on the acquired user voice,
Determining whether a message corresponding to the result of the voice recognition process can be generated;
If it is determined that the message can be generated, utter the message to start a new conversation with the user,
When it is determined that the message cannot be generated, the speech that has been paused is resumed and the speech recognition stop operation is resumed.
Robot control method.