JP2016076007A

JP2016076007A - Interactive apparatus and interactive method

Info

Publication number: JP2016076007A
Application number: JP2014204520A
Authority: JP
Inventors: 公亮角野; Kosuke Kadono; 渉内田; Wataru Uchida; 孝輔辻野; Kosuke Tsujino
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2014-10-03
Filing date: 2014-10-03
Publication date: 2016-05-12
Anticipated expiration: 2034-10-03
Also published as: JP6514475B2

Abstract

PROBLEM TO BE SOLVED: To provide an interactive apparatus and an interactive method allowing switching to the state of interaction with a user at proper timing.SOLUTION: An interactive apparatus 100 comprises detection means (detection unit 131) for detecting distance between a user and the interactive apparatus 100 and the presence of a user in a camera image, user intention determination means (user intention determination unit 132) for, on the basis of the detection result of the detection means, determining whether or not a user in a hands-free state has intention of speaking to the interactive apparatus 100, and interactive state control means (state control unit 137) for, on the basis of the determination result of the determination means, controlling the state of the interactive apparatus 100 so as to be switched to one of the state of interaction and the state of non-interaction.SELECTED DRAWING: Figure 2

Description

本発明は、ユーザとの対話を行うための対話装置、およびユーザと対話装置との間で対話を行うための対話方法に関する。 The present invention relates to a dialog device for performing a dialog with a user and a dialog method for performing a dialog between the user and the dialog device.

近年、音声認識並びにユーザの自然な発話内容を理解して受け答えを行う対話技術が進化している（たとえば特許文献１参照）。スマートフォン、ロボットデバイスなどに様々なセンサを備えることによって、ユーザの存在を知覚し、あたかも人間と対話するように対話を行うことができる対話エージェント型の対話装置が実現されつつある。 In recent years, dialogue technology has been evolving to recognize voice and to understand and answer the user's natural utterance content (see, for example, Patent Document 1). By providing various sensors in a smartphone, a robot device, and the like, a dialog agent type dialog device capable of perceiving the presence of a user and performing a dialog as if interacting with a human being is being realized.

対話は、ユーザが、対話装置への発話のタイミングを伝えるためのボタン操作などを行わない状態（以下「ハンズフリー状態」という。）で開始される場合がある。この場合、対話装置は、たとえばマイク入力を監視し、ユーザの発話を検出することによって、非対話状態から対話状態に切り替わることができる。 The dialogue may be started in a state where the user does not perform a button operation or the like for transmitting the timing of utterance to the dialogue device (hereinafter referred to as “hands-free state”). In this case, the interactive device can switch from the non-interactive state to the interactive state, for example, by monitoring the microphone input and detecting the user's utterance.

特開２００２−１８２８９６号公報JP 2002-182896 A

しかしながら、ユーザの発話が検出されたからといって、ユーザが対話装置へ語りかけようとする意思（語りかけ意思）を有しているとは限らない。そのため、ユーザが語りかけ意思を有していないにもかかわらず、ユーザの発話を検出した対話装置が、誤ったタイミングで対話を開始して対話状態に切り替わってしまうおそれがある。 However, just because a user's utterance is detected does not necessarily mean that the user has an intention to talk to the dialogue device (speaking intention). Therefore, even though the user does not have the intention to talk, there is a possibility that the dialogue apparatus that detects the user's utterance starts the dialogue at an incorrect timing and switches to the dialogue state.

本発明は、上記問題点に鑑みてなされたものであり、適切なタイミングでユーザとの対話状態に切り替わることが可能な対話装置および対話方法を提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide an interactive apparatus and an interactive method capable of switching to an interactive state with a user at an appropriate timing.

本発明の一態様に係る対話装置は、ユーザとの対話を行うための対話装置であって、ユーザと対話装置との距離と、カメラ画像におけるユーザの存在とを検出する検出手段と、検出手段の検出結果に基づいて、ハンズフリー状態にあるユーザが対話装置への語りかけ意思を有しているか否かを判定するユーザ意思判定手段と、判定手段の判定結果に基づいて、対話装置が対話状態および非対話状態のいずれかの状態に切り替わるように対話装置の状態を制御する対話状態制御手段と、を備える。 An interactive apparatus according to an aspect of the present invention is an interactive apparatus for performing a dialog with a user, the detecting means for detecting the distance between the user and the interactive apparatus and the presence of the user in the camera image, and the detecting means Based on the detection result, the user intention determination means for determining whether or not the user in the hands-free state has an intention to talk to the dialog apparatus, and the dialog apparatus is in the dialog state based on the determination result of the determination means And dialogue state control means for controlling the state of the dialogue device so as to switch to any one of the non-dialogue states.

本発明の一態様に係る対話方法は、ユーザと対話装置との間で対話を行うための対話方法であって、対話装置が、ユーザと対話装置との距離と、カメラ画像におけるユーザの存在とを検出するステップと、対話装置が、検出するステップの検出結果に基づいて、ハンズフリー状態にあるユーザが対話装置への語りかけ意思を有しているか否かを判定するステップと、対話装置が、判定するステップの判定結果に基づいて、対話装置が対話状態および非対話状態のいずれかの状態に切り替わるように対話装置の状態を制御するステップと、を含む。 An interactive method according to an aspect of the present invention is an interactive method for performing a dialog between a user and an interactive device, wherein the interactive device has a distance between the user and the interactive device, and the presence of the user in a camera image. Detecting, based on a detection result of the detecting step, determining whether or not a user in a hands-free state has an intention to talk to the interactive device; and And controlling the state of the interactive device so that the interactive device switches to either the interactive state or the non-interactive state based on the determination result of the determining step.

上記の対話装置または対話方法では、ユーザが対話装置への語りかけ意思を有しているか否かに基づいて、対話装置が対話状態および非対話状態のいずれかの状態に切り替わるように制御される。これにより、対話装置は、ユーザの意思に応じた適切なタイミングで、対話状態に切り替わることができる。 In the above interactive device or interactive method, the interactive device is controlled to switch to either the interactive state or the non-interactive state based on whether or not the user has an intention to talk to the interactive device. Thereby, the dialogue apparatus can be switched to the dialogue state at an appropriate timing according to the user's intention.

また、対話装置は、検出手段の検出結果に基づいて、ユーザが対話装置からの情報を視認できる状態にあるか否かを判定するユーザ状態判定手段と、ユーザ状態判定手段の判定結果に基づいて、ユーザへの出力を制御する出力制御手段と、をさらに備えてもよい。これにより、ユーザが対話装置からの視覚的な出力（情報）を視認（閲覧など）できないときは、たとえば音声のみでユーザへ情報を伝達することができる。また、ユーザが対話装置からの視覚的な出力を視認できるときは、視覚的な出力と音声出力とを併用することよって、たとえば音声出力を短縮することができる。 Further, the interactive device is based on the determination result of the user state determining unit and the user state determining unit that determines whether or not the user can visually recognize the information from the interactive device based on the detection result of the detecting unit. And output control means for controlling output to the user. Thereby, when the user cannot visually recognize (view, etc.) the visual output (information) from the interactive device, the information can be transmitted to the user only by voice, for example. Further, when the user can visually recognize the visual output from the interactive device, for example, the audio output can be shortened by using both the visual output and the audio output.

また、対話装置は、対話状態においてはユーザの音声に含まれる語彙を連続して認識する第１の認識モードを実行し、非対話状態においてはユーザの音声に含まれる所定の語彙のみを認識する第２の認識モードとを実行する音声認識手段、をさらに備えてもよく、ユーザ意思判定手段は、非対話状態において、第２の認識モードを実行する音声認識手段によってユーザの音声に含まれる所定の語彙が認識された場合に、ユーザが対話装置への語りかけ意思を有していると判定してもよい。これにより、対話装置は、ユーザが所定の語彙（キーワード）を発話したことを契機として、ユーザの意思に応じた適切なタイミングで、非対話状態から対話状態に切り替わることができる。 The interactive device executes a first recognition mode for continuously recognizing vocabulary contained in the user's voice in the conversation state, and recognizes only a predetermined vocabulary contained in the user's speech in the non-interactive state. Voice recognition means for executing the second recognition mode may be further included, and the user intention determination means is a predetermined voice included in the user's voice by the voice recognition means for executing the second recognition mode in the non-interactive state. When the vocabulary is recognized, it may be determined that the user has an intention to talk to the dialogue apparatus. Thereby, the dialogue apparatus can switch from the non-dialogue state to the dialogue state at an appropriate timing according to the user's intention, when the user speaks a predetermined vocabulary (keyword).

また、第１の認識モードでは、音声認識手段が、対話装置の外部との通信を行いサーバのデータ処理を利用することによって、ユーザの音声に含まれる語彙を連続して認識し、第２の認識モードでは、音声認識手段が、対話装置の外部との通信を行わずに、ユーザの音声に含まれる所定の語彙のみを認識してもよい。これにより、第１の認識モードでは、サーバのデータ処理を利用した大語彙が認識可能な音声認識（サーバ型音声認識）を行うことができる。また、第２の認識モードでは、たとえば通信を行わない分だけ第１の認識モードより消費電力を低減させつつ音声認識を行うことができる。 In the first recognition mode, the voice recognition means continuously recognizes the vocabulary included in the user's voice by communicating with the outside of the dialogue apparatus and using the data processing of the server, and the second recognition mode. In the recognition mode, the voice recognition unit may recognize only a predetermined vocabulary included in the user's voice without performing communication with the outside of the dialogue apparatus. Thereby, in the first recognition mode, it is possible to perform voice recognition (server type voice recognition) capable of recognizing a large vocabulary using server data processing. Also, in the second recognition mode, for example, voice recognition can be performed while reducing power consumption compared to the first recognition mode by the amount that communication is not performed.

また、検出手段は、カメラ画像におけるユーザの顔を検出することによって、ユーザの存在を検出してもよい。これにより、たとえば、対話装置とユーザの顔との位置関係や、対話装置に対するユーザの顔の角度などに基づいて、ユーザが対話装置への語りかけ意思を有しているか否か判定することができる。 The detecting unit may detect the presence of the user by detecting the user's face in the camera image. Thereby, for example, based on the positional relationship between the interactive device and the user's face, the angle of the user's face with respect to the interactive device, etc., it can be determined whether or not the user has an intention to talk to the interactive device. .

本発明によれば、適切なタイミングでユーザとの対話状態に切り替わることが可能になる。 According to the present invention, it is possible to switch to a dialog state with the user at an appropriate timing.

対話システムの概略構成を示す図である。It is a figure which shows schematic structure of a dialogue system. 対話装置の詳細構成を示す図である。It is a figure which shows the detailed structure of a dialogue apparatus. 対話装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a dialogue apparatus. 応答データテーブルの一例を示す図である。It is a figure which shows an example of a response data table. 応答データテーブルの別の例を示す図である。It is a figure which shows another example of a response data table. 対話装置の状態遷移図である。It is a state transition diagram of a dialogue apparatus. 対話装置の状態遷移を説明するためのフローチャートの一例である。It is an example of the flowchart for demonstrating the state transition of a dialogue apparatus.

以下、本発明の実施形態について、図面を参照しながら説明する。なお、図面の説明において同一要素には同一符号を付し、重複する説明は省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the description of the drawings, the same elements are denoted by the same reference numerals, and redundant descriptions are omitted.

図１は、実施形態に係る対話装置および対話方法が適用される対話システムの概要を示す図である。図１に示すように、対話システム１においては、ユーザ１０と、対話装置１００との対話が行われる。 FIG. 1 is a diagram illustrating an outline of a dialogue system to which a dialogue apparatus and a dialogue method according to an embodiment are applied. As shown in FIG. 1, in the dialogue system 1, a dialogue between the user 10 and the dialogue device 100 is performed.

対話装置１００は、インタフェース部１１０を含む。図１に示す例では、インタフェース部１１０の一部（後述の図２のディスプレイ１１３）に、ヒト型のキャラクタ１１が表示されている。このようなヒト型のキャラクタ１１の表示によって、ユーザ１０は、あたかも人間と対話するように、対話装置１００と対話することができる。 The interactive apparatus 100 includes an interface unit 110. In the example shown in FIG. 1, the human character 11 is displayed on a part of the interface unit 110 (a display 113 in FIG. 2 described later). The display of the humanoid character 11 allows the user 10 to interact with the interaction device 100 as if interacting with a human.

対話装置１００は、通信ネットワーク５０を介して、サーバ２００と接続可能とされている。これにより、対話装置１００は、ユーザ１０との対話に用いるための多くの情報を、サーバ２００から取得することができる。 The interactive device 100 can be connected to the server 200 via the communication network 50. Thereby, the dialogue apparatus 100 can acquire a lot of information to be used for the dialogue with the user 10 from the server 200.

対話装置１００は、ユーザ１０との対話を行うことが可能であればよく、その外観や大きさなどは図１に示す例に限定されるものではない。たとえば、対話装置１００は、スマートフォンのような端末装置を用いて好適に実現される。そのような端末装置は、ユーザ１０との対話に必要なスピーカ、マイク、各種センサなどの様々なデバイス、およびユーザ１０との接点となるディスプレイを備えており、また、通信ネットワーク５０を介してサーバ２００と通信することができるからである。また、対話装置１００として、人間の形状を模した物理的なロボットなどを用いてもよい。 The dialog device 100 only needs to be able to perform a dialog with the user 10, and its appearance and size are not limited to the example shown in FIG. 1. For example, the interactive device 100 is suitably realized using a terminal device such as a smartphone. Such a terminal device includes various devices such as a speaker, a microphone, and various sensors necessary for interaction with the user 10, and a display serving as a contact point with the user 10, and a server via the communication network 50. This is because it is possible to communicate with 200. Further, as the interactive device 100, a physical robot imitating a human shape may be used.

ユーザ１０は、ハンズフリー状態で、対話装置１００と対話することができる。本明細書におけるハンズフリー状態とは、ユーザ１０が対話装置１００に接触して行う操作（たとえば対話装置１００のボタン操作など）を行わない状態を意味する。なお、ユーザ１０が対話装置１００以外のものに触れている場合でも、対話装置１００に接触していなければ、ハンズフリー状態とされる。 The user 10 can interact with the interaction apparatus 100 in a hands-free state. The hands-free state in this specification means a state in which an operation performed by the user 10 in contact with the interactive device 100 (for example, a button operation of the interactive device 100) is not performed. Even when the user 10 is touching something other than the interactive device 100, if the user 10 is not touching the interactive device 100, a hands-free state is set.

ユーザ１０と対話装置１００との対話は、ユーザ１０が対話装置１００の近くにいる状態で行われることが好ましい。図１において、対話に好ましいユーザ１０と対話装置１００との位置関係が、領域Ｒとして破線で例示される。領域Ｒの範囲は、ユーザ１０が対話装置１００に表示されているキャラクタ１１を良好に視認することができ、また、対話装置１００からの音声を良好に認識できるような範囲とすることができる。そのような領域Ｒの範囲は、たとえば対話装置１００から数十センチ〜数メートル程度の範囲である。図１に示す例では、領域Ｒは、対話装置１００の正面側（インタフェース部１１０が設けられている側）に広く設定され、対話装置１００の側面および背面には狭く設定されている。すなわち、領域Ｒは、対話装置１００の正面側に長く設定され、対話装置１００の側面および背面に短く設定される。このような領域Ｒ内にユーザ１０が位置するときには、ユーザ１０は対話装置１００の正面と向かいあって対話できる可能性が高まるので、対話をスムーズに（ユーザ１０にとって快適に）行うことができる。 The dialogue between the user 10 and the dialogue device 100 is preferably performed in a state where the user 10 is near the dialogue device 100. In FIG. 1, the positional relationship between the user 10 and the interaction device 100 that are preferable for the interaction is illustrated as a region R by a broken line. The range of the region R can be set such that the user 10 can visually recognize the character 11 displayed on the interactive device 100 and can recognize the voice from the interactive device 100. The range of such a region R is, for example, a range from several tens of centimeters to several meters from the interactive device 100. In the example illustrated in FIG. 1, the region R is set to be wide on the front side (side on which the interface unit 110 is provided) of the interactive device 100, and is set to be narrow on the side surface and back surface of the interactive device 100. That is, the region R is set to be long on the front side of the interactive device 100 and is set to be short on the side surface and back surface of the interactive device 100. When the user 10 is located in such a region R, the possibility that the user 10 can interact with the front of the interactive device 100 is increased, so that the conversation can be performed smoothly (comfortable for the user 10).

図２は、対話装置１００の詳細構成を示す図である。図２に示すように、対話装置１００は、インタフェース部１１０と、データ処理部１２０と、制御部１３０と、記憶部１４０と、通信部１５０とを含む。 FIG. 2 is a diagram illustrating a detailed configuration of the interactive apparatus 100. As illustrated in FIG. 2, the dialogue apparatus 100 includes an interface unit 110, a data processing unit 120, a control unit 130, a storage unit 140, and a communication unit 150.

インタフェース部１１０は、対話装置１００の外部（主に図１のユーザ１０）と情報をやり取りするための部分である。インタフェース部１１０は、カメラ１１１と、近接センサ１１２と、ディスプレイ１１３と、マイク１１４と、スピーカ１１５と、操作パネル１１６とを含む。 The interface unit 110 is a part for exchanging information with the outside of the interactive apparatus 100 (mainly the user 10 in FIG. 1). The interface unit 110 includes a camera 111, a proximity sensor 112, a display 113, a microphone 114, a speaker 115, and an operation panel 116.

データ処理部１２０は、インタフェース部１１０に入力された情報の解析などに必要なデータ処理を行い、また、インタフェース部１１０が出力する種々の情報の生成などに必要なデータ処理を行う部分である。データ処理部１２０は、画像処理部１２１と、センサデータ処理部１２２と、出力処理部１２３と、音声認識部１２４と、音声合成部１２５と、入力処理部１２６とを含む。 The data processing unit 120 is a part that performs data processing necessary for analysis of information input to the interface unit 110 and performs data processing necessary for generation of various information output by the interface unit 110. The data processing unit 120 includes an image processing unit 121, a sensor data processing unit 122, an output processing unit 123, a speech recognition unit 124, a speech synthesis unit 125, and an input processing unit 126.

以下、インタフェース部１１０およびデータ処理部１２０に含まれる各部について説明する。 Hereinafter, each unit included in the interface unit 110 and the data processing unit 120 will be described.

カメラ１１１は、たとえばユーザ１０を撮像する。たとえば、画像処理部１２１は、カメラ画像におけるユーザ１０の顔の位置（または領域）を検出する。そのためのデータ処理には、種々の公知の技術を用いることができる。たとえば、google（登録商標）社によって提供されるスマートフォン用ＯＳとして知られているアンドロイド（登録商標）に提供される種々のＡＰＩ（Application Program Interface）に関する情報（たとえば、入手のための情報、使い方の情報など）が、下記のサイトに記載されている。
http://developer.android.com/reference/android/media/FaceDetector.html The camera 111 images the user 10, for example. For example, the image processing unit 121 detects the position (or region) of the face of the user 10 in the camera image. Various known techniques can be used for data processing for that purpose. For example, information on various APIs (Application Program Interface) provided to Android (registered trademark) known as an OS for smartphones provided by google (registered trademark) (for example, information for obtaining, usage information) Information etc.) is listed on the following site.
http://developer.android.com/reference/android/media/FaceDetector.html

カメラ１１１は、対話システム１において、ユーザ１０が語りかける対象となるマイク１１４、ユーザ１０への応答を出力するディスプレイ１１３およびスピーカ１１５のいずれかに対して、ユーザ１０が向けられている事を検出できる位置に設置される。 In the interactive system 1, the camera 111 can detect that the user 10 is directed to any one of the microphone 114 that the user 10 talks to, the display 113 that outputs a response to the user 10, and the speaker 115. Installed in position.

カメラ１１１を用いて、対話装置１００とユーザ１０との距離を検出することもできる。この場合には、カメラ１１１は、対話装置１００において、マイク１１４、ディスプレイ１１３およびスピーカ１１５のいずれかとユーザ１０との距離を検出（測定）できる位置に設置される。カメラ１１１によって対話装置１００とユーザ１０との距離を検出する場合は、上述のデータ処理によって、ユーザ１０の顔領域を検出し、検出した顔領域の大きさから、ユーザ１０との距離を測定することができる。また、対話装置１００が２つ以上のカメラを搭載することによって、上記顔領域の検出と、２つ以上のカメラによって撮像された画像の視差とによって得られる１つ以上の情報から、ユーザ１０との距離を推定することも可能である。 The distance between the interactive device 100 and the user 10 can also be detected using the camera 111. In this case, the camera 111 is installed at a position where the distance between any of the microphone 114, the display 113, and the speaker 115 and the user 10 can be detected (measured) in the interactive apparatus 100. When the distance between the interactive apparatus 100 and the user 10 is detected by the camera 111, the face area of the user 10 is detected by the above-described data processing, and the distance to the user 10 is measured from the size of the detected face area. be able to. In addition, when the interactive apparatus 100 includes two or more cameras, the user 10 can be obtained from one or more pieces of information obtained from the detection of the face area and the parallax of images captured by the two or more cameras. It is also possible to estimate the distance.

近接センサ１１２は、一定距離内への物体の近接を検出する。近接センサ１１２として、たとえば、赤外光（あるいは音波）を発して、物体からの反射光（あるいは反射波）を検出するタイプのセンサを用いることができる。センサデータ処理部１２２は、近接センサ１１２の検出結果に基づいて、対話装置１００とユーザ１０との距離を測定することができ、またユーザ１０が対話装置１００の近くにいるか否かを判断することもできる。そのためのデータ処理には、種々の公知の技術を用いることができる。たとえば、google社によって提供されるスマートフォン用ＯＳとして知られているアンドロイドに提供される種々のＡＰＩに関する情報（たとえば、入手のための情報、使い方の情報など）が、下記のサイトに記載されている。
http://developer.android.com/reference/android/hardware/SensorManager.html The proximity sensor 112 detects the proximity of an object within a certain distance. As the proximity sensor 112, for example, a sensor that emits infrared light (or sound waves) and detects reflected light (or reflected waves) from an object can be used. The sensor data processing unit 122 can measure the distance between the interactive device 100 and the user 10 based on the detection result of the proximity sensor 112, and determines whether the user 10 is near the interactive device 100. You can also. Various known techniques can be used for data processing for that purpose. For example, information on various APIs provided to Android, which is known as the OS for smartphones provided by google (for example, information for obtaining, usage information, etc.) is described in the following site .
http://developer.android.com/reference/android/hardware/SensorManager.html

ディスプレイ１１３は、ユーザ１０が視認可能な情報を表示する。出力処理部１２３は、ディスプレイ１１３の表示に必要なデータ処理を行う。データ処理には、ディスプレイ１１３におけるキャラクタ１１の動作を表すのに必要なデータ処理も含まれる。 The display 113 displays information that the user 10 can visually recognize. The output processing unit 123 performs data processing necessary for display on the display 113. The data processing includes data processing necessary to represent the motion of the character 11 on the display 113.

マイク１１４は、ユーザ１０の音声を検出する。音声認識部１２４は、マイク１１４の検出結果に基づいて、ユーザ１０の音声を認識する音声認識手段として機能する。また、音声認識部１２４は、認識した音声を所定のフォーマット（たとえばテキストの形式）に変換する。音声を認識するためのデータ処理には、種々の公知の技術を用いることができる。たとえば、google社によって提供されるスマートフォン用ＯＳとして知られているアンドロイドに提供される種々のＡＰＩに関する情報（たとえば、入手のための情報、使い方の情報など）が、下記のサイトに記載されている。
http://developer.android.com/reference/android/speech/RecognizerIntent.html The microphone 114 detects the voice of the user 10. The voice recognition unit 124 functions as a voice recognition unit that recognizes the voice of the user 10 based on the detection result of the microphone 114. The voice recognition unit 124 converts the recognized voice into a predetermined format (for example, a text format). Various known techniques can be used for data processing for recognizing speech. For example, information on various APIs provided to Android, which is known as the OS for smartphones provided by google (for example, information for obtaining, usage information, etc.) is described in the following site .
http://developer.android.com/reference/android/speech/RecognizerIntent.html

スピーカ１１５は、ユーザ１０が聴認可能な音声を発する。音声合成部１２５は、スピーカ１１５が発する音声を生成するための種々のデータ処理を行う。たとえば、音声合成部１２５は、各種の音データを合成することによって、テキストの形式で指定された内容（情報）を音声に変換する。そのためのデータ処理には、種々の公知の技術を用いることができる。たとえば、google社によって提供されるスマートフォン用ＯＳとして知られているアンドロイドに提供される種々のＡＰＩに関する情報（たとえば、入手のための情報、使い方の情報など）が、下記のサイトに記載されている。
http://developer.android.com/reference/android/speech/tts/TextToSpeech.html The speaker 115 emits sound that can be heard by the user 10. The voice synthesizer 125 performs various data processing for generating a voice uttered by the speaker 115. For example, the speech synthesizer 125 synthesizes various sound data, thereby converting the content (information) specified in the text format into speech. Various known techniques can be used for data processing for that purpose. For example, information on various APIs provided to Android, which is known as the OS for smartphones provided by google (for example, information for obtaining, usage information, etc.) is described in the following site .
http://developer.android.com/reference/android/speech/tts/TextToSpeech.html

本実施形態において、音声入力の方式は、連続的な対話を想定したものだけでなく、キーワード型の音声入力に特化した方式も採用される。連続的な対話において大規模な語彙が必要な場合は、対話装置１００の記憶領域や計算能力に限りがあるので、音声対話時はサーバとの通信によるサーバ接続型の音声認識を利用する。一方で、キーワード型音声入力では、対話装置１００の内部で完結可能な音声認識エンジンを利用することができ、この場合、サーバとの通信処理を行わない分、バッテリ消費の効率等の面で有効である。たとえば、市販の音声認識エンジンでも、同事業者がサーバ型、ローカル型の複数の方式によるエンジンを販売していることが一般的であり、それらに関する情報（たとえば、入手のための情報、使い方の情報など）が、下記のサイトに記載されている。
http://www.fuetrek.co.jp/product/vgate/asr.html In the present embodiment, the voice input method is not limited to a method that assumes continuous dialogue, but a method specialized for keyword-type voice input is also employed. When a large-scale vocabulary is required for continuous dialogue, the storage area and calculation capability of the dialogue apparatus 100 are limited. Therefore, during voice dialogue, server-connected voice recognition by communication with the server is used. On the other hand, in the keyword type voice input, a voice recognition engine that can be completed inside the dialogue apparatus 100 can be used. In this case, since communication processing with the server is not performed, it is effective in terms of battery consumption efficiency and the like. It is. For example, even with commercially available speech recognition engines, it is common for the same company to sell servers and local types of engines, and information about them (for example, information for obtaining information, usage information) Are listed on the following site.
http://www.fuetrek.co.jp/product/vgate/asr.html

操作パネル１１６は、ユーザ１０の操作（ユーザ操作）を検出する。入力処理部１２６は、操作パネル１１６の検出結果に基づいて、ユーザ操作に応じた必要なデータ処理を行う。 The operation panel 116 detects an operation (user operation) of the user 10. The input processing unit 126 performs necessary data processing according to the user operation based on the detection result of the operation panel 116.

以上の構成によって、対話装置１００は、たとえば、インタフェース部１１０を介して、外部（ユーザ１０を含む）から種々の情報を取得し、また、外部（ユーザ１０を含む）に種々の情報を伝達することができる。そして、本実施形態においては、さらに、後述する制御部１３０、記憶部１４０、通信部１５０などの各要素が協働することによって、ユーザ１０との対話が実現される。 With the above configuration, the interactive apparatus 100 acquires various information from the outside (including the user 10) via the interface unit 110, and transmits various information to the outside (including the user 10), for example. be able to. Further, in the present embodiment, an interaction with the user 10 is realized by further cooperation of each element such as a control unit 130, a storage unit 140, and a communication unit 150 described later.

制御部１３０は、対話装置１００の各要素を制御する部分であり、後述の検出部１３１、ユーザ意思判定部１３２、情報閲覧可否判定部１３３、対話制御部１３４、応答内容決定部１３５、出力制御部１３６、状態制御部１３７、音声認識制御部１３８を含んで構成される。ただし、制御部１３０の機能は、それらの機能に限定されるものではない。 The control unit 130 is a part that controls each element of the interactive apparatus 100, and includes a detection unit 131, a user intention determination unit 132, an information browsing availability determination unit 133, a dialog control unit 134, a response content determination unit 135, and output control, which will be described later. Unit 136, state control unit 137, and voice recognition control unit 138. However, the functions of the control unit 130 are not limited to those functions.

記憶部１４０は、対話装置１００とユーザ１０との対話などに必要な種々の情報を記憶する部分である。記憶部１４０は、たとえば、後述する種々のデータテーブルを記憶する。 The storage unit 140 is a part that stores various types of information necessary for a dialogue between the dialogue device 100 and the user 10. The storage unit 140 stores, for example, various data tables described later.

通信部１５０は、対話装置１００の外部（たとえば図１のサーバ２００）と通信を行う部分である。通信の手法は特に限定されないが、たとえば通信部１５０と基地局（図示しない）との無線通信、および、基地局とサーバ２００との有線通信などを用いることができる。 The communication unit 150 is a part that communicates with the outside of the interactive device 100 (for example, the server 200 in FIG. 1). Although the communication method is not particularly limited, for example, wireless communication between the communication unit 150 and a base station (not shown), wired communication between the base station and the server 200, or the like can be used.

以下、制御部１３０に含まれる各部について説明する。 Hereinafter, each unit included in the control unit 130 will be described.

検出部１３１は、ユーザ１０と対話装置１００との距離と、カメラ画像におけるユーザ１０の存在とを検出する部分（検出手段）である。ユーザ１０と対話装置１００との距離は、カメラ１１１および画像処理部１２１、あるいは近接センサ１１２およびセンサデータ処理部１２２などを用いて行われる。カメラ画像におけるユーザ１０の存在の検出は、カメラ１１１および画像処理部１２１などを用いて行われる。検出部１３１は、カメラ画像におけるユーザ１０の顔を検出することによって、ユーザ１０の存在を検出することが好ましい。 The detection unit 131 is a part (detection unit) that detects the distance between the user 10 and the interactive device 100 and the presence of the user 10 in the camera image. The distance between the user 10 and the interactive device 100 is performed using the camera 111 and the image processing unit 121, or the proximity sensor 112 and the sensor data processing unit 122. Detection of the presence of the user 10 in the camera image is performed using the camera 111, the image processing unit 121, and the like. The detection unit 131 preferably detects the presence of the user 10 by detecting the face of the user 10 in the camera image.

ユーザ意思判定部１３２は、検出部１３１の検出結果に基づいて、ハンズフリー状態にあるユーザ１０が対話装置１００への語りかけ意思を有しているか否かを判定する部分（ユーザ意思判定手段）である。たとえば、ユーザ１０と対話装置との距離が所定距離以下であって（たとえばユーザ１０が図１の領域Ｒの内側に位置する）且つカメラ画像におけるユーザ１０の存在が検出された場合には、ユーザ意思判定部１３２は、ユーザ１０は語りかけ意思を有していると判定することができる。 The user intention determination unit 132 is a part (user intention determination unit) that determines whether the user 10 in the hands-free state has an intention to talk to the interactive device 100 based on the detection result of the detection unit 131. is there. For example, when the distance between the user 10 and the interactive device is equal to or less than a predetermined distance (for example, the user 10 is located inside the region R in FIG. 1) and the presence of the user 10 in the camera image is detected, the user The intention determination unit 132 can determine that the user 10 has an intention to talk.

情報閲覧可否判定部１３３は、検出部１３１の検出結果に基づいて、ユーザ１０が対話装置１００からの情報を視認できる状態にあるか否かを判定する部分（ユーザ状態判定手段）である。たとえば、ユーザ１０の顔がディスプレイ１１３の方向に向いており且つユーザ１０とディスプレイ１１３の距離が、ユーザ１０がディスプレイ１１３に表示された情報を閲覧可能な範囲内である（たとえばユーザ１０が図１の領域Ｒの内側に位置する）ときには、情報閲覧可否判定部１３３は、ユーザ１０が対話装置１００からの情報を視認できる状態にあると判定することができる。 The information browsing availability determination unit 133 is a part (user state determination unit) that determines whether or not the user 10 can visually recognize information from the interactive device 100 based on the detection result of the detection unit 131. For example, the face of the user 10 faces the display 113 and the distance between the user 10 and the display 113 is within a range in which the user 10 can browse the information displayed on the display 113 (for example, the user 10 is shown in FIG. 1). The information browsing availability determination unit 133 can determine that the user 10 is in a state where the information from the interactive device 100 can be visually recognized.

対話制御部１３４は、ユーザ１０との対話を制御する。たとえば、対話制御部１３４は、音声認識部１２４の認識結果を後述の応答内容決定部１３５へ送信する。また、対話制御部１３４は、応答内容決定部１３５によって決定された応答内容を出力処理部１２３に送信することによって、スピーカ１１５やディスプレイ１１３などを介して当該応答内容をユーザ１０に伝達する。 The dialogue control unit 134 controls the dialogue with the user 10. For example, the dialogue control unit 134 transmits the recognition result of the voice recognition unit 124 to the response content determination unit 135 described later. In addition, the dialogue control unit 134 transmits the response content determined by the response content determination unit 135 to the output processing unit 123, thereby transmitting the response content to the user 10 via the speaker 115, the display 113, or the like.

応答内容決定部１３５は、ユーザ１０の発話に対する対話装置１００の応答内容を決定する部分である。応答内容決定部１３５による応答内容の決定には種々の方法が考えられるが、たとえば特定の語彙（キーワード）に対する対話装置１００の応答を図４の応答データテーブル１４１に記憶しておき、その応答データテーブル１４１にしたがって応答内容を決定することができる。応答データテーブル１４１は、たとえば記憶部１４０に記憶される。 The response content determination unit 135 is a part that determines the response content of the dialogue apparatus 100 for the utterance of the user 10. Various methods are conceivable for determining the response content by the response content determination unit 135. For example, the response of the dialogue apparatus 100 to a specific vocabulary (keyword) is stored in the response data table 141 of FIG. The response content can be determined according to the table 141. The response data table 141 is stored in the storage unit 140, for example.

図４は、応答データテーブル１４１の一例を示す図である。図４に示すように、応答データテーブル１６１は、ユーザ発話と応答情報とを対応づけて記述している。図４に示す例では、ユーザ発話「こんにちは」、「おはよう」、「行ってきます」、「ただいま」に対して、システム発話「こんにちは。アナタの名前は？」、「お早うございます！」、「行ってらっしゃい！」、「お帰りなさーい」がそれぞれ対応する。 FIG. 4 is a diagram illustrating an example of the response data table 141. As shown in FIG. 4, the response data table 161 describes user utterances and response information in association with each other. In the example shown in FIG. 4, the user utterance "Hello", "Good morning", "we're going", for the "I'm home", the system utterance "Hello. The name of you?", "Good morning!", " Come on! ”And“ Return home ”correspond respectively.

このようにユーザ１０の発話などに対して対話装置１００が応答することで、対話装置１００はユーザ１０と対話することができる。 Thus, the interaction device 100 can interact with the user 10 by the interaction device 100 responding to the utterance of the user 10 or the like.

図２に戻って、出力制御部１３６は、情報閲覧可否判定部１３３の判定結果に基づいて、ユーザ１０への出力を制御する部分（出力制御手段）である。具体的に、出力制御部１３６は、情報閲覧可否判定部１３３の判定結果に応じて、対話制御部１３４から伝達された応答内容を適切な態様によってユーザ１０に伝達する。具体的に、応答内容は、音声出力情報と、視覚情報とを適宜組み合わせることによってユーザ１０に伝達され、音声出力情報と視覚情報との割合が調節される。対話制御部１３４による応答内容の態様の決定には種々の方法が考えられるが、たとえばキーワードに対する対話装置１００の応答と情報閲覧可否判定部１３３の判定結果とをキーとして、音声出力情報と視覚出力情報とをバリューとした図５の応答データテーブル１４２に記憶しておき、その応答データテーブル１４２にしたがって応答内容を決定することができる。応答データテーブル１４２は、たとえば記憶部１４０に記憶される。 Returning to FIG. 2, the output control unit 136 is a part (output control unit) that controls output to the user 10 based on the determination result of the information browsing availability determination unit 133. Specifically, the output control unit 136 transmits the response content transmitted from the dialogue control unit 134 to the user 10 in an appropriate manner according to the determination result of the information browsing availability determination unit 133. Specifically, the response content is transmitted to the user 10 by appropriately combining audio output information and visual information, and the ratio between the audio output information and the visual information is adjusted. Various methods can be considered for determining the mode of response content by the dialog control unit 134. For example, voice output information and visual output are made using the response of the dialog device 100 to a keyword and the determination result of the information browsing availability determination unit 133 as keys. Information is stored in the response data table 142 of FIG. 5 as values, and the response contents can be determined according to the response data table 142. The response data table 142 is stored in the storage unit 140, for example.

図５は、応答データテーブル１４２の一例を示す図である。図５に示すように、応答データテーブル１４２は、応答内容と情報閲覧可否判定結果とをキーとし、音声出力情報と視覚出力情報とをバリューとして記述している。 FIG. 5 is a diagram illustrating an example of the response data table 142. As shown in FIG. 5, the response data table 142 describes the audio output information and the visual output information as values using the response contents and the information browsing availability determination result as keys.

応答内容は、対話装置１００からユーザ１０に伝達すべき情報の内容を示す。図５に示す例では、応答内容として「明日東京晴れ３０，１８」が示される。この応答内容は、天気に関する情報であり、明日の東京は晴れであって、最高気温が３０度、最低気温が１８度となることが予想されていることを意味している。 The response content indicates the content of information to be transmitted from the interactive apparatus 100 to the user 10. In the example shown in FIG. 5, “Tomorrow Tokyo fine 30, 18” is shown as the response content. This response content is information related to the weather, which means that tomorrow's Tokyo is sunny and the maximum temperature is expected to be 30 degrees and the minimum temperature is expected to be 18 degrees.

情報閲覧可否判定結果は、ユーザ１０が、対話装置１００に出力される情報を閲覧可能な状態であるか否かを示すフラグである。閲覧可能な状態であるか否かは、先に説明した情報閲覧可否判定部１３３によって判断される。図５に示す例では、情報閲覧可否判定結果は、「ＴＲＵＥ」と「ＦＡＬＳＥ」との２通りで表される。情報閲覧可否判定結果が「ＴＲＵＥ」の場合、ユーザ１０は、対話装置１００のディスプレイ１１３に表示される情報を視認できる状態にある。情報閲覧可否判定結果が「ＦＡＬＳＥ」の場合、ユーザ１０は、対話装置１００のディスプレイ１１３に表示される情報を視認できない状態にある。 The information browsing availability determination result is a flag indicating whether or not the user 10 can browse the information output to the interactive device 100. Whether or not the information can be browsed is determined by the information browsing availability determination unit 133 described above. In the example illustrated in FIG. 5, the information browsing availability determination result is expressed in two ways, “TRUE” and “FALSE”. When the information browsing availability determination result is “TRUE”, the user 10 can visually recognize the information displayed on the display 113 of the interactive apparatus 100. When the information browsing availability determination result is “FALSE”, the user 10 is in a state where the information displayed on the display 113 of the interactive apparatus 100 cannot be viewed.

音声出力情報は、応答内容のうち、音声によってユーザに伝達すべき情報を示す。音声出力情報は、同じ応答内容であっても、情報閲覧可否判定結果に応じて、異なる内容とされる。情報閲覧結果判定が「ＴＲＵＥ」の場合、情報閲覧可否判定結果が「ＦＡＬＳＥ」の場合よりも、音声出力情報は少ない。図５に示す例では、情報閲覧可否判定結果が「ＴＲＵＥ」の場合には音声出力情報は「明日は晴れらしいよ」とされ、情報閲覧判定結果が「ＦＡＬＳＥ」の場合には、音声出力情報は「明日の東京とは晴れで、最高気温は３０度、最低気温は１８度らしいよ」とされる。 The voice output information indicates information to be transmitted to the user by voice among the response contents. Even if the audio output information has the same response content, the audio output information has different contents depending on the information browsing availability determination result. When the information browsing result determination is “TRUE”, there is less audio output information than when the information browsing availability determination result is “FALSE”. In the example illustrated in FIG. 5, when the information browsing availability determination result is “TRUE”, the voice output information is “Tomorrow is sunny”, and when the information browsing determination result is “FALSE”, the voice output information is displayed. Is said to be clear tomorrow in Tokyo, with a maximum temperature of 30 degrees and a minimum temperature of 18 degrees.

視覚出力情報は、応答内容のうち、視覚によってユーザに伝達すべき情報を示す。視覚出力情報は、情報閲覧可否判定結果が「ＦＡＬＳＥ」の場合には存在せず、情報閲覧可否結果判定が「ＴＲＵＥ」の場合にのみ存在する。図５に示す例では、情報閲覧可否判定結果が「ＴＲＵＥ」の場合に、視覚出力情報が「東京都晴れ最高気温３０度最低気温１８度」とされる。 The visual output information indicates information to be transmitted to the user by visual sense among the response contents. The visual output information does not exist when the information browsing availability determination result is “FALSE”, and exists only when the information browsing availability determination is “TRUE”. In the example illustrated in FIG. 5, when the information browsing availability determination result is “TRUE”, the visual output information is “Tokyo clear, highest temperature 30 degrees, lowest temperature 18 degrees”.

図２に戻って、状態制御部１３７は、対話装置１００の状態を対話状態と非対話状態とで切り替える部分（対話状態制御手段）である。たとえば、ユーザ意思判定部１３２によってユーザ１０に対話意思が有ると判定された場合には、状態制御部１３７は、対話装置１００の状態を非対話状態から対話状態に切り替える。対話装置１００の状態の切り替えについては、後に図６および図７を参照して詳述する。 Returning to FIG. 2, the state control unit 137 is a part (interactive state control means) that switches the state of the interactive device 100 between the interactive state and the non-interactive state. For example, when the user intention determination unit 132 determines that the user 10 has an intention to interact, the state control unit 137 switches the state of the interaction device 100 from the non-interactive state to the interactive state. The switching of the state of the interactive device 100 will be described in detail later with reference to FIGS. 6 and 7.

音声認識制御部１３８は、ユーザ１０の音声に含まれる語彙を連続して認識する第１の認識モードと、ユーザ１０の音声に含まれる所定の語彙（キーワード）のみを認識する第２の認識モードとを切り替えて実行する部分（音声認識手段）である。第２の検出モードおよび第２の検出モードの詳細については後述する。 The voice recognition control unit 138 continuously recognizes a vocabulary included in the voice of the user 10 and a second recognition mode that recognizes only a predetermined vocabulary (keyword) included in the voice of the user 10. Is a portion (voice recognition means) that is executed by switching. Details of the second detection mode and the second detection mode will be described later.

ここで、図３を参照して、対話装置１００のハードウェア構成について説明する。図３は、対話装置１００のハードウェア構成図である。図３に示されるように、対話装置１００は、物理的には、１または複数のＣＰＵ（Central Processing unit）２１、主記憶装置であるＲＡＭ（Random Access Memory）２２およびＲＯＭ（Read Only Memory)２３、データ送受信デバイスである通信モジュール２６、半導体メモリなどの補助記憶装置２７、操作盤（操作ボタンを含む）やタッチパネルなどのユーザの入力を受け付ける入力装置２８、ディスプレイなどの出力装置２９、カメラなどの撮像装置２４、ならびに赤外線センサなどのセンサ２５のハードウェアを備えるコンピュータとして構成され得る。図２における対話装置１００の各機能は、たとえば、ＣＰＵ２１、ＲＡＭ２２などのハードウェア上に１または複数の所定のコンピュータソフトウェアを読み込ませることにより、ＣＰＵ１０１の制御のもとで通信モジュール２６、入力装置２８、出力装置２９、撮像装置２４およびセンサ２５を動作させるとともに、ＲＡＭ２２および補助記憶装置２７におけるデータの読み出しおよび書き込みを行うことで実現することができる。 Here, the hardware configuration of the interactive apparatus 100 will be described with reference to FIG. FIG. 3 is a hardware configuration diagram of the interactive apparatus 100. As shown in FIG. 3, the interactive apparatus 100 physically includes one or a plurality of CPUs (Central Processing Units) 21, a RAM (Random Access Memory) 22 and a ROM (Read Only Memory) 23 that are main storage devices. A communication module 26 that is a data transmission / reception device, an auxiliary storage device 27 such as a semiconductor memory, an input device 28 that accepts user input such as an operation panel (including operation buttons) and a touch panel, an output device 29 such as a display, and a camera. It can be configured as a computer including the imaging device 24 and the hardware of the sensor 25 such as an infrared sensor. Each function of the interactive device 100 in FIG. 2 is, for example, by loading one or more predetermined computer software on hardware such as the CPU 21 and the RAM 22 to control the communication module 26 and the input device 28 under the control of the CPU 101. This can be realized by operating the output device 29, the imaging device 24, and the sensor 25, and reading and writing data in the RAM 22 and the auxiliary storage device 27.

再び図２を参照して、音声認識制御部１３８が実行する第1の認識モードおよび第２の認識モードの詳細について説明する。 With reference to FIG. 2 again, details of the first recognition mode and the second recognition mode executed by the speech recognition control unit 138 will be described.

第１の認識モードは、対話状態において実行される。第１の認識モードでは、音声認識制御部１３８は、ユーザ１０の音声に含まれる語彙を連続して認識する。語彙を連続して認識するとは、対話における一連のユーザの音声に含まれる語彙を可能な限り解析して認識することを意図している。理想的にはすべての語彙、すなわちユーザ１０の音声がすべて認識される。 The first recognition mode is executed in the interactive state. In the first recognition mode, the voice recognition control unit 138 continuously recognizes the vocabulary included in the voice of the user 10. Recognizing vocabulary continuously is intended to recognize and recognize as much as possible the vocabulary contained in a series of user voices in a dialogue. Ideally, all vocabularies, i.e. all the voices of the user 10, are recognized.

実施形態において、第１の認識モードでは、音声認識制御部１３８は、対話装置１００の外部との通信を行いサーバ２００のデータ処理を利用することによって、ユーザ１０の音声に含まれる語彙を連続して認識する。 In the embodiment, in the first recognition mode, the voice recognition control unit 138 communicates with the outside of the interactive apparatus 100 and uses the data processing of the server 200 to continuously vocabulary included in the voice of the user 10. Recognize.

第２の認識モードは、非対話状態において実行される。第２の認識モードでは、音声認識制御部１３８は、ユーザの音声に含まれるキーワードのみを認識する。また、ユーザ意思判定部１３２は、音声認識制御部１３８によってユーザ１０の音声に含まれるキーワードが認識された場合に、ユーザ１０が対話装置１００への語りかけ意思を有していると判定する。キーワードは、たとえば先に説明した図４のユーザ発話「こんにちは」、「おはよう」などである。 The second recognition mode is executed in a non-interactive state. In the second recognition mode, the voice recognition control unit 138 recognizes only keywords included in the user's voice. Further, the user intention determination unit 132 determines that the user 10 has an intention to talk to the interactive device 100 when the voice recognition control unit 138 recognizes a keyword included in the voice of the user 10. Keyword, for example, the user utterance "Hello" in FIG. 4 described above, and the like "Good morning".

実施形態において、第２の認識モードでは、音声認識制御部１３８は、対話装置１００の外部との通信を行わずに、ユーザ１０の音声に含まれるキーワードのみを認識する。 In the embodiment, in the second recognition mode, the voice recognition control unit 138 recognizes only the keyword included in the voice of the user 10 without performing communication with the outside of the interactive device 100.

図６は、対話装置１００の状態遷移図である。図６に示すように、対話装置１００は、対話状態および非対話状態のいずれかの状態に置かれる。 FIG. 6 is a state transition diagram of the interactive device 100. As shown in FIG. 6, the interactive device 100 is placed in either a dialog state or a non-interaction state.

非対話状態は、対話装置１００がユーザ１０と対話を行っていない状態である。この非対話状態では、第１の認識モードが実行される。この状態では、対話装置１００（の音声認識部１２４）とサーバ２００との通信は行われない。そして、ユーザの音声中にキーワードが検出されたことを契機として対話が開始され、対話装置１００は、対話状態に移行する（ＡＲ１）。 The non-interactive state is a state where the interactive device 100 is not interacting with the user 10. In this non-interactive state, the first recognition mode is executed. In this state, communication between the dialogue apparatus 100 (the voice recognition unit 124 thereof) and the server 200 is not performed. Then, the dialogue is started when the keyword is detected in the user's voice, and the dialogue device 100 shifts to the dialogue state (AR1).

対話状態は、対話装置１００がユーザ１０と対話している状態である。対話状態では、第２の認識モードが実行される。この状態では、対話装置１００とサーバ２００との通信が行われる。このため、サーバ２００の音声認識エンジンを用いた大語彙の認識によるスムーズな対話が行われる。対話が終了すると、対話装置１００は、非対話状態に移行する（ＡＲ２）。 The dialog state is a state in which the dialog device 100 is interacting with the user 10. In the dialog state, the second recognition mode is executed. In this state, communication between the interactive device 100 and the server 200 is performed. Therefore, a smooth dialogue is performed by large vocabulary recognition using the speech recognition engine of the server 200. When the dialogue ends, the dialogue device 100 shifts to a non-dialogue state (AR2).

図７は、対話装置１００（図２）の状態遷移を説明するためのフローチャートである。このフローチャートの処理は、とくに記載がない場合は、対話装置１００の制御部１３０によって実行され得る。 FIG. 7 is a flowchart for explaining the state transition of the interactive apparatus 100 (FIG. 2). The processing of this flowchart can be executed by the control unit 130 of the interactive apparatus 100 unless otherwise specified.

はじめに、対話装置１００は、対話状態であるか否かを判断する（ステップＳ１）。対話状態の場合（ステップＳ１：ＹＥＳ）、対話装置１００は、ステップＳ２に処理を進める。そうでない場合（ステップＳ１：ＮＯ）、対話装置１００はステップＳ８に処理を進める。 First, the dialogue apparatus 100 determines whether or not it is in a dialogue state (step S1). In the case of a dialog state (step S1: YES), the dialog device 100 advances the process to step S2. When that is not right (step S1: NO), the dialogue apparatus 100 advances a process to step S8.

ステップＳ１において対話状態の場合（ステップＳ１：ＹＥＳ）、対話装置１００の音声認識制御部１３８は、第１の認識モードを実行する（ステップＳ２）。 If the conversation state is set in step S1 (step S1: YES), the speech recognition control unit 138 of the interaction apparatus 100 executes the first recognition mode (step S2).

次に、対話装置１００は、一定時間ユーザ１０からの話しかけが無かったか否か判断する（ステップＳ３）。この判断は、たとえば対話制御部１３４によって実行される。一定時間ユーザからの話かけが無かった場合（ステップＳ３：ＹＥＳ）、対話装置１００は、ステップＳ４に処理を進める。そうでない場合（ステップＳ３）：ＮＯ）、対話装置１００は、ステップＳ６に処理を進める。 Next, the dialogue apparatus 100 determines whether or not there is no talk from the user 10 for a certain time (step S3). This determination is performed by, for example, the dialogue control unit 134. When there is no talk from the user for a certain period of time (step S3: YES), the dialogue apparatus 100 advances the process to step S4. When that is not right (step S3): NO), the dialogue apparatus 100 advances a process to step S6.

ステップＳ３において一定時間ユーザ１０からの話しかけが無かった場合（ステップＳ３：ＹＥＳ）、対話装置１００は、カメラ１１１でユーザ１０を検出できるか否か判断する（ステップＳ４）。この判断は、たとえば検出部１３１によって行われる。カメラ１１１でユーザ１０を検出できる場合（ステップＳ４：ＹＥＳ）、対話装置１００は、ステップＳ５に処理を進める。そうでない場合（ステップＳ４：ＮＯ）、対話装置１００は、ステップＳ７に処理を進める。 If there is no talk from the user 10 for a predetermined time in step S3 (step S3: YES), the dialogue apparatus 100 determines whether or not the user 10 can be detected by the camera 111 (step S4). This determination is performed by the detection unit 131, for example. When the user 10 can be detected by the camera 111 (step S4: YES), the interactive apparatus 100 proceeds with the process to step S5. When that is not right (step S4: NO), the dialogue apparatus 100 advances a process to step S7.

ステップＳ４においてカメラ１１１でユーザ１０を検出できる場合（ステップＳ４：ＹＥＳ）、対話装置１００は、ユーザ１０との距離が所定範囲内である（たとえばユーザ１０が図１の領域Ｒの内側に位置している）か否か判断する（ステップＳ５）。ユーザ１０との距離が所定範囲内の場合（ステップＳ５：ＹＥＳ）、対話装置１００は、ステップＳ６に処理を進める。そうでない場合（ステップＳ５：ＮＯ）、対話装置１００は、ステップＳ７に処理を進める。 When the user 10 can be detected by the camera 111 in step S4 (step S4: YES), the interactive apparatus 100 has a distance from the user 10 within a predetermined range (for example, the user 10 is located inside the region R in FIG. 1). Whether or not) (step S5). When the distance to the user 10 is within the predetermined range (step S5: YES), the dialogue apparatus 100 advances the process to step S6. Otherwise (step S5: NO), the dialogue apparatus 100 proceeds with the process to step S7.

ステップＳ３において一定時間の間にユーザ１０からの話しかけがあった場合（ステップＳ３：ＮＯ）、またはステップＳ５においてユーザとの距離が所定範囲内の場合（ステップＳ５：ＹＥＳ）、対話装置１００は、ユーザ１０に語りかけ意思があると判定し、対話状態を維持する（ステップＳ６）。ユーザ１０に語りかけ意思があるとの判定は、たとえばユーザ意思判定部１３２によって行われる。対話状態を維持する処理は、たとえば対話制御部１３４によって行われる。 When there is a talk from the user 10 for a certain time in step S3 (step S3: NO), or when the distance from the user is within a predetermined range in step S5 (step S5: YES), the interactive device 100 It is determined that the user 10 has an intention to talk and the conversation state is maintained (step S6). For example, the user intention determination unit 132 determines that the user 10 has an intention to talk. The process of maintaining the dialog state is performed by the dialog control unit 134, for example.

ステップＳ４においてカメラ１１１でユーザ１０を検出できない場合（ステップＳ４：ＮＯ）、またはステップＳ５においてユーザとの距離が所定範囲内にない場合（ステップＳ５：ＮＯ）、対話装置１００は、ユーザ１０に語りかけ意思が無いと判定し、非対話状態へ移行する（ステップＳ７）。 When the user 111 cannot be detected by the camera 111 in step S4 (step S4: NO), or when the distance from the user is not within the predetermined range in step S5 (step S5: NO), the interactive device 100 talks to the user 10 It is determined that there is no intention, and a transition is made to a non-interactive state (step S7).

一方、ステップＳ１において非対話状態の場合（ステップＳ１：ＮＯ）、対話装置１００の音声認識制御部１３８は、第２の認識モードを実行する（ステップＳ８）。 On the other hand, in the non-interactive state in step S1 (step S1: NO), the voice recognition control unit 138 of the interactive apparatus 100 executes the second recognition mode (step S8).

次に、対話装置１００の音声認識制御部１３８は、キーワードを検出したか否か判断する（ステップＳ９）。キーワードを検出した場合（ステップＳ９：ＹＥＳ）、対話装置１００は、ステップＳ１２に処理を進める。そうでない場合（ステップＳ９：ＮＯ）、対話装置１００は、ステップＳ１０に処理を進める。 Next, the voice recognition control unit 138 of the interactive apparatus 100 determines whether or not a keyword has been detected (step S9). When the keyword is detected (step S9: YES), the dialogue apparatus 100 proceeds with the process to step S12. Otherwise (step S9: NO), the dialogue apparatus 100 proceeds with the process to step S10.

ステップＳ９においてキーワードを検出しなかった場合（ステップＳ９：ＮＯ）、対話装置１００は、カメラ１１１でユーザ１０を検出できるか否か判断する（ステップＳ１０）。カメラ１１１でユーザ１０を検出できる場合（ステップＳ１０：ＹＥＳ）、対話装置１００は、ステップＳ１１に処理を進める。そうでない場合（ステップＳ１０：ＮＯ）、対話装置１００は、ステップＳ１３に処理を進める。 When no keyword is detected in step S9 (step S9: NO), the interactive apparatus 100 determines whether or not the user 10 can be detected by the camera 111 (step S10). When the user 10 can be detected by the camera 111 (step S10: YES), the interactive apparatus 100 proceeds with the process to step S11. When that is not right (step S10: NO), the dialogue apparatus 100 advances a process to step S13.

ステップＳ１０においてカメラ１１１でユーザ１０を検出できる場合（ステップＳ１０：ＹＥＳ）、対話装置１００は、ユーザ１０との距離が所定範囲内であるか否か判断する（ステップＳ１１）。ユーザ１０との距離が所定範囲内の場合（ステップＳ１１：ＹＥＳ）、対話装置１００は、ステップＳ１２に処理を進める。そうでない場合（ステップＳ１１：ＮＯ）、対話装置１００は、ステップＳ１３に処理を進める。 When the user 111 can be detected by the camera 111 in step S10 (step S10: YES), the dialogue apparatus 100 determines whether or not the distance from the user 10 is within a predetermined range (step S11). When the distance to the user 10 is within the predetermined range (step S11: YES), the dialogue apparatus 100 proceeds with the process to step S12. When that is not right (step S11: NO), the dialogue apparatus 100 advances a process to step S13.

ステップＳ９においてキーワードを検出した場合（ステップＳ９：ＹＥＳ）、またはステップＳ１１においてユーザ１０との距離が所定範囲内の場合（ステップＳ１１：ＹＥＳ）、対話装置１００は、ユーザ１０に語りかけ意思があると判定し、対話状態へ移行する（ステップＳ１２）。 When the keyword is detected in step S9 (step S9: YES), or when the distance to the user 10 is within the predetermined range in step S11 (step S11: YES), the interactive device 100 has an intention to talk to the user 10 The determination is made and the state transitions to a dialog state (step S12).

ステップＳ１０においてカメラ１１１でユーザ１０を検出できない場合（ステップＳ１０：ＮＯ）、またはステップＳ１１においてユーザ１０との距離が所定範囲内でない場合（ステップＳ１１：ＮＯ）、対話装置１００は、ユーザ１０に語りかけ意思が無いと判定し、非対話状態を維持する（ステップＳ１３）。 When the user 111 cannot be detected by the camera 111 in step S10 (step S10: NO), or when the distance from the user 10 is not within the predetermined range in step S11 (step S11: NO), the interactive device 100 talks to the user 10 It is determined that there is no intention and the non-interactive state is maintained (step S13).

ステップＳ６，Ｓ７，Ｓ１２またはＳ１３の処理が完了した後、対話装置１００は、ステップＳ１に再び処理を戻す。 After the process of step S6, S7, S12 or S13 is completed, the interactive apparatus 100 returns the process to step S1 again.

図７に示すフローチャートによれば、対話装置１００がユーザ１０と対話装置１００との距離とカメラ画像におけるユーザ１０の存在とを検出するステップ（ステップＳ４，Ｓ５，Ｓ１０，Ｓ１１）と、対話装置１００が上記検出するステップの検出結果に基づいてハンズフリー状態にあるユーザ１０が対話装置１００への語りかけ意思を有しているか否かを判定するステップ（ステップＳ６，Ｓ７，Ｓ１２，Ｓ１３）と、対話装置１００が上記判定するステップの判定結果に基づいて対話装置１００が対話状態および非対話状態のいずれかの状態に切り替わるように対話装置１００の状態を制御するステップ（ステップＳ７，Ｓ１２）と、が実行される。 According to the flowchart shown in FIG. 7, the interaction device 100 detects the distance between the user 10 and the interaction device 100 and the presence of the user 10 in the camera image (steps S4, S5, S10, S11), and the interaction device 100. Determining whether or not the user 10 in the hands-free state has an intention to talk to the interactive device 100 based on the detection result of the detecting step (steps S6, S7, S12, S13), Steps (steps S7 and S12) for controlling the state of the interactive device 100 so that the interactive device 100 switches to either the interactive state or the non-interactive state based on the determination result of the determination step performed by the device 100. Executed.

次に、対話装置１００の作用効果について説明する。対話装置１００は、ユーザと対話装置との距離と、カメラ画像におけるユーザの存在とを検出する検出部１３１と、検出部１３１の検出結果に基づいて、ハンズフリー状態にあるユーザが対話装置への語りかけ意思を有しているか否かを判定するユーザ意思判定部１３２と、判定手段の判定結果に基づいて、対話装置１００が対話状態および非対話状態のいずれかの状態に切り替わるように対話装置１００の状態を制御する状態制御部１３７と、を備える。対話装置１００によれば、ユーザ１０が対話装置１００への語りかけ意思を有しているか否かに基づいて、対話装置１００が対話状態および非対話状態のいずれかの状態に切り替わるように制御される。これにより、対話装置１００は、ユーザ１０の意思に応じた適切なタイミングで、対話状態に切り替わることができる。 Next, the function and effect of the interactive device 100 will be described. The interactive device 100 detects the distance between the user and the interactive device and the presence of the user in the camera image, and the user in the hands-free state based on the detection result of the detection unit 131 Based on the user intention determination unit 132 that determines whether or not he / she has a talking intention, and the determination result of the determination unit, the dialog device 100 is switched so as to switch to either the dialog state or the non-dialog state. And a state control unit 137 for controlling the state. According to the interactive device 100, the interactive device 100 is controlled to switch to either the interactive state or the non-interactive state based on whether or not the user 10 has an intention to talk to the interactive device 100. . Thereby, the dialogue apparatus 100 can switch to the dialogue state at an appropriate timing according to the intention of the user 10.

また、対話装置１００は、検出部１３１の検出結果に基づいて、ユーザ１０が対話装置１００からの情報を視認できる状態にあるか否かを判定する情報閲覧可否判定部１３３と、情報閲覧可否判定部１３３の判定結果に基づいて、ユーザ１０への出力を制御する出力制御部１３６と、をさらに備える。これにより、ユーザ１０が対話装置１００からの視覚的な出力（情報）を視認（閲覧など）できないときは、たとえば音声のみでユーザ１０へ情報を伝達することができる。また、ユーザ１０が対話装置１００からの視覚的な出力を視認できるときは、視覚的な出力と音声出力とを併用することよって、たとえば音声出力を短縮することができる。 In addition, the interactive apparatus 100 includes an information browsing availability determination unit 133 that determines whether the user 10 can visually recognize information from the interactive apparatus 100 based on the detection result of the detection unit 131, and an information browsing availability determination. And an output control unit 136 that controls output to the user 10 based on the determination result of the unit 133. Thereby, when the user 10 cannot visually recognize (view) the visual output (information) from the interactive apparatus 100, for example, information can be transmitted to the user 10 only by voice. Further, when the user 10 can visually recognize the visual output from the interactive apparatus 100, for example, the audio output can be shortened by using both the visual output and the audio output.

また、対話装置１００は、対話状態においてはユーザ１０の音声に含まれる語彙を連続して認識する第１の認識モードを実行し、非対話状態においてはユーザ１０の音声に含まれる所定の語彙のみを認識する第２の認識モードとを実行する音声認識制御部１３８、をさらに備える。その場合、ユーザ意思判定部１３２は、非対話状態において、第２の認識モードを実行する音声認識制御部１３８によってユーザ１０の音声に含まれる所定の語彙（キーワード）が認識された場合に、ユーザ１０が対話装置１００への語りかけ意思を有していると判定する。これにより、対話装置１００は、ユーザ１０がキーワードを発話したことを契機として、ユーザ１０の意思に応じた適切なタイミングで、非対話状態から対話状態に切り替わることができる。 Further, the interactive device 100 executes the first recognition mode that continuously recognizes the vocabulary included in the voice of the user 10 in the dialog state, and only the predetermined vocabulary included in the voice of the user 10 in the non-interactive state. And a second recognition mode for recognizing the voice recognition control unit 138. In that case, in a non-interactive state, the user intention determination unit 132, when a predetermined vocabulary (keyword) included in the voice of the user 10 is recognized by the voice recognition control unit 138 that executes the second recognition mode, It is determined that 10 has an intention to talk to the interactive apparatus 100. Thereby, the dialogue apparatus 100 can switch from the non-dialogue state to the dialogue state at an appropriate timing according to the intention of the user 10 when the user 10 speaks the keyword.

また、第１の認識モードでは、音声認識制御部１３８が、対話装置１００の外部との通信を行いサーバ２００のデータ処理を利用することによって、ユーザ１０の音声に含まれる語彙を連続して認識し、第２の認識モードでは、音声認識制御部１３８が、対話装置１００の外部との通信を行わずに、ユーザ１０の音声に含まれるキーワードのみを認識する。これにより、第１の認識モードでは、サーバ２００のデータ処理を利用した大語彙が認識可能な音声認識（サーバ型音声認識）を行うことができる。また、第２の認識モードでは、たとえば通信を行わない分だけ第１の認識モードより消費電力を低減させつつ音声認識を行うことができる。 In the first recognition mode, the speech recognition control unit 138 continuously recognizes the vocabulary included in the speech of the user 10 by communicating with the outside of the interactive device 100 and using the data processing of the server 200. In the second recognition mode, the voice recognition control unit 138 recognizes only the keyword included in the voice of the user 10 without performing communication with the outside of the interactive device 100. Thereby, in the first recognition mode, it is possible to perform voice recognition (server type voice recognition) capable of recognizing a large vocabulary using the data processing of the server 200. Also, in the second recognition mode, for example, voice recognition can be performed while reducing power consumption compared to the first recognition mode by the amount that communication is not performed.

また、検出部１３１は、カメラ画像におけるユーザ１０の顔を検出することによって、ユーザ１０の存在を検出してもよい。これにより、たとえば、対話装置１００とユーザ１０の顔との位置関係、対話装置１００に対するユーザ１０の顔の角度などに基づいて、ユーザ１０が対話装置１００への語りかけ意思を有しているか否か判定することができる。 Further, the detection unit 131 may detect the presence of the user 10 by detecting the face of the user 10 in the camera image. Accordingly, for example, whether or not the user 10 has an intention to talk to the interactive device 100 based on the positional relationship between the interactive device 100 and the face of the user 10, the angle of the face of the user 10 with respect to the interactive device 100, and the like. Can be determined.

１…対話システム、１０…ユーザ、１１…キャラクタ、５０…通信ネットワーク、１００…対話装置、１１０…インタフェース部、１１１…カメラ、１１２…近接センサ、１１３…ディスプレイ、１１４…マイク、１１５…スピーカ、１１６…操作パネル、１２０…データ処理部、１２１…画像処理部、１２２…センサデータ処理部、１２３…出力処理部、１２４…音声認識部、…音声合成部１２５、入力処理部１２６、１３０…制御部、１３１…検出部、１３２…ユーザ意思判定部、１３３…情報閲覧可否判定部、１３４…対話制御部、１３５…応答内容決定部、１３６…出力制御部、１３７…状態制御部、１３８…音声認識制御部、１４０…記憶部、１５０…通信部、２００…サーバ、Ｒ…領域。 DESCRIPTION OF SYMBOLS 1 ... Dialog system, 10 ... User, 11 ... Character, 50 ... Communication network, 100 ... Dialog device, 110 ... Interface part, 111 ... Camera, 112 ... Proximity sensor, 113 ... Display, 114 ... Microphone, 115 ... Speaker, 116 ... Operation panel, 120 ... Data processing unit, 121 ... Image processing unit, 122 ... Sensor data processing unit, 123 ... Output processing unit, 124 ... Speech recognition unit, ... Speech synthesis unit 125, Input processing unit 126, 130 ... Control unit , 131 ... detection unit, 132 ... user intention determination unit, 133 ... information browsing availability determination unit, 134 ... dialogue control unit, 135 ... response content determination unit, 136 ... output control unit, 137 ... state control unit, 138 ... voice recognition Control unit, 140 ... storage unit, 150 ... communication unit, 200 ... server, R ... area.

Claims

An interactive device for interacting with a user,
Detecting means for detecting a distance between the user and the interactive device and presence of the user in a camera image;
User intention determination means for determining whether or not the user in a hands-free state has an intention to talk to the interactive device based on the detection result of the detection means;
Dialog state control means for controlling the state of the dialog device so that the dialog device is switched to either a dialog state or a non-dialogue state based on the determination result of the determination means;
An interactive apparatus comprising:

User state determination means for determining whether or not the user can visually recognize information from the interactive device based on a detection result of the detection means;
Output control means for controlling output to the user based on a determination result of the user state determination means;
The interactive apparatus according to claim 1, further comprising:

In the interactive state, a first recognition mode for continuously recognizing the vocabulary included in the user's voice is executed, and in the non-interactive state, only a predetermined vocabulary included in the user's voice is recognized. Voice recognition means for executing a recognition mode of
Further comprising
In the non-dialogue state, the user intention determination unit is configured such that when a predetermined vocabulary included in the user's voice is recognized by the voice recognition unit that executes the second recognition mode, the user transmits to the dialogue apparatus. The interactive apparatus according to claim 1, wherein the interactive apparatus determines that the person has an intention to speak.

In the first recognition mode, the voice recognition means continuously recognizes a vocabulary included in the user's voice by communicating with the outside of the interactive device and using server data processing,
4. The dialogue apparatus according to claim 3, wherein in the second recognition mode, the voice recognition unit recognizes only a predetermined vocabulary included in the voice of the user without performing communication with the outside of the dialogue apparatus. .

The interactive device according to claim 1, wherein the detection unit detects the presence of the user by detecting a face of the user in a camera image.

A dialogue method for carrying out a dialogue between a user and a dialogue device,
The interactive device detecting the distance between the user and the interactive device and the presence of the user in a camera image;
A step of determining whether or not the user in a hands-free state has an intention to talk to the interactive device based on a detection result of the detecting step;
The interactive device controlling the state of the interactive device so that the interactive device is switched to either the interactive state or the non-interactive state based on the determination result of the determining step;
Including interactive methods.