JP2017054064A

JP2017054064A - Interactive device and interactive program

Info

Publication number: JP2017054064A
Application number: JP2015179490A
Authority: JP
Inventors: 択磨松村; Takuma Matsumura; 哲溝口; Satoru Mizoguchi
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2015-09-11
Filing date: 2015-09-11
Publication date: 2017-03-16
Anticipated expiration: 2035-09-11
Also published as: JP6645779B2

Abstract

PROBLEM TO BE SOLVED: To provide an interactive device and interactive program, capable of appropriately grasping user's intention.SOLUTION: An interactive device 100 is for providing dialogue with a user. The interactive device 100 includes a voice acquisition unit 131 configured to acquire voice of the user, an image acquisition unit 132 configured to acquire an image, and an identification unit 135 configured to identify a target object included in the image acquired by the image acquisition unit 132 based on the voice of the user acquired by the voice acquisition unit 131.SELECTED DRAWING: Figure 1

Description

本発明は、対話装置および対話プログラムに関する。 The present invention relates to a dialogue apparatus and a dialogue program.

従来の対話装置は、ユーザの音声を認識することによって、ユーザの意図を把握し、ユーザとの対話を行う（たとえば下記特許文献１参照）。 A conventional dialogue apparatus recognizes a user's voice, grasps the intention of the user, and performs a dialogue with the user (for example, see Patent Document 1 below).

特開２００２−１８２８９６号公報JP 2002-182896 A

ユーザの音声認識のみで対話を行う従来の対話装置では、ユーザの意図を対話装置が正確に把握できない場合がある。たとえば、ユーザが或る山の高さを対話装置に質問しようとする場合を想定する。ユーザが山に関する知識、たとえば山の名称「Ａ」を知っているのであれば、ユーザは、山の名称を発音できる。このため、ユーザが「Ａの高さは？」といった質問の音声を発すれば、質問の内容が音声のみで明確であるので、対話装置は、ユーザの意図を把握し、対話を行うことができる。一方、ユーザが山の名称を知らない場合、ユーザは山の名称を発音できない。このため、ユーザは、音声のみでは、山の高さを質問するための適切な音声を発することができない。その場合、対話装置は、ユーザの意図を把握できず、対話を行うことができない。 In a conventional dialog device that performs a dialog only by user's voice recognition, the dialog device may not be able to accurately grasp the user's intention. For example, assume that the user wants to ask the dialog device about the height of a certain mountain. If the user knows knowledge about the mountain, for example, the mountain name “A”, the user can pronounce the mountain name. For this reason, if the user utters the voice of a question such as “What is the height of A?”, The content of the question is clear only by the voice, so that the dialogue device can grasp the user's intention and perform a dialogue. it can. On the other hand, if the user does not know the name of the mountain, the user cannot pronounce the name of the mountain. For this reason, the user cannot utter an appropriate voice for asking the height of the mountain only by voice. In that case, the dialog device cannot grasp the user's intention and cannot perform a dialog.

本発明は、上記問題点に鑑みてなされたものであり、ユーザの意図をより適切に把握することが可能な対話装置および対話プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide an interactive device and an interactive program capable of more appropriately grasping a user's intention.

本発明の一態様に係る対話装置は、ユーザとの対話を行うための対話装置であって、ユーザの音声を取得する音声取得手段と、画像を取得する画像取得手段と、音声取得手段によって取得されたユーザの音声に基づいて、画像取得手段によって取得された画像に含まれる対象物を特定する特定手段と、を備える。 An interactive apparatus according to an aspect of the present invention is an interactive apparatus for performing a dialog with a user, and is acquired by a voice acquisition unit that acquires a user's voice, an image acquisition unit that acquires an image, and a voice acquisition unit. Specifying means for specifying an object included in the image acquired by the image acquisition means based on the user's voice.

また、本発明の一態様に係る対話プログラムは、ユーザとの対話を行うための対話装置に設けられたコンピュータを、ユーザの音声を取得する音声取得手段と、画像を取得する画像取得手段と、音声取得手段によって取得されたユーザの音声に基づいて、画像取得手段によって取得された画像に含まれる対象物を特定する特定手段、として機能させる。 Further, an interactive program according to an aspect of the present invention includes a computer provided in an interactive apparatus for performing an interaction with a user, an audio acquisition unit that acquires the user's audio, an image acquisition unit that acquires an image, Based on the user's voice acquired by the voice acquisition means, it is made to function as a specifying means for specifying an object included in the image acquired by the image acquisition means.

上記の対話装置または対話プログラムによれば、音声取得手段によって取得されたユーザの音声に基づいて、画像取得手段によって取得された画像に含まれる対象物が特定される。これにより、たとえば、ユーザが対象物に関する知識を有していない場合でも、その対象物が画像に含まれているものであることを意図（指定）する音声を発すれば、その音声に基づいて、ユーザの意図している対象物が特定される。よって、対話において、ユーザの意図をより適切に把握することができる。 According to the interactive apparatus or the interactive program, the object included in the image acquired by the image acquisition unit is specified based on the user's voice acquired by the audio acquisition unit. Thus, for example, even when the user does not have knowledge about the object, if the user intends to specify (specify) that the object is included in the image, the sound is based on the sound. The object intended by the user is specified. Therefore, the user's intention can be grasped more appropriately in the dialogue.

対話装置は、音声取得手段によって取得されたユーザの音声を認識するための認識モードを実行する認識手段、をさらに備え、特定手段は、対象物の名称を特定し、認識モードは、音声取得手段によって取得されたユーザの音声を認識する第１の認識モードと、音声取得手段によって取得されたユーザの音声と、特定手段によって特定された対象物の名称とに基づいて、ユーザの音声を認識する第２の認識モードと、を含んでもよい。第１の認識モードが実行されると、ユーザの音声を認識することによって、従来の対話装置と同様に、ユーザとの対話が行われる。これに対し、第２の認識モードが実行されると、ユーザの音声と特定された対象物の名称とに基づいてユーザの音声が認識され、ユーザとの対話が行われる。このような第２の認識モードを実行することによって、ユーザの意図をより適切に把握しつつ、ユーザとの対話を行うことができる。 The dialogue apparatus further includes a recognition unit that executes a recognition mode for recognizing the user's voice acquired by the voice acquisition unit, the specifying unit specifies the name of the object, and the recognition mode is the voice acquisition unit. The user's voice is recognized based on the first recognition mode for recognizing the user's voice acquired by the user, the user's voice acquired by the voice acquisition means, and the name of the object specified by the specifying means. And a second recognition mode. When the first recognition mode is executed, the user's voice is recognized, and a dialogue with the user is performed in the same manner as a conventional dialogue device. On the other hand, when the second recognition mode is executed, the user's voice is recognized based on the user's voice and the name of the identified object, and a dialogue with the user is performed. By executing such a second recognition mode, it is possible to interact with the user while more appropriately grasping the user's intention.

特定手段は、音声取得手段によって取得されたユーザの音声の一部に置き換えることが可能な対象物の名称を特定し、第２の認識モードでは、音声取得手段によって取得されたユーザの音声に対応するデータの一部が、特定手段によって特定された対象物の名称に対応するデータに置き換えられた後に、当該置き換えられたデータに基づいてユーザの音声が認識されてもよい。これにより、たとえば対象物の名称の候補が複数存在する場合でも、その中から、ユーザの音声の一部に置き換えることが可能な対象物の名称、すなわち文脈（会話の流れ）に適した対象物の名称が特定される。このように特定された対象物の名称に対応するデータを、ユーザの音声に対応するデータの一部に置き換えた後に音声認識を行うことで、会話の流れに沿って、ユーザの意図をより適切に把握することができる。 The identification unit identifies a name of an object that can be replaced with a part of the user's voice acquired by the voice acquisition unit, and corresponds to the user's voice acquired by the voice acquisition unit in the second recognition mode. After a part of the data to be replaced is replaced with data corresponding to the name of the object specified by the specifying unit, the user's voice may be recognized based on the replaced data. Thus, for example, even when there are a plurality of candidates for the name of an object, the object name that can be replaced with a part of the user's voice, that is, an object suitable for the context (flow of conversation) Is specified. By replacing the data corresponding to the name of the object identified in this way with a part of the data corresponding to the user's voice, the user's intention is more appropriate along the flow of the conversation. Can grasp.

認識手段は、第１の認識モードを実行しているときに、音声取得手段によって取得されたユーザの音声に所定の音声が含まれることを認識すると、実行する認識モードを、第１の認識モードから第２の認識モードに切り替えてもよい。たとえば所定の音声を、対象物が画像に含まれているものであることを意味する音声に設定しておくことで、適切なタイミングで第２の認識モードを実行し、ユーザの意図をより適切に把握することができるようになる。 When the recognizing unit recognizes that the user's voice acquired by the voice acquiring unit includes a predetermined voice while executing the first recognition mode, the recognizing unit changes the recognition mode to be executed to the first recognition mode. May be switched to the second recognition mode. For example, by setting the predetermined voice to a voice that means that the target object is included in the image, the second recognition mode is executed at an appropriate timing, and the user's intention is more appropriate. To be able to grasp.

特定手段は、画像における対象物の位置情報と、音声取得手段によって取得されたユーザの音声とに基づいて、対象物を特定してもよい。これにより、画像に複数の対象物が含まれる場合でも、ユーザの音声に基づいて、ユーザの意図している対象物を特定することができる。 The specifying unit may specify the target based on the position information of the target in the image and the user's voice acquired by the voice acquisition unit. Thereby, even when a plurality of objects are included in the image, the object intended by the user can be specified based on the user's voice.

対話装置は、画像を出力する出力手段、をさらに備え、画像取得手段は、出力手段によって出力されている画像を取得してもよい。これにより、たとえば出力手段によって出力される画像、映像を話題として対話装置とユーザとの対話が行われるような場合でも、上記説明したように、ユーザの意図している対象物を特定し、ユーザの意図をより適切に把握することができる。 The interactive apparatus may further include an output unit that outputs an image, and the image acquisition unit may acquire the image output by the output unit. Thereby, for example, even when a dialogue between the dialogue apparatus and the user is performed using the image or video output by the output means as a topic, as described above, the object intended by the user is specified, and the user Can better understand the intentions.

本発明によれば、ユーザの意図をより適切に把握することが可能になる。 According to the present invention, a user's intention can be grasped more appropriately.

対話装置の機能ブロックを示す図である。It is a figure which shows the functional block of a dialogue apparatus. 記憶部１４０に記憶されるデータテーブルの一例を示す図である。4 is a diagram illustrating an example of a data table stored in a storage unit 140. FIG. 対話装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a dialogue apparatus. 対話プログラムの構成を示す図である。It is a figure which shows the structure of a dialogue program. 対話装置において実行される処理の一例を示す第１のフローチャートである。It is a 1st flowchart which shows an example of the process performed in a dialogue apparatus. 対話装置において実行される処理の一例を示す第２のフローチャートである。It is a 2nd flowchart which shows an example of the process performed in a dialogue apparatus. 変形例に係る対話装置の機能ブロックを示す図である。It is a figure which shows the functional block of the dialogue apparatus which concerns on a modification.

以下、本発明の実施形態について、図面を参照しながら説明する。なお、図面の説明において同一要素には同一符号を付し、重複する説明は省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the description of the drawings, the same elements are denoted by the same reference numerals, and redundant descriptions are omitted.

実施形態に係る対話装置は、ユーザと対話を行う装置である。対話装置は、たとえばスマートフォンのような移動体端末装置、あるいは据え置き型の端末として実現されてもよいし、人間の外形形状を模したロボットとして実現されてもよい。 The interactive device according to the embodiment is a device that performs a dialog with a user. The interactive device may be realized as a mobile terminal device such as a smartphone or a stationary terminal, or may be realized as a robot imitating a human outer shape.

図１は、実施形態に係る対話装置の機能ブロックを示す図である。図１に示されるように、対話装置１００は、入力部１１０と、出力部１２０と、制御部１３０と、記憶部１４０と、通信部１５０とを含む。 FIG. 1 is a diagram illustrating functional blocks of the interactive apparatus according to the embodiment. As illustrated in FIG. 1, the dialogue apparatus 100 includes an input unit 110, an output unit 120, a control unit 130, a storage unit 140, and a communication unit 150.

入力部１１０および出力部１２０は、対話装置１００の外部、主にユーザとの間で情報をやり取りするための部分（入出力インタフェース）である。入力部１１０は、ユーザの音声を含む対話装置１００の周囲の音声（以下、単に「周囲音声」という場合もある）の入力を受け付け、また、ユーザを含む対話装置１００の周囲の画像、映像（以下、単に「周囲画像」という場合もある）の入力を受け付ける。出力部１２０は、種々の画像、映像を出力し、また、種々の音声を出力する。 The input unit 110 and the output unit 120 are portions (input / output interfaces) for exchanging information with the outside of the interactive apparatus 100, mainly with the user. The input unit 110 accepts input of voice around the interactive apparatus 100 including the user's voice (hereinafter, also simply referred to as “ambient voice”), and images and video around the interactive apparatus 100 including the user ( Hereinafter, the input of “sometimes referred to as“ ambient image ”” is accepted. The output unit 120 outputs various images and videos, and outputs various sounds.

具体的に、入力部１１０は、集音部１１１と、撮像部１１２とを含む。集音部１１１は、周囲音声の入力を受け付ける部分である。集音部１１１は、たとえばマイクで構成される。集音部１１１は、たとえば指向性を有するように、複数のマイクが配列されたマイクアレイで構成されてもよい。撮像部１１２は、周囲画像の入力を受け付ける部分である。撮像部１１２は、たとえばカメラで構成される。撮像部１１２は、対象物との距離を把握できるように、たとえば複数のカメラで構成される。なお、入力部１１０は、たとえば、ユーザが対話装置１００を操作するための操作ボタンなどの要素をさらに含んでもよい。 Specifically, the input unit 110 includes a sound collection unit 111 and an imaging unit 112. The sound collection unit 111 is a part that receives an input of ambient sound. The sound collection unit 111 is constituted by a microphone, for example. The sound collection unit 111 may be configured by a microphone array in which a plurality of microphones are arranged so as to have directivity, for example. The imaging unit 112 is a part that receives an input of a surrounding image. The imaging unit 112 is configured with a camera, for example. The imaging unit 112 is configured by a plurality of cameras, for example, so that the distance from the object can be grasped. Note that the input unit 110 may further include an element such as an operation button for the user to operate the interactive device 100, for example.

出力部１２０は、発音部１２１と、表示部１２２とを含む。発音部１２１は、音声を出力する部分である。発音部１２１は、たとえばスピーカで構成される。発音部１２１は、たとえば指向性を有するように、複数のスピーカが配列されたアレイスピーカで構成されてもよい。表示部１２２は、画像、映像を出力する部分（出力手段）である。表示部１２２は、たとえばディスプレイで構成される。ディスプレイはタッチパネルで構成されてもよく、その場合、タッチパネルは、ユーザが対話装置１００を操作するための要素としても機能する。 The output unit 120 includes a sound generation unit 121 and a display unit 122. The sound generator 121 is a part that outputs sound. The sound generation unit 121 is configured by a speaker, for example. The sound generation unit 121 may be configured by an array speaker in which a plurality of speakers are arranged so as to have directivity, for example. The display unit 122 is a part (output unit) that outputs images and videos. The display unit 122 is configured by a display, for example. The display may be configured with a touch panel. In this case, the touch panel also functions as an element for the user to operate the interactive device 100.

また、表示部１２２は、インターネットなどの通信網を介して配信されたり、あるいはテレビ放送として受信されたりする、画像、映像といったコンテンツを表示する。表示部１２２によって表示されるそれらの画像、映像を話題として、対話装置１００とユーザとの対話が行われてもよい。 The display unit 122 displays contents such as images and videos that are distributed via a communication network such as the Internet or received as a television broadcast. The dialogue between the dialogue apparatus 100 and the user may be performed using the images and videos displayed by the display unit 122 as topics.

制御部１３０は、対話装置１００の各要素を制御することによって、対話装置１００の全体制御を行う部分である。制御部１３０は、音声取得部１３１と、画像取得部１３２と、音声認識部１３３と、画像認識部１３４と、特定部１３５とを含む。 The control unit 130 is a part that performs overall control of the interactive device 100 by controlling each element of the interactive device 100. The control unit 130 includes a voice acquisition unit 131, an image acquisition unit 132, a voice recognition unit 133, an image recognition unit 134, and a specification unit 135.

音声取得部１３１は、入力部１１０の集音部１１１に入力された周囲音声を取得する部分である。すなわち、音声取得部１３１および集音部１１１は、周囲音声を取得する音声取得手段として機能する。以下、とくに説明がない限り、音声取得手段を単に音声取得部１３１と称して説明する。 The sound acquisition unit 131 is a part that acquires the ambient sound input to the sound collection unit 111 of the input unit 110. That is, the sound acquisition unit 131 and the sound collection unit 111 function as a sound acquisition unit that acquires ambient sound. Hereinafter, unless otherwise specified, the voice acquisition unit is simply referred to as the voice acquisition unit 131.

画像取得部１３２は、入力部１１０の撮像部１１２に入力された周囲画像を取得する部分である。すなわち、画像取得部１３２および撮像部１１２は、周囲画像を取得する画像取得手段として機能する。以下、とくに説明がない限り、画像取得手段を画像取得部１３２と称して説明する。 The image acquisition unit 132 is a part that acquires the surrounding image input to the imaging unit 112 of the input unit 110. That is, the image acquisition unit 132 and the imaging unit 112 function as an image acquisition unit that acquires a surrounding image. Hereinafter, the image acquisition unit will be referred to as the image acquisition unit 132 unless otherwise specified.

また、画像取得部１３２は、上述の表示部１２２によって表示される種々の画像、映像を、周囲画像として取得する。すなわち、画像取得部１３２は、撮像部１１２によって出力されている画像を取得する部分でもある。 The image acquisition unit 132 acquires various images and videos displayed by the display unit 122 described above as surrounding images. That is, the image acquisition unit 132 is also a part that acquires the image output by the imaging unit 112.

音声認識部１３３は、周囲音声、とくにユーザの音声を認識するための音声認識処理を実行する部分（認識手段）である。音声認識部１３３は、音声取得部１３１によって取得された周囲音声に対して、音声認識処理を実行する。音声認識処理はたとえば、予め用意された音響モデル、言語モデルを用いた手法を含む、種々の手法によって実現される。言語モデルは、種々の専門分野に対応できるように、専門辞書の言語を含むモデルであってもよい。 The voice recognition unit 133 is a part (recognition means) that executes voice recognition processing for recognizing ambient voice, particularly user voice. The voice recognition unit 133 performs a voice recognition process on the surrounding voice acquired by the voice acquisition unit 131. The speech recognition process is realized by various methods including, for example, a method using a prepared acoustic model and language model. The language model may be a model including a language of a specialized dictionary so that it can correspond to various specialized fields.

音声認識部１３３によって実行される音声認識処理は、ユーザの音声の波形を解析し、その発声内容を、結果データとして出力する処理を含む。結果データは、ユーザの音声に対応するデータであり、たとえば文字列のデータ（テキストデータ）とされる。 The voice recognition process executed by the voice recognition unit 133 includes a process of analyzing the waveform of the user's voice and outputting the utterance content as result data. The result data is data corresponding to the user's voice, and is, for example, character string data (text data).

音声認識処理においては、第１の認識モードおよび第２の認識モードが実行される。第１の認識モードでは、音声取得部１３１によって取得されたユーザの音声がそのまま認識される。これに対し、第２の認識モードでは、音声取得部１３１によって取得されたユーザの音声と、後述の特定部１３５によって特定された対象物の名称とに基づいて、ユーザの音声が認識される。より具体的に、第２の認識モードでは、音声取得部１３１によって取得されたユーザの音声に対応するデータ（テキストデータなど）の一部が、特定部１３５によって特定された対象物の名称に対応するデータ（テキストデータなど）に置き換えられた後に、当該置き換えられたデータに基づいてユーザの音声が認識（解読）される。 In the speech recognition process, the first recognition mode and the second recognition mode are executed. In the first recognition mode, the user's voice acquired by the voice acquisition unit 131 is recognized as it is. On the other hand, in the second recognition mode, the user's voice is recognized based on the user's voice acquired by the voice acquisition unit 131 and the name of the target specified by the specifying unit 135 described later. More specifically, in the second recognition mode, a part of data (text data or the like) corresponding to the user's voice acquired by the voice acquisition unit 131 corresponds to the name of the object specified by the specification unit 135. After being replaced with data to be performed (text data or the like), the user's voice is recognized (decoded) based on the replaced data.

画像認識部１３４は、画像取得部１３２によって取得された周囲画像を認識するための部分である。画像認識部１３４は、音声取得部１３１によって取得された周囲画像に対して、画像認識処理を実行する。画像認識処理の手法は、一般的に用いられる画像認識処理の手法であればよく、とくに限定されるものではないが、たとえばｏｐｅｎＣＶ（Open Source Computer Vision Library）など、種々の手法が用いられる。 The image recognition unit 134 is a part for recognizing the surrounding image acquired by the image acquisition unit 132. The image recognition unit 134 performs an image recognition process on the surrounding image acquired by the sound acquisition unit 131. The image recognition processing technique is not particularly limited as long as it is a commonly used image recognition processing technique. For example, various techniques such as open CV (Open Source Computer Vision Library) are used.

また、画像認識部１３４は、画像取得部１３２によって取得された周囲画像に含まれる対象物について、撮像部１１２から対象物までの距離を推定する。たとえば前述したように撮像部１１２が複数のカメラで構成される場合、画像認識部１３４は、三角測量法を用いて対象物までの距離を計測することができる。 In addition, the image recognition unit 134 estimates the distance from the imaging unit 112 to the target object included in the surrounding image acquired by the image acquisition unit 132. For example, when the imaging unit 112 is configured by a plurality of cameras as described above, the image recognition unit 134 can measure the distance to the object using the triangulation method.

特定部１３５は、音声取得部１３１によって取得された周囲音声、とくにユーザの音声に基づいて、画像取得部１３２によって取得された周囲画像に含まれる対象物を特定する部分（特定手段）である。対象物は、周囲画像に含まれる物のうち、ユーザが音声を用いて特定（指定）しようとしている物である。 The specifying unit 135 is a part (specifying unit) that specifies an object included in the surrounding image acquired by the image acquiring unit 132 based on the surrounding sound acquired by the sound acquiring unit 131, particularly the user's voice. The target object is an object that the user is trying to specify (designate) using sound among the objects included in the surrounding image.

たとえば、ユーザが或る山の高さを対話装置１００に質問しようとする際、ユーザは、山の名称「Ａ」を知らないが、その山の写真を有している（あるいはその山の写真がユーザの近くに存在している）ことが想定される。このとき、ユーザは、その写真を対話装置１００に提示するとともに「この山の高さは？」といった質問の音声を発する。すると、特定部１３５は、音声取得部１３１によって取得されたユーザの「この山の高さは？」との音声に基づいて、画像取得部１３２によって取得された写真に写っている山を、対象物として特定する。特定部１３５は、たとえば、対象物の名称、この例では山の名称「Ａ」を特定する。 For example, when the user tries to ask the dialogue apparatus 100 the height of a certain mountain, the user does not know the name “A” of the mountain, but has a photograph of the mountain (or a photograph of the mountain). Is present near the user). At this time, the user presents the photograph to the dialogue apparatus 100 and utters a voice of a question “What is the height of this mountain?”. Then, based on the user's voice “What is the height of this mountain?” Acquired by the voice acquisition unit 131, the specifying unit 135 targets the mountain reflected in the photograph acquired by the image acquisition unit 132. Identify as a thing. For example, the specifying unit 135 specifies the name of the object, in this example, the name “A” of the mountain.

具体的に、まず、特定部１３５は、ユーザの音声に含まれる指示詞（指示語）あるいは指示代名詞を特定する。上述の例では、ユーザの音声「この山の高さは？」に含まれる「この」が、指示詞として特定される。 Specifically, first, the specifying unit 135 specifies an indicator (indicator) or an indicator pronoun included in the user's voice. In the above example, “this” included in the user's voice “What is the height of this mountain?” Is specified as an indicator.

また、特定部１３５は、特定された指示詞あるいは指示代名詞に基づいて、周囲画像に含まれる対象物の画像を特定する。上述の例では、写真に写っている山の画像が、指示詞「この」および当該指示詞に続く名詞「山」、つまり「この山」に対応する対象物の画像として特定される。 Further, the specifying unit 135 specifies an image of the object included in the surrounding image based on the specified directive or indicating pronoun. In the above-described example, the mountain image shown in the photograph is specified as the image of the object corresponding to the indicator “this” and the noun “mountain” following the indicator, that is, “this mountain”.

そして、特定部１３５は、特定した対象物の画像に基づいて、対象物の名称を特定する。上述の例では、写真に写っている山の名称が「Ａ」として特定される。山の名称の特定は、たとえば、後述の記憶部１４０に記憶されている種々の情報を参照することによって行われてもよいし、通信部１５０を介して対話装置１００の外部から情報を取得することによって行われてもよい。 Then, the specifying unit 135 specifies the name of the target object based on the specified target image. In the above example, the name of the mountain shown in the photograph is specified as “A”. The specification of the name of the mountain may be performed, for example, by referring to various information stored in the storage unit 140 described later, or information is acquired from the outside of the interactive apparatus 100 via the communication unit 150. May be performed.

ここで、特定部１３５は、対象物の周囲画像における位置情報と、音声取得部１３１によって取得されたユーザの音声とに基づいて、対象物を特定してもよい。上述の例では、「この」という指示詞の意味から、ユーザが意図している対象物は、周囲画像において手前、つまり、対話装置１００から近距離の位置に存在している山であることが推定できる。このように推定された位置情報（この場合は「近距離」）に基づくことで、周囲画像におけるユーザが意図している対象物が適切に特定され得る。 Here, the specifying unit 135 may specify the target based on the position information in the surrounding image of the target and the user's voice acquired by the voice acquisition unit 131. In the above-described example, from the meaning of the directive “this”, the object intended by the user may be a mountain that is present in the vicinity of the surrounding image, that is, at a short distance from the interactive device 100. Can be estimated. Based on the position information estimated in this way (in this case, “short distance”), the object intended by the user in the surrounding image can be appropriately identified.

一方、対話装置１００から見てユーザのかなり後方の位置に山の写真が置かれている場合には、ユーザが「あの山の高さは？」といった質問の音声を発することが考えられる。その場合、「あの」という指示詞の意味から、ユーザが意図している対象物は、周囲画像において後方、つまり対話装置１００から遠距離の位置に存在している山であることが推定できる。 On the other hand, when a picture of a mountain is placed at a position far behind the user as viewed from the interactive device 100, the user may utter a voice of a question such as “What is the height of that mountain?”. In that case, from the meaning of the directive “that”, it can be estimated that the object intended by the user is a mountain that exists behind the surrounding image, that is, at a position far from the interactive device 100.

なお、ユーザが「その山の高さは？」といった質問の音声を発した場合には、「その」という指示詞の意味から、ユーザが意図している対象物は、対話装置１００から中距離の位置に存在している山であることが推定できる。 When the user utters a voice of a question such as “What is the height of the mountain?”, From the meaning of the directive “that”, the object intended by the user is a medium distance from the interactive device 100. It can be estimated that the mountain exists at the position of.

以上では、特定部１３５が、「この」、「その」、「あの」といった指示詞に基づいて、周囲画像に含まれる対象物の画像を特定する例について説明したが、特定部１３５は、指示詞に代えて、指示代名詞に基づいて、周囲画像に含まれる対象物の画像を特定してもよい。たとえば、上述の例において、ユーザが山の写真を示しながら「これの高さは？」といった質問の音声を発することも想定される。この場合、特定部１３５は、「これ」という指示代名詞の意味から、ユーザが意図している対象物は、周囲画像において手前、つまり、対話装置１００から近距離の位置に存在している物体であることが推定できる。このように推定された位置情報に基づいても、周囲画像におけるユーザが意図している対象物が適切に特定され得る。 In the above, the example in which the specifying unit 135 specifies the image of the target object included in the surrounding image based on the directives such as “this”, “that”, and “that” has been described. Instead of the lyrics, the image of the object included in the surrounding image may be specified based on the indicating pronoun. For example, in the above-described example, it is also assumed that the user utters a voice of a question such as “How tall is this” while showing a picture of a mountain. In this case, the specifying unit 135 indicates that the object intended by the user is an object that is present in the foreground in the surrounding image, that is, at a position at a short distance from the interactive device 100, from the meaning of the pronoun pronoun “this”. It can be estimated that there is. Even on the basis of the position information estimated in this way, the object intended by the user in the surrounding image can be appropriately identified.

記憶部１４０は、制御部１３０によって実行される処理に必要な種々の情報を記憶する部分である。記憶部１４０は、たとえば、前述の音響モデル、言語モデルを記憶する。また、記憶部１４０は、対話装置１００がユーザと対話を行うために必要な処理を対話装置１００に実行させるためのプログラム（対話プログラム）を記憶する。 The storage unit 140 is a part that stores various information necessary for processing executed by the control unit 130. The storage unit 140 stores, for example, the above-described acoustic model and language model. In addition, the storage unit 140 stores a program (interactive program) for causing the interactive apparatus 100 to execute processing necessary for the interactive apparatus 100 to interact with the user.

また、記憶部１４０は、特定部１３５が周囲画像に含まれる対象物の名称を特定するために必要な種々の情報を記憶する。たとえば、さまざまな物体の画像および名称を記憶したデータベースが記憶部１４０に格納されており、周囲画像に含まれる対象物の画像と、当該データベースに記憶されている物体の画像とを照合することによって、対象物の名称が特定され得る。 In addition, the storage unit 140 stores various information necessary for the specifying unit 135 to specify the name of the object included in the surrounding image. For example, a database storing images and names of various objects is stored in the storage unit 140, and by comparing an object image included in the surrounding image with an object image stored in the database. The name of the object can be specified.

また、記憶部１４０は、上述した特定部１３５による、指示詞あるいは指示代名詞と、対象物の周囲画像における位置情報等とを対応づけて記述するデータテーブルを記憶する。 In addition, the storage unit 140 stores a data table in which the specifier or the pronoun by the specifying unit 135 described above and the positional information in the surrounding image of the target object are described in association with each other.

図２は、記憶部１４０に記憶されるデータテーブルの一例を示す図である。図２に示されるように、このデータテーブルは、「キーワード」と「機器からの距離」と「キーワードに続く名詞の有無」とを対応づけて記述する。 FIG. 2 is a diagram illustrating an example of a data table stored in the storage unit 140. As shown in FIG. 2, this data table describes “keyword”, “distance from device”, and “presence / absence of noun following keyword” in association with each other.

「キーワード」は、たとえば指示代名詞あるいは指示詞である。図２では、指示代名詞として「これ」、「それ」、「あれ」が、指示詞として「この」、「その」、「あの」が例示されている。 The “keyword” is, for example, a demonstrative pronoun or a directive. In FIG. 2, “this”, “it”, and “that” are illustrated as the pronouns, and “this”, “that”, and “that” are illustrated as the directives.

「機器からの距離」は、各キーワード、つまり各指示代名詞あるいは各指示詞に応じて想定される、対話装置１００（より具体的には撮像部１１２）から対象物までの距離である。図２では、機器からの距離が、「近距離」、「中距離」、「遠距離」の３通りに分類されている。 The “distance from the device” is a distance from the interactive device 100 (more specifically, the imaging unit 112) to the target, which is assumed in accordance with each keyword, that is, each directive pronoun or each directive. In FIG. 2, the distance from the device is classified into three types of “short distance”, “medium distance”, and “far distance”.

たとえば、対話装置１００から対象物までの距離が第１の所定距離未満の場合には、当該距離は「近距離」に設定される。対話装置１００から対象物までの距離が第１の所定距離以上かつ第２の所定距離未満の場合には、当該距離は「中距離」に設定される。対話装置１００から対象物までの距離が第２の所定距離以上の場合には、当該距離は「遠距離」に設定される。第２の所定距離は、第１の所定距離よりも大きい。 For example, when the distance from the interactive apparatus 100 to the object is less than the first predetermined distance, the distance is set to “short distance”. When the distance from the interactive apparatus 100 to the object is not less than the first predetermined distance and less than the second predetermined distance, the distance is set to “medium distance”. When the distance from the interactive apparatus 100 to the object is equal to or greater than the second predetermined distance, the distance is set to “far”. The second predetermined distance is greater than the first predetermined distance.

「キーワードに続く名詞の有無」は、各キーワードが、その後ろに名詞を伴い得るキーワードであるか否かを示す。具体的に、キーワードが「これ」、「それ」、「あれ」といった指示代名詞である場合には、キーワードに続く名詞の有無は「無し」とされる。一方、キーワードが「この」、「その」、「あの」といった指示詞である場合、キーワードに続く名詞の有無は「有り」とされる。 “Presence / absence of noun following keyword” indicates whether each keyword is a keyword that can be followed by a noun. Specifically, when the keyword is a directive pronoun such as “this”, “it”, or “that”, the presence or absence of the noun following the keyword is set to “none”. On the other hand, if the keyword is an indicator such as “this”, “that”, “that”, the presence or absence of the noun following the keyword is “present”.

図２に示されるデータテーブルを参照すれば、ユーザの音声に所定の音声（キーワード）が含まれる場合に、当該キーワードに基づいて、ユーザが意図している対象物に関する位置情報、つまり、対話装置１００からの距離を推定することができる。また、キーワードに続く名詞の有無を判断することによって、たとえば、キーワードに続く名詞が有る場合には、その後の名詞を参考とすることで、ユーザの意図している対象物をより適切に把握できる可能性が高まる。 Referring to the data table shown in FIG. 2, when a predetermined voice (keyword) is included in the user's voice, based on the keyword, position information regarding the object intended by the user, that is, an interactive device The distance from 100 can be estimated. Also, by determining whether there is a noun following the keyword, for example, if there is a noun following the keyword, the target object intended by the user can be grasped more appropriately by referring to the subsequent noun. The possibility increases.

再び図１に戻り、通信部１５０は、対話装置１００の外部と通信を行う部分である。通信部１５０によって、たとえば、記憶部１４０に記憶される上述の種々の情報が追加して取得され、あるいは、更新され得る。また、通信部１５０を用いてインターネットなどにアクセスすることで、周囲画像に含まれる対象物の画像から、対象物の名称を特定するための検索処理が実行されてもよい。 Returning to FIG. 1 again, the communication unit 150 is a part that communicates with the outside of the interactive apparatus 100. By the communication unit 150, for example, the above-described various information stored in the storage unit 140 may be additionally acquired or updated. In addition, by accessing the Internet or the like using the communication unit 150, a search process for specifying the name of the object from the image of the object included in the surrounding image may be executed.

次に、図３を参照して、対話装置１００のハードウェア構成について説明する。図３に示されるように、対話装置１００は、物理的には、１または複数のＣＰＵ（Central Processing Unit）２１、ＲＡＭ（Random Access Memory）２２およびＲＯＭ（Read Only Memory）２３、カメラなどの撮像装置２４、データ送受信デバイスである通信モジュール２６、半導体メモリなどの補助記憶装置２７、操作盤（操作ボタンを含む）やタッチパネルなどのユーザ操作の入力を受け付ける入力装置２８、ディスプレイなどの出力装置２９、ならびにＣＤ−ＲＯＭドライブ装置などの読取装置２Ａを備えるコンピュータとして構成され得る。図１における対話装置１００の機能は、たとえば、ＣＤ−ＲＯＭなどの記憶媒体Ｍに記憶された１または複数のプログラムを読取装置２Ａにより読み取ってＲＡＭ２２などのハードウェア上に取り込むことにより、ＣＰＵ２１の制御のもとで撮像装置２４、通信モジュール２６、入力装置２８、出力装置２９を動作させるとともに、ＲＡＭ２２および補助記憶装置２７におけるデータの読み出しおよび書き込みを行うことで実現される。 Next, the hardware configuration of the interactive apparatus 100 will be described with reference to FIG. As shown in FIG. 3, the interactive apparatus 100 physically has one or more CPUs (Central Processing Units) 21, RAMs (Random Access Memory) 22, ROMs (Read Only Memory) 23, and imaging such as a camera. Device 24, communication module 26 that is a data transmission / reception device, auxiliary storage device 27 such as a semiconductor memory, input device 28 that receives an input of a user operation such as an operation panel (including operation buttons) or a touch panel, an output device 29 such as a display, Moreover, it may be configured as a computer including a reading device 2A such as a CD-ROM drive device. The function of the interactive device 100 in FIG. 1 is controlled by the CPU 21 by, for example, reading one or a plurality of programs stored in a storage medium M such as a CD-ROM with a reading device 2A and taking it into hardware such as a RAM 22. This is realized by operating the imaging device 24, the communication module 26, the input device 28, and the output device 29 under the above, and reading and writing data in the RAM 22 and the auxiliary storage device 27.

また、図４には、コンピュータを対話装置１００として機能させるための対話プログラムのモジュールが示される。図４に示されるように、対話プログラムP１００は、音声取得モジュールＰ１０１、画像取得モジュールＰ１０２、音声認識モジュールＰ１０３、画像認識モジュールＰ１０４および特定モジュールＰ１０５を備えている。各モジュールによって、先に図１を参照して説明した、音声取得部１３１、画像取得部１３２、音声認識部１３３、画像認識部１３４および特定部１３５の機能が実現される。 FIG. 4 shows a module of an interactive program for causing a computer to function as the interactive apparatus 100. As shown in FIG. 4, the dialogue program P100 includes a voice acquisition module P101, an image acquisition module P102, a voice recognition module P103, an image recognition module P104, and a specific module P105. Each module implements the functions of the voice acquisition unit 131, the image acquisition unit 132, the voice recognition unit 133, the image recognition unit 134, and the specifying unit 135 described above with reference to FIG.

対話プログラムは、たとえば記憶媒体に格納されて提供される。記憶媒体は、フレキシブルディスク、ＣＤ−ＲＯＭ、ＵＳＢメモリ、ＤＶＤ、半導体メモリなどであってよい。 The interactive program is provided by being stored in a storage medium, for example. The storage medium may be a flexible disk, CD-ROM, USB memory, DVD, semiconductor memory, or the like.

次に、図５および図６を用いて、対話装置１００の動作（対話装置１００によって実行される対話方法）について説明する。 Next, the operation of the interactive device 100 (the interactive method executed by the interactive device 100) will be described using FIG. 5 and FIG.

図５は、対話装置１００において実行される処理の一例を示すフローチャートである。このフローチャートの処理は、対話装置１００とユーザとの対話が開始されたことに応じて実行される。なお、とくに説明がない場合、各処理は、制御部１３０によって（つまり制御部１３０に含まれるいずれの要素かを問わず）実行され得る。 FIG. 5 is a flowchart illustrating an example of processing executed in the interactive apparatus 100. The processing of this flowchart is executed in response to the start of dialogue between the dialogue apparatus 100 and the user. Unless otherwise specified, each process can be executed by the control unit 130 (that is, regardless of which element is included in the control unit 130).

まず、対話装置１００は、認識モードを第１の認識モードとして、音声認識を実行する（ステップＳ１）。具体的に、音声認識部１３３によって、第１の認識モードが実行され、ユーザとの対話が行われる。なお、次のステップＳ２，Ｓ３において実行される処理は、第１の認識モードにおいて実行される処理である。 First, the interactive apparatus 100 executes speech recognition using the recognition mode as the first recognition mode (step S1). Specifically, the voice recognition unit 133 executes the first recognition mode and performs a dialog with the user. In addition, the process performed in following step S2, S3 is a process performed in 1st recognition mode.

次に、対話装置１００は、曖昧入力判定処理を実行する（ステップＳ２）。すなわち、対話装置１００は、曖昧入力が有ったか否かを判断する（ステップＳ３）。この処理は、音声認識部１３３によって実行される。たとえば、ユーザの音声に、指示詞あるいは指示代名詞などのキーワードが含まれることによって、ユーザの音声のみでは、ユーザの発話の意図が明確になっていない（曖昧となっている）場合に、曖昧入力が有ったと判断される。曖昧入力が有った場合（ステップＳ３：ＹＥＳ）、対話装置１００は、ステップＳ４に処理を進める。そうでない場合（ステップＳ３：ＮＯ）、対話装置１００は、ステップＳ１に再び処理を戻す。 Next, the dialogue apparatus 100 executes an ambiguous input determination process (step S2). That is, the dialogue apparatus 100 determines whether or not there is an ambiguous input (step S3). This process is executed by the voice recognition unit 133. For example, if a keyword such as a directive or a pronoun is included in the user's voice, the user's voice alone does not clarify the intention of the user's utterance. It is judged that there was. When there is an ambiguous input (step S3: YES), the dialogue apparatus 100 proceeds with the process to step S4. When that is not right (step S3: NO), the dialogue apparatus 100 returns a process to step S1 again.

なお、指示詞が含まれる場合であっても、「あの夜」、「その日」など、指示詞「あの」、「その」の後ろに続く名詞が画像、映像で表すことができないものである場合には、曖昧入力が無かったと（ステップＳ３：ＮＯ）判断されてよい。 Even if a directive is included, the nouns that follow the directives “that” and “that”, such as “that night” and “that day” cannot be represented by images or videos. It may be determined that there is no ambiguous input (step S3: NO).

ステップＳ４において、対話装置１００は、認識モードを第２の認識モードに切り替える。具体的に、音声認識部１３３が、音声認識処理において実行する認識モードを、第１の認識モードから第２の認識モードに切り替える。よって、後述次のステップＳ５〜Ｓ７において実行される処理は、第２の認識モードにおいて実行される処理である。 In step S4, the interactive apparatus 100 switches the recognition mode to the second recognition mode. Specifically, the voice recognition unit 133 switches the recognition mode executed in the voice recognition process from the first recognition mode to the second recognition mode. Therefore, the processing executed in the following steps S5 to S7 described later is processing executed in the second recognition mode.

具体的に、対話装置１００は、音声認識結果より、指定距離推定処理を実行する（ステップＳ５）。この処理は、たとえば特定部１３５によって実行される。具体的に、記憶部１４０に記憶されている先に説明した図２に示されるようなデータテーブルが参照され、ユーザが意図（指定）している対話装置１００から対象物までの距離が推定される。 Specifically, the interactive device 100 executes a designated distance estimation process based on the voice recognition result (step S5). This process is executed by the specifying unit 135, for example. Specifically, the data table as shown in FIG. 2 described above stored in the storage unit 140 is referred to, and the distance from the interactive device 100 intended by the user to the target is estimated. The

また、対話装置１００は、音声認識結果より、指定名詞特定処理を実行する（ステップＳ６）。この処理は、たとえば特定部１３５によって実行される。具体的に、ユーザの音声に指示詞が含まれる場合、その指示詞に続く名詞が、指定名詞として特定され、保持される。 Further, the dialogue apparatus 100 executes designated noun identification processing based on the voice recognition result (step S6). This process is executed by the specifying unit 135, for example. Specifically, when a directive is included in the user's voice, a noun following the directive is specified and retained as a designated noun.

ステップＳ５およびステップＳ６の処理が完了した後、対話装置１００は、画像認識処理を実行する（ステップＳ７）。 After the processes of step S5 and step S6 are completed, the interactive device 100 executes an image recognition process (step S7).

図６は、画像認識処理（図５のステップＳ７）において実行される処理の詳細を示すフローチャートである。 FIG. 6 is a flowchart showing details of processing executed in the image recognition processing (step S7 in FIG. 5).

まず、対話装置１００は、画像認識・距離測定処理を実行する（ステップＳ１１）。具体的に、画像認識部１３４が、画像取得部１３２によって取得された周囲画像に含まれる物体の数（ｎ個）を特定し、各物体の画像を認識する。また、画像認識部１３４は、対話装置１００から各物体までの距離をそれぞれ計測（推定）する。 First, the interactive device 100 executes image recognition / distance measurement processing (step S11). Specifically, the image recognition unit 134 identifies the number (n) of objects included in the surrounding image acquired by the image acquisition unit 132 and recognizes the image of each object. Further, the image recognition unit 134 measures (estimates) the distance from the interactive apparatus 100 to each object.

次のステップＳ１２〜Ｓ１７において、対話装置１００は、ｎ個の物体のうちのいずれの物体がユーザの意図している物体に該当する可能性が高いかを評価する。具体的に、対話装置１００は、変数ｉの初期値を０とし（ステップＳ１２）、ｉを１ずつ増加させながら（ステップＳ１６）、ｉがｎ以上になるまでの間（ステップＳ１７：ＮＯ）、ｉ番目の物体について、以下のステップＳ１３〜Ｓ１５の処理を繰り返し実行する。 In the next steps S12 to S17, the interactive apparatus 100 evaluates which of the n objects is likely to correspond to the object intended by the user. Specifically, the interactive device 100 sets the initial value of the variable i to 0 (step S12), increases i by 1 (step S16), and until i becomes n or more (step S17: NO), The following steps S13 to S15 are repeatedly executed for the i-th object.

まず、対話装置１００は、距離点数を算出する（ステップＳ１３）。具体的に、特定部１３５が、先のステップＳ１１において推定された対話装置１００から物体までの距離と、先のステップＳ５（図５）において推定されたユーザが意図している対話装置１００から対象物までの距離とに基づいて、距離点数を算出する。たとえば、両者の距離が近いほど、距離点数は大きくなるように算出される。 First, the interactive apparatus 100 calculates the number of distance points (step S13). Specifically, the specifying unit 135 sets the distance from the interactive device 100 to the object estimated in the previous step S11 and the target from the interactive device 100 intended by the user estimated in the previous step S5 (FIG. 5). The distance score is calculated based on the distance to the object. For example, the distance points are calculated to be larger as the distance between the two is closer.

次に、対話装置１００は、指定名詞の単語を保持しているか否かを判断する（ステップＳ１４）。具体的に、先のステップＳ６（図５）において特定され保持された指示詞に続く名詞がある場合、指定名詞の単語を保持していると判断される。指定名詞の単語を保持している場合（ステップＳ１４：ＹＥＳ）、対話装置１００は、ステップＳ１５に処理を進める。そうでない場合（ステップＳ１４：ＮＯ）、対話装置１００は、ステップＳ１５をスキップして、ステップＳ１６に処理を進める。 Next, the dialogue apparatus 100 determines whether or not the designated noun word is held (step S14). Specifically, if there is a noun following the directive specified and retained in the previous step S6 (FIG. 5), it is determined that the designated noun word is retained. When the word of the designated noun is held (step S14: YES), the dialogue apparatus 100 proceeds with the process to step S15. Otherwise (step S14: NO), the interaction device 100 skips step S15 and proceeds to step S16.

ステップＳ１５において、対話装置１００は、画像認識した名称と保持名詞で点数を算出する（ステップＳ１５）。この処理は、特定部１３５によって実行される。たとえば、画像認識部１３４によって特定された対象物の種類（「山」など）と、指示詞に続く名詞（「山」など）とが一致している場合には、先のステップＳ１３において算出された距離点数が増加される。逆に、画像認識部１３４によって特定された対象物の名称と、指示詞に続く名詞とが一致しない場合には、距離点数が維持され、あるいは減少される。 In step S15, the dialogue apparatus 100 calculates a score using the recognized image name and the retained noun (step S15). This process is executed by the specifying unit 135. For example, if the type of the object identified by the image recognition unit 134 (such as “mountain”) matches the noun (such as “mountain”) following the directive, it is calculated in the previous step S13. The distance score is increased. Conversely, if the name of the object specified by the image recognition unit 134 and the noun following the directive do not match, the distance score is maintained or reduced.

ｎ個の物体について上述のステップＳ１３〜Ｓ１５の処理が実行された後（ステップＳ１７：ＹＥＳ）、対話装置１００は、ステップＳ１８に処理を進める。 After the processes in steps S13 to S15 described above have been executed for n objects (step S17: YES), the interactive apparatus 100 proceeds with the process to step S18.

ステップＳ１８において、対話装置１００は、総合的な点数により対象を１つに決定する。具体的に、特定部１３５が、ｎ個の物体のうち、先のステップＳ１３および／またはステップＳ１５で算出された点数が最も高い物体を、ユーザが意図（指定）している対象物として特定する。つまり、ステップＳ１８では、対話装置１００から対象物までの距離と、指示詞、指示代名詞の後ろに続く名詞の有無といった会話文脈から、ユーザの音声の一部に置き換えることができる可能性の高い対象物の名称が特定される。 In step S18, the interactive apparatus 100 determines one target based on the total score. Specifically, the specifying unit 135 specifies the object having the highest score calculated in the previous step S13 and / or step S15 among the n objects as the object intended (designated) by the user. . That is, in step S18, an object that is likely to be replaced with a part of the user's voice from the conversation context such as the distance from the interactive device 100 to the object and the presence or absence of a noun and a noun following the pronoun. The name of the object is specified.

なお、具体的に、指示詞、指示代名詞の後ろに続く名詞の有無といった会話文脈からは、以下のようにして、ユーザの音声の一部に置き換えることができる可能性の高い対象物の名称が特定され得る。一例として、周囲画像に或る図形が含まれており、当該図形が、或る果物を示す図形であるとともに或る会社のロゴを示す図形でもある場合を想定する。この場合、ユーザが「これ食べたい」との音声を発した場合には、「これ」という指示代名詞から、その後ろに続く名詞は存在しない。このため、当該図形は、単に果物を示す図形（画像）として認識され、当該果物の名称が、ユーザが意図している対象物の名称として特定される。これに対し、ユーザが「ここの社長は誰？」との音声を発した場合には、「ここ」という指示詞から、その後ろに続く「社長」という名詞が存在する。このため、当該図形は、単に果物を示す図形でなく、「社長」という名詞に関連し得る文言、たとえば「会社」、「企業」などを示す図形（画像）として認識され、当該会社等の名称が、ユーザが意図している対象物の名称として特定される。たとえばこのようにして、会話文脈から、ユーザの音声の一部に置き換えることができる可能性の高い対象物の名称が特定される。 Specifically, from the conversation context such as the presence of a noun that follows the directive and the pronoun, the name of the object that is likely to be replaced with a part of the user's voice is as follows. Can be identified. As an example, it is assumed that a certain graphic is included in the surrounding image and the graphic is a graphic indicating a certain fruit and a graphic indicating a logo of a certain company. In this case, when the user utters a voice saying “I want to eat this”, there is no noun that follows from the indicating pronoun “this”. For this reason, the said figure is simply recognized as a figure (image) which shows a fruit, and the name of the said fruit is specified as a name of the target object which the user intends. On the other hand, when the user utters “Who is the president here?”, There is a noun “president” following the directive “here”. For this reason, the figure is not simply a figure showing fruit, but is recognized as a figure (image) showing a word related to the noun “president”, for example, “company”, “company”, etc. Is specified as the name of the object intended by the user. For example, in this way, the names of objects that are likely to be replaced with a part of the user's voice are specified from the conversation context.

その後、対話装置１００は、音声認識結果の補正処理を実行する（ステップＳ１９）。この処理は、音声認識部１３３によって実行される。具体的に、ユーザ音声に対応するデータ（テキストデータなど）に含まれる指示詞およびそれに続く名詞、あるいは指示代名詞が、先のステップＳ１８において特定された物体の名称に対応するデータに置き換えられる。そして、置き換えられたデータに基づいて、ユーザの音声が認識（内容が解読）される。 Thereafter, the interactive device 100 executes a speech recognition result correction process (step S19). This process is executed by the voice recognition unit 133. Specifically, the indicator and subsequent noun or indicator pronoun included in the data (text data or the like) corresponding to the user voice is replaced with data corresponding to the name of the object specified in the previous step S18. Then, based on the replaced data, the user's voice is recognized (the contents are decoded).

ステップＳ１９の処理が完了した後、対話装置１００は、ステップＳ１に再び処理を戻す（図５）。 After the process of step S19 is completed, the dialogue apparatus 100 returns the process to step S1 again (FIG. 5).

次に、対話装置１００の作用効果について説明する。対話装置１００によれば、特定部１３５が、音声取得部１３１によって取得されたユーザの音声に基づいて、画像取得部１３２によって取得された周囲画像に含まれる対象物を特定する（ステップＳ７）。これにより、たとえば、ユーザが対象物に関する知識を有していない場合でも、その対象物が周囲画像に含まれているものであることを意図（指定）する音声、たとえば指示詞、指示代名詞を含む音声を発すれば、その音声に基づいて、ユーザの意図している対象物が特定される。よって、対話において、ユーザの意図をより適切に把握することができる。 Next, the function and effect of the interactive device 100 will be described. According to the interactive apparatus 100, the specifying unit 135 specifies an object included in the surrounding image acquired by the image acquiring unit 132 based on the user's voice acquired by the voice acquiring unit 131 (step S7). Thereby, for example, even when the user does not have knowledge about the target object, the voice that intends (designates) that the target object is included in the surrounding image, for example, an indicator, an indicating pronoun is included. If a voice is uttered, an object intended by the user is specified based on the voice. Therefore, the user's intention can be grasped more appropriately in the dialogue.

具体的に、音声認識部１３３は、音声認識処理として、まず、第１の認識モードを実行し、ユーザの音声をそのまま認識する。これにより、従来の対話装置と同様にユーザとの対話が行われる（ステップＳ１，Ｓ２，Ｓ３：ＮＯ）。さらに、音声認識部１３３は、第２の認識モードを実行し、ユーザの音声と特定された対象物の名称とに基づいてユーザの音声を認識する（ステップＳ４〜Ｓ７）。第２の認識モードが実行されることによって、ユーザの意図をより適切に把握しつつ、ユーザとの対話を行うことができる。 Specifically, the voice recognition unit 133 first executes the first recognition mode as voice recognition processing, and recognizes the user's voice as it is. Thereby, the dialog with the user is performed as in the conventional dialog device (steps S1, S2, S3: NO). Furthermore, the voice recognition unit 133 executes the second recognition mode, and recognizes the user's voice based on the user's voice and the name of the identified object (steps S4 to S7). By executing the second recognition mode, it is possible to interact with the user while more appropriately grasping the user's intention.

また、特定部１３５は、音声取得部１３１によって取得されたユーザの音声の一部に置き換えることが可能な対象物の名称を特定する（ステップＳ１８）。そして、音声認識部１３３が、ユーザの音声に対応するデータの一部を、特定部１３５によって特定された対象物の名称に対応するデータに置き換えた後に、当該置き換えられたデータに基づいてユーザの音声を認識する（ステップＳ１９）。これにより、たとえば、周囲画像に複数の対象物が含まれ、ユーザが意図（指定）している対象物の名称の候補が複数存在する場合でも、その中から、ユーザの音声の一部に置き換えることが可能な対象物の名称、すなわち文脈（会話の流れ）に適した対象物の名称が特定される。このように特定された対象物の名称に対応するデータを、ユーザの音声に対応するデータの一部に置き換えた後に音声認識を行うことで、会話の流れに沿って、ユーザの意図をより適切に把握することができる。 Further, the specifying unit 135 specifies the name of an object that can be replaced with a part of the user's voice acquired by the voice acquisition unit 131 (step S18). Then, after the voice recognition unit 133 replaces a part of the data corresponding to the user's voice with the data corresponding to the name of the target specified by the specifying unit 135, the user recognizes the user based on the replaced data. The voice is recognized (step S19). Thereby, for example, even when a plurality of objects are included in the surrounding image and there are a plurality of candidates for the name of the object intended (designated) by the user, it is replaced with a part of the user's voice. The name of the object that can be used, that is, the name of the object suitable for the context (flow of conversation) is specified. By replacing the data corresponding to the name of the object identified in this way with a part of the data corresponding to the user's voice, the user's intention is more appropriate along the flow of the conversation. Can grasp.

また、音声認識部１３３は、第１の認識モードを実行しているときに、ユーザの音声に、指示詞、指示代名詞などのキーワードが含まれることを認識すると、実行する認識モードを、第１の認識モードから第２の認識モードに切替える（ステップＳ３：ＹＥＳ、ステップＳ４）。これにより、指示詞、指示代名詞に基づいて対象物の名称を特定する必要が生じた適切なタイミングで第２の認識モードを実行し、ユーザの意図をより適切に把握することができるようになる。 Further, when the voice recognition unit 133 recognizes that the user's voice includes a keyword such as a directive or a pronoun when the first recognition mode is being executed, the voice recognition unit 133 sets the recognition mode to be executed to the first. The recognition mode is switched to the second recognition mode (step S3: YES, step S4). As a result, the second recognition mode is executed at an appropriate timing when the name of the object needs to be specified based on the directive and the pronoun, and the user's intention can be grasped more appropriately. .

また、特定部１３５は、周囲画像における対象物の位置情報、より具体的には、対話装置１００から対象物までの距離と、ユーザの音声とに基づいて、対象物を特定する（ステップＳ１３〜Ｓ１５，Ｓ１８）。これにより、周囲画像に複数の対象物が含まれる場合でも、ユーザの音声に基づいて、ユーザの意図（指定）している対象物を特定することができる。 Further, the specifying unit 135 specifies the target object based on the position information of the target object in the surrounding image, more specifically, based on the distance from the interactive device 100 to the target object and the user's voice (Steps S13 to S13). S15, S18). Thereby, even when a plurality of objects are included in the surrounding image, the object intended (designated) by the user can be specified based on the user's voice.

また、周囲画像は、表示部１２２によって表示されている画像、映像であってもよい。これにより、たとえば、表示部１２２によって表示されている画像、映像を話題として、ユーザとの対話を行うことができる。たとえば、ユーザとの対話中に、テレビ放送として受信された或る山の風景の映像が、表示部１２２によって表示されている場合を想定する。たとえば、ユーザが「ここはどこ？」といった音声を発した場合、「ここ」という指示代名詞に応じて、表示部１２２によって表示されている山の名称を特定することによって、ユーザの意図を適切に把握しつつ対話を行うことができる。 The surrounding image may be an image or video displayed by the display unit 122. Thereby, for example, a conversation with the user can be performed using the image and video displayed on the display unit 122 as a topic. For example, it is assumed that a video of a certain mountain landscape received as a television broadcast is displayed on the display unit 122 during a dialogue with the user. For example, when the user utters a voice such as “Where is this place?”, The user's intention is appropriately determined by specifying the name of the mountain displayed by the display unit 122 according to the indicating pronoun “here”. Dialogue can be conducted while grasping.

以上説明した対話装置１００の各機能は、たとえば、コンピュータにおいて対話プログラムが実行されることによって実現することもできる。 Each function of the interactive apparatus 100 described above can also be realized, for example, by executing an interactive program on a computer.

以上、本発明の一実施形態について説明したが、本発明は、上記実施形態に限定されるものではない。 Although one embodiment of the present invention has been described above, the present invention is not limited to the above embodiment.

図７は変形例に係る対話装置の機能ブロックを示す図である。対話装置１００Ａは、サーバ２００との共同により、ユーザとの対話を行う対話システム１を構成する。この変形例では、対話システム１が本発明に係る対話装置に相当する。 FIG. 7 is a diagram showing functional blocks of an interactive apparatus according to a modification. The dialogue apparatus 100 </ b> A constitutes a dialogue system 1 that performs dialogue with the user in cooperation with the server 200. In this modification, the dialogue system 1 corresponds to the dialogue device according to the present invention.

図７に示されるように、サーバ２００は、制御部２３０と、記憶部２４０と、通信部２５０とを含む。 As illustrated in FIG. 7, the server 200 includes a control unit 230, a storage unit 240, and a communication unit 250.

制御部２３０は、音声取得部２３１、画像取得部２３２、音声認識部２３３、画像認識部２３４および特定部２３５を含む。これらの各要素は、先に図１を参照して説明した音声取得部１３１、画像取得部１３２，音声認識部１３３、画像認識部１３４および特定部１３５と同様の機能を有する。 The control unit 230 includes a voice acquisition unit 231, an image acquisition unit 232, a voice recognition unit 233, an image recognition unit 234, and a specification unit 235. Each of these elements has the same functions as the voice acquisition unit 131, the image acquisition unit 132, the voice recognition unit 133, the image recognition unit 134, and the specification unit 135 described above with reference to FIG.

記憶部２４０は、先に図１を参照して説明した記憶部１４０と同様の機能を有する。すなわち、記憶部２４０は、制御部２３０によって実行される処理に必要な種々の情報を記憶する部分であり、たとえば音響モデル、言語モデル、対話プログラム、さまざまな物体の画像および名称を記憶したデータベース、図２に示されるようなデータテーブルを記憶する。 The storage unit 240 has the same function as the storage unit 140 described above with reference to FIG. That is, the storage unit 240 is a part that stores various information necessary for processing executed by the control unit 230. For example, an acoustic model, a language model, a dialogue program, a database that stores images and names of various objects, A data table as shown in FIG. 2 is stored.

通信部２５０は、対話装置１００Ａの通信部１５０と通信する部分である。通信部２５０によって、対話装置１００Ａとサーバ２００とが通信可能となる。通信部２５０を用いてインターネットなどにアクセスすることで、周囲画像に含まれる対象物の画像から、対象物の名称を特定するための検索処理が実行されてもよい。 The communication unit 250 is a part that communicates with the communication unit 150 of the interactive apparatus 100A. The communication unit 250 enables communication between the dialogue apparatus 100A and the server 200. By accessing the Internet or the like using the communication unit 250, a search process for specifying the name of the object from the image of the object included in the surrounding image may be executed.

以上の構成により、対話システム１は、対話装置１００Ａと、サーバ２００との協働により、ユーザとの対話を可能とする。すなわち、対話システム１では、対話装置１００（図１）においてユーザとの対話を行うために実行される処理の一部（とくに制御部１３０によって実行される処理）が、サーバ２００によって実行される。したがって、対話装置１００Ａによれば、対話装置１００よりも、対話装置における処理負担を軽減することができる。 With the above configuration, the dialogue system 1 enables dialogue with the user through cooperation between the dialogue apparatus 100 </ b> A and the server 200. That is, in the interactive system 1, a part of the processing (particularly, processing executed by the control unit 130) executed for performing a dialog with the user in the interactive device 100 (FIG. 1) is executed by the server 200. Therefore, according to the interactive device 100A, the processing load on the interactive device can be reduced as compared with the interactive device 100.

具体的に、対話装置１００Ａは、対話装置１００と比較して、制御部１３０および記憶部１４０に変えて、制御部１３０Ａ、記憶部１４０Ａを含む構成とすることができる。 Specifically, the interaction device 100A may include a control unit 130A and a storage unit 140A in place of the control unit 130 and the storage unit 140, as compared with the interaction device 100.

制御部１３０Ａは、対話装置１００Ａの全体制御を行う部分である。ただし、制御部１３０Ａは、制御部１３０と比較して、音声取得部１３１、画像取得部１３２、音声認識部１３３、画像認識部１３４、特定部１３５を含むことが必須でなく、それによって、制御部１３０Ａの構成は、制御部１３０よりも簡素化することができる。 The control unit 130A is a part that performs overall control of the interactive apparatus 100A. However, as compared with the control unit 130, the control unit 130A does not necessarily include the voice acquisition unit 131, the image acquisition unit 132, the voice recognition unit 133, the image recognition unit 134, and the specification unit 135. The configuration of the unit 130A can be simplified as compared with the control unit 130.

記憶部１４０Ａは、制御部１３０Ａによって実行される処理に必要な種々の情報を記憶する部分であるが、サーバ２００の記憶部２４０と重複するデータの記憶は、記憶部１４０Ａにおいては必須ではない。その分、記憶部１４０Ａの記憶容量を、記憶部１４０の記憶容量よりも小さくするなどして、構成を簡素化することができる。 The storage unit 140A is a part that stores various information necessary for processing executed by the control unit 130A. However, storage of data that overlaps the storage unit 240 of the server 200 is not essential in the storage unit 140A. Accordingly, the configuration can be simplified by making the storage capacity of the storage unit 140A smaller than the storage capacity of the storage unit 140.

１００，１００Ａ…対話装置、１１０…入力部、１１１…集音部（音声取得手段）、１１２…撮像部（画像取得手段）、１２０…出力部、１２１…発音部、１２２…表示部（出力手段）、１３０，１３０Ａ，２３０…制御部、１３１，２３１…音声取得部（音声取得手段）、１３２，２３２…画像取得部（画像取得手段）、１３３，２３３…音声認識部（認識手段）、１３４，２３４…画像認識部、１３５，２３５…特定部（特定手段）、１４０，１４０Ａ，２４０…記憶部、１５０，２５０…通信部。 DESCRIPTION OF SYMBOLS 100,100A ... Dialogue device, 110 ... Input part, 111 ... Sound collecting part (sound acquisition means), 112 ... Imaging part (image acquisition means), 120 ... Output part, 121 ... Sound generation part, 122 ... Display part (output means) , 130, 130A, 230 ... control unit, 131, 231 ... voice acquisition unit (sound acquisition unit), 132, 232 ... image acquisition unit (image acquisition unit), 133, 233 ... voice recognition unit (recognition unit), 134 , 234 ... Image recognition unit, 135, 235 ... Identification unit (identification means), 140, 140A, 240 ... Storage unit, 150, 250 ... Communication unit.

Claims

An interactive device for interacting with a user,
Voice acquisition means for acquiring the user's voice;
Image acquisition means for acquiring images;
A specifying unit for specifying an object included in the image acquired by the image acquiring unit based on the voice of the user acquired by the audio acquiring unit;
An interactive apparatus comprising:

Recognizing means for executing a recognition mode for recognizing the voice of the user acquired by the sound acquiring means;
The specifying means specifies the name of the object,
The recognition mode is
A first recognition mode for recognizing the user's voice acquired by the voice acquisition means;
A second recognition mode for recognizing the user's voice based on the voice of the user acquired by the voice acquisition means and the name of the object specified by the specifying means;
The interactive apparatus according to claim 1.

The specifying means specifies the name of the object that can be replaced with a part of the user's voice acquired by the voice acquisition means,
In the second recognition mode, after a part of the data corresponding to the user's voice acquired by the voice acquisition unit is replaced with data corresponding to the name of the object specified by the specifying unit, The user's voice is recognized based on the replaced data.
The interactive apparatus according to claim 2.

When the recognition unit recognizes that the user's voice acquired by the voice acquisition unit includes a predetermined voice while executing the first recognition mode, the recognition unit executes the recognition mode to be executed. Switching from the first recognition mode to the second recognition mode,
The interactive apparatus according to claim 2 or 3.

The specifying unit specifies the target based on position information of the target in the image and a user's voice acquired by the voice acquisition unit.
The interactive apparatus according to any one of claims 1 to 4.

Output means for outputting an image,
The image acquisition means acquires the image output by the output means;
The interactive apparatus according to any one of claims 1 to 5.

A computer provided in a dialog device for performing a dialog with a user,
Voice acquisition means for acquiring the user's voice;
Image acquisition means for acquiring images;
A specifying unit for specifying an object included in the image acquired by the image acquiring unit based on the voice of the user acquired by the audio acquiring unit;
Interactive program to function as