JP6562790B2

JP6562790B2 - Dialogue device and dialogue program

Info

Publication number: JP6562790B2
Application number: JP2015179495A
Authority: JP
Inventors: 択磨松村; 哲溝口
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2015-09-11
Filing date: 2015-09-11
Publication date: 2019-08-21
Anticipated expiration: 2035-09-11
Also published as: JP2017054065A

Description

本発明は、ユーザとの対話を行うための対話装置および対話プログラムに関する。 The present invention relates to a dialog device and a dialog program for performing a dialog with a user.

従来より、たとえば特許文献１に記載されているように、ユーザとの対話を行う対話装置が提案されている。ユーザと対話装置との対話は、ユーザがタッチ操作等を行わない状態（以下、「ハンズフリー状態」という）で開始される場合もある。この場合、対話装置は、たとえば、ユーザの音声（発話）が検出されたことに応じて、対話を開始する。 2. Description of the Related Art Conventionally, as described in Patent Document 1, for example, an interactive device that performs an interaction with a user has been proposed. The dialogue between the user and the dialogue device may be started in a state where the user does not perform a touch operation or the like (hereinafter referred to as “hands-free state”). In this case, the dialogue apparatus starts a dialogue in response to, for example, detection of the user's voice (utterance).

特開２００２−１８２８９６号公報JP 2002-182896 A

対話装置は、ユーザの音声以外の音、たとえばユーザ以外の人物の会話、騒音、およびテレビ音声などの雑音が存在する雑音環境下におかれる場合がある。雑音環境下においては、対話装置が雑音をユーザの音声であると誤認識して対話が開始されてしまう可能性がある。誤認識によって、誤って対話が終了されてしまう可能性もある。 An interactive apparatus may be placed in a noise environment in which noises other than the user's voice, for example, conversations of people other than the user, noise, and noise such as television voices are present. In a noisy environment, there is a possibility that the conversation apparatus may misrecognize the noise as the user's voice and start the conversation. There is also a possibility that the conversation is terminated by mistake due to the recognition error.

本発明は、上記問題点に鑑みてなされたものであり、雑音耐性が向上された対話装置および対話プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and an object thereof is to provide an interactive apparatus and an interactive program with improved noise tolerance.

本発明の一態様に係る対話装置は、ユーザとの対話を行うための対話装置であって、ユーザの画像を取得する取得手段と、ユーザとの対話を開始する際またはユーザとの対話を終了する際に、取得手段によって取得された画像に基づいて、ハンズフリー状態にあるユーザとの対話を開始するタイミングまたはユーザとの対話を終了するタイミングを決定する決定手段と、を備える。 An interactive apparatus according to an aspect of the present invention is an interactive apparatus for performing an interaction with a user, and an acquisition unit that acquires an image of the user and an interaction with the user are started or ended. And determining means for determining the timing for starting the conversation with the user in the hands-free state or the timing for ending the conversation with the user based on the image acquired by the acquiring means.

また、本発明の一態様に係るプログラムは、ユーザとの対話を行うための対話装置に設けられたコンピュータを、ユーザの画像を取得する取得手段と、ユーザとの対話を開始する際またはユーザとの対話を終了する際に、取得手段によって取得された画像に基づいて、ハンズフリー状態にあるユーザとの対話を開始するタイミングまたはユーザとの対話を終了するタイミングを決定する決定手段、として機能させる。 In addition, a program according to one embodiment of the present invention provides a computer provided in an interaction device for performing an interaction with a user, an acquisition unit that acquires an image of the user, and a user when starting an interaction with the user or When ending the user's dialogue, based on the image acquired by the acquisition unit, the function to serve as a determination unit that determines the timing for starting the conversation with the user in the hands-free state or the timing for ending the dialogue with the user .

上記の対話装置またはプログラムによれば、ユーザの画像に基づいて、ユーザとの対話を開始するタイミングまたはユーザとの対話を終了するタイミングが決定される。ユーザの画像に基づけば、ユーザ以外の人物の会話、騒音、およびテレビ音声などの雑音の影響を受けることなく、ユーザとの対話を開始するタイミング等が適切に決定される。したがって、従来のように、ユーザの音声が検出されたことに応じて対話を開始する場合よりも、対話装置の雑音耐性を向上することができる。 According to the above-described dialog device or program, the timing for starting the dialog with the user or the timing for ending the dialog with the user is determined based on the user image. Based on the user's image, the timing for starting the dialogue with the user is appropriately determined without being affected by noises such as conversations, noises, and TV voices of persons other than the user. Therefore, it is possible to improve the noise tolerance of the dialog device as compared with the conventional case where the dialog is started in response to detection of the user's voice.

決定手段は、ユーザの視線が所定時間以上継続して対話装置に向けられたことを検出し、当該検出のタイミングを、ユーザとの対話を開始するタイミングとして決定してもよい。たとえば人間の外形形状を模したロボットの顔の部分に対話装置が搭載されている状況などにおいて、ユーザの視線が所定時間以上継続して対話装置に向けられている場合には、ユーザが対話を開始しようとする意思を有している可能性が高い。このため、上記構成によれば、ユーザとの対話を開始するタイミングを適切に決定することができる。 The determining unit may detect that the user's line of sight is continuously directed to the dialogue apparatus for a predetermined time or more, and may determine the timing of the detection as a timing for starting the dialogue with the user. For example, in a situation where an interactive device is mounted on the face part of a robot that imitates the outline of a human being, when the user's line of sight continues to be directed to the interactive device for a predetermined time or longer, the user interacts. It is likely that you have the intention to start. For this reason, according to the said structure, the timing which starts the dialog with a user can be determined appropriately.

決定手段は、ユーザの口が開いたタイミングを、ユーザとの対話を開始するタイミングとして決定してもよい。ユーザの口が開いた場合には、ユーザが対話を開始する可能性が高い。このため、上記構成によれば、ユーザとの対話を開始するタイミングを適切に決定することができる。 The determining means may determine the timing when the user's mouth is opened as the timing for starting the dialogue with the user. If the user's mouth opens, the user is likely to start a conversation. For this reason, according to the said structure, the timing which starts the dialog with a user can be determined appropriately.

決定手段は、ユーザの口が所定時間以上継続して閉じられたことを検出し、当該検出のタイミングを、ユーザとの対話を終了するタイミングとして決定してもよい。ユーザの口が所定時間以上継続して閉じられている場合には、ユーザが対話を終了しようとする意思を有している可能性が高い。このため、上記構成によれば、ユーザとの対話を終了するタイミングを適切に決定することができる。 The determination unit may detect that the user's mouth is continuously closed for a predetermined time or more, and may determine the timing of the detection as the timing of ending the dialogue with the user. When the user's mouth is continuously closed for a predetermined time or more, there is a high possibility that the user has an intention to end the dialogue. For this reason, according to the said structure, the timing which complete | finishes a dialog with a user can be determined appropriately.

取得手段は、ユーザの音声をさらに取得し、対話装置は、対話において取得手段が取得したユーザの音声を認識するために実行される音声認識処理を、取得手段によって取得されたユーザの画像に基づいて、予め定められた複数の音声認識処理から選択する選択手段、をさらに備えてもよい。これにより、ユーザに応じた適切な音声認識処理が選択され、対話において実行される。その結果、対話におけるユーザの音声の認識精度を向上することができる。 The acquisition unit further acquires the user's voice, and the dialogue apparatus performs a voice recognition process executed for recognizing the user's voice acquired by the acquisition unit in the dialogue based on the user image acquired by the acquisition unit. And selecting means for selecting from a plurality of predetermined voice recognition processes. Thereby, an appropriate voice recognition process corresponding to the user is selected and executed in the dialogue. As a result, the recognition accuracy of the user's voice in the dialogue can be improved.

取得手段は、集音部と、取得手段によって取得された画像に基づきユーザの顔に向けて集音部の指向性を調整する調整部と、を含んでもよい。これにより、ユーザ以外の人物の会話、騒音、およびテレビ音声などが存在する雑音環境下であっても、ユーザの音声の認識精度を高めることができる。 The acquisition unit may include a sound collection unit and an adjustment unit that adjusts the directivity of the sound collection unit toward the user's face based on the image acquired by the acquisition unit. Thereby, the recognition accuracy of the user's voice can be improved even in a noisy environment in which there are conversations, noises, and TV voices of persons other than the user.

取得手段は、対話装置の周囲画像を撮像する撮像部と、撮像部によって撮像された周囲画像に複数の人物が含まれる場合に、複数の人物のうち、対話装置に視線を向けている人物をユーザとして特定する特定部と、をさらに含み、調整部は、特定部によって特定されたユーザの顔に向けて集音部の指向性を調整してもよい。複数の人物が存在する場合、複数の人物のうちの対話装置に視線を向けている人物が、対話を開始しようとする意思を有しているユーザである可能性が高い。上記構成によれば、そのような対話を開始しようとする意思を有している可能性の高いユーザが特定され、そのユーザの顔に向けて集音部の指向性が調整される。よって、複数の人物が存在する場合であっても、対話の対象となっているユーザの音声の認識精度を高めることができる。 An acquisition unit is configured to capture an image capturing unit that captures a surrounding image of the interactive device, and when a plurality of persons are included in the surrounding image captured by the image capturing unit, a person whose line of sight is directed to the interactive device is selected from the plurality of persons. The adjusting unit may further adjust the directivity of the sound collecting unit toward the face of the user specified by the specifying unit. In the case where there are a plurality of persons, it is highly likely that a person who is looking at the conversation device among the plurality of persons is a user who has an intention to start a conversation. According to the above configuration, a user who has a high possibility of starting such a dialogue is identified, and the directivity of the sound collection unit is adjusted toward the face of the user. Therefore, even when there are a plurality of persons, it is possible to improve the recognition accuracy of the voice of the user who is the subject of the conversation.

あるいは、取得手段は、取得した周囲画像に複数の人物が含まれる場合に、周囲画像に含まれる顔情報に基づいて、複数の人物からユーザを特定する特定部、をさらに含んでもよい。これにより、個人の顔の特徴などの画像認証情報を使用し、対話を開始等しようとしているユーザを特定し、また、利用ユーザを限定することもできる。 Alternatively, the acquisition unit may further include a specifying unit that specifies a user from a plurality of persons based on face information included in the surrounding image when the acquired surrounding image includes a plurality of persons. Accordingly, it is possible to use image authentication information such as personal facial features, specify a user who is about to start a dialogue, and limit the number of users.

あるいは、対話装置は、対話において取得手段が取得したユーザの音声を認識するために実行される音声認識処理を、ユーザによる音声認識処理の利用履歴に基づいて、予め定められた複数の音声認識処理から選択する選択手段、をさらに備えてもよい。これにより、ユーザに適した音声認識処理を選択して実行することで、ユーザの音声の認識精度を高めることができる。また、ユーザが手動で言語を選択するといった手間を不要とすることもできる。 Alternatively, the dialogue apparatus performs a plurality of predetermined voice recognition processes based on a use history of the voice recognition process performed by the user, based on a use history of the voice recognition process performed by the user. Selecting means for selecting from the above may be further provided. Thereby, the recognition accuracy of the user can be improved by selecting and executing the speech recognition process suitable for the user. In addition, it is possible to eliminate the need for the user to manually select a language.

本発明によれば、雑音耐性が向上された対話装置および対話プログラムが提供される。 According to the present invention, an interactive apparatus and an interactive program with improved noise tolerance are provided.

対話装置の機能ブロックを示す図である。It is a figure which shows the functional block of a dialogue apparatus. 対話装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a dialogue apparatus. 対話プログラムの構成を示す図である。It is a figure which shows the structure of a dialogue program. 対話装置において実行される処理の一例を示す第１のフローチャートである。It is a 1st flowchart which shows an example of the process performed in a dialogue apparatus. 対話装置において実行される処理の一例を示す第２のフローチャートである。It is a 2nd flowchart which shows an example of the process performed in a dialogue apparatus. 対話装置において実行される処理の一例を示す第３のフローチャートである。It is a 3rd flowchart which shows an example of the process performed in a dialogue apparatus. 変形例に係る対話装置の機能ブロックを示す図である。It is a figure which shows the functional block of the dialogue apparatus which concerns on a modification.

以下、本発明の実施形態について、図面を参照しながら説明する。なお、図面の説明において同一要素には同一符号を付し、重複する説明は省略する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the description of the drawings, the same elements are denoted by the same reference numerals, and redundant descriptions are omitted.

実施形態に係る対話装置は、ユーザと対話を行う装置である。対話装置は、たとえばスマートフォンのような移動体端末装置、あるいは据え置き型の端末として実現されてもよいし、人間の外形形状を模したロボットとして実現されてもよい。ユーザは、ハンズフリー状態で、対話装置と対話を行うことができる。ハンズフリー状態とは、ユーザが対話装置を操作するための物理的な要素（操作ボタン、タッチパネルなど）に接触していない状態を意味する。ユーザが、上記要素以外のものに接触している場合でも、上記要素に接触していなければ、ハンズフリー状態である。なお、実施形態に係る対話装置は、ユーザがハンズフリー状態である場合に、ユーザとの対話の開始および終了のタイミングを適切に決定するものである。そのため、対話の途中においては、ユーザは必ずしもハンズフリー状態である必要はない。 The interactive device according to the embodiment is a device that performs a dialog with a user. The interactive device may be realized as a mobile terminal device such as a smartphone or a stationary terminal, or may be realized as a robot imitating a human outer shape. The user can interact with the interaction device in a hands-free state. The hands-free state means a state where the user is not in contact with physical elements (operation buttons, touch panel, etc.) for operating the interactive device. Even when the user is in contact with something other than the above elements, if the user is not in contact with the above elements, it is in a hands-free state. Note that the dialogue apparatus according to the embodiment appropriately determines the start and end timing of the dialogue with the user when the user is in the hands-free state. Therefore, the user is not necessarily in the hands-free state during the conversation.

図１は、実施形態に係る対話装置の機能ブロックを示す図である。図１に示されるように、対話装置１００は、入力部１１０と、出力部１２０と、制御部１３０と、記憶部１４０と、通信部１５０とを含む。 FIG. 1 is a diagram illustrating functional blocks of the interactive apparatus according to the embodiment. As illustrated in FIG. 1, the dialogue apparatus 100 includes an input unit 110, an output unit 120, a control unit 130, a storage unit 140, and a communication unit 150.

入力部１１０および出力部１２０は、対話装置１００の外部、主にユーザとの間で情報をやり取りするための部分（入出力インタフェース）である。入力部１１０は、ユーザの音声を含む対話装置１００の周囲の音声（以下、単に「周囲音声」という場合もある）の入力を受け付け、また、ユーザを含む対話装置１００の周囲の画像、映像（以下、単に「周囲画像」という場合もある）の入力を受け付ける。出力部１２０は、種々の画像、映像を出力し、また、種々の音声を出力する。 The input unit 110 and the output unit 120 are portions (input / output interfaces) for exchanging information with the outside of the interactive apparatus 100, mainly with the user. The input unit 110 accepts input of voice around the interactive apparatus 100 including the user's voice (hereinafter, also simply referred to as “ambient voice”), and images and video around the interactive apparatus 100 including the user ( Hereinafter, the input of “sometimes referred to as“ ambient image ”” is accepted. The output unit 120 outputs various images and videos, and outputs various sounds.

具体的に、入力部１１０は、集音部１１１と、撮像部１１２とを含む。集音部１１１は、周囲音声の入力を受け付ける部分である。集音部１１１は、たとえばマイクで構成される。集音部１１１は、たとえば指向性を有するように、複数のマイクが配列されたマイクアレイで構成されてもよい。撮像部１１２は、周囲画像の入力を受け付ける部分である。撮像部１１２は、たとえばカメラで構成される。撮像部１１２は、たとえば撮像対象との距離を把握できるように、複数のカメラで構成されてもよい。なお、入力部１１０は、たとえば、ユーザが対話装置１００を操作するための操作ボタンなどの要素をさらに含んでもよい。 Specifically, the input unit 110 includes a sound collection unit 111 and an imaging unit 112. The sound collection unit 111 is a part that receives an input of ambient sound. The sound collection unit 111 is constituted by a microphone, for example. The sound collection unit 111 may be configured by a microphone array in which a plurality of microphones are arranged so as to have directivity, for example. The imaging unit 112 is a part that receives an input of a surrounding image. The imaging unit 112 is configured with a camera, for example. The imaging unit 112 may be configured by a plurality of cameras so that the distance to the imaging target can be grasped, for example. Note that the input unit 110 may further include an element such as an operation button for the user to operate the interactive device 100, for example.

出力部１２０は、発音部１２１と、表示部１２２とを含む。発音部１２１は、音声を出力する部分である。発音部１２１は、たとえばスピーカで構成される。発音部１２１は、たとえば指向性を有するように、複数のスピーカが配列されたアレイスピーカで構成されてもよい。表示部１２２は、画像、映像を出力する部分である。表示部１２２は、たとえばディスプレイで構成される。ディスプレイはタッチパネルで構成されてもよく、その場合、タッチパネルは、ユーザが対話装置１００を操作するための要素としても機能する。 The output unit 120 includes a sound generation unit 121 and a display unit 122. The sound generator 121 is a part that outputs sound. The sound generation unit 121 is configured by a speaker, for example. The sound generation unit 121 may be configured by an array speaker in which a plurality of speakers are arranged so as to have directivity, for example. The display unit 122 is a part that outputs images and videos. The display unit 122 is configured by a display, for example. The display may be configured with a touch panel. In this case, the touch panel also functions as an element for the user to operate the interactive device 100.

制御部１３０は、対話装置１００の各要素を制御することによって、対話装置１００の全体制御を行う部分である。制御部１３０は、取得部１３１と、決定部１３２と、音声認識部１３３と、選択部１３４と、画像認識部１３５と、特定部１３６と、調整部１３７とを含む。 The control unit 130 is a part that performs overall control of the interactive device 100 by controlling each element of the interactive device 100. The control unit 130 includes an acquisition unit 131, a determination unit 132, a voice recognition unit 133, a selection unit 134, an image recognition unit 135, a specification unit 136, and an adjustment unit 137.

取得部１３１は、入力部１１０に入力された周囲音声および周囲画像を取得する部分である。すなわち、取得部１３１および入力部１１０（集音部１１１、撮像部１１２を含む）は、周囲音声および周囲画像を取得する取得手段として機能する。なお、後述の特定部１３６も、取得手段の一部を構成し得る。以下、とくに説明がない限り、取得手段を単に取得部１３１と称してして説明する。 The acquisition unit 131 is a part that acquires the surrounding sound and the surrounding image input to the input unit 110. That is, the acquisition unit 131 and the input unit 110 (including the sound collection unit 111 and the imaging unit 112) function as an acquisition unit that acquires ambient sound and a surrounding image. The specifying unit 136 described later can also constitute part of the acquisition unit. Hereinafter, unless otherwise specified, the acquisition unit is simply referred to as the acquisition unit 131 and described.

決定部１３２は、ユーザとの対話を開始または終了するタイミングを決定する部分（決定手段）である。とくに決定部１３２は、取得部１３１によって取得された周囲音声および周囲画像に基づいて、ハンズフリー状態にあるユーザとの対話を開始または終了するタイミングを決定する。決定部１３２によるタイミングの決定の詳細については、後述する。 The determination unit 132 is a part (determination unit) that determines the timing for starting or ending the dialog with the user. In particular, the determination unit 132 determines the timing for starting or ending the conversation with the user in the hands-free state, based on the surrounding sound and the surrounding image acquired by the acquiring unit 131. Details of timing determination by the determination unit 132 will be described later.

音声認識部１３３は、周囲音声、とくにユーザの音声を認識するための音声認識処理を実行する部分である。音声認識部１３３は、取得部１３１によって取得された周囲音声に対して、音声認識処理を実行する。音声認識処理は、たとえば、予め用意された音響モデル、言語モデルを用いた手法を含む、種々の公知の手法によって実現される。言語モデルは、種々の専門分野に対応できるように、専門辞書の言語を含むモデルであってもよい。音響モデル、言語モデルは、複数用意されていてもよい。その場合、音響モデル、言語モデルの組み合わせに対応する複数の音声認識処理が予め用意されており、音声認識部１３３は、後述の選択部１３４によって選択された音声認識処理を実行してもよい。 The voice recognition unit 133 is a part that executes voice recognition processing for recognizing ambient voice, particularly user voice. The voice recognition unit 133 performs a voice recognition process on the ambient voice acquired by the acquisition unit 131. The voice recognition processing is realized by various known methods including methods using acoustic models and language models prepared in advance. The language model may be a model including a language of a specialized dictionary so that it can correspond to various specialized fields. A plurality of acoustic models and language models may be prepared. In that case, a plurality of speech recognition processes corresponding to the combination of the acoustic model and the language model are prepared in advance, and the speech recognition unit 133 may execute the speech recognition process selected by the selection unit 134 described later.

選択部１３４は、複数の音声認識処理が用意されている場合に、音声認識部１３３が実行すべき音声認識処理を選択する部分（選択手段）である。選択部１３４は、取得部１３１によって取得された周囲音声および周囲画像に基づいて、音声認識部１３３が実行すべき音声認識処理を選択する。 The selection unit 134 is a part (selection unit) that selects a speech recognition process to be executed by the speech recognition unit 133 when a plurality of speech recognition processes are prepared. The selection unit 134 selects a voice recognition process to be executed by the voice recognition unit 133 based on the surrounding voice and the surrounding image acquired by the acquisition unit 131.

たとえば、選択部１３４は、周囲画像等に基づいて、人物の、とくにユーザの髪の色や顔立ちから人種を推測し、推測した人種のユーザに適した音声認識処理を選択してもよい。たとえば、推定した人種のユーザが使用する言語に対応した音響モデル、言語モデルを組み合わせに対応する音声認識処理が選択される。 For example, the selection unit 134 may estimate a race based on a surrounding image or the like based on a person, in particular, a user's hair color or face, and may select a speech recognition process suitable for the user of the estimated race. . For example, an acoustic model corresponding to a language used by an estimated race user and a speech recognition process corresponding to a combination of language models are selected.

また、選択部１３４は、周囲画像等に基づいて、人物の、とくにユーザの性別を推定し、推定した性別のユーザに適した音声認識処理を選択してもよい。性別は音声周波数帯に関連するので、選択部１３４は、ユーザの音声周波数帯を推定するとも言える。たとえば、推定した性別（音声周波数帯）のユーザに対応した音響モデル、性別に対応した話し言葉（口調、表現）などに対応した言語モデルを組み合わせに対応する音声認識処理が選択される。 The selection unit 134 may estimate the gender of the person, particularly the user, based on the surrounding image and the like, and may select a speech recognition process suitable for the estimated gender user. Since gender relates to the voice frequency band, it can be said that the selection unit 134 estimates the voice frequency band of the user. For example, a speech recognition process corresponding to a combination of an acoustic model corresponding to a user of an estimated gender (voice frequency band) and a language model corresponding to spoken language (tone, expression) corresponding to gender is selected.

また、選択部１３４は、ユーザによる音声認識処理の利用履歴に基づいて、ユーザに適した音声認識処理を選択してもよい。たとえば、過去にユーザが利用した履歴のある音響モデル、言語モデルを組み合わせに対応する音声認識処理が選択される。 Further, the selection unit 134 may select a speech recognition process suitable for the user based on the use history of the speech recognition process by the user. For example, a speech recognition process corresponding to a combination of an acoustic model and a language model with a history used by the user in the past is selected.

画像認識部１３５は、周囲画像、とくにユーザの画像または映像を認識するための部分である。画像認識部１３５は、取得部１３１によって取得された周囲画像に対して、画像認識処理を実行する。画像認識処理には、たとえばｏｐｅｎＣＶ（Open Source Computer Vision Library）など、種々の公知の手法によって実現される。 The image recognition unit 135 is a part for recognizing surrounding images, particularly a user image or video. The image recognition unit 135 performs image recognition processing on the surrounding image acquired by the acquisition unit 131. The image recognition processing is realized by various known methods such as open CV (Open Source Computer Vision Library).

特定部１３６は、周囲画像に複数のユーザが含まれている場合に、対話の対象となるユーザを特定するための部分である。特定部１３６は、たとえば、撮像部１１２によって取得された周囲画像、より具体的には周囲画像についての画像認識部１３５の認識結果に基づいて、複数のユーザのうち、対話装置１００に視線を向けているユーザを特定する。 The specifying unit 136 is a part for specifying a user who is an object of dialogue when a plurality of users are included in the surrounding image. For example, the specifying unit 136 directs the line of sight to the interactive device 100 among a plurality of users based on the surrounding image acquired by the imaging unit 112, more specifically, the recognition result of the image recognizing unit 135 for the surrounding image. Identify who is.

調整部１３７は、集音部１１１の指向性を調整する部分である。集音部１１１がマイクアレイで構成される場合には、たとえば位相制御によって指向性が調整される。なお、集音部１１１が単一のマイクで構成される場合でも、たとえばマイクの向きを物理的に変更することによって指向性が調整され得る。調整部１３７は、取得部１３１によって取得された周囲画像、より具体的には画像認識部１３５の認識結果に基づいて、たとえばユーザの顔に向けて、集音部１１１の指向性を調整する。ユーザの顔に向けて指向性を調整するとは、ユーザの顔およびその付近で発生した音声が、他の部分で発生した音声よりも集音されやすくなるように、指向性を調整することである。 The adjustment unit 137 is a part that adjusts the directivity of the sound collection unit 111. When the sound collection unit 111 is configured by a microphone array, the directivity is adjusted by phase control, for example. Even when the sound collection unit 111 is configured by a single microphone, the directivity can be adjusted by, for example, physically changing the direction of the microphone. The adjustment unit 137 adjusts the directivity of the sound collection unit 111, for example, toward the user's face based on the surrounding image acquired by the acquisition unit 131, more specifically, the recognition result of the image recognition unit 135. To adjust the directivity toward the user's face is to adjust the directivity so that sounds generated in and near the user's face are more easily collected than sounds generated in other parts. .

なお、制御部１３０は、たとえば、対話において、出力部１２０がユーザに対して出力する種々の情報、たとえば発音部１２１による音声、表示部１２２による画像、映像などのデータを生成するための処理も実行し得る。 Note that the control unit 130 also performs processing for generating various information output by the output unit 120 to the user, for example, data such as sound by the sound generation unit 121, images and video by the display unit 122, for example, in a dialog. Can be executed.

以上説明した構成により、制御部１３０は、対話装置１００がユーザと対話を行うために必要な種々の処理を実行する。制御部１３０によって実行される処理の詳細については、後に図４から図６を参照して改めて説明する。 With the configuration described above, the control unit 130 executes various processes necessary for the interaction apparatus 100 to interact with the user. Details of processing executed by the control unit 130 will be described later with reference to FIGS. 4 to 6.

記憶部１４０は、制御部１３０によって実行される処理に必要な種々の情報を記憶する部分である。記憶部１４０は、たとえば、前述の音響モデル、言語モデル、利用履歴を記憶する。また、記憶部１４０は、対話装置１００がユーザと対話を行うために必要な処理を対話装置１００に実行させるためのプログラム（対話プログラム）を記憶する。 The storage unit 140 is a part that stores various information necessary for processing executed by the control unit 130. The storage unit 140 stores, for example, the above-described acoustic model, language model, and usage history. In addition, the storage unit 140 stores a program (interactive program) for causing the interactive apparatus 100 to execute processing necessary for the interactive apparatus 100 to interact with the user.

また、記憶部１４０は、対話装置１００を利用することが許可されているユーザ（利用ユーザ）のデータ（ユーザデータ）を記憶する。ユーザデータは、利用ユーザの特徴データを含んでよい。特徴データは、たとえば、利用ユーザの顔に関する特徴を示すデータであってもよいし、利用ユーザの音声に関する特徴を示すデータであってもよい。また、ユーザデータは、ユーザの人種を推測したり、ユーザの性別を推測したりするために必要なデータを含んでもよい。この場合のユーザデータは、たとえば、人種と、髪の色、顔立ちなどとを対応づけたデータであってもよいし、性別と、髪の色、顔立ちなどとを対応づけたデータなどであってよい。 In addition, the storage unit 140 stores data (user data) of users (users) who are permitted to use the interactive device 100. The user data may include user user feature data. The feature data may be, for example, data indicating features related to the user's face or data indicating features related to the user's voice. Further, the user data may include data necessary for estimating the race of the user or estimating the gender of the user. The user data in this case may be, for example, data associating race with hair color, facial features, etc., or data associating gender with hair color, facial features, etc. It's okay.

通信部１５０は、対話装置１００の外部と通信を行う部分である。通信部１５０によって、たとえば、上述の、音響モデル、言語モデル、対話プログラム、ユーザデータなどが追加して取得され、あるいは、更新され得る。 The communication unit 150 is a part that communicates with the outside of the dialogue apparatus 100. For example, the above-described acoustic model, language model, dialogue program, user data, and the like may be additionally acquired or updated by the communication unit 150.

ここで、図２を参照して、対話装置１００のハードウェア構成について説明する。図２に示されるように、対話装置１００は、物理的には、１または複数のＣＰＵ（Central Processing Unit）２１、ＲＡＭ（Random Access Memory）２２およびＲＯＭ（Read Only Memory）２３、カメラなどの撮像装置２４、データ送受信デバイスである通信モジュール２６、半導体メモリなどの補助記憶装置２７、操作盤（操作ボタンを含む）やタッチパネルなどのユーザ操作の入力を受け付ける入力装置２８、ディスプレイなどの出力装置２９、ならびにＣＤ−ＲＯＭドライブ装置などの読み取り装置２Ａを備えるコンピュータとして構成され得る。図１における対話装置１００の機能は、たとえば、ＣＤ−ＲＯＭなどの記憶媒体Ｍに記憶された１または複数のプログラムを読み取り装置２Ａにより読み取ってＲＡＭ２２などのハードウェア上に取り込むことにより、ＣＰＵ２１の制御のもとで撮像装置２４、通信モジュール２６、入力装置２８、出力装置２９を動作させるとともに、ＲＡＭ２２および補助記憶装置２７におけるデータの読み出しおよび書き込みを行うことで実現される。 Here, the hardware configuration of the interactive apparatus 100 will be described with reference to FIG. As shown in FIG. 2, the interactive apparatus 100 physically has one or more CPUs (Central Processing Units) 21, RAMs (Random Access Memory) 22, ROMs (Read Only Memory) 23, and imaging such as a camera. Device 24, communication module 26 that is a data transmission / reception device, auxiliary storage device 27 such as a semiconductor memory, input device 28 that receives an input of a user operation such as an operation panel (including operation buttons) or a touch panel, an output device 29 such as a display, Moreover, it can be configured as a computer including a reading device 2A such as a CD-ROM drive device. The function of the interactive apparatus 100 in FIG. 1 is controlled by the CPU 21 by, for example, reading one or a plurality of programs stored in a storage medium M such as a CD-ROM with a reading apparatus 2A and taking them into hardware such as a RAM 22. This is realized by operating the imaging device 24, the communication module 26, the input device 28, and the output device 29 under the above, and reading and writing data in the RAM 22 and the auxiliary storage device 27.

また、図３には、コンピュータを対話装置１００として機能させるための対話プログラムのモジュールが示される。図３に示されるように、対話プログラムＰ１００は、取得モジュールＰ１０１、決定モジュールＰ１０２、音声認識モジュールＰ１０３、選択モジュールＰ１０４、画像認識モジュールＰ１０５、特定モジュールＰ１０６および調整モジュールＰ１０７を備えている。各モジュールによって、先に図１を参照して説明した、取得部１３１、決定部１３２、音声認識部１３３、選択部１３４、画像認識部１３５、特定部１３６および調整部１３７の機能が実現される。 Further, FIG. 3 shows a module of an interactive program for causing a computer to function as the interactive apparatus 100. As shown in FIG. 3, the dialogue program P100 includes an acquisition module P101, a determination module P102, a voice recognition module P103, a selection module P104, an image recognition module P105, a specific module P106, and an adjustment module P107. The functions of the acquisition unit 131, the determination unit 132, the voice recognition unit 133, the selection unit 134, the image recognition unit 135, the specification unit 136, and the adjustment unit 137 described above with reference to FIG. .

対話プログラムは、たとえば記憶媒体に格納されて提供される。記憶媒体は、フレキシブルディスク、ＣＤ−ＲＯＭ、ＵＳＢメモリ、ＤＶＤ、半導体メモリなどであってよい。 The interactive program is provided by being stored in a storage medium, for example. The storage medium may be a flexible disk, CD-ROM, USB memory, DVD, semiconductor memory, or the like.

次に、図４から図６を用いて、対話装置１００の動作（対話装置１００によって実行される対話方法）について説明する。 Next, the operation of the interactive device 100 (the interactive method executed by the interactive device 100) will be described with reference to FIGS.

図４および図５は、対話装置１００において実行される処理の一例を示すフローチャートである。このフローチャートの処理は、対話装置１００がユーザとの対話を開始する際、あるいは終了する際に実行される。前提として、ユーザは、少なくとも対話の開始時または終了時には、ハンズフリー状態にあるものとする。なお、とくに説明がない場合、各処理は、制御部１３０によって（つまり制御部１３０に含まれるいずれの要素かを問わず）実行され得る。 4 and 5 are flowcharts illustrating an example of processing executed in the interactive apparatus 100. The process of this flowchart is executed when the interactive apparatus 100 starts or ends a dialog with the user. As a premise, the user is assumed to be in a hands-free state at least at the start or end of the dialogue. Unless otherwise specified, each process can be executed by the control unit 130 (that is, regardless of which element is included in the control unit 130).

まず、対話装置１００は、発話中ユーザリストを作成する（ステップＳ１）。発話中ユーザリストは、このフローチャートの処理において、対話装置１００と対話をするための発話を行っているユーザのリストである。発話中ユーザリストは、たとえば記憶部１４０に記憶されてよい。なお、ステップＳ１において発話中ユーザリストが作成された時点では、発話中ユーザリストにはユーザは含まれておらず、後述のステップＳ３７において、発話中ユーザリストにユーザが追加される。フローチャートの処理はループするので、２回目以降のフローにおいては、発話中ユーザリストには、ユーザが含まれ得る。 First, the interactive device 100 creates a user list during utterance (step S1). The utterance user list is a list of users who are making utterances to interact with the interaction apparatus 100 in the process of this flowchart. The talking user list may be stored in the storage unit 140, for example. Note that at the time when the talking user list is created in step S1, the user is not included in the talking user list, and in step S37 described later, the user is added to the talking user list. Since the process of the flowchart loops, in the second and subsequent flows, the user can be included in the talking user list.

次に、対話装置１００は、顔検出により、人数ｎを決定する（ステップＳ２）。たとえば画像認識部１３５が、取得部１３１によって取得された周囲画像を認識する。そして、たとえば特定部１３６が、周囲画像に含まれる人物の顔を検出するとともに、検出した顔の数を、人数ｎとして決定する。 Next, the interactive device 100 determines the number of people n by face detection (step S2). For example, the image recognition unit 135 recognizes the surrounding image acquired by the acquisition unit 131. Then, for example, the specifying unit 136 detects the faces of persons included in the surrounding image, and determines the number of detected faces as the number of persons n.

次のステップＳ３〜Ｓ７において、対話装置１００は、ｎ人の人物のそれぞれの視線を特定する。具体的に、対話装置１００は、変数ｉの初期値を０とし（ステップＳ３）、ｉを１ずつ増加させながら（ステップＳ６）、ｉがｎ以上になるまでの間（ステップＳ７：ＮＯ）、ｉ番目のユーザについて、以下のステップＳ４およびステップＳ５の処理を繰り返し実行する。 In the next steps S 3 to S 7, the dialogue apparatus 100 specifies the line of sight of each of n persons. Specifically, the interactive apparatus 100 sets the initial value of the variable i to 0 (step S3), increases i by 1 (step S6), and until i becomes n or more (step S7: NO), The following steps S4 and S5 are repeatedly executed for the i-th user.

すなわち、対話装置１００は、顔の認識により、個人を特定し（ステップＳ４）、視線の認識により、見ている方向を特定する（ステップＳ５）。具体的に、特定部１３６が、画像認識部１３５の認識結果に基づいて、周囲画像中の人物を特定し、特定した人物（個人）の視線の方向を特定する。 That is, the dialogue apparatus 100 identifies an individual by recognizing a face (step S4), and identifies a viewing direction by recognizing a line of sight (step S5). Specifically, the specifying unit 136 specifies a person in the surrounding image based on the recognition result of the image recognition unit 135, and specifies the direction of the line of sight of the specified person (individual).

ｎ人の人物のそれぞれについて上記ステップＳ４およびステップＳ５の処理が完了した後（ステップＳ７：ＹＥＳ）、対話装置１００は、ステップＳ８に処理を進める。 After the processes of step S4 and step S5 are completed for each of the n persons (step S7: YES), the dialogue apparatus 100 advances the process to step S8.

ステップＳ８において、対話装置１００は、発話ユーザリストに顔認証されていないユーザがいるか否かを判断する。たとえば、先のステップＳ４で特定した人物以外の人物が発話ユーザリストに含まれる場合には、発話ユーザリストに顔認証されていないユーザがいると判断されてよい。発話ユーザリストに顔認証されていないユーザがいる場合（ステップＳ８：ＹＥＳ）、対話装置１００は、ステップＳ９に処理を進める。そうでない場合（ステップＳ８：ＮＯ）、対話装置１００は、ステップＳ１０に処理を進める。 In step S 8, the dialogue apparatus 100 determines whether there is a user whose face is not authenticated in the utterance user list. For example, when a person other than the person specified in the previous step S4 is included in the utterance user list, it may be determined that there is a user whose face is not authenticated in the utterance user list. When there is a user whose face is not authenticated in the utterance user list (step S8: YES), the dialogue apparatus 100 proceeds with the process to step S9. When that is not right (step S8: NO), the dialogue apparatus 100 advances a process to step S10.

ステップＳ９において、対話装置１００は、発話中ユーザリストから削除し、音声認識を終了する（ステップＳ９）。具体的に、先のステップＳ８において発話ユーザリストに顔認証されていないユーザであると判断されたユーザが、発話中ユーザリストから削除される。また、音声認識部１３３が、音声認識処理を終了する。なお、ステップＳ９の処理は、ステップＳ８において発話中ユーザリストに顔認証されていないユーザがいる場合に実行される処理である。そのため、ステップＳ９の処理は、後述のステップＳ３７において、発話ユーザリストにユーザが追加され、音声認識処理が開始された後、フローチャートの処理がループして再びステップＳ８に至った場合に実行され得る処理である。 In step S9, the dialogue apparatus 100 deletes the user from the uttering user list and ends the speech recognition (step S9). Specifically, the user who is determined to be a user whose face is not authenticated in the utterance user list in the previous step S8 is deleted from the utterance user list. In addition, the voice recognition unit 133 ends the voice recognition process. Note that the process of step S9 is a process executed when there is a user whose face is not authenticated in the talking user list in step S8. Therefore, the process of step S9 can be executed when a user is added to the utterance user list in step S37, which will be described later, and after the voice recognition process is started, the process of the flowchart loops to reach step S8 again. It is processing.

次のステップＳ１０〜Ｓ２４において、対話装置１００は、ｎ人の人物のうちの発話ユーザの数を特定する。発話ユーザは、対話装置１００との対話を開始するために音声を発したと考えられるユーザである。具体的に、対話装置１００は、変数ｊおよび変数ｍの初期値を０とし（ステップＳ１０，Ｓ１１）、ｊを１ずつ増加させながら（ステップＳ２３）、ｊがｎ以上になるまでの間（ステップＳ２４：ＮＯ）、ｊ番目のユーザについて、以下のステップＳ１２〜Ｓ２２の処理を繰り返し実行する。 In the next steps S 10 to S 24, the dialogue apparatus 100 specifies the number of speaking users among n persons. The uttering user is a user who is considered to have uttered a voice to start a dialogue with the dialogue apparatus 100. Specifically, the interactive apparatus 100 sets the initial values of the variable j and the variable m to 0 (steps S10 and S11), and increases j by 1 (step S23) until j becomes n or more (steps). S24: NO), the following steps S12 to S22 are repeated for the j-th user.

すなわち、まず、対話装置１００は、発話中ユーザリストに存在するユーザであるか否かを判断する（ステップＳ１２）。たとえば、ｊ番目の人物が発話中ユーザリストに含まれる場合には、発話中ユーザリストに存在するユーザであると判断されてよい。発話中ユーザリストに存在するユーザである場合（ステップＳ１２：ＹＥＳ）、対話装置１００は、ステップＳ１３に処理を進める。そうでない場合（ステップＳ１２：ＮＯ）、対話装置１００は、ステップＳ１５に処理を進める。 That is, first, the dialogue apparatus 100 determines whether or not the user is present in the talking user list (step S12). For example, when the j-th person is included in the talking user list, it may be determined that the user is in the talking user list. If the user is in the utterance user list (step S12: YES), the dialogue apparatus 100 advances the process to step S13. When that is not right (step S12: NO), the dialogue apparatus 100 advances a process to step S15.

ステップＳ１３において、対話装置１００は、口が閉じ続けているか否かを判断する。この処理は、たとえば決定部１３２が、画像認識部１３５の認識結果に基づいて実行する。たとえば人物の口が閉じた状態が、所定時間以上継続している場合に、口が閉じ続けていると判断されてよい。口が閉じ続けている場合（ステップＳ１３：ＹＥＳ）、対話装置１００は、ステップＳ１４に処理を進める。そうでない場合（ステップＳ１３：ＮＯ）、対話装置１００は、ステップＳ２０に処理を進める。 In step S13, the dialogue apparatus 100 determines whether or not the mouth is kept closed. For example, the determination unit 132 executes this process based on the recognition result of the image recognition unit 135. For example, it may be determined that the mouth is kept closed when the person's mouth is closed for a predetermined time or longer. When the mouth is kept closed (step S13: YES), the dialogue apparatus 100 proceeds with the process to step S14. When that is not right (step S13: NO), the dialogue apparatus 100 advances a process to step S20.

ステップＳ１４において、対話装置１００は、発話中ユーザリストから削除し、音声認識を終了する（ステップＳ１４）。具体的に、ｊ番目の人物が、発話中ユーザリストから削除される。また、音声認識部１３３が、音声認識を終了する。これにより、そのユーザとの対話が終了する。ステップＳ１４の処理が完了した後、対話装置１００は、ステップＳ２３に処理を進める。 In step S14, the interactive device 100 deletes the user from the uttering user list and ends the speech recognition (step S14). Specifically, the jth person is deleted from the speaking user list. In addition, the voice recognition unit 133 ends the voice recognition. Thereby, the dialog with the user is completed. After the process of step S14 is completed, the interactive device 100 proceeds with the process to step S23.

ステップＳ１５において、対話装置１００は、視線が所定の方向を向いているか否かを判断する。この処理は、たとえば決定部１３２あるいは特定部１３６が、先に説明したステップＳ５において特定された視線の方向に基づいて実行する。所定の方向は、対話装置１００に向かう方向であってよい。たとえば視線の方向が所定時間以上継続して対話装置１００に向けられていた場合に、人物の視線の方向が所定の方向であると判断されてよい。所定時間は、数秒程度であってよい。視線が所定の方向を向いている場合（ステップＳ１５：ＹＥＳ）、対話装置１００は、ステップＳ１６に処理を進める。そうでない場合（ステップＳ１５：ＮＯ）、対話装置１００は、人物が発話意思の無いユーザであると判定し（ステップＳ１７）、ステップＳ２３に処理を進める。 In step S15, the dialogue apparatus 100 determines whether or not the line of sight faces a predetermined direction. This process is executed by the determining unit 132 or the specifying unit 136 based on the line-of-sight direction specified in step S5 described above, for example. The predetermined direction may be a direction toward the interactive apparatus 100. For example, when the direction of the line of sight is continuously directed to the interactive apparatus 100 for a predetermined time or more, it may be determined that the direction of the line of sight of the person is the predetermined direction. The predetermined time may be about several seconds. When the line of sight is directed in a predetermined direction (step S15: YES), the dialogue apparatus 100 proceeds with the process to step S16. When that is not right (step S15: NO), the dialogue apparatus 100 determines that the person is a user who has no intention to speak (step S17), and proceeds to step S23.

ステップＳ１６において、対話装置１００は、人物が利用許可ユーザであるか否かを判断する。たとえば決定部１３２あるいは特定部１３６が、画像認識部１３５の認識結果と記憶部１４０に記憶されたユーザデータとを照合することによって、ユーザが利用ユーザであるか否かを判断する。人物が利用許可ユーザである場合（ステップＳ１６：ＹＥＳ）、対話装置１００は、ステップＳ１８に処理を進める。そうでない場合（ステップＳ１６：ＮＯ）、対話装置１００は、人物は非許可ユーザであると判定し（ステップＳ１９）、ステップＳ２３に処理を進める。 In step S 16, the dialogue apparatus 100 determines whether or not the person is a usage-permitted user. For example, the determining unit 132 or the specifying unit 136 determines whether or not the user is a user by collating the recognition result of the image recognition unit 135 with the user data stored in the storage unit 140. If the person is a use-permitted user (step S16: YES), the dialogue apparatus 100 proceeds with the process to step S18. Otherwise (step S16: NO), the dialogue apparatus 100 determines that the person is a non-permitted user (step S19), and proceeds to step S23.

ステップＳ１８において、対話装置１００は、口が動き始めたか否かを判断する。この処理は、たとえば決定部１３２が、画像認識部１３５の認識結果に基づいて実行する。たとえば人物の口が閉じた状態から開いた状態に変化した場合に、人物の口が動き始めたと判断されてよい。口が動き始めた場合（ステップＳ１８：ＹＥＳ）、より具体的には、人物の視線が所定の方向を向いており（ステップＳ１５：ＹＥＳ）、人物が利用許可ユーザであり（ステップＳ１６：ＹＥＳ）、人物の口が動き始めた場合（ステップＳ１８：ＹＥＳ）、対話装置１００は、人物が発話ユーザであると判断し（ステップＳ２０）、変数ｍを１だけ増加させ（ステップＳ２１）、ステップＳ２３に処理を進める。すなわち、変数ｍは、発話ユーザの数を示す値とされる。一方、人物の口が動き始めていない場合（ステップＳ１８：ＮＯ）、対話装置１００は、ステップＳ２２に処理を進める。 In step S18, the dialogue apparatus 100 determines whether or not the mouth has started to move. For example, the determination unit 132 executes this process based on the recognition result of the image recognition unit 135. For example, when the person's mouth changes from a closed state to an open state, it may be determined that the person's mouth has started to move. When the mouth starts to move (step S18: YES), more specifically, the line of sight of the person is facing a predetermined direction (step S15: YES), and the person is a use-permitted user (step S16: YES). When the person's mouth starts to move (step S18: YES), the dialogue apparatus 100 determines that the person is a speaking user (step S20), increases the variable m by 1 (step S21), and proceeds to step S23. Proceed with the process. That is, the variable m is a value indicating the number of speaking users. On the other hand, when the person's mouth has not started to move (step S18: NO), the dialogue apparatus 100 advances the process to step S22.

ステップＳ２２において、対話装置１００は、音声による発話が検知されたか否かを判断する（ステップＳ２２）。この処理は、たとえば、集音部１１１、取得部１３１および音声認識部１３３の機能を用いて、従来の対話装置と同様に実行される。音声による発話が検知された場合（ステップＳ２２：ＹＥＳ）、対話装置１００は、先に説明したステップＳ２０に処理を進める。そうでない場合（ステップＳ２２：ＮＯ）、対話装置１００は、ステップＳ２３に処理を進める。 In step S22, the dialogue apparatus 100 determines whether or not a speech utterance has been detected (step S22). This process is executed in the same manner as a conventional dialogue apparatus using the functions of the sound collection unit 111, the acquisition unit 131, and the voice recognition unit 133, for example. When a speech utterance is detected (step S22: YES), the dialogue apparatus 100 proceeds with the process to step S20 described above. When that is not right (step S22: NO), the dialogue apparatus 100 advances a process to step S23.

ｎ人の人物のそれぞれについて上記ステップＳ１２〜Ｓ２２の処理が完了した後（ステップＳ２４：ＹＥＳ）、対話装置１００は、対話開始処理を実行する（ステップＳ２５）。 After the processes in steps S12 to S22 are completed for each of n persons (step S24: YES), the dialogue apparatus 100 executes a dialogue start process (step S25).

図６は、対話開始処理（図５のステップＳ２５）において実行される処理の一例を示すフローチャートである。 FIG. 6 is a flowchart showing an example of processing executed in the dialogue start processing (step S25 in FIG. 5).

次のステップＳ３１〜Ｓ３９において、対話装置１００は、ｍ人の発話ユーザのそれぞれについて、マイク調整、および、音響モデル、言語モデルの最適化を行ったうえで音声認識等を行う。具体的に、対話装置１００は、変数ｋの初期値を０とし（ステップＳ３１）、ｋを１ずつ増加させながら（ステップＳ３８）、ｋがｍ以上になるまでの間（ステップＳ３９：ＮＯ）、ｋ番目の発話ユーザについて、以下のステップＳ３２〜Ｓ３７の処理を実行する。 In the next steps S31 to S39, the dialogue apparatus 100 performs speech recognition and the like after performing microphone adjustment and optimization of an acoustic model and a language model for each of m speech users. Specifically, the interactive apparatus 100 sets the initial value of the variable k to 0 (step S31), increases k by 1 (step S38), and until k becomes m or more (step S39: NO), The following steps S32 to S37 are executed for the k-th utterance user.

すなわち、まず、対話装置１００は、映像（または画像）の顔位置より、マイク方向を導出し（ステップＳ３２）、マイク方向を制御する（ステップＳ３３）。具体的に、調整部１３７が、画像認識部１３５の認識結果に基づいて、発話ユーザの顔に向けて集音部１１１の指向性を調整する。なお、集音部１１１がマイクアレイの場合は、全ての発話ユーザの方向のそれぞれに指向性を調整し、全ての発話ユーザの音声を同時に認識できるようにしてもよい。また、集音部１１１が単一のマイクの場合には、たとえば最初のループ（ｋ＝０）における発話ユーザに対してマイクの指向性を調整するとよい。 That is, first, the dialogue apparatus 100 derives the microphone direction from the face position of the video (or image) (step S32), and controls the microphone direction (step S33). Specifically, the adjustment unit 137 adjusts the directivity of the sound collection unit 111 toward the uttering user's face based on the recognition result of the image recognition unit 135. In the case where the sound collection unit 111 is a microphone array, the directivity may be adjusted in each of the directions of all the utterance users so that the voices of all the utterance users can be recognized simultaneously. When the sound collection unit 111 is a single microphone, for example, the directivity of the microphone may be adjusted with respect to the user who speaks in the first loop (k = 0).

次に、対話装置１００は、ユーザ状態は既に発話中であるか否かを判断する（ステップＳ３４）。たとえば、前のループでステップＳ３７において音声認識が開始されており、かつ、音声による発話が検知されている場合には、ユーザ状態は既に発話中であると判断されてよい。ユーザ状態が既に発話中である場合（ステップＳ３４：ＹＥＳ）、対話装置１００は、音声認識を継続し（ステップＳ３５）、ステップＳ３８に処理を進める。そうでない場合（ステップＳ３４：ＮＯ）、対話装置１００は、ステップＳ３６に処理を進める。 Next, the dialogue apparatus 100 determines whether or not the user state is already speaking (step S34). For example, if voice recognition has been started in step S37 in the previous loop and a voice utterance has been detected, it may be determined that the user state is already speaking. When the user state is already speaking (step S34: YES), the dialogue apparatus 100 continues the voice recognition (step S35), and advances the process to step S38. When that is not right (step S34: NO), the dialogue apparatus 100 advances the process to step S36.

ステップＳ３６において、対話装置１００は、ユーザデータを読み出し、音響モデル、言語モデルを決定する。具体的に、選択部１３４が、先のステップＳ４（図４）において認識されたユーザの画像（顔の画像）と、記憶部１４０に記憶されたユーザデータとを照合することによって、発話ユーザに適した音響モデル、言語モデルを決定する。選択部１３４は、記憶部１４０に記憶された利用履歴に基づいて、発話ユーザに適した音響モデル、言語モデルを決定してもよい。 In step S36, the dialogue apparatus 100 reads user data and determines an acoustic model and a language model. Specifically, the selection unit 134 matches the user image (face image) recognized in the previous step S4 (FIG. 4) with the user data stored in the storage unit 140, thereby allowing the user to speak. Determine suitable acoustic and language models. The selection unit 134 may determine an acoustic model and a language model suitable for the uttering user based on the usage history stored in the storage unit 140.

そして、対話装置１００は、発話中ユーザリストに追加し、音声認識を開始する（ステップＳ３７）。具体的に、ｋ番目の発話ユーザが、発話ユーザリストに追加される。また、音声認識部１３３が、音声認識処理を開始する。なお、その後、ステップＳ３８の処理を経て、ｋがｍ以上になると（ステップＳ３９：ＹＥＳ）、対話装置１００は、ステップＳ２（図４）に再び処理を戻す。 Then, the dialogue apparatus 100 adds to the talking user list and starts speech recognition (step S37). Specifically, the k-th utterance user is added to the utterance user list. In addition, the voice recognition unit 133 starts a voice recognition process. After that, if k becomes m or more through the process of step S38 (step S39: YES), the interactive device 100 returns the process to step S2 (FIG. 4) again.

次に、対話装置１００の作用効果について説明する。対話装置１００によれば、ユーザの画像（または映像）に基づいて、ユーザとの対話を開始または終了するタイミングが決定される。ユーザの画像に基づけば、ユーザ以外の人物の会話、騒音、およびテレビ音声などの雑音の影響を受けることなく、ユーザとの対話を開始するタイミング等が適切に決定される。したがって、従来のように、ユーザの音声が検出されたことに応じて対話を開始する場合よりも、対話装置の雑音耐性を向上することができる。 Next, the function and effect of the interactive device 100 will be described. According to the interaction device 100, the timing for starting or ending the interaction with the user is determined based on the user's image (or video). Based on the user's image, the timing for starting the dialogue with the user is appropriately determined without being affected by noises such as conversations, noises, and TV voices of persons other than the user. Therefore, it is possible to improve the noise tolerance of the dialog device as compared with the conventional case where the dialog is started in response to detection of the user's voice.

具体的に、決定部１３２が、ユーザの視線が所定時間以上継続して対話装置１００に向けられたことを検出し（ステップＳ１５：ＹＥＳ）、当該検出のタイミングを、ユーザとの対話を開始するタイミングとして決定する（ステップＳ２５）。ユーザの視線が所定時間以上継続して対話装置１００に向けられている場合には、ユーザが対話を開始しようとする意思を有している可能性が高い。このため、決定部１３２の上記処理によれば、ユーザとの対話を開始するタイミングを適切に決定することができる。 Specifically, the determination unit 132 detects that the user's line of sight has been continuously directed to the dialogue apparatus 100 for a predetermined time (step S15: YES), and starts a dialogue with the user at the detection timing. The timing is determined (step S25). When the user's line of sight is continuously directed to the dialogue apparatus 100 for a predetermined time or more, there is a high possibility that the user has an intention to start the dialogue. For this reason, according to the said process of the determination part 132, the timing which starts the dialog with a user can be determined appropriately.

また、決定部１３２が、ユーザの口が開いたタイミングを、ユーザとの対話を開始するタイミングとして決定する（ステップＳ１８：ＹＥＳ、ステップＳ２５）。ユーザの口が開いた場合には、ユーザが対話を開始する可能性が高い。このため、決定部１３２の上記処理によれば、ユーザとの対話を開始するタイミングを適切に決定することができる。 Moreover, the determination part 132 determines the timing which a user's mouth opened as a timing which starts a dialog with a user (step S18: YES, step S25). If the user's mouth opens, the user is likely to start a conversation. For this reason, according to the said process of the determination part 132, the timing which starts the dialog with a user can be determined appropriately.

また、決定部１３２が、ユーザの口が所定時間以上継続して閉じられたことを検出し、当該検出のタイミングを、ユーザとの対話を終了するタイミングとして決定する（ステップＳ１３：ＹＥＳ、ステップＳ１４）。ユーザの口が所定時間以上継続して閉じられている場合には、ユーザが対話を終了しようとする意思を有している可能性が高い。このため、決定部１３２の上記処理によれば、ユーザとの対話を終了するタイミングを適切に決定することができる。 Further, the determination unit 132 detects that the user's mouth is continuously closed for a predetermined time or more, and determines the timing of the detection as the timing of ending the dialogue with the user (step S13: YES, step S14). ). When the user's mouth is continuously closed for a predetermined time or more, there is a high possibility that the user has an intention to end the dialogue. For this reason, according to the said process of the determination part 132, the timing which complete | finishes a dialog with a user can be determined appropriately.

また、選択部１３４が、対話においてユーザの音声を認識するために実行される音声認識処理を、取得部１３１によって取得されたユーザの画像または映像に基づいて、予め定められた複数の音声認識処理から選択する（ステップＳ３６）。これにより、ユーザに応じた適切な音声認識処理が選択され、対話において実行される。その結果、対話におけるユーザの音声の認識精度を向上することができる。 In addition, the voice recognition processing executed for the selection unit 134 to recognize the user's voice in the dialogue is based on a plurality of predetermined voice recognition processes based on the user's image or video acquired by the acquisition unit 131. (Step S36). Thereby, an appropriate voice recognition process corresponding to the user is selected and executed in the dialogue. As a result, the recognition accuracy of the user's voice in the dialogue can be improved.

たとえば、選択部１３４は、ユーザの人種を推測し、推測した人種のユーザに適した音声認識処理を選択する。これにより、ユーザの人種に応じた適切な音声認識処理が選択される。たとえば、推定した人種のユーザが使用する言語に対応した音響モデル、言語モデルを組み合わせに対応する音声認識処理を選択することで、音声認識の精度を向上させることができる。また、ユーザが手動で言語を選択するといった手間も不要とすることができる。 For example, the selection unit 134 estimates the race of the user and selects a speech recognition process suitable for the user of the estimated race. Thereby, an appropriate speech recognition process according to the race of the user is selected. For example, the accuracy of speech recognition can be improved by selecting an acoustic model corresponding to the language used by the estimated race user and a speech recognition process corresponding to the combination of language models. In addition, it is possible to eliminate the need for the user to manually select a language.

たとえば、選択部１３４は、ユーザの性別を推定し、推定した性別のユーザに適した音声認識処理を選択する。性別は、たとえば音声周波数帯に関連するので、推定した性別のユーザの音声周波数帯に適した音響モデルを用いた音声認識処理を選択することで、音声認識の精度を向上させることができる。また、推定した性別に対応した話し言葉（口調、表現）などに対応した言語モデルを用いた音声認識処理を選択することで、音声認識の精度を向上させることができる。もちろん、上記音響モデルおよび言語モデルを組み合わせに対応する音声認識処理を選択することで、音声認識の精度をさらに向上させることもできる。 For example, the selection unit 134 estimates the gender of the user and selects a speech recognition process suitable for the estimated gender user. Since gender relates to, for example, a voice frequency band, the accuracy of voice recognition can be improved by selecting a voice recognition process using an acoustic model suitable for the estimated user's voice frequency band of gender. Moreover, the accuracy of speech recognition can be improved by selecting speech recognition processing using a language model corresponding to the spoken language (tone, expression) corresponding to the estimated gender. Of course, the accuracy of speech recognition can be further improved by selecting speech recognition processing corresponding to the combination of the acoustic model and the language model.

また、調整部１３７が、ユーザの顔に向けて集音部１１１の指向性を調整する（ステップＳ３２，Ｓ３３）。これにより、ユーザ以外の人物の会話、騒音、およびテレビ音声などが存在する雑音環境下であっても、ユーザの音声の認識精度を高めることができる。 Further, the adjustment unit 137 adjusts the directivity of the sound collection unit 111 toward the user's face (steps S32 and S33). Thereby, the recognition accuracy of the user's voice can be improved even in a noisy environment in which there are conversations, noises, and TV voices of persons other than the user.

また、特定部１３６は、撮像部１１２によって撮像された周囲画像に複数の人物が含まれる場合に、複数の人物のうち、対話装置１００に視線を向けている人物をユーザとして特定する（ステップＳ１５：ＹＥＳ、ステップＳ２０）。そして、調整部１３７は、特定部１３６によって特定されたユーザの顔に向けて集音部１１１の指向性を調整する（ステップＳ３２，Ｓ３３）。複数の人物が存在する場合、複数の人物のうちの対話装置１００に視線を向けている人物が、対話を開始しようとする意思を有しているユーザである可能性が高い。特定部１３６および調整部１３７の上記処理によれば、そのような対話を開始しようとする意思を有している可能性の高いユーザが特定され、そのユーザの顔に向けて集音部の指向性が調整される。よって、複数の人物が存在する場合であっても、対話の対象となっているユーザの音声の認識精度を高めることができる。 In addition, when a plurality of persons are included in the surrounding image captured by the imaging unit 112, the specifying unit 136 specifies, as a user, a person whose line of sight is directed to the interactive device 100 among the plurality of persons (Step S15). : YES, step S20). Then, the adjusting unit 137 adjusts the directivity of the sound collecting unit 111 toward the user's face specified by the specifying unit 136 (steps S32 and S33). When there are a plurality of persons, it is highly likely that a person who is looking at the conversation apparatus 100 among the plurality of persons is a user who has an intention to start a conversation. According to the above processing of the specifying unit 136 and the adjusting unit 137, a user who has a high possibility of starting such a dialogue is specified, and the sound collection unit is directed toward the user's face. Sex is adjusted. Therefore, even when there are a plurality of persons, it is possible to improve the recognition accuracy of the voice of the user who is the subject of the conversation.

また、特定部１３６は、周囲画像に含まれるユーザの顔、あるいは、周囲音声に含まれるユーザの音声と、記憶部１４０に記憶されたユーザデータとを照合することによって、利用ユーザを特定する（ステップＳ１６：ＹＥＳ）。これにより、個人の顔の特徴などの画像認証情報を使用し、発話ユーザを特定することができる（ステップＳ１６：ＹＥＳ、ステップＳ２０，Ｓ２１）。 In addition, the specifying unit 136 specifies the user who uses the user's face included in the surrounding image or the user's voice included in the surrounding sound and the user data stored in the storage unit 140 ( Step S16: YES). Thereby, it is possible to specify the uttering user by using the image authentication information such as personal facial features (step S16: YES, steps S20 and S21).

なお、顔の特徴などに基づく認証に代えて、虹彩認識が用いられてもよい。その場合、視線を特定するための処理（ステップＳ１５）を省略してもよい。 Note that iris recognition may be used instead of authentication based on facial features and the like. In that case, the process (step S15) for specifying the line of sight may be omitted.

また、選択部１３４は、記憶部１４０に記憶された履歴情報を参照し、利用履歴のある音響モデル、言語モデルを選択することで、ユーザに適した音響モデル、言語モデルを組み合わせに対応する音声認識処理を選択する（ステップＳ３６）。これによっても、ユーザの音声の認識精度を高めることができる。また、ユーザが手動で言語を選択するといった手間を不要とすることもできる。 In addition, the selection unit 134 refers to the history information stored in the storage unit 140 and selects an acoustic model and a language model having a usage history, so that the sound corresponding to the combination of the acoustic model and the language model suitable for the user is selected. A recognition process is selected (step S36). Also by this, the recognition accuracy of the user's voice can be improved. In addition, it is possible to eliminate the need for the user to manually select a language.

以上説明した対話装置１００の各機能は、たとえば、コンピュータにおいて対話プログラムが実行されることによって実現することもできる。 Each function of the interactive apparatus 100 described above can also be realized, for example, by executing an interactive program on a computer.

以上、本発明の一実施形態について説明したが、本発明は、上記実施形態に限定されるものではない。 Although one embodiment of the present invention has been described above, the present invention is not limited to the above embodiment.

図７は、変形例に係る対話装置の機能ブロックを示す図である。対話装置１００Ａは、サーバ２００との協働により、ユーザとの対話を行う対話システム１を構成する。この変形例では、対話システム１が本発明に係る対話装置に相当する。 FIG. 7 is a diagram illustrating functional blocks of the interactive apparatus according to the modification. The dialogue apparatus 100 A constitutes a dialogue system 1 that performs dialogue with the user in cooperation with the server 200. In this modification, the dialogue system 1 corresponds to the dialogue device according to the present invention.

図７に示されるように、サーバ２００は、制御部２３０と、記憶部２４０と、通信部２５０とを含む。 As illustrated in FIG. 7, the server 200 includes a control unit 230, a storage unit 240, and a communication unit 250.

制御部２３０は、取得部２３１、決定部２３２、音声認識部２３３、選択部２３４、画像認識部２３５、特定部２３６および調整部２３７を含む。これらの各要素は、先に図１を参照して説明した取得部１３１、決定部１３２、音声認識部１３３、選択部１３４、画像認識部１３５、特定部１３６および調整部１３７と同様の機能を有する。 The control unit 230 includes an acquisition unit 231, a determination unit 232, a voice recognition unit 233, a selection unit 234, an image recognition unit 235, a specification unit 236, and an adjustment unit 237. Each of these elements has the same functions as those of the acquisition unit 131, the determination unit 132, the voice recognition unit 133, the selection unit 134, the image recognition unit 135, the specifying unit 136, and the adjustment unit 137 described above with reference to FIG. Have.

記憶部２４０は、先に図１を参照して説明した記憶部１４０と同様の機能を有する。すなわち、記憶部２４０は、制御部２３０によって実行される処理に必要な種々の情報を記憶する部分であり、たとえば音響モデル、言語モデル、対話プログラム、ユーザデータ、利用履歴を記憶する。 The storage unit 240 has the same function as the storage unit 140 described above with reference to FIG. In other words, the storage unit 240 is a part that stores various information necessary for processing executed by the control unit 230, and stores, for example, an acoustic model, a language model, a dialogue program, user data, and a usage history.

通信部２５０は、対話装置１００Ａの通信部１５０と通信する部分である。通信部２５０によって、対話装置１００Ａとサーバ２００とが通信可能となる。 The communication unit 250 is a part that communicates with the communication unit 150 of the interactive apparatus 100A. The communication unit 250 enables communication between the dialogue apparatus 100A and the server 200.

以上の構成により、対話システム１は、対話装置１００Ａと、サーバ２００との協働により、ユーザとの対話を可能とする。すなわち、対話システム１では、対話装置１００（図１）においてユーザとの対話を行うために実行される処理の一部（とくに制御部１３０によって実行される処理）が、サーバ２００によって実行される。したがって、対話装置１００Ａによれば、対話装置１００よりも、対話装置における処理負担を軽減することができる。 With the above configuration, the dialogue system 1 enables dialogue with the user through cooperation between the dialogue apparatus 100 A and the server 200. That is, in the interactive system 1, a part of the processing (particularly, processing executed by the control unit 130) executed for performing a dialog with the user in the interactive device 100 (FIG. 1) is executed by the server 200. Therefore, according to the interactive device 100A, the processing load on the interactive device can be reduced as compared with the interactive device 100.

具体的に、対話装置１００Ａは、対話装置１００と比較して、制御部１３０および記憶部１４０に代えて、制御部１３０Ａ、記憶部１４０Ａを含む構成とすることができる。 Specifically, the interaction device 100A may include a control unit 130A and a storage unit 140A in place of the control unit 130 and the storage unit 140, as compared with the interaction device 100.

制御部１３０Ａは、対話装置１００Ａの全体制御を行う部分である。ただし、制御部１３０Ａは、制御部１３０と比較して、取得部１３１、決定部１３２、音声認識部１３３、選択部１３４、画像認識部１３５、特定部１３６、調整部１３７を含むことが必須でなく、それによって、制御部１３０Ａの構成は、制御部１３０よりも簡素化することができる。 The control unit 130A is a part that performs overall control of the interactive apparatus 100A. However, it is essential that the control unit 130A includes the acquisition unit 131, the determination unit 132, the voice recognition unit 133, the selection unit 134, the image recognition unit 135, the specifying unit 136, and the adjustment unit 137, as compared with the control unit 130. Accordingly, the configuration of the control unit 130 A can be simplified as compared with the control unit 130.

記憶部１４０Ａは、制御部１３０Ａによって実行される処理に必要な種々の情報を記憶する部分であるが、サーバ２００の記憶部２４０と重複するデータの記憶は、記憶部１４０Ａにおいては必須ではない。その分、記憶部１４０Ａの記憶容量を、記憶部１４０の記憶容量よりも小さくするなどして、構成を簡素化することができる。 The storage unit 140A is a part that stores various information necessary for processing executed by the control unit 130A. However, storage of data that overlaps the storage unit 240 of the server 200 is not essential in the storage unit 140A. Accordingly, the configuration can be simplified by making the storage capacity of the storage unit 140A smaller than the storage capacity of the storage unit 140.

１００，１００Ａ…対話装置、１１０…入力部（取得手段）、１１１…集音部（取得手段）、１１２…撮像部（取得手段）、１２０…出力部、１２１…発音部、１２２…表示部、１３０，１３０Ａ，２３０…制御部、１３１，２３１…取得部（取得手段）、１３２，２３２…決定部（決定手段）、１３３，２３３…音声認識部、１３４，２３４…選択部（選択手段）、１３５，２３５…画像認識部、１３６，２３６…特定部（取得手段）、１３７，２３７…調整部、１４０，１４０Ａ，２４０…記憶部、１５０…通信部、２００…サーバ。 DESCRIPTION OF SYMBOLS 100,100A ... Dialogue device, 110 ... Input part (acquisition means), 111 ... Sound collection part (acquisition means), 112 ... Imaging part (acquisition means), 120 ... Output part, 121 ... Sound generation part, 122 ... Display part, 130, 130A, 230 ... control unit, 131, 231 ... acquisition unit (acquisition unit), 132, 232 ... determination unit (determination unit), 133, 233 ... voice recognition unit, 134, 234 ... selection unit (selection unit), 135, 235 ... image recognition unit, 136, 236 ... identification unit (acquisition means), 137, 237 ... adjustment unit, 140, 140A, 240 ... storage unit, 150 ... communication unit, 200 ... server.

Claims

An interactive device for interacting with a user,
Obtaining means for obtaining an image of the user;
When starting a dialog with the user or when ending the dialog with the user, based on the image acquired by the acquisition means, the timing for starting the dialog with the user in a hands-free state or the user Determining means for determining when to end the dialogue with
Equipped with a,
The start of the dialog includes starting a voice recognition process for recognizing the voice of the user, and the end of the dialog includes ending the voice recognition process,
The dialog device detects that the mouth of the user is continuously closed for a predetermined time or more, and determines the timing of the detection as a timing to end the dialog with the user .

The determining means detects that the user's line of sight has been continuously directed to the interactive device for a predetermined time or more, and determines the timing of the detection as a timing for starting a conversation with the user.
The interactive apparatus according to claim 1.

The determining means determines the timing when the user's mouth is opened as the timing for starting a dialogue with the user.
The interactive apparatus according to claim 1 or 2.

The acquisition means further acquires the user's voice,
The dialogue apparatus performs a voice recognition process executed for recognizing the voice of the user acquired by the acquisition unit in the dialogue based on the image of the user acquired by the acquisition unit. Selecting means for selecting from a plurality of voice recognition processes;
The dialogue apparatus according to any one of claims 1 to 3 .

The acquisition means includes
The sound collection section;
An adjustment unit that adjusts the directivity of the sound collection unit toward the user's face based on the image acquired by the acquisition unit;
including,
The interactive apparatus according to any one of claims 1 to 4 .

The acquisition means includes
An imaging unit that captures an image around the interactive device;
A plurality of persons included in the surrounding image captured by the image capturing unit; a specific unit that identifies a person who is looking at the interactive device as the user among the plurality of persons;
Further including
The adjusting unit adjusts the directivity of the sound collecting unit toward the face of the user specified by the specifying unit;
The interactive apparatus according to claim 5 .

The acquisition means includes
A specifying unit that specifies the user from the plurality of persons based on face information included in the surrounding image when the acquired surrounding image includes a plurality of persons;
Further including
The interactive apparatus according to any one of claims 1 to 5 .

A voice recognition process executed for recognizing the user's voice acquired by the acquisition means in the dialogue is selected from a plurality of predetermined voice recognition processes based on a use history of the voice recognition process by the user. Further comprising selection means for
The dialogue apparatus according to any one of claims 1 to 3 .

A computer provided in a dialog device for performing a dialog with a user,
Obtaining means for obtaining an image of the user;
When starting a dialog with the user or when ending the dialog with the user, based on the image acquired by the acquisition means, the timing for starting the dialog with the user in a hands-free state or the user Determining means for determining when to end the dialogue with
An interactive program to function as,
The start of the dialog includes starting a voice recognition process for recognizing the voice of the user, and the end of the dialog includes ending the voice recognition process,
The said determination means detects that the said user's mouth was closed continuously more than predetermined time, and determines the timing of the said detection as a timing which complete | finishes the dialog with the said user .