JP6772881B2

JP6772881B2 - Voice dialogue device

Info

Publication number: JP6772881B2
Application number: JP2017025581A
Authority: JP
Inventors: 佐和樋口; 生聖渡部
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2017-02-15
Filing date: 2017-02-15
Publication date: 2020-10-21
Anticipated expiration: 2037-02-15
Also published as: JP2018132623A

Description

本発明は音声対話装置に関し、特に、感情の推定を行う音声対話装置に関する。 The present invention relates to a voice dialogue device, and more particularly to a voice dialogue device that estimates emotions.

ユーザと会話を行う音声対話装置が知られている。このような技術に関し、例えば、特許文献１では、ユーザの感情に合わせた応答を行う対話処理装置について開示している。この対話処理装置では、ユーザの発話内容に基づいて、ユーザの感情が、ポジティブ、ネガティブ、ニュートラルのいずれであるかを一定の判定基準に従って推定し、推定結果に応じた応答を行う。 A voice dialogue device that has a conversation with a user is known. Regarding such a technique, for example, Patent Document 1 discloses an interactive processing device that responds according to a user's emotion. In this dialogue processing device, whether the user's emotion is positive, negative, or neutral is estimated based on the content of the user's utterance according to a certain criterion, and a response is performed according to the estimation result.

このように、言語情報、音韻情報、又は画像情報などといった特徴量に基づいて、ユーザの感情を指標化して、ユーザによらず一定の判断基準で感情を推定する技術がある。 As described above, there is a technique of indexing a user's emotion based on a feature amount such as linguistic information, phonological information, or image information, and estimating the emotion based on a certain criterion regardless of the user.

特開２００６−１７８０６３号公報Japanese Unexamined Patent Publication No. 2006-178603

しかしながら、同じ発話内容であっても、ユーザによって、その発話に込められた感情が異なる場合がある。つまり、ユーザによって、感情推定に使用される特徴量への感情の反映度合いが異なる場合がある。例えば、言葉の表現が豊かなユーザや表情が豊かなユーザもいれば、そうでないユーザもいる。特許文献１に記載された技術の場合、感情が特徴量に反映されにくいユーザに対しては、推定結果としてニュートラルとなることが多くなり、感情が特徴量に反映されやすいユーザに対しては、推定結果としてポジティブ又はネガティブが多くなる。このように、ユーザ毎のばらつきが多くなり適切な感情推定を行うことができない。 However, even if the utterance content is the same, the emotions contained in the utterance may differ depending on the user. That is, the degree of reflection of emotions on the features used for emotion estimation may differ depending on the user. For example, some users are rich in word expressions and facial expressions, while others are not. In the case of the technique described in Patent Document 1, for users whose emotions are difficult to be reflected in the feature amount, the estimation result is often neutral, and for users whose emotions are easily reflected in the feature amount, As an estimation result, there are many positives or negatives. In this way, the variation among users becomes large, and it is not possible to perform appropriate emotion estimation.

これに対し、感情推定の際の判定基準をユーザ毎に予め設定しておけば、ユーザに応じて判定基準を変更することができ、ユーザに応じた適切な感情推定が可能になると考えられる。しかしながら、その場合、事前の設定が煩雑となってしまう。このため、ユーザに応じて適切に感情推定を行うことが容易ではなかった。 On the other hand, if the judgment criteria for emotion estimation are set in advance for each user, the judgment criteria can be changed according to the user, and it is considered that appropriate emotion estimation according to the user becomes possible. However, in that case, the preset setting becomes complicated. For this reason, it has not been easy to appropriately estimate emotions according to the user.

本発明は、上記した事情を背景としてなされたものであり、ユーザに応じた感情推定を容易に行うことができる音声対話装置を提供することを目的とする。 The present invention has been made against the background of the above circumstances, and an object of the present invention is to provide a voice dialogue device capable of easily performing emotion estimation according to a user.

上記目的を達成するための本発明の一態様は、対話相手であるユーザを特定するユーザ特定部と、感情推定に用いる特徴量を前記ユーザから取得する特徴量取得部と、前記特徴量に基づいて前記ユーザの感情の指標値を算出する感情推定部と、前記ユーザ特定部により特定されたユーザとの対話実施量が予め定められた基準実施量未満である場合に、予め定められた応答指針に従った初期応答を生成する応答生成部と、前記初期応答に対する前記ユーザの反応についての前記特徴量に基づいて前記感情推定部が算出した前記指標値と、予め定められた基準指標値との比較結果に応じて、感情を推定するための前記ユーザに固有な閾値を設定する感情閾値設定部と、を有し、前記感情推定部は、算出した前記指標値と前記感情閾値設定部により設定された閾値との比較結果に応じて、前記ユーザの感情を推定する音声対話装置である。
この音声対話装置によれば、対話実施量が予め定められた基準実施量未満である場合に、所定の初期応答に従ったコミュニケーションが行われ、その際のユーザの感情の指標値と予め定められた基準指標値との比較結果に応じて、このユーザに固有な閾値が設定される。このため、特徴量への感情の反映度合いに応じた閾値の設定を容易に行うことができる。よって、この音声対話装置によればユーザに応じた感情推定を容易に行うことができる。 One aspect of the present invention for achieving the above object is based on a user identification unit that identifies a user who is a dialogue partner, a feature amount acquisition unit that acquires a feature amount used for emotion estimation from the user, and the feature amount. When the amount of dialogue between the emotion estimation unit that calculates the index value of the user's emotion and the user specified by the user identification unit is less than the predetermined reference implementation amount, a predetermined response guideline. A response generation unit that generates an initial response according to the above, an index value calculated by the emotion estimation unit based on the feature amount of the user's reaction to the initial response, and a predetermined reference index value. It has an emotion threshold setting unit that sets a threshold unique to the user for estimating emotions according to the comparison result, and the emotion estimation unit is set by the calculated index value and the emotion threshold setting unit. It is a voice dialogue device that estimates the emotion of the user according to the comparison result with the threshold value.
According to this voice dialogue device, when the dialogue execution amount is less than a predetermined standard implementation amount, communication is performed according to a predetermined initial response, and the communication is performed in advance as an index value of the user's emotion at that time. A threshold value unique to this user is set according to the comparison result with the reference index value. Therefore, it is possible to easily set the threshold value according to the degree of reflection of emotions on the feature amount. Therefore, according to this voice dialogue device, emotion estimation according to the user can be easily performed.

本発明によれば、ユーザに応じた感情推定を容易に行うことができる音声対話装置を提供することができる。 According to the present invention, it is possible to provide a voice dialogue device capable of easily performing emotion estimation according to a user.

実施の形態１にかかる音声対話装置のハードウェア構成を示す図である。It is a figure which shows the hardware configuration of the voice dialogue apparatus which concerns on Embodiment 1. FIG. 実施の形態１にかかる音声対話装置の制御装置の構成を示すブロック図である。It is a block diagram which shows the structure of the control device of the voice dialogue device which concerns on Embodiment 1. FIG. 実施の形態１にかかる音声対話装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of the operation of the voice dialogue apparatus which concerns on Embodiment 1. FIG.

以下、図面を参照して本発明の実施の形態について説明する。なお、各図面において、同一の要素には同一の符号が付されており、必要に応じて重複説明は省略されている。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In each drawing, the same elements are designated by the same reference numerals, and duplicate explanations are omitted as necessary.

図１は、実施の形態１にかかる音声対話装置１のハードウェア構成を示す図である。音声対話装置１は、ユーザと音声を用いて対話を行う。具体的には、音声対話装置１は、ユーザからの発話に応じて、ユーザに対して音声を出力することで、ユーザと対話を行う。つまり、音声対話装置１は、対話相手であるユーザの発話音声を認識し、このユーザに対し音声を出力する。音声対話装置１は、例えば、生活支援ロボット及び小型ロボット等のロボット、クラウドシステム及びスマートフォン等に搭載可能である。 FIG. 1 is a diagram showing a hardware configuration of the voice dialogue device 1 according to the first embodiment. The voice dialogue device 1 has a dialogue with the user using voice. Specifically, the voice dialogue device 1 interacts with the user by outputting voice to the user in response to an utterance from the user. That is, the voice dialogue device 1 recognizes the spoken voice of the user who is the conversation partner, and outputs the voice to this user. The voice dialogue device 1 can be mounted on, for example, robots such as life support robots and small robots, cloud systems, smartphones, and the like.

音声対話装置１は、周辺の音声を収集するマイク２と、周辺の画像を取得するカメラ３と、音声を発するスピーカ４と、制御装置１０とを有する。制御装置１０は、例えばコンピュータとしての機能を有する。制御装置１０は、マイク２、カメラ３及びスピーカ４と、有線又は無線で接続されている。 The voice dialogue device 1 includes a microphone 2 that collects surrounding sounds, a camera 3 that acquires images of the surroundings, a speaker 4 that emits sounds, and a control device 10. The control device 10 has a function as, for example, a computer. The control device 10 is connected to the microphone 2, the camera 3, and the speaker 4 by wire or wirelessly.

制御装置１０は、主要なハードウェア構成として、ＣＰＵ（Central Processing Unit）１２と、ＲＯＭ（Read Only Memory）１４と、ＲＡＭ（Random Access Memory）１６とを有する。ＣＰＵ１２は、制御処理及び演算処理等を行う演算装置としての機能を有する。ＲＯＭ１４は、ＣＰＵ１２によって実行される制御プログラム及び演算プログラム等を記憶するための機能を有する。ＲＡＭ１６は、処理データ等を一時的に記憶するための機能を有する。 The control device 10 has a CPU (Central Processing Unit) 12, a ROM (Read Only Memory) 14, and a RAM (Random Access Memory) 16 as a main hardware configuration. The CPU 12 has a function as an arithmetic unit that performs control processing, arithmetic processing, and the like. The ROM 14 has a function for storing a control program, an arithmetic program, and the like executed by the CPU 12. The RAM 16 has a function for temporarily storing processing data and the like.

制御装置１０は、マイク２によって集音されたユーザの発話を解析して、そのユーザの発話に応じて、ユーザに対する応答を生成する。そして、制御装置１０は、スピーカ４を介して、生成された応答に対応する音声（応答音声）を出力する。 The control device 10 analyzes the user's utterance collected by the microphone 2 and generates a response to the user in response to the user's utterance. Then, the control device 10 outputs a voice (response voice) corresponding to the generated response via the speaker 4.

図２は、実施の形態１にかかる音声対話装置１の制御装置１０の構成を示すブロック図である。制御装置１０は、ユーザ特定部１０１と、特徴量取得部１０２と、ユーザデータベース１０３と、初期対話判定部１０４と、感情推定部１０５と、応答生成部１０６と、初期対話シナリオデータベース１０７と、感情の出やすさ判定部１０８と、感情閾値設定部１０９と、音声合成部１１０とを有する。なお、図２に示すユーザ特定部１０１、特徴量取得部１０２、初期対話判定部１０４、感情推定部１０５、応答生成部１０６、感情の出やすさ判定部１０８、感情閾値設定部１０９、及び音声合成部１１０は、例えば、ＣＰＵ１２がＲＯＭ１４に記憶されたプログラムを実行することによって実現可能である。また、必要なプログラムを任意の不揮発性記録媒体に記録しておき、必要に応じてインストールするようにしてもよい。また、ユーザデータベース１０３及び初期対話シナリオデータベース１０７は、例えば、ＲＯＭ１４等の記憶装置により実現される。なお、各構成要素は、上記のようにソフトウェアによって実現されることに限定されず、何らかの回路素子等のハードウェアによって実現されてもよい。 FIG. 2 is a block diagram showing a configuration of a control device 10 of the voice dialogue device 1 according to the first embodiment. The control device 10 includes a user identification unit 101, a feature amount acquisition unit 102, a user database 103, an initial dialogue determination unit 104, an emotion estimation unit 105, a response generation unit 106, an initial dialogue scenario database 107, and emotions. It has an ease of output determination unit 108, an emotion threshold setting unit 109, and a voice synthesis unit 110. The user identification unit 101, the feature amount acquisition unit 102, the initial dialogue determination unit 104, the emotion estimation unit 105, the response generation unit 106, the emotion easiness determination unit 108, the emotion threshold setting unit 109, and the voice shown in FIG. The synthesis unit 110 can be realized, for example, by the CPU 12 executing the program stored in the ROM 14. Further, the necessary program may be recorded on an arbitrary non-volatile recording medium and installed if necessary. Further, the user database 103 and the initial dialogue scenario database 107 are realized by a storage device such as a ROM 14. It should be noted that each component is not limited to being realized by software as described above, and may be realized by some hardware such as a circuit element.

ユーザ特定部１０１は、対話相手であるユーザを特定する。本実施の形態では、ユーザ特定部１０１は、カメラ３から取得された画像情報を用いて、ユーザの特定を行う。より具体的には、ユーザデータベース１０３に予め格納されているユーザ毎の顔画像情報と、カメラ３から取得された画像情報とを用いて、顔認識処理を行うことで対話相手を特定する。本実施の形態では、ユーザ特定部１０１は、ユーザデータベース１０３において顔画像情報と対応付けられているユーザＩＤを特定する。 The user identification unit 101 identifies a user who is a conversation partner. In the present embodiment, the user identification unit 101 identifies the user by using the image information acquired from the camera 3. More specifically, the dialogue partner is specified by performing face recognition processing using the face image information for each user stored in advance in the user database 103 and the image information acquired from the camera 3. In the present embodiment, the user identification unit 101 identifies the user ID associated with the face image information in the user database 103.

特徴量取得部１０２は、感情推定に用いる特徴量を、対話相手であるユーザから取得する。本実施の形態では、特徴量取得部１０２は、マイク２により取得されたユーザの音声データに対し、音声認識処理を行うことで、特徴量としてテキストデータを生成する。したがって、特徴量取得部１０２は、音声認識部と称されてもよい。なお、本実施の形態では、特徴量取得部１０２により取得された特徴量、すなわちテキストデータは、感情推定のみならず、応答の生成にも用いられる。 The feature amount acquisition unit 102 acquires the feature amount used for emotion estimation from the user who is the dialogue partner. In the present embodiment, the feature amount acquisition unit 102 generates text data as a feature amount by performing voice recognition processing on the user's voice data acquired by the microphone 2. Therefore, the feature amount acquisition unit 102 may be referred to as a voice recognition unit. In the present embodiment, the feature amount acquired by the feature amount acquisition unit 102, that is, the text data is used not only for emotion estimation but also for response generation.

ユーザデータベース１０３は、ユーザ毎の顔画像情報及び対話履歴を格納する。ユーザデータベース１０３は、例えば、これらのデータとユーザＩＤとを対応付けて格納している。なお、対話履歴には、対話回数又は対話時間といった対話実施量を示す情報が含まれる。すなわち、対話実施量は、音声対話装置１とユーザとの対話が今までにどれだけ行われているかをユーザ毎に示す値である。また、ユーザデータベース１０３は、後述する基準指標値を格納している。 The user database 103 stores face image information and dialogue history for each user. The user database 103 stores, for example, these data and the user ID in association with each other. The dialogue history includes information indicating the amount of dialogue performed, such as the number of dialogues or the dialogue time. That is, the dialogue execution amount is a value indicating for each user how much the dialogue between the voice dialogue device 1 and the user has been performed so far. Further, the user database 103 stores a reference index value described later.

初期対話判定部１０４は、ユーザ特定部１０１により特定されたユーザとの対話実施量が予め定められた基準実施量未満であるか否かを判定する。すなわち、初期対話判定部１０４は、ユーザ特定部１０１により特定されたユーザとの対話が初期段階であるのか否かを判定する。本実施の形態では、具体的には、初期対話判定部１０４は、ユーザデータベース１０３を参照し、ユーザ特定部１０１により特定されたユーザＩＤに対応付けられている対話回数が予め定められた回数未満であるか否かを判定する。 The initial dialogue determination unit 104 determines whether or not the amount of dialogue with the user specified by the user identification unit 101 is less than a predetermined reference implementation amount. That is, the initial dialogue determination unit 104 determines whether or not the dialogue with the user specified by the user identification unit 101 is in the initial stage. Specifically, in the present embodiment, the initial dialogue determination unit 104 refers to the user database 103, and the number of dialogues associated with the user ID specified by the user identification unit 101 is less than a predetermined number of times. It is determined whether or not it is.

感情推定部１０５は、特徴量取得部１０２が取得した特徴量に基づいて、ユーザ特定部１０１が特定したユーザの感情の指標値を算出する。具体的には、感情推定部１０５は、特徴量取得部１０２により生成されたテキストデータを解析し、予め定められた算出規則に従ってユーザの感情を示す指標値を算出する。また、感情推定部１０５は、算出した指標値と閾値との比較結果に応じて、ユーザ特定部１０１が特定したユーザである対話相手の感情を推定する。ここで、閾値は、後述する感情閾値設定部１０９により設定される値であり、各ユーザに応じた値である。 The emotion estimation unit 105 calculates an index value of the user's emotion specified by the user identification unit 101 based on the feature amount acquired by the feature amount acquisition unit 102. Specifically, the emotion estimation unit 105 analyzes the text data generated by the feature amount acquisition unit 102, and calculates an index value indicating the user's emotion according to a predetermined calculation rule. Further, the emotion estimation unit 105 estimates the emotion of the dialogue partner who is the user specified by the user identification unit 101 according to the comparison result between the calculated index value and the threshold value. Here, the threshold value is a value set by the emotion threshold value setting unit 109 described later, and is a value corresponding to each user.

なお、感情推定部１０５は、感情の指標値の算出及び指標値と閾値との比較に基づいて感情の推定を行なえばよく、そのような感情推定の方法として公知の任意の手法が適用可能である。例えば、感情推定方法のひとつとして、「Ｗｅｂから獲得した感情生起要因コーパスに基づく感情推定」（徳久良子ほか，言語処理学会第１４回年次大会論文集，２００８年３月）に記載された技術が用いられてもよい。 The emotion estimation unit 105 may estimate the emotion based on the calculation of the emotion index value and the comparison between the index value and the threshold value, and any known method can be applied as such an emotion estimation method. is there. For example, as one of the emotion estimation methods, the technique described in "Emotion estimation based on the emotion-causing factor corpus acquired from the Web" (Ryoko Tokukura et al., Proceedings of the 14th Annual Meeting of the Natural Language Processing Society, March 2008). May be used.

本実施の形態では、感情推定部１０５は、ユーザの発話内容を示すテキストデータの解析結果から、指標値として、−１．０〜＋１．０の範囲内の数値を算出する。ここで、解析結果がネガティブな感情を示す場合、指標値はマイナスの値となり、解析結果がポジティブな感情を示す場合、指標値はプラスの値となる。 In the present embodiment, the emotion estimation unit 105 calculates a numerical value in the range of −1.0 to +1.0 as an index value from the analysis result of the text data indicating the utterance content of the user. Here, when the analysis result shows a negative emotion, the index value becomes a negative value, and when the analysis result shows a positive emotion, the index value becomes a positive value.

感情推定部１０５は、後述する初期応答によるコミュニケーションの終了後、感情閾値設定部１０９により設定された閾値を用いて、対話相手であるユーザの感情を推定する。本実施の形態では、具体的には、感情推定部１０５は、算出した指標値と閾値とを用いて、ポジティブ、ネガティブ、ニュートラルのいずれかの感情を決定する。なお、ニュートラルとは、ポジティブでもネガティブでもない感情である。例えば、ポジティブな感情と推定するための閾値を＋０．５とし、ネガティブな感情と推定するための閾値を−０．５とする。この場合、感情推定部１０５は、特徴量取得部１０２が取得した特徴量に基づいて算出した指標値が、−０．５以下である場合、ユーザの感情がネガティブであると決定する。また、感情推定部１０５は、特徴量取得部１０２が取得した特徴量に基づいて算出した指標値が、＋０．５以上である場合、ユーザの感情がポジティブであると決定する。そして、感情推定部１０５は、特徴量取得部１０２が取得した特徴量に基づいて算出した指標値が、−０．５より大きく＋０．５未満である場合、ユーザの感情がニュートラルであると決定する。 The emotion estimation unit 105 estimates the emotion of the user who is the dialogue partner by using the threshold value set by the emotion threshold setting unit 109 after the communication by the initial response described later is completed. In the present embodiment, specifically, the emotion estimation unit 105 determines one of positive, negative, and neutral emotions by using the calculated index value and the threshold value. Neutral is an emotion that is neither positive nor negative. For example, the threshold value for estimating positive emotions is +0.5, and the threshold value for estimating negative emotions is -0.5. In this case, the emotion estimation unit 105 determines that the user's emotion is negative when the index value calculated based on the feature amount acquired by the feature amount acquisition unit 102 is −0.5 or less. Further, the emotion estimation unit 105 determines that the user's emotion is positive when the index value calculated based on the feature amount acquired by the feature amount acquisition unit 102 is +0.5 or more. Then, the emotion estimation unit 105 determines that the user's emotion is neutral when the index value calculated based on the feature amount acquired by the feature amount acquisition unit 102 is greater than −0.5 and less than +0.5. To do.

応答生成部１０６は、対話相手であるユーザの発話に対する応答を生成する。応答は、典型的にはテキストデータである。ここで、特に、応答生成部１０６は、ユーザ特定部１０１により特定されたユーザとの対話実施量が予め定められた基準実施量未満である場合には、予め定められた応答指針に従った応答である初期応答を生成する。すなわち、初期対話判定部１０４により、対話が初期段階であると判定された場合、予め定められた応答指針に従った初期応答を生成する。具体的には、初期対話シナリオデータベース１０７に予め格納されたシナリオに沿って、ユーザの発話に対する応答を決定する。 The response generation unit 106 generates a response to the utterance of the user who is the conversation partner. The response is typically textual data. Here, in particular, when the dialogue execution amount with the user specified by the user identification unit 101 is less than the predetermined reference implementation amount, the response generation unit 106 responds according to the predetermined response guideline. Generate an initial response that is. That is, when the initial dialogue determination unit 104 determines that the dialogue is in the initial stage, it generates an initial response according to a predetermined response guideline. Specifically, the response to the user's utterance is determined according to the scenario stored in advance in the initial dialogue scenario database 107.

初期対話シナリオデータベース１０７は、感情を推定するための上記閾値を決定するために行われる所定のシナリオデータを格納している。シナリオデータとしては、例えば、天気等に関する雑談又は挨拶などが含まれるが、これらに限られない。応答生成部１０６は、対話が初期段階であると判定された場合、初期対話シナリオデータベース１０７を参照し、このデータベースから応答文を選択することにより応答を生成する。 The initial dialogue scenario database 107 stores predetermined scenario data performed to determine the threshold value for estimating emotions. The scenario data includes, for example, chats or greetings regarding the weather, etc., but is not limited to these. When it is determined that the dialogue is in the initial stage, the response generation unit 106 refers to the initial dialogue scenario database 107 and generates a response by selecting a response statement from this database.

また、応答生成部１０６は、ユーザ特定部１０１により特定されたユーザとの対話実施量が予め定められた基準実施量以上である場合には、通常応答を生成する。通常応答は、例えば、初期応答よりも自由度のある応答であり、例えば、初期対話シナリオデータベース１０７に格納されたシナリオ以外の応答を含むことができる。本実施の形態では、応答生成部１０６は、感情推定部１０５により推定された感情に応じて適切な応答を生成する。このため、例えば、応答生成部１０６は、感情の種類に対応付けられた応答文を含む応答文テーブルを参照し、応答文テーブルから適切な応答文を選択することにより、応答文の生成を行ってもよい。 Further, the response generation unit 106 generates a normal response when the dialogue execution amount with the user specified by the user identification unit 101 is equal to or more than a predetermined reference implementation amount. The normal response is, for example, a response having a higher degree of freedom than the initial response, and can include, for example, a response other than the scenario stored in the initial dialogue scenario database 107. In the present embodiment, the response generation unit 106 generates an appropriate response according to the emotion estimated by the emotion estimation unit 105. Therefore, for example, the response generation unit 106 generates a response sentence by referring to the response sentence table including the response sentence associated with the emotion type and selecting an appropriate response sentence from the response sentence table. You may.

このように、応答生成部１０６は、ユーザ特定部１０１により特定されたユーザとの対話の初期段階では、初期応答を生成し、初期段階が終了した後は、通常応答を生成する。以下、応答生成部１０６が初期応答を行っている際の対話を初期対話と呼ぶこととする。初期対話中、ユーザデータベース１０３の対話履歴が更新されることとなる。したがって、初期対話中のユーザの対話実施量が増すこととなる。そして、対話実施量が予め定められた基準実施量以上となると、初期対話が終了し、通常応答による対話が開始される。 As described above, the response generation unit 106 generates an initial response at the initial stage of the dialogue with the user specified by the user identification unit 101, and generates a normal response after the initial stage is completed. Hereinafter, the dialogue when the response generation unit 106 is performing the initial response will be referred to as an initial dialogue. During the initial dialogue, the dialogue history of the user database 103 will be updated. Therefore, the amount of dialogue performed by the user during the initial dialogue is increased. Then, when the dialogue implementation amount becomes equal to or more than a predetermined standard implementation amount, the initial dialogue ends and the dialogue by the normal response is started.

感情の出やすさ判定部１０８は、初期対話中の上記指標値に基づいて、ユーザ特定部１０１により特定されたユーザの感情の出やすさを判定する。すなわち、感情の出やすさ判定部１０８は、感情推定に使用される特徴量への感情のユーザ毎の反映度合いを判定する。具体的には、感情の出やすさ判定部１０８は、初期応答に対するユーザの反応についての特徴量に基づいて感情推定部１０５が算出した指標値と、予め定められた基準指標値とを比較することにより、当該ユーザの感情の出やすさを判定する。ここで、基準指標値は、例えば、複数の人が初期対話した際の指標値の平均である。この場合、感情の出やすさ判定部１０８は、判定対象のユーザの指標値を、ユーザの平均的な指標値と比べることにより、判定対象のユーザの感情の出やすさを判定することとなる。例えば、判定対象のユーザの初期対話中の指標値の平均値（以下、判定対象平均値と呼ぶ）が、０．７５であるとする。また、事前に算出された、初期対話中の指標値の平均値の複数ユーザの平均値（以下、基準平均値と呼ぶ）が、０．５２であるとする。感情の出やすさ判定部１０８は、例えば、判定対象平均値の絶対値から基準平均値の絶対値を減算し、この減算結果として得られる判定値に従って、感情の出やすさを３段階で判定する。例えば、感情の出やすさ判定部１０８は、判定値が＋０，２よりも大きい場合、感情が出やすいと判定し、判定値が−０，２よりも小さい場合、感情が出にくいと判定し、判定値が−０．２以上＋０．２以下である場合、どちらでもないと判定する。上記例の場合、判定値は、０．７５−０．５２＝０．２３＞０．２であるため、判定対象のユーザは感情が出やすいと判定される。 The emotional susceptibility determination unit 108 determines the emotional susceptibility of the user specified by the user identification unit 101 based on the index value during the initial dialogue. That is, the emotional easiness determination unit 108 determines the degree of reflection of emotions on the feature amount used for emotion estimation for each user. Specifically, the emotion easiness determination unit 108 compares the index value calculated by the emotion estimation unit 105 based on the feature amount of the user's reaction to the initial response with a predetermined reference index value. This determines how easily the user feels. Here, the reference index value is, for example, the average of the index values when a plurality of people have an initial dialogue. In this case, the emotional susceptibility determination unit 108 determines the emotional susceptibility of the determination target user by comparing the index value of the determination target user with the average index value of the user. .. For example, it is assumed that the average value of the index values during the initial dialogue of the user to be determined (hereinafter referred to as the average value to be determined) is 0.75. Further, it is assumed that the average value of a plurality of users (hereinafter referred to as a reference average value) of the average value of the index values during the initial dialogue calculated in advance is 0.52. For example, the emotional easiness determination unit 108 subtracts the absolute value of the reference average value from the absolute value of the determination target average value, and determines the emotional easiness in three stages according to the determination value obtained as the subtraction result. To do. For example, the ease of emotional determination unit 108 determines that emotions are likely to occur when the determination value is larger than +0,2, and determines that emotions are less likely to occur when the determination value is smaller than −0,2. If the judgment value is -0.2 or more and +0.2 or less, it is judged that neither is the case. In the case of the above example, since the determination value is 0.75-0.52 = 0.23> 0.2, it is determined that the user to be determined is likely to express emotions.

感情閾値設定部１０９は、感情の出やすさ判定部１０８による判定結果に応じて、感情を推定するためのユーザに固有な閾値を設定する。感情閾値設定部１０９は、初期対話が終了すると、感情の出やすさ判定部１０８による判定結果に応じて、当該ユーザの感情を感情推定部１０５が推定する際に用いる閾値を設定する。例えば、感情の出やすさ判定部１０８により、感情が出にくいと判定された場合、感情閾値設定部１０９は、ポジティブ又はネガティブの感情が推定されやすくなるよう、予め定められた基本閾値よりもゆるい閾値を設定する。また、感情の出やすさ判定部１０８により、感情が出やすいと判定された場合、感情閾値設定部１０９は、ポジティブ又はネガティブの感情が推定されにくくなるよう、基本閾値よりもきつい閾値を設定する。したがって、例えば、感情の出やすさ判定部１０８により、感情が出にくいと判定された場合、感情閾値設定部１０９は、ポジティブな感情と推定するための閾値を、その基本閾値である＋０．５より０．２だけ下げ、＋０．３とし、ネガティブな感情と推定するための閾値を、その基本閾値である−０．５より０．２だけ上げ、−０．３とする。 The emotion threshold setting unit 109 sets a user-specific threshold value for estimating emotions according to the determination result by the emotional ease determination unit 108. When the initial dialogue is completed, the emotion threshold setting unit 109 sets a threshold value to be used when the emotion estimation unit 105 estimates the user's emotion according to the determination result by the emotion easiness determination unit 108. For example, when the emotional ease determination unit 108 determines that emotions are difficult to express, the emotional threshold setting unit 109 is looser than a predetermined basic threshold value so that positive or negative emotions can be easily estimated. Set the threshold. Further, when the emotional ease determination unit 108 determines that emotions are likely to be expressed, the emotional threshold setting unit 109 sets a threshold value tighter than the basic threshold value so that positive or negative emotions are less likely to be estimated. .. Therefore, for example, when the emotional ease determination unit 108 determines that emotions are difficult to produce, the emotional threshold setting unit 109 sets the threshold value for presuming positive emotions to +0.5, which is the basic threshold value. The threshold value for estimating negative emotions is increased by 0.2 from the basic threshold value of -0.5 to be -0.3.

音声合成部１１０は、応答生成部１０６が生成した応答を音声データに変換する。すなわち、音声合成部１１０は、応答生成部１０６が生成した応答文のテキストデータを音声データに変換する。テキストデータからの音声データの生成は、公知の種々の音声合成技術等により実現可能である。その後、典型的にはＤ／Ａ変換装置（図示せず）が音声データをアナログ音声信号に変換し、スピーカ４がアナログ音声信号を音声として出力する。 The voice synthesis unit 110 converts the response generated by the response generation unit 106 into voice data. That is, the voice synthesis unit 110 converts the text data of the response sentence generated by the response generation unit 106 into voice data. The generation of speech data from text data can be realized by various known speech synthesis techniques and the like. After that, typically, a D / A converter (not shown) converts the audio data into an analog audio signal, and the speaker 4 outputs the analog audio signal as audio.

次に、音声対話装置１の動作について説明する。図３は、音声対話装置１の動作の一例を示すフローチャートである。以下、図３に沿って、音声対話装置１の動作例を説明する。 Next, the operation of the voice dialogue device 1 will be described. FIG. 3 is a flowchart showing an example of the operation of the voice dialogue device 1. Hereinafter, an operation example of the voice dialogue device 1 will be described with reference to FIG.

ステップ１００（Ｓ１００）において、ユーザ特定部１０１が、対話相手であるユーザを特定する。
次に、ステップ１０１（Ｓ１０１）において、特徴量取得部１０２が、マイク２により取得されたユーザの音声データに対し、音声認識処理を行い、テキストデータを生成する。例えば、ステップ１０１では、ユーザの発話から「今日は晴れているね」というテキストデータが生成される。 In step 100 (S100), the user identification unit 101 identifies a user who is a dialogue partner.
Next, in step 101 (S101), the feature amount acquisition unit 102 performs voice recognition processing on the user's voice data acquired by the microphone 2 to generate text data. For example, in step 101, text data "It's sunny today" is generated from the user's utterance.

次に、ステップ１０２（Ｓ１０２）において、初期対話判定部１０４は、ユーザとの対話が初期対話であるか否かを判定する。すなわち、初期対話判定部１０４は、ステップ１００で特定されたユーザとの対話実施量が予め定められた基準実施量未満であるか否かを判定する。ステップ１００で特定されたユーザとの対話実施量が予め定められた基準実施量未満である場合、初期対話判定部１０４は、初期対話の実施を決定し、処理は、ステップ１０３へ移行する。これに対し、ステップ１００で特定されたユーザとの対話実施量が予め定められた基準実施量以上である場合、初期対話判定部１０４は、通常応答による対話の実施を決定し、処理は、ステップ１０６へ移行する。 Next, in step 102 (S102), the initial dialogue determination unit 104 determines whether or not the dialogue with the user is an initial dialogue. That is, the initial dialogue determination unit 104 determines whether or not the dialogue execution amount with the user specified in step 100 is less than the predetermined reference implementation amount. When the dialogue execution amount with the user specified in step 100 is less than the predetermined reference implementation amount, the initial dialogue determination unit 104 determines the implementation of the initial dialogue, and the process proceeds to step 103. On the other hand, when the dialogue execution amount with the user specified in step 100 is equal to or more than the predetermined reference implementation amount, the initial dialogue determination unit 104 determines the execution of the dialogue by the normal response, and the process is the step. Move to 106.

ステップ１０３（Ｓ１０３）では、感情推定部１０５は、ステップ１００で特定したユーザの感情の指標値を算出する。
次に、ステップ１０４（Ｓ１０４）において、応答生成部１０６は、初期応答を生成する。すなわち、応答生成部１０６は、初期対話シナリオデータベース１０７から応答文を選択する。例えば、ステップ１０４では、「こんにちは、晴れているときはどんな気分になる？」が初期応答として生成される。 In step 103 (S103), the emotion estimation unit 105 calculates an index value of the user's emotion specified in step 100.
Next, in step 104 (S104), the response generation unit 106 generates an initial response. That is, the response generation unit 106 selects a response statement from the initial dialogue scenario database 107. For example, in step 104, "Hello, becomes you feel when you are clear?" Is generated as an initial response.

ステップ１０４の後、ステップ１０５（Ｓ１０５）において、音声合成部１１０が、初期応答のテキストデータを音声データに変換する。これにより、スピーカ４から初期応答の音声が出力される。ステップ１０５の後、処理は、ステップ１０１に戻る。すなわち、音声対話装置１の初期応答を受けたユーザからの発話について、音声認識処理が行われる。このユーザとの対話実施量が、基準実施量に到達するまで、ステップ１０２からステップ１０３への遷移が繰り返されることとなる。すなわち、初期応答が繰り返されることとなる。 After step 104, in step 105 (S105), the voice synthesis unit 110 converts the text data of the initial response into voice data. As a result, the voice of the initial response is output from the speaker 4. After step 105, the process returns to step 101. That is, the voice recognition process is performed on the utterance from the user who received the initial response of the voice dialogue device 1. The transition from step 102 to step 103 is repeated until the amount of dialogue with the user reaches the reference amount. That is, the initial response is repeated.

ユーザとの対話実施量が、基準実施量に到達した場合、上述の通り、処理は、ステップ１０２からステップ１０６へ移行する。
ステップ１０６（Ｓ１０６）では、ステップ１００で特定されたユーザについての閾値が設定されているかいなかが判定される。閾値が設定されていない場合、処理はステップ１０７へ移行する。閾値が設定されている場合、処理は、閾値設定のためのステップ１０７及びステップ１０８を飛ばして、ステップ１０９へ移行する。つまり、初期会話の終了直後に、閾値設定のためのステップが実施され、以降、閾値設定処理がスキップされる。 When the amount of dialogue with the user reaches the reference amount, the process shifts from step 102 to step 106 as described above.
In step 106 (S106), it is determined whether or not the threshold value for the user specified in step 100 is set. If the threshold is not set, the process proceeds to step 107. When the threshold value is set, the process skips step 107 and step 108 for setting the threshold value and proceeds to step 109. That is, immediately after the end of the initial conversation, the step for setting the threshold value is executed, and thereafter, the threshold value setting process is skipped.

ステップ１０７（Ｓ１０７）において、感情の出やすさ判定部１０８が、初期対話中にステップ１０３で算出された指標値に基づいて、ステップ１００で特定されたユーザの感情の出やすさを判定する。
次に、ステップ１０８（Ｓ１０８）において、感情閾値設定部１０９が、ステップ１０７での判定結果に応じて、ステップ１００で特定されたユーザに固有な閾値を設定する。ステップ１０８の後、処理はステップ１０９へ移行する。 In step 107 (S107), the emotional susceptibility determination unit 108 determines the emotional susceptibility of the user specified in step 100 based on the index value calculated in step 103 during the initial dialogue.
Next, in step 108 (S108), the emotion threshold setting unit 109 sets a threshold value specific to the user specified in step 100 according to the determination result in step 107. After step 108, the process proceeds to step 109.

ステップ１０９（Ｓ１０９）において、感情推定部１０５は、ステップ１０１で得られたテキストデータに基づいてユーザの感情の指標値を算出し、算出した指標値とステップ１０８で設定された閾値との比較結果に応じて、ユーザの感情を推定する。例えば、ステップ１０１で得られた「晴れていると洗濯物がよく乾くし、気持ちがいいよ」というテキストデータに基づいて、ステップ１０９では、ユーザの感情の出やすさが考慮された閾値が用いられ、感情としてポジティブが推定される。
次に、ステップ１１０（Ｓ１１０）において、応答生成部１０６は、ステップ１０９において推定された感情に応じた通常応答を生成する。例えば、ステップ１０９で推定された感情がポジティブである場合、ステップ１１０では、ポジティブな発話に対する応答として応答「それはいいね。」が生成される。 In step 109 (S109), the emotion estimation unit 105 calculates an index value of the user's emotion based on the text data obtained in step 101, and compares the calculated index value with the threshold value set in step 108. Estimate the user's emotions accordingly. For example, based on the text data "When it is sunny, the laundry dries well and feels good" obtained in step 101, in step 109, a threshold value is used in consideration of the user's emotional susceptibility. It is presumed to be positive as an emotion.
Next, in step 110 (S110), the response generation unit 106 generates a normal response corresponding to the emotion estimated in step 109. For example, if the emotion estimated in step 109 is positive, in step 110 a response "it is like" is generated as a response to the positive utterance.

ステップ１１０の後、ステップ１０５において、音声合成部１１０が、通常応答のテキストデータを音声データに変換する。これにより、スピーカ４から通常応答の音声が出力される。以降、通常応答による対話が繰り返されることとなる。 After step 110, in step 105, the voice synthesis unit 110 converts the text data of the normal response into voice data. As a result, the sound of the normal response is output from the speaker 4. After that, the dialogue based on the normal response will be repeated.

以上、実施の形態にかかる音声対話装置１について説明した。音声対話装置１は、上述の通り、感情推定に使用される特徴量への感情の反映度合いをユーザ毎に判定し、その判定結果に従って設定された閾値で感情を推定する。したがって、特徴量への感情の反映度合いがユーザ毎に異なる場合であっても、ユーザの感情を的確に捉えることが可能となる。このように、感情を的確に捉えた上で、応答を生成することができるため、音声対話装置１は、より円滑にユーザとコミュニケーションをとることができる。ここで、特に、音声対話装置１は、まず、初期応答を行い、その際の指標値に基づいて閾値を設定し、その後、通常応答へと移行する。このため、ユーザ毎の閾値の設定を容易に行うことができる。 The voice dialogue device 1 according to the embodiment has been described above. As described above, the voice dialogue device 1 determines the degree of reflection of emotions on the feature amount used for emotion estimation for each user, and estimates emotions with a threshold value set according to the determination result. Therefore, even if the degree of reflection of emotions on the feature amount differs for each user, it is possible to accurately capture the user's emotions. In this way, since the response can be generated after accurately capturing the emotion, the voice dialogue device 1 can communicate with the user more smoothly. Here, in particular, the voice dialogue device 1 first makes an initial response, sets a threshold value based on the index value at that time, and then shifts to a normal response. Therefore, it is possible to easily set the threshold value for each user.

なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。例えば、上記の実施の形態では、特徴量として、ユーザの発話内容が用いられたが、これに限らず、ユーザの表情、韻律などの他の特徴量に基づいて、指標値の算出及び感情の推定が行われてもよい。すなわち、他の一例として、特徴量取得部１０２は、カメラ３が取得した画像に対し画像処理を行って、特徴量としてのユーザの表情を取得してもよい。
また、上記実施の形態では、感情の出やすさを３段階で判定しているが、これは一例であり、２段階や４段階以上で判定してもよい。また、指標値及び閾値に関する各値は、一例であり、上記の値に限られない。 The present invention is not limited to the above embodiment, and can be appropriately modified without departing from the spirit. For example, in the above embodiment, the user's utterance content is used as the feature amount, but the present invention is not limited to this, and the index value is calculated and the emotional value is calculated based on other feature amounts such as the user's facial expression and prosody. Estimates may be made. That is, as another example, the feature amount acquisition unit 102 may perform image processing on the image acquired by the camera 3 to acquire the user's facial expression as the feature amount.
Further, in the above-described embodiment, the ease of expressing emotions is determined in three stages, but this is an example and may be determined in two stages or four or more stages. Further, each value related to the index value and the threshold value is an example and is not limited to the above values.

１音声対話装置
１０１ユーザ特定部
１０２特徴量取得部
１０３ユーザデータベース
１０４初期対話判定部
１０５感情推定部
１０６応答生成部
１０７初期対話シナリオデータベース
１０８感情の出やすさ判定部
１０９感情閾値設定部
１１０音声合成部 1 Voice dialogue device 101 User identification unit 102 Feature amount acquisition unit 103 User database 104 Initial dialogue judgment unit 105 Emotion estimation unit 106 Response generation unit 107 Initial dialogue scenario database 108 Emotion easiness judgment unit 109 Emotion threshold setting unit 110 Speech synthesis Department

Claims

A user identification unit that identifies the user with whom you interact
A feature amount acquisition unit that acquires a feature amount used for emotion estimation from the user,
An emotion estimation unit that calculates an index value of the user's emotion based on the feature amount,
A response generation unit that generates an initial response according to a predetermined response guideline when the amount of dialogue with a user specified by the user identification unit is less than a predetermined reference implementation amount.
The above-mentioned for estimating an emotion according to a comparison result between the index value calculated by the emotion estimation unit based on the feature amount of the user's reaction to the initial response and a predetermined reference index value. An emotion threshold setting unit that sets a user-specific threshold,
Have,
The emotion estimation unit is a voice dialogue device that estimates the user's emotions according to a comparison result between the calculated index value and the threshold value set by the emotion threshold setting unit.