JP6732703B2

JP6732703B2 - Emotion interaction model learning device, emotion recognition device, emotion interaction model learning method, emotion recognition method, and program

Info

Publication number: JP6732703B2
Application number: JP2017141791A
Authority: JP
Inventors: 厚志安藤; 歩相名神山; 哲小橋川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-07-21
Filing date: 2017-07-21
Publication date: 2020-07-29
Anticipated expiration: 2037-07-21
Also published as: JP2019020684A

Description

この発明は、対話に含まれる文脈情報を用いて話者の感情を認識する技術に関する。 The present invention relates to a technique for recognizing a feeling of a speaker using context information included in a dialogue.

対話において、話者の感情を認識することは重要である。例えば、カウンセリング時に感情認識を行うことで、患者の不安や悲しみの感情を可視化でき、カウンセラーの理解の深化や指導の質の向上が期待できる。また、人間と機械の対話において人間の感情を認識することで、人間が喜んでいれば共に喜び、悲しんでいれば励ますなど、より親しみやすい対話システムの構築が可能となる。以降では、話者二名の話し合いを「対話」と呼ぶ。また、対話を行う話者のうち感情認識の対象とする発話を行った話者を「目的話者」と呼び、目的話者以外の話者を「相手話者」と呼ぶ。例えば、カウンセリング向け感情認識では、患者が目的話者となり、カウンセラーが相手話者となる。 In dialogue, it is important to recognize the emotion of the speaker. For example, by performing emotion recognition during counseling, the emotions of anxiety and sadness of the patient can be visualized, and it is expected that the counselor can deepen understanding and improve the quality of instruction. In addition, by recognizing human emotions in human-machine dialogue, it is possible to build a more friendly dialogue system, such as when people are happy, they are happy and when they are sad, they are encouraged. Hereinafter, the discussion between the two speakers is called "dialogue". In addition, a speaker who has made a utterance that is a target of emotion recognition among speakers who interact with each other is called a “target speaker”, and a speaker other than the target speaker is called a “partner speaker”. For example, in emotion recognition for counseling, the patient is the target speaker and the counselor is the other speaker.

対話における感情認識技術が非特許文献１に提案されている。一般に、感情認識技術は各発話に対して独立に感情認識を行うことが多い（例えば、非特許文献２）。一方、非特許文献１に記載の技術では、対話に含まれる文脈情報に着目し、現在の発話の特徴に加えて目的話者自身の過去や未来の感情にも基づいて現在の目的話者の感情を認識することで、対話における感情認識の精度を向上させている。これは、感情に連続性や関連性があるためであると考えられる。 Non-Patent Document 1 proposes an emotion recognition technique in dialogue. In general, emotion recognition technology often performs emotion recognition independently for each utterance (for example, Non-Patent Document 2). On the other hand, in the technique described in Non-Patent Document 1, focusing on the context information included in the dialogue, in addition to the characteristics of the current utterance, the current target speaker is also based on the past and future feelings of the target speaker itself. By recognizing emotions, the accuracy of emotion recognition in dialogue is improved. This is considered to be because emotions have continuity and relevance.

Martin Wollmer, Angeliki Metallinou, Florian Eyben, Bjorn Schuller, Shrikanth Narayanan, “Context-Sensitive Multimodal Emotion Recognition from Speech and Facial Expression using Bidirectional LSTM Modeling,” in Interspeech 2010, 2010.Martin Wollmer, Angeliki Metallinou, Florian Eyben, Bjorn Schuller, Shrikanth Narayanan, “Context-Sensitive Multimodal Emotion Recognition from Speech and Facial Expression using Bidirectional LSTM Modeling,” in Interspeech 2010, 2010. Che-Wei Huang, Shrikanth Narayanan, “Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition,” in Interspeech 2016, 2016.Che-Wei Huang, Shrikanth Narayanan, “Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition,” in Interspeech 2016, 2016.

対話に含まれる文脈情報には、非特許文献１に記載の技術で用いられる目的話者自身の感情の情報以外にも、多くの情報が存在する。例えば、相手話者の感情の情報などである。このような情報も目的話者の感情認識において有効と考えられるが、非特許文献１に記載の技術では文脈情報のうち目的話者自身の感情の情報しか利用していない。そのため、対話における感情認識の精度を向上する余地が残されている可能性がある。 The context information included in the dialogue includes a lot of information in addition to the emotional information of the target speaker used in the technique described in Non-Patent Document 1. For example, it is information on the emotion of the other speaker. Such information is also considered to be effective in emotion recognition of the target speaker, but the technique described in Non-Patent Document 1 uses only the emotion information of the target speaker among the context information. Therefore, there may be room for improving the accuracy of emotion recognition in dialogue.

この発明の目的は、上記のような点に鑑みて、目的話者自身の感情の情報だけでなく、対話に含まれる文脈情報も利用して、目的話者の感情の認識精度を向上することである。 In view of the above points, an object of the present invention is to improve not only the emotion information of the target speaker himself but also the context information included in the dialogue to improve the recognition accuracy of the emotion of the target speaker. Is.

上記の課題を解決するために、この発明の第一の態様の感情インタラクションモデル学習装置は、目的話者の複数の発話と相手話者の複数の発話とからなる対話を収録した対話音声と、その対話に含まれる各発話に対する感情の正解値とからなる学習データを記憶する学習データ記憶部と、対話音声から抽出した各発話に対する発話毎感情を認識して、目的話者の発話毎感情系列と相手話者の発話毎感情系列とを生成する発話毎感情認識部と、感情の正解値と目的話者の発話毎感情系列と相手話者の発話毎感情系列とを用いて、目的話者の発話である目的発話の発話毎感情と目的発話の直前に相手話者が行った直前発話の発話毎感情とを入力として目的発話の感情を再推定する感情インタラクションモデルを学習するモデル学習部と、を含む。 In order to solve the above problems, the emotion interaction model learning device according to the first aspect of the present invention is a dialogue voice recording a dialogue consisting of a plurality of utterances of a target speaker and a plurality of utterances of a partner speaker, A learning data storage unit that stores learning data consisting of emotional correct values for each utterance included in the dialogue, and an emotional sequence for each utterance of the target speaker by recognizing each utterance emotion for each utterance extracted from the dialogue voice. The target speaker using the emotion recognition unit for each utterance, which generates the emotional sequence for each utterance of the other speaker, the correct value of the emotion, the emotional sequence for each utterance of the target speaker, and the emotional sequence for each utterance of the other speaker. A model learning unit that learns an emotion interaction model that re-estimates the emotion of the target utterance by inputting the utterance-by-utterance emotion of the target utterance that is the utterance of ,including.

上記の課題を解決するために、この発明の第二の態様の感情認識装置は、第一の態様の感情インタラクションモデル学習装置により学習した感情インタラクションモデルを記憶するモデル記憶部と、目的話者の複数の発話と相手話者の複数の発話とからなる対話に含まれる各発話に対する発話毎感情を認識して、目的話者の発話毎感情系列と相手話者の発話毎感情系列とを生成する発話毎感情認識部と、目的話者の発話である目的発話の発話毎感情と、目的発話の直前に相手話者が行った直前発話の発話毎感情とを感情インタラクションモデルに入力して目的発話の感情を再推定する感情再推定部と、を含む。 In order to solve the above problems, an emotion recognition device according to a second aspect of the present invention is a model storage unit that stores an emotion interaction model learned by the emotion interaction model learning device according to the first aspect, and a target speaker Recognizing the emotions of each utterance for each utterance included in a dialogue consisting of multiple utterances and multiple utterances of the other speaker, and generating an emotional sequence for each utterance of the target speaker and an emotional sequence for each utterance of the other speaker. The utterance-by-utterance emotion recognition unit, the utterance-by-utterance emotion of the target utterance that is the utterance of the target speaker, and the utterance-by-utterance emotion of the immediately preceding utterance performed by the other speaker immediately before the target utterance are input to the emotion interaction model to input the target utterance And an emotion re-estimation unit that re-estimates the emotion.

この発明によれば、目的話者自身の感情の情報だけでなく、対話に含まれる文脈情報も利用することで、目的話者の感情の認識精度が向上する。 According to the present invention, the recognition accuracy of the emotion of the target speaker is improved by using not only the emotion information of the target speaker itself but also the context information included in the dialogue.

図１は、目的話者または相手話者の前後の感情が目的話者の感情に影響を与える例を説明するための図である。FIG. 1 is a diagram for explaining an example in which emotions before and after a target speaker or a partner speaker influence the emotion of the target speaker. 図２は、感情インタラクションモデルを説明するための図である。FIG. 2 is a diagram for explaining the emotion interaction model. 図３は、感情インタラクションモデル学習装置の機能構成を例示する図である。FIG. 3 is a diagram illustrating a functional configuration of the emotion interaction model learning device. 図４は、感情インタラクションモデル学習方法の処理手続きを例示する図である。FIG. 4 is a diagram illustrating a processing procedure of the emotion interaction model learning method. 図５は、感情インタラクションモデルを用いた感情認識について説明するための図である。FIG. 5 is a diagram for explaining emotion recognition using the emotion interaction model. 図６は、感情認識装置の機能構成を例示する図である。FIG. 6 is a diagram illustrating a functional configuration of the emotion recognition device. 図７は、感情認識方法の処理手続きを例示する図である。FIG. 7 is a diagram illustrating a processing procedure of the emotion recognition method.

本発明のポイントは、対話に含まれる文脈情報の一つである相手話者の感情の情報を用いて目的話者の感情を認識する点にある。対話に含まれる文脈情報のうち相手話者の感情の情報は目的話者の感情の認識に有効である。感情の認識は、発話を複数の感情クラスに分類する処理である。以降の説明では、感情クラスを、怒り／喜び／悲しみ／平常／その他の５種類とする。ただし、感情クラスはこれらに限定されるものではなく、任意に設定することができる。 The point of the present invention is to recognize the emotion of the target speaker by using the emotion information of the other speaker, which is one of the context information included in the dialogue. Of the context information included in the dialogue, the emotion information of the other speaker is effective in recognizing the emotion of the target speaker. Emotion recognition is a process of classifying an utterance into a plurality of emotion classes. In the following explanation, the emotion class will be anger/joy/sadness/normal/other five types. However, the emotion class is not limited to these and can be set arbitrarily.

図１を参照しながら、対話に含まれる文脈情報を用いた感情認識の具体例を説明する。ある目的話者の発話において、目的話者の直前の感情が“平常”であった場合、その発話の感情を推定することは困難である。しかし、その発話の直前の相手話者の感情が“喜び”であった場合、目的話者の感情も“喜び”である可能性が高くなることが想像できる。これは、人間が持つ共感の性質により、相手話者の感情の影響を受けるためである。 A specific example of emotion recognition using context information included in a dialogue will be described with reference to FIG. In the utterance of a target speaker, if the emotion immediately before the target speaker is “normal”, it is difficult to estimate the emotion of the utterance. However, if the emotion of the other speaker immediately before the utterance is "joy," it can be imagined that the emotion of the target speaker is likely to be "joy." This is because the nature of human empathy affects the emotion of the other speaker.

表１は、ある音声対話データベースを用いて、目的話者と相手話者の感情の関係性を調査した結果である。表中の各値の単位は割合である。例えば、目的話者の現在の発話の感情が“怒り”であるとき、相手話者の直前の発話の感情が“怒り”であった割合は0.38、すなわち38％、“喜び”であった割合は0.00、すなわち0％、“悲しみ”であった割合は0.02、すなわち2％である。 Table 1 shows the results of an investigation of the emotional relationship between the target speaker and the other speaker using a certain voice dialogue database. The unit of each value in the table is a ratio. For example, when the emotion of the target speaker's current utterance is "angry", the rate of the utterance of the other speaker's last utterance being "angry" is 0.38, that is, 38%, and "joy" Is 0.00, or 0%, and the rate of being "sad" is 0.02, or 2%.

表１の左上から右下へ向かう対角線上の値は、目的話者の現在の発話の感情と相手話者の直前の発話の感情とが一致した割合、すなわち共感の発生割合である。表１によれば、目的話者の現在の発話の感情が、“喜び”であったときの45％（*1）、“悲しみ”であったときの42％（*2）が、相手話者の直前の発話も同じ感情を表している。すなわち、目的話者の感情は共感により相手話者の感情の影響を受けていることがわかる。このことから、対話における感情認識において、相手話者の感情の情報が目的話者の感情認識に有効であることがわかる。 The value on the diagonal line from the upper left to the lower right of Table 1 is the rate at which the emotion of the current utterance of the target speaker and the emotion of the utterance immediately before the other speaker match, that is, the rate of occurrence of empathy. According to Table 1, 45% (*1) when the target speaker's current utterance was "joy" and 42% (*2) when "sadness" was the other party's talk The utterance immediately before the person expresses the same feeling. That is, it is understood that the emotion of the target speaker is influenced by the emotion of the other speaker due to empathy. From this, it can be seen that in emotion recognition in dialogue, the emotion information of the other speaker is effective for emotion recognition of the target speaker.

図１の例では、目的話者と相手話者が交互に発話を行っているが、目的話者もしくは相手話者が複数の発話を連続して行う場合もあり得る。例えば、感情認識の対象とする目的話者の発話の前に相手話者の発話が複数回続いた場合、「相手話者の直前の発話」とは複数回続く相手話者の発話のうち最後の発話である。一方、相手話者の発話の後に目的話者の発話が複数回続いた場合、複数回続く目的話者の発話それぞれに対して「相手話者の直前の発話」はすべて同じ相手話者の発話が用いられる。なお、以降の説明では、感情認識の対象とする目的話者の発話を「目的発話」と呼び、目的発話の直前に相手話者が行った発話を「直前発話」と呼ぶ。 In the example of FIG. 1, the target speaker and the other party speaker speak alternately, but the target speaker or the other party speaker may continuously make a plurality of utterances. For example, if the utterance of the other speaker continues multiple times before the utterance of the target speaker for emotion recognition, "the utterance immediately before the other speaker" is the last of the utterances of the other speaker that continues multiple times. Is the utterance. On the other hand, if the target speaker's utterance continues multiple times after the target speaker's utterance, the utterance immediately before the target speaker's utterance is the same for each target speaker's utterance that continues multiple times. Is used. In the following description, the utterance of the target speaker that is the target of emotion recognition is referred to as “target utterance”, and the utterance made by the other speaker immediately before the target utterance is referred to as “previous utterance”.

本発明では、相手話者の感情を目的話者の感情の再推定に利用する。すなわち、各発話から認識された感情（以降、「発話毎感情」と呼ぶ）が対話に含まれるすべての発話に対して得られており、目的話者の発話毎感情と相手話者の発話毎感情とに基づいて目的話者の感情を再推定する。以降では、本発明で用いる再推定モデルを「感情インタラクションモデル」と呼ぶ。 In the present invention, the emotion of the other speaker is used for re-estimating the emotion of the target speaker. That is, the emotions recognized from each utterance (hereinafter referred to as “emotions for each utterance”) are obtained for all utterances included in the dialogue, and each emotion of the target speaker and each utterance of the other speaker are obtained. Re-estimate the emotion of the target speaker based on the emotion. Hereinafter, the re-estimation model used in the present invention will be referred to as an “emotion interaction model”.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In the drawings, components having the same function are denoted by the same reference numerals, and duplicate description will be omitted.

［感情インタラクションモデル学習装置］
実施形態の感情インタラクションモデル学習装置は、以下のようにして、目的話者の感情を推定するために用いる感情インタラクションモデルを学習する。 [Emotion interaction model learning device]
The emotion interaction model learning device according to the embodiment learns the emotion interaction model used for estimating the emotion of the target speaker as follows.

１．目的話者の複数の発話と相手話者の複数の発話とを含む対話を収録した対話音声と、目的話者の各発話に対して付与された目的話者の感情の正解値を表す感情ラベルとからなる学習データを用意する。感情ラベルは予め人手により付与されるものとする。 1. Dialogue voices that record dialogues that include multiple utterances of the target speaker and multiple utterances of the other speaker, and an emotion label that represents the correct answer value of the emotion of the target speaker assigned to each utterance of the target speaker. Prepare the learning data consisting of and. The emotion label is manually assigned in advance.

２．学習データの対話音声から、目的話者および相手話者の発話毎感情を認識する。発話毎感情の認識には、例えば、非特許文献２などに記載された技術を用いる。 2. Recognize the emotion of each utterance of the target speaker and the other speaker from the dialogue voice of the learning data. For recognition of each utterance emotion, for example, the technique described in Non-Patent Document 2 or the like is used.

３．学習データに含まれる感情ラベルと目的話者の発話毎感情の推定値と相手話者の発話毎感情の推定値との３つ組の系列を用いて感情インタラクションモデルを学習する。 3. An emotion interaction model is learned using a series of three sets of emotion labels included in the learning data, an estimated value of emotion of each utterance of the target speaker, and an estimated value of emotion of each utterance of the other speaker.

図２に感情インタラクションモデルの構造の一例を示す。感情インタラクションモデルは、図２に示すように、１個の目的発話に対して１個の発話感情推定器を構成している。発話感情推定器は、目的発話の発話毎感情の推定値と直前発話の発話毎感情の推定値とを入力とし、目的話者の過去および／または未来の感情の情報を用いて、目的発話の感情を再推定し、その推定値を出力する。発話感情推定器は、具体的には、例えば、リカレントニューラルネットワーク（RNN: Recurrent Neural Network）である。リカレントニューラルネットワークを用いることで、目的話者の発話毎感情の推定値と相手話者の発話毎感情の推定値とに加えて、非特許文献１に記載の技術と同様に、目的話者の過去および／または未来の感情の情報を用いることが可能となる。すなわち、目的話者自身と相手話者との文脈情報に基づいた感情認識が可能となる。 FIG. 2 shows an example of the structure of the emotion interaction model. As shown in FIG. 2, the emotion interaction model configures one utterance emotion estimator for one target utterance. The utterance emotion estimator receives the estimated value of each utterance emotion of the target utterance and the estimated value of each utterance emotion of the immediately preceding utterance as input, and uses the past and/or future emotion information of the target speaker to determine the target utterance. The emotion is re-estimated and the estimated value is output. The utterance emotion estimator is specifically, for example, a recurrent neural network (RNN). By using the recurrent neural network, in addition to the estimated value of the emotion for each utterance of the target speaker and the estimated value of the emotion for each utterance of the other speaker, as well as the technique described in Non-Patent Document 1, It is possible to use information about past and/or future emotions. That is, emotion recognition based on the context information between the target speaker and the other speaker becomes possible.

実施形態の感情インタラクションモデル学習装置１は、図３に示すように、学習データ記憶部１０、発話検出部１１、発話毎感情認識部１２、モデル学習部１３、発話毎感情認識モデル記憶部１９、および感情インタラクションモデル記憶部２０を含む。感情インタラクションモデル学習装置１は、学習データ記憶部１０に記憶された学習データを用いて感情インタラクションモデルを学習し、学習済みの感情インタラクションモデルを感情インタラクションモデル記憶部２０へ記憶する。感情インタラクションモデル学習装置１が図４に示す各ステップの処理を行うことにより実施形態の感情インタラクションモデル学習方法が実現される。 As shown in FIG. 3, the emotion interaction model learning device 1 according to the exemplary embodiment includes a learning data storage unit 10, a speech detection unit 11, an emotion recognition unit for each utterance 12, a model learning unit 13, an emotion recognition model storage unit for each utterance 19, And an emotion interaction model storage unit 20. The emotion interaction model learning device 1 learns an emotion interaction model using the learning data stored in the learning data storage unit 10, and stores the learned emotion interaction model in the emotion interaction model storage unit 20. The emotion interaction model learning method of the embodiment is realized by the emotion interaction model learning device 1 performing the processing of each step shown in FIG.

感情インタラクションモデル学習装置１は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。感情インタラクションモデル学習装置１は、例えば、中央演算処理装置の制御のもとで各処理を実行する。感情インタラクションモデル学習装置１に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。感情インタラクションモデル学習装置１が備える各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。感情インタラクションモデル学習装置１が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。感情インタラクションモデル学習装置１が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 The emotion interaction model learning device 1 is configured, for example, by loading a special program into a known or dedicated computer having a central processing unit (CPU), a main memory (RAM: Random Access Memory), and the like. It is a special device. The emotion interaction model learning device 1 executes each process under the control of the central processing unit, for example. The data input to the emotion interaction model learning device 1 and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device is read to the central processing unit as necessary. And used for other processing. At least a part of each processing unit included in the emotion interaction model learning device 1 may be configured by hardware such as an integrated circuit. Each storage unit included in the emotion interaction model learning device 1 is, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory (Flash Memory). , Or middleware such as a relational database or key-value store. Each storage unit included in the emotion interaction model learning device 1 may be logically divided, and may be stored in one physical storage device.

学習データ記憶部１０には、感情インタラクションモデルの学習に用いる学習データが記憶されている。学習データは、目的話者の複数の発話と相手話者の複数の発話とを含む対話を収録した対話音声と、その対話音声に含まれる各発話に対して付与された感情の正解値を表す感情ラベルとからなる。感情ラベルは予め人手により付与しておけばよい。 The learning data storage unit 10 stores learning data used for learning the emotion interaction model. The learning data represents a dialogue voice containing a dialogue including a plurality of utterances of a target speaker and a plurality of utterances of a partner speaker, and a correct value of an emotion given to each utterance included in the dialogue voice. It consists of emotion labels. The emotion label may be manually assigned in advance.

発話毎感情認識モデル記憶部１９には、発話毎感情認識部１２が用いる発話毎感情認識モデルが記憶されている。発話毎感情認識モデルは、例えば、非特許文献２に記載された発話毎感情認識の手法において用いられるものとする。発話毎感情認識モデルは、例えば、非特許文献２に記載された手法により事前に学習しておく。このとき、発話毎感情認識モデルの事前学習において、学習データ記憶部１０に記憶された対話音声を学習データとして用いてもよく、別の学習データ（発話とその発話に対応する感情ラベルの組の集合）を用いてもよい。 The utterance emotion recognition model storage unit 19 stores the utterance emotion recognition model used by the utterance emotion recognition unit 12. The emotion recognition model for each utterance is assumed to be used in the method for emotion recognition for each utterance described in Non-Patent Document 2, for example. The emotion recognition model for each utterance is learned in advance by the method described in Non-Patent Document 2, for example. At this time, in the pre-learning of the emotion recognition model for each utterance, the dialogue voice stored in the learning data storage unit 10 may be used as the learning data, and another learning data (a utterance and a set of emotion labels corresponding to the utterance may be used). Set) may be used.

以下、図４を参照して、実施形態の感情インタラクションモデル学習装置１が実行する感情インタラクションモデル学習方法について説明する。 Hereinafter, the emotion interaction model learning method executed by the emotion interaction model learning device 1 according to the embodiment will be described with reference to FIG. 4.

ステップＳ１１において、発話検出部１１は、学習データ記憶部１０に記憶されている対話音声から発話区間を検出し、目的話者の発話による系列と相手話者の発話による系列とを得る。発話区間を検出する方法は、例えば、パワーのしきい値処理に基づく手法を用いることができる。また、音声／非音声モデルの尤度比に基づく手法などの他の発話区間検出手法を用いてもよい。以下、各話者の発話を対話の時系列順に並べたものを「発話系列」と呼ぶ。発話検出部１１は、取得した目的話者の発話系列と相手話者の発話系列とを発話毎感情認識部１２へ出力する。 In step S11, the utterance detection unit 11 detects the utterance section from the dialogue voice stored in the learning data storage unit 10, and obtains the series of the utterance of the target speaker and the series of the utterance of the partner speaker. As a method of detecting the utterance section, for example, a method based on power threshold processing can be used. Further, another utterance section detection method such as a method based on the likelihood ratio of the voice/non-voice model may be used. Hereinafter, the utterances of the speakers are arranged in chronological order of the dialogue and are referred to as “utterance series”. The utterance detection unit 11 outputs the acquired utterance sequence of the target speaker and the obtained utterance sequence of the other speaker to the emotion recognition unit 12 for each utterance.

ステップＳ１２において、発話毎感情認識部１２は、発話検出部１１から目的話者の発話系列と相手話者の発話系列とを受け取り、発話毎感情認識モデル記憶部１９に記憶された発話毎感情認識モデルを用いて、各発話系列に含まれる各発話に対して発話毎感情の認識を行う。ここでは、発話毎感情の認識は、非特許文献２に記載された手法を用いるものとする。また、例えば、基本周波数やパワーの発話平均のしきい値に基づく分類などの発話毎感情認識手法を利用してもよい。各発話に対する発話毎感情を認識した結果、各発話に対応する発話毎感情の推定値を得ることができる。これは、感情クラスごとの事後確率を並べた事後確率ベクトルである。以下、発話毎感情の推定値を対話の時系列順に並べたものを「発話毎感情系列」と呼ぶ。発話毎感情認識部１２は、目的話者の発話毎感情系列と、相手話者の発話毎感情系列とをモデル学習部１３へ出力する。 In step S12, the utterance emotion recognition unit 12 receives the utterance sequence of the target speaker and the utterance sequence of the other speaker from the utterance detection unit 11, and utterance emotion recognition stored in the utterance emotion recognition model storage unit 19 Using the model, emotion recognition for each utterance is performed for each utterance included in each utterance sequence. Here, it is assumed that the method described in Non-Patent Document 2 is used for the recognition of emotion for each utterance. Further, for example, a utterance-by-utterance emotion recognition method such as classification based on the threshold of the utterance average of the fundamental frequency and power may be used. As a result of recognizing the emotion for each utterance for each utterance, an estimated value of the emotion for each utterance corresponding to each utterance can be obtained. This is a posterior probability vector that lists posterior probabilities for each emotion class. Hereinafter, a sequence of estimated emotion values for each utterance in chronological order of dialogue will be referred to as an “emotion series for each utterance”. The utterance emotion recognition unit 12 outputs the utterance emotion sequence of the target speaker and the utterance emotion sequence of the partner speaker to the model learning unit 13.

ステップＳ１３において、モデル学習部１３は、発話毎感情認識部１２から目的話者の発話毎感情系列と相手話者の発話毎感情系列とを受け取り、学習データ記憶部１０に記憶されている対話音声の各発話に対応する感情ラベルを読み込み、目的発話の発話毎感情の推定値と直前発話の発話毎感情の推定値とを入力とし、目的話者の過去および／または未来の感情の情報を用いて目的発話の感情を再推定し、目的発話の感情の推定値を出力する感情インタラクションモデルの学習を行う。モデル学習部１３は、学習済みの感情インタラクションモデルを感情インタラクションモデル記憶部２０へ記憶する。 In step S<b>13, the model learning unit 13 receives the utterance-by-utterance emotional sequence of the target speaker and the utterance-by-utterance emotional sequence of the other speaker from the utterance-by-utterance emotion recognition unit 12, and stores the dialogue voice stored in the learning data storage unit 10. The emotion label corresponding to each utterance of the target utterance is read, the estimated value of the emotion of each utterance of the target utterance and the estimated value of the emotion of each utterance of the immediately preceding utterance are input, and the past and/or future emotion information of the target speaker is used. We re-estimate the emotion of the target utterance and learn the emotion interaction model that outputs the estimated value of the emotion of the target utterance. The model learning unit 13 stores the learned emotion interaction model in the emotion interaction model storage unit 20.

感情インタラクションモデルは、図２に示したように、リカレントニューラルネットワーク（RNN）を用いる。ここでは、RNNとして、例えば、長短期記憶リカレントニューラルネットワーク（LSTM-RNN: Long Short-Term Memory Recurrent Neural Network）を用いるものとする。ただし、LSTM-RNN以外のリカレントニューラルネットワークを用いてもよく、例えば、ゲート付き再帰ユニット（GRU: Gated Recurrent Unit）などを用いてもよい。なお、LSTM-RNNは入力ゲートと出力ゲート、もしくは入力ゲートと出力ゲートと忘却ゲートを用いて構成され、GRUはリセットゲートと更新ゲートを用いて構成されることを特徴としている。LSTM-RNNは、双方向型のLSTM-RNNを用いても、一方向型のLSTM-RNNを用いてもよい。一方向型のLSTM-RNNを用いる場合、過去の感情の情報のみを用いるため、対話途中であっても感情認識を行うことができる。双方向型のLSTM-RNNを用いる場合、過去の感情の情報に加えて未来の感情の情報を利用可能となるため、感情の認識精度が向上する一方で、対話の開始から終了まですべての発話から得た感情の推定値による系列を一度に入力する必要があり、対話終了後に対話全体の感情認識を行う場合に適している。感情インタラクションモデルの学習は、例えば、既存のLSTM-RNNの学習手法である通時的誤差逆伝播法（BPTT: Back Propagation Through Time）を用いる。 The emotion interaction model uses a recurrent neural network (RNN) as shown in FIG. Here, as the RNN, for example, a long short-term memory recurrent neural network (LSTM-RNN) is used. However, a recurrent neural network other than LSTM-RNN may be used, and for example, a gated recurrent unit (GRU) may be used. The LSTM-RNN is composed of an input gate and an output gate, or an input gate, an output gate, and a forget gate, and the GRU is composed of a reset gate and an update gate. The LSTM-RNN may be a bidirectional LSTM-RNN or a unidirectional LSTM-RNN. When one-way LSTM-RNN is used, emotion information can be recognized even in the middle of a dialogue because only past emotion information is used. When the interactive LSTM-RNN is used, information on future emotions can be used in addition to information on past emotions, so emotion recognition accuracy is improved, while all utterances from start to end of dialogue It is necessary to input the sequence based on the estimated values of emotions obtained from, at one time, and it is suitable when emotion recognition of the entire dialogue is performed after the dialogue ends. For learning the emotion interaction model, for example, the BPTT (Back Propagation Through Time) method, which is an existing LSTM-RNN learning method, is used.

［感情認識装置］
実施形態の感情認識装置は、以下のようにして、感情インタラクションモデルを用いて目的話者の発話の感情を認識する。 [Emotion recognition device]
The emotion recognition apparatus of the embodiment recognizes the emotion of the utterance of the target speaker using the emotion interaction model as follows.

１．認識対象とする対話音声から、目的話者および相手話者の発話毎感情を認識する。発話毎感情の認識方法は、感情インタラクションモデルを学習した際と同様に、例えば、非特許文献２などに記載された技術を用いる。 1. Recognize the emotions of each utterance of the target speaker and the other speaker from the target speech. As a method of recognizing emotions for each utterance, for example, the technique described in Non-Patent Document 2 or the like is used as in the case of learning the emotion interaction model.

２．目的話者および相手話者の発話毎感情の推定値を感情インタラクションモデルに入力し、目的話者の感情の再推定を行う。 2. Estimates of the emotions of each utterance of the target speaker and the other speaker are input to the emotion interaction model to re-estimate the emotions of the target speaker.

図５に目的話者の感情を再推定する動作の例を示す。図５では、対話に参加している話者Ａと話者Ｂの両方を目的話者としている。この場合、話者Ａが目的話者の場合は話者Ｂを相手話者とみなし、話者Ｂが目的話者の場合は話者Ａを相手話者とみなすことで、両方の話者の感情認識を行うことができる。図５の例では、対話音声に含まれる話者Ａと話者Ｂの各発話から認識した発話毎感情は時刻の早い方から順に「平常」「喜び」「平常」「平常」であったが、感情インタラクションモデルを用いて再推定を行うことにより、直前発話の発話毎感情に影響を受けて「平常」「喜び」「喜び」「喜び」と更新されている。 FIG. 5 shows an example of the operation of re-estimating the emotion of the target speaker. In FIG. 5, both the speaker A and the speaker B participating in the dialogue are target speakers. In this case, when the speaker A is the target speaker, the speaker B is regarded as the partner speaker, and when the speaker B is the target speaker, the speaker A is regarded as the partner speaker. Can perform emotion recognition. In the example of FIG. 5, the emotions for each utterance recognized from the utterances of the speaker A and the speaker B included in the dialogue voice are “normal”, “joy”, “normal”, and “normal” in order from the earliest time. , By re-estimating using the emotion interaction model, it is updated to "normal", "joy", "joy", and "joy" by being affected by the emotion of each utterance of the immediately preceding utterance.

実施形態の感情認識装置２は、図６に示すように、発話毎感情認識モデル記憶部１９、感情インタラクションモデル記憶部２０、発話検出部２１、発話毎感情認識部２２、および感情再推定部２３を含む。感情認識装置２は、感情を認識する対象とする対話の音声を収録した対話音声を入力とし、感情インタラクションモデル記憶部２０に記憶された感情インタラクションモデルを用いて、対話音声に含まれる目的話者の各発話の感情を推定し、感情の推定値による系列を出力する。感情認識装置２が図６に示す各ステップの処理を行うことにより実施形態の感情認識方法が実現される。 As shown in FIG. 6, the emotion recognition device 2 according to the embodiment includes a utterance-by-utterance emotion recognition model storage unit 19, an emotion interaction model storage unit 20, a utterance detection unit 21, a utterance-by-utterance emotion recognition unit 22, and an emotion reestimation unit 23. including. The emotion recognition device 2 receives a conversation voice in which a conversation voice to be recognized as an emotion is recorded, and uses the emotion interaction model stored in the emotion interaction model storage unit 20 to obtain a target speaker included in the conversation voice. The emotion of each utterance is estimated and a sequence based on the estimated value of the emotion is output. The emotion recognition method of the embodiment is realized by the emotion recognition device 2 performing the processing of each step shown in FIG.

感情認識装置２は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。感情認識装置２は、例えば、中央演算処理装置の制御のもとで各処理を実行する。感情認識装置２に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。感情認識装置２の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。感情認識装置２が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。感情認識装置２が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 The emotion recognition device 2 is, for example, a special program configured by loading a special program into a known or dedicated computer having a central processing unit (CPU), a main memory (RAM: Random Access Memory), and the like. It is a device. The emotion recognition device 2 executes each process under the control of the central processing unit, for example. The data input to the emotion recognition device 2 and the data obtained by each process are stored in, for example, the main storage device, and the data stored in the main storage device is read to the central processing unit as necessary. It is used for other processing. At least a part of each processing unit of the emotion recognition device 2 may be configured by hardware such as an integrated circuit. Each storage unit included in the emotion recognition device 2 is, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disc, or a flash memory (Flash Memory), or It can be configured by middleware such as a relational database or key-value store. Each storage unit included in the emotion recognition device 2 may be logically divided, and may be stored in one physical storage device.

発話毎感情認識モデル記憶部１９には、発話毎感情認識部２２が用いる発話毎感情認識モデルが記憶されている。発話毎感情認識モデルは、感情インタラクションモデル学習装置１が用いたモデルと同様である。 The utterance emotion recognition model storage unit 19 stores the utterance emotion recognition model used by the utterance emotion recognition unit 22. The emotion recognition model for each utterance is the same as the model used by the emotion interaction model learning device 1.

感情インタラクションモデル記憶部２０には、感情インタラクションモデル学習装置１が生成した学習済みの感情インタラクションモデルが記憶されている。 The emotion interaction model storage unit 20 stores the learned emotion interaction model generated by the emotion interaction model learning device 1.

以下、図７を参照して、実施形態の感情認識装置２が実行する感情認識方法について説明する。 Hereinafter, the emotion recognition method executed by the emotion recognition device 2 according to the embodiment will be described with reference to FIG. 7.

ステップＳ２１において、発話検出部２１は、感情認識装置２に入力された対話音声から発話区間を検出し、目的話者の発話系列と相手話者の発話系列とを得る。この対話音声は、学習データの対話音声と同様に、目的話者の複数の発話と相手話者の複数の発話とを含む。発話区間を検出する方法は、感情インタラクションモデル学習装置１の発話検出部１１と同様の方法を用いればよい。発話検出部２１は、取得した目的話者の発話系列と相手話者の発話系列とを発話毎感情認識部２２へ出力する。 In step S21, the utterance detection unit 21 detects the utterance section from the dialogue voice input to the emotion recognition device 2, and obtains the utterance sequence of the target speaker and the utterance sequence of the other speaker. This dialogue voice includes a plurality of utterances of the target speaker and a plurality of utterances of the partner speaker, like the dialogue sound of the learning data. As a method of detecting the utterance section, the same method as the utterance detection unit 11 of the emotion interaction model learning device 1 may be used. The utterance detection unit 21 outputs the acquired utterance sequence of the target speaker and the obtained utterance sequence of the other speaker to the emotion recognition unit 22 for each utterance.

ステップＳ２２において、発話毎感情認識部２２は、発話検出部２１から目的話者の発話系列と相手話者の発話系列とを受け取り、発話毎感情認識モデル記憶部１９に記憶された発話毎感情認識モデルを用いて、各発話系列に含まれる各発話に対して発話毎感情の認識を行う。発話毎感情を認識する方法は、感情インタラクションモデル学習装置１の発話毎感情認識部２１と同様の方法を用いればよい。発話毎感情認識部２２は、目的話者の発話毎感情系列と、相手話者の発話毎感情系列とを感情再推定部２３へ出力する。 In step S22, the utterance-by-utterance emotion recognition unit 22 receives the utterance sequence of the target speaker and the utterance sequence of the partner speaker from the utterance detection unit 21, and utterance-by-utterance emotion recognition stored in the utterance-by-utterance emotion recognition model storage unit 19 Using the model, emotion recognition for each utterance is performed for each utterance included in each utterance sequence. As a method for recognizing the emotion for each utterance, a method similar to that for the emotion recognizing unit for each utterance 21 of the emotion interaction model learning device 1 may be used. The utterance emotion recognition unit 22 outputs the utterance emotion sequence of the target speaker and the utterance emotion sequence of the other speaker to the emotion reestimation unit 23.

ステップＳ２３において、感情再推定部２３は、発話毎感情認識部２２から目的話者の発話毎感情系列と相手話者の発話毎感情系列とを受け取り、目的発話の発話毎感情の推定値と直前発話の発話毎感情の推定値とを感情インタラクションモデル記憶部２０に記憶されている感情インタラクションモデルに入力して目的話者の感情を再推定する。これは、相手話者の感情の情報や目的話者の過去および／または未来の感情の情報に基づいて目的話者の感情の認識を再度行うことに相当する。例えば、発話毎感情認識では「平常」か「喜び」かの分類が困難であった発話に対し、当該発話の直前の相手話者の感情が「喜び」であったことに基づいて、当該発話が「喜び」の感情であったことを再推定することができる。これにより、感情認識精度の向上が期待できる。感情インタラクションモデルに基づく感情再推定では、感情インタラクションモデルに目的発話の発話毎感情の推定値と直前発話の発話毎感情の推定値とを入力し、順伝播させることで感情の再推定を行う。感情再推定部２３は、対話音声に含まれる目的話者の発話それぞれを目的発話として感情を再推定し、目的話者の感情の推定値による系列を感情認識装置２から出力する。 In step S23, the emotion re-estimation unit 23 receives the utterance-by-utterance emotion sequence of the target speaker and the utterance-by-utterance emotion sequence of the partner speaker from the utterance-by-utterance emotion recognition unit 22, and estimates the utterance-by-utterance emotion of the target utterance and the immediately preceding value. The estimated value of each utterance emotion of the utterance is input to the emotion interaction model stored in the emotion interaction model storage unit 20 to re-estimate the emotion of the target speaker. This is equivalent to re-recognizing the emotion of the target speaker based on the emotional information of the partner speaker and the past and/or future emotional information of the target speaker. For example, based on the fact that it was difficult to classify “normal” or “joyful” by emotion recognition for each utterance, the feeling of the other speaker immediately before the utterance was “joyful” It can be re-estimated that was the feeling of "joy." This can be expected to improve emotion recognition accuracy. In the emotion re-estimation based on the emotion interaction model, the estimated value of the emotion of each utterance of the target utterance and the estimated value of the emotion of each utterance of the immediately preceding utterance are input to the emotion interaction model, and the emotion is re-estimated by forward propagation. The emotion re-estimation unit 23 re-estimates the emotion by using each utterance of the target speaker included in the dialogue voice as the target utterance, and outputs the sequence based on the estimated value of the emotion of the target speaker from the emotion recognition device 2.

［変形例］
上述の実施形態では、感情インタラクションモデル学習装置１と感情認識装置２を別個の装置として構成する例を説明したが、感情インタラクションモデルを学習する機能と学習済みの感情インタラクションモデルを用いて感情を認識する機能とを兼ね備えた１台の感情認識装置を構成することも可能である。すなわち、変形例の感情認識装置は、学習データ記憶部１０、発話検出部１１、発話毎感情認識部１２、モデル学習部１３、発話毎感情認識モデル記憶部１９、感情インタラクションモデル記憶部２０、および感情再推定部２３を含む。 [Modification]
In the above-described embodiment, an example in which the emotion interaction model learning device 1 and the emotion recognition device 2 are configured as separate devices has been described, but the emotion is recognized using the function of learning the emotion interaction model and the learned emotion interaction model. It is also possible to configure a single emotion recognition device that has the function to perform. That is, the emotion recognition device of the modified example includes a learning data storage unit 10, an utterance detection unit 11, an utterance emotion recognition unit 12, a model learning unit 13, an utterance emotion recognition model storage unit 19, an emotion interaction model storage unit 20, and The emotion re-estimation unit 23 is included.

上述のように、本発明の感情インタラクションモデル学習装置および感情認識装置は、目的話者の発話毎感情系列に加えて相手話者の発話毎感情系列も用いて感情インタラクションモデルを学習し、その感情インタラクションモデルを用いて目的話者の感情の再推定を行うように構成されている。これにより、目的話者自身の感情の情報だけでなく、対話に含まれる文脈情報も利用することができるため、目的話者の感情の推定精度を向上することができる。 As described above, the emotion interaction model learning device and the emotion recognition device of the present invention learn the emotion interaction model by using not only the emotional sequence for each utterance of the target speaker but also the emotional sequence for each utterance of the other speaker, and the emotion It is configured to re-estimate the emotion of the target speaker using an interaction model. Accordingly, not only the emotion information of the target speaker itself but also the context information included in the dialogue can be used, so that the estimation accuracy of the emotion of the target speaker can be improved.

以上、この発明の実施の形態について説明したが、具体的な構成は、これらの実施の形態に限られるものではなく、この発明の趣旨を逸脱しない範囲で適宜設計の変更等があっても、この発明に含まれることはいうまでもない。実施の形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 Although the embodiments of the present invention have been described above, the specific configuration is not limited to these embodiments, and even if the design is appropriately changed without departing from the spirit of the present invention, Needless to say, it is included in the present invention. The various kinds of processing described in the embodiments may be executed not only in time series according to the order described, but also in parallel or individually according to the processing capability of the device that executes the processing or the need.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, processing contents of functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded in a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, or the like.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of this program is performed by, for example, selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, the program may be stored in a storage device of a server computer and transferred from the server computer to another computer via a network to distribute the program.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, the program recorded on a portable recording medium or the program transferred from the server computer in its own storage device. Then, when executing the process, this computer reads the program stored in its own storage device and executes the process according to the read program. As another execution form of this program, a computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be sequentially executed. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer May be Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (such as data that is not a direct command to a computer but has the property of defining computer processing).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the present apparatus is configured by executing a predetermined program on the computer, but at least a part of the processing content may be implemented by hardware.

１感情インタラクションモデル学習装置
１０学習データ記憶部
１１発話検出部
１２発話毎感情認識部
１３モデル学習部
１９発話毎感情認識モデル記憶部
２感情認識装置
２０感情インタラクションモデル記憶部
２１発話検出部
２２発話毎感情認識部
２３感情再推定部 1 Emotion Interaction Model Learning Device 10 Learning Data Storage Unit 11 Utterance Detection Unit 12 Emotion Recognition Unit 13 Model Learning Unit 19 Emotion Recognition Model Storage Unit 2 Emotion Recognition Model Memory Unit 20 Emotion Recognition Device 20 Emotion Interaction Model Storage Unit 21 Utterance Detection Unit 22 Utterance Detection Emotion recognition unit 23 Emotion re-estimation unit

Claims

A learning data storage unit that stores learning data composed of a dialogue voice containing a dialogue consisting of a plurality of utterances of a target speaker and a plurality of utterances of a partner speaker, and learning data consisting of correct answer values of emotions for each utterance included in the dialogue. When,
An utterance-by-utterance recognition unit that recognizes the utterance-by-utterance for each utterance extracted from the dialogue voice, and generates the utterance-by-utterance emotion series of the target speaker and the utterance-by-utterance emotion series of the partner speaker,
Immediately before the target utterance and the utterance for each target utterance, which is the utterance of the target speaker, using the correct value of the emotion, the utterance sequence for each utterance of the target speaker, and the utterance sequence for each utterance of the other speaker A model learning unit that learns an emotion interaction model that re-estimates the emotion of the target utterance by inputting each utterance emotion of the immediately preceding utterance performed by the other speaker to
Emotional interaction model learning device including.

The emotion interaction model learning device according to claim 1,
The emotion interaction model constitutes one utterance emotion estimator for one target utterance,
The utterance emotion estimator receives the utterance-by-utterance emotions of the target utterance and the utterance-by-utterance emotions of the immediately preceding utterance as input, and information about emotions related to utterances made by the target speaker before the target utterance or the target utterances. Using the emotion information about the utterance performed by the target speaker before and after, it is to re-estimate the emotion of the target utterance and output the estimated value of the emotion of the target utterance,
Emotional interaction model learning device.

The emotion interaction model learning device according to claim 2, wherein
The speech emotion estimator is characterized by comprising any one of an input gate and an output gate, an input gate, an output gate and a forgetting gate, a reset gate and an update gate,
Emotional interaction model learning device.

A model storage unit for storing an emotion interaction model learned by the emotion interaction model learning device according to claim 1.
Recognizing emotions for each utterance included in a dialogue consisting of a plurality of utterances of the target speaker and a plurality of utterances of the other speaker, and for each utterance sequence of the target speaker and each utterance of the other speaker An emotion recognition unit for each utterance that generates an emotion sequence,
The emotion of each target utterance, which is the utterance of the target speaker, and the emotion of each utterance of the immediately preceding utterance performed by the other speaker immediately before the target utterance are input to the emotion interaction model, and the emotion of the target utterance is input. An emotion re-estimation unit that re-estimates
Emotion recognition device including.

In the learning data storage unit, learning data including a dialogue voice in which a dialogue composed of a plurality of utterances of a target speaker and a plurality of utterances of a partner speaker is recorded, and a correct emotion value for each utterance included in the dialogue. Remembered,
The utterance-by-utterance emotion recognition unit recognizes the utterance-by-utterance emotions for each utterance extracted from the dialogue voice, and generates the utterance-by-utterance emotional series of the target speaker and the utterance-by-utterance emotional series of the partner speaker,
The model learning unit uses the correct answer value of the emotion, the utterance-by-utterance emotional sequence of the target speaker, and the utterance-by-utterance emotional sequence of the partner speaker, and the utterance-by-utterance of the target utterance that is the utterance of the target speaker Learning an emotion interaction model that re-estimates the emotion of the target utterance by inputting the utterance for each utterance of the immediately preceding utterance performed by the other speaker immediately before the target utterance,
Emotional interaction model learning method.

An emotion interaction model learned by the emotion interaction model learning method according to claim 5 is stored in the model storage unit,
The utterance-by-utterance emotion recognition unit recognizes the utterance-by-utterance emotions for each utterance included in the dialogue consisting of the plurality of utterances of the target speaker and the plurality of utterances of the partner speaker, and the utterance-by-utterance emotion series of the target speaker and Generate the emotional sequence for each utterance of the other speaker,
The emotion re-estimation unit inputs the utterance-by-utterance emotion of the target utterance, which is the utterance of the objective speaker, and the utterance-by-utterance emotion of the immediately preceding utterance performed by the opposite speaker immediately before the objective utterance, in the emotion interaction model. Re-estimate the emotion of the above target utterance,
Emotion recognition method.

Program for causing a computer to function as the emotion interaction model learning equipment according to any one of claims 1 to 3.

A program for causing a computer to function as the emotion recognition device according to claim 4.