JP2023171101A

JP2023171101A - Learning device, estimation device, learning method, estimation method and program

Info

Publication number: JP2023171101A
Application number: JP2022083342A
Authority: JP
Inventors: 修平立石; Shuhei Tateishi; 真中辻; Makoto Nakatsuji; 颯平奥井; Sohei Okui; 悠佳小瀬木; Yuka Koseki; 浩文八島; Hirofumi Yajima; 繁雄松野; Shigeo Matsuno
Original assignee: NTT Resonant Inc
Current assignee: NTT Resonant Inc
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2023-12-01
Anticipated expiration: 2042-05-20
Also published as: JP7419615B2

Abstract

To provide a learning device, an estimation device, a learning method, an estimation method and a program, which improve estimation accuracy of an utterer's emotion.SOLUTION: In a learning device, a control unit includes a learning unit for learning a mathematical model which estimates emotion of an estimation object on the basis of an emotion time series which depends on emotion of the estimation object during utterance. The mathematical model performs common information acquisition processing which applies the same mapping to two or more types of integrated vectors to which semantic vectors indicating semantic intervals, which are obtained as a result in which the emotion time series are divided in a time direction, are applied under a division condition being a condition which is previously determined according to the type of the emotion time series on the division of the emotion time series for an emotion time series vector indicating the emotion time series. Mapping is updated by learning.SELECTED DRAWING: Figure 3

Description

特許法第３０条第２項適用申請有りｈｔｔｐｓ：／／ｕｒｌｄｅｆｅｎｓｅ．ｃｏｍ／ｖ３／＿＿ｈｔｔｐｓ：／／ｄｏｉ．ｏｒｇ／１０．１１５１７／ｐｊｓａｉ．ＪＳＡＩ２０２１．０＿１Ｎ４ＩＳ１ａ０４＿＿；！！ＯｈＹＬＺｋｉｔ９ｐ４７ｄ２Ａ！ｔｇＸｚｑＦｚｖ９ＴｅｋＨＳＭ３ｄｔＰ＿ＢｓＰｗＡＲＹ３ｘｗ８ｌｒｃｋｓｋｅｘｆｆＯｉｐＮ２Ａ１ＤＶｌＹＲ２ＱｘｍＣＹｌｚｂＣｖＷｐＹｖｘＤＯ６０Ｊｏ８ｄＰｘＦＩＺ０７ｌａＱ＄掲載日令和３年６月８日Application for application of Article 30, Paragraph 2 of the Patent Act https://urldefense. com/v3/___https://doi. org/10.11517/pjsai. JSAI2021.0_1N4IS1a04＿＿;! ! OhYLZkit9p47d2A! tgXzqFzv9TekHSM3dtP_BsPwARY3xw8lrckskexffOipN2A1DVlYR2QxmCYlzbCvWpYvxDO60Jo8dPxFIZ07laQ$ Posting date June 8, 2021

本発明は、学習装置、推定装置、学習方法、推定方法及びプログラムに関する。 The present invention relates to a learning device, an estimation device, a learning method, an estimation method, and a program.

機械学習の技術を用いて発話者の感情を推定する技術に関心が高まっている。 There is growing interest in technology that uses machine learning technology to estimate the emotion of a speaker.

Kaicheng Yang, et.al., “CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis”, 2020 Association for Computing Machinery, ACM ISBN 978-1-4503-7988-5/20/10Kaicheng Yang, et.al., “CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis”, 2020 Association for Computing Machinery, ACM ISBN 978-1-4503-7988-5/20/10

しかしながら、これまでの技術では単一の情報を用いて感情の推定を行うことが主に行われている。感情の推定には、感情に応じた内容を有する情報を解析することが大事であるが、感情は複雑であるため、単一の情報を用いた推定では、推定の精度が低い場合があった。 However, conventional techniques have mainly used a single piece of information to estimate emotions. To estimate emotions, it is important to analyze information that has content corresponding to the emotion, but since emotions are complex, estimation accuracy using a single piece of information may be low. .

上記事情に鑑み、本発明は、発話者の感情の推定の精度を向上させる技術を提供することを目的としている。 In view of the above circumstances, an object of the present invention is to provide a technique that improves the accuracy of estimating the emotion of a speaker.

本発明の一態様は、発話の最中の推定対象の感情に依存する時系列である感情時系列に基づき前記推定対象の感情を推定する数理モデルの学習を行う学習部、を備え、前記感情時系列を示す感情時系列ベクトルに対して、感情時系列の区分けに関する条件であって感情時系列の種類に応じて予め定められた条件である区分け条件の下で感情時系列が時間方向に区分けされた結果として得られる区間である意味区間を示す意味ベクトルが付与された２種類以上の統合ベクトルに対して、同一の写像を作用させる共通情報取得処理を前記数理モデルは実行し、前記写像は前記学習により更新される、学習装置である。 One aspect of the present invention includes a learning unit that performs learning of a mathematical model for estimating the emotion of the estimation target based on an emotion time series that is a time series that depends on the emotion of the estimation target during utterance, and For the emotion time series vector indicating the time series, the emotion time series is divided in the time direction under the division conditions, which are conditions related to division of the emotion time series and are predetermined conditions according to the type of emotion time series. The mathematical model executes a common information acquisition process that applies the same mapping to two or more types of integrated vectors to which a semantic vector indicating a semantic interval that is an interval obtained as a result of the It is a learning device that is updated by the learning.

本発明の一態様は、発話の最中の推定対象の感情に依存する時系列である感情時系列を取得する対象取得部と、感情時系列に基づき前記推定対象の感情を推定する数理モデルの学習を行う学習部、を備え、前記感情時系列を示す感情時系列ベクトルに対して、感情時系列の区分けに関する条件であって感情時系列の種類に応じて予め定められた条件である区分け条件の下で感情時系列が時間方向に区分けされた結果として得られる区間である意味区間を示す意味ベクトルが付与された２種類以上の統合ベクトルに対して、同一の写像を作用させる共通情報取得処理を前記数理モデルは実行し、前記写像は前記学習により更新される、学習装置が得た学習済みの前記数理モデルによって、発した発話に関する時系列であって自身の感情に依存する時系列が前記対象取得部の取得した前記感情時系列であるという条件を満たす推定対象、の感情を推定する推定部と、を備える推定装置である。 One aspect of the present invention includes an object acquisition unit that acquires an emotion time series that is a time series that depends on the emotion of the estimation target during utterance, and a mathematical model that estimates the emotion of the estimation target based on the emotion time series. a learning unit that performs learning, and for the emotion time series vector indicating the emotion time series, a categorization condition that is a condition related to categorization of the emotion time series and is a condition predetermined according to the type of the emotion time series. A common information acquisition process that applies the same mapping to two or more types of integrated vectors to which a semantic vector indicating a semantic interval is an interval obtained as a result of dividing an emotional time series in the temporal direction under The mathematical model executes, and the mapping is updated by the learning.The learned mathematical model obtained by the learning device allows the time series related to the utterances that depend on one's own emotions to be The estimation device includes an estimation unit that estimates the emotion of an estimation target that satisfies the condition that the emotion time series is the emotion time series acquired by the target acquisition unit.

本発明の一態様は、発話の最中の推定対象の感情に依存する時系列である感情時系列に基づき前記推定対象の感情を推定する数理モデルの学習を行う学習ステップ、を有し、前記感情時系列を示す感情時系列ベクトルに対して、感情時系列の区分けに関する条件であって感情時系列の種類に応じて予め定められた条件である区分け条件の下で感情時系列が時間方向に区分けされた結果として得られる区間である意味区間を示す意味ベクトルが付与された２種類以上の統合ベクトルに対して、同一の写像を作用させる共通情報取得処理を前記数理モデルは実行し、前記写像は前記学習により更新される、学習方法である。 One aspect of the present invention includes a learning step of learning a mathematical model for estimating the emotion of the estimation target based on an emotion time series that is a time series that depends on the emotion of the estimation target during utterance, For an emotion time series vector indicating an emotion time series, the emotion time series is plotted in the time direction under a segmentation condition that is a condition for segmenting the emotion time series and is a predetermined condition depending on the type of emotion time series. The mathematical model executes a common information acquisition process that applies the same mapping to two or more types of integrated vectors to which a semantic vector indicating a semantic interval, which is an interval obtained as a result of the division, is applied, and is a learning method that is updated by the learning.

本発明の一態様は、発話の最中の推定対象の感情に依存する時系列である感情時系列を取得する対象取得ステップと、感情時系列に基づき前記推定対象の感情を推定する数理モデルの学習を行う学習部、を備え、前記感情時系列を示す感情時系列ベクトルに対して、感情時系列の区分けに関する条件であって感情時系列の種類に応じて予め定められた条件である区分け条件の下で感情時系列が時間方向に区分けされた結果として得られる区間である意味区間を示す意味ベクトルが付与された２種類以上の統合ベクトルに対して、同一の写像を作用させる共通情報取得処理を前記数理モデルは実行し、前記写像は前記学習により更新される、学習装置が得た学習済みの前記数理モデルによって、発した発話に関する時系列であって自身の感情に依存する時系列が前記対象取得ステップの取得した前記感情時系列であるという条件を満たす推定対象、の感情を推定する推定ステップと、を有する推定方法である。 One aspect of the present invention includes a target acquisition step of acquiring an emotion time series that is a time series that depends on the emotion of the estimation target during utterance, and a mathematical model that estimates the emotion of the estimation target based on the emotion time series. a learning unit that performs learning, and for the emotion time series vector indicating the emotion time series, a categorization condition that is a condition related to categorization of the emotion time series and is a condition predetermined according to the type of the emotion time series. A common information acquisition process that applies the same mapping to two or more types of integrated vectors to which a semantic vector indicating a semantic interval is an interval obtained as a result of dividing an emotional time series in the temporal direction under The mathematical model executes, and the mapping is updated by the learning.The learned mathematical model obtained by the learning device allows the time series related to the utterances that depend on one's own emotions to be This estimation method includes an estimating step of estimating the emotion of an estimation object that satisfies the condition that the emotion time series obtained in the object obtaining step is the emotion time series.

本発明の一態様は、上記の学習装置としてコンピュータを機能させるためのプログラムである。 One aspect of the present invention is a program for causing a computer to function as the above learning device.

本発明の一態様は、上記の推定装置としてコンピュータを機能させるためのプログラムである。 One aspect of the present invention is a program for causing a computer to function as the above estimation device.

本発明により、発話者の感情の推定の精度を向上させることが可能となる。 According to the present invention, it is possible to improve the accuracy of estimating the emotion of the speaker.

実施形態における数理モデルの概要を説明する説明図。FIG. 2 is an explanatory diagram illustrating an overview of a mathematical model in an embodiment. 実施形態の学習装置のハードウェア構成の一例を示す図。FIG. 1 is a diagram showing an example of a hardware configuration of a learning device according to an embodiment. 実施形態の学習装置が備える制御部の構成の一例を示す図。The figure which shows an example of the structure of the control part with which the learning device of embodiment is provided. 実施形態の学習装置が実行する処理の流れの一例を示すフローチャート。1 is a flowchart illustrating an example of the flow of processing executed by the learning device according to the embodiment. 実施形態の推定装置のハードウェア構成の一例を示す図。FIG. 1 is a diagram showing an example of a hardware configuration of an estimation device according to an embodiment. 実施形態の推定装置が備える制御部の構成の一例を示す図。The figure which shows an example of the structure of the control part with which the estimation device of embodiment is provided. 実施形態の推定装置が実行する処理の流れの一例を示すフローチャート。5 is a flowchart illustrating an example of the flow of processing executed by the estimation device according to the embodiment.

（実施形態）
図１は、実施形態における数理モデルの概要を説明する説明図である。より具体的には、図１は、推定対象である発話者の感情を推定する数理モデル（以下「感情推定モデル」という。）の概要を説明する図である。感情推定モデルは、学習により更新される。学習により更新されるとは、機械学習の方法により更新されることを意味する。 (Embodiment)
FIG. 1 is an explanatory diagram illustrating an overview of a mathematical model in an embodiment. More specifically, FIG. 1 is a diagram illustrating an overview of a mathematical model (hereinafter referred to as "emotion estimation model") for estimating the emotion of a speaker who is an estimation target. The emotion estimation model is updated through learning. Updated by learning means updated by a machine learning method.

感情推定モデルは、より具体的には、２種類以上の感情時系列に基づき推定対象の感情を推定する数理モデルである。感情時系列は、発話者が発する発話に関する時系列であって、発話の最中の発話者の感情に依存する時系列である。 More specifically, the emotion estimation model is a mathematical model that estimates an emotion to be estimated based on two or more types of emotion time series. The emotion time series is a time series related to utterances uttered by a speaker, and is a time series that depends on the speaker's emotion during the utterance.

感情時系列は、例えば発話者の発話の時系列（以下「発話時系列」という。）である。感情時系列は、例えば発話時系列が示す発話の音の時系列（以下「音時系列」という。）である。感情時系列は、例えば発話時系列が示す発話を発話中の発話者の動画音の時系列（以下「音時系列」という。）である。 The emotion time series is, for example, a time series of utterances of a speaker (hereinafter referred to as "utterance time series"). The emotional time series is, for example, a time series of sounds of speech indicated by the utterance time series (hereinafter referred to as "sound time series"). The emotion time series is, for example, a time series of video sounds of a speaker who is uttering the utterance indicated by the utterance time series (hereinafter referred to as "sound time series").

感情時系列は、例えば発話時系列が示す発話を発話している最中の発話者の様子を映した動画（以下「発話動画」という。）である。このように感情時系列は、発話時系列が示す発話の時系列と、発話時系列が示す発話に関する時系列と、発話時系列が示す発話を発話している最中の発話者に関する時系列と、のいずれかに属するであればどのような時系列であってもよい。 The emotion time series is, for example, a video (hereinafter referred to as "utterance video") that shows the speaker while uttering the utterance indicated by the utterance time series. In this way, the emotion time series consists of the time series of utterances indicated by the utterance time series, the time series related to the utterances indicated by the utterance time series, and the time series related to the speaker who is in the middle of uttering the utterances indicated by the utterance time series. Any time series may be used as long as it belongs to any one of .

例えば発話時系列と音時系列とは互いに種類の異なる時系列である。また、発話時系列と発話動画とも互いに種類の異なる時系列である。音時系列と発話動画とも互いに種類の異なる時系列である。感情推定モデルには、２種類以上の感情時系列が入力されればどのような入力時系列でもよいが、少なくとも発話時系列は入力されることが望ましい。 For example, an utterance time series and a sound time series are different types of time series. Furthermore, the utterance time series and the utterance video are different types of time series. Both the sound time series and the speech video are different types of time series. Although any input time series may be used as long as two or more types of emotion time series are input to the emotion estimation model, it is preferable that at least an utterance time series is input.

図１には、感情時系列として、発話時系列と、音時系列と、発話動画との３つが示されている。感情推定モデルでは、ベクトル化取得処理が実行される。ベクトル化処理は、感情時系列を示すベクトル（以下「感情時系列ベクトル」という。）を感情時系列ごとに取得する処理である。感情時系列ベクトルは具体的には、各要素が、時系列のサンプルを示すベクトルである。したがって、図１の例では、ベクトル化処理の実行により、発話時系列を示す感情時系列ベクトルと、音時系列を示す感情時系列ベクトルと、発話動画を示す感情時系列ベクトルとが得られる。 FIG. 1 shows three emotional time series: an utterance time series, a sound time series, and a utterance video. In the emotion estimation model, vectorization acquisition processing is executed. The vectorization process is a process of acquiring a vector indicating an emotion time series (hereinafter referred to as an "emotion time series vector") for each emotion time series. Specifically, the emotion time series vector is a vector in which each element represents a sample of the time series. Therefore, in the example of FIG. 1, by executing the vectorization process, an emotion time series vector indicating the utterance time series, an emotion time series vector indicating the sound time series, and an emotion time series vector indicating the utterance video are obtained.

次に感情推定モデルでは、意味付与処理が実行される。意味付与処理は、意味ベクトルを感情時系列ベクトルに付与する処理である。意味ベクトルは、各意味区間の意味を示すベクトルである。意味区間は、感情時系列の区分けに関する条件であって感情時系列の種類に応じて予め定められた条件の下で感情時系列が時間方向に区分けされた結果として得られる区間である。以下、感情時系列の区分けに関する条件であって感情時系列の種類に応じて予め定められた条件を、区分け条件という。 Next, the emotion estimation model performs meaning assignment processing. The meaning adding process is a process of adding a meaning vector to an emotion time series vector. The meaning vector is a vector indicating the meaning of each meaning interval. The meaning interval is a condition related to division of the emotion time series, and is an interval obtained as a result of dividing the emotion time series in the time direction under conditions predetermined according to the type of the emotion time series. Hereinafter, conditions related to classification of emotional time series and predetermined according to the type of emotional time series will be referred to as classification conditions.

なお、ベクトルにベクトルを付与するとは、ベクトルとベクトルとを結合することを意味する。ベクトルとベクトルとの結合とは、Ｎ次元のベクトルであるベクトルＨ１とＭ次元のベクトルであるベクトルＨ２とから（Ｎ＋Ｍ）次元のベクトルＨ３を生成する処理（ＮとＭとは１以上の整数）である。ベクトルＨ３の１番目からＮ番目までの要素は順にベクトルＨ１の１番目からＮ番目までの要素であり、ベクトルＨ３の（Ｎ＋１）番目から（Ｎ＋Ｍ）番目までの要素は順にベクトルＨ２の１番目からＭ番目までの要素である。 Note that adding a vector to a vector means combining vectors. Combination of vectors is a process of generating a (N+M)-dimensional vector H3 from a vector H1, which is an N-dimensional vector, and a vector H2, which is an M-dimensional vector (N and M are integers of 1 or more). It is. The first to Nth elements of vector H3 are the first to Nth elements of vector H1, and the (N+1)th to (N+M)th elements of vector H3 are the first to Nth elements of vector H2, in order. These are elements up to the Mth element.

以下、意味付与処理によって得られたベクトルであって感情時系列ベクトルと意味ベクトルとの結合したベクトルを統合ベクトルという。 Hereinafter, a vector obtained by the meaning assignment process, which is a combination of an emotion time series vector and a meaning vector, will be referred to as an integrated vector.

図１の例では、意味付与処理の実行により、発話時系列を示す感情時系列ベクトルに対応する統合ベクトルと、音時系列を示す感情時系列ベクトルに対応する統合ベクトルと、発話動画を示す感情時系列ベクトルに対応する統合ベクトルと、の３種類の統合ベクトルが得られる。 In the example of FIG. 1, by executing the meaning adding process, an integrated vector corresponding to an emotion time series vector indicating an utterance time series, an integrated vector corresponding to an emotion time series vector indicating a sound time series, and an emotion indicating an utterance video are generated. Three types of integrated vectors are obtained: an integrated vector corresponding to a time series vector, and an integrated vector corresponding to a time series vector.

なお区分け条件は、例えば発話時系列の場合であれば、単語を１つだけ含むという条件である。音時系列あれば、区分け条件は、例えば音の開始から終わりまでの１フレーズだけを含む、という条件である。発話動画であれば、区分け条件は、例えば１シーンだけを含む、という条件である。 Note that the classification condition is, for example, in the case of an utterance time series, that only one word is included. If there is a sound time series, the division condition is, for example, that only one phrase from the start to the end of the sound is included. In the case of a speech video, the classification condition is, for example, that the video contains only one scene.

したがって、意味区間の時間方向の長さは意味区間ごとに異なり、必ずしも同一では無い。例えば発話時系列の場合であれば、意味区間の長さは単語の長さであるので、発話時系列に長さの異なる複数の単語が含まれていれば、発話時系列には時間方向の長さが異なる複数の意味区間が存在する。 Therefore, the lengths of the semantic sections in the time direction differ from one semantic section to another, and are not necessarily the same. For example, in the case of an utterance time series, the length of the semantic interval is the length of the word, so if the utterance time series contains multiple words with different lengths, the utterance time series will There are multiple semantic intervals with different lengths.

なお意味付与処理では、意味の候補を示す情報であって予め所定の記憶装置に記憶済みの情報（以下「セマンティック辞書」という。）に基づいて意味ベクトルが付与される。意味付与処理は、例えば以下の参考文献１に記載のＬＭＭＳ（Language Modeling Makes Sense）を用いた技術であってもよい。 In the meaning assignment process, a meaning vector is assigned based on information indicating meaning candidates and previously stored in a predetermined storage device (hereinafter referred to as a "semantic dictionary"). The meaning assignment process may be a technique using LMMS (Language Modeling Makes Sense) described in Reference 1 below, for example.

参考文献１：Daniel Loureiro, et al. Language Modelling Makes Sense: Propagating Representations through WordNet for Full-Coverage Word Sense Disambiguation, In Proc. ACL’19, 5682-5691 Reference 1: Daniel Loureiro, et al. Language Modeling Makes Sense: Propagating Representations through WordNet for Full-Coverage Word Sense Disambiguation, In Proc. ACL’19, 5682-5691

なお、発話時系列に対する意味付与処理では、例えば発話時系列の示す発話内の各単語の意味を示す情報が音時系列を示す感情時系列ベクトルに付与される。なお、音時系列に対する意味付与処理では、例えば音時系列が示す各音の高低の情報や大小の情報が音時系列を示す感情時系列ベクトルに付与される。なお、発話動画に対する意味付与処理では、例えば発話動画の示す動画の各シーンの内容を示す情報が発話動画を示す感情時系列ベクトルに付与される。 In addition, in the process of adding meaning to the utterance time series, for example, information indicating the meaning of each word in the utterance indicated by the utterance time series is attached to the emotion time series vector indicating the sound time series. In addition, in the process of adding meaning to the sound time series, for example, information on the pitch and magnitude of each sound indicated by the sound time series is assigned to the emotion time series vector indicating the sound time series. In addition, in the process of adding meaning to the utterance video, for example, information indicating the content of each scene of the video shown by the utterance video is added to the emotion time series vector representing the utterance video.

次に感情推定モデルでは、共通情報取得処理が実行される。共通情報取得処理は、統合ベクトルの種類に依らず各統合ベクトルに同一の写像を作用させる処理である。写像は、例えば所定のベクトルとの二項演算を表す写像であってもよい。この場合、所定のベクトルと二項演算の定義とは、統合ベクトルの種類に依らず同一である。二項演算は、例えばテンソル積である。写像は、例えば行列であってもよい。二項演算としてテンソル積が用いられる場合、各元はベクトル又は行列等のテンソルである。そのためテンソル積では、各元の要素同士がすべて乗算される。したがってテンソル積が用いられることで、他の二項演算が用いられた場合よりも、二項同士の関連性が高い精度で抽出される。 Next, in the emotion estimation model, common information acquisition processing is executed. The common information acquisition process is a process in which the same mapping is applied to each integrated vector regardless of the type of integrated vector. The mapping may be, for example, a mapping representing a binary operation with a predetermined vector. In this case, the predetermined vector and the definition of the binary operation are the same regardless of the type of integrated vector. The binary operation is, for example, a tensor product. The mapping may be a matrix, for example. When tensor product is used as a binary operation, each element is a tensor, such as a vector or a matrix. Therefore, in tensor product, all the elements of each element are multiplied together. Therefore, by using the tensor product, the relationship between two terms can be extracted with higher precision than when other binary operations are used.

共通情報取得処理の実行の結果は、後段の処理に入力される。共通情報取得処理の実行の結果とは、二項演算の結果である。すなわち、共通情報取得処理の実行の結果とは、写像の像である。後段の処理は、具体的には、共通情報取得処理の実行の結果に基づいて感情を推定する処理（以下「感情推定後処理」という。）である。感情の推定は、例えば共通情報取得処理の実行の結果に基づいて文章分類（Sequence Classification）タスクによって推定される。 The results of the execution of the common information acquisition process are input to the subsequent process. The result of execution of the common information acquisition process is the result of a binary operation. That is, the result of execution of the common information acquisition process is a mapping image. Specifically, the subsequent process is a process of estimating an emotion based on the result of execution of the common information acquisition process (hereinafter referred to as "emotion estimation post-processing"). Emotion estimation is performed, for example, by a sequence classification task based on the result of execution of common information acquisition processing.

共通情報取得処理の実行の次に感情推定モデルでは、感情推定後処理を実行する。このようにして、感情推定モデルは推定対象の感情を推定する。 After executing the common information acquisition process, the emotion estimation model executes emotion estimation post-processing. In this way, the emotion estimation model estimates the emotion of the estimation target.

＜共通情報取得処理の奏する効果＞
ここで共通情報取得処理の奏する効果について説明する。共通情報取得処理は、上述したように入力された各統合ベクトルに同一の写像を作用させる処理である。したがって、各統合ベクトルに共通する情報を取得する処理である。数学的には、共通情報取得処理は、各統合ベクトルを含む１つのベクトル空間において、各統合ベクトルを同一の超平面に斜影する処理である。ところで、各統合ベクトルはいずれも、発話時系列が示す発話の時系列と、発話時系列が示す発話に関する時系列と、発話時系列が示す発話を発話している最中の発話者に関する時系列と、のいずれかに属する時系列から得られたものである。 <Effects of common information acquisition processing>
Here, the effects of the common information acquisition process will be explained. The common information acquisition process is a process in which the same mapping is applied to each integrated vector input as described above. Therefore, this is a process of acquiring information common to each integrated vector. Mathematically, the common information acquisition process is a process of obliquely projecting each integrated vector onto the same hyperplane in one vector space that includes each integrated vector. By the way, each integrated vector is a time series of the utterance indicated by the utterance time series, a time series related to the utterance indicated by the utterance time series, and a time series related to the speaker who is uttering the utterance indicated by the utterance time series. It is obtained from the time series belonging to either of the following.

そのため、各統合ベクトルは発話時系列が示す発話の主題（トピック）を共通の情報として有する。したがって、共通情報取得処理によって得られる各統合ベクトルに共通する情報は主題を含む。感情推定モデルの学習により写像が更新されることで、共通情報取得処理において、共通の情報の全情報量のうちの主題の情報量の割合が高まる。たとえ主題の情報を含んでいても他の多くの情報に埋もれていては主題の情報の感情推定モデルの推定結果に対する影響は小さい。すなわち、主題の情報を含んでいても他の多くの情報に埋もれていては主題の情報は、感情推定モデルの推定結果に対して有意な効果を与えない。 Therefore, each integrated vector has the subject (topic) of the utterance indicated by the utterance time series as common information. Therefore, the information common to each integrated vector obtained by the common information acquisition process includes the theme. By updating the mapping by learning the emotion estimation model, the ratio of the information amount of the subject to the total information amount of the common information increases in the common information acquisition process. Even if it contains thematic information, if it is buried in a lot of other information, the influence of the topical information on the estimation results of the emotion estimation model will be small. That is, even if the subject information is included, if the subject information is buried in a lot of other information, the subject information will not have a significant effect on the estimation results of the emotion estimation model.

上述したように、共通情報取得処理は、感情推定モデルによる推定に用いられる全情報の情報量のうち主題の情報の情報量を増大させる処理であるので、共通情報取得処理の実行により、主題の情報の感情推定モデルの推定結果に対する影響が増大する。その結果、感情推定モデルは、主題を示す情報による有意な効果を受けて、推定結果を得ることができる。主題が発話の内容の概要を示す重要な情報であることを鑑みれば、推定において主題の情報も有意に用いることができる感情推定モデルは、感情を推定する数理モデルであって共通情報取得処理を実行しない他の数理モデルよりも推定の精度が高い。 As mentioned above, the common information acquisition process is a process that increases the amount of information about the subject out of all the information used for estimation by the emotion estimation model. The influence of information on the estimation results of the emotion estimation model increases. As a result, the emotion estimation model can obtain estimation results under the significant effect of the information indicating the theme. Considering that the subject is important information that indicates the outline of the content of the utterance, an emotion estimation model that can also meaningfully use the subject information in estimation is a mathematical model that estimates emotions and requires common information acquisition processing. Estimation accuracy is higher than other mathematical models that do not run.

なお、機械学習の方法は、例えばＢＥＲＴ（Bidirectional Encoder Representations from Transformers）を用いる方法であってもよいし、ＬＳＴＭ（Long short-term memory）を用いる方法であってもよいし、ＣＮＮ（Convolutional Neural Networks）を用いる方法であってもよい。感情推定モデルの学習では、正解データと２種類以上の感情時系列との対が訓練データとして用いられる。正解データは、感情を示す情報である。学習における損失関数は正解データとの違いを示す関数であり、感情推定モデルは損失関数の示す違いが小さくなるように更新される。なお、学習は学習に関する所定の終了条件（以下「学習終了条件」という。）が満たされるまで実行される。学習終了条件は、例えば予め定められた回数の学習が終了したという条件である。学習終了条件は、例えば感情推定モデルの推定の精度が所定の精度以上という条件であってもよい。 Note that the machine learning method may be, for example, a method using BERT (Bidirectional Encoder Representations from Transformers), a method using LSTM (Long short-term memory), or a method using CNN (Convolutional Neural Networks). ) may be used. In learning the emotion estimation model, pairs of correct data and two or more types of emotion time series are used as training data. Correct answer data is information indicating emotion. The loss function in learning is a function that indicates the difference from the correct data, and the emotion estimation model is updated so that the difference indicated by the loss function becomes smaller. Note that learning is executed until a predetermined end condition regarding learning (hereinafter referred to as "learning end condition") is satisfied. The learning end condition is, for example, a condition that learning has been completed a predetermined number of times. The learning end condition may be, for example, a condition that the estimation accuracy of the emotion estimation model is equal to or higher than a predetermined accuracy.

図２は、実施形態における学習装置１のハードウェア構成の一例を示す図である。学習装置１は、バスで接続されたＣＰＵ（Central Processing Unit）等のプロセッサ９１とメモリ９２とを備える制御部１１を備え、プログラムを実行する。学習装置１は、プログラムの実行によって制御部１１、入力部１２、通信部１３、記憶部１４及び出力部１５を備える装置として機能する。 FIG. 2 is a diagram showing an example of the hardware configuration of the learning device 1 in the embodiment. The learning device 1 includes a control unit 11 including a processor 91 such as a CPU (Central Processing Unit) and a memory 92 connected via a bus, and executes a program. The learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15 by executing a program.

より具体的には、プロセッサ９１が記憶部１４に記憶されているプログラムを読み出し、読み出したプログラムをメモリ９２に記憶させる。プロセッサ９１が、メモリ９２に記憶させたプログラムを実行することによって、学習装置１は、制御部１１、入力部１２、通信部１３、記憶部１４及び出力部１５を備える装置として機能する。 More specifically, processor 91 reads a program stored in storage unit 14 and stores the read program in memory 92 . When the processor 91 executes the program stored in the memory 92, the learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15.

制御部１１は、学習装置１が備える各種機能部の動作を制御する。制御部１１は、例えば感情推定モデルの学習を行う。制御部１１は、例えば出力部１５の動作を制御する。制御部１１は、例えば感情推定モデルの学習により生じた各種情報を記憶部１４に記録する。 The control unit 11 controls the operations of various functional units included in the learning device 1. The control unit 11 performs learning of an emotion estimation model, for example. The control unit 11 controls the operation of the output unit 15, for example. The control unit 11 records, for example, various information generated by learning the emotion estimation model in the storage unit 14.

入力部１２は、マウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部１２は、これらの入力装置を学習装置１に接続するインタフェースとして構成されてもよい。入力部１２は、学習装置１に対する各種情報の入力を受け付ける。入力部１２には、例えば感情時系列が入力される。 The input unit 12 includes input devices such as a mouse, a keyboard, and a touch panel. The input unit 12 may be configured as an interface that connects these input devices to the learning device 1. The input unit 12 receives input of various information to the learning device 1. For example, an emotional time series is input to the input unit 12 .

通信部１３は、学習装置１を外部装置に接続するための通信インタフェースを含んで構成される。通信部１３は、有線又は無線を介して外部装置と通信する。外部装置は、例えば訓練データの送信元の装置である。通信部１３は、訓練データの送信元の装置との通信によって訓練データを取得する。外部装置は、例えば後述する推定装置２である。 The communication unit 13 includes a communication interface for connecting the learning device 1 to an external device. The communication unit 13 communicates with an external device via wire or wireless. The external device is, for example, a device that is a source of training data. The communication unit 13 acquires training data by communicating with a device that is a transmission source of the training data. The external device is, for example, the estimation device 2 described later.

記憶部１４は、磁気ハードディスク装置や半導体記憶装置などのコンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部１４は学習装置１に関する各種情報を記憶する。記憶部１４は、例えば入力部１２又は通信部１３を介して入力された情報を記憶する。記憶部１４は、例えば感情推定モデルの学習により生じた各種情報を記憶する。記憶部１４は、予め感情推定モデルを記憶する。なお数理モデルを記憶するとは数理モデルを記述するコンピュータプログラムを記憶することを意味する。記憶部１４は、得られた学習済みの感情推定モデルを記憶してもよい。 The storage unit 14 is configured using a computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 14 stores various information regarding the learning device 1. The storage unit 14 stores information input via the input unit 12 or the communication unit 13, for example. The storage unit 14 stores various information generated by learning the emotion estimation model, for example. The storage unit 14 stores emotion estimation models in advance. Note that storing a mathematical model means storing a computer program that describes the mathematical model. The storage unit 14 may store the obtained learned emotion estimation model.

出力部１５は、各種情報を出力する。出力部１５は、例えばＣＲＴ（Cathode Ray Tube）ディスプレイや液晶ディスプレイ、有機ＥＬ（Electro-Luminescence）ディスプレイ等の表示装置を含んで構成される。出力部１５は、これらの表示装置を学習装置１に接続するインタフェースとして構成されてもよい。出力部１５は、例えば入力部１２に入力された情報を出力する。出力部１５は、例えば感情推定モデルの実行の結果を表示してもよい。 The output unit 15 outputs various information. The output unit 15 includes a display device such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display. The output unit 15 may be configured as an interface that connects these display devices to the learning device 1. The output unit 15 outputs, for example, information input to the input unit 12. The output unit 15 may display, for example, the result of executing the emotion estimation model.

図３は、実施形態における制御部１１の構成の一例を示す図である。制御部１１は、データ取得部１１０、学習部１２０、記憶制御部１３０、通信制御部１４０及び出力制御部１５０を備える。 FIG. 3 is a diagram showing an example of the configuration of the control unit 11 in the embodiment. The control unit 11 includes a data acquisition unit 110, a learning unit 120, a storage control unit 130, a communication control unit 140, and an output control unit 150.

データ取得部１１０は、訓練データを取得する。すなわちデータ取得部１１０は、正解データと感情時系列とを取得する。学習部１２０は、データ取得部１１０の得た訓練データを用いて、学習終了条件が満たされるまで感情推定モデルを更新する。すなわち、学習部１２０は、正解データと２種類以上の感情時系列とを用いて学習済みの感情推定モデルを得る。学習済みの感情推定モデルは、学習終了条件が満たされた時点の感情推定モデルである。 The data acquisition unit 110 acquires training data. That is, the data acquisition unit 110 acquires correct answer data and emotion time series. The learning unit 120 uses the training data obtained by the data acquisition unit 110 to update the emotion estimation model until the learning end condition is satisfied. That is, the learning unit 120 obtains a learned emotion estimation model using the correct data and two or more types of emotion time series. The learned emotion estimation model is the emotion estimation model at the time when the learning end condition is satisfied.

記憶制御部１３０は、記憶部１４に各種情報を記録する。通信制御部１４０は通信部１３の動作を制御する。出力制御部１５０は、出力部１５の動作を制御する。 The storage control unit 130 records various information in the storage unit 14. The communication control unit 140 controls the operation of the communication unit 13. The output control section 150 controls the operation of the output section 15.

図４は、実施形態における学習装置１が実行する処理の流れの一例を示すフローチャートである。データ取得部１１０が正解データと２種類以上の感情時系列とを含む訓練データを取得する（ステップＳ１０１）。次に、学習部１２０が、ステップＳ１０１で得られた２種類以上の感情時系列に対して感情推定モデルを実行することで、推定対象の感情を推定する（ステップＳ１０２）。ステップＳ１０２では感情推定モデルが実行されるので、ベクトル化処理、意味付与処理、共通情報取得処理及び感情推定後処理が実行される。 FIG. 4 is a flowchart showing an example of the flow of processing executed by the learning device 1 in the embodiment. The data acquisition unit 110 acquires training data including correct data and two or more types of emotion time series (step S101). Next, the learning unit 120 estimates the emotion to be estimated by executing the emotion estimation model on the two or more types of emotion time series obtained in step S101 (step S102). Since the emotion estimation model is executed in step S102, vectorization processing, meaning assignment processing, common information acquisition processing, and emotion estimation post-processing are executed.

ステップＳ１０２の次に、学習部１２０がステップＳ１０２の推定結果を用い、ステップＳ１０１で得られた正解データと違いに基づき、感情推定モデルを更新する（ステップＳ１０３）。次に、学習部１２０が、学習終了条件が満たされたか否かを判定する（ステップＳ１０４）。学習終了条件が満たされた場合（ステップＳ１０４：ＹＥＳ）、処理が終了する。一方、学習終了条件が満たされない場合（ステップＳ１０４：ＮＯ）、ステップＳ１０１の処理に戻る。 Next to step S102, the learning unit 120 uses the estimation result of step S102 to update the emotion estimation model based on the difference from the correct data obtained in step S101 (step S103). Next, the learning unit 120 determines whether the learning end condition is satisfied (step S104). If the learning end condition is satisfied (step S104: YES), the process ends. On the other hand, if the learning end condition is not satisfied (step S104: NO), the process returns to step S101.

学習終了条件が満たされた時点の感情推定モデルが学習済みの感情推定モデルであり、得られた学習済みの感情推定モデルは、以下の図７に示す推定装置２によって推定対象の感情の推定に用いられる。学習済みの感情推定モデルは推定装置２が実行可能な状態になれば、どのような方法で推定装置２の制御化に置かれてもよい。学習済みの感情推定モデルは、例えば学習終了条件が満たされた後に通信によって学習装置１から推定装置２に送信されることで、推定装置２による実行が可能になる。 The emotion estimation model at the time when the learning end condition is satisfied is the learned emotion estimation model, and the obtained learned emotion estimation model is used to estimate the emotion of the estimation target by the estimation device 2 shown in FIG. 7 below. used. The learned emotion estimation model may be placed under the control of the estimation device 2 by any method as long as the estimation device 2 becomes executable. The learned emotion estimation model can be executed by the estimation device 2 by being transmitted from the learning device 1 to the estimation device 2 via communication, for example, after the learning end condition is satisfied.

図７は、実施形態における推定装置２のハードウェア構成の一例を示す図である。推定装置２は、バスで接続されたＣＰＵ等のプロセッサ９３とメモリ９４とを備える制御部２１を備え、プログラムを実行する。推定装置２は、プログラムの実行によって制御部２１、入力部２２、通信部２３、記憶部２４及び出力部２５を備える装置として機能する。 FIG. 7 is a diagram showing an example of the hardware configuration of the estimation device 2 in the embodiment. The estimation device 2 includes a control unit 21 including a processor 93 such as a CPU and a memory 94 connected via a bus, and executes a program. The estimation device 2 functions as a device including a control section 21, an input section 22, a communication section 23, a storage section 24, and an output section 25 by executing a program.

より具体的には、プロセッサ９３が記憶部２４に記憶されているプログラムを読み出し、読み出したプログラムをメモリ９４に記憶させる。プロセッサ９３が、メモリ９４に記憶させたプログラムを実行することによって、推定装置２は、制御部２１、入力部２２、通信部２３、記憶部２４及び出力部２５を備える装置として機能する。 More specifically, the processor 93 reads a program stored in the storage unit 24 and stores the read program in the memory 94. When the processor 93 executes the program stored in the memory 94, the estimation device 2 functions as a device including the control section 21, the input section 22, the communication section 23, the storage section 24, and the output section 25.

制御部２１は、推定装置２が備える各種機能部の動作を制御する。制御部２１は、例えば学習済みの感情推定モデルを実行する。制御部２１は、例えば出力部２５の動作を制御する。制御部２１は、例えば学習済みの感情推定モデルの実行により生じた各種情報を記憶部２４に記録する。 The control unit 21 controls the operations of various functional units included in the estimation device 2. The control unit 21 executes, for example, a learned emotion estimation model. The control unit 21 controls the operation of the output unit 25, for example. The control unit 21 records, for example, various information generated by executing the learned emotion estimation model in the storage unit 24.

入力部２２は、マウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部２２は、これらの入力装置を推定装置２に接続するインタフェースとして構成されてもよい。入力部２２は、推定装置２に対する各種情報の入力を受け付ける。 The input unit 22 includes input devices such as a mouse, a keyboard, and a touch panel. The input unit 22 may be configured as an interface that connects these input devices to the estimation device 2. The input unit 22 receives input of various information to the estimation device 2 .

通信部２３は、推定装置２を外部装置に接続するための通信インタフェースを含んで構成される。通信部２３は、有線又は無線を介して外部装置と通信する。外部装置は、例えば感情時系列の送信元の装置である。外部装置は、例えば学習装置１である。通信部２３は、学習装置１との通信により、学習済みの感情推定モデルを取得する。なお、感情時系列は、必ずしも通信部２３に入力される必要は無く、入力部２２に入力されてもよい。 The communication unit 23 includes a communication interface for connecting the estimation device 2 to an external device. The communication unit 23 communicates with an external device via wire or wireless. The external device is, for example, a device that sends the emotion time series. The external device is, for example, the learning device 1. The communication unit 23 acquires the learned emotion estimation model through communication with the learning device 1 . Note that the emotional time series does not necessarily need to be input to the communication unit 23 and may be input to the input unit 22.

記憶部２４は、磁気ハードディスク装置や半導体記憶装置などのコンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部２４は推定装置２に関する各種情報を記憶する。記憶部２４は、例えば入力部２２又は通信部２３を介して入力された情報を記憶する。記憶部２４は、例えば学習済みの感情推定モデルの実行により生じた各種情報を記憶する。記憶部２４は、学習済みの感情推定モデルを記憶する。 The storage unit 24 is configured using a computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The storage unit 24 stores various information regarding the estimation device 2. The storage unit 24 stores information input via the input unit 22 or the communication unit 23, for example. The storage unit 24 stores, for example, various information generated by executing a learned emotion estimation model. The storage unit 24 stores the learned emotion estimation model.

出力部２５は、各種情報を出力する。出力部２５は、例えばＣＲＴディスプレイや液晶ディスプレイ、有機ＥＬディスプレイ等の表示装置を含んで構成される。出力部２５は、これらの表示装置を推定装置２に接続するインタフェースとして構成されてもよい。出力部２５は、例えば入力部２２に入力された情報を出力する。出力部２５は、例えば学習済みの感情推定モデルの実行結果を表示してもよい。 The output unit 25 outputs various information. The output section 25 is configured to include a display device such as a CRT display, a liquid crystal display, an organic EL display, or the like. The output unit 25 may be configured as an interface that connects these display devices to the estimation device 2. The output unit 25 outputs the information input to the input unit 22, for example. The output unit 25 may display, for example, the execution results of the learned emotion estimation model.

図６は、実施形態における制御部２１の構成の一例を示す図である。制御部２１は、対象取得部２１０、推定部２２０、記憶制御部２３０、通信制御部２４０及び出力制御部２５０を備える。対象取得部２１０は、入力部２２又は通信部２３に入力された感情時系列を取得する。 FIG. 6 is a diagram showing an example of the configuration of the control unit 21 in the embodiment. The control unit 21 includes a target acquisition unit 210, an estimation unit 220, a storage control unit 230, a communication control unit 240, and an output control unit 250. The target acquisition unit 210 acquires the emotion time series input to the input unit 22 or the communication unit 23.

推定部２２０は、対象取得部２１０の取得した感情時系列に対して学習済みの感情推定モデルを実行する。推定部２２０は、学習済みの感情推定モデルの実行により、対象取得部２１０の取得した感情時系列に対応する発話者の感情が推定される。 The estimation unit 220 executes the learned emotion estimation model on the emotion time series acquired by the target acquisition unit 210. The estimation unit 220 estimates the speaker's emotion corresponding to the emotion time series acquired by the target acquisition unit 210 by executing the learned emotion estimation model.

上述したように感情時系列は発話者が発する発話に関する時系列であって発話の最中の発話者の感情に依存する時系列である。したがって、対象取得部２１０の取得した感情時系列に対応する発話者とは、発した発話に関する時系列であって自身の感情に依存する時系列が対象取得部２１０の取得した感情時系列であるという条件（以下「対応発話者条件」という。）を満たす発話者である。そのため、対応発話者条件を満たす発話者が、推定部２２０による感情の推定の対象である。 As described above, the emotional time series is a time series related to the utterances uttered by the speaker, and is a time series that depends on the speaker's emotion during the utterance. Therefore, the speaker corresponding to the emotion time series acquired by the target acquisition unit 210 means that the time series related to the utterances that the speaker has uttered and that depends on his/her own emotion is the emotion time series acquired by the target acquisition unit 210. This is a speaker who satisfies the condition (hereinafter referred to as "corresponding speaker condition"). Therefore, the speaker who satisfies the corresponding speaker condition is the target of emotion estimation by the estimation unit 220.

記憶制御部２３０は、記憶部２４に各種情報を記録する。通信制御部２４０は通信部２３の動作を制御する。出力制御部２５０は、出力部２５の動作を制御する。 The storage control unit 230 records various information in the storage unit 24. The communication control unit 240 controls the operation of the communication unit 23. The output control section 250 controls the operation of the output section 25.

図７は、実施形態における推定装置２が実行する処理の流れの一例を示すフローチャートである。対象取得部２１０が、入力部２２又は通信部２３に入力された２種類以上の感情時系列を取得する（ステップＳ２０１）。次に推定部２２０が、学習済みの感情推定モデルを実行することで、対象取得部２１０の取得した感情時系列に対応する発話者の感情を推定する（ステップＳ２０２）。次に出力制御部２５０が出力部２５の動作を制御して、ステップＳ２０２で推定された感情を出力部２５に出力させる（ステップＳ２０３）。 FIG. 7 is a flowchart showing an example of the flow of processing executed by the estimation device 2 in the embodiment. The target acquisition unit 210 acquires two or more types of emotion time series input to the input unit 22 or the communication unit 23 (step S201). Next, the estimation unit 220 estimates the speaker's emotion corresponding to the emotion time series acquired by the target acquisition unit 210 by executing the learned emotion estimation model (step S202). Next, the output control unit 250 controls the operation of the output unit 25 to cause the output unit 25 to output the emotion estimated in step S202 (step S203).

このように構成された実施形態における学習装置１は、共通情報取得処理を実行する感情推定モデルを学習により更新する。したがって、上述の共通情報取得処理の奏する効果の説明に記載のように、感情推定モデルは、主題を示す情報による有意な効果を受けて推定結果を得る。そのため、学習装置１は、発話者の感情の推定の精度を向上させることができる。 The learning device 1 in the embodiment configured as described above updates the emotion estimation model that executes the common information acquisition process through learning. Therefore, as described in the explanation of the effect of the common information acquisition process above, the emotion estimation model obtains the estimation result by receiving a significant effect from the information indicating the theme. Therefore, the learning device 1 can improve the accuracy of estimating the speaker's emotion.

また、このように構成された実施形態における推定装置２は、学習装置１が得た学習済みの感情推定モデルを用いて、推定対象の感情を推定する。したがって推定装置２は、推定対象の感情の推定の精度を向上させることができる。 Furthermore, the estimation device 2 in the embodiment configured as described above uses the learned emotion estimation model obtained by the learning device 1 to estimate the emotion of the estimation target. Therefore, the estimation device 2 can improve the accuracy of estimating the emotion of the estimation target.

（変形例）
感情推定後処理は、共通情報取得処理の実行の結果そのものに代えて、共通情報取得処理の実行の結果に対して次元削減処理が実行された結果に対して、実行されてもよい。すなわち、共通情報取得処理と感情情報推定後処理との間に次元削減処理が実行されてもよい。次元削減処理は、共通情報取得処理の実行の結果がベクトル等のテンソルである場合にテンソルの次元を減らす処理である。次元削減処理の実行により得られるテンソルの次元は、感情情報推定後処理が処理可能な次元である。 (Modified example)
The emotion estimation post-processing may be performed on the result of the dimension reduction process performed on the result of the common information acquisition process, instead of the result itself of the execution of the common information acquisition process. That is, dimension reduction processing may be performed between the common information acquisition processing and the emotional information estimation post-processing. Dimension reduction processing is processing that reduces the dimension of a tensor when the result of execution of the common information acquisition processing is a tensor such as a vector. The dimension of the tensor obtained by executing the dimension reduction process is a dimension that can be processed by the emotional information estimation post-processing.

感情推定後処理は、共通情報取得処理の実行の結果そのものに代えて、共通情報取得処理の実行の結果に対して時分解埋め込み処理が実行された結果に対して、実行されてもよい。時分解埋め込み処理は、共通情報取得処理の実行の結果がベクトルである場合に適用可能である。意味付与処理では、上述したように意味区間ごとに意味が付与されている。したがって、意味付与処理で得られたベクトル（すなわち統合ベクトル）は意味区間の時間の長さの情報を有している。そして、共通情報取得処理は統合ベクトルに対しても実行されるため、各意味区間の時間の長さを示す情報（以下「意味区間長さ情報」という。）は、共通情報取得処理の実行の結果も有する。 The emotion estimation post-processing may be performed on the result of time-resolved embedding processing performed on the result of the common information acquisition process, instead of the result itself of the execution of the common information acquisition process. Time-resolved embedding processing is applicable when the result of execution of common information acquisition processing is a vector. In the meaning assignment process, a meaning is assigned to each meaning interval as described above. Therefore, the vector obtained by the meaning assignment process (ie, the integrated vector) has information on the length of time of the meaning interval. Since the common information acquisition process is also executed for the integrated vector, information indicating the length of time of each semantic interval (hereinafter referred to as "semantic interval length information") is used in the execution of the common information acquisition process. It also has results.

しかしながら、感情推定後処理は、意味区間の長さが同じという情報の元で演算を行う処理である場合がある。そこで、時分解埋め込み処理は、意味区間長さ情報の示す各意味区間の長さが意味区間によらず同一であるというベクトルを得るように、共通情報取得処理の実行の結果として得られたベクトルを変換する。 However, the emotion estimation post-processing may be a process of performing calculations based on information that the lengths of the semantic intervals are the same. Therefore, in the time-resolved embedding process, the length of each semantic interval indicated by the semantic interval length information is the same regardless of the semantic interval, so that the vector obtained as a result of the common information acquisition process is Convert.

具体的には、以下の処理が実行される。 Specifically, the following processing is executed.

式（１）の左辺は各モダリティの発話における単語分割された埋め込みテンソルを意味する。式（１）の右辺は左辺テンソルがモダリティの埋め込みベクトルから成っていることを意味する。この時、ｍは｛ｌ、ａ、ｖ｝を意味している。すなわち、ｍは、ｌとａとｖとの集合であることを意味する。ｍはモダリティに対する添え字の総称として置かれており、ｌは言語モダリティを、ａは音声モダリティを、ｖは映像モダリティを、それぞれ意味している。式（１）の右辺の各要素は、単語ごとの埋め込みベクトルを意味する。ｎ_ｕは、発話の単語長を意味する。式（２）の左辺は時分解埋め込み処理によって生成される時系列モダリティの発話全体に対する特徴量テンソルを意味する。式（２）の右辺の“ｃｏｎｖ１Ｄ”は一次元畳み込み処理を意味し、時系列の全データポイントを発話における単語長のデータポイント数まで畳み込む処理を示している。式（２）の右辺のＭ´は発話における各モダリティの時系列の特徴テンソルを意味する。ＭはＡとＶとの組である。Ａは音声モダリティを、Ｖは映像モダリティを、それぞれ意味する。式（３）の左辺は発話全体の特徴量テンソルを意味する。式（３）の右辺の“ｃｏｎｃａｔ”は行列の結合処理を意味する。式（３）のＡ^ｕは、式（２）におけるＭ^ｕテンソルのうち音声モダリティ成分意味する。式（３）のＷ^ａは、Ａ^ｕに対する重みを意味する。式（３）のＶ^ｕは、式（２）におけるＭ^ｕテンソルのうち映像モダリティ成分意味する。式（３）のＷ^ｖは、Ｖ^ｕに対する重みを意味する。を意味する。式（３）におけるｍも、言語、音声、映像の各モダリティの集合を意味しており、すなわちこの式は各モダリティの単語分割埋め込みテンソルに対して右辺の操作を行うことを示している。この時の右辺のＥ´_ｍは式（１）の左辺と同じもの、すなわち上述の単語分割埋め込みテンソルである。 The left side of equation (1) means an embedding tensor divided into words in the utterance of each modality. The right-hand side of equation (1) means that the left-hand tensor consists of modality embedding vectors. At this time, m means {l, a, v}. That is, m means a set of l, a, and v. m is placed as a general term for subscripts for modalities, l means language modality, a means audio modality, and v means video modality. Each element on the right side of equation (1) means an embedding vector for each word. n _u means the word length of the utterance. The left side of equation (2) means the feature quantity tensor for the entire utterance of the time series modality generated by the time-resolved embedding process. “conv1D” on the right side of equation (2) means one-dimensional convolution processing, and indicates processing of convolving all data points in the time series to the number of data points of the word length in the utterance. M' on the right side of equation (2) means a time-series feature tensor of each modality in the utterance. M is a pair of A and V. A means audio modality and V means video modality, respectively. The left side of equation (3) means the feature tensor of the entire utterance. “concat” on the right side of equation (3) means matrix combination processing. A ^u in equation (3) means the audio modality component of the M ^u tensor in equation (2). W ^a in equation (3) means a weight for A ^u . V ^u in equation (3) means the video modality component of the M ^u tensor in equation (2). W ^v in equation (3) means a weight for V ^u . means. m in Equation (3) also means a set of language, audio, and video modalities; that is, this equation indicates that the operation on the right side is performed on the word segmentation embedding tensor of each modality. In this case, _E'm on the right side is the same as the left side of equation (1), that is, the word division embedding tensor described above.

このように、時分解埋め込み処理は、時系列を示す入力されたベクトルであって意味区間長さ情報を有するベクトルの前記意味区間長さ情報の内容を変更する処理であって、前記意味区間長さ情報の示す各意味区間の長さを意味区間によらず同一の長さに変換する処理である。そして、変換後の各意味区間の長さは発話時系列が含む各発話の長さの平均値で入力されたベクトルが示す時系列の時間方向の長さを割り算した長さである。 In this way, the time-resolved embedding process is a process of changing the content of the semantic interval length information of an input vector indicating a time series and having the semantic interval length information, This process converts the length of each semantic section indicated by the information into the same length regardless of the semantic section. The length of each semantic interval after conversion is the length obtained by dividing the length in the time direction of the time series indicated by the input vector by the average value of the length of each utterance included in the utterance time series.

なお、時分解埋め込み処理は、感情推定後処理の実行前に実行されればよく、例えば次元削減処理の結果に対して実行されてもよい。 Note that the time-resolved embedding process only needs to be executed before the emotion estimation post-processing, and may be executed, for example, on the result of the dimensionality reduction process.

なお、学習装置１は、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。この場合、学習装置１が備える各機能部は、複数の情報処理装置に分散して実装されてもよい。 Note that the learning device 1 may be implemented using a plurality of information processing devices communicatively connected via a network. In this case, each functional unit included in the learning device 1 may be distributed and implemented in a plurality of information processing devices.

なお、推定装置２は、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。この場合、推定装置２が備える各機能部は、複数の情報処理装置に分散して実装されてもよい。 Note that the estimation device 2 may be implemented using a plurality of information processing devices communicably connected via a network. In this case, each functional unit included in the estimation device 2 may be distributed and implemented in a plurality of information processing devices.

なお、学習装置１と、推定装置２と、の各機能の全て又は一部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されてもよい。プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。プログラムは、電気通信回線を介して送信されてもよい。 All or part of each function of the learning device 1 and the estimation device 2 may be implemented using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field Programmable Gate Array). It may also be realized by The program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system. The program may be transmitted via a telecommunications line.

なお、対象取得部２１０の取得する感情時系列は対象情報の一例である。対象取得部２１０の取得した感情時系列の示す文章は処理対象の一例である。 Note that the emotion time series acquired by the target acquisition unit 210 is an example of target information. The sentence indicated by the emotion time series acquired by the target acquisition unit 210 is an example of a processing target.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described above in detail with reference to the drawings, the specific configuration is not limited to these embodiments, and includes designs within the scope of the gist of the present invention.

（付記１）
発話の最中の推定対象の感情に依存する時系列である感情時系列に基づき前記推定対象の感情を推定する数理モデルの学習を行う学習部、
を備え、
前記感情時系列を示す感情時系列ベクトルに対して、感情時系列の区分けに関する条件であって感情時系列の種類に応じて予め定められた条件である区分け条件の下で感情時系列が時間方向に区分けされた結果として得られる区間である意味区間を示す意味ベクトルが付与された２種類以上の統合ベクトルに対して、同一の写像を作用させる共通情報取得処理を前記数理モデルは実行し、
前記写像は前記学習により更新される、
学習装置。 (Additional note 1)
a learning unit that performs learning of a mathematical model that estimates the emotion of the estimation target based on an emotion time series that is a time series that depends on the emotion of the estimation target during utterance;
Equipped with
With respect to the emotion time series vector indicating the emotion time series, the emotion time series is determined in the time direction under a classification condition that is a condition regarding the division of the emotion time series and is a predetermined condition depending on the type of emotion time series. The mathematical model executes a common information acquisition process that applies the same mapping to two or more types of integrated vectors to which a semantic vector indicating a semantic interval that is an interval obtained as a result of segmentation is applied,
the mapping is updated by the learning;
learning device.

（付記２）
前記感情時系列の１つは前記推定対象の発話の時系列である、
付記１に記載の学習装置。 (Additional note 2)
one of the emotion time series is a time series of the utterances of the estimation target;
The learning device described in Appendix 1.

（付記３）
前記感情時系列の他の１つは前記発話の音の時系列である、
付記２に記載の学習装置。 (Additional note 3)
Another one of the emotional time series is a time series of the sounds of the utterance,
The learning device described in Appendix 2.

（付記４）
前記感情時系列の他の１つは前記発話を発話している最中の前記推定対象の様子を映した動画である、
付記２又は３に記載の学習装置。 (Additional note 4)
Another one of the emotion time series is a video showing the state of the estimation target while uttering the utterance.
The learning device according to appendix 2 or 3.

（付記５）
前記数理モデルは、前記共通情報取得処理の後に、時系列を示す入力されたベクトルであって前記意味区間の時間の長さを示す情報である意味区間長さ情報を有するベクトルの前記意味区間長さ情報の内容を変更する処理であって、前記意味区間長さ情報の示す各意味区間の長さを意味区間によらず同一の長さに変換する時分解埋め込み処理を実行し、
前記時分解埋め込み処理による変換後の各意味区間の長さは、前記推定対象の発話の時系列である発話時系列が含む各発話の長さの平均値によって入力された前記ベクトルが示す時系列の時間方向の長さを割り算した長さである、
付記１から４のいずれか一つに記載の学習装置。 (Appendix 5)
After the common information acquisition process, the mathematical model calculates the semantic interval length of the input vector indicating a time series and having semantic interval length information that is information indicating the time length of the semantic interval. a time-resolved embedding process that converts the length of each semantic interval indicated by the semantic interval length information to the same length regardless of the semantic interval;
The length of each semantic interval after conversion by the time-resolved embedding process is the time series indicated by the vector input by the average value of the length of each utterance included in the utterance time series that is the time series of the utterances to be estimated. is the length divided by the length in the time direction,
The learning device according to any one of Supplementary Notes 1 to 4.

（付記６）
発話の最中の推定対象の感情に依存する時系列である感情時系列を取得する対象取得部と、
感情時系列に基づき前記推定対象の感情を推定する数理モデルの学習を行う学習部、を備え、前記感情時系列を示す感情時系列ベクトルに対して、感情時系列の区分けに関する条件であって感情時系列の種類に応じて予め定められた条件である区分け条件の下で感情時系列が時間方向に区分けされた結果として得られる区間である意味区間を示す意味ベクトルが付与された２種類以上の統合ベクトルに対して、同一の写像を作用させる共通情報取得処理を前記数理モデルは実行し、前記写像は前記学習により更新される、学習装置が得た学習済みの前記数理モデルによって、発した発話に関する時系列であって自身の感情に依存する時系列が前記対象取得部の取得した前記感情時系列であるという条件を満たす推定対象、の感情を推定する推定部と、
を備える推定装置。 (Appendix 6)
an object acquisition unit that acquires an emotion time series that is a time series that depends on the emotion of the estimation target during utterance;
a learning unit that learns a mathematical model for estimating the emotion of the estimation target based on the emotion time series, and for the emotion time series vector indicative of the emotion time series, conditions regarding classification of the emotion time series are determined. Two or more types of semantic vectors that indicate semantic intervals, which are intervals obtained as a result of dividing an emotion time series in the temporal direction under division conditions that are predetermined according to the type of time series. The mathematical model executes a common information acquisition process that applies the same mapping to the integrated vector, and the mapping is updated by the learning. an estimation unit that estimates the emotion of an estimation target that satisfies the condition that the time series that is a time series related to the emotion and that depends on one's own emotion is the emotion time series acquired by the target acquisition unit;
An estimation device comprising:

（付記７）
発話の最中の推定対象の感情に依存する時系列である感情時系列に基づき前記推定対象の感情を推定する数理モデルの学習を行う学習ステップ、
を有し、
前記感情時系列を示す感情時系列ベクトルに対して、感情時系列の区分けに関する条件であって感情時系列の種類に応じて予め定められた条件である区分け条件の下で感情時系列が時間方向に区分けされた結果として得られる区間である意味区間を示す意味ベクトルが付与された２種類以上の統合ベクトルに対して、同一の写像を作用させる共通情報取得処理を前記数理モデルは実行し、
前記写像は前記学習により更新される、
学習方法。 (Appendix 7)
a learning step of learning a mathematical model for estimating the emotion of the estimation target based on an emotion time series that is a time series that depends on the emotion of the estimation target during utterance;
has
With respect to the emotion time series vector indicating the emotion time series, the emotion time series is determined in the time direction under a classification condition that is a condition regarding the division of the emotion time series and is a predetermined condition depending on the type of emotion time series. The mathematical model executes a common information acquisition process that applies the same mapping to two or more types of integrated vectors to which a semantic vector indicating a semantic interval that is an interval obtained as a result of segmentation is applied,
the mapping is updated by the learning;
How to learn.

（付記８）
発話の最中の推定対象の感情に依存する時系列である感情時系列を取得する対象取得ステップと、
感情時系列に基づき前記推定対象の感情を推定する数理モデルの学習を行う学習部、を備え、前記感情時系列を示す感情時系列ベクトルに対して、感情時系列の区分けに関する条件であって感情時系列の種類に応じて予め定められた条件である区分け条件の下で感情時系列が時間方向に区分けされた結果として得られる区間である意味区間を示す意味ベクトルが付与された２種類以上の統合ベクトルに対して、同一の写像を作用させる共通情報取得処理を前記数理モデルは実行し、前記写像は前記学習により更新される、学習装置が得た学習済みの前記数理モデルによって、発した発話に関する時系列であって自身の感情に依存する時系列が前記対象取得ステップの取得した前記感情時系列であるという条件を満たす推定対象、の感情を推定する推定ステップと、
を有する推定方法。 (Appendix 8)
a target acquisition step of acquiring an emotion time series that is a time series that depends on the emotion of the estimation target during the utterance;
a learning unit that learns a mathematical model for estimating the emotion of the estimation target based on the emotion time series, and for the emotion time series vector indicative of the emotion time series, conditions regarding classification of the emotion time series are determined. Two or more types of semantic vectors that indicate semantic intervals, which are intervals obtained as a result of dividing an emotion time series in the temporal direction under division conditions that are predetermined according to the type of time series. The mathematical model executes a common information acquisition process that applies the same mapping to the integrated vector, and the mapping is updated by the learning. an estimation step of estimating the emotion of an estimation target that satisfies the condition that the time series that is a time series related to the emotion and that depends on one's own emotion is the emotion time series acquired in the target acquisition step;
An estimation method with

（付記９）
付記１から５のいずれか一つに記載の学習装置としてコンピュータを機能させるためのプログラム。 (Appendix 9)
A program for causing a computer to function as a learning device according to any one of Supplementary Notes 1 to 5.

（付記１０）
付記６に記載の推定装置としてコンピュータを機能させるためのプログラム。 (Appendix 10)
A program for causing a computer to function as the estimation device according to appendix 6.

１…学習装置、２…推定装置、１１…制御部、１２…入力部、１３…通信部、１４…記憶部、１５…出力部、１１０…データ取得部、１２０…学習部、１３０…記憶制御部、１４０…通信制御部、１５０…出力制御部、２１…制御部、２２…入力部、２３…通信部、２４…記憶部、２５…出力部、２１０…対象取得部、２２０…推定部、２３０…記憶制御部、２４０…通信制御部、２５０…出力制御部、９１…プロセッサ、９２…メモリ、９３…プロセッサ、９４…メモリ DESCRIPTION OF SYMBOLS 1...Learning device, 2...Estimation device, 11...Control unit, 12...Input unit, 13...Communication unit, 14...Storage unit, 15...Output unit, 110...Data acquisition unit, 120...Learning unit, 130...Storage control 140... Communication control section, 150... Output control section, 21... Control section, 22... Input section, 23... Communication section, 24... Storage section, 25... Output section, 210... Target acquisition section, 220... Estimation section, 230...Storage control unit, 240...Communication control unit, 250...Output control unit, 91...Processor, 92...Memory, 93...Processor, 94...Memory

Claims

a learning unit that performs learning of a mathematical model that estimates the emotion of the estimation target based on an emotion time series that is a time series that depends on the emotion of the estimation target during utterance;
Equipped with
With respect to the emotion time series vector indicating the emotion time series, the emotion time series is determined in the time direction under a classification condition that is a condition regarding the division of the emotion time series and is a predetermined condition depending on the type of emotion time series. The mathematical model executes a common information acquisition process that applies the same mapping to two or more types of integrated vectors to which a semantic vector indicating a semantic interval that is an interval obtained as a result of segmentation is applied,
the mapping is updated by the learning;
learning device.

one of the emotion time series is a time series of the utterances of the estimation target;
The learning device according to claim 1.

Another one of the emotional time series is a time series of the sounds of the utterance,
The learning device according to claim 2.

Another one of the emotion time series is a video showing the state of the estimation target while uttering the utterance.
The learning device according to claim 2.

After the common information acquisition process, the mathematical model calculates the semantic interval length of the input vector indicating a time series and having semantic interval length information that is information indicating the time length of the semantic interval. a time-resolved embedding process that converts the length of each semantic interval indicated by the semantic interval length information to the same length regardless of the semantic interval;
The length of each semantic interval after conversion by the time-resolved embedding process is the time series indicated by the vector input by the average value of the length of each utterance included in the utterance time series that is the time series of the utterances to be estimated. is the length divided by the length in the time direction,
The learning device according to any one of claims 1 to 4.

an object acquisition unit that acquires an emotion time series that is a time series that depends on the emotion of the estimation target during utterance;
a learning unit that learns a mathematical model for estimating the emotion of the estimation target based on the emotion time series, and for the emotion time series vector indicative of the emotion time series, conditions regarding classification of the emotion time series are determined. Two or more types of semantic vectors that indicate semantic intervals, which are intervals obtained as a result of dividing an emotion time series in the temporal direction under division conditions that are predetermined according to the type of time series. The mathematical model executes a common information acquisition process that applies the same mapping to the integrated vector, and the mapping is updated by the learning. an estimation unit that estimates the emotion of an estimation target that satisfies the condition that the time series that is a time series related to the emotion and that depends on one's own emotion is the emotion time series acquired by the target acquisition unit;
An estimation device comprising:

a learning step of learning a mathematical model for estimating the emotion of the estimation target based on an emotion time series that is a time series that depends on the emotion of the estimation target during utterance;
has
With respect to the emotion time series vector indicating the emotion time series, the emotion time series is determined in the time direction under a classification condition that is a condition regarding the division of the emotion time series and is a predetermined condition depending on the type of emotion time series. The mathematical model executes a common information acquisition process that applies the same mapping to two or more types of integrated vectors to which a semantic vector indicating a semantic interval that is an interval obtained as a result of segmentation is applied,
the mapping is updated by the learning;
How to learn.

a target acquisition step of acquiring an emotion time series that is a time series that depends on the emotion of the estimation target during the utterance;
a learning unit that learns a mathematical model for estimating the emotion of the estimation target based on the emotion time series, and for the emotion time series vector indicative of the emotion time series, conditions regarding classification of the emotion time series are determined. Two or more types of semantic vectors that indicate semantic intervals, which are intervals obtained as a result of dividing an emotion time series in the temporal direction under division conditions that are predetermined according to the type of time series. The mathematical model executes a common information acquisition process that applies the same mapping to the integrated vector, and the mapping is updated by the learning. an estimation step of estimating the emotion of an estimation target that satisfies the condition that the time series that is a time series related to the emotion and that depends on one's own emotion is the emotion time series acquired in the target acquisition step;
An estimation method with

A program for causing a computer to function as the learning device according to claim 1.

A program for causing a computer to function as the estimation device according to claim 6.