JP7279287B2

JP7279287B2 - Emotion estimation device and emotion estimation system

Info

Publication number: JP7279287B2
Application number: JP2019106848A
Authority: JP
Inventors: 秀行窪田; 博子進藤
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2019-06-07
Filing date: 2019-06-07
Publication date: 2023-05-23
Anticipated expiration: 2039-06-07
Also published as: JP2020201334A

Description

本発明は、感情推定装置、及び、感情推定システムに関する。 The present invention relates to an emotion estimation device and an emotion estimation system.

近年、喜び、怒り及び悲しみ等の感情を推定するサービスが普及している。例えば、特許文献１には、ユーザの音声を示す音声情報に基づいて、ユーザが抱く感情を推定する感情推定装置が開示されている。この感情推定装置は、ある一人のユーザによって入力された複数回の音声情報から、音声認識を行うこのユーザ個人の固有データである、周波数、音量、及び、速度といった複数の特徴量のそれぞれの平均値及び標準偏差を予め算出する。そして、この感情推定装置は、このユーザが抱く感情を推定する際に入力された音声情報の特徴量を、予め算出した平均値及び標準偏差を用いて正規化し、正規化した複数の特徴量に基づいてこのユーザが抱く感情を推定する。 In recent years, services for estimating emotions such as joy, anger and sadness have become popular. For example, Patent Literature 1 discloses an emotion estimating device that estimates an emotion of a user based on voice information indicating the voice of the user. This emotion estimating device uses voice information input multiple times by a certain user to perform voice recognition. Values and standard deviations are pre-calculated. Then, this emotion estimation apparatus normalizes the feature amount of speech information input when estimating the emotion of the user by using a pre-calculated average value and standard deviation, and converts the feature amount into a plurality of normalized feature amounts. Based on this, the user's feelings are estimated.

特開２００６－２５９６４１号公報JP-A-2006-259641

しかしながら、上述した従来の技術を、音声情報に基づく複数の特徴量と感情との関係を学習済みの学習モデルを用いて、ユーザが抱く感情を推定する装置に適用する場合、学習モデルをユーザごとに用意する必要があった。多数のユーザの音声情報を教師データとして学習済みの汎用的な学習モデルを利用すると、多数のユーザの平均的な音声の特徴とユーザの音声の特徴との差分が吸収されないため、ユーザが抱く感情を精度良く推定することができなかった。 However, when the above-described conventional technology is applied to a device that estimates the emotion of a user using a learning model that has already learned the relationship between a plurality of feature values and emotions based on voice information, the learning model is applied to each user. I had to prepare for When using a general-purpose learning model that has been trained using speech information of a large number of users as training data, the difference between the average speech characteristics of a large number of users and the characteristics of the user's speech is not absorbed. could not be estimated accurately.

本発明の好適な態様にかかる感情推定装置は、人間の音声に応じた複数の特徴量と当該音声を発した人間が抱く複数の感情の各々に対する強度との関係を複数の人間について学習済みの学習モデルに対して、ユーザの音声を示す音声情報に基づく複数の特徴量を入力し、前記ユーザが抱く前記複数の感情の各々に対する強度を示す音声評価値を含む音声感情情報を前記学習モデルから取得する音声評価部と、前記ユーザの音声の特徴に基づく補正情報を用いて前記音声感情情報を補正した補正感情情報を生成する補正部と、前記補正感情情報に基づいて、前記複数の感情の中から前記ユーザが抱く１以上の感情を推定する推定部と、を備える。 An emotion estimating apparatus according to a preferred aspect of the present invention is a device that has learned the relationship between a plurality of feature amounts corresponding to human speech and the intensity of each of a plurality of emotions held by a person who has uttered the speech. inputting a plurality of feature amounts based on speech information indicating a user's speech to a learning model, and obtaining speech emotion information including a speech evaluation value indicating the intensity of each of the plurality of emotions held by the user from the learning model; a correction unit for generating corrected emotional information obtained by correcting the voice emotional information using correction information based on the characteristics of the user's voice; an estimation unit for estimating one or more emotions that the user has from among them.

本発明の好適な態様にかかる感情推定システムは、サーバ装置と、前記サーバ装置と通信可能な端末装置とを備える感情推定システムであって、前記サーバ装置は、ユーザの音声を含む音を示す音情報を受信する第１通信装置と、前記音情報が示す音からノイズを除去して、前記ユーザの音声を示す音声情報を生成するノイズ除去部と、前記人間の音声に応じた複数の特徴量と当該音声を発した人間が抱く複数の感情の各々に対する強度との関係を複数の人間について学習済みの学習モデルに対して、前記音声情報に基づく複数の特徴量を入力し、前記ユーザの抱く前記複数の感情の各々に対する強度を示す音声評価値を含む音声感情情報を前記学習モデルから取得する音声評価部と、人間が発した音声の発話内容を認識する音声認識処理を、前記音情報に対して実行し、前記音声認識処理の認識結果を示す認識文字列に基づいて、前記ユーザが抱く前記複数の感情の各々に対する強度を示す文字評価値を含む文字感情情報を生成する文字評価部とを備え、前記第１通信装置は、前記文字感情情報と前記音声感情情報とを前記端末装置に送信し、前記端末装置は、前記ユーザの音声を含む音を集音する集音装置と、前記集音装置が出力する前記音情報を前記サーバ装置に送信し、前記文字感情情報と前記音声感情情報とを前記サーバ装置から受信する第２通信装置と、前記ユーザの音声の特徴に基づく補正情報を用いて前記音声感情情報を補正した補正感情情報を生成する補正部と、前記補正感情情報と前記文字感情情報とに基づいて、前記ユーザが抱く１以上の感情を推定する推定部とを備える。 An emotion estimating system according to a preferred aspect of the present invention is an emotion estimating system comprising a server device and a terminal device capable of communicating with the server device, wherein the server device includes a sound indicating a sound including a user's voice. a first communication device that receives information; a noise removal unit that removes noise from the sound indicated by the sound information to generate voice information indicating the user's voice; and a plurality of feature quantities corresponding to the human voice. inputting a plurality of features based on the voice information to a learning model that has already learned the relationship between the intensity of each of the plurality of emotions felt by the person who uttered the voice and the intensity of each of the plurality of emotions held by the person who emitted the voice A speech evaluation unit that acquires speech emotion information including a speech evaluation value indicating the strength of each of the plurality of emotions from the learning model, and a speech recognition process that recognizes the content of speech uttered by a person are performed on the sound information. a character evaluation unit for generating character emotion information including a character evaluation value indicating the intensity of each of the plurality of emotions held by the user, based on the recognized character string indicating the recognition result of the speech recognition processing; wherein the first communication device transmits the text emotion information and the voice emotion information to the terminal device, and the terminal device comprises a sound collection device that collects sound including the user's voice; a second communication device that transmits the sound information output by the sound collector to the server device and receives the character emotion information and the voice emotion information from the server device; and correction information based on the features of the user's voice. and an estimation unit for estimating one or more emotions of the user based on the corrected emotion information and the text emotion information. .

本発明によれば、複数の人間の音声情報を教師データとして学習済みの学習モデルを利用する場合であっても、ユーザが抱く感情を高精度に推定できる。 ADVANTAGE OF THE INVENTION According to this invention, even if it is a case where the learning model which has been trained by using the audio|speech information of several people as teacher data is used, the emotion which a user has can be estimated with high accuracy.

ユーザ装置１の機能の概要を示す図。3 is a diagram showing an overview of functions of the user device 1; FIG. 第１実施形態にかかるユーザ装置１の構成を示すブロック図。2 is a block diagram showing the configuration of the user device 1 according to the first embodiment; FIG. 解析用辞書情報３１の記憶内容の一例を示す図。4 is a diagram showing an example of contents stored in analysis dictionary information 31. FIG. 感情分類情報３３の記憶内容の一例を示す図。4 is a diagram showing an example of contents stored in emotion classification information 33. FIG. ユーザ装置１の機能の概要を示す図。3 is a diagram showing an overview of functions of the user device 1; FIG. ユーザ装置１の動作を示すフローチャートを示す図。FIG. 4 is a diagram showing a flowchart showing the operation of the user device 1; 第２実施形態にかかるユーザ装置１ａを示すブロック図。FIG. 2 is a block diagram showing a user device 1a according to a second embodiment; FIG. 第２実施形態にかかるユーザ装置１ａの機能の概要を示す図。The figure which shows the outline|summary of the function of the user apparatus 1a concerning 2nd Embodiment. キャリブレーションモード時のユーザ装置１ａの動作を示すフローチャートを示す図。The figure which shows the flowchart which shows the operation|movement of the user apparatus 1a at the time of calibration mode. 感情推定システムＳＹＳの全体構成を示す図。The figure which shows the whole structure of emotion estimation system SYS. ユーザ装置１ｂの構成を示すブロック図。FIG. 3 is a block diagram showing the configuration of the user device 1b; サーバ装置１０の構成を示すブロック図。2 is a block diagram showing the configuration of the server device 10; FIG. 感情推定システムＳＹＳの機能の概要を示す図。The figure which shows the outline|summary of the function of emotion estimation system SYS. 非キャリブレーションユーザの補正情報ＣＩの調整機能の概要を示す図。FIG. 11 is a diagram showing an overview of a function of adjusting correction information CI for a non-calibration user; 感情推定モードにおける感情推定システムＳＹＳの動作を示すフローチャートを示す図（その１）。The figure which shows the flowchart which shows the operation|movement of the emotion estimation system SYS in emotion estimation mode (part 1). 感情推定モードにおける感情推定システムＳＹＳの動作を示すフローチャートを示す図（その２）。The figure which shows the flowchart which shows the operation|movement of the emotion estimation system SYS in emotion estimation mode (part 2). 感情推定システムＳＹＳｃの全体構成を示す図。The figure which shows the whole structure of emotion estimation system SYSc. サーバ装置１０Ｃの構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of a server device 10C; 感情推定システムＳＹＳｃの機能の概要を示す図。The figure which shows the outline|summary of the function of emotion estimation system SYSc. 非キャリブレーションユーザのパラメータ情報ＴＩの調整機能の概要を示す図。FIG. 10 is a diagram showing an overview of a function of adjusting parameter information TI for a non-calibration user; 感情推定モードにおける感情推定システムＳＹＳｃの動作を示すフローチャートを示す図。The figure which shows the flowchart which shows the operation|movement of the emotion estimation system SYSc in emotion estimation mode. 感情推定システムＳＹＳｄの全体構成を示す図。The figure which shows the whole structure of emotion estimation system SYSd. ユーザ装置１ｄの構成を示すブロック図。FIG. 3 is a block diagram showing the configuration of the user device 1d; サーバ装置１０Ｄの構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of a server device 10D; 感情推定システムＳＹＳｄの機能の概要を示す図。The figure which shows the outline|summary of the function of emotion estimation system SYSd. 感情推定システムＳＹＳｅの全体構成を示す図。The figure which shows the whole structure of emotion estimation system SYSe. ユーザ装置１ｅの構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of a user device 1e; 感情推定システムＳＹＳｅの機能の概要を示す図。The figure which shows the outline|summary of the function of emotion estimation system SYSe. 感情推定システムＳＹＳｆの全体構成を示す図。The figure which shows the whole structure of emotion estimation system SYSf. ユーザ装置１ｆの構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of the user device 1f; サーバ装置１０Ｆの構成を示すブロック図。FIG. 2 is a block diagram showing the configuration of a server device 10F; 第１感情推定部２５ｆと第２感情推定部２５Ｆとの機能の概要を示す図。The figure which shows the outline|summary of the function of the 1st emotion estimation part 25f and the 2nd emotion estimation part 25F. 第１変形例におけるユーザ装置１ｇの機能の概要を示す図。The figure which shows the outline|summary of the function of the user apparatus 1g in a 1st modification.

１．第１実施形態
図１は、ユーザ装置１の機能の概要を示す図である。ユーザ装置１は、スマートフォンを想定する。ユーザ装置１が、「感情推定装置」の一例である。ただし、ユーザ装置１としては、任意の情報処理装置を採用することができ、例えば、パーソナルコンピュータ等の端末型の情報機器であってもよいし、ノートパソコン、ウェアラブル端末及びタブレット端末等の可搬型の情報端末であってもよい。 1. 1. First Embodiment FIG. 1 is a diagram showing an overview of functions of a user device 1. As shown in FIG. The user device 1 is assumed to be a smart phone. The user device 1 is an example of an "emotion estimation device". However, any information processing device can be adopted as the user device 1. For example, it may be a terminal-type information device such as a personal computer, or a portable device such as a notebook computer, a wearable terminal, and a tablet terminal. information terminal.

ユーザ装置１は、ユーザ装置１を所持するユーザＵの音声を含む音を示す音情報に対して音声認識処理を実行して得られた認識文字列を、他者が利用する装置に送信する機能、又は、ユーザＵの付近に位置する他者に聞かせるために、認識文字列を示す音を放音する機能を有する。さらに、ユーザ装置１は、ユーザＵの音声に基づいてユーザＵが抱く感情を推定し、認識文字列に対して、推定した感情に応じた図形を認識文字列に付加する、又は、推定した感情に応じた抑揚で認識文字列を示す音を放音することにより、コミュニケーションに必要な感情表現を付加できる。
図１の例では、ユーザＵが「こんにちは」と発声し、ユーザ装置１が、推定した感情に応じた図形ＰＩを付加している。 The user device 1 has a function of transmitting a recognized character string obtained by performing voice recognition processing on sound information indicating a sound including the voice of the user U possessing the user device 1 to a device used by another person. Alternatively, it has a function of emitting a sound indicating the recognized character string so that others located near the user U can hear it. Further, the user device 1 estimates the emotion that the user U has based on the voice of the user U, and adds a figure corresponding to the estimated emotion to the recognized character string, or adds the figure corresponding to the estimated emotion to the recognized character string. Emotional expressions necessary for communication can be added by emitting a sound indicating a recognition character string with an intonation corresponding to the character string.
In the example of FIG. 1, the user U utters "Hello" and the user device 1 adds a graphic PI corresponding to the estimated emotion.

図２は、第１実施形態にかかるユーザ装置１の構成を示すブロック図である。ユーザ装置１は、処理装置２、記憶装置３、表示装置４、操作装置５、通信装置６、放音装置７、及び、集音装置８を具備するコンピュータシステムにより実現される。ユーザ装置１の各要素は、情報を通信するための単体又は複数のバス９で相互に接続される。なお、本明細書における「装置」という用語は、回路、デバイス又はユニット等の他の用語に読替えてもよい。また、ユーザ装置１の各要素は、単数又は複数の機器で構成され、ユーザ装置１の一部の要素は省略されてもよい。 FIG. 2 is a block diagram showing the configuration of the user device 1 according to the first embodiment. The user device 1 is implemented by a computer system including a processing device 2 , a storage device 3 , a display device 4 , an operation device 5 , a communication device 6 , a sound emitting device 7 and a sound collecting device 8 . Each element of the user device 1 is interconnected by a bus or buses 9 for communicating information. Note that the term "apparatus" in this specification may be replaced with another term such as a circuit, a device, or a unit. Also, each element of the user device 1 may be composed of one or more devices, and some elements of the user device 1 may be omitted.

処理装置２は、ユーザ装置１の全体を制御するプロセッサであり、例えば、単数又は複数のチップで構成される。処理装置２は、例えば、周辺装置とのインタフェース、演算装置及びレジスタ等を含む中央処理装置（ＣＰＵ：Central Processing Unit）で構成される。なお、処理装置２の機能の一部又は全部を、ＤＳＰ（Digital Signal Processor）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＰＬＤ（Programmable Logic Device）、ＦＰＧＡ（Field Programmable Gate Array）等のハードウェアによって実現してもよい。処理装置２は、各種の処理を並列的又は逐次的に実行する。 The processing device 2 is a processor that controls the entire user device 1, and is composed of, for example, a single chip or multiple chips. The processing device 2 is composed of, for example, a central processing unit (CPU) including interfaces with peripheral devices, arithmetic devices, registers, and the like. Some or all of the functions of the processing device 2 are realized by hardware such as a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), and an FPGA (Field Programmable Gate Array). may The processing device 2 executes various processes in parallel or sequentially.

記憶装置３は、処理装置２が読取可能な記録媒体であり、処理装置２が実行する制御プログラムＰＲを含む複数のプログラム、解析用辞書情報３１、感情分類情報３３、及び、学習モデルＬＭを記憶する。記憶装置３は、例えば、ＲＯＭ（Read Only Memory）、ＥＰＲＯＭ（Erasable Programmable ROM）、ＥＥＰＲＯＭ（Electrically Erasable Programmable ROM）、ＲＡＭ（Random Access Memory）等の記憶回路の１種類以上で構成される。 The storage device 3 is a recording medium readable by the processing device 2, and stores a plurality of programs including a control program PR executed by the processing device 2, analysis dictionary information 31, emotion classification information 33, and a learning model LM. do. The storage device 3 is composed of, for example, one or more types of storage circuits such as ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), and RAM (Random Access Memory).

図３は、解析用辞書情報３１の記憶内容の一例を示す図である。解析用辞書情報３１は、形態素ごとに、品詞、品詞細分類、及び、原形情報を互いに対応付けた情報である。形態素は、意味を有する表現要素の最小単位の文字列である。品詞は、文法的性質によって分類された単語の種別であり、名詞、動詞、及び形容詞等である。品詞細分類は、品詞をさらに細分類した項目である。原形情報は、該当の形態素が活用する単語である場合、単語の原形を示す文字列であり、該当の形態素が活用しない単語である場合、該当の形態素と同一の文字列である。 FIG. 3 is a diagram showing an example of contents stored in the analysis dictionary information 31. As shown in FIG. The analysis dictionary information 31 is information in which the part of speech, the sub-classification of the part of speech, and the root form information are associated with each other for each morpheme. A morpheme is a string of the smallest units of meaningful expressions. A part of speech is a type of word classified by grammatical properties, such as nouns, verbs, and adjectives. The part-of-speech fine classification is an item obtained by further finely classifying the part of speech. The original form information is a character string indicating the original form of the word if the morpheme is a word that is used, and is the same character string as the morpheme if the morpheme is a word that is not used.

図４は、感情分類情報３３の記憶内容の一例を示す図である。感情分類情報３３は、文字列を、喜び、怒り、悲しみ、及び、平常の何れかに分類した情報である。感情分類情報３３に登録される文字列は、喜び、怒り、悲しみ、又は、平常のうちのいずれかの感情を表す。図４の例では、喜びに分類された文字列群３３１は、「嬉しい」、「合格」、「勝つ」、及び、「勝っ」等を含む。同様に、怒りに分類された文字列群３３２は、「イライラ」、及び、「むかっ腹」等を含む。同様に、悲しみに分類された文字列群３３３は、「悲しい」、及び、「敗ける」等を含む。同様に、平常に分類された文字列群３３４は、「安心」等を含む。 FIG. 4 is a diagram showing an example of contents stored in the emotion classification information 33. As shown in FIG. The emotion classification information 33 is information in which character strings are classified into joy, anger, sadness, and normal. A character string registered in the emotion classification information 33 represents any emotion of joy, anger, sadness, or normal. In the example of FIG. 4, the character string group 331 classified as joy includes "happy", "pass", "win", and "win". Similarly, the character string group 332 classified as anger includes "frustration", "furious", and the like. Similarly, the character string group 333 classified as sadness includes “sad”, “defeat”, and the like. Similarly, the character string group 334 classified as normal includes "safety" and the like.

説明を図２に戻す。学習モデルＬＭは、人間の音声に応じた複数の特徴量と、複数の感情の各々に対する強度との関係を学習済みである。 Returning to FIG. The learning model LM has already learned the relationship between a plurality of feature quantities corresponding to human speech and the intensity of each of a plurality of emotions.

表示装置４は、処理装置２による制御のもとで各種の画像を表示する。例えば液晶表示パネル、又は有機ＥＬ（Electro Luminescence）表示パネル等の各種の表示パネルが表示装置４として好適に利用される。 The display device 4 displays various images under the control of the processing device 2 . For example, various display panels such as a liquid crystal display panel or an organic EL (Electro Luminescence) display panel are preferably used as the display device 4 .

操作装置５は、ユーザ装置１が使用する情報を入力するための機器である。操作装置５は、ユーザＵによる操作を受け付ける。具体的には、操作装置５は、数字及び文字等の符号を入力するための操作と、表示装置４が表示するアイコンを選択するための操作とを受け付ける。例えば、表示装置４の表示面に対する接触を検出するタッチパネルが操作装置５として好適である。なお、利用者が操作可能な操作子を操作装置５が含んでもよい。操作子は、例えば、タッチペンである。 The operation device 5 is a device for inputting information used by the user device 1 . The operation device 5 receives an operation by the user U. Specifically, the operation device 5 receives an operation for inputting codes such as numbers and letters, and an operation for selecting an icon displayed by the display device 4 . For example, a touch panel that detects contact with the display surface of the display device 4 is suitable as the operation device 5 . Note that the operating device 5 may include an operator that can be operated by the user. The manipulator is, for example, a touch pen.

通信装置６は、ネットワークを介して他の装置と通信を行うためのハードウェア（送受信デバイス）である。通信装置６は、例えば、ネットワークデバイス、ネットワークコントローラ、ネットワークカード、通信モジュール等とも呼ばれる。 The communication device 6 is hardware (transmitting/receiving device) for communicating with other devices via a network. The communication device 6 is also called a network device, a network controller, a network card, a communication module, etc., for example.

放音装置７は、例えばスピーカで構成され、処理装置２による制御のもとで、音を放音する。集音装置８は、例えばマイクロフォン及びＡＤ変換器で構成され、処理装置２による制御のもとで、ユーザＵの音声を含む音を集音する。マイクロフォンは、集音した音声を電気信号に変換する。ＡＤ変換器は、マイクロフォンが変換した電気信号をＡＤ変換して、図５に示す音情報ＳＩに変換する。音情報ＳＩが示す音には、発話者の音声に加えて、発話者の周囲から発せられた雑音が含まれ得る。 The sound emitting device 7 is composed of, for example, a speaker, and emits sound under the control of the processing device 2 . The sound collecting device 8 is composed of, for example, a microphone and an AD converter, and collects sounds including the voice of the user U under the control of the processing device 2 . A microphone converts the collected sound into an electrical signal. The AD converter AD-converts the electric signal converted by the microphone into sound information SI shown in FIG. The sound indicated by the sound information SI may include noise emitted from the speaker's surroundings in addition to the speaker's voice.

１．１．第１実施形態の機能
処理装置２は、記憶装置３から制御プログラムＰＲを読み取り実行することによって、取得部２１、感情推定部２５、及び、出力部２６として機能する。
図５を用いて、処理装置２によって実現される機能について説明する。 1.1. Functions of First Embodiment The processing device 2 functions as an acquisition unit 21 , an emotion estimation unit 25 and an output unit 26 by reading and executing the control program PR from the storage device 3 .
Functions realized by the processing device 2 will be described with reference to FIG.

図５は、ユーザ装置１の機能の概要を示す図である。取得部２１は、ユーザＵの音声を含む音を集音する集音装置８が出力する音情報ＳＩを取得する。感情推定部２５は、ユーザＵが抱く複数の感情の中から、ユーザＵが抱く１以上の感情を推定する。第１実施形態において、ユーザＵが抱く複数の感情は、喜び、怒り、悲しみ、及び、平常の４つであるとして説明する。以下、喜び、怒り、悲しみ、及び、平常は複数の感情の一例である。 FIG. 5 is a diagram showing an overview of the functions of the user device 1. As shown in FIG. The acquisition unit 21 acquires sound information SI output by the sound collector 8 that collects sounds including the user's U voice. The emotion estimation unit 25 estimates one or more emotions that the user U has from among the plurality of emotions that the user U has. In the first embodiment, it is assumed that the user U has four emotions: joy, anger, sadness, and normality. Below, joy, anger, sadness, and normal are examples of multiple emotions.

感情推定部２５は、ノイズ除去部２５１、音声評価部２５２、補正部２５３、文字評価部２５６、及び、推定部２５８を含む。 The emotion estimation unit 25 includes a noise removal unit 251 , a voice evaluation unit 252 , a correction unit 253 , a character evaluation unit 256 and an estimation unit 258 .

ノイズ除去部２５１は、音情報ＳＩが示す音からノイズを除去して音声情報ＶＩを生成する。ノイズ除去部２５１には、例えば、第１パラメータＰ１と第２パラメータＰ２とが与えられる。第１パラメータＰ１は、ノイズとみなす周波数帯を指定する。第２パラメータＰ２はノイズとみなす振幅成分の大きさを指定する。ノイズ除去部２５１は、第１処理から第４処理を実行する。第１処理では、音情報ＳＩに高速フーリエ変換処理を施すことによって、複数の周波数帯の各々について振幅成分を算出する。第２処理では、第１パラメータＰ１で指定される周波数帯の振幅成分を低減させる。人間の音声の周波数は、概ね１００Ｈｚ以上２０００Ｈｚ以下である。第１パラメータＰ１は、下限の周波数と上限の周波数を指定する。このため、ノイズ除去部２５１が第１パラメータＰ１を用いることによって、下限の周波数以下の周波数帯において、振幅成分が低減され、且つ、下限の周波数以上の周波数帯において、振幅成分が低減される。第３処理は、第２パラメータＰ２で指定される大きさ以下の振幅成分を低減させる。第４処理では、第３処理の処理結果に逆フーリエ変換処理を施して、音声情報ＶＩを生成する。音声情報ＶＩは、音情報ＳＩから環境ノイズなどが除去されたユーザＵの音声を示す。 The noise removal unit 251 removes noise from the sound indicated by the sound information SI to generate voice information VI. For example, a first parameter P1 and a second parameter P2 are provided to the noise removal unit 251 . The first parameter P1 designates a frequency band that is regarded as noise. The second parameter P2 specifies the magnitude of the amplitude component regarded as noise. The noise removal unit 251 executes the first to fourth processes. In the first process, an amplitude component is calculated for each of a plurality of frequency bands by performing a fast Fourier transform process on the sound information SI. In the second process, the amplitude component of the frequency band specified by the first parameter P1 is reduced. The frequency of human speech is generally between 100 Hz and 2000 Hz. The first parameter P1 specifies the lower frequency limit and the upper frequency limit. Therefore, the noise removal unit 251 uses the first parameter P1 to reduce the amplitude component in the frequency band below the lower limit frequency and reduce the amplitude component in the frequency band above the lower limit frequency. The third process reduces amplitude components that are equal to or less than the magnitude specified by the second parameter P2. In the fourth process, an inverse Fourier transform process is applied to the result of the third process to generate audio information VI. The voice information VI indicates the voice of the user U from which environmental noise and the like are removed from the sound information SI.

音声評価部２５２は、学習モデルＬＭに対して、音声情報ＶＩに基づく複数の特徴量を入力し、複数の感情の各々に対する強度を示す音声評価値ｘを含む音声感情情報ＶＥを学習モデルＬＭから取得する。
学習モデルＬＭは、人間の音声に応じた複数の特徴量と当該音声を発した人間が抱く複数の感情の各々に対する強度との関係を複数の人間について学習済みである。学習モデルＬＭは、学習の過程において、多数の教師データを学習する。教師データは、入力データである複数の特徴量と、ラベルデータである複数の感情の各々に対する強度との組で与えられる。また、教師データは、多数のユーザの音声情報ＶＩに基づいて生成される。言い換えれば、学習モデルＬＭは、特定の個人向けに調整されていない、汎用的なモデルである。
複数の特徴量は、音の特徴量であり、例えば、MFCC（Mel-Frequency Cepstrum Coefficients）12次元、ラウドネス、基本周波数(F0)、音声確率、ゼロ交差率、HNR（Harmonics-to-Noise-Ratio）、及びこれらの一次微分、MFCC及びラウドネスの二次微分の計４７個である。ラウドネスは、音の大きさであり、人間の聴覚が感じる音の強さを示す。音声確率は、音声情報ＶＩが示す音に音声が含まれる確率を示す。ゼロ交差率は、音圧がゼロとなった回数である。
音声評価部２５２は、音声情報ＶＩに音の特徴を抽出する処理を施して複数の特徴量を生成する。 The speech evaluation unit 252 inputs a plurality of feature amounts based on the speech information VI to the learning model LM, and outputs speech emotion information VE including a speech evaluation value x indicating the intensity of each of a plurality of emotions from the learning model LM. get.
The learning model LM has already learned the relationship between a plurality of feature quantities corresponding to human speech and the intensity of each of a plurality of emotions held by the person who uttered the speech for a plurality of people. The learning model LM learns a lot of teacher data in the process of learning. The teacher data is given as a set of a plurality of feature values, which are input data, and the intensity for each of a plurality of emotions, which are label data. Also, teacher data is generated based on the voice information VI of a large number of users. In other words, the learning model LM is a generic model that is not tailored to a specific individual.
The plurality of feature quantities are sound feature quantities, such as 12-dimensional MFCC (Mel-Frequency Cepstrum Coefficients), loudness, fundamental frequency (F0), speech probability, zero-crossing rate, HNR (Harmonics-to-Noise-Ratio ), and their first derivative, MFCC, and second derivative of loudness, for a total of 47 pieces. Loudness is the loudness of sound and indicates the intensity of sound sensed by human hearing. The speech probability indicates the probability that speech is included in the sound indicated by the speech information VI. The zero-crossing rate is the number of times the sound pressure is zero.
The speech evaluation unit 252 generates a plurality of feature amounts by performing processing for extracting sound features from the speech information VI.

音声感情情報ＶＥは、喜びの音声評価値ｘ１、怒りの音声評価値ｘ２、悲しみの音声評価値ｘ３、及び、平常の音声評価値ｘ４を含む。音声評価値ｘは、０以上の実数である。以下の説明では、同種の要素を区別する場合には、喜びの音声評価値ｘ１、怒りの音声評価値ｘ２のように参照符号を使用する。一方、同種の要素を区別しない場合には、音声評価値ｘのように、参照符号のうちの共通番号だけを使用する。 The voice emotion information VE includes a joyful voice evaluation value x1, an anger voice evaluation value x2, a sadness voice evaluation value x3, and a normal voice evaluation value x4. The audio evaluation value x is a real number of 0 or more. In the following description, when distinguishing elements of the same type, reference numerals are used such as a voice evaluation value of joy x1 and a voice evaluation value of anger x2. On the other hand, when the elements of the same type are not distinguished, only the common number among the reference codes is used, like the speech evaluation value x.

補正部２５３は、ユーザＵの音声の特徴に基づく補正情報ＣＩを用いて音声感情情報ＶＥを補正した補正感情情報ＣＶＥを生成する。補正情報ＣＩは、例えば、喜びの音声評価値ｘ１を補正する係数ｋ１、怒りの音声評価値ｘ２を補正する係数ｋ２、悲しみの音声評価値ｘ３を補正する係数ｋ３、及び、平常の音声評価値ｘ４を補正する係数ｋ４を含む。ｋ１～ｋ４は、０以上の実数である。補正感情情報ＣＶＥは、喜びの音声評価値Ｘ１、怒りの音声評価値Ｘ２、悲しみの音声評価値Ｘ３、及び、平常の音声評価値Ｘ４を含む。補正部２５３は、例えば、下記式に従って、補正感情情報ＣＶＥを生成する。 The correction unit 253 generates corrected emotion information CVE by correcting the voice emotion information VE using correction information CI based on the features of the user's U voice. The correction information CI includes, for example, a coefficient k1 for correcting the joyful voice evaluation value x1, a coefficient k2 for correcting the anger voice evaluation value x2, a coefficient k3 for correcting the sadness voice evaluation value x3, and a normal voice evaluation value. It contains a coefficient k4 that corrects x4. k1 to k4 are real numbers of 0 or more. The corrected emotion information CVE includes a joyful voice evaluation value X1, an anger voice evaluation value X2, a sadness voice evaluation value X3, and a normal voice evaluation value X4. The correction unit 253 generates corrected emotion information CVE, for example, according to the following formula.

Ｘ１＝ｘ１×ｋ１
Ｘ２＝ｘ２×ｋ２
Ｘ３＝ｘ３×ｋ３
Ｘ４＝ｘ４×ｋ４ X1=x1×k1
X2=x2×k2
X3=x3×k3
X4=x4×k4

補正情報ＣＩの生成方法は、例えば、以下に示す２つの態様がある。第１の態様において、ユーザＵが、平常時の状態で、集音装置８に向かって発話する。処理装置２は、発話に応じた音声情報ＶＩに対して、複数の特徴量を抽出し、抽出した複数の特徴量と所定の閾値とを比較することにより、係数ｋ１～ｋ４を生成する。例えば、抽出した基本周波数が所定の閾値より高い場合、このユーザＵは、平常時であっても基本周波数が高めであり、ユーザＵが抱く感情が喜び又は怒りであると誤判定しやすくなる。そこで、処理装置２は、喜びの音声評価値Ｘ１及び怒りの音声評価値Ｘ２を低くする目的で、喜びに対応する係数ｋ１及び怒りに対応する係数ｋ２を、０より大きく１より小さい値に設定する。 There are, for example, the following two modes of generating the correction information CI. In the first mode, the user U speaks toward the sound collector 8 in a normal state. The processing device 2 generates coefficients k1 to k4 by extracting a plurality of feature amounts from the speech information VI corresponding to the utterance and comparing the extracted plurality of feature amounts with a predetermined threshold value. For example, when the extracted fundamental frequency is higher than a predetermined threshold, the user U has a high fundamental frequency even in normal times, and it is easy to erroneously determine that the user U is feeling joy or anger. Therefore, the processing device 2 sets the coefficient k1 corresponding to joy and the coefficient k2 corresponding to anger to values greater than 0 and less than 1 for the purpose of lowering the voice evaluation value X1 of joy and the voice evaluation value X2 of anger. do.

第２の態様において、処理装置２は、ユーザＵに自身の音声の特徴に関する情報を入力させる。例えば、処理装置２は、ユーザＵに、自身の音声の特徴に関する情報として、性別及び年齢を入力させる。性別が女性である場合、男性と比較して、一般的には基本周波数が高くなるため、処理装置２は、喜びの音声評価値Ｘ１及び怒りの音声評価値Ｘ２を低くする目的で、喜びに対応する係数ｋ１及び怒りに対応する係数ｋ２を、０より大きく１より小さい値に設定する。同様に、一般的には、年齢が低いほど声が高くなるため、入力された年齢が所定の閾値以下である場合、処理装置２は、喜びの音声評価値Ｘ１及び怒りの音声評価値Ｘ２を低くする目的で、喜びに対応する係数ｋ１及び怒りに対応する係数ｋ２を、０より大きく１より小さい値に設定する。 In the second aspect, the processing device 2 allows the user U to input information regarding the characteristics of his/her own voice. For example, the processing device 2 allows the user U to input gender and age as information regarding the characteristics of his/her own voice. When the gender is female, the fundamental frequency is generally higher than that of male. The corresponding coefficient k1 and the coefficient k2 corresponding to anger are set to values greater than zero and less than one. Similarly, in general, the lower the age, the higher the voice. Therefore, if the input age is equal to or less than a predetermined threshold, the processing device 2 converts the voice evaluation value X1 of joy and the voice evaluation value X2 of anger into For the purpose of lowering, the coefficient k1 corresponding to joy and the coefficient k2 corresponding to anger are set to values greater than zero and less than one.

文字評価部２５６は、人間が発話した音声の発話内容を認識する音声認識処理を音情報ＳＩに対して実行し、音声認識処理の認識結果を示す認識文字列ＲＴに基づいて、複数の感情の各々に対する強度を示す文字評価値Ｙを含む文字感情情報ＴＥを生成する。文字感情情報ＴＥは、喜びの文字評価値Ｙ１、怒りの文字評価値Ｙ２、悲しみの文字評価値Ｙ３、及び、平常の文字評価値Ｙ４を含む。文字評価値Ｙは、０以上の実数である。 The character evaluation unit 256 performs speech recognition processing on the sound information SI to recognize the content of speech uttered by a person, and evaluates a plurality of emotions based on the recognition character string RT indicating the recognition result of the speech recognition processing. Textual emotion information TE is generated that includes text evaluation values Y that indicate strength for each. The text emotion information TE includes a text evaluation value Y1 of joy, a text evaluation value Y2 of anger, a text evaluation value Y3 of sadness, and a normal text evaluation value Y4. The character evaluation value Y is a real number of 0 or more.

より詳細には、文字評価部２５６は、音声認識処理部２５６１、形態素解析処理部２５６３、及び、評価値算出部２５６５を含む。
音声認識処理部２５６１は、音情報ＳＩに音声認識処理を施して認識文字列ＲＴを出力する。音声認識処理部２５６１は、例えば、予め用意された音響モデル及び言語モデルを用いて、音声から文字列を認識する手法を含む、種々の手法によって、認識文字列ＲＴを出力する。 More specifically, the character evaluation section 256 includes a speech recognition processing section 2561 , a morphological analysis processing section 2563 and an evaluation value calculation section 2565 .
The speech recognition processing unit 2561 performs speech recognition processing on the sound information SI and outputs a recognized character string RT. The speech recognition processing unit 2561 outputs a recognized character string RT by various methods including a method of recognizing a character string from speech using, for example, an acoustic model and a language model prepared in advance.

形態素解析処理部２５６３は、解析用辞書情報３１を参照して、認識文字列ＲＴに対して形態素解析処理を実行して、補正後認識文字列ＣＲＴを出力する。形態素解析処理は、認識文字列ＲＴを形態素ごとに分解する処理である。形態素解析処理において、解析用辞書情報３１の品詞及び品詞細分類が利用される。補正後認識文字列ＣＲＴは、フィラー等といった、ユーザＵが抱く感情を推定するためには不要な文字列を除いた文字列である。フィラーは、「ええと」、「あの」、及び、「まあ」といった、発話の合間に挟み込む言葉である。 The morphological analysis processing unit 2563 refers to the analysis dictionary information 31, executes morphological analysis processing on the recognized character string RT, and outputs a corrected recognized character string CRT. The morphological analysis process is a process of decomposing the recognized character string RT into morphemes. In the morphological analysis processing, the part of speech and the sub-classification of the part of speech of the analysis dictionary information 31 are used. The post-correction recognition character string CRT is a character string excluding character strings that are not necessary for estimating the feelings of the user U, such as fillers. Fillers are words that are inserted between utterances, such as "um," "that," and "well."

評価値算出部２５６５は、感情分類情報３３に含まれる文字列と、補正後認識文字列ＣＲＴとを比較することにより各感情の文字評価値Ｙを算出し、各感情の文字評価値Ｙを含む文字感情情報ＴＥを生成する。各感情の文字評価値Ｙの算出について、評価値算出部２５６５は、補正後認識文字列ＣＲＴが、感情分類情報３３に含まれる文字列を含む場合に、この補正後認識文字列ＣＲＴに含まれる文字列に対応する感情の文字評価値Ｙを増加させる。
例えば、補正後認識文字列ＣＲＴが「今日試合に勝った」であれば、評価値算出部２５６５は、以下のような各感情の文字評価値Ｙを出力する。 The evaluation value calculation unit 2565 calculates the character evaluation value Y of each emotion by comparing the character string included in the emotion classification information 33 and the corrected recognized character string CRT, and includes the character evaluation value Y of each emotion. Generate text emotion information TE. Regarding the calculation of the character evaluation value Y of each emotion, the evaluation value calculation unit 2565 calculates the character string included in the corrected recognized character string CRT when the corrected recognized character string CRT includes the character string included in the emotion classification information 33. Increase the character evaluation value Y of the emotion corresponding to the character string.
For example, if the recognized character string after correction CRT is "I won the game today", the evaluation value calculation unit 2565 outputs the following character evaluation values Y for each emotion.

喜びの文字評価値Ｙ１１
怒りの文字評価値Ｙ２０
悲しみの文字評価値Ｙ３０
平常の文字評価値Ｙ４０ Character evaluation value of joy Y1 1
Character evaluation value of anger Y2 0
Character evaluation value of sadness Y3 0
Normal character evaluation value Y4 0

上述の例では、補正後認識文字列ＣＲＴに、感情分類情報３３に含まれる「勝っ」が含まれているため、評価値算出部２５６５は、「勝っ」に対応する喜びの文字評価値Ｙ１を１増加させる。増加させる文字評価値Ｙの増加量は、１に限らなく、感情分類情報３３に含まれる文字列ごとに異なってもよい。例えば、より喜びを強く示す文字列に対する文字評価値Ｙの増加量を２としてもよい。さらに、補正後認識文字列ＣＲＴに、感情分類情報３３に含まれる文字列、及び、内容を強調する文字列が含まれる場合、評価値算出部２５６５は、文字評価値Ｙの増加量を大きくしてもよい。例えば、補正後認識文字列ＣＲＴが「今日試合に勝ててとても嬉しい」であれば、補正後認識文字列ＣＲＴに感情分類情報３３に含まれる「嬉しい」が含まれており、かつ、「とても」という内容を強調する文字列が含まれるため、評価値算出部２５６５は、例えば、喜びの文字評価値Ｙ１を２増加させる。補正後認識文字列ＣＲＴのうち、どの文字列が、内容を強調する文字列であるか否かは、形態素解析処理によって得られる形態素によって判定することができる。以下の例では、説明を容易にするため、増加させる文字評価値Ｙの増加量が１であるとする。
さらに、補正後認識文字列ＣＲＴに、感情分類情報３３に含まれる文字列、及び、内容を否定する文字列が含まれる場合、評価値算出部２５６５は、この補正後認識文字列ＣＲＴに含まれる文字列に対応する文字評価値Ｙを増加させる処理とは異なる処理を実行してもよい。例えば、補正後認識文字列ＣＲＴが「今日試合に勝つことができなかった」であれば、補正後認識文字列ＣＲＴに感情分類情報３３に含まれる「勝つ」が含まれるが、「なかっ」という内容を否定する文字列が含まれるため、評価値算出部２５６５は、例えば、悲しみの文字評価値Ｙ３を１増加させる。補正後認識文字列ＣＲＴのうち、どの文字列が、内容を否定する文字列であるか否かは、形態素解析処理によって得られる形態素によって判定することができる。このように、形態素解析処理によって、補正後認識文字列ＣＲＴが肯定的な内容なのか否定的な内容かを推定することが可能である。以下の例では、説明を容易にするため、補正後認識文字列ＣＲＴに、感情分類情報３３に含まれる文字列が含まれれば、この補正後認識文字列ＣＲＴに含まれる文字列に対応する文字評価値Ｙを増加させることとして説明を行う。 In the above example, since the corrected recognized character string CRT includes "win" included in the emotion classification information 33, the evaluation value calculation unit 2565 calculates the joy character evaluation value Y1 corresponding to "win". Increase by 1. The amount by which the character evaluation value Y is increased is not limited to 1, and may differ for each character string included in the emotion classification information 33 . For example, the amount of increase in the character evaluation value Y for a character string that strongly indicates joy may be set to 2. Furthermore, when the corrected recognized character string CRT includes a character string included in the emotion classification information 33 and a character string that emphasizes the content, the evaluation value calculation unit 2565 increases the amount of increase in the character evaluation value Y. may For example, if the post-correction recognition character string CRT is "I am very happy that I won the match today", the post-correction recognition character string CRT includes "happy" included in the emotion classification information 33 and is "very". Since the character string emphasizing the content is included, the evaluation value calculation unit 2565 increases the character evaluation value Y1 of joy by 2, for example. It is possible to determine which character string among the corrected recognized character strings CRT is the character string for emphasizing the content, based on the morphemes obtained by the morphological analysis process. In the following example, the increment of the character evaluation value Y is assumed to be 1 for ease of explanation.
Furthermore, if the corrected recognized character string CRT includes a character string included in the emotion classification information 33 and a character string that negates the contents, the evaluation value calculation unit 2565 A process different from the process of increasing the character evaluation value Y corresponding to the character string may be executed. For example, if the post-correction recognition character string CRT is "I couldn't win the game today", the post-correction recognition character string CRT includes "win" included in the emotion classification information 33, but "not". Since a character string that negates the content is included, the evaluation value calculation unit 2565 increases the sadness character evaluation value Y3 by one, for example. It is possible to determine which character string among the corrected recognized character strings CRT is a character string that negates the content, based on the morphemes obtained by the morphological analysis process. In this way, it is possible to estimate whether the post-correction recognized character string CRT has positive or negative content by the morphological analysis processing. In the following example, for ease of explanation, if the post-correction recognition character string CRT includes a character string included in the emotion classification information 33, the character string corresponding to the character string included in the post-correction recognition character string CRT is A description will be given assuming that the evaluation value Y is increased.

推定部２５８は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く１以上の感情を推定する。例えば、推定部２５８は、複数の感情の各々について、補正感情情報ＣＶＥの音声評価値Ｘ１～Ｘ４と文字感情情報ＴＥの文字評価値Ｙ１～Ｙ４とを感情ごとに加算して、感情ごとに加算値を算出する。推定部２５８は、感情ごとの加算値を閾値と比較し、閾値を超える加算値を特定する。推定部２５８は、特定された加算値に対応する１以上の感情を、ユーザＵが抱く１以上の感情として推定する。以下の説明では、音声評価値Ｘと文字評価値Ｙとの加算とは、感情ごとに加算して、４つの加算値を生成することを意味する。
推定部２５８は、音声評価値Ｘと文字評価値Ｙとを単に加算するのみに限らず、音声評価値Ｘと文字評価値Ｙとのいずれか一方の評価値に、所定値αを乗じた値と、他方の評価値とを加算してもよい。所定値αは、例えば、ユーザ装置１の開発者又はユーザＵなどによって設定される値である。 The estimation unit 258 estimates one or more emotions that the user U has based on the corrected emotion information CVE and the text emotion information TE. For example, the estimating unit 258 adds the voice evaluation values X1 to X4 of the corrected emotion information CVE and the text evaluation values Y1 to Y4 of the text emotion information TE for each of the plurality of emotions. Calculate the value. The estimating unit 258 compares the added value for each emotion with the threshold and identifies the added value exceeding the threshold. Estimation unit 258 estimates one or more emotions corresponding to the identified added value as one or more emotions that user U has. In the following description, the addition of the voice evaluation value X and the character evaluation value Y means adding for each emotion to generate four added values.
The estimating unit 258 is not limited to simply adding the voice evaluation value X and the character evaluation value Y; and the evaluation value of the other may be added. The predetermined value α is, for example, a value set by the developer of the user device 1, the user U, or the like.

推定部２５８は、推定したユーザＵが抱く１以上の感情を示す推定感情情報ＥＩを出力する。推定感情情報ＥＩは、例えば、以下に示す２つの態様がある。推定感情情報ＥＩの第１の態様は、推定したユーザＵが抱く１以上の感情を示す識別子である。感情を示す識別子には、喜びを示す識別子、怒りを示す識別子、悲しみを示す識別子、及び、平常を示す識別子がある。推定感情情報ＥＩの第２の態様は、推定したユーザＵが抱く１以上の感情を示す識別子と、推定したユーザＵが抱く感情の評価値とである。推定したユーザＵが抱く感情の評価値は、例えば、推定したユーザＵが抱く１以上の感情について、補正感情情報ＣＶＥの音声評価値Ｘと文字評価値Ｙとを感情ごとに加算した値である。 The estimation unit 258 outputs estimated emotion information EI indicating one or more emotions that the estimated user U has. The estimated emotion information EI has, for example, the following two forms. A first aspect of the estimated emotion information EI is an identifier indicating one or more emotions that the estimated user U has. Emotion identifiers include joy identifiers, anger identifiers, sadness identifiers, and normality identifiers. A second aspect of the estimated emotion information EI is an identifier indicating one or more emotions that the estimated user U has, and an evaluation value of the estimated emotion that the user U has. The evaluation value of the estimated emotion of the user U is, for example, a value obtained by adding the voice evaluation value X and the character evaluation value Y of the corrected emotion information CVE for each emotion of one or more emotions that the estimated user U has. .

出力部２６は、推定感情情報ＥＩが示す１以上の感情に応じた処理を実行して得られたデータを出力する。例えば、出力部２６は、下記に示す２つの態様がある。第１の態様における出力部２６は、音声認識処理部２５６１によって得られた認識文字列ＲＴに対して、推定感情情報ＥＩが示す１以上の感情に応じた処理を実行して得られたデータを出力する。感情に応じた処理は、例えば、下記に示す２つの態様がある。
感情に応じた処理の第１の態様は、認識文字列ＲＴに対して、感情を具象化した図形を付加する処理である。感情を具象化した図形は、例えば、感情を具象化した絵文字、及び、感情を具象化した顔文字である。絵文字は、文字コードに対応付けられた画像である。文字コードは、例えば、Unicodeである。顔文字は、記号及び文字を組み合わせて顔を表現した文字列である。以下の説明では、感情を具象化した図形は、感情を具象化した絵文字であるとして説明する。喜びを具象化した絵文字は、例えば、笑顔を示す絵文字である。怒りを具象化した絵文字は、例えば、怒りの顔を示す絵文字である。悲しみを具象化した絵文字は、例えば、泣き顔を示す絵文字である。さらに、推定感情情報ＥＩが第２の態様である場合、出力部２６は、推定感情情報ＥＩが示す感情であって、推定感情情報ＥＩに含まれる評価値に応じた強度を有する感情を具象化した絵文字を、認識文字列ＲＴに付加する絵文字として決定してもよい。例えば、推定感情情報ＥＩが示す感情が悲しみであり、かつ、推定感情情報ＥＩに含まれる評価値が所定の閾値以下である場合、出力部２６は、涙をこぼす顔を示す絵文字を認識文字列ＲＴに付加する絵文字として決定する。一方、推定感情情報ＥＩが示す感情が悲しみであり、かつ、推定感情情報ＥＩに含まれる評価値が所定の閾値より大きい場合、出力部２６は、号泣した顔を示す絵文字を認識文字列ＲＴに付加する絵文字として決定する。号泣した顔を示す絵文字は、涙をこぼす顔を示す絵文字と比較して、より高い強度の悲しみを具象化している。
出力部２６は、認識文字列ＲＴに絵文字を付加して得られた絵文字付き文字列を出力する。絵文字を付加する位置は、例えば、以下に示す２つがある。第１の位置は、認識文字列ＲＴの末尾である。第２の位置は、認識文字列ＲＴ内における、感情分類情報３３に含まれる文字列の次である。表示装置４は、出力部２６が出力した絵文字付き文字列に基づく画像を表示する。 The output unit 26 outputs data obtained by executing processing corresponding to one or more emotions indicated by the estimated emotion information EI. For example, the output unit 26 has the following two aspects. The output unit 26 in the first mode outputs data obtained by executing processing according to one or more emotions indicated by the estimated emotion information EI on the recognized character string RT obtained by the speech recognition processing unit 2561. Output. There are, for example, the following two modes of processing according to emotion.
A first mode of processing according to emotion is processing of adding a figure embodying emotion to the recognized character string RT. The figure embodying emotion is, for example, a pictogram embodying emotion and an emoticon embodying emotion. A pictogram is an image associated with a character code. The character code is Unicode, for example. An emoticon is a character string that expresses a face by combining symbols and characters. In the following description, it is assumed that the figure embodying emotion is a pictogram embodying emotion. A pictogram embodying joy is, for example, a smiley pictogram. A pictogram embodying anger is, for example, a pictogram showing an angry face. A pictogram embodying sadness is, for example, a pictogram showing a crying face. Furthermore, when the estimated emotion information EI is in the second mode, the output unit 26 embody the emotion indicated by the estimated emotion information EI and having an intensity corresponding to the evaluation value included in the estimated emotion information EI. The resulting pictogram may be determined as a pictogram to be added to the recognized character string RT. For example, when the emotion indicated by the estimated emotion information EI is sadness and the evaluation value included in the estimated emotion information EI is equal to or less than a predetermined threshold value, the output unit 26 recognizes a pictogram representing a tearful face as a recognized character string. It is determined as a pictogram to be added to RT. On the other hand, when the emotion indicated by the estimated emotion information EI is sadness and the evaluation value included in the estimated emotion information EI is greater than the predetermined threshold value, the output unit 26 outputs a pictogram representing a crying face as the recognition character string RT. It is determined as a pictogram to be added. An emoji showing a crying face embodies a higher intensity of sadness compared to an emoji showing a tearful face.
The output unit 26 outputs a character string with pictograms obtained by adding pictograms to the recognized character string RT. There are, for example, the following two positions for adding pictograms. The first position is the end of the recognition string RT. The second position is next to the character string included in the emotion classification information 33 within the recognized character string RT. The display device 4 displays an image based on the character string with pictograms output by the output unit 26 .

感情に応じた処理の第２の態様は、感情に基づく抑揚を付加して読み上げた合成音声を生成する処理である。抑揚は、例えば、読み上げ速度を速くするもしくは遅くする、又は、音量を大きくするもしくは小さくすることである。喜びに基づく抑揚は、例えば、読み上げ速度を上げることである。怒りに基づく抑揚は、例えば、音量を大きくすることである。悲しみに基づく抑揚は、例えば、音量を小さくすることである。出力部２６は、感情に基づく抑揚を付加して読み上げた合成音声を示す情報を出力する。そして、出力部２６は、生成したデータが示す合成音声に、感情に基づく抑揚を付加して、感情に基づく抑揚を付加して読み上げた合成音声を示す情報を出力する。放音装置７は、出力部２６が出力したデータが示す合成音声を放音する。 A second aspect of the process according to emotion is a process of generating synthesized speech that is read aloud by adding inflections based on emotion. Inflection is, for example, speeding up or slowing down reading speed, or increasing or decreasing volume. A pleasure-based intonation is, for example, increasing the reading speed. Anger-based intonation is, for example, increasing the volume. An inflection based on grief is, for example, lowering the volume. The output unit 26 outputs information indicating synthesized speech read aloud with an intonation based on emotion. Then, the output unit 26 adds an emotion-based intonation to the synthesized speech indicated by the generated data, and outputs information indicating the synthesized speech read aloud with the emotion-based intonation added. The sound emitting device 7 emits synthesized speech indicated by the data output by the output unit 26 .

第２の態様における出力部２６は、推定感情情報ＥＩが示す１以上の感情を具象化した絵文字を出力する。第２の態様における出力部２６では、認識文字列ＲＴを用いる必要がない。以下の記載では、出力部２６は、第１の態様であるとして説明する。 The output unit 26 in the second mode outputs pictographs embodying one or more emotions indicated by the estimated emotion information EI. The output unit 26 in the second mode does not need to use the recognized character string RT. In the description below, the output unit 26 is described as being in the first mode.

１．２．第１実施形態の動作
次に、ユーザ装置１の動作について、図６を用いて説明する。 1.2. Operation of First Embodiment Next, the operation of the user device 1 will be described with reference to FIG.

図６は、ユーザ装置１の動作を示すフローチャートである。処理装置２は、上述した補正情報ＣＩの２つの生成方法のいずれか一方に従って、補正情報ＣＩを生成する（ステップＳ１）。次に、取得部２１は、音情報ＳＩを取得する（ステップＳ２）。そして、音声認識処理部２５６１は、音情報ＳＩに対して音声認識処理を実行し、認識文字列ＲＴを出力する（ステップＳ３）。次に、形態素解析処理部２５６３は、解析用辞書情報３１を参照して、認識文字列ＲＴに対して形態素解析処理を実行して、補正後認識文字列ＣＲＴを出力する（ステップＳ４）。そして、評価値算出部２５６５は、感情分類情報３３に含まれる文字列と、補正後認識文字列ＣＲＴとを比較することにより各感情の文字評価値Ｙを算出し、各感情の文字評価値Ｙを含む文字感情情報ＴＥを生成する（ステップＳ５）。 FIG. 6 is a flow chart showing the operation of the user device 1. As shown in FIG. The processing device 2 generates the correction information CI according to one of the two methods of generating the correction information CI described above (step S1). Next, the obtaining unit 21 obtains the sound information SI (step S2). Then, the speech recognition processing unit 2561 performs speech recognition processing on the sound information SI, and outputs a recognized character string RT (step S3). Next, the morphological analysis processing unit 2563 refers to the analysis dictionary information 31, executes morphological analysis processing on the recognized character string RT, and outputs the corrected recognized character string CRT (step S4). Then, the evaluation value calculation unit 2565 calculates the character evaluation value Y of each emotion by comparing the character string included in the emotion classification information 33 and the post-correction recognized character string CRT, and calculates the character evaluation value Y of each emotion. is generated (step S5).

また、ノイズ除去部２５１は、音情報ＳＩが示す音から、第１パラメータＰ１及び第２パラメータＰ２に従ってノイズを除去して音声情報ＶＩを生成する（ステップＳ６）。そして、音声評価部２５２は、ノイズを除去した音声情報ＶＩから、音の特徴量を抽出する（ステップＳ７）。次に、音声評価部２５２は、音の特徴量を学習モデルＬＭに入力し、各感情の音声評価値ｘを含む音声感情情報ＶＥを学習モデルＬＭから取得する（ステップＳ８）。補正部２５３は、補正情報ＣＩを用いて、音声感情情報ＶＥに含まれる各感情の音声評価値ｘを補正した補正感情情報ＣＶＥを生成する（ステップＳ９）。 Further, the noise removal unit 251 removes noise from the sound indicated by the sound information SI according to the first parameter P1 and the second parameter P2 to generate the voice information VI (step S6). Then, the speech evaluation unit 252 extracts the sound feature amount from the speech information VI from which noise has been removed (step S7). Next, the voice evaluation unit 252 inputs the sound feature amount to the learning model LM, and acquires the voice emotion information VE including the voice evaluation value x of each emotion from the learning model LM (step S8). The correction unit 253 generates corrected emotion information CVE by correcting the voice evaluation value x of each emotion included in the voice emotion information VE using the correction information CI (step S9).

推定部２５８は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く感情を推定し、推定感情情報ＥＩを出力する（ステップＳ１０）。出力部２６は、認識文字列ＲＴに対して、推定感情情報ＥＩが示す感情に応じた処理を実行して得られる情報を出力する（ステップＳ１１）。ステップＳ１１の処理終了後、ユーザ装置１は、図６に示す一連の処理を終了する。 The estimation unit 258 estimates the emotion that the user U has based on the corrected emotion information CVE and the text emotion information TE, and outputs estimated emotion information EI (step S10). The output unit 26 outputs information obtained by performing processing according to the emotion indicated by the estimated emotion information EI on the recognized character string RT (step S11). After completing the process of step S11, the user device 1 ends the series of processes shown in FIG.

１．３．第１実施形態の効果
以上の説明によれば、ユーザ装置１は、汎用的な学習モデルＬＭを用いてユーザＵの感情推定を実行するため、個人ごとに調整された学習モデルを生成する場合と比較して、学習モデルＬＭの生成にかかる時間を短縮できる。
汎用的な学習モデルＬＭに平均的な人間の音声に関する複数の特徴量を入力すれば、平均的な人間の抱く感情を推定できる。しかし、ユーザＵの音声は、ユーザＵの性別、年齢、及び、ユーザＵの話し方の特徴等の影響を受けるので、平均的な人間の音声と相違する。従って、単に汎用的な学習モデルＬＭを用いるだけでは、ユーザＵが抱く感情の判定精度が低下する。
上述したユーザ装置１では、ユーザＵの音声の特徴に基づく補正情報ＣＩを用いて、学習モデルＬＭから出力される音声感情情報ＶＥが補正されるため、汎用的な学習モデルＬＭを利用しつつ、ユーザＵが抱く感情を高精度に推定できる。 1.3. Effect of the First Embodiment According to the above description, the user device 1 estimates the emotions of the user U using the general-purpose learning model LM. By comparison, the time required to generate the learning model LM can be shortened.
By inputting a plurality of feature amounts relating to an average human voice to the general-purpose learning model LM, it is possible to estimate the emotions of an average human being. However, the user U's voice is affected by the user's U gender, age, characteristics of the user's U's speaking style, and the like, and thus differs from an average human voice. Therefore, simply using the general-purpose learning model LM lowers the accuracy of determination of the emotion that the user U has.
In the user device 1 described above, since the speech emotion information VE output from the learning model LM is corrected using the correction information CI based on the features of the voice of the user U, while using the general-purpose learning model LM, The emotion that the user U has can be estimated with high accuracy.

また、ユーザ装置１は、音情報ＳＩが示す音からノイズを除去して音声情報ＶＩを生成し、音声情報ＶＩに基づく音の特徴量を学習モデルＬＭに入力する。音声情報ＶＩに基づく音の特徴量を学習モデルＬＭに入力することにより、音情報ＳＩに基づく音の特徴量を学習モデルＬＭに入力する場合と比較して、より精度の高い音声感情情報ＶＥを得ることができる。 In addition, the user device 1 removes noise from the sound indicated by the sound information SI to generate voice information VI, and inputs the feature amount of the sound based on the voice information VI to the learning model LM. By inputting the sound feature amount based on the voice information VI into the learning model LM, the speech emotion information VE with higher accuracy can be generated than when the sound feature amount based on the sound information SI is input into the learning model LM. Obtainable.

また、ユーザ装置１は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く１以上の感情を推定するので、補正感情情報ＣＶＥのみに基づいてユーザＵが抱く感情を推定する場合と比較して、ユーザＵが抱く感情を高精度に推定できる。 Moreover, since the user device 1 estimates one or more emotions that the user U has based on the corrected emotion information CVE and the text emotion information TE, the user device 1 estimates the emotions that the user U has based only on the corrected emotion information CVE. Compared to the case, the emotion that the user U has can be estimated with high accuracy.

２．第２実施形態
第２実施形態にかかるユーザ装置１ａは、ユーザＵに明示的に感情を発露させた音声を発話するように促し、ユーザＵの明示的な音声感情情報ＶＥａを学習モデルＬＭから取得し、ユーザＵが抱く感情が明示的な感情であると推定部２５８が推定する可能性を高くする目的で、補正情報ＣＩを調整する点で、第１実施形態にかかるユーザ装置１と相違する。第２実施形態では、ユーザ装置１ａは、ユーザＵが抱く感情を推定する感情推定モードと、補正情報ＣＩを調整するキャリブレーションモードとを取り得る。感情推定モードが、第１実施形態に相当するため、説明を省略する。以下、第２実施形態にかかるユーザ装置１ａを説明する。なお、以下に例示する第２実施形態において作用又は機能が第１実施形態と同等である要素については、以上の説明で参照の符号を流用して各々の詳細な説明を適宜に省略する。 2. Second Embodiment A user device 1a according to a second embodiment prompts a user U to utter a voice that expresses an explicit emotion, and acquires the user U's explicit voice emotion information VEa from a learning model LM. However, the user device 1 differs from the user device 1 according to the first embodiment in that the correction information CI is adjusted for the purpose of increasing the possibility that the estimation unit 258 estimates that the emotion that the user U has is an explicit emotion. . In the second embodiment, the user device 1a can take an emotion estimation mode for estimating the emotion of the user U and a calibration mode for adjusting the correction information CI. Since the emotion estimation mode corresponds to the first embodiment, the description is omitted. The user device 1a according to the second embodiment will be described below. In the second embodiment exemplified below, elements having the same actions or functions as those of the first embodiment are denoted by reference numerals in the above description, and detailed descriptions thereof are appropriately omitted.

２．１．第２実施形態の機能
図７は、第２実施形態にかかるユーザ装置１ａを示すブロック図である。ユーザ装置１ａは、処理装置２ａ、記憶装置３ａ、表示装置４、操作装置５、通信装置６、放音装置７、及び、集音装置８を具備するコンピュータシステムにより実現される。記憶装置３ａは、処理装置２ａが読取可能な記録媒体であり、処理装置２ａが実行する制御プログラムＰＲａを含む複数のプログラム、解析用辞書情報３１、及び、感情分類情報３３を記憶する。 2.1. Functions of Second Embodiment FIG. 7 is a block diagram showing a user device 1a according to the second embodiment. The user device 1a is implemented by a computer system including a processing device 2a, a storage device 3a, a display device 4, an operation device 5, a communication device 6, a sound emitting device 7, and a sound collecting device 8. FIG. The storage device 3a is a recording medium readable by the processing device 2a, and stores a plurality of programs including the control program PRa executed by the processing device 2a, analysis dictionary information 31, and emotion classification information 33. FIG.

処理装置２ａは、記憶装置３ａから制御プログラムＰＲａを読み取り実行することによって、取得部２１ａ、感情推定部２５ａ、及び、出力部２６として機能する。 The processing device 2a functions as an acquisition unit 21a, an emotion estimation unit 25a, and an output unit 26 by reading and executing the control program PRa from the storage device 3a.

図８は、第２実施形態にかかるユーザ装置１ａの機能の概要を示す図である。感情推定部２５ａは、ノイズ除去部２５１、音声評価部２５２、補正部２５３、調整部２５４、文字評価部２５６、及び、推定部２５８を含む。 FIG. 8 is a diagram showing an overview of functions of the user device 1a according to the second embodiment. The emotion estimation unit 25 a includes a noise removal unit 251 , a voice evaluation unit 252 , a correction unit 253 , an adjustment unit 254 , a character evaluation unit 256 and an estimation unit 258 .

取得部２１ａは、複数の感情のうち一の感情をユーザＵが明示的に発露させた音声を含む音を示す音情報ＳＩａを取得する。具体的には、ユーザＵが、操作装置５への操作によって、ユーザ装置１ａをキャリブレーションモードに設定した場合、処理装置２は、複数の感情のうち一の感情を明示的に発露して発音するように促した画面を表示装置４に表示する。「一の感情」を、以下、「明示感情」と称する。取得部２１ａは、前述の画面を表示した後に取得した音情報ＳＩを、明示感情をユーザＵが発露させた音声を含む音を示す音情報ＳＩａとして取得する。複数の感情のうち、いずれの感情を明示感情に設定するかについては、例えば、ユーザ装置１ａの開発者が予め設定してもよいし、ユーザＵが複数の感情から明示感情を選択してもよい。 Acquisition unit 21a acquires sound information SIa indicating sound including voice in which user U expresses one emotion out of a plurality of emotions. Specifically, when the user U operates the operation device 5 to set the user device 1a to the calibration mode, the processing device 2 expresses one emotion out of the plurality of emotions and pronounces it. A screen prompting to do so is displayed on the display device 4 . "One emotion" is hereinafter referred to as "explicit emotion". The obtaining unit 21a obtains the sound information SI obtained after the screen is displayed as the sound information SIa indicating the sound including the voice that the user U expresses the explicit emotion. Which of the plurality of emotions to set as the explicit emotion may be set in advance by, for example, the developer of the user device 1a, or the user U may select the explicit emotion from among the plurality of emotions. good.

ノイズ除去部２５１は、音情報ＳＩａが示す音からノイズを除去して音声情報ＶＩａを生成する。 The noise removal unit 251 removes noise from the sound indicated by the sound information SIa to generate voice information VIa.

音声評価部２５２は、学習モデルＬＭに対して、音声情報ＶＩａに基づく音の特徴量を入力し、ユーザＵの明示的な音声感情情報ＶＥａを学習モデルＬＭから取得する。 The voice evaluation unit 252 inputs the sound feature amount based on the voice information VIa to the learning model LM, and acquires the user U's explicit voice emotion information VEa from the learning model LM.

調整部２５４は、ユーザＵが抱く感情が明示感情であると推定部２５８が推定する可能性を高くする目的で、明示的な音声感情情報ＶＥａに基づいて補正情報ＣＩを調整する。例えば、調整部２５４は、明示感情に対応する係数ｋを増加させる処理、及び、明示感情以外の感情に対応する係数ｋを減少させる処理の一方又は両方を実行する。例えば、調整部２５４は、下記式に従って、係数ｋ１～ｋ４を生成する。但し、感情を発露させてユーザＵが予め定められた音声を発話した場合に得られる理想的な音声評価値Ｘに関し、喜びの音声評価値をＸａ１、怒りの音声評価値をＸａ２、悲しみの音声評価値をＸａ３、平常の音声評価値をＸａ４で表す。
ｋ１＝Ｘａ１／ｘ１
ｋ２＝Ｘａ２／ｘ２
ｋ３＝Ｘａ３／ｘ３
ｋ４＝Ｘａ４／ｘ４
但し、係数ｋは、必ずしもＸａ／ｘと一致する必要はない。 The adjustment unit 254 adjusts the correction information CI based on the explicit voice emotion information VEa for the purpose of increasing the possibility of the estimation unit 258 estimating that the emotion that the user U has is the explicit emotion. For example, the adjustment unit 254 performs one or both of a process of increasing the coefficient k corresponding to the explicit emotion and a process of decreasing the coefficient k corresponding to emotions other than the explicit emotion. For example, the adjuster 254 generates coefficients k1 to k4 according to the following equations. However, regarding the ideal voice evaluation value X obtained when the user U expresses emotion and utters a predetermined voice, Xa1 is the voice evaluation value of joy, Xa2 is the voice evaluation value of anger, and Xa2 is the voice of sadness. The evaluation value is represented by Xa3, and the normal voice evaluation value by Xa4.
k1=Xa1/x1
k2=Xa2/x2
k3=Xa3/x3
k4=Xa4/x4
However, the coefficient k does not necessarily have to match Xa/x.

２．２．第２実施形態の動作
次に、キャリブレーションモード時のユーザ装置１ａの動作について、図９を用いて説明する。 2.2. Operation of Second Embodiment Next, the operation of the user device 1a in the calibration mode will be described with reference to FIG.

図９は、キャリブレーションモード時のユーザ装置１ａの動作を示すフローチャートである。取得部２１ａは、ユーザＵが明示感情を発露させた音声を含む音を示す音情報ＳＩａを取得する（ステップＳ２１）。次に、ノイズ除去部２５１は、第１パラメータＰ１及び第２パラメータＰ２に従ってノイズを除去して音声情報ＶＩａを生成する（ステップＳ２２）。そして、音声評価部２５２は、ノイズを除去した音声情報ＶＩａから、音の特徴量を抽出する（ステップＳ２３）。次に、音声評価部２５２は、音の特徴量を学習モデルＬＭに入力し、各感情の音声評価値ｘを含む音声感情情報ＶＥａを学習モデルＬＭから取得する（ステップＳ２４）。 FIG. 9 is a flow chart showing the operation of the user device 1a in the calibration mode. The acquisition unit 21a acquires the sound information SIa indicating the sound including the voice in which the user U expresses the explicit emotion (step S21). Next, the noise removing unit 251 removes noise according to the first parameter P1 and the second parameter P2 to generate the voice information VIa (step S22). Then, the speech evaluation unit 252 extracts the sound feature amount from the noise-removed speech information VIa (step S23). Next, the voice evaluation unit 252 inputs the sound feature amount to the learning model LM, and acquires voice emotion information VEa including the voice evaluation value x of each emotion from the learning model LM (step S24).

調整部２５４は、明示的な音声感情情報ＶＥａに含まれる複数の音声評価値ｘを、補正部２５３と同様の方法により補正する（ステップＳ２５）。次に、調整部２５４は、ユーザＵが抱く感情が明示感情であると推定部２５８が推定する可能性を高くする目的で、補正情報ＣＩを調整する（ステップＳ２６）。ステップＳ２６の処理終了後、ユーザ装置１ａは、図９に示す一連の処理を終了する。 The adjuster 254 corrects the plurality of voice evaluation values x included in the explicit voice emotion information VEa by the same method as the corrector 253 (step S25). Next, the adjustment unit 254 adjusts the correction information CI for the purpose of increasing the possibility of the estimation unit 258 estimating that the emotion that the user U has is the explicit emotion (step S26). After completing the process of step S26, the user device 1a ends the series of processes shown in FIG.

２．３．第２実施形態の効果
以上の説明によれば、ユーザＵが明示感情を発露させた音声を発話した場合に、ユーザ装置１ａは、ユーザＵが抱く感情が明示感情であると推定部２５８が推定する可能性を高くする目的で、明示的な音声感情情報ＶＥａに基づいて補正情報ＣＩを調整する態様を有する。この態様では、音声感情情報ＶＥａによって推定される感情の正解が判明しており、補正情報ＣＩを調整したユーザＵ用の補正感情情報ＣＶＥは、補正情報ＣＩを調整していないユーザＵ用の補正感情情報ＣＶＥと比較して、ユーザＵが抱く感情の推定精度を向上できる。
また、ユーザＵが明示感情を発露させた音声を発話したとしても、感情を音声に発露させる強度はユーザＵ間で互いに異なる。例えば、あるユーザＵは、感情を音声に発露させる強度が高い一方で、別のユーザＵは、感情を音声に発露させる強度が低い場合がある。第２実施形態における補正情報ＣＩは、感情を音声に発露させる強度の違いも反映される。例えば、感情を音声に発露させる強度が高いユーザＵは、上述の理想的な音声評価値Ｘに対して、音声評価値ｘが近い値となり、係数ｋが１に近い値となる。一方、感情を音声に発露させる強度が低いユーザＵは、上述の理想的な音声評価値Ｘに対して、音声評価値ｘが小さい値となり、係数ｋが１から離れた値となる。以上により、発露させる強度が低いユーザＵほど、係数ｋが１から離れた値になり、感情を音声に発露させる強度の違いが補正情報ＣＩに反映されるため、ユーザＵが抱く感情の推定精度を向上できる。 2.3. Effect of the Second Embodiment According to the above description, when the user U utters a voice that expresses an explicit emotion, the estimation unit 258 estimates that the emotion that the user U has is an explicit emotion. For the purpose of increasing the possibility of correcting the correction information CI, the correction information CI is adjusted based on the explicit voice emotion information VEa. In this aspect, the correct answer for the emotion estimated by the voice emotion information VEa is known, and the corrected emotion information CVE for the user U whose correction information CI is adjusted is the corrected emotion information CVE for the user U whose correction information CI is not adjusted. The accuracy of estimating the emotion that the user U has can be improved compared to the emotion information CVE.
Further, even if the user U utters a voice that expresses an explicit emotion, the intensity of the voice that expresses the emotion is different among the users U. For example, there is a case where a certain user U expresses emotions with voice with high intensity, while another user U expresses emotions with voice with low intensity. The correction information CI in the second embodiment also reflects the difference in the intensity of expressing emotion in voice. For example, a user U who expresses emotion in voice with a high intensity has a voice evaluation value x close to the ideal voice evaluation value X described above, and a coefficient k close to one. On the other hand, the voice evaluation value x of the user U, whose intensity of expressing emotion in voice is low, is smaller than the above-described ideal voice evaluation value X, and the coefficient k is a value away from 1. As described above, the coefficient k becomes a value farther from 1 for the user U who expresses the emotion with a lower intensity, and the difference in the intensity of expressing the emotion in the voice is reflected in the correction information CI. can be improved.

３．第３実施形態
第３実施形態にかかる感情推定システムＳＹＳは、第２実施形態で示した機能によってユーザ装置１ｂをキャリブレーションモードに設定して、明示感情を発露させたユーザＵの感情推定結果を利用して、ユーザ装置１ｂをキャリブレーションモードに設定していなく、明示感情を発露させていないユーザＵの補正情報ＣＩを調整する構成を有する点で、第２実施形態にかかるユーザ装置１ａと相違する。以下の説明において、ユーザ装置１ｂをキャリブレーションモードに設定し、明示感情を発露させたユーザＵを、「キャリブレーション済みユーザ」と称し、キャリブレーションモードに設定していなく、明示感情を発露させていないユーザＵを、「非キャリブレーションユーザ」と称する。
以下、第３実施形態にかかる感情推定システムＳＹＳを説明する。なお、以下に例示する第３実施形態において作用又は機能が第２実施形態と同等である要素については、以上の説明で参照の符号を流用して各々の詳細な説明を適宜に省略する。 3. Third Embodiment The emotion estimation system SYS according to the third embodiment sets the user device 1b to the calibration mode by the function shown in the second embodiment, and estimates the emotion estimation result of the user U who expresses the explicit emotion. The user device 1a differs from the user device 1a according to the second embodiment in that it has a configuration for adjusting the correction information CI of the user U who has not set the user device 1b to the calibration mode and is not expressing an explicit emotion. do. In the following description, a user U who has set the user device 1b to the calibration mode and has expressed an explicit emotion is referred to as a "calibrated user", who has not set the calibration mode and has expressed an explicit emotion. A user U who does not have a calibration is referred to as a "non-calibrated user".
The emotion estimation system SYS according to the third embodiment will be described below. In addition, in the third embodiment illustrated below, the elements whose actions or functions are the same as those of the second embodiment are referred to by the same reference numerals in the above description, and the detailed description thereof will be omitted as appropriate.

３．１．第３実施形態の概要
図１０は、感情推定システムＳＹＳの全体構成を示す図である。感情推定システムＳＹＳは、ユーザＵが所持するユーザ装置１ｂと、ネットワークＮＷと、サーバ装置１０とを備える。感情推定システムＳＹＳに含まれるユーザ装置１は、ユーザ装置１ｂ１からユーザ装置１ｂｍまでである。ｍは２以上の整数である。ユーザ装置１ｂ１を所持するユーザＵが、ユーザＵ１であり、ユーザ装置１ｂｍを所持するユーザＵは、ユーザＵｍである。 3.1. Overview of Third Embodiment FIG. 10 is a diagram showing the overall configuration of an emotion estimation system SYS. The emotion estimation system SYS includes a user device 1b possessed by a user U, a network NW, and a server device . User devices 1 included in emotion estimation system SYS are from user device 1b1 to user device 1bm. m is an integer of 2 or more. The user U possessing the user device 1b1 is the user U1, and the user U possessing the user device 1bm is the user Um.

以下では、説明の簡略化のため、ユーザＵ１が、キャリブレーション済みユーザであり、ユーザＵ２が、非キャリブレーションユーザであるとして、説明を行う。 In the following, for the sake of simplification of explanation, it is assumed that user U1 is a calibrated user and user U2 is a non-calibrated user.

図１１は、ユーザ装置１ｂの構成を示すブロック図である。ユーザ装置１ｂは、処理装置２ｂ、記憶装置３ｂ、表示装置４、操作装置５、通信装置６、放音装置７、及び、集音装置８を具備するコンピュータシステムにより実現される。記憶装置３ｂは、処理装置２ｂが読取可能な記録媒体であり、処理装置２ｂが実行する制御プログラムＰＲｂを含む複数のプログラムを記憶する。 FIG. 11 is a block diagram showing the configuration of the user device 1b. The user device 1b is implemented by a computer system including a processing device 2b, a storage device 3b, a display device 4, an operation device 5, a communication device 6, a sound emitting device 7, and a sound collecting device 8. FIG. The storage device 3b is a recording medium readable by the processing device 2b, and stores a plurality of programs including a control program PRb executed by the processing device 2b.

処理装置２ｂは、記憶装置３ｂから制御プログラムＰＲｂを読み取り実行することによって、取得部２１、及び、出力部２６として機能する。 The processing device 2b functions as an acquisition unit 21 and an output unit 26 by reading and executing the control program PRb from the storage device 3b.

図１２は、サーバ装置１０の構成を示すブロック図である。サーバ装置１０は、処理装置２Ｂ、記憶装置３Ｂ、及び、通信装置６Ｂを具備するコンピュータシステムにより実現される。サーバ装置１０の各要素は、情報を通信するための単体又は複数のバス９Ｂで相互に接続される。記憶装置３Ｂは、処理装置２Ｂが読取可能な記録媒体であり、処理装置２Ｂが実行する制御プログラムＰＲＢを含む複数のプログラム、解析用辞書情報３１、感情分類情報３３、及び、学習モデルＬＭを記憶する。 FIG. 12 is a block diagram showing the configuration of the server device 10. As shown in FIG. The server device 10 is implemented by a computer system including a processing device 2B, a storage device 3B, and a communication device 6B. Each element of the server device 10 is interconnected by one or more buses 9B for communicating information. The storage device 3B is a recording medium readable by the processing device 2B, and stores a plurality of programs including the control program PRB executed by the processing device 2B, analysis dictionary information 31, emotion classification information 33, and a learning model LM. do.

処理装置２Ｂは、記憶装置３Ｂから制御プログラムＰＲＢを読み取り実行することによって、感情推定部２５Ｂとして機能する。図１３を用いて、感情推定システムＳＹＳの機能について説明する。 The processing device 2B functions as an emotion estimation section 25B by reading and executing the control program PRB from the storage device 3B. Functions of the emotion estimation system SYS will be described with reference to FIG.

図１３は、感情推定システムＳＹＳの機能の概要を示す図である。感情推定部２５Ｂは、ノイズ除去部２５１、音声評価部２５２Ｂ、補正部２５３、調整部２５４、文字評価部２５６、推定部２５８、及び、特定部２５９を含む。 FIG. 13 is a diagram showing an overview of the functions of emotion estimation system SYS. The emotion estimation unit 25B includes a noise removal unit 251, a speech evaluation unit 252B, a correction unit 253, an adjustment unit 254, a character evaluation unit 256, an estimation unit 258, and a specification unit 259.

ユーザ装置１ｂ１の取得部２１は、ユーザＵ１の音声を含む音を集音する集音装置８が出力する音情報ＳＩ１を取得する。図１４を用いて、処理装置２Ｂによって実現される機能である、非キャリブレーションユーザの補正情報ＣＩの調整機能の概要を示す。 The acquiring unit 21 of the user device 1b1 acquires sound information SI1 output by the sound collecting device 8 that collects sound including the voice of the user U1. FIG. 14 shows an overview of the adjustment function of the non-calibration user's correction information CI, which is a function realized by the processing device 2B.

図１４は、非キャリブレーションユーザの補正情報ＣＩの調整機能の概要を示す図である。図１４では、キャリブレーション済みであるユーザＵ１が、「ありがとう」と発声し、ユーザ装置１ｂ１の取得部２１が、音情報ＳＩ１を取得した状態を示している。 FIG. 14 is a diagram showing an overview of the function of adjusting the correction information CI for the non-calibration user. FIG. 14 shows a state in which the user U1 who has been calibrated utters "thank you" and the acquisition unit 21 of the user device 1b1 acquires the sound information SI1.

説明を図１３に戻す。ユーザＵ１に関して、ノイズ除去部２５１は、音情報ＳＩ１が示す音からノイズを除去して音声情報ＶＩ１を生成する。音声評価部２５２Ｂは、学習モデルＬＭに対して、音声情報ＶＩ１から抽出した音の特徴量を入力し、音声感情情報ＶＥ１を学習モデルＬＭから取得する。補正部２５３は、ユーザＵ１の音声の特徴に基づく補正情報ＣＩ１を用いて音声感情情報ＶＥ１を補正した補正感情情報ＣＶＥ１を生成する。また、音声認識処理部２５６１は、音声認識処理を音情報ＳＩ１に対して実行し、音声認識処理の認識結果を示す認識文字列ＲＴ１を取得する。続けて、形態素解析処理部２５６３は、解析用辞書情報３１を参照して、認識文字列ＲＴ１に対して形態素解析処理を実行して、補正後認識文字列ＣＲＴ１を出力する。評価値算出部２５６５は、感情分類情報３３に含まれる文字列と、補正後認識文字列ＣＲＴ１とを比較することにより、文字感情情報ＴＥ１を生成する。
図１４では、サーバ装置１０が、音情報ＳＩ１に基づいて、補正感情情報ＣＶＥ１と文字感情情報ＴＥ１とを生成した状態を示している。 Return the description to FIG. Regarding the user U1, the noise removal unit 251 removes noise from the sound indicated by the sound information SI1 to generate voice information VI1. The speech evaluation unit 252B inputs the sound feature amount extracted from the speech information VI1 to the learning model LM, and acquires the speech emotional information VE1 from the learning model LM. The correction unit 253 generates corrected emotion information CVE1 by correcting the voice emotion information VE1 using the correction information CI1 based on the voice features of the user U1. Also, the speech recognition processing unit 2561 executes speech recognition processing on the sound information SI1 and acquires a recognized character string RT1 indicating the recognition result of the speech recognition processing. Subsequently, the morphological analysis processing unit 2563 refers to the analysis dictionary information 31, executes morphological analysis processing on the recognized character string RT1, and outputs the corrected recognized character string CRT1. The evaluation value calculation unit 2565 compares the character string included in the emotion classification information 33 with the post-correction recognition character string CRT1 to generate character emotion information TE1.
FIG. 14 shows a state in which the server device 10 has generated the corrected emotion information CVE1 and the character emotion information TE1 based on the sound information SI1.

特定部２５９は、補正感情情報ＣＶＥ１に含まれる複数の音声評価値Ｘと、文字感情情報ＴＥ１に含まれる文字評価値Ｙとの相違の程度を示す値が所定値以下である場合、認識文字列ＲＴ１を特定文字列ＳＴとして特定する。特定文字列ＳＴとして特定されやすい文字列は、この文字列が有する本来の意味で発話されることが多い文字列であり、例えば、「ありがとう」、及び「ふざけるな」等である。
ただし、「ありがとう」といった言葉も、時に社交辞令又は皮肉として発話されることもあり、「ありがとう」が有する本来の意味である「感謝」の意味で発話されない場合がある。この場合、音声情報ＶＩ１には喜びが発露していないため、音声評価値Ｘと文字評価値Ｙとが大きく相違する。そこで、サーバ装置１０は、キャリブレーション済みユーザの認識文字列ＲＴと、音声評価値Ｘと、文字評価値Ｙと、音声情報ＶＩ１を生成した日時とを対応付けてログとして記憶し、特定部２５９は、このログを参照して、認識文字列ＲＴに対する音声評価値Ｘ及び文字評価値Ｙの相違の程度の傾向に基づいて、特定文字列ＳＴを特定してもよい。例えば、特定部２５９は、現在時刻から過去のある時刻までにおいて、音声評価値Ｘ及び文字評価値Ｙの相違の程度を示す値が所定値以下となった割合が所定の割合以上となった認識文字列ＲＴを、特定文字列ＳＴとして特定する。
相違の程度を示す値は、例えば、以下に示す２つの態様がある。第１の態様における相違の程度を示す値は、複数の感情の各々について、音声評価値Ｘと文字評価値Ｙとの差分の２乗の和Ｓｕｍ_ＸＹである。和Ｓｕｍ_ＸＹは、例えば、下記（１）式により求められる。
Ｓｕｍ_ＸＹ＝（Ｘ１－Ｙ１）^２＋（Ｘ２－Ｙ２）^２＋（Ｘ３－Ｙ３）^２＋（Ｘ４－Ｙ４）^２（１）
第２の態様における相違の程度を示す値は、補正感情情報ＣＶＥ１及び文字感情情報ＴＥ１を４次元のベクトルとみなした場合の補正感情情報ＣＶＥ１及び文字感情情報ＴＥ１の角度θである。角度θが大きい程、補正感情情報ＣＶＥ１と文字感情情報ＴＥ１とが相違すると言える。例えば、角度θは、下記（２）式により求められる。
θ＝ｃｏｓ^－１（（ＣＶＥ１・ＴＥ１）／（｜ＣＶＥ１｜×｜ＴＥ１｜））（２）
ただし、ＣＶＥ１・ＴＥ１は、補正感情情報ＣＶＥ１と文字感情情報ＴＥ１の内積を示す。｜ＣＶＥ１｜は、補正感情情報ＣＶＥ１の大きさを示す。｜ＴＥ１｜は、文字感情情報ＴＥ１の大きさを示す。
以下の説明では、相違の程度を示す値は、和Ｓｕｍ_ＸＹであるとする。
図１４では、和Ｓｕｍ_ＸＹが所定値以下である例を示す。従って、特定部２５９は、認識文字列ＲＴ１である「ありがとう」を特定文字列ＳＴとして特定する。 If the value indicating the degree of difference between the plurality of voice evaluation values X included in the corrected emotion information CVE1 and the character evaluation value Y included in the text emotion information TE1 is equal to or less than a predetermined value, the specifying unit 259 Identify RT1 as a specific character string ST. A character string that is likely to be identified as the specific character string ST is a character string that is often uttered in its original meaning, such as "Thank you" and "Don't be silly".
However, the word "thank you" is sometimes uttered as a social greeting or sarcasm, and may not be uttered with the original meaning of "thank you", which is the meaning of "thank you." In this case, since joy is not expressed in the voice information VI1, the voice evaluation value X and the text evaluation value Y are greatly different. Therefore, the server device 10 associates and stores the recognized character string RT of the calibrated user, the voice evaluation value X, the character evaluation value Y, and the date and time when the voice information VI1 was generated as a log. may refer to this log and specify the specific character string ST based on the tendency of the degree of difference between the speech evaluation value X and the character evaluation value Y with respect to the recognized character string RT. For example, the specifying unit 259 recognizes that the ratio of values indicating the degree of difference between the speech evaluation value X and the character evaluation value Y is equal to or less than a predetermined value from the current time to a certain time in the past. Character string RT is identified as specific character string ST.
There are, for example, the following two aspects of the value indicating the degree of difference. The value indicating the degree of difference in the first mode is the sum of the squares of the difference between the speech evaluation value X and the text evaluation value Y for each of the plurality of emotions, Sum _XY . The sum Sum _XY is obtained, for example, by the following equation (1).
Sum _XY = (X1-Y1) ² + (X2-Y2) ² + (X3-Y3) ² + (X4-Y4) ² (1)
The value indicating the degree of difference in the second mode is the angle θ between the corrected emotion information CVE1 and the text emotion information TE1 when the corrected emotion information CVE1 and the text emotion information TE1 are regarded as a four-dimensional vector. It can be said that the greater the angle θ, the greater the difference between the corrected emotion information CVE1 and the character emotion information TE1. For example, the angle θ is obtained by the following formula (2).
θ=cos ⁻¹ ((CVE1 TE1)/(|CVE1|×|TE1|)) (2)
However, CVE1·TE1 indicates the inner product of the corrected emotion information CVE1 and the character emotion information TE1. |CVE1| indicates the magnitude of the corrected emotion information CVE1. |TE1| indicates the size of the text emotion information TE1.
In the following description, it is assumed that the value indicating the degree of difference is the sum Sum _XY .
FIG. 14 shows an example in which the sum Sum _XY is less than or equal to a predetermined value. Therefore, the identifying unit 259 identifies the recognized character string RT1, “Thank you”, as the specific character string ST.

ユーザＵ２に関して、図１４に示すように、ユーザＵ２が、特定文字列ＳＴである「ありがとう」を発話したとする。ユーザ装置１ｂ２の取得部２１が、音情報ＳＩ２を取得する。ノイズ除去部２５１は、音情報ＳＩ２が示す音からノイズを除去して音声情報ＶＩ２を生成する。音声評価部２５２Ｂは、学習モデルＬＭに対して、音声情報ＶＩ２から抽出した音の特徴量を入力し、音声感情情報ＶＥ２を学習モデルＬＭから取得する。補正部２５３は、ユーザＵ２の音声の特徴に基づく補正情報ＣＩ２を用いて音声感情情報ＶＥ２を補正した補正感情情報ＣＶＥ２を生成する。また、音声認識処理部２５６１は、音声認識処理を音情報ＳＩ２に対して実行し、音声認識処理の認識結果を示す認識文字列ＲＴ２を取得する。続けて、形態素解析処理部２５６３は、解析用辞書情報３１を参照して、認識文字列ＲＴ２に対して形態素解析処理を実行して、補正後認識文字列ＣＲＴ２を出力する。評価値算出部２５６５は、感情分類情報３３に含まれる文字列と、補正後認識文字列ＣＲＴ２とを比較することにより、文字感情情報ＴＥ２を生成する。
図１４では、サーバ装置１０が、音情報ＳＩ２に基づいて、補正感情情報ＣＶＥ２と文字感情情報ＴＥ２とを生成した状態を示している。 Regarding the user U2, as shown in FIG. 14, it is assumed that the user U2 utters the specific character string ST "thank you". Acquisition unit 21 of user device 1b2 acquires sound information SI2. The noise removing unit 251 removes noise from the sound indicated by the sound information SI2 to generate the voice information VI2. The speech evaluation unit 252B inputs the sound feature amount extracted from the speech information VI2 to the learning model LM, and acquires the speech emotion information VE2 from the learning model LM. The correction unit 253 generates corrected emotion information CVE2 by correcting the voice emotion information VE2 using the correction information CI2 based on the features of the voice of the user U2. Also, the speech recognition processing unit 2561 executes speech recognition processing on the sound information SI2 and acquires a recognized character string RT2 indicating the recognition result of the speech recognition processing. Subsequently, the morphological analysis processing unit 2563 refers to the analysis dictionary information 31, performs morphological analysis processing on the recognized character string RT2, and outputs the corrected recognized character string CRT2. The evaluation value calculator 2565 compares the character string included in the emotion classification information 33 with the post-correction recognition character string CRT2 to generate text emotion information TE2.
FIG. 14 shows a state in which the server device 10 has generated the corrected emotion information CVE2 and the character emotion information TE2 based on the sound information SI2.

非キャリブレーションユーザであるユーザＵ２が、特定文字列ＳＴを発話した場合には、調整部２５４は、ユーザＵ２の補正感情情報ＣＶＥ２に含まれる複数の音声評価値Ｘを、複数の感情の各々について、ユーザＵ２の文字感情情報ＴＥ２に含まれる複数の文字評価値Ｙに近づける目的で、ユーザＵ２用の補正情報ＣＩ２を調整する。例えば、調整部２５４は、下記式に従って、係数ｋ１～ｋ４を生成する。
ｋ１＝Ｙ１／Ｘ１
ｋ２＝Ｙ２／Ｘ２
ｋ３＝Ｙ３／Ｘ３
ｋ４＝Ｙ４／Ｘ４
但し、係数ｋは、必ずしもＹ／Ｘと一致する必要はない。 When user U2, who is a non-calibrated user, utters the specific character string ST, the adjustment unit 254 converts a plurality of voice evaluation values X included in the corrected emotion information CVE2 of user U2 to , the correction information CI2 for the user U2 is adjusted for the purpose of bringing it closer to the plurality of character evaluation values Y included in the character emotion information TE2 of the user U2. For example, the adjuster 254 generates coefficients k1 to k4 according to the following equations.
k1=Y1/X1
k2=Y2/X2
k3=Y3/X3
k4=Y4/X4
However, the coefficient k does not necessarily have to match Y/X.

３．２．第３実施形態の動作
第２実施形態と同様に、第３実施形態でも、ユーザ装置１ｂは、ユーザＵの感情を推定する感情推定モードと、補正情報ＣＩを調整するキャリブレーションモードとを取り得る。ユーザ装置１ｂがキャリブレーションモードに設定された場合、サーバ装置１０が、ステップＳ２１に示す音情報ＳＩをユーザ装置１ｂから取得して、ステップＳ２１以降の各ステップを実行すればよい。図９に示す一連の処理終了後、サーバ装置１０は、キャリブレーションモードに設定されたユーザ装置１ｂの識別情報を、キャリブレーション済みユーザが所持するユーザ装置１ｂとして記憶装置３Ｂに記憶する。ユーザ装置１ｂの識別情報は、例えば、ＵＩＤ（User IDentifier）、ＭＡＣ（Media Access Control）アドレス、加入者認証モジュール（ＳＩＭ：Subscriber Identity Module）に記録されたＩＭＳＩ（International Mobile Subscriber Identity）、又はユーザＩＤ等である。ＵＩＤは、サービスを提供する事業者が、ユーザごとに割り当てたＩＤである。感情推定モードにおける感情推定システムＳＹＳの動作について、図１５及び図１６を用いて説明する。 3.2. Operation of the Third Embodiment As in the second embodiment, in the third embodiment, the user device 1b can take an emotion estimation mode for estimating the emotion of the user U and a calibration mode for adjusting the correction information CI. . When the user device 1b is set to the calibration mode, the server device 10 may acquire the sound information SI shown in step S21 from the user device 1b, and execute the steps after step S21. After the series of processes shown in FIG. 9 is completed, the server device 10 stores the identification information of the user device 1b set in the calibration mode in the storage device 3B as the user device 1b possessed by the calibrated user. The identification information of the user device 1b is, for example, a UID (User Identifier), a MAC (Media Access Control) address, an IMSI (International Mobile Subscriber Identity) recorded in a subscriber authentication module (SIM), or a user ID etc. The UID is an ID assigned to each user by a service provider. The operation of emotion estimation system SYS in the emotion estimation mode will be described with reference to FIGS. 15 and 16. FIG.

図１５及び図１６は、感情推定モードにおける感情推定システムＳＹＳの動作を示すフローチャートである。サーバ装置１０は、ユーザ装置１ｂから、補正情報ＣＩを取得する（ステップＳ３１）。具体的には、ユーザ装置１ｂが、上述した補正情報ＣＩの２つの生成方法のいずれか一方に従って、補正情報ＣＩを生成し、サーバ装置１０に補正情報ＣＩを送信する。次に、サーバ装置１０は、ユーザ装置１ｂから、音情報ＳＩを取得する（ステップＳ３２）。そして、感情推定部２５Ｂの音声認識処理部２５６１は、音情報ＳＩに対して音声認識処理を実行し、認識文字列ＲＴを出力する（ステップＳ３３）。次に、感情推定部２５Ｂの形態素解析処理部２５６３は、解析用辞書情報３１を参照して、認識文字列ＲＴに対して形態素解析処理を実行して、補正後認識文字列ＣＲＴを出力する（ステップＳ３４）。そして、感情推定部２５Ｂの評価値算出部２５６５は、感情分類情報３３に含まれる文字列と、補正後認識文字列ＣＲＴとを比較することにより各感情の文字評価値Ｙを算出し、各感情の文字評価値Ｙを含む文字感情情報ＴＥを生成する（ステップＳ３５）。 15 and 16 are flow charts showing the operation of the emotion estimation system SYS in the emotion estimation mode. The server device 10 acquires the correction information CI from the user device 1b (step S31). Specifically, the user device 1b generates the correction information CI according to one of the two generation methods of the correction information CI described above, and transmits the correction information CI to the server device 10 . Next, the server device 10 acquires the sound information SI from the user device 1b (step S32). Then, the speech recognition processing unit 2561 of the emotion estimation unit 25B performs speech recognition processing on the sound information SI, and outputs a recognized character string RT (step S33). Next, the morphological analysis processing unit 2563 of the emotion estimation unit 25B refers to the analysis dictionary information 31, executes morphological analysis processing on the recognized character string RT, and outputs the corrected recognized character string CRT ( step S34). Then, the evaluation value calculation unit 2565 of the emotion estimation unit 25B calculates the character evaluation value Y of each emotion by comparing the character string included in the emotion classification information 33 and the post-correction recognition character string CRT. character evaluation value Y is generated (step S35).

また、感情推定部２５Ｂのノイズ除去部２５１は、音情報ＳＩが示す音から、第１パラメータＰ１及び第２パラメータＰ２に従ってノイズを除去して音声情報ＶＩを生成する（ステップＳ４１）。そして、感情推定部２５Ｂの音声評価部２５２Ｂは、ノイズを除去した音声情報ＶＩから、音の特徴量を抽出する（ステップＳ４２）。次に、感情推定部２５Ｂの音声評価部２５２Ｂは、音の特徴量を学習モデルＬＭに入力し、各感情の音声評価値ｘを含む音声感情情報ＶＥを学習モデルＬＭから取得する（ステップＳ４３）。 Further, the noise removal unit 251 of the emotion estimation unit 25B removes noise from the sound indicated by the sound information SI according to the first parameter P1 and the second parameter P2 to generate voice information VI (step S41). Then, the voice evaluation unit 252B of the emotion estimation unit 25B extracts the sound feature amount from the noise-removed voice information VI (step S42). Next, the voice evaluation unit 252B of the emotion estimation unit 25B inputs the sound feature amount to the learning model LM, and obtains voice emotion information VE including the voice evaluation value x of each emotion from the learning model LM (step S43). .

次に、サーバ装置１０は、補正情報ＣＩ及び音情報ＳＩの送信元のユーザ装置１ｂを所持するユーザが、キャリブレーション済みユーザか否かを判定する（ステップＳ４４）。キャリブレーション済みユーザか非キャリブレーションユーザかを判定する方法として、ユーザ装置１ｂは、補正情報ＣＩの送信時及び音情報ＳＩの送信時のいずれか一方の時又は両方の時に、ユーザ装置１ｂの識別情報を送信する。サーバ装置１０は、受信したユーザ装置１ｂの識別情報が、キャリブレーション済みユーザが所持するユーザ装置１ｂとして記憶した識別情報と一致した場合、肯定である判定結果を出力し、記憶装置３Ｂに記憶した識別情報と一致しない場合、否定である判定結果を出力する。 Next, the server device 10 determines whether or not the user who owns the user device 1b that is the transmission source of the correction information CI and the sound information SI is a calibrated user (step S44). As a method of determining whether the user is a calibrated user or a non-calibrated user, the user device 1b identifies the user device 1b at one or both of the time of transmitting the correction information CI and the time of transmitting the sound information SI. Send information. When the received identification information of the user device 1b matches the identification information stored as the user device 1b possessed by the calibrated user, the server device 10 outputs a positive determination result and stores it in the storage device 3B. If it does not match the identification information, a negative judgment result is output.

ステップＳ４４の判定結果が肯定の場合、感情推定部２５Ｂの補正部２５３は、補正情報ＣＩを用いて、音声感情情報ＶＥに含まれる各感情の音声評価値ｘを補正した補正感情情報ＣＶＥを生成する（ステップＳ４５）。そして、感情推定部２５Ｂの特定部２５９は、補正感情情報ＣＶＥに含まれる音声評価値Ｘと文字感情情報ＴＥに含まれる文字評価値Ｙとの差分の２乗の和Ｓｕｍ_ＸＹが所定値以下か否かを判定する（ステップＳ４６）。
ステップＳ４４の判定結果が肯定であり、かつ、ステップＳ４６の判定結果が肯定の場合、感情推定部２５Ｂの特定部２５９は、認識文字列ＲＴを特定文字列ＳＴとして特定する（ステップＳ４７）。そして、感情推定部２５Ｂの推定部２５８は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く感情を推定する（ステップＳ６１）。一方、ステップＳ４４の判定結果が肯定であり、ステップＳ４６の判定結果が否定の場合も、感情推定部２５Ｂの推定部２５８は、ステップＳ６１の処理を実行する。 If the determination result in step S44 is affirmative, the correction unit 253 of the emotion estimation unit 25B uses the correction information CI to generate corrected emotion information CVE by correcting the voice evaluation value x of each emotion included in the voice emotion information VE. (step S45). Then, the specifying unit 259 of the emotion estimating unit 25B determines whether the sum Sum _XY of the squares of the difference between the voice evaluation value X included in the corrected emotion information CVE and the text evaluation value Y included in the text emotion information TE is equal to or less than a predetermined value. It is determined whether or not (step S46).
When the determination result of step S44 is affirmative and the determination result of step S46 is affirmative, the identifying unit 259 of the emotion estimating unit 25B identifies the recognized character string RT as the specific character string ST (step S47). Then, the estimation unit 258 of the emotion estimation unit 25B estimates the emotion that the user U has based on the corrected emotion information CVE and the text emotion information TE (step S61). On the other hand, even when the determination result of step S44 is affirmative and the determination result of step S46 is negative, the estimation unit 258 of the emotion estimation unit 25B executes the process of step S61.

ステップＳ４４の判定結果が否定の場合、すなわち、補正情報ＣＩ及び音情報ＳＩの送信元のユーザ装置１ｂを所持するユーザＵが非キャリブレーションユーザである場合、サーバ装置１０は、特定文字列ＳＴと認識文字列ＲＴとが一致するか否かを判定する（ステップＳ５０）。
ステップＳ４４の判定結果が否定であり、かつ、ステップＳ５０の判定結果が肯定の場合、感情推定部２５Ｂの特定部２５９は、補正感情情報ＣＶＥに含まれる音声評価値Ｘを文字感情情報ＴＥに含まれる複数の文字評価値Ｙに近づける目的で、非キャリブレーションユーザ用の補正情報ＣＩを調整する（ステップＳ５１）。そして、感情推定部２５Ｂの補正部２５３は、補正情報ＣＩを用いて、音声感情情報ＶＥに含まれる各感情の音声評価値ｘを補正した補正感情情報ＣＶＥを生成する（ステップＳ５２）。
ステップＳ４４の判定結果が否定であり、かつ、ステップＳ４５の判定結果が否定の場合も、感情推定部２５Ｂの補正部２５３は、ステップＳ５２の処理を実行する。
ステップＳ５２の処理終了後、感情推定部２５Ｂの推定部２５８は、ステップＳ６１の処理を実行する。 If the determination result in step S44 is negative, that is, if the user U who owns the user device 1b that is the transmission source of the correction information CI and the sound information SI is a non-calibration user, the server device 10 determines that the specific character string ST and It is determined whether or not the recognized character string RT matches (step S50).
If the determination result in step S44 is negative and the determination result in step S50 is positive, the specifying unit 259 of the emotion estimation unit 25B includes the speech evaluation value X included in the corrected emotion information CVE in the text emotion information TE. The correction information CI for the non-calibrated user is adjusted for the purpose of approximating the plurality of character evaluation values Y that are used (step S51). Then, the correction unit 253 of the emotion estimation unit 25B uses the correction information CI to generate corrected emotion information CVE by correcting the voice evaluation value x of each emotion included in the voice emotion information VE (step S52).
Even when the determination result of step S44 is negative and the determination result of step S45 is negative, the correction unit 253 of the emotion estimation unit 25B executes the process of step S52.
After completing the process of step S52, the estimation unit 258 of the emotion estimation unit 25B executes the process of step S61.

ステップＳ６１の処理を実行後、サーバ装置１０は、認識文字列ＲＴと、ステップＳ６１の処理結果である推定感情情報ＥＩとを、ユーザ装置１ｂに送信する。出力部２６は、認識文字列ＲＴに対して、推定感情情報ＥＩが示す感情に応じた処理を実行して得られる情報を出力する（ステップＳ６２）。ステップＳ６２の処理終了後、感情推定システムＳＹＳは、図１５及び図１６に示す一連の処理を終了する。 After executing the process of step S61, the server device 10 transmits the recognized character string RT and the estimated emotion information EI, which is the result of the process of step S61, to the user device 1b. The output unit 26 outputs information obtained by performing processing according to the emotion indicated by the estimated emotion information EI on the recognized character string RT (step S62). After completing the process of step S62, the emotion estimation system SYS ends the series of processes shown in FIGS.

３．３．第３実施形態の効果
以上の説明によれば、サーバ装置１０は、非キャリブレーションユーザであるユーザＵ２が特定文字列ＳＴを発話した場合、補正感情情報ＣＶＥ２に含まれる複数の音声評価値Ｘを、複数の感情の各々について、文字感情情報ＴＥ２に含まれる複数の文字評価値Ｙに近づける目的で、ユーザＵ２用の補正情報ＣＩを調整する。特定文字列ＳＴは、キャリブレーションユーザであるユーザＵ１において、音声評価値Ｘと文字評価値Ｙとの相違の程度を示す値が所定値以下となった時の認識文字列ＲＴである。
ユーザＵ２が特定文字列ＳＴを発話した場合に限り、ユーザＵ２用の補正情報ＣＩを調整する理由について説明する。キャリブレーション済みユーザであっても、補正感情情報ＣＶＥと文字感情情報ＴＥとが近い値にならないことがある。例えば、キャリブレーション済みユーザが、文字列が有する本来の意味とは異なる意味でこの文字列を発話した場合、補正感情情報ＣＶＥと文字感情情報ＴＥとが近い値にならないことがある。文字列が有する本来の意味とは異なる意味でユーザＵが発話する例としては、ユーザＵが皮肉の内容を発話した場合、及び、ユーザＵが冗談を発話した場合である。ユーザＵが皮肉の内容及び冗談を発話すると、文字感情情報ＴＥの精度が低下するので、文字感情情報ＴＥのみに基づいてユーザＵが抱く感情を推定すると精度が低下する。また、ユーザＵが「今、着きました」といった事務連絡を発話すると、文字感情情報ＴＥの精度が低下するので、文字感情情報ＴＥのみに基づいてユーザＵが抱く感情を推定すると精度が低下する。発話内容が事務連絡である場合に文字感情情報ＴＥの精度が低下する理由は、事務連絡を示す発話内容には、感情分類情報３３に登録されている、感情を表す文字列が含まれる割合が一般的な発話内容と比較して低い傾向にあり、文字評価値Ｙ１～Ｙ４が小さい値となるためである。ユーザＵが皮肉の内容を発話した場合、ユーザＵが冗談を発話した場合、及び、ユーザＵが事務連絡を発話した場合とは、文字感情情報ＴＥのみに基づいてユーザＵが抱く感情を精度良く推定できない場合の一例である。文字列が有する本来の意味で発話されている場合には、補正感情情報ＣＶＥと文字感情情報ＴＥとが近い値になりやすい傾向にある。
従って、特定文字列ＳＴは、音声評価値Ｘと文字評価値Ｙとの相違の程度を示す値が所定値以下となっているため、本来の意味で発話された可能性が高い文字列であると言える。そして、ユーザＵ２が特定文字列ＳＴを発話した場合には特定文字列ＳＴが有する本来の意味で、ユーザＵ２が発話している可能性が高いため、本来であれば、補正感情情報ＣＶＥと文字感情情報ＴＥとが近い値になるはずである。
ここで、非キャリブレーションユーザにおいて、一般的には、補正感情情報ＣＶＥの精度は、文字感情情報ＴＥの精度より低い可能性が高い。理由としては、文字感情情報ＴＥは、ユーザＵの音声の特徴からの影響が小さい一方で、音声感情情報ＶＥは、ユーザＵの音声の特徴からの影響が大きく、非キャリブレーションユーザの補正情報ＣＩが正しく調整されていないためである。
そこで、第３実施形態では、ユーザＵ２が特定文字列ＳＴを発話した場合には、文字感情情報ＴＥが正解の感情を示している可能性が高いので、サーバ装置１０は、音声評価値Ｘを文字評価値Ｙに近づける目的で、ユーザＵ２用の補正情報ＣＩを調整する。以上により、非キャリブレーションユーザについて、キャリブレーションモードを用いなくても、ユーザＵが抱く感情の推定精度を向上できる。非キャリブレーションユーザは、ユーザ装置１ｂをキャリブレーションモードに設定しなくとも感情の推定精度を向上できるので、ユーザ装置１ｂは、非キャリブレーションユーザの手間を削減できる。 3.3. Effects of the Third Embodiment According to the above description, when the user U2 who is a non-calibration user utters the specific character string ST, the server device 10 converts the plurality of voice evaluation values X included in the corrected emotion information CVE2 into , the correction information CI for the user U2 is adjusted for the purpose of bringing each of the plurality of emotions closer to the plurality of character evaluation values Y included in the character emotion information TE2. The specific character string ST is the recognized character string RT when the value indicating the degree of difference between the voice evaluation value X and the character evaluation value Y for the user U1 who is the calibration user is equal to or less than a predetermined value.
The reason why the correction information CI for the user U2 is adjusted only when the user U2 utters the specific character string ST will be described. Even for a calibrated user, the corrected emotion information CVE and text emotion information TE may not be close to each other. For example, when a calibrated user utters a character string with a meaning different from the original meaning of the character string, the corrected emotion information CVE and the character emotion information TE may not be close to each other. Examples of the user U uttering a meaning different from the original meaning of the character string include a case where the user U utters an ironic content and a case where the user U utters a joke. If the user U speaks sarcastically or jokingly, the accuracy of the textual emotion information TE is lowered. Therefore, if the user U's emotions are estimated based only on the textual emotion information TE, the accuracy is lowered. In addition, when the user U utters an office communication such as "I just arrived", the accuracy of the text emotion information TE is lowered. . The reason why the accuracy of the character emotion information TE is lowered when the utterance content is business communication is that the utterance content indicating business communication includes a character string representing an emotion registered in the emotion classification information 33. This is because the character evaluation values Y1 to Y4 tend to be low compared to general utterance contents, and the character evaluation values Y1 to Y4 are small values. When the user U utters a sarcastic content, when the user U utters a joke, and when the user U utters an office contact, the emotions of the user U can be accurately estimated based only on the character emotion information TE. This is an example of a case where estimation is not possible. When the original meaning of the character string is uttered, the values of the corrected emotional information CVE and the text emotional information TE tend to be close to each other.
Therefore, since the specific character string ST has a value indicating the degree of difference between the voice evaluation value X and the character evaluation value Y being equal to or less than a predetermined value, the specific character string ST is highly likely to be uttered in its original meaning. I can say. When the user U2 utters the specific character string ST, there is a high possibility that the user U2 is uttering the original meaning of the specific character string ST. The value should be close to the emotion information TE.
Here, for non-calibrated users, the accuracy of the corrected emotion information CVE is generally lower than the accuracy of the character emotion information TE. The reason for this is that the text emotion information TE is less influenced by the voice features of the user U, while the voice emotion information VE is greatly influenced by the voice features of the user U, and the non-calibrated user's correction information CI is not properly adjusted.
Therefore, in the third embodiment, when the user U2 utters the specific character string ST, there is a high possibility that the character emotion information TE indicates the correct emotion. The correction information CI for the user U2 is adjusted for the purpose of approximating the character evaluation value Y. FIG. As described above, it is possible to improve the accuracy of estimating the emotion of the user U without using the calibration mode for the non-calibration user. Since the non-calibration user can improve the estimation accuracy of emotion without setting the user device 1b to the calibration mode, the user device 1b can reduce the trouble of the non-calibration user.

４．第４実施形態
第４実施形態にかかる感情推定システムＳＹＳｃは、キャリブレーション済みユーザの感情推定結果を利用して、非キャリブレーションユーザ用の第１パラメータＰ１及び第２パラメータＰ２を調整する点で、第３実施形態にかかる感情推定システムＳＹＳと相違する。
以下、第４実施形態にかかる感情推定システムＳＹＳｃを説明する。なお、以下に例示する第４実施形態において作用又は機能が第３実施形態と同等である要素については、以上の説明で参照の符号を流用して各々の詳細な説明を適宜に省略する。 4. Fourth Embodiment The emotion estimation system SYSc according to the fourth embodiment uses the emotion estimation result of the calibrated user to adjust the first parameter P1 and the second parameter P2 for the non-calibrated user. It differs from the emotion estimation system SYS according to the third embodiment.
The emotion estimation system SYSc according to the fourth embodiment will be described below. In addition, in the fourth embodiment illustrated below, the elements whose actions or functions are the same as those of the third embodiment are referred to by reference numerals in the above description, and their detailed descriptions are appropriately omitted.

図１７は、感情推定システムＳＹＳｃの全体構成を示す図である。感情推定システムＳＹＳｃは、ユーザＵが所持するユーザ装置１ｂと、ネットワークＮＷと、サーバ装置１０Ｃとを備える。 FIG. 17 is a diagram showing the overall configuration of emotion estimation system SYSc. The emotion estimation system SYSc includes a user device 1b possessed by a user U, a network NW, and a server device 10C.

図１８は、サーバ装置１０Ｃの構成を示すブロック図である。サーバ装置１０Ｃは、処理装置２Ｃ、記憶装置３Ｃ、及び、通信装置６Ｂを具備するコンピュータシステムにより実現される。記憶装置３Ｃは、処理装置２Ｃが読取可能な記録媒体であり、処理装置２Ｃが実行する制御プログラムＰＲＣを含む複数のプログラム、解析用辞書情報３１、感情分類情報３３、及び、学習モデルＬＭを記憶する。 FIG. 18 is a block diagram showing the configuration of the server device 10C. The server device 10C is implemented by a computer system including a processing device 2C, a storage device 3C, and a communication device 6B. Storage device 3C is a recording medium readable by processing device 2C, and stores a plurality of programs including control program PRC executed by processing device 2C, analysis dictionary information 31, emotion classification information 33, and learning model LM. do.

処理装置２Ｃは、記憶装置３Ｃから制御プログラムＰＲを読み取り実行することによって、感情推定部２５Ｃとして機能する。図１９を用いて、感情推定システムＳＹＳｃの機能について説明する。 The processing device 2C functions as an emotion estimation section 25C by reading and executing the control program PR from the storage device 3C. Functions of the emotion estimation system SYSc will be described with reference to FIG.

４．１．第４実施形態の機能
図１９は、感情推定システムＳＹＳｃの機能の概要を示す図である。感情推定部２５Ｃは、ノイズ除去部２５１Ｃ、音声評価部２５２Ｂ、補正部２５３、調整部２５４Ｃ、文字評価部２５６、推定部２５８、及び、特定部２５９を含む。 4.1. Functions of Fourth Embodiment FIG. 19 is a diagram showing an overview of the functions of the emotion estimation system SYSc. The emotion estimation unit 25C includes a noise removal unit 251C, a speech evaluation unit 252B, a correction unit 253, an adjustment unit 254C, a character evaluation unit 256, an estimation unit 258, and a specification unit 259.

第４実施形態では、ノイズ除去部２５１Ｃで用いられる第１パラメータＰ１及び第２パラメータＰ２が、ユーザＵごとに用意される。以下の説明では、ユーザＵ１用の第１パラメータＰ１及び第２パラメータＰ２を含む情報をパラメータ情報ＴＩ１とし、ユーザＵ２用の第１パラメータＰ１及び第２パラメータＰ２を含む情報をパラメータ情報ＴＩ２として説明する。 In the fourth embodiment, the first parameter P1 and the second parameter P2 used in the noise removal section 251C are prepared for each user U. In the following description, information including the first parameter P1 and the second parameter P2 for the user U1 is referred to as parameter information TI1, and information including the first parameter P1 and the second parameter P2 for the user U2 is referred to as parameter information TI2. .

図２０は、非キャリブレーションユーザのパラメータ情報ＴＩの調整機能の概要を示す図である。図２０では、キャリブレーション済みであるユーザＵ１が、「ありがとう」と発声し、ユーザ装置１ｂ１の取得部２１が、音情報ＳＩ１を取得した状態を示している。 FIG. 20 is a diagram showing an overview of the adjustment function of parameter information TI for non-calibration users. FIG. 20 shows a state in which user U1, who has been calibrated, utters "thank you" and acquisition unit 21 of user device 1b1 acquires sound information SI1.

説明を図１９に戻す。ユーザＵ１に関して、ノイズ除去部２５１Ｃは、音情報ＳＩ１が示す音から、パラメータ情報ＴＩ１に含まれる第１パラメータＰ１及び第２パラメータＰ２に基づいて、ノイズを除去して音声情報ＶＩ１を生成する。以降の処理について、感情推定部２５Ｃは、第３実施形態と同様に処理して、補正感情情報ＣＶＥ１と文字感情情報ＴＥ１とを生成し、認識文字列ＲＴ１である「ありがとう」を特定文字列ＳＴとして特定する。 Returning the description to FIG. For the user U1, the noise removal unit 251C removes noise from the sound indicated by the sound information SI1 based on the first parameter P1 and the second parameter P2 included in the parameter information TI1 to generate voice information VI1. For subsequent processing, the emotion estimation unit 25C performs the same processing as in the third embodiment, generates corrected emotion information CVE1 and text emotion information TE1, and converts the recognition character string RT1 "thank you" to the specific character string ST. Identify as

ユーザＵ２に関して、図２０に示すように、ユーザＵ２が、特定文字列ＳＴである「ありがとう」を発話したとする。調整部２５４Ｃは、ユーザＵ２の補正感情情報ＣＶＥ２に含まれる複数の音声評価値Ｘを、複数の感情の各々について、ユーザＵ２の文字感情情報ＴＥ２に含まれる複数の文字評価値Ｙに近づける目的で、ユーザＵ２用のパラメータ情報ＴＩ２を調整する。具体的には、調整部２５４Ｃは、ノイズ除去部２５１Ｃに、現在のパラメータ情報ＴＩ２の第１パラメータＰ１及び第２パラメータＰ２に基づいて、音声情報ＶＩ２を生成させる。そして、調整部２５４Ｃは、音声評価部２５２Ｂ及び補正部２５３に、補正感情情報ＣＶＥ２を生成させ、文字評価部２５６に、文字感情情報ＴＥ２を生成させる。そして、調整部２５４Ｃは、補正感情情報ＣＶＥ２に含まれる音声評価値Ｘと、文字感情情報ＴＥ２に含まれる文字評価値Ｙとを比較する。例えば、調整部２５４Ｃは、パラメータ情報ＴＩ２の第１パラメータＰ１及び第２パラメータＰ２を微小量変化させる。調整部２５４Ｃは、微小量変化させた第１パラメータＰ１及び第２パラメータＰ２に基づいて、補正感情情報ＣＶＥを再度生成し、再度生成した補正感情情報ＣＶＥと文字感情情報ＴＥ２との相違の程度を示す値が、再作成する前の補正感情情報ＣＶＥと文字感情情報ＴＥ２との相違の程度を示す値より小さい場合、ユーザＵ２の複数の音声評価値Ｘを、ユーザＵ２の複数の文字評価値Ｙに近づける目的が達せられたと判定する。 Regarding the user U2, as shown in FIG. 20, it is assumed that the user U2 utters the specific character string ST "thank you". The adjustment unit 254C adjusts the plurality of voice evaluation values X included in the corrected emotion information CVE2 of the user U2 closer to the plurality of character evaluation values Y included in the text emotion information TE2 of the user U2 for each of the plurality of emotions. , adjust the parameter information TI2 for user U2. Specifically, the adjuster 254C causes the noise remover 251C to generate the voice information VI2 based on the first parameter P1 and the second parameter P2 of the current parameter information TI2. Then, the adjustment unit 254C causes the voice evaluation unit 252B and the correction unit 253 to generate corrected emotion information CVE2, and the character evaluation unit 256 to generate text emotion information TE2. Then, the adjustment unit 254C compares the voice evaluation value X included in the corrected emotion information CVE2 with the text evaluation value Y included in the text emotion information TE2. For example, the adjuster 254C slightly changes the first parameter P1 and the second parameter P2 of the parameter information TI2. The adjustment unit 254C regenerates the corrected emotion information CVE based on the first parameter P1 and the second parameter P2 that are slightly changed, and determines the degree of difference between the regenerated corrected emotion information CVE and the text emotion information TE2. If the indicated value is smaller than the value indicating the degree of difference between the corrected emotion information CVE before re-creation and the character emotion information TE2, the plurality of voice evaluation values X of the user U2 are replaced with the plurality of character evaluation values Y of the user U2. It is determined that the purpose of bringing the

図２０では、調整部２５４Ｃが、ユーザＵ２の複数の音声評価値Ｘを、ユーザＵ２の複数の文字評価値Ｙに近づける目的で、パラメータ情報ＴＩ２に含まれる第１パラメータＰ１及び第２パラメータＰ２を調整することを示している。 In FIG. 20, the adjustment unit 254C adjusts the first parameter P1 and the second parameter P2 included in the parameter information TI2 for the purpose of bringing the plurality of voice evaluation values X of the user U2 closer to the plurality of character evaluation values Y of the user U2. indicates to adjust.

４．２．第４実施形態の動作
次に、感情推定モードにおける感情推定システムＳＹＳｃの動作について、図２１を用いて説明する。 4.2. Operation of Fourth Embodiment Next, the operation of the emotion estimation system SYSc in the emotion estimation mode will be described with reference to FIG.

図２１は、感情推定モードにおける感情推定システムＳＹＳｃの動作を示すフローチャートである。なお、第３実施形態で示した感情推定モードにおける感情推定システムＳＹＳｃの動作と、第４実施形態の感情推定モードにおける感情推定システムＳＹＳｃの動作において、図１５に示すステップＳ３１からステップＳ３５までの処理は共通である。従って、ステップＳ３１からステップＳ３５までの処理については図示及び説明を省略する。 FIG. 21 is a flow chart showing the operation of emotion estimation system SYSc in emotion estimation mode. In the operation of the emotion estimation system SYSc in the emotion estimation mode shown in the third embodiment and the operation of the emotion estimation system SYSc in the emotion estimation mode of the fourth embodiment, the processing from step S31 to step S35 shown in FIG. are common. Therefore, illustration and description of the processing from step S31 to step S35 are omitted.

ステップＳ３５の処理終了後、サーバ装置１０Ｃは、補正情報ＣＩ及び音情報ＳＩの送信元のユーザ装置１ｂを所持するユーザＵが、キャリブレーション済みユーザか否かを判定する（ステップＳ７１）。ステップＳ７１の判定結果が肯定の場合、感情推定部２５Ｃのノイズ除去部２５１Ｃは、音情報ＳＩが示す音から、パラメータ情報Ｔ１の第１パラメータＰ１及び第２パラメータＰ２に従ってノイズを除去して音声情報ＶＩを生成する（ステップＳ７２）。感情推定部２５Ｃの音声評価部２５２Ｂは、ノイズを除去した音声情報ＶＩから、音の特徴量を抽出する（ステップＳ７３）。次に、感情推定部２５Ｃの音声評価部２５２Ｂは、音の特徴量を学習モデルＬＭに入力し、各感情の音声評価値ｘを含む音声感情情報ＶＥを学習モデルＬＭから取得する（ステップＳ７４）。そして、感情推定部２５Ｃの補正部２５３は、補正情報ＣＩを用いて、音声感情情報ＶＥに含まれる各感情の音声評価値ｘを補正した補正感情情報ＣＶＥを生成する（ステップＳ７５）。 After the process of step S35 ends, the server device 10C determines whether or not the user U who owns the user device 1b, which is the transmission source of the correction information CI and the sound information SI, is a calibrated user (step S71). If the determination result in step S71 is affirmative, the noise removal unit 251C of the emotion estimation unit 25C removes noise from the sound indicated by the sound information SI according to the first parameter P1 and the second parameter P2 of the parameter information T1, and removes the noise from the voice information. A VI is generated (step S72). The voice evaluation unit 252B of the emotion estimation unit 25C extracts the sound feature amount from the noise-removed voice information VI (step S73). Next, the voice evaluation unit 252B of the emotion estimation unit 25C inputs the sound feature amount to the learning model LM, and acquires voice emotion information VE including the voice evaluation value x of each emotion from the learning model LM (step S74). . Then, the correction unit 253 of the emotion estimation unit 25C uses the correction information CI to generate corrected emotion information CVE by correcting the voice evaluation value x of each emotion included in the voice emotion information VE (step S75).

そして、感情推定部２５Ｃの特定部２５９は、補正感情情報ＣＶＥに含まれる音声評価値Ｘと文字感情情報ＴＥに含まれる文字評価値Ｙとの差分の２乗の和Ｓｕｍ_ＸＹが所定値以下か否かを判定する（ステップＳ７６）。
ステップＳ７１の判定結果が肯定であり、かつ、ステップＳ７６の判定結果が肯定の場合、感情推定部２５Ｃの特定部２５９は、認識文字列ＲＴを特定文字列ＳＴとして特定する（ステップＳ７７）。そして、感情推定部２５Ｃの推定部２５８は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く感情を推定する（ステップＳ９１）。一方、ステップＳ７１の判定結果が肯定であり、ステップＳ７６の判定結果が否定の場合も、感情推定部２５Ｃの推定部２５８は、ステップＳ９１の処理を実行する。 Then, the specifying unit 259 of the emotion estimating unit 25C determines whether the sum Sum _XY of the squares of the difference between the voice evaluation value X included in the corrected emotion information CVE and the text evaluation value Y included in the text emotion information TE is equal to or less than a predetermined value. It is determined whether or not (step S76).
When the determination result of step S71 is affirmative and the determination result of step S76 is affirmative, specifying unit 259 of emotion estimating unit 25C specifies recognized character string RT as specific character string ST (step S77). Then, the estimation unit 258 of the emotion estimation unit 25C estimates the emotion that the user U has based on the corrected emotion information CVE and the text emotion information TE (step S91). On the other hand, even when the determination result of step S71 is affirmative and the determination result of step S76 is negative, the estimation unit 258 of the emotion estimation unit 25C executes the process of step S91.

ステップＳ７１の判定結果が否定の場合、すなわち、補正情報ＣＩ及び音情報ＳＩの送信元のユーザ装置１ｂを所持するユーザＵが非キャリブレーションユーザである場合、サーバ装置１０Ｃは、特定文字列ＳＴと認識文字列ＲＴとが一致するか否かを判定する（ステップＳ８１）。ステップＳ７１の判定結果が否定であり、かつ、ステップＳ８１の判定結果が肯定の場合、感情推定部２５Ｂの調整部２５４Ｃは、補正感情情報ＣＶＥに含まれる音声評価値Ｘを文字感情情報ＴＥに含まれる複数の文字評価値Ｙに近づける目的で、非キャリブレーションユーザ用のパラメータ情報ＴＩを調整する（ステップＳ８２）。そして、感情推定部２５Ｃのノイズ除去部２５１Ｃは、音情報ＳＩが示す音から、パラメータ情報ＴＩの第１パラメータＰ１及び第２パラメータＰ２に従ってノイズを除去して音声情報ＶＩを生成する（ステップＳ８３）。
ステップＳ７１の判定結果が否定であり、かつ、ステップＳ８１の判定結果が否定の場合も、感情推定部２５Ｃのノイズ除去部２５１Ｃは、ステップＳ８３の処理を実行する。 If the determination result in step S71 is negative, that is, if the user U who owns the user device 1b that is the transmission source of the correction information CI and the sound information SI is a non-calibration user, the server device 10C determines that the specific character string ST and It is determined whether or not the recognized character string RT matches (step S81). If the determination result in step S71 is negative and the determination result in step S81 is positive, the adjustment unit 254C of the emotion estimation unit 25B includes the voice evaluation value X included in the corrected emotion information CVE in the text emotion information TE. The parameter information TI for non-calibrated users is adjusted for the purpose of approximating a plurality of character evaluation values Y (step S82). Then, the noise removal unit 251C of the emotion estimation unit 25C removes noise from the sound indicated by the sound information SI according to the first parameter P1 and the second parameter P2 of the parameter information TI to generate voice information VI (step S83). .
Even when the determination result of step S71 is negative and the determination result of step S81 is negative, the noise removing section 251C of the emotion estimating section 25C executes the process of step S83.

ステップＳ８３の処理終了後、感情推定部２５Ｃの音声評価部２５２Ｂは、ノイズを除去した音声情報ＶＩから、音の特徴量を抽出する（ステップＳ８４）。次に、感情推定部２５Ｃの音声評価部２５２Ｂは、音の特徴量を学習モデルＬＭに入力し、各感情の音声評価値ｘを含む音声感情情報ＶＥを学習モデルＬＭから取得する（ステップＳ８５）。そして、感情推定部２５Ｃの補正部２５３は、補正情報ＣＩを用いて、音声感情情報ＶＥに含まれる音声評価値ｘを補正した補正感情情報ＣＶＥを生成する（ステップＳ８６）。ステップＳ８６の処理終了後、感情推定部２５Ｃの推定部２５８は、ステップＳ９１の処理を実行する。 After the process of step S83 is completed, the voice evaluation unit 252B of the emotion estimation unit 25C extracts the sound feature amount from the noise-removed voice information VI (step S84). Next, the voice evaluation unit 252B of the emotion estimation unit 25C inputs the sound feature amount to the learning model LM, and acquires voice emotion information VE including the voice evaluation value x of each emotion from the learning model LM (step S85). . Then, the correction unit 253 of the emotion estimation unit 25C uses the correction information CI to generate corrected emotion information CVE by correcting the voice evaluation value x included in the voice emotion information VE (step S86). After completing the process of step S86, the estimation unit 258 of the emotion estimation unit 25C executes the process of step S91.

ステップＳ９１の処理終了後、サーバ装置１０Ｃは、認識文字列ＲＴと、ステップＳ６１の処理結果である推定感情情報ＥＩとを、ユーザ装置１ｂに送信する。出力部２６は、認識文字列ＲＴに対して、推定感情情報ＥＩが示す感情に応じた処理を実行して得られる情報を出力する（ステップＳ９２）。ステップＳ９２の処理終了後、感情推定システムＳＹＳｃは、図２１に示す一連の処理を終了する。 After completing the process of step S91, the server device 10C transmits the recognized character string RT and the estimated emotion information EI, which is the result of the process of step S61, to the user device 1b. The output unit 26 outputs information obtained by performing processing according to the emotion indicated by the estimated emotion information EI on the recognized character string RT (step S92). After completing the process of step S92, the emotion estimation system SYSc ends the series of processes shown in FIG.

４．３．第４実施形態の効果
第４実施形態も、第３実施形態と同様に、ユーザＵ２が特定文字列ＳＴを発話した場合には、文字感情情報ＴＥが正解の感情を示している可能性が高いので、サーバ装置１０Ｃは、音声評価値Ｘを文字評価値Ｙに近づける目的で、ユーザＵ２用のパラメータ情報ＴＩを調整する。以上により、非キャリブレーションユーザについて、キャリブレーションモードを用いなくても、ユーザＵが抱く感情の推定精度を向上できる。非キャリブレーションユーザは、ユーザ装置１ｂをキャリブレーションモードに設定しなくとも感情の推定精度を向上できるで、ユーザ装置１ｂは、非キャリブレーションユーザの手間を削減できる。
集音装置８の性能は、ユーザ装置１ｂ間で互いに異なる。例えば、集音装置８の製造元が異なると、集音装置８の性能も一般的に互いに異なる。また、集音装置８は経年劣化により性能が低下する傾向にあるため、同一の製造元の集音装置８であっても、製造時点からの日数が異なる場合、集音装置８の性能も互いに異なる傾向にある。ユーザ装置１ｂ間で集音装置８の性能が互いに異なる結果、音情報ＳＩに含まれるノイズの量も異なるため、パラメータ情報ＴＩを調整することにより、ユーザＵが抱く感情を精度良く推定できる。
例えば、学習済みのパラメータ情報ＴＩを適用したノイズ処理を実行すると、集音装置８の性能の違いによって、音声情報ＶＩから感情推定に必要な情報が欠落する場合がある。従って、集音装置８の性能に応じてパラメータ情報ＴＩを調整することにより、ユーザＵが抱く感情を精度良く推定できる。 4.3. Effects of the Fourth Embodiment In the fourth embodiment, similarly to the third embodiment, when the user U2 utters the specific character string ST, there is a high possibility that the character emotion information TE indicates the correct emotion. Therefore, the server device 10C adjusts the parameter information TI for the user U2 for the purpose of bringing the speech evaluation value X closer to the character evaluation value Y. FIG. As described above, it is possible to improve the accuracy of estimating the emotion of the user U without using the calibration mode for the non-calibration user. The non-calibration user can improve the accuracy of emotion estimation without setting the user device 1b to the calibration mode, and the user device 1b can reduce the labor of the non-calibration user.
The performance of the sound collector 8 differs between the user devices 1b. For example, different manufacturers of sound collectors 8 generally have different performance of the sound collectors 8 . In addition, since the performance of the sound collector 8 tends to deteriorate due to aging, even if the sound collector 8 is manufactured by the same manufacturer, the performance of the sound collector 8 will be different if the number of days from the time of manufacture is different. There is a tendency. Since the performance of the sound collector 8 differs among the user devices 1b and the amount of noise included in the sound information SI is also different, the emotion of the user U can be accurately estimated by adjusting the parameter information TI.
For example, when noise processing is performed using learned parameter information TI, information necessary for emotion estimation may be missing from voice information VI due to differences in the performance of the sound collector 8 . Therefore, by adjusting the parameter information TI according to the performance of the sound collector 8, the emotion of the user U can be accurately estimated.

５．第５実施形態
第５実施形態にかかる感情推定システムＳＹＳｄは、第１実施形態で示した感情推定部２５の処理を、サーバ装置１０Ｄとユーザ装置１ｄとで分散する点で、第１実施形態にかかるユーザ装置１と相違する。以下、第５実施形態にかかる感情推定システムＳＹＳｄを説明する。なお、以下に例示する第５実施形態において作用又は機能が第１実施形態と同等である要素については、以上の説明で参照の符号を流用して各々の詳細な説明を適宜に省略する。 5. Fifth Embodiment The emotion estimation system SYSd according to the fifth embodiment differs from the first embodiment in that the processing of the emotion estimation unit 25 shown in the first embodiment is distributed between the server device 10D and the user device 1d. This user device 1 is different. The emotion estimation system SYSd according to the fifth embodiment will be described below. In addition, in the fifth embodiment illustrated below, elements having the same action or function as those of the first embodiment are denoted by reference numerals in the above description, and detailed description thereof will be omitted as appropriate.

５．１．第５実施形態の概要
図２２は、感情推定システムＳＹＳｄの全体構成を示す図である。感情推定システムＳＹＳｄは、ユーザＵが所持するユーザ装置１ｄと、ネットワークＮＷと、サーバ装置１０Ｄとを備える。 5.1. Overview of Fifth Embodiment FIG. 22 is a diagram showing the overall configuration of an emotion estimation system SYSd. The emotion estimation system SYSd includes a user device 1d owned by a user U, a network NW, and a server device 10D.

図２３は、ユーザ装置１ｄの構成を示すブロック図である。ユーザ装置１ｄは、処理装置２ｄ、記憶装置３ｄ、表示装置４、操作装置５、通信装置６、放音装置７、及び、集音装置８を具備するコンピュータシステムにより実現される。記憶装置３ｄは、処理装置２ｄが読取可能な記録媒体であり、処理装置２ｄが実行する制御プログラムＰＲｄを含む複数のプログラムを記憶する。通信装置６が、「第２通信装置」の一例である。 FIG. 23 is a block diagram showing the configuration of the user device 1d. The user device 1d is implemented by a computer system including a processing device 2d, a storage device 3d, a display device 4, an operation device 5, a communication device 6, a sound emitting device 7, and a sound collecting device 8. FIG. The storage device 3d is a recording medium readable by the processing device 2d, and stores a plurality of programs including the control program PRd executed by the processing device 2d. The communication device 6 is an example of a "second communication device".

処理装置２ｄは、記憶装置３ｄから制御プログラムＰＲｄを読み取り実行することによって、取得部２１、第１感情推定部２５ｄ、及び、出力部２６として機能する。 The processing device 2d functions as an acquisition unit 21, a first emotion estimation unit 25d, and an output unit 26 by reading and executing the control program PRd from the storage device 3d.

図２４は、サーバ装置１０Ｄの構成を示すブロック図である。サーバ装置１０Ｄは、処理装置２Ｄ、記憶装置３Ｄ、及び、通信装置６Ｂを具備するコンピュータシステムにより実現される。記憶装置３Ｄは、処理装置２Ｄが読取可能な記録媒体であり、処理装置２Ｄが実行する制御プログラムＰＲＤを含む複数のプログラム、解析用辞書情報３１、感情分類情報３３、及び、学習モデルＬＭを記憶する。通信装置６Ｂが、「第１通信装置」の一例である。 FIG. 24 is a block diagram showing the configuration of the server device 10D. The server device 10D is implemented by a computer system including a processing device 2D, a storage device 3D, and a communication device 6B. The storage device 3D is a recording medium readable by the processing device 2D, and stores a plurality of programs including a control program PRD executed by the processing device 2D, analysis dictionary information 31, emotion classification information 33, and a learning model LM. do. The communication device 6B is an example of the "first communication device".

処理装置２Ｄは、記憶装置３Ｄから制御プログラムＰＲＤを読み取り実行することによって、第２感情推定部２５Ｄとして機能する。図２５を用いて、感情推定システムＳＹＳｄの機能について説明する。 The processing device 2D functions as a second emotion estimation section 25D by reading and executing the control program PRD from the storage device 3D. Functions of the emotion estimation system SYSd will be described with reference to FIG.

図２５は、感情推定システムＳＹＳｄの機能の概要を示す図である。第１感情推定部２５ｄは、補正部２５３、及び、推定部２５８を含む。第２感情推定部２５Ｄは、ノイズ除去部２５１、音声評価部２５２、文字評価部２５６を含む。 FIG. 25 is a diagram showing an outline of functions of the emotion estimation system SYSd. First emotion estimation section 25 d includes correction section 253 and estimation section 258 . Second emotion estimation unit 25D includes noise removal unit 251, voice evaluation unit 252, and character evaluation unit 256. FIG.

取得部２１は、ユーザＵ１の音声を含む音を集音する集音装置８が出力する音情報ＳＩ１を取得する。通信装置６は、音情報ＳＩを、サーバ装置１０Ｄに送信する。第２感情推定部２５Ｄは、音情報ＳＩに基づいて、音声感情情報ＶＥと文字感情情報ＴＥと認識文字列ＲＴとを生成する。通信装置６Ｂは、音声感情情報ＶＥと文字感情情報ＴＥと認識文字列ＲＴとをユーザ装置１ｄに送信する。 The acquisition unit 21 acquires sound information SI1 output by the sound collector 8 that collects sound including the voice of the user U1. The communication device 6 transmits the sound information SI to the server device 10D. Second emotion estimation section 25D generates voice emotion information VE, text emotion information TE, and recognized character string RT based on sound information SI. The communication device 6B transmits the voice emotion information VE, the text emotion information TE, and the recognized character string RT to the user device 1d.

補正部２５３は、補正情報ＣＩを用いて、補正感情情報ＣＶＥを生成する。推定部２５８は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く感情を推定する。出力部２６は、認識文字列ＲＴに対して、推定感情情報ＥＩが示す感情に応じた処理を実行して得られたデータを出力する。 The correction unit 253 uses the correction information CI to generate the corrected emotion information CVE. The estimation unit 258 estimates the emotion that the user U has based on the corrected emotion information CVE and the text emotion information TE. The output unit 26 outputs data obtained by performing processing corresponding to the emotion indicated by the estimated emotion information EI on the recognized character string RT.

５．２．第５実施形態の効果
以上の説明によれば、感情推定システムＳＹＳｄにおいて、ユーザ装置１ｄは、第１実施形態におけるユーザ装置１と比較すると、負荷を軽減できる。 5.2. Effect of Fifth Embodiment According to the above description, in the emotion estimation system SYSd, the user device 1d can reduce the load compared to the user device 1 in the first embodiment.

６．第６実施形態
第６実施形態にかかる感情推定システムＳＹＳｅは、第２実施形態で示した感情推定部２５の処理を、サーバ装置１０Ｄとユーザ装置１ｅとで分散する点で、第２実施形態にかかるユーザ装置１ａと相違する。以下、第６実施形態にかかる感情推定システムＳＹＳｅを説明する。なお、以下に例示する第６実施形態において作用又は機能が第２実施形態又は第５実施形態と同等である要素については、以上の説明で参照の符号を流用して各々の詳細な説明を適宜に省略する。 6. Sixth Embodiment The emotion estimation system SYSe according to the sixth embodiment differs from the second embodiment in that the processing of the emotion estimation unit 25 shown in the second embodiment is distributed between the server device 10D and the user device 1e. It is different from the user device 1a. The emotion estimation system SYSe according to the sixth embodiment will be described below. In the sixth embodiment exemplified below, the elements whose actions or functions are equivalent to those of the second embodiment or the fifth embodiment will be referred to in the above description by using the reference numerals, and their detailed description will be appropriately described. abbreviated to

６．１．第６実施形態の概要
図２６は、感情推定システムＳＹＳｅの全体構成を示す図である。感情推定システムＳＹＳｅは、ユーザＵが所持するユーザ装置１ｅと、ネットワークＮＷと、サーバ装置１０Ｄとを備える。 6.1. Overview of Sixth Embodiment FIG. 26 is a diagram showing the overall configuration of an emotion estimation system SYSe. The emotion estimation system SYSe includes a user device 1e possessed by a user U, a network NW, and a server device 10D.

図２７は、ユーザ装置１ｅの構成を示すブロック図である。ユーザ装置１ｅは、処理装置２ｅ、記憶装置３ｅ、表示装置４、操作装置５、通信装置６、放音装置７、及び、集音装置８を具備するコンピュータシステムにより実現される。記憶装置３ｅは、処理装置２ｅが読取可能な記録媒体であり、処理装置２ｅが実行する制御プログラムＰＲｅを含む複数のプログラムを記憶する。 FIG. 27 is a block diagram showing the configuration of the user device 1e. The user device 1e is implemented by a computer system including a processing device 2e, a storage device 3e, a display device 4, an operation device 5, a communication device 6, a sound emitting device 7, and a sound collecting device 8. FIG. The storage device 3e is a recording medium readable by the processing device 2e, and stores a plurality of programs including the control program PRe executed by the processing device 2e.

処理装置２ｅは、記憶装置３ｅから制御プログラムＰＲｅを読み取り実行することによって、取得部２１ａ、第１感情推定部２５ｅ、及び、出力部２６として機能する。図２８を用いて、感情推定システムＳＹＳｅの機能について説明する。 The processing device 2e functions as an acquisition unit 21a, a first emotion estimation unit 25e, and an output unit 26 by reading and executing the control program PRe from the storage device 3e. Functions of the emotion estimation system SYSe will be described with reference to FIG.

図２８は、感情推定システムＳＹＳｅの機能の概要を示す図である。第１感情推定部２５ｅは、補正部２５３と、調整部２５４と、推定部２５８とを含む。 FIG. 28 is a diagram showing an outline of functions of the emotion estimation system SYSe. First emotion estimating portion 25 e includes correcting portion 253 , adjusting portion 254 , and estimating portion 258 .

取得部２１ａは、ユーザＵが明示感情を発露させた音声を含む音を示す音情報ＳＩａを取得する。サーバ装置１０Ｄは、音情報ＳＩａに基づいて音声感情情報ＶＥａを生成する。そして、通信装置６Ｂが、音声感情情報ＶＥａをユーザ装置１に送信する。 Acquisition unit 21a acquires sound information SIa indicating a sound including a voice in which user U expresses an explicit emotion. The server device 10D generates voice emotion information VEa based on the sound information SIa. The communication device 6B then transmits the voice emotion information VEa to the user device 1 .

調整部２５４は、ユーザＵが抱く感情が明示感情であると推定部２５８が推定する可能性を高くする目的で、明示的な音声感情情報ＶＥａに基づいて補正情報ＣＩを調整する。 The adjustment unit 254 adjusts the correction information CI based on the explicit voice emotion information VEa for the purpose of increasing the possibility of the estimation unit 258 estimating that the emotion that the user U has is the explicit emotion.

６．２．第６実施形態の効果
以上の説明によれば、感情推定システムＳＹＳにおいて、ユーザ装置１ｄは、第２実施形態におけるユーザ装置１と比較すると、負荷を軽減できる。 6.2. Effect of Sixth Embodiment According to the above description, in the emotion estimation system SYS, the user device 1d can reduce the load compared to the user device 1 in the second embodiment.

７．第７実施形態
第７実施形態にかかる感情推定システムＳＹＳｆは、第３実施形態で示した感情推定部２５の処理を、サーバ装置１０Ｆとユーザ装置１ｆとで分散する点で、第３実施形態にかかる感情推定システムＳＹＳと相違する。以下、第７実施形態にかかる感情推定システムＳＹＳｆを説明する。なお、以下に例示する第７実施形態において作用又は機能が第３実施形態又は第５実施形態と同等である要素については、以上の説明で参照の符号を流用して各々の詳細な説明を適宜に省略する。 7. Seventh Embodiment The emotion estimation system SYSf according to the seventh embodiment differs from the third embodiment in that the processing of the emotion estimation unit 25 shown in the third embodiment is distributed between the server device 10F and the user device 1f. It differs from the emotion estimation system SYS. The emotion estimation system SYSf according to the seventh embodiment will be described below. It should be noted that, in the seventh embodiment illustrated below, the elements whose actions or functions are equivalent to those of the third embodiment or the fifth embodiment will be referred to in the above description by using the reference numerals, and the detailed description of each will be made as appropriate. abbreviated to

７．１．第７実施形態の概要
図２９は、感情推定システムＳＹＳｆの全体構成を示す図である。感情推定システムＳＹＳｆは、ユーザＵが所持するユーザ装置１ｆと、ネットワークＮＷと、サーバ装置１０Ｆとを備える。ユーザＵ１が、「第１ユーザ」の例である。ユーザＵ２が、「第２ユーザ」の例である。 7.1. Overview of Seventh Embodiment FIG. 29 is a diagram showing the overall configuration of an emotion estimation system SYSf. The emotion estimation system SYSf includes a user device 1f owned by a user U, a network NW, and a server device 10F. User U1 is an example of a "first user". User U2 is an example of a "second user".

図３０は、ユーザ装置１ｆを示すブロック図である。ユーザ装置１ｆは、処理装置２ｆ、記憶装置３ｆ、表示装置４、操作装置５、通信装置６、放音装置７、及び、集音装置８を具備するコンピュータシステムにより実現される。記憶装置３ｆは、処理装置２ｆが読取可能な記録媒体であり、処理装置２ｆが実行する制御プログラムＰＲｆを含む複数のプログラムを記憶する。 FIG. 30 is a block diagram showing the user device 1f. The user device 1f is implemented by a computer system including a processing device 2f, a storage device 3f, a display device 4, an operation device 5, a communication device 6, a sound emitting device 7, and a sound collecting device 8. FIG. The storage device 3f is a recording medium readable by the processing device 2f, and stores a plurality of programs including a control program PRf executed by the processing device 2f.

処理装置２ｆは、記憶装置３ｆから制御プログラムＰＲｆを読み取り実行することによって、取得部２１、第１感情推定部２５ｆ、及び、出力部２６として機能する。 The processing device 2f functions as an acquisition unit 21, a first emotion estimation unit 25f, and an output unit 26 by reading and executing the control program PRf from the storage device 3f.

図３１は、サーバ装置１０Ｆの構成を示すブロック図である。サーバ装置１０Ｆは、処理装置２Ｆ、記憶装置３Ｆ、及び、通信装置６Ｂを具備するコンピュータシステムにより実現される。記憶装置３Ｆは、処理装置２Ｆが読取可能な記録媒体であり、処理装置２Ｆが実行する制御プログラムＰＲＦを含む複数のプログラム、解析用辞書情報３１、感情分類情報３３、及び、学習モデルＬＭを記憶する。 FIG. 31 is a block diagram showing the configuration of the server device 10F. The server device 10F is implemented by a computer system including a processing device 2F, a storage device 3F, and a communication device 6B. The storage device 3F is a recording medium readable by the processing device 2F, and stores a plurality of programs including the control program PRF executed by the processing device 2F, analysis dictionary information 31, emotion classification information 33, and a learning model LM. do.

処理装置２Ｆは、記憶装置３Ｆから制御プログラムＰＲＤを読み取り実行することによって、第２感情推定部２５Ｆとして機能する。図３２を用いて、感情推定システムＳＹＳｆの機能について説明する。 The processing device 2F functions as a second emotion estimation section 25F by reading and executing the control program PRD from the storage device 3F. Functions of the emotion estimation system SYSf will be described with reference to FIG.

図３２は、感情推定システムＳＹＳｆとの機能の概要を示す図である。第１感情推定部２５ｆは、補正部２５３、調整部２５４、及び、推定部２５８を含む。第２感情推定部２５Ｆは、ノイズ除去部２５１、音声評価部２５２、文字評価部２５６、及び、特定部２５９を含む。ユーザ装置１ｆ１が、「第１端末装置」の一例である。ユーザ装置１ｆ２が、「第２端末装置」の一例である。図面の煩雑化を防ぐため、ユーザ装置１ｆ１の処理装置２ｆが実現する機能については、図示を省略している。 FIG. 32 is a diagram showing an overview of the functions of the emotion estimation system SYSf. First emotion estimator 25 f includes corrector 253 , adjuster 254 , and estimator 258 . Second emotion estimation unit 25</b>F includes noise removal unit 251 , voice evaluation unit 252 , character evaluation unit 256 and identification unit 259 . The user device 1f1 is an example of a "first terminal device". The user device 1f2 is an example of a "second terminal device". In order to avoid complication of the drawing, illustration of the functions realized by the processing device 2f of the user device 1f1 is omitted.

ユーザ装置１ｆ１の取得部２１は、ユーザ装置１ｆ１の集音装置８が出力する音情報ＳＩ１を取得する。ユーザ装置１ｆ１の集音装置８は、「第１集音装置」の一例である。ユーザ装置１ｆ１の通信装置６は、音情報ＳＩ１をサーバ装置１０Ｆに送信する。ノイズ除去部２５１は、音情報ＳＩ１が示す音からノイズを除去して音声情報ＶＩ１を生成する。ユーザＵ１に関して、以降の処理は、図１３に示す音声評価部２５２Ｂ、補正部２５３、音声認識処理部２５６１、形態素解析処理部２５６３、評価値算出部２５６５、特定部２５９と同一であるため、説明を省略する。さらに、図示を省略しているが、第２感情推定部２５Ｆは、認識文字列ＲＴ１と、ユーザＵ１の音声感情情報ＶＥ１と、ユーザＵ１の文字感情情報ＴＥ１とを、ユーザ装置１ｆ１に送信する。そして、特定部２５９が特定文字列ＳＴを特定するために、ユーザ装置１ｆ１の通信装置６は、ユーザＵ１の補正感情情報ＣＶＥ１をサーバ装置１０Ｆに送信する。
ユーザＵ２に関して、ユーザ装置１ｆ２の取得部２１は、ユーザ装置１ｆ２の集音装置８が出力する音情報ＳＩ２を取得する。ユーザ装置１ｆ２の集音装置８は、「第２集音装置」の一例である。ユーザ装置１ｆ２の通信装置６は、音情報ＳＩ２をサーバ装置１０Ｆに送信する。ユーザ装置１ｆ２の通信装置６は、「第３通信装置」の一例である。ユーザＵ２に関して、以降の処理は、図１３に示す音声評価部２５２Ｂ、補正部２５３、音声認識処理部２５６１、形態素解析処理部２５６３、評価値算出部２５６５と同一であるため、説明を省略する。通信装置６Ｂは、特定文字列ＳＴと、ユーザＵ２の音声感情情報ＶＥ２と、ユーザＵ２の文字感情情報ＴＥ２とを、ユーザ装置１ｆ２に送信する。 The acquisition unit 21 of the user device 1f1 acquires sound information SI1 output by the sound collector 8 of the user device 1f1. The sound collector 8 of the user device 1f1 is an example of the "first sound collector". The communication device 6 of the user device 1f1 transmits the sound information SI1 to the server device 10F. The noise removal unit 251 removes noise from the sound indicated by the sound information SI1 to generate the voice information VI1. Regarding the user U1, the subsequent processing is the same as the speech evaluation unit 252B, the correction unit 253, the speech recognition processing unit 2561, the morphological analysis processing unit 2563, the evaluation value calculation unit 2565, and the identification unit 259 shown in FIG. omitted. Further, although not shown, the second emotion estimation unit 25F transmits the recognized character string RT1, the voice emotion information VE1 of the user U1, and the text emotion information TE1 of the user U1 to the user device 1f1. Then, in order for the specifying unit 259 to specify the specific character string ST, the communication device 6 of the user device 1f1 transmits the corrected emotion information CVE1 of the user U1 to the server device 10F.
Regarding the user U2, the acquisition unit 21 of the user device 1f2 acquires sound information SI2 output by the sound collector 8 of the user device 1f2. The sound collector 8 of the user device 1f2 is an example of the "second sound collector". The communication device 6 of the user device 1f2 transmits the sound information SI2 to the server device 10F. The communication device 6 of the user device 1f2 is an example of a "third communication device". Regarding the user U2, the subsequent processing is the same as the speech evaluation unit 252B, the correction unit 253, the speech recognition processing unit 2561, the morphological analysis processing unit 2563, and the evaluation value calculation unit 2565 shown in FIG. 13, so the description is omitted. The communication device 6B transmits the specific character string ST, the voice emotion information VE2 of the user U2, and the character emotion information TE2 of the user U2 to the user device 1f2.

非キャリブレーションユーザであるユーザＵ２が、特定文字列ＳＴを発話した場合には、調整部２５４は、ユーザＵ２の補正感情情報ＣＶＥ２に含まれる複数の音声評価値Ｘを、複数の感情の各々について、ユーザＵ２の文字感情情報ＴＥ２に含まれる複数の文字評価値Ｙに近づける目的で、ユーザＵ２用の補正情報ＣＩ２を調整する。 When user U2, who is a non-calibrated user, utters the specific character string ST, the adjustment unit 254 converts a plurality of voice evaluation values X included in the corrected emotion information CVE2 of user U2 to , the correction information CI2 for the user U2 is adjusted for the purpose of bringing it closer to the plurality of character evaluation values Y included in the character emotion information TE2 of the user U2.

７．２．第７実施形態の効果
以上の説明によれば、感情推定システムＳＹＳｆにおいて、サーバ装置１０Ｆは、第３実施形態におけるサーバ装置１０と比較すると、負荷を軽減できる。 7.2. Effect of Seventh Embodiment According to the above description, in the emotion estimation system SYSf, the server device 10F can reduce the load compared to the server device 10 in the third embodiment.

８．変形例
本発明は、以上に例示した各実施形態に限定されない。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様を併合してもよい。 8. Modifications The present invention is not limited to the embodiments illustrated above. Specific modification modes are exemplified below. Two or more aspects arbitrarily selected from the following examples may be combined.

（１）第１変形例として、推定部２５８は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く感情を推定することを説明したが、これに限らない。推定部２５８が補正感情情報ＣＶＥに基づいて、ユーザＵが抱く１以上の感情を推定する例を、図３３を用いて説明する。 (1) As a first modified example, the estimating unit 258 estimates the emotion of the user U based on the corrected emotion information CVE and the text emotion information TE, but the present invention is not limited to this. An example in which the estimation unit 258 estimates one or more emotions that the user U has based on the corrected emotion information CVE will be described with reference to FIG.

図３３は、第１変形例におけるユーザ装置１ｇの機能の概要を示す図である。ユーザ装置１ｇの処理装置２は、ユーザ装置１ｇの記憶装置３から制御プログラムを読み取り実行することによって、取得部２１、感情推定部２５ｇ、及び、出力部２６ｇとして機能する。感情推定部２５ｇは、ノイズ除去部２５１、音声評価部２５２、補正部２５３、及び、推定部２５８ｇを含む。推定部２５８ｇは、補正感情情報ＣＶＥに基づいて、ユーザＵが抱く１以上の感情を推定する。例えば、推定部２５８ｇは、補正感情情報ＣＶＥの音声評価値Ｘ１～Ｘ４を閾値と比較し、閾値を超える音声評価値Ｘを特定する。推定部２５８ｇは、特定された音声評価値Ｘに対応する１以上の感情を、ユーザＵが抱く１以上の感情として推定する。感情推定部２５ｇは、推定したユーザＵが抱く１以上の感情を示す推定感情情報ＥＩを出力する。
出力部２６ｇは、推定感情情報ＥＩを出力する。例えば、出力部２６ｇは、推定感情情報ＥＩが示す感情を示す文字列を、表示装置４に出力する。 FIG. 33 is a diagram showing an overview of functions of the user device 1g in the first modified example. The processing device 2 of the user device 1g functions as an acquisition unit 21, an emotion estimation unit 25g, and an output unit 26g by reading and executing a control program from the storage device 3 of the user device 1g. The emotion estimation unit 25g includes a noise removal unit 251, a voice evaluation unit 252, a correction unit 253, and an estimation unit 258g. The estimation unit 258g estimates one or more emotions that the user U has based on the corrected emotion information CVE. For example, the estimating unit 258g compares the audio evaluation values X1 to X4 of the corrected emotion information CVE with a threshold and identifies the audio evaluation value X exceeding the threshold. The estimation unit 258g estimates one or more emotions corresponding to the identified voice evaluation value X as one or more emotions that the user U has. The emotion estimation unit 25g outputs estimated emotion information EI indicating one or more emotions that the estimated user U has.
The output unit 26g outputs the estimated emotion information EI. For example, the output unit 26g outputs to the display device 4 a character string indicating the emotion indicated by the estimated emotion information EI.

（２）第４実施形態に示した感情推定部２５Ｃの処理を、サーバ装置１０とユーザ装置１とで分散してもよい。例えば、サーバ装置１０における第２感情推定部２５は、ノイズ除去部２５１Ｃ、音声評価部２５２Ｂ、文字評価部２５６、及び、特定部２５９を有する。ユーザ装置１における第１感情推定部２５は、補正部２５３、調整部２５４、推定部２５８を有する。 (2) The processing of the emotion estimation unit 25C shown in the fourth embodiment may be distributed between the server device 10 and the user device 1. FIG. For example, the second emotion estimation unit 25 in the server device 10 has a noise removal unit 251C, a voice evaluation unit 252B, a character evaluation unit 256, and a specification unit 259. First emotion estimating section 25 in user device 1 has correcting section 253 , adjusting section 254 , and estimating section 258 .

（３）感情推定部２５は、喜び、怒り、悲しみ、及び、平常のうち、１以上の感情を推定することとして説明したが、１つの感情を推定してもよい。例えば、推定部２５８は、補正感情情報ＣＶＥの音声評価値Ｘ１～Ｘ４と文字感情情報ＴＥの文字評価値Ｙ１～Ｙ４とを感情ごとに加算して、感情ごとに加算値を算出する。推定部２５８は、感情ごとの加算値のうち最も大きい値の感情を、ユーザＵが抱く感情として推定してもよい。 (3) The emotion estimator 25 estimates one or more emotions out of joy, anger, sadness, and normality, but it may estimate one emotion. For example, the estimation unit 258 adds the voice evaluation values X1 to X4 of the corrected emotion information CVE and the text evaluation values Y1 to Y4 of the text emotion information TE for each emotion to calculate the added value for each emotion. The estimation unit 258 may estimate the emotion having the largest value among the added values for each emotion as the emotion that the user U has.

（４）第３実施形態では、サーバ装置１０によって感情推定部２５Ｂが実現したが、１台のユーザ装置１に適用してもよい。例えば、ユーザ装置１が、複数のユーザＵによって所持される場合である。ある期間において、ユーザＵ１がユーザ装置１を所持し、ユーザ装置１をキャリブレーションモードに設定し、ある期間の後の期間において、ユーザＵ２がユーザ装置１を所持した場合に、第３実施形態を適用してもよい。 (4) In the third embodiment, the emotion estimation unit 25B is implemented by the server device 10, but it may be applied to one user device 1 as well. For example, this is the case where the user device 1 is possessed by a plurality of users U. In a certain period, the user U1 possesses the user device 1 and sets the user device 1 to the calibration mode, and in a period after the certain period, the user U2 possesses the user device 1, the third embodiment may apply.

（５）第５実施形態、第６実施形態、及び、第７実施形態において、通信装置６Ｂは、認識文字列ＲＴと、ユーザＵ２の音声感情情報ＶＥ２と、ユーザＵ２の文字感情情報ＴＥ２とを、ユーザ装置１ｆ２に送信するが、認識文字列ＲＴを送信しなくてもよい。例えば、ユーザ装置１ｆは、推定感情情報ＥＩが示す感情を示す文字列を、表示装置４に出力する。 (5) In the fifth, sixth, and seventh embodiments, the communication device 6B converts the recognized character string RT, the voice emotion information VE2 of the user U2, and the character emotion information TE2 of the user U2 to , to the user device 1f2, but the recognition character string RT may not be transmitted. For example, the user device 1f outputs to the display device 4 a character string indicating the emotion indicated by the estimated emotion information EI.

（６）ユーザ装置１は、集音装置８を有さなくてもよい。集音装置８を有さない場合、ユーザ装置１は、通信装置６を介して音情報ＳＩを取得してもよいし、記憶装置３に記憶された音情報ＳＩを取得してもよい。 (6) The user device 1 may not have the sound collector 8 . If the user device 1 does not have the sound collecting device 8 , the user device 1 may acquire the sound information SI via the communication device 6 or may acquire the sound information SI stored in the storage device 3 .

（７）ユーザ装置１は、放音装置７を有さなくてもよい。 (7) The user device 1 does not have to have the sound emitting device 7 .

（８）ユーザ装置１は、スマートスピーカでもよい。ユーザ装置１がスマートスピーカである場合、ユーザ装置１は、表示装置４及び操作装置５を有さなくてもよい。 (8) The user device 1 may be a smart speaker. If the user device 1 is a smart speaker, the user device 1 may not have the display device 4 and the operation device 5 .

（９）感情分類情報３３は、図４に示すように、「勝つ」、「勝っ」のように、ある単語が活用した複数の形態素のそれぞれを、喜び、怒り、悲しみ、及び、平常の何れかに分類したが、これに限らない。例えば、感情分類情報３３は、解析用辞書情報３１の原形データに登録された文字列を、喜び、怒り、悲しみ、及び、平常の何れかに分類してもよい。例えば、感情分類情報３３は、解析用辞書情報３１の原形データに登録された文字列「嬉しい」、「合格」、及び「勝つ」を、喜びに分類する。評価値算出部２５６５は、補正後認識文字列ＣＲＴを形態素ごとに分解し、分解した形態素を、解析用辞書情報３１の原形データに登録された文字列に変換する。そして、評価値算出部２５６５は、変換して得られた文字列と、感情分類情報３３に含まれる文字列とが一致する場合に、この補正後認識文字列ＣＲＴに含まれる文字列に対応する感情の文字評価値Ｙを増加させる。 (9) The emotion classification information 33, as shown in FIG. 4, includes each of a plurality of morphemes used by a word, such as "win" and "win", as joy, anger, sadness, and normal. Although classified as one, it is not limited to this. For example, the emotion classification information 33 may classify the character strings registered in the original data of the analysis dictionary information 31 into either joy, anger, sadness, or normal. For example, the emotion classification information 33 classifies the character strings "happy", "pass", and "win" registered in the original data of the analysis dictionary information 31 as joy. The evaluation value calculation unit 2565 decomposes the corrected recognized character string CRT into morphemes, and converts the decomposed morphemes into character strings registered in the original form data of the analysis dictionary information 31 . Then, when the character string obtained by the conversion matches the character string included in the emotion classification information 33, the evaluation value calculation unit 2565 calculates the character string included in the post-correction recognized character string CRT. Increase character evaluation value Y of emotion.

（１０）評価値算出部２５６５は、補正後認識文字列ＣＲＴに対して、感情ごとの文字評価値Ｙを算出したが、認識文字列ＲＴに対して感情ごとの文字評価値Ｙを算出してもよい。しかしながら、認識文字列ＲＴには、感情を推定するためには不要な文字列が含まれる。従って、補正後認識文字列ＣＲＴに対して感情ごとの文字評価値Ｙを算出することにより、認識文字列ＲＴに対して感情ごとの文字評価値Ｙを算出する場合と比較して、感情の推定精度を向上できる。 (10) The evaluation value calculation unit 2565 calculates the character evaluation value Y for each emotion for the corrected recognized character string CRT, but does not calculate the character evaluation value Y for each emotion for the recognized character string RT. good too. However, the recognized character string RT includes character strings that are unnecessary for estimating emotions. Therefore, by calculating the character evaluation value Y for each emotion with respect to the post-correction recognition character string CRT, it is possible to estimate the emotion in comparison with the case where the character evaluation value Y for each emotion is calculated for the recognition character string RT. Can improve accuracy.

（１１）第１の態様における相違の程度を示す値は、音声評価値Ｘと文字評価値Ｙとの差分の２乗の和であったが、音声評価値Ｘと文字評価値Ｙとの差分の絶対値の和等、評価値間の距離を定義する任意の評価関数によって求められる値でもよい。 (11) Although the value indicating the degree of difference in the first mode was the sum of the squares of the difference between the voice evaluation value X and the character evaluation value Y, the difference between the voice evaluation value X and the character evaluation value Y It may be a value determined by an arbitrary evaluation function that defines the distance between evaluation values, such as the sum of the absolute values of .

（１２）ユーザＵが日本語を話す例を用いたが、ユーザが如何なる言語を話しても上述の各態様を適用することが可能である。例えば、ユーザＵが、日本語以外の英語、フランス語、又は中国語等を話す場合であっても上述の各態様を適用できる。例えば、ユーザＵが英語を話す場合、解析用辞書情報３１は、英語の形態素に関する情報であり、感情分類情報３３は、英単語を喜び、怒り、悲しみ、及び、平常の何れかに分類したデータであればよい。 (12) Although the example in which the user U speaks Japanese has been used, the above aspects can be applied to any language the user speaks. For example, even if the user U speaks English, French, Chinese, or the like other than Japanese, each of the above aspects can be applied. For example, when the user U speaks English, the analysis dictionary information 31 is information about English morphemes, and the emotion classification information 33 is data in which English words are classified into either joy, anger, sadness, or normal. If it is

（１３）上述した各態様の説明に用いたブロック図は、機能単位のブロックを示している。これらの機能ブロック（構成部）は、ハードウェア及び／又はソフトウェアの任意の組み合わせによって実現される。また、各機能ブロックの実現手段は特に限定されない。すなわち、各機能ブロックは、物理的及び／又は論理的に結合した１つの装置により実現されてもよいし、物理的及び／又は論理的に分離した２つ以上の装置を直接的及び／又は間接的に(例えば、有線及び／又は無線)で接続し、これら複数の装置により実現されてもよい。 (13) The block diagrams used to describe each of the above aspects show blocks in units of functions. These functional blocks (components) are implemented by any combination of hardware and/or software. Further, means for realizing each functional block is not particularly limited. That is, each functional block may be implemented by one device physically and/or logically coupled, or may be implemented by two or more physically and/or logically separated devices directly and/or indirectly. These multiple devices may be connected together (eg, wired and/or wirelessly).

（１４）上述した各態様における処理手順、シーケンス、フローチャートなどは、矛盾のない限り、順序を入れ替えてもよい。例えば、本明細書で説明した方法については、例示的な順序で様々なステップの要素を提示しており、提示した特定の順序に限定されない。 (14) As long as there is no contradiction, the order of the processing procedures, sequences, flowcharts, and the like in each aspect described above may be changed. For example, the methods described herein present elements of the various steps in a sample order and are not limited to the specific order presented.

（１５）上述した各態様において、入出力された情報等は特定の場所(例えば、メモリ)に保存されてもよいし、管理テーブルで管理してもよい。入出力される情報等は、上書き、更新、又は追記され得る。出力された情報等は削除されてもよい。入力された情報等は他の装置へ送信されてもよい。 (15) In each aspect described above, input/output information and the like may be stored in a specific location (for example, memory) or managed in a management table. Input/output information and the like can be overwritten, updated, or appended. The output information and the like may be deleted. The entered information and the like may be transmitted to another device.

（１６）上述した各態様において、判定は、１ビットで表される値（０か１か）によって行われてもよいし、真偽値（Boolean：true又はfalse）によって行われてもよいし、数値の比較（例えば、所定の値との比較）によって行われてもよい。 (16) In each aspect described above, the determination may be made by a value represented by 1 bit (0 or 1), or by a true/false value (Boolean: true or false). , may be performed by numerical comparison (eg, comparison with a predetermined value).

（１７）上述した各態様では、スマートフォン等の可搬型の情報処理装置をユーザ装置１として例示したが、ユーザ装置１の具体的な形態は任意であり、前述の各形態の例示には限定されない。例えば、可搬型又は据置型のパーソナルコンピュータをユーザ装置１として利用してもよい。 (17) In each aspect described above, a portable information processing device such as a smart phone was exemplified as the user device 1, but the specific form of the user device 1 is arbitrary, and is not limited to the examples of the above forms. . For example, a portable or stationary personal computer may be used as the user device 1 .

（１８）上述した各態様では、記憶装置３は、処理装置２が読取可能な記録媒体であり、ＲＯＭ及びＲＡＭなどを例示したが、フレキシブルディスク、光磁気ディスク(例えば、コンパクトディスク、デジタル多用途ディスク、Ｂｌｕ－ｒａｙ（登録商標）ディスク)、スマートカード、フラッシュメモリデバイス(例えば、カード、スティック、キードライブ)、ＣＤ－ＲＯＭ（Compact Disc－ＲＯＭ）、レジスタ、リムーバブルディスク、ハードディスク、フロッピー（登録商標）ディスク、磁気ストリップ、データベース、サーバその他の適切な記憶媒体である。また、プログラムは、ネットワークから送信されても良い。また、プログラムは、電気通信回線を介して通信網から送信されても良い。 (18) In each aspect described above, the storage device 3 is a recording medium that can be read by the processing device 2. Although ROM and RAM are examples, flexible discs, magneto-optical discs (e.g., compact discs, digital versatile discs, Blu-ray discs), smart cards, flash memory devices (e.g. cards, sticks, key drives), CD-ROMs (Compact Disc-ROMs), registers, removable discs, hard disks, floppies ) disk, magnetic strip, database, server or other suitable storage medium. Also, the program may be transmitted from a network. Also, the program may be transmitted from a communication network via an electric communication line.

（１９）上述した各態様は、ＬＴＥ（Long Term Evolution）、ＬＴＥ－Ａ（LTE-Advanced）、ＳＵＰＥＲ３Ｇ、ＩＭＴ－Ａｄｖａｎｃｅｄ、４Ｇ、５Ｇ、ＦＲＡ（Future Radio Access）、Ｗ－ＣＤＭＡ（登録商標）、ＧＳＭ（登録商標）、ＣＤＭＡ２０００、ＵＭＢ（Ultra Mobile Broadband）、ＩＥＥＥ８０２．１１（Ｗｉ－Ｆｉ）、ＩＥＥＥ８０２．１６（ＷｉＭＡＸ）、ＩＥＥＥ８０２．２０、ＵＷＢ（Ultra-WideBand）、Ｂｌｕｅｔｏｏｔｈ（登録商標）、その他の適切なシステムを利用するシステム及び／又はこれらに基づいて拡張された次世代システムに適用されてもよい。 (19) Each aspect described above includes LTE (Long Term Evolution), LTE-A (LTE-Advanced), SUPER 3G, IMT-Advanced, 4G, 5G, FRA (Future Radio Access), W-CDMA (registered trademark) , GSM (registered trademark), CDMA2000, UMB (Ultra Mobile Broadband), IEEE 802.11 (Wi-Fi), IEEE 802.16 (WiMAX), IEEE 802.20, UWB (Ultra-WideBand), Bluetooth (registered trademark) ), systems utilizing other suitable systems, and/or future generation systems enhanced based on these.

（２０）上述した各態様において、説明した情報及び信号などは、様々な異なる技術の何れかを使用して表されてもよい。例えば、上述の説明全体に渡って言及され得るデータ、命令、コマンド、情報、信号、ビット、シンボル、チップなどは、電圧、電流、電磁波、磁界若しくは磁性粒子、光場若しくは光子、又はこれらの任意の組み合わせによって表されてもよい。
なお、本明細書で説明した用語及び／又は本明細書の理解に必要な用語については、同一の又は類似する意味を有する用語と置き換えてもよい。 (20) In each of the above aspects, the information, signals, etc. described may be represented using any of a variety of different technologies. For example, data, instructions, commands, information, signals, bits, symbols, chips, etc. that may be referred to throughout the above description may refer to voltages, currents, electromagnetic waves, magnetic fields or magnetic particles, light fields or photons, or any of these. may be represented by a combination of
The terms explained in this specification and/or terms necessary for understanding this specification may be replaced with terms having the same or similar meanings.

（２１）図２、図５、図７、図８、図１１、図１２、図１３、図１４、図１８、図１９、図２０、図２３、図２４、図２５、図２７、図２８、図３０、図３１、図３２、及び、図３３に例示された各機能は、ハードウェア及びソフトウェアの任意の組み合わせによって実現される。また、各機能は、単体の装置によって実現されてもよいし、相互に別体で構成された２個以上の装置によって実現されてもよい。 (21) FIGS. 2, 5, 7, 8, 11, 12, 13, 14, 18, 19, 20, 23, 24, 25, 27, and 28 , 30, 31, 32, and 33 are implemented by any combination of hardware and software. Also, each function may be implemented by a single device, or may be implemented by two or more devices configured separately from each other.

（２２）上述した各実施形態で例示したプログラムは、ソフトウェア、ファームウェア、ミドルウェア、マイクロコード又はハードウェア記述言語と呼ばれるか、他の名称によって呼ばれるかを問わず、命令、命令セット、コード、コードセグメント、プログラムコード、サブプログラム、ソフトウェアモジュール、アプリケーション、ソフトウェアアプリケーション、ソフトウェアパッケージ、ルーチン、サブルーチン、オブジェクト、実行可能ファイル、実行スレッド、手順又は機能等を意味するよう広く解釈されるべきである。
また、ソフトウェア、命令などは、伝送媒体を介して送受信されてもよい。例えば、ソフトウェアが、同軸ケーブル、光ファイバケーブル、ツイストペア及びデジタル加入者回線（ＤＳＬ）などの有線技術及び／又は赤外線、無線及びマイクロ波などの無線技術を使用してウェブサイト、サーバ、又は他のリモートソースから送信される場合、これらの有線技術及び／又は無線技術は、伝送媒体の定義内に含まれる。 (22) The programs exemplified in each of the above embodiments, whether referred to as software, firmware, middleware, microcode, hardware description language or by any other name, may include instructions, instruction sets, code, code segments. , program code, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executable files, threads of execution, procedures or functions, or the like.
Software, instructions, etc. may also be sent and received over a transmission medium. For example, the software can be used to access websites, servers, or other When transmitted from a remote source, these wired and/or wireless technologies are included within the definition of transmission media.

（２３）上述した各実施形態において、情報、パラメータなどは、絶対値で表されてもよいし、所定の値からの相対値で表されてもよいし、対応する別の情報で表されてもよい。 (23) In each of the above-described embodiments, information, parameters, etc. may be represented by absolute values, may be represented by relative values from a predetermined value, or may be represented by corresponding other information. good too.

（２４）上述したパラメータに使用する名称はいかなる点においても限定的なものではない。さらに、これらのパラメータを使用する数式等は、本明細書で明示的に開示したものと異なる場合もある。 (24) The names used for the parameters described above are not limiting in any way. Further, the formulas, etc. using these parameters may differ from those explicitly disclosed herein.

（２５）上述した各実施形態において、ユーザ装置１は、移動局である場合が含まれる。移動局は、当業者によって、加入者局、モバイルユニット、加入者ユニット、ワイヤレスユニット、リモートユニット、モバイルデバイス、ワイヤレスデバイス、ワイヤレス通信デバイス、リモートデバイス、モバイル加入者局、アクセス端末、モバイル端末、ワイヤレス端末、リモート端末、ハンドセット、ユーザエージェント、モバイルクライアント、クライアント、又はいくつかの他の適切な用語で呼ばれる場合もある。 (25) In each of the above embodiments, the user equipment 1 may be a mobile station. A mobile station is defined by those skilled in the art as subscriber station, mobile unit, subscriber unit, wireless unit, remote unit, mobile device, wireless device, wireless communication device, remote device, mobile subscriber station, access terminal, mobile terminal, wireless It may also be called a terminal, remote terminal, handset, user agent, mobile client, client, or some other suitable term.

（２６）上述した各実施形態において、「に基づいて」という記載は、別段に明記されていない限り、「のみに基づいて」を意味しない。言い換えれば、「に基づいて」という記載は、「のみに基づいて」と「に少なくとも基づいて」の両方を意味する。 (26) In each of the embodiments described above, the phrase "based on" does not mean "based only on," unless expressly specified otherwise. In other words, the phrase "based on" means both "based only on" and "based at least on."

（２７）本明細書で使用する「第１」、「第２」などの呼称を使用した要素へのいかなる参照も、それらの要素の量又は順序を全般的に限定するものではない。これらの呼称は、２つ以上の要素間を区別する便利な方法として本明細書で使用され得る。従って、第１及び第２の要素への参照は、２つの要素のみがそこで採用され得ること、又は何らかの形で第１の要素が第２の要素に先行しなければならないことを意味しない。 (27) Any reference to elements using the "first," "second," etc. designations used herein does not generally limit the quantity or order of those elements. These designations may be used herein as a convenient method of distinguishing between two or more elements. Thus, references to first and second elements do not imply that only two elements may be employed therein, or that the first element must precede the second element in any way.

（２８）上述した各実施形態において「含む(ｉｎｃｌｕｄｉｎｇ)」、「含んでいる（ｃｏｍｐｒｉｓｉｎｇ）」、及びそれらの変形が、本明細書あるいは特許請求の範囲で使用されている限り、これら用語は、用語「備える」と同様に、包括的であることが意図される。さらに、本明細書あるいは特許請求の範囲において使用されている用語「又は（or）」は、排他的論理和ではないことが意図される。 (28) To the extent that "including," "comprising," and variations thereof are used in each of the above-described embodiments in the specification or claims, these terms include: Like the term "comprising," it is intended to be inclusive. Furthermore, the term "or" as used in this specification or the claims is not intended to be an exclusive OR.

（２９）本願の全体において、例えば、英語におけるa、an及びtheのように、翻訳によって冠詞が追加された場合、これらの冠詞は、文脈から明らかにそうではないことが示されていなければ、複数を含む。 (29) Throughout this application, where articles have been added by translation, e.g., a, an and the in English, these articles shall be used unless the context clearly indicates otherwise. Including multiple.

（３０）本発明が本明細書中に説明した実施形態に限定されないことは当業者にとって明白である。本発明は、特許請求の範囲の記載に基づいて定まる本発明の趣旨及び範囲を逸脱することなく修正及び変更態様として実施できる。従って、本明細書の記載は、例示的な説明を目的とし、本発明に対して何ら制限的な意味を有さない。また、本明細書に例示した態様から選択された複数の態様を組み合わせてもよい。 (30) It will be clear to those skilled in the art that the present invention is not limited to the embodiments described herein. The present invention can be implemented as modifications and changes without departing from the spirit and scope of the present invention determined based on the description of the claims. Accordingly, the description herein is for illustrative purposes only and is not meant to be limiting in any way. Also, a plurality of aspects selected from the aspects exemplified in this specification may be combined.

１、１ａ、１ｂ、１ｄ、１ｅ、１ｆ、１ｇ…ユーザ装置、１０、１０Ｃ、１０Ｄ、１０Ｆ…サーバ装置、２１、２１ａ…取得部、２６…出力部、３１…解析用辞書情報、３３…感情分類情報、２５１、２５１Ｃ…ノイズ除去部、２５２、２５２Ｂ…音声評価部、２５３…補正部、２５４、２５４Ｃ…調整部、２５６…文字評価部、２５８…推定部、２５９…特定部、ＣＩ…補正情報、ＣＶＥ…補正感情情報、ＬＭ…学習モデル、Ｐ１…第１パラメータ、Ｐ２…第２パラメータ、ＴＥ…文字感情情報、Ｕ…ユーザ、ＶＥ…音声感情情報、ＶＩ…音声情報、Ｘ…音声評価値、Ｙ…文字評価値。 1, 1a, 1b, 1d, 1e, 1f, 1g... user device, 10, 10C, 10D, 10F... server device, 21, 21a... acquisition unit, 26... output unit, 31... dictionary information for analysis, 33... emotion Classification information 251, 251C Noise removal unit 252, 252B Speech evaluation unit 253 Correction unit 254, 254C Adjustment unit 256 Character evaluation unit 258 Estimation unit 259 Identification unit CI Correction information, CVE... corrected emotional information, LM... learning model, P1... first parameter, P2... second parameter, TE... text emotional information, U... user, VE... voice emotional information, VI... voice information, X... voice evaluation Value, Y... Character evaluation value.

Claims

Speech information indicating a user's speech for a learning model that has already learned the relationship between a plurality of feature values corresponding to a human speech and the intensity of each of a plurality of emotions held by a person who uttered the speech. a speech evaluation unit for inputting a plurality of feature amounts based on the learning model, and obtaining speech emotion information including a speech evaluation value indicating the strength of each of the plurality of emotions held by the user, from the learning model;
a correcting unit that generates corrected emotional information obtained by correcting the voice emotional information using correction information based on the features of the voice of the user;
an estimation unit for estimating one or more emotions of the user from among the plurality of emotions based on the corrected emotion information;
Emotion estimation device.

an acquisition unit that acquires sound information output by a sound collecting device that collects sound including the user's voice;
a noise removal unit that removes noise from the sound indicated by the sound information to generate the audio information;
The emotion estimation device according to claim 1, comprising:

The speech evaluation unit inputs, to the learning model, a plurality of feature amounts based on speech information indicating a speech in which the user explicitly expresses one of the plurality of emotions, and obtaining explicit speech affective information from the learning model;
an adjustment unit that adjusts the correction information based on the explicit voice emotion information for the purpose of increasing the possibility that the estimation unit estimates that the emotion held by the user is the one emotion;
The emotion estimation device according to claim 1 or 2, comprising:

A speech recognition process for recognizing the utterance content of the voice uttered by a person is performed on sound information output by a sound collecting device that collects sound including the user's voice , and the recognition result of the speech recognition process is performed. a character evaluation unit that generates character emotion information including a character evaluation value indicating the strength of each of the plurality of emotions held by the user, based on the recognized character string shown;
The estimation unit estimates one or more emotions of the user based on the corrected emotion information and the character emotion information.
The emotion estimation device according to claim 3.

If a value indicating the degree of difference between the plurality of voice evaluation values included in the corrected emotion information of the user and the plurality of character evaluation values included in the text emotion information is equal to or less than a predetermined value, the recognized character string is specified. a specific part for specifying as a character string,
When another user who does not speak a voice that expresses an explicit emotion speaks the specific character string,
The voice evaluation unit learns the voice emotion information of the other user by inputting a plurality of feature amounts according to the voice of the specific character string uttered by the other user to the learning model. obtained from the model,
The character evaluation unit
generating character emotion information of the other user based on the voice of the specific character string uttered by the other user;
The purpose of the adjustment unit is to bring the plurality of voice evaluation values included in the corrected emotion information of the other user closer to the plurality of character evaluation values included in the text emotion information of the other user for each of the plurality of emotions. adjusting the correction information for the other user;
The emotion estimation device according to claim 4.

a noise removal unit that removes noise based on a predetermined threshold value from the sound indicated by the sound information to generate the audio information;
If a value indicating the degree of difference between the plurality of voice evaluation values included in the corrected emotion information of the user and the plurality of character evaluation values included in the text emotion information is equal to or less than a predetermined value, the recognized character string is specified. a specific part for specifying as a character string,
When another user who does not speak a voice that expresses an explicit emotion speaks the specific character string,
The voice evaluation unit learns the voice emotion information of the other user by inputting a plurality of feature amounts according to the voice of the specific character string uttered by the other user to the learning model. obtained from the model,
The character evaluation unit generates character emotion information of the other user based on the voice of the specific character string uttered by the other user,
The purpose of the adjustment unit is to bring the plurality of voice evaluation values included in the corrected emotion information of the other user closer to the plurality of character evaluation values included in the text emotion information of the other user for each of the plurality of emotions. adjusting the predetermined threshold for the other user at
The emotion estimation device according to claim 4.

An emotion estimation system comprising a server device and a terminal device capable of communicating with the server device,
The server device
a first communication device that receives sound information indicating sound including the user's voice;
a noise removing unit that removes noise from the sound indicated by the sound information to generate voice information indicating the voice of the user;
For a learning model that has been trained on a plurality of humans , a relationship between a plurality of feature quantities corresponding to human speech and the intensity of each of a plurality of emotions held by the person who uttered the speech is evaluated based on the speech information. a voice evaluation unit for inputting the feature quantity of and acquiring from the learning model voice emotion information including a voice evaluation value indicating the strength of each of the plurality of emotions held by the user;
speech recognition processing is performed on the sound information to recognize the utterance content of the voice uttered by a human, and the plurality of emotions of the user are identified based on the recognition character string indicating the recognition result of the speech recognition processing. a character evaluator for generating character emotion information including character evaluation values indicative of strength for each;
The first communication device is
transmitting the text emotion information and the voice emotion information to the terminal device;
The terminal device
a sound collecting device that collects sound including the user's voice;
a second communication device that transmits the sound information output by the sound collecting device to the server device and receives the text emotion information and the voice emotion information from the server device;
a correcting unit that generates corrected emotional information obtained by correcting the voice emotional information using correction information based on the features of the voice of the user;
an estimation unit for estimating one or more emotions of the user based on the corrected emotion information and the character emotion information;
emotion estimation system.

The speech evaluation unit inputs, to the learning model, a plurality of feature amounts based on speech information indicating a speech in which the user explicitly expresses one of the plurality of emotions, and obtaining explicit speech affective information from the learning model;
The terminal device
an adjustment unit that adjusts the correction information based on the explicit voice emotion information for the purpose of increasing the possibility that the estimation unit estimates that the emotion held by the user is the one emotion;
The emotion estimation system of claim 7, comprising:

The terminal device is a first terminal device,
The user is a first user,
The sound collector is a first sound collector,
The server device is capable of communicating with a second terminal device different from the first terminal device,
The second communication device is
transmitting the corrected emotion information of the first user to the server device;
The server device
when a value indicating the degree of difference between the plurality of voice evaluation values included in the corrected emotion information of the first user and the plurality of character evaluation values included in the text emotion information is equal to or less than a predetermined value, the recognized character string as a specific character string,
When the second user who possesses the second terminal device does not utter a voice that expresses an explicit emotion and utters the specific character string,
The speech evaluation unit learns the speech emotional information of the second user by inputting a plurality of feature quantities corresponding to speech in which the second user utters the specific character string to the learning model. obtained from the model,
The character evaluation unit generates character emotion information of the second user based on the voice of the specific character string uttered by the second user,
The first communication device is
transmitting the specific character string, the second user's voice emotion information, and the second user's text emotion information to the second terminal device;
The second terminal device
a second sound collecting device that collects sound including the voice of the second user;
a third communication device that transmits sound information output by the second sound collecting device to the server device, and receives text emotional information of the second user and voice emotional information of the second user from the server device;
a correction unit that generates corrected emotional information of the second user by correcting the voice emotional information of the second user using correction information based on the characteristics of the voice of the user;
For the purpose of bringing the plurality of voice evaluation values included in the corrected emotion information of the second user closer to the plurality of text evaluation values included in the text emotion information of the second user for each of the plurality of emotions, the second an adjustment unit that adjusts the correction information for a user,
The emotion estimation system according to claim 8.