JP2020201334A

JP2020201334A - Emotion estimation device and emotion estimation system

Info

Publication number: JP2020201334A
Application number: JP2019106848A
Authority: JP
Inventors: 秀行窪田; Hideyuki Kubota; 博子進藤; Hiroko Shindo
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2019-06-07
Filing date: 2019-06-07
Publication date: 2020-12-17
Anticipated expiration: 2039-06-07
Also published as: JP7279287B2

Abstract

To highly precisely estimate emotion which a user has even if a general-purpose learning model which is learnt is used with voice information on a plurality of persons as teacher data.SOLUTION: An emotion estimation device comprises: a voice evaluation section for inputting a plurality of feature amounts on the basis of voice information indicating user's voice to learning models which have learnt a relation between the plurality of feature amounts corresponding to human voice and intensity to a plurality of emotions which the person emitting the voice has on the plurality of persons and acquiring voice emotion information including a voice evaluation value indicating the intensity of the plurality of emotions which the user has from the learning model; a correction section for generating corrected emotion information obtained by correcting the voice emotion information by using the correction information on the basis of the feature of the user's voice; and an estimation section for estimating one or more emotions which the user has from the plurality of emotions on the basis of the corrected emotion information.SELECTED DRAWING: Figure 5

Description

本発明は、感情推定装置、及び、感情推定システムに関する。 The present invention relates to an emotion estimation device and an emotion estimation system.

近年、喜び、怒り及び悲しみ等の感情を推定するサービスが普及している。例えば、特許文献１には、ユーザの音声を示す音声情報に基づいて、ユーザが抱く感情を推定する感情推定装置が開示されている。この感情推定装置は、ある一人のユーザによって入力された複数回の音声情報から、音声認識を行うこのユーザ個人の固有データである、周波数、音量、及び、速度といった複数の特徴量のそれぞれの平均値及び標準偏差を予め算出する。そして、この感情推定装置は、このユーザが抱く感情を推定する際に入力された音声情報の特徴量を、予め算出した平均値及び標準偏差を用いて正規化し、正規化した複数の特徴量に基づいてこのユーザが抱く感情を推定する。 In recent years, services for estimating emotions such as joy, anger, and sadness have become widespread. For example, Patent Document 1 discloses an emotion estimation device that estimates an emotion held by a user based on voice information indicating a user's voice. This emotion estimator is an average of each of a plurality of features such as frequency, volume, and speed, which are unique data of this user who performs voice recognition from a plurality of voice information input by one user. Calculate the value and standard deviation in advance. Then, this emotion estimation device normalizes the feature amount of the voice information input when estimating the emotion held by the user using the mean value and the standard deviation calculated in advance, and converts it into a plurality of normalized feature amounts. Estimate the emotions that this user has based on.

特開２００６−２５９６４１号公報Japanese Unexamined Patent Publication No. 2006-259641

しかしながら、上述した従来の技術を、音声情報に基づく複数の特徴量と感情との関係を学習済みの学習モデルを用いて、ユーザが抱く感情を推定する装置に適用する場合、学習モデルをユーザごとに用意する必要があった。多数のユーザの音声情報を教師データとして学習済みの汎用的な学習モデルを利用すると、多数のユーザの平均的な音声の特徴とユーザの音声の特徴との差分が吸収されないため、ユーザが抱く感情を精度良く推定することができなかった。 However, when the above-mentioned conventional technique is applied to a device for estimating emotions held by a user by using a learning model in which the relationship between a plurality of features based on voice information and emotions has been learned, the learning model is applied to each user. I had to prepare for it. When a general-purpose learning model that has been trained using the voice information of many users as teacher data is used, the difference between the average voice characteristics of many users and the voice characteristics of users is not absorbed, so that the emotions that the users have. Could not be estimated accurately.

本発明の好適な態様にかかる感情推定装置は、人間の音声に応じた複数の特徴量と当該音声を発した人間が抱く複数の感情の各々に対する強度との関係を複数の人間について学習済みの学習モデルに対して、ユーザの音声を示す音声情報に基づく複数の特徴量を入力し、前記ユーザが抱く前記複数の感情の各々に対する強度を示す音声評価値を含む音声感情情報を前記学習モデルから取得する音声評価部と、前記ユーザの音声の特徴に基づく補正情報を用いて前記音声感情情報を補正した補正感情情報を生成する補正部と、前記補正感情情報に基づいて、前記複数の感情の中から前記ユーザが抱く１以上の感情を推定する推定部と、を備える。 In the emotion estimation device according to the preferred embodiment of the present invention, the relationship between a plurality of feature quantities corresponding to a human voice and the intensity of each of the plurality of emotions held by the human who emits the voice has been learned for a plurality of humans. A plurality of feature quantities based on the voice information indicating the user's voice are input to the learning model, and voice emotion information including a voice evaluation value indicating the intensity of each of the plurality of emotions held by the user is obtained from the learning model. A voice evaluation unit to be acquired, a correction unit that generates corrected emotion information obtained by correcting the voice emotion information using correction information based on the characteristics of the user's voice, and a correction unit of the plurality of emotions based on the corrected emotion information. It includes an estimation unit that estimates one or more emotions held by the user from the inside.

本発明の好適な態様にかかる感情推定システムは、サーバ装置と、前記サーバ装置と通信可能な端末装置とを備える感情推定システムであって、前記サーバ装置は、ユーザの音声を含む音を示す音情報を受信する第１通信装置と、前記音情報が示す音からノイズを除去して、前記ユーザの音声を示す音声情報を生成するノイズ除去部と、前記人間の音声に応じた複数の特徴量と当該音声を発した人間が抱く複数の感情の各々に対する強度との関係を複数の人間について学習済みの学習モデルに対して、前記音声情報に基づく複数の特徴量を入力し、前記ユーザの抱く前記複数の感情の各々に対する強度を示す音声評価値を含む音声感情情報を前記学習モデルから取得する音声評価部と、人間が発した音声の発話内容を認識する音声認識処理を、前記音情報に対して実行し、前記音声認識処理の認識結果を示す認識文字列に基づいて、前記ユーザが抱く前記複数の感情の各々に対する強度を示す文字評価値を含む文字感情情報を生成する文字評価部とを備え、前記第１通信装置は、前記文字感情情報と前記音声感情情報とを前記端末装置に送信し、前記端末装置は、前記ユーザの音声を含む音を集音する集音装置と、前記集音装置が出力する前記音情報を前記サーバ装置に送信し、前記文字感情情報と前記音声感情情報とを前記サーバ装置から受信する第２通信装置と、前記ユーザの音声の特徴に基づく補正情報を用いて前記音声感情情報を補正した補正感情情報を生成する補正部と、前記補正感情情報と前記文字感情情報とに基づいて、前記ユーザが抱く１以上の感情を推定する推定部とを備える。 The emotion estimation system according to a preferred embodiment of the present invention is an emotion estimation system including a server device and a terminal device capable of communicating with the server device, and the server device is a sound indicating a sound including a user's voice. A first communication device that receives information, a noise removing unit that removes noise from the sound indicated by the sound information to generate audio information indicating the user's voice, and a plurality of feature quantities corresponding to the human voice. And the relationship between the intensity of each of the plurality of emotions held by the person who emitted the sound and the intensity of each of the plurality of emotions held by the user. The sound information includes a voice evaluation unit that acquires voice emotion information including a voice evaluation value indicating the intensity of each of the plurality of emotions from the learning model, and a voice recognition process that recognizes the utterance content of a human-generated sound. A character evaluation unit that executes the sound recognition process and generates character emotion information including a character evaluation value indicating the intensity of each of the plurality of emotions held by the user based on the recognition character string indicating the recognition result of the voice recognition process. The first communication device transmits the character emotion information and the voice emotion information to the terminal device, and the terminal device collects sounds including the user's voice, and the sound collecting device. A second communication device that transmits the sound information output by the sound collecting device to the server device and receives the character emotion information and the voice emotion information from the server device, and correction information based on the characteristics of the user's voice. It is provided with a correction unit that generates corrected emotion information that corrects the voice emotion information using the above, and an estimation unit that estimates one or more emotions held by the user based on the corrected emotion information and the character emotion information. ..

本発明によれば、複数の人間の音声情報を教師データとして学習済みの学習モデルを利用する場合であっても、ユーザが抱く感情を高精度に推定できる。 According to the present invention, even when a learned learning model using a plurality of human voice information as teacher data is used, the emotions held by the user can be estimated with high accuracy.

ユーザ装置１の機能の概要を示す図。The figure which shows the outline of the function of the user apparatus 1. 第１実施形態にかかるユーザ装置１の構成を示すブロック図。The block diagram which shows the structure of the user apparatus 1 which concerns on 1st Embodiment. 解析用辞書情報３１の記憶内容の一例を示す図。The figure which shows an example of the storage contents of the dictionary information 31 for analysis. 感情分類情報３３の記憶内容の一例を示す図。The figure which shows an example of the memory content of the emotion classification information 33. ユーザ装置１の機能の概要を示す図。The figure which shows the outline of the function of the user apparatus 1. ユーザ装置１の動作を示すフローチャートを示す図。The figure which shows the flowchart which shows the operation of the user apparatus 1. 第２実施形態にかかるユーザ装置１ａを示すブロック図。The block diagram which shows the user apparatus 1a which concerns on 2nd Embodiment. 第２実施形態にかかるユーザ装置１ａの機能の概要を示す図。The figure which shows the outline of the function of the user apparatus 1a which concerns on 2nd Embodiment. キャリブレーションモード時のユーザ装置１ａの動作を示すフローチャートを示す図。The figure which shows the flowchart which shows the operation of the user apparatus 1a in the calibration mode. 感情推定システムＳＹＳの全体構成を示す図。The figure which shows the whole structure of the emotion estimation system SYS. ユーザ装置１ｂの構成を示すブロック図。The block diagram which shows the structure of the user apparatus 1b. サーバ装置１０の構成を示すブロック図。The block diagram which shows the structure of the server apparatus 10. 感情推定システムＳＹＳの機能の概要を示す図。The figure which shows the outline of the function of the emotion estimation system SYS. 非キャリブレーションユーザの補正情報ＣＩの調整機能の概要を示す図。The figure which shows the outline of the adjustment function of the correction information CI of a non-calibration user. 感情推定モードにおける感情推定システムＳＹＳの動作を示すフローチャートを示す図（その１）。The figure which shows the flowchart which shows the operation of the emotion estimation system SYS in the emotion estimation mode (the 1). 感情推定モードにおける感情推定システムＳＹＳの動作を示すフローチャートを示す図（その２）。The figure which shows the flowchart which shows the operation of the emotion estimation system SYS in the emotion estimation mode (the 2). 感情推定システムＳＹＳｃの全体構成を示す図。The figure which shows the whole structure of the emotion estimation system SYSc. サーバ装置１０Ｃの構成を示すブロック図。The block diagram which shows the structure of the server apparatus 10C. 感情推定システムＳＹＳｃの機能の概要を示す図。The figure which shows the outline of the function of the emotion estimation system SYSc. 非キャリブレーションユーザのパラメータ情報ＴＩの調整機能の概要を示す図。The figure which shows the outline of the adjustment function of the parameter information TI of a non-calibration user. 感情推定モードにおける感情推定システムＳＹＳｃの動作を示すフローチャートを示す図。The figure which shows the flowchart which shows the operation of the emotion estimation system SYSc in the emotion estimation mode. 感情推定システムＳＹＳｄの全体構成を示す図。The figure which shows the whole structure of the emotion estimation system SYSd. ユーザ装置１ｄの構成を示すブロック図。The block diagram which shows the structure of the user apparatus 1d. サーバ装置１０Ｄの構成を示すブロック図。The block diagram which shows the structure of the server apparatus 10D. 感情推定システムＳＹＳｄの機能の概要を示す図。The figure which shows the outline of the function of the emotion estimation system SYSd. 感情推定システムＳＹＳｅの全体構成を示す図。The figure which shows the whole structure of the emotion estimation system SYSTEM. ユーザ装置１ｅの構成を示すブロック図。The block diagram which shows the structure of the user apparatus 1e. 感情推定システムＳＹＳｅの機能の概要を示す図。The figure which shows the outline of the function of the emotion estimation system SYSTEM. 感情推定システムＳＹＳｆの全体構成を示す図。The figure which shows the whole structure of the emotion estimation system SYSf. ユーザ装置１ｆの構成を示すブロック図。The block diagram which shows the structure of the user apparatus 1f. サーバ装置１０Ｆの構成を示すブロック図。The block diagram which shows the structure of the server apparatus 10F. 第１感情推定部２５ｆと第２感情推定部２５Ｆとの機能の概要を示す図。The figure which shows the outline of the function of the 1st emotion estimation part 25f and the 2nd emotion estimation part 25F. 第１変形例におけるユーザ装置１ｇの機能の概要を示す図。The figure which shows the outline of the function of the user apparatus 1g in the 1st modification.

１．第１実施形態
図１は、ユーザ装置１の機能の概要を示す図である。ユーザ装置１は、スマートフォンを想定する。ユーザ装置１が、「感情推定装置」の一例である。ただし、ユーザ装置１としては、任意の情報処理装置を採用することができ、例えば、パーソナルコンピュータ等の端末型の情報機器であってもよいし、ノートパソコン、ウェアラブル端末及びタブレット端末等の可搬型の情報端末であってもよい。 1. 1. 1st Embodiment FIG. 1 is a diagram showing an outline of a function of a user device 1. The user device 1 is assumed to be a smartphone. The user device 1 is an example of an “emotion estimation device”. However, as the user device 1, any information processing device can be adopted, and for example, it may be a terminal-type information device such as a personal computer, or a portable type such as a notebook computer, a wearable terminal, or a tablet terminal. It may be an information terminal of.

ユーザ装置１は、ユーザ装置１を所持するユーザＵの音声を含む音を示す音情報に対して音声認識処理を実行して得られた認識文字列を、他者が利用する装置に送信する機能、又は、ユーザＵの付近に位置する他者に聞かせるために、認識文字列を示す音を放音する機能を有する。さらに、ユーザ装置１は、ユーザＵの音声に基づいてユーザＵが抱く感情を推定し、認識文字列に対して、推定した感情に応じた図形を認識文字列に付加する、又は、推定した感情に応じた抑揚で認識文字列を示す音を放音することにより、コミュニケーションに必要な感情表現を付加できる。
図１の例では、ユーザＵが「こんにちは」と発声し、ユーザ装置１が、推定した感情に応じた図形ＰＩを付加している。 The user device 1 has a function of transmitting a recognition character string obtained by executing a voice recognition process on sound information indicating a sound including a sound of a user U who owns the user device 1 to a device used by another person. Or, it has a function of emitting a sound indicating a recognition character string in order to let another person located near the user U hear it. Further, the user device 1 estimates the emotion held by the user U based on the voice of the user U, adds a figure corresponding to the estimated emotion to the recognition character string, or adds the estimated emotion to the recognition character string. Emotional expressions necessary for communication can be added by emitting a sound indicating a recognition character string with intonation according to.
In the example of FIG. 1, a user U is say "Hello", the user device 1, appended figures PI according to the estimated emotions.

図２は、第１実施形態にかかるユーザ装置１の構成を示すブロック図である。ユーザ装置１は、処理装置２、記憶装置３、表示装置４、操作装置５、通信装置６、放音装置７、及び、集音装置８を具備するコンピュータシステムにより実現される。ユーザ装置１の各要素は、情報を通信するための単体又は複数のバス９で相互に接続される。なお、本明細書における「装置」という用語は、回路、デバイス又はユニット等の他の用語に読替えてもよい。また、ユーザ装置１の各要素は、単数又は複数の機器で構成され、ユーザ装置１の一部の要素は省略されてもよい。 FIG. 2 is a block diagram showing a configuration of the user device 1 according to the first embodiment. The user device 1 is realized by a computer system including a processing device 2, a storage device 3, a display device 4, an operating device 5, a communication device 6, a sound emitting device 7, and a sound collecting device 8. Each element of the user device 1 is connected to each other by a single unit or a plurality of buses 9 for communicating information. The term "device" in the present specification may be read as another term such as a circuit, a device, or a unit. Further, each element of the user device 1 may be composed of a single device or a plurality of devices, and some elements of the user device 1 may be omitted.

処理装置２は、ユーザ装置１の全体を制御するプロセッサであり、例えば、単数又は複数のチップで構成される。処理装置２は、例えば、周辺装置とのインタフェース、演算装置及びレジスタ等を含む中央処理装置（ＣＰＵ：Central Processing Unit）で構成される。なお、処理装置２の機能の一部又は全部を、ＤＳＰ（Digital Signal Processor）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＰＬＤ（Programmable Logic Device）、ＦＰＧＡ（Field Programmable Gate Array）等のハードウェアによって実現してもよい。処理装置２は、各種の処理を並列的又は逐次的に実行する。 The processing device 2 is a processor that controls the entire user device 1, and is composed of, for example, a single chip or a plurality of chips. The processing device 2 is composed of, for example, a central processing unit (CPU) including an interface with peripheral devices, an arithmetic unit, registers, and the like. Part or all of the functions of the processing device 2 are realized by hardware such as DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit), PLD (Programmable Logic Device), FPGA (Field Programmable Gate Array), etc. You may. The processing device 2 executes various processes in parallel or sequentially.

記憶装置３は、処理装置２が読取可能な記録媒体であり、処理装置２が実行する制御プログラムＰＲを含む複数のプログラム、解析用辞書情報３１、感情分類情報３３、及び、学習モデルＬＭを記憶する。記憶装置３は、例えば、ＲＯＭ（Read Only Memory）、ＥＰＲＯＭ（Erasable Programmable ROM）、ＥＥＰＲＯＭ（Electrically Erasable Programmable ROM）、ＲＡＭ（Random Access Memory）等の記憶回路の１種類以上で構成される。 The storage device 3 is a recording medium that can be read by the processing device 2, and stores a plurality of programs including the control program PR executed by the processing device 2, analysis dictionary information 31, emotion classification information 33, and learning model LM. To do. The storage device 3 is composed of, for example, one or more types of storage circuits such as ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), and RAM (Random Access Memory).

図３は、解析用辞書情報３１の記憶内容の一例を示す図である。解析用辞書情報３１は、形態素ごとに、品詞、品詞細分類、及び、原形情報を互いに対応付けた情報である。形態素は、意味を有する表現要素の最小単位の文字列である。品詞は、文法的性質によって分類された単語の種別であり、名詞、動詞、及び形容詞等である。品詞細分類は、品詞をさらに細分類した項目である。原形情報は、該当の形態素が活用する単語である場合、単語の原形を示す文字列であり、該当の形態素が活用しない単語である場合、該当の形態素と同一の文字列である。 FIG. 3 is a diagram showing an example of the stored contents of the analysis dictionary information 31. The analysis dictionary information 31 is information in which part of speech, part of speech subclassification, and original form information are associated with each other for each morpheme. A morpheme is a character string that is the smallest unit of a meaningful expression element. Part of speech is a type of word classified according to its grammatical nature, such as nouns, verbs, and adjectives. Part of speech subclassification is an item in which part of speech is further subdivided. The original form information is a character string indicating the original form of the word when the word is utilized by the corresponding morpheme, and is the same character string as the corresponding morpheme when the word is not utilized by the relevant morpheme.

図４は、感情分類情報３３の記憶内容の一例を示す図である。感情分類情報３３は、文字列を、喜び、怒り、悲しみ、及び、平常の何れかに分類した情報である。感情分類情報３３に登録される文字列は、喜び、怒り、悲しみ、又は、平常のうちのいずれかの感情を表す。図４の例では、喜びに分類された文字列群３３１は、「嬉しい」、「合格」、「勝つ」、及び、「勝っ」等を含む。同様に、怒りに分類された文字列群３３２は、「イライラ」、及び、「むかっ腹」等を含む。同様に、悲しみに分類された文字列群３３３は、「悲しい」、及び、「敗ける」等を含む。同様に、平常に分類された文字列群３３４は、「安心」等を含む。 FIG. 4 is a diagram showing an example of the stored contents of the emotion classification information 33. The emotion classification information 33 is information in which the character string is classified into any of joy, anger, sadness, and normal. The character string registered in the emotion classification information 33 represents an emotion of joy, anger, sadness, or normality. In the example of FIG. 4, the character string group 331 classified as joy includes "happy", "pass", "win", "win", and the like. Similarly, the character string group 332 classified as anger includes "irritated", "mucked up" and the like. Similarly, the character string group 333 classified as sad includes "sad", "losing", and the like. Similarly, the character string group 334 normally classified includes "safety" and the like.

説明を図２に戻す。学習モデルＬＭは、人間の音声に応じた複数の特徴量と、複数の感情の各々に対する強度との関係を学習済みである。 The explanation is returned to FIG. The learning model LM has learned the relationship between a plurality of features corresponding to human voice and the intensity of each of the plurality of emotions.

表示装置４は、処理装置２による制御のもとで各種の画像を表示する。例えば液晶表示パネル、又は有機ＥＬ（Electro Luminescence）表示パネル等の各種の表示パネルが表示装置４として好適に利用される。 The display device 4 displays various images under the control of the processing device 2. For example, various display panels such as a liquid crystal display panel or an organic EL (Electro Luminescence) display panel are preferably used as the display device 4.

操作装置５は、ユーザ装置１が使用する情報を入力するための機器である。操作装置５は、ユーザＵによる操作を受け付ける。具体的には、操作装置５は、数字及び文字等の符号を入力するための操作と、表示装置４が表示するアイコンを選択するための操作とを受け付ける。例えば、表示装置４の表示面に対する接触を検出するタッチパネルが操作装置５として好適である。なお、利用者が操作可能な操作子を操作装置５が含んでもよい。操作子は、例えば、タッチペンである。 The operation device 5 is a device for inputting information used by the user device 1. The operation device 5 accepts an operation by the user U. Specifically, the operating device 5 accepts an operation for inputting a code such as a number and a character and an operation for selecting an icon displayed by the display device 4. For example, a touch panel that detects contact with the display surface of the display device 4 is suitable as the operation device 5. The operation device 5 may include an operator that can be operated by the user. The operator is, for example, a stylus.

通信装置６は、ネットワークを介して他の装置と通信を行うためのハードウェア（送受信デバイス）である。通信装置６は、例えば、ネットワークデバイス、ネットワークコントローラ、ネットワークカード、通信モジュール等とも呼ばれる。 The communication device 6 is hardware (transmission / reception device) for communicating with another device via a network. The communication device 6 is also called, for example, a network device, a network controller, a network card, a communication module, or the like.

放音装置７は、例えばスピーカで構成され、処理装置２による制御のもとで、音を放音する。集音装置８は、例えばマイクロフォン及びＡＤ変換器で構成され、処理装置２による制御のもとで、ユーザＵの音声を含む音を集音する。マイクロフォンは、集音した音声を電気信号に変換する。ＡＤ変換器は、マイクロフォンが変換した電気信号をＡＤ変換して、図５に示す音情報ＳＩに変換する。音情報ＳＩが示す音には、発話者の音声に加えて、発話者の周囲から発せられた雑音が含まれ得る。 The sound emitting device 7 is composed of, for example, a speaker, and emits sound under the control of the processing device 2. The sound collecting device 8 is composed of, for example, a microphone and an AD converter, and collects sounds including the voice of the user U under the control of the processing device 2. The microphone converts the collected voice into an electric signal. The AD converter AD-converts the electric signal converted by the microphone and converts it into the sound information SI shown in FIG. The sound indicated by the sound information SI may include noise emitted from the surroundings of the speaker in addition to the voice of the speaker.

１．１．第１実施形態の機能
処理装置２は、記憶装置３から制御プログラムＰＲを読み取り実行することによって、取得部２１、感情推定部２５、及び、出力部２６として機能する。
図５を用いて、処理装置２によって実現される機能について説明する。 1.1. The function processing device 2 of the first embodiment functions as an acquisition unit 21, an emotion estimation unit 25, and an output unit 26 by reading and executing the control program PR from the storage device 3.
The function realized by the processing apparatus 2 will be described with reference to FIG.

図５は、ユーザ装置１の機能の概要を示す図である。取得部２１は、ユーザＵの音声を含む音を集音する集音装置８が出力する音情報ＳＩを取得する。感情推定部２５は、ユーザＵが抱く複数の感情の中から、ユーザＵが抱く１以上の感情を推定する。第１実施形態において、ユーザＵが抱く複数の感情は、喜び、怒り、悲しみ、及び、平常の４つであるとして説明する。以下、喜び、怒り、悲しみ、及び、平常は複数の感情の一例である。 FIG. 5 is a diagram showing an outline of the functions of the user device 1. The acquisition unit 21 acquires the sound information SI output by the sound collecting device 8 that collects sounds including the voice of the user U. The emotion estimation unit 25 estimates one or more emotions held by the user U from a plurality of emotions held by the user U. In the first embodiment, the plurality of emotions held by the user U will be described as four emotions: joy, anger, sadness, and normality. Below, joy, anger, sadness, and normality are examples of multiple emotions.

感情推定部２５は、ノイズ除去部２５１、音声評価部２５２、補正部２５３、文字評価部２５６、及び、推定部２５８を含む。 The emotion estimation unit 25 includes a noise removal unit 251, a voice evaluation unit 252, a correction unit 253, a character evaluation unit 256, and an estimation unit 258.

ノイズ除去部２５１は、音情報ＳＩが示す音からノイズを除去して音声情報ＶＩを生成する。ノイズ除去部２５１には、例えば、第１パラメータＰ１と第２パラメータＰ２とが与えられる。第１パラメータＰ１は、ノイズとみなす周波数帯を指定する。第２パラメータＰ２はノイズとみなす振幅成分の大きさを指定する。ノイズ除去部２５１は、第１処理から第４処理を実行する。第１処理では、音情報ＳＩに高速フーリエ変換処理を施すことによって、複数の周波数帯の各々について振幅成分を算出する。第２処理では、第１パラメータＰ１で指定される周波数帯の振幅成分を低減させる。人間の音声の周波数は、概ね１００Ｈｚ以上２０００Ｈｚ以下である。第１パラメータＰ１は、下限の周波数と上限の周波数を指定する。このため、ノイズ除去部２５１が第１パラメータＰ１を用いることによって、下限の周波数以下の周波数帯において、振幅成分が低減され、且つ、下限の周波数以上の周波数帯において、振幅成分が低減される。第３処理は、第２パラメータＰ２で指定される大きさ以下の振幅成分を低減させる。第４処理では、第３処理の処理結果に逆フーリエ変換処理を施して、音声情報ＶＩを生成する。音声情報ＶＩは、音情報ＳＩから環境ノイズなどが除去されたユーザＵの音声を示す。 The noise removing unit 251 removes noise from the sound indicated by the sound information SI to generate the voice information VI. For example, a first parameter P1 and a second parameter P2 are given to the noise removing unit 251. The first parameter P1 specifies a frequency band to be regarded as noise. The second parameter P2 specifies the magnitude of the amplitude component regarded as noise. The noise removing unit 251 executes the first to fourth processes. In the first process, the amplitude component is calculated for each of the plurality of frequency bands by performing the fast Fourier transform process on the sound information SI. In the second process, the amplitude component of the frequency band specified by the first parameter P1 is reduced. The frequency of human voice is generally 100 Hz or more and 2000 Hz or less. The first parameter P1 specifies the lower limit frequency and the upper limit frequency. Therefore, when the noise removing unit 251 uses the first parameter P1, the amplitude component is reduced in the frequency band below the lower limit frequency, and the amplitude component is reduced in the frequency band above the lower limit frequency. The third process reduces the amplitude component of the magnitude or less specified by the second parameter P2. In the fourth process, the processing result of the third process is subjected to an inverse Fourier transform process to generate voice information VI. The voice information VI indicates the voice of the user U in which environmental noise and the like are removed from the sound information SI.

音声評価部２５２は、学習モデルＬＭに対して、音声情報ＶＩに基づく複数の特徴量を入力し、複数の感情の各々に対する強度を示す音声評価値ｘを含む音声感情情報ＶＥを学習モデルＬＭから取得する。
学習モデルＬＭは、人間の音声に応じた複数の特徴量と当該音声を発した人間が抱く複数の感情の各々に対する強度との関係を複数の人間について学習済みである。学習モデルＬＭは、学習の過程において、多数の教師データを学習する。教師データは、入力データである複数の特徴量と、ラベルデータである複数の感情の各々に対する強度との組で与えられる。また、教師データは、多数のユーザの音声情報ＶＩに基づいて生成される。言い換えれば、学習モデルＬＭは、特定の個人向けに調整されていない、汎用的なモデルである。
複数の特徴量は、音の特徴量であり、例えば、MFCC（Mel-Frequency Cepstrum Coefficients）12次元、ラウドネス、基本周波数(F0)、音声確率、ゼロ交差率、HNR（Harmonics-to-Noise-Ratio）、及びこれらの一次微分、MFCC及びラウドネスの二次微分の計４７個である。ラウドネスは、音の大きさであり、人間の聴覚が感じる音の強さを示す。音声確率は、音声情報ＶＩが示す音に音声が含まれる確率を示す。ゼロ交差率は、音圧がゼロとなった回数である。
音声評価部２５２は、音声情報ＶＩに音の特徴を抽出する処理を施して複数の特徴量を生成する。 The voice evaluation unit 252 inputs a plurality of feature quantities based on the voice information VI into the learning model LM, and inputs the voice emotion information VE including the voice evaluation value x indicating the intensity for each of the plurality of emotions from the learning model LM. get.
In the learning model LM, the relationship between a plurality of features corresponding to a human voice and the intensity of each of the plurality of emotions held by the person who emits the voice has been learned for the plurality of humans. The learning model LM learns a large amount of teacher data in the learning process. The teacher data is given as a set of a plurality of features which are input data and an intensity for each of a plurality of emotions which are label data. Further, the teacher data is generated based on the voice information VI of a large number of users. In other words, the learning model LM is a general purpose model that is not tailored to a particular individual.
The plurality of features are sound features, for example, MFCC (Mel-Frequency Cepstrum Coefficients) 12-dimensional, loudness, fundamental frequency (F0), voice probability, zero crossover rate, HNR (Harmonics-to-Noise-Ratio). ), And these first-order differentials, MFCC and loudness second-order differentials, for a total of 47. Loudness is the loudness of a sound, which indicates the intensity of the sound felt by human hearing. The voice probability indicates the probability that the sound indicated by the voice information VI includes voice. The zero crossing rate is the number of times the sound pressure becomes zero.
The voice evaluation unit 252 performs a process of extracting sound features on the voice information VI to generate a plurality of feature quantities.

音声感情情報ＶＥは、喜びの音声評価値ｘ１、怒りの音声評価値ｘ２、悲しみの音声評価値ｘ３、及び、平常の音声評価値ｘ４を含む。音声評価値ｘは、０以上の実数である。以下の説明では、同種の要素を区別する場合には、喜びの音声評価値ｘ１、怒りの音声評価値ｘ２のように参照符号を使用する。一方、同種の要素を区別しない場合には、音声評価値ｘのように、参照符号のうちの共通番号だけを使用する。 The voice emotion information VE includes a voice evaluation value of joy x1, a voice evaluation value of anger x2, a voice evaluation value of sadness x3, and a normal voice evaluation value x4. The voice evaluation value x is a real number of 0 or more. In the following description, when distinguishing the same kind of elements, reference codes are used such as joy voice evaluation value x1 and anger voice evaluation value x2. On the other hand, when the same type of elements are not distinguished, only the common number among the reference codes is used, such as the voice evaluation value x.

補正部２５３は、ユーザＵの音声の特徴に基づく補正情報ＣＩを用いて音声感情情報ＶＥを補正した補正感情情報ＣＶＥを生成する。補正情報ＣＩは、例えば、喜びの音声評価値ｘ１を補正する係数ｋ１、怒りの音声評価値ｘ２を補正する係数ｋ２、悲しみの音声評価値ｘ３を補正する係数ｋ３、及び、平常の音声評価値ｘ４を補正する係数ｋ４を含む。ｋ１〜ｋ４は、０以上の実数である。補正感情情報ＣＶＥは、喜びの音声評価値Ｘ１、怒りの音声評価値Ｘ２、悲しみの音声評価値Ｘ３、及び、平常の音声評価値Ｘ４を含む。補正部２５３は、例えば、下記式に従って、補正感情情報ＣＶＥを生成する。 The correction unit 253 generates the corrected emotion information CVE in which the voice emotion information VE is corrected by using the correction information CI based on the voice characteristics of the user U. The correction information CI includes, for example, a coefficient k1 for correcting the voice evaluation value x1 of joy, a coefficient k2 for correcting the voice evaluation value x2 for anger, a coefficient k3 for correcting the voice evaluation value x3 for sadness, and a normal voice evaluation value. Includes a coefficient k4 that corrects x4. k1 to k4 are real numbers of 0 or more. The corrected emotion information CVE includes a joy voice evaluation value X1, an anger voice evaluation value X2, a sadness voice evaluation value X3, and a normal voice evaluation value X4. The correction unit 253 generates the correction emotion information CVE according to, for example, the following equation.

Ｘ１＝ｘ１×ｋ１
Ｘ２＝ｘ２×ｋ２
Ｘ３＝ｘ３×ｋ３
Ｘ４＝ｘ４×ｋ４ X1 = x1 × k1
X2 = x2 x k2
X3 = x3 × k3
X4 = x4 x k4

補正情報ＣＩの生成方法は、例えば、以下に示す２つの態様がある。第１の態様において、ユーザＵが、平常時の状態で、集音装置８に向かって発話する。処理装置２は、発話に応じた音声情報ＶＩに対して、複数の特徴量を抽出し、抽出した複数の特徴量と所定の閾値とを比較することにより、係数ｋ１〜ｋ４を生成する。例えば、抽出した基本周波数が所定の閾値より高い場合、このユーザＵは、平常時であっても基本周波数が高めであり、ユーザＵが抱く感情が喜び又は怒りであると誤判定しやすくなる。そこで、処理装置２は、喜びの音声評価値Ｘ１及び怒りの音声評価値Ｘ２を低くする目的で、喜びに対応する係数ｋ１及び怒りに対応する係数ｋ２を、０より大きく１より小さい値に設定する。 The method for generating the correction information CI has, for example, the following two aspects. In the first aspect, the user U speaks to the sound collecting device 8 in a normal state. The processing device 2 extracts a plurality of feature quantities from the voice information VI corresponding to the utterance, and compares the extracted plurality of feature quantities with a predetermined threshold value to generate coefficients k1 to k4. For example, when the extracted fundamental frequency is higher than a predetermined threshold value, the user U has a higher fundamental frequency even in normal times, and it is easy to erroneously determine that the emotion held by the user U is joy or anger. Therefore, the processing device 2 sets the coefficient k1 corresponding to joy and the coefficient k2 corresponding to anger to values larger than 0 and smaller than 1 for the purpose of lowering the voice evaluation value X1 of joy and the voice evaluation value X2 of anger. To do.

第２の態様において、処理装置２は、ユーザＵに自身の音声の特徴に関する情報を入力させる。例えば、処理装置２は、ユーザＵに、自身の音声の特徴に関する情報として、性別及び年齢を入力させる。性別が女性である場合、男性と比較して、一般的には基本周波数が高くなるため、処理装置２は、喜びの音声評価値Ｘ１及び怒りの音声評価値Ｘ２を低くする目的で、喜びに対応する係数ｋ１及び怒りに対応する係数ｋ２を、０より大きく１より小さい値に設定する。同様に、一般的には、年齢が低いほど声が高くなるため、入力された年齢が所定の閾値以下である場合、処理装置２は、喜びの音声評価値Ｘ１及び怒りの音声評価値Ｘ２を低くする目的で、喜びに対応する係数ｋ１及び怒りに対応する係数ｋ２を、０より大きく１より小さい値に設定する。 In the second aspect, the processing device 2 causes the user U to input information regarding the characteristics of his / her voice. For example, the processing device 2 causes the user U to input the gender and age as information regarding the characteristics of his / her voice. When the gender is female, the fundamental frequency is generally higher than that of male, so that the processing device 2 is delighted for the purpose of lowering the joy voice evaluation value X1 and the anger voice evaluation value X2. The corresponding coefficient k1 and the corresponding coefficient k2 for anger are set to values greater than 0 and less than 1. Similarly, in general, the younger the age, the louder the voice. Therefore, when the input age is equal to or less than a predetermined threshold value, the processing device 2 sets the joy voice evaluation value X1 and the anger voice evaluation value X2. For the purpose of lowering, the coefficient k1 corresponding to joy and the coefficient k2 corresponding to anger are set to values larger than 0 and smaller than 1.

文字評価部２５６は、人間が発話した音声の発話内容を認識する音声認識処理を音情報ＳＩに対して実行し、音声認識処理の認識結果を示す認識文字列ＲＴに基づいて、複数の感情の各々に対する強度を示す文字評価値Ｙを含む文字感情情報ＴＥを生成する。文字感情情報ＴＥは、喜びの文字評価値Ｙ１、怒りの文字評価値Ｙ２、悲しみの文字評価値Ｙ３、及び、平常の文字評価値Ｙ４を含む。文字評価値Ｙは、０以上の実数である。 The character evaluation unit 256 executes a voice recognition process for recognizing the utterance content of the voice uttered by a human to the sound information SI, and based on the recognition character string RT indicating the recognition result of the voice recognition process, of a plurality of emotions. The character emotion information TE including the character evaluation value Y indicating the strength for each is generated. The character emotion information TE includes a joy character evaluation value Y1, an anger character evaluation value Y2, a sad character evaluation value Y3, and a normal character evaluation value Y4. The character evaluation value Y is a real number of 0 or more.

より詳細には、文字評価部２５６は、音声認識処理部２５６１、形態素解析処理部２５６３、及び、評価値算出部２５６５を含む。
音声認識処理部２５６１は、音情報ＳＩに音声認識処理を施して認識文字列ＲＴを出力する。音声認識処理部２５６１は、例えば、予め用意された音響モデル及び言語モデルを用いて、音声から文字列を認識する手法を含む、種々の手法によって、認識文字列ＲＴを出力する。 More specifically, the character evaluation unit 256 includes a voice recognition processing unit 2561, a morphological analysis processing unit 2563, and an evaluation value calculation unit 2565.
The voice recognition processing unit 2561 performs voice recognition processing on the sound information SI and outputs the recognition character string RT. The voice recognition processing unit 2561 outputs the recognition character string RT by various methods including a method of recognizing a character string from the voice by using, for example, an acoustic model and a language model prepared in advance.

形態素解析処理部２５６３は、解析用辞書情報３１を参照して、認識文字列ＲＴに対して形態素解析処理を実行して、補正後認識文字列ＣＲＴを出力する。形態素解析処理は、認識文字列ＲＴを形態素ごとに分解する処理である。形態素解析処理において、解析用辞書情報３１の品詞及び品詞細分類が利用される。補正後認識文字列ＣＲＴは、フィラー等といった、ユーザＵが抱く感情を推定するためには不要な文字列を除いた文字列である。フィラーは、「ええと」、「あの」、及び、「まあ」といった、発話の合間に挟み込む言葉である。 The morphological analysis processing unit 2563 executes the morphological analysis process on the recognition character string RT with reference to the analysis dictionary information 31, and outputs the corrected recognition character string CRT. The morphological analysis process is a process of decomposing the recognition character string RT for each morpheme. In the morphological analysis process, the part of speech and the part of speech subclassification of the analysis dictionary information 31 are used. The corrected recognition character string CRT is a character string excluding a character string such as a filler, which is unnecessary for estimating the emotion held by the user U. Filler is a word that is inserted between utterances, such as "um", "that", and "well".

評価値算出部２５６５は、感情分類情報３３に含まれる文字列と、補正後認識文字列ＣＲＴとを比較することにより各感情の文字評価値Ｙを算出し、各感情の文字評価値Ｙを含む文字感情情報ＴＥを生成する。各感情の文字評価値Ｙの算出について、評価値算出部２５６５は、補正後認識文字列ＣＲＴが、感情分類情報３３に含まれる文字列を含む場合に、この補正後認識文字列ＣＲＴに含まれる文字列に対応する感情の文字評価値Ｙを増加させる。
例えば、補正後認識文字列ＣＲＴが「今日試合に勝った」であれば、評価値算出部２５６５は、以下のような各感情の文字評価値Ｙを出力する。 The evaluation value calculation unit 2565 calculates the character evaluation value Y of each emotion by comparing the character string included in the emotion classification information 33 with the corrected recognition character string CRT, and includes the character evaluation value Y of each emotion. Generates character emotion information TE. Regarding the calculation of the character evaluation value Y of each emotion, the evaluation value calculation unit 2565 includes the corrected recognition character string CRT when the corrected recognition character string CRT includes the character string included in the emotion classification information 33. Increases the character evaluation value Y of the emotion corresponding to the character string.
For example, if the corrected recognition character string CRT is "winning the game today", the evaluation value calculation unit 2565 outputs the character evaluation value Y of each emotion as follows.

喜びの文字評価値Ｙ１１
怒りの文字評価値Ｙ２０
悲しみの文字評価値Ｙ３０
平常の文字評価値Ｙ４０ Character evaluation value of joy Y1 1
Anger character evaluation value Y2 0
Character evaluation value of sadness Y30
Normal character evaluation value Y40

上述の例では、補正後認識文字列ＣＲＴに、感情分類情報３３に含まれる「勝っ」が含まれているため、評価値算出部２５６５は、「勝っ」に対応する喜びの文字評価値Ｙ１を１増加させる。増加させる文字評価値Ｙの増加量は、１に限らなく、感情分類情報３３に含まれる文字列ごとに異なってもよい。例えば、より喜びを強く示す文字列に対する文字評価値Ｙの増加量を２としてもよい。さらに、補正後認識文字列ＣＲＴに、感情分類情報３３に含まれる文字列、及び、内容を強調する文字列が含まれる場合、評価値算出部２５６５は、文字評価値Ｙの増加量を大きくしてもよい。例えば、補正後認識文字列ＣＲＴが「今日試合に勝ててとても嬉しい」であれば、補正後認識文字列ＣＲＴに感情分類情報３３に含まれる「嬉しい」が含まれており、かつ、「とても」という内容を強調する文字列が含まれるため、評価値算出部２５６５は、例えば、喜びの文字評価値Ｙ１を２増加させる。補正後認識文字列ＣＲＴのうち、どの文字列が、内容を強調する文字列であるか否かは、形態素解析処理によって得られる形態素によって判定することができる。以下の例では、説明を容易にするため、増加させる文字評価値Ｙの増加量が１であるとする。
さらに、補正後認識文字列ＣＲＴに、感情分類情報３３に含まれる文字列、及び、内容を否定する文字列が含まれる場合、評価値算出部２５６５は、この補正後認識文字列ＣＲＴに含まれる文字列に対応する文字評価値Ｙを増加させる処理とは異なる処理を実行してもよい。例えば、補正後認識文字列ＣＲＴが「今日試合に勝つことができなかった」であれば、補正後認識文字列ＣＲＴに感情分類情報３３に含まれる「勝つ」が含まれるが、「なかっ」という内容を否定する文字列が含まれるため、評価値算出部２５６５は、例えば、悲しみの文字評価値Ｙ３を１増加させる。補正後認識文字列ＣＲＴのうち、どの文字列が、内容を否定する文字列であるか否かは、形態素解析処理によって得られる形態素によって判定することができる。このように、形態素解析処理によって、補正後認識文字列ＣＲＴが肯定的な内容なのか否定的な内容かを推定することが可能である。以下の例では、説明を容易にするため、補正後認識文字列ＣＲＴに、感情分類情報３３に含まれる文字列が含まれれば、この補正後認識文字列ＣＲＴに含まれる文字列に対応する文字評価値Ｙを増加させることとして説明を行う。 In the above example, since the corrected recognition character string CRT includes the "win" included in the emotion classification information 33, the evaluation value calculation unit 2565 sets the joy character evaluation value Y1 corresponding to the "win". Increase by 1. The amount of increase in the character evaluation value Y to be increased is not limited to 1, and may differ for each character string included in the emotion classification information 33. For example, the amount of increase in the character evaluation value Y with respect to the character string indicating more joy may be set to 2. Further, when the corrected recognition character string CRT includes a character string included in the emotion classification information 33 and a character string that emphasizes the content, the evaluation value calculation unit 2565 increases the amount of increase in the character evaluation value Y. You may. For example, if the corrected recognition character string CRT is "very happy to win the game today", the corrected recognition character string CRT contains "happy" included in the emotion classification information 33 and is "very". Since the character string emphasizing the content is included, the evaluation value calculation unit 2565 increases, for example, the character evaluation value Y1 of joy by 2. Which character string of the corrected recognition character string CRT is a character string that emphasizes the content can be determined by the morpheme obtained by the morphological analysis process. In the following example, for ease of explanation, it is assumed that the amount of increase in the character evaluation value Y to be increased is 1.
Further, when the corrected recognition character string CRT includes a character string included in the emotion classification information 33 and a character string denying the content, the evaluation value calculation unit 2565 is included in the corrected recognition character string CRT. A process different from the process of increasing the character evaluation value Y corresponding to the character string may be executed. For example, if the corrected recognition character string CRT is "could not win the game today", the corrected recognition character string CRT includes "win" included in the emotion classification information 33, but is said to be "not". Since the character string denying the content is included, the evaluation value calculation unit 2565 increases, for example, the character evaluation value Y3 of sadness by 1. Which character string of the corrected recognition character string CRT is a character string whose content is negated can be determined by the morpheme obtained by the morphological analysis process. In this way, it is possible to estimate whether the corrected recognition character string CRT has positive content or negative content by the morphological analysis process. In the following example, if the corrected recognition character string CRT includes the character string included in the emotion classification information 33, the character corresponding to the character string included in the corrected recognition character string CRT is included in the following example. The explanation will be given as increasing the evaluation value Y.

推定部２５８は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く１以上の感情を推定する。例えば、推定部２５８は、複数の感情の各々について、補正感情情報ＣＶＥの音声評価値Ｘ１〜Ｘ４と文字感情情報ＴＥの文字評価値Ｙ１〜Ｙ４とを感情ごとに加算して、感情ごとに加算値を算出する。推定部２５８は、感情ごとの加算値を閾値と比較し、閾値を超える加算値を特定する。推定部２５８は、特定された加算値に対応する１以上の感情を、ユーザＵが抱く１以上の感情として推定する。以下の説明では、音声評価値Ｘと文字評価値Ｙとの加算とは、感情ごとに加算して、４つの加算値を生成することを意味する。
推定部２５８は、音声評価値Ｘと文字評価値Ｙとを単に加算するのみに限らず、音声評価値Ｘと文字評価値Ｙとのいずれか一方の評価値に、所定値αを乗じた値と、他方の評価値とを加算してもよい。所定値αは、例えば、ユーザ装置１の開発者又はユーザＵなどによって設定される値である。 The estimation unit 258 estimates one or more emotions held by the user U based on the corrected emotion information CVE and the character emotion information TE. For example, the estimation unit 258 adds the voice evaluation values X1 to X4 of the corrected emotion information CVE and the character evaluation values Y1 to Y4 of the character emotion information TE for each emotion for each of the plurality of emotions, and adds each emotion. Calculate the value. The estimation unit 258 compares the added value for each emotion with the threshold value, and identifies the added value exceeding the threshold value. The estimation unit 258 estimates one or more emotions corresponding to the specified addition value as one or more emotions held by the user U. In the following description, the addition of the voice evaluation value X and the character evaluation value Y means that each emotion is added to generate four addition values.
The estimation unit 258 is not limited to simply adding the voice evaluation value X and the character evaluation value Y, but is a value obtained by multiplying one of the voice evaluation value X and the character evaluation value Y by a predetermined value α. And the other evaluation value may be added. The predetermined value α is, for example, a value set by the developer of the user device 1, the user U, or the like.

推定部２５８は、推定したユーザＵが抱く１以上の感情を示す推定感情情報ＥＩを出力する。推定感情情報ＥＩは、例えば、以下に示す２つの態様がある。推定感情情報ＥＩの第１の態様は、推定したユーザＵが抱く１以上の感情を示す識別子である。感情を示す識別子には、喜びを示す識別子、怒りを示す識別子、悲しみを示す識別子、及び、平常を示す識別子がある。推定感情情報ＥＩの第２の態様は、推定したユーザＵが抱く１以上の感情を示す識別子と、推定したユーザＵが抱く感情の評価値とである。推定したユーザＵが抱く感情の評価値は、例えば、推定したユーザＵが抱く１以上の感情について、補正感情情報ＣＶＥの音声評価値Ｘと文字評価値Ｙとを感情ごとに加算した値である。 The estimation unit 258 outputs an estimated emotion information EI indicating one or more emotions held by the estimated user U. The estimated emotional information EI has, for example, the following two aspects. The first aspect of the estimated emotion information EI is an identifier indicating one or more emotions held by the estimated user U. The identifier indicating emotion includes an identifier indicating joy, an identifier indicating anger, an identifier indicating sadness, and an identifier indicating normality. The second aspect of the estimated emotion information EI is an identifier indicating one or more emotions held by the estimated user U and an evaluation value of the emotions held by the estimated user U. The evaluation value of the emotions held by the estimated user U is, for example, a value obtained by adding the voice evaluation value X of the corrected emotion information CVE and the character evaluation value Y for each emotion for one or more emotions held by the estimated user U. ..

出力部２６は、推定感情情報ＥＩが示す１以上の感情に応じた処理を実行して得られたデータを出力する。例えば、出力部２６は、下記に示す２つの態様がある。第１の態様における出力部２６は、音声認識処理部２５６１によって得られた認識文字列ＲＴに対して、推定感情情報ＥＩが示す１以上の感情に応じた処理を実行して得られたデータを出力する。感情に応じた処理は、例えば、下記に示す２つの態様がある。
感情に応じた処理の第１の態様は、認識文字列ＲＴに対して、感情を具象化した図形を付加する処理である。感情を具象化した図形は、例えば、感情を具象化した絵文字、及び、感情を具象化した顔文字である。絵文字は、文字コードに対応付けられた画像である。文字コードは、例えば、Unicodeである。顔文字は、記号及び文字を組み合わせて顔を表現した文字列である。以下の説明では、感情を具象化した図形は、感情を具象化した絵文字であるとして説明する。喜びを具象化した絵文字は、例えば、笑顔を示す絵文字である。怒りを具象化した絵文字は、例えば、怒りの顔を示す絵文字である。悲しみを具象化した絵文字は、例えば、泣き顔を示す絵文字である。さらに、推定感情情報ＥＩが第２の態様である場合、出力部２６は、推定感情情報ＥＩが示す感情であって、推定感情情報ＥＩに含まれる評価値に応じた強度を有する感情を具象化した絵文字を、認識文字列ＲＴに付加する絵文字として決定してもよい。例えば、推定感情情報ＥＩが示す感情が悲しみであり、かつ、推定感情情報ＥＩに含まれる評価値が所定の閾値以下である場合、出力部２６は、涙をこぼす顔を示す絵文字を認識文字列ＲＴに付加する絵文字として決定する。一方、推定感情情報ＥＩが示す感情が悲しみであり、かつ、推定感情情報ＥＩに含まれる評価値が所定の閾値より大きい場合、出力部２６は、号泣した顔を示す絵文字を認識文字列ＲＴに付加する絵文字として決定する。号泣した顔を示す絵文字は、涙をこぼす顔を示す絵文字と比較して、より高い強度の悲しみを具象化している。
出力部２６は、認識文字列ＲＴに絵文字を付加して得られた絵文字付き文字列を出力する。絵文字を付加する位置は、例えば、以下に示す２つがある。第１の位置は、認識文字列ＲＴの末尾である。第２の位置は、認識文字列ＲＴ内における、感情分類情報３３に含まれる文字列の次である。表示装置４は、出力部２６が出力した絵文字付き文字列に基づく画像を表示する。 The output unit 26 outputs the data obtained by executing the processing corresponding to one or more emotions indicated by the estimated emotion information EI. For example, the output unit 26 has the following two modes. The output unit 26 in the first aspect executes data obtained by executing processing according to one or more emotions indicated by the estimated emotion information EI on the recognition character string RT obtained by the voice recognition processing unit 2561. Output. There are two modes of processing according to emotions, for example, as shown below.
The first aspect of the process according to the emotion is a process of adding a figure embodying the emotion to the recognition character string RT. The figures that embody emotions are, for example, pictograms that embody emotions and emoticons that embody emotions. A pictogram is an image associated with a character code. The character code is, for example, Unicode. An emoticon is a character string that expresses a face by combining symbols and characters. In the following description, a figure that embodies emotions will be described as a pictogram that embodies emotions. The pictogram that embodies joy is, for example, a pictogram that shows a smile. The pictogram that embodies anger is, for example, a pictogram that shows the face of anger. The pictogram that embodies sadness is, for example, a pictogram that shows a crying face. Further, when the estimated emotion information EI is the second aspect, the output unit 26 embodies the emotion indicated by the estimated emotion information EI and has an intensity corresponding to the evaluation value included in the estimated emotion information EI. The resulting pictogram may be determined as a pictogram to be added to the recognition character string RT. For example, when the emotion indicated by the estimated emotion information EI is sadness and the evaluation value included in the estimated emotion information EI is equal to or less than a predetermined threshold value, the output unit 26 recognizes a pictogram indicating a face spilling tears. Determined as a pictogram to be added to RT. On the other hand, when the emotion indicated by the estimated emotion information EI is sadness and the evaluation value included in the estimated emotion information EI is larger than a predetermined threshold value, the output unit 26 converts the pictogram indicating the crying face into the recognition character string RT. Determine as the pictogram to be added. The emoji showing a crying face embodies a higher degree of sadness than the emoji showing a tearful face.
The output unit 26 outputs a character string with a pictogram obtained by adding a pictogram to the recognition character string RT. For example, there are two positions for adding pictograms as shown below. The first position is the end of the recognition string RT. The second position is next to the character string included in the emotion classification information 33 in the recognition character string RT. The display device 4 displays an image based on the character string with pictograms output by the output unit 26.

感情に応じた処理の第２の態様は、感情に基づく抑揚を付加して読み上げた合成音声を生成する処理である。抑揚は、例えば、読み上げ速度を速くするもしくは遅くする、又は、音量を大きくするもしくは小さくすることである。喜びに基づく抑揚は、例えば、読み上げ速度を上げることである。怒りに基づく抑揚は、例えば、音量を大きくすることである。悲しみに基づく抑揚は、例えば、音量を小さくすることである。出力部２６は、感情に基づく抑揚を付加して読み上げた合成音声を示す情報を出力する。そして、出力部２６は、生成したデータが示す合成音声に、感情に基づく抑揚を付加して、感情に基づく抑揚を付加して読み上げた合成音声を示す情報を出力する。放音装置７は、出力部２６が出力したデータが示す合成音声を放音する。 The second aspect of the emotion-based process is a process of generating a synthetic voice read aloud by adding emotion-based intonation. Inflection is, for example, increasing or decreasing the reading speed, or increasing or decreasing the volume. Pleasure-based intonation is, for example, speeding up reading. Anger-based intonation is, for example, increasing the volume. Sadness-based intonation is, for example, reducing the volume. The output unit 26 outputs information indicating a synthetic voice read aloud with an emotion-based intonation added. Then, the output unit 26 adds emotion-based intonation to the synthetic voice indicated by the generated data, and outputs information indicating the synthetic voice read by adding emotion-based intonation. The sound emitting device 7 emits a synthetic voice indicated by the data output by the output unit 26.

第２の態様における出力部２６は、推定感情情報ＥＩが示す１以上の感情を具象化した絵文字を出力する。第２の態様における出力部２６では、認識文字列ＲＴを用いる必要がない。以下の記載では、出力部２６は、第１の態様であるとして説明する。 The output unit 26 in the second aspect outputs a pictogram that embodies one or more emotions indicated by the estimated emotion information EI. In the output unit 26 in the second aspect, it is not necessary to use the recognition character string RT. In the following description, the output unit 26 will be described as the first aspect.

１．２．第１実施形態の動作
次に、ユーザ装置１の動作について、図６を用いて説明する。 1.2. Operation of First Embodiment Next, the operation of the user device 1 will be described with reference to FIG.

図６は、ユーザ装置１の動作を示すフローチャートである。処理装置２は、上述した補正情報ＣＩの２つの生成方法のいずれか一方に従って、補正情報ＣＩを生成する（ステップＳ１）。次に、取得部２１は、音情報ＳＩを取得する（ステップＳ２）。そして、音声認識処理部２５６１は、音情報ＳＩに対して音声認識処理を実行し、認識文字列ＲＴを出力する（ステップＳ３）。次に、形態素解析処理部２５６３は、解析用辞書情報３１を参照して、認識文字列ＲＴに対して形態素解析処理を実行して、補正後認識文字列ＣＲＴを出力する（ステップＳ４）。そして、評価値算出部２５６５は、感情分類情報３３に含まれる文字列と、補正後認識文字列ＣＲＴとを比較することにより各感情の文字評価値Ｙを算出し、各感情の文字評価値Ｙを含む文字感情情報ＴＥを生成する（ステップＳ５）。 FIG. 6 is a flowchart showing the operation of the user device 1. The processing device 2 generates the correction information CI according to one of the two methods for generating the correction information CI described above (step S1). Next, the acquisition unit 21 acquires the sound information SI (step S2). Then, the voice recognition processing unit 2561 executes voice recognition processing on the sound information SI and outputs the recognition character string RT (step S3). Next, the morphological analysis processing unit 2563 performs morphological analysis processing on the recognition character string RT with reference to the analysis dictionary information 31, and outputs the corrected recognition character string CRT (step S4). Then, the evaluation value calculation unit 2565 calculates the character evaluation value Y of each emotion by comparing the character string included in the emotion classification information 33 with the corrected recognition character string CRT, and the character evaluation value Y of each emotion. The character emotion information TE including the above is generated (step S5).

また、ノイズ除去部２５１は、音情報ＳＩが示す音から、第１パラメータＰ１及び第２パラメータＰ２に従ってノイズを除去して音声情報ＶＩを生成する（ステップＳ６）。そして、音声評価部２５２は、ノイズを除去した音声情報ＶＩから、音の特徴量を抽出する（ステップＳ７）。次に、音声評価部２５２は、音の特徴量を学習モデルＬＭに入力し、各感情の音声評価値ｘを含む音声感情情報ＶＥを学習モデルＬＭから取得する（ステップＳ８）。補正部２５３は、補正情報ＣＩを用いて、音声感情情報ＶＥに含まれる各感情の音声評価値ｘを補正した補正感情情報ＣＶＥを生成する（ステップＳ９）。 Further, the noise removing unit 251 removes noise from the sound indicated by the sound information SI according to the first parameter P1 and the second parameter P2 to generate the voice information VI (step S6). Then, the voice evaluation unit 252 extracts the feature amount of the sound from the voice information VI from which the noise has been removed (step S7). Next, the voice evaluation unit 252 inputs the feature amount of the sound into the learning model LM, and acquires the voice emotion information VE including the voice evaluation value x of each emotion from the learning model LM (step S8). The correction unit 253 uses the correction information CI to generate the correction emotion information CVE that corrects the voice evaluation value x of each emotion included in the voice emotion information VE (step S9).

推定部２５８は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く感情を推定し、推定感情情報ＥＩを出力する（ステップＳ１０）。出力部２６は、認識文字列ＲＴに対して、推定感情情報ＥＩが示す感情に応じた処理を実行して得られる情報を出力する（ステップＳ１１）。ステップＳ１１の処理終了後、ユーザ装置１は、図６に示す一連の処理を終了する。 The estimation unit 258 estimates the emotion held by the user U based on the corrected emotion information CVE and the character emotion information TE, and outputs the estimated emotion information EI (step S10). The output unit 26 outputs information obtained by executing processing according to the emotion indicated by the estimated emotion information EI with respect to the recognition character string RT (step S11). After the processing in step S11 is completed, the user device 1 ends a series of processing shown in FIG.

１．３．第１実施形態の効果
以上の説明によれば、ユーザ装置１は、汎用的な学習モデルＬＭを用いてユーザＵの感情推定を実行するため、個人ごとに調整された学習モデルを生成する場合と比較して、学習モデルＬＭの生成にかかる時間を短縮できる。
汎用的な学習モデルＬＭに平均的な人間の音声に関する複数の特徴量を入力すれば、平均的な人間の抱く感情を推定できる。しかし、ユーザＵの音声は、ユーザＵの性別、年齢、及び、ユーザＵの話し方の特徴等の影響を受けるので、平均的な人間の音声と相違する。従って、単に汎用的な学習モデルＬＭを用いるだけでは、ユーザＵが抱く感情の判定精度が低下する。
上述したユーザ装置１では、ユーザＵの音声の特徴に基づく補正情報ＣＩを用いて、学習モデルＬＭから出力される音声感情情報ＶＥが補正されるため、汎用的な学習モデルＬＭを利用しつつ、ユーザＵが抱く感情を高精度に推定できる。 1.3. Effect of First Embodiment According to the above description, in order to execute the emotion estimation of the user U by using the general-purpose learning model LM, the user device 1 may generate a learning model adjusted for each individual. In comparison, the time required to generate the learning model LM can be shortened.
By inputting a plurality of features related to the average human voice into the general-purpose learning model LM, the emotions of the average human can be estimated. However, the voice of the user U is different from the average human voice because it is affected by the gender, age, characteristics of the user U's speaking style, and the like. Therefore, simply using the general-purpose learning model LM reduces the accuracy of determining the emotions held by the user U.
In the user device 1 described above, the voice emotion information VE output from the learning model LM is corrected by using the correction information CI based on the voice characteristics of the user U, so that while using the general-purpose learning model LM, The emotions held by the user U can be estimated with high accuracy.

また、ユーザ装置１は、音情報ＳＩが示す音からノイズを除去して音声情報ＶＩを生成し、音声情報ＶＩに基づく音の特徴量を学習モデルＬＭに入力する。音声情報ＶＩに基づく音の特徴量を学習モデルＬＭに入力することにより、音情報ＳＩに基づく音の特徴量を学習モデルＬＭに入力する場合と比較して、より精度の高い音声感情情報ＶＥを得ることができる。 Further, the user device 1 removes noise from the sound indicated by the sound information SI to generate the voice information VI, and inputs the feature amount of the sound based on the voice information VI into the learning model LM. By inputting the sound feature amount based on the voice information VI into the learning model LM, the voice emotion information VE with higher accuracy can be obtained as compared with the case where the sound feature amount based on the sound information SI is input to the learning model LM. Obtainable.

また、ユーザ装置１は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く１以上の感情を推定するので、補正感情情報ＣＶＥのみに基づいてユーザＵが抱く感情を推定する場合と比較して、ユーザＵが抱く感情を高精度に推定できる。 Further, since the user device 1 estimates one or more emotions held by the user U based on the corrected emotion information CVE and the character emotion information TE, the emotions held by the user U are estimated based only on the corrected emotion information CVE. Compared with the case, the emotion held by the user U can be estimated with high accuracy.

２．第２実施形態
第２実施形態にかかるユーザ装置１ａは、ユーザＵに明示的に感情を発露させた音声を発話するように促し、ユーザＵの明示的な音声感情情報ＶＥａを学習モデルＬＭから取得し、ユーザＵが抱く感情が明示的な感情であると推定部２５８が推定する可能性を高くする目的で、補正情報ＣＩを調整する点で、第１実施形態にかかるユーザ装置１と相違する。第２実施形態では、ユーザ装置１ａは、ユーザＵが抱く感情を推定する感情推定モードと、補正情報ＣＩを調整するキャリブレーションモードとを取り得る。感情推定モードが、第１実施形態に相当するため、説明を省略する。以下、第２実施形態にかかるユーザ装置１ａを説明する。なお、以下に例示する第２実施形態において作用又は機能が第１実施形態と同等である要素については、以上の説明で参照の符号を流用して各々の詳細な説明を適宜に省略する。 2. 2. Second Embodiment The user device 1a according to the second embodiment urges the user U to utter a voice that explicitly expresses emotions, and acquires the explicit voice emotion information VEa of the user U from the learning model LM. However, it differs from the user device 1 according to the first embodiment in that the correction information CI is adjusted for the purpose of increasing the possibility that the estimation unit 258 estimates that the emotion held by the user U is an explicit emotion. .. In the second embodiment, the user device 1a can take an emotion estimation mode for estimating the emotion held by the user U and a calibration mode for adjusting the correction information CI. Since the emotion estimation mode corresponds to the first embodiment, the description thereof will be omitted. Hereinafter, the user device 1a according to the second embodiment will be described. Regarding the elements whose actions or functions are equivalent to those of the first embodiment in the second embodiment illustrated below, the reference numerals are used in the above description, and detailed description of each is appropriately omitted.

２．１．第２実施形態の機能
図７は、第２実施形態にかかるユーザ装置１ａを示すブロック図である。ユーザ装置１ａは、処理装置２ａ、記憶装置３ａ、表示装置４、操作装置５、通信装置６、放音装置７、及び、集音装置８を具備するコンピュータシステムにより実現される。記憶装置３ａは、処理装置２ａが読取可能な記録媒体であり、処理装置２ａが実行する制御プログラムＰＲａを含む複数のプログラム、解析用辞書情報３１、及び、感情分類情報３３を記憶する。 2.1. Function of the second embodiment FIG. 7 is a block diagram showing a user device 1a according to the second embodiment. The user device 1a is realized by a computer system including a processing device 2a, a storage device 3a, a display device 4, an operating device 5, a communication device 6, a sound emitting device 7, and a sound collecting device 8. The storage device 3a is a recording medium that can be read by the processing device 2a, and stores a plurality of programs including the control program PRa executed by the processing device 2a, the analysis dictionary information 31, and the emotion classification information 33.

処理装置２ａは、記憶装置３ａから制御プログラムＰＲａを読み取り実行することによって、取得部２１ａ、感情推定部２５ａ、及び、出力部２６として機能する。 The processing device 2a functions as an acquisition unit 21a, an emotion estimation unit 25a, and an output unit 26 by reading and executing the control program PRa from the storage device 3a.

図８は、第２実施形態にかかるユーザ装置１ａの機能の概要を示す図である。感情推定部２５ａは、ノイズ除去部２５１、音声評価部２５２、補正部２５３、調整部２５４、文字評価部２５６、及び、推定部２５８を含む。 FIG. 8 is a diagram showing an outline of the function of the user device 1a according to the second embodiment. The emotion estimation unit 25a includes a noise removal unit 251, a voice evaluation unit 252, a correction unit 253, an adjustment unit 254, a character evaluation unit 256, and an estimation unit 258.

取得部２１ａは、複数の感情のうち一の感情をユーザＵが明示的に発露させた音声を含む音を示す音情報ＳＩａを取得する。具体的には、ユーザＵが、操作装置５への操作によって、ユーザ装置１ａをキャリブレーションモードに設定した場合、処理装置２は、複数の感情のうち一の感情を明示的に発露して発音するように促した画面を表示装置４に表示する。「一の感情」を、以下、「明示感情」と称する。取得部２１ａは、前述の画面を表示した後に取得した音情報ＳＩを、明示感情をユーザＵが発露させた音声を含む音を示す音情報ＳＩａとして取得する。複数の感情のうち、いずれの感情を明示感情に設定するかについては、例えば、ユーザ装置１ａの開発者が予め設定してもよいし、ユーザＵが複数の感情から明示感情を選択してもよい。 The acquisition unit 21a acquires sound information SIa indicating a sound including a voice in which the user U explicitly expresses one of the plurality of emotions. Specifically, when the user U sets the user device 1a to the calibration mode by operating the operation device 5, the processing device 2 explicitly expresses and pronounces one of the plurality of emotions. The screen prompting the user to do so is displayed on the display device 4. "One emotion" is hereinafter referred to as "explicit emotion". The acquisition unit 21a acquires the sound information SI acquired after displaying the above-mentioned screen as the sound information SIa indicating the sound including the voice in which the user U expresses the explicit emotion. Of the plurality of emotions, which emotion is set as the explicit emotion may be set in advance by the developer of the user device 1a, or may be selected by the user U from the plurality of emotions. Good.

ノイズ除去部２５１は、音情報ＳＩａが示す音からノイズを除去して音声情報ＶＩａを生成する。 The noise removing unit 251 removes noise from the sound indicated by the sound information SIa to generate the voice information VIa.

音声評価部２５２は、学習モデルＬＭに対して、音声情報ＶＩａに基づく音の特徴量を入力し、ユーザＵの明示的な音声感情情報ＶＥａを学習モデルＬＭから取得する。 The voice evaluation unit 252 inputs the feature amount of the sound based on the voice information VIa into the learning model LM, and acquires the explicit voice emotion information VEa of the user U from the learning model LM.

調整部２５４は、ユーザＵが抱く感情が明示感情であると推定部２５８が推定する可能性を高くする目的で、明示的な音声感情情報ＶＥａに基づいて補正情報ＣＩを調整する。例えば、調整部２５４は、明示感情に対応する係数ｋを増加させる処理、及び、明示感情以外の感情に対応する係数ｋを減少させる処理の一方又は両方を実行する。例えば、調整部２５４は、下記式に従って、係数ｋ１〜ｋ４を生成する。但し、感情を発露させてユーザＵが予め定められた音声を発話した場合に得られる理想的な音声評価値Ｘに関し、喜びの音声評価値をＸａ１、怒りの音声評価値をＸａ２、悲しみの音声評価値をＸａ３、平常の音声評価値をＸａ４で表す。
ｋ１＝Ｘａ１／ｘ１
ｋ２＝Ｘａ２／ｘ２
ｋ３＝Ｘａ３／ｘ３
ｋ４＝Ｘａ４／ｘ４
但し、係数ｋは、必ずしもＸａ／ｘと一致する必要はない。 The adjustment unit 254 adjusts the correction information CI based on the explicit voice emotion information VEa for the purpose of increasing the possibility that the estimation unit 258 estimates that the emotion held by the user U is an explicit emotion. For example, the adjustment unit 254 executes one or both of a process of increasing the coefficient k corresponding to the explicit emotion and a process of decreasing the coefficient k corresponding to the emotion other than the explicit emotion. For example, the adjusting unit 254 generates the coefficients k1 to k4 according to the following equation. However, regarding the ideal voice evaluation value X obtained when the user U utters a predetermined voice by expressing emotions, the voice evaluation value of joy is Xa1, the voice evaluation value of anger is Xa2, and the voice of sadness. The evaluation value is represented by Xa3, and the normal voice evaluation value is represented by Xa4.
k1 = Xa1 / x1
k2 = Xa2 / x2
k3 = Xa3 / x3
k4 = Xa4 / x4
However, the coefficient k does not necessarily have to match Xa / x.

２．２．第２実施形態の動作
次に、キャリブレーションモード時のユーザ装置１ａの動作について、図９を用いて説明する。 2.2. Operation of the Second Embodiment Next, the operation of the user apparatus 1a in the calibration mode will be described with reference to FIG.

図９は、キャリブレーションモード時のユーザ装置１ａの動作を示すフローチャートである。取得部２１ａは、ユーザＵが明示感情を発露させた音声を含む音を示す音情報ＳＩａを取得する（ステップＳ２１）。次に、ノイズ除去部２５１は、第１パラメータＰ１及び第２パラメータＰ２に従ってノイズを除去して音声情報ＶＩａを生成する（ステップＳ２２）。そして、音声評価部２５２は、ノイズを除去した音声情報ＶＩａから、音の特徴量を抽出する（ステップＳ２３）。次に、音声評価部２５２は、音の特徴量を学習モデルＬＭに入力し、各感情の音声評価値ｘを含む音声感情情報ＶＥａを学習モデルＬＭから取得する（ステップＳ２４）。 FIG. 9 is a flowchart showing the operation of the user device 1a in the calibration mode. The acquisition unit 21a acquires sound information SIa indicating a sound including a voice in which the user U expresses an explicit emotion (step S21). Next, the noise removing unit 251 removes noise according to the first parameter P1 and the second parameter P2 to generate voice information VIa (step S22). Then, the voice evaluation unit 252 extracts the feature amount of the sound from the voice information VIa from which the noise has been removed (step S23). Next, the voice evaluation unit 252 inputs the feature amount of the sound into the learning model LM, and acquires the voice emotion information VEa including the voice evaluation value x of each emotion from the learning model LM (step S24).

調整部２５４は、明示的な音声感情情報ＶＥａに含まれる複数の音声評価値ｘを、補正部２５３と同様の方法により補正する（ステップＳ２５）。次に、調整部２５４は、ユーザＵが抱く感情が明示感情であると推定部２５８が推定する可能性を高くする目的で、補正情報ＣＩを調整する（ステップＳ２６）。ステップＳ２６の処理終了後、ユーザ装置１ａは、図９に示す一連の処理を終了する。 The adjusting unit 254 corrects a plurality of voice evaluation values x included in the explicit voice emotion information VEa by the same method as the correction unit 253 (step S25). Next, the adjustment unit 254 adjusts the correction information CI for the purpose of increasing the possibility that the estimation unit 258 estimates that the emotion held by the user U is an explicit emotion (step S26). After the processing in step S26 is completed, the user device 1a ends a series of processing shown in FIG.

２．３．第２実施形態の効果
以上の説明によれば、ユーザＵが明示感情を発露させた音声を発話した場合に、ユーザ装置１ａは、ユーザＵが抱く感情が明示感情であると推定部２５８が推定する可能性を高くする目的で、明示的な音声感情情報ＶＥａに基づいて補正情報ＣＩを調整する態様を有する。この態様では、音声感情情報ＶＥａによって推定される感情の正解が判明しており、補正情報ＣＩを調整したユーザＵ用の補正感情情報ＣＶＥは、補正情報ＣＩを調整していないユーザＵ用の補正感情情報ＣＶＥと比較して、ユーザＵが抱く感情の推定精度を向上できる。
また、ユーザＵが明示感情を発露させた音声を発話したとしても、感情を音声に発露させる強度はユーザＵ間で互いに異なる。例えば、あるユーザＵは、感情を音声に発露させる強度が高い一方で、別のユーザＵは、感情を音声に発露させる強度が低い場合がある。第２実施形態における補正情報ＣＩは、感情を音声に発露させる強度の違いも反映される。例えば、感情を音声に発露させる強度が高いユーザＵは、上述の理想的な音声評価値Ｘに対して、音声評価値ｘが近い値となり、係数ｋが１に近い値となる。一方、感情を音声に発露させる強度が低いユーザＵは、上述の理想的な音声評価値Ｘに対して、音声評価値ｘが小さい値となり、係数ｋが１から離れた値となる。以上により、発露させる強度が低いユーザＵほど、係数ｋが１から離れた値になり、感情を音声に発露させる強度の違いが補正情報ＣＩに反映されるため、ユーザＵが抱く感情の推定精度を向上できる。 2.3. Effect of the Second Embodiment According to the above description, when the user U utters a voice that expresses an explicit emotion, the user device 1a estimates that the emotion held by the user U is an explicit emotion. It has an aspect of adjusting the correction information CI based on the explicit voice emotion information VEa for the purpose of increasing the possibility of doing so. In this aspect, the correct answer of the emotion estimated by the voice emotion information VEa is known, and the corrected emotion information CVE for the user U who has adjusted the correction information CI is the correction for the user U who has not adjusted the correction information CI. Compared with the emotion information CVE, the estimation accuracy of the emotion held by the user U can be improved.
Further, even if the user U utters a voice that expresses an explicit emotion, the intensity of expressing the emotion in the voice differs between the users U. For example, one user U may have a high intensity of expressing emotions in voice, while another user U may have a low intensity of expressing emotions in voice. The correction information CI in the second embodiment also reflects the difference in the intensity of expressing emotions in voice. For example, a user U having a high intensity of expressing emotions in voice has a voice evaluation value x close to the above-mentioned ideal voice evaluation value X, and a coefficient k close to 1. On the other hand, the user U having a low intensity of expressing emotions in voice has a voice evaluation value x that is smaller than the above-mentioned ideal voice evaluation value X, and a coefficient k that is far from 1. As described above, the lower the intensity of the user U to be exposed, the farther the coefficient k is from 1, and the difference in the intensity of expressing the emotion to the voice is reflected in the correction information CI. Can be improved.

３．第３実施形態
第３実施形態にかかる感情推定システムＳＹＳは、第２実施形態で示した機能によってユーザ装置１ｂをキャリブレーションモードに設定して、明示感情を発露させたユーザＵの感情推定結果を利用して、ユーザ装置１ｂをキャリブレーションモードに設定していなく、明示感情を発露させていないユーザＵの補正情報ＣＩを調整する構成を有する点で、第２実施形態にかかるユーザ装置１ａと相違する。以下の説明において、ユーザ装置１ｂをキャリブレーションモードに設定し、明示感情を発露させたユーザＵを、「キャリブレーション済みユーザ」と称し、キャリブレーションモードに設定していなく、明示感情を発露させていないユーザＵを、「非キャリブレーションユーザ」と称する。
以下、第３実施形態にかかる感情推定システムＳＹＳを説明する。なお、以下に例示する第３実施形態において作用又は機能が第２実施形態と同等である要素については、以上の説明で参照の符号を流用して各々の詳細な説明を適宜に省略する。 3. 3. Third Embodiment The emotion estimation system SYS according to the third embodiment sets the user device 1b to the calibration mode by the function shown in the second embodiment, and outputs the emotion estimation result of the user U who has expressed explicit emotions. It is different from the user device 1a according to the second embodiment in that the user device 1b is not set to the calibration mode by using the user device 1b and has a configuration for adjusting the correction information CI of the user U who does not express an explicit emotion. To do. In the following description, the user U in which the user device 1b is set to the calibration mode and the explicit emotion is expressed is referred to as a "calibrated user", and the user device 1b is not set in the calibration mode and the explicit emotion is expressed. A non-user U is referred to as an "uncalibrated user".
Hereinafter, the emotion estimation system SYS according to the third embodiment will be described. Regarding the elements whose actions or functions are the same as those of the second embodiment in the third embodiment illustrated below, the reference numerals are used in the above description, and detailed description of each is appropriately omitted.

３．１．第３実施形態の概要
図１０は、感情推定システムＳＹＳの全体構成を示す図である。感情推定システムＳＹＳは、ユーザＵが所持するユーザ装置１ｂと、ネットワークＮＷと、サーバ装置１０とを備える。感情推定システムＳＹＳに含まれるユーザ装置１は、ユーザ装置１ｂ１からユーザ装置１ｂｍまでである。ｍは２以上の整数である。ユーザ装置１ｂ１を所持するユーザＵが、ユーザＵ１であり、ユーザ装置１ｂｍを所持するユーザＵは、ユーザＵｍである。 3.1. Outline of the Third Embodiment FIG. 10 is a diagram showing the overall configuration of the emotion estimation system SYS. The emotion estimation system SYS includes a user device 1b owned by the user U, a network NW, and a server device 10. The user device 1 included in the emotion estimation system SYS is from the user device 1b1 to the user device 1bm. m is an integer of 2 or more. The user U who possesses the user device 1b1 is the user U1, and the user U who possesses the user device 1bm is the user Um.

以下では、説明の簡略化のため、ユーザＵ１が、キャリブレーション済みユーザであり、ユーザＵ２が、非キャリブレーションユーザであるとして、説明を行う。 In the following, for the sake of brevity, it is assumed that the user U1 is a calibrated user and the user U2 is a non-calibrated user.

図１１は、ユーザ装置１ｂの構成を示すブロック図である。ユーザ装置１ｂは、処理装置２ｂ、記憶装置３ｂ、表示装置４、操作装置５、通信装置６、放音装置７、及び、集音装置８を具備するコンピュータシステムにより実現される。記憶装置３ｂは、処理装置２ｂが読取可能な記録媒体であり、処理装置２ｂが実行する制御プログラムＰＲｂを含む複数のプログラムを記憶する。 FIG. 11 is a block diagram showing the configuration of the user device 1b. The user device 1b is realized by a computer system including a processing device 2b, a storage device 3b, a display device 4, an operating device 5, a communication device 6, a sound emitting device 7, and a sound collecting device 8. The storage device 3b is a recording medium that can be read by the processing device 2b, and stores a plurality of programs including the control program PRb executed by the processing device 2b.

処理装置２ｂは、記憶装置３ｂから制御プログラムＰＲｂを読み取り実行することによって、取得部２１、及び、出力部２６として機能する。 The processing device 2b functions as an acquisition unit 21 and an output unit 26 by reading and executing the control program PRb from the storage device 3b.

図１２は、サーバ装置１０の構成を示すブロック図である。サーバ装置１０は、処理装置２Ｂ、記憶装置３Ｂ、及び、通信装置６Ｂを具備するコンピュータシステムにより実現される。サーバ装置１０の各要素は、情報を通信するための単体又は複数のバス９Ｂで相互に接続される。記憶装置３Ｂは、処理装置２Ｂが読取可能な記録媒体であり、処理装置２Ｂが実行する制御プログラムＰＲＢを含む複数のプログラム、解析用辞書情報３１、感情分類情報３３、及び、学習モデルＬＭを記憶する。 FIG. 12 is a block diagram showing the configuration of the server device 10. The server device 10 is realized by a computer system including a processing device 2B, a storage device 3B, and a communication device 6B. Each element of the server device 10 is connected to each other by a single unit or a plurality of buses 9B for communicating information. The storage device 3B is a recording medium that can be read by the processing device 2B, and stores a plurality of programs including the control program PRB executed by the processing device 2B, analysis dictionary information 31, emotion classification information 33, and learning model LM. To do.

処理装置２Ｂは、記憶装置３Ｂから制御プログラムＰＲＢを読み取り実行することによって、感情推定部２５Ｂとして機能する。図１３を用いて、感情推定システムＳＹＳの機能について説明する。 The processing device 2B functions as the emotion estimation unit 25B by reading and executing the control program PRB from the storage device 3B. The function of the emotion estimation system SYS will be described with reference to FIG.

図１３は、感情推定システムＳＹＳの機能の概要を示す図である。感情推定部２５Ｂは、ノイズ除去部２５１、音声評価部２５２Ｂ、補正部２５３、調整部２５４、文字評価部２５６、推定部２５８、及び、特定部２５９を含む。 FIG. 13 is a diagram showing an outline of the functions of the emotion estimation system SYS. The emotion estimation unit 25B includes a noise removal unit 251, a voice evaluation unit 252B, a correction unit 253, an adjustment unit 254, a character evaluation unit 256, an estimation unit 258, and a specific unit 259.

ユーザ装置１ｂ１の取得部２１は、ユーザＵ１の音声を含む音を集音する集音装置８が出力する音情報ＳＩ１を取得する。図１４を用いて、処理装置２Ｂによって実現される機能である、非キャリブレーションユーザの補正情報ＣＩの調整機能の概要を示す。 The acquisition unit 21 of the user device 1b1 acquires the sound information SI1 output by the sound collecting device 8 that collects the sound including the voice of the user U1. FIG. 14 shows an outline of the adjustment function of the correction information CI of the non-calibration user, which is a function realized by the processing device 2B.

図１４は、非キャリブレーションユーザの補正情報ＣＩの調整機能の概要を示す図である。図１４では、キャリブレーション済みであるユーザＵ１が、「ありがとう」と発声し、ユーザ装置１ｂ１の取得部２１が、音情報ＳＩ１を取得した状態を示している。 FIG. 14 is a diagram showing an outline of the adjustment function of the correction information CI of the non-calibrated user. FIG. 14 shows a state in which the calibrated user U1 utters “Thank you” and the acquisition unit 21 of the user device 1b1 has acquired the sound information SI1.

説明を図１３に戻す。ユーザＵ１に関して、ノイズ除去部２５１は、音情報ＳＩ１が示す音からノイズを除去して音声情報ＶＩ１を生成する。音声評価部２５２Ｂは、学習モデルＬＭに対して、音声情報ＶＩ１から抽出した音の特徴量を入力し、音声感情情報ＶＥ１を学習モデルＬＭから取得する。補正部２５３は、ユーザＵ１の音声の特徴に基づく補正情報ＣＩ１を用いて音声感情情報ＶＥ１を補正した補正感情情報ＣＶＥ１を生成する。また、音声認識処理部２５６１は、音声認識処理を音情報ＳＩ１に対して実行し、音声認識処理の認識結果を示す認識文字列ＲＴ１を取得する。続けて、形態素解析処理部２５６３は、解析用辞書情報３１を参照して、認識文字列ＲＴ１に対して形態素解析処理を実行して、補正後認識文字列ＣＲＴ１を出力する。評価値算出部２５６５は、感情分類情報３３に含まれる文字列と、補正後認識文字列ＣＲＴ１とを比較することにより、文字感情情報ＴＥ１を生成する。
図１４では、サーバ装置１０が、音情報ＳＩ１に基づいて、補正感情情報ＣＶＥ１と文字感情情報ＴＥ１とを生成した状態を示している。 The explanation is returned to FIG. With respect to the user U1, the noise removing unit 251 removes noise from the sound indicated by the sound information SI1 to generate the voice information VI1. The voice evaluation unit 252B inputs the feature amount of the sound extracted from the voice information VI1 into the learning model LM, and acquires the voice emotion information VE1 from the learning model LM. The correction unit 253 generates the correction emotion information CVE1 in which the voice emotion information VE1 is corrected by using the correction information CI1 based on the voice characteristics of the user U1. Further, the voice recognition processing unit 2561 executes the voice recognition processing on the sound information SI1 and acquires the recognition character string RT1 indicating the recognition result of the voice recognition processing. Subsequently, the morphological analysis processing unit 2563 performs morphological analysis processing on the recognition character string RT1 with reference to the analysis dictionary information 31, and outputs the corrected recognition character string CRT1. The evaluation value calculation unit 2565 generates the character emotion information TE1 by comparing the character string included in the emotion classification information 33 with the corrected recognition character string CRT1.
FIG. 14 shows a state in which the server device 10 generates the corrected emotion information CVE1 and the character emotion information TE1 based on the sound information SI1.

特定部２５９は、補正感情情報ＣＶＥ１に含まれる複数の音声評価値Ｘと、文字感情情報ＴＥ１に含まれる文字評価値Ｙとの相違の程度を示す値が所定値以下である場合、認識文字列ＲＴ１を特定文字列ＳＴとして特定する。特定文字列ＳＴとして特定されやすい文字列は、この文字列が有する本来の意味で発話されることが多い文字列であり、例えば、「ありがとう」、及び「ふざけるな」等である。
ただし、「ありがとう」といった言葉も、時に社交辞令又は皮肉として発話されることもあり、「ありがとう」が有する本来の意味である「感謝」の意味で発話されない場合がある。この場合、音声情報ＶＩ１には喜びが発露していないため、音声評価値Ｘと文字評価値Ｙとが大きく相違する。そこで、サーバ装置１０は、キャリブレーション済みユーザの認識文字列ＲＴと、音声評価値Ｘと、文字評価値Ｙと、音声情報ＶＩ１を生成した日時とを対応付けてログとして記憶し、特定部２５９は、このログを参照して、認識文字列ＲＴに対する音声評価値Ｘ及び文字評価値Ｙの相違の程度の傾向に基づいて、特定文字列ＳＴを特定してもよい。例えば、特定部２５９は、現在時刻から過去のある時刻までにおいて、音声評価値Ｘ及び文字評価値Ｙの相違の程度を示す値が所定値以下となった割合が所定の割合以上となった認識文字列ＲＴを、特定文字列ＳＴとして特定する。
相違の程度を示す値は、例えば、以下に示す２つの態様がある。第１の態様における相違の程度を示す値は、複数の感情の各々について、音声評価値Ｘと文字評価値Ｙとの差分の２乗の和Ｓｕｍ_ＸＹである。和Ｓｕｍ_ＸＹは、例えば、下記（１）式により求められる。
Ｓｕｍ_ＸＹ＝（Ｘ１−Ｙ１）^２＋（Ｘ２−Ｙ２）^２＋（Ｘ３−Ｙ３）^２＋（Ｘ４−Ｙ４）^２（１）
第２の態様における相違の程度を示す値は、補正感情情報ＣＶＥ１及び文字感情情報ＴＥ１を４次元のベクトルとみなした場合の補正感情情報ＣＶＥ１及び文字感情情報ＴＥ１の角度θである。角度θが大きい程、補正感情情報ＣＶＥ１と文字感情情報ＴＥ１とが相違すると言える。例えば、角度θは、下記（２）式により求められる。
θ＝ｃｏｓ^−１（（ＣＶＥ１・ＴＥ１）／（｜ＣＶＥ１｜×｜ＴＥ１｜））（２）
ただし、ＣＶＥ１・ＴＥ１は、補正感情情報ＣＶＥ１と文字感情情報ＴＥ１の内積を示す。｜ＣＶＥ１｜は、補正感情情報ＣＶＥ１の大きさを示す。｜ＴＥ１｜は、文字感情情報ＴＥ１の大きさを示す。
以下の説明では、相違の程度を示す値は、和Ｓｕｍ_ＸＹであるとする。
図１４では、和Ｓｕｍ_ＸＹが所定値以下である例を示す。従って、特定部２５９は、認識文字列ＲＴ１である「ありがとう」を特定文字列ＳＴとして特定する。 The specific unit 259 recognizes a recognition character string when the value indicating the degree of difference between the plurality of voice evaluation values X included in the corrected emotion information CVE1 and the character evaluation value Y included in the character emotion information TE1 is equal to or less than a predetermined value. RT1 is specified as a specific character string ST. The character string that is easily specified as the specific character string ST is a character string that is often spoken in the original meaning of this character string, and is, for example, "Thank you" and "Don't be silly".
However, the word "thank you" is sometimes spoken as a social decree or ironically, and may not be spoken in the original meaning of "thank you". In this case, since the voice information VI1 does not show joy, the voice evaluation value X and the character evaluation value Y are significantly different. Therefore, the server device 10 stores the recognized character string RT of the calibrated user, the voice evaluation value X, the character evaluation value Y, and the date and time when the voice information VI1 is generated in association with each other as a log, and stores the specific unit 259. May specify the specific character string ST based on the tendency of the degree of difference between the voice evaluation value X and the character evaluation value Y with respect to the recognition character string RT with reference to this log. For example, the specific unit 259 recognizes that the ratio of the value indicating the degree of difference between the voice evaluation value X and the character evaluation value Y from the current time to a certain time in the past is equal to or less than a predetermined ratio. The character string RT is specified as a specific character string ST.
The value indicating the degree of difference has, for example, the following two aspects. The value indicating the degree of difference in the first aspect is Sum _XY, which is the sum of the squares of the differences between the voice evaluation value X and the character evaluation value Y for each of the plurality of emotions. The sum Sum _XY is obtained by, for example, the following equation (1).
Sum _XY = (X1-Y1) ² + (X2-Y2) ² + (X3-Y3) ² + (X4-Y4) ² (1)
The value indicating the degree of difference in the second aspect is the angle θ of the corrected emotion information CVE1 and the character emotion information TE1 when the corrected emotion information CVE1 and the character emotion information TE1 are regarded as a four-dimensional vector. It can be said that the larger the angle θ, the more different the corrected emotion information CVE1 and the character emotion information TE1. For example, the angle θ is obtained by the following equation (2).
θ = cos ^-1 ((CVE1 ・ TE1) / (| CVE1 | × | TE1 |)) (2)
However, CVE1 and TE1 indicate the inner product of the corrected emotion information CVE1 and the character emotion information TE1. | CVE1 | indicates the magnitude of the corrected emotion information CVE1. | TE1 | indicates the size of the character emotion information TE1.
In the following description, it is assumed that the value indicating the degree of difference is the sum Sum _XY .
FIG. 14 shows an example in which the sum Sum _XY is equal to or less than a predetermined value. Therefore, the specific unit 259 specifies "Thank you", which is the recognition character string RT1, as the specific character string ST.

ユーザＵ２に関して、図１４に示すように、ユーザＵ２が、特定文字列ＳＴである「ありがとう」を発話したとする。ユーザ装置１ｂ２の取得部２１が、音情報ＳＩ２を取得する。ノイズ除去部２５１は、音情報ＳＩ２が示す音からノイズを除去して音声情報ＶＩ２を生成する。音声評価部２５２Ｂは、学習モデルＬＭに対して、音声情報ＶＩ２から抽出した音の特徴量を入力し、音声感情情報ＶＥ２を学習モデルＬＭから取得する。補正部２５３は、ユーザＵ２の音声の特徴に基づく補正情報ＣＩ２を用いて音声感情情報ＶＥ２を補正した補正感情情報ＣＶＥ２を生成する。また、音声認識処理部２５６１は、音声認識処理を音情報ＳＩ２に対して実行し、音声認識処理の認識結果を示す認識文字列ＲＴ２を取得する。続けて、形態素解析処理部２５６３は、解析用辞書情報３１を参照して、認識文字列ＲＴ２に対して形態素解析処理を実行して、補正後認識文字列ＣＲＴ２を出力する。評価値算出部２５６５は、感情分類情報３３に含まれる文字列と、補正後認識文字列ＣＲＴ２とを比較することにより、文字感情情報ＴＥ２を生成する。
図１４では、サーバ装置１０が、音情報ＳＩ２に基づいて、補正感情情報ＣＶＥ２と文字感情情報ＴＥ２とを生成した状態を示している。 Regarding the user U2, as shown in FIG. 14, it is assumed that the user U2 utters "Thank you" which is a specific character string ST. The acquisition unit 21 of the user device 1b2 acquires the sound information SI2. The noise removing unit 251 removes noise from the sound indicated by the sound information SI2 to generate the voice information VI2. The voice evaluation unit 252B inputs the feature amount of the sound extracted from the voice information VI2 into the learning model LM, and acquires the voice emotion information VE2 from the learning model LM. The correction unit 253 generates the correction emotion information CVE2 in which the voice emotion information VE2 is corrected by using the correction information CI2 based on the voice characteristics of the user U2. Further, the voice recognition processing unit 2561 executes the voice recognition processing on the sound information SI2 and acquires the recognition character string RT2 indicating the recognition result of the voice recognition processing. Subsequently, the morphological analysis processing unit 2563 executes the morphological analysis processing on the recognition character string RT2 with reference to the analysis dictionary information 31, and outputs the corrected recognition character string CRT2. The evaluation value calculation unit 2565 generates the character emotion information TE2 by comparing the character string included in the emotion classification information 33 with the corrected recognition character string CRT2.
FIG. 14 shows a state in which the server device 10 generates the corrected emotion information CVE2 and the character emotion information TE2 based on the sound information SI2.

非キャリブレーションユーザであるユーザＵ２が、特定文字列ＳＴを発話した場合には、調整部２５４は、ユーザＵ２の補正感情情報ＣＶＥ２に含まれる複数の音声評価値Ｘを、複数の感情の各々について、ユーザＵ２の文字感情情報ＴＥ２に含まれる複数の文字評価値Ｙに近づける目的で、ユーザＵ２用の補正情報ＣＩ２を調整する。例えば、調整部２５４は、下記式に従って、係数ｋ１〜ｋ４を生成する。
ｋ１＝Ｙ１／Ｘ１
ｋ２＝Ｙ２／Ｘ２
ｋ３＝Ｙ３／Ｘ３
ｋ４＝Ｙ４／Ｘ４
但し、係数ｋは、必ずしもＹ／Ｘと一致する必要はない。 When the user U2 who is a non-calibration user utters a specific character string ST, the adjustment unit 254 sets a plurality of voice evaluation values X included in the corrected emotion information CVE2 of the user U2 for each of the plurality of emotions. , The correction information CI2 for the user U2 is adjusted for the purpose of approaching the plurality of character evaluation values Y included in the character emotion information TE2 of the user U2. For example, the adjusting unit 254 generates the coefficients k1 to k4 according to the following equation.
k1 = Y1 / X1
k2 = Y2 / X2
k3 = Y3 / X3
k4 = Y4 / X4
However, the coefficient k does not necessarily have to match Y / X.

３．２．第３実施形態の動作
第２実施形態と同様に、第３実施形態でも、ユーザ装置１ｂは、ユーザＵの感情を推定する感情推定モードと、補正情報ＣＩを調整するキャリブレーションモードとを取り得る。ユーザ装置１ｂがキャリブレーションモードに設定された場合、サーバ装置１０が、ステップＳ２１に示す音情報ＳＩをユーザ装置１ｂから取得して、ステップＳ２１以降の各ステップを実行すればよい。図９に示す一連の処理終了後、サーバ装置１０は、キャリブレーションモードに設定されたユーザ装置１ｂの識別情報を、キャリブレーション済みユーザが所持するユーザ装置１ｂとして記憶装置３Ｂに記憶する。ユーザ装置１ｂの識別情報は、例えば、ＵＩＤ（User IDentifier）、ＭＡＣ（Media Access Control）アドレス、加入者認証モジュール（ＳＩＭ：Subscriber Identity Module）に記録されたＩＭＳＩ（International Mobile Subscriber Identity）、又はユーザＩＤ等である。ＵＩＤは、サービスを提供する事業者が、ユーザごとに割り当てたＩＤである。感情推定モードにおける感情推定システムＳＹＳの動作について、図１５及び図１６を用いて説明する。 3.2. Operation of the Third Embodiment Similar to the second embodiment, in the third embodiment, the user device 1b can take an emotion estimation mode for estimating the emotion of the user U and a calibration mode for adjusting the correction information CI. .. When the user device 1b is set to the calibration mode, the server device 10 may acquire the sound information SI shown in step S21 from the user device 1b and execute each step after step S21. After completing the series of processes shown in FIG. 9, the server device 10 stores the identification information of the user device 1b set in the calibration mode in the storage device 3B as the user device 1b possessed by the calibrated user. The identification information of the user device 1b is, for example, a UID (User IDentifier), a MAC (Media Access Control) address, an IMSI (International Mobile Subscriber Identity) recorded in a subscriber authentication module (SIM), or a user ID. And so on. The UID is an ID assigned to each user by the service provider. The operation of the emotion estimation system SYS in the emotion estimation mode will be described with reference to FIGS. 15 and 16.

図１５及び図１６は、感情推定モードにおける感情推定システムＳＹＳの動作を示すフローチャートである。サーバ装置１０は、ユーザ装置１ｂから、補正情報ＣＩを取得する（ステップＳ３１）。具体的には、ユーザ装置１ｂが、上述した補正情報ＣＩの２つの生成方法のいずれか一方に従って、補正情報ＣＩを生成し、サーバ装置１０に補正情報ＣＩを送信する。次に、サーバ装置１０は、ユーザ装置１ｂから、音情報ＳＩを取得する（ステップＳ３２）。そして、感情推定部２５Ｂの音声認識処理部２５６１は、音情報ＳＩに対して音声認識処理を実行し、認識文字列ＲＴを出力する（ステップＳ３３）。次に、感情推定部２５Ｂの形態素解析処理部２５６３は、解析用辞書情報３１を参照して、認識文字列ＲＴに対して形態素解析処理を実行して、補正後認識文字列ＣＲＴを出力する（ステップＳ３４）。そして、感情推定部２５Ｂの評価値算出部２５６５は、感情分類情報３３に含まれる文字列と、補正後認識文字列ＣＲＴとを比較することにより各感情の文字評価値Ｙを算出し、各感情の文字評価値Ｙを含む文字感情情報ＴＥを生成する（ステップＳ３５）。 15 and 16 are flowcharts showing the operation of the emotion estimation system SYS in the emotion estimation mode. The server device 10 acquires the correction information CI from the user device 1b (step S31). Specifically, the user device 1b generates the correction information CI according to one of the two generation methods of the correction information CI described above, and transmits the correction information CI to the server device 10. Next, the server device 10 acquires the sound information SI from the user device 1b (step S32). Then, the voice recognition processing unit 2561 of the emotion estimation unit 25B executes the voice recognition processing on the sound information SI and outputs the recognition character string RT (step S33). Next, the morphological analysis processing unit 2563 of the emotion estimation unit 25B executes the morphological analysis processing on the recognition character string RT with reference to the analysis dictionary information 31, and outputs the corrected recognition character string CRT ( Step S34). Then, the evaluation value calculation unit 2565 of the emotion estimation unit 25B calculates the character evaluation value Y of each emotion by comparing the character string included in the emotion classification information 33 with the corrected recognition character string CRT, and calculates each emotion. The character emotion information TE including the character evaluation value Y of is generated (step S35).

また、感情推定部２５Ｂのノイズ除去部２５１は、音情報ＳＩが示す音から、第１パラメータＰ１及び第２パラメータＰ２に従ってノイズを除去して音声情報ＶＩを生成する（ステップＳ４１）。そして、感情推定部２５Ｂの音声評価部２５２Ｂは、ノイズを除去した音声情報ＶＩから、音の特徴量を抽出する（ステップＳ４２）。次に、感情推定部２５Ｂの音声評価部２５２Ｂは、音の特徴量を学習モデルＬＭに入力し、各感情の音声評価値ｘを含む音声感情情報ＶＥを学習モデルＬＭから取得する（ステップＳ４３）。 Further, the noise removing unit 251 of the emotion estimation unit 25B removes noise from the sound indicated by the sound information SI according to the first parameter P1 and the second parameter P2 to generate the voice information VI (step S41). Then, the voice evaluation unit 252B of the emotion estimation unit 25B extracts the feature amount of the sound from the voice information VI from which the noise has been removed (step S42). Next, the voice evaluation unit 252B of the emotion estimation unit 25B inputs the feature amount of the sound into the learning model LM, and acquires the voice emotion information VE including the voice evaluation value x of each emotion from the learning model LM (step S43). ..

次に、サーバ装置１０は、補正情報ＣＩ及び音情報ＳＩの送信元のユーザ装置１ｂを所持するユーザが、キャリブレーション済みユーザか否かを判定する（ステップＳ４４）。キャリブレーション済みユーザか非キャリブレーションユーザかを判定する方法として、ユーザ装置１ｂは、補正情報ＣＩの送信時及び音情報ＳＩの送信時のいずれか一方の時又は両方の時に、ユーザ装置１ｂの識別情報を送信する。サーバ装置１０は、受信したユーザ装置１ｂの識別情報が、キャリブレーション済みユーザが所持するユーザ装置１ｂとして記憶した識別情報と一致した場合、肯定である判定結果を出力し、記憶装置３Ｂに記憶した識別情報と一致しない場合、否定である判定結果を出力する。 Next, the server device 10 determines whether or not the user who possesses the user device 1b that is the source of the correction information CI and the sound information SI is a calibrated user (step S44). As a method of determining whether the user is a calibrated user or a non-calibrated user, the user device 1b identifies the user device 1b at the time of transmitting the correction information CI and at the time of transmitting the sound information SI, or both. Send information. When the received identification information of the user device 1b matches the identification information stored as the user device 1b possessed by the calibrated user, the server device 10 outputs a positive determination result and stores it in the storage device 3B. If it does not match the identification information, a negative judgment result is output.

ステップＳ４４の判定結果が肯定の場合、感情推定部２５Ｂの補正部２５３は、補正情報ＣＩを用いて、音声感情情報ＶＥに含まれる各感情の音声評価値ｘを補正した補正感情情報ＣＶＥを生成する（ステップＳ４５）。そして、感情推定部２５Ｂの特定部２５９は、補正感情情報ＣＶＥに含まれる音声評価値Ｘと文字感情情報ＴＥに含まれる文字評価値Ｙとの差分の２乗の和Ｓｕｍ_ＸＹが所定値以下か否かを判定する（ステップＳ４６）。
ステップＳ４４の判定結果が肯定であり、かつ、ステップＳ４６の判定結果が肯定の場合、感情推定部２５Ｂの特定部２５９は、認識文字列ＲＴを特定文字列ＳＴとして特定する（ステップＳ４７）。そして、感情推定部２５Ｂの推定部２５８は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く感情を推定する（ステップＳ６１）。一方、ステップＳ４４の判定結果が肯定であり、ステップＳ４６の判定結果が否定の場合も、感情推定部２５Ｂの推定部２５８は、ステップＳ６１の処理を実行する。 When the determination result in step S44 is affirmative, the correction unit 253 of the emotion estimation unit 25B uses the correction information CI to generate a correction emotion information CVE that corrects the voice evaluation value x of each emotion included in the voice emotion information VE. (Step S45). Then, in the specific unit 259 of the emotion estimation unit 25B, is the sum _{XY of the} squares of the differences between the voice evaluation value X included in the corrected emotion information CVE and the character evaluation value Y included in the character emotion information TE Sum _XY equal to or less than a predetermined value? Whether or not it is determined (step S46).
When the determination result in step S44 is affirmative and the determination result in step S46 is affirmative, the identification unit 259 of the emotion estimation unit 25B specifies the recognition character string RT as the specific character string ST (step S47). Then, the estimation unit 258 of the emotion estimation unit 25B estimates the emotion held by the user U based on the corrected emotion information CVE and the character emotion information TE (step S61). On the other hand, even when the determination result in step S44 is affirmative and the determination result in step S46 is negative, the estimation unit 258 of the emotion estimation unit 25B executes the process of step S61.

ステップＳ４４の判定結果が否定の場合、すなわち、補正情報ＣＩ及び音情報ＳＩの送信元のユーザ装置１ｂを所持するユーザＵが非キャリブレーションユーザである場合、サーバ装置１０は、特定文字列ＳＴと認識文字列ＲＴとが一致するか否かを判定する（ステップＳ５０）。
ステップＳ４４の判定結果が否定であり、かつ、ステップＳ５０の判定結果が肯定の場合、感情推定部２５Ｂの特定部２５９は、補正感情情報ＣＶＥに含まれる音声評価値Ｘを文字感情情報ＴＥに含まれる複数の文字評価値Ｙに近づける目的で、非キャリブレーションユーザ用の補正情報ＣＩを調整する（ステップＳ５１）。そして、感情推定部２５Ｂの補正部２５３は、補正情報ＣＩを用いて、音声感情情報ＶＥに含まれる各感情の音声評価値ｘを補正した補正感情情報ＣＶＥを生成する（ステップＳ５２）。
ステップＳ４４の判定結果が否定であり、かつ、ステップＳ４５の判定結果が否定の場合も、感情推定部２５Ｂの補正部２５３は、ステップＳ５２の処理を実行する。
ステップＳ５２の処理終了後、感情推定部２５Ｂの推定部２５８は、ステップＳ６１の処理を実行する。 When the determination result in step S44 is negative, that is, when the user U who possesses the user device 1b that is the source of the correction information CI and the sound information SI is a non-calibrated user, the server device 10 has a specific character string ST. It is determined whether or not the recognition character string RT matches (step S50).
When the determination result in step S44 is negative and the determination result in step S50 is affirmative, the specific unit 259 of the emotion estimation unit 25B includes the voice evaluation value X included in the corrected emotion information CVE in the character emotion information TE. The correction information CI for the non-calibrated user is adjusted for the purpose of approaching the plurality of character evaluation values Y (step S51). Then, the correction unit 253 of the emotion estimation unit 25B uses the correction information CI to generate the correction emotion information CVE that corrects the voice evaluation value x of each emotion included in the voice emotion information VE (step S52).
Even if the determination result in step S44 is negative and the determination result in step S45 is negative, the correction unit 253 of the emotion estimation unit 25B executes the process of step S52.
After the process of step S52 is completed, the estimation unit 258 of the emotion estimation unit 25B executes the process of step S61.

ステップＳ６１の処理を実行後、サーバ装置１０は、認識文字列ＲＴと、ステップＳ６１の処理結果である推定感情情報ＥＩとを、ユーザ装置１ｂに送信する。出力部２６は、認識文字列ＲＴに対して、推定感情情報ＥＩが示す感情に応じた処理を実行して得られる情報を出力する（ステップＳ６２）。ステップＳ６２の処理終了後、感情推定システムＳＹＳは、図１５及び図１６に示す一連の処理を終了する。 After executing the process of step S61, the server device 10 transmits the recognition character string RT and the estimated emotion information EI which is the processing result of step S61 to the user device 1b. The output unit 26 outputs information obtained by executing processing according to the emotion indicated by the estimated emotion information EI with respect to the recognition character string RT (step S62). After the processing of step S62 is completed, the emotion estimation system SYS ends the series of processing shown in FIGS. 15 and 16.

３．３．第３実施形態の効果
以上の説明によれば、サーバ装置１０は、非キャリブレーションユーザであるユーザＵ２が特定文字列ＳＴを発話した場合、補正感情情報ＣＶＥ２に含まれる複数の音声評価値Ｘを、複数の感情の各々について、文字感情情報ＴＥ２に含まれる複数の文字評価値Ｙに近づける目的で、ユーザＵ２用の補正情報ＣＩを調整する。特定文字列ＳＴは、キャリブレーションユーザであるユーザＵ１において、音声評価値Ｘと文字評価値Ｙとの相違の程度を示す値が所定値以下となった時の認識文字列ＲＴである。
ユーザＵ２が特定文字列ＳＴを発話した場合に限り、ユーザＵ２用の補正情報ＣＩを調整する理由について説明する。キャリブレーション済みユーザであっても、補正感情情報ＣＶＥと文字感情情報ＴＥとが近い値にならないことがある。例えば、キャリブレーション済みユーザが、文字列が有する本来の意味とは異なる意味でこの文字列を発話した場合、補正感情情報ＣＶＥと文字感情情報ＴＥとが近い値にならないことがある。文字列が有する本来の意味とは異なる意味でユーザＵが発話する例としては、ユーザＵが皮肉の内容を発話した場合、及び、ユーザＵが冗談を発話した場合である。ユーザＵが皮肉の内容及び冗談を発話すると、文字感情情報ＴＥの精度が低下するので、文字感情情報ＴＥのみに基づいてユーザＵが抱く感情を推定すると精度が低下する。また、ユーザＵが「今、着きました」といった事務連絡を発話すると、文字感情情報ＴＥの精度が低下するので、文字感情情報ＴＥのみに基づいてユーザＵが抱く感情を推定すると精度が低下する。発話内容が事務連絡である場合に文字感情情報ＴＥの精度が低下する理由は、事務連絡を示す発話内容には、感情分類情報３３に登録されている、感情を表す文字列が含まれる割合が一般的な発話内容と比較して低い傾向にあり、文字評価値Ｙ１〜Ｙ４が小さい値となるためである。ユーザＵが皮肉の内容を発話した場合、ユーザＵが冗談を発話した場合、及び、ユーザＵが事務連絡を発話した場合とは、文字感情情報ＴＥのみに基づいてユーザＵが抱く感情を精度良く推定できない場合の一例である。文字列が有する本来の意味で発話されている場合には、補正感情情報ＣＶＥと文字感情情報ＴＥとが近い値になりやすい傾向にある。
従って、特定文字列ＳＴは、音声評価値Ｘと文字評価値Ｙとの相違の程度を示す値が所定値以下となっているため、本来の意味で発話された可能性が高い文字列であると言える。そして、ユーザＵ２が特定文字列ＳＴを発話した場合には特定文字列ＳＴが有する本来の意味で、ユーザＵ２が発話している可能性が高いため、本来であれば、補正感情情報ＣＶＥと文字感情情報ＴＥとが近い値になるはずである。
ここで、非キャリブレーションユーザにおいて、一般的には、補正感情情報ＣＶＥの精度は、文字感情情報ＴＥの精度より低い可能性が高い。理由としては、文字感情情報ＴＥは、ユーザＵの音声の特徴からの影響が小さい一方で、音声感情情報ＶＥは、ユーザＵの音声の特徴からの影響が大きく、非キャリブレーションユーザの補正情報ＣＩが正しく調整されていないためである。
そこで、第３実施形態では、ユーザＵ２が特定文字列ＳＴを発話した場合には、文字感情情報ＴＥが正解の感情を示している可能性が高いので、サーバ装置１０は、音声評価値Ｘを文字評価値Ｙに近づける目的で、ユーザＵ２用の補正情報ＣＩを調整する。以上により、非キャリブレーションユーザについて、キャリブレーションモードを用いなくても、ユーザＵが抱く感情の推定精度を向上できる。非キャリブレーションユーザは、ユーザ装置１ｂをキャリブレーションモードに設定しなくとも感情の推定精度を向上できるので、ユーザ装置１ｂは、非キャリブレーションユーザの手間を削減できる。 3.3. Effect of Third Embodiment According to the above description, when the user U2 who is a non-calibrated user utters the specific character string ST, the server device 10 sets a plurality of voice evaluation values X included in the corrected emotion information CVE2. For each of the plurality of emotions, the correction information CI for the user U2 is adjusted for the purpose of approaching the plurality of character evaluation values Y included in the character emotion information TE2. The specific character string ST is a recognition character string RT when the value indicating the degree of difference between the voice evaluation value X and the character evaluation value Y is equal to or less than a predetermined value in the user U1 who is the calibration user.
The reason for adjusting the correction information CI for the user U2 will be described only when the user U2 utters the specific character string ST. Even for a calibrated user, the corrected emotion information CVE and the character emotion information TE may not be close to each other. For example, when the calibrated user utters this character string in a meaning different from the original meaning of the character string, the corrected emotion information CVE and the character emotion information TE may not be close to each other. An example in which the user U utters a meaning different from the original meaning of the character string is when the user U utters an ironic content and when the user U utters a joke. When the user U utters an ironic content and a joke, the accuracy of the character emotion information TE is lowered. Therefore, if the emotion held by the user U is estimated based only on the character emotion information TE, the accuracy is lowered. Further, when the user U utters an office communication such as "I have arrived now", the accuracy of the character emotion information TE decreases. Therefore, if the emotion held by the user U is estimated based only on the character emotion information TE, the accuracy decreases. .. The reason why the accuracy of the character emotion information TE is lowered when the utterance content is the office communication is that the utterance content indicating the office communication includes the character string representing the emotion registered in the emotion classification information 33. This is because the character evaluation values Y1 to Y4 tend to be lower than those of general utterance contents. When the user U utters the ironic content, when the user U utters a joke, and when the user U utters an office communication, the emotions that the user U has are accurately measured based only on the text emotion information TE. This is an example when it cannot be estimated. When the character string is spoken in the original meaning, the corrected emotion information CVE and the character emotion information TE tend to have close values.
Therefore, the specific character string ST is a character string that is highly likely to have been uttered in the original meaning because the value indicating the degree of difference between the voice evaluation value X and the character evaluation value Y is equal to or less than a predetermined value. It can be said that. Then, when the user U2 utters the specific character string ST, there is a high possibility that the user U2 is uttering in the original meaning of the specific character string ST. Therefore, originally, the corrected emotion information CVE and the character The emotional information TE should be close to the value.
Here, in a non-calibrated user, in general, the accuracy of the corrected emotion information CVE is likely to be lower than the accuracy of the character emotion information TE. The reason is that the character emotion information TE has a small influence from the voice characteristics of the user U, while the voice emotion information VE has a large influence from the voice characteristics of the user U, and the correction information CI of the uncalibrated user. Is not adjusted correctly.
Therefore, in the third embodiment, when the user U2 utters the specific character string ST, there is a high possibility that the character emotion information TE indicates the correct emotion, so that the server device 10 sets the voice evaluation value X. The correction information CI for the user U2 is adjusted for the purpose of approaching the character evaluation value Y. As described above, for the non-calibrated user, the estimation accuracy of the emotion held by the user U can be improved without using the calibration mode. Since the non-calibrated user can improve the emotion estimation accuracy without setting the user device 1b to the calibration mode, the user device 1b can reduce the labor of the non-calibrated user.

４．第４実施形態
第４実施形態にかかる感情推定システムＳＹＳｃは、キャリブレーション済みユーザの感情推定結果を利用して、非キャリブレーションユーザ用の第１パラメータＰ１及び第２パラメータＰ２を調整する点で、第３実施形態にかかる感情推定システムＳＹＳと相違する。
以下、第４実施形態にかかる感情推定システムＳＹＳｃを説明する。なお、以下に例示する第４実施形態において作用又は機能が第３実施形態と同等である要素については、以上の説明で参照の符号を流用して各々の詳細な説明を適宜に省略する。 4. Fourth Embodiment The emotion estimation system SYSc according to the fourth embodiment adjusts the first parameter P1 and the second parameter P2 for the uncalibrated user by using the emotion estimation result of the calibrated user. This is different from the emotion estimation system SYS according to the third embodiment.
Hereinafter, the emotion estimation system SYSc according to the fourth embodiment will be described. Regarding the elements whose actions or functions are equivalent to those of the third embodiment in the fourth embodiment illustrated below, the reference numerals will be used in the above description, and detailed description of each will be omitted as appropriate.

図１７は、感情推定システムＳＹＳｃの全体構成を示す図である。感情推定システムＳＹＳｃは、ユーザＵが所持するユーザ装置１ｂと、ネットワークＮＷと、サーバ装置１０Ｃとを備える。 FIG. 17 is a diagram showing the overall configuration of the emotion estimation system SYSTEM. The emotion estimation system SYSc includes a user device 1b owned by the user U, a network NW, and a server device 10C.

図１８は、サーバ装置１０Ｃの構成を示すブロック図である。サーバ装置１０Ｃは、処理装置２Ｃ、記憶装置３Ｃ、及び、通信装置６Ｂを具備するコンピュータシステムにより実現される。記憶装置３Ｃは、処理装置２Ｃが読取可能な記録媒体であり、処理装置２Ｃが実行する制御プログラムＰＲＣを含む複数のプログラム、解析用辞書情報３１、感情分類情報３３、及び、学習モデルＬＭを記憶する。 FIG. 18 is a block diagram showing a configuration of the server device 10C. The server device 10C is realized by a computer system including a processing device 2C, a storage device 3C, and a communication device 6B. The storage device 3C is a recording medium that can be read by the processing device 2C, and stores a plurality of programs including the control program PRC executed by the processing device 2C, analysis dictionary information 31, emotion classification information 33, and learning model LM. To do.

処理装置２Ｃは、記憶装置３Ｃから制御プログラムＰＲを読み取り実行することによって、感情推定部２５Ｃとして機能する。図１９を用いて、感情推定システムＳＹＳｃの機能について説明する。 The processing device 2C functions as the emotion estimation unit 25C by reading and executing the control program PR from the storage device 3C. The function of the emotion estimation system SYSc will be described with reference to FIG.

４．１．第４実施形態の機能
図１９は、感情推定システムＳＹＳｃの機能の概要を示す図である。感情推定部２５Ｃは、ノイズ除去部２５１Ｃ、音声評価部２５２Ｂ、補正部２５３、調整部２５４Ｃ、文字評価部２５６、推定部２５８、及び、特定部２５９を含む。 4.1. Function of the 4th Embodiment FIG. 19 is a figure which shows the outline of the function of the emotion estimation system SYSc. The emotion estimation unit 25C includes a noise removal unit 251C, a voice evaluation unit 252B, a correction unit 253, an adjustment unit 254C, a character evaluation unit 256, an estimation unit 258, and a specific unit 259.

第４実施形態では、ノイズ除去部２５１Ｃで用いられる第１パラメータＰ１及び第２パラメータＰ２が、ユーザＵごとに用意される。以下の説明では、ユーザＵ１用の第１パラメータＰ１及び第２パラメータＰ２を含む情報をパラメータ情報ＴＩ１とし、ユーザＵ２用の第１パラメータＰ１及び第２パラメータＰ２を含む情報をパラメータ情報ＴＩ２として説明する。 In the fourth embodiment, the first parameter P1 and the second parameter P2 used in the noise removing unit 251C are prepared for each user U. In the following description, the information including the first parameter P1 and the second parameter P2 for the user U1 will be referred to as the parameter information TI1, and the information including the first parameter P1 and the second parameter P2 for the user U2 will be referred to as the parameter information TI2. ..

図２０は、非キャリブレーションユーザのパラメータ情報ＴＩの調整機能の概要を示す図である。図２０では、キャリブレーション済みであるユーザＵ１が、「ありがとう」と発声し、ユーザ装置１ｂ１の取得部２１が、音情報ＳＩ１を取得した状態を示している。 FIG. 20 is a diagram showing an outline of the adjustment function of the parameter information TI of the non-calibrated user. FIG. 20 shows a state in which the calibrated user U1 utters “Thank you” and the acquisition unit 21 of the user device 1b1 has acquired the sound information SI1.

説明を図１９に戻す。ユーザＵ１に関して、ノイズ除去部２５１Ｃは、音情報ＳＩ１が示す音から、パラメータ情報ＴＩ１に含まれる第１パラメータＰ１及び第２パラメータＰ２に基づいて、ノイズを除去して音声情報ＶＩ１を生成する。以降の処理について、感情推定部２５Ｃは、第３実施形態と同様に処理して、補正感情情報ＣＶＥ１と文字感情情報ＴＥ１とを生成し、認識文字列ＲＴ１である「ありがとう」を特定文字列ＳＴとして特定する。 The description returns to FIG. With respect to the user U1, the noise removing unit 251C removes noise from the sound indicated by the sound information SI1 based on the first parameter P1 and the second parameter P2 included in the parameter information TI1 to generate the voice information VI1. Regarding the subsequent processing, the emotion estimation unit 25C processes in the same manner as in the third embodiment to generate the corrected emotion information CVE1 and the character emotion information TE1, and sets the recognition character string RT1 "Thank you" as the specific character string ST. Identify as.

ユーザＵ２に関して、図２０に示すように、ユーザＵ２が、特定文字列ＳＴである「ありがとう」を発話したとする。調整部２５４Ｃは、ユーザＵ２の補正感情情報ＣＶＥ２に含まれる複数の音声評価値Ｘを、複数の感情の各々について、ユーザＵ２の文字感情情報ＴＥ２に含まれる複数の文字評価値Ｙに近づける目的で、ユーザＵ２用のパラメータ情報ＴＩ２を調整する。具体的には、調整部２５４Ｃは、ノイズ除去部２５１Ｃに、現在のパラメータ情報ＴＩ２の第１パラメータＰ１及び第２パラメータＰ２に基づいて、音声情報ＶＩ２を生成させる。そして、調整部２５４Ｃは、音声評価部２５２Ｂ及び補正部２５３に、補正感情情報ＣＶＥ２を生成させ、文字評価部２５６に、文字感情情報ＴＥ２を生成させる。そして、調整部２５４Ｃは、補正感情情報ＣＶＥ２に含まれる音声評価値Ｘと、文字感情情報ＴＥ２に含まれる文字評価値Ｙとを比較する。例えば、調整部２５４Ｃは、パラメータ情報ＴＩ２の第１パラメータＰ１及び第２パラメータＰ２を微小量変化させる。調整部２５４Ｃは、微小量変化させた第１パラメータＰ１及び第２パラメータＰ２に基づいて、補正感情情報ＣＶＥを再度生成し、再度生成した補正感情情報ＣＶＥと文字感情情報ＴＥ２との相違の程度を示す値が、再作成する前の補正感情情報ＣＶＥと文字感情情報ＴＥ２との相違の程度を示す値より小さい場合、ユーザＵ２の複数の音声評価値Ｘを、ユーザＵ２の複数の文字評価値Ｙに近づける目的が達せられたと判定する。 Regarding the user U2, as shown in FIG. 20, it is assumed that the user U2 utters "Thank you" which is a specific character string ST. The adjustment unit 254C has the purpose of bringing the plurality of voice evaluation values X included in the corrected emotion information CVE2 of the user U2 closer to the plurality of character evaluation values Y included in the character emotion information TE2 of the user U2 for each of the plurality of emotions. , Adjust the parameter information TI2 for user U2. Specifically, the adjusting unit 254C causes the noise removing unit 251C to generate the voice information VI2 based on the first parameter P1 and the second parameter P2 of the current parameter information TI2. Then, the adjustment unit 254C causes the voice evaluation unit 252B and the correction unit 253 to generate the corrected emotion information CVE2, and causes the character evaluation unit 256 to generate the character emotion information TE2. Then, the adjustment unit 254C compares the voice evaluation value X included in the corrected emotion information CVE2 with the character evaluation value Y included in the character emotion information TE2. For example, the adjusting unit 254C changes the first parameter P1 and the second parameter P2 of the parameter information TI2 by a small amount. The adjusting unit 254C regenerates the corrected emotion information CVE based on the first parameter P1 and the second parameter P2 changed by a minute amount, and determines the degree of difference between the regenerated corrected emotion information CVE and the character emotion information TE2. When the indicated value is smaller than the value indicating the degree of difference between the corrected emotion information CVE before re-creation and the character emotion information TE2, the plurality of voice evaluation values X of the user U2 and the plurality of character evaluation values Y of the user U2 are used. It is judged that the purpose of approaching is achieved.

図２０では、調整部２５４Ｃが、ユーザＵ２の複数の音声評価値Ｘを、ユーザＵ２の複数の文字評価値Ｙに近づける目的で、パラメータ情報ＴＩ２に含まれる第１パラメータＰ１及び第２パラメータＰ２を調整することを示している。 In FIG. 20, the adjusting unit 254C sets the first parameter P1 and the second parameter P2 included in the parameter information TI2 for the purpose of bringing the plurality of voice evaluation values X of the user U2 closer to the plurality of character evaluation values Y of the user U2. Indicates to be adjusted.

４．２．第４実施形態の動作
次に、感情推定モードにおける感情推定システムＳＹＳｃの動作について、図２１を用いて説明する。 4.2. Operation of the Fourth Embodiment Next, the operation of the emotion estimation system SYSc in the emotion estimation mode will be described with reference to FIG.

図２１は、感情推定モードにおける感情推定システムＳＹＳｃの動作を示すフローチャートである。なお、第３実施形態で示した感情推定モードにおける感情推定システムＳＹＳｃの動作と、第４実施形態の感情推定モードにおける感情推定システムＳＹＳｃの動作において、図１５に示すステップＳ３１からステップＳ３５までの処理は共通である。従って、ステップＳ３１からステップＳ３５までの処理については図示及び説明を省略する。 FIG. 21 is a flowchart showing the operation of the emotion estimation system SYSc in the emotion estimation mode. In the operation of the emotion estimation system SYSc in the emotion estimation mode shown in the third embodiment and the operation of the emotion estimation system SYSc in the emotion estimation mode of the fourth embodiment, the processes from step S31 to step S35 shown in FIG. Is common. Therefore, illustration and description of the processes from step S31 to step S35 will be omitted.

ステップＳ３５の処理終了後、サーバ装置１０Ｃは、補正情報ＣＩ及び音情報ＳＩの送信元のユーザ装置１ｂを所持するユーザＵが、キャリブレーション済みユーザか否かを判定する（ステップＳ７１）。ステップＳ７１の判定結果が肯定の場合、感情推定部２５Ｃのノイズ除去部２５１Ｃは、音情報ＳＩが示す音から、パラメータ情報Ｔ１の第１パラメータＰ１及び第２パラメータＰ２に従ってノイズを除去して音声情報ＶＩを生成する（ステップＳ７２）。感情推定部２５Ｃの音声評価部２５２Ｂは、ノイズを除去した音声情報ＶＩから、音の特徴量を抽出する（ステップＳ７３）。次に、感情推定部２５Ｃの音声評価部２５２Ｂは、音の特徴量を学習モデルＬＭに入力し、各感情の音声評価値ｘを含む音声感情情報ＶＥを学習モデルＬＭから取得する（ステップＳ７４）。そして、感情推定部２５Ｃの補正部２５３は、補正情報ＣＩを用いて、音声感情情報ＶＥに含まれる各感情の音声評価値ｘを補正した補正感情情報ＣＶＥを生成する（ステップＳ７５）。 After the processing in step S35 is completed, the server device 10C determines whether or not the user U who possesses the user device 1b that is the source of the correction information CI and the sound information SI is a calibrated user (step S71). When the determination result in step S71 is affirmative, the noise removing unit 251C of the emotion estimation unit 25C removes noise from the sound indicated by the sound information SI according to the first parameter P1 and the second parameter P2 of the parameter information T1 to perform voice information. VI is generated (step S72). The voice evaluation unit 252B of the emotion estimation unit 25C extracts a sound feature amount from the noise-removed voice information VI (step S73). Next, the voice evaluation unit 252B of the emotion estimation unit 25C inputs the feature amount of the sound into the learning model LM, and acquires the voice emotion information VE including the voice evaluation value x of each emotion from the learning model LM (step S74). .. Then, the correction unit 253 of the emotion estimation unit 25C uses the correction information CI to generate the correction emotion information CVE that corrects the voice evaluation value x of each emotion included in the voice emotion information VE (step S75).

そして、感情推定部２５Ｃの特定部２５９は、補正感情情報ＣＶＥに含まれる音声評価値Ｘと文字感情情報ＴＥに含まれる文字評価値Ｙとの差分の２乗の和Ｓｕｍ_ＸＹが所定値以下か否かを判定する（ステップＳ７６）。
ステップＳ７１の判定結果が肯定であり、かつ、ステップＳ７６の判定結果が肯定の場合、感情推定部２５Ｃの特定部２５９は、認識文字列ＲＴを特定文字列ＳＴとして特定する（ステップＳ７７）。そして、感情推定部２５Ｃの推定部２５８は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く感情を推定する（ステップＳ９１）。一方、ステップＳ７１の判定結果が肯定であり、ステップＳ７６の判定結果が否定の場合も、感情推定部２５Ｃの推定部２５８は、ステップＳ９１の処理を実行する。 Then, in the specific unit 259 of the emotion estimation unit 25C, is the sum _{XY of the} squares of the differences between the voice evaluation value X included in the corrected emotion information CVE and the character evaluation value Y included in the character emotion information TE Sum _XY equal to or less than a predetermined value? Whether or not it is determined (step S76).
When the determination result in step S71 is affirmative and the determination result in step S76 is affirmative, the identification unit 259 of the emotion estimation unit 25C specifies the recognition character string RT as the specific character string ST (step S77). Then, the estimation unit 258 of the emotion estimation unit 25C estimates the emotion held by the user U based on the corrected emotion information CVE and the character emotion information TE (step S91). On the other hand, even when the determination result in step S71 is affirmative and the determination result in step S76 is negative, the estimation unit 258 of the emotion estimation unit 25C executes the process of step S91.

ステップＳ７１の判定結果が否定の場合、すなわち、補正情報ＣＩ及び音情報ＳＩの送信元のユーザ装置１ｂを所持するユーザＵが非キャリブレーションユーザである場合、サーバ装置１０Ｃは、特定文字列ＳＴと認識文字列ＲＴとが一致するか否かを判定する（ステップＳ８１）。ステップＳ７１の判定結果が否定であり、かつ、ステップＳ８１の判定結果が肯定の場合、感情推定部２５Ｂの調整部２５４Ｃは、補正感情情報ＣＶＥに含まれる音声評価値Ｘを文字感情情報ＴＥに含まれる複数の文字評価値Ｙに近づける目的で、非キャリブレーションユーザ用のパラメータ情報ＴＩを調整する（ステップＳ８２）。そして、感情推定部２５Ｃのノイズ除去部２５１Ｃは、音情報ＳＩが示す音から、パラメータ情報ＴＩの第１パラメータＰ１及び第２パラメータＰ２に従ってノイズを除去して音声情報ＶＩを生成する（ステップＳ８３）。
ステップＳ７１の判定結果が否定であり、かつ、ステップＳ８１の判定結果が否定の場合も、感情推定部２５Ｃのノイズ除去部２５１Ｃは、ステップＳ８３の処理を実行する。 When the determination result in step S71 is negative, that is, when the user U who possesses the user device 1b that is the source of the correction information CI and the sound information SI is a non-calibrated user, the server device 10C has a specific character string ST. It is determined whether or not the recognition character string RT matches (step S81). When the determination result in step S71 is negative and the determination result in step S81 is affirmative, the adjustment unit 254C of the emotion estimation unit 25B includes the voice evaluation value X included in the corrected emotion information CVE in the character emotion information TE. The parameter information TI for the non-calibrated user is adjusted for the purpose of approaching the plurality of character evaluation values Y (step S82). Then, the noise removing unit 251C of the emotion estimation unit 25C removes noise from the sound indicated by the sound information SI according to the first parameter P1 and the second parameter P2 of the parameter information TI to generate the voice information VI (step S83). ..
Even if the determination result in step S71 is negative and the determination result in step S81 is negative, the noise removal unit 251C of the emotion estimation unit 25C executes the process of step S83.

ステップＳ８３の処理終了後、感情推定部２５Ｃの音声評価部２５２Ｂは、ノイズを除去した音声情報ＶＩから、音の特徴量を抽出する（ステップＳ８４）。次に、感情推定部２５Ｃの音声評価部２５２Ｂは、音の特徴量を学習モデルＬＭに入力し、各感情の音声評価値ｘを含む音声感情情報ＶＥを学習モデルＬＭから取得する（ステップＳ８５）。そして、感情推定部２５Ｃの補正部２５３は、補正情報ＣＩを用いて、音声感情情報ＶＥに含まれる音声評価値ｘを補正した補正感情情報ＣＶＥを生成する（ステップＳ８６）。ステップＳ８６の処理終了後、感情推定部２５Ｃの推定部２５８は、ステップＳ９１の処理を実行する。 After the processing of step S83 is completed, the voice evaluation unit 252B of the emotion estimation unit 25C extracts the feature amount of the sound from the voice information VI from which the noise has been removed (step S84). Next, the voice evaluation unit 252B of the emotion estimation unit 25C inputs the feature amount of the sound into the learning model LM, and acquires the voice emotion information VE including the voice evaluation value x of each emotion from the learning model LM (step S85). .. Then, the correction unit 253 of the emotion estimation unit 25C uses the correction information CI to generate the correction emotion information CVE in which the voice evaluation value x included in the voice emotion information VE is corrected (step S86). After the processing of step S86 is completed, the estimation unit 258 of the emotion estimation unit 25C executes the processing of step S91.

ステップＳ９１の処理終了後、サーバ装置１０Ｃは、認識文字列ＲＴと、ステップＳ６１の処理結果である推定感情情報ＥＩとを、ユーザ装置１ｂに送信する。出力部２６は、認識文字列ＲＴに対して、推定感情情報ＥＩが示す感情に応じた処理を実行して得られる情報を出力する（ステップＳ９２）。ステップＳ９２の処理終了後、感情推定システムＳＹＳｃは、図２１に示す一連の処理を終了する。 After the processing of step S91 is completed, the server device 10C transmits the recognition character string RT and the estimated emotion information EI which is the processing result of step S61 to the user device 1b. The output unit 26 outputs information obtained by executing processing according to the emotion indicated by the estimated emotion information EI with respect to the recognition character string RT (step S92). After the processing of step S92 is completed, the emotion estimation system SYSc ends a series of processing shown in FIG.

４．３．第４実施形態の効果
第４実施形態も、第３実施形態と同様に、ユーザＵ２が特定文字列ＳＴを発話した場合には、文字感情情報ＴＥが正解の感情を示している可能性が高いので、サーバ装置１０Ｃは、音声評価値Ｘを文字評価値Ｙに近づける目的で、ユーザＵ２用のパラメータ情報ＴＩを調整する。以上により、非キャリブレーションユーザについて、キャリブレーションモードを用いなくても、ユーザＵが抱く感情の推定精度を向上できる。非キャリブレーションユーザは、ユーザ装置１ｂをキャリブレーションモードに設定しなくとも感情の推定精度を向上できるで、ユーザ装置１ｂは、非キャリブレーションユーザの手間を削減できる。
集音装置８の性能は、ユーザ装置１ｂ間で互いに異なる。例えば、集音装置８の製造元が異なると、集音装置８の性能も一般的に互いに異なる。また、集音装置８は経年劣化により性能が低下する傾向にあるため、同一の製造元の集音装置８であっても、製造時点からの日数が異なる場合、集音装置８の性能も互いに異なる傾向にある。ユーザ装置１ｂ間で集音装置８の性能が互いに異なる結果、音情報ＳＩに含まれるノイズの量も異なるため、パラメータ情報ＴＩを調整することにより、ユーザＵが抱く感情を精度良く推定できる。
例えば、学習済みのパラメータ情報ＴＩを適用したノイズ処理を実行すると、集音装置８の性能の違いによって、音声情報ＶＩから感情推定に必要な情報が欠落する場合がある。従って、集音装置８の性能に応じてパラメータ情報ＴＩを調整することにより、ユーザＵが抱く感情を精度良く推定できる。 4.3. Effect of the fourth embodiment As in the third embodiment, when the user U2 utters the specific character string ST, the character emotion information TE is likely to indicate the correct emotion. Therefore, the server device 10C adjusts the parameter information TI for the user U2 for the purpose of bringing the voice evaluation value X closer to the character evaluation value Y. As described above, for the non-calibrated user, the estimation accuracy of the emotion held by the user U can be improved without using the calibration mode. The non-calibrated user can improve the emotion estimation accuracy without setting the user device 1b to the calibration mode, and the user device 1b can reduce the labor of the non-calibrated user.
The performance of the sound collecting device 8 differs between the user devices 1b. For example, if the manufacturers of the sound collecting devices 8 are different, the performances of the sound collecting devices 8 are generally different from each other. Further, since the performance of the sound collecting device 8 tends to deteriorate due to aged deterioration, even if the sound collecting device 8 is manufactured by the same manufacturer, the performance of the sound collecting device 8 is different from each other when the number of days from the time of manufacture is different. There is a tendency. As a result of the performance of the sound collecting device 8 being different between the user devices 1b, the amount of noise included in the sound information SI is also different. Therefore, by adjusting the parameter information TI, the emotion held by the user U can be estimated accurately.
For example, when noise processing to which the learned parameter information TI is applied is executed, the information necessary for emotion estimation may be missing from the voice information VI due to the difference in the performance of the sound collecting device 8. Therefore, by adjusting the parameter information TI according to the performance of the sound collecting device 8, the emotion held by the user U can be estimated accurately.

５．第５実施形態
第５実施形態にかかる感情推定システムＳＹＳｄは、第１実施形態で示した感情推定部２５の処理を、サーバ装置１０Ｄとユーザ装置１ｄとで分散する点で、第１実施形態にかかるユーザ装置１と相違する。以下、第５実施形態にかかる感情推定システムＳＹＳｄを説明する。なお、以下に例示する第５実施形態において作用又は機能が第１実施形態と同等である要素については、以上の説明で参照の符号を流用して各々の詳細な説明を適宜に省略する。 5. Fifth Embodiment The emotion estimation system SYSd according to the fifth embodiment is the first embodiment in that the processing of the emotion estimation unit 25 shown in the first embodiment is distributed between the server device 10D and the user device 1d. It is different from the user device 1. Hereinafter, the emotion estimation system SYSd according to the fifth embodiment will be described. Regarding the elements whose actions or functions are the same as those of the first embodiment in the fifth embodiment illustrated below, the reference numerals are used in the above description, and detailed description of each is appropriately omitted.

５．１．第５実施形態の概要
図２２は、感情推定システムＳＹＳｄの全体構成を示す図である。感情推定システムＳＹＳｄは、ユーザＵが所持するユーザ装置１ｄと、ネットワークＮＷと、サーバ装置１０Ｄとを備える。 5.1. Outline of the 5th Embodiment FIG. 22 is a diagram showing the overall configuration of the emotion estimation system SYSTEM. The emotion estimation system SYSd includes a user device 1d owned by the user U, a network NW, and a server device 10D.

図２３は、ユーザ装置１ｄの構成を示すブロック図である。ユーザ装置１ｄは、処理装置２ｄ、記憶装置３ｄ、表示装置４、操作装置５、通信装置６、放音装置７、及び、集音装置８を具備するコンピュータシステムにより実現される。記憶装置３ｄは、処理装置２ｄが読取可能な記録媒体であり、処理装置２ｄが実行する制御プログラムＰＲｄを含む複数のプログラムを記憶する。通信装置６が、「第２通信装置」の一例である。 FIG. 23 is a block diagram showing the configuration of the user device 1d. The user device 1d is realized by a computer system including a processing device 2d, a storage device 3d, a display device 4, an operating device 5, a communication device 6, a sound emitting device 7, and a sound collecting device 8. The storage device 3d is a recording medium that can be read by the processing device 2d, and stores a plurality of programs including the control program PRd executed by the processing device 2d. The communication device 6 is an example of the “second communication device”.

処理装置２ｄは、記憶装置３ｄから制御プログラムＰＲｄを読み取り実行することによって、取得部２１、第１感情推定部２５ｄ、及び、出力部２６として機能する。 The processing device 2d functions as an acquisition unit 21, a first emotion estimation unit 25d, and an output unit 26 by reading and executing the control program PRd from the storage device 3d.

図２４は、サーバ装置１０Ｄの構成を示すブロック図である。サーバ装置１０Ｄは、処理装置２Ｄ、記憶装置３Ｄ、及び、通信装置６Ｂを具備するコンピュータシステムにより実現される。記憶装置３Ｄは、処理装置２Ｄが読取可能な記録媒体であり、処理装置２Ｄが実行する制御プログラムＰＲＤを含む複数のプログラム、解析用辞書情報３１、感情分類情報３３、及び、学習モデルＬＭを記憶する。通信装置６Ｂが、「第１通信装置」の一例である。 FIG. 24 is a block diagram showing the configuration of the server device 10D. The server device 10D is realized by a computer system including a processing device 2D, a storage device 3D, and a communication device 6B. The storage device 3D is a recording medium that can be read by the processing device 2D, and stores a plurality of programs including a control program PRD executed by the processing device 2D, analysis dictionary information 31, emotion classification information 33, and a learning model LM. To do. The communication device 6B is an example of the “first communication device”.

処理装置２Ｄは、記憶装置３Ｄから制御プログラムＰＲＤを読み取り実行することによって、第２感情推定部２５Ｄとして機能する。図２５を用いて、感情推定システムＳＹＳｄの機能について説明する。 The processing device 2D functions as a second emotion estimation unit 25D by reading and executing the control program PRD from the storage device 3D. The function of the emotion estimation system SYSd will be described with reference to FIG. 25.

図２５は、感情推定システムＳＹＳｄの機能の概要を示す図である。第１感情推定部２５ｄは、補正部２５３、及び、推定部２５８を含む。第２感情推定部２５Ｄは、ノイズ除去部２５１、音声評価部２５２、文字評価部２５６を含む。 FIG. 25 is a diagram showing an outline of the functions of the emotion estimation system SYSd. The first emotion estimation unit 25d includes a correction unit 253 and an estimation unit 258. The second emotion estimation unit 25D includes a noise removal unit 251, a voice evaluation unit 252, and a character evaluation unit 256.

取得部２１は、ユーザＵ１の音声を含む音を集音する集音装置８が出力する音情報ＳＩ１を取得する。通信装置６は、音情報ＳＩを、サーバ装置１０Ｄに送信する。第２感情推定部２５Ｄは、音情報ＳＩに基づいて、音声感情情報ＶＥと文字感情情報ＴＥと認識文字列ＲＴとを生成する。通信装置６Ｂは、音声感情情報ＶＥと文字感情情報ＴＥと認識文字列ＲＴとをユーザ装置１ｄに送信する。 The acquisition unit 21 acquires the sound information SI1 output by the sound collecting device 8 that collects the sound including the voice of the user U1. The communication device 6 transmits the sound information SI to the server device 10D. The second emotion estimation unit 25D generates voice emotion information VE, character emotion information TE, and recognition character string RT based on the sound information SI. The communication device 6B transmits the voice emotion information VE, the character emotion information TE, and the recognition character string RT to the user device 1d.

補正部２５３は、補正情報ＣＩを用いて、補正感情情報ＣＶＥを生成する。推定部２５８は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く感情を推定する。出力部２６は、認識文字列ＲＴに対して、推定感情情報ＥＩが示す感情に応じた処理を実行して得られたデータを出力する。 The correction unit 253 uses the correction information CI to generate the correction emotion information CVE. The estimation unit 258 estimates the emotion held by the user U based on the corrected emotion information CVE and the character emotion information TE. The output unit 26 outputs the data obtained by executing the processing according to the emotion indicated by the estimated emotion information EI with respect to the recognition character string RT.

５．２．第５実施形態の効果
以上の説明によれば、感情推定システムＳＹＳｄにおいて、ユーザ装置１ｄは、第１実施形態におけるユーザ装置１と比較すると、負荷を軽減できる。 5.2. Effect of Fifth Embodiment According to the above description, in the emotion estimation system SYSd, the load of the user device 1d can be reduced as compared with the user device 1 of the first embodiment.

６．第６実施形態
第６実施形態にかかる感情推定システムＳＹＳｅは、第２実施形態で示した感情推定部２５の処理を、サーバ装置１０Ｄとユーザ装置１ｅとで分散する点で、第２実施形態にかかるユーザ装置１ａと相違する。以下、第６実施形態にかかる感情推定システムＳＹＳｅを説明する。なお、以下に例示する第６実施形態において作用又は機能が第２実施形態又は第５実施形態と同等である要素については、以上の説明で参照の符号を流用して各々の詳細な説明を適宜に省略する。 6. The sixth embodiment The emotion estimation system SYSSe according to the sixth embodiment is the second embodiment in that the processing of the emotion estimation unit 25 shown in the second embodiment is distributed between the server device 10D and the user device 1e. It is different from the user device 1a. Hereinafter, the emotion estimation system SYSTEM according to the sixth embodiment will be described. Regarding the elements whose actions or functions are equivalent to those of the second embodiment or the fifth embodiment in the sixth embodiment illustrated below, the reference numerals are used in the above description, and detailed description of each is appropriately described. Omitted to.

６．１．第６実施形態の概要
図２６は、感情推定システムＳＹＳｅの全体構成を示す図である。感情推定システムＳＹＳｅは、ユーザＵが所持するユーザ装置１ｅと、ネットワークＮＷと、サーバ装置１０Ｄとを備える。 6.1. Outline of the Sixth Embodiment FIG. 26 is a diagram showing the overall configuration of the emotion estimation system SYSTEM. The emotion estimation system SYSTEM includes a user device 1e owned by the user U, a network NW, and a server device 10D.

図２７は、ユーザ装置１ｅの構成を示すブロック図である。ユーザ装置１ｅは、処理装置２ｅ、記憶装置３ｅ、表示装置４、操作装置５、通信装置６、放音装置７、及び、集音装置８を具備するコンピュータシステムにより実現される。記憶装置３ｅは、処理装置２ｅが読取可能な記録媒体であり、処理装置２ｅが実行する制御プログラムＰＲｅを含む複数のプログラムを記憶する。 FIG. 27 is a block diagram showing the configuration of the user device 1e. The user device 1e is realized by a computer system including a processing device 2e, a storage device 3e, a display device 4, an operating device 5, a communication device 6, a sound emitting device 7, and a sound collecting device 8. The storage device 3e is a recording medium that can be read by the processing device 2e, and stores a plurality of programs including the control program PRE executed by the processing device 2e.

処理装置２ｅは、記憶装置３ｅから制御プログラムＰＲｅを読み取り実行することによって、取得部２１ａ、第１感情推定部２５ｅ、及び、出力部２６として機能する。図２８を用いて、感情推定システムＳＹＳｅの機能について説明する。 The processing device 2e functions as an acquisition unit 21a, a first emotion estimation unit 25e, and an output unit 26 by reading and executing the control program PRE from the storage device 3e. The function of the emotion estimation system SYSTEM will be described with reference to FIG. 28.

図２８は、感情推定システムＳＹＳｅの機能の概要を示す図である。第１感情推定部２５ｅは、補正部２５３と、調整部２５４と、推定部２５８とを含む。 FIG. 28 is a diagram showing an outline of the functions of the emotion estimation system SYSTEM. The first emotion estimation unit 25e includes a correction unit 253, an adjustment unit 254, and an estimation unit 258.

取得部２１ａは、ユーザＵが明示感情を発露させた音声を含む音を示す音情報ＳＩａを取得する。サーバ装置１０Ｄは、音情報ＳＩａに基づいて音声感情情報ＶＥａを生成する。そして、通信装置６Ｂが、音声感情情報ＶＥａをユーザ装置１に送信する。 The acquisition unit 21a acquires sound information SIa indicating a sound including a voice in which the user U expresses an explicit emotion. The server device 10D generates voice emotion information VEa based on the sound information SIa. Then, the communication device 6B transmits the voice emotion information VEa to the user device 1.

調整部２５４は、ユーザＵが抱く感情が明示感情であると推定部２５８が推定する可能性を高くする目的で、明示的な音声感情情報ＶＥａに基づいて補正情報ＣＩを調整する。 The adjustment unit 254 adjusts the correction information CI based on the explicit voice emotion information VEa for the purpose of increasing the possibility that the estimation unit 258 estimates that the emotion held by the user U is an explicit emotion.

６．２．第６実施形態の効果
以上の説明によれば、感情推定システムＳＹＳにおいて、ユーザ装置１ｄは、第２実施形態におけるユーザ装置１と比較すると、負荷を軽減できる。 6.2. Effect of the Sixth Embodiment According to the above description, in the emotion estimation system SYS, the load of the user device 1d can be reduced as compared with the user device 1 of the second embodiment.

７．第７実施形態
第７実施形態にかかる感情推定システムＳＹＳｆは、第３実施形態で示した感情推定部２５の処理を、サーバ装置１０Ｆとユーザ装置１ｆとで分散する点で、第３実施形態にかかる感情推定システムＳＹＳと相違する。以下、第７実施形態にかかる感情推定システムＳＹＳｆを説明する。なお、以下に例示する第７実施形態において作用又は機能が第３実施形態又は第５実施形態と同等である要素については、以上の説明で参照の符号を流用して各々の詳細な説明を適宜に省略する。 7. Seventh Embodiment The emotion estimation system SYSf according to the seventh embodiment has the third embodiment in that the processing of the emotion estimation unit 25 shown in the third embodiment is distributed between the server device 10F and the user device 1f. It differs from the emotion estimation system SYS. Hereinafter, the emotion estimation system SYSf according to the seventh embodiment will be described. Regarding the elements whose actions or functions are equivalent to those of the third embodiment or the fifth embodiment in the seventh embodiment illustrated below, the reference numerals are used in the above description, and detailed description of each is appropriately described. Omitted to.

７．１．第７実施形態の概要
図２９は、感情推定システムＳＹＳｆの全体構成を示す図である。感情推定システムＳＹＳｆは、ユーザＵが所持するユーザ装置１ｆと、ネットワークＮＷと、サーバ装置１０Ｆとを備える。ユーザＵ１が、「第１ユーザ」の例である。ユーザＵ２が、「第２ユーザ」の例である。 7.1. Outline of the 7th Embodiment FIG. 29 is a diagram showing the overall configuration of the emotion estimation system SYSf. The emotion estimation system SYSf includes a user device 1f owned by the user U, a network NW, and a server device 10F. User U1 is an example of a "first user". User U2 is an example of a "second user".

図３０は、ユーザ装置１ｆを示すブロック図である。ユーザ装置１ｆは、処理装置２ｆ、記憶装置３ｆ、表示装置４、操作装置５、通信装置６、放音装置７、及び、集音装置８を具備するコンピュータシステムにより実現される。記憶装置３ｆは、処理装置２ｆが読取可能な記録媒体であり、処理装置２ｆが実行する制御プログラムＰＲｆを含む複数のプログラムを記憶する。 FIG. 30 is a block diagram showing a user device 1f. The user device 1f is realized by a computer system including a processing device 2f, a storage device 3f, a display device 4, an operating device 5, a communication device 6, a sound emitting device 7, and a sound collecting device 8. The storage device 3f is a recording medium that can be read by the processing device 2f, and stores a plurality of programs including the control program PRf executed by the processing device 2f.

処理装置２ｆは、記憶装置３ｆから制御プログラムＰＲｆを読み取り実行することによって、取得部２１、第１感情推定部２５ｆ、及び、出力部２６として機能する。 The processing device 2f functions as an acquisition unit 21, a first emotion estimation unit 25f, and an output unit 26 by reading and executing the control program PRf from the storage device 3f.

図３１は、サーバ装置１０Ｆの構成を示すブロック図である。サーバ装置１０Ｆは、処理装置２Ｆ、記憶装置３Ｆ、及び、通信装置６Ｂを具備するコンピュータシステムにより実現される。記憶装置３Ｆは、処理装置２Ｆが読取可能な記録媒体であり、処理装置２Ｆが実行する制御プログラムＰＲＦを含む複数のプログラム、解析用辞書情報３１、感情分類情報３３、及び、学習モデルＬＭを記憶する。 FIG. 31 is a block diagram showing the configuration of the server device 10F. The server device 10F is realized by a computer system including a processing device 2F, a storage device 3F, and a communication device 6B. The storage device 3F is a recording medium that can be read by the processing device 2F, and stores a plurality of programs including a control program PRF executed by the processing device 2F, analysis dictionary information 31, emotion classification information 33, and a learning model LM. To do.

処理装置２Ｆは、記憶装置３Ｆから制御プログラムＰＲＤを読み取り実行することによって、第２感情推定部２５Ｆとして機能する。図３２を用いて、感情推定システムＳＹＳｆの機能について説明する。 The processing device 2F functions as a second emotion estimation unit 25F by reading and executing the control program PRD from the storage device 3F. The function of the emotion estimation system SYSf will be described with reference to FIG. 32.

図３２は、感情推定システムＳＹＳｆとの機能の概要を示す図である。第１感情推定部２５ｆは、補正部２５３、調整部２５４、及び、推定部２５８を含む。第２感情推定部２５Ｆは、ノイズ除去部２５１、音声評価部２５２、文字評価部２５６、及び、特定部２５９を含む。ユーザ装置１ｆ１が、「第１端末装置」の一例である。ユーザ装置１ｆ２が、「第２端末装置」の一例である。図面の煩雑化を防ぐため、ユーザ装置１ｆ１の処理装置２ｆが実現する機能については、図示を省略している。 FIG. 32 is a diagram showing an outline of the function with the emotion estimation system SYSf. The first emotion estimation unit 25f includes a correction unit 253, an adjustment unit 254, and an estimation unit 258. The second emotion estimation unit 25F includes a noise removal unit 251, a voice evaluation unit 252, a character evaluation unit 256, and a specific unit 259. The user device 1f1 is an example of the "first terminal device". The user device 1f2 is an example of the "second terminal device". In order to prevent the drawings from becoming complicated, the functions realized by the processing device 2f of the user device 1f1 are not shown.

ユーザ装置１ｆ１の取得部２１は、ユーザ装置１ｆ１の集音装置８が出力する音情報ＳＩ１を取得する。ユーザ装置１ｆ１の集音装置８は、「第１集音装置」の一例である。ユーザ装置１ｆ１の通信装置６は、音情報ＳＩ１をサーバ装置１０Ｆに送信する。ノイズ除去部２５１は、音情報ＳＩ１が示す音からノイズを除去して音声情報ＶＩ１を生成する。ユーザＵ１に関して、以降の処理は、図１３に示す音声評価部２５２Ｂ、補正部２５３、音声認識処理部２５６１、形態素解析処理部２５６３、評価値算出部２５６５、特定部２５９と同一であるため、説明を省略する。さらに、図示を省略しているが、第２感情推定部２５Ｆは、認識文字列ＲＴ１と、ユーザＵ１の音声感情情報ＶＥ１と、ユーザＵ１の文字感情情報ＴＥ１とを、ユーザ装置１ｆ１に送信する。そして、特定部２５９が特定文字列ＳＴを特定するために、ユーザ装置１ｆ１の通信装置６は、ユーザＵ１の補正感情情報ＣＶＥ１をサーバ装置１０Ｆに送信する。
ユーザＵ２に関して、ユーザ装置１ｆ２の取得部２１は、ユーザ装置１ｆ２の集音装置８が出力する音情報ＳＩ２を取得する。ユーザ装置１ｆ２の集音装置８は、「第２集音装置」の一例である。ユーザ装置１ｆ２の通信装置６は、音情報ＳＩ２をサーバ装置１０Ｆに送信する。ユーザ装置１ｆ２の通信装置６は、「第３通信装置」の一例である。ユーザＵ２に関して、以降の処理は、図１３に示す音声評価部２５２Ｂ、補正部２５３、音声認識処理部２５６１、形態素解析処理部２５６３、評価値算出部２５６５と同一であるため、説明を省略する。通信装置６Ｂは、特定文字列ＳＴと、ユーザＵ２の音声感情情報ＶＥ２と、ユーザＵ２の文字感情情報ＴＥ２とを、ユーザ装置１ｆ２に送信する。 The acquisition unit 21 of the user device 1f1 acquires the sound information SI1 output by the sound collecting device 8 of the user device 1f1. The sound collecting device 8 of the user device 1f1 is an example of the “first sound collecting device”. The communication device 6 of the user device 1f1 transmits the sound information SI1 to the server device 10F. The noise removing unit 251 removes noise from the sound indicated by the sound information SI1 to generate the voice information VI1. Regarding the user U1, the subsequent processing is the same as the voice evaluation unit 252B, the correction unit 253, the voice recognition processing unit 2561, the morphological analysis processing unit 2563, the evaluation value calculation unit 2565, and the specific unit 259 shown in FIG. Is omitted. Further, although not shown, the second emotion estimation unit 25F transmits the recognition character string RT1, the voice emotion information VE1 of the user U1, and the character emotion information TE1 of the user U1 to the user device 1f1. Then, in order for the specific unit 259 to specify the specific character string ST, the communication device 6 of the user device 1f1 transmits the corrected emotion information CVE1 of the user U1 to the server device 10F.
Regarding the user U2, the acquisition unit 21 of the user device 1f2 acquires the sound information SI2 output by the sound collecting device 8 of the user device 1f2. The sound collecting device 8 of the user device 1f2 is an example of the “second sound collecting device”. The communication device 6 of the user device 1f2 transmits the sound information SI2 to the server device 10F. The communication device 6 of the user device 1f2 is an example of the “third communication device”. Regarding the user U2, the subsequent processing is the same as the voice evaluation unit 252B, the correction unit 253, the voice recognition processing unit 2561, the morphological analysis processing unit 2563, and the evaluation value calculation unit 2565 shown in FIG. 13, so the description thereof will be omitted. The communication device 6B transmits the specific character string ST, the voice emotion information VE2 of the user U2, and the character emotion information TE2 of the user U2 to the user device 1f2.

非キャリブレーションユーザであるユーザＵ２が、特定文字列ＳＴを発話した場合には、調整部２５４は、ユーザＵ２の補正感情情報ＣＶＥ２に含まれる複数の音声評価値Ｘを、複数の感情の各々について、ユーザＵ２の文字感情情報ＴＥ２に含まれる複数の文字評価値Ｙに近づける目的で、ユーザＵ２用の補正情報ＣＩ２を調整する。 When the user U2 who is a non-calibration user utters a specific character string ST, the adjustment unit 254 sets a plurality of voice evaluation values X included in the corrected emotion information CVE2 of the user U2 for each of the plurality of emotions. , The correction information CI2 for the user U2 is adjusted for the purpose of approaching the plurality of character evaluation values Y included in the character emotion information TE2 of the user U2.

７．２．第７実施形態の効果
以上の説明によれば、感情推定システムＳＹＳｆにおいて、サーバ装置１０Ｆは、第３実施形態におけるサーバ装置１０と比較すると、負荷を軽減できる。 7.2. Effect of the 7th Embodiment According to the above description, in the emotion estimation system SYSf, the load of the server device 10F can be reduced as compared with the server device 10 of the 3rd embodiment.

８．変形例
本発明は、以上に例示した各実施形態に限定されない。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様を併合してもよい。 8. Modifications The present invention is not limited to the embodiments exemplified above. A specific mode of modification is illustrated below. Two or more aspects arbitrarily selected from the following examples may be merged.

（１）第１変形例として、推定部２５８は、補正感情情報ＣＶＥと文字感情情報ＴＥとに基づいて、ユーザＵが抱く感情を推定することを説明したが、これに限らない。推定部２５８が補正感情情報ＣＶＥに基づいて、ユーザＵが抱く１以上の感情を推定する例を、図３３を用いて説明する。 (1) As a first modification, the estimation unit 258 has explained that the emotion held by the user U is estimated based on the corrected emotion information CVE and the character emotion information TE, but the present invention is not limited to this. An example in which the estimation unit 258 estimates one or more emotions held by the user U based on the corrected emotion information CVE will be described with reference to FIG. 33.

図３３は、第１変形例におけるユーザ装置１ｇの機能の概要を示す図である。ユーザ装置１ｇの処理装置２は、ユーザ装置１ｇの記憶装置３から制御プログラムを読み取り実行することによって、取得部２１、感情推定部２５ｇ、及び、出力部２６ｇとして機能する。感情推定部２５ｇは、ノイズ除去部２５１、音声評価部２５２、補正部２５３、及び、推定部２５８ｇを含む。推定部２５８ｇは、補正感情情報ＣＶＥに基づいて、ユーザＵが抱く１以上の感情を推定する。例えば、推定部２５８ｇは、補正感情情報ＣＶＥの音声評価値Ｘ１〜Ｘ４を閾値と比較し、閾値を超える音声評価値Ｘを特定する。推定部２５８ｇは、特定された音声評価値Ｘに対応する１以上の感情を、ユーザＵが抱く１以上の感情として推定する。感情推定部２５ｇは、推定したユーザＵが抱く１以上の感情を示す推定感情情報ＥＩを出力する。
出力部２６ｇは、推定感情情報ＥＩを出力する。例えば、出力部２６ｇは、推定感情情報ＥＩが示す感情を示す文字列を、表示装置４に出力する。 FIG. 33 is a diagram showing an outline of the function of the user device 1g in the first modification. The processing device 2 of the user device 1g functions as an acquisition unit 21, an emotion estimation unit 25g, and an output unit 26g by reading and executing a control program from the storage device 3 of the user device 1g. The emotion estimation unit 25g includes a noise removal unit 251, a voice evaluation unit 252, a correction unit 253, and an estimation unit 258g. The estimation unit 258g estimates one or more emotions held by the user U based on the corrected emotion information CVE. For example, the estimation unit 258g compares the voice evaluation values X1 to X4 of the corrected emotion information CVE with the threshold value, and identifies the voice evaluation value X exceeding the threshold value. The estimation unit 258g estimates one or more emotions corresponding to the specified voice evaluation value X as one or more emotions held by the user U. The emotion estimation unit 25g outputs an estimated emotion information EI indicating one or more emotions held by the estimated user U.
The output unit 26g outputs the estimated emotion information EI. For example, the output unit 26g outputs a character string indicating the emotion indicated by the estimated emotion information EI to the display device 4.

（２）第４実施形態に示した感情推定部２５Ｃの処理を、サーバ装置１０とユーザ装置１とで分散してもよい。例えば、サーバ装置１０における第２感情推定部２５は、ノイズ除去部２５１Ｃ、音声評価部２５２Ｂ、文字評価部２５６、及び、特定部２５９を有する。ユーザ装置１における第１感情推定部２５は、補正部２５３、調整部２５４、推定部２５８を有する。 (2) The processing of the emotion estimation unit 25C shown in the fourth embodiment may be distributed between the server device 10 and the user device 1. For example, the second emotion estimation unit 25 in the server device 10 has a noise removal unit 251C, a voice evaluation unit 252B, a character evaluation unit 256, and a specific unit 259. The first emotion estimation unit 25 in the user device 1 includes a correction unit 253, an adjustment unit 254, and an estimation unit 258.

（３）感情推定部２５は、喜び、怒り、悲しみ、及び、平常のうち、１以上の感情を推定することとして説明したが、１つの感情を推定してもよい。例えば、推定部２５８は、補正感情情報ＣＶＥの音声評価値Ｘ１〜Ｘ４と文字感情情報ＴＥの文字評価値Ｙ１〜Ｙ４とを感情ごとに加算して、感情ごとに加算値を算出する。推定部２５８は、感情ごとの加算値のうち最も大きい値の感情を、ユーザＵが抱く感情として推定してもよい。 (3) Although the emotion estimation unit 25 has explained that one or more emotions are estimated among joy, anger, sadness, and normality, one emotion may be estimated. For example, the estimation unit 258 adds the voice evaluation values X1 to X4 of the corrected emotion information CVE and the character evaluation values Y1 to Y4 of the character emotion information TE for each emotion, and calculates the addition value for each emotion. The estimation unit 258 may estimate the emotion having the largest value among the added values for each emotion as the emotion held by the user U.

（４）第３実施形態では、サーバ装置１０によって感情推定部２５Ｂが実現したが、１台のユーザ装置１に適用してもよい。例えば、ユーザ装置１が、複数のユーザＵによって所持される場合である。ある期間において、ユーザＵ１がユーザ装置１を所持し、ユーザ装置１をキャリブレーションモードに設定し、ある期間の後の期間において、ユーザＵ２がユーザ装置１を所持した場合に、第３実施形態を適用してもよい。 (4) In the third embodiment, the emotion estimation unit 25B is realized by the server device 10, but it may be applied to one user device 1. For example, the user device 1 is possessed by a plurality of users U. When the user U1 possesses the user device 1 in a certain period, the user device 1 is set to the calibration mode, and the user U2 possesses the user device 1 in a period after a certain period, the third embodiment is performed. May be applied.

（５）第５実施形態、第６実施形態、及び、第７実施形態において、通信装置６Ｂは、認識文字列ＲＴと、ユーザＵ２の音声感情情報ＶＥ２と、ユーザＵ２の文字感情情報ＴＥ２とを、ユーザ装置１ｆ２に送信するが、認識文字列ＲＴを送信しなくてもよい。例えば、ユーザ装置１ｆは、推定感情情報ＥＩが示す感情を示す文字列を、表示装置４に出力する。 (5) In the fifth embodiment, the sixth embodiment, and the seventh embodiment, the communication device 6B uses the recognition character string RT, the voice emotion information VE2 of the user U2, and the character emotion information TE2 of the user U2. , Is transmitted to the user device 1f2, but the recognition character string RT may not be transmitted. For example, the user device 1f outputs a character string indicating the emotion indicated by the estimated emotion information EI to the display device 4.

（６）ユーザ装置１は、集音装置８を有さなくてもよい。集音装置８を有さない場合、ユーザ装置１は、通信装置６を介して音情報ＳＩを取得してもよいし、記憶装置３に記憶された音情報ＳＩを取得してもよい。 (6) The user device 1 does not have to have the sound collecting device 8. When the sound collecting device 8 is not provided, the user device 1 may acquire the sound information SI via the communication device 6 or may acquire the sound information SI stored in the storage device 3.

（７）ユーザ装置１は、放音装置７を有さなくてもよい。 (7) The user device 1 does not have to have the sound emitting device 7.

（８）ユーザ装置１は、スマートスピーカでもよい。ユーザ装置１がスマートスピーカである場合、ユーザ装置１は、表示装置４及び操作装置５を有さなくてもよい。 (8) The user device 1 may be a smart speaker. When the user device 1 is a smart speaker, the user device 1 does not have to have the display device 4 and the operation device 5.

（９）感情分類情報３３は、図４に示すように、「勝つ」、「勝っ」のように、ある単語が活用した複数の形態素のそれぞれを、喜び、怒り、悲しみ、及び、平常の何れかに分類したが、これに限らない。例えば、感情分類情報３３は、解析用辞書情報３１の原形データに登録された文字列を、喜び、怒り、悲しみ、及び、平常の何れかに分類してもよい。例えば、感情分類情報３３は、解析用辞書情報３１の原形データに登録された文字列「嬉しい」、「合格」、及び「勝つ」を、喜びに分類する。評価値算出部２５６５は、補正後認識文字列ＣＲＴを形態素ごとに分解し、分解した形態素を、解析用辞書情報３１の原形データに登録された文字列に変換する。そして、評価値算出部２５６５は、変換して得られた文字列と、感情分類情報３３に含まれる文字列とが一致する場合に、この補正後認識文字列ＣＲＴに含まれる文字列に対応する感情の文字評価値Ｙを増加させる。 (9) As shown in FIG. 4, the emotion classification information 33 is either joy, anger, sadness, or normal for each of a plurality of morphemes utilized by a certain word, such as "win" and "win". However, it is not limited to this. For example, the emotion classification information 33 may classify the character string registered in the prototype data of the analysis dictionary information 31 into any of joy, anger, sadness, and normal. For example, the emotion classification information 33 classifies the character strings “happy”, “pass”, and “win” registered in the prototype data of the analysis dictionary information 31 into joy. The evaluation value calculation unit 2565 decomposes the corrected recognition character string CRT for each morpheme, and converts the decomposed morpheme into a character string registered in the prototype data of the analysis dictionary information 31. Then, the evaluation value calculation unit 2565 corresponds to the character string included in the corrected recognition character string CRT when the character string obtained by conversion and the character string included in the emotion classification information 33 match. Increases the character evaluation value Y of emotions.

（１０）評価値算出部２５６５は、補正後認識文字列ＣＲＴに対して、感情ごとの文字評価値Ｙを算出したが、認識文字列ＲＴに対して感情ごとの文字評価値Ｙを算出してもよい。しかしながら、認識文字列ＲＴには、感情を推定するためには不要な文字列が含まれる。従って、補正後認識文字列ＣＲＴに対して感情ごとの文字評価値Ｙを算出することにより、認識文字列ＲＴに対して感情ごとの文字評価値Ｙを算出する場合と比較して、感情の推定精度を向上できる。 (10) The evaluation value calculation unit 2565 calculated the character evaluation value Y for each emotion with respect to the corrected recognition character string CRT, but calculated the character evaluation value Y for each emotion with respect to the recognition character string RT. May be good. However, the recognition character string RT includes a character string that is unnecessary for estimating emotions. Therefore, by calculating the character evaluation value Y for each emotion for the corrected recognition character string CRT, the emotion is estimated as compared with the case where the character evaluation value Y for each emotion is calculated for the recognition character string RT. The accuracy can be improved.

（１１）第１の態様における相違の程度を示す値は、音声評価値Ｘと文字評価値Ｙとの差分の２乗の和であったが、音声評価値Ｘと文字評価値Ｙとの差分の絶対値の和等、評価値間の距離を定義する任意の評価関数によって求められる値でもよい。 (11) The value indicating the degree of difference in the first aspect was the sum of the squares of the differences between the voice evaluation value X and the character evaluation value Y, but the difference between the voice evaluation value X and the character evaluation value Y. It may be a value obtained by an arbitrary evaluation function that defines the distance between evaluation values, such as the sum of absolute values of.

（１２）ユーザＵが日本語を話す例を用いたが、ユーザが如何なる言語を話しても上述の各態様を適用することが可能である。例えば、ユーザＵが、日本語以外の英語、フランス語、又は中国語等を話す場合であっても上述の各態様を適用できる。例えば、ユーザＵが英語を話す場合、解析用辞書情報３１は、英語の形態素に関する情報であり、感情分類情報３３は、英単語を喜び、怒り、悲しみ、及び、平常の何れかに分類したデータであればよい。 (12) Although the example in which the user U speaks Japanese is used, each of the above aspects can be applied to any language spoken by the user. For example, even when the user U speaks English, French, Chinese, or the like other than Japanese, each of the above aspects can be applied. For example, when the user U speaks English, the analysis dictionary information 31 is information related to English morphemes, and the emotion classification information 33 is data in which English words are classified into joy, anger, sadness, or normal. It should be.

（１３）上述した各態様の説明に用いたブロック図は、機能単位のブロックを示している。これらの機能ブロック（構成部）は、ハードウェア及び／又はソフトウェアの任意の組み合わせによって実現される。また、各機能ブロックの実現手段は特に限定されない。すなわち、各機能ブロックは、物理的及び／又は論理的に結合した１つの装置により実現されてもよいし、物理的及び／又は論理的に分離した２つ以上の装置を直接的及び／又は間接的に(例えば、有線及び／又は無線)で接続し、これら複数の装置により実現されてもよい。 (13) The block diagram used in the description of each of the above-described aspects shows a block of functional units. These functional blocks (components) are realized by any combination of hardware and / or software. Further, the means for realizing each functional block is not particularly limited. That is, each functional block may be realized by one physically and / or logically coupled device, or directly and / or indirectly by two or more physically and / or logically separated devices. (For example, wired and / or wireless) may be connected and realized by these plurality of devices.

（１４）上述した各態様における処理手順、シーケンス、フローチャートなどは、矛盾のない限り、順序を入れ替えてもよい。例えば、本明細書で説明した方法については、例示的な順序で様々なステップの要素を提示しており、提示した特定の順序に限定されない。 (14) The order of the processing procedures, sequences, flowcharts, etc. in each of the above-described aspects may be changed as long as there is no contradiction. For example, the methods described herein present elements of various steps in an exemplary order, and are not limited to the particular order presented.

（１５）上述した各態様において、入出力された情報等は特定の場所(例えば、メモリ)に保存されてもよいし、管理テーブルで管理してもよい。入出力される情報等は、上書き、更新、又は追記され得る。出力された情報等は削除されてもよい。入力された情報等は他の装置へ送信されてもよい。 (15) In each of the above-described aspects, the input / output information and the like may be stored in a specific place (for example, a memory) or may be managed by a management table. Input / output information and the like can be overwritten, updated, or added. The output information and the like may be deleted. The input information or the like may be transmitted to another device.

（１６）上述した各態様において、判定は、１ビットで表される値（０か１か）によって行われてもよいし、真偽値（Boolean：true又はfalse）によって行われてもよいし、数値の比較（例えば、所定の値との比較）によって行われてもよい。 (16) In each of the above-described aspects, the determination may be made by a value represented by 1 bit (0 or 1) or by a boolean value (Boolean: true or false). , May be done by numerical comparison (eg, comparison with a given value).

（１７）上述した各態様では、スマートフォン等の可搬型の情報処理装置をユーザ装置１として例示したが、ユーザ装置１の具体的な形態は任意であり、前述の各形態の例示には限定されない。例えば、可搬型又は据置型のパーソナルコンピュータをユーザ装置１として利用してもよい。 (17) In each of the above-described aspects, a portable information processing device such as a smartphone is illustrated as the user device 1, but the specific form of the user device 1 is arbitrary and is not limited to the above-mentioned examples of each form. .. For example, a portable or stationary personal computer may be used as the user device 1.

（１８）上述した各態様では、記憶装置３は、処理装置２が読取可能な記録媒体であり、ＲＯＭ及びＲＡＭなどを例示したが、フレキシブルディスク、光磁気ディスク(例えば、コンパクトディスク、デジタル多用途ディスク、Ｂｌｕ−ｒａｙ（登録商標）ディスク)、スマートカード、フラッシュメモリデバイス(例えば、カード、スティック、キードライブ)、ＣＤ−ＲＯＭ（Compact Disc−ＲＯＭ）、レジスタ、リムーバブルディスク、ハードディスク、フロッピー（登録商標）ディスク、磁気ストリップ、データベース、サーバその他の適切な記憶媒体である。また、プログラムは、ネットワークから送信されても良い。また、プログラムは、電気通信回線を介して通信網から送信されても良い。 (18) In each of the above-described aspects, the storage device 3 is a recording medium that can be read by the processing device 2, and examples thereof include a ROM and a RAM. Discs, Blu-ray® disks, smart cards, flash memory devices (eg cards, sticks, key drives), CD-ROMs (Compact Disc-ROMs), registers, removable disks, hard disks, floppies (registered trademarks) ) Disks, magnetic strips, databases, servers and other suitable storage media. The program may also be transmitted from the network. The program may also be transmitted from the communication network via a telecommunication line.

（１９）上述した各態様は、ＬＴＥ（Long Term Evolution）、ＬＴＥ−Ａ（LTE-Advanced）、ＳＵＰＥＲ３Ｇ、ＩＭＴ−Ａｄｖａｎｃｅｄ、４Ｇ、５Ｇ、ＦＲＡ（Future Radio Access）、Ｗ−ＣＤＭＡ（登録商標）、ＧＳＭ（登録商標）、ＣＤＭＡ２０００、ＵＭＢ（Ultra Mobile Broadband）、ＩＥＥＥ８０２．１１（Ｗｉ−Ｆｉ）、ＩＥＥＥ８０２．１６（ＷｉＭＡＸ）、ＩＥＥＥ８０２．２０、ＵＷＢ（Ultra-WideBand）、Ｂｌｕｅｔｏｏｔｈ（登録商標）、その他の適切なシステムを利用するシステム及び／又はこれらに基づいて拡張された次世代システムに適用されてもよい。 (19) Each of the above-described embodiments includes LTE (Long Term Evolution), LTE-A (LTE-Advanced), SUPER 3G, IMT-Advanced, 4G, 5G, FRA (Future Radio Access), and W-CDMA (registered trademark). , GSM (registered trademark), CDMA2000, UMB (Ultra Mobile Broadband), IEEE 802.11 (Wi-Fi), IEEE 802.16 (WiMAX), IEEE 802.20, UWB (Ultra-WideBand), Bluetooth (registered trademark) ), Other systems that utilize suitable systems and / or next-generation systems that are extended based on them.

（２０）上述した各態様において、説明した情報及び信号などは、様々な異なる技術の何れかを使用して表されてもよい。例えば、上述の説明全体に渡って言及され得るデータ、命令、コマンド、情報、信号、ビット、シンボル、チップなどは、電圧、電流、電磁波、磁界若しくは磁性粒子、光場若しくは光子、又はこれらの任意の組み合わせによって表されてもよい。
なお、本明細書で説明した用語及び／又は本明細書の理解に必要な用語については、同一の又は類似する意味を有する用語と置き換えてもよい。 (20) In each of the above aspects, the information, signals, and the like described may be represented using any of a variety of different techniques. For example, data, instructions, commands, information, signals, bits, symbols, chips, etc. that may be referred to throughout the above description are voltages, currents, electromagnetic waves, magnetic fields or magnetic particles, light fields or photons, or any of these. It may be represented by a combination of.
In addition, the terms described in the present specification and / or the terms necessary for understanding the present specification may be replaced with terms having the same or similar meanings.

（２１）図２、図５、図７、図８、図１１、図１２、図１３、図１４、図１８、図１９、図２０、図２３、図２４、図２５、図２７、図２８、図３０、図３１、図３２、及び、図３３に例示された各機能は、ハードウェア及びソフトウェアの任意の組み合わせによって実現される。また、各機能は、単体の装置によって実現されてもよいし、相互に別体で構成された２個以上の装置によって実現されてもよい。 (21) FIG. 2, FIG. 5, FIG. 7, FIG. 8, FIG. 11, FIG. 12, FIG. 13, FIG. 14, FIG. 18, FIG. 19, FIG. 20, FIG. 23, FIG. 24, FIG. 25, FIG. 27, FIG. 28. , 30, 31, 32, and 33, each of which is illustrated by any combination of hardware and software. Further, each function may be realized by a single device, or may be realized by two or more devices configured as separate bodies from each other.

（２２）上述した各実施形態で例示したプログラムは、ソフトウェア、ファームウェア、ミドルウェア、マイクロコード又はハードウェア記述言語と呼ばれるか、他の名称によって呼ばれるかを問わず、命令、命令セット、コード、コードセグメント、プログラムコード、サブプログラム、ソフトウェアモジュール、アプリケーション、ソフトウェアアプリケーション、ソフトウェアパッケージ、ルーチン、サブルーチン、オブジェクト、実行可能ファイル、実行スレッド、手順又は機能等を意味するよう広く解釈されるべきである。
また、ソフトウェア、命令などは、伝送媒体を介して送受信されてもよい。例えば、ソフトウェアが、同軸ケーブル、光ファイバケーブル、ツイストペア及びデジタル加入者回線（ＤＳＬ）などの有線技術及び／又は赤外線、無線及びマイクロ波などの無線技術を使用してウェブサイト、サーバ、又は他のリモートソースから送信される場合、これらの有線技術及び／又は無線技術は、伝送媒体の定義内に含まれる。 (22) The programs exemplified in each of the above-described embodiments are called instructions, instruction sets, codes, code segments regardless of whether they are called software, firmware, middleware, microcode or hardware description language, or by other names. , Program code, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, execution threads, procedures or functions, etc. should be broadly interpreted to mean.
Further, software, instructions, and the like may be transmitted and received via a transmission medium. For example, the software uses wired technology such as coaxial cable, fiber optic cable, twist pair and digital subscriber line (DSL) and / or wireless technology such as infrared, wireless and microwave to websites, servers, or other When transmitted from a remote source, these wired and / or wireless technologies are included within the definition of transmission medium.

（２３）上述した各実施形態において、情報、パラメータなどは、絶対値で表されてもよいし、所定の値からの相対値で表されてもよいし、対応する別の情報で表されてもよい。 (23) In each of the above-described embodiments, the information, parameters, etc. may be represented by absolute values, relative values from a predetermined value, or other corresponding information. May be good.

（２４）上述したパラメータに使用する名称はいかなる点においても限定的なものではない。さらに、これらのパラメータを使用する数式等は、本明細書で明示的に開示したものと異なる場合もある。 (24) The names used for the above-mentioned parameters are not limited in any respect. Further, mathematical formulas and the like using these parameters may differ from those expressly disclosed herein.

（２５）上述した各実施形態において、ユーザ装置１は、移動局である場合が含まれる。移動局は、当業者によって、加入者局、モバイルユニット、加入者ユニット、ワイヤレスユニット、リモートユニット、モバイルデバイス、ワイヤレスデバイス、ワイヤレス通信デバイス、リモートデバイス、モバイル加入者局、アクセス端末、モバイル端末、ワイヤレス端末、リモート端末、ハンドセット、ユーザエージェント、モバイルクライアント、クライアント、又はいくつかの他の適切な用語で呼ばれる場合もある。 (25) In each of the above-described embodiments, the user device 1 includes a case where it is a mobile station. Mobile stations can be subscriber stations, mobile units, subscriber units, wireless units, remote units, mobile devices, wireless devices, wireless communication devices, remote devices, mobile subscriber stations, access terminals, mobile terminals, wireless, depending on the trader. It may also be referred to as a terminal, remote terminal, handset, user agent, mobile client, client, or some other suitable term.

（２６）上述した各実施形態において、「に基づいて」という記載は、別段に明記されていない限り、「のみに基づいて」を意味しない。言い換えれば、「に基づいて」という記載は、「のみに基づいて」と「に少なくとも基づいて」の両方を意味する。 (26) In each of the above embodiments, the statement "based on" does not mean "based on" unless otherwise stated. In other words, the statement "based on" means both "based only" and "at least based on".

（２７）本明細書で使用する「第１」、「第２」などの呼称を使用した要素へのいかなる参照も、それらの要素の量又は順序を全般的に限定するものではない。これらの呼称は、２つ以上の要素間を区別する便利な方法として本明細書で使用され得る。従って、第１及び第２の要素への参照は、２つの要素のみがそこで採用され得ること、又は何らかの形で第１の要素が第２の要素に先行しなければならないことを意味しない。 (27) Any reference to elements using designations such as "first", "second" as used herein does not generally limit the quantity or order of those elements. These designations can be used herein as a convenient way to distinguish between two or more elements. Thus, references to the first and second elements do not mean that only two elements can be adopted there, or that the first element must somehow precede the second element.

（２８）上述した各実施形態において「含む(ｉｎｃｌｕｄｉｎｇ)」、「含んでいる（ｃｏｍｐｒｉｓｉｎｇ）」、及びそれらの変形が、本明細書あるいは特許請求の範囲で使用されている限り、これら用語は、用語「備える」と同様に、包括的であることが意図される。さらに、本明細書あるいは特許請求の範囲において使用されている用語「又は（or）」は、排他的論理和ではないことが意図される。 (28) As long as "inclusion," "comprising," and variations thereof in each of the embodiments described above are used herein or within the scope of the claims, these terms are used. As with the term "prepare", it is intended to be inclusive. Furthermore, the term "or" as used herein or in the claims is intended not to be an exclusive OR.

（２９）本願の全体において、例えば、英語におけるa、an及びtheのように、翻訳によって冠詞が追加された場合、これらの冠詞は、文脈から明らかにそうではないことが示されていなければ、複数を含む。 (29) In the whole of the present application, if articles are added by translation, for example, a, an and the in English, unless the context clearly indicates that these articles are not. Including multiple.

（３０）本発明が本明細書中に説明した実施形態に限定されないことは当業者にとって明白である。本発明は、特許請求の範囲の記載に基づいて定まる本発明の趣旨及び範囲を逸脱することなく修正及び変更態様として実施できる。従って、本明細書の記載は、例示的な説明を目的とし、本発明に対して何ら制限的な意味を有さない。また、本明細書に例示した態様から選択された複数の態様を組み合わせてもよい。 (30) It will be apparent to those skilled in the art that the present invention is not limited to the embodiments described herein. The present invention can be implemented as modifications and modifications without departing from the spirit and scope of the invention, which is determined based on the description of the claims. Therefore, the description herein is for illustrative purposes and has no limiting implications for the present invention. In addition, a plurality of aspects selected from the aspects illustrated in the present specification may be combined.

１、１ａ、１ｂ、１ｄ、１ｅ、１ｆ、１ｇ…ユーザ装置、１０、１０Ｃ、１０Ｄ、１０Ｆ…サーバ装置、２１、２１ａ…取得部、２６…出力部、３１…解析用辞書情報、３３…感情分類情報、２５１、２５１Ｃ…ノイズ除去部、２５２、２５２Ｂ…音声評価部、２５３…補正部、２５４、２５４Ｃ…調整部、２５６…文字評価部、２５８…推定部、２５９…特定部、ＣＩ…補正情報、ＣＶＥ…補正感情情報、ＬＭ…学習モデル、Ｐ１…第１パラメータ、Ｐ２…第２パラメータ、ＴＥ…文字感情情報、Ｕ…ユーザ、ＶＥ…音声感情情報、ＶＩ…音声情報、Ｘ…音声評価値、Ｙ…文字評価値。 1, 1a, 1b, 1d, 1e, 1f, 1g ... User device, 10, 10C, 10D, 10F ... Server device, 21, 21a ... Acquisition unit, 26 ... Output unit, 31 ... Analysis dictionary information, 33 ... Emotion Classification information, 251, 251C ... Noise removal unit, 252, 252B ... Voice evaluation unit, 253 ... Correction unit, 254, 254C ... Adjustment unit, 256 ... Character evaluation unit, 258 ... Estimating unit, 259 ... Specific unit, CI ... Correction Information, CVE ... Corrected emotion information, LM ... Learning model, P1 ... 1st parameter, P2 ... 2nd parameter, TE ... Character emotion information, U ... User, VE ... Voice emotion information, VI ... Voice information, X ... Voice evaluation Value, Y ... Character evaluation value.

Claims

Voice information indicating the user's voice for a learning model that has been learned for multiple humans, regarding the relationship between multiple feature quantities according to human voice and the intensity of each of the multiple emotions held by the person who emitted the voice. A voice evaluation unit that inputs a plurality of feature quantities based on the above and acquires voice emotion information including a voice evaluation value indicating the intensity of each of the plurality of emotions held by the user from the learning model.
A correction unit that generates corrected emotion information that corrects the voice emotion information using correction information based on the characteristics of the user's voice, and a correction unit.
An estimation unit that estimates one or more emotions held by the user from the plurality of emotions based on the corrected emotion information.
Emotion estimator equipped with.

An acquisition unit that acquires sound information output by a sound collecting device that collects sounds including the user's voice, and an acquisition unit.
A noise removing unit that removes noise from the sound indicated by the sound information to generate the voice information,
The emotion estimation device according to claim 1.

The voice evaluation unit inputs a plurality of feature quantities based on voice information indicating a voice in which the user explicitly expresses one of the plurality of emotions to the learning model, and the user Explicit voice emotional information is obtained from the learning model and
An adjustment unit that adjusts the correction information based on the explicit voice emotion information for the purpose of increasing the possibility that the estimation unit estimates that the emotion held by the user is the one emotion.
The emotion estimation device according to claim 1 or 2.

A voice recognition process for recognizing the utterance content of the voice uttered by the human is executed for the sound information, and based on the recognition character string indicating the recognition result of the voice recognition process, the plurality of emotions held by the user It is equipped with a character evaluation unit that generates character emotion information including a character evaluation value indicating the strength for each.
The estimation unit estimates one or more emotions held by the user based on the corrected emotion information and the character emotion information.
The emotion estimation device according to claim 3.

When the value indicating the degree of difference between the plurality of voice evaluation values included in the corrected emotion information of the user and the plurality of character evaluation values included in the character emotion information is equal to or less than a predetermined value, the recognition character string is specified. It has a specific part to be specified as a character string,
When another user who has not spoken a voice that expresses an explicit emotion speaks the specific character string,
The voice evaluation unit learns the voice emotion information of the other user by inputting a plurality of feature quantities corresponding to the voice of the other user uttering the specific character string into the learning model. Get from the model
The character evaluation unit
Generates the character emotion information of the other user based on the voice that the other user utters the specific character string.
The purpose of the adjusting unit is to bring a plurality of voice evaluation values included in the corrected emotion information of the other user closer to a plurality of character evaluation values included in the character emotion information of the other user for each of the plurality of emotions. To adjust the correction information for the other user,
The emotion estimation device according to claim 4.

A noise removing unit that removes noise from the sound indicated by the sound information based on a predetermined threshold value to generate the voice information, and a noise removing unit.
When the value indicating the degree of difference between the plurality of voice evaluation values included in the corrected emotion information of the user and the plurality of character evaluation values included in the character emotion information is equal to or less than a predetermined value, the recognition character string is specified. It has a specific part to be specified as a character string,
When another user who has not spoken a voice that expresses an explicit emotion speaks the specific character string,
The voice evaluation unit learns the voice emotion information of the other user by inputting a plurality of feature quantities corresponding to the voice of the other user uttering the specific character string into the learning model. Get from the model
The character evaluation unit generates character emotion information of the other user based on the voice that the other user utters the specific character string.
The purpose of the adjusting unit is to bring a plurality of voice evaluation values included in the corrected emotion information of the other user closer to a plurality of character evaluation values included in the character emotion information of the other user for each of the plurality of emotions. To adjust the predetermined threshold value for the other user.
The emotion estimation device according to claim 4.

An emotion estimation system including a server device and a terminal device capable of communicating with the server device.
The server device is
The first communication device that receives sound information indicating sound including the user's voice, and
A noise removing unit that removes noise from the sound indicated by the sound information and generates voice information indicating the user's voice.
A plurality of learning models based on the voice information regarding the relationship between a plurality of feature quantities corresponding to the human voice and the intensity of each of the plurality of emotions held by the person who emitted the voice for each of the plurality of humans. A voice evaluation unit that inputs a feature amount of the above and acquires voice emotion information including a voice evaluation value indicating the intensity of each of the plurality of emotions held by the user from the learning model.
A voice recognition process for recognizing the utterance content of a voice uttered by a human is executed on the sound information, and based on a recognition character string indicating a recognition result of the voice recognition process, the plurality of emotions held by the user It is equipped with a character evaluation unit that generates character emotion information including a character evaluation value indicating the strength for each.
The first communication device is
The character emotion information and the voice emotion information are transmitted to the terminal device,
The terminal device is
A sound collecting device that collects sounds including the user's voice, and
A second communication device that transmits the sound information output by the sound collecting device to the server device and receives the character emotion information and the voice emotion information from the server device.
A correction unit that generates corrected emotion information that corrects the voice emotion information using correction information based on the characteristics of the user's voice, and a correction unit.
It includes an estimation unit that estimates one or more emotions held by the user based on the corrected emotion information and the character emotion information.
Emotion estimation system.

The voice evaluation unit inputs a plurality of feature quantities based on voice information indicating a voice in which the user explicitly expresses one of the plurality of emotions to the learning model, and the user Explicit voice emotional information is obtained from the learning model and
The terminal device is
An adjustment unit that adjusts the correction information based on the explicit voice emotion information for the purpose of increasing the possibility that the estimation unit estimates that the emotion held by the user is the one emotion.
7. The emotion estimation system according to claim 7.

The terminal device is a first terminal device and is
The user is a first user and
The sound collecting device is the first sound collecting device, and is
The server device can communicate with a second terminal device different from the first terminal device.
The second communication device is
The corrected emotion information of the first user is transmitted to the server device,
The server device is
When the value indicating the degree of difference between the plurality of voice evaluation values included in the corrected emotion information of the first user and the plurality of character evaluation values included in the character emotion information is equal to or less than a predetermined value, the recognition character string It has a specific part that identifies as a specific character string,
When the second user possessing the second terminal device does not utter a voice that expresses an explicit emotion and utters the specific character string.
The voice evaluation unit learns the voice emotion information of the second user by inputting a plurality of feature quantities corresponding to the voice spoken by the second user into the learning model. Get from the model
The character evaluation unit generates character emotion information of the second user based on the voice that the second user utters the specific character string.
The first communication device is
The specific character string, the voice emotion information of the second user, and the character emotion information of the second user are transmitted to the second terminal device.
The second terminal device is
A second sound collecting device that collects sounds including the voice of the second user, and
A third communication device that transmits sound information output by the second sound collecting device to the server device and receives the character emotion information of the second user and the voice emotion information of the second user from the server device.
A correction unit that generates corrected emotional information of the second user, which corrects the voice emotional information of the second user by using correction information based on the characteristics of the voice of the user.
The second, for the purpose of bringing a plurality of voice evaluation values included in the corrected emotion information of the second user closer to a plurality of character evaluation values included in the character emotion information of the second user for each of the plurality of emotions. The adjustment unit for adjusting the correction information for the user is provided.
The emotion estimation system according to claim 8.