JP6638435B2

JP6638435B2 - Personal adaptation method of emotion estimator, emotion estimation device and program

Info

Publication number: JP6638435B2
Application number: JP2016020071A
Authority: JP
Inventors: 佐藤　勝彦; 勝彦佐藤; 浩一中込; 崇史山谷
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2016-02-04
Filing date: 2016-02-04
Publication date: 2020-01-29
Anticipated expiration: 2036-02-04
Also published as: JP2017138509A

Description

本発明は、感情推定器の個人適応方法、感情推定装置及びプログラムに関する。 The present invention relates to a personal adaptation method of an emotion estimator, an emotion estimation device, and a program.

感情をラベリングした音声データ群を教師データとして機械学習により生成された感情推定装置を用いて、発話者の感情状態を推定する技術の開発が進められている。例えば、特許文献１は、音声の強度、音声のテンポ、音声の抑揚のそれぞれの変化量を求め、求めた変化量に基づいて発話者の感情状態を推定する技術を開示している。また、特許文献１は、不特定多数の発話者が喜怒哀楽等の感情状態で発話した音声データを教師データとして生成された感情推定装置を使用して任意の発話者の発話時の感情状態を推定し、その音声データに推定した喜怒哀楽等のラベルを付与して教師データに追加し、感情推定装置を適応学習する技術を開示している。 A technology for estimating a speaker's emotional state using an emotion estimation device generated by machine learning using a group of voice data in which emotions are labeled as teacher data is being developed. For example, Patent Literature 1 discloses a technique in which a change amount of each of a sound intensity, a sound tempo, and a intonation of a sound is obtained, and an emotional state of a speaker is estimated based on the obtained change amounts. Patent Document 1 discloses an emotion state of an arbitrary speaker at the time of uttering using an emotion estimation device generated as teacher data using voice data in which an unspecified number of speakers uttered in emotion states such as emotions and sorrows. A technology for adaptively learning an emotion estimation device by estimating the emotion data, adding a label such as estimated emotion, emotion, and so on to the voice data and adding the label to teacher data is disclosed.

特開２００５−３５２１５４号公報JP 2005-352154 A

ところで、不特定多数の発話者の音声データを教師データとして生成された感情推定装置は、不特定多数の発話者の音声データの特徴量と解析対象の発話者の音声データの特徴量との共通特性に基づいて発話者の感情状態を推定する。したがって、解析対象の音声データの特徴量が不特定多数の音声データの特徴量の平均値に近い場合には、感情推定装置による推定精度は高くなる。しかしながら、解析対象の音声データの特徴量と不特定多数の音声データの特徴量の平均値との差が大きくなるほど、感情推定装置による推定精度は低下する。 By the way, the emotion estimating device generated by using the voice data of the unspecified number of speakers as the teacher data uses the common feature of the feature amount of the voice data of the unspecified number of speakers and the feature amount of the voice data of the speaker to be analyzed. The speaker's emotional state is estimated based on the characteristics. Therefore, when the characteristic amount of the audio data to be analyzed is close to the average value of the characteristic amounts of the unspecified number of audio data, the estimation accuracy by the emotion estimation device becomes higher. However, as the difference between the feature value of the voice data to be analyzed and the average value of the feature values of the unspecified number of voice data increases, the estimation accuracy of the emotion estimation device decreases.

特許文献１が開示する技術では、感情推定装置を特定個人用の感情推定装置として適応させる適応処理において、解析対象の音声データの特徴量と不特定多数の音声データの特徴量の平均値との差が大きいために推定精度が悪い状態で喜怒哀楽等のラベルを付与された音声データも、そのまま教師データとして追加される。そのため、特定個人の音声データの特徴量と不特定多数の音声データの特徴量の平均値との差が大きい場合に、特定個人用の感情推定装置として推定精度を向上させることが困難であるという問題がある。 In the technology disclosed in Patent Literature 1, in an adaptive process for adapting an emotion estimation device as an emotion estimation device for a specific individual, a characteristic amount of voice data to be analyzed and an average value of the characteristic amount of an unspecified number of audio data are compared. Speech data to which labels such as emotions and sorrows are given in a state where the estimation accuracy is poor due to a large difference is also added as it is as teacher data. Therefore, it is difficult to improve the estimation accuracy as an emotion estimation device for a specific individual when the difference between the characteristic amount of the audio data of the specific individual and the average value of the characteristic amounts of an unspecified number of audio data is large. There's a problem.

本発明は、このような状況を鑑みてなされたものであり、特定個人用の感情推定装置の推定精度を向上することができる感情推定器の個人適応方法、感情推定装置及びプログラムを提供することを目的とする。 The present invention has been made in view of such a situation, and provides an individual adaptation method, an emotion estimation device, and a program of an emotion estimator that can improve the estimation accuracy of an emotion estimation device for a specific individual. With the goal.

上記目的を達成するため、本発明の第１の観点に係る感情推定器の個人適応方法は、
不特定多数の発話者が発話した音声データを教師データとして生成された、発話者の発話時の感情状態を推定する感情推定器を、特定個人の発話時の感情状態を推定する感情推定器として個人適応させる感情推定器の個人適応方法であって、
前記特定個人が発話した音声データを取得する取得ステップと、
前記音声データの特徴を抽出する特徴抽出ステップと、
抽出された前記特徴を、特定の抽出区間ごとに複数のパターンに分類し、該パターンごとの出現頻度を解析する頻度解析ステップと、
前記パターンの出現頻度が閾値以上と判別された前記特定の抽出区間の音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与する第１ラベル付与ステップと、
前記第１ラベル付与ステップでニュートラルのラベルが付与された音声データを、前記不特定多数の発話者が発話した音声データで構成された教師データに追加した前記特定個人用の教師データを生成し、生成した前記特定個人用の教師データに基づいて前記特定個人用に感情推定器を構築することにより、前記不特定多数の発話者が発話した音声データを教師データとして生成された感情推定器を、前記特定個人の発話時の感情状態を推定する感情推定器として個人適応させる個人適応ステップと、
を含むことを特徴とする。 In order to achieve the above object, a personal adaptation method of the emotion estimator according to the first aspect of the present invention includes:
An emotion estimator that estimates the emotional state of a speaker when speaking, which is generated as teacher data using voice data spoken by an unspecified number of speakers, as an emotion estimator that estimates the emotional state of a specific individual when speaking A personal adaptation method of an emotion estimator for personal adaptation,
An obtaining step of obtaining voice data spoken by the specific individual;
A feature extraction step of extracting features of the audio data;
A frequency analysis step of classifying the extracted features into a plurality of patterns for each specific extraction section and analyzing an appearance frequency for each pattern;
A first label assigning step of assigning a neutral label indicating that the emotional state at the time of speech is a calm state to the audio data of the specific extraction section in which the appearance frequency of the pattern is determined to be equal to or greater than a threshold value;
Generating the teacher data for the specific individual by adding the voice data to which the neutral label has been provided in the first label providing step to the teacher data composed of the voice data spoken by the unspecified number of speakers; By constructing an emotion estimator for the specific individual based on the generated teacher data for the specific individual, an emotion estimator generated as teacher data using the voice data spoken by the unspecified number of speakers, Individual adaptation step of personal adaptation as an emotion estimator for estimating the emotional state of the specific individual at the time of speech,
It is characterized by including.

また、本発明の第２の観点に係る感情推定装置は、
特定個人が発話した音声データを取得する取得手段と、
前記音声データの特徴を抽出する特徴抽出手段と、
抽出された前記特徴を、特定の抽出区間ごとに複数のパターンに分類し、該パターンごとの出現頻度を解析する頻度解析手段と、
前記パターンの出現頻度が閾値以上と判別された前記特定の抽出区間の音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与する第１ラベル付与手段と、
前記第１ラベル付与手段によりニュートラルのラベルが付与された音声データを、前記不特定多数の発話者が発話した音声データで構成された教師データに追加した前記特定個人用の教師データを生成し、生成した前記特定個人用の教師データに基づいて前記特定個人用に感情推定器を構築することにより、前記不特定多数の発話者が発話した音声データを教師データとして生成された感情推定器を、前記特定個人の発話時の感情状態を推定する感情推定器として個人適応させる個人適応手段と、
を備えることを特徴とする。 Further, the emotion estimation device according to the second aspect of the present invention includes:
Acquiring means for acquiring voice data spoken by a specific individual;
Feature extraction means for extracting features of the audio data;
Frequency analysis means for classifying the extracted features into a plurality of patterns for each specific extraction section, and analyzing an appearance frequency for each pattern;
A first label assigning unit that assigns a neutral label indicating that the emotional state at the time of speech is a calm state to the audio data of the specific extraction section in which the appearance frequency of the pattern is determined to be equal to or greater than a threshold value;
Generating the teacher data for the specific individual by adding the voice data to which the neutral label is provided by the first label providing unit to the teacher data composed of the voice data spoken by the unspecified number of speakers; By constructing an emotion estimator for the specific individual based on the generated teacher data for the specific individual, an emotion estimator generated as teacher data using the voice data spoken by the unspecified number of speakers, Personal adaptation means for personal adaptation as an emotion estimator for estimating the emotional state of the specific individual when speaking,
It is characterized by having.

また、本発明の第３の観点に係る感情推定装置は、
発話者が発話した音声データを取得する取得手段と、
前記音声データの特徴を抽出する特徴抽出手段と、
抽出された前記特徴を、特定の抽出区間ごとに複数のパターンに分類し、該パターンごとの出現頻度を解析する頻度解析手段と、
前記パターンの出現頻度が閾値以上と判別された前記特定の抽出区間の音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与する第１ラベル付与手段と、
ニュートラルのラベルを付与した前記特定の抽出区間の数に重み係数を掛けて得られた評価値と、ニュートラルのラベルを付与しなかった特定の抽出区間の数に重み係数を掛けて得られた評価値と、を比較し、ニュートラルのラベルを付与した特定の抽出区間の評価値がニュートラルのラベルを付与しなかった特定の抽出区間の評価値よりも高い評価値であった場合、発話者の発話時の感情状態をニュートラルと判別する感情推定手段と、
を備えることを特徴とする。 Further, the emotion estimation device according to the third aspect of the present invention includes:
Acquiring means for acquiring voice data spoken by the speaker;
Feature extraction means for extracting features of the audio data;
Frequency analysis means for classifying the extracted features into a plurality of patterns for each specific extraction section, and analyzing an appearance frequency for each pattern;
A first label assigning unit that assigns a neutral label indicating that the emotional state at the time of speech is a calm state to the audio data of the specific extraction section in which the appearance frequency of the pattern is determined to be equal to or greater than a threshold value;
An evaluation value obtained by multiplying the number of the specific extraction sections to which the neutral label is added by a weighting coefficient, and an evaluation value obtained by multiplying the number of the specific extraction sections to which the neutral label is not added by the weighting coefficient If the evaluation value of the specific extraction section with the neutral label is higher than the evaluation value of the specific extraction section without the neutral label, the utterance of the speaker is compared. Emotion estimation means for determining the emotional state at the time as neutral,
It is characterized by having.

また、本発明の第４の観点に係るプログラムは、
コンピュータを
特定個人が発話した音声データを取得する取得手段、
前記音声データの特徴を抽出する特徴抽出手段、
抽出された前記特徴を、特定の抽出区間ごとに複数のパターンに分類し、該パターンごとの出現頻度を解析する頻度解析手段、
前記パターンの出現頻度が閾値以上と判別された前記特定の抽出区間の音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与する第１ラベル付与手段、
前記第１ラベル付与手段によりニュートラルのラベルが付与された音声データを、前記不特定多数の発話者が発話した音声データで構成された教師データに追加した前記特定個人用の教師データを生成し、生成した前記特定個人用の教師データに基づいて前記特定個人用に感情推定器を構築することにより、前記不特定多数の発話者が発話した音声データを教師データとして生成された感情推定器を、前記特定個人の発話時の感情状態を推定する感情推定器として個人適応させる個人適応手段、
として機能させることを特徴とする。 Further, a program according to a fourth aspect of the present invention includes:
Acquisition means for acquiring voice data spoken by a specific individual using a computer;
Feature extracting means for extracting features of the audio data,
Frequency analysis means for classifying the extracted features into a plurality of patterns for each specific extraction section and analyzing the appearance frequency of each pattern;
A first label assigning unit that assigns a neutral label indicating that the emotional state at the time of speech is a calm state to the audio data of the specific extraction section in which the appearance frequency of the pattern is determined to be equal to or greater than a threshold value;
Generating the teacher data for the specific individual by adding the voice data to which the neutral label is provided by the first label providing unit to the teacher data composed of the voice data spoken by the unspecified number of speakers; By constructing an emotion estimator for the specific individual based on the generated teacher data for the specific individual, an emotion estimator generated as teacher data using the voice data spoken by the unspecified number of speakers, Personal adaptation means for personal adaptation as an emotion estimator for estimating the emotional state of the specific individual when speaking,
It is characterized by functioning as

本発明によれば、特定個人用の感情推定装置の推定精度を向上することができる。 According to the present invention, it is possible to improve the estimation accuracy of the emotion estimation device for a specific individual.

本発明の実施形態１に係る感情推定装置の物理構成を示すブロック図である。1 is a block diagram illustrating a physical configuration of an emotion estimation device according to a first embodiment of the present invention. 本発明の実施形態１に係る感情推定装置の機能構成を示すブロック図である。FIG. 2 is a block diagram illustrating a functional configuration of the emotion estimation device according to the first embodiment of the present invention. 音声データのパワー時系列データ、及び音声データのピッチ時系列データの作成方法について説明するための図である。FIG. 4 is a diagram for describing a method for creating power time-series data of audio data and pitch time-series data of audio data. 音声データのスペクトル分布を示した図である。FIG. 3 is a diagram illustrating a spectrum distribution of audio data. 抽出区間について説明するための図である。It is a figure for explaining an extraction section. 音声データのパワー時系列変化パターンの抽出方法について説明するための図である。FIG. 5 is a diagram for describing a method of extracting a power time-series change pattern of audio data. 音声データのピッチ時系列変化パターンの抽出方法について説明するための図である。FIG. 7 is a diagram for describing a method of extracting a pitch time-series change pattern of audio data. （ａ）は、パワー時系列変化パターンの頻度解析方法について説明するための図である。（ｂ）は、ピッチ時系列変化パターンの頻度解析方法について説明するための図である。(A) is a figure for demonstrating the frequency analysis method of a power time series change pattern. (B) is a figure for explaining the frequency analysis method of a pitch time series change pattern. 感情推定器の判別閾値のイメージについて説明するための図である。It is a figure for explaining an image of a discrimination threshold of an emotion estimator. 感情推定部による感情推定方法について説明するための図である。FIG. 6 is a diagram for describing an emotion estimation method by an emotion estimation unit. 重み係数について説明するための図である。FIG. 4 is a diagram for describing a weight coefficient. 感情推定器を特定個人用に適応させるための個人適応処理について説明するためのフローチャートである。It is a flowchart for demonstrating the individual adaptation process for adapting an emotion estimator for a specific individual. 感情推定器の個人適応処理の詳細について説明するためのフローチャートである。It is a flow chart for explaining details of personal adaptation processing of an emotion estimator. 音声データの特徴の抽出処理について説明するためのフローチャートである。It is a flowchart for demonstrating the extraction process of the characteristic of audio | voice data. 感情推定処理について説明するためのフローチャートである。It is a flowchart for demonstrating an emotion estimation process. 変形例１に係る感情推定装置１００の頻度解析方法について説明するための図である。（ａ）は出現頻度の平均値を用いてニュートラル区間を設定する場合であり、（ｂ）は出現頻度の分散を用いてニュートラル区間を設定する場合であり、（ｃ）は出現頻度の中央値を用いてニュートラル区間を設定する場合である。FIG. 9 is a diagram for describing a frequency analysis method of the emotion estimation device 100 according to Modification 1. (A) shows a case where a neutral section is set using the average value of the appearance frequency, (b) shows a case where the neutral section is set using the variance of the appearance frequency, and (c) shows a median value of the appearance frequency. Is used to set a neutral section.

以下、本発明の実施形態に係る感情推定器の個人適応方法、感情推定装置及びプログラムについて、図面を参照しながら説明する。なお、図中同一又は相当する部分には同一符号を付す。 Hereinafter, a personal adaptation method, an emotion estimation device, and a program of an emotion estimator according to an embodiment of the present invention will be described with reference to the drawings. In the drawings, the same or corresponding portions are denoted by the same reference numerals.

（実施形態１）
本実施形態では、感情推定装置１００をペット用ロボットに搭載した場合について説明する。ペット用ロボットに搭載された感情推定装置１００は、ユーザーの音声の特徴を解析することによりユーザーの感情状態を推定し、ペット用ロボットは推定した感情に対応するアクションをとる。ペット用ロボットに搭載された感情推定装置１００は、日々蓄積されるユーザーの音声データを教師データに追加することにより、感情推定器を特定個人用の感情推定器として適応していく。 (Embodiment 1)
In the present embodiment, a case will be described in which the emotion estimation device 100 is mounted on a pet robot. The emotion estimation device 100 mounted on the pet robot estimates the emotional state of the user by analyzing the features of the user's voice, and the pet robot takes an action corresponding to the estimated emotion. The emotion estimation device 100 mounted on the pet robot adapts the emotion estimator as a specific individual emotion estimator by adding the voice data of the user accumulated daily to the teacher data.

本実施形態では、感情推定装置１００が、発話者の感情状態をポジティブ、ネガティブ、ニュートラルの何れかの感情状態であると推定する場合について説明する。ポジティブな感情状態とは、喜んでいる感情状態、安らぎを感じている感情状態、興味を抱いている感情状態等である。ネガティブな感情状態とは、怒っている感情状態、不安を感じている感情状態、退屈に思っている感情状態等である。ニュートラルな感情状態とは、ポジティブな感情状態とネガティブな感情状態以外の感情状態である。一般的には、発話される音声の多くは、ニュートラルな感情状態で発話されることが多いと想定できる。 In the present embodiment, a case will be described in which the emotion estimation device 100 estimates the emotional state of the speaker as one of positive, negative, and neutral emotional states. The positive emotional state is a happy emotional state, an emotional state of comfort, an emotional state of interest, and the like. The negative emotional state includes an angry emotional state, an anxious emotional state, a boring emotional state, and the like. The neutral emotional state is an emotional state other than the positive emotional state and the negative emotional state. In general, it can be assumed that most of the uttered voices are often uttered in a neutral emotional state.

以下に、感情推定装置１００の構成、感情推定装置１００を特定個人ユーザーの感情推定装置として適応させる個人適応処理、感情推定装置１００による特定個人の感情推定処理について、詳細に説明する。 Hereinafter, the configuration of the emotion estimation device 100, the personal adaptation process for adapting the emotion estimation device 100 as the emotion estimation device of the specific individual user, and the emotion estimation process of the specific individual by the emotion estimation device 100 will be described in detail.

実施形態１に係る感情推定装置１００は、物理的には、図１に示すように、制御部１と、記憶部２と、入出力部３と、バス４と、を備える。 The emotion estimation device 100 according to the first embodiment physically includes a control unit 1, a storage unit 2, an input / output unit 3, and a bus 4, as illustrated in FIG.

制御部１は、ＲＯＭ（Read Only Memory）と、ＲＡＭ（Random Access Memory）と、ＣＰＵ（Central Processing Unit）と、を備える。ＲＯＭは、本実施形態に係る感情推定器個人適応処理プログラム、感情推定プログラム、各種初期設定プログラム、ハードウェアの検査プログラムのロード等を行うための初期プログラム等を記憶する。ＲＡＭは、ＣＰＵが実行する各種ソフトウェアプログラム、これらのソフトウェアプログラムの実行に必要なデータ等を一時的に記憶するワークエリアとして機能する。ＣＰＵは、各種ソフトウェアプログラムを実行することにより、様々な処理及び演算を実行する中央演算処理部である。 The control unit 1 includes a ROM (Read Only Memory), a RAM (Random Access Memory), and a CPU (Central Processing Unit). The ROM stores an emotion estimator personal adaptation processing program according to the present embodiment, an emotion estimation program, various initial setting programs, an initial program for loading a hardware inspection program, and the like. The RAM functions as a work area for temporarily storing various software programs executed by the CPU, data necessary for executing the software programs, and the like. The CPU is a central processing unit that executes various processes and calculations by executing various software programs.

記憶部２は、ハードディスクドライブ、フラッシュメモリ等の不揮発性メモリを備える。記憶部２は、不特定多数の発話者が発話した音声データを教師データとして記憶する。また、記憶部２は、ユーザーが発話した音声データを日々記憶する。 The storage unit 2 includes a nonvolatile memory such as a hard disk drive and a flash memory. The storage unit 2 stores voice data spoken by an unspecified number of speakers as teacher data. The storage unit 2 stores voice data spoken by the user every day.

入出力部３は、ユーザーが発話した音声データを取得するための音声入力装置を備える。また、入出力部３は、解析対象の音声データを記憶媒体を介して取得するために、ＣＤ（Compact Disc）ドライブ、ＵＳＢ（Universal Serial Bus）インタフェースを備える。また、入出力部３は、感情推定器が推定したユーザーの感情状態を出力するために、スピーカ、ディスプレー、ＬＥＤ（Light Emitting Diode）等を備える。入出力部３は、推定したユーザーの感情状態（ポジティブ、ネガティブ、ニュートラル）を直接的に音声もしくは文字で出力するほか、予めプログラムされたシナリオに基づいて、推定した感情状態と対応付けられた発話内容、発話速度、ＬＥＤの色として、推定した感情状態を間接的に出力することもできる。また、入出力部３から後述する重み係数の変更設定をすることができる。 The input / output unit 3 includes a voice input device for acquiring voice data spoken by the user. In addition, the input / output unit 3 includes a CD (Compact Disc) drive and a USB (Universal Serial Bus) interface to acquire audio data to be analyzed via a storage medium. In addition, the input / output unit 3 includes a speaker, a display, an LED (Light Emitting Diode), and the like in order to output the emotional state of the user estimated by the emotion estimator. The input / output unit 3 directly outputs the estimated user's emotional state (positive, negative, or neutral) by voice or text, and also generates an utterance associated with the estimated emotional state based on a pre-programmed scenario. The estimated emotional state can also be indirectly output as the content, speech speed, and LED color. In addition, a change setting of a weight coefficient, which will be described later, can be performed from the input / output unit 3.

バス４は、制御部１と、記憶部２と、入出力部３と、を接続する。 The bus 4 connects the control unit 1, the storage unit 2, and the input / output unit 3.

感情推定装置１００は、機能的には、図２に示すように、音声データ取得部１１０と、音声データ解析部１２０と、特徴抽出部１３０と、頻度解析部１５０と、第１ラベル付与部１６０と、第２ラベル付与部１７０と、感情推定器適応処理部１８０と、感情推定部１９０と、を含む。また、特徴抽出部１３０は、特定個人分類部１３１と、時間長測定部１３２と、パワー時系列変化パターン算出部１３３と、ピッチ時系列変化パターン算出部１３４と、を含む。 As shown in FIG. 2, the emotion estimation device 100 functionally includes a voice data acquisition unit 110, a voice data analysis unit 120, a feature extraction unit 130, a frequency analysis unit 150, and a first label provision unit 160. , A second label assigning section 170, an emotion estimator adaptation processing section 180, and an emotion estimating section 190. In addition, the feature extraction unit 130 includes a specific individual classification unit 131, a time length measurement unit 132, a power time series change pattern calculation unit 133, and a pitch time series change pattern calculation unit 134.

音声データ取得部１１０は、入出力部３を介して解析対象とするユーザーの音声データを取得する。 The audio data acquisition unit 110 acquires, via the input / output unit 3, audio data of a user to be analyzed.

音声データ解析部１２０は、取得した音声データの解析を行う。具体的には、音声データのパワー、音声データのピッチを時系列のデータとして作成する。図３に点線で示した波形は、音声データの波形例である。音声データ解析部１２０は、音声データの始点ｔ０から始まる解析窓を設定し、ずらし幅ｄｔだけ解析窓をずらしながら、解析窓ごとに音声波形振幅の２乗の時間平均値を算出することにより音声データのパワー時系列データを作成する。 The audio data analysis unit 120 analyzes the acquired audio data. Specifically, the power of the audio data and the pitch of the audio data are created as time-series data. The waveform shown by the dotted line in FIG. 3 is a waveform example of the audio data. The audio data analysis unit 120 sets an analysis window starting from the start point t0 of the audio data, and calculates the time average of the square of the audio waveform amplitude for each analysis window while shifting the analysis window by a shift width dt. Create power time series data.

また、音声データ解析部１２０は、音声データの始点ｔ０から始まる解析窓を設定し、ずらし幅ｄｔずつ解析窓をずらしながら解析窓内の音声データをＦＦＴ（Fast Fourier Transform）変換する。図４は、上記のＦＦＴ変換により得られた各解析窓内の音声データのスペクトル分布を示した例である。横軸は周波数であり、縦軸はスペクトルの強度である。このスペクトルの中で最も低い周波数領域に存在するピーク周波数をｆ０とする。このｆ０は、その解析窓の音声データから得られた発話者固有の基本周波数を示す。音声データ解析部１２０は、時間ｔｎにおけるｆ０をｆ０＿ｎとして抽出することにより音声データのピッチ時系列データを作成する。なお、パワー時系列データとピッチ時系列データの作成において、解析窓の解析窓幅、ずらし幅ｄｔは、音声データのサンプリング周波数に基づいて設定する。 Further, the audio data analysis unit 120 sets an analysis window starting from the start point t0 of the audio data, and performs FFT (Fast Fourier Transform) conversion on the audio data in the analysis window while shifting the analysis window by a shift width dt. FIG. 4 is an example showing a spectrum distribution of audio data in each analysis window obtained by the FFT transform. The horizontal axis is frequency and the vertical axis is spectrum intensity. The peak frequency existing in the lowest frequency region in this spectrum is defined as f0. This f0 indicates the fundamental frequency unique to the speaker obtained from the audio data of the analysis window. The voice data analysis unit 120 creates pitch time-series data of voice data by extracting f0 at time tn as f0_n. In creating the power time series data and the pitch time series data, the analysis window width and the shift width dt of the analysis window are set based on the sampling frequency of the audio data.

特徴抽出部１３０は、音声データの特徴として、抽出された音声データのパワー時系列データ、音声データのピッチ時系列データを用いて、特定の抽出区間ごとに、パワー時系列変化パターン、ピッチ時系列変化パターンを算出する、パワー時系列変化パターン算出部１３３とピッチ時系列変化パターン算出部１３４とを備える。特定の抽出区間として、ここでは、音声データを呼気段落の単位で区切った呼気段落区間を抽出区間として設定する。呼気段落とは、一息の間に発せられる発話区間の単位であり、呼気段落の単位で音声データの特徴を抽出する理由は、呼気段落で発話者の感情状態が変化する場合が多いからである。呼気段落の設定について、図５を用いて具体例を説明する。図５の上段は、発話時の感情状態を解析する音声データの波形例である。横軸は時間であり、縦軸は音声の振幅である。２段目は、音声データを呼気段落区間ごとに呼気段落区間１から呼気段落区間ｎまで分割した例である。 The feature extraction unit 130 uses the power time-series data of the extracted voice data and the pitch time-series data of the voice data as the characteristics of the voice data, and performs a power time-series change pattern, a pitch time-series A power time series change pattern calculation unit 133 and a pitch time series change pattern calculation unit 134 for calculating a change pattern are provided. Here, as a specific extraction section, an exhalation paragraph section in which audio data is divided in units of exhalation paragraphs is set as an extraction section. An exhalation paragraph is a unit of an utterance section uttered during a breath, and the reason for extracting features of voice data in units of an exhalation paragraph is that the emotional state of a speaker often changes in the exhalation paragraph. . A specific example of the setting of the exhalation paragraph will be described with reference to FIG. The upper part of FIG. 5 is a waveform example of voice data for analyzing an emotional state at the time of speech. The horizontal axis is time, and the vertical axis is the amplitude of speech. The second row shows an example in which the voice data is divided from the expiration paragraph section 1 to the expiration paragraph section n for each expiration paragraph section.

また、特徴抽出部１３０は、後述する頻度解析部１５０で特定個人ごとに、パワー時系列変化パターン、ピッチ時系列変化パターンの出現頻度解析を行うために、特定個人分類部１３１を備える。また、特徴抽出部１３０は、例えば、呼気段落区間を短い、普通、長い、の３つに分類して後述する頻度解析を行うために、時間長測定部１３２を備える。 In addition, the feature extracting unit 130 includes a specific individual classifying unit 131 so that the frequency analyzing unit 150 described later analyzes the appearance frequency of the power time-series change pattern and the pitch time-series change pattern for each specific individual. In addition, the feature extraction unit 130 includes a time length measurement unit 132, for example, to classify the exhalation paragraph section into three, that is, short, normal, and long, and to perform a frequency analysis described later.

特定個人分類部１３１は、予め登録してある特定個人の音声データの特徴量と比較することにより、複数ユーザーの音声データを特定個人ごとの音声データに分類する。特定個人分類部１３１は、ユーザーとして、例えば、父親、母親、子供を登録している場合は、父親、母親、子供ごとに、音声データを分類する。具体的には、特定個人分類部１３１は、父親、母親、子供ごとに、予め登録した音声データの特徴量と入力した音声データの特徴量との相関性に基づいて、入力した音声データを父親の音声データ、母親の音声データ、子供の音声データ、その他に分類する。そして、特定個人分類部１３１は、分類した音声データに父親、母親、子供を判別するラベルを付けて、記憶部２に記憶する。 The specific individual classifying unit 131 classifies the voice data of a plurality of users into voice data for each specific individual by comparing the feature data of the voice data of the specific individual registered in advance. For example, when a father, mother, and child are registered as a user, the specific individual classification unit 131 classifies the audio data for each father, mother, and child. Specifically, the specific individual classifying unit 131 divides the input audio data into fathers, mothers, and children based on the correlation between the feature amounts of the audio data registered in advance and the features of the input audio data. Audio data, mother audio data, child audio data, and others. Then, the specific individual classifying unit 131 attaches a label for discriminating a father, a mother, and a child to the classified voice data, and stores the label in the storage unit 2.

時間長測定部１３２は、音声データの特徴を抽出する抽出区間である呼気段落区間の時間長を測定する。具体的には、時間長測定部１３２は、無音区間を検出することにより、呼気段落区間の時間長を測定する。 The time length measuring unit 132 measures the time length of an exhalation paragraph section, which is an extraction section for extracting features of voice data. Specifically, the time length measuring unit 132 measures the time length of the expiration paragraph section by detecting a silent section.

パワー時系列変化パターン算出部１３３は、パワー時系列データの変化パターンとして、抽出区間ごとにパワー時系列変化パターンを抽出する。図６を用いて具体的に説明する。図６は、音声データ解析部１２０で作成された任意の呼気段落区間のパワー時系列データの例である。そのパワーの最も大きい位置Ａと波形の始点Ｂとを結ぶ線ＡＢと時間軸とのなす角度をθｓとする。呼気段落区間１のθｓをθ１ｓとする。同様に、呼気段落区間ｎのθｓをθｎｓとする。また、パワーの最も大きい位置Ａと波形の終点Ｃとを結ぶ線ＡＣと時間軸とのなす角度をθｅとする。呼気段落区間１のθｅをθ１ｅとする。同様に、呼気段落区間ｎのθｅをθｎｅとする。パワー時系列変化パターン算出部１３３は、θ１ｓからθｎｓ、θ１ｅからθｎｅを求める。 The power time series change pattern calculation unit 133 extracts a power time series change pattern for each extraction section as a change pattern of the power time series data. This will be specifically described with reference to FIG. FIG. 6 is an example of power time-series data of an arbitrary exhalation paragraph section created by the audio data analysis unit 120. The angle between the line AB connecting the position A where the power is the highest and the starting point B of the waveform and the time axis is defined as θs. Θs in the expiration paragraph section 1 is assumed to be θ1s. Similarly, θs in the expiration paragraph section n is set to θns. Further, an angle between a line AC connecting the position A where the power is greatest and the end point C of the waveform and the time axis is defined as θe. Let θe in expiration paragraph section 1 be θ1e. Similarly, let θe be θne in the expiration paragraph section n. The power time-series change pattern calculation unit 133 calculates θns from θ1s and θne from θ1e.

パワー時系列変化パターン算出部１３３は、蓄積されている音声データの全てについて、同様にしてθ１ｓからθｎｓ、θ１ｅからθｎｅを求め、呼気段落区間ごとに分割された音声データと対応付けて記憶部２に記憶する。このθｓとθｅをパワー時系列変化パターンと称することとする。 The power time-series change pattern calculation unit 133 similarly calculates θ1s to θns and θ1e to θne for all of the stored voice data, and associates the obtained voice data with the voice data divided for each expiration paragraph section to store the data in the storage unit 2. To memorize. These θs and θe are referred to as a power time-series change pattern.

図２に戻って、ピッチ時系列変化パターン算出部１３４は、ピッチ時系列データの変化パターンとして、抽出区間ごとにピッチ時系列変化パターンを抽出する。 Returning to FIG. 2, the pitch time series change pattern calculation unit 134 extracts a pitch time series change pattern for each extraction section as a change pattern of the pitch time series data.

図７は、図４で得られたｆ０を任意の呼気段落区間について時系列でグラフ化したものである。横軸は時間であり、縦軸は図４に示したｆ０の周波数である。Ｂ点からｆ０_ｎまでの時間は、ｄｔ×（ｎ−１）である。最も高い周波数の位置Ａと波形の始点Ｂとを結ぶ線ＡＢと時間軸とのなす角度をθｒとする。呼気段落区間１のθｒをθ１ｒとする。同様に、呼気段落区間ｎのθｒをθｎｒとする。また、最も高い周波数の位置Ａと波形の終点Ｃとを結ぶ線ＡＣと時間軸とのなす角度をθｆとする。呼気段落区間１のθｆをθ１ｆとする。同様に、呼気段落区間ｎのθｆをθｎｆとする。ピッチ時系列変化パターン算出部１３４は、θ１ｒからθｎｒ、θ１ｆからθｎｆを求める。 FIG. 7 is a time-series graph of f0 obtained in FIG. 4 for an arbitrary exhalation paragraph section. The horizontal axis is time, and the vertical axis is the frequency of f0 shown in FIG. The time from the point B to f0_n is dt × (n−1). The angle between the line AB connecting the position A of the highest frequency and the starting point B of the waveform and the time axis is defined as θr. Θr in the expiration paragraph section 1 is assumed to be θ1r. Similarly, θr in the expiration paragraph section n is set to θnr. Further, an angle formed between a line AC connecting the position A of the highest frequency and the end point C of the waveform and the time axis is θf. Assume that θf in expiration paragraph section 1 is θ1f. Similarly, let θf in the expiration paragraph section n be θnf. The pitch time-series change pattern calculation unit 134 calculates θnr from θ1r and θnf from θ1f.

ピッチ時系列変化パターン算出部１３４は、蓄積されている音声データの全てについて、同様にしてθ１ｒからθｎｒ、θ１ｆからθｎｆを求め、呼気段落区間ごとに分割された音声データと対応付けて記憶部２に記憶する。このθｒとθｆをピッチ時系列変化パターンと称することとする。 The pitch time-series change pattern calculating unit 134 similarly calculates θ1r to θnr and θ1f to θnf for all of the stored voice data, and associates the obtained voice data with the voice data divided for each exhalation paragraph section to store the data in the storage unit 2. To memorize. These θr and θf are referred to as a pitch time-series change pattern.

図２に戻って、頻度解析部１５０は、抽出された特徴を複数のパターンに分類し、そのパターンごとの出現頻度を特定個人ごとに解析する。具体的には、頻度解析部１５０は、パワー時系列変化パターン算出部１３３が算出したパワー時系列変化パターンと、ピッチ時系列変化パターン算出部１３４が算出したピッチ時系列変化パターンについて、音声データの時系列変化パターンの頻度解析を特定個人ごとに行う。そして、頻度解析部１５０は、出現頻度が閾値以上である時系列変化パターンを、平常状態で発話された音声の時系列変化パターンであるとし、その時系列変化パターンの属する区間をニュートラル区間として設定する。 Returning to FIG. 2, the frequency analysis unit 150 classifies the extracted features into a plurality of patterns, and analyzes the appearance frequency of each pattern for each specific individual. More specifically, the frequency analysis unit 150 converts the power time series change pattern calculated by the power time series change pattern calculation unit 133 and the pitch time series change pattern calculated by the pitch time series change pattern calculation unit 134 into audio data. The frequency analysis of the time series change pattern is performed for each specific individual. Then, the frequency analysis unit 150 determines that the time-series change pattern whose appearance frequency is equal to or greater than the threshold is the time-series change pattern of the voice uttered in a normal state, and sets a section to which the time-series change pattern belongs as a neutral section. .

図８を用いて具体的に説明する。頻度解析部１５０は、パワー時系列変化パターン算出部１３３が算出したθｓについて頻度解析を行い、図８（ａ）に示すようなパワー時系列変化パターンの頻度解析グラフを作成する。横軸はθｓの角度であり、０°から９０°を１０°ごとに区分している。縦軸は該当するθｓの区分に属するパワー時系列変化パターンの出現頻度を％で表示している。頻度解析部１５０は、特徴抽出部１３０が抽出したパワー時系列変化パターンθｓを１０°ごとの区分に分け、各区分に属するデータ数をカウントし、図８（ａ）に示すグラフを作成する。 This will be specifically described with reference to FIG. The frequency analysis unit 150 performs a frequency analysis on θs calculated by the power time-series change pattern calculation unit 133, and creates a frequency analysis graph of the power time-series change pattern as shown in FIG. The horizontal axis is the angle of θs, and is divided from 0 ° to 90 ° every 10 °. The vertical axis represents the appearance frequency of the power time-series change pattern belonging to the corresponding θs section in%. The frequency analysis unit 150 divides the power time-series change pattern θs extracted by the feature extraction unit 130 into sections at intervals of 10 °, counts the number of data belonging to each section, and creates a graph shown in FIG.

そして、頻度解析部１５０は、出現頻度が閾値以上であったθｓの区間を、感情状態が平静状態（ニュートラル）で発話された音声データのパワー時系列変化パターンの属するニュートラル区間に設定する。図８（ａ）に示す例では、頻度解析部１５０は、閾値を１５％とし、θｓの出現頻度が１５％以上である２０°から５０°の区間をニュートラル区間と設定している。 Then, the frequency analysis unit 150 sets the section of θs in which the appearance frequency is equal to or larger than the threshold to the neutral section to which the power time-series change pattern of the voice data uttered in the quiet state (neutral). In the example illustrated in FIG. 8A, the frequency analysis unit 150 sets the threshold to 15%, and sets a section from 20 ° to 50 ° where the appearance frequency of θs is 15% or more as a neutral section.

また、頻度解析部１５０は、ピッチ時系列変化パターン算出部１３４が算出したθｒについて頻度解析を行い、図８（ｂ）に示すようなピッチ時系列変化パターンの頻度解析グラフを作成する。横軸はθｒの角度であり、０°から９０°を１０°ごとに区分している。縦軸は該当するθｒの区分に属するピッチ時系列変化パターンの出現頻度を％で表示している。頻度解析部１５０は、出現頻度が閾値以上であったθｒの区間を、感情状態が平静状態（ニュートラル）で発話された音声データのピッチ時系列変化パターンの属するニュートラル区間に設定する。図８（ｂ）に示す例では、頻度解析部１５０は、閾値を１５％とし、θｒの出現頻度が１５％以上である３０°から６０°の区間をニュートラル区間と設定している。 Further, the frequency analysis unit 150 performs a frequency analysis on θr calculated by the pitch time-series change pattern calculation unit 134, and creates a frequency analysis graph of the pitch time-series change pattern as shown in FIG. The horizontal axis is the angle of θr, and is divided from 0 ° to 90 ° every 10 °. The vertical axis indicates the frequency of appearance of the pitch time-series change pattern belonging to the corresponding θr section in%. The frequency analysis unit 150 sets the section of θr in which the appearance frequency is equal to or greater than the threshold to the neutral section to which the pitch time-series change pattern of the voice data uttered in a quiet state (neutral). In the example illustrated in FIG. 8B, the frequency analysis unit 150 sets the threshold to 15%, and sets a section from 30 ° to 60 ° where the appearance frequency of θr is 15% or more as a neutral section.

頻度解析部１５０は、同様にして、θｅとθｆについても頻度解析を行う。頻度解析部１５０は、この頻度解析処理を特定個人分類部１３１が分類した特定個人ごとに行う。また、頻度解析部１５０は、時間長測定部１３２が測定した時間長により、音声データを、例えば、２秒以下（短い）、２秒から４秒（普通）、４秒以上（長い）、のように分類し、分類した音声データごとに頻度解析を行う。時間長で音声データを分類する理由は、呼気段落区間の時間長が大きく異なると、発話時の感情状態の変化による特徴の変化傾向が異なる場合があり、時間長で分類して解析した方が解析精度を向上できるからである。 The frequency analysis unit 150 similarly performs frequency analysis on θe and θf. The frequency analysis unit 150 performs this frequency analysis process for each specific individual classified by the specific individual classification unit 131. Further, the frequency analysis unit 150 converts the audio data into, for example, 2 seconds or less (short), 2 seconds to 4 seconds (normal), and 4 seconds or more (long) according to the time length measured by the time length measurement unit 132. And frequency analysis is performed for each of the classified audio data. The reason why voice data is classified by time length is that if the time length of the expiration paragraph section is significantly different, the tendency of feature changes due to changes in emotional state at the time of speech may be different, so it is better to classify and analyze by time length. This is because the analysis accuracy can be improved.

図２に戻って、第１ラベル付与部１６０は、特定個人ごとに解析したパターンの出現頻度が閾値以上と判別された抽出区間の音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与する。具体的には、第１ラベル付与部１６０は、感情推定器の個人適応処理において、頻度解析部１５０が設定したニュートラル区間に属する音声データにニュートラルのラベルを付与し、該当する音声データと紐付けて記憶部２に記憶する。第１ラベル付与部１６０は、頻度解析前にポジティブ又はネガティブのラベルが付与されていた教師データであっても、頻度解析部１５０の新たな頻度解析結果でニュートラル区間に属することとなった教師データについては、ラベルをニュートラルに変更する。 Returning to FIG. 2, the first label assigning unit 160 has a quiet emotional state when uttering the speech data of the extraction section in which the appearance frequency of the pattern analyzed for each specific individual is determined to be equal to or higher than the threshold. A neutral label indicating that Specifically, the first label assigning section 160 assigns a neutral label to the audio data belonging to the neutral section set by the frequency analysis section 150 in the personal adaptation processing of the emotion estimator, and associates the neutral label with the corresponding audio data. In the storage unit 2. The first label assigning unit 160 outputs the teacher data that has been assigned to the neutral section based on the new frequency analysis result of the frequency analyzing unit 150 even if the teacher data has been assigned a positive or negative label before the frequency analysis. For, change the label to neutral.

第２ラベル付与部１７０は、第１ラベル付与部１６０がニュートラルのラベルを付与しなかった音声データについて、感情推定装置１００に搭載されている感情推定器を用いて、ポジティブ又はネガティブのいずれかのラベルを付与し、該当する音声データと紐付けて記憶部２に記憶する。感情推定装置１００の使用開始時においては、不特定多数の発話者が発話した音声データを教師データとして生成された初期状態の感情推定器を用いて、ポジティブ又はネガティブを判別することになる。 The second label assigning unit 170 uses the emotion estimator mounted on the emotion estimating device 100 for any of the voice data to which the first label assigning unit 160 does not assign the neutral label, and outputs any one of the positive and the negative. A label is given and linked to the corresponding audio data and stored in the storage unit 2. At the start of using the emotion estimating apparatus 100, positive or negative is determined by using an emotion estimator in an initial state in which voice data spoken by an unspecified number of speakers is generated as teacher data.

感情推定器適応処理部１８０は、第１ラベル付与部１６０及び第２ラベル付与部１７０により、ポジティブ又はネガティブと、ニュートラルの何れかのラベルが付与された特定個人の音声データを、不特定多数の発話者が発話した音声データで構成された教師データに追加した特定個人用の教師データを生成し、生成した特定個人用の教師データに基づいて特定個人用に感情推定器を構築する。これにより、不特定多数の発話者が発話した音声データを教師データとして生成された感情推定器を、前記特定個人の発話時の感情状態を推定する感情推定器として個人適応させる。 The emotion estimator adaptation processing unit 180 converts the voice data of a specific individual, which has been given a positive or negative or neutral label by the first label giving unit 160 and the second label giving unit 170, into an unspecified number of Generate teacher data for a specific individual added to the teacher data composed of voice data spoken by the speaker, and construct an emotion estimator for the specific individual based on the generated teacher data for the specific individual. Thereby, the emotion estimator generated as the teacher data using the voice data uttered by the unspecified number of speakers is personally adapted as the emotion estimator for estimating the emotional state of the specific individual when speaking.

具体的には、感情推定装置１００は、ポジティブ、ネガティブ、ニュートラルのラベルが付与された特定個人ユーザーの音声データを教師データとして、この教師データをポジティブ、ネガティブ、ニュートラルに分類するように感情推定器の特性を決める計算式のパラメータを最適化する。図９は、ポジティブ、ネガティブ、ニュートラルの３つの感情状態を判別する判別閾値を２次元で表現したイメージ図である。感情推定装置１００は、不特定多数の発話者が発話した音声データで構成された教師データに、特定個人ユーザーの音声データを追加して日々蓄積する。感情推定器適応処理部１８０は、この蓄積された教師データの増加量が予め設定した増加量を超えるごとに、感情推定器を再構築する。 Specifically, the emotion estimating apparatus 100 uses the voice data of the specific individual user to which the positive, negative, and neutral labels are assigned as teacher data, and classifies the teacher data into positive, negative, and neutral. Optimize the parameters of the formula that determine the characteristics of FIG. 9 is an image diagram expressing two-dimensionally a discrimination threshold for discriminating three emotional states of positive, negative, and neutral. The emotion estimation apparatus 100 adds voice data of a specific individual user to teacher data composed of voice data uttered by an unspecified number of speakers, and accumulates the data every day. The emotion estimator adaptation processing unit 180 reconstructs the emotion estimator each time the accumulated amount of the teacher data exceeds the preset increase amount.

初期状態の教師データは、不特定多数の発話者の音声データで構成されているため、不特定多数の発話者の音声データの特徴と特定個人ユーザーの音声データの特徴とは、必ずしも一致しているとは限らない。そのため、初期状態の感情推定器による特定個人ユーザーの感情推定の精度は必ずしも高いとは言えない。しかし、特定個人ユーザーの音声データが日々蓄積されていくにしたがって、教師データに占める特定個人ユーザーの音声データの比率が高まっていく。したがって、ニュートラルのラベル付与における不特定多数の発話者の音声データの影響は徐々に低下していく。頻度解析部１５０は、不特定多数の発話者の音声データを教師データとする初期の感情推定器を用いないでニュートラル区間の設定を行うからである。このように感情推定装置１００は、特定個人ユーザーに適応した感情推定器に変化していくので、特定個人用の感情推定装置１００として感情推定の精度が向上していく。 Since the teacher data in the initial state is composed of the voice data of an unspecified number of speakers, the characteristics of the voice data of the unspecified number of speakers and the characteristics of the voice data of the specific individual user do not always match. Not necessarily. Therefore, the accuracy of estimating the emotion of the specific individual user by the emotion estimator in the initial state is not always high. However, as the voice data of the specific individual user accumulates daily, the ratio of the voice data of the specific individual user to the teacher data increases. Therefore, the influence of the voice data of the unspecified number of speakers on the neutral labeling gradually decreases. This is because the frequency analysis unit 150 sets a neutral section without using an initial emotion estimator using the voice data of an unspecified number of speakers as teacher data. As described above, the emotion estimation device 100 is changed to an emotion estimator adapted to a specific individual user, and thus the accuracy of emotion estimation is improved as the emotion estimation device 100 for a specific individual.

感情推定部１９０は、感情推定装置１００に搭載されている感情推定器を用いて、解析対象のユーザーの発話時の感情状態を推定する。図１０と図１１を用いて具体的に説明する。感情推定部１９０は、図１０に示すように、パワー解析結果とピッチ解析結果のそれぞれに基づいて、第１ラベル付与部１６０と第２ラベル付与部１７０が付与したニュートラル、ポジティブ、ネガティブのラベルを抽出区間である呼気段落区間ごとに並べる。図１１に示すように、パワー解析結果によるニュートラル、ポジティブ、ネガティブのラベルが付与された呼気段落区間の数をＮｐａ、Ｎｐｂ、Ｎｐｃとする。同様に、ピッチ解析結果によるニュートラル、ポジティブ、ネガティブのラベルが付与された呼気段落区間の数をＮｆａ、Ｎｆｂ、Ｎｆｃとする。感情推定部１９０は、これに重み係数Ｗｐａ、Ｗｐｂ、Ｗｐｃ、Ｗｆａ、Ｗｆｂ、Ｗｆｃを付与する。 The emotion estimation unit 190 estimates the emotional state of the analysis target user at the time of utterance, using an emotion estimator mounted on the emotion estimation device 100. This will be specifically described with reference to FIGS. As shown in FIG. 10, the emotion estimating unit 190 adds the neutral, positive, and negative labels assigned by the first label assigning unit 160 and the second label assigning unit 170 based on each of the power analysis result and the pitch analysis result. It is arranged for each exhalation paragraph section that is an extraction section. As shown in FIG. 11, the numbers of exhalation paragraph sections to which neutral, positive, and negative labels are assigned based on the power analysis result are Npa, Npb, and Npc. Similarly, let Nfa, Nfb, and Nfc be the number of breath paragraph sections to which neutral, positive, and negative labels are assigned based on the pitch analysis result. Emotion estimation section 190 assigns weighting factors Wpa, Wpb, Wpc, Wfa, Wfb, and Wfc to this.

この重み係数の設定により、パワー解析結果を重視するかピッチ解析結果を重視するかのバランス調整をすることができる。また、ポジティブ、ネガティブ、ニュートラルの重み係数の調整により、ロボットの疑似人格を調整できる。例えば、ニュートラルの重み係数を重くすると、ユーザーの多少の感情状態の変化を汲み取ることの無い事務的な疑似人格を形成することができる。また、ネガティブの重み係数を重くすると、ユーザー音声のネガティブ的特徴を敏感に汲み取る気遣いに優れた疑似人格を形成することができる。 By setting the weight coefficient, it is possible to adjust the balance between emphasizing the power analysis result and the pitch analysis result. Further, the pseudo personality of the robot can be adjusted by adjusting the positive, negative, and neutral weighting factors. For example, by increasing the neutral weighting factor, it is possible to form an office-like pseudo-personality that does not capture a slight change in the emotional state of the user. In addition, when the negative weighting factor is increased, a pseudo-personality excellent in concern for sensitively extracting the negative characteristics of the user voice can be formed.

感情推定部１９０は、Ｎｐａ、Ｎｐｂ、Ｎｐｃ、Ｎｆａ、Ｎｆｂ、Ｎｆｃに基づいて、式１から式３を使用して、ニュートラルの評価点Ｅｎｅｕ、ポジティブの評価点Ｅｐｏｓ、ネガティブの評価点Ｅｎｅｇを求める。 The emotion estimation unit 190 obtains a neutral evaluation point Eneu, a positive evaluation point Epos, and a negative evaluation point Eneg based on Npa, Npb, Npc, Nfa, Nfb, and Nfc, using Expressions 1 to 3. .

Ｅｎｅｕ＝Ｎｐａ＊Ｗｐａ＋Ｎｆａ＊Ｗｆａ（式１）
Ｅｐｏｓ＝Ｎｐｂ＊Ｗｐｂ＋Ｎｆｂ＊Ｗｆｂ（式２）
Ｅｎｅｇ＝Ｎｐｃ＊Ｗｐｃ＋Ｎｆｃ＊Ｗｆｃ（式３） Eneu = Npa * Wpa + Nfa * Wfa (Equation 1)
Epos = Npb * Wpb + Nfb * Wfb (Equation 2)
Eneg = Npc * Wpc + Nfc * Wfc (Equation 3)

感情推定部１９０は、評価点が最も高い感情状態を解析対象とする音声データ（例えば、文）を発話したときのユーザーの感情状態と推定する。 The emotion estimation unit 190 estimates the emotion state of the user when speaking the voice data (for example, a sentence) whose analysis state is the highest in the evaluation point.

入出力部３は、入出力部３を介して、推定したユーザーの感情状態を出力する。例えば、入出力部３は、推定したユーザーの感情状態（ポジティブ、ネガティブ、ニュートラル）をスピーカから出力し、表示部に表示する。また、入出力部３は、予めプログラムされたシナリオに基づいて、感情状態と対応付けられた発話内容、発話速度、ＬＥＤの色として推定した感情状態を間接的に出力することもできる。 The input / output unit 3 outputs the estimated user's emotional state via the input / output unit 3. For example, the input / output unit 3 outputs the estimated emotional state of the user (positive, negative, neutral) from the speaker, and displays it on the display unit. The input / output unit 3 can also indirectly output the utterance content, the utterance speed, and the emotion state estimated as the color of the LED associated with the emotion state based on a scenario programmed in advance.

次に、上記の構成を有する感情推定装置１００が搭載する感情推定器を特定個人用に再構築する個人適応処理について、図１２から図１４に示すフローチャートを参照して説明する。不特定多数の発話者の音声データにポジティブ、ネガティブ、ニュートラルのラベルが付与された教師データは、予め記憶部２に記憶されているものとする。本実施形態では、呼気段落の単位で音声データの特徴を抽出するので、教師データには呼気段落区間ごとに分割された音声データを用いる。また、感情推定装置１００が搭載する感情推定器は、不特定多数の発話者の音声データを教師データとして構築されているものとする。また、特定個人として、父親、母親、子供が予め登録されているものとする。 Next, a personal adaptation process for reconstructing an emotion estimator mounted on the emotion estimation device 100 having the above configuration for a specific individual will be described with reference to flowcharts shown in FIGS. It is assumed that the teacher data in which positive, negative, and neutral labels are added to the voice data of an unspecified number of speakers is stored in the storage unit 2 in advance. In the present embodiment, since the characteristics of the audio data are extracted in units of the exhalation paragraph, the audio data divided for each exhalation paragraph section is used as the teacher data. Further, it is assumed that the emotion estimator mounted on the emotion estimation device 100 is constructed by using voice data of an unspecified number of speakers as teacher data. It is also assumed that a father, a mother, and a child are registered in advance as specific individuals.

ユーザーがロボットを起動し、ロボットに搭載された感情推定装置１００に音声データを供給することにより、図１２に示すフローチャートはスタートする。 The flowchart shown in FIG. 12 is started when the user activates the robot and supplies voice data to the emotion estimation device 100 mounted on the robot.

音声データ取得部１１０がユーザーの供給した音声データを取得すると（ステップＳ１１）、音声データ解析部１２０は、取得した音声データの解析を実施する（ステップＳ１２）。具体的には、音声データ解析部１２０は、取得した音声データから、音声データのパワー、音声データのピッチを時系列のデータとして作成する。 When the audio data acquisition unit 110 acquires the audio data supplied by the user (Step S11), the audio data analysis unit 120 analyzes the acquired audio data (Step S12). Specifically, the audio data analysis unit 120 creates the power of the audio data and the pitch of the audio data as time-series data from the acquired audio data.

特定個人分類部１３１は、抽出した音声データの特徴に基づいて、音声データを特定個人ごとに分類し、記憶部２に記憶する（ステップＳ１３）。例えば、父親の音声データ、母親の音声データ、子供の音声データのように分類し、記憶部２に記憶する。 The specific individual classifying unit 131 classifies the voice data for each specific individual based on the extracted characteristics of the voice data, and stores the voice data in the storage unit 2 (step S13). For example, audio data of a father, audio data of a mother, and audio data of a child are classified and stored in the storage unit 2.

次に、感情推定装置１００は、記憶した音声データの増加量が予め設定した所定量を超えたか否かを判別する（ステップＳ１４）。この所定量の設定を大きくすると、感情推定器の個人適応処理を行うたびに、感情推定装置１００を搭載したロボットの疑似人格が大きく変化するようになる。また、この所定量の設定を小さくすると、感情推定装置１００を搭載したロボットの疑似人格が少しずつ変化するようになる。感情推定装置１００は、蓄積した音声データの増加量が所定の閾値を超えていない場合（ステップＳ１４：Ｎｏ）、音声データの蓄積を継続する。感情推定装置１００は、蓄積した音声データの増加量が所定量を超えた場合（ステップＳ１４：Ｙｅｓ）、搭載する感情推定器の個人適応処理を行う（ステップＳ１５）。感情推定器の個人適応処理につては、図１３に示すフローチャートを参照しながら説明する。 Next, emotion estimation apparatus 100 determines whether or not the amount of increase in the stored voice data exceeds a predetermined amount (step S14). When the setting of the predetermined amount is increased, the pseudo personality of the robot equipped with the emotion estimating device 100 changes greatly every time the emotion estimator performs the individual adaptation process. Further, when the setting of the predetermined amount is reduced, the pseudo personality of the robot equipped with the emotion estimation device 100 gradually changes. When the increase amount of the stored voice data does not exceed the predetermined threshold (step S14: No), emotion estimation apparatus 100 continues to store the voice data. When the increase amount of the accumulated voice data exceeds the predetermined amount (step S14: Yes), the emotion estimation device 100 performs personal adaptation processing of the mounted emotion estimator (step S15). The personal adaptation process of the emotion estimator will be described with reference to the flowchart shown in FIG.

感情推定装置１００は、感情推定器の個人適応処理をスタートさせると、特徴抽出部１３０は、音声データのパワー時系列変化パターンとピッチ時系列変化パターンを算出する区間として、図５を用いて説明したように、音声データを呼気段落の単位で区切った呼気段落区間を設定する。特徴抽出部１３０は、設定した呼気段落区間の全てについて音声データの特徴の抽出処理を行う（ステップＳ２１）。音声データの特徴抽出処理については、図１４に示すフローチャートを参照しながら説明する。 When the emotion estimation device 100 starts the personal adaptation process of the emotion estimator, the feature extraction unit 130 describes a section for calculating a power time series change pattern and a pitch time series change pattern of audio data with reference to FIG. As described above, the exhalation paragraph section in which the voice data is divided in the unit of the exhalation paragraph is set. The feature extracting unit 130 performs a process of extracting features of the voice data for all of the set exhalation paragraph sections (step S21). The feature extraction process of audio data will be described with reference to the flowchart shown in FIG.

感情推定装置１００は、音声データの特徴抽出処理をスタートすると、最初に、時間長測定部１３２の機能を用いて呼気段落区間に分割された音声データの時間長を測定し、音声データと測定した時間長とを紐付けて記憶部２に記憶する（ステップＳ３１）。 When the emotion estimation device 100 starts the feature extraction process of the voice data, first, the time length of the voice data divided into the exhalation paragraph section is measured by using the function of the time length measurement unit 132, and is measured as the voice data. The time length is associated with the time length and stored in the storage unit 2 (step S31).

次に、パワー時系列変化パターン算出部１３３は、図６を用いて説明したように、パワー時系列データの変化パターンとして、抽出区間（呼気段落区間）ごとにパワー時系列変化パターンを算出する（ステップＳ３２）。そして、算出したパワー時系列変化パターンと時間長測定部１３２が測定した時間長とを紐付けて記憶部２に記憶する。 Next, as described with reference to FIG. 6, the power time-series change pattern calculation unit 133 calculates a power time-series change pattern for each extraction section (expiration paragraph section) as a power time-series data change pattern ( Step S32). Then, the calculated power time-series change pattern and the time length measured by the time length measuring unit 132 are linked and stored in the storage unit 2.

次に、ピッチ時系列変化パターン算出部１３４は、図７を用いて説明したように、ピッチ時系列データの変化パターンとして、抽出区間（呼気段落区間）ごとにピッチ時系列変化パターンを算出する（ステップＳ３３）。そして、算出したピッチ時系列変化パターンと時間長測定部１３２が測定した時間長とを紐付けて記憶部２に記憶する。 Next, as described with reference to FIG. 7, the pitch time-series change pattern calculating unit 134 calculates a pitch time-series change pattern for each extraction section (expiration paragraph section) as a change pattern of the pitch time-series data (FIG. 7). Step S33). Then, the calculated pitch time-series change pattern and the time length measured by the time length measuring unit 132 are linked and stored in the storage unit 2.

図１３のフローチャートに戻って、記憶部２に記憶している音声データ（パワー時系列データ、ピッチ時系列データ）の全てについて特徴抽出処理を終えると、頻度解析部１５０は、抽出された特徴を複数のパターンに分類し、そのパターンごとの出現頻度を特定個人ごとに解析する（ステップＳ２２）。具体的には、頻度解析部１５０は、図８を用いて説明したように、特定個人ごとに、パワー時系列変化パターンの頻度解析とピッチ時系列変化パターンの頻度解析を行う。このとき、頻度解析部１５０は、時間長測定部１３２が測定した抽出区間の時間長を、例えば、２秒以下、２秒から４秒、４秒以上の３種類に分類して頻度解析を行う。そして、頻度解析部１５０は、出現頻度が閾値以上である区間を、平常状態で発話された音声データの時系列変化パターンが属するニュートラル区間として設定する。 Returning to the flowchart of FIG. 13, when the feature extraction process is completed for all of the audio data (power time series data, pitch time series data) stored in the storage unit 2, the frequency analysis unit 150 determines the extracted features. The pattern is classified into a plurality of patterns, and the appearance frequency of each pattern is analyzed for each specific individual (step S22). Specifically, as described with reference to FIG. 8, the frequency analysis unit 150 performs the frequency analysis of the power time-series change pattern and the frequency analysis of the pitch time-series change pattern for each specific individual. At this time, the frequency analysis unit 150 classifies the time length of the extraction section measured by the time length measurement unit 132 into three types, for example, 2 seconds or less, 2 seconds to 4 seconds, and 4 seconds or more, and performs frequency analysis. . Then, the frequency analysis unit 150 sets a section in which the appearance frequency is equal to or greater than the threshold as a neutral section to which the time-series change pattern of the voice data uttered in the normal state belongs.

第１ラベル付与部１６０は、音声データの特徴（パワー時系列変化パターンとピッチ時系列変化パターン）が、頻度解析部１５０がニュートラル区間として設定した区間に属する音声データにニュートラルのラベルを付与する（ステップＳ２３）。そして、第１ラベル付与部１６０は、ニュートラルのラベルを付与した音声データを、教師データとして記憶部２に記憶する（ステップＳ２４）。 The first label assigning section 160 assigns a neutral label to audio data whose characteristics (power time-series change pattern and pitch time-series change pattern) belong to a section set as a neutral section by the frequency analysis section 150 ( Step S23). Then, the first label assigning unit 160 stores the voice data to which the neutral label has been assigned in the storage unit 2 as the teacher data (step S24).

次に、第２ラベル付与部１７０は、搭載する感情推定器を用いて、第１ラベル付与部１６０がニュートラルのラベルを付与しなかった音声データを、ポジティブ又はネガティブに分類する（ステップＳ２５）。そして、第２ラベル付与部１７０は、該当する音声データにポジティブ又はネガティブのラベルを付与し、教師データとして記憶部２に記憶する（ステップＳ２６）。 Next, the second label assigning section 170 classifies the voice data to which the first label assigning section 160 has not assigned the neutral label as positive or negative, using the mounted emotion estimator (step S25). Then, the second label assigning unit 170 assigns a positive or negative label to the corresponding audio data, and stores the audio data in the storage unit 2 as teacher data (step S26).

記憶部２に記憶する音声データ（パワー時系列データ、ピッチ時系列データ）の全てについてポジティブ、ネガティブ、ニュートラルのラベル付けが完了すると、感情推定器適応処理部１８０は、搭載する感情推定器の個人適応処理を行う（ステップＳ２７）。具体的には、感情推定器適応処理部１８０は、第１ラベル付与部１６０及び第２ラベル付与部１７０により、ポジティブ、ネガティブ、ニュートラルの何れかのラベルが付与された音声データを、付与されているラベル通りにポジティブ、ネガティブ、ニュートラルに分類する特定個人用の感情推定器として、搭載する感情推定器を再構築する。父親、母親、子供の３人が登録されている場合は、父親用、母親用、子供用の３種類の感情推定器を構築する。これにより、感情推定器適応処理部１８０は、不特定多数の発話者が発話した音声データを教師データとして生成された初期状態の感情推定器を、特定個人専用の感情状態を推定する感情推定器として個人適応させる。感情推定器の個人適応処理（ステップＳ２７）が完了すると、図１２のステップＳ１５の処理は終了する。 When the positive, negative, and neutral labeling is completed for all of the voice data (power time series data and pitch time series data) stored in the storage unit 2, the emotion estimator adaptation processing unit 180 sets the personality of the mounted emotion estimator. An adaptive process is performed (step S27). More specifically, the emotion estimator adaptation processing unit 180 receives the sound data to which any one of the positive, negative, and neutral labels has been given by the first label giving unit 160 and the second label giving unit 170. We reconstruct the on-board emotion estimator as a specific individual emotion estimator that classifies as positive, negative, or neutral according to the label. If three people, father, mother and child, are registered, three kinds of emotion estimators for father, mother and child are constructed. Thereby, the emotion estimator adaptation processing unit 180 converts the emotion estimator in the initial state in which voice data spoken by an unspecified number of speakers as teacher data into an emotion estimator for estimating an emotion state dedicated to a specific individual. As personal adaptation. When the personal adaptation process (step S27) of the emotion estimator is completed, the process of step S15 in FIG. 12 ends.

感情推定装置１００は、音声データの蓄積を継続し、蓄積された音声データの増加量が所定量を超えるたびに搭載する感情推定器の個人適応処理を行う。以上で、感情推定装置１００が行う感情推定器の個人適応処理の説明を終了する。 The emotion estimation device 100 continues to store the voice data, and performs personal adaptation processing of the mounted emotion estimator each time the increase amount of the stored voice data exceeds a predetermined amount. This is the end of the description of the personal adaptation process of the emotion estimator performed by the emotion estimation device 100.

次に、感情推定装置１００が行うユーザーの音声データからユーザーの発話時の感情状態を推定する感情推定処理について、図１５に示すフローチャートを参照しながら説明する。図１２〜図１３に示すフローチャートを用いて説明した感情推定装置１００の個人適応処理は定期的に行われているものとする。ユーザーが感情推定装置１００を搭載したロボットを起動し、ロボットにユーザーの音声データを供給することにより、図１５に示すフローチャートはスタートする。 Next, the emotion estimation process performed by the emotion estimation device 100 to estimate the emotional state of the user when speaking from the user's voice data will be described with reference to the flowchart shown in FIG. It is assumed that the individual adaptation process of the emotion estimation device 100 described using the flowcharts shown in FIGS. The flow chart shown in FIG. 15 is started when the user starts the robot equipped with the emotion estimation device 100 and supplies the robot with the user's voice data.

ユーザーが音声データを供給して、音声データを取得する処理（ステップＳ５１）から特定個人ごとに音声データを分類する処理（ステップＳ５３）までの説明は、個人適応処理で行ったステップＳ１１からＳ１３までの説明と同じである。 The description from the process of the user supplying the voice data and obtaining the voice data (step S51) to the process of classifying the voice data for each specific individual (step S53) is described in steps S11 to S13 performed in the individual adaptation process. It is the same as the description.

感情推定装置１００は、音声データ解析部１２０が作成した音声データ（パワー時系列データ、ピッチ時系列データ）を特定個人ごとに分類すると、特徴抽出部１３０は、特定個人の音声データ（パワー時系列データ、ピッチ時系列データ）を抽出する（ステップＳ５４）。特徴抽出部１３０は、特定個人の音声データを抽出すると、音声データ（パワー時系列データ、ピッチ時系列データ）に呼気段落区間を設定する。そして、特徴抽出部１３０は、呼気段落区間の全てについて、図６と図７を用いて説明したように、音声データの特徴として、音声データのパワー時系列変化パターンとピッチ時系列変化パターンを抽出する（ステップＳ５５）。ステップＳ５５の詳細処理内容は、図１４を用いて行った説明と同じである。 When the emotion estimation device 100 classifies the voice data (power time-series data, pitch time-series data) created by the voice data analysis unit 120 for each specific individual, the feature extraction unit 130 sets the voice data (power time-series Data and pitch time series data) (step S54). When extracting the voice data of the specific individual, the feature extracting unit 130 sets an exhalation paragraph section in the voice data (power time series data, pitch time series data). Then, as described with reference to FIGS. 6 and 7, the feature extraction unit 130 extracts the power time series change pattern and the pitch time series change pattern of the audio data as the characteristics of the audio data for all of the exhalation paragraph sections. (Step S55). The details of step S55 are the same as those described with reference to FIG.

次に、感情推定装置１００は、搭載する感情推定器を用いて、音声データを抽出区間（呼気段落区間）ごとに、ポジティブ、ネガティブ、ニュートラルの何れかに分類し、ポジティブ、ネガティブ、ニュートラルのラベルを付与して記憶部２に記憶する（ステップＳ５６）。 Next, the emotion estimating apparatus 100 classifies the speech data into any of positive, negative, and neutral for each extraction section (expiration paragraph section) by using an onboard emotion estimator, and labels positive, negative, and neutral. And stored in the storage unit 2 (step S56).

次に、感情推定装置１００は、全ての抽出区間（呼気段落区間）についての解析（感情推定）を完了したか否かを判別する（ステップＳ５７）。感情推定装置１００は、全ての抽出区間の解析を完了していない場合（ステップＳ５７：Ｎｏ）、他の抽出区間を抽出して解析を継続する（ステップＳ５８）。一方、全ての抽出区間について解析を完了している場合（ステップＳ５７：Ｙｅｓ）、感情推定部１９０は、図１０と図１１を用いて説明したように、式１から式３を用いて、特定個人が発話した音声データ全体（例えば、文）について、特定個人の発話時の感情状態を推定する（ステップＳ５９）。 Next, emotion estimation apparatus 100 determines whether or not analysis (emotion estimation) has been completed for all extraction sections (expiration paragraph sections) (step S57). When the analysis of all extraction sections has not been completed (step S57: No), emotion estimation apparatus 100 extracts another extraction section and continues the analysis (step S58). On the other hand, when the analysis has been completed for all the extracted sections (step S57: Yes), the emotion estimation unit 190 specifies the expression using Expressions 1 to 3 as described with reference to FIGS. For the entire voice data (for example, a sentence) spoken by the individual, the emotional state of the specific individual at the time of speech is estimated (step S59).

次に、感情推定装置１００は、取得した全ての人の音声データについて感情推定を完了したか否かを判別する（ステップＳ６０）。全ての人の感情推定を完了していない場合（ステップＳ６０：Ｎｏ）、感情推定装置１００は、他の人を特定個人とし、新たな特定個人の音声データ（パワー時系列データ、ピッチ時系列データ）を抽出して感情推定処理を継続する（ステップＳ６１）。一方、全ての人の音声データについて感情推定を完了している場合（ステップＳ６０：Ｙｅｓ）、感情推定装置１００の感情推定処理は終了する。 Next, emotion estimation apparatus 100 determines whether or not emotion estimation has been completed for the acquired voice data of all persons (step S60). If the emotion estimation of all the persons has not been completed (step S60: No), the emotion estimation apparatus 100 sets the other person as the specific individual, and sets new specific individual voice data (power time series data, pitch time series data). ) Is extracted and the emotion estimation process is continued (step S61). On the other hand, when the emotion estimation has been completed for the voice data of all the persons (Step S60: Yes), the emotion estimation processing of the emotion estimation device 100 ends.

以上に説明したように感情推定装置１００は、音声データの特徴を抽出し、抽出した特徴の出現頻度に基づいて、音声データにニュートラルのラベルを付与して教師データに追加する。これにより、特定個人ユーザーの音声データが日々蓄積されていくにしたがって、教師データに占める特定個人ユーザーの音声データの比率が高まっていく。この特定個人ユーザーの音声データの比率が高くなった教師データに基づいて再構築される感情推定器は、特定個人用の感情推定器として適応していく。 As described above, the emotion estimation device 100 extracts the features of the audio data, adds a neutral label to the audio data based on the appearance frequency of the extracted features, and adds the neutral labels to the teacher data. As a result, as the voice data of the specific individual user accumulates daily, the ratio of the voice data of the specific individual user to the teacher data increases. The emotion estimator reconstructed based on the teacher data in which the ratio of the voice data of the specific individual user has been increased adapts as an emotion estimator for the specific individual.

従来技術におけるラベル付与方法は、あくまでも不特定多数の発話者の音声データを教師データとして構築された感情推定器を用いてラベル付与を行う。したがって、そのラベルを付与された音声データを教師データに追加して感情推定器を再構築しても、必ずしも特定個人ユーザーに適応した感情推定器として最適化されるとは言えない。本実施形態に係る感情推定装置１００は、日々蓄積される特定個人ユーザーの音声データの特徴パターンの出現頻度に基づいてニュートラルのラベルを付与するので、特定個人用の感情推定器として適応しやすくなる。これにより、感情推定装置１００は、特定個人用の感情推定装置として推定精度を向上することができる。 In the labeling method in the related art, labeling is performed by using an emotion estimator constructed as teacher data using voice data of an unspecified number of speakers. Therefore, even if the voice data to which the label is added is added to the teacher data to reconstruct the emotion estimator, it cannot be said that the emotion estimator optimized for the specific individual user is necessarily optimized. Since the emotion estimation device 100 according to the present embodiment assigns a neutral label based on the appearance frequency of the feature pattern of the voice data of the specific individual user accumulated every day, it is easy to adapt as an emotion estimator for the specific individual. . Thus, emotion estimation apparatus 100 can improve estimation accuracy as an emotion estimation apparatus for a specific individual.

また、特徴抽出部１３０は、音声データの特徴としてパワー時系列変化パターンとピッチ時系列変化パターンとを抽出し、頻度解析部１５０は、その時系列変化パターンに基づいて音声データの特徴の頻度解析を行う。音声データの時系列変化パターンは、発話者の感情状態によって変化しやすい傾向がある。感情推定装置１００は、この時系列変化パターンの頻度解析により発話者のニュートラルな（平均的な）感情状態を判別する。これにより、感情推定装置１００は、特定個人ユーザーの発話時の感情状態をより正確に推定することができる。 Further, the feature extraction unit 130 extracts a power time series change pattern and a pitch time series change pattern as features of the audio data, and the frequency analysis unit 150 analyzes the frequency of the feature of the audio data based on the time series change pattern. Do. The time-series change pattern of voice data tends to change depending on the emotional state of the speaker. The emotion estimation device 100 determines the neutral (average) emotion state of the speaker by analyzing the frequency of the time-series change pattern. Thereby, emotion estimation apparatus 100 can more accurately estimate the emotional state of the specific individual user when speaking.

第１ラベル付与部１６０は、パワー時系列変化パターンにおいて特定のパターンの出現頻度が閾値以上と判別された音声データに対して、又はピッチ時系列変化パターンにおいて特定のパターンの出現頻度が閾値以上と判別された音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与する。このように、感情推定装置１００は、特定個人ユーザーの音声データの特徴パターンの統計データに基づいて、その特定個人ユーザーの発話時の感情状態のラベル付与を行って教師データに追加する。これにより、特定個人ユーザーの音声データが増えるにしたがって、感情推定装置１００は、搭載する感情推定器を特定個人用の感情推定器として適応させることができる。 The first label assigning unit 160 determines that the frequency of occurrence of the specific pattern in the power time-series change pattern is determined to be equal to or higher than the threshold, or that the frequency of occurrence of the specific pattern in the pitch time-series change pattern is equal to or higher than the threshold. A neutral label indicating that the emotional state at the time of speech is a calm state is given to the determined voice data. As described above, the emotion estimation device 100 performs labeling on the emotional state of the specific individual user when speaking based on the statistical data of the characteristic pattern of the voice data of the specific individual user, and adds the label to the teacher data. As a result, as the voice data of the specific individual user increases, the emotion estimation device 100 can adapt the mounted emotion estimator as the emotion estimator for the specific individual.

音声データの特徴の変化パターンの抽出区間は、呼気段落のほか、単語、アクセント句、文、等の区間を抽出区間とすることができる。特定個人ユーザーがまとまった文を発話する習慣が無い場合には、単語単位のように抽出区間を短くした方がそのユーザーに適した感情推定装置１００を構築できる場合がある。特定個人ユーザーの発話特徴に合わせた抽出区間とすることにより、推定精度を向上することができる。 The extraction section of the change pattern of the feature of the voice data may be an exhalation paragraph, or a section such as a word, an accent phrase, a sentence, or the like. If a specific individual user does not have a habit of speaking a set sentence, it may be possible to construct an emotion estimation device 100 suitable for the user by shortening the extraction section such as a word unit. By setting the extraction section in accordance with the utterance feature of the specific individual user, the estimation accuracy can be improved.

感情推定器適応処理部１８０は、日々蓄積される特定個人の音声データの増加量が予め設定した量を超えるごとに、感情推定器の適応処理を行う。これにより、音声データを取得するたびに適応処理を行う方式に比べ、感情推定装置１００の処理量を削減することができる。 The emotion estimator adaptation processing unit 180 performs the adaptation processing of the emotion estimator every time the increase amount of the voice data of the specific individual accumulated every day exceeds a preset amount. This makes it possible to reduce the processing amount of emotion estimation apparatus 100 as compared with a method in which adaptive processing is performed each time voice data is acquired.

感情推定部１９０は、重み係数を使用した式１から式３を使用して、ニュートラルの評価点Ｅｎｅｕ、ポジティブの評価点Ｅｐｏｓ、ネガティブの評価点Ｅｎｅｇを求める。この重み係数の設定により、パワー解析結果を重視するかピッチ解析結果を重視するかのバランス調整をすることができる。また、ポジティブ、ネガティブ、ニュートラルの重み係数の調整により、ロボットの疑似人格を調整できる。 The emotion estimation unit 190 obtains a neutral evaluation point Eneu, a positive evaluation point Epos, and a negative evaluation point Eneg using Expressions 1 to 3 using the weighting coefficients. By setting the weight coefficient, it is possible to adjust the balance between emphasizing the power analysis result and the pitch analysis result. Further, the pseudo personality of the robot can be adjusted by adjusting the positive, negative, and neutral weighting factors.

（変形例１）
実施形態１の説明では、頻度解析部１５０が、予め設定された閾値を基準にニュートラル区間を設定する説明をしたが、頻度解析の方法はこれに限定する必要は無い。例えば、図１６（ａ）に示すように、平均出現頻度以上の区間をニュートラル区間に設定してもよい。また、図１６（ｂ）に示すように、出現頻度分布における分散σを求め、分散σの範囲をニュートラル区間に設定してもよい。ニュートラル区間とする範囲をσから１．５σ、２σと広げるほど、感情推定装置１００を搭載したロボットの疑似人格を、ユーザーの多少の感情状態の変化を汲み取ることの無い事務的な疑似人格とすることができる。また、図１６（ｃ）に示すように、出現頻度分布における中央値を求め、中央値から所定の幅Ｘの範囲をニュートラル区間に設定してもよい。 (Modification 1)
In the description of the first embodiment, the frequency analysis unit 150 sets the neutral section based on a preset threshold, but the frequency analysis method need not be limited to this. For example, as shown in FIG. 16A, a section having an average appearance frequency or more may be set as a neutral section. Alternatively, as shown in FIG. 16B, the variance σ in the appearance frequency distribution may be obtained, and the range of the variance σ may be set in the neutral section. As the range of the neutral section is increased from σ to 1.5σ and 2σ, the pseudo personality of the robot equipped with the emotion estimation device 100 is changed to an office-like pseudo personality that does not capture a slight change in the emotional state of the user. be able to. Further, as shown in FIG. 16C, a median in the appearance frequency distribution may be obtained, and a range of a predetermined width X from the median may be set as a neutral section.

（変形例２）
実施形態１で図１０を用いて説明した感情推定部１９０の処理は、式１から式３を用いて評価点を求め、文全体としてポジティブ、ネガティブ、ニュートラルの何れに該当するのかを推定する処理であった。しかし、感情推定処理の方法はこれに限定する必要は無い。例えば、抽出区間（呼気段落区間）ごとにポジティブ、ネガティブ、ニュートラルのいずれに該当するかを判別し、ポジティブ、ネガティブ、ニュートラルのそれぞれに該当する区間の数を比較し、多数決で決定するようにしてもよい。具体的には、パワー解析結果とピッチ解析結果の両方の解析結果がニュートラルであった抽出区間のみをニュートラルとする。そして、ニュートラルのラベルを付与しなかった抽出区間について、搭載する感情推定器を用いて、ポジティブ又はネガティブの判別を行う。そして、ポジティブと判別した区間数、ネガティブと判別した区間数、ニュートラルと判別した区間数の中で、最も多かった区間数の感情状態（ポジティブ、ネガティブ、ニュートラル）を発話者の感情状態として推定するようにしてもよい。 (Modification 2)
The process of the emotion estimating unit 190 described with reference to FIG. 10 in the first embodiment is a process of obtaining an evaluation point using Expressions 1 to 3 and estimating whether the sentence as a whole is positive, negative, or neutral. Met. However, the method of the emotion estimation processing need not be limited to this. For example, it is determined whether each of the extraction sections (expiration paragraph section) corresponds to positive, negative, or neutral, and the number of sections corresponding to each of positive, negative, and neutral is compared, and the majority is determined. Is also good. Specifically, only the extraction section in which both the power analysis result and the pitch analysis result are neutral is set to neutral. Then, for the extracted section to which the neutral label has not been added, a positive or negative discrimination is performed using the mounted emotion estimator. Then, the emotion state (positive, negative, or neutral) with the largest number of sections among the number of sections determined as positive, the number of sections determined as negative, and the number of sections determined as neutral is estimated as the emotion state of the speaker. You may do so.

パワー解析結果とピッチ解析結果の両方の解析結果がニュートラルであった区間をニュートラル区間とすると、ニュートラルと判別する区間は狭くなる。この場合、感情推定装置１００を搭載したロボットの疑似人格を、ユーザーの感情状態の変化を敏感に汲み取る気遣いに優れた疑似人格とすることができる。 If a section in which both the power analysis result and the pitch analysis result are neutral is defined as a neutral section, the section determined to be neutral becomes narrower. In this case, the pseudo-personality of the robot equipped with the emotion estimation device 100 can be a pseudo-personality excellent in concern for sensitively picking up changes in the emotional state of the user.

また、別の方法としては、パワー解析結果とピッチ解析結果の何れかの解析結果がニュートラルであった区間をニュートラル区間としてもよい。この場合、ニュートラルと判別する区間が広くなるので、ロボットの疑似人格を、ユーザーの多少の感情状態の変化を汲み取ることの無い事務的なイメージが強い疑似人格とすることができる。 As another method, a section in which either the power analysis result or the pitch analysis result is neutral may be set as a neutral section. In this case, since the section that is determined to be neutral is widened, the pseudo personality of the robot can be set to a pseudo personality that has a strong office-like image that does not capture any change in the emotional state of the user.

また、実施形態１では、パワー時系列変化パターンとピッチ時系列変化パターンの２つの特徴を用いて解析をしたが、パワー時系列変化パターンとピッチ時系列変化パターンの何れか片方のみで解析するようにしてもよい。どの解析を省略するかは、感情推定精度、処理速度、製造コスト等を考慮して、選択すればよい。また、どの解析項目を省略するかにより、ロボットの疑似人格を調整することもできる。 In the first embodiment, the analysis is performed using the two features of the power time series change pattern and the pitch time series change pattern. However, the analysis is performed using only one of the power time series change pattern and the pitch time series change pattern. It may be. Which analysis should be omitted may be selected in consideration of emotion estimation accuracy, processing speed, manufacturing cost, and the like. Further, the pseudo personality of the robot can be adjusted depending on which analysis item is omitted.

（変形例３）
実施形態１の説明では、教師データは、呼気段落区間ごとに分割された音声データであり、第１ラベル付与部１６０と第２ラベル付与部１７０とは、呼気段落区間に分割された音声データに対してポジティブ、ネガティブ、ニュートラルのラベルを付与する説明をした。変形例３では、文単位でポジティブ、ネガティブ、ニュートラルのラベルを付与された教師データを用いる場合について説明する。 (Modification 3)
In the description of the first embodiment, the teacher data is audio data divided for each exhalation paragraph section, and the first label assigning unit 160 and the second label assigning unit 170 generate audio data divided for the exhalation paragraph section. On the other hand, it was explained that positive, negative and neutral labels were provided. In Modification 3, a case will be described in which teacher data to which positive, negative, and neutral labels are assigned in units of sentences is used.

音声データ取得部１１０、音声データ解析部１２０、特徴抽出部１３０、頻度解析部１５０の機能動作は同じである。音声データ解析部１２０は、取得した音声データから、パワー時系列データとピッチ時系列データとを作成する。特徴抽出部１３０は、パワー時系列データとピッチ時系列データに特徴抽出区間として呼気段落を設定し、パワー時系列変化パターンとピッチ時系列変化パターンとを抽出する。そして、頻度解析部１５０は、特定個人ごとに、パワー時系列変化パターンとピッチ時系列変化パターンの頻度解析を行う。 The functional operations of the audio data acquisition unit 110, the audio data analysis unit 120, the feature extraction unit 130, and the frequency analysis unit 150 are the same. The audio data analysis unit 120 creates power time series data and pitch time series data from the acquired audio data. The feature extraction unit 130 sets an exhalation paragraph as a feature extraction section in the power time series data and the pitch time series data, and extracts a power time series change pattern and a pitch time series change pattern. Then, the frequency analysis unit 150 performs a frequency analysis of the power time series change pattern and the pitch time series change pattern for each specific individual.

第１ラベル付与部１６０は、解析対象とする文がニュートラルの感情状態で発話されたか否かを判別する。具体的には、実施形態１で図１０を用いて説明したように、図８に例示した出現頻度分布と比較し、抽出区間ごとにニュートラル区間に該当するか否かを判別する。そして、例えば、ニュートラル区間に該当する区間数が、文全体の抽出区間数の５０％以上であった場合、その文をニュートラルの感情状態で発話された文であると判別する。そして、ニュートラルと判別した文にニュートラルのラベルを付与して、記憶部２に記憶する。 The first label assigning unit 160 determines whether or not the sentence to be analyzed is uttered in a neutral emotional state. Specifically, as described with reference to FIG. 10 in the first embodiment, a comparison is made with the appearance frequency distribution illustrated in FIG. 8 to determine whether or not each extraction section corresponds to a neutral section. Then, for example, when the number of sections corresponding to the neutral section is 50% or more of the number of extracted sections of the entire sentence, it is determined that the sentence is a sentence uttered in a neutral emotional state. The sentence determined to be neutral is given a neutral label and stored in the storage unit 2.

第２ラベル付与部１７０は、ニュートラルのラベルを付与されていない文単位の音声データを、不特定多数の発話者が発話した文単位の音声データを教師データとして生成された感情推定器を用いて、ポジティブ又はネガティブの何れかに分類して、ポジティブ又はネガティブのラベルを文単位で付与する。具体的には、第２ラベル付与部１７０は、第１ラベル付与部１６０がニュートラルのラベルを付与しなかった文単位の音声データについて、感情推定部１９０を用いてポジティブ又はネガティブのいずれかのラベルを付与し、該当する音声データと紐付けて記憶部２に記憶する。 The second label assigning unit 170 converts sentence-based speech data to which a neutral label has not been assigned using an emotion estimator generated as teacher data using sentence-based speech data uttered by an unspecified number of speakers. , Positive or negative, and a positive or negative label is given in units of sentences. Specifically, the second label assigning unit 170 uses the emotion estimating unit 190 to determine whether a positive or negative label has been assigned to the sentence unit speech data to which the first label assigning unit 160 has not assigned a neutral label. And associated with the corresponding audio data and stored in the storage unit 2.

感情推定器適応処理部１８０は、第１ラベル付与部１６０及び第２ラベル付与部１７０により、ポジティブ、ネガティブ、ニュートラルの何れかのラベルが文単位で付与された特定個人ごとの音声データを、不特定多数の発話者が発話した音声データで文単位に構成された教師データに追加し、特定個人用の教師データを生成する。そして、特定個人ごとに生成された教師データに基づいて感情推定器を特定個人用の感情推定器として再構築することにより、不特定多数の発話者が発話した音声データを教師データとして生成された感情推定器を、特定個人の感情状態を文単位で推定する感情推定器として個人適応させる。 The emotion estimator adaptation processing unit 180 extracts the audio data for each specific individual to which the first label attaching unit 160 and the second label attaching unit 170 are assigned a positive, negative, or neutral label in units of a sentence. In addition to the teacher data composed in units of sentences with voice data spoken by a specific number of speakers, teacher data for a specific individual is generated. Then, by reconstructing the emotion estimator as an emotion estimator for a specific individual based on teacher data generated for each specific individual, voice data spoken by an unspecified number of speakers was generated as teacher data. The emotion estimator is individually adapted as an emotion estimator for estimating the emotional state of a specific individual on a sentence basis.

次に、感情推定部１９０は、音声データを文単位で発話者の感情推定を行う感情推定器を用いて、文単位で発話者の感情状態を推定する。したがって、感情推定部１９０は、図１０と式１から式３を用いて説明した処理は行わない。 Next, the emotion estimation unit 190 estimates the speaker's emotional state in units of sentences using an emotion estimator that estimates the speaker's emotion in units of sentences in the voice data. Therefore, emotion estimation section 190 does not perform the processing described using FIG. 10 and Equations 1 to 3.

ロボットに搭載した感情推定装置１００に日々類似する内容の音声データの供給が繰り返される場合、呼気段落よりも長い文単位でラベル付与処理を行った方が、発話者の感情推定の精度を向上できる場合がある。また、感情推定に要する処理時間も削減できる場合がある。 When the supply of voice data having similar content to the emotion estimation device 100 mounted on the robot is repeated every day, it is possible to improve the accuracy of estimating the emotion of the speaker by performing the labeling process in a sentence longer than the exhalation paragraph. There are cases. In some cases, the processing time required for emotion estimation can be reduced.

なお、上記の説明では、音声データの特徴として、パワー時系列変化パターンとピッチ時系列変化パターンを例にして説明したが、音声データの特徴はこれに限定する必要は無い。例えば、抽出区間の音声のパワーのピーク値もしくは平均値、抽出区間の音声データをＦＦＴ変換した際の最も低い周波数領域に存在するピーク周波数ｆ０、抽出区間の音声データをＦＦＴ変換して求めたスペクトルの平均周波数等を用いてもよい。時系列変化パターンとしては、アクセント句ごとのパワーの変化パターン、アクセント句ごとのピッチの変化パターン等を用いることもできる。このように様々な音声データの特徴に基づいて頻度解析することにより、より正確な感情推定をすることができる。 In the above description, the power time series change pattern and the pitch time series change pattern have been described as examples of the characteristics of the audio data, but the characteristics of the audio data need not be limited to these. For example, the peak value or average value of the power of the audio in the extraction section, the peak frequency f0 present in the lowest frequency region when the audio data in the extraction section is subjected to the FFT conversion, and the spectrum obtained by performing the FFT conversion on the audio data in the extraction section. May be used. As the time-series change pattern, a power change pattern for each accent phrase, a pitch change pattern for each accent phrase, and the like can be used. As described above, by performing frequency analysis based on characteristics of various voice data, more accurate emotion estimation can be performed.

また、上記の説明では、音声データ解析部１２０がパワー時系列データとピッチ時系列データとを作成し、特徴抽出部１３０がパワー時系列データとピッチ時系列データに呼気段落を設定し、呼気段落ごとのパワー時系列変化パターンとピッチ時系列変化パターンとを抽出する説明をした。この変形として、無音区間に基づいて音声データを呼気段落に分割し、分割後の音声データからパワー時系列データとピッチ時系列データを作成し、呼気段落毎に作成されたパワー時系列データとピッチ時系列データから、パワー時系列変化パターンとピッチ時系列変化パターンとを抽出するようにしてもよい。 In the above description, the voice data analysis unit 120 creates power time series data and pitch time series data, and the feature extraction unit 130 sets an expiration paragraph in the power time series data and pitch time series data, It has been described that the power time-series change pattern and the pitch time-series change pattern are extracted for each power. As a modification of this, audio data is divided into exhalation paragraphs based on silent sections, power time series data and pitch time series data are created from the divided audio data, and power time series data and pitch created for each exhalation paragraph are created. A power time series change pattern and a pitch time series change pattern may be extracted from the time series data.

また、感情推定装置１００が搭載する感情推定器は、図９を用いて説明したようなポジティブ、ネガティブ、ニュートラルの３値に分類する１つの推定器であってもよいし、２つの推定器を使用してもよい。例えば、１つ目の推定器で音声データをニュートラルとその他に分類し、２つ目の推定器でその他に分類した音声データをポジティブとネガティブとに分類するようにしてもよい。 Further, the emotion estimator mounted on emotion estimation device 100 may be one estimator that classifies into three values of positive, negative, and neutral as described with reference to FIG. 9, or two estimators. May be used. For example, the first estimator may classify the speech data into neutral and other, and the second estimator may classify the speech data classified as other into positive and negative.

また、感情推定処理において、ニュートラルの分類には搭載する感情推定器を使用しないようにすることもできる。具体的には、第１ラベル付与部１６０は、特徴抽出部１３０が抽出した解析対象とするユーザーの音声データの特徴が、頻度解析部１５０が解析したニュートラル区間に属するか否かを判別し、ニュートラル区間に属する場合には、該当する音声データにニュートラルのラベルを付与する。図８（ａ）に示す例では、第１ラベル付与部１６０は、パワー時系列変化パターン算出部１３３が算出したパワー時系列変化パターンθｓが、２０°から５０°の範囲である場合には、音声データにニュートラルのラベルを付与する。ピッチ時系列変化パターンについても同様である。ポジティブとネガティブとの分類には、搭載する感情推定器を使用すればよい。 Further, in the emotion estimation process, the on-board emotion estimator may not be used for neutral classification. Specifically, the first label assigning unit 160 determines whether or not the feature of the voice data of the user to be analyzed extracted by the feature extracting unit 130 belongs to the neutral section analyzed by the frequency analyzing unit 150, If it belongs to a neutral section, a neutral label is given to the corresponding audio data. In the example illustrated in FIG. 8A, the first labeling unit 160 determines that the power time-series change pattern θs calculated by the power time-series change pattern calculation unit 133 is in the range of 20 ° to 50 °. Add a neutral label to the audio data. The same applies to the pitch time series change pattern. The classification of the positive and the negative may use the mounted emotion estimator.

また、実施形態１の説明では、特徴抽出部１３０内に時間長測定部１３２を設け、頻度解析部１５０が、時系列変化パターンの頻度解析を所定の時間長ごとに行う説明をした。しかし、時間長を分類せずに頻度解析を行ってもよい。時間長による分類を省略するか否かは、抽出区間の設定、推定精度、処理時間、製造コスト等の兼ね合いで決めることができる。 In the description of the first embodiment, the time length measuring unit 132 is provided in the feature extracting unit 130, and the frequency analyzing unit 150 performs the frequency analysis of the time-series change pattern for each predetermined time length. However, the frequency analysis may be performed without classifying the time length. Whether to omit the classification based on the time length can be determined in consideration of the setting of the extraction section, the estimation accuracy, the processing time, the manufacturing cost, and the like.

また、実施形態１の図８を用いた頻度解析の説明では、頻度の区分を１０°ごとに分割した区分として９つの区分を設ける説明をしたが、この区分は任意に設定することができる。例えば、０°から３０°、３０°から６０°、６０°から９０°のように３つの区分としてもよい。 Further, in the description of the frequency analysis using FIG. 8 of the first embodiment, a description has been given in which nine sections are provided as divisions obtained by dividing the frequency division by 10 °, but this division can be set arbitrarily. For example, three sections such as 0 ° to 30 °, 30 ° to 60 °, and 60 ° to 90 ° may be used.

また、上記の説明では、感情推定器適応処理部１８０が感情推定器を再構築するタイミングを、教師データの増加量が所定量増加するごととする説明をした。しかしながら、感情推定器の再構築のタイミングは任意に設定してもよい。例えば、毎晩深夜２時、毎日曜日の深夜２時、毎月１日の深夜２時などに設定してもよい。このように特定の時刻に感情推定器の再構築タイミングを設定することにより、感情推定装置１００を搭載したロボットの疑似人格が、設定したタイミングごとに適応変化することを楽しむこともできる。 Further, in the above description, the timing at which the emotion estimator adaptation processing unit 180 reconstructs the emotion estimator is described as each time the amount of increase in the teacher data increases by a predetermined amount. However, the timing of reconstruction of the emotion estimator may be set arbitrarily. For example, it may be set to 2:00 midnight every night, 2:00 midnight every Sunday, 2:00 midnight on the first day of every month, and the like. By setting the reestablishment timing of the emotion estimator at a specific time in this way, it is possible to enjoy that the pseudo personality of the robot equipped with the emotion estimating device 100 adaptively changes at each set timing.

また、実施形態１で説明したように、感情推定部１９０は、式１から式３を用いて評価値を計算しているので、ニュートラルの評価値Ｅｎｅｕ、ポジティブの評価値Ｅｐｏｓ、ネガティブの評価値Ｅｎｅｇを把握している。したがって、感情推定装置１００は、ニュートラルの評価値Ｅｎｅｕ、ポジティブの評価値Ｅｐｏｓ、ネガティブの評価値Ｅｎｅｇの比率に基づいて、「ニュートラルではあるが、ややネガティブ」のように、複数の感情状態の程度を加味した感情推定を行うこともできる。 Further, as described in the first embodiment, since the emotion estimation unit 190 calculates the evaluation value using Expressions 1 to 3, the neutral evaluation value Eneu, the positive evaluation value Epos, and the negative evaluation value I know Eneg. Therefore, based on the ratio of the neutral evaluation value Eneu, the positive evaluation value Epos, and the negative evaluation value Eneg, the emotion estimating apparatus 100 determines a plurality of degrees of emotion states such as “neutral but slightly negative”. Can be used to estimate the emotion.

また、実施形態１の説明では、発話者の感情状態をポジティブ、ネガティブ、ニュートラル、の３種類に分類する説明をしたが、感情の分類方法はこれに限定する必要はない。ニュートラル（普通）とその他の感情状態を分類する方法であれば良い。例えば、喜、怒、哀、楽、普通の５種類に分類してもよい。この場合、不特定多数の発話者の音声データを教師データとして構築された初期の感情推定器も、音声データを喜、怒、哀、楽、普通の５種類に分類可能な感情推定器を搭載する。搭載する感情推定器は、第１ラベル付与部１６０がニュートラルのラベルを付与しなかった音声データについて、音声の強さ、音声のピッチ、音素の時間長等の特徴量に基づいて、発話者の感情状態を喜、怒、哀、楽、の４種類のいずれかに分類し、喜怒哀楽等のラベルを付与する。 In the description of the first embodiment, the emotional state of the speaker is classified into three types, that is, positive, negative, and neutral, but the emotion classification method need not be limited to this. Any method that classifies neutral (ordinary) and other emotional states may be used. For example, it may be classified into five types: happy, angry, sad, easy, and ordinary. In this case, the initial emotion estimator constructed using the voice data of an unspecified number of speakers as teacher data is also equipped with an emotion estimator that can classify the voice data into five types: happy, angry, sad, easy, and ordinary. I do. The mounted emotion estimator uses the speech data to which the first label assigning unit 160 has not assigned the neutral label based on the features of the speaker, such as the strength of the speech, the pitch of the speech, and the time length of the phoneme, based on the feature amounts. The emotional state is classified into one of four types, happy, angry, sad, and easy, and a label such as emotion, sadness, and so on is given.

また、不特定多数の発話者により発話された初期の教師データの量を多くすると、特定個人専用に適応するためには、多くの特定個人の音声データの入力が必要となる。一方、初期の教師データの量が少なすぎると、初期の感情推定精度が低下する。したがって、不特定多数の発話者により発話された初期の教師データの量は、上記を考慮して設定することが好ましい。 In addition, if the amount of initial teacher data uttered by an unspecified number of speakers is increased, it is necessary to input voice data of many specific individuals in order to adapt to specific individuals. On the other hand, if the amount of the initial teacher data is too small, the accuracy of the initial emotion estimation decreases. Therefore, it is preferable to set the amount of initial teacher data uttered by an unspecified number of speakers in consideration of the above.

また、本発明に係る機能を実現するための構成を予め備えた感情推定装置１００として提供できることはもとより、プログラムの適用により、既存のパーソナルコンピュータや情報端末機器等を、本発明に係る感情推定装置１００として機能させることもできる。すなわち、上記実施形態で例示した感情推定装置１００による各機能構成を実現させるためのプログラムを、既存のパーソナルコンピュータや情報端末機器等を制御するＣＰＵ等が実行できるように適用することで、本発明に係る感情推定装置１００として機能させることができる。また、本発明に係る感情推定方法は、感情推定装置１００を用いて実施できる。 In addition to being able to be provided as the emotion estimation device 100 having a configuration for realizing the function according to the present invention in advance, by applying a program, an existing personal computer, an information terminal device, or the like can be converted into an emotion estimation device according to the present invention. It can also function as 100. That is, the present invention is applied by applying a program for realizing each functional configuration of the emotion estimation device 100 exemplified in the above embodiment so that a CPU or the like for controlling an existing personal computer, information terminal device, or the like can execute the program. Can function as the emotion estimation device 100 according to the above. Further, the emotion estimation method according to the present invention can be implemented using the emotion estimation device 100.

また、このようなプログラムの適用方法は任意である。プログラムを、例えば、コンピュータが読取可能な記録媒体（ＣＤ−ＲＯＭ（Compact Disc Read-Only Memory）、ＤＶＤ（Digital Versatile Disc）、ＭＯ（Magneto Optical disc）等）に格納して適用できる他、インターネット等のネットワーク上のストレージにプログラムを格納しておき、これをダウンロードさせることにより適用することもできる。 The method of applying such a program is arbitrary. The program can be stored in a computer-readable recording medium (Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD), Magneto Optical disc (MO), etc.), and applied, for example, to the Internet, etc. Alternatively, the program can be stored in a storage on a network, and the program can be applied by downloading the program.

以上、本発明の好ましい実施形態について説明したが、本発明は係る特定の実施形態に限定されるものではなく、本発明には、特許請求の範囲に記載された発明とその均等の範囲が含まれる。以下に、本願出願の当初の特許請求の範囲に記載された発明を付記する。 As described above, the preferred embodiments of the present invention have been described, but the present invention is not limited to the specific embodiments, and the present invention includes the inventions described in the claims and equivalents thereof. It is. Hereinafter, the inventions described in the claims of the present application will be additionally described.

（付記１）
不特定多数の発話者が発話した音声データを教師データとして生成された、発話者の発話時の感情状態を推定する感情推定器を、特定個人の発話時の感情状態を推定する感情推定器として個人適応させる感情推定器の個人適応方法であって、
前記特定個人が発話した音声データを取得する取得ステップと、
前記音声データの特徴を抽出する特徴抽出ステップと、
抽出された前記特徴を、特定の抽出区間ごとに複数のパターンに分類し、該パターンごとの出現頻度を解析する頻度解析ステップと、
前記パターンの出現頻度が閾値以上と判別された前記特定の抽出区間の音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与する第１ラベル付与ステップと、
前記第１ラベル付与ステップでニュートラルのラベルが付与された音声データを、前記不特定多数の発話者が発話した音声データで構成された教師データに追加した前記特定個人用の教師データを生成し、生成した前記特定個人用の教師データに基づいて前記特定個人用に感情推定器を構築することにより、前記不特定多数の発話者が発話した音声データを教師データとして生成された感情推定器を、前記特定個人の発話時の感情状態を推定する感情推定器として個人適応させる個人適応ステップと、
を含む感情推定器の個人適応方法。 (Appendix 1)
An emotion estimator that estimates the emotional state of a speaker when speaking, which is generated as teacher data using voice data spoken by an unspecified number of speakers, as an emotion estimator that estimates the emotional state of a specific individual when speaking A personal adaptation method of an emotion estimator for personal adaptation,
An obtaining step of obtaining voice data spoken by the specific individual;
A feature extraction step of extracting features of the audio data;
A frequency analysis step of classifying the extracted features into a plurality of patterns for each specific extraction section and analyzing an appearance frequency for each pattern;
A first label assigning step of assigning a neutral label indicating that the emotional state at the time of speech is a calm state to the audio data of the specific extraction section in which the appearance frequency of the pattern is determined to be equal to or greater than a threshold value;
Generating the teacher data for the specific individual by adding the voice data to which the neutral label has been provided in the first label providing step to the teacher data composed of the voice data spoken by the unspecified number of speakers; By constructing an emotion estimator for the specific individual based on the generated teacher data for the specific individual, an emotion estimator generated as teacher data using the voice data spoken by the unspecified number of speakers, Individual adaptation step of personal adaptation as an emotion estimator for estimating the emotional state of the specific individual at the time of speech,
Personal adaptation method of the emotion estimator including.

（付記２）
前記不特定多数の発話者が発話した音声データを教師データとして生成された感情推定器は、発話者の発話時の感情状態をポジティブ又はネガティブと、ニュートラルの何れかと推定する感情推定器であって、
前記第１ラベル付与ステップにおいてニュートラルのラベルを付与されていない音声データを、前記不特定多数の発話者が発話した音声データを教師データとして生成された感情推定器を用いて、ポジティブ又はネガティブの何れかに分類し、ポジティブ又はネガティブのラベルを付与する第２ラベル付与ステップをさらに含み、
前記個人適応ステップでは、前記第１ラベル付与ステップ及び前記第２ラベル付与ステップにより、ポジティブ又はネガティブと、ニュートラルの何れかのラベルが付与された前記特定個人の音声データを、前記不特定多数の発話者が発話した音声データで構成された教師データに追加した前記特定個人用の教師データを生成し、生成した前記特定個人用の教師データに基づいて前記特定個人用に感情推定器を構築することにより、前記不特定多数の発話者が発話した音声データを教師データとして生成された感情推定器を、前記特定個人の発話時の感情状態を推定する感情推定器として個人適応させる、
ことを特徴とする付記１に記載の感情推定器の個人適応方法。 (Appendix 2)
The emotion estimator generated as the teacher data using the voice data spoken by the unspecified number of speakers is an emotion estimator that estimates the emotional state of the speaker at the time of utterance as either positive or negative, and neutral. ,
The voice data to which the neutral label is not provided in the first label providing step is converted into either positive or negative voice data using an emotion estimator generated using the voice data uttered by the unspecified number of speakers as teacher data. And further comprising a second labeling step of applying a positive or negative label,
In the individual adaptation step, the specific individual voice data to which either the positive label or the negative label is assigned by the first label assigning step and the second label assigning step is converted into the unspecified number of utterances. Generating teacher data for the specific individual added to teacher data composed of voice data spoken by a person, and constructing an emotion estimator for the specific individual based on the generated teacher data for the specific individual. Thereby, the emotion estimator generated as teacher data using the voice data spoken by the unspecified number of speakers is personally adapted as an emotion estimator for estimating the emotional state of the specific individual at the time of speech.
The personal adaptation method of the emotion estimator according to claim 1, characterized in that:

（付記３）
前記特徴抽出ステップで抽出する前記音声データの特徴は、音声データのパワー時系列データの変化パターン、又は音声データのピッチ時系列データの変化パターンである、
ことを特徴とする付記１または２に記載の感情推定器の個人適応方法。 (Appendix 3)
The feature of the audio data extracted in the feature extraction step is a change pattern of power time-series data of the audio data, or a change pattern of pitch time-series data of the audio data,
The personal adaptation method of the emotion estimator according to claim 1 or 2, characterized in that:

（付記４）
前記頻度解析ステップで分類し、出現頻度を解析するパターンは、前記パワー時系列データの変化パターン、又は前記ピッチ時系列データの変化パターンである、
ことを特徴とする付記３に記載の感情推定器の個人適応方法。 (Appendix 4)
Classified in the frequency analysis step, the pattern to analyze the appearance frequency is a change pattern of the power time series data, or a change pattern of the pitch time series data,
The personal adaptation method of the emotion estimator according to claim 3, characterized in that:

（付記５）
前記第１ラベル付与ステップでは、前記パワー時系列データの変化パターンにおいて特定のパターンの出現頻度が閾値以上と判別された当該音声データに対して、又は前記ピッチ時系列データの変化パターンにおいて特定のパターンの出現頻度が閾値以上と判別された当該音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与する、
ことを特徴とする付記４に記載の感情推定器の個人適応方法。 (Appendix 5)
In the first labeling step, a specific pattern in the change pattern of the power time-series data is determined for the audio data whose appearance frequency of the specific pattern is determined to be equal to or more than a threshold value, or in a change pattern of the pitch time-series data. For the voice data whose appearance frequency is determined to be equal to or higher than the threshold, a neutral label indicating that the emotional state at the time of speech is a calm state is given.
The personal adaptation method of the emotion estimator according to supplementary note 4, characterized in that:

（付記６）
前記第１ラベル付与ステップでは、前記パワー時系列データの変化パターンにおいて特定のパターンの出現頻度が所定値以上と判別され、且つ前記ピッチ時系列データの変化パターンにおいて特定のパターンの出現頻度が所定値以上と判別された当該音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与する、
ことを特徴とする付記４に記載の感情推定器の個人適応方法。 (Appendix 6)
In the first labeling step, the frequency of occurrence of a specific pattern in the change pattern of the power time-series data is determined to be equal to or more than a predetermined value, and the frequency of occurrence of the specific pattern in the change pattern of the pitch time-series data is determined by a predetermined value. For the voice data determined as described above, a neutral label indicating that the emotional state at the time of speech is a calm state is given,
The personal adaptation method of the emotion estimator according to supplementary note 4, characterized in that:

（付記７）
前記特定の抽出区間は、少なくとも、呼気段落、単語、アクセント句、文、の何れかの区間である、
ことを特徴とする付記１から６の何れか一に記載の感情推定器の個人適応方法。 (Appendix 7)
The specific extraction section is at least one of a breath paragraph, a word, an accent phrase, and a sentence.
7. The personal adaptation method of the emotion estimator according to any one of supplementary notes 1 to 6, characterized in that:

（付記８）
前記個人適応ステップでは、前記取得ステップで取得された前記特定個人の発話者が発話した音声データの増加量が予め設定した量を超えるごとに、前記感情推定器を前記特定個人用の感情推定器として個人適応させる、
ことを特徴とする付記１から７の何れか一に記載の感情推定器の個人適応方法。 (Appendix 8)
In the individual adaptation step, the emotion estimator is changed to the emotion estimator for the specific individual every time the increase amount of the voice data uttered by the speaker of the specific individual acquired in the acquisition step exceeds a preset amount. As a personal adaptation,
8. The personal adaptation method of the emotion estimator according to any one of supplementary notes 1 to 7, characterized in that:

（付記９）
前記個人適応ステップでは、予め設定された時刻になると、前記感情推定器を前記特定個人用の感情推定器として個人適応させる、
ことを特徴とする付記１から７の何れか一に記載の感情推定器の個人適応方法。 (Appendix 9)
In the individual adaptation step, at a preset time, personalize the emotion estimator as the specific individual emotion estimator,
8. The personal adaptation method of the emotion estimator according to any one of supplementary notes 1 to 7, characterized in that:

（付記１０）
前記取得ステップでは、前記特定個人が発話した音声データを取得し、
前記頻度解析ステップでは、前記パターンの出現頻度を前記特定個人ごとに解析し、
前記第１ラベル付与ステップでは、前記特定個人ごとに解析した前記パターンの出現頻度が閾値以上と判別された前記特定の抽出区間の音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与し、
前記個人適応ステップでは、前記第１ラベル付与ステップでニュートラルのラベルが付与された前記特定個人ごとに分類された音声データを、前記不特定多数の発話者が発話した音声データで構成された教師データに追加した前記特定個人ごとの教師データを生成し、生成した前記特定個人ごとの教師データに基づいて前記特定個人ごとに感情推定器を構築することにより、前記不特定多数の発話者が発話した音声データを教師データとして生成された感情推定器を、前記特定個人ごとの発話時の感情状態を推定する感情推定器として個人適応させる、
ことを特徴とする付記１から９の何れか一に記載の感情推定器の個人適応方法。 (Appendix 10)
In the obtaining step, obtains voice data spoken by the specific individual,
In the frequency analysis step, the frequency of appearance of the pattern is analyzed for each of the specific individuals,
In the first labeling step, the emotional state at the time of speech is a calm state with respect to the voice data of the specific extraction section in which the appearance frequency of the pattern analyzed for the specific individual is determined to be equal to or higher than a threshold value With a neutral label indicating
In the individual adaptation step, the speech data classified for each specific individual to which a neutral label is assigned in the first label assigning step is converted into teacher data composed of speech data spoken by the unspecified number of speakers. By generating teacher data for each specific individual added to the above, and constructing an emotion estimator for each specific individual based on the generated teacher data for each specific individual, the unspecified number of speakers spoke Emotion estimator generated as speech data teacher data, personal adaptation as an emotion estimator to estimate the emotional state at the time of speech for each specific individual,
10. The personal adaptation method of the emotion estimator according to any one of supplementary notes 1 to 9, characterized in that:

（付記１１）
特定個人が発話した音声データを取得する取得手段と、
前記音声データの特徴を抽出する特徴抽出手段と、
抽出された前記特徴を、特定の抽出区間ごとに複数のパターンに分類し、該パターンごとの出現頻度を解析する頻度解析手段と、
前記パターンの出現頻度が閾値以上と判別された前記特定の抽出区間の音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与する第１ラベル付与手段と、
前記第１ラベル付与手段によりニュートラルのラベルが付与された音声データを、前記不特定多数の発話者が発話した音声データで構成された教師データに追加した前記特定個人用の教師データを生成し、生成した前記特定個人用の教師データに基づいて前記特定個人用に感情推定器を構築することにより、前記不特定多数の発話者が発話した音声データを教師データとして生成された感情推定器を、前記特定個人の発話時の感情状態を推定する感情推定器として個人適応させる個人適応手段と、
を備える感情推定装置。 (Appendix 11)
Acquiring means for acquiring voice data spoken by a specific individual;
Feature extraction means for extracting features of the audio data;
Frequency analysis means for classifying the extracted features into a plurality of patterns for each specific extraction section, and analyzing an appearance frequency for each pattern;
A first label assigning unit that assigns a neutral label indicating that the emotional state at the time of speech is a calm state to the audio data of the specific extraction section in which the appearance frequency of the pattern is determined to be equal to or greater than a threshold value;
Generating the teacher data for the specific individual by adding the voice data to which the neutral label is provided by the first label providing unit to the teacher data composed of the voice data spoken by the unspecified number of speakers; By constructing an emotion estimator for the specific individual based on the generated teacher data for the specific individual, an emotion estimator generated as teacher data using the voice data spoken by the unspecified number of speakers, Personal adaptation means for personal adaptation as an emotion estimator for estimating the emotional state of the specific individual when speaking,
An emotion estimation device comprising:

（付記１２）
発話者が発話した音声データを取得する取得手段と、
前記音声データの特徴を抽出する特徴抽出手段と、
抽出された前記特徴を、特定の抽出区間ごとに複数のパターンに分類し、該パターンごとの出現頻度を解析する頻度解析手段と、
前記パターンの出現頻度が閾値以上と判別された前記特定の抽出区間の音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与する第１ラベル付与手段と、
ニュートラルのラベルを付与した前記特定の抽出区間の数に重み係数を掛けて得られた評価値と、ニュートラルのラベルを付与しなかった特定の抽出区間の数に重み係数を掛けて得られた評価値と、を比較し、ニュートラルのラベルを付与した特定の抽出区間の評価値がニュートラルのラベルを付与しなかった特定の抽出区間の評価値よりも高い評価値であった場合、発話者の発話時の感情状態をニュートラルと判別する感情推定手段と、
を備える感情推定装置。 (Appendix 12)
Acquiring means for acquiring voice data spoken by the speaker;
Feature extraction means for extracting features of the audio data;
Frequency analysis means for classifying the extracted features into a plurality of patterns for each specific extraction section, and analyzing an appearance frequency for each pattern;
A first label assigning unit that assigns a neutral label indicating that the emotional state at the time of speech is a calm state to the audio data of the specific extraction section in which the appearance frequency of the pattern is determined to be equal to or greater than a threshold value;
An evaluation value obtained by multiplying the number of the specific extraction sections to which the neutral label is added by a weighting coefficient, and an evaluation value obtained by multiplying the number of the specific extraction sections to which the neutral label is not added by the weighting coefficient If the evaluation value of the specific extraction section with the neutral label is higher than the evaluation value of the specific extraction section without the neutral label, the utterance of the speaker is compared. Emotion estimation means for determining the emotional state at the time as neutral,
An emotion estimation device comprising:

（付記１３）
コンピュータを
特定個人が発話した音声データを取得する取得手段、
前記音声データの特徴を抽出する特徴抽出手段、
抽出された前記特徴を、特定の抽出区間ごとに複数のパターンに分類し、該パターンごとの出現頻度を解析する頻度解析手段、
前記パターンの出現頻度が閾値以上と判別された前記特定の抽出区間の音声データに対して、発話時の感情状態が平静状態であることを示すニュートラルのラベルを付与する第１ラベル付与手段、
前記第１ラベル付与手段によりニュートラルのラベルが付与された音声データを、前記不特定多数の発話者が発話した音声データで構成された教師データに追加した前記特定個人用の教師データを生成し、生成した前記特定個人用の教師データに基づいて前記特定個人用に感情推定器を構築することにより、前記不特定多数の発話者が発話した音声データを教師データとして生成された感情推定器を、前記特定個人の発話時の感情状態を推定する感情推定器として個人適応させる個人適応手段、
として機能させるためのプログラム。 (Appendix 13)
Acquisition means for acquiring voice data spoken by a specific individual using a computer;
Feature extracting means for extracting features of the audio data,
Frequency analysis means for classifying the extracted features into a plurality of patterns for each specific extraction section and analyzing the appearance frequency of each pattern;
A first label assigning unit that assigns a neutral label indicating that the emotional state at the time of speech is a calm state to the audio data of the specific extraction section in which the appearance frequency of the pattern is determined to be equal to or greater than a threshold value;
Generating the teacher data for the specific individual by adding the voice data to which the neutral label is provided by the first label providing unit to the teacher data composed of the voice data spoken by the unspecified number of speakers; By constructing an emotion estimator for the specific individual based on the generated teacher data for the specific individual, an emotion estimator generated as teacher data using the voice data spoken by the unspecified number of speakers, Personal adaptation means for personal adaptation as an emotion estimator for estimating the emotional state of the specific individual when speaking,
Program to function as

１…制御部、２…記憶部、３…入出力部、４…バス、１００…感情推定装置、１１０…音声データ取得部、１２０…音声データ解析部、１３０…特徴抽出部、１３１…特定個人分類部、１３２…時間長測定部、１３３…パワー時系列変化パターン算出部、１３４…ピッチ時系列変化パターン算出部、１５０…頻度解析部、１６０…第１ラベル付与部、１７０…第２ラベル付与部、１８０…感情推定器適応処理部、１９０…感情推定部 DESCRIPTION OF SYMBOLS 1 ... Control part, 2 ... Storage part, 3 ... Input / output part, 4 ... Bus, 100 ... Emotion estimation device, 110 ... Speech data acquisition part, 120 ... Speech data analysis part, 130 ... Feature extraction part, 131 ... Specific individual Classification unit, 132: time length measurement unit, 133: power time series change pattern calculation unit, 134: pitch time series change pattern calculation unit, 150: frequency analysis unit, 160: first label assignment unit, 170: second label assignment Section, 180: emotion estimation unit adaptive processing section, 190: emotion estimation section

Claims

An emotion estimator that estimates the emotional state of a speaker when speaking, which is generated as teacher data using voice data spoken by an unspecified number of speakers, as an emotion estimator that estimates the emotional state of a specific individual when speaking A personal adaptation method of an emotion estimator for personal adaptation,
An obtaining step of obtaining voice data spoken by the specific individual;
A feature extraction step of extracting features of the audio data;
A frequency analysis step of classifying the extracted features into a plurality of patterns for each specific extraction section and analyzing an appearance frequency for each pattern;
A first label assigning step of assigning a neutral label indicating that the emotional state at the time of speech is a calm state to the audio data of the specific extraction section in which the appearance frequency of the pattern is determined to be equal to or greater than a threshold value;
Generating the teacher data for the specific individual by adding the voice data to which the neutral label has been provided in the first label providing step to the teacher data composed of the voice data spoken by the unspecified number of speakers; By constructing an emotion estimator for the specific individual based on the generated teacher data for the specific individual, an emotion estimator generated as teacher data using the voice data spoken by the unspecified number of speakers, Individual adaptation step of personal adaptation as an emotion estimator for estimating the emotional state of the specific individual at the time of speech,
Personal adaptation method of the emotion estimator including.

The emotion estimator generated as the teacher data using the voice data spoken by the unspecified number of speakers is an emotion estimator that estimates the emotional state of the speaker at the time of utterance as either positive or negative, and neutral. ,
The voice data to which the neutral label is not provided in the first labeling step is converted into either positive or negative voice data using an emotion estimator generated as voice data spoken by the unspecified number of speakers as teacher data. And further comprising a second labeling step of applying a positive or negative label,
In the individual adaptation step, the specific individual voice data to which either the positive label or the negative label is assigned by the first label assigning step and the second label assigning step is converted into the unspecified number of utterances. Generating teacher data for the specific individual added to teacher data composed of voice data spoken by a person, and constructing an emotion estimator for the specific individual based on the generated teacher data for the specific individual. Thereby, the emotion estimator generated as teacher data using the voice data spoken by the unspecified number of speakers is personally adapted as an emotion estimator for estimating the emotional state of the specific individual at the time of speech.
The personal adaptation method of the emotion estimator according to claim 1, wherein:

The feature of the audio data extracted in the feature extraction step is a change pattern of power time-series data of the audio data, or a change pattern of pitch time-series data of the audio data,
The personal adaptation method of the emotion estimator according to claim 1 or 2, wherein:

Classified in the frequency analysis step, the pattern to analyze the appearance frequency is a change pattern of the power time series data, or a change pattern of the pitch time series data,
4. The personal adaptation method of the emotion estimator according to claim 3, wherein:

In the first labeling step, a specific pattern in the change pattern of the power time-series data is determined for the audio data whose appearance frequency of the specific pattern is determined to be equal to or more than a threshold value, or in a change pattern of the pitch time-series data. For the voice data whose appearance frequency is determined to be equal to or higher than the threshold, a neutral label indicating that the emotional state at the time of speech is a calm state is given.
5. The personal adaptation method of the emotion estimator according to claim 4, wherein:

In the first labeling step, the frequency of occurrence of a specific pattern in the change pattern of the power time-series data is determined to be equal to or more than a predetermined value, and the frequency of occurrence of the specific pattern in the change pattern of the pitch time-series data is determined by a predetermined value. For the voice data determined as described above, a neutral label indicating that the emotional state at the time of speech is a calm state is given,
5. The personal adaptation method of the emotion estimator according to claim 4, wherein:

The specific extraction section is at least one of a breath paragraph, a word, an accent phrase, and a sentence.
The personal adaptation method of the emotion estimator according to any one of claims 1 to 6, characterized in that:

In the individual adaptation step, the emotion estimator is changed to the emotion estimator for the specific individual every time the increase amount of the voice data uttered by the speaker of the specific individual acquired in the acquisition step exceeds a preset amount. As a personal adaptation,
The personal adaptation method of the emotion estimator according to any one of claims 1 to 7, wherein:

In the individual adaptation step, at a preset time, personalize the emotion estimator as the specific individual emotion estimator,
The personal adaptation method of the emotion estimator according to any one of claims 1 to 7, wherein:

In the obtaining step, obtains voice data spoken by the specific individual,
In the frequency analysis step, the frequency of appearance of the pattern is analyzed for each of the specific individuals,
In the first labeling step, the emotional state at the time of speech is a calm state with respect to the voice data of the specific extraction section in which the appearance frequency of the pattern analyzed for the specific individual is determined to be equal to or higher than a threshold value With a neutral label indicating
In the individual adaptation step, the speech data classified for each specific individual to which a neutral label is assigned in the first label assigning step is converted into teacher data composed of speech data spoken by the unspecified number of speakers. By generating teacher data for each specific individual added to the above, and constructing an emotion estimator for each specific individual based on the generated teacher data for each specific individual, the unspecified number of speakers spoke Emotion estimator generated as speech data teacher data, personal adaptation as an emotion estimator to estimate the emotional state at the time of speech for each specific individual,
The personal adaptation method of the emotion estimator according to any one of claims 1 to 9, wherein:

Acquiring means for acquiring voice data spoken by a specific individual;
Feature extraction means for extracting features of the audio data;
Frequency analysis means for classifying the extracted features into a plurality of patterns for each specific extraction section, and analyzing an appearance frequency for each pattern;
A first label assigning unit that assigns a neutral label indicating that the emotional state at the time of speech is a calm state to the audio data of the specific extraction section in which the appearance frequency of the pattern is determined to be equal to or greater than a threshold value;
Generating the teacher data for the specific individual by adding the voice data to which the neutral label is provided by the first label providing unit to the teacher data composed of the voice data spoken by the unspecified number of speakers; By constructing an emotion estimator for the specific individual based on the generated teacher data for the specific individual, an emotion estimator generated as teacher data using the voice data spoken by the unspecified number of speakers, Personal adaptation means for personal adaptation as an emotion estimator for estimating the emotional state of the specific individual when speaking,
An emotion estimation device comprising:

Acquiring means for acquiring voice data spoken by the speaker;
Feature extraction means for extracting features of the audio data;
Frequency analysis means for classifying the extracted features into a plurality of patterns for each specific extraction section, and analyzing an appearance frequency for each pattern;
A first label assigning unit that assigns a neutral label indicating that the emotional state at the time of speech is a calm state to the audio data of the specific extraction section in which the appearance frequency of the pattern is determined to be equal to or greater than a threshold value;
An evaluation value obtained by multiplying the number of the specific extraction sections to which the neutral label is added by a weighting coefficient, and an evaluation value obtained by multiplying the number of the specific extraction sections to which the neutral label is not added by the weighting coefficient If the evaluation value of the specific extraction section with the neutral label is higher than the evaluation value of the specific extraction section without the neutral label, the utterance of the speaker is compared. Emotion estimation means for determining the emotional state at the time as neutral,
An emotion estimation device comprising:

Acquisition means for acquiring voice data spoken by a specific individual using a computer;
Feature extracting means for extracting features of the audio data,
Frequency analysis means for classifying the extracted features into a plurality of patterns for each specific extraction section and analyzing the appearance frequency of each pattern;
A first label assigning unit that assigns a neutral label indicating that the emotional state at the time of speech is a calm state to the audio data of the specific extraction section in which the appearance frequency of the pattern is determined to be equal to or greater than a threshold value;
Generating the teacher data for the specific individual by adding the voice data to which the neutral label is provided by the first label providing unit to the teacher data composed of the voice data spoken by the unspecified number of speakers; By constructing an emotion estimator for the specific individual based on the generated teacher data for the specific individual, an emotion estimator generated as teacher data using the voice data spoken by the unspecified number of speakers, Personal adaptation means for personal adaptation as an emotion estimator for estimating the emotional state of the specific individual when speaking,
Program to function as