JP2017111760A

JP2017111760A - Emotion estimation device creation method, emotion estimation device creation device, emotion estimation method, emotion estimation device and program

Info

Publication number: JP2017111760A
Application number: JP2015247885A
Authority: JP
Inventors: 浩一中込; Koichi Nakagome; 佐藤　勝彦; Katsuhiko Sato; 勝彦佐藤; 崇史山谷; Takashi Yamatani
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2015-12-18
Filing date: 2015-12-18
Publication date: 2017-06-22
Anticipated expiration: 2035-12-18
Also published as: JP6720520B2

Abstract

PROBLEM TO BE SOLVED: To improve estimation accuracy in estimation of an emotion of a speaker based on voice data.SOLUTION: An emotion estimation device creation device 100 comprises: an analysis zone setting part 120 for setting an analysis zone for analyzing a feature amount of voice data which is a source of teacher data; an accent type determination part 130 for determining a pattern in which the feature amount of the voice data included in the analysis zone is changed, as a change pattern of the feature amount of the voice data included in the analysis zone based on the change patterns classified into plural classes; and an emotion estimation device creation part 150 for creating an emotion estimation device for estimating an emotion of the speaker when the speaker produces speech, for every change pattern of the feature amount, with the voice data which is classified for every change pattern of the feature amount as the teacher data.SELECTED DRAWING: Figure 2

Description

本発明は、感情推定器生成方法、感情推定器生成装置、感情推定方法、感情推定装置及びプログラムに関する。 The present invention relates to an emotion estimator generation method, an emotion estimator generation device, an emotion estimation method, an emotion estimation device, and a program.

感情をラベリングした音声データ群を教師データとして機械学習により生成された感情推定装置を用いて、発話者の感情を推定する技術の開発が進められている。例えば、特許文献１は、音声の強度、音声のテンポ、音声の抑揚のそれぞれの変化量を求め、求めた変化量に基づいて発話者の感情を推定する技術を開示している。 Development of a technique for estimating a speaker's emotion using an emotion estimation device generated by machine learning using a voice data group in which emotions are labeled as teacher data is underway. For example, Patent Literature 1 discloses a technique for obtaining respective amounts of change in speech intensity, speech tempo, and speech inflection, and estimating a speaker's emotion based on the obtained change amount.

特開２００２−９１４８２号公報JP 2002-91482 A

一般に、興奮した状態で発話すると、通常の発話時よりも話し方が早くなり、声が高くなる傾向がある。また、落胆した状態で発話すると、通常の発話時よりも話し方が遅くなり、声が低くなる傾向がある。このように、発話時の発話者の感情と音声の特徴量とは相関性がある。特許文献１は、このような音声データの特徴量の変化を解析することにより、発話者の感情を推定する技術を開示している。 In general, speaking in an excited state tends to be faster and louder than normal speech. In addition, when speaking in a discouraged state, speaking tends to be slower than normal speech, and the voice tends to be low. Thus, there is a correlation between the emotion of the speaker at the time of speaking and the feature amount of the voice. Patent Document 1 discloses a technique for estimating the emotion of a speaker by analyzing such a change in the feature amount of voice data.

ところで、通常の感情状態で発話された音声の特徴量と怒った感情状態で発話された音声の特徴量とを比較した場合、短い言葉と長い言葉とでは特徴量の変化の傾向が異なる場合がある。例えば、発話しやすい短い言葉は、発話時の感情状態によって音声の特徴量の変化が大きい場合が多い。これに対して、早口言葉のように発話しにくい長い言葉は、発話時の感情状態によって音声の特徴量の変化が小さい場合がある。特許文献１が開示する技術は、このように発話時の感情状態によって音声の特徴量に変化が少ない言葉と変化が大きい言葉とを一律にして発話者の感情推定を行うので、推定精度が上がりにくいという問題があった。 By the way, when comparing the feature quantity of speech uttered in a normal emotional state with the feature quantity of speech uttered in an angry emotional state, the tendency of changes in feature quantities may differ between short words and long words. is there. For example, a short word that is easy to utter often has a large change in voice feature value depending on the emotional state at the time of utterance. On the other hand, a long word that is difficult to utter, such as a quick-speaking word, may have a small change in voice feature value depending on the emotional state at the time of utterance. Since the technique disclosed in Patent Document 1 estimates a speaker's emotion uniformly by using words that have a small change in voice feature amount and words that have a large change depending on the emotional state at the time of utterance, the estimation accuracy increases. There was a problem that it was difficult.

本発明は、このような状況を鑑みてなされたものであり、音声データから発話者の感情を推定する推定精度を向上することができる感情推定器生成方法、感情推定器生成装置、感情推定方法、感情推定装置及びプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and an emotion estimator generation method, an emotion estimator generation device, and an emotion estimation method capable of improving the estimation accuracy for estimating the emotion of a speaker from speech data An object is to provide an emotion estimation apparatus and program.

上記目的を達成するため、本発明の第１の観点に係る感情推定器生成方法は、
教師データの元となる音声データの特徴量を解析する解析区間を設定する解析区間設定ステップと、
前記解析区間に含まれる音声データの特徴量の変化するパターンを、複数のクラスに分類された変化パターンに基づいて、前記解析区間に含まれる音声データの特徴量の変化パターンとして決定する変化パターン決定ステップと、
前記特徴量の変化パターンごとに分類された音声データを教師データとして、前記特徴量の変化パターンごとに、音声を発話したときの発話者の感情を推定する感情推定器を生成する感情推定器生成ステップと、
を含むことを特徴とする。 In order to achieve the above object, an emotion estimator generation method according to the first aspect of the present invention includes:
An analysis interval setting step for setting an analysis interval for analyzing the feature amount of the voice data that is the source of the teacher data;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. Steps,
Generation of an emotion estimator for generating an emotion estimator for estimating an emotion of a speaker when speech is uttered for each feature value change pattern, using the voice data classified for each feature value change pattern as teacher data Steps,
It is characterized by including.

また、本発明の第２の観点に係る感情推定器生成装置は、
教師データの元となる音声データの特徴量を解析する解析区間を設定する解析区間設定手段と、
前記解析区間に含まれる音声データの特徴量の変化するパターンを、複数のクラスに分類された変化パターンに基づいて、前記解析区間に含まれる音声データの特徴量の変化パターンとして決定する変化パターン決定手段と、
前記特徴量の変化パターンごとに分類された音声データを教師データとして、前記特徴量の変化パターンごとに、音声を発話したときの発話者の感情を推定する感情推定器を生成する感情推定器生成手段と、
を含むことを特徴とする。 Moreover, the emotion estimator generation device according to the second aspect of the present invention provides:
An analysis interval setting means for setting an analysis interval for analyzing a feature amount of voice data that is a source of teacher data;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. Means,
Generation of an emotion estimator for generating an emotion estimator for estimating an emotion of a speaker when speech is uttered for each feature value change pattern, using the voice data classified for each feature value change pattern as teacher data Means,
It is characterized by including.

また、本発明の第３の観点に係るプログラムは、
コンピュータを
教師データの元となる音声データの特徴量を解析する解析区間を設定する解析区間設定手段、
前記解析区間に含まれる音声データの特徴量の変化するパターンを、複数のクラスに分類された変化パターンに基づいて、前記解析区間に含まれる音声データの特徴量の変化パターンとして決定する変化パターン決定手段、
前記特徴量の変化パターンごとに分類された音声データを教師データとして、前記特徴量の変化パターンごとに、音声を発話したときの発話者の感情を推定する感情推定器を生成する感情推定器生成手段、
として機能させることを特徴とする。 A program according to the third aspect of the present invention is:
Analysis interval setting means for setting an analysis interval for analyzing features of speech data that is a source of teacher data on a computer;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. means,
Generation of an emotion estimator for generating an emotion estimator for estimating an emotion of a speaker when speech is uttered for each feature value change pattern, using the voice data classified for each feature value change pattern as teacher data means,
It is made to function as.

また、本発明の第４の観点に係る感情推定方法は、
解析対象とする音声データの特徴量を解析する解析区間を設定する解析区間設定ステップと、
前記解析区間に含まれる音声データの特徴量の変化するパターンを、複数のクラスに分類された変化パターンに基づいて、前記解析区間に含まれる音声データの特徴量の変化パターンとして決定する変化パターン決定ステップと、
前記特徴量の変化パターンごとに、同じ特徴量の変化パターンを有する教師データに基づいて生成された感情推定器を用いて、前記解析区間の音声を発話した時の発話者の感情を推定する感情推定ステップと、
を含むことを特徴とする。 Moreover, the emotion estimation method according to the fourth aspect of the present invention includes:
An analysis interval setting step for setting an analysis interval for analyzing the feature amount of the audio data to be analyzed;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. Steps,
Emotion that estimates the emotion of the speaker when the speech of the analysis section is uttered using an emotion estimator generated based on teacher data having the same feature amount change pattern for each feature amount change pattern An estimation step;
It is characterized by including.

また、本発明の第５の観点に係る感情推定装置は、
解析対象とする音声データの特徴量を解析する解析区間を設定する解析区間設定手段と、
前記解析区間に含まれる音声データの特徴量の変化するパターンを、複数のクラスに分類された変化パターンに基づいて、前記解析区間に含まれる音声データの特徴量の変化パターンとして決定する変化パターン決定手段と、
前記特徴量の変化パターンごとに、同じ特徴量の変化パターンを有する教師データに基づいて生成された感情推定器を用いて、前記解析区間の音声を発話した時の発話者の感情を推定する感情推定手段と、
を備えることを特徴とする。 An emotion estimation apparatus according to the fifth aspect of the present invention is:
Analysis interval setting means for setting an analysis interval for analyzing the feature amount of the audio data to be analyzed;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. Means,
Emotion that estimates the emotion of the speaker when the speech of the analysis section is uttered using an emotion estimator generated based on teacher data having the same feature amount change pattern for each feature amount change pattern An estimation means;
It is characterized by providing.

本発明によれば、音声データから発話者の感情を推定する推定精度を向上することができる。 ADVANTAGE OF THE INVENTION According to this invention, the estimation precision which estimates a speaker's emotion from audio | voice data can be improved.

本発明の実施形態１に係る感情推定器生成装置の物理構成を示すブロック図である。It is a block diagram which shows the physical structure of the emotion estimator production | generation apparatus which concerns on Embodiment 1 of this invention. 本発明の実施形態１に係る感情推定器生成装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of the emotion estimator production | generation apparatus which concerns on Embodiment 1 of this invention. 形態素について説明するための図である。It is a figure for demonstrating a morpheme. モーラ区間について説明するための図である。It is a figure for demonstrating a mora area. 特徴量の解析方法について説明するための図である。It is a figure for demonstrating the analysis method of a feature-value. 特徴量の解析方法について説明するための図である。It is a figure for demonstrating the analysis method of a feature-value. クラス分けについて説明するための図である。It is a figure for demonstrating classification. 生成された感情推定装置の識別閾値のイメージについて説明するための図である。It is a figure for demonstrating the image of the identification threshold value of the produced | generated emotion estimation apparatus. 感情推定装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of an emotion estimation apparatus. 感情推定器の生成処理について説明するためのフローチャートである。It is a flowchart for demonstrating the production | generation process of an emotion estimator. 感情推定処理について説明するためのフローチャートである。It is a flowchart for demonstrating an emotion estimation process. 変形例１に係る特徴量の解析区間について説明するための図である。It is a figure for demonstrating the analysis area of the feature-value which concerns on the modification 1. FIG. 変形例１に係る特徴量の解析方法について説明するための図である。It is a figure for demonstrating the analysis method of the feature-value which concerns on the modification 1. FIG. 変形例２に係る音声の強度による特徴量の解析について説明するための図である。FIG. 10 is a diagram for explaining analysis of a feature amount based on sound intensity according to Modification Example 2. 変形例５に係る複数の感情の度合いを推定する技術について説明するための図である。It is a figure for demonstrating the technique which estimates the degree of the some emotion which concerns on the modification 5. FIG.

以下、本発明の実施形態に係る感情推定器生成方法、感情推定器生成装置、感情推定方法、感情推定装置及びプログラムについて、図面を参照しながら説明する。なお、図中同一又は相当する部分には同一符号を付す。 Hereinafter, an emotion estimator generation method, an emotion estimator generation device, an emotion estimation method, an emotion estimation device, and a program according to an embodiment of the present invention will be described with reference to the drawings. In the drawings, the same or corresponding parts are denoted by the same reference numerals.

（実施形態１）
本実施形態では、音声データから発話者の感情を推定する感情推定器を生成する感情推定器生成装置について説明した後、音声を発話した時の発話者の感情を推定する感情推定装置について説明する。本実施形態では、感情推定装置が、発話者の感情を悲しんでいる状態（悲しみ）、退屈している状態（退屈）、怒っている状態（怒り）、驚いている状態（驚き）、落胆している状態（落胆）、嫌悪感を抱いている状態（嫌悪）、喜んでいる状態（喜び）、の基本的な７種類の感情状態のいずれかであると推定する場合について説明する。
なお、以下の実施形態では、音声データの特徴量の変化パターンをアクセント型と称する。 (Embodiment 1)
In the present embodiment, after describing an emotion estimator generating device that generates an emotion estimator that estimates an emotion of a speaker from speech data, an emotion estimating device that estimates the emotion of the speaker when speaking the speech will be described. . In this embodiment, the emotion estimation device is in a state of sadness (sadness), bored state (boring), angry state (anger), surprised state (surprise), and discouraged. A case where it is estimated that the state is one of the seven basic emotional states, that is, a state of being disappointed (disappointed), a state of being disgusted (disgust), and a state of being happy (joy).
In the following embodiment, the change pattern of the feature amount of the audio data is referred to as an accent type.

実施形態１に係る感情推定器生成装置１００は、物理的には、図１に示すように、制御部１と、記憶部２と、入出力部３と、バス４と、を備える。 The emotion estimator generation device 100 according to the first embodiment physically includes a control unit 1, a storage unit 2, an input / output unit 3, and a bus 4 as illustrated in FIG. 1.

制御部１は、ＲＯＭ（Read Only Memory）と、ＲＡＭ（Random Access Memory）と、ＣＰＵ（Central Processing Unit）と、を備える。ＲＯＭは、本実施形態に係る感情推定器生成プログラム、及び、各種初期設定、ハードウェアの検査、プログラムのロード等を行うための初期プログラム等を記憶する。ＲＡＭは、ＣＰＵが実行する各種ソフトウェアプログラム、これらのソフトウェアプログラムの実行に必要なデータ等を一時的に記憶するワークエリアとして機能する。ＣＰＵは、各種ソフトウェアプログラムを実行することにより、様々な処理及び演算を実行する中央演算処理部である。 The control unit 1 includes a ROM (Read Only Memory), a RAM (Random Access Memory), and a CPU (Central Processing Unit). The ROM stores an emotion estimator generation program according to the present embodiment, an initial program for performing various initial settings, hardware inspection, program loading, and the like. The RAM functions as a work area for temporarily storing various software programs executed by the CPU and data necessary for executing these software programs. The CPU is a central processing unit that executes various processes and operations by executing various software programs.

記憶部２は、ハードディスクドライブ、フラッシュメモリ等の不揮発性メモリを備える。記憶部２は、教師データとする音声データ等を記憶する。 The storage unit 2 includes a nonvolatile memory such as a hard disk drive or a flash memory. The storage unit 2 stores voice data and the like as teacher data.

入出力部３は、教師データとする音声データを取得するための音声入力装置、ＣＤ（Compact Disc）ドライブ、ＵＳＢ（Universal Serial Bus）インタフェースを備える。入出力部３は、教師データとする音声データを取得する。また、入出力部３は、生成した感情推定器をプログラムもしくは感情推定器の特性を決めるパラメータを外部装置に出力する。 The input / output unit 3 includes an audio input device for acquiring audio data as teacher data, a CD (Compact Disc) drive, and a USB (Universal Serial Bus) interface. The input / output unit 3 acquires audio data as teacher data. The input / output unit 3 outputs the generated emotion estimator to a program or a parameter that determines the characteristics of the emotion estimator.

バス４は、制御部１と、記憶部２と、入出力部３と、を接続する。 The bus 4 connects the control unit 1, the storage unit 2, and the input / output unit 3.

感情推定器生成装置１００は、機能的には、図２に示すように、音声データ取得部１１０と、解析区間設定部１２０と、アクセント型決定部１３０と、特徴量抽出部１４０と、感情推定器生成部１５０と、を含む。また、解析区間設定部１２０は、形態素解析部１２１とアクセント句抽出部１２２と、を含む。また、アクセント型決定部１３０は、モーラ区間抽出部１３１と、アクセント型抽出部１３２と、を含む。 As shown in FIG. 2, the emotion estimator generation device 100 functionally includes a voice data acquisition unit 110, an analysis interval setting unit 120, an accent type determination unit 130, a feature amount extraction unit 140, and emotion estimation. Generator generator 150. The analysis section setting unit 120 includes a morpheme analysis unit 121 and an accent phrase extraction unit 122. Further, the accent type determination unit 130 includes a mora section extraction unit 131 and an accent type extraction unit 132.

音声データ取得部１１０は、入出力部３を介して感情推定器を生成するために教師データとして使用する音声データを取得する。教師データは、例えば、悲しみ、退屈、怒り、驚き、落胆、嫌悪、喜び、の７種類の感情状態で発話された音声から構成される。また、教師データは、十分に多くの種類の語句を含む音声データで構成される。教師データを発話する人数及び教師データに含まれるアクセント句の種類は多い方が好ましい。アクセント句とは、名詞と助詞、もしくは動詞と助動詞を結合した音声データを区分する単位である。例えば、教師データとして、５００人程度の多人数が７種類の感情状態で発話した、１０００種類以上のアクセント句を含む音声データを準備する。 The voice data acquisition unit 110 acquires voice data used as teacher data in order to generate an emotion estimator via the input / output unit 3. The teacher data is composed of speech uttered in seven emotional states, for example, sadness, boredom, anger, surprise, discouragement, disgust, and joy. The teacher data is composed of sound data including a sufficiently large number of types of words. It is preferable that the number of utterances of teacher data and the types of accent phrases included in the teacher data are large. An accent phrase is a unit that classifies speech data in which a noun and a particle or a verb and an auxiliary verb are combined. For example, as teacher data, voice data including more than 1000 types of accent phrases that a large number of people of about 500 uttered in 7 types of emotion states is prepared.

解析区間設定部１２０は、教師データとする音声データの特徴を解析する単位である解析区間を設定する。そのために、解析区間設定部１２０は、形態素解析部１２１とアクセント句抽出部１２２とを備える。 The analysis section setting unit 120 sets an analysis section which is a unit for analyzing the characteristics of the voice data used as teacher data. For this purpose, the analysis section setting unit 120 includes a morpheme analysis unit 121 and an accent phrase extraction unit 122.

形態素解析部１２１は、取得した音声データを形態素に分割する。形態素とは、言語としての意味を有する最小単位である。例えば、「坊主が屏風に上手に坊主の絵を描いた」という音声は、図３に示すように、「坊主」、「が」、「屏風」、「に」、「上手」、「に」、「坊主」、「の」、「絵」、「を」、「描い」、「た」の１２個の形態素に分割される。 The morpheme analyzer 121 divides the acquired voice data into morphemes. A morpheme is a minimum unit having a meaning as a language. For example, as shown in FIG. 3, the voice that “the shaved drawing of the shaved figure well” is shown in FIG. 3 as “shaved”, “ga”, “folding screen”, “ni”, “skilled”, “ni”. , “Shaved”, “No”, “Picture”, “O”, “Draw”, “Ta”.

アクセント句抽出部１２２は、取得した音声データからアクセント句を抽出する。アクセント句とは、形態素に分割した名詞又は動詞に、それに続く助詞又は助動詞を結合した区間である。上述の例では、アクセント句は、「坊主が」、「屏風に」、「上手に」、「坊主の」、「絵を」、「描いた」となる。本実施形態では、このアクセント句の単位で音声データの特徴を解析する場合について説明する。アクセント句の単位で音声データを解析する理由は、アクセント句の単位で発話者の感情状態が変化する場合が多いからである。 The accent phrase extraction unit 122 extracts an accent phrase from the acquired voice data. An accent phrase is a section in which a noun or verb divided into morphemes is combined with a subsequent particle or auxiliary verb. In the above-mentioned example, the accent phrases are “shaved is”, “is in a folding screen”, “is well”, “is a shaved”, “draws”, “draws”. In the present embodiment, a case will be described in which the characteristics of audio data are analyzed in units of accent phrases. The reason why voice data is analyzed in units of accent phrases is that the emotional state of the speaker often changes in units of accent phrases.

アクセント型決定部１３０は、アクセント句のアクセント型を決定する。アクセント型とは、アクセント句を構成する音節が発話されている区間であるモーラ区間ごとに、音声の特徴量が平均特徴量に対して大きい場合には「Ｈ」、小さい場合には「Ｌ」を付与して得られる「Ｈ」と「Ｌ」の組み合わせのパターンである。アクセント型を決定するために、アクセント型決定部１３０は、モーラ区間抽出部１３１とアクセント型抽出部１３２とを備える。 The accent type determination unit 130 determines the accent type of the accent phrase. The accent type is “H” when the voice feature amount is larger than the average feature amount and “L” when the speech feature amount is smaller than the average feature amount for each mora section in which the syllable constituting the accent phrase is spoken. Is a combination pattern of “H” and “L” obtained. In order to determine the accent type, the accent type determination unit 130 includes a mora section extraction unit 131 and an accent type extraction unit 132.

モーラ区間抽出部１３１は、図４に示すように、解析対象のアクセント句区間の音声データから、モーラ区間を抽出する。モーラ区間は１つの音節が発話されている区間である。アクセント句「坊主が」の場合で説明すると、「ボ」、「ウ」、「ズ」、「ガ」のそれぞれの音節がモーラ区間である。 As illustrated in FIG. 4, the mora section extraction unit 131 extracts a mora section from the speech data of the accent phrase section to be analyzed. A mora section is a section in which one syllable is spoken. In the case of the accent phrase “shaved”, each syllable of “BO”, “U”, “Z”, and “GA” is a mora section.

アクセント型抽出部１３２は、モーラ区間のそれぞれが「Ｈ」もしくは「Ｌ」のいずれに該当するかを判別し、アクセント句のアクセント型を抽出する。アクセント型の抽出方法には、音声の強度、音声のピッチ、音素の発話時間長等の特徴量を使用する方法がある。ここでは、音声のピッチに着目した抽出方法について、図５と図６を参照しながら説明する。 The accent type extraction unit 132 determines whether each of the mora sections corresponds to “H” or “L”, and extracts the accent type of the accent phrase. As an accent type extraction method, there is a method of using feature quantities such as voice intensity, voice pitch, and phoneme speech duration. Here, an extraction method that focuses on the pitch of speech will be described with reference to FIGS.

アクセント型抽出部１３２は、図５に示すように、モーラ区間をさらに細分する所定時間の時間窓を設定する。１つのモーラ区間に対して、窓１を設定し、その窓内の音声データをＦＦＴ（Fast Fourier Transform）変換する。次に、窓１を所定時間ｄｔずらした窓２内の音声データをＦＦＴ変換する。以下、同様に窓ｎ内の音声データをＦＦＴ変換する。時間窓の設定方法は、例えば、モーラ区間内に１０以上の時間窓を構成するように時間窓とずらす時間幅ｄｔを設定する。時間窓の数が少なすぎると、計算精度が低下するからである。 As shown in FIG. 5, the accent type extraction unit 132 sets a time window of a predetermined time for further subdividing the mora section. A window 1 is set for one mora section, and audio data in the window is subjected to FFT (Fast Fourier Transform) conversion. Next, the audio data in the window 2 in which the window 1 is shifted by a predetermined time dt is subjected to FFT conversion. Thereafter, the audio data in the window n is similarly subjected to FFT conversion. As the time window setting method, for example, a time width dt to be shifted from the time window is set so that ten or more time windows are formed in the mora section. This is because if the number of time windows is too small, the calculation accuracy decreases.

図６は、上記のＦＦＴ変換により得られた各窓内の音声データのスペクトル分布を示した例である。横軸は周波数であり、縦軸はスペクトルの強度である。このスペクトルの中で最も低い周波数領域に存在するピーク周波数をｆ０とする。このｆ０は、その窓区間の音声データから得られた発話者固有の基本周波数を示す。窓１から得られたｆ０をｆ０＿１、窓２から得られたｆ０をｆ０＿２、とする。同様にして、窓ｎから得られたｆ０をｆ０＿ｎとする。そして、アクセント型抽出部１３２は、ｆ０＿１からｆ０＿ｎまでの平均値を計算し、第１モーラ区間の平均基本周波数１＿ｆ０とする。 FIG. 6 is an example showing the spectral distribution of the audio data in each window obtained by the above-described FFT conversion. The horizontal axis is frequency, and the vertical axis is spectrum intensity. The peak frequency existing in the lowest frequency region in this spectrum is defined as f0. This f0 indicates the fundamental frequency unique to the speaker obtained from the audio data of the window section. Let f0 obtained from window 1 be f0_1, and f0 obtained from window 2 be f0_2. Similarly, let f0 obtained from the window n be f0_n. And the accent type | mold extraction part 132 calculates the average value from f0_1 to f0_n, and makes it the average fundamental frequency 1_f0 of a 1st mora area.

アクセント型抽出部１３２は、アクセント句に含まれる全てのモーラ区間について同様の計算をする。第ｍモーラ区間の平均基本周波数ｍ＿ｆ０は、式１を用いて算出することができる。 The accent type extraction unit 132 performs the same calculation for all the mora sections included in the accent phrase. The average fundamental frequency m_f0 of the m-th mora section can be calculated using Equation 1.

m_f0=1/n・Σf0_n （式１） m_f0 = 1 / n ・ Σf0_n (Formula 1)

次に、アクセント型抽出部１３２は、アクセント句区間における平均基本周波数ｍ＿ｔｈを式２を用いて求める。 Next, the accent type extraction unit 132 obtains the average fundamental frequency m_th in the accent phrase section using Equation 2.

m_th={max(1_f0，・・・，n_f0)−min(1_f0，・・・，n_f0)}/2 （式２） m_th = {max (1_f0, ..., n_f0) -min (1_f0, ..., n_f0)} / 2 (Formula 2)

次に、アクセント型抽出部１３２は、モーラ区間の平均基本周波数ｍ＿ｆ０とアクセント句区間の平均基本周波数ｍ＿ｔｈとを比較し、ｍ＿ｆ０≧ｍ＿ｔｈであれば「Ｈ」、ｍ＿ｆ０＜ｍ＿ｔｈであれば「Ｌ」をそれぞれのモーラ区間に付与する。アクセント型抽出部１３２は、このようにアクセント句を構成するモーラ区間ごとに「Ｈ」と「Ｌ」を付与することにより、ＨとＬの組み合わせで構成されるアクセント型を抽出する。 Next, the accent type extraction unit 132 compares the average fundamental frequency m_f0 of the mora section with the average fundamental frequency m_th of the accent phrase section, and is “H” if m_f0 ≧ m_th, and “L” if m_f0 <m_th. Is assigned to each mora section. The accent type extraction unit 132 extracts an accent type composed of a combination of H and L by assigning “H” and “L” to each mora section constituting the accent phrase in this way.

アクセント型決定部１３０は、教師データから生成されたアクセント句の全てについて、この処理を行う。図７に示すアクセント型の例は、大量の教師データから得られたアクセント型の中で発生頻度が高い順に２０種類のアクセント型を選択した、モーラ区間数が６以下の例である。解析対象のアクセント句に含まれるモーラ区間の数が６以下である場合、この２０種類のアクセント型に対して順にクラス１からクラス２０までのクラス名を付与する。アクセント型とクラスとは１対１に対応している。アクセント型決定部１３０は、教師データとするアクセント句単位の音声データとアクセント型（クラス）とを対応付けて記憶部２に記憶する。なお、この２０種類のアクセント型に該当しなかったアクセント句は、教師データから除外する。以後、本実施形態では、解析対象とするアクセント句に含まれるモーラ区間数が６以下である場合について説明する。 The accent type determination unit 130 performs this process for all accent phrases generated from the teacher data. The example of the accent type shown in FIG. 7 is an example in which 20 types of accent types are selected in descending order of occurrence frequency among the accent types obtained from a large amount of teacher data, and the number of mora sections is 6 or less. When the number of mora sections included in the accent phrase to be analyzed is 6 or less, class names from class 1 to class 20 are assigned to the 20 types of accent types in order. There is a one-to-one correspondence between accent types and classes. The accent type determination unit 130 stores the accent phrase unit speech data as the teacher data and the accent type (class) in the storage unit 2 in association with each other. Note that accent phrases that do not correspond to these 20 types of accent types are excluded from the teacher data. Hereinafter, in the present embodiment, a case where the number of mora sections included in an accent phrase to be analyzed is 6 or less will be described.

ここで、アクセント句に含まれるモーラ区間数が６以下の場合における２０種類のアクセント型の選択方法は、７種類の感情で発話された大量の日本語を統計処理した実験結果に基づいて、発生頻度が高い順に２０種類のアクセント型を選択する。アクセント型のクラス数を減らすと、アクセント型に該当しない教師データの頻度が高くなり、生成された感情推定器を内蔵した感情推定装置の推定精度が低下することになる。一方、アクセント型のクラス数を増やすと、生成する感情推定器の種類が増えるので感情推定装置の製造コストが高くなることになる。したがって、この２つの兼ね合いでアクセント型のクラス数を決定する。なお、アクセント型を２０種類としたのはアクセント句に含まれるモーラ区間数が６以下の場合である日本語の場合の例である。アクセント句に含まれるモーラ区間数が７以上である場合、もしくは他の言語の場合は、アクセント型の発生頻度についてさらに統計処理して決める必要がある。 Here, when the number of mora sections included in an accent phrase is 6 or less, 20 types of accent types are selected based on the result of statistical processing of a large amount of Japanese spoken with 7 types of emotions. Twenty accent types are selected in descending order of frequency. If the number of classes of the accent type is reduced, the frequency of teacher data that does not correspond to the accent type increases, and the estimation accuracy of the emotion estimation device incorporating the generated emotion estimator decreases. On the other hand, when the number of accent-type classes is increased, the number of types of emotion estimators to be generated increases, and the manufacturing cost of the emotion estimation device increases. Therefore, the number of accent-type classes is determined based on the balance between the two. Note that the 20 accent types are examples in the case of Japanese, where the number of mora sections included in the accent phrase is 6 or less. When the number of mora sections included in the accent phrase is 7 or more, or in other languages, the frequency of occurrence of the accent type needs to be further statistically determined.

図２に戻って、特徴量抽出部１４０は、アクセント句ごとの音声の特徴量を抽出する。音声の特徴量とは、音声の大きさ、音声のピッチ、音素の発話時間長等である。そして、抽出した特徴量にアクセント型決定部１３０で決定したクラス１からクラス２０のクラス名を付与し、教師データとして記憶部２に記憶する。 Returning to FIG. 2, the feature amount extraction unit 140 extracts a feature amount of speech for each accent phrase. The feature amount of speech is the size of speech, the pitch of speech, the utterance time length of phonemes, and the like. Then, class names of class 1 to class 20 determined by the accent type determination unit 130 are given to the extracted feature amount, and stored in the storage unit 2 as teacher data.

感情推定器生成部１５０は、教師データをクラスごとに記憶部２から取得し、それぞれのクラスに適応した感情推定器を生成する。具体的には、感情推定器生成部１５０は、クラス１に分類された教師データを取得し、その教師データの発話時の感情状態を、７種類の感情状態である悲しんでいる状態（悲しみ）、退屈している状態（退屈）、怒っている状態（怒り）、驚いている状態（驚き）、落胆している状態（落胆）、嫌悪感を抱いている状態（嫌悪）、喜んでいる状態（喜び）、に分類するクラス１用の感情推定器を生成する。図８は、７種類の感情に識別する識別閾値を２次元で表現したイメージ図である。教師データに基づいて感情推定器を生成する方法には公知の技術を用いることができる。次に、感情推定器生成部１５０は、クラス２に分類された教師データを取得し、その教師データの発話時の感情状態を、７種類の感情状態である悲しみ、退屈、怒り、驚き、落胆、嫌悪、喜び、に分類するクラス２用の感情推定器を生成する。同様にして、感情推定器生成部１５０は、クラス２０までの感情推定器を生成する。 The emotion estimator generation unit 150 acquires teacher data from the storage unit 2 for each class, and generates an emotion estimator adapted to each class. Specifically, the emotion estimator generation unit 150 acquires teacher data classified into class 1, and the emotion state at the time of utterance of the teacher data is a sad state (sadness) that is seven types of emotion states. , Bored (bored), angry (angry), surprised (surprise), discouraged (disappointed), disgusted (disgust), happy A class 1 emotion estimator that classifies (joy) is generated. FIG. 8 is an image diagram in which the identification threshold value for identifying the seven types of emotions is expressed in two dimensions. A known technique can be used as a method for generating an emotion estimator based on teacher data. Next, the emotion estimator generation unit 150 acquires the teacher data classified into class 2, and changes the emotion state at the time of utterance of the teacher data into seven kinds of emotion states, sadness, boredom, anger, surprise, discouragement. Generate an emotion estimator for class 2 that classifies as hate, joy. Similarly, the emotion estimator generation unit 150 generates emotion estimators up to class 20.

次に、上記の説明により生成した２０種類の感情推定器を内蔵する感情推定装置２００の構成について、図９を参照しながら説明する。感情推定装置２００は、物理的には、図１に示す構成と同じである。 Next, the configuration of the emotion estimation apparatus 200 incorporating the 20 types of emotion estimators generated by the above description will be described with reference to FIG. The emotion estimation apparatus 200 is physically the same as the configuration shown in FIG.

制御部１が備えるＲＯＭは、本実施形態に係る感情推定器生成装置１００で生成された感情推定プログラムを記憶する。記憶部２は、解析対象とする音声データ等を記憶する。入出力部３は、解析対象とする音声データを取得するための音声入力装置、ＣＤドライブ、ＵＳＢインタフェースを備える。また、入出力部３は、感情推定器生成装置１００で生成された感情推定器の特性を決定するパラメータを取得するようにしてもよい。また、入出力部３は、感情を推定した結果を出力するための表示装置もしくは音声出力装置を備える。 The ROM included in the control unit 1 stores an emotion estimation program generated by the emotion estimator generation device 100 according to the present embodiment. The storage unit 2 stores audio data and the like to be analyzed. The input / output unit 3 includes an audio input device for acquiring audio data to be analyzed, a CD drive, and a USB interface. Further, the input / output unit 3 may acquire a parameter that determines the characteristics of the emotion estimator generated by the emotion estimator generation device 100. Further, the input / output unit 3 includes a display device or a sound output device for outputting the result of estimating the emotion.

感情推定装置２００は、図９に示すように、音声データ取得部２１０と、話者分割部２２０と、解析区間設定部２３０と、アクセント型決定部２４０と、選択部２５０と、特徴量抽出部２６０と、感情推定部２７０と、統合部２８０と、の機能を含む。また、解析区間設定部２３０は、形態素解析部２３１とアクセント句抽出部２３２との機能を含む。また、アクセント型決定部２４０は、モーラ区間抽出部２４１とアクセント型抽出部２４２との機能を含む。 As shown in FIG. 9, the emotion estimation apparatus 200 includes a voice data acquisition unit 210, a speaker division unit 220, an analysis section setting unit 230, an accent type determination unit 240, a selection unit 250, and a feature amount extraction unit. 260, the emotion estimation unit 270, and the integration unit 280. The analysis section setting unit 230 includes functions of a morphological analysis unit 231 and an accent phrase extraction unit 232. Further, the accent type determination unit 240 includes functions of a mora section extraction unit 241 and an accent type extraction unit 242.

音声データ取得部２１０は、ユーザが発話した解析対象とする音声を取得する。音声データ取得部２１０は、マイク等の音声取得装置から構成される。また、音声データ取得部２１０は、ＣＤドライブ、ＵＳＢインタフェースを備え、音声データとしてユーザの音声を取得することもできる。 The voice data acquisition unit 210 acquires voice to be analyzed which is spoken by the user. The audio data acquisition unit 210 includes an audio acquisition device such as a microphone. The audio data acquisition unit 210 includes a CD drive and a USB interface, and can acquire the user's audio as audio data.

話者分割部２２０は、取得した解析対象の音声データを話者ごとに分割する。音声データの中に複数人の音声データが存在する場合、１人の話者が発話した文ごとに発話者の感情を推定するためである。音声データを話者ごとに分割する方法は、公知の技術を用いて行う。例えば、音声の強度、音声のピッチ、音素の発話時間長等の相関性に基づいて分割することができる。 The speaker dividing unit 220 divides the acquired voice data to be analyzed for each speaker. This is to estimate the speaker's emotion for each sentence spoken by a single speaker when the voice data includes a plurality of voice data. A method of dividing the voice data for each speaker is performed using a known technique. For example, the division can be performed based on the correlation such as the intensity of the voice, the pitch of the voice, and the utterance time length of the phoneme.

解析区間設定部２３０、アクセント型決定部２４０は、感情推定器の生成時と同じ条件下で解析対象の音声データを解析するために、感情推定器生成装置１００と同じ構成を有している。つまり、解析区間設定部２３０は、音声データからアクセント句を抽出し、アクセント型決定部２４０は、アクセント句ごとにアクセント型（クラス）を決定する。 The analysis section setting unit 230 and the accent type determination unit 240 have the same configuration as the emotion estimator generation device 100 in order to analyze the speech data to be analyzed under the same conditions as when the emotion estimator is generated. That is, the analysis section setting unit 230 extracts an accent phrase from the audio data, and the accent type determination unit 240 determines an accent type (class) for each accent phrase.

選択部２５０は、クラス分けされたアクセント句ごとに、該当するクラスに対応する感情推定器を選択する。具体的には、感情推定装置２００に内蔵している２０種類の感情推定器の中から、解析対象のアクセント句のクラスに対応する感情推定器を選択する。 The selection unit 250 selects an emotion estimator corresponding to the corresponding class for each classified accent phrase. Specifically, the emotion estimator corresponding to the class of the accent phrase to be analyzed is selected from the 20 types of emotion estimators built in the emotion estimation device 200.

特徴量抽出部２６０は、感情推定器生成装置１００と同じ構成を有しており、同じ条件下で音声データから特徴量を抽出する。そして、特徴量抽出部２６０は、抽出した特徴量とアクセント型を示すクラス名とを対応付けて記憶部２に記憶する。 The feature amount extraction unit 260 has the same configuration as the emotion estimator generation device 100, and extracts feature amounts from the speech data under the same conditions. Then, the feature quantity extraction unit 260 stores the extracted feature quantity and the class name indicating the accent type in association with each other in the storage unit 2.

感情推定部２７０は、選択部２５０が選択した感情推定器を用いて、アクセント句ごとに発話者の感情を推定する。具体的には、感情推定部２７０は、クラス１に分類されたアクセント句の感情を推定する場合には、クラス１用の感情推定器を選択して発話者の感情を推定する。感情推定部２７０は、クラスｎに分類されたアクセント句の感情を推定する場合には、クラスｎ用の感情推定器を選択して発話者の感情を推定する。そして、感情推定部２７０は、解析対象のアクセント句を発話したときの発話者の感情状態が、悲しみ、退屈、怒り、驚き、落胆、嫌悪、喜び、のいずれの感情状態に該当するかを推定する。 The emotion estimation unit 270 uses the emotion estimator selected by the selection unit 250 to estimate the speaker's emotion for each accent phrase. Specifically, when estimating emotions of accent phrases classified into class 1, emotion estimation unit 270 selects an emotion estimator for class 1 and estimates the emotion of the speaker. When estimating emotions of accent phrases classified into class n, emotion estimation unit 270 selects an emotion estimator for class n and estimates the emotion of the speaker. Then, the emotion estimation unit 270 estimates whether the speaker's emotional state when the accent phrase to be analyzed corresponds to one of sadness, boredom, anger, surprise, discouragement, disgust, and joy. To do.

統合部２８０は、発話者の感情を音声データの文単位で推定する。具体的には、統合部２８０は、１文の中で最も多かった感情をその文を発話した発話者の感情として推定する。例えば、「坊主が」、「屏風に」、「上手に」、「坊主の」、「絵を」、「描い」、「た」の７つのアクセント句から構成される「坊主が屏風に上手に坊主の絵を描いた」という文において、「喜び」と判別されたアクセント句の数が４であり、「怒り」と判別されたアクセント句の数が２であり、「驚き」と判別されたアクセント句の数が１であった場合、一番多い「喜び」をこの「坊主が屏風に上手に坊主の絵を描いた」を発話したときの発話者の感情として推定する。 The integration unit 280 estimates the emotion of the speaker for each sentence of the audio data. Specifically, the integration unit 280 estimates the emotion that is most common in one sentence as the emotion of the speaker who uttered the sentence. For example, “Bow is a good screen” composed of seven accent phrases: “Bow is”, “Beautiful”, “Good”, “Shaved”, “Draw”, “Draw”, “Ta” In the sentence “Drawn a shaved picture,” the number of accent phrases identified as “joy” was 4, the number of accent phrases identified as “anger” was 2, and it was determined as “surprise”. If the number of accent phrases is 1, the most “joy” is estimated as the emotion of the speaker when this “shaven skillfully painted the shaved picture”.

次に、以上の構成を有する感情推定器生成装置１００が感情推定器を生成する処理について、図１０を参照しながら説明する。教師データとして使用する悲しみ、退屈、怒り、驚き、落胆、嫌悪、喜び、の７種類の感情状態で発話された音声データは、予め記憶部２に記憶されているものとする。解析対象の音声データに含まれるアクセント句のモーラ区間数は６以下であると仮定する。感情推定器を生成する担当者が、感情推定器生成装置１００に予めインストールされている感情推定器生成プログラムを起動することにより、図１０に示すフローチャートは開始される。 Next, a process in which the emotion estimator generation device 100 having the above configuration generates an emotion estimator will be described with reference to FIG. It is assumed that voice data uttered in seven kinds of emotional states such as sadness, boredom, anger, surprise, discouragement, disgust, and joy used as teacher data is stored in the storage unit 2 in advance. It is assumed that the number of mora sections of the accent phrase included in the speech data to be analyzed is 6 or less. The person in charge of generating the emotion estimator starts the emotion estimator generation program installed in advance in the emotion estimator generation device 100, whereby the flowchart shown in FIG. 10 is started.

制御部１は、感情推定器生成プログラムが起動されると、記憶部２に記憶されている教師データを音声データ取得部１１０に取得する（ステップＳ１１）。そして、形態素解析部１２１は、取得した音声データを形態素の単位で分割する（ステップＳ１２）。次に、アクセント句抽出部１２２は、音声データの特徴を解析する単位であるアクセント句を抽出し、音声データをアクセント句に分割する（ステップＳ１３）。 When the emotion estimator generation program is activated, the control unit 1 acquires the teacher data stored in the storage unit 2 in the voice data acquisition unit 110 (step S11). Then, the morpheme analysis unit 121 divides the acquired voice data in units of morphemes (step S12). Next, the accent phrase extraction unit 122 extracts an accent phrase, which is a unit for analyzing the characteristics of the voice data, and divides the voice data into accent phrases (step S13).

次に、モーラ区間抽出部１３１は、アクセント句に含まれるモーラ区間を抽出する（ステップＳ１４）。そして、アクセント型抽出部１３２は、アクセント句のアクセント型を抽出する（ステップＳ１５）。具体的には、アクセント型抽出部１３２は、図７を用いて説明したように、教師データとして使用するアクセント句を２０のクラスに分類する（ステップＳ１６）。アクセント型抽出部１３２は、その分類をするために、図５と図６を用いて説明したように、モーラ区間ごとの平均基本周波数ｍ＿ｆ０とアクセント句区間の平均基本周波数ｍ＿ｔｈとを比較し、ｍ＿ｆ０≧ｍ＿ｔｈであれば「Ｈ」、ｍ＿ｆ０＜ｍ＿ｔｈであれば「Ｌ」をそれぞれのモーラ区間に付与する。アクセント型抽出部１３２は、このようにして教師データとして使用するアクセント句に対して、アクセント句を構成するモーラ区間ごとにＨとＬを付与し、ＨとＬのパターンによりアクセント型を抽出する。そして、アクセント型決定部１３０は、教師データのアクセント型を図７に示す２０のアクセント型（クラス）の何れかに決定する。 Next, the mora section extraction unit 131 extracts a mora section included in the accent phrase (step S14). Then, the accent type extraction unit 132 extracts the accent type of the accent phrase (step S15). Specifically, as described with reference to FIG. 7, the accent type extraction unit 132 classifies accent phrases used as teacher data into 20 classes (step S16). As described with reference to FIGS. 5 and 6, the accent type extraction unit 132 compares the average basic frequency m_f0 for each mora section with the average basic frequency m_th of the accent phrase section, as described with reference to FIGS. 5 and 6. If ≧ m_th, “H” is assigned to each mora section, and if m_f0 <m_th, “L” is assigned to each mora section. The accent type extraction unit 132 assigns H and L to the accent phrase used as teacher data in this way for each mora section constituting the accent phrase, and extracts the accent type based on the H and L patterns. Then, the accent type determination unit 130 determines the accent type of the teacher data as one of the 20 accent types (classes) shown in FIG.

次に、特徴量抽出部１４０は、教師データとする音声データの特徴量をアクセント句ごとに抽出し、抽出した特徴量のデータと分類されたクラスとを対応付けて教師データとして記憶部２に記憶する（ステップＳ１７）。 Next, the feature amount extraction unit 140 extracts the feature amount of the speech data as the teacher data for each accent phrase, and associates the extracted feature amount data with the classified class in the storage unit 2 as teacher data. Store (step S17).

感情推定器生成部１５０は、アクセント型（クラス）ごとに分類された教師データに基づいて、それぞれのクラスごとに感情推定器を生成する（ステップＳ１８）。具体的には、感情推定器生成部１５０は、クラス１に分類された教師データ（アクセント句）を取得して、その教師データの発話時の感情状態を、７種類の感情状態である悲しみ、退屈、怒り、驚き、落胆、嫌悪、喜び、に分類することが可能な感情推定器を生成する。より具体的には、図８に示すよな７種類の感情に分類するための識別閾値（分類器を構成する数式のパラメータ）を生成する。次に、感情推定器生成部１５０は、クラス２に分類された教師データ（アクセント句）を取得して、その教師データの発話時の感情状態を、７種類の感情状態である悲しみ、退屈、怒り、驚き、落胆、嫌悪、喜び、に分類することが可能な２つめの感情推定器を生成する。感情推定器生成部１５０は、このように２０種類の感情推定器を生成する。以上で、感情推定器生成装置１００の感情推定器生成処理の説明を終了する。 The emotion estimator generation unit 150 generates an emotion estimator for each class based on the teacher data classified for each accent type (class) (step S18). Specifically, the emotion estimator generation unit 150 acquires teacher data (accent phrases) classified into class 1 and sets the emotional state at the time of utterance of the teacher data as sadness as seven types of emotional states, Generate emotion estimators that can be categorized as boredom, anger, surprise, discouragement, disgust, and joy. More specifically, identification threshold values (parameters of mathematical expressions constituting the classifier) for classification into seven types of emotions as shown in FIG. 8 are generated. Next, the emotion estimator generation unit 150 acquires the teacher data (accent phrase) classified into class 2, and changes the emotional state at the time of utterance of the teacher data to sadness, boredom, A second emotion estimator that can be classified into anger, surprise, discouragement, disgust, and joy is generated. The emotion estimator generation unit 150 generates 20 types of emotion estimators in this way. Above, description of the emotion estimator production | generation process of the emotion estimator production | generation apparatus 100 is complete | finished.

次に、感情推定器生成装置１００が生成した２０種類の感情推定器を内蔵する感情推定装置２００が発話者の感情を推定する感情推定処理について、図１１を参照しながら説明する。ユーザが、感情推定装置２００に予めインストールされている感情推定プログラムを起動し、解析対象とする音声データを感情推定装置２００に入力することにより、図１１に示すフローチャートは開始される。 Next, an emotion estimation process in which the emotion estimation device 200 including the 20 types of emotion estimators generated by the emotion estimator generation device 100 estimates the emotion of the speaker will be described with reference to FIG. The flowchart shown in FIG. 11 is started when the user starts an emotion estimation program installed in advance in emotion estimation apparatus 200 and inputs voice data to be analyzed to emotion estimation apparatus 200.

制御部１は、感情推定プログラムが起動され、ユーザが解析対象の音声データを感情推定装置２００に入力すると、話者分割部２２０は、取得した音声データを話者ごとに分割して記憶部２に記憶する。次に、解析区間設定部２３０は、話者ごとに音声データを記憶部２から取得する（ステップＳ３１）。次に、形態素解析部２３１は、任意の話者の音声データを形態素に分解し（ステップＳ３２）、アクセント句抽出部２３２は、解析単位であるアクセント句を決定する（ステップＳ３３）。 When the emotion estimation program is activated and the user inputs voice data to be analyzed to the emotion estimation apparatus 200, the controller 1 divides the acquired voice data for each speaker and stores the data in the storage unit 2. To remember. Next, the analysis section setting unit 230 acquires voice data for each speaker from the storage unit 2 (step S31). Next, the morphological analysis unit 231 decomposes speech data of an arbitrary speaker into morphemes (step S32), and the accent phrase extraction unit 232 determines an accent phrase that is an analysis unit (step S33).

次に、アクセント型決定部２４０は、感情推定器生成装置１００の動作説明と同様に、アクセント句ごとにアクセント型（クラス）を決定する。具体的には、モーラ区間抽出部２４１が、該当するアクセント句に含まれるモーラ区間を抽出し（ステップＳ３４）、アクセント型抽出部２４２が、そのアクセント句のアクセント型を抽出する（ステップＳ３５）。そして、アクセント型決定部２４０は、抽出したアクセント型からそのアクセント句が属するクラスを決定する。そして、選択部２５０は、該当するアクセント句を発話したときの発話者の感情を推定するために使用する感情推定器として、同じアクセント型（クラス）の教師データに基づいて感情推定器生成装置１００が生成した感情推定器を選択する（ステップＳ３６）。 Next, the accent type determination unit 240 determines an accent type (class) for each accent phrase as in the operation description of the emotion estimator generation device 100. Specifically, the mora section extraction unit 241 extracts the mora section included in the corresponding accent phrase (step S34), and the accent type extraction unit 242 extracts the accent type of the accent phrase (step S35). Then, the accent type determination unit 240 determines the class to which the accent phrase belongs from the extracted accent types. And the selection part 250 is an emotion estimator production | generation apparatus 100 based on the teacher data of the same accent type (class) as an emotion estimator used in order to estimate the speaker's emotion when uttering an applicable accent phrase. The emotion estimator generated by is selected (step S36).

一方、特徴量抽出部２６０は、解析対象のアクセント句の音声の強度、音声のピッチ、音素の継続時間長といった音声の特徴量を抽出し、抽出した特徴量と判別したクラスとを対応付けて記憶部２に記憶する（ステップＳ３７）。 On the other hand, the feature quantity extraction unit 260 extracts voice feature quantities such as voice intensity, voice pitch, and phoneme duration length of the accent phrase to be analyzed, and associates the extracted feature quantities with the determined class. It memorize | stores in the memory | storage part 2 (step S37).

次に、感情推定部２７０は、選択部２５０が選択した感情推定器を用いて、該当するアクセント句を発話したときの発話者の感情を推定する（ステップＳ３８）。 Next, the emotion estimation unit 270 uses the emotion estimator selected by the selection unit 250 to estimate the emotion of the speaker when speaking the corresponding accent phrase (step S38).

アクセント句の１つについて感情推定が完了すると、感情推定装置２００は、まだ解析が完了していないアクセント句が存在するか否かを判別する（ステップＳ３９）。解析が完了していないアクセント句が存在する場合（ステップＳ３９：Ｎｏ）、解析が完了していない他のアクセント句を抽出し（ステップＳ４０）、そのアクセント句に該当する感情を推定する。 When emotion estimation is completed for one of the accent phrases, emotion estimation apparatus 200 determines whether there is an accent phrase that has not been analyzed (step S39). If there is an accent phrase that has not been analyzed (step S39: No), another accent phrase that has not been analyzed is extracted (step S40), and an emotion corresponding to the accent phrase is estimated.

すべてのアクセント句の解析が完了している場合（ステップＳ３９：Ｙｅｓ）、感情推定装置２００は、解析した文単位で統合処理を行う（ステップＳ４１）。具体的には、統合部２８０は、解析対象の文に含まれるアクセント句ごとの感情推定結果に基づいて、最も多かった感情をその文を発話したときの発話者の感情として推定する。 If the analysis of all accent phrases has been completed (step S39: Yes), the emotion estimation apparatus 200 performs integration processing for each analyzed sentence (step S41). Specifically, based on the emotion estimation result for each accent phrase included in the sentence to be analyzed, the integration unit 280 estimates the most common emotion as the speaker's emotion when the sentence is uttered.

次に、感情推定装置２００は、最初に取得した任意の人が発話したすべての文について解析が完了したか否かを判別する（ステップＳ４２）。すべての文について解析が完了していない場合は（ステップＳ４２：Ｎｏ）、他の文を抽出し（ステップＳ４３）、他の文について感情推定処理を継続する。 Next, the emotion estimation apparatus 200 determines whether or not the analysis has been completed for all sentences spoken by the first acquired arbitrary person (step S42). If the analysis has not been completed for all sentences (step S42: No), other sentences are extracted (step S43), and the emotion estimation process is continued for the other sentences.

一方、感情推定装置２００は、すべての文について解析が完了している場合は（ステップＳ４２：Ｙｅｓ）、音声データに含まれているすべての人について感情推定が完了しているか否かを判別する（ステップＳ４４）。すべての人について解析が完了していない場合は（ステップＳ４４：Ｎｏ）、他の人の音声データを抽出して感情推定処理を継続する（ステップＳ４５）。すべての人について解析処理が完了している場合は（ステップＳ４４：Ｙｅｓ）、感情推定処理を終了する。 On the other hand, if the analysis has been completed for all sentences (step S42: Yes), emotion estimation apparatus 200 determines whether emotion estimation has been completed for all persons included in the voice data. (Step S44). If the analysis has not been completed for all the people (step S44: No), the voice estimation data of other people is extracted and the emotion estimation process is continued (step S45). If the analysis process has been completed for all persons (step S44: Yes), the emotion estimation process is terminated.

以上に説明したように感情推定器生成装置１００は、アクセント型ごとに分類した教師データに基づいて、アクセント型ごとに感情推定器を生成する。そして、感情推定装置２００は、アクセント型ごとに生成された感情推定器を使用して、発話者の感情を推定する。具体的には、感情推定装置２００は、解析対象の音声データをアクセント型ごとに分類し、同じアクセント型を有する教師データに基づいて生成された感情推定器を用いて発話者の感情を推定する。これにより、音声データから発話者の感情を推定する推定精度を向上することができる。 As described above, the emotion estimator generation device 100 generates an emotion estimator for each accent type based on the teacher data classified for each accent type. And the emotion estimation apparatus 200 estimates the speaker's emotion using the emotion estimator generated for each accent type. Specifically, emotion estimation apparatus 200 classifies speech data to be analyzed for each accent type, and estimates a speaker's emotion using an emotion estimator generated based on teacher data having the same accent type. . Thereby, the estimation precision which estimates a speaker's emotion from audio | voice data can be improved.

また、アクセント型抽出部１３２は、モーラ区間の単位で音声の特徴量の変化を抽出するので、感情推定器生成装置１００は、発話者の感情をより細かく解析することが可能な感情推定器を生成することができる。 In addition, since the accent type extraction unit 132 extracts changes in the feature amount of speech in units of mora sections, the emotion estimator generation device 100 uses an emotion estimator that can analyze the speaker's emotion in more detail. Can be generated.

また、アクセント型抽出部１３２は、音声の基本周波数の変化に基づいてアクセント型を抽出する。発話時の感情状態により音声の基本周波数は変化する傾向がある。したがって、感情推定器生成装置１００は、発話者の感情をより正確に推定することが可能な感情推定器を生成することができる。また、同じ理由により、感情推定装置２００は、発話者の感情をより正確に推定することができる。 The accent type extraction unit 132 extracts an accent type based on a change in the fundamental frequency of the voice. The fundamental frequency of speech tends to change depending on the emotional state during speech. Therefore, the emotion estimator generation device 100 can generate an emotion estimator that can estimate the speaker's emotion more accurately. For the same reason, the emotion estimation apparatus 200 can estimate the speaker's emotion more accurately.

解析区間設定手段１２０は、形態素の単位で音声を解析するので、感情推定器生成装置１００は、発話者の感情をより正確に解析することが可能な感情推定器を生成することができる。 Since the analysis section setting unit 120 analyzes the speech in units of morphemes, the emotion estimator generation device 100 can generate an emotion estimator that can analyze the speaker's emotion more accurately.

感情推定器生成装置１００は、発話者の発話時の感情を、悲しみ、退屈、怒り、驚き、落胆、嫌悪、喜び、に分類する感情推定器を生成する。この推定器を内蔵する感情推定装置２００は、発話者の発話時の感情状態を、悲しみ、退屈、怒り、驚き、落胆、嫌悪、喜び、に分類することができる。 The emotion estimator generation device 100 generates an emotion estimator that classifies emotions of a speaker when speaking into sadness, boredom, anger, surprise, discouragement, disgust, and joy. The emotion estimation apparatus 200 incorporating this estimator can classify the emotional state of the speaker when speaking into sadness, boredom, anger, surprise, discouragement, disgust, and joy.

（変形例１）
実施形態１では、アクセント型を判別するために特徴量をモーラ区間の単位で解析する説明をした。変形例１では、モーラ区間の中の母音区間に限定して特徴量を解析する説明を行う。具体的には、図１２に示すように、母音区間のみの音声データを取り出して、図１３に示すように特徴量の解析を行う。基本周波数の解析方法は実施形態１の説明と同じである。 (Modification 1)
In the first embodiment, the feature amount is analyzed in units of mora sections in order to determine the accent type. In the first modification, a description will be given of analyzing the feature quantity limited to the vowel section in the mora section. Specifically, as shown in FIG. 12, the voice data of only the vowel section is taken out and the feature amount is analyzed as shown in FIG. The fundamental frequency analysis method is the same as that described in the first embodiment.

母音区間にのみ着目する理由は、子音区間よりも母音区間の方が音素の継続時間長が長く、含まれる音声のエネルギーも大きいので、感情の変化による特徴量の変化は、子音区間よりも母音区間の方に顕著に現れるからである。 The reason for focusing only on the vowel interval is that the vowel duration is longer than the consonant interval, and the energy of the included speech is larger than the consonant interval. This is because it appears prominently in the section.

このように、変形例１に係る感情推定装置２００は、母音区間に限定して特徴量の解析を行うことにより、感情推定の推定精度を向上することができる。 As described above, the emotion estimation apparatus 200 according to the modification 1 can improve the estimation accuracy of emotion estimation by analyzing the feature amount only in the vowel section.

（変形例２）
実施形態１の説明では、アクセント型抽出部１３２が、音声の特徴量として音声のピッチ情報（音声の基本周波数）を利用する場合について説明した。変形例２では、音声の特徴量として音声の強度情報を利用する場合について説明する。ここでは、発話時の感情状態によって母音の発話区間における音声のエネルギー分布が変化することに着目した技術について説明する。 (Modification 2)
In the description of the first embodiment, the case where the accent type extraction unit 132 uses the pitch information (sound fundamental frequency) as the feature amount of the sound has been described. In the second modification, a case will be described in which voice intensity information is used as a voice feature amount. Here, a technique that focuses on the fact that the energy distribution of the voice in the vowel utterance section changes depending on the emotional state during utterance will be described.

具体的には、アクセント型抽出部１３２は、図１４に点線の丸印で示した音声のエネルギーのピークが、母音区間の前半に存在するか後半に存在するかを判別する。例えば、前半にピークが存在した場合には「Ｈ」を付与し、後半にピークが存在した場合には「Ｌ」を付与する。これにより、アクセント型決定部１３０は、アクセント型を決定する。音声の強度によりアクセント型を分類する場合は、実験データに基づいてクラス分けの仕方を検討する必要がある。その他の説明は実施形態１の説明と同じである。 Specifically, the accent type extraction unit 132 determines whether the peak of the voice energy indicated by the dotted circle in FIG. 14 is in the first half or the second half of the vowel section. For example, “H” is assigned when a peak exists in the first half, and “L” is assigned when a peak exists in the second half. Thereby, the accent type determination unit 130 determines an accent type. When classifying accent types according to the strength of speech, it is necessary to examine the classification method based on experimental data. Other explanations are the same as those of the first embodiment.

なお、変形例２の説明では、音声エネルギーのピーク点の時間位置の変化に着目する解析方法を説明したが、音声の強度の変化を用いてアクセント型を抽出することもできる。怒った状態で発話すると音声の強度は高くなる傾向があり、悲しい状態で発話すると音声の強度は低くなる傾向があるので、この傾向を利用するものである。この場合、例えば、アクセント句区間に含まれるモーラ区間ごとの音声のピーク強度を計測して、アクセント句区間の平均ピーク強度を求める。そして、モーラ区間の音声のピーク強度と平均ピーク強度とを比較して、モーラ区間ごとに「Ｈ」又は「Ｌ」を付与することにより、アクセント型を抽出することもできる。 In the description of the second modification, the analysis method focusing on the change in the time position of the peak point of the voice energy has been described, but the accent type can be extracted using the change in the intensity of the voice. When speaking in an angry state, the strength of the voice tends to increase, and when speaking in a sad state, the strength of the voice tends to decrease, so this tendency is used. In this case, for example, the peak intensity of the voice for each mora section included in the accent phrase section is measured to obtain the average peak intensity of the accent phrase section. Then, the accent type can be extracted by comparing the peak intensity and the average peak intensity of the sound in the mora section and assigning “H” or “L” to each mora section.

このように、変形例２に係る感情推定装置２００は、音声の発話時の感情状態を音声の強度の変化情報を利用して解析するので、感情推定の推定精度を向上することができる。 As described above, the emotion estimation apparatus 200 according to the modified example 2 analyzes the emotion state at the time of speech utterance using the change information of the strength of the speech, so that the estimation accuracy of emotion estimation can be improved.

（変形例３）
変形例３では、音声の特徴量として音素の継続時間長を利用する場合について説明する。怒ったり喜んだりした状態で発話すると音素の継続時間長は短くなる傾向があり、退屈な状態や悲しい状態で発話すると音素の継続時間長が長くなる傾向があるので、この傾向を利用するものである。 (Modification 3)
In the third modification, a case will be described in which the phoneme duration is used as a voice feature. When speaking in an angry or happy state, the phoneme duration tends to be short, and when speaking in a boring or sad state, the phoneme duration tends to be long. is there.

具体的には、アクセント型抽出部１３２は、モーラ区間に含まれる母音の継続時間長と、教師データに含まれる同じ母音の平均継続時間長とを比較し、モーラ区間に含まれる母音の継続時間長が平均継続時間長よりも長い場合は「Ｈ」を、短い場合は「Ｌ］を付与する。これにより、アクセント型決定部１３０は、アクセント型を決定する。 Specifically, the accent type extraction unit 132 compares the duration of the vowels included in the mora section with the average duration of the same vowel included in the teacher data, and the duration of the vowels included in the mora section If the length is longer than the average duration time, “H” is assigned, and if the length is shorter, “L” is assigned, whereby the accent type determination unit 130 determines the accent type.

実施形態１、変形例１、変形例２の説明では、解析区間であるアクセント句の区間における音声の特徴量の平均値とモーラ区間の平均値とを比較した。しかし、音素の継続時間長で比較する場合、感情推定器生成装置１００のアクセント型抽出部１３２は、平均継続時間長を解析区間内の音声データの平均ではなく、教師データ全体の平均継続時間長と比較する。母音によって継続時間長は異なるので、異なる母音の継続時間長と比較することはできない。アクセント句に含まれる同じ母音の数が少ないため、平均継続時間長のバラツキが大きくなり、誤判定の要因となるので、教師データ全体の平均をとることが好ましい。 In the description of the first embodiment, the first modified example, and the second modified example, the average value of the voice feature amount in the accent phrase section, which is the analysis section, is compared with the average value in the mora section. However, when comparing with the phoneme duration length, the accent type extraction unit 132 of the emotion estimator generation device 100 does not calculate the average duration length of the teacher data but the average duration length of the entire teacher data instead of the average of the speech data in the analysis section. Compare with Since the duration is different depending on the vowel, it cannot be compared with the duration of different vowels. Since the number of the same vowels included in the accent phrase is small, the variation in the average duration time becomes large and causes erroneous determination. Therefore, it is preferable to average the entire teacher data.

一方、感情推定装置２００のアクセント型抽出部２４２は、話者分類部２２０が分類した話者ごとの音声データについて、母音ごとに平均継続時間長を計算することが好ましい。 On the other hand, it is preferable that the accent type extraction unit 242 of the emotion estimation apparatus 200 calculates the average duration for each vowel for the voice data for each speaker classified by the speaker classification unit 220.

音素の継続時間長によりアクセント型を分類する場合は、実験データに基づいてクラス分けの仕方を検討する必要がある。その他の説明は実施形態１の説明と同じである。 When classifying accent types according to phoneme durations, it is necessary to consider how to classify them based on experimental data. Other explanations are the same as those of the first embodiment.

このように、変形例３に係る感情推定装置２００は、音声の発話時の感情状態を音素の発話時間長の変化情報を利用して解析するので、感情推定の推定精度を向上することができる。 As described above, the emotion estimation apparatus 200 according to the modified example 3 analyzes the emotional state at the time of speech utterance using the change information of the phoneme utterance time length, so that the estimation accuracy of emotion estimation can be improved. .

（変形例４）
実施形態１と変形例１では、音声の特徴量として音声のピッチ情報を利用してアクセント型を抽出する技術の説明をした。また、変形例２では、音声の強度情報を利用してアクセント型を抽出する技術を紹介し、変形例３では、音素の継続時間長を利用してアクセント型を抽出する技術を紹介した。アクセント型を抽出する場合、これらの技術を単独で使用することもできるが、音声のピッチ情報と音声の強度情報のように２つ以上の技術を組み合わせてアクセント型を抽出することもできる。２つ以上の情報を組み合わせるとアクセント型の種類が増えることになるが、感情推定の精度を向上させることができる。 (Modification 4)
In the first embodiment and the first modification, the technique for extracting the accent type using the pitch information of the voice as the voice feature amount has been described. In the second modification, a technique for extracting an accent type using the intensity information of speech was introduced. In the third modification, a technique for extracting an accent type using the duration of phonemes was introduced. When extracting an accent type, these techniques can be used alone, but an accent type can also be extracted by combining two or more techniques such as voice pitch information and voice intensity information. Combining two or more pieces of information increases the number of accent types, but can improve the accuracy of emotion estimation.

なお、上記の説明では、音声の特徴量として、音声の強度、音声のピッチ、音素の継続時間長を例にして説明したが、これに限定する必要はない。例えば、音声の強度の変化量、音声のピッチの変化量、音素の継続時間長の変化量等を抽出してアクセント型を決定することもできる。 In the above description, the sound intensity, the pitch of the sound, and the duration of the phoneme have been described as examples of the feature amount of the sound. However, the present invention is not limited to this. For example, the accent type can be determined by extracting the change amount of the sound intensity, the change amount of the sound pitch, the change amount of the phoneme duration, and the like.

（変形例５）
実施形態１の説明では、解析対象の文に含まれるアクセント句ごとの感情推定結果に基づいて、最も多かった感情をその文を発話したときの発話者の感情として推定する技術について説明を行った。しかし、統合処理の仕方はこれに限定する必要は無い。例えば、「少し驚きを伴った喜び」のように、複数の感情を含む推定を行うこともできる。感情推定器を構成する分類器では、特徴量をベクトルとして取得し、そのベクトルと識別閾値との距離に基づいて、いずれの感情に分類するかを決める場合が多い。例えば、「坊主が」、「屏風に」、「上手に」、「坊主の」、「絵を」、「描いた」の７つのアクセント句に対応する特徴量を、図１５に示す１から７に示す位置ベクトルで表し、７つの位置ベクトルを合成した平均ベクトルが、図１５に「平均」で示した位置ベクトルであったとする。この場合、位置ベクトル「平均」は、喜びの領域に属しているが、喜びと驚きの境界に近い位置に存在する。このような場合には、「少し驚きの感情が混在している可能性がある」というニュアンスを含めた感情推定結果を出力するようにしてもよい。 (Modification 5)
In the description of the first embodiment, a technique has been described in which, based on the emotion estimation result for each accent phrase included in the sentence to be analyzed, the most common emotion is estimated as the speaker's emotion when the sentence is spoken. . However, the method of integration processing need not be limited to this. For example, estimation including a plurality of emotions such as “joy with a little surprise” can be performed. In a classifier that constitutes an emotion estimator, it is often the case that a feature quantity is acquired as a vector and the emotion to be classified is determined based on the distance between the vector and an identification threshold. For example, the feature amounts corresponding to seven accent phrases of “shaved is”, “bold in the screen”, “skilled”, “shaved”, “draw”, and “drawn” are represented by 1 to 7 shown in FIG. It is assumed that the average vector obtained by combining the seven position vectors is the position vector indicated by “average” in FIG. In this case, the position vector “average” belongs to the joy region, but exists near the boundary between joy and surprise. In such a case, an emotion estimation result including a nuance that “a little surprising emotion may be mixed” may be output.

図８と図１５とは、７次元の識別空間を２次元でイメージ表現した図であるので、複雑な例を表現することは困難である。しかし、感情推定器を構成する分類器の中では、それぞれの識別境界との距離を数値で計算することが可能である。したがって、「怒りと悲しみ」、「怒りと落胆」のように、複数の感情の組み合わせと、その感情の度合い（識別境界との距離）を数値計算することが可能である。さらに、複数の閾値を設定することにより、「怒り、悲しみ、落胆」のように２つ以上の感情を含めた感情推定も可能である。また、複数の感情の複合度合いも推定することができる。 Since FIGS. 8 and 15 are two-dimensional image representations of a 7-dimensional identification space, it is difficult to represent a complex example. However, in the classifier constituting the emotion estimator, it is possible to calculate the distance from each identification boundary numerically. Accordingly, it is possible to numerically calculate a combination of a plurality of emotions and a degree of the emotion (distance from the identification boundary) such as “anger and sadness” and “anger and discouragement”. Furthermore, by setting a plurality of thresholds, emotion estimation including two or more emotions such as “anger, sadness, and discouragement” is possible. Moreover, the composite degree of several emotions can also be estimated.

変形例５で説明した構成および処理を設けることにより、感情推定器生成装置１００は、複数の感情の度合いを推定可能な感情推定器を生成することが可能となる。また、複数の感情の度合いを推定可能な感情推定器を内蔵する感情推定装置２００は、発話者の複数の感情度合いを推定することができる。 By providing the configuration and processing described in the modification example 5, the emotion estimator generation device 100 can generate an emotion estimator that can estimate the degree of a plurality of emotions. Also, the emotion estimation device 200 including an emotion estimator capable of estimating a plurality of emotion levels can estimate a plurality of emotion levels of a speaker.

なお、実施形態１の説明では、発話者の感情状態を悲しみ、退屈、怒り、驚き、落胆、嫌悪、喜び、の７種類に分類する説明をしたが、感情の分類方法はこれに限定する必要はない。例えば、喜、怒、哀、楽の４種類に分類してもよい。 In the description of the first embodiment, the emotional state of the speaker is classified into seven types of sadness, boredom, anger, surprise, discouragement, disgust, and joy. However, the emotion classification method needs to be limited to this. There is no. For example, it may be classified into four types of joy, anger, sorrow, and comfort.

また、実施形態１の説明では、発話者の発話時の感情状態を、悲しみ、退屈、怒り、驚き、落胆、嫌悪、喜び、の７つの感情状態の何れかに分類する場合について説明し、いずれにも該当しない教師データは除外する説明をした。しかし、発話者の感情として、「普通」という感情状態を設け、７つの感情に分類できなかった教師データを感情「普通」に分類するようにしてもよい。これにより、感情推定器生成装置１００は、発話者の感情を７つの感情に「普通」を加えた８つの感情に推定可能な感情推定器を生成することができる。また、感情推定装置２００は、発話者の感情を７つの感情に「普通」を加えた８つの感情に推定することができる。 In the description of the first embodiment, the emotional state at the time of speaking of the speaker is described as being classified into any of the seven emotional states of sadness, boredom, anger, surprise, discouragement, disgust, and joy. Explained that teacher data that does not fall under is excluded. However, the emotional state of “normal” may be provided as the emotion of the speaker, and the teacher data that could not be classified into the seven emotions may be classified as the emotion “normal”. As a result, the emotion estimator generation device 100 can generate an emotion estimator that can estimate the emotion of the speaker to eight emotions obtained by adding “normal” to the seven emotions. The emotion estimation apparatus 200 can estimate the speaker's emotions as eight emotions by adding “normal” to the seven emotions.

また、実施形態１の説明では、解析区間をアクセント句の区間単位とする説明をしたが、解析区間はこれに限定する必要はない。例えば、解析区間を単語の発話区間としてもよいし、息継ぎ区間である呼気段落区間としてもよいし、文の発話区間としてもよい。解析区間を文の発話区間とした場合には、統合部２８０は、文単位で発話者の感情を推定してもよいし、さらに複数の文をまとめた単位で発話者の感情を推定するようにしてもよい。 In the description of the first embodiment, the analysis interval is described as the unit of the accent phrase. However, the analysis interval is not necessarily limited to this. For example, the analysis interval may be a word utterance interval, a breathing interval interval that is a breathing interval, or a sentence utterance interval. When the analysis section is a sentence utterance section, the integration unit 280 may estimate a speaker's emotion in units of sentences, or may further estimate a speaker's emotion in units of a plurality of sentences. It may be.

また、式１の説明では、平均値を用いてその区間の特徴量を代表する処理について説明したが、平均値の代わりに中央値を用いて処理を行ってもよい。また、最も低い周波数を代表値として処理を行うようにしてもよい。 In the description of Expression 1, processing that represents the feature amount of the section using the average value has been described. However, processing may be performed using the median instead of the average value. Further, the processing may be performed using the lowest frequency as a representative value.

また、式２の説明では、中央値を用いてその区間の特徴量を代表する処理について説明したが、中央値の代わりに平均値を用いて処理を行ってもよい。 In the description of Expression 2, processing that represents the feature amount of the section using the median value has been described, but processing may be performed using an average value instead of the median value.

また、本発明に係る機能を実現するための構成を予め備えた感情推定器生成装置１００、感情推定装置２００として提供できることはもとより、プログラムの適用により、既存のパーソナルコンピュータや情報端末機器等を、本発明に係る感情推定器生成装置１００、感情推定装置２００として機能させることもできる。すなわち、上記実施形態で例示した感情推定器生成装置１００、感情推定装置２００による各機能構成を実現させるためのプログラムを、既存のパーソナルコンピュータや情報端末機器等を制御するＣＰＵ等が実行できるように適用することで、本発明に係る感情推定器生成装置１００、感情推定装置２００として機能させることができる。また、本発明に係る感情推定器生成方法及び感情推定方法は、感情推定器生成装置１００、感情推定装置２００を用いて実施できる。 In addition to being able to provide the emotion estimator generation device 100 and the emotion estimation device 200 that have a configuration for realizing the functions according to the present invention in advance, by applying a program, existing personal computers, information terminal devices, and the like can be provided. It is also possible to function as the emotion estimator generation device 100 and the emotion estimation device 200 according to the present invention. That is, a program for realizing each functional configuration by the emotion estimator generation device 100 and the emotion estimation device 200 exemplified in the above embodiment can be executed by a CPU or the like that controls an existing personal computer or information terminal device. By applying, it can function as the emotion estimator generation device 100 and the emotion estimation device 200 according to the present invention. Further, the emotion estimator generation method and the emotion estimation method according to the present invention can be implemented using the emotion estimator generation device 100 and the emotion estimation device 200.

また、このようなプログラムの適用方法は任意である。プログラムを、例えば、コンピュータが読取可能な記録媒体（ＣＤ−ＲＯＭ（Compact Disc Read-Only Memory）、ＤＶＤ（Digital Versatile Disc）、ＭＯ（Magneto Optical disc）等）に格納して適用できる他、インターネット等のネットワーク上のストレージにプログラムを格納しておき、これをダウンロードさせることにより適用することもできる。 Moreover, the application method of such a program is arbitrary. For example, the program can be stored and applied to a computer-readable recording medium (CD-ROM (Compact Disc Read-Only Memory), DVD (Digital Versatile Disc), MO (Magneto Optical disc), etc.), the Internet, etc. It is also possible to apply the program by storing it in a storage on the network and downloading it.

以上、本発明の好ましい実施形態について説明したが、本発明は係る特定の実施形態に限定されるものではなく、本発明には、特許請求の範囲に記載された発明とその均等の範囲が含まれる。以下に、本願出願の当初の特許請求の範囲に記載された発明を付記する。 As mentioned above, although preferable embodiment of this invention was described, this invention is not limited to the specific embodiment which concerns, This invention includes the invention described in the claim, and its equivalent range It is. Hereinafter, the invention described in the scope of claims of the present application will be appended.

（付記１）
教師データの元となる音声データの特徴量を解析する解析区間を設定する解析区間設定ステップと、
前記解析区間に含まれる音声データの特徴量の変化するパターンを、複数のクラスに分類された変化パターンに基づいて、前記解析区間に含まれる音声データの特徴量の変化パターンとして決定する変化パターン決定ステップと、
前記特徴量の変化パターンごとに分類された音声データを教師データとして、前記特徴量の変化パターンごとに、音声を発話したときの発話者の感情を推定する感情推定器を生成する感情推定器生成ステップと、
を含む感情推定器生成方法。 (Appendix 1)
An analysis interval setting step for setting an analysis interval for analyzing the feature amount of the voice data that is the source of the teacher data;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. Steps,
Generation of an emotion estimator for generating an emotion estimator for estimating an emotion of a speaker when speech is uttered for each feature value change pattern, using the voice data classified for each feature value change pattern as teacher data Steps,
An emotion estimator generation method including:

（付記２）
前記変化パターン決定ステップは、
前記解析区間に含まれる音声データを、音節の単位であるモーラ区間に分割するモーラ区間抽出ステップと、
前記解析区間における音声データの特徴量の平均値と、前記モーラ区間における音声データの特徴量の平均値と、をモーラ区間ごとに比較した比較結果に基づいて、前記解析区間の音声データを発話したときのモーラ区間ごとに変化する音声の特徴量の変化パターンを抽出する変化パターン抽出ステップと、
を含むことを特徴とする付記１に記載の感情推定器生成方法。 (Appendix 2)
The change pattern determining step includes:
A mora section extraction step for dividing the speech data included in the analysis section into mora sections that are syllable units;
The speech data of the analysis section is uttered based on the comparison result of comparing the average value of the feature amount of the speech data in the analysis section and the average value of the feature amount of the speech data in the mora section for each mora section. A change pattern extraction step for extracting a change pattern of the feature amount of the sound that changes for each mora section;
The emotion estimator generation method according to appendix 1, characterized by comprising:

（付記３）
前記変化パターン抽出ステップでは、音声の特徴量として音声データから抽出した音声の基本周波数を用い、前記解析区間における音声の平均基本周波数と、前記モーラ区間における音声の平均基本周波数と、をモーラ区間ごとに比較し、モーラ区間の音声の平均基本周波数が解析区間の音声の平均基本周波数よりも高い場合にはＨｉｇｈを、低い場合にはＬｏｗを付与し、モーラ区間ごとにＨｉｇｈとＬｏｗに変化する音声の特徴量の変化パターンを抽出する、
ことを特徴とする付記２に記載の感情推定器生成方法。 (Appendix 3)
In the change pattern extraction step, the fundamental frequency of speech extracted from speech data is used as the feature amount of speech, and the average fundamental frequency of speech in the analysis interval and the average fundamental frequency of speech in the mora interval are determined for each mora interval. Compared to the above, when the average fundamental frequency of the voice in the mora section is higher than the average fundamental frequency of the voice in the analysis section, High is given, and when it is low, Low is given, and the voice changes to High and Low for each mora section. Extract the change pattern of the feature amount of
The method for generating an emotion estimator according to Supplementary Note 2, wherein:

（付記４）
前記変化パターン抽出ステップでは、音声の特徴量として音声データから抽出した音声の強度を用い、前記解析区間における音声の平均強度と、前記モーラ区間における音声の平均強度と、をモーラ区間ごとに比較し、モーラ区間の音声の平均強度が解析区間の音声の平均強度よりも高い場合にはＨｉｇｈを、低い場合にはＬｏｗを付与し、モーラ区間ごとにＨｉｇｈとＬｏｗに変化する音声の特徴量の変化パターンを抽出する、
ことを特徴とする付記２に記載の感情推定器生成方法。 (Appendix 4)
In the change pattern extraction step, the voice intensity extracted from the voice data is used as the voice feature value, and the average voice intensity in the analysis section and the average voice intensity in the mora section are compared for each mora section. When the average intensity of the voice in the mora section is higher than the average intensity of the voice in the analysis section, High is given, and when the average intensity is low, Low is given, and the change in the feature amount of the voice that changes between High and Low for each mora section Extract patterns,
The method for generating an emotion estimator according to Supplementary Note 2, wherein:

（付記５）
前記変化パターン抽出ステップでは、音声の特徴量として音声データから抽出した音素の継続時間長を用い、前記解析区間における音素の平均継続時間長と、前記モーラ区間における音素の平均継続時間長と、をモーラ区間ごとに比較し、モーラ区間の音素の平均継続時間長が解析区間の音素の平均継続時間長よりも長い場合にはＨｉｇｈを、短い場合にはＬｏｗを付与し、モーラ区間ごとにＨｉｇｈとＬｏｗに変化する音声の特徴量の変化パターンを抽出する、
ことを特徴とする付記２に記載の感情推定器生成方法。 (Appendix 5)
In the change pattern extraction step, the phoneme duration extracted from the speech data is used as the feature amount of speech, the average duration of phonemes in the analysis interval, and the average duration of phonemes in the mora interval, Compared for each mora section, High is given when the average duration of phonemes in the mora section is longer than the average duration of phonemes in the analysis section, Low is given when the phoneme is short, and High for each mora section. Extracting the change pattern of the feature amount of the voice that changes to Low,
The method for generating an emotion estimator according to Supplementary Note 2, wherein:

（付記６）
前記変化パターン抽出ステップでは、音声の特徴量として、音声の基本周波数、音声の強度、音素の継続時間長の少なくとも何れか１つを使用して音声の特徴量の変化パターンを抽出する、
ことを特徴とする付記２から５の何れか一つに記載の感情推定器生成方法。 (Appendix 6)
In the change pattern extraction step, a change pattern of the speech feature value is extracted using at least one of the fundamental frequency of the speech, the strength of the speech, and the duration of the phoneme as the feature amount of the speech.
The emotion estimator generation method according to any one of appendices 2 to 5, characterized in that:

（付記７）
前記解析区間設定ステップでは、音声データを、言語の意味を持つ最小の単位である形態素に分割し、当該形態素の後で発話された助詞又は助動詞と結合したアクセント句の区間を前記解析区間として設定する、
ことを特徴とする付記１から６の何れか一つに記載の感情推定器生成方法。 (Appendix 7)
In the analysis interval setting step, the speech data is divided into morphemes which are the smallest units having language meaning, and an accent phrase interval combined with a particle or an auxiliary verb spoken after the morpheme is set as the analysis interval. To
The emotion estimator generation method according to any one of supplementary notes 1 to 6, characterized in that:

（付記８）
前記モーラ区間抽出ステップでは、音声データをテキスト表示した場合に、仮名文字１文字を１モーラ区間とし、小書きの仮名文字はその前の仮名文字と一緒にして１モーラ区間とし、長音は独立して１モーラ区間とする、
ことを特徴とする付記２に記載の感情推定器生成方法。 (Appendix 8)
In the mora section extraction step, when the voice data is displayed as text, one kana character is set as one mora section, the small kana character is combined with the preceding kana character as one mora section, and the long sound is independent. 1 mora section,
The method for generating an emotion estimator according to Supplementary Note 2, wherein:

（付記９）
前記感情推定器は、発話者の発話時の感情を、悲しみ、退屈、怒り、驚き、落胆、嫌悪、喜び、の何れかの感情であると推定する、
ことを特徴とする付記１から８の何れか一つに記載の感情推定器生成方法。 (Appendix 9)
The emotion estimator estimates the emotion at the time of speaking of the speaker as one of sadness, boredom, anger, surprise, discouragement, disgust, joy,
The emotion estimator generation method according to any one of appendices 1 to 8, characterized in that:

（付記１０）
前記複数のクラスに分類された変化パターンを設定する変化パターン設定ステップを含む、
ことを特徴とする付記１から９の何れか一つに記載の感情推定器生成方法。 (Appendix 10)
Including a change pattern setting step for setting change patterns classified into the plurality of classes.
The emotion estimator generation method according to any one of supplementary notes 1 to 9, wherein:

（付記１１）
教師データの元となる音声データの特徴量を解析する解析区間を設定する解析区間設定手段と、
前記解析区間に含まれる音声データの特徴量の変化するパターンを、複数のクラスに分類された変化パターンに基づいて、前記解析区間に含まれる音声データの特徴量の変化パターンとして決定する変化パターン決定手段と、
前記特徴量の変化パターンごとに分類された音声データを教師データとして、前記特徴量の変化パターンごとに、音声を発話したときの発話者の感情を推定する感情推定器を生成する感情推定器生成手段と、
を含む感情推定器生成装置。 (Appendix 11)
An analysis interval setting means for setting an analysis interval for analyzing a feature amount of voice data that is a source of teacher data;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. Means,
Generation of an emotion estimator for generating an emotion estimator for estimating an emotion of a speaker when speech is uttered for each feature value change pattern, using the voice data classified for each feature value change pattern as teacher data Means,
An emotion estimator generator including:

（付記１２）
コンピュータを
教師データの元となる音声データの特徴量を解析する解析区間を設定する解析区間設定手段、
前記解析区間に含まれる音声データの特徴量の変化するパターンを、複数のクラスに分類された変化パターンに基づいて、前記解析区間に含まれる音声データの特徴量の変化パターンとして決定する変化パターン決定手段、
前記特徴量の変化パターンごとに分類された音声データを教師データとして、前記特徴量の変化パターンごとに、音声を発話したときの発話者の感情を推定する感情推定器を生成する感情推定器生成手段、
として機能させるためのプログラム。 (Appendix 12)
Analysis interval setting means for setting an analysis interval for analyzing features of speech data that is a source of teacher data on a computer;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. means,
Generation of an emotion estimator for generating an emotion estimator for estimating an emotion of a speaker when speech is uttered for each feature value change pattern, using the voice data classified for each feature value change pattern as teacher data means,
Program to function as.

（付記１３）
解析対象とする音声データの特徴量を解析する解析区間を設定する解析区間設定ステップと、
前記解析区間に含まれる音声データの特徴量の変化するパターンを、複数のクラスに分類された変化パターンに基づいて、前記解析区間に含まれる音声データの特徴量の変化パターンとして決定する変化パターン決定ステップと、
前記特徴量の変化パターンごとに、同じ特徴量の変化パターンを有する教師データに基づいて生成された感情推定器を用いて、前記解析区間の音声を発話した時の発話者の感情を推定する感情推定ステップと、
を含む感情推定方法。 (Appendix 13)
An analysis interval setting step for setting an analysis interval for analyzing the feature amount of the audio data to be analyzed;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. Steps,
Emotion that estimates the emotion of the speaker when the speech of the analysis section is uttered using an emotion estimator generated based on teacher data having the same feature amount change pattern for each feature amount change pattern An estimation step;
Emotion estimation method including

（付記１４）
解析対象とする音声データの特徴量を解析する解析区間を設定する解析区間設定手段と、
前記解析区間に含まれる音声データの特徴量の変化するパターンを、複数のクラスに分類された変化パターンに基づいて、前記解析区間に含まれる音声データの特徴量の変化パターンとして決定する変化パターン決定手段と、
前記特徴量の変化パターンごとに、同じ特徴量の変化パターンを有する教師データに基づいて生成された感情推定器を用いて、前記解析区間の音声を発話した時の発話者の感情を推定する感情推定手段と、
を備えた感情推定装置。 (Appendix 14)
Analysis interval setting means for setting an analysis interval for analyzing the feature amount of the audio data to be analyzed;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. Means,
Emotion that estimates the emotion of the speaker when the speech of the analysis section is uttered using an emotion estimator generated based on teacher data having the same feature amount change pattern for each feature amount change pattern An estimation means;
Emotion estimation device with

１…制御部、２…記憶部、３…入出力部、４…バス、１００…感情推定器生成装置、１１０、２１０…音声データ取得部、１２０、２３０…解析区間設定部、１２１、２３１…形態素解析部、１２２、２３２…アクセント句抽出部、１３０、２４０…アクセント型決定部、１３１、２４１…モーラ区間抽出部、１３２、２４２…アクセント型抽出部、１４０、２６０…特徴量抽出部、１５０…感情推定器生成部、２００…感情推定装置、２２０…話者分割部、２５０…選択部、２７０…感情推定部、２８０…統合部 DESCRIPTION OF SYMBOLS 1 ... Control part, 2 ... Memory | storage part, 3 ... Input / output part, 4 ... Bus, 100 ... Emotion estimator production | generation apparatus, 110, 210 ... Audio | voice data acquisition part, 120, 230 ... Analysis area setting part, 121, 231 ... Morphological analysis unit, 122, 232 ... Accent phrase extraction unit, 130, 240 ... Accent type determination unit, 131, 241 ... Mora section extraction unit, 132, 242 ... Accent type extraction unit, 140, 260 ... Feature quantity extraction unit, 150 ... Emotion estimator generation unit, 200 ... Emotion estimation device, 220 ... Speaker dividing unit, 250 ... Selection unit, 270 ... Emotion estimation unit, 280 ... Integration unit

Claims

An analysis interval setting step for setting an analysis interval for analyzing the feature amount of the voice data that is the source of the teacher data;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. Steps,
Generation of an emotion estimator for generating an emotion estimator for estimating an emotion of a speaker when speech is uttered for each feature value change pattern, using the voice data classified for each feature value change pattern as teacher data Steps,
An emotion estimator generation method including:

The change pattern determining step includes:
A mora section extraction step for dividing the speech data included in the analysis section into mora sections that are syllable units;
The speech data of the analysis section is uttered based on the comparison result of comparing the average value of the feature amount of the speech data in the analysis section and the average value of the feature amount of the speech data in the mora section for each mora section. A change pattern extraction step for extracting a change pattern of the feature amount of the sound that changes for each mora section;
The emotion estimator generation method according to claim 1, wherein:

In the change pattern extraction step, the fundamental frequency of speech extracted from speech data is used as the feature amount of speech, and the average fundamental frequency of speech in the analysis interval and the average fundamental frequency of speech in the mora interval are determined for each mora interval. Compared to the above, when the average fundamental frequency of the voice in the mora section is higher than the average fundamental frequency of the voice in the analysis section, High is given, and when it is low, Low is given, and the voice changes to High and Low for each mora section. Extract the change pattern of the feature amount of
The emotion estimator generation method according to claim 2.

In the change pattern extraction step, the voice intensity extracted from the voice data is used as the voice feature value, and the average voice intensity in the analysis section and the average voice intensity in the mora section are compared for each mora section. When the average intensity of the voice in the mora section is higher than the average intensity of the voice in the analysis section, High is given, and when the average intensity is low, Low is given, and the change in the feature amount of the voice that changes between High and Low for each mora section Extract patterns,
The emotion estimator generation method according to claim 2.

In the change pattern extraction step, the phoneme duration extracted from the speech data is used as the feature amount of speech, the average duration of phonemes in the analysis interval, and the average duration of phonemes in the mora interval, Compared for each mora section, High is given when the average duration of phonemes in the mora section is longer than the average duration of phonemes in the analysis section, Low is given when the phoneme is short, and High for each mora section. Extracting the change pattern of the feature amount of the voice that changes to Low,
The emotion estimator generation method according to claim 2.

In the change pattern extraction step, a change pattern of the speech feature value is extracted using at least one of the fundamental frequency of the speech, the strength of the speech, and the duration of the phoneme as the feature amount of the speech.
The emotion estimator generation method according to any one of claims 2 to 5, wherein:

In the analysis interval setting step, the speech data is divided into morphemes which are the smallest units having language meaning, and an accent phrase interval combined with a particle or an auxiliary verb spoken after the morpheme is set as the analysis interval. To
The emotion estimator generation method according to any one of claims 1 to 6.

In the mora section extraction step, when the voice data is displayed as text, one kana character is set as one mora section, the small kana character is combined with the preceding kana character as one mora section, and the long sound is independent. 1 mora section,
The emotion estimator generation method according to claim 2.

The emotion estimator estimates the emotion at the time of speaking of the speaker as one of sadness, boredom, anger, surprise, discouragement, disgust, joy,
The emotion estimator generation method according to any one of claims 1 to 8, wherein:

Including a change pattern setting step for setting change patterns classified into the plurality of classes.
The emotion estimator generation method according to any one of claims 1 to 9, wherein:

An analysis interval setting means for setting an analysis interval for analyzing a feature amount of voice data that is a source of teacher data;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. Means,
Generation of an emotion estimator for generating an emotion estimator for estimating an emotion of a speaker when speech is uttered for each feature value change pattern, using the voice data classified for each feature value change pattern as teacher data Means,
An emotion estimator generator including:

Analysis interval setting means for setting an analysis interval for analyzing features of speech data that is a source of teacher data on a computer;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. means,
Generation of an emotion estimator for generating an emotion estimator for estimating an emotion of a speaker when speech is uttered for each feature value change pattern, using the voice data classified for each feature value change pattern as teacher data means,
Program to function as.

An analysis interval setting step for setting an analysis interval for analyzing the feature amount of the audio data to be analyzed;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. Steps,
Emotion that estimates the emotion of the speaker when the speech of the analysis section is uttered using an emotion estimator generated based on teacher data having the same feature amount change pattern for each feature amount change pattern An estimation step;
Emotion estimation method including

Analysis interval setting means for setting an analysis interval for analyzing the feature amount of the audio data to be analyzed;
Change pattern determination for determining a pattern in which the feature amount of the voice data included in the analysis section changes as a pattern of change in the feature amount of the voice data included in the analysis section, based on the change patterns classified into a plurality of classes. Means,
Emotion that estimates the emotion of the speaker when the speech of the analysis section is uttered using an emotion estimator generated based on teacher data having the same feature amount change pattern for each feature amount change pattern An estimation means;
Emotion estimation device with