JP2018180334A

JP2018180334A - Emotion recognition device, method and program

Info

Publication number: JP2018180334A
Application number: JP2017080653A
Authority: JP
Inventors: 藤本　敦; Atsushi Fujimoto; 敦藤本; 亜紗美中島; Asami Nakajima; 奈緒程島; Nao Hodoshima
Original assignee: Tokai University; Iwatsu Electric Co Ltd
Current assignee: Tokai University; Iwatsu Electric Co Ltd
Priority date: 2017-04-14
Filing date: 2017-04-14
Publication date: 2018-11-15

Abstract

PROBLEM TO BE SOLVED: To recognize emotion at the time of utterance from a voice signal with high accuracy and high reliability.SOLUTION: An autocorrelation processing unit 20 outputs a normalized autocorrelation signal for a segment decomposed by a segmentation unit 10. A feature quantity extraction unit 40 extracts a feature quantity of an analysis frame from the normalized autocorrelation signal and a sound intensity measurement unit 30, using a period of several 10 ms to several 100 ms including one or more segments as an analysis frame. A pattern identification and emotion recognition unit 50 recognizes the emotion at the time of utterance of a voice signal on the basis of the correlation between the feature quantity extracted by the feature quantity extraction unit 40 acquired in the preliminary learning and the emotion, and outputs an emotion label of the recognition result.SELECTED DRAWING: Figure 1

Description

本発明は、感情認識装置、方法およびプログラムに関し、特に、音声信号から発話時の感情を高精度かつ高信頼度で認識できるようにした感情認識装置、方法およびプログラムに関する。 The present invention relates to an emotion recognition apparatus, method, and program, and more particularly to an emotion recognition apparatus, method, and program capable of highly accurately and reliably recognizing an emotion at the time of speech from a voice signal.

音声信号から発話時の感情を認識するためには、音声信号から感情の指標となる特徴量を抽出することが必要である。 In order to recognize an emotion at the time of speech from a speech signal, it is necessary to extract a feature quantity that is an index of the emotion from the speech signal.

特許文献１には、音声情報から、音声速度、音声ピッチ周波数、音声情報の音量、音声情報の音声スペクトルを検出し、それらを音声情報に付随する感性情報とする手法が記載されている。 Patent Document 1 describes a method of detecting a voice speed, a voice pitch frequency, a volume of voice information, and a voice spectrum of voice information from voice information and using them as emotion information attached to the voice information.

特許文献２には、音声信号に由来する特徴として、音声信号の強度、ピッチを抽出し、それらの時系列データの平均値、最大値、最小値のような統計的な代表値から感情を認識する手法が記載されている。 In Patent Document 2, as a feature derived from a speech signal, the strength and pitch of the speech signal are extracted, and emotions are recognized from statistical representative values such as the average value, maximum value, and minimum value of those time series data. Methods are described.

特許文献３には、入力音声信号の振幅包絡を抽出し、その周期的変動の周波数を求め、それが予め定められた範囲内にある場合に、力んだ状態での発話であると推定する手法が記載されている。また、特許文献３には、音韻という短い単位で、怒り強度を検出することも記載されている。 In Patent Document 3, an amplitude envelope of an input speech signal is extracted, the frequency of the periodic fluctuation is determined, and when it is within a predetermined range, it is estimated that the speech is in a state of force. The method is described. Patent Document 3 also describes that anger intensity is detected in short units called phonemes.

特許文献４には、音声信号データから、分析フレームごとに、基本周波数、基本周波数の時間変動特性、振幅のｒｍｓ値、振幅のｒｍｓ値の時間変動特性、パワー、パワーの時間変動特性、音声速度、音声速度の時間変動特性を音声特徴量として抽出し、音声信号データについての感情を推定する手法が記載されている。 According to Patent Document 4, from the speech signal data, the fundamental frequency, the time variation characteristic of the fundamental frequency, the rms value of the amplitude, the time variation characteristic of the rms value of the amplitude, the time fluctuation characteristic of the power and the power, the speech velocity There is described a method of extracting time-varying characteristics of speech velocity as speech feature quantities and estimating an emotion of speech signal data.

特開平９−２２２９６号公報Unexamined-Japanese-Patent No. 9-22296 gazette 特開２００３−９９０８４号公報Japanese Patent Application Laid-Open No. 2003-99084 特開２００９−３１６２号公報JP, 2009-3162, A 特開２００８−２０４１９３号公報JP 2008-204193 A

音声の高さ、音声強度(パワー)、音声速度(発話速度)などの韻律的情報は、感情を伝達するためだけでなく、言語的情報を伝達するためにも使用される。例えば、日本語では基本周波数の高低のアクセントで同音異義語を区別する。したがって、音声信号からの感情の認識には、言語的情報としての韻律変化と区別し、感情表現としての韻律変化から感情を認識することが必要となる。また、音声信号からは、発話時の感情をできるだけ反映している特徴量を抽出することが望ましい。 Prosodic information such as speech height, speech intensity (power), speech rate (speech rate) is used not only to convey emotions, but also to convey linguistic information. For example, in Japanese, homonyms are distinguished by high and low accents of the fundamental frequency. Therefore, in recognition of emotion from speech signal, it is necessary to distinguish from prosody change as linguistic information and to recognize emotion from prosody change as emotion expression. In addition, it is desirable to extract, from the audio signal, a feature that reflects the emotion at the time of speech as much as possible.

特許文献１記載の手法は、発話あるいはフレーズ単位で、音声の高さ(基本周波数)、音声強度、音声速度などの韻律的特徴を抽出し、その韻律的特徴から音声情報全体についての特徴、例えば、「声が高い」、「声が大きい」などを判断して音声情報に付随する感性を認識するものであり、感情表現としての韻律変化と言語的情報としての韻律変化を区別しないので、誤認識が生じやすいという課題がある。また、この手法では、早口/声が高い/声が低いなどといった、音声の個人差に起因する韻律的特徴のばらつきにより誤認識が生じやすいという課題もある。 The method described in Patent Document 1 extracts prosodic features such as speech height (fundamental frequency), speech intensity, speech speed and the like in units of speech or phrase, and based on the prosodic features, features for the entire speech information, for example , “High voice”, “high voice”, etc. to recognize the sensibility associated with speech information, and it does not distinguish between prosody change as emotional expression and prosody change as linguistic information, so it is erroneous. There is a problem that recognition tends to occur. In addition, in this method, there is also a problem that misrecognition tends to occur due to variations in prosodic features caused by individual differences in speech, such as rapid voice / high voice / low voice.

特許文献２記載の手法は、発話あるいはフレーズ単位で、音声強度および基本周波数の時系列データを求め、時系列データの平均値、最大値、最小値のような統計的な代表値から感情を認識するものであり、特許文献１記載の手法と同様の課題がある。また、この手法では、発話の時間長、発話の途中でのポーズの有無やその深さによる影響を受けやすいので、安定して精度よく感情を認識することが困難である。 The method described in Patent Document 2 determines time series data of speech intensity and fundamental frequency in units of speech or phrase, and recognizes emotion from statistical representative values such as average value, maximum value, and minimum value of time series data. There is a problem similar to the method described in Patent Document 1. In addition, in this method, it is difficult to stably and accurately recognize emotion because it is easily affected by the time length of the utterance, the presence or absence of the pause in the utterance, and the depth thereof.

特許文献３記載の手法は、力んだ状態での発話やその怒り強度を推定するものであり、ここでも感情表現としての韻律変化と言語的情報としての韻律変化を区別しないので、特許文献１記載の手法と同様の課題がある。特許文献３には、音韻という短い単位で怒り強度を検出することも記載されているが、その手法は、力んだ状態での発話と推定された音声信号について、音韻ごとの発話時の力みやすさを示す力み音声発生指標から怒りの強度を決定するというもので、その力み音声発生指標は、子音、母音、アクセント句中の位置、アクセント核からの相対的位置などの音韻の属性情報から発話時の力みやすさを求めるための規則を用いて計算される。 The method described in Patent Document 3 is for estimating speech in a strong state and its anger intensity, and here also, since it does not distinguish between prosodic change as emotional expression and prosodic change as linguistic information, Patent Document 1 There are the same issues as the described method. Although patent document 3 also describes detecting anger intensity by a short unit called a phonology, the method is the power at the time of the utterance of every phonology about the speech signal presumed to be the utterance in a state of pressure. The strength of the anger is determined from the strength speech generation index indicating the ease of viewing, and the strength speech generation index is an attribute of the phoneme such as a consonant, a vowel, a position in an accent phrase, a relative position from an accent nucleus, etc. It is calculated using the rules for finding the ease of speaking from information.

言語的情報を区別し、さらに、個人差による韻律情報のばらつきも考慮して、韻律的特徴に基づいて感情を認識するために、統計モデルを用いることが考えられるが、その統計モデルの構築には大量の学習データが必要となり、また、学習においては感情ラベルを付けることが必要となる。そのため、実際には、音声データに加えて、画像データや生体データなどを同時に取得し、例えば、画像データや生体データから感情ラベルを生成し、その感情ラベルで機械的に学習を進め、間違えたデータについては人手で確認するなどといった操作を行うことが必要となる。 It is conceivable to use a statistical model to distinguish linguistic information and to recognize emotions based on prosodic features in consideration of variations in prosody information due to individual differences, but it is conceivable to use the statistical model. Requires a large amount of learning data, and learning requires the labeling of emotions. Therefore, in practice, in addition to voice data, image data, biometric data, etc. are simultaneously acquired, for example, emotion labels are generated from image data or biometric data, and learning is mechanically advanced using the emotion labels, making a mistake It is necessary to perform an operation such as manually confirming the data.

特許文献４記載の手法では、分析フレームごとに、音声特徴量を抽出して感情を推定するが、ここに示されている音声特徴量は、音声の基本周波数、振幅、パワー、音声速度といったものである。 In the method described in Patent Document 4, the voice feature quantity is extracted for each analysis frame to estimate the emotion, but the voice feature quantity shown here is the basic frequency, amplitude, power, voice speed of voice, etc. It is.

特許文献１〜３などにみられるように、発話やフレーズ単位で感情を認識する場合の特徴量については、多くの文献で議論されているが、分析フレーム単位で感情を認識する場合の特徴量については、特許文献４などの少数の文献で議論されているに過ぎない。 As described in Patent Documents 1 to 3 and the like, the feature quantities in the case of recognizing emotions in utterance or phrase units are discussed in many documents, but the feature quantities in the case of recognizing emotions in analysis frame units Is discussed only in a small number of documents such as Patent Document 4.

また、特許文献４に開示されている特徴量は、音声の基本周波数、振幅、パワー、音声速度といった、特許文献１〜３のものと同様である。このように、発話やフレーズの時間長よりも十分に短い分析フレーム単位で感情を認識する場合に有効な特徴量については、まだ十分に議論されていない。 The feature quantities disclosed in Patent Document 4 are the same as those of Patent Documents 1 to 3, such as the fundamental frequency, amplitude, power, and voice speed of voice. As described above, a feature that is effective when recognizing an emotion in analysis frame units sufficiently shorter than the time length of the utterance or the phrase has not been sufficiently discussed.

図２９および図３０は、基本周波数および音声強度の時系列データから求めたフレーズごとの統計的な代表値(平均値、最大値、最小値など)の一例を示す図であり、同図は、発話あるいはフレーズ単位での特徴量と分析フレーム単位での特徴量の関係をみるために示している。 FIG. 29 and FIG. 30 are diagrams showing an example of statistical representative values (average value, maximum value, minimum value, etc.) for each phrase obtained from time series data of fundamental frequency and speech strength, and the figure is This is shown in order to see the relationship between the feature amount in utterance or phrase units and the feature amount in analysis frame units.

図２９において、フレーズにおける基本周波数の平均値は160Hzであり、時間が0.32秒のときに最大値200Hz、時間が0.77秒のときに最小値80Hzである。また、基本周波数が表示されていない期間は、無声音(声帯振動を伴わない音)であるか、音声強度が低いか、あるいは卓越する周波数成分がないため、基本周波数が抽出できない期間である。 In FIG. 29, the average value of the fundamental frequency in the phrase is 160 Hz, the maximum value 200 Hz when the time is 0.32 seconds, and the minimum value 80 Hz when the time is 0.77 seconds. Further, a period in which the fundamental frequency is not displayed is a period in which the fundamental frequency can not be extracted because it is unvoiced sound (sound without vocal cord vibration), has a low voice intensity, or has no dominant frequency component.

統計的代表値を用いて感情を認識する従来の手法では、基本周波数については、その最大値、最小値、平均値などを使用しており、図２９の例では、最大値とレンジ(最大値と最小値の差分)が大きく、平均値が低くないので、感情は"怒り"や"驚き"などと認識される。 In the conventional method of recognizing emotions using statistical representative values, the maximum value, minimum value, average value, etc. are used for the fundamental frequency, and in the example of FIG. 29, the maximum value and the range (maximum value And because the difference between the minimum and the maximum) is large, the average is not low, so the emotion is recognized as "anger" or "surprise."

一方、分析フレーム単位で基本周波数の特徴量を抽出すると、基本周波数は、大部分の時間帯で典型的な範囲内の値をとり、その傾きは、通常よりも傾きが大きくなる時間帯が多い。 On the other hand, when the feature value of the fundamental frequency is extracted in analysis frame units, the fundamental frequency takes a value within a typical range in most time zones, and the inclination has many time zones in which the inclination is larger than usual. .

また、聴感上では、"ふざけるな"とか"いいかげんにしろよ"と怒って叫ぶ場面を想定すると、"ふ"、"ざ"、"け"、"る"、"な"の各音の中にも、"い"、"い"、"か"、"げ"、"ん"、"に"、"し"、"ろ"、"よ"の各音の中にも怒りの感情が含まれていると考えられる。当然、荒げた音声からは音声強度が大きいという特徴が抽出されるが、大きな音量でもやさしい感じの発話もある。したがって、音声強度以外の何らかの形で怒りの感情が表現されていると考えられる。 In addition, in the sense of hearing, assuming the scene that yells at "do not be upset" or "does not care" and yells, "f", "za", "da", "ru", "na" in each sound Even, "I", "I", "D", "D", "D", "D", "D", "D", "D", "D", "E" and "Y" also contain anger emotions. It is thought that Of course, although the feature that the voice intensity is large is extracted from the rough voice, there is also an utterance that feels easy even at a large volume. Therefore, it is considered that the emotion of anger is expressed in some form other than voice intensity.

以上のように、発話やフレーズの時間長よりも十分に短い分析フレームにおいては、音声信号の基本周波数、音声強度、振幅、発話速度などの統計的な代表値以外の何らかの特徴により感情が伝達されていると考えられる。 As described above, in an analysis frame sufficiently shorter than the time length of speech or phrase, emotion is transmitted by some characteristic other than a statistical representative value such as fundamental frequency of speech signal, speech intensity, amplitude, and speech rate. It is thought that

ところで、分析フレームにおける音声の情報は、人間が知覚することのできないことが多い位相情報を除けば、自己相関関数もしくはパワースペクトルにより完全に記述できる。そこで、自己相関関数をモデルパラメータで記述し、モデルパラメータと感情の相関関係を統計モデル化することにより、音声信号についての発話時の感情を高精度かつ高信頼度で認識できると考えられる。 By the way, the speech information in the analysis frame can be completely described by an autocorrelation function or a power spectrum except for phase information which can not often be perceived by humans. Therefore, by describing the autocorrelation function as a model parameter and statistically modeling the correlation between the model parameter and the emotion, it is considered that the emotion at the time of speech of the speech signal can be recognized with high accuracy and high reliability.

本発明は、上記課題を解決し、音声信号から発話時の感情を高精度かつ高信頼度でよく認識できる感情認識装置、方法およびプログラムを提供することを目的とする。 An object of the present invention is to solve the above-mentioned problems, and to provide an emotion recognition apparatus, a method and a program capable of well recognizing an emotion at the time of speech from speech signals with high accuracy and high reliability.

本発明では、上記した考察から、音声信号をセグメントに分解し、セグメントについての自己相関関数の低周波成分を生成し、数10ms〜数100msの分析フレームとして、前記低周波成分から、分析フレームごとに、感情に関連する特徴量を抽出し、さらに、複数の分析フレームにおける特徴量から音声信号についての感情を認識するようにしている。 In the present invention, based on the above consideration, the speech signal is decomposed into segments, and low frequency components of the autocorrelation function for the segments are generated, and from the low frequency components as analysis frames of several 10 ms to several hundreds ms, every analysis frame. Then, feature quantities related to emotions are extracted, and furthermore, emotions about speech signals are recognized from feature quantities in a plurality of analysis frames.

上記課題を解決するため、本発明は、入力される音声信号から発話時の感情を認識するための感情認識装置であって、音声信号をセグメントに分解するセグメント化手段と、前記セグメントの各々についての音声信号の音声強度を測定してセグメントについての音声強度信号を出力する音声強度測定手段と、前記セグメントの各々についての自己相関関数を計算してセグメントについての自己相関信号を出力する自己相関計算手段と、前記自己相関信号からピッチ情報を抽出してセグメントについてのフィルタ後自己相関信号を出力するフィルタ手段と、前記フィルタ後自己相関信号の大きさを前記音声強度信号により規格化してセグメントについての規格化後自己相関信号を出力する規格化手段と、１以上のセグメントを含む数10ms〜数100msの期間を分析フレームとして、前記規格化後自己相関信号および前記音声強度信号から分析フレームについての特徴量を抽出して順次出力する特徴量抽出手段と、前記特徴量と感情との相関に基づいて前記音声信号についての発話時の感情を認識し、その認識結果の感情ラベルを出力するパターン識別・感情認識手段を備えることを特徴としている。 In order to solve the above-mentioned problems, the present invention is an emotion recognition device for recognizing an emotion at the time of speech from an input speech signal, which comprises: segmentation means for decomposing the speech signal into segments; Voice strength measuring means for measuring the voice strength of the voice signal of the voice signal and outputting the voice strength signal for the segment, and calculating the autocorrelation function for each of the segments and outputting the autocorrelation signal for the segment Means, filter means for extracting pitch information from the autocorrelation signal and outputting a post-filtering autocorrelation signal for the segment, normalizing the magnitude of the post-filtering autocorrelation signal with the voice intensity signal for the segment Normalization means for outputting an autocorrelation signal after standardization, and several tens of ms to several hundreds ms including one or more segments Feature quantity extraction means for extracting feature quantities of the analysis frame from the normalized autocorrelation signal and the speech intensity signal using the period as an analysis frame, and sequentially outputting the correlation, based on the correlation between the feature quantity and emotion It is characterized in that it comprises pattern identification / emotion recognition means for recognizing an emotion at the time of speech of a voice signal and outputting an emotion label of the recognition result.

前記パターン識別・感情認識手段は、前記特徴量と感情との相関値を分析フレームについてのパターン識別信号として出力するパターン識別手段と、前記パターン識別信号を保持するパターン記憶手段と、前記パターン記憶手段からパターン識別信号系列を読み出し、該パターン識別信号系列における分析フレームについての特徴量と感情との相関から感情を認識し、その認識結果の感情ラベルを出力する感情認識手段により構成することができる The pattern identification and emotion recognition means outputs a pattern identification signal for outputting a correlation value between the feature amount and emotion as a pattern identification signal for an analysis frame, a pattern storage means for holding the pattern identification signal, and the pattern storage means The pattern recognition signal sequence can be read out from the above, and the emotion recognition means can recognize the emotion from the correlation between the feature amount for the analysis frame in the pattern recognition signal sequence and the emotion, and output the emotion label of the recognition result.

また、パターン識別・感情認識手段は、前記パターン識別・感情認識手段は、前記特徴量を各分析フレームについてのパターン識別信号として出力するパターン識別手段と、前記パターン識別信号を保持するパターン記憶手段と、前記パターン記憶手段からパターン識別信号系列を読み出し、該パターン識別信号系列におけるパターン識別信号の発生頻度分布と感情との相関、あるいは該発生頻度分布における最多のパターン識別信号または重要なパターン識別信号と感情との相関から前記音声信号についての発話時の感情を認識し、その認識結果の感情ラベルを出力する感情認識手段により構成することができる。 Further, pattern identification and emotion recognition means, pattern identification and emotion recognition means, pattern identification means for outputting the feature amount as a pattern identification signal for each analysis frame, pattern storage means for holding the pattern identification signal Reading out a pattern identification signal sequence from the pattern storage means, correlating the occurrence frequency distribution of the pattern identification signal with the emotion in the pattern identification signal sequence, or the largest number of pattern identification signals or important pattern identification signals in the occurrence frequency distribution; It can be comprised by the emotion recognition means which recognizes the emotion at the time of the speech about the said speech signal from correlation with an emotion, and outputs the emotion label of the recognition result.

ここで、前記音声強度測定手段が、前記音声信号が所定閾値以下となる無音期間を検出して無音検出信号を出力し、前記パターン識別・感情認識手段は、前記無音検出信号が出力されるタイミングで感情を認識し、その認識結果の感情ラベルを出力することも好ましい。 Here, the voice intensity measurement means detects a silence period in which the voice signal is equal to or less than a predetermined threshold and outputs a silence detection signal, and the pattern identification / emotion recognition means outputs the silence detection signal. It is also preferable to recognize the emotion with and output an emotion label of the recognition result.

また、フィルタ手段が、カットオフ周波数が500Hzから800Hzの範囲内のローパスフィルタであることも好ましい。 It is also preferred that the filter means be a low pass filter with a cutoff frequency in the range of 500 Hz to 800 Hz.

さらに、前記規格化後自己相関信号から分析フレームごとに母音を検出する母音検出手段を備え、前記パターン識別・感情認識手段は、前記母音検出手段により母音が検出された分析フレームに対してのみ感情の認識処理を実行することも好ましい。 The method further comprises vowel detection means for detecting vowels for each analysis frame from the normalized autocorrelation signal, and the pattern identification and emotion recognition means is for emotion only for analysis frames for which vowels are detected by the vowel detection means. It is also preferable to execute the recognition process of

また、前記特徴量抽出手段が、複数の異なる時間長の分析フレームについての特徴量をそれぞれ抽出し、前記パターン識別・感情認識手段は、得られたすべての特徴量と感情との相関に基づく感情の認識処理を実行し、これにより認識結果の感情ラベルを出力することも好ましい。 Further, the feature quantity extraction means extracts feature quantities for analysis frames of a plurality of different time lengths, and the pattern identification / emotion recognition means is an emotion based on the correlation between all the obtained feature quantities and emotions. It is also preferable to execute the recognition process of and thereby output the emotion label of the recognition result.

さらに、前記特徴量抽出手段により抽出された特徴量を入力として複数分析フレームについての特徴量を抽出する長期特徴量抽出手段を備え、前記パターン識別・感情認識手段は、前記特徴量抽出手段および長期特徴量抽出手段により抽出されたすべての特徴量と感情との相関に基づく感情の認識処理を実行し、これにより認識結果の感情ラベルを出力することも好ましい。 The image processing apparatus further comprises long-term feature quantity extraction means for extracting feature quantities of a plurality of analysis frames using the feature quantity extracted by the feature quantity extraction means as an input, and the pattern identification and emotion recognition means comprises the feature quantity extraction means It is also preferable to execute emotion recognition processing based on the correlation between all the feature quantities extracted by the feature quantity extraction means and the emotion, and thereby output an emotion label of the recognition result.

なお、本発明は、感情認識装置としてだけでなく、感情認識方法や感情認識用プログラムとしても実現することができる。 The present invention can be realized not only as an emotion recognition apparatus but also as an emotion recognition method and a program for emotion recognition.

本発明によれば、音声信号をセグメントに分解し、各セグメントについての自己相関関数の低周波成分を生成し、該低周波成分から感情に関連する特徴量を抽出し、該特徴量から数10ms〜数100msの分析フレームについての特徴量を抽出し、さらに、複数の分析フレームにおける特徴量から感情を認識するので、音声の情報を正しく反映した特徴量を抽出することができ、それを元にして音声信号についての発話時の感情を高精度かつ高信頼度で安定して認識できる。 According to the present invention, the speech signal is decomposed into segments, low-frequency components of the autocorrelation function for each segment are generated, feature quantities related to emotion are extracted from the low-frequency components, and several tens of ms from the feature quantities. Since the feature amount of an analysis frame of several hundred ms is extracted and emotions are recognized from the feature amounts of a plurality of analysis frames, the feature amount that correctly reflects the voice information can be extracted, and based on that Therefore, the emotion at the time of speech of the voice signal can be stably recognized with high accuracy and high reliability.

本発明の感情認識装置の実施形態を総括的に示すブロック図である。FIG. 1 is a block diagram generally showing an embodiment of an emotion recognition device of the present invention. セグメント化部における動作の一例を示す図である。It is a figure which shows an example of operation | movement in a segmentation part. 音声強度測定部の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of a voice intensity measurement part. 特徴量抽出部の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of a feature-value extraction part. 帯域制限なしの場合の自己相関信号の波形を示す図である。It is a figure which shows the waveform of the autocorrelation signal in the case of no band limitation. 帯域制限ありの場合の自己相関信号の波形を示す図である。It is a figure which shows the waveform of the autocorrelation signal in the case with band limitation. 正弦波信号のピーク値の変化が短時間に観測される場合の一例を示す図である。It is a figure which shows an example in case the change of the peak value of a sine wave signal is observed in a short time. 正弦波信号のピーク値の変化が長時間にわたって継続的に観測される場合の一例を示す図である。It is a figure which shows an example in case the change of the peak value of a sine wave signal is continuously observed over a long time. パターン識別・感情認識部での感情認識処理のイメージを示す図である。It is a figure which shows the image of the emotion recognition process in a pattern identification * emotion recognition part. 本発明の感情認識装置の第１の実施形態を示すブロック図である。FIG. 1 is a block diagram showing a first embodiment of an emotion recognition device of the present invention. パターン記憶部が保持するパターン識別信号系列の一具体例を示す図である。It is a figure which shows one specific example of the pattern identification signal series which a pattern memory | storage part hold | maintains. 感情認識部の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of an emotion recognition part. 本発明の感情認識装置の第２の実施形態を示すブロック図である。It is a block diagram which shows 2nd Embodiment of the emotion recognition apparatus of this invention. パターン識別信号系列におけるパターン識別信号の発生頻度分布を用いて感情を認識する場合の感情認識部の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement of an emotion recognition part in the case of recognizing an emotion using the generation frequency distribution of the pattern identification signal in a pattern identification signal series. パターン識別信号系列におけるパターン識別信号をグループ分けしてから感情を認識する場合の説明図である。It is explanatory drawing in the case of recognizing an emotion, after dividing the pattern identification signal in a pattern identification signal series into a group. 感情認識部に与えられるパターン識別信号系列の一具体例を示す図である。It is a figure which shows one specific example of the pattern identification signal series given to an emotion recognition part. 本発明の感情認識装置の第３の実施形態を示すブロック図である。It is a block diagram which shows 3rd Embodiment of the emotion recognition apparatus of this invention. 母音検出部の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of a vowel detection part. 第３の実施形態においてパターン識別用の統計モデルを使用する場合のブロック構成を示す図である。It is a figure which shows the block configuration in the case of using the statistical model for pattern identification in 3rd Embodiment. 第３の実施形態において感情認識用の統計モデルを使用する場合のブロック構成を示す図である。It is a figure which shows the block configuration in the case of using the statistical model for emotion recognition in 3rd Embodiment. 本発明の感情認識装置の第４の実施形態を示すブロック図である。It is a block diagram which shows 4th Embodiment of the emotion recognition apparatus of this invention. 第４の実施形態において複数の分析フレーム長で、特徴量の抽出および感情の認識を並行して行う場合の特徴量抽出部のブロック構成を示す図である。It is a figure which shows the block configuration of the feature extraction part in the case where extraction of a feature and recognition of emotion are performed in parallel by several analysis frame length in 4th Embodiment. 第４の実施形態においてパターン識別用の統計モデルを使用する場合のブロック構成を示す図である。It is a figure which shows the block configuration in the case of using the statistical model for pattern identification in 4th Embodiment. 第４の実施形態において感情認識用の統計モデルを使用する場合のブロック構成を示す図である。It is a figure which shows the block configuration in the case of using the statistical model for emotion recognition in 4th Embodiment. 本発明の感情認識装置の第５の実施形態を示すブロック図である。It is a block diagram which shows 5th Embodiment of the emotion recognition apparatus of this invention. 第５の実施形態において複数の分析フレーム長で、特徴量の抽出および感情の認識を行う場合の長期特徴量抽出部の内部ブロック構成を示す図である。It is a figure which shows the internal block structure of the long-term feature-value extraction part in the case of performing extraction of a feature-value, and recognition of emotion by several analysis frame length in 5th Embodiment. 第５の実施形態においてパターン識別用の統計モデルを使用する場合のブロック構成を示す図である。It is a figure which shows the block configuration in the case of using the statistical model for pattern identification in 5th Embodiment. 第５の実施形態において感情認識用の統計モデルを使用する場合のブロック構成を示す図である。It is a figure which shows the block configuration in the case of using the statistical model for emotion recognition in 5th Embodiment. 基本周波数の時系列データから求めたフレーズごとの統計的な代表値の一例を示す図である。It is a figure which shows an example of the statistical representative value for every phrase calculated | required from the time series data of fundamental frequency. 音声強度の時系列データから求めたフレーズごとの統計的な代表値の一例を示す図である。It is a figure which shows an example of the statistical representative value for every phrase calculated | required from the time series data of audio | voice strength.

以下、図面を参照して本発明を説明する。 Hereinafter, the present invention will be described with reference to the drawings.

以下では、本発明を感情認識装置として実現した場合について説明するが、本発明は、感情認識方法や感情認識用プログラムとしても実現することができる。 Although the case where the present invention is realized as an emotion recognition device will be described below, the present invention can also be realized as an emotion recognition method or a program for emotion recognition.

図１は、本発明の感情認識装置の実施形態を総括的に示すブロック図である。 FIG. 1 is a block diagram generally showing an embodiment of the emotion recognition apparatus of the present invention.

本実施形態の感情認識装置は、セグメント化部10、自己相関処理部20、音声強度測定部30、特徴量抽出部40およびパターン識別・感情認識部50を備える。なお、これらの各部は、ハードウエアあるいはソフトウエアのいずれで構成されてもよい。 The emotion recognition apparatus of the present embodiment includes a segmentation unit 10, an autocorrelation processing unit 20, a voice intensity measurement unit 30, a feature quantity extraction unit 40, and a pattern identification / emotion recognition unit 50. Each of these units may be configured by either hardware or software.

発話時の感情を認識するために入力される音声信号は、任意のレート、例えば16kspsでサンプリングされ、任意の精度、例えば16bitで量子化されたデジタル信号である。 The speech signal input to recognize the emotion at the time of speech is a digital signal sampled at an arbitrary rate, for example, 16 ksps and quantized at an arbitrary accuracy, for example, 16 bits.

セグメント化部10は、入力される音声信号を時間軸上で所定時間ごとに切り出してセグメント化を行う。このセグメント化では、音声信号からセグメントを単に順次切り出すのではなく、例えば、図２に示すように、時間的にオーバラップさせながら切り出すのが好ましい。 The segmentation unit 10 segments the input audio signal at predetermined time intervals on the time axis. In this segmentation, it is preferable to cut out the segments from the audio signal, for example, as shown in FIG. 2 while overlapping them in time.

図２は、時間的に前後と10msだけオーバラップした30msの期間の音声信号から中央部分の10msの音声信号を切り出してセグメント化を行うことを示している。ここでのセグメント長は10msである。セグメント長は、10ms程度が好ましいが、数10ms程度でもよい。また、図２には、セグメントと分析フレームとの関係も示している。図２は、分析フレームが５個のセグメントからなる例であるが、分析フレームは１以上のセグメントを含めばよい。後述するように、分析フレーム単位で音声特徴量の抽出を行う。 FIG. 2 shows that segmentation is performed by extracting a 10-ms speech signal in the central portion from a 30-ms speech signal that is overlapped by an anteroposterior in time by 10 ms. The segment length here is 10 ms. The segment length is preferably about 10 ms, but may be about several tens of ms. FIG. 2 also shows the relationship between the segments and the analysis frame. Although FIG. 2 is an example in which the analysis frame is composed of five segments, the analysis frame may include one or more segments. As will be described later, speech feature quantities are extracted in units of analysis frames.

また、後述するように、音声特徴量の検出精度を高めるために複数の時間長の異なる分析フレームを同時に使用することもできる。すなわち、短時間の分析フレームを用いて抽出した音声特徴量と長時間の分析フレームを用いて抽出した音声特徴量を併用することで音声特徴量の精度を高めてもよい。
分析フレームについては、後で詳細に説明する。 Also, as described later, a plurality of analysis frames with different time lengths can be used simultaneously to enhance the detection accuracy of the audio feature quantity. That is, the accuracy of the voice feature may be enhanced by using the voice feature extracted using the short analysis frame and the voice feature extracted using the long analysis frame in combination.
The analysis frame will be described in detail later.

なお、セグメント長は、音声信号から感情を認識する場合、スペクトル構造が明瞭な母音のみを分析に使用することが好ましいこと、子音が分析に混入すると認識精度が劣化するので分析への子音の混入の影響を抑えることが好ましいこと、子音には継続時間の短いものがあること、長いセグメント長は分析フレーム内での子音を含むセグメントの比率の増大をもたらし、それにより分析フレームにおける特徴量への子音の影響が大きくなる可能性があること、などを考慮して設定すればよい。 As for segment length, when recognizing emotion from speech signal, it is preferable to use only vowels with clear spectral structure for analysis, and if consonant is mixed in analysis, recognition accuracy will deteriorate, so consonant mixing into analysis It is preferable to reduce the influence of the following: some consonants have a short duration, and a long segment length leads to an increase in the ratio of segments including consonants in the analysis frame, and thereby to the feature quantities in the analysis frame. It may be set in consideration of the possibility that the influence of consonants may increase, and the like.

なお、子音区間ではスペクトルが白色化して正弦波成分が存在しないので、これを分析に含めると誤判定の原因となるが、後述するように母音検出を追加することで、分析への子音混入の発生確率を低下させ、これに起因する誤判定も減らすことができる。なお、詳細は後述するが、母音の自己相関信号のゼロクロス数はピッチ周波数と関係することから、例えばピッチ周波数800Hz相当以上のゼロクロスが検出された場合には、子音あるいはノイズの混入が疑われるので、このような場合には特徴量の抽出を停止することが好ましい。 In the consonant section, the spectrum is whitened and there is no sine wave component. Therefore, including this in the analysis will cause a misjudgment, but adding vowel detection as described later makes consonant mixing into the analysis It is possible to reduce the probability of occurrence and to reduce erroneous determinations resulting from this. Although details will be described later, since the zero-cross number of the autocorrelation signal of the vowel relates to the pitch frequency, for example, when a zero-cross equivalent to a pitch frequency of 800 Hz or more is detected, mixing of consonants or noise is suspected. In such a case, it is preferable to stop the extraction of the feature amount.

自己相関処理部20は、自己相関計算部21、フィルタ処理部22および規格化部23を備える。自己相関計算部21は、例えば、５つの連続するセグメントの音声信号を入力とし、各セグメントについての自己相関関数を計算し、自己相関信号をフィルタ処理部22へ出力する。５つの連続するセグメントの音声信号をパラレルに自己相関計算部21へ入力してもよい。 The autocorrelation processing unit 20 includes an autocorrelation calculation unit 21, a filter processing unit 22, and a normalization unit 23. The autocorrelation calculation unit 21 receives, for example, audio signals of five consecutive segments, calculates an autocorrelation function for each segment, and outputs the autocorrelation signal to the filter processing unit 22. The audio signals of five consecutive segments may be input to the autocorrelation calculation unit 21 in parallel.

自己相関関数の計算では、パワースペクトルを計算し、それを逆フーリエ変換する手法などを用いれば、計算量を削減できる。パワースペクトルの計算では、予め時間窓処理を行うことも知られており、これによれば、パワースペクトルを計算する時の時間窓の形状にもよるが、ほぼ50Hz以上の周波数成分を反映した自己相関信号を得ることができる。 In the calculation of the autocorrelation function, the amount of calculation can be reduced by using a method of calculating a power spectrum and performing inverse Fourier transform on it. In the calculation of power spectrum, it is also known to perform time window processing beforehand, and according to this, it depends on the shape of the time window when calculating the power spectrum, but it reflects the frequency component of approximately 50 Hz or more. A correlation signal can be obtained.

フィルタ処理部22は、自己相関計算部21から出力される自己相関信号を入力とし、低域フィルタ処理を行ってLPF後自己相関信号を生成し、それを規格化部23へ出力する。音声信号が有するフォルマント情報は言語情報と関連性が大きく、感情は、主にピッチ情報により伝達されるので、ここでは、低域通過フィルタ処理を行ってピッチ情報を抽出する。フィルタ処理部でのフィルタのカットオフ周波数は、例えば、500Hz〜800Hzとしてもよい。 The filter processing unit 22 receives the autocorrelation signal output from the autocorrelation calculation unit 21, performs low-pass filter processing to generate an after-LPF autocorrelation signal, and outputs the autocorrelation signal to the normalization unit 23. Since the formant information included in the speech signal is highly related to the linguistic information, and the emotion is mainly transmitted by the pitch information, low-pass filtering is performed here to extract the pitch information. The cutoff frequency of the filter in the filtering unit may be, for example, 500 Hz to 800 Hz.

規格化部23は、フィルタ処理部22から出力されるセグメントについてのLPF後自己相関信号および音声強度測定部30から出力されるセグメントについての音声強度信号を入力とし、LPF後自己相関信号を音声強度信号により規格化して規格化後自己相関信号を生成し、それを特徴量抽出部40へ出力する。規格化後自己相関信号は、フィルタ処理部22での帯域制限を受けて正弦波状信号となっている。 The normalization unit 23 receives the post-LPF autocorrelation signal for the segment output from the filter processing unit 22 and the voice intensity signal for the segment output from the voice intensity measurement unit 30 as input, and the LPF post-correlation signal is voice intensity The signal is normalized to generate an after-normalization autocorrelation signal, which is output to the feature amount extraction unit 40. The post-normalization autocorrelation signal is a sine wave signal due to the band limitation in the filter processing unit 22.

音声強度測定部30は、セグメントについての音声信号強度を求め、それを特徴量抽出部40へ出力する。また、音声強度測定部30は、入力された音声信号における所定期間以上の無音を検出して無音検出信号を生成し、それをパターン識別・感情認識部50の感情認識部53へ出力する。無音検出信号は、感情認識部53で発話終了を判断するために用いられる。 The voice strength measurement unit 30 obtains the voice signal strength for the segment and outputs it to the feature quantity extraction unit 40. Further, the voice intensity measurement unit 30 detects silence in the input voice signal for a predetermined period or more to generate a silence detection signal, and outputs the silence detection signal to the emotion recognition unit 53 of the pattern identification / emotion recognition unit 50. The silence detection signal is used by the emotion recognition unit 53 to determine the end of the speech.

図３は、音声強度測定部30において音声強度信号を生成するための構成の一例を示すブロック図である。 FIG. 3 is a block diagram showing an example of a configuration for generating an audio strength signal in the audio strength measurement unit 30. As shown in FIG.

音声強度測定部30は、ピークホールド回路31およびフィルタ処理部32を備える。ピークホールド回路31は、セグメント内の音声信号の包絡線を検出し、それをピークホールド信号として出力する。ピークホールド信号に対して低域通過フィルタ処理を施し、雑音の影響を軽減した信号を音声強度信号として出力する。 The voice strength measurement unit 30 includes a peak hold circuit 31 and a filter processing unit 32. The peak hold circuit 31 detects the envelope of the audio signal in the segment and outputs it as a peak hold signal. Low-pass filtering is applied to the peak hold signal, and a signal with reduced noise influence is output as an audio strength signal.

特徴量抽出部40は、規格化部23から出力される規格化後自己相関信号および音声強度測定部30から出力される音声強度信号を入力とし、分析フレームについての特徴量を抽出する。規格化後自己相関信号において感情の指標となる特徴量の変化は、数10ms程度の短時間で観測される場合や数100ms程度の期間にわたって継続的に観測される場合がある。 The feature amount extraction unit 40 receives the normalized autocorrelation signal output from the normalization unit 23 and the voice intensity signal output from the voice intensity measurement unit 30 as input, and extracts feature amounts for the analysis frame. Changes in the feature value serving as an indicator of emotion in the autocorrelation signal after normalization may be observed in a short time of about several tens of ms or continuously observed over a period of about several hundred ms.

そこで、分析フレームを数10msから数100msの範囲内の時間長に設定し、その時間長で特徴量を抽出することにより、音声信号についての感情を認識するための特徴量を適切に抽出できる。なお、自己相関信号および音声強度信号はセグメント単位で計算されるため、分析フレーム長は、セグメント長の整数倍としてもよい。なお、既に説明したように、感情による音声特徴量の変化速度はさまざまであり、これを精度よく検出するためには、時間長の異なる複数の分析フレームを併用することが好ましい。 Therefore, by setting the analysis frame to a time length in the range of several tens of ms to several hundreds of ms and extracting the feature amount with that time length, it is possible to appropriately extract the feature amount for recognizing the emotion of the audio signal. Since the autocorrelation signal and the voice strength signal are calculated in segment units, the analysis frame length may be an integral multiple of the segment length. As described above, the rate of change of the voice feature amount due to emotion is various, and in order to detect this accurately, it is preferable to use a plurality of analysis frames having different time lengths in combination.

図１は単一の分析フレームを使用する一例のブロック構成図であるが、後で複数分析フレームを使用する場合のブロック構成についても説明する。図１においては、使用する分析フレーム数を明示する意味で、分析フレーム長を本ブロックに印加する形としている。もちろん、分析フレーム長を複数のものの中から適宜のものを選択して特徴量抽出部40に設定できるようにしてもよいし、それを可変設定できるようにしてもよい。以下の実施形態でも同様である。 Although FIG. 1 is a block diagram of an example using a single analysis frame, a block configuration in the case of using a plurality of analysis frames will also be described later. In FIG. 1, the analysis frame length is applied to this block in the sense of clearly indicating the number of analysis frames to be used. Of course, an appropriate analysis frame length may be selected from among a plurality of analysis frame lengths and may be set in the feature amount extraction unit 40, or may be set variably. The same applies to the following embodiments.

図４は、特徴量抽出部40の構成の一例を示すブロック図である。 FIG. 4 is a block diagram showing an example of the configuration of the feature quantity extraction unit 40. As shown in FIG.

特徴量抽出部40は、ピーク検出部41、対数圧縮部42、タイムラグ解析部43、ピーク値解析部44および音声強度解析部45を備える。 The feature amount extraction unit 40 includes a peak detection unit 41, a logarithmic compression unit 42, a time lag analysis unit 43, a peak value analysis unit 44, and a voice strength analysis unit 45.

ピーク検出部41は、規格化部23から出力される規格化後自己相関信号を入力とし、規格化後自己相関信号のピークタイムラグおよびピーク値を検出し、ピークタイムラグをタイムラグ解析部43へ出力し、ピーク値をピーク値解析44部へ出力する。なお、規格化後自己相関信号は、フィルタ処理部22での帯域制限を受けて正弦波状信号となっており、ピーク検出部41は、その正弦波信号のピークタイムラグおよびピーク値を検出する。 The peak detection unit 41 receives the normalized autocorrelation signal output from the normalization unit 23 as input, detects a peak time lag and a peak value of the normalized autocorrelation signal, and outputs the peak time lag to the time lag analysis unit 43. , Peak value is output to the peak value analysis unit 44. The normalized autocorrelation signal is a sine wave signal due to the band limitation in the filter processing unit 22, and the peak detection unit 41 detects the peak time lag and peak value of the sine wave signal.

図５および図６は、帯域制限の有無による自己相関信号の波形の違いを示す。 FIG. 5 and FIG. 6 show the difference in the waveform of the autocorrelation signal depending on the presence or absence of band limitation.

帯域制限を受けていない自己相関信号は、図５に示すように、発話内容に応じて時々刻々と大きく変化する。一方、フィルタ処理部22で帯域制限を受けた自己相関信号は、図６に示すように、発話内容に拘わらず安定した正弦波信号となる。また、感情を含む発話においては、正弦波信号のピークタイムラグおよびピーク値の変化が観測される。 As shown in FIG. 5, the autocorrelation signal not subjected to the band limitation greatly changes every moment according to the content of the utterance. On the other hand, as shown in FIG. 6, the autocorrelation signal which has been band-limited by the filter processing unit 22 becomes a stable sine wave signal regardless of the content of the utterance. In addition, in speech including emotion, peak time lag of the sine wave signal and change in peak value are observed.

なお、自己相関信号は正弦波の信号となるため、略等間隔で複数のピークが出現する。自己相関信号に対して最大値検出を行うと、音声強度の時間変動やノイズ等の影響により、最大値をとるタイムラグが何番目のピーク位置となるかは変動する。以下では、タイムラグが正(0よりも大きい)の領域での１番目のピークのタイムラグおよびピーク値を、各々ピークタイムラグおよびピーク値とするものとする。 In addition, since the autocorrelation signal is a sine wave signal, a plurality of peaks appear at substantially equal intervals. When the maximum value detection is performed on the autocorrelation signal, the peak position of the time lag having the maximum value fluctuates due to the influence of time variation of the voice strength, noise, and the like. In the following, it is assumed that the time lag and peak value of the first peak in the region where the time lag is positive (larger than 0) are the peak time lag and peak value, respectively.

また、これらの変化は、数10ms程度の短時間で観測される場合や数100ms程度の期間にわたって継続的に観測される場合がある。ピーク値の変化が数10ms程度の短時間に観測される場合の一例を図７に、数100msの期間にわたって継続的に観測される場合を図８に示す。短期間での急激な変化（立ち上がり）を捉える上では短期間の分析フレーム１が有効であるが、その傾向の継続性を把握するためには長期間の分析フレーム２と組み合わせる必要があることが分かる。 Also, these changes may be observed in a short time of about several tens of ms or may be observed continuously over a period of about several hundred ms. An example of the case where a change in peak value is observed in a short time of about several tens of ms is shown in FIG. 7 and the case of being continuously observed over a period of several hundred ms is shown in FIG. Although analysis frame 1 for a short period is effective in capturing a rapid change (rise) in a short period, it is necessary to combine it with analysis frame 2 for a long period in order to grasp the continuity of the trend. I understand.

すなわち、図８における分析フレーム１のように、特徴量の変化速度と比較して、分析フレームの時間長が小さい場合には、感情に起因した特徴量の変化の全体像を１つの分析フレーム内で捉えることができない。この場合において、分析フレーム１のみを使用すると特徴量の増大の傾向を明確に捉えることはできないが、分析フレーム２を併用することで、特徴量の増大の傾向を捉えることが可能となる。 That is, as in analysis frame 1 in FIG. 8, when the time length of the analysis frame is small compared to the change rate of the feature, the entire image of the change in the feature due to emotion is within one analysis frame. It can not be caught in In this case, although it is not possible to clearly grasp the tendency of the feature amount to be increased if only the analysis frame 1 is used, it is possible to grasp the tendency to increase the feature amount by using the analysis frame 2 in combination.

一方、図７における分析フレーム２のように、特徴量の変化速度と比較して分析フレームの時間長が大きい場合には、特徴量が分析フレーム内で単調増加あるいは単調減少の場合にはよいが、特徴量が分析フレーム内で凸型または凹型の場合には、全体の平均としての傾きが求まるのみであり、急激な変化を正しくとらえることはできない。このような場合でも、時間長の短い分析フレーム１を併用することで、急激な変化も捉えられるようになる。 On the other hand, as in the analysis frame 2 in FIG. 7, when the time length of the analysis frame is large compared to the change rate of the feature, it is good when the feature is monotonically increasing or monotonically decreasing within the analysis frame. If the feature amount is convex or concave in the analysis frame, the gradient as the whole average can only be determined, and rapid changes can not be captured correctly. Even in such a case, rapid change can be captured by using the analysis frame 1 having a short time length in combination.

タイムラグ解析部43は、ピーク検出部41から出力されるピークタイムラグを入力とし、数10ms〜数100msの期間、例えば、50msの期間を分析フレームとし、分析フレームで検出されたピークタイムラグを保持し、さらに、平均タイムラグ、最大タイムラグ、タイムラグ変化率を当該分析フレームについての特徴量としてパターン識別・感情認識部50のパターン識別部51へ出力する。 The time lag analysis unit 43 receives the peak time lag output from the peak detection unit 41 as an input, takes a period of several tens ms to several hundreds ms, for example, a period of 50 ms as an analysis frame, and holds the peak time lag detected in the analysis frame, Furthermore, the average time lag, the maximum time lag, and the time lag change rate are output to the pattern identification unit 51 of the pattern identification / emotion recognition unit 50 as feature amounts for the analysis frame.

同様に、ピーク値解析部44は、ピーク検出部41から出力されるピーク値を入力とし、分析フレームで検出されたピーク値を保持し、さらに、平均ピーク値、最大ピーク値、ピーク値変化率情報を当該分析フレームについての特徴量としてパターン識別・感情認識部50のパターン識別部51へ出力する。 Similarly, the peak value analysis unit 44 receives the peak value output from the peak detection unit 41, holds the peak value detected in the analysis frame, and further, average peak value, maximum peak value, peak value change rate The information is output to the pattern identification unit 51 of the pattern identification / emotion recognition unit 50 as the feature amount for the analysis frame.

対数圧縮部42は、音声強度測定部30から出力される音声強度信号を入力とし、対数圧縮後音声強度を出力する。音声強度測定部30から出力される音声強度信号は、通常、そのダイナミックレンジが100dB以上ある。そこで、ここでは音声強度信号を対数圧縮して広範囲の入力レベルの音声信号強度を扱うことができるようにする。 The logarithmic compression unit 42 receives the speech strength signal output from the speech strength measurement unit 30 as an input, and outputs the logarithmically compressed speech strength. The voice strength signal output from the voice strength measurement unit 30 usually has a dynamic range of 100 dB or more. Therefore, the speech strength signal is logarithmically compressed here so that speech signal strength in a wide range of input levels can be handled.

音声強度解析部45は、対数圧縮部42から出力される対数圧縮後音声強度を入力とし、分析フレーム(50ms)のセグメント(10ms)についての対数圧縮後音声強度を保持し、さらに、分析フレームのセグメントについての対数圧縮後音声強度の平均値、最大値、変化率を計算し、平均音声強度、最大音声強度、音声強度変化率を、当該分析フレームについての特徴量としてパターン識別・感情認識部50のパターン識別部51へ出力する。 The speech intensity analysis unit 45 receives the logarithmically compressed speech intensity output from the logarithmic compression unit 42, holds the logarithmically compressed speech intensity for the segment (10 ms) of the analysis frame (50 ms), and further The average value, maximum value, and change rate of the logarithmically compressed speech intensity for the segment are calculated, and the average speech intensity, the maximum speech intensity, and the speech intensity change rate are used as feature quantities for the analysis frame. To the pattern identification unit 51 of FIG.

以上のように、特徴量抽出部40は、規格化後自己相関信号および音声強度信号から特徴量(図４の例では、平均タイムラグ、最大タイムラグ、タイムラグ変化率、平均ピーク値、最大ピーク値、ピーク値変化率、平均音声強度、最大音声強度および音声強度変化率の９個の要素)を抽出し、それらを分析フレームについての特徴量としてパターン識別・感情認識部50へ出力する。 As described above, the feature extraction unit 40 extracts the feature from the normalized autocorrelation signal and the speech strength signal (in the example of FIG. 4, the average time lag, the maximum time lag, the time lag change rate, the average peak value, the maximum peak value, The peak value change rate, the average speech intensity, the maximum speech intensity, and nine elements of the speech intensity change rate) are extracted, and they are output to the pattern identification / emotion recognizer 50 as feature quantities for the analysis frame.

パターン識別・感情認識部50は、パターン識別部51、パターン記憶部52および感情認識部53を備える。 The pattern identification / emotion recognition unit 50 includes a pattern identification unit 51, a pattern storage unit 52, and an emotion recognition unit 53.

パターン識別部51は、特徴量抽出部40から出力される分析フレームについての特徴量を入力とし、その特徴量と感情との相関に応じたパターン識別信号、あるいは特徴量(各要素のパターン)に応じたパターン識別信号を出力する。 The pattern identification unit 51 receives as input the feature amount of the analysis frame output from the feature amount extraction unit 40, and uses it as a pattern identification signal or a feature amount (pattern of each element) according to the correlation between the feature amount and emotion. Output a pattern identification signal according to the request.

パターン記憶部52は、パターン識別部51から出力される分析フレームについてのパターン識別信号を保持する。パターン記憶部52は、続く分析フレームについてのパターン識別信号を順次に記憶する。 The pattern storage unit 52 holds a pattern identification signal for the analysis frame output from the pattern identification unit 51. The pattern storage unit 52 sequentially stores pattern identification signals for subsequent analysis frames.

感情認識部53は、パターン記憶部52から分析フレームについてのパターン識別信号を順次にパターン識別信号系列として読み出し、パターン識別信号系列と感情との相関に基づいて音声信号についての発話時の感情を認識し、その認識結果の感情ラベルを出力する。例えば、パターン識別信号が分析フレームについての特徴量と感情との相関に応じたものである場合、それらの相関を用いて感情を認識できる。また、パターン識別信号が特徴量抽出部40から出力される特徴量(各要素のパターン)に応じたものである場合には、ここで、パターン識別信号系列におけるパターン識別信号と感情との相関を求め、その相関に基づいて感情を認識できる。 The emotion recognition unit 53 sequentially reads out the pattern identification signals for the analysis frame from the pattern storage unit 52 as a pattern identification signal sequence, and recognizes the emotion at the time of the speech signal utterance based on the correlation between the pattern identification signal sequence and the emotion. And output the emotion label of the recognition result. For example, if the pattern identification signal corresponds to the correlation between the feature amount for the analysis frame and the emotion, the correlation can be used to recognize the emotion. Further, when the pattern identification signal corresponds to the feature amount (pattern of each element) output from the feature amount extraction unit 40, here, the correlation between the pattern identification signal and the emotion in the pattern identification signal sequence is obtained. The emotion can be recognized based on the correlation.

以上のように、パターン識別・感情認識部50は、特徴量抽出部40から出力される分析フレームについての特徴量と感情との相関に基づいて音声信号についての発話時の感情を認識し、その認識結果の感情ラベルを出力する。 As described above, the pattern identification / emotion recognition unit 50 recognizes the emotion at the time of the speech signal utterance based on the correlation between the feature amount of the analysis frame output from the feature amount extraction unit 40 and the emotion, and Output emotion label of recognition result.

図９は、パターン識別・感情認識部50での感情認識処理のイメージを示す図である。 FIG. 9 is a view showing an image of an emotion recognition process in the pattern identification / emotion recognition unit 50. As shown in FIG.

分析フレームについての特徴量は、抽出される数だけの要素を持つ特徴量(図４の例では９個の要素)であり、要素数だけの次元の空間上のベクトルで表すことができる。図９では、特徴量は、２つの特徴量(要素)１，２を持ち、分析フレームについての特徴量は、特徴量１，２が張る２次元の空間上の位置のベクトルで表すことができる。ここでは説明を簡単にするために、分析フレームについての特徴量の数を２としているが、図４の例のように、分析フレームについての特徴量が９個あれば、分析フレームについての特徴量のパターンは、９次元の空間上のベクトルとなる。 The feature quantity for the analysis frame is a feature quantity having nine elements as many as the extracted number (9 elements in the example of FIG. 4), and can be represented by a vector over a dimension of the number of elements. In FIG. 9, the feature quantity has two feature quantities (elements) 1 and 2, and the feature quantity for the analysis frame can be represented by a vector of a two-dimensional spatial position spanned by the feature quantities 1 and 2 . Here, in order to simplify the explanation, the number of feature quantities for the analysis frame is two, but as shown in the example of FIG. 4, if there are nine feature quantities for the analysis frame, the feature quantities for the analysis frame Is a nine-dimensional vector in space.

このように、分析フレームについての特徴量のパターンを、特徴量が張る空間におけるベクトルで表したとき、それらのベクトルを感情ごとにグループ分けしてグループのベクトルを感情に関連付けることができる。これは、パターン識別信号系列の分析フレームについての特徴量に基づいて感情を認識できることを意味している。 Thus, when the pattern of the feature amount for the analysis frame is represented by a vector in the space in which the feature amount spans, those vectors can be grouped by emotion and the vector of the group can be associated with the emotion. This means that emotions can be recognized based on the feature amount of the analysis frame of the pattern identification signal sequence.

以上のことから、パターン識別・感情認識部50では、分析フレームについての特徴量と感情との相関に基づいて音声信号についての発話時の感情を認識する。その具体的手法については、各実施形態とともに後で説明する。 From the above, the pattern identification and emotion recognition unit 50 recognizes the emotion at the time of the speech signal utterance based on the correlation between the feature amount of the analysis frame and the emotion. The specific method will be described later along with each embodiment.

図１０は、本発明の感情認識装置の第１の実施形態を示すブロック図である。なお、図１と同一あるいは同等部分には同じ符号を付してある。 FIG. 10 is a block diagram showing a first embodiment of the emotion recognition device of the present invention. The same or equivalent parts as in FIG. 1 are denoted by the same reference numerals.

第１の実施形態では、統計モデルを用いて、分析フレームについての特徴量と感情との相関値を求め、この相関値に基づいて音声信号についての発話時の感情を認識し、その認識結果の感情ラベルを出力する。 In the first embodiment, a statistical model is used to obtain a correlation value between the feature amount and emotion for the analysis frame, and based on this correlation value, the emotion at the time of speech utterance is recognized, and the recognition result is Output emotion labels.

まず、パターン認識部51で使用する統計モデルの作成について説明する。統計モデルは、統計モデルを作成するための学習用の音声信号を用いて事前に作成することができる。 First, creation of a statistical model used by the pattern recognition unit 51 will be described. A statistical model can be created in advance using training speech signals to create a statistical model.

図１０に示すように、予め感情が分かっている学習用の音声信号をセグメント化部10へ入力し、その音声信号の分析フレームについての特徴量を特徴量抽出部40で抽出して統計モデル生成部60に与える。同時に、該特徴量についてのパターン識別信号を統計モデル生成部60に与える。統計モデル生成部60は、分析フレームについての特徴量と該特徴量についてのパターン識別信号から両者の相関を表す統計モデルを生成する。 As shown in FIG. 10, a speech signal for learning in which emotions are known in advance is input to the segmentation unit 10, and a feature quantity of an analysis frame of the speech signal is extracted by the feature quantity extraction unit 40 to generate a statistical model. Give to part 60. At the same time, a pattern identification signal for the feature amount is given to the statistical model generation unit 60. The statistical model generation unit 60 generates a statistical model representing the correlation between the feature amount of the analysis frame and the pattern identification signal of the feature amount.

なお、図１０は単一の分析フレームを使用する一例のブロック構成図であるが、後で複数分析フレームを使用する場合のブロック構成についても説明する。 Although FIG. 10 is a block configuration diagram of an example using a single analysis frame, a block configuration in the case of using a plurality of analysis frames later will also be described.

統計モデルは、例えば、行列Ｗで表現することができ、この行列Ｗを用いて、入力される音声信号の分析フレームについての特徴量に対応するパターン識別信号を生成できる。なお、行列Ｗは、最尤法やSVM(Support Vector Machine)の手法で求めることができる。 The statistical model can be expressed, for example, by a matrix W. This matrix W can be used to generate a pattern identification signal corresponding to the feature amount of the analysis frame of the input audio signal. The matrix W can be obtained by the method of maximum likelihood method or SVM (Support Vector Machine).

下記式(１)は、統計モデルを表す行列Ｗの一例である。下記式(２)で示されるように、行列Ｗに分析フレームについての９個の特徴量からなる列ベクトルｖを掛け合わせることにより、分析フレームについての９個の特徴量と４個の各感情との相関を表すパターン識別信号としてのベクトルｒを求めることができる。 The following equation (1) is an example of a matrix W representing a statistical model. As shown in the following equation (2), by multiplying the matrix W by a column vector v consisting of nine feature quantities for the analysis frame, nine feature quantities and four emotions for the analysis frame are obtained. Vector r as a pattern identification signal representing the correlation of

行列Ｗを用いて生成されるパターン識別信号は、例えば｛４０，８，１１，２７｝などといったベクトルｒであり、ベクトルの各要素は、分析フレームについての特徴量と４個の各感情、例えば、"怒り"、"悲しみ"、"退屈"、"驚き"との相関を表している。ここで、数値が大きいほど相関が大きいとすると、上記数値例では、特徴量と"怒り"の感情との相関が最大であり、"驚き"の感情との相関もある程度認められ、"悲しみ"や"退屈"の感情との相関は低いことを表している。 The pattern identification signal generated using the matrix W is a vector r such as {40, 8, 11, 27}, and each element of the vector is a feature for the analysis frame and each of four emotions, for example The correlation with "anger", "sadness", "boring" and "surprise" is shown. Here, assuming that the larger the number, the larger the correlation, in the above numerical example, the correlation between the feature amount and the emotion of "anger" is the largest, and the correlation with the emotion of "surprise" is also recognized to some extent, "sadness" And correlation with the emotion of "boring" represents low.

以上では、分析フレームについての特徴量と感情との相関をパターン識別信号としたが、特徴量と感情との相関が最大となる要素に対応する感情の感情ラベルをパターン識別信号としてもよい。例えば、ベクトルでの表現が｛４０，８，１１，２７｝の場合、"怒り"の感情ラベルをパターン識別信号とすることができる。このように、パターン識別信号は、発話時の感情を表す感情ラベルでもよい。 Although the correlation between the feature amount and the emotion in the analysis frame is used as the pattern identification signal in the above, the emotion label of the emotion corresponding to the element that maximizes the correlation between the feature amount and the emotion may be used as the pattern identification signal. For example, when the vector representation is {40, 8, 11, 27}, the emotion label of "anger" can be used as the pattern identification signal. Thus, the pattern identification signal may be an emotion label representing an emotion at the time of speech.

また、感情ラベルは、例えば、"怒り"、"悲しみ"、"退屈"、"驚き"などといった感情を"１"、"２"、"３"、"４"のように数字で表してもよく、｛１，０，０，０｝、｛０，１，０，０｝、｛０，０，１，０｝、｛０，０，０，１｝のようにベクトルで表してもよい。 In addition, the emotion labels may be expressed as numbers such as "1", "2", "3", "4", for example, emotions such as "anger", "sadness", "boring", "surprise" etc. Well, it may be represented by a vector as {1, 0, 0, 0}, {0, 1, 0, 0}, {0, 0, 1, 0}, {0, 0, 0, 1} .

パターン記憶部52は、パターン識別部51から順次に出力される分析フレームについてのパターン識別信号を保持する。 The pattern storage unit 52 holds pattern identification signals for analysis frames sequentially output from the pattern identification unit 51.

図１１は、パターン記憶部52が保持するパターン識別信号系列の一具体例を示す図である。 FIG. 11 is a diagram showing a specific example of a pattern identification signal sequence held by the pattern storage unit 52. As shown in FIG.

ここで、１行目の数値は、最初の１つの分析フレームについての特徴量と４個の各感情との相関値を示し、"怒り"、"悲しみ"、"退屈"、"驚き"との相関の値が｛２０，１０，１５，１７｝である。２行目の数値は、続く１つの分析フレームについての特徴量と４個の各感情との相関値を示し、"怒り"、"悲しみ"、"退屈"、"驚き"との相関の値が｛８０，７，２５，３７｝である。以下同様である。なお、ここでは、連続する９個の分析フレームについての特徴量と４個の感情との相関を示しているが、パターン記憶部は、入力される音声信号の分析フレームについての相関をパターン識別信号として保持する。 Here, the numerical value in the first line indicates the correlation value between the feature value of each of the first analysis frame and each of the four emotions, with "anger", "sadness", "boring", "surprise" The correlation value is {20, 10, 15, 17}. The numbers on the second line indicate the correlation values between the feature values of the following analysis frame and each of the four emotions, and the correlation values with "anger", "sadness", "boring", "surprise" {80, 7, 25, 37}. The same applies to the following. In addition, although the correlation with the feature-value and four emotions about nine continuous analysis frames is shown here, a pattern memory | storage part is a pattern identification signal about the correlation about the analysis frame of the audio | voice signal input. Hold as.

感情認識部53は、パターン記憶部52から分析フレームについてのパターン識別信号を順次にパターン識別信号系列として読み出し、パターン識別信号系列における分析フレームについての相関に基づいて音声信号についての発話時の感情を識別する。 The emotion recognition unit 53 sequentially reads out the pattern identification signals for the analysis frame from the pattern storage unit 52 as a pattern identification signal sequence, and based on the correlation for the analysis frame in the pattern identification signal sequence, the emotion at the time of speech about the speech signal Identify

第１の実施形態のように、特徴量を入力信号とする統計モデルを用いて分析フレームごとに特徴量と感情との相関を求め、それをパターン識別信号としてパターン記憶部52に保持する場合には、パターン識別信号系列における分析フレームについての相関を感情ごとに加算して、その合計が最大となる感情を音声信号についての発話時の感情と認識できる、 As in the first embodiment, when the correlation between the feature amount and the emotion is obtained for each analysis frame using a statistical model having the feature amount as an input signal, and the correlation is stored in the pattern storage unit 52 as a pattern identification signal. Can add the correlation for the analysis frame in the pattern identification signal sequence for each emotion, and can recognize the emotion for which the sum is maximum as the emotion at the time of speech of the audio signal.

例えば、図１１の例では、"怒り"に対する相関の合計が最大となるので、音声信号についての発話時の感情は、"怒り"と認識できる。なお、相関の合計を求めることなく、相関が最大を示す感情、あるいは相関が所定のしきい値以上となる分析フレームの数が最大となる感情を音声信号についての発話時の感情と認識するようにしてもよい。 For example, in the example shown in FIG. 11, since the sum of correlations with "anger" is maximized, the emotion at the time of speech about the speech signal can be recognized as "anger". In addition, it is possible to recognize an emotion in which the correlation shows the maximum or an emotion in which the number of analysis frames in which the correlation becomes the predetermined threshold is the maximum as the emotion at the time of speech signal without obtaining the total of the correlations. You may

図１２は、感情認識部53の動作の一例を示すフローチャートである。 FIG. 12 is a flowchart showing an example of the operation of the emotion recognition unit 53.

感情認識部53は、音声強度測定部30から出力される無音検出信号により発話が終了したかどうかを判断する。ここで、発話が終了したと判断した場合には、パターン識別信号系列における分析フレームについての相関値を感情ごとに加算し、その合計値が最大となる感情を音声信号についての発話時の感情と認識し、その認識結果の感情ラベルを出力する。 The emotion recognition unit 53 determines, based on the silence detection signal output from the voice strength measurement unit 30, whether the speech has ended. Here, if it is determined that the utterance has ended, the correlation value for the analysis frame in the pattern identification signal sequence is added for each emotion, and the emotion for which the total value is maximum is the emotion at the time of the utterance for the voice signal. Recognize and output an emotion label of the recognition result.

以上のように、第１の実施形態では、パターン識別部51で、分析フレームについての特徴量と感情との相関値を求め、その相関値のパターン識別信号を生成し、感情認識部53で、パターン識別信号系列における分析フレームについての相関値を感情ごとに加算することにより音声信号についての発話時の感情を認識するようにしているが、以下に説明するように、分析フレームについての特徴量をパターン識別信号とし、パターン識別信号系列におけるパターン識別信号の発生頻度分布(ヒストグラム)と感情との相関に基づいて音声信号についての発話時の感情を認識することもできる。 As described above, in the first embodiment, the pattern identification unit 51 obtains the correlation value between the feature amount and emotion for the analysis frame, generates the pattern identification signal of the correlation value, and the emotion recognition unit 53 The emotion at the time of speech of the speech signal is recognized by adding the correlation value for the analysis frame in the pattern identification signal sequence for each emotion, but as described below, the feature value for the analysis frame is As a pattern identification signal, it is also possible to recognize an emotion at the time of speech about a voice signal based on the correlation between the occurrence frequency distribution (histogram) of the pattern identification signal in the pattern identification signal sequence and the emotion.

図１３は、本発明の感情認識装置の第２の実施形態を示すブロック図である。なお、図１と同一あるいは同等部分には同じ符号を付してある。 FIG. 13 is a block diagram showing a second embodiment of the emotion recognition device of the present invention. The same or equivalent parts as in FIG. 1 are denoted by the same reference numerals.

第２の実施形態は、特徴量抽出部40から出力される特徴量をパターン識別信号とし、統計モデルを用いて、パターン識別信号系列におけるパターン識別信号の発生頻度分布と感情との相関に基づいて音声信号についての発話時の感情を認識するようにしたものである。 In the second embodiment, the feature quantity output from the feature quantity extraction unit 40 is used as a pattern identification signal, and using a statistical model, based on the correlation between the occurrence frequency distribution of pattern identification signals in the pattern identification signal sequence and emotion. It is made to recognize the emotion at the time of the utterance about a voice signal.

図１３は単一の分析フレームを使用する一例のブロック構成図であるが、後で複数分析フレームを使用する場合のブロック構成についても説明する。 Although FIG. 13 is a block configuration diagram of an example using a single analysis frame, a block configuration in the case of using a plurality of analysis frames will also be described later.

まず、感情認識部53で用いる統計モデルの作成について説明する。統計モデルは、統計モデルを作成するための学習用の音声信号を用いて事前に作成することができる。 First, creation of a statistical model used by the emotion recognition unit 53 will be described. A statistical model can be created in advance using training speech signals to create a statistical model.

図１３に示すように、予め感情が分かっている学習用の音声信号からパターン識別信号およびパターン識別信号系列を生成し、そのパターン識別信号系列におけるパターン識別信号の発生頻度分布を統計モデル生成部70に与える。同時に、パターン識別信号の発生頻度分布に対応する感情ラベルを統計モデル生成部70に与える。 As shown in FIG. 13, a pattern identification signal and a pattern identification signal sequence are generated from a speech signal for learning in which emotions are known in advance, and the occurrence frequency distribution of the pattern identification signal in the pattern identification signal sequence is generated by a statistical model generator 70. Give to. At the same time, an emotion label corresponding to the occurrence frequency distribution of the pattern identification signal is given to the statistical model generation unit 70.

統計モデル生成部70は、パターン識別信号系列におけるパターン識別信号の発生頻度分布と感情ラベルとから両者の相関を表す統計モデルを生成する。統計モデルは、例えば、学習用の音声信号から生成されたパターン識別信号系列のパターン識別信号のヒストグラムを生成し、このヒストグラムの各値を用いて深層学習あるいはSVM(Support Vector Machine)などの機械学習を利用して生成できる。 The statistical model generation unit 70 generates a statistical model representing the correlation between the pattern identification signal sequence and the occurrence frequency distribution of the pattern identification signal and the emotion label. The statistical model generates, for example, a histogram of pattern identification signals of a pattern identification signal sequence generated from a speech signal for learning, and machine learning such as deep learning or SVM (Support Vector Machine) using each value of the histogram. Can be generated using

音声信号についての感情の認識時、感情認識部53は、入力される音声信号から生成されるパターン識別信号系列におけるパターン識別信号の発生頻度分布を求め、統計モデルを用いて、該発生頻度分布と相関の高い感情を、音声信号についての発話時の感情と認識し、その認識結果の感情ラベルを出力する。 At the time of recognition of the emotion of the voice signal, the emotion recognition unit 53 obtains the occurrence frequency distribution of the pattern identification signal in the pattern identification signal sequence generated from the input voice signal, and uses the statistical model to The highly correlated emotion is recognized as the emotion at the time of speech of the speech signal, and the emotion label of the recognition result is output.

図１４は、パターン識別信号系列におけるパターン識別信号の発生頻度分布を用いて感情を認識する場合の感情認識部53の動作の例を示すフローチャートである。 FIG. 14 is a flowchart showing an example of the operation of the emotion recognition unit 53 in the case of recognizing an emotion using the occurrence frequency distribution of pattern identification signals in a pattern identification signal sequence.

感情認識部53は、音声強度測定部30から出力される無音検出信号により発話が終了したかどうかを判断する。ここで、発話が終了したと判断された場合、パターン識別信号系列における各パターン識別信号の発生頻度を計算し、パターン識別信号系列についてのパターン識別信号の発生頻度分布と相関の高い感情を音声信号についての発話時の感情と認識し、この認識結果の感情ラベルを出力する。 The emotion recognition unit 53 determines, based on the silence detection signal output from the voice strength measurement unit 30, whether the speech has ended. Here, if it is determined that the utterance has ended, the occurrence frequency of each pattern identification signal in the pattern identification signal sequence is calculated, and the emotion having a high correlation with the occurrence frequency distribution of the pattern identification signal in the pattern identification signal sequence is voice signal It recognizes as an emotion at the time of speaking about, and outputs an emotion label of this recognition result.

また、感情認識部53では、例えば、パターン識別信号系列において発生頻度の最多のパターン識別信号と相関の高い感情を発話時の感情と認識することもできる。さらに、パターン識別部51で、各分析フレームについての特徴量をグループ分けしてグループごとのパターン識別信号とし、感情認識部53では、統計モデルを用いて、パターン識別信号系列における各パターン識別信号の発生頻度分布や発生頻度が最多のパターン識別信号と相関の高い感情を発話時の感情と認識することもできる。 Further, the emotion recognition unit 53 can also recognize, for example, an emotion highly correlated with the pattern identification signal having the highest frequency of occurrence in the pattern identification signal sequence as the emotion at the time of speech. Furthermore, the pattern identification unit 51 divides the feature quantities of each analysis frame into groups and sets them as pattern identification signals for each group, and the emotion recognition unit 53 uses statistical models to select pattern identification signals in the pattern identification signal series. It is also possible to recognize an emotion having a high correlation with the occurrence frequency distribution and the pattern identification signal having the highest occurrence frequency as the emotion at the time of speech.

図１５は、パターン識別信号系列におけるパターン識別信号をグループ分けしてから感情を認識する場合の説明図である。ここでも説明を簡単にするために、特徴量の数を２としている。 FIG. 15 is an explanatory view of a case where the pattern identification signals in the pattern identification signal sequence are grouped and then the emotion is recognized. Here, in order to simplify the description, the number of feature quantities is two.

図１５の例では、特徴量１，２をそれぞれ８段階に量子化することによりパターン識別信号を６４個にグループ分けしている。なお、このグループ分けには、K-MEANS法などを利用することができる。ここでも、パターン識別信号系列におけるパターン識別信号のグループごとの発生頻度分布や発生頻度が最多のパターン識別信号と感情との相関に基づいて音声信号についての発話時の感情を認識することができる。 In the example of FIG. 15, the pattern identification signals are grouped into 64 by quantizing the feature quantities 1 and 2 in eight stages. In addition, K-MEANS method etc. can be used for this grouping. Also in this case, it is possible to recognize the emotion at the time of the speech signal based on the occurrence frequency distribution for each group of pattern identification signals in the pattern identification signal sequence or the correlation between the pattern identification signal having the highest occurrence frequency and the emotion.

図１６は、感情認識部53に与えられるパターン識別信号系列の一具体例を示す図である。 FIG. 16 is a diagram showing a specific example of a pattern identification signal sequence given to the emotion recognition unit 53. As shown in FIG.

ここでは、パターン記憶部が保持する１０個の連続する分析フレームについてのパターン識別信号のパターン識別信号系列を示しているが、パターン記憶部52は、入力される音声信号の各分析フレームについてのパターン識別信号のパターン識別信号系列を保持する。なお、パターン識別信号"パターン１，パターン２，・・・，パターン６４"は、図１５の６４個のグループのそれぞれに対応している。 Here, the pattern identification signal sequence of the pattern identification signal for ten consecutive analysis frames held by the pattern storage unit is shown, but the pattern storage unit 52 is a pattern for each analysis frame of the input audio signal. The pattern identification signal sequence of the identification signal is held. The pattern identification signals “pattern 1, pattern 2,..., Pattern 64” correspond to each of the 64 groups in FIG.

図１６のパターン識別信号系列において出現頻度が最多のパターン識別信号は、パターン５であるので、パターン５のパターン識別信号と相関が大きい感情を音声信号についての発話時の感情と認識できる。 Since the pattern identification signal having the highest frequency of appearance in the pattern identification signal sequence of FIG. 16 is pattern 5, it is possible to recognize an emotion having a large correlation with the pattern identification signal of pattern 5 as the emotion at the time of speech of the voice signal.

なお、感情を認識する上で重要なパターン識別信号(重要パターン)を統計モデルとして予め保持しておき、パターン識別信号系列が重要パターンを含む場合、重要パターンとの相関が大きい感情を音声信号についての発話時の感情と認識することもできる。さらに、複数の重要パターンを統計モデルとして予め保持しておき、パターン識別信号系列が複数の重要パターンを含む場合、予め定めた優先順位に従って優先順位の高い方の重要パターンとの相関が大きい感情を音声信号についての発話時の感情と認識することもできる。 When a pattern identification signal (important pattern) important for recognizing emotion is held in advance as a statistical model and the pattern identification signal sequence includes an important pattern, the emotion having a large correlation with the important pattern is a speech signal It can also be recognized as the emotion when Furthermore, when a plurality of important patterns are held in advance as a statistical model and the pattern identification signal sequence includes a plurality of important patterns, emotions having a high correlation with the higher priority pattern according to the predetermined priority are It can also be recognized as an emotion at the time of speech about a voice signal.

図１７は、本発明の感情認識装置の第３の実施形態を示すブロック図であり、図１と同一あるいは同等部分には同じ符号を付してある。 FIG. 17 is a block diagram showing a third embodiment of the emotion recognition apparatus of the present invention, and the same or equivalent parts as in FIG. 1 are assigned the same reference numerals.

上記したように、分析フレーム内に子音が混入すると感情認識の精度が劣化するが、この劣化は、母音発話中の場合にのみ特徴量を抽出すること、あるいは母音発話中の場合にのみ特徴量からパターン識別を行うこと、により防ぐことができる。 As described above, when a consonant is mixed in the analysis frame, the accuracy of emotion recognition is degraded. However, this degradation is performed only when the vowel utterance is being extracted, or when the vowel utterance is in progress Can be prevented by performing pattern identification from.

そこで、第３の実施形態では、自己相関処理部20の規格化部23から出力される規格化後自己相関信号のゼロクロスの発生頻度に基づいて母音発話中を検出する母音検出部90を設け、これによる母音検出信号をパターン識別部51へ出力する。 Therefore, in the third embodiment, a vowel detection unit 90 is provided which detects during vowel utterance based on the occurrence frequency of the zero cross of the normalized autocorrelation signal output from the normalization unit 23 of the autocorrelation processing unit 20, A vowel detection signal based on this is output to the pattern identification unit 51.

パターン識別部51は、基本的には、分析フレームごとにパターン識別信号を出力するが、母音検出部90から母音検出信号が出力されない(母音検出信号が無効を示す)分析フレームについてはパターン識別信号を出力しない。 The pattern identification unit 51 basically outputs a pattern identification signal for each analysis frame, but the vowel detection unit 90 does not output a vowel detection signal (a vowel detection signal indicates invalidity). Does not output

なお、図１７は単一の分析フレームを使用する一例のブロック構成図であるが、後で複数分析フレームを使用する場合のブロック構成についても説明する。 Although FIG. 17 is a block configuration diagram of an example using a single analysis frame, a block configuration in the case of using a plurality of analysis frames later will also be described.

図２０は、母音検出部90の構成の一例を示すブロック図である。 FIG. 20 is a block diagram showing an example of the configuration of the vowel detection unit 90. As shown in FIG.

この母音検出部90は、ゼロクロス回数カウント部91およびしきい値判定部92を備える。 The vowel detection unit 90 includes a zero crossing count unit 91 and a threshold determination unit 92.

ゼロクロス回数カウント部91は、自己相関処理部20の規格化部23から出力される規格化後自己相関信号を入力とし、１つの分析フレーム(例えば、５つのセグメント)に対する規格化後自己相関信号におけるゼロクロスを検出し、そのゼロクロスの数である自己相関ゼロクロス回数信号を出力する。 The zero cross frequency counting unit 91 receives the post-normalization autocorrelation signal output from the normalization unit 23 of the autocorrelation processing unit 20 as an input, and uses the post-normalization autocorrelation signal for one analysis frame (for example, five segments). A zero crossing is detected and an autocorrelation zero crossing number signal which is the number of the zero crossing is output.

しきい値判定部92は、ゼロクロス回数カウント部91から出力される自己相関ゼロクロス回数信号を入力とし、自己相関ゼロクロス回数が所定しきい値以下の場合に母音検出信号を出力する。自己相関ゼロクロス回数に対する所定しきい値は、例えば、分析フレーム長が50msの場合、50回〜100回程度に設定する。 The threshold determination unit 92 receives the autocorrelation zero crossing number signal output from the zero crossing number counting unit 91, and outputs a vowel detection signal when the autocorrelation zero crossing number is less than or equal to a predetermined threshold value. The predetermined threshold value for the number of autocorrelation zero crossings is set to, for example, about 50 to 100 times when the analysis frame length is 50 ms.

なお、図１７では省略しているが、パターン識別部51もしくは感情認識部53では、事前学習時に生成される統計モデルを使用する。図１８は、パターン識別用の統計モデルを使用する場合のブロック構成を示し、図１９は、感情認識用の統計モデルを使用する場合のブロック構成を示す。なお、統計モデルの生成および統計モデルを使用した感情認識の動作は、第１および第２の実施形態と同様であるので、説明は省略する。 Although omitted in FIG. 17, the pattern identification unit 51 or the emotion recognition unit 53 uses a statistical model generated at the time of prior learning. FIG. 18 shows a block configuration in the case of using a statistical model for pattern identification, and FIG. 19 shows a block configuration in the case of using a statistical model for emotion recognition. The generation of a statistical model and the operation of emotion recognition using the statistical model are the same as in the first and second embodiments, and thus the description thereof is omitted.

図２１は、本発明の感情認識装置の第４の実施形態を示すブロック図であり、図１と同一部分には同じ符号を付してある。 FIG. 21 is a block diagram showing a fourth embodiment of the emotion recognition device of the present invention, and the same reference numerals as in FIG. 1 denote the same parts in FIG.

上記したように、自己相関処理部20の規格化部23から出力される規格化後自己相関信号は、フィルタ処理部22での帯域制限を受けて正弦波信号となるが、発話時の音声信号には各感情に対応した正弦波ピークのタイムラグおよびピーク値の変化が観測され、これらの変化は、短時間に急激に発生する場合や数100ms程度の期間にわたって継続する場合がある。 As described above, the normalized autocorrelation signal output from the normalization unit 23 of the autocorrelation processing unit 20 is subjected to the band limitation in the filter processing unit 22 and becomes a sine wave signal. The time lag of the sine wave peak corresponding to each emotion and the change of peak value are observed, and these changes may occur rapidly in a short time or may last for a period of several hundred ms.

第４の実施形態では、それらの場合に対応できるように、異なる分析フレーム長として、複数の分析フレーム長についての特徴量の抽出を並行して行い、それらの特徴量のすべてを使用してパターン識別を行い発話時の感情を認識するようにしている。特徴量抽出部40は、異なる分析フレーム長として、複数の分析フレーム長についての特徴量を並行して抽出する。そのために、特徴量抽出部40には複数の分析フレーム長が与えられる。 In the fourth embodiment, extraction of feature quantities for a plurality of analysis frame lengths is performed in parallel as different analysis frame lengths so as to correspond to those cases, and a pattern is generated using all of the feature quantities. The identification is made to recognize the emotion at the time of speech. The feature quantity extraction unit 40 extracts feature quantities for a plurality of analysis frame lengths in parallel as different analysis frame lengths. For this purpose, the feature amount extraction unit 40 is provided with a plurality of analysis frame lengths.

図２１においては、時間長の異なる２個の分析フレームを使用して特徴量抽出を行うことを明示するために、特徴量抽出部40に分析フレーム長1,2を印加する形としている。なお、複数の分析フレーム長で、特徴量の抽出および感情の認識を並行して行うには、それらの処理を行う部分を複数系統にすればよい。図２２は、この場合の特徴量抽出部のブロック構成を示す。 In FIG. 21, analysis frame lengths 1 and 2 are applied to the feature amount extraction unit 40 in order to clearly indicate that feature amount extraction is performed using two analysis frames having different time lengths. Note that in order to perform feature amount extraction and emotion recognition in parallel with a plurality of analysis frame lengths, it is sufficient to use multiple systems for performing these processes. FIG. 22 shows a block configuration of the feature quantity extraction unit in this case.

パターン識別部51は、異なる分析フレーム長の特徴量をすべて使用してパターン識別信号を生成し、パターン記憶部52は、こうして得られたパターン識別信号を保持する。 The pattern identification unit 51 generates a pattern identification signal using all feature amounts of different analysis frame lengths, and the pattern storage unit 52 holds the pattern identification signal thus obtained.

感情認識部53は、パターン記憶部52からパターン識別信号系列を読み出し、該パターン識別信号系列から発話時の感情を認識し、その認識結果の感情ラベルを出力する。なお、図２１では省略しているが、パターン識別部51もしくは感情認識部53では、事前学習時に生成される統計モデルを使用する。図２３は、パターン識別用の統計モデルを使用する場合のブロック構成を示し、図２４は、感情認識用の統計モデルを使用する場合のブロック構成を示す。ここでの統計モデルの生成および統計モデルを使用した感情認識の動作は、第１および第２の実施形態と同様であるので、説明は省略する。なお、この場合には、事前学習において異なる分析フレーム長を使用して求めた複数の特徴量と学習用入力信号（パターン識別信号もしくは感情ラベル）の相関を学習して統計モデルを生成する。 The emotion recognition unit 53 reads the pattern identification signal sequence from the pattern storage unit 52, recognizes the emotion at the time of speech from the pattern identification signal sequence, and outputs an emotion label of the recognition result. Although omitted in FIG. 21, the pattern identification unit 51 or the emotion recognition unit 53 uses a statistical model generated at the time of prior learning. FIG. 23 shows a block configuration in the case of using a statistical model for pattern identification, and FIG. 24 shows a block configuration in the case of using a statistical model for emotion recognition. The generation of the statistical model and the operation of emotion recognition using the statistical model here are the same as in the first and second embodiments, and thus the description thereof is omitted. In this case, a statistical model is generated by learning the correlation between a plurality of feature quantities obtained using different analysis frame lengths in advance learning and a learning input signal (pattern identification signal or emotion label).

図２５は、本発明の感情認識装置の第５の実施形態を示すブロック図であり、図１と同一あるいは同等部分には同じ符号を付してある。 FIG. 25 is a block diagram showing a fifth embodiment of the emotion recognition apparatus of the present invention, and the same reference numerals as in FIG. 1 denote the same parts in FIG.

第５の実施形態も、上記の場合に対応できるようにしたものであり、分析フレームで抽出された特徴量を処理して複数分析フレームについての特徴量を抽出する長期特徴量抽出部80を設けている。そして、特徴量抽出部40で抽出された分析フレームについての特徴量と長期特徴量抽出部80で抽出された複数分析フレームについて特徴量のすべてを使用してパターン識別信号系列を生成し、それを用いて発話時の感情を認識するようにしている。 The fifth embodiment is also adapted to cope with the above case, and is provided with a long-term feature quantity extraction unit 80 which processes the feature quantities extracted in the analysis frame and extracts the feature quantities of a plurality of analysis frames. ing. Then, a pattern identification signal sequence is generated using all of the feature amounts for the feature amount of the analysis frame extracted by the feature amount extraction unit 40 and the plurality of analysis frames extracted by the long-term feature amount extraction unit 80, It is used to recognize the emotion at the time of speech.

図２６は、長期特徴量抽出部の内部ブロック構成を示す。ここでは、予め決められた数、例えば５個の分析フレームでの５個の特徴量を各々保持し、その平均値、最大値、変化率を各々求めて、長期平均タイムラグ、長期最大タイムラグ、長期タイムラグ変化率、長期平均ピーク値、長期最大ピーク値、長期ピーク値変化率、長期平均音声強度、長期最大音声強度および音声強度変化率を出力する。 FIG. 26 shows an internal block configuration of the long-term feature quantity extraction unit. Here, each of five feature quantities in a predetermined number, for example, five analysis frames, is held, and the average value, maximum value, and change rate are respectively determined to obtain long-term average time lag, long-term maximum time lag, long-term maximum It outputs the time lag change rate, long-term average peak value, long-term maximum peak value, long-term peak value change rate, long-term average voice strength, long-term maximum voice strength and voice strength change rate.

パターン認識部51は、特徴量抽出部40で抽出された分析フレームについての特徴量と長期特徴量抽出部80で抽出された複数分析フレームについて特徴量のすべてを使用してパターン識別信号を生成し、パターン記憶部52は、こうして得られたパターン識別信号を保持する。感情認識部53は、パターン記憶部52からパターン識別信号系列を読み出し、該パターン識別信号系列から感情を認識し、それらの認識結果の感情ラベルを出力する。 The pattern recognition unit 51 generates a pattern identification signal using all of the feature amounts for the feature amount of the analysis frame extracted by the feature amount extraction unit 40 and the plural analysis frames extracted by the long-term feature amount extraction unit 80. The pattern storage unit 52 holds the pattern identification signal thus obtained. The emotion recognition unit 53 reads out a pattern identification signal sequence from the pattern storage unit 52, recognizes an emotion from the pattern identification signal sequence, and outputs an emotion label of the recognition result.

なお、図２５では省略しているが、パターン識別部51もしくは感情認識部53では、事前学習時に生成される統計モデルを使用する。図２７は、パターン識別用の統計モデルを使用する場合のブロック構成を示し、図２８は、感情認識用の統計モデルを使用する場合のブロック構成を示す。ここでの統計モデルの生成および統計モデルを使用した感情認識の動作は、第１および第２の実施形態と同様であるので、説明は省略する。なお、この場合には、事前学習において１つの分析フレームと複数分析フレームについて求めた複数の特徴量と学習用入力信号（パターン識別信号もしくは感情ラベル）の相関を学習して統計モデルを生成する。 Although omitted in FIG. 25, the pattern identification unit 51 or the emotion recognition unit 53 uses a statistical model generated at the time of prior learning. FIG. 27 shows a block configuration in the case of using a statistical model for pattern identification, and FIG. 28 shows a block configuration in the case of using a statistical model for emotion recognition. The generation of the statistical model and the operation of emotion recognition using the statistical model here are the same as in the first and second embodiments, and thus the description thereof is omitted. In this case, a statistical model is generated by learning the correlation between a plurality of feature quantities obtained for one analysis frame and a plurality of analysis frames in learning in advance and a learning input signal (pattern identification signal or emotion label).

以上、本発明の実施形態を説明したが、本発明は、上記実施形態に限られるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment.

例えば、上記実施形態におけるパターン識別・感情認識部は、発話終了のタイミングで感情を認識するようにしているが、ユーザ操作などで所定期間の音声信号を指定し、該所定期間での音声信号についての感情を認識してもよい。この場合、所定期間の動作中を表示させることが好ましい。また、所定期間ごと音声信号を対象として感情を認識することにより、発話全体を通しての感情の変化を認識できる。 For example, although the pattern identification and emotion recognition unit in the above embodiment is configured to recognize emotion at the timing of the end of speech, a voice signal of a predetermined period is designated by user operation or the like, and a voice signal in the predetermined period is selected. You may recognize the emotions of In this case, it is preferable to display an operation during a predetermined period. Also, by recognizing the emotion for the audio signal at predetermined intervals, it is possible to recognize the change in emotion throughout the speech.

なお、本発明は、感情認識装置としてだけでなく、感情認識方法や感情認識用プログラムとしても実現できる。 The present invention can be realized not only as an emotion recognition device but also as an emotion recognition method and a program for emotion recognition.

10・・・セグメント化部、20・・・自己相関処理部、21・・・自己相関計算部、22・・・フィルタ処理部、23・・・規格化部、30・・・音声強度測定部、31・・・ピークホールド回路、32・・・フィルタ処理部、40・・・特徴量抽出部、41・・・ピーク検出部、42・・・対数圧縮部、43・・・タイムラグ解析部、44・・・ピーク値解析部、45・・・音声強度解析部、50・・・パターン識別・感情認識部、51・・・パターン認識部、52・・・パターン記憶部、53・・・感情認識部、60,70・・・統計モデル生成部、80・・・長期特徴量抽出部、90・・・母音検出部、91・・・ゼロクロス回数カウント部、92・・・しきい値判定部 10: Segmentation unit 20: Autocorrelation processing unit 21: Autocorrelation calculation unit 22: Filter processing unit 23: Normalization unit 30: Voice strength measurement unit 31: peak hold circuit 32: filter processing unit 40: feature amount extraction unit 41: peak detection unit 42: logarithmic compression unit 43: time lag analysis unit 44: peak value analysis unit 45: speech intensity analysis unit 50: pattern identification / emotion recognition unit 51: pattern recognition unit 52: pattern storage unit 53: emotion Recognition unit, 60, 70 ... statistical model generation unit, 80 ... long-term feature value extraction unit, 90 ... vowel detection unit, 91 ... zero cross count counting unit, 92 ... threshold determination unit

Claims

An emotion recognition device for recognizing an emotion at the time of speech from an input voice signal, comprising:
Segmentation means for decomposing the audio signal into segments;
Audio intensity measuring means for measuring the audio intensity of the audio signal for each of the segments and outputting an audio intensity signal for the segments;
Autocorrelation calculation means for calculating an autocorrelation function for each of the segments and outputting an autocorrelation signal for the segment;
Filter means for extracting pitch information from the autocorrelation signal and outputting a filtered autocorrelation signal for the segment;
Normalization means for normalizing the magnitude of the post-filtering autocorrelation signal with the voice intensity signal and outputting a post-normalization autocorrelation signal for the segment;
Feature quantity extraction means for extracting feature quantities of an analysis frame from the normalized autocorrelation signal and the voice intensity signal and sequentially outputting the analysis frame as a period of several tens of ms to several hundreds of ms including one or more segments;
An emotion recognition apparatus comprising: pattern identification / emotion recognition means for recognizing an emotion at the time of an utterance of the voice signal based on the correlation between the feature amount and the emotion, and outputting an emotion label of the recognition result.

The pattern identification and emotion recognition means
Pattern identification means for outputting the correlation value between the feature amount and emotion as a pattern identification signal for an analysis frame;
Pattern storage means for holding the pattern identification signal;
Emotion recognition means for reading out a pattern identification signal sequence from the pattern storage means, recognizing an emotion from the correlation between a feature amount and an emotion in an analysis frame in the pattern identification signal sequence, and outputting an emotion label of the recognition result The emotion recognition device according to claim 1, characterized in that

The pattern identification and emotion recognition means
Pattern identification means for outputting the feature amount as a pattern identification signal for each analysis frame;
Pattern storage means for holding the pattern identification signal;
The pattern identification signal sequence is read out from the pattern storage means, and the correlation between the occurrence frequency distribution and the emotion of the pattern identification signal in the pattern identification signal sequence, or the largest number of pattern identification signals in the occurrence frequency distribution or the important pattern identification signals and the emotion 2. The emotion recognition apparatus according to claim 1, further comprising emotion recognition means for recognizing an emotion at the time of utterance of the voice signal from the correlation with the speech signal, and outputting an emotion label of the recognition result.

The voice intensity measurement means detects a silence period in which the voice signal is equal to or less than a predetermined threshold value, and outputs a silence detection signal.
The pattern recognition / emotion recognition means recognizes an emotion at the timing when the silence detection signal is output, and outputs an emotion label as a recognition result. Emotion recognition device.

The emotion recognition apparatus according to any one of claims 1 to 4, wherein the filter means is a low pass filter having a cutoff frequency in the range of 500 Hz to 800 Hz.

And vowel detection means for detecting vowels for each analysis frame from the normalized autocorrelation signal.
6. The pattern recognition / emotion recognition means executes an emotion recognition process only on an analysis frame in which a vowel is detected by the vowel detection means. Emotion recognition device.

The feature quantity extraction unit respectively extracts feature quantities for analysis frames of a plurality of different time lengths;
The pattern identification and emotion recognition means executes an emotion recognition process based on the correlation between all the obtained feature quantities and emotions, thereby outputting an emotion label of a recognition result. The emotion recognition device according to any one of 6.

The image processing apparatus further comprises long-term feature quantity extraction means for extracting feature quantities of a plurality of analysis frames using the feature quantity extracted by the feature quantity extraction means as an input.
The pattern identification / emotion recognition means executes an emotion recognition process based on the correlation between all the feature amounts extracted by the feature amount extraction means and the long-term feature amount extraction means, and thereby the emotion label of the recognition result The emotion recognition apparatus according to any one of claims 1 to 6, characterized in that:

A method for recognizing an emotion at the time of speech from an input speech signal, comprising:
A segmentation step of decomposing the speech signal into segments;
Measuring the speech strength of the speech signal for each of the segments and outputting a speech strength signal for the segments;
Calculating an autocorrelation function for each of the segments and outputting an autocorrelation signal for the segment;
A filtering step of extracting pitch information from the autocorrelation signal and outputting a filtered autocorrelation signal for the segment;
Normalizing the magnitude of the post-filtering autocorrelation signal with the speech intensity signal and outputting a post-normalization autocorrelation signal for the segment;
And a step of feature quantity extraction for extracting and sequentially outputting feature quantities of the analysis frame from the normalized autocorrelation signal and the speech intensity signal, with a period of several tens of ms to several hundreds of ms including one or more segments as an analysis frame ,
A method characterized by comprising a step of pattern identification and emotion recognition which recognizes an emotion at the time of speech about the voice signal based on the correlation between the feature amount and an emotion, and outputs an emotion label of the recognition result.

The steps of pattern identification and emotion recognition are
Outputting a correlation value between the feature amount and emotion as a pattern identification signal for an analysis frame;
The step of storing the pattern identification signal in the pattern storage means;
It has a step of emotion recognition for reading out a pattern identification signal sequence from the pattern storage means, recognizing an emotion from a correlation between a feature amount and an emotion in an analysis frame in the pattern identification signal sequence, and outputting an emotion label of the recognition result. 10. A method according to item 9.

The steps of pattern identification and emotion recognition are
A pattern identification step of outputting the feature amount as a pattern identification signal for each analysis frame;
The step of storing the pattern identification signal in the pattern storage means;
The pattern identification signal sequence is read out from the pattern storage means, and the correlation between the occurrence frequency distribution and the emotion of the pattern identification signal in the pattern identification signal sequence, or the largest number of pattern identification signals in the occurrence frequency distribution or the important pattern identification signals and the emotion 10. The method according to claim 9, further comprising the step of: recognizing emotions at the time of speech about the speech signal from the correlation with the step c) and outputting an emotion label as a recognition result.

A program for recognizing an emotion at the time of speech from an input audio signal, comprising:
Segmentation means for decomposing the audio signal into segments,
Audio intensity measuring means for measuring the audio intensity of the audio signal for each of the segments and outputting an audio intensity signal for the segments;
Autocorrelation calculation means for calculating an autocorrelation function for each of the segments and outputting an autocorrelation signal for the segment;
Filter means for extracting pitch information from the autocorrelation signal and outputting a filtered autocorrelation signal for the segment;
Normalization means for normalizing the magnitude of the post-filtering autocorrelation signal with the voice strength signal and outputting a post-normalization autocorrelation signal for the segment;
Feature quantity extraction means for extracting feature quantities of an analysis frame from the normalized autocorrelation signal and the speech intensity signal and sequentially outputting the analysis frame as a period of several tens of ms to several hundreds of ms including one or more segments;
A program that functions as pattern identification / emotion recognition means for recognizing an emotion at the time of an utterance of the voice signal based on the correlation between the feature amount and an emotion, and outputting an emotion label of the recognition result.

The pattern identification and emotion recognition means
Pattern identification means for outputting the correlation value between the feature amount and emotion as a pattern identification signal for an analysis frame;
The pattern identification signal sequence is read out from the pattern storage means holding the pattern identification signal, the emotion is recognized from the correlation between the feature amount and the emotion for the analysis frame in the pattern identification signal sequence, and the emotion label of the recognition result is output. The program according to claim 12, comprising emotion recognition means.

The pattern identification and emotion recognition means
Pattern identification means for outputting the feature amount as a pattern identification signal for each analysis frame;
The pattern identification signal sequence is read out from the pattern storage means holding the pattern identification signal, and the correlation between the occurrence frequency distribution of the pattern identification signal and the emotion in the pattern identification signal sequence or the largest number of pattern identification signals in the occurrence frequency distribution or important 13. The program according to claim 12, further comprising emotion recognition means for recognizing an emotion at the time of utterance of the voice signal from correlation of the pattern identification signal and the emotion, and outputting an emotion label of the recognition result.