JP2004317822A

JP2004317822A - Feeling analysis/display device

Info

Publication number: JP2004317822A
Application number: JP2003112315A
Authority: JP
Inventors: Kaoru Ogata; 薫尾形; Kokichi Tanihira; 耕吉谷平
Original assignee: AGI KK
Current assignee: AGI KK
Priority date: 2003-04-17
Filing date: 2003-04-17
Publication date: 2004-11-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide a simple feeling analysis/display device which recognizes a feeling by numeralizing variation information on power by using sound volume power information on a voice whose information can easily be obtained even by a low-specification CPU such as a PDA. <P>SOLUTION: This feeling analysis/display device analyzes a person's voice to display a feeling of the person, separates the voice, frame by frame, in time series, and obtains power deviation between frames in time series, a mean value of power differences, and deviation of power differences, thereby analyzing and displaying the feeling of the person. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、人の音声を分析してその時のその人の感情を表示する感情分析・表示装置に関する。
【０００２】
【従来の技術】
人のしゃべる音声には意味内容の他に、感情の情報が加わっている。一般的に音声の認識とは内容の認識であるが、実際には同じ内容の音声であってもニュアンスの違いにより意味合いが異なる場合がある。例えば甘えた声で「いやだ」と表現する場合と、強い口調で「いやだ」という場合は意味合いはほとんど正反対と言える。従って音声に含まれる感情の認識はマンマシン対話インターフェースにおいても非常に重要と言える。
【０００３】
人は、意味内容が分からなくても喋っている声の調子で「興奮していそうだ」とか「落ち込んでいそうだ」などの感情の状態がある程度推測できる。このことから、言語理解を伴わなくとも、音響情報の機械処理で同様な感情の推測が可能であると考えられる。
【０００４】
音声からの感情検出手法はこれまでいくつかの提案がある。そのほとんどは、以下のような方法をもとにしている。
（ａ）音声の平均音量と、平均と全体の差分（あるいは偏差）
（ｂ）ピッチ（基本周波数）平均と、平均と全体の差分（あるいは偏差）
（ｃ）ポーズ区間（無音区間）の出現タイミング
【０００５】
しかしながら、（ａ）は単なる大声と抑揚の大きい音声（興奮した音声）との差は捉えられない。（ｂ）は安定した精度のよいピッチ解析手法はまだ確立されておらず、さらにピッチ抽出はＦＦＴなどの大量の計算を伴うため，低スペックのＣＰＵでは実現できない。（ｃ）は話速抽出として補助的に用いられる程度で感情検出には有効なものではない。
【０００６】
このように、感情認識はいくつかの先行研究が行われているが、そのほとんどは音量での判定、ピッチでの判定である。しかし音量はマイクとの位置関係に大きく左右され、また、ピッチ判定には上記のようにピッチ解析が必要であり、低スペックＣＰＵでは解析が困難である。すなわち、ピッチの変化を抽出するのは手間がかかり、ＰＤＡ等では実時間処理が困難である。また、ノイズに弱く、感情認識を行う場合は音声情報を得る環境が悪いことが多いので、誤りが多い欠点がある。
【０００７】
【特許文献１】
特開２００２−２１５１８３号公報
【特許文献２】
特開２００２−９１４８２号公報
【特許文献３】
特開平９−２２２９６号公報
【０００８】
【発明が解決しようとする課題】
この発明は、ＰＤＡ等の低スペックなＣＰＵでも容易に情報を得られる音声の音量パワー情報を用い、該パワーの変化情報を数値化することで感情認識を行う簡便な感情分析・表示装置を提供することを目的とする。
【０００９】
【課題を解決するための手段】
上記問題を解決するために、本願発明の感情分析・表示装置は、人の音声を分析してその人の感情を表示する感情分析・表示装置において、
上記音声を所定のフレーム毎に時系列に分離し、上記時系列のフレーム間の音量のパワー差分を求めることにより、その人の感情を分析し、表示することを特徴とする。
【００１０】
さらに、上記時系列のフレーム間のパワー偏差、パワー差分の平均値及びパワー差分の偏差を求めることにより、その人の感情を分析し、表示することを特徴とする。
【００１１】
さらにまた、前記音声の分析が、有声音区間を抽出し、該有声音に対して行われることを特徴とする。
【００１２】
上記本発明では、以下の特徴がある。
（ａ）サンプリングされた音声は２５６サンプルを１単位（１フレーム）とし、フレームごとの音量を感情分析の単位とする。
（ｂ）隣接フレーム間での音量の差分（デルタ）を感情分析のパラメータの一つとする。
（ｃ）全フレームの音量平均と偏差、デルタ平均と偏差をパラメータとし、感情分析を行う。これらにより、低スペックのＣＰＵでも実装できる上、大声はデルタ偏差が小さくなる一方、興奮した抑揚のある声はデルタ偏差が大きくなる傾向を用いることで、より精度の高い感情分析を行うことが出来る。
【００１３】
【発明の実施の形態】
本願発明をより詳細に説明するために、添付の図面に従ってこれを説明する。
まず、本発明の原理を説明する。
【００１４】
この発明において、音声の音量パワー情報を用いて感情を認識するために用いる情報は、全区間のパワー平均とパワー偏差、正規化したパワー偏差である。
Ｎフレーム位置の区間パワーＰ_Ｎは以下で求められる。但しＷ（ｋ）は窓関数、Ａ（ｋ）は時刻ｋにおける音声データ列Ａの値（パワー）である。
【００１５】
【数１】

【００１６】
全区間の平均パワーは、フレーム数をＬとすると、
【００１７】
【数２】

【００１８】
また、全区間の偏差を求めると
【００１９】
【数３】

【００２０】
これを平均パワーで正規化する。
【００２１】
【数４】

【００２２】
なお、上記正規化する理由は以下のとおりである。
すなわち、話者とマイクの位置により、音声のパワーは大きく変化する。しかし、人間が聞いた場合には「遠くで興奮している声」と「近くで普通の声」は区別できる。人間は感情の分析を音声パワーのみに依存していないのである。
【００２３】
従って、音声の感情分析は、音声パワーに依存しないよう、パワー偏差や後述するデルタ平均・デルタ偏差はパワー平均による除算を行い、基準を一定にする必要がある。
【００２４】
実際の音声を適用し、評価を行うと、以下のごとくなる。
図７は、普通の口調で『あーどうも、こんちわ』と喋った場合の音声波形（ａ）、パワー変化（ｂ）を示している。（ｃ）は後述するデルタパワー変化である。これは、表１の音声Ａである。同様に、図８は、弱気な口調で『ちょっとじょうだんじゃないよー』と喋った場合の音声波形（ａ）、パワー変化（ｂ）を示している。これは表１の音声Ｃである。（ｃ）も後述するデルタパワー変化である。
【００２５】
このような音声に対する評価結果を、表１に示す。音声ＦとＧは、同じ口調でマイクとの距離を変えた場合である。
【００２６】
【表１】

【００２７】
なお、表１における具体的な音声のセリフは以下のごときものである。
Ａ．あーども、こんにちは（普通）
Ｂ．おがたです（大声）
Ｃ．おーい、おーい（大声）
Ｄ．ちょっとじょうだんじゃないよ（弱気）
Ｅ．お、お、お、おいおいまってくれよ（興奮）
Ｆ．なんでだよ、どうしてだよ（興奮・近くで）
Ｇ．なんでだよ、どうしてだよ（興奮・遠くから）
Ｈ．はーあもうやんなっちゃう（弱気）
【００２８】
パワー平均で区別すると、大声である音声ＢとＣは共に興奮と認識されてしまう。しかし、実際は大声だからといって、必ずしも「興奮」とは限らない。また、大声か、それとも人の口とマイクとの距離が近いかの区別ができない。正規化パワー偏差では音声Ａと音声Ｄの数値が近く、Ａは普通に話しＤは弱気に話しているが分離が難しい。また、音声ＧとＨも正規化パワー偏差では数値が近い傾向があり、Ｇは興奮しておりＨは弱気で話しているのに両者を識別できない。
【００２９】
従ってパワー平均とパワー偏差だけでは感情認識の情報としては足りないことが分かる。
上記のように、音声のパワー平均や偏差では感情認識のための情報量が足りないので、本発明は音量のパワー変化に注目し、フレーム間での該パワー差分の絶対値を求め、これらの平均と偏差を感情認識のパラメータに以下の如く追加している。
【００３０】
フレーム間パワー差分（以下，デルタパワーという。）は、以下のように絶対値で求める。
【００３１】
【数５】

【００３２】
デルタパワーの平均と偏差、正規化した偏差は同様に、
【００３３】
【数６】

【００３４】
【数７】

【００３５】
【数８】

【００３６】
となる。
以下、上記本発明の実施例について、具体的手順を追って説明する。
（１）図１に示すように、入力音声１はサンプリング周波数８ｋＨｚでサンプリング・データを抽出する。なお、サンプリング周波数はこれ以外も可能であるが、その際には後述する判断パラメータも変更する必要がある。
（２）次に、図２に示すように、上記サンプリング・データを２５６サンプリングを１単位＝１フレームとして、各フレーム２内の総パワー（総和）を算出する。この際、予め求めておいたゼロ基準値からの絶対値で総和を計算する。このフレームパワーが、分析処理の基本単位となる。
【００３７】
ここで、２５６サンプリングを１単位とするのは、以下の理由による。
すなわち、後述するように低スペックＣＰＵは整数演算となるので、低スペックＣＰＵにて演算を行う際に２の累乗を１単位とするのは一般的である。また、音声の分析においては２０〜５０ミリ秒が妥当とされており、両者をふまえて２５６サンプリングを１単位とする。これは、上記８ｋＨｚサンプリングの場合には３２ミリ秒に相当する。
【００３８】
（３）次に、図３に示すように、順次フレームパワー・データをバッファーに格納し、パワー値が音声開始基準値を一定区間超えたらば、有声区間３の開始とし、パワー値が音声終了基準値を一定区間下回ったら、有声区間の終了とし、これらの区間を有声区間として、感情分析を行う。
【００３９】
（４）図４に示すように、有声区間に対し、隣り合うフレーム間のパワー差（絶対値）を求める。これをデルタパワー（△Ｐ）と呼ぶ。
（５）図５に示すように、パワーとデルタパワーの区間平均値と偏差を求める。
（６）最後に、図６に示すように、パワー平均、デルタパワー平均、パワー偏差、デルタパワー偏差の解析結果４をもとに、あらかじめ作成してある統計情報５と比較して、感情状態を判定する。
【００４０】
判定には、概ね、以下の傾向がある：
強い興奮：
パワー平均：大パワー偏差：大 △Ｐ偏差：大
興奮：
パワー平均：中パワー偏差：中 △Ｐ偏差：大
弱気：
パワー平均：小 △Ｐ偏差：小
小声：パワー平均：小
大声：パワー平均：大 △Ｐ偏差：中
【００４１】
これらは、一般に興奮すると声が大きくなると共に語気がきつくなり、きれがよくなることで、音量と共に音量の変化量（微分値）が大きくなっていると推測される。すなわち、興奮した音声は語気が強くなるが、この際にパワー変化を観察すると小→大、あるいは大→小の変化は急激であり、かつ、頻度も高い。
【００４２】
これらのパラメータを基にいくつかの音声の分析結果を表２に示す。なお、表１と表２のＡ〜Ｈは同じ標本データを示し、表２には、表１に対応したＡ〜Ｈの音声のパワー偏差、デルタパワーの平均、偏差を記載する。
【００４３】
【表２】

【００４４】
表２において、音声ＡとＤの正規化デルタパワー平均と正規化デルタパワー偏差（１１と１３，１２と１４）は大きく離れており、弱気な調子の音声を判断することができる。音声ＧとＨも正規化デルタパワー平均、正規化デルタパワー偏差両方の数値（１５と１７，１６と１８）は離れており、分離が容易になる。
【００４５】
全体的に、パワー平均を排除し、正規化したパワー偏差・デルタパワー平均・デルタパワー偏差を用いることで感情を正確に分析できる。
このようにして、音声感情分析のために必要なパラメータを提案し、表３のように閾値を設定し感情認識を行う。
【００４６】
【表３】

【００４７】
この発明は，小型感情分析プログラムを搭載したワンチップマイコンボード（ＳＴボード）として具現される。該ＳＴボードは、マイクから音声を拾い、該音声をマイコンが分析して音声に含まれる感情要因を判断し、結果を４段階（強い興奮、興奮、平静、弱気）に識別して出力する。
【００４８】
この発明は，前記のようにワンチップマイコン等の非力なＣＰＵ、少ないメモリでの動作を前提にしており、このことにより電池駆動のおもちゃや携帯電話などに組み込むことが可能である。
【００４９】
なお、この発明でいう低スペックＣＰＵとは、以下のものである。
すなわち、低スペックとはクロック周波数１００ＭＨｚ程度以下、ＤＳＰなどの高速演算回路を持たないＣＰＵを指している。この場合，実数演算はきわめて遅く、整数演算に頼らなければならない。逆にいえば、高スペックＣＰＵは、高速な浮動小数点演算回路を持ち、クロック周波数も５００ＭＨｚ以上で整数演算よりも実数演算を高速に行うことができるＣＰＵのことである。ＦＦＴ処理など一般的な音声処理に用いる技術は大量の実数演算を前提としているため、この低スペックＣＰＵでは実時間内での処理は不可能である。本発明は、整数演算のみで実装が可能であり、実際の実装結果も良好である。
【００５０】
【発明の効果】
以上述べたように、この発明の感情分析・表示装置は、パワー情報そのものを認識に不要としたことで、マイクとの位置関係やボリュームへの依存を減少させることができる。
【００５１】
そして、正規化デルタパワー平均と正規化デルタパワー偏差をパラメータとして用いることにより、音量に拘わらず、その音声を発する人の「興奮」か、「弱気」か、の識別が簡便にできるようになった。
【００５２】
また、上記感情状態の分析は、ピッチ抽出等に比べ、ノイズに対する誤動作が少ないと共に、複雑な処理を必要としないのでＰＤＡ等の低スペックのＣＰＵで実現でき、安価で簡便な感情認識装置が実現できる。
【図面の簡単な説明】
【図１】本発明の感情分析・表示装置の入力音声を示す図である。
【図２】図１のフレーム解析を示す図である。
【図３】図２における有声音区間検出を示す図である。
【図４】デルタパワーの計算過程を示す図である。
【図５】パワーとデルタパワーの偏差を求める過程を示す図である。
【図６】感情状態の判定過程を示す図である。
【図７】普通の口調で喋った標本音声の例を示す図である。
【図８】弱気の口調で喋った標本音声の例を示す図である。
【符号の簡単な説明】
１入力音声
２フレーム
３有声音区間
４解析結果
５統計情報[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an emotion analysis and display device that analyzes a person's voice and displays the person's emotion at that time.
[0002]
[Prior art]
Speech spoken by humans includes emotional information in addition to the meaning content. Generally, speech recognition is content recognition. However, in practice, even voices having the same content may have different meanings due to differences in nuances. For example, the meaning of expressing "no" in a weak voice is almost opposite to the meaning of "no" in a strong tone. Therefore, it can be said that recognition of emotions contained in speech is very important also in a man-machine interactive interface.
[0003]
A person can guess to some extent the state of emotions such as "appearing to be excited" or "appearing to be depressed" in the tone of the speaking voice without knowing the meaning. From this, it is considered that similar emotion can be estimated by mechanical processing of acoustic information without language understanding.
[0004]
There have been several proposals for emotion detection methods from speech. Most of them are based on the following methods.
(A) Average sound volume and the difference (or deviation) between the average and the whole
(B) Pitch (fundamental frequency) average and the difference (or deviation) between the average and the whole
(C) Appearance timing of pause section (silent section)
However, in (a), a difference between a mere loud voice and a voice with a large intonation (excited voice) cannot be captured. In (b), a stable and accurate pitch analysis method has not been established yet, and pitch extraction involves a large amount of calculation such as FFT, so that it cannot be realized with a low-spec CPU. (C) is not effective for emotion detection because it is used as an auxiliary for speech speed extraction.
[0006]
As described above, some prior studies have been performed on emotion recognition, but most of them are based on determination based on volume and determination based on pitch. However, the volume greatly depends on the positional relationship with the microphone, and pitch analysis requires pitch analysis as described above, and it is difficult to analyze with a low-spec CPU. That is, it takes time and effort to extract a change in pitch, and real-time processing is difficult with a PDA or the like. In addition, when emotion recognition is performed, the environment in which voice information is obtained is often poor when performing emotion recognition.
[0007]
[Patent Document 1]
JP 2002-215183 A [Patent Document 2]
JP-A-2002-91482 [Patent Document 3]
JP-A-9-22296
[Problems to be solved by the invention]
The present invention provides a simple emotion analysis and display device that performs emotion recognition by using sound volume power information that can easily obtain information even with a low-spec CPU such as a PDA and digitizing the power change information. The purpose is to do.
[0009]
[Means for Solving the Problems]
In order to solve the above problem, the emotion analysis and display device of the present invention is an emotion analysis and display device that analyzes the voice of a person and displays the emotion of the person,
The voice is separated into time series for each predetermined frame, and the emotion difference of the person is analyzed and displayed by calculating a power difference of the volume between the frames in the time series.
[0010]
Further, the present invention is characterized in that the emotion of the person is analyzed and displayed by calculating the power deviation, the average value of the power difference and the deviation of the power difference between the time-series frames.
[0011]
Still further, the voice analysis is performed on a voiced sound section by extracting a voiced sound section.
[0012]
The present invention has the following features.
(A) 256 units of sampled voice are defined as one unit (one frame), and the volume of each frame is defined as a unit of emotion analysis.
(B) The difference (delta) in the volume between adjacent frames is set as one of the parameters for emotion analysis.
(C) The emotion analysis is performed using the volume average and deviation and the delta average and deviation of all frames as parameters. With these, it is possible to implement even a low-spec CPU, and it is possible to perform more accurate emotion analysis by using a tendency that the delta deviation becomes large while a loud voice has a small delta deviation, while the voice with excited inflection tends to have a large delta deviation. .
[0013]
BEST MODE FOR CARRYING OUT THE INVENTION
The present invention will be described in more detail with reference to the accompanying drawings.
First, the principle of the present invention will be described.
[0014]
In the present invention, information used for recognizing an emotion using sound volume power information of a voice is a power average, a power deviation, and a normalized power deviation of all sections.
Section power P _N of the N frame position is determined by the following. Here, W (k) is a window function, and A (k) is the value (power) of the audio data string A at time k.
[0015]
(Equation 1)

[0016]
The average power of all sections is as follows, where L is the number of frames.
[0017]
(Equation 2)

[0018]
Further, when the deviation of all sections is obtained,
[Equation 3]

[0020]
This is normalized by the average power.
[0021]
(Equation 4)

[0022]
The reason for the normalization is as follows.
That is, the power of the voice greatly changes depending on the positions of the speaker and the microphone. However, when heard by humans, a "distantly excited voice" and a "nearly ordinary voice" can be distinguished. Humans do not rely solely on speech power to analyze emotions.
[0023]
Therefore, in the voice emotion analysis, the power deviation and the delta average / delta deviation described later need to be divided by the power average so that the criterion is constant so that the voice emotion analysis does not depend on the voice power.
[0024]
When the actual voice is applied and evaluated, the result is as follows.
FIG. 7 shows a sound waveform (a) and a power change (b) when "Ah, konchiwa" is spoken in a normal tone. (C) is a delta power change described later. This is voice A in Table 1. Similarly, FIG. 8 shows a voice waveform (a) and a power change (b) when the user speaks "I'm not a little lucky" in a bearish tone. This is voice C in Table 1. (C) is also a delta power change described later.
[0025]
Table 1 shows the evaluation results for such voices. The sounds F and G are obtained when the distance from the microphone is changed with the same tone.
[0026]
[Table 1]

[0027]
The specific speech lines in Table 1 are as follows.
A. Domo Oh, Hello (usually)
B. I'm louder
C. Hey, hey (loud)
D. I'm not a little fortunate (bearish)
E. FIG. Oh, oh, oh, come on (excitement)
F. Why and why (excitement, near me)
G. FIG. Why and why (excited, from a distance)
H. I'm gonna stop doing it (bear)
[0028]
When distinguished by power average, both the loud sounds B and C are recognized as excitement. However, just because it's really loud doesn't always mean "excitation." Also, it is not possible to distinguish whether the voice is loud or the distance between the mouth and the microphone is short. In the normalized power deviation, the numerical values of voice A and voice D are close to each other, and A speaks normally and D speaks weakly, but it is difficult to separate. The voices G and H also tend to have similar numerical values in the normalized power deviation. G is excited and H speaks bearish, but cannot distinguish between them.
[0029]
Therefore, it is understood that the power average and the power deviation alone are not enough as information for emotion recognition.
As described above, since the amount of information for emotion recognition is not sufficient with the average or deviation of the power of the voice, the present invention focuses on the power change of the volume, finds the absolute value of the power difference between frames, and The average and deviation are added to the emotion recognition parameters as follows.
[0030]
The inter-frame power difference (hereinafter referred to as delta power) is obtained as an absolute value as follows.
[0031]
(Equation 5)

[0032]
Similarly, the average and deviation of the delta power, and the normalized deviation,
[0033]
(Equation 6)

[0034]
(Equation 7)

[0035]
(Equation 8)

[0036]
It becomes.
Hereinafter, the embodiment of the present invention will be described with a specific procedure.
(1) As shown in FIG. 1, the input voice 1 extracts sampling data at a sampling frequency of 8 kHz. Note that other sampling frequencies are possible, but in that case, it is also necessary to change a determination parameter described later.
(2) Next, as shown in FIG. 2, the total power (sum) in each frame 2 is calculated with 256 samplings of the sampling data as 1 unit = 1 frame. At this time, the sum is calculated by the absolute value from the zero reference value obtained in advance. This frame power is a basic unit of the analysis processing.
[0037]
Here, 256 sampling is set as one unit for the following reason.
That is, as will be described later, a low-spec CPU performs an integer operation, so that when a low-spec CPU performs an operation, it is common to use a power of 2 as one unit. Also, in the analysis of voice, 20 to 50 milliseconds is considered appropriate, and taking both into account, 256 sampling is set as one unit. This corresponds to 32 milliseconds in the case of the 8 kHz sampling.
[0038]
(3) Next, as shown in FIG. 3, the frame power data is sequentially stored in the buffer, and if the power value exceeds the reference value of the voice start for a certain period, the voiced period 3 is started, and the power value ends. When the value falls below the reference value by a certain section, the voiced section ends, and these sections are set as voiced sections, and emotion analysis is performed.
[0039]
(4) As shown in FIG. 4, a power difference (absolute value) between adjacent frames is obtained for a voiced section. This is called delta power (△ P).
(5) As shown in FIG. 5, a section average value and a deviation of the power and the delta power are obtained.
(6) Finally, as shown in FIG. 6, based on the analysis result 4 of the power average, the delta power average, the power deviation, and the delta power deviation, the emotion state is compared with statistical information 5 created in advance. Is determined.
[0040]
Judgments generally have the following trends:
Strong excitement:
Power average: large Power deviation: large △ P deviation: great excitement:
Power average: Medium Power deviation: Medium △ P deviation: Very bearish:
Power average: small ΔP deviation: small and small voice: power average: small and loud: power average: large ΔP deviation: medium
In general, it is presumed that when the user is excited, the voice becomes loud and the vocabulary becomes sharp, and the sharpness is improved, so that the change amount (differential value) of the sound volume is increased together with the sound volume. That is, the excited voice has a strong word, but when the power change is observed at this time, the change from small to large or large to small is rapid and frequent.
[0042]
Table 2 shows the analysis results of some voices based on these parameters. Note that A to H in Tables 1 and 2 indicate the same sample data, and Table 2 describes the power deviation of the sounds A to H corresponding to Table 1, the average and deviation of the delta power.
[0043]
[Table 2]

[0044]
In Table 2, the normalized delta power averages and the normalized delta power deviations (11, 13, 12 and 14) of the sounds A and D are far apart, and it is possible to judge a bearish sound. The values of both the normalized delta power average and the normalized delta power deviation (15 and 17, 16 and 18) of the voices G and H are distant from each other, which facilitates separation.
[0045]
Overall, emotions can be accurately analyzed by eliminating the power average and using the normalized power deviation, delta power average, and delta power deviation.
In this way, parameters necessary for voice emotion analysis are proposed, threshold values are set as shown in Table 3, and emotion recognition is performed.
[0046]
[Table 3]

[0047]
The present invention is embodied as a one-chip microcomputer board (ST board) equipped with a small emotion analysis program. The ST board picks up a voice from a microphone, analyzes the voice by a microcomputer, determines an emotion factor included in the voice, and identifies and outputs the result in four stages (strong excitement, excitement, calm, and bearish).
[0048]
As described above, the present invention is premised on operation with a low-power CPU such as a one-chip microcomputer and a small amount of memory, and thus can be incorporated in a battery-powered toy or a mobile phone.
[0049]
The low-spec CPU according to the present invention is as follows.
That is, a low specification refers to a CPU having a clock frequency of about 100 MHz or less and having no high-speed operation circuit such as a DSP. In this case, real arithmetic is very slow and must rely on integer arithmetic. Conversely, a high-specification CPU is a CPU that has a high-speed floating-point arithmetic circuit and that can perform real number arithmetic faster than integer arithmetic with a clock frequency of 500 MHz or more. The technology used for general audio processing such as FFT processing is based on a large amount of real number operations, so that processing in real time is impossible with this low-spec CPU. The present invention can be implemented only by integer operations, and the actual implementation results are good.
[0050]
【The invention's effect】
As described above, the emotion analysis and display device of the present invention can reduce dependence on the positional relationship with the microphone and the volume by making the power information unnecessary for recognition.
[0051]
Then, by using the normalized delta power average and the normalized delta power deviation as parameters, it is possible to easily determine whether the person who emits the sound is “excited” or “bear” regardless of the volume. Was.
[0052]
In addition, the analysis of the emotional state described above can be realized by a low-spec CPU such as a PDA and the like, since there is less erroneous operation with respect to noise and does not require complicated processing as compared with pitch extraction or the like, and an inexpensive and simple emotion recognition device is realized. it can.
[Brief description of the drawings]
FIG. 1 is a diagram showing an input voice of an emotion analysis / display device of the present invention.
FIG. 2 is a diagram showing a frame analysis of FIG. 1;
FIG. 3 is a diagram showing voiced sound section detection in FIG. 2;
FIG. 4 is a diagram showing a calculation process of delta power.
FIG. 5 is a diagram showing a process of obtaining a deviation between power and delta power.
FIG. 6 is a diagram showing a process of determining an emotional state.
FIG. 7 is a diagram showing an example of a sample voice spoken in a normal tone.
FIG. 8 is a diagram showing an example of a sample voice spoken in a bearish tone.
[Brief description of reference numerals]
1 input voice 2 frame 3 voiced section 4 analysis result 5 statistical information

Claims

In an emotion analysis and display device that analyzes a person's voice and displays the person's emotions,
An emotion analysis / display device for separating the voice into time series for each predetermined frame, and analyzing and displaying the emotion of the person by obtaining a power difference in volume between the frames in the time series. .

The emotion analysis and display according to claim 1, wherein the emotion of the person is analyzed and displayed by calculating a power deviation, an average value of the power difference and a deviation of the power difference between the frames in the time series. apparatus.

The emotion analysis / display device according to claim 1 or 2, wherein the voice analysis is performed on a voiced sound section by extracting a voiced sound section.