CN105609117A

CN105609117A - Device and method for identifying voice emotion

Info

Publication number: CN105609117A
Application number: CN201610091015.2A
Authority: CN
Inventors: 郑洪亮
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-02-19
Filing date: 2016-02-19
Publication date: 2016-05-25

Abstract

The invention discloses a device and a method for identifying voice emotion. The device comprises a training portion and an identification portion, wherein the training portion is used used for carrying out voice characteristic extraction for the to-be-pre-processed voice data, through characteristic extraction and Gaussian modeling, SVM classification for results acquired through Gaussian modeling is carried out, the identification portion is used for identifying an emotion state of voice and carrying out voice characteristic extraction for the to-be-identified voice, through characteristic selection, Gaussian likelihood score calculation is carried out, the calculation result is compared with the SVM classification result, and thereby the emotion category of the to-be-identified voice is acquired.

Description

A kind of apparatus and method of identifying speech emotional

Technical field

The present invention relates to field of voice signal, relate in particular to a kind of apparatus and method of identifying speech emotional.

Background technology

The recognition technology of speech emotional refers to that machine passes through the voice signal Intelligent Recognition mankind's different emotions state, in the obvious feature of the non-stationary feature geometric ratio of the voice signal under different emotions, judge the variation of mood according to people by extracting the acoustic feature such as tonequality feature, prosodic features and spectrum signature of voice. Speech emotional identification is the emerging field of the multidisciplinary intersections such as artificial intelligence, psychology and biology, its object is exactly, by computer technology, the emotion information lying in voice is identified to (in short same, expressed implication can be completely different in the time of different environment and affective state for speaker). Voice signal has good portability and gathers the advantages such as convenient, and therefore emotion recognition technology can be widely used in intelligent human-machine interaction, man-machine interaction teaching, show business, medical science, criminal investigation and security fields.

The evaluation of personnel's emotional state is had to very high using value, particularly in the Military Application fields such as Aero-Space, for a long time, uninteresting, high-intensity task can make related personnel face harsh physiology and psychology test, causes some negative moods. Inquire into negative emotions for mechanism of action and the influence factor of human cognitive activity, research improve individual cognition and operating efficiency method, avoid affecting the factor of cognition and ability to work, there is great practical significance.

Generally, the representation of the emotion correlation of voice can be realized by speaker model or acoustic model. Existing achievement in research shows, the feature adopting for emotion recognition is prosodic features mostly, and namely super segment5al feature, as fundamental tone, intensity, duration and their derivative parameter. But the information of speech quality sense of hearing aspect is also the factor that usually needs consideration.

Non-patent literature Alter, E.Tank, andS.Kotz, " AccentuationandEmotions-twodifferentsystems, " presentedatISCAWorkshop (ITRW) onSpeechandEmotion, Newcastle, NorthernIreland, the people such as 2000, Alter are by the research to relation between the rhythm and tonequality, find that pronunciation when angry and glad is different at the aspect such as breathe and hoarse. Other research shows, between the prosodic features of voice signal and three emotion dimensions (dimension of tiring, activate peacekeeping control dimension), there is certain relevance, wherein activate between peacekeeping prosodic features and there is obvious relation between persistence, activate the close affective state of dimension and there is similar prosodic features and easily obscure.

Summary of the invention

The object of invention is just to address the deficiencies of the prior art, and designs, studies a kind of apparatus and method of high performance identification speech emotional.

Technical scheme of the present invention is: a kind of device of identifying speech emotional, comprise, and training department, for pretreatment speech data is carried out to speech feature extraction, by feature extraction and Gauss's modeling, the result that modeling obtains to Gauss is carried out svm classifier;

Identification part, for identifying the affective state of voice, carries out speech feature extraction to voice to be identified, by feature selecting, carries out the calculating of Gauss's likelihood score, and result of calculation and svm classifier are contrasted, and obtains the emotion classification of voice to be identified.

Further, described training department comprises, training utterance database, for training the speech data of emotion identification method, comprises the speech data of multiple affective style;

Pronunciation extracting module, for extracting the basic acoustic feature of each speech data of training utterance database, basic acoustic feature comprises the statistical nature of fundamental tone and single order thereof, second differnce, formant and statistical nature thereof, and MFCC feature and statistical nature thereof;

Feature selection module, combines any two kinds of affective styles, selects its acoustic feature, obtains training data;

Gauss's MBM, adopts gauss hybrid models modeling to training data, obtains data and distributes;

Svm classifier device, to each speech data in training utterance database, under the integrated mode of any two kinds of affective styles, obtains according to Gauss model the likelihood score that this speech data belongs to these two affective styles.

Further, described identification part comprises, characteristic extracting module, for extracting the basic acoustic feature of voice to be identified;

Select module, combine for the arbitrary two kinds of affective styles to voice to be identified, select its acoustic feature, obtain data to be identified;

Gauss's likelihood score computing module, treats identification data and carries out likelihood score calculating;

Emotion matching part, the likelihood score input svm classifier device for the treatment of identification data mates, and obtains the emotion classification of voice to be identified.

A method of identifying speech emotional, comprises the steps, training, and for pretreatment speech data is carried out to speech feature extraction, by feature extraction and Gauss's modeling, the result that modeling obtains to Gauss is carried out svm classifier;

Identification, carries out speech feature extraction to voice to be identified, by feature selecting, carries out the calculating of Gauss's likelihood score, and result of calculation and svm classifier are contrasted, and obtains the emotion classification of voice to be identified.

Compared with prior art, advantage of the present invention is: adopt technical scheme of the present invention, precision is high, is not subject to the control of language languages, and processing speed is fast, can process in real time.

Brief description of the drawings

Below in conjunction with drawings and Examples, the invention will be further described:

Fig. 1 is the structured flowchart of identification speech emotional device;

Fig. 2 is training schematic diagram.

Fig. 3 is identification schematic diagram.

Detailed description of the invention

As shown in Figure 1, a kind of device of identifying speech emotional, comprises, training department and identification part, and training department is for carrying out speech feature extraction to pretreatment speech data, and by feature extraction and Gauss's modeling, the result that modeling obtains to Gauss is carried out svm classifier;

Speech feature extraction, for identifying the affective state of voice, is carried out to voice to be identified in identification part, by feature selecting, carries out the calculating of Gauss's likelihood score, and result of calculation and svm classifier are contrasted, and obtains the emotion classification of voice to be identified.

Training department comprises, training utterance database, pronunciation extracting module, feature selection module, Gauss's MBM, svm classifier device.

As shown in Figure 2, training utterance database, for training the speech data of emotion identification method, comprises the speech data of multiple affective style. Suppose the total N kind of affective style that we will identify, in training utterance database, should comprise so the speech data of all these N kind affective styles. For example, if we will identify happiness, anger, sad, tranquil 4 kinds of affective styles, so, in training utterance storehouse, should comprise and these 4 kinds of speech datas that affective style is corresponding.

Pronunciation extracting module, for extracting the basic acoustic feature of each speech data of training utterance database, basic acoustic feature comprises the statistical nature of fundamental tone and single order thereof, second differnce, formant and statistical nature thereof, and MFCC feature and statistical nature thereof.

To each speech data, the acoustic feature of extraction forms the characteristic vector of a D dimension, the number F that wherein D is feature.

A) each speech data in training utterance database is processed, generated the characteristic vector of D dimension;

B) all features are normalized. By following formula, the feature on each dimension k (k=1...D) is normalized one by one:

{\tilde{f}}_{k} = \frac{f_{k} - a_{k}}{b_{k} - a_{k}}, k = 1... D

In above formula, some dimensions of k representation feature vector, k=1...D. f_kWithBe respectively before normalization and normalization after the numerical value of feature of k dimension. a_k、b_kRepresent minimum of a value and maximum on dimension k, minimum of a value and the maximum of k dimensional feature the acoustic feature vector extracting from all training utterances.

Feature selection module, combines any two kinds of affective styles, selects its acoustic feature, obtains training data; Gauss's MBM, adopts gauss hybrid models modeling to training data, obtains data and distributes.

Any two kinds of affective styles are combined, carry out the step of feature selecting and Gauss's modeling. If comprise N kind affective style in our classification task, so, the number of combination is exactly N (N-1)/2 kind, (Class1, type 2), (Class1, type 3) ..., (Class1, type N); (type 2, type 3), (type 2, type 4) ..., (type 2, type N) ..., (type N-1, type N). To each combination, the operation of carrying out is all the same, the data difference just adopting. Below, describe as an example of the combination (type 2, type 4) of i type and j type example:

A) first, choose acoustic feature vector that the training utterance corresponding with i type and j type obtain as training data;

B) calculate the differentiation s (d) of each dimension:

s (d) = \frac{{(μ_{i}^{d} - μ_{j}^{d})}^{2}}{{(σ_{i}^{d})}^{2} + {(σ_{j}^{d})}^{2}}

Wherein, and d (d=1 ..., D) d dimension of representation feature vector; () represents that i (j) plants the average of d dimension in the corresponding characteristic vector of affective style; () represents that i (j) plants the variance of d dimension in the corresponding characteristic vector of affective style. The value of the property distinguished s (d) is larger, shows that this feature is better to distinguishing this two kinds effect.

C) by the property distinguished, all D dimension is sorted, the highest individual dimension of M (for example M=10) of the value of selective discrimination is as the feature of affective style i and affective style j, i.e. M dimensional feature vector.

By this process, from original D dimension acoustic feature, select M element to form the characteristic vector that new M ties up. But to different affective style combinations, this M element of selecting is different. For example: for (Class1, type 2) and (type 2, type 3) combination, M corresponding element is different.

D) Gauss's modeling. The M dimensional feature vector that utilizes said process to obtain, for i class emotion and the corresponding all training datas of j class emotion, the data that adopt a gauss hybrid models to carry out such data of modeling distribute.

The gauss hybrid models corresponding with i class emotion, its likelihood function can represent by following form:

p (X | λ_{i}) = Σ_{t = 1}^{T} a_{t} b_{t} (X)

Here X is the characteristic vector of a M dimension; Bt (X) is member's density function; At is mixed weight-value, and T is for being mixed into mark. Each member's density function is the Gaussian function about mean value vector and covariance matrix of a M dimension variable, and form is as follows:

b_{t} (X) = \frac{1}{{(2 π)}^{M / 2} {| Σ_{t} |}^{1 / 2}} \exp (- \frac{1}{2} {(X - μ_{t})}^{'} Σ_{t}^{- 1} (X - μ_{t}))

Said process (a)-(d) can simply be summarized as: for i kind affective style and j kind affective style, for each training utterance extracts a D dimension acoustic feature; Secondly, calculated and selected M characteristic dimension by the property distinguished, like this, D dimensional feature is converted into M dimensional feature; Finally, utilize Gaussian modeling, for each affective style builds a gauss hybrid models. Based on these two gauss hybrid models, we can obtain the likelihood score that a speech data belongs to classification i and j.

Svm classifier device, to each speech data in training utterance database, under the integrated mode of any two kinds of affective styles, obtains according to Gauss model the likelihood score that this speech data belongs to these two affective styles. To arbitrary speech data, generate N^*(N-1) individual likelihood value, taking these likelihood values as feature, using the emotion category label of speech data as label, training svm classifier device.

Identification part comprises, characteristic extracting module is selected module, Gauss's likelihood score computing module, emotion matching part.

As shown in Figure 3, characteristic extracting module, for extracting the basic acoustic feature of voice to be identified. This step is identical with training process, from input voice, extracts N kind acoustic feature.

Select module, combine for the arbitrary two kinds of affective styles to voice to be identified, select its acoustic feature, obtain data to be identified. Select M feature: the combining form (as emotion in emotion and j in i) to each affective style, select M the highest acoustic feature of the differentiation corresponding with it.

Gauss's likelihood score computing module, treats identification data and carries out likelihood score calculating. Gauss's likelihood score calculates: the combining form (as emotion in emotion and j in i) to each affective style, according to the M of its a gauss hybrid models and selection acoustic feature, calculate the likelihood score that these voice belong to these two classifications.

Emotion matching part, the likelihood score input svm classifier device for the treatment of identification data mates, and obtains the emotion classification of voice to be identified. The value of all likelihood scores is combined into a vector (N^*(N-1) dimension), input svm classifier device is classified, and obtains the emotion classification of voice to be identified.

Below be only concrete exemplary applications of the present invention, protection scope of the present invention is not constituted any limitation. In addition to the implementation, the present invention can also have other embodiment. All employings are equal to the technical scheme of replacement or equivalent transformation formation, within all dropping on the present invention's scope required for protection.

Claims

1. a device of identifying speech emotional, is characterized in that: comprises,

Training department, for pretreatment speech data is carried out to speech feature extraction, by feature extraction and Gauss's modeling,The result that modeling obtains to Gauss is carried out svm classifier;

Identification part, for identifying the affective state of voice, carries out speech feature extraction to voice to be identified, by spyLevy selection, carry out the calculating of Gauss's likelihood score, result of calculation and svm classifier are contrasted, obtain waiting to knowThe emotion classification of other voice.

2. a kind of device of identifying speech emotional according to claim 1, is characterized in that: described training departmentComprise,

Training utterance database, for training the speech data of emotion identification method, comprises the language of multiple affective styleSound data;

Pronunciation extracting module, for extracting the basic acoustics spy of each speech data of training utterance databaseLevy, basic acoustic feature comprises statistical nature, formant and the statistics thereof of fundamental tone and single order thereof, second differnceFeature, and MFCC feature and statistical nature thereof;

Feature selection module, combines any two kinds of affective styles, selects its acoustic feature, obtains training numberAccording to;

Svm classifier device, to each speech data in training utterance database, at any two kinds of affective stylesIntegrated mode under, obtain according to Gauss model the likelihood score that this speech data belongs to these two affective styles.

3. a kind of device of identifying speech emotional according to claim 1, is characterized in that: described identification partComprise,

Characteristic extracting module, for extracting the basic acoustic feature of voice to be identified;

Select module, combine for the arbitrary two kinds of affective styles to voice to be identified, select its acoustic feature,Obtain data to be identified;

Emotion matching part, the likelihood score input svm classifier device for the treatment of identification data mates, and obtains to be identifiedThe emotion classification of voice.

4. a method of identifying speech emotional, is characterized in that: comprises the steps,

Training, for pretreatment speech data is carried out to speech feature extraction, by feature extraction and Gauss's modeling,The result that modeling obtains to Gauss is carried out svm classifier;

Identification, carries out speech feature extraction to voice to be identified, by feature selecting, carries out the calculating of Gauss's likelihood score,Result of calculation and svm classifier are contrasted, obtain the emotion classification of voice to be identified.