CN108682431A

CN108682431A - A kind of speech-emotion recognition method in PAD three-dimensionals emotional space

Info

Publication number: CN108682431A
Application number: CN201810438464.9A
Authority: CN
Inventors: 程艳芬; 陈逸灵; 李超
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2018-05-09
Filing date: 2018-05-09
Publication date: 2018-10-19
Anticipated expiration: 2038-05-09
Also published as: CN108682431B

Abstract

The invention discloses the speech-emotion recognition methods in a kind of PAD three-dimensionals emotional space, the PAD three-dimensionals emotion model based on dimension theory is selected to show mode as recognition result, the PAD values that mel-frequency cepstrum coefficient, time igniting sequence and ignition location information characteristics are individually used for speech emotional are predicted, respectively correlation analysis is carried out from three P (pleasant degree), A (activity), D (dominance) dimensions, the weight coefficient of these three features is calculated, Weighted Fusion obtains final predicted value of the speech emotional in PAD three-dimensional emotional spaces.Experiment shows that this method can more meticulously position the affective state of voice in emotional space, more focuses on the expression and embodiment in emotion in ingredient, more properly reflects the polarity and degree of emotional expression, to show miscellaneous affective content of mixing in emotional speech.

Description

A kind of speech-emotion recognition method in PAD three-dimensionals emotional space

Technical field

The invention belongs to speech emotion recognition fields, are related to a kind of speech-emotion recognition method, and in particular to a kind of PAD tri- Tie up the speech-emotion recognition method in emotional space.

Background technology

In speech emotion recognition field, common cepstrum feature generally has mel-frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficient, MFCC) etc..MFCC is calculated using the nonlinear correspondence relation of Mel frequencies and Hz frequencies The Hz spectrum signatures arrived, Mel dimensions in frequency has booster action to the low frequency details of voice signal, therefore MFCC can protrude voice letter Number useful information, reduce interference of the ambient noise to voice signal, can effectively identify speech emotional, but due to emotion language Sound signal it is non-stationary particularly evident, FFT direct to signal cannot reflect that its is non-stationary, thus be used alone MFCC be used for There are larger false recognition rates for speech emotion recognition.Sound spectrograph because can intuitively present voice signal day part frequency distribution situation And favored deeply by voice study scholar, by sound spectrograph input pulse coupled neural network (Pulse Coupled Neural Network, PCNN) extraction time igniting sequence and Entropy sequence be used for speech emotion recognition, and experiment shows usually and high to identifying Emerging two kinds of emotions are effective.

It is above-mentioned that speech emotional feature obtained by sound spectrograph is handled using MFCC identification speech emotionals and using PCNN into market The result of sensillary area point is it is found that they emphasize particularly on different fields in terms of identifying affective style, and speech emotion recognition result is by emotional semantic classification For a few classification, for example, it is glad, sad, sad, angry, however, in fact, emotion is expressed as counting using hyperspace Value is more convenient for human-computer interaction, because the ability of computer disposal number is stronger.

Invention content

In order to solve the above-mentioned technical problem, the present invention proposes a kind of by MFCC, neurons firing series and neuron point The method of fiery location information prediction PAD values (P (pleasant degree), A (activity), D (dominance)) result fusion.

The technical solution adopted in the present invention is：A kind of speech-emotion recognition method in PAD three-dimensionals emotional space, it is special Sign is, includes the following steps：

Step 1：Extract emotional speech data feature, including mel-frequency cepstrum coefficient MFCC, the time igniting sequence and Ignition location information；

Step 2：Mel-frequency cepstrum coefficient MFCC, time igniting sequence and ignition location information are individually applied SVR algorithms establish speech emotion recognition model, predict pleasant angle value P, the activation angle value A and advantage angle value D of emotional speech；

Step 3：Predicted value obtained by three kinds of features is calculated in tri- dimensions of P, A, D using Pearson correlation analysis On related coefficient, determine feature weight；

Step 4：According to the weight of different characteristic, it is final in three-dimensional emotional space that Weighted Fusion obtains emotional speech PAD values.

The present invention is first by three kinds of MFCC (mel-frequency cepstrum coefficient), time igniting sequence and ignition location information characteristics The PAD values that speech emotional feature is individually used for speech emotional are predicted, and then (are swashed from P (pleasant degree), A according to prediction result Activity), three dimensions of D (dominance) carry out correlation analysis, calculate the weight coefficient of these three features, fusion obtains voice feelings Feel the final predicted value in PAD three-dimensional emotional spaces.

The present invention provides certain ginseng as one new trial of speech emotion recognition, for the research in the field from now on It examines, final recognition result does not use discrete word label in the studies above (glad, sad, usually etc.) to indicate, but by feelings Sense is predicted as coordinate values and maps to PAD three-dimensional emotional spaces, by calculating it at a distance from basic emotion PAD values, into one Step analyzes the constitution element and composition ratio of the emotional speech state, so as to identify " mildly partially sad " or " temperature With it is higher emerging " etc. mixed feelings type, breach discrete adjective label and describe emotion type limitation, more properly reflect The polarity and degree of emotional expression are more convenient on the single dimensional problem of processing emotion.The experimental results showed that the present invention proposes Method on the basis of preferably distinguishing effect, what is more focused on is the table in emotion in ingredient having to basic speech emotional type It reaches and embodies, the calculating time is short, if faster will be suitble to the application scenarios handled in real time with parallel machine or hardware realization.

Description of the drawings

Fig. 1 is the flow chart of the embodiment of the present invention；

Fig. 2 is the PAD three-dimensional emotional space distribution schematic diagrams of the embodiment of the present invention.

Specific implementation mode

Understand for the ease of those of ordinary skill in the art and implement the present invention, with reference to the accompanying drawings and embodiments to this hair It is bright to be described in further detail, it should be understood that implementation example described herein is merely to illustrate and explain the present invention, not For limiting the present invention.

The CASIA Chinese Emotional Corpus that the present embodiment is selected is the discrete voice developed by Institute of Automation, Chinese Academy of sociences Affection data library, but since in a completed subject study, Chinese Academy of Sciences's psychology has engaged 346 university students to use Its sinicization PAD emotion scale of simplification version revised has carried out 14 kinds of specific emotional scopes the evaluation of PAD values, has obtained this Value of 14 kinds of emotions on P (pleasant degree), A (activity) and D (dominance), wherein containing 6 kinds of feelings in CASIA corpus Type is felt, so speech emotional data known to P, A, D value can be as the dimension emotion language needed for experiment in CASIA corpus Sound data, it was demonstrated that three kinds of speech emotional features respectively classification function and effect during speech emotion recognition and fusion after have Effect property.

The running environment of the present embodiment is Matlab (R2014b), system environments Win10, allocation of computer Intel Core i3-3217U CPU (1.8GHz), 8GB memories.

Referring to Fig.1, the speech-emotion recognition method in a kind of PAD three-dimensionals emotional space provided by the invention, including it is following Step：

Step 1.1：Extract mel-frequency cepstrum coefficient MFCC.

Extraction MFCC characteristic parameters first pre-process voice signal, and pretreatment includes preemphasis and adding window framing, Frame length takes 256, and frame pipettes 128, and window function is Hamming window.Pre-add Beijing South Maxpower Technology Co. Ltd compensates voice signal and is made after lip and nostril radiation At the decaying of high frequency section, realize that original signal framing obtains the voice sequence s (n) of every frame voice signal, then to s (n) into Row Fast Fourier Transform (FFT) (FFT) obtains the frequency spectrum S (n) of each frame, and the energy spectrum of voice signal is obtained to its modulus square | S (n) |².Will | S (n) |²Pass through Meier filter group H_m(k), output parameter P_m(m=0,1,2 ..., M-1), computational methods are as follows：

In formula, M is the number of filter, and the present embodiment takes 26；f_mIndicate the centre frequency of triangular filter；

Finally to parameter P_mIt takes logarithm to do discrete cosine transform (DCT), is transformed into cepstrum domain and obtains mel-frequency cepstrum system Number C_mel(k)。

L_m=ln (P_m), m=0,1,2 ..., M-1,

In formula, N represents the exponent number of mel cepstrum coefficients, and the present embodiment extracts the first-order difference MFCC characteristic parameters of 12 ranks.

Step 1.2：Extraction time igniting sequence.

Sound spectrograph is obtained first, to being divided into several overlapped speech signal frame windowing processes, the present embodiment selection Hamming window.Then the short-term spectrum for the signal being obtained by FFT is estimated, makees abscissa with time n, frequency w makees ordinate, any Given frequency ingredient is indicated in the degree of strength of given time using the gray scale of respective point, and sound spectrograph is constituted.The present embodiment is selected Simplified PCNN neuron models, parameter setting and value such as the following table 1：

The parameter setting and value for the PCNN neuron models that table 1 simplifies

Wherein α_F, α_L, α_θFeed back input item F is indicated respectively_ij, connection input item L_ij, dynamic threshold thresholding θ_ijDecaying when Between constant.V_F, V_L, V_θIndicate that the feedback amplification coefficient, connection amplification coefficient and threshold value amplification coefficient of PCNN, β are that connection is strong respectively Coefficient is spent, above-mentioned parameter is empirically worth setting.Link input L, internal activity item U, pulse output Y initial value be set as 0, it is defeated Enter for normalized gray value, belongs between [0,1].Link field radius r=1.5, internal connection matrix W are one 3 × 3 sides Battle array, inverse (r of the pixel to the Euclidean distance square of each pixel of surrounding wherein centered on each element numerical value^-2)。

Sound spectrograph is inputted the PCNN iteration 50 times that neuron number is equal to sound spectrograph pixel number, is changed each time Total ignition times in generation are equal to the sum of the neuron number of release pulse, and time igniting is carried out by the image segmentation ability of PCNN Sequence signature extracts.

Step 1.3：Extract neuron firing location information.

The igniting neuron location map obtained every time is projected on time shaft and frequency axis respectively, then after projection Two vectors be merged into a vector.Finally, vector obtained by the ignition location distribution map by each moment is lined up according to the time Multiple row, one matrix of composition is speech emotion recognition eigenmatrix.

Step 2：Mel-frequency cepstrum coefficient MFCC, time igniting sequence and ignition location information are individually applied SVR algorithms (Support vector regression) establish speech emotion recognition model, predict pleasant angle value P, the activation angle value A of emotional speech With advantage angle value D；

Related coefficient calculation formula is：

In formula, X is respectively represented in each calculate and is represented emotion using P values, A values and D values, Y obtained by the prediction of each feature P value, A value and D value of the voice in emotion scale, μ_X, μ_YVariable X, the average value of Y, σ are indicated respectively_X, σ_YVariable is indicated respectively The standard deviation of X, Y.ρ_X,YAs feature weight；

Final PAD value of the emotional speech in three-dimensional emotional space be：

P=P₁λ₁+P₂λ₂+P₃λ₃；

In formula, P₁, P₂, P₃It represents successively and uses mel-frequency cepstrum coefficient, duration of ignition sequence, ignition location information pair Predicted value of the voice in P (pleasant degree) dimension.λ₁, λ₂, λ₃Indicate three of the above speech emotional feature to the speech emotional successively Type meets λ in the related coefficient normalized value of P (pleasant degree) dimension₁+λ₂+λ₃=1.

The present embodiment is using root-mean-square error (root-mean-square error, RMSE) as to basic emotion type The evaluation index of identification, computational methods such as formula are：

In formula, X_obs,iIndicate experiment predicted value, X_model,iIndicate PAD scale reference values.

The RMSE and comparison for calculating separately P, A, D predicted value obtained by the present invention and MFCC prediction P, A, D values being used alone, meter The results are shown in Table 2 for calculation, wherein the RMSE value of each dimension standardizes between 0 to 1, the corresponding method of the smaller explanations of RMSE Performance is better.

2 context of methods of table and the RMSE value of MFCC prediction PAD values compare

By experiment, find to be distributed in test sample PAD reference values respective coordinates point in each identical affective style sample set Near, variant affective style sample, which is then distributed, more to be disperseed, and calculates the comparing result proof of gained RMSE originally according to table 2 Invention can effectively carry out speech emotional type classification.Test PAD reference values such as 3 institute of table of the 6 kinds of emotional speech signals used Show, table 4 is that three kinds of features of wherein one speech samples in this method identification process correspond to the related coefficient normalization of each dimension As a result, seeking the final predicted values of its PAD for being further weighted fusion.

The PAD scale reference values of 36 kinds of affective domains of table

The related coefficient result of 4 three kinds of speech emotional features of table

Fig. 2 is then the distribution map that experiment sample is mapped to PAD three-dimensional emotional spaces after this method identifies, table 5 is this point The corresponding numberical range statistics of Butut.

The final distribution of forecasting value ranges of PAD of 56 kinds of emotions of table

Since recognition result of the present invention does not use conventional discrete affective tag word to indicate that advantage, which is presented as, passes through calculating Its distance value with basic emotion PAD values in emotion coordinate system, further analyze the constitution element of the emotional speech state with And composition ratio, so as to identify the mixed feelings types such as " mildly partially sad " or " mild higher emerging ".

Recognition result is illustrated in the PAD three-dimensional emotional spaces based on continuous dimension theory by the present invention, by intuitive Mapping graph can clearly be presented difference between various affective states and contact, and the psychology for describing a variety of basic emotion type hybrids is living It is dynamic, embody the delicate changeable affective state of the mankind.Experiment shows that the present invention can show emotion by accurate emotion coordinate value It mixes in voice miscellaneous affective content, more meticulously completes speech emotion recognition task.

It should be understood that the part that this specification does not elaborate belongs to the prior art.

It should be understood that the above-mentioned description for preferred embodiment is more detailed, can not therefore be considered to this The limitation of invention patent protection range, those skilled in the art under the inspiration of the present invention, are not departing from power of the present invention Profit requires under protected ambit, can also make replacement or deformation, each fall within protection scope of the present invention, this hair It is bright range is claimed to be determined by the appended claims.

Claims

1. the speech-emotion recognition method in a kind of PAD three-dimensionals emotional space, which is characterized in that include the following steps：

Step 1：Extract the feature of emotional speech data, including mel-frequency cepstrum coefficient MFCC, time igniting sequence and igniting Location information；

Step 2：Individually SVR is applied to calculate mel-frequency cepstrum coefficient MFCC, time igniting sequence and ignition location information Method establishes speech emotion recognition model, predicts pleasant angle value P, the activation angle value A and advantage angle value D of emotional speech；

Step 3：Predicted value obtained by three kinds of features is calculated in tri- dimensions of P, A, D using Pearson correlation analysis Related coefficient determines feature weight；

Step 4：According to the weight of different characteristic, Weighted Fusion obtains final PAD value of the emotional speech in three-dimensional emotional space.

2. the speech-emotion recognition method in PAD three-dimensionals emotional space according to claim 1, it is characterised in that：Step 1 In, in mel-frequency cepstrum coefficient MFCC extraction process, when pretreatment, frame length takes 256, and frame pipettes 128, and window function is Hamming Window；In calculating process, pretreated voice data is subjected to Fast Fourier Transform (FFT) and modulus square obtains its energy spectrum, it will Energy spectrum passes through Meier filter group, output parameter P_m(m=0,1,2 ..., M-1), calculation is：

Wherein, m=0,1,2 ..., M-1, number of filter M take 26, f_mIndicate the centre frequency of triangular filter, H_m(k) it indicates The frequency response of triangular filter, S (n) indicate the Fast Fourier Transform (FFT) acquired results of voice signal；12 ranks of final extraction First-order difference MFCC characteristic parameters.

3. the speech-emotion recognition method in PAD three-dimensionals emotional space according to claim 1, it is characterised in that：Step 1 In, extraction time igniting sequence and ignition location information are using simplified PCNN neuron models, parameter alpha_F、α_L、α_θ、V_F、V_L、 V_θ, β values be respectively 0.1,1.0,1.0,0.5,0.2,20,0.1；α_F, α_L, α_θFeed back input item F is indicated respectively_ij, connection input Item L_ij, dynamic threshold thresholding θ_ijDamping time constant, V_F, V_L, V_θFeedback amplification coefficient, the connection amplification of PCNN are indicated respectively Coefficient and threshold value amplification coefficient, β are bonding strength coefficients；Link input L, internal activity item U, pulse output Y initial value set It is 0, inputs as normalized gray value, belong between [0,1]；Link field radius r=1.5, internal connection matrix W are one 3 × 3 square formation, inverse of the pixel to the Euclidean distance square of each pixel of surrounding wherein centered on each element numerical value r^-2。

4. the speech-emotion recognition method in PAD three-dimensionals emotional space according to claim 1, it is characterised in that：Step 3 In, related coefficient of the gained predicted value in tri- dimensions of P, A, D is three kinds of features respectively：

In formula, X is respectively represented in each calculate and is represented emotional speech using P values, A values and D values, Y obtained by the prediction of each feature P values, A values in emotion scale and D values, μ_X, μ_YVariable X, the average value of Y, σ are indicated respectively_X, σ_YVariable X is indicated respectively, Y's Standard deviation；ρ_X,YValue is exactly the feature weight being calculated.

5. the speech-emotion recognition method in PAD three-dimensionals emotional space according to any one of claims 1-4, feature It is：In step 4, final PAD value of the emotional speech in three-dimensional emotional space is：

P=P₁λ₁+P₂λ₂+P₃λ₃；

In formula, P₁、P₂、P₃It respectively represents using mel-frequency cepstrum coefficient, duration of ignition sequence, ignition location information to the language Predicted value of the sound in P dimensions；λ₁、λ₂、λ₃Indicate three of the above speech emotional feature to the speech emotional type in P dimensions successively Related coefficient normalized value, meet λ₁+λ₂+λ₃=1.