CN109767788A

CN109767788A - A kind of speech-emotion recognition method based on LLD and DSS fusion feature

Info

Publication number: CN109767788A
Application number: CN201910143689.6A
Authority: CN
Inventors: 张秀再; 王玮蔚
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2019-02-25
Filing date: 2019-02-25
Publication date: 2019-05-17

Abstract

The invention discloses a kind of speech-emotion recognition methods based on LLD and DSS fusion feature, specially on the basis of traditional LLD feature, it increases DSS feature and carries out feature set expansion, then the feature set after expansion is carried out by dimensionality reduction by self-encoding encoder, obtain LLD+DSS fusion feature.Finally using LLD+DSS fusion feature as the input of LSTM depth network, the corresponding emotion type of every fusion feature is determined by LSTM depth network.The present invention has better comprehensive performance compared to traditional voice affective characteristics and classification and identification algorithm, improves the accuracy of speech emotional classification.

Description

A kind of speech-emotion recognition method based on LLD and DSS fusion feature

Technical field

The invention belongs to artificial intelligence and field of speech recognition more particularly to a kind of languages based on LLD and DSS fusion feature Sound emotion identification method.

Background technique

In recent years, with the development of computer technology, human-computer interaction (HMI) technology has also obtained significant progress, but also Far from reaching the man-machine level sufficiently linked up.Because machine is difficult to understand for lying in some paralanguage information in language, feelings Thread is exactly one of them.The basic task of speech emotion recognition (SER), it is intended to by voice signal to talker's emotional state into Row classification keeps HMI more natural and reality.Although domestic and international scientific research personnel has carried out extensive research to SER, so far Until, the performance of SER system is relatively low, still can not practical application.

The groundwork of speech emotion recognition is divided into speech emotional feature extraction and the selection of sorter network model.It is current domestic Outer research object is mostly the selection of sorter network model, and greater advance is had been achieved on disaggregated model.Speech emotion recognition In most common disaggregated model be support vector machines [1] (SVM), artificial neural network [2] (ANN), K nearest neighbor algorithm [3] (KNN), Elman neural network [4], long neural network [5] (LSTM) etc. in short-term, these models mostly use greatly Interspeech Bottom descriptor (LLD) affective characteristics used by general knowledge test, few affective characteristics for the network optimization.Therefore, such as What excavates potential feature and improves discrimination, still to be studied.

Bibliography

[1]Lin Y L,Wei G.Speech emotion recognition based on HMM and SVM[C]// Machine Learning and Cybernetics,2005.Proceedings of 2005International Conference on.IEEE,2005,8:4898-4901.

[2]Han K,Yu D,Tashev I.Speech emotion recognition using deep neural network and extreme learning machine[C]//Fifteenth annual conference of the international speech communication association.2014.

[3]Schuller B,Rigoll G,Lang M.Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture[C]//Acoustics,Speech,and Signal Processing,2004.Proceedings.(ICASSP'04).IEEE International Conference on.IEEE,2004,1:I-577.

[4] Yu Lingli, Zhou Kaijun, Qiu Aibing are based on speech emotion recognition application study [J] of Elman neural network Calculation machine application study, 2012,29 (5): 1809-1814.

[5]M,Kaiser M,Eyben F,et al.LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework[J].Image and Vision Computing,2013,31(2):153-163.

[6]Andén J,Mallat S.Deep scattering spectrum[J].IEEE Transactions on Signal Processing,2014,62(16):4114-4128.

[7]Deng J,Zhang Z,Eyben F,et al.Autoencoder-based unsupervised domain adaptation for speech emotion recognition[J].IEEE Signal Processing Letters, 2014,21(9):1068-1072.

[8]Zheng F,Zhang G,Song Z.Comparison ofdifferent implementations ofMFCC[J].Journal ofComputer science and Technology,2001,16(6):582-589.

[9]Guo J M,Markoni H.Driver drowsiness detection using hybrid convolutional neural network and long short-term memory[J].Multimedia Tools andApplications,2018:1-29.

[10]Morchid M,Bousquet P M,Kheder W B,et al.Latent Topic-based Subspace for Natural Language Processing[J].Journal ofSignal Processing Systems,2018:1-21.

[11]Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural computation,1997,9(8):1735-1780.

[12]Burkhardt F,Paeschke A,Rolfes M,et al.A database of German emotional speech[C]//Ninth European Conference on Speech Communication and Technology.2005.

[13]Livingstone S R,Peck K,Russo F A.Ravdess:The ryerson audio-visual database of emotional speech and song[C]//Annual meeting of the canadian society for brain,behaviour and cognitive science.2012:205-211.

[14]Jackson P,Haq S.Surrey Audio-Visual Expressed Emotion(SAVEE) Database[J].University ofSurrey:Guildford,UK,2014.

Summary of the invention

Goal of the invention: the present invention is directed to existing voice affective characteristics bad problem of performance when carrying out Classification and Identification, mentions For a kind of speech-emotion recognition method based on LLD and DSS fusion feature.

Technical solution: the present invention provides a kind of speech-emotion recognition method based on LLD and DSS fusion feature, this method Include the following steps:

Step 1: extracting the LLD feature and DSS feature of emotional speech data set；

Step 2: LLD feature and DSS feature be used as to the training set of self-encoding encoder, the self-encoding encoder to LLD feature with DSS feature carries out dimensionality reduction calculating, the fusion feature of the LLD+DSS after obtaining dimensionality reduction；

Step 3: the fusion feature of LLD+DSS described in step 2 being sequentially input into LSTM depth network, by LSTM The corresponding emotion type of every fusion feature of depth Network Recognition.

Further, in the step 1, the extraction of DSS feature is carried out to emotional speech data set using DSS algorithm；Institute The order for stating DSS algorithm is set as 2 ranks, that is, the DSS feature extracted include the zeroth order feature of emotional speech data set, single order feature and Second order feature, the acquisition methods of each feature are as follows: obtained using emotional speech data set as input signal by the first low-pass filter Obtain zeroth order feature；The first small pass band filter and the second low pass filtered are passed sequentially through using emotional speech data set as input signal Wave device obtains single order feature；The first small pass band filter, second are passed sequentially through using emotional speech data set as input signal Small pass band filter and third low-pass filter obtain second order feature, and the frequency of the second small pass band filter is higher than the The frequency of one small pass band filter.

Further, in the step 1, emotional speech data set include EMODB data set, RAVDESS data set and Surrey data set.

Further, in the step 3, LSTM depth network has a β layer network layer, and first β -1 layers in the β layer network layer Fusion feature for the LLD+DSS to input is trained to obtain the hidden feature of this fusion feature；The last layer is point Class device, the classifier judge emotion kind corresponding to emotion type namely this fusion feature corresponding to the hidden feature Class.

Further, the dimension number in the classifier has the number θ of emotion type consistent together, and a dimension is corresponding One of shared emotion type emotion type；The shared emotion type be EMODB data set, RAVDESS data set and Compathy type in Surrey data set.

Further, the classifier judges the method for emotion type corresponding to hidden feature are as follows: classifier will imply In Feature Mapping to the section of (0,1), θ is obtained₁A probability, the θ₁A probability and θ shared emotion types correspond, θ₁= θ；The emotion type of maximum probability is the corresponding emotion type of the hidden feature.

Further, the classifier is sofmax classifier.

Further, the self-encoding encoder has have three layers neural net layer, respectively input layer, hidden layer and output layer, The dimension of the fusion feature of LLD+DSS is equal to the number of output layer neuron.

The utility model has the advantages that characteristic of the present invention for emotional speech signal comprising timing information, is existed using LSTM depth network The advantage for handling text and voice data, proposes a kind of speech emotional classification method based on LLD and DSS fusion feature, according to Non-linear, the non-stationary property of emotional speech signal extract DSS feature first with deep scattering spectra, then will by self-encoding encoder Feature set after expansion carries out dimensionality reduction, obtains LLD+DSS fusion feature, and LSTM depth network is recycled to carry out the emotion point of voice Class.Compared to traditional voice affective characteristics and classification and identification algorithm, the speech emotional classification side based on LLD and DSS fusion feature Method has better comprehensive performance, improves the accuracy of speech emotional classification.

Detailed description of the invention

Fig. 1 is flow chart of the invention；

Fig. 2 is DSS feature extraction figure；

Fig. 3 is the zeroth order (a) of Fear sentence DSS, single order (b), second order (c) logarithmic energy figure；

Fig. 4 is self-encoding encoder network structure of the invention；

Fig. 5 is the inside basic structure of LSTM depth network；

Fig. 6 is EMODB data set experimental result of the invention；

Fig. 7 is RAVDESS data set experimental result of the invention；

Fig. 8 is SAVEE data set experimental result of the invention.

Specific embodiment

The attached drawing for constituting a part of the invention is used to provide further understanding of the present invention, schematic reality of the invention It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.

As shown in Figure 1, the present embodiment, which on the basis of traditional LLD feature, increases DSS feature, carries out feature set expansion, The feature set after expansion is carried out by dimensionality reduction by self-encoding encoder again, obtains LLD+DSS fusion feature.Finally LLD+DSS is merged Input of the feature as LSTM depth network determines the corresponding emotion type of every fusion feature by the LSTM depth network.

The LLD feature of the emotional speech data set extracted in the present embodiment is 79 dimensions, and specific dimension is as shown in table 1:

Table 1

The present embodiment use emotional speech data set include three kinds of emotional speech data sets, be EMODB data set, RAVDESS data set, Surrey data set, specific emotion type, performance number and the sentence number such as table 2 of these three data sets It is shown；To EMODB data set in the present embodiment, the emotion sentence of 10 speakers (5 male 5 female) is taken, every kind of emotion each 20；It is right RAVDESS data set, takes the emotion sentence of 16 speakers (8 male 8 female), and every kind of emotion each 50；To Surrey data set, take The emotion sentence of 4 speakers, every kind each 20.Wherein 80% sentence is as training set, and 20% sentence is as test set It is tested.10 experiments are carried out, experimental result is using the average value of 10 experimental identification rates as evaluation index.By LLD+DSS Fusion feature carries out speech emotional Classification and Identification.

Table 2

Deep scattering spectra (DSS) was proposed in 2014 by JoakimAnd é n and St é phane Mallat, normal in LLD feature set MFCC feature is different with DSS feature in high frequency section.When low-pass filter uses convolution, MFCC feature is almost Not comprising frequency detailed information and lose high-frequency characteristic part.And DSS feature can be used for compensating what MFCC characteristic can not indicate High-frequency characteristic.DSS feature achieves the effect better than MFCC feature on voice and music assorting^[6].DSS feature passes through ScatNat is extracted, and DSS includes the frequency domain energy distribution and time delay component more richer than LLD.

In most of experiments, the scattering coefficient of DSS feature decomposition to second order is sufficient to the application of speech emotional classification In because zero into second order dispersion coefficient seizing signal most energy.Therefore, the order of DSS algorithm is 2 in the present embodiment Rank, the zero of DSS algorithm extraction emotional speech data set arrive the feature of second order, obtain the DSS feature of the emotional speech data set, this The DSS feature is 600 dimensions in embodiment.

DSS characteristic extraction procedure is as shown in Figure 2: emotional speech data set passes through the first low-pass filter as input signal Obtain zeroth order feature；Emotional speech data set continues through the first small pass band filter and the second low pass filtered as input signal Wave device obtains single order feature；Emotional speech data set continues through the first small pass band filter, second small as input signal Pass band filter and third low-pass filter obtain second order feature.The frequency of second small pass band filter should be higher than that first Small pass band filter, for the numerical value being specifically higher by depending on experimental conditions, the band logical of the second small pass band filter is to restore The band logical of high frequency band signal；Zeroth order, single order and the second order feature of input signal are DSS feature.

The zeroth order character representation of input signal is

S₀(x)=x* φ

In formula (1), S₀(x) zeroth order feature is indicated, x is input signal, and φ is low pass filter transfer function.

In formula (2), S₁It (x) is single order feature, Ψ_λ1For based on morlet small echo λ₁Bandpass filter configured transmission.

In formula (3), S₂It (x) is second order feature, Ψ_λ2For based on morlet small echo λ₂Bandpass filter configured transmission.

One section of FEAR emotional speech in EMODB data set is selected to carry out DSS feature extraction, zero obtained in the present embodiment Rank, single order, second order feature are respectively as shown in (a) of Fig. 3, (b), (c).

The set input of extract from emotional speech data set 79 dimension LLD features and 600 dimension DSS features is encoded certainly Device carries out Feature Dimension Reduction, obtains the LLD+DSS fusion feature of certain dimension.It is self-editing that the dimension of LLD+DSS fusion feature is equal to this The neuron number of code device output layer.

The self-encoding encoder that the present invention uses is a kind of artificial neural network for Data Dimensionality Reduction^[7], by three-layer neural network It constitutes, structure is as shown in Figure 4.Cataloged procedure can be regarded as from input layer to hidden layer, indicated are as follows:

H=σ_h(w₁x₁+b₁)

H is the output of hidden layer, σ_hFor the activation primitive of hidden layer, x₁79 for input tie up LLD feature and 600 WeiDSSTe The set of sign, W₁And b₁For hidden layer weight and offset parameter.

Decoding process can be regarded as from hidden layer to output layer, indicated are as follows:

Y=σ_y(W₂h+b₂)

Y is the output of output layer, σ_yFor the activation primitive of output layer, W₂And b₂For output layer weight and offset parameter.

Self-encoding encoder is optimized by loss function, and loss function selects cross entropy, is indicated are as follows:

J is loss function, and W, b are the weight and offset parameter of whole network, y_iAnd y '_iRespectively indicate the label value of sample With the output valve of network.

LSTM depth network can store information in a memory cell according to timing, and can learn and classify to appoint It is engaged in relevant contextual information.LSTM depth network is similar to RNN network, and only non-linear hidden unit is replaced by special defects The memory block of type.Each memory block includes one or more storage units and three periodically connected in the LSTM depth network A multiplication unit (input, output and forgetting door).Multiplication gate allows storage unit to store and access information on long list entries. The voice segments of five frame lengths (every frame 20ms) are used to carry out pre-training LSTM depth network as data cell in this implementation.

LSTM depth network internal basic structure in the present embodiment is as shown in figure 5, the network model has β network Layer.In order to control the flowing of information, memory unit (memory is specially devised in the internal node of LSTM depth network Cell the deletion or increase of information), and by door are controlled.Door is the method that a kind of pair of information carries out that selection passes through, There is input gate (input gate) in the node of LSTM depth network, forget door (forget gate) and out gate (output Gate) three kinds of doors protect the state with control node.If x_tInput, h for LSTM depth network node t moment_tFor The output of t moment, W_xk(k=i, f, c, o) is to input corresponding weight, W_hk(k=i, f, c, o) is to export corresponding weight, W_ck (k=i, f, c, o) is the value c of moment t memory unit_tCorresponding weight, b_k(k=i, f, c, o) is to bias corresponding weight, and σ is Activation primitive, then LSTM neural network model is divided into four steps by the process that door controls information update^[11]:

A. the value i of input gate t moment is calculated_t, input gate control is influence of the current input to memory unit state value, Calculation expression is as follows:

i_t=σ (W_xix_t+W_hih_t-1+W_cic_t-1+b_i)

Wherein h_t-1For the output at certain node t-1 moment, c_t-1For the value of t-1 moment memory unit

B. the value f for forgeing door t moment is calculated_t, forget door control is influence of the historical information to memory unit state value, Calculation expression is as follows:

f_t=σ (W_xfx_t+W_hfh_t-1+W_cfc_t-1+b_f)

The value c of memory unit when C. calculating t moment_t, calculation expression is as follows

c_t=f_t·c_t-1+i_t·tanh(W_xcx_t+W_hch_t-1+b_c)

Output information h when D. calculating t moment_t, which exports o by out gate_tIt determines, shown under calculation method such as formula:

h_t=o_t·tanh(c_t)

Wherein o_t=σ (W_xox_t+W_hoh_t-1+W_coc_t-1+b_o)。

Using LLD+DSS fusion feature matrix after dimensionality reduction as the input of LSTM depth network, in LSTM depth network before The fusion feature of the β -1 layers of LLD+DSS around input is trained to obtain the hidden feature of the fusion feature.

The last layer in the LSTM depth network is Softmax classifier^[10], due to EMODB data set, RAVDESS (angry/indignation detests/disagreeable, evil to 5 kinds of shared identical emotion types of data set, these three data sets of Surrey data set Fearness/fear, happiness, sadness), so for Softmax classifier tool there are five dimension, each dimension corresponds to a kind of feelings in the present embodiment Feel type.

Hidden feature is mapped in the section of (0,1) by Softmax classifier, and obtains 5 probability, 5 probability and 5 A emotion type corresponds；The emotion type of maximum probability is the corresponding emotion type of the fusion feature；

Output, that is, hidden feature of multiple neurons is mapped in (0,1) section by Softmax classifier, can be regarded as Generic is carried out to each sample to estimate, specific to calculate as follows:

Wherein K is all categories sum, and j is the classification of current predictive.X is neuron output, W_jFor the corresponding power with j class Weight coefficient.

LLD feature, DSS feature and LLD+DSS fusion feature are all made of KNN, LVQ, SVM, BP and LSTM by the present embodiment Emotional semantic classification comparison is carried out Deng five kinds of networks.Fig. 6,7,8 are respectively the classification knot using EMODB, RAVDESS, SAVEE data set Fruit.Fig. 6,7,8 are it is found that LLD spy is better than used only in discrimination of the LLD+DSS fusion feature in almost all of classification method Discrimination when sign, and optimal identification rate is obtained in LSTM network.

Fig. 6 is EMODB data set experimental result, and tetra- kinds of networks of KNN, LVQ, SVM and BP are equal using the discrimination of LLD feature Higher than DSS feature；The a little higher than LLD feature of discrimination of the DSS feature in LSTM network；And five kinds of networks are melted using LLD+DSS The discrimination for closing feature is above LLD feature and DSS feature, wherein LSTM network uses the discrimination of LLD+DSS fusion feature Opposite highest.

Fig. 7 is RAVDESS data set experimental result, and SVM network is almost the same with LLD feature using LLD+DSS feature； Tetra- kinds of networks of KNN, LVQ, BP and LSTM are higher than LLD feature and DSS feature using the discrimination of LLD+DSS fusion feature；LSTM Network is using the discrimination of LLD+DSS fusion feature with respect to highest.

Fig. 8 is SAVEE data set experimental result, and SVM network is slightly less than LLD using the discrimination of LLD+DSS fusion feature Feature (result SVM bad to high dimensional feature recognition performance)；Tetra- kinds of networks of KNN, LVQ, BP and LSTM are merged using LLD+DSS The discrimination of feature is above LLD feature and DSS feature；LSTM network uses the discrimination of LLD+DSS fusion feature relatively most It is high.

It is further to note that specific technical features described in the above specific embodiments, in not lance In the case where shield, it can be combined in any appropriate way.In order to avoid unnecessary repetition, the present invention to it is various can No further explanation will be given for the combination of energy.

Claims

1. a kind of speech-emotion recognition method based on LLD and DSS fusion feature, which is characterized in that this method includes following step It is rapid:

Step 2: using LLD feature and DSS feature as the training set of self-encoding encoder, the self-encoding encoder is special to LLD feature and DSS Sign carries out dimensionality reduction calculating, the fusion feature of the LLD+DSS after obtaining dimensionality reduction；

Step 3: the fusion feature of LLD+DSS described in step 2 being sequentially input into LSTM depth network, by LSTM depth The corresponding emotion type of every fusion feature of Network Recognition.

2. being based on method described in claim 1, which is characterized in that in the step 1, using DSS algorithm to emotional speech number The extraction of DSS feature is carried out according to collection；The order of the DSS algorithm is set as 2 ranks, that is, the DSS feature extracted includes emotional speech number According to the zeroth order feature, single order feature and second order feature of collection, the acquisition methods of each feature are as follows: using emotional speech data set as input Signal obtains zeroth order feature by the first low-pass filter；It is small that first is passed sequentially through using emotional speech data set as input signal Pass band filter and the second low-pass filter obtain single order feature；It is passed sequentially through using emotional speech data set as input signal First small pass band filter, the second small pass band filter and third low-pass filter obtain second order feature, and described second is small The frequency of pass band filter is higher than the frequency of the first small pass band filter.

3. being based on method described in claim 1, which is characterized in that in the step 1, emotional speech data set includes EMODB Data set, RAVDESS data set and Surrey data set.

4. being based on method as claimed in claim 3, which is characterized in that in the step 3, LSTM depth network has β layer network Layer, the first β -1 layers fusion feature for the LLD+DSS to input in the β layer network layer is trained to obtain this fusion spy The hidden feature of sign；The last layer is classifier, which judges emotion type corresponding to the hidden feature, namely should Emotion type corresponding to fusion feature.

5. being based on method as claimed in claim 4, which is characterized in that the dimension number in the classifier has emotion type together Number θ it is consistent, one of corresponding shared emotion type of dimension emotion type；The shared emotion type is EMODB Compathy type in data set, RAVDESS data set and Surrey data set.

6. based on the method described in claim 5, which is characterized in that the classifier judges emotion kind corresponding to hidden feature The method of class are as follows: hidden feature is mapped in the section of (0,1) by classifier, obtains θ₁A probability, the θ₁A probability and θ are a altogether There are emotion type one-to-one correspondence, θ₁=θ；The emotion type of maximum probability is the corresponding emotion type of the hidden feature.

7. being based on method as claimed in claim 4, which is characterized in that the classifier is sofmax classifier.

8. being based on method described in claim 1, which is characterized in that the self-encoding encoder has the neural net layer that haves three layers, respectively Input layer, hidden layer and output layer, the dimension of the fusion feature of LLD+DSS are equal to the number of output layer neuron.