CN109767788A - A kind of speech-emotion recognition method based on LLD and DSS fusion feature - Google Patents

A kind of speech-emotion recognition method based on LLD and DSS fusion feature Download PDF

Info

Publication number
CN109767788A
CN109767788A CN201910143689.6A CN201910143689A CN109767788A CN 109767788 A CN109767788 A CN 109767788A CN 201910143689 A CN201910143689 A CN 201910143689A CN 109767788 A CN109767788 A CN 109767788A
Authority
CN
China
Prior art keywords
feature
dss
lld
data set
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910143689.6A
Other languages
Chinese (zh)
Inventor
张秀再
王玮蔚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN201910143689.6A priority Critical patent/CN109767788A/en
Publication of CN109767788A publication Critical patent/CN109767788A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a kind of speech-emotion recognition methods based on LLD and DSS fusion feature, specially on the basis of traditional LLD feature, it increases DSS feature and carries out feature set expansion, then the feature set after expansion is carried out by dimensionality reduction by self-encoding encoder, obtain LLD+DSS fusion feature.Finally using LLD+DSS fusion feature as the input of LSTM depth network, the corresponding emotion type of every fusion feature is determined by LSTM depth network.The present invention has better comprehensive performance compared to traditional voice affective characteristics and classification and identification algorithm, improves the accuracy of speech emotional classification.

Description

A kind of speech-emotion recognition method based on LLD and DSS fusion feature
Technical field
The invention belongs to artificial intelligence and field of speech recognition more particularly to a kind of languages based on LLD and DSS fusion feature Sound emotion identification method.
Background technique
In recent years, with the development of computer technology, human-computer interaction (HMI) technology has also obtained significant progress, but also Far from reaching the man-machine level sufficiently linked up.Because machine is difficult to understand for lying in some paralanguage information in language, feelings Thread is exactly one of them.The basic task of speech emotion recognition (SER), it is intended to by voice signal to talker's emotional state into Row classification keeps HMI more natural and reality.Although domestic and international scientific research personnel has carried out extensive research to SER, so far Until, the performance of SER system is relatively low, still can not practical application.
The groundwork of speech emotion recognition is divided into speech emotional feature extraction and the selection of sorter network model.It is current domestic Outer research object is mostly the selection of sorter network model, and greater advance is had been achieved on disaggregated model.Speech emotion recognition In most common disaggregated model be support vector machines [1] (SVM), artificial neural network [2] (ANN), K nearest neighbor algorithm [3] (KNN), Elman neural network [4], long neural network [5] (LSTM) etc. in short-term, these models mostly use greatly Interspeech Bottom descriptor (LLD) affective characteristics used by general knowledge test, few affective characteristics for the network optimization.Therefore, such as What excavates potential feature and improves discrimination, still to be studied.
Bibliography
[1]Lin Y L,Wei G.Speech emotion recognition based on HMM and SVM[C]// Machine Learning and Cybernetics,2005.Proceedings of 2005International Conference on.IEEE,2005,8:4898-4901.
[2]Han K,Yu D,Tashev I.Speech emotion recognition using deep neural network and extreme learning machine[C]//Fifteenth annual conference of the international speech communication association.2014.
[3]Schuller B,Rigoll G,Lang M.Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture[C]//Acoustics,Speech,and Signal Processing,2004.Proceedings.(ICASSP'04).IEEE International Conference on.IEEE,2004,1:I-577.
[4] Yu Lingli, Zhou Kaijun, Qiu Aibing are based on speech emotion recognition application study [J] of Elman neural network Calculation machine application study, 2012,29 (5): 1809-1814.
[5]M,Kaiser M,Eyben F,et al.LSTM-Modeling of continuous emotions in an audiovisual affect recognition framework[J].Image and Vision Computing,2013,31(2):153-163.
[6]Andén J,Mallat S.Deep scattering spectrum[J].IEEE Transactions on Signal Processing,2014,62(16):4114-4128.
[7]Deng J,Zhang Z,Eyben F,et al.Autoencoder-based unsupervised domain adaptation for speech emotion recognition[J].IEEE Signal Processing Letters, 2014,21(9):1068-1072.
[8]Zheng F,Zhang G,Song Z.Comparison ofdifferent implementations ofMFCC[J].Journal ofComputer science and Technology,2001,16(6):582-589.
[9]Guo J M,Markoni H.Driver drowsiness detection using hybrid convolutional neural network and long short-term memory[J].Multimedia Tools andApplications,2018:1-29.
[10]Morchid M,Bousquet P M,Kheder W B,et al.Latent Topic-based Subspace for Natural Language Processing[J].Journal ofSignal Processing Systems,2018:1-21.
[11]Hochreiter S,Schmidhuber J.Long short-term memory[J].Neural computation,1997,9(8):1735-1780.
[12]Burkhardt F,Paeschke A,Rolfes M,et al.A database of German emotional speech[C]//Ninth European Conference on Speech Communication and Technology.2005.
[13]Livingstone S R,Peck K,Russo F A.Ravdess:The ryerson audio-visual database of emotional speech and song[C]//Annual meeting of the canadian society for brain,behaviour and cognitive science.2012:205-211.
[14]Jackson P,Haq S.Surrey Audio-Visual Expressed Emotion(SAVEE) Database[J].University ofSurrey:Guildford,UK,2014.
Summary of the invention
Goal of the invention: the present invention is directed to existing voice affective characteristics bad problem of performance when carrying out Classification and Identification, mentions For a kind of speech-emotion recognition method based on LLD and DSS fusion feature.
Technical solution: the present invention provides a kind of speech-emotion recognition method based on LLD and DSS fusion feature, this method Include the following steps:
Step 1: extracting the LLD feature and DSS feature of emotional speech data set;
Step 2: LLD feature and DSS feature be used as to the training set of self-encoding encoder, the self-encoding encoder to LLD feature with DSS feature carries out dimensionality reduction calculating, the fusion feature of the LLD+DSS after obtaining dimensionality reduction;
Step 3: the fusion feature of LLD+DSS described in step 2 being sequentially input into LSTM depth network, by LSTM The corresponding emotion type of every fusion feature of depth Network Recognition.
Further, in the step 1, the extraction of DSS feature is carried out to emotional speech data set using DSS algorithm;Institute The order for stating DSS algorithm is set as 2 ranks, that is, the DSS feature extracted include the zeroth order feature of emotional speech data set, single order feature and Second order feature, the acquisition methods of each feature are as follows: obtained using emotional speech data set as input signal by the first low-pass filter Obtain zeroth order feature;The first small pass band filter and the second low pass filtered are passed sequentially through using emotional speech data set as input signal Wave device obtains single order feature;The first small pass band filter, second are passed sequentially through using emotional speech data set as input signal Small pass band filter and third low-pass filter obtain second order feature, and the frequency of the second small pass band filter is higher than the The frequency of one small pass band filter.
Further, in the step 1, emotional speech data set include EMODB data set, RAVDESS data set and Surrey data set.
Further, in the step 3, LSTM depth network has a β layer network layer, and first β -1 layers in the β layer network layer Fusion feature for the LLD+DSS to input is trained to obtain the hidden feature of this fusion feature;The last layer is point Class device, the classifier judge emotion kind corresponding to emotion type namely this fusion feature corresponding to the hidden feature Class.
Further, the dimension number in the classifier has the number θ of emotion type consistent together, and a dimension is corresponding One of shared emotion type emotion type;The shared emotion type be EMODB data set, RAVDESS data set and Compathy type in Surrey data set.
Further, the classifier judges the method for emotion type corresponding to hidden feature are as follows: classifier will imply In Feature Mapping to the section of (0,1), θ is obtained1A probability, the θ1A probability and θ shared emotion types correspond, θ1= θ;The emotion type of maximum probability is the corresponding emotion type of the hidden feature.
Further, the classifier is sofmax classifier.
Further, the self-encoding encoder has have three layers neural net layer, respectively input layer, hidden layer and output layer, The dimension of the fusion feature of LLD+DSS is equal to the number of output layer neuron.
The utility model has the advantages that characteristic of the present invention for emotional speech signal comprising timing information, is existed using LSTM depth network The advantage for handling text and voice data, proposes a kind of speech emotional classification method based on LLD and DSS fusion feature, according to Non-linear, the non-stationary property of emotional speech signal extract DSS feature first with deep scattering spectra, then will by self-encoding encoder Feature set after expansion carries out dimensionality reduction, obtains LLD+DSS fusion feature, and LSTM depth network is recycled to carry out the emotion point of voice Class.Compared to traditional voice affective characteristics and classification and identification algorithm, the speech emotional classification side based on LLD and DSS fusion feature Method has better comprehensive performance, improves the accuracy of speech emotional classification.
Detailed description of the invention
Fig. 1 is flow chart of the invention;
Fig. 2 is DSS feature extraction figure;
Fig. 3 is the zeroth order (a) of Fear sentence DSS, single order (b), second order (c) logarithmic energy figure;
Fig. 4 is self-encoding encoder network structure of the invention;
Fig. 5 is the inside basic structure of LSTM depth network;
Fig. 6 is EMODB data set experimental result of the invention;
Fig. 7 is RAVDESS data set experimental result of the invention;
Fig. 8 is SAVEE data set experimental result of the invention.
Specific embodiment
The attached drawing for constituting a part of the invention is used to provide further understanding of the present invention, schematic reality of the invention It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.
As shown in Figure 1, the present embodiment, which on the basis of traditional LLD feature, increases DSS feature, carries out feature set expansion, The feature set after expansion is carried out by dimensionality reduction by self-encoding encoder again, obtains LLD+DSS fusion feature.Finally LLD+DSS is merged Input of the feature as LSTM depth network determines the corresponding emotion type of every fusion feature by the LSTM depth network.
The LLD feature of the emotional speech data set extracted in the present embodiment is 79 dimensions, and specific dimension is as shown in table 1:
Table 1
The present embodiment use emotional speech data set include three kinds of emotional speech data sets, be EMODB data set, RAVDESS data set, Surrey data set, specific emotion type, performance number and the sentence number such as table 2 of these three data sets It is shown;To EMODB data set in the present embodiment, the emotion sentence of 10 speakers (5 male 5 female) is taken, every kind of emotion each 20;It is right RAVDESS data set, takes the emotion sentence of 16 speakers (8 male 8 female), and every kind of emotion each 50;To Surrey data set, take The emotion sentence of 4 speakers, every kind each 20.Wherein 80% sentence is as training set, and 20% sentence is as test set It is tested.10 experiments are carried out, experimental result is using the average value of 10 experimental identification rates as evaluation index.By LLD+DSS Fusion feature carries out speech emotional Classification and Identification.
Table 2
Deep scattering spectra (DSS) was proposed in 2014 by JoakimAnd é n and St é phane Mallat, normal in LLD feature set MFCC feature is different with DSS feature in high frequency section.When low-pass filter uses convolution, MFCC feature is almost Not comprising frequency detailed information and lose high-frequency characteristic part.And DSS feature can be used for compensating what MFCC characteristic can not indicate High-frequency characteristic.DSS feature achieves the effect better than MFCC feature on voice and music assorting[6].DSS feature passes through ScatNat is extracted, and DSS includes the frequency domain energy distribution and time delay component more richer than LLD.
In most of experiments, the scattering coefficient of DSS feature decomposition to second order is sufficient to the application of speech emotional classification In because zero into second order dispersion coefficient seizing signal most energy.Therefore, the order of DSS algorithm is 2 in the present embodiment Rank, the zero of DSS algorithm extraction emotional speech data set arrive the feature of second order, obtain the DSS feature of the emotional speech data set, this The DSS feature is 600 dimensions in embodiment.
DSS characteristic extraction procedure is as shown in Figure 2: emotional speech data set passes through the first low-pass filter as input signal Obtain zeroth order feature;Emotional speech data set continues through the first small pass band filter and the second low pass filtered as input signal Wave device obtains single order feature;Emotional speech data set continues through the first small pass band filter, second small as input signal Pass band filter and third low-pass filter obtain second order feature.The frequency of second small pass band filter should be higher than that first Small pass band filter, for the numerical value being specifically higher by depending on experimental conditions, the band logical of the second small pass band filter is to restore The band logical of high frequency band signal;Zeroth order, single order and the second order feature of input signal are DSS feature.
The zeroth order character representation of input signal is
S0(x)=x* φ
In formula (1), S0(x) zeroth order feature is indicated, x is input signal, and φ is low pass filter transfer function.
In formula (2), S1It (x) is single order feature, Ψλ1For based on morlet small echo λ1Bandpass filter configured transmission.
In formula (3), S2It (x) is second order feature, Ψλ2For based on morlet small echo λ2Bandpass filter configured transmission.
One section of FEAR emotional speech in EMODB data set is selected to carry out DSS feature extraction, zero obtained in the present embodiment Rank, single order, second order feature are respectively as shown in (a) of Fig. 3, (b), (c).
The set input of extract from emotional speech data set 79 dimension LLD features and 600 dimension DSS features is encoded certainly Device carries out Feature Dimension Reduction, obtains the LLD+DSS fusion feature of certain dimension.It is self-editing that the dimension of LLD+DSS fusion feature is equal to this The neuron number of code device output layer.
The self-encoding encoder that the present invention uses is a kind of artificial neural network for Data Dimensionality Reduction[7], by three-layer neural network It constitutes, structure is as shown in Figure 4.Cataloged procedure can be regarded as from input layer to hidden layer, indicated are as follows:
H=σh(w1x1+b1)
H is the output of hidden layer, σhFor the activation primitive of hidden layer, x179 for input tie up LLD feature and 600 WeiDSSTe The set of sign, W1And b1For hidden layer weight and offset parameter.
Decoding process can be regarded as from hidden layer to output layer, indicated are as follows:
Y=σy(W2h+b2)
Y is the output of output layer, σyFor the activation primitive of output layer, W2And b2For output layer weight and offset parameter.
Self-encoding encoder is optimized by loss function, and loss function selects cross entropy, is indicated are as follows:
J is loss function, and W, b are the weight and offset parameter of whole network, yiAnd y 'iRespectively indicate the label value of sample With the output valve of network.
LSTM depth network can store information in a memory cell according to timing, and can learn and classify to appoint It is engaged in relevant contextual information.LSTM depth network is similar to RNN network, and only non-linear hidden unit is replaced by special defects The memory block of type.Each memory block includes one or more storage units and three periodically connected in the LSTM depth network A multiplication unit (input, output and forgetting door).Multiplication gate allows storage unit to store and access information on long list entries. The voice segments of five frame lengths (every frame 20ms) are used to carry out pre-training LSTM depth network as data cell in this implementation.
LSTM depth network internal basic structure in the present embodiment is as shown in figure 5, the network model has β network Layer.In order to control the flowing of information, memory unit (memory is specially devised in the internal node of LSTM depth network Cell the deletion or increase of information), and by door are controlled.Door is the method that a kind of pair of information carries out that selection passes through, There is input gate (input gate) in the node of LSTM depth network, forget door (forget gate) and out gate (output Gate) three kinds of doors protect the state with control node.If xtInput, h for LSTM depth network node t momenttFor The output of t moment, Wxk(k=i, f, c, o) is to input corresponding weight, Whk(k=i, f, c, o) is to export corresponding weight, Wck (k=i, f, c, o) is the value c of moment t memory unittCorresponding weight, bk(k=i, f, c, o) is to bias corresponding weight, and σ is Activation primitive, then LSTM neural network model is divided into four steps by the process that door controls information update[11]:
A. the value i of input gate t moment is calculatedt, input gate control is influence of the current input to memory unit state value, Calculation expression is as follows:
it=σ (Wxixt+Whiht-1+Wcict-1+bi)
Wherein ht-1For the output at certain node t-1 moment, ct-1For the value of t-1 moment memory unit
B. the value f for forgeing door t moment is calculatedt, forget door control is influence of the historical information to memory unit state value, Calculation expression is as follows:
ft=σ (Wxfxt+Whfht-1+Wcfct-1+bf)
The value c of memory unit when C. calculating t momentt, calculation expression is as follows
ct=ft·ct-1+it·tanh(Wxcxt+Whcht-1+bc)
Output information h when D. calculating t momentt, which exports o by out gatetIt determines, shown under calculation method such as formula:
ht=ot·tanh(ct)
Wherein ot=σ (Wxoxt+Whoht-1+Wcoct-1+bo)。
Using LLD+DSS fusion feature matrix after dimensionality reduction as the input of LSTM depth network, in LSTM depth network before The fusion feature of the β -1 layers of LLD+DSS around input is trained to obtain the hidden feature of the fusion feature.
The last layer in the LSTM depth network is Softmax classifier[10], due to EMODB data set, RAVDESS (angry/indignation detests/disagreeable, evil to 5 kinds of shared identical emotion types of data set, these three data sets of Surrey data set Fearness/fear, happiness, sadness), so for Softmax classifier tool there are five dimension, each dimension corresponds to a kind of feelings in the present embodiment Feel type.
Hidden feature is mapped in the section of (0,1) by Softmax classifier, and obtains 5 probability, 5 probability and 5 A emotion type corresponds;The emotion type of maximum probability is the corresponding emotion type of the fusion feature;
Output, that is, hidden feature of multiple neurons is mapped in (0,1) section by Softmax classifier, can be regarded as Generic is carried out to each sample to estimate, specific to calculate as follows:
Wherein K is all categories sum, and j is the classification of current predictive.X is neuron output, WjFor the corresponding power with j class Weight coefficient.
LLD feature, DSS feature and LLD+DSS fusion feature are all made of KNN, LVQ, SVM, BP and LSTM by the present embodiment Emotional semantic classification comparison is carried out Deng five kinds of networks.Fig. 6,7,8 are respectively the classification knot using EMODB, RAVDESS, SAVEE data set Fruit.Fig. 6,7,8 are it is found that LLD spy is better than used only in discrimination of the LLD+DSS fusion feature in almost all of classification method Discrimination when sign, and optimal identification rate is obtained in LSTM network.
Fig. 6 is EMODB data set experimental result, and tetra- kinds of networks of KNN, LVQ, SVM and BP are equal using the discrimination of LLD feature Higher than DSS feature;The a little higher than LLD feature of discrimination of the DSS feature in LSTM network;And five kinds of networks are melted using LLD+DSS The discrimination for closing feature is above LLD feature and DSS feature, wherein LSTM network uses the discrimination of LLD+DSS fusion feature Opposite highest.
Fig. 7 is RAVDESS data set experimental result, and SVM network is almost the same with LLD feature using LLD+DSS feature; Tetra- kinds of networks of KNN, LVQ, BP and LSTM are higher than LLD feature and DSS feature using the discrimination of LLD+DSS fusion feature;LSTM Network is using the discrimination of LLD+DSS fusion feature with respect to highest.
Fig. 8 is SAVEE data set experimental result, and SVM network is slightly less than LLD using the discrimination of LLD+DSS fusion feature Feature (result SVM bad to high dimensional feature recognition performance);Tetra- kinds of networks of KNN, LVQ, BP and LSTM are merged using LLD+DSS The discrimination of feature is above LLD feature and DSS feature;LSTM network uses the discrimination of LLD+DSS fusion feature relatively most It is high.
It is further to note that specific technical features described in the above specific embodiments, in not lance In the case where shield, it can be combined in any appropriate way.In order to avoid unnecessary repetition, the present invention to it is various can No further explanation will be given for the combination of energy.

Claims (8)

1. a kind of speech-emotion recognition method based on LLD and DSS fusion feature, which is characterized in that this method includes following step It is rapid:
Step 1: extracting the LLD feature and DSS feature of emotional speech data set;
Step 2: using LLD feature and DSS feature as the training set of self-encoding encoder, the self-encoding encoder is special to LLD feature and DSS Sign carries out dimensionality reduction calculating, the fusion feature of the LLD+DSS after obtaining dimensionality reduction;
Step 3: the fusion feature of LLD+DSS described in step 2 being sequentially input into LSTM depth network, by LSTM depth The corresponding emotion type of every fusion feature of Network Recognition.
2. being based on method described in claim 1, which is characterized in that in the step 1, using DSS algorithm to emotional speech number The extraction of DSS feature is carried out according to collection;The order of the DSS algorithm is set as 2 ranks, that is, the DSS feature extracted includes emotional speech number According to the zeroth order feature, single order feature and second order feature of collection, the acquisition methods of each feature are as follows: using emotional speech data set as input Signal obtains zeroth order feature by the first low-pass filter;It is small that first is passed sequentially through using emotional speech data set as input signal Pass band filter and the second low-pass filter obtain single order feature;It is passed sequentially through using emotional speech data set as input signal First small pass band filter, the second small pass band filter and third low-pass filter obtain second order feature, and described second is small The frequency of pass band filter is higher than the frequency of the first small pass band filter.
3. being based on method described in claim 1, which is characterized in that in the step 1, emotional speech data set includes EMODB Data set, RAVDESS data set and Surrey data set.
4. being based on method as claimed in claim 3, which is characterized in that in the step 3, LSTM depth network has β layer network Layer, the first β -1 layers fusion feature for the LLD+DSS to input in the β layer network layer is trained to obtain this fusion spy The hidden feature of sign;The last layer is classifier, which judges emotion type corresponding to the hidden feature, namely should Emotion type corresponding to fusion feature.
5. being based on method as claimed in claim 4, which is characterized in that the dimension number in the classifier has emotion type together Number θ it is consistent, one of corresponding shared emotion type of dimension emotion type;The shared emotion type is EMODB Compathy type in data set, RAVDESS data set and Surrey data set.
6. based on the method described in claim 5, which is characterized in that the classifier judges emotion kind corresponding to hidden feature The method of class are as follows: hidden feature is mapped in the section of (0,1) by classifier, obtains θ1A probability, the θ1A probability and θ are a altogether There are emotion type one-to-one correspondence, θ1=θ;The emotion type of maximum probability is the corresponding emotion type of the hidden feature.
7. being based on method as claimed in claim 4, which is characterized in that the classifier is sofmax classifier.
8. being based on method described in claim 1, which is characterized in that the self-encoding encoder has the neural net layer that haves three layers, respectively Input layer, hidden layer and output layer, the dimension of the fusion feature of LLD+DSS are equal to the number of output layer neuron.
CN201910143689.6A 2019-02-25 2019-02-25 A kind of speech-emotion recognition method based on LLD and DSS fusion feature Pending CN109767788A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910143689.6A CN109767788A (en) 2019-02-25 2019-02-25 A kind of speech-emotion recognition method based on LLD and DSS fusion feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910143689.6A CN109767788A (en) 2019-02-25 2019-02-25 A kind of speech-emotion recognition method based on LLD and DSS fusion feature

Publications (1)

Publication Number Publication Date
CN109767788A true CN109767788A (en) 2019-05-17

Family

ID=66457509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910143689.6A Pending CN109767788A (en) 2019-02-25 2019-02-25 A kind of speech-emotion recognition method based on LLD and DSS fusion feature

Country Status (1)

Country Link
CN (1) CN109767788A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112151071A (en) * 2020-09-23 2020-12-29 哈尔滨工程大学 Speech emotion recognition method based on mixed wavelet packet feature deep learning

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104835507A (en) * 2015-03-30 2015-08-12 渤海大学 Serial-parallel combined multi-mode emotion information fusion and identification method
US20150317990A1 (en) * 2014-05-02 2015-11-05 International Business Machines Corporation Deep scattering spectrum in acoustic modeling for speech recognition
US20160180838A1 (en) * 2014-12-22 2016-06-23 Google Inc. User specified keyword spotting using long short term memory neural network feature extractor
CN106205636A (en) * 2016-07-07 2016-12-07 东南大学 A kind of speech emotion recognition Feature fusion based on MRMR criterion
CN106250855A (en) * 2016-08-02 2016-12-21 南京邮电大学 A kind of multi-modal emotion identification method based on Multiple Kernel Learning
CN107133481A (en) * 2017-05-22 2017-09-05 西北工业大学 The estimation of multi-modal depression and sorting technique based on DCNN DNN and PV SVM
CN107316654A (en) * 2017-07-24 2017-11-03 湖南大学 Emotion identification method based on DIS NV features
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN108899051A (en) * 2018-06-26 2018-11-27 北京大学深圳研究生院 A kind of speech emotion recognition model and recognition methods based on union feature expression

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150317990A1 (en) * 2014-05-02 2015-11-05 International Business Machines Corporation Deep scattering spectrum in acoustic modeling for speech recognition
US20160180838A1 (en) * 2014-12-22 2016-06-23 Google Inc. User specified keyword spotting using long short term memory neural network feature extractor
CN104835507A (en) * 2015-03-30 2015-08-12 渤海大学 Serial-parallel combined multi-mode emotion information fusion and identification method
CN106205636A (en) * 2016-07-07 2016-12-07 东南大学 A kind of speech emotion recognition Feature fusion based on MRMR criterion
CN106250855A (en) * 2016-08-02 2016-12-21 南京邮电大学 A kind of multi-modal emotion identification method based on Multiple Kernel Learning
CN107133481A (en) * 2017-05-22 2017-09-05 西北工业大学 The estimation of multi-modal depression and sorting technique based on DCNN DNN and PV SVM
CN107316654A (en) * 2017-07-24 2017-11-03 湖南大学 Emotion identification method based on DIS NV features
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN108899051A (en) * 2018-06-26 2018-11-27 北京大学深圳研究生院 A kind of speech emotion recognition model and recognition methods based on union feature expression

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KUN-YI HUANG ETC: "Mood Disorder Identification Using Deep Bottleneck Feature of Elicited Speech", 《PROCEEDINGS OF APSIPA ANNUAL SUMMIT AND CONFERENCE 2017》 *
KUN-YI HUANG ETC: "Speech Emotion Recognition Using Autoencoder Bottleneck Feature and LSTM", 《2016 INTERNATIONAL CONFERENCE ON ORANGE TECHNOLOGIES》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112151071A (en) * 2020-09-23 2020-12-29 哈尔滨工程大学 Speech emotion recognition method based on mixed wavelet packet feature deep learning
CN112151071B (en) * 2020-09-23 2022-10-28 哈尔滨工程大学 Speech emotion recognition method based on mixed wavelet packet feature deep learning

Similar Documents

Publication Publication Date Title
Latif et al. Deep representation learning in speech processing: Challenges, recent advances, and future trends
Li et al. Dilated residual network with multi-head self-attention for speech emotion recognition
CN110675860A (en) Voice information identification method and system based on improved attention mechanism and combined with semantics
Terechshenko et al. A comparison of methods in political science text classification: Transfer learning language models for politics
Pan et al. Oil well production prediction based on CNN-LSTM model with self-attention mechanism
Wei et al. A novel speech emotion recognition algorithm based on wavelet kernel sparse classifier in stacked deep auto-encoder model
Lian et al. Conversational emotion recognition using self-attention mechanisms and graph neural networks.
CN111309909A (en) Text emotion classification method based on hybrid model
CN111325233B (en) Transformer fault detection method and device
CN117095702A (en) Multi-mode emotion recognition method based on gating multi-level feature coding network
Srivastava et al. Speech recognition using HMM and Soft Computing
Tao et al. News text classification based on an improved convolutional neural network
Kamaruddin et al. Features extraction for speech emotion
Xu Intelligent automobile auxiliary propagation system based on speech recognition and AI driven feature extraction techniques
CN106992000A (en) Prediction-based multi-feature fusion old people voice emotion recognition method
Shah et al. Articulation constrained learning with application to speech emotion recognition
Liu et al. Graph based emotion recognition with attention pooling for variable-length utterances
CN109767788A (en) A kind of speech-emotion recognition method based on LLD and DSS fusion feature
Niu Music Emotion Recognition Model Using Gated Recurrent Unit Networks and Multi‐Feature Extraction
CN115512721A (en) PDAN-based cross-database speech emotion recognition method and device
CN113190733B (en) Network event popularity prediction method and system based on multiple platforms
Kim et al. Representation learning with graph neural networks for speech emotion recognition
Mavaddati Voice-based age, gender, and language recognition based on ResNet deep model and transfer learning in spectro-temporal domain
Pashaian et al. Speech Enhancement Using Joint DNN‐NMF Model Learned with Multi‐Objective Frequency Differential Spectrum Loss Function
Sabuj et al. A Comparative Study of Machine Learning Classifiers for Speaker’s Accent Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190517

RJ01 Rejection of invention patent application after publication