Disclosure of Invention
To solve the above problems, the present invention provides a method for recognizing speech emotion by using a hierarchical structure of a deep learning time sequence model, which can extract not only frame-level speech features but also segment-level speech features.
In order to achieve the above object, the present invention provides a speech-based emotion recognition method, including the steps of:
step 1: performing frame processing on the speech, and extracting features of each frame to obtain a feature vector, v, of each framenA feature vector representing an nth frame;
step 2: the feature vector v of each frame obtained in the step 1 is usednInputting the data into a first layer deep learning time sequence model, learning the association between frames through the first layer deep learning time sequence model, and outputting the frame-level characteristics y every k framesiObtaining the frame-level features y of M time instantsi,yiRepresenting a frame level vector output by the first layer deep learning time sequence model at the ith moment;
and step 3: the frame level characteristics y of the t moment obtained in the step 2iImplicit state h at time t-1 of second-layer deep learning time sequence modelt-1Inputting the data into a first attention model to obtain a second-layer depth chart at the time tInput z of learning time sequence modeltAfter M times of learning, segment level characteristics s are outputt,stA segment level vector which represents the output of the second layer deep learning time sequence model at the t moment;
and 4, step 4: the segment-level characteristics s obtained in the step 3tInputting the data into a second attention model to form a representation of the final pronunciation level;
and 5: and (4) inputting the representation of the pronunciation level obtained in the step (4) into a softmax layer to obtain a probability value of the predicted emotion, so as to identify the emotion.
As a further refinement of the present invention, the first layer deep learning temporal model and the second layer deep learning temporal model are one of LSTM, RNN and GRU.
As a further improvement of the present invention, in step 1, the length of each frame is 25ms, and the frame shift is 10 ms.
As a further improvement of the invention, in step 1, 36-dimensional features are extracted from each frame, and a feature vector of each frame consists of 13-dimensional MFCC, zero-crossing rate, energy entropy, spectrum center, spectrum extensibility, spectrum entropy, spectrum flux, spectrum roll-off point, 12-dimensional chrominance vector, chrominance vector standard deviation, signal-to-noise ratio and pitch.
As a further improvement of the invention, in step 2, k is 3 to give
Frame-level features for each time instant.
As a further improvement of the present invention, in step 3, the first attention model operation mechanism is as shown in formula (1), formula (2) and formula (3):
ei (t)=wTtanh(Waht-1+Uayi+ba) (1)
wherein, wT、Wa、Ua、baIs a network parameter of the first attention model, yiFor frame-level features, ht-1Is an implicit state at time LSTM t-1, ztFor input at time t LSTM, ei (t)For the frame-level feature y at time tiInput z with time t LSTMtCorrelation coefficient of (a)i (t)Is the attention coefficient at time t.
As a further improvement of the invention, in step 4, the correlation coefficients of the second attention model are estimated by the network parameters u and SiObtained by vector multiplication.
As a further improvement of the invention, a plurality of the first layer deep learning time sequence models and the first attention model are used for extracting features of different levels in the voice.
The invention has the beneficial effects that: the method has the advantages that the features of different levels in the voice are extracted by using the hierarchical deep learning time sequence model structure, a plurality of attention mechanisms are introduced to effectively select the key features, emotion recognition is facilitated, and by using the method, not only can the frame-level voice features be extracted, but also the segment-level voice features can be extracted, so that the accuracy of emotion recognition can be effectively improved.
Detailed Description
The present invention will be described in further detail below with reference to specific embodiments and with reference to the attached drawings.
As shown in fig. 1, a method for emotion recognition based on speech according to an embodiment of the present invention includes the following steps:
step 1: performing frame processing on the speech, and extracting features of each frame to obtain a feature vector, v, of each framenA feature vector representing an nth frame;
step 2: the feature vector v of each frame obtained in the step 1 is usednInputting the data into a first layer deep learning time sequence model, learning the association between frames through the first layer deep learning time sequence model, and outputting the frame-level characteristics y every k framesiObtaining the frame-level features y of M time instantsi,yiRepresenting a frame level vector output by the first layer deep learning time sequence model at the ith moment;
and step 3: the frame level characteristics y of the t moment obtained in the step 2iImplicit state h at time t-1 of second-layer deep learning time sequence modelt-1Inputting the input z into the first attention model to obtain the input z of the second layer deep learning time sequence model at the time ttAfter M times of learning, segment level characteristics s are outputt,stA segment level vector which represents the output of the second layer deep learning time sequence model at the t moment;
and 4, step 4: the segment-level characteristics s obtained in the step 3tInputting the data into a second attention model to form a representation of the final pronunciation level;
and 5: and (4) inputting the representation of the pronunciation level obtained in the step (4) into a softmax layer to obtain a probability value of the predicted emotion, so as to identify the emotion.
Further, the first layer deep learning timing model and the second layer deep learning timing model are one of LSTM, RNN, and GRU.
Further, in step 1, the length of each frame is 25ms, and the frame shift is 10 ms.
Further, in step 1, 36-dimensional features are extracted from each frame, and a feature vector of each frame is composed of 13-dimensional MFCC, zero-crossing rate, energy entropy, spectrum center, spectrum ductility, spectrum entropy, spectrum flux, spectrum roll-off point, 12-dimensional chrominance vector, chrominance vector standard deviation, signal-to-noise ratio and pitch.
Further, in step 2, k is 3 to give
Frame-level features for each time instant.
Further, in step 1 and step 3, the first attention model operation mechanism is shown as formula (1), formula (2) and formula (3):
ei (t)=wTtanh(Waht-1+Uayi+ba) (1)
wherein, wT、Wa、Ua、baFor the network parameters of the first attention model (W, U for weight, b for bias), yiFor frame-level features, ht-1Is an implicit state at time LSTM t-1, ztFor input at time t LSTM, ei (t)For the frame-level feature y at time tiInput z with time t LSTMtCorrelation coefficient of (a)i (t)Is the attention coefficient at time t.
Further, in step 4, the correlation coefficient of the second attention model is estimated by the network estimation parameters u and Siu is obtained by vector multiplication.
Furthermore, the characteristics of different levels in the voice are extracted by using the plurality of first-layer deep learning time sequence models and the first attention model, and the emotion recognition effect is improved. The first layer deep learning time sequence model and the first attention model are taken as an integral module, and the integral module formed by the multi-level first layer deep learning time sequence model and the first attention model can be adopted to realize the extraction of the characteristics in the voice by three or more layers of deep learning time sequence models.
As shown in fig. 2 and fig. 3, when recognizing speech emotion by using LSTM hierarchical structure, firstly, framing the speech, where each frame is 25ms in length and the frame shift is 10 ms; and extracting 36-dimensional features for each frame, wherein the 36-dimensional features comprise 13-dimensional MFCC, zero crossing rate, energy entropy, spectrum center, spectrum spread, spectrum entropy, spectrum flux, spectrum roll-off point, 12-dimensional chrominance vector, chrominance vector standard deviation, signal-to-noise ratio and pitch. The 36-dimensional feature vectors for each frame are then input into the first-level LSTM structure, as shown in FIG. 2, vnRepresenting the feature vector of the nth frame, the association between frames can be learned by the first layer LSTM.
Second, the output y of the first layer LSTM is output every k frames
iImplicit state h from the moment before the second layer LSTM
t-1Inputting to an attention model; k is 3, then obtained jointly from the first layer LSTM
Output of individual time { y
1,y
2,......,y
M}. The first attention model works as follows:
ei (t)=wTtanh(Waht-1+Uayi+ba) (1)
wherein, wT,Wa,UaAs a weight value, baTo be offset, yiIs a frame-level feature at time t, ht-1Is an implicit state at time LSTM t-1, ztFor input at time t LSTM, ei (t)For the frame-level feature y at time tiInput z with time t LSTMtCorrelation coefficient of (a)i (t)Is the attention coefficient at time t. With the attention model, it is effective to let the second layer LSTM select key features by increasing the attention coefficient. For example: attention coefficient ai (t)When the value is equal to 0, the ith frame-level feature y is not selectedi。
Thirdly, the features S of the second layer LSTM learningiInput to a second attention model to form a representation of the final pronunciation level. Here the correlation coefficient e of the second attention modeli (t)Estimating parameters u and S from the networkiObtained by vector multiplication.
And finally, inputting the representation of the pronunciation level into a softmax layer to obtain a probability value of the predicted emotion, thereby realizing emotion recognition.
Using a layered LSTM structure: after the audio is subjected to framing processing, each frame is tens of milliseconds, and the extracted features are based on phonemes or even lower level elements. After a hierarchical LSTM structure is used, the relation between phoneme characteristics is learned through a first layer of LSTM, the phoneme forms syllables, the syllables form different words and phrases, and the relation between the syllable characteristics and the phrase characteristics is extracted from the first layer of LSTM and output to a second layer of LSTM at intervals to learn the syllable characteristics. Such as a certain voice "feed, hello! ", the hierarchical LSTM structure can be used to learn the relationships of"/w/"/ei/"/n/"etc., and also learn and distinguish emotions based on" wei "and" ni hao ", whereas the prior art can only learn based on"/w/"/ei/"/n/"and other phoneme features, or even lower level element features. Compared with the prior art which uses a single-layer LSTM, the invention can extract the characteristics of different levels in the voice by using a layered LSTM structure, thereby being more beneficial to emotion recognition.
In specific implementation, because the LSTM, RNN, and GRU modules all put a frame sequence into the modules, each time of the modules has an output, but the internal mechanisms of the three modules are slightly different, but the characteristics of the sequences can be extracted, so the LSTM can be converted into similar timing models such as RNN, GRU, and the like.
The model structures of RNN, GRU and LSTM are as follows:
the LSTM internally comprises a memory unit and has long-time and short-time memory; the system comprises three gates, namely an input gate, a forgetting gate and an output gate, and the specific expression formula is as follows:
it=σ(Wi·[ht-1,xt]+bi)
ft=σ(Wf·[ht-1,xt]+bf)
ct=ft⊙ct-1+it⊙tanh(Wc·[ht-1,xt]+bc)
ot=σ(Wo·[ht-1,xt]+bo)
ht=ot⊙tanh(ct)
wherein it,ft,otAn input gate, a forgetting gate and an output gate are respectively arranged; c. CtIs a memory cell, htIs an implicit state, σ denotes a sigmoid function, a indicates a dot product, W denotes a weight, and b denotes an offset. The LSTM determines how much past time information is reserved in the memory unit through the forgetting gate, and receives the current time information through the input gate.
The RNN has no memory unit therein, learns the relationship between a plurality of time inputs through the hidden layer, and specifically expresses the following formula:
wherein x is
tIndicates the input at time t, h
tIndicating an implicit state at time t, o
tAn output representing the time at which t is present,
the activation function is expressed, and the tanh function is generally selected; u, W, V are weights, and b is an offset. It can be found that the implicit state at the moment t is not only related to the input at the moment t, but also related to the state before the moment t, so that the association between the time series can be effectively learned.
RNN is a hidden layer, and inputs x (t-1) x (t +1).. sequence and outputs y (t-1) y (t +1).. sequence by using the same weight parameter, and the model structure of RNN is shown in fig. 4.
The GRU also has no memory cells inside, but has an update gate that helps the model determine how much past information to pass into the future, and a reset gate that mainly determines how much past information to forget. The concrete expression formula is as follows:
zt=σ(Uzxt+Wzht-1)
rt=σ(Urxt+Wrht-1)
wherein x is
tIndicates the input at time t, h
tOutput representing time t, z
tIndicating an update gate, r
tIt is indicated that the gate is reset,
represents an intermediate state at time t; σ denotes a sigmoid function, and |, denotes a dot product, and U, W denotes a weight. It can be seen that the output of the GRU at each instant depends on the resets and the update gates.
The GRU is a variant of LSTM, simplifying the LSTM network, and the model structure of the GRU is shown in fig. 5.
Further extended, the method can also be applied to speech age group recognition and gender recognition.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.