CN110223714B

CN110223714B - Emotion recognition method based on voice

Info

Publication number: CN110223714B
Application number: CN201910478640.6A
Authority: CN
Inventors: 伍林; 尹朝阳
Original assignee: Hangzhou Zhexin Information Technology Co ltd
Current assignee: Shaoxing Shangyu Soft Candy Technology Co.,Ltd.
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2021-08-03
Anticipated expiration: 2039-06-03
Also published as: CN110223714A

Abstract

The invention discloses a speech-based emotion recognition method, which comprises the steps of performing frame processing on speech, and extracting a feature vector of each frame; inputting the feature vector of each frame into a deep learning time sequence model, and outputting frame-level features; inputting the frame level characteristics and the hidden state of the deep learning time sequence model at the previous moment into the attention model, and outputting segment level characteristics through learning; inputting the segment-level features into the attention model to form a representation of the final pronunciation level; and finally, inputting the emotion prediction data into a softmax layer to obtain a probability value of the predicted emotion, so that the emotion is recognized. The invention has the beneficial effects that: the method has the advantages that the features of different levels in the voice are extracted by using the hierarchical deep learning time sequence model structure, a plurality of attention mechanisms are introduced to effectively select the key features, emotion recognition is facilitated, and by using the method, not only can the frame-level voice features be extracted, but also the segment-level voice features can be extracted, so that the accuracy of emotion recognition can be effectively improved.

Description

Emotion recognition method based on voice

Technical Field

The invention relates to the technical field of emotion recognition, in particular to an emotion recognition method based on voice.

Background

With the development of computers and artificial intelligence technologies, emotion recognition is particularly important in natural human-computer interaction. Such as an intelligent customer service system, a chat robot, etc., need to give corresponding feedback through different emotions of the client. The voice contains rich information of the speaker, and the emotion of the speaker can be recognized through voice. The traditional speech emotion recognition system firstly extracts acoustic features of each frame of audio, such as short-time energy, fundamental frequency, MFCC (Mel frequency cepstrum coefficient, a commonly used speech frequency spectrum feature) and the like, then concatenates the acoustic features, and finally recognizes emotion through a classifier. Commonly used classifiers are SVM (support vector machine, a supervised classifier), random forests, etc.

In recent years, deep learning methods are widely applied to the field of speech emotion recognition, and mainly include: 1) extracting a Mel frequency spectrum of the audio as an input of a CNN (convolutional neural network for extracting features) to further extract features, and extracting time correlation between frames through an LSTM (long short term memory network, which is suitable for processing time series), wherein an attention mechanism is introduced to reduce influence caused by silence; 2) converting audio into a sound spectrum, extracting features by adopting an FCN (full convolution network) structure in AlexNet (a deep neural network), and extracting a part useful for emotion by introducing an attention mechanism to reduce influence caused by input irrelevant to emotion; 3) extracting 32-dimensional acoustic features of the audio, and identifying emotion by adopting a bidirectional LSTM (least squares) filling intention machine mechanism; 4) 36-dimensional acoustic features of the audio are extracted, and a modified LSTM is adopted to better extract time correlation features.

Since speech is a time series, it is a good choice to extract the time-related features in speech using LSTM. In the above prior art, the input of LSTM at a certain time is the acoustic features of the corresponding frames of audio, and the association between frames is learned, but the training data set is based on segment-level labeled emotion, that is, one emotion is labeled by one piece of speech. Therefore, in addition to learning frame-level features in speech, there is a need to learn segment-level features, i.e., extract pronunciation-level features to better recognize emotion.

Disclosure of Invention

To solve the above problems, the present invention provides a method for recognizing speech emotion by using a hierarchical structure of a deep learning time sequence model, which can extract not only frame-level speech features but also segment-level speech features.

In order to achieve the above object, the present invention provides a speech-based emotion recognition method, including the steps of:

step 1: performing frame processing on the speech, and extracting features of each frame to obtain a feature vector, v, of each frame_nA feature vector representing an nth frame;

step 2: the feature vector v of each frame obtained in the step 1 is used_nInputting the data into a first layer deep learning time sequence model, learning the association between frames through the first layer deep learning time sequence model, and outputting the frame-level characteristics y every k frames_iObtaining the frame-level features y of M time instants_i，y_iRepresenting a frame level vector output by the first layer deep learning time sequence model at the ith moment;

and step 3: the frame level characteristics y of the t moment obtained in the step 2_iImplicit state h at time t-1 of second-layer deep learning time sequence model_t-1Inputting the data into a first attention model to obtain a second-layer depth chart at the time tInput z of learning time sequence model_tAfter M times of learning, segment level characteristics s are output_t，s_tA segment level vector which represents the output of the second layer deep learning time sequence model at the t moment;

and 4, step 4: the segment-level characteristics s obtained in the step 3_tInputting the data into a second attention model to form a representation of the final pronunciation level;

and 5: and (4) inputting the representation of the pronunciation level obtained in the step (4) into a softmax layer to obtain a probability value of the predicted emotion, so as to identify the emotion.

As a further refinement of the present invention, the first layer deep learning temporal model and the second layer deep learning temporal model are one of LSTM, RNN and GRU.

As a further improvement of the present invention, in step 1, the length of each frame is 25ms, and the frame shift is 10 ms.

As a further improvement of the invention, in step 1, 36-dimensional features are extracted from each frame, and a feature vector of each frame consists of 13-dimensional MFCC, zero-crossing rate, energy entropy, spectrum center, spectrum extensibility, spectrum entropy, spectrum flux, spectrum roll-off point, 12-dimensional chrominance vector, chrominance vector standard deviation, signal-to-noise ratio and pitch.

As a further improvement of the invention, in step 2, k is 3 to give

Frame-level features for each time instant.

As a further improvement of the present invention, in step 3, the first attention model operation mechanism is as shown in formula (1), formula (2) and formula (3):

e_i ^(t)＝w^Ttanh(W_ah_t-1+U_ay_i+b_a) (1)

wherein, w^T、W_a、U_a、b_aIs a network parameter of the first attention model, y_iFor frame-level features, h_t-1Is an implicit state at time LSTM t-1, z_tFor input at time t LSTM, e_i ^(t)For the frame-level feature y at time t_iInput z with time t LSTM_tCorrelation coefficient of (a)_i ^(t)Is the attention coefficient at time t.

As a further improvement of the invention, in step 4, the correlation coefficients of the second attention model are estimated by the network parameters u and S_iObtained by vector multiplication.

As a further improvement of the invention, a plurality of the first layer deep learning time sequence models and the first attention model are used for extracting features of different levels in the voice.

The invention has the beneficial effects that: the method has the advantages that the features of different levels in the voice are extracted by using the hierarchical deep learning time sequence model structure, a plurality of attention mechanisms are introduced to effectively select the key features, emotion recognition is facilitated, and by using the method, not only can the frame-level voice features be extracted, but also the segment-level voice features can be extracted, so that the accuracy of emotion recognition can be effectively improved.

Drawings

FIG. 1 is a flow chart of a method for speech-based emotion recognition according to an embodiment of the present invention;

FIG. 2 is a block diagram of an emotion recognition system of a speech-based emotion recognition method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an LSTM model structure of a speech-based emotion recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an RNN model structure of a speech-based emotion recognition method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a GRU model of a speech-based emotion recognition method according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to specific embodiments and with reference to the attached drawings.

As shown in fig. 1, a method for emotion recognition based on speech according to an embodiment of the present invention includes the following steps:

and step 3: the frame level characteristics y of the t moment obtained in the step 2_iImplicit state h at time t-1 of second-layer deep learning time sequence model_t-1Inputting the input z into the first attention model to obtain the input z of the second layer deep learning time sequence model at the time t_tAfter M times of learning, segment level characteristics s are output_t，s_tA segment level vector which represents the output of the second layer deep learning time sequence model at the t moment;

Further, the first layer deep learning timing model and the second layer deep learning timing model are one of LSTM, RNN, and GRU.

Further, in step 1, the length of each frame is 25ms, and the frame shift is 10 ms.

Further, in step 1, 36-dimensional features are extracted from each frame, and a feature vector of each frame is composed of 13-dimensional MFCC, zero-crossing rate, energy entropy, spectrum center, spectrum ductility, spectrum entropy, spectrum flux, spectrum roll-off point, 12-dimensional chrominance vector, chrominance vector standard deviation, signal-to-noise ratio and pitch.

Further, in step 2, k is 3 to give

Frame-level features for each time instant.

Further, in step 1 and step 3, the first attention model operation mechanism is shown as formula (1), formula (2) and formula (3):

e_i ^(t)＝w^Ttanh(W_ah_t-1+U_ay_i+b_a) (1)

wherein, w^T、W_a、U_a、b_aFor the network parameters of the first attention model (W, U for weight, b for bias), y_iFor frame-level features, h_t-1Is an implicit state at time LSTM t-1, z_tFor input at time t LSTM, e_i ^(t)For the frame-level feature y at time t_iInput z with time t LSTM_tCorrelation coefficient of (a)_i ^(t)Is the attention coefficient at time t.

Further, in step 4, the correlation coefficient of the second attention model is estimated by the network estimation parameters u and S_iu is obtained by vector multiplication.

Furthermore, the characteristics of different levels in the voice are extracted by using the plurality of first-layer deep learning time sequence models and the first attention model, and the emotion recognition effect is improved. The first layer deep learning time sequence model and the first attention model are taken as an integral module, and the integral module formed by the multi-level first layer deep learning time sequence model and the first attention model can be adopted to realize the extraction of the characteristics in the voice by three or more layers of deep learning time sequence models.

As shown in fig. 2 and fig. 3, when recognizing speech emotion by using LSTM hierarchical structure, firstly, framing the speech, where each frame is 25ms in length and the frame shift is 10 ms; and extracting 36-dimensional features for each frame, wherein the 36-dimensional features comprise 13-dimensional MFCC, zero crossing rate, energy entropy, spectrum center, spectrum spread, spectrum entropy, spectrum flux, spectrum roll-off point, 12-dimensional chrominance vector, chrominance vector standard deviation, signal-to-noise ratio and pitch. The 36-dimensional feature vectors for each frame are then input into the first-level LSTM structure, as shown in FIG. 2, v_nRepresenting the feature vector of the nth frame, the association between frames can be learned by the first layer LSTM.

Second, the output y of the first layer LSTM is output every k frames_iImplicit state h from the moment before the second layer LSTM_t-1Inputting to an attention model; k is 3, then obtained jointly from the first layer LSTM

Output of individual time { y₁,y₂,......,y_M}. The first attention model works as follows:

e_i ^(t)＝w^Ttanh(W_ah_t-1+U_ay_i+b_a) (1)

wherein, w^T，W_a，U_aAs a weight value, b_aTo be offset, y_iIs a frame-level feature at time t, h_t-1Is an implicit state at time LSTM t-1, z_tFor input at time t LSTM, e_i ^(t)For the frame-level feature y at time t_iInput z with time t LSTM_tCorrelation coefficient of (a)_i ^(t)Is the attention coefficient at time t. With the attention model, it is effective to let the second layer LSTM select key features by increasing the attention coefficient. For example: attention coefficient a_i ^(t)When the value is equal to 0, the ith frame-level feature y is not selected_i。

Thirdly, the features S of the second layer LSTM learning_iInput to a second attention model to form a representation of the final pronunciation level. Here the correlation coefficient e of the second attention model_i ^(t)Estimating parameters u and S from the network_iObtained by vector multiplication.

And finally, inputting the representation of the pronunciation level into a softmax layer to obtain a probability value of the predicted emotion, thereby realizing emotion recognition.

Using a layered LSTM structure: after the audio is subjected to framing processing, each frame is tens of milliseconds, and the extracted features are based on phonemes or even lower level elements. After a hierarchical LSTM structure is used, the relation between phoneme characteristics is learned through a first layer of LSTM, the phoneme forms syllables, the syllables form different words and phrases, and the relation between the syllable characteristics and the phrase characteristics is extracted from the first layer of LSTM and output to a second layer of LSTM at intervals to learn the syllable characteristics. Such as a certain voice "feed, hello! ", the hierarchical LSTM structure can be used to learn the relationships of"/w/"/ei/"/n/"etc., and also learn and distinguish emotions based on" wei "and" ni hao ", whereas the prior art can only learn based on"/w/"/ei/"/n/"and other phoneme features, or even lower level element features. Compared with the prior art which uses a single-layer LSTM, the invention can extract the characteristics of different levels in the voice by using a layered LSTM structure, thereby being more beneficial to emotion recognition.

In specific implementation, because the LSTM, RNN, and GRU modules all put a frame sequence into the modules, each time of the modules has an output, but the internal mechanisms of the three modules are slightly different, but the characteristics of the sequences can be extracted, so the LSTM can be converted into similar timing models such as RNN, GRU, and the like.

The model structures of RNN, GRU and LSTM are as follows:

the LSTM internally comprises a memory unit and has long-time and short-time memory; the system comprises three gates, namely an input gate, a forgetting gate and an output gate, and the specific expression formula is as follows:

i_t＝σ(W_i·[h_t-1，x_t]+b_i)

f_t＝σ(W_f·[h_t-1，x_t]+b_f)

c_t＝f_t⊙c_t-1+i_t⊙tanh(W_c·[h_t-1，x_t]+b_c)

o_t＝σ(W_o·[h_t-1，x_t]+b_o)

h_t＝o_t⊙tanh(c_t)

wherein i_t，f_t，o_tAn input gate, a forgetting gate and an output gate are respectively arranged; c. C_tIs a memory cell, h_tIs an implicit state, σ denotes a sigmoid function, a indicates a dot product, W denotes a weight, and b denotes an offset. The LSTM determines how much past time information is reserved in the memory unit through the forgetting gate, and receives the current time information through the input gate.

The RNN has no memory unit therein, learns the relationship between a plurality of time inputs through the hidden layer, and specifically expresses the following formula:

wherein x is_tIndicates the input at time t, h_tIndicating an implicit state at time t, o_tAn output representing the time at which t is present,

the activation function is expressed, and the tanh function is generally selected; u, W, V are weights, and b is an offset. It can be found that the implicit state at the moment t is not only related to the input at the moment t, but also related to the state before the moment t, so that the association between the time series can be effectively learned.

RNN is a hidden layer, and inputs x (t-1) x (t +1).. sequence and outputs y (t-1) y (t +1).. sequence by using the same weight parameter, and the model structure of RNN is shown in fig. 4.

The GRU also has no memory cells inside, but has an update gate that helps the model determine how much past information to pass into the future, and a reset gate that mainly determines how much past information to forget. The concrete expression formula is as follows:

z_t＝σ(U_zx_t+W_zh_t-1)

r_t＝σ(U_rx_t+W_rh_t-1)

wherein x is_tIndicates the input at time t, h_tOutput representing time t, z_tIndicating an update gate, r_tIt is indicated that the gate is reset,

represents an intermediate state at time t; σ denotes a sigmoid function, and |, denotes a dot product, and U, W denotes a weight. It can be seen that the output of the GRU at each instant depends on the resets and the update gates.

The GRU is a variant of LSTM, simplifying the LSTM network, and the model structure of the GRU is shown in fig. 5.

Further extended, the method can also be applied to speech age group recognition and gender recognition.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A speech-based emotion recognition method, characterized in that the method comprises the steps of:

step 1: performing frame processing on the voice, extracting features of each frame to obtain a feature vector of each frame,

is shown as

A feature vector of the frame;

step 2: the feature vector of each frame obtained in the step 1 is used

Inputting the data into a first layer deep learning time sequence model, learning the association between frames through the first layer deep learning time sequence model, and every other frame

Frame output frame level features

To obtain

Frame level characterization of individual time instances

，

Representing a frame level vector output by the first layer deep learning time sequence model at the ith moment;

and step 3: the frame level characteristics of M moments of the t moment speech obtained in the step 2

Implicit states at time t-1 of the second-layer deep learning time sequence model

Inputting the data into a first attention model to obtain the input of a second layer deep learning time sequence model at the time t

Through which is passed

Learning at each moment, outputting segment-level features

，

A segment level vector which represents the output of the second layer deep learning time sequence model at the t moment;

and 4, step 4: segment-level characteristics obtained in step 3

Inputting the data into a second attention model to form a representation of the final pronunciation level;

2. A speech based emotion recognition method as claimed in claim 1, wherein the first layer deep learning temporal model and the second layer deep learning temporal model are one of LSTM, RNN and GRU.

3. A speech-based emotion recognition method as claimed in claim 1, wherein in step 1, each frame has a length of 25ms and the frame shift is 10 ms.

4. The method of claim 1, wherein in step 1, 36-dimensional features are extracted from each frame, and the feature vector of each frame is extracted

The method is characterized by comprising 13-dimensional MFCC, zero-crossing rate, energy entropy, spectrum center, spectrum extensibility, spectrum entropy, spectrum flux, spectrum roll-off point, 12-dimensional chrominance vector, chrominance vector standard deviation, signal-to-noise ratio and pitch.

5. A speech based emotion recognition method as claimed in claim 1, wherein, in step 2,

= 3, obtained

Frame-level features for each time instant.

6. A speech-based emotion recognition method as claimed in claim 1, wherein in step 3, the first attention model operation scheme is as shown in formula (1), formula (2) and formula (3):

（1）

（2）

（3）

wherein,

、

、

、

being the network parameters of the first attention model,

in order to be a frame-level feature,

for the implicit state at time LSTMt-1,

for the input of the LSTM at time t,

for time t frame level features

Input with LSTM at time t

The correlation coefficient of (a) is calculated,

for the attention coefficient at time t, the second layer deep learning temporal model is LSTM.

7. A speech-based emotion recognition method as claimed in claim 1, wherein in step 4, the correlation coefficient of the second attention model is a network estimation parameter

And

obtained by vector multiplication.

8. The method of claim 1, wherein a plurality of the first-level deep learning temporal models and a plurality of the first attention models are used to extract features of different levels in the speech.