CN114927144A

CN114927144A - Voice emotion recognition method based on attention mechanism and multi-task learning

Info

Publication number: CN114927144A
Application number: CN202210546156.4A
Authority: CN
Inventors: 何震宇; 刘斌
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2022-05-19
Filing date: 2022-05-19
Publication date: 2022-08-19

Abstract

The invention discloses a speech emotion recognition method based on attention mechanism and multi-task learning, which comprises the following steps of: step 1: acquiring a CASIA Chinese emotion data set for voice emotion recognition; step 2: the LSTM _ att-MTL speech emotion recognition model consists of a feature extraction module, a sequence modeling module and a multi-task learning module, and speech emotion data in the step one is input into the recognition model for collaborative training; and 3, step 3: obtaining an identification result through the softmax classifier in the second step, and calculating a loss function of the identification result and the training set label so as to adjust the loss; and 4, step 4: inputting the test set voice emotion data into the network trained in the third step to realize the recognition of the test set voice emotion data. The invention constructs an LSTM _ att-MTL model, and the model solves the problems of lower recognition performance and the like caused by higher computational complexity and poor training process effect of the traditional feature extraction method.

Description

Voice emotion recognition method based on attention mechanism and multi-task learning

Technical Field

The invention belongs to the field of deep learning, and particularly relates to a speech emotion recognition method based on an attention mechanism and multi-task learning.

Background

The speech emotion recognition is an advanced computer technology relating to various disciplines, and the purpose of the technology is to extract emotion feature parameters from a speech signal and recognize speech emotion through the parameters. Speech emotion recognition has been developed from a small-scale development stage into a key technology of man-machine interaction, and the speech emotion recognition has been widely applied to various industries, such as introduction into a vehicle driving system, recording of changes in mental states of drivers and vehicles, and increase of interaction between people and vehicles; the medical equipment is added with a voice emotion recognition system, so that better treatment can be performed according to emotion changes of patients; the education system is introduced, and the emotion changes of students can be found in remote teaching, so that the teaching quality is improved. The initial speech emotion recognition method mainly comprises machine learning algorithms and recognition models such as HMM (hidden Markov model), and generally only can recognize a few types of emotions. Deep learning techniques such as Convolutional Neural Networks (CNN) have been developed to enable deep learning-based speech emotion recognition to be long-term, for example, a speech emotion recognition method combining a spectrogram and a deep Convolutional neural Network is adopted, and a Network composed of a Convolutional layer and a fully connected layer is adopted to extract features from the spectrogram, so that a variety of emotions are effectively recognized.

After the original speech is preprocessed by speech emotion recognition, features are extracted from the original speech by generally adopting a feature extraction method, so as to recognize the emotion types to which the features belong. One solution is to extract prosodic and phonetic-qualitative features. Usually, energy, fundamental frequency curves and logarithmic curves are extracted from emotion sentences, corresponding first-order difference and second-order difference curves are calculated, and finally parameters such as skewness, kurtosis and variance are counted. The scheme has limited emotion distinguishing capability of the characteristic area, and the identification precision of the emotion such as anger, fear, happiness and the like is not ideal. The feature extraction from the spectrogram is one of important technologies for speech emotion recognition at the present stage, and the spectrogram is normalized and grayed mainly from two aspects of time domain and frequency domain to extract higher-level features, and meanwhile, the higher-level features can be used as visual representation of a speech signal.

Most of the existing methods for extracting features from a spectrogram adopt a Long-Short Term Memory (LSTM) network to solve the problem of Long-Term dependence on time sequences, and the recognition rate of the existing speech emotion classifier is low due to high computational complexity and insufficient training data. Therefore, the patent provides a speech emotion recognition method based on attention mechanism and multi-task learning, and the recognition performance of the algorithm is improved.

Disclosure of Invention

The invention aims to solve the technical problem of providing a speech emotion recognition method based on attention mechanism and multi-task learning aiming at the defects of the background technology. The method comprises the steps of constructing an LSTM _ att-MTL model, and solving the problems of low recognition performance and the like caused by high calculation complexity and poor training process effect of the traditional feature extraction method.

The invention adopts the following technical scheme for solving the technical problems:

compared with the prior art, the technical scheme adopted by the invention has the following technical effects:

the invention designs a speech emotion recognition method based on attention mechanism and multi-task learning, namely an LSTM _ att-MTL model. The model solves the problems that the traditional feature extraction method is high in calculation complexity and poor in training process effect, so that the recognition performance is reduced and the like. In order to extract higher-level features, the original voice at each moment is preprocessed to obtain a gray-scale representation of the voice as an input of a two-layer CNN network, and then the feature extraction of the voice is carried out at the CNN layer. Secondly, because the obtained voice features have correlation in a time sequence, performing time sequence modeling on the grey spectrogram features learned by the CNN network by adopting the LSTM _ att layer, wherein the LSTM _ att layer is a parameter sharing layer. Finally, as the memory of LSTM _ att gradually decreases with the length of the speech, the information of the start time node has weaker and weaker influence on the current moment as the speech sequence grows, so an attention layer is added in the multitask layer. The result of simulation experiments carried out on the CASIA data set shows that the model has good speech emotion recognition performance.

Drawings

FIG. 1 is an overall framework diagram of a speech emotion recognition method based on attention mechanism and multi-task learning according to the present invention;

FIG. 2 is a schematic diagram of an LSTM structure with an attention gate in accordance with the present invention;

FIG. 3 is a schematic diagram of values of LSTM _ att-MTL and LSTM-MTL weight α;

FIG. 4 is a comparative diagram showing the experimental results of different LSTM _ att layer numbers.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the attached drawings:

a speech emotion recognition method based on attention mechanism and multi-task learning is disclosed, as shown in fig. 1 and 2, and is used for solving the problems of high computational complexity, poor training process effect, reduced recognition performance and the like of the traditional feature extraction method, the aim of reducing training parameters is achieved by adding an attention gate in an LSTM, an LSTM _ att-MTL speech emotion recognition model is constructed together with the multi-task learning, and the recognition performance of an algorithm is improved, and the method comprises the following steps:

step 1: acquiring a speech emotion data set: acquiring a CASIA Chinese emotion data set for voice emotion recognition;

and 2, step: constructing an LSTM _ att-MTL speech emotion recognition model: the LSTM _ att-MTL speech emotion recognition model consists of a feature extraction module, a sequence modeling module and a multi-task learning module, and speech emotion data in the step one is input into the recognition model for collaborative training;

and step 3: calculating the loss function of the model: obtaining an identification result through a sofftmax classifier in the step two, and calculating a loss function of the identification result and a training set label so as to adjust the loss;

and 4, step 4: training the overall model and obtaining a recognition result: inputting the test set voice emotion data into the network trained in the third step to realize the recognition of the test set voice emotion data.

The step 1 of acquiring the speech emotion data set comprises the following steps:

step 11: the method comprises the steps that firstly, a voice emotion data set is a CASIA Chinese emotion corpus developed by the institute of automation of Chinese academy of sciences, the data set comprises two pairs of actors performing deduction on 500 sentences of texts under pure recording environment under six emotions of anger, fear, happy, neutral, sad and surprise, 16KHZ sampling rate, 16bit quantization and signal-to-noise ratio are about 35db, the sentences are stored in a pcm format, and 9600 sentences are screened out;

step 12: and secondly, dividing the data set into a training set, a verification set and a test set according to the ratio of 6: 2.

The step 2 of constructing the LSTM _ att-MTL speech emotion recognition model comprises the following steps:

step 21: firstly, preprocessing original voice, comprising the following operations: performing frame windowing on the voice, wherein the frame length is 25ms, and the frame shift is 10 ms; performing short-time Fourier transform to obtain a spectrogram of the voice signal; performing maximum-minimum normalization processing on the spectrogram, quantizing the spectrogram into a gray level map, and entering a second step;

step 22: inputting the spectrogram into a CNN network, learning the voice features from the spectrogram through convolution calculation by the convolution layer, selecting a ReLU function as an activation function, and entering a third step, wherein the specific parameters are as follows:

step 23: referring to fig. 2, two recurrent neural networks are used as the shared layer, LSTM _ att is used as the basic unit, and the number of hidden layer units is 128. To prevent overfitting, dropout was introduced during training with the parameter set to 0.5. The first layer outputs all time sequences to the next layer, and the second layer outputs the result of the last time step. LSTM _ att attention gate output:

att _t ＝σ(V _att ×tanh(W _att ×c _t-1 )) (1)

wherein,

is a parameter to be trained, and learns from training data; c. C _t-1 For the node state at the last moment(ii) a σ (-) and tanh (-) are the logistic sigmoid and hyperbolic tangent activation functions, respectively; output of other gate units:

i _t ＝σ(W _i ×[c _t-1 ，h _t-1 ，x _t ]+b _i ) (2)

o _t ＝σ(W _o ×[h _t-1 ，x _t ]+b _o ) (5)

h _t ＝o _t ·tanh(c _t ) (6)

wherein i _t Input gate representing time t, W _i And b _i Representing the weight matrix and offset terms in the input gate cell, C _t-1 And h _t-1 Respectively the cell state and the hidden layer output at the previous moment, x _t An input representing a current time;

candidate value, W, representing the state of the updating unit at time t _c And b _c Representing a weight matrix and an offset item when updating the state; c. C _t Represents the state of the cell at time t,. represents the hadamard product; o _t Output gate representing time t, W _o And b _o Representing the weight matrix and the offset term in the output gate unit; h is a total of _t Representing the output of the hidden layer at time t;

step 24: the multi-task learning method with hard parameter sharing is adopted, and the gender of a speaker is used as an auxiliary task, because the system performance of voice related tasks can be influenced by the difference of voices of males and females, and the performance of the emotion recognition model specific to the gender is superior to that of other emotion recognition models which are not classified into the gender.

The attention layer of multitask learning attention-weights the LSTM _ att output:

ν＝∑ _i α _i h _i (8)

in the formula (7), α _i The attention weight is expressed, the vector μ is an attention parameter, and μ ═ θ ¹ ，θ ² ，...，θ ^T ) T is the frame length, { h ₁ ，h ₂ ，...，h _T The output of the last layer LSTM _ att is calculated, and attention parameter vectors mu and h are calculated _i The inner product of (4) is used as a score of the importance of each time frame, and normalization processing is carried out on the score, namely the weight of each frame containing the key information. Equation (8) performs a dot product of the obtained weight and the LSTM _ att output, and the obtained weighted sum v is used as a feature vector of the global update weight.

The loss function of the calculation model in the step 3 comprises the following steps:

the method comprises the following steps: inputting the feature vector obtained by the formula (8) into a full connection layer, wherein the full connection layer can realize independent optimization while learning shared features, and can classify through softmax to obtain a predicted value:

wherein W and b represent weight and bias terms in the fully-connected layer, V represents a feature vector obtained by the formula (8), and softmax (·) is a softmax function.

Step 31: during the training process, the recognition result is compared with the labels of the training set, a loss function is calculated, and an ADAM optimizer is used for optimizing the loss function. The cross-entropy function is employed herein as a loss function for each task:

wherein

A loss function representing each task, task being the category of the task: emotion recognition (emotion) and gender classification (generator), y _i And

respectively the true value of the training set label and the predicted value output by the model, N _task Is the total number of categories for the task.

Step 32: final overall loss function:

L _total ＝αL _emo +(1-α)L _gen (10)

wherein L is _emo And L _gen The loss functions of speech emotion recognition and gender recognition are respectively, and alpha is a weight coefficient of the emotion recognition loss function. By combining L _total And performing back propagation and gradient updating on the whole network.

Step 4, the implementation of the test set speech emotion data recognition comprises the following steps:

step 41: in order to embody the processing capacity of the LSTM _ att on time sequence data, an LSTM-MTL comparison experiment is designed, the number of LSTM layers is the same as that of nodes in a text model, and emotion and gender are identified. And determining values of the weight coefficients of the two task loss functions, and identifying the speech emotion as a main task, so that the corresponding weight of the speech emotion is greater than that of a gender identification task, and the initial value of alpha in the experiment is set to be 0.9. And (4) finding the weight with the highest identification accuracy rate by adjusting the weight and testing on the verification set. As shown in fig. 3, LSTM _ att-MTL and LSTM-MTL weights α take values of 0.8 and 0.6, respectively.

Step 42: in order to reflect the performance improvement of multi-task learning compared with single-task learning, a single-task comparison experiment LSTM _ att-STL is designed, and the emotional tasks are identified on the premise of keeping other parameters the same.

Tables 1 and 2 show the recognition results of each task and emotion on the CASIA data set, respectively, and it can be seen from Table 1 that there is a great improvement in LSTM _ att-MTL emotion recognition, and since there are fewer speakers in the CASIA data set and the gender recognition rate is very high, the gender recognition as an auxiliary task has less influence on emotion recognition. As can be seen from Table 2, the recognition accuracy of the method provided by the invention reaches 93.1%, and the recognition accuracy of other emotions is higher than that of the comparison experiment except that the accuracy of neutral emotion and the accuracy of surprise emotion are lower than that of LSTM-MTL. The confusion matrix of the LSTM _ att-MTL recognition result is shown in a table 3, the horizontal coordinate in the table represents the prediction result of speech emotion, the vertical coordinate represents the real category corresponding to the emotion, and the confusion matrix is obtained from the table, has low emotion recognition accuracy and is easy to be recognized as neutral and surprise. The neutral recognition result is accurate and reaches 96.9%.

TABLE 1 identification results of each task in comparative experiments

Tab.2 The recognition results of each task in the experiment

TABLE 2 results of emotion recognition for each of the comparative experiments

Tab.3 The results of each emotion recognition in the experiment

TABLE 3 Emotion recognition results for each item in LSTM _ att-MTL

Tab.4 Each emotion recognition result in LSTM_att-MTL

Step 43: in order to select the most suitable LSTM _ att layer number in the LSTM _ att-MTL, three experiments are simultaneously set, wherein the LSTM layer number is 2, 3 and 4 respectively. The experimental result is shown in fig. 4, and the increase of the number of LSTM layers improves the efficiency of time domain feature extraction, as shown in table 5, the recognition accuracy of 2 layers of LSTM _ att is 93.1%, the recognition accuracy of 3 layers of LSTM _ att is 92.8%, and the recognition accuracy of 4 layers of LSTM _ att is 92.1%, so 2 layers of LSTM _ att are selected for the experiment.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or flow transformations made by using the contents of the specification and the drawings, or directly or indirectly applied to the related art, are included in the scope of the present invention.

Claims

1. A speech emotion recognition method based on attention mechanism and multi-task learning is characterized by comprising the following steps: the method comprises the following steps:

step 2: constructing an LSTM _ att-MTL speech emotion recognition model: the LSTM _ att-MTL speech emotion recognition model consists of a feature extraction module, a sequence modeling module and a multi-task learning module, and speech emotion data in the step one is input into the recognition model for collaborative training;

and step 3: calculating the loss function of the model: obtaining a recognition result through the softmax classifier in the step two, and calculating a loss function of the recognition result and the training set label so as to adjust the loss;

and 4, step 4: training the model to obtain a recognition result: inputting the test set voice emotion data into the network trained in the third step to realize the recognition of the test set voice emotion data.

2. The speech emotion recognition method based on attention mechanism and multitask learning, as claimed in claim 1, wherein: the step 1 of acquiring the speech emotion data set comprises the following steps:

step 11: firstly, the voice emotion data set is a CASIA Chinese emotion corpus developed by the Chinese academy of sciences automation research institute, the data set comprises two pairs of actors and phrases respectively deducing 500 sentences of texts under six emotions of anger, fear, happy, neutral, sad and surprise in a pure recording environment, the sampling rate of 16kHZ is 16bit quantization, the signal to noise ratio is about 35db, the sentences are stored in a pcm format, and finally 9600 sentences are screened out;

3. The speech emotion recognition method based on attention mechanism and multitask learning, as claimed in claim 1, wherein: the step 2 of constructing the LSTM _ att-MTL speech emotion recognition model comprises the following steps of:

step 21: firstly, preprocessing original voice, comprising the following operations: performing frame windowing on the voice, wherein the frame length is 25ms, and the frame shift is 10 ms; performing short-time Fourier transform to obtain a spectrogram of the voice signal; performing maximum-minimum normalization processing on the spectrogram, quantizing the spectrogram into a grey-scale map, and entering the second step;

step 22: inputting the spectrogram into a CNN network, learning the voice characteristics from the spectrogram by convolution calculation through the convolution layer, selecting a ReLU function as an activation function, and entering the third step;

step 23: two cyclic neural networks are used as a sharing layer, LSTM _ att is used as a basic unit, and the number of hidden layer units is 128; in order to prevent overfitting, dropout is introduced in the training process, and the parameter is set to be 0.5; the first layer outputs all time sequences to the next layer, and the second layer outputs the result of the last time step; output att of LSTM _ att attention gate _t ：

att _t ＝σ(V _att ×tanh(W _att ×c _t-1 )) (1)

Wherein,

is a parameter to be trained, and learns from training data; c. C _t-1 The node state at the previous moment; σ (-) and tanh (-) are logi, respectivelystic sigmoid and hyperbolic tangent activation function; specifically, each gate unit outputs:

i _t ＝σ(W _i ×[c _t-1 ，h _t-1 ，x _t ]+b _i ) (2)

o _t ＝σ(W _o ×[h _t-1 ，x _t ]+b _o ) (5)

h _t ＝o _t ·tanh(c _t ) (6)

wherein i _t Input gate representing time t, W _i And b _i Representing the weight matrix and offset terms in the input gate cell, C _t-1 And h _t-1 Respectively the cell state and hidden layer output at the previous moment, x _t An input representing a current time;

candidate value, W, representing the state of the updating unit at time t _c And b _c Representing the weight matrix and the bias term when updating the state; c. C _t Represents the state of the cell at time t,. represents the hadamard product; o _t Output gate representing time t, W _o And b _o Representing the weight matrix and the bias term in the output gate unit; h is _t Representing the output of the hidden layer at time t;

step 24: the multi-task learning method with hard parameter sharing is adopted, the gender of a speaker is adopted as an auxiliary task, because the difference of male and female voices influences the system performance of voice related tasks, and the performance of a gender-specific emotion recognition model is superior to that of other gender-independent emotion recognition models;

the attention layer in the multitask learning module attention-weights the LSTM _ att output:

v＝∑ _i α _i h _i (8)

in the formula (7), α _i Denotes the attention weight, vector μ denotes the attention parameter, and μ ═ θ ¹ ，θ ² ，...，θ ^T ) T is the frame length, { h ₁ ，h ₂ ，...，h _T The output of the last layer LSTM _ att is calculated as the attention parameter vectors mu and h _i The inner product of (1) is taken as a score of the importance of each time frame, and normalization processing is carried out on the score, wherein the normalized score is the weight of each frame containing the key information; equation (8) the obtained weight α _i And LSTM _ att output h _i And performing dot multiplication, and taking the obtained weighted sum V as a feature vector of the global updating weight.

4. The speech emotion recognition method based on attention mechanism and multitask learning, as claimed in claim 1, wherein: the loss function of the calculation model in the step 3 comprises the following steps:

Step 31: in the training process, comparing the recognition result with the labels of the training set, calculating a loss function, and optimizing the loss function by using an ADAM optimizer; the cross entropy function is used herein as a penalty function for each task:

wherein

respectively the true value of the training set label and the predicted value of the model output, N _task The total number of categories of task;

step 32: final overall loss function:

L _total ＝αL _emo +(1-α)L _gen (10)

wherein L is _emo And L _gen Respectively a loss function of speech emotion recognition and gender recognition, and alpha is a weight coefficient of the loss function of emotion recognition; by combining L _total And performing back propagation and gradient updating on the whole network.

5. The method for recognizing speech emotion based on attention mechanism and multitask learning as claimed in claim 1, characterized in that: step 4, the implementation of the test set speech emotion data recognition comprises the following steps:

step 41: in order to embody the processing capacity of the LSTM _ att on time sequence data, an LSTM-MTL comparison experiment is designed, the number of LSTM layers is the same as that of nodes and a text model, and emotion and gender are identified; determining values of weight coefficients of two task loss functions, identifying speech emotion as a main task, so that the corresponding weight of the main task is greater than that of a gender identification task, and setting an initial value of alpha to be 0.9 in an experiment; finding the weight with the highest identification accuracy rate by adjusting the weight and testing on the verification set; as shown in fig. 1, the LSTM _ att-MTL and LSTM-MTL weights α take values of 0.8 and 0.6, respectively;

step 42: in order to reflect the performance improvement of multi-task learning compared with single-task learning, a single-task comparison experiment LSTM _ att-STL is designed, and the emotional tasks are identified on the premise of keeping other parameters the same;

step 43: in order to select the most appropriate layer number of LSTM _ att in the LSTM _ att-MTL, three experiments are simultaneously set on a verification set, wherein the layer number of LSTM is 2, 3 and 4 respectively; the experimental result is shown in fig. 2, and the increase of the number of LSTM layers improves the efficiency of time domain feature extraction, as shown in table 5, the recognition accuracy of 2 layers of LSTM _ att is 93.1%, the recognition accuracy of 3 layers of LSTM _ att is 92.8%, and the recognition accuracy of 4 layers of LSTM _ att is 92.1%, so 2 layers of LSTM _ att are selected for the experiment.