CN114927144A - Voice emotion recognition method based on attention mechanism and multi-task learning - Google Patents
Voice emotion recognition method based on attention mechanism and multi-task learning Download PDFInfo
- Publication number
- CN114927144A CN114927144A CN202210546156.4A CN202210546156A CN114927144A CN 114927144 A CN114927144 A CN 114927144A CN 202210546156 A CN202210546156 A CN 202210546156A CN 114927144 A CN114927144 A CN 114927144A
- Authority
- CN
- China
- Prior art keywords
- lstm
- att
- recognition
- task
- emotion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 title claims abstract description 34
- 230000007246 mechanism Effects 0.000 title claims abstract description 14
- 230000008451 emotion Effects 0.000 claims abstract description 49
- 230000006870 function Effects 0.000 claims abstract description 33
- 238000012549 training Methods 0.000 claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 12
- 238000012360 testing method Methods 0.000 claims abstract description 12
- 230000008569 process Effects 0.000 claims abstract description 7
- 238000002474 experimental method Methods 0.000 claims description 15
- 239000013598 vector Substances 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 230000007935 neutral effect Effects 0.000 claims description 5
- 238000012795 verification Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 230000006872 improvement Effects 0.000 claims description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 230000002996 emotional effect Effects 0.000 claims description 2
- 230000037433 frameshift Effects 0.000 claims description 2
- 238000005457 optimization Methods 0.000 claims description 2
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000013139 quantization Methods 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims description 2
- 125000004122 cyclic group Chemical group 0.000 claims 1
- 238000011160 research Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000000052 comparative effect Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000006996 mental state Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Computation (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a speech emotion recognition method based on attention mechanism and multi-task learning, which comprises the following steps of: step 1: acquiring a CASIA Chinese emotion data set for voice emotion recognition; step 2: the LSTM _ att-MTL speech emotion recognition model consists of a feature extraction module, a sequence modeling module and a multi-task learning module, and speech emotion data in the step one is input into the recognition model for collaborative training; and 3, step 3: obtaining an identification result through the softmax classifier in the second step, and calculating a loss function of the identification result and the training set label so as to adjust the loss; and 4, step 4: inputting the test set voice emotion data into the network trained in the third step to realize the recognition of the test set voice emotion data. The invention constructs an LSTM _ att-MTL model, and the model solves the problems of lower recognition performance and the like caused by higher computational complexity and poor training process effect of the traditional feature extraction method.
Description
Technical Field
The invention belongs to the field of deep learning, and particularly relates to a speech emotion recognition method based on an attention mechanism and multi-task learning.
Background
The speech emotion recognition is an advanced computer technology relating to various disciplines, and the purpose of the technology is to extract emotion feature parameters from a speech signal and recognize speech emotion through the parameters. Speech emotion recognition has been developed from a small-scale development stage into a key technology of man-machine interaction, and the speech emotion recognition has been widely applied to various industries, such as introduction into a vehicle driving system, recording of changes in mental states of drivers and vehicles, and increase of interaction between people and vehicles; the medical equipment is added with a voice emotion recognition system, so that better treatment can be performed according to emotion changes of patients; the education system is introduced, and the emotion changes of students can be found in remote teaching, so that the teaching quality is improved. The initial speech emotion recognition method mainly comprises machine learning algorithms and recognition models such as HMM (hidden Markov model), and generally only can recognize a few types of emotions. Deep learning techniques such as Convolutional Neural Networks (CNN) have been developed to enable deep learning-based speech emotion recognition to be long-term, for example, a speech emotion recognition method combining a spectrogram and a deep Convolutional neural Network is adopted, and a Network composed of a Convolutional layer and a fully connected layer is adopted to extract features from the spectrogram, so that a variety of emotions are effectively recognized.
After the original speech is preprocessed by speech emotion recognition, features are extracted from the original speech by generally adopting a feature extraction method, so as to recognize the emotion types to which the features belong. One solution is to extract prosodic and phonetic-qualitative features. Usually, energy, fundamental frequency curves and logarithmic curves are extracted from emotion sentences, corresponding first-order difference and second-order difference curves are calculated, and finally parameters such as skewness, kurtosis and variance are counted. The scheme has limited emotion distinguishing capability of the characteristic area, and the identification precision of the emotion such as anger, fear, happiness and the like is not ideal. The feature extraction from the spectrogram is one of important technologies for speech emotion recognition at the present stage, and the spectrogram is normalized and grayed mainly from two aspects of time domain and frequency domain to extract higher-level features, and meanwhile, the higher-level features can be used as visual representation of a speech signal.
Most of the existing methods for extracting features from a spectrogram adopt a Long-Short Term Memory (LSTM) network to solve the problem of Long-Term dependence on time sequences, and the recognition rate of the existing speech emotion classifier is low due to high computational complexity and insufficient training data. Therefore, the patent provides a speech emotion recognition method based on attention mechanism and multi-task learning, and the recognition performance of the algorithm is improved.
Disclosure of Invention
The invention aims to solve the technical problem of providing a speech emotion recognition method based on attention mechanism and multi-task learning aiming at the defects of the background technology. The method comprises the steps of constructing an LSTM _ att-MTL model, and solving the problems of low recognition performance and the like caused by high calculation complexity and poor training process effect of the traditional feature extraction method.
The invention adopts the following technical scheme for solving the technical problems:
compared with the prior art, the technical scheme adopted by the invention has the following technical effects:
the invention designs a speech emotion recognition method based on attention mechanism and multi-task learning, namely an LSTM _ att-MTL model. The model solves the problems that the traditional feature extraction method is high in calculation complexity and poor in training process effect, so that the recognition performance is reduced and the like. In order to extract higher-level features, the original voice at each moment is preprocessed to obtain a gray-scale representation of the voice as an input of a two-layer CNN network, and then the feature extraction of the voice is carried out at the CNN layer. Secondly, because the obtained voice features have correlation in a time sequence, performing time sequence modeling on the grey spectrogram features learned by the CNN network by adopting the LSTM _ att layer, wherein the LSTM _ att layer is a parameter sharing layer. Finally, as the memory of LSTM _ att gradually decreases with the length of the speech, the information of the start time node has weaker and weaker influence on the current moment as the speech sequence grows, so an attention layer is added in the multitask layer. The result of simulation experiments carried out on the CASIA data set shows that the model has good speech emotion recognition performance.
Drawings
FIG. 1 is an overall framework diagram of a speech emotion recognition method based on attention mechanism and multi-task learning according to the present invention;
FIG. 2 is a schematic diagram of an LSTM structure with an attention gate in accordance with the present invention;
FIG. 3 is a schematic diagram of values of LSTM _ att-MTL and LSTM-MTL weight α;
FIG. 4 is a comparative diagram showing the experimental results of different LSTM _ att layer numbers.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
a speech emotion recognition method based on attention mechanism and multi-task learning is disclosed, as shown in fig. 1 and 2, and is used for solving the problems of high computational complexity, poor training process effect, reduced recognition performance and the like of the traditional feature extraction method, the aim of reducing training parameters is achieved by adding an attention gate in an LSTM, an LSTM _ att-MTL speech emotion recognition model is constructed together with the multi-task learning, and the recognition performance of an algorithm is improved, and the method comprises the following steps:
step 1: acquiring a speech emotion data set: acquiring a CASIA Chinese emotion data set for voice emotion recognition;
and 2, step: constructing an LSTM _ att-MTL speech emotion recognition model: the LSTM _ att-MTL speech emotion recognition model consists of a feature extraction module, a sequence modeling module and a multi-task learning module, and speech emotion data in the step one is input into the recognition model for collaborative training;
and step 3: calculating the loss function of the model: obtaining an identification result through a sofftmax classifier in the step two, and calculating a loss function of the identification result and a training set label so as to adjust the loss;
and 4, step 4: training the overall model and obtaining a recognition result: inputting the test set voice emotion data into the network trained in the third step to realize the recognition of the test set voice emotion data.
The step 1 of acquiring the speech emotion data set comprises the following steps:
step 11: the method comprises the steps that firstly, a voice emotion data set is a CASIA Chinese emotion corpus developed by the institute of automation of Chinese academy of sciences, the data set comprises two pairs of actors performing deduction on 500 sentences of texts under pure recording environment under six emotions of anger, fear, happy, neutral, sad and surprise, 16KHZ sampling rate, 16bit quantization and signal-to-noise ratio are about 35db, the sentences are stored in a pcm format, and 9600 sentences are screened out;
step 12: and secondly, dividing the data set into a training set, a verification set and a test set according to the ratio of 6: 2.
The step 2 of constructing the LSTM _ att-MTL speech emotion recognition model comprises the following steps:
step 21: firstly, preprocessing original voice, comprising the following operations: performing frame windowing on the voice, wherein the frame length is 25ms, and the frame shift is 10 ms; performing short-time Fourier transform to obtain a spectrogram of the voice signal; performing maximum-minimum normalization processing on the spectrogram, quantizing the spectrogram into a gray level map, and entering a second step;
step 22: inputting the spectrogram into a CNN network, learning the voice features from the spectrogram through convolution calculation by the convolution layer, selecting a ReLU function as an activation function, and entering a third step, wherein the specific parameters are as follows:
step 23: referring to fig. 2, two recurrent neural networks are used as the shared layer, LSTM _ att is used as the basic unit, and the number of hidden layer units is 128. To prevent overfitting, dropout was introduced during training with the parameter set to 0.5. The first layer outputs all time sequences to the next layer, and the second layer outputs the result of the last time step. LSTM _ att attention gate output:
att t =σ(V att ×tanh(W att ×c t-1 )) (1)
wherein,is a parameter to be trained, and learns from training data; c. C t-1 For the node state at the last moment(ii) a σ (-) and tanh (-) are the logistic sigmoid and hyperbolic tangent activation functions, respectively; output of other gate units:
i t =σ(W i ×[c t-1 ,h t-1 ,x t ]+b i ) (2)
o t =σ(W o ×[h t-1 ,x t ]+b o ) (5)
h t =o t ·tanh(c t ) (6)
wherein i t Input gate representing time t, W i And b i Representing the weight matrix and offset terms in the input gate cell, C t-1 And h t-1 Respectively the cell state and the hidden layer output at the previous moment, x t An input representing a current time;candidate value, W, representing the state of the updating unit at time t c And b c Representing a weight matrix and an offset item when updating the state; c. C t Represents the state of the cell at time t,. represents the hadamard product; o t Output gate representing time t, W o And b o Representing the weight matrix and the offset term in the output gate unit; h is a total of t Representing the output of the hidden layer at time t;
step 24: the multi-task learning method with hard parameter sharing is adopted, and the gender of a speaker is used as an auxiliary task, because the system performance of voice related tasks can be influenced by the difference of voices of males and females, and the performance of the emotion recognition model specific to the gender is superior to that of other emotion recognition models which are not classified into the gender.
The attention layer of multitask learning attention-weights the LSTM _ att output:
ν=∑ i α i h i (8)
in the formula (7), α i The attention weight is expressed, the vector μ is an attention parameter, and μ ═ θ 1 ,θ 2 ,...,θ T ) T is the frame length, { h 1 ,h 2 ,...,h T The output of the last layer LSTM _ att is calculated, and attention parameter vectors mu and h are calculated i The inner product of (4) is used as a score of the importance of each time frame, and normalization processing is carried out on the score, namely the weight of each frame containing the key information. Equation (8) performs a dot product of the obtained weight and the LSTM _ att output, and the obtained weighted sum v is used as a feature vector of the global update weight.
The loss function of the calculation model in the step 3 comprises the following steps:
the method comprises the following steps: inputting the feature vector obtained by the formula (8) into a full connection layer, wherein the full connection layer can realize independent optimization while learning shared features, and can classify through softmax to obtain a predicted value:
wherein W and b represent weight and bias terms in the fully-connected layer, V represents a feature vector obtained by the formula (8), and softmax (·) is a softmax function.
Step 31: during the training process, the recognition result is compared with the labels of the training set, a loss function is calculated, and an ADAM optimizer is used for optimizing the loss function. The cross-entropy function is employed herein as a loss function for each task:
whereinA loss function representing each task, task being the category of the task: emotion recognition (emotion) and gender classification (generator), y i Andrespectively the true value of the training set label and the predicted value output by the model, N task Is the total number of categories for the task.
Step 32: final overall loss function:
L total =αL emo +(1-α)L gen (10)
wherein L is emo And L gen The loss functions of speech emotion recognition and gender recognition are respectively, and alpha is a weight coefficient of the emotion recognition loss function. By combining L total And performing back propagation and gradient updating on the whole network.
Step 4, the implementation of the test set speech emotion data recognition comprises the following steps:
step 41: in order to embody the processing capacity of the LSTM _ att on time sequence data, an LSTM-MTL comparison experiment is designed, the number of LSTM layers is the same as that of nodes in a text model, and emotion and gender are identified. And determining values of the weight coefficients of the two task loss functions, and identifying the speech emotion as a main task, so that the corresponding weight of the speech emotion is greater than that of a gender identification task, and the initial value of alpha in the experiment is set to be 0.9. And (4) finding the weight with the highest identification accuracy rate by adjusting the weight and testing on the verification set. As shown in fig. 3, LSTM _ att-MTL and LSTM-MTL weights α take values of 0.8 and 0.6, respectively.
Step 42: in order to reflect the performance improvement of multi-task learning compared with single-task learning, a single-task comparison experiment LSTM _ att-STL is designed, and the emotional tasks are identified on the premise of keeping other parameters the same.
Tables 1 and 2 show the recognition results of each task and emotion on the CASIA data set, respectively, and it can be seen from Table 1 that there is a great improvement in LSTM _ att-MTL emotion recognition, and since there are fewer speakers in the CASIA data set and the gender recognition rate is very high, the gender recognition as an auxiliary task has less influence on emotion recognition. As can be seen from Table 2, the recognition accuracy of the method provided by the invention reaches 93.1%, and the recognition accuracy of other emotions is higher than that of the comparison experiment except that the accuracy of neutral emotion and the accuracy of surprise emotion are lower than that of LSTM-MTL. The confusion matrix of the LSTM _ att-MTL recognition result is shown in a table 3, the horizontal coordinate in the table represents the prediction result of speech emotion, the vertical coordinate represents the real category corresponding to the emotion, and the confusion matrix is obtained from the table, has low emotion recognition accuracy and is easy to be recognized as neutral and surprise. The neutral recognition result is accurate and reaches 96.9%.
TABLE 1 identification results of each task in comparative experiments
Tab.2 The recognition results of each task in the experiment
TABLE 2 results of emotion recognition for each of the comparative experiments
Tab.3 The results of each emotion recognition in the experiment
TABLE 3 Emotion recognition results for each item in LSTM _ att-MTL
Tab.4 Each emotion recognition result in LSTM_att-MTL
Step 43: in order to select the most suitable LSTM _ att layer number in the LSTM _ att-MTL, three experiments are simultaneously set, wherein the LSTM layer number is 2, 3 and 4 respectively. The experimental result is shown in fig. 4, and the increase of the number of LSTM layers improves the efficiency of time domain feature extraction, as shown in table 5, the recognition accuracy of 2 layers of LSTM _ att is 93.1%, the recognition accuracy of 3 layers of LSTM _ att is 92.8%, and the recognition accuracy of 4 layers of LSTM _ att is 92.1%, so 2 layers of LSTM _ att are selected for the experiment.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or flow transformations made by using the contents of the specification and the drawings, or directly or indirectly applied to the related art, are included in the scope of the present invention.
Claims (5)
1. A speech emotion recognition method based on attention mechanism and multi-task learning is characterized by comprising the following steps: the method comprises the following steps:
step 1: acquiring a speech emotion data set: acquiring a CASIA Chinese emotion data set for voice emotion recognition;
step 2: constructing an LSTM _ att-MTL speech emotion recognition model: the LSTM _ att-MTL speech emotion recognition model consists of a feature extraction module, a sequence modeling module and a multi-task learning module, and speech emotion data in the step one is input into the recognition model for collaborative training;
and step 3: calculating the loss function of the model: obtaining a recognition result through the softmax classifier in the step two, and calculating a loss function of the recognition result and the training set label so as to adjust the loss;
and 4, step 4: training the model to obtain a recognition result: inputting the test set voice emotion data into the network trained in the third step to realize the recognition of the test set voice emotion data.
2. The speech emotion recognition method based on attention mechanism and multitask learning, as claimed in claim 1, wherein: the step 1 of acquiring the speech emotion data set comprises the following steps:
step 11: firstly, the voice emotion data set is a CASIA Chinese emotion corpus developed by the Chinese academy of sciences automation research institute, the data set comprises two pairs of actors and phrases respectively deducing 500 sentences of texts under six emotions of anger, fear, happy, neutral, sad and surprise in a pure recording environment, the sampling rate of 16kHZ is 16bit quantization, the signal to noise ratio is about 35db, the sentences are stored in a pcm format, and finally 9600 sentences are screened out;
step 12: and secondly, dividing the data set into a training set, a verification set and a test set according to the ratio of 6: 2.
3. The speech emotion recognition method based on attention mechanism and multitask learning, as claimed in claim 1, wherein: the step 2 of constructing the LSTM _ att-MTL speech emotion recognition model comprises the following steps of:
step 21: firstly, preprocessing original voice, comprising the following operations: performing frame windowing on the voice, wherein the frame length is 25ms, and the frame shift is 10 ms; performing short-time Fourier transform to obtain a spectrogram of the voice signal; performing maximum-minimum normalization processing on the spectrogram, quantizing the spectrogram into a grey-scale map, and entering the second step;
step 22: inputting the spectrogram into a CNN network, learning the voice characteristics from the spectrogram by convolution calculation through the convolution layer, selecting a ReLU function as an activation function, and entering the third step;
step 23: two cyclic neural networks are used as a sharing layer, LSTM _ att is used as a basic unit, and the number of hidden layer units is 128; in order to prevent overfitting, dropout is introduced in the training process, and the parameter is set to be 0.5; the first layer outputs all time sequences to the next layer, and the second layer outputs the result of the last time step; output att of LSTM _ att attention gate t :
att t =σ(V att ×tanh(W att ×c t-1 )) (1)
Wherein,is a parameter to be trained, and learns from training data; c. C t-1 The node state at the previous moment; σ (-) and tanh (-) are logi, respectivelystic sigmoid and hyperbolic tangent activation function; specifically, each gate unit outputs:
i t =σ(W i ×[c t-1 ,h t-1 ,x t ]+b i ) (2)
o t =σ(W o ×[h t-1 ,x t ]+b o ) (5)
h t =o t ·tanh(c t ) (6)
wherein i t Input gate representing time t, W i And b i Representing the weight matrix and offset terms in the input gate cell, C t-1 And h t-1 Respectively the cell state and hidden layer output at the previous moment, x t An input representing a current time;candidate value, W, representing the state of the updating unit at time t c And b c Representing the weight matrix and the bias term when updating the state; c. C t Represents the state of the cell at time t,. represents the hadamard product; o t Output gate representing time t, W o And b o Representing the weight matrix and the bias term in the output gate unit; h is t Representing the output of the hidden layer at time t;
step 24: the multi-task learning method with hard parameter sharing is adopted, the gender of a speaker is adopted as an auxiliary task, because the difference of male and female voices influences the system performance of voice related tasks, and the performance of a gender-specific emotion recognition model is superior to that of other gender-independent emotion recognition models;
the attention layer in the multitask learning module attention-weights the LSTM _ att output:
v=∑ i α i h i (8)
in the formula (7), α i Denotes the attention weight, vector μ denotes the attention parameter, and μ ═ θ 1 ,θ 2 ,...,θ T ) T is the frame length, { h 1 ,h 2 ,...,h T The output of the last layer LSTM _ att is calculated as the attention parameter vectors mu and h i The inner product of (1) is taken as a score of the importance of each time frame, and normalization processing is carried out on the score, wherein the normalized score is the weight of each frame containing the key information; equation (8) the obtained weight α i And LSTM _ att output h i And performing dot multiplication, and taking the obtained weighted sum V as a feature vector of the global updating weight.
4. The speech emotion recognition method based on attention mechanism and multitask learning, as claimed in claim 1, wherein: the loss function of the calculation model in the step 3 comprises the following steps:
the method comprises the following steps: inputting the feature vector obtained by the formula (8) into a full connection layer, wherein the full connection layer can realize independent optimization while learning shared features, and can classify through softmax to obtain a predicted value:
wherein W and b represent weight and bias terms in the fully-connected layer, V represents a feature vector obtained by the formula (8), and softmax (·) is a softmax function.
Step 31: in the training process, comparing the recognition result with the labels of the training set, calculating a loss function, and optimizing the loss function by using an ADAM optimizer; the cross entropy function is used herein as a penalty function for each task:
whereinA loss function representing each task, task being the category of the task: emotion recognition (emotion) and gender classification (generator), y i Andrespectively the true value of the training set label and the predicted value of the model output, N task The total number of categories of task;
step 32: final overall loss function:
L total =αL emo +(1-α)L gen (10)
wherein L is emo And L gen Respectively a loss function of speech emotion recognition and gender recognition, and alpha is a weight coefficient of the loss function of emotion recognition; by combining L total And performing back propagation and gradient updating on the whole network.
5. The method for recognizing speech emotion based on attention mechanism and multitask learning as claimed in claim 1, characterized in that: step 4, the implementation of the test set speech emotion data recognition comprises the following steps:
step 41: in order to embody the processing capacity of the LSTM _ att on time sequence data, an LSTM-MTL comparison experiment is designed, the number of LSTM layers is the same as that of nodes and a text model, and emotion and gender are identified; determining values of weight coefficients of two task loss functions, identifying speech emotion as a main task, so that the corresponding weight of the main task is greater than that of a gender identification task, and setting an initial value of alpha to be 0.9 in an experiment; finding the weight with the highest identification accuracy rate by adjusting the weight and testing on the verification set; as shown in fig. 1, the LSTM _ att-MTL and LSTM-MTL weights α take values of 0.8 and 0.6, respectively;
step 42: in order to reflect the performance improvement of multi-task learning compared with single-task learning, a single-task comparison experiment LSTM _ att-STL is designed, and the emotional tasks are identified on the premise of keeping other parameters the same;
step 43: in order to select the most appropriate layer number of LSTM _ att in the LSTM _ att-MTL, three experiments are simultaneously set on a verification set, wherein the layer number of LSTM is 2, 3 and 4 respectively; the experimental result is shown in fig. 2, and the increase of the number of LSTM layers improves the efficiency of time domain feature extraction, as shown in table 5, the recognition accuracy of 2 layers of LSTM _ att is 93.1%, the recognition accuracy of 3 layers of LSTM _ att is 92.8%, and the recognition accuracy of 4 layers of LSTM _ att is 92.1%, so 2 layers of LSTM _ att are selected for the experiment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210546156.4A CN114927144A (en) | 2022-05-19 | 2022-05-19 | Voice emotion recognition method based on attention mechanism and multi-task learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210546156.4A CN114927144A (en) | 2022-05-19 | 2022-05-19 | Voice emotion recognition method based on attention mechanism and multi-task learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114927144A true CN114927144A (en) | 2022-08-19 |
Family
ID=82807755
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210546156.4A Pending CN114927144A (en) | 2022-05-19 | 2022-05-19 | Voice emotion recognition method based on attention mechanism and multi-task learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114927144A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116030526A (en) * | 2023-02-27 | 2023-04-28 | 华南农业大学 | Emotion recognition method, system and storage medium based on multitask deep learning |
-
2022
- 2022-05-19 CN CN202210546156.4A patent/CN114927144A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116030526A (en) * | 2023-02-27 | 2023-04-28 | 华南农业大学 | Emotion recognition method, system and storage medium based on multitask deep learning |
CN116030526B (en) * | 2023-02-27 | 2023-08-15 | 华南农业大学 | Emotion recognition method, system and storage medium based on multitask deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111275085B (en) | Online short video multi-modal emotion recognition method based on attention fusion | |
CN108597541B (en) | Speech emotion recognition method and system for enhancing anger and happiness recognition | |
CN105741832B (en) | Spoken language evaluation method and system based on deep learning | |
Bhat et al. | Automatic assessment of sentence-level dysarthria intelligibility using BLSTM | |
CN102142253B (en) | Voice emotion identification equipment and method | |
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN110555084B (en) | Remote supervision relation classification method based on PCNN and multi-layer attention | |
Mingote et al. | Optimization of the area under the ROC curve using neural network supervectors for text-dependent speaker verification | |
CN110321418A (en) | A kind of field based on deep learning, intention assessment and slot fill method | |
CN110853680A (en) | double-BiLSTM structure with multi-input multi-fusion strategy for speech emotion recognition | |
CN110211594A (en) | A kind of method for distinguishing speek person based on twin network model and KNN algorithm | |
CN108109615A (en) | A kind of construction and application method of the Mongol acoustic model based on DNN | |
CN117043857A (en) | Method, apparatus and computer program product for English pronunciation assessment | |
CN111899766B (en) | Speech emotion recognition method based on optimization fusion of depth features and acoustic features | |
Regmi et al. | Nepali speech recognition using rnn-ctc model | |
CN114898779A (en) | Multi-mode fused speech emotion recognition method and system | |
Jiang et al. | Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit. | |
CN106448660B (en) | It is a kind of introduce big data analysis natural language smeared out boundary determine method | |
Kherdekar et al. | Convolution neural network model for recognition of speech for words used in mathematical expression | |
CN114927144A (en) | Voice emotion recognition method based on attention mechanism and multi-task learning | |
Elbarougy | Speech emotion recognition based on voiced emotion unit | |
CN112750466A (en) | Voice emotion recognition method for video interview | |
Mirhassani et al. | Fuzzy-based discriminative feature representation for children's speech recognition | |
Anindya et al. | Development of Indonesian speech recognition with deep neural network for robotic command | |
CN116434786A (en) | Text-semantic-assisted teacher voice emotion recognition method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |