CN114927144A - Voice emotion recognition method based on attention mechanism and multi-task learning - Google Patents

Voice emotion recognition method based on attention mechanism and multi-task learning Download PDF

Info

Publication number
CN114927144A
CN114927144A CN202210546156.4A CN202210546156A CN114927144A CN 114927144 A CN114927144 A CN 114927144A CN 202210546156 A CN202210546156 A CN 202210546156A CN 114927144 A CN114927144 A CN 114927144A
Authority
CN
China
Prior art keywords
lstm
att
recognition
task
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210546156.4A
Other languages
Chinese (zh)
Inventor
何震宇
刘斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tech University
Original Assignee
Nanjing Tech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tech University filed Critical Nanjing Tech University
Priority to CN202210546156.4A priority Critical patent/CN114927144A/en
Publication of CN114927144A publication Critical patent/CN114927144A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a speech emotion recognition method based on attention mechanism and multi-task learning, which comprises the following steps of: step 1: acquiring a CASIA Chinese emotion data set for voice emotion recognition; step 2: the LSTM _ att-MTL speech emotion recognition model consists of a feature extraction module, a sequence modeling module and a multi-task learning module, and speech emotion data in the step one is input into the recognition model for collaborative training; and 3, step 3: obtaining an identification result through the softmax classifier in the second step, and calculating a loss function of the identification result and the training set label so as to adjust the loss; and 4, step 4: inputting the test set voice emotion data into the network trained in the third step to realize the recognition of the test set voice emotion data. The invention constructs an LSTM _ att-MTL model, and the model solves the problems of lower recognition performance and the like caused by higher computational complexity and poor training process effect of the traditional feature extraction method.

Description

Voice emotion recognition method based on attention mechanism and multi-task learning
Technical Field
The invention belongs to the field of deep learning, and particularly relates to a speech emotion recognition method based on an attention mechanism and multi-task learning.
Background
The speech emotion recognition is an advanced computer technology relating to various disciplines, and the purpose of the technology is to extract emotion feature parameters from a speech signal and recognize speech emotion through the parameters. Speech emotion recognition has been developed from a small-scale development stage into a key technology of man-machine interaction, and the speech emotion recognition has been widely applied to various industries, such as introduction into a vehicle driving system, recording of changes in mental states of drivers and vehicles, and increase of interaction between people and vehicles; the medical equipment is added with a voice emotion recognition system, so that better treatment can be performed according to emotion changes of patients; the education system is introduced, and the emotion changes of students can be found in remote teaching, so that the teaching quality is improved. The initial speech emotion recognition method mainly comprises machine learning algorithms and recognition models such as HMM (hidden Markov model), and generally only can recognize a few types of emotions. Deep learning techniques such as Convolutional Neural Networks (CNN) have been developed to enable deep learning-based speech emotion recognition to be long-term, for example, a speech emotion recognition method combining a spectrogram and a deep Convolutional neural Network is adopted, and a Network composed of a Convolutional layer and a fully connected layer is adopted to extract features from the spectrogram, so that a variety of emotions are effectively recognized.
After the original speech is preprocessed by speech emotion recognition, features are extracted from the original speech by generally adopting a feature extraction method, so as to recognize the emotion types to which the features belong. One solution is to extract prosodic and phonetic-qualitative features. Usually, energy, fundamental frequency curves and logarithmic curves are extracted from emotion sentences, corresponding first-order difference and second-order difference curves are calculated, and finally parameters such as skewness, kurtosis and variance are counted. The scheme has limited emotion distinguishing capability of the characteristic area, and the identification precision of the emotion such as anger, fear, happiness and the like is not ideal. The feature extraction from the spectrogram is one of important technologies for speech emotion recognition at the present stage, and the spectrogram is normalized and grayed mainly from two aspects of time domain and frequency domain to extract higher-level features, and meanwhile, the higher-level features can be used as visual representation of a speech signal.
Most of the existing methods for extracting features from a spectrogram adopt a Long-Short Term Memory (LSTM) network to solve the problem of Long-Term dependence on time sequences, and the recognition rate of the existing speech emotion classifier is low due to high computational complexity and insufficient training data. Therefore, the patent provides a speech emotion recognition method based on attention mechanism and multi-task learning, and the recognition performance of the algorithm is improved.
Disclosure of Invention
The invention aims to solve the technical problem of providing a speech emotion recognition method based on attention mechanism and multi-task learning aiming at the defects of the background technology. The method comprises the steps of constructing an LSTM _ att-MTL model, and solving the problems of low recognition performance and the like caused by high calculation complexity and poor training process effect of the traditional feature extraction method.
The invention adopts the following technical scheme for solving the technical problems:
compared with the prior art, the technical scheme adopted by the invention has the following technical effects:
the invention designs a speech emotion recognition method based on attention mechanism and multi-task learning, namely an LSTM _ att-MTL model. The model solves the problems that the traditional feature extraction method is high in calculation complexity and poor in training process effect, so that the recognition performance is reduced and the like. In order to extract higher-level features, the original voice at each moment is preprocessed to obtain a gray-scale representation of the voice as an input of a two-layer CNN network, and then the feature extraction of the voice is carried out at the CNN layer. Secondly, because the obtained voice features have correlation in a time sequence, performing time sequence modeling on the grey spectrogram features learned by the CNN network by adopting the LSTM _ att layer, wherein the LSTM _ att layer is a parameter sharing layer. Finally, as the memory of LSTM _ att gradually decreases with the length of the speech, the information of the start time node has weaker and weaker influence on the current moment as the speech sequence grows, so an attention layer is added in the multitask layer. The result of simulation experiments carried out on the CASIA data set shows that the model has good speech emotion recognition performance.
Drawings
FIG. 1 is an overall framework diagram of a speech emotion recognition method based on attention mechanism and multi-task learning according to the present invention;
FIG. 2 is a schematic diagram of an LSTM structure with an attention gate in accordance with the present invention;
FIG. 3 is a schematic diagram of values of LSTM _ att-MTL and LSTM-MTL weight α;
FIG. 4 is a comparative diagram showing the experimental results of different LSTM _ att layer numbers.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the attached drawings:
a speech emotion recognition method based on attention mechanism and multi-task learning is disclosed, as shown in fig. 1 and 2, and is used for solving the problems of high computational complexity, poor training process effect, reduced recognition performance and the like of the traditional feature extraction method, the aim of reducing training parameters is achieved by adding an attention gate in an LSTM, an LSTM _ att-MTL speech emotion recognition model is constructed together with the multi-task learning, and the recognition performance of an algorithm is improved, and the method comprises the following steps:
step 1: acquiring a speech emotion data set: acquiring a CASIA Chinese emotion data set for voice emotion recognition;
and 2, step: constructing an LSTM _ att-MTL speech emotion recognition model: the LSTM _ att-MTL speech emotion recognition model consists of a feature extraction module, a sequence modeling module and a multi-task learning module, and speech emotion data in the step one is input into the recognition model for collaborative training;
and step 3: calculating the loss function of the model: obtaining an identification result through a sofftmax classifier in the step two, and calculating a loss function of the identification result and a training set label so as to adjust the loss;
and 4, step 4: training the overall model and obtaining a recognition result: inputting the test set voice emotion data into the network trained in the third step to realize the recognition of the test set voice emotion data.
The step 1 of acquiring the speech emotion data set comprises the following steps:
step 11: the method comprises the steps that firstly, a voice emotion data set is a CASIA Chinese emotion corpus developed by the institute of automation of Chinese academy of sciences, the data set comprises two pairs of actors performing deduction on 500 sentences of texts under pure recording environment under six emotions of anger, fear, happy, neutral, sad and surprise, 16KHZ sampling rate, 16bit quantization and signal-to-noise ratio are about 35db, the sentences are stored in a pcm format, and 9600 sentences are screened out;
step 12: and secondly, dividing the data set into a training set, a verification set and a test set according to the ratio of 6: 2.
The step 2 of constructing the LSTM _ att-MTL speech emotion recognition model comprises the following steps:
step 21: firstly, preprocessing original voice, comprising the following operations: performing frame windowing on the voice, wherein the frame length is 25ms, and the frame shift is 10 ms; performing short-time Fourier transform to obtain a spectrogram of the voice signal; performing maximum-minimum normalization processing on the spectrogram, quantizing the spectrogram into a gray level map, and entering a second step;
step 22: inputting the spectrogram into a CNN network, learning the voice features from the spectrogram through convolution calculation by the convolution layer, selecting a ReLU function as an activation function, and entering a third step, wherein the specific parameters are as follows:
Figure BDA0003652206470000041
step 23: referring to fig. 2, two recurrent neural networks are used as the shared layer, LSTM _ att is used as the basic unit, and the number of hidden layer units is 128. To prevent overfitting, dropout was introduced during training with the parameter set to 0.5. The first layer outputs all time sequences to the next layer, and the second layer outputs the result of the last time step. LSTM _ att attention gate output:
att t =σ(V att ×tanh(W att ×c t-1 )) (1)
wherein,
Figure BDA0003652206470000042
is a parameter to be trained, and learns from training data; c. C t-1 For the node state at the last moment(ii) a σ (-) and tanh (-) are the logistic sigmoid and hyperbolic tangent activation functions, respectively; output of other gate units:
i t =σ(W i ×[c t-1 ,h t-1 ,x t ]+b i ) (2)
Figure BDA0003652206470000043
Figure BDA0003652206470000044
o t =σ(W o ×[h t-1 ,x t ]+b o ) (5)
h t =o t ·tanh(c t ) (6)
wherein i t Input gate representing time t, W i And b i Representing the weight matrix and offset terms in the input gate cell, C t-1 And h t-1 Respectively the cell state and the hidden layer output at the previous moment, x t An input representing a current time;
Figure BDA0003652206470000045
candidate value, W, representing the state of the updating unit at time t c And b c Representing a weight matrix and an offset item when updating the state; c. C t Represents the state of the cell at time t,. represents the hadamard product; o t Output gate representing time t, W o And b o Representing the weight matrix and the offset term in the output gate unit; h is a total of t Representing the output of the hidden layer at time t;
step 24: the multi-task learning method with hard parameter sharing is adopted, and the gender of a speaker is used as an auxiliary task, because the system performance of voice related tasks can be influenced by the difference of voices of males and females, and the performance of the emotion recognition model specific to the gender is superior to that of other emotion recognition models which are not classified into the gender.
The attention layer of multitask learning attention-weights the LSTM _ att output:
Figure BDA0003652206470000051
ν=∑ i α i h i (8)
in the formula (7), α i The attention weight is expressed, the vector μ is an attention parameter, and μ ═ θ 1 ,θ 2 ,...,θ T ) T is the frame length, { h 1 ,h 2 ,...,h T The output of the last layer LSTM _ att is calculated, and attention parameter vectors mu and h are calculated i The inner product of (4) is used as a score of the importance of each time frame, and normalization processing is carried out on the score, namely the weight of each frame containing the key information. Equation (8) performs a dot product of the obtained weight and the LSTM _ att output, and the obtained weighted sum v is used as a feature vector of the global update weight.
The loss function of the calculation model in the step 3 comprises the following steps:
the method comprises the following steps: inputting the feature vector obtained by the formula (8) into a full connection layer, wherein the full connection layer can realize independent optimization while learning shared features, and can classify through softmax to obtain a predicted value:
Figure BDA0003652206470000052
wherein W and b represent weight and bias terms in the fully-connected layer, V represents a feature vector obtained by the formula (8), and softmax (·) is a softmax function.
Step 31: during the training process, the recognition result is compared with the labels of the training set, a loss function is calculated, and an ADAM optimizer is used for optimizing the loss function. The cross-entropy function is employed herein as a loss function for each task:
Figure BDA0003652206470000053
wherein
Figure BDA0003652206470000054
A loss function representing each task, task being the category of the task: emotion recognition (emotion) and gender classification (generator), y i And
Figure BDA0003652206470000055
respectively the true value of the training set label and the predicted value output by the model, N task Is the total number of categories for the task.
Step 32: final overall loss function:
L total =αL emo +(1-α)L gen (10)
wherein L is emo And L gen The loss functions of speech emotion recognition and gender recognition are respectively, and alpha is a weight coefficient of the emotion recognition loss function. By combining L total And performing back propagation and gradient updating on the whole network.
Step 4, the implementation of the test set speech emotion data recognition comprises the following steps:
step 41: in order to embody the processing capacity of the LSTM _ att on time sequence data, an LSTM-MTL comparison experiment is designed, the number of LSTM layers is the same as that of nodes in a text model, and emotion and gender are identified. And determining values of the weight coefficients of the two task loss functions, and identifying the speech emotion as a main task, so that the corresponding weight of the speech emotion is greater than that of a gender identification task, and the initial value of alpha in the experiment is set to be 0.9. And (4) finding the weight with the highest identification accuracy rate by adjusting the weight and testing on the verification set. As shown in fig. 3, LSTM _ att-MTL and LSTM-MTL weights α take values of 0.8 and 0.6, respectively.
Step 42: in order to reflect the performance improvement of multi-task learning compared with single-task learning, a single-task comparison experiment LSTM _ att-STL is designed, and the emotional tasks are identified on the premise of keeping other parameters the same.
Tables 1 and 2 show the recognition results of each task and emotion on the CASIA data set, respectively, and it can be seen from Table 1 that there is a great improvement in LSTM _ att-MTL emotion recognition, and since there are fewer speakers in the CASIA data set and the gender recognition rate is very high, the gender recognition as an auxiliary task has less influence on emotion recognition. As can be seen from Table 2, the recognition accuracy of the method provided by the invention reaches 93.1%, and the recognition accuracy of other emotions is higher than that of the comparison experiment except that the accuracy of neutral emotion and the accuracy of surprise emotion are lower than that of LSTM-MTL. The confusion matrix of the LSTM _ att-MTL recognition result is shown in a table 3, the horizontal coordinate in the table represents the prediction result of speech emotion, the vertical coordinate represents the real category corresponding to the emotion, and the confusion matrix is obtained from the table, has low emotion recognition accuracy and is easy to be recognized as neutral and surprise. The neutral recognition result is accurate and reaches 96.9%.
TABLE 1 identification results of each task in comparative experiments
Tab.2 The recognition results of each task in the experiment
Figure BDA0003652206470000061
TABLE 2 results of emotion recognition for each of the comparative experiments
Tab.3 The results of each emotion recognition in the experiment
Figure BDA0003652206470000062
Figure BDA0003652206470000071
TABLE 3 Emotion recognition results for each item in LSTM _ att-MTL
Tab.4 Each emotion recognition result in LSTM_att-MTL
Figure BDA0003652206470000072
Step 43: in order to select the most suitable LSTM _ att layer number in the LSTM _ att-MTL, three experiments are simultaneously set, wherein the LSTM layer number is 2, 3 and 4 respectively. The experimental result is shown in fig. 4, and the increase of the number of LSTM layers improves the efficiency of time domain feature extraction, as shown in table 5, the recognition accuracy of 2 layers of LSTM _ att is 93.1%, the recognition accuracy of 3 layers of LSTM _ att is 92.8%, and the recognition accuracy of 4 layers of LSTM _ att is 92.1%, so 2 layers of LSTM _ att are selected for the experiment.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or flow transformations made by using the contents of the specification and the drawings, or directly or indirectly applied to the related art, are included in the scope of the present invention.

Claims (5)

1. A speech emotion recognition method based on attention mechanism and multi-task learning is characterized by comprising the following steps: the method comprises the following steps:
step 1: acquiring a speech emotion data set: acquiring a CASIA Chinese emotion data set for voice emotion recognition;
step 2: constructing an LSTM _ att-MTL speech emotion recognition model: the LSTM _ att-MTL speech emotion recognition model consists of a feature extraction module, a sequence modeling module and a multi-task learning module, and speech emotion data in the step one is input into the recognition model for collaborative training;
and step 3: calculating the loss function of the model: obtaining a recognition result through the softmax classifier in the step two, and calculating a loss function of the recognition result and the training set label so as to adjust the loss;
and 4, step 4: training the model to obtain a recognition result: inputting the test set voice emotion data into the network trained in the third step to realize the recognition of the test set voice emotion data.
2. The speech emotion recognition method based on attention mechanism and multitask learning, as claimed in claim 1, wherein: the step 1 of acquiring the speech emotion data set comprises the following steps:
step 11: firstly, the voice emotion data set is a CASIA Chinese emotion corpus developed by the Chinese academy of sciences automation research institute, the data set comprises two pairs of actors and phrases respectively deducing 500 sentences of texts under six emotions of anger, fear, happy, neutral, sad and surprise in a pure recording environment, the sampling rate of 16kHZ is 16bit quantization, the signal to noise ratio is about 35db, the sentences are stored in a pcm format, and finally 9600 sentences are screened out;
step 12: and secondly, dividing the data set into a training set, a verification set and a test set according to the ratio of 6: 2.
3. The speech emotion recognition method based on attention mechanism and multitask learning, as claimed in claim 1, wherein: the step 2 of constructing the LSTM _ att-MTL speech emotion recognition model comprises the following steps of:
step 21: firstly, preprocessing original voice, comprising the following operations: performing frame windowing on the voice, wherein the frame length is 25ms, and the frame shift is 10 ms; performing short-time Fourier transform to obtain a spectrogram of the voice signal; performing maximum-minimum normalization processing on the spectrogram, quantizing the spectrogram into a grey-scale map, and entering the second step;
step 22: inputting the spectrogram into a CNN network, learning the voice characteristics from the spectrogram by convolution calculation through the convolution layer, selecting a ReLU function as an activation function, and entering the third step;
step 23: two cyclic neural networks are used as a sharing layer, LSTM _ att is used as a basic unit, and the number of hidden layer units is 128; in order to prevent overfitting, dropout is introduced in the training process, and the parameter is set to be 0.5; the first layer outputs all time sequences to the next layer, and the second layer outputs the result of the last time step; output att of LSTM _ att attention gate t
att t =σ(V att ×tanh(W att ×c t-1 )) (1)
Wherein,
Figure FDA0003652206460000021
is a parameter to be trained, and learns from training data; c. C t-1 The node state at the previous moment; σ (-) and tanh (-) are logi, respectivelystic sigmoid and hyperbolic tangent activation function; specifically, each gate unit outputs:
i t =σ(W i ×[c t-1 ,h t-1 ,x t ]+b i ) (2)
Figure FDA0003652206460000024
Figure FDA0003652206460000025
o t =σ(W o ×[h t-1 ,x t ]+b o ) (5)
h t =o t ·tanh(c t ) (6)
wherein i t Input gate representing time t, W i And b i Representing the weight matrix and offset terms in the input gate cell, C t-1 And h t-1 Respectively the cell state and hidden layer output at the previous moment, x t An input representing a current time;
Figure FDA0003652206460000022
candidate value, W, representing the state of the updating unit at time t c And b c Representing the weight matrix and the bias term when updating the state; c. C t Represents the state of the cell at time t,. represents the hadamard product; o t Output gate representing time t, W o And b o Representing the weight matrix and the bias term in the output gate unit; h is t Representing the output of the hidden layer at time t;
step 24: the multi-task learning method with hard parameter sharing is adopted, the gender of a speaker is adopted as an auxiliary task, because the difference of male and female voices influences the system performance of voice related tasks, and the performance of a gender-specific emotion recognition model is superior to that of other gender-independent emotion recognition models;
the attention layer in the multitask learning module attention-weights the LSTM _ att output:
Figure FDA0003652206460000023
v=∑ i α i h i (8)
in the formula (7), α i Denotes the attention weight, vector μ denotes the attention parameter, and μ ═ θ 1 ,θ 2 ,...,θ T ) T is the frame length, { h 1 ,h 2 ,...,h T The output of the last layer LSTM _ att is calculated as the attention parameter vectors mu and h i The inner product of (1) is taken as a score of the importance of each time frame, and normalization processing is carried out on the score, wherein the normalized score is the weight of each frame containing the key information; equation (8) the obtained weight α i And LSTM _ att output h i And performing dot multiplication, and taking the obtained weighted sum V as a feature vector of the global updating weight.
4. The speech emotion recognition method based on attention mechanism and multitask learning, as claimed in claim 1, wherein: the loss function of the calculation model in the step 3 comprises the following steps:
the method comprises the following steps: inputting the feature vector obtained by the formula (8) into a full connection layer, wherein the full connection layer can realize independent optimization while learning shared features, and can classify through softmax to obtain a predicted value:
Figure FDA0003652206460000031
wherein W and b represent weight and bias terms in the fully-connected layer, V represents a feature vector obtained by the formula (8), and softmax (·) is a softmax function.
Step 31: in the training process, comparing the recognition result with the labels of the training set, calculating a loss function, and optimizing the loss function by using an ADAM optimizer; the cross entropy function is used herein as a penalty function for each task:
Figure FDA0003652206460000032
wherein
Figure FDA0003652206460000033
A loss function representing each task, task being the category of the task: emotion recognition (emotion) and gender classification (generator), y i And
Figure FDA0003652206460000034
respectively the true value of the training set label and the predicted value of the model output, N task The total number of categories of task;
step 32: final overall loss function:
L total =αL emo +(1-α)L gen (10)
wherein L is emo And L gen Respectively a loss function of speech emotion recognition and gender recognition, and alpha is a weight coefficient of the loss function of emotion recognition; by combining L total And performing back propagation and gradient updating on the whole network.
5. The method for recognizing speech emotion based on attention mechanism and multitask learning as claimed in claim 1, characterized in that: step 4, the implementation of the test set speech emotion data recognition comprises the following steps:
step 41: in order to embody the processing capacity of the LSTM _ att on time sequence data, an LSTM-MTL comparison experiment is designed, the number of LSTM layers is the same as that of nodes and a text model, and emotion and gender are identified; determining values of weight coefficients of two task loss functions, identifying speech emotion as a main task, so that the corresponding weight of the main task is greater than that of a gender identification task, and setting an initial value of alpha to be 0.9 in an experiment; finding the weight with the highest identification accuracy rate by adjusting the weight and testing on the verification set; as shown in fig. 1, the LSTM _ att-MTL and LSTM-MTL weights α take values of 0.8 and 0.6, respectively;
step 42: in order to reflect the performance improvement of multi-task learning compared with single-task learning, a single-task comparison experiment LSTM _ att-STL is designed, and the emotional tasks are identified on the premise of keeping other parameters the same;
step 43: in order to select the most appropriate layer number of LSTM _ att in the LSTM _ att-MTL, three experiments are simultaneously set on a verification set, wherein the layer number of LSTM is 2, 3 and 4 respectively; the experimental result is shown in fig. 2, and the increase of the number of LSTM layers improves the efficiency of time domain feature extraction, as shown in table 5, the recognition accuracy of 2 layers of LSTM _ att is 93.1%, the recognition accuracy of 3 layers of LSTM _ att is 92.8%, and the recognition accuracy of 4 layers of LSTM _ att is 92.1%, so 2 layers of LSTM _ att are selected for the experiment.
CN202210546156.4A 2022-05-19 2022-05-19 Voice emotion recognition method based on attention mechanism and multi-task learning Pending CN114927144A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210546156.4A CN114927144A (en) 2022-05-19 2022-05-19 Voice emotion recognition method based on attention mechanism and multi-task learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210546156.4A CN114927144A (en) 2022-05-19 2022-05-19 Voice emotion recognition method based on attention mechanism and multi-task learning

Publications (1)

Publication Number Publication Date
CN114927144A true CN114927144A (en) 2022-08-19

Family

ID=82807755

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210546156.4A Pending CN114927144A (en) 2022-05-19 2022-05-19 Voice emotion recognition method based on attention mechanism and multi-task learning

Country Status (1)

Country Link
CN (1) CN114927144A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030526A (en) * 2023-02-27 2023-04-28 华南农业大学 Emotion recognition method, system and storage medium based on multitask deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030526A (en) * 2023-02-27 2023-04-28 华南农业大学 Emotion recognition method, system and storage medium based on multitask deep learning
CN116030526B (en) * 2023-02-27 2023-08-15 华南农业大学 Emotion recognition method, system and storage medium based on multitask deep learning

Similar Documents

Publication Publication Date Title
CN111275085B (en) Online short video multi-modal emotion recognition method based on attention fusion
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
CN105741832B (en) Spoken language evaluation method and system based on deep learning
Bhat et al. Automatic assessment of sentence-level dysarthria intelligibility using BLSTM
CN102142253B (en) Voice emotion identification equipment and method
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN110555084B (en) Remote supervision relation classification method based on PCNN and multi-layer attention
Mingote et al. Optimization of the area under the ROC curve using neural network supervectors for text-dependent speaker verification
CN110321418A (en) A kind of field based on deep learning, intention assessment and slot fill method
CN110853680A (en) double-BiLSTM structure with multi-input multi-fusion strategy for speech emotion recognition
CN110211594A (en) A kind of method for distinguishing speek person based on twin network model and KNN algorithm
CN108109615A (en) A kind of construction and application method of the Mongol acoustic model based on DNN
CN117043857A (en) Method, apparatus and computer program product for English pronunciation assessment
CN111899766B (en) Speech emotion recognition method based on optimization fusion of depth features and acoustic features
Regmi et al. Nepali speech recognition using rnn-ctc model
CN114898779A (en) Multi-mode fused speech emotion recognition method and system
Jiang et al. Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit.
CN106448660B (en) It is a kind of introduce big data analysis natural language smeared out boundary determine method
Kherdekar et al. Convolution neural network model for recognition of speech for words used in mathematical expression
CN114927144A (en) Voice emotion recognition method based on attention mechanism and multi-task learning
Elbarougy Speech emotion recognition based on voiced emotion unit
CN112750466A (en) Voice emotion recognition method for video interview
Mirhassani et al. Fuzzy-based discriminative feature representation for children's speech recognition
Anindya et al. Development of Indonesian speech recognition with deep neural network for robotic command
CN116434786A (en) Text-semantic-assisted teacher voice emotion recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination