CN116741206A - Speech emotion recognition method and system based on multitask learning - Google Patents
Speech emotion recognition method and system based on multitask learning Download PDFInfo
- Publication number
- CN116741206A CN116741206A CN202310954870.1A CN202310954870A CN116741206A CN 116741206 A CN116741206 A CN 116741206A CN 202310954870 A CN202310954870 A CN 202310954870A CN 116741206 A CN116741206 A CN 116741206A
- Authority
- CN
- China
- Prior art keywords
- task
- recognition
- model
- emotion
- auxiliary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 71
- 238000000034 method Methods 0.000 title claims abstract description 55
- 230000006870 function Effects 0.000 claims abstract description 69
- 238000012549 training Methods 0.000 claims abstract description 46
- 230000008451 emotion Effects 0.000 claims abstract description 36
- 239000013598 vector Substances 0.000 claims description 21
- 230000000694 effects Effects 0.000 claims description 20
- 230000008569 process Effects 0.000 claims description 15
- 230000004913 activation Effects 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 8
- 238000010586 diagram Methods 0.000 description 11
- 238000000605 extraction Methods 0.000 description 6
- 238000013461 design Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 239000006227 byproduct Substances 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/02—Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/18—Artificial neural networks; Connectionist approaches
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Hospice & Palliative Care (AREA)
- General Health & Medical Sciences (AREA)
- Psychiatry (AREA)
- Child & Adolescent Psychology (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a speech emotion recognition method and a system based on multitask learning, which belong to the field of emotion recognition, wherein the method comprises the following steps: acquiring a pre-trained multi-task model; the multitasking model comprises: a pre-training model for extracting audio features of speech, and a plurality of task branches comprising: the method comprises the steps of performing voice emotion recognition on a main task and two auxiliary tasks, performing text recognition on the main task, performing speaker recognition on the first auxiliary task, and linearly combining respective loss functions of the main task and the two auxiliary tasks by using loss functions used in a plurality of task branch training stages; and inputting the original voice into the multitasking model to obtain the recognized voice emotion. The invention constructs the multi-task learning frame taking the speech emotion recognition as a main task and the text recognition and the speaker recognition as auxiliary tasks, can mutually promote the training in the multi-task frame, and improves the recognition precision.
Description
Technical Field
The invention belongs to the field of emotion recognition, and in particular relates to a speech emotion recognition method and system based on multitask learning.
Background
The voice emotion recognition has great research significance in the field of man-machine interaction, and researchers at home and abroad have proposed a plurality of emotion recognition methods, wherein a traditional system for extracting basic voice characteristics for classification is provided, and an end-to-end system based on deep learning is also provided.
The traditional voice emotion recognition system mainly comprises a feature extraction component, a feature selection component and a classifier component, the system mostly uses basic voice features such as frequency spectrum features, rhythm features, voice quality features and the like as features for system processing, the development of the system needs to have deep understanding and enough domain knowledge on the voice domain, therefore, the system is generally complex, careful design of the system is needed, and the recognition effect of most of the traditional systems is relatively general.
The end-to-end voice emotion recognition system based on deep learning can extract deeper features, and the effect is generally better than that of the traditional system. However, most of the prior art only uses a specific single task for training, and the features extracted from the speech may ignore some emotion related information, such as text information and speaker information, which may include part of emotion information, and if the system for training a single task does not capture the part of emotion information.
In summary, the accuracy of the existing speech emotion recognition is limited, and further improvement and promotion are required to meet the user requirements.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a voice emotion recognition method and a voice emotion recognition system based on multitask learning, which aim to solve the problem that the accuracy of the existing voice emotion recognition is limited.
In order to achieve the above object, in a first aspect, the present invention provides a speech emotion recognition method based on multitask learning, including the steps of:
acquiring a pre-trained multi-task model; the multitasking model comprises: a pre-training model for extracting audio features of speech, and a plurality of task branches comprising: the system comprises a main task and two auxiliary tasks, wherein the main task is used for carrying out voice emotion recognition based on audio features, the first auxiliary task is used for carrying out text recognition based on the audio features, the second auxiliary task is used for carrying out speaker recognition based on the audio features, and loss functions used in the multiple task branch training stages are obtained by linearly combining the loss functions of the main task and the two auxiliary tasks so as to correlate the auxiliary tasks with the main task and improve the recognition effect of the main task;
and inputting the original voice into the multitasking model to obtain the recognized voice emotion.
In one possible implementation, the loss function L used by the multitasking model training process is:
L=L CE1 +αL CTC +βL CE2
wherein L is CE1 Loss function for emotion recognition task, L CTC Loss function for text recognition task, L CE2 Identifying a loss function of the task for the speaker; alpha and beta are two hyper-parameters.
In one possible implementation, the architecture of the primary task includes: a first pooling layer, a first full connection layer, and a first activation function; when the main task is executed, the first pooling layer converts a vector sequence corresponding to the audio feature into a one-way quantity, the first full-connection layer obtains predicted values corresponding to different emotion types based on the one-way quantity, the first activation function determines probability distribution of the different emotion types based on a plurality of predicted values, and the emotion type with the largest probability is used as the recognized voice emotion to be output.
In one possible implementation manner, the architecture of the first auxiliary task includes: a second fully-connected layer and a second activation function; when the first auxiliary task is executed, the second full-connection layer obtains logic predicted values of different characters based on feature vectors corresponding to the audio features, the second activation function converts the logic predicted values of the different characters into probability distribution of the different characters, and the character with the highest probability is output as a recognized text.
In one possible implementation manner, the architecture of the second auxiliary task includes: a second pooling layer, a third connection layer, and a third activation function; when the second auxiliary task is executed, the second pooling layer converts the vector sequence corresponding to the audio feature into a one-way quantity, the third full-connection layer obtains logic predicted values corresponding to different speakers based on the one-way quantity, the third activation function determines probability distribution of the different speakers based on a plurality of logic predicted values, and the speaker with the largest probability is output as the identified speaker.
In a possible implementation, the original speech is input to the multitasking model, and the obtained recognition information further includes a text recognition result and a speaker recognition result.
In a second aspect, the present invention provides a speech emotion recognition system based on multitasking learning, including:
the model acquisition unit is used for acquiring a pre-trained multi-task model; the multitasking model comprises: a pre-training model for extracting audio features of speech, and a plurality of task branches comprising: the system comprises a main task and two auxiliary tasks, wherein the main task is used for carrying out voice emotion recognition based on audio features, the first auxiliary task is used for carrying out text recognition based on the audio features, the second auxiliary task is used for carrying out speaker recognition based on the audio features, and loss functions used in the multiple task branch training stages are obtained by linearly combining the loss functions of the main task and the two auxiliary tasks so as to correlate the auxiliary tasks with the main task and improve the recognition effect of the main task;
and the emotion recognition unit is used for inputting the original voice into the multitasking model to obtain the recognized voice emotion.
In one possible embodiment, the system further comprises: a model training unit;
the loss function L used by the model training unit in the multi-task model training process is as follows:
L=L CE1 +αL CTC +βL CE2
wherein L is CE1 Loss function for emotion recognition task, L CTC Loss function for text recognition task, L CE2 Identifying a loss function of the task for the speaker; alpha and beta are two hyper-parameters.
In a third aspect, the present invention provides an electronic device comprising: at least one memory for storing a program; at least one processor for executing a memory-stored program, which when executed is adapted to carry out the method of the first aspect or any one of the possible implementations of the first aspect.
In a fourth aspect, the invention provides a computer readable storage medium storing a computer program which, when run on a processor, causes the processor to perform the method described in the first aspect or any one of the possible implementations of the first aspect.
In a fifth aspect, the invention provides a computer program product which, when run on a processor, causes the processor to perform the method described in the first aspect or any one of the possible implementations of the first aspect.
In general, the above technical solutions conceived by the present invention have the following beneficial effects compared with the prior art:
the invention provides a speech emotion recognition method and a system based on multitask learning, which construct a multitask learning frame with speech emotion recognition as a main task and text recognition and speaker recognition as auxiliary tasks, wherein the three tasks have certain correlation, can be mutually promoted during training in the multitask frame, and improve recognition accuracy. The method comprises the steps of extracting shared features by using a pre-training model, then learning a plurality of tasks by using the extracted features, and fine-tuning the pre-training model through a multi-task learning strategy.
The invention provides a speech emotion recognition method and a speech emotion recognition system based on multitask learning. The text recognition is to directly perform full-connection layer transformation on the shared features, obtain predicted characters through a softmax layer, and finally calculate CTC loss through real transcription with the text. And fusing three loss functions of the three tasks, giving weight to the auxiliary functions of the auxiliary tasks, and then linearly adding the weight to the loss functions of the main tasks to obtain a final loss function, so that back propagation is carried out to train the model.
The invention provides a speech emotion recognition method and a system based on multi-task learning, which are used for carrying out a comparison test on multi-task decisions, and have more excellent effect on multi-task speech emotion recognition than a single-task training frame, wherein the weights of all tasks in the multi-task learning frame play a very key role on experimental results, and grid search is carried out on the hyper-parameters for controlling the weights, so that the hyper-parameter combination with the best recognition effect is obtained.
Drawings
FIG. 1 is a flowchart of a speech emotion recognition method based on multitasking learning provided by an embodiment of the present invention;
FIG. 2 is a simplified flow chart of speech emotion recognition provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of a multi-task learning framework provided by an embodiment of the present invention;
FIG. 4 is a flow chart of shared feature extraction provided by an embodiment of the present invention;
FIG. 5 is a diagram of a model structure provided by an embodiment of the present invention;
fig. 6 is a graph showing the comparison of emotion recognition results when α=0, β= [0,0.01,0.1,1] provided in the embodiment of the present invention;
fig. 7 is a comparison chart of speaker recognition results when α=0, β= [0,0.01,0.1,1] provided by the embodiment of the present invention;
fig. 8 is a graph showing the comparison of emotion recognition results when β=0, α= [0,0.01,0.1,1] provided in the embodiment of the present invention;
fig. 9 is a comparison chart of text recognition word error rate when β=0, α= [0,0.01,0.1,1] provided by the embodiment of the present invention;
FIG. 10 is a schematic diagram of an inference process provided by an embodiment of the present invention;
fig. 11 is a schematic diagram of emotion recognition accuracy, speaker recognition accuracy and word error rate when α=0.1 and β=0.15 according to an embodiment of the present invention;
fig. 12 is a schematic diagram of a speech emotion recognition system based on multitasking learning according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The terms "first" and "second" and the like in the description and in the claims are used for distinguishing between different objects and not for describing a particular sequential order of objects.
In embodiments of the invention, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
Next, the technical scheme provided in the embodiment of the present invention is described.
Aiming at the defects, the invention provides a multi-task learning framework, emotion recognition is used as a main task, text recognition and speaker recognition are used as auxiliary tasks, emotion information is extracted as much as possible, and an end-to-end voice emotion recognition system is realized. The original audio waveform is input into a trained model, which can output a speech emotion classification tag, and simultaneously, speech text and speaker information can be output as additional information.
FIG. 1 is a flowchart of a speech emotion recognition method based on multitasking learning provided by an embodiment of the present invention; as shown in fig. 1, the method comprises the following steps:
s101, acquiring a pre-trained multi-task model; the multitasking model comprises: a pre-training model for extracting audio features of speech, and a plurality of task branches comprising: the system comprises a main task and two auxiliary tasks, wherein the main task is used for carrying out voice emotion recognition based on audio features, the first auxiliary task is used for carrying out text recognition based on the audio features, the second auxiliary task is used for carrying out speaker recognition based on the audio features, and loss functions used in the multiple task branch training stages are obtained by linearly combining the loss functions of the main task and the two auxiliary tasks so as to correlate the auxiliary tasks with the main task and improve the recognition effect of the main task;
s102, inputting the original voice into the multitasking model to obtain the recognized voice emotion.
Specifically, the invention aims to simultaneously train three tasks of voice emotion recognition, text recognition and speaker recognition by using a multi-task learning method and an end-to-end deep neural model based on a pre-training model wav2vec-2.0, and finally realize voice to emotion classification by using the trained model, wherein the text recognition and the speaker recognition are generated as additional results.
The invention provides an end-to-end speech emotion recognition model which can be divided into three parts: shared feature extraction, multitasking model training and model reasoning; the process flow is shown in fig. 2. Firstly, inputting unprocessed audio into a model, extracting features of the audio file by the model, inputting the extracted features as a sharing layer into emotion recognition, text recognition and speaker recognition tasks for training respectively, wherein the total loss function is a linear combination of the loss functions of the three tasks; and finally, the trained model is used for recognizing the voice emotion, and the audio is directly input into the model to obtain the classification of the voice emotion.
Specifically, the multitask learning utilizes a shared backbone model to simultaneously optimize a plurality of targets in different tasks, which has an advantage in that the plurality of tasks can promote each other's performance by sharing information, supplementing each other. Experience has shown that the simultaneous training of the associated tasks performs better.
Three tasks for training simultaneously are selected from the model provided by the invention, namely emotion recognition, text recognition and speaker recognition. The emotion recognition is used as a main task, and the text recognition and the speaker recognition are used as auxiliary tasks. After the original audio is input, extracting shared characteristics, carrying out emotion classification on the extracted moral characteristics by an emotion recognition task, and outputting predicted emotion labels; the text recognition task carries out text recognition and outputs predicted text content; the speaker recognition task predicts the speaker identity corresponding to the input audio and outputs its speaker tag. Fig. 3 is a diagram illustrating a multi-task learning framework according to the present invention. The two auxiliary tasks have certain correlation with the main task, and play a certain role in promoting the identification effect of the main task.
The key of the multi-task learning framework is to find a shared feature representation, and the model can share related information among different tasks through the shared feature representation, so that generalization capability and effect of the model are improved.
The invention uses the pre-training model to extract the shared characteristics, the shared characteristics are used as input for training by the subsequent training tasks, meanwhile, the loss functions of a plurality of tasks are combined by using the joint loss function, and the model is finely adjusted in the training process. The shared feature extraction flow is shown in fig. 4.
The invention uses a self-supervised learning based pre-training model wav2vec-2.0 to extract shared features, the model takes original voice as input, abstract audio representation features are extracted from the original voice, and the features can be used for different tasks, such as voice emotion analysis, voice recognition and the like.
The invention provides an end-to-end multitask training model, which has three tasks: emotion recognition, text recognition and speaker recognition, wherein emotion recognition is a primary task, the other two are auxiliary tasks, and fig. 5 is a model structure diagram. When model training is performed on an IEMOCAP dataset, the sampling rate of each sample is set to 16000Hz and the audio duration is 1s to 40s. The pre-trained wav2vec-2.0 model is denoted as f θ (x) θ is a parameter related to the pre-trained model. Let input original audio x e R L (L is the number of audio samples) and outputs the characteristic z, z E R after wav2vec-2.0 L×d =f θ (x) D is the dimension of the last hidden layer, d=768. After the shared feature extraction S is completed, the feature z is input to three different tasksTraining is performed.
The lower left part of FIG. 5 is the training process of emotion recognition, after the feature vector z is obtained, the part inputs z into a pooling layer which performs an accumulation operation over the sample length L, adding the vector sequence z ε R L×d Converted into unidirectional quantityThen input it into a full connection layer h φ In the method, if the classification emotion has C in total, h φ Will be +.>Mapping to logical vector c E R C Finally, obtaining the logic predictive value +.>Where phi denotes the parameters related to the full connection layer h. And finally, converting the logic predictive value of the emotion into probability distribution of emotion labels by utilizing the softmax layer, thereby obtaining the emotion predictive label with the maximum probability.
In the text recognition section, the vocabulary used in the present invention has a total of v=32 characters, of which there are 26 english alphabets and 6 punctuation characters. The structure is shown in the middle part of FIG. 5, after the feature vector z is obtained, it is input into a fully connected layer FC, noted as g φ Then the feature vector z e R L×d Mapped to logical vector y E R L×V Finally, the character prediction expressed by the logic vector is obtained: y=g φ (f θ (x) A kind of electronic device. Wherein phi represents a parameter related to the full connection layer g. And then converting the logical predicted value of the character into probability distribution of the character through the softmax layer, so that character prediction with the maximum probability is obtained.
For the input shared feature z, the speaker recognition is processed in a similar way to emotion recognition, and the specific process is shown in the lower right part of fig. 5. The feature vector z passes through a pooling layer, and the pooling layer makes the vector sequence z epsilon R L×d Conversion to unidirectional quantities z' ∈R d Then input it into a full connection layer s φ If N speaker categories are provided, the logic vector N E R of the speaker can be obtained N Logical predictive value of speakerWhere phi denotes the parameters related to the fully connected layer s. Also, after passing through the softmax layer, a predicted speaker tag can be obtained.
It will be appreciated that the setting of the loss function of the multi-task learning model is an extremely critical loop, which involves the selection and synthesis of different task loss functions, and the setting of super-parameters. In the model proposed by the invention, the training process is supervised, and in the final stage of the three tasks, a softmax operator is applied to convert their logical vectors into probability vectors corresponding to the respective tasks, and compare them with the real labels, thereby calculating the penalty function.
For emotion recognition tasks, the model calculates the cross entropy of the predictive probability vector and the true emotion label emo _label, and uses the cross entropy to lose L CE1 As a function of loss.
For text recognition tasks, connectionist Temporal Classification (CTC) techniques can map an input signal to an output signal when the input and output signal lengths are different or no alignment information is provided, and because the lengths of the input and output signals are generally different in the task, the model calculates CTC losses for the predicted character sequence and the true sequence trans-script, which can effectively counter-propagate gradients. CTC loss is:
for the speaker recognition task, the model calculates the cross entropy of the predictive probability vector and the true speaker tag spc_label, and uses the cross entropy lossLoss of L CE2 As a function of loss.
The model combines the three loss functions linearly into a total loss function, and therefore, two super parameters alpha and beta are introduced to control the relative importance of CTC loss and speaker identification cross entropy loss respectively, and the optimal selection of alpha and beta is found through grid search. Finally, the objective function to be optimized for the model is as follows:
specifically, the value of the super parameter represents the weight of the auxiliary task in the loss function, and when searching the optimal super parameter combination, the invention performs the following experiment: when alpha is 0, changing the value of beta, and observing the change of the primary task identification effect; when beta is 0, changing the value of alpha, and observing the change of the primary task identification effect; meanwhile, let α=β=0 as a control group, at this time, the model becomes a single-task speech emotion recognition model. The results are as follows.
Let β be 0,0.01,0.1, and 1 when α=0, and the emotion recognition acc and speaker recognition spec_acc result are shown in fig. 6 and 7. When α=0 and β++0, the model is changed into a multi-task learning model with emotion recognition as a main task and speaker recognition as an auxiliary task, and the value of β is the weight of the loss function of the speaker recognition task in the total loss function, as can be seen from fig. 6, when β=0.01, the effect of multi-task emotion recognition is not as good as that of single task learning, but when β gradually increases, the accuracy of multi-task emotion recognition gradually exceeds that of the comparison group of single task learning; meanwhile, as can be seen from fig. 7, the gradual increase of β can make the task of speaker recognition to converge more quickly, and the improvement of the speed is obvious. Therefore, the speaker recognition can be used as an auxiliary task to improve the recognition effect of the voice emotion of the main task.
When β=0, α is set to 0,0.01,0.1, and 1, respectively, and the word error rate wer of emotion recognition acc and text recognition is as shown in fig. 8 and 9. When β=0 and α++0, the model is changed into a multi-task learning model with emotion recognition as a main task and text recognition as an auxiliary task, and the value of α is the weight of the loss function of the text recognition task in the total loss function, as can be seen from fig. 8, when α gradually increases, the accuracy of multi-task emotion recognition finally exceeds the comparison group of single-task learning; meanwhile, as can be seen from fig. 9, when α=0.1, the word error rate of the text recognition task converges faster. Therefore, the text recognition can be used as an auxiliary task to improve the recognition effect of the voice emotion of the main task.
Through subsequent experiments, when the two auxiliary tasks are trained together with the main task at the same time, the recognition effect of the main task is improved more obviously. Experiments show that the selection of the super parameters alpha and beta has a great influence on the effect of the model, and when alpha and beta are both 0, the best selection is to set the learning rate to be 10 -5 When α and β are greater than 0, the optimal learning rate is 5×10 -5 The optimal choice of superparameters at this time is α=0.1, β=0.15.
When the reasoning process is carried out, the model does not need the participation of an explicit language model and a speaker recognition model, and can directly finish the classification from the original waveform to emotion, and the reasoning process is shown in fig. 10. The original waveform is input into a fine-tuned pre-training model wav2vec-2.0 for feature extraction, the extracted features are sequentially input into a pooling layer and a full-connection layer, and finally the obtained emotion type logic vector outputs emotion types through argmax functions. Meanwhile, text recognition and speaker recognition results may be generated as by-products, the former requiring the addition of a CTC decoder.
The model provided by the invention has remarkable performance on a public data set IEMOCAP, and fig. 11 shows the accuracy (alpha=0.1, beta=0.15) of each task of the model on a verification set, wherein (a) is an emotion recognition accuracy schematic diagram, (b) is a speaker recognition accuracy schematic diagram, and (c) is a text recognition word error rate schematic diagram. The weighted accuracy (Weighted Accuracy, WA) WAs used as an evaluation criterion and table 1 shows the comparison result of this model with other models.
Table 1 comparison with other model experimental results
As can be seen from Table 1, the emotion recognition accuracy of the scheme of the invention is much higher than that of the prior multiple schemes, which indicates that the scheme provided by the invention is better than the prior art.
FIG. 12 is a schematic diagram of a speech emotion recognition system based on multitasking learning according to an embodiment of the present invention; as shown in fig. 12, includes:
a model acquisition unit 1210 configured to acquire a multitasking model trained in advance; the multitasking model comprises: a pre-training model for extracting audio features of speech, and a plurality of task branches comprising: the system comprises a main task and two auxiliary tasks, wherein the main task is used for carrying out voice emotion recognition based on audio features, the first auxiliary task is used for carrying out text recognition based on the audio features, the second auxiliary task is used for carrying out speaker recognition based on the audio features, and loss functions used in the multiple task branch training stages are obtained by linearly combining the loss functions of the main task and the two auxiliary tasks so as to correlate the auxiliary tasks with the main task and improve the recognition effect of the main task;
emotion recognition unit 1220 is configured to input the original speech to the multitasking model to obtain the recognized speech emotion.
Model training unit 1230 uses a loss function L for the multitasking model training process:
L=L CE1 +αL CTC +βL CE2
wherein L is CE1 Loss function for emotion recognition task, L CTC Loss function for text recognition task, L CE2 Identifying a loss function of the task for the speaker; alpha and beta are two hyper-parameters.
It should be understood that, the system is used to execute the method in the foregoing embodiment, and the corresponding program element in the system performs the principle and technical effects similar to those described in the foregoing method, and the working process of the system may refer to the corresponding process in the foregoing method, which is not repeated herein.
Based on the method in the above embodiment, the embodiment of the invention provides an electronic device. The apparatus may include: at least one memory for storing programs and at least one processor for executing the programs stored by the memory. Wherein the processor is adapted to perform the method described in the above embodiments when the program stored in the memory is executed.
Based on the method in the above embodiment, the embodiment of the present invention provides a computer-readable storage medium storing a computer program, which when executed on a processor, causes the processor to perform the method in the above embodiment.
Based on the method in the above embodiments, an embodiment of the present invention provides a computer program product, which when run on a processor causes the processor to perform the method in the above embodiments.
It is to be appreciated that the processor in embodiments of the invention may be a central processing unit (centralprocessing unit, CPU), other general purpose processor, digital signal processor (digital signalprocessor, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.
The method steps in the embodiments of the present invention may be implemented by hardware, or may be implemented by executing software instructions by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access memory (random access memory, RAM), flash memory, read-only memory (ROM), programmable ROM (PROM), erasable programmable PROM (EPROM), electrically erasable programmable EPROM (EEPROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
It will be appreciated that the various numerical numbers referred to in the embodiments of the present invention are merely for ease of description and are not intended to limit the scope of the embodiments of the present invention.
It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.
Claims (10)
1. A speech emotion recognition method based on multitask learning is characterized by comprising the following steps:
acquiring a pre-trained multi-task model; the multitasking model comprises: a pre-training model for extracting audio features of speech, and a plurality of task branches comprising: the system comprises a main task and two auxiliary tasks, wherein the main task is used for carrying out voice emotion recognition based on audio features, the first auxiliary task is used for carrying out text recognition based on the audio features, the second auxiliary task is used for carrying out speaker recognition based on the audio features, and loss functions used in the multiple task branch training stages are obtained by linearly combining the loss functions of the main task and the two auxiliary tasks so as to correlate the auxiliary tasks with the main task and improve the recognition effect of the main task;
and inputting the original voice into the multitasking model to obtain the recognized voice emotion.
2. The method according to claim 1, wherein the loss function L used by the multitasking model training process is:
L=L CE1 +αL CTC +βL CE2
wherein L is CE1 Loss function for emotion recognition task, L CTC Loss function for text recognition task, L CE2 Identifying a loss function of the task for the speaker; alpha and beta are two hyper-parameters.
3. The method of claim 1, wherein the architecture of the primary task comprises: a first pooling layer, a first full connection layer, and a first activation function; when the main task is executed, the first pooling layer converts a vector sequence corresponding to the audio feature into a one-way quantity, the first full-connection layer obtains predicted values corresponding to different emotion types based on the one-way quantity, the first activation function determines probability distribution of the different emotion types based on a plurality of predicted values, and the emotion type with the largest probability is used as the recognized voice emotion to be output.
4. A method according to claim 3, wherein the architecture of the first auxiliary task comprises: a second fully-connected layer and a second activation function; when the first auxiliary task is executed, the second full-connection layer obtains logic predicted values of different characters based on feature vectors corresponding to the audio features, the second activation function converts the logic predicted values of the different characters into probability distribution of the different characters, and the character with the highest probability is output as a recognized text.
5. The method of claim 4, wherein the architecture of the second auxiliary task comprises: a second pooling layer, a third connection layer, and a third activation function; when the second auxiliary task is executed, the second pooling layer converts the vector sequence corresponding to the audio feature into a one-way quantity, the third full-connection layer obtains logic predicted values corresponding to different speakers based on the one-way quantity, the third activation function determines probability distribution of the different speakers based on a plurality of logic predicted values, and the speaker with the largest probability is output as the identified speaker.
6. The method according to any one of claims 1 to 5, wherein the original speech is input to the multitasking model, and the obtained recognition information further includes a text recognition result and a speaker recognition result.
7. A speech emotion recognition system based on multitasking learning, comprising:
the model acquisition unit is used for acquiring a pre-trained multi-task model; the multitasking model comprises: a pre-training model for extracting audio features of speech, and a plurality of task branches comprising: the system comprises a main task and two auxiliary tasks, wherein the main task is used for carrying out voice emotion recognition based on audio features, the first auxiliary task is used for carrying out text recognition based on the audio features, the second auxiliary task is used for carrying out speaker recognition based on the audio features, and loss functions used in the multiple task branch training stages are obtained by linearly combining the loss functions of the main task and the two auxiliary tasks so as to correlate the auxiliary tasks with the main task and improve the recognition effect of the main task;
and the emotion recognition unit is used for inputting the original voice into the multitasking model to obtain the recognized voice emotion.
8. The system of claim 7, further comprising: a model training unit;
the loss function L used by the model training unit in the multi-task model training process is as follows:
L=L CE1 +αL CTC +βL CE2
wherein L is CE1 Loss function for emotion recognition task, L CTC Loss function for text recognition task, L CE2 Identifying a loss function of the task for the speaker; alpha and beta are two hyper-parameters.
9. An electronic device, comprising:
at least one memory for storing a program;
at least one processor for executing the memory-stored program, which processor is adapted to perform the method according to any of claims 1-6 when the memory-stored program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program, when run on a processor, causes the processor to perform the method according to any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310954870.1A CN116741206A (en) | 2023-07-31 | 2023-07-31 | Speech emotion recognition method and system based on multitask learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310954870.1A CN116741206A (en) | 2023-07-31 | 2023-07-31 | Speech emotion recognition method and system based on multitask learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116741206A true CN116741206A (en) | 2023-09-12 |
Family
ID=87902931
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310954870.1A Pending CN116741206A (en) | 2023-07-31 | 2023-07-31 | Speech emotion recognition method and system based on multitask learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116741206A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117275461A (en) * | 2023-11-23 | 2023-12-22 | 上海蜜度科技股份有限公司 | Multitasking audio processing method, system, storage medium and electronic equipment |
CN118280372A (en) * | 2024-06-03 | 2024-07-02 | 中邮消费金融有限公司 | Dialogue assistance method, device, storage medium, and computer program product |
-
2023
- 2023-07-31 CN CN202310954870.1A patent/CN116741206A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117275461A (en) * | 2023-11-23 | 2023-12-22 | 上海蜜度科技股份有限公司 | Multitasking audio processing method, system, storage medium and electronic equipment |
CN117275461B (en) * | 2023-11-23 | 2024-03-15 | 上海蜜度科技股份有限公司 | Multitasking audio processing method, system, storage medium and electronic equipment |
CN118280372A (en) * | 2024-06-03 | 2024-07-02 | 中邮消费金融有限公司 | Dialogue assistance method, device, storage medium, and computer program product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110491416B (en) | Telephone voice emotion analysis and identification method based on LSTM and SAE | |
US11043209B2 (en) | System and method for neural network orchestration | |
CN116741206A (en) | Speech emotion recognition method and system based on multitask learning | |
Jothimani et al. | MFF-SAug: Multi feature fusion with spectrogram augmentation of speech emotion recognition using convolution neural network | |
WO2020216064A1 (en) | Speech emotion recognition method, semantic recognition method, question-answering method, computer device and computer-readable storage medium | |
US11200885B1 (en) | Goal-oriented dialog system | |
CN111462750A (en) | End-to-end task type dialogue system and method for semantic and knowledge enhancement | |
Wang et al. | Dialogue intent classification with character-CNN-BGRU networks | |
CN114596844B (en) | Training method of acoustic model, voice recognition method and related equipment | |
US11132994B1 (en) | Multi-domain dialog state tracking | |
CN113688878B (en) | Small sample image classification method based on memory mechanism and graph neural network | |
Hu et al. | Neural architecture search for LF-MMI trained time delay neural networks | |
Pham et al. | Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition | |
Zhang et al. | Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis | |
CN113255366B (en) | Aspect-level text emotion analysis method based on heterogeneous graph neural network | |
Mamatov et al. | Speech recognition based on transformer neural networks | |
CN116775873A (en) | Multi-mode dialogue emotion recognition method | |
CN116645980A (en) | Full life cycle voice emotion recognition method for focusing sample feature spacing | |
CN113763939B (en) | Mixed voice recognition system and method based on end-to-end model | |
Anguraj et al. | Analysis of influencing features with spectral feature extraction and multi-class classification using deep neural network for speech recognition system | |
Bhat et al. | Transfer Learning Based Automatic Model Creation Tool For Resource Constraint Devices | |
CN116504274B (en) | Non-invasive voice quality evaluation method enhanced by retrieval | |
Li et al. | A joint multi-task learning framework for spoken language understanding | |
Du et al. | End to end model for keyword spotting with trainable window function and Densenet | |
CN117251523B (en) | Multi-task information extraction method and system for scientific and technological papers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |