CN115730203A - Voice emotion recognition method based on global perception cross-modal feature fusion network - Google Patents

Voice emotion recognition method based on global perception cross-modal feature fusion network Download PDF

Info

Publication number
CN115730203A
CN115730203A CN202211489099.7A CN202211489099A CN115730203A CN 115730203 A CN115730203 A CN 115730203A CN 202211489099 A CN202211489099 A CN 202211489099A CN 115730203 A CN115730203 A CN 115730203A
Authority
CN
China
Prior art keywords
text
modal
features
fusion
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211489099.7A
Other languages
Chinese (zh)
Inventor
李峰
王玲玲
杨菲
罗久淞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University of Finance and Economics
Original Assignee
Anhui University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University of Finance and Economics filed Critical Anhui University of Finance and Economics
Priority to CN202211489099.7A priority Critical patent/CN115730203A/en
Publication of CN115730203A publication Critical patent/CN115730203A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of language emotion recognition, and discloses a speech emotion recognition method based on a global perception cross-modal feature fusion network, which comprises a multi-modal emotion recognition model, wherein the multi-modal emotion recognition model comprises an SER (Server) and an ASR (asynchronous receiver/transmitter).

Description

Voice emotion recognition method based on global perception cross-modal feature fusion network
Technical Field
The invention relates to the technical field of language emotion recognition, in particular to a speech emotion recognition method based on a global perception cross-modal feature fusion network.
Background
Speech, as a first attribute of language, plays a decisive supporting role in language, it not only includes the content of text expressed by speaker, but also includes the emotional information that speaker intends to convey, the same text has great difference in different emotional expressions, therefore, speech emotion recognition has received more and more attention due to the importance of emotion in human normal conversation, taking ubiquitous virtual speech assistants (such as Alexa, siri, google assistant and Cortana) as an example, as the number of interacting people increases, they must infer the emotion of user and make appropriate reaction to improve user experience, however, human not only expresses emotion by speech, but also expresses emotion by many other ways, such as words, body posture part, face part, etc., therefore, in order to correctly understand the emotion expressed in speech, we need to fully understand the emotional information contained in various modalities.
In real life, voice emotion recognition is helpful for people to better communicate, emotions usually appear in a conversation in various forms, such as voice and text, however, most of existing emotion recognition systems only use single-modal characteristics for emotion recognition, and ignore interaction among multi-modal information.
Disclosure of Invention
The invention provides the following technical scheme: a speech emotion recognition system based on a global perception cross-modal feature fusion network comprises a multi-modal emotion recognition model, wherein the multi-modal emotion recognition model comprises an SER and an ASR, the SER comprises a wav2vec2.0, a Roberta-base, a residual cross-modal fusion attention module, a global perception block and a complete connection layer, the ASR comprises a transcription layer, an audio feature layer and a complete connection layer, the SER is used for calculating cross Entrophy loss through predicting emotion labels and real emotion labels, the ASR calculates CTC loss through audio features of the wav2vec2.0 model and corresponding text transcription, and finally the cross Entrophy loss and the CTC loss are added to obtain a loss value of a training part.
Preferably, the multimodal emotion recognition model comprises a question statement;
data set D has k utterances u i Each utterance corresponds to a label of l i Each utterance is composed of a speech segment a i And text transcription t i Composition of u wherein i ∈(a i ,t i ),t i Is ASR transcribed text or manually annotated text, the proposed network model will u i As input and assigns the correct emotion label to any given utterance.
<U,L>={{u i =<a i ,t i >,l i }|i∈[1,k]} (1)
Preferably, the multi-modal emotion recognition model comprises feature coding;
in feature coding, the audio information and text information of each utterance are coded into (wav2vec 2.0 feature), (text feature) by the corresponding coder to input the model we propose.
Preferably, the multi-modal emotion recognition model comprises speech coding;
the wav2vec2.0 features contain rich prosodic information needed for emotion recognition, in our model we use a pre-trained wav2vec2.0 model as the original audio waveform coder to extract wav2vec2.0 features, which are based on the transformer structure representing the speech audio sequence, by fitting a set of ASR modeling units shorter than phonemes to extract features, and in addition, we choose to use a wav2vec2-base model with a dimension size of 768, comparing the two versions wav2vec2.0 models, we input the audio data of the first utterance into the pre-trained wav2vec2.0 model to obtain a context-embedded representation, representing the size of the audio feature embedding, and thus, can be expressed as:
Figure BDA0003962815770000021
wherein F wav2vec2.0 Representing the pre-trained wav2vec2.0 model as a function of the audio feature processor, j depends on the size of the original audio in the wav2vec2.0 model and the CNN feature extraction layer, which extracts frames from the original audio in steps of 20ms and jumps of 25ms, and in our experiments the parameters of the CNN feature extraction layer will be fixed at a constant level.
Preferably, the multi-modal emotion recognition model comprises a context text representation;
inputting text data into a roberta-base model for coding, marking the input text before extracting text features, adding separators and separating sentences, separating the sentences to prevent semantic confusion, finely adjusting the marked text data and corresponding utterances, and embedding context into the text, wherein the context can be expressed as:
Figure BDA0003962815770000031
wherein F Roberta-base Representing a text feature extraction function, m depending on the number of tokens in the text, D T Is the dimension size of the text feature embedding.
Preferably, the residual across-modal fusion attention module is composed of two parallel fusion blocks for different modalities, specifically, a speech-text residual across-modal fusion attention block and a text-speech residual across-modal fusion attention block, and the residual across-modal fusion attention module is configured to complete multi-modal information interaction by using a across-modal attention mechanism.
Preferably, the audio information (a) i ) And text information (t) i ) The method comprises the steps that relevant audio features and text features are generated by corresponding pre-training feature extractors, and a residual cross-modal fusion attention block is composed of a cross-modal attention mechanism, a linear layer, a normalization layer, a discarding layer, a Gaussian error linear unit activation function and a residual structure. Two kinds of differencesThe difference between the residual cross-modal fusion attention blocks of (1) that characterize audio features is the query, key, and value of the cross-modal attention mechanism
Figure BDA0003962815770000032
As queries, text features
Figure BDA0003962815770000033
The interaction of audio and text is carried out by using a multi-head attention mechanism as keys and values, and a text-voice residual cross-modal fusion attention block uses text features
Figure BDA0003962815770000034
As a query, audio features
Figure BDA0003962815770000035
As keys and values, firstly, the audio features and the text features are interacted through a multi-head attention mechanism, then the interacted features pass through a linear layer, a normalization layer and a discarding layer, and finally are connected with the initial audio features or text features of the block through a residual error structure;
Figure BDA0003962815770000041
wherein phi 1 A learning function representing a first speech-text residual across-modal fusion attention block or a text-speech residual across-modal fusion attention block;
in addition, the output of the first residual cross-modal fusion attention block is combined with the initial audio feature or text feature and fed to a second residual cross-modal fusion attention block such that a plurality of residual cross-modal fusion attention blocks are stacked together to generate corresponding multi-modal fusion features
Figure BDA0003962815770000042
And
Figure BDA0003962815770000043
speech-text:
Figure BDA0003962815770000044
text-speech:
Figure BDA0003962815770000045
our fusion strategy differs from the previously proposed multimodal fusion model in that to better integrate multimodal information we will always keep the key and value unchanged, the key and value of each residual cross-modal fusion attention block as the initial audio and text features of the module, and finally the outputs of the two residual cross-modal fusion attention blocks are concatenated to generate the final multimodal fusion feature
Figure BDA0003962815770000046
Figure BDA0003962815770000047
Preferably, the global perception module captures global information of the multi-modal fusion feature after extracting the multi-modal fusion feature through the residual cross-modal fusion attention module, the size of the multi-modal fusion feature varies with the length of the audio (j) and the text (m), therefore, before inputting the multi-modal fusion feature into the global perception block, we must map the dimension of the multi-modal fusion feature to 1 dimension (for example, let the values of j and m be l), the structure of the global perception module is composed of two fully-connected layers, a convolutional layer, two normalization layers, a gaussian error linear unit activation function and a multiplication operation, the output dimensions of the first and last fully-connected layers in the module are respectively 4D f And D f (wherein, D f 768), respectively, the output is split into 2D after gaussian error linear unit activation function projection f Multiplication enhances cross-dimensional feature mixing, and finally, the output of the global perception fusion module is integrated for classification,the corresponding equation is described as follows:
Figure BDA0003962815770000051
y i =FC(F global-aware ),y iC (9)
wherein phi global-aware Is a function of the multi-modal fusion features through the global perception module, and C is the number of emotion classes.
Preferably, the multi-modal emotion recognition model comprises a connectionless temporal classification layer, effectively backpropagating gradients using connectionless temporal classification loss as a loss function, so we pass the waw2vec2.0 feature
Figure BDA0003962815770000052
And text transcription information t i Calculating the connection meaning time classification loss;
Figure BDA0003962815770000053
wherein
Figure BDA0003962815770000055
V =32 is the size of our vocabulary, consisting of 26 letters of the alphabet and some punctuation marks;
Figure BDA0003962815770000054
furthermore, we need to use the output feature y of the global perceptual block i And a true sentiment tag l i To calculate cross entropy loss;
L CrossEntrogy =CrossEntrogy(y i ,l i ) (12)
finally, we introduce a hyper-parameter α that combines two loss functions into a loss, α effectively controlling the relative importance of the connection-wise temporal classification loss.
L=L CrossEntrogy +αL CTC ,α∈(0,1)(13)
Preferably, the speech emotion recognition method based on the global perception cross-modal feature fusion network is characterized by comprising the following steps: comprises the following main steps;
s1: respectively extracting wav2vec2.0 features and text features through a pre-training model of transfer learning;
s2: fusing features from different modalities through a residual cross-modality fusion attention module;
s3: introducing a global perception module to capture important emotion information of the multi-modal fusion features from different scales;
s4: numerous experiments performed on the IEMOCAP data set indicate.
Advantageous effects
Compared with the prior art, the invention provides a speech emotion recognition method based on a global perception cross-modal feature fusion network, which has the following beneficial effects:
the invention provides a global perception cross-modal feature fusion network (GCF-Net) for speech emotion recognition. In GCF-Net, a residual across-modal fusion attention module is designed to help a network to extract rich features from audio and text, a global fusion block is added behind the residual across-modal fusion attention module to further extract the features rich in emotion in a global range, automatic speech recognition is introduced as an auxiliary task for calculating the loss of the connotation time classification, and experimental results on an IEMOCAP data set show that the residual across-modal fusion attention module, the global perception module and the automatic speech recognition calculate the loss of the connotation time classification and improve the performance of the model.
Drawings
FIG. 1 is a schematic structural diagram of a multi-modal emotion recognition model of a speech emotion recognition method based on a global perception cross-modal feature fusion network;
FIG. 2 is a schematic diagram of a residual error cross-modal fusion attention module structure of a speech emotion recognition method based on a global perception cross-modal feature fusion network according to the present invention;
FIG. 3 is a schematic diagram of a voice-text residual across-modal fusion attention block structure of a voice emotion recognition method based on a global perceptual across-modal feature fusion network according to the present invention;
FIG. 4 is a schematic diagram of a global fusion block structure of a speech emotion recognition method based on a global perceptual cross-modal feature fusion network of the present invention;
FIG. 5 is a schematic diagram of confusion matrix results of different modes of a speech emotion recognition method based on a global perception cross-mode feature fusion network.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The first embodiment;
referring to fig. 1, a speech emotion recognition system based on a global perception cross-modal feature fusion network comprises a multi-modal emotion recognition model, wherein the multi-modal emotion recognition model comprises an SER and an ASR, the SER comprises wav2vec2.0, a Roberta-base, a residual cross-modal fusion attention module, a global perception block and a complete connection layer, the ASR comprises a transcription layer, an audio feature layer and a complete connection layer, the SER is used for calculating CrossEntropy loss through predicting emotion labels and real emotion labels, the ASR part calculates CTC loss through audio features of the wav2vec2.0 model and corresponding text transcription, and finally the CrossEntropy loss and the CTC loss are added to obtain a loss value of a training part.
Further, the multi-modal emotion recognition model comprises a question statement;
data set D has k utterances u i Each utterance corresponds to a label ofl i Each utterance is composed of a speech segment a i And text transcription t i Composition of u wherein i ∈(a i ,t i ),t i Is ASR transcribed text or manually annotated text, the proposed network model will u i As input and assign the correct emotion tag to any given utterance;
<U,L>={{u i =<a i ,t i >,l i }|i∈[1,k]} (1)
preferably, the multi-modal emotion recognition model comprises feature codes;
in feature coding, the audio information and text information of each utterance are coded into (wav2vec 2.0 feature), (text feature) by the corresponding coder to input the model we propose.
Further, the multi-modal emotion recognition model comprises speech coding;
the wav2vec2.0 features contain rich prosodic information needed for emotion recognition, in our model we use a pre-trained wav2vec2.0 model as the original audio waveform coder to extract wav2vec2.0 features, which are based on the transformer structure representing the speech audio sequence, by fitting a set of ASR modeling units shorter than phonemes to extract features, and in addition, we choose to use a wav2vec2-base model with a dimension size of 768, comparing the two versions wav2vec2.0 models, we input the audio data of the first utterance into the pre-trained wav2vec2.0 model to obtain a context-embedded representation, representing the size of the audio feature embedding, and thus, can be expressed as:
Figure BDA0003962815770000081
wherein F wav2vec2.0 Representing the pre-trained wav2vec2.0 model as a function of the audio feature processor, j depends on the size of the original audio in the wav2vec2.0 model and the CNN feature extraction layer, which extracts frames from the original audio in steps of 20ms and jumps of 25ms, and in our experiments the parameters of the CNN feature extraction layer will be fixed at a constant level。
Further, the multi-modal emotion recognition model comprises context text representation;
inputting text data into a roberta-base model for encoding, marking the input text before extracting text features, adding separators and separating sentences, separating the sentences to prevent semantic confusion, finely adjusting the marked text data and corresponding utterances, and embedding context into the following words:
Figure BDA0003962815770000082
wherein F Roberta-base Representing a text feature extraction function, m depending on the number of tokens in the text, D T Is the dimension size of the text feature embedding.
Example two;
referring to fig. 2 and 3, the residual cross-modal fusion attention module is composed of two parallel fusion blocks for different modalities, specifically, a speech-text residual cross-modal fusion attention block and a text-speech residual cross-modal fusion attention block, and the residual cross-modal fusion attention module is used for completing multi-modal information interaction by using a cross-modal attention mechanism, designing a potential representation of different models combined by a fusion layer based on cross-modal attention, and introducing a new global perception fusion module to obtain key emotion information of multi-modal fusion features.
Further, the audio information (a) i ) And text information (t) i ) The method comprises the steps that relevant audio features and text features are generated by corresponding pre-training feature extractors, and a residual cross-modal fusion attention block is composed of a cross-modal attention mechanism, a linear layer, a normalization layer, a discarding layer, a Gaussian error linear unit activation function and a residual structure. The difference between two different residual cross-modal fusion attention blocks is the query, key and value of the cross-modal attention mechanism, and the speech-text residual cross-modal fusion attention block is to characterize the audio frequency
Figure BDA0003962815770000091
As queries, text features
Figure BDA0003962815770000092
The interaction of audio and text is carried out by using a multi-head attention mechanism as keys and values, and a text-voice residual cross-modal fusion attention block uses text features
Figure BDA0003962815770000093
As a query, audio features
Figure BDA0003962815770000094
As keys and values, firstly, the audio features and the text features are interacted through a multi-head attention mechanism, then the interacted features pass through a linear layer, a normalization layer and a discarding layer, and finally are connected with the initial audio features or text features of the block through a residual error structure;
Figure BDA0003962815770000095
wherein phi 1 A learning function representing a first speech-text residual across-modality fusion attention block or a text-speech residual across-modality fusion attention block;
in addition, the output of the first residual cross-modal fusion attention block is combined with the initial audio feature or text feature and fed to a second residual cross-modal fusion attention block, such that a plurality of residual cross-modal fusion attention blocks are stacked together to generate corresponding multi-modal fusion features
Figure BDA0003962815770000096
And
Figure BDA0003962815770000097
speech-text:
Figure BDA0003962815770000098
text-speech:
Figure BDA0003962815770000099
our fusion strategy differs from the previously proposed multimodal fusion model in that to better integrate multimodal information we will always keep the key and value unchanged, the key and value of each residual cross-modal fusion attention block as the initial audio and text features of the module, and finally the outputs of the two residual cross-modal fusion attention blocks are concatenated to generate the final multimodal fusion feature
Figure BDA0003962815770000101
Figure BDA0003962815770000102
Example three;
referring to fig. 4, after the global sensing module extracts the multi-modal fusion features through the residual cross-modal fusion attention module, the global information of the multi-modal fusion features is captured, the size of the multi-modal fusion features varies with the length of the audio (j) and the text (m), therefore, before the multi-modal fusion features are input to the global sensing block, we must map the dimension of the multi-modal fusion features to 1 dimension (for example, let the values of j and m be l), the structure of the global sensing module is composed of two fully-connected layers, a convolutional layer, two normalization layers, a gaussian error linear unit activation function and a multiplication operation, the output dimensions of the first and last fully-connected layers in the module are respectively 4D f And D f (wherein, D f 768), respectively, the output is split into 2D after gaussian error linear unit activation function projection f Multiplication enhances cross-dimension feature mixing, and finally, the output of the global sensing fusion module is integrated for classification, and the corresponding equation is described as follows:
Figure BDA0003962815770000103
y i =FC(F global-aware ),y iC (9)
wherein phi global-aware Is a function of the multi-modal fusion features through the global perception module, and C is the number of emotion classes.
Further, the multi-modal emotion recognition model includes a connection-oriented temporal classification layer, which effectively propagates gradients back using connection-oriented temporal classification loss as a loss function, so we pass the waw2vec2.0 feature
Figure BDA0003962815770000111
And text transcription information t i Calculating the connection meaning time classification loss;
Figure BDA0003962815770000112
wherein
Figure BDA0003962815770000114
V =32 is the size of our vocabulary, consisting of 26 letters of the alphabet and some punctuation marks;
Figure BDA0003962815770000113
furthermore, we need to use the output feature y of the global perceptual block i And a true emotion tag l i To calculate cross entropy loss;
L CrossEntrogy =CrossEntrogy(y i ,l i ) (12)
finally, we introduce a hyper-parameter α that combines two loss functions into a loss, α can effectively control the relative importance of the connection-wise temporal classification loss.
L=L CrossEntrogy +αL CTC ,α∈(0,1)(13)
Examples of the experiments
(1) Data set
Under much of the prior art in the speech emotion recognition literature, we trained and evaluated our own models on an IEMOCAP data set, a multimodal data set that is the benchmark for multimodal emotion recognition studies, that contains 12-hour impromptu and scripted audiovisual data from 10 theater actors (five males and five females) in five binary sessions, with emotion information for each session presented in four ways: video, audio, transcription, and motion capture of facial motion.
Due to experimental requirements, we selected audio and transcription data to evaluate our model on the IEMOCAP dataset. Like most studies, we have selected five emotions to recognize emotions: happy, angry, neutral, sad and excited, we labeled all excited sample data as happy since happy and excited are highly similar, and furthermore we randomly split the data set into training (80%) and testing (20%) parts, and evaluate our model using quintupled cross-validation.
(2) Experimental setup
To explore the advantages of multi-modal, we constructed two single-modal baselines using text and speech modalities, respectively, using Roberta-base as the text baseline for the context text encoder, then classified using a single linear layer and softmax activation function, the speech baseline using a similar setup as the text baseline, replacing the encoder with only the pre-trained wav2vec2.0 model, and furthermore, table 1 shows the basic hyper-parameter setup of our experiment.
TABLE 1 Superparameter settings.
Videocard NVIDIAGeForce940MX
Batches of 2
Cumulative gradient 4
Period of time 100
α 0;0.001;0.01;0.1;1
Optimizer Adam
Learning rate 1e-5
Loss function Cross entropy and connection attention time classification loss function
Evaluation index Weighted exact value and unweighted exact value
(3) Ablation experiment
In order to better understand the contributions made by the different modules in the proposed GCF-Net model. We performed multiple ablation studies on each of our modules in the IEMOCAP dataset. The weighted accuracy and the unweighted accuracy are selected as evaluation indexes to evaluate our model.
To verify the impact of each modality, we trained our proposed network using audio features or text features as input, respectively, without applying the fusion model. In Table 2, we can see that the fusion of two features combines the advantages of both features, significantly improving the emotion recognition rate compared to the single feature.
Table 2 ablation experiments with different modal characteristics.
Model (model) Accuracy of weighting To weight accuracy
Roberta-base baseline 69.27% 69.89%
Wav2vec2.0 baseline 79.76% 78.66%
Roberta-base+Wav2vec2.0 82.80% 82.01%
Furthermore, we investigated the impact of the global perception module on our proposed model. In table 3, we can see that adding global perceptual blocks improves the weighted and unweighted accuracies by 1.0% and 1.5%, respectively. Therefore, it can be proved that the global perception module can extract more important emotion information to improve the performance of the model.
Table 3 ablation experiment of global sensing module.
Models Accuracy of weighting To weight accuracy
Non-joined global sense module 81.73% 80.92%
Joining a global awareness module 82.80% 82.01%
With the addition of the global perception block, we also set an ablation experiment for the residual cross-modal attention fusion block. We have two residual cross-modal attention fusion blocks placed in parallel, and therefore this ablation experiment was set up to verify the effect of different numbers of residual cross-modal attention fusion layers in our proposed model. Table 4 shows the best model performance with four layers of residual cross-modal attention blend blocks (m = 4). When m =5, the accuracy of the model may decrease. We consider m =4 as our best choice.
TABLE 4 residual Cross-Modal attention fusion Block ablation experiment
Residual cross-modal attention fusion block layer number Accuracy of weighting To weight accuracy
1 79.66% 78.85%
2 80.79% 79.56%
3 82.16% 81.10%
4 82.80% 82.01%
5 80.95% 80.47%
The hyper-parameter α can control the intensity of CTC loss, therefore, we try to change α from 0 to 1 to obtain a different intensity. Table 5 shows the effect of different alpha values on our optimal model. We can see that the positive impact of CTCloss is largest when α = 0.1.
TABLE 5 ablation experiment of alpha size
α WA UA
0 81.6% 81.1%
0.001 81.9% 81.6%
0.01 81.2% 80.1%
0.1 82.8% 82.0%
1 77.0% 76.2%
(4) Error analysis
We visualize the appearance and span of different modalities in different emotion categories by a confusion matrix. FIG. 5 shows the confusion matrix for each modality, including wav2vec2.0, roberta-base, and multi-modality fusion, respectively.
As shown in fig. 5 (a), it mistakenly confuses happiness and neutrality. Therefore, the recognition rate of these two emotions is much lower than that of the other two emotions, and particularly the recognition rate of anger reaches 86.94%. In general, most emotions are easily confused with neutrality. Our observations are consistent with other studies reported by others, who believe that neutrality is centered in the activation space, making it more challenging to distinguish from other classes. Fig. 5 (b) has a good effect in predicting happiness compared to fig. 5 (a). This result is reasonable, with happiness and other emotions being more significantly different in word distribution than the audio signal data, providing more emotional information, on the other hand, sad predictions are worst, with 23.71% confusion with neutrality.
The model in fig. 5 (c) makes up for the deficiencies of the first two models (fig. 5 (a) and (b)) by using a fusion of the two modal features. We can see that the prediction rate for each emotion, except for neutrality, reaches 80%. Especially the prediction of sadness reaches 91.27%. Unfortunately, the recognition rate of anger and neutrality is slightly reduced.
Comparative example
As shown in Table 6, we compared with today's mainstream multimodal emotion recognition models using the same modality data. It can be seen that our model achieves the most advanced experimental results in and for weighting accuracy. The comparison further demonstrates the effectiveness of our proposed model.
TABLE 6 quantitative comparison with mainstream multimodal methods on IEMOCAP dataset
Method Accuracy of weighting To weight accuracy Year of year
Xuetal.[24] 70.40% 69.50% 2019
Liuet.al.[59] 72.40% 70.10% 2020
Makiuchietal.[31] 73.50% 73.00% 2021
Caietal.[32] 78.15% - 2021
Moraisetal.[60] 77.36% 77.76% 2022
Ghoshetal.[52] 77.64% - 2022
GCF-Net (our model) 82.80% 82.01% -
In conclusion; the invention provides a global perception cross-modal feature fusion network (GCF-Net) for speech emotion recognition, wherein a residual error cross-modal fusion attention module is designed in the GCF-Net to help the network to extract rich features from audio and text, a global fusion block is added behind the residual error cross-modal fusion attention module to further extract rich features in a global range, automatic speech recognition is introduced to serve as an auxiliary task for calculating the connection ambiguity time classification loss, and experimental results on an IEMOCAP data set show that the residual error cross-modal fusion attention module, the global perception module and the automatic speech recognition calculate the connection ambiguity time classification loss all improve the performance of the model.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A speech emotion recognition method based on a global perception cross-modal feature fusion network is characterized by comprising the following steps: comprises the following main steps;
s1: respectively extracting wav2vec2.0 features and text features through a pre-training model of transfer learning;
s2: fusing features from different modalities through a residual cross-modality fusion attention module;
s3: introducing a global perception module to capture important emotion information of multi-modal fusion features from different scales;
s4: numerous experiments performed on the IEMOCAP data set indicate.
2. The speech emotion recognition system based on global perception cross-modal feature fusion network according to claim 1, comprising a multi-modal emotion recognition model, wherein: the multi-modal emotion recognition model comprises an SER and an ASR, wherein the SER comprises wav2vec2.0, a Roberta-base, a residual cross-modal fusion attention module, a global perception block and a complete connection layer, the ASR comprises a transcription layer, an audio feature layer and a complete connection layer, the SER is used for calculating cross Encopy loss through predicting emotion labels and real emotion labels, the ASR part calculates CTC loss through the audio feature of the wav2vec2.0 model and corresponding text transcription, and finally the cross Encopy loss and the CTC loss are added to obtain a loss value of a training part.
3. The system according to claim 2, wherein the system comprises: the multi-modal emotion recognition model comprises a question statement;
data set D has k utterances u i Each utterance corresponds to a label of l i Each utterance is composed of a speech segment a i And text transcription t i Composition of u wherein i ∈(a i ,t i ),t i Is ASR transcribed text or manually annotated text, the proposed network model will u i As input and assigns the correct emotion label to any given utterance.
<U,L>={{u i =<a i ,t i >,l i }|i∈[1,k]} (1)
4. The system according to claim 2, wherein the system comprises: the multi-mode emotion recognition model comprises feature codes;
in feature coding, the audio information and text information of each utterance are coded into (wav2vec 2.0 feature), (text feature) by the corresponding coder to input the model we propose.
5. The system according to claim 2, wherein the system comprises: the multi-mode emotion recognition model comprises speech coding;
wav2vec2.0 features contain rich prosodic information needed for emotion recognition, in our model we use a pre-trained wav2vec2.0 model as the original audio waveform coder to extract wav2vec2.0 features, which are based on the transformer structure representing the speech audio sequence, by fitting a set of ASR modeling units shorter than the phonemes to extract features, and in addition, we choose to use a wav2vec2-base model with a dimension size of 768, comparing the two versions wav2vec2.0 models, we input the audio data of the first utterance into the pre-trained wav2vec2.0 model to obtain a context-embedded representation, representing the size of the audio feature embedding, and thus, can be expressed as:
Figure FDA0003962815760000021
wherein F wav2vec2.0 Representing the pre-trained wav2vec2.0 model as a function of the audio feature processor, j depends on the size of the original audio in the wav2vec2.0 model and the CNN feature extraction layer, which extracts frames from the original audio in steps of 20ms and jumps of 25ms, and in our experiments the parameters of the CNN feature extraction layer will be fixed at a constant level.
6. The system according to claim 2, wherein the system comprises: the multi-modal emotion recognition model comprises a context text representation;
inputting text data into a roberta-base model for coding, marking the input text before extracting text features, adding separators and separating sentences, finely adjusting the marked text data and corresponding utterances after separating the sentences, wherein context embedding can be expressed as:
Figure FDA0003962815760000022
wherein F Roberta-base Representing a text feature extraction function, m depending on the number of tokens in the text, D T Is the dimension size of the text feature embedding.
7. The system according to claim 2, wherein the system comprises: the residual across-modal fusion attention module is composed of two parallel fusion blocks aiming at different modalities, specifically a voice-text residual across-modal fusion attention block and a text-voice residual across-modal fusion attention block, and is used for completing multi-modal information interaction by using a across-modal attention mechanism.
8. The system according to claim 7, wherein the system comprises: the speech-text residual across-modality fusion attention block characterizes audio
Figure FDA0003962815760000031
As queries, text features
Figure FDA0003962815760000032
The interaction of audio and text is carried out by using a multi-head attention mechanism as keys and values, and a text-voice residual cross-modal fusion attention block uses text features
Figure FDA0003962815760000033
As a query, audio features
Figure FDA0003962815760000034
As keys and values, first, the audio features and text features are interacted with by a multi-head attention mechanism, and then the interacted features are passed through a linear layer,The normalization layer and the discard layer are connected with the initial audio features or text features of the block through a residual error structure;
Figure FDA0003962815760000035
wherein phi 1 A learning function representing a first speech-text residual across-modality fusion attention block or a text-speech residual across-modality fusion attention block;
in addition, the output of the first residual cross-modal fusion attention block is combined with the initial audio feature or text feature and fed to a second residual cross-modal fusion attention block, such that a plurality of residual cross-modal fusion attention blocks are stacked together to generate corresponding multi-modal fusion features
Figure FDA0003962815760000036
And
Figure FDA0003962815760000037
speech-text:
Figure FDA0003962815760000038
text-speech:
Figure FDA0003962815760000039
our fusion strategy differs from the previously proposed multimodal fusion model in that to better integrate multimodal information we will always keep the key and value unchanged, the key and value of each residual cross-modal fusion attention block as the initial audio and text features of the module, and finally the outputs of the two residual cross-modal fusion attention blocks are concatenated to generate the final multimodal fusion feature
Figure FDA0003962815760000041
Figure FDA0003962815760000042
9. The system according to claim 2, wherein the system comprises: the global perception module extracts the multi-modal fusion features through the residual cross-modal fusion attention module, then captures global information of the multi-modal fusion features, the size of the multi-modal fusion features changes along with the length of audio (j) and text (m), therefore, before the multi-modal fusion features are input into the global perception block, the dimension of the multi-modal fusion features needs to be mapped into 1 dimension (for example, the value of j and m is l), the structure of the global perception module is composed of two full connection layers, a convolution layer, two normalization layers, a Gaussian error linear unit activation function and a multiplication operation, the output dimensions of the first full connection layer and the last full connection layer in the module are respectively 4D f And D f (wherein, D f 768), respectively, the output is split into 2D after gaussian error linear unit activation function projection f Multiplication enhances cross-dimension feature mixing, and finally, the output of the global perception fusion module is integrated for classification, and the corresponding equation is described as follows:
Figure FDA0003962815760000043
y i =FC(F global-aware ),y iC (9)
wherein phi global-aware Is a function of the multi-modal fusion features through the global perception module, and C is the number of emotion classes.
10. The system according to claim 2, wherein the system comprises: the multi-mode emotion recognition model comprises a connection meaningTemporal classification layer, using connectionless temporal classification loss as a loss function to efficiently backpropagate gradients, hence we pass the waw2vec2.0 feature
Figure FDA0003962815760000051
And text transcription information t i Calculating the connection meaning time classification loss;
Figure FDA0003962815760000052
wherein
Figure FDA0003962815760000053
V =32 is the size of our vocabulary, consisting of 26 letters of the alphabet and some punctuation marks;
Figure FDA0003962815760000054
furthermore, we need to use the output feature y of the global perceptual block i And a true sentiment tag l i To calculate cross entropy loss;
L CrossEntrogy =CrossEntrogy(y i ,l i ) (12)
finally, we introduce a hyper-parameter α that combines two loss functions into a loss, α can effectively control the relative importance of the connection-wise temporal classification loss.
L=L CrossEntrogy +αL CTC ,α∈(0,1)(13)
CN202211489099.7A 2022-11-25 2022-11-25 Voice emotion recognition method based on global perception cross-modal feature fusion network Pending CN115730203A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211489099.7A CN115730203A (en) 2022-11-25 2022-11-25 Voice emotion recognition method based on global perception cross-modal feature fusion network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211489099.7A CN115730203A (en) 2022-11-25 2022-11-25 Voice emotion recognition method based on global perception cross-modal feature fusion network

Publications (1)

Publication Number Publication Date
CN115730203A true CN115730203A (en) 2023-03-03

Family

ID=85298296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211489099.7A Pending CN115730203A (en) 2022-11-25 2022-11-25 Voice emotion recognition method based on global perception cross-modal feature fusion network

Country Status (1)

Country Link
CN (1) CN115730203A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089906A (en) * 2023-03-13 2023-05-09 山东大学 Multi-mode classification method and system based on dynamic context representation and mode fusion
CN116778967A (en) * 2023-08-28 2023-09-19 清华大学 Multi-mode emotion recognition method and device based on pre-training model

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116089906A (en) * 2023-03-13 2023-05-09 山东大学 Multi-mode classification method and system based on dynamic context representation and mode fusion
CN116778967A (en) * 2023-08-28 2023-09-19 清华大学 Multi-mode emotion recognition method and device based on pre-training model
CN116778967B (en) * 2023-08-28 2023-11-28 清华大学 Multi-mode emotion recognition method and device based on pre-training model

Similar Documents

Publication Publication Date Title
Atmaja et al. Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion
CN113205817B (en) Speech semantic recognition method, system, device and medium
Kim et al. DNN-based emotion recognition based on bottleneck acoustic features and lexical features
CN113420807A (en) Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method
Ren et al. Intention detection based on siamese neural network with triplet loss
CN113836277A (en) Machine learning system for digital assistant
CN115730203A (en) Voice emotion recognition method based on global perception cross-modal feature fusion network
Chen et al. Multimodal emotion recognition with temporal and semantic consistency
Raymond et al. On the use of finite state transducers for semantic interpretation
Zhang et al. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects
Wang et al. Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition.
Zhang et al. Multi-head attention fusion networks for multi-modal speech emotion recognition
CN114676259B (en) Conversation emotion recognition method based on causal perception interactive network
CN115630145A (en) Multi-granularity emotion-based conversation recommendation method and system
CN112349294A (en) Voice processing method and device, computer readable medium and electronic equipment
de Velasco et al. Emotion Detection from Speech and Text.
CN114386426B (en) Gold medal speaking skill recommendation method and device based on multivariate semantic fusion
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
Goel et al. Emotion-aware transformer encoder for empathetic dialogue generation
CN114005446A (en) Emotion analysis method, related equipment and readable storage medium
CN117591648A (en) Power grid customer service co-emotion dialogue reply generation method based on emotion fine perception
Kakuba et al. Deep learning approaches for bimodal speech emotion recognition: Advancements, challenges, and a multi-learning model
Li et al. GCF2-Net: Global-aware cross-modal feature fusion network for speech emotion recognition
Cao et al. Acoustic and lexical representations for affect prediction in spontaneous conversations
Pandey et al. Multimodal sarcasm detection (msd) in videos using deep learning models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination