CN114743569A

CN114743569A - Speech emotion recognition method based on double-layer fusion deep network

Info

Publication number: CN114743569A
Application number: CN202210419568.1A
Authority: CN
Inventors: 李飞; 李斌建; 李汀
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-07-12

Abstract

The invention relates to a speech emotion recognition method based on a double-layer fusion deep network, which utilizes feature vectors of two modes of speech and text to perform optimization fusion and capture complex association to perform emotion recognition between audio and text in order to obtain rich information of cross modes. Firstly, preprocessing voice and text information to obtain audio and text feature vectors, performing multi-mode cross fusion on audio features and text features through an FBP feature fusion module, respectively passing the fused feature vectors through a level-1 primary feature coding network consisting of three submodels of LSTM, GRU and DNN, then performing secondary fusion on the outputs of three sub-networks of level-1 to code high-level features, performing hadmard product fusion, finally inputting the fused features into a level-2 BiLSTM coding network, and finally connecting a classification output layer to predict emotion categories. The experimental result of the fusion algorithm on the public data set IEMOCAP shows that the fusion algorithm reaches 80.38% WA and 78.62% UA, and a better result in the field of speech emotion recognition is achieved at present.

Description

Speech emotion recognition method based on double-layer fusion deep network

Technical Field

The invention belongs to the field of speech emotion recognition, and relates to a multi-mode feature fusion speech emotion recognition algorithm.

Background

At present, the field of speech emotion recognition is mainly divided into 3 research directions, the first is based on a traditional machine learning method, emotion related features are manually extracted, and emotion classification is carried out by using the machine learning method. And secondly, realizing an end-to-end emotion recognition system through the neural network based on an emotion recognition method of the deep neural network. The third is to combine speech and text for multi-modal emotion recognition.

Multimodal speech emotion recognition is a new research hotspot in recent years, and has attracted extensive attention. The main research field of the multi-modal speech emotion recognition is to combine speech and text to perform multi-modal emotion recognition, consider that different speech characteristics have different influences on emotion, and aim to improve the recognition rate of the system, the future speech emotion recognition trend is a multi-modal recognition system framework.

Speech emotion recognition is a challenging task because emotional expressions are complex, multi-modal. The key for improving the speech emotion recognition performance is the extraction of emotion characteristics firstly and the fusion mode during multi-modal characteristic fusion secondly. At present, a mainstream method for feature extraction mainly uses a deep neural network to extract emotional features, and the effect is verified to be obviously improved in many researches. However, the innovative research on feature fusion is still not deep enough, and research finds that most methods use simple fusion methods such as addition and splicing. Obviously, none of these methods fully exploit the complementarity between the modality information.

In the research in the field of multi-modal emotion recognition, a feature fusion method in a multi-modal emotion recognition based on speech text fusion published by huya heart and the like uses simple fusion methods such as addition, splicing and the like. Obviously, none of these methods can fully exploit the complementarity of emotional features between modality information.

Disclosure of Invention

The invention mainly solves the problems that: the existing speech emotion recognition method and system are improved, and a multi-mode feature fusion scheme capable of improving the recognition rate is provided. In order to realize the technical purpose, the technical scheme adopted by the invention comprises the following processes:

step 1: preprocessing the voice and text data to make the voice and text data meet the input requirement of a network model;

and 2, step: inputting the preprocessed voice and text feature vectors in the step 1 into a decomposition bilinear pooling fusion module (FBP) for primary feature fusion;

and 3, step 3: the fused primary features output by the FBP module in the step 2 are respectively processed by a level-1 primary feature coding network consisting of three submodels of LSTM, GRU and DNN;

and 4, step 4: then, the outputs of three sub-networks of level-1 are secondarily fused to encode high-level features, the fusion method is hadmard product, then the fused features are input into a BilSTM encoding network of level-2, and finally a classified output layer is connected to predict emotion categories;

and 5: and finally training the network model.

Further, the text signal preprocessing refers to representing each word by a vector by using a pre-trained Glove model in a word embedding mode.

Further, the voice signal preprocessing refers to windowing and framing the audio signal, using a hamming window with a window length of 25ms and frame shifting of 10ms, then performing fourier transform on the hamming window, and finally performing mel filtering operation to obtain mel frequency spectrum characteristics.

Furthermore, the alignment operation is to combine the voice frames related to the words to obtain the voice characteristics corresponding to each word.

Further, the structure of the network model comprises four layers in total, wherein the first layer is an FBP primary feature fusion layer; the second layer is a level-1 primary feature coding network layer consisting of three submodels of LSTM, GRU and DNN; the third layer is a hadmard fused layer; the fourth layer is an advanced coding network layer consisting of BiLSTM.

Further, the fusion adopts a fusion algorithm process, specifically, audio and text feature vectors are input, firstly, audio features and text features are subjected to cross fusion through an FBP fusion module, the fused features are respectively subjected to a level-1 primary feature coding network consisting of three submodels of LSTM, GRU and DNN, then the outputs of three sub-networks of level-1 are subjected to secondary fusion, high-level features are coded, the fusion method is hadmard product, then the fused features are input into a level-2 BiLSTM coding network, and finally, a classified output layer is connected, and emotion categories are predicted.

Further, the network model training mode is to adopt an Adam optimizer to minimize a cross entropy loss function, set the learning rate to be 0.0001, set the batch to be 100, and prevent the model from being over-fitted by using L2 regularization.

The invention has the beneficial effects that:

(1) the network structure of the invention adopts a hierarchical mode, the primary characteristics are coded by the sub-network of the first layer, the primary characteristics are mapped to the high-level characteristics by the sub-network of the second layer, and the hierarchical structure relation between the high-level characteristics and the low-level characteristics can be effectively established.

(2) The invention fully utilizes the characteristic complementarity between the voice and the text information and can capture complex association between the audio and the text. Compared with the method in the prior art, the model identification rate of the invention is greatly improved.

Drawings

FIG. 1 is a diagram of a system model of the present invention;

FIG. 2 is a diagram of a decomposition bilinear pooled feature fusion module (FBP);

figure 3 is a confusion matrix of experiments on an IEMOCAP data set by the algorithm presented herein;

FIG. 4 is a graph of the change in model loss function;

FIG. 5 graphs of accuracy (Precision), Recall (Recall), and F-value.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

System model

The model structure provided by the invention comprises three layers. The first layer is low-level feature fusion, which uses a decomposed bilinear pooling fusion module (FBP) to fuse the speech feature vector and the text feature vector. The second layer is high-grade feature fusion, the low-grade features output by the FBP module are respectively output by three submodels of LSTM, GRU and DNN for secondary fusion, and the high-grade features are coded by splicing two-by-two hadmard products. And the third layer is an emotion classification layer, the acquired high-level feature vector is input into a full connection layer, the dimension reduction is carried out on the fused high-level emotion feature in the full connection layer, and finally, the emotion coefficient at the moment is output through a softmax function and emotion judgment is carried out. A system model is shown in FIG. 1.

Speech signal pre-processing

The speech emotion characteristics of the invention adopt a Mel frequency spectrum, and the Mel frequency spectrum is a general frequency spectrogram plus a Mel filtering function. The extraction process is shown below.

Given a speech signal S, which is windowed and framed, it is denoted S ═ S₁,s₂,s₃……,s_nWhere n represents the total number of speech frames. The window used in this section is a hamming window of 25ms, with a 10ms frame shift. Then, Fourier transform is carried out on each frame in S to obtain frequency domain representation x of each frame_t. As expressed by the formula:

where M represents the number of fourier transforms. x is the number of_t(k) Denotes x_tThe kth value in the vector.

Subsequently, Mel transformation is performed according to the above formula, and x is transformed_tIs transformed from a linear scale to a mel-scale, wherein f represents the frequency scale after fourier transformation. And then, a group of filters is designed on the Mel scale to filter the frequency spectrum of each frame, and finally, the Mel frequency spectrum characteristics are obtained.

Text signal pre-processing

The invention adopts the famous Glove algorithm to preprocess the text information by using a vector to represent each word. The main idea of Glove algorithm: the relationship between words i and j is visually represented by a co-occurrence matrix. Vectorization representation of words is performed so that semantic and grammatical information is contained between vectors as much as possible. The algorithm flow is as follows:

1. and constructing a co-occurrence matrix X and solving the probability. The mathematical expression is as follows:

wherein p is_ijRepresenting the probability of co-occurrence of words i and j in the context.

2. And constructing an approximate relation between the word vector and the co-occurrence matrix X. Given that word k represents a word that may appear in their vicinity, the relevance between it and i and j is judged.

The correlation between k and i and j can be obtained. If the correlation is small, the result of the formula is close to 1, and if the correlation is large, the result of the formula is far from 1. Wherein p is_ikRepresenting co-occurrence probability, p, of words i and k_jkRepresenting the co-occurrence probability of words j and k.

3. Constructing a loss function

Wherein V represents the size of the vocabulary, w_iAnd w_jAs a word vector, b_iAnd b_iTo offset, f (x)_ij) Is a weighting function.

The weight function is expressed as follows:

according to the experience, Glove authors consider x_max＝100，

Is a better choice. If the two words do not appear together, the weight is 0; f (x) must be a non-decreasing function.

4. Training model

And for the text data in the IEMOCAP dataset, carrying out vectorization on the text information by using a trained Glove word embedding model, wherein the dimensionality is 300 dimensions. And keeping the maximum sequence length of the text data to be 500, and finally obtaining a (500, 300) -dimensional word vector for researching the text emotion analysis.

Speech text alignment

This section mainly introduces the implementation method of speech and text alignment. Given phonetic representation S ═ S₁,...,s_i,...,s_nWhere n denotes the number of frames of speech. Given text word vector representation T ═ T₁,...,t_j,...,s_mWhere m represents the number of words in the text. Since a word corresponds to a plurality of frames, by combining the speech frames associated with the word, the speech characteristics of each word can be obtained, which means that there is a strong connection between each word and its associated speech frame. From this point, we first calculate the similarity between the word and the speech frame to obtain the relationship between the word and the speech frame, as shown in the formula:

wherein t is_jRepresenting the jth word vector, s_iRepresents the ith speech frame, a_ijRepresenting the similarity between the jth word vector and the ith speech frame, a larger one represents a more similar two vectors.

Finally to a_ijAnd carrying out normalization operation to obtain the probability values of the word j and the voice frame i.

Last pair ofWeighting and summing all frames to obtain the speech characteristic u aligned with the jth word_jThe aligned speech is denoted as U ═ U₁,...u_j,...,u_m}。

Introduction to data set

The IEMOCAP interactive emotion binary motion capture dataset is an emotion recognition dataset that is widely used in english multimodal, recorded by the SALL laboratory of the university of california. The method is divided into 5 session parts, wherein each session consists of 5 groups of 10 persons, 5 men and 5 women, and is widely cited in the field of speech recognition. The IEMOCAP multimodal data set comprises audio and video files of about 12 hours in duration in total, including data information of three modalities, audio, semantic text and facial expressions. The speech in the data set is divided into ten emotions in total: happy, angry, sad, depressed, natural, excited, frightened, disliked, surprised, and others. To be consistent with existing work, only four emotion classes are used in our work: anger, sadness, joy and nature. The four emotion category distributions are shown in the following table.

The invention fuses the algorithm process

1. The decomposition bilinear pooling fusion Module (FBP) is shown in FIG. 2, with x and y representing preprocessed speech and text features, x ∈ R^m,y∈Rⁿ. The specific operation is as follows.

	Generating qi	Happy	Nature of nature	Heart injury	Is totaled
						Total up to	1103	1636	1707	1084	5531

K is the potential dimension of the decomposition matrix, matrix U_i＝[u₁,…,u_k]∈R^m×k，V_i＝[v₁,…,v_k]∈R^n×k,1∈R^k。

Is a hadmard operation.

To obtain the output characteristic z ∈ R^OThe weight to be learned is two third-order tensors U ═ U₁,…,U_O]∈R^m×k×oAnd V ═ V₁,…,V_O]∈R^n×k×o. We can reformulate U and V as a two-dimensional matrix

And

the remodeling operation is then performed, and the formula is shown below.

Where the function summoping (x, k) indicates that summoping is performed on x using a one-dimensional non-overlapping window of size k. Dropout is used to prevent overfitting. Since the amplitude of the output varies greatly due to the introduction of element-by-element multiplication, the energy of Z is normalized to 1 using L2 normalization.

2. The second part of the network framework of the invention adopts three different LSTM, GRU and DNN neural networks to extract the depth characteristics of the characteristics to obtain different expressions. Primary fused feature vector z output by FBP module_tRespectively output via LSTM, GRU and DNN networks

And

and (4) showing. The formula is shown in the following formula.

3. Then the high-grade feature fusion output part integrates the three types of features through pairwise hadmardCombining in a vector, and outputting the high-level fusion characteristic C at the moment_tThe following formula is shown below.

Wherein C is_t＝{C₁,C₂,…,C_mAnd m represents the number of words. Here the symbol |, indicates that the vector corresponds to an element multiplication.

4. Then we fuse the vector C after the last step_tFeeding into a BilSTM network for encoding, and the specific operation is as follows. Wherein the forward feature vector

Inverse eigenvector

5. Finally, the feature vector is led into a full connection layer for dimension reduction, and a Softmax function is used for obtaining

For emotion classification of utterances and using a cross-entropy loss function. The specific operation is as follows.

Wherein y is_kA standard distribution of the sample classes is represented,

is the predicted distribution of the sample class. f. of_θ(.) represents a full connectivity layer network with a parameter theta.

Network training

When training the model, an Adam optimizer was used to minimize the cross-entropy loss function, the learning rate was set to 0.0001, the batch was 100, and L2 regularization was used to prevent overfitting of the model.

The Adam optimization algorithm is an algorithm for estimating the first moment and the second moment of the gradient of the cross entropy loss function and applying the first moment and the second moment to parameter updating. The specific operation is as follows.

m_t＝β₁·m_t-1+(1-β₁)·g_t

m_tAnd v_tRespectively, the first moment and the second moment of the cross entropy loss function gradient, and the hyper-parameter beta₁And beta₂Controlling the attenuation, g_tRepresenting the gradient of the function, theta_tFor variable update values, Ir represents the learning rate.

The parameter settings of the network are shown in the following table:

simulation parameter table

The experimental results are as follows:

the confusion matrix of the experimental results on the IEMOCAP data set for our proposed efficient fusion algorithm model is shown in fig. 3. From the confusion matrix, the accuracy rates of the four emotions of distraction, heart injury, nature and anger are 91.69%, 73.95%, 73.68% and 75.16% respectively, and the overall accuracy rate reaches 80.38%. The validity and feasibility of our proposed method on the IEMOCAP data set at present are verified.

The invention mainly adopts WA weighted precision and UA unweighted precision to measure the performance of the model, wherein WA is used for evaluating the overall performance of the model, and UA is used for evaluating the classification result of the model on each type of emotion. The specific calculation formula is shown as the following formula:

where L is the number of emotion classes, TP_iIndicating the number of class i samples predicted correctly, FN_iThe number of samples of the ith class indicating a prediction error.

In order to further measure the performance of the speech emotion recognition model provided by the invention, the accuracy, the recall ratio and the F1 score are adopted to highlight the performance of the model, and the calculation formulas of the accuracy, the recall ratio and the F1 score are as follows.

Wherein TP means both the prediction and the label are positive; FP means the prediction positive label is negative; TN means both the prediction and the label are negative; FN refers to prediction of negative label as positive. precision and Recall refer to precision and Recall, respectively, and the F1 score is the harmonic mean of the two, with larger values being better for the F1 score. The results of the experimental indexes related to accuracy, recall rate and F1 score are shown in the following table.

	TP	FN	FP	precision	recall	F1-score
							happy	397	36	93	0.8102	0.9169	0.8602
sad	176	62	31	0.8502	0.7395	0.7910
							neutral	280	100	82	0.7735	0.7368	0.7547
anger	118	39	31	0.7919	0.7516	0.7712
							Average				0.8065	0.7862	0.7943

In contrast to other methods

To better verify the validity and scientificity of the improved algorithm, the algorithm was compared with other algorithm models on an IEMOCAP data set, and the results are shown in the following table.

1. Hengshun Zhou et al use a bilinear pooling algorithm as a feature fusion approach. (ACM2021)

2. Haiyang Xu et al use phonetic and text feature concatenation fusion. (INTERSPEECH2019)

3. Md Asif Jalal et al uses multiple attentions output splicing fusion mode (INTERSPEECH2020)

4. Ming Chen et al adopt a mixed splicing and then splicing individual mode. (INTERSPEECH2020)

5. Qi Cao et al operate the fusion approach using a speech and text consistency and contextual attention mechanism. (2021)

Comparison with different models

As shown in the above table, the method proposed by the present invention achieves 80.38% of WA and 78.62% of UA in the IEMOCAP data set four-classification emotion recognition experiment, and WA is the best experimental result in all comparison experiments. The worst result is that Haiyang Xu et al performs the multimodal fusion method by using simple linear splicing, which only has 70.4% WA and 69.5% UA, probably because the complementary relationship between the two modalities is not fully captured by the simple splicing method. Ming Chen et al add two separate features in parallel on the basis of Haiyang Xu et al, and considering the correlation effect between and within two modes, 71.06% of WA and 72.05% of UA are obtained, and the performance is improved. Yuanyuan Zhang et al, which adopts bilinear pooling to perform fusion of two features, has 75.49% WA. The invention performs feature fusion once again on the basis of bilinear pooling feature fusion to obtain rich cross modal feature representation, thereby improving WA by 4.89%. Qi Cao et al achieved 78.74% WA and 79.77% UA using speech and text consistency and contextual attention mechanism fusion, slightly higher in UA than our approach. In summary, through the above comparative experiments, it is verified that the algorithm provided by the present invention has a significant advantage in effect compared with other algorithms.

Aiming at the defects of a feature fusion mode in the field of multi-modal emotion recognition, the speech emotion recognition method based on the double-layer fusion deep network is provided, and complex association can be captured between audio and text. The method is based on the decomposition bilinear pooling fusion module (FBP) to perform the first fusion, and then performs the second fusion, which is superior to the splicing and summing method and enables the characteristics of each mode to be more fully fused with lower time complexity. The network structure adopts a hierarchical mode, primary features are coded by a first-layer sub-network, the primary features are mapped to high-level features by a second-layer network, and the hierarchical structure relation between the high-level features and the low-level features can be effectively established. Finally experiments were performed on the IEMOCAP dataset to reach 80.38% WA and 78.62% UA. Compared with the related work of other researchers, the recognition rate is greatly improved, and the feasibility and the effectiveness of the high-efficiency fusion algorithm provided by the inventor are verified.

The scope of the present invention is defined by the claims.

Claims

1. A speech emotion recognition method based on a double-layer fusion deep network is characterized by comprising the following steps:

step 1: preprocessing a voice signal and a text signal, and performing alignment operation to make the voice signal and the text signal meet the input requirement of a network model;

step 2: inputting the voice and text feature vectors preprocessed in the step 1 into a decomposition bilinear pooling fusion module FBP for primary feature fusion;

and step 3: decomposing the fused primary features output by the bilinear pooling fusion module FBP in the step 2, and respectively passing through a level-1 primary feature coding network consisting of three submodels including LSTM, GRU and DNN;

and 5: and finally training the network model.

2. The speech emotion recognition method based on the two-layer fusion deep network, as recited in claim 1, wherein the text signal preprocessing is performed by using a word embedding manner, and a pre-trained Glove model is used to represent each word by using a vector.

3. The method for speech emotion recognition based on two-layer fusion deep network as claimed in claim 1, wherein the speech signal preprocessing is to perform windowing and framing on the audio signal, use a hamming window with a window length of 25ms and a frame shift of 10ms, perform fourier transform on the audio signal, and finally perform mel filtering operation to obtain mel frequency spectrum characteristics.

4. The method of claim 1, wherein the aligning operation is to obtain the speech feature corresponding to each word by combining the speech frames associated with the words.

5. The method for recognizing speech emotion based on two-layer fusion depth network as claimed in claim 1, wherein the structure of the network model comprises four layers in total, the first layer is FBP primary feature fusion layer; the second layer is a level-1 primary characteristic coding network layer consisting of three submodels of LSTM, GRU and DNN; the third layer is a hadmard fused layer; the fourth layer is an advanced coding network layer consisting of BiLSTM.

6. The speech emotion recognition method based on the dual-layer fusion depth network as claimed in claim 1, wherein the fusion adopts a fusion algorithm process, specifically, audio and text feature vectors are input, firstly, the audio features and the text features are cross-fused through an FBP fusion module, the fused features are respectively passed through a level-1 primary feature coding network composed of three submodels of LSTM, GRU and DNN, then, the output of three sub-networks of level-1 is secondarily fused to code high-level features, the fusion method is hadnard product, then, the fused features are input into a level-2 BiLSTM coding network, and finally, classified output layers are connected to predict emotion classes.

7. The method for recognizing speech emotion based on two-layer fusion depth network as claimed in claim 1, wherein the network model training mode is to adopt an Adam optimizer to minimize cross entropy loss function, the learning rate is set to 0.0001, the batch is 100, and L2 regularization is used to prevent model overfitting.