CN114743569A - Speech emotion recognition method based on double-layer fusion deep network - Google Patents
Speech emotion recognition method based on double-layer fusion deep network Download PDFInfo
- Publication number
- CN114743569A CN114743569A CN202210419568.1A CN202210419568A CN114743569A CN 114743569 A CN114743569 A CN 114743569A CN 202210419568 A CN202210419568 A CN 202210419568A CN 114743569 A CN114743569 A CN 114743569A
- Authority
- CN
- China
- Prior art keywords
- fusion
- layer
- network
- level
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 60
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 29
- 239000013598 vector Substances 0.000 claims abstract description 28
- 230000008451 emotion Effects 0.000 claims abstract description 25
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 20
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 16
- 238000011176 pooling Methods 0.000 claims description 9
- 238000007500 overflow downdraw method Methods 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 7
- 238000000354 decomposition reaction Methods 0.000 claims description 6
- 238000001228 spectrum Methods 0.000 claims description 6
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 3
- 230000005236 sound signal Effects 0.000 claims description 3
- 230000037433 frameshift Effects 0.000 claims description 2
- 238000009432 framing Methods 0.000 claims description 2
- 239000010410 layer Substances 0.000 claims 16
- 239000002355 dual-layer Substances 0.000 claims 1
- 238000005457 optimization Methods 0.000 abstract description 2
- 239000011159 matrix material Substances 0.000 description 9
- 238000011160 research Methods 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000002996 emotional effect Effects 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 208000013875 Heart injury Diseases 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000994 depressogenic effect Effects 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000007634 remodeling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Computation (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a speech emotion recognition method based on a double-layer fusion deep network, which utilizes feature vectors of two modes of speech and text to perform optimization fusion and capture complex association to perform emotion recognition between audio and text in order to obtain rich information of cross modes. Firstly, preprocessing voice and text information to obtain audio and text feature vectors, performing multi-mode cross fusion on audio features and text features through an FBP feature fusion module, respectively passing the fused feature vectors through a level-1 primary feature coding network consisting of three submodels of LSTM, GRU and DNN, then performing secondary fusion on the outputs of three sub-networks of level-1 to code high-level features, performing hadmard product fusion, finally inputting the fused features into a level-2 BiLSTM coding network, and finally connecting a classification output layer to predict emotion categories. The experimental result of the fusion algorithm on the public data set IEMOCAP shows that the fusion algorithm reaches 80.38% WA and 78.62% UA, and a better result in the field of speech emotion recognition is achieved at present.
Description
Technical Field
The invention belongs to the field of speech emotion recognition, and relates to a multi-mode feature fusion speech emotion recognition algorithm.
Background
At present, the field of speech emotion recognition is mainly divided into 3 research directions, the first is based on a traditional machine learning method, emotion related features are manually extracted, and emotion classification is carried out by using the machine learning method. And secondly, realizing an end-to-end emotion recognition system through the neural network based on an emotion recognition method of the deep neural network. The third is to combine speech and text for multi-modal emotion recognition.
Multimodal speech emotion recognition is a new research hotspot in recent years, and has attracted extensive attention. The main research field of the multi-modal speech emotion recognition is to combine speech and text to perform multi-modal emotion recognition, consider that different speech characteristics have different influences on emotion, and aim to improve the recognition rate of the system, the future speech emotion recognition trend is a multi-modal recognition system framework.
Speech emotion recognition is a challenging task because emotional expressions are complex, multi-modal. The key for improving the speech emotion recognition performance is the extraction of emotion characteristics firstly and the fusion mode during multi-modal characteristic fusion secondly. At present, a mainstream method for feature extraction mainly uses a deep neural network to extract emotional features, and the effect is verified to be obviously improved in many researches. However, the innovative research on feature fusion is still not deep enough, and research finds that most methods use simple fusion methods such as addition and splicing. Obviously, none of these methods fully exploit the complementarity between the modality information.
In the research in the field of multi-modal emotion recognition, a feature fusion method in a multi-modal emotion recognition based on speech text fusion published by huya heart and the like uses simple fusion methods such as addition, splicing and the like. Obviously, none of these methods can fully exploit the complementarity of emotional features between modality information.
Disclosure of Invention
The invention mainly solves the problems that: the existing speech emotion recognition method and system are improved, and a multi-mode feature fusion scheme capable of improving the recognition rate is provided. In order to realize the technical purpose, the technical scheme adopted by the invention comprises the following processes:
step 1: preprocessing the voice and text data to make the voice and text data meet the input requirement of a network model;
and 2, step: inputting the preprocessed voice and text feature vectors in the step 1 into a decomposition bilinear pooling fusion module (FBP) for primary feature fusion;
and 3, step 3: the fused primary features output by the FBP module in the step 2 are respectively processed by a level-1 primary feature coding network consisting of three submodels of LSTM, GRU and DNN;
and 4, step 4: then, the outputs of three sub-networks of level-1 are secondarily fused to encode high-level features, the fusion method is hadmard product, then the fused features are input into a BilSTM encoding network of level-2, and finally a classified output layer is connected to predict emotion categories;
and 5: and finally training the network model.
Further, the text signal preprocessing refers to representing each word by a vector by using a pre-trained Glove model in a word embedding mode.
Further, the voice signal preprocessing refers to windowing and framing the audio signal, using a hamming window with a window length of 25ms and frame shifting of 10ms, then performing fourier transform on the hamming window, and finally performing mel filtering operation to obtain mel frequency spectrum characteristics.
Furthermore, the alignment operation is to combine the voice frames related to the words to obtain the voice characteristics corresponding to each word.
Further, the structure of the network model comprises four layers in total, wherein the first layer is an FBP primary feature fusion layer; the second layer is a level-1 primary feature coding network layer consisting of three submodels of LSTM, GRU and DNN; the third layer is a hadmard fused layer; the fourth layer is an advanced coding network layer consisting of BiLSTM.
Further, the fusion adopts a fusion algorithm process, specifically, audio and text feature vectors are input, firstly, audio features and text features are subjected to cross fusion through an FBP fusion module, the fused features are respectively subjected to a level-1 primary feature coding network consisting of three submodels of LSTM, GRU and DNN, then the outputs of three sub-networks of level-1 are subjected to secondary fusion, high-level features are coded, the fusion method is hadmard product, then the fused features are input into a level-2 BiLSTM coding network, and finally, a classified output layer is connected, and emotion categories are predicted.
Further, the network model training mode is to adopt an Adam optimizer to minimize a cross entropy loss function, set the learning rate to be 0.0001, set the batch to be 100, and prevent the model from being over-fitted by using L2 regularization.
The invention has the beneficial effects that:
(1) the network structure of the invention adopts a hierarchical mode, the primary characteristics are coded by the sub-network of the first layer, the primary characteristics are mapped to the high-level characteristics by the sub-network of the second layer, and the hierarchical structure relation between the high-level characteristics and the low-level characteristics can be effectively established.
(2) The invention fully utilizes the characteristic complementarity between the voice and the text information and can capture complex association between the audio and the text. Compared with the method in the prior art, the model identification rate of the invention is greatly improved.
Drawings
FIG. 1 is a diagram of a system model of the present invention;
FIG. 2 is a diagram of a decomposition bilinear pooled feature fusion module (FBP);
figure 3 is a confusion matrix of experiments on an IEMOCAP data set by the algorithm presented herein;
FIG. 4 is a graph of the change in model loss function;
FIG. 5 graphs of accuracy (Precision), Recall (Recall), and F-value.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
System model
The model structure provided by the invention comprises three layers. The first layer is low-level feature fusion, which uses a decomposed bilinear pooling fusion module (FBP) to fuse the speech feature vector and the text feature vector. The second layer is high-grade feature fusion, the low-grade features output by the FBP module are respectively output by three submodels of LSTM, GRU and DNN for secondary fusion, and the high-grade features are coded by splicing two-by-two hadmard products. And the third layer is an emotion classification layer, the acquired high-level feature vector is input into a full connection layer, the dimension reduction is carried out on the fused high-level emotion feature in the full connection layer, and finally, the emotion coefficient at the moment is output through a softmax function and emotion judgment is carried out. A system model is shown in FIG. 1.
Speech signal pre-processing
The speech emotion characteristics of the invention adopt a Mel frequency spectrum, and the Mel frequency spectrum is a general frequency spectrogram plus a Mel filtering function. The extraction process is shown below.
Given a speech signal S, which is windowed and framed, it is denoted S ═ S1,s2,s3……,snWhere n represents the total number of speech frames. The window used in this section is a hamming window of 25ms, with a 10ms frame shift. Then, Fourier transform is carried out on each frame in S to obtain frequency domain representation x of each framet. As expressed by the formula:
where M represents the number of fourier transforms. x is the number oft(k) Denotes xtThe kth value in the vector.
Subsequently, Mel transformation is performed according to the above formula, and x is transformedtIs transformed from a linear scale to a mel-scale, wherein f represents the frequency scale after fourier transformation. And then, a group of filters is designed on the Mel scale to filter the frequency spectrum of each frame, and finally, the Mel frequency spectrum characteristics are obtained.
Text signal pre-processing
The invention adopts the famous Glove algorithm to preprocess the text information by using a vector to represent each word. The main idea of Glove algorithm: the relationship between words i and j is visually represented by a co-occurrence matrix. Vectorization representation of words is performed so that semantic and grammatical information is contained between vectors as much as possible. The algorithm flow is as follows:
1. and constructing a co-occurrence matrix X and solving the probability. The mathematical expression is as follows:
wherein p isijRepresenting the probability of co-occurrence of words i and j in the context.
2. And constructing an approximate relation between the word vector and the co-occurrence matrix X. Given that word k represents a word that may appear in their vicinity, the relevance between it and i and j is judged.
The correlation between k and i and j can be obtained. If the correlation is small, the result of the formula is close to 1, and if the correlation is large, the result of the formula is far from 1. Wherein p isikRepresenting co-occurrence probability, p, of words i and kjkRepresenting the co-occurrence probability of words j and k.
3. Constructing a loss function
Wherein V represents the size of the vocabulary, wiAnd wjAs a word vector, biAnd biTo offset, f (x)ij) Is a weighting function.
The weight function is expressed as follows:
according to the experience, Glove authors consider xmax=100,Is a better choice. If the two words do not appear together, the weight is 0; f (x) must be a non-decreasing function.
4. Training model
And for the text data in the IEMOCAP dataset, carrying out vectorization on the text information by using a trained Glove word embedding model, wherein the dimensionality is 300 dimensions. And keeping the maximum sequence length of the text data to be 500, and finally obtaining a (500, 300) -dimensional word vector for researching the text emotion analysis.
Speech text alignment
This section mainly introduces the implementation method of speech and text alignment. Given phonetic representation S ═ S1,...,si,...,snWhere n denotes the number of frames of speech. Given text word vector representation T ═ T1,...,tj,...,smWhere m represents the number of words in the text. Since a word corresponds to a plurality of frames, by combining the speech frames associated with the word, the speech characteristics of each word can be obtained, which means that there is a strong connection between each word and its associated speech frame. From this point, we first calculate the similarity between the word and the speech frame to obtain the relationship between the word and the speech frame, as shown in the formula:
wherein t isjRepresenting the jth word vector, siRepresents the ith speech frame, aijRepresenting the similarity between the jth word vector and the ith speech frame, a larger one represents a more similar two vectors.
Finally to aijAnd carrying out normalization operation to obtain the probability values of the word j and the voice frame i.
Last pair ofWeighting and summing all frames to obtain the speech characteristic u aligned with the jth wordjThe aligned speech is denoted as U ═ U1,...uj,...,um}。
Introduction to data set
The IEMOCAP interactive emotion binary motion capture dataset is an emotion recognition dataset that is widely used in english multimodal, recorded by the SALL laboratory of the university of california. The method is divided into 5 session parts, wherein each session consists of 5 groups of 10 persons, 5 men and 5 women, and is widely cited in the field of speech recognition. The IEMOCAP multimodal data set comprises audio and video files of about 12 hours in duration in total, including data information of three modalities, audio, semantic text and facial expressions. The speech in the data set is divided into ten emotions in total: happy, angry, sad, depressed, natural, excited, frightened, disliked, surprised, and others. To be consistent with existing work, only four emotion classes are used in our work: anger, sadness, joy and nature. The four emotion category distributions are shown in the following table.
The invention fuses the algorithm process
1. The decomposition bilinear pooling fusion Module (FBP) is shown in FIG. 2, with x and y representing preprocessed speech and text features, x ∈ Rm,y∈Rn. The specific operation is as follows.
Generating qi | Happy | Nature of nature | Heart injury | Is totaled | |
Total up to | 1103 | 1636 | 1707 | 1084 | 5531 |
K is the potential dimension of the decomposition matrix, matrix Ui=[u1,…,uk]∈Rm×k,Vi=[v1,…,vk]∈Rn×k,1∈Rk。Is a hadmard operation.
To obtain the output characteristic z ∈ ROThe weight to be learned is two third-order tensors U ═ U1,…,UO]∈Rm×k×oAnd V ═ V1,…,VO]∈Rn×k×o. We can reformulate U and V as a two-dimensional matrixAndthe remodeling operation is then performed, and the formula is shown below.
Where the function summoping (x, k) indicates that summoping is performed on x using a one-dimensional non-overlapping window of size k. Dropout is used to prevent overfitting. Since the amplitude of the output varies greatly due to the introduction of element-by-element multiplication, the energy of Z is normalized to 1 using L2 normalization.
2. The second part of the network framework of the invention adopts three different LSTM, GRU and DNN neural networks to extract the depth characteristics of the characteristics to obtain different expressions. Primary fused feature vector z output by FBP moduletRespectively output via LSTM, GRU and DNN networksAndand (4) showing. The formula is shown in the following formula.
3. Then the high-grade feature fusion output part integrates the three types of features through pairwise hadmardCombining in a vector, and outputting the high-level fusion characteristic C at the momenttThe following formula is shown below.
Wherein C ist={C1,C2,…,CmAnd m represents the number of words. Here the symbol |, indicates that the vector corresponds to an element multiplication.
4. Then we fuse the vector C after the last steptFeeding into a BilSTM network for encoding, and the specific operation is as follows. Wherein the forward feature vectorInverse eigenvector
5. Finally, the feature vector is led into a full connection layer for dimension reduction, and a Softmax function is used for obtainingFor emotion classification of utterances and using a cross-entropy loss function. The specific operation is as follows.
Wherein y iskA standard distribution of the sample classes is represented,is the predicted distribution of the sample class. f. ofθ(.) represents a full connectivity layer network with a parameter theta.
Network training
When training the model, an Adam optimizer was used to minimize the cross-entropy loss function, the learning rate was set to 0.0001, the batch was 100, and L2 regularization was used to prevent overfitting of the model.
The Adam optimization algorithm is an algorithm for estimating the first moment and the second moment of the gradient of the cross entropy loss function and applying the first moment and the second moment to parameter updating. The specific operation is as follows.
mt=β1·mt-1+(1-β1)·gt
mtAnd vtRespectively, the first moment and the second moment of the cross entropy loss function gradient, and the hyper-parameter beta1And beta2Controlling the attenuation, gtRepresenting the gradient of the function, thetatFor variable update values, Ir represents the learning rate.
The parameter settings of the network are shown in the following table:
simulation parameter table
The experimental results are as follows:
the confusion matrix of the experimental results on the IEMOCAP data set for our proposed efficient fusion algorithm model is shown in fig. 3. From the confusion matrix, the accuracy rates of the four emotions of distraction, heart injury, nature and anger are 91.69%, 73.95%, 73.68% and 75.16% respectively, and the overall accuracy rate reaches 80.38%. The validity and feasibility of our proposed method on the IEMOCAP data set at present are verified.
The invention mainly adopts WA weighted precision and UA unweighted precision to measure the performance of the model, wherein WA is used for evaluating the overall performance of the model, and UA is used for evaluating the classification result of the model on each type of emotion. The specific calculation formula is shown as the following formula:
where L is the number of emotion classes, TPiIndicating the number of class i samples predicted correctly, FNiThe number of samples of the ith class indicating a prediction error.
In order to further measure the performance of the speech emotion recognition model provided by the invention, the accuracy, the recall ratio and the F1 score are adopted to highlight the performance of the model, and the calculation formulas of the accuracy, the recall ratio and the F1 score are as follows.
Wherein TP means both the prediction and the label are positive; FP means the prediction positive label is negative; TN means both the prediction and the label are negative; FN refers to prediction of negative label as positive. precision and Recall refer to precision and Recall, respectively, and the F1 score is the harmonic mean of the two, with larger values being better for the F1 score. The results of the experimental indexes related to accuracy, recall rate and F1 score are shown in the following table.
TP | FN | FP | precision | recall | F1-score | |
happy | 397 | 36 | 93 | 0.8102 | 0.9169 | 0.8602 |
sad | 176 | 62 | 31 | 0.8502 | 0.7395 | 0.7910 |
neutral | 280 | 100 | 82 | 0.7735 | 0.7368 | 0.7547 |
|
118 | 39 | 31 | 0.7919 | 0.7516 | 0.7712 |
Average | 0.8065 | 0.7862 | 0.7943 |
In contrast to other methods
To better verify the validity and scientificity of the improved algorithm, the algorithm was compared with other algorithm models on an IEMOCAP data set, and the results are shown in the following table.
1. Hengshun Zhou et al use a bilinear pooling algorithm as a feature fusion approach. (ACM2021)
2. Haiyang Xu et al use phonetic and text feature concatenation fusion. (INTERSPEECH2019)
3. Md Asif Jalal et al uses multiple attentions output splicing fusion mode (INTERSPEECH2020)
4. Ming Chen et al adopt a mixed splicing and then splicing individual mode. (INTERSPEECH2020)
5. Qi Cao et al operate the fusion approach using a speech and text consistency and contextual attention mechanism. (2021)
Comparison with different models
As shown in the above table, the method proposed by the present invention achieves 80.38% of WA and 78.62% of UA in the IEMOCAP data set four-classification emotion recognition experiment, and WA is the best experimental result in all comparison experiments. The worst result is that Haiyang Xu et al performs the multimodal fusion method by using simple linear splicing, which only has 70.4% WA and 69.5% UA, probably because the complementary relationship between the two modalities is not fully captured by the simple splicing method. Ming Chen et al add two separate features in parallel on the basis of Haiyang Xu et al, and considering the correlation effect between and within two modes, 71.06% of WA and 72.05% of UA are obtained, and the performance is improved. Yuanyuan Zhang et al, which adopts bilinear pooling to perform fusion of two features, has 75.49% WA. The invention performs feature fusion once again on the basis of bilinear pooling feature fusion to obtain rich cross modal feature representation, thereby improving WA by 4.89%. Qi Cao et al achieved 78.74% WA and 79.77% UA using speech and text consistency and contextual attention mechanism fusion, slightly higher in UA than our approach. In summary, through the above comparative experiments, it is verified that the algorithm provided by the present invention has a significant advantage in effect compared with other algorithms.
Aiming at the defects of a feature fusion mode in the field of multi-modal emotion recognition, the speech emotion recognition method based on the double-layer fusion deep network is provided, and complex association can be captured between audio and text. The method is based on the decomposition bilinear pooling fusion module (FBP) to perform the first fusion, and then performs the second fusion, which is superior to the splicing and summing method and enables the characteristics of each mode to be more fully fused with lower time complexity. The network structure adopts a hierarchical mode, primary features are coded by a first-layer sub-network, the primary features are mapped to high-level features by a second-layer network, and the hierarchical structure relation between the high-level features and the low-level features can be effectively established. Finally experiments were performed on the IEMOCAP dataset to reach 80.38% WA and 78.62% UA. Compared with the related work of other researchers, the recognition rate is greatly improved, and the feasibility and the effectiveness of the high-efficiency fusion algorithm provided by the inventor are verified.
The scope of the present invention is defined by the claims.
Claims (7)
1. A speech emotion recognition method based on a double-layer fusion deep network is characterized by comprising the following steps:
step 1: preprocessing a voice signal and a text signal, and performing alignment operation to make the voice signal and the text signal meet the input requirement of a network model;
step 2: inputting the voice and text feature vectors preprocessed in the step 1 into a decomposition bilinear pooling fusion module FBP for primary feature fusion;
and step 3: decomposing the fused primary features output by the bilinear pooling fusion module FBP in the step 2, and respectively passing through a level-1 primary feature coding network consisting of three submodels including LSTM, GRU and DNN;
and 4, step 4: then, the outputs of three sub-networks of level-1 are secondarily fused to encode high-level features, the fusion method is hadmard product, then the fused features are input into a BilSTM encoding network of level-2, and finally a classified output layer is connected to predict emotion categories;
and 5: and finally training the network model.
2. The speech emotion recognition method based on the two-layer fusion deep network, as recited in claim 1, wherein the text signal preprocessing is performed by using a word embedding manner, and a pre-trained Glove model is used to represent each word by using a vector.
3. The method for speech emotion recognition based on two-layer fusion deep network as claimed in claim 1, wherein the speech signal preprocessing is to perform windowing and framing on the audio signal, use a hamming window with a window length of 25ms and a frame shift of 10ms, perform fourier transform on the audio signal, and finally perform mel filtering operation to obtain mel frequency spectrum characteristics.
4. The method of claim 1, wherein the aligning operation is to obtain the speech feature corresponding to each word by combining the speech frames associated with the words.
5. The method for recognizing speech emotion based on two-layer fusion depth network as claimed in claim 1, wherein the structure of the network model comprises four layers in total, the first layer is FBP primary feature fusion layer; the second layer is a level-1 primary characteristic coding network layer consisting of three submodels of LSTM, GRU and DNN; the third layer is a hadmard fused layer; the fourth layer is an advanced coding network layer consisting of BiLSTM.
6. The speech emotion recognition method based on the dual-layer fusion depth network as claimed in claim 1, wherein the fusion adopts a fusion algorithm process, specifically, audio and text feature vectors are input, firstly, the audio features and the text features are cross-fused through an FBP fusion module, the fused features are respectively passed through a level-1 primary feature coding network composed of three submodels of LSTM, GRU and DNN, then, the output of three sub-networks of level-1 is secondarily fused to code high-level features, the fusion method is hadnard product, then, the fused features are input into a level-2 BiLSTM coding network, and finally, classified output layers are connected to predict emotion classes.
7. The method for recognizing speech emotion based on two-layer fusion depth network as claimed in claim 1, wherein the network model training mode is to adopt an Adam optimizer to minimize cross entropy loss function, the learning rate is set to 0.0001, the batch is 100, and L2 regularization is used to prevent model overfitting.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210419568.1A CN114743569A (en) | 2022-04-21 | 2022-04-21 | Speech emotion recognition method based on double-layer fusion deep network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210419568.1A CN114743569A (en) | 2022-04-21 | 2022-04-21 | Speech emotion recognition method based on double-layer fusion deep network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114743569A true CN114743569A (en) | 2022-07-12 |
Family
ID=82282837
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210419568.1A Pending CN114743569A (en) | 2022-04-21 | 2022-04-21 | Speech emotion recognition method based on double-layer fusion deep network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114743569A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116468214A (en) * | 2023-03-07 | 2023-07-21 | 德联易控科技(北京)有限公司 | Evidence electronization method and electronic equipment based on fault event processing process |
-
2022
- 2022-04-21 CN CN202210419568.1A patent/CN114743569A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116468214A (en) * | 2023-03-07 | 2023-07-21 | 德联易控科技(北京)有限公司 | Evidence electronization method and electronic equipment based on fault event processing process |
CN116468214B (en) * | 2023-03-07 | 2023-12-15 | 德联易控科技(北京)有限公司 | Evidence electronization method and electronic equipment based on fault event processing process |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108597541B (en) | Speech emotion recognition method and system for enhancing anger and happiness recognition | |
CN112348075B (en) | Multi-mode emotion recognition method based on contextual attention neural network | |
CN110211574B (en) | Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism | |
Aldeneh et al. | Using regional saliency for speech emotion recognition | |
Cai et al. | A novel learnable dictionary encoding layer for end-to-end language identification | |
CN112818861B (en) | Emotion classification method and system based on multi-mode context semantic features | |
Sultana et al. | Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks | |
CN110675859B (en) | Multi-emotion recognition method, system, medium, and apparatus combining speech and text | |
CN110287320A (en) | A kind of deep learning of combination attention mechanism is classified sentiment analysis model more | |
CN110674339A (en) | Chinese song emotion classification method based on multi-mode fusion | |
CN105047194B (en) | A kind of self study sound spectrograph feature extracting method for speech emotion recognition | |
CN110060690A (en) | Multi-to-multi voice conversion method based on STARGAN and ResNet | |
Zhong et al. | A Lightweight Model Based on Separable Convolution for Speech Emotion Recognition. | |
Keyvanrad et al. | Deep belief network training improvement using elite samples minimizing free energy | |
Zhao et al. | Multi-level fusion of wav2vec 2.0 and bert for multimodal emotion recognition | |
CN115393933A (en) | Video face emotion recognition method based on frame attention mechanism | |
Parthasarathy et al. | Improving emotion classification through variational inference of latent variables | |
CN110532380B (en) | Text emotion classification method based on memory network | |
CN114694255A (en) | Sentence-level lip language identification method based on channel attention and time convolution network | |
CN114898775A (en) | Voice emotion recognition method and system based on cross-layer cross fusion | |
CN114743569A (en) | Speech emotion recognition method based on double-layer fusion deep network | |
Liu et al. | Graph based emotion recognition with attention pooling for variable-length utterances | |
CN112700796B (en) | Voice emotion recognition method based on interactive attention model | |
CN114360584A (en) | Phoneme-level-based speech emotion layered recognition method and system | |
Ejbali et al. | Intelligent approach to train wavelet networks for Recognition System of Arabic Words |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |