CN114743569A - Speech emotion recognition method based on double-layer fusion deep network - Google Patents

Speech emotion recognition method based on double-layer fusion deep network Download PDF

Info

Publication number
CN114743569A
CN114743569A CN202210419568.1A CN202210419568A CN114743569A CN 114743569 A CN114743569 A CN 114743569A CN 202210419568 A CN202210419568 A CN 202210419568A CN 114743569 A CN114743569 A CN 114743569A
Authority
CN
China
Prior art keywords
fusion
layer
network
level
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210419568.1A
Other languages
Chinese (zh)
Inventor
李飞
李斌建
李汀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210419568.1A priority Critical patent/CN114743569A/en
Publication of CN114743569A publication Critical patent/CN114743569A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a speech emotion recognition method based on a double-layer fusion deep network, which utilizes feature vectors of two modes of speech and text to perform optimization fusion and capture complex association to perform emotion recognition between audio and text in order to obtain rich information of cross modes. Firstly, preprocessing voice and text information to obtain audio and text feature vectors, performing multi-mode cross fusion on audio features and text features through an FBP feature fusion module, respectively passing the fused feature vectors through a level-1 primary feature coding network consisting of three submodels of LSTM, GRU and DNN, then performing secondary fusion on the outputs of three sub-networks of level-1 to code high-level features, performing hadmard product fusion, finally inputting the fused features into a level-2 BiLSTM coding network, and finally connecting a classification output layer to predict emotion categories. The experimental result of the fusion algorithm on the public data set IEMOCAP shows that the fusion algorithm reaches 80.38% WA and 78.62% UA, and a better result in the field of speech emotion recognition is achieved at present.

Description

Speech emotion recognition method based on double-layer fusion deep network
Technical Field
The invention belongs to the field of speech emotion recognition, and relates to a multi-mode feature fusion speech emotion recognition algorithm.
Background
At present, the field of speech emotion recognition is mainly divided into 3 research directions, the first is based on a traditional machine learning method, emotion related features are manually extracted, and emotion classification is carried out by using the machine learning method. And secondly, realizing an end-to-end emotion recognition system through the neural network based on an emotion recognition method of the deep neural network. The third is to combine speech and text for multi-modal emotion recognition.
Multimodal speech emotion recognition is a new research hotspot in recent years, and has attracted extensive attention. The main research field of the multi-modal speech emotion recognition is to combine speech and text to perform multi-modal emotion recognition, consider that different speech characteristics have different influences on emotion, and aim to improve the recognition rate of the system, the future speech emotion recognition trend is a multi-modal recognition system framework.
Speech emotion recognition is a challenging task because emotional expressions are complex, multi-modal. The key for improving the speech emotion recognition performance is the extraction of emotion characteristics firstly and the fusion mode during multi-modal characteristic fusion secondly. At present, a mainstream method for feature extraction mainly uses a deep neural network to extract emotional features, and the effect is verified to be obviously improved in many researches. However, the innovative research on feature fusion is still not deep enough, and research finds that most methods use simple fusion methods such as addition and splicing. Obviously, none of these methods fully exploit the complementarity between the modality information.
In the research in the field of multi-modal emotion recognition, a feature fusion method in a multi-modal emotion recognition based on speech text fusion published by huya heart and the like uses simple fusion methods such as addition, splicing and the like. Obviously, none of these methods can fully exploit the complementarity of emotional features between modality information.
Disclosure of Invention
The invention mainly solves the problems that: the existing speech emotion recognition method and system are improved, and a multi-mode feature fusion scheme capable of improving the recognition rate is provided. In order to realize the technical purpose, the technical scheme adopted by the invention comprises the following processes:
step 1: preprocessing the voice and text data to make the voice and text data meet the input requirement of a network model;
and 2, step: inputting the preprocessed voice and text feature vectors in the step 1 into a decomposition bilinear pooling fusion module (FBP) for primary feature fusion;
and 3, step 3: the fused primary features output by the FBP module in the step 2 are respectively processed by a level-1 primary feature coding network consisting of three submodels of LSTM, GRU and DNN;
and 4, step 4: then, the outputs of three sub-networks of level-1 are secondarily fused to encode high-level features, the fusion method is hadmard product, then the fused features are input into a BilSTM encoding network of level-2, and finally a classified output layer is connected to predict emotion categories;
and 5: and finally training the network model.
Further, the text signal preprocessing refers to representing each word by a vector by using a pre-trained Glove model in a word embedding mode.
Further, the voice signal preprocessing refers to windowing and framing the audio signal, using a hamming window with a window length of 25ms and frame shifting of 10ms, then performing fourier transform on the hamming window, and finally performing mel filtering operation to obtain mel frequency spectrum characteristics.
Furthermore, the alignment operation is to combine the voice frames related to the words to obtain the voice characteristics corresponding to each word.
Further, the structure of the network model comprises four layers in total, wherein the first layer is an FBP primary feature fusion layer; the second layer is a level-1 primary feature coding network layer consisting of three submodels of LSTM, GRU and DNN; the third layer is a hadmard fused layer; the fourth layer is an advanced coding network layer consisting of BiLSTM.
Further, the fusion adopts a fusion algorithm process, specifically, audio and text feature vectors are input, firstly, audio features and text features are subjected to cross fusion through an FBP fusion module, the fused features are respectively subjected to a level-1 primary feature coding network consisting of three submodels of LSTM, GRU and DNN, then the outputs of three sub-networks of level-1 are subjected to secondary fusion, high-level features are coded, the fusion method is hadmard product, then the fused features are input into a level-2 BiLSTM coding network, and finally, a classified output layer is connected, and emotion categories are predicted.
Further, the network model training mode is to adopt an Adam optimizer to minimize a cross entropy loss function, set the learning rate to be 0.0001, set the batch to be 100, and prevent the model from being over-fitted by using L2 regularization.
The invention has the beneficial effects that:
(1) the network structure of the invention adopts a hierarchical mode, the primary characteristics are coded by the sub-network of the first layer, the primary characteristics are mapped to the high-level characteristics by the sub-network of the second layer, and the hierarchical structure relation between the high-level characteristics and the low-level characteristics can be effectively established.
(2) The invention fully utilizes the characteristic complementarity between the voice and the text information and can capture complex association between the audio and the text. Compared with the method in the prior art, the model identification rate of the invention is greatly improved.
Drawings
FIG. 1 is a diagram of a system model of the present invention;
FIG. 2 is a diagram of a decomposition bilinear pooled feature fusion module (FBP);
figure 3 is a confusion matrix of experiments on an IEMOCAP data set by the algorithm presented herein;
FIG. 4 is a graph of the change in model loss function;
FIG. 5 graphs of accuracy (Precision), Recall (Recall), and F-value.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
System model
The model structure provided by the invention comprises three layers. The first layer is low-level feature fusion, which uses a decomposed bilinear pooling fusion module (FBP) to fuse the speech feature vector and the text feature vector. The second layer is high-grade feature fusion, the low-grade features output by the FBP module are respectively output by three submodels of LSTM, GRU and DNN for secondary fusion, and the high-grade features are coded by splicing two-by-two hadmard products. And the third layer is an emotion classification layer, the acquired high-level feature vector is input into a full connection layer, the dimension reduction is carried out on the fused high-level emotion feature in the full connection layer, and finally, the emotion coefficient at the moment is output through a softmax function and emotion judgment is carried out. A system model is shown in FIG. 1.
Speech signal pre-processing
The speech emotion characteristics of the invention adopt a Mel frequency spectrum, and the Mel frequency spectrum is a general frequency spectrogram plus a Mel filtering function. The extraction process is shown below.
Given a speech signal S, which is windowed and framed, it is denoted S ═ S1,s2,s3……,snWhere n represents the total number of speech frames. The window used in this section is a hamming window of 25ms, with a 10ms frame shift. Then, Fourier transform is carried out on each frame in S to obtain frequency domain representation x of each framet. As expressed by the formula:
Figure BDA0003607000400000051
where M represents the number of fourier transforms. x is the number oft(k) Denotes xtThe kth value in the vector.
Figure BDA0003607000400000052
Subsequently, Mel transformation is performed according to the above formula, and x is transformedtIs transformed from a linear scale to a mel-scale, wherein f represents the frequency scale after fourier transformation. And then, a group of filters is designed on the Mel scale to filter the frequency spectrum of each frame, and finally, the Mel frequency spectrum characteristics are obtained.
Text signal pre-processing
The invention adopts the famous Glove algorithm to preprocess the text information by using a vector to represent each word. The main idea of Glove algorithm: the relationship between words i and j is visually represented by a co-occurrence matrix. Vectorization representation of words is performed so that semantic and grammatical information is contained between vectors as much as possible. The algorithm flow is as follows:
1. and constructing a co-occurrence matrix X and solving the probability. The mathematical expression is as follows:
Figure BDA0003607000400000053
wherein p isijRepresenting the probability of co-occurrence of words i and j in the context.
2. And constructing an approximate relation between the word vector and the co-occurrence matrix X. Given that word k represents a word that may appear in their vicinity, the relevance between it and i and j is judged.
Figure BDA0003607000400000061
The correlation between k and i and j can be obtained. If the correlation is small, the result of the formula is close to 1, and if the correlation is large, the result of the formula is far from 1. Wherein p isikRepresenting co-occurrence probability, p, of words i and kjkRepresenting the co-occurrence probability of words j and k.
3. Constructing a loss function
Figure BDA0003607000400000062
Wherein V represents the size of the vocabulary, wiAnd wjAs a word vector, biAnd biTo offset, f (x)ij) Is a weighting function.
The weight function is expressed as follows:
Figure BDA0003607000400000063
according to the experience, Glove authors consider xmax=100,
Figure BDA0003607000400000064
Is a better choice. If the two words do not appear together, the weight is 0; f (x) must be a non-decreasing function.
4. Training model
And for the text data in the IEMOCAP dataset, carrying out vectorization on the text information by using a trained Glove word embedding model, wherein the dimensionality is 300 dimensions. And keeping the maximum sequence length of the text data to be 500, and finally obtaining a (500, 300) -dimensional word vector for researching the text emotion analysis.
Speech text alignment
This section mainly introduces the implementation method of speech and text alignment. Given phonetic representation S ═ S1,...,si,...,snWhere n denotes the number of frames of speech. Given text word vector representation T ═ T1,...,tj,...,smWhere m represents the number of words in the text. Since a word corresponds to a plurality of frames, by combining the speech frames associated with the word, the speech characteristics of each word can be obtained, which means that there is a strong connection between each word and its associated speech frame. From this point, we first calculate the similarity between the word and the speech frame to obtain the relationship between the word and the speech frame, as shown in the formula:
Figure BDA0003607000400000071
wherein t isjRepresenting the jth word vector, siRepresents the ith speech frame, aijRepresenting the similarity between the jth word vector and the ith speech frame, a larger one represents a more similar two vectors.
Finally to aijAnd carrying out normalization operation to obtain the probability values of the word j and the voice frame i.
Figure BDA0003607000400000072
Last pair ofWeighting and summing all frames to obtain the speech characteristic u aligned with the jth wordjThe aligned speech is denoted as U ═ U1,...uj,...,um}。
Figure BDA0003607000400000073
Introduction to data set
The IEMOCAP interactive emotion binary motion capture dataset is an emotion recognition dataset that is widely used in english multimodal, recorded by the SALL laboratory of the university of california. The method is divided into 5 session parts, wherein each session consists of 5 groups of 10 persons, 5 men and 5 women, and is widely cited in the field of speech recognition. The IEMOCAP multimodal data set comprises audio and video files of about 12 hours in duration in total, including data information of three modalities, audio, semantic text and facial expressions. The speech in the data set is divided into ten emotions in total: happy, angry, sad, depressed, natural, excited, frightened, disliked, surprised, and others. To be consistent with existing work, only four emotion classes are used in our work: anger, sadness, joy and nature. The four emotion category distributions are shown in the following table.
The invention fuses the algorithm process
1. The decomposition bilinear pooling fusion Module (FBP) is shown in FIG. 2, with x and y representing preprocessed speech and text features, x ∈ Rm,y∈Rn. The specific operation is as follows.
Figure BDA0003607000400000081
Generating qi Happy Nature of nature Heart injury Is totaled
Total up to 1103 1636 1707 1084 5531
Figure BDA0003607000400000082
K is the potential dimension of the decomposition matrix, matrix Ui=[u1,…,uk]∈Rm×k,Vi=[v1,…,vk]∈Rn×k,1∈Rk
Figure BDA0003607000400000086
Is a hadmard operation.
To obtain the output characteristic z ∈ ROThe weight to be learned is two third-order tensors U ═ U1,…,UO]∈Rm×k×oAnd V ═ V1,…,VO]∈Rn×k×o. We can reformulate U and V as a two-dimensional matrix
Figure BDA0003607000400000083
And
Figure BDA0003607000400000085
the remodeling operation is then performed, and the formula is shown below.
Figure BDA0003607000400000084
Where the function summoping (x, k) indicates that summoping is performed on x using a one-dimensional non-overlapping window of size k. Dropout is used to prevent overfitting. Since the amplitude of the output varies greatly due to the introduction of element-by-element multiplication, the energy of Z is normalized to 1 using L2 normalization.
Figure BDA0003607000400000091
2. The second part of the network framework of the invention adopts three different LSTM, GRU and DNN neural networks to extract the depth characteristics of the characteristics to obtain different expressions. Primary fused feature vector z output by FBP moduletRespectively output via LSTM, GRU and DNN networks
Figure BDA0003607000400000092
And
Figure BDA0003607000400000093
and (4) showing. The formula is shown in the following formula.
Figure BDA0003607000400000094
Figure BDA0003607000400000095
Figure BDA0003607000400000096
3. Then the high-grade feature fusion output part integrates the three types of features through pairwise hadmardCombining in a vector, and outputting the high-level fusion characteristic C at the momenttThe following formula is shown below.
Figure BDA0003607000400000097
Wherein C ist={C1,C2,…,CmAnd m represents the number of words. Here the symbol |, indicates that the vector corresponds to an element multiplication.
4. Then we fuse the vector C after the last steptFeeding into a BilSTM network for encoding, and the specific operation is as follows. Wherein the forward feature vector
Figure BDA0003607000400000098
Inverse eigenvector
Figure BDA0003607000400000099
Figure BDA00036070004000000910
Figure BDA00036070004000000911
Figure BDA00036070004000000912
5. Finally, the feature vector is led into a full connection layer for dimension reduction, and a Softmax function is used for obtaining
Figure BDA00036070004000000913
For emotion classification of utterances and using a cross-entropy loss function. The specific operation is as follows.
Figure BDA00036070004000000914
Figure BDA0003607000400000101
Wherein y iskA standard distribution of the sample classes is represented,
Figure BDA0003607000400000102
is the predicted distribution of the sample class. f. ofθ(.) represents a full connectivity layer network with a parameter theta.
Network training
When training the model, an Adam optimizer was used to minimize the cross-entropy loss function, the learning rate was set to 0.0001, the batch was 100, and L2 regularization was used to prevent overfitting of the model.
The Adam optimization algorithm is an algorithm for estimating the first moment and the second moment of the gradient of the cross entropy loss function and applying the first moment and the second moment to parameter updating. The specific operation is as follows.
mt=β1·mt-1+(1-β1)·gt
Figure BDA0003607000400000103
Figure BDA0003607000400000104
mtAnd vtRespectively, the first moment and the second moment of the cross entropy loss function gradient, and the hyper-parameter beta1And beta2Controlling the attenuation, gtRepresenting the gradient of the function, thetatFor variable update values, Ir represents the learning rate.
The parameter settings of the network are shown in the following table:
simulation parameter table
Figure BDA0003607000400000105
Figure BDA0003607000400000111
The experimental results are as follows:
the confusion matrix of the experimental results on the IEMOCAP data set for our proposed efficient fusion algorithm model is shown in fig. 3. From the confusion matrix, the accuracy rates of the four emotions of distraction, heart injury, nature and anger are 91.69%, 73.95%, 73.68% and 75.16% respectively, and the overall accuracy rate reaches 80.38%. The validity and feasibility of our proposed method on the IEMOCAP data set at present are verified.
The invention mainly adopts WA weighted precision and UA unweighted precision to measure the performance of the model, wherein WA is used for evaluating the overall performance of the model, and UA is used for evaluating the classification result of the model on each type of emotion. The specific calculation formula is shown as the following formula:
Figure BDA0003607000400000112
Figure BDA0003607000400000113
where L is the number of emotion classes, TPiIndicating the number of class i samples predicted correctly, FNiThe number of samples of the ith class indicating a prediction error.
In order to further measure the performance of the speech emotion recognition model provided by the invention, the accuracy, the recall ratio and the F1 score are adopted to highlight the performance of the model, and the calculation formulas of the accuracy, the recall ratio and the F1 score are as follows.
Figure BDA0003607000400000121
Figure BDA0003607000400000122
Figure BDA0003607000400000123
Wherein TP means both the prediction and the label are positive; FP means the prediction positive label is negative; TN means both the prediction and the label are negative; FN refers to prediction of negative label as positive. precision and Recall refer to precision and Recall, respectively, and the F1 score is the harmonic mean of the two, with larger values being better for the F1 score. The results of the experimental indexes related to accuracy, recall rate and F1 score are shown in the following table.
TP FN FP precision recall F1-score
happy 397 36 93 0.8102 0.9169 0.8602
sad 176 62 31 0.8502 0.7395 0.7910
neutral 280 100 82 0.7735 0.7368 0.7547
anger 118 39 31 0.7919 0.7516 0.7712
Average 0.8065 0.7862 0.7943
In contrast to other methods
To better verify the validity and scientificity of the improved algorithm, the algorithm was compared with other algorithm models on an IEMOCAP data set, and the results are shown in the following table.
1. Hengshun Zhou et al use a bilinear pooling algorithm as a feature fusion approach. (ACM2021)
2. Haiyang Xu et al use phonetic and text feature concatenation fusion. (INTERSPEECH2019)
3. Md Asif Jalal et al uses multiple attentions output splicing fusion mode (INTERSPEECH2020)
4. Ming Chen et al adopt a mixed splicing and then splicing individual mode. (INTERSPEECH2020)
5. Qi Cao et al operate the fusion approach using a speech and text consistency and contextual attention mechanism. (2021)
Comparison with different models
Figure BDA0003607000400000131
As shown in the above table, the method proposed by the present invention achieves 80.38% of WA and 78.62% of UA in the IEMOCAP data set four-classification emotion recognition experiment, and WA is the best experimental result in all comparison experiments. The worst result is that Haiyang Xu et al performs the multimodal fusion method by using simple linear splicing, which only has 70.4% WA and 69.5% UA, probably because the complementary relationship between the two modalities is not fully captured by the simple splicing method. Ming Chen et al add two separate features in parallel on the basis of Haiyang Xu et al, and considering the correlation effect between and within two modes, 71.06% of WA and 72.05% of UA are obtained, and the performance is improved. Yuanyuan Zhang et al, which adopts bilinear pooling to perform fusion of two features, has 75.49% WA. The invention performs feature fusion once again on the basis of bilinear pooling feature fusion to obtain rich cross modal feature representation, thereby improving WA by 4.89%. Qi Cao et al achieved 78.74% WA and 79.77% UA using speech and text consistency and contextual attention mechanism fusion, slightly higher in UA than our approach. In summary, through the above comparative experiments, it is verified that the algorithm provided by the present invention has a significant advantage in effect compared with other algorithms.
Aiming at the defects of a feature fusion mode in the field of multi-modal emotion recognition, the speech emotion recognition method based on the double-layer fusion deep network is provided, and complex association can be captured between audio and text. The method is based on the decomposition bilinear pooling fusion module (FBP) to perform the first fusion, and then performs the second fusion, which is superior to the splicing and summing method and enables the characteristics of each mode to be more fully fused with lower time complexity. The network structure adopts a hierarchical mode, primary features are coded by a first-layer sub-network, the primary features are mapped to high-level features by a second-layer network, and the hierarchical structure relation between the high-level features and the low-level features can be effectively established. Finally experiments were performed on the IEMOCAP dataset to reach 80.38% WA and 78.62% UA. Compared with the related work of other researchers, the recognition rate is greatly improved, and the feasibility and the effectiveness of the high-efficiency fusion algorithm provided by the inventor are verified.
The scope of the present invention is defined by the claims.

Claims (7)

1. A speech emotion recognition method based on a double-layer fusion deep network is characterized by comprising the following steps:
step 1: preprocessing a voice signal and a text signal, and performing alignment operation to make the voice signal and the text signal meet the input requirement of a network model;
step 2: inputting the voice and text feature vectors preprocessed in the step 1 into a decomposition bilinear pooling fusion module FBP for primary feature fusion;
and step 3: decomposing the fused primary features output by the bilinear pooling fusion module FBP in the step 2, and respectively passing through a level-1 primary feature coding network consisting of three submodels including LSTM, GRU and DNN;
and 4, step 4: then, the outputs of three sub-networks of level-1 are secondarily fused to encode high-level features, the fusion method is hadmard product, then the fused features are input into a BilSTM encoding network of level-2, and finally a classified output layer is connected to predict emotion categories;
and 5: and finally training the network model.
2. The speech emotion recognition method based on the two-layer fusion deep network, as recited in claim 1, wherein the text signal preprocessing is performed by using a word embedding manner, and a pre-trained Glove model is used to represent each word by using a vector.
3. The method for speech emotion recognition based on two-layer fusion deep network as claimed in claim 1, wherein the speech signal preprocessing is to perform windowing and framing on the audio signal, use a hamming window with a window length of 25ms and a frame shift of 10ms, perform fourier transform on the audio signal, and finally perform mel filtering operation to obtain mel frequency spectrum characteristics.
4. The method of claim 1, wherein the aligning operation is to obtain the speech feature corresponding to each word by combining the speech frames associated with the words.
5. The method for recognizing speech emotion based on two-layer fusion depth network as claimed in claim 1, wherein the structure of the network model comprises four layers in total, the first layer is FBP primary feature fusion layer; the second layer is a level-1 primary characteristic coding network layer consisting of three submodels of LSTM, GRU and DNN; the third layer is a hadmard fused layer; the fourth layer is an advanced coding network layer consisting of BiLSTM.
6. The speech emotion recognition method based on the dual-layer fusion depth network as claimed in claim 1, wherein the fusion adopts a fusion algorithm process, specifically, audio and text feature vectors are input, firstly, the audio features and the text features are cross-fused through an FBP fusion module, the fused features are respectively passed through a level-1 primary feature coding network composed of three submodels of LSTM, GRU and DNN, then, the output of three sub-networks of level-1 is secondarily fused to code high-level features, the fusion method is hadnard product, then, the fused features are input into a level-2 BiLSTM coding network, and finally, classified output layers are connected to predict emotion classes.
7. The method for recognizing speech emotion based on two-layer fusion depth network as claimed in claim 1, wherein the network model training mode is to adopt an Adam optimizer to minimize cross entropy loss function, the learning rate is set to 0.0001, the batch is 100, and L2 regularization is used to prevent model overfitting.
CN202210419568.1A 2022-04-21 2022-04-21 Speech emotion recognition method based on double-layer fusion deep network Pending CN114743569A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210419568.1A CN114743569A (en) 2022-04-21 2022-04-21 Speech emotion recognition method based on double-layer fusion deep network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210419568.1A CN114743569A (en) 2022-04-21 2022-04-21 Speech emotion recognition method based on double-layer fusion deep network

Publications (1)

Publication Number Publication Date
CN114743569A true CN114743569A (en) 2022-07-12

Family

ID=82282837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210419568.1A Pending CN114743569A (en) 2022-04-21 2022-04-21 Speech emotion recognition method based on double-layer fusion deep network

Country Status (1)

Country Link
CN (1) CN114743569A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468214A (en) * 2023-03-07 2023-07-21 德联易控科技(北京)有限公司 Evidence electronization method and electronic equipment based on fault event processing process

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116468214A (en) * 2023-03-07 2023-07-21 德联易控科技(北京)有限公司 Evidence electronization method and electronic equipment based on fault event processing process
CN116468214B (en) * 2023-03-07 2023-12-15 德联易控科技(北京)有限公司 Evidence electronization method and electronic equipment based on fault event processing process

Similar Documents

Publication Publication Date Title
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
CN112348075B (en) Multi-mode emotion recognition method based on contextual attention neural network
CN110211574B (en) Method for establishing voice recognition model based on bottleneck characteristics and multi-scale multi-head attention mechanism
Aldeneh et al. Using regional saliency for speech emotion recognition
Cai et al. A novel learnable dictionary encoding layer for end-to-end language identification
CN112818861B (en) Emotion classification method and system based on multi-mode context semantic features
Sultana et al. Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks
CN110675859B (en) Multi-emotion recognition method, system, medium, and apparatus combining speech and text
CN110287320A (en) A kind of deep learning of combination attention mechanism is classified sentiment analysis model more
CN110674339A (en) Chinese song emotion classification method based on multi-mode fusion
CN105047194B (en) A kind of self study sound spectrograph feature extracting method for speech emotion recognition
CN110060690A (en) Multi-to-multi voice conversion method based on STARGAN and ResNet
Zhong et al. A Lightweight Model Based on Separable Convolution for Speech Emotion Recognition.
Keyvanrad et al. Deep belief network training improvement using elite samples minimizing free energy
Zhao et al. Multi-level fusion of wav2vec 2.0 and bert for multimodal emotion recognition
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
Parthasarathy et al. Improving emotion classification through variational inference of latent variables
CN110532380B (en) Text emotion classification method based on memory network
CN114694255A (en) Sentence-level lip language identification method based on channel attention and time convolution network
CN114898775A (en) Voice emotion recognition method and system based on cross-layer cross fusion
CN114743569A (en) Speech emotion recognition method based on double-layer fusion deep network
Liu et al. Graph based emotion recognition with attention pooling for variable-length utterances
CN112700796B (en) Voice emotion recognition method based on interactive attention model
CN114360584A (en) Phoneme-level-based speech emotion layered recognition method and system
Ejbali et al. Intelligent approach to train wavelet networks for Recognition System of Arabic Words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination