CN117095702A

CN117095702A - Multi-mode emotion recognition method based on gating multi-level feature coding network

Info

Publication number: CN117095702A
Application number: CN202310909951.XA
Authority: CN
Inventors: 孙林慧; 王静; 张子晓; 李平安; 叶蕾
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-07-24
Filing date: 2023-07-24
Publication date: 2023-11-21

Abstract

The invention discloses a multimode emotion recognition method based on a gating multistage feature coding network, which comprises the steps of firstly, respectively extracting shallow features of an original input voice and a text mode; then, a depth coding network is constructed aiming at different modes, and deep features related to emotion are obtained. In order to better explore the space information of the voice, a branch network is introduced, and spectrogram information based on a deep convolution network is obtained; the method comprises the steps of designing a self-adaptive multi-mode gating fusion module, realizing dynamic fusion of three mode characteristic information, and realizing model construction based on a gating multi-stage characteristic coding network; finally, performing performance evaluation on a multi-mode emotion recognition method based on a gating multi-level feature coding network; the invention further digs the multi-level emotion information of the input signal and constructs the multi-level feature coding network based on gating to carry out multi-mode emotion recognition, so that the emotion recognition performance and generalization capability of the system are further improved, and the method can be well applied to intelligent man-machine interaction.

Description

Multi-mode emotion recognition method based on gating multi-level feature coding network

Technical Field

The invention belongs to the technical field of text-voice recognition, and particularly relates to a multi-mode emotion recognition method based on a gating multi-level feature coding network.

Background

Emotion recognition is a dynamic process in which the emotional state of one person is targeted, meaning that the emotion corresponding to each person's behavior is different. Emotion recognition in daily life is important for social interaction, and humans express their own feelings in different ways, and emotion plays an important role in determining human behaviors. To ensure meaningful communication, accurate interpretation of these emotions is important. In real life, emotion has many expression patterns, including spoken language and non-spoken language, including expressive language, facial gestures, limb language, and the like. Thus, emotional signals from multiple modalities may be used to predict an emotional state of a subject. And the performance of the multi-mode emotion recognition system is directly affected by the emotion characteristics and the quality of the classification model. Along with the rapid development of artificial intelligence technology, the application of multi-mode emotion recognition technology is more and more widespread, such as social robots, safety control, online remote auxiliary teaching, emotion monitoring of depression patients and the like. However, in order to make the multi-modal emotion recognition better applied to the field of man-machine interaction, the performance of multi-modal emotion recognition still needs to be further improved.

The feature extraction of high distinguishability and rich emotion information is the key for multi-modal emotion recognition. For text modalities, traditional manual text emotion feature extraction relies primarily on the Bag of Words (BoW) model. Along with the development of deep learning, a plurality of pre-training word embedding models based on deep learning are widely applied to the task of extracting text emotion characteristics. Atmaja et al integrate acoustic and semantic information via Long Short-Term Memory (LSTM) and Support Vector Machines (SVM). For speech modalities, kwon et al extract two types of features based on a multi-level learning strategy: space-time high discrimination features and long-term dependency features. In addition, they achieved the task of speech emotion recognition using one-dimensional dilation convolutional neural networks. Zhu et al propose a new weight-aware multi-scale neural network system that utilizes a global-aware fusion module and multi-scale feature information to improve emotion recognition performance. It has been found that using multi-scale speech features and mel-pattern features provides more comprehensive audio information than has been possible in the past using only the original waveform or MFCC as input to a speech emotion recognition system.

In addition, the quality of the emotion classification model directly influences the performance of the multi-mode emotion recognition system. The development of deep learning has created more possibilities for multi-modal emotion recognition systems. Some researchers have proposed using network models containing attention to improve the performance of emotion recognition systems. Choi et al used CNN with an attention mechanism for feature learning between speech and text. Krishna et al used a large self-supervising pre-training model and a broad modal attention mechanism for multi-modal emotion recognition. Zhou et al propose a novel multi-modal feature fusion algorithm based on adaptive and multi-level decomposition bilinear pooling for emotion recognition.

The above-mentioned research work has facilitated the implementation of multi-modal emotion recognition techniques, but there are still some problems. First, in speech signal preprocessing, a single frame length is used for framing to extract emotion features. Further, emotion recognition using only time-frequency domain information is insufficient, and space information is not supplemented. Secondly, when a large training model is adopted for multi-mode emotion recognition, the model is complex, the parameter calculation amount is large, and the practicability is affected. Finally, in multi-modal emotion recognition, the effective fusion of emotion information of different modalities directly affects emotion recognition performance, so that it is necessary to construct an adaptive multi-modal fusion module.

Disclosure of Invention

The invention aims to: the invention aims to provide a multi-mode emotion recognition method based on a gating multi-level feature coding network, which is characterized in that the acquisition of time domain frequency domain and space information is realized through three different depth encoders, the network can capture more high-discrimination emotion features through a self-adaptive multi-mode gating fusion module, and the multi-mode gating fusion module can implicitly complete the fusion of text-voice information through the self-adaptive control of the multi-mode gating system.

The technical scheme is as follows: the invention discloses a multimode emotion recognition method based on a gating multistage feature coding network, which comprises the following steps of:

step 1, extracting shallow features of each original text through a Word2vec network, carrying out feature fusion after extracting multi-scale statistical features of each original voice to obtain multi-scale fusion voice features, and collecting Mel spectrogram features of each original voice;

step 2, taking shallow features of the multi-scale fusion voice features and the original text as input, taking the multi-scale voice depth features and the text depth features as output, and constructing a depth feature coding model, wherein the depth feature coding model comprises a network frame suitable for the text features GRUs and a CNNs network frame suitable for the voice features;

step 3, constructing a self-adaptive multi-mode gating fusion module, respectively inputting the mel spectrogram characteristics in the step 1, the multi-scale voice depth characteristics and the text depth characteristics in the step 2 into the self-adaptive multi-mode gating fusion module, and adaptively weighing contribution of each mode through a gating mechanism, controlling information transmission and suppression, realizing multi-mode characteristic information fusion, and obtaining emotion classification labels;

and 4, performing performance evaluation on the multi-mode emotion recognition method based on the gating multi-level feature coding network.

Further, the step 1 specifically includes:

step 1.1, preprocessing each voice signal by adopting different frame lengths, wherein the frame lengths are respectively 256 and 512, the preprocessed signals are obtained and feature extraction is carried out, and the method comprises 24-dimensional MFCC, 24-dimensional MFCC first-order dynamic difference, 24-dimensional MFCC second-order dynamic difference, energy, pitch frequency and short-time zero-crossing rate;

step 1.2, fusing the voice statistical characteristics obtained by the scales 256 and 512 to obtain 750-dimensional multi-scale fused voice characteristics;

step 1.3, preprocessing each original text, constructing a Word2vec network, realizing unsupervised learning through Word2vec, and learning the distribution of words in the context to capture the semantic information of the words, so that the words are mapped to the feature representation in a high-dimensional space to obtain 300-dimensional shallow text features;

and 1.4, sampling, pre-emphasis, framing and windowing the original voice, calculating a power spectrum, performing Mel scale transformation on the power spectrum, and converting the frequency into an auditory perception scale, thereby obtaining a Mel spectrogram.

Further, the step 2 specifically includes:

step 2.1, constructing a GRUs network framework suitable for text characteristics, comprising two GRU network layers, processing text sequence data through the GRU layers, and realizing characteristic coding with a simpler structure and fewer parameters;

step 2.2, constructing a CNNs network framework suitable for multi-scale voice features, wherein the CNNs network framework comprises a first convolution layer, a second ReLu layer, a third full connection layer, a fourth convolution layer, a fifth ReLu layer and a sixth full connection layer;

step 2.3, constructing a Mel-spectra-AlexNet branch, introducing an AlexNet pre-training network model, and processing the Mel Spectrogram information through convolution and a nonlinear activation function to obtain the characteristics of the voice Spectrogram;

further, in step 2.3, the constructing the Mel-spectra-alexent branch includes the following steps:

step 2.3.1, carrying out normalization operation on the mel spectrogram, and uniformly cutting the mel spectrogram to 224 x 224;

2.3.2, building an AlexNet integral network framework, which comprises a first convolution layer, a second maximum pooling layer, a third convolution layer, a fourth maximum pooling layer, a fifth convolution layer, a sixth convolution layer, a seventh convolution layer, an eighth full connection layer and a ninth full connection layer;

and 2.3.3, inputting the normalized voice Mel spectrogram into a pre-training model AlexNet network, and extracting spectrogram characteristics by utilizing convolution operation and nonlinear activation function, thereby obtaining 300-dimensional voice Mel spectrogram characteristics.

Further, the step 3 specifically includes:

step 3.1, inputting the text depth feature and the multi-scale voice depth feature into a gating fusion module to obtain a weighting vector e;

z＝σ(W _z [x′ _t ,x′ _f ])

e＝zh _t +(1-z)h _f

wherein W is _t ，W _f ，W _z The calculated weights of the text vector, the audio vector and the text-audio fusion vector are respectively represented, e is the weighted vector output by the gating unit, and x '' _t Representing depth-coded text feature vectors, x' _f Representing the depth-coded multi-level speech feature vector, sigma representing the sigmod computation;

step 3.2, inputting the text depth feature and the Mel spectrogram depth feature into a gating fusion module to obtain a weight vector e';

z′＝σ(W′ _z [x′ _t ,x′ _m ])

e′＝z′h _t +(1-z′)h _f

wherein W is _t ，W _m ，W′ _z Respectively representing the calculated weights of the text vector, the Mel spectrogram vector and the text-Mel spectrogram fusion vector, wherein e ' is the weight vector output by the gating unit, and x ' ' _t Representing depth-coded text feature vectors, x' _m Representing a mel spectrogram feature vector after depth coding, and sigma representing sigmod calculation;

step 3.3, splicing the weighted vectors e and e', obtaining self-adaptive multi-mode weighted features, reserving text depth features, and inputting the text depth features into a classifier, wherein the classifier comprises a first full-connection layer, a second full-connection layer and a third Softmax layer;

step 3.4, setting a cross entropy loss function as a loss function based on a gating multi-level feature coding network, and obtaining an emotion classification label at a Softmax layer;

wherein p is _i The probability of representing the emotion of the i-th class,and (3) representing the ith emotion feature, and J representing the emotion type.

Further, the step 4 specifically includes:

step 4.1, inputting the text shallow features and the multi-scale voice statistical features into the same GRUs network, inputting the text shallow features and the multi-scale voice statistical features into the same CNNs network, and verifying the validity of the GRUs-CNNs parallel network;

step 4.2, extracting voice statistical features with a frame length of 256 and voice statistical features with a frame length of 512, respectively inputting the voice statistical features into a depth coding network as voice statistical features, and verifying the effectiveness of using multi-scale features to multi-mode emotion recognition;

step 4.3, respectively inputting text mode characteristics and voice mode characteristics into a GRUs-CNNs parallel network, introducing Mel-spectra-AlexNet branches, and verifying the effectiveness of introducing Mel-spectra-AlexNet branches on multi-mode emotion recognition;

and 4.4, respectively inputting the text depth characteristics and the multi-level voice statistical characteristics into a depth coding network containing the gating fusion module and a depth coding network not containing the gating fusion module, and verifying the effectiveness of the gating fusion module on multi-mode emotion recognition.

The beneficial effects are that: compared with the prior art, the invention has the following remarkable advantages:

1. the method provides a simple and effective gating multi-level feature coding network. The method and the system have the advantages that the time domain frequency domain and the space information are acquired through three different depth encoders, and the network can capture more high-discrimination emotion characteristics through the self-adaptive multi-mode gating fusion module.

2. Multi-level features are introduced as shallow features of speech, including multi-scale shallow speech statistics features and mel spectrogram features. The Mel-spectra-AlexNet is used as an auxiliary branch to supplement the information of the voice in the spatial dimension.

3. The multi-mode gating fusion module is provided for realizing dynamic fusion of multi-level depth characteristics. The multi-mode gating fusion module can implicitly complete the fusion of text-voice information through the self-adaptive control of the multi-mode gating system.

Drawings

FIG. 1 is a system block diagram of the present invention;

FIG. 2 is a block diagram of the inside of a GRU unit of the text depth coding branch of the present invention;

FIG. 3 is a block diagram of depth coding of multi-level speech statistics of the present invention;

FIG. 4 is a Mel-spectra-AlexNet branch of the present invention;

FIG. 5 is a block diagram showing the internal calculation of the gating cell G according to the present invention;

FIG. 6 is a block flow diagram of a multi-modal gating fusion module of the present invention.

Detailed Description

The technical scheme of the invention is further described below with reference to the accompanying drawings.

Referring to fig. 1, the present embodiment provides a multi-mode emotion recognition method based on a gated multi-level feature encoding network, and in practical application, emotion recognition is generally performed by adopting features with a single frame length, which is insufficient for representing speech emotion information. Moreover, there is a similar problem in performing emotion recognition using only time-frequency domain information, that is, there is a lack of supplement of spatial information. Secondly, when a large training model is adopted for multi-mode emotion recognition, the model is complex, the parameter calculation amount is large, and the practicability is affected. Finally, in multi-modal emotion recognition, the effective fusion of emotion information of different modalities directly affects emotion recognition performance. Therefore, the multi-mode characteristic information needs to be weighted and fused to obtain the effective emotion information with more distinction degree. In order to overcome the defects, the invention provides a multi-mode emotion recognition method based on a gating multi-level feature coding network, which can effectively capture emotion information with high distinction degree. In the network, the audio mode and the text mode are respectively encoded to obtain deep features related to emotion. In addition, we have introduced the Mel-spectra-AlexNet branch to further explore spatial information of the speech signal. Finally, a multi-mode gating fusion module is adopted for fusing the voice-text information from the three branches, so that a dynamic weight vector is obtained.

The invention provides a simple and efficient emotion recognition-oriented gating multistage feature coding network. Specifically, it is mainly divided into three stages. In the first stage, shallow features in the original text and speech are extracted. For text unimodal, the original text is normalized and discrete digitized. Shallow features are then extracted from the text using Word2vec networks. For the voice single mode, firstly, carrying out framing pretreatment on long-term voice to obtain multi-scale high-order statistical characteristics and Mel spectrogram characteristics. In the second stage, shallow features are input into a depth encoder for feature encoding in order to mine more differentiated and efficient emotion features. Advanced feature representation is learned by modeling neuronal transmission in biology, thereby obtaining highly discriminative deep emotion features. In the third stage, a multi-mode gating fusion module is adopted to combine deep features of three branches. Through the multi-mode gating fusion module, self-adaptive feature fusion can be realized, so that the generated fusion features can pay more attention to the information with large judgment effect on the emotion tags. The following is a detailed discussion of specific embodiments of the invention;

step 1: collecting multi-scale statistical characteristics of each voice and carrying out characteristic fusion, collecting Word2vec shallow characteristics of each text, collecting Mel spectrogram characteristics of each voice, and taking multi-level voice characteristics and text characteristics as input of a depth characteristic coding model.

Collecting multi-scale voice characteristics of each voice:

the speech signal is non-stationary but typically in a relatively short time of 10ms to 40ms, the speech signal can be regarded as approximately stationary. Before extracting the emotion features of the speech signal, a smooth frame signal is usually obtained by framing. When different frame lengths are adopted for framing, the extracted emotion characteristics contained in the emotion characteristics with different scales are different. Therefore, frames are divided by adopting the frame lengths of 256 and 512 with different scales, and then the MFCC characteristics, the MFCC first-order differential characteristics, the MFCC second-order differential characteristics, the energy characteristics, the pitch frequency characteristics and the short-time zero-crossing rate characteristics of the original voice signals are respectively calculated and extracted.

Carrying out statistical calculation and feature fusion on the multi-scale voice features:

because the emotion information contained in the single-frame voice signal is limited and is insufficient to express accurate emotion, after the emotion characteristics are extracted in frames, the statistical characteristics of the multi-frame voice signal need to be calculated. Statistics employed by the present invention include maximum, minimum, median, variance, and mean. The calculation of 5 common statistics can be expressed as x _max ＝max(x ₁ ,x ₂ ,…,x _N )，x _min ＝min(x ₁ ,x ₂ ,…,x _N )， Wherein x is _i Indicating the ith feature and N indicating the length of the feature. And then, fusing the global statistical features obtained with the frame lengths of 25 and 512 to obtain 750-dimensional multi-scale voice statistical features.

Extracting a mel spectrogram of each voice:

in order to better capture time-frequency and space information of voice, a mel spectrogram is introduced to make up for the deficiency of multi-scale voice characteristic information. Firstly, sampling, pre-emphasis, framing, windowing and other operations are carried out on original voice, and a preprocessed voice signal is obtained. Then, calculating the power spectrum of the voice signal, and inputting a spectrogram formed by a plurality of frames of power spectrums into a Mel filter group to obtain a Mel spectrogram;

step 2: and carrying out depth treatment aiming at different modes to realize the construction of a depth characteristic coding model.

Constructing GRUs network frames:

a text input branch of a multi-level depth coding network is constructed with internal GRU units as shown in fig. 2. For text modalities, a Masking layer is first used to process features from the Word2vec network. And then, passing through the first GRU layer, the second GRU layer and the third Flatten layer. The number of filters in the GRU layer 1 is 256, and the activation function is tanh. The number of filters for GRU layer 2 is 256 and the activation function is tanh. Then, the data leveling is achieved using the flat layer 3. Let the input text feature of the depth coding box be x _t The computation of the text depth coding branch may be expressed as x' _t ＝f _t (x _t )＝GRU(GRU(x _t ) Where x' _t Representing the depth text encoding features.

Constructing a CNNs network framework:

a CNNs network framework suitable for multi-scale speech features is constructed, the flow chart of which is shown in fig. 3.

For multi-scale speech features, let the input of the depth coding network be x _f ，x _f Is a multi-scale speech statistical feature. Passing it through CNNs depth coding network framework comprising first convolution layer, second ReLu layer, third maximum pooling layer, fourth convolution layer, fifth ReLu layerAnd a sixth max pooling layer. The convolution kernel size of the convolution layer 1 is 3, the number of filters is 512, and the step size is 2. The pooling kernel size of the maximum pooling layer 3 is 2, and the step size is 2. The convolution kernel size of the convolution layer 4 is 3, the number of filters is 512, and the step size is 1. The pooling kernel size of the maximum pooling layer 3 is 3, and the step size is 3. The computation of the multi-scale speech feature depth coding branches may be expressed as x' _f ＝f _h (x _f )＝Conv1D(Conv1D(x _f ) Where x' _f Representing the depth text encoding features.

Constructing Mel-spectra-AlexNet branches:

A. and normalizing the mel spectrogram, wherein the unified clipping size is 224 x 224.

B. An AlexNet pre-training model is introduced, and the flow chart is shown in figure 4. Constructing AlexNet pre-training, which comprises a first convolution layer, a second maximum pooling layer, a third convolution layer, a fourth maximum pooling layer, a fifth convolution layer, a sixth convolution layer, a seventh convolution layer, an eighth full connection layer and a ninth full connection layer. The convolution kernel size of the convolution layer 1 is 11×11, the number of filters is 96, and the step size is 4*4. The size of the pooling core of the pooling layer 2 is 3*3, and the step size is 2 x 2. The convolution kernel size of convolution layer 3 is 5*5 with a step size of 1*1. The pooling core size of the pooling layer 4 is 3*3 and the step size is 2 x 2. Convolution kernel sizes of convolution layers 5 and 6 are 3*3, the number of filters is 384, and the step size is 1*1. The convolution kernel size of the convolution layer 7 is 3*3, the number of filters is 256, and the step size is 1*1. The number of filters of the full connection layer 8 is 4096 and the activation function is ReLu. The number of filters of the fully connected layer 9 is 300 and the activation function is Softmax.

C. And inputting the normalized voice Mel spectrogram into a pre-training model AlexNet network, and obtaining 300-dimensional voice Mel spectrogram characteristics through network training. Let the input mel-pattern be x _m The computation of the speech mel-pattern depth coding branch may be expressed as x' _m ＝f _m (x _m )＝AlexNet(x _m ) Wherein x' _m Representing the deep mel profile features.

Step 3: construction multi-mode door control fusion module

The text-multiscale voice statistical feature input gating fusion module:

inputting the text depth features and the multi-scale voice depth features into a gating unit. The internal computational block diagram of the gating cell is shown in fig. 5.

Let the depth feature of the input text be x' _t The multi-scale voice depth features are x' _f The calculation process of the gating module 1 can be expressed as:

z＝σ(W _z [x′ _t ,x′ _f ])

e＝zh _t +(1-z)h _f

wherein W is _t ，W _f ，W _z Respectively representing the calculated weights of the text vector, the audio vector and the text-audio fusion vector. e is the weight vector output by the gating unit.

The text-mel spectrogram depth feature input gating fusion module:

inputting the text depth characteristic and the mel spectrogram depth characteristic into a gating unit. Let the depth feature of the input text be x' _t The depth of the mel spectrogram is characterized by x' _m The calculation process of the gating module 2 can be expressed as:

z′＝σ(W′ _z [x′t,x′ _m ])

e′＝z′h _t +(1-z′)h _f

wherein W is _t ，W _m ，W′ _z Respectively representing the calculated weights of the text vector, the Mel spectrogram vector and the text-Mel spectrogram fusion vector. e' is the weight vector output by the gating unit.

Weighted depth feature stitching:

weighting vector e, weighting vector e ', text depth vector x' _t And performing splicing operation. By reserving the text mode characteristics, the completeness of the text information can be better ensured. The flow framework of the multi-mode gating fusion module is shown in fig. 6. The computation of the multi-modality gating fusion module may be expressed as: x is x _z ＝Concatenate[e,e′,x′ _t ]

Setting a loss function for experiments:

and adopting a cross entropy Loss function as a los function, and completing the multi-mode emotion recognition task through one Softmax. Softmax function and output predictive tagThe calculation formula of (2) is as follows:

wherein,the input of the ith Softmax cell is shown, i e (1, j). J represents the number of output emotion categories. P is p _i Representing the probability of the i-th emotion.

Step 4: and performing performance evaluation on the multi-mode emotion recognition method based on the gating multi-level feature coding network.

Database and experimental setup:

the data set employed in this embodiment is an interactive emotion binary motion capture IEMOCAP emotion database consisting of a script and an impromptu dialogue of various professional actress performances. The data set contains 5 sessions and about 12 hours of voice data. It contains a variety of emotions including vigilance (angry), excitement (exhite), happiness (happy), sadness (sad), calm (neutral), etc. In the experiment, excited and happy emotion samples were combined into one class and expressed as happy. We use a sample of four emotion categories, including: qi, happy, sad and calm. The experimental dataset includes 5531 sentences, respectively as training set 8: test set 2, see table 1. In addition, 15% of the training set statements are classified as validation sets. At the same time, performance assessment was performed on the IEMOCAP dataset using 5-fold cross-validation.

TABLE 1 IEMOCAP dataset various emotion statement scores

In experiments, the model was optimized using Adam optimization functions. Setting the initial learning rate of the model to be 10 ^-2 The batch size was 16 and the number of iterations was 100. Experiments were set up on Tensorflow v 2.7.0. First, the original text and voice data are preprocessed. For text modalities, the original text is numerically discretized and the number of words per text is limited to 50. The CNN layer-Dense layer is then used to obtain text features. For the speech modality, a mel spectrogram with a window size of 25ms and a step size of 10ms is generated using librosa. In addition, shallow features are extracted for each emotion-like voice in the frame length 256 and the frame length 512 dimensions respectively. In order to characterize the global emotion characteristics of the whole speech, the statistical characteristics of five emotion characteristics of the multi-frame speech need to be calculated, and statistics used herein are maximum, minimum, median, variance and mean. And then merging statistical features under different frame length scale settings to serve as multi-scale voice deep features. Finally, to prevent overfitting, the model is constrained using a dropout layer and regularization. Hyper-parameters of the model are shown in

Table 2.

TABLE 2 super parameter settings for network model

Experimental performance evaluation:

in this embodiment, a block diagram of the system is shown in fig. 1.

First, to verify the effectiveness of depth coding networks for emotion recognition, we have experimented with in different coding network frameworks. Different GRUs coding networks, CNNs coding networks and GRUs-CNNs parallel coding networks are respectively built. Table 3 shows the multi-modal emotion recognition rates under different coding network frameworks. As can be seen from table 3, the emotion recognition rate of the model is highest, 81.39%, when the text feature and the speech feature are deeply processed respectively using the parallel encoding framework. Because, for the text mode, the context information can be better captured by adopting the GRUs network, and the feature with emotion characterization information is obtained. For voice modes, the CNNs can better capture the effective information of the multi-scale voice features, so that the emotion recognition rate is improved.

TABLE 3 emotion recognition rate table for different depth coding models

Then, in order to verify that the emotion characteristics under the multi-scale fusion can further improve the voice emotion recognition performance, comparison experiments are carried out under different scale frame length settings. Table 4 shows the weighted recognition rates of speech emotion features and their fusion features for emotion recognition at frame lengths of 256 and 512 scales, respectively. Table 4 shows that the speech emotion features extracted by the multi-scale frame length are significantly improved in emotion recognition performance compared with the single-scale emotion features. When the single-scale frame length is set to 256 and 512, the emotion recognition performance obtained by the feature of 2256 is optimal, and the average recognition rate is 81.99%. Different single-scale emotion characteristics are used for emotion classification, and the system performance shows difference, and mainly because emotion information contained in the emotion characteristics under different scales is different, the diversity of the characteristics can be fully utilized by fusing the multi-scale characteristics, and emotion information in a voice signal can be more comprehensively mined. When emotion characteristics extracted by combining two scale frame lengths are fused to classify emotion, emotion recognition performance obtained by combining statistical characteristics under the frame length 256 and the frame length 512 scale is optimal, and the average recognition rate is 81.18%. This shows that the multi-scale emotion feature fusion can further improve the speech emotion recognition performance.

TABLE 4 Emotion recognition rate table for different frame length models

Next, the effectiveness of introducing the Mel-spectra-AlexNet branches was verified. For experiments introducing Mel-spectrum features, table 5 gives the weighted emotion recognition rates of the gated multi-level feature encoding network with or without Mel-spectra-alexent branches. As can be seen from Table 5, the emotion recognition rate of the model was improved by 0.79% when the Mel-spectra-AlexNet branches were introduced. Because the problem of information loss exists when the multi-mode emotion recognition task is performed by only adopting the traditional manual characteristics. The Mel-Spectromram-AlexNet branch is introduced to effectively relieve the problem, and make up for the problem of insufficient space information in the traditional manual characteristics, so that the characteristics after depth coding can be more effectively used for emotion judgment.

TABLE 5 model emotion recognition rate Table with or without Mel-Spectrom-AlexNet branches

Finally, experimental comparison is performed on the effectiveness of the gating fusion module in improving emotion recognition performance. For the traditional multi-mode emotion recognition system, multi-mode features are mostly used for emotion calculation in a simple splicing mode. In the actual emotion characteristics, the contribution degree of different characteristics to emotion recognition is different. Therefore, the invention constructs the feature coding network based on the gating fusion module to acquire the feature vector with weight. The experimental performance comparisons of the gated fusion modules are shown in table 6. As can be seen from Table 6, the emotion recognition rate of the multi-level feature encoding network using simple concatenation is 81.82%, while the emotion recognition rate of the multi-level feature encoding network using gating is 82.18%. Compared with the traditional mode fusion method adopting the direct splicing method, the gating fusion module can pay attention to the characteristic of high emotion distinguishing degree. By constructing the gating fusion module, self-adaptive feature fusion can be realized, and more weight is given to the feature of outputting correct emotion.

TABLE 6 model emotion recognition rate table based on door control fusion module

The above results indicate that: according to the multi-mode emotion recognition method based on the gating multi-level feature coding network, recognition rate of a voice emotion recognition system is improved from three aspects. Firstly, by constructing a multi-branch depth coding network, targeted coding of different modes is realized. Secondly, extracting emotion characteristics under the frame length 256 and the frame length 512, calculating statistical characteristics of the emotion characteristics, and fusing the statistical characteristics of the 3 scales as input of the model, so that emotion information in the voice signal can be more comprehensively mined. Meanwhile, a Mel-Spectromram-AlexNet branch is introduced to supplement voice emotion information, so that space information of a voice signal is more perfectly represented. Finally, aiming at the problem of feature fusion in multi-mode emotion recognition, the empty gate-based multi-level feature coding network constructed by the invention can realize self-adaptive feature fusion and give more weight to multi-emotion distinguishing features, thereby improving the performance of multi-mode emotion recognition. The model constructed by the invention is subjected to experiments on an IEMOCAP emotion database, achieves 82.18% emotion recognition rate, verifies the effectiveness of the multi-mode emotion recognition method provided by us, and has good practical application significance.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the specific embodiments described above, and that the above specific embodiments and descriptions are provided for further illustration of the principles of the present invention, and that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined in the appended claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. A multi-mode emotion recognition method based on a gating multi-level feature coding network is characterized by comprising the following steps:

2. The method for identifying multi-modal emotion based on a gated multi-level feature encoding network according to claim 1, wherein step 1 specifically comprises:

3. The method for identifying multi-modal emotion based on a gated multi-level feature encoding network according to claim 1, wherein step 2 is specifically:

and 2.3, constructing a Mel-spectra-AlexNet branch, introducing an AlexNet pre-training network model, and processing the Mel Spectrogram information through convolution and a nonlinear activation function to obtain the characteristics of the voice Spectrogram.

4. A method for identifying multi-modal emotion based on a gated multi-level feature encoding network according to claim 3, wherein in step 2.3, the constructing Mel-spectra-alexent branches comprises the steps of:

5. The method for identifying multi-modal emotion based on a gated multi-level feature encoding network according to claim 1, wherein step 3 is specifically:

z＝σ[W _z [x′ _t ，x′ _f ])

e＝zh _t +(1-z)h _f

wherein W is _t ，W _f ，W _z The calculated weights of the text vector, the audio vector and the text-audio fusion vector are respectively represented, e is the weighted vector output by the gating unit, and x '' _t Representing depth-coded text feature vectors, x' _f Representing depth-coded multi-level speech feature vectors, σ representing sigmod calculation;

z′＝σ(W′ _z [x′ _t ,x′ _m ])

e ^′ ＝z ^′ h _t +(1-z ^′ )h _f

wherein W is _t ，W _m ，W _z 'respectively represents the calculated weights of the text vector, the Mel spectrogram vector and the text-Mel spectrogram fusion vector, e' is the weighted vector output by the gating unit, and x is the weight vector _t 'represents the depth-coded text feature vector, x' _m Representing a mel spectrogram feature vector after depth coding, and sigma representing sigmod calculation;

6. The method for identifying multi-modal emotion based on a gated multi-level feature encoding network according to claim 1, wherein step 4 is specifically: