CN113450830A

CN113450830A - Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms

Info

Publication number: CN113450830A
Application number: CN202110695847.6A
Authority: CN
Inventors: 姜芃旭; 梁瑞宇; 赵力; 徐新洲; 陶华伟
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-09-28
Anticipated expiration: 2041-06-23
Also published as: CN113450830B

Abstract

The invention discloses a voice emotion recognition method of a convolution cyclic neural network with a multiple attention mechanism, which comprises the following steps: step 1, extracting spectrogram features and frame level features. And 2, transmitting the spectrogram characteristics into a CNN module to learn time-frequency related information in the characteristics. And 3, the multi-head self-attention layer acts on the CNN module to calculate the weights of different frames under the global features of different scales, and the features of different depths in the CNN are fused. And 4, a multi-dimensional attention layer acts on the frame-level features of the LSTM input to comprehensively consider the relationship between the local features and the global features. And 5, transmitting the processed frame-level features into an LSTM model to acquire time information in the features. Step 6, a fusion layer summarizes the outputs of different modules to enhance the model performance. And 7, classifying different emotions by using a Softmax classifier. The invention combines a deep learning network, and the module adopts a parallel connection structure to process the characteristics simultaneously, thereby effectively improving the performance of speech emotion recognition.

Description

Voice emotion recognition method of convolution cyclic neural network with multiple attention mechanisms

Technical Field

The invention relates to the technical field of speech emotion recognition, in particular to a speech emotion recognition method of a convolution cyclic neural network with a multiple attention mechanism.

Background

The focus of linguistic research is to mine potential information in the language, characterizing the state of a speaker or voice. As an emotion task in paraphrase language, speech emotion recognition can learn the category of emotion from speech, which can provide assistance for intelligent human-computer interaction. Recent studies related to deep learning provide speech recognition with a deep model that better describes the emotional state of speech. One of the most important deep learning models is the neural network, which is typically used to learn distinctive feature representations from low-order acoustic features. Furthermore, these emotion-related tasks tend to be centered around convolutional neural networks and long-term memory-based recurrent neural networks to mine local information in speech. CNN is often used to learn time-frequency information from spectral features, while LSTM is mainly used to extract the sequence correlation of speech time series.

Although the neural network model described above has been successfully applied to speech emotion recognition, there are three problems to be solved. First, most existing neural network model methods segment the complete speech into segments to meet the requirement of model input fixed length. In this process, incomplete time information inevitably results in loss of emotional details. Second, most CNN-based methods take only the last convolutional layer as output, and do not consider hidden convolutional layers containing high-resolution low-level information. Third, existing speech emotion studies based on LSTM attention layers exclude temporal correlation of frame-level features in utterances by weighting sequences in high-level tokens by setting an attention layer connected at the back end of the LSTM.

Disclosure of Invention

The technical problem is as follows: in order to overcome some problems of the existing speech emotion recognition technology, the invention discloses a speech emotion recognition method of a convolution cyclic neural network (CRNN-MA) with a multiple attention mechanism.

The technical scheme is as follows: a speech emotion recognition method of a convolution cyclic neural network with a multiple attention mechanism comprises the following steps:

and step A, extracting spectrogram characteristics and frame level characteristics respectively as the input of different modules of the model. Then, the characteristics are respectively input into a Convolutional Neural Network (CNN) and a long-time memory cyclic neural network (LSTM), and time-frequency information and sequence information are simultaneously acquired by adopting a parallel structure. And step B, the spectrogram characteristics are transmitted to a CNN to learn time-frequency related information in the characteristics. And step C, a multi-head Self-Attention layer (Multiple Self-Attention) acts on the CNN module to calculate the weights of different frames under global features of different scales, and features of different depths in the CNN are fused. Step D, a Multi-Dimensional Attention layer (Multi-Dimensional Attention) acts on the frame-level features of the LSTM input to comprehensively consider the relationship between the local features and the global features. And E, transmitting the processed frame-level features into an LSTM model to acquire time information in the features. Step F, a fusion layer summarizes the outputs of different modules to enhance the model performance. And G, classifying different emotions by using a Softmax classifier.

Preferably, the specific step of extracting the spectrogram feature in the step a comprises: pre-emphasis, framing and fast Fourier transform are carried out on the voice, and then the energy spectrum is processed through a group of Mel-scale triangular filter banks to obtain spectrogram characteristics; then, solving a first order difference and a second order difference of each section of spectrogram characteristic; the specific steps of extracting the frame-level features in the step A comprise: each frame of speech is extracted with 95-dimensional low-level descriptors including mel-frequency cepstral coefficients and their first derivatives, mel-frequency cepstral and its first derivatives, spectral features, spectral flatness, color spectrum, zero-crossing rate and root-mean-square.

Preferably, the specific steps of step B include:

b, the spectrogram characteristics in the step A and three-dimensional spectrogram characteristics formed by the first-order difference and the second-order difference are transmitted into a CNN module for learning; for the CNN module, AlexNet trained on an ImageNet data set is used as an initial model, the model has five convolutional layers and three pooling layers in total, and a full-link hit layer in the network is deleted to better match a multi-head self-attention layer; the input size is 227 x 3, the first tier of convolutional layers contains 96 convolutional kernels, size 11 x 11, and the second tier contains 256 convolutional kernels, size 5 x 5; the last three convolutional layers contain 384, 384 and 256 convolutional kernels, respectively, and have a size of 3 × 3.

Preferably, the specific steps of step C include:

step C-1: setting the three pooling layers of the CNN in the step B as the input of the self-attention layer; in the self-attention layer, the input is first dimension reduced:

F_n＝σ_R(f_n*X_n)

wherein sigma_R(. to) denotes the ReLU activation function, ". to" is a convolution operation, X_nFor input, X₁，X₂，X₃Respectively representing a first pooling layer, a second pooling layer and a third pooling layer in the CNN;

step C-2: adding attention unit to calculate interdependence of all frames to obtain weight of different frames

α_n＝Softmax(V_n·U_n)

Wherein, V_n＝σ_S(Fn·W_n+b_n)，T₀Is the time dimension, W and U are weights, b is the offset, σ_SRepresents a Sigmoid activation function, and Softmax represents a Softmax operation;

step C-3: a convolution Gn of 1 × 1 size with 1024 convolution kernels is set, calculated as:

N₀feature dimensions representing input features, then, at G_nAdopt N₀1 is the mostLarge-pool operation:

step C-4: the output of the multi-headed self-attention layer combines all the autocorrelation layers:

wherein, O_n＝M_n·α_n∈R^1024×1。

Preferably, the specific steps of step D include:

step D-1: in the multidimensional attention layer, a convolution f of 1 × 1 size is first set for one channel_TAnd f_NThe output of the frame dimension and feature dimension is represented as:

F_T＝σ_R(f_T*X_T)∈R^T×N

F_N＝σ_R(f_N*X_N)∈R^N×T

wherein X_TAnd X_N＝(X_T)^TRepresenting the input of two dimensions of a multidimensional attention layer, T and N representing a frame dimension and a feature dimension, respectively;

step D-2: using the attention unit to score different frame dimensions or feature dimensions, thereby obtaining the weights of different attention layers as:

α_T＝Softmax(σ_R(F_T·W_T+b_T)·U_T)∈R^T×1

α_N＝Softmax(σ_R(F_N·W_N+b_N)·U_N)∈R^N×1

W_T，U_T，W_N，U_Nrespectively represent a weight matrix, b_T，b_NRepresents a deviation;

step D-3: output of frame dimension O_TAnd featuresOutput of dimension O_NThe outputs of (a) are respectively expressed as:

wherein e_T∈R^T×1，e_N∈R^N×1,. Representing the Hadamard product, and then adding O_NAfter inversion with O_TFusion is performed as input to LSTM, O_(LSTM)∈R^1024×1。

Preferably, the specific steps of step E include:

step E-1: the input of LSTM at each time is the input value x of the current time_tOutput value h of last time_t-1And last cell state c_t-1The outputs are respectively the current time h_tAnd current state c_t'forget to remember door' f_tInformation for determining cell discard:

f_t＝σ(W_f[h_t-1，x_t]+b_f)

sigma denotes the activation function Sigmoid, W and b are the weight and offset, respectively, f_tThe output of (1) is between 0 and 1, 1 represents that the information is completely reserved, and 0 represents that the data is completely discarded;

step E-2: the cell decides the value to update:

i_t＝σ(W_i[h_t-1，x_t]+b_i)

sigmoid decides which values are to be updated, tanh is used to create a new candidate;

step E-3: the cell state will update and output the final state:

h_t＝o_t*tanh(C_t)。

preferably, the specific steps of step F include:

step F-1: respectively optimizing the output values of two different modules to accelerate the convergence rate of training:

O_(CNN-)＝σ_S(BN(O_(CNN)))

O_(LSTM-)＝σ_R(BN(O_(LSTM)))

wherein BN represents batch normalization;

step F-2: calculating the output of the CRNN-MA model:

O_(CRNN-MA)＝σ_S([(O_(CNN-B))^T，[(O_(LSTM-BN))^T]·W)·V

wherein W, V ∈ R^2048×2048Representing the weight of the fusion layer.

Preferably, the specific steps of step G include: softmax is expressed as:

and (f), (Vi) is the probability corresponding to the characteristic value, the sum of all probabilities is equal to 1, if one Vi is larger than all other V, the probability output by the characteristic value is the highest, the mapping component approaches to 1, and the mapping components of other characteristic values approach to 0.

Has the advantages that: the speech emotion recognition method of the convolution cyclic neural network with the multiple attention mechanisms is combined with the deep learning network, and the characteristics are processed simultaneously by adopting a parallel connection structure in the module, so that the speech emotion recognition performance can be effectively improved.

Drawings

FIG. 1 is a diagram of the structure of the CRNN-MA model;

FIG. 2 is a diagram of a multi-headed self-attention layer structure;

FIG. 3 is a diagram of a multi-dimensional attention layer model architecture.

Detailed Description

The architecture of the proposed CRNN-MA model is shown in fig. 1. Spectrogram features and frame-level features are first input into the model. The relationship between local and global features is obtained using three pooling layers in the CNN as inputs to a multi-headed self-attention module, which is shown in fig. 2. The multidimensional attention layer computes the weights of the different frames and features as shown in fig. 3. The fusion layer is then used to fuse the different outputs, while the softmax classifier outputs the results.

To verify the performance of the proposed model, experiments were performed on the ABC emotion database and the eNTERFACE emotion database. The ABC database is a german database with 6 different emotions recorded by 4 males and 4 females. There are 430 speech samples. The eNTERFACE emotion database included 43 subjects from 14 different countries, recording in english, for a total of 1283 speech samples.

For the ABC database, the LOSO cross-validation strategy was employed. In this strategy, each time a speech sample of one person is selected from the data set as the test set of the experiment, the remaining samples are used as the training set. Each person's voice takes turns as a test set. Finally, the average of several tests was calculated. For the eNTERFACE database, the data was randomly divided into 8 speaker independent samples, with each of the seven samples containing 5 speaker samples and the remaining one containing 8 speaker samples, and eight cross-validations were performed.

Due to the imbalance of emotion classes, weighted and unweighted accuracy rates are used to evaluate the experimental results. The weighted accuracy is the ratio of the number of correct samples to the number of all samples determined. The unweighted accuracy is the sum of the precision of all classes divided by the number of classes, regardless of the number of samples per class.

Table 1 shows the effect of multiple layers of attention on the performance of the model

TABLE 1 comparison of Performance of Multi-head attention layer to CNN (%; 'WA/UA')

Data set	ABC	eNTERFACE
			CNN	49.2/41.7	70.9/71.0
CNN + Multi-headed attention layer	60.9/53.9	71.0/71.1

As can be seen from Table 1, the recognition effect of the emotion of the model can be effectively improved by the multi-head attention layer, WA values of the two databases are respectively improved by 11.7% and 0.1%, and UA values are respectively improved by 12.2% and 0.1%. The result shows that the proposed multi-head self-attention layer can improve the CNN by capturing the time-frequency information of the CNN module.

Table 2 shows the effect of the multidimensional attention layer on the performance of the LSTM model.

TABLE 2 comparison of Performance of multidimensional attention layer to LSTM (%; 'WA/UA')

Data set	ABC	eNTERFACE
			LSTM	57.2/49.4	71.5/71.6
LSTM + multi-head attention layer	60.1/52.5	74.3/74.4

As can be seen from table 2, the multidimensional attention layer can also effectively improve the recognition effect of the model, the WA values of the two databases are respectively improved by 2.9% and 2.8%, and the UA values are respectively improved by 3.1% and 2.8%. It is reasonable to state that the proposed multidimensional attention layer is in coordinating emotion fragments.

Table 3 shows the enhancement of the fusion layer to the model performance.

TABLE 3 enhancement of fusion layer for model Performance (%; 'WA/UA')

Data set	ABC	eNTERFACE
			CNN+LSTM	58.0/49.9	74.8/75.0
CRNN-MA (no fusion layer)	60.3/53.1	75.7/75.6
			CRNN-MA	65.3/59.7	78.6/78.6

From table 3, we can see that the proposed CRNN-MA model achieves the best experimental effect, and the integration of the fusion layer on different advanced features makes the performance of the model generate positive influence; also, the addition of the fusion layer enables the model to obtain more effective emotional information.

Claims

1. A speech emotion recognition method of a convolution cyclic neural network with a multiple attention mechanism is characterized by comprising the following steps: comprises the following steps of (a) carrying out,

step A: extracting spectrogram features and frame-level features in the voice;

and B: using CNN to learn time-frequency related information in the spectrogram;

and C: the multi-head self-attention layer acts on the CNN module to calculate the weights of different frames under global features of different scales and fuses features of different depths in the CNN;

step D: a multi-dimensional attention layer acts on the frame-level features to comprehensively consider the relationship between the local features and the global features;

step E: the processed frame-level features are transmitted into an LSTM model to obtain time information in the features;

step F: a blending layer to summarize the outputs of the different modules to enhance model performance;

step G: the emotion is classified using a softmax classifier.

2. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the specific steps of extracting spectrogram features in the step A comprise: pre-emphasis, framing and fast Fourier transform are carried out on the voice, and then the energy spectrum is processed through a group of Mel-scale triangular filter banks to obtain spectrogram characteristics; then, solving a first order difference and a second order difference of each section of spectrogram characteristic; the specific steps of extracting the frame-level features in the step A comprise: each frame of speech is extracted with 95-dimensional low-level descriptors including mel-frequency cepstral coefficients and their first derivatives, mel-frequency cepstral and its first derivatives, spectral features, spectral flatness, color spectrum, zero-crossing rate and root-mean-square.

3. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 2, wherein: the specific steps of the step B comprise:

4. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 3, wherein: the concrete steps of the step C comprise:

F_n＝σ_R(f_n*X_n)

step C-2: adding attention unitsTo calculate the interdependencies of all frames to obtain the weights of different frames

α_n＝Softmax(V_n·U_n)

Wherein, V_n＝σ_S(F_n·W_n+b_n)，T₀Is the time dimension, W and U are weights, b is the offset, σ_SRepresents a Sigmoid activation function, and Softmax represents a Softmax operation;

step C-3: setting a convolution G of 1 × 1 size with 1024 convolution kernels_nThe calculation is as follows:

N₀feature dimensions representing input features, then, at G_nAdopt N₀Max pooling operation by x 1:

wherein, O_n＝M_n·α_n∈R^1024×1。

5. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the specific steps of the step D comprise:

F_T＝σ_R(f_T*X_T)∈R^T×N

F_N＝σ_R(f_N*X_N)∈R^N×T

α_T＝Softmax(σ_R(F_T·W_T+b_T)·U_T)∈R^T×1

α_N＝Softmax(σ_R(F_N·W_N+b_N)·U_N)∈R^N×1

step D-3: output of frame dimension O_TAnd output of the feature dimension O_NThe outputs of (a) are respectively expressed as:

wherein e_T∈R^T×1，e_N∈R^N×1，

Representing the Hadamard product, and then adding O_NAfter inversion with O_TFusion is performed as input to LSTM, O_(LSTM)∈R^1024×1。

6. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the concrete steps of the step E comprise:

f_t＝σ(W_f[h_t-1，x_t]+b_f)

step E-2: the cell decides the value to update:

i_t＝σ(W_i[h_t-1，x_t]+b_i)

step E-3: the cell state will update and output the final state:

h_t＝o_t*tanh(C_t)。

7. the speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the specific steps of step F include:

O_(CNN)＝σ_S(BN(O_(CNN)))

O_(LSTM-)＝σ_R(BN(O_(LSTM)))

wherein BN represents batch normalization;

step F-2: calculating the output of the CRNN-MA model:

O_(CRNN-)＝σ_S([(O_(CNN-BN))^T，[(O_(LSTM-BN))^T]·W)·V

wherein W, V ∈ R^2048×2048Representing the weight of the fusion layer.

8. The speech emotion recognition method of the convolutional recurrent neural network with multiple attention mechanisms as claimed in claim 1, wherein: the concrete steps of the step G comprise: softmax is expressed as:

the features are normalized using softmax, f (V)_i) If one Vi is larger than all the other V, the probability of the output of the characteristic value is the highest, the mapping component approaches to 1, and the mapping components of the other characteristic values approach to 0.