CN114898777A

CN114898777A - Cross-library speech emotion recognition method and device based on deep direct-push migration network

Info

Publication number: CN114898777A
Application number: CN202210513096.6A
Authority: CN
Inventors: 郑文明; 赵焱; 宗源; 赵力; 路成; 连海伦
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-08-12

Abstract

The invention discloses a cross-database speech emotion recognition method and a device based on a deep direct-push migration network, wherein the method comprises the following steps: (1) obtaining a source voice emotion database and a target voice emotion database (2), processing emotion audios of a source database and a target database into spectrogram (3), establishing a deep regression neural network (4), inputting the spectrogram of the source database and the target database into the deep regression neural network for training, calculating the maximum mean difference of the source data and the target data in different scales according to the characteristics obtained from the network, finely tuning the neural network (5), taking the voice to be recognized as the voice data in the target database, and inputting the trained deep convolution neural network to obtain the voice emotion type. The invention has higher identification accuracy.

Description

Cross-library speech emotion recognition method and device based on deep direct-push migration network

Technical Field

The invention relates to a voice emotion recognition technology, in particular to a cross-database voice emotion recognition method and device based on a deep direct push type migration network.

Background

Speech is one of the most natural ways of expression for humans. It also reveals the natural emotional state of humans more than other common modes of communication in daily life. Because emotion can help people to understand each other better, in order to help people understand and recognize speech emotion more efficiently, automatic speech emotion recognition by using a computer program and an artificial intelligence algorithm becomes a popular research direction in the fields of pattern recognition, computer vision, emotion calculation and the like in recent years.

In recent years, researchers have proposed many effective methods based on machine learning and deep learning to automatically recognize speech emotion. For example, conventional machine learning methods typically first extract manual features such as IS09 and IS10, and then construct various types of classifiers, such as SVM, K-NN, and Bayesian classifiers, specifically for the task of recognizing speech emotion. Meanwhile, some deep learning methods are also used for the task of recognizing speech emotion, such as LSTM, pre-trained CNN (e.g., ResNet, VggNet, and densneet), and capsulneet. These networks can generally improve the representation ability of speech emotion and learn emotion features and classifiers in an end-to-end manner to classify speech emotion.

The above method is performed in an ideal case where the test sample and the training sample are from the same database. However, in many practical applications, the test sample and the training sample are usually from different databases, which easily brings large domain differences, and results in unsatisfactory recognition effect of most speech emotion recognition methods when the speech emotion recognition methods are across the databases. Recently, there have been many researchers attempting to solve the cross-database speech emotion recognition problem, for example, Zong et al proposed a least squares regression (DaLSR) based domain adaptive approach to handle the cross-database speech emotion recognition task. Hassan et al propose an important weighted support vector machine (IW-SVM) to eliminate the feature distribution mismatch between different samples and improve the classification accuracy under different databases. Long et al propose to apply transferred nuclear learning (TKL) to learn domain invariant kernels to eliminate feature distribution differences between samples from different databases. Gong et al propose a method called Geodesic Flow Kernel (GFK) to connect two different databases with well-designed GFK on Grassmann manifold and reduce the feature distribution difference between them. Deng et al propose a generic auto-encoder (UAE) to learn a database-independent feature space, with the goal of mapping test samples and training samples to a domain-independent feature space using the powerful mapping capabilities of UAE. Fernando et al propose a Subspace Alignment (SA) method for finding a mapping function that can align the subspace where the source samples are located with the target samples. Pan et al propose a Transfer Component Analysis (TCA) method based on a regenerative kernel Hilbert space to eliminate the distribution difference of different domain samples by finding some cross-domain transfer components. Gideon et al propose a method based on the domain generalization of the confrontation discriminant (ADDoG) that learns more generalized speech emotion characteristics from more database samples. Most of the cross-library speech emotion recognition related researches are mainly based on the traditional machine learning method, a space-time descriptor with field invariance and a machine learning classifier are used for processing cross-library speech emotion recognition tasks, the recognition rate is not ideal, and a distance is reserved from practical application; in addition, with the expansion of the speech emotion data set samples, the method based on deep learning will be one of the main research directions of cross-library speech emotion recognition in the future, but at present, related research is less, and the overall progress is slower.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a cross-database speech emotion recognition method and device based on a deep direct-push migration network, which have higher recognition accuracy.

The technical scheme is as follows: the cross-library speech emotion recognition method based on the deep direct push type migration network comprises the following steps:

(1) acquiring two different voice emotion databases, namely a source database and a target database, wherein emotion voice audios and corresponding emotion type labels are stored in the source database, and only the emotion voice audios are stored in the target database;

(2) processing the emotion voice audios of the source database and the target database into spectrogram patterns;

(3) establishing a deep regression neural network;

(4) simultaneously inputting the spectrogram of the source database and the corresponding label as well as the spectrogram of the target database and the randomly initialized pseudo label into a deep regression neural network as a sample for pre-training; then, the obtained depth features are used as samples to be input into a depth regression neural network for fine adjustment, and training is completed;

(5) preprocessing the emotion voice frequency of the voice to be recognized into a spectrogram, and inputting the spectrogram as a sample in a target database into a trained deep regression neural network to obtain the emotion type of the voice.

Further, the step (2) specifically comprises: the emotional voice audios in the source database and the target database are processed into spectrogram forms by using a librosa tool package of python.

Further, the deep recurrent neural network established in step (3) includes: the self-adaptive average pooling layer comprises a first pooling layer, a first maximum pooling layer, a second maximum pooling layer, a third pooling layer, a fourth pooling layer, a third maximum pooling layer, a fifth pooling layer, a sixth pooling layer, a fourth maximum pooling layer, a seventh pooling layer, an eighth pooling layer, a fifth maximum pooling layer, a self-adaptive average pooling layer and a full-connection layer which are connected in sequence from front to back.

Further, each neuron in all convolutional layers and all connection layers adopts a straight line correction unit ReLu as an activation function.

Further, the neuron output of the fully-connected layer is prevented from being over-fitted with Dropout of 0.5.

Further, the step (4) comprises the following steps:

(4-1) simultaneously inputting the spectrogram of the source database and the corresponding label, the spectrogram of the target database and the randomly initialized pseudo label into the deep regression neural network as samples, wherein a loss function L adopted during training is as follows:

wherein N is the source database sample number, N is the source database sample number, M is the target database sample number, M is the target database sample number, J is the speech emotion type number, J is the speech emotion type number,

the probability that the actual output emotional feature for the nth sample of the source database is divided into j,

the expected probability that the output emotional feature for the nth sample of the source database is divided into j,

the probability that the actual output emotional feature for the mth sample of the target database is divided into j,

the probability that the actual output emotional characteristics of the mth sample of the target database in the previous training are divided into j is obtained;

(4-2) inputting the depth characteristics of the source database and the target database obtained through the deep recurrent neural network as samples and the trained deep recurrent neural network for fine adjustment training, wherein a loss function L is adopted during training _total Comprises the following steps:

L _total ＝αL _mmd +βL

wherein, MMD (X) _s ，Y _t ) Maximum mean difference MMD, X of emotional features output on deep regression neural network by representing source database and target database samples _s Represent source database samples atEmotional characteristics, Y, of the deep recurrent neural network output _t Representing the emotional characteristics of the target database sample output in the deep regression neural network,

the MMD representing the maximum mean difference of the emotional features of the category j output by the source database and target database samples on the deep recurrent neural network,

the emotional characteristic distribution of the category j output by the source database sample in the deep regression neural network is represented,

the emotional characteristics of the category j, which represent the target database samples output in the deep regression neural network,

the MMD represents the maximum mean difference MMD of the emotional characteristics of the positive and negative emotions output by the source database sample and the target database sample in the deep recurrent neural network,

the emotional characteristics of positive and negative emotions output by the source database sample in the deep recurrent neural network are represented,

and the MMD is the mean distance of two groups of data in a regeneration Hilbert space, and alpha and beta are binding strength coefficients obtained through training.

The cross-library speech emotion recognition device based on the deep direct push type migration network comprises a processor and a computer program which is stored on a memory and can run on the processor, wherein the processor realizes the method when executing the program.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: the invention has higher identification accuracy.

Drawings

FIG. 1 is a schematic flow chart of a cross-database speech emotion recognition method based on a deep direct-push migration network according to the present invention;

FIG. 2 is a schematic diagram of a cross-database speech emotion recognition architecture of a deep direct-push migration network DTTRN;

fig. 3 is a detailed structure of a backbone network of the deep direct push migration network DTTRN;

fig. 4 is an experimental result comparison of the deep direct-push migration network DTTRN in the cross-library speech emotion recognition experiment.

Detailed Description

The embodiment provides a cross-database speech emotion recognition method based on a deep direct-push migration network, as shown in fig. 1, including:

(1) the method comprises the steps of obtaining two different voice emotion databases which are respectively a source database and a target database, wherein emotion voice audios and corresponding emotion type labels are stored in the source database, and only the emotion voice audios are stored in the target database.

In the invention, a training sample (a source database or a source domain) and a testing sample (a target domain or a target database) belong to different speech emotion data sets, and the training sample and the testing sample have obvious characteristic distribution difference.

(2) And processing the emotional voice audios of the source database and the target database into spectrogram.

In this embodiment, spectrogram processing is implemented using the librosa toolkit of python.

(3) A deep recurrent neural network (CNN, DTTRN) is established, said network comprising a CNN backbone network and a full connectivity layer FC1 connecting the backbone network.

As shown in fig. 2, the structure of the deep convolutional neural network includes: the first convolution layer conv1, the first maximum value pooling layer Maxpool1, the second convolution layer conv2, the second maximum value pooling layer Maxpool2, the third convolution layer conv3, the fourth convolution layer conv4, the third maximum value pooling layer Maxpool3, the fifth convolution layer conv5, the sixth convolution layer conv6, the fourth maximum value pooling layer Maxpool4, the seventh convolution layer conv7, the eighth convolution layer conv8, the fifth maximum value pooling layer Maxpool5, the adaptive average pooling layer adaveabool 1 and the full connection layer FC1 which are connected in sequence from front to back, each neuron in the convolution layer and the full connection layer adopts a straight line correction unit ReLu as an activation neuron function, and all full connection layer outputs adopt Dropout of 0.5 to prevent overfitting.

As shown in fig. 3, the local receptive field sizes of the 8 convolutional layers in the network are all set to 3 × 3, the step size (stride) is all set to 1, and the edge zero padding strategy is used to keep the feature mapping size unchanged after convolution. For the 1 st convolution layer, 64 convolution kernels are set; for the 2 nd convolutional layer, 128 convolutional kernels are set; for the 3 rd convolutional layer and the 4 th convolutional layer, 256 convolutional kernels are set; for the last 4 convolutional layers, 512 convolutional kernels are set; for 2 kinds of pooling layers, the maximum pooling layer window size was set to 2 × 2, the adaptive average pooling layer window size was set to 7 × 7, and the step size (stride) was set to 2. Each pooling reduces the feature dimension of the mapping output to half of the original.

The network activation function is set to: a straight line modified linear unit (ReLU) is used as an activation function for each neuron in the DTTRN network, which is defined as follows:

(4) simultaneously inputting the spectrogram of the source database and the corresponding label as well as the spectrogram of the target database and the randomly initialized pseudo label into a deep regression neural network as a sample for pre-training; and then, inputting the obtained depth features as samples into a depth regression neural network for fine adjustment, and finishing training.

The method specifically comprises the following steps:

(4-1) simultaneously inputting the spectrogram and the corresponding label of the source database, the spectrogram of the target database and the randomly initialized pseudo label into a deep regression neural network as samples, and improving based on a Cross Entropy Loss Function (Cross Entropy Loss Function) during training, wherein the specific Loss Function L is as follows:

L _total ＝αL _mmd +βL

wherein, MMD (X) _s ，Y _t ) Maximum mean difference MMD, X of emotional features output on deep regression neural network by representing source database and target database samples _s Expressing the emotional characteristics of the source database sample output in the deep regression neural network, Y _t Representing the emotional characteristics of the target database sample output in the deep regression neural network,

the emotion characteristics of the positive and negative emotions output by the source database sample in the deep recurrent neural network are represented,

the emotional characteristics of positive and negative emotions output by the target database sample in the deep regression neural network, the MMD is the mean distance of two groups of data in a regeneration Hilbert space,α and β are binding strength coefficients obtained by training.

The training optimizer adopts a Stochastic Gradient Descent (SGD) algorithm with a correction factor Momentum (Nesterov Momentum), a loss function is calculated, so that the weight is continuously updated, the Nesterov Momentum inhibits the oscillation in the Gradient direction by simulating the concept of Momentum in physics, the convergence speed is accelerated, if the historical Gradient is consistent with the current Gradient direction, the Momentum item is increased, and if not, the Momentum item is reduced; and the Nesterov term adds a correction when updating the gradient, avoids the gradient updating from advancing too fast, and simultaneously the gradient updating is more flexible, and the iteration process is as follows

θ＝θ-v _t Where eta represents the learning rate and is set to 10 in the experiment ^-3 And in addition the weight attenuation is set to 10 ^-5 The correction factor is 0.9.

In order to accelerate the training speed and improve the reliability of the recognition result, weights of all layers are frozen on a DTTRN network model pre-trained by a source database, and then training and fine-tuning are only carried out on the weight of the last full-connection layer of the DTTRN by combining a target domain sample and a source domain sample, so that the expected cross-library speech emotion recognition task can be realized. The training maximum period is set to 200.

The embodiment also provides a cross-library speech emotion recognition device based on the deep direct-push migration network, which comprises a processor and a computer program stored on a memory and capable of running on the processor, wherein the processor realizes the method when executing the program.

In order to verify the invention, simulation verification is carried out, sample data which have the same sample label and belong to different databases are needed to be used for simulation cross-library speech emotion recognition experiments, EmoDB, eNTFACE and CASIA are selected, and B, E and C are respectively used for representing corresponding databases. In the co-migration of EmoDB and CASIA, we used 5 kinds of emotions of anger, injury, fear, happiness and peace; in the co-migration of EmoDB and eNTERFACE, we used anger, injury, fear, happiness and nausea; in the co-migration of CASIA and eNTERFACE, we used anger, hurry, fear, happiness and surprise. In order to verify the effectiveness and the necessity of the direct-push type migration depth regression network DTTRN, cross-library speech emotion experiments are carried out on EmoDB, eNTIFACE and CASIA. In the experiment, uar (unweighted Average reverse) is selected as an evaluation index, and the result is shown in fig. 4, it can be observed that the direct push type migration depth regression network DTTRN in the invention has excellent recognition effect on the cross-library speech emotion recognition task. The DTTRN references the idea of direct push migration and utilizes the target domain data without labels to better pre-train the network; in addition, the designed loss function can restrict the feature distribution difference of the two databases on the network in the DTTRN fine tuning training process, and plays a key role in improving the cross-database recognition performance.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A cross-database speech emotion recognition method based on a deep direct push type migration network is characterized by comprising the following steps:

(3) establishing a deep regression neural network;

2. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: the step (2) specifically comprises the following steps: and processing the emotional voice audios in the source database and the target database into spectrogram diagrams by using a library tool package of python.

3. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: the deep regression neural network established in the step (3) comprises the following steps: the self-adaptive average pooling layer comprises a first pooling layer, a first maximum pooling layer, a second maximum pooling layer, a third pooling layer, a fourth pooling layer, a third maximum pooling layer, a fifth pooling layer, a sixth pooling layer, a fourth maximum pooling layer, a seventh pooling layer, an eighth pooling layer, a fifth maximum pooling layer, a self-adaptive average pooling layer and a full-connection layer which are connected in sequence from front to back.

4. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: each neuron in all convolutional layers and all connection layers adopts a straight line correction unit ReLu as an activation function.

5. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: the neuron output of the fully connected layer is prevented from overfitting with Dropout of 0.5.

6. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: the step (4) comprises the following steps:

L _total ＝αL _mmd +βL

the MMD represents the maximum mean difference MMD of the emotional features of the category j output by the source database sample and the target database sample on the deep recurrent neural network,

representing the output of the target database sample in the deep regression neural network,The MMD is the mean distance of two groups of data in a regenerated Hilbert space, and alpha and beta are binding strength coefficients obtained by training.

7. A cross-library speech emotion recognition device based on a deep direct-push migration network comprises a processor and a computer program which is stored on a memory and can run on the processor, and is characterized in that: the processor, when executing the program, implements the method of any of claims 1-6.