CN113851148A

CN113851148A - Cross-library speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment

Info

Publication number: CN113851148A
Application number: CN202111117676.5A
Authority: CN
Inventors: 庄志豪; 刘曼; 汪洋; 陶华伟; 傅洪亮
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2021-12-28

Abstract

The invention discloses a cross-database speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment, which comprises the following steps of: firstly, a depth network model based on depth denoising self-coding and a depth neural network is built for compressing feature redundancy information and improving feature characterization capability; then, a global area and sub-domain self-adaptive method is adopted to realize feature migration, and the influence of the unbalanced sample problem on the model identification performance is reduced; and finally, in a training stage, constructing dynamic weight factors to adjust the contribution degrees of different loss functions, and realizing the optimization of the model. The method provided by the invention can effectively learn the common emotion information of the unbalanced sample corpus and reduce the feature distribution difference.

Description

Cross-library speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment

Technology neighborhood

The invention belongs to the technical field of voice signal processing, and particularly relates to a cross-database voice emotion recognition method based on transfer learning and multi-loss dynamic adjustment.

Background

Speech emotion recognition is an important technical basis for human-computer interaction. Traditional speech emotion recognition research is usually based on the same corpus for training and testing, and a good recognition effect is achieved. However, the speech feature distributions of different corpora are greatly different due to different recording environments, different gender, different age distributions, different languages and the like of the different corpora, which is a typical cross-library speech emotion recognition problem. Therefore, how to effectively deal with the difference of feature distribution caused by training across corpora is a very important and challenging problem in the research of speech emotion recognition.

Inspired by the successful application of transfer learning in text classification and clustering, image classification, sensor positioning, collaborative filtering and the like, domain adaptation is introduced in cross-library speech emotion recognition research to reduce the difference of feature distribution among different fields.

Therefore, the invention mainly focuses on cross-library speech emotion recognition between different corpora. Firstly, the low-dimensional speech emotion characteristics are obtained based on a deep network model built by a deep denoising autoencoder and a deep neural network. The depth denoising self-encoder can effectively compress characteristic redundant information and improve the robustness of a model; the deep neural network has strong nonlinear fitting capability and can effectively improve the emotion characterization capability of voice features. And then introducing the MMD and the LMMD to simultaneously reduce the feature distribution distance and simultaneously relieve the influence of the sample imbalance on the model identification performance. And finally, in a training stage, dynamically adjusting the contributions of different loss functions to model optimization by using dynamic weight factors.

Disclosure of Invention

In order to solve the problem of characteristic distribution difference among different corpus databases, the knowledge of the data with the marked source domain is better migrated to the unmarked target domain, the accurate classification of the unmarked data is realized, and a cross-library speech emotion recognition method based on migration learning and multi-loss dynamic adjustment is provided. The method comprises the following specific steps:

(1) preparing a corpus: acquiring two unbalanced corpus samples as a source domain database and a target domain database respectively, wherein the source domain database comprises a plurality of voice signals and corresponding emotion category labels, and the target domain database comprises a plurality of voice signals;

(2) voice preprocessing: preprocessing voice signals in a source domain database and a target domain database to prepare for next feature extraction;

(3) voice feature extraction: extracting speech emotion characteristics of the speech signals after the preprocessing in the step (2), wherein the characteristics comprise but are not limited to MFCC, short-time average zero-crossing rate, fundamental frequency, mean value, standard deviation, maximum and minimum values and the like;

(4) characteristic processing: firstly, the source domain characteristics obtained in the step (3) are used

Label corresponding to source domain features

And target domain characteristics

Then at X_sAnd X_TAfter noise subject to positive distribution is added, a depth self-encoder is input for feature reconstruction processing:

wherein

And

the sample characteristics after the reconstruction is decoded by the depth self-encoder. Then, the coded output of the depth self-coder is input into a depth neural network to be processedFurther processing to obtain low-dimensional emotional characteristics of the source domain and the target domain respectively

And

finally using the real label Y of the source domain_sWith source domain feature probability predicted by softmax classifier

Performing cross entropy operation:

(5) characteristic migration: first, a Maximum Mean Difference (MMD) algorithm is employed to reduce X'_SAnd X'_TGlobal feature distribution distance of (2):

where H is the Regenerative Kernel Hilbert Space (RKHS) and δ (·) is the feature mapping function (Gaussian kernel function). Then, reducing X 'by Local Maximum Mean Difference (LMMD) simultaneously'_SAnd X'_TGlobal feature distribution distance of (2):

wherein

As source domain samples

Wherein each sample belongs to the emotion class CThe weight of the weight is calculated,

is a target domain sample

Weight of each sample in (1) belonging to emotion class C;

(6) model training: according to the five loss functions obtained in the steps (4) and (5), the dynamic weight factor w is reused_iThe contributions of different loss functions to model optimization are adjusted, and further the overall optimization goal of the model is obtained as follows:

the dynamic weight factor is expressed as:

where i ∈ { S, T, y, MMD, LMMD }, α_i＞0。

(7) Repeating the steps (4) and (5), iteratively training the network model by a gradient descent method, and continuously updating the dynamic weight factor in the step (6) until the model is optimal;

(8) and (4) predicting the target domain feature label in the step (3) by using the network model trained in the step (7) and using a sofmatx classifier, and finally realizing emotion recognition of the speech emotion under the condition of cross corpus.

Drawings

As shown in the accompanying drawings, fig. 1 is a training flow chart of a cross-base speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment, and fig. 2 is a testing flow chart of the cross-base speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment.

Detailed Description

The present invention will be further described with reference to the following embodiments.

(1) And extracting standard feature sets of the 2010 international speech emotion recognition challenge match from the speech by using an open source toolkit Opensmile, wherein the extracted features of each speech are 1582 dimensions. Therefore, 368 voices are totally contained in 5 types of emotional voices of the EMO-DB database, and the total data amount is 368 x 1582; 1072 voices are shared by 5 types of emotion voices in the eNFERFACE database, and the total amount of data is 1072 x 1582; the total amount of 6 types of emotion voice of the CASIA Chinese voice emotion database is 1200 voice data, and the total amount of the data is 1200 x 1582.

(2) The source domain features and the target domain features are normalized and noise subject to positive cross distribution is added.

(3) And establishing a depth network model based on the depth automatic encoder and the depth neural network. For DAE with the number of hidden layers of 5, the hidden layer neuron nodes are respectively set to 1200, 900, 500, 900 and 1200, wherein the ELU function is used as the activation function of the encoding stage, and the Sigmoid function is used as the activation function of the decoding stage. In addition, a BatchNorm layer and a Dropout layer are added to each layer structure of the DAE. The number of hidden layers of the DNN is set to be 2, the number of hidden layer neuron nodes of the DNN is 600 and 256 respectively, and the Sigmoid function is used as the activation function.

(4) And (3) respectively inputting the source domain and target domain data set characteristics obtained in the step (2) into a depth network model based on a depth automatic encoder and a depth neural network to extract low-dimensional emotional characteristics.

(5) In the low-dimensional emotion space, the MMD and the LMMD are used simultaneously to measure the feature distribution distance of the low-dimensional emotion features originated from the target domain. The MMD only needs the low-dimensional emotional characteristics of the source domain and the target domain, and when the LMMD-based sub-domain adaptive loss function is calculated, besides the low-dimensional characteristics of the source domain and the corresponding real label and the low-dimensional characteristics of the target domain, probability distribution calculated by softmax is used as a label of the target domain, namely a pseudo label.

(6) The network loss of the model is:

wherein

And

a loss function representing the distribution of the metrology features,

representing the loss of the source domain reconstruction,

representing the loss of reconstruction of the target domain,

representing the loss of source domain classification, α_iIs a hyper-parameter, which is used to reinforce the contribution of different losses in the overall optimized loss.

(7) The learning rate and batch size of the model are set to be 0.00001 and 100, the network model is trained by using the Adam gradient descent method, the model is iteratively trained for 500 times, and the classifier uses softmax. In MMD and LMMD, the feature mapping function uses a multi-kernel gaussian function, and the number of gaussian kernels is set to 5. At the end of each round of training, a set of loss function values is generated for updating the weights w in (6)_iAnd realizing dynamic adjustment of the loss weight.

(8) And (3) carrying out normalization processing on the voice signal to be recognized, inputting the voice signal into the trained deep network model, and outputting the class with the maximum probability as the recognized emotion class by using a softmax classifier.

The following examples are exemplified by the eNTERFACE database (abbreviated as E), the MO-DB database (abbreviated as B) and the CASIA database (abbreviated as C). The 4 cross-library experiments listed are all sample imbalance experiments, and the verification results are shown in table 1.

TABLE 1 Experimental results (UAR/WAR) with comparative methods

The accuracy measurement method used as the model is UAR non-weighted average recall (unweighted average recall) and WAR weighted average recall (weighted average recall). PCA + SVM, DoSL, TSDSL and JDAR are all cross-library voice emotion recognition methods for recognizing the same emotion category by using an IS10 feature set and cross-library. The invention is a cross-library speech emotion recognition method built on the basis of the above claims.

The experimental result shows that the optimal cross-library voice recognition rate can be obtained in the experiment of sample unbalance. The model provided by the invention combines a deep network model and two transfer learning algorithms, and simultaneously utilizes dynamic weight factors to adjust the contributions of different loss functions to realize the optimization of model training.

The scope of the invention is not limited to the description of the embodiments.

Claims

1. A cross-library speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment is characterized by comprising the following steps:

(1) preparing a corpus: obtaining a corpus with unbalanced samples, and respectively using the corpus as a source domain database and a target domain database, wherein the source domain database comprises a plurality of voice signals and corresponding emotion category labels, and the target domain database comprises a plurality of voice signals;

Label corresponding to source domain features

And target domain characteristics

wherein

And

the sample characteristics after the decoding reconstruction of the depth self-encoder are obtained; then, the coded output of the depth self-coder is input into a depth neural network for further processing, so that the low-dimensional emotion characteristics of the source domain and the target domain are respectively obtained

And

Performing cross entropy operation:

wherein H is a Regenerative Kernel Hilbert Space (RKHS), and δ (·) is a feature mapping function (Gaussian kernel function); then, reducing X 'by Local Maximum Mean Difference (LMMD) simultaneously'_SAnd X'_TGlobal feature distribution distance of (2):

wherein

As source domain samples

The weight of each sample in (1) belonging to emotion category C,

is a target domain sample

Weight of each sample in (1) belonging to emotion class C;

(6) model training: according to the five loss functions obtained in the steps (4) and (5), the dynamic weight factor w is reused_iTo adjust different loss function pair modesThe contribution of type optimization, and further the overall optimization objective of the obtained model is as follows:

the dynamic weight factor is expressed as:

where i ∈ { S, T, y, MMD, LMMD }, α_i＞0；