CN113851148A - Cross-library speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment - Google Patents

Cross-library speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment Download PDF

Info

Publication number
CN113851148A
CN113851148A CN202111117676.5A CN202111117676A CN113851148A CN 113851148 A CN113851148 A CN 113851148A CN 202111117676 A CN202111117676 A CN 202111117676A CN 113851148 A CN113851148 A CN 113851148A
Authority
CN
China
Prior art keywords
feature
source domain
emotion
cross
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111117676.5A
Other languages
Chinese (zh)
Inventor
庄志豪
刘曼
汪洋
陶华伟
傅洪亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University of Technology
Original Assignee
Henan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University of Technology filed Critical Henan University of Technology
Priority to CN202111117676.5A priority Critical patent/CN113851148A/en
Publication of CN113851148A publication Critical patent/CN113851148A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a cross-database speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment, which comprises the following steps of: firstly, a depth network model based on depth denoising self-coding and a depth neural network is built for compressing feature redundancy information and improving feature characterization capability; then, a global area and sub-domain self-adaptive method is adopted to realize feature migration, and the influence of the unbalanced sample problem on the model identification performance is reduced; and finally, in a training stage, constructing dynamic weight factors to adjust the contribution degrees of different loss functions, and realizing the optimization of the model. The method provided by the invention can effectively learn the common emotion information of the unbalanced sample corpus and reduce the feature distribution difference.

Description

Cross-library speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment
Technology neighborhood
The invention belongs to the technical field of voice signal processing, and particularly relates to a cross-database voice emotion recognition method based on transfer learning and multi-loss dynamic adjustment.
Background
Speech emotion recognition is an important technical basis for human-computer interaction. Traditional speech emotion recognition research is usually based on the same corpus for training and testing, and a good recognition effect is achieved. However, the speech feature distributions of different corpora are greatly different due to different recording environments, different gender, different age distributions, different languages and the like of the different corpora, which is a typical cross-library speech emotion recognition problem. Therefore, how to effectively deal with the difference of feature distribution caused by training across corpora is a very important and challenging problem in the research of speech emotion recognition.
Inspired by the successful application of transfer learning in text classification and clustering, image classification, sensor positioning, collaborative filtering and the like, domain adaptation is introduced in cross-library speech emotion recognition research to reduce the difference of feature distribution among different fields.
Therefore, the invention mainly focuses on cross-library speech emotion recognition between different corpora. Firstly, the low-dimensional speech emotion characteristics are obtained based on a deep network model built by a deep denoising autoencoder and a deep neural network. The depth denoising self-encoder can effectively compress characteristic redundant information and improve the robustness of a model; the deep neural network has strong nonlinear fitting capability and can effectively improve the emotion characterization capability of voice features. And then introducing the MMD and the LMMD to simultaneously reduce the feature distribution distance and simultaneously relieve the influence of the sample imbalance on the model identification performance. And finally, in a training stage, dynamically adjusting the contributions of different loss functions to model optimization by using dynamic weight factors.
Disclosure of Invention
In order to solve the problem of characteristic distribution difference among different corpus databases, the knowledge of the data with the marked source domain is better migrated to the unmarked target domain, the accurate classification of the unmarked data is realized, and a cross-library speech emotion recognition method based on migration learning and multi-loss dynamic adjustment is provided. The method comprises the following specific steps:
(1) preparing a corpus: acquiring two unbalanced corpus samples as a source domain database and a target domain database respectively, wherein the source domain database comprises a plurality of voice signals and corresponding emotion category labels, and the target domain database comprises a plurality of voice signals;
(2) voice preprocessing: preprocessing voice signals in a source domain database and a target domain database to prepare for next feature extraction;
(3) voice feature extraction: extracting speech emotion characteristics of the speech signals after the preprocessing in the step (2), wherein the characteristics comprise but are not limited to MFCC, short-time average zero-crossing rate, fundamental frequency, mean value, standard deviation, maximum and minimum values and the like;
(4) characteristic processing: firstly, the source domain characteristics obtained in the step (3) are used
Figure BDA0003275965660000021
Label corresponding to source domain features
Figure BDA0003275965660000022
And target domain characteristics
Figure BDA0003275965660000023
Then at XsAnd XTAfter noise subject to positive distribution is added, a depth self-encoder is input for feature reconstruction processing:
Figure BDA0003275965660000024
Figure BDA0003275965660000025
wherein
Figure BDA0003275965660000026
And
Figure BDA0003275965660000027
the sample characteristics after the reconstruction is decoded by the depth self-encoder. Then, the coded output of the depth self-coder is input into a depth neural network to be processedFurther processing to obtain low-dimensional emotional characteristics of the source domain and the target domain respectively
Figure BDA0003275965660000028
And
Figure BDA0003275965660000029
finally using the real label Y of the source domainsWith source domain feature probability predicted by softmax classifier
Figure BDA00032759656600000210
Performing cross entropy operation:
Figure BDA00032759656600000211
(5) characteristic migration: first, a Maximum Mean Difference (MMD) algorithm is employed to reduce X'SAnd X'TGlobal feature distribution distance of (2):
Figure BDA00032759656600000212
where H is the Regenerative Kernel Hilbert Space (RKHS) and δ (·) is the feature mapping function (Gaussian kernel function). Then, reducing X 'by Local Maximum Mean Difference (LMMD) simultaneously'SAnd X'TGlobal feature distribution distance of (2):
Figure BDA00032759656600000213
wherein
Figure BDA00032759656600000214
As source domain samples
Figure BDA00032759656600000215
Wherein each sample belongs to the emotion class CThe weight of the weight is calculated,
Figure BDA00032759656600000216
is a target domain sample
Figure BDA00032759656600000217
Weight of each sample in (1) belonging to emotion class C;
(6) model training: according to the five loss functions obtained in the steps (4) and (5), the dynamic weight factor w is reusediThe contributions of different loss functions to model optimization are adjusted, and further the overall optimization goal of the model is obtained as follows:
Figure BDA00032759656600000218
the dynamic weight factor is expressed as:
Figure BDA0003275965660000031
where i ∈ { S, T, y, MMD, LMMD }, αi>0。
(7) Repeating the steps (4) and (5), iteratively training the network model by a gradient descent method, and continuously updating the dynamic weight factor in the step (6) until the model is optimal;
(8) and (4) predicting the target domain feature label in the step (3) by using the network model trained in the step (7) and using a sofmatx classifier, and finally realizing emotion recognition of the speech emotion under the condition of cross corpus.
Drawings
As shown in the accompanying drawings, fig. 1 is a training flow chart of a cross-base speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment, and fig. 2 is a testing flow chart of the cross-base speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment.
Detailed Description
The present invention will be further described with reference to the following embodiments.
(1) And extracting standard feature sets of the 2010 international speech emotion recognition challenge match from the speech by using an open source toolkit Opensmile, wherein the extracted features of each speech are 1582 dimensions. Therefore, 368 voices are totally contained in 5 types of emotional voices of the EMO-DB database, and the total data amount is 368 x 1582; 1072 voices are shared by 5 types of emotion voices in the eNFERFACE database, and the total amount of data is 1072 x 1582; the total amount of 6 types of emotion voice of the CASIA Chinese voice emotion database is 1200 voice data, and the total amount of the data is 1200 x 1582.
(2) The source domain features and the target domain features are normalized and noise subject to positive cross distribution is added.
(3) And establishing a depth network model based on the depth automatic encoder and the depth neural network. For DAE with the number of hidden layers of 5, the hidden layer neuron nodes are respectively set to 1200, 900, 500, 900 and 1200, wherein the ELU function is used as the activation function of the encoding stage, and the Sigmoid function is used as the activation function of the decoding stage. In addition, a BatchNorm layer and a Dropout layer are added to each layer structure of the DAE. The number of hidden layers of the DNN is set to be 2, the number of hidden layer neuron nodes of the DNN is 600 and 256 respectively, and the Sigmoid function is used as the activation function.
(4) And (3) respectively inputting the source domain and target domain data set characteristics obtained in the step (2) into a depth network model based on a depth automatic encoder and a depth neural network to extract low-dimensional emotional characteristics.
(5) In the low-dimensional emotion space, the MMD and the LMMD are used simultaneously to measure the feature distribution distance of the low-dimensional emotion features originated from the target domain. The MMD only needs the low-dimensional emotional characteristics of the source domain and the target domain, and when the LMMD-based sub-domain adaptive loss function is calculated, besides the low-dimensional characteristics of the source domain and the corresponding real label and the low-dimensional characteristics of the target domain, probability distribution calculated by softmax is used as a label of the target domain, namely a pseudo label.
(6) The network loss of the model is:
Figure BDA0003275965660000041
wherein
Figure BDA0003275965660000042
Figure BDA0003275965660000043
And
Figure BDA0003275965660000044
a loss function representing the distribution of the metrology features,
Figure BDA0003275965660000045
representing the loss of the source domain reconstruction,
Figure BDA0003275965660000046
representing the loss of reconstruction of the target domain,
Figure BDA0003275965660000047
representing the loss of source domain classification, αiIs a hyper-parameter, which is used to reinforce the contribution of different losses in the overall optimized loss.
(7) The learning rate and batch size of the model are set to be 0.00001 and 100, the network model is trained by using the Adam gradient descent method, the model is iteratively trained for 500 times, and the classifier uses softmax. In MMD and LMMD, the feature mapping function uses a multi-kernel gaussian function, and the number of gaussian kernels is set to 5. At the end of each round of training, a set of loss function values is generated for updating the weights w in (6)iAnd realizing dynamic adjustment of the loss weight.
(8) And (3) carrying out normalization processing on the voice signal to be recognized, inputting the voice signal into the trained deep network model, and outputting the class with the maximum probability as the recognized emotion class by using a softmax classifier.
The following examples are exemplified by the eNTERFACE database (abbreviated as E), the MO-DB database (abbreviated as B) and the CASIA database (abbreviated as C). The 4 cross-library experiments listed are all sample imbalance experiments, and the verification results are shown in table 1.
TABLE 1 Experimental results (UAR/WAR) with comparative methods
Figure BDA0003275965660000048
The accuracy measurement method used as the model is UAR non-weighted average recall (unweighted average recall) and WAR weighted average recall (weighted average recall). PCA + SVM, DoSL, TSDSL and JDAR are all cross-library voice emotion recognition methods for recognizing the same emotion category by using an IS10 feature set and cross-library. The invention is a cross-library speech emotion recognition method built on the basis of the above claims.
The experimental result shows that the optimal cross-library voice recognition rate can be obtained in the experiment of sample unbalance. The model provided by the invention combines a deep network model and two transfer learning algorithms, and simultaneously utilizes dynamic weight factors to adjust the contributions of different loss functions to realize the optimization of model training.
The scope of the invention is not limited to the description of the embodiments.

Claims (1)

1. A cross-library speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment is characterized by comprising the following steps:
(1) preparing a corpus: obtaining a corpus with unbalanced samples, and respectively using the corpus as a source domain database and a target domain database, wherein the source domain database comprises a plurality of voice signals and corresponding emotion category labels, and the target domain database comprises a plurality of voice signals;
(2) voice preprocessing: preprocessing voice signals in a source domain database and a target domain database to prepare for next feature extraction;
(3) voice feature extraction: extracting speech emotion characteristics of the speech signals after the preprocessing in the step (2), wherein the characteristics comprise but are not limited to MFCC, short-time average zero-crossing rate, fundamental frequency, mean value, standard deviation, maximum and minimum values and the like;
(4) characteristic processing: firstly, the source domain characteristics obtained in the step (3) are used
Figure FDA0003275965650000011
Label corresponding to source domain features
Figure FDA0003275965650000012
And target domain characteristics
Figure FDA0003275965650000013
Then at XsAnd XTAfter noise subject to positive distribution is added, a depth self-encoder is input for feature reconstruction processing:
Figure FDA0003275965650000014
Figure FDA0003275965650000015
wherein
Figure FDA0003275965650000016
And
Figure FDA0003275965650000017
the sample characteristics after the decoding reconstruction of the depth self-encoder are obtained; then, the coded output of the depth self-coder is input into a depth neural network for further processing, so that the low-dimensional emotion characteristics of the source domain and the target domain are respectively obtained
Figure FDA0003275965650000018
And
Figure FDA0003275965650000019
finally using the real label Y of the source domainsWith source domain feature probability predicted by softmax classifier
Figure FDA00032759656500000110
Performing cross entropy operation:
Figure FDA00032759656500000111
(5) characteristic migration: first, a Maximum Mean Difference (MMD) algorithm is employed to reduce X'SAnd X'TGlobal feature distribution distance of (2):
Figure FDA00032759656500000112
wherein H is a Regenerative Kernel Hilbert Space (RKHS), and δ (·) is a feature mapping function (Gaussian kernel function); then, reducing X 'by Local Maximum Mean Difference (LMMD) simultaneously'SAnd X'TGlobal feature distribution distance of (2):
Figure FDA0003275965650000021
wherein
Figure FDA0003275965650000022
As source domain samples
Figure FDA0003275965650000023
The weight of each sample in (1) belonging to emotion category C,
Figure FDA0003275965650000024
is a target domain sample
Figure FDA0003275965650000025
Weight of each sample in (1) belonging to emotion class C;
(6) model training: according to the five loss functions obtained in the steps (4) and (5), the dynamic weight factor w is reusediTo adjust different loss function pair modesThe contribution of type optimization, and further the overall optimization objective of the obtained model is as follows:
Figure FDA0003275965650000026
the dynamic weight factor is expressed as:
Figure FDA0003275965650000027
where i ∈ { S, T, y, MMD, LMMD }, αi>0;
(7) Repeating the steps (4) and (5), iteratively training the network model by a gradient descent method, and continuously updating the dynamic weight factor in the step (6) until the model is optimal;
(8) and (4) predicting the target domain feature label in the step (3) by using the network model trained in the step (7) and using a sofmatx classifier, and finally realizing emotion recognition of the speech emotion under the condition of cross corpus.
CN202111117676.5A 2021-09-23 2021-09-23 Cross-library speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment Pending CN113851148A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111117676.5A CN113851148A (en) 2021-09-23 2021-09-23 Cross-library speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111117676.5A CN113851148A (en) 2021-09-23 2021-09-23 Cross-library speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment

Publications (1)

Publication Number Publication Date
CN113851148A true CN113851148A (en) 2021-12-28

Family

ID=78979543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111117676.5A Pending CN113851148A (en) 2021-09-23 2021-09-23 Cross-library speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment

Country Status (1)

Country Link
CN (1) CN113851148A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114757310A (en) * 2022-06-16 2022-07-15 山东海量信息技术研究院 Emotion recognition model, and training method, device, equipment and readable storage medium thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114757310A (en) * 2022-06-16 2022-07-15 山东海量信息技术研究院 Emotion recognition model, and training method, device, equipment and readable storage medium thereof
CN114757310B (en) * 2022-06-16 2022-11-11 山东海量信息技术研究院 Emotion recognition model and training method, device, equipment and readable storage medium thereof

Similar Documents

Publication Publication Date Title
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN108319666B (en) Power supply service assessment method based on multi-modal public opinion analysis
WO2015180368A1 (en) Variable factor decomposition method for semi-supervised speech features
CN111161744B (en) Speaker clustering method for simultaneously optimizing deep characterization learning and speaker identification estimation
CN110751044A (en) Urban noise identification method based on deep network migration characteristics and augmented self-coding
CN105206270A (en) Isolated digit speech recognition classification system and method combining principal component analysis (PCA) with restricted Boltzmann machine (RBM)
CN111461025B (en) Signal identification method for self-evolving zero-sample learning
CN113077823B (en) Depth self-encoder subdomain self-adaptive cross-library voice emotion recognition method
CN112053694A (en) Voiceprint recognition method based on CNN and GRU network fusion
CN104077598B (en) A kind of emotion identification method based on voice fuzzy cluster
CN113554110B (en) Brain electricity emotion recognition method based on binary capsule network
Wei et al. A novel speech emotion recognition algorithm based on wavelet kernel sparse classifier in stacked deep auto-encoder model
CN112766355A (en) Electroencephalogram signal emotion recognition method under label noise
CN113571067A (en) Voiceprint recognition countermeasure sample generation method based on boundary attack
CN113763965A (en) Speaker identification method with multiple attention characteristics fused
CN115393933A (en) Video face emotion recognition method based on frame attention mechanism
Janbakhshi et al. Automatic dysarthric speech detection exploiting pairwise distance-based convolutional neural networks
CN113851148A (en) Cross-library speech emotion recognition method based on transfer learning and multi-loss dynamic adjustment
CN112863521B (en) Speaker identification method based on mutual information estimation
CN110246509A (en) A kind of stack denoising self-encoding encoder and deep neural network structure for voice lie detection
CN106297768B (en) Speech recognition method
CN113628640A (en) Cross-library speech emotion recognition method based on sample equalization and maximum mean difference
CN116842460A (en) Cough-related disease identification method and system based on attention mechanism and residual neural network
CN115101077A (en) Voiceprint detection model training method and voiceprint recognition method
CN113851149A (en) Cross-library speech emotion recognition method based on anti-migration and Frobenius norm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination