CN114898777A - Cross-library speech emotion recognition method and device based on deep direct-push migration network - Google Patents

Cross-library speech emotion recognition method and device based on deep direct-push migration network Download PDF

Info

Publication number
CN114898777A
CN114898777A CN202210513096.6A CN202210513096A CN114898777A CN 114898777 A CN114898777 A CN 114898777A CN 202210513096 A CN202210513096 A CN 202210513096A CN 114898777 A CN114898777 A CN 114898777A
Authority
CN
China
Prior art keywords
deep
neural network
database
sample
target database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210513096.6A
Other languages
Chinese (zh)
Inventor
郑文明
赵焱
宗源
赵力
路成
连海伦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202210513096.6A priority Critical patent/CN114898777A/en
Publication of CN114898777A publication Critical patent/CN114898777A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-database speech emotion recognition method and a device based on a deep direct-push migration network, wherein the method comprises the following steps: (1) obtaining a source voice emotion database and a target voice emotion database (2), processing emotion audios of a source database and a target database into spectrogram (3), establishing a deep regression neural network (4), inputting the spectrogram of the source database and the target database into the deep regression neural network for training, calculating the maximum mean difference of the source data and the target data in different scales according to the characteristics obtained from the network, finely tuning the neural network (5), taking the voice to be recognized as the voice data in the target database, and inputting the trained deep convolution neural network to obtain the voice emotion type. The invention has higher identification accuracy.

Description

Cross-library speech emotion recognition method and device based on deep direct-push migration network
Technical Field
The invention relates to a voice emotion recognition technology, in particular to a cross-database voice emotion recognition method and device based on a deep direct push type migration network.
Background
Speech is one of the most natural ways of expression for humans. It also reveals the natural emotional state of humans more than other common modes of communication in daily life. Because emotion can help people to understand each other better, in order to help people understand and recognize speech emotion more efficiently, automatic speech emotion recognition by using a computer program and an artificial intelligence algorithm becomes a popular research direction in the fields of pattern recognition, computer vision, emotion calculation and the like in recent years.
In recent years, researchers have proposed many effective methods based on machine learning and deep learning to automatically recognize speech emotion. For example, conventional machine learning methods typically first extract manual features such as IS09 and IS10, and then construct various types of classifiers, such as SVM, K-NN, and Bayesian classifiers, specifically for the task of recognizing speech emotion. Meanwhile, some deep learning methods are also used for the task of recognizing speech emotion, such as LSTM, pre-trained CNN (e.g., ResNet, VggNet, and densneet), and capsulneet. These networks can generally improve the representation ability of speech emotion and learn emotion features and classifiers in an end-to-end manner to classify speech emotion.
The above method is performed in an ideal case where the test sample and the training sample are from the same database. However, in many practical applications, the test sample and the training sample are usually from different databases, which easily brings large domain differences, and results in unsatisfactory recognition effect of most speech emotion recognition methods when the speech emotion recognition methods are across the databases. Recently, there have been many researchers attempting to solve the cross-database speech emotion recognition problem, for example, Zong et al proposed a least squares regression (DaLSR) based domain adaptive approach to handle the cross-database speech emotion recognition task. Hassan et al propose an important weighted support vector machine (IW-SVM) to eliminate the feature distribution mismatch between different samples and improve the classification accuracy under different databases. Long et al propose to apply transferred nuclear learning (TKL) to learn domain invariant kernels to eliminate feature distribution differences between samples from different databases. Gong et al propose a method called Geodesic Flow Kernel (GFK) to connect two different databases with well-designed GFK on Grassmann manifold and reduce the feature distribution difference between them. Deng et al propose a generic auto-encoder (UAE) to learn a database-independent feature space, with the goal of mapping test samples and training samples to a domain-independent feature space using the powerful mapping capabilities of UAE. Fernando et al propose a Subspace Alignment (SA) method for finding a mapping function that can align the subspace where the source samples are located with the target samples. Pan et al propose a Transfer Component Analysis (TCA) method based on a regenerative kernel Hilbert space to eliminate the distribution difference of different domain samples by finding some cross-domain transfer components. Gideon et al propose a method based on the domain generalization of the confrontation discriminant (ADDoG) that learns more generalized speech emotion characteristics from more database samples. Most of the cross-library speech emotion recognition related researches are mainly based on the traditional machine learning method, a space-time descriptor with field invariance and a machine learning classifier are used for processing cross-library speech emotion recognition tasks, the recognition rate is not ideal, and a distance is reserved from practical application; in addition, with the expansion of the speech emotion data set samples, the method based on deep learning will be one of the main research directions of cross-library speech emotion recognition in the future, but at present, related research is less, and the overall progress is slower.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a cross-database speech emotion recognition method and device based on a deep direct-push migration network, which have higher recognition accuracy.
The technical scheme is as follows: the cross-library speech emotion recognition method based on the deep direct push type migration network comprises the following steps:
(1) acquiring two different voice emotion databases, namely a source database and a target database, wherein emotion voice audios and corresponding emotion type labels are stored in the source database, and only the emotion voice audios are stored in the target database;
(2) processing the emotion voice audios of the source database and the target database into spectrogram patterns;
(3) establishing a deep regression neural network;
(4) simultaneously inputting the spectrogram of the source database and the corresponding label as well as the spectrogram of the target database and the randomly initialized pseudo label into a deep regression neural network as a sample for pre-training; then, the obtained depth features are used as samples to be input into a depth regression neural network for fine adjustment, and training is completed;
(5) preprocessing the emotion voice frequency of the voice to be recognized into a spectrogram, and inputting the spectrogram as a sample in a target database into a trained deep regression neural network to obtain the emotion type of the voice.
Further, the step (2) specifically comprises: the emotional voice audios in the source database and the target database are processed into spectrogram forms by using a librosa tool package of python.
Further, the deep recurrent neural network established in step (3) includes: the self-adaptive average pooling layer comprises a first pooling layer, a first maximum pooling layer, a second maximum pooling layer, a third pooling layer, a fourth pooling layer, a third maximum pooling layer, a fifth pooling layer, a sixth pooling layer, a fourth maximum pooling layer, a seventh pooling layer, an eighth pooling layer, a fifth maximum pooling layer, a self-adaptive average pooling layer and a full-connection layer which are connected in sequence from front to back.
Further, each neuron in all convolutional layers and all connection layers adopts a straight line correction unit ReLu as an activation function.
Further, the neuron output of the fully-connected layer is prevented from being over-fitted with Dropout of 0.5.
Further, the step (4) comprises the following steps:
(4-1) simultaneously inputting the spectrogram of the source database and the corresponding label, the spectrogram of the target database and the randomly initialized pseudo label into the deep regression neural network as samples, wherein a loss function L adopted during training is as follows:
Figure BDA0003640262910000021
wherein N is the source database sample number, N is the source database sample number, M is the target database sample number, M is the target database sample number, J is the speech emotion type number, J is the speech emotion type number,
Figure BDA0003640262910000031
the probability that the actual output emotional feature for the nth sample of the source database is divided into j,
Figure BDA0003640262910000032
the expected probability that the output emotional feature for the nth sample of the source database is divided into j,
Figure BDA0003640262910000033
the probability that the actual output emotional feature for the mth sample of the target database is divided into j,
Figure BDA0003640262910000034
the probability that the actual output emotional characteristics of the mth sample of the target database in the previous training are divided into j is obtained;
(4-2) inputting the depth characteristics of the source database and the target database obtained through the deep recurrent neural network as samples and the trained deep recurrent neural network for fine adjustment training, wherein a loss function L is adopted during training total Comprises the following steps:
L total =αL mmd +βL
Figure BDA0003640262910000035
wherein, MMD (X) s ,Y t ) Maximum mean difference MMD, X of emotional features output on deep regression neural network by representing source database and target database samples s Represent source database samples atEmotional characteristics, Y, of the deep recurrent neural network output t Representing the emotional characteristics of the target database sample output in the deep regression neural network,
Figure BDA0003640262910000036
the MMD representing the maximum mean difference of the emotional features of the category j output by the source database and target database samples on the deep recurrent neural network,
Figure BDA0003640262910000037
the emotional characteristic distribution of the category j output by the source database sample in the deep regression neural network is represented,
Figure BDA0003640262910000038
the emotional characteristics of the category j, which represent the target database samples output in the deep regression neural network,
Figure BDA0003640262910000039
the MMD represents the maximum mean difference MMD of the emotional characteristics of the positive and negative emotions output by the source database sample and the target database sample in the deep recurrent neural network,
Figure BDA00036402629100000310
the emotional characteristics of positive and negative emotions output by the source database sample in the deep recurrent neural network are represented,
Figure BDA00036402629100000311
and the MMD is the mean distance of two groups of data in a regeneration Hilbert space, and alpha and beta are binding strength coefficients obtained through training.
The cross-library speech emotion recognition device based on the deep direct push type migration network comprises a processor and a computer program which is stored on a memory and can run on the processor, wherein the processor realizes the method when executing the program.
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: the invention has higher identification accuracy.
Drawings
FIG. 1 is a schematic flow chart of a cross-database speech emotion recognition method based on a deep direct-push migration network according to the present invention;
FIG. 2 is a schematic diagram of a cross-database speech emotion recognition architecture of a deep direct-push migration network DTTRN;
fig. 3 is a detailed structure of a backbone network of the deep direct push migration network DTTRN;
fig. 4 is an experimental result comparison of the deep direct-push migration network DTTRN in the cross-library speech emotion recognition experiment.
Detailed Description
The embodiment provides a cross-database speech emotion recognition method based on a deep direct-push migration network, as shown in fig. 1, including:
(1) the method comprises the steps of obtaining two different voice emotion databases which are respectively a source database and a target database, wherein emotion voice audios and corresponding emotion type labels are stored in the source database, and only the emotion voice audios are stored in the target database.
In the invention, a training sample (a source database or a source domain) and a testing sample (a target domain or a target database) belong to different speech emotion data sets, and the training sample and the testing sample have obvious characteristic distribution difference.
(2) And processing the emotional voice audios of the source database and the target database into spectrogram.
In this embodiment, spectrogram processing is implemented using the librosa toolkit of python.
(3) A deep recurrent neural network (CNN, DTTRN) is established, said network comprising a CNN backbone network and a full connectivity layer FC1 connecting the backbone network.
As shown in fig. 2, the structure of the deep convolutional neural network includes: the first convolution layer conv1, the first maximum value pooling layer Maxpool1, the second convolution layer conv2, the second maximum value pooling layer Maxpool2, the third convolution layer conv3, the fourth convolution layer conv4, the third maximum value pooling layer Maxpool3, the fifth convolution layer conv5, the sixth convolution layer conv6, the fourth maximum value pooling layer Maxpool4, the seventh convolution layer conv7, the eighth convolution layer conv8, the fifth maximum value pooling layer Maxpool5, the adaptive average pooling layer adaveabool 1 and the full connection layer FC1 which are connected in sequence from front to back, each neuron in the convolution layer and the full connection layer adopts a straight line correction unit ReLu as an activation neuron function, and all full connection layer outputs adopt Dropout of 0.5 to prevent overfitting.
As shown in fig. 3, the local receptive field sizes of the 8 convolutional layers in the network are all set to 3 × 3, the step size (stride) is all set to 1, and the edge zero padding strategy is used to keep the feature mapping size unchanged after convolution. For the 1 st convolution layer, 64 convolution kernels are set; for the 2 nd convolutional layer, 128 convolutional kernels are set; for the 3 rd convolutional layer and the 4 th convolutional layer, 256 convolutional kernels are set; for the last 4 convolutional layers, 512 convolutional kernels are set; for 2 kinds of pooling layers, the maximum pooling layer window size was set to 2 × 2, the adaptive average pooling layer window size was set to 7 × 7, and the step size (stride) was set to 2. Each pooling reduces the feature dimension of the mapping output to half of the original.
The network activation function is set to: a straight line modified linear unit (ReLU) is used as an activation function for each neuron in the DTTRN network, which is defined as follows:
Figure BDA0003640262910000051
(4) simultaneously inputting the spectrogram of the source database and the corresponding label as well as the spectrogram of the target database and the randomly initialized pseudo label into a deep regression neural network as a sample for pre-training; and then, inputting the obtained depth features as samples into a depth regression neural network for fine adjustment, and finishing training.
The method specifically comprises the following steps:
(4-1) simultaneously inputting the spectrogram and the corresponding label of the source database, the spectrogram of the target database and the randomly initialized pseudo label into a deep regression neural network as samples, and improving based on a Cross Entropy Loss Function (Cross Entropy Loss Function) during training, wherein the specific Loss Function L is as follows:
Figure BDA0003640262910000052
(4-1) simultaneously inputting the spectrogram of the source database and the corresponding label, the spectrogram of the target database and the randomly initialized pseudo label into the deep regression neural network as samples, wherein a loss function L adopted during training is as follows:
Figure BDA0003640262910000053
wherein N is the source database sample number, N is the source database sample number, M is the target database sample number, M is the target database sample number, J is the speech emotion type number, J is the speech emotion type number,
Figure BDA0003640262910000054
the probability that the actual output emotional feature for the nth sample of the source database is divided into j,
Figure BDA0003640262910000055
the expected probability that the output emotional feature for the nth sample of the source database is divided into j,
Figure BDA0003640262910000056
the probability that the actual output emotional feature for the mth sample of the target database is divided into j,
Figure BDA0003640262910000057
the probability that the actual output emotional characteristics of the mth sample of the target database in the previous training are divided into j is obtained;
(4-2) inputting the depth characteristics of the source database and the target database obtained through the deep recurrent neural network as samples and the trained deep recurrent neural network for fine adjustment training, wherein a loss function L is adopted during training total Comprises the following steps:
L total =αL mmd +βL
Figure BDA0003640262910000061
wherein, MMD (X) s ,Y t ) Maximum mean difference MMD, X of emotional features output on deep regression neural network by representing source database and target database samples s Expressing the emotional characteristics of the source database sample output in the deep regression neural network, Y t Representing the emotional characteristics of the target database sample output in the deep regression neural network,
Figure BDA0003640262910000062
the MMD representing the maximum mean difference of the emotional features of the category j output by the source database and target database samples on the deep recurrent neural network,
Figure BDA0003640262910000063
the emotional characteristic distribution of the category j output by the source database sample in the deep regression neural network is represented,
Figure BDA0003640262910000064
the emotional characteristics of the category j, which represent the target database samples output in the deep regression neural network,
Figure BDA0003640262910000065
the MMD represents the maximum mean difference MMD of the emotional characteristics of the positive and negative emotions output by the source database sample and the target database sample in the deep recurrent neural network,
Figure BDA0003640262910000066
the emotion characteristics of the positive and negative emotions output by the source database sample in the deep recurrent neural network are represented,
Figure BDA0003640262910000067
the emotional characteristics of positive and negative emotions output by the target database sample in the deep regression neural network, the MMD is the mean distance of two groups of data in a regeneration Hilbert space,α and β are binding strength coefficients obtained by training.
(5) Preprocessing the emotion voice frequency of the voice to be recognized into a spectrogram, and inputting the spectrogram as a sample in a target database into a trained deep regression neural network to obtain the emotion type of the voice.
The training optimizer adopts a Stochastic Gradient Descent (SGD) algorithm with a correction factor Momentum (Nesterov Momentum), a loss function is calculated, so that the weight is continuously updated, the Nesterov Momentum inhibits the oscillation in the Gradient direction by simulating the concept of Momentum in physics, the convergence speed is accelerated, if the historical Gradient is consistent with the current Gradient direction, the Momentum item is increased, and if not, the Momentum item is reduced; and the Nesterov term adds a correction when updating the gradient, avoids the gradient updating from advancing too fast, and simultaneously the gradient updating is more flexible, and the iteration process is as follows
Figure BDA0003640262910000068
θ=θ-v t Where eta represents the learning rate and is set to 10 in the experiment -3 And in addition the weight attenuation is set to 10 -5 The correction factor is 0.9.
In order to accelerate the training speed and improve the reliability of the recognition result, weights of all layers are frozen on a DTTRN network model pre-trained by a source database, and then training and fine-tuning are only carried out on the weight of the last full-connection layer of the DTTRN by combining a target domain sample and a source domain sample, so that the expected cross-library speech emotion recognition task can be realized. The training maximum period is set to 200.
The embodiment also provides a cross-library speech emotion recognition device based on the deep direct-push migration network, which comprises a processor and a computer program stored on a memory and capable of running on the processor, wherein the processor realizes the method when executing the program.
In order to verify the invention, simulation verification is carried out, sample data which have the same sample label and belong to different databases are needed to be used for simulation cross-library speech emotion recognition experiments, EmoDB, eNTFACE and CASIA are selected, and B, E and C are respectively used for representing corresponding databases. In the co-migration of EmoDB and CASIA, we used 5 kinds of emotions of anger, injury, fear, happiness and peace; in the co-migration of EmoDB and eNTERFACE, we used anger, injury, fear, happiness and nausea; in the co-migration of CASIA and eNTERFACE, we used anger, hurry, fear, happiness and surprise. In order to verify the effectiveness and the necessity of the direct-push type migration depth regression network DTTRN, cross-library speech emotion experiments are carried out on EmoDB, eNTIFACE and CASIA. In the experiment, uar (unweighted Average reverse) is selected as an evaluation index, and the result is shown in fig. 4, it can be observed that the direct push type migration depth regression network DTTRN in the invention has excellent recognition effect on the cross-library speech emotion recognition task. The DTTRN references the idea of direct push migration and utilizes the target domain data without labels to better pre-train the network; in addition, the designed loss function can restrict the feature distribution difference of the two databases on the network in the DTTRN fine tuning training process, and plays a key role in improving the cross-database recognition performance.
While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (7)

1. A cross-database speech emotion recognition method based on a deep direct push type migration network is characterized by comprising the following steps:
(1) acquiring two different voice emotion databases, namely a source database and a target database, wherein emotion voice audios and corresponding emotion type labels are stored in the source database, and only the emotion voice audios are stored in the target database;
(2) processing the emotion voice audios of the source database and the target database into spectrogram patterns;
(3) establishing a deep regression neural network;
(4) simultaneously inputting the spectrogram of the source database and the corresponding label as well as the spectrogram of the target database and the randomly initialized pseudo label into a deep regression neural network as a sample for pre-training; then, the obtained depth features are used as samples to be input into a depth regression neural network for fine adjustment, and training is completed;
(5) preprocessing the emotion voice frequency of the voice to be recognized into a spectrogram, and inputting the spectrogram as a sample in a target database into a trained deep regression neural network to obtain the emotion type of the voice.
2. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: the step (2) specifically comprises the following steps: and processing the emotional voice audios in the source database and the target database into spectrogram diagrams by using a library tool package of python.
3. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: the deep regression neural network established in the step (3) comprises the following steps: the self-adaptive average pooling layer comprises a first pooling layer, a first maximum pooling layer, a second maximum pooling layer, a third pooling layer, a fourth pooling layer, a third maximum pooling layer, a fifth pooling layer, a sixth pooling layer, a fourth maximum pooling layer, a seventh pooling layer, an eighth pooling layer, a fifth maximum pooling layer, a self-adaptive average pooling layer and a full-connection layer which are connected in sequence from front to back.
4. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: each neuron in all convolutional layers and all connection layers adopts a straight line correction unit ReLu as an activation function.
5. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: the neuron output of the fully connected layer is prevented from overfitting with Dropout of 0.5.
6. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: the step (4) comprises the following steps:
(4-1) simultaneously inputting the spectrogram of the source database and the corresponding label, the spectrogram of the target database and the randomly initialized pseudo label into the deep regression neural network as samples, wherein a loss function L adopted during training is as follows:
Figure FDA0003640262900000011
wherein N is the source database sample number, N is the source database sample number, M is the target database sample number, M is the target database sample number, J is the speech emotion type number, J is the speech emotion type number,
Figure FDA0003640262900000012
the probability that the actual output emotional feature for the nth sample of the source database is divided into j,
Figure FDA0003640262900000021
the expected probability that the output emotional feature for the nth sample of the source database is divided into j,
Figure FDA0003640262900000022
the probability that the actual output emotional feature for the mth sample of the target database is divided into j,
Figure FDA0003640262900000023
the probability that the actual output emotional characteristics of the mth sample of the target database in the previous training are divided into j is obtained;
(4-2) inputting the depth characteristics of the source database and the target database obtained through the deep recurrent neural network as samples and the trained deep recurrent neural network for fine adjustment training, wherein a loss function L is adopted during training total Comprises the following steps:
L total =αL mmd +βL
Figure FDA0003640262900000024
wherein, MMD (X) s ,Y t ) Maximum mean difference MMD, X of emotional features output on deep regression neural network by representing source database and target database samples s Expressing the emotional characteristics of the source database sample output in the deep regression neural network, Y t Representing the emotional characteristics of the target database sample output in the deep regression neural network,
Figure FDA0003640262900000025
the MMD represents the maximum mean difference MMD of the emotional features of the category j output by the source database sample and the target database sample on the deep recurrent neural network,
Figure FDA0003640262900000026
the emotional characteristic distribution of the category j output by the source database sample in the deep regression neural network is represented,
Figure FDA0003640262900000027
the emotional characteristics of the category j, which represent the target database samples output in the deep regression neural network,
Figure FDA0003640262900000028
the MMD represents the maximum mean difference MMD of the emotional characteristics of the positive and negative emotions output by the source database sample and the target database sample in the deep recurrent neural network,
Figure FDA0003640262900000029
the emotional characteristics of positive and negative emotions output by the source database sample in the deep recurrent neural network are represented,
Figure FDA00036402629000000210
representing the output of the target database sample in the deep regression neural network,The MMD is the mean distance of two groups of data in a regenerated Hilbert space, and alpha and beta are binding strength coefficients obtained by training.
7. A cross-library speech emotion recognition device based on a deep direct-push migration network comprises a processor and a computer program which is stored on a memory and can run on the processor, and is characterized in that: the processor, when executing the program, implements the method of any of claims 1-6.
CN202210513096.6A 2022-05-12 2022-05-12 Cross-library speech emotion recognition method and device based on deep direct-push migration network Pending CN114898777A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210513096.6A CN114898777A (en) 2022-05-12 2022-05-12 Cross-library speech emotion recognition method and device based on deep direct-push migration network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210513096.6A CN114898777A (en) 2022-05-12 2022-05-12 Cross-library speech emotion recognition method and device based on deep direct-push migration network

Publications (1)

Publication Number Publication Date
CN114898777A true CN114898777A (en) 2022-08-12

Family

ID=82722316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210513096.6A Pending CN114898777A (en) 2022-05-12 2022-05-12 Cross-library speech emotion recognition method and device based on deep direct-push migration network

Country Status (1)

Country Link
CN (1) CN114898777A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116072154A (en) * 2023-03-07 2023-05-05 华南师范大学 Speech emotion recognition method, device and equipment based on data enhancement

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116072154A (en) * 2023-03-07 2023-05-05 华南师范大学 Speech emotion recognition method, device and equipment based on data enhancement

Similar Documents

Publication Publication Date Title
Hinton et al. Improving neural networks by preventing co-adaptation of feature detectors
CN109271522B (en) Comment emotion classification method and system based on deep hybrid model transfer learning
CN110083838B (en) Biomedical semantic relation extraction method based on multilayer neural network and external knowledge base
CN113239700A (en) Text semantic matching device, system, method and storage medium for improving BERT
Elleuch et al. Arabic handwritten characters recognition using deep belief neural networks
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN110459225B (en) Speaker recognition system based on CNN fusion characteristics
CN112257449B (en) Named entity recognition method and device, computer equipment and storage medium
Golovko et al. The nature of unsupervised learning in deep neural networks: A new understanding and novel approach
CN112232087B (en) Specific aspect emotion analysis method of multi-granularity attention model based on Transformer
CN112784532A (en) Multi-head attention memory network for short text sentiment classification
Cai et al. Meta Multi-Task Learning for Speech Emotion Recognition.
Demircan et al. Comparison of the effects of mel coefficients and spectrogram images via deep learning in emotion classification
Schmitt et al. End-to-end audio classification with small datasets–making it work
Soliman et al. Isolated word speech recognition using convolutional neural network
CN114898777A (en) Cross-library speech emotion recognition method and device based on deep direct-push migration network
Du et al. CGaP: Continuous growth and pruning for efficient deep learning
Aishwarya et al. Kannada speech recognition system for Aphasic people
Starzyk et al. Concurrent associative memories with synaptic delays
Rahim A neural tree network for phoneme classification with experiments on the TIMIT database
CN114743569A (en) Speech emotion recognition method based on double-layer fusion deep network
Busser et al. Machine learning of word pronunciation: the case against abstraction.
Kroshchanka et al. The reduction of fully connected neural network parameters using the pre-training technique
CN112465054A (en) Multivariate time series data classification method based on FCN
Hallac et al. Experiments on fine tuning deep learning models with news data for tweet classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination