CN114898777A - Cross-library speech emotion recognition method and device based on deep direct-push migration network - Google Patents
Cross-library speech emotion recognition method and device based on deep direct-push migration network Download PDFInfo
- Publication number
- CN114898777A CN114898777A CN202210513096.6A CN202210513096A CN114898777A CN 114898777 A CN114898777 A CN 114898777A CN 202210513096 A CN202210513096 A CN 202210513096A CN 114898777 A CN114898777 A CN 114898777A
- Authority
- CN
- China
- Prior art keywords
- deep
- neural network
- database
- sample
- target database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 31
- 238000013508 migration Methods 0.000 title claims abstract description 26
- 230000005012 migration Effects 0.000 title claims abstract description 23
- 238000013528 artificial neural network Methods 0.000 claims abstract description 54
- 230000008451 emotion Effects 0.000 claims abstract description 52
- 238000012549 training Methods 0.000 claims abstract description 33
- 238000012545 processing Methods 0.000 claims abstract description 7
- 230000002996 emotional effect Effects 0.000 claims description 39
- 238000011176 pooling Methods 0.000 claims description 38
- 230000000306 recurrent effect Effects 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 17
- 210000002569 neuron Anatomy 0.000 claims description 7
- 238000012937 correction Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 description 5
- 238000013507 mapping Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 230000003044 adaptive effect Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 208000027418 Wounds and injury Diseases 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000006378 damage Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 208000014674 injury Diseases 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000008929 regeneration Effects 0.000 description 2
- 238000011069 regeneration method Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 206010028813 Nausea Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008693 nausea Effects 0.000 description 1
- 230000010355 oscillation Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000001172 regenerating effect Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Child & Adolescent Psychology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a cross-database speech emotion recognition method and a device based on a deep direct-push migration network, wherein the method comprises the following steps: (1) obtaining a source voice emotion database and a target voice emotion database (2), processing emotion audios of a source database and a target database into spectrogram (3), establishing a deep regression neural network (4), inputting the spectrogram of the source database and the target database into the deep regression neural network for training, calculating the maximum mean difference of the source data and the target data in different scales according to the characteristics obtained from the network, finely tuning the neural network (5), taking the voice to be recognized as the voice data in the target database, and inputting the trained deep convolution neural network to obtain the voice emotion type. The invention has higher identification accuracy.
Description
Technical Field
The invention relates to a voice emotion recognition technology, in particular to a cross-database voice emotion recognition method and device based on a deep direct push type migration network.
Background
Speech is one of the most natural ways of expression for humans. It also reveals the natural emotional state of humans more than other common modes of communication in daily life. Because emotion can help people to understand each other better, in order to help people understand and recognize speech emotion more efficiently, automatic speech emotion recognition by using a computer program and an artificial intelligence algorithm becomes a popular research direction in the fields of pattern recognition, computer vision, emotion calculation and the like in recent years.
In recent years, researchers have proposed many effective methods based on machine learning and deep learning to automatically recognize speech emotion. For example, conventional machine learning methods typically first extract manual features such as IS09 and IS10, and then construct various types of classifiers, such as SVM, K-NN, and Bayesian classifiers, specifically for the task of recognizing speech emotion. Meanwhile, some deep learning methods are also used for the task of recognizing speech emotion, such as LSTM, pre-trained CNN (e.g., ResNet, VggNet, and densneet), and capsulneet. These networks can generally improve the representation ability of speech emotion and learn emotion features and classifiers in an end-to-end manner to classify speech emotion.
The above method is performed in an ideal case where the test sample and the training sample are from the same database. However, in many practical applications, the test sample and the training sample are usually from different databases, which easily brings large domain differences, and results in unsatisfactory recognition effect of most speech emotion recognition methods when the speech emotion recognition methods are across the databases. Recently, there have been many researchers attempting to solve the cross-database speech emotion recognition problem, for example, Zong et al proposed a least squares regression (DaLSR) based domain adaptive approach to handle the cross-database speech emotion recognition task. Hassan et al propose an important weighted support vector machine (IW-SVM) to eliminate the feature distribution mismatch between different samples and improve the classification accuracy under different databases. Long et al propose to apply transferred nuclear learning (TKL) to learn domain invariant kernels to eliminate feature distribution differences between samples from different databases. Gong et al propose a method called Geodesic Flow Kernel (GFK) to connect two different databases with well-designed GFK on Grassmann manifold and reduce the feature distribution difference between them. Deng et al propose a generic auto-encoder (UAE) to learn a database-independent feature space, with the goal of mapping test samples and training samples to a domain-independent feature space using the powerful mapping capabilities of UAE. Fernando et al propose a Subspace Alignment (SA) method for finding a mapping function that can align the subspace where the source samples are located with the target samples. Pan et al propose a Transfer Component Analysis (TCA) method based on a regenerative kernel Hilbert space to eliminate the distribution difference of different domain samples by finding some cross-domain transfer components. Gideon et al propose a method based on the domain generalization of the confrontation discriminant (ADDoG) that learns more generalized speech emotion characteristics from more database samples. Most of the cross-library speech emotion recognition related researches are mainly based on the traditional machine learning method, a space-time descriptor with field invariance and a machine learning classifier are used for processing cross-library speech emotion recognition tasks, the recognition rate is not ideal, and a distance is reserved from practical application; in addition, with the expansion of the speech emotion data set samples, the method based on deep learning will be one of the main research directions of cross-library speech emotion recognition in the future, but at present, related research is less, and the overall progress is slower.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a cross-database speech emotion recognition method and device based on a deep direct-push migration network, which have higher recognition accuracy.
The technical scheme is as follows: the cross-library speech emotion recognition method based on the deep direct push type migration network comprises the following steps:
(1) acquiring two different voice emotion databases, namely a source database and a target database, wherein emotion voice audios and corresponding emotion type labels are stored in the source database, and only the emotion voice audios are stored in the target database;
(2) processing the emotion voice audios of the source database and the target database into spectrogram patterns;
(3) establishing a deep regression neural network;
(4) simultaneously inputting the spectrogram of the source database and the corresponding label as well as the spectrogram of the target database and the randomly initialized pseudo label into a deep regression neural network as a sample for pre-training; then, the obtained depth features are used as samples to be input into a depth regression neural network for fine adjustment, and training is completed;
(5) preprocessing the emotion voice frequency of the voice to be recognized into a spectrogram, and inputting the spectrogram as a sample in a target database into a trained deep regression neural network to obtain the emotion type of the voice.
Further, the step (2) specifically comprises: the emotional voice audios in the source database and the target database are processed into spectrogram forms by using a librosa tool package of python.
Further, the deep recurrent neural network established in step (3) includes: the self-adaptive average pooling layer comprises a first pooling layer, a first maximum pooling layer, a second maximum pooling layer, a third pooling layer, a fourth pooling layer, a third maximum pooling layer, a fifth pooling layer, a sixth pooling layer, a fourth maximum pooling layer, a seventh pooling layer, an eighth pooling layer, a fifth maximum pooling layer, a self-adaptive average pooling layer and a full-connection layer which are connected in sequence from front to back.
Further, each neuron in all convolutional layers and all connection layers adopts a straight line correction unit ReLu as an activation function.
Further, the neuron output of the fully-connected layer is prevented from being over-fitted with Dropout of 0.5.
Further, the step (4) comprises the following steps:
(4-1) simultaneously inputting the spectrogram of the source database and the corresponding label, the spectrogram of the target database and the randomly initialized pseudo label into the deep regression neural network as samples, wherein a loss function L adopted during training is as follows:
wherein N is the source database sample number, N is the source database sample number, M is the target database sample number, M is the target database sample number, J is the speech emotion type number, J is the speech emotion type number,the probability that the actual output emotional feature for the nth sample of the source database is divided into j,the expected probability that the output emotional feature for the nth sample of the source database is divided into j,the probability that the actual output emotional feature for the mth sample of the target database is divided into j,the probability that the actual output emotional characteristics of the mth sample of the target database in the previous training are divided into j is obtained;
(4-2) inputting the depth characteristics of the source database and the target database obtained through the deep recurrent neural network as samples and the trained deep recurrent neural network for fine adjustment training, wherein a loss function L is adopted during training total Comprises the following steps:
L total =αL mmd +βL
wherein, MMD (X) s ,Y t ) Maximum mean difference MMD, X of emotional features output on deep regression neural network by representing source database and target database samples s Represent source database samples atEmotional characteristics, Y, of the deep recurrent neural network output t Representing the emotional characteristics of the target database sample output in the deep regression neural network,the MMD representing the maximum mean difference of the emotional features of the category j output by the source database and target database samples on the deep recurrent neural network,the emotional characteristic distribution of the category j output by the source database sample in the deep regression neural network is represented,the emotional characteristics of the category j, which represent the target database samples output in the deep regression neural network,the MMD represents the maximum mean difference MMD of the emotional characteristics of the positive and negative emotions output by the source database sample and the target database sample in the deep recurrent neural network,the emotional characteristics of positive and negative emotions output by the source database sample in the deep recurrent neural network are represented,and the MMD is the mean distance of two groups of data in a regeneration Hilbert space, and alpha and beta are binding strength coefficients obtained through training.
The cross-library speech emotion recognition device based on the deep direct push type migration network comprises a processor and a computer program which is stored on a memory and can run on the processor, wherein the processor realizes the method when executing the program.
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: the invention has higher identification accuracy.
Drawings
FIG. 1 is a schematic flow chart of a cross-database speech emotion recognition method based on a deep direct-push migration network according to the present invention;
FIG. 2 is a schematic diagram of a cross-database speech emotion recognition architecture of a deep direct-push migration network DTTRN;
fig. 3 is a detailed structure of a backbone network of the deep direct push migration network DTTRN;
fig. 4 is an experimental result comparison of the deep direct-push migration network DTTRN in the cross-library speech emotion recognition experiment.
Detailed Description
The embodiment provides a cross-database speech emotion recognition method based on a deep direct-push migration network, as shown in fig. 1, including:
(1) the method comprises the steps of obtaining two different voice emotion databases which are respectively a source database and a target database, wherein emotion voice audios and corresponding emotion type labels are stored in the source database, and only the emotion voice audios are stored in the target database.
In the invention, a training sample (a source database or a source domain) and a testing sample (a target domain or a target database) belong to different speech emotion data sets, and the training sample and the testing sample have obvious characteristic distribution difference.
(2) And processing the emotional voice audios of the source database and the target database into spectrogram.
In this embodiment, spectrogram processing is implemented using the librosa toolkit of python.
(3) A deep recurrent neural network (CNN, DTTRN) is established, said network comprising a CNN backbone network and a full connectivity layer FC1 connecting the backbone network.
As shown in fig. 2, the structure of the deep convolutional neural network includes: the first convolution layer conv1, the first maximum value pooling layer Maxpool1, the second convolution layer conv2, the second maximum value pooling layer Maxpool2, the third convolution layer conv3, the fourth convolution layer conv4, the third maximum value pooling layer Maxpool3, the fifth convolution layer conv5, the sixth convolution layer conv6, the fourth maximum value pooling layer Maxpool4, the seventh convolution layer conv7, the eighth convolution layer conv8, the fifth maximum value pooling layer Maxpool5, the adaptive average pooling layer adaveabool 1 and the full connection layer FC1 which are connected in sequence from front to back, each neuron in the convolution layer and the full connection layer adopts a straight line correction unit ReLu as an activation neuron function, and all full connection layer outputs adopt Dropout of 0.5 to prevent overfitting.
As shown in fig. 3, the local receptive field sizes of the 8 convolutional layers in the network are all set to 3 × 3, the step size (stride) is all set to 1, and the edge zero padding strategy is used to keep the feature mapping size unchanged after convolution. For the 1 st convolution layer, 64 convolution kernels are set; for the 2 nd convolutional layer, 128 convolutional kernels are set; for the 3 rd convolutional layer and the 4 th convolutional layer, 256 convolutional kernels are set; for the last 4 convolutional layers, 512 convolutional kernels are set; for 2 kinds of pooling layers, the maximum pooling layer window size was set to 2 × 2, the adaptive average pooling layer window size was set to 7 × 7, and the step size (stride) was set to 2. Each pooling reduces the feature dimension of the mapping output to half of the original.
The network activation function is set to: a straight line modified linear unit (ReLU) is used as an activation function for each neuron in the DTTRN network, which is defined as follows:
(4) simultaneously inputting the spectrogram of the source database and the corresponding label as well as the spectrogram of the target database and the randomly initialized pseudo label into a deep regression neural network as a sample for pre-training; and then, inputting the obtained depth features as samples into a depth regression neural network for fine adjustment, and finishing training.
The method specifically comprises the following steps:
(4-1) simultaneously inputting the spectrogram and the corresponding label of the source database, the spectrogram of the target database and the randomly initialized pseudo label into a deep regression neural network as samples, and improving based on a Cross Entropy Loss Function (Cross Entropy Loss Function) during training, wherein the specific Loss Function L is as follows:
(4-1) simultaneously inputting the spectrogram of the source database and the corresponding label, the spectrogram of the target database and the randomly initialized pseudo label into the deep regression neural network as samples, wherein a loss function L adopted during training is as follows:
wherein N is the source database sample number, N is the source database sample number, M is the target database sample number, M is the target database sample number, J is the speech emotion type number, J is the speech emotion type number,the probability that the actual output emotional feature for the nth sample of the source database is divided into j,the expected probability that the output emotional feature for the nth sample of the source database is divided into j,the probability that the actual output emotional feature for the mth sample of the target database is divided into j,the probability that the actual output emotional characteristics of the mth sample of the target database in the previous training are divided into j is obtained;
(4-2) inputting the depth characteristics of the source database and the target database obtained through the deep recurrent neural network as samples and the trained deep recurrent neural network for fine adjustment training, wherein a loss function L is adopted during training total Comprises the following steps:
L total =αL mmd +βL
wherein, MMD (X) s ,Y t ) Maximum mean difference MMD, X of emotional features output on deep regression neural network by representing source database and target database samples s Expressing the emotional characteristics of the source database sample output in the deep regression neural network, Y t Representing the emotional characteristics of the target database sample output in the deep regression neural network,the MMD representing the maximum mean difference of the emotional features of the category j output by the source database and target database samples on the deep recurrent neural network,the emotional characteristic distribution of the category j output by the source database sample in the deep regression neural network is represented,the emotional characteristics of the category j, which represent the target database samples output in the deep regression neural network,the MMD represents the maximum mean difference MMD of the emotional characteristics of the positive and negative emotions output by the source database sample and the target database sample in the deep recurrent neural network,the emotion characteristics of the positive and negative emotions output by the source database sample in the deep recurrent neural network are represented,the emotional characteristics of positive and negative emotions output by the target database sample in the deep regression neural network, the MMD is the mean distance of two groups of data in a regeneration Hilbert space,α and β are binding strength coefficients obtained by training.
(5) Preprocessing the emotion voice frequency of the voice to be recognized into a spectrogram, and inputting the spectrogram as a sample in a target database into a trained deep regression neural network to obtain the emotion type of the voice.
The training optimizer adopts a Stochastic Gradient Descent (SGD) algorithm with a correction factor Momentum (Nesterov Momentum), a loss function is calculated, so that the weight is continuously updated, the Nesterov Momentum inhibits the oscillation in the Gradient direction by simulating the concept of Momentum in physics, the convergence speed is accelerated, if the historical Gradient is consistent with the current Gradient direction, the Momentum item is increased, and if not, the Momentum item is reduced; and the Nesterov term adds a correction when updating the gradient, avoids the gradient updating from advancing too fast, and simultaneously the gradient updating is more flexible, and the iteration process is as follows
θ=θ-v t Where eta represents the learning rate and is set to 10 in the experiment -3 And in addition the weight attenuation is set to 10 -5 The correction factor is 0.9.
In order to accelerate the training speed and improve the reliability of the recognition result, weights of all layers are frozen on a DTTRN network model pre-trained by a source database, and then training and fine-tuning are only carried out on the weight of the last full-connection layer of the DTTRN by combining a target domain sample and a source domain sample, so that the expected cross-library speech emotion recognition task can be realized. The training maximum period is set to 200.
The embodiment also provides a cross-library speech emotion recognition device based on the deep direct-push migration network, which comprises a processor and a computer program stored on a memory and capable of running on the processor, wherein the processor realizes the method when executing the program.
In order to verify the invention, simulation verification is carried out, sample data which have the same sample label and belong to different databases are needed to be used for simulation cross-library speech emotion recognition experiments, EmoDB, eNTFACE and CASIA are selected, and B, E and C are respectively used for representing corresponding databases. In the co-migration of EmoDB and CASIA, we used 5 kinds of emotions of anger, injury, fear, happiness and peace; in the co-migration of EmoDB and eNTERFACE, we used anger, injury, fear, happiness and nausea; in the co-migration of CASIA and eNTERFACE, we used anger, hurry, fear, happiness and surprise. In order to verify the effectiveness and the necessity of the direct-push type migration depth regression network DTTRN, cross-library speech emotion experiments are carried out on EmoDB, eNTIFACE and CASIA. In the experiment, uar (unweighted Average reverse) is selected as an evaluation index, and the result is shown in fig. 4, it can be observed that the direct push type migration depth regression network DTTRN in the invention has excellent recognition effect on the cross-library speech emotion recognition task. The DTTRN references the idea of direct push migration and utilizes the target domain data without labels to better pre-train the network; in addition, the designed loss function can restrict the feature distribution difference of the two databases on the network in the DTTRN fine tuning training process, and plays a key role in improving the cross-database recognition performance.
While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
Claims (7)
1. A cross-database speech emotion recognition method based on a deep direct push type migration network is characterized by comprising the following steps:
(1) acquiring two different voice emotion databases, namely a source database and a target database, wherein emotion voice audios and corresponding emotion type labels are stored in the source database, and only the emotion voice audios are stored in the target database;
(2) processing the emotion voice audios of the source database and the target database into spectrogram patterns;
(3) establishing a deep regression neural network;
(4) simultaneously inputting the spectrogram of the source database and the corresponding label as well as the spectrogram of the target database and the randomly initialized pseudo label into a deep regression neural network as a sample for pre-training; then, the obtained depth features are used as samples to be input into a depth regression neural network for fine adjustment, and training is completed;
(5) preprocessing the emotion voice frequency of the voice to be recognized into a spectrogram, and inputting the spectrogram as a sample in a target database into a trained deep regression neural network to obtain the emotion type of the voice.
2. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: the step (2) specifically comprises the following steps: and processing the emotional voice audios in the source database and the target database into spectrogram diagrams by using a library tool package of python.
3. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: the deep regression neural network established in the step (3) comprises the following steps: the self-adaptive average pooling layer comprises a first pooling layer, a first maximum pooling layer, a second maximum pooling layer, a third pooling layer, a fourth pooling layer, a third maximum pooling layer, a fifth pooling layer, a sixth pooling layer, a fourth maximum pooling layer, a seventh pooling layer, an eighth pooling layer, a fifth maximum pooling layer, a self-adaptive average pooling layer and a full-connection layer which are connected in sequence from front to back.
4. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: each neuron in all convolutional layers and all connection layers adopts a straight line correction unit ReLu as an activation function.
5. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: the neuron output of the fully connected layer is prevented from overfitting with Dropout of 0.5.
6. The cross-library speech emotion recognition method based on the deep direct-push migration network, which is characterized in that: the step (4) comprises the following steps:
(4-1) simultaneously inputting the spectrogram of the source database and the corresponding label, the spectrogram of the target database and the randomly initialized pseudo label into the deep regression neural network as samples, wherein a loss function L adopted during training is as follows:
wherein N is the source database sample number, N is the source database sample number, M is the target database sample number, M is the target database sample number, J is the speech emotion type number, J is the speech emotion type number,the probability that the actual output emotional feature for the nth sample of the source database is divided into j,the expected probability that the output emotional feature for the nth sample of the source database is divided into j,the probability that the actual output emotional feature for the mth sample of the target database is divided into j,the probability that the actual output emotional characteristics of the mth sample of the target database in the previous training are divided into j is obtained;
(4-2) inputting the depth characteristics of the source database and the target database obtained through the deep recurrent neural network as samples and the trained deep recurrent neural network for fine adjustment training, wherein a loss function L is adopted during training total Comprises the following steps:
L total =αL mmd +βL
wherein, MMD (X) s ,Y t ) Maximum mean difference MMD, X of emotional features output on deep regression neural network by representing source database and target database samples s Expressing the emotional characteristics of the source database sample output in the deep regression neural network, Y t Representing the emotional characteristics of the target database sample output in the deep regression neural network,the MMD represents the maximum mean difference MMD of the emotional features of the category j output by the source database sample and the target database sample on the deep recurrent neural network,the emotional characteristic distribution of the category j output by the source database sample in the deep regression neural network is represented,the emotional characteristics of the category j, which represent the target database samples output in the deep regression neural network,the MMD represents the maximum mean difference MMD of the emotional characteristics of the positive and negative emotions output by the source database sample and the target database sample in the deep recurrent neural network,the emotional characteristics of positive and negative emotions output by the source database sample in the deep recurrent neural network are represented,representing the output of the target database sample in the deep regression neural network,The MMD is the mean distance of two groups of data in a regenerated Hilbert space, and alpha and beta are binding strength coefficients obtained by training.
7. A cross-library speech emotion recognition device based on a deep direct-push migration network comprises a processor and a computer program which is stored on a memory and can run on the processor, and is characterized in that: the processor, when executing the program, implements the method of any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210513096.6A CN114898777A (en) | 2022-05-12 | 2022-05-12 | Cross-library speech emotion recognition method and device based on deep direct-push migration network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210513096.6A CN114898777A (en) | 2022-05-12 | 2022-05-12 | Cross-library speech emotion recognition method and device based on deep direct-push migration network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114898777A true CN114898777A (en) | 2022-08-12 |
Family
ID=82722316
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210513096.6A Pending CN114898777A (en) | 2022-05-12 | 2022-05-12 | Cross-library speech emotion recognition method and device based on deep direct-push migration network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114898777A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116072154A (en) * | 2023-03-07 | 2023-05-05 | 华南师范大学 | Speech emotion recognition method, device and equipment based on data enhancement |
-
2022
- 2022-05-12 CN CN202210513096.6A patent/CN114898777A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116072154A (en) * | 2023-03-07 | 2023-05-05 | 华南师范大学 | Speech emotion recognition method, device and equipment based on data enhancement |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hinton et al. | Improving neural networks by preventing co-adaptation of feature detectors | |
CN109271522B (en) | Comment emotion classification method and system based on deep hybrid model transfer learning | |
CN110083838B (en) | Biomedical semantic relation extraction method based on multilayer neural network and external knowledge base | |
CN113239700A (en) | Text semantic matching device, system, method and storage medium for improving BERT | |
Elleuch et al. | Arabic handwritten characters recognition using deep belief neural networks | |
CN110222163A (en) | A kind of intelligent answer method and system merging CNN and two-way LSTM | |
CN110459225B (en) | Speaker recognition system based on CNN fusion characteristics | |
CN112257449B (en) | Named entity recognition method and device, computer equipment and storage medium | |
Golovko et al. | The nature of unsupervised learning in deep neural networks: A new understanding and novel approach | |
CN112232087B (en) | Specific aspect emotion analysis method of multi-granularity attention model based on Transformer | |
CN112784532A (en) | Multi-head attention memory network for short text sentiment classification | |
Cai et al. | Meta Multi-Task Learning for Speech Emotion Recognition. | |
Demircan et al. | Comparison of the effects of mel coefficients and spectrogram images via deep learning in emotion classification | |
Schmitt et al. | End-to-end audio classification with small datasets–making it work | |
Soliman et al. | Isolated word speech recognition using convolutional neural network | |
CN114898777A (en) | Cross-library speech emotion recognition method and device based on deep direct-push migration network | |
Du et al. | CGaP: Continuous growth and pruning for efficient deep learning | |
Aishwarya et al. | Kannada speech recognition system for Aphasic people | |
Starzyk et al. | Concurrent associative memories with synaptic delays | |
Rahim | A neural tree network for phoneme classification with experiments on the TIMIT database | |
CN114743569A (en) | Speech emotion recognition method based on double-layer fusion deep network | |
Busser et al. | Machine learning of word pronunciation: the case against abstraction. | |
Kroshchanka et al. | The reduction of fully connected neural network parameters using the pre-training technique | |
CN112465054A (en) | Multivariate time series data classification method based on FCN | |
Hallac et al. | Experiments on fine tuning deep learning models with news data for tweet classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |