CN116524960A

CN116524960A - Speech emotion recognition system based on mixed entropy downsampling and integrated classifier

Info

Publication number: CN116524960A
Application number: CN202310509029.1A
Authority: CN
Inventors: 李冬冬; 王喆; 宣正吉; 王建伟
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-08-01

Abstract

The invention discloses a voice emotion recognition system based on mixed entropy downsampling and an integrated classifier, which comprises the following steps: the preprocessing stage divides the voice signal of the training data into fragments, then extracts a spectrogram, trains a base classifier by using the fragment, and obtains the depth characteristic and the confidence coefficient of each voice fragment; the training stage, calculating the mixed entropy of all the voice fragments and taking the weighted sum of the mixed entropy and the confidence coefficient as a ranking value; then, the spectrograms of the voice fragments with the ranking values larger than the set threshold value are used for retraining a base classifier, ranking values of all the voice fragments and training the base classifier are calculated again, the operation is cycled for given rounds, and the base classifier trained in each round forms an integrated classifier; finally, testing the speech segments, extracting the spectrogram, inputting the speech segments into an integrated classifier, and calculating the emotion recognition result of the speech. The invention obviously reduces the influence of voice fragments with undefined emotion and unstable distribution structure, and effectively improves the accuracy of voice emotion recognition.

Description

Speech emotion recognition system based on mixed entropy downsampling and integrated classifier

Technical Field

The invention relates to the technical field of voice emotion recognition, in particular to a voice emotion recognition system based on mixed entropy downsampling and an integrated classifier.

Background

Speech is the most direct and natural communication mode between people, and is the main form of man-machine interaction. However, speech emotion in real life is often complex, subtle, in a constantly changing state. Thus, detecting and recognizing emotion in speech has become a challenging task. In recent years, speech emotion recognition has been studied and developed, and has been widely used in various fields such as virtual customer service, intelligent assistants, and medical auxiliary diagnosis. The speech emotion recognition system generally comprises two parts of feature extraction and training of a classifier, wherein the traditional method is to segment an original speech waveform and then extract artificial features, and the classifier commonly used in speech emotion recognition comprises a Gaussian mixture classifier, a support vector machine and the like. In recent years, with the development of deep learning, many methods based on deep learning classifiers, such as recurrent neural network classifiers and convolutional neural network classifiers, have emerged.

Previous studies have found that the confidence level of each emotion varies with the location of each segment of speech in the speech. For example, a true emotion tag for a piece of speech is happy, but the trained classifier results show that the confidence of neutral emotion is highest in the first half of speech and the confidence of happy is highest in the second half. Clearly, the first half has weaker emotional intensity of happiness, which is detrimental to classifier training. Speech segments with ambiguous emotion introduce noise into the classifier training process and degrade the performance of the speech emotion recognition system. Thus, challenges remain in speech emotion recognition at the speech segment level. While there are some approaches to this problem, such as attention mechanisms and multi-instance learning, these approaches autonomously learn how to weight different parts of speech through deep learning classifiers, which are difficult to analyze and interpret theoretically.

Disclosure of Invention

The invention provides a voice emotion recognition system based on mixed entropy downsampling and an integrated classifier, which selects voice fragments with clear emotion from voice fragments of all training data for next round of training in each round of training, namely downsampling the voice fragments of all training data, each round can generate a base classifier, and the base classifiers form the integrated classifier. In the training process of each round, the mixed entropy and the confidence of the voice fragments are calculated, and the ranking value is calculated according to the mixed entropy and the confidence, so that a sample with clear emotion type is selected. The integrated classifier utilizes the base classifier trained by multiple iterations to predict emotion of the whole voice, and accuracy of voice emotion recognition is effectively improved.

The voice emotion recognition system based on the mixed entropy downsampling and the integrated classifier in the project comprises the following steps:

1) Dividing a data set into two parts of training data and test data, dividing a voice signal of the training data into fragments, extracting a spectrogram, and training a base classifier by using the voice signal and obtaining depth characteristics and confidence of each voice fragment;

2) Calculating the mixed entropy of all the voice fragments and taking the weighted sum of the mixed entropy and the confidence coefficient as a ranking value;

3) The spectrogram of the voice fragments with the ranking values larger than the set threshold value is used for retraining a base classifier, ranking values of all the voice fragments and training the base classifier are calculated again, the operation is cycled for given rounds, and the base classifier trained in each round forms an integrated classifier;

4) Testing the speech segments, extracting the spectrogram, inputting the speech segments into an integrated classifier, and calculating the emotion recognition result of the speech.

The technical scheme adopted by the invention can be further refined, the label of each voice segment is the real label of the whole voice in which the label is positioned in the data set, and the mixed entropy of the voice segment in the step 2) consists of emotion certainty entropy and structure distribution entropy, wherein the emotion certainty entropy is used for measuring the significance degree of emotion expressed by the voice segment, and the emotion certainty entropy is used for measuring the significance degree of emotion expressed by the voice segmentThe formula of (2) is:

wherein i is the number of the voice fragment on the training data, C is the emotion type number in the data set, k is the set neighbor number,determining degree entropy for the basis of the ith voice segment;

the structure distribution entropy is used for measuring the distribution structure stability of the voice fragments in the depth feature spaceThe formula of (2) is:

wherein i is the number of the voice fragment on the training data, k is the set neighbor number, d _i,q Representing the Euclidean distance between depth features of the ith speech segment and the qth speech segment in the training data, ln represents the logarithm of the calculated base e.

In the training process of the base model, a ranking value calculated by the weighted sum of the mixed entropy and the confidence coefficient is used as a basis for downsampling the voice fragments in each turn, the mixed entropy of each voice fragment is calculated by the emotion certainty entropy and the structure distribution entropy, and the formula of the mixed entropy is as follows:

wherein i is the number of the voice segment on the training data,representing emotion certainty entropy->Representing the structure distribution entropy, nor is a Min-Max normalization function, MIE _i Representing the mixed entropy of the ith speech segment;

the ranking value of each speech segment is defined as the weighted sum of the mixed entropy and the confidence obtained in step 1), and the formula of the ranking value is:

Rnak _i ＝(1-λ)nor(conf _i )+λnor(-MIE _i )， (4)

wherein i is the number of the voice fragment on the training data, conf _i Representing confidence level of ith speech segment MiE _i Represents the mixed entropy of the ith speech segment, lambda is the weight coefficient, nor is the Min-Max normalization function, rank _i Representing the ranking value of the ith speech segment.

The basic model in each turn updates parameters by minimizing cross entropy loss of voice segment labels and voice segment emotion classification results through a gradient descent method, and finally an integrated classifier composed of basic classifiers generated by each turn calculates emotion types predicted by a system according to the output of each voice segment in the whole test voice.

The beneficial effects of the invention are as follows: the invention provides a voice emotion recognition system based on mixed entropy downsampling and an integrated classifier, which combines the iterative classifiers of each round into an integrated classifier by selecting voice fragments participating in training in the basic classifier training of a plurality of rounds, thereby effectively improving the accuracy of voice emotion recognition and obviously reducing the influence of voice fragments with undefined emotion types compared with the existing classifier and basic classifier. The invention provides a concept of mixed entropy, wherein the mixed entropy of a voice segment comprises emotion certainty entropy and structure distribution entropy, and a ranking value calculated by the mixed entropy and the confidence is used as a standard, so that samples with definite emotion types and stable distribution structure can be effectively selected for training an integrated classifier.

Drawings

FIG. 1 is a block diagram of a speech emotion recognition system based on mixed entropy downsampling and an integrated classifier in accordance with the present invention.

Detailed Description

The following detailed description of specific embodiments of the invention refers to the accompanying drawings, in which:

step 1: pretreatment stepThe segment divides all emotion voice original voice signals in the training data into voice segments with the duration of 2s one by one, N voice segments are divided in total, no overlapping part exists between the two voice segments, the voice segments with the duration of less than 2s are subjected to zero padding processing on the read signal value, and then frames and window division operation are carried out on the signal value of each voice segment as required to extract a spectrogram as new training data Wherein f is the number of sub-frames, w is the characteristic length of the frame voice, and the corresponding training label of each voice segment is +.> The true emotion label of the whole voice in which the true emotion label is positioned in the training data;

step 2: in each iteration round i, a new base classifier m is trained _l Training data composed of a spectrogram of a voice fragment in the first round is inputIts corresponding tag in Y is +.>Wherein n is the number of speech segments in each round that participate in the training of the base classifier; when l=1, X ₁ The number n=n of the voice fragments, namely the spectrograms of all the voice fragments participate in the training of the base classifier; each speech segment is in the base classifier m _l The final output isWherein C is the number of emotion categories in the dataset, which represents the probability that the speech segment is predicted to be each emotion category on the base classifierBasis classifier m _l Predictive tag of->The loss function on the ith voice segment in the training process is a true emotion label y _i Cross entropy loss-y 'of sum base classifier output' _i ·log(y _i )-(1-y′ _i )·log(1-y _i ) The loss is minimized and the parameters of the base classifier are updated by a gradient descent method, and after the given times of gradient descent iteration, the trained base classifier m of the round can be obtained _l ；

Step 3: inputting all training data, namely the spectrogram X of all voice fragments, into a trained classifier m _l Of which each speech segment has a spectrogram x _i Depth features of size z can be obtained in the penultimate fully connected layer of the classifierWhere i is the number of the speech segment, the depth feature corresponding to the spectrogram X of all the speech segments can be denoted as f= { F _i I=1, 2, …, N }; the speech segment is in the base classifier m _l Confidence in the way can be determined by->yy＝y _i Calculating;

step 4: calculating a k-nearest neighbor Euclidean distance matrix between depth features F of voice segment spectrograms in all training data And a k nearest neighbor speech fragment numbering matrix +.>The method is used for calculating emotion certainty entropy and structure distribution entropy in the mixed entropy;

step 5: the mixed entropy is calculated on the depth features F of the speech segment spectrograms in all training data:

step 5.1: calculating emotion certainty entropy:

emotion certainty entropyThe formula of (2) is:

wherein i is the number of the voice fragment on the training data, C is the emotion type number in the data set, k is the set neighbor number,entropy of the basic certainty of the ith speech segment, ln represents the logarithm based on e calculated as follows:

specifically, emotion certainty entropyIn the formula of->The number of the fragments corresponding to the emotion type label with the largest number of fragments among the k voice fragments with the nearest Euclidean distance calculated with the ith voice fragment on the depth characteristic of the training data is represented as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,represented in training dataCalculating the number of fragments with emotion type label j in k voice fragments with nearest Euclidean distance between depth features from ith voice fragment, which is formed by matrix M _ind Calculating, < +.for the ith speech segment>Is M _ind Voice fragment real tag y corresponding to ith row k neighbor voice fragment number _i The number of fragments j.

Step 5.2: calculating structure distribution entropy:

wherein d is _i,q ∈M _dis Depth feature F representing the ith speech segment on depth feature F _i Depth feature f with the q-th speech segment _q Is a euclidean distance of (c).

Step 5.3: the mixed entropy of each speech segment is calculated from the emotion certainty entropy in step 5.1 and the structure distribution entropy in step 5.2:

where nor is the Min-Max normalization function.

Step 6: calculating ranking values on depth features F of speech segment spectrograms in all training data, wherein the ranking values are a weighted sum of the mixed entropy calculated in the step 5 and the confidence obtained in the step 3:

Rank _i ＝(1-λ)nor(conf _i )+λnor(-MIE _i ) (10)

step 7: downsampling the speech fragments participating in training, and connecting Rank _i A spectrogram of n voice fragments larger than a specified threshold is used as new training data X _l+1 Namely, selecting a spectrogram of a speech fragment with clear emotion and strong distributed structural stability in a depth feature space as new training data X _l+1 ；

Step 8: repeating the steps 2 to 7 for L rounds, wherein the base classifier m is obtained in each round _l Adding the integrated classifier into a set M to serve as an integrated classifier;

step 9: during the test, a complete voice is divided into E voice fragments, and the output of each voice fragment on the basic classifier obtained by the first round of training is thatWhere e is the segment number and C is the emotion category number in the data set, then the output of the complete speech on the integrated classifier M can be defined as:

wherein e is the segment number, and the output of each voice segment on the base classifier obtained by the first round of training isE is the number of voice fragments divided by the complete voice, L is the set total training round, and the complete voice is subjected to final recognition emotion subscript R of a voice emotion recognition system based on mixed entropy downsampling and integrated classifier _ind The calculation formula of (2) is as follows:

wherein C is the emotion category number in the data set, R _ind The corresponding emotion type is the final recognition result of the system.

Design of experiment

And (3) selecting an experimental data set: the invention uses a voice data set: IEMOCAP. It contains 12 hours of voice audio, played in conversational form by 10 actors. Five sections are divided into two, 10. In the experiment of the invention, only four common emotions of anger, happiness, neutrality and sadness are considered, and the emotion type real tag of the excited voice audio in the data set is also regarded as happiness. The data contains 5531 speech in total, including 1,103 categories of happiness, 1,636 categories of happiness, 1,708 categories of anger, and 1,084 categories of sadness.

We use two indices Weighted Accuracy (WA) and Unweighted Accuracy (UA) to measure the accuracy of the classifier on the test data, defined as the following two formulas, where N _c Representing the number of class c emotion samples, r _c Representing the number of emotionally correctly classified samples of class c:

the base classifier in the experiment adopts a ResNet18 convolutional neural network, an ablation experiment and a comparison experiment are respectively carried out on the basis of the classifier, the voice corresponding to each person in the data set is adopted as test data in the experiment in turn, and the average value is obtained on the result. The ablation experiment compares an original base classifier, the integrated learning of the original base classifier for downsampling by using the confidence level, the integrated learning classifier of the original base classifier for downsampling by using the mixed entropy, and the method provided by the invention to reveal the utility of each right in the method; in the comparison test, compared with a voice emotion recognition method which is popular in recent years, the method has the advantages that the number of adjacent times k=5, the total iteration times L=5, the weight coefficient lambda=0.6, the downsampling threshold t=0.6, 8 gradient descent iterations are carried out during training of the base classifier, the voice framing duration is 16ms, and the overlapping part duration is 8ms.

Ablation experimental results:

table 1 speech emotion recognition accuracy for ablation experiments on IEMOCAP datasets

Classifier name	WA(％)	UA(％)
			Base classifier	54.95	56.42
Base classifier+confidence	57.26	58.49
			Base classifier+mixed entropy	57.76	58.32
The invention is that	58.72	58.79

Each row in the table is a set of ablation experiments, and each column is the WA and UA of the current ablation experiment, respectively, as a percentage.

It can be seen that when only the confidence is used for downsampling ensemble learning, WA and UA of the classifier are improved by 2.31% and 2.07% respectively, which indicates that the confidence can well measure the emotion intensity on each segment. When downsampling is performed using only the mixed entropy as a basis, the result is more improved relative to the base classifier by 2.81% and 1.90% over WA and UA, respectively. The result is improved most when the confidence and the mixed entropy participate in downsampling at the same time, and the calculated ranking value considers the characteristics of the depth features on the sample space by adding the mixed entropy, so that the result accuracy is further improved.

The ablation experiment result shows that when the downsampling ensemble learning is carried out by using the confidence coefficient only, WA and UA of the classifier are respectively improved by 2.31% and 2.07%, and the confidence coefficient is proved to be capable of measuring the emotion intensity on each segment well. When downsampling is performed using only the mixed entropy as a basis, the result is more improved relative to the base classifier by 2.81% and 1.90% over WA and UA, respectively. The result is improved most when the confidence and the mixed entropy participate in downsampling at the same time, and the calculated ranking value considers the characteristics of the depth features on the sample space by adding the mixed entropy, so that the result accuracy is further improved.

Comparing the experimental results:

TABLE 2 accuracy of speech emotion recognition for comparative experiments on IEMOCAP datasets

Each row in the table is an experiment of a classifier, each column is WA and UA of the current experiment, respectively, listed in percent.

The comparison experiment shows that the accuracy of our classifier on WA and UA is high. The methods in the classifier 2 and the classifier 4 are deep learning classifiers, and the comparison result shows that the integrated learning method provided by the user effectively improves the accuracy of voice emotion recognition.

Claims

1. A speech emotion recognition system based on mixed entropy downsampling and an integrated classifier, comprising the steps of:

2. The speech emotion recognition system based on mixed entropy downsampling and integrated classifier as claimed in claim 1, wherein the mixed entropy in step 2) is composed of emotion certainty entropy and structure distribution entropy, wherein the emotion certainty entropy is used for measuring the significance of emotion expressed by the speech fragment;

emotion certainty entropyThe formula of (2) is:

specifically, emotion certainty entropyIn the formula of->The method is characterized in that the method comprises the steps of representing the number of fragments corresponding to emotion type labels with the largest number of fragments in k voice fragments with the nearest Euclidean distance calculated with the ith voice fragment on the depth characteristic of training data, wherein the formula is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the number of fragments with emotion type label j in k voice fragments with nearest Euclidean distance between depth features calculated from the ith voice fragment in the training data.

3. The hybrid entropy according to claim 2, wherein the structure distribution entropy in the hybrid entropy is used to measure the distribution structure stability of the speech segment in the depth feature space;

structure distribution entropyThe formula of (2) is:

4. A speech emotion recognition system based on mixed entropy downsampling and integrated classifier as claimed in claim 1, wherein the mixed entropy of each speech segment in step 2) is calculated from the emotion certainty entropy in claim 2 and the structure distribution entropy in claim 3;

the formula of the mixed entropy is:

wherein i is the number of the voice segment on the training data,representing emotion certainty entropy->Representing the structure distribution entropy, nor is a Min-Max normalization function, MIE _i Representing the mixed entropy of the ith speech segment.

5. A speech emotion recognition system based on mixed entropy downsampling and integrated classifier as claimed in claim 1, characterized in that the ranking value of each speech segment in step 2) is defined as a weighted sum of the mixed entropy as claimed in claim 4 and the confidence obtained in step 1) in claim 1;

the formula of the ranking value is:

Rank _i ＝(1-λ)nor(conf _i )+λnor(-MTE _i )

wherein i is the number of the voice fragment on the training data, conf _i Representing confidence of ith speech segment, MIE _i Represents the mixed entropy of the ith speech segment, lambda is the weight coefficient, nor is the Min-Max normalization function, rank _i And representing the ranking value of the ith voice segment, wherein the ranking value is used as a basis for downsampling the voice segment.

6. The speech emotion recognition system based on mixed entropy downsampling and integrated classifier as claimed in claim 1, wherein the integrated classifier obtained in step 3) is m= { M _l L=1, 2,3, …, L }, where m _l Representing the first roundThe trained basic classifier, L is the set total training round, a complete voice is divided into E voice fragments in the test process, and the output of each voice fragment on the basic classifier obtained by the first round training isWhere e is the segment number and C is the emotion category number in the data set, then the output of the complete speech on the integrated classifier M can be defined as:

wherein e is the segment number, and the output of each voice segment on the base classifier obtained by the first round of training isE is the number of voice fragments divided by the complete voice, L is the set total training round, and the final recognition emotion index R of the complete voice in the voice emotion recognition system based on mixed entropy downsampling and integrated classifier as set forth in claim 1 _ind The calculation formula of (2) is as follows: