CN113870901B

CN113870901B - SVM-KNN-based voice emotion recognition method

Info

Publication number: CN113870901B
Application number: CN202111127502.7A
Authority: CN
Inventors: 王海; 路璐; 侯宇婷; 冯毅
Original assignee: NORTHWEST UNIVERSITY
Current assignee: NORTHWEST UNIVERSITY
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2024-05-24
Anticipated expiration: 2041-09-26
Also published as: CN113870901A

Abstract

The method comprises the following steps of firstly, preprocessing an original voice signal; step two, performing voice enhancement processing by a microphone array delay alignment method; step three, performing feature extraction on the processed data based on BN-DNN of the SHL structure; step four, selecting the extracted characteristics based on a fuzzy set theory method; and fifthly, performing emotion recognition by adopting an optimized SVM-KNN method. Through the method and the device, a user can obtain higher voice emotion classification accuracy, the problem of limited optimization under a large-scale training sample is avoided, and the SVM classification accuracy and the recognition speed are improved. On the other hand, the SVM-KNN concept provided by the invention can also be applied to other fields of voice recognition, such as the dialect classification field, and provides reference for classification and recognition based on voice signals.

Description

SVM-KNN-based voice emotion recognition method

Technical Field

The invention relates to voice emotion recognition, in particular to a voice emotion recognition method based on SVM-KNN.

Background

In the current speech emotion recognition method, a support vector machine (SVM, support vector machine) is proved to be a relatively effective classification tool, but under the condition of high emotion confusion degree, accurate recognition is still difficult to carry out by using the SVM.

For a long time, the study of emotion has been carried out by experts in the physiological and psychological fields. With the rapid development of artificial intelligence, emotion research in human-computer interaction has attracted great interest to a large number of experts. In human-computer interaction, people hope to communicate with a machine more naturally, so that the machine is required to understand the emotion of the human, and therefore classification and recognition of the emotion are particularly important. In human communication, language contains rich information, so machines can use language to classify and identify emotions. Experts have made a great deal of research and analysis in the aspect of speech emotion classification and recognition, including building a speech emotion database, extracting emotion characteristics, classifying and recognizing methods and the like. In order to improve the recognition rate of the voice emotion, the former people carry out improvement research on each link, but a unified system is not available, and the recognition rate is not very high. MFCCs have been used in the past as identification features, but have not been further processed prior to identification, resulting in significant amounts of redundant information affecting the identification. In order to eliminate such an influence and improve the recognition rate, selection of an appropriate classifier is an important point of study. In order to improve the emotion recognition rate and accurately process emotion characteristics, it is important to select a proper classification method.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a voice emotion recognition method based on SVM-KNN, which is characterized in that voice enhancement processing is carried out by a microphone array delay alignment method, BN-DNN based on an SHL structure is adopted for feature extraction, a fuzzy set theory based method is adopted for feature selection, and then an optimized SVM-KNN method is adopted for emotion recognition. Finally, a voice emotion recognition method with high precision and low calculation load is formed.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a voice emotion recognition method based on SVM-KNN comprises the following steps of different voice signal preprocessing modes, specific feature extraction, fuzzy feature selection and SVM-KNN support vector machine classification:

(1) Preprocessing an input voice signal; the preprocessing comprises pre-emphasis filtering and windowing framing, wherein the pre-emphasis coefficient alpha of the pre-emphasis filtering is 0.95, and the frame length of the windowing framing is 26ms;

(2) The data of different microphone channels are delayed and aligned by using a microphone array solution to realize the localization of sound sources and improve the audio quality:

1) A nested microphone array structure consisting of 9 microphones, in effect a 4-set linear microphone array, consists of 5 equidistant (2.5 am.5am, 10am, 20 am) microphones, respectively, thus ensuring that the frequency domain range of the recorded speech signal is 3003400Hz.

2) Meanwhile, the impulse response using mage model of the hypothetical room with the sound field as far field is met by taking account of the proportional relation between the microphone spacing and the distance from the person to the microphone array

(3) Adopting BN-DNN based on SHL structure to extract the characteristics, wherein the characteristic extraction process is as follows:

1) In the experiment, 5 hidden layers are set in a BN-DNN model, the 3 rd hidden layer is set as a bottleneck layer, and the number of neurons in the other hidden layers is 1024; the input data is a 40-dimensional MFCC bottleneck characteristic of consecutive 11 frames,

2) The neurons of the input layer are all set to 440 (40 x 11). The DNN network structure is set to 440- [ 1024-1024-1024-1024-1024 ] -440.

3) And determining the number of neurons in each group of optimal parameters and the sparse group overlapping coefficient alpha. The experimental settings were 64, 128, 256, and the overlap coefficient α was 0%, 20%, 30%, 40%.

4) The sparsity of the network is measured by using the proportion of activation probability h equal to 0 in the neurons, and the sparsity is defined as:

Wherein D represents the number of neurons in a layer, hAi =1, 2, …, D) represents the neurons, and a greater sparsity represents a greater sparsity of the neurons in the hidden layer, i.e., a greater number of neurons with a weight of o. For each model, training the model by using a training set to obtain the activation probability of each layer of neurons, then substituting the training set to calculate the sparsity of the layer, finally calculating the average value of the sparsity of all hidden layers as the sparsity of the whole neural network, and finally extracting the voice bottleneck characteristics.

(4) And selecting the characteristics by adopting a method based on fuzzy set theory:

1) In the N-dimensional space R, for the problem of class c, the training sample set is X=X, X is shown in the specification, …, XN is the number of samples, and for the sample X to be measured, K values of K neighbors of the sample to be measured are determined first

2) Determining the distance between the sample to be tested and all training samples, wherein Euclidean distance is selected to be adopted:

Ordering the N distances

d(1)≤d(2)≤d(3)≤.≤d(k)≤d(k+1)≤……≤d(N)

D (1) is the distance between the sample to be measured and the K nearest neighbors.

4) Calculating class membership of the sample to be tested according to formula (1), wherein m is a fuzzy weight adjustment factor; if ui (x) =max { un (x) }, then determine that x belongs to class i.

(5) And carrying out emotion recognition by adopting an optimized SVM-KNN method:

1) Assuming that the membership degree of each sample belongs to the class is s _q, the blurred input sample is S＝{(x₁,y₁,s₁),(x₂,y₂,s₂)……(x_i,y_i,s_i)},, wherein x _i∈R,y_i∈{1,-1},σ≤s_i is less than or equal to 1, and the positive number s _i with the sigma being sufficiently small indicates the degree that the ith sample belongs to the positive class.

2) Introducing transformation 0:R-F under nonlinear condition, mapping a sample from an input space R to a high-dimensional feature space F, determining an optimal classification hyperplane in the high-dimensional feature space by utilizing a structural risk minimization principle and a classification interval maximization idea, and solving the FSVM optimal hyperplane problem can be converted into the following optimization problem

ξ_i≥0,i＝1,^…,1.

3) Build lagrange function

Wherein μi >0, βi >0 is Lagrange multiplier, C0>0 is penalty factor, and w is weight coefficient of linear classification function y.

4) The following dual programming problem is obtained.

0≤μ_i≤s_iC₀,i＝1…l. (5)

Where k (x _i,x_j) is a kernel function, corresponding to sample x of μ=0, taking into account the KKT condition; is a sample that can be correctly classified, i.e., a non-support vector. Corresponding toIs the support vector on the boundary, i.e. the sample xi is located in the correctly partitioned area on the interval boundary.

Drawings

FIG. 1 is a flow chart of speech signal enhancement according to the present invention;

FIG. 2 is a flow chart of a multi-level SVM classifier of the present invention;

FIG. 3 shows the SVM-KNN classification step according to the invention.

Fig. 4 is a flow chart of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1,2 and 3, the speech emotion recognition method based on the SVM-KNN comprises the following steps of different speech signal preprocessing modes, multiple feature extraction, fuzzy KNN feature selection and SVM support vector machine-K neighbor classification:

(1) Preprocessing the original data; the method of pre-emphasis, framing, windowing and endpoint detection is used for preprocessing the original data.

1) The signal s (1) becomes sw (n) after being added to the bed, and the formula is as follows: sw (n) =s (n) ×w (n)

2) Filtering the data by using a filter, wherein the pre-emphasis coefficient alpha of the pre-emphasis filter is 0.95

(2) And designing a corresponding microphone array, and carrying out time delay alignment on data of different microphone channels by utilizing a microphone array solution to realize sound source positioning and improve audio quality.

1) A nested microphone array structure consisting of 9 microphones, 4 sets of linear microphone arrays, each consisting of 5 equidistant (2.5 am, 5am, 10am, 20 am) microphones, ensures that the frequency domain range of the recorded speech signal is 3003400Hz.

3) A mage model is used to compromise the proportional relationship between microphone spacing and human-to-microphone array distance to conform to the impulse response of a hypothetical conditioned room with a far field sound field.

(3) And performing feature extraction on the processed data based on BN-DNN of the SHL structure.

1) Setting 5 hidden layers in a BN-DNN model, wherein the 3 rd hidden layer is set as a bottleneck layer, and the number of neurons of each hidden layer is 1024; the 40-dimensional data of consecutive 11 frames is taken as bottleneck characteristic of the MFCC,

2) The DNN network structure is 440- [1024-1024-1024-1024-1024] -440.

4) The greater the probability of activation in a neuron, the more sparse the representative network, the sparsity defined as:

And finally, calculating an average value of sparsity of all hidden layers as sparsity of the whole neural network, and finally extracting voice bottleneck characteristics.

(4) And selecting the extracted features based on a fuzzy set theory method:

1) The energy, short-time amplitude, short-time zero-crossing rate and pitch frequency characteristics are extracted using a function.

2) And forming the extracted characteristic parameters into characteristic vectors, and taking the characteristic vectors as the input of the fuzzy set.

3) For C emotion recognition, counting the average value of the same characteristic parameter under C different emotion states for a training sample set X, and marking the average value as M; ; (i=1, 2..c, j=1, 2..n, N is the number of emotion feature parameters), each feature parameter Mjm of each sentence of speech sample in each emotion state (N is a sample in the emotion state, n=1 is a first sentence, and so on) is normalized by a normalization formula shown in the formula

Calculating the dispersion of the characteristic parameters under a certain specific emotion:

5) After the dispersion of each characteristic parameter under each emotion is obtained, the contribution degree of each characteristic parameter under each emotion is calculated according to the dispersion degree:

and weighting the contribution degree of the emotion characteristic parameter and the Euclidean distance when distinguishing the sample to be classified by using the fuzzy K nearest neighbor.

Finally, the feature which contributes most to emotion recognition is extracted.

1) And constructing a voice emotion recognition model according to the multi-level classification strategy.

2) The emotion confusion degree is the similarity degree of two different emotions.

Defining an ith emotion Bi and a jth emotion B; the confusion degree of the system is I _ij, the specific meaning is that the average value of the probability that the ith emotion is misjudged as the jth emotion and the probability that the jth emotion is misjudged as the ith emotion, and the mathematical expression is as follows:

wherein x is test data, and t is identification result corresponding to the test data x.

3) The construction algorithm of the multi-level classification comprises the following specific steps:

a. Calculating a voice emotion recognition confusion matrix by using a traditional Support Vector Machine (SVMD) method;

b. constructing a first-level classifier, setting the probability P1 of the first-level classifier, classifying the emotion with the confusion degree exceeding the probability P1 into one type, namely classifying a and b into one group if I _ab>P_l,I_cd>P₁ is adopted, and classifying c and d into one group; if I _ab>P_l and I _bc>P_l, then a, b, c are grouped together.

On the basis of the completion of the construction of the upper classifier, when constructing the second classifier, the second classifier probability P ₂ is set again, and if I _a>P₂ and I _bc>P₂, a.b.c is also classified as a group. Here, when the first-stage classifier is designed, P _l is set to be 10%, then each classifier probability is sequentially increased by 2% on the basis of the probability of the upper-stage classifier, for example, P' of the second-stage classifier is based on the probability of the first-stage classifier P _l, and then sequentially increased by 2% on the basis, namely, sequentially increased by 10%, 12%, 14%, 16% and the like, and so on;

c. calculating the emotion confusion degree of the ungrouped emotion states according to the formula (1.1), and turning to the step b to classify the ungrouped emotion states into the existing groups or the independent groups;

d. all four emotions are correctly grouped and ended.

Examples

Step 1: preprocessing the original data, comprising the following steps:

(1) In this example, EMO-DB dataset was used, which was obtained from a German emotion voice library recorded by Berlin university, by 7 emotion (neutral/nertral, angry/anger, fear/fear, happy/joy, sad/sadness, aversion/disgust, boring/boredom) simulations of 10 sentences (5 long 5 short 5) by 10 actors (5 male 5 female), containing 800 sentences corpus in total, sampling rate 48kHz (post compression to 16 kHz), 16bit quantization. The selection of the corpus text complies with the principles of semantic neutrality and no emotional tendency, and is a daily spoken language style without excessive written modification. Recording of speech is done in a professional studio, asking the actors to enhance the realism of the emotion by recall of their own real experiences or experience for the emotional incubation before a certain emotion is deduced. After 20 participants (10 men and 10 women) were subjected to a hearing test, a hearing recognition rate of 84.3% was obtained.

After the data set is subjected to the hearing test, 233 sentences of male emotion sentences and 302 sentences of female emotion sentences are reserved, and 535 sentences are reserved. The sentence content comprises 5 short sentences and 5 long sentences of daily life expression, has higher emotion degrees of freedom and does not comprise a specific emotion tendency. 16kHZ sampling, 16bit quantization, and saving the file in WAV format are used.

(2) Pretreatment: and setting a pre-emphasis coefficient, performing pre-emphasis processing on the voice signal, and setting a windowed framing frame length to perform framing processing on the voice signal.

Step 2: and selecting a specific channel from the preprocessed data, and delaying and aligning the data of different microphone channels by using a microphone array solution to realize the positioning of a sound source and improve the audio quality.

(3) And performing feature extraction by adopting BN-DNN based on an SHL structure. Setting the layer number and each layer parameter, and constructing the BN-DNN neural network. Training a neural network, inputting the MFCC, and finally outputting the voice bottleneck characteristics.

(4) And selecting the extracted multiple features as fuzzy features. And calculating the class membership of each feature according to the extracted features and judging that the x features belong to the ith class until all samples are processed.

(5) And carrying out emotion recognition by adopting an optimized SVM-KNN method. And calculating the emotion confusion degree, and classifying the emotion confusion degree into corresponding groups.

(6) And counting the accuracy rate to obtain a final result.

Claims

1. A voice emotion recognition method based on SVM-KNN is characterized by comprising the following steps:

(1) Preprocessing the original data; preprocessing the original data by using a pre-emphasis, framing, windowing and endpoint detection method;

1) The pre-emphasis technology is utilized to improve the high-frequency part, so that the frequency spectrum of the signal is flattened, and the frequency spectrum analysis or the channel parameter analysis is facilitated;

2) Framing the voice signal; in order to make the transition between frames smooth, keep continuity, use the method of the overlapping segmentation, intercept a section every time move, thus obtain as many frames as possible, facilitate the short-term analysis;

3) Multiplying s (n) by a certain window function w (n), thereby forming a windowed speech signal sw (n) =s (n) ×w (n);

4) Accurately finding out a starting point and an ending point of a voice signal from a section of voice signal, so that an effective voice signal and an useless noise signal are separated;

(2) Designing a corresponding microphone array, and carrying out time delay alignment on data of different microphone channels by utilizing a microphone array solution to realize sound source positioning and improve audio quality;

1) Estimating a noise power spectrum of the input speech signal using a first order recursive smoothing method;

2) Calculating the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the voice signal with noise;

3) Smoothing the noisy speech signal to obtain a smoothed power spectrum S (x, k) of the signal, and searching the minimum value of the smoothed output signal to obtain S _min (lambda, k);

4) Solving the probability I (x, k) of the existence of the voice signal, carrying out secondary smoothing and minimum value searching according to the probability, and calculating to obtain the non-existence probability q (lambda, k) of the voice signal;

5) According to the following

Calculating the probability of speech signal existence

6) Updating time-varying smoothing parameters and smoothing noise power spectrums;

(3) Performing feature extraction on the processed data based on BN-DNN of the SHL structure;

1) Firstly, 39-dimensional MFCC characteristics (13++ delta) are extracted from VYSTADIAL _cz for 1 hour, a triphone GMM model is trained, and forced alignment is carried out;

2) Training a triphone GMM acoustic model (9 frame stitching of 13-dimensional MFCC features, LDA down to 40 dimensions) based on linear discriminant analysis (LINEAR DISCRIMINANT ANALYSIS, LDA) and maximum likelihood linear transformation (Maximum likelihood linear transform, MLLT), wherein the number of Gaussian mixture elements of the model is 19200;

3) Then, performing speaker self-adaptive training (SPEAKER ADAPTIVE TRAINING, SAT) by utilizing a Feature space maximum likelihood linear regression (Feature-space maximum likelihood linearregression, fMLLR) technology, so as to form a GMM acoustic model of LDA+ MLLT +SAT;

4) Obtaining a training target of a softmax layer in BN-DNN by a mode of forced alignment of the model; the training features of DNN have the characteristics of fbanks with good using effect, firstly, the fbanks features of 40 dimensions are extracted, 11 frames are spliced (5-1-5), and the obtained supervectors are used as the input features of DNN;

5) Performing 10 rounds of RBM pre-training on each hidden layer (comprising a BN layer), then performing fine adjustment on global parameters by utilizing a BP algorithm, and finally extracting three major characteristics of rhythm, tone quality and spectrum;

(4) Selecting the extracted features based on a fuzzy set theory method;

1) Analyzing short-time energy, short-time amplitude, short-time zero-crossing rate and pitch frequency of the extracted features by a function TimePara () and extracting the pitch frequency by a function FunFre ();

2) After the short-time energy, the short-time amplitude, the short-time zero-crossing rate and the fundamental tone frequency are respectively extracted, the extracted characteristic parameters form characteristic vectors which are used as the input of the fuzzy set.

3) For C emotion recognition, counting the average value of the same characteristic parameter under C different emotion states for a training sample set X, and marking the average value as M; ; (i=1, 2;;;;;; + & lt, & gt, & lt, & gtand; + & lt, & gtand + & lt, & gtare the number of emotion feature parameters) each feature parameter Mjm (N is the sample in each emotion state, N is the first sentence, n=1 is the sample in the emotion state, and so on) is normalized by a normalization formula

4) Then calculating the dispersion of the characteristic parameters under a certain specific emotion:

5) Calculating the contribution degree of each characteristic parameter under each emotion according to the dispersion degree after calculating the dispersion degree of each characteristic parameter under each emotion; contribution u _ij of characteristic parameter θ _i:

Weighting contribution degree and Euclidean distance of emotion characteristic parameters when distinguishing samples to be classified by using fuzzy K neighbors;

Finally extracting the characteristics with the largest contribution to emotion recognition;

(5) Performing emotion recognition on the voice features by adopting an optimized SVM-KNN method based on the extracted features;

1) Decomposing 6 emotion classification problems, establishing a multi-level SVM classifier based on a decision tree, identifying one emotion by the SVM of each level for a sample set, identifying the rest sample set by the SVM of the next level, gradually decreasing as shown in figure 1, and finally obtaining the emotion classification by the leaf nodes of the decision tree;

2) For the wrong division samples generated near the SVM hyperplane, a KNN algorithm is utilized for combination, and an SVM-KNN combined classification model is constructed to improve the accuracy of the SVM; SVM-KNN classification:

① The method comprises the steps that samples in an initial training set are marked, a small number of samples are selected randomly from the training set, a small sample training set is constructed, and each emotion in the initial training sample set is guaranteed to at least contain one sample;

② Obtaining a weak SVM classifier of emotion A according to the initial training sample, and then determining an optimal classification hyperplane, a support vector set T, a coefficient W of a classification decision function and a constant b;

③ Selecting a sample with fuzzy classification and low accuracy near the hyperplane from the A-class emotion, calculating the similarity of the sample with all the samples of the non-A-class emotion, selecting n most probable samples of the non-A-class emotion, and marking the samples as a sample set A; selecting one sample from non-A samples, calculating the similarity between the sample and all samples of A emotion, selecting n samples most likely to be the A emotion, and marking the samples as a sample set B;

④ The samples in A and B are points near the hyperplane, and the sample x in A, B is taken as a human decision function

g(x)＝∑_iy_ia_iK(x_i,x_j)+b (6)；

⑤ If |v| > e, the classification accuracy and the reliability of the sample points by the SVM are high, so that the class to which the sample points belong can be determined by a decision function f (x) =sgng (x);

⑥ If |v| < e, the sample point is near the hyperplane, the classification reliability is low and the classification is easy to be misplaced, so that the classification of the sample x is determined by a KNN 4 method; the support vector set T of class A and non-class A is used as a training sample, the distance d (x, x _i) between each vector in the samples x and T is calculated, the class to which the nearest vector belongs is used as the class of the sample x,

Wherein x; is a support vector; k () is a first order polynomial kernel function; the range of the threshold value e is 0,1, the specific value can be dynamically adjusted according to the experimental result, the initial value is generally set to be 1, and if the adjustment is 0, the algorithm is a traditional SVM algorithm;

⑦ Placing the sample obtained by SVM classification and the sample of KNN classification into an initial training set to expand the sample, so that a new SVM2 is trained on the basis of the expanded training set;

⑧ Iterating until all samples in the training set are added into the initial training set, and stopping iterating; obtaining an SVM classifier with high class A emotion classification precision by utilizing the final training set;

⑨ The first-level SVM classifier in the trained decision tree is then used as a trainer of the next-level SVM by using a sample set which is not A-class; and training step by step to obtain SVM classifiers corresponding to each emotion category.