CN113870901A

CN113870901A - Voice emotion recognition method based on SVM-KNN

Info

Publication number: CN113870901A
Application number: CN202111127502.7A
Authority: CN
Inventors: 王海; 路璐; 侯宇婷; 冯毅
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-12-31
Anticipated expiration: 2041-09-26
Also published as: CN113870901B

Abstract

A speech emotion recognition method based on SVM-KNN comprises the following steps that firstly, an original speech signal is preprocessed; secondly, performing voice enhancement processing by a microphone array delay alignment method; thirdly, extracting the characteristics of the processed data based on the BN-DNN with the SHL structure; step four, selecting the extracted features based on a fuzzy set theory method; and fifthly, emotion recognition is carried out by adopting an optimized SVM-KNN method. By the method and the device, a user can obtain higher speech emotion classification accuracy, the problem of limited optimization under large-scale training samples is avoided, and SVM classification accuracy and recognition speed are improved. On the other hand, the SVM-KNN thought provided by the invention can also be applied to other fields of voice recognition, such as dialect classification field, and provides reference for classification and recognition based on voice signals.

Description

Voice emotion recognition method based on SVM-KNN

Technical Field

The invention relates to voice emotion recognition, in particular to a voice emotion recognition method based on SVM-KNN.

Background

In the current speech emotion recognition method, a Support Vector Machine (SVM) is proved to be a relatively effective classification tool, but under the condition of large emotion confusion degree, the SVM is still difficult to perform accurate recognition.

For a long time, studies on emotions have been carried out by experts in the fields of physiology and psychology. With the rapid development of artificial intelligence, emotional research in human-computer interaction arouses great interest of experts. In human-computer interaction, people hope that people can communicate with machines more naturally, and the machines are required to understand human emotions, so that the emotion classification and identification are particularly important. In human communication, languages contain rich information, so machines can classify and recognize emotions using languages. Experts do a lot of research and analysis on the speech emotion classification and identification aspects, including establishing a speech emotion database, extracting emotion characteristics, classifying and identifying methods and the like. In order to improve the recognition rate of speech emotion, the predecessors have conducted improvement research on each link, but there is no unified system, and the recognition rate is not very high. MFCC has been used as a recognition feature in the past, but it is not further processed before recognition, resulting in a large amount of redundant information affecting recognition. In order to eliminate such an influence and improve the recognition rate, selection of an appropriate classifier is a key point of research. In order to improve the emotion recognition rate and correctly process the emotion characteristics, it is important to select a proper classification method.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a speech emotion recognition method based on SVM-KNN, which performs speech enhancement processing by a microphone array delay alignment method, adopts BN-DNN based on an SHL structure to perform feature extraction, adopts a method based on fuzzy set theory to perform feature selection, and then adopts an optimized SVM-KNN method to perform emotion recognition. Finally, a speech emotion recognition method with high precision and low calculation load is formed.

In order to achieve the purpose, the invention adopts the technical scheme that:

a speech emotion recognition method based on SVM-KNN, different speech signal preprocessing modes, specific feature extraction, fuzzy feature selection and SVM-KNN support vector machine classification, comprises the following steps:

(1) preprocessing an input voice signal; the preprocessing comprises pre-emphasis filtering and windowing framing, wherein a pre-emphasis coefficient alpha of the pre-emphasis filtering is 0.95, and the frame length of the windowing framing is 26 ms;

(2) the data of different microphone channels are delayed and aligned by using a microphone array solution so as to realize the positioning of a sound source and improve the audio quality:

1) a nested microphone array structure consisting of 9 microphones, in fact 4 sets of linear microphone arrays, each consisting of 5 equally spaced (2.5am.5am, 10am, 20am) microphones, ensures that the frequency domain range of the recorded speech signal is 3003400Hz.

2) Simultaneously, a large model is used for considering that the proportional relation between the microphone distance and the distance from a person to the microphone array conforms to the impulse response of the room with the assumed condition that the sound field is a far field

(3) And (2) performing feature extraction by using the BN-DNN based on the SHL structure, wherein the feature extraction process is as follows:

1) in the experiment, a BN-DNN model is provided with 5 hidden layers, the 3 rd hidden layer is set as a bottleneck layer, and the number of neurons of the rest hidden layers is 1024; the input data is a 40-dimensional MFCC bottleneck feature of 11 consecutive frames,

2) the neurons of the input layer were all set to 440(40x 11). The DNN network structure is set to 440- [ 1024-440.

3) And determining the number of the neurons in each group and the sparse group overlapping coefficient alpha of the optimal parameters. The experimental settings are 64, 128, 256, and the overlap factor α is 0%, 20%, 30%, 40%.

4) The sparsity of the network is measured by the proportion of the activation probability h of the neurons equal to 0, and the sparsity is defined as:

wherein D represents the number of neurons in a layer, hAi ═ 1,2, …, D) represents neurons, and a larger sparseness represents a more sparse number of neurons in the hidden layer, that is, a larger weight value. For each model, firstly, the model is trained by using a training set to obtain the activation probability of each layer of neuron, then, the activation probability is substituted to calculate the sparsity of the layer, finally, the average value of the sparsity of all hidden layers is calculated to be used as the sparsity of the whole neural network, and finally, the speech bottleneck characteristic is extracted.

(4) And (3) selecting features by adopting a method based on a fuzzy set theory:

1) in N-dimensional space R, for the class c problem, the training sample set is X ═ X, X:, …, XN }, N is the number of samples, and for the sample X to be measured, K values of K neighbors of the sample X to be measured are firstly determined

2) Determining the distance between the sample to be detected and all training samples, wherein the Euclidean distance is selected to be adopted:

sorting the N distances

d(1)≤d(2)≤d(3)≤.≤d(k)≤d(k+1)≤……≤d(N)

Wherein d (1),. d (K) is the distance between the sample to be measured and the K nearest neighbors.

4) Calculating the class membership of the sample to be detected according to the formula (1), wherein m is a fuzzy weight adjusting factor, and n is 1,2. If ui (x) max { un (x), then judging that x belongs to the ith class, and repeating the algorithm until all samples to be detected are processed.

(5) And (3) emotion recognition is carried out by adopting an optimized SVM-KNN method:

1) let the membership degree of each sample belonging to the class be s_qThen the blurred input sample is S { (x)₁，y₁，s₁)，(x₂，y₂，s₂)……(x_i，y_i，s_i) In which x_i∈R,y_i∈{1,-1},σ≤s _i1 or less, sigma being a sufficiently small positive number s_iIndicating the degree to which the ith sample belongs to the positive class.

2) Introducing transformation 0: R → F under the nonlinear condition, mapping the sample from an input space R to a high-dimensional feature space F, and determining an optimal classification hyperplane in the high-dimensional feature space by using a structure risk minimization principle and a classification interval maximization idea, so that the problem of solving the FSVM optimal hyperplane can be converted into the following optimization problem

ξ_i≥0,i＝1,^…,1.

3) Establishing a lagrange function

Wherein μ i >0, β i >0 are Lagrange multipliers, C0>0 are penalty factors, and w is a weight coefficient of the linear classification function y.

4) The following dual planning problem results.

0≤μ_i≤s_iC₀,i＝1…l. (5)

Wherein k (x)_i,x_j) For the kernel function, consider the KKT condition, corresponding to sample x where μ ═ 0; for samples that can be correctly classified, i.e. not supportedAnd (5) vector quantity. Correspond to

Is the support vector on the boundary, i.e. the correct partition where the sample xi is located on the boundary of the interval.

Drawings

FIG. 1 is a flow chart of speech signal enhancement according to the present invention;

FIG. 2 is a flow chart of the multi-level SVM classifier of the present invention;

fig. 3 shows the SVM-KNN classification steps of the present invention.

FIG. 4 is a flow chart of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1,2 and 3, a speech emotion recognition method based on SVM-KNN, which comprises the following steps of different speech signal preprocessing modes, multiple feature extraction, fuzzy KNN feature selection and SVM support vector machine-K nearest neighbor classification:

(1) preprocessing original data; the method comprises the steps of pre-emphasis, framing, windowing and end point detection.

1) After the signal s (1) is added to the bed, the signal s (1) becomes sw (n), and the formula is as follows: sw (n) ═ s (n) × w (n)

2) The data is filtered by a filter, and the pre-emphasis coefficient alpha of the pre-emphasis filtering is 0.95

(2) And designing a corresponding microphone array, and performing delay alignment on data of different microphone channels by using a microphone array solution to realize sound source positioning and improve audio quality.

1) A nested microphone array structure consisting of 9 microphones and 4 groups of linear microphone arrays consisting of 5 microphones with equal distance (2.5am, 5am, 10am and 20am) respectively, thereby ensuring that the frequency domain range of the recorded voice signal is 3003400Hz.

3) The method uses a large model to take into account that the proportional relationship between the microphone spacing and the distance from a person to the microphone array conforms to the impulse response of a room with the assumed condition that the sound field is far field.

(3) And performing feature extraction on the processed data based on the BN-DNN of the SHL structure.

1) Setting 5 hidden layers in a BN-DNN model, wherein the 3 rd hidden layer is set as a bottleneck layer, and the number of neurons of each hidden layer is 1024; with 40-dimensional data of 11 consecutive frames as a bottleneck feature of the MFCC,

2) the DNN network structure is 440- [ 1024-440.

4) The larger the activation probability in the neuron is, the more sparse the representative network is, and the sparsity is defined as:

and finally, calculating the average value of all the hidden layer sparsity as the sparsity of the whole neural network, and finally extracting the voice bottleneck characteristic.

(4) And (3) selecting the extracted features based on a fuzzy set theory method:

1) the energy, short-term amplitude, short-term zero-crossing rate and pitch frequency features are extracted using a function.

2) And forming a feature vector by using the extracted feature parameters as input of a fuzzy set.

3) For the identification of C-type emotion, counting the average value of the same characteristic parameter under C different emotion states by a training sample set X, and recording the average value as M; (ii) a (i 1,2.. C, j 1,2.. N, N are the number of emotion feature parameters), and then, each feature parameter Mjm (N is a sample in the emotion state, N1 is a first sentence, and so on) of each speech sample in each emotion state is normalized, wherein the normalization formula is as follows

Calculating the dispersion of the characteristic parameter under a certain emotion:

5) after the dispersion of each characteristic parameter under each emotion is solved, the contribution degree of each characteristic parameter under each emotion is calculated according to the dispersion degree_iDegree of contribution u of_ij:

And weighting the contribution degree of the emotional characteristic parameters and the Euclidean distance when the fuzzy K nearest neighbor is used for distinguishing the sample to be classified.

And finally, extracting the features which have the largest contribution to emotion recognition.

1) and constructing a speech emotion recognition model according to a multi-stage classification strategy.

2) The emotional confusion degree is the similarity degree of two different emotions.

Defining an ith emotion Bi and a jth emotion B; has a degree of confusion of I_ijThe specific meaning of the method is that the probability of judging the ith emotion as the jth emotion by mistake and the average value of the probability of judging the jth emotion as the ith emotion by mistake, and the mathematical expression is as follows:

wherein x is the test data, and t is the identification result corresponding to the test data x.

3) The construction algorithm of the multilevel classification comprises the following specific steps:

a. calculating a speech emotion recognition confusion matrix by using a traditional Support Vector Machine (SVMD) method;

b. constructing a first-level classifier, setting the probability P1 of the first-level classifier, and classifying the emotion with the confusion degree exceeding the probability P1 into one class, namely if I_ab>P_l,I_cd>P₁Grouping a and b into one group, and grouping c and d into one group; if I_ab>P_lAnd I_bc>P_lThen a, b, c are grouped into one group.

On the basis of the completion of the construction of the superior classifier, when constructing the second classifier, the probability P of the second classifier is set again₂If I is_a>P₂And I_bc>P₂A.b.c are also grouped together. This text when designing the first stage classifier P_lSet to 10%, then each class of classifier probability is sequentially incremented by 2% based on its upper class of classifier probability, e.g., P' of the second class classifier is set to the first class classifier probability P_lOn the basis, sequentially increasing by 2% on the basis, namely sequentially increasing by 10%, 12%, 14%, 16% and the like, and so on;

c. calculating the emotional confusion degree of the non-grouped emotional states according to the formula (1.1), turning to the step b, and classifying the non-grouped emotional states into the existing group or the independent group;

d. and all four emotions are correctly grouped and finished.

Examples

Step 1: the method for preprocessing the raw data comprises the following steps:

(1) in the present embodiment, an EMO-DB data set is used, which is a german emotion voice library recorded at the university of berlin industry, and is obtained by simulating 10 sentences (5 long and 5 short) with 7 emotions (neutral/nertral, angry/anger, fear/fear, happy/joy, sad/sadness, disgust/distust, boring/boredom) by 10 actors (5 men and 5 women), wherein the 10 emotions totally contain 800 sentence corpus, the sampling rate is 48kHz (post-compression to 16kHz), and 16bit quantization is performed. The selection of the corpus text follows the principle of neutral semantics and no emotional tendency, and is a daily spoken style without excessive written language modification. The recording of the voice is finished in a professional recording studio, and actors are required to perform emotion incubation by recalling real experience or experience of the actors before deducing a certain specific emotion, so that the reality of the emotion is enhanced. After 20 participants (10 men and 10 women) had performed the listening discrimination test, 84.3% was obtained.

After the data set is tested by listening and distinguishing, 233 sentences of male emotion sentences and 302 sentences of female emotion sentences are reserved, and 535 sentences are obtained. The sentence content comprises 5 short sentences and 5 long sentences of the daily life phrases, has higher emotional freedom degree and does not comprise a certain specific emotional tendency. And 16kHZ sampling is adopted, 16 bits are quantized, and the file is saved in a WAV format.

(2) Pretreatment: and setting a pre-emphasis coefficient, performing pre-emphasis processing on the voice signal, and setting a windowing framing frame length to perform framing processing on the voice signal.

Step 2: and selecting specific channels of the preprocessed data, and delaying and aligning the data of different microphone channels by using a microphone array solution so as to realize the positioning of a sound source and improve the audio quality.

(3) And (3) extracting features by using BN-DNN based on an SHL structure. And setting the number of layers and parameters of each layer, and constructing the BN-DNN neural network. Training a neural network, inputting the MFCC, and finally outputting a voice bottleneck characteristic.

(4) And carrying out fuzzy feature selection on the extracted multiple features. And calculating the class membership degree of each extracted feature according to a formula, and judging that the x feature belongs to the ith class until all samples are processed.

(5) And performing emotion recognition by adopting an optimized SVM-KNN method. And calculating the emotional confusion degree and classifying the emotional confusion degree into the corresponding groups.

(6) And (5) counting the accuracy to obtain a final result.

Claims

1. A speech emotion recognition method based on SVM-KNN is characterized by comprising the following steps:

(1) preprocessing original data; preprocessing original data by using a method of pre-emphasis, framing, windowing and end point detection;

1) the high-frequency part is improved by utilizing a pre-emphasis technology, so that the frequency spectrum of the signal becomes flat, and the frequency spectrum analysis or the vocal tract parameter analysis is facilitated;

2) performing framing processing on a voice signal; in order to make the transition between frames smooth and keep the continuity, an overlapped segmentation method is used, and one segment is intercepted every time the frame is moved, so that the frames as many as possible are obtained, and the short-term analysis is convenient;

3) multiplying s (n) by a window function w (n) to form a windowed speech signal sw (n) s (n) w (n);

4) accurately finding out a starting point and an ending point of a voice signal from a section of voice signal so as to separate an effective voice signal from a useless noise signal;

(2) designing a corresponding microphone array, and performing delay alignment on data of different microphone channels by using a microphone array solution to realize sound source positioning and improve audio quality;

1) estimating a noise power spectrum of an input speech signal by using a first-order recursive smoothing method;

2) calculating the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the voice signal with noise;

3) smoothing the noisy speech signal to obtain a smoothed power spectrum S (x, k) of the signal, and performing minimum value search on the smoothed output signal to obtain S_min(λ,k)；

4) Solving the probability I (x, k) of the existence of the voice signal, performing secondary smoothing and minimum value search according to the probability, and calculating to obtain the probability q (lambda, k) of the existence of the voice signal;

5) according to the formula

Calculating speech signal presence probability

6) Updating time-varying smoothing parameters and a smoothing noise power spectrum;

(3) extracting the characteristics of the processed data based on BN-DNN of an SHL structure;

1) firstly, extracting 39-dimensional MFCC characteristics (13+ delta) from 1-hour Vystdial _ cz, training a three-tone GMM model, and performing forced alignment;

2) training a three-tone GMM acoustic model (13-dimensional MFCC features are spliced for 9 frames, LDA is reduced to 40 dimensions) based on Linear Discriminant Analysis (LDA) and Maximum Likelihood Linear Transformation (MLLT), wherein the number of Gaussian mixture elements of the model is 19200;

3) then, carrying out Speaker Adaptive Training (SAT) by using a Feature-space maximum likelihood linear regression (fMLLR) technology, thereby forming a GMM acoustic model of LDA + MLLT + SAT;

4) obtaining a training target of a softmax layer in the BN-DNN in a mode of forced alignment of the model; the training feature of the DNN uses fbanks feature with good effect, firstly, 40-dimensional fbanks feature is extracted, 11 frames of splicing (5-1-5) are carried out, and the obtained super vector is used as the input feature of the DNN;

5) performing 10 rounds of RBM pre-training on each hidden layer (including a BN layer), then performing fine adjustment on global parameters by using a BP algorithm, and finally extracting three major characteristics of rhythm, tone quality and spectrum;

(4) selecting the extracted features based on a fuzzy set theory method;

1) analyzing the short-time energy, the short-time amplitude, the short-time zero-crossing rate and the pitch frequency of the extracted features by using the function TimePara (), and extracting the pitch frequency by using the function FunFre ();

2) after short-time energy, short-time amplitude, short-time zero-crossing rate and fundamental tone frequency are respectively extracted, the extracted characteristic parameters are combined into a characteristic vector to be used as the input of a fuzzy set.

3) For the identification of C-type emotion, counting the average value of the same characteristic parameter under C different emotion states by a training sample set X, and recording the average value as M; (ii) a (i;;;;;;;;;;. C, j;. 1, 2;;;;;. N, N are the number of the emotional characteristic parameters), and then normalize the normalization formula as shown in the formula, respectively, for each characteristic parameter Mjm (N is the sample in the emotional state, N;. 1 is the first sentence, and so on) of each speech sample in each emotional state

4) Then calculating the dispersion of the characteristic parameter under a certain specific emotion:

5) after the dispersion of each characteristic parameter under each emotion is solved, calculating the contribution degree of each characteristic parameter under each emotion according to the dispersion; characteristic parameter theta_iDegree of contribution u of_ij:

Weighting the contribution degree of the emotional characteristic parameters and the Euclidean distance when the sample to be classified is judged by using fuzzy K nearest neighbor;

finally, extracting features which have the largest contribution to emotion recognition;

(5) performing emotion recognition on the voice features by adopting an optimized SVM-KNN method based on the extracted features;

1) decomposing the 6 emotion classification problems, establishing a multi-stage SVM classifier based on a decision tree, identifying an emotion by the SVM of each stage for a sample set, entering the remaining sample set into the SVM of the next stage for identification, gradually decreasing step by step as shown in FIG. 1, and finally classifying the leaf nodes of the decision tree into the obtained emotion;

2) for misclassified samples generated near the hyperplane of the SVM, a KNN algorithm is utilized for combination, and an SVM-KNN combined classification model is constructed to improve the accuracy of the SVM; SVM-KNN classification step:

firstly, initially considering that samples in a training set are all marked, randomly selecting a small number of samples from the training set, and constructing a small sample training set to ensure that each emotion in the initial training sample set at least comprises one sample;

obtaining a weak SVM classifier of emotion A according to an initial training sample, then determining an optimal classification hyperplane of the weak SVM classifier, and supporting a vector set T, a coefficient W of a classification decision function and a constant b;

selecting a sample from A-class emotion, calculating the similarity of the sample to all samples of non-A-class emotion, selecting n samples most probably not to be A-class emotion, and marking the n samples as a sample set A; selecting one sample from non-A samples, calculating the similarity of the sample with all the samples of A emotion, selecting n samples most likely to be A emotion, and recording the samples as a sample set B;

the samples in A and B are points near the hyperplane, and the x samples in A, B are used as human decision functions

g(x)＝∑_iy_ia_iK(x_i，x_j)+b (6)；

If | v | > e, the classification accuracy and reliability of the sample points by the SVM are high, so that the class of the sample points can be determined by the decision function f (x) ═ sgng (x);

(vi) not air<e, the sample point is near the hyperplane, the classification reliability is low and it is easy to be classified, therefore, pass KNN 4]The method determines the category to which the sample x belongs; taking a support vector set T of A type and non-A type as a training sample, and calculating the distance d (x, x) between each vector in the sample x and T_i) The class to which the vector closest to the sample x belongs is taken as the class of the sample x,

in the formula, x; is a support vector; k () is a first order polynomial kernel; the range of the threshold e is [0,1], the specific value can be dynamically adjusted according to the experimental result, the initial value is generally set to 1, and if the initial value is adjusted to 0, the algorithm is the traditional SVM algorithm;

seventhly, putting the samples obtained by SVM classification and the samples obtained by KNN classification into an initial training set to expand the samples, and training a new SVM2 on the basis of the expanded training set;

iterating until all samples in the training set are added into the initial training set, and stopping iteration; obtaining an SVM classifier with high classification precision on the A-type emotion by using the final training set;

ninthly, training a first-level SVM classifier in the decision tree at the moment, and then using a non-A-type sample set as a trainer of a next-level SVM; and training step by step to obtain the SVM classifier corresponding to each emotion type.